All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-13  9:35 ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch series implements two optimizations:
1) transfer pages in chuncks between the guest and host;
2) transfer the guest unused pages to the host so that they
can be skipped to migrate in live migration.

Changes:
v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Liang Li (1):
  virtio-balloon: deflate via a page list

Wei Wang (4):
  virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  mm: function to offer a page block on the free list
  mm: export symbol of next_zone and first_online_pgdat
  virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ

 drivers/virtio/virtio_balloon.c     | 615 +++++++++++++++++++++++++++++++++---
 include/linux/mm.h                  |   3 +
 include/uapi/linux/virtio_balloon.h |  21 ++
 mm/mmzone.c                         |   2 +
 mm/page_alloc.c                     |  87 +++++
 5 files changed, 678 insertions(+), 50 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-13  9:35 ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch series implements two optimizations:
1) transfer pages in chuncks between the guest and host;
2) transfer the guest unused pages to the host so that they
can be skipped to migrate in live migration.

Changes:
v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Liang Li (1):
  virtio-balloon: deflate via a page list

Wei Wang (4):
  virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  mm: function to offer a page block on the free list
  mm: export symbol of next_zone and first_online_pgdat
  virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ

 drivers/virtio/virtio_balloon.c     | 615 +++++++++++++++++++++++++++++++++---
 include/linux/mm.h                  |   3 +
 include/uapi/linux/virtio_balloon.h |  21 ++
 mm/mmzone.c                         |   2 +
 mm/page_alloc.c                     |  87 +++++
 5 files changed, 678 insertions(+), 50 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [Qemu-devel] [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-13  9:35 ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch series implements two optimizations:
1) transfer pages in chuncks between the guest and host;
2) transfer the guest unused pages to the host so that they
can be skipped to migrate in live migration.

Changes:
v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Liang Li (1):
  virtio-balloon: deflate via a page list

Wei Wang (4):
  virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  mm: function to offer a page block on the free list
  mm: export symbol of next_zone and first_online_pgdat
  virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ

 drivers/virtio/virtio_balloon.c     | 615 +++++++++++++++++++++++++++++++++---
 include/linux/mm.h                  |   3 +
 include/uapi/linux/virtio_balloon.h |  21 ++
 mm/mmzone.c                         |   2 +
 mm/page_alloc.c                     |  87 +++++
 5 files changed, 678 insertions(+), 50 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v9 1/5] virtio-balloon: deflate via a page list
  2017-04-13  9:35 ` Wei Wang
  (?)
@ 2017-04-13  9:35   ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

From: Liang Li <liang.z.li@intel.com>

This patch saves the deflated pages to a list, instead of the PFN array.
Accordingly, the balloon_pfn_to_page() function is removed.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 181793f..f59cb4f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
 	return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-	BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-	return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
 	struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+				 struct list_head *pages)
 {
-	unsigned int i;
-	struct page *page;
+	struct page *page, *next;
 
-	/* Find pfns pointing at start of each page, get pages and free them. */
-	for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-							   vb->pfns[i]));
+	list_for_each_entry_safe(page, next, pages, lru) {
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 			adjust_managed_page_count(page, 1);
+		list_del(&page->lru);
 		put_page(page); /* balloon reference */
 	}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	unsigned num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	LIST_HEAD(pages);
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 */
 	if (vb->num_pfns != 0)
 		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb);
+	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH v9 1/5] virtio-balloon: deflate via a page list
@ 2017-04-13  9:35   ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

From: Liang Li <liang.z.li@intel.com>

This patch saves the deflated pages to a list, instead of the PFN array.
Accordingly, the balloon_pfn_to_page() function is removed.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 181793f..f59cb4f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
 	return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-	BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-	return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
 	struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+				 struct list_head *pages)
 {
-	unsigned int i;
-	struct page *page;
+	struct page *page, *next;
 
-	/* Find pfns pointing at start of each page, get pages and free them. */
-	for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-							   vb->pfns[i]));
+	list_for_each_entry_safe(page, next, pages, lru) {
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 			adjust_managed_page_count(page, 1);
+		list_del(&page->lru);
 		put_page(page); /* balloon reference */
 	}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	unsigned num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	LIST_HEAD(pages);
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 */
 	if (vb->num_pfns != 0)
 		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb);
+	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [Qemu-devel] [PATCH v9 1/5] virtio-balloon: deflate via a page list
@ 2017-04-13  9:35   ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

From: Liang Li <liang.z.li@intel.com>

This patch saves the deflated pages to a list, instead of the PFN array.
Accordingly, the balloon_pfn_to_page() function is removed.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 181793f..f59cb4f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
 	return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-	BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-	return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
 	struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+				 struct list_head *pages)
 {
-	unsigned int i;
-	struct page *page;
+	struct page *page, *next;
 
-	/* Find pfns pointing at start of each page, get pages and free them. */
-	for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-							   vb->pfns[i]));
+	list_for_each_entry_safe(page, next, pages, lru) {
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 			adjust_managed_page_count(page, 1);
+		list_del(&page->lru);
 		put_page(page); /* balloon reference */
 	}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	unsigned num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	LIST_HEAD(pages);
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 */
 	if (vb->num_pfns != 0)
 		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb);
+	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH v9 1/5] virtio-balloon: deflate via a page list
  2017-04-13  9:35 ` Wei Wang
  (?)
  (?)
@ 2017-04-13  9:35 ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

From: Liang Li <liang.z.li@intel.com>

This patch saves the deflated pages to a list, instead of the PFN array.
Accordingly, the balloon_pfn_to_page() function is removed.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 181793f..f59cb4f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
 	return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-	BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-	return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
 	struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+				 struct list_head *pages)
 {
-	unsigned int i;
-	struct page *page;
+	struct page *page, *next;
 
-	/* Find pfns pointing at start of each page, get pages and free them. */
-	for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-							   vb->pfns[i]));
+	list_for_each_entry_safe(page, next, pages, lru) {
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 			adjust_managed_page_count(page, 1);
+		list_del(&page->lru);
 		put_page(page); /* balloon reference */
 	}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	unsigned num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	LIST_HEAD(pages);
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 */
 	if (vb->num_pfns != 0)
 		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb);
+	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-13  9:35 ` Wei Wang
  (?)
@ 2017-04-13  9:35   ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new feature, VIRTIO_BALLOON_F_BALLOON_CHUNKS, which enables
the transfer of the ballooned (i.e. inflated/deflated) pages in
chunks to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the ballooned pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
chunks. A chunk consists of guest physically continuous pages, and
it is offered to the host via a base PFN (i.e. the start PFN of
those physically continuous pages) and the size (i.e. the total
number of the pages). A chunk is formated as below:
--------------------------------------------------------
|                 Base (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------
--------------------------------------------------------
|                 Size (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------

By doing so, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~590ms
resulting in an improvement of ~85%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 384 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |  13 ++
 2 files changed, 374 insertions(+), 23 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f59cb4f..5e2e7cc 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define PAGE_BMAP_SIZE		(8 * PAGE_SIZE)
+#define PFNS_PER_PAGE_BMAP	(PAGE_BMAP_SIZE * BITS_PER_BYTE)
+#define PAGE_BMAP_COUNT_MAX	32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+/* Types of pages to chunk */
+#define PAGE_CHUNK_TYPE_BALLOON 0
+
+#define MAX_PAGE_CHUNKS 4096
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -78,6 +86,32 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/*
+	 * Buffer for PAGE_CHUNK_TYPE_BALLOON:
+	 * virtio_balloon_page_chunk_hdr +
+	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
+	 */
+	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
+	struct virtio_balloon_page_chunk *balloon_page_chunk;
+
+	/* Bitmap used to record pages */
+	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
+	/* Number of the allocated page_bmap */
+	unsigned int page_bmaps;
+
+	/*
+	 * The allocated page_bmap size may be smaller than the pfn range of
+	 * the ballooned pages. In this case, we need to use the page_bmap
+	 * multiple times to cover the entire pfn range. It's like using a
+	 * short ruler several times to finish measuring a long object.
+	 * The start location of the ruler in the next measurement is the end
+	 * location of the ruler in the previous measurement.
+	 *
+	 * pfn_max & pfn_min: forms the pfn range of the ballooned pages
+	 * pfn_start & pfn_stop: records the start and stop pfn in each cover
+	 */
+	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -110,20 +144,201 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_page_bmap_range(struct virtio_balloon *vb)
+{
+	vb->pfn_min = ULONG_MAX;
+	vb->pfn_max = 0;
+}
+
+static inline void update_page_bmap_range(struct virtio_balloon *vb,
+					  struct page *page)
+{
+	unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+	vb->pfn_min = min(balloon_pfn, vb->pfn_min);
+	vb->pfn_max = max(balloon_pfn, vb->pfn_max);
+}
+
+/* The page_bmap size is extended by adding more number of page_bmap */
+static void extend_page_bmap_size(struct virtio_balloon *vb,
+				  unsigned long pfns)
+{
+	int i, bmaps;
+	unsigned long bmap_len;
+
+	bmap_len = ALIGN(pfns, BITS_PER_LONG) / BITS_PER_BYTE;
+	bmap_len = ALIGN(bmap_len, PAGE_BMAP_SIZE);
+	bmaps = min((int)(bmap_len / PAGE_BMAP_SIZE),
+		    PAGE_BMAP_COUNT_MAX);
+
+	for (i = 1; i < bmaps; i++) {
+		vb->page_bmap[i] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+		if (vb->page_bmap[i])
+			vb->page_bmaps++;
+		else
+			break;
+	}
+}
+
+static void free_extended_page_bmap(struct virtio_balloon *vb)
+{
+	int i, bmaps = vb->page_bmaps;
+
+	for (i = 1; i < bmaps; i++) {
+		kfree(vb->page_bmap[i]);
+		vb->page_bmap[i] = NULL;
+		vb->page_bmaps--;
+	}
+}
+
+static void free_page_bmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->page_bmaps; i++)
+		kfree(vb->page_bmap[i]);
+}
+
+static void clear_page_bmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->page_bmaps; i++)
+		memset(vb->page_bmap[i], 0, PAGE_BMAP_SIZE);
+}
+
+static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
+			     int type, bool busy_wait)
 {
 	struct scatterlist sg;
+	struct virtio_balloon_page_chunk_hdr *hdr;
+	void *buf;
 	unsigned int len;
 
-	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		hdr = vb->balloon_page_chunk_hdr;
+		len = 0;
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
+			 __func__, type);
+		return;
+	}
 
-	/* We should always be able to add one buffer to an empty queue. */
-	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
-	virtqueue_kick(vq);
+	buf = (void *)hdr - len;
+	len += sizeof(struct virtio_balloon_page_chunk_hdr);
+	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
+	sg_init_table(&sg, 1);
+	sg_set_buf(&sg, buf, len);
+	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
+		virtqueue_kick(vq);
+		if (busy_wait)
+			while (!virtqueue_get_buf(vq, &len) &&
+			       !virtqueue_is_broken(vq))
+				cpu_relax();
+		else
+			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		hdr->chunks = 0;
+	}
+}
+
+static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
+			  int type, u64 base, u64 size)
+{
+	struct virtio_balloon_page_chunk_hdr *hdr;
+	struct virtio_balloon_page_chunk *chunk;
+
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		hdr = vb->balloon_page_chunk_hdr;
+		chunk = vb->balloon_page_chunk;
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
+			 __func__, type);
+		return;
+	}
+	chunk = chunk + hdr->chunks;
+	chunk->base = cpu_to_le64(base << VIRTIO_BALLOON_CHUNK_BASE_SHIFT);
+	chunk->size = cpu_to_le64(size << VIRTIO_BALLOON_CHUNK_SIZE_SHIFT);
+	hdr->chunks++;
+	if (hdr->chunks == MAX_PAGE_CHUNKS)
+		send_page_chunks(vb, vq, type, false);
+}
+
+static void chunking_pages_from_bmap(struct virtio_balloon *vb,
+				     struct virtqueue *vq,
+				     unsigned long pfn_start,
+				     unsigned long *bmap,
+				     unsigned long len)
+{
+	unsigned long pos = 0, end = len * BITS_PER_BYTE;
+
+	while (pos < end) {
+		unsigned long one = find_next_bit(bmap, end, pos);
+
+		if (one < end) {
+			unsigned long chunk_size, zero;
+
+			zero = find_next_zero_bit(bmap, end, one + 1);
+			if (zero >= end)
+				chunk_size = end - one;
+			else
+				chunk_size = zero - one;
+
+			if (chunk_size)
+				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+					      pfn_start + one, chunk_size);
+			pos = one + chunk_size;
+		} else
+			break;
+	}
+}
+
+static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+{
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
+		int pfns, page_bmaps, i;
+		unsigned long pfn_start, pfns_len;
+
+		pfn_start = vb->pfn_start;
+		pfns = vb->pfn_stop - pfn_start + 1;
+		pfns = roundup(roundup(pfns, BITS_PER_LONG),
+			       PFNS_PER_PAGE_BMAP);
+		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
+		pfns_len = pfns / BITS_PER_BYTE;
+
+		for (i = 0; i < page_bmaps; i++) {
+			unsigned int bmap_len = PAGE_BMAP_SIZE;
+
+			/* The last one takes the leftover only */
+			if (i + 1 == page_bmaps)
+				bmap_len = pfns_len - PAGE_BMAP_SIZE * i;
+
+			chunking_pages_from_bmap(vb, vq, pfn_start +
+						 i * PFNS_PER_PAGE_BMAP,
+						 vb->page_bmap[i], bmap_len);
+		}
+		if (vb->balloon_page_chunk_hdr->chunks > 0)
+			send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+					 false);
+	} else {
+		struct scatterlist sg;
+		unsigned int len;
 
-	/* When host has read buffer, this completes via balloon_ack */
-	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
 
+		/*
+		 * We should always be able to add one buffer to an empty
+		 * queue.
+		 */
+		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+		virtqueue_kick(vq);
+
+		/* When host has read buffer, this completes via balloon_ack */
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+	}
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -131,20 +346,73 @@ static void set_page_pfns(struct virtio_balloon *vb,
 {
 	unsigned int i;
 
-	/* Set balloon pfns pointing at this page.
-	 * Note that the first pfn points at start of the page. */
+	/*
+	 * Set balloon pfns pointing at this page.
+	 * Note that the first pfn points at start of the page.
+	 */
 	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
 		pfns[i] = cpu_to_virtio32(vb->vdev,
 					  page_to_balloon_pfn(page) + i);
 }
 
+static void set_page_bmap(struct virtio_balloon *vb,
+			  struct list_head *pages, struct virtqueue *vq)
+{
+	unsigned long pfn_start, pfn_stop;
+	struct page *page;
+	bool found;
+
+	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
+	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
+
+	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
+	pfn_start = vb->pfn_min;
+
+	while (pfn_start < vb->pfn_max) {
+		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
+		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
+
+		vb->pfn_start = pfn_start;
+		clear_page_bmap(vb);
+		found = false;
+
+		list_for_each_entry(page, pages, lru) {
+			unsigned long bmap_idx, bmap_pos, balloon_pfn;
+
+			balloon_pfn = page_to_balloon_pfn(page);
+			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
+				continue;
+			bmap_idx = (balloon_pfn - pfn_start) /
+				   PFNS_PER_PAGE_BMAP;
+			bmap_pos = (balloon_pfn - pfn_start) %
+				   PFNS_PER_PAGE_BMAP;
+			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
+
+			found = true;
+		}
+		if (found) {
+			vb->pfn_stop = pfn_stop;
+			tell_host(vb, vq);
+		}
+		pfn_start = pfn_stop;
+	}
+	free_extended_page_bmap(vb);
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (chunking) {
+		init_page_bmap_range(vb);
+	} else {
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
+	}
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -159,7 +427,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_page_bmap_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -168,8 +439,13 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			set_page_bmap(vb, &vb_dev_info->pages,
+					vb->inflate_vq);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -195,6 +471,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
+	if (chunking)
+		init_page_bmap_range(vb);
+	else
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -208,6 +491,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_page_bmap_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -218,8 +505,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			set_page_bmap(vb, &pages, vb->deflate_vq);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -431,6 +722,13 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
+static void tell_host_one_page(struct virtio_balloon *vb,
+			       struct virtqueue *vq, struct page *page)
+{
+	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);
+}
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -454,6 +752,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
 	unsigned long flags;
 
 	/*
@@ -475,16 +775,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->inflate_vq, newpage);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->deflate_vq, page);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -511,6 +817,32 @@ static struct file_system_type balloon_fs = {
 
 #endif /* CONFIG_BALLOON_COMPACTION */
 
+static void balloon_page_chunk_init(struct virtio_balloon *vb)
+{
+	void *buf;
+
+	/*
+	 * By default, we allocate page_bmap[0] only. More page_bmap will be
+	 * allocated on demand.
+	 */
+	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk) *
+		      MAX_PAGE_CHUNKS, GFP_KERNEL);
+	if (!vb->page_bmap[0] || !buf) {
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);
+		kfree(vb->page_bmap[0]);
+		kfree(vb->balloon_page_chunk_hdr);
+		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+	} else {
+		vb->page_bmaps = 1;
+		vb->balloon_page_chunk_hdr = buf;
+		vb->balloon_page_chunk_hdr->chunks = 0;
+		vb->balloon_page_chunk = buf +
+				sizeof(struct virtio_balloon_page_chunk_hdr);
+	}
+}
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
@@ -533,6 +865,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
 	vb->num_pages = 0;
+
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
+		balloon_page_chunk_init(vb);
+
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
@@ -609,6 +945,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	free_page_bmap(vb);
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
 	kfree(vb);
@@ -649,6 +986,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..be317b7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,16 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+struct virtio_balloon_page_chunk_hdr {
+	/* Number of chunks in the payload */
+	__le32 chunks;
+};
+
+#define VIRTIO_BALLOON_CHUNK_BASE_SHIFT 12
+#define VIRTIO_BALLOON_CHUNK_SIZE_SHIFT 12
+struct virtio_balloon_page_chunk {
+	__le64 base;
+	__le64 size;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-13  9:35   ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new feature, VIRTIO_BALLOON_F_BALLOON_CHUNKS, which enables
the transfer of the ballooned (i.e. inflated/deflated) pages in
chunks to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the ballooned pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
chunks. A chunk consists of guest physically continuous pages, and
it is offered to the host via a base PFN (i.e. the start PFN of
those physically continuous pages) and the size (i.e. the total
number of the pages). A chunk is formated as below:
--------------------------------------------------------
|                 Base (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------
--------------------------------------------------------
|                 Size (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------

By doing so, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~590ms
resulting in an improvement of ~85%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 384 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |  13 ++
 2 files changed, 374 insertions(+), 23 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f59cb4f..5e2e7cc 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define PAGE_BMAP_SIZE		(8 * PAGE_SIZE)
+#define PFNS_PER_PAGE_BMAP	(PAGE_BMAP_SIZE * BITS_PER_BYTE)
+#define PAGE_BMAP_COUNT_MAX	32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+/* Types of pages to chunk */
+#define PAGE_CHUNK_TYPE_BALLOON 0
+
+#define MAX_PAGE_CHUNKS 4096
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -78,6 +86,32 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/*
+	 * Buffer for PAGE_CHUNK_TYPE_BALLOON:
+	 * virtio_balloon_page_chunk_hdr +
+	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
+	 */
+	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
+	struct virtio_balloon_page_chunk *balloon_page_chunk;
+
+	/* Bitmap used to record pages */
+	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
+	/* Number of the allocated page_bmap */
+	unsigned int page_bmaps;
+
+	/*
+	 * The allocated page_bmap size may be smaller than the pfn range of
+	 * the ballooned pages. In this case, we need to use the page_bmap
+	 * multiple times to cover the entire pfn range. It's like using a
+	 * short ruler several times to finish measuring a long object.
+	 * The start location of the ruler in the next measurement is the end
+	 * location of the ruler in the previous measurement.
+	 *
+	 * pfn_max & pfn_min: forms the pfn range of the ballooned pages
+	 * pfn_start & pfn_stop: records the start and stop pfn in each cover
+	 */
+	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -110,20 +144,201 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_page_bmap_range(struct virtio_balloon *vb)
+{
+	vb->pfn_min = ULONG_MAX;
+	vb->pfn_max = 0;
+}
+
+static inline void update_page_bmap_range(struct virtio_balloon *vb,
+					  struct page *page)
+{
+	unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+	vb->pfn_min = min(balloon_pfn, vb->pfn_min);
+	vb->pfn_max = max(balloon_pfn, vb->pfn_max);
+}
+
+/* The page_bmap size is extended by adding more number of page_bmap */
+static void extend_page_bmap_size(struct virtio_balloon *vb,
+				  unsigned long pfns)
+{
+	int i, bmaps;
+	unsigned long bmap_len;
+
+	bmap_len = ALIGN(pfns, BITS_PER_LONG) / BITS_PER_BYTE;
+	bmap_len = ALIGN(bmap_len, PAGE_BMAP_SIZE);
+	bmaps = min((int)(bmap_len / PAGE_BMAP_SIZE),
+		    PAGE_BMAP_COUNT_MAX);
+
+	for (i = 1; i < bmaps; i++) {
+		vb->page_bmap[i] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+		if (vb->page_bmap[i])
+			vb->page_bmaps++;
+		else
+			break;
+	}
+}
+
+static void free_extended_page_bmap(struct virtio_balloon *vb)
+{
+	int i, bmaps = vb->page_bmaps;
+
+	for (i = 1; i < bmaps; i++) {
+		kfree(vb->page_bmap[i]);
+		vb->page_bmap[i] = NULL;
+		vb->page_bmaps--;
+	}
+}
+
+static void free_page_bmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->page_bmaps; i++)
+		kfree(vb->page_bmap[i]);
+}
+
+static void clear_page_bmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->page_bmaps; i++)
+		memset(vb->page_bmap[i], 0, PAGE_BMAP_SIZE);
+}
+
+static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
+			     int type, bool busy_wait)
 {
 	struct scatterlist sg;
+	struct virtio_balloon_page_chunk_hdr *hdr;
+	void *buf;
 	unsigned int len;
 
-	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		hdr = vb->balloon_page_chunk_hdr;
+		len = 0;
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
+			 __func__, type);
+		return;
+	}
 
-	/* We should always be able to add one buffer to an empty queue. */
-	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
-	virtqueue_kick(vq);
+	buf = (void *)hdr - len;
+	len += sizeof(struct virtio_balloon_page_chunk_hdr);
+	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
+	sg_init_table(&sg, 1);
+	sg_set_buf(&sg, buf, len);
+	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
+		virtqueue_kick(vq);
+		if (busy_wait)
+			while (!virtqueue_get_buf(vq, &len) &&
+			       !virtqueue_is_broken(vq))
+				cpu_relax();
+		else
+			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		hdr->chunks = 0;
+	}
+}
+
+static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
+			  int type, u64 base, u64 size)
+{
+	struct virtio_balloon_page_chunk_hdr *hdr;
+	struct virtio_balloon_page_chunk *chunk;
+
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		hdr = vb->balloon_page_chunk_hdr;
+		chunk = vb->balloon_page_chunk;
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
+			 __func__, type);
+		return;
+	}
+	chunk = chunk + hdr->chunks;
+	chunk->base = cpu_to_le64(base << VIRTIO_BALLOON_CHUNK_BASE_SHIFT);
+	chunk->size = cpu_to_le64(size << VIRTIO_BALLOON_CHUNK_SIZE_SHIFT);
+	hdr->chunks++;
+	if (hdr->chunks == MAX_PAGE_CHUNKS)
+		send_page_chunks(vb, vq, type, false);
+}
+
+static void chunking_pages_from_bmap(struct virtio_balloon *vb,
+				     struct virtqueue *vq,
+				     unsigned long pfn_start,
+				     unsigned long *bmap,
+				     unsigned long len)
+{
+	unsigned long pos = 0, end = len * BITS_PER_BYTE;
+
+	while (pos < end) {
+		unsigned long one = find_next_bit(bmap, end, pos);
+
+		if (one < end) {
+			unsigned long chunk_size, zero;
+
+			zero = find_next_zero_bit(bmap, end, one + 1);
+			if (zero >= end)
+				chunk_size = end - one;
+			else
+				chunk_size = zero - one;
+
+			if (chunk_size)
+				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+					      pfn_start + one, chunk_size);
+			pos = one + chunk_size;
+		} else
+			break;
+	}
+}
+
+static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+{
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
+		int pfns, page_bmaps, i;
+		unsigned long pfn_start, pfns_len;
+
+		pfn_start = vb->pfn_start;
+		pfns = vb->pfn_stop - pfn_start + 1;
+		pfns = roundup(roundup(pfns, BITS_PER_LONG),
+			       PFNS_PER_PAGE_BMAP);
+		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
+		pfns_len = pfns / BITS_PER_BYTE;
+
+		for (i = 0; i < page_bmaps; i++) {
+			unsigned int bmap_len = PAGE_BMAP_SIZE;
+
+			/* The last one takes the leftover only */
+			if (i + 1 == page_bmaps)
+				bmap_len = pfns_len - PAGE_BMAP_SIZE * i;
+
+			chunking_pages_from_bmap(vb, vq, pfn_start +
+						 i * PFNS_PER_PAGE_BMAP,
+						 vb->page_bmap[i], bmap_len);
+		}
+		if (vb->balloon_page_chunk_hdr->chunks > 0)
+			send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+					 false);
+	} else {
+		struct scatterlist sg;
+		unsigned int len;
 
-	/* When host has read buffer, this completes via balloon_ack */
-	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
 
+		/*
+		 * We should always be able to add one buffer to an empty
+		 * queue.
+		 */
+		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+		virtqueue_kick(vq);
+
+		/* When host has read buffer, this completes via balloon_ack */
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+	}
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -131,20 +346,73 @@ static void set_page_pfns(struct virtio_balloon *vb,
 {
 	unsigned int i;
 
-	/* Set balloon pfns pointing at this page.
-	 * Note that the first pfn points at start of the page. */
+	/*
+	 * Set balloon pfns pointing at this page.
+	 * Note that the first pfn points at start of the page.
+	 */
 	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
 		pfns[i] = cpu_to_virtio32(vb->vdev,
 					  page_to_balloon_pfn(page) + i);
 }
 
+static void set_page_bmap(struct virtio_balloon *vb,
+			  struct list_head *pages, struct virtqueue *vq)
+{
+	unsigned long pfn_start, pfn_stop;
+	struct page *page;
+	bool found;
+
+	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
+	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
+
+	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
+	pfn_start = vb->pfn_min;
+
+	while (pfn_start < vb->pfn_max) {
+		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
+		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
+
+		vb->pfn_start = pfn_start;
+		clear_page_bmap(vb);
+		found = false;
+
+		list_for_each_entry(page, pages, lru) {
+			unsigned long bmap_idx, bmap_pos, balloon_pfn;
+
+			balloon_pfn = page_to_balloon_pfn(page);
+			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
+				continue;
+			bmap_idx = (balloon_pfn - pfn_start) /
+				   PFNS_PER_PAGE_BMAP;
+			bmap_pos = (balloon_pfn - pfn_start) %
+				   PFNS_PER_PAGE_BMAP;
+			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
+
+			found = true;
+		}
+		if (found) {
+			vb->pfn_stop = pfn_stop;
+			tell_host(vb, vq);
+		}
+		pfn_start = pfn_stop;
+	}
+	free_extended_page_bmap(vb);
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (chunking) {
+		init_page_bmap_range(vb);
+	} else {
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
+	}
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -159,7 +427,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_page_bmap_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -168,8 +439,13 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			set_page_bmap(vb, &vb_dev_info->pages,
+					vb->inflate_vq);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -195,6 +471,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
+	if (chunking)
+		init_page_bmap_range(vb);
+	else
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -208,6 +491,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_page_bmap_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -218,8 +505,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			set_page_bmap(vb, &pages, vb->deflate_vq);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -431,6 +722,13 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
+static void tell_host_one_page(struct virtio_balloon *vb,
+			       struct virtqueue *vq, struct page *page)
+{
+	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);
+}
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -454,6 +752,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
 	unsigned long flags;
 
 	/*
@@ -475,16 +775,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->inflate_vq, newpage);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->deflate_vq, page);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -511,6 +817,32 @@ static struct file_system_type balloon_fs = {
 
 #endif /* CONFIG_BALLOON_COMPACTION */
 
+static void balloon_page_chunk_init(struct virtio_balloon *vb)
+{
+	void *buf;
+
+	/*
+	 * By default, we allocate page_bmap[0] only. More page_bmap will be
+	 * allocated on demand.
+	 */
+	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk) *
+		      MAX_PAGE_CHUNKS, GFP_KERNEL);
+	if (!vb->page_bmap[0] || !buf) {
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);
+		kfree(vb->page_bmap[0]);
+		kfree(vb->balloon_page_chunk_hdr);
+		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+	} else {
+		vb->page_bmaps = 1;
+		vb->balloon_page_chunk_hdr = buf;
+		vb->balloon_page_chunk_hdr->chunks = 0;
+		vb->balloon_page_chunk = buf +
+				sizeof(struct virtio_balloon_page_chunk_hdr);
+	}
+}
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
@@ -533,6 +865,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
 	vb->num_pages = 0;
+
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
+		balloon_page_chunk_init(vb);
+
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
@@ -609,6 +945,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	free_page_bmap(vb);
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
 	kfree(vb);
@@ -649,6 +986,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..be317b7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,16 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+struct virtio_balloon_page_chunk_hdr {
+	/* Number of chunks in the payload */
+	__le32 chunks;
+};
+
+#define VIRTIO_BALLOON_CHUNK_BASE_SHIFT 12
+#define VIRTIO_BALLOON_CHUNK_SIZE_SHIFT 12
+struct virtio_balloon_page_chunk {
+	__le64 base;
+	__le64 size;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [Qemu-devel] [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-13  9:35   ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new feature, VIRTIO_BALLOON_F_BALLOON_CHUNKS, which enables
the transfer of the ballooned (i.e. inflated/deflated) pages in
chunks to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the ballooned pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
chunks. A chunk consists of guest physically continuous pages, and
it is offered to the host via a base PFN (i.e. the start PFN of
those physically continuous pages) and the size (i.e. the total
number of the pages). A chunk is formated as below:
--------------------------------------------------------
|                 Base (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------
--------------------------------------------------------
|                 Size (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------

By doing so, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~590ms
resulting in an improvement of ~85%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 384 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |  13 ++
 2 files changed, 374 insertions(+), 23 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f59cb4f..5e2e7cc 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define PAGE_BMAP_SIZE		(8 * PAGE_SIZE)
+#define PFNS_PER_PAGE_BMAP	(PAGE_BMAP_SIZE * BITS_PER_BYTE)
+#define PAGE_BMAP_COUNT_MAX	32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+/* Types of pages to chunk */
+#define PAGE_CHUNK_TYPE_BALLOON 0
+
+#define MAX_PAGE_CHUNKS 4096
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -78,6 +86,32 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/*
+	 * Buffer for PAGE_CHUNK_TYPE_BALLOON:
+	 * virtio_balloon_page_chunk_hdr +
+	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
+	 */
+	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
+	struct virtio_balloon_page_chunk *balloon_page_chunk;
+
+	/* Bitmap used to record pages */
+	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
+	/* Number of the allocated page_bmap */
+	unsigned int page_bmaps;
+
+	/*
+	 * The allocated page_bmap size may be smaller than the pfn range of
+	 * the ballooned pages. In this case, we need to use the page_bmap
+	 * multiple times to cover the entire pfn range. It's like using a
+	 * short ruler several times to finish measuring a long object.
+	 * The start location of the ruler in the next measurement is the end
+	 * location of the ruler in the previous measurement.
+	 *
+	 * pfn_max & pfn_min: forms the pfn range of the ballooned pages
+	 * pfn_start & pfn_stop: records the start and stop pfn in each cover
+	 */
+	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -110,20 +144,201 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_page_bmap_range(struct virtio_balloon *vb)
+{
+	vb->pfn_min = ULONG_MAX;
+	vb->pfn_max = 0;
+}
+
+static inline void update_page_bmap_range(struct virtio_balloon *vb,
+					  struct page *page)
+{
+	unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+	vb->pfn_min = min(balloon_pfn, vb->pfn_min);
+	vb->pfn_max = max(balloon_pfn, vb->pfn_max);
+}
+
+/* The page_bmap size is extended by adding more number of page_bmap */
+static void extend_page_bmap_size(struct virtio_balloon *vb,
+				  unsigned long pfns)
+{
+	int i, bmaps;
+	unsigned long bmap_len;
+
+	bmap_len = ALIGN(pfns, BITS_PER_LONG) / BITS_PER_BYTE;
+	bmap_len = ALIGN(bmap_len, PAGE_BMAP_SIZE);
+	bmaps = min((int)(bmap_len / PAGE_BMAP_SIZE),
+		    PAGE_BMAP_COUNT_MAX);
+
+	for (i = 1; i < bmaps; i++) {
+		vb->page_bmap[i] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+		if (vb->page_bmap[i])
+			vb->page_bmaps++;
+		else
+			break;
+	}
+}
+
+static void free_extended_page_bmap(struct virtio_balloon *vb)
+{
+	int i, bmaps = vb->page_bmaps;
+
+	for (i = 1; i < bmaps; i++) {
+		kfree(vb->page_bmap[i]);
+		vb->page_bmap[i] = NULL;
+		vb->page_bmaps--;
+	}
+}
+
+static void free_page_bmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->page_bmaps; i++)
+		kfree(vb->page_bmap[i]);
+}
+
+static void clear_page_bmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->page_bmaps; i++)
+		memset(vb->page_bmap[i], 0, PAGE_BMAP_SIZE);
+}
+
+static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
+			     int type, bool busy_wait)
 {
 	struct scatterlist sg;
+	struct virtio_balloon_page_chunk_hdr *hdr;
+	void *buf;
 	unsigned int len;
 
-	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		hdr = vb->balloon_page_chunk_hdr;
+		len = 0;
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
+			 __func__, type);
+		return;
+	}
 
-	/* We should always be able to add one buffer to an empty queue. */
-	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
-	virtqueue_kick(vq);
+	buf = (void *)hdr - len;
+	len += sizeof(struct virtio_balloon_page_chunk_hdr);
+	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
+	sg_init_table(&sg, 1);
+	sg_set_buf(&sg, buf, len);
+	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
+		virtqueue_kick(vq);
+		if (busy_wait)
+			while (!virtqueue_get_buf(vq, &len) &&
+			       !virtqueue_is_broken(vq))
+				cpu_relax();
+		else
+			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		hdr->chunks = 0;
+	}
+}
+
+static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
+			  int type, u64 base, u64 size)
+{
+	struct virtio_balloon_page_chunk_hdr *hdr;
+	struct virtio_balloon_page_chunk *chunk;
+
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		hdr = vb->balloon_page_chunk_hdr;
+		chunk = vb->balloon_page_chunk;
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
+			 __func__, type);
+		return;
+	}
+	chunk = chunk + hdr->chunks;
+	chunk->base = cpu_to_le64(base << VIRTIO_BALLOON_CHUNK_BASE_SHIFT);
+	chunk->size = cpu_to_le64(size << VIRTIO_BALLOON_CHUNK_SIZE_SHIFT);
+	hdr->chunks++;
+	if (hdr->chunks == MAX_PAGE_CHUNKS)
+		send_page_chunks(vb, vq, type, false);
+}
+
+static void chunking_pages_from_bmap(struct virtio_balloon *vb,
+				     struct virtqueue *vq,
+				     unsigned long pfn_start,
+				     unsigned long *bmap,
+				     unsigned long len)
+{
+	unsigned long pos = 0, end = len * BITS_PER_BYTE;
+
+	while (pos < end) {
+		unsigned long one = find_next_bit(bmap, end, pos);
+
+		if (one < end) {
+			unsigned long chunk_size, zero;
+
+			zero = find_next_zero_bit(bmap, end, one + 1);
+			if (zero >= end)
+				chunk_size = end - one;
+			else
+				chunk_size = zero - one;
+
+			if (chunk_size)
+				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+					      pfn_start + one, chunk_size);
+			pos = one + chunk_size;
+		} else
+			break;
+	}
+}
+
+static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+{
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
+		int pfns, page_bmaps, i;
+		unsigned long pfn_start, pfns_len;
+
+		pfn_start = vb->pfn_start;
+		pfns = vb->pfn_stop - pfn_start + 1;
+		pfns = roundup(roundup(pfns, BITS_PER_LONG),
+			       PFNS_PER_PAGE_BMAP);
+		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
+		pfns_len = pfns / BITS_PER_BYTE;
+
+		for (i = 0; i < page_bmaps; i++) {
+			unsigned int bmap_len = PAGE_BMAP_SIZE;
+
+			/* The last one takes the leftover only */
+			if (i + 1 == page_bmaps)
+				bmap_len = pfns_len - PAGE_BMAP_SIZE * i;
+
+			chunking_pages_from_bmap(vb, vq, pfn_start +
+						 i * PFNS_PER_PAGE_BMAP,
+						 vb->page_bmap[i], bmap_len);
+		}
+		if (vb->balloon_page_chunk_hdr->chunks > 0)
+			send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+					 false);
+	} else {
+		struct scatterlist sg;
+		unsigned int len;
 
-	/* When host has read buffer, this completes via balloon_ack */
-	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
 
+		/*
+		 * We should always be able to add one buffer to an empty
+		 * queue.
+		 */
+		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+		virtqueue_kick(vq);
+
+		/* When host has read buffer, this completes via balloon_ack */
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+	}
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -131,20 +346,73 @@ static void set_page_pfns(struct virtio_balloon *vb,
 {
 	unsigned int i;
 
-	/* Set balloon pfns pointing at this page.
-	 * Note that the first pfn points at start of the page. */
+	/*
+	 * Set balloon pfns pointing at this page.
+	 * Note that the first pfn points at start of the page.
+	 */
 	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
 		pfns[i] = cpu_to_virtio32(vb->vdev,
 					  page_to_balloon_pfn(page) + i);
 }
 
+static void set_page_bmap(struct virtio_balloon *vb,
+			  struct list_head *pages, struct virtqueue *vq)
+{
+	unsigned long pfn_start, pfn_stop;
+	struct page *page;
+	bool found;
+
+	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
+	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
+
+	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
+	pfn_start = vb->pfn_min;
+
+	while (pfn_start < vb->pfn_max) {
+		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
+		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
+
+		vb->pfn_start = pfn_start;
+		clear_page_bmap(vb);
+		found = false;
+
+		list_for_each_entry(page, pages, lru) {
+			unsigned long bmap_idx, bmap_pos, balloon_pfn;
+
+			balloon_pfn = page_to_balloon_pfn(page);
+			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
+				continue;
+			bmap_idx = (balloon_pfn - pfn_start) /
+				   PFNS_PER_PAGE_BMAP;
+			bmap_pos = (balloon_pfn - pfn_start) %
+				   PFNS_PER_PAGE_BMAP;
+			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
+
+			found = true;
+		}
+		if (found) {
+			vb->pfn_stop = pfn_stop;
+			tell_host(vb, vq);
+		}
+		pfn_start = pfn_stop;
+	}
+	free_extended_page_bmap(vb);
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (chunking) {
+		init_page_bmap_range(vb);
+	} else {
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
+	}
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -159,7 +427,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_page_bmap_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -168,8 +439,13 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			set_page_bmap(vb, &vb_dev_info->pages,
+					vb->inflate_vq);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -195,6 +471,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
+	if (chunking)
+		init_page_bmap_range(vb);
+	else
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -208,6 +491,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_page_bmap_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -218,8 +505,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			set_page_bmap(vb, &pages, vb->deflate_vq);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -431,6 +722,13 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
+static void tell_host_one_page(struct virtio_balloon *vb,
+			       struct virtqueue *vq, struct page *page)
+{
+	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);
+}
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -454,6 +752,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
 	unsigned long flags;
 
 	/*
@@ -475,16 +775,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->inflate_vq, newpage);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->deflate_vq, page);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -511,6 +817,32 @@ static struct file_system_type balloon_fs = {
 
 #endif /* CONFIG_BALLOON_COMPACTION */
 
+static void balloon_page_chunk_init(struct virtio_balloon *vb)
+{
+	void *buf;
+
+	/*
+	 * By default, we allocate page_bmap[0] only. More page_bmap will be
+	 * allocated on demand.
+	 */
+	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk) *
+		      MAX_PAGE_CHUNKS, GFP_KERNEL);
+	if (!vb->page_bmap[0] || !buf) {
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);
+		kfree(vb->page_bmap[0]);
+		kfree(vb->balloon_page_chunk_hdr);
+		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+	} else {
+		vb->page_bmaps = 1;
+		vb->balloon_page_chunk_hdr = buf;
+		vb->balloon_page_chunk_hdr->chunks = 0;
+		vb->balloon_page_chunk = buf +
+				sizeof(struct virtio_balloon_page_chunk_hdr);
+	}
+}
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
@@ -533,6 +865,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
 	vb->num_pages = 0;
+
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
+		balloon_page_chunk_init(vb);
+
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
@@ -609,6 +945,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	free_page_bmap(vb);
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
 	kfree(vb);
@@ -649,6 +986,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..be317b7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,16 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+struct virtio_balloon_page_chunk_hdr {
+	/* Number of chunks in the payload */
+	__le32 chunks;
+};
+
+#define VIRTIO_BALLOON_CHUNK_BASE_SHIFT 12
+#define VIRTIO_BALLOON_CHUNK_SIZE_SHIFT 12
+struct virtio_balloon_page_chunk {
+	__le64 base;
+	__le64 size;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-13  9:35 ` Wei Wang
                   ` (3 preceding siblings ...)
  (?)
@ 2017-04-13  9:35 ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new feature, VIRTIO_BALLOON_F_BALLOON_CHUNKS, which enables
the transfer of the ballooned (i.e. inflated/deflated) pages in
chunks to the host.

The implementation of the previous virtio-balloon is not very
efficient, because the ballooned pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
chunks. A chunk consists of guest physically continuous pages, and
it is offered to the host via a base PFN (i.e. the start PFN of
those physically continuous pages) and the size (i.e. the total
number of the pages). A chunk is formated as below:
--------------------------------------------------------
|                 Base (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------
--------------------------------------------------------
|                 Size (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------

By doing so, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

With this new feature, the above ballooning process takes ~590ms
resulting in an improvement of ~85%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 384 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |  13 ++
 2 files changed, 374 insertions(+), 23 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f59cb4f..5e2e7cc 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define PAGE_BMAP_SIZE		(8 * PAGE_SIZE)
+#define PFNS_PER_PAGE_BMAP	(PAGE_BMAP_SIZE * BITS_PER_BYTE)
+#define PAGE_BMAP_COUNT_MAX	32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+/* Types of pages to chunk */
+#define PAGE_CHUNK_TYPE_BALLOON 0
+
+#define MAX_PAGE_CHUNKS 4096
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -78,6 +86,32 @@ struct virtio_balloon {
 	/* Synchronize access/update to this struct virtio_balloon elements */
 	struct mutex balloon_lock;
 
+	/*
+	 * Buffer for PAGE_CHUNK_TYPE_BALLOON:
+	 * virtio_balloon_page_chunk_hdr +
+	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
+	 */
+	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
+	struct virtio_balloon_page_chunk *balloon_page_chunk;
+
+	/* Bitmap used to record pages */
+	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
+	/* Number of the allocated page_bmap */
+	unsigned int page_bmaps;
+
+	/*
+	 * The allocated page_bmap size may be smaller than the pfn range of
+	 * the ballooned pages. In this case, we need to use the page_bmap
+	 * multiple times to cover the entire pfn range. It's like using a
+	 * short ruler several times to finish measuring a long object.
+	 * The start location of the ruler in the next measurement is the end
+	 * location of the ruler in the previous measurement.
+	 *
+	 * pfn_max & pfn_min: forms the pfn range of the ballooned pages
+	 * pfn_start & pfn_stop: records the start and stop pfn in each cover
+	 */
+	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
+
 	/* The array of pfns we tell the Host about. */
 	unsigned int num_pfns;
 	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
@@ -110,20 +144,201 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_page_bmap_range(struct virtio_balloon *vb)
+{
+	vb->pfn_min = ULONG_MAX;
+	vb->pfn_max = 0;
+}
+
+static inline void update_page_bmap_range(struct virtio_balloon *vb,
+					  struct page *page)
+{
+	unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+	vb->pfn_min = min(balloon_pfn, vb->pfn_min);
+	vb->pfn_max = max(balloon_pfn, vb->pfn_max);
+}
+
+/* The page_bmap size is extended by adding more number of page_bmap */
+static void extend_page_bmap_size(struct virtio_balloon *vb,
+				  unsigned long pfns)
+{
+	int i, bmaps;
+	unsigned long bmap_len;
+
+	bmap_len = ALIGN(pfns, BITS_PER_LONG) / BITS_PER_BYTE;
+	bmap_len = ALIGN(bmap_len, PAGE_BMAP_SIZE);
+	bmaps = min((int)(bmap_len / PAGE_BMAP_SIZE),
+		    PAGE_BMAP_COUNT_MAX);
+
+	for (i = 1; i < bmaps; i++) {
+		vb->page_bmap[i] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+		if (vb->page_bmap[i])
+			vb->page_bmaps++;
+		else
+			break;
+	}
+}
+
+static void free_extended_page_bmap(struct virtio_balloon *vb)
+{
+	int i, bmaps = vb->page_bmaps;
+
+	for (i = 1; i < bmaps; i++) {
+		kfree(vb->page_bmap[i]);
+		vb->page_bmap[i] = NULL;
+		vb->page_bmaps--;
+	}
+}
+
+static void free_page_bmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->page_bmaps; i++)
+		kfree(vb->page_bmap[i]);
+}
+
+static void clear_page_bmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->page_bmaps; i++)
+		memset(vb->page_bmap[i], 0, PAGE_BMAP_SIZE);
+}
+
+static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
+			     int type, bool busy_wait)
 {
 	struct scatterlist sg;
+	struct virtio_balloon_page_chunk_hdr *hdr;
+	void *buf;
 	unsigned int len;
 
-	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		hdr = vb->balloon_page_chunk_hdr;
+		len = 0;
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
+			 __func__, type);
+		return;
+	}
 
-	/* We should always be able to add one buffer to an empty queue. */
-	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
-	virtqueue_kick(vq);
+	buf = (void *)hdr - len;
+	len += sizeof(struct virtio_balloon_page_chunk_hdr);
+	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
+	sg_init_table(&sg, 1);
+	sg_set_buf(&sg, buf, len);
+	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
+		virtqueue_kick(vq);
+		if (busy_wait)
+			while (!virtqueue_get_buf(vq, &len) &&
+			       !virtqueue_is_broken(vq))
+				cpu_relax();
+		else
+			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		hdr->chunks = 0;
+	}
+}
+
+static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
+			  int type, u64 base, u64 size)
+{
+	struct virtio_balloon_page_chunk_hdr *hdr;
+	struct virtio_balloon_page_chunk *chunk;
+
+	switch (type) {
+	case PAGE_CHUNK_TYPE_BALLOON:
+		hdr = vb->balloon_page_chunk_hdr;
+		chunk = vb->balloon_page_chunk;
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
+			 __func__, type);
+		return;
+	}
+	chunk = chunk + hdr->chunks;
+	chunk->base = cpu_to_le64(base << VIRTIO_BALLOON_CHUNK_BASE_SHIFT);
+	chunk->size = cpu_to_le64(size << VIRTIO_BALLOON_CHUNK_SIZE_SHIFT);
+	hdr->chunks++;
+	if (hdr->chunks == MAX_PAGE_CHUNKS)
+		send_page_chunks(vb, vq, type, false);
+}
+
+static void chunking_pages_from_bmap(struct virtio_balloon *vb,
+				     struct virtqueue *vq,
+				     unsigned long pfn_start,
+				     unsigned long *bmap,
+				     unsigned long len)
+{
+	unsigned long pos = 0, end = len * BITS_PER_BYTE;
+
+	while (pos < end) {
+		unsigned long one = find_next_bit(bmap, end, pos);
+
+		if (one < end) {
+			unsigned long chunk_size, zero;
+
+			zero = find_next_zero_bit(bmap, end, one + 1);
+			if (zero >= end)
+				chunk_size = end - one;
+			else
+				chunk_size = zero - one;
+
+			if (chunk_size)
+				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+					      pfn_start + one, chunk_size);
+			pos = one + chunk_size;
+		} else
+			break;
+	}
+}
+
+static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+{
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
+		int pfns, page_bmaps, i;
+		unsigned long pfn_start, pfns_len;
+
+		pfn_start = vb->pfn_start;
+		pfns = vb->pfn_stop - pfn_start + 1;
+		pfns = roundup(roundup(pfns, BITS_PER_LONG),
+			       PFNS_PER_PAGE_BMAP);
+		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
+		pfns_len = pfns / BITS_PER_BYTE;
+
+		for (i = 0; i < page_bmaps; i++) {
+			unsigned int bmap_len = PAGE_BMAP_SIZE;
+
+			/* The last one takes the leftover only */
+			if (i + 1 == page_bmaps)
+				bmap_len = pfns_len - PAGE_BMAP_SIZE * i;
+
+			chunking_pages_from_bmap(vb, vq, pfn_start +
+						 i * PFNS_PER_PAGE_BMAP,
+						 vb->page_bmap[i], bmap_len);
+		}
+		if (vb->balloon_page_chunk_hdr->chunks > 0)
+			send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
+					 false);
+	} else {
+		struct scatterlist sg;
+		unsigned int len;
 
-	/* When host has read buffer, this completes via balloon_ack */
-	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
 
+		/*
+		 * We should always be able to add one buffer to an empty
+		 * queue.
+		 */
+		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+		virtqueue_kick(vq);
+
+		/* When host has read buffer, this completes via balloon_ack */
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+	}
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -131,20 +346,73 @@ static void set_page_pfns(struct virtio_balloon *vb,
 {
 	unsigned int i;
 
-	/* Set balloon pfns pointing at this page.
-	 * Note that the first pfn points at start of the page. */
+	/*
+	 * Set balloon pfns pointing at this page.
+	 * Note that the first pfn points at start of the page.
+	 */
 	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
 		pfns[i] = cpu_to_virtio32(vb->vdev,
 					  page_to_balloon_pfn(page) + i);
 }
 
+static void set_page_bmap(struct virtio_balloon *vb,
+			  struct list_head *pages, struct virtqueue *vq)
+{
+	unsigned long pfn_start, pfn_stop;
+	struct page *page;
+	bool found;
+
+	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
+	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
+
+	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
+	pfn_start = vb->pfn_min;
+
+	while (pfn_start < vb->pfn_max) {
+		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
+		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
+
+		vb->pfn_start = pfn_start;
+		clear_page_bmap(vb);
+		found = false;
+
+		list_for_each_entry(page, pages, lru) {
+			unsigned long bmap_idx, bmap_pos, balloon_pfn;
+
+			balloon_pfn = page_to_balloon_pfn(page);
+			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
+				continue;
+			bmap_idx = (balloon_pfn - pfn_start) /
+				   PFNS_PER_PAGE_BMAP;
+			bmap_pos = (balloon_pfn - pfn_start) %
+				   PFNS_PER_PAGE_BMAP;
+			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
+
+			found = true;
+		}
+		if (found) {
+			vb->pfn_stop = pfn_stop;
+			tell_host(vb, vq);
+		}
+		pfn_start = pfn_stop;
+	}
+	free_extended_page_bmap(vb);
+}
+
 static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	unsigned num_allocated_pages;
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
 
 	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (chunking) {
+		init_page_bmap_range(vb);
+	} else {
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
+	}
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -159,7 +427,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_page_bmap_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -168,8 +439,13 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			set_page_bmap(vb, &vb_dev_info->pages,
+					vb->inflate_vq);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -195,6 +471,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
+	if (chunking)
+		init_page_bmap_range(vb);
+	else
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -208,6 +491,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_page_bmap_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -218,8 +505,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			set_page_bmap(vb, &pages, vb->deflate_vq);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -431,6 +722,13 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+
+static void tell_host_one_page(struct virtio_balloon *vb,
+			       struct virtqueue *vq, struct page *page)
+{
+	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);
+}
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -454,6 +752,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 {
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
 	unsigned long flags;
 
 	/*
@@ -475,16 +775,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->inflate_vq, newpage);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
-
+	if (chunking) {
+		tell_host_one_page(vb, vb->deflate_vq, page);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	put_page(page); /* balloon reference */
@@ -511,6 +817,32 @@ static struct file_system_type balloon_fs = {
 
 #endif /* CONFIG_BALLOON_COMPACTION */
 
+static void balloon_page_chunk_init(struct virtio_balloon *vb)
+{
+	void *buf;
+
+	/*
+	 * By default, we allocate page_bmap[0] only. More page_bmap will be
+	 * allocated on demand.
+	 */
+	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk) *
+		      MAX_PAGE_CHUNKS, GFP_KERNEL);
+	if (!vb->page_bmap[0] || !buf) {
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);
+		kfree(vb->page_bmap[0]);
+		kfree(vb->balloon_page_chunk_hdr);
+		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+	} else {
+		vb->page_bmaps = 1;
+		vb->balloon_page_chunk_hdr = buf;
+		vb->balloon_page_chunk_hdr->chunks = 0;
+		vb->balloon_page_chunk = buf +
+				sizeof(struct virtio_balloon_page_chunk_hdr);
+	}
+}
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
@@ -533,6 +865,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
 	vb->num_pages = 0;
+
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
+		balloon_page_chunk_init(vb);
+
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
@@ -609,6 +945,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	free_page_bmap(vb);
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
 	kfree(vb);
@@ -649,6 +986,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..be317b7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,16 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+struct virtio_balloon_page_chunk_hdr {
+	/* Number of chunks in the payload */
+	__le32 chunks;
+};
+
+#define VIRTIO_BALLOON_CHUNK_BASE_SHIFT 12
+#define VIRTIO_BALLOON_CHUNK_SIZE_SHIFT 12
+struct virtio_balloon_page_chunk {
+	__le64 base;
+	__le64 size;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH v9 3/5] mm: function to offer a page block on the free list
  2017-04-13  9:35 ` Wei Wang
  (?)
@ 2017-04-13  9:35   ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a function to find a page block on the free list specified by the
caller. Pages from the page block may be used immediately after the
function returns. The caller is responsible for detecting or preventing
the use of such pages.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 include/linux/mm.h |  3 ++
 mm/page_alloc.c    | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84615b..096705e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1764,6 +1764,9 @@ extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+extern int inquire_unused_page_block(struct zone *zone, unsigned int order,
+				     unsigned int migratetype,
+				     struct page **page);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3e0c69..fa8203f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter)
 	show_swap_cache_info();
 }
 
+/**
+ * Heuristically get a page block in the system that is unused.
+ * It is possible that pages from the page block are used immediately after
+ * inquire_unused_page_block() returns. It is the caller's responsibility
+ * to either detect or prevent the use of such pages.
+ *
+ * The free list to check: zone->free_area[order].free_list[migratetype].
+ *
+ * If the caller supplied page block (i.e. **page) is on the free list, offer
+ * the next page block on the list to the caller. Otherwise, offer the first
+ * page block on the list.
+ *
+ * Return 0 when a page block is found on the caller specified free list.
+ */
+int inquire_unused_page_block(struct zone *zone, unsigned int order,
+			      unsigned int migratetype, struct page **page)
+{
+	struct zone *this_zone;
+	struct list_head *this_list;
+	int ret = 0;
+	unsigned long flags;
+
+	/* Sanity check */
+	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
+	    migratetype >= MIGRATE_TYPES)
+		return -EINVAL;
+
+	/* Zone validity check */
+	for_each_populated_zone(this_zone) {
+		if (zone == this_zone)
+			break;
+	}
+
+	/* Got a non-existent zone from the caller? */
+	if (zone != this_zone)
+		return -EINVAL;
+
+	spin_lock_irqsave(&this_zone->lock, flags);
+
+	this_list = &zone->free_area[order].free_list[migratetype];
+	if (list_empty(this_list)) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/* The caller is asking for the first free page block on the list */
+	if ((*page) == NULL) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/**
+	 * The page block passed from the caller is not on this free list
+	 * anymore (e.g. a 1MB free page block has been split). In this case,
+	 * offer the first page block on the free list that the caller is
+	 * asking for.
+	 */
+	if (PageBuddy(*page) && order != page_order(*page)) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/**
+	 * The page block passed from the caller has been the last page block
+	 * on the list.
+	 */
+	if ((*page)->lru.next == this_list) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/**
+	 * Finally, fall into the regular case: the page block passed from the
+	 * caller is still on the free list. Offer the next one.
+	 */
+	*page = list_next_entry((*page), lru);
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&this_zone->lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL(inquire_unused_page_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH v9 3/5] mm: function to offer a page block on the free list
@ 2017-04-13  9:35   ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a function to find a page block on the free list specified by the
caller. Pages from the page block may be used immediately after the
function returns. The caller is responsible for detecting or preventing
the use of such pages.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 include/linux/mm.h |  3 ++
 mm/page_alloc.c    | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84615b..096705e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1764,6 +1764,9 @@ extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+extern int inquire_unused_page_block(struct zone *zone, unsigned int order,
+				     unsigned int migratetype,
+				     struct page **page);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3e0c69..fa8203f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter)
 	show_swap_cache_info();
 }
 
+/**
+ * Heuristically get a page block in the system that is unused.
+ * It is possible that pages from the page block are used immediately after
+ * inquire_unused_page_block() returns. It is the caller's responsibility
+ * to either detect or prevent the use of such pages.
+ *
+ * The free list to check: zone->free_area[order].free_list[migratetype].
+ *
+ * If the caller supplied page block (i.e. **page) is on the free list, offer
+ * the next page block on the list to the caller. Otherwise, offer the first
+ * page block on the list.
+ *
+ * Return 0 when a page block is found on the caller specified free list.
+ */
+int inquire_unused_page_block(struct zone *zone, unsigned int order,
+			      unsigned int migratetype, struct page **page)
+{
+	struct zone *this_zone;
+	struct list_head *this_list;
+	int ret = 0;
+	unsigned long flags;
+
+	/* Sanity check */
+	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
+	    migratetype >= MIGRATE_TYPES)
+		return -EINVAL;
+
+	/* Zone validity check */
+	for_each_populated_zone(this_zone) {
+		if (zone == this_zone)
+			break;
+	}
+
+	/* Got a non-existent zone from the caller? */
+	if (zone != this_zone)
+		return -EINVAL;
+
+	spin_lock_irqsave(&this_zone->lock, flags);
+
+	this_list = &zone->free_area[order].free_list[migratetype];
+	if (list_empty(this_list)) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/* The caller is asking for the first free page block on the list */
+	if ((*page) == NULL) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/**
+	 * The page block passed from the caller is not on this free list
+	 * anymore (e.g. a 1MB free page block has been split). In this case,
+	 * offer the first page block on the free list that the caller is
+	 * asking for.
+	 */
+	if (PageBuddy(*page) && order != page_order(*page)) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/**
+	 * The page block passed from the caller has been the last page block
+	 * on the list.
+	 */
+	if ((*page)->lru.next == this_list) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/**
+	 * Finally, fall into the regular case: the page block passed from the
+	 * caller is still on the free list. Offer the next one.
+	 */
+	*page = list_next_entry((*page), lru);
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&this_zone->lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL(inquire_unused_page_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [Qemu-devel] [PATCH v9 3/5] mm: function to offer a page block on the free list
@ 2017-04-13  9:35   ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a function to find a page block on the free list specified by the
caller. Pages from the page block may be used immediately after the
function returns. The caller is responsible for detecting or preventing
the use of such pages.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 include/linux/mm.h |  3 ++
 mm/page_alloc.c    | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84615b..096705e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1764,6 +1764,9 @@ extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+extern int inquire_unused_page_block(struct zone *zone, unsigned int order,
+				     unsigned int migratetype,
+				     struct page **page);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3e0c69..fa8203f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter)
 	show_swap_cache_info();
 }
 
+/**
+ * Heuristically get a page block in the system that is unused.
+ * It is possible that pages from the page block are used immediately after
+ * inquire_unused_page_block() returns. It is the caller's responsibility
+ * to either detect or prevent the use of such pages.
+ *
+ * The free list to check: zone->free_area[order].free_list[migratetype].
+ *
+ * If the caller supplied page block (i.e. **page) is on the free list, offer
+ * the next page block on the list to the caller. Otherwise, offer the first
+ * page block on the list.
+ *
+ * Return 0 when a page block is found on the caller specified free list.
+ */
+int inquire_unused_page_block(struct zone *zone, unsigned int order,
+			      unsigned int migratetype, struct page **page)
+{
+	struct zone *this_zone;
+	struct list_head *this_list;
+	int ret = 0;
+	unsigned long flags;
+
+	/* Sanity check */
+	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
+	    migratetype >= MIGRATE_TYPES)
+		return -EINVAL;
+
+	/* Zone validity check */
+	for_each_populated_zone(this_zone) {
+		if (zone == this_zone)
+			break;
+	}
+
+	/* Got a non-existent zone from the caller? */
+	if (zone != this_zone)
+		return -EINVAL;
+
+	spin_lock_irqsave(&this_zone->lock, flags);
+
+	this_list = &zone->free_area[order].free_list[migratetype];
+	if (list_empty(this_list)) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/* The caller is asking for the first free page block on the list */
+	if ((*page) == NULL) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/**
+	 * The page block passed from the caller is not on this free list
+	 * anymore (e.g. a 1MB free page block has been split). In this case,
+	 * offer the first page block on the free list that the caller is
+	 * asking for.
+	 */
+	if (PageBuddy(*page) && order != page_order(*page)) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/**
+	 * The page block passed from the caller has been the last page block
+	 * on the list.
+	 */
+	if ((*page)->lru.next == this_list) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/**
+	 * Finally, fall into the regular case: the page block passed from the
+	 * caller is still on the free list. Offer the next one.
+	 */
+	*page = list_next_entry((*page), lru);
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&this_zone->lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL(inquire_unused_page_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH v9 3/5] mm: function to offer a page block on the free list
  2017-04-13  9:35 ` Wei Wang
                   ` (5 preceding siblings ...)
  (?)
@ 2017-04-13  9:35 ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a function to find a page block on the free list specified by the
caller. Pages from the page block may be used immediately after the
function returns. The caller is responsible for detecting or preventing
the use of such pages.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 include/linux/mm.h |  3 ++
 mm/page_alloc.c    | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84615b..096705e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1764,6 +1764,9 @@ extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+extern int inquire_unused_page_block(struct zone *zone, unsigned int order,
+				     unsigned int migratetype,
+				     struct page **page);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3e0c69..fa8203f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter)
 	show_swap_cache_info();
 }
 
+/**
+ * Heuristically get a page block in the system that is unused.
+ * It is possible that pages from the page block are used immediately after
+ * inquire_unused_page_block() returns. It is the caller's responsibility
+ * to either detect or prevent the use of such pages.
+ *
+ * The free list to check: zone->free_area[order].free_list[migratetype].
+ *
+ * If the caller supplied page block (i.e. **page) is on the free list, offer
+ * the next page block on the list to the caller. Otherwise, offer the first
+ * page block on the list.
+ *
+ * Return 0 when a page block is found on the caller specified free list.
+ */
+int inquire_unused_page_block(struct zone *zone, unsigned int order,
+			      unsigned int migratetype, struct page **page)
+{
+	struct zone *this_zone;
+	struct list_head *this_list;
+	int ret = 0;
+	unsigned long flags;
+
+	/* Sanity check */
+	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
+	    migratetype >= MIGRATE_TYPES)
+		return -EINVAL;
+
+	/* Zone validity check */
+	for_each_populated_zone(this_zone) {
+		if (zone == this_zone)
+			break;
+	}
+
+	/* Got a non-existent zone from the caller? */
+	if (zone != this_zone)
+		return -EINVAL;
+
+	spin_lock_irqsave(&this_zone->lock, flags);
+
+	this_list = &zone->free_area[order].free_list[migratetype];
+	if (list_empty(this_list)) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/* The caller is asking for the first free page block on the list */
+	if ((*page) == NULL) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/**
+	 * The page block passed from the caller is not on this free list
+	 * anymore (e.g. a 1MB free page block has been split). In this case,
+	 * offer the first page block on the free list that the caller is
+	 * asking for.
+	 */
+	if (PageBuddy(*page) && order != page_order(*page)) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/**
+	 * The page block passed from the caller has been the last page block
+	 * on the list.
+	 */
+	if ((*page)->lru.next == this_list) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/**
+	 * Finally, fall into the regular case: the page block passed from the
+	 * caller is still on the free list. Offer the next one.
+	 */
+	*page = list_next_entry((*page), lru);
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&this_zone->lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL(inquire_unused_page_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH v9 4/5] mm: export symbol of next_zone and first_online_pgdat
  2017-04-13  9:35 ` Wei Wang
  (?)
@ 2017-04-13  9:35   ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch enables for_each_zone()/for_each_populated_zone() to be
invoked by a kernel module.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 mm/mmzone.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmzone.c b/mm/mmzone.c
index 5652be8..e14b7ec 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void)
 {
 	return NODE_DATA(first_online_node);
 }
+EXPORT_SYMBOL_GPL(first_online_pgdat);
 
 struct pglist_data *next_online_pgdat(struct pglist_data *pgdat)
 {
@@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone)
 	}
 	return zone;
 }
+EXPORT_SYMBOL_GPL(next_zone);
 
 static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH v9 4/5] mm: export symbol of next_zone and first_online_pgdat
@ 2017-04-13  9:35   ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch enables for_each_zone()/for_each_populated_zone() to be
invoked by a kernel module.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 mm/mmzone.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmzone.c b/mm/mmzone.c
index 5652be8..e14b7ec 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void)
 {
 	return NODE_DATA(first_online_node);
 }
+EXPORT_SYMBOL_GPL(first_online_pgdat);
 
 struct pglist_data *next_online_pgdat(struct pglist_data *pgdat)
 {
@@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone)
 	}
 	return zone;
 }
+EXPORT_SYMBOL_GPL(next_zone);
 
 static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
 {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [Qemu-devel] [PATCH v9 4/5] mm: export symbol of next_zone and first_online_pgdat
@ 2017-04-13  9:35   ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch enables for_each_zone()/for_each_populated_zone() to be
invoked by a kernel module.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 mm/mmzone.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmzone.c b/mm/mmzone.c
index 5652be8..e14b7ec 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void)
 {
 	return NODE_DATA(first_online_node);
 }
+EXPORT_SYMBOL_GPL(first_online_pgdat);
 
 struct pglist_data *next_online_pgdat(struct pglist_data *pgdat)
 {
@@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone)
 	}
 	return zone;
 }
+EXPORT_SYMBOL_GPL(next_zone);
 
 static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH v9 4/5] mm: export symbol of next_zone and first_online_pgdat
  2017-04-13  9:35 ` Wei Wang
                   ` (8 preceding siblings ...)
  (?)
@ 2017-04-13  9:35 ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch enables for_each_zone()/for_each_populated_zone() to be
invoked by a kernel module.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 mm/mmzone.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmzone.c b/mm/mmzone.c
index 5652be8..e14b7ec 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -13,6 +13,7 @@ struct pglist_data *first_online_pgdat(void)
 {
 	return NODE_DATA(first_online_node);
 }
+EXPORT_SYMBOL_GPL(first_online_pgdat);
 
 struct pglist_data *next_online_pgdat(struct pglist_data *pgdat)
 {
@@ -41,6 +42,7 @@ struct zone *next_zone(struct zone *zone)
 	}
 	return zone;
 }
+EXPORT_SYMBOL_GPL(next_zone);
 
 static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
  2017-04-13  9:35 ` Wei Wang
  (?)
@ 2017-04-13  9:35   ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new vq, miscq, to handle miscellaneous requests between the device
and the driver.

This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
request sent from the device. Upon receiving this request from the
miscq, the driver offers to the device the guest unused pages.

Tests have shown that skipping the transfer of unused pages of a 32G
guest can get the live migration time reduced to 1/8.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 209 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |   8 ++
 2 files changed, 204 insertions(+), 13 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 5e2e7cc..95c703e 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,11 +56,12 @@ static struct vfsmount *balloon_mnt;
 
 /* Types of pages to chunk */
 #define PAGE_CHUNK_TYPE_BALLOON 0
+#define PAGE_CHUNK_TYPE_UNUSED 1
 
 #define MAX_PAGE_CHUNKS 4096
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *miscq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -94,6 +95,19 @@ struct virtio_balloon {
 	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
 	struct virtio_balloon_page_chunk *balloon_page_chunk;
 
+	/*
+	 * Buffer for PAGE_CHUNK_TYPE_UNUSED:
+	 * virtio_balloon_miscq_hdr +
+	 * virtio_balloon_page_chunk_hdr +
+	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
+	 */
+	struct virtio_balloon_miscq_hdr *miscq_out_hdr;
+	struct virtio_balloon_page_chunk_hdr *unused_page_chunk_hdr;
+	struct virtio_balloon_page_chunk *unused_page_chunk;
+
+	/* Buffer for host to send cmd to miscq */
+	struct virtio_balloon_miscq_hdr *miscq_in_hdr;
+
 	/* Bitmap used to record pages */
 	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
 	/* Number of the allocated page_bmap */
@@ -220,6 +234,10 @@ static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
 		hdr = vb->balloon_page_chunk_hdr;
 		len = 0;
 		break;
+	case PAGE_CHUNK_TYPE_UNUSED:
+		hdr = vb->unused_page_chunk_hdr;
+		len = sizeof(struct virtio_balloon_miscq_hdr);
+		break;
 	default:
 		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
 			 __func__, type);
@@ -254,6 +272,10 @@ static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
 		hdr = vb->balloon_page_chunk_hdr;
 		chunk = vb->balloon_page_chunk;
 		break;
+	case PAGE_CHUNK_TYPE_UNUSED:
+		hdr = vb->unused_page_chunk_hdr;
+		chunk = vb->unused_page_chunk;
+		break;
 	default:
 		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
 			 __func__, type);
@@ -686,28 +708,139 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void miscq_in_hdr_add(struct virtio_balloon *vb)
+{
+	struct scatterlist sg_in;
+
+	sg_init_one(&sg_in, vb->miscq_in_hdr,
+		    sizeof(struct virtio_balloon_miscq_hdr));
+	if (virtqueue_add_inbuf(vb->miscq, &sg_in, 1, vb->miscq_in_hdr,
+	    GFP_KERNEL) < 0) {
+		__virtio_clear_bit(vb->vdev,
+				   VIRTIO_BALLOON_F_MISC_VQ);
+		dev_warn(&vb->vdev->dev, "%s: add miscq_in_hdr err\n",
+			 __func__);
+		return;
+	}
+	virtqueue_kick(vb->miscq);
+}
+
+static void miscq_send_unused_pages(struct virtio_balloon *vb)
+{
+	struct virtio_balloon_miscq_hdr *miscq_out_hdr = vb->miscq_out_hdr;
+	struct virtqueue *vq = vb->miscq;
+	int ret = 0;
+	unsigned int order = 0, migratetype = 0;
+	struct zone *zone = NULL;
+	struct page *page = NULL;
+	u64 pfn;
+
+	miscq_out_hdr->cmd =  VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES;
+	miscq_out_hdr->flags = 0;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order > 0; order--) {
+			for (migratetype = 0; migratetype < MIGRATE_TYPES;
+			     migratetype++) {
+				do {
+					ret = inquire_unused_page_block(zone,
+						order, migratetype, &page);
+					if (!ret) {
+						pfn = (u64)page_to_pfn(page);
+						add_one_chunk(vb, vq,
+							PAGE_CHUNK_TYPE_UNUSED,
+							pfn,
+							(u64)(1 << order));
+					}
+				} while (!ret);
+			}
+		}
+	}
+	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;
+	send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_UNUSED, true);
+}
+
+static void miscq_handle(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+	struct virtio_balloon_miscq_hdr *hdr;
+	unsigned int len;
+
+	hdr = virtqueue_get_buf(vb->miscq, &len);
+	if (!hdr || len != sizeof(struct virtio_balloon_miscq_hdr)) {
+		dev_warn(&vb->vdev->dev, "%s: invalid miscq hdr len\n",
+			 __func__);
+		miscq_in_hdr_add(vb);
+		return;
+	}
+	switch (hdr->cmd) {
+	case VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES:
+		miscq_send_unused_pages(vb);
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: miscq cmd %d not supported\n",
+			 __func__, hdr->cmd);
+	}
+	miscq_in_hdr_add(vb);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	int err = -ENOMEM;
+	int i, nvqs;
+
+	 /* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
 
-	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
-	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names);
+	if (virtio_has_feature(vb->vdev,
+				      VIRTIO_BALLOON_F_MISC_VQ)) {
+		callbacks[i] = miscq_handle;
+		names[i] = "miscq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
+					 names);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		struct scatterlist sg;
-		vb->stats_vq = vqs[2];
 
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
@@ -718,7 +851,25 @@ static int init_vqs(struct virtio_balloon *vb)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ)) {
+		vb->miscq = vqs[i];
+		miscq_in_hdr_add(vb);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -843,6 +994,32 @@ static void balloon_page_chunk_init(struct virtio_balloon *vb)
 	}
 }
 
+static void miscq_init(struct virtio_balloon *vb)
+{
+	void *buf;
+
+	vb->miscq_in_hdr = kmalloc(sizeof(struct virtio_balloon_miscq_hdr),
+				   GFP_KERNEL);
+	buf = kmalloc(sizeof(struct virtio_balloon_miscq_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk) *
+		      MAX_PAGE_CHUNKS, GFP_KERNEL);
+	if (!vb->miscq_in_hdr || !buf) {
+		kfree(buf);
+		kfree(vb->miscq_in_hdr);
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ);
+		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+	} else {
+		vb->miscq_out_hdr = buf;
+		vb->unused_page_chunk_hdr = buf +
+				sizeof(struct virtio_balloon_miscq_hdr);
+		vb->unused_page_chunk_hdr->chunks = 0;
+		vb->unused_page_chunk = buf +
+				sizeof(struct virtio_balloon_miscq_hdr) +
+				sizeof(struct virtio_balloon_page_chunk_hdr);
+	}
+}
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
@@ -869,6 +1046,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
 		balloon_page_chunk_init(vb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_MISC_VQ))
+		miscq_init(vb);
+
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
@@ -946,6 +1126,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 
 	remove_common(vb);
 	free_page_bmap(vb);
+	kfree(vb->miscq_out_hdr);
+	kfree(vb->miscq_in_hdr);
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
 	kfree(vb);
@@ -987,6 +1169,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
+	VIRTIO_BALLOON_F_MISC_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index be317b7..96bdc86 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
+#define VIRTIO_BALLOON_F_MISC_VQ	4 /* Virtqueue for misc. requests */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -95,4 +96,11 @@ struct virtio_balloon_page_chunk {
 	__le64 size;
 };
 
+#define VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES 0
+#define VIRTIO_BALLOON_MISCQ_F_COMPLETE 0x1
+struct virtio_balloon_miscq_hdr {
+	__le16 cmd;
+	__le16 flags;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
@ 2017-04-13  9:35   ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new vq, miscq, to handle miscellaneous requests between the device
and the driver.

This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
request sent from the device. Upon receiving this request from the
miscq, the driver offers to the device the guest unused pages.

Tests have shown that skipping the transfer of unused pages of a 32G
guest can get the live migration time reduced to 1/8.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 209 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |   8 ++
 2 files changed, 204 insertions(+), 13 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 5e2e7cc..95c703e 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,11 +56,12 @@ static struct vfsmount *balloon_mnt;
 
 /* Types of pages to chunk */
 #define PAGE_CHUNK_TYPE_BALLOON 0
+#define PAGE_CHUNK_TYPE_UNUSED 1
 
 #define MAX_PAGE_CHUNKS 4096
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *miscq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -94,6 +95,19 @@ struct virtio_balloon {
 	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
 	struct virtio_balloon_page_chunk *balloon_page_chunk;
 
+	/*
+	 * Buffer for PAGE_CHUNK_TYPE_UNUSED:
+	 * virtio_balloon_miscq_hdr +
+	 * virtio_balloon_page_chunk_hdr +
+	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
+	 */
+	struct virtio_balloon_miscq_hdr *miscq_out_hdr;
+	struct virtio_balloon_page_chunk_hdr *unused_page_chunk_hdr;
+	struct virtio_balloon_page_chunk *unused_page_chunk;
+
+	/* Buffer for host to send cmd to miscq */
+	struct virtio_balloon_miscq_hdr *miscq_in_hdr;
+
 	/* Bitmap used to record pages */
 	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
 	/* Number of the allocated page_bmap */
@@ -220,6 +234,10 @@ static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
 		hdr = vb->balloon_page_chunk_hdr;
 		len = 0;
 		break;
+	case PAGE_CHUNK_TYPE_UNUSED:
+		hdr = vb->unused_page_chunk_hdr;
+		len = sizeof(struct virtio_balloon_miscq_hdr);
+		break;
 	default:
 		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
 			 __func__, type);
@@ -254,6 +272,10 @@ static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
 		hdr = vb->balloon_page_chunk_hdr;
 		chunk = vb->balloon_page_chunk;
 		break;
+	case PAGE_CHUNK_TYPE_UNUSED:
+		hdr = vb->unused_page_chunk_hdr;
+		chunk = vb->unused_page_chunk;
+		break;
 	default:
 		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
 			 __func__, type);
@@ -686,28 +708,139 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void miscq_in_hdr_add(struct virtio_balloon *vb)
+{
+	struct scatterlist sg_in;
+
+	sg_init_one(&sg_in, vb->miscq_in_hdr,
+		    sizeof(struct virtio_balloon_miscq_hdr));
+	if (virtqueue_add_inbuf(vb->miscq, &sg_in, 1, vb->miscq_in_hdr,
+	    GFP_KERNEL) < 0) {
+		__virtio_clear_bit(vb->vdev,
+				   VIRTIO_BALLOON_F_MISC_VQ);
+		dev_warn(&vb->vdev->dev, "%s: add miscq_in_hdr err\n",
+			 __func__);
+		return;
+	}
+	virtqueue_kick(vb->miscq);
+}
+
+static void miscq_send_unused_pages(struct virtio_balloon *vb)
+{
+	struct virtio_balloon_miscq_hdr *miscq_out_hdr = vb->miscq_out_hdr;
+	struct virtqueue *vq = vb->miscq;
+	int ret = 0;
+	unsigned int order = 0, migratetype = 0;
+	struct zone *zone = NULL;
+	struct page *page = NULL;
+	u64 pfn;
+
+	miscq_out_hdr->cmd =  VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES;
+	miscq_out_hdr->flags = 0;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order > 0; order--) {
+			for (migratetype = 0; migratetype < MIGRATE_TYPES;
+			     migratetype++) {
+				do {
+					ret = inquire_unused_page_block(zone,
+						order, migratetype, &page);
+					if (!ret) {
+						pfn = (u64)page_to_pfn(page);
+						add_one_chunk(vb, vq,
+							PAGE_CHUNK_TYPE_UNUSED,
+							pfn,
+							(u64)(1 << order));
+					}
+				} while (!ret);
+			}
+		}
+	}
+	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;
+	send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_UNUSED, true);
+}
+
+static void miscq_handle(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+	struct virtio_balloon_miscq_hdr *hdr;
+	unsigned int len;
+
+	hdr = virtqueue_get_buf(vb->miscq, &len);
+	if (!hdr || len != sizeof(struct virtio_balloon_miscq_hdr)) {
+		dev_warn(&vb->vdev->dev, "%s: invalid miscq hdr len\n",
+			 __func__);
+		miscq_in_hdr_add(vb);
+		return;
+	}
+	switch (hdr->cmd) {
+	case VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES:
+		miscq_send_unused_pages(vb);
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: miscq cmd %d not supported\n",
+			 __func__, hdr->cmd);
+	}
+	miscq_in_hdr_add(vb);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	int err = -ENOMEM;
+	int i, nvqs;
+
+	 /* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
 
-	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
-	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names);
+	if (virtio_has_feature(vb->vdev,
+				      VIRTIO_BALLOON_F_MISC_VQ)) {
+		callbacks[i] = miscq_handle;
+		names[i] = "miscq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
+					 names);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		struct scatterlist sg;
-		vb->stats_vq = vqs[2];
 
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
@@ -718,7 +851,25 @@ static int init_vqs(struct virtio_balloon *vb)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ)) {
+		vb->miscq = vqs[i];
+		miscq_in_hdr_add(vb);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -843,6 +994,32 @@ static void balloon_page_chunk_init(struct virtio_balloon *vb)
 	}
 }
 
+static void miscq_init(struct virtio_balloon *vb)
+{
+	void *buf;
+
+	vb->miscq_in_hdr = kmalloc(sizeof(struct virtio_balloon_miscq_hdr),
+				   GFP_KERNEL);
+	buf = kmalloc(sizeof(struct virtio_balloon_miscq_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk) *
+		      MAX_PAGE_CHUNKS, GFP_KERNEL);
+	if (!vb->miscq_in_hdr || !buf) {
+		kfree(buf);
+		kfree(vb->miscq_in_hdr);
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ);
+		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+	} else {
+		vb->miscq_out_hdr = buf;
+		vb->unused_page_chunk_hdr = buf +
+				sizeof(struct virtio_balloon_miscq_hdr);
+		vb->unused_page_chunk_hdr->chunks = 0;
+		vb->unused_page_chunk = buf +
+				sizeof(struct virtio_balloon_miscq_hdr) +
+				sizeof(struct virtio_balloon_page_chunk_hdr);
+	}
+}
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
@@ -869,6 +1046,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
 		balloon_page_chunk_init(vb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_MISC_VQ))
+		miscq_init(vb);
+
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
@@ -946,6 +1126,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 
 	remove_common(vb);
 	free_page_bmap(vb);
+	kfree(vb->miscq_out_hdr);
+	kfree(vb->miscq_in_hdr);
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
 	kfree(vb);
@@ -987,6 +1169,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
+	VIRTIO_BALLOON_F_MISC_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index be317b7..96bdc86 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
+#define VIRTIO_BALLOON_F_MISC_VQ	4 /* Virtqueue for misc. requests */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -95,4 +96,11 @@ struct virtio_balloon_page_chunk {
 	__le64 size;
 };
 
+#define VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES 0
+#define VIRTIO_BALLOON_MISCQ_F_COMPLETE 0x1
+struct virtio_balloon_miscq_hdr {
+	__le16 cmd;
+	__le16 flags;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [Qemu-devel] [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
@ 2017-04-13  9:35   ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new vq, miscq, to handle miscellaneous requests between the device
and the driver.

This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
request sent from the device. Upon receiving this request from the
miscq, the driver offers to the device the guest unused pages.

Tests have shown that skipping the transfer of unused pages of a 32G
guest can get the live migration time reduced to 1/8.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 209 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |   8 ++
 2 files changed, 204 insertions(+), 13 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 5e2e7cc..95c703e 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,11 +56,12 @@ static struct vfsmount *balloon_mnt;
 
 /* Types of pages to chunk */
 #define PAGE_CHUNK_TYPE_BALLOON 0
+#define PAGE_CHUNK_TYPE_UNUSED 1
 
 #define MAX_PAGE_CHUNKS 4096
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *miscq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -94,6 +95,19 @@ struct virtio_balloon {
 	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
 	struct virtio_balloon_page_chunk *balloon_page_chunk;
 
+	/*
+	 * Buffer for PAGE_CHUNK_TYPE_UNUSED:
+	 * virtio_balloon_miscq_hdr +
+	 * virtio_balloon_page_chunk_hdr +
+	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
+	 */
+	struct virtio_balloon_miscq_hdr *miscq_out_hdr;
+	struct virtio_balloon_page_chunk_hdr *unused_page_chunk_hdr;
+	struct virtio_balloon_page_chunk *unused_page_chunk;
+
+	/* Buffer for host to send cmd to miscq */
+	struct virtio_balloon_miscq_hdr *miscq_in_hdr;
+
 	/* Bitmap used to record pages */
 	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
 	/* Number of the allocated page_bmap */
@@ -220,6 +234,10 @@ static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
 		hdr = vb->balloon_page_chunk_hdr;
 		len = 0;
 		break;
+	case PAGE_CHUNK_TYPE_UNUSED:
+		hdr = vb->unused_page_chunk_hdr;
+		len = sizeof(struct virtio_balloon_miscq_hdr);
+		break;
 	default:
 		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
 			 __func__, type);
@@ -254,6 +272,10 @@ static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
 		hdr = vb->balloon_page_chunk_hdr;
 		chunk = vb->balloon_page_chunk;
 		break;
+	case PAGE_CHUNK_TYPE_UNUSED:
+		hdr = vb->unused_page_chunk_hdr;
+		chunk = vb->unused_page_chunk;
+		break;
 	default:
 		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
 			 __func__, type);
@@ -686,28 +708,139 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void miscq_in_hdr_add(struct virtio_balloon *vb)
+{
+	struct scatterlist sg_in;
+
+	sg_init_one(&sg_in, vb->miscq_in_hdr,
+		    sizeof(struct virtio_balloon_miscq_hdr));
+	if (virtqueue_add_inbuf(vb->miscq, &sg_in, 1, vb->miscq_in_hdr,
+	    GFP_KERNEL) < 0) {
+		__virtio_clear_bit(vb->vdev,
+				   VIRTIO_BALLOON_F_MISC_VQ);
+		dev_warn(&vb->vdev->dev, "%s: add miscq_in_hdr err\n",
+			 __func__);
+		return;
+	}
+	virtqueue_kick(vb->miscq);
+}
+
+static void miscq_send_unused_pages(struct virtio_balloon *vb)
+{
+	struct virtio_balloon_miscq_hdr *miscq_out_hdr = vb->miscq_out_hdr;
+	struct virtqueue *vq = vb->miscq;
+	int ret = 0;
+	unsigned int order = 0, migratetype = 0;
+	struct zone *zone = NULL;
+	struct page *page = NULL;
+	u64 pfn;
+
+	miscq_out_hdr->cmd =  VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES;
+	miscq_out_hdr->flags = 0;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order > 0; order--) {
+			for (migratetype = 0; migratetype < MIGRATE_TYPES;
+			     migratetype++) {
+				do {
+					ret = inquire_unused_page_block(zone,
+						order, migratetype, &page);
+					if (!ret) {
+						pfn = (u64)page_to_pfn(page);
+						add_one_chunk(vb, vq,
+							PAGE_CHUNK_TYPE_UNUSED,
+							pfn,
+							(u64)(1 << order));
+					}
+				} while (!ret);
+			}
+		}
+	}
+	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;
+	send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_UNUSED, true);
+}
+
+static void miscq_handle(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+	struct virtio_balloon_miscq_hdr *hdr;
+	unsigned int len;
+
+	hdr = virtqueue_get_buf(vb->miscq, &len);
+	if (!hdr || len != sizeof(struct virtio_balloon_miscq_hdr)) {
+		dev_warn(&vb->vdev->dev, "%s: invalid miscq hdr len\n",
+			 __func__);
+		miscq_in_hdr_add(vb);
+		return;
+	}
+	switch (hdr->cmd) {
+	case VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES:
+		miscq_send_unused_pages(vb);
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: miscq cmd %d not supported\n",
+			 __func__, hdr->cmd);
+	}
+	miscq_in_hdr_add(vb);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	int err = -ENOMEM;
+	int i, nvqs;
+
+	 /* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
 
-	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
-	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names);
+	if (virtio_has_feature(vb->vdev,
+				      VIRTIO_BALLOON_F_MISC_VQ)) {
+		callbacks[i] = miscq_handle;
+		names[i] = "miscq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
+					 names);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		struct scatterlist sg;
-		vb->stats_vq = vqs[2];
 
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
@@ -718,7 +851,25 @@ static int init_vqs(struct virtio_balloon *vb)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ)) {
+		vb->miscq = vqs[i];
+		miscq_in_hdr_add(vb);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -843,6 +994,32 @@ static void balloon_page_chunk_init(struct virtio_balloon *vb)
 	}
 }
 
+static void miscq_init(struct virtio_balloon *vb)
+{
+	void *buf;
+
+	vb->miscq_in_hdr = kmalloc(sizeof(struct virtio_balloon_miscq_hdr),
+				   GFP_KERNEL);
+	buf = kmalloc(sizeof(struct virtio_balloon_miscq_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk) *
+		      MAX_PAGE_CHUNKS, GFP_KERNEL);
+	if (!vb->miscq_in_hdr || !buf) {
+		kfree(buf);
+		kfree(vb->miscq_in_hdr);
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ);
+		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+	} else {
+		vb->miscq_out_hdr = buf;
+		vb->unused_page_chunk_hdr = buf +
+				sizeof(struct virtio_balloon_miscq_hdr);
+		vb->unused_page_chunk_hdr->chunks = 0;
+		vb->unused_page_chunk = buf +
+				sizeof(struct virtio_balloon_miscq_hdr) +
+				sizeof(struct virtio_balloon_page_chunk_hdr);
+	}
+}
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
@@ -869,6 +1046,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
 		balloon_page_chunk_init(vb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_MISC_VQ))
+		miscq_init(vb);
+
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
@@ -946,6 +1126,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 
 	remove_common(vb);
 	free_page_bmap(vb);
+	kfree(vb->miscq_out_hdr);
+	kfree(vb->miscq_in_hdr);
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
 	kfree(vb);
@@ -987,6 +1169,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
+	VIRTIO_BALLOON_F_MISC_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index be317b7..96bdc86 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
+#define VIRTIO_BALLOON_F_MISC_VQ	4 /* Virtqueue for misc. requests */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -95,4 +96,11 @@ struct virtio_balloon_page_chunk {
 	__le64 size;
 };
 
+#define VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES 0
+#define VIRTIO_BALLOON_MISCQ_F_COMPLETE 0x1
+struct virtio_balloon_miscq_hdr {
+	__le16 cmd;
+	__le16 flags;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
  2017-04-13  9:35 ` Wei Wang
                   ` (9 preceding siblings ...)
  (?)
@ 2017-04-13  9:35 ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

Add a new vq, miscq, to handle miscellaneous requests between the device
and the driver.

This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
request sent from the device. Upon receiving this request from the
miscq, the driver offers to the device the guest unused pages.

Tests have shown that skipping the transfer of unused pages of a 32G
guest can get the live migration time reduced to 1/8.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 209 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |   8 ++
 2 files changed, 204 insertions(+), 13 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 5e2e7cc..95c703e 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,11 +56,12 @@ static struct vfsmount *balloon_mnt;
 
 /* Types of pages to chunk */
 #define PAGE_CHUNK_TYPE_BALLOON 0
+#define PAGE_CHUNK_TYPE_UNUSED 1
 
 #define MAX_PAGE_CHUNKS 4096
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *miscq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -94,6 +95,19 @@ struct virtio_balloon {
 	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
 	struct virtio_balloon_page_chunk *balloon_page_chunk;
 
+	/*
+	 * Buffer for PAGE_CHUNK_TYPE_UNUSED:
+	 * virtio_balloon_miscq_hdr +
+	 * virtio_balloon_page_chunk_hdr +
+	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
+	 */
+	struct virtio_balloon_miscq_hdr *miscq_out_hdr;
+	struct virtio_balloon_page_chunk_hdr *unused_page_chunk_hdr;
+	struct virtio_balloon_page_chunk *unused_page_chunk;
+
+	/* Buffer for host to send cmd to miscq */
+	struct virtio_balloon_miscq_hdr *miscq_in_hdr;
+
 	/* Bitmap used to record pages */
 	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
 	/* Number of the allocated page_bmap */
@@ -220,6 +234,10 @@ static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
 		hdr = vb->balloon_page_chunk_hdr;
 		len = 0;
 		break;
+	case PAGE_CHUNK_TYPE_UNUSED:
+		hdr = vb->unused_page_chunk_hdr;
+		len = sizeof(struct virtio_balloon_miscq_hdr);
+		break;
 	default:
 		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
 			 __func__, type);
@@ -254,6 +272,10 @@ static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
 		hdr = vb->balloon_page_chunk_hdr;
 		chunk = vb->balloon_page_chunk;
 		break;
+	case PAGE_CHUNK_TYPE_UNUSED:
+		hdr = vb->unused_page_chunk_hdr;
+		chunk = vb->unused_page_chunk;
+		break;
 	default:
 		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
 			 __func__, type);
@@ -686,28 +708,139 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void miscq_in_hdr_add(struct virtio_balloon *vb)
+{
+	struct scatterlist sg_in;
+
+	sg_init_one(&sg_in, vb->miscq_in_hdr,
+		    sizeof(struct virtio_balloon_miscq_hdr));
+	if (virtqueue_add_inbuf(vb->miscq, &sg_in, 1, vb->miscq_in_hdr,
+	    GFP_KERNEL) < 0) {
+		__virtio_clear_bit(vb->vdev,
+				   VIRTIO_BALLOON_F_MISC_VQ);
+		dev_warn(&vb->vdev->dev, "%s: add miscq_in_hdr err\n",
+			 __func__);
+		return;
+	}
+	virtqueue_kick(vb->miscq);
+}
+
+static void miscq_send_unused_pages(struct virtio_balloon *vb)
+{
+	struct virtio_balloon_miscq_hdr *miscq_out_hdr = vb->miscq_out_hdr;
+	struct virtqueue *vq = vb->miscq;
+	int ret = 0;
+	unsigned int order = 0, migratetype = 0;
+	struct zone *zone = NULL;
+	struct page *page = NULL;
+	u64 pfn;
+
+	miscq_out_hdr->cmd =  VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES;
+	miscq_out_hdr->flags = 0;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order > 0; order--) {
+			for (migratetype = 0; migratetype < MIGRATE_TYPES;
+			     migratetype++) {
+				do {
+					ret = inquire_unused_page_block(zone,
+						order, migratetype, &page);
+					if (!ret) {
+						pfn = (u64)page_to_pfn(page);
+						add_one_chunk(vb, vq,
+							PAGE_CHUNK_TYPE_UNUSED,
+							pfn,
+							(u64)(1 << order));
+					}
+				} while (!ret);
+			}
+		}
+	}
+	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;
+	send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_UNUSED, true);
+}
+
+static void miscq_handle(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+	struct virtio_balloon_miscq_hdr *hdr;
+	unsigned int len;
+
+	hdr = virtqueue_get_buf(vb->miscq, &len);
+	if (!hdr || len != sizeof(struct virtio_balloon_miscq_hdr)) {
+		dev_warn(&vb->vdev->dev, "%s: invalid miscq hdr len\n",
+			 __func__);
+		miscq_in_hdr_add(vb);
+		return;
+	}
+	switch (hdr->cmd) {
+	case VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES:
+		miscq_send_unused_pages(vb);
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: miscq cmd %d not supported\n",
+			 __func__, hdr->cmd);
+	}
+	miscq_in_hdr_add(vb);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
-	int err, nvqs;
+	struct virtqueue **vqs;
+	vq_callback_t **callbacks;
+	const char **names;
+	int err = -ENOMEM;
+	int i, nvqs;
+
+	 /* Inflateq and deflateq are used unconditionally */
+	nvqs = 2;
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs++;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
+		nvqs++;
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vq;
+	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
+	if (!callbacks)
+		goto err_callback;
+	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
+	if (!names)
+		goto err_names;
+
+	callbacks[0] = balloon_ack;
+	names[0] = "inflate";
+	callbacks[1] = balloon_ack;
+	names[1] = "deflate";
+
+	i = 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		callbacks[i] = stats_request;
+		names[i] = "stats";
+		i++;
+	}
 
-	/*
-	 * We expect two virtqueues: inflate and deflate, and
-	 * optionally stat.
-	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
-	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names);
+	if (virtio_has_feature(vb->vdev,
+				      VIRTIO_BALLOON_F_MISC_VQ)) {
+		callbacks[i] = miscq_handle;
+		names[i] = "miscq";
+	}
+
+	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
+					 names);
 	if (err)
-		return err;
+		goto err_find;
 
 	vb->inflate_vq = vqs[0];
 	vb->deflate_vq = vqs[1];
+	i = 2;
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		struct scatterlist sg;
-		vb->stats_vq = vqs[2];
 
+		vb->stats_vq = vqs[i++];
 		/*
 		 * Prime this virtqueue with one buffer so the hypervisor can
 		 * use it to signal us later (it can't be broken yet!).
@@ -718,7 +851,25 @@ static int init_vqs(struct virtio_balloon *vb)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ)) {
+		vb->miscq = vqs[i];
+		miscq_in_hdr_add(vb);
+	}
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
 	return 0;
+
+err_find:
+	kfree(names);
+err_names:
+	kfree(callbacks);
+err_callback:
+	kfree(vqs);
+err_vq:
+	return err;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
@@ -843,6 +994,32 @@ static void balloon_page_chunk_init(struct virtio_balloon *vb)
 	}
 }
 
+static void miscq_init(struct virtio_balloon *vb)
+{
+	void *buf;
+
+	vb->miscq_in_hdr = kmalloc(sizeof(struct virtio_balloon_miscq_hdr),
+				   GFP_KERNEL);
+	buf = kmalloc(sizeof(struct virtio_balloon_miscq_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk) *
+		      MAX_PAGE_CHUNKS, GFP_KERNEL);
+	if (!vb->miscq_in_hdr || !buf) {
+		kfree(buf);
+		kfree(vb->miscq_in_hdr);
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ);
+		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
+	} else {
+		vb->miscq_out_hdr = buf;
+		vb->unused_page_chunk_hdr = buf +
+				sizeof(struct virtio_balloon_miscq_hdr);
+		vb->unused_page_chunk_hdr->chunks = 0;
+		vb->unused_page_chunk = buf +
+				sizeof(struct virtio_balloon_miscq_hdr) +
+				sizeof(struct virtio_balloon_page_chunk_hdr);
+	}
+}
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
@@ -869,6 +1046,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
 		balloon_page_chunk_init(vb);
 
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_MISC_VQ))
+		miscq_init(vb);
+
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
@@ -946,6 +1126,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 
 	remove_common(vb);
 	free_page_bmap(vb);
+	kfree(vb->miscq_out_hdr);
+	kfree(vb->miscq_in_hdr);
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
 	kfree(vb);
@@ -987,6 +1169,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
+	VIRTIO_BALLOON_F_MISC_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index be317b7..96bdc86 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
+#define VIRTIO_BALLOON_F_MISC_VQ	4 /* Virtqueue for misc. requests */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -95,4 +96,11 @@ struct virtio_balloon_page_chunk {
 	__le64 size;
 };
 
+#define VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES 0
+#define VIRTIO_BALLOON_MISCQ_F_COMPLETE 0x1
+struct virtio_balloon_miscq_hdr {
+	__le16 cmd;
+	__le16 flags;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-13  9:35   ` Wei Wang
  (?)
@ 2017-04-13 16:34     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-13 16:34 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_BALLOON_CHUNKS, which enables

Let's find a better name here.
VIRTIO_BALLOON_F_PAGE_CHUNK


> the transfer of the ballooned (i.e. inflated/deflated) pages in
> chunks to the host.
> 
> The implementation of the previous virtio-balloon is not very
> efficient, because the ballooned pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> chunks. A chunk consists of guest physically continuous pages, and
> it is offered to the host via a base PFN (i.e. the start PFN of
> those physically continuous pages) and the size (i.e. the total
> number of the pages). A chunk is formated as below:

formatted

> --------------------------------------------------------
> |                 Base (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> --------------------------------------------------------
> |                 Size (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> 
> By doing so, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~590ms
> resulting in an improvement of ~85%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>

So we don't need the bitmap to talk to host, it is just
a data structure we chose to maintain lists of pages, right?
OK as far as it goes but you need much better isolation for it.
Build a data structure with APIs such as _init, _cleanup, _add, _clear,
_find_first, _find_next.
Completely unrelated to pages, it just maintains bits.
Then use it here.


> ---
>  drivers/virtio/virtio_balloon.c     | 384 +++++++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |  13 ++
>  2 files changed, 374 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f59cb4f..5e2e7cc 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -42,6 +42,10 @@
>  #define OOM_VBALLOON_DEFAULT_PAGES 256
>  #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
>  
> +#define PAGE_BMAP_SIZE		(8 * PAGE_SIZE)
> +#define PFNS_PER_PAGE_BMAP	(PAGE_BMAP_SIZE * BITS_PER_BYTE)
> +#define PAGE_BMAP_COUNT_MAX	32
> +

Please prefix with VIRTIO_BALLOON_ and add comments.

>  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>  module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>  MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>  static struct vfsmount *balloon_mnt;
>  #endif
>  
> +/* Types of pages to chunk */
> +#define PAGE_CHUNK_TYPE_BALLOON 0
> +

Doesn't look like you are ever adding more types in this
patchset.  Pls keep code simple, generalize it later.

> +#define MAX_PAGE_CHUNKS 4096

This is an order-4 allocation. I'd make it 4095 and then it's
an order-3 one.

>  struct virtio_balloon {
>  	struct virtio_device *vdev;
>  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> @@ -78,6 +86,32 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	/*
> +	 * Buffer for PAGE_CHUNK_TYPE_BALLOON:
> +	 * virtio_balloon_page_chunk_hdr +
> +	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> +	 */
> +	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
> +	struct virtio_balloon_page_chunk *balloon_page_chunk;
> +
> +	/* Bitmap used to record pages */
> +	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
> +	/* Number of the allocated page_bmap */
> +	unsigned int page_bmaps;
> +
> +	/*
> +	 * The allocated page_bmap size may be smaller than the pfn range of
> +	 * the ballooned pages. In this case, we need to use the page_bmap
> +	 * multiple times to cover the entire pfn range. It's like using a
> +	 * short ruler several times to finish measuring a long object.
> +	 * The start location of the ruler in the next measurement is the end
> +	 * location of the ruler in the previous measurement.
> +	 *
> +	 * pfn_max & pfn_min: forms the pfn range of the ballooned pages
> +	 * pfn_start & pfn_stop: records the start and stop pfn in each cover

cover? what does this mean?

looks like you only use these to pass data to tell_host.
so pass these as parameters and you won't need to keep
them in this structure.

And then you can move this comment to set_page_bmap where
it belongs.

> +	 */
> +	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -110,20 +144,201 @@ static void balloon_ack(struct virtqueue *vq)
>  	wake_up(&vb->acked);
>  }
>  
> -static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> +static inline void init_page_bmap_range(struct virtio_balloon *vb)
> +{
> +	vb->pfn_min = ULONG_MAX;
> +	vb->pfn_max = 0;
> +}
> +
> +static inline void update_page_bmap_range(struct virtio_balloon *vb,
> +					  struct page *page)
> +{
> +	unsigned long balloon_pfn = page_to_balloon_pfn(page);
> +
> +	vb->pfn_min = min(balloon_pfn, vb->pfn_min);
> +	vb->pfn_max = max(balloon_pfn, vb->pfn_max);
> +}
> +
> +/* The page_bmap size is extended by adding more number of page_bmap */

did you mean

	Allocate more bitmaps to cover the given number of pfns
	and add them to page_bmap

?

This isn't what this function does.
It blindly assumes 1 bitmap is allocated
and allocates more, up to PAGE_BMAP_COUNT_MAX.

> +static void extend_page_bmap_size(struct virtio_balloon *vb,
> +				  unsigned long pfns)
> +{
> +	int i, bmaps;
> +	unsigned long bmap_len;
> +
> +	bmap_len = ALIGN(pfns, BITS_PER_LONG) / BITS_PER_BYTE;
> +	bmap_len = ALIGN(bmap_len, PAGE_BMAP_SIZE);

Align? PAGE_BMAP_SIZE doesn't even have to be a power of 2 ...

> +	bmaps = min((int)(bmap_len / PAGE_BMAP_SIZE),
> +		    PAGE_BMAP_COUNT_MAX);

I got lost here.

Please use things like ARRAY_SIZE instead of macros.

> +
> +	for (i = 1; i < bmaps; i++) {
> +		vb->page_bmap[i] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
> +		if (vb->page_bmap[i])
> +			vb->page_bmaps++;
> +		else
> +			break;
> +	}
> +}
> +
> +static void free_extended_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i, bmaps = vb->page_bmaps;
> +
> +	for (i = 1; i < bmaps; i++) {
> +		kfree(vb->page_bmap[i]);
> +		vb->page_bmap[i] = NULL;
> +		vb->page_bmaps--;
> +	}
> +}
> +

What's the magic number 1 here?
Maybe you want to document what is going on.
Here's a guess:

We keep a single bmap around at all times.
If memory does not fit there, we allocate up to
PAGE_BMAP_COUNT_MAX of chunks.


> +static void free_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	for (i = 0; i < vb->page_bmaps; i++)
> +		kfree(vb->page_bmap[i]);
> +}
> +
> +static void clear_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	for (i = 0; i < vb->page_bmaps; i++)
> +		memset(vb->page_bmap[i], 0, PAGE_BMAP_SIZE);
> +}
> +
> +static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
> +			     int type, bool busy_wait)

busy_wait seems unused. pls drop.

>  {
>  	struct scatterlist sg;
> +	struct virtio_balloon_page_chunk_hdr *hdr;
> +	void *buf;
>  	unsigned int len;
>  
> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> +	switch (type) {
> +	case PAGE_CHUNK_TYPE_BALLOON:
> +		hdr = vb->balloon_page_chunk_hdr;
> +		len = 0;
> +		break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
> +			 __func__, type);
> +		return;
> +	}
>  
> -	/* We should always be able to add one buffer to an empty queue. */
> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> -	virtqueue_kick(vq);
> +	buf = (void *)hdr - len;

Moving back to before the header? How can this make sense?
It works fine since len is 0, so just buf = hdr.

> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
> +	sg_init_table(&sg, 1);
> +	sg_set_buf(&sg, buf, len);
> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
> +		virtqueue_kick(vq);
> +		if (busy_wait)
> +			while (!virtqueue_get_buf(vq, &len) &&
> +			       !virtqueue_is_broken(vq))
> +				cpu_relax();
> +		else
> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		hdr->chunks = 0;

Why zero it here after device used it? Better to zero before use.

> +	}
> +}
> +
> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> +			  int type, u64 base, u64 size)

what are the units here? Looks like it's in 4kbyte units?

> +{
> +	struct virtio_balloon_page_chunk_hdr *hdr;
> +	struct virtio_balloon_page_chunk *chunk;
> +
> +	switch (type) {
> +	case PAGE_CHUNK_TYPE_BALLOON:
> +		hdr = vb->balloon_page_chunk_hdr;
> +		chunk = vb->balloon_page_chunk;
> +		break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
> +			 __func__, type);
> +		return;
> +	}
> +	chunk = chunk + hdr->chunks;
> +	chunk->base = cpu_to_le64(base << VIRTIO_BALLOON_CHUNK_BASE_SHIFT);
> +	chunk->size = cpu_to_le64(size << VIRTIO_BALLOON_CHUNK_SIZE_SHIFT);
> +	hdr->chunks++;

Isn't this LE? You should keep it somewhere else.

> +	if (hdr->chunks == MAX_PAGE_CHUNKS)
> +		send_page_chunks(vb, vq, type, false);
		and zero chunks here?
> +}
> +
> +static void chunking_pages_from_bmap(struct virtio_balloon *vb,

Does this mean "convert_bmap_to_chunks"?

> +				     struct virtqueue *vq,
> +				     unsigned long pfn_start,
> +				     unsigned long *bmap,
> +				     unsigned long len)
> +{
> +	unsigned long pos = 0, end = len * BITS_PER_BYTE;
> +
> +	while (pos < end) {
> +		unsigned long one = find_next_bit(bmap, end, pos);
> +
> +		if (one < end) {
> +			unsigned long chunk_size, zero;
> +
> +			zero = find_next_zero_bit(bmap, end, one + 1);


zero and one are unhelpful names unless they equal 0 and 1.
current/next?


> +			if (zero >= end)
> +				chunk_size = end - one;
> +			else
> +				chunk_size = zero - one;
> +
> +			if (chunk_size)
> +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
> +					      pfn_start + one, chunk_size);

Still not so what does a bit refer to? page or 4kbytes?
I think it should be a page.

> +			pos = one + chunk_size;
> +		} else
> +			break;
> +	}
> +}
> +



> +static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> +{
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
> +		int pfns, page_bmaps, i;
> +		unsigned long pfn_start, pfns_len;
> +
> +		pfn_start = vb->pfn_start;
> +		pfns = vb->pfn_stop - pfn_start + 1;
> +		pfns = roundup(roundup(pfns, BITS_PER_LONG),
> +			       PFNS_PER_PAGE_BMAP);
> +		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
> +		pfns_len = pfns / BITS_PER_BYTE;
> +
> +		for (i = 0; i < page_bmaps; i++) {
> +			unsigned int bmap_len = PAGE_BMAP_SIZE;
> +
> +			/* The last one takes the leftover only */

I don't understand what does this mean.

> +			if (i + 1 == page_bmaps)
> +				bmap_len = pfns_len - PAGE_BMAP_SIZE * i;
> +
> +			chunking_pages_from_bmap(vb, vq, pfn_start +
> +						 i * PFNS_PER_PAGE_BMAP,
> +						 vb->page_bmap[i], bmap_len);
> +		}
> +		if (vb->balloon_page_chunk_hdr->chunks > 0)
> +			send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
> +					 false);
> +	} else {
> +		struct scatterlist sg;
> +		unsigned int len;
>  
> -	/* When host has read buffer, this completes via balloon_ack */
> -	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>  
> +		/*
> +		 * We should always be able to add one buffer to an empty
> +		 * queue.
> +		 */
> +		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +		virtqueue_kick(vq);
> +
> +		/* When host has read buffer, this completes via balloon_ack */
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +	}
>  }
>  
>  static void set_page_pfns(struct virtio_balloon *vb,
> @@ -131,20 +346,73 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  {
>  	unsigned int i;
>  
> -	/* Set balloon pfns pointing at this page.
> -	 * Note that the first pfn points at start of the page. */
> +	/*
> +	 * Set balloon pfns pointing at this page.
> +	 * Note that the first pfn points at start of the page.
> +	 */
>  	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
>  		pfns[i] = cpu_to_virtio32(vb->vdev,
>  					  page_to_balloon_pfn(page) + i);
>  }
>

Nice cleanup but pls split this out. This patch is big enough as it is.
  
> +static void set_page_bmap(struct virtio_balloon *vb,
> +			  struct list_head *pages, struct virtqueue *vq)
> +{
> +	unsigned long pfn_start, pfn_stop;
> +	struct page *page;
> +	bool found;
> +
> +	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
> +	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
> +
> +	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);

This might not do anything in particular might not cover the
given pfn range. Do we care? Why not?

> +	pfn_start = vb->pfn_min;
> +
> +	while (pfn_start < vb->pfn_max) {
> +		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
> +		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
> +
> +		vb->pfn_start = pfn_start;
> +		clear_page_bmap(vb);
> +		found = false;
> +
> +		list_for_each_entry(page, pages, lru) {
> +			unsigned long bmap_idx, bmap_pos, balloon_pfn;
> +
> +			balloon_pfn = page_to_balloon_pfn(page);
> +			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
> +				continue;
> +			bmap_idx = (balloon_pfn - pfn_start) /
> +				   PFNS_PER_PAGE_BMAP;
> +			bmap_pos = (balloon_pfn - pfn_start) %
> +				   PFNS_PER_PAGE_BMAP;
> +			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);

Looks like this will crash if bmap_idx is out of range or
if page_bmap allocation failed.

> +
> +			found = true;
> +		}
> +		if (found) {
> +			vb->pfn_stop = pfn_stop;
> +			tell_host(vb, vq);
> +		}
> +		pfn_start = pfn_stop;
> +	}
> +	free_extended_page_bmap(vb);
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (chunking) {
> +		init_page_bmap_range(vb);
> +	} else {
> +		/* We can only do one array worth at a time. */
> +		num = min(num, ARRAY_SIZE(vb->pfns));
> +	}
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -159,7 +427,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_page_bmap_range(vb, page);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -168,8 +439,13 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			set_page_bmap(vb, &vb_dev_info->pages,
> +					vb->inflate_vq);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -195,6 +471,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
> +	if (chunking)
> +		init_page_bmap_range(vb);
> +	else
> +		/* We can only do one array worth at a time. */
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	/* We can only do one array worth at a time. */
>  	num = min(num, ARRAY_SIZE(vb->pfns));
> @@ -208,6 +491,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		if (!page)
>  			break;
>  		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_page_bmap_range(vb, page);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -218,8 +505,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			set_page_bmap(vb, &pages, vb->deflate_vq);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -431,6 +722,13 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb,
> +			       struct virtqueue *vq, struct page *page)
> +{
> +	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);

This passes 4kbytes to host which seems wrong - I think you want a full page.

> +}
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -454,6 +752,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
>  	unsigned long flags;
>  
>  	/*
> @@ -475,16 +775,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->inflate_vq, newpage);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->deflate_vq, page);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -511,6 +817,32 @@ static struct file_system_type balloon_fs = {
>  
>  #endif /* CONFIG_BALLOON_COMPACTION */
>  
> +static void balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	void *buf;
> +
> +	/*
> +	 * By default, we allocate page_bmap[0] only. More page_bmap will be
> +	 * allocated on demand.
> +	 */
> +	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
> +	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk) *
> +		      MAX_PAGE_CHUNKS, GFP_KERNEL);
> +	if (!vb->page_bmap[0] || !buf) {
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);

this doesn't work as expected as features has been OK'd by then.
You want something like
validate_features that I posted. See
"virtio: allow drivers to validate features".

> +		kfree(vb->page_bmap[0]);

Looks like this will double free. you want to zero them I think.

> +		kfree(vb->balloon_page_chunk_hdr);
> +		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
> +	} else {
> +		vb->page_bmaps = 1;
> +		vb->balloon_page_chunk_hdr = buf;
> +		vb->balloon_page_chunk_hdr->chunks = 0;
> +		vb->balloon_page_chunk = buf +
> +				sizeof(struct virtio_balloon_page_chunk_hdr);
> +	}
> +}
> +
>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb;
> @@ -533,6 +865,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	spin_lock_init(&vb->stop_update_lock);
>  	vb->stop_update = false;
>  	vb->num_pages = 0;
> +
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
> +		balloon_page_chunk_init(vb);
> +
>  	mutex_init(&vb->balloon_lock);
>  	init_waitqueue_head(&vb->acked);
>  	vb->vdev = vdev;
> @@ -609,6 +945,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	cancel_work_sync(&vb->update_balloon_stats_work);
>  
>  	remove_common(vb);
> +	free_page_bmap(vb);
>  	if (vb->vb_dev_info.inode)
>  		iput(vb->vb_dev_info.inode);
>  	kfree(vb);
> @@ -649,6 +986,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..be317b7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -82,4 +83,16 @@ struct virtio_balloon_stat {
>  	__virtio64 val;
>  } __attribute__((packed));
>  
> +struct virtio_balloon_page_chunk_hdr {
> +	/* Number of chunks in the payload */
> +	__le32 chunks;

You want to make this __le64 to align everything to 64 bit.

> +};
> +
> +#define VIRTIO_BALLOON_CHUNK_BASE_SHIFT 12
> +#define VIRTIO_BALLOON_CHUNK_SIZE_SHIFT 12
> +struct virtio_balloon_page_chunk {

so rename this virtio_balloon_page_chunk_entry

> +	__le64 base;
> +	__le64 size;
> +};
> +

And then:

struct virtio_balloon_page_chunk {
	struct virtio_balloon_page_chunk_hdr hdr;
	struct virtio_balloon_page_chunk_entry entries[];
};



>  #endif /* _LINUX_VIRTIO_BALLOON_H */
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-13 16:34     ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-13 16:34 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_BALLOON_CHUNKS, which enables

Let's find a better name here.
VIRTIO_BALLOON_F_PAGE_CHUNK


> the transfer of the ballooned (i.e. inflated/deflated) pages in
> chunks to the host.
> 
> The implementation of the previous virtio-balloon is not very
> efficient, because the ballooned pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> chunks. A chunk consists of guest physically continuous pages, and
> it is offered to the host via a base PFN (i.e. the start PFN of
> those physically continuous pages) and the size (i.e. the total
> number of the pages). A chunk is formated as below:

formatted

> --------------------------------------------------------
> |                 Base (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> --------------------------------------------------------
> |                 Size (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> 
> By doing so, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~590ms
> resulting in an improvement of ~85%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>

So we don't need the bitmap to talk to host, it is just
a data structure we chose to maintain lists of pages, right?
OK as far as it goes but you need much better isolation for it.
Build a data structure with APIs such as _init, _cleanup, _add, _clear,
_find_first, _find_next.
Completely unrelated to pages, it just maintains bits.
Then use it here.


> ---
>  drivers/virtio/virtio_balloon.c     | 384 +++++++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |  13 ++
>  2 files changed, 374 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f59cb4f..5e2e7cc 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -42,6 +42,10 @@
>  #define OOM_VBALLOON_DEFAULT_PAGES 256
>  #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
>  
> +#define PAGE_BMAP_SIZE		(8 * PAGE_SIZE)
> +#define PFNS_PER_PAGE_BMAP	(PAGE_BMAP_SIZE * BITS_PER_BYTE)
> +#define PAGE_BMAP_COUNT_MAX	32
> +

Please prefix with VIRTIO_BALLOON_ and add comments.

>  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>  module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>  MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>  static struct vfsmount *balloon_mnt;
>  #endif
>  
> +/* Types of pages to chunk */
> +#define PAGE_CHUNK_TYPE_BALLOON 0
> +

Doesn't look like you are ever adding more types in this
patchset.  Pls keep code simple, generalize it later.

> +#define MAX_PAGE_CHUNKS 4096

This is an order-4 allocation. I'd make it 4095 and then it's
an order-3 one.

>  struct virtio_balloon {
>  	struct virtio_device *vdev;
>  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> @@ -78,6 +86,32 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	/*
> +	 * Buffer for PAGE_CHUNK_TYPE_BALLOON:
> +	 * virtio_balloon_page_chunk_hdr +
> +	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> +	 */
> +	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
> +	struct virtio_balloon_page_chunk *balloon_page_chunk;
> +
> +	/* Bitmap used to record pages */
> +	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
> +	/* Number of the allocated page_bmap */
> +	unsigned int page_bmaps;
> +
> +	/*
> +	 * The allocated page_bmap size may be smaller than the pfn range of
> +	 * the ballooned pages. In this case, we need to use the page_bmap
> +	 * multiple times to cover the entire pfn range. It's like using a
> +	 * short ruler several times to finish measuring a long object.
> +	 * The start location of the ruler in the next measurement is the end
> +	 * location of the ruler in the previous measurement.
> +	 *
> +	 * pfn_max & pfn_min: forms the pfn range of the ballooned pages
> +	 * pfn_start & pfn_stop: records the start and stop pfn in each cover

cover? what does this mean?

looks like you only use these to pass data to tell_host.
so pass these as parameters and you won't need to keep
them in this structure.

And then you can move this comment to set_page_bmap where
it belongs.

> +	 */
> +	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -110,20 +144,201 @@ static void balloon_ack(struct virtqueue *vq)
>  	wake_up(&vb->acked);
>  }
>  
> -static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> +static inline void init_page_bmap_range(struct virtio_balloon *vb)
> +{
> +	vb->pfn_min = ULONG_MAX;
> +	vb->pfn_max = 0;
> +}
> +
> +static inline void update_page_bmap_range(struct virtio_balloon *vb,
> +					  struct page *page)
> +{
> +	unsigned long balloon_pfn = page_to_balloon_pfn(page);
> +
> +	vb->pfn_min = min(balloon_pfn, vb->pfn_min);
> +	vb->pfn_max = max(balloon_pfn, vb->pfn_max);
> +}
> +
> +/* The page_bmap size is extended by adding more number of page_bmap */

did you mean

	Allocate more bitmaps to cover the given number of pfns
	and add them to page_bmap

?

This isn't what this function does.
It blindly assumes 1 bitmap is allocated
and allocates more, up to PAGE_BMAP_COUNT_MAX.

> +static void extend_page_bmap_size(struct virtio_balloon *vb,
> +				  unsigned long pfns)
> +{
> +	int i, bmaps;
> +	unsigned long bmap_len;
> +
> +	bmap_len = ALIGN(pfns, BITS_PER_LONG) / BITS_PER_BYTE;
> +	bmap_len = ALIGN(bmap_len, PAGE_BMAP_SIZE);

Align? PAGE_BMAP_SIZE doesn't even have to be a power of 2 ...

> +	bmaps = min((int)(bmap_len / PAGE_BMAP_SIZE),
> +		    PAGE_BMAP_COUNT_MAX);

I got lost here.

Please use things like ARRAY_SIZE instead of macros.

> +
> +	for (i = 1; i < bmaps; i++) {
> +		vb->page_bmap[i] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
> +		if (vb->page_bmap[i])
> +			vb->page_bmaps++;
> +		else
> +			break;
> +	}
> +}
> +
> +static void free_extended_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i, bmaps = vb->page_bmaps;
> +
> +	for (i = 1; i < bmaps; i++) {
> +		kfree(vb->page_bmap[i]);
> +		vb->page_bmap[i] = NULL;
> +		vb->page_bmaps--;
> +	}
> +}
> +

What's the magic number 1 here?
Maybe you want to document what is going on.
Here's a guess:

We keep a single bmap around at all times.
If memory does not fit there, we allocate up to
PAGE_BMAP_COUNT_MAX of chunks.


> +static void free_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	for (i = 0; i < vb->page_bmaps; i++)
> +		kfree(vb->page_bmap[i]);
> +}
> +
> +static void clear_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	for (i = 0; i < vb->page_bmaps; i++)
> +		memset(vb->page_bmap[i], 0, PAGE_BMAP_SIZE);
> +}
> +
> +static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
> +			     int type, bool busy_wait)

busy_wait seems unused. pls drop.

>  {
>  	struct scatterlist sg;
> +	struct virtio_balloon_page_chunk_hdr *hdr;
> +	void *buf;
>  	unsigned int len;
>  
> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> +	switch (type) {
> +	case PAGE_CHUNK_TYPE_BALLOON:
> +		hdr = vb->balloon_page_chunk_hdr;
> +		len = 0;
> +		break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
> +			 __func__, type);
> +		return;
> +	}
>  
> -	/* We should always be able to add one buffer to an empty queue. */
> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> -	virtqueue_kick(vq);
> +	buf = (void *)hdr - len;

Moving back to before the header? How can this make sense?
It works fine since len is 0, so just buf = hdr.

> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
> +	sg_init_table(&sg, 1);
> +	sg_set_buf(&sg, buf, len);
> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
> +		virtqueue_kick(vq);
> +		if (busy_wait)
> +			while (!virtqueue_get_buf(vq, &len) &&
> +			       !virtqueue_is_broken(vq))
> +				cpu_relax();
> +		else
> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		hdr->chunks = 0;

Why zero it here after device used it? Better to zero before use.

> +	}
> +}
> +
> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> +			  int type, u64 base, u64 size)

what are the units here? Looks like it's in 4kbyte units?

> +{
> +	struct virtio_balloon_page_chunk_hdr *hdr;
> +	struct virtio_balloon_page_chunk *chunk;
> +
> +	switch (type) {
> +	case PAGE_CHUNK_TYPE_BALLOON:
> +		hdr = vb->balloon_page_chunk_hdr;
> +		chunk = vb->balloon_page_chunk;
> +		break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
> +			 __func__, type);
> +		return;
> +	}
> +	chunk = chunk + hdr->chunks;
> +	chunk->base = cpu_to_le64(base << VIRTIO_BALLOON_CHUNK_BASE_SHIFT);
> +	chunk->size = cpu_to_le64(size << VIRTIO_BALLOON_CHUNK_SIZE_SHIFT);
> +	hdr->chunks++;

Isn't this LE? You should keep it somewhere else.

> +	if (hdr->chunks == MAX_PAGE_CHUNKS)
> +		send_page_chunks(vb, vq, type, false);
		and zero chunks here?
> +}
> +
> +static void chunking_pages_from_bmap(struct virtio_balloon *vb,

Does this mean "convert_bmap_to_chunks"?

> +				     struct virtqueue *vq,
> +				     unsigned long pfn_start,
> +				     unsigned long *bmap,
> +				     unsigned long len)
> +{
> +	unsigned long pos = 0, end = len * BITS_PER_BYTE;
> +
> +	while (pos < end) {
> +		unsigned long one = find_next_bit(bmap, end, pos);
> +
> +		if (one < end) {
> +			unsigned long chunk_size, zero;
> +
> +			zero = find_next_zero_bit(bmap, end, one + 1);


zero and one are unhelpful names unless they equal 0 and 1.
current/next?


> +			if (zero >= end)
> +				chunk_size = end - one;
> +			else
> +				chunk_size = zero - one;
> +
> +			if (chunk_size)
> +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
> +					      pfn_start + one, chunk_size);

Still not so what does a bit refer to? page or 4kbytes?
I think it should be a page.

> +			pos = one + chunk_size;
> +		} else
> +			break;
> +	}
> +}
> +



> +static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> +{
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
> +		int pfns, page_bmaps, i;
> +		unsigned long pfn_start, pfns_len;
> +
> +		pfn_start = vb->pfn_start;
> +		pfns = vb->pfn_stop - pfn_start + 1;
> +		pfns = roundup(roundup(pfns, BITS_PER_LONG),
> +			       PFNS_PER_PAGE_BMAP);
> +		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
> +		pfns_len = pfns / BITS_PER_BYTE;
> +
> +		for (i = 0; i < page_bmaps; i++) {
> +			unsigned int bmap_len = PAGE_BMAP_SIZE;
> +
> +			/* The last one takes the leftover only */

I don't understand what does this mean.

> +			if (i + 1 == page_bmaps)
> +				bmap_len = pfns_len - PAGE_BMAP_SIZE * i;
> +
> +			chunking_pages_from_bmap(vb, vq, pfn_start +
> +						 i * PFNS_PER_PAGE_BMAP,
> +						 vb->page_bmap[i], bmap_len);
> +		}
> +		if (vb->balloon_page_chunk_hdr->chunks > 0)
> +			send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
> +					 false);
> +	} else {
> +		struct scatterlist sg;
> +		unsigned int len;
>  
> -	/* When host has read buffer, this completes via balloon_ack */
> -	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>  
> +		/*
> +		 * We should always be able to add one buffer to an empty
> +		 * queue.
> +		 */
> +		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +		virtqueue_kick(vq);
> +
> +		/* When host has read buffer, this completes via balloon_ack */
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +	}
>  }
>  
>  static void set_page_pfns(struct virtio_balloon *vb,
> @@ -131,20 +346,73 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  {
>  	unsigned int i;
>  
> -	/* Set balloon pfns pointing at this page.
> -	 * Note that the first pfn points at start of the page. */
> +	/*
> +	 * Set balloon pfns pointing at this page.
> +	 * Note that the first pfn points at start of the page.
> +	 */
>  	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
>  		pfns[i] = cpu_to_virtio32(vb->vdev,
>  					  page_to_balloon_pfn(page) + i);
>  }
>

Nice cleanup but pls split this out. This patch is big enough as it is.
  
> +static void set_page_bmap(struct virtio_balloon *vb,
> +			  struct list_head *pages, struct virtqueue *vq)
> +{
> +	unsigned long pfn_start, pfn_stop;
> +	struct page *page;
> +	bool found;
> +
> +	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
> +	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
> +
> +	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);

This might not do anything in particular might not cover the
given pfn range. Do we care? Why not?

> +	pfn_start = vb->pfn_min;
> +
> +	while (pfn_start < vb->pfn_max) {
> +		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
> +		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
> +
> +		vb->pfn_start = pfn_start;
> +		clear_page_bmap(vb);
> +		found = false;
> +
> +		list_for_each_entry(page, pages, lru) {
> +			unsigned long bmap_idx, bmap_pos, balloon_pfn;
> +
> +			balloon_pfn = page_to_balloon_pfn(page);
> +			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
> +				continue;
> +			bmap_idx = (balloon_pfn - pfn_start) /
> +				   PFNS_PER_PAGE_BMAP;
> +			bmap_pos = (balloon_pfn - pfn_start) %
> +				   PFNS_PER_PAGE_BMAP;
> +			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);

Looks like this will crash if bmap_idx is out of range or
if page_bmap allocation failed.

> +
> +			found = true;
> +		}
> +		if (found) {
> +			vb->pfn_stop = pfn_stop;
> +			tell_host(vb, vq);
> +		}
> +		pfn_start = pfn_stop;
> +	}
> +	free_extended_page_bmap(vb);
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (chunking) {
> +		init_page_bmap_range(vb);
> +	} else {
> +		/* We can only do one array worth at a time. */
> +		num = min(num, ARRAY_SIZE(vb->pfns));
> +	}
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -159,7 +427,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_page_bmap_range(vb, page);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -168,8 +439,13 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			set_page_bmap(vb, &vb_dev_info->pages,
> +					vb->inflate_vq);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -195,6 +471,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
> +	if (chunking)
> +		init_page_bmap_range(vb);
> +	else
> +		/* We can only do one array worth at a time. */
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	/* We can only do one array worth at a time. */
>  	num = min(num, ARRAY_SIZE(vb->pfns));
> @@ -208,6 +491,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		if (!page)
>  			break;
>  		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_page_bmap_range(vb, page);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -218,8 +505,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			set_page_bmap(vb, &pages, vb->deflate_vq);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -431,6 +722,13 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb,
> +			       struct virtqueue *vq, struct page *page)
> +{
> +	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);

This passes 4kbytes to host which seems wrong - I think you want a full page.

> +}
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -454,6 +752,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
>  	unsigned long flags;
>  
>  	/*
> @@ -475,16 +775,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->inflate_vq, newpage);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->deflate_vq, page);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -511,6 +817,32 @@ static struct file_system_type balloon_fs = {
>  
>  #endif /* CONFIG_BALLOON_COMPACTION */
>  
> +static void balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	void *buf;
> +
> +	/*
> +	 * By default, we allocate page_bmap[0] only. More page_bmap will be
> +	 * allocated on demand.
> +	 */
> +	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
> +	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk) *
> +		      MAX_PAGE_CHUNKS, GFP_KERNEL);
> +	if (!vb->page_bmap[0] || !buf) {
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);

this doesn't work as expected as features has been OK'd by then.
You want something like
validate_features that I posted. See
"virtio: allow drivers to validate features".

> +		kfree(vb->page_bmap[0]);

Looks like this will double free. you want to zero them I think.

> +		kfree(vb->balloon_page_chunk_hdr);
> +		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
> +	} else {
> +		vb->page_bmaps = 1;
> +		vb->balloon_page_chunk_hdr = buf;
> +		vb->balloon_page_chunk_hdr->chunks = 0;
> +		vb->balloon_page_chunk = buf +
> +				sizeof(struct virtio_balloon_page_chunk_hdr);
> +	}
> +}
> +
>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb;
> @@ -533,6 +865,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	spin_lock_init(&vb->stop_update_lock);
>  	vb->stop_update = false;
>  	vb->num_pages = 0;
> +
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
> +		balloon_page_chunk_init(vb);
> +
>  	mutex_init(&vb->balloon_lock);
>  	init_waitqueue_head(&vb->acked);
>  	vb->vdev = vdev;
> @@ -609,6 +945,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	cancel_work_sync(&vb->update_balloon_stats_work);
>  
>  	remove_common(vb);
> +	free_page_bmap(vb);
>  	if (vb->vb_dev_info.inode)
>  		iput(vb->vb_dev_info.inode);
>  	kfree(vb);
> @@ -649,6 +986,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..be317b7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -82,4 +83,16 @@ struct virtio_balloon_stat {
>  	__virtio64 val;
>  } __attribute__((packed));
>  
> +struct virtio_balloon_page_chunk_hdr {
> +	/* Number of chunks in the payload */
> +	__le32 chunks;

You want to make this __le64 to align everything to 64 bit.

> +};
> +
> +#define VIRTIO_BALLOON_CHUNK_BASE_SHIFT 12
> +#define VIRTIO_BALLOON_CHUNK_SIZE_SHIFT 12
> +struct virtio_balloon_page_chunk {

so rename this virtio_balloon_page_chunk_entry

> +	__le64 base;
> +	__le64 size;
> +};
> +

And then:

struct virtio_balloon_page_chunk {
	struct virtio_balloon_page_chunk_hdr hdr;
	struct virtio_balloon_page_chunk_entry entries[];
};



>  #endif /* _LINUX_VIRTIO_BALLOON_H */
> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-13 16:34     ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-13 16:34 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_BALLOON_CHUNKS, which enables

Let's find a better name here.
VIRTIO_BALLOON_F_PAGE_CHUNK


> the transfer of the ballooned (i.e. inflated/deflated) pages in
> chunks to the host.
> 
> The implementation of the previous virtio-balloon is not very
> efficient, because the ballooned pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> chunks. A chunk consists of guest physically continuous pages, and
> it is offered to the host via a base PFN (i.e. the start PFN of
> those physically continuous pages) and the size (i.e. the total
> number of the pages). A chunk is formated as below:

formatted

> --------------------------------------------------------
> |                 Base (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> --------------------------------------------------------
> |                 Size (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> 
> By doing so, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~590ms
> resulting in an improvement of ~85%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>

So we don't need the bitmap to talk to host, it is just
a data structure we chose to maintain lists of pages, right?
OK as far as it goes but you need much better isolation for it.
Build a data structure with APIs such as _init, _cleanup, _add, _clear,
_find_first, _find_next.
Completely unrelated to pages, it just maintains bits.
Then use it here.


> ---
>  drivers/virtio/virtio_balloon.c     | 384 +++++++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |  13 ++
>  2 files changed, 374 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f59cb4f..5e2e7cc 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -42,6 +42,10 @@
>  #define OOM_VBALLOON_DEFAULT_PAGES 256
>  #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
>  
> +#define PAGE_BMAP_SIZE		(8 * PAGE_SIZE)
> +#define PFNS_PER_PAGE_BMAP	(PAGE_BMAP_SIZE * BITS_PER_BYTE)
> +#define PAGE_BMAP_COUNT_MAX	32
> +

Please prefix with VIRTIO_BALLOON_ and add comments.

>  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>  module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>  MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>  static struct vfsmount *balloon_mnt;
>  #endif
>  
> +/* Types of pages to chunk */
> +#define PAGE_CHUNK_TYPE_BALLOON 0
> +

Doesn't look like you are ever adding more types in this
patchset.  Pls keep code simple, generalize it later.

> +#define MAX_PAGE_CHUNKS 4096

This is an order-4 allocation. I'd make it 4095 and then it's
an order-3 one.

>  struct virtio_balloon {
>  	struct virtio_device *vdev;
>  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> @@ -78,6 +86,32 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	/*
> +	 * Buffer for PAGE_CHUNK_TYPE_BALLOON:
> +	 * virtio_balloon_page_chunk_hdr +
> +	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> +	 */
> +	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
> +	struct virtio_balloon_page_chunk *balloon_page_chunk;
> +
> +	/* Bitmap used to record pages */
> +	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
> +	/* Number of the allocated page_bmap */
> +	unsigned int page_bmaps;
> +
> +	/*
> +	 * The allocated page_bmap size may be smaller than the pfn range of
> +	 * the ballooned pages. In this case, we need to use the page_bmap
> +	 * multiple times to cover the entire pfn range. It's like using a
> +	 * short ruler several times to finish measuring a long object.
> +	 * The start location of the ruler in the next measurement is the end
> +	 * location of the ruler in the previous measurement.
> +	 *
> +	 * pfn_max & pfn_min: forms the pfn range of the ballooned pages
> +	 * pfn_start & pfn_stop: records the start and stop pfn in each cover

cover? what does this mean?

looks like you only use these to pass data to tell_host.
so pass these as parameters and you won't need to keep
them in this structure.

And then you can move this comment to set_page_bmap where
it belongs.

> +	 */
> +	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -110,20 +144,201 @@ static void balloon_ack(struct virtqueue *vq)
>  	wake_up(&vb->acked);
>  }
>  
> -static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> +static inline void init_page_bmap_range(struct virtio_balloon *vb)
> +{
> +	vb->pfn_min = ULONG_MAX;
> +	vb->pfn_max = 0;
> +}
> +
> +static inline void update_page_bmap_range(struct virtio_balloon *vb,
> +					  struct page *page)
> +{
> +	unsigned long balloon_pfn = page_to_balloon_pfn(page);
> +
> +	vb->pfn_min = min(balloon_pfn, vb->pfn_min);
> +	vb->pfn_max = max(balloon_pfn, vb->pfn_max);
> +}
> +
> +/* The page_bmap size is extended by adding more number of page_bmap */

did you mean

	Allocate more bitmaps to cover the given number of pfns
	and add them to page_bmap

?

This isn't what this function does.
It blindly assumes 1 bitmap is allocated
and allocates more, up to PAGE_BMAP_COUNT_MAX.

> +static void extend_page_bmap_size(struct virtio_balloon *vb,
> +				  unsigned long pfns)
> +{
> +	int i, bmaps;
> +	unsigned long bmap_len;
> +
> +	bmap_len = ALIGN(pfns, BITS_PER_LONG) / BITS_PER_BYTE;
> +	bmap_len = ALIGN(bmap_len, PAGE_BMAP_SIZE);

Align? PAGE_BMAP_SIZE doesn't even have to be a power of 2 ...

> +	bmaps = min((int)(bmap_len / PAGE_BMAP_SIZE),
> +		    PAGE_BMAP_COUNT_MAX);

I got lost here.

Please use things like ARRAY_SIZE instead of macros.

> +
> +	for (i = 1; i < bmaps; i++) {
> +		vb->page_bmap[i] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
> +		if (vb->page_bmap[i])
> +			vb->page_bmaps++;
> +		else
> +			break;
> +	}
> +}
> +
> +static void free_extended_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i, bmaps = vb->page_bmaps;
> +
> +	for (i = 1; i < bmaps; i++) {
> +		kfree(vb->page_bmap[i]);
> +		vb->page_bmap[i] = NULL;
> +		vb->page_bmaps--;
> +	}
> +}
> +

What's the magic number 1 here?
Maybe you want to document what is going on.
Here's a guess:

We keep a single bmap around at all times.
If memory does not fit there, we allocate up to
PAGE_BMAP_COUNT_MAX of chunks.


> +static void free_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	for (i = 0; i < vb->page_bmaps; i++)
> +		kfree(vb->page_bmap[i]);
> +}
> +
> +static void clear_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	for (i = 0; i < vb->page_bmaps; i++)
> +		memset(vb->page_bmap[i], 0, PAGE_BMAP_SIZE);
> +}
> +
> +static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
> +			     int type, bool busy_wait)

busy_wait seems unused. pls drop.

>  {
>  	struct scatterlist sg;
> +	struct virtio_balloon_page_chunk_hdr *hdr;
> +	void *buf;
>  	unsigned int len;
>  
> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> +	switch (type) {
> +	case PAGE_CHUNK_TYPE_BALLOON:
> +		hdr = vb->balloon_page_chunk_hdr;
> +		len = 0;
> +		break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
> +			 __func__, type);
> +		return;
> +	}
>  
> -	/* We should always be able to add one buffer to an empty queue. */
> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> -	virtqueue_kick(vq);
> +	buf = (void *)hdr - len;

Moving back to before the header? How can this make sense?
It works fine since len is 0, so just buf = hdr.

> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
> +	sg_init_table(&sg, 1);
> +	sg_set_buf(&sg, buf, len);
> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
> +		virtqueue_kick(vq);
> +		if (busy_wait)
> +			while (!virtqueue_get_buf(vq, &len) &&
> +			       !virtqueue_is_broken(vq))
> +				cpu_relax();
> +		else
> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		hdr->chunks = 0;

Why zero it here after device used it? Better to zero before use.

> +	}
> +}
> +
> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> +			  int type, u64 base, u64 size)

what are the units here? Looks like it's in 4kbyte units?

> +{
> +	struct virtio_balloon_page_chunk_hdr *hdr;
> +	struct virtio_balloon_page_chunk *chunk;
> +
> +	switch (type) {
> +	case PAGE_CHUNK_TYPE_BALLOON:
> +		hdr = vb->balloon_page_chunk_hdr;
> +		chunk = vb->balloon_page_chunk;
> +		break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
> +			 __func__, type);
> +		return;
> +	}
> +	chunk = chunk + hdr->chunks;
> +	chunk->base = cpu_to_le64(base << VIRTIO_BALLOON_CHUNK_BASE_SHIFT);
> +	chunk->size = cpu_to_le64(size << VIRTIO_BALLOON_CHUNK_SIZE_SHIFT);
> +	hdr->chunks++;

Isn't this LE? You should keep it somewhere else.

> +	if (hdr->chunks == MAX_PAGE_CHUNKS)
> +		send_page_chunks(vb, vq, type, false);
		and zero chunks here?
> +}
> +
> +static void chunking_pages_from_bmap(struct virtio_balloon *vb,

Does this mean "convert_bmap_to_chunks"?

> +				     struct virtqueue *vq,
> +				     unsigned long pfn_start,
> +				     unsigned long *bmap,
> +				     unsigned long len)
> +{
> +	unsigned long pos = 0, end = len * BITS_PER_BYTE;
> +
> +	while (pos < end) {
> +		unsigned long one = find_next_bit(bmap, end, pos);
> +
> +		if (one < end) {
> +			unsigned long chunk_size, zero;
> +
> +			zero = find_next_zero_bit(bmap, end, one + 1);


zero and one are unhelpful names unless they equal 0 and 1.
current/next?


> +			if (zero >= end)
> +				chunk_size = end - one;
> +			else
> +				chunk_size = zero - one;
> +
> +			if (chunk_size)
> +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
> +					      pfn_start + one, chunk_size);

Still not so what does a bit refer to? page or 4kbytes?
I think it should be a page.

> +			pos = one + chunk_size;
> +		} else
> +			break;
> +	}
> +}
> +



> +static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> +{
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
> +		int pfns, page_bmaps, i;
> +		unsigned long pfn_start, pfns_len;
> +
> +		pfn_start = vb->pfn_start;
> +		pfns = vb->pfn_stop - pfn_start + 1;
> +		pfns = roundup(roundup(pfns, BITS_PER_LONG),
> +			       PFNS_PER_PAGE_BMAP);
> +		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
> +		pfns_len = pfns / BITS_PER_BYTE;
> +
> +		for (i = 0; i < page_bmaps; i++) {
> +			unsigned int bmap_len = PAGE_BMAP_SIZE;
> +
> +			/* The last one takes the leftover only */

I don't understand what does this mean.

> +			if (i + 1 == page_bmaps)
> +				bmap_len = pfns_len - PAGE_BMAP_SIZE * i;
> +
> +			chunking_pages_from_bmap(vb, vq, pfn_start +
> +						 i * PFNS_PER_PAGE_BMAP,
> +						 vb->page_bmap[i], bmap_len);
> +		}
> +		if (vb->balloon_page_chunk_hdr->chunks > 0)
> +			send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
> +					 false);
> +	} else {
> +		struct scatterlist sg;
> +		unsigned int len;
>  
> -	/* When host has read buffer, this completes via balloon_ack */
> -	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>  
> +		/*
> +		 * We should always be able to add one buffer to an empty
> +		 * queue.
> +		 */
> +		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +		virtqueue_kick(vq);
> +
> +		/* When host has read buffer, this completes via balloon_ack */
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +	}
>  }
>  
>  static void set_page_pfns(struct virtio_balloon *vb,
> @@ -131,20 +346,73 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  {
>  	unsigned int i;
>  
> -	/* Set balloon pfns pointing at this page.
> -	 * Note that the first pfn points at start of the page. */
> +	/*
> +	 * Set balloon pfns pointing at this page.
> +	 * Note that the first pfn points at start of the page.
> +	 */
>  	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
>  		pfns[i] = cpu_to_virtio32(vb->vdev,
>  					  page_to_balloon_pfn(page) + i);
>  }
>

Nice cleanup but pls split this out. This patch is big enough as it is.
  
> +static void set_page_bmap(struct virtio_balloon *vb,
> +			  struct list_head *pages, struct virtqueue *vq)
> +{
> +	unsigned long pfn_start, pfn_stop;
> +	struct page *page;
> +	bool found;
> +
> +	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
> +	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
> +
> +	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);

This might not do anything in particular might not cover the
given pfn range. Do we care? Why not?

> +	pfn_start = vb->pfn_min;
> +
> +	while (pfn_start < vb->pfn_max) {
> +		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
> +		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
> +
> +		vb->pfn_start = pfn_start;
> +		clear_page_bmap(vb);
> +		found = false;
> +
> +		list_for_each_entry(page, pages, lru) {
> +			unsigned long bmap_idx, bmap_pos, balloon_pfn;
> +
> +			balloon_pfn = page_to_balloon_pfn(page);
> +			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
> +				continue;
> +			bmap_idx = (balloon_pfn - pfn_start) /
> +				   PFNS_PER_PAGE_BMAP;
> +			bmap_pos = (balloon_pfn - pfn_start) %
> +				   PFNS_PER_PAGE_BMAP;
> +			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);

Looks like this will crash if bmap_idx is out of range or
if page_bmap allocation failed.

> +
> +			found = true;
> +		}
> +		if (found) {
> +			vb->pfn_stop = pfn_stop;
> +			tell_host(vb, vq);
> +		}
> +		pfn_start = pfn_stop;
> +	}
> +	free_extended_page_bmap(vb);
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (chunking) {
> +		init_page_bmap_range(vb);
> +	} else {
> +		/* We can only do one array worth at a time. */
> +		num = min(num, ARRAY_SIZE(vb->pfns));
> +	}
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -159,7 +427,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_page_bmap_range(vb, page);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -168,8 +439,13 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			set_page_bmap(vb, &vb_dev_info->pages,
> +					vb->inflate_vq);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -195,6 +471,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
> +	if (chunking)
> +		init_page_bmap_range(vb);
> +	else
> +		/* We can only do one array worth at a time. */
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	/* We can only do one array worth at a time. */
>  	num = min(num, ARRAY_SIZE(vb->pfns));
> @@ -208,6 +491,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		if (!page)
>  			break;
>  		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_page_bmap_range(vb, page);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -218,8 +505,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			set_page_bmap(vb, &pages, vb->deflate_vq);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -431,6 +722,13 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb,
> +			       struct virtqueue *vq, struct page *page)
> +{
> +	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);

This passes 4kbytes to host which seems wrong - I think you want a full page.

> +}
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -454,6 +752,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
>  	unsigned long flags;
>  
>  	/*
> @@ -475,16 +775,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->inflate_vq, newpage);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->deflate_vq, page);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -511,6 +817,32 @@ static struct file_system_type balloon_fs = {
>  
>  #endif /* CONFIG_BALLOON_COMPACTION */
>  
> +static void balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	void *buf;
> +
> +	/*
> +	 * By default, we allocate page_bmap[0] only. More page_bmap will be
> +	 * allocated on demand.
> +	 */
> +	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
> +	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk) *
> +		      MAX_PAGE_CHUNKS, GFP_KERNEL);
> +	if (!vb->page_bmap[0] || !buf) {
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);

this doesn't work as expected as features has been OK'd by then.
You want something like
validate_features that I posted. See
"virtio: allow drivers to validate features".

> +		kfree(vb->page_bmap[0]);

Looks like this will double free. you want to zero them I think.

> +		kfree(vb->balloon_page_chunk_hdr);
> +		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
> +	} else {
> +		vb->page_bmaps = 1;
> +		vb->balloon_page_chunk_hdr = buf;
> +		vb->balloon_page_chunk_hdr->chunks = 0;
> +		vb->balloon_page_chunk = buf +
> +				sizeof(struct virtio_balloon_page_chunk_hdr);
> +	}
> +}
> +
>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb;
> @@ -533,6 +865,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	spin_lock_init(&vb->stop_update_lock);
>  	vb->stop_update = false;
>  	vb->num_pages = 0;
> +
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
> +		balloon_page_chunk_init(vb);
> +
>  	mutex_init(&vb->balloon_lock);
>  	init_waitqueue_head(&vb->acked);
>  	vb->vdev = vdev;
> @@ -609,6 +945,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	cancel_work_sync(&vb->update_balloon_stats_work);
>  
>  	remove_common(vb);
> +	free_page_bmap(vb);
>  	if (vb->vb_dev_info.inode)
>  		iput(vb->vb_dev_info.inode);
>  	kfree(vb);
> @@ -649,6 +986,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..be317b7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -82,4 +83,16 @@ struct virtio_balloon_stat {
>  	__virtio64 val;
>  } __attribute__((packed));
>  
> +struct virtio_balloon_page_chunk_hdr {
> +	/* Number of chunks in the payload */
> +	__le32 chunks;

You want to make this __le64 to align everything to 64 bit.

> +};
> +
> +#define VIRTIO_BALLOON_CHUNK_BASE_SHIFT 12
> +#define VIRTIO_BALLOON_CHUNK_SIZE_SHIFT 12
> +struct virtio_balloon_page_chunk {

so rename this virtio_balloon_page_chunk_entry

> +	__le64 base;
> +	__le64 size;
> +};
> +

And then:

struct virtio_balloon_page_chunk {
	struct virtio_balloon_page_chunk_hdr hdr;
	struct virtio_balloon_page_chunk_entry entries[];
};



>  #endif /* _LINUX_VIRTIO_BALLOON_H */
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-13  9:35   ` Wei Wang
  (?)
  (?)
@ 2017-04-13 16:34   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-13 16:34 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> Add a new feature, VIRTIO_BALLOON_F_BALLOON_CHUNKS, which enables

Let's find a better name here.
VIRTIO_BALLOON_F_PAGE_CHUNK


> the transfer of the ballooned (i.e. inflated/deflated) pages in
> chunks to the host.
> 
> The implementation of the previous virtio-balloon is not very
> efficient, because the ballooned pages are transferred to the
> host one by one. Here is the breakdown of the time in percentage
> spent on each step of the balloon inflating process (inflating
> 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2)
> and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in
> chunks. A chunk consists of guest physically continuous pages, and
> it is offered to the host via a base PFN (i.e. the start PFN of
> those physically continuous pages) and the size (i.e. the total
> number of the pages). A chunk is formated as below:

formatted

> --------------------------------------------------------
> |                 Base (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> --------------------------------------------------------
> |                 Size (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> 
> By doing so, step 4) can also be optimized by doing address
> translation and madvise() in chunks rather than page by page.
> 
> With this new feature, the above ballooning process takes ~590ms
> resulting in an improvement of ~85%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages
> instead of a single page each time.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>

So we don't need the bitmap to talk to host, it is just
a data structure we chose to maintain lists of pages, right?
OK as far as it goes but you need much better isolation for it.
Build a data structure with APIs such as _init, _cleanup, _add, _clear,
_find_first, _find_next.
Completely unrelated to pages, it just maintains bits.
Then use it here.


> ---
>  drivers/virtio/virtio_balloon.c     | 384 +++++++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |  13 ++
>  2 files changed, 374 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index f59cb4f..5e2e7cc 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -42,6 +42,10 @@
>  #define OOM_VBALLOON_DEFAULT_PAGES 256
>  #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
>  
> +#define PAGE_BMAP_SIZE		(8 * PAGE_SIZE)
> +#define PFNS_PER_PAGE_BMAP	(PAGE_BMAP_SIZE * BITS_PER_BYTE)
> +#define PAGE_BMAP_COUNT_MAX	32
> +

Please prefix with VIRTIO_BALLOON_ and add comments.

>  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>  module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>  MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>  static struct vfsmount *balloon_mnt;
>  #endif
>  
> +/* Types of pages to chunk */
> +#define PAGE_CHUNK_TYPE_BALLOON 0
> +

Doesn't look like you are ever adding more types in this
patchset.  Pls keep code simple, generalize it later.

> +#define MAX_PAGE_CHUNKS 4096

This is an order-4 allocation. I'd make it 4095 and then it's
an order-3 one.

>  struct virtio_balloon {
>  	struct virtio_device *vdev;
>  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> @@ -78,6 +86,32 @@ struct virtio_balloon {
>  	/* Synchronize access/update to this struct virtio_balloon elements */
>  	struct mutex balloon_lock;
>  
> +	/*
> +	 * Buffer for PAGE_CHUNK_TYPE_BALLOON:
> +	 * virtio_balloon_page_chunk_hdr +
> +	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> +	 */
> +	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
> +	struct virtio_balloon_page_chunk *balloon_page_chunk;
> +
> +	/* Bitmap used to record pages */
> +	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
> +	/* Number of the allocated page_bmap */
> +	unsigned int page_bmaps;
> +
> +	/*
> +	 * The allocated page_bmap size may be smaller than the pfn range of
> +	 * the ballooned pages. In this case, we need to use the page_bmap
> +	 * multiple times to cover the entire pfn range. It's like using a
> +	 * short ruler several times to finish measuring a long object.
> +	 * The start location of the ruler in the next measurement is the end
> +	 * location of the ruler in the previous measurement.
> +	 *
> +	 * pfn_max & pfn_min: forms the pfn range of the ballooned pages
> +	 * pfn_start & pfn_stop: records the start and stop pfn in each cover

cover? what does this mean?

looks like you only use these to pass data to tell_host.
so pass these as parameters and you won't need to keep
them in this structure.

And then you can move this comment to set_page_bmap where
it belongs.

> +	 */
> +	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
> +
>  	/* The array of pfns we tell the Host about. */
>  	unsigned int num_pfns;
>  	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> @@ -110,20 +144,201 @@ static void balloon_ack(struct virtqueue *vq)
>  	wake_up(&vb->acked);
>  }
>  
> -static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> +static inline void init_page_bmap_range(struct virtio_balloon *vb)
> +{
> +	vb->pfn_min = ULONG_MAX;
> +	vb->pfn_max = 0;
> +}
> +
> +static inline void update_page_bmap_range(struct virtio_balloon *vb,
> +					  struct page *page)
> +{
> +	unsigned long balloon_pfn = page_to_balloon_pfn(page);
> +
> +	vb->pfn_min = min(balloon_pfn, vb->pfn_min);
> +	vb->pfn_max = max(balloon_pfn, vb->pfn_max);
> +}
> +
> +/* The page_bmap size is extended by adding more number of page_bmap */

did you mean

	Allocate more bitmaps to cover the given number of pfns
	and add them to page_bmap

?

This isn't what this function does.
It blindly assumes 1 bitmap is allocated
and allocates more, up to PAGE_BMAP_COUNT_MAX.

> +static void extend_page_bmap_size(struct virtio_balloon *vb,
> +				  unsigned long pfns)
> +{
> +	int i, bmaps;
> +	unsigned long bmap_len;
> +
> +	bmap_len = ALIGN(pfns, BITS_PER_LONG) / BITS_PER_BYTE;
> +	bmap_len = ALIGN(bmap_len, PAGE_BMAP_SIZE);

Align? PAGE_BMAP_SIZE doesn't even have to be a power of 2 ...

> +	bmaps = min((int)(bmap_len / PAGE_BMAP_SIZE),
> +		    PAGE_BMAP_COUNT_MAX);

I got lost here.

Please use things like ARRAY_SIZE instead of macros.

> +
> +	for (i = 1; i < bmaps; i++) {
> +		vb->page_bmap[i] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
> +		if (vb->page_bmap[i])
> +			vb->page_bmaps++;
> +		else
> +			break;
> +	}
> +}
> +
> +static void free_extended_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i, bmaps = vb->page_bmaps;
> +
> +	for (i = 1; i < bmaps; i++) {
> +		kfree(vb->page_bmap[i]);
> +		vb->page_bmap[i] = NULL;
> +		vb->page_bmaps--;
> +	}
> +}
> +

What's the magic number 1 here?
Maybe you want to document what is going on.
Here's a guess:

We keep a single bmap around at all times.
If memory does not fit there, we allocate up to
PAGE_BMAP_COUNT_MAX of chunks.


> +static void free_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	for (i = 0; i < vb->page_bmaps; i++)
> +		kfree(vb->page_bmap[i]);
> +}
> +
> +static void clear_page_bmap(struct virtio_balloon *vb)
> +{
> +	int i;
> +
> +	for (i = 0; i < vb->page_bmaps; i++)
> +		memset(vb->page_bmap[i], 0, PAGE_BMAP_SIZE);
> +}
> +
> +static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
> +			     int type, bool busy_wait)

busy_wait seems unused. pls drop.

>  {
>  	struct scatterlist sg;
> +	struct virtio_balloon_page_chunk_hdr *hdr;
> +	void *buf;
>  	unsigned int len;
>  
> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> +	switch (type) {
> +	case PAGE_CHUNK_TYPE_BALLOON:
> +		hdr = vb->balloon_page_chunk_hdr;
> +		len = 0;
> +		break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
> +			 __func__, type);
> +		return;
> +	}
>  
> -	/* We should always be able to add one buffer to an empty queue. */
> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> -	virtqueue_kick(vq);
> +	buf = (void *)hdr - len;

Moving back to before the header? How can this make sense?
It works fine since len is 0, so just buf = hdr.

> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
> +	sg_init_table(&sg, 1);
> +	sg_set_buf(&sg, buf, len);
> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
> +		virtqueue_kick(vq);
> +		if (busy_wait)
> +			while (!virtqueue_get_buf(vq, &len) &&
> +			       !virtqueue_is_broken(vq))
> +				cpu_relax();
> +		else
> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		hdr->chunks = 0;

Why zero it here after device used it? Better to zero before use.

> +	}
> +}
> +
> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> +			  int type, u64 base, u64 size)

what are the units here? Looks like it's in 4kbyte units?

> +{
> +	struct virtio_balloon_page_chunk_hdr *hdr;
> +	struct virtio_balloon_page_chunk *chunk;
> +
> +	switch (type) {
> +	case PAGE_CHUNK_TYPE_BALLOON:
> +		hdr = vb->balloon_page_chunk_hdr;
> +		chunk = vb->balloon_page_chunk;
> +		break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
> +			 __func__, type);
> +		return;
> +	}
> +	chunk = chunk + hdr->chunks;
> +	chunk->base = cpu_to_le64(base << VIRTIO_BALLOON_CHUNK_BASE_SHIFT);
> +	chunk->size = cpu_to_le64(size << VIRTIO_BALLOON_CHUNK_SIZE_SHIFT);
> +	hdr->chunks++;

Isn't this LE? You should keep it somewhere else.

> +	if (hdr->chunks == MAX_PAGE_CHUNKS)
> +		send_page_chunks(vb, vq, type, false);
		and zero chunks here?
> +}
> +
> +static void chunking_pages_from_bmap(struct virtio_balloon *vb,

Does this mean "convert_bmap_to_chunks"?

> +				     struct virtqueue *vq,
> +				     unsigned long pfn_start,
> +				     unsigned long *bmap,
> +				     unsigned long len)
> +{
> +	unsigned long pos = 0, end = len * BITS_PER_BYTE;
> +
> +	while (pos < end) {
> +		unsigned long one = find_next_bit(bmap, end, pos);
> +
> +		if (one < end) {
> +			unsigned long chunk_size, zero;
> +
> +			zero = find_next_zero_bit(bmap, end, one + 1);


zero and one are unhelpful names unless they equal 0 and 1.
current/next?


> +			if (zero >= end)
> +				chunk_size = end - one;
> +			else
> +				chunk_size = zero - one;
> +
> +			if (chunk_size)
> +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
> +					      pfn_start + one, chunk_size);

Still not so what does a bit refer to? page or 4kbytes?
I think it should be a page.

> +			pos = one + chunk_size;
> +		} else
> +			break;
> +	}
> +}
> +



> +static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> +{
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
> +		int pfns, page_bmaps, i;
> +		unsigned long pfn_start, pfns_len;
> +
> +		pfn_start = vb->pfn_start;
> +		pfns = vb->pfn_stop - pfn_start + 1;
> +		pfns = roundup(roundup(pfns, BITS_PER_LONG),
> +			       PFNS_PER_PAGE_BMAP);
> +		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
> +		pfns_len = pfns / BITS_PER_BYTE;
> +
> +		for (i = 0; i < page_bmaps; i++) {
> +			unsigned int bmap_len = PAGE_BMAP_SIZE;
> +
> +			/* The last one takes the leftover only */

I don't understand what does this mean.

> +			if (i + 1 == page_bmaps)
> +				bmap_len = pfns_len - PAGE_BMAP_SIZE * i;
> +
> +			chunking_pages_from_bmap(vb, vq, pfn_start +
> +						 i * PFNS_PER_PAGE_BMAP,
> +						 vb->page_bmap[i], bmap_len);
> +		}
> +		if (vb->balloon_page_chunk_hdr->chunks > 0)
> +			send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
> +					 false);
> +	} else {
> +		struct scatterlist sg;
> +		unsigned int len;
>  
> -	/* When host has read buffer, this completes via balloon_ack */
> -	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>  
> +		/*
> +		 * We should always be able to add one buffer to an empty
> +		 * queue.
> +		 */
> +		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> +		virtqueue_kick(vq);
> +
> +		/* When host has read buffer, this completes via balloon_ack */
> +		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> +	}
>  }
>  
>  static void set_page_pfns(struct virtio_balloon *vb,
> @@ -131,20 +346,73 @@ static void set_page_pfns(struct virtio_balloon *vb,
>  {
>  	unsigned int i;
>  
> -	/* Set balloon pfns pointing at this page.
> -	 * Note that the first pfn points at start of the page. */
> +	/*
> +	 * Set balloon pfns pointing at this page.
> +	 * Note that the first pfn points at start of the page.
> +	 */
>  	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
>  		pfns[i] = cpu_to_virtio32(vb->vdev,
>  					  page_to_balloon_pfn(page) + i);
>  }
>

Nice cleanup but pls split this out. This patch is big enough as it is.
  
> +static void set_page_bmap(struct virtio_balloon *vb,
> +			  struct list_head *pages, struct virtqueue *vq)
> +{
> +	unsigned long pfn_start, pfn_stop;
> +	struct page *page;
> +	bool found;
> +
> +	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
> +	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
> +
> +	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);

This might not do anything in particular might not cover the
given pfn range. Do we care? Why not?

> +	pfn_start = vb->pfn_min;
> +
> +	while (pfn_start < vb->pfn_max) {
> +		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
> +		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
> +
> +		vb->pfn_start = pfn_start;
> +		clear_page_bmap(vb);
> +		found = false;
> +
> +		list_for_each_entry(page, pages, lru) {
> +			unsigned long bmap_idx, bmap_pos, balloon_pfn;
> +
> +			balloon_pfn = page_to_balloon_pfn(page);
> +			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
> +				continue;
> +			bmap_idx = (balloon_pfn - pfn_start) /
> +				   PFNS_PER_PAGE_BMAP;
> +			bmap_pos = (balloon_pfn - pfn_start) %
> +				   PFNS_PER_PAGE_BMAP;
> +			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);

Looks like this will crash if bmap_idx is out of range or
if page_bmap allocation failed.

> +
> +			found = true;
> +		}
> +		if (found) {
> +			vb->pfn_stop = pfn_stop;
> +			tell_host(vb, vq);
> +		}
> +		pfn_start = pfn_stop;
> +	}
> +	free_extended_page_bmap(vb);
> +}
> +
>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  {
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	unsigned num_allocated_pages;
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
>  
>  	/* We can only do one array worth at a time. */
> -	num = min(num, ARRAY_SIZE(vb->pfns));
> +	if (chunking) {
> +		init_page_bmap_range(vb);
> +	} else {
> +		/* We can only do one array worth at a time. */
> +		num = min(num, ARRAY_SIZE(vb->pfns));
> +	}
>  
>  	mutex_lock(&vb->balloon_lock);
>  	for (vb->num_pfns = 0; vb->num_pfns < num;
> @@ -159,7 +427,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  			msleep(200);
>  			break;
>  		}
> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_page_bmap_range(vb, page);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>  		if (!virtio_has_feature(vb->vdev,
>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> @@ -168,8 +439,13 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>  
>  	num_allocated_pages = vb->num_pfns;
>  	/* Did we get any? */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->inflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			set_page_bmap(vb, &vb_dev_info->pages,
> +					vb->inflate_vq);
> +		else
> +			tell_host(vb, vb->inflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	return num_allocated_pages;
> @@ -195,6 +471,13 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	struct page *page;
>  	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
>  	LIST_HEAD(pages);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
> +	if (chunking)
> +		init_page_bmap_range(vb);
> +	else
> +		/* We can only do one array worth at a time. */
> +		num = min(num, ARRAY_SIZE(vb->pfns));
>  
>  	/* We can only do one array worth at a time. */
>  	num = min(num, ARRAY_SIZE(vb->pfns));
> @@ -208,6 +491,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  		if (!page)
>  			break;
>  		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> +		if (chunking)
> +			update_page_bmap_range(vb, page);
> +		else
> +			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>  		list_add(&page->lru, &pages);
>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>  	}
> @@ -218,8 +505,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>  	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
>  	 * is true, we *have* to do it in this order
>  	 */
> -	if (vb->num_pfns != 0)
> -		tell_host(vb, vb->deflate_vq);
> +	if (vb->num_pfns != 0) {
> +		if (chunking)
> +			set_page_bmap(vb, &pages, vb->deflate_vq);
> +		else
> +			tell_host(vb, vb->deflate_vq);
> +	}
>  	release_pages_balloon(vb, &pages);
>  	mutex_unlock(&vb->balloon_lock);
>  	return num_freed_pages;
> @@ -431,6 +722,13 @@ static int init_vqs(struct virtio_balloon *vb)
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb,
> +			       struct virtqueue *vq, struct page *page)
> +{
> +	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);

This passes 4kbytes to host which seems wrong - I think you want a full page.

> +}
> +
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
>   *			     a compation thread.     (called under page lock)
> @@ -454,6 +752,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  {
>  	struct virtio_balloon *vb = container_of(vb_dev_info,
>  			struct virtio_balloon, vb_dev_info);
> +	bool chunking = virtio_has_feature(vb->vdev,
> +					   VIRTIO_BALLOON_F_BALLOON_CHUNKS);
>  	unsigned long flags;
>  
>  	/*
> @@ -475,16 +775,22 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>  	vb_dev_info->isolated_pages--;
>  	__count_vm_event(BALLOON_MIGRATE);
>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, newpage);
> -	tell_host(vb, vb->inflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->inflate_vq, newpage);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, newpage);
> +		tell_host(vb, vb->inflate_vq);
> +	}
>  	/* balloon's page migration 2nd step -- deflate "page" */
>  	balloon_page_delete(page);
> -	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> -	set_page_pfns(vb, vb->pfns, page);
> -	tell_host(vb, vb->deflate_vq);
> -
> +	if (chunking) {
> +		tell_host_one_page(vb, vb->deflate_vq, page);
> +	} else {
> +		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
> +		set_page_pfns(vb, vb->pfns, page);
> +		tell_host(vb, vb->deflate_vq);
> +	}
>  	mutex_unlock(&vb->balloon_lock);
>  
>  	put_page(page); /* balloon reference */
> @@ -511,6 +817,32 @@ static struct file_system_type balloon_fs = {
>  
>  #endif /* CONFIG_BALLOON_COMPACTION */
>  
> +static void balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	void *buf;
> +
> +	/*
> +	 * By default, we allocate page_bmap[0] only. More page_bmap will be
> +	 * allocated on demand.
> +	 */
> +	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
> +	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk) *
> +		      MAX_PAGE_CHUNKS, GFP_KERNEL);
> +	if (!vb->page_bmap[0] || !buf) {
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);

this doesn't work as expected as features has been OK'd by then.
You want something like
validate_features that I posted. See
"virtio: allow drivers to validate features".

> +		kfree(vb->page_bmap[0]);

Looks like this will double free. you want to zero them I think.

> +		kfree(vb->balloon_page_chunk_hdr);
> +		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
> +	} else {
> +		vb->page_bmaps = 1;
> +		vb->balloon_page_chunk_hdr = buf;
> +		vb->balloon_page_chunk_hdr->chunks = 0;
> +		vb->balloon_page_chunk = buf +
> +				sizeof(struct virtio_balloon_page_chunk_hdr);
> +	}
> +}
> +
>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb;
> @@ -533,6 +865,10 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	spin_lock_init(&vb->stop_update_lock);
>  	vb->stop_update = false;
>  	vb->num_pages = 0;
> +
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
> +		balloon_page_chunk_init(vb);
> +
>  	mutex_init(&vb->balloon_lock);
>  	init_waitqueue_head(&vb->acked);
>  	vb->vdev = vdev;
> @@ -609,6 +945,7 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  	cancel_work_sync(&vb->update_balloon_stats_work);
>  
>  	remove_common(vb);
> +	free_page_bmap(vb);
>  	if (vb->vb_dev_info.inode)
>  		iput(vb->vb_dev_info.inode);
>  	kfree(vb);
> @@ -649,6 +986,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_MUST_TELL_HOST,
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> +	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index 343d7dd..be317b7 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
>  #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -82,4 +83,16 @@ struct virtio_balloon_stat {
>  	__virtio64 val;
>  } __attribute__((packed));
>  
> +struct virtio_balloon_page_chunk_hdr {
> +	/* Number of chunks in the payload */
> +	__le32 chunks;

You want to make this __le64 to align everything to 64 bit.

> +};
> +
> +#define VIRTIO_BALLOON_CHUNK_BASE_SHIFT 12
> +#define VIRTIO_BALLOON_CHUNK_SIZE_SHIFT 12
> +struct virtio_balloon_page_chunk {

so rename this virtio_balloon_page_chunk_entry

> +	__le64 base;
> +	__le64 size;
> +};
> +

And then:

struct virtio_balloon_page_chunk {
	struct virtio_balloon_page_chunk_hdr hdr;
	struct virtio_balloon_page_chunk_entry entries[];
};



>  #endif /* _LINUX_VIRTIO_BALLOON_H */
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-13 16:34     ` Michael S. Tsirkin
  (?)
@ 2017-04-13 17:03       ` Matthew Wilcox
  -1 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-13 17:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 07:34:19PM +0300, Michael S. Tsirkin wrote:
> So we don't need the bitmap to talk to host, it is just
> a data structure we chose to maintain lists of pages, right?
> OK as far as it goes but you need much better isolation for it.
> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> _find_first, _find_next.
> Completely unrelated to pages, it just maintains bits.
> Then use it here.

That sounds an awful lot like the xbitmap I wrote a few months ago ...

http://git.infradead.org/users/willy/linux-dax.git/commit/727e401bee5ad7d37e0077291d90cc17475c6392

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-13 17:03       ` Matthew Wilcox
  0 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-13 17:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 07:34:19PM +0300, Michael S. Tsirkin wrote:
> So we don't need the bitmap to talk to host, it is just
> a data structure we chose to maintain lists of pages, right?
> OK as far as it goes but you need much better isolation for it.
> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> _find_first, _find_next.
> Completely unrelated to pages, it just maintains bits.
> Then use it here.

That sounds an awful lot like the xbitmap I wrote a few months ago ...

http://git.infradead.org/users/willy/linux-dax.git/commit/727e401bee5ad7d37e0077291d90cc17475c6392

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-13 17:03       ` Matthew Wilcox
  0 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-13 17:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 07:34:19PM +0300, Michael S. Tsirkin wrote:
> So we don't need the bitmap to talk to host, it is just
> a data structure we chose to maintain lists of pages, right?
> OK as far as it goes but you need much better isolation for it.
> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> _find_first, _find_next.
> Completely unrelated to pages, it just maintains bits.
> Then use it here.

That sounds an awful lot like the xbitmap I wrote a few months ago ...

http://git.infradead.org/users/willy/linux-dax.git/commit/727e401bee5ad7d37e0077291d90cc17475c6392

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-13 16:34     ` Michael S. Tsirkin
                       ` (2 preceding siblings ...)
  (?)
@ 2017-04-13 17:03     ` Matthew Wilcox
  -1 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-13 17:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Thu, Apr 13, 2017 at 07:34:19PM +0300, Michael S. Tsirkin wrote:
> So we don't need the bitmap to talk to host, it is just
> a data structure we chose to maintain lists of pages, right?
> OK as far as it goes but you need much better isolation for it.
> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> _find_first, _find_next.
> Completely unrelated to pages, it just maintains bits.
> Then use it here.

That sounds an awful lot like the xbitmap I wrote a few months ago ...

http://git.infradead.org/users/willy/linux-dax.git/commit/727e401bee5ad7d37e0077291d90cc17475c6392

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
  2017-04-13  9:35   ` Wei Wang
  (?)
@ 2017-04-13 17:08     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-13 17:08 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> Add a new vq, miscq, to handle miscellaneous requests between the device
> and the driver.
> 
> This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES

implements

> request sent from the device.

Commands are sent from host and handled on guest.
In fact how is this so different from stats?
How about reusing the stats vq then? You can use one buffer
for stats and one buffer for commands.

> Upon receiving this request from the
> miscq, the driver offers to the device the guest unused pages.
> 
> Tests have shown that skipping the transfer of unused pages of a 32G
> guest can get the live migration time reduced to 1/8.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 209 +++++++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |   8 ++
>  2 files changed, 204 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 5e2e7cc..95c703e 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -56,11 +56,12 @@ static struct vfsmount *balloon_mnt;
>  
>  /* Types of pages to chunk */
>  #define PAGE_CHUNK_TYPE_BALLOON 0
> +#define PAGE_CHUNK_TYPE_UNUSED 1
>  
>  #define MAX_PAGE_CHUNKS 4096
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *miscq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
> @@ -94,6 +95,19 @@ struct virtio_balloon {
>  	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
>  	struct virtio_balloon_page_chunk *balloon_page_chunk;
>  
> +	/*
> +	 * Buffer for PAGE_CHUNK_TYPE_UNUSED:
> +	 * virtio_balloon_miscq_hdr +
> +	 * virtio_balloon_page_chunk_hdr +
> +	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> +	 */
> +	struct virtio_balloon_miscq_hdr *miscq_out_hdr;
> +	struct virtio_balloon_page_chunk_hdr *unused_page_chunk_hdr;
> +	struct virtio_balloon_page_chunk *unused_page_chunk;
> +
> +	/* Buffer for host to send cmd to miscq */
> +	struct virtio_balloon_miscq_hdr *miscq_in_hdr;
> +
>  	/* Bitmap used to record pages */
>  	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
>  	/* Number of the allocated page_bmap */
> @@ -220,6 +234,10 @@ static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
>  		hdr = vb->balloon_page_chunk_hdr;
>  		len = 0;
>  		break;
> +	case PAGE_CHUNK_TYPE_UNUSED:
> +		hdr = vb->unused_page_chunk_hdr;
> +		len = sizeof(struct virtio_balloon_miscq_hdr);
> +		break;
>  	default:
>  		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>  			 __func__, type);
> @@ -254,6 +272,10 @@ static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
>  		hdr = vb->balloon_page_chunk_hdr;
>  		chunk = vb->balloon_page_chunk;
>  		break;
> +	case PAGE_CHUNK_TYPE_UNUSED:
> +		hdr = vb->unused_page_chunk_hdr;
> +		chunk = vb->unused_page_chunk;
> +		break;
>  	default:
>  		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>  			 __func__, type);
> @@ -686,28 +708,139 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void miscq_in_hdr_add(struct virtio_balloon *vb)
> +{
> +	struct scatterlist sg_in;
> +
> +	sg_init_one(&sg_in, vb->miscq_in_hdr,
> +		    sizeof(struct virtio_balloon_miscq_hdr));
> +	if (virtqueue_add_inbuf(vb->miscq, &sg_in, 1, vb->miscq_in_hdr,
> +	    GFP_KERNEL) < 0) {
> +		__virtio_clear_bit(vb->vdev,
> +				   VIRTIO_BALLOON_F_MISC_VQ);
> +		dev_warn(&vb->vdev->dev, "%s: add miscq_in_hdr err\n",
> +			 __func__);
> +		return;
> +	}
> +	virtqueue_kick(vb->miscq);
> +}
> +
> +static void miscq_send_unused_pages(struct virtio_balloon *vb)
> +{
> +	struct virtio_balloon_miscq_hdr *miscq_out_hdr = vb->miscq_out_hdr;
> +	struct virtqueue *vq = vb->miscq;
> +	int ret = 0;
> +	unsigned int order = 0, migratetype = 0;
> +	struct zone *zone = NULL;
> +	struct page *page = NULL;
> +	u64 pfn;
> +
> +	miscq_out_hdr->cmd =  VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES;

Gets endian-ness and whitespace wrong. Pls use static checkers to catch
this type of error.

> +	miscq_out_hdr->flags = 0;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1; order > 0; order--) {
> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
> +			     migratetype++) {
> +				do {
> +					ret = inquire_unused_page_block(zone,
> +						order, migratetype, &page);
> +					if (!ret) {
> +						pfn = (u64)page_to_pfn(page);
> +						add_one_chunk(vb, vq,
> +							PAGE_CHUNK_TYPE_UNUSED,
> +							pfn,
> +							(u64)(1 << order));
> +					}
> +				} while (!ret);
> +			}
> +		}
> +	}
> +	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;

And where is miscq_out_hdr used? I see no add_outbuf anywhere.

Things like this should be passed through function parameters
and not stuffed into device structure, fields should be
initialized before use and not where we happen to
have the data handy.



Also, _F_ is normally a bit number, you use it as a value here.


> +	send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_UNUSED, true);
> +}
> +
> +static void miscq_handle(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +	struct virtio_balloon_miscq_hdr *hdr;
> +	unsigned int len;
> +
> +	hdr = virtqueue_get_buf(vb->miscq, &len);
> +	if (!hdr || len != sizeof(struct virtio_balloon_miscq_hdr)) {
> +		dev_warn(&vb->vdev->dev, "%s: invalid miscq hdr len\n",
> +			 __func__);
> +		miscq_in_hdr_add(vb);
> +		return;
> +	}
> +	switch (hdr->cmd) {
> +	case VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES:
> +		miscq_send_unused_pages(vb);
> +		break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: miscq cmd %d not supported\n",
> +			 __func__, hdr->cmd);
> +	}
> +	miscq_in_hdr_add(vb);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	int err = -ENOMEM;
> +	int i, nvqs;
> +
> +	 /* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +

All of 4 VQs, why are dynamic allocations called for?

> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
>  
> -	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> -	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names);
> +	if (virtio_has_feature(vb->vdev,
> +				      VIRTIO_BALLOON_F_MISC_VQ)) {
> +		callbacks[i] = miscq_handle;
> +		names[i] = "miscq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
> +					 names);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>  		struct scatterlist sg;
> -		vb->stats_vq = vqs[2];
>  
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
> @@ -718,7 +851,25 @@ static int init_vqs(struct virtio_balloon *vb)
>  			BUG();
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ)) {
> +		vb->miscq = vqs[i];
> +		miscq_in_hdr_add(vb);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -843,6 +994,32 @@ static void balloon_page_chunk_init(struct virtio_balloon *vb)
>  	}
>  }
>  
> +static void miscq_init(struct virtio_balloon *vb)
> +{
> +	void *buf;
> +
> +	vb->miscq_in_hdr = kmalloc(sizeof(struct virtio_balloon_miscq_hdr),
> +				   GFP_KERNEL);
> +	buf = kmalloc(sizeof(struct virtio_balloon_miscq_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk) *
> +		      MAX_PAGE_CHUNKS, GFP_KERNEL);

Mabe reduce MAX_PAGE_CHUNKS even further to fit in order-3 allocation.


> +	if (!vb->miscq_in_hdr || !buf) {
> +		kfree(buf);
> +		kfree(vb->miscq_in_hdr);
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ);

Again this does not really work here. In this case it might be best to
just fail probe.

> +		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
> +	} else {
> +		vb->miscq_out_hdr = buf;
> +		vb->unused_page_chunk_hdr = buf +
> +				sizeof(struct virtio_balloon_miscq_hdr);
> +		vb->unused_page_chunk_hdr->chunks = 0;
> +		vb->unused_page_chunk = buf +
> +				sizeof(struct virtio_balloon_miscq_hdr) +
> +				sizeof(struct virtio_balloon_page_chunk_hdr);
> +	}
> +}
> +
>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb;
> @@ -869,6 +1046,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
>  		balloon_page_chunk_init(vb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_MISC_VQ))
> +		miscq_init(vb);
> +
>  	mutex_init(&vb->balloon_lock);
>  	init_waitqueue_head(&vb->acked);
>  	vb->vdev = vdev;
> @@ -946,6 +1126,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  
>  	remove_common(vb);
>  	free_page_bmap(vb);
> +	kfree(vb->miscq_out_hdr);
> +	kfree(vb->miscq_in_hdr);
>  	if (vb->vb_dev_info.inode)
>  		iput(vb->vb_dev_info.inode);
>  	kfree(vb);
> @@ -987,6 +1169,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
> +	VIRTIO_BALLOON_F_MISC_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index be317b7..96bdc86 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
> +#define VIRTIO_BALLOON_F_MISC_VQ	4 /* Virtqueue for misc. requests */

Is "misc" the best we can do? I think these are
actually host commands - aren't they?

>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -95,4 +96,11 @@ struct virtio_balloon_page_chunk {
>  	__le64 size;
>  };
>  
> +#define VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES 0

meaning what? Is this a command value? Is this a command
to report unused memory then? Let's call it this then.


> +#define VIRTIO_BALLOON_MISCQ_F_COMPLETE 0x1

meaning what?

> +struct virtio_balloon_miscq_hdr {
> +	__le16 cmd;
> +	__le16 flags;

Add padding to make it full 64 bit.

> +};
> +
>  #endif /* _LINUX_VIRTIO_BALLOON_H */
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
@ 2017-04-13 17:08     ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-13 17:08 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> Add a new vq, miscq, to handle miscellaneous requests between the device
> and the driver.
> 
> This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES

implements

> request sent from the device.

Commands are sent from host and handled on guest.
In fact how is this so different from stats?
How about reusing the stats vq then? You can use one buffer
for stats and one buffer for commands.

> Upon receiving this request from the
> miscq, the driver offers to the device the guest unused pages.
> 
> Tests have shown that skipping the transfer of unused pages of a 32G
> guest can get the live migration time reduced to 1/8.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 209 +++++++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |   8 ++
>  2 files changed, 204 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 5e2e7cc..95c703e 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -56,11 +56,12 @@ static struct vfsmount *balloon_mnt;
>  
>  /* Types of pages to chunk */
>  #define PAGE_CHUNK_TYPE_BALLOON 0
> +#define PAGE_CHUNK_TYPE_UNUSED 1
>  
>  #define MAX_PAGE_CHUNKS 4096
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *miscq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
> @@ -94,6 +95,19 @@ struct virtio_balloon {
>  	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
>  	struct virtio_balloon_page_chunk *balloon_page_chunk;
>  
> +	/*
> +	 * Buffer for PAGE_CHUNK_TYPE_UNUSED:
> +	 * virtio_balloon_miscq_hdr +
> +	 * virtio_balloon_page_chunk_hdr +
> +	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> +	 */
> +	struct virtio_balloon_miscq_hdr *miscq_out_hdr;
> +	struct virtio_balloon_page_chunk_hdr *unused_page_chunk_hdr;
> +	struct virtio_balloon_page_chunk *unused_page_chunk;
> +
> +	/* Buffer for host to send cmd to miscq */
> +	struct virtio_balloon_miscq_hdr *miscq_in_hdr;
> +
>  	/* Bitmap used to record pages */
>  	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
>  	/* Number of the allocated page_bmap */
> @@ -220,6 +234,10 @@ static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
>  		hdr = vb->balloon_page_chunk_hdr;
>  		len = 0;
>  		break;
> +	case PAGE_CHUNK_TYPE_UNUSED:
> +		hdr = vb->unused_page_chunk_hdr;
> +		len = sizeof(struct virtio_balloon_miscq_hdr);
> +		break;
>  	default:
>  		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>  			 __func__, type);
> @@ -254,6 +272,10 @@ static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
>  		hdr = vb->balloon_page_chunk_hdr;
>  		chunk = vb->balloon_page_chunk;
>  		break;
> +	case PAGE_CHUNK_TYPE_UNUSED:
> +		hdr = vb->unused_page_chunk_hdr;
> +		chunk = vb->unused_page_chunk;
> +		break;
>  	default:
>  		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>  			 __func__, type);
> @@ -686,28 +708,139 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void miscq_in_hdr_add(struct virtio_balloon *vb)
> +{
> +	struct scatterlist sg_in;
> +
> +	sg_init_one(&sg_in, vb->miscq_in_hdr,
> +		    sizeof(struct virtio_balloon_miscq_hdr));
> +	if (virtqueue_add_inbuf(vb->miscq, &sg_in, 1, vb->miscq_in_hdr,
> +	    GFP_KERNEL) < 0) {
> +		__virtio_clear_bit(vb->vdev,
> +				   VIRTIO_BALLOON_F_MISC_VQ);
> +		dev_warn(&vb->vdev->dev, "%s: add miscq_in_hdr err\n",
> +			 __func__);
> +		return;
> +	}
> +	virtqueue_kick(vb->miscq);
> +}
> +
> +static void miscq_send_unused_pages(struct virtio_balloon *vb)
> +{
> +	struct virtio_balloon_miscq_hdr *miscq_out_hdr = vb->miscq_out_hdr;
> +	struct virtqueue *vq = vb->miscq;
> +	int ret = 0;
> +	unsigned int order = 0, migratetype = 0;
> +	struct zone *zone = NULL;
> +	struct page *page = NULL;
> +	u64 pfn;
> +
> +	miscq_out_hdr->cmd =  VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES;

Gets endian-ness and whitespace wrong. Pls use static checkers to catch
this type of error.

> +	miscq_out_hdr->flags = 0;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1; order > 0; order--) {
> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
> +			     migratetype++) {
> +				do {
> +					ret = inquire_unused_page_block(zone,
> +						order, migratetype, &page);
> +					if (!ret) {
> +						pfn = (u64)page_to_pfn(page);
> +						add_one_chunk(vb, vq,
> +							PAGE_CHUNK_TYPE_UNUSED,
> +							pfn,
> +							(u64)(1 << order));
> +					}
> +				} while (!ret);
> +			}
> +		}
> +	}
> +	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;

And where is miscq_out_hdr used? I see no add_outbuf anywhere.

Things like this should be passed through function parameters
and not stuffed into device structure, fields should be
initialized before use and not where we happen to
have the data handy.



Also, _F_ is normally a bit number, you use it as a value here.


> +	send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_UNUSED, true);
> +}
> +
> +static void miscq_handle(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +	struct virtio_balloon_miscq_hdr *hdr;
> +	unsigned int len;
> +
> +	hdr = virtqueue_get_buf(vb->miscq, &len);
> +	if (!hdr || len != sizeof(struct virtio_balloon_miscq_hdr)) {
> +		dev_warn(&vb->vdev->dev, "%s: invalid miscq hdr len\n",
> +			 __func__);
> +		miscq_in_hdr_add(vb);
> +		return;
> +	}
> +	switch (hdr->cmd) {
> +	case VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES:
> +		miscq_send_unused_pages(vb);
> +		break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: miscq cmd %d not supported\n",
> +			 __func__, hdr->cmd);
> +	}
> +	miscq_in_hdr_add(vb);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	int err = -ENOMEM;
> +	int i, nvqs;
> +
> +	 /* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +

All of 4 VQs, why are dynamic allocations called for?

> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
>  
> -	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> -	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names);
> +	if (virtio_has_feature(vb->vdev,
> +				      VIRTIO_BALLOON_F_MISC_VQ)) {
> +		callbacks[i] = miscq_handle;
> +		names[i] = "miscq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
> +					 names);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>  		struct scatterlist sg;
> -		vb->stats_vq = vqs[2];
>  
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
> @@ -718,7 +851,25 @@ static int init_vqs(struct virtio_balloon *vb)
>  			BUG();
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ)) {
> +		vb->miscq = vqs[i];
> +		miscq_in_hdr_add(vb);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -843,6 +994,32 @@ static void balloon_page_chunk_init(struct virtio_balloon *vb)
>  	}
>  }
>  
> +static void miscq_init(struct virtio_balloon *vb)
> +{
> +	void *buf;
> +
> +	vb->miscq_in_hdr = kmalloc(sizeof(struct virtio_balloon_miscq_hdr),
> +				   GFP_KERNEL);
> +	buf = kmalloc(sizeof(struct virtio_balloon_miscq_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk) *
> +		      MAX_PAGE_CHUNKS, GFP_KERNEL);

Mabe reduce MAX_PAGE_CHUNKS even further to fit in order-3 allocation.


> +	if (!vb->miscq_in_hdr || !buf) {
> +		kfree(buf);
> +		kfree(vb->miscq_in_hdr);
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ);

Again this does not really work here. In this case it might be best to
just fail probe.

> +		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
> +	} else {
> +		vb->miscq_out_hdr = buf;
> +		vb->unused_page_chunk_hdr = buf +
> +				sizeof(struct virtio_balloon_miscq_hdr);
> +		vb->unused_page_chunk_hdr->chunks = 0;
> +		vb->unused_page_chunk = buf +
> +				sizeof(struct virtio_balloon_miscq_hdr) +
> +				sizeof(struct virtio_balloon_page_chunk_hdr);
> +	}
> +}
> +
>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb;
> @@ -869,6 +1046,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
>  		balloon_page_chunk_init(vb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_MISC_VQ))
> +		miscq_init(vb);
> +
>  	mutex_init(&vb->balloon_lock);
>  	init_waitqueue_head(&vb->acked);
>  	vb->vdev = vdev;
> @@ -946,6 +1126,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  
>  	remove_common(vb);
>  	free_page_bmap(vb);
> +	kfree(vb->miscq_out_hdr);
> +	kfree(vb->miscq_in_hdr);
>  	if (vb->vb_dev_info.inode)
>  		iput(vb->vb_dev_info.inode);
>  	kfree(vb);
> @@ -987,6 +1169,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
> +	VIRTIO_BALLOON_F_MISC_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index be317b7..96bdc86 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
> +#define VIRTIO_BALLOON_F_MISC_VQ	4 /* Virtqueue for misc. requests */

Is "misc" the best we can do? I think these are
actually host commands - aren't they?

>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -95,4 +96,11 @@ struct virtio_balloon_page_chunk {
>  	__le64 size;
>  };
>  
> +#define VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES 0

meaning what? Is this a command value? Is this a command
to report unused memory then? Let's call it this then.


> +#define VIRTIO_BALLOON_MISCQ_F_COMPLETE 0x1

meaning what?

> +struct virtio_balloon_miscq_hdr {
> +	__le16 cmd;
> +	__le16 flags;

Add padding to make it full 64 bit.

> +};
> +
>  #endif /* _LINUX_VIRTIO_BALLOON_H */
> -- 
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
@ 2017-04-13 17:08     ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-13 17:08 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> Add a new vq, miscq, to handle miscellaneous requests between the device
> and the driver.
> 
> This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES

implements

> request sent from the device.

Commands are sent from host and handled on guest.
In fact how is this so different from stats?
How about reusing the stats vq then? You can use one buffer
for stats and one buffer for commands.

> Upon receiving this request from the
> miscq, the driver offers to the device the guest unused pages.
> 
> Tests have shown that skipping the transfer of unused pages of a 32G
> guest can get the live migration time reduced to 1/8.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 209 +++++++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |   8 ++
>  2 files changed, 204 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 5e2e7cc..95c703e 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -56,11 +56,12 @@ static struct vfsmount *balloon_mnt;
>  
>  /* Types of pages to chunk */
>  #define PAGE_CHUNK_TYPE_BALLOON 0
> +#define PAGE_CHUNK_TYPE_UNUSED 1
>  
>  #define MAX_PAGE_CHUNKS 4096
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *miscq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
> @@ -94,6 +95,19 @@ struct virtio_balloon {
>  	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
>  	struct virtio_balloon_page_chunk *balloon_page_chunk;
>  
> +	/*
> +	 * Buffer for PAGE_CHUNK_TYPE_UNUSED:
> +	 * virtio_balloon_miscq_hdr +
> +	 * virtio_balloon_page_chunk_hdr +
> +	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> +	 */
> +	struct virtio_balloon_miscq_hdr *miscq_out_hdr;
> +	struct virtio_balloon_page_chunk_hdr *unused_page_chunk_hdr;
> +	struct virtio_balloon_page_chunk *unused_page_chunk;
> +
> +	/* Buffer for host to send cmd to miscq */
> +	struct virtio_balloon_miscq_hdr *miscq_in_hdr;
> +
>  	/* Bitmap used to record pages */
>  	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
>  	/* Number of the allocated page_bmap */
> @@ -220,6 +234,10 @@ static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
>  		hdr = vb->balloon_page_chunk_hdr;
>  		len = 0;
>  		break;
> +	case PAGE_CHUNK_TYPE_UNUSED:
> +		hdr = vb->unused_page_chunk_hdr;
> +		len = sizeof(struct virtio_balloon_miscq_hdr);
> +		break;
>  	default:
>  		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>  			 __func__, type);
> @@ -254,6 +272,10 @@ static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
>  		hdr = vb->balloon_page_chunk_hdr;
>  		chunk = vb->balloon_page_chunk;
>  		break;
> +	case PAGE_CHUNK_TYPE_UNUSED:
> +		hdr = vb->unused_page_chunk_hdr;
> +		chunk = vb->unused_page_chunk;
> +		break;
>  	default:
>  		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>  			 __func__, type);
> @@ -686,28 +708,139 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void miscq_in_hdr_add(struct virtio_balloon *vb)
> +{
> +	struct scatterlist sg_in;
> +
> +	sg_init_one(&sg_in, vb->miscq_in_hdr,
> +		    sizeof(struct virtio_balloon_miscq_hdr));
> +	if (virtqueue_add_inbuf(vb->miscq, &sg_in, 1, vb->miscq_in_hdr,
> +	    GFP_KERNEL) < 0) {
> +		__virtio_clear_bit(vb->vdev,
> +				   VIRTIO_BALLOON_F_MISC_VQ);
> +		dev_warn(&vb->vdev->dev, "%s: add miscq_in_hdr err\n",
> +			 __func__);
> +		return;
> +	}
> +	virtqueue_kick(vb->miscq);
> +}
> +
> +static void miscq_send_unused_pages(struct virtio_balloon *vb)
> +{
> +	struct virtio_balloon_miscq_hdr *miscq_out_hdr = vb->miscq_out_hdr;
> +	struct virtqueue *vq = vb->miscq;
> +	int ret = 0;
> +	unsigned int order = 0, migratetype = 0;
> +	struct zone *zone = NULL;
> +	struct page *page = NULL;
> +	u64 pfn;
> +
> +	miscq_out_hdr->cmd =  VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES;

Gets endian-ness and whitespace wrong. Pls use static checkers to catch
this type of error.

> +	miscq_out_hdr->flags = 0;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1; order > 0; order--) {
> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
> +			     migratetype++) {
> +				do {
> +					ret = inquire_unused_page_block(zone,
> +						order, migratetype, &page);
> +					if (!ret) {
> +						pfn = (u64)page_to_pfn(page);
> +						add_one_chunk(vb, vq,
> +							PAGE_CHUNK_TYPE_UNUSED,
> +							pfn,
> +							(u64)(1 << order));
> +					}
> +				} while (!ret);
> +			}
> +		}
> +	}
> +	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;

And where is miscq_out_hdr used? I see no add_outbuf anywhere.

Things like this should be passed through function parameters
and not stuffed into device structure, fields should be
initialized before use and not where we happen to
have the data handy.



Also, _F_ is normally a bit number, you use it as a value here.


> +	send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_UNUSED, true);
> +}
> +
> +static void miscq_handle(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +	struct virtio_balloon_miscq_hdr *hdr;
> +	unsigned int len;
> +
> +	hdr = virtqueue_get_buf(vb->miscq, &len);
> +	if (!hdr || len != sizeof(struct virtio_balloon_miscq_hdr)) {
> +		dev_warn(&vb->vdev->dev, "%s: invalid miscq hdr len\n",
> +			 __func__);
> +		miscq_in_hdr_add(vb);
> +		return;
> +	}
> +	switch (hdr->cmd) {
> +	case VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES:
> +		miscq_send_unused_pages(vb);
> +		break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: miscq cmd %d not supported\n",
> +			 __func__, hdr->cmd);
> +	}
> +	miscq_in_hdr_add(vb);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	int err = -ENOMEM;
> +	int i, nvqs;
> +
> +	 /* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +

All of 4 VQs, why are dynamic allocations called for?

> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
>  
> -	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> -	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names);
> +	if (virtio_has_feature(vb->vdev,
> +				      VIRTIO_BALLOON_F_MISC_VQ)) {
> +		callbacks[i] = miscq_handle;
> +		names[i] = "miscq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
> +					 names);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>  		struct scatterlist sg;
> -		vb->stats_vq = vqs[2];
>  
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
> @@ -718,7 +851,25 @@ static int init_vqs(struct virtio_balloon *vb)
>  			BUG();
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ)) {
> +		vb->miscq = vqs[i];
> +		miscq_in_hdr_add(vb);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -843,6 +994,32 @@ static void balloon_page_chunk_init(struct virtio_balloon *vb)
>  	}
>  }
>  
> +static void miscq_init(struct virtio_balloon *vb)
> +{
> +	void *buf;
> +
> +	vb->miscq_in_hdr = kmalloc(sizeof(struct virtio_balloon_miscq_hdr),
> +				   GFP_KERNEL);
> +	buf = kmalloc(sizeof(struct virtio_balloon_miscq_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk) *
> +		      MAX_PAGE_CHUNKS, GFP_KERNEL);

Mabe reduce MAX_PAGE_CHUNKS even further to fit in order-3 allocation.


> +	if (!vb->miscq_in_hdr || !buf) {
> +		kfree(buf);
> +		kfree(vb->miscq_in_hdr);
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ);

Again this does not really work here. In this case it might be best to
just fail probe.

> +		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
> +	} else {
> +		vb->miscq_out_hdr = buf;
> +		vb->unused_page_chunk_hdr = buf +
> +				sizeof(struct virtio_balloon_miscq_hdr);
> +		vb->unused_page_chunk_hdr->chunks = 0;
> +		vb->unused_page_chunk = buf +
> +				sizeof(struct virtio_balloon_miscq_hdr) +
> +				sizeof(struct virtio_balloon_page_chunk_hdr);
> +	}
> +}
> +
>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb;
> @@ -869,6 +1046,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
>  		balloon_page_chunk_init(vb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_MISC_VQ))
> +		miscq_init(vb);
> +
>  	mutex_init(&vb->balloon_lock);
>  	init_waitqueue_head(&vb->acked);
>  	vb->vdev = vdev;
> @@ -946,6 +1126,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  
>  	remove_common(vb);
>  	free_page_bmap(vb);
> +	kfree(vb->miscq_out_hdr);
> +	kfree(vb->miscq_in_hdr);
>  	if (vb->vb_dev_info.inode)
>  		iput(vb->vb_dev_info.inode);
>  	kfree(vb);
> @@ -987,6 +1169,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
> +	VIRTIO_BALLOON_F_MISC_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index be317b7..96bdc86 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
> +#define VIRTIO_BALLOON_F_MISC_VQ	4 /* Virtqueue for misc. requests */

Is "misc" the best we can do? I think these are
actually host commands - aren't they?

>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -95,4 +96,11 @@ struct virtio_balloon_page_chunk {
>  	__le64 size;
>  };
>  
> +#define VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES 0

meaning what? Is this a command value? Is this a command
to report unused memory then? Let's call it this then.


> +#define VIRTIO_BALLOON_MISCQ_F_COMPLETE 0x1

meaning what?

> +struct virtio_balloon_miscq_hdr {
> +	__le16 cmd;
> +	__le16 flags;

Add padding to make it full 64 bit.

> +};
> +
>  #endif /* _LINUX_VIRTIO_BALLOON_H */
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
  2017-04-13  9:35   ` Wei Wang
                     ` (2 preceding siblings ...)
  (?)
@ 2017-04-13 17:08   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-13 17:08 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> Add a new vq, miscq, to handle miscellaneous requests between the device
> and the driver.
> 
> This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES

implements

> request sent from the device.

Commands are sent from host and handled on guest.
In fact how is this so different from stats?
How about reusing the stats vq then? You can use one buffer
for stats and one buffer for commands.

> Upon receiving this request from the
> miscq, the driver offers to the device the guest unused pages.
> 
> Tests have shown that skipping the transfer of unused pages of a 32G
> guest can get the live migration time reduced to 1/8.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 209 +++++++++++++++++++++++++++++++++---
>  include/uapi/linux/virtio_balloon.h |   8 ++
>  2 files changed, 204 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 5e2e7cc..95c703e 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -56,11 +56,12 @@ static struct vfsmount *balloon_mnt;
>  
>  /* Types of pages to chunk */
>  #define PAGE_CHUNK_TYPE_BALLOON 0
> +#define PAGE_CHUNK_TYPE_UNUSED 1
>  
>  #define MAX_PAGE_CHUNKS 4096
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *miscq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
> @@ -94,6 +95,19 @@ struct virtio_balloon {
>  	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
>  	struct virtio_balloon_page_chunk *balloon_page_chunk;
>  
> +	/*
> +	 * Buffer for PAGE_CHUNK_TYPE_UNUSED:
> +	 * virtio_balloon_miscq_hdr +
> +	 * virtio_balloon_page_chunk_hdr +
> +	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> +	 */
> +	struct virtio_balloon_miscq_hdr *miscq_out_hdr;
> +	struct virtio_balloon_page_chunk_hdr *unused_page_chunk_hdr;
> +	struct virtio_balloon_page_chunk *unused_page_chunk;
> +
> +	/* Buffer for host to send cmd to miscq */
> +	struct virtio_balloon_miscq_hdr *miscq_in_hdr;
> +
>  	/* Bitmap used to record pages */
>  	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
>  	/* Number of the allocated page_bmap */
> @@ -220,6 +234,10 @@ static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
>  		hdr = vb->balloon_page_chunk_hdr;
>  		len = 0;
>  		break;
> +	case PAGE_CHUNK_TYPE_UNUSED:
> +		hdr = vb->unused_page_chunk_hdr;
> +		len = sizeof(struct virtio_balloon_miscq_hdr);
> +		break;
>  	default:
>  		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>  			 __func__, type);
> @@ -254,6 +272,10 @@ static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
>  		hdr = vb->balloon_page_chunk_hdr;
>  		chunk = vb->balloon_page_chunk;
>  		break;
> +	case PAGE_CHUNK_TYPE_UNUSED:
> +		hdr = vb->unused_page_chunk_hdr;
> +		chunk = vb->unused_page_chunk;
> +		break;
>  	default:
>  		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>  			 __func__, type);
> @@ -686,28 +708,139 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void miscq_in_hdr_add(struct virtio_balloon *vb)
> +{
> +	struct scatterlist sg_in;
> +
> +	sg_init_one(&sg_in, vb->miscq_in_hdr,
> +		    sizeof(struct virtio_balloon_miscq_hdr));
> +	if (virtqueue_add_inbuf(vb->miscq, &sg_in, 1, vb->miscq_in_hdr,
> +	    GFP_KERNEL) < 0) {
> +		__virtio_clear_bit(vb->vdev,
> +				   VIRTIO_BALLOON_F_MISC_VQ);
> +		dev_warn(&vb->vdev->dev, "%s: add miscq_in_hdr err\n",
> +			 __func__);
> +		return;
> +	}
> +	virtqueue_kick(vb->miscq);
> +}
> +
> +static void miscq_send_unused_pages(struct virtio_balloon *vb)
> +{
> +	struct virtio_balloon_miscq_hdr *miscq_out_hdr = vb->miscq_out_hdr;
> +	struct virtqueue *vq = vb->miscq;
> +	int ret = 0;
> +	unsigned int order = 0, migratetype = 0;
> +	struct zone *zone = NULL;
> +	struct page *page = NULL;
> +	u64 pfn;
> +
> +	miscq_out_hdr->cmd =  VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES;

Gets endian-ness and whitespace wrong. Pls use static checkers to catch
this type of error.

> +	miscq_out_hdr->flags = 0;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1; order > 0; order--) {
> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
> +			     migratetype++) {
> +				do {
> +					ret = inquire_unused_page_block(zone,
> +						order, migratetype, &page);
> +					if (!ret) {
> +						pfn = (u64)page_to_pfn(page);
> +						add_one_chunk(vb, vq,
> +							PAGE_CHUNK_TYPE_UNUSED,
> +							pfn,
> +							(u64)(1 << order));
> +					}
> +				} while (!ret);
> +			}
> +		}
> +	}
> +	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;

And where is miscq_out_hdr used? I see no add_outbuf anywhere.

Things like this should be passed through function parameters
and not stuffed into device structure, fields should be
initialized before use and not where we happen to
have the data handy.



Also, _F_ is normally a bit number, you use it as a value here.


> +	send_page_chunks(vb, vq, PAGE_CHUNK_TYPE_UNUSED, true);
> +}
> +
> +static void miscq_handle(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +	struct virtio_balloon_miscq_hdr *hdr;
> +	unsigned int len;
> +
> +	hdr = virtqueue_get_buf(vb->miscq, &len);
> +	if (!hdr || len != sizeof(struct virtio_balloon_miscq_hdr)) {
> +		dev_warn(&vb->vdev->dev, "%s: invalid miscq hdr len\n",
> +			 __func__);
> +		miscq_in_hdr_add(vb);
> +		return;
> +	}
> +	switch (hdr->cmd) {
> +	case VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES:
> +		miscq_send_unused_pages(vb);
> +		break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: miscq cmd %d not supported\n",
> +			 __func__, hdr->cmd);
> +	}
> +	miscq_in_hdr_add(vb);
> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> -	int err, nvqs;
> +	struct virtqueue **vqs;
> +	vq_callback_t **callbacks;
> +	const char **names;
> +	int err = -ENOMEM;
> +	int i, nvqs;
> +
> +	 /* Inflateq and deflateq are used unconditionally */
> +	nvqs = 2;
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs++;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ))
> +		nvqs++;
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kcalloc(nvqs, sizeof(*vqs), GFP_KERNEL);
> +	if (!vqs)
> +		goto err_vq;
> +	callbacks = kmalloc_array(nvqs, sizeof(*callbacks), GFP_KERNEL);
> +	if (!callbacks)
> +		goto err_callback;
> +	names = kmalloc_array(nvqs, sizeof(*names), GFP_KERNEL);
> +	if (!names)
> +		goto err_names;
> +

All of 4 VQs, why are dynamic allocations called for?

> +	callbacks[0] = balloon_ack;
> +	names[0] = "inflate";
> +	callbacks[1] = balloon_ack;
> +	names[1] = "deflate";
> +
> +	i = 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		callbacks[i] = stats_request;
> +		names[i] = "stats";
> +		i++;
> +	}
>  
> -	/*
> -	 * We expect two virtqueues: inflate and deflate, and
> -	 * optionally stat.
> -	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> -	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names);
> +	if (virtio_has_feature(vb->vdev,
> +				      VIRTIO_BALLOON_F_MISC_VQ)) {
> +		callbacks[i] = miscq_handle;
> +		names[i] = "miscq";
> +	}
> +
> +	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks,
> +					 names);
>  	if (err)
> -		return err;
> +		goto err_find;
>  
>  	vb->inflate_vq = vqs[0];
>  	vb->deflate_vq = vqs[1];
> +	i = 2;
>  	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>  		struct scatterlist sg;
> -		vb->stats_vq = vqs[2];
>  
> +		vb->stats_vq = vqs[i++];
>  		/*
>  		 * Prime this virtqueue with one buffer so the hypervisor can
>  		 * use it to signal us later (it can't be broken yet!).
> @@ -718,7 +851,25 @@ static int init_vqs(struct virtio_balloon *vb)
>  			BUG();
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ)) {
> +		vb->miscq = vqs[i];
> +		miscq_in_hdr_add(vb);
> +	}
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
>  	return 0;
> +
> +err_find:
> +	kfree(names);
> +err_names:
> +	kfree(callbacks);
> +err_callback:
> +	kfree(vqs);
> +err_vq:
> +	return err;
>  }
>  
>  #ifdef CONFIG_BALLOON_COMPACTION
> @@ -843,6 +994,32 @@ static void balloon_page_chunk_init(struct virtio_balloon *vb)
>  	}
>  }
>  
> +static void miscq_init(struct virtio_balloon *vb)
> +{
> +	void *buf;
> +
> +	vb->miscq_in_hdr = kmalloc(sizeof(struct virtio_balloon_miscq_hdr),
> +				   GFP_KERNEL);
> +	buf = kmalloc(sizeof(struct virtio_balloon_miscq_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk) *
> +		      MAX_PAGE_CHUNKS, GFP_KERNEL);

Mabe reduce MAX_PAGE_CHUNKS even further to fit in order-3 allocation.


> +	if (!vb->miscq_in_hdr || !buf) {
> +		kfree(buf);
> +		kfree(vb->miscq_in_hdr);
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_MISC_VQ);

Again this does not really work here. In this case it might be best to
just fail probe.

> +		dev_warn(&vb->vdev->dev, "%s: failed\n", __func__);
> +	} else {
> +		vb->miscq_out_hdr = buf;
> +		vb->unused_page_chunk_hdr = buf +
> +				sizeof(struct virtio_balloon_miscq_hdr);
> +		vb->unused_page_chunk_hdr->chunks = 0;
> +		vb->unused_page_chunk = buf +
> +				sizeof(struct virtio_balloon_miscq_hdr) +
> +				sizeof(struct virtio_balloon_page_chunk_hdr);
> +	}
> +}
> +
>  static int virtballoon_probe(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb;
> @@ -869,6 +1046,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS))
>  		balloon_page_chunk_init(vb);
>  
> +	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_MISC_VQ))
> +		miscq_init(vb);
> +
>  	mutex_init(&vb->balloon_lock);
>  	init_waitqueue_head(&vb->acked);
>  	vb->vdev = vdev;
> @@ -946,6 +1126,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
>  
>  	remove_common(vb);
>  	free_page_bmap(vb);
> +	kfree(vb->miscq_out_hdr);
> +	kfree(vb->miscq_in_hdr);
>  	if (vb->vb_dev_info.inode)
>  		iput(vb->vb_dev_info.inode);
>  	kfree(vb);
> @@ -987,6 +1169,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_BALLOON_CHUNKS,
> +	VIRTIO_BALLOON_F_MISC_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index be317b7..96bdc86 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_BALLOON_CHUNKS 3 /* Inflate/Deflate pages in chunks */
> +#define VIRTIO_BALLOON_F_MISC_VQ	4 /* Virtqueue for misc. requests */

Is "misc" the best we can do? I think these are
actually host commands - aren't they?

>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -95,4 +96,11 @@ struct virtio_balloon_page_chunk {
>  	__le64 size;
>  };
>  
> +#define VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES 0

meaning what? Is this a command value? Is this a command
to report unused memory then? Let's call it this then.


> +#define VIRTIO_BALLOON_MISCQ_F_COMPLETE 0x1

meaning what?

> +struct virtio_balloon_miscq_hdr {
> +	__le16 cmd;
> +	__le16 flags;

Add padding to make it full 64 bit.

> +};
> +
>  #endif /* _LINUX_VIRTIO_BALLOON_H */
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 3/5] mm: function to offer a page block on the free list
  2017-04-13  9:35   ` Wei Wang
  (?)
  (?)
@ 2017-04-13 20:02     ` Andrew Morton
  -1 siblings, 0 replies; 136+ messages in thread
From: Andrew Morton @ 2017-04-13 20:02 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, 13 Apr 2017 17:35:06 +0800 Wei Wang <wei.w.wang@intel.com> wrote:

> Add a function to find a page block on the free list specified by the
> caller. Pages from the page block may be used immediately after the
> function returns. The caller is responsible for detecting or preventing
> the use of such pages.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter)
>  	show_swap_cache_info();
>  }
>  
> +/**
> + * Heuristically get a page block in the system that is unused.
> + * It is possible that pages from the page block are used immediately after
> + * inquire_unused_page_block() returns. It is the caller's responsibility
> + * to either detect or prevent the use of such pages.
> + *
> + * The free list to check: zone->free_area[order].free_list[migratetype].
> + *
> + * If the caller supplied page block (i.e. **page) is on the free list, offer
> + * the next page block on the list to the caller. Otherwise, offer the first
> + * page block on the list.
> + *
> + * Return 0 when a page block is found on the caller specified free list.
> + */
> +int inquire_unused_page_block(struct zone *zone, unsigned int order,
> +			      unsigned int migratetype, struct page **page)
> +{

Perhaps we can wrap this in the appropriate ifdef so the kernels which
won't be using virtio-balloon don't carry the added overhead.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 3/5] mm: function to offer a page block on the free list
@ 2017-04-13 20:02     ` Andrew Morton
  0 siblings, 0 replies; 136+ messages in thread
From: Andrew Morton @ 2017-04-13 20:02 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mst, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, mgorman

On Thu, 13 Apr 2017 17:35:06 +0800 Wei Wang <wei.w.wang@intel.com> wrote:

> Add a function to find a page block on the free list specified by the
> caller. Pages from the page block may be used immediately after the
> function returns. The caller is responsible for detecting or preventing
> the use of such pages.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter)
>  	show_swap_cache_info();
>  }
>  
> +/**
> + * Heuristically get a page block in the system that is unused.
> + * It is possible that pages from the page block are used immediately after
> + * inquire_unused_page_block() returns. It is the caller's responsibility
> + * to either detect or prevent the use of such pages.
> + *
> + * The free list to check: zone->free_area[order].free_list[migratetype].
> + *
> + * If the caller supplied page block (i.e. **page) is on the free list, offer
> + * the next page block on the list to the caller. Otherwise, offer the first
> + * page block on the list.
> + *
> + * Return 0 when a page block is found on the caller specified free list.
> + */
> +int inquire_unused_page_block(struct zone *zone, unsigned int order,
> +			      unsigned int migratetype, struct page **page)
> +{

Perhaps we can wrap this in the appropriate ifdef so the kernels which
won't be using virtio-balloon don't carry the added overhead.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 3/5] mm: function to offer a page block on the free list
@ 2017-04-13 20:02     ` Andrew Morton
  0 siblings, 0 replies; 136+ messages in thread
From: Andrew Morton @ 2017-04-13 20:02 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, 13 Apr 2017 17:35:06 +0800 Wei Wang <wei.w.wang@intel.com> wrote:

> Add a function to find a page block on the free list specified by the
> caller. Pages from the page block may be used immediately after the
> function returns. The caller is responsible for detecting or preventing
> the use of such pages.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter)
>  	show_swap_cache_info();
>  }
>  
> +/**
> + * Heuristically get a page block in the system that is unused.
> + * It is possible that pages from the page block are used immediately after
> + * inquire_unused_page_block() returns. It is the caller's responsibility
> + * to either detect or prevent the use of such pages.
> + *
> + * The free list to check: zone->free_area[order].free_list[migratetype].
> + *
> + * If the caller supplied page block (i.e. **page) is on the free list, offer
> + * the next page block on the list to the caller. Otherwise, offer the first
> + * page block on the list.
> + *
> + * Return 0 when a page block is found on the caller specified free list.
> + */
> +int inquire_unused_page_block(struct zone *zone, unsigned int order,
> +			      unsigned int migratetype, struct page **page)
> +{

Perhaps we can wrap this in the appropriate ifdef so the kernels which
won't be using virtio-balloon don't carry the added overhead.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 3/5] mm: function to offer a page block on the free list
@ 2017-04-13 20:02     ` Andrew Morton
  0 siblings, 0 replies; 136+ messages in thread
From: Andrew Morton @ 2017-04-13 20:02 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, 13 Apr 2017 17:35:06 +0800 Wei Wang <wei.w.wang@intel.com> wrote:

> Add a function to find a page block on the free list specified by the
> caller. Pages from the page block may be used immediately after the
> function returns. The caller is responsible for detecting or preventing
> the use of such pages.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter)
>  	show_swap_cache_info();
>  }
>  
> +/**
> + * Heuristically get a page block in the system that is unused.
> + * It is possible that pages from the page block are used immediately after
> + * inquire_unused_page_block() returns. It is the caller's responsibility
> + * to either detect or prevent the use of such pages.
> + *
> + * The free list to check: zone->free_area[order].free_list[migratetype].
> + *
> + * If the caller supplied page block (i.e. **page) is on the free list, offer
> + * the next page block on the list to the caller. Otherwise, offer the first
> + * page block on the list.
> + *
> + * Return 0 when a page block is found on the caller specified free list.
> + */
> +int inquire_unused_page_block(struct zone *zone, unsigned int order,
> +			      unsigned int migratetype, struct page **page)
> +{

Perhaps we can wrap this in the appropriate ifdef so the kernels which
won't be using virtio-balloon don't carry the added overhead.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2017-04-13  9:35 ` Wei Wang
  (?)
  (?)
@ 2017-04-13 20:44   ` Matthew Wilcox
  -1 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-13 20:44 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> 2) transfer the guest unused pages to the host so that they
> can be skipped to migrate in live migration.

I don't understand this second bit.  You leave the pages on the free list,
and tell the host they're free.  What's preventing somebody else from
allocating them and using them for something?  Is the guest semi-frozen
at this point with just enough of it running to ask the balloon driver
to do things?

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-13 20:44   ` Matthew Wilcox
  0 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-13 20:44 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mst, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> 2) transfer the guest unused pages to the host so that they
> can be skipped to migrate in live migration.

I don't understand this second bit.  You leave the pages on the free list,
and tell the host they're free.  What's preventing somebody else from
allocating them and using them for something?  Is the guest semi-frozen
at this point with just enough of it running to ask the balloon driver
to do things?

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-13 20:44   ` Matthew Wilcox
  0 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-13 20:44 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> 2) transfer the guest unused pages to the host so that they
> can be skipped to migrate in live migration.

I don't understand this second bit.  You leave the pages on the free list,
and tell the host they're free.  What's preventing somebody else from
allocating them and using them for something?  Is the guest semi-frozen
at this point with just enough of it running to ask the balloon driver
to do things?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-13 20:44   ` Matthew Wilcox
  0 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-13 20:44 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> 2) transfer the guest unused pages to the host so that they
> can be skipped to migrate in live migration.

I don't understand this second bit.  You leave the pages on the free list,
and tell the host they're free.  What's preventing somebody else from
allocating them and using them for something?  Is the guest semi-frozen
at this point with just enough of it running to ask the balloon driver
to do things?

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2017-04-13 20:44   ` Matthew Wilcox
  (?)
@ 2017-04-14  1:50     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14  1:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > 2) transfer the guest unused pages to the host so that they
> > can be skipped to migrate in live migration.
> 
> I don't understand this second bit.  You leave the pages on the free list,
> and tell the host they're free.  What's preventing somebody else from
> allocating them and using them for something?  Is the guest semi-frozen
> at this point with just enough of it running to ask the balloon driver
> to do things?

There's missing documentation here.

The way things actually work is host sends to guest
a request for unused pages and then write-protects all memory.

So guest isn't frozen but any changes will be detected by host.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-14  1:50     ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14  1:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > 2) transfer the guest unused pages to the host so that they
> > can be skipped to migrate in live migration.
> 
> I don't understand this second bit.  You leave the pages on the free list,
> and tell the host they're free.  What's preventing somebody else from
> allocating them and using them for something?  Is the guest semi-frozen
> at this point with just enough of it running to ask the balloon driver
> to do things?

There's missing documentation here.

The way things actually work is host sends to guest
a request for unused pages and then write-protects all memory.

So guest isn't frozen but any changes will be detected by host.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-14  1:50     ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14  1:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > 2) transfer the guest unused pages to the host so that they
> > can be skipped to migrate in live migration.
> 
> I don't understand this second bit.  You leave the pages on the free list,
> and tell the host they're free.  What's preventing somebody else from
> allocating them and using them for something?  Is the guest semi-frozen
> at this point with just enough of it running to ask the balloon driver
> to do things?

There's missing documentation here.

The way things actually work is host sends to guest
a request for unused pages and then write-protects all memory.

So guest isn't frozen but any changes will be detected by host.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2017-04-13 20:44   ` Matthew Wilcox
                     ` (3 preceding siblings ...)
  (?)
@ 2017-04-14  1:50   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14  1:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > 2) transfer the guest unused pages to the host so that they
> > can be skipped to migrate in live migration.
> 
> I don't understand this second bit.  You leave the pages on the free list,
> and tell the host they're free.  What's preventing somebody else from
> allocating them and using them for something?  Is the guest semi-frozen
> at this point with just enough of it running to ask the balloon driver
> to do things?

There's missing documentation here.

The way things actually work is host sends to guest
a request for unused pages and then write-protects all memory.

So guest isn't frozen but any changes will be detected by host.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2017-04-14  1:50     ` Michael S. Tsirkin
  (?)
@ 2017-04-14  2:28       ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  2:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, Matthew Wilcox
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/14/2017 09:50 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
>> On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
>>> 2) transfer the guest unused pages to the host so that they
>>> can be skipped to migrate in live migration.
>> I don't understand this second bit.  You leave the pages on the free list,
>> and tell the host they're free.  What's preventing somebody else from
>> allocating them and using them for something?  Is the guest semi-frozen
>> at this point with just enough of it running to ask the balloon driver
>> to do things?
> There's missing documentation here.
>
> The way things actually work is host sends to guest
> a request for unused pages and then write-protects all memory.
>
> So guest isn't frozen but any changes will be detected by host.
>

Probably it's better to say " transfer the info about the guest unused pages
to the host so that the host gets a chance to skip the transfer of the 
unused
pages during live migration".

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-14  2:28       ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  2:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, Matthew Wilcox
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/14/2017 09:50 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
>> On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
>>> 2) transfer the guest unused pages to the host so that they
>>> can be skipped to migrate in live migration.
>> I don't understand this second bit.  You leave the pages on the free list,
>> and tell the host they're free.  What's preventing somebody else from
>> allocating them and using them for something?  Is the guest semi-frozen
>> at this point with just enough of it running to ask the balloon driver
>> to do things?
> There's missing documentation here.
>
> The way things actually work is host sends to guest
> a request for unused pages and then write-protects all memory.
>
> So guest isn't frozen but any changes will be detected by host.
>

Probably it's better to say " transfer the info about the guest unused pages
to the host so that the host gets a chance to skip the transfer of the 
unused
pages during live migration".

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-14  2:28       ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  2:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, Matthew Wilcox
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/14/2017 09:50 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
>> On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
>>> 2) transfer the guest unused pages to the host so that they
>>> can be skipped to migrate in live migration.
>> I don't understand this second bit.  You leave the pages on the free list,
>> and tell the host they're free.  What's preventing somebody else from
>> allocating them and using them for something?  Is the guest semi-frozen
>> at this point with just enough of it running to ask the balloon driver
>> to do things?
> There's missing documentation here.
>
> The way things actually work is host sends to guest
> a request for unused pages and then write-protects all memory.
>
> So guest isn't frozen but any changes will be detected by host.
>

Probably it's better to say " transfer the info about the guest unused pages
to the host so that the host gets a chance to skip the transfer of the 
unused
pages during live migration".

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2017-04-14  1:50     ` Michael S. Tsirkin
                       ` (2 preceding siblings ...)
  (?)
@ 2017-04-14  2:28     ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  2:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, Matthew Wilcox
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On 04/14/2017 09:50 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
>> On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
>>> 2) transfer the guest unused pages to the host so that they
>>> can be skipped to migrate in live migration.
>> I don't understand this second bit.  You leave the pages on the free list,
>> and tell the host they're free.  What's preventing somebody else from
>> allocating them and using them for something?  Is the guest semi-frozen
>> at this point with just enough of it running to ask the balloon driver
>> to do things?
> There's missing documentation here.
>
> The way things actually work is host sends to guest
> a request for unused pages and then write-protects all memory.
>
> So guest isn't frozen but any changes will be detected by host.
>

Probably it's better to say " transfer the info about the guest unused pages
to the host so that the host gets a chance to skip the transfer of the 
unused
pages during live migration".

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 3/5] mm: function to offer a page block on the free list
  2017-04-13 20:02     ` Andrew Morton
  (?)
@ 2017-04-14  2:30       ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  2:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/14/2017 04:02 AM, Andrew Morton wrote:
> On Thu, 13 Apr 2017 17:35:06 +0800 Wei Wang <wei.w.wang@intel.com> wrote:
>
>> Add a function to find a page block on the free list specified by the
>> caller. Pages from the page block may be used immediately after the
>> function returns. The caller is responsible for detecting or preventing
>> the use of such pages.
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter)
>>   	show_swap_cache_info();
>>   }
>>   
>> +/**
>> + * Heuristically get a page block in the system that is unused.
>> + * It is possible that pages from the page block are used immediately after
>> + * inquire_unused_page_block() returns. It is the caller's responsibility
>> + * to either detect or prevent the use of such pages.
>> + *
>> + * The free list to check: zone->free_area[order].free_list[migratetype].
>> + *
>> + * If the caller supplied page block (i.e. **page) is on the free list, offer
>> + * the next page block on the list to the caller. Otherwise, offer the first
>> + * page block on the list.
>> + *
>> + * Return 0 when a page block is found on the caller specified free list.
>> + */
>> +int inquire_unused_page_block(struct zone *zone, unsigned int order,
>> +			      unsigned int migratetype, struct page **page)
>> +{
> Perhaps we can wrap this in the appropriate ifdef so the kernels which
> won't be using virtio-balloon don't carry the added overhead.
>
>

OK. What do you think if we add this:

#if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)



Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 3/5] mm: function to offer a page block on the free list
@ 2017-04-14  2:30       ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  2:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/14/2017 04:02 AM, Andrew Morton wrote:
> On Thu, 13 Apr 2017 17:35:06 +0800 Wei Wang <wei.w.wang@intel.com> wrote:
>
>> Add a function to find a page block on the free list specified by the
>> caller. Pages from the page block may be used immediately after the
>> function returns. The caller is responsible for detecting or preventing
>> the use of such pages.
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter)
>>   	show_swap_cache_info();
>>   }
>>   
>> +/**
>> + * Heuristically get a page block in the system that is unused.
>> + * It is possible that pages from the page block are used immediately after
>> + * inquire_unused_page_block() returns. It is the caller's responsibility
>> + * to either detect or prevent the use of such pages.
>> + *
>> + * The free list to check: zone->free_area[order].free_list[migratetype].
>> + *
>> + * If the caller supplied page block (i.e. **page) is on the free list, offer
>> + * the next page block on the list to the caller. Otherwise, offer the first
>> + * page block on the list.
>> + *
>> + * Return 0 when a page block is found on the caller specified free list.
>> + */
>> +int inquire_unused_page_block(struct zone *zone, unsigned int order,
>> +			      unsigned int migratetype, struct page **page)
>> +{
> Perhaps we can wrap this in the appropriate ifdef so the kernels which
> won't be using virtio-balloon don't carry the added overhead.
>
>

OK. What do you think if we add this:

#if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)



Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 3/5] mm: function to offer a page block on the free list
@ 2017-04-14  2:30       ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  2:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/14/2017 04:02 AM, Andrew Morton wrote:
> On Thu, 13 Apr 2017 17:35:06 +0800 Wei Wang <wei.w.wang@intel.com> wrote:
>
>> Add a function to find a page block on the free list specified by the
>> caller. Pages from the page block may be used immediately after the
>> function returns. The caller is responsible for detecting or preventing
>> the use of such pages.
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter)
>>   	show_swap_cache_info();
>>   }
>>   
>> +/**
>> + * Heuristically get a page block in the system that is unused.
>> + * It is possible that pages from the page block are used immediately after
>> + * inquire_unused_page_block() returns. It is the caller's responsibility
>> + * to either detect or prevent the use of such pages.
>> + *
>> + * The free list to check: zone->free_area[order].free_list[migratetype].
>> + *
>> + * If the caller supplied page block (i.e. **page) is on the free list, offer
>> + * the next page block on the list to the caller. Otherwise, offer the first
>> + * page block on the list.
>> + *
>> + * Return 0 when a page block is found on the caller specified free list.
>> + */
>> +int inquire_unused_page_block(struct zone *zone, unsigned int order,
>> +			      unsigned int migratetype, struct page **page)
>> +{
> Perhaps we can wrap this in the appropriate ifdef so the kernels which
> won't be using virtio-balloon don't carry the added overhead.
>
>

OK. What do you think if we add this:

#if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)



Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 3/5] mm: function to offer a page block on the free list
  2017-04-13 20:02     ` Andrew Morton
                       ` (3 preceding siblings ...)
  (?)
@ 2017-04-14  2:30     ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  2:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: aarcange, virtio-dev, kvm, mst, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, mgorman

On 04/14/2017 04:02 AM, Andrew Morton wrote:
> On Thu, 13 Apr 2017 17:35:06 +0800 Wei Wang <wei.w.wang@intel.com> wrote:
>
>> Add a function to find a page block on the free list specified by the
>> caller. Pages from the page block may be used immediately after the
>> function returns. The caller is responsible for detecting or preventing
>> the use of such pages.
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter)
>>   	show_swap_cache_info();
>>   }
>>   
>> +/**
>> + * Heuristically get a page block in the system that is unused.
>> + * It is possible that pages from the page block are used immediately after
>> + * inquire_unused_page_block() returns. It is the caller's responsibility
>> + * to either detect or prevent the use of such pages.
>> + *
>> + * The free list to check: zone->free_area[order].free_list[migratetype].
>> + *
>> + * If the caller supplied page block (i.e. **page) is on the free list, offer
>> + * the next page block on the list to the caller. Otherwise, offer the first
>> + * page block on the list.
>> + *
>> + * Return 0 when a page block is found on the caller specified free list.
>> + */
>> +int inquire_unused_page_block(struct zone *zone, unsigned int order,
>> +			      unsigned int migratetype, struct page **page)
>> +{
> Perhaps we can wrap this in the appropriate ifdef so the kernels which
> won't be using virtio-balloon don't carry the added overhead.
>
>

OK. What do you think if we add this:

#if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)



Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2017-04-14  2:28       ` Wei Wang
  (?)
@ 2017-04-14  2:57         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14  2:57 UTC (permalink / raw)
  To: Wei Wang
  Cc: Matthew Wilcox, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, david, dave.hansen, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Apr 14, 2017 at 10:28:32AM +0800, Wei Wang wrote:
> On 04/14/2017 09:50 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> > > On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > > > 2) transfer the guest unused pages to the host so that they
> > > > can be skipped to migrate in live migration.
> > > I don't understand this second bit.  You leave the pages on the free list,
> > > and tell the host they're free.  What's preventing somebody else from
> > > allocating them and using them for something?  Is the guest semi-frozen
> > > at this point with just enough of it running to ask the balloon driver
> > > to do things?
> > There's missing documentation here.
> > 
> > The way things actually work is host sends to guest
> > a request for unused pages and then write-protects all memory.
> > 
> > So guest isn't frozen but any changes will be detected by host.
> > 
> 
> Probably it's better to say " transfer the info about the guest unused pages
> to the host so that the host gets a chance to skip the transfer of the
> unused
> pages during live migration".
> 
> Best,
> Wei

IMHO this would not be helpful.
Most people don't know how does migration work, even if they did
this isn't tied to migration in any way.
It just makes people go "oh it's some virtualization mumbo jumbo".
We want people to be able to review and for that
interfaces need to be separate from the implementation.

IOW we must document what the interface promises not how it's used.


The promise is that pages have been unused at some time between when
host sent command and when guest completed it.  Host uses that by
tracking memory changes and then discarding changes made to pages
it gets from guest before it sent the command.

Say that and drop all mention of transfer, migration etc.

-- 
MST

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-14  2:57         ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14  2:57 UTC (permalink / raw)
  To: Wei Wang
  Cc: Matthew Wilcox, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, david, dave.hansen, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Apr 14, 2017 at 10:28:32AM +0800, Wei Wang wrote:
> On 04/14/2017 09:50 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> > > On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > > > 2) transfer the guest unused pages to the host so that they
> > > > can be skipped to migrate in live migration.
> > > I don't understand this second bit.  You leave the pages on the free list,
> > > and tell the host they're free.  What's preventing somebody else from
> > > allocating them and using them for something?  Is the guest semi-frozen
> > > at this point with just enough of it running to ask the balloon driver
> > > to do things?
> > There's missing documentation here.
> > 
> > The way things actually work is host sends to guest
> > a request for unused pages and then write-protects all memory.
> > 
> > So guest isn't frozen but any changes will be detected by host.
> > 
> 
> Probably it's better to say " transfer the info about the guest unused pages
> to the host so that the host gets a chance to skip the transfer of the
> unused
> pages during live migration".
> 
> Best,
> Wei

IMHO this would not be helpful.
Most people don't know how does migration work, even if they did
this isn't tied to migration in any way.
It just makes people go "oh it's some virtualization mumbo jumbo".
We want people to be able to review and for that
interfaces need to be separate from the implementation.

IOW we must document what the interface promises not how it's used.


The promise is that pages have been unused at some time between when
host sent command and when guest completed it.  Host uses that by
tracking memory changes and then discarding changes made to pages
it gets from guest before it sent the command.

Say that and drop all mention of transfer, migration etc.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-14  2:57         ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14  2:57 UTC (permalink / raw)
  To: Wei Wang
  Cc: Matthew Wilcox, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, david, dave.hansen, cornelia.huck,
	akpm, mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Apr 14, 2017 at 10:28:32AM +0800, Wei Wang wrote:
> On 04/14/2017 09:50 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> > > On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > > > 2) transfer the guest unused pages to the host so that they
> > > > can be skipped to migrate in live migration.
> > > I don't understand this second bit.  You leave the pages on the free list,
> > > and tell the host they're free.  What's preventing somebody else from
> > > allocating them and using them for something?  Is the guest semi-frozen
> > > at this point with just enough of it running to ask the balloon driver
> > > to do things?
> > There's missing documentation here.
> > 
> > The way things actually work is host sends to guest
> > a request for unused pages and then write-protects all memory.
> > 
> > So guest isn't frozen but any changes will be detected by host.
> > 
> 
> Probably it's better to say " transfer the info about the guest unused pages
> to the host so that the host gets a chance to skip the transfer of the
> unused
> pages during live migration".
> 
> Best,
> Wei

IMHO this would not be helpful.
Most people don't know how does migration work, even if they did
this isn't tied to migration in any way.
It just makes people go "oh it's some virtualization mumbo jumbo".
We want people to be able to review and for that
interfaces need to be separate from the implementation.

IOW we must document what the interface promises not how it's used.


The promise is that pages have been unused at some time between when
host sent command and when guest completed it.  Host uses that by
tracking memory changes and then discarding changes made to pages
it gets from guest before it sent the command.

Say that and drop all mention of transfer, migration etc.

-- 
MST

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2017-04-14  2:28       ` Wei Wang
  (?)
  (?)
@ 2017-04-14  2:57       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14  2:57 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, Matthew Wilcox,
	virtualization, linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Fri, Apr 14, 2017 at 10:28:32AM +0800, Wei Wang wrote:
> On 04/14/2017 09:50 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> > > On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > > > 2) transfer the guest unused pages to the host so that they
> > > > can be skipped to migrate in live migration.
> > > I don't understand this second bit.  You leave the pages on the free list,
> > > and tell the host they're free.  What's preventing somebody else from
> > > allocating them and using them for something?  Is the guest semi-frozen
> > > at this point with just enough of it running to ask the balloon driver
> > > to do things?
> > There's missing documentation here.
> > 
> > The way things actually work is host sends to guest
> > a request for unused pages and then write-protects all memory.
> > 
> > So guest isn't frozen but any changes will be detected by host.
> > 
> 
> Probably it's better to say " transfer the info about the guest unused pages
> to the host so that the host gets a chance to skip the transfer of the
> unused
> pages during live migration".
> 
> Best,
> Wei

IMHO this would not be helpful.
Most people don't know how does migration work, even if they did
this isn't tied to migration in any way.
It just makes people go "oh it's some virtualization mumbo jumbo".
We want people to be able to review and for that
interfaces need to be separate from the implementation.

IOW we must document what the interface promises not how it's used.


The promise is that pages have been unused at some time between when
host sent command and when guest completed it.  Host uses that by
tracking memory changes and then discarding changes made to pages
it gets from guest before it sent the command.

Say that and drop all mention of transfer, migration etc.

-- 
MST

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 3/5] mm: function to offer a page block on the free list
  2017-04-14  2:30       ` Wei Wang
  (?)
@ 2017-04-14  2:58         ` Matthew Wilcox
  -1 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-14  2:58 UTC (permalink / raw)
  To: Wei Wang
  Cc: Andrew Morton, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, david, dave.hansen,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource

On Fri, Apr 14, 2017 at 10:30:27AM +0800, Wei Wang wrote:
> OK. What do you think if we add this:
> 
> #if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)

That's spelled "IS_ENABLED(CONFIG_VIRTIO_BALLOON)", FYI.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 3/5] mm: function to offer a page block on the free list
@ 2017-04-14  2:58         ` Matthew Wilcox
  0 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-14  2:58 UTC (permalink / raw)
  To: Wei Wang
  Cc: Andrew Morton, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, david, dave.hansen,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource

On Fri, Apr 14, 2017 at 10:30:27AM +0800, Wei Wang wrote:
> OK. What do you think if we add this:
> 
> #if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)

That's spelled "IS_ENABLED(CONFIG_VIRTIO_BALLOON)", FYI.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 3/5] mm: function to offer a page block on the free list
@ 2017-04-14  2:58         ` Matthew Wilcox
  0 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-14  2:58 UTC (permalink / raw)
  To: Wei Wang
  Cc: Andrew Morton, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, david, dave.hansen,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource

On Fri, Apr 14, 2017 at 10:30:27AM +0800, Wei Wang wrote:
> OK. What do you think if we add this:
> 
> #if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)

That's spelled "IS_ENABLED(CONFIG_VIRTIO_BALLOON)", FYI.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 3/5] mm: function to offer a page block on the free list
  2017-04-14  2:30       ` Wei Wang
  (?)
  (?)
@ 2017-04-14  2:58       ` Matthew Wilcox
  -1 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-14  2:58 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, mst, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, Andrew Morton, mgorman

On Fri, Apr 14, 2017 at 10:30:27AM +0800, Wei Wang wrote:
> OK. What do you think if we add this:
> 
> #if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)

That's spelled "IS_ENABLED(CONFIG_VIRTIO_BALLOON)", FYI.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-13 16:34     ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-04-14  8:37       ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  8:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
>
> So we don't need the bitmap to talk to host, it is just
> a data structure we chose to maintain lists of pages, right?
Right. bitmap is the way to gather pages to chunk.
It's only needed in the balloon page case.
For the unused page case, we don't need it, since the free
page blocks are already chunks.

> OK as far as it goes but you need much better isolation for it.
> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> _find_first, _find_next.
> Completely unrelated to pages, it just maintains bits.
> Then use it here.
>
>
>>   static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>>   module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>>   MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>> @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>   static struct vfsmount *balloon_mnt;
>>   #endif
>>   
>> +/* Types of pages to chunk */
>> +#define PAGE_CHUNK_TYPE_BALLOON 0
>> +
> Doesn't look like you are ever adding more types in this
> patchset.  Pls keep code simple, generalize it later.
>
"#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.

Types of page to chunk are treated differently. Different types of page
chunks are sent to the host via different protocols.

1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
to chunk.  For the ballooned type, it uses the basic chunk msg format:

virtio_balloon_page_chunk_hdr +
virtio_balloon_page_chunk * MAX_PAGE_CHUNKS

2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
format:
miscq_hdr +
virtio_balloon_page_chunk_hdr +
virtio_balloon_page_chunk * MAX_PAGE_CHUNKS

The chunk msg is actually the payload of the miscq msg.



>> +#define MAX_PAGE_CHUNKS 4096
> This is an order-4 allocation. I'd make it 4095 and then it's
> an order-3 one.

Sounds good, thanks.
I think it would be better to make it 4090. Leave some space for the hdr
as well.

>
>>   struct virtio_balloon {
>>   	struct virtio_device *vdev;
>>   	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>> @@ -78,6 +86,32 @@ struct virtio_balloon {
>>   	/* Synchronize access/update to this struct virtio_balloon elements */
>>   	struct mutex balloon_lock;
>>   
>> +	/*
>> +	 * Buffer for PAGE_CHUNK_TYPE_BALLOON:
>> +	 * virtio_balloon_page_chunk_hdr +
>> +	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>> +	 */
>> +	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
>> +	struct virtio_balloon_page_chunk *balloon_page_chunk;
>> +
>> +	/* Bitmap used to record pages */
>> +	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
>> +	/* Number of the allocated page_bmap */
>> +	unsigned int page_bmaps;
>> +
>> +	/*
>> +	 * The allocated page_bmap size may be smaller than the pfn range of
>> +	 * the ballooned pages. In this case, we need to use the page_bmap
>> +	 * multiple times to cover the entire pfn range. It's like using a
>> +	 * short ruler several times to finish measuring a long object.
>> +	 * The start location of the ruler in the next measurement is the end
>> +	 * location of the ruler in the previous measurement.
>> +	 *
>> +	 * pfn_max & pfn_min: forms the pfn range of the ballooned pages
>> +	 * pfn_start & pfn_stop: records the start and stop pfn in each cover
> cover? what does this mean?
>
> looks like you only use these to pass data to tell_host.
> so pass these as parameters and you won't need to keep
> them in this structure.
>
> And then you can move this comment to set_page_bmap where
> it belongs.
>
>> +	 */
>> +	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
>> +
>>   	/* The array of pfns we tell the Host about. */
>>   	unsigned int num_pfns;
>>   	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
>> @@ -110,20 +144,201 @@ static void balloon_ack(struct virtqueue *vq)
>>   	wake_up(&vb->acked);
>>   }
>>   
>> -static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>> +static inline void init_page_bmap_range(struct virtio_balloon *vb)
>> +{
>> +	vb->pfn_min = ULONG_MAX;
>> +	vb->pfn_max = 0;
>> +}
>> +
>> +static inline void update_page_bmap_range(struct virtio_balloon *vb,
>> +					  struct page *page)
>> +{
>> +	unsigned long balloon_pfn = page_to_balloon_pfn(page);
>> +
>> +	vb->pfn_min = min(balloon_pfn, vb->pfn_min);
>> +	vb->pfn_max = max(balloon_pfn, vb->pfn_max);
>> +}
>> +
>> +/* The page_bmap size is extended by adding more number of page_bmap */
> did you mean
>
> 	Allocate more bitmaps to cover the given number of pfns
> 	and add them to page_bmap
>
> ?
>
> This isn't what this function does.
> It blindly assumes 1 bitmap is allocated
> and allocates more, up to PAGE_BMAP_COUNT_MAX.
>

Please let me use a concrete analogy to explain this algorithm:
We have a 2-meter long ruler (i.e. page_bmap[0]).

Case 1:
To measure a  1-meter long object (i.e. pfn_max=1, pfn_min=0),
we can simply use the ruler once and get to know that the object
is 1-meter long.

Case 2:
To measure a 11-meter long object (i.e. pfn_max=11, pfn_min=0).
We will first see if we can extend the 2-meter long ruler, for example,
to 12-meter by getting another five 2-meter rulers and combine them
(i.e. extend_page_bmap_size() to allocate page_bmap[1],
page_bmap[2]...page_bmap[5]).
Case 2.1: If the length of the ruler is successfully extended to
                 12-meter, that is, we get a 12-meter long ruler, then we
                 can simply use the ruler once and know the length of the
                 object is 11-meter.
Case 2.2: If the ruler failed to be extended. Then we need to use the
                 2-meter long ruler 6 times to measure the 11-meter long
                 object:
                 1st time: pfn_start=0, pfn_stop=2;
                 2nd time: pfn_start=2, pfn_stop=4;
                 ..
                 6th time: pfn_start=10, pfn_stop=11
                 Still, we covered the entire length of the long object 
with the
                 short ruler that we have. But we used it 6 times (i.e. use
                 page_bmap[0], 6 times).

Based on the understanding of this analogy, I think the following
questions would be easier to understand.

>> +static void extend_page_bmap_size(struct virtio_balloon *vb,
>> +				  unsigned long pfns)
>> +{
>> +	int i, bmaps;
>> +	unsigned long bmap_len;
>> +
>> +	bmap_len = ALIGN(pfns, BITS_PER_LONG) / BITS_PER_BYTE;
>> +	bmap_len = ALIGN(bmap_len, PAGE_BMAP_SIZE);
> Align? PAGE_BMAP_SIZE doesn't even have to be a power of 2 ...

ThoughPAGE_BMAP_SIZE has been set to 32K in the implementation,
would you prefer to use roundup() here?


>> +	bmaps = min((int)(bmap_len / PAGE_BMAP_SIZE),
>> +		    PAGE_BMAP_COUNT_MAX);
> I got lost here.
>
> Please use things like ARRAY_SIZE instead of macros.

PAGE_BMAP_COUNT_MAX is the total amount of page_bmap[] that is
allowed to be allocated on demand. It is 32 in the implementation.

For example, if the calculation shows that it needs 100 page_bmap[],
but we can only afford 32, so use 32 for bmaps, instead of 100. The
the following implementation go through Case 2.2.



>> +
>> +	for (i = 1; i < bmaps; i++) {
>> +		vb->page_bmap[i] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
>> +		if (vb->page_bmap[i])
>> +			vb->page_bmaps++;
>> +		else
>> +			break;
>> +	}
>> +}
>> +
>> +static void free_extended_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i, bmaps = vb->page_bmaps;
>> +
>> +	for (i = 1; i < bmaps; i++) {
>> +		kfree(vb->page_bmap[i]);
>> +		vb->page_bmap[i] = NULL;
>> +		vb->page_bmaps--;
>> +	}
>> +}
>> +
> What's the magic number 1 here?
> Maybe you want to document what is going on.
> Here's a guess:
>
> We keep a single bmap around at all times.
> If memory does not fit there, we allocate up to
> PAGE_BMAP_COUNT_MAX of chunks.
>

Right. By default, we have only 1 page_bmap[] allocated.


>> +static void free_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < vb->page_bmaps; i++)
>> +		kfree(vb->page_bmap[i]);
>> +}
>> +
>> +static void clear_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < vb->page_bmaps; i++)
>> +		memset(vb->page_bmap[i], 0, PAGE_BMAP_SIZE);
>> +}
>> +
>> +static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
>> +			     int type, bool busy_wait)
> busy_wait seems unused. pls drop.

It will be used in the other patch (the 5th patch) for sending unused pages.
Probably I can add it from that patch.

>
>>   {
>>   	struct scatterlist sg;
>> +	struct virtio_balloon_page_chunk_hdr *hdr;
>> +	void *buf;
>>   	unsigned int len;
>>   
>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>> +	switch (type) {
>> +	case PAGE_CHUNK_TYPE_BALLOON:
>> +		hdr = vb->balloon_page_chunk_hdr;
>> +		len = 0;
>> +		break;
>> +	default:
>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>> +			 __func__, type);
>> +		return;
>> +	}
>>   
>> -	/* We should always be able to add one buffer to an empty queue. */
>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> -	virtqueue_kick(vq);
>> +	buf = (void *)hdr - len;
> Moving back to before the header? How can this make sense?
> It works fine since len is 0, so just buf = hdr.
>
For the unused page chunk case, it follows its own protocol:
miscq_hdr + payload(chunk msg).
  "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
the entire miscq msg.

Please check the patch for implementing the unused page chunk,
it will be clear. If necessary, I can put "buf = (void *)hdr - len" from 
that patch.


>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
>> +	sg_init_table(&sg, 1);
>> +	sg_set_buf(&sg, buf, len);
>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
>> +		virtqueue_kick(vq);
>> +		if (busy_wait)
>> +			while (!virtqueue_get_buf(vq, &len) &&
>> +			       !virtqueue_is_broken(vq))
>> +				cpu_relax();
>> +		else
>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		hdr->chunks = 0;
> Why zero it here after device used it? Better to zero before use.

hdr->chunks tells the host how many chunks are there in the payload.
After the device use it, it is ready to zero it.

>
>> +	}
>> +}
>> +
>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
>> +			  int type, u64 base, u64 size)
> what are the units here? Looks like it's in 4kbyte units?

what is the "unit" you referred to?
This is the function to add one chunk, base pfn and size of the chunk are
supplied to the function.



>
>> +	if (hdr->chunks == MAX_PAGE_CHUNKS)
>> +		send_page_chunks(vb, vq, type, false);
> 		and zero chunks here?
>> +}
>> +
>> +static void chunking_pages_from_bmap(struct virtio_balloon *vb,
> Does this mean "convert_bmap_to_chunks"?
>

Yes.


>> +				     struct virtqueue *vq,
>> +				     unsigned long pfn_start,
>> +				     unsigned long *bmap,
>> +				     unsigned long len)
>> +{
>> +	unsigned long pos = 0, end = len * BITS_PER_BYTE;
>> +
>> +	while (pos < end) {
>> +		unsigned long one = find_next_bit(bmap, end, pos);
>> +
>> +		if (one < end) {
>> +			unsigned long chunk_size, zero;
>> +
>> +			zero = find_next_zero_bit(bmap, end, one + 1);
>
> zero and one are unhelpful names unless they equal 0 and 1.
> current/next?
>

I think it is clear if we think about the bitmap, for example:
00001111000011110000
one = the position of the next "1" bit,
zero= the position of the next "0" bit, starting from one.

Then, it is clear, chunk_size= zero - one

would it be better to use pos_0 and pos_1?

>> +			if (zero >= end)
>> +				chunk_size = end - one;
>> +			else
>> +				chunk_size = zero - one;
>> +
>> +			if (chunk_size)
>> +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>> +					      pfn_start + one, chunk_size);
> Still not so what does a bit refer to? page or 4kbytes?
> I think it should be a page.
A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
But I think it doesn't matter here, since it is pfn.
Using the above example:
00001111000011110000

If the starting bit above corresponds to pfn-0x1000 (i.e. pfn_start)
Then the chunk base = 0x1004
(one is the position of the "Set" bit, which is 4), so pfn_start 
+one=0x1004

>> +static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>> +{
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
>> +		int pfns, page_bmaps, i;
>> +		unsigned long pfn_start, pfns_len;
>> +
>> +		pfn_start = vb->pfn_start;
>> +		pfns = vb->pfn_stop - pfn_start + 1;
>> +		pfns = roundup(roundup(pfns, BITS_PER_LONG),
>> +			       PFNS_PER_PAGE_BMAP);
>> +		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
>> +		pfns_len = pfns / BITS_PER_BYTE;
>> +
>> +		for (i = 0; i < page_bmaps; i++) {
>> +			unsigned int bmap_len = PAGE_BMAP_SIZE;
>> +
>> +			/* The last one takes the leftover only */
> I don't understand what does this mean.
Still use the ruler analogy here: the object is 11-meter long, and we have
a 2-meter long ruler. The 5th time has covered 10 meters of the object, Then
the last time, the leftover is 1 meter, which means we can use half of 
the ruler
to cover the left 1 meter.

Back to the implementation here, if there are only 10 pfns left in the 
last round,
I think it's not necessary to search the entire page_bmap[] till the end.

>> +static void set_page_bmap(struct virtio_balloon *vb,
>> +			  struct list_head *pages, struct virtqueue *vq)
>> +{
>> +	unsigned long pfn_start, pfn_stop;
>> +	struct page *page;
>> +	bool found;
>> +
>> +	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
>> +	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
>> +
>> +	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
> This might not do anything in particular might not cover the
> given pfn range. Do we care? Why not?

We have allocated only 1 page_bmap[], which is able to cover 1GB memory.
To inflate 2GB, it will try to extend by getting one more page_bmap, 
page_bmap[1].

>> +	pfn_start = vb->pfn_min;
>> +
>> +	while (pfn_start < vb->pfn_max) {
>> +		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
>> +		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
>> +
>> +		vb->pfn_start = pfn_start;
>> +		clear_page_bmap(vb);
>> +		found = false;
>> +
>> +		list_for_each_entry(page, pages, lru) {
>> +			unsigned long bmap_idx, bmap_pos, balloon_pfn;
>> +
>> +			balloon_pfn = page_to_balloon_pfn(page);
>> +			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
>> +				continue;
>> +			bmap_idx = (balloon_pfn - pfn_start) /
>> +				   PFNS_PER_PAGE_BMAP;
>> +			bmap_pos = (balloon_pfn - pfn_start) %
>> +				   PFNS_PER_PAGE_BMAP;
>> +			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
> Looks like this will crash if bmap_idx is out of range or
> if page_bmap allocation failed.

No, it won't. Please think about the analogy Case 2.2: pfn_start is updated
in each round. Like in the 2nd round, pfn_start is updated to 2, balloon_pfn
will be a value between 2 and 4, so the result of
"(balloon_pfn - pfn_start) /  PFNS_PER_PAGE_BMAP" will always be 0 when
we only have page_bmap[0].

>   
>   #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb,
> +			       struct virtqueue *vq, struct page *page)
> +{
> +	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);
> This passes 4kbytes to host which seems wrong - I think you want a full page.

OK. It should be
add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
                           page_to_pfn(page), VIRTIO_BALLOON_PAGES_PER_PAGE)

right?

If Page=2*BalloonPage, it will pass 2*4K to the host.

+static void balloon_page_chunk_init(struct virtio_balloon *vb)
+{
+	void *buf;
+
+	/*
+	 * By default, we allocate page_bmap[0] only. More page_bmap will be
+	 * allocated on demand.
+	 */
+	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk) *
+		      MAX_PAGE_CHUNKS, GFP_KERNEL);
+	if (!vb->page_bmap[0] || !buf) {
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);

> this doesn't work as expected as features has been OK'd by then.
> You want something like
> validate_features that I posted. See
> "virtio: allow drivers to validate features".

OK. I will change it after that patch is merged.

>
>> +		kfree(vb->page_bmap[0]);
> Looks like this will double free. you want to zero them I think.
>

OK. I'll NULL the pointers after kfree().



Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-14  8:37       ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  8:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
>
> So we don't need the bitmap to talk to host, it is just
> a data structure we chose to maintain lists of pages, right?
Right. bitmap is the way to gather pages to chunk.
It's only needed in the balloon page case.
For the unused page case, we don't need it, since the free
page blocks are already chunks.

> OK as far as it goes but you need much better isolation for it.
> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> _find_first, _find_next.
> Completely unrelated to pages, it just maintains bits.
> Then use it here.
>
>
>>   static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>>   module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>>   MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>> @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>   static struct vfsmount *balloon_mnt;
>>   #endif
>>   
>> +/* Types of pages to chunk */
>> +#define PAGE_CHUNK_TYPE_BALLOON 0
>> +
> Doesn't look like you are ever adding more types in this
> patchset.  Pls keep code simple, generalize it later.
>
"#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.

Types of page to chunk are treated differently. Different types of page
chunks are sent to the host via different protocols.

1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
to chunk.  For the ballooned type, it uses the basic chunk msg format:

virtio_balloon_page_chunk_hdr +
virtio_balloon_page_chunk * MAX_PAGE_CHUNKS

2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
format:
miscq_hdr +
virtio_balloon_page_chunk_hdr +
virtio_balloon_page_chunk * MAX_PAGE_CHUNKS

The chunk msg is actually the payload of the miscq msg.



>> +#define MAX_PAGE_CHUNKS 4096
> This is an order-4 allocation. I'd make it 4095 and then it's
> an order-3 one.

Sounds good, thanks.
I think it would be better to make it 4090. Leave some space for the hdr
as well.

>
>>   struct virtio_balloon {
>>   	struct virtio_device *vdev;
>>   	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>> @@ -78,6 +86,32 @@ struct virtio_balloon {
>>   	/* Synchronize access/update to this struct virtio_balloon elements */
>>   	struct mutex balloon_lock;
>>   
>> +	/*
>> +	 * Buffer for PAGE_CHUNK_TYPE_BALLOON:
>> +	 * virtio_balloon_page_chunk_hdr +
>> +	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>> +	 */
>> +	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
>> +	struct virtio_balloon_page_chunk *balloon_page_chunk;
>> +
>> +	/* Bitmap used to record pages */
>> +	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
>> +	/* Number of the allocated page_bmap */
>> +	unsigned int page_bmaps;
>> +
>> +	/*
>> +	 * The allocated page_bmap size may be smaller than the pfn range of
>> +	 * the ballooned pages. In this case, we need to use the page_bmap
>> +	 * multiple times to cover the entire pfn range. It's like using a
>> +	 * short ruler several times to finish measuring a long object.
>> +	 * The start location of the ruler in the next measurement is the end
>> +	 * location of the ruler in the previous measurement.
>> +	 *
>> +	 * pfn_max & pfn_min: forms the pfn range of the ballooned pages
>> +	 * pfn_start & pfn_stop: records the start and stop pfn in each cover
> cover? what does this mean?
>
> looks like you only use these to pass data to tell_host.
> so pass these as parameters and you won't need to keep
> them in this structure.
>
> And then you can move this comment to set_page_bmap where
> it belongs.
>
>> +	 */
>> +	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
>> +
>>   	/* The array of pfns we tell the Host about. */
>>   	unsigned int num_pfns;
>>   	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
>> @@ -110,20 +144,201 @@ static void balloon_ack(struct virtqueue *vq)
>>   	wake_up(&vb->acked);
>>   }
>>   
>> -static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>> +static inline void init_page_bmap_range(struct virtio_balloon *vb)
>> +{
>> +	vb->pfn_min = ULONG_MAX;
>> +	vb->pfn_max = 0;
>> +}
>> +
>> +static inline void update_page_bmap_range(struct virtio_balloon *vb,
>> +					  struct page *page)
>> +{
>> +	unsigned long balloon_pfn = page_to_balloon_pfn(page);
>> +
>> +	vb->pfn_min = min(balloon_pfn, vb->pfn_min);
>> +	vb->pfn_max = max(balloon_pfn, vb->pfn_max);
>> +}
>> +
>> +/* The page_bmap size is extended by adding more number of page_bmap */
> did you mean
>
> 	Allocate more bitmaps to cover the given number of pfns
> 	and add them to page_bmap
>
> ?
>
> This isn't what this function does.
> It blindly assumes 1 bitmap is allocated
> and allocates more, up to PAGE_BMAP_COUNT_MAX.
>

Please let me use a concrete analogy to explain this algorithm:
We have a 2-meter long ruler (i.e. page_bmap[0]).

Case 1:
To measure a  1-meter long object (i.e. pfn_max=1, pfn_min=0),
we can simply use the ruler once and get to know that the object
is 1-meter long.

Case 2:
To measure a 11-meter long object (i.e. pfn_max=11, pfn_min=0).
We will first see if we can extend the 2-meter long ruler, for example,
to 12-meter by getting another five 2-meter rulers and combine them
(i.e. extend_page_bmap_size() to allocate page_bmap[1],
page_bmap[2]...page_bmap[5]).
Case 2.1: If the length of the ruler is successfully extended to
                 12-meter, that is, we get a 12-meter long ruler, then we
                 can simply use the ruler once and know the length of the
                 object is 11-meter.
Case 2.2: If the ruler failed to be extended. Then we need to use the
                 2-meter long ruler 6 times to measure the 11-meter long
                 object:
                 1st time: pfn_start=0, pfn_stop=2;
                 2nd time: pfn_start=2, pfn_stop=4;
                 ..
                 6th time: pfn_start=10, pfn_stop=11
                 Still, we covered the entire length of the long object 
with the
                 short ruler that we have. But we used it 6 times (i.e. use
                 page_bmap[0], 6 times).

Based on the understanding of this analogy, I think the following
questions would be easier to understand.

>> +static void extend_page_bmap_size(struct virtio_balloon *vb,
>> +				  unsigned long pfns)
>> +{
>> +	int i, bmaps;
>> +	unsigned long bmap_len;
>> +
>> +	bmap_len = ALIGN(pfns, BITS_PER_LONG) / BITS_PER_BYTE;
>> +	bmap_len = ALIGN(bmap_len, PAGE_BMAP_SIZE);
> Align? PAGE_BMAP_SIZE doesn't even have to be a power of 2 ...

ThoughPAGE_BMAP_SIZE has been set to 32K in the implementation,
would you prefer to use roundup() here?


>> +	bmaps = min((int)(bmap_len / PAGE_BMAP_SIZE),
>> +		    PAGE_BMAP_COUNT_MAX);
> I got lost here.
>
> Please use things like ARRAY_SIZE instead of macros.

PAGE_BMAP_COUNT_MAX is the total amount of page_bmap[] that is
allowed to be allocated on demand. It is 32 in the implementation.

For example, if the calculation shows that it needs 100 page_bmap[],
but we can only afford 32, so use 32 for bmaps, instead of 100. The
the following implementation go through Case 2.2.



>> +
>> +	for (i = 1; i < bmaps; i++) {
>> +		vb->page_bmap[i] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
>> +		if (vb->page_bmap[i])
>> +			vb->page_bmaps++;
>> +		else
>> +			break;
>> +	}
>> +}
>> +
>> +static void free_extended_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i, bmaps = vb->page_bmaps;
>> +
>> +	for (i = 1; i < bmaps; i++) {
>> +		kfree(vb->page_bmap[i]);
>> +		vb->page_bmap[i] = NULL;
>> +		vb->page_bmaps--;
>> +	}
>> +}
>> +
> What's the magic number 1 here?
> Maybe you want to document what is going on.
> Here's a guess:
>
> We keep a single bmap around at all times.
> If memory does not fit there, we allocate up to
> PAGE_BMAP_COUNT_MAX of chunks.
>

Right. By default, we have only 1 page_bmap[] allocated.


>> +static void free_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < vb->page_bmaps; i++)
>> +		kfree(vb->page_bmap[i]);
>> +}
>> +
>> +static void clear_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < vb->page_bmaps; i++)
>> +		memset(vb->page_bmap[i], 0, PAGE_BMAP_SIZE);
>> +}
>> +
>> +static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
>> +			     int type, bool busy_wait)
> busy_wait seems unused. pls drop.

It will be used in the other patch (the 5th patch) for sending unused pages.
Probably I can add it from that patch.

>
>>   {
>>   	struct scatterlist sg;
>> +	struct virtio_balloon_page_chunk_hdr *hdr;
>> +	void *buf;
>>   	unsigned int len;
>>   
>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>> +	switch (type) {
>> +	case PAGE_CHUNK_TYPE_BALLOON:
>> +		hdr = vb->balloon_page_chunk_hdr;
>> +		len = 0;
>> +		break;
>> +	default:
>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>> +			 __func__, type);
>> +		return;
>> +	}
>>   
>> -	/* We should always be able to add one buffer to an empty queue. */
>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> -	virtqueue_kick(vq);
>> +	buf = (void *)hdr - len;
> Moving back to before the header? How can this make sense?
> It works fine since len is 0, so just buf = hdr.
>
For the unused page chunk case, it follows its own protocol:
miscq_hdr + payload(chunk msg).
  "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
the entire miscq msg.

Please check the patch for implementing the unused page chunk,
it will be clear. If necessary, I can put "buf = (void *)hdr - len" from 
that patch.


>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
>> +	sg_init_table(&sg, 1);
>> +	sg_set_buf(&sg, buf, len);
>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
>> +		virtqueue_kick(vq);
>> +		if (busy_wait)
>> +			while (!virtqueue_get_buf(vq, &len) &&
>> +			       !virtqueue_is_broken(vq))
>> +				cpu_relax();
>> +		else
>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		hdr->chunks = 0;
> Why zero it here after device used it? Better to zero before use.

hdr->chunks tells the host how many chunks are there in the payload.
After the device use it, it is ready to zero it.

>
>> +	}
>> +}
>> +
>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
>> +			  int type, u64 base, u64 size)
> what are the units here? Looks like it's in 4kbyte units?

what is the "unit" you referred to?
This is the function to add one chunk, base pfn and size of the chunk are
supplied to the function.



>
>> +	if (hdr->chunks == MAX_PAGE_CHUNKS)
>> +		send_page_chunks(vb, vq, type, false);
> 		and zero chunks here?
>> +}
>> +
>> +static void chunking_pages_from_bmap(struct virtio_balloon *vb,
> Does this mean "convert_bmap_to_chunks"?
>

Yes.


>> +				     struct virtqueue *vq,
>> +				     unsigned long pfn_start,
>> +				     unsigned long *bmap,
>> +				     unsigned long len)
>> +{
>> +	unsigned long pos = 0, end = len * BITS_PER_BYTE;
>> +
>> +	while (pos < end) {
>> +		unsigned long one = find_next_bit(bmap, end, pos);
>> +
>> +		if (one < end) {
>> +			unsigned long chunk_size, zero;
>> +
>> +			zero = find_next_zero_bit(bmap, end, one + 1);
>
> zero and one are unhelpful names unless they equal 0 and 1.
> current/next?
>

I think it is clear if we think about the bitmap, for example:
00001111000011110000
one = the position of the next "1" bit,
zero= the position of the next "0" bit, starting from one.

Then, it is clear, chunk_size= zero - one

would it be better to use pos_0 and pos_1?

>> +			if (zero >= end)
>> +				chunk_size = end - one;
>> +			else
>> +				chunk_size = zero - one;
>> +
>> +			if (chunk_size)
>> +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>> +					      pfn_start + one, chunk_size);
> Still not so what does a bit refer to? page or 4kbytes?
> I think it should be a page.
A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
But I think it doesn't matter here, since it is pfn.
Using the above example:
00001111000011110000

If the starting bit above corresponds to pfn-0x1000 (i.e. pfn_start)
Then the chunk base = 0x1004
(one is the position of the "Set" bit, which is 4), so pfn_start 
+one=0x1004

>> +static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>> +{
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
>> +		int pfns, page_bmaps, i;
>> +		unsigned long pfn_start, pfns_len;
>> +
>> +		pfn_start = vb->pfn_start;
>> +		pfns = vb->pfn_stop - pfn_start + 1;
>> +		pfns = roundup(roundup(pfns, BITS_PER_LONG),
>> +			       PFNS_PER_PAGE_BMAP);
>> +		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
>> +		pfns_len = pfns / BITS_PER_BYTE;
>> +
>> +		for (i = 0; i < page_bmaps; i++) {
>> +			unsigned int bmap_len = PAGE_BMAP_SIZE;
>> +
>> +			/* The last one takes the leftover only */
> I don't understand what does this mean.
Still use the ruler analogy here: the object is 11-meter long, and we have
a 2-meter long ruler. The 5th time has covered 10 meters of the object, Then
the last time, the leftover is 1 meter, which means we can use half of 
the ruler
to cover the left 1 meter.

Back to the implementation here, if there are only 10 pfns left in the 
last round,
I think it's not necessary to search the entire page_bmap[] till the end.

>> +static void set_page_bmap(struct virtio_balloon *vb,
>> +			  struct list_head *pages, struct virtqueue *vq)
>> +{
>> +	unsigned long pfn_start, pfn_stop;
>> +	struct page *page;
>> +	bool found;
>> +
>> +	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
>> +	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
>> +
>> +	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
> This might not do anything in particular might not cover the
> given pfn range. Do we care? Why not?

We have allocated only 1 page_bmap[], which is able to cover 1GB memory.
To inflate 2GB, it will try to extend by getting one more page_bmap, 
page_bmap[1].

>> +	pfn_start = vb->pfn_min;
>> +
>> +	while (pfn_start < vb->pfn_max) {
>> +		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
>> +		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
>> +
>> +		vb->pfn_start = pfn_start;
>> +		clear_page_bmap(vb);
>> +		found = false;
>> +
>> +		list_for_each_entry(page, pages, lru) {
>> +			unsigned long bmap_idx, bmap_pos, balloon_pfn;
>> +
>> +			balloon_pfn = page_to_balloon_pfn(page);
>> +			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
>> +				continue;
>> +			bmap_idx = (balloon_pfn - pfn_start) /
>> +				   PFNS_PER_PAGE_BMAP;
>> +			bmap_pos = (balloon_pfn - pfn_start) %
>> +				   PFNS_PER_PAGE_BMAP;
>> +			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
> Looks like this will crash if bmap_idx is out of range or
> if page_bmap allocation failed.

No, it won't. Please think about the analogy Case 2.2: pfn_start is updated
in each round. Like in the 2nd round, pfn_start is updated to 2, balloon_pfn
will be a value between 2 and 4, so the result of
"(balloon_pfn - pfn_start) /  PFNS_PER_PAGE_BMAP" will always be 0 when
we only have page_bmap[0].

>   
>   #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb,
> +			       struct virtqueue *vq, struct page *page)
> +{
> +	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);
> This passes 4kbytes to host which seems wrong - I think you want a full page.

OK. It should be
add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
                           page_to_pfn(page), VIRTIO_BALLOON_PAGES_PER_PAGE)

right?

If Page=2*BalloonPage, it will pass 2*4K to the host.

+static void balloon_page_chunk_init(struct virtio_balloon *vb)
+{
+	void *buf;
+
+	/*
+	 * By default, we allocate page_bmap[0] only. More page_bmap will be
+	 * allocated on demand.
+	 */
+	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk) *
+		      MAX_PAGE_CHUNKS, GFP_KERNEL);
+	if (!vb->page_bmap[0] || !buf) {
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);

> this doesn't work as expected as features has been OK'd by then.
> You want something like
> validate_features that I posted. See
> "virtio: allow drivers to validate features".

OK. I will change it after that patch is merged.

>
>> +		kfree(vb->page_bmap[0]);
> Looks like this will double free. you want to zero them I think.
>

OK. I'll NULL the pointers after kfree().



Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-14  8:37       ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  8:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
>
> So we don't need the bitmap to talk to host, it is just
> a data structure we chose to maintain lists of pages, right?
Right. bitmap is the way to gather pages to chunk.
It's only needed in the balloon page case.
For the unused page case, we don't need it, since the free
page blocks are already chunks.

> OK as far as it goes but you need much better isolation for it.
> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> _find_first, _find_next.
> Completely unrelated to pages, it just maintains bits.
> Then use it here.
>
>
>>   static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>>   module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>>   MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>> @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>   static struct vfsmount *balloon_mnt;
>>   #endif
>>   
>> +/* Types of pages to chunk */
>> +#define PAGE_CHUNK_TYPE_BALLOON 0
>> +
> Doesn't look like you are ever adding more types in this
> patchset.  Pls keep code simple, generalize it later.
>
"#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.

Types of page to chunk are treated differently. Different types of page
chunks are sent to the host via different protocols.

1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
to chunk.  For the ballooned type, it uses the basic chunk msg format:

virtio_balloon_page_chunk_hdr +
virtio_balloon_page_chunk * MAX_PAGE_CHUNKS

2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
format:
miscq_hdr +
virtio_balloon_page_chunk_hdr +
virtio_balloon_page_chunk * MAX_PAGE_CHUNKS

The chunk msg is actually the payload of the miscq msg.



>> +#define MAX_PAGE_CHUNKS 4096
> This is an order-4 allocation. I'd make it 4095 and then it's
> an order-3 one.

Sounds good, thanks.
I think it would be better to make it 4090. Leave some space for the hdr
as well.

>
>>   struct virtio_balloon {
>>   	struct virtio_device *vdev;
>>   	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>> @@ -78,6 +86,32 @@ struct virtio_balloon {
>>   	/* Synchronize access/update to this struct virtio_balloon elements */
>>   	struct mutex balloon_lock;
>>   
>> +	/*
>> +	 * Buffer for PAGE_CHUNK_TYPE_BALLOON:
>> +	 * virtio_balloon_page_chunk_hdr +
>> +	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>> +	 */
>> +	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
>> +	struct virtio_balloon_page_chunk *balloon_page_chunk;
>> +
>> +	/* Bitmap used to record pages */
>> +	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
>> +	/* Number of the allocated page_bmap */
>> +	unsigned int page_bmaps;
>> +
>> +	/*
>> +	 * The allocated page_bmap size may be smaller than the pfn range of
>> +	 * the ballooned pages. In this case, we need to use the page_bmap
>> +	 * multiple times to cover the entire pfn range. It's like using a
>> +	 * short ruler several times to finish measuring a long object.
>> +	 * The start location of the ruler in the next measurement is the end
>> +	 * location of the ruler in the previous measurement.
>> +	 *
>> +	 * pfn_max & pfn_min: forms the pfn range of the ballooned pages
>> +	 * pfn_start & pfn_stop: records the start and stop pfn in each cover
> cover? what does this mean?
>
> looks like you only use these to pass data to tell_host.
> so pass these as parameters and you won't need to keep
> them in this structure.
>
> And then you can move this comment to set_page_bmap where
> it belongs.
>
>> +	 */
>> +	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
>> +
>>   	/* The array of pfns we tell the Host about. */
>>   	unsigned int num_pfns;
>>   	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
>> @@ -110,20 +144,201 @@ static void balloon_ack(struct virtqueue *vq)
>>   	wake_up(&vb->acked);
>>   }
>>   
>> -static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>> +static inline void init_page_bmap_range(struct virtio_balloon *vb)
>> +{
>> +	vb->pfn_min = ULONG_MAX;
>> +	vb->pfn_max = 0;
>> +}
>> +
>> +static inline void update_page_bmap_range(struct virtio_balloon *vb,
>> +					  struct page *page)
>> +{
>> +	unsigned long balloon_pfn = page_to_balloon_pfn(page);
>> +
>> +	vb->pfn_min = min(balloon_pfn, vb->pfn_min);
>> +	vb->pfn_max = max(balloon_pfn, vb->pfn_max);
>> +}
>> +
>> +/* The page_bmap size is extended by adding more number of page_bmap */
> did you mean
>
> 	Allocate more bitmaps to cover the given number of pfns
> 	and add them to page_bmap
>
> ?
>
> This isn't what this function does.
> It blindly assumes 1 bitmap is allocated
> and allocates more, up to PAGE_BMAP_COUNT_MAX.
>

Please let me use a concrete analogy to explain this algorithm:
We have a 2-meter long ruler (i.e. page_bmap[0]).

Case 1:
To measure a  1-meter long object (i.e. pfn_max=1, pfn_min=0),
we can simply use the ruler once and get to know that the object
is 1-meter long.

Case 2:
To measure a 11-meter long object (i.e. pfn_max=11, pfn_min=0).
We will first see if we can extend the 2-meter long ruler, for example,
to 12-meter by getting another five 2-meter rulers and combine them
(i.e. extend_page_bmap_size() to allocate page_bmap[1],
page_bmap[2]...page_bmap[5]).
Case 2.1: If the length of the ruler is successfully extended to
                 12-meter, that is, we get a 12-meter long ruler, then we
                 can simply use the ruler once and know the length of the
                 object is 11-meter.
Case 2.2: If the ruler failed to be extended. Then we need to use the
                 2-meter long ruler 6 times to measure the 11-meter long
                 object:
                 1st time: pfn_start=0, pfn_stop=2;
                 2nd time: pfn_start=2, pfn_stop=4;
                 ..
                 6th time: pfn_start=10, pfn_stop=11
                 Still, we covered the entire length of the long object 
with the
                 short ruler that we have. But we used it 6 times (i.e. use
                 page_bmap[0], 6 times).

Based on the understanding of this analogy, I think the following
questions would be easier to understand.

>> +static void extend_page_bmap_size(struct virtio_balloon *vb,
>> +				  unsigned long pfns)
>> +{
>> +	int i, bmaps;
>> +	unsigned long bmap_len;
>> +
>> +	bmap_len = ALIGN(pfns, BITS_PER_LONG) / BITS_PER_BYTE;
>> +	bmap_len = ALIGN(bmap_len, PAGE_BMAP_SIZE);
> Align? PAGE_BMAP_SIZE doesn't even have to be a power of 2 ...

ThoughPAGE_BMAP_SIZE has been set to 32K in the implementation,
would you prefer to use roundup() here?


>> +	bmaps = min((int)(bmap_len / PAGE_BMAP_SIZE),
>> +		    PAGE_BMAP_COUNT_MAX);
> I got lost here.
>
> Please use things like ARRAY_SIZE instead of macros.

PAGE_BMAP_COUNT_MAX is the total amount of page_bmap[] that is
allowed to be allocated on demand. It is 32 in the implementation.

For example, if the calculation shows that it needs 100 page_bmap[],
but we can only afford 32, so use 32 for bmaps, instead of 100. The
the following implementation go through Case 2.2.



>> +
>> +	for (i = 1; i < bmaps; i++) {
>> +		vb->page_bmap[i] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
>> +		if (vb->page_bmap[i])
>> +			vb->page_bmaps++;
>> +		else
>> +			break;
>> +	}
>> +}
>> +
>> +static void free_extended_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i, bmaps = vb->page_bmaps;
>> +
>> +	for (i = 1; i < bmaps; i++) {
>> +		kfree(vb->page_bmap[i]);
>> +		vb->page_bmap[i] = NULL;
>> +		vb->page_bmaps--;
>> +	}
>> +}
>> +
> What's the magic number 1 here?
> Maybe you want to document what is going on.
> Here's a guess:
>
> We keep a single bmap around at all times.
> If memory does not fit there, we allocate up to
> PAGE_BMAP_COUNT_MAX of chunks.
>

Right. By default, we have only 1 page_bmap[] allocated.


>> +static void free_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < vb->page_bmaps; i++)
>> +		kfree(vb->page_bmap[i]);
>> +}
>> +
>> +static void clear_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < vb->page_bmaps; i++)
>> +		memset(vb->page_bmap[i], 0, PAGE_BMAP_SIZE);
>> +}
>> +
>> +static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
>> +			     int type, bool busy_wait)
> busy_wait seems unused. pls drop.

It will be used in the other patch (the 5th patch) for sending unused pages.
Probably I can add it from that patch.

>
>>   {
>>   	struct scatterlist sg;
>> +	struct virtio_balloon_page_chunk_hdr *hdr;
>> +	void *buf;
>>   	unsigned int len;
>>   
>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>> +	switch (type) {
>> +	case PAGE_CHUNK_TYPE_BALLOON:
>> +		hdr = vb->balloon_page_chunk_hdr;
>> +		len = 0;
>> +		break;
>> +	default:
>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>> +			 __func__, type);
>> +		return;
>> +	}
>>   
>> -	/* We should always be able to add one buffer to an empty queue. */
>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> -	virtqueue_kick(vq);
>> +	buf = (void *)hdr - len;
> Moving back to before the header? How can this make sense?
> It works fine since len is 0, so just buf = hdr.
>
For the unused page chunk case, it follows its own protocol:
miscq_hdr + payload(chunk msg).
  "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
the entire miscq msg.

Please check the patch for implementing the unused page chunk,
it will be clear. If necessary, I can put "buf = (void *)hdr - len" from 
that patch.


>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
>> +	sg_init_table(&sg, 1);
>> +	sg_set_buf(&sg, buf, len);
>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
>> +		virtqueue_kick(vq);
>> +		if (busy_wait)
>> +			while (!virtqueue_get_buf(vq, &len) &&
>> +			       !virtqueue_is_broken(vq))
>> +				cpu_relax();
>> +		else
>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		hdr->chunks = 0;
> Why zero it here after device used it? Better to zero before use.

hdr->chunks tells the host how many chunks are there in the payload.
After the device use it, it is ready to zero it.

>
>> +	}
>> +}
>> +
>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
>> +			  int type, u64 base, u64 size)
> what are the units here? Looks like it's in 4kbyte units?

what is the "unit" you referred to?
This is the function to add one chunk, base pfn and size of the chunk are
supplied to the function.



>
>> +	if (hdr->chunks == MAX_PAGE_CHUNKS)
>> +		send_page_chunks(vb, vq, type, false);
> 		and zero chunks here?
>> +}
>> +
>> +static void chunking_pages_from_bmap(struct virtio_balloon *vb,
> Does this mean "convert_bmap_to_chunks"?
>

Yes.


>> +				     struct virtqueue *vq,
>> +				     unsigned long pfn_start,
>> +				     unsigned long *bmap,
>> +				     unsigned long len)
>> +{
>> +	unsigned long pos = 0, end = len * BITS_PER_BYTE;
>> +
>> +	while (pos < end) {
>> +		unsigned long one = find_next_bit(bmap, end, pos);
>> +
>> +		if (one < end) {
>> +			unsigned long chunk_size, zero;
>> +
>> +			zero = find_next_zero_bit(bmap, end, one + 1);
>
> zero and one are unhelpful names unless they equal 0 and 1.
> current/next?
>

I think it is clear if we think about the bitmap, for example:
00001111000011110000
one = the position of the next "1" bit,
zero= the position of the next "0" bit, starting from one.

Then, it is clear, chunk_size= zero - one

would it be better to use pos_0 and pos_1?

>> +			if (zero >= end)
>> +				chunk_size = end - one;
>> +			else
>> +				chunk_size = zero - one;
>> +
>> +			if (chunk_size)
>> +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>> +					      pfn_start + one, chunk_size);
> Still not so what does a bit refer to? page or 4kbytes?
> I think it should be a page.
A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
But I think it doesn't matter here, since it is pfn.
Using the above example:
00001111000011110000

If the starting bit above corresponds to pfn-0x1000 (i.e. pfn_start)
Then the chunk base = 0x1004
(one is the position of the "Set" bit, which is 4), so pfn_start 
+one=0x1004

>> +static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>> +{
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
>> +		int pfns, page_bmaps, i;
>> +		unsigned long pfn_start, pfns_len;
>> +
>> +		pfn_start = vb->pfn_start;
>> +		pfns = vb->pfn_stop - pfn_start + 1;
>> +		pfns = roundup(roundup(pfns, BITS_PER_LONG),
>> +			       PFNS_PER_PAGE_BMAP);
>> +		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
>> +		pfns_len = pfns / BITS_PER_BYTE;
>> +
>> +		for (i = 0; i < page_bmaps; i++) {
>> +			unsigned int bmap_len = PAGE_BMAP_SIZE;
>> +
>> +			/* The last one takes the leftover only */
> I don't understand what does this mean.
Still use the ruler analogy here: the object is 11-meter long, and we have
a 2-meter long ruler. The 5th time has covered 10 meters of the object, Then
the last time, the leftover is 1 meter, which means we can use half of 
the ruler
to cover the left 1 meter.

Back to the implementation here, if there are only 10 pfns left in the 
last round,
I think it's not necessary to search the entire page_bmap[] till the end.

>> +static void set_page_bmap(struct virtio_balloon *vb,
>> +			  struct list_head *pages, struct virtqueue *vq)
>> +{
>> +	unsigned long pfn_start, pfn_stop;
>> +	struct page *page;
>> +	bool found;
>> +
>> +	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
>> +	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
>> +
>> +	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
> This might not do anything in particular might not cover the
> given pfn range. Do we care? Why not?

We have allocated only 1 page_bmap[], which is able to cover 1GB memory.
To inflate 2GB, it will try to extend by getting one more page_bmap, 
page_bmap[1].

>> +	pfn_start = vb->pfn_min;
>> +
>> +	while (pfn_start < vb->pfn_max) {
>> +		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
>> +		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
>> +
>> +		vb->pfn_start = pfn_start;
>> +		clear_page_bmap(vb);
>> +		found = false;
>> +
>> +		list_for_each_entry(page, pages, lru) {
>> +			unsigned long bmap_idx, bmap_pos, balloon_pfn;
>> +
>> +			balloon_pfn = page_to_balloon_pfn(page);
>> +			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
>> +				continue;
>> +			bmap_idx = (balloon_pfn - pfn_start) /
>> +				   PFNS_PER_PAGE_BMAP;
>> +			bmap_pos = (balloon_pfn - pfn_start) %
>> +				   PFNS_PER_PAGE_BMAP;
>> +			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
> Looks like this will crash if bmap_idx is out of range or
> if page_bmap allocation failed.

No, it won't. Please think about the analogy Case 2.2: pfn_start is updated
in each round. Like in the 2nd round, pfn_start is updated to 2, balloon_pfn
will be a value between 2 and 4, so the result of
"(balloon_pfn - pfn_start) /  PFNS_PER_PAGE_BMAP" will always be 0 when
we only have page_bmap[0].

>   
>   #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb,
> +			       struct virtqueue *vq, struct page *page)
> +{
> +	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);
> This passes 4kbytes to host which seems wrong - I think you want a full page.

OK. It should be
add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
                           page_to_pfn(page), VIRTIO_BALLOON_PAGES_PER_PAGE)

right?

If Page=2*BalloonPage, it will pass 2*4K to the host.

+static void balloon_page_chunk_init(struct virtio_balloon *vb)
+{
+	void *buf;
+
+	/*
+	 * By default, we allocate page_bmap[0] only. More page_bmap will be
+	 * allocated on demand.
+	 */
+	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk) *
+		      MAX_PAGE_CHUNKS, GFP_KERNEL);
+	if (!vb->page_bmap[0] || !buf) {
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);

> this doesn't work as expected as features has been OK'd by then.
> You want something like
> validate_features that I posted. See
> "virtio: allow drivers to validate features".

OK. I will change it after that patch is merged.

>
>> +		kfree(vb->page_bmap[0]);
> Looks like this will double free. you want to zero them I think.
>

OK. I'll NULL the pointers after kfree().



Best,
Wei



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-14  8:37       ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  8:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
>
> So we don't need the bitmap to talk to host, it is just
> a data structure we chose to maintain lists of pages, right?
Right. bitmap is the way to gather pages to chunk.
It's only needed in the balloon page case.
For the unused page case, we don't need it, since the free
page blocks are already chunks.

> OK as far as it goes but you need much better isolation for it.
> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> _find_first, _find_next.
> Completely unrelated to pages, it just maintains bits.
> Then use it here.
>
>
>>   static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>>   module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>>   MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>> @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>   static struct vfsmount *balloon_mnt;
>>   #endif
>>   
>> +/* Types of pages to chunk */
>> +#define PAGE_CHUNK_TYPE_BALLOON 0
>> +
> Doesn't look like you are ever adding more types in this
> patchset.  Pls keep code simple, generalize it later.
>
"#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.

Types of page to chunk are treated differently. Different types of page
chunks are sent to the host via different protocols.

1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
to chunk.  For the ballooned type, it uses the basic chunk msg format:

virtio_balloon_page_chunk_hdr +
virtio_balloon_page_chunk * MAX_PAGE_CHUNKS

2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
format:
miscq_hdr +
virtio_balloon_page_chunk_hdr +
virtio_balloon_page_chunk * MAX_PAGE_CHUNKS

The chunk msg is actually the payload of the miscq msg.



>> +#define MAX_PAGE_CHUNKS 4096
> This is an order-4 allocation. I'd make it 4095 and then it's
> an order-3 one.

Sounds good, thanks.
I think it would be better to make it 4090. Leave some space for the hdr
as well.

>
>>   struct virtio_balloon {
>>   	struct virtio_device *vdev;
>>   	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>> @@ -78,6 +86,32 @@ struct virtio_balloon {
>>   	/* Synchronize access/update to this struct virtio_balloon elements */
>>   	struct mutex balloon_lock;
>>   
>> +	/*
>> +	 * Buffer for PAGE_CHUNK_TYPE_BALLOON:
>> +	 * virtio_balloon_page_chunk_hdr +
>> +	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>> +	 */
>> +	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
>> +	struct virtio_balloon_page_chunk *balloon_page_chunk;
>> +
>> +	/* Bitmap used to record pages */
>> +	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
>> +	/* Number of the allocated page_bmap */
>> +	unsigned int page_bmaps;
>> +
>> +	/*
>> +	 * The allocated page_bmap size may be smaller than the pfn range of
>> +	 * the ballooned pages. In this case, we need to use the page_bmap
>> +	 * multiple times to cover the entire pfn range. It's like using a
>> +	 * short ruler several times to finish measuring a long object.
>> +	 * The start location of the ruler in the next measurement is the end
>> +	 * location of the ruler in the previous measurement.
>> +	 *
>> +	 * pfn_max & pfn_min: forms the pfn range of the ballooned pages
>> +	 * pfn_start & pfn_stop: records the start and stop pfn in each cover
> cover? what does this mean?
>
> looks like you only use these to pass data to tell_host.
> so pass these as parameters and you won't need to keep
> them in this structure.
>
> And then you can move this comment to set_page_bmap where
> it belongs.
>
>> +	 */
>> +	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
>> +
>>   	/* The array of pfns we tell the Host about. */
>>   	unsigned int num_pfns;
>>   	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
>> @@ -110,20 +144,201 @@ static void balloon_ack(struct virtqueue *vq)
>>   	wake_up(&vb->acked);
>>   }
>>   
>> -static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>> +static inline void init_page_bmap_range(struct virtio_balloon *vb)
>> +{
>> +	vb->pfn_min = ULONG_MAX;
>> +	vb->pfn_max = 0;
>> +}
>> +
>> +static inline void update_page_bmap_range(struct virtio_balloon *vb,
>> +					  struct page *page)
>> +{
>> +	unsigned long balloon_pfn = page_to_balloon_pfn(page);
>> +
>> +	vb->pfn_min = min(balloon_pfn, vb->pfn_min);
>> +	vb->pfn_max = max(balloon_pfn, vb->pfn_max);
>> +}
>> +
>> +/* The page_bmap size is extended by adding more number of page_bmap */
> did you mean
>
> 	Allocate more bitmaps to cover the given number of pfns
> 	and add them to page_bmap
>
> ?
>
> This isn't what this function does.
> It blindly assumes 1 bitmap is allocated
> and allocates more, up to PAGE_BMAP_COUNT_MAX.
>

Please let me use a concrete analogy to explain this algorithm:
We have a 2-meter long ruler (i.e. page_bmap[0]).

Case 1:
To measure a  1-meter long object (i.e. pfn_max=1, pfn_min=0),
we can simply use the ruler once and get to know that the object
is 1-meter long.

Case 2:
To measure a 11-meter long object (i.e. pfn_max=11, pfn_min=0).
We will first see if we can extend the 2-meter long ruler, for example,
to 12-meter by getting another five 2-meter rulers and combine them
(i.e. extend_page_bmap_size() to allocate page_bmap[1],
page_bmap[2]...page_bmap[5]).
Case 2.1: If the length of the ruler is successfully extended to
                 12-meter, that is, we get a 12-meter long ruler, then we
                 can simply use the ruler once and know the length of the
                 object is 11-meter.
Case 2.2: If the ruler failed to be extended. Then we need to use the
                 2-meter long ruler 6 times to measure the 11-meter long
                 object:
                 1st time: pfn_start=0, pfn_stop=2;
                 2nd time: pfn_start=2, pfn_stop=4;
                 ..
                 6th time: pfn_start=10, pfn_stop=11
                 Still, we covered the entire length of the long object 
with the
                 short ruler that we have. But we used it 6 times (i.e. use
                 page_bmap[0], 6 times).

Based on the understanding of this analogy, I think the following
questions would be easier to understand.

>> +static void extend_page_bmap_size(struct virtio_balloon *vb,
>> +				  unsigned long pfns)
>> +{
>> +	int i, bmaps;
>> +	unsigned long bmap_len;
>> +
>> +	bmap_len = ALIGN(pfns, BITS_PER_LONG) / BITS_PER_BYTE;
>> +	bmap_len = ALIGN(bmap_len, PAGE_BMAP_SIZE);
> Align? PAGE_BMAP_SIZE doesn't even have to be a power of 2 ...

ThoughPAGE_BMAP_SIZE has been set to 32K in the implementation,
would you prefer to use roundup() here?


>> +	bmaps = min((int)(bmap_len / PAGE_BMAP_SIZE),
>> +		    PAGE_BMAP_COUNT_MAX);
> I got lost here.
>
> Please use things like ARRAY_SIZE instead of macros.

PAGE_BMAP_COUNT_MAX is the total amount of page_bmap[] that is
allowed to be allocated on demand. It is 32 in the implementation.

For example, if the calculation shows that it needs 100 page_bmap[],
but we can only afford 32, so use 32 for bmaps, instead of 100. The
the following implementation go through Case 2.2.



>> +
>> +	for (i = 1; i < bmaps; i++) {
>> +		vb->page_bmap[i] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
>> +		if (vb->page_bmap[i])
>> +			vb->page_bmaps++;
>> +		else
>> +			break;
>> +	}
>> +}
>> +
>> +static void free_extended_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i, bmaps = vb->page_bmaps;
>> +
>> +	for (i = 1; i < bmaps; i++) {
>> +		kfree(vb->page_bmap[i]);
>> +		vb->page_bmap[i] = NULL;
>> +		vb->page_bmaps--;
>> +	}
>> +}
>> +
> What's the magic number 1 here?
> Maybe you want to document what is going on.
> Here's a guess:
>
> We keep a single bmap around at all times.
> If memory does not fit there, we allocate up to
> PAGE_BMAP_COUNT_MAX of chunks.
>

Right. By default, we have only 1 page_bmap[] allocated.


>> +static void free_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < vb->page_bmaps; i++)
>> +		kfree(vb->page_bmap[i]);
>> +}
>> +
>> +static void clear_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < vb->page_bmaps; i++)
>> +		memset(vb->page_bmap[i], 0, PAGE_BMAP_SIZE);
>> +}
>> +
>> +static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
>> +			     int type, bool busy_wait)
> busy_wait seems unused. pls drop.

It will be used in the other patch (the 5th patch) for sending unused pages.
Probably I can add it from that patch.

>
>>   {
>>   	struct scatterlist sg;
>> +	struct virtio_balloon_page_chunk_hdr *hdr;
>> +	void *buf;
>>   	unsigned int len;
>>   
>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>> +	switch (type) {
>> +	case PAGE_CHUNK_TYPE_BALLOON:
>> +		hdr = vb->balloon_page_chunk_hdr;
>> +		len = 0;
>> +		break;
>> +	default:
>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>> +			 __func__, type);
>> +		return;
>> +	}
>>   
>> -	/* We should always be able to add one buffer to an empty queue. */
>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> -	virtqueue_kick(vq);
>> +	buf = (void *)hdr - len;
> Moving back to before the header? How can this make sense?
> It works fine since len is 0, so just buf = hdr.
>
For the unused page chunk case, it follows its own protocol:
miscq_hdr + payload(chunk msg).
  "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
the entire miscq msg.

Please check the patch for implementing the unused page chunk,
it will be clear. If necessary, I can put "buf = (void *)hdr - len" from 
that patch.


>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
>> +	sg_init_table(&sg, 1);
>> +	sg_set_buf(&sg, buf, len);
>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
>> +		virtqueue_kick(vq);
>> +		if (busy_wait)
>> +			while (!virtqueue_get_buf(vq, &len) &&
>> +			       !virtqueue_is_broken(vq))
>> +				cpu_relax();
>> +		else
>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		hdr->chunks = 0;
> Why zero it here after device used it? Better to zero before use.

hdr->chunks tells the host how many chunks are there in the payload.
After the device use it, it is ready to zero it.

>
>> +	}
>> +}
>> +
>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
>> +			  int type, u64 base, u64 size)
> what are the units here? Looks like it's in 4kbyte units?

what is the "unit" you referred to?
This is the function to add one chunk, base pfn and size of the chunk are
supplied to the function.



>
>> +	if (hdr->chunks == MAX_PAGE_CHUNKS)
>> +		send_page_chunks(vb, vq, type, false);
> 		and zero chunks here?
>> +}
>> +
>> +static void chunking_pages_from_bmap(struct virtio_balloon *vb,
> Does this mean "convert_bmap_to_chunks"?
>

Yes.


>> +				     struct virtqueue *vq,
>> +				     unsigned long pfn_start,
>> +				     unsigned long *bmap,
>> +				     unsigned long len)
>> +{
>> +	unsigned long pos = 0, end = len * BITS_PER_BYTE;
>> +
>> +	while (pos < end) {
>> +		unsigned long one = find_next_bit(bmap, end, pos);
>> +
>> +		if (one < end) {
>> +			unsigned long chunk_size, zero;
>> +
>> +			zero = find_next_zero_bit(bmap, end, one + 1);
>
> zero and one are unhelpful names unless they equal 0 and 1.
> current/next?
>

I think it is clear if we think about the bitmap, for example:
00001111000011110000
one = the position of the next "1" bit,
zero= the position of the next "0" bit, starting from one.

Then, it is clear, chunk_size= zero - one

would it be better to use pos_0 and pos_1?

>> +			if (zero >= end)
>> +				chunk_size = end - one;
>> +			else
>> +				chunk_size = zero - one;
>> +
>> +			if (chunk_size)
>> +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>> +					      pfn_start + one, chunk_size);
> Still not so what does a bit refer to? page or 4kbytes?
> I think it should be a page.
A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
But I think it doesn't matter here, since it is pfn.
Using the above example:
00001111000011110000

If the starting bit above corresponds to pfn-0x1000 (i.e. pfn_start)
Then the chunk base = 0x1004
(one is the position of the "Set" bit, which is 4), so pfn_start 
+one=0x1004

>> +static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>> +{
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
>> +		int pfns, page_bmaps, i;
>> +		unsigned long pfn_start, pfns_len;
>> +
>> +		pfn_start = vb->pfn_start;
>> +		pfns = vb->pfn_stop - pfn_start + 1;
>> +		pfns = roundup(roundup(pfns, BITS_PER_LONG),
>> +			       PFNS_PER_PAGE_BMAP);
>> +		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
>> +		pfns_len = pfns / BITS_PER_BYTE;
>> +
>> +		for (i = 0; i < page_bmaps; i++) {
>> +			unsigned int bmap_len = PAGE_BMAP_SIZE;
>> +
>> +			/* The last one takes the leftover only */
> I don't understand what does this mean.
Still use the ruler analogy here: the object is 11-meter long, and we have
a 2-meter long ruler. The 5th time has covered 10 meters of the object, Then
the last time, the leftover is 1 meter, which means we can use half of 
the ruler
to cover the left 1 meter.

Back to the implementation here, if there are only 10 pfns left in the 
last round,
I think it's not necessary to search the entire page_bmap[] till the end.

>> +static void set_page_bmap(struct virtio_balloon *vb,
>> +			  struct list_head *pages, struct virtqueue *vq)
>> +{
>> +	unsigned long pfn_start, pfn_stop;
>> +	struct page *page;
>> +	bool found;
>> +
>> +	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
>> +	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
>> +
>> +	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
> This might not do anything in particular might not cover the
> given pfn range. Do we care? Why not?

We have allocated only 1 page_bmap[], which is able to cover 1GB memory.
To inflate 2GB, it will try to extend by getting one more page_bmap, 
page_bmap[1].

>> +	pfn_start = vb->pfn_min;
>> +
>> +	while (pfn_start < vb->pfn_max) {
>> +		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
>> +		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
>> +
>> +		vb->pfn_start = pfn_start;
>> +		clear_page_bmap(vb);
>> +		found = false;
>> +
>> +		list_for_each_entry(page, pages, lru) {
>> +			unsigned long bmap_idx, bmap_pos, balloon_pfn;
>> +
>> +			balloon_pfn = page_to_balloon_pfn(page);
>> +			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
>> +				continue;
>> +			bmap_idx = (balloon_pfn - pfn_start) /
>> +				   PFNS_PER_PAGE_BMAP;
>> +			bmap_pos = (balloon_pfn - pfn_start) %
>> +				   PFNS_PER_PAGE_BMAP;
>> +			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
> Looks like this will crash if bmap_idx is out of range or
> if page_bmap allocation failed.

No, it won't. Please think about the analogy Case 2.2: pfn_start is updated
in each round. Like in the 2nd round, pfn_start is updated to 2, balloon_pfn
will be a value between 2 and 4, so the result of
"(balloon_pfn - pfn_start) /  PFNS_PER_PAGE_BMAP" will always be 0 when
we only have page_bmap[0].

>   
>   #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb,
> +			       struct virtqueue *vq, struct page *page)
> +{
> +	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);
> This passes 4kbytes to host which seems wrong - I think you want a full page.

OK. It should be
add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
                           page_to_pfn(page), VIRTIO_BALLOON_PAGES_PER_PAGE)

right?

If Page=2*BalloonPage, it will pass 2*4K to the host.

+static void balloon_page_chunk_init(struct virtio_balloon *vb)
+{
+	void *buf;
+
+	/*
+	 * By default, we allocate page_bmap[0] only. More page_bmap will be
+	 * allocated on demand.
+	 */
+	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk) *
+		      MAX_PAGE_CHUNKS, GFP_KERNEL);
+	if (!vb->page_bmap[0] || !buf) {
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);

> this doesn't work as expected as features has been OK'd by then.
> You want something like
> validate_features that I posted. See
> "virtio: allow drivers to validate features".

OK. I will change it after that patch is merged.

>
>> +		kfree(vb->page_bmap[0]);
> Looks like this will double free. you want to zero them I think.
>

OK. I'll NULL the pointers after kfree().



Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-13 16:34     ` Michael S. Tsirkin
                       ` (4 preceding siblings ...)
  (?)
@ 2017-04-14  8:37     ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  8:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
>
> So we don't need the bitmap to talk to host, it is just
> a data structure we chose to maintain lists of pages, right?
Right. bitmap is the way to gather pages to chunk.
It's only needed in the balloon page case.
For the unused page case, we don't need it, since the free
page blocks are already chunks.

> OK as far as it goes but you need much better isolation for it.
> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> _find_first, _find_next.
> Completely unrelated to pages, it just maintains bits.
> Then use it here.
>
>
>>   static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>>   module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>>   MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>> @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>   static struct vfsmount *balloon_mnt;
>>   #endif
>>   
>> +/* Types of pages to chunk */
>> +#define PAGE_CHUNK_TYPE_BALLOON 0
>> +
> Doesn't look like you are ever adding more types in this
> patchset.  Pls keep code simple, generalize it later.
>
"#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.

Types of page to chunk are treated differently. Different types of page
chunks are sent to the host via different protocols.

1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
to chunk.  For the ballooned type, it uses the basic chunk msg format:

virtio_balloon_page_chunk_hdr +
virtio_balloon_page_chunk * MAX_PAGE_CHUNKS

2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
format:
miscq_hdr +
virtio_balloon_page_chunk_hdr +
virtio_balloon_page_chunk * MAX_PAGE_CHUNKS

The chunk msg is actually the payload of the miscq msg.



>> +#define MAX_PAGE_CHUNKS 4096
> This is an order-4 allocation. I'd make it 4095 and then it's
> an order-3 one.

Sounds good, thanks.
I think it would be better to make it 4090. Leave some space for the hdr
as well.

>
>>   struct virtio_balloon {
>>   	struct virtio_device *vdev;
>>   	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
>> @@ -78,6 +86,32 @@ struct virtio_balloon {
>>   	/* Synchronize access/update to this struct virtio_balloon elements */
>>   	struct mutex balloon_lock;
>>   
>> +	/*
>> +	 * Buffer for PAGE_CHUNK_TYPE_BALLOON:
>> +	 * virtio_balloon_page_chunk_hdr +
>> +	 * virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>> +	 */
>> +	struct virtio_balloon_page_chunk_hdr *balloon_page_chunk_hdr;
>> +	struct virtio_balloon_page_chunk *balloon_page_chunk;
>> +
>> +	/* Bitmap used to record pages */
>> +	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
>> +	/* Number of the allocated page_bmap */
>> +	unsigned int page_bmaps;
>> +
>> +	/*
>> +	 * The allocated page_bmap size may be smaller than the pfn range of
>> +	 * the ballooned pages. In this case, we need to use the page_bmap
>> +	 * multiple times to cover the entire pfn range. It's like using a
>> +	 * short ruler several times to finish measuring a long object.
>> +	 * The start location of the ruler in the next measurement is the end
>> +	 * location of the ruler in the previous measurement.
>> +	 *
>> +	 * pfn_max & pfn_min: forms the pfn range of the ballooned pages
>> +	 * pfn_start & pfn_stop: records the start and stop pfn in each cover
> cover? what does this mean?
>
> looks like you only use these to pass data to tell_host.
> so pass these as parameters and you won't need to keep
> them in this structure.
>
> And then you can move this comment to set_page_bmap where
> it belongs.
>
>> +	 */
>> +	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
>> +
>>   	/* The array of pfns we tell the Host about. */
>>   	unsigned int num_pfns;
>>   	__virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
>> @@ -110,20 +144,201 @@ static void balloon_ack(struct virtqueue *vq)
>>   	wake_up(&vb->acked);
>>   }
>>   
>> -static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>> +static inline void init_page_bmap_range(struct virtio_balloon *vb)
>> +{
>> +	vb->pfn_min = ULONG_MAX;
>> +	vb->pfn_max = 0;
>> +}
>> +
>> +static inline void update_page_bmap_range(struct virtio_balloon *vb,
>> +					  struct page *page)
>> +{
>> +	unsigned long balloon_pfn = page_to_balloon_pfn(page);
>> +
>> +	vb->pfn_min = min(balloon_pfn, vb->pfn_min);
>> +	vb->pfn_max = max(balloon_pfn, vb->pfn_max);
>> +}
>> +
>> +/* The page_bmap size is extended by adding more number of page_bmap */
> did you mean
>
> 	Allocate more bitmaps to cover the given number of pfns
> 	and add them to page_bmap
>
> ?
>
> This isn't what this function does.
> It blindly assumes 1 bitmap is allocated
> and allocates more, up to PAGE_BMAP_COUNT_MAX.
>

Please let me use a concrete analogy to explain this algorithm:
We have a 2-meter long ruler (i.e. page_bmap[0]).

Case 1:
To measure a  1-meter long object (i.e. pfn_max=1, pfn_min=0),
we can simply use the ruler once and get to know that the object
is 1-meter long.

Case 2:
To measure a 11-meter long object (i.e. pfn_max=11, pfn_min=0).
We will first see if we can extend the 2-meter long ruler, for example,
to 12-meter by getting another five 2-meter rulers and combine them
(i.e. extend_page_bmap_size() to allocate page_bmap[1],
page_bmap[2]...page_bmap[5]).
Case 2.1: If the length of the ruler is successfully extended to
                 12-meter, that is, we get a 12-meter long ruler, then we
                 can simply use the ruler once and know the length of the
                 object is 11-meter.
Case 2.2: If the ruler failed to be extended. Then we need to use the
                 2-meter long ruler 6 times to measure the 11-meter long
                 object:
                 1st time: pfn_start=0, pfn_stop=2;
                 2nd time: pfn_start=2, pfn_stop=4;
                 ..
                 6th time: pfn_start=10, pfn_stop=11
                 Still, we covered the entire length of the long object 
with the
                 short ruler that we have. But we used it 6 times (i.e. use
                 page_bmap[0], 6 times).

Based on the understanding of this analogy, I think the following
questions would be easier to understand.

>> +static void extend_page_bmap_size(struct virtio_balloon *vb,
>> +				  unsigned long pfns)
>> +{
>> +	int i, bmaps;
>> +	unsigned long bmap_len;
>> +
>> +	bmap_len = ALIGN(pfns, BITS_PER_LONG) / BITS_PER_BYTE;
>> +	bmap_len = ALIGN(bmap_len, PAGE_BMAP_SIZE);
> Align? PAGE_BMAP_SIZE doesn't even have to be a power of 2 ...

ThoughPAGE_BMAP_SIZE has been set to 32K in the implementation,
would you prefer to use roundup() here?


>> +	bmaps = min((int)(bmap_len / PAGE_BMAP_SIZE),
>> +		    PAGE_BMAP_COUNT_MAX);
> I got lost here.
>
> Please use things like ARRAY_SIZE instead of macros.

PAGE_BMAP_COUNT_MAX is the total amount of page_bmap[] that is
allowed to be allocated on demand. It is 32 in the implementation.

For example, if the calculation shows that it needs 100 page_bmap[],
but we can only afford 32, so use 32 for bmaps, instead of 100. The
the following implementation go through Case 2.2.



>> +
>> +	for (i = 1; i < bmaps; i++) {
>> +		vb->page_bmap[i] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
>> +		if (vb->page_bmap[i])
>> +			vb->page_bmaps++;
>> +		else
>> +			break;
>> +	}
>> +}
>> +
>> +static void free_extended_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i, bmaps = vb->page_bmaps;
>> +
>> +	for (i = 1; i < bmaps; i++) {
>> +		kfree(vb->page_bmap[i]);
>> +		vb->page_bmap[i] = NULL;
>> +		vb->page_bmaps--;
>> +	}
>> +}
>> +
> What's the magic number 1 here?
> Maybe you want to document what is going on.
> Here's a guess:
>
> We keep a single bmap around at all times.
> If memory does not fit there, we allocate up to
> PAGE_BMAP_COUNT_MAX of chunks.
>

Right. By default, we have only 1 page_bmap[] allocated.


>> +static void free_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < vb->page_bmaps; i++)
>> +		kfree(vb->page_bmap[i]);
>> +}
>> +
>> +static void clear_page_bmap(struct virtio_balloon *vb)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < vb->page_bmaps; i++)
>> +		memset(vb->page_bmap[i], 0, PAGE_BMAP_SIZE);
>> +}
>> +
>> +static void send_page_chunks(struct virtio_balloon *vb, struct virtqueue *vq,
>> +			     int type, bool busy_wait)
> busy_wait seems unused. pls drop.

It will be used in the other patch (the 5th patch) for sending unused pages.
Probably I can add it from that patch.

>
>>   {
>>   	struct scatterlist sg;
>> +	struct virtio_balloon_page_chunk_hdr *hdr;
>> +	void *buf;
>>   	unsigned int len;
>>   
>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>> +	switch (type) {
>> +	case PAGE_CHUNK_TYPE_BALLOON:
>> +		hdr = vb->balloon_page_chunk_hdr;
>> +		len = 0;
>> +		break;
>> +	default:
>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>> +			 __func__, type);
>> +		return;
>> +	}
>>   
>> -	/* We should always be able to add one buffer to an empty queue. */
>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>> -	virtqueue_kick(vq);
>> +	buf = (void *)hdr - len;
> Moving back to before the header? How can this make sense?
> It works fine since len is 0, so just buf = hdr.
>
For the unused page chunk case, it follows its own protocol:
miscq_hdr + payload(chunk msg).
  "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
the entire miscq msg.

Please check the patch for implementing the unused page chunk,
it will be clear. If necessary, I can put "buf = (void *)hdr - len" from 
that patch.


>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
>> +	sg_init_table(&sg, 1);
>> +	sg_set_buf(&sg, buf, len);
>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
>> +		virtqueue_kick(vq);
>> +		if (busy_wait)
>> +			while (!virtqueue_get_buf(vq, &len) &&
>> +			       !virtqueue_is_broken(vq))
>> +				cpu_relax();
>> +		else
>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>> +		hdr->chunks = 0;
> Why zero it here after device used it? Better to zero before use.

hdr->chunks tells the host how many chunks are there in the payload.
After the device use it, it is ready to zero it.

>
>> +	}
>> +}
>> +
>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
>> +			  int type, u64 base, u64 size)
> what are the units here? Looks like it's in 4kbyte units?

what is the "unit" you referred to?
This is the function to add one chunk, base pfn and size of the chunk are
supplied to the function.



>
>> +	if (hdr->chunks == MAX_PAGE_CHUNKS)
>> +		send_page_chunks(vb, vq, type, false);
> 		and zero chunks here?
>> +}
>> +
>> +static void chunking_pages_from_bmap(struct virtio_balloon *vb,
> Does this mean "convert_bmap_to_chunks"?
>

Yes.


>> +				     struct virtqueue *vq,
>> +				     unsigned long pfn_start,
>> +				     unsigned long *bmap,
>> +				     unsigned long len)
>> +{
>> +	unsigned long pos = 0, end = len * BITS_PER_BYTE;
>> +
>> +	while (pos < end) {
>> +		unsigned long one = find_next_bit(bmap, end, pos);
>> +
>> +		if (one < end) {
>> +			unsigned long chunk_size, zero;
>> +
>> +			zero = find_next_zero_bit(bmap, end, one + 1);
>
> zero and one are unhelpful names unless they equal 0 and 1.
> current/next?
>

I think it is clear if we think about the bitmap, for example:
00001111000011110000
one = the position of the next "1" bit,
zero= the position of the next "0" bit, starting from one.

Then, it is clear, chunk_size= zero - one

would it be better to use pos_0 and pos_1?

>> +			if (zero >= end)
>> +				chunk_size = end - one;
>> +			else
>> +				chunk_size = zero - one;
>> +
>> +			if (chunk_size)
>> +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>> +					      pfn_start + one, chunk_size);
> Still not so what does a bit refer to? page or 4kbytes?
> I think it should be a page.
A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
But I think it doesn't matter here, since it is pfn.
Using the above example:
00001111000011110000

If the starting bit above corresponds to pfn-0x1000 (i.e. pfn_start)
Then the chunk base = 0x1004
(one is the position of the "Set" bit, which is 4), so pfn_start 
+one=0x1004

>> +static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>> +{
>> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
>> +		int pfns, page_bmaps, i;
>> +		unsigned long pfn_start, pfns_len;
>> +
>> +		pfn_start = vb->pfn_start;
>> +		pfns = vb->pfn_stop - pfn_start + 1;
>> +		pfns = roundup(roundup(pfns, BITS_PER_LONG),
>> +			       PFNS_PER_PAGE_BMAP);
>> +		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
>> +		pfns_len = pfns / BITS_PER_BYTE;
>> +
>> +		for (i = 0; i < page_bmaps; i++) {
>> +			unsigned int bmap_len = PAGE_BMAP_SIZE;
>> +
>> +			/* The last one takes the leftover only */
> I don't understand what does this mean.
Still use the ruler analogy here: the object is 11-meter long, and we have
a 2-meter long ruler. The 5th time has covered 10 meters of the object, Then
the last time, the leftover is 1 meter, which means we can use half of 
the ruler
to cover the left 1 meter.

Back to the implementation here, if there are only 10 pfns left in the 
last round,
I think it's not necessary to search the entire page_bmap[] till the end.

>> +static void set_page_bmap(struct virtio_balloon *vb,
>> +			  struct list_head *pages, struct virtqueue *vq)
>> +{
>> +	unsigned long pfn_start, pfn_stop;
>> +	struct page *page;
>> +	bool found;
>> +
>> +	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
>> +	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
>> +
>> +	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
> This might not do anything in particular might not cover the
> given pfn range. Do we care? Why not?

We have allocated only 1 page_bmap[], which is able to cover 1GB memory.
To inflate 2GB, it will try to extend by getting one more page_bmap, 
page_bmap[1].

>> +	pfn_start = vb->pfn_min;
>> +
>> +	while (pfn_start < vb->pfn_max) {
>> +		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
>> +		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
>> +
>> +		vb->pfn_start = pfn_start;
>> +		clear_page_bmap(vb);
>> +		found = false;
>> +
>> +		list_for_each_entry(page, pages, lru) {
>> +			unsigned long bmap_idx, bmap_pos, balloon_pfn;
>> +
>> +			balloon_pfn = page_to_balloon_pfn(page);
>> +			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
>> +				continue;
>> +			bmap_idx = (balloon_pfn - pfn_start) /
>> +				   PFNS_PER_PAGE_BMAP;
>> +			bmap_pos = (balloon_pfn - pfn_start) %
>> +				   PFNS_PER_PAGE_BMAP;
>> +			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
> Looks like this will crash if bmap_idx is out of range or
> if page_bmap allocation failed.

No, it won't. Please think about the analogy Case 2.2: pfn_start is updated
in each round. Like in the 2nd round, pfn_start is updated to 2, balloon_pfn
will be a value between 2 and 4, so the result of
"(balloon_pfn - pfn_start) /  PFNS_PER_PAGE_BMAP" will always be 0 when
we only have page_bmap[0].

>   
>   #ifdef CONFIG_BALLOON_COMPACTION
> +
> +static void tell_host_one_page(struct virtio_balloon *vb,
> +			       struct virtqueue *vq, struct page *page)
> +{
> +	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);
> This passes 4kbytes to host which seems wrong - I think you want a full page.

OK. It should be
add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
                           page_to_pfn(page), VIRTIO_BALLOON_PAGES_PER_PAGE)

right?

If Page=2*BalloonPage, it will pass 2*4K to the host.

+static void balloon_page_chunk_init(struct virtio_balloon *vb)
+{
+	void *buf;
+
+	/*
+	 * By default, we allocate page_bmap[0] only. More page_bmap will be
+	 * allocated on demand.
+	 */
+	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
+		      sizeof(struct virtio_balloon_page_chunk) *
+		      MAX_PAGE_CHUNKS, GFP_KERNEL);
+	if (!vb->page_bmap[0] || !buf) {
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);

> this doesn't work as expected as features has been OK'd by then.
> You want something like
> validate_features that I posted. See
> "virtio: allow drivers to validate features".

OK. I will change it after that patch is merged.

>
>> +		kfree(vb->page_bmap[0]);
> Looks like this will double free. you want to zero them I think.
>

OK. I'll NULL the pointers after kfree().



Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 3/5] mm: function to offer a page block on the free list
  2017-04-14  2:58         ` Matthew Wilcox
  (?)
@ 2017-04-14  8:58           ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  8:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, david, dave.hansen,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource

On 04/14/2017 10:58 AM, Matthew Wilcox wrote:
> On Fri, Apr 14, 2017 at 10:30:27AM +0800, Wei Wang wrote:
>> OK. What do you think if we add this:
>>
>> #if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)
> That's spelled "IS_ENABLED(CONFIG_VIRTIO_BALLOON)", FYI.

Right, thanks.

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 3/5] mm: function to offer a page block on the free list
@ 2017-04-14  8:58           ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  8:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, david, dave.hansen,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource

On 04/14/2017 10:58 AM, Matthew Wilcox wrote:
> On Fri, Apr 14, 2017 at 10:30:27AM +0800, Wei Wang wrote:
>> OK. What do you think if we add this:
>>
>> #if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)
> That's spelled "IS_ENABLED(CONFIG_VIRTIO_BALLOON)", FYI.

Right, thanks.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 3/5] mm: function to offer a page block on the free list
@ 2017-04-14  8:58           ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  8:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, mst, david, dave.hansen,
	cornelia.huck, mgorman, aarcange, amit.shah, pbonzini,
	liliang.opensource

On 04/14/2017 10:58 AM, Matthew Wilcox wrote:
> On Fri, Apr 14, 2017 at 10:30:27AM +0800, Wei Wang wrote:
>> OK. What do you think if we add this:
>>
>> #if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)
> That's spelled "IS_ENABLED(CONFIG_VIRTIO_BALLOON)", FYI.

Right, thanks.

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 3/5] mm: function to offer a page block on the free list
  2017-04-14  2:58         ` Matthew Wilcox
                           ` (2 preceding siblings ...)
  (?)
@ 2017-04-14  8:58         ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-14  8:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: aarcange, virtio-dev, kvm, mst, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, Andrew Morton, mgorman

On 04/14/2017 10:58 AM, Matthew Wilcox wrote:
> On Fri, Apr 14, 2017 at 10:30:27AM +0800, Wei Wang wrote:
>> OK. What do you think if we add this:
>>
>> #if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)
> That's spelled "IS_ENABLED(CONFIG_VIRTIO_BALLOON)", FYI.

Right, thanks.

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2017-04-14  1:50     ` Michael S. Tsirkin
  (?)
@ 2017-04-14  9:47       ` Matthew Wilcox
  -1 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-14  9:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Apr 14, 2017 at 04:50:48AM +0300, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> > On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > > 2) transfer the guest unused pages to the host so that they
> > > can be skipped to migrate in live migration.
> > 
> > I don't understand this second bit.  You leave the pages on the free list,
> > and tell the host they're free.  What's preventing somebody else from
> > allocating them and using them for something?  Is the guest semi-frozen
> > at this point with just enough of it running to ask the balloon driver
> > to do things?
> 
> There's missing documentation here.
> 
> The way things actually work is host sends to guest
> a request for unused pages and then write-protects all memory.

... hopefully you mean "write protects all memory, then sends a request
for unused pages", otherwise there's a race condition.

And I see the utility of this, but does this functionality belong in
the balloon driver?  It seems like it's something you might want even
if you don't have the balloon driver loaded.  Or something you might
not want if you do have the balloon driver loaded.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-14  9:47       ` Matthew Wilcox
  0 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-14  9:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Apr 14, 2017 at 04:50:48AM +0300, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> > On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > > 2) transfer the guest unused pages to the host so that they
> > > can be skipped to migrate in live migration.
> > 
> > I don't understand this second bit.  You leave the pages on the free list,
> > and tell the host they're free.  What's preventing somebody else from
> > allocating them and using them for something?  Is the guest semi-frozen
> > at this point with just enough of it running to ask the balloon driver
> > to do things?
> 
> There's missing documentation here.
> 
> The way things actually work is host sends to guest
> a request for unused pages and then write-protects all memory.

... hopefully you mean "write protects all memory, then sends a request
for unused pages", otherwise there's a race condition.

And I see the utility of this, but does this functionality belong in
the balloon driver?  It seems like it's something you might want even
if you don't have the balloon driver loaded.  Or something you might
not want if you do have the balloon driver loaded.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-14  9:47       ` Matthew Wilcox
  0 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-14  9:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Apr 14, 2017 at 04:50:48AM +0300, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> > On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > > 2) transfer the guest unused pages to the host so that they
> > > can be skipped to migrate in live migration.
> > 
> > I don't understand this second bit.  You leave the pages on the free list,
> > and tell the host they're free.  What's preventing somebody else from
> > allocating them and using them for something?  Is the guest semi-frozen
> > at this point with just enough of it running to ask the balloon driver
> > to do things?
> 
> There's missing documentation here.
> 
> The way things actually work is host sends to guest
> a request for unused pages and then write-protects all memory.

... hopefully you mean "write protects all memory, then sends a request
for unused pages", otherwise there's a race condition.

And I see the utility of this, but does this functionality belong in
the balloon driver?  It seems like it's something you might want even
if you don't have the balloon driver loaded.  Or something you might
not want if you do have the balloon driver loaded.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2017-04-14  1:50     ` Michael S. Tsirkin
                       ` (3 preceding siblings ...)
  (?)
@ 2017-04-14  9:47     ` Matthew Wilcox
  -1 siblings, 0 replies; 136+ messages in thread
From: Matthew Wilcox @ 2017-04-14  9:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Fri, Apr 14, 2017 at 04:50:48AM +0300, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> > On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > > 2) transfer the guest unused pages to the host so that they
> > > can be skipped to migrate in live migration.
> > 
> > I don't understand this second bit.  You leave the pages on the free list,
> > and tell the host they're free.  What's preventing somebody else from
> > allocating them and using them for something?  Is the guest semi-frozen
> > at this point with just enough of it running to ask the balloon driver
> > to do things?
> 
> There's missing documentation here.
> 
> The way things actually work is host sends to guest
> a request for unused pages and then write-protects all memory.

... hopefully you mean "write protects all memory, then sends a request
for unused pages", otherwise there's a race condition.

And I see the utility of this, but does this functionality belong in
the balloon driver?  It seems like it's something you might want even
if you don't have the balloon driver loaded.  Or something you might
not want if you do have the balloon driver loaded.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2017-04-14  9:47       ` Matthew Wilcox
  (?)
@ 2017-04-14 14:22         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14 14:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Apr 14, 2017 at 02:47:40AM -0700, Matthew Wilcox wrote:
> On Fri, Apr 14, 2017 at 04:50:48AM +0300, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> > > On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > > > 2) transfer the guest unused pages to the host so that they
> > > > can be skipped to migrate in live migration.
> > > 
> > > I don't understand this second bit.  You leave the pages on the free list,
> > > and tell the host they're free.  What's preventing somebody else from
> > > allocating them and using them for something?  Is the guest semi-frozen
> > > at this point with just enough of it running to ask the balloon driver
> > > to do things?
> > 
> > There's missing documentation here.
> > 
> > The way things actually work is host sends to guest
> > a request for unused pages and then write-protects all memory.
> 
> ... hopefully you mean "write protects all memory, then sends a request
> for unused pages", otherwise there's a race condition.

Exactly.

> And I see the utility of this, but does this functionality belong in
> the balloon driver?

We have historically put all kind of memory-related functionality in the
balloon device. Consider for example memory statistics - seems related
conceptually. See patches 1-2: the new mechanism for reporting lists of
pages seems to be benefitial for both which seems to indicate using the
balloon for this is a good idea.

> It seems like it's something you might want even if you don't have the
> balloon driver loaded.  Or something you might not want if you do have
> the balloon driver loaded.

Most of balloon functionality is kind of loosely coupled.  Yes we could
split it up but I'm not sure what would this buy us. What do you have
in mind?

-- 
MST

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-14 14:22         ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14 14:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Apr 14, 2017 at 02:47:40AM -0700, Matthew Wilcox wrote:
> On Fri, Apr 14, 2017 at 04:50:48AM +0300, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> > > On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > > > 2) transfer the guest unused pages to the host so that they
> > > > can be skipped to migrate in live migration.
> > > 
> > > I don't understand this second bit.  You leave the pages on the free list,
> > > and tell the host they're free.  What's preventing somebody else from
> > > allocating them and using them for something?  Is the guest semi-frozen
> > > at this point with just enough of it running to ask the balloon driver
> > > to do things?
> > 
> > There's missing documentation here.
> > 
> > The way things actually work is host sends to guest
> > a request for unused pages and then write-protects all memory.
> 
> ... hopefully you mean "write protects all memory, then sends a request
> for unused pages", otherwise there's a race condition.

Exactly.

> And I see the utility of this, but does this functionality belong in
> the balloon driver?

We have historically put all kind of memory-related functionality in the
balloon device. Consider for example memory statistics - seems related
conceptually. See patches 1-2: the new mechanism for reporting lists of
pages seems to be benefitial for both which seems to indicate using the
balloon for this is a good idea.

> It seems like it's something you might want even if you don't have the
> balloon driver loaded.  Or something you might not want if you do have
> the balloon driver loaded.

Most of balloon functionality is kind of loosely coupled.  Yes we could
split it up but I'm not sure what would this buy us. What do you have
in mind?

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-14 14:22         ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14 14:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wei Wang, virtio-dev, linux-kernel, qemu-devel, virtualization,
	kvm, linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Apr 14, 2017 at 02:47:40AM -0700, Matthew Wilcox wrote:
> On Fri, Apr 14, 2017 at 04:50:48AM +0300, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> > > On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > > > 2) transfer the guest unused pages to the host so that they
> > > > can be skipped to migrate in live migration.
> > > 
> > > I don't understand this second bit.  You leave the pages on the free list,
> > > and tell the host they're free.  What's preventing somebody else from
> > > allocating them and using them for something?  Is the guest semi-frozen
> > > at this point with just enough of it running to ask the balloon driver
> > > to do things?
> > 
> > There's missing documentation here.
> > 
> > The way things actually work is host sends to guest
> > a request for unused pages and then write-protects all memory.
> 
> ... hopefully you mean "write protects all memory, then sends a request
> for unused pages", otherwise there's a race condition.

Exactly.

> And I see the utility of this, but does this functionality belong in
> the balloon driver?

We have historically put all kind of memory-related functionality in the
balloon device. Consider for example memory statistics - seems related
conceptually. See patches 1-2: the new mechanism for reporting lists of
pages seems to be benefitial for both which seems to indicate using the
balloon for this is a good idea.

> It seems like it's something you might want even if you don't have the
> balloon driver loaded.  Or something you might not want if you do have
> the balloon driver loaded.

Most of balloon functionality is kind of loosely coupled.  Yes we could
split it up but I'm not sure what would this buy us. What do you have
in mind?

-- 
MST

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2017-04-14  9:47       ` Matthew Wilcox
                         ` (2 preceding siblings ...)
  (?)
@ 2017-04-14 14:22       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14 14:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Fri, Apr 14, 2017 at 02:47:40AM -0700, Matthew Wilcox wrote:
> On Fri, Apr 14, 2017 at 04:50:48AM +0300, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 01:44:11PM -0700, Matthew Wilcox wrote:
> > > On Thu, Apr 13, 2017 at 05:35:03PM +0800, Wei Wang wrote:
> > > > 2) transfer the guest unused pages to the host so that they
> > > > can be skipped to migrate in live migration.
> > > 
> > > I don't understand this second bit.  You leave the pages on the free list,
> > > and tell the host they're free.  What's preventing somebody else from
> > > allocating them and using them for something?  Is the guest semi-frozen
> > > at this point with just enough of it running to ask the balloon driver
> > > to do things?
> > 
> > There's missing documentation here.
> > 
> > The way things actually work is host sends to guest
> > a request for unused pages and then write-protects all memory.
> 
> ... hopefully you mean "write protects all memory, then sends a request
> for unused pages", otherwise there's a race condition.

Exactly.

> And I see the utility of this, but does this functionality belong in
> the balloon driver?

We have historically put all kind of memory-related functionality in the
balloon device. Consider for example memory statistics - seems related
conceptually. See patches 1-2: the new mechanism for reporting lists of
pages seems to be benefitial for both which seems to indicate using the
balloon for this is a good idea.

> It seems like it's something you might want even if you don't have the
> balloon driver loaded.  Or something you might not want if you do have
> the balloon driver loaded.

Most of balloon functionality is kind of loosely coupled.  Yes we could
split it up but I'm not sure what would this buy us. What do you have
in mind?

-- 
MST

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-14  8:37       ` Wei Wang
  (?)
  (?)
@ 2017-04-14 21:38         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14 21:38 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> > 
> > So we don't need the bitmap to talk to host, it is just
> > a data structure we chose to maintain lists of pages, right?
> Right. bitmap is the way to gather pages to chunk.
> It's only needed in the balloon page case.
> For the unused page case, we don't need it, since the free
> page blocks are already chunks.
> 
> > OK as far as it goes but you need much better isolation for it.
> > Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> > _find_first, _find_next.
> > Completely unrelated to pages, it just maintains bits.
> > Then use it here.
> > 
> > 
> > >   static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > >   module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > >   MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > > @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > >   static struct vfsmount *balloon_mnt;
> > >   #endif
> > > +/* Types of pages to chunk */
> > > +#define PAGE_CHUNK_TYPE_BALLOON 0
> > > +
> > Doesn't look like you are ever adding more types in this
> > patchset.  Pls keep code simple, generalize it later.
> > 
> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.

I would say add the extra code there too. Or maybe we can avoid
adding it altogether.

> Types of page to chunk are treated differently. Different types of page
> chunks are sent to the host via different protocols.
> 
> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
> to chunk.  For the ballooned type, it uses the basic chunk msg format:
> 
> virtio_balloon_page_chunk_hdr +
> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> 
> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
> format:
> miscq_hdr +
> virtio_balloon_page_chunk_hdr +
> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> 
> The chunk msg is actually the payload of the miscq msg.
> 
> 

So just combine the two message formats and then it'll all be easier?


> > > +#define MAX_PAGE_CHUNKS 4096
> > This is an order-4 allocation. I'd make it 4095 and then it's
> > an order-3 one.
> 
> Sounds good, thanks.
> I think it would be better to make it 4090. Leave some space for the hdr
> as well.

And miscq hdr. In fact just let compiler do the math - something like:
(8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)


I skimmed explanation of algorithms below but please make sure
code speaks for itself and add comments inline to document it.
Whenever you answered me inline this is where you want to
try to make code clearer and add comments.

Also, pls find ways to abstract the data structure so we don't
need to deal with its internals all over the code.


....

> > 
> > >   {
> > >   	struct scatterlist sg;
> > > +	struct virtio_balloon_page_chunk_hdr *hdr;
> > > +	void *buf;
> > >   	unsigned int len;
> > > -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > > +	switch (type) {
> > > +	case PAGE_CHUNK_TYPE_BALLOON:
> > > +		hdr = vb->balloon_page_chunk_hdr;
> > > +		len = 0;
> > > +		break;
> > > +	default:
> > > +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
> > > +			 __func__, type);
> > > +		return;
> > > +	}
> > > -	/* We should always be able to add one buffer to an empty queue. */
> > > -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > -	virtqueue_kick(vq);
> > > +	buf = (void *)hdr - len;
> > Moving back to before the header? How can this make sense?
> > It works fine since len is 0, so just buf = hdr.
> > 
> For the unused page chunk case, it follows its own protocol:
> miscq_hdr + payload(chunk msg).
>  "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
> the entire miscq msg.

Well just pass the correct pointer in.

> Please check the patch for implementing the unused page chunk,
> it will be clear. If necessary, I can put "buf = (void *)hdr - len" from
> that patch.

Exactly. And all this pointer math is very messy. Please look for ways
to clean it. It's generally easy to fill structures:

struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
for (i = 0; i < n; ++i)
	foo->a[i] = b;

this is the kind of code that's easy to understand and it's
obvious there are no overflows and no info leaks here.

> 
> > > +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
> > > +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
> > > +	sg_init_table(&sg, 1);
> > > +	sg_set_buf(&sg, buf, len);
> > > +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
> > > +		virtqueue_kick(vq);
> > > +		if (busy_wait)
> > > +			while (!virtqueue_get_buf(vq, &len) &&
> > > +			       !virtqueue_is_broken(vq))
> > > +				cpu_relax();
> > > +		else
> > > +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		hdr->chunks = 0;
> > Why zero it here after device used it? Better to zero before use.
> 
> hdr->chunks tells the host how many chunks are there in the payload.
> After the device use it, it is ready to zero it.

It's rather confusing. Try to pass # of chunks around
in some other way.

> > 
> > > +	}
> > > +}
> > > +
> > > +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> > > +			  int type, u64 base, u64 size)
> > what are the units here? Looks like it's in 4kbyte units?
> 
> what is the "unit" you referred to?
> This is the function to add one chunk, base pfn and size of the chunk are
> supplied to the function.
> 

Are both size and base in bytes then?
But you do not send them to host as is, you shift them for some reason
before sending them to host.


> 
> > 
> > > +	if (hdr->chunks == MAX_PAGE_CHUNKS)
> > > +		send_page_chunks(vb, vq, type, false);
> > 		and zero chunks here?
> > > +}
> > > +
> > > +static void chunking_pages_from_bmap(struct virtio_balloon *vb,
> > Does this mean "convert_bmap_to_chunks"?
> > 
> 
> Yes.
> 

Pls name it accordingly then.

> > > +				     struct virtqueue *vq,
> > > +				     unsigned long pfn_start,
> > > +				     unsigned long *bmap,
> > > +				     unsigned long len)
> > > +{
> > > +	unsigned long pos = 0, end = len * BITS_PER_BYTE;
> > > +
> > > +	while (pos < end) {
> > > +		unsigned long one = find_next_bit(bmap, end, pos);
> > > +
> > > +		if (one < end) {
> > > +			unsigned long chunk_size, zero;
> > > +
> > > +			zero = find_next_zero_bit(bmap, end, one + 1);
> > 
> > zero and one are unhelpful names unless they equal 0 and 1.
> > current/next?
> > 
> 
> I think it is clear if we think about the bitmap, for example:
> 00001111000011110000
> one = the position of the next "1" bit,
> zero= the position of the next "0" bit, starting from one.
> 
> Then, it is clear, chunk_size= zero - one
> 
> would it be better to use pos_0 and pos_1?

Oh, so it's next_zero_bit and next_bit.


> > > +			if (zero >= end)
> > > +				chunk_size = end - one;
> > > +			else
> > > +				chunk_size = zero - one;
> > > +
> > > +			if (chunk_size)
> > > +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
> > > +					      pfn_start + one, chunk_size);
> > Still not so what does a bit refer to? page or 4kbytes?
> > I think it should be a page.
> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).

That's a waste on systems with large page sizes, and it does not
look like you handle that case correctly.


> But I think it doesn't matter here, since it is pfn.
> Using the above example:
> 00001111000011110000
> 
> If the starting bit above corresponds to pfn-0x1000 (i.e. pfn_start)
> Then the chunk base = 0x1004
> (one is the position of the "Set" bit, which is 4), so pfn_start +one=0x1004
> 
> > > +static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> > > +{
> > > +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
> > > +		int pfns, page_bmaps, i;
> > > +		unsigned long pfn_start, pfns_len;
> > > +
> > > +		pfn_start = vb->pfn_start;
> > > +		pfns = vb->pfn_stop - pfn_start + 1;
> > > +		pfns = roundup(roundup(pfns, BITS_PER_LONG),
> > > +			       PFNS_PER_PAGE_BMAP);
> > > +		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
> > > +		pfns_len = pfns / BITS_PER_BYTE;
> > > +
> > > +		for (i = 0; i < page_bmaps; i++) {
> > > +			unsigned int bmap_len = PAGE_BMAP_SIZE;
> > > +
> > > +			/* The last one takes the leftover only */
> > I don't understand what does this mean.
> Still use the ruler analogy here: the object is 11-meter long, and we have
> a 2-meter long ruler. The 5th time has covered 10 meters of the object, Then
> the last time, the leftover is 1 meter, which means we can use half of the
> ruler
> to cover the left 1 meter.
> 
> Back to the implementation here, if there are only 10 pfns left in the last
> round,
> I think it's not necessary to search the entire page_bmap[] till the end.


Pls reword the comment to make it a whole sentence.


> > > +static void set_page_bmap(struct virtio_balloon *vb,
> > > +			  struct list_head *pages, struct virtqueue *vq)
> > > +{
> > > +	unsigned long pfn_start, pfn_stop;
> > > +	struct page *page;
> > > +	bool found;
> > > +
> > > +	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
> > > +	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
> > > +
> > > +	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
> > This might not do anything in particular might not cover the
> > given pfn range. Do we care? Why not?
> 
> We have allocated only 1 page_bmap[], which is able to cover 1GB memory.
> To inflate 2GB, it will try to extend by getting one more page_bmap,
> page_bmap[1].
> 
> > > +	pfn_start = vb->pfn_min;
> > > +
> > > +	while (pfn_start < vb->pfn_max) {
> > > +		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
> > > +		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
> > > +
> > > +		vb->pfn_start = pfn_start;
> > > +		clear_page_bmap(vb);
> > > +		found = false;
> > > +
> > > +		list_for_each_entry(page, pages, lru) {
> > > +			unsigned long bmap_idx, bmap_pos, balloon_pfn;
> > > +
> > > +			balloon_pfn = page_to_balloon_pfn(page);
> > > +			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
> > > +				continue;
> > > +			bmap_idx = (balloon_pfn - pfn_start) /
> > > +				   PFNS_PER_PAGE_BMAP;
> > > +			bmap_pos = (balloon_pfn - pfn_start) %
> > > +				   PFNS_PER_PAGE_BMAP;
> > > +			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
> > Looks like this will crash if bmap_idx is out of range or
> > if page_bmap allocation failed.
> 
> No, it won't. Please think about the analogy Case 2.2: pfn_start is updated
> in each round. Like in the 2nd round, pfn_start is updated to 2, balloon_pfn
> will be a value between 2 and 4, so the result of
> "(balloon_pfn - pfn_start) /  PFNS_PER_PAGE_BMAP" will always be 0 when
> we only have page_bmap[0].

All these cases confuse too much. Pls abstract away the underlying data
structure (or better find an appropriate existing one). Things should
become clearer then.



> >   #ifdef CONFIG_BALLOON_COMPACTION
> > +
> > +static void tell_host_one_page(struct virtio_balloon *vb,
> > +			       struct virtqueue *vq, struct page *page)
> > +{
> > +	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);
> > This passes 4kbytes to host which seems wrong - I think you want a full page.
> 
> OK. It should be
> add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>                           page_to_pfn(page), VIRTIO_BALLOON_PAGES_PER_PAGE)
> 
> right?
> 
> If Page=2*BalloonPage, it will pass 2*4K to the host.

I guess, or better use whole page units.


> +static void balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	void *buf;
> +
> +	/*
> +	 * By default, we allocate page_bmap[0] only. More page_bmap will be
> +	 * allocated on demand.
> +	 */
> +	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
> +	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk) *
> +		      MAX_PAGE_CHUNKS, GFP_KERNEL);
> +	if (!vb->page_bmap[0] || !buf) {
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);
> 
> > this doesn't work as expected as features has been OK'd by then.
> > You want something like
> > validate_features that I posted. See
> > "virtio: allow drivers to validate features".
> 
> OK. I will change it after that patch is merged.

It's upstream now.


> > 
> > > +		kfree(vb->page_bmap[0]);
> > Looks like this will double free. you want to zero them I think.
> > 
> 
> OK. I'll NULL the pointers after kfree().
> 
> 
> 
> Best,
> Wei
> 
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-14 21:38         ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14 21:38 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> > 
> > So we don't need the bitmap to talk to host, it is just
> > a data structure we chose to maintain lists of pages, right?
> Right. bitmap is the way to gather pages to chunk.
> It's only needed in the balloon page case.
> For the unused page case, we don't need it, since the free
> page blocks are already chunks.
> 
> > OK as far as it goes but you need much better isolation for it.
> > Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> > _find_first, _find_next.
> > Completely unrelated to pages, it just maintains bits.
> > Then use it here.
> > 
> > 
> > >   static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > >   module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > >   MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > > @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > >   static struct vfsmount *balloon_mnt;
> > >   #endif
> > > +/* Types of pages to chunk */
> > > +#define PAGE_CHUNK_TYPE_BALLOON 0
> > > +
> > Doesn't look like you are ever adding more types in this
> > patchset.  Pls keep code simple, generalize it later.
> > 
> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.

I would say add the extra code there too. Or maybe we can avoid
adding it altogether.

> Types of page to chunk are treated differently. Different types of page
> chunks are sent to the host via different protocols.
> 
> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
> to chunk.  For the ballooned type, it uses the basic chunk msg format:
> 
> virtio_balloon_page_chunk_hdr +
> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> 
> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
> format:
> miscq_hdr +
> virtio_balloon_page_chunk_hdr +
> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> 
> The chunk msg is actually the payload of the miscq msg.
> 
> 

So just combine the two message formats and then it'll all be easier?


> > > +#define MAX_PAGE_CHUNKS 4096
> > This is an order-4 allocation. I'd make it 4095 and then it's
> > an order-3 one.
> 
> Sounds good, thanks.
> I think it would be better to make it 4090. Leave some space for the hdr
> as well.

And miscq hdr. In fact just let compiler do the math - something like:
(8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)


I skimmed explanation of algorithms below but please make sure
code speaks for itself and add comments inline to document it.
Whenever you answered me inline this is where you want to
try to make code clearer and add comments.

Also, pls find ways to abstract the data structure so we don't
need to deal with its internals all over the code.


....

> > 
> > >   {
> > >   	struct scatterlist sg;
> > > +	struct virtio_balloon_page_chunk_hdr *hdr;
> > > +	void *buf;
> > >   	unsigned int len;
> > > -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > > +	switch (type) {
> > > +	case PAGE_CHUNK_TYPE_BALLOON:
> > > +		hdr = vb->balloon_page_chunk_hdr;
> > > +		len = 0;
> > > +		break;
> > > +	default:
> > > +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
> > > +			 __func__, type);
> > > +		return;
> > > +	}
> > > -	/* We should always be able to add one buffer to an empty queue. */
> > > -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > -	virtqueue_kick(vq);
> > > +	buf = (void *)hdr - len;
> > Moving back to before the header? How can this make sense?
> > It works fine since len is 0, so just buf = hdr.
> > 
> For the unused page chunk case, it follows its own protocol:
> miscq_hdr + payload(chunk msg).
>  "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
> the entire miscq msg.

Well just pass the correct pointer in.

> Please check the patch for implementing the unused page chunk,
> it will be clear. If necessary, I can put "buf = (void *)hdr - len" from
> that patch.

Exactly. And all this pointer math is very messy. Please look for ways
to clean it. It's generally easy to fill structures:

struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
for (i = 0; i < n; ++i)
	foo->a[i] = b;

this is the kind of code that's easy to understand and it's
obvious there are no overflows and no info leaks here.

> 
> > > +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
> > > +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
> > > +	sg_init_table(&sg, 1);
> > > +	sg_set_buf(&sg, buf, len);
> > > +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
> > > +		virtqueue_kick(vq);
> > > +		if (busy_wait)
> > > +			while (!virtqueue_get_buf(vq, &len) &&
> > > +			       !virtqueue_is_broken(vq))
> > > +				cpu_relax();
> > > +		else
> > > +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		hdr->chunks = 0;
> > Why zero it here after device used it? Better to zero before use.
> 
> hdr->chunks tells the host how many chunks are there in the payload.
> After the device use it, it is ready to zero it.

It's rather confusing. Try to pass # of chunks around
in some other way.

> > 
> > > +	}
> > > +}
> > > +
> > > +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> > > +			  int type, u64 base, u64 size)
> > what are the units here? Looks like it's in 4kbyte units?
> 
> what is the "unit" you referred to?
> This is the function to add one chunk, base pfn and size of the chunk are
> supplied to the function.
> 

Are both size and base in bytes then?
But you do not send them to host as is, you shift them for some reason
before sending them to host.


> 
> > 
> > > +	if (hdr->chunks == MAX_PAGE_CHUNKS)
> > > +		send_page_chunks(vb, vq, type, false);
> > 		and zero chunks here?
> > > +}
> > > +
> > > +static void chunking_pages_from_bmap(struct virtio_balloon *vb,
> > Does this mean "convert_bmap_to_chunks"?
> > 
> 
> Yes.
> 

Pls name it accordingly then.

> > > +				     struct virtqueue *vq,
> > > +				     unsigned long pfn_start,
> > > +				     unsigned long *bmap,
> > > +				     unsigned long len)
> > > +{
> > > +	unsigned long pos = 0, end = len * BITS_PER_BYTE;
> > > +
> > > +	while (pos < end) {
> > > +		unsigned long one = find_next_bit(bmap, end, pos);
> > > +
> > > +		if (one < end) {
> > > +			unsigned long chunk_size, zero;
> > > +
> > > +			zero = find_next_zero_bit(bmap, end, one + 1);
> > 
> > zero and one are unhelpful names unless they equal 0 and 1.
> > current/next?
> > 
> 
> I think it is clear if we think about the bitmap, for example:
> 00001111000011110000
> one = the position of the next "1" bit,
> zero= the position of the next "0" bit, starting from one.
> 
> Then, it is clear, chunk_size= zero - one
> 
> would it be better to use pos_0 and pos_1?

Oh, so it's next_zero_bit and next_bit.


> > > +			if (zero >= end)
> > > +				chunk_size = end - one;
> > > +			else
> > > +				chunk_size = zero - one;
> > > +
> > > +			if (chunk_size)
> > > +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
> > > +					      pfn_start + one, chunk_size);
> > Still not so what does a bit refer to? page or 4kbytes?
> > I think it should be a page.
> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).

That's a waste on systems with large page sizes, and it does not
look like you handle that case correctly.


> But I think it doesn't matter here, since it is pfn.
> Using the above example:
> 00001111000011110000
> 
> If the starting bit above corresponds to pfn-0x1000 (i.e. pfn_start)
> Then the chunk base = 0x1004
> (one is the position of the "Set" bit, which is 4), so pfn_start +one=0x1004
> 
> > > +static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> > > +{
> > > +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
> > > +		int pfns, page_bmaps, i;
> > > +		unsigned long pfn_start, pfns_len;
> > > +
> > > +		pfn_start = vb->pfn_start;
> > > +		pfns = vb->pfn_stop - pfn_start + 1;
> > > +		pfns = roundup(roundup(pfns, BITS_PER_LONG),
> > > +			       PFNS_PER_PAGE_BMAP);
> > > +		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
> > > +		pfns_len = pfns / BITS_PER_BYTE;
> > > +
> > > +		for (i = 0; i < page_bmaps; i++) {
> > > +			unsigned int bmap_len = PAGE_BMAP_SIZE;
> > > +
> > > +			/* The last one takes the leftover only */
> > I don't understand what does this mean.
> Still use the ruler analogy here: the object is 11-meter long, and we have
> a 2-meter long ruler. The 5th time has covered 10 meters of the object, Then
> the last time, the leftover is 1 meter, which means we can use half of the
> ruler
> to cover the left 1 meter.
> 
> Back to the implementation here, if there are only 10 pfns left in the last
> round,
> I think it's not necessary to search the entire page_bmap[] till the end.


Pls reword the comment to make it a whole sentence.


> > > +static void set_page_bmap(struct virtio_balloon *vb,
> > > +			  struct list_head *pages, struct virtqueue *vq)
> > > +{
> > > +	unsigned long pfn_start, pfn_stop;
> > > +	struct page *page;
> > > +	bool found;
> > > +
> > > +	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
> > > +	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
> > > +
> > > +	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
> > This might not do anything in particular might not cover the
> > given pfn range. Do we care? Why not?
> 
> We have allocated only 1 page_bmap[], which is able to cover 1GB memory.
> To inflate 2GB, it will try to extend by getting one more page_bmap,
> page_bmap[1].
> 
> > > +	pfn_start = vb->pfn_min;
> > > +
> > > +	while (pfn_start < vb->pfn_max) {
> > > +		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
> > > +		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
> > > +
> > > +		vb->pfn_start = pfn_start;
> > > +		clear_page_bmap(vb);
> > > +		found = false;
> > > +
> > > +		list_for_each_entry(page, pages, lru) {
> > > +			unsigned long bmap_idx, bmap_pos, balloon_pfn;
> > > +
> > > +			balloon_pfn = page_to_balloon_pfn(page);
> > > +			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
> > > +				continue;
> > > +			bmap_idx = (balloon_pfn - pfn_start) /
> > > +				   PFNS_PER_PAGE_BMAP;
> > > +			bmap_pos = (balloon_pfn - pfn_start) %
> > > +				   PFNS_PER_PAGE_BMAP;
> > > +			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
> > Looks like this will crash if bmap_idx is out of range or
> > if page_bmap allocation failed.
> 
> No, it won't. Please think about the analogy Case 2.2: pfn_start is updated
> in each round. Like in the 2nd round, pfn_start is updated to 2, balloon_pfn
> will be a value between 2 and 4, so the result of
> "(balloon_pfn - pfn_start) /  PFNS_PER_PAGE_BMAP" will always be 0 when
> we only have page_bmap[0].

All these cases confuse too much. Pls abstract away the underlying data
structure (or better find an appropriate existing one). Things should
become clearer then.



> >   #ifdef CONFIG_BALLOON_COMPACTION
> > +
> > +static void tell_host_one_page(struct virtio_balloon *vb,
> > +			       struct virtqueue *vq, struct page *page)
> > +{
> > +	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);
> > This passes 4kbytes to host which seems wrong - I think you want a full page.
> 
> OK. It should be
> add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>                           page_to_pfn(page), VIRTIO_BALLOON_PAGES_PER_PAGE)
> 
> right?
> 
> If Page=2*BalloonPage, it will pass 2*4K to the host.

I guess, or better use whole page units.


> +static void balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	void *buf;
> +
> +	/*
> +	 * By default, we allocate page_bmap[0] only. More page_bmap will be
> +	 * allocated on demand.
> +	 */
> +	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
> +	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk) *
> +		      MAX_PAGE_CHUNKS, GFP_KERNEL);
> +	if (!vb->page_bmap[0] || !buf) {
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);
> 
> > this doesn't work as expected as features has been OK'd by then.
> > You want something like
> > validate_features that I posted. See
> > "virtio: allow drivers to validate features".
> 
> OK. I will change it after that patch is merged.

It's upstream now.


> > 
> > > +		kfree(vb->page_bmap[0]);
> > Looks like this will double free. you want to zero them I think.
> > 
> 
> OK. I'll NULL the pointers after kfree().
> 
> 
> 
> Best,
> Wei
> 
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-14 21:38         ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14 21:38 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> > 
> > So we don't need the bitmap to talk to host, it is just
> > a data structure we chose to maintain lists of pages, right?
> Right. bitmap is the way to gather pages to chunk.
> It's only needed in the balloon page case.
> For the unused page case, we don't need it, since the free
> page blocks are already chunks.
> 
> > OK as far as it goes but you need much better isolation for it.
> > Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> > _find_first, _find_next.
> > Completely unrelated to pages, it just maintains bits.
> > Then use it here.
> > 
> > 
> > >   static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > >   module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > >   MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > > @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > >   static struct vfsmount *balloon_mnt;
> > >   #endif
> > > +/* Types of pages to chunk */
> > > +#define PAGE_CHUNK_TYPE_BALLOON 0
> > > +
> > Doesn't look like you are ever adding more types in this
> > patchset.  Pls keep code simple, generalize it later.
> > 
> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.

I would say add the extra code there too. Or maybe we can avoid
adding it altogether.

> Types of page to chunk are treated differently. Different types of page
> chunks are sent to the host via different protocols.
> 
> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
> to chunk.  For the ballooned type, it uses the basic chunk msg format:
> 
> virtio_balloon_page_chunk_hdr +
> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> 
> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
> format:
> miscq_hdr +
> virtio_balloon_page_chunk_hdr +
> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> 
> The chunk msg is actually the payload of the miscq msg.
> 
> 

So just combine the two message formats and then it'll all be easier?


> > > +#define MAX_PAGE_CHUNKS 4096
> > This is an order-4 allocation. I'd make it 4095 and then it's
> > an order-3 one.
> 
> Sounds good, thanks.
> I think it would be better to make it 4090. Leave some space for the hdr
> as well.

And miscq hdr. In fact just let compiler do the math - something like:
(8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)


I skimmed explanation of algorithms below but please make sure
code speaks for itself and add comments inline to document it.
Whenever you answered me inline this is where you want to
try to make code clearer and add comments.

Also, pls find ways to abstract the data structure so we don't
need to deal with its internals all over the code.


....

> > 
> > >   {
> > >   	struct scatterlist sg;
> > > +	struct virtio_balloon_page_chunk_hdr *hdr;
> > > +	void *buf;
> > >   	unsigned int len;
> > > -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > > +	switch (type) {
> > > +	case PAGE_CHUNK_TYPE_BALLOON:
> > > +		hdr = vb->balloon_page_chunk_hdr;
> > > +		len = 0;
> > > +		break;
> > > +	default:
> > > +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
> > > +			 __func__, type);
> > > +		return;
> > > +	}
> > > -	/* We should always be able to add one buffer to an empty queue. */
> > > -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > -	virtqueue_kick(vq);
> > > +	buf = (void *)hdr - len;
> > Moving back to before the header? How can this make sense?
> > It works fine since len is 0, so just buf = hdr.
> > 
> For the unused page chunk case, it follows its own protocol:
> miscq_hdr + payload(chunk msg).
>  "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
> the entire miscq msg.

Well just pass the correct pointer in.

> Please check the patch for implementing the unused page chunk,
> it will be clear. If necessary, I can put "buf = (void *)hdr - len" from
> that patch.

Exactly. And all this pointer math is very messy. Please look for ways
to clean it. It's generally easy to fill structures:

struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
for (i = 0; i < n; ++i)
	foo->a[i] = b;

this is the kind of code that's easy to understand and it's
obvious there are no overflows and no info leaks here.

> 
> > > +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
> > > +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
> > > +	sg_init_table(&sg, 1);
> > > +	sg_set_buf(&sg, buf, len);
> > > +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
> > > +		virtqueue_kick(vq);
> > > +		if (busy_wait)
> > > +			while (!virtqueue_get_buf(vq, &len) &&
> > > +			       !virtqueue_is_broken(vq))
> > > +				cpu_relax();
> > > +		else
> > > +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		hdr->chunks = 0;
> > Why zero it here after device used it? Better to zero before use.
> 
> hdr->chunks tells the host how many chunks are there in the payload.
> After the device use it, it is ready to zero it.

It's rather confusing. Try to pass # of chunks around
in some other way.

> > 
> > > +	}
> > > +}
> > > +
> > > +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> > > +			  int type, u64 base, u64 size)
> > what are the units here? Looks like it's in 4kbyte units?
> 
> what is the "unit" you referred to?
> This is the function to add one chunk, base pfn and size of the chunk are
> supplied to the function.
> 

Are both size and base in bytes then?
But you do not send them to host as is, you shift them for some reason
before sending them to host.


> 
> > 
> > > +	if (hdr->chunks == MAX_PAGE_CHUNKS)
> > > +		send_page_chunks(vb, vq, type, false);
> > 		and zero chunks here?
> > > +}
> > > +
> > > +static void chunking_pages_from_bmap(struct virtio_balloon *vb,
> > Does this mean "convert_bmap_to_chunks"?
> > 
> 
> Yes.
> 

Pls name it accordingly then.

> > > +				     struct virtqueue *vq,
> > > +				     unsigned long pfn_start,
> > > +				     unsigned long *bmap,
> > > +				     unsigned long len)
> > > +{
> > > +	unsigned long pos = 0, end = len * BITS_PER_BYTE;
> > > +
> > > +	while (pos < end) {
> > > +		unsigned long one = find_next_bit(bmap, end, pos);
> > > +
> > > +		if (one < end) {
> > > +			unsigned long chunk_size, zero;
> > > +
> > > +			zero = find_next_zero_bit(bmap, end, one + 1);
> > 
> > zero and one are unhelpful names unless they equal 0 and 1.
> > current/next?
> > 
> 
> I think it is clear if we think about the bitmap, for example:
> 00001111000011110000
> one = the position of the next "1" bit,
> zero= the position of the next "0" bit, starting from one.
> 
> Then, it is clear, chunk_size= zero - one
> 
> would it be better to use pos_0 and pos_1?

Oh, so it's next_zero_bit and next_bit.


> > > +			if (zero >= end)
> > > +				chunk_size = end - one;
> > > +			else
> > > +				chunk_size = zero - one;
> > > +
> > > +			if (chunk_size)
> > > +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
> > > +					      pfn_start + one, chunk_size);
> > Still not so what does a bit refer to? page or 4kbytes?
> > I think it should be a page.
> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).

That's a waste on systems with large page sizes, and it does not
look like you handle that case correctly.


> But I think it doesn't matter here, since it is pfn.
> Using the above example:
> 00001111000011110000
> 
> If the starting bit above corresponds to pfn-0x1000 (i.e. pfn_start)
> Then the chunk base = 0x1004
> (one is the position of the "Set" bit, which is 4), so pfn_start +one=0x1004
> 
> > > +static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> > > +{
> > > +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
> > > +		int pfns, page_bmaps, i;
> > > +		unsigned long pfn_start, pfns_len;
> > > +
> > > +		pfn_start = vb->pfn_start;
> > > +		pfns = vb->pfn_stop - pfn_start + 1;
> > > +		pfns = roundup(roundup(pfns, BITS_PER_LONG),
> > > +			       PFNS_PER_PAGE_BMAP);
> > > +		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
> > > +		pfns_len = pfns / BITS_PER_BYTE;
> > > +
> > > +		for (i = 0; i < page_bmaps; i++) {
> > > +			unsigned int bmap_len = PAGE_BMAP_SIZE;
> > > +
> > > +			/* The last one takes the leftover only */
> > I don't understand what does this mean.
> Still use the ruler analogy here: the object is 11-meter long, and we have
> a 2-meter long ruler. The 5th time has covered 10 meters of the object, Then
> the last time, the leftover is 1 meter, which means we can use half of the
> ruler
> to cover the left 1 meter.
> 
> Back to the implementation here, if there are only 10 pfns left in the last
> round,
> I think it's not necessary to search the entire page_bmap[] till the end.


Pls reword the comment to make it a whole sentence.


> > > +static void set_page_bmap(struct virtio_balloon *vb,
> > > +			  struct list_head *pages, struct virtqueue *vq)
> > > +{
> > > +	unsigned long pfn_start, pfn_stop;
> > > +	struct page *page;
> > > +	bool found;
> > > +
> > > +	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
> > > +	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
> > > +
> > > +	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
> > This might not do anything in particular might not cover the
> > given pfn range. Do we care? Why not?
> 
> We have allocated only 1 page_bmap[], which is able to cover 1GB memory.
> To inflate 2GB, it will try to extend by getting one more page_bmap,
> page_bmap[1].
> 
> > > +	pfn_start = vb->pfn_min;
> > > +
> > > +	while (pfn_start < vb->pfn_max) {
> > > +		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
> > > +		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
> > > +
> > > +		vb->pfn_start = pfn_start;
> > > +		clear_page_bmap(vb);
> > > +		found = false;
> > > +
> > > +		list_for_each_entry(page, pages, lru) {
> > > +			unsigned long bmap_idx, bmap_pos, balloon_pfn;
> > > +
> > > +			balloon_pfn = page_to_balloon_pfn(page);
> > > +			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
> > > +				continue;
> > > +			bmap_idx = (balloon_pfn - pfn_start) /
> > > +				   PFNS_PER_PAGE_BMAP;
> > > +			bmap_pos = (balloon_pfn - pfn_start) %
> > > +				   PFNS_PER_PAGE_BMAP;
> > > +			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
> > Looks like this will crash if bmap_idx is out of range or
> > if page_bmap allocation failed.
> 
> No, it won't. Please think about the analogy Case 2.2: pfn_start is updated
> in each round. Like in the 2nd round, pfn_start is updated to 2, balloon_pfn
> will be a value between 2 and 4, so the result of
> "(balloon_pfn - pfn_start) /  PFNS_PER_PAGE_BMAP" will always be 0 when
> we only have page_bmap[0].

All these cases confuse too much. Pls abstract away the underlying data
structure (or better find an appropriate existing one). Things should
become clearer then.



> >   #ifdef CONFIG_BALLOON_COMPACTION
> > +
> > +static void tell_host_one_page(struct virtio_balloon *vb,
> > +			       struct virtqueue *vq, struct page *page)
> > +{
> > +	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);
> > This passes 4kbytes to host which seems wrong - I think you want a full page.
> 
> OK. It should be
> add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>                           page_to_pfn(page), VIRTIO_BALLOON_PAGES_PER_PAGE)
> 
> right?
> 
> If Page=2*BalloonPage, it will pass 2*4K to the host.

I guess, or better use whole page units.


> +static void balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	void *buf;
> +
> +	/*
> +	 * By default, we allocate page_bmap[0] only. More page_bmap will be
> +	 * allocated on demand.
> +	 */
> +	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
> +	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk) *
> +		      MAX_PAGE_CHUNKS, GFP_KERNEL);
> +	if (!vb->page_bmap[0] || !buf) {
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);
> 
> > this doesn't work as expected as features has been OK'd by then.
> > You want something like
> > validate_features that I posted. See
> > "virtio: allow drivers to validate features".
> 
> OK. I will change it after that patch is merged.

It's upstream now.


> > 
> > > +		kfree(vb->page_bmap[0]);
> > Looks like this will double free. you want to zero them I think.
> > 
> 
> OK. I'll NULL the pointers after kfree().
> 
> 
> 
> Best,
> Wei
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-14 21:38         ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14 21:38 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> > 
> > So we don't need the bitmap to talk to host, it is just
> > a data structure we chose to maintain lists of pages, right?
> Right. bitmap is the way to gather pages to chunk.
> It's only needed in the balloon page case.
> For the unused page case, we don't need it, since the free
> page blocks are already chunks.
> 
> > OK as far as it goes but you need much better isolation for it.
> > Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> > _find_first, _find_next.
> > Completely unrelated to pages, it just maintains bits.
> > Then use it here.
> > 
> > 
> > >   static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > >   module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > >   MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > > @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > >   static struct vfsmount *balloon_mnt;
> > >   #endif
> > > +/* Types of pages to chunk */
> > > +#define PAGE_CHUNK_TYPE_BALLOON 0
> > > +
> > Doesn't look like you are ever adding more types in this
> > patchset.  Pls keep code simple, generalize it later.
> > 
> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.

I would say add the extra code there too. Or maybe we can avoid
adding it altogether.

> Types of page to chunk are treated differently. Different types of page
> chunks are sent to the host via different protocols.
> 
> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
> to chunk.  For the ballooned type, it uses the basic chunk msg format:
> 
> virtio_balloon_page_chunk_hdr +
> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> 
> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
> format:
> miscq_hdr +
> virtio_balloon_page_chunk_hdr +
> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> 
> The chunk msg is actually the payload of the miscq msg.
> 
> 

So just combine the two message formats and then it'll all be easier?


> > > +#define MAX_PAGE_CHUNKS 4096
> > This is an order-4 allocation. I'd make it 4095 and then it's
> > an order-3 one.
> 
> Sounds good, thanks.
> I think it would be better to make it 4090. Leave some space for the hdr
> as well.

And miscq hdr. In fact just let compiler do the math - something like:
(8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)


I skimmed explanation of algorithms below but please make sure
code speaks for itself and add comments inline to document it.
Whenever you answered me inline this is where you want to
try to make code clearer and add comments.

Also, pls find ways to abstract the data structure so we don't
need to deal with its internals all over the code.


....

> > 
> > >   {
> > >   	struct scatterlist sg;
> > > +	struct virtio_balloon_page_chunk_hdr *hdr;
> > > +	void *buf;
> > >   	unsigned int len;
> > > -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > > +	switch (type) {
> > > +	case PAGE_CHUNK_TYPE_BALLOON:
> > > +		hdr = vb->balloon_page_chunk_hdr;
> > > +		len = 0;
> > > +		break;
> > > +	default:
> > > +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
> > > +			 __func__, type);
> > > +		return;
> > > +	}
> > > -	/* We should always be able to add one buffer to an empty queue. */
> > > -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > -	virtqueue_kick(vq);
> > > +	buf = (void *)hdr - len;
> > Moving back to before the header? How can this make sense?
> > It works fine since len is 0, so just buf = hdr.
> > 
> For the unused page chunk case, it follows its own protocol:
> miscq_hdr + payload(chunk msg).
>  "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
> the entire miscq msg.

Well just pass the correct pointer in.

> Please check the patch for implementing the unused page chunk,
> it will be clear. If necessary, I can put "buf = (void *)hdr - len" from
> that patch.

Exactly. And all this pointer math is very messy. Please look for ways
to clean it. It's generally easy to fill structures:

struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
for (i = 0; i < n; ++i)
	foo->a[i] = b;

this is the kind of code that's easy to understand and it's
obvious there are no overflows and no info leaks here.

> 
> > > +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
> > > +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
> > > +	sg_init_table(&sg, 1);
> > > +	sg_set_buf(&sg, buf, len);
> > > +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
> > > +		virtqueue_kick(vq);
> > > +		if (busy_wait)
> > > +			while (!virtqueue_get_buf(vq, &len) &&
> > > +			       !virtqueue_is_broken(vq))
> > > +				cpu_relax();
> > > +		else
> > > +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		hdr->chunks = 0;
> > Why zero it here after device used it? Better to zero before use.
> 
> hdr->chunks tells the host how many chunks are there in the payload.
> After the device use it, it is ready to zero it.

It's rather confusing. Try to pass # of chunks around
in some other way.

> > 
> > > +	}
> > > +}
> > > +
> > > +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> > > +			  int type, u64 base, u64 size)
> > what are the units here? Looks like it's in 4kbyte units?
> 
> what is the "unit" you referred to?
> This is the function to add one chunk, base pfn and size of the chunk are
> supplied to the function.
> 

Are both size and base in bytes then?
But you do not send them to host as is, you shift them for some reason
before sending them to host.


> 
> > 
> > > +	if (hdr->chunks == MAX_PAGE_CHUNKS)
> > > +		send_page_chunks(vb, vq, type, false);
> > 		and zero chunks here?
> > > +}
> > > +
> > > +static void chunking_pages_from_bmap(struct virtio_balloon *vb,
> > Does this mean "convert_bmap_to_chunks"?
> > 
> 
> Yes.
> 

Pls name it accordingly then.

> > > +				     struct virtqueue *vq,
> > > +				     unsigned long pfn_start,
> > > +				     unsigned long *bmap,
> > > +				     unsigned long len)
> > > +{
> > > +	unsigned long pos = 0, end = len * BITS_PER_BYTE;
> > > +
> > > +	while (pos < end) {
> > > +		unsigned long one = find_next_bit(bmap, end, pos);
> > > +
> > > +		if (one < end) {
> > > +			unsigned long chunk_size, zero;
> > > +
> > > +			zero = find_next_zero_bit(bmap, end, one + 1);
> > 
> > zero and one are unhelpful names unless they equal 0 and 1.
> > current/next?
> > 
> 
> I think it is clear if we think about the bitmap, for example:
> 00001111000011110000
> one = the position of the next "1" bit,
> zero= the position of the next "0" bit, starting from one.
> 
> Then, it is clear, chunk_size= zero - one
> 
> would it be better to use pos_0 and pos_1?

Oh, so it's next_zero_bit and next_bit.


> > > +			if (zero >= end)
> > > +				chunk_size = end - one;
> > > +			else
> > > +				chunk_size = zero - one;
> > > +
> > > +			if (chunk_size)
> > > +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
> > > +					      pfn_start + one, chunk_size);
> > Still not so what does a bit refer to? page or 4kbytes?
> > I think it should be a page.
> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).

That's a waste on systems with large page sizes, and it does not
look like you handle that case correctly.


> But I think it doesn't matter here, since it is pfn.
> Using the above example:
> 00001111000011110000
> 
> If the starting bit above corresponds to pfn-0x1000 (i.e. pfn_start)
> Then the chunk base = 0x1004
> (one is the position of the "Set" bit, which is 4), so pfn_start +one=0x1004
> 
> > > +static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> > > +{
> > > +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
> > > +		int pfns, page_bmaps, i;
> > > +		unsigned long pfn_start, pfns_len;
> > > +
> > > +		pfn_start = vb->pfn_start;
> > > +		pfns = vb->pfn_stop - pfn_start + 1;
> > > +		pfns = roundup(roundup(pfns, BITS_PER_LONG),
> > > +			       PFNS_PER_PAGE_BMAP);
> > > +		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
> > > +		pfns_len = pfns / BITS_PER_BYTE;
> > > +
> > > +		for (i = 0; i < page_bmaps; i++) {
> > > +			unsigned int bmap_len = PAGE_BMAP_SIZE;
> > > +
> > > +			/* The last one takes the leftover only */
> > I don't understand what does this mean.
> Still use the ruler analogy here: the object is 11-meter long, and we have
> a 2-meter long ruler. The 5th time has covered 10 meters of the object, Then
> the last time, the leftover is 1 meter, which means we can use half of the
> ruler
> to cover the left 1 meter.
> 
> Back to the implementation here, if there are only 10 pfns left in the last
> round,
> I think it's not necessary to search the entire page_bmap[] till the end.


Pls reword the comment to make it a whole sentence.


> > > +static void set_page_bmap(struct virtio_balloon *vb,
> > > +			  struct list_head *pages, struct virtqueue *vq)
> > > +{
> > > +	unsigned long pfn_start, pfn_stop;
> > > +	struct page *page;
> > > +	bool found;
> > > +
> > > +	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
> > > +	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
> > > +
> > > +	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
> > This might not do anything in particular might not cover the
> > given pfn range. Do we care? Why not?
> 
> We have allocated only 1 page_bmap[], which is able to cover 1GB memory.
> To inflate 2GB, it will try to extend by getting one more page_bmap,
> page_bmap[1].
> 
> > > +	pfn_start = vb->pfn_min;
> > > +
> > > +	while (pfn_start < vb->pfn_max) {
> > > +		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
> > > +		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
> > > +
> > > +		vb->pfn_start = pfn_start;
> > > +		clear_page_bmap(vb);
> > > +		found = false;
> > > +
> > > +		list_for_each_entry(page, pages, lru) {
> > > +			unsigned long bmap_idx, bmap_pos, balloon_pfn;
> > > +
> > > +			balloon_pfn = page_to_balloon_pfn(page);
> > > +			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
> > > +				continue;
> > > +			bmap_idx = (balloon_pfn - pfn_start) /
> > > +				   PFNS_PER_PAGE_BMAP;
> > > +			bmap_pos = (balloon_pfn - pfn_start) %
> > > +				   PFNS_PER_PAGE_BMAP;
> > > +			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
> > Looks like this will crash if bmap_idx is out of range or
> > if page_bmap allocation failed.
> 
> No, it won't. Please think about the analogy Case 2.2: pfn_start is updated
> in each round. Like in the 2nd round, pfn_start is updated to 2, balloon_pfn
> will be a value between 2 and 4, so the result of
> "(balloon_pfn - pfn_start) /  PFNS_PER_PAGE_BMAP" will always be 0 when
> we only have page_bmap[0].

All these cases confuse too much. Pls abstract away the underlying data
structure (or better find an appropriate existing one). Things should
become clearer then.



> >   #ifdef CONFIG_BALLOON_COMPACTION
> > +
> > +static void tell_host_one_page(struct virtio_balloon *vb,
> > +			       struct virtqueue *vq, struct page *page)
> > +{
> > +	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);
> > This passes 4kbytes to host which seems wrong - I think you want a full page.
> 
> OK. It should be
> add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>                           page_to_pfn(page), VIRTIO_BALLOON_PAGES_PER_PAGE)
> 
> right?
> 
> If Page=2*BalloonPage, it will pass 2*4K to the host.

I guess, or better use whole page units.


> +static void balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	void *buf;
> +
> +	/*
> +	 * By default, we allocate page_bmap[0] only. More page_bmap will be
> +	 * allocated on demand.
> +	 */
> +	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
> +	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk) *
> +		      MAX_PAGE_CHUNKS, GFP_KERNEL);
> +	if (!vb->page_bmap[0] || !buf) {
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);
> 
> > this doesn't work as expected as features has been OK'd by then.
> > You want something like
> > validate_features that I posted. See
> > "virtio: allow drivers to validate features".
> 
> OK. I will change it after that patch is merged.

It's upstream now.


> > 
> > > +		kfree(vb->page_bmap[0]);
> > Looks like this will double free. you want to zero them I think.
> > 
> 
> OK. I'll NULL the pointers after kfree().
> 
> 
> 
> Best,
> Wei
> 
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-14  8:37       ` Wei Wang
                         ` (2 preceding siblings ...)
  (?)
@ 2017-04-14 21:38       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14 21:38 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> > 
> > So we don't need the bitmap to talk to host, it is just
> > a data structure we chose to maintain lists of pages, right?
> Right. bitmap is the way to gather pages to chunk.
> It's only needed in the balloon page case.
> For the unused page case, we don't need it, since the free
> page blocks are already chunks.
> 
> > OK as far as it goes but you need much better isolation for it.
> > Build a data structure with APIs such as _init, _cleanup, _add, _clear,
> > _find_first, _find_next.
> > Completely unrelated to pages, it just maintains bits.
> > Then use it here.
> > 
> > 
> > >   static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > >   module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > >   MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > > @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > >   static struct vfsmount *balloon_mnt;
> > >   #endif
> > > +/* Types of pages to chunk */
> > > +#define PAGE_CHUNK_TYPE_BALLOON 0
> > > +
> > Doesn't look like you are ever adding more types in this
> > patchset.  Pls keep code simple, generalize it later.
> > 
> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.

I would say add the extra code there too. Or maybe we can avoid
adding it altogether.

> Types of page to chunk are treated differently. Different types of page
> chunks are sent to the host via different protocols.
> 
> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
> to chunk.  For the ballooned type, it uses the basic chunk msg format:
> 
> virtio_balloon_page_chunk_hdr +
> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> 
> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
> format:
> miscq_hdr +
> virtio_balloon_page_chunk_hdr +
> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> 
> The chunk msg is actually the payload of the miscq msg.
> 
> 

So just combine the two message formats and then it'll all be easier?


> > > +#define MAX_PAGE_CHUNKS 4096
> > This is an order-4 allocation. I'd make it 4095 and then it's
> > an order-3 one.
> 
> Sounds good, thanks.
> I think it would be better to make it 4090. Leave some space for the hdr
> as well.

And miscq hdr. In fact just let compiler do the math - something like:
(8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)


I skimmed explanation of algorithms below but please make sure
code speaks for itself and add comments inline to document it.
Whenever you answered me inline this is where you want to
try to make code clearer and add comments.

Also, pls find ways to abstract the data structure so we don't
need to deal with its internals all over the code.


....

> > 
> > >   {
> > >   	struct scatterlist sg;
> > > +	struct virtio_balloon_page_chunk_hdr *hdr;
> > > +	void *buf;
> > >   	unsigned int len;
> > > -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> > > +	switch (type) {
> > > +	case PAGE_CHUNK_TYPE_BALLOON:
> > > +		hdr = vb->balloon_page_chunk_hdr;
> > > +		len = 0;
> > > +		break;
> > > +	default:
> > > +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
> > > +			 __func__, type);
> > > +		return;
> > > +	}
> > > -	/* We should always be able to add one buffer to an empty queue. */
> > > -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > -	virtqueue_kick(vq);
> > > +	buf = (void *)hdr - len;
> > Moving back to before the header? How can this make sense?
> > It works fine since len is 0, so just buf = hdr.
> > 
> For the unused page chunk case, it follows its own protocol:
> miscq_hdr + payload(chunk msg).
>  "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
> the entire miscq msg.

Well just pass the correct pointer in.

> Please check the patch for implementing the unused page chunk,
> it will be clear. If necessary, I can put "buf = (void *)hdr - len" from
> that patch.

Exactly. And all this pointer math is very messy. Please look for ways
to clean it. It's generally easy to fill structures:

struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
for (i = 0; i < n; ++i)
	foo->a[i] = b;

this is the kind of code that's easy to understand and it's
obvious there are no overflows and no info leaks here.

> 
> > > +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
> > > +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
> > > +	sg_init_table(&sg, 1);
> > > +	sg_set_buf(&sg, buf, len);
> > > +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
> > > +		virtqueue_kick(vq);
> > > +		if (busy_wait)
> > > +			while (!virtqueue_get_buf(vq, &len) &&
> > > +			       !virtqueue_is_broken(vq))
> > > +				cpu_relax();
> > > +		else
> > > +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> > > +		hdr->chunks = 0;
> > Why zero it here after device used it? Better to zero before use.
> 
> hdr->chunks tells the host how many chunks are there in the payload.
> After the device use it, it is ready to zero it.

It's rather confusing. Try to pass # of chunks around
in some other way.

> > 
> > > +	}
> > > +}
> > > +
> > > +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> > > +			  int type, u64 base, u64 size)
> > what are the units here? Looks like it's in 4kbyte units?
> 
> what is the "unit" you referred to?
> This is the function to add one chunk, base pfn and size of the chunk are
> supplied to the function.
> 

Are both size and base in bytes then?
But you do not send them to host as is, you shift them for some reason
before sending them to host.


> 
> > 
> > > +	if (hdr->chunks == MAX_PAGE_CHUNKS)
> > > +		send_page_chunks(vb, vq, type, false);
> > 		and zero chunks here?
> > > +}
> > > +
> > > +static void chunking_pages_from_bmap(struct virtio_balloon *vb,
> > Does this mean "convert_bmap_to_chunks"?
> > 
> 
> Yes.
> 

Pls name it accordingly then.

> > > +				     struct virtqueue *vq,
> > > +				     unsigned long pfn_start,
> > > +				     unsigned long *bmap,
> > > +				     unsigned long len)
> > > +{
> > > +	unsigned long pos = 0, end = len * BITS_PER_BYTE;
> > > +
> > > +	while (pos < end) {
> > > +		unsigned long one = find_next_bit(bmap, end, pos);
> > > +
> > > +		if (one < end) {
> > > +			unsigned long chunk_size, zero;
> > > +
> > > +			zero = find_next_zero_bit(bmap, end, one + 1);
> > 
> > zero and one are unhelpful names unless they equal 0 and 1.
> > current/next?
> > 
> 
> I think it is clear if we think about the bitmap, for example:
> 00001111000011110000
> one = the position of the next "1" bit,
> zero= the position of the next "0" bit, starting from one.
> 
> Then, it is clear, chunk_size= zero - one
> 
> would it be better to use pos_0 and pos_1?

Oh, so it's next_zero_bit and next_bit.


> > > +			if (zero >= end)
> > > +				chunk_size = end - one;
> > > +			else
> > > +				chunk_size = zero - one;
> > > +
> > > +			if (chunk_size)
> > > +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
> > > +					      pfn_start + one, chunk_size);
> > Still not so what does a bit refer to? page or 4kbytes?
> > I think it should be a page.
> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).

That's a waste on systems with large page sizes, and it does not
look like you handle that case correctly.


> But I think it doesn't matter here, since it is pfn.
> Using the above example:
> 00001111000011110000
> 
> If the starting bit above corresponds to pfn-0x1000 (i.e. pfn_start)
> Then the chunk base = 0x1004
> (one is the position of the "Set" bit, which is 4), so pfn_start +one=0x1004
> 
> > > +static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> > > +{
> > > +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS)) {
> > > +		int pfns, page_bmaps, i;
> > > +		unsigned long pfn_start, pfns_len;
> > > +
> > > +		pfn_start = vb->pfn_start;
> > > +		pfns = vb->pfn_stop - pfn_start + 1;
> > > +		pfns = roundup(roundup(pfns, BITS_PER_LONG),
> > > +			       PFNS_PER_PAGE_BMAP);
> > > +		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
> > > +		pfns_len = pfns / BITS_PER_BYTE;
> > > +
> > > +		for (i = 0; i < page_bmaps; i++) {
> > > +			unsigned int bmap_len = PAGE_BMAP_SIZE;
> > > +
> > > +			/* The last one takes the leftover only */
> > I don't understand what does this mean.
> Still use the ruler analogy here: the object is 11-meter long, and we have
> a 2-meter long ruler. The 5th time has covered 10 meters of the object, Then
> the last time, the leftover is 1 meter, which means we can use half of the
> ruler
> to cover the left 1 meter.
> 
> Back to the implementation here, if there are only 10 pfns left in the last
> round,
> I think it's not necessary to search the entire page_bmap[] till the end.


Pls reword the comment to make it a whole sentence.


> > > +static void set_page_bmap(struct virtio_balloon *vb,
> > > +			  struct list_head *pages, struct virtqueue *vq)
> > > +{
> > > +	unsigned long pfn_start, pfn_stop;
> > > +	struct page *page;
> > > +	bool found;
> > > +
> > > +	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
> > > +	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
> > > +
> > > +	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
> > This might not do anything in particular might not cover the
> > given pfn range. Do we care? Why not?
> 
> We have allocated only 1 page_bmap[], which is able to cover 1GB memory.
> To inflate 2GB, it will try to extend by getting one more page_bmap,
> page_bmap[1].
> 
> > > +	pfn_start = vb->pfn_min;
> > > +
> > > +	while (pfn_start < vb->pfn_max) {
> > > +		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
> > > +		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
> > > +
> > > +		vb->pfn_start = pfn_start;
> > > +		clear_page_bmap(vb);
> > > +		found = false;
> > > +
> > > +		list_for_each_entry(page, pages, lru) {
> > > +			unsigned long bmap_idx, bmap_pos, balloon_pfn;
> > > +
> > > +			balloon_pfn = page_to_balloon_pfn(page);
> > > +			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
> > > +				continue;
> > > +			bmap_idx = (balloon_pfn - pfn_start) /
> > > +				   PFNS_PER_PAGE_BMAP;
> > > +			bmap_pos = (balloon_pfn - pfn_start) %
> > > +				   PFNS_PER_PAGE_BMAP;
> > > +			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
> > Looks like this will crash if bmap_idx is out of range or
> > if page_bmap allocation failed.
> 
> No, it won't. Please think about the analogy Case 2.2: pfn_start is updated
> in each round. Like in the 2nd round, pfn_start is updated to 2, balloon_pfn
> will be a value between 2 and 4, so the result of
> "(balloon_pfn - pfn_start) /  PFNS_PER_PAGE_BMAP" will always be 0 when
> we only have page_bmap[0].

All these cases confuse too much. Pls abstract away the underlying data
structure (or better find an appropriate existing one). Things should
become clearer then.



> >   #ifdef CONFIG_BALLOON_COMPACTION
> > +
> > +static void tell_host_one_page(struct virtio_balloon *vb,
> > +			       struct virtqueue *vq, struct page *page)
> > +{
> > +	add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON, page_to_pfn(page), 1);
> > This passes 4kbytes to host which seems wrong - I think you want a full page.
> 
> OK. It should be
> add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>                           page_to_pfn(page), VIRTIO_BALLOON_PAGES_PER_PAGE)
> 
> right?
> 
> If Page=2*BalloonPage, it will pass 2*4K to the host.

I guess, or better use whole page units.


> +static void balloon_page_chunk_init(struct virtio_balloon *vb)
> +{
> +	void *buf;
> +
> +	/*
> +	 * By default, we allocate page_bmap[0] only. More page_bmap will be
> +	 * allocated on demand.
> +	 */
> +	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
> +	buf = kmalloc(sizeof(struct virtio_balloon_page_chunk_hdr) +
> +		      sizeof(struct virtio_balloon_page_chunk) *
> +		      MAX_PAGE_CHUNKS, GFP_KERNEL);
> +	if (!vb->page_bmap[0] || !buf) {
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_BALLOON_CHUNKS);
> 
> > this doesn't work as expected as features has been OK'd by then.
> > You want something like
> > validate_features that I posted. See
> > "virtio: allow drivers to validate features".
> 
> OK. I will change it after that patch is merged.

It's upstream now.


> > 
> > > +		kfree(vb->page_bmap[0]);
> > Looks like this will double free. you want to zero them I think.
> > 
> 
> OK. I'll NULL the pointers after kfree().
> 
> 
> 
> Best,
> Wei
> 
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-14 21:38         ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-04-17  3:35           ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-17  3:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/15/2017 05:38 AM, Michael S. Tsirkin wrote:
> On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
>> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
>>>
>>> So we don't need the bitmap to talk to host, it is just
>>> a data structure we chose to maintain lists of pages, right?
>> Right. bitmap is the way to gather pages to chunk.
>> It's only needed in the balloon page case.
>> For the unused page case, we don't need it, since the free
>> page blocks are already chunks.
>>
>>> OK as far as it goes but you need much better isolation for it.
>>> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
>>> _find_first, _find_next.
>>> Completely unrelated to pages, it just maintains bits.
>>> Then use it here.
>>>
>>>
>>>>    static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>>>>    module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>>>>    MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>>> @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>>>    static struct vfsmount *balloon_mnt;
>>>>    #endif
>>>> +/* Types of pages to chunk */
>>>> +#define PAGE_CHUNK_TYPE_BALLOON 0
>>>> +
>>> Doesn't look like you are ever adding more types in this
>>> patchset.  Pls keep code simple, generalize it later.
>>>
>> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.
> I would say add the extra code there too. Or maybe we can avoid
> adding it altogether.

I'm trying to have the two features( i.e. "balloon pages" and
"unused pages") decoupled while trying to use common functions
to deal with the commonalities. That's the reason to define
the above macro.
Without the macro, we will need to have separate functions,
for example, instead of one "add_one_chunk()", we need to
have add_one_balloon_page_chunk() and
add_one_unused_page_chunk(),
and some of the implementations will be kind of duplicate in the
two functions.
Probably we can add it when the second feature comes to
the code.

>
>> Types of page to chunk are treated differently. Different types of page
>> chunks are sent to the host via different protocols.
>>
>> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
>> to chunk.  For the ballooned type, it uses the basic chunk msg format:
>>
>> virtio_balloon_page_chunk_hdr +
>> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>>
>> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
>> format:
>> miscq_hdr +
>> virtio_balloon_page_chunk_hdr +
>> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>>
>> The chunk msg is actually the payload of the miscq msg.
>>
>>
> So just combine the two message formats and then it'll all be easier?
>

Yes, it'll be simple with only one msg format. But the problem I see
here is that miscq hdr is something necessary for the "unused page"
usage, but not needed by the "balloon page" usage. To be more
precise,
struct virtio_balloon_miscq_hdr {
  __le16 cmd;
  __le16 flags;
};
'cmd' specifies  the command from the miscq (I envision that
miscq will be further used to handle other possible miscellaneous
requests either from the host or to the host), so 'cmd' is necessary
for the miscq. But the inflateq is exclusively used for inflating
pages, so adding a command to it would be redundant and look a little
bewildered there.
'flags': We currently use bit 0 of flags to indicate the completion
ofa command, this is also useful in the "unused page" usage, and not
needed by the "balloon page" usage.
>>>> +#define MAX_PAGE_CHUNKS 4096
>>> This is an order-4 allocation. I'd make it 4095 and then it's
>>> an order-3 one.
>> Sounds good, thanks.
>> I think it would be better to make it 4090. Leave some space for the hdr
>> as well.
> And miscq hdr. In fact just let compiler do the math - something like:
> (8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)
Agree, thanks.

>
> I skimmed explanation of algorithms below but please make sure
> code speaks for itself and add comments inline to document it.
> Whenever you answered me inline this is where you want to
> try to make code clearer and add comments.
>
> Also, pls find ways to abstract the data structure so we don't
> need to deal with its internals all over the code.
>
>
> ....
>
>>>>    {
>>>>    	struct scatterlist sg;
>>>> +	struct virtio_balloon_page_chunk_hdr *hdr;
>>>> +	void *buf;
>>>>    	unsigned int len;
>>>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>>>> +	switch (type) {
>>>> +	case PAGE_CHUNK_TYPE_BALLOON:
>>>> +		hdr = vb->balloon_page_chunk_hdr;
>>>> +		len = 0;
>>>> +		break;
>>>> +	default:
>>>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>>>> +			 __func__, type);
>>>> +		return;
>>>> +	}
>>>> -	/* We should always be able to add one buffer to an empty queue. */
>>>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> -	virtqueue_kick(vq);
>>>> +	buf = (void *)hdr - len;
>>> Moving back to before the header? How can this make sense?
>>> It works fine since len is 0, so just buf = hdr.
>>>
>> For the unused page chunk case, it follows its own protocol:
>> miscq_hdr + payload(chunk msg).
>>   "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
>> the entire miscq msg.
> Well just pass the correct pointer in.
>
OK. The miscq msg is
{
miscq_hdr;
chunk_msg;
}

We can probably change the code like this:

#define CHUNK_TO_MISCQ_MSG(chunk) (chunk - sizeof(struct 
virtio_balloon_miscq_hdr))

switch (type) {
         case PAGE_CHUNK_TYPE_BALLOON:
                 msg_buf = vb->balloon_page_chunk_hdr;
                 msg_len = sizeof(struct virtio_balloon_page_chunk_hdr) +
                     nr_chunks * sizeof(struct 
virtio_balloon_page_chunk_entry);
                 break;
         case PAGE_CHUNK_TYPE_UNUSED:
                 msg_buf = CHUNK_TO_MISCQ_MSG(vb->unused_page_chunk_hdr);
                 msg_len = sizeof(struct virtio_balloon_miscq_hdr) +
sizeof(struct virtio_balloon_page_chunk_hdr) +
                     nr_chunks * sizeof(struct 
virtio_balloon_page_chunk_entry);
                 break;
         default:
                 dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
                          __func__, type);
                 return;
         }



>> Please check the patch for implementing the unused page chunk,
>> it will be clear. If necessary, I can put "buf = (void *)hdr - len" from
>> that patch.
> Exactly. And all this pointer math is very messy. Please look for ways
> to clean it. It's generally easy to fill structures:
>
> struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
> for (i = 0; i < n; ++i)
> 	foo->a[i] = b;
>
> this is the kind of code that's easy to understand and it's
> obvious there are no overflows and no info leaks here.
>
OK, will take your suggestion:

struct virtio_balloon_page_chunk {
	struct virtio_balloon_page_chunk_hdr hdr;
	struct virtio_balloon_page_chunk_entry entries[];
};


>>>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
>>>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
>>>> +	sg_init_table(&sg, 1);
>>>> +	sg_set_buf(&sg, buf, len);
>>>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
>>>> +		virtqueue_kick(vq);
>>>> +		if (busy_wait)
>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>> +			       !virtqueue_is_broken(vq))
>>>> +				cpu_relax();
>>>> +		else
>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		hdr->chunks = 0;
>>> Why zero it here after device used it? Better to zero before use.
>> hdr->chunks tells the host how many chunks are there in the payload.
>> After the device use it, it is ready to zero it.
> It's rather confusing. Try to pass # of chunks around
> in some other way.

Not sure if this was explained clearly - we just let the chunk msg hdr
indicates the # of chunks in the payload. I think this should be a pretty
normal usage, like the network UDP hdr, which uses a length field to 
indicate
the packet length.

>>>> +	}
>>>> +}
>>>> +
>>>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
>>>> +			  int type, u64 base, u64 size)
>>> what are the units here? Looks like it's in 4kbyte units?
>> what is the "unit" you referred to?
>> This is the function to add one chunk, base pfn and size of the chunk are
>> supplied to the function.
>>
> Are both size and base in bytes then?
> But you do not send them to host as is, you shift them for some reason
> before sending them to host.
>
Not in bytes actually. base is a base pfn, which is the starting address
of the continuous pfns. Size is the chunk size, which is the number of
continuous pfns.

They are shifted based on the chunk format we agreed before:

--------------------------------------------------------
|                 Base (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------
--------------------------------------------------------
|                 Size (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------


Here, the pfn will be the balloon page pfn (4KB).In this way, the host
doesn't need to know PAGE_SIZE of the guest.



>>>> +			if (zero >= end)
>>>> +				chunk_size = end - one;
>>>> +			else
>>>> +				chunk_size = zero - one;
>>>> +
>>>> +			if (chunk_size)
>>>> +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>>>> +					      pfn_start + one, chunk_size);
>>> Still not so what does a bit refer to? page or 4kbytes?
>>> I think it should be a page.
>> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
> That's a waste on systems with large page sizes, and it does not
> look like you handle that case correctly.

OK, I will change the bitmap to be PAGE_SIZE based here, instead of
BALLOON_PAGE_SIZE based. When convert them into chunks, making it based
on BALLOON_PAGE_SIZE.


Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-17  3:35           ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-17  3:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/15/2017 05:38 AM, Michael S. Tsirkin wrote:
> On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
>> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
>>>
>>> So we don't need the bitmap to talk to host, it is just
>>> a data structure we chose to maintain lists of pages, right?
>> Right. bitmap is the way to gather pages to chunk.
>> It's only needed in the balloon page case.
>> For the unused page case, we don't need it, since the free
>> page blocks are already chunks.
>>
>>> OK as far as it goes but you need much better isolation for it.
>>> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
>>> _find_first, _find_next.
>>> Completely unrelated to pages, it just maintains bits.
>>> Then use it here.
>>>
>>>
>>>>    static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>>>>    module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>>>>    MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>>> @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>>>    static struct vfsmount *balloon_mnt;
>>>>    #endif
>>>> +/* Types of pages to chunk */
>>>> +#define PAGE_CHUNK_TYPE_BALLOON 0
>>>> +
>>> Doesn't look like you are ever adding more types in this
>>> patchset.  Pls keep code simple, generalize it later.
>>>
>> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.
> I would say add the extra code there too. Or maybe we can avoid
> adding it altogether.

I'm trying to have the two features( i.e. "balloon pages" and
"unused pages") decoupled while trying to use common functions
to deal with the commonalities. That's the reason to define
the above macro.
Without the macro, we will need to have separate functions,
for example, instead of one "add_one_chunk()", we need to
have add_one_balloon_page_chunk() and
add_one_unused_page_chunk(),
and some of the implementations will be kind of duplicate in the
two functions.
Probably we can add it when the second feature comes to
the code.

>
>> Types of page to chunk are treated differently. Different types of page
>> chunks are sent to the host via different protocols.
>>
>> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
>> to chunk.  For the ballooned type, it uses the basic chunk msg format:
>>
>> virtio_balloon_page_chunk_hdr +
>> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>>
>> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
>> format:
>> miscq_hdr +
>> virtio_balloon_page_chunk_hdr +
>> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>>
>> The chunk msg is actually the payload of the miscq msg.
>>
>>
> So just combine the two message formats and then it'll all be easier?
>

Yes, it'll be simple with only one msg format. But the problem I see
here is that miscq hdr is something necessary for the "unused page"
usage, but not needed by the "balloon page" usage. To be more
precise,
struct virtio_balloon_miscq_hdr {
  __le16 cmd;
  __le16 flags;
};
'cmd' specifies  the command from the miscq (I envision that
miscq will be further used to handle other possible miscellaneous
requests either from the host or to the host), so 'cmd' is necessary
for the miscq. But the inflateq is exclusively used for inflating
pages, so adding a command to it would be redundant and look a little
bewildered there.
'flags': We currently use bit 0 of flags to indicate the completion
ofa command, this is also useful in the "unused page" usage, and not
needed by the "balloon page" usage.
>>>> +#define MAX_PAGE_CHUNKS 4096
>>> This is an order-4 allocation. I'd make it 4095 and then it's
>>> an order-3 one.
>> Sounds good, thanks.
>> I think it would be better to make it 4090. Leave some space for the hdr
>> as well.
> And miscq hdr. In fact just let compiler do the math - something like:
> (8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)
Agree, thanks.

>
> I skimmed explanation of algorithms below but please make sure
> code speaks for itself and add comments inline to document it.
> Whenever you answered me inline this is where you want to
> try to make code clearer and add comments.
>
> Also, pls find ways to abstract the data structure so we don't
> need to deal with its internals all over the code.
>
>
> ....
>
>>>>    {
>>>>    	struct scatterlist sg;
>>>> +	struct virtio_balloon_page_chunk_hdr *hdr;
>>>> +	void *buf;
>>>>    	unsigned int len;
>>>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>>>> +	switch (type) {
>>>> +	case PAGE_CHUNK_TYPE_BALLOON:
>>>> +		hdr = vb->balloon_page_chunk_hdr;
>>>> +		len = 0;
>>>> +		break;
>>>> +	default:
>>>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>>>> +			 __func__, type);
>>>> +		return;
>>>> +	}
>>>> -	/* We should always be able to add one buffer to an empty queue. */
>>>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> -	virtqueue_kick(vq);
>>>> +	buf = (void *)hdr - len;
>>> Moving back to before the header? How can this make sense?
>>> It works fine since len is 0, so just buf = hdr.
>>>
>> For the unused page chunk case, it follows its own protocol:
>> miscq_hdr + payload(chunk msg).
>>   "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
>> the entire miscq msg.
> Well just pass the correct pointer in.
>
OK. The miscq msg is
{
miscq_hdr;
chunk_msg;
}

We can probably change the code like this:

#define CHUNK_TO_MISCQ_MSG(chunk) (chunk - sizeof(struct 
virtio_balloon_miscq_hdr))

switch (type) {
         case PAGE_CHUNK_TYPE_BALLOON:
                 msg_buf = vb->balloon_page_chunk_hdr;
                 msg_len = sizeof(struct virtio_balloon_page_chunk_hdr) +
                     nr_chunks * sizeof(struct 
virtio_balloon_page_chunk_entry);
                 break;
         case PAGE_CHUNK_TYPE_UNUSED:
                 msg_buf = CHUNK_TO_MISCQ_MSG(vb->unused_page_chunk_hdr);
                 msg_len = sizeof(struct virtio_balloon_miscq_hdr) +
sizeof(struct virtio_balloon_page_chunk_hdr) +
                     nr_chunks * sizeof(struct 
virtio_balloon_page_chunk_entry);
                 break;
         default:
                 dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
                          __func__, type);
                 return;
         }



>> Please check the patch for implementing the unused page chunk,
>> it will be clear. If necessary, I can put "buf = (void *)hdr - len" from
>> that patch.
> Exactly. And all this pointer math is very messy. Please look for ways
> to clean it. It's generally easy to fill structures:
>
> struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
> for (i = 0; i < n; ++i)
> 	foo->a[i] = b;
>
> this is the kind of code that's easy to understand and it's
> obvious there are no overflows and no info leaks here.
>
OK, will take your suggestion:

struct virtio_balloon_page_chunk {
	struct virtio_balloon_page_chunk_hdr hdr;
	struct virtio_balloon_page_chunk_entry entries[];
};


>>>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
>>>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
>>>> +	sg_init_table(&sg, 1);
>>>> +	sg_set_buf(&sg, buf, len);
>>>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
>>>> +		virtqueue_kick(vq);
>>>> +		if (busy_wait)
>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>> +			       !virtqueue_is_broken(vq))
>>>> +				cpu_relax();
>>>> +		else
>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		hdr->chunks = 0;
>>> Why zero it here after device used it? Better to zero before use.
>> hdr->chunks tells the host how many chunks are there in the payload.
>> After the device use it, it is ready to zero it.
> It's rather confusing. Try to pass # of chunks around
> in some other way.

Not sure if this was explained clearly - we just let the chunk msg hdr
indicates the # of chunks in the payload. I think this should be a pretty
normal usage, like the network UDP hdr, which uses a length field to 
indicate
the packet length.

>>>> +	}
>>>> +}
>>>> +
>>>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
>>>> +			  int type, u64 base, u64 size)
>>> what are the units here? Looks like it's in 4kbyte units?
>> what is the "unit" you referred to?
>> This is the function to add one chunk, base pfn and size of the chunk are
>> supplied to the function.
>>
> Are both size and base in bytes then?
> But you do not send them to host as is, you shift them for some reason
> before sending them to host.
>
Not in bytes actually. base is a base pfn, which is the starting address
of the continuous pfns. Size is the chunk size, which is the number of
continuous pfns.

They are shifted based on the chunk format we agreed before:

--------------------------------------------------------
|                 Base (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------
--------------------------------------------------------
|                 Size (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------


Here, the pfn will be the balloon page pfn (4KB).In this way, the host
doesn't need to know PAGE_SIZE of the guest.



>>>> +			if (zero >= end)
>>>> +				chunk_size = end - one;
>>>> +			else
>>>> +				chunk_size = zero - one;
>>>> +
>>>> +			if (chunk_size)
>>>> +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>>>> +					      pfn_start + one, chunk_size);
>>> Still not so what does a bit refer to? page or 4kbytes?
>>> I think it should be a page.
>> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
> That's a waste on systems with large page sizes, and it does not
> look like you handle that case correctly.

OK, I will change the bitmap to be PAGE_SIZE based here, instead of
BALLOON_PAGE_SIZE based. When convert them into chunks, making it based
on BALLOON_PAGE_SIZE.


Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-17  3:35           ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-17  3:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/15/2017 05:38 AM, Michael S. Tsirkin wrote:
> On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
>> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
>>>
>>> So we don't need the bitmap to talk to host, it is just
>>> a data structure we chose to maintain lists of pages, right?
>> Right. bitmap is the way to gather pages to chunk.
>> It's only needed in the balloon page case.
>> For the unused page case, we don't need it, since the free
>> page blocks are already chunks.
>>
>>> OK as far as it goes but you need much better isolation for it.
>>> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
>>> _find_first, _find_next.
>>> Completely unrelated to pages, it just maintains bits.
>>> Then use it here.
>>>
>>>
>>>>    static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>>>>    module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>>>>    MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>>> @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>>>    static struct vfsmount *balloon_mnt;
>>>>    #endif
>>>> +/* Types of pages to chunk */
>>>> +#define PAGE_CHUNK_TYPE_BALLOON 0
>>>> +
>>> Doesn't look like you are ever adding more types in this
>>> patchset.  Pls keep code simple, generalize it later.
>>>
>> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.
> I would say add the extra code there too. Or maybe we can avoid
> adding it altogether.

I'm trying to have the two features( i.e. "balloon pages" and
"unused pages") decoupled while trying to use common functions
to deal with the commonalities. That's the reason to define
the above macro.
Without the macro, we will need to have separate functions,
for example, instead of one "add_one_chunk()", we need to
have add_one_balloon_page_chunk() and
add_one_unused_page_chunk(),
and some of the implementations will be kind of duplicate in the
two functions.
Probably we can add it when the second feature comes to
the code.

>
>> Types of page to chunk are treated differently. Different types of page
>> chunks are sent to the host via different protocols.
>>
>> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
>> to chunk.  For the ballooned type, it uses the basic chunk msg format:
>>
>> virtio_balloon_page_chunk_hdr +
>> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>>
>> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
>> format:
>> miscq_hdr +
>> virtio_balloon_page_chunk_hdr +
>> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>>
>> The chunk msg is actually the payload of the miscq msg.
>>
>>
> So just combine the two message formats and then it'll all be easier?
>

Yes, it'll be simple with only one msg format. But the problem I see
here is that miscq hdr is something necessary for the "unused page"
usage, but not needed by the "balloon page" usage. To be more
precise,
struct virtio_balloon_miscq_hdr {
  __le16 cmd;
  __le16 flags;
};
'cmd' specifies  the command from the miscq (I envision that
miscq will be further used to handle other possible miscellaneous
requests either from the host or to the host), so 'cmd' is necessary
for the miscq. But the inflateq is exclusively used for inflating
pages, so adding a command to it would be redundant and look a little
bewildered there.
'flags': We currently use bit 0 of flags to indicate the completion
ofa command, this is also useful in the "unused page" usage, and not
needed by the "balloon page" usage.
>>>> +#define MAX_PAGE_CHUNKS 4096
>>> This is an order-4 allocation. I'd make it 4095 and then it's
>>> an order-3 one.
>> Sounds good, thanks.
>> I think it would be better to make it 4090. Leave some space for the hdr
>> as well.
> And miscq hdr. In fact just let compiler do the math - something like:
> (8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)
Agree, thanks.

>
> I skimmed explanation of algorithms below but please make sure
> code speaks for itself and add comments inline to document it.
> Whenever you answered me inline this is where you want to
> try to make code clearer and add comments.
>
> Also, pls find ways to abstract the data structure so we don't
> need to deal with its internals all over the code.
>
>
> ....
>
>>>>    {
>>>>    	struct scatterlist sg;
>>>> +	struct virtio_balloon_page_chunk_hdr *hdr;
>>>> +	void *buf;
>>>>    	unsigned int len;
>>>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>>>> +	switch (type) {
>>>> +	case PAGE_CHUNK_TYPE_BALLOON:
>>>> +		hdr = vb->balloon_page_chunk_hdr;
>>>> +		len = 0;
>>>> +		break;
>>>> +	default:
>>>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>>>> +			 __func__, type);
>>>> +		return;
>>>> +	}
>>>> -	/* We should always be able to add one buffer to an empty queue. */
>>>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> -	virtqueue_kick(vq);
>>>> +	buf = (void *)hdr - len;
>>> Moving back to before the header? How can this make sense?
>>> It works fine since len is 0, so just buf = hdr.
>>>
>> For the unused page chunk case, it follows its own protocol:
>> miscq_hdr + payload(chunk msg).
>>   "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
>> the entire miscq msg.
> Well just pass the correct pointer in.
>
OK. The miscq msg is
{
miscq_hdr;
chunk_msg;
}

We can probably change the code like this:

#define CHUNK_TO_MISCQ_MSG(chunk) (chunk - sizeof(struct 
virtio_balloon_miscq_hdr))

switch (type) {
         case PAGE_CHUNK_TYPE_BALLOON:
                 msg_buf = vb->balloon_page_chunk_hdr;
                 msg_len = sizeof(struct virtio_balloon_page_chunk_hdr) +
                     nr_chunks * sizeof(struct 
virtio_balloon_page_chunk_entry);
                 break;
         case PAGE_CHUNK_TYPE_UNUSED:
                 msg_buf = CHUNK_TO_MISCQ_MSG(vb->unused_page_chunk_hdr);
                 msg_len = sizeof(struct virtio_balloon_miscq_hdr) +
sizeof(struct virtio_balloon_page_chunk_hdr) +
                     nr_chunks * sizeof(struct 
virtio_balloon_page_chunk_entry);
                 break;
         default:
                 dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
                          __func__, type);
                 return;
         }



>> Please check the patch for implementing the unused page chunk,
>> it will be clear. If necessary, I can put "buf = (void *)hdr - len" from
>> that patch.
> Exactly. And all this pointer math is very messy. Please look for ways
> to clean it. It's generally easy to fill structures:
>
> struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
> for (i = 0; i < n; ++i)
> 	foo->a[i] = b;
>
> this is the kind of code that's easy to understand and it's
> obvious there are no overflows and no info leaks here.
>
OK, will take your suggestion:

struct virtio_balloon_page_chunk {
	struct virtio_balloon_page_chunk_hdr hdr;
	struct virtio_balloon_page_chunk_entry entries[];
};


>>>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
>>>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
>>>> +	sg_init_table(&sg, 1);
>>>> +	sg_set_buf(&sg, buf, len);
>>>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
>>>> +		virtqueue_kick(vq);
>>>> +		if (busy_wait)
>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>> +			       !virtqueue_is_broken(vq))
>>>> +				cpu_relax();
>>>> +		else
>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		hdr->chunks = 0;
>>> Why zero it here after device used it? Better to zero before use.
>> hdr->chunks tells the host how many chunks are there in the payload.
>> After the device use it, it is ready to zero it.
> It's rather confusing. Try to pass # of chunks around
> in some other way.

Not sure if this was explained clearly - we just let the chunk msg hdr
indicates the # of chunks in the payload. I think this should be a pretty
normal usage, like the network UDP hdr, which uses a length field to 
indicate
the packet length.

>>>> +	}
>>>> +}
>>>> +
>>>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
>>>> +			  int type, u64 base, u64 size)
>>> what are the units here? Looks like it's in 4kbyte units?
>> what is the "unit" you referred to?
>> This is the function to add one chunk, base pfn and size of the chunk are
>> supplied to the function.
>>
> Are both size and base in bytes then?
> But you do not send them to host as is, you shift them for some reason
> before sending them to host.
>
Not in bytes actually. base is a base pfn, which is the starting address
of the continuous pfns. Size is the chunk size, which is the number of
continuous pfns.

They are shifted based on the chunk format we agreed before:

--------------------------------------------------------
|                 Base (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------
--------------------------------------------------------
|                 Size (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------


Here, the pfn will be the balloon page pfn (4KB).In this way, the host
doesn't need to know PAGE_SIZE of the guest.



>>>> +			if (zero >= end)
>>>> +				chunk_size = end - one;
>>>> +			else
>>>> +				chunk_size = zero - one;
>>>> +
>>>> +			if (chunk_size)
>>>> +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>>>> +					      pfn_start + one, chunk_size);
>>> Still not so what does a bit refer to? page or 4kbytes?
>>> I think it should be a page.
>> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
> That's a waste on systems with large page sizes, and it does not
> look like you handle that case correctly.

OK, I will change the bitmap to be PAGE_SIZE based here, instead of
BALLOON_PAGE_SIZE based. When convert them into chunks, making it based
on BALLOON_PAGE_SIZE.


Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-17  3:35           ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-17  3:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/15/2017 05:38 AM, Michael S. Tsirkin wrote:
> On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
>> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
>>>
>>> So we don't need the bitmap to talk to host, it is just
>>> a data structure we chose to maintain lists of pages, right?
>> Right. bitmap is the way to gather pages to chunk.
>> It's only needed in the balloon page case.
>> For the unused page case, we don't need it, since the free
>> page blocks are already chunks.
>>
>>> OK as far as it goes but you need much better isolation for it.
>>> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
>>> _find_first, _find_next.
>>> Completely unrelated to pages, it just maintains bits.
>>> Then use it here.
>>>
>>>
>>>>    static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>>>>    module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>>>>    MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>>> @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>>>    static struct vfsmount *balloon_mnt;
>>>>    #endif
>>>> +/* Types of pages to chunk */
>>>> +#define PAGE_CHUNK_TYPE_BALLOON 0
>>>> +
>>> Doesn't look like you are ever adding more types in this
>>> patchset.  Pls keep code simple, generalize it later.
>>>
>> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.
> I would say add the extra code there too. Or maybe we can avoid
> adding it altogether.

I'm trying to have the two features( i.e. "balloon pages" and
"unused pages") decoupled while trying to use common functions
to deal with the commonalities. That's the reason to define
the above macro.
Without the macro, we will need to have separate functions,
for example, instead of one "add_one_chunk()", we need to
have add_one_balloon_page_chunk() and
add_one_unused_page_chunk(),
and some of the implementations will be kind of duplicate in the
two functions.
Probably we can add it when the second feature comes to
the code.

>
>> Types of page to chunk are treated differently. Different types of page
>> chunks are sent to the host via different protocols.
>>
>> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
>> to chunk.  For the ballooned type, it uses the basic chunk msg format:
>>
>> virtio_balloon_page_chunk_hdr +
>> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>>
>> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
>> format:
>> miscq_hdr +
>> virtio_balloon_page_chunk_hdr +
>> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>>
>> The chunk msg is actually the payload of the miscq msg.
>>
>>
> So just combine the two message formats and then it'll all be easier?
>

Yes, it'll be simple with only one msg format. But the problem I see
here is that miscq hdr is something necessary for the "unused page"
usage, but not needed by the "balloon page" usage. To be more
precise,
struct virtio_balloon_miscq_hdr {
  __le16 cmd;
  __le16 flags;
};
'cmd' specifies  the command from the miscq (I envision that
miscq will be further used to handle other possible miscellaneous
requests either from the host or to the host), so 'cmd' is necessary
for the miscq. But the inflateq is exclusively used for inflating
pages, so adding a command to it would be redundant and look a little
bewildered there.
'flags': We currently use bit 0 of flags to indicate the completion
ofa command, this is also useful in the "unused page" usage, and not
needed by the "balloon page" usage.
>>>> +#define MAX_PAGE_CHUNKS 4096
>>> This is an order-4 allocation. I'd make it 4095 and then it's
>>> an order-3 one.
>> Sounds good, thanks.
>> I think it would be better to make it 4090. Leave some space for the hdr
>> as well.
> And miscq hdr. In fact just let compiler do the math - something like:
> (8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)
Agree, thanks.

>
> I skimmed explanation of algorithms below but please make sure
> code speaks for itself and add comments inline to document it.
> Whenever you answered me inline this is where you want to
> try to make code clearer and add comments.
>
> Also, pls find ways to abstract the data structure so we don't
> need to deal with its internals all over the code.
>
>
> ....
>
>>>>    {
>>>>    	struct scatterlist sg;
>>>> +	struct virtio_balloon_page_chunk_hdr *hdr;
>>>> +	void *buf;
>>>>    	unsigned int len;
>>>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>>>> +	switch (type) {
>>>> +	case PAGE_CHUNK_TYPE_BALLOON:
>>>> +		hdr = vb->balloon_page_chunk_hdr;
>>>> +		len = 0;
>>>> +		break;
>>>> +	default:
>>>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>>>> +			 __func__, type);
>>>> +		return;
>>>> +	}
>>>> -	/* We should always be able to add one buffer to an empty queue. */
>>>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> -	virtqueue_kick(vq);
>>>> +	buf = (void *)hdr - len;
>>> Moving back to before the header? How can this make sense?
>>> It works fine since len is 0, so just buf = hdr.
>>>
>> For the unused page chunk case, it follows its own protocol:
>> miscq_hdr + payload(chunk msg).
>>   "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
>> the entire miscq msg.
> Well just pass the correct pointer in.
>
OK. The miscq msg is
{
miscq_hdr;
chunk_msg;
}

We can probably change the code like this:

#define CHUNK_TO_MISCQ_MSG(chunk) (chunk - sizeof(struct 
virtio_balloon_miscq_hdr))

switch (type) {
         case PAGE_CHUNK_TYPE_BALLOON:
                 msg_buf = vb->balloon_page_chunk_hdr;
                 msg_len = sizeof(struct virtio_balloon_page_chunk_hdr) +
                     nr_chunks * sizeof(struct 
virtio_balloon_page_chunk_entry);
                 break;
         case PAGE_CHUNK_TYPE_UNUSED:
                 msg_buf = CHUNK_TO_MISCQ_MSG(vb->unused_page_chunk_hdr);
                 msg_len = sizeof(struct virtio_balloon_miscq_hdr) +
sizeof(struct virtio_balloon_page_chunk_hdr) +
                     nr_chunks * sizeof(struct 
virtio_balloon_page_chunk_entry);
                 break;
         default:
                 dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
                          __func__, type);
                 return;
         }



>> Please check the patch for implementing the unused page chunk,
>> it will be clear. If necessary, I can put "buf = (void *)hdr - len" from
>> that patch.
> Exactly. And all this pointer math is very messy. Please look for ways
> to clean it. It's generally easy to fill structures:
>
> struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
> for (i = 0; i < n; ++i)
> 	foo->a[i] = b;
>
> this is the kind of code that's easy to understand and it's
> obvious there are no overflows and no info leaks here.
>
OK, will take your suggestion:

struct virtio_balloon_page_chunk {
	struct virtio_balloon_page_chunk_hdr hdr;
	struct virtio_balloon_page_chunk_entry entries[];
};


>>>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
>>>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
>>>> +	sg_init_table(&sg, 1);
>>>> +	sg_set_buf(&sg, buf, len);
>>>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
>>>> +		virtqueue_kick(vq);
>>>> +		if (busy_wait)
>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>> +			       !virtqueue_is_broken(vq))
>>>> +				cpu_relax();
>>>> +		else
>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		hdr->chunks = 0;
>>> Why zero it here after device used it? Better to zero before use.
>> hdr->chunks tells the host how many chunks are there in the payload.
>> After the device use it, it is ready to zero it.
> It's rather confusing. Try to pass # of chunks around
> in some other way.

Not sure if this was explained clearly - we just let the chunk msg hdr
indicates the # of chunks in the payload. I think this should be a pretty
normal usage, like the network UDP hdr, which uses a length field to 
indicate
the packet length.

>>>> +	}
>>>> +}
>>>> +
>>>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
>>>> +			  int type, u64 base, u64 size)
>>> what are the units here? Looks like it's in 4kbyte units?
>> what is the "unit" you referred to?
>> This is the function to add one chunk, base pfn and size of the chunk are
>> supplied to the function.
>>
> Are both size and base in bytes then?
> But you do not send them to host as is, you shift them for some reason
> before sending them to host.
>
Not in bytes actually. base is a base pfn, which is the starting address
of the continuous pfns. Size is the chunk size, which is the number of
continuous pfns.

They are shifted based on the chunk format we agreed before:

--------------------------------------------------------
|                 Base (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------
--------------------------------------------------------
|                 Size (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------


Here, the pfn will be the balloon page pfn (4KB).In this way, the host
doesn't need to know PAGE_SIZE of the guest.



>>>> +			if (zero >= end)
>>>> +				chunk_size = end - one;
>>>> +			else
>>>> +				chunk_size = zero - one;
>>>> +
>>>> +			if (chunk_size)
>>>> +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>>>> +					      pfn_start + one, chunk_size);
>>> Still not so what does a bit refer to? page or 4kbytes?
>>> I think it should be a page.
>> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
> That's a waste on systems with large page sizes, and it does not
> look like you handle that case correctly.

OK, I will change the bitmap to be PAGE_SIZE based here, instead of
BALLOON_PAGE_SIZE based. When convert them into chunks, making it based
on BALLOON_PAGE_SIZE.


Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-14 21:38         ` Michael S. Tsirkin
                           ` (3 preceding siblings ...)
  (?)
@ 2017-04-17  3:35         ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-17  3:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On 04/15/2017 05:38 AM, Michael S. Tsirkin wrote:
> On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
>> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
>>>
>>> So we don't need the bitmap to talk to host, it is just
>>> a data structure we chose to maintain lists of pages, right?
>> Right. bitmap is the way to gather pages to chunk.
>> It's only needed in the balloon page case.
>> For the unused page case, we don't need it, since the free
>> page blocks are already chunks.
>>
>>> OK as far as it goes but you need much better isolation for it.
>>> Build a data structure with APIs such as _init, _cleanup, _add, _clear,
>>> _find_first, _find_next.
>>> Completely unrelated to pages, it just maintains bits.
>>> Then use it here.
>>>
>>>
>>>>    static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
>>>>    module_param(oom_pages, int, S_IRUSR | S_IWUSR);
>>>>    MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>>> @@ -50,6 +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>>>>    static struct vfsmount *balloon_mnt;
>>>>    #endif
>>>> +/* Types of pages to chunk */
>>>> +#define PAGE_CHUNK_TYPE_BALLOON 0
>>>> +
>>> Doesn't look like you are ever adding more types in this
>>> patchset.  Pls keep code simple, generalize it later.
>>>
>> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.
> I would say add the extra code there too. Or maybe we can avoid
> adding it altogether.

I'm trying to have the two features( i.e. "balloon pages" and
"unused pages") decoupled while trying to use common functions
to deal with the commonalities. That's the reason to define
the above macro.
Without the macro, we will need to have separate functions,
for example, instead of one "add_one_chunk()", we need to
have add_one_balloon_page_chunk() and
add_one_unused_page_chunk(),
and some of the implementations will be kind of duplicate in the
two functions.
Probably we can add it when the second feature comes to
the code.

>
>> Types of page to chunk are treated differently. Different types of page
>> chunks are sent to the host via different protocols.
>>
>> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
>> to chunk.  For the ballooned type, it uses the basic chunk msg format:
>>
>> virtio_balloon_page_chunk_hdr +
>> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>>
>> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq msg
>> format:
>> miscq_hdr +
>> virtio_balloon_page_chunk_hdr +
>> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
>>
>> The chunk msg is actually the payload of the miscq msg.
>>
>>
> So just combine the two message formats and then it'll all be easier?
>

Yes, it'll be simple with only one msg format. But the problem I see
here is that miscq hdr is something necessary for the "unused page"
usage, but not needed by the "balloon page" usage. To be more
precise,
struct virtio_balloon_miscq_hdr {
  __le16 cmd;
  __le16 flags;
};
'cmd' specifies  the command from the miscq (I envision that
miscq will be further used to handle other possible miscellaneous
requests either from the host or to the host), so 'cmd' is necessary
for the miscq. But the inflateq is exclusively used for inflating
pages, so adding a command to it would be redundant and look a little
bewildered there.
'flags': We currently use bit 0 of flags to indicate the completion
ofa command, this is also useful in the "unused page" usage, and not
needed by the "balloon page" usage.
>>>> +#define MAX_PAGE_CHUNKS 4096
>>> This is an order-4 allocation. I'd make it 4095 and then it's
>>> an order-3 one.
>> Sounds good, thanks.
>> I think it would be better to make it 4090. Leave some space for the hdr
>> as well.
> And miscq hdr. In fact just let compiler do the math - something like:
> (8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)
Agree, thanks.

>
> I skimmed explanation of algorithms below but please make sure
> code speaks for itself and add comments inline to document it.
> Whenever you answered me inline this is where you want to
> try to make code clearer and add comments.
>
> Also, pls find ways to abstract the data structure so we don't
> need to deal with its internals all over the code.
>
>
> ....
>
>>>>    {
>>>>    	struct scatterlist sg;
>>>> +	struct virtio_balloon_page_chunk_hdr *hdr;
>>>> +	void *buf;
>>>>    	unsigned int len;
>>>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
>>>> +	switch (type) {
>>>> +	case PAGE_CHUNK_TYPE_BALLOON:
>>>> +		hdr = vb->balloon_page_chunk_hdr;
>>>> +		len = 0;
>>>> +		break;
>>>> +	default:
>>>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>>>> +			 __func__, type);
>>>> +		return;
>>>> +	}
>>>> -	/* We should always be able to add one buffer to an empty queue. */
>>>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
>>>> -	virtqueue_kick(vq);
>>>> +	buf = (void *)hdr - len;
>>> Moving back to before the header? How can this make sense?
>>> It works fine since len is 0, so just buf = hdr.
>>>
>> For the unused page chunk case, it follows its own protocol:
>> miscq_hdr + payload(chunk msg).
>>   "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr, to send
>> the entire miscq msg.
> Well just pass the correct pointer in.
>
OK. The miscq msg is
{
miscq_hdr;
chunk_msg;
}

We can probably change the code like this:

#define CHUNK_TO_MISCQ_MSG(chunk) (chunk - sizeof(struct 
virtio_balloon_miscq_hdr))

switch (type) {
         case PAGE_CHUNK_TYPE_BALLOON:
                 msg_buf = vb->balloon_page_chunk_hdr;
                 msg_len = sizeof(struct virtio_balloon_page_chunk_hdr) +
                     nr_chunks * sizeof(struct 
virtio_balloon_page_chunk_entry);
                 break;
         case PAGE_CHUNK_TYPE_UNUSED:
                 msg_buf = CHUNK_TO_MISCQ_MSG(vb->unused_page_chunk_hdr);
                 msg_len = sizeof(struct virtio_balloon_miscq_hdr) +
sizeof(struct virtio_balloon_page_chunk_hdr) +
                     nr_chunks * sizeof(struct 
virtio_balloon_page_chunk_entry);
                 break;
         default:
                 dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
                          __func__, type);
                 return;
         }



>> Please check the patch for implementing the unused page chunk,
>> it will be clear. If necessary, I can put "buf = (void *)hdr - len" from
>> that patch.
> Exactly. And all this pointer math is very messy. Please look for ways
> to clean it. It's generally easy to fill structures:
>
> struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
> for (i = 0; i < n; ++i)
> 	foo->a[i] = b;
>
> this is the kind of code that's easy to understand and it's
> obvious there are no overflows and no info leaks here.
>
OK, will take your suggestion:

struct virtio_balloon_page_chunk {
	struct virtio_balloon_page_chunk_hdr hdr;
	struct virtio_balloon_page_chunk_entry entries[];
};


>>>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
>>>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
>>>> +	sg_init_table(&sg, 1);
>>>> +	sg_set_buf(&sg, buf, len);
>>>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
>>>> +		virtqueue_kick(vq);
>>>> +		if (busy_wait)
>>>> +			while (!virtqueue_get_buf(vq, &len) &&
>>>> +			       !virtqueue_is_broken(vq))
>>>> +				cpu_relax();
>>>> +		else
>>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
>>>> +		hdr->chunks = 0;
>>> Why zero it here after device used it? Better to zero before use.
>> hdr->chunks tells the host how many chunks are there in the payload.
>> After the device use it, it is ready to zero it.
> It's rather confusing. Try to pass # of chunks around
> in some other way.

Not sure if this was explained clearly - we just let the chunk msg hdr
indicates the # of chunks in the payload. I think this should be a pretty
normal usage, like the network UDP hdr, which uses a length field to 
indicate
the packet length.

>>>> +	}
>>>> +}
>>>> +
>>>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
>>>> +			  int type, u64 base, u64 size)
>>> what are the units here? Looks like it's in 4kbyte units?
>> what is the "unit" you referred to?
>> This is the function to add one chunk, base pfn and size of the chunk are
>> supplied to the function.
>>
> Are both size and base in bytes then?
> But you do not send them to host as is, you shift them for some reason
> before sending them to host.
>
Not in bytes actually. base is a base pfn, which is the starting address
of the continuous pfns. Size is the chunk size, which is the number of
continuous pfns.

They are shifted based on the chunk format we agreed before:

--------------------------------------------------------
|                 Base (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------
--------------------------------------------------------
|                 Size (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------


Here, the pfn will be the balloon page pfn (4KB).In this way, the host
doesn't need to know PAGE_SIZE of the guest.



>>>> +			if (zero >= end)
>>>> +				chunk_size = end - one;
>>>> +			else
>>>> +				chunk_size = zero - one;
>>>> +
>>>> +			if (chunk_size)
>>>> +				add_one_chunk(vb, vq, PAGE_CHUNK_TYPE_BALLOON,
>>>> +					      pfn_start + one, chunk_size);
>>> Still not so what does a bit refer to? page or 4kbytes?
>>> I think it should be a page.
>> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
> That's a waste on systems with large page sizes, and it does not
> look like you handle that case correctly.

OK, I will change the bitmap to be PAGE_SIZE based here, instead of
BALLOON_PAGE_SIZE based. When convert them into chunks, making it based
on BALLOON_PAGE_SIZE.


Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-17  3:35           ` Wei Wang
  (?)
  (?)
@ 2017-04-26 11:03             ` Wang, Wei W
  -1 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-04-26 11:03 UTC (permalink / raw)
  To: Wang, Wei W, Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

Hi Michael, could you please give some feedback?

On Monday, April 17, 2017 11:35 AM, Wei Wang wrote:
> On 04/15/2017 05:38 AM, Michael S. Tsirkin wrote:
> > On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
> >> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> >>> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> >>>
> >>> So we don't need the bitmap to talk to host, it is just a data
> >>> structure we chose to maintain lists of pages, right?
> >> Right. bitmap is the way to gather pages to chunk.
> >> It's only needed in the balloon page case.
> >> For the unused page case, we don't need it, since the free page
> >> blocks are already chunks.
> >>
> >>> OK as far as it goes but you need much better isolation for it.
> >>> Build a data structure with APIs such as _init, _cleanup, _add,
> >>> _clear, _find_first, _find_next.
> >>> Completely unrelated to pages, it just maintains bits.
> >>> Then use it here.
> >>>
> >>>
> >>>>    static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> >>>>    module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> >>>>    MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -50,6
> >>>> +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >>>>    static struct vfsmount *balloon_mnt;
> >>>>    #endif
> >>>> +/* Types of pages to chunk */
> >>>> +#define PAGE_CHUNK_TYPE_BALLOON 0
> >>>> +
> >>> Doesn't look like you are ever adding more types in this patchset.
> >>> Pls keep code simple, generalize it later.
> >>>
> >> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.
> > I would say add the extra code there too. Or maybe we can avoid adding
> > it altogether.
> 
> I'm trying to have the two features( i.e. "balloon pages" and "unused pages")
> decoupled while trying to use common functions to deal with the commonalities.
> That's the reason to define the above macro.
> Without the macro, we will need to have separate functions, for example,
> instead of one "add_one_chunk()", we need to have
> add_one_balloon_page_chunk() and add_one_unused_page_chunk(), and some
> of the implementations will be kind of duplicate in the two functions.
> Probably we can add it when the second feature comes to the code.
> 
> >
> >> Types of page to chunk are treated differently. Different types of
> >> page chunks are sent to the host via different protocols.
> >>
> >> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
> >> to chunk.  For the ballooned type, it uses the basic chunk msg format:
> >>
> >> virtio_balloon_page_chunk_hdr +
> >> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> >>
> >> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq
> >> msg
> >> format:
> >> miscq_hdr +
> >> virtio_balloon_page_chunk_hdr +
> >> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> >>
> >> The chunk msg is actually the payload of the miscq msg.
> >>
> >>
> > So just combine the two message formats and then it'll all be easier?
> >
> 
> Yes, it'll be simple with only one msg format. But the problem I see here is that
> miscq hdr is something necessary for the "unused page"
> usage, but not needed by the "balloon page" usage. To be more precise, struct
> virtio_balloon_miscq_hdr {
>   __le16 cmd;
>   __le16 flags;
> };
> 'cmd' specifies  the command from the miscq (I envision that miscq will be
> further used to handle other possible miscellaneous requests either from the
> host or to the host), so 'cmd' is necessary for the miscq. But the inflateq is
> exclusively used for inflating pages, so adding a command to it would be
> redundant and look a little bewildered there.
> 'flags': We currently use bit 0 of flags to indicate the completion ofa command,
> this is also useful in the "unused page" usage, and not needed by the "balloon
> page" usage.
> >>>> +#define MAX_PAGE_CHUNKS 4096
> >>> This is an order-4 allocation. I'd make it 4095 and then it's an
> >>> order-3 one.
> >> Sounds good, thanks.
> >> I think it would be better to make it 4090. Leave some space for the
> >> hdr as well.
> > And miscq hdr. In fact just let compiler do the math - something like:
> > (8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)
> Agree, thanks.
> 
> >
> > I skimmed explanation of algorithms below but please make sure code
> > speaks for itself and add comments inline to document it.
> > Whenever you answered me inline this is where you want to try to make
> > code clearer and add comments.
> >
> > Also, pls find ways to abstract the data structure so we don't need to
> > deal with its internals all over the code.
> >
> >
> > ....
> >
> >>>>    {
> >>>>    	struct scatterlist sg;
> >>>> +	struct virtio_balloon_page_chunk_hdr *hdr;
> >>>> +	void *buf;
> >>>>    	unsigned int len;
> >>>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> >>>> +	switch (type) {
> >>>> +	case PAGE_CHUNK_TYPE_BALLOON:
> >>>> +		hdr = vb->balloon_page_chunk_hdr;
> >>>> +		len = 0;
> >>>> +		break;
> >>>> +	default:
> >>>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown
> pages\n",
> >>>> +			 __func__, type);
> >>>> +		return;
> >>>> +	}
> >>>> -	/* We should always be able to add one buffer to an empty queue. */
> >>>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> >>>> -	virtqueue_kick(vq);
> >>>> +	buf = (void *)hdr - len;
> >>> Moving back to before the header? How can this make sense?
> >>> It works fine since len is 0, so just buf = hdr.
> >>>
> >> For the unused page chunk case, it follows its own protocol:
> >> miscq_hdr + payload(chunk msg).
> >>   "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr,
> >> to send the entire miscq msg.
> > Well just pass the correct pointer in.
> >
> OK. The miscq msg is
> {
> miscq_hdr;
> chunk_msg;
> }
> 
> We can probably change the code like this:
> 
> #define CHUNK_TO_MISCQ_MSG(chunk) (chunk - sizeof(struct
> virtio_balloon_miscq_hdr))
> 
> switch (type) {
>          case PAGE_CHUNK_TYPE_BALLOON:
>                  msg_buf = vb->balloon_page_chunk_hdr;
>                  msg_len = sizeof(struct virtio_balloon_page_chunk_hdr) +
>                      nr_chunks * sizeof(struct virtio_balloon_page_chunk_entry);
>                  break;
>          case PAGE_CHUNK_TYPE_UNUSED:
>                  msg_buf = CHUNK_TO_MISCQ_MSG(vb->unused_page_chunk_hdr);
>                  msg_len = sizeof(struct virtio_balloon_miscq_hdr) + sizeof(struct
> virtio_balloon_page_chunk_hdr) +
>                      nr_chunks * sizeof(struct virtio_balloon_page_chunk_entry);
>                  break;
>          default:
>                  dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>                           __func__, type);
>                  return;
>          }
> 
> 
> 
> >> Please check the patch for implementing the unused page chunk, it
> >> will be clear. If necessary, I can put "buf = (void *)hdr - len" from
> >> that patch.
> > Exactly. And all this pointer math is very messy. Please look for ways
> > to clean it. It's generally easy to fill structures:
> >
> > struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
> > for (i = 0; i < n; ++i)
> > 	foo->a[i] = b;
> >
> > this is the kind of code that's easy to understand and it's obvious
> > there are no overflows and no info leaks here.
> >
> OK, will take your suggestion:
> 
> struct virtio_balloon_page_chunk {
> 	struct virtio_balloon_page_chunk_hdr hdr;
> 	struct virtio_balloon_page_chunk_entry entries[]; };
> 
> 
> >>>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
> >>>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
> >>>> +	sg_init_table(&sg, 1);
> >>>> +	sg_set_buf(&sg, buf, len);
> >>>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
> >>>> +		virtqueue_kick(vq);
> >>>> +		if (busy_wait)
> >>>> +			while (!virtqueue_get_buf(vq, &len) &&
> >>>> +			       !virtqueue_is_broken(vq))
> >>>> +				cpu_relax();
> >>>> +		else
> >>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> >>>> +		hdr->chunks = 0;
> >>> Why zero it here after device used it? Better to zero before use.
> >> hdr->chunks tells the host how many chunks are there in the payload.
> >> After the device use it, it is ready to zero it.
> > It's rather confusing. Try to pass # of chunks around in some other
> > way.
> 
> Not sure if this was explained clearly - we just let the chunk msg hdr indicates
> the # of chunks in the payload. I think this should be a pretty normal usage, like
> the network UDP hdr, which uses a length field to indicate the packet length.
> 
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> >>>> +			  int type, u64 base, u64 size)
> >>> what are the units here? Looks like it's in 4kbyte units?
> >> what is the "unit" you referred to?
> >> This is the function to add one chunk, base pfn and size of the chunk
> >> are supplied to the function.
> >>
> > Are both size and base in bytes then?
> > But you do not send them to host as is, you shift them for some reason
> > before sending them to host.
> >
> Not in bytes actually. base is a base pfn, which is the starting address of the
> continuous pfns. Size is the chunk size, which is the number of continuous pfns.
> 
> They are shifted based on the chunk format we agreed before:
> 
> --------------------------------------------------------
> |                 Base (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> --------------------------------------------------------
> |                 Size (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> 
> 
> Here, the pfn will be the balloon page pfn (4KB).In this way, the host doesn't
> need to know PAGE_SIZE of the guest.
> 
> 
> 
> >>>> +			if (zero >= end)
> >>>> +				chunk_size = end - one;
> >>>> +			else
> >>>> +				chunk_size = zero - one;
> >>>> +
> >>>> +			if (chunk_size)
> >>>> +				add_one_chunk(vb, vq,
> PAGE_CHUNK_TYPE_BALLOON,
> >>>> +					      pfn_start + one, chunk_size);
> >>> Still not so what does a bit refer to? page or 4kbytes?
> >>> I think it should be a page.
> >> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
> > That's a waste on systems with large page sizes, and it does not look
> > like you handle that case correctly.
> 
> OK, I will change the bitmap to be PAGE_SIZE based here, instead of
> BALLOON_PAGE_SIZE based. When convert them into chunks, making it based
> on BALLOON_PAGE_SIZE.
> 
> 
> Best,
> Wei
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-26 11:03             ` Wang, Wei W
  0 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-04-26 11:03 UTC (permalink / raw)
  To: Wang, Wei W, Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

Hi Michael, could you please give some feedback?

On Monday, April 17, 2017 11:35 AM, Wei Wang wrote:
> On 04/15/2017 05:38 AM, Michael S. Tsirkin wrote:
> > On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
> >> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> >>> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> >>>
> >>> So we don't need the bitmap to talk to host, it is just a data
> >>> structure we chose to maintain lists of pages, right?
> >> Right. bitmap is the way to gather pages to chunk.
> >> It's only needed in the balloon page case.
> >> For the unused page case, we don't need it, since the free page
> >> blocks are already chunks.
> >>
> >>> OK as far as it goes but you need much better isolation for it.
> >>> Build a data structure with APIs such as _init, _cleanup, _add,
> >>> _clear, _find_first, _find_next.
> >>> Completely unrelated to pages, it just maintains bits.
> >>> Then use it here.
> >>>
> >>>
> >>>>    static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> >>>>    module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> >>>>    MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -50,6
> >>>> +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >>>>    static struct vfsmount *balloon_mnt;
> >>>>    #endif
> >>>> +/* Types of pages to chunk */
> >>>> +#define PAGE_CHUNK_TYPE_BALLOON 0
> >>>> +
> >>> Doesn't look like you are ever adding more types in this patchset.
> >>> Pls keep code simple, generalize it later.
> >>>
> >> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.
> > I would say add the extra code there too. Or maybe we can avoid adding
> > it altogether.
> 
> I'm trying to have the two features( i.e. "balloon pages" and "unused pages")
> decoupled while trying to use common functions to deal with the commonalities.
> That's the reason to define the above macro.
> Without the macro, we will need to have separate functions, for example,
> instead of one "add_one_chunk()", we need to have
> add_one_balloon_page_chunk() and add_one_unused_page_chunk(), and some
> of the implementations will be kind of duplicate in the two functions.
> Probably we can add it when the second feature comes to the code.
> 
> >
> >> Types of page to chunk are treated differently. Different types of
> >> page chunks are sent to the host via different protocols.
> >>
> >> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
> >> to chunk.  For the ballooned type, it uses the basic chunk msg format:
> >>
> >> virtio_balloon_page_chunk_hdr +
> >> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> >>
> >> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq
> >> msg
> >> format:
> >> miscq_hdr +
> >> virtio_balloon_page_chunk_hdr +
> >> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> >>
> >> The chunk msg is actually the payload of the miscq msg.
> >>
> >>
> > So just combine the two message formats and then it'll all be easier?
> >
> 
> Yes, it'll be simple with only one msg format. But the problem I see here is that
> miscq hdr is something necessary for the "unused page"
> usage, but not needed by the "balloon page" usage. To be more precise, struct
> virtio_balloon_miscq_hdr {
>   __le16 cmd;
>   __le16 flags;
> };
> 'cmd' specifies  the command from the miscq (I envision that miscq will be
> further used to handle other possible miscellaneous requests either from the
> host or to the host), so 'cmd' is necessary for the miscq. But the inflateq is
> exclusively used for inflating pages, so adding a command to it would be
> redundant and look a little bewildered there.
> 'flags': We currently use bit 0 of flags to indicate the completion ofa command,
> this is also useful in the "unused page" usage, and not needed by the "balloon
> page" usage.
> >>>> +#define MAX_PAGE_CHUNKS 4096
> >>> This is an order-4 allocation. I'd make it 4095 and then it's an
> >>> order-3 one.
> >> Sounds good, thanks.
> >> I think it would be better to make it 4090. Leave some space for the
> >> hdr as well.
> > And miscq hdr. In fact just let compiler do the math - something like:
> > (8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)
> Agree, thanks.
> 
> >
> > I skimmed explanation of algorithms below but please make sure code
> > speaks for itself and add comments inline to document it.
> > Whenever you answered me inline this is where you want to try to make
> > code clearer and add comments.
> >
> > Also, pls find ways to abstract the data structure so we don't need to
> > deal with its internals all over the code.
> >
> >
> > ....
> >
> >>>>    {
> >>>>    	struct scatterlist sg;
> >>>> +	struct virtio_balloon_page_chunk_hdr *hdr;
> >>>> +	void *buf;
> >>>>    	unsigned int len;
> >>>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> >>>> +	switch (type) {
> >>>> +	case PAGE_CHUNK_TYPE_BALLOON:
> >>>> +		hdr = vb->balloon_page_chunk_hdr;
> >>>> +		len = 0;
> >>>> +		break;
> >>>> +	default:
> >>>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown
> pages\n",
> >>>> +			 __func__, type);
> >>>> +		return;
> >>>> +	}
> >>>> -	/* We should always be able to add one buffer to an empty queue. */
> >>>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> >>>> -	virtqueue_kick(vq);
> >>>> +	buf = (void *)hdr - len;
> >>> Moving back to before the header? How can this make sense?
> >>> It works fine since len is 0, so just buf = hdr.
> >>>
> >> For the unused page chunk case, it follows its own protocol:
> >> miscq_hdr + payload(chunk msg).
> >>   "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr,
> >> to send the entire miscq msg.
> > Well just pass the correct pointer in.
> >
> OK. The miscq msg is
> {
> miscq_hdr;
> chunk_msg;
> }
> 
> We can probably change the code like this:
> 
> #define CHUNK_TO_MISCQ_MSG(chunk) (chunk - sizeof(struct
> virtio_balloon_miscq_hdr))
> 
> switch (type) {
>          case PAGE_CHUNK_TYPE_BALLOON:
>                  msg_buf = vb->balloon_page_chunk_hdr;
>                  msg_len = sizeof(struct virtio_balloon_page_chunk_hdr) +
>                      nr_chunks * sizeof(struct virtio_balloon_page_chunk_entry);
>                  break;
>          case PAGE_CHUNK_TYPE_UNUSED:
>                  msg_buf = CHUNK_TO_MISCQ_MSG(vb->unused_page_chunk_hdr);
>                  msg_len = sizeof(struct virtio_balloon_miscq_hdr) + sizeof(struct
> virtio_balloon_page_chunk_hdr) +
>                      nr_chunks * sizeof(struct virtio_balloon_page_chunk_entry);
>                  break;
>          default:
>                  dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>                           __func__, type);
>                  return;
>          }
> 
> 
> 
> >> Please check the patch for implementing the unused page chunk, it
> >> will be clear. If necessary, I can put "buf = (void *)hdr - len" from
> >> that patch.
> > Exactly. And all this pointer math is very messy. Please look for ways
> > to clean it. It's generally easy to fill structures:
> >
> > struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
> > for (i = 0; i < n; ++i)
> > 	foo->a[i] = b;
> >
> > this is the kind of code that's easy to understand and it's obvious
> > there are no overflows and no info leaks here.
> >
> OK, will take your suggestion:
> 
> struct virtio_balloon_page_chunk {
> 	struct virtio_balloon_page_chunk_hdr hdr;
> 	struct virtio_balloon_page_chunk_entry entries[]; };
> 
> 
> >>>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
> >>>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
> >>>> +	sg_init_table(&sg, 1);
> >>>> +	sg_set_buf(&sg, buf, len);
> >>>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
> >>>> +		virtqueue_kick(vq);
> >>>> +		if (busy_wait)
> >>>> +			while (!virtqueue_get_buf(vq, &len) &&
> >>>> +			       !virtqueue_is_broken(vq))
> >>>> +				cpu_relax();
> >>>> +		else
> >>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> >>>> +		hdr->chunks = 0;
> >>> Why zero it here after device used it? Better to zero before use.
> >> hdr->chunks tells the host how many chunks are there in the payload.
> >> After the device use it, it is ready to zero it.
> > It's rather confusing. Try to pass # of chunks around in some other
> > way.
> 
> Not sure if this was explained clearly - we just let the chunk msg hdr indicates
> the # of chunks in the payload. I think this should be a pretty normal usage, like
> the network UDP hdr, which uses a length field to indicate the packet length.
> 
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> >>>> +			  int type, u64 base, u64 size)
> >>> what are the units here? Looks like it's in 4kbyte units?
> >> what is the "unit" you referred to?
> >> This is the function to add one chunk, base pfn and size of the chunk
> >> are supplied to the function.
> >>
> > Are both size and base in bytes then?
> > But you do not send them to host as is, you shift them for some reason
> > before sending them to host.
> >
> Not in bytes actually. base is a base pfn, which is the starting address of the
> continuous pfns. Size is the chunk size, which is the number of continuous pfns.
> 
> They are shifted based on the chunk format we agreed before:
> 
> --------------------------------------------------------
> |                 Base (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> --------------------------------------------------------
> |                 Size (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> 
> 
> Here, the pfn will be the balloon page pfn (4KB).In this way, the host doesn't
> need to know PAGE_SIZE of the guest.
> 
> 
> 
> >>>> +			if (zero >= end)
> >>>> +				chunk_size = end - one;
> >>>> +			else
> >>>> +				chunk_size = zero - one;
> >>>> +
> >>>> +			if (chunk_size)
> >>>> +				add_one_chunk(vb, vq,
> PAGE_CHUNK_TYPE_BALLOON,
> >>>> +					      pfn_start + one, chunk_size);
> >>> Still not so what does a bit refer to? page or 4kbytes?
> >>> I think it should be a page.
> >> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
> > That's a waste on systems with large page sizes, and it does not look
> > like you handle that case correctly.
> 
> OK, I will change the bitmap to be PAGE_SIZE based here, instead of
> BALLOON_PAGE_SIZE based. When convert them into chunks, making it based
> on BALLOON_PAGE_SIZE.
> 
> 
> Best,
> Wei
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-26 11:03             ` Wang, Wei W
  0 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-04-26 11:03 UTC (permalink / raw)
  To: Wang, Wei W, Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

Hi Michael, could you please give some feedback?

On Monday, April 17, 2017 11:35 AM, Wei Wang wrote:
> On 04/15/2017 05:38 AM, Michael S. Tsirkin wrote:
> > On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
> >> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> >>> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> >>>
> >>> So we don't need the bitmap to talk to host, it is just a data
> >>> structure we chose to maintain lists of pages, right?
> >> Right. bitmap is the way to gather pages to chunk.
> >> It's only needed in the balloon page case.
> >> For the unused page case, we don't need it, since the free page
> >> blocks are already chunks.
> >>
> >>> OK as far as it goes but you need much better isolation for it.
> >>> Build a data structure with APIs such as _init, _cleanup, _add,
> >>> _clear, _find_first, _find_next.
> >>> Completely unrelated to pages, it just maintains bits.
> >>> Then use it here.
> >>>
> >>>
> >>>>    static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> >>>>    module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> >>>>    MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -50,6
> >>>> +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >>>>    static struct vfsmount *balloon_mnt;
> >>>>    #endif
> >>>> +/* Types of pages to chunk */
> >>>> +#define PAGE_CHUNK_TYPE_BALLOON 0
> >>>> +
> >>> Doesn't look like you are ever adding more types in this patchset.
> >>> Pls keep code simple, generalize it later.
> >>>
> >> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.
> > I would say add the extra code there too. Or maybe we can avoid adding
> > it altogether.
> 
> I'm trying to have the two features( i.e. "balloon pages" and "unused pages")
> decoupled while trying to use common functions to deal with the commonalities.
> That's the reason to define the above macro.
> Without the macro, we will need to have separate functions, for example,
> instead of one "add_one_chunk()", we need to have
> add_one_balloon_page_chunk() and add_one_unused_page_chunk(), and some
> of the implementations will be kind of duplicate in the two functions.
> Probably we can add it when the second feature comes to the code.
> 
> >
> >> Types of page to chunk are treated differently. Different types of
> >> page chunks are sent to the host via different protocols.
> >>
> >> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
> >> to chunk.  For the ballooned type, it uses the basic chunk msg format:
> >>
> >> virtio_balloon_page_chunk_hdr +
> >> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> >>
> >> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq
> >> msg
> >> format:
> >> miscq_hdr +
> >> virtio_balloon_page_chunk_hdr +
> >> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> >>
> >> The chunk msg is actually the payload of the miscq msg.
> >>
> >>
> > So just combine the two message formats and then it'll all be easier?
> >
> 
> Yes, it'll be simple with only one msg format. But the problem I see here is that
> miscq hdr is something necessary for the "unused page"
> usage, but not needed by the "balloon page" usage. To be more precise, struct
> virtio_balloon_miscq_hdr {
>   __le16 cmd;
>   __le16 flags;
> };
> 'cmd' specifies  the command from the miscq (I envision that miscq will be
> further used to handle other possible miscellaneous requests either from the
> host or to the host), so 'cmd' is necessary for the miscq. But the inflateq is
> exclusively used for inflating pages, so adding a command to it would be
> redundant and look a little bewildered there.
> 'flags': We currently use bit 0 of flags to indicate the completion ofa command,
> this is also useful in the "unused page" usage, and not needed by the "balloon
> page" usage.
> >>>> +#define MAX_PAGE_CHUNKS 4096
> >>> This is an order-4 allocation. I'd make it 4095 and then it's an
> >>> order-3 one.
> >> Sounds good, thanks.
> >> I think it would be better to make it 4090. Leave some space for the
> >> hdr as well.
> > And miscq hdr. In fact just let compiler do the math - something like:
> > (8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)
> Agree, thanks.
> 
> >
> > I skimmed explanation of algorithms below but please make sure code
> > speaks for itself and add comments inline to document it.
> > Whenever you answered me inline this is where you want to try to make
> > code clearer and add comments.
> >
> > Also, pls find ways to abstract the data structure so we don't need to
> > deal with its internals all over the code.
> >
> >
> > ....
> >
> >>>>    {
> >>>>    	struct scatterlist sg;
> >>>> +	struct virtio_balloon_page_chunk_hdr *hdr;
> >>>> +	void *buf;
> >>>>    	unsigned int len;
> >>>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> >>>> +	switch (type) {
> >>>> +	case PAGE_CHUNK_TYPE_BALLOON:
> >>>> +		hdr = vb->balloon_page_chunk_hdr;
> >>>> +		len = 0;
> >>>> +		break;
> >>>> +	default:
> >>>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown
> pages\n",
> >>>> +			 __func__, type);
> >>>> +		return;
> >>>> +	}
> >>>> -	/* We should always be able to add one buffer to an empty queue. */
> >>>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> >>>> -	virtqueue_kick(vq);
> >>>> +	buf = (void *)hdr - len;
> >>> Moving back to before the header? How can this make sense?
> >>> It works fine since len is 0, so just buf = hdr.
> >>>
> >> For the unused page chunk case, it follows its own protocol:
> >> miscq_hdr + payload(chunk msg).
> >>   "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr,
> >> to send the entire miscq msg.
> > Well just pass the correct pointer in.
> >
> OK. The miscq msg is
> {
> miscq_hdr;
> chunk_msg;
> }
> 
> We can probably change the code like this:
> 
> #define CHUNK_TO_MISCQ_MSG(chunk) (chunk - sizeof(struct
> virtio_balloon_miscq_hdr))
> 
> switch (type) {
>          case PAGE_CHUNK_TYPE_BALLOON:
>                  msg_buf = vb->balloon_page_chunk_hdr;
>                  msg_len = sizeof(struct virtio_balloon_page_chunk_hdr) +
>                      nr_chunks * sizeof(struct virtio_balloon_page_chunk_entry);
>                  break;
>          case PAGE_CHUNK_TYPE_UNUSED:
>                  msg_buf = CHUNK_TO_MISCQ_MSG(vb->unused_page_chunk_hdr);
>                  msg_len = sizeof(struct virtio_balloon_miscq_hdr) + sizeof(struct
> virtio_balloon_page_chunk_hdr) +
>                      nr_chunks * sizeof(struct virtio_balloon_page_chunk_entry);
>                  break;
>          default:
>                  dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>                           __func__, type);
>                  return;
>          }
> 
> 
> 
> >> Please check the patch for implementing the unused page chunk, it
> >> will be clear. If necessary, I can put "buf = (void *)hdr - len" from
> >> that patch.
> > Exactly. And all this pointer math is very messy. Please look for ways
> > to clean it. It's generally easy to fill structures:
> >
> > struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
> > for (i = 0; i < n; ++i)
> > 	foo->a[i] = b;
> >
> > this is the kind of code that's easy to understand and it's obvious
> > there are no overflows and no info leaks here.
> >
> OK, will take your suggestion:
> 
> struct virtio_balloon_page_chunk {
> 	struct virtio_balloon_page_chunk_hdr hdr;
> 	struct virtio_balloon_page_chunk_entry entries[]; };
> 
> 
> >>>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
> >>>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
> >>>> +	sg_init_table(&sg, 1);
> >>>> +	sg_set_buf(&sg, buf, len);
> >>>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
> >>>> +		virtqueue_kick(vq);
> >>>> +		if (busy_wait)
> >>>> +			while (!virtqueue_get_buf(vq, &len) &&
> >>>> +			       !virtqueue_is_broken(vq))
> >>>> +				cpu_relax();
> >>>> +		else
> >>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> >>>> +		hdr->chunks = 0;
> >>> Why zero it here after device used it? Better to zero before use.
> >> hdr->chunks tells the host how many chunks are there in the payload.
> >> After the device use it, it is ready to zero it.
> > It's rather confusing. Try to pass # of chunks around in some other
> > way.
> 
> Not sure if this was explained clearly - we just let the chunk msg hdr indicates
> the # of chunks in the payload. I think this should be a pretty normal usage, like
> the network UDP hdr, which uses a length field to indicate the packet length.
> 
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> >>>> +			  int type, u64 base, u64 size)
> >>> what are the units here? Looks like it's in 4kbyte units?
> >> what is the "unit" you referred to?
> >> This is the function to add one chunk, base pfn and size of the chunk
> >> are supplied to the function.
> >>
> > Are both size and base in bytes then?
> > But you do not send them to host as is, you shift them for some reason
> > before sending them to host.
> >
> Not in bytes actually. base is a base pfn, which is the starting address of the
> continuous pfns. Size is the chunk size, which is the number of continuous pfns.
> 
> They are shifted based on the chunk format we agreed before:
> 
> --------------------------------------------------------
> |                 Base (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> --------------------------------------------------------
> |                 Size (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> 
> 
> Here, the pfn will be the balloon page pfn (4KB).In this way, the host doesn't
> need to know PAGE_SIZE of the guest.
> 
> 
> 
> >>>> +			if (zero >= end)
> >>>> +				chunk_size = end - one;
> >>>> +			else
> >>>> +				chunk_size = zero - one;
> >>>> +
> >>>> +			if (chunk_size)
> >>>> +				add_one_chunk(vb, vq,
> PAGE_CHUNK_TYPE_BALLOON,
> >>>> +					      pfn_start + one, chunk_size);
> >>> Still not so what does a bit refer to? page or 4kbytes?
> >>> I think it should be a page.
> >> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
> > That's a waste on systems with large page sizes, and it does not look
> > like you handle that case correctly.
> 
> OK, I will change the bitmap to be PAGE_SIZE based here, instead of
> BALLOON_PAGE_SIZE based. When convert them into chunks, making it based
> on BALLOON_PAGE_SIZE.
> 
> 
> Best,
> Wei
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-26 11:03             ` Wang, Wei W
  0 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-04-26 11:03 UTC (permalink / raw)
  To: Wang, Wei W, Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

Hi Michael, could you please give some feedback?

On Monday, April 17, 2017 11:35 AM, Wei Wang wrote:
> On 04/15/2017 05:38 AM, Michael S. Tsirkin wrote:
> > On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
> >> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> >>> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> >>>
> >>> So we don't need the bitmap to talk to host, it is just a data
> >>> structure we chose to maintain lists of pages, right?
> >> Right. bitmap is the way to gather pages to chunk.
> >> It's only needed in the balloon page case.
> >> For the unused page case, we don't need it, since the free page
> >> blocks are already chunks.
> >>
> >>> OK as far as it goes but you need much better isolation for it.
> >>> Build a data structure with APIs such as _init, _cleanup, _add,
> >>> _clear, _find_first, _find_next.
> >>> Completely unrelated to pages, it just maintains bits.
> >>> Then use it here.
> >>>
> >>>
> >>>>    static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> >>>>    module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> >>>>    MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -50,6
> >>>> +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >>>>    static struct vfsmount *balloon_mnt;
> >>>>    #endif
> >>>> +/* Types of pages to chunk */
> >>>> +#define PAGE_CHUNK_TYPE_BALLOON 0
> >>>> +
> >>> Doesn't look like you are ever adding more types in this patchset.
> >>> Pls keep code simple, generalize it later.
> >>>
> >> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.
> > I would say add the extra code there too. Or maybe we can avoid adding
> > it altogether.
> 
> I'm trying to have the two features( i.e. "balloon pages" and "unused pages")
> decoupled while trying to use common functions to deal with the commonalities.
> That's the reason to define the above macro.
> Without the macro, we will need to have separate functions, for example,
> instead of one "add_one_chunk()", we need to have
> add_one_balloon_page_chunk() and add_one_unused_page_chunk(), and some
> of the implementations will be kind of duplicate in the two functions.
> Probably we can add it when the second feature comes to the code.
> 
> >
> >> Types of page to chunk are treated differently. Different types of
> >> page chunks are sent to the host via different protocols.
> >>
> >> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
> >> to chunk.  For the ballooned type, it uses the basic chunk msg format:
> >>
> >> virtio_balloon_page_chunk_hdr +
> >> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> >>
> >> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq
> >> msg
> >> format:
> >> miscq_hdr +
> >> virtio_balloon_page_chunk_hdr +
> >> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> >>
> >> The chunk msg is actually the payload of the miscq msg.
> >>
> >>
> > So just combine the two message formats and then it'll all be easier?
> >
> 
> Yes, it'll be simple with only one msg format. But the problem I see here is that
> miscq hdr is something necessary for the "unused page"
> usage, but not needed by the "balloon page" usage. To be more precise, struct
> virtio_balloon_miscq_hdr {
>   __le16 cmd;
>   __le16 flags;
> };
> 'cmd' specifies  the command from the miscq (I envision that miscq will be
> further used to handle other possible miscellaneous requests either from the
> host or to the host), so 'cmd' is necessary for the miscq. But the inflateq is
> exclusively used for inflating pages, so adding a command to it would be
> redundant and look a little bewildered there.
> 'flags': We currently use bit 0 of flags to indicate the completion ofa command,
> this is also useful in the "unused page" usage, and not needed by the "balloon
> page" usage.
> >>>> +#define MAX_PAGE_CHUNKS 4096
> >>> This is an order-4 allocation. I'd make it 4095 and then it's an
> >>> order-3 one.
> >> Sounds good, thanks.
> >> I think it would be better to make it 4090. Leave some space for the
> >> hdr as well.
> > And miscq hdr. In fact just let compiler do the math - something like:
> > (8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)
> Agree, thanks.
> 
> >
> > I skimmed explanation of algorithms below but please make sure code
> > speaks for itself and add comments inline to document it.
> > Whenever you answered me inline this is where you want to try to make
> > code clearer and add comments.
> >
> > Also, pls find ways to abstract the data structure so we don't need to
> > deal with its internals all over the code.
> >
> >
> > ....
> >
> >>>>    {
> >>>>    	struct scatterlist sg;
> >>>> +	struct virtio_balloon_page_chunk_hdr *hdr;
> >>>> +	void *buf;
> >>>>    	unsigned int len;
> >>>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> >>>> +	switch (type) {
> >>>> +	case PAGE_CHUNK_TYPE_BALLOON:
> >>>> +		hdr = vb->balloon_page_chunk_hdr;
> >>>> +		len = 0;
> >>>> +		break;
> >>>> +	default:
> >>>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown
> pages\n",
> >>>> +			 __func__, type);
> >>>> +		return;
> >>>> +	}
> >>>> -	/* We should always be able to add one buffer to an empty queue. */
> >>>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> >>>> -	virtqueue_kick(vq);
> >>>> +	buf = (void *)hdr - len;
> >>> Moving back to before the header? How can this make sense?
> >>> It works fine since len is 0, so just buf = hdr.
> >>>
> >> For the unused page chunk case, it follows its own protocol:
> >> miscq_hdr + payload(chunk msg).
> >>   "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr,
> >> to send the entire miscq msg.
> > Well just pass the correct pointer in.
> >
> OK. The miscq msg is
> {
> miscq_hdr;
> chunk_msg;
> }
> 
> We can probably change the code like this:
> 
> #define CHUNK_TO_MISCQ_MSG(chunk) (chunk - sizeof(struct
> virtio_balloon_miscq_hdr))
> 
> switch (type) {
>          case PAGE_CHUNK_TYPE_BALLOON:
>                  msg_buf = vb->balloon_page_chunk_hdr;
>                  msg_len = sizeof(struct virtio_balloon_page_chunk_hdr) +
>                      nr_chunks * sizeof(struct virtio_balloon_page_chunk_entry);
>                  break;
>          case PAGE_CHUNK_TYPE_UNUSED:
>                  msg_buf = CHUNK_TO_MISCQ_MSG(vb->unused_page_chunk_hdr);
>                  msg_len = sizeof(struct virtio_balloon_miscq_hdr) + sizeof(struct
> virtio_balloon_page_chunk_hdr) +
>                      nr_chunks * sizeof(struct virtio_balloon_page_chunk_entry);
>                  break;
>          default:
>                  dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>                           __func__, type);
>                  return;
>          }
> 
> 
> 
> >> Please check the patch for implementing the unused page chunk, it
> >> will be clear. If necessary, I can put "buf = (void *)hdr - len" from
> >> that patch.
> > Exactly. And all this pointer math is very messy. Please look for ways
> > to clean it. It's generally easy to fill structures:
> >
> > struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
> > for (i = 0; i < n; ++i)
> > 	foo->a[i] = b;
> >
> > this is the kind of code that's easy to understand and it's obvious
> > there are no overflows and no info leaks here.
> >
> OK, will take your suggestion:
> 
> struct virtio_balloon_page_chunk {
> 	struct virtio_balloon_page_chunk_hdr hdr;
> 	struct virtio_balloon_page_chunk_entry entries[]; };
> 
> 
> >>>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
> >>>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
> >>>> +	sg_init_table(&sg, 1);
> >>>> +	sg_set_buf(&sg, buf, len);
> >>>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
> >>>> +		virtqueue_kick(vq);
> >>>> +		if (busy_wait)
> >>>> +			while (!virtqueue_get_buf(vq, &len) &&
> >>>> +			       !virtqueue_is_broken(vq))
> >>>> +				cpu_relax();
> >>>> +		else
> >>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> >>>> +		hdr->chunks = 0;
> >>> Why zero it here after device used it? Better to zero before use.
> >> hdr->chunks tells the host how many chunks are there in the payload.
> >> After the device use it, it is ready to zero it.
> > It's rather confusing. Try to pass # of chunks around in some other
> > way.
> 
> Not sure if this was explained clearly - we just let the chunk msg hdr indicates
> the # of chunks in the payload. I think this should be a pretty normal usage, like
> the network UDP hdr, which uses a length field to indicate the packet length.
> 
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> >>>> +			  int type, u64 base, u64 size)
> >>> what are the units here? Looks like it's in 4kbyte units?
> >> what is the "unit" you referred to?
> >> This is the function to add one chunk, base pfn and size of the chunk
> >> are supplied to the function.
> >>
> > Are both size and base in bytes then?
> > But you do not send them to host as is, you shift them for some reason
> > before sending them to host.
> >
> Not in bytes actually. base is a base pfn, which is the starting address of the
> continuous pfns. Size is the chunk size, which is the number of continuous pfns.
> 
> They are shifted based on the chunk format we agreed before:
> 
> --------------------------------------------------------
> |                 Base (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> --------------------------------------------------------
> |                 Size (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> 
> 
> Here, the pfn will be the balloon page pfn (4KB).In this way, the host doesn't
> need to know PAGE_SIZE of the guest.
> 
> 
> 
> >>>> +			if (zero >= end)
> >>>> +				chunk_size = end - one;
> >>>> +			else
> >>>> +				chunk_size = zero - one;
> >>>> +
> >>>> +			if (chunk_size)
> >>>> +				add_one_chunk(vb, vq,
> PAGE_CHUNK_TYPE_BALLOON,
> >>>> +					      pfn_start + one, chunk_size);
> >>> Still not so what does a bit refer to? page or 4kbytes?
> >>> I think it should be a page.
> >> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
> > That's a waste on systems with large page sizes, and it does not look
> > like you handle that case correctly.
> 
> OK, I will change the bitmap to be PAGE_SIZE based here, instead of
> BALLOON_PAGE_SIZE based. When convert them into chunks, making it based
> on BALLOON_PAGE_SIZE.
> 
> 
> Best,
> Wei
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-17  3:35           ` Wei Wang
                             ` (2 preceding siblings ...)
  (?)
@ 2017-04-26 11:03           ` Wang, Wei W
  -1 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-04-26 11:03 UTC (permalink / raw)
  To: Wang, Wei W, Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, Hansen, Dave, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

Hi Michael, could you please give some feedback?

On Monday, April 17, 2017 11:35 AM, Wei Wang wrote:
> On 04/15/2017 05:38 AM, Michael S. Tsirkin wrote:
> > On Fri, Apr 14, 2017 at 04:37:52PM +0800, Wei Wang wrote:
> >> On 04/14/2017 12:34 AM, Michael S. Tsirkin wrote:
> >>> On Thu, Apr 13, 2017 at 05:35:05PM +0800, Wei Wang wrote:
> >>>
> >>> So we don't need the bitmap to talk to host, it is just a data
> >>> structure we chose to maintain lists of pages, right?
> >> Right. bitmap is the way to gather pages to chunk.
> >> It's only needed in the balloon page case.
> >> For the unused page case, we don't need it, since the free page
> >> blocks are already chunks.
> >>
> >>> OK as far as it goes but you need much better isolation for it.
> >>> Build a data structure with APIs such as _init, _cleanup, _add,
> >>> _clear, _find_first, _find_next.
> >>> Completely unrelated to pages, it just maintains bits.
> >>> Then use it here.
> >>>
> >>>
> >>>>    static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> >>>>    module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> >>>>    MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -50,6
> >>>> +54,10 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >>>>    static struct vfsmount *balloon_mnt;
> >>>>    #endif
> >>>> +/* Types of pages to chunk */
> >>>> +#define PAGE_CHUNK_TYPE_BALLOON 0
> >>>> +
> >>> Doesn't look like you are ever adding more types in this patchset.
> >>> Pls keep code simple, generalize it later.
> >>>
> >> "#define PAGE_CHUNK_TYPE_UNUSED 1" is added in another patch.
> > I would say add the extra code there too. Or maybe we can avoid adding
> > it altogether.
> 
> I'm trying to have the two features( i.e. "balloon pages" and "unused pages")
> decoupled while trying to use common functions to deal with the commonalities.
> That's the reason to define the above macro.
> Without the macro, we will need to have separate functions, for example,
> instead of one "add_one_chunk()", we need to have
> add_one_balloon_page_chunk() and add_one_unused_page_chunk(), and some
> of the implementations will be kind of duplicate in the two functions.
> Probably we can add it when the second feature comes to the code.
> 
> >
> >> Types of page to chunk are treated differently. Different types of
> >> page chunks are sent to the host via different protocols.
> >>
> >> 1) PAGE_CHUNK_TYPE_BALLOON: Ballooned (i.e. inflated/deflated) pages
> >> to chunk.  For the ballooned type, it uses the basic chunk msg format:
> >>
> >> virtio_balloon_page_chunk_hdr +
> >> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> >>
> >> 2) PAGE_CHUNK_TYPE_UNUSED: unused pages to chunk. It uses this miscq
> >> msg
> >> format:
> >> miscq_hdr +
> >> virtio_balloon_page_chunk_hdr +
> >> virtio_balloon_page_chunk * MAX_PAGE_CHUNKS
> >>
> >> The chunk msg is actually the payload of the miscq msg.
> >>
> >>
> > So just combine the two message formats and then it'll all be easier?
> >
> 
> Yes, it'll be simple with only one msg format. But the problem I see here is that
> miscq hdr is something necessary for the "unused page"
> usage, but not needed by the "balloon page" usage. To be more precise, struct
> virtio_balloon_miscq_hdr {
>   __le16 cmd;
>   __le16 flags;
> };
> 'cmd' specifies  the command from the miscq (I envision that miscq will be
> further used to handle other possible miscellaneous requests either from the
> host or to the host), so 'cmd' is necessary for the miscq. But the inflateq is
> exclusively used for inflating pages, so adding a command to it would be
> redundant and look a little bewildered there.
> 'flags': We currently use bit 0 of flags to indicate the completion ofa command,
> this is also useful in the "unused page" usage, and not needed by the "balloon
> page" usage.
> >>>> +#define MAX_PAGE_CHUNKS 4096
> >>> This is an order-4 allocation. I'd make it 4095 and then it's an
> >>> order-3 one.
> >> Sounds good, thanks.
> >> I think it would be better to make it 4090. Leave some space for the
> >> hdr as well.
> > And miscq hdr. In fact just let compiler do the math - something like:
> > (8 * PAGE_SIZE - sizeof(hdr)) / sizeof(chunk)
> Agree, thanks.
> 
> >
> > I skimmed explanation of algorithms below but please make sure code
> > speaks for itself and add comments inline to document it.
> > Whenever you answered me inline this is where you want to try to make
> > code clearer and add comments.
> >
> > Also, pls find ways to abstract the data structure so we don't need to
> > deal with its internals all over the code.
> >
> >
> > ....
> >
> >>>>    {
> >>>>    	struct scatterlist sg;
> >>>> +	struct virtio_balloon_page_chunk_hdr *hdr;
> >>>> +	void *buf;
> >>>>    	unsigned int len;
> >>>> -	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
> >>>> +	switch (type) {
> >>>> +	case PAGE_CHUNK_TYPE_BALLOON:
> >>>> +		hdr = vb->balloon_page_chunk_hdr;
> >>>> +		len = 0;
> >>>> +		break;
> >>>> +	default:
> >>>> +		dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown
> pages\n",
> >>>> +			 __func__, type);
> >>>> +		return;
> >>>> +	}
> >>>> -	/* We should always be able to add one buffer to an empty queue. */
> >>>> -	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> >>>> -	virtqueue_kick(vq);
> >>>> +	buf = (void *)hdr - len;
> >>> Moving back to before the header? How can this make sense?
> >>> It works fine since len is 0, so just buf = hdr.
> >>>
> >> For the unused page chunk case, it follows its own protocol:
> >> miscq_hdr + payload(chunk msg).
> >>   "buf = (void *)hdr - len" moves the buf pointer to the miscq_hdr,
> >> to send the entire miscq msg.
> > Well just pass the correct pointer in.
> >
> OK. The miscq msg is
> {
> miscq_hdr;
> chunk_msg;
> }
> 
> We can probably change the code like this:
> 
> #define CHUNK_TO_MISCQ_MSG(chunk) (chunk - sizeof(struct
> virtio_balloon_miscq_hdr))
> 
> switch (type) {
>          case PAGE_CHUNK_TYPE_BALLOON:
>                  msg_buf = vb->balloon_page_chunk_hdr;
>                  msg_len = sizeof(struct virtio_balloon_page_chunk_hdr) +
>                      nr_chunks * sizeof(struct virtio_balloon_page_chunk_entry);
>                  break;
>          case PAGE_CHUNK_TYPE_UNUSED:
>                  msg_buf = CHUNK_TO_MISCQ_MSG(vb->unused_page_chunk_hdr);
>                  msg_len = sizeof(struct virtio_balloon_miscq_hdr) + sizeof(struct
> virtio_balloon_page_chunk_hdr) +
>                      nr_chunks * sizeof(struct virtio_balloon_page_chunk_entry);
>                  break;
>          default:
>                  dev_warn(&vb->vdev->dev, "%s: chunk %d of unknown pages\n",
>                           __func__, type);
>                  return;
>          }
> 
> 
> 
> >> Please check the patch for implementing the unused page chunk, it
> >> will be clear. If necessary, I can put "buf = (void *)hdr - len" from
> >> that patch.
> > Exactly. And all this pointer math is very messy. Please look for ways
> > to clean it. It's generally easy to fill structures:
> >
> > struct foo *foo = kmalloc(..., sizeof(*foo) + n * sizeof(foo->a[0]));
> > for (i = 0; i < n; ++i)
> > 	foo->a[i] = b;
> >
> > this is the kind of code that's easy to understand and it's obvious
> > there are no overflows and no info leaks here.
> >
> OK, will take your suggestion:
> 
> struct virtio_balloon_page_chunk {
> 	struct virtio_balloon_page_chunk_hdr hdr;
> 	struct virtio_balloon_page_chunk_entry entries[]; };
> 
> 
> >>>> +	len += sizeof(struct virtio_balloon_page_chunk_hdr);
> >>>> +	len += hdr->chunks * sizeof(struct virtio_balloon_page_chunk);
> >>>> +	sg_init_table(&sg, 1);
> >>>> +	sg_set_buf(&sg, buf, len);
> >>>> +	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
> >>>> +		virtqueue_kick(vq);
> >>>> +		if (busy_wait)
> >>>> +			while (!virtqueue_get_buf(vq, &len) &&
> >>>> +			       !virtqueue_is_broken(vq))
> >>>> +				cpu_relax();
> >>>> +		else
> >>>> +			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
> >>>> +		hdr->chunks = 0;
> >>> Why zero it here after device used it? Better to zero before use.
> >> hdr->chunks tells the host how many chunks are there in the payload.
> >> After the device use it, it is ready to zero it.
> > It's rather confusing. Try to pass # of chunks around in some other
> > way.
> 
> Not sure if this was explained clearly - we just let the chunk msg hdr indicates
> the # of chunks in the payload. I think this should be a pretty normal usage, like
> the network UDP hdr, which uses a length field to indicate the packet length.
> 
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
> >>>> +			  int type, u64 base, u64 size)
> >>> what are the units here? Looks like it's in 4kbyte units?
> >> what is the "unit" you referred to?
> >> This is the function to add one chunk, base pfn and size of the chunk
> >> are supplied to the function.
> >>
> > Are both size and base in bytes then?
> > But you do not send them to host as is, you shift them for some reason
> > before sending them to host.
> >
> Not in bytes actually. base is a base pfn, which is the starting address of the
> continuous pfns. Size is the chunk size, which is the number of continuous pfns.
> 
> They are shifted based on the chunk format we agreed before:
> 
> --------------------------------------------------------
> |                 Base (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> --------------------------------------------------------
> |                 Size (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> 
> 
> Here, the pfn will be the balloon page pfn (4KB).In this way, the host doesn't
> need to know PAGE_SIZE of the guest.
> 
> 
> 
> >>>> +			if (zero >= end)
> >>>> +				chunk_size = end - one;
> >>>> +			else
> >>>> +				chunk_size = zero - one;
> >>>> +
> >>>> +			if (chunk_size)
> >>>> +				add_one_chunk(vb, vq,
> PAGE_CHUNK_TYPE_BALLOON,
> >>>> +					      pfn_start + one, chunk_size);
> >>> Still not so what does a bit refer to? page or 4kbytes?
> >>> I think it should be a page.
> >> A bit in the bitmap corresponds to a pfn of a balloon page(4KB).
> > That's a waste on systems with large page sizes, and it does not look
> > like you handle that case correctly.
> 
> OK, I will change the bitmap to be PAGE_SIZE based here, instead of
> BALLOON_PAGE_SIZE based. When convert them into chunks, making it based
> on BALLOON_PAGE_SIZE.
> 
> 
> Best,
> Wei
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-26 11:03             ` Wang, Wei W
  (?)
@ 2017-04-26 23:20               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-26 23:20 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> Hi Michael, could you please give some feedback?

I'm sorry, I'm not sure feedback on what you are requesting.

The interface looks reasonable now, even though there's
a way to make it even simpler if we can limit chunk size
to 2G (in fact 4G - 1). Do you think we can live with this
limitation?

But the code still needs some cleanup.

-- 
MST

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-26 23:20               ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-26 23:20 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> Hi Michael, could you please give some feedback?

I'm sorry, I'm not sure feedback on what you are requesting.

The interface looks reasonable now, even though there's
a way to make it even simpler if we can limit chunk size
to 2G (in fact 4G - 1). Do you think we can live with this
limitation?

But the code still needs some cleanup.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-26 23:20               ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-26 23:20 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> Hi Michael, could you please give some feedback?

I'm sorry, I'm not sure feedback on what you are requesting.

The interface looks reasonable now, even though there's
a way to make it even simpler if we can limit chunk size
to 2G (in fact 4G - 1). Do you think we can live with this
limitation?

But the code still needs some cleanup.

-- 
MST

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-26 11:03             ` Wang, Wei W
                               ` (2 preceding siblings ...)
  (?)
@ 2017-04-26 23:20             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-04-26 23:20 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, Hansen, Dave, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> Hi Michael, could you please give some feedback?

I'm sorry, I'm not sure feedback on what you are requesting.

The interface looks reasonable now, even though there's
a way to make it even simpler if we can limit chunk size
to 2G (in fact 4G - 1). Do you think we can live with this
limitation?

But the code still needs some cleanup.

-- 
MST

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-26 23:20               ` Michael S. Tsirkin
  (?)
@ 2017-04-27  6:31                 ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-27  6:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
>> Hi Michael, could you please give some feedback?
> I'm sorry, I'm not sure feedback on what you are requesting.
Oh, just some trivial things (e.g. use a field in the
header, hdr->chunks to indicate the number of chunks
in the payload) that wasn't confirmed.

I will prepare the new version with fixing the agreed issues,
and we can continue to discuss those parts if you still find
them improper.


>
> The interface looks reasonable now, even though there's
> a way to make it even simpler if we can limit chunk size
> to 2G (in fact 4G - 1). Do you think we can live with this
> limitation?
Yes, I think we can. So, is it good to change to use the
previous 64-bit chunk format (52-bit base + 12-bit size)?


>
> But the code still needs some cleanup.
>

OK. We'll also still to discuss your comments in the patch 05.

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-27  6:31                 ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-27  6:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
>> Hi Michael, could you please give some feedback?
> I'm sorry, I'm not sure feedback on what you are requesting.
Oh, just some trivial things (e.g. use a field in the
header, hdr->chunks to indicate the number of chunks
in the payload) that wasn't confirmed.

I will prepare the new version with fixing the agreed issues,
and we can continue to discuss those parts if you still find
them improper.


>
> The interface looks reasonable now, even though there's
> a way to make it even simpler if we can limit chunk size
> to 2G (in fact 4G - 1). Do you think we can live with this
> limitation?
Yes, I think we can. So, is it good to change to use the
previous 64-bit chunk format (52-bit base + 12-bit size)?


>
> But the code still needs some cleanup.
>

OK. We'll also still to discuss your comments in the patch 05.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-04-27  6:31                 ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-27  6:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
>> Hi Michael, could you please give some feedback?
> I'm sorry, I'm not sure feedback on what you are requesting.
Oh, just some trivial things (e.g. use a field in the
header, hdr->chunks to indicate the number of chunks
in the payload) that wasn't confirmed.

I will prepare the new version with fixing the agreed issues,
and we can continue to discuss those parts if you still find
them improper.


>
> The interface looks reasonable now, even though there's
> a way to make it even simpler if we can limit chunk size
> to 2G (in fact 4G - 1). Do you think we can live with this
> limitation?
Yes, I think we can. So, is it good to change to use the
previous 64-bit chunk format (52-bit base + 12-bit size)?


>
> But the code still needs some cleanup.
>

OK. We'll also still to discuss your comments in the patch 05.

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-26 23:20               ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-04-27  6:31               ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-27  6:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, Hansen, Dave, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
>> Hi Michael, could you please give some feedback?
> I'm sorry, I'm not sure feedback on what you are requesting.
Oh, just some trivial things (e.g. use a field in the
header, hdr->chunks to indicate the number of chunks
in the payload) that wasn't confirmed.

I will prepare the new version with fixing the agreed issues,
and we can continue to discuss those parts if you still find
them improper.


>
> The interface looks reasonable now, even though there's
> a way to make it even simpler if we can limit chunk size
> to 2G (in fact 4G - 1). Do you think we can live with this
> limitation?
Yes, I think we can. So, is it good to change to use the
previous 64-bit chunk format (52-bit base + 12-bit size)?


>
> But the code still needs some cleanup.
>

OK. We'll also still to discuss your comments in the patch 05.

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
  2017-04-13 17:08     ` Michael S. Tsirkin
  (?)
@ 2017-04-27  6:33       ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-27  6:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
>> Add a new vq, miscq, to handle miscellaneous requests between the device
>> and the driver.
>>
>> This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> implements
>
>> request sent from the device.
> Commands are sent from host and handled on guest.
> In fact how is this so different from stats?
> How about reusing the stats vq then? You can use one buffer
> for stats and one buffer for commands.
>

The meaning of the two vqs is a little different. statq is used for
reporting statistics, while miscq is intended to be used to handle
miscellaneous requests from the guest or host (I think it can
also be used the other way around in the future when other
new features are added which need the guest to send requests
and the host to provide responses).

I would prefer to have them separate, because:
If we plan to combine them, we need to put the previous statq
related implementation under miscq with a new command (I think
we can't combine them without using commands to distinguish
the two features).
In this way, an old driver won't work with a new QEMU or a new
driver won't work with an old QEMU. Would this be considered
as an issue here?



>
>> +	miscq_out_hdr->flags = 0;
>> +
>> +	for_each_populated_zone(zone) {
>> +		for (order = MAX_ORDER - 1; order > 0; order--) {
>> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
>> +			     migratetype++) {
>> +				do {
>> +					ret = inquire_unused_page_block(zone,
>> +						order, migratetype, &page);
>> +					if (!ret) {
>> +						pfn = (u64)page_to_pfn(page);
>> +						add_one_chunk(vb, vq,
>> +							PAGE_CHUNK_TYPE_UNUSED,
>> +							pfn,
>> +							(u64)(1 << order));
>> +					}
>> +				} while (!ret);
>> +			}
>> +		}
>> +	}
>> +	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;
> And where is miscq_out_hdr used? I see no add_outbuf anywhere.
>
> Things like this should be passed through function parameters
> and not stuffed into device structure, fields should be
> initialized before use and not where we happen to
> have the data handy.
>

miscq_out_hdr is linear with the payload (i.e. kmalloc(hdr+payload) ).
It is the same as the use of statq - one request in-flight each time.


>
> Also, _F_ is normally a bit number, you use it as a value here.
>
It intends to be a bit number. Bit 0 of flags to indicate the completion
of handling the request.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
@ 2017-04-27  6:33       ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-27  6:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
>> Add a new vq, miscq, to handle miscellaneous requests between the device
>> and the driver.
>>
>> This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> implements
>
>> request sent from the device.
> Commands are sent from host and handled on guest.
> In fact how is this so different from stats?
> How about reusing the stats vq then? You can use one buffer
> for stats and one buffer for commands.
>

The meaning of the two vqs is a little different. statq is used for
reporting statistics, while miscq is intended to be used to handle
miscellaneous requests from the guest or host (I think it can
also be used the other way around in the future when other
new features are added which need the guest to send requests
and the host to provide responses).

I would prefer to have them separate, because:
If we plan to combine them, we need to put the previous statq
related implementation under miscq with a new command (I think
we can't combine them without using commands to distinguish
the two features).
In this way, an old driver won't work with a new QEMU or a new
driver won't work with an old QEMU. Would this be considered
as an issue here?



>
>> +	miscq_out_hdr->flags = 0;
>> +
>> +	for_each_populated_zone(zone) {
>> +		for (order = MAX_ORDER - 1; order > 0; order--) {
>> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
>> +			     migratetype++) {
>> +				do {
>> +					ret = inquire_unused_page_block(zone,
>> +						order, migratetype, &page);
>> +					if (!ret) {
>> +						pfn = (u64)page_to_pfn(page);
>> +						add_one_chunk(vb, vq,
>> +							PAGE_CHUNK_TYPE_UNUSED,
>> +							pfn,
>> +							(u64)(1 << order));
>> +					}
>> +				} while (!ret);
>> +			}
>> +		}
>> +	}
>> +	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;
> And where is miscq_out_hdr used? I see no add_outbuf anywhere.
>
> Things like this should be passed through function parameters
> and not stuffed into device structure, fields should be
> initialized before use and not where we happen to
> have the data handy.
>

miscq_out_hdr is linear with the payload (i.e. kmalloc(hdr+payload) ).
It is the same as the use of statq - one request in-flight each time.


>
> Also, _F_ is normally a bit number, you use it as a value here.
>
It intends to be a bit number. Bit 0 of flags to indicate the completion
of handling the request.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
@ 2017-04-27  6:33       ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-27  6:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
>> Add a new vq, miscq, to handle miscellaneous requests between the device
>> and the driver.
>>
>> This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> implements
>
>> request sent from the device.
> Commands are sent from host and handled on guest.
> In fact how is this so different from stats?
> How about reusing the stats vq then? You can use one buffer
> for stats and one buffer for commands.
>

The meaning of the two vqs is a little different. statq is used for
reporting statistics, while miscq is intended to be used to handle
miscellaneous requests from the guest or host (I think it can
also be used the other way around in the future when other
new features are added which need the guest to send requests
and the host to provide responses).

I would prefer to have them separate, because:
If we plan to combine them, we need to put the previous statq
related implementation under miscq with a new command (I think
we can't combine them without using commands to distinguish
the two features).
In this way, an old driver won't work with a new QEMU or a new
driver won't work with an old QEMU. Would this be considered
as an issue here?



>
>> +	miscq_out_hdr->flags = 0;
>> +
>> +	for_each_populated_zone(zone) {
>> +		for (order = MAX_ORDER - 1; order > 0; order--) {
>> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
>> +			     migratetype++) {
>> +				do {
>> +					ret = inquire_unused_page_block(zone,
>> +						order, migratetype, &page);
>> +					if (!ret) {
>> +						pfn = (u64)page_to_pfn(page);
>> +						add_one_chunk(vb, vq,
>> +							PAGE_CHUNK_TYPE_UNUSED,
>> +							pfn,
>> +							(u64)(1 << order));
>> +					}
>> +				} while (!ret);
>> +			}
>> +		}
>> +	}
>> +	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;
> And where is miscq_out_hdr used? I see no add_outbuf anywhere.
>
> Things like this should be passed through function parameters
> and not stuffed into device structure, fields should be
> initialized before use and not where we happen to
> have the data handy.
>

miscq_out_hdr is linear with the payload (i.e. kmalloc(hdr+payload) ).
It is the same as the use of statq - one request in-flight each time.


>
> Also, _F_ is normally a bit number, you use it as a value here.
>
It intends to be a bit number. Bit 0 of flags to indicate the completion
of handling the request.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
  2017-04-13 17:08     ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-04-27  6:33     ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-27  6:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
>> Add a new vq, miscq, to handle miscellaneous requests between the device
>> and the driver.
>>
>> This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> implements
>
>> request sent from the device.
> Commands are sent from host and handled on guest.
> In fact how is this so different from stats?
> How about reusing the stats vq then? You can use one buffer
> for stats and one buffer for commands.
>

The meaning of the two vqs is a little different. statq is used for
reporting statistics, while miscq is intended to be used to handle
miscellaneous requests from the guest or host (I think it can
also be used the other way around in the future when other
new features are added which need the guest to send requests
and the host to provide responses).

I would prefer to have them separate, because:
If we plan to combine them, we need to put the previous statq
related implementation under miscq with a new command (I think
we can't combine them without using commands to distinguish
the two features).
In this way, an old driver won't work with a new QEMU or a new
driver won't work with an old QEMU. Would this be considered
as an issue here?



>
>> +	miscq_out_hdr->flags = 0;
>> +
>> +	for_each_populated_zone(zone) {
>> +		for (order = MAX_ORDER - 1; order > 0; order--) {
>> +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
>> +			     migratetype++) {
>> +				do {
>> +					ret = inquire_unused_page_block(zone,
>> +						order, migratetype, &page);
>> +					if (!ret) {
>> +						pfn = (u64)page_to_pfn(page);
>> +						add_one_chunk(vb, vq,
>> +							PAGE_CHUNK_TYPE_UNUSED,
>> +							pfn,
>> +							(u64)(1 << order));
>> +					}
>> +				} while (!ret);
>> +			}
>> +		}
>> +	}
>> +	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;
> And where is miscq_out_hdr used? I see no add_outbuf anywhere.
>
> Things like this should be passed through function parameters
> and not stuffed into device structure, fields should be
> initialized before use and not where we happen to
> have the data handy.
>

miscq_out_hdr is linear with the payload (i.e. kmalloc(hdr+payload) ).
It is the same as the use of statq - one request in-flight each time.


>
> Also, _F_ is normally a bit number, you use it as a value here.
>
It intends to be a bit number. Bit 0 of flags to indicate the completion
of handling the request.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
  2017-04-27  6:33       ` Wei Wang
  (?)
@ 2017-05-05 22:21         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-05-05 22:21 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 27, 2017 at 02:33:44PM +0800, Wei Wang wrote:
> On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> > > Add a new vq, miscq, to handle miscellaneous requests between the device
> > > and the driver.
> > > 
> > > This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> > implements
> > 
> > > request sent from the device.
> > Commands are sent from host and handled on guest.
> > In fact how is this so different from stats?
> > How about reusing the stats vq then? You can use one buffer
> > for stats and one buffer for commands.
> > 
> 
> The meaning of the two vqs is a little different. statq is used for
> reporting statistics, while miscq is intended to be used to handle
> miscellaneous requests from the guest or host

misc just means "anything goes". If you want it to mean
"commands" name it so.

> (I think it can
> also be used the other way around in the future when other
> new features are added which need the guest to send requests
> and the host to provide responses).
> 
> I would prefer to have them separate, because:
> If we plan to combine them, we need to put the previous statq
> related implementation under miscq with a new command (I think
> we can't combine them without using commands to distinguish
> the two features).

Right.

> In this way, an old driver won't work with a new QEMU or a new
> driver won't work with an old QEMU. Would this be considered
> as an issue here?

Compatibility is and should always be handled using
feature flags.  There's a feature flag for this, isn't it?

> 
> 
> > 
> > > +	miscq_out_hdr->flags = 0;
> > > +
> > > +	for_each_populated_zone(zone) {
> > > +		for (order = MAX_ORDER - 1; order > 0; order--) {
> > > +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
> > > +			     migratetype++) {
> > > +				do {
> > > +					ret = inquire_unused_page_block(zone,
> > > +						order, migratetype, &page);
> > > +					if (!ret) {
> > > +						pfn = (u64)page_to_pfn(page);
> > > +						add_one_chunk(vb, vq,
> > > +							PAGE_CHUNK_TYPE_UNUSED,
> > > +							pfn,
> > > +							(u64)(1 << order));
> > > +					}
> > > +				} while (!ret);
> > > +			}
> > > +		}
> > > +	}
> > > +	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;
> > And where is miscq_out_hdr used? I see no add_outbuf anywhere.
> > 
> > Things like this should be passed through function parameters
> > and not stuffed into device structure, fields should be
> > initialized before use and not where we happen to
> > have the data handy.
> > 
> 
> miscq_out_hdr is linear with the payload (i.e. kmalloc(hdr+payload) ).
> It is the same as the use of statq - one request in-flight each time.
> 
> 
> > 
> > Also, _F_ is normally a bit number, you use it as a value here.
> > 
> It intends to be a bit number. Bit 0 of flags to indicate the completion
> of handling the request.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
@ 2017-05-05 22:21         ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-05-05 22:21 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 27, 2017 at 02:33:44PM +0800, Wei Wang wrote:
> On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> > > Add a new vq, miscq, to handle miscellaneous requests between the device
> > > and the driver.
> > > 
> > > This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> > implements
> > 
> > > request sent from the device.
> > Commands are sent from host and handled on guest.
> > In fact how is this so different from stats?
> > How about reusing the stats vq then? You can use one buffer
> > for stats and one buffer for commands.
> > 
> 
> The meaning of the two vqs is a little different. statq is used for
> reporting statistics, while miscq is intended to be used to handle
> miscellaneous requests from the guest or host

misc just means "anything goes". If you want it to mean
"commands" name it so.

> (I think it can
> also be used the other way around in the future when other
> new features are added which need the guest to send requests
> and the host to provide responses).
> 
> I would prefer to have them separate, because:
> If we plan to combine them, we need to put the previous statq
> related implementation under miscq with a new command (I think
> we can't combine them without using commands to distinguish
> the two features).

Right.

> In this way, an old driver won't work with a new QEMU or a new
> driver won't work with an old QEMU. Would this be considered
> as an issue here?

Compatibility is and should always be handled using
feature flags.  There's a feature flag for this, isn't it?

> 
> 
> > 
> > > +	miscq_out_hdr->flags = 0;
> > > +
> > > +	for_each_populated_zone(zone) {
> > > +		for (order = MAX_ORDER - 1; order > 0; order--) {
> > > +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
> > > +			     migratetype++) {
> > > +				do {
> > > +					ret = inquire_unused_page_block(zone,
> > > +						order, migratetype, &page);
> > > +					if (!ret) {
> > > +						pfn = (u64)page_to_pfn(page);
> > > +						add_one_chunk(vb, vq,
> > > +							PAGE_CHUNK_TYPE_UNUSED,
> > > +							pfn,
> > > +							(u64)(1 << order));
> > > +					}
> > > +				} while (!ret);
> > > +			}
> > > +		}
> > > +	}
> > > +	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;
> > And where is miscq_out_hdr used? I see no add_outbuf anywhere.
> > 
> > Things like this should be passed through function parameters
> > and not stuffed into device structure, fields should be
> > initialized before use and not where we happen to
> > have the data handy.
> > 
> 
> miscq_out_hdr is linear with the payload (i.e. kmalloc(hdr+payload) ).
> It is the same as the use of statq - one request in-flight each time.
> 
> 
> > 
> > Also, _F_ is normally a bit number, you use it as a value here.
> > 
> It intends to be a bit number. Bit 0 of flags to indicate the completion
> of handling the request.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
@ 2017-05-05 22:21         ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-05-05 22:21 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 27, 2017 at 02:33:44PM +0800, Wei Wang wrote:
> On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> > > Add a new vq, miscq, to handle miscellaneous requests between the device
> > > and the driver.
> > > 
> > > This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> > implements
> > 
> > > request sent from the device.
> > Commands are sent from host and handled on guest.
> > In fact how is this so different from stats?
> > How about reusing the stats vq then? You can use one buffer
> > for stats and one buffer for commands.
> > 
> 
> The meaning of the two vqs is a little different. statq is used for
> reporting statistics, while miscq is intended to be used to handle
> miscellaneous requests from the guest or host

misc just means "anything goes". If you want it to mean
"commands" name it so.

> (I think it can
> also be used the other way around in the future when other
> new features are added which need the guest to send requests
> and the host to provide responses).
> 
> I would prefer to have them separate, because:
> If we plan to combine them, we need to put the previous statq
> related implementation under miscq with a new command (I think
> we can't combine them without using commands to distinguish
> the two features).

Right.

> In this way, an old driver won't work with a new QEMU or a new
> driver won't work with an old QEMU. Would this be considered
> as an issue here?

Compatibility is and should always be handled using
feature flags.  There's a feature flag for this, isn't it?

> 
> 
> > 
> > > +	miscq_out_hdr->flags = 0;
> > > +
> > > +	for_each_populated_zone(zone) {
> > > +		for (order = MAX_ORDER - 1; order > 0; order--) {
> > > +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
> > > +			     migratetype++) {
> > > +				do {
> > > +					ret = inquire_unused_page_block(zone,
> > > +						order, migratetype, &page);
> > > +					if (!ret) {
> > > +						pfn = (u64)page_to_pfn(page);
> > > +						add_one_chunk(vb, vq,
> > > +							PAGE_CHUNK_TYPE_UNUSED,
> > > +							pfn,
> > > +							(u64)(1 << order));
> > > +					}
> > > +				} while (!ret);
> > > +			}
> > > +		}
> > > +	}
> > > +	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;
> > And where is miscq_out_hdr used? I see no add_outbuf anywhere.
> > 
> > Things like this should be passed through function parameters
> > and not stuffed into device structure, fields should be
> > initialized before use and not where we happen to
> > have the data handy.
> > 
> 
> miscq_out_hdr is linear with the payload (i.e. kmalloc(hdr+payload) ).
> It is the same as the use of statq - one request in-flight each time.
> 
> 
> > 
> > Also, _F_ is normally a bit number, you use it as a value here.
> > 
> It intends to be a bit number. Bit 0 of flags to indicate the completion
> of handling the request.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
  2017-04-27  6:33       ` Wei Wang
  (?)
  (?)
@ 2017-05-05 22:21       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-05-05 22:21 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, dave.hansen, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Thu, Apr 27, 2017 at 02:33:44PM +0800, Wei Wang wrote:
> On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> > > Add a new vq, miscq, to handle miscellaneous requests between the device
> > > and the driver.
> > > 
> > > This patch implemnts the VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> > implements
> > 
> > > request sent from the device.
> > Commands are sent from host and handled on guest.
> > In fact how is this so different from stats?
> > How about reusing the stats vq then? You can use one buffer
> > for stats and one buffer for commands.
> > 
> 
> The meaning of the two vqs is a little different. statq is used for
> reporting statistics, while miscq is intended to be used to handle
> miscellaneous requests from the guest or host

misc just means "anything goes". If you want it to mean
"commands" name it so.

> (I think it can
> also be used the other way around in the future when other
> new features are added which need the guest to send requests
> and the host to provide responses).
> 
> I would prefer to have them separate, because:
> If we plan to combine them, we need to put the previous statq
> related implementation under miscq with a new command (I think
> we can't combine them without using commands to distinguish
> the two features).

Right.

> In this way, an old driver won't work with a new QEMU or a new
> driver won't work with an old QEMU. Would this be considered
> as an issue here?

Compatibility is and should always be handled using
feature flags.  There's a feature flag for this, isn't it?

> 
> 
> > 
> > > +	miscq_out_hdr->flags = 0;
> > > +
> > > +	for_each_populated_zone(zone) {
> > > +		for (order = MAX_ORDER - 1; order > 0; order--) {
> > > +			for (migratetype = 0; migratetype < MIGRATE_TYPES;
> > > +			     migratetype++) {
> > > +				do {
> > > +					ret = inquire_unused_page_block(zone,
> > > +						order, migratetype, &page);
> > > +					if (!ret) {
> > > +						pfn = (u64)page_to_pfn(page);
> > > +						add_one_chunk(vb, vq,
> > > +							PAGE_CHUNK_TYPE_UNUSED,
> > > +							pfn,
> > > +							(u64)(1 << order));
> > > +					}
> > > +				} while (!ret);
> > > +			}
> > > +		}
> > > +	}
> > > +	miscq_out_hdr->flags |= VIRTIO_BALLOON_MISCQ_F_COMPLETE;
> > And where is miscq_out_hdr used? I see no add_outbuf anywhere.
> > 
> > Things like this should be passed through function parameters
> > and not stuffed into device structure, fields should be
> > initialized before use and not where we happen to
> > have the data handy.
> > 
> 
> miscq_out_hdr is linear with the payload (i.e. kmalloc(hdr+payload) ).
> It is the same as the use of statq - one request in-flight each time.
> 
> 
> > 
> > Also, _F_ is normally a bit number, you use it as a value here.
> > 
> It intends to be a bit number. Bit 0 of flags to indicate the completion
> of handling the request.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-27  6:31                 ` Wei Wang
  (?)
  (?)
@ 2017-05-05 22:25                   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-05-05 22:25 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > Hi Michael, could you please give some feedback?
> > I'm sorry, I'm not sure feedback on what you are requesting.
> Oh, just some trivial things (e.g. use a field in the
> header, hdr->chunks to indicate the number of chunks
> in the payload) that wasn't confirmed.
> 
> I will prepare the new version with fixing the agreed issues,
> and we can continue to discuss those parts if you still find
> them improper.
> 
> 
> > 
> > The interface looks reasonable now, even though there's
> > a way to make it even simpler if we can limit chunk size
> > to 2G (in fact 4G - 1). Do you think we can live with this
> > limitation?
> Yes, I think we can. So, is it good to change to use the
> previous 64-bit chunk format (52-bit base + 12-bit size)?

This isn't what I meant. virtio ring has descriptors with
a 64 bit address and 32 bit size.

If size < 4g is not a significant limitation, why not just
use that to pass address/size in a standard s/g list,
possibly using INDIRECT?

> 
> > 
> > But the code still needs some cleanup.
> > 
> 
> OK. We'll also still to discuss your comments in the patch 05.
> 
> Best,
> Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-05-05 22:25                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-05-05 22:25 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > Hi Michael, could you please give some feedback?
> > I'm sorry, I'm not sure feedback on what you are requesting.
> Oh, just some trivial things (e.g. use a field in the
> header, hdr->chunks to indicate the number of chunks
> in the payload) that wasn't confirmed.
> 
> I will prepare the new version with fixing the agreed issues,
> and we can continue to discuss those parts if you still find
> them improper.
> 
> 
> > 
> > The interface looks reasonable now, even though there's
> > a way to make it even simpler if we can limit chunk size
> > to 2G (in fact 4G - 1). Do you think we can live with this
> > limitation?
> Yes, I think we can. So, is it good to change to use the
> previous 64-bit chunk format (52-bit base + 12-bit size)?

This isn't what I meant. virtio ring has descriptors with
a 64 bit address and 32 bit size.

If size < 4g is not a significant limitation, why not just
use that to pass address/size in a standard s/g list,
possibly using INDIRECT?

> 
> > 
> > But the code still needs some cleanup.
> > 
> 
> OK. We'll also still to discuss your comments in the patch 05.
> 
> Best,
> Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-05-05 22:25                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-05-05 22:25 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > Hi Michael, could you please give some feedback?
> > I'm sorry, I'm not sure feedback on what you are requesting.
> Oh, just some trivial things (e.g. use a field in the
> header, hdr->chunks to indicate the number of chunks
> in the payload) that wasn't confirmed.
> 
> I will prepare the new version with fixing the agreed issues,
> and we can continue to discuss those parts if you still find
> them improper.
> 
> 
> > 
> > The interface looks reasonable now, even though there's
> > a way to make it even simpler if we can limit chunk size
> > to 2G (in fact 4G - 1). Do you think we can live with this
> > limitation?
> Yes, I think we can. So, is it good to change to use the
> previous 64-bit chunk format (52-bit base + 12-bit size)?

This isn't what I meant. virtio ring has descriptors with
a 64 bit address and 32 bit size.

If size < 4g is not a significant limitation, why not just
use that to pass address/size in a standard s/g list,
possibly using INDIRECT?

> 
> > 
> > But the code still needs some cleanup.
> > 
> 
> OK. We'll also still to discuss your comments in the patch 05.
> 
> Best,
> Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-05-05 22:25                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-05-05 22:25 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > Hi Michael, could you please give some feedback?
> > I'm sorry, I'm not sure feedback on what you are requesting.
> Oh, just some trivial things (e.g. use a field in the
> header, hdr->chunks to indicate the number of chunks
> in the payload) that wasn't confirmed.
> 
> I will prepare the new version with fixing the agreed issues,
> and we can continue to discuss those parts if you still find
> them improper.
> 
> 
> > 
> > The interface looks reasonable now, even though there's
> > a way to make it even simpler if we can limit chunk size
> > to 2G (in fact 4G - 1). Do you think we can live with this
> > limitation?
> Yes, I think we can. So, is it good to change to use the
> previous 64-bit chunk format (52-bit base + 12-bit size)?

This isn't what I meant. virtio ring has descriptors with
a 64 bit address and 32 bit size.

If size < 4g is not a significant limitation, why not just
use that to pass address/size in a standard s/g list,
possibly using INDIRECT?

> 
> > 
> > But the code still needs some cleanup.
> > 
> 
> OK. We'll also still to discuss your comments in the patch 05.
> 
> Best,
> Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-04-27  6:31                 ` Wei Wang
  (?)
  (?)
@ 2017-05-05 22:25                 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-05-05 22:25 UTC (permalink / raw)
  To: Wei Wang
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, Hansen, Dave, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > Hi Michael, could you please give some feedback?
> > I'm sorry, I'm not sure feedback on what you are requesting.
> Oh, just some trivial things (e.g. use a field in the
> header, hdr->chunks to indicate the number of chunks
> in the payload) that wasn't confirmed.
> 
> I will prepare the new version with fixing the agreed issues,
> and we can continue to discuss those parts if you still find
> them improper.
> 
> 
> > 
> > The interface looks reasonable now, even though there's
> > a way to make it even simpler if we can limit chunk size
> > to 2G (in fact 4G - 1). Do you think we can live with this
> > limitation?
> Yes, I think we can. So, is it good to change to use the
> previous 64-bit chunk format (52-bit base + 12-bit size)?

This isn't what I meant. virtio ring has descriptors with
a 64 bit address and 32 bit size.

If size < 4g is not a significant limitation, why not just
use that to pass address/size in a standard s/g list,
possibly using INDIRECT?

> 
> > 
> > But the code still needs some cleanup.
> > 
> 
> OK. We'll also still to discuss your comments in the patch 05.
> 
> Best,
> Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-05-05 22:25                   ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-05-07  4:19                     ` Wang, Wei W
  -1 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-05-07  4:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> > On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > > Hi Michael, could you please give some feedback?
> > > I'm sorry, I'm not sure feedback on what you are requesting.
> > Oh, just some trivial things (e.g. use a field in the header,
> > hdr->chunks to indicate the number of chunks in the payload) that
> > wasn't confirmed.
> >
> > I will prepare the new version with fixing the agreed issues, and we
> > can continue to discuss those parts if you still find them improper.
> >
> >
> > >
> > > The interface looks reasonable now, even though there's a way to
> > > make it even simpler if we can limit chunk size to 2G (in fact 4G -
> > > 1). Do you think we can live with this limitation?
> > Yes, I think we can. So, is it good to change to use the previous
> > 64-bit chunk format (52-bit base + 12-bit size)?
> 
> This isn't what I meant. virtio ring has descriptors with a 64 bit address and 32 bit
> size.
> 
> If size < 4g is not a significant limitation, why not just use that to pass
> address/size in a standard s/g list, possibly using INDIRECT?

OK, I see your point, thanks. Post the two options here for an analysis:
Option1 (what we have now):
struct virtio_balloon_page_chunk {
        __le64 chunk_num;
        struct virtio_balloon_page_chunk_entry entry[];
};
Option2:
struct virtio_balloon_page_chunk {
        __le64 chunk_num;
        struct scatterlist entry[];
};

I don't have an issue to change it to Option2, but I would prefer Option1,
because I think there is no be obvious difference between the two options,
while Option1 appears to have little advantages here:
1) "struct virtio_balloon_page_chunk_entry" has smaller size than
"struct scatterlist", so the same size of allocated page chunk buffer
can hold more entry[] using Option1;
2) INDIRECT needs on demand kmalloc();
3) no 4G size limit;

What do you think?

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-05-07  4:19                     ` Wang, Wei W
  0 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-05-07  4:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> > On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > > Hi Michael, could you please give some feedback?
> > > I'm sorry, I'm not sure feedback on what you are requesting.
> > Oh, just some trivial things (e.g. use a field in the header,
> > hdr->chunks to indicate the number of chunks in the payload) that
> > wasn't confirmed.
> >
> > I will prepare the new version with fixing the agreed issues, and we
> > can continue to discuss those parts if you still find them improper.
> >
> >
> > >
> > > The interface looks reasonable now, even though there's a way to
> > > make it even simpler if we can limit chunk size to 2G (in fact 4G -
> > > 1). Do you think we can live with this limitation?
> > Yes, I think we can. So, is it good to change to use the previous
> > 64-bit chunk format (52-bit base + 12-bit size)?
> 
> This isn't what I meant. virtio ring has descriptors with a 64 bit address and 32 bit
> size.
> 
> If size < 4g is not a significant limitation, why not just use that to pass
> address/size in a standard s/g list, possibly using INDIRECT?

OK, I see your point, thanks. Post the two options here for an analysis:
Option1 (what we have now):
struct virtio_balloon_page_chunk {
        __le64 chunk_num;
        struct virtio_balloon_page_chunk_entry entry[];
};
Option2:
struct virtio_balloon_page_chunk {
        __le64 chunk_num;
        struct scatterlist entry[];
};

I don't have an issue to change it to Option2, but I would prefer Option1,
because I think there is no be obvious difference between the two options,
while Option1 appears to have little advantages here:
1) "struct virtio_balloon_page_chunk_entry" has smaller size than
"struct scatterlist", so the same size of allocated page chunk buffer
can hold more entry[] using Option1;
2) INDIRECT needs on demand kmalloc();
3) no 4G size limit;

What do you think?

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-05-07  4:19                     ` Wang, Wei W
  0 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-05-07  4:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> > On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > > Hi Michael, could you please give some feedback?
> > > I'm sorry, I'm not sure feedback on what you are requesting.
> > Oh, just some trivial things (e.g. use a field in the header,
> > hdr->chunks to indicate the number of chunks in the payload) that
> > wasn't confirmed.
> >
> > I will prepare the new version with fixing the agreed issues, and we
> > can continue to discuss those parts if you still find them improper.
> >
> >
> > >
> > > The interface looks reasonable now, even though there's a way to
> > > make it even simpler if we can limit chunk size to 2G (in fact 4G -
> > > 1). Do you think we can live with this limitation?
> > Yes, I think we can. So, is it good to change to use the previous
> > 64-bit chunk format (52-bit base + 12-bit size)?
> 
> This isn't what I meant. virtio ring has descriptors with a 64 bit address and 32 bit
> size.
> 
> If size < 4g is not a significant limitation, why not just use that to pass
> address/size in a standard s/g list, possibly using INDIRECT?

OK, I see your point, thanks. Post the two options here for an analysis:
Option1 (what we have now):
struct virtio_balloon_page_chunk {
        __le64 chunk_num;
        struct virtio_balloon_page_chunk_entry entry[];
};
Option2:
struct virtio_balloon_page_chunk {
        __le64 chunk_num;
        struct scatterlist entry[];
};

I don't have an issue to change it to Option2, but I would prefer Option1,
because I think there is no be obvious difference between the two options,
while Option1 appears to have little advantages here:
1) "struct virtio_balloon_page_chunk_entry" has smaller size than
"struct scatterlist", so the same size of allocated page chunk buffer
can hold more entry[] using Option1;
2) INDIRECT needs on demand kmalloc();
3) no 4G size limit;

What do you think?

Best,
Wei



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-05-07  4:19                     ` Wang, Wei W
  0 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-05-07  4:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> > On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > > Hi Michael, could you please give some feedback?
> > > I'm sorry, I'm not sure feedback on what you are requesting.
> > Oh, just some trivial things (e.g. use a field in the header,
> > hdr->chunks to indicate the number of chunks in the payload) that
> > wasn't confirmed.
> >
> > I will prepare the new version with fixing the agreed issues, and we
> > can continue to discuss those parts if you still find them improper.
> >
> >
> > >
> > > The interface looks reasonable now, even though there's a way to
> > > make it even simpler if we can limit chunk size to 2G (in fact 4G -
> > > 1). Do you think we can live with this limitation?
> > Yes, I think we can. So, is it good to change to use the previous
> > 64-bit chunk format (52-bit base + 12-bit size)?
> 
> This isn't what I meant. virtio ring has descriptors with a 64 bit address and 32 bit
> size.
> 
> If size < 4g is not a significant limitation, why not just use that to pass
> address/size in a standard s/g list, possibly using INDIRECT?

OK, I see your point, thanks. Post the two options here for an analysis:
Option1 (what we have now):
struct virtio_balloon_page_chunk {
        __le64 chunk_num;
        struct virtio_balloon_page_chunk_entry entry[];
};
Option2:
struct virtio_balloon_page_chunk {
        __le64 chunk_num;
        struct scatterlist entry[];
};

I don't have an issue to change it to Option2, but I would prefer Option1,
because I think there is no be obvious difference between the two options,
while Option1 appears to have little advantages here:
1) "struct virtio_balloon_page_chunk_entry" has smaller size than
"struct scatterlist", so the same size of allocated page chunk buffer
can hold more entry[] using Option1;
2) INDIRECT needs on demand kmalloc();
3) no 4G size limit;

What do you think?

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-05-05 22:25                   ` Michael S. Tsirkin
                                     ` (3 preceding siblings ...)
  (?)
@ 2017-05-07  4:19                   ` Wang, Wei W
  -1 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-05-07  4:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, Hansen, Dave, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> > On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > > Hi Michael, could you please give some feedback?
> > > I'm sorry, I'm not sure feedback on what you are requesting.
> > Oh, just some trivial things (e.g. use a field in the header,
> > hdr->chunks to indicate the number of chunks in the payload) that
> > wasn't confirmed.
> >
> > I will prepare the new version with fixing the agreed issues, and we
> > can continue to discuss those parts if you still find them improper.
> >
> >
> > >
> > > The interface looks reasonable now, even though there's a way to
> > > make it even simpler if we can limit chunk size to 2G (in fact 4G -
> > > 1). Do you think we can live with this limitation?
> > Yes, I think we can. So, is it good to change to use the previous
> > 64-bit chunk format (52-bit base + 12-bit size)?
> 
> This isn't what I meant. virtio ring has descriptors with a 64 bit address and 32 bit
> size.
> 
> If size < 4g is not a significant limitation, why not just use that to pass
> address/size in a standard s/g list, possibly using INDIRECT?

OK, I see your point, thanks. Post the two options here for an analysis:
Option1 (what we have now):
struct virtio_balloon_page_chunk {
        __le64 chunk_num;
        struct virtio_balloon_page_chunk_entry entry[];
};
Option2:
struct virtio_balloon_page_chunk {
        __le64 chunk_num;
        struct scatterlist entry[];
};

I don't have an issue to change it to Option2, but I would prefer Option1,
because I think there is no be obvious difference between the two options,
while Option1 appears to have little advantages here:
1) "struct virtio_balloon_page_chunk_entry" has smaller size than
"struct scatterlist", so the same size of allocated page chunk buffer
can hold more entry[] using Option1;
2) INDIRECT needs on demand kmalloc();
3) no 4G size limit;

What do you think?

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
  2017-05-05 22:21         ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-05-07  4:20           ` Wang, Wei W
  -1 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-05-07  4:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 05/06/2017 06:21 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 27, 2017 at 02:33:44PM +0800, Wei Wang wrote:
> > On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> > > On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> > > > Add a new vq, miscq, to handle miscellaneous requests between the
> > > > device and the driver.
> > > >
> > > > This patch implemnts the
> VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> > > implements
> > >
> > > > request sent from the device.
> > > Commands are sent from host and handled on guest.
> > > In fact how is this so different from stats?
> > > How about reusing the stats vq then? You can use one buffer for
> > > stats and one buffer for commands.
> > >
> >
> > The meaning of the two vqs is a little different. statq is used for
> > reporting statistics, while miscq is intended to be used to handle
> > miscellaneous requests from the guest or host
> 
> misc just means "anything goes". If you want it to mean "commands" name it so.

Ok, will change it.

> > (I think it can
> > also be used the other way around in the future when other new
> > features are added which need the guest to send requests and the host
> > to provide responses).
> >
> > I would prefer to have them separate, because:
> > If we plan to combine them, we need to put the previous statq related
> > implementation under miscq with a new command (I think we can't
> > combine them without using commands to distinguish the two features).
> 
> Right.

> > In this way, an old driver won't work with a new QEMU or a new driver
> > won't work with an old QEMU. Would this be considered as an issue
> > here?
> 
> Compatibility is and should always be handled using feature flags.  There's a
> feature flag for this, isn't it?

The negotiation of the existing feature flag, VIRTIO_BALLOON_F_STATS_VQ
only indicates the support of the old statq implementation. To move the statq
implementation under cmdq, I think we would need a new feature flag for the
new statq implementation:
#define VIRTIO_BALLOON_F_CMDQ_STATS      5

What do you think?

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
@ 2017-05-07  4:20           ` Wang, Wei W
  0 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-05-07  4:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 05/06/2017 06:21 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 27, 2017 at 02:33:44PM +0800, Wei Wang wrote:
> > On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> > > On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> > > > Add a new vq, miscq, to handle miscellaneous requests between the
> > > > device and the driver.
> > > >
> > > > This patch implemnts the
> VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> > > implements
> > >
> > > > request sent from the device.
> > > Commands are sent from host and handled on guest.
> > > In fact how is this so different from stats?
> > > How about reusing the stats vq then? You can use one buffer for
> > > stats and one buffer for commands.
> > >
> >
> > The meaning of the two vqs is a little different. statq is used for
> > reporting statistics, while miscq is intended to be used to handle
> > miscellaneous requests from the guest or host
> 
> misc just means "anything goes". If you want it to mean "commands" name it so.

Ok, will change it.

> > (I think it can
> > also be used the other way around in the future when other new
> > features are added which need the guest to send requests and the host
> > to provide responses).
> >
> > I would prefer to have them separate, because:
> > If we plan to combine them, we need to put the previous statq related
> > implementation under miscq with a new command (I think we can't
> > combine them without using commands to distinguish the two features).
> 
> Right.

> > In this way, an old driver won't work with a new QEMU or a new driver
> > won't work with an old QEMU. Would this be considered as an issue
> > here?
> 
> Compatibility is and should always be handled using feature flags.  There's a
> feature flag for this, isn't it?

The negotiation of the existing feature flag, VIRTIO_BALLOON_F_STATS_VQ
only indicates the support of the old statq implementation. To move the statq
implementation under cmdq, I think we would need a new feature flag for the
new statq implementation:
#define VIRTIO_BALLOON_F_CMDQ_STATS      5

What do you think?

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
@ 2017-05-07  4:20           ` Wang, Wei W
  0 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-05-07  4:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 05/06/2017 06:21 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 27, 2017 at 02:33:44PM +0800, Wei Wang wrote:
> > On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> > > On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> > > > Add a new vq, miscq, to handle miscellaneous requests between the
> > > > device and the driver.
> > > >
> > > > This patch implemnts the
> VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> > > implements
> > >
> > > > request sent from the device.
> > > Commands are sent from host and handled on guest.
> > > In fact how is this so different from stats?
> > > How about reusing the stats vq then? You can use one buffer for
> > > stats and one buffer for commands.
> > >
> >
> > The meaning of the two vqs is a little different. statq is used for
> > reporting statistics, while miscq is intended to be used to handle
> > miscellaneous requests from the guest or host
> 
> misc just means "anything goes". If you want it to mean "commands" name it so.

Ok, will change it.

> > (I think it can
> > also be used the other way around in the future when other new
> > features are added which need the guest to send requests and the host
> > to provide responses).
> >
> > I would prefer to have them separate, because:
> > If we plan to combine them, we need to put the previous statq related
> > implementation under miscq with a new command (I think we can't
> > combine them without using commands to distinguish the two features).
> 
> Right.

> > In this way, an old driver won't work with a new QEMU or a new driver
> > won't work with an old QEMU. Would this be considered as an issue
> > here?
> 
> Compatibility is and should always be handled using feature flags.  There's a
> feature flag for this, isn't it?

The negotiation of the existing feature flag, VIRTIO_BALLOON_F_STATS_VQ
only indicates the support of the old statq implementation. To move the statq
implementation under cmdq, I think we would need a new feature flag for the
new statq implementation:
#define VIRTIO_BALLOON_F_CMDQ_STATS      5

What do you think?

Best,
Wei




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
@ 2017-05-07  4:20           ` Wang, Wei W
  0 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-05-07  4:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 05/06/2017 06:21 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 27, 2017 at 02:33:44PM +0800, Wei Wang wrote:
> > On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> > > On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> > > > Add a new vq, miscq, to handle miscellaneous requests between the
> > > > device and the driver.
> > > >
> > > > This patch implemnts the
> VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> > > implements
> > >
> > > > request sent from the device.
> > > Commands are sent from host and handled on guest.
> > > In fact how is this so different from stats?
> > > How about reusing the stats vq then? You can use one buffer for
> > > stats and one buffer for commands.
> > >
> >
> > The meaning of the two vqs is a little different. statq is used for
> > reporting statistics, while miscq is intended to be used to handle
> > miscellaneous requests from the guest or host
> 
> misc just means "anything goes". If you want it to mean "commands" name it so.

Ok, will change it.

> > (I think it can
> > also be used the other way around in the future when other new
> > features are added which need the guest to send requests and the host
> > to provide responses).
> >
> > I would prefer to have them separate, because:
> > If we plan to combine them, we need to put the previous statq related
> > implementation under miscq with a new command (I think we can't
> > combine them without using commands to distinguish the two features).
> 
> Right.

> > In this way, an old driver won't work with a new QEMU or a new driver
> > won't work with an old QEMU. Would this be considered as an issue
> > here?
> 
> Compatibility is and should always be handled using feature flags.  There's a
> feature flag for this, isn't it?

The negotiation of the existing feature flag, VIRTIO_BALLOON_F_STATS_VQ
only indicates the support of the old statq implementation. To move the statq
implementation under cmdq, I think we would need a new feature flag for the
new statq implementation:
#define VIRTIO_BALLOON_F_CMDQ_STATS      5

What do you think?

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* RE: [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ
  2017-05-05 22:21         ` Michael S. Tsirkin
  (?)
  (?)
@ 2017-05-07  4:20         ` Wang, Wei W
  -1 siblings, 0 replies; 136+ messages in thread
From: Wang, Wei W @ 2017-05-07  4:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, Hansen, Dave, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On 05/06/2017 06:21 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 27, 2017 at 02:33:44PM +0800, Wei Wang wrote:
> > On 04/14/2017 01:08 AM, Michael S. Tsirkin wrote:
> > > On Thu, Apr 13, 2017 at 05:35:08PM +0800, Wei Wang wrote:
> > > > Add a new vq, miscq, to handle miscellaneous requests between the
> > > > device and the driver.
> > > >
> > > > This patch implemnts the
> VIRTIO_BALLOON_MISCQ_INQUIRE_UNUSED_PAGES
> > > implements
> > >
> > > > request sent from the device.
> > > Commands are sent from host and handled on guest.
> > > In fact how is this so different from stats?
> > > How about reusing the stats vq then? You can use one buffer for
> > > stats and one buffer for commands.
> > >
> >
> > The meaning of the two vqs is a little different. statq is used for
> > reporting statistics, while miscq is intended to be used to handle
> > miscellaneous requests from the guest or host
> 
> misc just means "anything goes". If you want it to mean "commands" name it so.

Ok, will change it.

> > (I think it can
> > also be used the other way around in the future when other new
> > features are added which need the guest to send requests and the host
> > to provide responses).
> >
> > I would prefer to have them separate, because:
> > If we plan to combine them, we need to put the previous statq related
> > implementation under miscq with a new command (I think we can't
> > combine them without using commands to distinguish the two features).
> 
> Right.

> > In this way, an old driver won't work with a new QEMU or a new driver
> > won't work with an old QEMU. Would this be considered as an issue
> > here?
> 
> Compatibility is and should always be handled using feature flags.  There's a
> feature flag for this, isn't it?

The negotiation of the existing feature flag, VIRTIO_BALLOON_F_STATS_VQ
only indicates the support of the old statq implementation. To move the statq
implementation under cmdq, I think we would need a new feature flag for the
new statq implementation:
#define VIRTIO_BALLOON_F_CMDQ_STATS      5

What do you think?

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-05-07  4:19                     ` Wang, Wei W
  (?)
  (?)
@ 2017-05-08 17:40                       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-05-08 17:40 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Sun, May 07, 2017 at 04:19:28AM +0000, Wang, Wei W wrote:
> On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> > > On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > > > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > > > Hi Michael, could you please give some feedback?
> > > > I'm sorry, I'm not sure feedback on what you are requesting.
> > > Oh, just some trivial things (e.g. use a field in the header,
> > > hdr->chunks to indicate the number of chunks in the payload) that
> > > wasn't confirmed.
> > >
> > > I will prepare the new version with fixing the agreed issues, and we
> > > can continue to discuss those parts if you still find them improper.
> > >
> > >
> > > >
> > > > The interface looks reasonable now, even though there's a way to
> > > > make it even simpler if we can limit chunk size to 2G (in fact 4G -
> > > > 1). Do you think we can live with this limitation?
> > > Yes, I think we can. So, is it good to change to use the previous
> > > 64-bit chunk format (52-bit base + 12-bit size)?
> > 
> > This isn't what I meant. virtio ring has descriptors with a 64 bit address and 32 bit
> > size.
> > 
> > If size < 4g is not a significant limitation, why not just use that to pass
> > address/size in a standard s/g list, possibly using INDIRECT?
> 
> OK, I see your point, thanks. Post the two options here for an analysis:
> Option1 (what we have now):
> struct virtio_balloon_page_chunk {
>         __le64 chunk_num;
>         struct virtio_balloon_page_chunk_entry entry[];
> };
> Option2:
> struct virtio_balloon_page_chunk {
>         __le64 chunk_num;
>         struct scatterlist entry[];
> };

This isn't what I meant really :) I meant vring_desc.

> I don't have an issue to change it to Option2, but I would prefer Option1,
> because I think there is no be obvious difference between the two options,
> while Option1 appears to have little advantages here:
> 1) "struct virtio_balloon_page_chunk_entry" has smaller size than
> "struct scatterlist", so the same size of allocated page chunk buffer
> can hold more entry[] using Option1;
> 2) INDIRECT needs on demand kmalloc();

Within alloc_indirect?  We can fix that with a separate patch.


> 3) no 4G size limit;

Do you see lots of >=4g chunks in practice?

> What do you think?
> 
> Best,
> Wei
> 
>

OTOH using existing vring APIs handles things like DMA transparently.


-- 
MST

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-05-08 17:40                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-05-08 17:40 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Sun, May 07, 2017 at 04:19:28AM +0000, Wang, Wei W wrote:
> On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> > > On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > > > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > > > Hi Michael, could you please give some feedback?
> > > > I'm sorry, I'm not sure feedback on what you are requesting.
> > > Oh, just some trivial things (e.g. use a field in the header,
> > > hdr->chunks to indicate the number of chunks in the payload) that
> > > wasn't confirmed.
> > >
> > > I will prepare the new version with fixing the agreed issues, and we
> > > can continue to discuss those parts if you still find them improper.
> > >
> > >
> > > >
> > > > The interface looks reasonable now, even though there's a way to
> > > > make it even simpler if we can limit chunk size to 2G (in fact 4G -
> > > > 1). Do you think we can live with this limitation?
> > > Yes, I think we can. So, is it good to change to use the previous
> > > 64-bit chunk format (52-bit base + 12-bit size)?
> > 
> > This isn't what I meant. virtio ring has descriptors with a 64 bit address and 32 bit
> > size.
> > 
> > If size < 4g is not a significant limitation, why not just use that to pass
> > address/size in a standard s/g list, possibly using INDIRECT?
> 
> OK, I see your point, thanks. Post the two options here for an analysis:
> Option1 (what we have now):
> struct virtio_balloon_page_chunk {
>         __le64 chunk_num;
>         struct virtio_balloon_page_chunk_entry entry[];
> };
> Option2:
> struct virtio_balloon_page_chunk {
>         __le64 chunk_num;
>         struct scatterlist entry[];
> };

This isn't what I meant really :) I meant vring_desc.

> I don't have an issue to change it to Option2, but I would prefer Option1,
> because I think there is no be obvious difference between the two options,
> while Option1 appears to have little advantages here:
> 1) "struct virtio_balloon_page_chunk_entry" has smaller size than
> "struct scatterlist", so the same size of allocated page chunk buffer
> can hold more entry[] using Option1;
> 2) INDIRECT needs on demand kmalloc();

Within alloc_indirect?  We can fix that with a separate patch.


> 3) no 4G size limit;

Do you see lots of >=4g chunks in practice?

> What do you think?
> 
> Best,
> Wei
> 
>

OTOH using existing vring APIs handles things like DMA transparently.


-- 
MST

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-05-08 17:40                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-05-08 17:40 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Sun, May 07, 2017 at 04:19:28AM +0000, Wang, Wei W wrote:
> On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> > > On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > > > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > > > Hi Michael, could you please give some feedback?
> > > > I'm sorry, I'm not sure feedback on what you are requesting.
> > > Oh, just some trivial things (e.g. use a field in the header,
> > > hdr->chunks to indicate the number of chunks in the payload) that
> > > wasn't confirmed.
> > >
> > > I will prepare the new version with fixing the agreed issues, and we
> > > can continue to discuss those parts if you still find them improper.
> > >
> > >
> > > >
> > > > The interface looks reasonable now, even though there's a way to
> > > > make it even simpler if we can limit chunk size to 2G (in fact 4G -
> > > > 1). Do you think we can live with this limitation?
> > > Yes, I think we can. So, is it good to change to use the previous
> > > 64-bit chunk format (52-bit base + 12-bit size)?
> > 
> > This isn't what I meant. virtio ring has descriptors with a 64 bit address and 32 bit
> > size.
> > 
> > If size < 4g is not a significant limitation, why not just use that to pass
> > address/size in a standard s/g list, possibly using INDIRECT?
> 
> OK, I see your point, thanks. Post the two options here for an analysis:
> Option1 (what we have now):
> struct virtio_balloon_page_chunk {
>         __le64 chunk_num;
>         struct virtio_balloon_page_chunk_entry entry[];
> };
> Option2:
> struct virtio_balloon_page_chunk {
>         __le64 chunk_num;
>         struct scatterlist entry[];
> };

This isn't what I meant really :) I meant vring_desc.

> I don't have an issue to change it to Option2, but I would prefer Option1,
> because I think there is no be obvious difference between the two options,
> while Option1 appears to have little advantages here:
> 1) "struct virtio_balloon_page_chunk_entry" has smaller size than
> "struct scatterlist", so the same size of allocated page chunk buffer
> can hold more entry[] using Option1;
> 2) INDIRECT needs on demand kmalloc();

Within alloc_indirect?  We can fix that with a separate patch.


> 3) no 4G size limit;

Do you see lots of >=4g chunks in practice?

> What do you think?
> 
> Best,
> Wei
> 
>

OTOH using existing vring APIs handles things like DMA transparently.


-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-05-08 17:40                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-05-08 17:40 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Sun, May 07, 2017 at 04:19:28AM +0000, Wang, Wei W wrote:
> On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> > > On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > > > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > > > Hi Michael, could you please give some feedback?
> > > > I'm sorry, I'm not sure feedback on what you are requesting.
> > > Oh, just some trivial things (e.g. use a field in the header,
> > > hdr->chunks to indicate the number of chunks in the payload) that
> > > wasn't confirmed.
> > >
> > > I will prepare the new version with fixing the agreed issues, and we
> > > can continue to discuss those parts if you still find them improper.
> > >
> > >
> > > >
> > > > The interface looks reasonable now, even though there's a way to
> > > > make it even simpler if we can limit chunk size to 2G (in fact 4G -
> > > > 1). Do you think we can live with this limitation?
> > > Yes, I think we can. So, is it good to change to use the previous
> > > 64-bit chunk format (52-bit base + 12-bit size)?
> > 
> > This isn't what I meant. virtio ring has descriptors with a 64 bit address and 32 bit
> > size.
> > 
> > If size < 4g is not a significant limitation, why not just use that to pass
> > address/size in a standard s/g list, possibly using INDIRECT?
> 
> OK, I see your point, thanks. Post the two options here for an analysis:
> Option1 (what we have now):
> struct virtio_balloon_page_chunk {
>         __le64 chunk_num;
>         struct virtio_balloon_page_chunk_entry entry[];
> };
> Option2:
> struct virtio_balloon_page_chunk {
>         __le64 chunk_num;
>         struct scatterlist entry[];
> };

This isn't what I meant really :) I meant vring_desc.

> I don't have an issue to change it to Option2, but I would prefer Option1,
> because I think there is no be obvious difference between the two options,
> while Option1 appears to have little advantages here:
> 1) "struct virtio_balloon_page_chunk_entry" has smaller size than
> "struct scatterlist", so the same size of allocated page chunk buffer
> can hold more entry[] using Option1;
> 2) INDIRECT needs on demand kmalloc();

Within alloc_indirect?  We can fix that with a separate patch.


> 3) no 4G size limit;

Do you see lots of >=4g chunks in practice?

> What do you think?
> 
> Best,
> Wei
> 
>

OTOH using existing vring APIs handles things like DMA transparently.


-- 
MST

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-05-07  4:19                     ` Wang, Wei W
                                       ` (3 preceding siblings ...)
  (?)
@ 2017-05-08 17:40                     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 136+ messages in thread
From: Michael S. Tsirkin @ 2017-05-08 17:40 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, Hansen, Dave, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Sun, May 07, 2017 at 04:19:28AM +0000, Wang, Wei W wrote:
> On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
> > On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
> > > On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
> > > > On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
> > > > > Hi Michael, could you please give some feedback?
> > > > I'm sorry, I'm not sure feedback on what you are requesting.
> > > Oh, just some trivial things (e.g. use a field in the header,
> > > hdr->chunks to indicate the number of chunks in the payload) that
> > > wasn't confirmed.
> > >
> > > I will prepare the new version with fixing the agreed issues, and we
> > > can continue to discuss those parts if you still find them improper.
> > >
> > >
> > > >
> > > > The interface looks reasonable now, even though there's a way to
> > > > make it even simpler if we can limit chunk size to 2G (in fact 4G -
> > > > 1). Do you think we can live with this limitation?
> > > Yes, I think we can. So, is it good to change to use the previous
> > > 64-bit chunk format (52-bit base + 12-bit size)?
> > 
> > This isn't what I meant. virtio ring has descriptors with a 64 bit address and 32 bit
> > size.
> > 
> > If size < 4g is not a significant limitation, why not just use that to pass
> > address/size in a standard s/g list, possibly using INDIRECT?
> 
> OK, I see your point, thanks. Post the two options here for an analysis:
> Option1 (what we have now):
> struct virtio_balloon_page_chunk {
>         __le64 chunk_num;
>         struct virtio_balloon_page_chunk_entry entry[];
> };
> Option2:
> struct virtio_balloon_page_chunk {
>         __le64 chunk_num;
>         struct scatterlist entry[];
> };

This isn't what I meant really :) I meant vring_desc.

> I don't have an issue to change it to Option2, but I would prefer Option1,
> because I think there is no be obvious difference between the two options,
> while Option1 appears to have little advantages here:
> 1) "struct virtio_balloon_page_chunk_entry" has smaller size than
> "struct scatterlist", so the same size of allocated page chunk buffer
> can hold more entry[] using Option1;
> 2) INDIRECT needs on demand kmalloc();

Within alloc_indirect?  We can fix that with a separate patch.


> 3) no 4G size limit;

Do you see lots of >=4g chunks in practice?

> What do you think?
> 
> Best,
> Wei
> 
>

OTOH using existing vring APIs handles things like DMA transparently.


-- 
MST

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-05-08 17:40                       ` Michael S. Tsirkin
  (?)
@ 2017-05-09  2:45                         ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-05-09  2:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 05/09/2017 01:40 AM, Michael S. Tsirkin wrote:
> On Sun, May 07, 2017 at 04:19:28AM +0000, Wang, Wei W wrote:
>> On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
>>>> On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
>>>>> On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
>>>>>> Hi Michael, could you please give some feedback?
>>>>> I'm sorry, I'm not sure feedback on what you are requesting.
>>>> Oh, just some trivial things (e.g. use a field in the header,
>>>> hdr->chunks to indicate the number of chunks in the payload) that
>>>> wasn't confirmed.
>>>>
>>>> I will prepare the new version with fixing the agreed issues, and we
>>>> can continue to discuss those parts if you still find them improper.
>>>>
>>>>
>>>>> The interface looks reasonable now, even though there's a way to
>>>>> make it even simpler if we can limit chunk size to 2G (in fact 4G -
>>>>> 1). Do you think we can live with this limitation?
>>>> Yes, I think we can. So, is it good to change to use the previous
>>>> 64-bit chunk format (52-bit base + 12-bit size)?
>>> This isn't what I meant. virtio ring has descriptors with a 64 bit address and 32 bit
>>> size.
>>>
>>> If size < 4g is not a significant limitation, why not just use that to pass
>>> address/size in a standard s/g list, possibly using INDIRECT?
>> OK, I see your point, thanks. Post the two options here for an analysis:
>> Option1 (what we have now):
>> struct virtio_balloon_page_chunk {
>>          __le64 chunk_num;
>>          struct virtio_balloon_page_chunk_entry entry[];
>> };
>> Option2:
>> struct virtio_balloon_page_chunk {
>>          __le64 chunk_num;
>>          struct scatterlist entry[];
>> };
> This isn't what I meant really :) I meant vring_desc.

OK. Repost the code change:

Option2:
struct virtio_balloon_page_chunk {
         __le64 chunk_num;
         struct ving_desc entry[];
};

We pre-allocate a table of desc, and each desc is used to hold a chunk.

In that case, the virtqueue_add() function, which deals with sg, is not
usable for us. We may need to add a new one,
virtqueue_add_indirect_desc(),
to add a pre-allocated indirect descriptor table to vring.


>
>> I don't have an issue to change it to Option2, but I would prefer Option1,
>> because I think there is no be obvious difference between the two options,
>> while Option1 appears to have little advantages here:
>> 1) "struct virtio_balloon_page_chunk_entry" has smaller size than
>> "struct scatterlist", so the same size of allocated page chunk buffer
>> can hold more entry[] using Option1;
>> 2) INDIRECT needs on demand kmalloc();
> Within alloc_indirect?  We can fix that with a separate patch.
>
>
>> 3) no 4G size limit;
> Do you see lots of >=4g chunks in practice?
It wouldn't be much in practice, but we still need the extra code to
handle the case - break larger chunks into less-than 4g ones.

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-05-09  2:45                         ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-05-09  2:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 05/09/2017 01:40 AM, Michael S. Tsirkin wrote:
> On Sun, May 07, 2017 at 04:19:28AM +0000, Wang, Wei W wrote:
>> On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
>>>> On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
>>>>> On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
>>>>>> Hi Michael, could you please give some feedback?
>>>>> I'm sorry, I'm not sure feedback on what you are requesting.
>>>> Oh, just some trivial things (e.g. use a field in the header,
>>>> hdr->chunks to indicate the number of chunks in the payload) that
>>>> wasn't confirmed.
>>>>
>>>> I will prepare the new version with fixing the agreed issues, and we
>>>> can continue to discuss those parts if you still find them improper.
>>>>
>>>>
>>>>> The interface looks reasonable now, even though there's a way to
>>>>> make it even simpler if we can limit chunk size to 2G (in fact 4G -
>>>>> 1). Do you think we can live with this limitation?
>>>> Yes, I think we can. So, is it good to change to use the previous
>>>> 64-bit chunk format (52-bit base + 12-bit size)?
>>> This isn't what I meant. virtio ring has descriptors with a 64 bit address and 32 bit
>>> size.
>>>
>>> If size < 4g is not a significant limitation, why not just use that to pass
>>> address/size in a standard s/g list, possibly using INDIRECT?
>> OK, I see your point, thanks. Post the two options here for an analysis:
>> Option1 (what we have now):
>> struct virtio_balloon_page_chunk {
>>          __le64 chunk_num;
>>          struct virtio_balloon_page_chunk_entry entry[];
>> };
>> Option2:
>> struct virtio_balloon_page_chunk {
>>          __le64 chunk_num;
>>          struct scatterlist entry[];
>> };
> This isn't what I meant really :) I meant vring_desc.

OK. Repost the code change:

Option2:
struct virtio_balloon_page_chunk {
         __le64 chunk_num;
         struct ving_desc entry[];
};

We pre-allocate a table of desc, and each desc is used to hold a chunk.

In that case, the virtqueue_add() function, which deals with sg, is not
usable for us. We may need to add a new one,
virtqueue_add_indirect_desc(),
to add a pre-allocated indirect descriptor table to vring.


>
>> I don't have an issue to change it to Option2, but I would prefer Option1,
>> because I think there is no be obvious difference between the two options,
>> while Option1 appears to have little advantages here:
>> 1) "struct virtio_balloon_page_chunk_entry" has smaller size than
>> "struct scatterlist", so the same size of allocated page chunk buffer
>> can hold more entry[] using Option1;
>> 2) INDIRECT needs on demand kmalloc();
> Within alloc_indirect?  We can fix that with a separate patch.
>
>
>> 3) no 4G size limit;
> Do you see lots of >=4g chunks in practice?
It wouldn't be much in practice, but we still need the extra code to
handle the case - break larger chunks into less-than 4g ones.

Best,
Wei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [Qemu-devel] [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
@ 2017-05-09  2:45                         ` Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-05-09  2:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 05/09/2017 01:40 AM, Michael S. Tsirkin wrote:
> On Sun, May 07, 2017 at 04:19:28AM +0000, Wang, Wei W wrote:
>> On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
>>>> On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
>>>>> On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
>>>>>> Hi Michael, could you please give some feedback?
>>>>> I'm sorry, I'm not sure feedback on what you are requesting.
>>>> Oh, just some trivial things (e.g. use a field in the header,
>>>> hdr->chunks to indicate the number of chunks in the payload) that
>>>> wasn't confirmed.
>>>>
>>>> I will prepare the new version with fixing the agreed issues, and we
>>>> can continue to discuss those parts if you still find them improper.
>>>>
>>>>
>>>>> The interface looks reasonable now, even though there's a way to
>>>>> make it even simpler if we can limit chunk size to 2G (in fact 4G -
>>>>> 1). Do you think we can live with this limitation?
>>>> Yes, I think we can. So, is it good to change to use the previous
>>>> 64-bit chunk format (52-bit base + 12-bit size)?
>>> This isn't what I meant. virtio ring has descriptors with a 64 bit address and 32 bit
>>> size.
>>>
>>> If size < 4g is not a significant limitation, why not just use that to pass
>>> address/size in a standard s/g list, possibly using INDIRECT?
>> OK, I see your point, thanks. Post the two options here for an analysis:
>> Option1 (what we have now):
>> struct virtio_balloon_page_chunk {
>>          __le64 chunk_num;
>>          struct virtio_balloon_page_chunk_entry entry[];
>> };
>> Option2:
>> struct virtio_balloon_page_chunk {
>>          __le64 chunk_num;
>>          struct scatterlist entry[];
>> };
> This isn't what I meant really :) I meant vring_desc.

OK. Repost the code change:

Option2:
struct virtio_balloon_page_chunk {
         __le64 chunk_num;
         struct ving_desc entry[];
};

We pre-allocate a table of desc, and each desc is used to hold a chunk.

In that case, the virtqueue_add() function, which deals with sg, is not
usable for us. We may need to add a new one,
virtqueue_add_indirect_desc(),
to add a pre-allocated indirect descriptor table to vring.


>
>> I don't have an issue to change it to Option2, but I would prefer Option1,
>> because I think there is no be obvious difference between the two options,
>> while Option1 appears to have little advantages here:
>> 1) "struct virtio_balloon_page_chunk_entry" has smaller size than
>> "struct scatterlist", so the same size of allocated page chunk buffer
>> can hold more entry[] using Option1;
>> 2) INDIRECT needs on demand kmalloc();
> Within alloc_indirect?  We can fix that with a separate patch.
>
>
>> 3) no 4G size limit;
> Do you see lots of >=4g chunks in practice?
It wouldn't be much in practice, but we still need the extra code to
handle the case - break larger chunks into less-than 4g ones.

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [virtio-dev] Re: [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  2017-05-08 17:40                       ` Michael S. Tsirkin
                                         ` (2 preceding siblings ...)
  (?)
@ 2017-05-09  2:45                       ` Wei Wang
  -1 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-05-09  2:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, Hansen, Dave, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On 05/09/2017 01:40 AM, Michael S. Tsirkin wrote:
> On Sun, May 07, 2017 at 04:19:28AM +0000, Wang, Wei W wrote:
>> On 05/06/2017 06:26 AM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 27, 2017 at 02:31:49PM +0800, Wei Wang wrote:
>>>> On 04/27/2017 07:20 AM, Michael S. Tsirkin wrote:
>>>>> On Wed, Apr 26, 2017 at 11:03:34AM +0000, Wang, Wei W wrote:
>>>>>> Hi Michael, could you please give some feedback?
>>>>> I'm sorry, I'm not sure feedback on what you are requesting.
>>>> Oh, just some trivial things (e.g. use a field in the header,
>>>> hdr->chunks to indicate the number of chunks in the payload) that
>>>> wasn't confirmed.
>>>>
>>>> I will prepare the new version with fixing the agreed issues, and we
>>>> can continue to discuss those parts if you still find them improper.
>>>>
>>>>
>>>>> The interface looks reasonable now, even though there's a way to
>>>>> make it even simpler if we can limit chunk size to 2G (in fact 4G -
>>>>> 1). Do you think we can live with this limitation?
>>>> Yes, I think we can. So, is it good to change to use the previous
>>>> 64-bit chunk format (52-bit base + 12-bit size)?
>>> This isn't what I meant. virtio ring has descriptors with a 64 bit address and 32 bit
>>> size.
>>>
>>> If size < 4g is not a significant limitation, why not just use that to pass
>>> address/size in a standard s/g list, possibly using INDIRECT?
>> OK, I see your point, thanks. Post the two options here for an analysis:
>> Option1 (what we have now):
>> struct virtio_balloon_page_chunk {
>>          __le64 chunk_num;
>>          struct virtio_balloon_page_chunk_entry entry[];
>> };
>> Option2:
>> struct virtio_balloon_page_chunk {
>>          __le64 chunk_num;
>>          struct scatterlist entry[];
>> };
> This isn't what I meant really :) I meant vring_desc.

OK. Repost the code change:

Option2:
struct virtio_balloon_page_chunk {
         __le64 chunk_num;
         struct ving_desc entry[];
};

We pre-allocate a table of desc, and each desc is used to hold a chunk.

In that case, the virtqueue_add() function, which deals with sg, is not
usable for us. We may need to add a new one,
virtqueue_add_indirect_desc(),
to add a pre-allocated indirect descriptor table to vring.


>
>> I don't have an issue to change it to Option2, but I would prefer Option1,
>> because I think there is no be obvious difference between the two options,
>> while Option1 appears to have little advantages here:
>> 1) "struct virtio_balloon_page_chunk_entry" has smaller size than
>> "struct scatterlist", so the same size of allocated page chunk buffer
>> can hold more entry[] using Option1;
>> 2) INDIRECT needs on demand kmalloc();
> Within alloc_indirect?  We can fix that with a separate patch.
>
>
>> 3) no 4G size limit;
> Do you see lots of >=4g chunks in practice?
It wouldn't be much in practice, but we still need the extra code to
handle the case - break larger chunks into less-than 4g ones.

Best,
Wei

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-04-13  9:35 Wei Wang
  0 siblings, 0 replies; 136+ messages in thread
From: Wei Wang @ 2017-04-13  9:35 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch series implements two optimizations:
1) transfer pages in chuncks between the guest and host;
2) transfer the guest unused pages to the host so that they
can be skipped to migrate in live migration.

Changes:
v8->v9:
1) Split the two new features, VIRTIO_BALLOON_F_BALLOON_CHUNKS and
VIRTIO_BALLOON_F_MISC_VQ, which were mixed together in the previous
implementation;
2) Simpler function to get the free page block.

v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Liang Li (1):
  virtio-balloon: deflate via a page list

Wei Wang (4):
  virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS
  mm: function to offer a page block on the free list
  mm: export symbol of next_zone and first_online_pgdat
  virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ

 drivers/virtio/virtio_balloon.c     | 615 +++++++++++++++++++++++++++++++++---
 include/linux/mm.h                  |   3 +
 include/uapi/linux/virtio_balloon.h |  21 ++
 mm/mmzone.c                         |   2 +
 mm/page_alloc.c                     |  87 +++++
 5 files changed, 678 insertions(+), 50 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 136+ messages in thread

end of thread, other threads:[~2017-05-09  2:45 UTC | newest]

Thread overview: 136+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-13  9:35 [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration Wei Wang
2017-04-13  9:35 ` [Qemu-devel] " Wei Wang
2017-04-13  9:35 ` Wei Wang
2017-04-13  9:35 ` [PATCH v9 1/5] virtio-balloon: deflate via a page list Wei Wang
2017-04-13  9:35 ` Wei Wang
2017-04-13  9:35   ` [Qemu-devel] " Wei Wang
2017-04-13  9:35   ` Wei Wang
2017-04-13  9:35 ` [PATCH v9 2/5] virtio-balloon: VIRTIO_BALLOON_F_BALLOON_CHUNKS Wei Wang
2017-04-13  9:35 ` Wei Wang
2017-04-13  9:35   ` [Qemu-devel] " Wei Wang
2017-04-13  9:35   ` Wei Wang
2017-04-13 16:34   ` Michael S. Tsirkin
2017-04-13 16:34   ` Michael S. Tsirkin
2017-04-13 16:34     ` [Qemu-devel] " Michael S. Tsirkin
2017-04-13 16:34     ` Michael S. Tsirkin
2017-04-13 17:03     ` Matthew Wilcox
2017-04-13 17:03       ` [Qemu-devel] " Matthew Wilcox
2017-04-13 17:03       ` Matthew Wilcox
2017-04-13 17:03     ` Matthew Wilcox
2017-04-14  8:37     ` [virtio-dev] " Wei Wang
2017-04-14  8:37       ` [Qemu-devel] " Wei Wang
2017-04-14  8:37       ` Wei Wang
2017-04-14  8:37       ` Wei Wang
2017-04-14 21:38       ` [virtio-dev] " Michael S. Tsirkin
2017-04-14 21:38       ` Michael S. Tsirkin
2017-04-14 21:38         ` [Qemu-devel] " Michael S. Tsirkin
2017-04-14 21:38         ` Michael S. Tsirkin
2017-04-14 21:38         ` Michael S. Tsirkin
2017-04-17  3:35         ` [virtio-dev] " Wei Wang
2017-04-17  3:35           ` [Qemu-devel] " Wei Wang
2017-04-17  3:35           ` Wei Wang
2017-04-17  3:35           ` Wei Wang
2017-04-26 11:03           ` [virtio-dev] " Wang, Wei W
2017-04-26 11:03           ` Wang, Wei W
2017-04-26 11:03             ` [Qemu-devel] " Wang, Wei W
2017-04-26 11:03             ` Wang, Wei W
2017-04-26 11:03             ` Wang, Wei W
2017-04-26 23:20             ` [virtio-dev] " Michael S. Tsirkin
2017-04-26 23:20             ` Michael S. Tsirkin
2017-04-26 23:20               ` [Qemu-devel] " Michael S. Tsirkin
2017-04-26 23:20               ` Michael S. Tsirkin
2017-04-27  6:31               ` Wei Wang
2017-04-27  6:31               ` Wei Wang
2017-04-27  6:31                 ` [Qemu-devel] " Wei Wang
2017-04-27  6:31                 ` Wei Wang
2017-05-05 22:25                 ` Michael S. Tsirkin
2017-05-05 22:25                 ` Michael S. Tsirkin
2017-05-05 22:25                   ` [Qemu-devel] " Michael S. Tsirkin
2017-05-05 22:25                   ` Michael S. Tsirkin
2017-05-05 22:25                   ` Michael S. Tsirkin
2017-05-07  4:19                   ` [virtio-dev] " Wang, Wei W
2017-05-07  4:19                     ` [Qemu-devel] " Wang, Wei W
2017-05-07  4:19                     ` Wang, Wei W
2017-05-07  4:19                     ` Wang, Wei W
2017-05-08 17:40                     ` [virtio-dev] " Michael S. Tsirkin
2017-05-08 17:40                       ` [Qemu-devel] " Michael S. Tsirkin
2017-05-08 17:40                       ` Michael S. Tsirkin
2017-05-08 17:40                       ` Michael S. Tsirkin
2017-05-09  2:45                       ` [virtio-dev] " Wei Wang
2017-05-09  2:45                       ` Wei Wang
2017-05-09  2:45                         ` [Qemu-devel] " Wei Wang
2017-05-09  2:45                         ` Wei Wang
2017-05-08 17:40                     ` Michael S. Tsirkin
2017-05-07  4:19                   ` Wang, Wei W
2017-04-17  3:35         ` Wei Wang
2017-04-14  8:37     ` Wei Wang
2017-04-13  9:35 ` [PATCH v9 3/5] mm: function to offer a page block on the free list Wei Wang
2017-04-13  9:35 ` Wei Wang
2017-04-13  9:35   ` [Qemu-devel] " Wei Wang
2017-04-13  9:35   ` Wei Wang
2017-04-13 20:02   ` Andrew Morton
2017-04-13 20:02     ` [Qemu-devel] " Andrew Morton
2017-04-13 20:02     ` Andrew Morton
2017-04-13 20:02     ` Andrew Morton
2017-04-14  2:30     ` Wei Wang
2017-04-14  2:30       ` [Qemu-devel] " Wei Wang
2017-04-14  2:30       ` Wei Wang
2017-04-14  2:58       ` Matthew Wilcox
2017-04-14  2:58       ` Matthew Wilcox
2017-04-14  2:58         ` [Qemu-devel] " Matthew Wilcox
2017-04-14  2:58         ` Matthew Wilcox
2017-04-14  8:58         ` Wei Wang
2017-04-14  8:58           ` [Qemu-devel] " Wei Wang
2017-04-14  8:58           ` Wei Wang
2017-04-14  8:58         ` Wei Wang
2017-04-14  2:30     ` Wei Wang
2017-04-13  9:35 ` [PATCH v9 4/5] mm: export symbol of next_zone and first_online_pgdat Wei Wang
2017-04-13  9:35   ` [Qemu-devel] " Wei Wang
2017-04-13  9:35   ` Wei Wang
2017-04-13  9:35 ` Wei Wang
2017-04-13  9:35 ` [PATCH v9 5/5] virtio-balloon: VIRTIO_BALLOON_F_MISC_VQ Wei Wang
2017-04-13  9:35 ` Wei Wang
2017-04-13  9:35   ` [Qemu-devel] " Wei Wang
2017-04-13  9:35   ` Wei Wang
2017-04-13 17:08   ` Michael S. Tsirkin
2017-04-13 17:08     ` [Qemu-devel] " Michael S. Tsirkin
2017-04-13 17:08     ` Michael S. Tsirkin
2017-04-27  6:33     ` Wei Wang
2017-04-27  6:33     ` Wei Wang
2017-04-27  6:33       ` [Qemu-devel] " Wei Wang
2017-04-27  6:33       ` Wei Wang
2017-05-05 22:21       ` Michael S. Tsirkin
2017-05-05 22:21       ` Michael S. Tsirkin
2017-05-05 22:21         ` [Qemu-devel] " Michael S. Tsirkin
2017-05-05 22:21         ` Michael S. Tsirkin
2017-05-07  4:20         ` Wang, Wei W
2017-05-07  4:20         ` Wang, Wei W
2017-05-07  4:20           ` [Qemu-devel] " Wang, Wei W
2017-05-07  4:20           ` Wang, Wei W
2017-05-07  4:20           ` Wang, Wei W
2017-04-13 17:08   ` Michael S. Tsirkin
2017-04-13 20:44 ` [PATCH v9 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration Matthew Wilcox
2017-04-13 20:44   ` [Qemu-devel] " Matthew Wilcox
2017-04-13 20:44   ` Matthew Wilcox
2017-04-13 20:44   ` Matthew Wilcox
2017-04-14  1:50   ` Michael S. Tsirkin
2017-04-14  1:50     ` [Qemu-devel] " Michael S. Tsirkin
2017-04-14  1:50     ` Michael S. Tsirkin
2017-04-14  2:28     ` Wei Wang
2017-04-14  2:28       ` [Qemu-devel] " Wei Wang
2017-04-14  2:28       ` Wei Wang
2017-04-14  2:57       ` Michael S. Tsirkin
2017-04-14  2:57       ` Michael S. Tsirkin
2017-04-14  2:57         ` [Qemu-devel] " Michael S. Tsirkin
2017-04-14  2:57         ` Michael S. Tsirkin
2017-04-14  2:28     ` Wei Wang
2017-04-14  9:47     ` Matthew Wilcox
2017-04-14  9:47     ` Matthew Wilcox
2017-04-14  9:47       ` [Qemu-devel] " Matthew Wilcox
2017-04-14  9:47       ` Matthew Wilcox
2017-04-14 14:22       ` Michael S. Tsirkin
2017-04-14 14:22         ` [Qemu-devel] " Michael S. Tsirkin
2017-04-14 14:22         ` Michael S. Tsirkin
2017-04-14 14:22       ` Michael S. Tsirkin
2017-04-14  1:50   ` Michael S. Tsirkin
2017-04-13  9:35 Wei Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.