linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH kernel v8 0/4] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2017-03-16  7:08 Wei Wang
  2017-03-16  7:08 ` [PATCH kernel v8 1/4] virtio-balloon: deflate via a page list Wei Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Wei Wang @ 2017-03-16  7:08 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

This patch series implements two optimizations:
1) transfer pages in chuncks between the guest and host;
2) transfer the guest unused pages to the host so that they
can be skipped to migrate in live migration.

Please read each patch commit log for details.

Changes:
v7->v8:
1) Use only one chunk format, instead of two.
2) re-write the virtio-balloon implementation patch.
3) commit changes
4) patch re-org

Liang Li (4):
  virtio-balloon: deflate via a page list
  virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER
  mm: add inerface to offer info about unused pages
  virtio-balloon: VIRTIO_BALLOON_F_HOST_REQ_VQ

 drivers/virtio/virtio_balloon.c     | 533 ++++++++++++++++++++++++++++++++----
 include/linux/mm.h                  |   3 +
 include/uapi/linux/virtio_balloon.h |  31 +++
 mm/page_alloc.c                     | 114 ++++++++
 4 files changed, 635 insertions(+), 46 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH kernel v8 1/4] virtio-balloon: deflate via a page list
  2017-03-16  7:08 [PATCH kernel v8 0/4] Extend virtio-balloon for fast (de)inflating & fast live migration Wei Wang
@ 2017-03-16  7:08 ` Wei Wang
  2017-03-16  7:08 ` [PATCH kernel v8 2/4] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER Wei Wang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Wei Wang @ 2017-03-16  7:08 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

From: Liang Li <liang.z.li@intel.com>

This patch saves the deflated pages to a list, instead of the PFN array.
Accordingly, the balloon_pfn_to_page() function is removed.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 181793f..f59cb4f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
 	return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-	BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-	return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
 	struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	return num_allocated_pages;
 }
 
-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+				 struct list_head *pages)
 {
-	unsigned int i;
-	struct page *page;
+	struct page *page, *next;
 
-	/* Find pfns pointing at start of each page, get pages and free them. */
-	for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-							   vb->pfns[i]));
+	list_for_each_entry_safe(page, next, pages, lru) {
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 			adjust_managed_page_count(page, 1);
+		list_del(&page->lru);
 		put_page(page); /* balloon reference */
 	}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	unsigned num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	LIST_HEAD(pages);
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
 
@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 */
 	if (vb->num_pfns != 0)
 		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb);
+	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH kernel v8 2/4] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER
  2017-03-16  7:08 [PATCH kernel v8 0/4] Extend virtio-balloon for fast (de)inflating & fast live migration Wei Wang
  2017-03-16  7:08 ` [PATCH kernel v8 1/4] virtio-balloon: deflate via a page list Wei Wang
@ 2017-03-16  7:08 ` Wei Wang
  2017-04-05  3:31   ` Wang, Wei W
  2017-03-16  7:08 ` [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages Wei Wang
  2017-03-16  7:08 ` [PATCH kernel v8 4/4] virtio-balloon: VIRTIO_BALLOON_F_HOST_REQ_VQ Wei Wang
  3 siblings, 1 reply; 18+ messages in thread
From: Wei Wang @ 2017-03-16  7:08 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

From: Liang Li <liang.z.li@intel.com>

The implementation of the current virtio-balloon is not very
efficient, because the ballooned pages are transferred to the
host one by one. Here is the breakdown of the time in percentage
spent on each step of the balloon inflating process (inflating
7GB of an 8GB idle guest).

1) allocating pages (6.5%)
2) sending PFNs to host (68.3%)
3) address translation (6.1%)
4) madvise (19%)

It takes about 4126ms for the inflating process to complete.
The above profiling shows that the bottlenecks are stage 2)
and stage 4).

This patch optimizes step 2) by transferring pages to the host in
chunks. A chunk consists of guest physically continuous pages, and
it is offered to the host via a base PFN (i.e. the start PFN of
those physically continuous pages) and the size (i.e. the total
number of the pages). A chunk is formated as below:

--------------------------------------------------------
|                 Base (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------
--------------------------------------------------------
|                 Size (52 bit)        | Rsvd (12 bit) |
--------------------------------------------------------

By doing so, step 4) can also be optimized by doing address
translation and madvise() in chunks rather than page by page.

This optimization requires the negotiation of a new feature bit,
VIRTIO_BALLOON_F_CHUNK_TRANSFER.

With this new feature, the above ballooning process takes ~590ms
resulting in an improvement of ~85%.

TODO: optimize stage 1) by allocating/freeing a chunk of pages
instead of a single page each time.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/virtio/virtio_balloon.c     | 371 +++++++++++++++++++++++++++++++++---
 include/uapi/linux/virtio_balloon.h |   9 +
 2 files changed, 353 insertions(+), 27 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f59cb4f..3f4a161 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
+#define PFNS_PER_PAGE_BMAP	(PAGE_BMAP_SIZE * BITS_PER_BYTE)
+#define PAGE_BMAP_COUNT_MAX	32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -50,6 +54,14 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 static struct vfsmount *balloon_mnt;
 #endif
 
+#define BALLOON_CHUNK_BASE_SHIFT 12
+#define BALLOON_CHUNK_SIZE_SHIFT 12
+struct balloon_page_chunk {
+	__le64 base;
+	__le64 size;
+};
+
+typedef __le64 resp_data_t;
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -67,6 +79,31 @@ struct virtio_balloon {
 
 	/* Number of balloon pages we've told the Host we're not using. */
 	unsigned int num_pages;
+	/* Pointer to the response header. */
+	struct virtio_balloon_resp_hdr *resp_hdr;
+	/* Pointer to the start address of response data. */
+	resp_data_t *resp_data;
+	/* Size of response data buffer. */
+	unsigned int resp_buf_size;
+	/* Pointer offset of the response data. */
+	unsigned int resp_pos;
+	/* Bitmap used to record pages */
+	unsigned long *page_bmap[PAGE_BMAP_COUNT_MAX];
+	/* Number of split page bmaps */
+	unsigned int page_bmaps;
+
+	/*
+	 * The allocated page_bmap size may be smaller than the pfn range of
+	 * the ballooned pages. In this case, we need to use the page_bmap
+	 * multiple times to cover the entire pfn range. It's like using a
+	 * short ruler several times to finish measuring a long object.
+	 * The start location of the ruler in the next measurement is the end
+	 * location of the ruler in the previous measurement.
+	 *
+	 * pfn_max & pfn_min: forms the pfn range of the ballooned pages
+	 * pfn_start & pfn_stop: records the start and stop pfn in each cover
+	 */
+	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
 	/*
 	 * The pages we've told the Host we're not using are enqueued
 	 * at vb_dev_info->pages list.
@@ -110,20 +147,187 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_page_bmap_range(struct virtio_balloon *vb)
+{
+	vb->pfn_min = ULONG_MAX;
+	vb->pfn_max = 0;
+}
+
+static inline void update_page_bmap_range(struct virtio_balloon *vb,
+					  struct page *page)
+{
+	unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+	vb->pfn_min = min(balloon_pfn, vb->pfn_min);
+	vb->pfn_max = max(balloon_pfn, vb->pfn_max);
+}
+
+/* The page_bmap size is extended by adding more number of page_bmap */
+static void extend_page_bmap_size(struct virtio_balloon *vb,
+				  unsigned long pfns)
+{
+	int i, bmaps;
+	unsigned long bmap_len;
+
+	bmap_len = ALIGN(pfns, BITS_PER_LONG) / BITS_PER_BYTE;
+	bmap_len = ALIGN(bmap_len, PAGE_BMAP_SIZE);
+	bmaps = min((int)(bmap_len / PAGE_BMAP_SIZE),
+		    PAGE_BMAP_COUNT_MAX);
+
+	for (i = 1; i < bmaps; i++) {
+		vb->page_bmap[i] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+		if (vb->page_bmap[i])
+			vb->page_bmaps++;
+		else
+			break;
+	}
+}
+
+static void free_extended_page_bmap(struct virtio_balloon *vb)
+{
+	int i, bmaps = vb->page_bmaps;
+
+	for (i = 1; i < bmaps; i++) {
+		kfree(vb->page_bmap[i]);
+		vb->page_bmap[i] = NULL;
+		vb->page_bmaps--;
+	}
+}
+
+static void free_page_bmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->page_bmaps; i++)
+		kfree(vb->page_bmap[i]);
+}
+
+static void clear_page_bmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->page_bmaps; i++)
+		memset(vb->page_bmap[i], 0, PAGE_BMAP_SIZE);
+}
+
+static void send_resp_data(struct virtio_balloon *vb, struct virtqueue *vq,
+			   bool busy_wait)
 {
 	struct scatterlist sg;
+	struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
 	unsigned int len;
 
-	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+	len = vb->resp_pos * sizeof(resp_data_t);
+	hdr->data_len = cpu_to_le32(len);
+	len += sizeof(struct virtio_balloon_resp_hdr);
+	sg_init_table(&sg, 1);
+	sg_set_buf(&sg, hdr, len);
+
+	if (!virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL)) {
+		virtqueue_kick(vq);
+		if (busy_wait)
+			while (!virtqueue_get_buf(vq, &len) &&
+			       !virtqueue_is_broken(vq))
+				cpu_relax();
+		else
+			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		vb->resp_pos = 0;
+		free_extended_page_bmap(vb);
+	}
+}
+
+/* Calculate how many resp_data does one chunk need */
+#define RESP_POS_ADD_CHUNK (sizeof(struct balloon_page_chunk) / \
+			    sizeof(resp_data_t))
+static void add_one_chunk(struct virtio_balloon *vb, struct virtqueue *vq,
+			  unsigned long base, int size)
+{
+	struct balloon_page_chunk *chunk =
+				(struct balloon_page_chunk *)(vb->resp_data +
+								vb->resp_pos);
+		/*
+		 * Not enough resp_data space to hold the next
+		 * chunk?
+		 */
+		if ((vb->resp_pos + RESP_POS_ADD_CHUNK) *
+		    sizeof(resp_data_t) > vb->resp_buf_size)
+			send_resp_data(vb, vq, false);
 
-	/* We should always be able to add one buffer to an empty queue. */
-	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
-	virtqueue_kick(vq);
+		chunk->base = cpu_to_le64(base << BALLOON_CHUNK_BASE_SHIFT);
+		chunk->size = cpu_to_le64(size << BALLOON_CHUNK_SIZE_SHIFT);
+		vb->resp_pos += RESP_POS_ADD_CHUNK;
+}
+
+static void chunking_pages_from_bmap(struct virtio_balloon *vb,
+				     struct virtqueue *vq,
+				     unsigned long pfn_start,
+				     unsigned long *bmap,
+				     unsigned long len)
+{
+	unsigned long pos = 0, end = len * BITS_PER_BYTE;
+
+	while (pos < end) {
+		unsigned long one = find_next_bit(bmap, end, pos);
+
+		if (one < end) {
+			unsigned long chunk_size, zero;
 
-	/* When host has read buffer, this completes via balloon_ack */
-	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+			zero = find_next_zero_bit(bmap, end, one + 1);
+			if (zero >= end)
+				chunk_size = end - one;
+			else
+				chunk_size = zero - one;
 
+			if (chunk_size)
+				add_one_chunk(vb, vq, pfn_start + one,
+						chunk_size);
+			pos = one + chunk_size;
+		} else
+			break;
+	}
+}
+
+static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+{
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_CHUNK_TRANSFER)) {
+		int pfns, page_bmaps, i;
+		unsigned long pfn_start, pfns_len;
+
+		pfn_start = vb->pfn_start;
+		pfns = vb->pfn_stop - pfn_start + 1;
+		pfns = roundup(roundup(pfns, BITS_PER_LONG),
+			       PFNS_PER_PAGE_BMAP);
+		page_bmaps = pfns / PFNS_PER_PAGE_BMAP;
+		pfns_len = pfns / BITS_PER_BYTE;
+
+		for (i = 0; i < page_bmaps; i++) {
+			unsigned int bmap_len = PAGE_BMAP_SIZE;
+
+			/* The last one takes the leftover only */
+			if (i + 1 == page_bmaps)
+				bmap_len = pfns_len - PAGE_BMAP_SIZE * i;
+
+			chunking_pages_from_bmap(vb, vq, pfn_start +
+						 i * PFNS_PER_PAGE_BMAP,
+						 vb->page_bmap[i], bmap_len);
+		}
+		if (vb->resp_pos > 0)
+			send_resp_data(vb, vq, false);
+	} else {
+		struct scatterlist sg;
+		unsigned int len;
+
+		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+
+		/*
+		 * We should always be able to add one buffer to an
+		 * empty queue
+		 */
+		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+		virtqueue_kick(vq);
+		/* When host has read buffer, this completes via balloon_ack */
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+	}
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -138,13 +342,61 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
-static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
+static void set_page_bmap(struct virtio_balloon *vb,
+			  struct list_head *pages, struct virtqueue *vq)
+{
+	unsigned long pfn_start, pfn_stop;
+	struct page *page;
+	bool found;
+
+	vb->pfn_min = rounddown(vb->pfn_min, BITS_PER_LONG);
+	vb->pfn_max = roundup(vb->pfn_max, BITS_PER_LONG);
+
+	extend_page_bmap_size(vb, vb->pfn_max - vb->pfn_min + 1);
+	pfn_start = vb->pfn_min;
+
+	while (pfn_start < vb->pfn_max) {
+		pfn_stop = pfn_start + PFNS_PER_PAGE_BMAP * vb->page_bmaps;
+		pfn_stop = pfn_stop < vb->pfn_max ? pfn_stop : vb->pfn_max;
+
+		vb->pfn_start = pfn_start;
+		clear_page_bmap(vb);
+		found = false;
+
+		list_for_each_entry(page, pages, lru) {
+			unsigned long bmap_idx, bmap_pos, balloon_pfn;
+
+			balloon_pfn = page_to_balloon_pfn(page);
+			if (balloon_pfn < pfn_start || balloon_pfn > pfn_stop)
+				continue;
+			bmap_idx = (balloon_pfn - pfn_start) /
+				   PFNS_PER_PAGE_BMAP;
+			bmap_pos = (balloon_pfn - pfn_start) %
+				   PFNS_PER_PAGE_BMAP;
+			set_bit(bmap_pos, vb->page_bmap[bmap_idx]);
+
+			found = true;
+		}
+		if (found) {
+			vb->pfn_stop = pfn_stop;
+			tell_host(vb, vq);
+		}
+		pfn_start = pfn_stop;
+	}
+}
+
+static unsigned int fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
-	unsigned num_allocated_pages;
+	unsigned int num_allocated_pages;
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_CHUNK_TRANSFER);
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (chunking)
+		init_page_bmap_range(vb);
+	else
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -159,7 +411,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_page_bmap_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -168,8 +423,13 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			set_page_bmap(vb, &vb_dev_info->pages,
+					vb->inflate_vq);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -189,15 +449,20 @@ static void release_pages_balloon(struct virtio_balloon *vb,
 	}
 }
 
-static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
+static unsigned int leak_balloon(struct virtio_balloon *vb, size_t num)
 {
-	unsigned num_freed_pages;
+	unsigned int num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_CHUNK_TRANSFER);
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (chunking)
+		init_page_bmap_range(vb);
+	else
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -207,7 +472,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (chunking)
+			update_page_bmap_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -218,8 +486,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns != 0) {
+		if (chunking)
+			set_page_bmap(vb, &pages, vb->deflate_vq);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -431,6 +703,12 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+static void tell_host_one_page(struct virtio_balloon *vb,
+			       struct virtqueue *vq, struct page *page)
+{
+	add_one_chunk(vb, vq, page_to_pfn(page), 1);
+}
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -455,6 +733,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
 	unsigned long flags;
+	bool chunking = virtio_has_feature(vb->vdev,
+					   VIRTIO_BALLOON_F_CHUNK_TRANSFER);
 
 	/*
 	 * In order to avoid lock contention while migrating pages concurrently
@@ -475,15 +755,23 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
+	if (chunking) {
+		tell_host_one_page(vb, vb->inflate_vq, newpage);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
+	if (chunking) {
+		tell_host_one_page(vb, vb->deflate_vq, page);
+	} else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 
 	mutex_unlock(&vb->balloon_lock);
 
@@ -533,6 +821,30 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
 	vb->num_pages = 0;
+
+	/*
+	 * By default, we allocate page_bmap[0] only. More page_bmap will be
+	 * allocated on demand.
+	 */
+	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
+	if (!vb->page_bmap[0]) {
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_CHUNK_TRANSFER);
+	} else {
+		vb->page_bmaps = 1;
+		vb->resp_hdr =
+			kmalloc(sizeof(struct virtio_balloon_resp_hdr) +
+				PAGE_BMAP_SIZE, GFP_KERNEL);
+		if (!vb->resp_hdr) {
+			__virtio_clear_bit(vdev,
+					   VIRTIO_BALLOON_F_CHUNK_TRANSFER);
+			kfree(vb->page_bmap[0]);
+		} else {
+			vb->resp_data = (void *)vb->resp_hdr +
+					sizeof(struct virtio_balloon_resp_hdr);
+			vb->resp_pos = 0;
+			vb->resp_buf_size = PAGE_BMAP_SIZE;
+		}
+	}
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
@@ -578,6 +890,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
 out_del_vqs:
 	vdev->config->del_vqs(vdev);
 out_free_vb:
+	kfree(vb->resp_hdr);
+	free_page_bmap(vb);
 	kfree(vb);
 out:
 	return err;
@@ -611,6 +925,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	remove_common(vb);
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
+	free_page_bmap(vb);
+	kfree(vb->resp_hdr);
 	kfree(vb);
 }
 
@@ -649,6 +965,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_CHUNK_TRANSFER,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..aa0e5f0 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_CHUNK_TRANSFER	3 /* Transfer pages in chunks */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,12 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+/* Response header structure */
+struct virtio_balloon_resp_hdr {
+	u8 cmd;
+	u8 flag;
+	__le16 id; /* cmd id */
+	__le32 data_len; /* Payload len in bytes */
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages
  2017-03-16  7:08 [PATCH kernel v8 0/4] Extend virtio-balloon for fast (de)inflating & fast live migration Wei Wang
  2017-03-16  7:08 ` [PATCH kernel v8 1/4] virtio-balloon: deflate via a page list Wei Wang
  2017-03-16  7:08 ` [PATCH kernel v8 2/4] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER Wei Wang
@ 2017-03-16  7:08 ` Wei Wang
  2017-03-16 21:28   ` Andrew Morton
  2017-03-17  1:21   ` Michael S. Tsirkin
  2017-03-16  7:08 ` [PATCH kernel v8 4/4] virtio-balloon: VIRTIO_BALLOON_F_HOST_REQ_VQ Wei Wang
  3 siblings, 2 replies; 18+ messages in thread
From: Wei Wang @ 2017-03-16  7:08 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

From: Liang Li <liang.z.li@intel.com>

This patch adds a function to provides a snapshot of the present system
unused pages. An important usage of this function is to provide the
unsused pages to the Live migration thread, which skips the transfer of
thoses unused pages. Newly used pages can be re-tracked by the dirty
page logging mechanisms.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 include/linux/mm.h |   3 ++
 mm/page_alloc.c    | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84615b..869749d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1764,6 +1764,9 @@ extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+extern int record_unused_pages(struct zone **start_zone, int order,
+			       __le64 *pages, unsigned int size,
+			       unsigned int *pos, bool part_fill);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3e0c69..b72a7ac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4498,6 +4498,120 @@ void show_free_areas(unsigned int filter)
 	show_swap_cache_info();
 }
 
+static int __record_unused_pages(struct zone *zone, int order,
+				 __le64 *buf, unsigned int size,
+				 unsigned int *offset, bool part_fill)
+{
+	unsigned long pfn, flags;
+	int t, ret = 0;
+	struct list_head *curr;
+	__le64 *chunk;
+
+	if (zone_is_empty(zone))
+		return 0;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	if (*offset + zone->free_area[order].nr_free > size && !part_fill) {
+		ret = -ENOSPC;
+		goto out;
+	}
+	for (t = 0; t < MIGRATE_TYPES; t++) {
+		list_for_each(curr, &zone->free_area[order].free_list[t]) {
+			pfn = page_to_pfn(list_entry(curr, struct page, lru));
+			chunk = buf + *offset;
+			if (*offset + 2 > size) {
+				ret = -ENOSPC;
+				goto out;
+			}
+			/* Align to the chunk format used in virtio-balloon */
+			*chunk = cpu_to_le64(pfn << 12);
+			*(chunk + 1) = cpu_to_le64((1 << order) << 12);
+			*offset += 2;
+		}
+	}
+
+out:
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return ret;
+}
+
+/*
+ * The record_unused_pages() function is used to record the system unused
+ * pages. The unused pages can be skipped to transfer during live migration.
+ * Though the unused pages are dynamically changing, dirty page logging
+ * mechanisms are able to capture the newly used pages though they were
+ * recorded as unused pages via this function.
+ *
+ * This function scans the free page list of the specified order to record
+ * the unused pages, and chunks those continuous pages following the chunk
+ * format below:
+ * --------------------------------------
+ * |	Base (52-bit)	| Rsvd (12-bit) |
+ * --------------------------------------
+ * --------------------------------------
+ * |	Size (52-bit)	| Rsvd (12-bit) |
+ * --------------------------------------
+ *
+ * @start_zone: zone to start the record operation.
+ * @order: order of the free page list to record.
+ * @buf: buffer to record the unused page info in chunks.
+ * @size: size of the buffer in __le64 to record
+ * @offset: offset in the buffer to record.
+ * @part_fill: indicate if partial fill is used.
+ *
+ * return -EINVAL if parameter is invalid
+ * return -ENOSPC when the buffer is too small to record all the unsed pages
+ * return 0 when sccess
+ */
+int record_unused_pages(struct zone **start_zone, int order,
+			__le64 *buf, unsigned int size,
+			unsigned int *offset, bool part_fill)
+{
+	struct zone *zone;
+	int ret = 0;
+	bool skip_check = false;
+
+	/* Make sure all the parameters are valid */
+	if (buf == NULL || offset == NULL || order >= MAX_ORDER)
+		return -EINVAL;
+
+	if (*start_zone != NULL) {
+		bool found = false;
+
+		for_each_populated_zone(zone) {
+			if (zone != *start_zone)
+				continue;
+			found = true;
+			break;
+		}
+		if (!found)
+			return -EINVAL;
+	} else
+		skip_check = true;
+
+	for_each_populated_zone(zone) {
+		/* Start from *start_zone if it's not NULL */
+		if (!skip_check) {
+			if (*start_zone != zone)
+				continue;
+			else
+				skip_check = true;
+		}
+		ret = __record_unused_pages(zone, order, buf, size,
+					    offset, part_fill);
+		if (ret < 0) {
+			/* record the failed zone */
+			*start_zone = zone;
+			break;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(record_unused_pages);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH kernel v8 4/4] virtio-balloon: VIRTIO_BALLOON_F_HOST_REQ_VQ
  2017-03-16  7:08 [PATCH kernel v8 0/4] Extend virtio-balloon for fast (de)inflating & fast live migration Wei Wang
                   ` (2 preceding siblings ...)
  2017-03-16  7:08 ` [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages Wei Wang
@ 2017-03-16  7:08 ` Wei Wang
  2017-03-17  1:39   ` Michael S. Tsirkin
  3 siblings, 1 reply; 18+ messages in thread
From: Wei Wang @ 2017-03-16  7:08 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, wei.w.wang, liliang.opensource

From: Liang Li <liang.z.li@intel.com>

Add a new vq, host request vq. The host uses the vq to send
requests to the guest. Upon getting a request, the guest responds
what the host needs via this vq.

The patch implements the request of getting the unsed pages from the guest.
The unused guest pages are avoided to migrate in live migration. For an
idle guest with 8GB RAM, this optimization shorterns the total migration
time to 1/4.

Furthermore, it's also possible to drop the guest's page cache before
live migration. This optimization will be implemented on top of this
new feature in the future.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 drivers/virtio/virtio_balloon.c     | 140 ++++++++++++++++++++++++++++++++++--
 include/uapi/linux/virtio_balloon.h |  22 ++++++
 2 files changed, 157 insertions(+), 5 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 3f4a161..bcf2baa 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -64,7 +64,7 @@ struct balloon_page_chunk {
 typedef __le64 resp_data_t;
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *host_req_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -104,6 +104,8 @@ struct virtio_balloon {
 	 * pfn_start & pfn_stop: records the start and stop pfn in each cover
 	 */
 	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
+	/* Request header */
+	struct virtio_balloon_req_hdr req_hdr;
 	/*
 	 * The pages we've told the Host we're not using are enqueued
 	 * at vb_dev_info->pages list.
@@ -568,6 +570,81 @@ static void stats_handle_request(struct virtio_balloon *vb)
 	virtqueue_kick(vq);
 }
 
+static void __send_unused_pages(struct virtio_balloon *vb,
+				unsigned long req_id, unsigned int pos,
+				bool done)
+{
+	struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
+	struct virtqueue *vq = vb->host_req_vq;
+
+	vb->resp_pos = pos;
+	hdr->cmd = BALLOON_GET_UNUSED_PAGES;
+	hdr->id = cpu_to_le16(req_id);
+	if (!done)
+		hdr->flag = BALLOON_FLAG_CONT;
+	else
+		hdr->flag = BALLOON_FLAG_DONE;
+
+	if (pos > 0 || done)
+		send_resp_data(vb, vq, true);
+
+}
+
+static void send_unused_pages(struct virtio_balloon *vb,
+			      unsigned long req_id)
+{
+	struct scatterlist sg_in;
+	unsigned int pos = 0;
+	struct virtqueue *vq = vb->host_req_vq;
+	int ret, order;
+	struct zone *zone = NULL;
+	bool part_fill = false;
+
+	mutex_lock(&vb->balloon_lock);
+
+	for (order = MAX_ORDER - 1; order >= 0; order--) {
+		ret = record_unused_pages(&zone, order, vb->resp_data,
+					  vb->resp_buf_size / sizeof(__le64),
+					  &pos, part_fill);
+		if (ret == -ENOSPC) {
+			if (pos == 0) {
+				void *new_resp_data;
+
+				new_resp_data = kmalloc(2 * vb->resp_buf_size,
+							GFP_KERNEL);
+				if (new_resp_data) {
+					kfree(vb->resp_data);
+					vb->resp_data = new_resp_data;
+					vb->resp_buf_size *= 2;
+				} else {
+					part_fill = true;
+					dev_warn(&vb->vdev->dev,
+						 "%s: part fill order: %d\n",
+						 __func__, order);
+				}
+			} else {
+				__send_unused_pages(vb, req_id, pos, false);
+				pos = 0;
+			}
+
+			if (!part_fill) {
+				order++;
+				continue;
+			}
+		} else
+			zone = NULL;
+
+		if (order == 0)
+			__send_unused_pages(vb, req_id, pos, true);
+
+	}
+
+	mutex_unlock(&vb->balloon_lock);
+	sg_init_one(&sg_in, &vb->req_hdr, sizeof(vb->req_hdr));
+	virtqueue_add_inbuf(vq, &sg_in, 1, &vb->req_hdr, GFP_KERNEL);
+	virtqueue_kick(vq);
+}
+
 static void virtballoon_changed(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb = vdev->priv;
@@ -667,18 +744,53 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void handle_host_request(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+	struct virtio_balloon_req_hdr *hdr;
+	unsigned long req_id;
+	unsigned int len;
+
+	hdr = virtqueue_get_buf(vb->host_req_vq, &len);
+	if (!hdr || len != sizeof(vb->req_hdr))
+		return;
+
+	switch (hdr->cmd) {
+	case BALLOON_GET_UNUSED_PAGES:
+		req_id = le64_to_cpu(hdr->param);
+		send_unused_pages(vb, req_id);
+		break;
+	default:
+		dev_warn(&vb->vdev->dev, "%s: host request %d not supported\n",
+			 __func__, hdr->cmd);
+	}
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
+	struct virtqueue *vqs[4];
+	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack,
+				       stats_request, handle_host_request };
+	static const char * const names[] = { "inflate", "deflate",
+					      "stats", "host_request" };
 	int err, nvqs;
 
 	/*
 	 * We expect two virtqueues: inflate and deflate, and
 	 * optionally stat.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ))
+		nvqs = 4;
+	else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs = 3;
+	else
+		nvqs = 2;
+
+	if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_CHUNK_TRANSFER);
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ);
+	}
+
 	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names);
 	if (err)
 		return err;
@@ -699,6 +811,20 @@ static int init_vqs(struct virtio_balloon *vb)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
 	}
+
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ)) {
+		struct scatterlist sg_in;
+
+		vb->host_req_vq = vqs[3];
+		sg_init_one(&sg_in, &vb->req_hdr, sizeof(vb->req_hdr));
+		if (virtqueue_add_inbuf(vb->host_req_vq, &sg_in, 1,
+					&vb->req_hdr, GFP_KERNEL) < 0)
+			__virtio_clear_bit(vb->vdev,
+					   VIRTIO_BALLOON_F_HOST_REQ_VQ);
+		else
+			virtqueue_kick(vb->host_req_vq);
+	}
+
 	return 0;
 }
 
@@ -829,6 +955,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
 	if (!vb->page_bmap[0]) {
 		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_CHUNK_TRANSFER);
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ);
 	} else {
 		vb->page_bmaps = 1;
 		vb->resp_hdr =
@@ -837,6 +964,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
 		if (!vb->resp_hdr) {
 			__virtio_clear_bit(vdev,
 					   VIRTIO_BALLOON_F_CHUNK_TRANSFER);
+			__virtio_clear_bit(vdev,
+					   VIRTIO_BALLOON_F_HOST_REQ_VQ);
 			kfree(vb->page_bmap[0]);
 		} else {
 			vb->resp_data = (void *)vb->resp_hdr +
@@ -966,6 +1095,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_CHUNK_TRANSFER,
+	VIRTIO_BALLOON_F_HOST_REQ_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index aa0e5f0..1f75bee 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_CHUNK_TRANSFER	3 /* Transfer pages in chunks */
+#define VIRTIO_BALLOON_F_HOST_REQ_VQ	4 /* Host request virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -91,4 +92,25 @@ struct virtio_balloon_resp_hdr {
 	__le32 data_len; /* Payload len in bytes */
 };
 
+enum virtio_balloon_req_id {
+	/* Get unused page information */
+	BALLOON_GET_UNUSED_PAGES,
+};
+
+enum virtio_balloon_flag {
+	/* Have more data for a request */
+	BALLOON_FLAG_CONT,
+	/* No more data for a request */
+	BALLOON_FLAG_DONE,
+};
+
+struct virtio_balloon_req_hdr {
+	/* Used to distinguish different requests */
+	__le16 cmd;
+	/* Reserved */
+	__le16 reserved[3];
+	/* Request parameter */
+	__le64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages
  2017-03-16  7:08 ` [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages Wei Wang
@ 2017-03-16 21:28   ` Andrew Morton
  2017-03-17  6:55     ` Wei Wang
  2017-04-13 11:07     ` Wei Wang
  2017-03-17  1:21   ` Michael S. Tsirkin
  1 sibling, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2017-03-16 21:28 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, 16 Mar 2017 15:08:46 +0800 Wei Wang <wei.w.wang@intel.com> wrote:

> From: Liang Li <liang.z.li@intel.com>
> 
> This patch adds a function to provides a snapshot of the present system
> unused pages. An important usage of this function is to provide the
> unsused pages to the Live migration thread, which skips the transfer of
> thoses unused pages. Newly used pages can be re-tracked by the dirty
> page logging mechanisms.

I don't think this will be useful for anything other than
virtio-balloon.  I guess it would be better to keep this code in the
virtio-balloon driver if possible, even though that's rather a layering
violation :( What would have to be done to make that possible?  Perhaps
we can put some *small* helpers into page_alloc.c to prevent things
from becoming too ugly.

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4498,6 +4498,120 @@ void show_free_areas(unsigned int filter)
>  	show_swap_cache_info();
>  }
>  
> +static int __record_unused_pages(struct zone *zone, int order,
> +				 __le64 *buf, unsigned int size,
> +				 unsigned int *offset, bool part_fill)
> +{
> +	unsigned long pfn, flags;
> +	int t, ret = 0;
> +	struct list_head *curr;
> +	__le64 *chunk;
> +
> +	if (zone_is_empty(zone))
> +		return 0;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +
> +	if (*offset + zone->free_area[order].nr_free > size && !part_fill) {
> +		ret = -ENOSPC;
> +		goto out;
> +	}
> +	for (t = 0; t < MIGRATE_TYPES; t++) {
> +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> +			chunk = buf + *offset;
> +			if (*offset + 2 > size) {
> +				ret = -ENOSPC;
> +				goto out;
> +			}
> +			/* Align to the chunk format used in virtio-balloon */
> +			*chunk = cpu_to_le64(pfn << 12);
> +			*(chunk + 1) = cpu_to_le64((1 << order) << 12);
> +			*offset += 2;
> +		}
> +	}
> +
> +out:
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	return ret;
> +}

This looks like it could disable interrupts for a long time.  Too long?

> +/*
> + * The record_unused_pages() function is used to record the system unused
> + * pages. The unused pages can be skipped to transfer during live migration.
> + * Though the unused pages are dynamically changing, dirty page logging
> + * mechanisms are able to capture the newly used pages though they were
> + * recorded as unused pages via this function.
> + *
> + * This function scans the free page list of the specified order to record
> + * the unused pages, and chunks those continuous pages following the chunk
> + * format below:
> + * --------------------------------------
> + * |	Base (52-bit)	| Rsvd (12-bit) |
> + * --------------------------------------
> + * --------------------------------------
> + * |	Size (52-bit)	| Rsvd (12-bit) |
> + * --------------------------------------
> + *
> + * @start_zone: zone to start the record operation.
> + * @order: order of the free page list to record.
> + * @buf: buffer to record the unused page info in chunks.
> + * @size: size of the buffer in __le64 to record
> + * @offset: offset in the buffer to record.
> + * @part_fill: indicate if partial fill is used.
> + *
> + * return -EINVAL if parameter is invalid
> + * return -ENOSPC when the buffer is too small to record all the unsed pages
> + * return 0 when sccess
> + */

It's a strange thing - it returns information which will instantly
become incorrect.

> +int record_unused_pages(struct zone **start_zone, int order,
> +			__le64 *buf, unsigned int size,
> +			unsigned int *offset, bool part_fill)
> +{
> +	struct zone *zone;
> +	int ret = 0;
> +	bool skip_check = false;
> +
> +	/* Make sure all the parameters are valid */
> +	if (buf == NULL || offset == NULL || order >= MAX_ORDER)
> +		return -EINVAL;
> +
> +	if (*start_zone != NULL) {
> +		bool found = false;
> +
> +		for_each_populated_zone(zone) {
> +			if (zone != *start_zone)
> +				continue;
> +			found = true;
> +			break;
> +		}
> +		if (!found)
> +			return -EINVAL;
> +	} else
> +		skip_check = true;
> +
> +	for_each_populated_zone(zone) {
> +		/* Start from *start_zone if it's not NULL */
> +		if (!skip_check) {
> +			if (*start_zone != zone)
> +				continue;
> +			else
> +				skip_check = true;
> +		}
> +		ret = __record_unused_pages(zone, order, buf, size,
> +					    offset, part_fill);
> +		if (ret < 0) {
> +			/* record the failed zone */
> +			*start_zone = zone;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(record_unused_pages);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages
  2017-03-16  7:08 ` [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages Wei Wang
  2017-03-16 21:28   ` Andrew Morton
@ 2017-03-17  1:21   ` Michael S. Tsirkin
  1 sibling, 0 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2017-03-17  1:21 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Mar 16, 2017 at 03:08:46PM +0800, Wei Wang wrote:
> +/*
> + * The record_unused_pages() function is used to record the system unused
> + * pages. The unused pages can be skipped to transfer during live migration.
> + * Though the unused pages are dynamically changing, dirty page logging
> + * mechanisms are able to capture the newly used pages though they were
> + * recorded as unused pages via this function.

You will keep confusing people as long as you keep using
this terminology which only makes sense in a very specific
use and a very specific implementation.

How does guest developer know this does the right thing wrt locking etc?
Look at hypervisor spec and try to figure it all out together?

So stop saying what caller should do, describe what does the *API* does.

You want something like this:

	Get a list of pages in the system that are unused at some point between
	record_unused_pages is called and before it returns, implying that any
	data that was present in these pages before record_unused_pages was
	called is safe to discard. Pages can be used immediately after
	this point and any data written after this point is not safe to discard,
	it is caller's responsibility to either prevent the use or
	detect such pages.

-- 
MST

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH kernel v8 4/4] virtio-balloon: VIRTIO_BALLOON_F_HOST_REQ_VQ
  2017-03-16  7:08 ` [PATCH kernel v8 4/4] virtio-balloon: VIRTIO_BALLOON_F_HOST_REQ_VQ Wei Wang
@ 2017-03-17  1:39   ` Michael S. Tsirkin
  0 siblings, 0 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2017-03-17  1:39 UTC (permalink / raw)
  To: Wei Wang
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, dave.hansen, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thu, Mar 16, 2017 at 03:08:47PM +0800, Wei Wang wrote:
> From: Liang Li <liang.z.li@intel.com>
> 
> Add a new vq, host request vq. The host uses the vq to send
> requests to the guest. Upon getting a request, the guest responds
> what the host needs via this vq.
> 
> The patch implements the request of getting the unsed pages from the guest.
> The unused guest pages are avoided to migrate in live migration. For an
> idle guest with 8GB RAM, this optimization shorterns the total migration
> time to 1/4.
> 
> Furthermore, it's also possible to drop the guest's page cache before
> live migration. This optimization will be implemented on top of this
> new feature in the future.
> 
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 140 ++++++++++++++++++++++++++++++++++--
>  include/uapi/linux/virtio_balloon.h |  22 ++++++
>  2 files changed, 157 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 3f4a161..bcf2baa 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -64,7 +64,7 @@ struct balloon_page_chunk {
>  typedef __le64 resp_data_t;
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
> -	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> +	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *host_req_vq;
>  
>  	/* The balloon servicing is delegated to a freezable workqueue. */
>  	struct work_struct update_balloon_stats_work;
> @@ -104,6 +104,8 @@ struct virtio_balloon {
>  	 * pfn_start & pfn_stop: records the start and stop pfn in each cover
>  	 */
>  	unsigned long pfn_min, pfn_max, pfn_start, pfn_stop;
> +	/* Request header */
> +	struct virtio_balloon_req_hdr req_hdr;
>  	/*
>  	 * The pages we've told the Host we're not using are enqueued
>  	 * at vb_dev_info->pages list.
> @@ -568,6 +570,81 @@ static void stats_handle_request(struct virtio_balloon *vb)
>  	virtqueue_kick(vq);
>  }
>  
> +static void __send_unused_pages(struct virtio_balloon *vb,
> +				unsigned long req_id, unsigned int pos,
> +				bool done)
> +{
> +	struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
> +	struct virtqueue *vq = vb->host_req_vq;
> +
> +	vb->resp_pos = pos;
> +	hdr->cmd = BALLOON_GET_UNUSED_PAGES;
> +	hdr->id = cpu_to_le16(req_id);
> +	if (!done)
> +		hdr->flag = BALLOON_FLAG_CONT;
> +	else
> +		hdr->flag = BALLOON_FLAG_DONE;
> +
> +	if (pos > 0 || done)
> +		send_resp_data(vb, vq, true);
> +
> +}
> +
> +static void send_unused_pages(struct virtio_balloon *vb,
> +			      unsigned long req_id)
> +{
> +	struct scatterlist sg_in;
> +	unsigned int pos = 0;
> +	struct virtqueue *vq = vb->host_req_vq;
> +	int ret, order;
> +	struct zone *zone = NULL;
> +	bool part_fill = false;
> +
> +	mutex_lock(&vb->balloon_lock);
> +
> +	for (order = MAX_ORDER - 1; order >= 0; order--) {
> +		ret = record_unused_pages(&zone, order, vb->resp_data,
> +					  vb->resp_buf_size / sizeof(__le64),
> +					  &pos, part_fill);
> +		if (ret == -ENOSPC) {
> +			if (pos == 0) {
> +				void *new_resp_data;
> +
> +				new_resp_data = kmalloc(2 * vb->resp_buf_size,
> +							GFP_KERNEL);
> +				if (new_resp_data) {
> +					kfree(vb->resp_data);
> +					vb->resp_data = new_resp_data;
> +					vb->resp_buf_size *= 2;
> +				} else {
> +					part_fill = true;
> +					dev_warn(&vb->vdev->dev,
> +						 "%s: part fill order: %d\n",
> +						 __func__, order);
> +				}
> +			} else {
> +				__send_unused_pages(vb, req_id, pos, false);
> +				pos = 0;
> +			}
> +
> +			if (!part_fill) {
> +				order++;
> +				continue;
> +			}
> +		} else
> +			zone = NULL;
> +
> +		if (order == 0)
> +			__send_unused_pages(vb, req_id, pos, true);
> +
> +	}
> +
> +	mutex_unlock(&vb->balloon_lock);
> +	sg_init_one(&sg_in, &vb->req_hdr, sizeof(vb->req_hdr));
> +	virtqueue_add_inbuf(vq, &sg_in, 1, &vb->req_hdr, GFP_KERNEL);

I note you ignore errors here. WARN_ON or something?
And why not handle same call during probe in the same way as here?

> +	virtqueue_kick(vq);
> +}
> +
>  static void virtballoon_changed(struct virtio_device *vdev)
>  {
>  	struct virtio_balloon *vb = vdev->priv;
> @@ -667,18 +744,53 @@ static void update_balloon_size_func(struct work_struct *work)
>  		queue_work(system_freezable_wq, work);
>  }
>  
> +static void handle_host_request(struct virtqueue *vq)
> +{
> +	struct virtio_balloon *vb = vq->vdev->priv;
> +	struct virtio_balloon_req_hdr *hdr;
> +	unsigned long req_id;
> +	unsigned int len;
> +
> +	hdr = virtqueue_get_buf(vb->host_req_vq, &len);
> +	if (!hdr || len != sizeof(vb->req_hdr))
> +		return;
> +
> +	switch (hdr->cmd) {
> +	case BALLOON_GET_UNUSED_PAGES:
> +		req_id = le64_to_cpu(hdr->param);

Will not fit in a long.

> +		send_unused_pages(vb, req_id);
> +		break;
> +	default:
> +		dev_warn(&vb->vdev->dev, "%s: host request %d not supported\n",
> +			 __func__, hdr->cmd);

why warn here but not above where length is wrong?

> +	}


All errors never re-add buffers so balloon will stop
working.

> +}
> +
>  static int init_vqs(struct virtio_balloon *vb)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
> -	static const char * const names[] = { "inflate", "deflate", "stats" };
> +	struct virtqueue *vqs[4];
> +	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack,
> +				       stats_request, handle_host_request };
> +	static const char * const names[] = { "inflate", "deflate",
> +					      "stats", "host_request" };

Pls use {} according to coding style now they don't fit on one line.

>  	int err, nvqs;
>  
>  	/*
>  	 * We expect two virtqueues: inflate and deflate, and
>  	 * optionally stat.
>  	 */
> -	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ))
> +		nvqs = 4;
> +	else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> +		nvqs = 3;
> +	else
> +		nvqs = 2;
> +

You assume REQ VQ implies STATS VQ. Why?


> +	if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_CHUNK_TRANSFER);
> +		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ);
> +	}
> +
>  	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names);
>  	if (err)
>  		return err;
> @@ -699,6 +811,20 @@ static int init_vqs(struct virtio_balloon *vb)
>  			BUG();
>  		virtqueue_kick(vb->stats_vq);
>  	}
> +
> +	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ)) {
> +		struct scatterlist sg_in;
> +
> +		vb->host_req_vq = vqs[3];
> +		sg_init_one(&sg_in, &vb->req_hdr, sizeof(vb->req_hdr));
> +		if (virtqueue_add_inbuf(vb->host_req_vq, &sg_in, 1,
> +					&vb->req_hdr, GFP_KERNEL) < 0)
> +			__virtio_clear_bit(vb->vdev,
> +					   VIRTIO_BALLOON_F_HOST_REQ_VQ);
> +		else
> +			virtqueue_kick(vb->host_req_vq);

> +	}
> +
>  	return 0;
>  }
>  
> @@ -829,6 +955,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  	vb->page_bmap[0] = kmalloc(PAGE_BMAP_SIZE, GFP_KERNEL);
>  	if (!vb->page_bmap[0]) {
>  		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_CHUNK_TRANSFER);
> +		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ);
>  	} else {
>  		vb->page_bmaps = 1;
>  		vb->resp_hdr =
> @@ -837,6 +964,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  		if (!vb->resp_hdr) {
>  			__virtio_clear_bit(vdev,
>  					   VIRTIO_BALLOON_F_CHUNK_TRANSFER);
> +			__virtio_clear_bit(vdev,
> +					   VIRTIO_BALLOON_F_HOST_REQ_VQ);
>  			kfree(vb->page_bmap[0]);
>  		} else {
>  			vb->resp_data = (void *)vb->resp_hdr +
> @@ -966,6 +1095,7 @@ static unsigned int features[] = {
>  	VIRTIO_BALLOON_F_STATS_VQ,
>  	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>  	VIRTIO_BALLOON_F_CHUNK_TRANSFER,
> +	VIRTIO_BALLOON_F_HOST_REQ_VQ,
>  };
>  
>  static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index aa0e5f0..1f75bee 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -35,6 +35,7 @@
>  #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
>  #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
>  #define VIRTIO_BALLOON_F_CHUNK_TRANSFER	3 /* Transfer pages in chunks */
> +#define VIRTIO_BALLOON_F_HOST_REQ_VQ	4 /* Host request virtqueue */
>  
>  /* Size of a PFN in the balloon interface. */
>  #define VIRTIO_BALLOON_PFN_SHIFT 12
> @@ -91,4 +92,25 @@ struct virtio_balloon_resp_hdr {
>  	__le32 data_len; /* Payload len in bytes */
>  };
>  
> +enum virtio_balloon_req_id {
> +	/* Get unused page information */
> +	BALLOON_GET_UNUSED_PAGES,
> +};


This is cmd, this is not req id.



> +
> +enum virtio_balloon_flag {
> +	/* Have more data for a request */
> +	BALLOON_FLAG_CONT,
> +	/* No more data for a request */
> +	BALLOON_FLAG_DONE,
> +};
th
Define values for these, this is good practice for an ABI.
And I think I would make DONE be 0 and CONT be 1.

> +
> +struct virtio_balloon_req_hdr {
> +	/* Used to distinguish different requests */
> +	__le16 cmd;
> +	/* Reserved */
> +	__le16 reserved[3];
> +	/* Request parameter */
> +	__le64 param;

Comments above are not informative.
Which enums refer to which field?
What is param and what happens to it?

Can't you reuse the same header for request and response?
That would be cleaner.

> +};
> +
>  #endif /* _LINUX_VIRTIO_BALLOON_H */
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages
  2017-03-16 21:28   ` Andrew Morton
@ 2017-03-17  6:55     ` Wei Wang
  2017-03-22 10:52       ` Wang, Wei W
  2017-03-29 17:48       ` Michael S. Tsirkin
  2017-04-13 11:07     ` Wei Wang
  1 sibling, 2 replies; 18+ messages in thread
From: Wei Wang @ 2017-03-17  6:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 03/17/2017 05:28 AM, Andrew Morton wrote:
> On Thu, 16 Mar 2017 15:08:46 +0800 Wei Wang <wei.w.wang@intel.com> wrote:
>
>> From: Liang Li <liang.z.li@intel.com>
>>
>> This patch adds a function to provides a snapshot of the present system
>> unused pages. An important usage of this function is to provide the
>> unsused pages to the Live migration thread, which skips the transfer of
>> thoses unused pages. Newly used pages can be re-tracked by the dirty
>> page logging mechanisms.
> I don't think this will be useful for anything other than
> virtio-balloon.  I guess it would be better to keep this code in the
> virtio-balloon driver if possible, even though that's rather a layering
> violation :( What would have to be done to make that possible?  Perhaps
> we can put some *small* helpers into page_alloc.c to prevent things
> from becoming too ugly.

The patch description was too narrowed and may have caused some
confusion, sorry about that. This function is aimed to be generic. I
agree with the description suggested by Michael.

Since the main body of the function is related to operating on the
free_list. I think it is better to have them located here.
Small helpers may be less efficient and thereby causing some
performance loss as well.
I think one improvement we can make is to remove the "chunk format"
related things from this function. The function can generally offer the
base pfn to the caller's recording buffer. Then it will be the caller's
responsibility to format the pfn if they need.

>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4498,6 +4498,120 @@ void show_free_areas(unsigned int filter)
>>   	show_swap_cache_info();
>>   }
>>   
>> +static int __record_unused_pages(struct zone *zone, int order,
>> +				 __le64 *buf, unsigned int size,
>> +				 unsigned int *offset, bool part_fill)
>> +{
>> +	unsigned long pfn, flags;
>> +	int t, ret = 0;
>> +	struct list_head *curr;
>> +	__le64 *chunk;
>> +
>> +	if (zone_is_empty(zone))
>> +		return 0;
>> +
>> +	spin_lock_irqsave(&zone->lock, flags);
>> +
>> +	if (*offset + zone->free_area[order].nr_free > size && !part_fill) {
>> +		ret = -ENOSPC;
>> +		goto out;
>> +	}
>> +	for (t = 0; t < MIGRATE_TYPES; t++) {
>> +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
>> +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
>> +			chunk = buf + *offset;
>> +			if (*offset + 2 > size) {
>> +				ret = -ENOSPC;
>> +				goto out;
>> +			}
>> +			/* Align to the chunk format used in virtio-balloon */
>> +			*chunk = cpu_to_le64(pfn << 12);
>> +			*(chunk + 1) = cpu_to_le64((1 << order) << 12);
>> +			*offset += 2;
>> +		}
>> +	}
>> +
>> +out:
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>> +
>> +	return ret;
>> +}
> This looks like it could disable interrupts for a long time.  Too long?

What do you think if we give "budgets" to the above function?
For example, budget=1000, and there are 2000 nodes on the list.
record() returns with "incomplete" status in the first round, along with the
status info, "*continue_node".

*continue_node: pointer to the starting node of the leftover. If 
*continue_node
has been used at the time of the second call (i.e. continue_node->next 
== NULL),
which implies that the previous 1000 nodes have been used, then the record()
function can simply start from the head of the list.

It is up to the caller whether it needs to continue the second round
when getting "incomplete".

>
>> +/*
>> + * The record_unused_pages() function is used to record the system unused
>> + * pages. The unused pages can be skipped to transfer during live migration.
>> + * Though the unused pages are dynamically changing, dirty page logging
>> + * mechanisms are able to capture the newly used pages though they were
>> + * recorded as unused pages via this function.
>> + *
>> + * This function scans the free page list of the specified order to record
>> + * the unused pages, and chunks those continuous pages following the chunk
>> + * format below:
>> + * --------------------------------------
>> + * |	Base (52-bit)	| Rsvd (12-bit) |
>> + * --------------------------------------
>> + * --------------------------------------
>> + * |	Size (52-bit)	| Rsvd (12-bit) |
>> + * --------------------------------------
>> + *
>> + * @start_zone: zone to start the record operation.
>> + * @order: order of the free page list to record.
>> + * @buf: buffer to record the unused page info in chunks.
>> + * @size: size of the buffer in __le64 to record
>> + * @offset: offset in the buffer to record.
>> + * @part_fill: indicate if partial fill is used.
>> + *
>> + * return -EINVAL if parameter is invalid
>> + * return -ENOSPC when the buffer is too small to record all the unsed pages
>> + * return 0 when sccess
>> + */
> It's a strange thing - it returns information which will instantly
> become incorrect.

I didn't get the point, could you please explain more? Thanks.

Best,
Wei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages
  2017-03-17  6:55     ` Wei Wang
@ 2017-03-22 10:52       ` Wang, Wei W
  2017-03-29 17:48       ` Michael S. Tsirkin
  1 sibling, 0 replies; 18+ messages in thread
From: Wang, Wei W @ 2017-03-22 10:52 UTC (permalink / raw)
  To: Wang, Wei W, Andrew Morton
  Cc: aarcange, virtio-dev, kvm, mst, qemu-devel, amit.shah,
	liliang.opensource, Hansen, Dave, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, mgorman

Hi Andrew, 

Do you have any comments on my thoughts? Thanks.

> On 03/17/2017 05:28 AM, Andrew Morton wrote:
> > On Thu, 16 Mar 2017 15:08:46 +0800 Wei Wang <wei.w.wang@intel.com>
> wrote:
> >
> >> From: Liang Li <liang.z.li@intel.com>
> >>
> >> This patch adds a function to provides a snapshot of the present
> >> system unused pages. An important usage of this function is to
> >> provide the unsused pages to the Live migration thread, which skips
> >> the transfer of thoses unused pages. Newly used pages can be
> >> re-tracked by the dirty page logging mechanisms.
> > I don't think this will be useful for anything other than
> > virtio-balloon.  I guess it would be better to keep this code in the
> > virtio-balloon driver if possible, even though that's rather a
> > layering violation :( What would have to be done to make that
> > possible?  Perhaps we can put some *small* helpers into page_alloc.c
> > to prevent things from becoming too ugly.
> 
> The patch description was too narrowed and may have caused some confusion,
> sorry about that. This function is aimed to be generic. I agree with the
> description suggested by Michael.
> 
> Since the main body of the function is related to operating on the free_list. I
> think it is better to have them located here.
> Small helpers may be less efficient and thereby causing some performance loss
> as well.
> I think one improvement we can make is to remove the "chunk format"
> related things from this function. The function can generally offer the base pfn
> to the caller's recording buffer. Then it will be the caller's responsibility to
> format the pfn if they need.
> 
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -4498,6 +4498,120 @@ void show_free_areas(unsigned int filter)
> >>   	show_swap_cache_info();
> >>   }
> >>
> >> +static int __record_unused_pages(struct zone *zone, int order,
> >> +				 __le64 *buf, unsigned int size,
> >> +				 unsigned int *offset, bool part_fill) {
> >> +	unsigned long pfn, flags;
> >> +	int t, ret = 0;
> >> +	struct list_head *curr;
> >> +	__le64 *chunk;
> >> +
> >> +	if (zone_is_empty(zone))
> >> +		return 0;
> >> +
> >> +	spin_lock_irqsave(&zone->lock, flags);
> >> +
> >> +	if (*offset + zone->free_area[order].nr_free > size && !part_fill) {
> >> +		ret = -ENOSPC;
> >> +		goto out;
> >> +	}
> >> +	for (t = 0; t < MIGRATE_TYPES; t++) {
> >> +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> >> +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> >> +			chunk = buf + *offset;
> >> +			if (*offset + 2 > size) {
> >> +				ret = -ENOSPC;
> >> +				goto out;
> >> +			}
> >> +			/* Align to the chunk format used in virtio-balloon */
> >> +			*chunk = cpu_to_le64(pfn << 12);
> >> +			*(chunk + 1) = cpu_to_le64((1 << order) << 12);
> >> +			*offset += 2;
> >> +		}
> >> +	}
> >> +
> >> +out:
> >> +	spin_unlock_irqrestore(&zone->lock, flags);
> >> +
> >> +	return ret;
> >> +}
> > This looks like it could disable interrupts for a long time.  Too long?
> 
> What do you think if we give "budgets" to the above function?
> For example, budget=1000, and there are 2000 nodes on the list.
> record() returns with "incomplete" status in the first round, along with the status
> info, "*continue_node".
> 
> *continue_node: pointer to the starting node of the leftover. If *continue_node
> has been used at the time of the second call (i.e. continue_node->next == NULL),
> which implies that the previous 1000 nodes have been used, then the record()
> function can simply start from the head of the list.
> 
> It is up to the caller whether it needs to continue the second round when getting
> "incomplete".
> 
> >
> >> +/*
> >> + * The record_unused_pages() function is used to record the system
> >> +unused
> >> + * pages. The unused pages can be skipped to transfer during live migration.
> >> + * Though the unused pages are dynamically changing, dirty page
> >> +logging
> >> + * mechanisms are able to capture the newly used pages though they
> >> +were
> >> + * recorded as unused pages via this function.
> >> + *
> >> + * This function scans the free page list of the specified order to
> >> +record
> >> + * the unused pages, and chunks those continuous pages following the
> >> +chunk
> >> + * format below:
> >> + * --------------------------------------
> >> + * |	Base (52-bit)	| Rsvd (12-bit) |
> >> + * --------------------------------------
> >> + * --------------------------------------
> >> + * |	Size (52-bit)	| Rsvd (12-bit) |
> >> + * --------------------------------------
> >> + *
> >> + * @start_zone: zone to start the record operation.
> >> + * @order: order of the free page list to record.
> >> + * @buf: buffer to record the unused page info in chunks.
> >> + * @size: size of the buffer in __le64 to record
> >> + * @offset: offset in the buffer to record.
> >> + * @part_fill: indicate if partial fill is used.
> >> + *
> >> + * return -EINVAL if parameter is invalid
> >> + * return -ENOSPC when the buffer is too small to record all the
> >> +unsed pages
> >> + * return 0 when sccess
> >> + */
> > It's a strange thing - it returns information which will instantly
> > become incorrect.
> 
> I didn't get the point, could you please explain more? Thanks.

Best,
Wei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages
  2017-03-17  6:55     ` Wei Wang
  2017-03-22 10:52       ` Wang, Wei W
@ 2017-03-29 17:48       ` Michael S. Tsirkin
  2017-03-31  9:53         ` Wei Wang
  1 sibling, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2017-03-29 17:48 UTC (permalink / raw)
  To: Wei Wang
  Cc: Andrew Morton, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, david, dave.hansen, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Mar 17, 2017 at 02:55:33PM +0800, Wei Wang wrote:
> On 03/17/2017 05:28 AM, Andrew Morton wrote:
> > On Thu, 16 Mar 2017 15:08:46 +0800 Wei Wang <wei.w.wang@intel.com> wrote:
> > 
> > > From: Liang Li <liang.z.li@intel.com>
> > > 
> > > This patch adds a function to provides a snapshot of the present system
> > > unused pages. An important usage of this function is to provide the
> > > unsused pages to the Live migration thread, which skips the transfer of
> > > thoses unused pages. Newly used pages can be re-tracked by the dirty
> > > page logging mechanisms.
> > I don't think this will be useful for anything other than
> > virtio-balloon.  I guess it would be better to keep this code in the
> > virtio-balloon driver if possible, even though that's rather a layering
> > violation :( What would have to be done to make that possible?  Perhaps
> > we can put some *small* helpers into page_alloc.c to prevent things
> > from becoming too ugly.
> 
> The patch description was too narrowed and may have caused some
> confusion, sorry about that. This function is aimed to be generic. I
> agree with the description suggested by Michael.
> 
> Since the main body of the function is related to operating on the
> free_list. I think it is better to have them located here.
> Small helpers may be less efficient and thereby causing some
> performance loss as well.
> I think one improvement we can make is to remove the "chunk format"
> related things from this function. The function can generally offer the
> base pfn to the caller's recording buffer. Then it will be the caller's
> responsibility to format the pfn if they need.

Sounds good at a high level, but we'd have to see the implementation
to judge it properly.

> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4498,6 +4498,120 @@ void show_free_areas(unsigned int filter)
> > >   	show_swap_cache_info();
> > >   }
> > > +static int __record_unused_pages(struct zone *zone, int order,
> > > +				 __le64 *buf, unsigned int size,
> > > +				 unsigned int *offset, bool part_fill)
> > > +{
> > > +	unsigned long pfn, flags;
> > > +	int t, ret = 0;
> > > +	struct list_head *curr;
> > > +	__le64 *chunk;
> > > +
> > > +	if (zone_is_empty(zone))
> > > +		return 0;
> > > +
> > > +	spin_lock_irqsave(&zone->lock, flags);
> > > +
> > > +	if (*offset + zone->free_area[order].nr_free > size && !part_fill) {
> > > +		ret = -ENOSPC;
> > > +		goto out;
> > > +	}
> > > +	for (t = 0; t < MIGRATE_TYPES; t++) {
> > > +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> > > +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> > > +			chunk = buf + *offset;
> > > +			if (*offset + 2 > size) {
> > > +				ret = -ENOSPC;
> > > +				goto out;
> > > +			}
> > > +			/* Align to the chunk format used in virtio-balloon */
> > > +			*chunk = cpu_to_le64(pfn << 12);
> > > +			*(chunk + 1) = cpu_to_le64((1 << order) << 12);
> > > +			*offset += 2;
> > > +		}
> > > +	}
> > > +
> > > +out:
> > > +	spin_unlock_irqrestore(&zone->lock, flags);
> > > +
> > > +	return ret;
> > > +}
> > This looks like it could disable interrupts for a long time.  Too long?
> 
> What do you think if we give "budgets" to the above function?
> For example, budget=1000, and there are 2000 nodes on the list.
> record() returns with "incomplete" status in the first round, along with the
> status info, "*continue_node".
> 
> *continue_node: pointer to the starting node of the leftover. If
> *continue_node
> has been used at the time of the second call (i.e. continue_node->next ==
> NULL),
> which implies that the previous 1000 nodes have been used, then the record()
> function can simply start from the head of the list.
> 
> It is up to the caller whether it needs to continue the second round
> when getting "incomplete".

It might be cleaner to add APIs to
	- start iteration
	- do one step
	- end iteration

caller can then iterate without too many issues

> > 
> > > +/*
> > > + * The record_unused_pages() function is used to record the system unused
> > > + * pages. The unused pages can be skipped to transfer during live migration.
> > > + * Though the unused pages are dynamically changing, dirty page logging
> > > + * mechanisms are able to capture the newly used pages though they were
> > > + * recorded as unused pages via this function.
> > > + *
> > > + * This function scans the free page list of the specified order to record
> > > + * the unused pages, and chunks those continuous pages following the chunk
> > > + * format below:
> > > + * --------------------------------------
> > > + * |	Base (52-bit)	| Rsvd (12-bit) |
> > > + * --------------------------------------
> > > + * --------------------------------------
> > > + * |	Size (52-bit)	| Rsvd (12-bit) |
> > > + * --------------------------------------
> > > + *
> > > + * @start_zone: zone to start the record operation.
> > > + * @order: order of the free page list to record.
> > > + * @buf: buffer to record the unused page info in chunks.
> > > + * @size: size of the buffer in __le64 to record
> > > + * @offset: offset in the buffer to record.
> > > + * @part_fill: indicate if partial fill is used.
> > > + *
> > > + * return -EINVAL if parameter is invalid
> > > + * return -ENOSPC when the buffer is too small to record all the unsed pages
> > > + * return 0 when sccess
> > > + */
> > It's a strange thing - it returns information which will instantly
> > become incorrect.
> 
> I didn't get the point, could you please explain more? Thanks.
> 
> Best,
> Wei

I tried to explain it in my reply.  Once this function drops the lock,
the pages can immediately be used so the returned value is wrong.
balloon uses tricks to recover from this but this needs to be explicit
at the API level.

-- 
MST

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages
  2017-03-29 17:48       ` Michael S. Tsirkin
@ 2017-03-31  9:53         ` Wei Wang
  2017-03-31 16:25           ` Michael S. Tsirkin
  0 siblings, 1 reply; 18+ messages in thread
From: Wei Wang @ 2017-03-31  9:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Andrew Morton, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, david, dave.hansen, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

On 03/30/2017 01:48 AM, Michael S. Tsirkin wrote:
> On Fri, Mar 17, 2017 at 02:55:33PM +0800, Wei Wang wrote:
>> On 03/17/2017 05:28 AM, Andrew Morton wrote:
>>> On Thu, 16 Mar 2017 15:08:46 +0800 Wei Wang <wei.w.wang@intel.com> wrote:
>>>
>>>> From: Liang Li <liang.z.li@intel.com>
>>>>
>>>> This patch adds a function to provides a snapshot of the present system
>>>> unused pages. An important usage of this function is to provide the
>>>> unsused pages to the Live migration thread, which skips the transfer of
>>>> thoses unused pages. Newly used pages can be re-tracked by the dirty
>>>> page logging mechanisms.
>>> I don't think this will be useful for anything other than
>>> virtio-balloon.  I guess it would be better to keep this code in the
>>> virtio-balloon driver if possible, even though that's rather a layering
>>> violation :( What would have to be done to make that possible?  Perhaps
>>> we can put some *small* helpers into page_alloc.c to prevent things
>>> from becoming too ugly.
>> The patch description was too narrowed and may have caused some
>> confusion, sorry about that. This function is aimed to be generic. I
>> agree with the description suggested by Michael.
>>
>> Since the main body of the function is related to operating on the
>> free_list. I think it is better to have them located here.
>> Small helpers may be less efficient and thereby causing some
>> performance loss as well.
>> I think one improvement we can make is to remove the "chunk format"
>> related things from this function. The function can generally offer the
>> base pfn to the caller's recording buffer. Then it will be the caller's
>> responsibility to format the pfn if they need.
> Sounds good at a high level, but we'd have to see the implementation
> to judge it properly.
>
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -4498,6 +4498,120 @@ void show_free_areas(unsigned int filter)
>>>>    	show_swap_cache_info();
>>>>    }
>>>> +static int __record_unused_pages(struct zone *zone, int order,
>>>> +				 __le64 *buf, unsigned int size,
>>>> +				 unsigned int *offset, bool part_fill)
>>>> +{
>>>> +	unsigned long pfn, flags;
>>>> +	int t, ret = 0;
>>>> +	struct list_head *curr;
>>>> +	__le64 *chunk;
>>>> +
>>>> +	if (zone_is_empty(zone))
>>>> +		return 0;
>>>> +
>>>> +	spin_lock_irqsave(&zone->lock, flags);
>>>> +
>>>> +	if (*offset + zone->free_area[order].nr_free > size && !part_fill) {
>>>> +		ret = -ENOSPC;
>>>> +		goto out;
>>>> +	}
>>>> +	for (t = 0; t < MIGRATE_TYPES; t++) {
>>>> +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
>>>> +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
>>>> +			chunk = buf + *offset;
>>>> +			if (*offset + 2 > size) {
>>>> +				ret = -ENOSPC;
>>>> +				goto out;
>>>> +			}
>>>> +			/* Align to the chunk format used in virtio-balloon */
>>>> +			*chunk = cpu_to_le64(pfn << 12);
>>>> +			*(chunk + 1) = cpu_to_le64((1 << order) << 12);
>>>> +			*offset += 2;
>>>> +		}
>>>> +	}
>>>> +
>>>> +out:
>>>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>>> +
>>>> +	return ret;
>>>> +}
>>> This looks like it could disable interrupts for a long time.  Too long?
>> What do you think if we give "budgets" to the above function?
>> For example, budget=1000, and there are 2000 nodes on the list.
>> record() returns with "incomplete" status in the first round, along with the
>> status info, "*continue_node".
>>
>> *continue_node: pointer to the starting node of the leftover. If
>> *continue_node
>> has been used at the time of the second call (i.e. continue_node->next ==
>> NULL),
>> which implies that the previous 1000 nodes have been used, then the record()
>> function can simply start from the head of the list.
>>
>> It is up to the caller whether it needs to continue the second round
>> when getting "incomplete".
> It might be cleaner to add APIs to
> 	- start iteration
> 	- do one step
> 	- end iteration
>
> caller can then iterate without too many issues
>

OK. I will re-implement it with this simple one - get only one node(page 
block) from the list in each call, and check if the time would increase 
a lot in comparison to v8.

Best,
Wei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages
  2017-03-31  9:53         ` Wei Wang
@ 2017-03-31 16:25           ` Michael S. Tsirkin
  0 siblings, 0 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2017-03-31 16:25 UTC (permalink / raw)
  To: Wei Wang
  Cc: Andrew Morton, virtio-dev, linux-kernel, qemu-devel,
	virtualization, kvm, linux-mm, david, dave.hansen, cornelia.huck,
	mgorman, aarcange, amit.shah, pbonzini, liliang.opensource

On Fri, Mar 31, 2017 at 05:53:00PM +0800, Wei Wang wrote:
> On 03/30/2017 01:48 AM, Michael S. Tsirkin wrote:
> > On Fri, Mar 17, 2017 at 02:55:33PM +0800, Wei Wang wrote:
> > > On 03/17/2017 05:28 AM, Andrew Morton wrote:
> > > > On Thu, 16 Mar 2017 15:08:46 +0800 Wei Wang <wei.w.wang@intel.com> wrote:
> > > > 
> > > > > From: Liang Li <liang.z.li@intel.com>
> > > > > 
> > > > > This patch adds a function to provides a snapshot of the present system
> > > > > unused pages. An important usage of this function is to provide the
> > > > > unsused pages to the Live migration thread, which skips the transfer of
> > > > > thoses unused pages. Newly used pages can be re-tracked by the dirty
> > > > > page logging mechanisms.
> > > > I don't think this will be useful for anything other than
> > > > virtio-balloon.  I guess it would be better to keep this code in the
> > > > virtio-balloon driver if possible, even though that's rather a layering
> > > > violation :( What would have to be done to make that possible?  Perhaps
> > > > we can put some *small* helpers into page_alloc.c to prevent things
> > > > from becoming too ugly.
> > > The patch description was too narrowed and may have caused some
> > > confusion, sorry about that. This function is aimed to be generic. I
> > > agree with the description suggested by Michael.
> > > 
> > > Since the main body of the function is related to operating on the
> > > free_list. I think it is better to have them located here.
> > > Small helpers may be less efficient and thereby causing some
> > > performance loss as well.
> > > I think one improvement we can make is to remove the "chunk format"
> > > related things from this function. The function can generally offer the
> > > base pfn to the caller's recording buffer. Then it will be the caller's
> > > responsibility to format the pfn if they need.
> > Sounds good at a high level, but we'd have to see the implementation
> > to judge it properly.
> > 
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -4498,6 +4498,120 @@ void show_free_areas(unsigned int filter)
> > > > >    	show_swap_cache_info();
> > > > >    }
> > > > > +static int __record_unused_pages(struct zone *zone, int order,
> > > > > +				 __le64 *buf, unsigned int size,
> > > > > +				 unsigned int *offset, bool part_fill)
> > > > > +{
> > > > > +	unsigned long pfn, flags;
> > > > > +	int t, ret = 0;
> > > > > +	struct list_head *curr;
> > > > > +	__le64 *chunk;
> > > > > +
> > > > > +	if (zone_is_empty(zone))
> > > > > +		return 0;
> > > > > +
> > > > > +	spin_lock_irqsave(&zone->lock, flags);
> > > > > +
> > > > > +	if (*offset + zone->free_area[order].nr_free > size && !part_fill) {
> > > > > +		ret = -ENOSPC;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	for (t = 0; t < MIGRATE_TYPES; t++) {
> > > > > +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> > > > > +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> > > > > +			chunk = buf + *offset;
> > > > > +			if (*offset + 2 > size) {
> > > > > +				ret = -ENOSPC;
> > > > > +				goto out;
> > > > > +			}
> > > > > +			/* Align to the chunk format used in virtio-balloon */
> > > > > +			*chunk = cpu_to_le64(pfn << 12);
> > > > > +			*(chunk + 1) = cpu_to_le64((1 << order) << 12);
> > > > > +			*offset += 2;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +out:
> > > > > +	spin_unlock_irqrestore(&zone->lock, flags);
> > > > > +
> > > > > +	return ret;
> > > > > +}
> > > > This looks like it could disable interrupts for a long time.  Too long?
> > > What do you think if we give "budgets" to the above function?
> > > For example, budget=1000, and there are 2000 nodes on the list.
> > > record() returns with "incomplete" status in the first round, along with the
> > > status info, "*continue_node".
> > > 
> > > *continue_node: pointer to the starting node of the leftover. If
> > > *continue_node
> > > has been used at the time of the second call (i.e. continue_node->next ==
> > > NULL),
> > > which implies that the previous 1000 nodes have been used, then the record()
> > > function can simply start from the head of the list.
> > > 
> > > It is up to the caller whether it needs to continue the second round
> > > when getting "incomplete".
> > It might be cleaner to add APIs to
> > 	- start iteration
> > 	- do one step
> > 	- end iteration
> > 
> > caller can then iterate without too many issues
> > 
> 
> OK. I will re-implement it with this simple one - get only one node(page
> block) from the list in each call, and check if the time would increase a
> lot in comparison to v8.
> 
> Best,
> Wei

Might work though this isn't what was suggested - just an iterator based
approach that allows user to drop the lock periodically.

-- 
MST

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH kernel v8 2/4] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER
  2017-03-16  7:08 ` [PATCH kernel v8 2/4] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER Wei Wang
@ 2017-04-05  3:31   ` Wang, Wei W
  2017-04-05  3:53     ` Michael S. Tsirkin
  0 siblings, 1 reply; 18+ messages in thread
From: Wang, Wei W @ 2017-04-05  3:31 UTC (permalink / raw)
  To: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Thursday, March 16, 2017 3:09 PM Wei Wang wrote:
> The implementation of the current virtio-balloon is not very efficient, because
> the ballooned pages are transferred to the host one by one. Here is the
> breakdown of the time in percentage spent on each step of the balloon inflating
> process (inflating 7GB of an 8GB idle guest).
> 
> 1) allocating pages (6.5%)
> 2) sending PFNs to host (68.3%)
> 3) address translation (6.1%)
> 4) madvise (19%)
> 
> It takes about 4126ms for the inflating process to complete.
> The above profiling shows that the bottlenecks are stage 2) and stage 4).
> 
> This patch optimizes step 2) by transferring pages to the host in chunks. A chunk
> consists of guest physically continuous pages, and it is offered to the host via a
> base PFN (i.e. the start PFN of those physically continuous pages) and the size
> (i.e. the total number of the pages). A chunk is formated as below:
> 
> --------------------------------------------------------
> |                 Base (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> --------------------------------------------------------
> |                 Size (52 bit)        | Rsvd (12 bit) |
> --------------------------------------------------------
> 
> By doing so, step 4) can also be optimized by doing address translation and
> madvise() in chunks rather than page by page.
> 
> This optimization requires the negotiation of a new feature bit,
> VIRTIO_BALLOON_F_CHUNK_TRANSFER.
> 
> With this new feature, the above ballooning process takes ~590ms resulting in
> an improvement of ~85%.
> 
> TODO: optimize stage 1) by allocating/freeing a chunk of pages instead of a
> single page each time.
> 
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_balloon.c     | 371 +++++++++++++++++++++++++++++++++-
> --
>  include/uapi/linux/virtio_balloon.h |   9 +
>  2 files changed, 353 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index
> f59cb4f..3f4a161 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -42,6 +42,10 @@
>  #define OOM_VBALLOON_DEFAULT_PAGES 256
>  #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> 
> +#define PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
> +#define PFNS_PER_PAGE_BMAP	(PAGE_BMAP_SIZE * BITS_PER_BYTE)
> +#define PAGE_BMAP_COUNT_MAX	32
> +
>  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -50,6 +54,14
> @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");  static struct
> vfsmount *balloon_mnt;  #endif
> 
> +#define BALLOON_CHUNK_BASE_SHIFT 12
> +#define BALLOON_CHUNK_SIZE_SHIFT 12
> +struct balloon_page_chunk {
> +	__le64 base;
> +	__le64 size;
> +};
> +
> +typedef __le64 resp_data_t;
>  struct virtio_balloon {
>  	struct virtio_device *vdev;
>  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -67,6 +79,31
> @@ struct virtio_balloon {
> 
>  	/* Number of balloon pages we've told the Host we're not using. */
>  	unsigned int num_pages;
> +	/* Pointer to the response header. */
> +	struct virtio_balloon_resp_hdr *resp_hdr;
> +	/* Pointer to the start address of response data. */
> +	resp_data_t *resp_data;

I think the implementation has an issue here - both the balloon pages and the unused pages use the same buffer ("resp_data" above) to store chunks. It would cause a race in this case: live migration starts while ballooning is also in progress. I plan to use separate buffers for CHUNKS_OF_BALLOON_PAGES and CHUNKS_OF_UNUSED_PAGES. Please let me know if you have a different suggestion. Thanks.

Best,
Wei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH kernel v8 2/4] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER
  2017-04-05  3:31   ` Wang, Wei W
@ 2017-04-05  3:53     ` Michael S. Tsirkin
  2017-04-05  4:31       ` Wang, Wei W
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2017-04-05  3:53 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Wed, Apr 05, 2017 at 03:31:36AM +0000, Wang, Wei W wrote:
> On Thursday, March 16, 2017 3:09 PM Wei Wang wrote:
> > The implementation of the current virtio-balloon is not very efficient, because
> > the ballooned pages are transferred to the host one by one. Here is the
> > breakdown of the time in percentage spent on each step of the balloon inflating
> > process (inflating 7GB of an 8GB idle guest).
> > 
> > 1) allocating pages (6.5%)
> > 2) sending PFNs to host (68.3%)
> > 3) address translation (6.1%)
> > 4) madvise (19%)
> > 
> > It takes about 4126ms for the inflating process to complete.
> > The above profiling shows that the bottlenecks are stage 2) and stage 4).
> > 
> > This patch optimizes step 2) by transferring pages to the host in chunks. A chunk
> > consists of guest physically continuous pages, and it is offered to the host via a
> > base PFN (i.e. the start PFN of those physically continuous pages) and the size
> > (i.e. the total number of the pages). A chunk is formated as below:
> > 
> > --------------------------------------------------------
> > |                 Base (52 bit)        | Rsvd (12 bit) |
> > --------------------------------------------------------
> > --------------------------------------------------------
> > |                 Size (52 bit)        | Rsvd (12 bit) |
> > --------------------------------------------------------
> > 
> > By doing so, step 4) can also be optimized by doing address translation and
> > madvise() in chunks rather than page by page.
> > 
> > This optimization requires the negotiation of a new feature bit,
> > VIRTIO_BALLOON_F_CHUNK_TRANSFER.
> > 
> > With this new feature, the above ballooning process takes ~590ms resulting in
> > an improvement of ~85%.
> > 
> > TODO: optimize stage 1) by allocating/freeing a chunk of pages instead of a
> > single page each time.
> > 
> > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  drivers/virtio/virtio_balloon.c     | 371 +++++++++++++++++++++++++++++++++-
> > --
> >  include/uapi/linux/virtio_balloon.h |   9 +
> >  2 files changed, 353 insertions(+), 27 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index
> > f59cb4f..3f4a161 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -42,6 +42,10 @@
> >  #define OOM_VBALLOON_DEFAULT_PAGES 256
> >  #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > 
> > +#define PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
> > +#define PFNS_PER_PAGE_BMAP	(PAGE_BMAP_SIZE * BITS_PER_BYTE)
> > +#define PAGE_BMAP_COUNT_MAX	32
> > +
> >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -50,6 +54,14
> > @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");  static struct
> > vfsmount *balloon_mnt;  #endif
> > 
> > +#define BALLOON_CHUNK_BASE_SHIFT 12
> > +#define BALLOON_CHUNK_SIZE_SHIFT 12
> > +struct balloon_page_chunk {
> > +	__le64 base;
> > +	__le64 size;
> > +};
> > +
> > +typedef __le64 resp_data_t;
> >  struct virtio_balloon {
> >  	struct virtio_device *vdev;
> >  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -67,6 +79,31
> > @@ struct virtio_balloon {
> > 
> >  	/* Number of balloon pages we've told the Host we're not using. */
> >  	unsigned int num_pages;
> > +	/* Pointer to the response header. */
> > +	struct virtio_balloon_resp_hdr *resp_hdr;
> > +	/* Pointer to the start address of response data. */
> > +	resp_data_t *resp_data;
> 
> I think the implementation has an issue here - both the balloon pages and the unused pages use the same buffer ("resp_data" above) to store chunks. It would cause a race in this case: live migration starts while ballooning is also in progress. I plan to use separate buffers for CHUNKS_OF_BALLOON_PAGES and CHUNKS_OF_UNUSED_PAGES. Please let me know if you have a different suggestion. Thanks.
> 
> Best,
> Wei

Is only one resp data ever in flight for each kind?
If not you want as many buffers as vq allows.

-- 
MST

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH kernel v8 2/4] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER
  2017-04-05  3:53     ` Michael S. Tsirkin
@ 2017-04-05  4:31       ` Wang, Wei W
  2017-04-05  7:47         ` Wang, Wei W
  0 siblings, 1 reply; 18+ messages in thread
From: Wang, Wei W @ 2017-04-05  4:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, david, Hansen, Dave, cornelia.huck, akpm, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On Wednesday, April 5, 2017 11:54 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 05, 2017 at 03:31:36AM +0000, Wang, Wei W wrote:
> > On Thursday, March 16, 2017 3:09 PM Wei Wang wrote:
> > > The implementation of the current virtio-balloon is not very
> > > efficient, because the ballooned pages are transferred to the host
> > > one by one. Here is the breakdown of the time in percentage spent on
> > > each step of the balloon inflating process (inflating 7GB of an 8GB idle guest).
> > >
> > > 1) allocating pages (6.5%)
> > > 2) sending PFNs to host (68.3%)
> > > 3) address translation (6.1%)
> > > 4) madvise (19%)
> > >
> > > It takes about 4126ms for the inflating process to complete.
> > > The above profiling shows that the bottlenecks are stage 2) and stage 4).
> > >
> > > This patch optimizes step 2) by transferring pages to the host in
> > > chunks. A chunk consists of guest physically continuous pages, and
> > > it is offered to the host via a base PFN (i.e. the start PFN of
> > > those physically continuous pages) and the size (i.e. the total number of the
> pages). A chunk is formated as below:
> > >
> > > --------------------------------------------------------
> > > |                 Base (52 bit)        | Rsvd (12 bit) |
> > > --------------------------------------------------------
> > > --------------------------------------------------------
> > > |                 Size (52 bit)        | Rsvd (12 bit) |
> > > --------------------------------------------------------
> > >
> > > By doing so, step 4) can also be optimized by doing address
> > > translation and
> > > madvise() in chunks rather than page by page.
> > >
> > > This optimization requires the negotiation of a new feature bit,
> > > VIRTIO_BALLOON_F_CHUNK_TRANSFER.
> > >
> > > With this new feature, the above ballooning process takes ~590ms
> > > resulting in an improvement of ~85%.
> > >
> > > TODO: optimize stage 1) by allocating/freeing a chunk of pages
> > > instead of a single page each time.
> > >
> > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > >  drivers/virtio/virtio_balloon.c     | 371
> +++++++++++++++++++++++++++++++++-
> > > --
> > >  include/uapi/linux/virtio_balloon.h |   9 +
> > >  2 files changed, 353 insertions(+), 27 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio_balloon.c
> > > b/drivers/virtio/virtio_balloon.c index
> > > f59cb4f..3f4a161 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -42,6 +42,10 @@
> > >  #define OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > >
> > > +#define PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
> > > +#define PFNS_PER_PAGE_BMAP	(PAGE_BMAP_SIZE * BITS_PER_BYTE)
> > > +#define PAGE_BMAP_COUNT_MAX	32
> > > +
> > >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -50,6
> +54,14
> > > @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");  static
> > > struct vfsmount *balloon_mnt;  #endif
> > >
> > > +#define BALLOON_CHUNK_BASE_SHIFT 12 #define
> > > +BALLOON_CHUNK_SIZE_SHIFT 12 struct balloon_page_chunk {
> > > +	__le64 base;
> > > +	__le64 size;
> > > +};
> > > +
> > > +typedef __le64 resp_data_t;
> > >  struct virtio_balloon {
> > >  	struct virtio_device *vdev;
> > >  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -67,6
> > > +79,31 @@ struct virtio_balloon {
> > >
> > >  	/* Number of balloon pages we've told the Host we're not using. */
> > >  	unsigned int num_pages;
> > > +	/* Pointer to the response header. */
> > > +	struct virtio_balloon_resp_hdr *resp_hdr;
> > > +	/* Pointer to the start address of response data. */
> > > +	resp_data_t *resp_data;
> >
> > I think the implementation has an issue here - both the balloon pages and the
> unused pages use the same buffer ("resp_data" above) to store chunks. It would
> cause a race in this case: live migration starts while ballooning is also in progress.
> I plan to use separate buffers for CHUNKS_OF_BALLOON_PAGES and
> CHUNKS_OF_UNUSED_PAGES. Please let me know if you have a different
> suggestion. Thanks.
> >
> > Best,
> > Wei
> 
> Is only one resp data ever in flight for each kind?
> If not you want as many buffers as vq allows.
> 

No, all the kinds were using only one resp_data. I will make it one resp_data for each kind.

Best,
Wei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH kernel v8 2/4] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER
  2017-04-05  4:31       ` Wang, Wei W
@ 2017-04-05  7:47         ` Wang, Wei W
  0 siblings, 0 replies; 18+ messages in thread
From: Wang, Wei W @ 2017-04-05  7:47 UTC (permalink / raw)
  To: Wang, Wei W, Michael S. Tsirkin
  Cc: aarcange, virtio-dev, kvm, qemu-devel, amit.shah,
	liliang.opensource, Hansen, Dave, linux-kernel, virtualization,
	linux-mm, cornelia.huck, pbonzini, akpm, mgorman

On Wednesday, April 5, 2017 12:31 PM, Wei Wang wrote:
> On Wednesday, April 5, 2017 11:54 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 05, 2017 at 03:31:36AM +0000, Wang, Wei W wrote:
> > > On Thursday, March 16, 2017 3:09 PM Wei Wang wrote:
> > > > The implementation of the current virtio-balloon is not very
> > > > efficient, because the ballooned pages are transferred to the host
> > > > one by one. Here is the breakdown of the time in percentage spent
> > > > on each step of the balloon inflating process (inflating 7GB of an 8GB idle
> guest).
> > > >
> > > > 1) allocating pages (6.5%)
> > > > 2) sending PFNs to host (68.3%)
> > > > 3) address translation (6.1%)
> > > > 4) madvise (19%)
> > > >
> > > > It takes about 4126ms for the inflating process to complete.
> > > > The above profiling shows that the bottlenecks are stage 2) and stage 4).
> > > >
> > > > This patch optimizes step 2) by transferring pages to the host in
> > > > chunks. A chunk consists of guest physically continuous pages, and
> > > > it is offered to the host via a base PFN (i.e. the start PFN of
> > > > those physically continuous pages) and the size (i.e. the total
> > > > number of the
> > pages). A chunk is formated as below:
> > > >
> > > > --------------------------------------------------------
> > > > |                 Base (52 bit)        | Rsvd (12 bit) |
> > > > --------------------------------------------------------
> > > > --------------------------------------------------------
> > > > |                 Size (52 bit)        | Rsvd (12 bit) |
> > > > --------------------------------------------------------
> > > >
> > > > By doing so, step 4) can also be optimized by doing address
> > > > translation and
> > > > madvise() in chunks rather than page by page.
> > > >
> > > > This optimization requires the negotiation of a new feature bit,
> > > > VIRTIO_BALLOON_F_CHUNK_TRANSFER.
> > > >
> > > > With this new feature, the above ballooning process takes ~590ms
> > > > resulting in an improvement of ~85%.
> > > >
> > > > TODO: optimize stage 1) by allocating/freeing a chunk of pages
> > > > instead of a single page each time.
> > > >
> > > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > > Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> > > > ---
> > > >  drivers/virtio/virtio_balloon.c     | 371
> > +++++++++++++++++++++++++++++++++-
> > > > --
> > > >  include/uapi/linux/virtio_balloon.h |   9 +
> > > >  2 files changed, 353 insertions(+), 27 deletions(-)
> > > >
> > > > diff --git a/drivers/virtio/virtio_balloon.c
> > > > b/drivers/virtio/virtio_balloon.c index
> > > > f59cb4f..3f4a161 100644
> > > > --- a/drivers/virtio/virtio_balloon.c
> > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > @@ -42,6 +42,10 @@
> > > >  #define OOM_VBALLOON_DEFAULT_PAGES 256  #define
> > > > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > > >
> > > > +#define PAGE_BMAP_SIZE	(8 * PAGE_SIZE)
> > > > +#define PFNS_PER_PAGE_BMAP	(PAGE_BMAP_SIZE * BITS_PER_BYTE)
> > > > +#define PAGE_BMAP_COUNT_MAX	32
> > > > +
> > > >  static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM"); @@ -50,6
> > +54,14
> > > > @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");  static
> > > > struct vfsmount *balloon_mnt;  #endif
> > > >
> > > > +#define BALLOON_CHUNK_BASE_SHIFT 12 #define
> > > > +BALLOON_CHUNK_SIZE_SHIFT 12 struct balloon_page_chunk {
> > > > +	__le64 base;
> > > > +	__le64 size;
> > > > +};
> > > > +
> > > > +typedef __le64 resp_data_t;
> > > >  struct virtio_balloon {
> > > >  	struct virtio_device *vdev;
> > > >  	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq; @@ -67,6
> > > > +79,31 @@ struct virtio_balloon {
> > > >
> > > >  	/* Number of balloon pages we've told the Host we're not using. */
> > > >  	unsigned int num_pages;
> > > > +	/* Pointer to the response header. */
> > > > +	struct virtio_balloon_resp_hdr *resp_hdr;
> > > > +	/* Pointer to the start address of response data. */
> > > > +	resp_data_t *resp_data;
> > >
> > > I think the implementation has an issue here - both the balloon
> > > pages and the
> > unused pages use the same buffer ("resp_data" above) to store chunks.
> > It would cause a race in this case: live migration starts while ballooning is also
> in progress.
> > I plan to use separate buffers for CHUNKS_OF_BALLOON_PAGES and
> > CHUNKS_OF_UNUSED_PAGES. Please let me know if you have a different
> > suggestion. Thanks.
> > >
> > > Best,
> > > Wei
> >
> > Is only one resp data ever in flight for each kind?
> > If not you want as many buffers as vq allows.
> >
> 
> No, all the kinds were using only one resp_data. I will make it one resp_data for
> each kind.
> 

Just in case it wasn't well explained - it is one resp data in flight for each kind, but the two kinds share the one resp data under the balloon_lock.  I'm thinking would it be worthwhile to have them use separate resp data, so that when live migration begins, it doesn't need to wait to get unused pages (in the case when "inflating" is using resp data and not sure how long would it take to finish). The unused pages are useful for the first round of ram transfer, so I think it is important to get them as soon as possible.

Best,
Wei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages
  2017-03-16 21:28   ` Andrew Morton
  2017-03-17  6:55     ` Wei Wang
@ 2017-04-13 11:07     ` Wei Wang
  1 sibling, 0 replies; 18+ messages in thread
From: Wei Wang @ 2017-04-13 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: virtio-dev, linux-kernel, qemu-devel, virtualization, kvm,
	linux-mm, mst, david, dave.hansen, cornelia.huck, mgorman,
	aarcange, amit.shah, pbonzini, liliang.opensource

On 03/17/2017 05:28 AM, Andrew Morton wrote:
> On Thu, 16 Mar 2017 15:08:46 +0800 Wei Wang <wei.w.wang@intel.com> wrote:
>
>> From: Liang Li <liang.z.li@intel.com>
>>
>> This patch adds a function to provides a snapshot of the present system
>> unused pages. An important usage of this function is to provide the
>> unsused pages to the Live migration thread, which skips the transfer of
>> thoses unused pages. Newly used pages can be re-tracked by the dirty
>> page logging mechanisms.
> I don't think this will be useful for anything other than
> virtio-balloon.  I guess it would be better to keep this code in the
> virtio-balloon driver if possible, even though that's rather a layering
> violation :( What would have to be done to make that possible?  Perhaps
> we can put some *small* helpers into page_alloc.c to prevent things
> from becoming too ugly.
>
>
Thanks for the suggestion. Small helpers do look more elegant. The nice 
thing is that I also didn't see any performance loss.
To make that possible, we need to enable for_each_polulated_zone() to be 
callable by a kernel module. Please have a check the v9 patches that I 
just posted out.

Best,
Wei

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-04-13 11:06 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-16  7:08 [PATCH kernel v8 0/4] Extend virtio-balloon for fast (de)inflating & fast live migration Wei Wang
2017-03-16  7:08 ` [PATCH kernel v8 1/4] virtio-balloon: deflate via a page list Wei Wang
2017-03-16  7:08 ` [PATCH kernel v8 2/4] virtio-balloon: VIRTIO_BALLOON_F_CHUNK_TRANSFER Wei Wang
2017-04-05  3:31   ` Wang, Wei W
2017-04-05  3:53     ` Michael S. Tsirkin
2017-04-05  4:31       ` Wang, Wei W
2017-04-05  7:47         ` Wang, Wei W
2017-03-16  7:08 ` [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages Wei Wang
2017-03-16 21:28   ` Andrew Morton
2017-03-17  6:55     ` Wei Wang
2017-03-22 10:52       ` Wang, Wei W
2017-03-29 17:48       ` Michael S. Tsirkin
2017-03-31  9:53         ` Wei Wang
2017-03-31 16:25           ` Michael S. Tsirkin
2017-04-13 11:07     ` Wei Wang
2017-03-17  1:21   ` Michael S. Tsirkin
2017-03-16  7:08 ` [PATCH kernel v8 4/4] virtio-balloon: VIRTIO_BALLOON_F_HOST_REQ_VQ Wei Wang
2017-03-17  1:39   ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).