[PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
@ 2016-11-30  8:43 Liang Li
  2016-11-30  8:43 ` [PATCH kernel v5 1/5] virtio-balloon: rework deflate to add page to a list Liang Li
                   ` (5 more replies)
  0 siblings, 6 replies; 41+ messages in thread
From: Liang Li @ 2016-11-30  8:43 UTC (permalink / raw)
  To: kvm
  Cc: virtualization, linux-kernel, linux-mm, virtio-dev, qemu-devel,
	quintela, dgilbert, dave.hansen, mst, jasowang, kirill.shutemov,
	akpm, mhocko, pbonzini, Liang Li

This patch set contains two parts of changes to the virtio-balloon.
 
One is the change for speeding up the inflating & deflating process,
the main idea of this optimization is to use bitmap to send the page
information to host instead of the PFNs, to reduce the overhead of
virtio data transmission, address translation and madvise(). This can
help to improve the performance by about 85%.
 
Another change is for speeding up live migration. By skipping process
guest's unused pages in the first round of data copy, to reduce needless
data processing, this can help to save quite a lot of CPU cycles and
network bandwidth. We put guest's unused page information in a bitmap
and send it to host with the virt queue of virtio-balloon. For an idle
guest with 8GB RAM, this can help to shorten the total live migration
time from 2Sec to about 500ms in 10Gbps network environment.
 
Changes from v4 to v5:
    * Drop the code to get the max_pfn, use another way instead.
    * Simplify the API to get the unused page information from mm. 

Changes from v3 to v4:
    * Use the new scheme suggested by Dave Hansen to encode the bitmap.
    * Add code which is missed in v3 to handle migrate page. 
    * Free the memory for bitmap intime once the operation is done.
    * Address some of the comments in v3.

Changes from v2 to v3:
    * Change the name of 'free page' to 'unused page'.
    * Use the scatter & gather bitmap instead of a 1MB page bitmap.
    * Fix overwriting the page bitmap after kicking.
    * Some of MST's comments for v2.
 
Changes from v1 to v2:
    * Abandon the patch for dropping page cache.
    * Put some structures to uapi head file.
    * Use a new way to determine the page bitmap size.
    * Use a unified way to send the free page information with the bitmap
    * Address the issues referred in MST's comments

Liang Li (5):
  virtio-balloon: rework deflate to add page to a list
  virtio-balloon: define new feature bit and head struct
  virtio-balloon: speed up inflate/deflate process
  virtio-balloon: define flags and head for host request vq
  virtio-balloon: tell host vm's unused page info

 drivers/virtio/virtio_balloon.c     | 539 ++++++++++++++++++++++++++++++++----
 include/linux/mm.h                  |   3 +-
 include/uapi/linux/virtio_balloon.h |  41 +++
 mm/page_alloc.c                     |  72 +++++
 4 files changed, 607 insertions(+), 48 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH kernel v5 1/5] virtio-balloon: rework deflate to add page to a list
  2016-11-30  8:43 [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration Liang Li
@ 2016-11-30  8:43 ` Liang Li
  2016-11-30  8:43 ` [PATCH kernel v5 2/5] virtio-balloon: define new feature bit and head struct Liang Li
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Liang Li @ 2016-11-30  8:43 UTC (permalink / raw)
  To: kvm
  Cc: virtualization, linux-kernel, linux-mm, virtio-dev, qemu-devel,
	quintela, dgilbert, dave.hansen, mst, jasowang, kirill.shutemov,
	akpm, mhocko, pbonzini, Liang Li, Cornelia Huck, Amit Shah

When doing the inflating/deflating operation, the current virtio-balloon
implementation uses an array to save 256 PFNS, then send these PFNS to
host through virtio and process each PFN one by one. This way is not
efficient when inflating/deflating a large mount of memory because too
many times of the following operations:

    1. Virtio data transmission
    2. Page allocate/free
    3. Address translation(GPA->HVA)
    4. madvise

The over head of these operations will consume a lot of CPU cycles and
will take a long time to complete, it may impact the QoS of the guest as
well as the host. The overhead will be reduced a lot if batch processing
is used. E.g. If there are several pages whose address are physical
contiguous in the guest, these bulk pages can be processed in one
operation.

The main idea for the optimization is to reduce the above operations as
much as possible. And it can be achieved by using a bitmap instead of an
PFN array. Comparing with PFN array, for a specific size buffer, bitmap
can present more pages, which is very important for batch processing.

Using bitmap instead of PFN is not very helpful when inflating/deflating
a small mount of pages, in this case, using PFNs is better. But using
bitmap will not impact the QoS of guest or host heavily because the
operation will be completed very soon for a small mount of pages, and we
will use some methods to make sure the efficiency not drop too much.

This patch saves the deflated pages to a list instead of the PFN array,
which will allow faster notifications using a bitmap down the road.
balloon_pfn_to_page() can be removed because it's useless.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 drivers/virtio/virtio_balloon.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 181793f..f59cb4f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -103,12 +103,6 @@ static u32 page_to_balloon_pfn(struct page *page)
 	return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }

-static struct page *balloon_pfn_to_page(u32 pfn)
-{
-	BUG_ON(pfn % VIRTIO_BALLOON_PAGES_PER_PAGE);
-	return pfn_to_page(pfn / VIRTIO_BALLOON_PAGES_PER_PAGE);
-}
-
 static void balloon_ack(struct virtqueue *vq)
 {
 	struct virtio_balloon *vb = vq->vdev->priv;
@@ -181,18 +175,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 	return num_allocated_pages;
 }

-static void release_pages_balloon(struct virtio_balloon *vb)
+static void release_pages_balloon(struct virtio_balloon *vb,
+				 struct list_head *pages)
 {
-	unsigned int i;
-	struct page *page;
+	struct page *page, *next;

-	/* Find pfns pointing at start of each page, get pages and free them. */
-	for (i = 0; i < vb->num_pfns; i += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		page = balloon_pfn_to_page(virtio32_to_cpu(vb->vdev,
-							   vb->pfns[i]));
+	list_for_each_entry_safe(page, next, pages, lru) {
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 			adjust_managed_page_count(page, 1);
+		list_del(&page->lru);
 		put_page(page); /* balloon reference */
 	}
 }
@@ -202,6 +194,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	unsigned num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
+	LIST_HEAD(pages);

 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
@@ -215,6 +208,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		if (!page)
 			break;
 		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}

@@ -226,7 +220,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 */
 	if (vb->num_pfns != 0)
 		tell_host(vb, vb->deflate_vq);
-	release_pages_balloon(vb);
+	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH kernel v5 2/5] virtio-balloon: define new feature bit and head struct
  2016-11-30  8:43 [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration Liang Li
  2016-11-30  8:43 ` [PATCH kernel v5 1/5] virtio-balloon: rework deflate to add page to a list Liang Li
@ 2016-11-30  8:43 ` Liang Li
  2016-11-30  8:43 ` [PATCH kernel v5 3/5] virtio-balloon: speed up inflate/deflate process Liang Li
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Liang Li @ 2016-11-30  8:43 UTC (permalink / raw)
  To: kvm
  Cc: virtualization, linux-kernel, linux-mm, virtio-dev, qemu-devel,
	quintela, dgilbert, dave.hansen, mst, jasowang, kirill.shutemov,
	akpm, mhocko, pbonzini, Liang Li, Cornelia Huck, Amit Shah

Add a new feature which supports sending the page information with
a bitmap. The current implementation uses PFNs array, which is not
very efficient. Using bitmap can improve the performance of
inflating/deflating significantly

The page bitmap header will used to tell the host some information
about the page bitmap. e.g. the page size, page bitmap length and
start pfn.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 include/uapi/linux/virtio_balloon.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 343d7dd..1be4b1f 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST	0 /* Tell before reclaiming pages */
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_PAGE_BITMAP	3 /* Send page info with bitmap */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -82,4 +83,22 @@ struct virtio_balloon_stat {
 	__virtio64 val;
 } __attribute__((packed));
 
+/* Response header structure */
+struct virtio_balloon_resp_hdr {
+	__le64 cmd : 8; /* Distinguish different requests type */
+	__le64 flag: 8; /* Mark status for a specific request type */
+	__le64 id : 16; /* Distinguish requests of a specific type */
+	__le64 data_len: 32; /* Length of the following data, in bytes */
+};
+
+/* Page bitmap header structure */
+struct virtio_balloon_bmap_hdr {
+	struct {
+		__le64 start_pfn : 52; /* start pfn for the bitmap */
+		__le64 page_shift : 6; /* page shift width, in bytes */
+		__le64 bmap_len : 6;  /* bitmap length, in bytes */
+	} head;
+	__le64 bmap[0];
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH kernel v5 3/5] virtio-balloon: speed up inflate/deflate process
  2016-11-30  8:43 [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration Liang Li
  2016-11-30  8:43 ` [PATCH kernel v5 1/5] virtio-balloon: rework deflate to add page to a list Liang Li
  2016-11-30  8:43 ` [PATCH kernel v5 2/5] virtio-balloon: define new feature bit and head struct Liang Li
@ 2016-11-30  8:43 ` Liang Li
  2016-11-30  8:43 ` [PATCH kernel v5 4/5] virtio-balloon: define flags and head for host request vq Liang Li
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Liang Li @ 2016-11-30  8:43 UTC (permalink / raw)
  To: kvm
  Cc: virtualization, linux-kernel, linux-mm, virtio-dev, qemu-devel,
	quintela, dgilbert, dave.hansen, mst, jasowang, kirill.shutemov,
	akpm, mhocko, pbonzini, Liang Li, Cornelia Huck, Amit Shah

The implementation of the current virtio-balloon is not very
efficient, the time spends on different stages of inflating
the balloon to 7GB of a 8GB idle guest:

a. allocating pages (6.5%)
b. sending PFNs to host (68.3%)
c. address translation (6.1%)
d. madvise (19%)

It takes about 4126ms for the inflating process to complete.
Debugging shows that the bottle neck are the stage b and stage d.

If using a bitmap to send the page info instead of the PFNs, we
can reduce the overhead in stage b quite a lot. Furthermore, we
can do the address translation and call madvise() with a bulk of
RAM pages, instead of the current page per page way, the overhead
of stage c and stage d can also be reduced a lot.

This patch is the kernel side implementation which is intended to
speed up the inflating & deflating process by adding a new feature
to the virtio-balloon device. With this new feature, inflating the
balloon to 7GB of a 8GB idle guest only takes 590ms, the
performance improvement is about 85%.

TODO: optimize stage a by allocating/freeing a chunk of pages
instead of a single page at a time.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 drivers/virtio/virtio_balloon.c | 395 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 367 insertions(+), 28 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f59cb4f..c3ddec3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -42,6 +42,10 @@
 #define OOM_VBALLOON_DEFAULT_PAGES 256
 #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
 
+#define BALLOON_BMAP_SIZE	(8 * PAGE_SIZE)
+#define PFNS_PER_BMAP		(BALLOON_BMAP_SIZE * BITS_PER_BYTE)
+#define BALLOON_BMAP_COUNT	32
+
 static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -67,6 +71,18 @@ struct virtio_balloon {
 
 	/* Number of balloon pages we've told the Host we're not using. */
 	unsigned int num_pages;
+	/* Pointer to the response header. */
+	void *resp_hdr;
+	/* Pointer to the start address of response data. */
+	unsigned long *resp_data;
+	/* Pointer offset of the response data. */
+	unsigned long resp_pos;
+	/* Bitmap and bitmap count used to tell the host the pages */
+	unsigned long *page_bitmap[BALLOON_BMAP_COUNT];
+	/* Number of split page bitmaps */
+	unsigned int nr_page_bmap;
+	/* Used to record the processed pfn range */
+	unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
 	/*
 	 * The pages we've told the Host we're not using are enqueued
 	 * at vb_dev_info->pages list.
@@ -110,20 +126,228 @@ static void balloon_ack(struct virtqueue *vq)
 	wake_up(&vb->acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+static inline void init_bmap_pfn_range(struct virtio_balloon *vb)
 {
-	struct scatterlist sg;
+	vb->min_pfn = ULONG_MAX;
+	vb->max_pfn = 0;
+}
+
+static inline void update_bmap_pfn_range(struct virtio_balloon *vb,
+				 struct page *page)
+{
+	unsigned long balloon_pfn = page_to_balloon_pfn(page);
+
+	vb->min_pfn = min(balloon_pfn, vb->min_pfn);
+	vb->max_pfn = max(balloon_pfn, vb->max_pfn);
+}
+
+static void extend_page_bitmap(struct virtio_balloon *vb,
+				unsigned long nr_pfn)
+{
+	int i, bmap_count;
+	unsigned long bmap_len;
+
+	bmap_len = ALIGN(nr_pfn, BITS_PER_LONG) / BITS_PER_BYTE;
+	bmap_len = ALIGN(bmap_len, BALLOON_BMAP_SIZE);
+	bmap_count = min((int)(bmap_len / BALLOON_BMAP_SIZE),
+				 BALLOON_BMAP_COUNT);
+
+	for (i = 1; i < bmap_count; i++) {
+		vb->page_bitmap[i] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+		if (vb->page_bitmap[i])
+			vb->nr_page_bmap++;
+		else
+			break;
+	}
+}
+
+static void free_extended_page_bitmap(struct virtio_balloon *vb)
+{
+	int i, bmap_count = vb->nr_page_bmap;
+
+
+	for (i = 1; i < bmap_count; i++) {
+		kfree(vb->page_bitmap[i]);
+		vb->page_bitmap[i] = NULL;
+		vb->nr_page_bmap--;
+	}
+}
+
+static void kfree_page_bitmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->nr_page_bmap; i++)
+		kfree(vb->page_bitmap[i]);
+}
+
+static void clear_page_bitmap(struct virtio_balloon *vb)
+{
+	int i;
+
+	for (i = 0; i < vb->nr_page_bmap; i++)
+		memset(vb->page_bitmap[i], 0, BALLOON_BMAP_SIZE);
+}
+
+static unsigned long do_set_resp_bitmap(struct virtio_balloon *vb,
+	unsigned long *bitmap,	unsigned long base_pfn,
+	unsigned long pos, int nr_page)
+
+{
+	struct virtio_balloon_bmap_hdr *hdr;
+	unsigned long end, new_pos, new_end, nr_left, proccessed = 0;
+
+	new_pos = pos;
+	new_end = end = pos + nr_page;
+
+	if (pos % BITS_PER_LONG) {
+		unsigned long pos_s;
+
+		pos_s = rounddown(pos, BITS_PER_LONG);
+		hdr = (struct virtio_balloon_bmap_hdr *)(vb->resp_data
+							 + vb->resp_pos);
+		hdr->head.start_pfn = base_pfn + pos_s;
+		hdr->head.page_shift = PAGE_SHIFT;
+		hdr->head.bmap_len = sizeof(unsigned long);
+		hdr->bmap[0] = cpu_to_virtio64(vb->vdev,
+				 bitmap[pos_s / BITS_PER_LONG]);
+		vb->resp_pos += 2;
+		if (pos_s + BITS_PER_LONG >= end)
+			return roundup(end, BITS_PER_LONG) - pos;
+		new_pos = roundup(pos, BITS_PER_LONG);
+	}
+
+	if (end % BITS_PER_LONG) {
+		unsigned long pos_e;
+
+		pos_e = roundup(end, BITS_PER_LONG);
+		hdr = (struct virtio_balloon_bmap_hdr *)(vb->resp_data
+							 + vb->resp_pos);
+		hdr->head.start_pfn = base_pfn + pos_e - BITS_PER_LONG;
+		hdr->head.page_shift = PAGE_SHIFT;
+		hdr->head.bmap_len = sizeof(unsigned long);
+		hdr->bmap[0] = bitmap[pos_e / BITS_PER_LONG - 1];
+		vb->resp_pos += 2;
+		if (new_pos + BITS_PER_LONG >= pos_e)
+			return pos_e - pos;
+		new_end = rounddown(end, BITS_PER_LONG);
+	}
+
+	nr_left = nr_page = new_end - new_pos;
+
+	while (proccessed < nr_page) {
+		int bulk, order;
+
+		order = get_order(nr_left << PAGE_SHIFT);
+		if ((1 << order) > nr_left)
+			order--;
+		hdr = (struct virtio_balloon_bmap_hdr *)(vb->resp_data
+							 + vb->resp_pos);
+		hdr->head.start_pfn = base_pfn + new_pos + proccessed;
+		hdr->head.page_shift = order + PAGE_SHIFT;
+		hdr->head.bmap_len = 0;
+		bulk = 1 << order;
+		nr_left -= bulk;
+		proccessed += bulk;
+		vb->resp_pos++;
+	}
+
+	return roundup(end, BITS_PER_LONG) - pos;
+}
+
+static void send_resp_data(struct virtio_balloon *vb, struct virtqueue *vq,
+			bool busy_wait)
+{
+	struct scatterlist sg[2];
+	struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
 	unsigned int len;
 
-	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+	len = hdr->data_len = vb->resp_pos * sizeof(unsigned long);
+	sg_init_table(sg, 2);
+	sg_set_buf(&sg[0], hdr, sizeof(struct virtio_balloon_resp_hdr));
+	sg_set_buf(&sg[1], vb->resp_data, len);
+
+	if (virtqueue_add_outbuf(vq, sg, 2, vb, GFP_KERNEL) == 0) {
+		virtqueue_kick(vq);
+		if (busy_wait)
+			while (!virtqueue_get_buf(vq, &len)
+				&& !virtqueue_is_broken(vq))
+				cpu_relax();
+		else
+			wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+		vb->resp_pos = 0;
+		free_extended_page_bitmap(vb);
+	}
+}
 
-	/* We should always be able to add one buffer to an empty queue. */
-	virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
-	virtqueue_kick(vq);
+static void set_bulk_pages(struct virtio_balloon *vb, struct virtqueue *vq,
+		unsigned long start_pfn, unsigned long *bitmap,
+		unsigned long len, bool busy_wait)
+{
+	unsigned long pos = 0, end = len * BITS_PER_BYTE;
+
+	while (pos < end) {
+		unsigned long one = find_next_bit(bitmap, end, pos);
+
+		if ((vb->resp_pos + 64) * sizeof(unsigned long) >
+			 BALLOON_BMAP_SIZE)
+			send_resp_data(vb, vq, busy_wait);
+		if (one < end) {
+			unsigned long pages, zero;
+
+			zero = find_next_zero_bit(bitmap, end, one + 1);
+			if (zero >= end)
+				pages = end - one;
+			else
+				pages = zero - one;
+			if (pages) {
+				pages = do_set_resp_bitmap(vb, bitmap,
+					 start_pfn, one, pages);
+			}
+			pos = one + pages;
+		} else
+			pos = one;
+	}
+}
 
-	/* When host has read buffer, this completes via balloon_ack */
-	wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
+{
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP)) {
+		int nr_pfn, nr_used_bmap, i;
+		unsigned long start_pfn, bmap_len;
+
+		start_pfn = vb->start_pfn;
+		nr_pfn = vb->end_pfn - start_pfn + 1;
+		nr_pfn = roundup(nr_pfn, BITS_PER_LONG);
+		nr_used_bmap = nr_pfn / PFNS_PER_BMAP;
+		if (nr_pfn % PFNS_PER_BMAP)
+			nr_used_bmap++;
+		bmap_len = nr_pfn / BITS_PER_BYTE;
+
+		for (i = 0; i < nr_used_bmap; i++) {
+			unsigned int bmap_size = BALLOON_BMAP_SIZE;
+
+			if (i + 1 == nr_used_bmap)
+				bmap_size = bmap_len - BALLOON_BMAP_SIZE * i;
+			set_bulk_pages(vb, vq, start_pfn + i * PFNS_PER_BMAP,
+				 vb->page_bitmap[i], bmap_size, false);
+		}
+		if (vb->resp_pos > 0)
+			send_resp_data(vb, vq, false);
+	} else {
+		struct scatterlist sg;
+		unsigned int len;
 
+		sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+
+		/* We should always be able to add one buffer to an
+		 * empty queue
+		 */
+		virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+		virtqueue_kick(vq);
+		/* When host has read buffer, this completes via balloon_ack */
+		wait_event(vb->acked, virtqueue_get_buf(vq, &len));
+	}
 }
 
 static void set_page_pfns(struct virtio_balloon *vb,
@@ -138,13 +362,59 @@ static void set_page_pfns(struct virtio_balloon *vb,
 					  page_to_balloon_pfn(page) + i);
 }
 
-static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
+static void set_page_bitmap(struct virtio_balloon *vb,
+			 struct list_head *pages, struct virtqueue *vq)
+{
+	unsigned long pfn, pfn_limit;
+	struct page *page;
+	bool found;
+	int bmap_idx;
+
+	vb->min_pfn = rounddown(vb->min_pfn, BITS_PER_LONG);
+	vb->max_pfn = roundup(vb->max_pfn, BITS_PER_LONG);
+	pfn_limit = PFNS_PER_BMAP * vb->nr_page_bmap;
+
+	if (vb->nr_page_bmap == 1)
+		extend_page_bitmap(vb, vb->max_pfn - vb->min_pfn + 1);
+	for (pfn = vb->min_pfn; pfn < vb->max_pfn; pfn += pfn_limit) {
+		unsigned long end_pfn;
+
+		clear_page_bitmap(vb);
+		vb->start_pfn = pfn;
+		end_pfn = pfn;
+		found = false;
+		list_for_each_entry(page, pages, lru) {
+			unsigned long pos, balloon_pfn;
+
+			balloon_pfn = page_to_balloon_pfn(page);
+			if (balloon_pfn < pfn || balloon_pfn >= pfn + pfn_limit)
+				continue;
+			bmap_idx = (balloon_pfn - pfn) / PFNS_PER_BMAP;
+			pos = (balloon_pfn - pfn) % PFNS_PER_BMAP;
+			set_bit(pos, vb->page_bitmap[bmap_idx]);
+			if (balloon_pfn > end_pfn)
+				end_pfn = balloon_pfn;
+			found = true;
+		}
+		if (found) {
+			vb->end_pfn = end_pfn;
+			tell_host(vb, vq);
+		}
+	}
+}
+
+static unsigned int fill_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
-	unsigned num_allocated_pages;
+	unsigned int num_allocated_pages;
+	bool use_bmap = virtio_has_feature(vb->vdev,
+				 VIRTIO_BALLOON_F_PAGE_BITMAP);
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (use_bmap)
+		init_bmap_pfn_range(vb);
+	else
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
@@ -159,7 +429,10 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 			msleep(200);
 			break;
 		}
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_bmap)
+			update_bmap_pfn_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		if (!virtio_has_feature(vb->vdev,
 					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
@@ -168,8 +441,13 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
 
 	num_allocated_pages = vb->num_pfns;
 	/* Did we get any? */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->inflate_vq);
+	if (vb->num_pfns != 0) {
+		if (use_bmap)
+			set_page_bitmap(vb, &vb_dev_info->pages,
+					vb->inflate_vq);
+		else
+			tell_host(vb, vb->inflate_vq);
+	}
 	mutex_unlock(&vb->balloon_lock);
 
 	return num_allocated_pages;
@@ -189,15 +467,20 @@ static void release_pages_balloon(struct virtio_balloon *vb,
 	}
 }
 
-static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
+static unsigned int leak_balloon(struct virtio_balloon *vb, size_t num)
 {
-	unsigned num_freed_pages;
+	unsigned int num_freed_pages;
 	struct page *page;
 	struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info;
 	LIST_HEAD(pages);
+	bool use_bmap = virtio_has_feature(vb->vdev,
+			 VIRTIO_BALLOON_F_PAGE_BITMAP);
 
-	/* We can only do one array worth at a time. */
-	num = min(num, ARRAY_SIZE(vb->pfns));
+	if (use_bmap)
+		init_bmap_pfn_range(vb);
+	else
+		/* We can only do one array worth at a time. */
+		num = min(num, ARRAY_SIZE(vb->pfns));
 
 	mutex_lock(&vb->balloon_lock);
 	/* We can't release more pages than taken */
@@ -207,7 +490,10 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 		page = balloon_page_dequeue(vb_dev_info);
 		if (!page)
 			break;
-		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
+		if (use_bmap)
+			update_bmap_pfn_range(vb, page);
+		else
+			set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
 		list_add(&page->lru, &pages);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
@@ -218,8 +504,12 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	if (vb->num_pfns != 0)
-		tell_host(vb, vb->deflate_vq);
+	if (vb->num_pfns != 0) {
+		if (use_bmap)
+			set_page_bitmap(vb, &pages, vb->deflate_vq);
+		else
+			tell_host(vb, vb->deflate_vq);
+	}
 	release_pages_balloon(vb, &pages);
 	mutex_unlock(&vb->balloon_lock);
 	return num_freed_pages;
@@ -431,6 +721,20 @@ static int init_vqs(struct virtio_balloon *vb)
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
+static void tell_host_one_page(struct virtio_balloon *vb,
+	struct virtqueue *vq, struct page *page)
+{
+	struct virtio_balloon_bmap_hdr *bmap_hdr;
+
+	bmap_hdr = (struct virtio_balloon_bmap_hdr *)(vb->resp_data
+							 + vb->resp_pos);
+	bmap_hdr->head.start_pfn = page_to_pfn(page);
+	bmap_hdr->head.page_shift = PAGE_SHIFT;
+	bmap_hdr->head.bmap_len = 0;
+	vb->resp_pos++;
+	send_resp_data(vb, vq, false);
+}
+
 /*
  * virtballoon_migratepage - perform the balloon page migration on behalf of
  *			     a compation thread.     (called under page lock)
@@ -455,6 +759,8 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	struct virtio_balloon *vb = container_of(vb_dev_info,
 			struct virtio_balloon, vb_dev_info);
 	unsigned long flags;
+	bool use_bmap = virtio_has_feature(vb->vdev,
+				 VIRTIO_BALLOON_F_PAGE_BITMAP);
 
 	/*
 	 * In order to avoid lock contention while migrating pages concurrently
@@ -475,15 +781,23 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 	vb_dev_info->isolated_pages--;
 	__count_vm_event(BALLOON_MIGRATE);
 	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, newpage);
-	tell_host(vb, vb->inflate_vq);
+	if (use_bmap)
+		tell_host_one_page(vb, vb->inflate_vq, newpage);
+	else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, newpage);
+		tell_host(vb, vb->inflate_vq);
+	}
 
 	/* balloon's page migration 2nd step -- deflate "page" */
 	balloon_page_delete(page);
-	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
-	set_page_pfns(vb, vb->pfns, page);
-	tell_host(vb, vb->deflate_vq);
+	if (use_bmap)
+		tell_host_one_page(vb, vb->deflate_vq, page);
+	else {
+		vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+		set_page_pfns(vb, vb->pfns, page);
+		tell_host(vb, vb->deflate_vq);
+	}
 
 	mutex_unlock(&vb->balloon_lock);
 
@@ -533,6 +847,28 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	spin_lock_init(&vb->stop_update_lock);
 	vb->stop_update = false;
 	vb->num_pages = 0;
+	vb->resp_hdr = kzalloc(sizeof(struct virtio_balloon_resp_hdr),
+				 GFP_KERNEL);
+	/* Clear the feature bit if memory allocation fails */
+	if (!vb->resp_hdr)
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
+	else {
+		vb->page_bitmap[0] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+		if (!vb->page_bitmap[0]) {
+			__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
+			kfree(vb->resp_hdr);
+		} else {
+			vb->nr_page_bmap = 1;
+			vb->resp_data = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
+			if (!vb->resp_data) {
+				__virtio_clear_bit(vdev,
+						VIRTIO_BALLOON_F_PAGE_BITMAP);
+				kfree(vb->page_bitmap[0]);
+				kfree(vb->resp_hdr);
+			}
+		}
+	}
+	vb->resp_pos = 0;
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
@@ -611,6 +947,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	remove_common(vb);
 	if (vb->vb_dev_info.inode)
 		iput(vb->vb_dev_info.inode);
+	kfree_page_bitmap(vb);
+	kfree(vb->resp_hdr);
 	kfree(vb);
 }
 
@@ -649,6 +987,7 @@ static int virtballoon_restore(struct virtio_device *vdev)
 	VIRTIO_BALLOON_F_MUST_TELL_HOST,
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+	VIRTIO_BALLOON_F_PAGE_BITMAP,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH kernel v5 4/5] virtio-balloon: define flags and head for host request vq
  2016-11-30  8:43 [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration Liang Li
                   ` (2 preceding siblings ...)
  2016-11-30  8:43 ` [PATCH kernel v5 3/5] virtio-balloon: speed up inflate/deflate process Liang Li
@ 2016-11-30  8:43 ` Liang Li
  2016-11-30  8:43 ` [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info Liang Li
  2016-12-06  8:40 ` [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration David Hildenbrand
  5 siblings, 0 replies; 41+ messages in thread
From: Liang Li @ 2016-11-30  8:43 UTC (permalink / raw)
  To: kvm
  Cc: virtualization, linux-kernel, linux-mm, virtio-dev, qemu-devel,
	quintela, dgilbert, dave.hansen, mst, jasowang, kirill.shutemov,
	akpm, mhocko, pbonzini, Liang Li, Mel Gorman, Cornelia Huck,
	Amit Shah

Define the flags and head struct for a new host request virtual
queue. Guest can get requests from host and then responds to them on
this new virtual queue.
Host can make use of this virtual queue to request the guest do some
operations, e.g. drop page cache, synchronize file system, etc.
And the hypervisor can get some of guest's runtime information
through this virtual queue too, e.g. the guest's unused page
information, which can be used for live migration optimization.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 include/uapi/linux/virtio_balloon.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 1be4b1f..5ac3a40 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
 #define VIRTIO_BALLOON_F_STATS_VQ	1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_PAGE_BITMAP	3 /* Send page info with bitmap */
+#define VIRTIO_BALLOON_F_HOST_REQ_VQ	4 /* Host request virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -101,4 +102,25 @@ struct virtio_balloon_bmap_hdr {
 	__le64 bmap[0];
 };
 
+enum virtio_balloon_req_id {
+	/* Get unused page information */
+	BALLOON_GET_UNUSED_PAGES,
+};
+
+enum virtio_balloon_flag {
+	/* Have more data for a request */
+	BALLOON_FLAG_CONT,
+	/* No more data for a request */
+	BALLOON_FLAG_DONE,
+};
+
+struct virtio_balloon_req_hdr {
+	/* Used to distinguish different requests */
+	__le16 cmd;
+	/* Reserved */
+	__le16 reserved[3];
+	/* Request parameter */
+	__le64 param;
+};
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info
  2016-11-30  8:43 [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration Liang Li
                   ` (3 preceding siblings ...)
  2016-11-30  8:43 ` [PATCH kernel v5 4/5] virtio-balloon: define flags and head for host request vq Liang Li
@ 2016-11-30  8:43 ` Liang Li
  2016-11-30 19:15   ` Dave Hansen
  2016-12-06  8:40 ` [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration David Hildenbrand
  5 siblings, 1 reply; 41+ messages in thread
From: Liang Li @ 2016-11-30  8:43 UTC (permalink / raw)
  To: kvm
  Cc: virtualization, linux-kernel, linux-mm, virtio-dev, qemu-devel,
	quintela, dgilbert, dave.hansen, mst, jasowang, kirill.shutemov,
	akpm, mhocko, pbonzini, Liang Li, Mel Gorman, Cornelia Huck,
	Amit Shah

This patch contains two parts:

One is to add a new API to mm go get the unused page information.
The virtio balloon driver will use this new API added to get the
unused page info and send it to hypervisor(QEMU) to speed up live
migration. During sending the bitmap, some the pages may be modified
and are used by the guest, this inaccuracy can be corrected by the
dirty page logging mechanism.

One is to add support the request for vm's unused page information,
QEMU can make use of unused page information and the dirty page
logging mechanism to skip the transportation of some of these unused
pages, this is very helpful to reduce the network traffic and speed
up the live migration process.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
---
 drivers/virtio/virtio_balloon.c | 126 +++++++++++++++++++++++++++++++++++++---
 include/linux/mm.h              |   3 +-
 mm/page_alloc.c                 |  72 +++++++++++++++++++++++
 3 files changed, 193 insertions(+), 8 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index c3ddec3..2626cc0 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -56,7 +56,7 @@
 
 struct virtio_balloon {
 	struct virtio_device *vdev;
-	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *req_vq;
 
 	/* The balloon servicing is delegated to a freezable workqueue. */
 	struct work_struct update_balloon_stats_work;
@@ -75,6 +75,8 @@ struct virtio_balloon {
 	void *resp_hdr;
 	/* Pointer to the start address of response data. */
 	unsigned long *resp_data;
+	/* Size of response data buffer. */
+	unsigned long resp_buf_size;
 	/* Pointer offset of the response data. */
 	unsigned long resp_pos;
 	/* Bitmap and bitmap count used to tell the host the pages */
@@ -83,6 +85,8 @@ struct virtio_balloon {
 	unsigned int nr_page_bmap;
 	/* Used to record the processed pfn range */
 	unsigned long min_pfn, max_pfn, start_pfn, end_pfn;
+	/* Request header */
+	struct virtio_balloon_req_hdr req_hdr;
 	/*
 	 * The pages we've told the Host we're not using are enqueued
 	 * at vb_dev_info->pages list.
@@ -551,6 +555,58 @@ static void update_balloon_stats(struct virtio_balloon *vb)
 				pages_to_bytes(available));
 }
 
+static void send_unused_pages_info(struct virtio_balloon *vb,
+				unsigned long req_id)
+{
+	struct scatterlist sg_in;
+	unsigned long pos = 0;
+	struct virtqueue *vq = vb->req_vq;
+	struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
+	int ret, order;
+
+	mutex_lock(&vb->balloon_lock);
+
+	for (order = MAX_ORDER - 1; order >= 0; order--) {
+		pos = 0;
+		ret = get_unused_pages(vb->resp_data,
+			 vb->resp_buf_size / sizeof(unsigned long),
+			 order, &pos);
+		if (ret == -ENOSPC) {
+			void *new_resp_data;
+
+			new_resp_data = kmalloc(2 * vb->resp_buf_size,
+						GFP_KERNEL);
+			if (new_resp_data) {
+				kfree(vb->resp_data);
+				vb->resp_data = new_resp_data;
+				vb->resp_buf_size *= 2;
+				order++;
+				continue;
+			} else
+				dev_warn(&vb->vdev->dev,
+					 "%s: omit some %d order pages\n",
+					 __func__, order);
+		}
+
+		if (pos > 0) {
+			vb->resp_pos = pos;
+			hdr->cmd = BALLOON_GET_UNUSED_PAGES;
+			hdr->id = req_id;
+			if (order > 0)
+				hdr->flag = BALLOON_FLAG_CONT;
+			else
+				hdr->flag = BALLOON_FLAG_DONE;
+
+			send_resp_data(vb, vq, true);
+		}
+	}
+
+	mutex_unlock(&vb->balloon_lock);
+	sg_init_one(&sg_in, &vb->req_hdr, sizeof(vb->req_hdr));
+	virtqueue_add_inbuf(vq, &sg_in, 1, &vb->req_hdr, GFP_KERNEL);
+	virtqueue_kick(vq);
+}
+
 /*
  * While most virtqueues communicate guest-initiated requests to the hypervisor,
  * the stats queue operates in reverse.  The driver initializes the virtqueue
@@ -685,18 +741,56 @@ static void update_balloon_size_func(struct work_struct *work)
 		queue_work(system_freezable_wq, work);
 }
 
+static void misc_handle_rq(struct virtio_balloon *vb)
+{
+	struct virtio_balloon_req_hdr *ptr_hdr;
+	unsigned int len;
+
+	ptr_hdr = virtqueue_get_buf(vb->req_vq, &len);
+	if (!ptr_hdr || len != sizeof(vb->req_hdr))
+		return;
+
+	switch (ptr_hdr->cmd) {
+	case BALLOON_GET_UNUSED_PAGES:
+		send_unused_pages_info(vb, ptr_hdr->param);
+		break;
+	default:
+		break;
+	}
+}
+
+static void misc_request(struct virtqueue *vq)
+{
+	struct virtio_balloon *vb = vq->vdev->priv;
+
+	misc_handle_rq(vb);
+}
+
 static int init_vqs(struct virtio_balloon *vb)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
-	static const char * const names[] = { "inflate", "deflate", "stats" };
+	struct virtqueue *vqs[4];
+	vq_callback_t *callbacks[] = { balloon_ack, balloon_ack,
+					 stats_request, misc_request };
+	static const char * const names[] = { "inflate", "deflate", "stats",
+						 "misc" };
 	int err, nvqs;
 
 	/*
 	 * We expect two virtqueues: inflate and deflate, and
 	 * optionally stat.
 	 */
-	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ))
+		nvqs = 4;
+	else if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ))
+		nvqs = 3;
+	else
+		nvqs = 2;
+
+	if (!virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
+		__virtio_clear_bit(vb->vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ);
+	}
+
 	err = vb->vdev->config->find_vqs(vb->vdev, nvqs, vqs, callbacks, names);
 	if (err)
 		return err;
@@ -717,6 +811,18 @@ static int init_vqs(struct virtio_balloon *vb)
 			BUG();
 		virtqueue_kick(vb->stats_vq);
 	}
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ)) {
+		struct scatterlist sg_in;
+
+		vb->req_vq = vqs[3];
+		sg_init_one(&sg_in, &vb->req_hdr, sizeof(vb->req_hdr));
+		if (virtqueue_add_inbuf(vb->req_vq, &sg_in, 1,
+		    &vb->req_hdr, GFP_KERNEL) < 0)
+			__virtio_clear_bit(vb->vdev,
+					VIRTIO_BALLOON_F_HOST_REQ_VQ);
+		else
+			virtqueue_kick(vb->req_vq);
+	}
 	return 0;
 }
 
@@ -850,11 +956,13 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->resp_hdr = kzalloc(sizeof(struct virtio_balloon_resp_hdr),
 				 GFP_KERNEL);
 	/* Clear the feature bit if memory allocation fails */
-	if (!vb->resp_hdr)
+	if (!vb->resp_hdr) {
 		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
-	else {
+		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ);
+	} else {
 		vb->page_bitmap[0] = kmalloc(BALLOON_BMAP_SIZE, GFP_KERNEL);
 		if (!vb->page_bitmap[0]) {
+			__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_HOST_REQ_VQ);
 			__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_BITMAP);
 			kfree(vb->resp_hdr);
 		} else {
@@ -863,12 +971,15 @@ static int virtballoon_probe(struct virtio_device *vdev)
 			if (!vb->resp_data) {
 				__virtio_clear_bit(vdev,
 						VIRTIO_BALLOON_F_PAGE_BITMAP);
+				__virtio_clear_bit(vdev,
+						VIRTIO_BALLOON_F_HOST_REQ_VQ);
 				kfree(vb->page_bitmap[0]);
 				kfree(vb->resp_hdr);
 			}
 		}
 	}
 	vb->resp_pos = 0;
+	vb->resp_buf_size = BALLOON_BMAP_SIZE;
 	mutex_init(&vb->balloon_lock);
 	init_waitqueue_head(&vb->acked);
 	vb->vdev = vdev;
@@ -988,6 +1099,7 @@ static int virtballoon_restore(struct virtio_device *vdev)
 	VIRTIO_BALLOON_F_STATS_VQ,
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_PAGE_BITMAP,
+	VIRTIO_BALLOON_F_HOST_REQ_VQ,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a92c8d7..e05ca86 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1772,7 +1772,8 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
 extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
-
+extern int get_unused_pages(unsigned long *unused_pages, unsigned long size,
+				int order, unsigned long *pos);
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6de9440..d5a5952 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4428,6 +4428,78 @@ void show_free_areas(unsigned int filter)
 	show_swap_cache_info();
 }
 
+struct page_info_item {
+	__le64 start_pfn : 52; /* start pfn for the bitmap */
+	__le64 page_shift : 6; /* page shift width, in bytes */
+	__le64 bmap_len : 6;  /* bitmap length, in bytes */
+};
+
+static int  mark_unused_pages(struct zone *zone,
+		unsigned long *unused_pages, unsigned long size,
+		int order, unsigned long *pos)
+{
+	unsigned long pfn, flags;
+	unsigned int t;
+	struct list_head *curr;
+	struct page_info_item *info;
+
+	if (zone_is_empty(zone))
+		return 0;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	if (*pos + zone->free_area[order].nr_free > size)
+		return -ENOSPC;
+	for (t = 0; t < MIGRATE_TYPES; t++) {
+		list_for_each(curr, &zone->free_area[order].free_list[t]) {
+			pfn = page_to_pfn(list_entry(curr, struct page, lru));
+			info = (struct page_info_item *)(unused_pages + *pos);
+			info->start_pfn = pfn;
+			info->page_shift = order + PAGE_SHIFT;
+			*pos += 1;
+		}
+	}
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return 0;
+}
+
+/*
+ * During live migration, page is always discardable unless it's
+ * content is needed by the system.
+ * get_unused_pages provides an API to get the unused pages, these
+ * unused pages can be discarded if there is no modification since
+ * the request. Some other mechanism, like the dirty page logging
+ * can be used to track the modification.
+ *
+ * This function scans the free page list to get the unused pages
+ * with the specified order, and set the corresponding element in
+ * the bitmap if an unused page is found.
+ *
+ * return -EINVAL if parameters are invalid
+ * return -ENOSPC when bitmap can't contain the pages
+ * return 0 when sccess
+ */
+int get_unused_pages(unsigned long *unused_pages, unsigned long size,
+		int order, unsigned long *pos)
+{
+	struct zone *zone;
+	int ret = 0;
+
+	if (unused_pages == NULL || pos == NULL || *pos < 0)
+		return -EINVAL;
+
+	for_each_populated_zone(zone) {
+		ret = mark_unused_pages(zone, unused_pages, size, order, pos);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(get_unused_pages);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info
  2016-11-30  8:43 ` [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info Liang Li
@ 2016-11-30 19:15   ` Dave Hansen
  2016-12-04 13:13     ` Li, Liang Z
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-11-30 19:15 UTC (permalink / raw)
  To: Liang Li, kvm
  Cc: virtualization, linux-kernel, linux-mm, virtio-dev, qemu-devel,
	quintela, dgilbert, mst, jasowang, kirill.shutemov, akpm, mhocko,
	pbonzini, Mel Gorman, Cornelia Huck, Amit Shah

On 11/30/2016 12:43 AM, Liang Li wrote:
> +static void send_unused_pages_info(struct virtio_balloon *vb,
> +				unsigned long req_id)
> +{
> +	struct scatterlist sg_in;
> +	unsigned long pos = 0;
> +	struct virtqueue *vq = vb->req_vq;
> +	struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
> +	int ret, order;
> +
> +	mutex_lock(&vb->balloon_lock);
> +
> +	for (order = MAX_ORDER - 1; order >= 0; order--) {

I scratched my head for a bit on this one.  Why are you walking over
orders, *then* zones.  I *think* you're doing it because you can
efficiently fill the bitmaps at a given order for all zones, then move
to a new bitmap.  But, it would be interesting to document this.

> +		pos = 0;
> +		ret = get_unused_pages(vb->resp_data,
> +			 vb->resp_buf_size / sizeof(unsigned long),
> +			 order, &pos);

FWIW, get_unsued_pages() is a pretty bad name.  "get" usually implies
bumping reference counts or consuming something.  You're just
"recording" or "marking" them.

> +		if (ret == -ENOSPC) {
> +			void *new_resp_data;
> +
> +			new_resp_data = kmalloc(2 * vb->resp_buf_size,
> +						GFP_KERNEL);
> +			if (new_resp_data) {
> +				kfree(vb->resp_data);
> +				vb->resp_data = new_resp_data;
> +				vb->resp_buf_size *= 2;

What happens to the data in ->resp_data at this point?  Doesn't this
just throw it away?

...
> +struct page_info_item {
> +	__le64 start_pfn : 52; /* start pfn for the bitmap */
> +	__le64 page_shift : 6; /* page shift width, in bytes */
> +	__le64 bmap_len : 6;  /* bitmap length, in bytes */
> +};

Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?

> +static int  mark_unused_pages(struct zone *zone,
> +		unsigned long *unused_pages, unsigned long size,
> +		int order, unsigned long *pos)
> +{
> +	unsigned long pfn, flags;
> +	unsigned int t;
> +	struct list_head *curr;
> +	struct page_info_item *info;
> +
> +	if (zone_is_empty(zone))
> +		return 0;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +
> +	if (*pos + zone->free_area[order].nr_free > size)
> +		return -ENOSPC;

Urg, so this won't partially fill?  So, what the nr_free pages limit
where we no longer fit in the kmalloc()'d buffer where this simply won't
work?

> +	for (t = 0; t < MIGRATE_TYPES; t++) {
> +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> +			info = (struct page_info_item *)(unused_pages + *pos);
> +			info->start_pfn = pfn;
> +			info->page_shift = order + PAGE_SHIFT;
> +			*pos += 1;
> +		}
> +	}

Do we need to fill in ->bmap_len here?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info
  2016-11-30 19:15   ` Dave Hansen
@ 2016-12-04 13:13     ` Li, Liang Z
  2016-12-05 17:22       ` Dave Hansen
  0 siblings, 1 reply; 41+ messages in thread
From: Li, Liang Z @ 2016-12-04 13:13 UTC (permalink / raw)
  To: Hansen, Dave, kvm
  Cc: virtualization, linux-kernel, linux-mm, virtio-dev, qemu-devel,
	quintela, dgilbert, mst, jasowang, kirill.shutemov, akpm, mhocko,
	pbonzini, Mel Gorman, Cornelia Huck, Amit Shah

> On 11/30/2016 12:43 AM, Liang Li wrote:
> > +static void send_unused_pages_info(struct virtio_balloon *vb,
> > +				unsigned long req_id)
> > +{
> > +	struct scatterlist sg_in;
> > +	unsigned long pos = 0;
> > +	struct virtqueue *vq = vb->req_vq;
> > +	struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
> > +	int ret, order;
> > +
> > +	mutex_lock(&vb->balloon_lock);
> > +
> > +	for (order = MAX_ORDER - 1; order >= 0; order--) {
> 
> I scratched my head for a bit on this one.  Why are you walking over orders,
> *then* zones.  I *think* you're doing it because you can efficiently fill the
> bitmaps at a given order for all zones, then move to a new bitmap.  But, it
> would be interesting to document this.
> 

Yes, use the order is somewhat strange, but it's helpful to keep the API simple. 
Do you think it's acceptable?

> > +		pos = 0;
> > +		ret = get_unused_pages(vb->resp_data,
> > +			 vb->resp_buf_size / sizeof(unsigned long),
> > +			 order, &pos);
> 
> FWIW, get_unsued_pages() is a pretty bad name.  "get" usually implies
> bumping reference counts or consuming something.  You're just "recording"
> or "marking" them.
> 

Will change to mark_unused_pages().

> > +		if (ret == -ENOSPC) {
> > +			void *new_resp_data;
> > +
> > +			new_resp_data = kmalloc(2 * vb->resp_buf_size,
> > +						GFP_KERNEL);
> > +			if (new_resp_data) {
> > +				kfree(vb->resp_data);
> > +				vb->resp_data = new_resp_data;
> > +				vb->resp_buf_size *= 2;
> 
> What happens to the data in ->resp_data at this point?  Doesn't this just
> throw it away?
> 

Yes, so we should make sure the data in resp_data is not inuse.

> ...
> > +struct page_info_item {
> > +	__le64 start_pfn : 52; /* start pfn for the bitmap */
> > +	__le64 page_shift : 6; /* page shift width, in bytes */
> > +	__le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> 
> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> 

Currently, we just use the 8 bytes and 0 bytes bitmap, should we support more than 64 bytes?

> > +static int  mark_unused_pages(struct zone *zone,
> > +		unsigned long *unused_pages, unsigned long size,
> > +		int order, unsigned long *pos)
> > +{
> > +	unsigned long pfn, flags;
> > +	unsigned int t;
> > +	struct list_head *curr;
> > +	struct page_info_item *info;
> > +
> > +	if (zone_is_empty(zone))
> > +		return 0;
> > +
> > +	spin_lock_irqsave(&zone->lock, flags);
> > +
> > +	if (*pos + zone->free_area[order].nr_free > size)
> > +		return -ENOSPC;
> 
> Urg, so this won't partially fill?  So, what the nr_free pages limit where we no
> longer fit in the kmalloc()'d buffer where this simply won't work?
> 

Yes.  My initial implementation is partially fill, it's better for the worst case.
I thought the above code is more efficient for most case ...
Do you think partially fill the bitmap is better?
 
> > +	for (t = 0; t < MIGRATE_TYPES; t++) {
> > +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> > +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> > +			info = (struct page_info_item *)(unused_pages +
> *pos);
> > +			info->start_pfn = pfn;
> > +			info->page_shift = order + PAGE_SHIFT;
> > +			*pos += 1;
> > +		}
> > +	}
> 
> Do we need to fill in ->bmap_len here?

For integrity, the bmap_len should be filled, will add.
Omit this step just because QEMU assume the ->bmp_len is 0 and ignore this field.

Thanks for your comment!

Liang

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info
  2016-12-04 13:13     ` Li, Liang Z
@ 2016-12-05 17:22       ` Dave Hansen
  2016-12-06  4:47         ` Li, Liang Z
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-12-05 17:22 UTC (permalink / raw)
  To: Li, Liang Z, kvm
  Cc: virtualization, linux-kernel, linux-mm, virtio-dev, qemu-devel,
	quintela, dgilbert, mst, jasowang, kirill.shutemov, akpm, mhocko,
	pbonzini, Mel Gorman, Cornelia Huck, Amit Shah

On 12/04/2016 05:13 AM, Li, Liang Z wrote:
>> On 11/30/2016 12:43 AM, Liang Li wrote:
>>> +static void send_unused_pages_info(struct virtio_balloon *vb,
>>> +				unsigned long req_id)
>>> +{
>>> +	struct scatterlist sg_in;
>>> +	unsigned long pos = 0;
>>> +	struct virtqueue *vq = vb->req_vq;
>>> +	struct virtio_balloon_resp_hdr *hdr = vb->resp_hdr;
>>> +	int ret, order;
>>> +
>>> +	mutex_lock(&vb->balloon_lock);
>>> +
>>> +	for (order = MAX_ORDER - 1; order >= 0; order--) {
>>
>> I scratched my head for a bit on this one.  Why are you walking over orders,
>> *then* zones.  I *think* you're doing it because you can efficiently fill the
>> bitmaps at a given order for all zones, then move to a new bitmap.  But, it
>> would be interesting to document this.
> 
> Yes, use the order is somewhat strange, but it's helpful to keep the API simple. 
> Do you think it's acceptable?

Yeah, it's fine.  Just comment it, please.

>>> +		if (ret == -ENOSPC) {
>>> +			void *new_resp_data;
>>> +
>>> +			new_resp_data = kmalloc(2 * vb->resp_buf_size,
>>> +						GFP_KERNEL);
>>> +			if (new_resp_data) {
>>> +				kfree(vb->resp_data);
>>> +				vb->resp_data = new_resp_data;
>>> +				vb->resp_buf_size *= 2;
>>
>> What happens to the data in ->resp_data at this point?  Doesn't this just
>> throw it away?
> 
> Yes, so we should make sure the data in resp_data is not inuse.

But doesn't it have valid data that we just collected and haven't told
the hypervisor about yet?  Aren't we throwing away good data that cost
us something to collect?

>> ...
>>> +struct page_info_item {
>>> +	__le64 start_pfn : 52; /* start pfn for the bitmap */
>>> +	__le64 page_shift : 6; /* page shift width, in bytes */

What does a page_shift "in bytes" mean? :)

>>> +	__le64 bmap_len : 6;  /* bitmap length, in bytes */ };
>>
>> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> 
> Currently, we just use the 8 bytes and 0 bytes bitmap, should we support more than 64 bytes?

It just means that with this format, you end up wasting at least ~1/8th
of the space with metadata.  That's a bit unfortunate, but I guess it's
not fatal.

I'd definitely call it out in the patch description and make sure other
folks take a look at it.

There's a somewhat easy fix, but that would make the qemu implementation
more complicated: You could just have bmap_len==0x3f imply that there's
another field that contains an extended bitmap length for when you need
long bitmaps.

But, as you note, there's no need for it, so it's a matter of trading
the extra complexity versus the desire to not habing to change the ABI
again for longer (hopefully).

>>> +static int  mark_unused_pages(struct zone *zone,
>>> +		unsigned long *unused_pages, unsigned long size,
>>> +		int order, unsigned long *pos)
>>> +{
>>> +	unsigned long pfn, flags;
>>> +	unsigned int t;
>>> +	struct list_head *curr;
>>> +	struct page_info_item *info;
>>> +
>>> +	if (zone_is_empty(zone))
>>> +		return 0;
>>> +
>>> +	spin_lock_irqsave(&zone->lock, flags);
>>> +
>>> +	if (*pos + zone->free_area[order].nr_free > size)
>>> +		return -ENOSPC;
>>
>> Urg, so this won't partially fill?  So, what the nr_free pages limit where we no
>> longer fit in the kmalloc()'d buffer where this simply won't work?
> 
> Yes.  My initial implementation is partially fill, it's better for the worst case.
> I thought the above code is more efficient for most case ...
> Do you think partially fill the bitmap is better?

Could you please answer the question I asked?

Because if you don't get this right, it could mean that there are system
that simply *fail* here.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info
  2016-12-05 17:22       ` Dave Hansen
@ 2016-12-06  4:47         ` Li, Liang Z
  0 siblings, 0 replies; 41+ messages in thread
From: Li, Liang Z @ 2016-12-06  4:47 UTC (permalink / raw)
  To: Hansen, Dave, kvm
  Cc: virtualization, linux-kernel, linux-mm, virtio-dev, qemu-devel,
	quintela, dgilbert, mst, jasowang, kirill.shutemov, akpm, mhocko,
	pbonzini, Mel Gorman, Cornelia Huck, Amit Shah

> >>> +	mutex_lock(&vb->balloon_lock);
> >>> +
> >>> +	for (order = MAX_ORDER - 1; order >= 0; order--) {
> >>
> >> I scratched my head for a bit on this one.  Why are you walking over
> >> orders,
> >> *then* zones.  I *think* you're doing it because you can efficiently
> >> fill the bitmaps at a given order for all zones, then move to a new
> >> bitmap.  But, it would be interesting to document this.
> >
> > Yes, use the order is somewhat strange, but it's helpful to keep the API simple.
> > Do you think it's acceptable?
> 
> Yeah, it's fine.  Just comment it, please.
> 
Good!

> >>> +		if (ret == -ENOSPC) {
> >>> +			void *new_resp_data;
> >>> +
> >>> +			new_resp_data = kmalloc(2 * vb->resp_buf_size,
> >>> +						GFP_KERNEL);
> >>> +			if (new_resp_data) {
> >>> +				kfree(vb->resp_data);
> >>> +				vb->resp_data = new_resp_data;
> >>> +				vb->resp_buf_size *= 2;
> >>
> >> What happens to the data in ->resp_data at this point?  Doesn't this
> >> just throw it away?
> >
> > Yes, so we should make sure the data in resp_data is not inuse.
> 
> But doesn't it have valid data that we just collected and haven't told the
> hypervisor about yet?  Aren't we throwing away good data that cost us
> something to collect?

Indeed.  Some filled data may exist for the previous zone. Should we
change the API to 
'int get_unused_pages(unsigned long *unused_pages, unsigned long size,
		int order, unsigned long *pos, struct zone *zone)' ?

then we can use the 'zone' to record the zone to retry and not discard the
filled data.

> >> ...
> >>> +struct page_info_item {
> >>> +	__le64 start_pfn : 52; /* start pfn for the bitmap */
> >>> +	__le64 page_shift : 6; /* page shift width, in bytes */
> 
> What does a page_shift "in bytes" mean? :)

Obviously, you know. :o
I will try to make it clear.

> 
> >>> +	__le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> >>
> >> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> >
> > Currently, we just use the 8 bytes and 0 bytes bitmap, should we support
> more than 64 bytes?
> 
> It just means that with this format, you end up wasting at least ~1/8th of the
> space with metadata.  That's a bit unfortunate, but I guess it's not fatal.
> 
> I'd definitely call it out in the patch description and make sure other folks take
> a look at it.

OK.

> 
> There's a somewhat easy fix, but that would make the qemu implementation
> more complicated: You could just have bmap_len==0x3f imply that there's
> another field that contains an extended bitmap length for when you need long
> bitmaps.
> 
> But, as you note, there's no need for it, so it's a matter of trading the extra
> complexity versus the desire to not habing to change the ABI again for longer
> (hopefully).
> 

Your suggestion still works without changing the current code, just reserve
 ' bmap_len==0x3f' for future extension, and it's not used by the current code.

> >>> +static int  mark_unused_pages(struct zone *zone,
> >>> +		unsigned long *unused_pages, unsigned long size,
> >>> +		int order, unsigned long *pos)
> >>> +{
> >>> +	unsigned long pfn, flags;
> >>> +	unsigned int t;
> >>> +	struct list_head *curr;
> >>> +	struct page_info_item *info;
> >>> +
> >>> +	if (zone_is_empty(zone))
> >>> +		return 0;
> >>> +
> >>> +	spin_lock_irqsave(&zone->lock, flags);
> >>> +
> >>> +	if (*pos + zone->free_area[order].nr_free > size)
> >>> +		return -ENOSPC;
> >>
> >> Urg, so this won't partially fill?  So, what the nr_free pages limit
> >> where we no longer fit in the kmalloc()'d buffer where this simply won't
> work?
> >
> > Yes.  My initial implementation is partially fill, it's better for the worst case.
> > I thought the above code is more efficient for most case ...
> > Do you think partially fill the bitmap is better?
> 
> Could you please answer the question I asked?
> 

For your question:
-------------------------------------------------------------------------------------------------------
>So, what the nr_free pages limit where we no longer fit in the kmalloc()'d buffer
> where this simply won't work?
------------------------------------------------------------------------------------------------------
No, if the buffer is not big enough to save 'nr_free'  pages, get_unused_pages() will return
'-ENOSPC', and the following code will try to allocate a 2x times size buffer for retrying,
until the proper size buffer is allocated. The current order will not be skipped unless the
buffer allocation failed.

> Because if you don't get this right, it could mean that there are system that
> simply *fail* here.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-11-30  8:43 [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration Liang Li
                   ` (4 preceding siblings ...)
  2016-11-30  8:43 ` [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info Liang Li
@ 2016-12-06  8:40 ` David Hildenbrand
  2016-12-07 13:35   ` Li, Liang Z
  5 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2016-12-06  8:40 UTC (permalink / raw)
  To: Liang Li, kvm
  Cc: virtio-dev, mhocko, mst, dave.hansen, qemu-devel, linux-kernel,
	linux-mm, kirill.shutemov, pbonzini, akpm, virtualization,
	dgilbert

Am 30.11.2016 um 09:43 schrieb Liang Li:
> This patch set contains two parts of changes to the virtio-balloon.
>
> One is the change for speeding up the inflating & deflating process,
> the main idea of this optimization is to use bitmap to send the page
> information to host instead of the PFNs, to reduce the overhead of
> virtio data transmission, address translation and madvise(). This can
> help to improve the performance by about 85%.

Do you have some statistics/some rough feeling how many consecutive bits 
are usually set in the bitmaps? Is it really just purely random or is 
there some granularity that is usually consecutive?

IOW in real examples, do we have really large consecutive areas or are 
all pages just completely distributed over our memory?

Thanks!

-- 

David

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-06  8:40 ` [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration David Hildenbrand
@ 2016-12-07 13:35   ` Li, Liang Z
  2016-12-07 15:34     ` Dave Hansen
  2016-12-07 15:42     ` David Hildenbrand
  0 siblings, 2 replies; 41+ messages in thread
From: Li, Liang Z @ 2016-12-07 13:35 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: virtio-dev, mhocko, mst, Hansen, Dave, qemu-devel, linux-kernel,
	linux-mm, kirill.shutemov, pbonzini, akpm, virtualization,
	dgilbert

> Am 30.11.2016 um 09:43 schrieb Liang Li:
> > This patch set contains two parts of changes to the virtio-balloon.
> >
> > One is the change for speeding up the inflating & deflating process,
> > the main idea of this optimization is to use bitmap to send the page
> > information to host instead of the PFNs, to reduce the overhead of
> > virtio data transmission, address translation and madvise(). This can
> > help to improve the performance by about 85%.
> 
> Do you have some statistics/some rough feeling how many consecutive bits are
> usually set in the bitmaps? Is it really just purely random or is there some
> granularity that is usually consecutive?
> 

I did something similar. Filled the balloon with 15GB for a 16GB idle guest, by
using bitmap, the madvise count was reduced to 605. when using the PFNs, the madvise count
was 3932160. It means there are quite a lot consecutive bits in the bitmap.
I didn't test for a guest with heavy memory workload. 

> IOW in real examples, do we have really large consecutive areas or are all
> pages just completely distributed over our memory?
> 

The buddy system of Linux kernel memory management shows there should be quite a lot of
 consecutive pages as long as there are a portion of free memory in the guest.
If all pages just completely distributed over our memory, it means the memory 
fragmentation is very serious, the kernel has the mechanism to avoid this happened.
In the other hand, the inflating should not happen at this time because the guest is almost
'out of memory'.

Liang

> Thanks!
> 
> --
> 
> David

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-07 13:35   ` Li, Liang Z
@ 2016-12-07 15:34     ` Dave Hansen
  2016-12-09  3:09       ` Li, Liang Z
  2016-12-07 15:42     ` David Hildenbrand
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-12-07 15:34 UTC (permalink / raw)
  To: Li, Liang Z, David Hildenbrand, kvm
  Cc: virtio-dev, mhocko, mst, qemu-devel, linux-kernel, linux-mm,
	kirill.shutemov, pbonzini, akpm, virtualization, dgilbert

On 12/07/2016 05:35 AM, Li, Liang Z wrote:
>> Am 30.11.2016 um 09:43 schrieb Liang Li:
>> IOW in real examples, do we have really large consecutive areas or are all
>> pages just completely distributed over our memory?
> 
> The buddy system of Linux kernel memory management shows there should
> be quite a lot of consecutive pages as long as there are a portion of
> free memory in the guest.
...
> If all pages just completely distributed over our memory, it means
> the memory fragmentation is very serious, the kernel has the
> mechanism to avoid this happened.

While it is correct that the kernel has anti-fragmentation mechanisms, I
don't think it invalidates the question as to whether a bitmap would be
too sparse to be effective.

> In the other hand, the inflating should not happen at this time because the guest is almost
> 'out of memory'.

I don't think this is correct.  Most systems try to run with relatively
little free memory all the time, using the bulk of it as page cache.  We
have no reason to expect that ballooning will only occur when there is
lots of actual free memory and that it will not occur when that same
memory is in use as page cache.

In these patches, you're effectively still sending pfns.  You're just
sending one pfn per high-order page which is giving a really nice
speedup.  IMNHO, you're avoiding doing a real bitmap because creating a
bitmap means either have a really big bitmap, or you would have to do
some sorting (or multiple passes) of the free lists before populating a
smaller bitmap.

Like David, I would still like to see some data on whether the choice
between bitmaps and pfn lists is ever clearly in favor of bitmaps.  You
haven't convinced me, at least, that the data isn't even worth collecting.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-07 13:35   ` Li, Liang Z
  2016-12-07 15:34     ` Dave Hansen
@ 2016-12-07 15:42     ` David Hildenbrand
  2016-12-07 15:45       ` Dave Hansen
  1 sibling, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2016-12-07 15:42 UTC (permalink / raw)
  To: Li, Liang Z, kvm
  Cc: virtio-dev, mhocko, mst, Hansen, Dave, qemu-devel, linux-kernel,
	linux-mm, kirill.shutemov, pbonzini, akpm, virtualization,
	dgilbert, Andrea Arcangeli

Am 07.12.2016 um 14:35 schrieb Li, Liang Z:
>> Am 30.11.2016 um 09:43 schrieb Liang Li:
>>> This patch set contains two parts of changes to the virtio-balloon.
>>>
>>> One is the change for speeding up the inflating & deflating process,
>>> the main idea of this optimization is to use bitmap to send the page
>>> information to host instead of the PFNs, to reduce the overhead of
>>> virtio data transmission, address translation and madvise(). This can
>>> help to improve the performance by about 85%.
>>
>> Do you have some statistics/some rough feeling how many consecutive bits are
>> usually set in the bitmaps? Is it really just purely random or is there some
>> granularity that is usually consecutive?
>>
>
> I did something similar. Filled the balloon with 15GB for a 16GB idle guest, by
> using bitmap, the madvise count was reduced to 605. when using the PFNs, the madvise count
> was 3932160. It means there are quite a lot consecutive bits in the bitmap.
> I didn't test for a guest with heavy memory workload.

Would it then even make sense to go one step further and report {pfn, 
length} combinations?

So simply send over an array of {pfn, length}?

This idea came up when talking to Andrea Arcangeli (put him on cc).

And it makes sense if you think about:

a) hugetlb backing: The host may only be able to free huge pages (we 
might want to communicate that to the guest later, that's another 
story). Still we would have to send bitmaps full of 4k frames (512 bits 
for 2mb frames). Of course, we could add a way to communicate that we 
are using a different bitmap-granularity.

b) if we really inflate huge memory regions (and it sounds like that 
according to your measurements), we can minimize the communication to 
the hypervisor and therefore the madvice calls.

c) we don't want to optimize for inflating guests with almost full 
memory (and therefore little consecutive memory areas) - my opinion :)

Thanks for the explanation!

-- 

David

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-07 15:42     ` David Hildenbrand
@ 2016-12-07 15:45       ` Dave Hansen
  2016-12-07 16:21         ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-12-07 15:45 UTC (permalink / raw)
  To: David Hildenbrand, Li, Liang Z, kvm
  Cc: virtio-dev, mhocko, mst, qemu-devel, linux-kernel, linux-mm,
	kirill.shutemov, pbonzini, akpm, virtualization, dgilbert,
	Andrea Arcangeli

On 12/07/2016 07:42 AM, David Hildenbrand wrote:
> Am 07.12.2016 um 14:35 schrieb Li, Liang Z:
>>> Am 30.11.2016 um 09:43 schrieb Liang Li:
>>>> This patch set contains two parts of changes to the virtio-balloon.
>>>>
>>>> One is the change for speeding up the inflating & deflating process,
>>>> the main idea of this optimization is to use bitmap to send the page
>>>> information to host instead of the PFNs, to reduce the overhead of
>>>> virtio data transmission, address translation and madvise(). This can
>>>> help to improve the performance by about 85%.
>>>
>>> Do you have some statistics/some rough feeling how many consecutive
>>> bits are
>>> usually set in the bitmaps? Is it really just purely random or is
>>> there some
>>> granularity that is usually consecutive?
>>>
>>
>> I did something similar. Filled the balloon with 15GB for a 16GB idle
>> guest, by
>> using bitmap, the madvise count was reduced to 605. when using the
>> PFNs, the madvise count
>> was 3932160. It means there are quite a lot consecutive bits in the
>> bitmap.
>> I didn't test for a guest with heavy memory workload.
> 
> Would it then even make sense to go one step further and report {pfn,
> length} combinations?
> 
> So simply send over an array of {pfn, length}?

Li's current patches do that.  Well, maybe not pfn/length, but they do
take a pfn and page-order, which fits perfectly with the kernel's
concept of high-order pages.

> And it makes sense if you think about:
> 
> a) hugetlb backing: The host may only be able to free huge pages (we
> might want to communicate that to the guest later, that's another
> story). Still we would have to send bitmaps full of 4k frames (512 bits
> for 2mb frames). Of course, we could add a way to communicate that we
> are using a different bitmap-granularity.

Yeah, please read the patches.  If they're not clear, then the
descriptions need work, but this is done already.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-07 15:45       ` Dave Hansen
@ 2016-12-07 16:21         ` David Hildenbrand
  2016-12-07 16:57           ` Dave Hansen
  0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2016-12-07 16:21 UTC (permalink / raw)
  To: Dave Hansen, Li, Liang Z, kvm
  Cc: virtio-dev, mhocko, mst, qemu-devel, linux-kernel, linux-mm,
	kirill.shutemov, pbonzini, akpm, virtualization, dgilbert,
	Andrea Arcangeli

>>>
>>> I did something similar. Filled the balloon with 15GB for a 16GB idle
>>> guest, by
>>> using bitmap, the madvise count was reduced to 605. when using the
>>> PFNs, the madvise count
>>> was 3932160. It means there are quite a lot consecutive bits in the
>>> bitmap.
>>> I didn't test for a guest with heavy memory workload.
>>
>> Would it then even make sense to go one step further and report {pfn,
>> length} combinations?
>>
>> So simply send over an array of {pfn, length}?
>
> Li's current patches do that.  Well, maybe not pfn/length, but they do
> take a pfn and page-order, which fits perfectly with the kernel's
> concept of high-order pages.

So we can send length in powers of two. Still, I don't see any benefit
over a simple pfn/len schema. But I'll have a more detailed look at the
implementation first, maybe that will enlighten me :)

>
>> And it makes sense if you think about:
>>
>> a) hugetlb backing: The host may only be able to free huge pages (we
>> might want to communicate that to the guest later, that's another
>> story). Still we would have to send bitmaps full of 4k frames (512 bits
>> for 2mb frames). Of course, we could add a way to communicate that we
>> are using a different bitmap-granularity.
>
> Yeah, please read the patches.  If they're not clear, then the
> descriptions need work, but this is done already.
>

I missed the page_shift, thanks for the hint.

-- 

David

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-07 16:21         ` David Hildenbrand
@ 2016-12-07 16:57           ` Dave Hansen
  2016-12-07 18:38             ` [Qemu-devel] " Andrea Arcangeli
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-12-07 16:57 UTC (permalink / raw)
  To: David Hildenbrand, Li, Liang Z, kvm
  Cc: mhocko, mst, qemu-devel, linux-kernel, linux-mm, kirill.shutemov,
	pbonzini, akpm, virtualization, dgilbert, Andrea Arcangeli

Removing silly virtio-dev@ list because it's bouncing mail...

On 12/07/2016 08:21 AM, David Hildenbrand wrote:
>> Li's current patches do that.  Well, maybe not pfn/length, but they do
>> take a pfn and page-order, which fits perfectly with the kernel's
>> concept of high-order pages.
> 
> So we can send length in powers of two. Still, I don't see any benefit
> over a simple pfn/len schema. But I'll have a more detailed look at the
> implementation first, maybe that will enlighten me :)

It is more space-efficient.  We're fitting the order into 6 bits, which
would allows the full 2^64 address space to be represented in one entry,
and leaves room for the bitmap size to be encoded as well, if we decide
we need a bitmap in the future.

If that was purely a length, we'd be limited to 64*4k pages per entry,
which isn't even a full large page.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-07 16:57           ` Dave Hansen
@ 2016-12-07 18:38             ` Andrea Arcangeli
  2016-12-07 18:44               ` Dave Hansen
  2016-12-07 19:54               ` Dave Hansen
  0 siblings, 2 replies; 41+ messages in thread
From: Andrea Arcangeli @ 2016-12-07 18:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Hildenbrand, Li, Liang Z, kvm, mhocko, mst, linux-kernel,
	qemu-devel, linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

Hello,

On Wed, Dec 07, 2016 at 08:57:01AM -0800, Dave Hansen wrote:
> It is more space-efficient.  We're fitting the order into 6 bits, which
> would allows the full 2^64 address space to be represented in one entry,

Very large order is the same as very large len, 6 bits of order or 8
bytes of len won't really move the needle here, simpler code is
preferable.

The main benefit of "len" is that it can be more granular, plus it's
simpler than the bitmap too. Eventually all this stuff has to end up
into a madvisev (not yet upstream but somebody posted it for jemalloc
and should get merged eventually).

So the bitmap shall be demuxed to a addr,len array anyway, the bitmap
won't ever be sent to the madvise syscall, which makes the
intermediate representation with the bitmap a complication with
basically no benefits compared to a (N, [addr1,len1], .., [addrN,
lenN]) representation.

If you prefer 1 byte of order (not just 6 bits) instead 8bytes of len
that's possible too, I wouldn't be against that, the conversion before
calling madvise would be pretty efficient too.

> and leaves room for the bitmap size to be encoded as well, if we decide
> we need a bitmap in the future.

How would a bitmap ever be useful with very large page-order?

> If that was purely a length, we'd be limited to 64*4k pages per entry,
> which isn't even a full large page.

I don't follow here.

What we suggest is to send the data down represented as (N,
[addr1,len1], ..., [addrN, lenN]) which allows infinite ranges each
one of maximum length 2^64, so 2^64 multiplied infinite times if you
wish. Simplifying the code and not having any bitmap at all and no :6
:6 bits either.

The high order to low order loop of allocations is the interesting part
here, not the bitmap, and the fact of doing a single vmexit to send
the large ranges.

Once we pull out the largest order regions, we just add them to the
array as [addr,1UL<<order], when the array reaches a maximum N number
of entries or we fail a order 0 allocation, we flush all those entries
down to qemu. Qemu then builds the iov for madvisev and it's pretty
much a 1:1 conversion, not a decoding operation converting the bitmap
in the (N, [addr1,len1], ..., [addrN, lenN]) for madvisev (or a flood
of madvise MADV_DONTNEED with current kernels).

Considering the loop that allocates starting from MAX_ORDER..1, the
chance the bitmap is actually getting filled with more than one bit at
page_shift of PAGE_SHIFT should be very low after some uptime.

By the very nature of this loop, if we already exacerbates all high
order buddies, the page-order 0 pages obtained are going to be fairly
fragmented reducing the usefulness of the bitmap and potentially only
wasting CPU/memory.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-07 18:38             ` [Qemu-devel] " Andrea Arcangeli
@ 2016-12-07 18:44               ` Dave Hansen
  2016-12-07 18:58                 ` Andrea Arcangeli
  2016-12-07 19:54               ` Dave Hansen
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-12-07 18:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Hildenbrand, Li, Liang Z, kvm, mhocko, mst, linux-kernel,
	qemu-devel, linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

On 12/07/2016 10:38 AM, Andrea Arcangeli wrote:
>> > and leaves room for the bitmap size to be encoded as well, if we decide
>> > we need a bitmap in the future.
> How would a bitmap ever be useful with very large page-order?

Please, guys.  Read the patches.  *Please*.

The current code doesn't even _use_ a bitmap.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-07 18:44               ` Dave Hansen
@ 2016-12-07 18:58                 ` Andrea Arcangeli
  0 siblings, 0 replies; 41+ messages in thread
From: Andrea Arcangeli @ 2016-12-07 18:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Hildenbrand, Li, Liang Z, kvm, mhocko, mst, linux-kernel,
	qemu-devel, linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

On Wed, Dec 07, 2016 at 10:44:31AM -0800, Dave Hansen wrote:
> On 12/07/2016 10:38 AM, Andrea Arcangeli wrote:
> >> > and leaves room for the bitmap size to be encoded as well, if we decide
> >> > we need a bitmap in the future.
> > How would a bitmap ever be useful with very large page-order?
> 
> Please, guys.  Read the patches.  *Please*.

I did read the code but you didn't answer my question.

Why should a feature exist in the code that will never be useful. Why
do you think we could ever decide we'll need the bitmap in the future
for high order pages?

> The current code doesn't even _use_ a bitmap.

It's not using it right now, my question is exactly when it will ever
use it?

Leaving the bitmap only for order 0 allocations when you already wiped
all high pages orders from the buddy, doesn't seem very good idea
overall as the chance you got order 0 pages with close physical
address doesn't seem very high. It would be high if the loop that eat
into every possible higher order didn't run first, but such loop just
run and already wiped everything.

Also note, we need to call compaction very aggressive before falling
back from order 9 down to order 8. Ideally we should never use the
page_shift = PAGE_SHIFT case at all! Which leaves the bitmap as best
as an optimization for something that is suboptimal case already. If
the bitmap starts to payoff it means the admin did a mistake and
shrunk the guest too much.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-07 18:38             ` [Qemu-devel] " Andrea Arcangeli
  2016-12-07 18:44               ` Dave Hansen
@ 2016-12-07 19:54               ` Dave Hansen
  2016-12-07 20:28                 ` Andrea Arcangeli
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-12-07 19:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Hildenbrand, Li, Liang Z, kvm, mhocko, mst, linux-kernel,
	qemu-devel, linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

We're talking about a bunch of different stuff which is all being
conflated.  There are 3 issues here that I can see.  I'll attempt to
summarize what I think is going on:

1. Current patches do a hypercall for each order in the allocator.
   This is inefficient, but independent from the underlying data
   structure in the ABI, unless bitmaps are in play, which they aren't.
2. Should we have bitmaps in the ABI, even if they are not in use by the
   guest implementation today?  Andrea says they have zero benefits
   over a pfn/len scheme.  Dave doesn't think they have zero benefits
   but isn't that attached to them.  QEMU's handling gets more
   complicated when using a bitmap.
3. Should the ABI contain records each with a pfn/len pair or a
   pfn/order pair?
   3a. 'len' is more flexible, but will always be a power-of-two anyway
	for high-order pages (the common case)
   3b. if we decide not to have a bitmap, then we basically have plenty
	of space for 'len' and should just do it
   3c. It's easiest for the hypervisor to turn pfn/len into the
       madvise() calls that it needs.

Did I miss anything?

On 12/07/2016 10:38 AM, Andrea Arcangeli wrote:
> On Wed, Dec 07, 2016 at 08:57:01AM -0800, Dave Hansen wrote:
>> It is more space-efficient.  We're fitting the order into 6 bits, which
>> would allows the full 2^64 address space to be represented in one entry,
> 
> Very large order is the same as very large len, 6 bits of order or 8
> bytes of len won't really move the needle here, simpler code is
> preferable.

Agreed.  But without seeing them side-by-side I'm not sure we can say
which is simpler.

> The main benefit of "len" is that it can be more granular, plus it's
> simpler than the bitmap too. Eventually all this stuff has to end up
> into a madvisev (not yet upstream but somebody posted it for jemalloc
> and should get merged eventually).
> 
> So the bitmap shall be demuxed to a addr,len array anyway, the bitmap
> won't ever be sent to the madvise syscall, which makes the
> intermediate representation with the bitmap a complication with
> basically no benefits compared to a (N, [addr1,len1], .., [addrN,
> lenN]) representation.

FWIW, I don't feel that strongly about the bitmap.  Li had one
originally, but I think the code thus far has demonstrated a huge
benefit without even having a bitmap.

I've got no objections to ripping the bitmap out of the ABI.

>> and leaves room for the bitmap size to be encoded as well, if we decide
>> we need a bitmap in the future.
> 
> How would a bitmap ever be useful with very large page-order?

Surely we can think of a few ways...

A bitmap is 64x more dense if the lists are unordered.  It means being
able to store ~32k*2M=64G worth of 2M pages in one data page vs. ~1G.
That's 64x fewer cachelines to touch, 64x fewer pages to move to the
hypervisor and lets us allocate 1/64th the memory.  Given a maximum
allocation that we're allowed, it lets us do 64x more per-pass.

Now, are those benefits worth it?  Maybe not, but let's not pretend they
don't exist. ;)

>> If that was purely a length, we'd be limited to 64*4k pages per entry,
>> which isn't even a full large page.
> 
> I don't follow here.
> 
> What we suggest is to send the data down represented as (N,
> [addr1,len1], ..., [addrN, lenN]) which allows infinite ranges each
> one of maximum length 2^64, so 2^64 multiplied infinite times if you
> wish. Simplifying the code and not having any bitmap at all and no :6
> :6 bits either.
> 
> The high order to low order loop of allocations is the interesting part
> here, not the bitmap, and the fact of doing a single vmexit to send
> the large ranges.

Yes, the current code sends one batch of pages up to the hypervisor per
order.  But, this has nothing to do with the underlying data structure,
or the choice to have an order vs. len in the ABI.

What you describe here is obviously more efficient.

> Considering the loop that allocates starting from MAX_ORDER..1, the
> chance the bitmap is actually getting filled with more than one bit at
> page_shift of PAGE_SHIFT should be very low after some uptime.

Yes, if bitmaps were in use, this is true.  I think a guest populating
bitmaps would probably not use the same algorithm.

> By the very nature of this loop, if we already exacerbates all high
> order buddies, the page-order 0 pages obtained are going to be fairly
> fragmented reducing the usefulness of the bitmap and potentially only
> wasting CPU/memory.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-07 19:54               ` Dave Hansen
@ 2016-12-07 20:28                 ` Andrea Arcangeli
  2016-12-09  4:45                   ` Li, Liang Z
  0 siblings, 1 reply; 41+ messages in thread
From: Andrea Arcangeli @ 2016-12-07 20:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Hildenbrand, Li, Liang Z, kvm, mhocko, mst, linux-kernel,
	qemu-devel, linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

On Wed, Dec 07, 2016 at 11:54:34AM -0800, Dave Hansen wrote:
> We're talking about a bunch of different stuff which is all being
> conflated.  There are 3 issues here that I can see.  I'll attempt to
> summarize what I think is going on:
> 
> 1. Current patches do a hypercall for each order in the allocator.
>    This is inefficient, but independent from the underlying data
>    structure in the ABI, unless bitmaps are in play, which they aren't.
> 2. Should we have bitmaps in the ABI, even if they are not in use by the
>    guest implementation today?  Andrea says they have zero benefits
>    over a pfn/len scheme.  Dave doesn't think they have zero benefits
>    but isn't that attached to them.  QEMU's handling gets more
>    complicated when using a bitmap.
> 3. Should the ABI contain records each with a pfn/len pair or a
>    pfn/order pair?
>    3a. 'len' is more flexible, but will always be a power-of-two anyway
> 	for high-order pages (the common case)

Len wouldn't be a power of two practically only if we detect adjacent
pages of smaller order that may merge into larger orders we already
allocated (or the other way around).

[addr=2M, len=2M] allocated at order 9 pass
[addr=4M, len=1M] allocated at order 8 pass -> merge as [addr=2M, len=3M]

Not sure if it would be worth it, but that unless we do this, page-order or
len won't make much difference.

>    3b. if we decide not to have a bitmap, then we basically have plenty
> 	of space for 'len' and should just do it
>    3c. It's easiest for the hypervisor to turn pfn/len into the
>        madvise() calls that it needs.
> 
> Did I miss anything?

I think you summarized fine all my arguments in your summary.

> FWIW, I don't feel that strongly about the bitmap.  Li had one
> originally, but I think the code thus far has demonstrated a huge
> benefit without even having a bitmap.
> 
> I've got no objections to ripping the bitmap out of the ABI.

I think we need to see a statistic showing the number of bits set in
each bitmap in average, after some uptime and lru churn, like running
stresstest app for a while with I/O and then inflate the balloon and
count:

1) how many bits were set vs total number of bits used in bitmaps

2) how many times bitmaps were used vs bitmap_len = 0 case of single
   page

My guess would be like very low percentage for both points.

> Surely we can think of a few ways...
> 
> A bitmap is 64x more dense if the lists are unordered.  It means being
> able to store ~32k*2M=64G worth of 2M pages in one data page vs. ~1G.
> That's 64x fewer cachelines to touch, 64x fewer pages to move to the
> hypervisor and lets us allocate 1/64th the memory.  Given a maximum
> allocation that we're allowed, it lets us do 64x more per-pass.
> 
> Now, are those benefits worth it?  Maybe not, but let's not pretend they
> don't exist. ;)

In the best case there are benefits obviously, the question is how
common the best case is.

The best case if I understand correctly is all high order not
available, but plenty of order 0 pages available at phys address X,
X+8k, X+16k, X+(8k*nr_bits_in_bitmap). How common is that 0 pages
exist but they're not at an address < X or > X+(8k*nr_bits_in_bitmap)?

> Yes, the current code sends one batch of pages up to the hypervisor per
> order.  But, this has nothing to do with the underlying data structure,
> or the choice to have an order vs. len in the ABI.
> 
> What you describe here is obviously more efficient.

And it isn't possible with the current ABI.

So there is a connection with the MAX_ORDER..0 allocation loop and the
ABI change, but I agree any of the ABI proposed would still allow for
it this logic to be used. Bitmap or not bitmap, the loop would still
work.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-07 15:34     ` Dave Hansen
@ 2016-12-09  3:09       ` Li, Liang Z
  0 siblings, 0 replies; 41+ messages in thread
From: Li, Liang Z @ 2016-12-09  3:09 UTC (permalink / raw)
  To: Hansen, Dave, David Hildenbrand, kvm
  Cc: virtio-dev, mhocko, mst, qemu-devel, linux-kernel, linux-mm,
	kirill.shutemov, pbonzini, akpm, virtualization, dgilbert

> Subject: Re: [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating
> & fast live migration
> 
> On 12/07/2016 05:35 AM, Li, Liang Z wrote:
> >> Am 30.11.2016 um 09:43 schrieb Liang Li:
> >> IOW in real examples, do we have really large consecutive areas or
> >> are all pages just completely distributed over our memory?
> >
> > The buddy system of Linux kernel memory management shows there
> should
> > be quite a lot of consecutive pages as long as there are a portion of
> > free memory in the guest.
> ...
> > If all pages just completely distributed over our memory, it means the
> > memory fragmentation is very serious, the kernel has the mechanism to
> > avoid this happened.
> 
> While it is correct that the kernel has anti-fragmentation mechanisms, I don't
> think it invalidates the question as to whether a bitmap would be too sparse
> to be effective.
> 
> > In the other hand, the inflating should not happen at this time
> > because the guest is almost 'out of memory'.
> 
> I don't think this is correct.  Most systems try to run with relatively little free
> memory all the time, using the bulk of it as page cache.  We have no reason
> to expect that ballooning will only occur when there is lots of actual free
> memory and that it will not occur when that same memory is in use as page
> cache.
> 

Yes.
> In these patches, you're effectively still sending pfns.  You're just sending
> one pfn per high-order page which is giving a really nice speedup.  IMNHO,
> you're avoiding doing a real bitmap because creating a bitmap means either
> have a really big bitmap, or you would have to do some sorting (or multiple
> passes) of the free lists before populating a smaller bitmap.
> 
> Like David, I would still like to see some data on whether the choice between
> bitmaps and pfn lists is ever clearly in favor of bitmaps.  You haven't
> convinced me, at least, that the data isn't even worth collecting.

I will try to get some data with the real workload and share it with your guys.

Thanks!
Liang

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-07 20:28                 ` Andrea Arcangeli
@ 2016-12-09  4:45                   ` Li, Liang Z
  2016-12-09  4:53                     ` Dave Hansen
  0 siblings, 1 reply; 41+ messages in thread
From: Li, Liang Z @ 2016-12-09  4:45 UTC (permalink / raw)
  To: Andrea Arcangeli, Hansen, Dave
  Cc: David Hildenbrand, kvm, mhocko, mst, linux-kernel, qemu-devel,
	linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

> > 1. Current patches do a hypercall for each order in the allocator.
> >    This is inefficient, but independent from the underlying data
> >    structure in the ABI, unless bitmaps are in play, which they aren't.
> > 2. Should we have bitmaps in the ABI, even if they are not in use by the
> >    guest implementation today?  Andrea says they have zero benefits
> >    over a pfn/len scheme.  Dave doesn't think they have zero benefits
> >    but isn't that attached to them.  QEMU's handling gets more
> >    complicated when using a bitmap.
> > 3. Should the ABI contain records each with a pfn/len pair or a
> >    pfn/order pair?
> >    3a. 'len' is more flexible, but will always be a power-of-two anyway
> > 	for high-order pages (the common case)
> 
> Len wouldn't be a power of two practically only if we detect adjacent pages
> of smaller order that may merge into larger orders we already allocated (or
> the other way around).
> 
> [addr=2M, len=2M] allocated at order 9 pass [addr=4M, len=1M] allocated at
> order 8 pass -> merge as [addr=2M, len=3M]
> 
> Not sure if it would be worth it, but that unless we do this, page-order or len
> won't make much difference.
> 
> >    3b. if we decide not to have a bitmap, then we basically have plenty
> > 	of space for 'len' and should just do it
> >    3c. It's easiest for the hypervisor to turn pfn/len into the
> >        madvise() calls that it needs.
> >
> > Did I miss anything?
> 
> I think you summarized fine all my arguments in your summary.
> 
> > FWIW, I don't feel that strongly about the bitmap.  Li had one
> > originally, but I think the code thus far has demonstrated a huge
> > benefit without even having a bitmap.
> >
> > I've got no objections to ripping the bitmap out of the ABI.
> 
> I think we need to see a statistic showing the number of bits set in each
> bitmap in average, after some uptime and lru churn, like running stresstest
> app for a while with I/O and then inflate the balloon and
> count:
> 
> 1) how many bits were set vs total number of bits used in bitmaps
> 
> 2) how many times bitmaps were used vs bitmap_len = 0 case of single
>    page
> 
> My guess would be like very low percentage for both points.
> 

> So there is a connection with the MAX_ORDER..0 allocation loop and the ABI
> change, but I agree any of the ABI proposed would still allow for it this logic to
> be used. Bitmap or not bitmap, the loop would still work.

Hi guys,

What's the conclusion of your discussion? 
It seems you want some statistic before deciding whether to  ripping the bitmap from the ABI, am I right?

Thanks!
Liang 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-09  4:45                   ` Li, Liang Z
@ 2016-12-09  4:53                     ` Dave Hansen
  2016-12-09  5:35                       ` Li, Liang Z
  2016-12-14  8:59                       ` Li, Liang Z
  0 siblings, 2 replies; 41+ messages in thread
From: Dave Hansen @ 2016-12-09  4:53 UTC (permalink / raw)
  To: Li, Liang Z, Andrea Arcangeli
  Cc: David Hildenbrand, kvm, mhocko, mst, linux-kernel, qemu-devel,
	linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> What's the conclusion of your discussion? It seems you want some
> statistic before deciding whether to  ripping the bitmap from the
> ABI, am I right?

I think Andrea and David feel pretty strongly that we should remove the
bitmap, unless we have some data to support keeping it.  I don't feel as
strongly about it, but I think their critique of it is pretty valid.  I
think the consensus is that the bitmap needs to go.

The only real question IMNHO is whether we should do a power-of-2 or a
length.  But, if we have 12 bits, then the argument for doing length is
pretty strong.  We don't need anywhere near 12 bits if doing power-of-2.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-09  4:53                     ` Dave Hansen
@ 2016-12-09  5:35                       ` Li, Liang Z
  2016-12-09 16:42                         ` Andrea Arcangeli
  2016-12-14  8:59                       ` Li, Liang Z
  1 sibling, 1 reply; 41+ messages in thread
From: Li, Liang Z @ 2016-12-09  5:35 UTC (permalink / raw)
  To: Hansen, Dave, Andrea Arcangeli
  Cc: David Hildenbrand, kvm, mhocko, mst, linux-kernel, qemu-devel,
	linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > What's the conclusion of your discussion? It seems you want some
> > statistic before deciding whether to  ripping the bitmap from the ABI,
> > am I right?
> 
> I think Andrea and David feel pretty strongly that we should remove the
> bitmap, unless we have some data to support keeping it.  I don't feel as
> strongly about it, but I think their critique of it is pretty valid.  I think the
> consensus is that the bitmap needs to go.
> 

Thanks for you clarification.

> The only real question IMNHO is whether we should do a power-of-2 or a
> length.  But, if we have 12 bits, then the argument for doing length is pretty
> strong.  We don't need anywhere near 12 bits if doing power-of-2.
> 
So each item can max represent 16MB Bytes, seems not big enough,
but enough for most case.
Things became much more simple without the bitmap, and I like simple solution too. :)

I will prepare the v6 and remove all the bitmap related stuffs. Thank you all!

Liang

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-09  5:35                       ` Li, Liang Z
@ 2016-12-09 16:42                         ` Andrea Arcangeli
  2016-12-14  8:20                           ` Li, Liang Z
  0 siblings, 1 reply; 41+ messages in thread
From: Andrea Arcangeli @ 2016-12-09 16:42 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: Hansen, Dave, David Hildenbrand, kvm, mhocko, mst, linux-kernel,
	qemu-devel, linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

Hello,

On Fri, Dec 09, 2016 at 05:35:45AM +0000, Li, Liang Z wrote:
> > On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > > What's the conclusion of your discussion? It seems you want some
> > > statistic before deciding whether to  ripping the bitmap from the ABI,
> > > am I right?
> > 
> > I think Andrea and David feel pretty strongly that we should remove the
> > bitmap, unless we have some data to support keeping it.  I don't feel as
> > strongly about it, but I think their critique of it is pretty valid.  I think the
> > consensus is that the bitmap needs to go.
> > 
> 
> Thanks for you clarification.
> 
> > The only real question IMNHO is whether we should do a power-of-2 or a
> > length.  But, if we have 12 bits, then the argument for doing length is pretty
> > strong.  We don't need anywhere near 12 bits if doing power-of-2.
> > 
> So each item can max represent 16MB Bytes, seems not big enough,
> but enough for most case.
> Things became much more simple without the bitmap, and I like simple solution too. :)
> 
> I will prepare the v6 and remove all the bitmap related stuffs. Thank you all!

Sounds great!

I suggested to check the statistics, because collecting those stats
looked simpler and quicker than removing all bitmap related stuff from
the patchset. However if you prefer to prepare a v6 without the bitmap
another perhaps more interesting way to evaluate the usefulness of the
bitmap is to just run the same benchmark and verify that there is no
regression compared to the bitmap enabled code.

The other issue with the bitmap is, the best case for the bitmap is
ever less likely to materialize the more RAM is added to the guest. It
won't regress linearly because after all there can be some locality
bias in the buddy splits, but if sync compaction is used in the large
order allocations tried before reaching order 0, the bitmap payoff
will regress close to linearly with the increase of RAM.

So it'd be good to check the stats or the benchmark on large guests,
at least one hundred gigabytes or so.

Changing topic but still about the ABI features needed, so it may be
relevant for this discussion:

1) vNUMA locality: i.e. allowing host to specify which vNODEs to take
   memory from, using alloc_pages_node in guest. So you can ask to
   take X pages from vnode A, Y pages from vnode B, in one vmenter.

2) allowing qemu to tell the guest to stop inflating the balloon and
   report a fragmentation limit being hit, when sync compaction
   powered allocations fails at certain power-of-two order granularity
   passed by qemu to the guest. This order constraint will be passed
   by default for hugetlbfs guests with 2MB hpage size, while it can
   be used optionally on THP backed guests. This option with THP
   guests would allow a highlevel management software to provide a
   "don't reduce guest performance" while shrinking the memory size of
   the guest from the GUI. If you deselect the option, you can shrink
   down to the last freeable 4k guest page, but doing so may have to
   split THP in the host (you don't know for sure if they were really
   THP but they could have been), and it may regress
   performance. Inflating the balloon while passing a minimum
   granularity "order" of the pages being zapped, will guarantee
   inflating the balloon cannot decrease guest performance
   instead. Plus it's needed for hugetlbfs anyway as far as I can
   tell. hugetlbfs would not be host enforceable even if the idea is
   not to free memory but only reduce the available memory of the
   guest (not without major changes that maps a hugetlb page with 4k
   ptes at least). While for a more cooperative usage of hugetlbfs
   guests, it's simply not useful to inflate the balloon at anything
   less than the "HPAGE_SIZE" hugetlbfs granularity.

We also plan to use userfaultfd to make the balloon driver host
enforced (will work fine on hugetlbfs 2M and tmpfs too) but that's
going to be invisible to the ABI so it's not strictly relevant for
this discussion.

On a side note, registering userfaultfd on the ballooned range, will
keep khugepaged at bay so it won't risk to re-inflating the
MADV_DONTNEED zapped sub-THP fragments no matter the sysfs tunings.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-09 16:42                         ` Andrea Arcangeli
@ 2016-12-14  8:20                           ` Li, Liang Z
  0 siblings, 0 replies; 41+ messages in thread
From: Li, Liang Z @ 2016-12-14  8:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hansen, Dave, David Hildenbrand, kvm, mhocko, mst, linux-kernel,
	qemu-devel, linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

> fast (de)inflating & fast live migration
> 
> Hello,
> 
> On Fri, Dec 09, 2016 at 05:35:45AM +0000, Li, Liang Z wrote:
> > > On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > > > What's the conclusion of your discussion? It seems you want some
> > > > statistic before deciding whether to  ripping the bitmap from the
> > > > ABI, am I right?
> > >
> > > I think Andrea and David feel pretty strongly that we should remove
> > > the bitmap, unless we have some data to support keeping it.  I don't
> > > feel as strongly about it, but I think their critique of it is
> > > pretty valid.  I think the consensus is that the bitmap needs to go.
> > >
> >
> > Thanks for you clarification.
> >
> > > The only real question IMNHO is whether we should do a power-of-2 or
> > > a length.  But, if we have 12 bits, then the argument for doing
> > > length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> > >
> > So each item can max represent 16MB Bytes, seems not big enough, but
> > enough for most case.
> > Things became much more simple without the bitmap, and I like simple
> > solution too. :)
> >
> > I will prepare the v6 and remove all the bitmap related stuffs. Thank you all!
> 
> Sounds great!
> 
> I suggested to check the statistics, because collecting those stats looked
> simpler and quicker than removing all bitmap related stuff from the patchset.
> However if you prefer to prepare a v6 without the bitmap another perhaps
> more interesting way to evaluate the usefulness of the bitmap is to just run
> the same benchmark and verify that there is no regression compared to the
> bitmap enabled code.
> 
> The other issue with the bitmap is, the best case for the bitmap is ever less
> likely to materialize the more RAM is added to the guest. It won't regress
> linearly because after all there can be some locality bias in the buddy splits,
> but if sync compaction is used in the large order allocations tried before
> reaching order 0, the bitmap payoff will regress close to linearly with the
> increase of RAM.
> 
> So it'd be good to check the stats or the benchmark on large guests, at least
> one hundred gigabytes or so.
> 
> Changing topic but still about the ABI features needed, so it may be relevant
> for this discussion:
> 
> 1) vNUMA locality: i.e. allowing host to specify which vNODEs to take
>    memory from, using alloc_pages_node in guest. So you can ask to
>    take X pages from vnode A, Y pages from vnode B, in one vmenter.
> 
> 2) allowing qemu to tell the guest to stop inflating the balloon and
>    report a fragmentation limit being hit, when sync compaction
>    powered allocations fails at certain power-of-two order granularity
>    passed by qemu to the guest. This order constraint will be passed
>    by default for hugetlbfs guests with 2MB hpage size, while it can
>    be used optionally on THP backed guests. This option with THP
>    guests would allow a highlevel management software to provide a
>    "don't reduce guest performance" while shrinking the memory size of
>    the guest from the GUI. If you deselect the option, you can shrink
>    down to the last freeable 4k guest page, but doing so may have to
>    split THP in the host (you don't know for sure if they were really
>    THP but they could have been), and it may regress
>    performance. Inflating the balloon while passing a minimum
>    granularity "order" of the pages being zapped, will guarantee
>    inflating the balloon cannot decrease guest performance
>    instead. Plus it's needed for hugetlbfs anyway as far as I can
>    tell. hugetlbfs would not be host enforceable even if the idea is
>    not to free memory but only reduce the available memory of the
>    guest (not without major changes that maps a hugetlb page with 4k
>    ptes at least). While for a more cooperative usage of hugetlbfs
>    guests, it's simply not useful to inflate the balloon at anything
>    less than the "HPAGE_SIZE" hugetlbfs granularity.
> 
> We also plan to use userfaultfd to make the balloon driver host enforced (will
> work fine on hugetlbfs 2M and tmpfs too) but that's going to be invisible to
> the ABI so it's not strictly relevant for this discussion.
> 
> On a side note, registering userfaultfd on the ballooned range, will keep
> khugepaged at bay so it won't risk to re-inflating the MADV_DONTNEED
> zapped sub-THP fragments no matter the sysfs tunings.
> 

Thanks for your elaboration!

> Thanks!
> Andrea

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-09  4:53                     ` Dave Hansen
  2016-12-09  5:35                       ` Li, Liang Z
@ 2016-12-14  8:59                       ` Li, Liang Z
  2016-12-15 15:34                         ` Dave Hansen
  1 sibling, 1 reply; 41+ messages in thread
From: Li, Liang Z @ 2016-12-14  8:59 UTC (permalink / raw)
  To: Hansen, Dave, Andrea Arcangeli
  Cc: David Hildenbrand, kvm, mhocko, mst, linux-kernel, qemu-devel,
	linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > What's the conclusion of your discussion? It seems you want some
> > statistic before deciding whether to  ripping the bitmap from the ABI,
> > am I right?
> 
> I think Andrea and David feel pretty strongly that we should remove the
> bitmap, unless we have some data to support keeping it.  I don't feel as
> strongly about it, but I think their critique of it is pretty valid.  I think the
> consensus is that the bitmap needs to go.
> 
> The only real question IMNHO is whether we should do a power-of-2 or a
> length.  But, if we have 12 bits, then the argument for doing length is pretty
> strong.  We don't need anywhere near 12 bits if doing power-of-2.

Just found the MAX_ORDER should be limited to 12 if use length instead of order,
If the MAX_ORDER is configured to a value bigger than 12, it will make things more
complex to handle this case. 

If use order, we need to break a large memory range whose length is not the power of 2 into several
small ranges, it also make the code complex.

It seems we leave too many bit  for the pfn, and the bits leave for length is not enough,
How about keep 45 bits for the pfn and 19 bits for length, 45 bits for pfn can cover 57 bits
physical address, that should be enough in the near feature. 

What's your opinion?


thanks!
Liang

 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-14  8:59                       ` Li, Liang Z
@ 2016-12-15 15:34                         ` Dave Hansen
  2016-12-15 15:54                           ` Michael S. Tsirkin
  2016-12-16  0:48                           ` Li, Liang Z
  0 siblings, 2 replies; 41+ messages in thread
From: Dave Hansen @ 2016-12-15 15:34 UTC (permalink / raw)
  To: Li, Liang Z, Andrea Arcangeli
  Cc: David Hildenbrand, kvm, mhocko, mst, linux-kernel, qemu-devel,
	linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

On 12/14/2016 12:59 AM, Li, Liang Z wrote:
>> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
>> fast (de)inflating & fast live migration
>>
>> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
>>> What's the conclusion of your discussion? It seems you want some
>>> statistic before deciding whether to  ripping the bitmap from the ABI,
>>> am I right?
>>
>> I think Andrea and David feel pretty strongly that we should remove the
>> bitmap, unless we have some data to support keeping it.  I don't feel as
>> strongly about it, but I think their critique of it is pretty valid.  I think the
>> consensus is that the bitmap needs to go.
>>
>> The only real question IMNHO is whether we should do a power-of-2 or a
>> length.  But, if we have 12 bits, then the argument for doing length is pretty
>> strong.  We don't need anywhere near 12 bits if doing power-of-2.
> 
> Just found the MAX_ORDER should be limited to 12 if use length instead of order,
> If the MAX_ORDER is configured to a value bigger than 12, it will make things more
> complex to handle this case. 
> 
> If use order, we need to break a large memory range whose length is not the power of 2 into several
> small ranges, it also make the code complex.

I can't imagine it makes the code that much more complex.  It adds a for
loop.  Right?

> It seems we leave too many bit  for the pfn, and the bits leave for length is not enough,
> How about keep 45 bits for the pfn and 19 bits for length, 45 bits for pfn can cover 57 bits
> physical address, that should be enough in the near feature. 
> 
> What's your opinion?

I still think 'order' makes a lot of sense.  But, as you say, 57 bits is
enough for x86 for a while.  Other architectures.... who knows?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-15 15:34                         ` Dave Hansen
@ 2016-12-15 15:54                           ` Michael S. Tsirkin
  2016-12-16  1:12                             ` Li, Liang Z
  2016-12-16  0:48                           ` Li, Liang Z
  1 sibling, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2016-12-15 15:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Li, Liang Z, Andrea Arcangeli, David Hildenbrand, kvm, mhocko,
	linux-kernel, qemu-devel, linux-mm, dgilbert, pbonzini, akpm,
	virtualization, kirill.shutemov

On Thu, Dec 15, 2016 at 07:34:33AM -0800, Dave Hansen wrote:
> On 12/14/2016 12:59 AM, Li, Liang Z wrote:
> >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> >> fast (de)inflating & fast live migration
> >>
> >> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> >>> What's the conclusion of your discussion? It seems you want some
> >>> statistic before deciding whether to  ripping the bitmap from the ABI,
> >>> am I right?
> >>
> >> I think Andrea and David feel pretty strongly that we should remove the
> >> bitmap, unless we have some data to support keeping it.  I don't feel as
> >> strongly about it, but I think their critique of it is pretty valid.  I think the
> >> consensus is that the bitmap needs to go.
> >>
> >> The only real question IMNHO is whether we should do a power-of-2 or a
> >> length.  But, if we have 12 bits, then the argument for doing length is pretty
> >> strong.  We don't need anywhere near 12 bits if doing power-of-2.
> > 
> > Just found the MAX_ORDER should be limited to 12 if use length instead of order,
> > If the MAX_ORDER is configured to a value bigger than 12, it will make things more
> > complex to handle this case. 
> > 
> > If use order, we need to break a large memory range whose length is not the power of 2 into several
> > small ranges, it also make the code complex.
> 
> I can't imagine it makes the code that much more complex.  It adds a for
> loop.  Right?
> 
> > It seems we leave too many bit  for the pfn, and the bits leave for length is not enough,
> > How about keep 45 bits for the pfn and 19 bits for length, 45 bits for pfn can cover 57 bits
> > physical address, that should be enough in the near feature. 
> > 
> > What's your opinion?
> 
> I still think 'order' makes a lot of sense.  But, as you say, 57 bits is
> enough for x86 for a while.  Other architectures.... who knows?

I think you can probably assume page size >= 4K. But I would not want
to make any other assumptions. E.g. there are systems that absolutely
require you to set high bits for DMA.

I think we really want both length and order.

I understand how you are trying to pack them as tightly as possible.

However, I thought of a trick, we don't need to encode all
possible orders. For example, with 2 bits of order,
we can make them mean:
00 - 4K pages
01 - 2M pages
02 - 1G pages

guest can program the sizes for each order through config space.

We will have 10 bits left for legth.

It might make sense to also allow guest to program the number of bits
used for order, this will make it easy to extend without
host changes.

-- 
MST

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-15 15:34                         ` Dave Hansen
  2016-12-15 15:54                           ` Michael S. Tsirkin
@ 2016-12-16  0:48                           ` Li, Liang Z
  2016-12-16  1:09                             ` Dave Hansen
  1 sibling, 1 reply; 41+ messages in thread
From: Li, Liang Z @ 2016-12-16  0:48 UTC (permalink / raw)
  To: Hansen, Dave, Andrea Arcangeli
  Cc: David Hildenbrand, kvm, mhocko, mst, linux-kernel, qemu-devel,
	linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On 12/14/2016 12:59 AM, Li, Liang Z wrote:
> >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon
> >> for fast (de)inflating & fast live migration
> >>
> >> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> >>> What's the conclusion of your discussion? It seems you want some
> >>> statistic before deciding whether to  ripping the bitmap from the
> >>> ABI, am I right?
> >>
> >> I think Andrea and David feel pretty strongly that we should remove
> >> the bitmap, unless we have some data to support keeping it.  I don't
> >> feel as strongly about it, but I think their critique of it is pretty
> >> valid.  I think the consensus is that the bitmap needs to go.
> >>
> >> The only real question IMNHO is whether we should do a power-of-2 or
> >> a length.  But, if we have 12 bits, then the argument for doing
> >> length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> >
> > Just found the MAX_ORDER should be limited to 12 if use length instead
> > of order, If the MAX_ORDER is configured to a value bigger than 12, it
> > will make things more complex to handle this case.
> >
> > If use order, we need to break a large memory range whose length is
> > not the power of 2 into several small ranges, it also make the code complex.
> 
> I can't imagine it makes the code that much more complex.  It adds a for loop.
> Right?
> 

Yes, just a little. :)

> > It seems we leave too many bit  for the pfn, and the bits leave for
> > length is not enough, How about keep 45 bits for the pfn and 19 bits
> > for length, 45 bits for pfn can cover 57 bits physical address, that should be
> enough in the near feature.
> >
> > What's your opinion?
> 
> I still think 'order' makes a lot of sense.  But, as you say, 57 bits is enough for
> x86 for a while.  Other architectures.... who knows?

Yes. 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-16  0:48                           ` Li, Liang Z
@ 2016-12-16  1:09                             ` Dave Hansen
  2016-12-16  1:38                               ` Li, Liang Z
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2016-12-16  1:09 UTC (permalink / raw)
  To: Li, Liang Z, Andrea Arcangeli
  Cc: David Hildenbrand, kvm, mhocko, mst, linux-kernel, qemu-devel,
	linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

On 12/15/2016 04:48 PM, Li, Liang Z wrote:
>>> It seems we leave too many bit  for the pfn, and the bits leave for
>>> length is not enough, How about keep 45 bits for the pfn and 19 bits
>>> for length, 45 bits for pfn can cover 57 bits physical address, that should be
>> enough in the near feature.
>>> What's your opinion?
>> I still think 'order' makes a lot of sense.  But, as you say, 57 bits is enough for
>> x86 for a while.  Other architectures.... who knows?

Thinking about this some more...  There are really only two cases that
matter: 4k pages and "much bigger" ones.

Squeezing each 4k page into 8 bytes of metadata helps guarantee that
this scheme won't regress over the old scheme in any cases.  For bigger
ranges, 8 vs 16 bytes means *nothing*.  And 16 bytes will be as good or
better than the old scheme for everything which is >4k.

How about this:
 * 52 bits of 'pfn', 5 bits of 'order', 7 bits of 'length'
 * One special 'length' value to mean "actual length in next 8 bytes"

That should be pretty simple to produce and decode.  We have two record
sizes, but I think it is manageable.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-15 15:54                           ` Michael S. Tsirkin
@ 2016-12-16  1:12                             ` Li, Liang Z
  2016-12-16 15:40                               ` Andrea Arcangeli
  0 siblings, 1 reply; 41+ messages in thread
From: Li, Liang Z @ 2016-12-16  1:12 UTC (permalink / raw)
  To: Michael S. Tsirkin, Hansen, Dave
  Cc: Andrea Arcangeli, David Hildenbrand, kvm, mhocko, linux-kernel,
	qemu-devel, linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

> On Thu, Dec 15, 2016 at 07:34:33AM -0800, Dave Hansen wrote:
> > On 12/14/2016 12:59 AM, Li, Liang Z wrote:
> > >> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend
> > >> virtio-balloon for fast (de)inflating & fast live migration
> > >>
> > >> On 12/08/2016 08:45 PM, Li, Liang Z wrote:
> > >>> What's the conclusion of your discussion? It seems you want some
> > >>> statistic before deciding whether to  ripping the bitmap from the
> > >>> ABI, am I right?
> > >>
> > >> I think Andrea and David feel pretty strongly that we should remove
> > >> the bitmap, unless we have some data to support keeping it.  I
> > >> don't feel as strongly about it, but I think their critique of it
> > >> is pretty valid.  I think the consensus is that the bitmap needs to go.
> > >>
> > >> The only real question IMNHO is whether we should do a power-of-2
> > >> or a length.  But, if we have 12 bits, then the argument for doing
> > >> length is pretty strong.  We don't need anywhere near 12 bits if doing
> power-of-2.
> > >
> > > Just found the MAX_ORDER should be limited to 12 if use length
> > > instead of order, If the MAX_ORDER is configured to a value bigger
> > > than 12, it will make things more complex to handle this case.
> > >
> > > If use order, we need to break a large memory range whose length is
> > > not the power of 2 into several small ranges, it also make the code
> complex.
> >
> > I can't imagine it makes the code that much more complex.  It adds a
> > for loop.  Right?
> >
> > > It seems we leave too many bit  for the pfn, and the bits leave for
> > > length is not enough, How about keep 45 bits for the pfn and 19 bits
> > > for length, 45 bits for pfn can cover 57 bits physical address, that should
> be enough in the near feature.
> > >
> > > What's your opinion?
> >
> > I still think 'order' makes a lot of sense.  But, as you say, 57 bits
> > is enough for x86 for a while.  Other architectures.... who knows?
> 
> I think you can probably assume page size >= 4K. But I would not want to
> make any other assumptions. E.g. there are systems that absolutely require
> you to set high bits for DMA.
> 
> I think we really want both length and order.
> 
> I understand how you are trying to pack them as tightly as possible.
> 
> However, I thought of a trick, we don't need to encode all possible orders.
> For example, with 2 bits of order, we can make them mean:
> 00 - 4K pages
> 01 - 2M pages
> 02 - 1G pages
> 
> guest can program the sizes for each order through config space.
> 
> We will have 10 bits left for legth.
> 

Please don't, we just get rid of the bitmap for simplification. :)

> It might make sense to also allow guest to program the number of bits used
> for order, this will make it easy to extend without host changes.
> 

There still exist the case if the MAX_ORDER is configured to a large value, e.g. 36 for a system
with huge amount of memory, then there is only 28 bits left for the pfn, which is not enough.
Should  we limit the MAX_ORDER? I don't think so.

It seems use order is better. 

Thanks!
Liang
> --
> MST

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-16  1:09                             ` Dave Hansen
@ 2016-12-16  1:38                               ` Li, Liang Z
  2016-12-16  1:40                                 ` Dave Hansen
  0 siblings, 1 reply; 41+ messages in thread
From: Li, Liang Z @ 2016-12-16  1:38 UTC (permalink / raw)
  To: Hansen, Dave, Andrea Arcangeli
  Cc: David Hildenbrand, kvm, mhocko, mst, linux-kernel, qemu-devel,
	linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

> On 12/15/2016 04:48 PM, Li, Liang Z wrote:
> >>> It seems we leave too many bit  for the pfn, and the bits leave for
> >>> length is not enough, How about keep 45 bits for the pfn and 19 bits
> >>> for length, 45 bits for pfn can cover 57 bits physical address, that
> >>> should be
> >> enough in the near feature.
> >>> What's your opinion?
> >> I still think 'order' makes a lot of sense.  But, as you say, 57 bits
> >> is enough for
> >> x86 for a while.  Other architectures.... who knows?
> 
> Thinking about this some more...  There are really only two cases that
> matter: 4k pages and "much bigger" ones.
> 
> Squeezing each 4k page into 8 bytes of metadata helps guarantee that this
> scheme won't regress over the old scheme in any cases.  For bigger ranges, 8
> vs 16 bytes means *nothing*.  And 16 bytes will be as good or better than
> the old scheme for everything which is >4k.
> 
> How about this:
>  * 52 bits of 'pfn', 5 bits of 'order', 7 bits of 'length'
>  * One special 'length' value to mean "actual length in next 8 bytes"
> 
> That should be pretty simple to produce and decode.  We have two record
> sizes, but I think it is manageable.

It works,  Now that we intend to use another 8 bytes for length

Why not:

Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long enough for the 'length'
Set the 'length' to a special value to indicate the "actual length in next 8 bytes".

That will be much more simple. Right?

Liang

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-16  1:38                               ` Li, Liang Z
@ 2016-12-16  1:40                                 ` Dave Hansen
  2016-12-16  1:43                                   ` Li, Liang Z
  2016-12-16 16:01                                   ` Andrea Arcangeli
  0 siblings, 2 replies; 41+ messages in thread
From: Dave Hansen @ 2016-12-16  1:40 UTC (permalink / raw)
  To: Li, Liang Z, Andrea Arcangeli
  Cc: David Hildenbrand, kvm, mhocko, mst, linux-kernel, qemu-devel,
	linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> 
> Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long enough for the 'length'
> Set the 'length' to a special value to indicate the "actual length in next 8 bytes".
> 
> That will be much more simple. Right?

Sounds fine to me.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-16  1:40                                 ` Dave Hansen
@ 2016-12-16  1:43                                   ` Li, Liang Z
  2016-12-16 16:01                                   ` Andrea Arcangeli
  1 sibling, 0 replies; 41+ messages in thread
From: Li, Liang Z @ 2016-12-16  1:43 UTC (permalink / raw)
  To: Hansen, Dave, Andrea Arcangeli
  Cc: David Hildenbrand, kvm, mhocko, mst, linux-kernel, qemu-devel,
	linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

> On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> >
> > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long enough
> for the 'length'
> > Set the 'length' to a special value to indicate the "actual length in next 8
> bytes".
> >
> > That will be much more simple. Right?
> 
> Sounds fine to me.

Thanks for your inspiration!

Liang

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-16  1:12                             ` Li, Liang Z
@ 2016-12-16 15:40                               ` Andrea Arcangeli
  2016-12-17 11:56                                 ` Li, Liang Z
  0 siblings, 1 reply; 41+ messages in thread
From: Andrea Arcangeli @ 2016-12-16 15:40 UTC (permalink / raw)
  To: Li, Liang Z
  Cc: Michael S. Tsirkin, Hansen, Dave, David Hildenbrand, kvm, mhocko,
	linux-kernel, qemu-devel, linux-mm, dgilbert, pbonzini, akpm,
	virtualization, kirill.shutemov

On Fri, Dec 16, 2016 at 01:12:21AM +0000, Li, Liang Z wrote:
> There still exist the case if the MAX_ORDER is configured to a large value, e.g. 36 for a system
> with huge amount of memory, then there is only 28 bits left for the pfn, which is not enough.

Not related to the balloon but how would it help to set MAX_ORDER to
36?

What the MAX_ORDER affects is that you won't be able to ask the kernel
page allocator for contiguous memory bigger than 1<<(MAX_ORDER-1), but
that's a driver issue not relevant to the amount of RAM. Drivers won't
suddenly start to ask the kernel allocator to allocate compound pages
at orders >= 11 just because more RAM was added.

The higher the MAX_ORDER the slower the kernel runs simply so the
smaller the MAX_ORDER the better.

> Should  we limit the MAX_ORDER? I don't think so.

We shouldn't strictly depend on MAX_ORDER value but it's mostly
limited already even if configurable at build time.

We definitely need it to reach at least the hugepage size, then it's
mostly driver issue, but drivers requiring large contiguous
allocations should rely on CMA only or vmalloc if they only require it
virtually contiguous, and not rely on larger MAX_ORDER that would
slowdown all kernel allocations/freeing.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-16  1:40                                 ` Dave Hansen
  2016-12-16  1:43                                   ` Li, Liang Z
@ 2016-12-16 16:01                                   ` Andrea Arcangeli
  2016-12-17 12:39                                     ` Li, Liang Z
  1 sibling, 1 reply; 41+ messages in thread
From: Andrea Arcangeli @ 2016-12-16 16:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Li, Liang Z, David Hildenbrand, kvm, mhocko, mst, linux-kernel,
	qemu-devel, linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

On Thu, Dec 15, 2016 at 05:40:45PM -0800, Dave Hansen wrote:
> On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> > 
> > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long enough for the 'length'
> > Set the 'length' to a special value to indicate the "actual length in next 8 bytes".
> > 
> > That will be much more simple. Right?
> 
> Sounds fine to me.
> 

Sounds fine to me too indeed.

I'm only wondering what is the major point for compressing gpfn+len in
8 bytes in the common case, you already use sg_init_table to send down
two pages, we could send three as well and avoid all math and bit
shifts and ors, or not?

I agree with the above because from a performance prospective I tend
to think the above proposal will run at least theoretically faster
because the other way is to waste double amount of CPU cache, and bit
mangling in the encoding and the later decoding on qemu side should be
faster than accessing an array of double size, but then I'm not sure
if it's measurable optimization. So I'd be curious to know the exact
motivation and if it is to reduce the CPU cache usage or if there's
some other fundamental reason to compress it.

The header already tells qemu how big is the array payload, couldn't
we just add more pages if one isn't enough?

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-16 15:40                               ` Andrea Arcangeli
@ 2016-12-17 11:56                                 ` Li, Liang Z
  0 siblings, 0 replies; 41+ messages in thread
From: Li, Liang Z @ 2016-12-17 11:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Michael S. Tsirkin, Hansen, Dave, David Hildenbrand, kvm, mhocko,
	linux-kernel, qemu-devel, linux-mm, dgilbert, pbonzini, akpm,
	virtualization, kirill.shutemov

> On Fri, Dec 16, 2016 at 01:12:21AM +0000, Li, Liang Z wrote:
> > There still exist the case if the MAX_ORDER is configured to a large
> > value, e.g. 36 for a system with huge amount of memory, then there is only
> 28 bits left for the pfn, which is not enough.
> 
> Not related to the balloon but how would it help to set MAX_ORDER to 36?
> 

My point here is  MAX_ORDER may be configured to a big value.

> What the MAX_ORDER affects is that you won't be able to ask the kernel
> page allocator for contiguous memory bigger than 1<<(MAX_ORDER-1), but
> that's a driver issue not relevant to the amount of RAM. Drivers won't
> suddenly start to ask the kernel allocator to allocate compound pages at
> orders >= 11 just because more RAM was added.
> 
> The higher the MAX_ORDER the slower the kernel runs simply so the smaller
> the MAX_ORDER the better.
> 
> > Should  we limit the MAX_ORDER? I don't think so.
> 
> We shouldn't strictly depend on MAX_ORDER value but it's mostly limited
> already even if configurable at build time.
> 

I didn't know that and will take a look, thanks for your information.


Liang
> We definitely need it to reach at least the hugepage size, then it's mostly
> driver issue, but drivers requiring large contiguous allocations should rely on
> CMA only or vmalloc if they only require it virtually contiguous, and not rely
> on larger MAX_ORDER that would slowdown all kernel allocations/freeing.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration
  2016-12-16 16:01                                   ` Andrea Arcangeli
@ 2016-12-17 12:39                                     ` Li, Liang Z
  0 siblings, 0 replies; 41+ messages in thread
From: Li, Liang Z @ 2016-12-17 12:39 UTC (permalink / raw)
  To: Andrea Arcangeli, Hansen, Dave
  Cc: David Hildenbrand, kvm, mhocko, mst, linux-kernel, qemu-devel,
	linux-mm, dgilbert, pbonzini, akpm, virtualization,
	kirill.shutemov

> Subject: Re: [Qemu-devel] [PATCH kernel v5 0/5] Extend virtio-balloon for
> fast (de)inflating & fast live migration
> 
> On Thu, Dec 15, 2016 at 05:40:45PM -0800, Dave Hansen wrote:
> > On 12/15/2016 05:38 PM, Li, Liang Z wrote:
> > >
> > > Use 52 bits for 'pfn', 12 bits for 'length', when the 12 bits is not long
> enough for the 'length'
> > > Set the 'length' to a special value to indicate the "actual length in next 8
> bytes".
> > >
> > > That will be much more simple. Right?
> >
> > Sounds fine to me.
> >
> 
> Sounds fine to me too indeed.
> 
> I'm only wondering what is the major point for compressing gpfn+len in
> 8 bytes in the common case, you already use sg_init_table to send down two
> pages, we could send three as well and avoid all math and bit shifts and ors,
> or not?
> 

Yes, we can use more pages for that.

> I agree with the above because from a performance prospective I tend to
> think the above proposal will run at least theoretically faster because the
> other way is to waste double amount of CPU cache, and bit mangling in the
> encoding and the later decoding on qemu side should be faster than
> accessing an array of double size, but then I'm not sure if it's measurable
> optimization. So I'd be curious to know the exact motivation and if it is to
> reduce the CPU cache usage or if there's some other fundamental reason to
> compress it.
> The header already tells qemu how big is the array payload, couldn't we just
> add more pages if one isn't enough?
> 

The original intention to compress the PFN and length it's to reduce the memory required.
Even the code was changed a lot from the previous versions, I think this is still true.

Now we allocate a specified buffer size to save the 'PFN|length', when the buffer is not big
enough to save all the page info for a specified order. A double size buffer will be allocated.
This is what we want to avoid because the allocation may fail and allocation takes some time,
for fast live migration, time is a critical factor we have to consider, more time takes means
more unnecessary pages are sent, because live migration starts before the request for unused
 pages get response. 

Thanks

Liang

> Thanks,
> Andrea

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2016-12-17 12:39 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-30  8:43 [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration Liang Li
2016-11-30  8:43 ` [PATCH kernel v5 1/5] virtio-balloon: rework deflate to add page to a list Liang Li
2016-11-30  8:43 ` [PATCH kernel v5 2/5] virtio-balloon: define new feature bit and head struct Liang Li
2016-11-30  8:43 ` [PATCH kernel v5 3/5] virtio-balloon: speed up inflate/deflate process Liang Li
2016-11-30  8:43 ` [PATCH kernel v5 4/5] virtio-balloon: define flags and head for host request vq Liang Li
2016-11-30  8:43 ` [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info Liang Li
2016-11-30 19:15   ` Dave Hansen
2016-12-04 13:13     ` Li, Liang Z
2016-12-05 17:22       ` Dave Hansen
2016-12-06  4:47         ` Li, Liang Z
2016-12-06  8:40 ` [PATCH kernel v5 0/5] Extend virtio-balloon for fast (de)inflating & fast live migration David Hildenbrand
2016-12-07 13:35   ` Li, Liang Z
2016-12-07 15:34     ` Dave Hansen
2016-12-09  3:09       ` Li, Liang Z
2016-12-07 15:42     ` David Hildenbrand
2016-12-07 15:45       ` Dave Hansen
2016-12-07 16:21         ` David Hildenbrand
2016-12-07 16:57           ` Dave Hansen
2016-12-07 18:38             ` [Qemu-devel] " Andrea Arcangeli
2016-12-07 18:44               ` Dave Hansen
2016-12-07 18:58                 ` Andrea Arcangeli
2016-12-07 19:54               ` Dave Hansen
2016-12-07 20:28                 ` Andrea Arcangeli
2016-12-09  4:45                   ` Li, Liang Z
2016-12-09  4:53                     ` Dave Hansen
2016-12-09  5:35                       ` Li, Liang Z
2016-12-09 16:42                         ` Andrea Arcangeli
2016-12-14  8:20                           ` Li, Liang Z
2016-12-14  8:59                       ` Li, Liang Z
2016-12-15 15:34                         ` Dave Hansen
2016-12-15 15:54                           ` Michael S. Tsirkin
2016-12-16  1:12                             ` Li, Liang Z
2016-12-16 15:40                               ` Andrea Arcangeli
2016-12-17 11:56                                 ` Li, Liang Z
2016-12-16  0:48                           ` Li, Liang Z
2016-12-16  1:09                             ` Dave Hansen
2016-12-16  1:38                               ` Li, Liang Z
2016-12-16  1:40                                 ` Dave Hansen
2016-12-16  1:43                                   ` Li, Liang Z
2016-12-16 16:01                                   ` Andrea Arcangeli
2016-12-17 12:39                                     ` Li, Liang Z

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).