All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC 0/5] Improve balloon handling of pagesizes other than 4kiB
@ 2018-10-12  3:24 David Gibson
  2018-10-12  3:24 ` [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate David Gibson
                   ` (5 more replies)
  0 siblings, 6 replies; 31+ messages in thread
From: David Gibson @ 2018-10-12  3:24 UTC (permalink / raw)
  To: dhildenb, imammedo, ehabkost
  Cc: mst, pbonzini, qemu-devel, qemu-ppc, David Gibson

The virtio-balloon devices was never really thought out for cases
other than 4kiB pagesize on both guest and host.  It works in some
cases, but in others can be ineffectual or even cause guest memory
corruption.

This series makes a handful of preliminary cleanups, then makes a
change to safely, though not perfectly, handle cases with non 4kiB
pagesizes.

David Gibson (5):
  virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
  virtio-balloon: Corrections to address verification
  virtio-balloon: Rework ballon_page() interface
  virtio-balloon: Use ram_block_discard_range() instead of raw madvise()
  virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size

 hw/virtio/virtio-balloon.c         | 100 ++++++++++++++++++++++++-----
 include/hw/virtio/virtio-balloon.h |   3 +
 2 files changed, 87 insertions(+), 16 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
  2018-10-12  3:24 [Qemu-devel] [RFC 0/5] Improve balloon handling of pagesizes other than 4kiB David Gibson
@ 2018-10-12  3:24 ` David Gibson
  2018-10-12  7:40   ` David Hildenbrand
                     ` (2 more replies)
  2018-10-12  3:24 ` [Qemu-devel] [RFC 2/5] virtio-balloon: Corrections to address verification David Gibson
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 31+ messages in thread
From: David Gibson @ 2018-10-12  3:24 UTC (permalink / raw)
  To: dhildenb, imammedo, ehabkost
  Cc: mst, pbonzini, qemu-devel, qemu-ppc, David Gibson

When the balloon is inflated, we discard memory place in it using madvise()
with MADV_DONTNEED.  And when we deflate it we use MADV_WILLNEED, which
sounds like it makes sense but is actually unnecessary.

The misleadingly named MADV_DONTNEED just discards the memory in question,
it doesn't set any persistent state on it in-kernel; all that's necessary
to bring the memory back is to touch it.  MADV_WILLNEED in contrast
specifically says that the memory will be used soon and faults it in.

Memory that's being given back to the guest by deflating the balloon
*might* be used soon, but it equally could just sit around in the guest's
pools until it actually needs it.  And, over the general timescale that
memory ballooning operates, it seems unlikely that one extra fault for the
guest will be a vast performance issue.

So, simplify the balloon's operation by dropping the madvise() on deflate.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/virtio/virtio-balloon.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 1728e4f83a..6ec4bcf4e1 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -35,9 +35,8 @@
 
 static void balloon_page(void *addr, int deflate)
 {
-    if (!qemu_balloon_is_inhibited()) {
-        qemu_madvise(addr, BALLOON_PAGE_SIZE,
-                deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
+    if (!qemu_balloon_is_inhibited() && !deflate) {
+        qemu_madvise(addr, BALLOON_PAGE_SIZE, QEMU_MADV_DONTNEED);
     }
 }
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [RFC 2/5] virtio-balloon: Corrections to address verification
  2018-10-12  3:24 [Qemu-devel] [RFC 0/5] Improve balloon handling of pagesizes other than 4kiB David Gibson
  2018-10-12  3:24 ` [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate David Gibson
@ 2018-10-12  3:24 ` David Gibson
  2018-10-12  7:44   ` David Hildenbrand
  2018-10-12  3:24 ` [Qemu-devel] [RFC 3/5] virtio-balloon: Rework ballon_page() interface David Gibson
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 31+ messages in thread
From: David Gibson @ 2018-10-12  3:24 UTC (permalink / raw)
  To: dhildenb, imammedo, ehabkost
  Cc: mst, pbonzini, qemu-devel, qemu-ppc, David Gibson

The virtio-balloon device's verification of the address given to it by the
guest has a number of faults:
    * The addresses here are guest physical addresses, which should be
      'hwaddr' rather than 'ram_addr_t' (the distinction is admittedly
      pretty subtle and confusing)
    * We don't check for section.mr being NULL, which is the main way that
      memory_region_find() reports basic failures.  We really need to check
      that before looking at any other section fields, because
      memory_region_find() doesn't initialize them on the failure path
    * We're passing a length of '1' to memory_region_find(), but really the
      guest is requesting that we put the entire page into the balloon,
      so it makes more sense to call it with BALLOON_PAGE_SIZE

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/virtio/virtio-balloon.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 6ec4bcf4e1..e8611aab0e 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -221,17 +221,20 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
         }
 
         while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4) {
-            ram_addr_t pa;
-            ram_addr_t addr;
+            hwaddr pa;
+            hwaddr addr;
             int p = virtio_ldl_p(vdev, &pfn);
 
-            pa = (ram_addr_t) p << VIRTIO_BALLOON_PFN_SHIFT;
+            pa = (hwaddr) p << VIRTIO_BALLOON_PFN_SHIFT;
             offset += 4;
 
-            /* FIXME: remove get_system_memory(), but how? */
-            section = memory_region_find(get_system_memory(), pa, 1);
-            if (!int128_nz(section.size) ||
-                !memory_region_is_ram(section.mr) ||
+            section = memory_region_find(get_system_memory(), pa,
+                                         BALLOON_PAGE_SIZE);
+            if (!section.mr) {
+                trace_virtio_balloon_bad_addr(pa);
+                continue;
+            }
+            if (!memory_region_is_ram(section.mr) ||
                 memory_region_is_rom(section.mr) ||
                 memory_region_is_romd(section.mr)) {
                 trace_virtio_balloon_bad_addr(pa);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [RFC 3/5] virtio-balloon: Rework ballon_page() interface
  2018-10-12  3:24 [Qemu-devel] [RFC 0/5] Improve balloon handling of pagesizes other than 4kiB David Gibson
  2018-10-12  3:24 ` [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate David Gibson
  2018-10-12  3:24 ` [Qemu-devel] [RFC 2/5] virtio-balloon: Corrections to address verification David Gibson
@ 2018-10-12  3:24 ` David Gibson
  2018-10-12  7:46   ` David Hildenbrand
  2018-10-12  3:24 ` [Qemu-devel] [RFC 4/5] virtio-balloon: Use ram_block_discard_range() instead of raw madvise() David Gibson
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 31+ messages in thread
From: David Gibson @ 2018-10-12  3:24 UTC (permalink / raw)
  To: dhildenb, imammedo, ehabkost
  Cc: mst, pbonzini, qemu-devel, qemu-ppc, David Gibson

This replaces the balloon_page() internal interface with
ballon_inflate_page(), with a slightly different interface.  The new
interface will make future alterations simpler.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/virtio/virtio-balloon.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index e8611aab0e..7229afad6e 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -33,11 +33,11 @@
 
 #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
 
-static void balloon_page(void *addr, int deflate)
+static void balloon_inflate_page(VirtIOBalloon *balloon,
+                                 MemoryRegion *mr, hwaddr offset)
 {
-    if (!qemu_balloon_is_inhibited() && !deflate) {
-        qemu_madvise(addr, BALLOON_PAGE_SIZE, QEMU_MADV_DONTNEED);
-    }
+    void *addr = memory_region_get_ram_ptr(mr) + offset;
+    qemu_madvise(addr, BALLOON_PAGE_SIZE, QEMU_MADV_DONTNEED);
 }
 
 static const char *balloon_stat_names[] = {
@@ -222,7 +222,6 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 
         while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4) {
             hwaddr pa;
-            hwaddr addr;
             int p = virtio_ldl_p(vdev, &pfn);
 
             pa = (hwaddr) p << VIRTIO_BALLOON_PFN_SHIFT;
@@ -244,11 +243,9 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 
             trace_virtio_balloon_handle_output(memory_region_name(section.mr),
                                                pa);
-            /* Using memory_region_get_ram_ptr is bending the rules a bit, but
-               should be OK because we only want a single page.  */
-            addr = section.offset_within_region;
-            balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
-                         !!(vq == s->dvq));
+            if (!qemu_balloon_is_inhibited() && vq != s->dvq) {
+                balloon_inflate_page(s, section.mr, section.offset_within_region);
+            }
             memory_region_unref(section.mr);
         }
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [RFC 4/5] virtio-balloon: Use ram_block_discard_range() instead of raw madvise()
  2018-10-12  3:24 [Qemu-devel] [RFC 0/5] Improve balloon handling of pagesizes other than 4kiB David Gibson
                   ` (2 preceding siblings ...)
  2018-10-12  3:24 ` [Qemu-devel] [RFC 3/5] virtio-balloon: Rework ballon_page() interface David Gibson
@ 2018-10-12  3:24 ` David Gibson
  2018-10-12  3:24 ` [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size David Gibson
  2018-10-12 17:26 ` [Qemu-devel] [RFC 0/5] Improve balloon handling of pagesizes other than 4kiB Michael S. Tsirkin
  5 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2018-10-12  3:24 UTC (permalink / raw)
  To: dhildenb, imammedo, ehabkost
  Cc: mst, pbonzini, qemu-devel, qemu-ppc, David Gibson

Currently, virtio-balloon uses madvise() with MADV_DONTNEED to actually
discard RAM pages inserted into the balloon.  This is basically a Linux
only interface (MADV_DONTNEED exists on some other platforms, but doesn't
always have the same semantics).  It also doesn't work on hugepages and has
some other limitations.

It turns out that postcopy also needs to discard chunks of memory, and uses
a better interface for it: ram_block_discard_range().  It doesn't cover
every case, but it covers more than going direct to madvise() and this
gives us a single place to update for more possibilities in future.

There are some subtleties here to maintain the current balloon behaviour:

* For now, we just ignore requests to balloon in a hugepage backed region.
  That matches current behaviour, because MADV_DONTNEED on a hugepage would
  simply fail, and we ignore the error.

* If host page size is > BALLOON_PAGE_SIZE we can frequently call this on
  non-host-page-aligned addresses.  These would also fail in madvise(),
  which we then ignored.  ram_block_discard_range() error_report()s calls
  on unaligned addresses, so we explicitly check that case to avoid
  spamming the logs.

* We now call ram_block_discard_range() with the *host* page size, whereas
  we previously called madvise() with BALLOON_PAGE_SIZE.  Surprisingly,
  this also matches existing behaviour.  Although the kernel fails madvise
  on unaligned addresses, it will round unaligned sizes *up* to the host
  page size.  Yes, this means that if BALLOON_PAGE_SIZE < guest page size
  we can incorrectly discard more memory than the guest asked us to.  I'm
  planning to address that soon.

Errors other than the ones discussed above, will now be reported by
ram_block_discard_range(), rather than silently ignored, which means we
have a much better chance of seeing when something is going wrong.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/virtio/virtio-balloon.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 7229afad6e..4435905c87 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -37,7 +37,29 @@ static void balloon_inflate_page(VirtIOBalloon *balloon,
                                  MemoryRegion *mr, hwaddr offset)
 {
     void *addr = memory_region_get_ram_ptr(mr) + offset;
-    qemu_madvise(addr, BALLOON_PAGE_SIZE, QEMU_MADV_DONTNEED);
+    RAMBlock *rb;
+    size_t rb_page_size;
+    ram_addr_t ram_offset;
+
+    /* XXX is there a better way to get to the RAMBlock than via a
+     * host address? */
+    rb = qemu_ram_block_from_host(addr, false, &ram_offset);
+    rb_page_size = qemu_ram_pagesize(rb);
+
+    /* Silently ignore hugepage RAM blocks */
+    if (rb_page_size != getpagesize()) {
+        return;
+    }
+
+    /* Silently ignore unaligned requests */
+    if (ram_offset & (rb_page_size - 1)) {
+        return;
+    }
+
+    ram_block_discard_range(rb, ram_offset, rb_page_size);
+    /* We ignore errors from ram_block_discard_range(), because it has
+     * already reported them, and failing to discard a balloon page is
+     * not fatal */
 }
 
 static const char *balloon_stat_names[] = {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size
  2018-10-12  3:24 [Qemu-devel] [RFC 0/5] Improve balloon handling of pagesizes other than 4kiB David Gibson
                   ` (3 preceding siblings ...)
  2018-10-12  3:24 ` [Qemu-devel] [RFC 4/5] virtio-balloon: Use ram_block_discard_range() instead of raw madvise() David Gibson
@ 2018-10-12  3:24 ` David Gibson
  2018-10-12  8:06   ` David Hildenbrand
  2018-10-12  8:32   ` David Hildenbrand
  2018-10-12 17:26 ` [Qemu-devel] [RFC 0/5] Improve balloon handling of pagesizes other than 4kiB Michael S. Tsirkin
  5 siblings, 2 replies; 31+ messages in thread
From: David Gibson @ 2018-10-12  3:24 UTC (permalink / raw)
  To: dhildenb, imammedo, ehabkost
  Cc: mst, pbonzini, qemu-devel, qemu-ppc, David Gibson

The virtio-balloon always works in units of 4kiB (BALLOON_PAGE_SIZE), but
on the host side, we can only actually discard memory in units of the host
page size.

At present we handle this very badly: we silently ignore balloon requests
that aren't host page aligned, and for requests that are host page aligned
we discard the entire host page.  The latter potentially corrupts guest
memory if its page size is smaller than the host's.

We could just disable the balloon if the host page size is not 4kiB, but
that would break a the special case where host and guest have the same page
size, but that's larger than 4kiB.  Thius case currently works by accident:
when the guest puts its page into the balloon, it will submit balloon
requests for each 4kiB subpage.  Most will be ignored, but the one which
happens to be host page aligned will discard the whole lot.

This occurs in practice routinely for POWER KVM systems, since both host
and guest typically use 64kiB pages.

To make this safe, without breaking that useful case, we need to
accumulate 4kiB balloon requests until we have a whole contiguous host page
at which point we can discard it.

We could in principle do that across all guest memory, but it would require
a large bitmap to track.  This patch represents a compromise: instead we
track ballooned subpages for a single contiguous host page at a time.  This
means that if the guest discards all 4kiB chunks of a host page in
succession, we will discard it.  In particular that means the balloon will
continue to work for the (host page size) == (guest page size) > 4kiB case.

If the guest scatters 4kiB requests across different host pages, we don't
discard anything, and issue a warning.  Not ideal, but at least we don't
corrupt guest memory as the previous version could.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/virtio/virtio-balloon.c         | 67 +++++++++++++++++++++++++-----
 include/hw/virtio/virtio-balloon.h |  3 ++
 2 files changed, 60 insertions(+), 10 deletions(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 4435905c87..39573ef2e3 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -33,33 +33,80 @@
 
 #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
 
+typedef struct PartiallyBalloonedPage {
+    RAMBlock *rb;
+    ram_addr_t base;
+    unsigned long bitmap[];
+} PartiallyBalloonedPage;
+
 static void balloon_inflate_page(VirtIOBalloon *balloon,
                                  MemoryRegion *mr, hwaddr offset)
 {
     void *addr = memory_region_get_ram_ptr(mr) + offset;
     RAMBlock *rb;
     size_t rb_page_size;
-    ram_addr_t ram_offset;
+    int subpages;
+    ram_addr_t ram_offset, host_page_base;
 
     /* XXX is there a better way to get to the RAMBlock than via a
      * host address? */
     rb = qemu_ram_block_from_host(addr, false, &ram_offset);
     rb_page_size = qemu_ram_pagesize(rb);
+    host_page_base = ram_offset & ~(rb_page_size - 1);
+
+    if (rb_page_size == BALLOON_PAGE_SIZE) {
+        /* Easy case */
 
-    /* Silently ignore hugepage RAM blocks */
-    if (rb_page_size != getpagesize()) {
+        ram_block_discard_range(rb, ram_offset, rb_page_size);
+        /* We ignore errors from ram_block_discard_range(), because it
+         * has already reported them, and failing to discard a balloon
+         * page is not fatal */
         return;
     }
 
-    /* Silently ignore unaligned requests */
-    if (ram_offset & (rb_page_size - 1)) {
-        return;
+    /* Hard case
+     *
+     * We've put a piece of a larger host page into the balloon - we
+     * need to keep track until we have a whole host page to
+     * discard
+     */
+    subpages = rb_page_size / BALLOON_PAGE_SIZE;
+
+    if (balloon->pbp
+        && (rb != balloon->pbp->rb
+            || host_page_base != balloon->pbp->base)) {
+        /* We've partially ballooned part of a host page, but now
+         * we're trying to balloon part of a different one.  Too hard,
+         * give up on the old partial page */
+        warn_report("Unable to insert a partial page into virtio-balloon");
+        free(balloon->pbp);
+        balloon->pbp = NULL;
     }
 
-    ram_block_discard_range(rb, ram_offset, rb_page_size);
-    /* We ignore errors from ram_block_discard_range(), because it has
-     * already reported them, and failing to discard a balloon page is
-     * not fatal */
+    if (!balloon->pbp) {
+        /* Starting on a new host page */
+        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
+        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
+        balloon->pbp->rb = rb;
+        balloon->pbp->base = host_page_base;
+    }
+
+    bitmap_set(balloon->pbp->bitmap,
+               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
+               subpages);
+
+    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
+        /* We've accumulated a full host page, we can actually discard
+         * it now */
+
+        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
+        /* We ignore errors from ram_block_discard_range(), because it
+         * has already reported them, and failing to discard a balloon
+         * page is not fatal */
+
+        free(balloon->pbp);
+        balloon->pbp = NULL;
+    }
 }
 
 static const char *balloon_stat_names[] = {
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index e0df3528c8..99dcd6d105 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -30,6 +30,8 @@ typedef struct virtio_balloon_stat_modern {
        uint64_t val;
 } VirtIOBalloonStatModern;
 
+typedef struct PartiallyBalloonedPage PartiallyBalloonedPage;
+
 typedef struct VirtIOBalloon {
     VirtIODevice parent_obj;
     VirtQueue *ivq, *dvq, *svq;
@@ -42,6 +44,7 @@ typedef struct VirtIOBalloon {
     int64_t stats_last_update;
     int64_t stats_poll_interval;
     uint32_t host_features;
+    PartiallyBalloonedPage *pbp;
 } VirtIOBalloon;
 
 #endif
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
  2018-10-12  3:24 ` [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate David Gibson
@ 2018-10-12  7:40   ` David Hildenbrand
  2018-10-13  6:26     ` David Gibson
  2018-10-12 17:41   ` Richard Henderson
  2018-10-12 18:05   ` Michael S. Tsirkin
  2 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2018-10-12  7:40 UTC (permalink / raw)
  To: David Gibson, dhildenb, imammedo, ehabkost
  Cc: mst, pbonzini, qemu-devel, qemu-ppc

On 12/10/2018 05:24, David Gibson wrote:
> When the balloon is inflated, we discard memory place in it using madvise()
> with MADV_DONTNEED.  And when we deflate it we use MADV_WILLNEED, which
> sounds like it makes sense but is actually unnecessary.
> 
> The misleadingly named MADV_DONTNEED just discards the memory in question,
> it doesn't set any persistent state on it in-kernel; all that's necessary
> to bring the memory back is to touch it.  MADV_WILLNEED in contrast
> specifically says that the memory will be used soon and faults it in.
> 
> Memory that's being given back to the guest by deflating the balloon
> *might* be used soon, but it equally could just sit around in the guest's
> pools until it actually needs it.  And, over the general timescale that
> memory ballooning operates, it seems unlikely that one extra fault for the
> guest will be a vast performance issue.
> 
> So, simplify the balloon's operation by dropping the madvise() on deflate.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/virtio/virtio-balloon.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 1728e4f83a..6ec4bcf4e1 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -35,9 +35,8 @@
>  
>  static void balloon_page(void *addr, int deflate)
>  {
> -    if (!qemu_balloon_is_inhibited()) {
> -        qemu_madvise(addr, BALLOON_PAGE_SIZE,
> -                deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> +    if (!qemu_balloon_is_inhibited() && !deflate) {
> +        qemu_madvise(addr, BALLOON_PAGE_SIZE, QEMU_MADV_DONTNEED);
>      }
>  }
>  
> 

Care to rename the function and drop the deflate parameter?
(call it only when actually inflating)

apart from that

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 2/5] virtio-balloon: Corrections to address verification
  2018-10-12  3:24 ` [Qemu-devel] [RFC 2/5] virtio-balloon: Corrections to address verification David Gibson
@ 2018-10-12  7:44   ` David Hildenbrand
  2018-10-13  6:25     ` David Gibson
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2018-10-12  7:44 UTC (permalink / raw)
  To: David Gibson, dhildenb, imammedo, ehabkost
  Cc: mst, pbonzini, qemu-devel, qemu-ppc

On 12/10/2018 05:24, David Gibson wrote:
> The virtio-balloon device's verification of the address given to it by the
> guest has a number of faults:
>     * The addresses here are guest physical addresses, which should be
>       'hwaddr' rather than 'ram_addr_t' (the distinction is admittedly
>       pretty subtle and confusing)
>     * We don't check for section.mr being NULL, which is the main way that
>       memory_region_find() reports basic failures.  We really need to check
>       that before looking at any other section fields, because
>       memory_region_find() doesn't initialize them on the failure path
>     * We're passing a length of '1' to memory_region_find(), but really the
>       guest is requesting that we put the entire page into the balloon,
>       so it makes more sense to call it with BALLOON_PAGE_SIZE
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/virtio/virtio-balloon.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 6ec4bcf4e1..e8611aab0e 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -221,17 +221,20 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>          }
>  
>          while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4) {
> -            ram_addr_t pa;
> -            ram_addr_t addr;
> +            hwaddr pa;
> +            hwaddr addr;
>              int p = virtio_ldl_p(vdev, &pfn);
>  
> -            pa = (ram_addr_t) p << VIRTIO_BALLOON_PFN_SHIFT;
> +            pa = (hwaddr) p << VIRTIO_BALLOON_PFN_SHIFT;
>              offset += 4;
>  
> -            /* FIXME: remove get_system_memory(), but how? */

Should we leave that fixme? (on the other hand, virtio-balloon operates
on all system mamory, so I also don't really see a way around this ...)

> -            section = memory_region_find(get_system_memory(), pa, 1);
> -            if (!int128_nz(section.size) ||
> -                !memory_region_is_ram(section.mr) ||
> +            section = memory_region_find(get_system_memory(), pa,
> +                                         BALLOON_PAGE_SIZE);
> +            if (!section.mr) {
> +                trace_virtio_balloon_bad_addr(pa);
> +                continue;
> +            }
> +            if (!memory_region_is_ram(section.mr) ||
>                  memory_region_is_rom(section.mr) ||
>                  memory_region_is_romd(section.mr)) {
>                  trace_virtio_balloon_bad_addr(pa);
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 3/5] virtio-balloon: Rework ballon_page() interface
  2018-10-12  3:24 ` [Qemu-devel] [RFC 3/5] virtio-balloon: Rework ballon_page() interface David Gibson
@ 2018-10-12  7:46   ` David Hildenbrand
  2018-10-13  6:29     ` David Gibson
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2018-10-12  7:46 UTC (permalink / raw)
  To: David Gibson, dhildenb, imammedo, ehabkost
  Cc: mst, pbonzini, qemu-devel, qemu-ppc

On 12/10/2018 05:24, David Gibson wrote:
> This replaces the balloon_page() internal interface with
> ballon_inflate_page(), with a slightly different interface.  The new
> interface will make future alterations simpler.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/virtio/virtio-balloon.c | 17 +++++++----------
>  1 file changed, 7 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index e8611aab0e..7229afad6e 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -33,11 +33,11 @@
>  
>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>  
> -static void balloon_page(void *addr, int deflate)
> +static void balloon_inflate_page(VirtIOBalloon *balloon,
> +                                 MemoryRegion *mr, hwaddr offset)
>  {
> -    if (!qemu_balloon_is_inhibited() && !deflate) {
> -        qemu_madvise(addr, BALLOON_PAGE_SIZE, QEMU_MADV_DONTNEED);
> -    }
> +    void *addr = memory_region_get_ram_ptr(mr) + offset;

I prefer an empty line here

> +    qemu_madvise(addr, BALLOON_PAGE_SIZE, QEMU_MADV_DONTNEED);
>  }
>  
>  static const char *balloon_stat_names[] = {
> @@ -222,7 +222,6 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>  
>          while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4) {
>              hwaddr pa;
> -            hwaddr addr;
>              int p = virtio_ldl_p(vdev, &pfn);
>  
>              pa = (hwaddr) p << VIRTIO_BALLOON_PFN_SHIFT;
> @@ -244,11 +243,9 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>  
>              trace_virtio_balloon_handle_output(memory_region_name(section.mr),
>                                                 pa);
> -            /* Using memory_region_get_ram_ptr is bending the rules a bit, but
> -               should be OK because we only want a single page.  */
> -            addr = section.offset_within_region;
> -            balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
> -                         !!(vq == s->dvq));
> +            if (!qemu_balloon_is_inhibited() && vq != s->dvq) {
> +                balloon_inflate_page(s, section.mr, section.offset_within_region);
> +            }
>              memory_region_unref(section.mr);
>          }
>  
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size
  2018-10-12  3:24 ` [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size David Gibson
@ 2018-10-12  8:06   ` David Hildenbrand
  2018-10-13  6:40     ` David Gibson
  2018-10-12  8:32   ` David Hildenbrand
  1 sibling, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2018-10-12  8:06 UTC (permalink / raw)
  To: David Gibson, dhildenb, imammedo, ehabkost
  Cc: mst, pbonzini, qemu-devel, qemu-ppc

On 12/10/2018 05:24, David Gibson wrote:
> The virtio-balloon always works in units of 4kiB (BALLOON_PAGE_SIZE), but
> on the host side, we can only actually discard memory in units of the host
> page size.
> 
> At present we handle this very badly: we silently ignore balloon requests
> that aren't host page aligned, and for requests that are host page aligned
> we discard the entire host page.  The latter potentially corrupts guest
> memory if its page size is smaller than the host's.
> 
> We could just disable the balloon if the host page size is not 4kiB, but
> that would break a the special case where host and guest have the same page
> size, but that's larger than 4kiB.  Thius case currently works by accident:
> when the guest puts its page into the balloon, it will submit balloon
> requests for each 4kiB subpage.  Most will be ignored, but the one which
> happens to be host page aligned will discard the whole lot.

Actually I would vote for disabling it if the host page size is not 4k.
This is broken by design and should not be worked around.

I consider 64k a similar case as huge pages in the hypervisor (from a
virtio-balloon interface point of view - which is broken by design for
these cases).

> 
> This occurs in practice routinely for POWER KVM systems, since both host
> and guest typically use 64kiB pages.
> 
> To make this safe, without breaking that useful case, we need to
> accumulate 4kiB balloon requests until we have a whole contiguous host page
> at which point we can discard it.

... and if you have a 4k guest it will just randomly report pages and
your 67 LOC tweak is useless. And this can and will easily happen.

> 
> We could in principle do that across all guest memory, but it would require
> a large bitmap to track.  This patch represents a compromise: instead we

Oh god no, no tracking of all memory. (size of bitmap, memory hotplug,
... migration?)

> track ballooned subpages for a single contiguous host page at a time.  This
> means that if the guest discards all 4kiB chunks of a host page in
> succession, we will discard it.  In particular that means the balloon will
> continue to work for the (host page size) == (guest page size) > 4kiB case.
> 
> If the guest scatters 4kiB requests across different host pages, we don't
> discard anything, and issue a warning.  Not ideal, but at least we don't
> corrupt guest memory as the previous version could.

My take would be to somehow disable the balloon on the hypervisor side
in case the host page size is not 4k. Like not allowing to realize it.
No corruptions, no warnings people will ignore.

> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/virtio/virtio-balloon.c         | 67 +++++++++++++++++++++++++-----
>  include/hw/virtio/virtio-balloon.h |  3 ++
>  2 files changed, 60 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 4435905c87..39573ef2e3 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -33,33 +33,80 @@
>  
>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>  
> +typedef struct PartiallyBalloonedPage {
> +    RAMBlock *rb;
> +    ram_addr_t base;
> +    unsigned long bitmap[];
> +} PartiallyBalloonedPage;
> +
>  static void balloon_inflate_page(VirtIOBalloon *balloon,
>                                   MemoryRegion *mr, hwaddr offset)
>  {
>      void *addr = memory_region_get_ram_ptr(mr) + offset;
>      RAMBlock *rb;
>      size_t rb_page_size;
> -    ram_addr_t ram_offset;
> +    int subpages;
> +    ram_addr_t ram_offset, host_page_base;
>  
>      /* XXX is there a better way to get to the RAMBlock than via a
>       * host address? */
>      rb = qemu_ram_block_from_host(addr, false, &ram_offset);
>      rb_page_size = qemu_ram_pagesize(rb);
> +    host_page_base = ram_offset & ~(rb_page_size - 1);
> +
> +    if (rb_page_size == BALLOON_PAGE_SIZE) {
> +        /* Easy case */
>  
> -    /* Silently ignore hugepage RAM blocks */
> -    if (rb_page_size != getpagesize()) {
> +        ram_block_discard_range(rb, ram_offset, rb_page_size);
> +        /* We ignore errors from ram_block_discard_range(), because it
> +         * has already reported them, and failing to discard a balloon
> +         * page is not fatal */
>          return;
>      }
>  
> -    /* Silently ignore unaligned requests */
> -    if (ram_offset & (rb_page_size - 1)) {
> -        return;
> +    /* Hard case
> +     *
> +     * We've put a piece of a larger host page into the balloon - we
> +     * need to keep track until we have a whole host page to
> +     * discard
> +     */
> +    subpages = rb_page_size / BALLOON_PAGE_SIZE;
> +
> +    if (balloon->pbp
> +        && (rb != balloon->pbp->rb
> +            || host_page_base != balloon->pbp->base)) {
> +        /* We've partially ballooned part of a host page, but now
> +         * we're trying to balloon part of a different one.  Too hard,
> +         * give up on the old partial page */
> +        warn_report("Unable to insert a partial page into virtio-balloon");

I am pretty sure that you can create misleading warnings in case you
migrate at the wrong time. (migrate while half the 64k page is inflated,
on the new host the other part is inflated - a warning when switching to
another 64k page).

> +        free(balloon->pbp);
> +        balloon->pbp = NULL;
>      }
>  
> -    ram_block_discard_range(rb, ram_offset, rb_page_size);
> -    /* We ignore errors from ram_block_discard_range(), because it has
> -     * already reported them, and failing to discard a balloon page is
> -     * not fatal */
> +    if (!balloon->pbp) {
> +        /* Starting on a new host page */
> +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
> +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
> +        balloon->pbp->rb = rb;
> +        balloon->pbp->base = host_page_base;
> +    }
> +
> +    bitmap_set(balloon->pbp->bitmap,
> +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
> +               subpages);
> +
> +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
> +        /* We've accumulated a full host page, we can actually discard
> +         * it now */
> +
> +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
> +        /* We ignore errors from ram_block_discard_range(), because it
> +         * has already reported them, and failing to discard a balloon
> +         * page is not fatal */
> +
> +        free(balloon->pbp);
> +        balloon->pbp = NULL;
> +    }
>  }
No, not a fan of this approach.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size
  2018-10-12  3:24 ` [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size David Gibson
  2018-10-12  8:06   ` David Hildenbrand
@ 2018-10-12  8:32   ` David Hildenbrand
  2018-10-13  6:41     ` David Gibson
  1 sibling, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2018-10-12  8:32 UTC (permalink / raw)
  To: David Gibson, dhildenb, imammedo, ehabkost
  Cc: mst, pbonzini, qemu-devel, qemu-ppc

On 12/10/2018 05:24, David Gibson wrote:
> The virtio-balloon always works in units of 4kiB (BALLOON_PAGE_SIZE), but
> on the host side, we can only actually discard memory in units of the host
> page size.
> 
> At present we handle this very badly: we silently ignore balloon requests
> that aren't host page aligned, and for requests that are host page aligned
> we discard the entire host page.  The latter potentially corrupts guest
> memory if its page size is smaller than the host's.
> 
> We could just disable the balloon if the host page size is not 4kiB, but
> that would break a the special case where host and guest have the same page
> size, but that's larger than 4kiB.  Thius case currently works by accident:
> when the guest puts its page into the balloon, it will submit balloon
> requests for each 4kiB subpage.  Most will be ignored, but the one which
> happens to be host page aligned will discard the whole lot.
> 
> This occurs in practice routinely for POWER KVM systems, since both host
> and guest typically use 64kiB pages.
> 
> To make this safe, without breaking that useful case, we need to
> accumulate 4kiB balloon requests until we have a whole contiguous host page
> at which point we can discard it.
> 
> We could in principle do that across all guest memory, but it would require
> a large bitmap to track.  This patch represents a compromise: instead we
> track ballooned subpages for a single contiguous host page at a time.  This
> means that if the guest discards all 4kiB chunks of a host page in
> succession, we will discard it.  In particular that means the balloon will
> continue to work for the (host page size) == (guest page size) > 4kiB case.
> 
> If the guest scatters 4kiB requests across different host pages, we don't
> discard anything, and issue a warning.  Not ideal, but at least we don't
> corrupt guest memory as the previous version could.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/virtio/virtio-balloon.c         | 67 +++++++++++++++++++++++++-----
>  include/hw/virtio/virtio-balloon.h |  3 ++
>  2 files changed, 60 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 4435905c87..39573ef2e3 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -33,33 +33,80 @@
>  
>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>  
> +typedef struct PartiallyBalloonedPage {
> +    RAMBlock *rb;
> +    ram_addr_t base;
> +    unsigned long bitmap[];

BTW, it might be easier to only remember the last inflated page and
incrementing it when you see the successor.

initialize last_page to -1ull on realize/reset


if (QEMU_IS_ALIGNED(addr, PAGE_SIZE)) {
	/* start of a new potential page block */
	last_page == addr;
} else if (addr == last_page + BALLOON_PAGE_SIZE) {
	/* next successor */
	last_page == addr;
	if (QEMU_IS_ALIGNED(last_page + BALLOON_PAGE_SIZE, PAGE_SIZE)) {
		ramblock_discard()....
	}
} else {
	last_page = -1ull;
}


> +} PartiallyBalloonedPage;
> +
>  static void balloon_inflate_page(VirtIOBalloon *balloon,
>                                   MemoryRegion *mr, hwaddr offset)
>  {
>      void *addr = memory_region_get_ram_ptr(mr) + offset;
>      RAMBlock *rb;
>      size_t rb_page_size;
> -    ram_addr_t ram_offset;
> +    int subpages;
> +    ram_addr_t ram_offset, host_page_base;
>  
>      /* XXX is there a better way to get to the RAMBlock than via a
>       * host address? */
>      rb = qemu_ram_block_from_host(addr, false, &ram_offset);
>      rb_page_size = qemu_ram_pagesize(rb);
> +    host_page_base = ram_offset & ~(rb_page_size - 1);
> +
> +    if (rb_page_size == BALLOON_PAGE_SIZE) {
> +        /* Easy case */
>  
> -    /* Silently ignore hugepage RAM blocks */
> -    if (rb_page_size != getpagesize()) {
> +        ram_block_discard_range(rb, ram_offset, rb_page_size);
> +        /* We ignore errors from ram_block_discard_range(), because it
> +         * has already reported them, and failing to discard a balloon
> +         * page is not fatal */
>          return;
>      }
>  
> -    /* Silently ignore unaligned requests */
> -    if (ram_offset & (rb_page_size - 1)) {
> -        return;
> +    /* Hard case
> +     *
> +     * We've put a piece of a larger host page into the balloon - we
> +     * need to keep track until we have a whole host page to
> +     * discard
> +     */
> +    subpages = rb_page_size / BALLOON_PAGE_SIZE;
> +
> +    if (balloon->pbp
> +        && (rb != balloon->pbp->rb
> +            || host_page_base != balloon->pbp->base)) {
> +        /* We've partially ballooned part of a host page, but now
> +         * we're trying to balloon part of a different one.  Too hard,
> +         * give up on the old partial page */
> +        warn_report("Unable to insert a partial page into virtio-balloon");
> +        free(balloon->pbp);
> +        balloon->pbp = NULL;
>      }
>  
> -    ram_block_discard_range(rb, ram_offset, rb_page_size);
> -    /* We ignore errors from ram_block_discard_range(), because it has
> -     * already reported them, and failing to discard a balloon page is
> -     * not fatal */
> +    if (!balloon->pbp) {
> +        /* Starting on a new host page */
> +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
> +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
> +        balloon->pbp->rb = rb;
> +        balloon->pbp->base = host_page_base;
> +    }
> +
> +    bitmap_set(balloon->pbp->bitmap,
> +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
> +               subpages);
> +
> +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
> +        /* We've accumulated a full host page, we can actually discard
> +         * it now */
> +
> +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
> +        /* We ignore errors from ram_block_discard_range(), because it
> +         * has already reported them, and failing to discard a balloon
> +         * page is not fatal */
> +
> +        free(balloon->pbp);
> +        balloon->pbp = NULL;
> +    }
>  }
>  
>  static const char *balloon_stat_names[] = {
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index e0df3528c8..99dcd6d105 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -30,6 +30,8 @@ typedef struct virtio_balloon_stat_modern {
>         uint64_t val;
>  } VirtIOBalloonStatModern;
>  
> +typedef struct PartiallyBalloonedPage PartiallyBalloonedPage;
> +
>  typedef struct VirtIOBalloon {
>      VirtIODevice parent_obj;
>      VirtQueue *ivq, *dvq, *svq;
> @@ -42,6 +44,7 @@ typedef struct VirtIOBalloon {
>      int64_t stats_last_update;
>      int64_t stats_poll_interval;
>      uint32_t host_features;
> +    PartiallyBalloonedPage *pbp;
>  } VirtIOBalloon;
>  
>  #endif
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Improve balloon handling of pagesizes other than 4kiB
  2018-10-12  3:24 [Qemu-devel] [RFC 0/5] Improve balloon handling of pagesizes other than 4kiB David Gibson
                   ` (4 preceding siblings ...)
  2018-10-12  3:24 ` [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size David Gibson
@ 2018-10-12 17:26 ` Michael S. Tsirkin
  2018-10-17  3:31   ` David Gibson
  5 siblings, 1 reply; 31+ messages in thread
From: Michael S. Tsirkin @ 2018-10-12 17:26 UTC (permalink / raw)
  To: David Gibson; +Cc: dhildenb, imammedo, ehabkost, pbonzini, qemu-devel, qemu-ppc

On Fri, Oct 12, 2018 at 02:24:26PM +1100, David Gibson wrote:
> The virtio-balloon devices was never really thought out for cases
> other than 4kiB pagesize on both guest and host.  It works in some
> cases, but in others can be ineffectual or even cause guest memory
> corruption.
> 
> This series makes a handful of preliminary cleanups, then makes a
> change to safely, though not perfectly, handle cases with non 4kiB
> pagesizes.

BTW do you want to add an interface to specify the page size?
I can see either host or guest or both supporting that.

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>


> David Gibson (5):
>   virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
>   virtio-balloon: Corrections to address verification
>   virtio-balloon: Rework ballon_page() interface
>   virtio-balloon: Use ram_block_discard_range() instead of raw madvise()
>   virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size
> 
>  hw/virtio/virtio-balloon.c         | 100 ++++++++++++++++++++++++-----
>  include/hw/virtio/virtio-balloon.h |   3 +
>  2 files changed, 87 insertions(+), 16 deletions(-)
> 
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
  2018-10-12  3:24 ` [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate David Gibson
  2018-10-12  7:40   ` David Hildenbrand
@ 2018-10-12 17:41   ` Richard Henderson
  2018-10-12 17:59     ` Eric Blake
  2018-10-13  6:23     ` David Gibson
  2018-10-12 18:05   ` Michael S. Tsirkin
  2 siblings, 2 replies; 31+ messages in thread
From: Richard Henderson @ 2018-10-12 17:41 UTC (permalink / raw)
  To: David Gibson, dhildenb, imammedo, ehabkost
  Cc: pbonzini, qemu-ppc, qemu-devel, mst

On 10/11/18 8:24 PM, David Gibson wrote:
> When the balloon is inflated, we discard memory place in it using madvise()
> with MADV_DONTNEED.  And when we deflate it we use MADV_WILLNEED, which
> sounds like it makes sense but is actually unnecessary.
> 
> The misleadingly named MADV_DONTNEED just discards the memory in question,
> it doesn't set any persistent state on it in-kernel; all that's necessary
> to bring the memory back is to touch it.


Isn't the point of deflate to free up host memory, for use by other guests or
whatever?

If you do nothing upon deflate, then that doesn't actually happen.  It seems to
me that this is backward and you should use DONTNEED on deflate and then
inflate should do nothing.


r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
  2018-10-12 17:41   ` Richard Henderson
@ 2018-10-12 17:59     ` Eric Blake
  2018-10-13  6:23     ` David Gibson
  1 sibling, 0 replies; 31+ messages in thread
From: Eric Blake @ 2018-10-12 17:59 UTC (permalink / raw)
  To: Richard Henderson, David Gibson, dhildenb, imammedo, ehabkost
  Cc: pbonzini, qemu-ppc, qemu-devel, mst

On 10/12/18 12:41 PM, Richard Henderson wrote:
> On 10/11/18 8:24 PM, David Gibson wrote:
>> When the balloon is inflated, we discard memory place in it using madvise()
>> with MADV_DONTNEED.  And when we deflate it we use MADV_WILLNEED, which
>> sounds like it makes sense but is actually unnecessary.
>>
>> The misleadingly named MADV_DONTNEED just discards the memory in question,
>> it doesn't set any persistent state on it in-kernel; all that's necessary
>> to bring the memory back is to touch it.
> 
> 
> Isn't the point of deflate to free up host memory, for use by other guests or
> whatever?

If I remember correctly, deflate says to make the balloon smaller (that 
is, the balloon gets out of the way so that the guest can use more 
memory); inflate says to make the balloon bigger (that is, the guest has 
less memory available to use because the inflated balloon is occupying 
that memory instead - where memory occupied by the balloon is really 
memory handed back to the host).

> 
> If you do nothing upon deflate, then that doesn't actually happen.  It seems to
> me that this is backward and you should use DONTNEED on deflate and then
> inflate should do nothing.

Or rather, it sounds like your definition of deflate is opposite the one 
I recall (you are trying to argue that deflate means the guest uses less 
memory and is therefore deflating in size; while I thought deflate meant 
that the balloon shrinks in size making more memory available to the guest).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
  2018-10-12  3:24 ` [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate David Gibson
  2018-10-12  7:40   ` David Hildenbrand
  2018-10-12 17:41   ` Richard Henderson
@ 2018-10-12 18:05   ` Michael S. Tsirkin
  2018-10-15  6:54     ` David Hildenbrand
  2 siblings, 1 reply; 31+ messages in thread
From: Michael S. Tsirkin @ 2018-10-12 18:05 UTC (permalink / raw)
  To: David Gibson; +Cc: dhildenb, imammedo, ehabkost, pbonzini, qemu-devel, qemu-ppc

On Fri, Oct 12, 2018 at 02:24:27PM +1100, David Gibson wrote:
> When the balloon is inflated, we discard memory place in it using madvise()
> with MADV_DONTNEED.  And when we deflate it we use MADV_WILLNEED, which
> sounds like it makes sense but is actually unnecessary.
> 
> The misleadingly named MADV_DONTNEED just discards the memory in question,
> it doesn't set any persistent state on it in-kernel; all that's necessary
> to bring the memory back is to touch it.  MADV_WILLNEED in contrast
> specifically says that the memory will be used soon and faults it in.
> 
> Memory that's being given back to the guest by deflating the balloon
> *might* be used soon, but it equally could just sit around in the guest's
> pools until it actually needs it.  And, over the general timescale that
> memory ballooning operates, it seems unlikely that one extra fault for the
> guest will be a vast performance issue.

Thinking about it, it might be for RT guests.

So I suspect if you want to drop MADV_WILLNEED you need a flag
telling qemu that's not the usecase.




> So, simplify the balloon's operation by dropping the madvise() on deflate.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/virtio/virtio-balloon.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 1728e4f83a..6ec4bcf4e1 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -35,9 +35,8 @@
>  
>  static void balloon_page(void *addr, int deflate)
>  {
> -    if (!qemu_balloon_is_inhibited()) {
> -        qemu_madvise(addr, BALLOON_PAGE_SIZE,
> -                deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> +    if (!qemu_balloon_is_inhibited() && !deflate) {
> +        qemu_madvise(addr, BALLOON_PAGE_SIZE, QEMU_MADV_DONTNEED);
>      }
>  }
>  
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
  2018-10-12 17:41   ` Richard Henderson
  2018-10-12 17:59     ` Eric Blake
@ 2018-10-13  6:23     ` David Gibson
  1 sibling, 0 replies; 31+ messages in thread
From: David Gibson @ 2018-10-13  6:23 UTC (permalink / raw)
  To: Richard Henderson
  Cc: dhildenb, imammedo, ehabkost, pbonzini, qemu-ppc, qemu-devel, mst

[-- Attachment #1: Type: text/plain, Size: 1122 bytes --]

On Fri, Oct 12, 2018 at 10:41:33AM -0700, Richard Henderson wrote:
> On 10/11/18 8:24 PM, David Gibson wrote:
> > When the balloon is inflated, we discard memory place in it using madvise()
> > with MADV_DONTNEED.  And when we deflate it we use MADV_WILLNEED, which
> > sounds like it makes sense but is actually unnecessary.
> > 
> > The misleadingly named MADV_DONTNEED just discards the memory in question,
> > it doesn't set any persistent state on it in-kernel; all that's necessary
> > to bring the memory back is to touch it.
> 
> Isn't the point of deflate to free up host memory, for use by other guests or
> whatever?

As Eric notes, I think you have inflate and deflate the wrong way around.

> If you do nothing upon deflate, then that doesn't actually happen.  It seems to
> me that this is backward and you should use DONTNEED on deflate and then
> inflate should do nothing.
> 
> 
> r~
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 2/5] virtio-balloon: Corrections to address verification
  2018-10-12  7:44   ` David Hildenbrand
@ 2018-10-13  6:25     ` David Gibson
  0 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2018-10-13  6:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhildenb, imammedo, ehabkost, mst, pbonzini, qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 3261 bytes --]

On Fri, Oct 12, 2018 at 09:44:25AM +0200, David Hildenbrand wrote:
> On 12/10/2018 05:24, David Gibson wrote:
> > The virtio-balloon device's verification of the address given to it by the
> > guest has a number of faults:
> >     * The addresses here are guest physical addresses, which should be
> >       'hwaddr' rather than 'ram_addr_t' (the distinction is admittedly
> >       pretty subtle and confusing)
> >     * We don't check for section.mr being NULL, which is the main way that
> >       memory_region_find() reports basic failures.  We really need to check
> >       that before looking at any other section fields, because
> >       memory_region_find() doesn't initialize them on the failure path
> >     * We're passing a length of '1' to memory_region_find(), but really the
> >       guest is requesting that we put the entire page into the balloon,
> >       so it makes more sense to call it with BALLOON_PAGE_SIZE
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/virtio/virtio-balloon.c | 17 ++++++++++-------
> >  1 file changed, 10 insertions(+), 7 deletions(-)
> > 
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index 6ec4bcf4e1..e8611aab0e 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -221,17 +221,20 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
> >          }
> >  
> >          while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4) {
> > -            ram_addr_t pa;
> > -            ram_addr_t addr;
> > +            hwaddr pa;
> > +            hwaddr addr;
> >              int p = virtio_ldl_p(vdev, &pfn);
> >  
> > -            pa = (ram_addr_t) p << VIRTIO_BALLOON_PFN_SHIFT;
> > +            pa = (hwaddr) p << VIRTIO_BALLOON_PFN_SHIFT;
> >              offset += 4;
> >  
> > -            /* FIXME: remove get_system_memory(), but how? */
> 
> Should we leave that fixme? (on the other hand, virtio-balloon operates
> on all system mamory, so I also don't really see a way around this ...)

Yeah, I took that out deliberately.  Given what this is supposed to be
doing, I can't really see how removing get_system_memory() is either
possible or desirable.

> > -            section = memory_region_find(get_system_memory(), pa, 1);
> > -            if (!int128_nz(section.size) ||
> > -                !memory_region_is_ram(section.mr) ||
> > +            section = memory_region_find(get_system_memory(), pa,
> > +                                         BALLOON_PAGE_SIZE);
> > +            if (!section.mr) {
> > +                trace_virtio_balloon_bad_addr(pa);
> > +                continue;
> > +            }
> > +            if (!memory_region_is_ram(section.mr) ||
> >                  memory_region_is_rom(section.mr) ||
> >                  memory_region_is_romd(section.mr)) {
> >                  trace_virtio_balloon_bad_addr(pa);
> > 
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
  2018-10-12  7:40   ` David Hildenbrand
@ 2018-10-13  6:26     ` David Gibson
  0 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2018-10-13  6:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhildenb, imammedo, ehabkost, mst, pbonzini, qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 2334 bytes --]

On Fri, Oct 12, 2018 at 09:40:24AM +0200, David Hildenbrand wrote:
> On 12/10/2018 05:24, David Gibson wrote:
> > When the balloon is inflated, we discard memory place in it using madvise()
> > with MADV_DONTNEED.  And when we deflate it we use MADV_WILLNEED, which
> > sounds like it makes sense but is actually unnecessary.
> > 
> > The misleadingly named MADV_DONTNEED just discards the memory in question,
> > it doesn't set any persistent state on it in-kernel; all that's necessary
> > to bring the memory back is to touch it.  MADV_WILLNEED in contrast
> > specifically says that the memory will be used soon and faults it in.
> > 
> > Memory that's being given back to the guest by deflating the balloon
> > *might* be used soon, but it equally could just sit around in the guest's
> > pools until it actually needs it.  And, over the general timescale that
> > memory ballooning operates, it seems unlikely that one extra fault for the
> > guest will be a vast performance issue.
> > 
> > So, simplify the balloon's operation by dropping the madvise() on deflate.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/virtio/virtio-balloon.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index 1728e4f83a..6ec4bcf4e1 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -35,9 +35,8 @@
> >  
> >  static void balloon_page(void *addr, int deflate)
> >  {
> > -    if (!qemu_balloon_is_inhibited()) {
> > -        qemu_madvise(addr, BALLOON_PAGE_SIZE,
> > -                deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> > +    if (!qemu_balloon_is_inhibited() && !deflate) {
> > +        qemu_madvise(addr, BALLOON_PAGE_SIZE, QEMU_MADV_DONTNEED);
> >      }
> >  }
> >  
> > 
> 
> Care to rename the function and drop the deflate parameter?
> (call it only when actually inflating)

As you probably found, I do that in a later patch.

> apart from that
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 3/5] virtio-balloon: Rework ballon_page() interface
  2018-10-12  7:46   ` David Hildenbrand
@ 2018-10-13  6:29     ` David Gibson
  0 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2018-10-13  6:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhildenb, imammedo, ehabkost, mst, pbonzini, qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 2859 bytes --]

On Fri, Oct 12, 2018 at 09:46:16AM +0200, David Hildenbrand wrote:
> On 12/10/2018 05:24, David Gibson wrote:
> > This replaces the balloon_page() internal interface with
> > ballon_inflate_page(), with a slightly different interface.  The new
> > interface will make future alterations simpler.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/virtio/virtio-balloon.c | 17 +++++++----------
> >  1 file changed, 7 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index e8611aab0e..7229afad6e 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -33,11 +33,11 @@
> >  
> >  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
> >  
> > -static void balloon_page(void *addr, int deflate)
> > +static void balloon_inflate_page(VirtIOBalloon *balloon,
> > +                                 MemoryRegion *mr, hwaddr offset)
> >  {
> > -    if (!qemu_balloon_is_inhibited() && !deflate) {
> > -        qemu_madvise(addr, BALLOON_PAGE_SIZE, QEMU_MADV_DONTNEED);
> > -    }
> > +    void *addr = memory_region_get_ram_ptr(mr) + offset;
> 
> I prefer an empty line here

Ok.

> > +    qemu_madvise(addr, BALLOON_PAGE_SIZE, QEMU_MADV_DONTNEED);
> >  }
> >  
> >  static const char *balloon_stat_names[] = {
> > @@ -222,7 +222,6 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
> >  
> >          while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4) {
> >              hwaddr pa;
> > -            hwaddr addr;
> >              int p = virtio_ldl_p(vdev, &pfn);
> >  
> >              pa = (hwaddr) p << VIRTIO_BALLOON_PFN_SHIFT;
> > @@ -244,11 +243,9 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
> >  
> >              trace_virtio_balloon_handle_output(memory_region_name(section.mr),
> >                                                 pa);
> > -            /* Using memory_region_get_ram_ptr is bending the rules a bit, but
> > -               should be OK because we only want a single page.  */
> > -            addr = section.offset_within_region;
> > -            balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
> > -                         !!(vq == s->dvq));
> > +            if (!qemu_balloon_is_inhibited() && vq != s->dvq) {
> > +                balloon_inflate_page(s, section.mr, section.offset_within_region);
> > +            }
> >              memory_region_unref(section.mr);
> >          }
> >  
> > 
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size
  2018-10-12  8:06   ` David Hildenbrand
@ 2018-10-13  6:40     ` David Gibson
  2018-10-15  7:08       ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: David Gibson @ 2018-10-13  6:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhildenb, imammedo, ehabkost, mst, pbonzini, qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 8624 bytes --]

On Fri, Oct 12, 2018 at 10:06:50AM +0200, David Hildenbrand wrote:
> On 12/10/2018 05:24, David Gibson wrote:
> > The virtio-balloon always works in units of 4kiB (BALLOON_PAGE_SIZE), but
> > on the host side, we can only actually discard memory in units of the host
> > page size.
> > 
> > At present we handle this very badly: we silently ignore balloon requests
> > that aren't host page aligned, and for requests that are host page aligned
> > we discard the entire host page.  The latter potentially corrupts guest
> > memory if its page size is smaller than the host's.
> > 
> > We could just disable the balloon if the host page size is not 4kiB, but
> > that would break a the special case where host and guest have the same page
> > size, but that's larger than 4kiB.  Thius case currently works by accident:
> > when the guest puts its page into the balloon, it will submit balloon
> > requests for each 4kiB subpage.  Most will be ignored, but the one which
> > happens to be host page aligned will discard the whole lot.
> 
> Actually I would vote for disabling it if the host page size is not 4k.
> This is broken by design and should not be worked around.

I really don't think that's feasible.  This is not some theoretical
edge case: essentially every KVM on POWER system out there with
balloon enabled is in this situation.  And AIUI, a bunch of common
management layers routinely enable the balloon.  Just disabling this
for non-4k systems will break existing, supported production setups.

> I consider 64k a similar case as huge pages in the hypervisor (from a
> virtio-balloon interface point of view - which is broken by design for
> these cases).

Technically it is very similar, but again, this is in routine use on
real systems out there.

> > This occurs in practice routinely for POWER KVM systems, since both host
> > and guest typically use 64kiB pages.
> > 
> > To make this safe, without breaking that useful case, we need to
> > accumulate 4kiB balloon requests until we have a whole contiguous host page
> > at which point we can discard it.
> 
> ... and if you have a 4k guest it will just randomly report pages and
> your 67 LOC tweak is useless.

Yes.

> And this can and will easily happen.

What cases are you thinking of?  For POWER at least, 4k on 64k will be
vastly less common than 64k on 64k.

> > We could in principle do that across all guest memory, but it would require
> > a large bitmap to track.  This patch represents a compromise: instead we
> 
> Oh god no, no tracking of all memory. (size of bitmap, memory hotplug,
> ... migration?)

Quite, that's why I didn't do it.  Although, I don't actually think
migration is such an issue: we *already* essentially lose track of
which pages are inside the balloon across migration.

> > track ballooned subpages for a single contiguous host page at a time.  This
> > means that if the guest discards all 4kiB chunks of a host page in
> > succession, we will discard it.  In particular that means the balloon will
> > continue to work for the (host page size) == (guest page size) > 4kiB case.
> > 
> > If the guest scatters 4kiB requests across different host pages, we don't
> > discard anything, and issue a warning.  Not ideal, but at least we don't
> > corrupt guest memory as the previous version could.
> 
> My take would be to somehow disable the balloon on the hypervisor side
> in case the host page size is not 4k. Like not allowing to realize it.
> No corruptions, no warnings people will ignore.

No, that's even worse than just having it silently do nothing on
non-4k setups.  Silently being useless we might get away with, we'll
just waste memory, but failing the device load will absolutely break
existing setups.

> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/virtio/virtio-balloon.c         | 67 +++++++++++++++++++++++++-----
> >  include/hw/virtio/virtio-balloon.h |  3 ++
> >  2 files changed, 60 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index 4435905c87..39573ef2e3 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -33,33 +33,80 @@
> >  
> >  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
> >  
> > +typedef struct PartiallyBalloonedPage {
> > +    RAMBlock *rb;
> > +    ram_addr_t base;
> > +    unsigned long bitmap[];
> > +} PartiallyBalloonedPage;
> > +
> >  static void balloon_inflate_page(VirtIOBalloon *balloon,
> >                                   MemoryRegion *mr, hwaddr offset)
> >  {
> >      void *addr = memory_region_get_ram_ptr(mr) + offset;
> >      RAMBlock *rb;
> >      size_t rb_page_size;
> > -    ram_addr_t ram_offset;
> > +    int subpages;
> > +    ram_addr_t ram_offset, host_page_base;
> >  
> >      /* XXX is there a better way to get to the RAMBlock than via a
> >       * host address? */
> >      rb = qemu_ram_block_from_host(addr, false, &ram_offset);
> >      rb_page_size = qemu_ram_pagesize(rb);
> > +    host_page_base = ram_offset & ~(rb_page_size - 1);
> > +
> > +    if (rb_page_size == BALLOON_PAGE_SIZE) {
> > +        /* Easy case */
> >  
> > -    /* Silently ignore hugepage RAM blocks */
> > -    if (rb_page_size != getpagesize()) {
> > +        ram_block_discard_range(rb, ram_offset, rb_page_size);
> > +        /* We ignore errors from ram_block_discard_range(), because it
> > +         * has already reported them, and failing to discard a balloon
> > +         * page is not fatal */
> >          return;
> >      }
> >  
> > -    /* Silently ignore unaligned requests */
> > -    if (ram_offset & (rb_page_size - 1)) {
> > -        return;
> > +    /* Hard case
> > +     *
> > +     * We've put a piece of a larger host page into the balloon - we
> > +     * need to keep track until we have a whole host page to
> > +     * discard
> > +     */
> > +    subpages = rb_page_size / BALLOON_PAGE_SIZE;
> > +
> > +    if (balloon->pbp
> > +        && (rb != balloon->pbp->rb
> > +            || host_page_base != balloon->pbp->base)) {
> > +        /* We've partially ballooned part of a host page, but now
> > +         * we're trying to balloon part of a different one.  Too hard,
> > +         * give up on the old partial page */
> > +        warn_report("Unable to insert a partial page into virtio-balloon");
> 
> I am pretty sure that you can create misleading warnings in case you
> migrate at the wrong time. (migrate while half the 64k page is inflated,
> on the new host the other part is inflated - a warning when switching to
> another 64k page).

Yes we can get bogus warnings across migration with this.  I was
considering that an acceptable price, but I'm open to better
alternatives.

> > +        free(balloon->pbp);
> > +        balloon->pbp = NULL;
> >      }
> >  
> > -    ram_block_discard_range(rb, ram_offset, rb_page_size);
> > -    /* We ignore errors from ram_block_discard_range(), because it has
> > -     * already reported them, and failing to discard a balloon page is
> > -     * not fatal */
> > +    if (!balloon->pbp) {
> > +        /* Starting on a new host page */
> > +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
> > +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
> > +        balloon->pbp->rb = rb;
> > +        balloon->pbp->base = host_page_base;
> > +    }
> > +
> > +    bitmap_set(balloon->pbp->bitmap,
> > +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
> > +               subpages);
> > +
> > +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
> > +        /* We've accumulated a full host page, we can actually discard
> > +         * it now */
> > +
> > +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
> > +        /* We ignore errors from ram_block_discard_range(), because it
> > +         * has already reported them, and failing to discard a balloon
> > +         * page is not fatal */
> > +
> > +        free(balloon->pbp);
> > +        balloon->pbp = NULL;
> > +    }
> >  }
> No, not a fan of this approach.

I can see why, but I really can't see what else to do without breaking
existing, supported, working (albeit by accident) setups.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size
  2018-10-12  8:32   ` David Hildenbrand
@ 2018-10-13  6:41     ` David Gibson
  0 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2018-10-13  6:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhildenb, imammedo, ehabkost, mst, pbonzini, qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 7870 bytes --]

On Fri, Oct 12, 2018 at 10:32:34AM +0200, David Hildenbrand wrote:
> On 12/10/2018 05:24, David Gibson wrote:
> > The virtio-balloon always works in units of 4kiB (BALLOON_PAGE_SIZE), but
> > on the host side, we can only actually discard memory in units of the host
> > page size.
> > 
> > At present we handle this very badly: we silently ignore balloon requests
> > that aren't host page aligned, and for requests that are host page aligned
> > we discard the entire host page.  The latter potentially corrupts guest
> > memory if its page size is smaller than the host's.
> > 
> > We could just disable the balloon if the host page size is not 4kiB, but
> > that would break a the special case where host and guest have the same page
> > size, but that's larger than 4kiB.  Thius case currently works by accident:
> > when the guest puts its page into the balloon, it will submit balloon
> > requests for each 4kiB subpage.  Most will be ignored, but the one which
> > happens to be host page aligned will discard the whole lot.
> > 
> > This occurs in practice routinely for POWER KVM systems, since both host
> > and guest typically use 64kiB pages.
> > 
> > To make this safe, without breaking that useful case, we need to
> > accumulate 4kiB balloon requests until we have a whole contiguous host page
> > at which point we can discard it.
> > 
> > We could in principle do that across all guest memory, but it would require
> > a large bitmap to track.  This patch represents a compromise: instead we
> > track ballooned subpages for a single contiguous host page at a time.  This
> > means that if the guest discards all 4kiB chunks of a host page in
> > succession, we will discard it.  In particular that means the balloon will
> > continue to work for the (host page size) == (guest page size) > 4kiB case.
> > 
> > If the guest scatters 4kiB requests across different host pages, we don't
> > discard anything, and issue a warning.  Not ideal, but at least we don't
> > corrupt guest memory as the previous version could.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/virtio/virtio-balloon.c         | 67 +++++++++++++++++++++++++-----
> >  include/hw/virtio/virtio-balloon.h |  3 ++
> >  2 files changed, 60 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index 4435905c87..39573ef2e3 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -33,33 +33,80 @@
> >  
> >  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
> >  
> > +typedef struct PartiallyBalloonedPage {
> > +    RAMBlock *rb;
> > +    ram_addr_t base;
> > +    unsigned long bitmap[];
> 
> BTW, it might be easier to only remember the last inflated page and
> incrementing it when you see the successor.

That would be marginally simpler, but I was preferring not to rely on
the guest always doing things in a particular order.

> 
> initialize last_page to -1ull on realize/reset
> 
> 
> if (QEMU_IS_ALIGNED(addr, PAGE_SIZE)) {
> 	/* start of a new potential page block */
> 	last_page == addr;
> } else if (addr == last_page + BALLOON_PAGE_SIZE) {
> 	/* next successor */
> 	last_page == addr;
> 	if (QEMU_IS_ALIGNED(last_page + BALLOON_PAGE_SIZE, PAGE_SIZE)) {
> 		ramblock_discard()....
> 	}
> } else {
> 	last_page = -1ull;
> }
> 
> 
> > +} PartiallyBalloonedPage;
> > +
> >  static void balloon_inflate_page(VirtIOBalloon *balloon,
> >                                   MemoryRegion *mr, hwaddr offset)
> >  {
> >      void *addr = memory_region_get_ram_ptr(mr) + offset;
> >      RAMBlock *rb;
> >      size_t rb_page_size;
> > -    ram_addr_t ram_offset;
> > +    int subpages;
> > +    ram_addr_t ram_offset, host_page_base;
> >  
> >      /* XXX is there a better way to get to the RAMBlock than via a
> >       * host address? */
> >      rb = qemu_ram_block_from_host(addr, false, &ram_offset);
> >      rb_page_size = qemu_ram_pagesize(rb);
> > +    host_page_base = ram_offset & ~(rb_page_size - 1);
> > +
> > +    if (rb_page_size == BALLOON_PAGE_SIZE) {
> > +        /* Easy case */
> >  
> > -    /* Silently ignore hugepage RAM blocks */
> > -    if (rb_page_size != getpagesize()) {
> > +        ram_block_discard_range(rb, ram_offset, rb_page_size);
> > +        /* We ignore errors from ram_block_discard_range(), because it
> > +         * has already reported them, and failing to discard a balloon
> > +         * page is not fatal */
> >          return;
> >      }
> >  
> > -    /* Silently ignore unaligned requests */
> > -    if (ram_offset & (rb_page_size - 1)) {
> > -        return;
> > +    /* Hard case
> > +     *
> > +     * We've put a piece of a larger host page into the balloon - we
> > +     * need to keep track until we have a whole host page to
> > +     * discard
> > +     */
> > +    subpages = rb_page_size / BALLOON_PAGE_SIZE;
> > +
> > +    if (balloon->pbp
> > +        && (rb != balloon->pbp->rb
> > +            || host_page_base != balloon->pbp->base)) {
> > +        /* We've partially ballooned part of a host page, but now
> > +         * we're trying to balloon part of a different one.  Too hard,
> > +         * give up on the old partial page */
> > +        warn_report("Unable to insert a partial page into virtio-balloon");
> > +        free(balloon->pbp);
> > +        balloon->pbp = NULL;
> >      }
> >  
> > -    ram_block_discard_range(rb, ram_offset, rb_page_size);
> > -    /* We ignore errors from ram_block_discard_range(), because it has
> > -     * already reported them, and failing to discard a balloon page is
> > -     * not fatal */
> > +    if (!balloon->pbp) {
> > +        /* Starting on a new host page */
> > +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
> > +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
> > +        balloon->pbp->rb = rb;
> > +        balloon->pbp->base = host_page_base;
> > +    }
> > +
> > +    bitmap_set(balloon->pbp->bitmap,
> > +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
> > +               subpages);
> > +
> > +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
> > +        /* We've accumulated a full host page, we can actually discard
> > +         * it now */
> > +
> > +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
> > +        /* We ignore errors from ram_block_discard_range(), because it
> > +         * has already reported them, and failing to discard a balloon
> > +         * page is not fatal */
> > +
> > +        free(balloon->pbp);
> > +        balloon->pbp = NULL;
> > +    }
> >  }
> >  
> >  static const char *balloon_stat_names[] = {
> > diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> > index e0df3528c8..99dcd6d105 100644
> > --- a/include/hw/virtio/virtio-balloon.h
> > +++ b/include/hw/virtio/virtio-balloon.h
> > @@ -30,6 +30,8 @@ typedef struct virtio_balloon_stat_modern {
> >         uint64_t val;
> >  } VirtIOBalloonStatModern;
> >  
> > +typedef struct PartiallyBalloonedPage PartiallyBalloonedPage;
> > +
> >  typedef struct VirtIOBalloon {
> >      VirtIODevice parent_obj;
> >      VirtQueue *ivq, *dvq, *svq;
> > @@ -42,6 +44,7 @@ typedef struct VirtIOBalloon {
> >      int64_t stats_last_update;
> >      int64_t stats_poll_interval;
> >      uint32_t host_features;
> > +    PartiallyBalloonedPage *pbp;
> >  } VirtIOBalloon;
> >  
> >  #endif
> > 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
  2018-10-12 18:05   ` Michael S. Tsirkin
@ 2018-10-15  6:54     ` David Hildenbrand
  2018-10-15 10:43       ` Michael S. Tsirkin
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2018-10-15  6:54 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Gibson
  Cc: dhildenb, imammedo, ehabkost, pbonzini, qemu-devel, qemu-ppc

On 12/10/2018 20:05, Michael S. Tsirkin wrote:
> On Fri, Oct 12, 2018 at 02:24:27PM +1100, David Gibson wrote:
>> When the balloon is inflated, we discard memory place in it using madvise()
>> with MADV_DONTNEED.  And when we deflate it we use MADV_WILLNEED, which
>> sounds like it makes sense but is actually unnecessary.
>>
>> The misleadingly named MADV_DONTNEED just discards the memory in question,
>> it doesn't set any persistent state on it in-kernel; all that's necessary
>> to bring the memory back is to touch it.  MADV_WILLNEED in contrast
>> specifically says that the memory will be used soon and faults it in.
>>
>> Memory that's being given back to the guest by deflating the balloon
>> *might* be used soon, but it equally could just sit around in the guest's
>> pools until it actually needs it.  And, over the general timescale that
>> memory ballooning operates, it seems unlikely that one extra fault for the
>> guest will be a vast performance issue.
> 
> Thinking about it, it might be for RT guests.
> 
> So I suspect if you want to drop MADV_WILLNEED you need a flag
> telling qemu that's not the usecase.
> 

As far as I know RT guests

1. mlock all memory
2. use huge pages

"MADV_DONTNEED cannot be applied to locked pages, Huge TLB pages, or
VM_PFNMAP pages."

As DONTNEED never succeeded, WILLNEED will do nothing. (e.g. postcopy
temporarily has to unlock all memory in order to make DONTNEED work)

(And "Expect access in the near future." does not sound like guarantees
to me either way)

So I think this can go.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size
  2018-10-13  6:40     ` David Gibson
@ 2018-10-15  7:08       ` David Hildenbrand
  2018-10-17  3:28         ` David Gibson
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2018-10-15  7:08 UTC (permalink / raw)
  To: David Gibson
  Cc: dhildenb, imammedo, ehabkost, mst, pbonzini, qemu-devel, qemu-ppc

>>> This occurs in practice routinely for POWER KVM systems, since both host
>>> and guest typically use 64kiB pages.
>>>
>>> To make this safe, without breaking that useful case, we need to
>>> accumulate 4kiB balloon requests until we have a whole contiguous host page
>>> at which point we can discard it.
>>
>> ... and if you have a 4k guest it will just randomly report pages and
>> your 67 LOC tweak is useless.
> 
> Yes.
> 
>> And this can and will easily happen.
> 
> What cases are you thinking of?  For POWER at least, 4k on 64k will be
> vastly less common than 64k on 64k.

Okay, I was wondering why we don't get tons of bug reports for 4k on
64k. So 64k on 64k while using ballooning is supported and heavily used
then I guess? Because as you said, 4k on 64k triggers memory corruptions.

> 
>>> We could in principle do that across all guest memory, but it would require
>>> a large bitmap to track.  This patch represents a compromise: instead we
>>
>> Oh god no, no tracking of all memory. (size of bitmap, memory hotplug,
>> ... migration?)
> 
> Quite, that's why I didn't do it.  Although, I don't actually think
> migration is such an issue: we *already* essentially lose track of
> which pages are inside the balloon across migration.

Well, we migrate zero pages that get replaces by zero pages on the
target. So at least KSM could recover them.

> 
>>> track ballooned subpages for a single contiguous host page at a time.  This
>>> means that if the guest discards all 4kiB chunks of a host page in
>>> succession, we will discard it.  In particular that means the balloon will
>>> continue to work for the (host page size) == (guest page size) > 4kiB case.
>>>
>>> If the guest scatters 4kiB requests across different host pages, we don't
>>> discard anything, and issue a warning.  Not ideal, but at least we don't
>>> corrupt guest memory as the previous version could.
>>
>> My take would be to somehow disable the balloon on the hypervisor side
>> in case the host page size is not 4k. Like not allowing to realize it.
>> No corruptions, no warnings people will ignore.
> 
> No, that's even worse than just having it silently do nothing on
> non-4k setups.  Silently being useless we might get away with, we'll
> just waste memory, but failing the device load will absolutely break
> existing setups.

Silently consume more memory is very very evil. Think about
auto-ballooning setups
https://www.ovirt.org/documentation/how-to/autoballooning/

But on the other hand, this has been broken forever for huge pages
without printing warnings. Oh man, virtio-balloon ...

Disallowing to realize will only break migration from an old host to a
new host. But migration will bail out right away. We could of course
glue this to a compat machine, but I guess the point you have is that
customer want to continue using this "works by accident without memory
corruptions" virtio-balloon implementation.

> 
>>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>>> ---
>>>  hw/virtio/virtio-balloon.c         | 67 +++++++++++++++++++++++++-----
>>>  include/hw/virtio/virtio-balloon.h |  3 ++
>>>  2 files changed, 60 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>>> index 4435905c87..39573ef2e3 100644
>>> --- a/hw/virtio/virtio-balloon.c
>>> +++ b/hw/virtio/virtio-balloon.c
>>> @@ -33,33 +33,80 @@
>>>  
>>>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>>>  
>>> +typedef struct PartiallyBalloonedPage {
>>> +    RAMBlock *rb;
>>> +    ram_addr_t base;
>>> +    unsigned long bitmap[];
>>> +} PartiallyBalloonedPage;
>>> +
>>>  static void balloon_inflate_page(VirtIOBalloon *balloon,
>>>                                   MemoryRegion *mr, hwaddr offset)
>>>  {
>>>      void *addr = memory_region_get_ram_ptr(mr) + offset;
>>>      RAMBlock *rb;
>>>      size_t rb_page_size;
>>> -    ram_addr_t ram_offset;
>>> +    int subpages;
>>> +    ram_addr_t ram_offset, host_page_base;
>>>  
>>>      /* XXX is there a better way to get to the RAMBlock than via a
>>>       * host address? */
>>>      rb = qemu_ram_block_from_host(addr, false, &ram_offset);
>>>      rb_page_size = qemu_ram_pagesize(rb);
>>> +    host_page_base = ram_offset & ~(rb_page_size - 1);
>>> +
>>> +    if (rb_page_size == BALLOON_PAGE_SIZE) {
>>> +        /* Easy case */
>>>  
>>> -    /* Silently ignore hugepage RAM blocks */
>>> -    if (rb_page_size != getpagesize()) {
>>> +        ram_block_discard_range(rb, ram_offset, rb_page_size);
>>> +        /* We ignore errors from ram_block_discard_range(), because it
>>> +         * has already reported them, and failing to discard a balloon
>>> +         * page is not fatal */
>>>          return;
>>>      }
>>>  
>>> -    /* Silently ignore unaligned requests */
>>> -    if (ram_offset & (rb_page_size - 1)) {
>>> -        return;
>>> +    /* Hard case
>>> +     *
>>> +     * We've put a piece of a larger host page into the balloon - we
>>> +     * need to keep track until we have a whole host page to
>>> +     * discard
>>> +     */
>>> +    subpages = rb_page_size / BALLOON_PAGE_SIZE;
>>> +
>>> +    if (balloon->pbp
>>> +        && (rb != balloon->pbp->rb
>>> +            || host_page_base != balloon->pbp->base)) {
>>> +        /* We've partially ballooned part of a host page, but now
>>> +         * we're trying to balloon part of a different one.  Too hard,
>>> +         * give up on the old partial page */
>>> +        warn_report("Unable to insert a partial page into virtio-balloon");
>>
>> I am pretty sure that you can create misleading warnings in case you
>> migrate at the wrong time. (migrate while half the 64k page is inflated,
>> on the new host the other part is inflated - a warning when switching to
>> another 64k page).
> 
> Yes we can get bogus warnings across migration with this.  I was
> considering that an acceptable price, but I'm open to better
> alternatives.

Is maybe reporting a warning on a 64k host when realizing the better
approach than on every inflation?

"host page size does not match virtio-balloon page size. If the guest
has a different page size than the host, inflating the balloon might not
effectively free up memory."

Or reporting a warning whenever changing the balloon target size.

> 
>>> +        free(balloon->pbp);
>>> +        balloon->pbp = NULL;
>>>      }
>>>  
>>> -    ram_block_discard_range(rb, ram_offset, rb_page_size);
>>> -    /* We ignore errors from ram_block_discard_range(), because it has
>>> -     * already reported them, and failing to discard a balloon page is
>>> -     * not fatal */
>>> +    if (!balloon->pbp) {
>>> +        /* Starting on a new host page */
>>> +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
>>> +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
>>> +        balloon->pbp->rb = rb;
>>> +        balloon->pbp->base = host_page_base;
>>> +    }
>>> +
>>> +    bitmap_set(balloon->pbp->bitmap,
>>> +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
>>> +               subpages);
>>> +
>>> +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
>>> +        /* We've accumulated a full host page, we can actually discard
>>> +         * it now */
>>> +
>>> +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
>>> +        /* We ignore errors from ram_block_discard_range(), because it
>>> +         * has already reported them, and failing to discard a balloon
>>> +         * page is not fatal */
>>> +
>>> +        free(balloon->pbp);
>>> +        balloon->pbp = NULL;
>>> +    }
>>>  }
>> No, not a fan of this approach.
> 
> I can see why, but I really can't see what else to do without breaking
> existing, supported, working (albeit by accident) setups.
> 

Is there any reason to use this more complicated "allow random freeing"
approach over a simplistic sequential freeing I propose? Then we can
simply migrate the last freed page and should be fine.

As far as I know Linux guests have been freeing and reporting these
pages sequentially, or is that not true? Are you aware of other
implementations that we might miss?

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
  2018-10-15  6:54     ` David Hildenbrand
@ 2018-10-15 10:43       ` Michael S. Tsirkin
  2018-10-15 11:14         ` David Hildenbrand
  2018-12-04  4:26         ` David Gibson
  0 siblings, 2 replies; 31+ messages in thread
From: Michael S. Tsirkin @ 2018-10-15 10:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: David Gibson, dhildenb, imammedo, ehabkost, pbonzini, qemu-devel,
	qemu-ppc

On Mon, Oct 15, 2018 at 08:54:27AM +0200, David Hildenbrand wrote:
> On 12/10/2018 20:05, Michael S. Tsirkin wrote:
> > On Fri, Oct 12, 2018 at 02:24:27PM +1100, David Gibson wrote:
> >> When the balloon is inflated, we discard memory place in it using madvise()
> >> with MADV_DONTNEED.  And when we deflate it we use MADV_WILLNEED, which
> >> sounds like it makes sense but is actually unnecessary.
> >>
> >> The misleadingly named MADV_DONTNEED just discards the memory in question,
> >> it doesn't set any persistent state on it in-kernel; all that's necessary
> >> to bring the memory back is to touch it.  MADV_WILLNEED in contrast
> >> specifically says that the memory will be used soon and faults it in.
> >>
> >> Memory that's being given back to the guest by deflating the balloon
> >> *might* be used soon, but it equally could just sit around in the guest's
> >> pools until it actually needs it.  And, over the general timescale that
> >> memory ballooning operates, it seems unlikely that one extra fault for the
> >> guest will be a vast performance issue.
> > 
> > Thinking about it, it might be for RT guests.
> > 
> > So I suspect if you want to drop MADV_WILLNEED you need a flag
> > telling qemu that's not the usecase.
> > 
> 
> As far as I know RT guests
> 
> 1. mlock all memory
> 2. use huge pages
> 
> "MADV_DONTNEED cannot be applied to locked pages, Huge TLB pages, or
> VM_PFNMAP pages."
> 
> As DONTNEED never succeeded, WILLNEED will do nothing. (e.g. postcopy
> temporarily has to unlock all memory in order to make DONTNEED work)


Hmm you are right.
Document this in commit log?


> (And "Expect access in the near future." does not sound like guarantees
> to me either way)

Should we be concerned about impact on performance though?

Thinking aloud it might have an impact.  But there is also a benefit.
Let's document known effects so people know what to look for
if they see issues?

	WILLNEED aggressively faults all memory in, removing it
	will cause just as much memory as needed to be allocated on access,
	slowing down the CPU at that point but conserving host memory.

	With this patch device assignment (e.g. with VFIO) hotplug will take longer
	as it needs to allocate all this memory on hotplug.
	While this happens all VM is paused right now.


> So I think this can go.
> 
> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
  2018-10-15 10:43       ` Michael S. Tsirkin
@ 2018-10-15 11:14         ` David Hildenbrand
  2018-12-04  4:26         ` David Gibson
  1 sibling, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2018-10-15 11:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Gibson, dhildenb, imammedo, ehabkost, pbonzini, qemu-devel,
	qemu-ppc


>> (And "Expect access in the near future." does not sound like guarantees
>> to me either way)
> 
> Should we be concerned about impact on performance though?

Yes, it really depends on the use case (and most of the time, the memory
won't be directly reused). Let's document it in the commit log just as
you suggest.

> 
> Thinking aloud it might have an impact.  But there is also a benefit.
> Let's document known effects so people know what to look for
> if they see issues?
> 
> 	WILLNEED aggressively faults all memory in, removing it
> 	will cause just as much memory as needed to be allocated on access,
> 	slowing down the CPU at that point but conserving host memory.
> 
> 	With this patch device assignment (e.g. with VFIO) hotplug will take longer
> 	as it needs to allocate all this memory on hotplug.
> 	While this happens all VM is paused right now.


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size
  2018-10-15  7:08       ` David Hildenbrand
@ 2018-10-17  3:28         ` David Gibson
  2018-10-17  9:56           ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: David Gibson @ 2018-10-17  3:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhildenb, imammedo, ehabkost, mst, pbonzini, qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 10920 bytes --]

On Mon, Oct 15, 2018 at 09:08:45AM +0200, David Hildenbrand wrote:
> >>> This occurs in practice routinely for POWER KVM systems, since both host
> >>> and guest typically use 64kiB pages.
> >>>
> >>> To make this safe, without breaking that useful case, we need to
> >>> accumulate 4kiB balloon requests until we have a whole contiguous host page
> >>> at which point we can discard it.
> >>
> >> ... and if you have a 4k guest it will just randomly report pages and
> >> your 67 LOC tweak is useless.
> > 
> > Yes.
> > 
> >> And this can and will easily happen.
> > 
> > What cases are you thinking of?  For POWER at least, 4k on 64k will be
> > vastly less common than 64k on 64k.
> 
> Okay, I was wondering why we don't get tons of bug reports for 4k on
> 64k.

Right, and the answer is because 64k has been the normal configuration
on every major ppc64 distro for years (for both host and guest).

> So 64k on 64k while using ballooning is supported and heavily used
> then I guess? Because as you said, 4k on 64k triggers memory corruptions.

Right.

> >>> We could in principle do that across all guest memory, but it would require
> >>> a large bitmap to track.  This patch represents a compromise: instead we
> >>
> >> Oh god no, no tracking of all memory. (size of bitmap, memory hotplug,
> >> ... migration?)
> > 
> > Quite, that's why I didn't do it.  Although, I don't actually think
> > migration is such an issue: we *already* essentially lose track of
> > which pages are inside the balloon across migration.
> 
> Well, we migrate zero pages that get replaces by zero pages on the
> target. So at least KSM could recover them.

Right, but there's no explicit state being transferred.  In a sense,
KSM and zero page handling across migration resets the balloon to a
state optimized for the current memory state of the guest.

> >>> track ballooned subpages for a single contiguous host page at a time.  This
> >>> means that if the guest discards all 4kiB chunks of a host page in
> >>> succession, we will discard it.  In particular that means the balloon will
> >>> continue to work for the (host page size) == (guest page size) > 4kiB case.
> >>>
> >>> If the guest scatters 4kiB requests across different host pages, we don't
> >>> discard anything, and issue a warning.  Not ideal, but at least we don't
> >>> corrupt guest memory as the previous version could.
> >>
> >> My take would be to somehow disable the balloon on the hypervisor side
> >> in case the host page size is not 4k. Like not allowing to realize it.
> >> No corruptions, no warnings people will ignore.
> > 
> > No, that's even worse than just having it silently do nothing on
> > non-4k setups.  Silently being useless we might get away with, we'll
> > just waste memory, but failing the device load will absolutely break
> > existing setups.
> 
> Silently consume more memory is very very evil. Think about
> auto-ballooning setups
> https://www.ovirt.org/documentation/how-to/autoballooning/

Well, sure, I don't think this is a good idea.

> But on the other hand, this has been broken forever for huge pages
> without printing warnings. Oh man, virtio-balloon ...
> 
> Disallowing to realize will only break migration from an old host to a
> new host. But migration will bail out right away.

The issue isn't really migration here.  If we prevent the balloon
device from realizing, guests which already had it configured simply
won't be able to start after qemu is updated.

> We could of course
> glue this to a compat machine,

We could, but what would it accomplish?  We'd still have to do
something for the old machine versions - so we're back at either
allowing memory corruption like we do right now, or batch gathering as
my patch proposes.

> but I guess the point you have is that
> customer want to continue using this "works by accident without memory
> corruptions" virtio-balloon implementation.

Right.  That's why I really think we need this batch-gathering
approach, ugly and irritating though it is.

> 
> > 
> >>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> >>> ---
> >>>  hw/virtio/virtio-balloon.c         | 67 +++++++++++++++++++++++++-----
> >>>  include/hw/virtio/virtio-balloon.h |  3 ++
> >>>  2 files changed, 60 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> >>> index 4435905c87..39573ef2e3 100644
> >>> --- a/hw/virtio/virtio-balloon.c
> >>> +++ b/hw/virtio/virtio-balloon.c
> >>> @@ -33,33 +33,80 @@
> >>>  
> >>>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
> >>>  
> >>> +typedef struct PartiallyBalloonedPage {
> >>> +    RAMBlock *rb;
> >>> +    ram_addr_t base;
> >>> +    unsigned long bitmap[];
> >>> +} PartiallyBalloonedPage;
> >>> +
> >>>  static void balloon_inflate_page(VirtIOBalloon *balloon,
> >>>                                   MemoryRegion *mr, hwaddr offset)
> >>>  {
> >>>      void *addr = memory_region_get_ram_ptr(mr) + offset;
> >>>      RAMBlock *rb;
> >>>      size_t rb_page_size;
> >>> -    ram_addr_t ram_offset;
> >>> +    int subpages;
> >>> +    ram_addr_t ram_offset, host_page_base;
> >>>  
> >>>      /* XXX is there a better way to get to the RAMBlock than via a
> >>>       * host address? */
> >>>      rb = qemu_ram_block_from_host(addr, false, &ram_offset);
> >>>      rb_page_size = qemu_ram_pagesize(rb);
> >>> +    host_page_base = ram_offset & ~(rb_page_size - 1);
> >>> +
> >>> +    if (rb_page_size == BALLOON_PAGE_SIZE) {
> >>> +        /* Easy case */
> >>>  
> >>> -    /* Silently ignore hugepage RAM blocks */
> >>> -    if (rb_page_size != getpagesize()) {
> >>> +        ram_block_discard_range(rb, ram_offset, rb_page_size);
> >>> +        /* We ignore errors from ram_block_discard_range(), because it
> >>> +         * has already reported them, and failing to discard a balloon
> >>> +         * page is not fatal */
> >>>          return;
> >>>      }
> >>>  
> >>> -    /* Silently ignore unaligned requests */
> >>> -    if (ram_offset & (rb_page_size - 1)) {
> >>> -        return;
> >>> +    /* Hard case
> >>> +     *
> >>> +     * We've put a piece of a larger host page into the balloon - we
> >>> +     * need to keep track until we have a whole host page to
> >>> +     * discard
> >>> +     */
> >>> +    subpages = rb_page_size / BALLOON_PAGE_SIZE;
> >>> +
> >>> +    if (balloon->pbp
> >>> +        && (rb != balloon->pbp->rb
> >>> +            || host_page_base != balloon->pbp->base)) {
> >>> +        /* We've partially ballooned part of a host page, but now
> >>> +         * we're trying to balloon part of a different one.  Too hard,
> >>> +         * give up on the old partial page */
> >>> +        warn_report("Unable to insert a partial page into virtio-balloon");
> >>
> >> I am pretty sure that you can create misleading warnings in case you
> >> migrate at the wrong time. (migrate while half the 64k page is inflated,
> >> on the new host the other part is inflated - a warning when switching to
> >> another 64k page).
> > 
> > Yes we can get bogus warnings across migration with this.  I was
> > considering that an acceptable price, but I'm open to better
> > alternatives.
> 
> Is maybe reporting a warning on a 64k host when realizing the better
> approach than on every inflation?
> 
> "host page size does not match virtio-balloon page size. If the guest
> has a different page size than the host, inflating the balloon might not
> effectively free up memory."

That might work - I'll see what I can come up with.  One complication
is that theoretically at least, you can have multiple host page sizes
(main memory in normal pages, a DIMM in hugepages).  That makes the
condition on which the warning should be issued a bit fiddly to work
out.

> Or reporting a warning whenever changing the balloon target size.

I'm not sure what you mean by this.

> 
> > 
> >>> +        free(balloon->pbp);
> >>> +        balloon->pbp = NULL;
> >>>      }
> >>>  
> >>> -    ram_block_discard_range(rb, ram_offset, rb_page_size);
> >>> -    /* We ignore errors from ram_block_discard_range(), because it has
> >>> -     * already reported them, and failing to discard a balloon page is
> >>> -     * not fatal */
> >>> +    if (!balloon->pbp) {
> >>> +        /* Starting on a new host page */
> >>> +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
> >>> +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
> >>> +        balloon->pbp->rb = rb;
> >>> +        balloon->pbp->base = host_page_base;
> >>> +    }
> >>> +
> >>> +    bitmap_set(balloon->pbp->bitmap,
> >>> +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
> >>> +               subpages);
> >>> +
> >>> +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
> >>> +        /* We've accumulated a full host page, we can actually discard
> >>> +         * it now */
> >>> +
> >>> +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
> >>> +        /* We ignore errors from ram_block_discard_range(), because it
> >>> +         * has already reported them, and failing to discard a balloon
> >>> +         * page is not fatal */
> >>> +
> >>> +        free(balloon->pbp);
> >>> +        balloon->pbp = NULL;
> >>> +    }
> >>>  }
> >> No, not a fan of this approach.
> > 
> > I can see why, but I really can't see what else to do without breaking
> > existing, supported, working (albeit by accident) setups.
> 
> Is there any reason to use this more complicated "allow random freeing"
> approach over a simplistic sequential freeing I propose? Then we can
> simply migrate the last freed page and should be fine.

Well.. your approach is probably simpler in terms of the calculations
that need to be done, though only very slightly.  I think my approach
is conceptually clearer though, since we're explicitly checking for
exactly the condition we need, rather than something we thing should
match up with that condition.

Even if the extra robustness is strictly theoretical, I prefer the
idea of more robust but slightly longer code versus less robust and
slightly shorter code - at least when the latter will want a long
detailed comment explaining why it's correct.

> As far as I know Linux guests have been freeing and reporting these
> pages sequentially, or is that not true? Are you aware of other
> implementations that we might miss?

No, I'm not aware of any other implementations we're likely to care
about.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 0/5] Improve balloon handling of pagesizes other than 4kiB
  2018-10-12 17:26 ` [Qemu-devel] [RFC 0/5] Improve balloon handling of pagesizes other than 4kiB Michael S. Tsirkin
@ 2018-10-17  3:31   ` David Gibson
  0 siblings, 0 replies; 31+ messages in thread
From: David Gibson @ 2018-10-17  3:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: dhildenb, imammedo, ehabkost, pbonzini, qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 1931 bytes --]

On Fri, Oct 12, 2018 at 01:26:45PM -0400, Michael S. Tsirkin wrote:
> On Fri, Oct 12, 2018 at 02:24:26PM +1100, David Gibson wrote:
> > The virtio-balloon devices was never really thought out for cases
> > other than 4kiB pagesize on both guest and host.  It works in some
> > cases, but in others can be ineffectual or even cause guest memory
> > corruption.
> > 
> > This series makes a handful of preliminary cleanups, then makes a
> > change to safely, though not perfectly, handle cases with non 4kiB
> > pagesizes.
> 
> BTW do you want to add an interface to specify the page size?
> I can see either host or guest or both supporting that.

To make this work, it really has to be advertised by the host.  Only
the host can know what a suitable discard granularity is.  It would
make the balloon more widely usable, however it makes the guest side
driver vastly more complex, since it will need to find contiguous
chunks of memory to discard.

Given the numerous problems with the balloon model (which David
Hildenbrand has and is continuing to point out), I'm not sure it's
work the trouble.

> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> 
> 
> > David Gibson (5):
> >   virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
> >   virtio-balloon: Corrections to address verification
> >   virtio-balloon: Rework ballon_page() interface
> >   virtio-balloon: Use ram_block_discard_range() instead of raw madvise()
> >   virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size
> > 
> >  hw/virtio/virtio-balloon.c         | 100 ++++++++++++++++++++++++-----
> >  include/hw/virtio/virtio-balloon.h |   3 +
> >  2 files changed, 87 insertions(+), 16 deletions(-)
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size
  2018-10-17  3:28         ` David Gibson
@ 2018-10-17  9:56           ` David Hildenbrand
  2018-10-23  8:02             ` David Gibson
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2018-10-17  9:56 UTC (permalink / raw)
  To: David Gibson
  Cc: dhildenb, imammedo, ehabkost, mst, pbonzini, qemu-devel, qemu-ppc

>>>> I am pretty sure that you can create misleading warnings in case you
>>>> migrate at the wrong time. (migrate while half the 64k page is inflated,
>>>> on the new host the other part is inflated - a warning when switching to
>>>> another 64k page).
>>>
>>> Yes we can get bogus warnings across migration with this.  I was
>>> considering that an acceptable price, but I'm open to better
>>> alternatives.
>>
>> Is maybe reporting a warning on a 64k host when realizing the better
>> approach than on every inflation?
>>
>> "host page size does not match virtio-balloon page size. If the guest
>> has a different page size than the host, inflating the balloon might not
>> effectively free up memory."
> 
> That might work - I'll see what I can come up with.  One complication
> is that theoretically at least, you can have multiple host page sizes
> (main memory in normal pages, a DIMM in hugepages).  That makes the
> condition on which the warning should be issued a bit fiddly to work
> out.

I assume issuing a warning on these strange systems would not hurt after
all. ("there is a chance this might not work")

> 
>> Or reporting a warning whenever changing the balloon target size.
> 
> I'm not sure what you mean by this.

I mean when setting e.g. qmp_balloon() to something != 0. This avoids a
warning when a virtio-balloon device is silently created (e.g. by
libvirt?) but never used.

Checking in virtio_balloon_to_target would be sufficient I guess.

> 
>>
>>>
>>>>> +        free(balloon->pbp);
>>>>> +        balloon->pbp = NULL;
>>>>>      }
>>>>>  
>>>>> -    ram_block_discard_range(rb, ram_offset, rb_page_size);
>>>>> -    /* We ignore errors from ram_block_discard_range(), because it has
>>>>> -     * already reported them, and failing to discard a balloon page is
>>>>> -     * not fatal */
>>>>> +    if (!balloon->pbp) {
>>>>> +        /* Starting on a new host page */
>>>>> +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
>>>>> +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
>>>>> +        balloon->pbp->rb = rb;
>>>>> +        balloon->pbp->base = host_page_base;
>>>>> +    }
>>>>> +
>>>>> +    bitmap_set(balloon->pbp->bitmap,
>>>>> +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
>>>>> +               subpages);
>>>>> +
>>>>> +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
>>>>> +        /* We've accumulated a full host page, we can actually discard
>>>>> +         * it now */
>>>>> +
>>>>> +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
>>>>> +        /* We ignore errors from ram_block_discard_range(), because it
>>>>> +         * has already reported them, and failing to discard a balloon
>>>>> +         * page is not fatal */
>>>>> +
>>>>> +        free(balloon->pbp);
>>>>> +        balloon->pbp = NULL;
>>>>> +    }
>>>>>  }
>>>> No, not a fan of this approach.
>>>
>>> I can see why, but I really can't see what else to do without breaking
>>> existing, supported, working (albeit by accident) setups.
>>
>> Is there any reason to use this more complicated "allow random freeing"
>> approach over a simplistic sequential freeing I propose? Then we can
>> simply migrate the last freed page and should be fine.
> 
> Well.. your approach is probably simpler in terms of the calculations
> that need to be done, though only very slightly.  I think my approach
> is conceptually clearer though, since we're explicitly checking for
> exactly the condition we need, rather than something we thing should
> match up with that condition.

I prefer to keep it simple where possible. We expect sequential freeing,
so it's easy to implement with only one additional uint64_t that can be
easily migrated. Having to use bitmaps + alloc/free is not really needed.

If you insist, at least try to get rid of the malloc to e.g. simplify
migration. (otherwise, please add freeing code on unrealize(), I guess
you are missing that right now)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size
  2018-10-17  9:56           ` David Hildenbrand
@ 2018-10-23  8:02             ` David Gibson
  2018-10-23 15:13               ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: David Gibson @ 2018-10-23  8:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhildenb, imammedo, ehabkost, mst, pbonzini, qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 4813 bytes --]

On Wed, Oct 17, 2018 at 11:56:09AM +0200, David Hildenbrand wrote:
> >>>> I am pretty sure that you can create misleading warnings in case you
> >>>> migrate at the wrong time. (migrate while half the 64k page is inflated,
> >>>> on the new host the other part is inflated - a warning when switching to
> >>>> another 64k page).
> >>>
> >>> Yes we can get bogus warnings across migration with this.  I was
> >>> considering that an acceptable price, but I'm open to better
> >>> alternatives.
> >>
> >> Is maybe reporting a warning on a 64k host when realizing the better
> >> approach than on every inflation?
> >>
> >> "host page size does not match virtio-balloon page size. If the guest
> >> has a different page size than the host, inflating the balloon might not
> >> effectively free up memory."
> > 
> > That might work - I'll see what I can come up with.  One complication
> > is that theoretically at least, you can have multiple host page sizes
> > (main memory in normal pages, a DIMM in hugepages).  That makes the
> > condition on which the warning should be issued a bit fiddly to work
> > out.
> 
> I assume issuing a warning on these strange systems would not hurt after
> all. ("there is a chance this might not work")

Sure.

> >> Or reporting a warning whenever changing the balloon target size.
> > 
> > I'm not sure what you mean by this.
> 
> I mean when setting e.g. qmp_balloon() to something != 0. This avoids a
> warning when a virtio-balloon device is silently created (e.g. by
> libvirt?) but never used.

Ah, right.  Yeah, that's a good idea.

> Checking in virtio_balloon_to_target would be sufficient I guess.
> 
> > 
> >>
> >>>
> >>>>> +        free(balloon->pbp);
> >>>>> +        balloon->pbp = NULL;
> >>>>>      }
> >>>>>  
> >>>>> -    ram_block_discard_range(rb, ram_offset, rb_page_size);
> >>>>> -    /* We ignore errors from ram_block_discard_range(), because it has
> >>>>> -     * already reported them, and failing to discard a balloon page is
> >>>>> -     * not fatal */
> >>>>> +    if (!balloon->pbp) {
> >>>>> +        /* Starting on a new host page */
> >>>>> +        size_t bitlen = BITS_TO_LONGS(subpages) * sizeof(unsigned long);
> >>>>> +        balloon->pbp = g_malloc0(sizeof(PartiallyBalloonedPage) + bitlen);
> >>>>> +        balloon->pbp->rb = rb;
> >>>>> +        balloon->pbp->base = host_page_base;
> >>>>> +    }
> >>>>> +
> >>>>> +    bitmap_set(balloon->pbp->bitmap,
> >>>>> +               (ram_offset - balloon->pbp->base) / BALLOON_PAGE_SIZE,
> >>>>> +               subpages);
> >>>>> +
> >>>>> +    if (bitmap_full(balloon->pbp->bitmap, subpages)) {
> >>>>> +        /* We've accumulated a full host page, we can actually discard
> >>>>> +         * it now */
> >>>>> +
> >>>>> +        ram_block_discard_range(rb, balloon->pbp->base, rb_page_size);
> >>>>> +        /* We ignore errors from ram_block_discard_range(), because it
> >>>>> +         * has already reported them, and failing to discard a balloon
> >>>>> +         * page is not fatal */
> >>>>> +
> >>>>> +        free(balloon->pbp);
> >>>>> +        balloon->pbp = NULL;
> >>>>> +    }
> >>>>>  }
> >>>> No, not a fan of this approach.
> >>>
> >>> I can see why, but I really can't see what else to do without breaking
> >>> existing, supported, working (albeit by accident) setups.
> >>
> >> Is there any reason to use this more complicated "allow random freeing"
> >> approach over a simplistic sequential freeing I propose? Then we can
> >> simply migrate the last freed page and should be fine.
> > 
> > Well.. your approach is probably simpler in terms of the calculations
> > that need to be done, though only very slightly.  I think my approach
> > is conceptually clearer though, since we're explicitly checking for
> > exactly the condition we need, rather than something we thing should
> > match up with that condition.
> 
> I prefer to keep it simple where possible. We expect sequential freeing,
> so it's easy to implement with only one additional uint64_t that can be
> easily migrated. Having to use bitmaps + alloc/free is not really needed.
> 
> If you insist, at least try to get rid of the malloc to e.g. simplify
> migration.

I don't think there's really anything to do for migration; we already
effectively lose the state of the balloon across migration.  I think
it's fine to lose a little extra transitional state.

> (otherwise, please add freeing code on unrealize(), I guess
> you are missing that right now)

Ah, good point, I need to fix that.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size
  2018-10-23  8:02             ` David Gibson
@ 2018-10-23 15:13               ` David Hildenbrand
  0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2018-10-23 15:13 UTC (permalink / raw)
  To: David Gibson
  Cc: dhildenb, imammedo, ehabkost, mst, pbonzini, qemu-devel, qemu-ppc


>>>>> I can see why, but I really can't see what else to do without breaking
>>>>> existing, supported, working (albeit by accident) setups.
>>>>
>>>> Is there any reason to use this more complicated "allow random freeing"
>>>> approach over a simplistic sequential freeing I propose? Then we can
>>>> simply migrate the last freed page and should be fine.
>>>
>>> Well.. your approach is probably simpler in terms of the calculations
>>> that need to be done, though only very slightly.  I think my approach
>>> is conceptually clearer though, since we're explicitly checking for
>>> exactly the condition we need, rather than something we thing should
>>> match up with that condition.
>>
>> I prefer to keep it simple where possible. We expect sequential freeing,
>> so it's easy to implement with only one additional uint64_t that can be
>> easily migrated. Having to use bitmaps + alloc/free is not really needed.
>>
>> If you insist, at least try to get rid of the malloc to e.g. simplify
>> migration.
> 
> I don't think there's really anything to do for migration; we already
> effectively lose the state of the balloon across migration.  I think
> it's fine to lose a little extra transitional state.

Yeah, I guess it would only make sense when trying to avoid misleading
warnings (depending on which approach you'll use for when to report
warnings).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
  2018-10-15 10:43       ` Michael S. Tsirkin
  2018-10-15 11:14         ` David Hildenbrand
@ 2018-12-04  4:26         ` David Gibson
  1 sibling, 0 replies; 31+ messages in thread
From: David Gibson @ 2018-12-04  4:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Hildenbrand, dhildenb, imammedo, ehabkost, pbonzini,
	qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 4437 bytes --]

On Mon, Oct 15, 2018 at 06:43:06AM -0400, Michael S. Tsirkin wrote:
> On Mon, Oct 15, 2018 at 08:54:27AM +0200, David Hildenbrand wrote:
> > On 12/10/2018 20:05, Michael S. Tsirkin wrote:
> > > On Fri, Oct 12, 2018 at 02:24:27PM +1100, David Gibson wrote:
> > >> When the balloon is inflated, we discard memory place in it using madvise()
> > >> with MADV_DONTNEED.  And when we deflate it we use MADV_WILLNEED, which
> > >> sounds like it makes sense but is actually unnecessary.
> > >>
> > >> The misleadingly named MADV_DONTNEED just discards the memory in question,
> > >> it doesn't set any persistent state on it in-kernel; all that's necessary
> > >> to bring the memory back is to touch it.  MADV_WILLNEED in contrast
> > >> specifically says that the memory will be used soon and faults it in.
> > >>
> > >> Memory that's being given back to the guest by deflating the balloon
> > >> *might* be used soon, but it equally could just sit around in the guest's
> > >> pools until it actually needs it.  And, over the general timescale that
> > >> memory ballooning operates, it seems unlikely that one extra fault for the
> > >> guest will be a vast performance issue.
> > > 
> > > Thinking about it, it might be for RT guests.
> > > 
> > > So I suspect if you want to drop MADV_WILLNEED you need a flag
> > > telling qemu that's not the usecase.
> > > 
> > 
> > As far as I know RT guests
> > 
> > 1. mlock all memory
> > 2. use huge pages
> > 
> > "MADV_DONTNEED cannot be applied to locked pages, Huge TLB pages, or
> > VM_PFNMAP pages."
> > 
> > As DONTNEED never succeeded, WILLNEED will do nothing. (e.g. postcopy
> > temporarily has to unlock all memory in order to make DONTNEED work)
> 
> 
> Hmm you are right.
> Document this in commit log?
> 
> 
> > (And "Expect access in the near future." does not sound like guarantees
> > to me either way)
> 
> Should we be concerned about impact on performance though?
> 
> Thinking aloud it might have an impact.  But there is also a benefit.
> Let's document known effects so people know what to look for
> if they see issues?

Ok, my new text in the commit message is:

This patch simplify's the balloon operation by dropping the madvise()                                   
on deflate.  This might have an impact on performance - it will move a                                  
delay at deflate time until that memory is actually touched, which                                      
might be more latency sensitive.  However:                                                              
                                                                                                        
  * Memory that's being given back to the guest by deflating the                                        
    balloon *might* be used soon, but it equally could just sit around                                  
    in the guest's pools until needed (or even be faulted out again if                                  
    the host is under memory pressure).                                                                 
                                                                                                        
  * Usually, the timescale over which you'll be adjusting the balloon                                   
    is long enough that a few extra faults after deflation aren't                                       
    going to make a difference.                           

Does that seem like it covers what it should?

> 	WILLNEED aggressively faults all memory in, removing it
> 	will cause just as much memory as needed to be allocated on access,
> 	slowing down the CPU at that point but conserving host memory.
> 
> 	With this patch device assignment (e.g. with VFIO) hotplug will take longer
> 	as it needs to allocate all this memory on hotplug.
> 	While this happens all VM is paused right now.

I'm not sure what you mean by this.  This patch is very specific to
the virtio-balloon deflate path.  "True" memory hotplug (e.g. via ACPI
or PAPR dynamic reconfiguration) is unaffected.  And if VFIO is in
play, all guest memory will generally be locked, just like with
realtime.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-12-04  5:19 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-12  3:24 [Qemu-devel] [RFC 0/5] Improve balloon handling of pagesizes other than 4kiB David Gibson
2018-10-12  3:24 ` [Qemu-devel] [RFC 1/5] virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate David Gibson
2018-10-12  7:40   ` David Hildenbrand
2018-10-13  6:26     ` David Gibson
2018-10-12 17:41   ` Richard Henderson
2018-10-12 17:59     ` Eric Blake
2018-10-13  6:23     ` David Gibson
2018-10-12 18:05   ` Michael S. Tsirkin
2018-10-15  6:54     ` David Hildenbrand
2018-10-15 10:43       ` Michael S. Tsirkin
2018-10-15 11:14         ` David Hildenbrand
2018-12-04  4:26         ` David Gibson
2018-10-12  3:24 ` [Qemu-devel] [RFC 2/5] virtio-balloon: Corrections to address verification David Gibson
2018-10-12  7:44   ` David Hildenbrand
2018-10-13  6:25     ` David Gibson
2018-10-12  3:24 ` [Qemu-devel] [RFC 3/5] virtio-balloon: Rework ballon_page() interface David Gibson
2018-10-12  7:46   ` David Hildenbrand
2018-10-13  6:29     ` David Gibson
2018-10-12  3:24 ` [Qemu-devel] [RFC 4/5] virtio-balloon: Use ram_block_discard_range() instead of raw madvise() David Gibson
2018-10-12  3:24 ` [Qemu-devel] [RFC 5/5] virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size David Gibson
2018-10-12  8:06   ` David Hildenbrand
2018-10-13  6:40     ` David Gibson
2018-10-15  7:08       ` David Hildenbrand
2018-10-17  3:28         ` David Gibson
2018-10-17  9:56           ` David Hildenbrand
2018-10-23  8:02             ` David Gibson
2018-10-23 15:13               ` David Hildenbrand
2018-10-12  8:32   ` David Hildenbrand
2018-10-13  6:41     ` David Gibson
2018-10-12 17:26 ` [Qemu-devel] [RFC 0/5] Improve balloon handling of pagesizes other than 4kiB Michael S. Tsirkin
2018-10-17  3:31   ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.