[PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction
@ 2018-07-30 23:13 ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-07-30 23:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: cohuck, david, peterx, kvm, mst

v2:
 - Use atomic ops for balloon inhibit counter (Peter)
 - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
   default, vfio-pci opt-in by device option, only allowed for mdev
   devices, no support added for platform as there are no platform
   mdev devices.

See patch 3/4 for detailed explanation why ballooning and device
assignment typically don't mix.  If this eventually changes, flags
on the iommu info struct or perhaps device info struct can inform
us for automatic opt-in.  Thanks,

Alex

---

Alex Williamson (4):
      balloon: Allow nested inhibits
      kvm: Use inhibit to prevent ballooning without synchronous mmu
      vfio: Inhibit ballooning based on group attachment to a container
      vfio/ccw/pci: Allow devices to opt-in for ballooning


 accel/kvm/kvm-all.c           |    4 ++++
 balloon.c                     |   13 ++++++++++---
 hw/vfio/ccw.c                 |    9 +++++++++
 hw/vfio/common.c              |   26 ++++++++++++++++++++++++++
 hw/vfio/pci.c                 |   26 +++++++++++++++++++++++++-
 hw/vfio/trace-events          |    1 +
 hw/virtio/virtio-balloon.c    |    4 +---
 include/hw/vfio/vfio-common.h |    2 ++
 8 files changed, 78 insertions(+), 7 deletions(-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction
@ 2018-07-30 23:13 ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-07-30 23:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm, peterx, cohuck, mst, david

v2:
 - Use atomic ops for balloon inhibit counter (Peter)
 - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
   default, vfio-pci opt-in by device option, only allowed for mdev
   devices, no support added for platform as there are no platform
   mdev devices.

See patch 3/4 for detailed explanation why ballooning and device
assignment typically don't mix.  If this eventually changes, flags
on the iommu info struct or perhaps device info struct can inform
us for automatic opt-in.  Thanks,

Alex

---

Alex Williamson (4):
      balloon: Allow nested inhibits
      kvm: Use inhibit to prevent ballooning without synchronous mmu
      vfio: Inhibit ballooning based on group attachment to a container
      vfio/ccw/pci: Allow devices to opt-in for ballooning


 accel/kvm/kvm-all.c           |    4 ++++
 balloon.c                     |   13 ++++++++++---
 hw/vfio/ccw.c                 |    9 +++++++++
 hw/vfio/common.c              |   26 ++++++++++++++++++++++++++
 hw/vfio/pci.c                 |   26 +++++++++++++++++++++++++-
 hw/vfio/trace-events          |    1 +
 hw/virtio/virtio-balloon.c    |    4 +---
 include/hw/vfio/vfio-common.h |    2 ++
 8 files changed, 78 insertions(+), 7 deletions(-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2 1/4] balloon: Allow nested inhibits
  2018-07-30 23:13 ` [Qemu-devel] " Alex Williamson
@ 2018-07-30 23:13   ` Alex Williamson
  -1 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-07-30 23:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: cohuck, david, peterx, kvm, mst

A simple true/false internal state does not allow multiple users.  Fix
this within the existing interface by converting to a counter, so long
as the counter is elevated, ballooning is inhibited.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 balloon.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/balloon.c b/balloon.c
index 6bf0a9681377..931987983858 100644
--- a/balloon.c
+++ b/balloon.c
@@ -26,6 +26,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu-common.h"
+#include "qemu/atomic.h"
 #include "exec/cpu-common.h"
 #include "sysemu/kvm.h"
 #include "sysemu/balloon.h"
@@ -37,16 +38,22 @@
 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
 static void *balloon_opaque;
-static bool balloon_inhibited;
+static int balloon_inhibit_count;
 
 bool qemu_balloon_is_inhibited(void)
 {
-    return balloon_inhibited;
+    return atomic_read(&balloon_inhibit_count) > 0;
 }
 
 void qemu_balloon_inhibit(bool state)
 {
-    balloon_inhibited = state;
+    if (state) {
+        atomic_inc(&balloon_inhibit_count);
+    } else {
+        atomic_dec(&balloon_inhibit_count);
+    }
+
+    assert(atomic_read(&balloon_inhibit_count) >= 0);
 }
 
 static bool have_balloon(Error **errp)

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 1/4] balloon: Allow nested inhibits
@ 2018-07-30 23:13   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-07-30 23:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm, peterx, cohuck, mst, david

A simple true/false internal state does not allow multiple users.  Fix
this within the existing interface by converting to a counter, so long
as the counter is elevated, ballooning is inhibited.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 balloon.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/balloon.c b/balloon.c
index 6bf0a9681377..931987983858 100644
--- a/balloon.c
+++ b/balloon.c
@@ -26,6 +26,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu-common.h"
+#include "qemu/atomic.h"
 #include "exec/cpu-common.h"
 #include "sysemu/kvm.h"
 #include "sysemu/balloon.h"
@@ -37,16 +38,22 @@
 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
 static void *balloon_opaque;
-static bool balloon_inhibited;
+static int balloon_inhibit_count;
 
 bool qemu_balloon_is_inhibited(void)
 {
-    return balloon_inhibited;
+    return atomic_read(&balloon_inhibit_count) > 0;
 }
 
 void qemu_balloon_inhibit(bool state)
 {
-    balloon_inhibited = state;
+    if (state) {
+        atomic_inc(&balloon_inhibit_count);
+    } else {
+        atomic_dec(&balloon_inhibit_count);
+    }
+
+    assert(atomic_read(&balloon_inhibit_count) >= 0);
 }
 
 static bool have_balloon(Error **errp)

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 2/4] kvm: Use inhibit to prevent ballooning without synchronous mmu
  2018-07-30 23:13 ` [Qemu-devel] " Alex Williamson
@ 2018-07-30 23:14   ` Alex Williamson
  -1 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-07-30 23:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: cohuck, david, peterx, kvm, mst

Remove KVM specific tests in balloon_page(), instead marking
ballooning as inhibited without KVM_CAP_SYNC_MMU support.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 accel/kvm/kvm-all.c        |    4 ++++
 hw/virtio/virtio-balloon.c |    4 +---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index eb7db92a5e3b..38f468d8e2b1 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -39,6 +39,7 @@
 #include "trace.h"
 #include "hw/irq.h"
 #include "sysemu/sev.h"
+#include "sysemu/balloon.h"
 
 #include "hw/boards.h"
 
@@ -1698,6 +1699,9 @@ static int kvm_init(MachineState *ms)
     s->many_ioeventfds = kvm_check_many_ioeventfds();
 
     s->sync_mmu = !!kvm_vm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
+    if (!s->sync_mmu) {
+        qemu_balloon_inhibit(true);
+    }
 
     return 0;
 
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 1f7a87f09429..b5425080c5fb 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -21,7 +21,6 @@
 #include "hw/mem/pc-dimm.h"
 #include "sysemu/balloon.h"
 #include "hw/virtio/virtio-balloon.h"
-#include "sysemu/kvm.h"
 #include "exec/address-spaces.h"
 #include "qapi/error.h"
 #include "qapi/qapi-events-misc.h"
@@ -36,8 +35,7 @@
 
 static void balloon_page(void *addr, int deflate)
 {
-    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
-                                         kvm_has_sync_mmu())) {
+    if (!qemu_balloon_is_inhibited()) {
         qemu_madvise(addr, BALLOON_PAGE_SIZE,
                 deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
     }

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 2/4] kvm: Use inhibit to prevent ballooning without synchronous mmu
@ 2018-07-30 23:14   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-07-30 23:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm, peterx, cohuck, mst, david

Remove KVM specific tests in balloon_page(), instead marking
ballooning as inhibited without KVM_CAP_SYNC_MMU support.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 accel/kvm/kvm-all.c        |    4 ++++
 hw/virtio/virtio-balloon.c |    4 +---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index eb7db92a5e3b..38f468d8e2b1 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -39,6 +39,7 @@
 #include "trace.h"
 #include "hw/irq.h"
 #include "sysemu/sev.h"
+#include "sysemu/balloon.h"
 
 #include "hw/boards.h"
 
@@ -1698,6 +1699,9 @@ static int kvm_init(MachineState *ms)
     s->many_ioeventfds = kvm_check_many_ioeventfds();
 
     s->sync_mmu = !!kvm_vm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
+    if (!s->sync_mmu) {
+        qemu_balloon_inhibit(true);
+    }
 
     return 0;
 
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 1f7a87f09429..b5425080c5fb 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -21,7 +21,6 @@
 #include "hw/mem/pc-dimm.h"
 #include "sysemu/balloon.h"
 #include "hw/virtio/virtio-balloon.h"
-#include "sysemu/kvm.h"
 #include "exec/address-spaces.h"
 #include "qapi/error.h"
 #include "qapi/qapi-events-misc.h"
@@ -36,8 +35,7 @@
 
 static void balloon_page(void *addr, int deflate)
 {
-    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
-                                         kvm_has_sync_mmu())) {
+    if (!qemu_balloon_is_inhibited()) {
         qemu_madvise(addr, BALLOON_PAGE_SIZE,
                 deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
     }

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 3/4] vfio: Inhibit ballooning based on group attachment to a container
  2018-07-30 23:13 ` [Qemu-devel] " Alex Williamson
@ 2018-07-30 23:14   ` Alex Williamson
  -1 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-07-30 23:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: cohuck, david, peterx, kvm, mst

We use a VFIOContainer to associate an AddressSpace to one or more
VFIOGroups.  The VFIOContainer represents the DMA context for that
AdressSpace for those VFIOGroups and is synchronized to changes in
that AddressSpace via a MemoryListener.  For IOMMU backed devices,
maintaining the DMA context for a VFIOGroup generally involves
pinning a host virtual address in order to create a stable host
physical address and then mapping a translation from the associated
guest physical address to that host physical address into the IOMMU.

While the above maintains the VFIOContainer synchronized to the QEMU
memory API of the VM, memory ballooning occurs outside of that API.
Inflating the memory balloon (ie. cooperatively capturing pages from
the guest for use by the host) simply uses MADV_DONTNEED to "zap"
pages from QEMU's host virtual address space.  The page pinning and
IOMMU mapping above remains in place, negating the host's ability to
reuse the page, but the host virtual to host physical mapping of the
page is invalidated outside of QEMU's memory API.

When the balloon is later deflated, attempting to cooperatively
return pages to the guest, the page is simply freed by the guest
balloon driver, allowing it to be used in the guest and incurring a
page fault when that occurs.  The page fault maps a new host physical
page backing the existing host virtual address, meanwhile the
VFIOContainer still maintains the translation to the original host
physical address.  At this point the guest vCPU and any assigned
devices will map different host physical addresses to the same guest
physical address.  Badness.

The IOMMU typically does not have page level granularity with which
it can track this mapping without also incurring inefficiencies in
using page size mappings throughout.  MMU notifiers in the host
kernel also provide indicators for invalidating the mapping on
balloon inflation, not for updating the mapping when the balloon is
deflated.  For these reasons we assume a default behavior that the
mapping of each VFIOGroup into the VFIOContainer is incompatible
with memory ballooning and increment the balloon inhibitor to match
the attached VFIOGroups.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/common.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fb396cf00ac4..4881b691a659 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -32,6 +32,7 @@
 #include "hw/hw.h"
 #include "qemu/error-report.h"
 #include "qemu/range.h"
+#include "sysemu/balloon.h"
 #include "sysemu/kvm.h"
 #include "trace.h"
 #include "qapi/error.h"
@@ -1049,6 +1050,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
             group->container = container;
             QLIST_INSERT_HEAD(&container->group_list, group, container_next);
             vfio_kvm_device_add_group(group);
+            qemu_balloon_inhibit(true);
             return 0;
         }
     }
@@ -1198,6 +1200,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     }

     vfio_kvm_device_add_group(group);
+    qemu_balloon_inhibit(true);

     QLIST_INIT(&container->group_list);
     QLIST_INSERT_HEAD(&space->containers, container, next);
@@ -1222,6 +1225,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 listener_release_exit:
     QLIST_REMOVE(group, container_next);
     QLIST_REMOVE(container, next);
+    qemu_balloon_inhibit(false);
     vfio_kvm_device_del_group(group);
     vfio_listener_release(container);

@@ -1352,6 +1356,7 @@ void vfio_put_group(VFIOGroup *group)
         return;
     }

+    qemu_balloon_inhibit(false);
     vfio_kvm_device_del_group(group);
     vfio_disconnect_container(group);
     QLIST_REMOVE(group, next);

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 3/4] vfio: Inhibit ballooning based on group attachment to a container
@ 2018-07-30 23:14   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-07-30 23:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm, peterx, cohuck, mst, david

We use a VFIOContainer to associate an AddressSpace to one or more
VFIOGroups.  The VFIOContainer represents the DMA context for that
AdressSpace for those VFIOGroups and is synchronized to changes in
that AddressSpace via a MemoryListener.  For IOMMU backed devices,
maintaining the DMA context for a VFIOGroup generally involves
pinning a host virtual address in order to create a stable host
physical address and then mapping a translation from the associated
guest physical address to that host physical address into the IOMMU.

While the above maintains the VFIOContainer synchronized to the QEMU
memory API of the VM, memory ballooning occurs outside of that API.
Inflating the memory balloon (ie. cooperatively capturing pages from
the guest for use by the host) simply uses MADV_DONTNEED to "zap"
pages from QEMU's host virtual address space.  The page pinning and
IOMMU mapping above remains in place, negating the host's ability to
reuse the page, but the host virtual to host physical mapping of the
page is invalidated outside of QEMU's memory API.

When the balloon is later deflated, attempting to cooperatively
return pages to the guest, the page is simply freed by the guest
balloon driver, allowing it to be used in the guest and incurring a
page fault when that occurs.  The page fault maps a new host physical
page backing the existing host virtual address, meanwhile the
VFIOContainer still maintains the translation to the original host
physical address.  At this point the guest vCPU and any assigned
devices will map different host physical addresses to the same guest
physical address.  Badness.

The IOMMU typically does not have page level granularity with which
it can track this mapping without also incurring inefficiencies in
using page size mappings throughout.  MMU notifiers in the host
kernel also provide indicators for invalidating the mapping on
balloon inflation, not for updating the mapping when the balloon is
deflated.  For these reasons we assume a default behavior that the
mapping of each VFIOGroup into the VFIOContainer is incompatible
with memory ballooning and increment the balloon inhibitor to match
the attached VFIOGroups.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/common.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fb396cf00ac4..4881b691a659 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -32,6 +32,7 @@
 #include "hw/hw.h"
 #include "qemu/error-report.h"
 #include "qemu/range.h"
+#include "sysemu/balloon.h"
 #include "sysemu/kvm.h"
 #include "trace.h"
 #include "qapi/error.h"
@@ -1049,6 +1050,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
             group->container = container;
             QLIST_INSERT_HEAD(&container->group_list, group, container_next);
             vfio_kvm_device_add_group(group);
+            qemu_balloon_inhibit(true);
             return 0;
         }
     }
@@ -1198,6 +1200,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     }

     vfio_kvm_device_add_group(group);
+    qemu_balloon_inhibit(true);

     QLIST_INIT(&container->group_list);
     QLIST_INSERT_HEAD(&space->containers, container, next);
@@ -1222,6 +1225,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 listener_release_exit:
     QLIST_REMOVE(group, container_next);
     QLIST_REMOVE(container, next);
+    qemu_balloon_inhibit(false);
     vfio_kvm_device_del_group(group);
     vfio_listener_release(container);

@@ -1352,6 +1356,7 @@ void vfio_put_group(VFIOGroup *group)
         return;
     }

+    qemu_balloon_inhibit(false);
     vfio_kvm_device_del_group(group);
     vfio_disconnect_container(group);
     QLIST_REMOVE(group, next);

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 4/4] vfio/ccw/pci: Allow devices to opt-in for ballooning
  2018-07-30 23:13 ` [Qemu-devel] " Alex Williamson
@ 2018-07-30 23:14   ` Alex Williamson
  -1 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-07-30 23:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: cohuck, david, peterx, kvm, mst

If a vfio assigned device makes use of a physical IOMMU, then memory
ballooning is necessarily inhibited due to the page pinning, lack of
page level granularity at the IOMMU, and sufficient notifiers to both
remove the page on balloon inflation and add it back on deflation.
However, not all devices are backed by a physical IOMMU.  In the case
of mediated devices, if a vendor driver is well synchronized with the
guest driver, such that only pages actively used by the guest driver
are pinned by the host mdev vendor driver, then there should be no
overlap between pages available for the balloon driver and pages
actively in use by the device.  Under these conditions, ballooning
should be safe.

vfio-ccw devices are always mediated devices and always operate under
the constraints above.  Therefore we can consider all vfio-ccw devices
as balloon compatible.

The situation is far from straightforward with vfio-pci.  These
devices can be physical devices with physical IOMMU backing or
mediated devices where it is unknown whether a physical IOMMU is in
use or whether the vendor driver is well synchronized to the working
set of the guest driver.  The safest approach is therefore to assume
all vfio-pci devices are incompatible with ballooning, but allow user
opt-in should they have further insight into mediated devices.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/ccw.c                 |    9 +++++++++
 hw/vfio/common.c              |   23 ++++++++++++++++++++++-
 hw/vfio/pci.c                 |   26 +++++++++++++++++++++++++-
 hw/vfio/trace-events          |    1 +
 include/hw/vfio/vfio-common.h |    2 ++
 5 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 351b305e1ae7..40e7b5623e69 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -349,6 +349,15 @@ static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
         }
     }
 
+    /*
+     * All vfio-ccw devices are believed to operate compatibly with memory
+     * ballooning, ie. pages pinned in the host are in the current working
+     * set of the guest driver and therefore never overlap with pages
+     * available to the guest balloon driver.  This needs to be set before
+     * vfio_get_device() for vfio common to handle the balloon inhibitor.
+     */
+    vcdev->vdev.balloon_allowed = true;
+
     if (vfio_get_device(group, vcdev->cdev.mdevid, &vcdev->vdev, errp)) {
         goto out_err;
     }
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4881b691a659..ef5f4b77548a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1356,7 +1356,9 @@ void vfio_put_group(VFIOGroup *group)
         return;
     }
 
-    qemu_balloon_inhibit(false);
+    if (!group->balloon_allowed) {
+        qemu_balloon_inhibit(false);
+    }
     vfio_kvm_device_del_group(group);
     vfio_disconnect_container(group);
     QLIST_REMOVE(group, next);
@@ -1392,6 +1394,25 @@ int vfio_get_device(VFIOGroup *group, const char *name,
         return ret;
     }
 
+    /*
+     * Clear the balloon inhibitor for this group if the driver knows the
+     * device operates compatibly with ballooning.  Setting must be consistent
+     * per group, but since compatibility is really only possible with mdev
+     * currently, we expect singleton groups.
+     */
+    if (vbasedev->balloon_allowed != group->balloon_allowed) {
+        if (!QLIST_EMPTY(&group->device_list)) {
+            error_setg(errp,
+                       "Inconsistent device balloon setting within group");
+            return -1;
+        }
+
+        if (!group->balloon_allowed) {
+            group->balloon_allowed = true;
+            qemu_balloon_inhibit(false);
+        }
+    }
+
     vbasedev->fd = fd;
     vbasedev->group = group;
     QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 6cbb8fa0549d..056f3a887a8f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2804,12 +2804,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
     VFIODevice *vbasedev_iter;
     VFIOGroup *group;
-    char *tmp, group_path[PATH_MAX], *group_name;
+    char *tmp, *subsys, group_path[PATH_MAX], *group_name;
     Error *err = NULL;
     ssize_t len;
     struct stat st;
     int groupid;
     int i, ret;
+    bool is_mdev;
 
     if (!vdev->vbasedev.sysfsdev) {
         if (!(~vdev->host.domain || ~vdev->host.bus ||
@@ -2869,6 +2870,27 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    /*
+     * Mediated devices *might* operate compatibly with memory ballooning, but
+     * we cannot know for certain, it depends on whether the mdev vendor driver
+     * stays in sync with the active working set of the guest driver.  Prevent
+     * the x-balloon-allowed option unless this is minimally an mdev device.
+     */
+    tmp = g_strdup_printf("%s/subsystem", vdev->vbasedev.sysfsdev);
+    subsys = realpath(tmp, NULL);
+    g_free(tmp);
+    is_mdev = (strcmp(subsys, "/sys/bus/mdev") == 0);
+    free(subsys);
+
+    trace_vfio_mdev(vdev->vbasedev.name, is_mdev);
+
+    if (vdev->vbasedev.balloon_allowed && !is_mdev) {
+        error_setg(errp, "x-balloon-allowed only potentially compatible "
+                   "with mdev devices");
+        vfio_put_group(group);
+        goto error;
+    }
+
     ret = vfio_get_device(group, vdev->vbasedev.name, &vdev->vbasedev, errp);
     if (ret) {
         vfio_put_group(group);
@@ -3170,6 +3192,8 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_BIT("x-igd-opregion", VFIOPCIDevice, features,
                     VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
+    DEFINE_PROP_BOOL("x-balloon-allowed", VFIOPCIDevice,
+                     vbasedev.balloon_allowed, false),
     DEFINE_PROP_BOOL("x-no-kvm-intx", VFIOPCIDevice, no_kvm_intx, false),
     DEFINE_PROP_BOOL("x-no-kvm-msi", VFIOPCIDevice, no_kvm_msi, false),
     DEFINE_PROP_BOOL("x-no-kvm-msix", VFIOPCIDevice, no_kvm_msix, false),
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index d2a74952e389..a85e8662eadb 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -39,6 +39,7 @@ vfio_pci_hot_reset_result(const char *name, const char *result) "%s hot reset: %
 vfio_populate_device_config(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s config:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
 vfio_populate_device_get_irq_info_failure(void) "VFIO_DEVICE_GET_IRQ_INFO failure: %m"
 vfio_realize(const char *name, int group_id) " (%s) group %d"
+vfio_mdev(const char *name, bool is_mdev) " (%s) is_mdev %d"
 vfio_add_ext_cap_dropped(const char *name, uint16_t cap, uint16_t offset) "%s 0x%x@0x%x"
 vfio_pci_reset(const char *name) " (%s)"
 vfio_pci_reset_flr(const char *name) "%s FLR/VFIO_DEVICE_RESET"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index a9036929b220..15ea6c26fdbc 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -112,6 +112,7 @@ typedef struct VFIODevice {
     bool reset_works;
     bool needs_reset;
     bool no_mmap;
+    bool balloon_allowed;
     VFIODeviceOps *ops;
     unsigned int num_irqs;
     unsigned int num_regions;
@@ -131,6 +132,7 @@ typedef struct VFIOGroup {
     QLIST_HEAD(, VFIODevice) device_list;
     QLIST_ENTRY(VFIOGroup) next;
     QLIST_ENTRY(VFIOGroup) container_next;
+    bool balloon_allowed;
 } VFIOGroup;
 
 typedef struct VFIODMABuf {

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 4/4] vfio/ccw/pci: Allow devices to opt-in for ballooning
@ 2018-07-30 23:14   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-07-30 23:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm, peterx, cohuck, mst, david

If a vfio assigned device makes use of a physical IOMMU, then memory
ballooning is necessarily inhibited due to the page pinning, lack of
page level granularity at the IOMMU, and sufficient notifiers to both
remove the page on balloon inflation and add it back on deflation.
However, not all devices are backed by a physical IOMMU.  In the case
of mediated devices, if a vendor driver is well synchronized with the
guest driver, such that only pages actively used by the guest driver
are pinned by the host mdev vendor driver, then there should be no
overlap between pages available for the balloon driver and pages
actively in use by the device.  Under these conditions, ballooning
should be safe.

vfio-ccw devices are always mediated devices and always operate under
the constraints above.  Therefore we can consider all vfio-ccw devices
as balloon compatible.

The situation is far from straightforward with vfio-pci.  These
devices can be physical devices with physical IOMMU backing or
mediated devices where it is unknown whether a physical IOMMU is in
use or whether the vendor driver is well synchronized to the working
set of the guest driver.  The safest approach is therefore to assume
all vfio-pci devices are incompatible with ballooning, but allow user
opt-in should they have further insight into mediated devices.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/ccw.c                 |    9 +++++++++
 hw/vfio/common.c              |   23 ++++++++++++++++++++++-
 hw/vfio/pci.c                 |   26 +++++++++++++++++++++++++-
 hw/vfio/trace-events          |    1 +
 include/hw/vfio/vfio-common.h |    2 ++
 5 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 351b305e1ae7..40e7b5623e69 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -349,6 +349,15 @@ static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
         }
     }
 
+    /*
+     * All vfio-ccw devices are believed to operate compatibly with memory
+     * ballooning, ie. pages pinned in the host are in the current working
+     * set of the guest driver and therefore never overlap with pages
+     * available to the guest balloon driver.  This needs to be set before
+     * vfio_get_device() for vfio common to handle the balloon inhibitor.
+     */
+    vcdev->vdev.balloon_allowed = true;
+
     if (vfio_get_device(group, vcdev->cdev.mdevid, &vcdev->vdev, errp)) {
         goto out_err;
     }
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4881b691a659..ef5f4b77548a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1356,7 +1356,9 @@ void vfio_put_group(VFIOGroup *group)
         return;
     }
 
-    qemu_balloon_inhibit(false);
+    if (!group->balloon_allowed) {
+        qemu_balloon_inhibit(false);
+    }
     vfio_kvm_device_del_group(group);
     vfio_disconnect_container(group);
     QLIST_REMOVE(group, next);
@@ -1392,6 +1394,25 @@ int vfio_get_device(VFIOGroup *group, const char *name,
         return ret;
     }
 
+    /*
+     * Clear the balloon inhibitor for this group if the driver knows the
+     * device operates compatibly with ballooning.  Setting must be consistent
+     * per group, but since compatibility is really only possible with mdev
+     * currently, we expect singleton groups.
+     */
+    if (vbasedev->balloon_allowed != group->balloon_allowed) {
+        if (!QLIST_EMPTY(&group->device_list)) {
+            error_setg(errp,
+                       "Inconsistent device balloon setting within group");
+            return -1;
+        }
+
+        if (!group->balloon_allowed) {
+            group->balloon_allowed = true;
+            qemu_balloon_inhibit(false);
+        }
+    }
+
     vbasedev->fd = fd;
     vbasedev->group = group;
     QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 6cbb8fa0549d..056f3a887a8f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2804,12 +2804,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
     VFIODevice *vbasedev_iter;
     VFIOGroup *group;
-    char *tmp, group_path[PATH_MAX], *group_name;
+    char *tmp, *subsys, group_path[PATH_MAX], *group_name;
     Error *err = NULL;
     ssize_t len;
     struct stat st;
     int groupid;
     int i, ret;
+    bool is_mdev;
 
     if (!vdev->vbasedev.sysfsdev) {
         if (!(~vdev->host.domain || ~vdev->host.bus ||
@@ -2869,6 +2870,27 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    /*
+     * Mediated devices *might* operate compatibly with memory ballooning, but
+     * we cannot know for certain, it depends on whether the mdev vendor driver
+     * stays in sync with the active working set of the guest driver.  Prevent
+     * the x-balloon-allowed option unless this is minimally an mdev device.
+     */
+    tmp = g_strdup_printf("%s/subsystem", vdev->vbasedev.sysfsdev);
+    subsys = realpath(tmp, NULL);
+    g_free(tmp);
+    is_mdev = (strcmp(subsys, "/sys/bus/mdev") == 0);
+    free(subsys);
+
+    trace_vfio_mdev(vdev->vbasedev.name, is_mdev);
+
+    if (vdev->vbasedev.balloon_allowed && !is_mdev) {
+        error_setg(errp, "x-balloon-allowed only potentially compatible "
+                   "with mdev devices");
+        vfio_put_group(group);
+        goto error;
+    }
+
     ret = vfio_get_device(group, vdev->vbasedev.name, &vdev->vbasedev, errp);
     if (ret) {
         vfio_put_group(group);
@@ -3170,6 +3192,8 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_BIT("x-igd-opregion", VFIOPCIDevice, features,
                     VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
+    DEFINE_PROP_BOOL("x-balloon-allowed", VFIOPCIDevice,
+                     vbasedev.balloon_allowed, false),
     DEFINE_PROP_BOOL("x-no-kvm-intx", VFIOPCIDevice, no_kvm_intx, false),
     DEFINE_PROP_BOOL("x-no-kvm-msi", VFIOPCIDevice, no_kvm_msi, false),
     DEFINE_PROP_BOOL("x-no-kvm-msix", VFIOPCIDevice, no_kvm_msix, false),
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index d2a74952e389..a85e8662eadb 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -39,6 +39,7 @@ vfio_pci_hot_reset_result(const char *name, const char *result) "%s hot reset: %
 vfio_populate_device_config(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s config:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
 vfio_populate_device_get_irq_info_failure(void) "VFIO_DEVICE_GET_IRQ_INFO failure: %m"
 vfio_realize(const char *name, int group_id) " (%s) group %d"
+vfio_mdev(const char *name, bool is_mdev) " (%s) is_mdev %d"
 vfio_add_ext_cap_dropped(const char *name, uint16_t cap, uint16_t offset) "%s 0x%x@0x%x"
 vfio_pci_reset(const char *name) " (%s)"
 vfio_pci_reset_flr(const char *name) "%s FLR/VFIO_DEVICE_RESET"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index a9036929b220..15ea6c26fdbc 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -112,6 +112,7 @@ typedef struct VFIODevice {
     bool reset_works;
     bool needs_reset;
     bool no_mmap;
+    bool balloon_allowed;
     VFIODeviceOps *ops;
     unsigned int num_irqs;
     unsigned int num_regions;
@@ -131,6 +132,7 @@ typedef struct VFIOGroup {
     QLIST_HEAD(, VFIODevice) device_list;
     QLIST_ENTRY(VFIOGroup) next;
     QLIST_ENTRY(VFIOGroup) container_next;
+    bool balloon_allowed;
 } VFIOGroup;
 
 typedef struct VFIODMABuf {

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 2/4] kvm: Use inhibit to prevent ballooning without synchronous mmu
  2018-07-30 23:14   ` [Qemu-devel] " Alex Williamson
@ 2018-07-31  8:24     ` David Hildenbrand
  -1 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-07-31  8:24 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: cohuck, peterx, kvm, mst

On 31.07.2018 01:14, Alex Williamson wrote:
> Remove KVM specific tests in balloon_page(), instead marking
> ballooning as inhibited without KVM_CAP_SYNC_MMU support.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  accel/kvm/kvm-all.c        |    4 ++++
>  hw/virtio/virtio-balloon.c |    4 +---
>  2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index eb7db92a5e3b..38f468d8e2b1 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -39,6 +39,7 @@
>  #include "trace.h"
>  #include "hw/irq.h"
>  #include "sysemu/sev.h"
> +#include "sysemu/balloon.h"
>  
>  #include "hw/boards.h"
>  
> @@ -1698,6 +1699,9 @@ static int kvm_init(MachineState *ms)
>      s->many_ioeventfds = kvm_check_many_ioeventfds();
>  
>      s->sync_mmu = !!kvm_vm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
> +    if (!s->sync_mmu) {
> +        qemu_balloon_inhibit(true);
> +    }
>  
>      return 0;
>  
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 1f7a87f09429..b5425080c5fb 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -21,7 +21,6 @@
>  #include "hw/mem/pc-dimm.h"
>  #include "sysemu/balloon.h"
>  #include "hw/virtio/virtio-balloon.h"
> -#include "sysemu/kvm.h"
>  #include "exec/address-spaces.h"
>  #include "qapi/error.h"
>  #include "qapi/qapi-events-misc.h"
> @@ -36,8 +35,7 @@
>  
>  static void balloon_page(void *addr, int deflate)
>  {
> -    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> -                                         kvm_has_sync_mmu())) {
> +    if (!qemu_balloon_is_inhibited()) {
>          qemu_madvise(addr, BALLOON_PAGE_SIZE,
>                  deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
>      }
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/4] kvm: Use inhibit to prevent ballooning without synchronous mmu
@ 2018-07-31  8:24     ` David Hildenbrand
  0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-07-31  8:24 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: kvm, peterx, cohuck, mst

On 31.07.2018 01:14, Alex Williamson wrote:
> Remove KVM specific tests in balloon_page(), instead marking
> ballooning as inhibited without KVM_CAP_SYNC_MMU support.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  accel/kvm/kvm-all.c        |    4 ++++
>  hw/virtio/virtio-balloon.c |    4 +---
>  2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index eb7db92a5e3b..38f468d8e2b1 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -39,6 +39,7 @@
>  #include "trace.h"
>  #include "hw/irq.h"
>  #include "sysemu/sev.h"
> +#include "sysemu/balloon.h"
>  
>  #include "hw/boards.h"
>  
> @@ -1698,6 +1699,9 @@ static int kvm_init(MachineState *ms)
>      s->many_ioeventfds = kvm_check_many_ioeventfds();
>  
>      s->sync_mmu = !!kvm_vm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
> +    if (!s->sync_mmu) {
> +        qemu_balloon_inhibit(true);
> +    }
>  
>      return 0;
>  
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 1f7a87f09429..b5425080c5fb 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -21,7 +21,6 @@
>  #include "hw/mem/pc-dimm.h"
>  #include "sysemu/balloon.h"
>  #include "hw/virtio/virtio-balloon.h"
> -#include "sysemu/kvm.h"
>  #include "exec/address-spaces.h"
>  #include "qapi/error.h"
>  #include "qapi/qapi-events-misc.h"
> @@ -36,8 +35,7 @@
>  
>  static void balloon_page(void *addr, int deflate)
>  {
> -    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> -                                         kvm_has_sync_mmu())) {
> +    if (!qemu_balloon_is_inhibited()) {
>          qemu_madvise(addr, BALLOON_PAGE_SIZE,
>                  deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
>      }
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] balloon: Allow nested inhibits
  2018-07-30 23:13   ` [Qemu-devel] " Alex Williamson
@ 2018-07-31  8:25     ` David Hildenbrand
  -1 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-07-31  8:25 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: cohuck, peterx, kvm, mst

On 31.07.2018 01:13, Alex Williamson wrote:
> A simple true/false internal state does not allow multiple users.  Fix
> this within the existing interface by converting to a counter, so long
> as the counter is elevated, ballooning is inhibited.
> 

Not sure if "nested" is really the right term here. It is really
multiple users. anyhow

Reviewed-by: David Hildenbrand <david@redhat.com>

> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  balloon.c |   13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/balloon.c b/balloon.c
> index 6bf0a9681377..931987983858 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -26,6 +26,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu-common.h"
> +#include "qemu/atomic.h"
>  #include "exec/cpu-common.h"
>  #include "sysemu/kvm.h"
>  #include "sysemu/balloon.h"
> @@ -37,16 +38,22 @@
>  static QEMUBalloonEvent *balloon_event_fn;
>  static QEMUBalloonStatus *balloon_stat_fn;
>  static void *balloon_opaque;
> -static bool balloon_inhibited;
> +static int balloon_inhibit_count;
>  
>  bool qemu_balloon_is_inhibited(void)
>  {
> -    return balloon_inhibited;
> +    return atomic_read(&balloon_inhibit_count) > 0;
>  }
>  
>  void qemu_balloon_inhibit(bool state)
>  {
> -    balloon_inhibited = state;
> +    if (state) {
> +        atomic_inc(&balloon_inhibit_count);
> +    } else {
> +        atomic_dec(&balloon_inhibit_count);
> +    }
> +
> +    assert(atomic_read(&balloon_inhibit_count) >= 0);
>  }
>  
>  static bool have_balloon(Error **errp)
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/4] balloon: Allow nested inhibits
@ 2018-07-31  8:25     ` David Hildenbrand
  0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2018-07-31  8:25 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: kvm, peterx, cohuck, mst

On 31.07.2018 01:13, Alex Williamson wrote:
> A simple true/false internal state does not allow multiple users.  Fix
> this within the existing interface by converting to a counter, so long
> as the counter is elevated, ballooning is inhibited.
> 

Not sure if "nested" is really the right term here. It is really
multiple users. anyhow

Reviewed-by: David Hildenbrand <david@redhat.com>

> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  balloon.c |   13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/balloon.c b/balloon.c
> index 6bf0a9681377..931987983858 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -26,6 +26,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu-common.h"
> +#include "qemu/atomic.h"
>  #include "exec/cpu-common.h"
>  #include "sysemu/kvm.h"
>  #include "sysemu/balloon.h"
> @@ -37,16 +38,22 @@
>  static QEMUBalloonEvent *balloon_event_fn;
>  static QEMUBalloonStatus *balloon_stat_fn;
>  static void *balloon_opaque;
> -static bool balloon_inhibited;
> +static int balloon_inhibit_count;
>  
>  bool qemu_balloon_is_inhibited(void)
>  {
> -    return balloon_inhibited;
> +    return atomic_read(&balloon_inhibit_count) > 0;
>  }
>  
>  void qemu_balloon_inhibit(bool state)
>  {
> -    balloon_inhibited = state;
> +    if (state) {
> +        atomic_inc(&balloon_inhibit_count);
> +    } else {
> +        atomic_dec(&balloon_inhibit_count);
> +    }
> +
> +    assert(atomic_read(&balloon_inhibit_count) >= 0);
>  }
>  
>  static bool have_balloon(Error **errp)
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction
  2018-07-30 23:13 ` [Qemu-devel] " Alex Williamson
@ 2018-07-31 12:29   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 42+ messages in thread
From: Michael S. Tsirkin @ 2018-07-31 12:29 UTC (permalink / raw)
  To: Alex Williamson; +Cc: peterx, cohuck, qemu-devel, kvm, david

On Mon, Jul 30, 2018 at 05:13:26PM -0600, Alex Williamson wrote:
> v2:
>  - Use atomic ops for balloon inhibit counter (Peter)
>  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
>    default, vfio-pci opt-in by device option, only allowed for mdev
>    devices, no support added for platform as there are no platform
>    mdev devices.
> 
> See patch 3/4 for detailed explanation why ballooning and device
> assignment typically don't mix.  If this eventually changes, flags
> on the iommu info struct or perhaps device info struct can inform
> us for automatic opt-in.  Thanks,
> 
> Alex

So this patch seems to block ballooning when vfio is added.
But what if balloon is added and inflated first?

I'd suggest making qemu_balloon_inhibit fail in that case,
and then vfio realize will fail as well.


> ---
> 
> Alex Williamson (4):
>       balloon: Allow nested inhibits
>       kvm: Use inhibit to prevent ballooning without synchronous mmu
>       vfio: Inhibit ballooning based on group attachment to a container
>       vfio/ccw/pci: Allow devices to opt-in for ballooning
> 
> 
>  accel/kvm/kvm-all.c           |    4 ++++
>  balloon.c                     |   13 ++++++++++---
>  hw/vfio/ccw.c                 |    9 +++++++++
>  hw/vfio/common.c              |   26 ++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |   26 +++++++++++++++++++++++++-
>  hw/vfio/trace-events          |    1 +
>  hw/virtio/virtio-balloon.c    |    4 +---
>  include/hw/vfio/vfio-common.h |    2 ++
>  8 files changed, 78 insertions(+), 7 deletions(-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction
@ 2018-07-31 12:29   ` Michael S. Tsirkin
  0 siblings, 0 replies; 42+ messages in thread
From: Michael S. Tsirkin @ 2018-07-31 12:29 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm, peterx, cohuck, david

On Mon, Jul 30, 2018 at 05:13:26PM -0600, Alex Williamson wrote:
> v2:
>  - Use atomic ops for balloon inhibit counter (Peter)
>  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
>    default, vfio-pci opt-in by device option, only allowed for mdev
>    devices, no support added for platform as there are no platform
>    mdev devices.
> 
> See patch 3/4 for detailed explanation why ballooning and device
> assignment typically don't mix.  If this eventually changes, flags
> on the iommu info struct or perhaps device info struct can inform
> us for automatic opt-in.  Thanks,
> 
> Alex

So this patch seems to block ballooning when vfio is added.
But what if balloon is added and inflated first?

I'd suggest making qemu_balloon_inhibit fail in that case,
and then vfio realize will fail as well.


> ---
> 
> Alex Williamson (4):
>       balloon: Allow nested inhibits
>       kvm: Use inhibit to prevent ballooning without synchronous mmu
>       vfio: Inhibit ballooning based on group attachment to a container
>       vfio/ccw/pci: Allow devices to opt-in for ballooning
> 
> 
>  accel/kvm/kvm-all.c           |    4 ++++
>  balloon.c                     |   13 ++++++++++---
>  hw/vfio/ccw.c                 |    9 +++++++++
>  hw/vfio/common.c              |   26 ++++++++++++++++++++++++++
>  hw/vfio/pci.c                 |   26 +++++++++++++++++++++++++-
>  hw/vfio/trace-events          |    1 +
>  hw/virtio/virtio-balloon.c    |    4 +---
>  include/hw/vfio/vfio-common.h |    2 ++
>  8 files changed, 78 insertions(+), 7 deletions(-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction
  2018-07-31 12:29   ` [Qemu-devel] " Michael S. Tsirkin
@ 2018-07-31 14:44     ` Alex Williamson
  -1 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-07-31 14:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, david, cohuck, Dr. David Alan Gilbert, peterx, qemu-devel

On Tue, 31 Jul 2018 15:29:17 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Jul 30, 2018 at 05:13:26PM -0600, Alex Williamson wrote:
> > v2:
> >  - Use atomic ops for balloon inhibit counter (Peter)
> >  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
> >    default, vfio-pci opt-in by device option, only allowed for mdev
> >    devices, no support added for platform as there are no platform
> >    mdev devices.
> > 
> > See patch 3/4 for detailed explanation why ballooning and device
> > assignment typically don't mix.  If this eventually changes, flags
> > on the iommu info struct or perhaps device info struct can inform
> > us for automatic opt-in.  Thanks,
> > 
> > Alex  
> 
> So this patch seems to block ballooning when vfio is added.
> But what if balloon is added and inflated first?

Good point.
 
> I'd suggest making qemu_balloon_inhibit fail in that case,
> and then vfio realize will fail as well.

That might be the correct behavior for vfio, but I wonder about the
existing postcopy use case.  Dave Gilbert, what do you think?  We might
need a separate interface for callers that cannot tolerate existing
ballooned pages.  Of course we'll also need another atomic counter to
keep a tally of ballooned pages.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction
@ 2018-07-31 14:44     ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-07-31 14:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, kvm, peterx, cohuck, david, Dr. David Alan Gilbert

On Tue, 31 Jul 2018 15:29:17 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Jul 30, 2018 at 05:13:26PM -0600, Alex Williamson wrote:
> > v2:
> >  - Use atomic ops for balloon inhibit counter (Peter)
> >  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
> >    default, vfio-pci opt-in by device option, only allowed for mdev
> >    devices, no support added for platform as there are no platform
> >    mdev devices.
> > 
> > See patch 3/4 for detailed explanation why ballooning and device
> > assignment typically don't mix.  If this eventually changes, flags
> > on the iommu info struct or perhaps device info struct can inform
> > us for automatic opt-in.  Thanks,
> > 
> > Alex  
> 
> So this patch seems to block ballooning when vfio is added.
> But what if balloon is added and inflated first?

Good point.
 
> I'd suggest making qemu_balloon_inhibit fail in that case,
> and then vfio realize will fail as well.

That might be the correct behavior for vfio, but I wonder about the
existing postcopy use case.  Dave Gilbert, what do you think?  We might
need a separate interface for callers that cannot tolerate existing
ballooned pages.  Of course we'll also need another atomic counter to
keep a tally of ballooned pages.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction
  2018-07-31 14:44     ` [Qemu-devel] " Alex Williamson
@ 2018-07-31 15:07       ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-31 15:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, david, cohuck, qemu-devel, peterx, Michael S. Tsirkin

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Tue, 31 Jul 2018 15:29:17 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, Jul 30, 2018 at 05:13:26PM -0600, Alex Williamson wrote:
> > > v2:
> > >  - Use atomic ops for balloon inhibit counter (Peter)
> > >  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
> > >    default, vfio-pci opt-in by device option, only allowed for mdev
> > >    devices, no support added for platform as there are no platform
> > >    mdev devices.
> > > 
> > > See patch 3/4 for detailed explanation why ballooning and device
> > > assignment typically don't mix.  If this eventually changes, flags
> > > on the iommu info struct or perhaps device info struct can inform
> > > us for automatic opt-in.  Thanks,
> > > 
> > > Alex  
> > 
> > So this patch seems to block ballooning when vfio is added.
> > But what if balloon is added and inflated first?
> 
> Good point.
>  
> > I'd suggest making qemu_balloon_inhibit fail in that case,
> > and then vfio realize will fail as well.
> 
> That might be the correct behavior for vfio, but I wonder about the
> existing postcopy use case.  Dave Gilbert, what do you think?  We might
> need a separate interface for callers that cannot tolerate existing
> ballooned pages.  Of course we'll also need another atomic counter to
> keep a tally of ballooned pages.  Thanks,

For postcopy, preinflation isn't a problem; our only issue is ballooning
during the postcopy phase itself.

Dave

> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction
@ 2018-07-31 15:07       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 42+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-31 15:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael S. Tsirkin, qemu-devel, kvm, peterx, cohuck, david

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Tue, 31 Jul 2018 15:29:17 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, Jul 30, 2018 at 05:13:26PM -0600, Alex Williamson wrote:
> > > v2:
> > >  - Use atomic ops for balloon inhibit counter (Peter)
> > >  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
> > >    default, vfio-pci opt-in by device option, only allowed for mdev
> > >    devices, no support added for platform as there are no platform
> > >    mdev devices.
> > > 
> > > See patch 3/4 for detailed explanation why ballooning and device
> > > assignment typically don't mix.  If this eventually changes, flags
> > > on the iommu info struct or perhaps device info struct can inform
> > > us for automatic opt-in.  Thanks,
> > > 
> > > Alex  
> > 
> > So this patch seems to block ballooning when vfio is added.
> > But what if balloon is added and inflated first?
> 
> Good point.
>  
> > I'd suggest making qemu_balloon_inhibit fail in that case,
> > and then vfio realize will fail as well.
> 
> That might be the correct behavior for vfio, but I wonder about the
> existing postcopy use case.  Dave Gilbert, what do you think?  We might
> need a separate interface for callers that cannot tolerate existing
> ballooned pages.  Of course we'll also need another atomic counter to
> keep a tally of ballooned pages.  Thanks,

For postcopy, preinflation isn't a problem; our only issue is ballooning
during the postcopy phase itself.

Dave

> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction
  2018-07-31 15:07       ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-07-31 21:50         ` Alex Williamson
  -1 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-07-31 21:50 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kvm, david, cohuck, qemu-devel, peterx, Michael S. Tsirkin

On Tue, 31 Jul 2018 16:07:46 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Alex Williamson (alex.williamson@redhat.com) wrote:
> > On Tue, 31 Jul 2018 15:29:17 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Mon, Jul 30, 2018 at 05:13:26PM -0600, Alex Williamson wrote:  
> > > > v2:
> > > >  - Use atomic ops for balloon inhibit counter (Peter)
> > > >  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
> > > >    default, vfio-pci opt-in by device option, only allowed for mdev
> > > >    devices, no support added for platform as there are no platform
> > > >    mdev devices.
> > > > 
> > > > See patch 3/4 for detailed explanation why ballooning and device
> > > > assignment typically don't mix.  If this eventually changes, flags
> > > > on the iommu info struct or perhaps device info struct can inform
> > > > us for automatic opt-in.  Thanks,
> > > > 
> > > > Alex    
> > > 
> > > So this patch seems to block ballooning when vfio is added.
> > > But what if balloon is added and inflated first?  
> > 
> > Good point.
> >    
> > > I'd suggest making qemu_balloon_inhibit fail in that case,
> > > and then vfio realize will fail as well.  
> > 
> > That might be the correct behavior for vfio, but I wonder about the
> > existing postcopy use case.  Dave Gilbert, what do you think?  We might
> > need a separate interface for callers that cannot tolerate existing
> > ballooned pages.  Of course we'll also need another atomic counter to
> > keep a tally of ballooned pages.  Thanks,  
> 
> For postcopy, preinflation isn't a problem; our only issue is ballooning
> during the postcopy phase itself.

On further consideration, I think device assignment is in the same
category.  The balloon inhibitor does not actually stop the guest
balloon driver from grabbing and freeing pages, it only changes whether
QEMU releases the pages with madvise DONTNEED.  The problem we have
with ballooning and device assignment is when we have an existing HPA
mapping in the IOMMU that isn't invalidated on DONTNEED and becomes
inconsistent when the page is re-populated.  Zapped pages at the time
an assigned device is added do not trigger this, those pages will be
repopulated when pages are pinned for the assigned device.  This is the
identical scenario to a freshly started VM that doesn't use memory
preallocation and therefore faults in pages on demand.  When an
assigned device is attached to such a VM, page pinning will fault in
and lock all of those pages.

This is observable behavior, for example if I start a VM with 16GB of
RAM, booted to a command prompt the VM shows less that 1GB of RAM
resident in the host.  If I set the balloon to 2048, there's no
observable change in the QEMU process size on the host.  If I hot-add
an assigned device while we're ballooned down, the resident memory size
from the host jumps up to 16GB.  All of the zapped pages have been
reclaimed.  Adjusting ballooning at this point only changes the balloon
size in the guest, inflating the balloon no longer zaps pages from the
process.

The only oddity I see is the one Dave noted in the commit introducing
balloon inhibiting (371ff5a3f04c):

    Queueing the requests until after migration would be nice, but is
    non-trivial, since the set of inflate/deflate requests have to
    be compared with the state of the page to know what the final
    outcome is allowed to be.

So for this example of a 16GB VM ballooned down to 2GB then an assigned
device added and subsequently removed, the resident memory remains 16GB
and I need to deflate the balloon and reinflate it in order to zap them
from the QEMU process.  Therefore, I think that with respect to this
inquiry, the series stands as is.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction
@ 2018-07-31 21:50         ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-07-31 21:50 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Michael S. Tsirkin, qemu-devel, kvm, peterx, cohuck, david

On Tue, 31 Jul 2018 16:07:46 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Alex Williamson (alex.williamson@redhat.com) wrote:
> > On Tue, 31 Jul 2018 15:29:17 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Mon, Jul 30, 2018 at 05:13:26PM -0600, Alex Williamson wrote:  
> > > > v2:
> > > >  - Use atomic ops for balloon inhibit counter (Peter)
> > > >  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
> > > >    default, vfio-pci opt-in by device option, only allowed for mdev
> > > >    devices, no support added for platform as there are no platform
> > > >    mdev devices.
> > > > 
> > > > See patch 3/4 for detailed explanation why ballooning and device
> > > > assignment typically don't mix.  If this eventually changes, flags
> > > > on the iommu info struct or perhaps device info struct can inform
> > > > us for automatic opt-in.  Thanks,
> > > > 
> > > > Alex    
> > > 
> > > So this patch seems to block ballooning when vfio is added.
> > > But what if balloon is added and inflated first?  
> > 
> > Good point.
> >    
> > > I'd suggest making qemu_balloon_inhibit fail in that case,
> > > and then vfio realize will fail as well.  
> > 
> > That might be the correct behavior for vfio, but I wonder about the
> > existing postcopy use case.  Dave Gilbert, what do you think?  We might
> > need a separate interface for callers that cannot tolerate existing
> > ballooned pages.  Of course we'll also need another atomic counter to
> > keep a tally of ballooned pages.  Thanks,  
> 
> For postcopy, preinflation isn't a problem; our only issue is ballooning
> during the postcopy phase itself.

On further consideration, I think device assignment is in the same
category.  The balloon inhibitor does not actually stop the guest
balloon driver from grabbing and freeing pages, it only changes whether
QEMU releases the pages with madvise DONTNEED.  The problem we have
with ballooning and device assignment is when we have an existing HPA
mapping in the IOMMU that isn't invalidated on DONTNEED and becomes
inconsistent when the page is re-populated.  Zapped pages at the time
an assigned device is added do not trigger this, those pages will be
repopulated when pages are pinned for the assigned device.  This is the
identical scenario to a freshly started VM that doesn't use memory
preallocation and therefore faults in pages on demand.  When an
assigned device is attached to such a VM, page pinning will fault in
and lock all of those pages.

This is observable behavior, for example if I start a VM with 16GB of
RAM, booted to a command prompt the VM shows less that 1GB of RAM
resident in the host.  If I set the balloon to 2048, there's no
observable change in the QEMU process size on the host.  If I hot-add
an assigned device while we're ballooned down, the resident memory size
from the host jumps up to 16GB.  All of the zapped pages have been
reclaimed.  Adjusting ballooning at this point only changes the balloon
size in the guest, inflating the balloon no longer zaps pages from the
process.

The only oddity I see is the one Dave noted in the commit introducing
balloon inhibiting (371ff5a3f04c):

    Queueing the requests until after migration would be nice, but is
    non-trivial, since the set of inflate/deflate requests have to
    be compared with the state of the page to know what the final
    outcome is allowed to be.

So for this example of a 16GB VM ballooned down to 2GB then an assigned
device added and subsequently removed, the resident memory remains 16GB
and I need to deflate the balloon and reinflate it in order to zap them
from the QEMU process.  Therefore, I think that with respect to this
inquiry, the series stands as is.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction
  2018-07-31 21:50         ` [Qemu-devel] " Alex Williamson
@ 2018-08-03 18:42           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 42+ messages in thread
From: Michael S. Tsirkin @ 2018-08-03 18:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, david, cohuck, Dr. David Alan Gilbert, peterx, qemu-devel

On Tue, Jul 31, 2018 at 03:50:30PM -0600, Alex Williamson wrote:
> On Tue, 31 Jul 2018 16:07:46 +0100
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > > On Tue, 31 Jul 2018 15:29:17 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > On Mon, Jul 30, 2018 at 05:13:26PM -0600, Alex Williamson wrote:  
> > > > > v2:
> > > > >  - Use atomic ops for balloon inhibit counter (Peter)
> > > > >  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
> > > > >    default, vfio-pci opt-in by device option, only allowed for mdev
> > > > >    devices, no support added for platform as there are no platform
> > > > >    mdev devices.
> > > > > 
> > > > > See patch 3/4 for detailed explanation why ballooning and device
> > > > > assignment typically don't mix.  If this eventually changes, flags
> > > > > on the iommu info struct or perhaps device info struct can inform
> > > > > us for automatic opt-in.  Thanks,
> > > > > 
> > > > > Alex    
> > > > 
> > > > So this patch seems to block ballooning when vfio is added.
> > > > But what if balloon is added and inflated first?  
> > > 
> > > Good point.
> > >    
> > > > I'd suggest making qemu_balloon_inhibit fail in that case,
> > > > and then vfio realize will fail as well.  
> > > 
> > > That might be the correct behavior for vfio, but I wonder about the
> > > existing postcopy use case.  Dave Gilbert, what do you think?  We might
> > > need a separate interface for callers that cannot tolerate existing
> > > ballooned pages.  Of course we'll also need another atomic counter to
> > > keep a tally of ballooned pages.  Thanks,  
> > 
> > For postcopy, preinflation isn't a problem; our only issue is ballooning
> > during the postcopy phase itself.
> 
> On further consideration, I think device assignment is in the same
> category.  The balloon inhibitor does not actually stop the guest
> balloon driver from grabbing and freeing pages, it only changes whether
> QEMU releases the pages with madvise DONTNEED.  The problem we have
> with ballooning and device assignment is when we have an existing HPA
> mapping in the IOMMU that isn't invalidated on DONTNEED and becomes
> inconsistent when the page is re-populated.  Zapped pages at the time
> an assigned device is added do not trigger this, those pages will be
> repopulated when pages are pinned for the assigned device.  This is the
> identical scenario to a freshly started VM that doesn't use memory
> preallocation and therefore faults in pages on demand.  When an
> assigned device is attached to such a VM, page pinning will fault in
> and lock all of those pages.

Granted this means memory won't be corrupted, but it is
also highly unlikely to be what the user wanted.

> This is observable behavior, for example if I start a VM with 16GB of
> RAM, booted to a command prompt the VM shows less that 1GB of RAM
> resident in the host.  If I set the balloon to 2048, there's no
> observable change in the QEMU process size on the host.  If I hot-add
> an assigned device while we're ballooned down, the resident memory size
> from the host jumps up to 16GB.  All of the zapped pages have been
> reclaimed.  Adjusting ballooning at this point only changes the balloon
> size in the guest, inflating the balloon no longer zaps pages from the
> process.
> 
> The only oddity I see is the one Dave noted in the commit introducing
> balloon inhibiting (371ff5a3f04c):
> 
>     Queueing the requests until after migration would be nice, but is
>     non-trivial, since the set of inflate/deflate requests have to
>     be compared with the state of the page to know what the final
>     outcome is allowed to be.
> 
> So for this example of a 16GB VM ballooned down to 2GB then an assigned
> device added and subsequently removed, the resident memory remains 16GB
> and I need to deflate the balloon and reinflate it in order to zap them
> from the QEMU process.  Therefore, I think that with respect to this
> inquiry, the series stands as is.  Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction
@ 2018-08-03 18:42           ` Michael S. Tsirkin
  0 siblings, 0 replies; 42+ messages in thread
From: Michael S. Tsirkin @ 2018-08-03 18:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Dr. David Alan Gilbert, qemu-devel, kvm, peterx, cohuck, david

On Tue, Jul 31, 2018 at 03:50:30PM -0600, Alex Williamson wrote:
> On Tue, 31 Jul 2018 16:07:46 +0100
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > > On Tue, 31 Jul 2018 15:29:17 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > On Mon, Jul 30, 2018 at 05:13:26PM -0600, Alex Williamson wrote:  
> > > > > v2:
> > > > >  - Use atomic ops for balloon inhibit counter (Peter)
> > > > >  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
> > > > >    default, vfio-pci opt-in by device option, only allowed for mdev
> > > > >    devices, no support added for platform as there are no platform
> > > > >    mdev devices.
> > > > > 
> > > > > See patch 3/4 for detailed explanation why ballooning and device
> > > > > assignment typically don't mix.  If this eventually changes, flags
> > > > > on the iommu info struct or perhaps device info struct can inform
> > > > > us for automatic opt-in.  Thanks,
> > > > > 
> > > > > Alex    
> > > > 
> > > > So this patch seems to block ballooning when vfio is added.
> > > > But what if balloon is added and inflated first?  
> > > 
> > > Good point.
> > >    
> > > > I'd suggest making qemu_balloon_inhibit fail in that case,
> > > > and then vfio realize will fail as well.  
> > > 
> > > That might be the correct behavior for vfio, but I wonder about the
> > > existing postcopy use case.  Dave Gilbert, what do you think?  We might
> > > need a separate interface for callers that cannot tolerate existing
> > > ballooned pages.  Of course we'll also need another atomic counter to
> > > keep a tally of ballooned pages.  Thanks,  
> > 
> > For postcopy, preinflation isn't a problem; our only issue is ballooning
> > during the postcopy phase itself.
> 
> On further consideration, I think device assignment is in the same
> category.  The balloon inhibitor does not actually stop the guest
> balloon driver from grabbing and freeing pages, it only changes whether
> QEMU releases the pages with madvise DONTNEED.  The problem we have
> with ballooning and device assignment is when we have an existing HPA
> mapping in the IOMMU that isn't invalidated on DONTNEED and becomes
> inconsistent when the page is re-populated.  Zapped pages at the time
> an assigned device is added do not trigger this, those pages will be
> repopulated when pages are pinned for the assigned device.  This is the
> identical scenario to a freshly started VM that doesn't use memory
> preallocation and therefore faults in pages on demand.  When an
> assigned device is attached to such a VM, page pinning will fault in
> and lock all of those pages.

Granted this means memory won't be corrupted, but it is
also highly unlikely to be what the user wanted.

> This is observable behavior, for example if I start a VM with 16GB of
> RAM, booted to a command prompt the VM shows less that 1GB of RAM
> resident in the host.  If I set the balloon to 2048, there's no
> observable change in the QEMU process size on the host.  If I hot-add
> an assigned device while we're ballooned down, the resident memory size
> from the host jumps up to 16GB.  All of the zapped pages have been
> reclaimed.  Adjusting ballooning at this point only changes the balloon
> size in the guest, inflating the balloon no longer zaps pages from the
> process.
> 
> The only oddity I see is the one Dave noted in the commit introducing
> balloon inhibiting (371ff5a3f04c):
> 
>     Queueing the requests until after migration would be nice, but is
>     non-trivial, since the set of inflate/deflate requests have to
>     be compared with the state of the page to know what the final
>     outcome is allowed to be.
> 
> So for this example of a 16GB VM ballooned down to 2GB then an assigned
> device added and subsequently removed, the resident memory remains 16GB
> and I need to deflate the balloon and reinflate it in order to zap them
> from the QEMU process.  Therefore, I think that with respect to this
> inquiry, the series stands as is.  Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction
  2018-08-03 18:42           ` [Qemu-devel] " Michael S. Tsirkin
@ 2018-08-03 20:12             ` Alex Williamson
  -1 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-08-03 20:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, david, cohuck, Dr. David Alan Gilbert, peterx, qemu-devel

On Fri, 3 Aug 2018 21:42:18 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Jul 31, 2018 at 03:50:30PM -0600, Alex Williamson wrote:
> > On Tue, 31 Jul 2018 16:07:46 +0100
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >   
> > > * Alex Williamson (alex.williamson@redhat.com) wrote:  
> > > > On Tue, 31 Jul 2018 15:29:17 +0300
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > >     
> > > > > On Mon, Jul 30, 2018 at 05:13:26PM -0600, Alex Williamson wrote:    
> > > > > > v2:
> > > > > >  - Use atomic ops for balloon inhibit counter (Peter)
> > > > > >  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
> > > > > >    default, vfio-pci opt-in by device option, only allowed for mdev
> > > > > >    devices, no support added for platform as there are no platform
> > > > > >    mdev devices.
> > > > > > 
> > > > > > See patch 3/4 for detailed explanation why ballooning and device
> > > > > > assignment typically don't mix.  If this eventually changes, flags
> > > > > > on the iommu info struct or perhaps device info struct can inform
> > > > > > us for automatic opt-in.  Thanks,
> > > > > > 
> > > > > > Alex      
> > > > > 
> > > > > So this patch seems to block ballooning when vfio is added.
> > > > > But what if balloon is added and inflated first?    
> > > > 
> > > > Good point.
> > > >      
> > > > > I'd suggest making qemu_balloon_inhibit fail in that case,
> > > > > and then vfio realize will fail as well.    
> > > > 
> > > > That might be the correct behavior for vfio, but I wonder about the
> > > > existing postcopy use case.  Dave Gilbert, what do you think?  We might
> > > > need a separate interface for callers that cannot tolerate existing
> > > > ballooned pages.  Of course we'll also need another atomic counter to
> > > > keep a tally of ballooned pages.  Thanks,    
> > > 
> > > For postcopy, preinflation isn't a problem; our only issue is ballooning
> > > during the postcopy phase itself.  
> > 
> > On further consideration, I think device assignment is in the same
> > category.  The balloon inhibitor does not actually stop the guest
> > balloon driver from grabbing and freeing pages, it only changes whether
> > QEMU releases the pages with madvise DONTNEED.  The problem we have
> > with ballooning and device assignment is when we have an existing HPA
> > mapping in the IOMMU that isn't invalidated on DONTNEED and becomes
> > inconsistent when the page is re-populated.  Zapped pages at the time
> > an assigned device is added do not trigger this, those pages will be
> > repopulated when pages are pinned for the assigned device.  This is the
> > identical scenario to a freshly started VM that doesn't use memory
> > preallocation and therefore faults in pages on demand.  When an
> > assigned device is attached to such a VM, page pinning will fault in
> > and lock all of those pages.  
> 
> Granted this means memory won't be corrupted, but it is
> also highly unlikely to be what the user wanted.

This is the current state of balloon inhibiting, it does not revoke
ballooned pages, ie. giving them back to the back to the guest, or
prevent the guest from inflating the balloon, it only controls whether
the pages are released back to the host.  I suspect a much more
involved overhaul of ballooning in general would be necessary to
support such features.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction
@ 2018-08-03 20:12             ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-08-03 20:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Dr. David Alan Gilbert, qemu-devel, kvm, peterx, cohuck, david

On Fri, 3 Aug 2018 21:42:18 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Jul 31, 2018 at 03:50:30PM -0600, Alex Williamson wrote:
> > On Tue, 31 Jul 2018 16:07:46 +0100
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >   
> > > * Alex Williamson (alex.williamson@redhat.com) wrote:  
> > > > On Tue, 31 Jul 2018 15:29:17 +0300
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > >     
> > > > > On Mon, Jul 30, 2018 at 05:13:26PM -0600, Alex Williamson wrote:    
> > > > > > v2:
> > > > > >  - Use atomic ops for balloon inhibit counter (Peter)
> > > > > >  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
> > > > > >    default, vfio-pci opt-in by device option, only allowed for mdev
> > > > > >    devices, no support added for platform as there are no platform
> > > > > >    mdev devices.
> > > > > > 
> > > > > > See patch 3/4 for detailed explanation why ballooning and device
> > > > > > assignment typically don't mix.  If this eventually changes, flags
> > > > > > on the iommu info struct or perhaps device info struct can inform
> > > > > > us for automatic opt-in.  Thanks,
> > > > > > 
> > > > > > Alex      
> > > > > 
> > > > > So this patch seems to block ballooning when vfio is added.
> > > > > But what if balloon is added and inflated first?    
> > > > 
> > > > Good point.
> > > >      
> > > > > I'd suggest making qemu_balloon_inhibit fail in that case,
> > > > > and then vfio realize will fail as well.    
> > > > 
> > > > That might be the correct behavior for vfio, but I wonder about the
> > > > existing postcopy use case.  Dave Gilbert, what do you think?  We might
> > > > need a separate interface for callers that cannot tolerate existing
> > > > ballooned pages.  Of course we'll also need another atomic counter to
> > > > keep a tally of ballooned pages.  Thanks,    
> > > 
> > > For postcopy, preinflation isn't a problem; our only issue is ballooning
> > > during the postcopy phase itself.  
> > 
> > On further consideration, I think device assignment is in the same
> > category.  The balloon inhibitor does not actually stop the guest
> > balloon driver from grabbing and freeing pages, it only changes whether
> > QEMU releases the pages with madvise DONTNEED.  The problem we have
> > with ballooning and device assignment is when we have an existing HPA
> > mapping in the IOMMU that isn't invalidated on DONTNEED and becomes
> > inconsistent when the page is re-populated.  Zapped pages at the time
> > an assigned device is added do not trigger this, those pages will be
> > repopulated when pages are pinned for the assigned device.  This is the
> > identical scenario to a freshly started VM that doesn't use memory
> > preallocation and therefore faults in pages on demand.  When an
> > assigned device is attached to such a VM, page pinning will fault in
> > and lock all of those pages.  
> 
> Granted this means memory won't be corrupted, but it is
> also highly unlikely to be what the user wanted.

This is the current state of balloon inhibiting, it does not revoke
ballooned pages, ie. giving them back to the back to the guest, or
prevent the guest from inflating the balloon, it only controls whether
the pages are released back to the host.  I suspect a much more
involved overhaul of ballooning in general would be necessary to
support such features.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] balloon: Allow nested inhibits
  2018-07-30 23:13   ` [Qemu-devel] " Alex Williamson
@ 2018-08-07 12:56     ` Peter Xu
  -1 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2018-08-07 12:56 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, david, qemu-devel, kvm, mst

On Mon, Jul 30, 2018 at 05:13:46PM -0600, Alex Williamson wrote:
> A simple true/false internal state does not allow multiple users.  Fix
> this within the existing interface by converting to a counter, so long
> as the counter is elevated, ballooning is inhibited.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/4] balloon: Allow nested inhibits
@ 2018-08-07 12:56     ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2018-08-07 12:56 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm, cohuck, mst, david

On Mon, Jul 30, 2018 at 05:13:46PM -0600, Alex Williamson wrote:
> A simple true/false internal state does not allow multiple users.  Fix
> this within the existing interface by converting to a counter, so long
> as the counter is elevated, ballooning is inhibited.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 2/4] kvm: Use inhibit to prevent ballooning without synchronous mmu
  2018-07-30 23:14   ` [Qemu-devel] " Alex Williamson
@ 2018-08-07 12:56     ` Peter Xu
  -1 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2018-08-07 12:56 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, david, qemu-devel, kvm, mst

On Mon, Jul 30, 2018 at 05:14:06PM -0600, Alex Williamson wrote:
> Remove KVM specific tests in balloon_page(), instead marking
> ballooning as inhibited without KVM_CAP_SYNC_MMU support.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/4] kvm: Use inhibit to prevent ballooning without synchronous mmu
@ 2018-08-07 12:56     ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2018-08-07 12:56 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm, cohuck, mst, david

On Mon, Jul 30, 2018 at 05:14:06PM -0600, Alex Williamson wrote:
> Remove KVM specific tests in balloon_page(), instead marking
> ballooning as inhibited without KVM_CAP_SYNC_MMU support.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 3/4] vfio: Inhibit ballooning based on group attachment to a container
  2018-07-30 23:14   ` [Qemu-devel] " Alex Williamson
@ 2018-08-07 13:10     ` Peter Xu
  -1 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2018-08-07 13:10 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, david, qemu-devel, kvm, mst

On Mon, Jul 30, 2018 at 05:14:21PM -0600, Alex Williamson wrote:
> We use a VFIOContainer to associate an AddressSpace to one or more
> VFIOGroups.  The VFIOContainer represents the DMA context for that
> AdressSpace for those VFIOGroups and is synchronized to changes in
> that AddressSpace via a MemoryListener.  For IOMMU backed devices,
> maintaining the DMA context for a VFIOGroup generally involves
> pinning a host virtual address in order to create a stable host
> physical address and then mapping a translation from the associated
> guest physical address to that host physical address into the IOMMU.
> 
> While the above maintains the VFIOContainer synchronized to the QEMU
> memory API of the VM, memory ballooning occurs outside of that API.
> Inflating the memory balloon (ie. cooperatively capturing pages from
> the guest for use by the host) simply uses MADV_DONTNEED to "zap"
> pages from QEMU's host virtual address space.  The page pinning and
> IOMMU mapping above remains in place, negating the host's ability to
> reuse the page, but the host virtual to host physical mapping of the
> page is invalidated outside of QEMU's memory API.
> 
> When the balloon is later deflated, attempting to cooperatively
> return pages to the guest, the page is simply freed by the guest
> balloon driver, allowing it to be used in the guest and incurring a
> page fault when that occurs.  The page fault maps a new host physical
> page backing the existing host virtual address, meanwhile the
> VFIOContainer still maintains the translation to the original host
> physical address.  At this point the guest vCPU and any assigned
> devices will map different host physical addresses to the same guest
> physical address.  Badness.
> 
> The IOMMU typically does not have page level granularity with which
> it can track this mapping without also incurring inefficiencies in
> using page size mappings throughout.  MMU notifiers in the host
> kernel also provide indicators for invalidating the mapping on
> balloon inflation, not for updating the mapping when the balloon is
> deflated.  For these reasons we assume a default behavior that the
> mapping of each VFIOGroup into the VFIOContainer is incompatible
> with memory ballooning and increment the balloon inhibitor to match
> the attached VFIOGroups.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/common.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fb396cf00ac4..4881b691a659 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -32,6 +32,7 @@
>  #include "hw/hw.h"
>  #include "qemu/error-report.h"
>  #include "qemu/range.h"
> +#include "sysemu/balloon.h"
>  #include "sysemu/kvm.h"
>  #include "trace.h"
>  #include "qapi/error.h"
> @@ -1049,6 +1050,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>              group->container = container;
>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>              vfio_kvm_device_add_group(group);
> +            qemu_balloon_inhibit(true);

[1]

>              return 0;
>          }
>      }
> @@ -1198,6 +1200,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      vfio_kvm_device_add_group(group);
> +    qemu_balloon_inhibit(true);

AFAIU there is a very critical information that this
qemu_balloon_inhibit() call must be before the call to:

    memory_listener_register(&container->listener, container->space->as);

Since the memory listener registeration is the point when we do the
pinning of the pages.  So to make sure we won't have stale pages we
must call qemu_balloon_inhibit() before memory_listener_register()
(which is what this patch does).  However this is not that obvious,
not sure whether that might worth a comment.

Considering this, not sure whether we can just do this per-container
instead of per-group, then we also don't need to bother with extra
group-add paths like [1].

No matter what, this patch looks good to me (and it is correct AFAIK),
so I'm leaving r-b and I'll leave Alex to decide:

Reviewed-by: Peter Xu <peterx@redhat.com>

>  
>      QLIST_INIT(&container->group_list);
>      QLIST_INSERT_HEAD(&space->containers, container, next);
> @@ -1222,6 +1225,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>  listener_release_exit:
>      QLIST_REMOVE(group, container_next);
>      QLIST_REMOVE(container, next);
> +    qemu_balloon_inhibit(false);
>      vfio_kvm_device_del_group(group);
>      vfio_listener_release(container);
>  
> @@ -1352,6 +1356,7 @@ void vfio_put_group(VFIOGroup *group)
>          return;
>      }
>  
> +    qemu_balloon_inhibit(false);
>      vfio_kvm_device_del_group(group);
>      vfio_disconnect_container(group);
>      QLIST_REMOVE(group, next);
> 

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] vfio: Inhibit ballooning based on group attachment to a container
@ 2018-08-07 13:10     ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2018-08-07 13:10 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm, cohuck, mst, david

On Mon, Jul 30, 2018 at 05:14:21PM -0600, Alex Williamson wrote:
> We use a VFIOContainer to associate an AddressSpace to one or more
> VFIOGroups.  The VFIOContainer represents the DMA context for that
> AdressSpace for those VFIOGroups and is synchronized to changes in
> that AddressSpace via a MemoryListener.  For IOMMU backed devices,
> maintaining the DMA context for a VFIOGroup generally involves
> pinning a host virtual address in order to create a stable host
> physical address and then mapping a translation from the associated
> guest physical address to that host physical address into the IOMMU.
> 
> While the above maintains the VFIOContainer synchronized to the QEMU
> memory API of the VM, memory ballooning occurs outside of that API.
> Inflating the memory balloon (ie. cooperatively capturing pages from
> the guest for use by the host) simply uses MADV_DONTNEED to "zap"
> pages from QEMU's host virtual address space.  The page pinning and
> IOMMU mapping above remains in place, negating the host's ability to
> reuse the page, but the host virtual to host physical mapping of the
> page is invalidated outside of QEMU's memory API.
> 
> When the balloon is later deflated, attempting to cooperatively
> return pages to the guest, the page is simply freed by the guest
> balloon driver, allowing it to be used in the guest and incurring a
> page fault when that occurs.  The page fault maps a new host physical
> page backing the existing host virtual address, meanwhile the
> VFIOContainer still maintains the translation to the original host
> physical address.  At this point the guest vCPU and any assigned
> devices will map different host physical addresses to the same guest
> physical address.  Badness.
> 
> The IOMMU typically does not have page level granularity with which
> it can track this mapping without also incurring inefficiencies in
> using page size mappings throughout.  MMU notifiers in the host
> kernel also provide indicators for invalidating the mapping on
> balloon inflation, not for updating the mapping when the balloon is
> deflated.  For these reasons we assume a default behavior that the
> mapping of each VFIOGroup into the VFIOContainer is incompatible
> with memory ballooning and increment the balloon inhibitor to match
> the attached VFIOGroups.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/common.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fb396cf00ac4..4881b691a659 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -32,6 +32,7 @@
>  #include "hw/hw.h"
>  #include "qemu/error-report.h"
>  #include "qemu/range.h"
> +#include "sysemu/balloon.h"
>  #include "sysemu/kvm.h"
>  #include "trace.h"
>  #include "qapi/error.h"
> @@ -1049,6 +1050,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>              group->container = container;
>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>              vfio_kvm_device_add_group(group);
> +            qemu_balloon_inhibit(true);

[1]

>              return 0;
>          }
>      }
> @@ -1198,6 +1200,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      vfio_kvm_device_add_group(group);
> +    qemu_balloon_inhibit(true);

AFAIU there is a very critical information that this
qemu_balloon_inhibit() call must be before the call to:

    memory_listener_register(&container->listener, container->space->as);

Since the memory listener registeration is the point when we do the
pinning of the pages.  So to make sure we won't have stale pages we
must call qemu_balloon_inhibit() before memory_listener_register()
(which is what this patch does).  However this is not that obvious,
not sure whether that might worth a comment.

Considering this, not sure whether we can just do this per-container
instead of per-group, then we also don't need to bother with extra
group-add paths like [1].

No matter what, this patch looks good to me (and it is correct AFAIK),
so I'm leaving r-b and I'll leave Alex to decide:

Reviewed-by: Peter Xu <peterx@redhat.com>

>  
>      QLIST_INIT(&container->group_list);
>      QLIST_INSERT_HEAD(&space->containers, container, next);
> @@ -1222,6 +1225,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>  listener_release_exit:
>      QLIST_REMOVE(group, container_next);
>      QLIST_REMOVE(container, next);
> +    qemu_balloon_inhibit(false);
>      vfio_kvm_device_del_group(group);
>      vfio_listener_release(container);
>  
> @@ -1352,6 +1356,7 @@ void vfio_put_group(VFIOGroup *group)
>          return;
>      }
>  
> +    qemu_balloon_inhibit(false);
>      vfio_kvm_device_del_group(group);
>      vfio_disconnect_container(group);
>      QLIST_REMOVE(group, next);
> 

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 4/4] vfio/ccw/pci: Allow devices to opt-in for ballooning
  2018-07-30 23:14   ` [Qemu-devel] " Alex Williamson
@ 2018-08-07 14:15     ` Cornelia Huck
  -1 siblings, 0 replies; 42+ messages in thread
From: Cornelia Huck @ 2018-08-07 14:15 UTC (permalink / raw)
  To: Alex Williamson; +Cc: peterx, david, qemu-devel, kvm, mst

On Mon, 30 Jul 2018 17:14:37 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> If a vfio assigned device makes use of a physical IOMMU, then memory
> ballooning is necessarily inhibited due to the page pinning, lack of
> page level granularity at the IOMMU, and sufficient notifiers to both
> remove the page on balloon inflation and add it back on deflation.
> However, not all devices are backed by a physical IOMMU.  In the case
> of mediated devices, if a vendor driver is well synchronized with the
> guest driver, such that only pages actively used by the guest driver
> are pinned by the host mdev vendor driver, then there should be no
> overlap between pages available for the balloon driver and pages
> actively in use by the device.  Under these conditions, ballooning
> should be safe.
> 
> vfio-ccw devices are always mediated devices and always operate under
> the constraints above.  Therefore we can consider all vfio-ccw devices
> as balloon compatible.

I agree, that should be the case.

For the upcoming vfio-ap devices? I'm not sure how much control there
is over the pages that are used.

> 
> The situation is far from straightforward with vfio-pci.  These
> devices can be physical devices with physical IOMMU backing or
> mediated devices where it is unknown whether a physical IOMMU is in
> use or whether the vendor driver is well synchronized to the working
> set of the guest driver.  The safest approach is therefore to assume
> all vfio-pci devices are incompatible with ballooning, but allow user
> opt-in should they have further insight into mediated devices.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/ccw.c                 |    9 +++++++++
>  hw/vfio/common.c              |   23 ++++++++++++++++++++++-
>  hw/vfio/pci.c                 |   26 +++++++++++++++++++++++++-
>  hw/vfio/trace-events          |    1 +
>  include/hw/vfio/vfio-common.h |    2 ++
>  5 files changed, 59 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
> index 351b305e1ae7..40e7b5623e69 100644
> --- a/hw/vfio/ccw.c
> +++ b/hw/vfio/ccw.c
> @@ -349,6 +349,15 @@ static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
>          }
>      }
>  
> +    /*
> +     * All vfio-ccw devices are believed to operate compatibly with memory

Better "to operate in a way compatible with memory ballooning"?

> +     * ballooning, ie. pages pinned in the host are in the current working
> +     * set of the guest driver and therefore never overlap with pages
> +     * available to the guest balloon driver.  This needs to be set before
> +     * vfio_get_device() for vfio common to handle the balloon inhibitor.
> +     */
> +    vcdev->vdev.balloon_allowed = true;
> +
>      if (vfio_get_device(group, vcdev->cdev.mdevid, &vcdev->vdev, errp)) {
>          goto out_err;
>      }

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 4/4] vfio/ccw/pci: Allow devices to opt-in for ballooning
@ 2018-08-07 14:15     ` Cornelia Huck
  0 siblings, 0 replies; 42+ messages in thread
From: Cornelia Huck @ 2018-08-07 14:15 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm, peterx, mst, david

On Mon, 30 Jul 2018 17:14:37 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> If a vfio assigned device makes use of a physical IOMMU, then memory
> ballooning is necessarily inhibited due to the page pinning, lack of
> page level granularity at the IOMMU, and sufficient notifiers to both
> remove the page on balloon inflation and add it back on deflation.
> However, not all devices are backed by a physical IOMMU.  In the case
> of mediated devices, if a vendor driver is well synchronized with the
> guest driver, such that only pages actively used by the guest driver
> are pinned by the host mdev vendor driver, then there should be no
> overlap between pages available for the balloon driver and pages
> actively in use by the device.  Under these conditions, ballooning
> should be safe.
> 
> vfio-ccw devices are always mediated devices and always operate under
> the constraints above.  Therefore we can consider all vfio-ccw devices
> as balloon compatible.

I agree, that should be the case.

For the upcoming vfio-ap devices? I'm not sure how much control there
is over the pages that are used.

> 
> The situation is far from straightforward with vfio-pci.  These
> devices can be physical devices with physical IOMMU backing or
> mediated devices where it is unknown whether a physical IOMMU is in
> use or whether the vendor driver is well synchronized to the working
> set of the guest driver.  The safest approach is therefore to assume
> all vfio-pci devices are incompatible with ballooning, but allow user
> opt-in should they have further insight into mediated devices.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/ccw.c                 |    9 +++++++++
>  hw/vfio/common.c              |   23 ++++++++++++++++++++++-
>  hw/vfio/pci.c                 |   26 +++++++++++++++++++++++++-
>  hw/vfio/trace-events          |    1 +
>  include/hw/vfio/vfio-common.h |    2 ++
>  5 files changed, 59 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
> index 351b305e1ae7..40e7b5623e69 100644
> --- a/hw/vfio/ccw.c
> +++ b/hw/vfio/ccw.c
> @@ -349,6 +349,15 @@ static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
>          }
>      }
>  
> +    /*
> +     * All vfio-ccw devices are believed to operate compatibly with memory

Better "to operate in a way compatible with memory ballooning"?

> +     * ballooning, ie. pages pinned in the host are in the current working
> +     * set of the guest driver and therefore never overlap with pages
> +     * available to the guest balloon driver.  This needs to be set before
> +     * vfio_get_device() for vfio common to handle the balloon inhibitor.
> +     */
> +    vcdev->vdev.balloon_allowed = true;
> +
>      if (vfio_get_device(group, vcdev->cdev.mdevid, &vcdev->vdev, errp)) {
>          goto out_err;
>      }

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] balloon: Allow nested inhibits
  2018-07-30 23:13   ` [Qemu-devel] " Alex Williamson
@ 2018-08-07 14:20     ` Cornelia Huck
  -1 siblings, 0 replies; 42+ messages in thread
From: Cornelia Huck @ 2018-08-07 14:20 UTC (permalink / raw)
  To: Alex Williamson; +Cc: peterx, david, qemu-devel, kvm, mst

On Mon, 30 Jul 2018 17:13:46 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> A simple true/false internal state does not allow multiple users.  Fix
> this within the existing interface by converting to a counter, so long
> as the counter is elevated, ballooning is inhibited.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  balloon.c |   13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/4] balloon: Allow nested inhibits
@ 2018-08-07 14:20     ` Cornelia Huck
  0 siblings, 0 replies; 42+ messages in thread
From: Cornelia Huck @ 2018-08-07 14:20 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm, peterx, mst, david

On Mon, 30 Jul 2018 17:13:46 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> A simple true/false internal state does not allow multiple users.  Fix
> this within the existing interface by converting to a counter, so long
> as the counter is elevated, ballooning is inhibited.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  balloon.c |   13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 2/4] kvm: Use inhibit to prevent ballooning without synchronous mmu
  2018-07-30 23:14   ` [Qemu-devel] " Alex Williamson
@ 2018-08-07 14:24     ` Cornelia Huck
  -1 siblings, 0 replies; 42+ messages in thread
From: Cornelia Huck @ 2018-08-07 14:24 UTC (permalink / raw)
  To: Alex Williamson; +Cc: peterx, david, qemu-devel, kvm, mst

On Mon, 30 Jul 2018 17:14:06 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> Remove KVM specific tests in balloon_page(), instead marking
> ballooning as inhibited without KVM_CAP_SYNC_MMU support.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  accel/kvm/kvm-all.c        |    4 ++++
>  hw/virtio/virtio-balloon.c |    4 +---
>  2 files changed, 5 insertions(+), 3 deletions(-)

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/4] kvm: Use inhibit to prevent ballooning without synchronous mmu
@ 2018-08-07 14:24     ` Cornelia Huck
  0 siblings, 0 replies; 42+ messages in thread
From: Cornelia Huck @ 2018-08-07 14:24 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm, peterx, mst, david

On Mon, 30 Jul 2018 17:14:06 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> Remove KVM specific tests in balloon_page(), instead marking
> ballooning as inhibited without KVM_CAP_SYNC_MMU support.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  accel/kvm/kvm-all.c        |    4 ++++
>  hw/virtio/virtio-balloon.c |    4 +---
>  2 files changed, 5 insertions(+), 3 deletions(-)

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 3/4] vfio: Inhibit ballooning based on group attachment to a container
  2018-08-07 13:10     ` [Qemu-devel] " Peter Xu
@ 2018-08-07 16:35       ` Alex Williamson
  -1 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-08-07 16:35 UTC (permalink / raw)
  To: Peter Xu; +Cc: cohuck, david, qemu-devel, kvm, mst

On Tue, 7 Aug 2018 21:10:21 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jul 30, 2018 at 05:14:21PM -0600, Alex Williamson wrote:
> > We use a VFIOContainer to associate an AddressSpace to one or more
> > VFIOGroups.  The VFIOContainer represents the DMA context for that
> > AdressSpace for those VFIOGroups and is synchronized to changes in
> > that AddressSpace via a MemoryListener.  For IOMMU backed devices,
> > maintaining the DMA context for a VFIOGroup generally involves
> > pinning a host virtual address in order to create a stable host
> > physical address and then mapping a translation from the associated
> > guest physical address to that host physical address into the IOMMU.
> > 
> > While the above maintains the VFIOContainer synchronized to the QEMU
> > memory API of the VM, memory ballooning occurs outside of that API.
> > Inflating the memory balloon (ie. cooperatively capturing pages from
> > the guest for use by the host) simply uses MADV_DONTNEED to "zap"
> > pages from QEMU's host virtual address space.  The page pinning and
> > IOMMU mapping above remains in place, negating the host's ability to
> > reuse the page, but the host virtual to host physical mapping of the
> > page is invalidated outside of QEMU's memory API.
> > 
> > When the balloon is later deflated, attempting to cooperatively
> > return pages to the guest, the page is simply freed by the guest
> > balloon driver, allowing it to be used in the guest and incurring a
> > page fault when that occurs.  The page fault maps a new host physical
> > page backing the existing host virtual address, meanwhile the
> > VFIOContainer still maintains the translation to the original host
> > physical address.  At this point the guest vCPU and any assigned
> > devices will map different host physical addresses to the same guest
> > physical address.  Badness.
> > 
> > The IOMMU typically does not have page level granularity with which
> > it can track this mapping without also incurring inefficiencies in
> > using page size mappings throughout.  MMU notifiers in the host
> > kernel also provide indicators for invalidating the mapping on
> > balloon inflation, not for updating the mapping when the balloon is
> > deflated.  For these reasons we assume a default behavior that the
> > mapping of each VFIOGroup into the VFIOContainer is incompatible
> > with memory ballooning and increment the balloon inhibitor to match
> > the attached VFIOGroups.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/vfio/common.c |    5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index fb396cf00ac4..4881b691a659 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -32,6 +32,7 @@
> >  #include "hw/hw.h"
> >  #include "qemu/error-report.h"
> >  #include "qemu/range.h"
> > +#include "sysemu/balloon.h"
> >  #include "sysemu/kvm.h"
> >  #include "trace.h"
> >  #include "qapi/error.h"
> > @@ -1049,6 +1050,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> >              group->container = container;
> >              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> >              vfio_kvm_device_add_group(group);
> > +            qemu_balloon_inhibit(true);  
> 
> [1]
> 
> >              return 0;
> >          }
> >      }
> > @@ -1198,6 +1200,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> >      }
> >  
> >      vfio_kvm_device_add_group(group);
> > +    qemu_balloon_inhibit(true);  
> 
> AFAIU there is a very critical information that this
> qemu_balloon_inhibit() call must be before the call to:
> 
>     memory_listener_register(&container->listener, container->space->as);
> 
> Since the memory listener registeration is the point when we do the
> pinning of the pages.  So to make sure we won't have stale pages we
> must call qemu_balloon_inhibit() before memory_listener_register()
> (which is what this patch does).  However this is not that obvious,
> not sure whether that might worth a comment.
> 
> Considering this, not sure whether we can just do this per-container
> instead of per-group, then we also don't need to bother with extra
> group-add paths like [1].
> 
> No matter what, this patch looks good to me (and it is correct AFAIK),
> so I'm leaving r-b and I'll leave Alex to decide:

Thanks Peter.  I agree, I'll add more commentary.  A minor correction,
we won't have "stale" pages at the time we pin, the act of pinning will
make those valid (same as I discussed in reply to mst why we don't need
to worry about pages ballooned before the device is added), but once a
page is pinned, we need to make sure it's not madvised dontneed, so we
need to be sure there's no possible race there, which effectively means
inhibiting before the memory listener can do any pinning.

The reason I chose to inhibit per group is that it becomes easier to
allow endpoint drivers to opt-in.  For instance if we could have ccw
and vfio-pci in the same VM, they would by default share a container.
If ccw releases the inhibit, we'd need to somehow reinstate it for the
vfio-pci device and remember which did what if one is hot unplugged.
Doing the inhibit at the group level resolves this, the ccw group adds
an inhibit by default, then releases it, the vfio-pci group adds an
inhibit and maintains it so long as attached.  I struggled with whether
this should actually be a per-device inhibit, but then there's a gap
that the container listener is active before the device is retrieved,
so again the per-group inhibit was a better fit.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] vfio: Inhibit ballooning based on group attachment to a container
@ 2018-08-07 16:35       ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2018-08-07 16:35 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, kvm, cohuck, mst, david

On Tue, 7 Aug 2018 21:10:21 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jul 30, 2018 at 05:14:21PM -0600, Alex Williamson wrote:
> > We use a VFIOContainer to associate an AddressSpace to one or more
> > VFIOGroups.  The VFIOContainer represents the DMA context for that
> > AdressSpace for those VFIOGroups and is synchronized to changes in
> > that AddressSpace via a MemoryListener.  For IOMMU backed devices,
> > maintaining the DMA context for a VFIOGroup generally involves
> > pinning a host virtual address in order to create a stable host
> > physical address and then mapping a translation from the associated
> > guest physical address to that host physical address into the IOMMU.
> > 
> > While the above maintains the VFIOContainer synchronized to the QEMU
> > memory API of the VM, memory ballooning occurs outside of that API.
> > Inflating the memory balloon (ie. cooperatively capturing pages from
> > the guest for use by the host) simply uses MADV_DONTNEED to "zap"
> > pages from QEMU's host virtual address space.  The page pinning and
> > IOMMU mapping above remains in place, negating the host's ability to
> > reuse the page, but the host virtual to host physical mapping of the
> > page is invalidated outside of QEMU's memory API.
> > 
> > When the balloon is later deflated, attempting to cooperatively
> > return pages to the guest, the page is simply freed by the guest
> > balloon driver, allowing it to be used in the guest and incurring a
> > page fault when that occurs.  The page fault maps a new host physical
> > page backing the existing host virtual address, meanwhile the
> > VFIOContainer still maintains the translation to the original host
> > physical address.  At this point the guest vCPU and any assigned
> > devices will map different host physical addresses to the same guest
> > physical address.  Badness.
> > 
> > The IOMMU typically does not have page level granularity with which
> > it can track this mapping without also incurring inefficiencies in
> > using page size mappings throughout.  MMU notifiers in the host
> > kernel also provide indicators for invalidating the mapping on
> > balloon inflation, not for updating the mapping when the balloon is
> > deflated.  For these reasons we assume a default behavior that the
> > mapping of each VFIOGroup into the VFIOContainer is incompatible
> > with memory ballooning and increment the balloon inhibitor to match
> > the attached VFIOGroups.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/vfio/common.c |    5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index fb396cf00ac4..4881b691a659 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -32,6 +32,7 @@
> >  #include "hw/hw.h"
> >  #include "qemu/error-report.h"
> >  #include "qemu/range.h"
> > +#include "sysemu/balloon.h"
> >  #include "sysemu/kvm.h"
> >  #include "trace.h"
> >  #include "qapi/error.h"
> > @@ -1049,6 +1050,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> >              group->container = container;
> >              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> >              vfio_kvm_device_add_group(group);
> > +            qemu_balloon_inhibit(true);  
> 
> [1]
> 
> >              return 0;
> >          }
> >      }
> > @@ -1198,6 +1200,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> >      }
> >  
> >      vfio_kvm_device_add_group(group);
> > +    qemu_balloon_inhibit(true);  
> 
> AFAIU there is a very critical information that this
> qemu_balloon_inhibit() call must be before the call to:
> 
>     memory_listener_register(&container->listener, container->space->as);
> 
> Since the memory listener registeration is the point when we do the
> pinning of the pages.  So to make sure we won't have stale pages we
> must call qemu_balloon_inhibit() before memory_listener_register()
> (which is what this patch does).  However this is not that obvious,
> not sure whether that might worth a comment.
> 
> Considering this, not sure whether we can just do this per-container
> instead of per-group, then we also don't need to bother with extra
> group-add paths like [1].
> 
> No matter what, this patch looks good to me (and it is correct AFAIK),
> so I'm leaving r-b and I'll leave Alex to decide:

Thanks Peter.  I agree, I'll add more commentary.  A minor correction,
we won't have "stale" pages at the time we pin, the act of pinning will
make those valid (same as I discussed in reply to mst why we don't need
to worry about pages ballooned before the device is added), but once a
page is pinned, we need to make sure it's not madvised dontneed, so we
need to be sure there's no possible race there, which effectively means
inhibiting before the memory listener can do any pinning.

The reason I chose to inhibit per group is that it becomes easier to
allow endpoint drivers to opt-in.  For instance if we could have ccw
and vfio-pci in the same VM, they would by default share a container.
If ccw releases the inhibit, we'd need to somehow reinstate it for the
vfio-pci device and remember which did what if one is hot unplugged.
Doing the inhibit at the group level resolves this, the ccw group adds
an inhibit by default, then releases it, the vfio-pci group adds an
inhibit and maintains it so long as attached.  I struggled with whether
this should actually be a per-device inhibit, but then there's a gap
that the container listener is active before the device is retrieved,
so again the per-group inhibit was a better fit.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 3/4] vfio: Inhibit ballooning based on group attachment to a container
  2018-08-07 16:35       ` [Qemu-devel] " Alex Williamson
@ 2018-08-08  3:22         ` Peter Xu
  -1 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2018-08-08  3:22 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, david, qemu-devel, kvm, mst

On Tue, Aug 07, 2018 at 10:35:05AM -0600, Alex Williamson wrote:
> On Tue, 7 Aug 2018 21:10:21 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Jul 30, 2018 at 05:14:21PM -0600, Alex Williamson wrote:
> > > We use a VFIOContainer to associate an AddressSpace to one or more
> > > VFIOGroups.  The VFIOContainer represents the DMA context for that
> > > AdressSpace for those VFIOGroups and is synchronized to changes in
> > > that AddressSpace via a MemoryListener.  For IOMMU backed devices,
> > > maintaining the DMA context for a VFIOGroup generally involves
> > > pinning a host virtual address in order to create a stable host
> > > physical address and then mapping a translation from the associated
> > > guest physical address to that host physical address into the IOMMU.
> > > 
> > > While the above maintains the VFIOContainer synchronized to the QEMU
> > > memory API of the VM, memory ballooning occurs outside of that API.
> > > Inflating the memory balloon (ie. cooperatively capturing pages from
> > > the guest for use by the host) simply uses MADV_DONTNEED to "zap"
> > > pages from QEMU's host virtual address space.  The page pinning and
> > > IOMMU mapping above remains in place, negating the host's ability to
> > > reuse the page, but the host virtual to host physical mapping of the
> > > page is invalidated outside of QEMU's memory API.
> > > 
> > > When the balloon is later deflated, attempting to cooperatively
> > > return pages to the guest, the page is simply freed by the guest
> > > balloon driver, allowing it to be used in the guest and incurring a
> > > page fault when that occurs.  The page fault maps a new host physical
> > > page backing the existing host virtual address, meanwhile the
> > > VFIOContainer still maintains the translation to the original host
> > > physical address.  At this point the guest vCPU and any assigned
> > > devices will map different host physical addresses to the same guest
> > > physical address.  Badness.
> > > 
> > > The IOMMU typically does not have page level granularity with which
> > > it can track this mapping without also incurring inefficiencies in
> > > using page size mappings throughout.  MMU notifiers in the host
> > > kernel also provide indicators for invalidating the mapping on
> > > balloon inflation, not for updating the mapping when the balloon is
> > > deflated.  For these reasons we assume a default behavior that the
> > > mapping of each VFIOGroup into the VFIOContainer is incompatible
> > > with memory ballooning and increment the balloon inhibitor to match
> > > the attached VFIOGroups.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > >  hw/vfio/common.c |    5 +++++
> > >  1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > > index fb396cf00ac4..4881b691a659 100644
> > > --- a/hw/vfio/common.c
> > > +++ b/hw/vfio/common.c
> > > @@ -32,6 +32,7 @@
> > >  #include "hw/hw.h"
> > >  #include "qemu/error-report.h"
> > >  #include "qemu/range.h"
> > > +#include "sysemu/balloon.h"
> > >  #include "sysemu/kvm.h"
> > >  #include "trace.h"
> > >  #include "qapi/error.h"
> > > @@ -1049,6 +1050,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> > >              group->container = container;
> > >              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> > >              vfio_kvm_device_add_group(group);
> > > +            qemu_balloon_inhibit(true);  
> > 
> > [1]
> > 
> > >              return 0;
> > >          }
> > >      }
> > > @@ -1198,6 +1200,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> > >      }
> > >  
> > >      vfio_kvm_device_add_group(group);
> > > +    qemu_balloon_inhibit(true);  
> > 
> > AFAIU there is a very critical information that this
> > qemu_balloon_inhibit() call must be before the call to:
> > 
> >     memory_listener_register(&container->listener, container->space->as);
> > 
> > Since the memory listener registeration is the point when we do the
> > pinning of the pages.  So to make sure we won't have stale pages we
> > must call qemu_balloon_inhibit() before memory_listener_register()
> > (which is what this patch does).  However this is not that obvious,
> > not sure whether that might worth a comment.
> > 
> > Considering this, not sure whether we can just do this per-container
> > instead of per-group, then we also don't need to bother with extra
> > group-add paths like [1].
> > 
> > No matter what, this patch looks good to me (and it is correct AFAIK),
> > so I'm leaving r-b and I'll leave Alex to decide:
> 
> Thanks Peter.  I agree, I'll add more commentary.  A minor correction,
> we won't have "stale" pages at the time we pin, the act of pinning will
> make those valid (same as I discussed in reply to mst why we don't need
> to worry about pages ballooned before the device is added), but once a
> page is pinned, we need to make sure it's not madvised dontneed, so we
> need to be sure there's no possible race there, which effectively means
> inhibiting before the memory listener can do any pinning.

Yes.

> 
> The reason I chose to inhibit per group is that it becomes easier to
> allow endpoint drivers to opt-in.  For instance if we could have ccw
> and vfio-pci in the same VM, they would by default share a container.
> If ccw releases the inhibit, we'd need to somehow reinstate it for the
> vfio-pci device and remember which did what if one is hot unplugged.
> Doing the inhibit at the group level resolves this, the ccw group adds
> an inhibit by default, then releases it, the vfio-pci group adds an
> inhibit and maintains it so long as attached.  I struggled with whether
> this should actually be a per-device inhibit, but then there's a gap
> that the container listener is active before the device is retrieved,
> so again the per-group inhibit was a better fit.  Thanks,

Thanks for explaining.  I didn't look into the ccw patch, but it
sounds reasonable to me now.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] vfio: Inhibit ballooning based on group attachment to a container
@ 2018-08-08  3:22         ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2018-08-08  3:22 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm, cohuck, mst, david

On Tue, Aug 07, 2018 at 10:35:05AM -0600, Alex Williamson wrote:
> On Tue, 7 Aug 2018 21:10:21 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Jul 30, 2018 at 05:14:21PM -0600, Alex Williamson wrote:
> > > We use a VFIOContainer to associate an AddressSpace to one or more
> > > VFIOGroups.  The VFIOContainer represents the DMA context for that
> > > AdressSpace for those VFIOGroups and is synchronized to changes in
> > > that AddressSpace via a MemoryListener.  For IOMMU backed devices,
> > > maintaining the DMA context for a VFIOGroup generally involves
> > > pinning a host virtual address in order to create a stable host
> > > physical address and then mapping a translation from the associated
> > > guest physical address to that host physical address into the IOMMU.
> > > 
> > > While the above maintains the VFIOContainer synchronized to the QEMU
> > > memory API of the VM, memory ballooning occurs outside of that API.
> > > Inflating the memory balloon (ie. cooperatively capturing pages from
> > > the guest for use by the host) simply uses MADV_DONTNEED to "zap"
> > > pages from QEMU's host virtual address space.  The page pinning and
> > > IOMMU mapping above remains in place, negating the host's ability to
> > > reuse the page, but the host virtual to host physical mapping of the
> > > page is invalidated outside of QEMU's memory API.
> > > 
> > > When the balloon is later deflated, attempting to cooperatively
> > > return pages to the guest, the page is simply freed by the guest
> > > balloon driver, allowing it to be used in the guest and incurring a
> > > page fault when that occurs.  The page fault maps a new host physical
> > > page backing the existing host virtual address, meanwhile the
> > > VFIOContainer still maintains the translation to the original host
> > > physical address.  At this point the guest vCPU and any assigned
> > > devices will map different host physical addresses to the same guest
> > > physical address.  Badness.
> > > 
> > > The IOMMU typically does not have page level granularity with which
> > > it can track this mapping without also incurring inefficiencies in
> > > using page size mappings throughout.  MMU notifiers in the host
> > > kernel also provide indicators for invalidating the mapping on
> > > balloon inflation, not for updating the mapping when the balloon is
> > > deflated.  For these reasons we assume a default behavior that the
> > > mapping of each VFIOGroup into the VFIOContainer is incompatible
> > > with memory ballooning and increment the balloon inhibitor to match
> > > the attached VFIOGroups.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > >  hw/vfio/common.c |    5 +++++
> > >  1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > > index fb396cf00ac4..4881b691a659 100644
> > > --- a/hw/vfio/common.c
> > > +++ b/hw/vfio/common.c
> > > @@ -32,6 +32,7 @@
> > >  #include "hw/hw.h"
> > >  #include "qemu/error-report.h"
> > >  #include "qemu/range.h"
> > > +#include "sysemu/balloon.h"
> > >  #include "sysemu/kvm.h"
> > >  #include "trace.h"
> > >  #include "qapi/error.h"
> > > @@ -1049,6 +1050,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> > >              group->container = container;
> > >              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> > >              vfio_kvm_device_add_group(group);
> > > +            qemu_balloon_inhibit(true);  
> > 
> > [1]
> > 
> > >              return 0;
> > >          }
> > >      }
> > > @@ -1198,6 +1200,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> > >      }
> > >  
> > >      vfio_kvm_device_add_group(group);
> > > +    qemu_balloon_inhibit(true);  
> > 
> > AFAIU there is a very critical information that this
> > qemu_balloon_inhibit() call must be before the call to:
> > 
> >     memory_listener_register(&container->listener, container->space->as);
> > 
> > Since the memory listener registeration is the point when we do the
> > pinning of the pages.  So to make sure we won't have stale pages we
> > must call qemu_balloon_inhibit() before memory_listener_register()
> > (which is what this patch does).  However this is not that obvious,
> > not sure whether that might worth a comment.
> > 
> > Considering this, not sure whether we can just do this per-container
> > instead of per-group, then we also don't need to bother with extra
> > group-add paths like [1].
> > 
> > No matter what, this patch looks good to me (and it is correct AFAIK),
> > so I'm leaving r-b and I'll leave Alex to decide:
> 
> Thanks Peter.  I agree, I'll add more commentary.  A minor correction,
> we won't have "stale" pages at the time we pin, the act of pinning will
> make those valid (same as I discussed in reply to mst why we don't need
> to worry about pages ballooned before the device is added), but once a
> page is pinned, we need to make sure it's not madvised dontneed, so we
> need to be sure there's no possible race there, which effectively means
> inhibiting before the memory listener can do any pinning.

Yes.

> 
> The reason I chose to inhibit per group is that it becomes easier to
> allow endpoint drivers to opt-in.  For instance if we could have ccw
> and vfio-pci in the same VM, they would by default share a container.
> If ccw releases the inhibit, we'd need to somehow reinstate it for the
> vfio-pci device and remember which did what if one is hot unplugged.
> Doing the inhibit at the group level resolves this, the ccw group adds
> an inhibit by default, then releases it, the vfio-pci group adds an
> inhibit and maintains it so long as attached.  I struggled with whether
> this should actually be a per-device inhibit, but then there's a gap
> that the container listener is active before the device is retrieved,
> so again the per-group inhibit was a better fit.  Thanks,

Thanks for explaining.  I didn't look into the ccw patch, but it
sounds reasonable to me now.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2018-08-08  3:22 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-30 23:13 [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction Alex Williamson
2018-07-30 23:13 ` [Qemu-devel] " Alex Williamson
2018-07-30 23:13 ` [PATCH v2 1/4] balloon: Allow nested inhibits Alex Williamson
2018-07-30 23:13   ` [Qemu-devel] " Alex Williamson
2018-07-31  8:25   ` David Hildenbrand
2018-07-31  8:25     ` [Qemu-devel] " David Hildenbrand
2018-08-07 12:56   ` Peter Xu
2018-08-07 12:56     ` [Qemu-devel] " Peter Xu
2018-08-07 14:20   ` Cornelia Huck
2018-08-07 14:20     ` [Qemu-devel] " Cornelia Huck
2018-07-30 23:14 ` [PATCH v2 2/4] kvm: Use inhibit to prevent ballooning without synchronous mmu Alex Williamson
2018-07-30 23:14   ` [Qemu-devel] " Alex Williamson
2018-07-31  8:24   ` David Hildenbrand
2018-07-31  8:24     ` [Qemu-devel] " David Hildenbrand
2018-08-07 12:56   ` Peter Xu
2018-08-07 12:56     ` [Qemu-devel] " Peter Xu
2018-08-07 14:24   ` Cornelia Huck
2018-08-07 14:24     ` [Qemu-devel] " Cornelia Huck
2018-07-30 23:14 ` [PATCH v2 3/4] vfio: Inhibit ballooning based on group attachment to a container Alex Williamson
2018-07-30 23:14   ` [Qemu-devel] " Alex Williamson
2018-08-07 13:10   ` Peter Xu
2018-08-07 13:10     ` [Qemu-devel] " Peter Xu
2018-08-07 16:35     ` Alex Williamson
2018-08-07 16:35       ` [Qemu-devel] " Alex Williamson
2018-08-08  3:22       ` Peter Xu
2018-08-08  3:22         ` [Qemu-devel] " Peter Xu
2018-07-30 23:14 ` [PATCH v2 4/4] vfio/ccw/pci: Allow devices to opt-in for ballooning Alex Williamson
2018-07-30 23:14   ` [Qemu-devel] " Alex Williamson
2018-08-07 14:15   ` Cornelia Huck
2018-08-07 14:15     ` [Qemu-devel] " Cornelia Huck
2018-07-31 12:29 ` [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction Michael S. Tsirkin
2018-07-31 12:29   ` [Qemu-devel] " Michael S. Tsirkin
2018-07-31 14:44   ` Alex Williamson
2018-07-31 14:44     ` [Qemu-devel] " Alex Williamson
2018-07-31 15:07     ` Dr. David Alan Gilbert
2018-07-31 15:07       ` [Qemu-devel] " Dr. David Alan Gilbert
2018-07-31 21:50       ` Alex Williamson
2018-07-31 21:50         ` [Qemu-devel] " Alex Williamson
2018-08-03 18:42         ` Michael S. Tsirkin
2018-08-03 18:42           ` [Qemu-devel] " Michael S. Tsirkin
2018-08-03 20:12           ` Alex Williamson
2018-08-03 20:12             ` [Qemu-devel] " Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.