All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-17 22:47 ` Alex Williamson
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-17 22:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm

Directly assigned vfio devices have never been compatible with
ballooning.  Zapping MADV_DONTNEED pages happens completely
independent of vfio page pinning and IOMMU mapping, leaving us with
inconsistent GPA to HPA mapping between vCPUs and assigned devices
when the balloon deflates.  Mediated devices can theoretically do
better, if we make the assumption that the mdev vendor driver is fully
synchronized to the actual working set of the guest driver.  In that
case the guest balloon driver should never be able to allocate an mdev
pinned page for balloon inflation.  Unfortunately, QEMU can't know the
workings of the vendor driver pinning, and doesn't actually know the
difference between mdev devices and directly assigned devices.  Until
we can sort out how the vfio IOMMU backend can tell us if ballooning
is safe, the best approach is to disabling ballooning any time a vfio
devices is attached.

To do that, simply make the balloon inhibitor a counter rather than a
boolean, fixup a case where KVM can then simply use the inhibit
interface, and inhibit ballooning any time a vfio device is attached.
I'm expecting we'll expose some sort of flag similar to
KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
this.  An addition we could consider here would be yet another device
option for vfio, such as x-disable-balloon-inhibit, in case there are
mdev devices that behave in a manner compatible with ballooning.

Please let me know if this looks like a good idea.  Thanks,

Alex

---

Alex Williamson (3):
      balloon: Allow nested inhibits
      kvm: Use inhibit to prevent ballooning without synchronous mmu
      vfio: Inhibit ballooning


 accel/kvm/kvm-all.c        |    4 ++++
 balloon.c                  |    7 ++++---
 hw/vfio/common.c           |    6 ++++++
 hw/virtio/virtio-balloon.c |    4 +---
 4 files changed, 15 insertions(+), 6 deletions(-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-17 22:47 ` Alex Williamson
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-17 22:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm

Directly assigned vfio devices have never been compatible with
ballooning.  Zapping MADV_DONTNEED pages happens completely
independent of vfio page pinning and IOMMU mapping, leaving us with
inconsistent GPA to HPA mapping between vCPUs and assigned devices
when the balloon deflates.  Mediated devices can theoretically do
better, if we make the assumption that the mdev vendor driver is fully
synchronized to the actual working set of the guest driver.  In that
case the guest balloon driver should never be able to allocate an mdev
pinned page for balloon inflation.  Unfortunately, QEMU can't know the
workings of the vendor driver pinning, and doesn't actually know the
difference between mdev devices and directly assigned devices.  Until
we can sort out how the vfio IOMMU backend can tell us if ballooning
is safe, the best approach is to disabling ballooning any time a vfio
devices is attached.

To do that, simply make the balloon inhibitor a counter rather than a
boolean, fixup a case where KVM can then simply use the inhibit
interface, and inhibit ballooning any time a vfio device is attached.
I'm expecting we'll expose some sort of flag similar to
KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
this.  An addition we could consider here would be yet another device
option for vfio, such as x-disable-balloon-inhibit, in case there are
mdev devices that behave in a manner compatible with ballooning.

Please let me know if this looks like a good idea.  Thanks,

Alex

---

Alex Williamson (3):
      balloon: Allow nested inhibits
      kvm: Use inhibit to prevent ballooning without synchronous mmu
      vfio: Inhibit ballooning


 accel/kvm/kvm-all.c        |    4 ++++
 balloon.c                  |    7 ++++---
 hw/vfio/common.c           |    6 ++++++
 hw/virtio/virtio-balloon.c |    4 +---
 4 files changed, 15 insertions(+), 6 deletions(-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH 1/3] balloon: Allow nested inhibits
  2018-07-17 22:47 ` [Qemu-devel] " Alex Williamson
@ 2018-07-17 22:47   ` Alex Williamson
  -1 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-17 22:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm

A simple true/false internal state does not allow multiple users.  Fix
this within the existing interface by converting to a counter, so long
as the counter is elevated, ballooning is inhibited.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 balloon.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/balloon.c b/balloon.c
index 6bf0a9681377..2a6a7e1a22a0 100644
--- a/balloon.c
+++ b/balloon.c
@@ -37,16 +37,17 @@
 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
 static void *balloon_opaque;
-static bool balloon_inhibited;
+static int balloon_inhibited;
 
 bool qemu_balloon_is_inhibited(void)
 {
-    return balloon_inhibited;
+    return balloon_inhibited > 0;
 }
 
 void qemu_balloon_inhibit(bool state)
 {
-    balloon_inhibited = state;
+    balloon_inhibited += (state ? 1 : -1);
+    assert(balloon_inhibited >= 0);
 }
 
 static bool have_balloon(Error **errp)

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [Qemu-devel] [RFC PATCH 1/3] balloon: Allow nested inhibits
@ 2018-07-17 22:47   ` Alex Williamson
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-17 22:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm

A simple true/false internal state does not allow multiple users.  Fix
this within the existing interface by converting to a counter, so long
as the counter is elevated, ballooning is inhibited.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 balloon.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/balloon.c b/balloon.c
index 6bf0a9681377..2a6a7e1a22a0 100644
--- a/balloon.c
+++ b/balloon.c
@@ -37,16 +37,17 @@
 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
 static void *balloon_opaque;
-static bool balloon_inhibited;
+static int balloon_inhibited;
 
 bool qemu_balloon_is_inhibited(void)
 {
-    return balloon_inhibited;
+    return balloon_inhibited > 0;
 }
 
 void qemu_balloon_inhibit(bool state)
 {
-    balloon_inhibited = state;
+    balloon_inhibited += (state ? 1 : -1);
+    assert(balloon_inhibited >= 0);
 }
 
 static bool have_balloon(Error **errp)

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 2/3] kvm: Use inhibit to prevent ballooning without synchronous mmu
  2018-07-17 22:47 ` [Qemu-devel] " Alex Williamson
@ 2018-07-17 22:47   ` Alex Williamson
  -1 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-17 22:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm

Remove KVM specific tests in balloon_page(), instead marking
ballooning as inhibited without KVM_CAP_SYNC_MMU support.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 accel/kvm/kvm-all.c        |    4 ++++
 hw/virtio/virtio-balloon.c |    4 +---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index eb7db92a5e3b..38f468d8e2b1 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -39,6 +39,7 @@
 #include "trace.h"
 #include "hw/irq.h"
 #include "sysemu/sev.h"
+#include "sysemu/balloon.h"
 
 #include "hw/boards.h"
 
@@ -1698,6 +1699,9 @@ static int kvm_init(MachineState *ms)
     s->many_ioeventfds = kvm_check_many_ioeventfds();
 
     s->sync_mmu = !!kvm_vm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
+    if (!s->sync_mmu) {
+        qemu_balloon_inhibit(true);
+    }
 
     return 0;
 
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 1f7a87f09429..b5425080c5fb 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -21,7 +21,6 @@
 #include "hw/mem/pc-dimm.h"
 #include "sysemu/balloon.h"
 #include "hw/virtio/virtio-balloon.h"
-#include "sysemu/kvm.h"
 #include "exec/address-spaces.h"
 #include "qapi/error.h"
 #include "qapi/qapi-events-misc.h"
@@ -36,8 +35,7 @@
 
 static void balloon_page(void *addr, int deflate)
 {
-    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
-                                         kvm_has_sync_mmu())) {
+    if (!qemu_balloon_is_inhibited()) {
         qemu_madvise(addr, BALLOON_PAGE_SIZE,
                 deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
     }

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [Qemu-devel] [RFC PATCH 2/3] kvm: Use inhibit to prevent ballooning without synchronous mmu
@ 2018-07-17 22:47   ` Alex Williamson
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-17 22:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm

Remove KVM specific tests in balloon_page(), instead marking
ballooning as inhibited without KVM_CAP_SYNC_MMU support.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 accel/kvm/kvm-all.c        |    4 ++++
 hw/virtio/virtio-balloon.c |    4 +---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index eb7db92a5e3b..38f468d8e2b1 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -39,6 +39,7 @@
 #include "trace.h"
 #include "hw/irq.h"
 #include "sysemu/sev.h"
+#include "sysemu/balloon.h"
 
 #include "hw/boards.h"
 
@@ -1698,6 +1699,9 @@ static int kvm_init(MachineState *ms)
     s->many_ioeventfds = kvm_check_many_ioeventfds();
 
     s->sync_mmu = !!kvm_vm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
+    if (!s->sync_mmu) {
+        qemu_balloon_inhibit(true);
+    }
 
     return 0;
 
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 1f7a87f09429..b5425080c5fb 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -21,7 +21,6 @@
 #include "hw/mem/pc-dimm.h"
 #include "sysemu/balloon.h"
 #include "hw/virtio/virtio-balloon.h"
-#include "sysemu/kvm.h"
 #include "exec/address-spaces.h"
 #include "qapi/error.h"
 #include "qapi/qapi-events-misc.h"
@@ -36,8 +35,7 @@
 
 static void balloon_page(void *addr, int deflate)
 {
-    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
-                                         kvm_has_sync_mmu())) {
+    if (!qemu_balloon_is_inhibited()) {
         qemu_madvise(addr, BALLOON_PAGE_SIZE,
                 deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
     }

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH 3/3] vfio: Inhibit ballooning
  2018-07-17 22:47 ` [Qemu-devel] " Alex Williamson
@ 2018-07-17 22:47   ` Alex Williamson
  -1 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-17 22:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm

For all directly assigned devices and potentially some mdev devices,
internal management of pinned pages is incompatible with ballooning.
In the case of mdev devices, QEMU cannot know which pages the vendor
driver has mapped for the device.  It could be only the working set,
which should never be a ballooning victim page, or it could be the
entire VM address space.  Until the vfio IOMMU interface can tell us
more about the support for handling ballooning, assume any vfio
container inhibits ballooning.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/common.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fb396cf00ac4..8124dd8df3c1 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -32,6 +32,7 @@
 #include "hw/hw.h"
 #include "qemu/error-report.h"
 #include "qemu/range.h"
+#include "sysemu/balloon.h"
 #include "sysemu/kvm.h"
 #include "trace.h"
 #include "qapi/error.h"
@@ -1218,6 +1219,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 
     container->initialized = true;
 
+    /* vfio is currently incompatible with ballooning */
+    qemu_balloon_inhibit(true);
+
     return 0;
 listener_release_exit:
     QLIST_REMOVE(group, container_next);
@@ -1276,6 +1280,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
         g_free(container);
 
         vfio_put_address_space(space);
+
+        qemu_balloon_inhibit(false);
     }
 }
 

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [Qemu-devel] [RFC PATCH 3/3] vfio: Inhibit ballooning
@ 2018-07-17 22:47   ` Alex Williamson
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-17 22:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: kvm

For all directly assigned devices and potentially some mdev devices,
internal management of pinned pages is incompatible with ballooning.
In the case of mdev devices, QEMU cannot know which pages the vendor
driver has mapped for the device.  It could be only the working set,
which should never be a ballooning victim page, or it could be the
entire VM address space.  Until the vfio IOMMU interface can tell us
more about the support for handling ballooning, assume any vfio
container inhibits ballooning.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/common.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fb396cf00ac4..8124dd8df3c1 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -32,6 +32,7 @@
 #include "hw/hw.h"
 #include "qemu/error-report.h"
 #include "qemu/range.h"
+#include "sysemu/balloon.h"
 #include "sysemu/kvm.h"
 #include "trace.h"
 #include "qapi/error.h"
@@ -1218,6 +1219,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 
     container->initialized = true;
 
+    /* vfio is currently incompatible with ballooning */
+    qemu_balloon_inhibit(true);
+
     return 0;
 listener_release_exit:
     QLIST_REMOVE(group, container_next);
@@ -1276,6 +1280,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
         g_free(container);
 
         vfio_put_address_space(space);
+
+        qemu_balloon_inhibit(false);
     }
 }
 

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 1/3] balloon: Allow nested inhibits
  2018-07-17 22:47   ` [Qemu-devel] " Alex Williamson
@ 2018-07-18  6:40     ` Peter Xu
  -1 siblings, 0 replies; 58+ messages in thread
From: Peter Xu @ 2018-07-18  6:40 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Tue, Jul 17, 2018 at 04:47:37PM -0600, Alex Williamson wrote:
> A simple true/false internal state does not allow multiple users.  Fix
> this within the existing interface by converting to a counter, so long
> as the counter is elevated, ballooning is inhibited.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  balloon.c |    7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/balloon.c b/balloon.c
> index 6bf0a9681377..2a6a7e1a22a0 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -37,16 +37,17 @@
>  static QEMUBalloonEvent *balloon_event_fn;
>  static QEMUBalloonStatus *balloon_stat_fn;
>  static void *balloon_opaque;
> -static bool balloon_inhibited;
> +static int balloon_inhibited;
>  
>  bool qemu_balloon_is_inhibited(void)
>  {
> -    return balloon_inhibited;
> +    return balloon_inhibited > 0;
>  }
>  
>  void qemu_balloon_inhibit(bool state)
>  {
> -    balloon_inhibited = state;
> +    balloon_inhibited += (state ? 1 : -1);
> +    assert(balloon_inhibited >= 0);

Better do it atomically?

>  }
>  
>  static bool have_balloon(Error **errp)
> 
> 

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/3] balloon: Allow nested inhibits
@ 2018-07-18  6:40     ` Peter Xu
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Xu @ 2018-07-18  6:40 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Tue, Jul 17, 2018 at 04:47:37PM -0600, Alex Williamson wrote:
> A simple true/false internal state does not allow multiple users.  Fix
> this within the existing interface by converting to a counter, so long
> as the counter is elevated, ballooning is inhibited.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  balloon.c |    7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/balloon.c b/balloon.c
> index 6bf0a9681377..2a6a7e1a22a0 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -37,16 +37,17 @@
>  static QEMUBalloonEvent *balloon_event_fn;
>  static QEMUBalloonStatus *balloon_stat_fn;
>  static void *balloon_opaque;
> -static bool balloon_inhibited;
> +static int balloon_inhibited;
>  
>  bool qemu_balloon_is_inhibited(void)
>  {
> -    return balloon_inhibited;
> +    return balloon_inhibited > 0;
>  }
>  
>  void qemu_balloon_inhibit(bool state)
>  {
> -    balloon_inhibited = state;
> +    balloon_inhibited += (state ? 1 : -1);
> +    assert(balloon_inhibited >= 0);

Better do it atomically?

>  }
>  
>  static bool have_balloon(Error **errp)
> 
> 

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-17 22:47 ` [Qemu-devel] " Alex Williamson
@ 2018-07-18  6:48   ` Peter Xu
  -1 siblings, 0 replies; 58+ messages in thread
From: Peter Xu @ 2018-07-18  6:48 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> Directly assigned vfio devices have never been compatible with
> ballooning.  Zapping MADV_DONTNEED pages happens completely
> independent of vfio page pinning and IOMMU mapping, leaving us with
> inconsistent GPA to HPA mapping between vCPUs and assigned devices
> when the balloon deflates.  Mediated devices can theoretically do
> better, if we make the assumption that the mdev vendor driver is fully
> synchronized to the actual working set of the guest driver.  In that
> case the guest balloon driver should never be able to allocate an mdev
> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> workings of the vendor driver pinning, and doesn't actually know the
> difference between mdev devices and directly assigned devices.  Until
> we can sort out how the vfio IOMMU backend can tell us if ballooning
> is safe, the best approach is to disabling ballooning any time a vfio
> devices is attached.
> 
> To do that, simply make the balloon inhibitor a counter rather than a
> boolean, fixup a case where KVM can then simply use the inhibit
> interface, and inhibit ballooning any time a vfio device is attached.
> I'm expecting we'll expose some sort of flag similar to
> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> this.  An addition we could consider here would be yet another device
> option for vfio, such as x-disable-balloon-inhibit, in case there are
> mdev devices that behave in a manner compatible with ballooning.
> 
> Please let me know if this looks like a good idea.  Thanks,

IMHO patches 1-2 are good cleanup as standalone patches...

I totally have no idea on whether people would like to use vfio-pci
and the balloon device at the same time.  After all vfio-pci are
majorly for performance players, then I would vaguely guess that they
don't really care thin provisioning of memory at all, hence the usage
scenario might not exist much.  Is that the major reason that we'd
just like to disable it (which makes sense to me)?

I'm wondering what if want to do that somehow some day... Whether
it'll work if we just let vfio-pci devices to register some guest
memory invalidation hook (just like the iommu notifiers, but for guest
memory address space instead), then we map/unmap the IOMMU pages there
for vfio-pci device to make sure the inflated balloon pages are not
mapped and also make sure new pages are remapped with correct HPA
after deflated.  This is a pure question out of my curiosity, and for
sure it makes little sense if the answer of the first question above
is positive.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-18  6:48   ` Peter Xu
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Xu @ 2018-07-18  6:48 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> Directly assigned vfio devices have never been compatible with
> ballooning.  Zapping MADV_DONTNEED pages happens completely
> independent of vfio page pinning and IOMMU mapping, leaving us with
> inconsistent GPA to HPA mapping between vCPUs and assigned devices
> when the balloon deflates.  Mediated devices can theoretically do
> better, if we make the assumption that the mdev vendor driver is fully
> synchronized to the actual working set of the guest driver.  In that
> case the guest balloon driver should never be able to allocate an mdev
> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> workings of the vendor driver pinning, and doesn't actually know the
> difference between mdev devices and directly assigned devices.  Until
> we can sort out how the vfio IOMMU backend can tell us if ballooning
> is safe, the best approach is to disabling ballooning any time a vfio
> devices is attached.
> 
> To do that, simply make the balloon inhibitor a counter rather than a
> boolean, fixup a case where KVM can then simply use the inhibit
> interface, and inhibit ballooning any time a vfio device is attached.
> I'm expecting we'll expose some sort of flag similar to
> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> this.  An addition we could consider here would be yet another device
> option for vfio, such as x-disable-balloon-inhibit, in case there are
> mdev devices that behave in a manner compatible with ballooning.
> 
> Please let me know if this looks like a good idea.  Thanks,

IMHO patches 1-2 are good cleanup as standalone patches...

I totally have no idea on whether people would like to use vfio-pci
and the balloon device at the same time.  After all vfio-pci are
majorly for performance players, then I would vaguely guess that they
don't really care thin provisioning of memory at all, hence the usage
scenario might not exist much.  Is that the major reason that we'd
just like to disable it (which makes sense to me)?

I'm wondering what if want to do that somehow some day... Whether
it'll work if we just let vfio-pci devices to register some guest
memory invalidation hook (just like the iommu notifiers, but for guest
memory address space instead), then we map/unmap the IOMMU pages there
for vfio-pci device to make sure the inflated balloon pages are not
mapped and also make sure new pages are remapped with correct HPA
after deflated.  This is a pure question out of my curiosity, and for
sure it makes little sense if the answer of the first question above
is positive.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-18  6:48   ` [Qemu-devel] " Peter Xu
@ 2018-07-18  9:36     ` Cornelia Huck
  -1 siblings, 0 replies; 58+ messages in thread
From: Cornelia Huck @ 2018-07-18  9:36 UTC (permalink / raw)
  To: Peter Xu; +Cc: Alex Williamson, qemu-devel, kvm

On Wed, 18 Jul 2018 14:48:03 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > Directly assigned vfio devices have never been compatible with
> > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > independent of vfio page pinning and IOMMU mapping, leaving us with
> > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > when the balloon deflates.  Mediated devices can theoretically do
> > better, if we make the assumption that the mdev vendor driver is fully
> > synchronized to the actual working set of the guest driver.  In that
> > case the guest balloon driver should never be able to allocate an mdev
> > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > workings of the vendor driver pinning, and doesn't actually know the
> > difference between mdev devices and directly assigned devices.  Until
> > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > is safe, the best approach is to disabling ballooning any time a vfio
> > devices is attached.
> > 
> > To do that, simply make the balloon inhibitor a counter rather than a
> > boolean, fixup a case where KVM can then simply use the inhibit
> > interface, and inhibit ballooning any time a vfio device is attached.
> > I'm expecting we'll expose some sort of flag similar to
> > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > this.  An addition we could consider here would be yet another device
> > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > mdev devices that behave in a manner compatible with ballooning.
> > 
> > Please let me know if this looks like a good idea.  Thanks,  
> 
> IMHO patches 1-2 are good cleanup as standalone patches...
> 
> I totally have no idea on whether people would like to use vfio-pci
> and the balloon device at the same time.  After all vfio-pci are
> majorly for performance players, then I would vaguely guess that they
> don't really care thin provisioning of memory at all, hence the usage
> scenario might not exist much.  Is that the major reason that we'd
> just like to disable it (which makes sense to me)?

Don't people use vfio-pci as well if they want some special
capabilities from the passed-through device? (At least that's the main
use case for vfio-ccw, not any performance considerations.)

> 
> I'm wondering what if want to do that somehow some day... Whether
> it'll work if we just let vfio-pci devices to register some guest
> memory invalidation hook (just like the iommu notifiers, but for guest
> memory address space instead), then we map/unmap the IOMMU pages there
> for vfio-pci device to make sure the inflated balloon pages are not
> mapped and also make sure new pages are remapped with correct HPA
> after deflated.  This is a pure question out of my curiosity, and for
> sure it makes little sense if the answer of the first question above
> is positive.
> 
> Thanks,
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-18  9:36     ` Cornelia Huck
  0 siblings, 0 replies; 58+ messages in thread
From: Cornelia Huck @ 2018-07-18  9:36 UTC (permalink / raw)
  To: Peter Xu; +Cc: Alex Williamson, qemu-devel, kvm

On Wed, 18 Jul 2018 14:48:03 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > Directly assigned vfio devices have never been compatible with
> > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > independent of vfio page pinning and IOMMU mapping, leaving us with
> > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > when the balloon deflates.  Mediated devices can theoretically do
> > better, if we make the assumption that the mdev vendor driver is fully
> > synchronized to the actual working set of the guest driver.  In that
> > case the guest balloon driver should never be able to allocate an mdev
> > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > workings of the vendor driver pinning, and doesn't actually know the
> > difference between mdev devices and directly assigned devices.  Until
> > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > is safe, the best approach is to disabling ballooning any time a vfio
> > devices is attached.
> > 
> > To do that, simply make the balloon inhibitor a counter rather than a
> > boolean, fixup a case where KVM can then simply use the inhibit
> > interface, and inhibit ballooning any time a vfio device is attached.
> > I'm expecting we'll expose some sort of flag similar to
> > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > this.  An addition we could consider here would be yet another device
> > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > mdev devices that behave in a manner compatible with ballooning.
> > 
> > Please let me know if this looks like a good idea.  Thanks,  
> 
> IMHO patches 1-2 are good cleanup as standalone patches...
> 
> I totally have no idea on whether people would like to use vfio-pci
> and the balloon device at the same time.  After all vfio-pci are
> majorly for performance players, then I would vaguely guess that they
> don't really care thin provisioning of memory at all, hence the usage
> scenario might not exist much.  Is that the major reason that we'd
> just like to disable it (which makes sense to me)?

Don't people use vfio-pci as well if they want some special
capabilities from the passed-through device? (At least that's the main
use case for vfio-ccw, not any performance considerations.)

> 
> I'm wondering what if want to do that somehow some day... Whether
> it'll work if we just let vfio-pci devices to register some guest
> memory invalidation hook (just like the iommu notifiers, but for guest
> memory address space instead), then we map/unmap the IOMMU pages there
> for vfio-pci device to make sure the inflated balloon pages are not
> mapped and also make sure new pages are remapped with correct HPA
> after deflated.  This is a pure question out of my curiosity, and for
> sure it makes little sense if the answer of the first question above
> is positive.
> 
> Thanks,
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-18  6:48   ` [Qemu-devel] " Peter Xu
@ 2018-07-18 16:31     ` Alex Williamson
  -1 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-18 16:31 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, kvm

On Wed, 18 Jul 2018 14:48:03 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > Directly assigned vfio devices have never been compatible with
> > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > independent of vfio page pinning and IOMMU mapping, leaving us with
> > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > when the balloon deflates.  Mediated devices can theoretically do
> > better, if we make the assumption that the mdev vendor driver is fully
> > synchronized to the actual working set of the guest driver.  In that
> > case the guest balloon driver should never be able to allocate an mdev
> > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > workings of the vendor driver pinning, and doesn't actually know the
> > difference between mdev devices and directly assigned devices.  Until
> > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > is safe, the best approach is to disabling ballooning any time a vfio
> > devices is attached.
> > 
> > To do that, simply make the balloon inhibitor a counter rather than a
> > boolean, fixup a case where KVM can then simply use the inhibit
> > interface, and inhibit ballooning any time a vfio device is attached.
> > I'm expecting we'll expose some sort of flag similar to
> > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > this.  An addition we could consider here would be yet another device
> > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > mdev devices that behave in a manner compatible with ballooning.
> > 
> > Please let me know if this looks like a good idea.  Thanks,  
> 
> IMHO patches 1-2 are good cleanup as standalone patches...
> 
> I totally have no idea on whether people would like to use vfio-pci
> and the balloon device at the same time.  After all vfio-pci are
> majorly for performance players, then I would vaguely guess that they
> don't really care thin provisioning of memory at all, hence the usage
> scenario might not exist much.  Is that the major reason that we'd
> just like to disable it (which makes sense to me)?

Well, the major reason for disabling it is that it currently doesn't
work and when the balloon is deflated, the device and vCPU are talking
to different host pages for the same GPA for previously ballooned
pages.  Regardless of the amenability of device assignment to various
usage scenarios, that's a bad thing.  I guess most device assignment
users have either realized this doesn't work and avoid it, or perhaps
they have VMs tuned more for performance than density and (again) don't
use ballooning.
 
> I'm wondering what if want to do that somehow some day... Whether
> it'll work if we just let vfio-pci devices to register some guest
> memory invalidation hook (just like the iommu notifiers, but for guest
> memory address space instead), then we map/unmap the IOMMU pages there
> for vfio-pci device to make sure the inflated balloon pages are not
> mapped and also make sure new pages are remapped with correct HPA
> after deflated.  This is a pure question out of my curiosity, and for
> sure it makes little sense if the answer of the first question above
> is positive.

This is why I mention the KVM MMU synchronization flag above.  KVM
essentially had this same problem and fixed it with with MMU notifiers
in the kernel.  They expose that KVM has the capability of handling
such a scenario via a feature flag.  We can do the same with vfio.  In
scenarios where we're able to fix this, we could expose a flag on the
container indicating support for the same sort of thing.

There are a few complications to this support though.  First ballooning
works at page size granularity, but IOMMU mapping can make use of
arbitrary superpage sizes and the IOMMU API only guarantees unmap
granularity equal to the original mapping.  Therefore we cannot unmap
individual pages unless we require that all mappings through the IOMMU
API are done with page granularity, precluding the use of superpages by
the IOMMU and thereby inflicting higher IOTLB overhead.  Unlike a CPU,
we can't invalidate the mappings and fault them back in or halt the
processor to make the page table updates appear atomic.  The device is
considered always running and interfering with that would likely lead
to functional issues.

Second MMU notifiers seem to provide invalidation, pte change notices,
and page aging interfaces, so if a page is consumed by the balloon
inflating, we can invalidate it (modulo the issues in the previous
paragraph), but how do we re-populate the mapping through the IOMMU
when the page is released as the balloon is deflated?  KVM seems to do
this by handling the page fault, but we don't really have that option
for devices.  If we try to solve this only for mdev devices, we can
request invalidation down the vendor driver with page granularity and
we could assume a vendor driver that's well synchronized with the
working set of the device would re-request a page if it was previously
invalidated and becomes part of the working set.  But if we have that
assumption, then we could also assume that such a vendor driver would
never have a ballooning victim page in its working set and therefore we
don't need to do anything.  Unfortunately without an audit, we can't
really know the behavior of the vendor driver.  vfio-ccw might be an
exception here since this entire class of devices doesn't really
perform DMA and page pinning is done on a per transaction basis, aiui.

The vIOMMU is yet another consideration as it can effectively define
the working set for a device via the device AddressSpace.  If a
ballooned request does not fall within the AddressSpace of any assigned
device, it would be safe to balloon the page.  So long as we're not
running in IOMMU passthrough mode, these should be distinctly separate
sets, active DMA pages should not be ballooning targets.  However, I
believe the current state of vIOMMU with assigned devices is that it's
functional, but not in any way performant for this scenario.  We see
massive performance degradation when trying to use vIOMMU for anything
other than mostly static mappings, such as when using passthrough mode
or using userspace drivers or nested guests with relatively static
mappings.  So I don't know that it's a worthwhile return on investment
if we were to test whether a balloon victim page falls within a
device's AddressSpace as a further level of granularity.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-18 16:31     ` Alex Williamson
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-18 16:31 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, kvm

On Wed, 18 Jul 2018 14:48:03 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > Directly assigned vfio devices have never been compatible with
> > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > independent of vfio page pinning and IOMMU mapping, leaving us with
> > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > when the balloon deflates.  Mediated devices can theoretically do
> > better, if we make the assumption that the mdev vendor driver is fully
> > synchronized to the actual working set of the guest driver.  In that
> > case the guest balloon driver should never be able to allocate an mdev
> > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > workings of the vendor driver pinning, and doesn't actually know the
> > difference between mdev devices and directly assigned devices.  Until
> > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > is safe, the best approach is to disabling ballooning any time a vfio
> > devices is attached.
> > 
> > To do that, simply make the balloon inhibitor a counter rather than a
> > boolean, fixup a case where KVM can then simply use the inhibit
> > interface, and inhibit ballooning any time a vfio device is attached.
> > I'm expecting we'll expose some sort of flag similar to
> > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > this.  An addition we could consider here would be yet another device
> > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > mdev devices that behave in a manner compatible with ballooning.
> > 
> > Please let me know if this looks like a good idea.  Thanks,  
> 
> IMHO patches 1-2 are good cleanup as standalone patches...
> 
> I totally have no idea on whether people would like to use vfio-pci
> and the balloon device at the same time.  After all vfio-pci are
> majorly for performance players, then I would vaguely guess that they
> don't really care thin provisioning of memory at all, hence the usage
> scenario might not exist much.  Is that the major reason that we'd
> just like to disable it (which makes sense to me)?

Well, the major reason for disabling it is that it currently doesn't
work and when the balloon is deflated, the device and vCPU are talking
to different host pages for the same GPA for previously ballooned
pages.  Regardless of the amenability of device assignment to various
usage scenarios, that's a bad thing.  I guess most device assignment
users have either realized this doesn't work and avoid it, or perhaps
they have VMs tuned more for performance than density and (again) don't
use ballooning.
 
> I'm wondering what if want to do that somehow some day... Whether
> it'll work if we just let vfio-pci devices to register some guest
> memory invalidation hook (just like the iommu notifiers, but for guest
> memory address space instead), then we map/unmap the IOMMU pages there
> for vfio-pci device to make sure the inflated balloon pages are not
> mapped and also make sure new pages are remapped with correct HPA
> after deflated.  This is a pure question out of my curiosity, and for
> sure it makes little sense if the answer of the first question above
> is positive.

This is why I mention the KVM MMU synchronization flag above.  KVM
essentially had this same problem and fixed it with with MMU notifiers
in the kernel.  They expose that KVM has the capability of handling
such a scenario via a feature flag.  We can do the same with vfio.  In
scenarios where we're able to fix this, we could expose a flag on the
container indicating support for the same sort of thing.

There are a few complications to this support though.  First ballooning
works at page size granularity, but IOMMU mapping can make use of
arbitrary superpage sizes and the IOMMU API only guarantees unmap
granularity equal to the original mapping.  Therefore we cannot unmap
individual pages unless we require that all mappings through the IOMMU
API are done with page granularity, precluding the use of superpages by
the IOMMU and thereby inflicting higher IOTLB overhead.  Unlike a CPU,
we can't invalidate the mappings and fault them back in or halt the
processor to make the page table updates appear atomic.  The device is
considered always running and interfering with that would likely lead
to functional issues.

Second MMU notifiers seem to provide invalidation, pte change notices,
and page aging interfaces, so if a page is consumed by the balloon
inflating, we can invalidate it (modulo the issues in the previous
paragraph), but how do we re-populate the mapping through the IOMMU
when the page is released as the balloon is deflated?  KVM seems to do
this by handling the page fault, but we don't really have that option
for devices.  If we try to solve this only for mdev devices, we can
request invalidation down the vendor driver with page granularity and
we could assume a vendor driver that's well synchronized with the
working set of the device would re-request a page if it was previously
invalidated and becomes part of the working set.  But if we have that
assumption, then we could also assume that such a vendor driver would
never have a ballooning victim page in its working set and therefore we
don't need to do anything.  Unfortunately without an audit, we can't
really know the behavior of the vendor driver.  vfio-ccw might be an
exception here since this entire class of devices doesn't really
perform DMA and page pinning is done on a per transaction basis, aiui.

The vIOMMU is yet another consideration as it can effectively define
the working set for a device via the device AddressSpace.  If a
ballooned request does not fall within the AddressSpace of any assigned
device, it would be safe to balloon the page.  So long as we're not
running in IOMMU passthrough mode, these should be distinctly separate
sets, active DMA pages should not be ballooning targets.  However, I
believe the current state of vIOMMU with assigned devices is that it's
functional, but not in any way performant for this scenario.  We see
massive performance degradation when trying to use vIOMMU for anything
other than mostly static mappings, such as when using passthrough mode
or using userspace drivers or nested guests with relatively static
mappings.  So I don't know that it's a worthwhile return on investment
if we were to test whether a balloon victim page falls within a
device's AddressSpace as a further level of granularity.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 1/3] balloon: Allow nested inhibits
  2018-07-18  6:40     ` [Qemu-devel] " Peter Xu
@ 2018-07-18 16:37       ` Alex Williamson
  -1 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-18 16:37 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, kvm

On Wed, 18 Jul 2018 14:40:15 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Jul 17, 2018 at 04:47:37PM -0600, Alex Williamson wrote:
> > A simple true/false internal state does not allow multiple users.  Fix
> > this within the existing interface by converting to a counter, so long
> > as the counter is elevated, ballooning is inhibited.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  balloon.c |    7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/balloon.c b/balloon.c
> > index 6bf0a9681377..2a6a7e1a22a0 100644
> > --- a/balloon.c
> > +++ b/balloon.c
> > @@ -37,16 +37,17 @@
> >  static QEMUBalloonEvent *balloon_event_fn;
> >  static QEMUBalloonStatus *balloon_stat_fn;
> >  static void *balloon_opaque;
> > -static bool balloon_inhibited;
> > +static int balloon_inhibited;
> >  
> >  bool qemu_balloon_is_inhibited(void)
> >  {
> > -    return balloon_inhibited;
> > +    return balloon_inhibited > 0;
> >  }
> >  
> >  void qemu_balloon_inhibit(bool state)
> >  {
> > -    balloon_inhibited = state;
> > +    balloon_inhibited += (state ? 1 : -1);
> > +    assert(balloon_inhibited >= 0);  
> 
> Better do it atomically?

I'd assumed we're protected by the BQL anywhere this is called.  Is
that not the case?  Generally when I try to add any sort of locking to
QEMU it's shot down because the code paths are already serialized.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/3] balloon: Allow nested inhibits
@ 2018-07-18 16:37       ` Alex Williamson
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-18 16:37 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, kvm

On Wed, 18 Jul 2018 14:40:15 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Jul 17, 2018 at 04:47:37PM -0600, Alex Williamson wrote:
> > A simple true/false internal state does not allow multiple users.  Fix
> > this within the existing interface by converting to a counter, so long
> > as the counter is elevated, ballooning is inhibited.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  balloon.c |    7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/balloon.c b/balloon.c
> > index 6bf0a9681377..2a6a7e1a22a0 100644
> > --- a/balloon.c
> > +++ b/balloon.c
> > @@ -37,16 +37,17 @@
> >  static QEMUBalloonEvent *balloon_event_fn;
> >  static QEMUBalloonStatus *balloon_stat_fn;
> >  static void *balloon_opaque;
> > -static bool balloon_inhibited;
> > +static int balloon_inhibited;
> >  
> >  bool qemu_balloon_is_inhibited(void)
> >  {
> > -    return balloon_inhibited;
> > +    return balloon_inhibited > 0;
> >  }
> >  
> >  void qemu_balloon_inhibit(bool state)
> >  {
> > -    balloon_inhibited = state;
> > +    balloon_inhibited += (state ? 1 : -1);
> > +    assert(balloon_inhibited >= 0);  
> 
> Better do it atomically?

I'd assumed we're protected by the BQL anywhere this is called.  Is
that not the case?  Generally when I try to add any sort of locking to
QEMU it's shot down because the code paths are already serialized.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 1/3] balloon: Allow nested inhibits
  2018-07-18 16:37       ` [Qemu-devel] " Alex Williamson
@ 2018-07-19  4:45         ` Peter Xu
  -1 siblings, 0 replies; 58+ messages in thread
From: Peter Xu @ 2018-07-19  4:45 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Wed, Jul 18, 2018 at 10:37:36AM -0600, Alex Williamson wrote:
> On Wed, 18 Jul 2018 14:40:15 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Jul 17, 2018 at 04:47:37PM -0600, Alex Williamson wrote:
> > > A simple true/false internal state does not allow multiple users.  Fix
> > > this within the existing interface by converting to a counter, so long
> > > as the counter is elevated, ballooning is inhibited.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > >  balloon.c |    7 ++++---
> > >  1 file changed, 4 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/balloon.c b/balloon.c
> > > index 6bf0a9681377..2a6a7e1a22a0 100644
> > > --- a/balloon.c
> > > +++ b/balloon.c
> > > @@ -37,16 +37,17 @@
> > >  static QEMUBalloonEvent *balloon_event_fn;
> > >  static QEMUBalloonStatus *balloon_stat_fn;
> > >  static void *balloon_opaque;
> > > -static bool balloon_inhibited;
> > > +static int balloon_inhibited;
> > >  
> > >  bool qemu_balloon_is_inhibited(void)
> > >  {
> > > -    return balloon_inhibited;
> > > +    return balloon_inhibited > 0;
> > >  }
> > >  
> > >  void qemu_balloon_inhibit(bool state)
> > >  {
> > > -    balloon_inhibited = state;
> > > +    balloon_inhibited += (state ? 1 : -1);
> > > +    assert(balloon_inhibited >= 0);  
> > 
> > Better do it atomically?
> 
> I'd assumed we're protected by the BQL anywhere this is called.  Is
> that not the case?  Generally when I try to add any sort of locking to
> QEMU it's shot down because the code paths are already serialized.

Ah I see. :-)

But I guess current code might call these without BQL.  For example,
qemu_balloon_inhibit() has:

postcopy_ram_listen_thread
  qemu_loadvm_state_main
    loadvm_process_command
      loadvm_postcopy_handle_listen
        postcopy_ram_enable_notify
          qemu_balloon_inhibit

While the whole stack is inside a standalone thread to load QEMU for
postcopy incoming migration rather than the main thread.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/3] balloon: Allow nested inhibits
@ 2018-07-19  4:45         ` Peter Xu
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Xu @ 2018-07-19  4:45 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Wed, Jul 18, 2018 at 10:37:36AM -0600, Alex Williamson wrote:
> On Wed, 18 Jul 2018 14:40:15 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Jul 17, 2018 at 04:47:37PM -0600, Alex Williamson wrote:
> > > A simple true/false internal state does not allow multiple users.  Fix
> > > this within the existing interface by converting to a counter, so long
> > > as the counter is elevated, ballooning is inhibited.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > >  balloon.c |    7 ++++---
> > >  1 file changed, 4 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/balloon.c b/balloon.c
> > > index 6bf0a9681377..2a6a7e1a22a0 100644
> > > --- a/balloon.c
> > > +++ b/balloon.c
> > > @@ -37,16 +37,17 @@
> > >  static QEMUBalloonEvent *balloon_event_fn;
> > >  static QEMUBalloonStatus *balloon_stat_fn;
> > >  static void *balloon_opaque;
> > > -static bool balloon_inhibited;
> > > +static int balloon_inhibited;
> > >  
> > >  bool qemu_balloon_is_inhibited(void)
> > >  {
> > > -    return balloon_inhibited;
> > > +    return balloon_inhibited > 0;
> > >  }
> > >  
> > >  void qemu_balloon_inhibit(bool state)
> > >  {
> > > -    balloon_inhibited = state;
> > > +    balloon_inhibited += (state ? 1 : -1);
> > > +    assert(balloon_inhibited >= 0);  
> > 
> > Better do it atomically?
> 
> I'd assumed we're protected by the BQL anywhere this is called.  Is
> that not the case?  Generally when I try to add any sort of locking to
> QEMU it's shot down because the code paths are already serialized.

Ah I see. :-)

But I guess current code might call these without BQL.  For example,
qemu_balloon_inhibit() has:

postcopy_ram_listen_thread
  qemu_loadvm_state_main
    loadvm_process_command
      loadvm_postcopy_handle_listen
        postcopy_ram_enable_notify
          qemu_balloon_inhibit

While the whole stack is inside a standalone thread to load QEMU for
postcopy incoming migration rather than the main thread.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-18  9:36     ` [Qemu-devel] " Cornelia Huck
@ 2018-07-19  4:49       ` Peter Xu
  -1 siblings, 0 replies; 58+ messages in thread
From: Peter Xu @ 2018-07-19  4:49 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: Alex Williamson, qemu-devel, kvm

On Wed, Jul 18, 2018 at 11:36:40AM +0200, Cornelia Huck wrote:
> On Wed, 18 Jul 2018 14:48:03 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > > Directly assigned vfio devices have never been compatible with
> > > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > > independent of vfio page pinning and IOMMU mapping, leaving us with
> > > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > > when the balloon deflates.  Mediated devices can theoretically do
> > > better, if we make the assumption that the mdev vendor driver is fully
> > > synchronized to the actual working set of the guest driver.  In that
> > > case the guest balloon driver should never be able to allocate an mdev
> > > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > > workings of the vendor driver pinning, and doesn't actually know the
> > > difference between mdev devices and directly assigned devices.  Until
> > > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > > is safe, the best approach is to disabling ballooning any time a vfio
> > > devices is attached.
> > > 
> > > To do that, simply make the balloon inhibitor a counter rather than a
> > > boolean, fixup a case where KVM can then simply use the inhibit
> > > interface, and inhibit ballooning any time a vfio device is attached.
> > > I'm expecting we'll expose some sort of flag similar to
> > > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > > this.  An addition we could consider here would be yet another device
> > > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > > mdev devices that behave in a manner compatible with ballooning.
> > > 
> > > Please let me know if this looks like a good idea.  Thanks,  
> > 
> > IMHO patches 1-2 are good cleanup as standalone patches...
> > 
> > I totally have no idea on whether people would like to use vfio-pci
> > and the balloon device at the same time.  After all vfio-pci are
> > majorly for performance players, then I would vaguely guess that they
> > don't really care thin provisioning of memory at all, hence the usage
> > scenario might not exist much.  Is that the major reason that we'd
> > just like to disable it (which makes sense to me)?
> 
> Don't people use vfio-pci as well if they want some special
> capabilities from the passed-through device? (At least that's the main
> use case for vfio-ccw, not any performance considerations.)

Good to know these.

Out of topic: could I further ask what's these capabilities, and why
these capabilities can't be emulated (or hard to be emulated) if we
don't care about performance?

(Any link or keyword would be welcomed too if there is)

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-19  4:49       ` Peter Xu
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Xu @ 2018-07-19  4:49 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: Alex Williamson, qemu-devel, kvm

On Wed, Jul 18, 2018 at 11:36:40AM +0200, Cornelia Huck wrote:
> On Wed, 18 Jul 2018 14:48:03 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > > Directly assigned vfio devices have never been compatible with
> > > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > > independent of vfio page pinning and IOMMU mapping, leaving us with
> > > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > > when the balloon deflates.  Mediated devices can theoretically do
> > > better, if we make the assumption that the mdev vendor driver is fully
> > > synchronized to the actual working set of the guest driver.  In that
> > > case the guest balloon driver should never be able to allocate an mdev
> > > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > > workings of the vendor driver pinning, and doesn't actually know the
> > > difference between mdev devices and directly assigned devices.  Until
> > > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > > is safe, the best approach is to disabling ballooning any time a vfio
> > > devices is attached.
> > > 
> > > To do that, simply make the balloon inhibitor a counter rather than a
> > > boolean, fixup a case where KVM can then simply use the inhibit
> > > interface, and inhibit ballooning any time a vfio device is attached.
> > > I'm expecting we'll expose some sort of flag similar to
> > > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > > this.  An addition we could consider here would be yet another device
> > > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > > mdev devices that behave in a manner compatible with ballooning.
> > > 
> > > Please let me know if this looks like a good idea.  Thanks,  
> > 
> > IMHO patches 1-2 are good cleanup as standalone patches...
> > 
> > I totally have no idea on whether people would like to use vfio-pci
> > and the balloon device at the same time.  After all vfio-pci are
> > majorly for performance players, then I would vaguely guess that they
> > don't really care thin provisioning of memory at all, hence the usage
> > scenario might not exist much.  Is that the major reason that we'd
> > just like to disable it (which makes sense to me)?
> 
> Don't people use vfio-pci as well if they want some special
> capabilities from the passed-through device? (At least that's the main
> use case for vfio-ccw, not any performance considerations.)

Good to know these.

Out of topic: could I further ask what's these capabilities, and why
these capabilities can't be emulated (or hard to be emulated) if we
don't care about performance?

(Any link or keyword would be welcomed too if there is)

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-18 16:31     ` [Qemu-devel] " Alex Williamson
@ 2018-07-19  5:40       ` Peter Xu
  -1 siblings, 0 replies; 58+ messages in thread
From: Peter Xu @ 2018-07-19  5:40 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Wed, Jul 18, 2018 at 10:31:33AM -0600, Alex Williamson wrote:
> On Wed, 18 Jul 2018 14:48:03 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > > Directly assigned vfio devices have never been compatible with
> > > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > > independent of vfio page pinning and IOMMU mapping, leaving us with
> > > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > > when the balloon deflates.  Mediated devices can theoretically do
> > > better, if we make the assumption that the mdev vendor driver is fully
> > > synchronized to the actual working set of the guest driver.  In that
> > > case the guest balloon driver should never be able to allocate an mdev
> > > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > > workings of the vendor driver pinning, and doesn't actually know the
> > > difference between mdev devices and directly assigned devices.  Until
> > > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > > is safe, the best approach is to disabling ballooning any time a vfio
> > > devices is attached.
> > > 
> > > To do that, simply make the balloon inhibitor a counter rather than a
> > > boolean, fixup a case where KVM can then simply use the inhibit
> > > interface, and inhibit ballooning any time a vfio device is attached.
> > > I'm expecting we'll expose some sort of flag similar to
> > > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > > this.  An addition we could consider here would be yet another device
> > > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > > mdev devices that behave in a manner compatible with ballooning.
> > > 
> > > Please let me know if this looks like a good idea.  Thanks,  
> > 
> > IMHO patches 1-2 are good cleanup as standalone patches...
> > 
> > I totally have no idea on whether people would like to use vfio-pci
> > and the balloon device at the same time.  After all vfio-pci are
> > majorly for performance players, then I would vaguely guess that they
> > don't really care thin provisioning of memory at all, hence the usage
> > scenario might not exist much.  Is that the major reason that we'd
> > just like to disable it (which makes sense to me)?
> 
> Well, the major reason for disabling it is that it currently doesn't
> work and when the balloon is deflated, the device and vCPU are talking
> to different host pages for the same GPA for previously ballooned
> pages.  Regardless of the amenability of device assignment to various
> usage scenarios, that's a bad thing.  I guess most device assignment
> users have either realized this doesn't work and avoid it, or perhaps
> they have VMs tuned more for performance than density and (again) don't
> use ballooning.

Makes sense to me.

>  
> > I'm wondering what if want to do that somehow some day... Whether
> > it'll work if we just let vfio-pci devices to register some guest
> > memory invalidation hook (just like the iommu notifiers, but for guest
> > memory address space instead), then we map/unmap the IOMMU pages there
> > for vfio-pci device to make sure the inflated balloon pages are not
> > mapped and also make sure new pages are remapped with correct HPA
> > after deflated.  This is a pure question out of my curiosity, and for
> > sure it makes little sense if the answer of the first question above
> > is positive.
> 
> This is why I mention the KVM MMU synchronization flag above.  KVM
> essentially had this same problem and fixed it with with MMU notifiers
> in the kernel.  They expose that KVM has the capability of handling
> such a scenario via a feature flag.  We can do the same with vfio.  In
> scenarios where we're able to fix this, we could expose a flag on the
> container indicating support for the same sort of thing.

Sorry I didn't really caught that point when reply.  So that's why we
have had the mmu notifiers... Hmm, glad to know that.

But I would guess that if we want that notifier for vfio it should be
in QEMU rather than the kernel one since kernel vfio driver should not
have enough information on the GPA address space, hence it might not
be able to rebuild the mapping when a new page is mapped?  While QEMU
should be able to get both GPA and HVA easily when the balloon device
wants to deflate a page. [1]

> 
> There are a few complications to this support though.  First ballooning
> works at page size granularity, but IOMMU mapping can make use of
> arbitrary superpage sizes and the IOMMU API only guarantees unmap
> granularity equal to the original mapping.  Therefore we cannot unmap
> individual pages unless we require that all mappings through the IOMMU
> API are done with page granularity, precluding the use of superpages by
> the IOMMU and thereby inflicting higher IOTLB overhead.  Unlike a CPU,
> we can't invalidate the mappings and fault them back in or halt the
> processor to make the page table updates appear atomic.  The device is
> considered always running and interfering with that would likely lead
> to functional issues.

Indeed.  Actually VT-d emulation bug was fixed just months ago where
the QEMU shadow page code for the device quickly unmapped the pages
and rebuilt the pages, but within the window we see DMA happened hence
DMA error on missing page entries.  I wish I have had learnt that
earlier from you!  Then the bug would be even more obvious to me.

And I would guess that if we want to do that in the future, the
easiest way as the first step would be that we just tell vfio to avoid
using huge pages when we see balloon devices.  It might be an
understandable cost at least to me to use both vfio-pci and the
balloon device.

> 
> Second MMU notifiers seem to provide invalidation, pte change notices,
> and page aging interfaces, so if a page is consumed by the balloon
> inflating, we can invalidate it (modulo the issues in the previous
> paragraph), but how do we re-populate the mapping through the IOMMU
> when the page is released as the balloon is deflated?  KVM seems to do
> this by handling the page fault, but we don't really have that option
> for devices.  If we try to solve this only for mdev devices, we can
> request invalidation down the vendor driver with page granularity and
> we could assume a vendor driver that's well synchronized with the
> working set of the device would re-request a page if it was previously
> invalidated and becomes part of the working set.  But if we have that
> assumption, then we could also assume that such a vendor driver would
> never have a ballooning victim page in its working set and therefore we
> don't need to do anything.  Unfortunately without an audit, we can't
> really know the behavior of the vendor driver.  vfio-ccw might be an
> exception here since this entire class of devices doesn't really
> perform DMA and page pinning is done on a per transaction basis, aiui.

Could we just provide the MMU notifier in QEMU instead of kernel, as I
mentioned at [1] (no matter what we call it...)?  Basically when we
deflate the balloon we trigger that notifier, then we pass another new
VFIO_IOMMU_DMA_MAP down to vfio with correct GPA/HVA.  Would that
work?

> 
> The vIOMMU is yet another consideration as it can effectively define
> the working set for a device via the device AddressSpace.  If a
> ballooned request does not fall within the AddressSpace of any assigned
> device, it would be safe to balloon the page.  So long as we're not
> running in IOMMU passthrough mode, these should be distinctly separate
> sets, active DMA pages should not be ballooning targets.  However, I
> believe the current state of vIOMMU with assigned devices is that it's
> functional, but not in any way performant for this scenario.  We see
> massive performance degradation when trying to use vIOMMU for anything
> other than mostly static mappings, such as when using passthrough mode
> or using userspace drivers or nested guests with relatively static
> mappings.  So I don't know that it's a worthwhile return on investment
> if we were to test whether a balloon victim page falls within a
> device's AddressSpace as a further level of granularity.  Thanks,

Yeah, vIOMMU will be another story.  Maybe that could be the last
thing to consider.  AFAIU the only user of that (both vIOMMU and
vfio-pci) are NFV, and I don't think they need balloon at all, so
maybe we can just keep it disabled there.

Thanks for the details (as always)!  FWIW I'd agree this is the only
correct thing to do at least for me as a first step, no matter what's
our possible next move is.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-19  5:40       ` Peter Xu
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Xu @ 2018-07-19  5:40 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Wed, Jul 18, 2018 at 10:31:33AM -0600, Alex Williamson wrote:
> On Wed, 18 Jul 2018 14:48:03 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > > Directly assigned vfio devices have never been compatible with
> > > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > > independent of vfio page pinning and IOMMU mapping, leaving us with
> > > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > > when the balloon deflates.  Mediated devices can theoretically do
> > > better, if we make the assumption that the mdev vendor driver is fully
> > > synchronized to the actual working set of the guest driver.  In that
> > > case the guest balloon driver should never be able to allocate an mdev
> > > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > > workings of the vendor driver pinning, and doesn't actually know the
> > > difference between mdev devices and directly assigned devices.  Until
> > > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > > is safe, the best approach is to disabling ballooning any time a vfio
> > > devices is attached.
> > > 
> > > To do that, simply make the balloon inhibitor a counter rather than a
> > > boolean, fixup a case where KVM can then simply use the inhibit
> > > interface, and inhibit ballooning any time a vfio device is attached.
> > > I'm expecting we'll expose some sort of flag similar to
> > > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > > this.  An addition we could consider here would be yet another device
> > > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > > mdev devices that behave in a manner compatible with ballooning.
> > > 
> > > Please let me know if this looks like a good idea.  Thanks,  
> > 
> > IMHO patches 1-2 are good cleanup as standalone patches...
> > 
> > I totally have no idea on whether people would like to use vfio-pci
> > and the balloon device at the same time.  After all vfio-pci are
> > majorly for performance players, then I would vaguely guess that they
> > don't really care thin provisioning of memory at all, hence the usage
> > scenario might not exist much.  Is that the major reason that we'd
> > just like to disable it (which makes sense to me)?
> 
> Well, the major reason for disabling it is that it currently doesn't
> work and when the balloon is deflated, the device and vCPU are talking
> to different host pages for the same GPA for previously ballooned
> pages.  Regardless of the amenability of device assignment to various
> usage scenarios, that's a bad thing.  I guess most device assignment
> users have either realized this doesn't work and avoid it, or perhaps
> they have VMs tuned more for performance than density and (again) don't
> use ballooning.

Makes sense to me.

>  
> > I'm wondering what if want to do that somehow some day... Whether
> > it'll work if we just let vfio-pci devices to register some guest
> > memory invalidation hook (just like the iommu notifiers, but for guest
> > memory address space instead), then we map/unmap the IOMMU pages there
> > for vfio-pci device to make sure the inflated balloon pages are not
> > mapped and also make sure new pages are remapped with correct HPA
> > after deflated.  This is a pure question out of my curiosity, and for
> > sure it makes little sense if the answer of the first question above
> > is positive.
> 
> This is why I mention the KVM MMU synchronization flag above.  KVM
> essentially had this same problem and fixed it with with MMU notifiers
> in the kernel.  They expose that KVM has the capability of handling
> such a scenario via a feature flag.  We can do the same with vfio.  In
> scenarios where we're able to fix this, we could expose a flag on the
> container indicating support for the same sort of thing.

Sorry I didn't really caught that point when reply.  So that's why we
have had the mmu notifiers... Hmm, glad to know that.

But I would guess that if we want that notifier for vfio it should be
in QEMU rather than the kernel one since kernel vfio driver should not
have enough information on the GPA address space, hence it might not
be able to rebuild the mapping when a new page is mapped?  While QEMU
should be able to get both GPA and HVA easily when the balloon device
wants to deflate a page. [1]

> 
> There are a few complications to this support though.  First ballooning
> works at page size granularity, but IOMMU mapping can make use of
> arbitrary superpage sizes and the IOMMU API only guarantees unmap
> granularity equal to the original mapping.  Therefore we cannot unmap
> individual pages unless we require that all mappings through the IOMMU
> API are done with page granularity, precluding the use of superpages by
> the IOMMU and thereby inflicting higher IOTLB overhead.  Unlike a CPU,
> we can't invalidate the mappings and fault them back in or halt the
> processor to make the page table updates appear atomic.  The device is
> considered always running and interfering with that would likely lead
> to functional issues.

Indeed.  Actually VT-d emulation bug was fixed just months ago where
the QEMU shadow page code for the device quickly unmapped the pages
and rebuilt the pages, but within the window we see DMA happened hence
DMA error on missing page entries.  I wish I have had learnt that
earlier from you!  Then the bug would be even more obvious to me.

And I would guess that if we want to do that in the future, the
easiest way as the first step would be that we just tell vfio to avoid
using huge pages when we see balloon devices.  It might be an
understandable cost at least to me to use both vfio-pci and the
balloon device.

> 
> Second MMU notifiers seem to provide invalidation, pte change notices,
> and page aging interfaces, so if a page is consumed by the balloon
> inflating, we can invalidate it (modulo the issues in the previous
> paragraph), but how do we re-populate the mapping through the IOMMU
> when the page is released as the balloon is deflated?  KVM seems to do
> this by handling the page fault, but we don't really have that option
> for devices.  If we try to solve this only for mdev devices, we can
> request invalidation down the vendor driver with page granularity and
> we could assume a vendor driver that's well synchronized with the
> working set of the device would re-request a page if it was previously
> invalidated and becomes part of the working set.  But if we have that
> assumption, then we could also assume that such a vendor driver would
> never have a ballooning victim page in its working set and therefore we
> don't need to do anything.  Unfortunately without an audit, we can't
> really know the behavior of the vendor driver.  vfio-ccw might be an
> exception here since this entire class of devices doesn't really
> perform DMA and page pinning is done on a per transaction basis, aiui.

Could we just provide the MMU notifier in QEMU instead of kernel, as I
mentioned at [1] (no matter what we call it...)?  Basically when we
deflate the balloon we trigger that notifier, then we pass another new
VFIO_IOMMU_DMA_MAP down to vfio with correct GPA/HVA.  Would that
work?

> 
> The vIOMMU is yet another consideration as it can effectively define
> the working set for a device via the device AddressSpace.  If a
> ballooned request does not fall within the AddressSpace of any assigned
> device, it would be safe to balloon the page.  So long as we're not
> running in IOMMU passthrough mode, these should be distinctly separate
> sets, active DMA pages should not be ballooning targets.  However, I
> believe the current state of vIOMMU with assigned devices is that it's
> functional, but not in any way performant for this scenario.  We see
> massive performance degradation when trying to use vIOMMU for anything
> other than mostly static mappings, such as when using passthrough mode
> or using userspace drivers or nested guests with relatively static
> mappings.  So I don't know that it's a worthwhile return on investment
> if we were to test whether a balloon victim page falls within a
> device's AddressSpace as a further level of granularity.  Thanks,

Yeah, vIOMMU will be another story.  Maybe that could be the last
thing to consider.  AFAIU the only user of that (both vIOMMU and
vfio-pci) are NFV, and I don't think they need balloon at all, so
maybe we can just keep it disabled there.

Thanks for the details (as always)!  FWIW I'd agree this is the only
correct thing to do at least for me as a first step, no matter what's
our possible next move is.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-19  4:49       ` [Qemu-devel] " Peter Xu
@ 2018-07-19  8:42         ` Cornelia Huck
  -1 siblings, 0 replies; 58+ messages in thread
From: Cornelia Huck @ 2018-07-19  8:42 UTC (permalink / raw)
  To: Peter Xu; +Cc: Alex Williamson, qemu-devel, kvm

On Thu, 19 Jul 2018 12:49:23 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Jul 18, 2018 at 11:36:40AM +0200, Cornelia Huck wrote:
> > On Wed, 18 Jul 2018 14:48:03 +0800
> > Peter Xu <peterx@redhat.com> wrote:

> > > I totally have no idea on whether people would like to use vfio-pci
> > > and the balloon device at the same time.  After all vfio-pci are
> > > majorly for performance players, then I would vaguely guess that they
> > > don't really care thin provisioning of memory at all, hence the usage
> > > scenario might not exist much.  Is that the major reason that we'd
> > > just like to disable it (which makes sense to me)?  
> > 
> > Don't people use vfio-pci as well if they want some special
> > capabilities from the passed-through device? (At least that's the main
> > use case for vfio-ccw, not any performance considerations.)  
> 
> Good to know these.
> 
> Out of topic: could I further ask what's these capabilities, and why
> these capabilities can't be emulated (or hard to be emulated) if we
> don't care about performance?

For vfio-ccw, the (current) main use case is ECKD DASD. While this is
basically a block device, it has some useful features (path failover,
remote copy, exclusive locking) that are not replicated by any emulated
device (and I'm not sure how to do this, either). It also has things
like a weird disk layout, which we _could_ emulate, but I'm not sure
why we would do that (it is mainly interesting if you want to share a
disk with a traditional mainframe OS).

Other usecases I'm thinking about are related to the
on-list-but-not-yet-merged vfio-ap crypto adapter support: Using a
crypto adapter that has been certified without needing to make keys
available to the OS, getting real randomness out of the card and so on.

Generally, I'd think anything that needs complicated transformations,
interaction with other ecosystems or exploiting reliability features
might be a case for using assignment: not just because of performance,
but also because you don't need to reinvent the wheel.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-19  8:42         ` Cornelia Huck
  0 siblings, 0 replies; 58+ messages in thread
From: Cornelia Huck @ 2018-07-19  8:42 UTC (permalink / raw)
  To: Peter Xu; +Cc: Alex Williamson, qemu-devel, kvm

On Thu, 19 Jul 2018 12:49:23 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Jul 18, 2018 at 11:36:40AM +0200, Cornelia Huck wrote:
> > On Wed, 18 Jul 2018 14:48:03 +0800
> > Peter Xu <peterx@redhat.com> wrote:

> > > I totally have no idea on whether people would like to use vfio-pci
> > > and the balloon device at the same time.  After all vfio-pci are
> > > majorly for performance players, then I would vaguely guess that they
> > > don't really care thin provisioning of memory at all, hence the usage
> > > scenario might not exist much.  Is that the major reason that we'd
> > > just like to disable it (which makes sense to me)?  
> > 
> > Don't people use vfio-pci as well if they want some special
> > capabilities from the passed-through device? (At least that's the main
> > use case for vfio-ccw, not any performance considerations.)  
> 
> Good to know these.
> 
> Out of topic: could I further ask what's these capabilities, and why
> these capabilities can't be emulated (or hard to be emulated) if we
> don't care about performance?

For vfio-ccw, the (current) main use case is ECKD DASD. While this is
basically a block device, it has some useful features (path failover,
remote copy, exclusive locking) that are not replicated by any emulated
device (and I'm not sure how to do this, either). It also has things
like a weird disk layout, which we _could_ emulate, but I'm not sure
why we would do that (it is mainly interesting if you want to share a
disk with a traditional mainframe OS).

Other usecases I'm thinking about are related to the
on-list-but-not-yet-merged vfio-ap crypto adapter support: Using a
crypto adapter that has been certified without needing to make keys
available to the OS, getting real randomness out of the card and so on.

Generally, I'd think anything that needs complicated transformations,
interaction with other ecosystems or exploiting reliability features
might be a case for using assignment: not just because of performance,
but also because you don't need to reinvent the wheel.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-19  8:42         ` [Qemu-devel] " Cornelia Huck
@ 2018-07-19  9:30           ` Peter Xu
  -1 siblings, 0 replies; 58+ messages in thread
From: Peter Xu @ 2018-07-19  9:30 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: Alex Williamson, qemu-devel, kvm

On Thu, Jul 19, 2018 at 10:42:22AM +0200, Cornelia Huck wrote:
> On Thu, 19 Jul 2018 12:49:23 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Wed, Jul 18, 2018 at 11:36:40AM +0200, Cornelia Huck wrote:
> > > On Wed, 18 Jul 2018 14:48:03 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> 
> > > > I totally have no idea on whether people would like to use vfio-pci
> > > > and the balloon device at the same time.  After all vfio-pci are
> > > > majorly for performance players, then I would vaguely guess that they
> > > > don't really care thin provisioning of memory at all, hence the usage
> > > > scenario might not exist much.  Is that the major reason that we'd
> > > > just like to disable it (which makes sense to me)?  
> > > 
> > > Don't people use vfio-pci as well if they want some special
> > > capabilities from the passed-through device? (At least that's the main
> > > use case for vfio-ccw, not any performance considerations.)  
> > 
> > Good to know these.
> > 
> > Out of topic: could I further ask what's these capabilities, and why
> > these capabilities can't be emulated (or hard to be emulated) if we
> > don't care about performance?
> 
> For vfio-ccw, the (current) main use case is ECKD DASD. While this is
> basically a block device, it has some useful features (path failover,
> remote copy, exclusive locking) that are not replicated by any emulated
> device (and I'm not sure how to do this, either). It also has things
> like a weird disk layout, which we _could_ emulate, but I'm not sure
> why we would do that (it is mainly interesting if you want to share a
> disk with a traditional mainframe OS).
> 
> Other usecases I'm thinking about are related to the
> on-list-but-not-yet-merged vfio-ap crypto adapter support: Using a
> crypto adapter that has been certified without needing to make keys
> available to the OS, getting real randomness out of the card and so on.
> 
> Generally, I'd think anything that needs complicated transformations,
> interaction with other ecosystems or exploiting reliability features
> might be a case for using assignment: not just because of performance,
> but also because you don't need to reinvent the wheel.

I'd confess that mainframe brought many interesting (and historical)
facts to me and I even feel like I understand virtualization a bit
better with them. :)

And I believe that the crypto use case is a good one too since that
can also happen on all architectures not only something special to
mainframe.  Security, randomness, and possibly more other things are
good reasons to use assigned devices indeed.

Thanks for sharing these!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-19  9:30           ` Peter Xu
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Xu @ 2018-07-19  9:30 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: Alex Williamson, qemu-devel, kvm

On Thu, Jul 19, 2018 at 10:42:22AM +0200, Cornelia Huck wrote:
> On Thu, 19 Jul 2018 12:49:23 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Wed, Jul 18, 2018 at 11:36:40AM +0200, Cornelia Huck wrote:
> > > On Wed, 18 Jul 2018 14:48:03 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> 
> > > > I totally have no idea on whether people would like to use vfio-pci
> > > > and the balloon device at the same time.  After all vfio-pci are
> > > > majorly for performance players, then I would vaguely guess that they
> > > > don't really care thin provisioning of memory at all, hence the usage
> > > > scenario might not exist much.  Is that the major reason that we'd
> > > > just like to disable it (which makes sense to me)?  
> > > 
> > > Don't people use vfio-pci as well if they want some special
> > > capabilities from the passed-through device? (At least that's the main
> > > use case for vfio-ccw, not any performance considerations.)  
> > 
> > Good to know these.
> > 
> > Out of topic: could I further ask what's these capabilities, and why
> > these capabilities can't be emulated (or hard to be emulated) if we
> > don't care about performance?
> 
> For vfio-ccw, the (current) main use case is ECKD DASD. While this is
> basically a block device, it has some useful features (path failover,
> remote copy, exclusive locking) that are not replicated by any emulated
> device (and I'm not sure how to do this, either). It also has things
> like a weird disk layout, which we _could_ emulate, but I'm not sure
> why we would do that (it is mainly interesting if you want to share a
> disk with a traditional mainframe OS).
> 
> Other usecases I'm thinking about are related to the
> on-list-but-not-yet-merged vfio-ap crypto adapter support: Using a
> crypto adapter that has been certified without needing to make keys
> available to the OS, getting real randomness out of the card and so on.
> 
> Generally, I'd think anything that needs complicated transformations,
> interaction with other ecosystems or exploiting reliability features
> might be a case for using assignment: not just because of performance,
> but also because you don't need to reinvent the wheel.

I'd confess that mainframe brought many interesting (and historical)
facts to me and I even feel like I understand virtualization a bit
better with them. :)

And I believe that the crypto use case is a good one too since that
can also happen on all architectures not only something special to
mainframe.  Security, randomness, and possibly more other things are
good reasons to use assigned devices indeed.

Thanks for sharing these!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-19  5:40       ` [Qemu-devel] " Peter Xu
@ 2018-07-19 15:01         ` Alex Williamson
  -1 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-19 15:01 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, kvm

On Thu, 19 Jul 2018 13:40:51 +0800
Peter Xu <peterx@redhat.com> wrote:
> On Wed, Jul 18, 2018 at 10:31:33AM -0600, Alex Williamson wrote:
> > On Wed, 18 Jul 2018 14:48:03 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> > > I'm wondering what if want to do that somehow some day... Whether
> > > it'll work if we just let vfio-pci devices to register some guest
> > > memory invalidation hook (just like the iommu notifiers, but for guest
> > > memory address space instead), then we map/unmap the IOMMU pages there
> > > for vfio-pci device to make sure the inflated balloon pages are not
> > > mapped and also make sure new pages are remapped with correct HPA
> > > after deflated.  This is a pure question out of my curiosity, and for
> > > sure it makes little sense if the answer of the first question above
> > > is positive.  
> > 
> > This is why I mention the KVM MMU synchronization flag above.  KVM
> > essentially had this same problem and fixed it with with MMU notifiers
> > in the kernel.  They expose that KVM has the capability of handling
> > such a scenario via a feature flag.  We can do the same with vfio.  In
> > scenarios where we're able to fix this, we could expose a flag on the
> > container indicating support for the same sort of thing.  
> 
> Sorry I didn't really caught that point when reply.  So that's why we
> have had the mmu notifiers... Hmm, glad to know that.
> 
> But I would guess that if we want that notifier for vfio it should be
> in QEMU rather than the kernel one since kernel vfio driver should not
> have enough information on the GPA address space, hence it might not
> be able to rebuild the mapping when a new page is mapped?  While QEMU
> should be able to get both GPA and HVA easily when the balloon device
> wants to deflate a page. [1]

This is where the vfio IOMMU backend comes into play.  vfio devices
make use of MemoryListeners to register the HVA to GPA translations
within the AddressSpace of a device.  When we're using an IOMMU, we pin
those HVAs in order to make the HPA static and insert the GPA to HPA
mappings into the IOMMU.  When we don't have an IOMMU, the IOMMU
backend is storing those HVA to GPA translations so that the mediated
device vendor driver can make pinning requests.  The vendor driver
requests pinning of a given set of GPAs and the IOMMU backend pins the
matching HVA to provide an HPA.

When a page is ballooned, it's zapped from the process address space,
so we need to invalidate the HVA to HPA mapping.  When the page is
restored, we still have the correct HVA, but we need a notifier to tell
us to put it back into play, re-pinning and inserting the mapping into
the IOMMU if we have one.

In order for QEMU to do this, this ballooned page would need to be
reflected in the memory API.  This would be quite simple, inserting a
MemoryRegion overlapping the RAM page which is ballooned out and
removing it when the balloon is deflated.  But we run into the same
problems with mapping granularity.  In order to accommodate this new
overlap, the memory API would first remove the previous mapping, split
or truncate the region, then reinsert the result.  Just like if we tried
to do this in the IOMMU, it's not atomic with respect to device DMA.  In
order to achieve this model, the memory API would need to operate
entirely on page size regions.  Now imagine that every MiB of guest RAM
requires 256 ioctls to map (assuming 4KiB pages), 256K per GiB.  Clearly
we'd want to use a larger granularity for efficiency.  If we allow the
user to specify the granularity, perhaps abstracting that granularity
as the size of a DIMM, suddenly we've moved from memory ballooning to
memory hotplug, where the latter does make use of the memory API and
has none of these issues AIUI.

> > There are a few complications to this support though.  First ballooning
> > works at page size granularity, but IOMMU mapping can make use of
> > arbitrary superpage sizes and the IOMMU API only guarantees unmap
> > granularity equal to the original mapping.  Therefore we cannot unmap
> > individual pages unless we require that all mappings through the IOMMU
> > API are done with page granularity, precluding the use of superpages by
> > the IOMMU and thereby inflicting higher IOTLB overhead.  Unlike a CPU,
> > we can't invalidate the mappings and fault them back in or halt the
> > processor to make the page table updates appear atomic.  The device is
> > considered always running and interfering with that would likely lead
> > to functional issues.  
> 
> Indeed.  Actually VT-d emulation bug was fixed just months ago where
> the QEMU shadow page code for the device quickly unmapped the pages
> and rebuilt the pages, but within the window we see DMA happened hence
> DMA error on missing page entries.  I wish I have had learnt that
> earlier from you!  Then the bug would be even more obvious to me.
> 
> And I would guess that if we want to do that in the future, the
> easiest way as the first step would be that we just tell vfio to avoid
> using huge pages when we see balloon devices.  It might be an
> understandable cost at least to me to use both vfio-pci and the
> balloon device.

There are a couple problem there though, first if we decide to use
smaller pages for any case where we have a balloon device (a device
that libvirt adds by default and requires manually editing the XML to
remove), we introduce a performance regression for pretty much every
existing VM as we restrict the IOMMU from making use of superpages and
therefore depend far more on the IOTLB.  Second, QEMU doesn't have
control of the mapping page size.  The vfio MAP_DMA ioctl simply takes
a virtual address, IOVA (GPA) and size, the IOMMU gets to map this
however it finds most efficient and the API requires unmapping with a
minimum granularity matching the original mapping.  So again, the only
way QEMU can get page size unmapping granularity is to perform only
page sized mappings.  We could add a mapping flag to specify page size
mapping and therefore page granularity unmapping, but that's a new
contract (ie. API) between the user and vfio that comes with a
performance penalty.  There is currently a vfio_iommu_type1 module
option which disables IOMMU superpage support globally, but we don't
have per instance control with the current APIs.

> > Second MMU notifiers seem to provide invalidation, pte change notices,
> > and page aging interfaces, so if a page is consumed by the balloon
> > inflating, we can invalidate it (modulo the issues in the previous
> > paragraph), but how do we re-populate the mapping through the IOMMU
> > when the page is released as the balloon is deflated?  KVM seems to do
> > this by handling the page fault, but we don't really have that option
> > for devices.  If we try to solve this only for mdev devices, we can
> > request invalidation down the vendor driver with page granularity and
> > we could assume a vendor driver that's well synchronized with the
> > working set of the device would re-request a page if it was previously
> > invalidated and becomes part of the working set.  But if we have that
> > assumption, then we could also assume that such a vendor driver would
> > never have a ballooning victim page in its working set and therefore we
> > don't need to do anything.  Unfortunately without an audit, we can't
> > really know the behavior of the vendor driver.  vfio-ccw might be an
> > exception here since this entire class of devices doesn't really
> > perform DMA and page pinning is done on a per transaction basis, aiui.  
> 
> Could we just provide the MMU notifier in QEMU instead of kernel, as I
> mentioned at [1] (no matter what we call it...)?  Basically when we
> deflate the balloon we trigger that notifier, then we pass another new
> VFIO_IOMMU_DMA_MAP down to vfio with correct GPA/HVA.  Would that
> work?

I've discussed the issues above.

> > The vIOMMU is yet another consideration as it can effectively define
> > the working set for a device via the device AddressSpace.  If a
> > ballooned request does not fall within the AddressSpace of any assigned
> > device, it would be safe to balloon the page.  So long as we're not
> > running in IOMMU passthrough mode, these should be distinctly separate
> > sets, active DMA pages should not be ballooning targets.  However, I
> > believe the current state of vIOMMU with assigned devices is that it's
> > functional, but not in any way performant for this scenario.  We see
> > massive performance degradation when trying to use vIOMMU for anything
> > other than mostly static mappings, such as when using passthrough mode
> > or using userspace drivers or nested guests with relatively static
> > mappings.  So I don't know that it's a worthwhile return on investment
> > if we were to test whether a balloon victim page falls within a
> > device's AddressSpace as a further level of granularity.  Thanks,  
> 
> Yeah, vIOMMU will be another story.  Maybe that could be the last
> thing to consider.  AFAIU the only user of that (both vIOMMU and
> vfio-pci) are NFV, and I don't think they need balloon at all, so
> maybe we can just keep it disabled there.
> 
> Thanks for the details (as always)!  FWIW I'd agree this is the only
> correct thing to do at least for me as a first step, no matter what's
> our possible next move is.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-19 15:01         ` Alex Williamson
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-19 15:01 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, kvm

On Thu, 19 Jul 2018 13:40:51 +0800
Peter Xu <peterx@redhat.com> wrote:
> On Wed, Jul 18, 2018 at 10:31:33AM -0600, Alex Williamson wrote:
> > On Wed, 18 Jul 2018 14:48:03 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> > > I'm wondering what if want to do that somehow some day... Whether
> > > it'll work if we just let vfio-pci devices to register some guest
> > > memory invalidation hook (just like the iommu notifiers, but for guest
> > > memory address space instead), then we map/unmap the IOMMU pages there
> > > for vfio-pci device to make sure the inflated balloon pages are not
> > > mapped and also make sure new pages are remapped with correct HPA
> > > after deflated.  This is a pure question out of my curiosity, and for
> > > sure it makes little sense if the answer of the first question above
> > > is positive.  
> > 
> > This is why I mention the KVM MMU synchronization flag above.  KVM
> > essentially had this same problem and fixed it with with MMU notifiers
> > in the kernel.  They expose that KVM has the capability of handling
> > such a scenario via a feature flag.  We can do the same with vfio.  In
> > scenarios where we're able to fix this, we could expose a flag on the
> > container indicating support for the same sort of thing.  
> 
> Sorry I didn't really caught that point when reply.  So that's why we
> have had the mmu notifiers... Hmm, glad to know that.
> 
> But I would guess that if we want that notifier for vfio it should be
> in QEMU rather than the kernel one since kernel vfio driver should not
> have enough information on the GPA address space, hence it might not
> be able to rebuild the mapping when a new page is mapped?  While QEMU
> should be able to get both GPA and HVA easily when the balloon device
> wants to deflate a page. [1]

This is where the vfio IOMMU backend comes into play.  vfio devices
make use of MemoryListeners to register the HVA to GPA translations
within the AddressSpace of a device.  When we're using an IOMMU, we pin
those HVAs in order to make the HPA static and insert the GPA to HPA
mappings into the IOMMU.  When we don't have an IOMMU, the IOMMU
backend is storing those HVA to GPA translations so that the mediated
device vendor driver can make pinning requests.  The vendor driver
requests pinning of a given set of GPAs and the IOMMU backend pins the
matching HVA to provide an HPA.

When a page is ballooned, it's zapped from the process address space,
so we need to invalidate the HVA to HPA mapping.  When the page is
restored, we still have the correct HVA, but we need a notifier to tell
us to put it back into play, re-pinning and inserting the mapping into
the IOMMU if we have one.

In order for QEMU to do this, this ballooned page would need to be
reflected in the memory API.  This would be quite simple, inserting a
MemoryRegion overlapping the RAM page which is ballooned out and
removing it when the balloon is deflated.  But we run into the same
problems with mapping granularity.  In order to accommodate this new
overlap, the memory API would first remove the previous mapping, split
or truncate the region, then reinsert the result.  Just like if we tried
to do this in the IOMMU, it's not atomic with respect to device DMA.  In
order to achieve this model, the memory API would need to operate
entirely on page size regions.  Now imagine that every MiB of guest RAM
requires 256 ioctls to map (assuming 4KiB pages), 256K per GiB.  Clearly
we'd want to use a larger granularity for efficiency.  If we allow the
user to specify the granularity, perhaps abstracting that granularity
as the size of a DIMM, suddenly we've moved from memory ballooning to
memory hotplug, where the latter does make use of the memory API and
has none of these issues AIUI.

> > There are a few complications to this support though.  First ballooning
> > works at page size granularity, but IOMMU mapping can make use of
> > arbitrary superpage sizes and the IOMMU API only guarantees unmap
> > granularity equal to the original mapping.  Therefore we cannot unmap
> > individual pages unless we require that all mappings through the IOMMU
> > API are done with page granularity, precluding the use of superpages by
> > the IOMMU and thereby inflicting higher IOTLB overhead.  Unlike a CPU,
> > we can't invalidate the mappings and fault them back in or halt the
> > processor to make the page table updates appear atomic.  The device is
> > considered always running and interfering with that would likely lead
> > to functional issues.  
> 
> Indeed.  Actually VT-d emulation bug was fixed just months ago where
> the QEMU shadow page code for the device quickly unmapped the pages
> and rebuilt the pages, but within the window we see DMA happened hence
> DMA error on missing page entries.  I wish I have had learnt that
> earlier from you!  Then the bug would be even more obvious to me.
> 
> And I would guess that if we want to do that in the future, the
> easiest way as the first step would be that we just tell vfio to avoid
> using huge pages when we see balloon devices.  It might be an
> understandable cost at least to me to use both vfio-pci and the
> balloon device.

There are a couple problem there though, first if we decide to use
smaller pages for any case where we have a balloon device (a device
that libvirt adds by default and requires manually editing the XML to
remove), we introduce a performance regression for pretty much every
existing VM as we restrict the IOMMU from making use of superpages and
therefore depend far more on the IOTLB.  Second, QEMU doesn't have
control of the mapping page size.  The vfio MAP_DMA ioctl simply takes
a virtual address, IOVA (GPA) and size, the IOMMU gets to map this
however it finds most efficient and the API requires unmapping with a
minimum granularity matching the original mapping.  So again, the only
way QEMU can get page size unmapping granularity is to perform only
page sized mappings.  We could add a mapping flag to specify page size
mapping and therefore page granularity unmapping, but that's a new
contract (ie. API) between the user and vfio that comes with a
performance penalty.  There is currently a vfio_iommu_type1 module
option which disables IOMMU superpage support globally, but we don't
have per instance control with the current APIs.

> > Second MMU notifiers seem to provide invalidation, pte change notices,
> > and page aging interfaces, so if a page is consumed by the balloon
> > inflating, we can invalidate it (modulo the issues in the previous
> > paragraph), but how do we re-populate the mapping through the IOMMU
> > when the page is released as the balloon is deflated?  KVM seems to do
> > this by handling the page fault, but we don't really have that option
> > for devices.  If we try to solve this only for mdev devices, we can
> > request invalidation down the vendor driver with page granularity and
> > we could assume a vendor driver that's well synchronized with the
> > working set of the device would re-request a page if it was previously
> > invalidated and becomes part of the working set.  But if we have that
> > assumption, then we could also assume that such a vendor driver would
> > never have a ballooning victim page in its working set and therefore we
> > don't need to do anything.  Unfortunately without an audit, we can't
> > really know the behavior of the vendor driver.  vfio-ccw might be an
> > exception here since this entire class of devices doesn't really
> > perform DMA and page pinning is done on a per transaction basis, aiui.  
> 
> Could we just provide the MMU notifier in QEMU instead of kernel, as I
> mentioned at [1] (no matter what we call it...)?  Basically when we
> deflate the balloon we trigger that notifier, then we pass another new
> VFIO_IOMMU_DMA_MAP down to vfio with correct GPA/HVA.  Would that
> work?

I've discussed the issues above.

> > The vIOMMU is yet another consideration as it can effectively define
> > the working set for a device via the device AddressSpace.  If a
> > ballooned request does not fall within the AddressSpace of any assigned
> > device, it would be safe to balloon the page.  So long as we're not
> > running in IOMMU passthrough mode, these should be distinctly separate
> > sets, active DMA pages should not be ballooning targets.  However, I
> > believe the current state of vIOMMU with assigned devices is that it's
> > functional, but not in any way performant for this scenario.  We see
> > massive performance degradation when trying to use vIOMMU for anything
> > other than mostly static mappings, such as when using passthrough mode
> > or using userspace drivers or nested guests with relatively static
> > mappings.  So I don't know that it's a worthwhile return on investment
> > if we were to test whether a balloon victim page falls within a
> > device's AddressSpace as a further level of granularity.  Thanks,  
> 
> Yeah, vIOMMU will be another story.  Maybe that could be the last
> thing to consider.  AFAIU the only user of that (both vIOMMU and
> vfio-pci) are NFV, and I don't think they need balloon at all, so
> maybe we can just keep it disabled there.
> 
> Thanks for the details (as always)!  FWIW I'd agree this is the only
> correct thing to do at least for me as a first step, no matter what's
> our possible next move is.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-19  4:49       ` [Qemu-devel] " Peter Xu
@ 2018-07-19 15:31         ` Alex Williamson
  -1 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-19 15:31 UTC (permalink / raw)
  To: Peter Xu; +Cc: Cornelia Huck, qemu-devel, kvm

On Thu, 19 Jul 2018 12:49:23 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Jul 18, 2018 at 11:36:40AM +0200, Cornelia Huck wrote:
> > On Wed, 18 Jul 2018 14:48:03 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:  
> > > > Directly assigned vfio devices have never been compatible with
> > > > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > > > independent of vfio page pinning and IOMMU mapping, leaving us with
> > > > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > > > when the balloon deflates.  Mediated devices can theoretically do
> > > > better, if we make the assumption that the mdev vendor driver is fully
> > > > synchronized to the actual working set of the guest driver.  In that
> > > > case the guest balloon driver should never be able to allocate an mdev
> > > > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > > > workings of the vendor driver pinning, and doesn't actually know the
> > > > difference between mdev devices and directly assigned devices.  Until
> > > > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > > > is safe, the best approach is to disabling ballooning any time a vfio
> > > > devices is attached.
> > > > 
> > > > To do that, simply make the balloon inhibitor a counter rather than a
> > > > boolean, fixup a case where KVM can then simply use the inhibit
> > > > interface, and inhibit ballooning any time a vfio device is attached.
> > > > I'm expecting we'll expose some sort of flag similar to
> > > > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > > > this.  An addition we could consider here would be yet another device
> > > > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > > > mdev devices that behave in a manner compatible with ballooning.
> > > > 
> > > > Please let me know if this looks like a good idea.  Thanks,    
> > > 
> > > IMHO patches 1-2 are good cleanup as standalone patches...
> > > 
> > > I totally have no idea on whether people would like to use vfio-pci
> > > and the balloon device at the same time.  After all vfio-pci are
> > > majorly for performance players, then I would vaguely guess that they
> > > don't really care thin provisioning of memory at all, hence the usage
> > > scenario might not exist much.  Is that the major reason that we'd
> > > just like to disable it (which makes sense to me)?  
> > 
> > Don't people use vfio-pci as well if they want some special
> > capabilities from the passed-through device? (At least that's the main
> > use case for vfio-ccw, not any performance considerations.)  
> 
> Good to know these.
> 
> Out of topic: could I further ask what's these capabilities, and why
> these capabilities can't be emulated (or hard to be emulated) if we
> don't care about performance?

Are you assuming that anything that isn't strictly performance focused
for device assignment is self contained, fully documented, suitable for
emulation, and there's someone willing and able to invest and upstream
(open source) that emulation?  What about things like data acquisition
devices, TV capture cards, serial ports, real-time control systems,
etc.  This is one of the basic tenants of device assignment, it
provides users the ability to migrate physical systems to virtual, even
if the entire reason for the system existing is tied to hardware.  The
world is more than just NICs and HBAs.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-19 15:31         ` Alex Williamson
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-19 15:31 UTC (permalink / raw)
  To: Peter Xu; +Cc: Cornelia Huck, qemu-devel, kvm

On Thu, 19 Jul 2018 12:49:23 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Jul 18, 2018 at 11:36:40AM +0200, Cornelia Huck wrote:
> > On Wed, 18 Jul 2018 14:48:03 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:  
> > > > Directly assigned vfio devices have never been compatible with
> > > > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > > > independent of vfio page pinning and IOMMU mapping, leaving us with
> > > > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > > > when the balloon deflates.  Mediated devices can theoretically do
> > > > better, if we make the assumption that the mdev vendor driver is fully
> > > > synchronized to the actual working set of the guest driver.  In that
> > > > case the guest balloon driver should never be able to allocate an mdev
> > > > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > > > workings of the vendor driver pinning, and doesn't actually know the
> > > > difference between mdev devices and directly assigned devices.  Until
> > > > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > > > is safe, the best approach is to disabling ballooning any time a vfio
> > > > devices is attached.
> > > > 
> > > > To do that, simply make the balloon inhibitor a counter rather than a
> > > > boolean, fixup a case where KVM can then simply use the inhibit
> > > > interface, and inhibit ballooning any time a vfio device is attached.
> > > > I'm expecting we'll expose some sort of flag similar to
> > > > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > > > this.  An addition we could consider here would be yet another device
> > > > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > > > mdev devices that behave in a manner compatible with ballooning.
> > > > 
> > > > Please let me know if this looks like a good idea.  Thanks,    
> > > 
> > > IMHO patches 1-2 are good cleanup as standalone patches...
> > > 
> > > I totally have no idea on whether people would like to use vfio-pci
> > > and the balloon device at the same time.  After all vfio-pci are
> > > majorly for performance players, then I would vaguely guess that they
> > > don't really care thin provisioning of memory at all, hence the usage
> > > scenario might not exist much.  Is that the major reason that we'd
> > > just like to disable it (which makes sense to me)?  
> > 
> > Don't people use vfio-pci as well if they want some special
> > capabilities from the passed-through device? (At least that's the main
> > use case for vfio-ccw, not any performance considerations.)  
> 
> Good to know these.
> 
> Out of topic: could I further ask what's these capabilities, and why
> these capabilities can't be emulated (or hard to be emulated) if we
> don't care about performance?

Are you assuming that anything that isn't strictly performance focused
for device assignment is self contained, fully documented, suitable for
emulation, and there's someone willing and able to invest and upstream
(open source) that emulation?  What about things like data acquisition
devices, TV capture cards, serial ports, real-time control systems,
etc.  This is one of the basic tenants of device assignment, it
provides users the ability to migrate physical systems to virtual, even
if the entire reason for the system existing is tied to hardware.  The
world is more than just NICs and HBAs.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-19 15:01         ` [Qemu-devel] " Alex Williamson
@ 2018-07-20  2:56           ` Peter Xu
  -1 siblings, 0 replies; 58+ messages in thread
From: Peter Xu @ 2018-07-20  2:56 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Thu, Jul 19, 2018 at 09:01:46AM -0600, Alex Williamson wrote:
> On Thu, 19 Jul 2018 13:40:51 +0800
> Peter Xu <peterx@redhat.com> wrote:
> > On Wed, Jul 18, 2018 at 10:31:33AM -0600, Alex Williamson wrote:
> > > On Wed, 18 Jul 2018 14:48:03 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> > > > I'm wondering what if want to do that somehow some day... Whether
> > > > it'll work if we just let vfio-pci devices to register some guest
> > > > memory invalidation hook (just like the iommu notifiers, but for guest
> > > > memory address space instead), then we map/unmap the IOMMU pages there
> > > > for vfio-pci device to make sure the inflated balloon pages are not
> > > > mapped and also make sure new pages are remapped with correct HPA
> > > > after deflated.  This is a pure question out of my curiosity, and for
> > > > sure it makes little sense if the answer of the first question above
> > > > is positive.  
> > > 
> > > This is why I mention the KVM MMU synchronization flag above.  KVM
> > > essentially had this same problem and fixed it with with MMU notifiers
> > > in the kernel.  They expose that KVM has the capability of handling
> > > such a scenario via a feature flag.  We can do the same with vfio.  In
> > > scenarios where we're able to fix this, we could expose a flag on the
> > > container indicating support for the same sort of thing.  
> > 
> > Sorry I didn't really caught that point when reply.  So that's why we
> > have had the mmu notifiers... Hmm, glad to know that.
> > 
> > But I would guess that if we want that notifier for vfio it should be
> > in QEMU rather than the kernel one since kernel vfio driver should not
> > have enough information on the GPA address space, hence it might not
> > be able to rebuild the mapping when a new page is mapped?  While QEMU
> > should be able to get both GPA and HVA easily when the balloon device
> > wants to deflate a page. [1]
> 
> This is where the vfio IOMMU backend comes into play.  vfio devices
> make use of MemoryListeners to register the HVA to GPA translations
> within the AddressSpace of a device.  When we're using an IOMMU, we pin
> those HVAs in order to make the HPA static and insert the GPA to HPA
> mappings into the IOMMU.  When we don't have an IOMMU, the IOMMU
> backend is storing those HVA to GPA translations so that the mediated
> device vendor driver can make pinning requests.  The vendor driver
> requests pinning of a given set of GPAs and the IOMMU backend pins the
> matching HVA to provide an HPA.
> 
> When a page is ballooned, it's zapped from the process address space,
> so we need to invalidate the HVA to HPA mapping.  When the page is
> restored, we still have the correct HVA, but we need a notifier to tell
> us to put it back into play, re-pinning and inserting the mapping into
> the IOMMU if we have one.
> 
> In order for QEMU to do this, this ballooned page would need to be
> reflected in the memory API.  This would be quite simple, inserting a
> MemoryRegion overlapping the RAM page which is ballooned out and
> removing it when the balloon is deflated.  But we run into the same
> problems with mapping granularity.  In order to accommodate this new
> overlap, the memory API would first remove the previous mapping, split
> or truncate the region, then reinsert the result.  Just like if we tried
> to do this in the IOMMU, it's not atomic with respect to device DMA.  In
> order to achieve this model, the memory API would need to operate
> entirely on page size regions.  Now imagine that every MiB of guest RAM
> requires 256 ioctls to map (assuming 4KiB pages), 256K per GiB.  Clearly
> we'd want to use a larger granularity for efficiency.  If we allow the
> user to specify the granularity, perhaps abstracting that granularity
> as the size of a DIMM, suddenly we've moved from memory ballooning to
> memory hotplug, where the latter does make use of the memory API and
> has none of these issues AIUI.

I see.  Indeed pc-dimm seems to be more suitable here.  And I think I
better understand the awkwardness that the page granularity problem
has brought - since we need this page granularity to happen even for
the QEMU memory API then we'll possibly have 4k-sized memory regions
to fill up all the RAM address space.  It sounds a hard mission.

> 
> > > There are a few complications to this support though.  First ballooning
> > > works at page size granularity, but IOMMU mapping can make use of
> > > arbitrary superpage sizes and the IOMMU API only guarantees unmap
> > > granularity equal to the original mapping.  Therefore we cannot unmap
> > > individual pages unless we require that all mappings through the IOMMU
> > > API are done with page granularity, precluding the use of superpages by
> > > the IOMMU and thereby inflicting higher IOTLB overhead.  Unlike a CPU,
> > > we can't invalidate the mappings and fault them back in or halt the
> > > processor to make the page table updates appear atomic.  The device is
> > > considered always running and interfering with that would likely lead
> > > to functional issues.  
> > 
> > Indeed.  Actually VT-d emulation bug was fixed just months ago where
> > the QEMU shadow page code for the device quickly unmapped the pages
> > and rebuilt the pages, but within the window we see DMA happened hence
> > DMA error on missing page entries.  I wish I have had learnt that
> > earlier from you!  Then the bug would be even more obvious to me.
> > 
> > And I would guess that if we want to do that in the future, the
> > easiest way as the first step would be that we just tell vfio to avoid
> > using huge pages when we see balloon devices.  It might be an
> > understandable cost at least to me to use both vfio-pci and the
> > balloon device.
> 
> There are a couple problem there though, first if we decide to use
> smaller pages for any case where we have a balloon device (a device
> that libvirt adds by default and requires manually editing the XML to
> remove), we introduce a performance regression for pretty much every
> existing VM as we restrict the IOMMU from making use of superpages and
> therefore depend far more on the IOTLB.  Second, QEMU doesn't have
> control of the mapping page size.  The vfio MAP_DMA ioctl simply takes
> a virtual address, IOVA (GPA) and size, the IOMMU gets to map this
> however it finds most efficient and the API requires unmapping with a
> minimum granularity matching the original mapping.  So again, the only
> way QEMU can get page size unmapping granularity is to perform only
> page sized mappings.  We could add a mapping flag to specify page size
> mapping and therefore page granularity unmapping, but that's a new
> contract (ie. API) between the user and vfio that comes with a
> performance penalty.  There is currently a vfio_iommu_type1 module
> option which disables IOMMU superpage support globally, but we don't
> have per instance control with the current APIs.

IIRC there were similar requests before to allow userspace to specify
page sizes to be used by vfio IOMMU backends but it didn't make it at
last.  But now I understand that this IOMMU page granularity problem
within the vfio IOMMU backend is a separate one to be settled
comparing to the one you mentioned in QEMU memory API.

(Though I also feel uncertain on whether libvirt should at least
 provide a simpler interface to allow the guest to disable the default
 balloon device...)

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-20  2:56           ` Peter Xu
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Xu @ 2018-07-20  2:56 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Thu, Jul 19, 2018 at 09:01:46AM -0600, Alex Williamson wrote:
> On Thu, 19 Jul 2018 13:40:51 +0800
> Peter Xu <peterx@redhat.com> wrote:
> > On Wed, Jul 18, 2018 at 10:31:33AM -0600, Alex Williamson wrote:
> > > On Wed, 18 Jul 2018 14:48:03 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> > > > I'm wondering what if want to do that somehow some day... Whether
> > > > it'll work if we just let vfio-pci devices to register some guest
> > > > memory invalidation hook (just like the iommu notifiers, but for guest
> > > > memory address space instead), then we map/unmap the IOMMU pages there
> > > > for vfio-pci device to make sure the inflated balloon pages are not
> > > > mapped and also make sure new pages are remapped with correct HPA
> > > > after deflated.  This is a pure question out of my curiosity, and for
> > > > sure it makes little sense if the answer of the first question above
> > > > is positive.  
> > > 
> > > This is why I mention the KVM MMU synchronization flag above.  KVM
> > > essentially had this same problem and fixed it with with MMU notifiers
> > > in the kernel.  They expose that KVM has the capability of handling
> > > such a scenario via a feature flag.  We can do the same with vfio.  In
> > > scenarios where we're able to fix this, we could expose a flag on the
> > > container indicating support for the same sort of thing.  
> > 
> > Sorry I didn't really caught that point when reply.  So that's why we
> > have had the mmu notifiers... Hmm, glad to know that.
> > 
> > But I would guess that if we want that notifier for vfio it should be
> > in QEMU rather than the kernel one since kernel vfio driver should not
> > have enough information on the GPA address space, hence it might not
> > be able to rebuild the mapping when a new page is mapped?  While QEMU
> > should be able to get both GPA and HVA easily when the balloon device
> > wants to deflate a page. [1]
> 
> This is where the vfio IOMMU backend comes into play.  vfio devices
> make use of MemoryListeners to register the HVA to GPA translations
> within the AddressSpace of a device.  When we're using an IOMMU, we pin
> those HVAs in order to make the HPA static and insert the GPA to HPA
> mappings into the IOMMU.  When we don't have an IOMMU, the IOMMU
> backend is storing those HVA to GPA translations so that the mediated
> device vendor driver can make pinning requests.  The vendor driver
> requests pinning of a given set of GPAs and the IOMMU backend pins the
> matching HVA to provide an HPA.
> 
> When a page is ballooned, it's zapped from the process address space,
> so we need to invalidate the HVA to HPA mapping.  When the page is
> restored, we still have the correct HVA, but we need a notifier to tell
> us to put it back into play, re-pinning and inserting the mapping into
> the IOMMU if we have one.
> 
> In order for QEMU to do this, this ballooned page would need to be
> reflected in the memory API.  This would be quite simple, inserting a
> MemoryRegion overlapping the RAM page which is ballooned out and
> removing it when the balloon is deflated.  But we run into the same
> problems with mapping granularity.  In order to accommodate this new
> overlap, the memory API would first remove the previous mapping, split
> or truncate the region, then reinsert the result.  Just like if we tried
> to do this in the IOMMU, it's not atomic with respect to device DMA.  In
> order to achieve this model, the memory API would need to operate
> entirely on page size regions.  Now imagine that every MiB of guest RAM
> requires 256 ioctls to map (assuming 4KiB pages), 256K per GiB.  Clearly
> we'd want to use a larger granularity for efficiency.  If we allow the
> user to specify the granularity, perhaps abstracting that granularity
> as the size of a DIMM, suddenly we've moved from memory ballooning to
> memory hotplug, where the latter does make use of the memory API and
> has none of these issues AIUI.

I see.  Indeed pc-dimm seems to be more suitable here.  And I think I
better understand the awkwardness that the page granularity problem
has brought - since we need this page granularity to happen even for
the QEMU memory API then we'll possibly have 4k-sized memory regions
to fill up all the RAM address space.  It sounds a hard mission.

> 
> > > There are a few complications to this support though.  First ballooning
> > > works at page size granularity, but IOMMU mapping can make use of
> > > arbitrary superpage sizes and the IOMMU API only guarantees unmap
> > > granularity equal to the original mapping.  Therefore we cannot unmap
> > > individual pages unless we require that all mappings through the IOMMU
> > > API are done with page granularity, precluding the use of superpages by
> > > the IOMMU and thereby inflicting higher IOTLB overhead.  Unlike a CPU,
> > > we can't invalidate the mappings and fault them back in or halt the
> > > processor to make the page table updates appear atomic.  The device is
> > > considered always running and interfering with that would likely lead
> > > to functional issues.  
> > 
> > Indeed.  Actually VT-d emulation bug was fixed just months ago where
> > the QEMU shadow page code for the device quickly unmapped the pages
> > and rebuilt the pages, but within the window we see DMA happened hence
> > DMA error on missing page entries.  I wish I have had learnt that
> > earlier from you!  Then the bug would be even more obvious to me.
> > 
> > And I would guess that if we want to do that in the future, the
> > easiest way as the first step would be that we just tell vfio to avoid
> > using huge pages when we see balloon devices.  It might be an
> > understandable cost at least to me to use both vfio-pci and the
> > balloon device.
> 
> There are a couple problem there though, first if we decide to use
> smaller pages for any case where we have a balloon device (a device
> that libvirt adds by default and requires manually editing the XML to
> remove), we introduce a performance regression for pretty much every
> existing VM as we restrict the IOMMU from making use of superpages and
> therefore depend far more on the IOTLB.  Second, QEMU doesn't have
> control of the mapping page size.  The vfio MAP_DMA ioctl simply takes
> a virtual address, IOVA (GPA) and size, the IOMMU gets to map this
> however it finds most efficient and the API requires unmapping with a
> minimum granularity matching the original mapping.  So again, the only
> way QEMU can get page size unmapping granularity is to perform only
> page sized mappings.  We could add a mapping flag to specify page size
> mapping and therefore page granularity unmapping, but that's a new
> contract (ie. API) between the user and vfio that comes with a
> performance penalty.  There is currently a vfio_iommu_type1 module
> option which disables IOMMU superpage support globally, but we don't
> have per instance control with the current APIs.

IIRC there were similar requests before to allow userspace to specify
page sizes to be used by vfio IOMMU backends but it didn't make it at
last.  But now I understand that this IOMMU page granularity problem
within the vfio IOMMU backend is a separate one to be settled
comparing to the one you mentioned in QEMU memory API.

(Though I also feel uncertain on whether libvirt should at least
 provide a simpler interface to allow the guest to disable the default
 balloon device...)

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-17 22:47 ` [Qemu-devel] " Alex Williamson
@ 2018-07-30 13:34   ` Michael S. Tsirkin
  -1 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2018-07-30 13:34 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> Directly assigned vfio devices have never been compatible with
> ballooning.  Zapping MADV_DONTNEED pages happens completely
> independent of vfio page pinning and IOMMU mapping, leaving us with
> inconsistent GPA to HPA mapping between vCPUs and assigned devices
> when the balloon deflates.  Mediated devices can theoretically do
> better, if we make the assumption that the mdev vendor driver is fully
> synchronized to the actual working set of the guest driver.  In that
> case the guest balloon driver should never be able to allocate an mdev
> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> workings of the vendor driver pinning, and doesn't actually know the
> difference between mdev devices and directly assigned devices.  Until
> we can sort out how the vfio IOMMU backend can tell us if ballooning
> is safe, the best approach is to disabling ballooning any time a vfio
> devices is attached.
> 
> To do that, simply make the balloon inhibitor a counter rather than a
> boolean, fixup a case where KVM can then simply use the inhibit
> interface, and inhibit ballooning any time a vfio device is attached.
> I'm expecting we'll expose some sort of flag similar to
> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> this.  An addition we could consider here would be yet another device
> option for vfio, such as x-disable-balloon-inhibit, in case there are
> mdev devices that behave in a manner compatible with ballooning.
> 
> Please let me know if this looks like a good idea.  Thanks,
> 
> Alex

It's probably the only a reasonable thing to do for this release.

Long term however, why can't balloon notify vfio as pages are
added and removed? VFIO could update its mappings then.

> ---
> 
> Alex Williamson (3):
>       balloon: Allow nested inhibits
>       kvm: Use inhibit to prevent ballooning without synchronous mmu
>       vfio: Inhibit ballooning
> 
> 
>  accel/kvm/kvm-all.c        |    4 ++++
>  balloon.c                  |    7 ++++---
>  hw/vfio/common.c           |    6 ++++++
>  hw/virtio/virtio-balloon.c |    4 +---
>  4 files changed, 15 insertions(+), 6 deletions(-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-30 13:34   ` Michael S. Tsirkin
  0 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2018-07-30 13:34 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> Directly assigned vfio devices have never been compatible with
> ballooning.  Zapping MADV_DONTNEED pages happens completely
> independent of vfio page pinning and IOMMU mapping, leaving us with
> inconsistent GPA to HPA mapping between vCPUs and assigned devices
> when the balloon deflates.  Mediated devices can theoretically do
> better, if we make the assumption that the mdev vendor driver is fully
> synchronized to the actual working set of the guest driver.  In that
> case the guest balloon driver should never be able to allocate an mdev
> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> workings of the vendor driver pinning, and doesn't actually know the
> difference between mdev devices and directly assigned devices.  Until
> we can sort out how the vfio IOMMU backend can tell us if ballooning
> is safe, the best approach is to disabling ballooning any time a vfio
> devices is attached.
> 
> To do that, simply make the balloon inhibitor a counter rather than a
> boolean, fixup a case where KVM can then simply use the inhibit
> interface, and inhibit ballooning any time a vfio device is attached.
> I'm expecting we'll expose some sort of flag similar to
> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> this.  An addition we could consider here would be yet another device
> option for vfio, such as x-disable-balloon-inhibit, in case there are
> mdev devices that behave in a manner compatible with ballooning.
> 
> Please let me know if this looks like a good idea.  Thanks,
> 
> Alex

It's probably the only a reasonable thing to do for this release.

Long term however, why can't balloon notify vfio as pages are
added and removed? VFIO could update its mappings then.

> ---
> 
> Alex Williamson (3):
>       balloon: Allow nested inhibits
>       kvm: Use inhibit to prevent ballooning without synchronous mmu
>       vfio: Inhibit ballooning
> 
> 
>  accel/kvm/kvm-all.c        |    4 ++++
>  balloon.c                  |    7 ++++---
>  hw/vfio/common.c           |    6 ++++++
>  hw/virtio/virtio-balloon.c |    4 +---
>  4 files changed, 15 insertions(+), 6 deletions(-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-30 13:34   ` [Qemu-devel] " Michael S. Tsirkin
@ 2018-07-30 13:54     ` David Hildenbrand
  -1 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2018-07-30 13:54 UTC (permalink / raw)
  To: Michael S. Tsirkin, Alex Williamson; +Cc: qemu-devel, kvm

On 30.07.2018 15:34, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
>> Directly assigned vfio devices have never been compatible with
>> ballooning.  Zapping MADV_DONTNEED pages happens completely
>> independent of vfio page pinning and IOMMU mapping, leaving us with
>> inconsistent GPA to HPA mapping between vCPUs and assigned devices
>> when the balloon deflates.  Mediated devices can theoretically do
>> better, if we make the assumption that the mdev vendor driver is fully
>> synchronized to the actual working set of the guest driver.  In that
>> case the guest balloon driver should never be able to allocate an mdev
>> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
>> workings of the vendor driver pinning, and doesn't actually know the
>> difference between mdev devices and directly assigned devices.  Until
>> we can sort out how the vfio IOMMU backend can tell us if ballooning
>> is safe, the best approach is to disabling ballooning any time a vfio
>> devices is attached.
>>
>> To do that, simply make the balloon inhibitor a counter rather than a
>> boolean, fixup a case where KVM can then simply use the inhibit
>> interface, and inhibit ballooning any time a vfio device is attached.
>> I'm expecting we'll expose some sort of flag similar to
>> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
>> this.  An addition we could consider here would be yet another device
>> option for vfio, such as x-disable-balloon-inhibit, in case there are
>> mdev devices that behave in a manner compatible with ballooning.
>>
>> Please let me know if this looks like a good idea.  Thanks,
>>
>> Alex
> 
> It's probably the only a reasonable thing to do for this release.
> 
> Long term however, why can't balloon notify vfio as pages are
> added and removed? VFIO could update its mappings then.

What if the guest is rebooted and pages are silently getting reused
without getting a deflation request first?

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-30 13:54     ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2018-07-30 13:54 UTC (permalink / raw)
  To: Michael S. Tsirkin, Alex Williamson; +Cc: qemu-devel, kvm

On 30.07.2018 15:34, Michael S. Tsirkin wrote:
> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
>> Directly assigned vfio devices have never been compatible with
>> ballooning.  Zapping MADV_DONTNEED pages happens completely
>> independent of vfio page pinning and IOMMU mapping, leaving us with
>> inconsistent GPA to HPA mapping between vCPUs and assigned devices
>> when the balloon deflates.  Mediated devices can theoretically do
>> better, if we make the assumption that the mdev vendor driver is fully
>> synchronized to the actual working set of the guest driver.  In that
>> case the guest balloon driver should never be able to allocate an mdev
>> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
>> workings of the vendor driver pinning, and doesn't actually know the
>> difference between mdev devices and directly assigned devices.  Until
>> we can sort out how the vfio IOMMU backend can tell us if ballooning
>> is safe, the best approach is to disabling ballooning any time a vfio
>> devices is attached.
>>
>> To do that, simply make the balloon inhibitor a counter rather than a
>> boolean, fixup a case where KVM can then simply use the inhibit
>> interface, and inhibit ballooning any time a vfio device is attached.
>> I'm expecting we'll expose some sort of flag similar to
>> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
>> this.  An addition we could consider here would be yet another device
>> option for vfio, such as x-disable-balloon-inhibit, in case there are
>> mdev devices that behave in a manner compatible with ballooning.
>>
>> Please let me know if this looks like a good idea.  Thanks,
>>
>> Alex
> 
> It's probably the only a reasonable thing to do for this release.
> 
> Long term however, why can't balloon notify vfio as pages are
> added and removed? VFIO could update its mappings then.

What if the guest is rebooted and pages are silently getting reused
without getting a deflation request first?

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-30 13:54     ` [Qemu-devel] " David Hildenbrand
@ 2018-07-30 13:59       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2018-07-30 13:59 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Alex Williamson, qemu-devel, kvm

On Mon, Jul 30, 2018 at 03:54:04PM +0200, David Hildenbrand wrote:
> On 30.07.2018 15:34, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> >> Directly assigned vfio devices have never been compatible with
> >> ballooning.  Zapping MADV_DONTNEED pages happens completely
> >> independent of vfio page pinning and IOMMU mapping, leaving us with
> >> inconsistent GPA to HPA mapping between vCPUs and assigned devices
> >> when the balloon deflates.  Mediated devices can theoretically do
> >> better, if we make the assumption that the mdev vendor driver is fully
> >> synchronized to the actual working set of the guest driver.  In that
> >> case the guest balloon driver should never be able to allocate an mdev
> >> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> >> workings of the vendor driver pinning, and doesn't actually know the
> >> difference between mdev devices and directly assigned devices.  Until
> >> we can sort out how the vfio IOMMU backend can tell us if ballooning
> >> is safe, the best approach is to disabling ballooning any time a vfio
> >> devices is attached.
> >>
> >> To do that, simply make the balloon inhibitor a counter rather than a
> >> boolean, fixup a case where KVM can then simply use the inhibit
> >> interface, and inhibit ballooning any time a vfio device is attached.
> >> I'm expecting we'll expose some sort of flag similar to
> >> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> >> this.  An addition we could consider here would be yet another device
> >> option for vfio, such as x-disable-balloon-inhibit, in case there are
> >> mdev devices that behave in a manner compatible with ballooning.
> >>
> >> Please let me know if this looks like a good idea.  Thanks,
> >>
> >> Alex
> > 
> > It's probably the only a reasonable thing to do for this release.
> > 
> > Long term however, why can't balloon notify vfio as pages are
> > added and removed? VFIO could update its mappings then.
> 
> What if the guest is rebooted and pages are silently getting reused
> without getting a deflation request first?

Good point. To handle we'd need to deflate fully on
on device reset, allowing access to all memory again.

> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-30 13:59       ` Michael S. Tsirkin
  0 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2018-07-30 13:59 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Alex Williamson, qemu-devel, kvm

On Mon, Jul 30, 2018 at 03:54:04PM +0200, David Hildenbrand wrote:
> On 30.07.2018 15:34, Michael S. Tsirkin wrote:
> > On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> >> Directly assigned vfio devices have never been compatible with
> >> ballooning.  Zapping MADV_DONTNEED pages happens completely
> >> independent of vfio page pinning and IOMMU mapping, leaving us with
> >> inconsistent GPA to HPA mapping between vCPUs and assigned devices
> >> when the balloon deflates.  Mediated devices can theoretically do
> >> better, if we make the assumption that the mdev vendor driver is fully
> >> synchronized to the actual working set of the guest driver.  In that
> >> case the guest balloon driver should never be able to allocate an mdev
> >> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> >> workings of the vendor driver pinning, and doesn't actually know the
> >> difference between mdev devices and directly assigned devices.  Until
> >> we can sort out how the vfio IOMMU backend can tell us if ballooning
> >> is safe, the best approach is to disabling ballooning any time a vfio
> >> devices is attached.
> >>
> >> To do that, simply make the balloon inhibitor a counter rather than a
> >> boolean, fixup a case where KVM can then simply use the inhibit
> >> interface, and inhibit ballooning any time a vfio device is attached.
> >> I'm expecting we'll expose some sort of flag similar to
> >> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> >> this.  An addition we could consider here would be yet another device
> >> option for vfio, such as x-disable-balloon-inhibit, in case there are
> >> mdev devices that behave in a manner compatible with ballooning.
> >>
> >> Please let me know if this looks like a good idea.  Thanks,
> >>
> >> Alex
> > 
> > It's probably the only a reasonable thing to do for this release.
> > 
> > Long term however, why can't balloon notify vfio as pages are
> > added and removed? VFIO could update its mappings then.
> 
> What if the guest is rebooted and pages are silently getting reused
> without getting a deflation request first?

Good point. To handle we'd need to deflate fully on
on device reset, allowing access to all memory again.

> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-30 13:34   ` [Qemu-devel] " Michael S. Tsirkin
@ 2018-07-30 14:39     ` Alex Williamson
  -1 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-30 14:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel, kvm

On Mon, 30 Jul 2018 16:34:09 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > Directly assigned vfio devices have never been compatible with
> > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > independent of vfio page pinning and IOMMU mapping, leaving us with
> > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > when the balloon deflates.  Mediated devices can theoretically do
> > better, if we make the assumption that the mdev vendor driver is fully
> > synchronized to the actual working set of the guest driver.  In that
> > case the guest balloon driver should never be able to allocate an mdev
> > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > workings of the vendor driver pinning, and doesn't actually know the
> > difference between mdev devices and directly assigned devices.  Until
> > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > is safe, the best approach is to disabling ballooning any time a vfio
> > devices is attached.
> > 
> > To do that, simply make the balloon inhibitor a counter rather than a
> > boolean, fixup a case where KVM can then simply use the inhibit
> > interface, and inhibit ballooning any time a vfio device is attached.
> > I'm expecting we'll expose some sort of flag similar to
> > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > this.  An addition we could consider here would be yet another device
> > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > mdev devices that behave in a manner compatible with ballooning.
> > 
> > Please let me know if this looks like a good idea.  Thanks,
> > 
> > Alex  
> 
> It's probably the only a reasonable thing to do for this release.
> 
> Long term however, why can't balloon notify vfio as pages are
> added and removed? VFIO could update its mappings then.

Are you thinking of a notifier outside of the memory API or updating
the memory API to reflect the current ballooning state?  In the former
case, we don't have page level granularity for mapping and un-mapping.
We could invent a mechanism for userspace to specify page granularity
mapping to the vfio kernel module, but that incurs a cost at the
hardware and host level with poor IOTLB efficiency and excessive page
tables.  Additionally, how would a notifier approach handle hot-added
devices, is the notifier replayed for each added device?  This starts
to sound more like the existing functionality of the memory API.

If we go through the memory API then we also don't really have page
level granularity, removing a page from a SubRegion will remove the
entire region and add back the remaining SubRegion(s).  This is more
compatible with the IOMMU mappings, but I don't think it can be done
atomically with respect to inflight DMA of a physical device where we
cannot halt the device without interfering with its state.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-30 14:39     ` Alex Williamson
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-30 14:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel, kvm

On Mon, 30 Jul 2018 16:34:09 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> > Directly assigned vfio devices have never been compatible with
> > ballooning.  Zapping MADV_DONTNEED pages happens completely
> > independent of vfio page pinning and IOMMU mapping, leaving us with
> > inconsistent GPA to HPA mapping between vCPUs and assigned devices
> > when the balloon deflates.  Mediated devices can theoretically do
> > better, if we make the assumption that the mdev vendor driver is fully
> > synchronized to the actual working set of the guest driver.  In that
> > case the guest balloon driver should never be able to allocate an mdev
> > pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> > workings of the vendor driver pinning, and doesn't actually know the
> > difference between mdev devices and directly assigned devices.  Until
> > we can sort out how the vfio IOMMU backend can tell us if ballooning
> > is safe, the best approach is to disabling ballooning any time a vfio
> > devices is attached.
> > 
> > To do that, simply make the balloon inhibitor a counter rather than a
> > boolean, fixup a case where KVM can then simply use the inhibit
> > interface, and inhibit ballooning any time a vfio device is attached.
> > I'm expecting we'll expose some sort of flag similar to
> > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> > this.  An addition we could consider here would be yet another device
> > option for vfio, such as x-disable-balloon-inhibit, in case there are
> > mdev devices that behave in a manner compatible with ballooning.
> > 
> > Please let me know if this looks like a good idea.  Thanks,
> > 
> > Alex  
> 
> It's probably the only a reasonable thing to do for this release.
> 
> Long term however, why can't balloon notify vfio as pages are
> added and removed? VFIO could update its mappings then.

Are you thinking of a notifier outside of the memory API or updating
the memory API to reflect the current ballooning state?  In the former
case, we don't have page level granularity for mapping and un-mapping.
We could invent a mechanism for userspace to specify page granularity
mapping to the vfio kernel module, but that incurs a cost at the
hardware and host level with poor IOTLB efficiency and excessive page
tables.  Additionally, how would a notifier approach handle hot-added
devices, is the notifier replayed for each added device?  This starts
to sound more like the existing functionality of the memory API.

If we go through the memory API then we also don't really have page
level granularity, removing a page from a SubRegion will remove the
entire region and add back the remaining SubRegion(s).  This is more
compatible with the IOMMU mappings, but I don't think it can be done
atomically with respect to inflight DMA of a physical device where we
cannot halt the device without interfering with its state.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-30 13:59       ` [Qemu-devel] " Michael S. Tsirkin
@ 2018-07-30 14:46         ` David Hildenbrand
  -1 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2018-07-30 14:46 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, qemu-devel, kvm

On 30.07.2018 15:59, Michael S. Tsirkin wrote:
> On Mon, Jul 30, 2018 at 03:54:04PM +0200, David Hildenbrand wrote:
>> On 30.07.2018 15:34, Michael S. Tsirkin wrote:
>>> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
>>>> Directly assigned vfio devices have never been compatible with
>>>> ballooning.  Zapping MADV_DONTNEED pages happens completely
>>>> independent of vfio page pinning and IOMMU mapping, leaving us with
>>>> inconsistent GPA to HPA mapping between vCPUs and assigned devices
>>>> when the balloon deflates.  Mediated devices can theoretically do
>>>> better, if we make the assumption that the mdev vendor driver is fully
>>>> synchronized to the actual working set of the guest driver.  In that
>>>> case the guest balloon driver should never be able to allocate an mdev
>>>> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
>>>> workings of the vendor driver pinning, and doesn't actually know the
>>>> difference between mdev devices and directly assigned devices.  Until
>>>> we can sort out how the vfio IOMMU backend can tell us if ballooning
>>>> is safe, the best approach is to disabling ballooning any time a vfio
>>>> devices is attached.
>>>>
>>>> To do that, simply make the balloon inhibitor a counter rather than a
>>>> boolean, fixup a case where KVM can then simply use the inhibit
>>>> interface, and inhibit ballooning any time a vfio device is attached.
>>>> I'm expecting we'll expose some sort of flag similar to
>>>> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
>>>> this.  An addition we could consider here would be yet another device
>>>> option for vfio, such as x-disable-balloon-inhibit, in case there are
>>>> mdev devices that behave in a manner compatible with ballooning.
>>>>
>>>> Please let me know if this looks like a good idea.  Thanks,
>>>>
>>>> Alex
>>>
>>> It's probably the only a reasonable thing to do for this release.
>>>
>>> Long term however, why can't balloon notify vfio as pages are
>>> added and removed? VFIO could update its mappings then.
>>
>> What if the guest is rebooted and pages are silently getting reused
>> without getting a deflation request first?
> 
> Good point. To handle we'd need to deflate fully on
> on device reset, allowing access to all memory again.

1. Doing it from the guest: not reliable. E.g. think about crashes +
reboots, or a plain "system_reset" in QEMU. Deflation is definetly not
reliably possible.

2. Doing it in QEMU balloon implementation. Not possible. We don't track
the memory that has been inflated (and also should not do it).

So the only thing we could do is "deflate all guest memory" which
implies a madvise WILLNEED on all guest memory. We definitely don't want
this. We could inform vfio about all guest memory.

Everything sounds like a big hack that should be handled internally in
the kernel.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-30 14:46         ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2018-07-30 14:46 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, qemu-devel, kvm

On 30.07.2018 15:59, Michael S. Tsirkin wrote:
> On Mon, Jul 30, 2018 at 03:54:04PM +0200, David Hildenbrand wrote:
>> On 30.07.2018 15:34, Michael S. Tsirkin wrote:
>>> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
>>>> Directly assigned vfio devices have never been compatible with
>>>> ballooning.  Zapping MADV_DONTNEED pages happens completely
>>>> independent of vfio page pinning and IOMMU mapping, leaving us with
>>>> inconsistent GPA to HPA mapping between vCPUs and assigned devices
>>>> when the balloon deflates.  Mediated devices can theoretically do
>>>> better, if we make the assumption that the mdev vendor driver is fully
>>>> synchronized to the actual working set of the guest driver.  In that
>>>> case the guest balloon driver should never be able to allocate an mdev
>>>> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
>>>> workings of the vendor driver pinning, and doesn't actually know the
>>>> difference between mdev devices and directly assigned devices.  Until
>>>> we can sort out how the vfio IOMMU backend can tell us if ballooning
>>>> is safe, the best approach is to disabling ballooning any time a vfio
>>>> devices is attached.
>>>>
>>>> To do that, simply make the balloon inhibitor a counter rather than a
>>>> boolean, fixup a case where KVM can then simply use the inhibit
>>>> interface, and inhibit ballooning any time a vfio device is attached.
>>>> I'm expecting we'll expose some sort of flag similar to
>>>> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
>>>> this.  An addition we could consider here would be yet another device
>>>> option for vfio, such as x-disable-balloon-inhibit, in case there are
>>>> mdev devices that behave in a manner compatible with ballooning.
>>>>
>>>> Please let me know if this looks like a good idea.  Thanks,
>>>>
>>>> Alex
>>>
>>> It's probably the only a reasonable thing to do for this release.
>>>
>>> Long term however, why can't balloon notify vfio as pages are
>>> added and removed? VFIO could update its mappings then.
>>
>> What if the guest is rebooted and pages are silently getting reused
>> without getting a deflation request first?
> 
> Good point. To handle we'd need to deflate fully on
> on device reset, allowing access to all memory again.

1. Doing it from the guest: not reliable. E.g. think about crashes +
reboots, or a plain "system_reset" in QEMU. Deflation is definetly not
reliably possible.

2. Doing it in QEMU balloon implementation. Not possible. We don't track
the memory that has been inflated (and also should not do it).

So the only thing we could do is "deflate all guest memory" which
implies a madvise WILLNEED on all guest memory. We definitely don't want
this. We could inform vfio about all guest memory.

Everything sounds like a big hack that should be handled internally in
the kernel.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-30 14:39     ` [Qemu-devel] " Alex Williamson
@ 2018-07-30 14:51       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2018-07-30 14:51 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Mon, Jul 30, 2018 at 08:39:39AM -0600, Alex Williamson wrote:
> This is more
> compatible with the IOMMU mappings,

Precisely. These are at page granularity.

> but I don't think it can be done
> atomically with respect to inflight DMA of a physical device where we
> cannot halt the device without interfering with its state.

Guests never add pages to the balloon if they are under DMA,
so that's fine - there's never an in-flight DMA, if
there is guest is buggy and it's ok to crash it.

-- 
MST

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-30 14:51       ` Michael S. Tsirkin
  0 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2018-07-30 14:51 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Mon, Jul 30, 2018 at 08:39:39AM -0600, Alex Williamson wrote:
> This is more
> compatible with the IOMMU mappings,

Precisely. These are at page granularity.

> but I don't think it can be done
> atomically with respect to inflight DMA of a physical device where we
> cannot halt the device without interfering with its state.

Guests never add pages to the balloon if they are under DMA,
so that's fine - there's never an in-flight DMA, if
there is guest is buggy and it's ok to crash it.

-- 
MST

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-30 14:46         ` [Qemu-devel] " David Hildenbrand
@ 2018-07-30 14:58           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2018-07-30 14:58 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Alex Williamson, qemu-devel, kvm

On Mon, Jul 30, 2018 at 04:46:25PM +0200, David Hildenbrand wrote:
> On 30.07.2018 15:59, Michael S. Tsirkin wrote:
> > On Mon, Jul 30, 2018 at 03:54:04PM +0200, David Hildenbrand wrote:
> >> On 30.07.2018 15:34, Michael S. Tsirkin wrote:
> >>> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> >>>> Directly assigned vfio devices have never been compatible with
> >>>> ballooning.  Zapping MADV_DONTNEED pages happens completely
> >>>> independent of vfio page pinning and IOMMU mapping, leaving us with
> >>>> inconsistent GPA to HPA mapping between vCPUs and assigned devices
> >>>> when the balloon deflates.  Mediated devices can theoretically do
> >>>> better, if we make the assumption that the mdev vendor driver is fully
> >>>> synchronized to the actual working set of the guest driver.  In that
> >>>> case the guest balloon driver should never be able to allocate an mdev
> >>>> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> >>>> workings of the vendor driver pinning, and doesn't actually know the
> >>>> difference between mdev devices and directly assigned devices.  Until
> >>>> we can sort out how the vfio IOMMU backend can tell us if ballooning
> >>>> is safe, the best approach is to disabling ballooning any time a vfio
> >>>> devices is attached.
> >>>>
> >>>> To do that, simply make the balloon inhibitor a counter rather than a
> >>>> boolean, fixup a case where KVM can then simply use the inhibit
> >>>> interface, and inhibit ballooning any time a vfio device is attached.
> >>>> I'm expecting we'll expose some sort of flag similar to
> >>>> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> >>>> this.  An addition we could consider here would be yet another device
> >>>> option for vfio, such as x-disable-balloon-inhibit, in case there are
> >>>> mdev devices that behave in a manner compatible with ballooning.
> >>>>
> >>>> Please let me know if this looks like a good idea.  Thanks,
> >>>>
> >>>> Alex
> >>>
> >>> It's probably the only a reasonable thing to do for this release.
> >>>
> >>> Long term however, why can't balloon notify vfio as pages are
> >>> added and removed? VFIO could update its mappings then.
> >>
> >> What if the guest is rebooted and pages are silently getting reused
> >> without getting a deflation request first?
> > 
> > Good point. To handle we'd need to deflate fully on
> > on device reset, allowing access to all memory again.
> 
> 1. Doing it from the guest: not reliable. E.g. think about crashes +
> reboots, or a plain "system_reset" in QEMU. Deflation is definetly not
> reliably possible.
> 
> 2. Doing it in QEMU balloon implementation. Not possible. We don't track
> the memory that has been inflated (and also should not do it).
>
> So the only thing we could do is "deflate all guest memory" which
> implies a madvise WILLNEED on all guest memory. We definitely don't want
> this. We could inform vfio about all guest memory.

Exactly. No need to track anything we just need QEMU to allow access to
all guest memory.

> Everything sounds like a big hack that should be handled internally in
> the kernel.

What exactly do you want the kernel to do?

> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-30 14:58           ` Michael S. Tsirkin
  0 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2018-07-30 14:58 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Alex Williamson, qemu-devel, kvm

On Mon, Jul 30, 2018 at 04:46:25PM +0200, David Hildenbrand wrote:
> On 30.07.2018 15:59, Michael S. Tsirkin wrote:
> > On Mon, Jul 30, 2018 at 03:54:04PM +0200, David Hildenbrand wrote:
> >> On 30.07.2018 15:34, Michael S. Tsirkin wrote:
> >>> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
> >>>> Directly assigned vfio devices have never been compatible with
> >>>> ballooning.  Zapping MADV_DONTNEED pages happens completely
> >>>> independent of vfio page pinning and IOMMU mapping, leaving us with
> >>>> inconsistent GPA to HPA mapping between vCPUs and assigned devices
> >>>> when the balloon deflates.  Mediated devices can theoretically do
> >>>> better, if we make the assumption that the mdev vendor driver is fully
> >>>> synchronized to the actual working set of the guest driver.  In that
> >>>> case the guest balloon driver should never be able to allocate an mdev
> >>>> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
> >>>> workings of the vendor driver pinning, and doesn't actually know the
> >>>> difference between mdev devices and directly assigned devices.  Until
> >>>> we can sort out how the vfio IOMMU backend can tell us if ballooning
> >>>> is safe, the best approach is to disabling ballooning any time a vfio
> >>>> devices is attached.
> >>>>
> >>>> To do that, simply make the balloon inhibitor a counter rather than a
> >>>> boolean, fixup a case where KVM can then simply use the inhibit
> >>>> interface, and inhibit ballooning any time a vfio device is attached.
> >>>> I'm expecting we'll expose some sort of flag similar to
> >>>> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
> >>>> this.  An addition we could consider here would be yet another device
> >>>> option for vfio, such as x-disable-balloon-inhibit, in case there are
> >>>> mdev devices that behave in a manner compatible with ballooning.
> >>>>
> >>>> Please let me know if this looks like a good idea.  Thanks,
> >>>>
> >>>> Alex
> >>>
> >>> It's probably the only a reasonable thing to do for this release.
> >>>
> >>> Long term however, why can't balloon notify vfio as pages are
> >>> added and removed? VFIO could update its mappings then.
> >>
> >> What if the guest is rebooted and pages are silently getting reused
> >> without getting a deflation request first?
> > 
> > Good point. To handle we'd need to deflate fully on
> > on device reset, allowing access to all memory again.
> 
> 1. Doing it from the guest: not reliable. E.g. think about crashes +
> reboots, or a plain "system_reset" in QEMU. Deflation is definetly not
> reliably possible.
> 
> 2. Doing it in QEMU balloon implementation. Not possible. We don't track
> the memory that has been inflated (and also should not do it).
>
> So the only thing we could do is "deflate all guest memory" which
> implies a madvise WILLNEED on all guest memory. We definitely don't want
> this. We could inform vfio about all guest memory.

Exactly. No need to track anything we just need QEMU to allow access to
all guest memory.

> Everything sounds like a big hack that should be handled internally in
> the kernel.

What exactly do you want the kernel to do?

> -- 
> 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-30 14:51       ` [Qemu-devel] " Michael S. Tsirkin
@ 2018-07-30 15:01         ` Alex Williamson
  -1 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-30 15:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel, kvm

On Mon, 30 Jul 2018 17:51:28 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Jul 30, 2018 at 08:39:39AM -0600, Alex Williamson wrote:
> > This is more
> > compatible with the IOMMU mappings,  
> 
> Precisely. These are at page granularity.

(This/these being memory API mappings for context)

SubRegions are not page granule, the entire previous SubRegion needs to
be unmapped and any remaining SubRegions re-mapped.
 
> > but I don't think it can be done
> > atomically with respect to inflight DMA of a physical device where we
> > cannot halt the device without interfering with its state.  
> 
> Guests never add pages to the balloon if they are under DMA,
> so that's fine - there's never an in-flight DMA, if
> there is guest is buggy and it's ok to crash it.

It's not the ballooned page that I'm trying to note, it's the entire
remainder of the SubRegion which needs to be unmapped to remove that
one page.  It's more compatible from an IOMMU perspective in that we're
only unmapping with the same granularity with which we mapped, but it's
incompatible with inflight DMA as we have no idea what DMA targets may
reside within the remainder of that mapping while it's temporarily
unmapped.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-30 15:01         ` Alex Williamson
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-30 15:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel, kvm

On Mon, 30 Jul 2018 17:51:28 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Jul 30, 2018 at 08:39:39AM -0600, Alex Williamson wrote:
> > This is more
> > compatible with the IOMMU mappings,  
> 
> Precisely. These are at page granularity.

(This/these being memory API mappings for context)

SubRegions are not page granule, the entire previous SubRegion needs to
be unmapped and any remaining SubRegions re-mapped.
 
> > but I don't think it can be done
> > atomically with respect to inflight DMA of a physical device where we
> > cannot halt the device without interfering with its state.  
> 
> Guests never add pages to the balloon if they are under DMA,
> so that's fine - there's never an in-flight DMA, if
> there is guest is buggy and it's ok to crash it.

It's not the ballooned page that I'm trying to note, it's the entire
remainder of the SubRegion which needs to be unmapped to remove that
one page.  It's more compatible from an IOMMU perspective in that we're
only unmapping with the same granularity with which we mapped, but it's
incompatible with inflight DMA as we have no idea what DMA targets may
reside within the remainder of that mapping while it's temporarily
unmapped.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-30 14:58           ` [Qemu-devel] " Michael S. Tsirkin
@ 2018-07-30 15:05             ` David Hildenbrand
  -1 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2018-07-30 15:05 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, qemu-devel, kvm

On 30.07.2018 16:58, Michael S. Tsirkin wrote:
> On Mon, Jul 30, 2018 at 04:46:25PM +0200, David Hildenbrand wrote:
>> On 30.07.2018 15:59, Michael S. Tsirkin wrote:
>>> On Mon, Jul 30, 2018 at 03:54:04PM +0200, David Hildenbrand wrote:
>>>> On 30.07.2018 15:34, Michael S. Tsirkin wrote:
>>>>> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
>>>>>> Directly assigned vfio devices have never been compatible with
>>>>>> ballooning.  Zapping MADV_DONTNEED pages happens completely
>>>>>> independent of vfio page pinning and IOMMU mapping, leaving us with
>>>>>> inconsistent GPA to HPA mapping between vCPUs and assigned devices
>>>>>> when the balloon deflates.  Mediated devices can theoretically do
>>>>>> better, if we make the assumption that the mdev vendor driver is fully
>>>>>> synchronized to the actual working set of the guest driver.  In that
>>>>>> case the guest balloon driver should never be able to allocate an mdev
>>>>>> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
>>>>>> workings of the vendor driver pinning, and doesn't actually know the
>>>>>> difference between mdev devices and directly assigned devices.  Until
>>>>>> we can sort out how the vfio IOMMU backend can tell us if ballooning
>>>>>> is safe, the best approach is to disabling ballooning any time a vfio
>>>>>> devices is attached.
>>>>>>
>>>>>> To do that, simply make the balloon inhibitor a counter rather than a
>>>>>> boolean, fixup a case where KVM can then simply use the inhibit
>>>>>> interface, and inhibit ballooning any time a vfio device is attached.
>>>>>> I'm expecting we'll expose some sort of flag similar to
>>>>>> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
>>>>>> this.  An addition we could consider here would be yet another device
>>>>>> option for vfio, such as x-disable-balloon-inhibit, in case there are
>>>>>> mdev devices that behave in a manner compatible with ballooning.
>>>>>>
>>>>>> Please let me know if this looks like a good idea.  Thanks,
>>>>>>
>>>>>> Alex
>>>>>
>>>>> It's probably the only a reasonable thing to do for this release.
>>>>>
>>>>> Long term however, why can't balloon notify vfio as pages are
>>>>> added and removed? VFIO could update its mappings then.
>>>>
>>>> What if the guest is rebooted and pages are silently getting reused
>>>> without getting a deflation request first?
>>>
>>> Good point. To handle we'd need to deflate fully on
>>> on device reset, allowing access to all memory again.
>>
>> 1. Doing it from the guest: not reliable. E.g. think about crashes +
>> reboots, or a plain "system_reset" in QEMU. Deflation is definetly not
>> reliably possible.
>>
>> 2. Doing it in QEMU balloon implementation. Not possible. We don't track
>> the memory that has been inflated (and also should not do it).
>>
>> So the only thing we could do is "deflate all guest memory" which
>> implies a madvise WILLNEED on all guest memory. We definitely don't want
>> this. We could inform vfio about all guest memory.
> 
> Exactly. No need to track anything we just need QEMU to allow access to
> all guest memory.
> 
>> Everything sounds like a big hack that should be handled internally in
>> the kernel.
> 
> What exactly do you want the kernel to do?

As already discussed (in this thread? I don't remember), Alex was asking
if there is some kind of notifier way in the kernel to get notified when
a fresh page is being used on memory that was previously madvise
DONTNEEDed. Then that page could be immediately repinned.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-30 15:05             ` David Hildenbrand
  0 siblings, 0 replies; 58+ messages in thread
From: David Hildenbrand @ 2018-07-30 15:05 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, qemu-devel, kvm

On 30.07.2018 16:58, Michael S. Tsirkin wrote:
> On Mon, Jul 30, 2018 at 04:46:25PM +0200, David Hildenbrand wrote:
>> On 30.07.2018 15:59, Michael S. Tsirkin wrote:
>>> On Mon, Jul 30, 2018 at 03:54:04PM +0200, David Hildenbrand wrote:
>>>> On 30.07.2018 15:34, Michael S. Tsirkin wrote:
>>>>> On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote:
>>>>>> Directly assigned vfio devices have never been compatible with
>>>>>> ballooning.  Zapping MADV_DONTNEED pages happens completely
>>>>>> independent of vfio page pinning and IOMMU mapping, leaving us with
>>>>>> inconsistent GPA to HPA mapping between vCPUs and assigned devices
>>>>>> when the balloon deflates.  Mediated devices can theoretically do
>>>>>> better, if we make the assumption that the mdev vendor driver is fully
>>>>>> synchronized to the actual working set of the guest driver.  In that
>>>>>> case the guest balloon driver should never be able to allocate an mdev
>>>>>> pinned page for balloon inflation.  Unfortunately, QEMU can't know the
>>>>>> workings of the vendor driver pinning, and doesn't actually know the
>>>>>> difference between mdev devices and directly assigned devices.  Until
>>>>>> we can sort out how the vfio IOMMU backend can tell us if ballooning
>>>>>> is safe, the best approach is to disabling ballooning any time a vfio
>>>>>> devices is attached.
>>>>>>
>>>>>> To do that, simply make the balloon inhibitor a counter rather than a
>>>>>> boolean, fixup a case where KVM can then simply use the inhibit
>>>>>> interface, and inhibit ballooning any time a vfio device is attached.
>>>>>> I'm expecting we'll expose some sort of flag similar to
>>>>>> KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve
>>>>>> this.  An addition we could consider here would be yet another device
>>>>>> option for vfio, such as x-disable-balloon-inhibit, in case there are
>>>>>> mdev devices that behave in a manner compatible with ballooning.
>>>>>>
>>>>>> Please let me know if this looks like a good idea.  Thanks,
>>>>>>
>>>>>> Alex
>>>>>
>>>>> It's probably the only a reasonable thing to do for this release.
>>>>>
>>>>> Long term however, why can't balloon notify vfio as pages are
>>>>> added and removed? VFIO could update its mappings then.
>>>>
>>>> What if the guest is rebooted and pages are silently getting reused
>>>> without getting a deflation request first?
>>>
>>> Good point. To handle we'd need to deflate fully on
>>> on device reset, allowing access to all memory again.
>>
>> 1. Doing it from the guest: not reliable. E.g. think about crashes +
>> reboots, or a plain "system_reset" in QEMU. Deflation is definetly not
>> reliably possible.
>>
>> 2. Doing it in QEMU balloon implementation. Not possible. We don't track
>> the memory that has been inflated (and also should not do it).
>>
>> So the only thing we could do is "deflate all guest memory" which
>> implies a madvise WILLNEED on all guest memory. We definitely don't want
>> this. We could inform vfio about all guest memory.
> 
> Exactly. No need to track anything we just need QEMU to allow access to
> all guest memory.
> 
>> Everything sounds like a big hack that should be handled internally in
>> the kernel.
> 
> What exactly do you want the kernel to do?

As already discussed (in this thread? I don't remember), Alex was asking
if there is some kind of notifier way in the kernel to get notified when
a fresh page is being used on memory that was previously madvise
DONTNEEDed. Then that page could be immediately repinned.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-30 15:01         ` [Qemu-devel] " Alex Williamson
@ 2018-07-30 15:49           ` Michael S. Tsirkin
  -1 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2018-07-30 15:49 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Mon, Jul 30, 2018 at 09:01:37AM -0600, Alex Williamson wrote:
> > > but I don't think it can be done
> > > atomically with respect to inflight DMA of a physical device where we
> > > cannot halt the device without interfering with its state.  
> > 
> > Guests never add pages to the balloon if they are under DMA,
> > so that's fine - there's never an in-flight DMA, if
> > there is guest is buggy and it's ok to crash it.
> 
> It's not the ballooned page that I'm trying to note, it's the entire
> remainder of the SubRegion which needs to be unmapped to remove that
> one page.  It's more compatible from an IOMMU perspective in that we're
> only unmapping with the same granularity with which we mapped, but it's
> incompatible with inflight DMA as we have no idea what DMA targets may
> reside within the remainder of that mapping while it's temporarily
> unmapped.

I see. Yes you need to be careful to replace the host IOMMU PTE
atomically. Same applies to vIOMMU though - if guest changes
a PTE atomically host should do the same.

--
MST

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-30 15:49           ` Michael S. Tsirkin
  0 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2018-07-30 15:49 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Mon, Jul 30, 2018 at 09:01:37AM -0600, Alex Williamson wrote:
> > > but I don't think it can be done
> > > atomically with respect to inflight DMA of a physical device where we
> > > cannot halt the device without interfering with its state.  
> > 
> > Guests never add pages to the balloon if they are under DMA,
> > so that's fine - there's never an in-flight DMA, if
> > there is guest is buggy and it's ok to crash it.
> 
> It's not the ballooned page that I'm trying to note, it's the entire
> remainder of the SubRegion which needs to be unmapped to remove that
> one page.  It's more compatible from an IOMMU perspective in that we're
> only unmapping with the same granularity with which we mapped, but it's
> incompatible with inflight DMA as we have no idea what DMA targets may
> reside within the remainder of that mapping while it's temporarily
> unmapped.

I see. Yes you need to be careful to replace the host IOMMU PTE
atomically. Same applies to vIOMMU though - if guest changes
a PTE atomically host should do the same.

--
MST

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-30 15:49           ` [Qemu-devel] " Michael S. Tsirkin
@ 2018-07-30 16:42             ` Alex Williamson
  -1 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-30 16:42 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel, kvm

On Mon, 30 Jul 2018 18:49:58 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Jul 30, 2018 at 09:01:37AM -0600, Alex Williamson wrote:
> > > > but I don't think it can be done
> > > > atomically with respect to inflight DMA of a physical device where we
> > > > cannot halt the device without interfering with its state.    
> > > 
> > > Guests never add pages to the balloon if they are under DMA,
> > > so that's fine - there's never an in-flight DMA, if
> > > there is guest is buggy and it's ok to crash it.  
> > 
> > It's not the ballooned page that I'm trying to note, it's the entire
> > remainder of the SubRegion which needs to be unmapped to remove that
> > one page.  It's more compatible from an IOMMU perspective in that we're
> > only unmapping with the same granularity with which we mapped, but it's
> > incompatible with inflight DMA as we have no idea what DMA targets may
> > reside within the remainder of that mapping while it's temporarily
> > unmapped.  
> 
> I see. Yes you need to be careful to replace the host IOMMU PTE
> atomically. Same applies to vIOMMU though - if guest changes
> a PTE atomically host should do the same.

I'm not sure the hardware supports atomic updates in these cases and
therefore I don't think the vIOMMU does either.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-30 16:42             ` Alex Williamson
  0 siblings, 0 replies; 58+ messages in thread
From: Alex Williamson @ 2018-07-30 16:42 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel, kvm

On Mon, 30 Jul 2018 18:49:58 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Jul 30, 2018 at 09:01:37AM -0600, Alex Williamson wrote:
> > > > but I don't think it can be done
> > > > atomically with respect to inflight DMA of a physical device where we
> > > > cannot halt the device without interfering with its state.    
> > > 
> > > Guests never add pages to the balloon if they are under DMA,
> > > so that's fine - there's never an in-flight DMA, if
> > > there is guest is buggy and it's ok to crash it.  
> > 
> > It's not the ballooned page that I'm trying to note, it's the entire
> > remainder of the SubRegion which needs to be unmapped to remove that
> > one page.  It's more compatible from an IOMMU perspective in that we're
> > only unmapping with the same granularity with which we mapped, but it's
> > incompatible with inflight DMA as we have no idea what DMA targets may
> > reside within the remainder of that mapping while it's temporarily
> > unmapped.  
> 
> I see. Yes you need to be careful to replace the host IOMMU PTE
> atomically. Same applies to vIOMMU though - if guest changes
> a PTE atomically host should do the same.

I'm not sure the hardware supports atomic updates in these cases and
therefore I don't think the vIOMMU does either.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH 0/3] Balloon inhibit enhancements
  2018-07-30 16:42             ` [Qemu-devel] " Alex Williamson
@ 2018-07-30 17:35               ` Michael S. Tsirkin
  -1 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2018-07-30 17:35 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Mon, Jul 30, 2018 at 10:42:05AM -0600, Alex Williamson wrote:
> On Mon, 30 Jul 2018 18:49:58 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, Jul 30, 2018 at 09:01:37AM -0600, Alex Williamson wrote:
> > > > > but I don't think it can be done
> > > > > atomically with respect to inflight DMA of a physical device where we
> > > > > cannot halt the device without interfering with its state.    
> > > > 
> > > > Guests never add pages to the balloon if they are under DMA,
> > > > so that's fine - there's never an in-flight DMA, if
> > > > there is guest is buggy and it's ok to crash it.  
> > > 
> > > It's not the ballooned page that I'm trying to note, it's the entire
> > > remainder of the SubRegion which needs to be unmapped to remove that
> > > one page.  It's more compatible from an IOMMU perspective in that we're
> > > only unmapping with the same granularity with which we mapped, but it's
> > > incompatible with inflight DMA as we have no idea what DMA targets may
> > > reside within the remainder of that mapping while it's temporarily
> > > unmapped.  
> > 
> > I see. Yes you need to be careful to replace the host IOMMU PTE
> > atomically. Same applies to vIOMMU though - if guest changes
> > a PTE atomically host should do the same.
> 
> I'm not sure the hardware supports atomic updates in these cases and
> therefore I don't think the vIOMMU does either.  Thanks,
> 
> Alex

Interesting. What makes you think simply writing into PTE
then flushing the cache isn't atomic?

-- 
MST

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/3] Balloon inhibit enhancements
@ 2018-07-30 17:35               ` Michael S. Tsirkin
  0 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2018-07-30 17:35 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Mon, Jul 30, 2018 at 10:42:05AM -0600, Alex Williamson wrote:
> On Mon, 30 Jul 2018 18:49:58 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, Jul 30, 2018 at 09:01:37AM -0600, Alex Williamson wrote:
> > > > > but I don't think it can be done
> > > > > atomically with respect to inflight DMA of a physical device where we
> > > > > cannot halt the device without interfering with its state.    
> > > > 
> > > > Guests never add pages to the balloon if they are under DMA,
> > > > so that's fine - there's never an in-flight DMA, if
> > > > there is guest is buggy and it's ok to crash it.  
> > > 
> > > It's not the ballooned page that I'm trying to note, it's the entire
> > > remainder of the SubRegion which needs to be unmapped to remove that
> > > one page.  It's more compatible from an IOMMU perspective in that we're
> > > only unmapping with the same granularity with which we mapped, but it's
> > > incompatible with inflight DMA as we have no idea what DMA targets may
> > > reside within the remainder of that mapping while it's temporarily
> > > unmapped.  
> > 
> > I see. Yes you need to be careful to replace the host IOMMU PTE
> > atomically. Same applies to vIOMMU though - if guest changes
> > a PTE atomically host should do the same.
> 
> I'm not sure the hardware supports atomic updates in these cases and
> therefore I don't think the vIOMMU does either.  Thanks,
> 
> Alex

Interesting. What makes you think simply writing into PTE
then flushing the cache isn't atomic?

-- 
MST

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2018-07-30 17:35 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-17 22:47 [RFC PATCH 0/3] Balloon inhibit enhancements Alex Williamson
2018-07-17 22:47 ` [Qemu-devel] " Alex Williamson
2018-07-17 22:47 ` [RFC PATCH 1/3] balloon: Allow nested inhibits Alex Williamson
2018-07-17 22:47   ` [Qemu-devel] " Alex Williamson
2018-07-18  6:40   ` Peter Xu
2018-07-18  6:40     ` [Qemu-devel] " Peter Xu
2018-07-18 16:37     ` Alex Williamson
2018-07-18 16:37       ` [Qemu-devel] " Alex Williamson
2018-07-19  4:45       ` Peter Xu
2018-07-19  4:45         ` [Qemu-devel] " Peter Xu
2018-07-17 22:47 ` [RFC PATCH 2/3] kvm: Use inhibit to prevent ballooning without synchronous mmu Alex Williamson
2018-07-17 22:47   ` [Qemu-devel] " Alex Williamson
2018-07-17 22:47 ` [RFC PATCH 3/3] vfio: Inhibit ballooning Alex Williamson
2018-07-17 22:47   ` [Qemu-devel] " Alex Williamson
2018-07-18  6:48 ` [RFC PATCH 0/3] Balloon inhibit enhancements Peter Xu
2018-07-18  6:48   ` [Qemu-devel] " Peter Xu
2018-07-18  9:36   ` Cornelia Huck
2018-07-18  9:36     ` [Qemu-devel] " Cornelia Huck
2018-07-19  4:49     ` Peter Xu
2018-07-19  4:49       ` [Qemu-devel] " Peter Xu
2018-07-19  8:42       ` Cornelia Huck
2018-07-19  8:42         ` [Qemu-devel] " Cornelia Huck
2018-07-19  9:30         ` Peter Xu
2018-07-19  9:30           ` [Qemu-devel] " Peter Xu
2018-07-19 15:31       ` Alex Williamson
2018-07-19 15:31         ` [Qemu-devel] " Alex Williamson
2018-07-18 16:31   ` Alex Williamson
2018-07-18 16:31     ` [Qemu-devel] " Alex Williamson
2018-07-19  5:40     ` Peter Xu
2018-07-19  5:40       ` [Qemu-devel] " Peter Xu
2018-07-19 15:01       ` Alex Williamson
2018-07-19 15:01         ` [Qemu-devel] " Alex Williamson
2018-07-20  2:56         ` Peter Xu
2018-07-20  2:56           ` [Qemu-devel] " Peter Xu
2018-07-30 13:34 ` Michael S. Tsirkin
2018-07-30 13:34   ` [Qemu-devel] " Michael S. Tsirkin
2018-07-30 13:54   ` David Hildenbrand
2018-07-30 13:54     ` [Qemu-devel] " David Hildenbrand
2018-07-30 13:59     ` Michael S. Tsirkin
2018-07-30 13:59       ` [Qemu-devel] " Michael S. Tsirkin
2018-07-30 14:46       ` David Hildenbrand
2018-07-30 14:46         ` [Qemu-devel] " David Hildenbrand
2018-07-30 14:58         ` Michael S. Tsirkin
2018-07-30 14:58           ` [Qemu-devel] " Michael S. Tsirkin
2018-07-30 15:05           ` David Hildenbrand
2018-07-30 15:05             ` [Qemu-devel] " David Hildenbrand
2018-07-30 14:39   ` Alex Williamson
2018-07-30 14:39     ` [Qemu-devel] " Alex Williamson
2018-07-30 14:51     ` Michael S. Tsirkin
2018-07-30 14:51       ` [Qemu-devel] " Michael S. Tsirkin
2018-07-30 15:01       ` Alex Williamson
2018-07-30 15:01         ` [Qemu-devel] " Alex Williamson
2018-07-30 15:49         ` Michael S. Tsirkin
2018-07-30 15:49           ` [Qemu-devel] " Michael S. Tsirkin
2018-07-30 16:42           ` Alex Williamson
2018-07-30 16:42             ` [Qemu-devel] " Alex Williamson
2018-07-30 17:35             ` Michael S. Tsirkin
2018-07-30 17:35               ` [Qemu-devel] " Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.