All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 00/17] virtio-mem: Paravirtualized memory hot(un)plug
@ 2020-05-06  9:49 ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand, Alex Williamson, Christian Borntraeger,
	Cornelia Huck, Eric Blake, Eric Farman, Hailiang Zhang,
	Halil Pasic, Igor Mammedov, Janosch Frank, Juan Quintela,
	Keith Busch, Marcel Apfelbaum, Markus Armbruster, Peter Maydell,
	Pierre Morel, Tony Krowiak

This is the very basic, initial version of virtio-mem. More info on
virtio-mem in general can be found in the Linux kernel driver posting [1]
and in patch #10.

"The basic idea of virtio-mem is to provide a flexible,
cross-architecture memory hot(un)plug solution that avoids many limitations
imposed by existing technologies, architectures, and interfaces."

There are a lot of addons in the works (esp. protection of unplugged
memory, better hugepage support (esp. when reading unplugged memory),
resizeable memory backends, migration optimizations, support for more
architectures, ...), this is the very basic version to get the ball
rolling.

The first 8 patches make sure we don't have any sudden surprises e.g., if
somebody tries to pin all memory in RAM blocks, resulting in a higher
memory consumption than desired. The remaining patches add basic virtio-mem
along with support for x86-64.

[1] https://lkml.kernel.org/r/20200311171422.10484-1-david@redhat.com

David Hildenbrand (17):
  exec: Introduce ram_block_discard_set_(unreliable|required)()
  vfio: Convert to ram_block_discard_set_broken()
  accel/kvm: Convert to ram_block_discard_set_broken()
  s390x/pv: Convert to ram_block_discard_set_broken()
  virtio-balloon: Rip out qemu_balloon_inhibit()
  target/i386: sev: Use ram_block_discard_set_broken()
  migration/rdma: Use ram_block_discard_set_broken()
  migration/colo: Use ram_block_discard_set_broken()
  linux-headers: update to contain virtio-mem
  virtio-mem: Paravirtualized memory hot(un)plug
  virtio-pci: Proxy for virtio-mem
  MAINTAINERS: Add myself as virtio-mem maintainer
  hmp: Handle virtio-mem when printing memory device info
  numa: Handle virtio-mem in NUMA stats
  pc: Support for virtio-mem-pci
  virtio-mem: Allow notifiers for size changes
  virtio-pci: Send qapi events when the virtio-mem size changes

 MAINTAINERS                                 |   8 +
 accel/kvm/kvm-all.c                         |   3 +-
 balloon.c                                   |  17 -
 exec.c                                      |  48 ++
 hw/core/numa.c                              |   6 +
 hw/i386/Kconfig                             |   1 +
 hw/i386/pc.c                                |  49 +-
 hw/s390x/s390-virtio-ccw.c                  |  22 +-
 hw/vfio/ap.c                                |  10 +-
 hw/vfio/ccw.c                               |  11 +-
 hw/vfio/common.c                            |  53 +-
 hw/vfio/pci.c                               |   6 +-
 hw/virtio/Kconfig                           |  11 +
 hw/virtio/Makefile.objs                     |   2 +
 hw/virtio/virtio-balloon.c                  |  12 +-
 hw/virtio/virtio-mem-pci.c                  | 159 ++++
 hw/virtio/virtio-mem-pci.h                  |  34 +
 hw/virtio/virtio-mem.c                      | 781 ++++++++++++++++++++
 include/exec/memory.h                       |  41 +
 include/hw/pci/pci.h                        |   1 +
 include/hw/vfio/vfio-common.h               |   4 +-
 include/hw/virtio/virtio-mem.h              |  85 +++
 include/migration/colo.h                    |   2 +-
 include/standard-headers/linux/virtio_ids.h |   1 +
 include/standard-headers/linux/virtio_mem.h | 208 ++++++
 include/sysemu/balloon.h                    |   2 -
 migration/migration.c                       |   8 +-
 migration/postcopy-ram.c                    |  23 -
 migration/rdma.c                            |  18 +-
 migration/savevm.c                          |  11 +-
 monitor/hmp-cmds.c                          |  16 +
 monitor/monitor.c                           |   1 +
 qapi/misc.json                              |  64 +-
 target/i386/sev.c                           |   1 +
 34 files changed, 1598 insertions(+), 121 deletions(-)
 create mode 100644 hw/virtio/virtio-mem-pci.c
 create mode 100644 hw/virtio/virtio-mem-pci.h
 create mode 100644 hw/virtio/virtio-mem.c
 create mode 100644 include/hw/virtio/virtio-mem.h
 create mode 100644 include/standard-headers/linux/virtio_mem.h

-- 
2.25.3


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v1 00/17] virtio-mem: Paravirtualized memory hot(un)plug
@ 2020-05-06  9:49 ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, kvm, Michael S . Tsirkin, Janosch Frank,
	Juan Quintela, David Hildenbrand, Markus Armbruster, Halil Pasic,
	Christian Borntraeger, Richard Henderson, Eric Farman,
	Eduardo Habkost, Dr . David Alan Gilbert, Alex Williamson,
	Paolo Bonzini, Keith Busch, Hailiang Zhang, Tony Krowiak,
	Pierre Morel, Cornelia Huck, qemu-s390x, Igor Mammedov

This is the very basic, initial version of virtio-mem. More info on
virtio-mem in general can be found in the Linux kernel driver posting [1]
and in patch #10.

"The basic idea of virtio-mem is to provide a flexible,
cross-architecture memory hot(un)plug solution that avoids many limitations
imposed by existing technologies, architectures, and interfaces."

There are a lot of addons in the works (esp. protection of unplugged
memory, better hugepage support (esp. when reading unplugged memory),
resizeable memory backends, migration optimizations, support for more
architectures, ...), this is the very basic version to get the ball
rolling.

The first 8 patches make sure we don't have any sudden surprises e.g., if
somebody tries to pin all memory in RAM blocks, resulting in a higher
memory consumption than desired. The remaining patches add basic virtio-mem
along with support for x86-64.

[1] https://lkml.kernel.org/r/20200311171422.10484-1-david@redhat.com

David Hildenbrand (17):
  exec: Introduce ram_block_discard_set_(unreliable|required)()
  vfio: Convert to ram_block_discard_set_broken()
  accel/kvm: Convert to ram_block_discard_set_broken()
  s390x/pv: Convert to ram_block_discard_set_broken()
  virtio-balloon: Rip out qemu_balloon_inhibit()
  target/i386: sev: Use ram_block_discard_set_broken()
  migration/rdma: Use ram_block_discard_set_broken()
  migration/colo: Use ram_block_discard_set_broken()
  linux-headers: update to contain virtio-mem
  virtio-mem: Paravirtualized memory hot(un)plug
  virtio-pci: Proxy for virtio-mem
  MAINTAINERS: Add myself as virtio-mem maintainer
  hmp: Handle virtio-mem when printing memory device info
  numa: Handle virtio-mem in NUMA stats
  pc: Support for virtio-mem-pci
  virtio-mem: Allow notifiers for size changes
  virtio-pci: Send qapi events when the virtio-mem size changes

 MAINTAINERS                                 |   8 +
 accel/kvm/kvm-all.c                         |   3 +-
 balloon.c                                   |  17 -
 exec.c                                      |  48 ++
 hw/core/numa.c                              |   6 +
 hw/i386/Kconfig                             |   1 +
 hw/i386/pc.c                                |  49 +-
 hw/s390x/s390-virtio-ccw.c                  |  22 +-
 hw/vfio/ap.c                                |  10 +-
 hw/vfio/ccw.c                               |  11 +-
 hw/vfio/common.c                            |  53 +-
 hw/vfio/pci.c                               |   6 +-
 hw/virtio/Kconfig                           |  11 +
 hw/virtio/Makefile.objs                     |   2 +
 hw/virtio/virtio-balloon.c                  |  12 +-
 hw/virtio/virtio-mem-pci.c                  | 159 ++++
 hw/virtio/virtio-mem-pci.h                  |  34 +
 hw/virtio/virtio-mem.c                      | 781 ++++++++++++++++++++
 include/exec/memory.h                       |  41 +
 include/hw/pci/pci.h                        |   1 +
 include/hw/vfio/vfio-common.h               |   4 +-
 include/hw/virtio/virtio-mem.h              |  85 +++
 include/migration/colo.h                    |   2 +-
 include/standard-headers/linux/virtio_ids.h |   1 +
 include/standard-headers/linux/virtio_mem.h | 208 ++++++
 include/sysemu/balloon.h                    |   2 -
 migration/migration.c                       |   8 +-
 migration/postcopy-ram.c                    |  23 -
 migration/rdma.c                            |  18 +-
 migration/savevm.c                          |  11 +-
 monitor/hmp-cmds.c                          |  16 +
 monitor/monitor.c                           |   1 +
 qapi/misc.json                              |  64 +-
 target/i386/sev.c                           |   1 +
 34 files changed, 1598 insertions(+), 121 deletions(-)
 create mode 100644 hw/virtio/virtio-mem-pci.c
 create mode 100644 hw/virtio/virtio-mem-pci.h
 create mode 100644 hw/virtio/virtio-mem.c
 create mode 100644 include/hw/virtio/virtio-mem.h
 create mode 100644 include/standard-headers/linux/virtio_mem.h

-- 
2.25.3



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v1 01/17] exec: Introduce ram_block_discard_set_(unreliable|required)()
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand

We want to replace qemu_balloon_inhibit() by something more generic.
Especially, we want to make sure that technologies that really rely on
RAM block discards to work reliably to run mutual exclusive with
technologies that break it.

E.g., vfio will usually pin all guest memory, turning the virtio-balloon
basically useless and make the VM consume more memory than reported via
the balloon. While the balloon is special already (=> no guarantees, same
behavior possible afer reboots and with huge pages), this will be
different, especially, with virtio-mem.

Let's implement a way such that we can make both types of technology run
mutually exclusive. We'll convert existing balloon inhibitors in successive
patches and add some new ones. Add the check to
qemu_balloon_is_inhibited() for now. We might want to make
virtio-balloon an acutal inhibitor in the future - however, that
requires more thought to not break existing setups.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 balloon.c             |  3 ++-
 exec.c                | 48 +++++++++++++++++++++++++++++++++++++++++++
 include/exec/memory.h | 41 ++++++++++++++++++++++++++++++++++++
 3 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/balloon.c b/balloon.c
index f104b42961..c49f57c27b 100644
--- a/balloon.c
+++ b/balloon.c
@@ -40,7 +40,8 @@ static int balloon_inhibit_count;
 
 bool qemu_balloon_is_inhibited(void)
 {
-    return atomic_read(&balloon_inhibit_count) > 0;
+    return atomic_read(&balloon_inhibit_count) > 0 ||
+           ram_block_discard_is_broken();
 }
 
 void qemu_balloon_inhibit(bool state)
diff --git a/exec.c b/exec.c
index 2874bb5088..52a6e40e99 100644
--- a/exec.c
+++ b/exec.c
@@ -4049,4 +4049,52 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root)
     }
 }
 
+static int ram_block_discard_broken;
+
+int ram_block_discard_set_broken(bool state)
+{
+    int old;
+
+    if (!state) {
+        atomic_dec(&ram_block_discard_broken);
+        return 0;
+    }
+
+    do {
+        old = atomic_read(&ram_block_discard_broken);
+        if (old < 0) {
+            return -EBUSY;
+        }
+    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old + 1) != old);
+    return 0;
+}
+
+int ram_block_discard_set_required(bool state)
+{
+    int old;
+
+    if (!state) {
+        atomic_inc(&ram_block_discard_broken);
+        return 0;
+    }
+
+    do {
+        old = atomic_read(&ram_block_discard_broken);
+        if (old > 0) {
+            return -EBUSY;
+        }
+    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old - 1) != old);
+    return 0;
+}
+
+bool ram_block_discard_is_broken(void)
+{
+    return atomic_read(&ram_block_discard_broken) > 0;
+}
+
+bool ram_block_discard_is_required(void)
+{
+    return atomic_read(&ram_block_discard_broken) < 0;
+}
+
 #endif
diff --git a/include/exec/memory.h b/include/exec/memory.h
index e000bd2f97..9bb5ced38d 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2463,6 +2463,47 @@ static inline MemOp devend_memop(enum device_endian end)
 }
 #endif
 
+/*
+ * Inhibit technologies that rely on discarding of parts of RAM blocks to work
+ * reliably, e.g., to manage the actual amount of memory consumed by the VM
+ * (then, the memory provided by RAM blocks might be bigger than the desired
+ * memory consumption). This *must* be set if:
+ * - Discarding parts of a RAM blocks does not result in the change being
+ *   reflected in the VM and the pages getting freed.
+ * - All memory in RAM blocks is pinned or duplicated, invaldiating any previous
+ *   discards blindly.
+ * - Discarding parts of a RAM blocks will result in integrity issues (e.g.,
+ *   encrypted VMs).
+ * Technologies that only temporarily pin the current working set of a
+ * driver are fine, because we don't expect such pages to be discarded
+ * (esp. based on guest action like balloon inflation).
+ *
+ * This is *not* to be used to protect from concurrent discards (esp.,
+ * postcopy).
+ *
+ * Returns 0 if successful. Returns -EBUSY if a technology that relies on
+ * discards to work reliably is active.
+ */
+int ram_block_discard_set_broken(bool state);
+
+/*
+ * Inhibit technologies that will break discarding of pages in RAM blocks.
+ *
+ * Returns 0 if successful. Returns -EBUSY if discards are already set to
+ * broken.
+ */
+int ram_block_discard_set_required(bool state);
+
+/*
+ * Test if discarding of memory in ram blocks is broken.
+ */
+bool ram_block_discard_is_broken(void);
+
+/*
+ * Test if discarding of memory in ram blocks is required to work reliably.
+ */
+bool ram_block_discard_is_required(void);
+
 #endif
 
 #endif
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 01/17] exec: Introduce ram_block_discard_set_(unreliable|required)()
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, qemu-s390x, Paolo Bonzini,
	Richard Henderson

We want to replace qemu_balloon_inhibit() by something more generic.
Especially, we want to make sure that technologies that really rely on
RAM block discards to work reliably to run mutual exclusive with
technologies that break it.

E.g., vfio will usually pin all guest memory, turning the virtio-balloon
basically useless and make the VM consume more memory than reported via
the balloon. While the balloon is special already (=> no guarantees, same
behavior possible afer reboots and with huge pages), this will be
different, especially, with virtio-mem.

Let's implement a way such that we can make both types of technology run
mutually exclusive. We'll convert existing balloon inhibitors in successive
patches and add some new ones. Add the check to
qemu_balloon_is_inhibited() for now. We might want to make
virtio-balloon an acutal inhibitor in the future - however, that
requires more thought to not break existing setups.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 balloon.c             |  3 ++-
 exec.c                | 48 +++++++++++++++++++++++++++++++++++++++++++
 include/exec/memory.h | 41 ++++++++++++++++++++++++++++++++++++
 3 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/balloon.c b/balloon.c
index f104b42961..c49f57c27b 100644
--- a/balloon.c
+++ b/balloon.c
@@ -40,7 +40,8 @@ static int balloon_inhibit_count;
 
 bool qemu_balloon_is_inhibited(void)
 {
-    return atomic_read(&balloon_inhibit_count) > 0;
+    return atomic_read(&balloon_inhibit_count) > 0 ||
+           ram_block_discard_is_broken();
 }
 
 void qemu_balloon_inhibit(bool state)
diff --git a/exec.c b/exec.c
index 2874bb5088..52a6e40e99 100644
--- a/exec.c
+++ b/exec.c
@@ -4049,4 +4049,52 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root)
     }
 }
 
+static int ram_block_discard_broken;
+
+int ram_block_discard_set_broken(bool state)
+{
+    int old;
+
+    if (!state) {
+        atomic_dec(&ram_block_discard_broken);
+        return 0;
+    }
+
+    do {
+        old = atomic_read(&ram_block_discard_broken);
+        if (old < 0) {
+            return -EBUSY;
+        }
+    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old + 1) != old);
+    return 0;
+}
+
+int ram_block_discard_set_required(bool state)
+{
+    int old;
+
+    if (!state) {
+        atomic_inc(&ram_block_discard_broken);
+        return 0;
+    }
+
+    do {
+        old = atomic_read(&ram_block_discard_broken);
+        if (old > 0) {
+            return -EBUSY;
+        }
+    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old - 1) != old);
+    return 0;
+}
+
+bool ram_block_discard_is_broken(void)
+{
+    return atomic_read(&ram_block_discard_broken) > 0;
+}
+
+bool ram_block_discard_is_required(void)
+{
+    return atomic_read(&ram_block_discard_broken) < 0;
+}
+
 #endif
diff --git a/include/exec/memory.h b/include/exec/memory.h
index e000bd2f97..9bb5ced38d 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2463,6 +2463,47 @@ static inline MemOp devend_memop(enum device_endian end)
 }
 #endif
 
+/*
+ * Inhibit technologies that rely on discarding of parts of RAM blocks to work
+ * reliably, e.g., to manage the actual amount of memory consumed by the VM
+ * (then, the memory provided by RAM blocks might be bigger than the desired
+ * memory consumption). This *must* be set if:
+ * - Discarding parts of a RAM blocks does not result in the change being
+ *   reflected in the VM and the pages getting freed.
+ * - All memory in RAM blocks is pinned or duplicated, invaldiating any previous
+ *   discards blindly.
+ * - Discarding parts of a RAM blocks will result in integrity issues (e.g.,
+ *   encrypted VMs).
+ * Technologies that only temporarily pin the current working set of a
+ * driver are fine, because we don't expect such pages to be discarded
+ * (esp. based on guest action like balloon inflation).
+ *
+ * This is *not* to be used to protect from concurrent discards (esp.,
+ * postcopy).
+ *
+ * Returns 0 if successful. Returns -EBUSY if a technology that relies on
+ * discards to work reliably is active.
+ */
+int ram_block_discard_set_broken(bool state);
+
+/*
+ * Inhibit technologies that will break discarding of pages in RAM blocks.
+ *
+ * Returns 0 if successful. Returns -EBUSY if discards are already set to
+ * broken.
+ */
+int ram_block_discard_set_required(bool state);
+
+/*
+ * Test if discarding of memory in ram blocks is broken.
+ */
+bool ram_block_discard_is_broken(void);
+
+/*
+ * Test if discarding of memory in ram blocks is required to work reliably.
+ */
+bool ram_block_discard_is_required(void);
+
 #endif
 
 #endif
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 02/17] vfio: Convert to ram_block_discard_set_broken()
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand, Cornelia Huck, Alex Williamson,
	Christian Borntraeger, Tony Krowiak, Halil Pasic, Pierre Morel,
	Eric Farman

VFIO is (except devices without a physical IOMMU or some mediated devices)
incompatible ram_block_discard_set_broken. The kernel will pin basically
all VM memory. Let's convert to ram_block_discard_set_broke(), which can
now fail, in contrast to qemu_balloon_inhibit().

Leave "x-balloon-allowed" named as it is for now.

Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Tony Krowiak <akrowiak@linux.ibm.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: Eric Farman <farman@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/vfio/ap.c                  | 10 +++----
 hw/vfio/ccw.c                 | 11 ++++----
 hw/vfio/common.c              | 53 +++++++++++++++++++----------------
 hw/vfio/pci.c                 |  6 ++--
 include/hw/vfio/vfio-common.h |  4 +--
 5 files changed, 45 insertions(+), 39 deletions(-)

diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index 8649ac15f9..b51546d67a 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -105,12 +105,12 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
     vapdev->vdev.dev = dev;
 
     /*
-     * vfio-ap devices operate in a way compatible with
-     * memory ballooning, as no pages are pinned in the host.
-     * This needs to be set before vfio_get_device() for vfio common to
-     * handle the balloon inhibitor.
+     * vfio-ap devices operate in a way compatible discarding of memory in
+     * RAM blocks, as no pages are pinned in the host. This needs to be
+     * set before vfio_get_device() for vfio common to handle
+     * ram_block_discard_set_broken().
      */
-    vapdev->vdev.balloon_allowed = true;
+    vapdev->vdev.ram_block_discard_allowed = true;
 
     ret = vfio_get_device(vfio_group, mdevid, &vapdev->vdev, errp);
     if (ret) {
diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 50cc2ec75c..0dd6c3f2ab 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -425,12 +425,13 @@ static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
 
     /*
      * All vfio-ccw devices are believed to operate in a way compatible with
-     * memory ballooning, ie. pages pinned in the host are in the current
-     * working set of the guest driver and therefore never overlap with pages
-     * available to the guest balloon driver.  This needs to be set before
-     * vfio_get_device() for vfio common to handle the balloon inhibitor.
+     * discarding of memory in RAM blocks, ie. pages pinned in the host are
+     * in the current working set of the guest driver and therefore never
+     * overlap e.g., with pages available to the guest balloon driver.  This
+     * needs to be set before vfio_get_device() for vfio common to handle
+     * ram_block_discard_set_broken().
      */
-    vcdev->vdev.balloon_allowed = true;
+    vcdev->vdev.ram_block_discard_allowed = true;
 
     if (vfio_get_device(group, vcdev->cdev.mdevid, &vcdev->vdev, errp)) {
         goto out_err;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b3593b3c0..98b2573ae6 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -33,7 +33,6 @@
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qemu/range.h"
-#include "sysemu/balloon.h"
 #include "sysemu/kvm.h"
 #include "sysemu/reset.h"
 #include "trace.h"
@@ -1215,31 +1214,36 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     space = vfio_get_address_space(as);
 
     /*
-     * VFIO is currently incompatible with memory ballooning insofar as the
+     * VFIO is currently incompatible with discarding of RAM insofar as the
      * madvise to purge (zap) the page from QEMU's address space does not
      * interact with the memory API and therefore leaves stale virtual to
      * physical mappings in the IOMMU if the page was previously pinned.  We
-     * therefore add a balloon inhibit for each group added to a container,
+     * therefore set discarding broken for each group added to a container,
      * whether the container is used individually or shared.  This provides
      * us with options to allow devices within a group to opt-in and allow
-     * ballooning, so long as it is done consistently for a group (for instance
+     * discarding, so long as it is done consistently for a group (for instance
      * if the device is an mdev device where it is known that the host vendor
      * driver will never pin pages outside of the working set of the guest
-     * driver, which would thus not be ballooning candidates).
+     * driver, which would thus not be discarding candidates).
      *
      * The first opportunity to induce pinning occurs here where we attempt to
      * attach the group to existing containers within the AddressSpace.  If any
-     * pages are already zapped from the virtual address space, such as from a
-     * previous ballooning opt-in, new pinning will cause valid mappings to be
+     * pages are already zapped from the virtual address space, such as from
+     * previous discards, new pinning will cause valid mappings to be
      * re-established.  Likewise, when the overall MemoryListener for a new
      * container is registered, a replay of mappings within the AddressSpace
      * will occur, re-establishing any previously zapped pages as well.
      *
-     * NB. Balloon inhibiting does not currently block operation of the
-     * balloon driver or revoke previously pinned pages, it only prevents
-     * calling madvise to modify the virtual mapping of ballooned pages.
+     * Especially virtio-balloon is currently only prevented from discarding
+     * new memory, it will not yet set ram_block_discard_set_required() and
+     * therefore, neither stops us here or deals with the sudden memory
+     * consumption of inflated memory.
      */
-    qemu_balloon_inhibit(true);
+    ret = ram_block_discard_set_broken(true);
+    if (ret) {
+        error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
+        return ret;
+    }
 
     QLIST_FOREACH(container, &space->containers, next) {
         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
@@ -1405,7 +1409,7 @@ close_fd_exit:
     close(fd);
 
 put_space_exit:
-    qemu_balloon_inhibit(false);
+    ram_block_discard_set_broken(false);
     vfio_put_address_space(space);
 
     return ret;
@@ -1526,8 +1530,8 @@ void vfio_put_group(VFIOGroup *group)
         return;
     }
 
-    if (!group->balloon_allowed) {
-        qemu_balloon_inhibit(false);
+    if (!group->ram_block_discard_allowed) {
+        ram_block_discard_set_broken(false);
     }
     vfio_kvm_device_del_group(group);
     vfio_disconnect_container(group);
@@ -1565,22 +1569,23 @@ int vfio_get_device(VFIOGroup *group, const char *name,
     }
 
     /*
-     * Clear the balloon inhibitor for this group if the driver knows the
-     * device operates compatibly with ballooning.  Setting must be consistent
-     * per group, but since compatibility is really only possible with mdev
-     * currently, we expect singleton groups.
+     * Set discarding of RAM as not broken for this group if the driver knows
+     * the device operates compatibly with discarding.  Setting must be
+     * consistent per group, but since compatibility is really only possible
+     * with mdev currently, we expect singleton groups.
      */
-    if (vbasedev->balloon_allowed != group->balloon_allowed) {
+    if (vbasedev->ram_block_discard_allowed !=
+        group->ram_block_discard_allowed) {
         if (!QLIST_EMPTY(&group->device_list)) {
-            error_setg(errp,
-                       "Inconsistent device balloon setting within group");
+            error_setg(errp, "Inconsistent setting of support for discarding "
+                       "RAM (e.g., balloon) within group");
             close(fd);
             return -1;
         }
 
-        if (!group->balloon_allowed) {
-            group->balloon_allowed = true;
-            qemu_balloon_inhibit(false);
+        if (!group->ram_block_discard_allowed) {
+            group->ram_block_discard_allowed = true;
+            ram_block_discard_set_broken(false);
         }
     }
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5e75a95129..88c630c21c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2796,7 +2796,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     }
 
     /*
-     * Mediated devices *might* operate compatibly with memory ballooning, but
+     * Mediated devices *might* operate compatibly with discarding of RAM, but
      * we cannot know for certain, it depends on whether the mdev vendor driver
      * stays in sync with the active working set of the guest driver.  Prevent
      * the x-balloon-allowed option unless this is minimally an mdev device.
@@ -2809,7 +2809,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 
     trace_vfio_mdev(vdev->vbasedev.name, is_mdev);
 
-    if (vdev->vbasedev.balloon_allowed && !is_mdev) {
+    if (vdev->vbasedev.ram_block_discard_allowed && !is_mdev) {
         error_setg(errp, "x-balloon-allowed only potentially compatible "
                    "with mdev devices");
         vfio_put_group(group);
@@ -3163,7 +3163,7 @@ static Property vfio_pci_dev_properties[] = {
                     VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
     DEFINE_PROP_BOOL("x-balloon-allowed", VFIOPCIDevice,
-                     vbasedev.balloon_allowed, false),
+                     vbasedev.ram_block_discard_allowed, false),
     DEFINE_PROP_BOOL("x-no-kvm-intx", VFIOPCIDevice, no_kvm_intx, false),
     DEFINE_PROP_BOOL("x-no-kvm-msi", VFIOPCIDevice, no_kvm_msi, false),
     DEFINE_PROP_BOOL("x-no-kvm-msix", VFIOPCIDevice, no_kvm_msix, false),
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fd564209ac..c78f3ff559 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -108,7 +108,7 @@ typedef struct VFIODevice {
     bool reset_works;
     bool needs_reset;
     bool no_mmap;
-    bool balloon_allowed;
+    bool ram_block_discard_allowed;
     VFIODeviceOps *ops;
     unsigned int num_irqs;
     unsigned int num_regions;
@@ -128,7 +128,7 @@ typedef struct VFIOGroup {
     QLIST_HEAD(, VFIODevice) device_list;
     QLIST_ENTRY(VFIOGroup) next;
     QLIST_ENTRY(VFIOGroup) container_next;
-    bool balloon_allowed;
+    bool ram_block_discard_allowed;
 } VFIOGroup;
 
 typedef struct VFIODMABuf {
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 02/17] vfio: Convert to ram_block_discard_set_broken()
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tony Krowiak, Eric Farman, Cornelia Huck, Alex Williamson,
	Eduardo Habkost, kvm, Michael S . Tsirkin, Pierre Morel,
	David Hildenbrand, Dr . David Alan Gilbert, Halil Pasic,
	Christian Borntraeger, qemu-s390x, Paolo Bonzini,
	Richard Henderson

VFIO is (except devices without a physical IOMMU or some mediated devices)
incompatible ram_block_discard_set_broken. The kernel will pin basically
all VM memory. Let's convert to ram_block_discard_set_broke(), which can
now fail, in contrast to qemu_balloon_inhibit().

Leave "x-balloon-allowed" named as it is for now.

Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Tony Krowiak <akrowiak@linux.ibm.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: Eric Farman <farman@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/vfio/ap.c                  | 10 +++----
 hw/vfio/ccw.c                 | 11 ++++----
 hw/vfio/common.c              | 53 +++++++++++++++++++----------------
 hw/vfio/pci.c                 |  6 ++--
 include/hw/vfio/vfio-common.h |  4 +--
 5 files changed, 45 insertions(+), 39 deletions(-)

diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index 8649ac15f9..b51546d67a 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -105,12 +105,12 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
     vapdev->vdev.dev = dev;
 
     /*
-     * vfio-ap devices operate in a way compatible with
-     * memory ballooning, as no pages are pinned in the host.
-     * This needs to be set before vfio_get_device() for vfio common to
-     * handle the balloon inhibitor.
+     * vfio-ap devices operate in a way compatible discarding of memory in
+     * RAM blocks, as no pages are pinned in the host. This needs to be
+     * set before vfio_get_device() for vfio common to handle
+     * ram_block_discard_set_broken().
      */
-    vapdev->vdev.balloon_allowed = true;
+    vapdev->vdev.ram_block_discard_allowed = true;
 
     ret = vfio_get_device(vfio_group, mdevid, &vapdev->vdev, errp);
     if (ret) {
diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 50cc2ec75c..0dd6c3f2ab 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -425,12 +425,13 @@ static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
 
     /*
      * All vfio-ccw devices are believed to operate in a way compatible with
-     * memory ballooning, ie. pages pinned in the host are in the current
-     * working set of the guest driver and therefore never overlap with pages
-     * available to the guest balloon driver.  This needs to be set before
-     * vfio_get_device() for vfio common to handle the balloon inhibitor.
+     * discarding of memory in RAM blocks, ie. pages pinned in the host are
+     * in the current working set of the guest driver and therefore never
+     * overlap e.g., with pages available to the guest balloon driver.  This
+     * needs to be set before vfio_get_device() for vfio common to handle
+     * ram_block_discard_set_broken().
      */
-    vcdev->vdev.balloon_allowed = true;
+    vcdev->vdev.ram_block_discard_allowed = true;
 
     if (vfio_get_device(group, vcdev->cdev.mdevid, &vcdev->vdev, errp)) {
         goto out_err;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b3593b3c0..98b2573ae6 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -33,7 +33,6 @@
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qemu/range.h"
-#include "sysemu/balloon.h"
 #include "sysemu/kvm.h"
 #include "sysemu/reset.h"
 #include "trace.h"
@@ -1215,31 +1214,36 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     space = vfio_get_address_space(as);
 
     /*
-     * VFIO is currently incompatible with memory ballooning insofar as the
+     * VFIO is currently incompatible with discarding of RAM insofar as the
      * madvise to purge (zap) the page from QEMU's address space does not
      * interact with the memory API and therefore leaves stale virtual to
      * physical mappings in the IOMMU if the page was previously pinned.  We
-     * therefore add a balloon inhibit for each group added to a container,
+     * therefore set discarding broken for each group added to a container,
      * whether the container is used individually or shared.  This provides
      * us with options to allow devices within a group to opt-in and allow
-     * ballooning, so long as it is done consistently for a group (for instance
+     * discarding, so long as it is done consistently for a group (for instance
      * if the device is an mdev device where it is known that the host vendor
      * driver will never pin pages outside of the working set of the guest
-     * driver, which would thus not be ballooning candidates).
+     * driver, which would thus not be discarding candidates).
      *
      * The first opportunity to induce pinning occurs here where we attempt to
      * attach the group to existing containers within the AddressSpace.  If any
-     * pages are already zapped from the virtual address space, such as from a
-     * previous ballooning opt-in, new pinning will cause valid mappings to be
+     * pages are already zapped from the virtual address space, such as from
+     * previous discards, new pinning will cause valid mappings to be
      * re-established.  Likewise, when the overall MemoryListener for a new
      * container is registered, a replay of mappings within the AddressSpace
      * will occur, re-establishing any previously zapped pages as well.
      *
-     * NB. Balloon inhibiting does not currently block operation of the
-     * balloon driver or revoke previously pinned pages, it only prevents
-     * calling madvise to modify the virtual mapping of ballooned pages.
+     * Especially virtio-balloon is currently only prevented from discarding
+     * new memory, it will not yet set ram_block_discard_set_required() and
+     * therefore, neither stops us here or deals with the sudden memory
+     * consumption of inflated memory.
      */
-    qemu_balloon_inhibit(true);
+    ret = ram_block_discard_set_broken(true);
+    if (ret) {
+        error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
+        return ret;
+    }
 
     QLIST_FOREACH(container, &space->containers, next) {
         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
@@ -1405,7 +1409,7 @@ close_fd_exit:
     close(fd);
 
 put_space_exit:
-    qemu_balloon_inhibit(false);
+    ram_block_discard_set_broken(false);
     vfio_put_address_space(space);
 
     return ret;
@@ -1526,8 +1530,8 @@ void vfio_put_group(VFIOGroup *group)
         return;
     }
 
-    if (!group->balloon_allowed) {
-        qemu_balloon_inhibit(false);
+    if (!group->ram_block_discard_allowed) {
+        ram_block_discard_set_broken(false);
     }
     vfio_kvm_device_del_group(group);
     vfio_disconnect_container(group);
@@ -1565,22 +1569,23 @@ int vfio_get_device(VFIOGroup *group, const char *name,
     }
 
     /*
-     * Clear the balloon inhibitor for this group if the driver knows the
-     * device operates compatibly with ballooning.  Setting must be consistent
-     * per group, but since compatibility is really only possible with mdev
-     * currently, we expect singleton groups.
+     * Set discarding of RAM as not broken for this group if the driver knows
+     * the device operates compatibly with discarding.  Setting must be
+     * consistent per group, but since compatibility is really only possible
+     * with mdev currently, we expect singleton groups.
      */
-    if (vbasedev->balloon_allowed != group->balloon_allowed) {
+    if (vbasedev->ram_block_discard_allowed !=
+        group->ram_block_discard_allowed) {
         if (!QLIST_EMPTY(&group->device_list)) {
-            error_setg(errp,
-                       "Inconsistent device balloon setting within group");
+            error_setg(errp, "Inconsistent setting of support for discarding "
+                       "RAM (e.g., balloon) within group");
             close(fd);
             return -1;
         }
 
-        if (!group->balloon_allowed) {
-            group->balloon_allowed = true;
-            qemu_balloon_inhibit(false);
+        if (!group->ram_block_discard_allowed) {
+            group->ram_block_discard_allowed = true;
+            ram_block_discard_set_broken(false);
         }
     }
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5e75a95129..88c630c21c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2796,7 +2796,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     }
 
     /*
-     * Mediated devices *might* operate compatibly with memory ballooning, but
+     * Mediated devices *might* operate compatibly with discarding of RAM, but
      * we cannot know for certain, it depends on whether the mdev vendor driver
      * stays in sync with the active working set of the guest driver.  Prevent
      * the x-balloon-allowed option unless this is minimally an mdev device.
@@ -2809,7 +2809,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 
     trace_vfio_mdev(vdev->vbasedev.name, is_mdev);
 
-    if (vdev->vbasedev.balloon_allowed && !is_mdev) {
+    if (vdev->vbasedev.ram_block_discard_allowed && !is_mdev) {
         error_setg(errp, "x-balloon-allowed only potentially compatible "
                    "with mdev devices");
         vfio_put_group(group);
@@ -3163,7 +3163,7 @@ static Property vfio_pci_dev_properties[] = {
                     VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
     DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
     DEFINE_PROP_BOOL("x-balloon-allowed", VFIOPCIDevice,
-                     vbasedev.balloon_allowed, false),
+                     vbasedev.ram_block_discard_allowed, false),
     DEFINE_PROP_BOOL("x-no-kvm-intx", VFIOPCIDevice, no_kvm_intx, false),
     DEFINE_PROP_BOOL("x-no-kvm-msi", VFIOPCIDevice, no_kvm_msi, false),
     DEFINE_PROP_BOOL("x-no-kvm-msix", VFIOPCIDevice, no_kvm_msix, false),
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fd564209ac..c78f3ff559 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -108,7 +108,7 @@ typedef struct VFIODevice {
     bool reset_works;
     bool needs_reset;
     bool no_mmap;
-    bool balloon_allowed;
+    bool ram_block_discard_allowed;
     VFIODeviceOps *ops;
     unsigned int num_irqs;
     unsigned int num_regions;
@@ -128,7 +128,7 @@ typedef struct VFIOGroup {
     QLIST_HEAD(, VFIODevice) device_list;
     QLIST_ENTRY(VFIOGroup) next;
     QLIST_ENTRY(VFIOGroup) container_next;
-    bool balloon_allowed;
+    bool ram_block_discard_allowed;
 } VFIOGroup;
 
 typedef struct VFIODMABuf {
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 03/17] accel/kvm: Convert to ram_block_discard_set_broken()
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand

Discarding memory does not work as expected. At the time this is called,
we cannot have anyone active that relies on discards to work properly.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 accel/kvm/kvm-all.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 439a4efe52..33421184ac 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -40,7 +40,6 @@
 #include "trace.h"
 #include "hw/irq.h"
 #include "sysemu/sev.h"
-#include "sysemu/balloon.h"
 #include "qapi/visitor.h"
 #include "qapi/qapi-types-common.h"
 #include "qapi/qapi-visit-common.h"
@@ -2107,7 +2106,7 @@ static int kvm_init(MachineState *ms)
 
     s->sync_mmu = !!kvm_vm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
     if (!s->sync_mmu) {
-        qemu_balloon_inhibit(true);
+        g_assert(ram_block_discard_set_broken(true));
     }
 
     return 0;
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 03/17] accel/kvm: Convert to ram_block_discard_set_broken()
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, qemu-s390x, Paolo Bonzini,
	Richard Henderson

Discarding memory does not work as expected. At the time this is called,
we cannot have anyone active that relies on discards to work properly.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 accel/kvm/kvm-all.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 439a4efe52..33421184ac 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -40,7 +40,6 @@
 #include "trace.h"
 #include "hw/irq.h"
 #include "sysemu/sev.h"
-#include "sysemu/balloon.h"
 #include "qapi/visitor.h"
 #include "qapi/qapi-types-common.h"
 #include "qapi/qapi-visit-common.h"
@@ -2107,7 +2106,7 @@ static int kvm_init(MachineState *ms)
 
     s->sync_mmu = !!kvm_vm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
     if (!s->sync_mmu) {
-        qemu_balloon_inhibit(true);
+        g_assert(ram_block_discard_set_broken(true));
     }
 
     return 0;
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 04/17] s390x/pv: Convert to ram_block_discard_set_broken()
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand, Cornelia Huck, Halil Pasic,
	Christian Borntraeger, Janosch Frank

Discarding RAM does not work as expected with protected VMs. Let's
switch to ram_block_discard_set_broken() for now, as we want to get rid
of qemu_balloon_inhibit(). Note that it will currently never fail, but
might fail in the future with new technologies (e.g., virtio-mem).

Cc: Richard Henderson <rth@twiddle.net>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/s390x/s390-virtio-ccw.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
index 45292fb5a8..883ea392bc 100644
--- a/hw/s390x/s390-virtio-ccw.c
+++ b/hw/s390x/s390-virtio-ccw.c
@@ -43,7 +43,6 @@
 #include "hw/qdev-properties.h"
 #include "hw/s390x/tod.h"
 #include "sysemu/sysemu.h"
-#include "sysemu/balloon.h"
 #include "hw/s390x/pv.h"
 #include "migration/blocker.h"
 
@@ -329,7 +328,7 @@ static void s390_machine_unprotect(S390CcwMachineState *ms)
     ms->pv = false;
     migrate_del_blocker(pv_mig_blocker);
     error_free_or_abort(&pv_mig_blocker);
-    qemu_balloon_inhibit(false);
+    ram_block_discard_set_broken(false);
 }
 
 static int s390_machine_protect(S390CcwMachineState *ms)
@@ -338,17 +337,22 @@ static int s390_machine_protect(S390CcwMachineState *ms)
     int rc;
 
    /*
-    * Ballooning on protected VMs needs support in the guest for
-    * sharing and unsharing balloon pages. Block ballooning for
-    * now, until we have a solution to make at least Linux guests
-    * either support it or fail gracefully.
+    * Discarding of memory in RAM blocks does not work as expected with
+    * protected VMs. Sharing and unsharing pages would be required. Mark it as
+    * broken for now, until until we have a solution to make at least Linux
+    * guests either support it (e.g., virtio-balloon) or fail gracefully.
     */
-    qemu_balloon_inhibit(true);
+    rc = ram_block_discard_set_broken(true);
+    if (rc) {
+        error_report("protected VMs: cannot set discarding of RAM broken");
+        return rc;
+    }
+
     error_setg(&pv_mig_blocker,
                "protected VMs are currently not migrateable.");
     rc = migrate_add_blocker(pv_mig_blocker, &local_err);
     if (rc) {
-        qemu_balloon_inhibit(false);
+        ram_block_discard_set_broken(false);
         error_report_err(local_err);
         error_free_or_abort(&pv_mig_blocker);
         return rc;
@@ -357,7 +361,7 @@ static int s390_machine_protect(S390CcwMachineState *ms)
     /* Create SE VM */
     rc = s390_pv_vm_enable();
     if (rc) {
-        qemu_balloon_inhibit(false);
+        ram_block_discard_set_broken(false);
         migrate_del_blocker(pv_mig_blocker);
         error_free_or_abort(&pv_mig_blocker);
         return rc;
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 04/17] s390x/pv: Convert to ram_block_discard_set_broken()
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Cornelia Huck, Eduardo Habkost, kvm, Michael S . Tsirkin,
	David Hildenbrand, Dr . David Alan Gilbert, Halil Pasic,
	Christian Borntraeger, qemu-s390x, Janosch Frank, Paolo Bonzini,
	Richard Henderson

Discarding RAM does not work as expected with protected VMs. Let's
switch to ram_block_discard_set_broken() for now, as we want to get rid
of qemu_balloon_inhibit(). Note that it will currently never fail, but
might fail in the future with new technologies (e.g., virtio-mem).

Cc: Richard Henderson <rth@twiddle.net>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/s390x/s390-virtio-ccw.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
index 45292fb5a8..883ea392bc 100644
--- a/hw/s390x/s390-virtio-ccw.c
+++ b/hw/s390x/s390-virtio-ccw.c
@@ -43,7 +43,6 @@
 #include "hw/qdev-properties.h"
 #include "hw/s390x/tod.h"
 #include "sysemu/sysemu.h"
-#include "sysemu/balloon.h"
 #include "hw/s390x/pv.h"
 #include "migration/blocker.h"
 
@@ -329,7 +328,7 @@ static void s390_machine_unprotect(S390CcwMachineState *ms)
     ms->pv = false;
     migrate_del_blocker(pv_mig_blocker);
     error_free_or_abort(&pv_mig_blocker);
-    qemu_balloon_inhibit(false);
+    ram_block_discard_set_broken(false);
 }
 
 static int s390_machine_protect(S390CcwMachineState *ms)
@@ -338,17 +337,22 @@ static int s390_machine_protect(S390CcwMachineState *ms)
     int rc;
 
    /*
-    * Ballooning on protected VMs needs support in the guest for
-    * sharing and unsharing balloon pages. Block ballooning for
-    * now, until we have a solution to make at least Linux guests
-    * either support it or fail gracefully.
+    * Discarding of memory in RAM blocks does not work as expected with
+    * protected VMs. Sharing and unsharing pages would be required. Mark it as
+    * broken for now, until until we have a solution to make at least Linux
+    * guests either support it (e.g., virtio-balloon) or fail gracefully.
     */
-    qemu_balloon_inhibit(true);
+    rc = ram_block_discard_set_broken(true);
+    if (rc) {
+        error_report("protected VMs: cannot set discarding of RAM broken");
+        return rc;
+    }
+
     error_setg(&pv_mig_blocker,
                "protected VMs are currently not migrateable.");
     rc = migrate_add_blocker(pv_mig_blocker, &local_err);
     if (rc) {
-        qemu_balloon_inhibit(false);
+        ram_block_discard_set_broken(false);
         error_report_err(local_err);
         error_free_or_abort(&pv_mig_blocker);
         return rc;
@@ -357,7 +361,7 @@ static int s390_machine_protect(S390CcwMachineState *ms)
     /* Create SE VM */
     rc = s390_pv_vm_enable();
     if (rc) {
-        qemu_balloon_inhibit(false);
+        ram_block_discard_set_broken(false);
         migrate_del_blocker(pv_mig_blocker);
         error_free_or_abort(&pv_mig_blocker);
         return rc;
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 05/17] virtio-balloon: Rip out qemu_balloon_inhibit()
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand, Juan Quintela

The only remaining special case is postcopy. It cannot handle
concurrent discards yet, which would result in requesting already sent
pages from the source. Special-case it in virtio-balloon instead.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Juan Quintela <quintela@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 balloon.c                  | 18 ------------------
 hw/virtio/virtio-balloon.c | 12 +++++++++++-
 include/sysemu/balloon.h   |  2 --
 migration/postcopy-ram.c   | 23 -----------------------
 4 files changed, 11 insertions(+), 44 deletions(-)

diff --git a/balloon.c b/balloon.c
index c49f57c27b..354408c6ea 100644
--- a/balloon.c
+++ b/balloon.c
@@ -36,24 +36,6 @@
 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
 static void *balloon_opaque;
-static int balloon_inhibit_count;
-
-bool qemu_balloon_is_inhibited(void)
-{
-    return atomic_read(&balloon_inhibit_count) > 0 ||
-           ram_block_discard_is_broken();
-}
-
-void qemu_balloon_inhibit(bool state)
-{
-    if (state) {
-        atomic_inc(&balloon_inhibit_count);
-    } else {
-        atomic_dec(&balloon_inhibit_count);
-    }
-
-    assert(atomic_read(&balloon_inhibit_count) >= 0);
-}
 
 static bool have_balloon(Error **errp)
 {
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index a4729f7fc9..aa5b89fb47 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -29,6 +29,7 @@
 #include "trace.h"
 #include "qemu/error-report.h"
 #include "migration/misc.h"
+#include "migration/postcopy-ram.h"
 
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
@@ -63,6 +64,15 @@ static bool virtio_balloon_pbp_matches(PartiallyBalloonedPage *pbp,
     return pbp->base_gpa == base_gpa;
 }
 
+static bool virtio_balloon_inhibited(void)
+{
+    PostcopyState ps = postcopy_state_get();
+
+    /* Postcopy cannot deal with concurrent discards (yet), so it's special. */
+    return ram_block_discard_is_broken() ||
+           (ps >= POSTCOPY_INCOMING_DISCARD && ps < POSTCOPY_INCOMING_END);
+}
+
 static void balloon_inflate_page(VirtIOBalloon *balloon,
                                  MemoryRegion *mr, hwaddr mr_offset,
                                  PartiallyBalloonedPage *pbp)
@@ -360,7 +370,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 
             trace_virtio_balloon_handle_output(memory_region_name(section.mr),
                                                pa);
-            if (!qemu_balloon_is_inhibited()) {
+            if (!virtio_balloon_inhibited()) {
                 if (vq == s->ivq) {
                     balloon_inflate_page(s, section.mr,
                                          section.offset_within_region, &pbp);
diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
index aea0c44985..20a2defe3a 100644
--- a/include/sysemu/balloon.h
+++ b/include/sysemu/balloon.h
@@ -23,7 +23,5 @@ typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
 int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
                              QEMUBalloonStatus *stat_func, void *opaque);
 void qemu_remove_balloon_handler(void *opaque);
-bool qemu_balloon_is_inhibited(void);
-void qemu_balloon_inhibit(bool state);
 
 #endif
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index a36402722b..b41a9fe2fd 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -27,7 +27,6 @@
 #include "qemu/notify.h"
 #include "qemu/rcu.h"
 #include "sysemu/sysemu.h"
-#include "sysemu/balloon.h"
 #include "qemu/error-report.h"
 #include "trace.h"
 #include "hw/boards.h"
@@ -520,20 +519,6 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis)
     return 0;
 }
 
-/*
- * Manage a single vote to the QEMU balloon inhibitor for all postcopy usage,
- * last caller wins.
- */
-static void postcopy_balloon_inhibit(bool state)
-{
-    static bool cur_state = false;
-
-    if (state != cur_state) {
-        qemu_balloon_inhibit(state);
-        cur_state = state;
-    }
-}
-
 /*
  * At the end of a migration where postcopy_ram_incoming_init was called.
  */
@@ -565,8 +550,6 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
         mis->have_fault_thread = false;
     }
 
-    postcopy_balloon_inhibit(false);
-
     if (enable_mlock) {
         if (os_mlock() < 0) {
             error_report("mlock: %s", strerror(errno));
@@ -1160,12 +1143,6 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
     }
     memset(mis->postcopy_tmp_zero_page, '\0', mis->largest_page_size);
 
-    /*
-     * Ballooning can mark pages as absent while we're postcopying
-     * that would cause false userfaults.
-     */
-    postcopy_balloon_inhibit(true);
-
     trace_postcopy_ram_enable_notify();
 
     return 0;
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 05/17] virtio-balloon: Rip out qemu_balloon_inhibit()
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Juan Quintela, qemu-s390x,
	Paolo Bonzini, Richard Henderson

The only remaining special case is postcopy. It cannot handle
concurrent discards yet, which would result in requesting already sent
pages from the source. Special-case it in virtio-balloon instead.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Juan Quintela <quintela@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 balloon.c                  | 18 ------------------
 hw/virtio/virtio-balloon.c | 12 +++++++++++-
 include/sysemu/balloon.h   |  2 --
 migration/postcopy-ram.c   | 23 -----------------------
 4 files changed, 11 insertions(+), 44 deletions(-)

diff --git a/balloon.c b/balloon.c
index c49f57c27b..354408c6ea 100644
--- a/balloon.c
+++ b/balloon.c
@@ -36,24 +36,6 @@
 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
 static void *balloon_opaque;
-static int balloon_inhibit_count;
-
-bool qemu_balloon_is_inhibited(void)
-{
-    return atomic_read(&balloon_inhibit_count) > 0 ||
-           ram_block_discard_is_broken();
-}
-
-void qemu_balloon_inhibit(bool state)
-{
-    if (state) {
-        atomic_inc(&balloon_inhibit_count);
-    } else {
-        atomic_dec(&balloon_inhibit_count);
-    }
-
-    assert(atomic_read(&balloon_inhibit_count) >= 0);
-}
 
 static bool have_balloon(Error **errp)
 {
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index a4729f7fc9..aa5b89fb47 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -29,6 +29,7 @@
 #include "trace.h"
 #include "qemu/error-report.h"
 #include "migration/misc.h"
+#include "migration/postcopy-ram.h"
 
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
@@ -63,6 +64,15 @@ static bool virtio_balloon_pbp_matches(PartiallyBalloonedPage *pbp,
     return pbp->base_gpa == base_gpa;
 }
 
+static bool virtio_balloon_inhibited(void)
+{
+    PostcopyState ps = postcopy_state_get();
+
+    /* Postcopy cannot deal with concurrent discards (yet), so it's special. */
+    return ram_block_discard_is_broken() ||
+           (ps >= POSTCOPY_INCOMING_DISCARD && ps < POSTCOPY_INCOMING_END);
+}
+
 static void balloon_inflate_page(VirtIOBalloon *balloon,
                                  MemoryRegion *mr, hwaddr mr_offset,
                                  PartiallyBalloonedPage *pbp)
@@ -360,7 +370,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 
             trace_virtio_balloon_handle_output(memory_region_name(section.mr),
                                                pa);
-            if (!qemu_balloon_is_inhibited()) {
+            if (!virtio_balloon_inhibited()) {
                 if (vq == s->ivq) {
                     balloon_inflate_page(s, section.mr,
                                          section.offset_within_region, &pbp);
diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
index aea0c44985..20a2defe3a 100644
--- a/include/sysemu/balloon.h
+++ b/include/sysemu/balloon.h
@@ -23,7 +23,5 @@ typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
 int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
                              QEMUBalloonStatus *stat_func, void *opaque);
 void qemu_remove_balloon_handler(void *opaque);
-bool qemu_balloon_is_inhibited(void);
-void qemu_balloon_inhibit(bool state);
 
 #endif
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index a36402722b..b41a9fe2fd 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -27,7 +27,6 @@
 #include "qemu/notify.h"
 #include "qemu/rcu.h"
 #include "sysemu/sysemu.h"
-#include "sysemu/balloon.h"
 #include "qemu/error-report.h"
 #include "trace.h"
 #include "hw/boards.h"
@@ -520,20 +519,6 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis)
     return 0;
 }
 
-/*
- * Manage a single vote to the QEMU balloon inhibitor for all postcopy usage,
- * last caller wins.
- */
-static void postcopy_balloon_inhibit(bool state)
-{
-    static bool cur_state = false;
-
-    if (state != cur_state) {
-        qemu_balloon_inhibit(state);
-        cur_state = state;
-    }
-}
-
 /*
  * At the end of a migration where postcopy_ram_incoming_init was called.
  */
@@ -565,8 +550,6 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
         mis->have_fault_thread = false;
     }
 
-    postcopy_balloon_inhibit(false);
-
     if (enable_mlock) {
         if (os_mlock() < 0) {
             error_report("mlock: %s", strerror(errno));
@@ -1160,12 +1143,6 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
     }
     memset(mis->postcopy_tmp_zero_page, '\0', mis->largest_page_size);
 
-    /*
-     * Ballooning can mark pages as absent while we're postcopying
-     * that would cause false userfaults.
-     */
-    postcopy_balloon_inhibit(true);
-
     trace_postcopy_ram_enable_notify();
 
     return 0;
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 06/17] target/i386: sev: Use ram_block_discard_set_broken()
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand

AMD SEV will pin all guest memory, mark discarding of RAM broken. At the
time this is called, we cannot have anyone active that relies on discards
to work properly.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 target/i386/sev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index 846018a12d..608225f9ba 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -722,6 +722,7 @@ sev_guest_init(const char *id)
     ram_block_notifier_add(&sev_ram_notifier);
     qemu_add_machine_init_done_notifier(&sev_machine_done_notify);
     qemu_add_vm_change_state_handler(sev_vm_state_change, s);
+    g_assert(!ram_block_discard_set_broken(true));
 
     return s;
 err:
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 06/17] target/i386: sev: Use ram_block_discard_set_broken()
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, qemu-s390x, Paolo Bonzini,
	Richard Henderson

AMD SEV will pin all guest memory, mark discarding of RAM broken. At the
time this is called, we cannot have anyone active that relies on discards
to work properly.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 target/i386/sev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index 846018a12d..608225f9ba 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -722,6 +722,7 @@ sev_guest_init(const char *id)
     ram_block_notifier_add(&sev_ram_notifier);
     qemu_add_machine_init_done_notifier(&sev_machine_done_notify);
     qemu_add_vm_change_state_handler(sev_vm_state_change, s);
+    g_assert(!ram_block_discard_set_broken(true));
 
     return s;
 err:
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 07/17] migration/rdma: Use ram_block_discard_set_broken()
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand, Juan Quintela

RDMA will pin all guest memory (as documented in docs/rdma.txt). We want
to mark RAM block discards to be broken - however, to keep it simple
use ram_block_discard_is_required() instead of inhibiting.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Juan Quintela <quintela@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 migration/rdma.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c
index f61587891b..029adbb950 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -29,6 +29,7 @@
 #include "qemu/sockets.h"
 #include "qemu/bitmap.h"
 #include "qemu/coroutine.h"
+#include "exec/memory.h"
 #include <sys/socket.h>
 #include <netdb.h>
 #include <arpa/inet.h>
@@ -4017,8 +4018,14 @@ void rdma_start_incoming_migration(const char *host_port, Error **errp)
     Error *local_err = NULL;
 
     trace_rdma_start_incoming_migration();
-    rdma = qemu_rdma_data_init(host_port, &local_err);
 
+    /* Avoid ram_block_discard_set_broken(), cannot change during migration. */
+    if (ram_block_discard_is_required()) {
+        error_setg(errp, "RDMA: cannot set discarding of RAM broken");
+        return;
+    }
+
+    rdma = qemu_rdma_data_init(host_port, &local_err);
     if (rdma == NULL) {
         goto err;
     }
@@ -4064,10 +4071,17 @@ void rdma_start_outgoing_migration(void *opaque,
                             const char *host_port, Error **errp)
 {
     MigrationState *s = opaque;
-    RDMAContext *rdma = qemu_rdma_data_init(host_port, errp);
     RDMAContext *rdma_return_path = NULL;
+    RDMAContext *rdma;
     int ret = 0;
 
+    /* Avoid ram_block_discard_set_broken(), cannot change during migration. */
+    if (ram_block_discard_is_required()) {
+        error_setg(errp, "RDMA: cannot set discarding of RAM broken");
+        return;
+    }
+
+    rdma = qemu_rdma_data_init(host_port, errp);
     if (rdma == NULL) {
         goto err;
     }
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 07/17] migration/rdma: Use ram_block_discard_set_broken()
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Juan Quintela, qemu-s390x,
	Paolo Bonzini, Richard Henderson

RDMA will pin all guest memory (as documented in docs/rdma.txt). We want
to mark RAM block discards to be broken - however, to keep it simple
use ram_block_discard_is_required() instead of inhibiting.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Juan Quintela <quintela@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 migration/rdma.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c
index f61587891b..029adbb950 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -29,6 +29,7 @@
 #include "qemu/sockets.h"
 #include "qemu/bitmap.h"
 #include "qemu/coroutine.h"
+#include "exec/memory.h"
 #include <sys/socket.h>
 #include <netdb.h>
 #include <arpa/inet.h>
@@ -4017,8 +4018,14 @@ void rdma_start_incoming_migration(const char *host_port, Error **errp)
     Error *local_err = NULL;
 
     trace_rdma_start_incoming_migration();
-    rdma = qemu_rdma_data_init(host_port, &local_err);
 
+    /* Avoid ram_block_discard_set_broken(), cannot change during migration. */
+    if (ram_block_discard_is_required()) {
+        error_setg(errp, "RDMA: cannot set discarding of RAM broken");
+        return;
+    }
+
+    rdma = qemu_rdma_data_init(host_port, &local_err);
     if (rdma == NULL) {
         goto err;
     }
@@ -4064,10 +4071,17 @@ void rdma_start_outgoing_migration(void *opaque,
                             const char *host_port, Error **errp)
 {
     MigrationState *s = opaque;
-    RDMAContext *rdma = qemu_rdma_data_init(host_port, errp);
     RDMAContext *rdma_return_path = NULL;
+    RDMAContext *rdma;
     int ret = 0;
 
+    /* Avoid ram_block_discard_set_broken(), cannot change during migration. */
+    if (ram_block_discard_is_required()) {
+        error_setg(errp, "RDMA: cannot set discarding of RAM broken");
+        return;
+    }
+
+    rdma = qemu_rdma_data_init(host_port, errp);
     if (rdma == NULL) {
         goto err;
     }
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 08/17] migration/colo: Use ram_block_discard_set_broken()
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand, Hailiang Zhang, Juan Quintela

COLO will copy all memory in a RAM block, mark discarding of RAM broken.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Hailiang Zhang <zhang.zhanghailiang@huawei.com>
Cc: Juan Quintela <quintela@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/migration/colo.h |  2 +-
 migration/migration.c    |  8 +++++++-
 migration/savevm.c       | 11 +++++++++--
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/include/migration/colo.h b/include/migration/colo.h
index 1636e6f907..768e1f04c3 100644
--- a/include/migration/colo.h
+++ b/include/migration/colo.h
@@ -25,7 +25,7 @@ void migrate_start_colo_process(MigrationState *s);
 bool migration_in_colo_state(void);
 
 /* loadvm */
-void migration_incoming_enable_colo(void);
+int migration_incoming_enable_colo(void);
 void migration_incoming_disable_colo(void);
 bool migration_incoming_colo_enabled(void);
 void *colo_process_incoming_thread(void *opaque);
diff --git a/migration/migration.c b/migration/migration.c
index 177cce9e95..f6830e4620 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -338,12 +338,18 @@ bool migration_incoming_colo_enabled(void)
 
 void migration_incoming_disable_colo(void)
 {
+    ram_block_discard_set_broken(false);
     migration_colo_enabled = false;
 }
 
-void migration_incoming_enable_colo(void)
+int migration_incoming_enable_colo(void)
 {
+    if (ram_block_discard_set_broken(true)) {
+        error_report("COLO: cannot set discarding of RAM broken");
+        return -EBUSY;
+    }
     migration_colo_enabled = true;
+    return 0;
 }
 
 void migrate_add_address(SocketAddress *address)
diff --git a/migration/savevm.c b/migration/savevm.c
index c00a6807d9..19b4f9600d 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2111,8 +2111,15 @@ static int loadvm_handle_recv_bitmap(MigrationIncomingState *mis,
 
 static int loadvm_process_enable_colo(MigrationIncomingState *mis)
 {
-    migration_incoming_enable_colo();
-    return colo_init_ram_cache();
+    int ret = migration_incoming_enable_colo();
+
+    if (!ret) {
+        ret = colo_init_ram_cache();
+        if (ret) {
+            migration_incoming_disable_colo();
+        }
+    }
+    return ret;
 }
 
 /*
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 08/17] migration/colo: Use ram_block_discard_set_broken()
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Juan Quintela, qemu-s390x,
	Hailiang Zhang, Paolo Bonzini, Richard Henderson

COLO will copy all memory in a RAM block, mark discarding of RAM broken.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Hailiang Zhang <zhang.zhanghailiang@huawei.com>
Cc: Juan Quintela <quintela@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/migration/colo.h |  2 +-
 migration/migration.c    |  8 +++++++-
 migration/savevm.c       | 11 +++++++++--
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/include/migration/colo.h b/include/migration/colo.h
index 1636e6f907..768e1f04c3 100644
--- a/include/migration/colo.h
+++ b/include/migration/colo.h
@@ -25,7 +25,7 @@ void migrate_start_colo_process(MigrationState *s);
 bool migration_in_colo_state(void);
 
 /* loadvm */
-void migration_incoming_enable_colo(void);
+int migration_incoming_enable_colo(void);
 void migration_incoming_disable_colo(void);
 bool migration_incoming_colo_enabled(void);
 void *colo_process_incoming_thread(void *opaque);
diff --git a/migration/migration.c b/migration/migration.c
index 177cce9e95..f6830e4620 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -338,12 +338,18 @@ bool migration_incoming_colo_enabled(void)
 
 void migration_incoming_disable_colo(void)
 {
+    ram_block_discard_set_broken(false);
     migration_colo_enabled = false;
 }
 
-void migration_incoming_enable_colo(void)
+int migration_incoming_enable_colo(void)
 {
+    if (ram_block_discard_set_broken(true)) {
+        error_report("COLO: cannot set discarding of RAM broken");
+        return -EBUSY;
+    }
     migration_colo_enabled = true;
+    return 0;
 }
 
 void migrate_add_address(SocketAddress *address)
diff --git a/migration/savevm.c b/migration/savevm.c
index c00a6807d9..19b4f9600d 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2111,8 +2111,15 @@ static int loadvm_handle_recv_bitmap(MigrationIncomingState *mis,
 
 static int loadvm_process_enable_colo(MigrationIncomingState *mis)
 {
-    migration_incoming_enable_colo();
-    return colo_init_ram_cache();
+    int ret = migration_incoming_enable_colo();
+
+    if (!ret) {
+        ret = colo_init_ram_cache();
+        if (ret) {
+            migration_incoming_disable_colo();
+        }
+    }
+    return ret;
 }
 
 /*
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 09/17] linux-headers: update to contain virtio-mem
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand

To be merged hopefully soon. Then, we can replace this by a proper
header sync.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/standard-headers/linux/virtio_ids.h |   1 +
 include/standard-headers/linux/virtio_mem.h | 208 ++++++++++++++++++++
 2 files changed, 209 insertions(+)
 create mode 100644 include/standard-headers/linux/virtio_mem.h

diff --git a/include/standard-headers/linux/virtio_ids.h b/include/standard-headers/linux/virtio_ids.h
index ecc27a1740..b052355ac7 100644
--- a/include/standard-headers/linux/virtio_ids.h
+++ b/include/standard-headers/linux/virtio_ids.h
@@ -44,6 +44,7 @@
 #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
 #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
 #define VIRTIO_ID_IOMMU        23 /* virtio IOMMU */
+#define VIRTIO_ID_MEM          24 /* virtio mem */
 #define VIRTIO_ID_FS           26 /* virtio filesystem */
 #define VIRTIO_ID_PMEM         27 /* virtio pmem */
 #define VIRTIO_ID_MAC80211_HWSIM 29 /* virtio mac80211-hwsim */
diff --git a/include/standard-headers/linux/virtio_mem.h b/include/standard-headers/linux/virtio_mem.h
new file mode 100644
index 0000000000..c28dd40870
--- /dev/null
+++ b/include/standard-headers/linux/virtio_mem.h
@@ -0,0 +1,208 @@
+/* SPDX-License-Identifier: BSD-3-Clause */
+/*
+ * Virtio Mem Device
+ *
+ * Copyright Red Hat, Inc. 2020
+ *
+ * Authors:
+ *     David Hildenbrand <david@redhat.com>
+ *
+ * This header is BSD licensed so anyone can use the definitions
+ * to implement compatible drivers/servers:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#ifndef _LINUX_VIRTIO_MEM_H
+#define _LINUX_VIRTIO_MEM_H
+
+#include "standard-headers/linux/types.h"
+#include "standard-headers/linux/virtio_types.h"
+#include "standard-headers/linux/virtio_ids.h"
+#include "standard-headers/linux/virtio_config.h"
+
+/*
+ * Each virtio-mem device manages a dedicated region in physical address
+ * space. Each device can belong to a single NUMA node, multiple devices
+ * for a single NUMA node are possible. A virtio-mem device is like a
+ * "resizable DIMM" consisting of small memory blocks that can be plugged
+ * or unplugged. The device driver is responsible for (un)plugging memory
+ * blocks on demand.
+ *
+ * Virtio-mem devices can only operate on their assigned memory region in
+ * order to (un)plug memory. A device cannot (un)plug memory belonging to
+ * other devices.
+ *
+ * The "region_size" corresponds to the maximum amount of memory that can
+ * be provided by a device. The "size" corresponds to the amount of memory
+ * that is currently plugged. "requested_size" corresponds to a request
+ * from the device to the device driver to (un)plug blocks. The
+ * device driver should try to (un)plug blocks in order to reach the
+ * "requested_size". It is impossible to plug more memory than requested.
+ *
+ * The "usable_region_size" represents the memory region that can actually
+ * be used to (un)plug memory. It is always at least as big as the
+ * "requested_size" and will grow dynamically. It will only shrink when
+ * explicitly triggered (VIRTIO_MEM_REQ_UNPLUG).
+ *
+ * There are no guarantees what will happen if unplugged memory is
+ * read/written. Such memory should, in general, not be touched. E.g.,
+ * even writing might succeed, but the values will simply be discarded at
+ * random points in time.
+ *
+ * It can happen that the device cannot process a request, because it is
+ * busy. The device driver has to retry later.
+ *
+ * Usually, during system resets all memory will get unplugged, so the
+ * device driver can start with a clean state. However, in specific
+ * scenarios (if the device is busy) it can happen that the device still
+ * has memory plugged. The device driver can request to unplug all memory
+ * (VIRTIO_MEM_REQ_UNPLUG) - which might take a while to succeed if the
+ * device is busy.
+ */
+
+/* --- virtio-mem: feature bits --- */
+
+/* node_id is an ACPI PXM and is valid */
+#define VIRTIO_MEM_F_ACPI_PXM		0
+
+
+/* --- virtio-mem: guest -> host requests --- */
+
+/* request to plug memory blocks */
+#define VIRTIO_MEM_REQ_PLUG			0
+/* request to unplug memory blocks */
+#define VIRTIO_MEM_REQ_UNPLUG			1
+/* request to unplug all blocks and shrink the usable size */
+#define VIRTIO_MEM_REQ_UNPLUG_ALL		2
+/* request information about the plugged state of memory blocks */
+#define VIRTIO_MEM_REQ_STATE			3
+
+struct virtio_mem_req_plug {
+	__virtio64 addr;
+	__virtio16 nb_blocks;
+};
+
+struct virtio_mem_req_unplug {
+	__virtio64 addr;
+	__virtio16 nb_blocks;
+};
+
+struct virtio_mem_req_state {
+	__virtio64 addr;
+	__virtio16 nb_blocks;
+};
+
+struct virtio_mem_req {
+	__virtio16 type;
+	__virtio16 padding[3];
+
+	union {
+		struct virtio_mem_req_plug plug;
+		struct virtio_mem_req_unplug unplug;
+		struct virtio_mem_req_state state;
+	} u;
+};
+
+
+/* --- virtio-mem: host -> guest response --- */
+
+/*
+ * Request processed successfully, applicable for
+ * - VIRTIO_MEM_REQ_PLUG
+ * - VIRTIO_MEM_REQ_UNPLUG
+ * - VIRTIO_MEM_REQ_UNPLUG_ALL
+ * - VIRTIO_MEM_REQ_STATE
+ */
+#define VIRTIO_MEM_RESP_ACK			0
+/*
+ * Request denied - e.g. trying to plug more than requested, applicable for
+ * - VIRTIO_MEM_REQ_PLUG
+ */
+#define VIRTIO_MEM_RESP_NACK			1
+/*
+ * Request cannot be processed right now, try again later, applicable for
+ * - VIRTIO_MEM_REQ_PLUG
+ * - VIRTIO_MEM_REQ_UNPLUG
+ * - VIRTIO_MEM_REQ_UNPLUG_ALL
+ */
+#define VIRTIO_MEM_RESP_BUSY			2
+/*
+ * Error in request (e.g. addresses/alignment), applicable for
+ * - VIRTIO_MEM_REQ_PLUG
+ * - VIRTIO_MEM_REQ_UNPLUG
+ * - VIRTIO_MEM_REQ_STATE
+ */
+#define VIRTIO_MEM_RESP_ERROR			3
+
+
+/* State of memory blocks is "plugged" */
+#define VIRTIO_MEM_STATE_PLUGGED		0
+/* State of memory blocks is "unplugged" */
+#define VIRTIO_MEM_STATE_UNPLUGGED		1
+/* State of memory blocks is "mixed" */
+#define VIRTIO_MEM_STATE_MIXED			2
+
+struct virtio_mem_resp_state {
+	__virtio16 state;
+};
+
+struct virtio_mem_resp {
+	__virtio16 type;
+	__virtio16 padding[3];
+
+	union {
+		struct virtio_mem_resp_state state;
+	} u;
+};
+
+/* --- virtio-mem: configuration --- */
+
+struct virtio_mem_config {
+	/* Block size and alignment. Cannot change. */
+	uint32_t block_size;
+	/* Valid with VIRTIO_MEM_F_ACPI_PXM. Cannot change. */
+	uint16_t node_id;
+	uint16_t padding;
+	/* Start address of the memory region. Cannot change. */
+	uint64_t addr;
+	/* Region size (maximum). Cannot change. */
+	uint64_t region_size;
+	/*
+	 * Currently usable region size. Can grow up to region_size. Can
+	 * shrink due to VIRTIO_MEM_REQ_UNPLUG_ALL (in which case no config
+	 * update will be sent).
+	 */
+	uint64_t usable_region_size;
+	/*
+	 * Currently used size. Changes due to plug/unplug requests, but no
+	 * config updates will be sent.
+	 */
+	uint64_t plugged_size;
+	/* Requested size. New plug requests cannot exceed it. Can change. */
+	uint64_t requested_size;
+};
+
+#endif /* _LINUX_VIRTIO_MEM_H */
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 09/17] linux-headers: update to contain virtio-mem
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, qemu-s390x, Paolo Bonzini,
	Richard Henderson

To be merged hopefully soon. Then, we can replace this by a proper
header sync.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/standard-headers/linux/virtio_ids.h |   1 +
 include/standard-headers/linux/virtio_mem.h | 208 ++++++++++++++++++++
 2 files changed, 209 insertions(+)
 create mode 100644 include/standard-headers/linux/virtio_mem.h

diff --git a/include/standard-headers/linux/virtio_ids.h b/include/standard-headers/linux/virtio_ids.h
index ecc27a1740..b052355ac7 100644
--- a/include/standard-headers/linux/virtio_ids.h
+++ b/include/standard-headers/linux/virtio_ids.h
@@ -44,6 +44,7 @@
 #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
 #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
 #define VIRTIO_ID_IOMMU        23 /* virtio IOMMU */
+#define VIRTIO_ID_MEM          24 /* virtio mem */
 #define VIRTIO_ID_FS           26 /* virtio filesystem */
 #define VIRTIO_ID_PMEM         27 /* virtio pmem */
 #define VIRTIO_ID_MAC80211_HWSIM 29 /* virtio mac80211-hwsim */
diff --git a/include/standard-headers/linux/virtio_mem.h b/include/standard-headers/linux/virtio_mem.h
new file mode 100644
index 0000000000..c28dd40870
--- /dev/null
+++ b/include/standard-headers/linux/virtio_mem.h
@@ -0,0 +1,208 @@
+/* SPDX-License-Identifier: BSD-3-Clause */
+/*
+ * Virtio Mem Device
+ *
+ * Copyright Red Hat, Inc. 2020
+ *
+ * Authors:
+ *     David Hildenbrand <david@redhat.com>
+ *
+ * This header is BSD licensed so anyone can use the definitions
+ * to implement compatible drivers/servers:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#ifndef _LINUX_VIRTIO_MEM_H
+#define _LINUX_VIRTIO_MEM_H
+
+#include "standard-headers/linux/types.h"
+#include "standard-headers/linux/virtio_types.h"
+#include "standard-headers/linux/virtio_ids.h"
+#include "standard-headers/linux/virtio_config.h"
+
+/*
+ * Each virtio-mem device manages a dedicated region in physical address
+ * space. Each device can belong to a single NUMA node, multiple devices
+ * for a single NUMA node are possible. A virtio-mem device is like a
+ * "resizable DIMM" consisting of small memory blocks that can be plugged
+ * or unplugged. The device driver is responsible for (un)plugging memory
+ * blocks on demand.
+ *
+ * Virtio-mem devices can only operate on their assigned memory region in
+ * order to (un)plug memory. A device cannot (un)plug memory belonging to
+ * other devices.
+ *
+ * The "region_size" corresponds to the maximum amount of memory that can
+ * be provided by a device. The "size" corresponds to the amount of memory
+ * that is currently plugged. "requested_size" corresponds to a request
+ * from the device to the device driver to (un)plug blocks. The
+ * device driver should try to (un)plug blocks in order to reach the
+ * "requested_size". It is impossible to plug more memory than requested.
+ *
+ * The "usable_region_size" represents the memory region that can actually
+ * be used to (un)plug memory. It is always at least as big as the
+ * "requested_size" and will grow dynamically. It will only shrink when
+ * explicitly triggered (VIRTIO_MEM_REQ_UNPLUG).
+ *
+ * There are no guarantees what will happen if unplugged memory is
+ * read/written. Such memory should, in general, not be touched. E.g.,
+ * even writing might succeed, but the values will simply be discarded at
+ * random points in time.
+ *
+ * It can happen that the device cannot process a request, because it is
+ * busy. The device driver has to retry later.
+ *
+ * Usually, during system resets all memory will get unplugged, so the
+ * device driver can start with a clean state. However, in specific
+ * scenarios (if the device is busy) it can happen that the device still
+ * has memory plugged. The device driver can request to unplug all memory
+ * (VIRTIO_MEM_REQ_UNPLUG) - which might take a while to succeed if the
+ * device is busy.
+ */
+
+/* --- virtio-mem: feature bits --- */
+
+/* node_id is an ACPI PXM and is valid */
+#define VIRTIO_MEM_F_ACPI_PXM		0
+
+
+/* --- virtio-mem: guest -> host requests --- */
+
+/* request to plug memory blocks */
+#define VIRTIO_MEM_REQ_PLUG			0
+/* request to unplug memory blocks */
+#define VIRTIO_MEM_REQ_UNPLUG			1
+/* request to unplug all blocks and shrink the usable size */
+#define VIRTIO_MEM_REQ_UNPLUG_ALL		2
+/* request information about the plugged state of memory blocks */
+#define VIRTIO_MEM_REQ_STATE			3
+
+struct virtio_mem_req_plug {
+	__virtio64 addr;
+	__virtio16 nb_blocks;
+};
+
+struct virtio_mem_req_unplug {
+	__virtio64 addr;
+	__virtio16 nb_blocks;
+};
+
+struct virtio_mem_req_state {
+	__virtio64 addr;
+	__virtio16 nb_blocks;
+};
+
+struct virtio_mem_req {
+	__virtio16 type;
+	__virtio16 padding[3];
+
+	union {
+		struct virtio_mem_req_plug plug;
+		struct virtio_mem_req_unplug unplug;
+		struct virtio_mem_req_state state;
+	} u;
+};
+
+
+/* --- virtio-mem: host -> guest response --- */
+
+/*
+ * Request processed successfully, applicable for
+ * - VIRTIO_MEM_REQ_PLUG
+ * - VIRTIO_MEM_REQ_UNPLUG
+ * - VIRTIO_MEM_REQ_UNPLUG_ALL
+ * - VIRTIO_MEM_REQ_STATE
+ */
+#define VIRTIO_MEM_RESP_ACK			0
+/*
+ * Request denied - e.g. trying to plug more than requested, applicable for
+ * - VIRTIO_MEM_REQ_PLUG
+ */
+#define VIRTIO_MEM_RESP_NACK			1
+/*
+ * Request cannot be processed right now, try again later, applicable for
+ * - VIRTIO_MEM_REQ_PLUG
+ * - VIRTIO_MEM_REQ_UNPLUG
+ * - VIRTIO_MEM_REQ_UNPLUG_ALL
+ */
+#define VIRTIO_MEM_RESP_BUSY			2
+/*
+ * Error in request (e.g. addresses/alignment), applicable for
+ * - VIRTIO_MEM_REQ_PLUG
+ * - VIRTIO_MEM_REQ_UNPLUG
+ * - VIRTIO_MEM_REQ_STATE
+ */
+#define VIRTIO_MEM_RESP_ERROR			3
+
+
+/* State of memory blocks is "plugged" */
+#define VIRTIO_MEM_STATE_PLUGGED		0
+/* State of memory blocks is "unplugged" */
+#define VIRTIO_MEM_STATE_UNPLUGGED		1
+/* State of memory blocks is "mixed" */
+#define VIRTIO_MEM_STATE_MIXED			2
+
+struct virtio_mem_resp_state {
+	__virtio16 state;
+};
+
+struct virtio_mem_resp {
+	__virtio16 type;
+	__virtio16 padding[3];
+
+	union {
+		struct virtio_mem_resp_state state;
+	} u;
+};
+
+/* --- virtio-mem: configuration --- */
+
+struct virtio_mem_config {
+	/* Block size and alignment. Cannot change. */
+	uint32_t block_size;
+	/* Valid with VIRTIO_MEM_F_ACPI_PXM. Cannot change. */
+	uint16_t node_id;
+	uint16_t padding;
+	/* Start address of the memory region. Cannot change. */
+	uint64_t addr;
+	/* Region size (maximum). Cannot change. */
+	uint64_t region_size;
+	/*
+	 * Currently usable region size. Can grow up to region_size. Can
+	 * shrink due to VIRTIO_MEM_REQ_UNPLUG_ALL (in which case no config
+	 * update will be sent).
+	 */
+	uint64_t usable_region_size;
+	/*
+	 * Currently used size. Changes due to plug/unplug requests, but no
+	 * config updates will be sent.
+	 */
+	uint64_t plugged_size;
+	/* Requested size. New plug requests cannot exceed it. Can change. */
+	uint64_t requested_size;
+};
+
+#endif /* _LINUX_VIRTIO_MEM_H */
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 10/17] virtio-mem: Paravirtualized memory hot(un)plug
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand, Eric Blake, Markus Armbruster, Igor Mammedov

This is the very basic/initial version of virtio-mem. An introduction to
virtio-mem can be found in the Linux kernel driver [1]. While it can be
used in the current state for hotplug of a smaller amount of memory, it
will heavily benefit from resizeable memory regions in the future.

Each virtio-mem device manages a memory region (provided via a memory
backend). After requested by the hypervisor ("requested-size"), the
guest can try to plug/unplug blocks of memory within that region, in order
to reach the requested size. Initially, and after a reboot, all memory is
unplugged (except in special cases - reboot during postcopy).

The guest may only try to plug/unplug blocks of memory within the usable
region size. The usable region size is a little bigger than the
requested size, to give the device driver some flexibility. The usable
region size will only grow, except on reboots or when all memory is
requested to get unplugged. The guest can never plug more memory than
requested. Unplugged memory will get zapped/discarded, similar to in a
balloon device.

The block size is variable, however, it is always chosen in a way such that
THP splits are avoided (e.g., 2MB). The state of each block
(plugged/unplugged) is tracked in a bitmap.

As virtio-mem devices (e.g., virtio-mem-pci) will be memory devices, we now
expose "VirtioMEMDeviceInfo" via "query-memory-devices".

--------------------------------------------------------------------------

There are two important follow-up items that are in the works:
1. Resizeable memory regions: Use resizeable allocations/RAM blocks to
   grow/shrink along with the usable region size. This avoids creating
   initially very big VMAs, RAM blocks, and KVM slots.
2. Protection of unplugged memory: Make sure the gust cannot actually
   make use of unplugged memory.

Other follow-up items that are in the works:
1. Exclude unplugged memory during migration (via precopy notifier).
2. Handle remapping of memory.
3. Support for other architectures.

--------------------------------------------------------------------------

Example usage (virtio-mem-pci is introduced in follow-up patches):

Start QEMU with two virtio-mem devices (one per NUMA node):
 $ qemu-system-x86_64 -m 4G,maxmem=20G \
  -smp sockets=2,cores=2 \
  -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
  [...]
  -object memory-backend-ram,id=mem0,size=8G \
  -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,requested-size=0M \
  -object memory-backend-ram,id=mem1,size=8G \
  -device virtio-mem-pci,id=vm1,memdev=mem1,node=1,requested-size=1G

Query the configuration:
 (qemu) info memory-devices
 Memory device [virtio-mem]: "vm0"
   memaddr: 0x140000000
   node: 0
   requested-size: 0
   size: 0
   max-size: 8589934592
   block-size: 2097152
   memdev: /objects/mem0
 Memory device [virtio-mem]: "vm1"
   memaddr: 0x340000000
   node: 1
   requested-size: 1073741824
   size: 1073741824
   max-size: 8589934592
   block-size: 2097152
   memdev: /objects/mem1

Add some memory to node 0:
 (qemu) qom-set vm0 requested-size 500M

Remove some memory from node 1:
 (qemu) qom-set vm1 requested-size 200M

Query the configuration again:
 (qemu) info memory-devices
 Memory device [virtio-mem]: "vm0"
   memaddr: 0x140000000
   node: 0
   requested-size: 524288000
   size: 524288000
   max-size: 8589934592
   block-size: 2097152
   memdev: /objects/mem0
 Memory device [virtio-mem]: "vm1"
   memaddr: 0x340000000
   node: 1
   requested-size: 209715200
   size: 209715200
   max-size: 8589934592
   block-size: 2097152
   memdev: /objects/mem1

[1] https://lkml.kernel.org/r/20200311171422.10484-1-david@redhat.com

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/Kconfig              |  11 +
 hw/virtio/Makefile.objs        |   1 +
 hw/virtio/virtio-mem.c         | 762 +++++++++++++++++++++++++++++++++
 include/hw/virtio/virtio-mem.h |  80 ++++
 qapi/misc.json                 |  39 +-
 5 files changed, 892 insertions(+), 1 deletion(-)
 create mode 100644 hw/virtio/virtio-mem.c
 create mode 100644 include/hw/virtio/virtio-mem.h

diff --git a/hw/virtio/Kconfig b/hw/virtio/Kconfig
index 83122424fa..0eda25c4e1 100644
--- a/hw/virtio/Kconfig
+++ b/hw/virtio/Kconfig
@@ -47,3 +47,14 @@ config VIRTIO_PMEM
     depends on VIRTIO
     depends on VIRTIO_PMEM_SUPPORTED
     select MEM_DEVICE
+
+config VIRTIO_MEM_SUPPORTED
+    bool
+
+config VIRTIO_MEM
+    bool
+    default y
+    depends on VIRTIO
+    depends on LINUX
+    depends on VIRTIO_MEM_SUPPORTED
+    select MEM_DEVICE
diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
index 4e4d39a0a4..7df70e977e 100644
--- a/hw/virtio/Makefile.objs
+++ b/hw/virtio/Makefile.objs
@@ -18,6 +18,7 @@ common-obj-$(call land,$(CONFIG_VIRTIO_PMEM),$(CONFIG_VIRTIO_PCI)) += virtio-pme
 obj-$(call land,$(CONFIG_VHOST_USER_FS),$(CONFIG_VIRTIO_PCI)) += vhost-user-fs-pci.o
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
 obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
+obj-$(CONFIG_VIRTIO_MEM) += virtio-mem.o
 
 ifeq ($(CONFIG_VIRTIO_PCI),y)
 obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock-pci.o
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
new file mode 100644
index 0000000000..e25b2c74f2
--- /dev/null
+++ b/hw/virtio/virtio-mem.c
@@ -0,0 +1,762 @@
+/*
+ * Virtio MEM device
+ *
+ * Copyright (C) 2020 Red Hat, Inc.
+ *
+ * Authors:
+ *  David Hildenbrand <david@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/iov.h"
+#include "qemu/cutils.h"
+#include "qemu/error-report.h"
+#include "qemu/units.h"
+#include "sysemu/numa.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/reset.h"
+#include "hw/virtio/virtio.h"
+#include "hw/virtio/virtio-bus.h"
+#include "hw/virtio/virtio-access.h"
+#include "hw/virtio/virtio-mem.h"
+#include "qapi/error.h"
+#include "qapi/visitor.h"
+#include "exec/ram_addr.h"
+#include "migration/misc.h"
+#include "migration/postcopy-ram.h"
+#include "hw/boards.h"
+#include "hw/qdev-properties.h"
+#include "config-devices.h"
+
+/*
+ * Use QEMU_VMALLOC_ALIGN, so no THP will have to be split when unplugging
+ * memory (e.g., 2MB on x86_64).
+ */
+#define VIRTIO_MEM_MIN_BLOCK_SIZE QEMU_VMALLOC_ALIGN
+/*
+ * Size the usable region bigger than the requested size if possible. Esp.
+ * Linux guests will only add (aligned) memory blocks in case they fully
+ * fit into the usable region, but plug+online only a subset of the pages.
+ * The memory block size corresponds mostly to the section size.
+ *
+ * This allows e.g., to add 20MB with a section size of 128MB on x86_64, and
+ * a section size of 1GB on arm64 (as long as the start address is properly
+ * aligned, similar to ordinary DIMMs).
+ *
+ * We can change this at any time and maybe even make it configurable if
+ * necessary (as the section size can change). But it's more likely that the
+ * section size will rather get smaller and not bigger over time.
+ */
+#if defined(__x86_64__)
+#define VIRTIO_MEM_USABLE_EXTENT (2 * (128 * MiB))
+#else
+#error VIRTIO_MEM_USABLE_EXTENT not defined
+#endif
+
+static bool virtio_mem_discard_inhibited(void)
+{
+    PostcopyState ps = postcopy_state_get();
+
+    /* Postcopy cannot deal with concurrent discards (yet), so it's special. */
+    return ps >= POSTCOPY_INCOMING_DISCARD && ps < POSTCOPY_INCOMING_END;
+}
+
+static bool virtio_mem_test_bitmap(VirtIOMEM *vmem, uint64_t start_gpa,
+                                   uint64_t size, bool plug)
+{
+    uint64_t bit = (start_gpa - vmem->addr) / vmem->block_size;
+
+    g_assert(QEMU_IS_ALIGNED(start_gpa, vmem->block_size));
+    g_assert(QEMU_IS_ALIGNED(size, vmem->block_size));
+    g_assert(vmem->bitmap);
+
+    while (size) {
+        g_assert((bit / BITS_PER_BYTE) <= vmem->bitmap_size);
+
+        if (plug && !test_bit(bit, vmem->bitmap)) {
+            return false;
+        } else if (!plug && test_bit(bit, vmem->bitmap)) {
+            return false;
+        }
+        size -= vmem->block_size;
+        bit++;
+    }
+    return true;
+}
+
+static void virtio_mem_set_bitmap(VirtIOMEM *vmem, uint64_t start_gpa,
+                                  uint64_t size, bool plug)
+{
+    const uint64_t bit = (start_gpa - vmem->addr) / vmem->block_size;
+    const uint64_t nbits = size / vmem->block_size;
+
+    g_assert(QEMU_IS_ALIGNED(start_gpa, vmem->block_size));
+    g_assert(QEMU_IS_ALIGNED(size, vmem->block_size));
+    g_assert(vmem->bitmap);
+
+    if (plug) {
+        bitmap_set(vmem->bitmap, bit, nbits);
+    } else {
+        bitmap_clear(vmem->bitmap, bit, nbits);
+    }
+}
+
+static void virtio_mem_send_response(VirtIOMEM *vmem, VirtQueueElement *elem,
+                                     struct virtio_mem_resp *resp)
+{
+    VirtIODevice *vdev = VIRTIO_DEVICE(vmem);
+    VirtQueue *vq = vmem->vq;
+
+    iov_from_buf(elem->in_sg, elem->in_num, 0, resp, sizeof(*resp));
+
+    virtqueue_push(vq, elem, sizeof(*resp));
+    virtio_notify(vdev, vq);
+}
+
+static void virtio_mem_send_response_simple(VirtIOMEM *vmem,
+                                            VirtQueueElement *elem,
+                                            uint16_t type)
+{
+    VirtIODevice *vdev = VIRTIO_DEVICE(vmem);
+    struct virtio_mem_resp resp = {};
+
+    virtio_stw_p(vdev, &resp.type, type);
+    virtio_mem_send_response(vmem, elem, &resp);
+}
+
+static void virtio_mem_bad_request(VirtIOMEM *vmem, const char *msg)
+{
+    virtio_error(VIRTIO_DEVICE(vmem), "virtio-mem protocol violation: %s", msg);
+}
+
+static bool virtio_mem_valid_range(VirtIOMEM *vmem, uint64_t gpa, uint64_t size)
+{
+    if (!QEMU_IS_ALIGNED(gpa, vmem->block_size)) {
+            return false;
+    }
+    if (gpa + size < gpa || size == 0) {
+        return false;
+    }
+    if (gpa < vmem->addr || gpa >= vmem->addr + vmem->usable_region_size) {
+        return false;
+    }
+    if (gpa + size > vmem->addr + vmem->usable_region_size) {
+        return false;
+    }
+    return true;
+}
+
+static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa,
+                                      uint64_t size, bool plug)
+{
+    const uint64_t offset = start_gpa - vmem->addr;
+    int ret;
+
+    if (!plug) {
+        if (virtio_mem_discard_inhibited()) {
+            return -EBUSY;
+        }
+        /* Note: Discarding should never fail at this point. */
+        ret = ram_block_discard_range(vmem->memdev->mr.ram_block, offset, size);
+        if (ret) {
+            return -EBUSY;
+        }
+    }
+    virtio_mem_set_bitmap(vmem, start_gpa, size, plug);
+    return 0;
+}
+
+static int virtio_mem_state_change_request(VirtIOMEM *vmem, uint64_t gpa,
+                                           uint16_t nb_blocks, bool plug)
+{
+    const uint64_t size = nb_blocks * vmem->block_size;
+    int ret;
+
+    if (!virtio_mem_valid_range(vmem, gpa, size)) {
+        return VIRTIO_MEM_RESP_ERROR;
+    }
+
+    if (plug && (vmem->size + size > vmem->requested_size)) {
+        return VIRTIO_MEM_RESP_NACK;
+    }
+
+    /* test if really all blocks are in the opposite state */
+    if (!virtio_mem_test_bitmap(vmem, gpa, size, !plug)) {
+        return VIRTIO_MEM_RESP_ERROR;
+    }
+
+    ret = virtio_mem_set_block_state(vmem, gpa, size, plug);
+    if (ret) {
+        return VIRTIO_MEM_RESP_BUSY;
+    }
+    if (plug) {
+        vmem->size += size;
+    } else {
+        vmem->size -= size;
+    }
+    return VIRTIO_MEM_RESP_ACK;
+}
+
+static void virtio_mem_plug_request(VirtIOMEM *vmem, VirtQueueElement *elem,
+                                    struct virtio_mem_req *req)
+{
+    const uint64_t gpa = le64_to_cpu(req->u.plug.addr);
+    const uint16_t nb_blocks = le16_to_cpu(req->u.plug.nb_blocks);
+    uint16_t type;
+
+    type = virtio_mem_state_change_request(vmem, gpa, nb_blocks, true);
+    virtio_mem_send_response_simple(vmem, elem, type);
+}
+
+static void virtio_mem_unplug_request(VirtIOMEM *vmem, VirtQueueElement *elem,
+                                      struct virtio_mem_req *req)
+{
+    const uint64_t gpa = le64_to_cpu(req->u.unplug.addr);
+    const uint16_t nb_blocks = le16_to_cpu(req->u.unplug.nb_blocks);
+    uint16_t type;
+
+    type = virtio_mem_state_change_request(vmem, gpa, nb_blocks, false);
+    virtio_mem_send_response_simple(vmem, elem, type);
+}
+
+static void virtio_mem_resize_usable_region(VirtIOMEM *vmem,
+                                            uint64_t requested_size,
+                                            bool can_shrink)
+{
+    uint64_t newsize = MIN(memory_region_size(&vmem->memdev->mr),
+                           requested_size + VIRTIO_MEM_USABLE_EXTENT);
+
+    /* We must only grow while the guest is running. */
+    if (newsize < vmem->usable_region_size && !can_shrink) {
+        return;
+    }
+
+    vmem->usable_region_size = newsize;
+}
+
+static int virtio_mem_unplug_all(VirtIOMEM *vmem)
+{
+    RAMBlock *rb = vmem->memdev->mr.ram_block;
+    int ret;
+
+    if (virtio_mem_discard_inhibited()) {
+        return -EBUSY;
+    }
+
+    ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
+    if (ret) {
+        /* Note: Discarding should never fail at this point. */
+        return -EBUSY;
+    }
+    bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
+    vmem->size = 0;
+
+    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
+    return 0;
+}
+
+static void virtio_mem_unplug_all_request(VirtIOMEM *vmem,
+                                          VirtQueueElement *elem)
+{
+
+    if (virtio_mem_unplug_all(vmem)) {
+        virtio_mem_send_response_simple(vmem, elem, VIRTIO_MEM_RESP_BUSY);
+    } else {
+        virtio_mem_send_response_simple(vmem, elem, VIRTIO_MEM_RESP_ACK);
+    }
+}
+
+static void virtio_mem_state_request(VirtIOMEM *vmem, VirtQueueElement *elem,
+                                     struct virtio_mem_req *req)
+{
+    const uint64_t gpa = le64_to_cpu(req->u.state.addr);
+    const uint16_t nb_blocks = le16_to_cpu(req->u.state.nb_blocks);
+    const uint64_t size = nb_blocks * vmem->block_size;
+    VirtIODevice *vdev = VIRTIO_DEVICE(vmem);
+    struct virtio_mem_resp resp = {};
+
+    if (!virtio_mem_valid_range(vmem, gpa, size)) {
+        virtio_mem_send_response_simple(vmem, elem, VIRTIO_MEM_RESP_ERROR);
+        return;
+    }
+
+    virtio_stw_p(vdev, &resp.type, VIRTIO_MEM_RESP_ACK);
+    if (virtio_mem_test_bitmap(vmem, gpa, size, true)) {
+        virtio_stw_p(vdev, &resp.u.state.state, VIRTIO_MEM_STATE_PLUGGED);
+    } else if (virtio_mem_test_bitmap(vmem, gpa, size, false)) {
+        virtio_stw_p(vdev, &resp.u.state.state, VIRTIO_MEM_STATE_UNPLUGGED);
+    } else {
+        virtio_stw_p(vdev, &resp.u.state.state, VIRTIO_MEM_STATE_MIXED);
+    }
+    virtio_mem_send_response(vmem, elem, &resp);
+}
+
+static void virtio_mem_handle_request(VirtIODevice *vdev, VirtQueue *vq)
+{
+    const int len = sizeof(struct virtio_mem_req);
+    VirtIOMEM *vmem = VIRTIO_MEM(vdev);
+    VirtQueueElement *elem;
+    struct virtio_mem_req req;
+    uint64_t type;
+
+    while (true) {
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            return;
+        }
+
+        if (iov_to_buf(elem->out_sg, elem->out_num, 0, &req, len) < len) {
+            virtio_mem_bad_request(vmem, "invalid request size");
+            g_free(elem);
+            return;
+        }
+
+        if (iov_size(elem->in_sg, elem->in_num) <
+            sizeof(struct virtio_mem_resp)) {
+            virtio_mem_bad_request(vmem, "not enough space for response");
+            g_free(elem);
+            return;
+        }
+
+        type = le16_to_cpu(req.type);
+        switch (type) {
+        case VIRTIO_MEM_REQ_PLUG:
+            virtio_mem_plug_request(vmem, elem, &req);
+            break;
+        case VIRTIO_MEM_REQ_UNPLUG:
+            virtio_mem_unplug_request(vmem, elem, &req);
+            break;
+        case VIRTIO_MEM_REQ_UNPLUG_ALL:
+            virtio_mem_unplug_all_request(vmem, elem);
+            break;
+        case VIRTIO_MEM_REQ_STATE:
+            virtio_mem_state_request(vmem, elem, &req);
+            break;
+        default:
+            virtio_mem_bad_request(vmem, "unknown request type");
+            g_free(elem);
+            return;
+        }
+
+        g_free(elem);
+    }
+}
+
+static void virtio_mem_get_config(VirtIODevice *vdev, uint8_t *config_data)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(vdev);
+    struct virtio_mem_config *config = (void *) config_data;
+
+    config->block_size = cpu_to_le32(vmem->block_size);
+    config->node_id = cpu_to_le16(vmem->node);
+    config->requested_size = cpu_to_le64(vmem->requested_size);
+    config->plugged_size = cpu_to_le64(vmem->size);
+    config->addr = cpu_to_le64(vmem->addr);
+    config->region_size = cpu_to_le64(memory_region_size(&vmem->memdev->mr));
+    config->usable_region_size = cpu_to_le64(vmem->usable_region_size);
+}
+
+static uint64_t virtio_mem_get_features(VirtIODevice *vdev, uint64_t features,
+                                        Error **errp)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+
+    if (ms->numa_state) {
+#if defined(CONFIG_ACPI)
+        virtio_add_feature(&features, VIRTIO_MEM_F_ACPI_PXM);
+#endif
+    }
+    return features;
+}
+
+static void virtio_mem_system_reset(void *opaque)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
+
+    /*
+     * During usual resets, we will unplug all memory and shrink the usable
+     * region size. This is, however, not possible in all scenarios. Then,
+     * the guest has to deal with this manually (VIRTIO_MEM_REQ_UNPLUG_ALL).
+     */
+    virtio_mem_unplug_all(vmem);
+}
+
+static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+    int nb_numa_nodes = ms->numa_state ? ms->numa_state->num_nodes : 0;
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VirtIOMEM *vmem = VIRTIO_MEM(dev);
+    uint64_t page_size;
+    RAMBlock *rb;
+    int ret;
+
+    if (!vmem->memdev) {
+        error_setg(errp, "'%s' property must be set", VIRTIO_MEM_MEMDEV_PROP);
+        return;
+    } else if (host_memory_backend_is_mapped(vmem->memdev)) {
+        char *path = object_get_canonical_path_component(OBJECT(vmem->memdev));
+
+        error_setg(errp, "can't use already busy memdev: %s", path);
+        g_free(path);
+        return;
+    }
+
+    if ((nb_numa_nodes && vmem->node >= nb_numa_nodes) ||
+        (!nb_numa_nodes && vmem->node)) {
+        error_setg(errp, "Property '%s' has value '%" PRIu32
+                   "', which exceeds the number of numa nodes: %d",
+                   VIRTIO_MEM_NODE_PROP, vmem->node,
+                   nb_numa_nodes ? nb_numa_nodes : 1);
+        return;
+    }
+
+    if (enable_mlock) {
+        error_setg(errp, "not compatible with mlock yet");
+        return;
+    }
+
+    if (!memory_region_is_ram(&vmem->memdev->mr) ||
+        memory_region_is_rom(&vmem->memdev->mr) ||
+        !vmem->memdev->mr.ram_block) {
+        error_setg(errp, "unsupported memdev");
+        return;
+    }
+
+    rb = vmem->memdev->mr.ram_block;
+    page_size = qemu_ram_pagesize(rb);
+
+    if (vmem->block_size < page_size) {
+        error_setg(errp, "'%s' has to be at least the page size (0x%"
+                   PRIx64 ")", VIRTIO_MEM_BLOCK_SIZE_PROP, page_size);
+        return;
+    } else if (!QEMU_IS_ALIGNED(vmem->requested_size, vmem->block_size)) {
+        error_setg(errp, "'%s' has to be multiples of '%s' (0x%" PRIx32
+                   ")", VIRTIO_MEM_REQUESTED_SIZE_PROP,
+                   VIRTIO_MEM_BLOCK_SIZE_PROP, vmem->block_size);
+        return;
+    } else if (!QEMU_IS_ALIGNED(memory_region_size(&vmem->memdev->mr),
+                                vmem->block_size)) {
+        error_setg(errp, "'%s' backend size has to be multiples of '%s' (0x%"
+                   PRIx32 ")", VIRTIO_MEM_MEMDEV_PROP,
+                   VIRTIO_MEM_BLOCK_SIZE_PROP, vmem->block_size);
+        return;
+    }
+
+    if (ram_block_discard_set_required(true)) {
+        error_setg(errp, "Discarding RAM is marked broken.");
+        return;
+    }
+
+    ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
+    if (ret) {
+        /* Note: Discarding should never fail at this point. */
+        error_setg_errno(errp, -ret, "Discarding RAM failed.");
+        ram_block_discard_set_required(false);
+        return;
+    }
+
+    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
+
+    vmem->bitmap_size = memory_region_size(&vmem->memdev->mr) /
+                        vmem->block_size;
+    vmem->bitmap = bitmap_new(vmem->bitmap_size);
+
+    virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
+                sizeof(struct virtio_mem_config));
+    vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
+
+    host_memory_backend_set_mapped(vmem->memdev, true);
+    vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
+    qemu_register_reset(virtio_mem_system_reset, vmem);
+    return;
+}
+
+static void virtio_mem_device_unrealize(DeviceState *dev, Error **errp)
+{
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VirtIOMEM *vmem = VIRTIO_MEM(dev);
+
+    qemu_unregister_reset(virtio_mem_system_reset, vmem);
+    vmstate_unregister_ram(&vmem->memdev->mr, DEVICE(vmem));
+    host_memory_backend_set_mapped(vmem->memdev, false);
+    virtio_del_queue(vdev, 0);
+    virtio_cleanup(vdev);
+    g_free(vmem->bitmap);
+    ramblock_discard_set_required(false);
+}
+
+static int virtio_mem_pre_save(void *opaque)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
+
+    vmem->migration_addr = vmem->addr;
+    vmem->migration_block_size = vmem->block_size;
+
+    return 0;
+}
+
+static int virtio_mem_restore_unplugged(VirtIOMEM *vmem)
+{
+    unsigned long bit;
+    uint64_t offset;
+    int ret;
+
+    /* TODO: Better postcopy handling - defer to postcopy end. */
+    if (virtio_mem_discard_inhibited()) {
+        return 0;
+    }
+
+    bit = find_first_zero_bit(vmem->bitmap, vmem->bitmap_size);
+    while (bit < vmem->bitmap_size) {
+        offset = bit * vmem->block_size;
+
+        if (offset + vmem->block_size >=
+            memory_region_size(&vmem->memdev->mr)) {
+            break;
+        }
+        /* Note: Discarding should never fail at this point. */
+        ret = ram_block_discard_range(vmem->memdev->mr.ram_block, offset,
+                                      vmem->block_size);
+        if (ret) {
+            return -EINVAL;
+        }
+        bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size, bit + 1);
+    }
+    return 0;
+}
+
+static int virtio_mem_post_load(void *opaque, int version_id)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
+
+    if (vmem->migration_block_size != vmem->block_size) {
+        error_report("'%s' doesn't match", VIRTIO_MEM_BLOCK_SIZE_PROP);
+        return -EINVAL;
+    }
+    if (vmem->migration_addr != vmem->addr) {
+        error_report("'%s' doesn't match", VIRTIO_MEM_ADDR_PROP);
+        return -EINVAL;
+    }
+    return virtio_mem_restore_unplugged(vmem);
+}
+
+static const VMStateDescription vmstate_virtio_mem_device = {
+    .name = "virtio-mem-device",
+    .minimum_version_id = 1,
+    .version_id = 1,
+    .pre_save = virtio_mem_pre_save,
+    .post_load = virtio_mem_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT64(usable_region_size, VirtIOMEM),
+        VMSTATE_UINT64(size, VirtIOMEM),
+        VMSTATE_UINT64(requested_size, VirtIOMEM),
+        VMSTATE_UINT64(migration_addr, VirtIOMEM),
+        VMSTATE_UINT32(migration_block_size, VirtIOMEM),
+        VMSTATE_BITMAP(bitmap, VirtIOMEM, 0, bitmap_size),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static const VMStateDescription vmstate_virtio_mem = {
+    .name = "virtio-mem",
+    .minimum_version_id = 1,
+    .version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_VIRTIO_DEVICE,
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static void virtio_mem_fill_device_info(const VirtIOMEM *vmem,
+                                        VirtioMEMDeviceInfo *vi)
+{
+    vi->memaddr = vmem->addr;
+    vi->node = vmem->node;
+    vi->requested_size = vmem->requested_size;
+    vi->size = vmem->size;
+    vi->max_size = memory_region_size(&vmem->memdev->mr);
+    vi->block_size = vmem->block_size;
+    vi->memdev = object_get_canonical_path(OBJECT(vmem->memdev));
+}
+
+static MemoryRegion *virtio_mem_get_memory_region(VirtIOMEM *vmem, Error **errp)
+{
+    if (!vmem->memdev) {
+        error_setg(errp, "'%s' property must be set", VIRTIO_MEM_MEMDEV_PROP);
+        return NULL;
+    }
+
+    return &vmem->memdev->mr;
+}
+
+static void virtio_mem_get_size(Object *obj, Visitor *v, const char *name,
+                                void *opaque, Error **errp)
+{
+    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    uint64_t value = vmem->size;
+
+    visit_type_size(v, name, &value, errp);
+}
+
+static void virtio_mem_get_requested_size(Object *obj, Visitor *v,
+                                          const char *name, void *opaque,
+                                          Error **errp)
+{
+    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    uint64_t value = vmem->requested_size;
+
+    visit_type_size(v, name, &value, errp);
+}
+
+static void virtio_mem_set_requested_size(Object *obj, Visitor *v,
+                                          const char *name, void *opaque,
+                                          Error **errp)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    Error *err = NULL;
+    uint64_t value;
+
+    visit_type_size(v, name, &value, &err);
+    if (err) {
+        error_propagate(errp, err);
+        return;
+    }
+
+    /*
+     * The block size and memory backend are not fixed until the device was
+     * realized. realize() will verify these properties then.
+     */
+    if (DEVICE(obj)->realized) {
+        if (!QEMU_IS_ALIGNED(value, vmem->block_size)) {
+            error_setg(errp, "'%s' has to be multiples of '%s' (0x%" PRIx32
+                       ")", name, VIRTIO_MEM_BLOCK_SIZE_PROP,
+                       vmem->block_size);
+            return;
+        } else if (value > memory_region_size(&vmem->memdev->mr)) {
+            error_setg(errp, "'%s' cannot exceed the memory backend size"
+                       "(0x%" PRIx64 ")", name,
+                       memory_region_size(&vmem->memdev->mr));
+            return;
+        }
+
+        if (value != vmem->requested_size) {
+            virtio_mem_resize_usable_region(vmem, value, false);
+            vmem->requested_size = value;
+        }
+        /*
+         * Trigger a config update so the guest gets notified. We trigger
+         * even if the size didn't change (especially helpful for debugging).
+         */
+        virtio_notify_config(VIRTIO_DEVICE(vmem));
+    } else {
+        vmem->requested_size = value;
+    }
+}
+
+static void virtio_mem_get_block_size(Object *obj, Visitor *v, const char *name,
+                                      void *opaque, Error **errp)
+{
+    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    uint64_t value = vmem->block_size;
+
+    visit_type_size(v, name, &value, errp);
+}
+
+static void virtio_mem_set_block_size(Object *obj, Visitor *v, const char *name,
+                                      void *opaque, Error **errp)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    Error *err = NULL;
+    uint64_t value;
+
+    if (DEVICE(obj)->realized) {
+        error_setg(errp, "'%s' cannot be changed", name);
+        return;
+    }
+
+    visit_type_size(v, name, &value, &err);
+    if (err) {
+        error_propagate(errp, err);
+        return;
+    }
+
+    if (value > UINT32_MAX) {
+        error_setg(errp, "'%s' has to be smaller than 0x%" PRIx32, name,
+                   UINT32_MAX);
+        return;
+    } else if (value < VIRTIO_MEM_MIN_BLOCK_SIZE) {
+        error_setg(errp, "'%s' has to be at least 0x%" PRIx32, name,
+                   VIRTIO_MEM_MIN_BLOCK_SIZE);
+        return;
+    } else if (!is_power_of_2(value)) {
+        error_setg(errp, "'%s' has to be a power of two", name);
+        return;
+    }
+    vmem->block_size = value;
+}
+
+static void virtio_mem_instance_init(Object *obj)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(obj);
+
+    vmem->block_size = VIRTIO_MEM_MIN_BLOCK_SIZE;
+
+    object_property_add(obj, VIRTIO_MEM_SIZE_PROP, "size", virtio_mem_get_size,
+                        NULL, NULL, NULL, &error_abort);
+    object_property_add(obj, VIRTIO_MEM_REQUESTED_SIZE_PROP, "size",
+                        virtio_mem_get_requested_size,
+                        virtio_mem_set_requested_size, NULL, NULL,
+                        &error_abort);
+    object_property_add(obj, VIRTIO_MEM_BLOCK_SIZE_PROP, "size",
+                        virtio_mem_get_block_size, virtio_mem_set_block_size,
+                        NULL, NULL, &error_abort);
+}
+
+static Property virtio_mem_properties[] = {
+    DEFINE_PROP_UINT64(VIRTIO_MEM_ADDR_PROP, VirtIOMEM, addr, 0),
+    DEFINE_PROP_UINT32(VIRTIO_MEM_NODE_PROP, VirtIOMEM, node, 0),
+    DEFINE_PROP_LINK(VIRTIO_MEM_MEMDEV_PROP, VirtIOMEM, memdev,
+                     TYPE_MEMORY_BACKEND, HostMemoryBackend *),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void virtio_mem_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
+    VirtIOMEMClass *vmc = VIRTIO_MEM_CLASS(klass);
+
+    device_class_set_props(dc, virtio_mem_properties);
+    dc->vmsd = &vmstate_virtio_mem;
+
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+    vdc->realize = virtio_mem_device_realize;
+    vdc->unrealize = virtio_mem_device_unrealize;
+    vdc->get_config = virtio_mem_get_config;
+    vdc->get_features = virtio_mem_get_features;
+    vdc->vmsd = &vmstate_virtio_mem_device;
+
+    vmc->fill_device_info = virtio_mem_fill_device_info;
+    vmc->get_memory_region = virtio_mem_get_memory_region;
+}
+
+static const TypeInfo virtio_mem_info = {
+    .name = TYPE_VIRTIO_MEM,
+    .parent = TYPE_VIRTIO_DEVICE,
+    .instance_size = sizeof(VirtIOMEM),
+    .instance_init = virtio_mem_instance_init,
+    .class_init = virtio_mem_class_init,
+    .class_size = sizeof(VirtIOMEMClass),
+};
+
+static void virtio_register_types(void)
+{
+    type_register_static(&virtio_mem_info);
+}
+
+type_init(virtio_register_types)
diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
new file mode 100644
index 0000000000..27158cb611
--- /dev/null
+++ b/include/hw/virtio/virtio-mem.h
@@ -0,0 +1,80 @@
+/*
+ * Virtio MEM device
+ *
+ * Copyright (C) 2020 Red Hat, Inc.
+ *
+ * Authors:
+ *  David Hildenbrand <david@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef HW_VIRTIO_MEM_H
+#define HW_VIRTIO_MEM_H
+
+#include "standard-headers/linux/virtio_mem.h"
+#include "hw/virtio/virtio.h"
+#include "qapi/qapi-types-misc.h"
+#include "sysemu/hostmem.h"
+
+#define TYPE_VIRTIO_MEM "virtio-mem"
+
+#define VIRTIO_MEM(obj) \
+        OBJECT_CHECK(VirtIOMEM, (obj), TYPE_VIRTIO_MEM)
+#define VIRTIO_MEM_CLASS(oc) \
+        OBJECT_CLASS_CHECK(VirtIOMEMClass, (oc), TYPE_VIRTIO_MEM)
+#define VIRTIO_MEM_GET_CLASS(obj) \
+        OBJECT_GET_CLASS(VirtIOMEMClass, (obj), TYPE_VIRTIO_MEM)
+
+#define VIRTIO_MEM_MEMDEV_PROP "memdev"
+#define VIRTIO_MEM_NODE_PROP "node"
+#define VIRTIO_MEM_SIZE_PROP "size"
+#define VIRTIO_MEM_REQUESTED_SIZE_PROP "requested-size"
+#define VIRTIO_MEM_BLOCK_SIZE_PROP "block-size"
+#define VIRTIO_MEM_ADDR_PROP "memaddr"
+
+typedef struct VirtIOMEM {
+    VirtIODevice parent_obj;
+
+    /* guest -> host request queue */
+    VirtQueue *vq;
+
+    /* bitmap used to track unplugged memory */
+    int32_t bitmap_size;
+    unsigned long *bitmap;
+
+    /* assigned memory backend and memory region */
+    HostMemoryBackend *memdev;
+
+    /* NUMA node */
+    uint32_t node;
+
+    /* assigned address of the region in guest physical memory */
+    uint64_t addr;
+    uint64_t migration_addr;
+
+    /* usable region size (<= region_size) */
+    uint64_t usable_region_size;
+
+    /* actual size (how much the guest plugged) */
+    uint64_t size;
+
+    /* requested size */
+    uint64_t requested_size;
+
+    /* block size and alignment */
+    uint32_t block_size;
+    uint32_t migration_block_size;
+} VirtIOMEM;
+
+typedef struct VirtIOMEMClass {
+    /* private */
+    VirtIODevice parent;
+
+    /* public */
+    void (*fill_device_info)(const VirtIOMEM *vmen, VirtioMEMDeviceInfo *vi);
+    MemoryRegion *(*get_memory_region)(VirtIOMEM *vmem, Error **errp);
+} VirtIOMEMClass;
+
+#endif
diff --git a/qapi/misc.json b/qapi/misc.json
index 99b90ac80b..feaeacec22 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -1354,19 +1354,56 @@
           }
 }
 
+##
+# @VirtioMEMDeviceInfo:
+#
+# VirtioMEMDevice state information
+#
+# @id: device's ID
+#
+# @memaddr: physical address in memory, where device is mapped
+#
+# @requested-size: the user requested size of the device
+#
+# @size: the (current) size of memory that the device provides
+#
+# @max-size: the maximum size of memory that the device can provide
+#
+# @block-size: the block size of memory that the device provides
+#
+# @node: NUMA node number where device is assigned to
+#
+# @memdev: memory backend linked with the region
+#
+# Since: 5.1
+##
+{ 'struct': 'VirtioMEMDeviceInfo',
+  'data': { '*id': 'str',
+            'memaddr': 'size',
+            'requested-size': 'size',
+            'size': 'size',
+            'max-size': 'size',
+            'block-size': 'size',
+            'node': 'int',
+            'memdev': 'str'
+          }
+}
+
 ##
 # @MemoryDeviceInfo:
 #
 # Union containing information about a memory device
 #
 # nvdimm is included since 2.12. virtio-pmem is included since 4.1.
+# virtio-mem is included since 5.2.
 #
 # Since: 2.1
 ##
 { 'union': 'MemoryDeviceInfo',
   'data': { 'dimm': 'PCDIMMDeviceInfo',
             'nvdimm': 'PCDIMMDeviceInfo',
-            'virtio-pmem': 'VirtioPMEMDeviceInfo'
+            'virtio-pmem': 'VirtioPMEMDeviceInfo',
+            'virtio-mem': 'VirtioMEMDeviceInfo'
           }
 }
 
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 10/17] virtio-mem: Paravirtualized memory hot(un)plug
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Markus Armbruster, qemu-s390x,
	Igor Mammedov, Paolo Bonzini, Richard Henderson

This is the very basic/initial version of virtio-mem. An introduction to
virtio-mem can be found in the Linux kernel driver [1]. While it can be
used in the current state for hotplug of a smaller amount of memory, it
will heavily benefit from resizeable memory regions in the future.

Each virtio-mem device manages a memory region (provided via a memory
backend). After requested by the hypervisor ("requested-size"), the
guest can try to plug/unplug blocks of memory within that region, in order
to reach the requested size. Initially, and after a reboot, all memory is
unplugged (except in special cases - reboot during postcopy).

The guest may only try to plug/unplug blocks of memory within the usable
region size. The usable region size is a little bigger than the
requested size, to give the device driver some flexibility. The usable
region size will only grow, except on reboots or when all memory is
requested to get unplugged. The guest can never plug more memory than
requested. Unplugged memory will get zapped/discarded, similar to in a
balloon device.

The block size is variable, however, it is always chosen in a way such that
THP splits are avoided (e.g., 2MB). The state of each block
(plugged/unplugged) is tracked in a bitmap.

As virtio-mem devices (e.g., virtio-mem-pci) will be memory devices, we now
expose "VirtioMEMDeviceInfo" via "query-memory-devices".

--------------------------------------------------------------------------

There are two important follow-up items that are in the works:
1. Resizeable memory regions: Use resizeable allocations/RAM blocks to
   grow/shrink along with the usable region size. This avoids creating
   initially very big VMAs, RAM blocks, and KVM slots.
2. Protection of unplugged memory: Make sure the gust cannot actually
   make use of unplugged memory.

Other follow-up items that are in the works:
1. Exclude unplugged memory during migration (via precopy notifier).
2. Handle remapping of memory.
3. Support for other architectures.

--------------------------------------------------------------------------

Example usage (virtio-mem-pci is introduced in follow-up patches):

Start QEMU with two virtio-mem devices (one per NUMA node):
 $ qemu-system-x86_64 -m 4G,maxmem=20G \
  -smp sockets=2,cores=2 \
  -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
  [...]
  -object memory-backend-ram,id=mem0,size=8G \
  -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,requested-size=0M \
  -object memory-backend-ram,id=mem1,size=8G \
  -device virtio-mem-pci,id=vm1,memdev=mem1,node=1,requested-size=1G

Query the configuration:
 (qemu) info memory-devices
 Memory device [virtio-mem]: "vm0"
   memaddr: 0x140000000
   node: 0
   requested-size: 0
   size: 0
   max-size: 8589934592
   block-size: 2097152
   memdev: /objects/mem0
 Memory device [virtio-mem]: "vm1"
   memaddr: 0x340000000
   node: 1
   requested-size: 1073741824
   size: 1073741824
   max-size: 8589934592
   block-size: 2097152
   memdev: /objects/mem1

Add some memory to node 0:
 (qemu) qom-set vm0 requested-size 500M

Remove some memory from node 1:
 (qemu) qom-set vm1 requested-size 200M

Query the configuration again:
 (qemu) info memory-devices
 Memory device [virtio-mem]: "vm0"
   memaddr: 0x140000000
   node: 0
   requested-size: 524288000
   size: 524288000
   max-size: 8589934592
   block-size: 2097152
   memdev: /objects/mem0
 Memory device [virtio-mem]: "vm1"
   memaddr: 0x340000000
   node: 1
   requested-size: 209715200
   size: 209715200
   max-size: 8589934592
   block-size: 2097152
   memdev: /objects/mem1

[1] https://lkml.kernel.org/r/20200311171422.10484-1-david@redhat.com

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/Kconfig              |  11 +
 hw/virtio/Makefile.objs        |   1 +
 hw/virtio/virtio-mem.c         | 762 +++++++++++++++++++++++++++++++++
 include/hw/virtio/virtio-mem.h |  80 ++++
 qapi/misc.json                 |  39 +-
 5 files changed, 892 insertions(+), 1 deletion(-)
 create mode 100644 hw/virtio/virtio-mem.c
 create mode 100644 include/hw/virtio/virtio-mem.h

diff --git a/hw/virtio/Kconfig b/hw/virtio/Kconfig
index 83122424fa..0eda25c4e1 100644
--- a/hw/virtio/Kconfig
+++ b/hw/virtio/Kconfig
@@ -47,3 +47,14 @@ config VIRTIO_PMEM
     depends on VIRTIO
     depends on VIRTIO_PMEM_SUPPORTED
     select MEM_DEVICE
+
+config VIRTIO_MEM_SUPPORTED
+    bool
+
+config VIRTIO_MEM
+    bool
+    default y
+    depends on VIRTIO
+    depends on LINUX
+    depends on VIRTIO_MEM_SUPPORTED
+    select MEM_DEVICE
diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
index 4e4d39a0a4..7df70e977e 100644
--- a/hw/virtio/Makefile.objs
+++ b/hw/virtio/Makefile.objs
@@ -18,6 +18,7 @@ common-obj-$(call land,$(CONFIG_VIRTIO_PMEM),$(CONFIG_VIRTIO_PCI)) += virtio-pme
 obj-$(call land,$(CONFIG_VHOST_USER_FS),$(CONFIG_VIRTIO_PCI)) += vhost-user-fs-pci.o
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
 obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
+obj-$(CONFIG_VIRTIO_MEM) += virtio-mem.o
 
 ifeq ($(CONFIG_VIRTIO_PCI),y)
 obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock-pci.o
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
new file mode 100644
index 0000000000..e25b2c74f2
--- /dev/null
+++ b/hw/virtio/virtio-mem.c
@@ -0,0 +1,762 @@
+/*
+ * Virtio MEM device
+ *
+ * Copyright (C) 2020 Red Hat, Inc.
+ *
+ * Authors:
+ *  David Hildenbrand <david@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/iov.h"
+#include "qemu/cutils.h"
+#include "qemu/error-report.h"
+#include "qemu/units.h"
+#include "sysemu/numa.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/reset.h"
+#include "hw/virtio/virtio.h"
+#include "hw/virtio/virtio-bus.h"
+#include "hw/virtio/virtio-access.h"
+#include "hw/virtio/virtio-mem.h"
+#include "qapi/error.h"
+#include "qapi/visitor.h"
+#include "exec/ram_addr.h"
+#include "migration/misc.h"
+#include "migration/postcopy-ram.h"
+#include "hw/boards.h"
+#include "hw/qdev-properties.h"
+#include "config-devices.h"
+
+/*
+ * Use QEMU_VMALLOC_ALIGN, so no THP will have to be split when unplugging
+ * memory (e.g., 2MB on x86_64).
+ */
+#define VIRTIO_MEM_MIN_BLOCK_SIZE QEMU_VMALLOC_ALIGN
+/*
+ * Size the usable region bigger than the requested size if possible. Esp.
+ * Linux guests will only add (aligned) memory blocks in case they fully
+ * fit into the usable region, but plug+online only a subset of the pages.
+ * The memory block size corresponds mostly to the section size.
+ *
+ * This allows e.g., to add 20MB with a section size of 128MB on x86_64, and
+ * a section size of 1GB on arm64 (as long as the start address is properly
+ * aligned, similar to ordinary DIMMs).
+ *
+ * We can change this at any time and maybe even make it configurable if
+ * necessary (as the section size can change). But it's more likely that the
+ * section size will rather get smaller and not bigger over time.
+ */
+#if defined(__x86_64__)
+#define VIRTIO_MEM_USABLE_EXTENT (2 * (128 * MiB))
+#else
+#error VIRTIO_MEM_USABLE_EXTENT not defined
+#endif
+
+static bool virtio_mem_discard_inhibited(void)
+{
+    PostcopyState ps = postcopy_state_get();
+
+    /* Postcopy cannot deal with concurrent discards (yet), so it's special. */
+    return ps >= POSTCOPY_INCOMING_DISCARD && ps < POSTCOPY_INCOMING_END;
+}
+
+static bool virtio_mem_test_bitmap(VirtIOMEM *vmem, uint64_t start_gpa,
+                                   uint64_t size, bool plug)
+{
+    uint64_t bit = (start_gpa - vmem->addr) / vmem->block_size;
+
+    g_assert(QEMU_IS_ALIGNED(start_gpa, vmem->block_size));
+    g_assert(QEMU_IS_ALIGNED(size, vmem->block_size));
+    g_assert(vmem->bitmap);
+
+    while (size) {
+        g_assert((bit / BITS_PER_BYTE) <= vmem->bitmap_size);
+
+        if (plug && !test_bit(bit, vmem->bitmap)) {
+            return false;
+        } else if (!plug && test_bit(bit, vmem->bitmap)) {
+            return false;
+        }
+        size -= vmem->block_size;
+        bit++;
+    }
+    return true;
+}
+
+static void virtio_mem_set_bitmap(VirtIOMEM *vmem, uint64_t start_gpa,
+                                  uint64_t size, bool plug)
+{
+    const uint64_t bit = (start_gpa - vmem->addr) / vmem->block_size;
+    const uint64_t nbits = size / vmem->block_size;
+
+    g_assert(QEMU_IS_ALIGNED(start_gpa, vmem->block_size));
+    g_assert(QEMU_IS_ALIGNED(size, vmem->block_size));
+    g_assert(vmem->bitmap);
+
+    if (plug) {
+        bitmap_set(vmem->bitmap, bit, nbits);
+    } else {
+        bitmap_clear(vmem->bitmap, bit, nbits);
+    }
+}
+
+static void virtio_mem_send_response(VirtIOMEM *vmem, VirtQueueElement *elem,
+                                     struct virtio_mem_resp *resp)
+{
+    VirtIODevice *vdev = VIRTIO_DEVICE(vmem);
+    VirtQueue *vq = vmem->vq;
+
+    iov_from_buf(elem->in_sg, elem->in_num, 0, resp, sizeof(*resp));
+
+    virtqueue_push(vq, elem, sizeof(*resp));
+    virtio_notify(vdev, vq);
+}
+
+static void virtio_mem_send_response_simple(VirtIOMEM *vmem,
+                                            VirtQueueElement *elem,
+                                            uint16_t type)
+{
+    VirtIODevice *vdev = VIRTIO_DEVICE(vmem);
+    struct virtio_mem_resp resp = {};
+
+    virtio_stw_p(vdev, &resp.type, type);
+    virtio_mem_send_response(vmem, elem, &resp);
+}
+
+static void virtio_mem_bad_request(VirtIOMEM *vmem, const char *msg)
+{
+    virtio_error(VIRTIO_DEVICE(vmem), "virtio-mem protocol violation: %s", msg);
+}
+
+static bool virtio_mem_valid_range(VirtIOMEM *vmem, uint64_t gpa, uint64_t size)
+{
+    if (!QEMU_IS_ALIGNED(gpa, vmem->block_size)) {
+            return false;
+    }
+    if (gpa + size < gpa || size == 0) {
+        return false;
+    }
+    if (gpa < vmem->addr || gpa >= vmem->addr + vmem->usable_region_size) {
+        return false;
+    }
+    if (gpa + size > vmem->addr + vmem->usable_region_size) {
+        return false;
+    }
+    return true;
+}
+
+static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa,
+                                      uint64_t size, bool plug)
+{
+    const uint64_t offset = start_gpa - vmem->addr;
+    int ret;
+
+    if (!plug) {
+        if (virtio_mem_discard_inhibited()) {
+            return -EBUSY;
+        }
+        /* Note: Discarding should never fail at this point. */
+        ret = ram_block_discard_range(vmem->memdev->mr.ram_block, offset, size);
+        if (ret) {
+            return -EBUSY;
+        }
+    }
+    virtio_mem_set_bitmap(vmem, start_gpa, size, plug);
+    return 0;
+}
+
+static int virtio_mem_state_change_request(VirtIOMEM *vmem, uint64_t gpa,
+                                           uint16_t nb_blocks, bool plug)
+{
+    const uint64_t size = nb_blocks * vmem->block_size;
+    int ret;
+
+    if (!virtio_mem_valid_range(vmem, gpa, size)) {
+        return VIRTIO_MEM_RESP_ERROR;
+    }
+
+    if (plug && (vmem->size + size > vmem->requested_size)) {
+        return VIRTIO_MEM_RESP_NACK;
+    }
+
+    /* test if really all blocks are in the opposite state */
+    if (!virtio_mem_test_bitmap(vmem, gpa, size, !plug)) {
+        return VIRTIO_MEM_RESP_ERROR;
+    }
+
+    ret = virtio_mem_set_block_state(vmem, gpa, size, plug);
+    if (ret) {
+        return VIRTIO_MEM_RESP_BUSY;
+    }
+    if (plug) {
+        vmem->size += size;
+    } else {
+        vmem->size -= size;
+    }
+    return VIRTIO_MEM_RESP_ACK;
+}
+
+static void virtio_mem_plug_request(VirtIOMEM *vmem, VirtQueueElement *elem,
+                                    struct virtio_mem_req *req)
+{
+    const uint64_t gpa = le64_to_cpu(req->u.plug.addr);
+    const uint16_t nb_blocks = le16_to_cpu(req->u.plug.nb_blocks);
+    uint16_t type;
+
+    type = virtio_mem_state_change_request(vmem, gpa, nb_blocks, true);
+    virtio_mem_send_response_simple(vmem, elem, type);
+}
+
+static void virtio_mem_unplug_request(VirtIOMEM *vmem, VirtQueueElement *elem,
+                                      struct virtio_mem_req *req)
+{
+    const uint64_t gpa = le64_to_cpu(req->u.unplug.addr);
+    const uint16_t nb_blocks = le16_to_cpu(req->u.unplug.nb_blocks);
+    uint16_t type;
+
+    type = virtio_mem_state_change_request(vmem, gpa, nb_blocks, false);
+    virtio_mem_send_response_simple(vmem, elem, type);
+}
+
+static void virtio_mem_resize_usable_region(VirtIOMEM *vmem,
+                                            uint64_t requested_size,
+                                            bool can_shrink)
+{
+    uint64_t newsize = MIN(memory_region_size(&vmem->memdev->mr),
+                           requested_size + VIRTIO_MEM_USABLE_EXTENT);
+
+    /* We must only grow while the guest is running. */
+    if (newsize < vmem->usable_region_size && !can_shrink) {
+        return;
+    }
+
+    vmem->usable_region_size = newsize;
+}
+
+static int virtio_mem_unplug_all(VirtIOMEM *vmem)
+{
+    RAMBlock *rb = vmem->memdev->mr.ram_block;
+    int ret;
+
+    if (virtio_mem_discard_inhibited()) {
+        return -EBUSY;
+    }
+
+    ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
+    if (ret) {
+        /* Note: Discarding should never fail at this point. */
+        return -EBUSY;
+    }
+    bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
+    vmem->size = 0;
+
+    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
+    return 0;
+}
+
+static void virtio_mem_unplug_all_request(VirtIOMEM *vmem,
+                                          VirtQueueElement *elem)
+{
+
+    if (virtio_mem_unplug_all(vmem)) {
+        virtio_mem_send_response_simple(vmem, elem, VIRTIO_MEM_RESP_BUSY);
+    } else {
+        virtio_mem_send_response_simple(vmem, elem, VIRTIO_MEM_RESP_ACK);
+    }
+}
+
+static void virtio_mem_state_request(VirtIOMEM *vmem, VirtQueueElement *elem,
+                                     struct virtio_mem_req *req)
+{
+    const uint64_t gpa = le64_to_cpu(req->u.state.addr);
+    const uint16_t nb_blocks = le16_to_cpu(req->u.state.nb_blocks);
+    const uint64_t size = nb_blocks * vmem->block_size;
+    VirtIODevice *vdev = VIRTIO_DEVICE(vmem);
+    struct virtio_mem_resp resp = {};
+
+    if (!virtio_mem_valid_range(vmem, gpa, size)) {
+        virtio_mem_send_response_simple(vmem, elem, VIRTIO_MEM_RESP_ERROR);
+        return;
+    }
+
+    virtio_stw_p(vdev, &resp.type, VIRTIO_MEM_RESP_ACK);
+    if (virtio_mem_test_bitmap(vmem, gpa, size, true)) {
+        virtio_stw_p(vdev, &resp.u.state.state, VIRTIO_MEM_STATE_PLUGGED);
+    } else if (virtio_mem_test_bitmap(vmem, gpa, size, false)) {
+        virtio_stw_p(vdev, &resp.u.state.state, VIRTIO_MEM_STATE_UNPLUGGED);
+    } else {
+        virtio_stw_p(vdev, &resp.u.state.state, VIRTIO_MEM_STATE_MIXED);
+    }
+    virtio_mem_send_response(vmem, elem, &resp);
+}
+
+static void virtio_mem_handle_request(VirtIODevice *vdev, VirtQueue *vq)
+{
+    const int len = sizeof(struct virtio_mem_req);
+    VirtIOMEM *vmem = VIRTIO_MEM(vdev);
+    VirtQueueElement *elem;
+    struct virtio_mem_req req;
+    uint64_t type;
+
+    while (true) {
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            return;
+        }
+
+        if (iov_to_buf(elem->out_sg, elem->out_num, 0, &req, len) < len) {
+            virtio_mem_bad_request(vmem, "invalid request size");
+            g_free(elem);
+            return;
+        }
+
+        if (iov_size(elem->in_sg, elem->in_num) <
+            sizeof(struct virtio_mem_resp)) {
+            virtio_mem_bad_request(vmem, "not enough space for response");
+            g_free(elem);
+            return;
+        }
+
+        type = le16_to_cpu(req.type);
+        switch (type) {
+        case VIRTIO_MEM_REQ_PLUG:
+            virtio_mem_plug_request(vmem, elem, &req);
+            break;
+        case VIRTIO_MEM_REQ_UNPLUG:
+            virtio_mem_unplug_request(vmem, elem, &req);
+            break;
+        case VIRTIO_MEM_REQ_UNPLUG_ALL:
+            virtio_mem_unplug_all_request(vmem, elem);
+            break;
+        case VIRTIO_MEM_REQ_STATE:
+            virtio_mem_state_request(vmem, elem, &req);
+            break;
+        default:
+            virtio_mem_bad_request(vmem, "unknown request type");
+            g_free(elem);
+            return;
+        }
+
+        g_free(elem);
+    }
+}
+
+static void virtio_mem_get_config(VirtIODevice *vdev, uint8_t *config_data)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(vdev);
+    struct virtio_mem_config *config = (void *) config_data;
+
+    config->block_size = cpu_to_le32(vmem->block_size);
+    config->node_id = cpu_to_le16(vmem->node);
+    config->requested_size = cpu_to_le64(vmem->requested_size);
+    config->plugged_size = cpu_to_le64(vmem->size);
+    config->addr = cpu_to_le64(vmem->addr);
+    config->region_size = cpu_to_le64(memory_region_size(&vmem->memdev->mr));
+    config->usable_region_size = cpu_to_le64(vmem->usable_region_size);
+}
+
+static uint64_t virtio_mem_get_features(VirtIODevice *vdev, uint64_t features,
+                                        Error **errp)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+
+    if (ms->numa_state) {
+#if defined(CONFIG_ACPI)
+        virtio_add_feature(&features, VIRTIO_MEM_F_ACPI_PXM);
+#endif
+    }
+    return features;
+}
+
+static void virtio_mem_system_reset(void *opaque)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
+
+    /*
+     * During usual resets, we will unplug all memory and shrink the usable
+     * region size. This is, however, not possible in all scenarios. Then,
+     * the guest has to deal with this manually (VIRTIO_MEM_REQ_UNPLUG_ALL).
+     */
+    virtio_mem_unplug_all(vmem);
+}
+
+static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+    int nb_numa_nodes = ms->numa_state ? ms->numa_state->num_nodes : 0;
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VirtIOMEM *vmem = VIRTIO_MEM(dev);
+    uint64_t page_size;
+    RAMBlock *rb;
+    int ret;
+
+    if (!vmem->memdev) {
+        error_setg(errp, "'%s' property must be set", VIRTIO_MEM_MEMDEV_PROP);
+        return;
+    } else if (host_memory_backend_is_mapped(vmem->memdev)) {
+        char *path = object_get_canonical_path_component(OBJECT(vmem->memdev));
+
+        error_setg(errp, "can't use already busy memdev: %s", path);
+        g_free(path);
+        return;
+    }
+
+    if ((nb_numa_nodes && vmem->node >= nb_numa_nodes) ||
+        (!nb_numa_nodes && vmem->node)) {
+        error_setg(errp, "Property '%s' has value '%" PRIu32
+                   "', which exceeds the number of numa nodes: %d",
+                   VIRTIO_MEM_NODE_PROP, vmem->node,
+                   nb_numa_nodes ? nb_numa_nodes : 1);
+        return;
+    }
+
+    if (enable_mlock) {
+        error_setg(errp, "not compatible with mlock yet");
+        return;
+    }
+
+    if (!memory_region_is_ram(&vmem->memdev->mr) ||
+        memory_region_is_rom(&vmem->memdev->mr) ||
+        !vmem->memdev->mr.ram_block) {
+        error_setg(errp, "unsupported memdev");
+        return;
+    }
+
+    rb = vmem->memdev->mr.ram_block;
+    page_size = qemu_ram_pagesize(rb);
+
+    if (vmem->block_size < page_size) {
+        error_setg(errp, "'%s' has to be at least the page size (0x%"
+                   PRIx64 ")", VIRTIO_MEM_BLOCK_SIZE_PROP, page_size);
+        return;
+    } else if (!QEMU_IS_ALIGNED(vmem->requested_size, vmem->block_size)) {
+        error_setg(errp, "'%s' has to be multiples of '%s' (0x%" PRIx32
+                   ")", VIRTIO_MEM_REQUESTED_SIZE_PROP,
+                   VIRTIO_MEM_BLOCK_SIZE_PROP, vmem->block_size);
+        return;
+    } else if (!QEMU_IS_ALIGNED(memory_region_size(&vmem->memdev->mr),
+                                vmem->block_size)) {
+        error_setg(errp, "'%s' backend size has to be multiples of '%s' (0x%"
+                   PRIx32 ")", VIRTIO_MEM_MEMDEV_PROP,
+                   VIRTIO_MEM_BLOCK_SIZE_PROP, vmem->block_size);
+        return;
+    }
+
+    if (ram_block_discard_set_required(true)) {
+        error_setg(errp, "Discarding RAM is marked broken.");
+        return;
+    }
+
+    ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
+    if (ret) {
+        /* Note: Discarding should never fail at this point. */
+        error_setg_errno(errp, -ret, "Discarding RAM failed.");
+        ram_block_discard_set_required(false);
+        return;
+    }
+
+    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
+
+    vmem->bitmap_size = memory_region_size(&vmem->memdev->mr) /
+                        vmem->block_size;
+    vmem->bitmap = bitmap_new(vmem->bitmap_size);
+
+    virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
+                sizeof(struct virtio_mem_config));
+    vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
+
+    host_memory_backend_set_mapped(vmem->memdev, true);
+    vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
+    qemu_register_reset(virtio_mem_system_reset, vmem);
+    return;
+}
+
+static void virtio_mem_device_unrealize(DeviceState *dev, Error **errp)
+{
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VirtIOMEM *vmem = VIRTIO_MEM(dev);
+
+    qemu_unregister_reset(virtio_mem_system_reset, vmem);
+    vmstate_unregister_ram(&vmem->memdev->mr, DEVICE(vmem));
+    host_memory_backend_set_mapped(vmem->memdev, false);
+    virtio_del_queue(vdev, 0);
+    virtio_cleanup(vdev);
+    g_free(vmem->bitmap);
+    ramblock_discard_set_required(false);
+}
+
+static int virtio_mem_pre_save(void *opaque)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
+
+    vmem->migration_addr = vmem->addr;
+    vmem->migration_block_size = vmem->block_size;
+
+    return 0;
+}
+
+static int virtio_mem_restore_unplugged(VirtIOMEM *vmem)
+{
+    unsigned long bit;
+    uint64_t offset;
+    int ret;
+
+    /* TODO: Better postcopy handling - defer to postcopy end. */
+    if (virtio_mem_discard_inhibited()) {
+        return 0;
+    }
+
+    bit = find_first_zero_bit(vmem->bitmap, vmem->bitmap_size);
+    while (bit < vmem->bitmap_size) {
+        offset = bit * vmem->block_size;
+
+        if (offset + vmem->block_size >=
+            memory_region_size(&vmem->memdev->mr)) {
+            break;
+        }
+        /* Note: Discarding should never fail at this point. */
+        ret = ram_block_discard_range(vmem->memdev->mr.ram_block, offset,
+                                      vmem->block_size);
+        if (ret) {
+            return -EINVAL;
+        }
+        bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size, bit + 1);
+    }
+    return 0;
+}
+
+static int virtio_mem_post_load(void *opaque, int version_id)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
+
+    if (vmem->migration_block_size != vmem->block_size) {
+        error_report("'%s' doesn't match", VIRTIO_MEM_BLOCK_SIZE_PROP);
+        return -EINVAL;
+    }
+    if (vmem->migration_addr != vmem->addr) {
+        error_report("'%s' doesn't match", VIRTIO_MEM_ADDR_PROP);
+        return -EINVAL;
+    }
+    return virtio_mem_restore_unplugged(vmem);
+}
+
+static const VMStateDescription vmstate_virtio_mem_device = {
+    .name = "virtio-mem-device",
+    .minimum_version_id = 1,
+    .version_id = 1,
+    .pre_save = virtio_mem_pre_save,
+    .post_load = virtio_mem_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT64(usable_region_size, VirtIOMEM),
+        VMSTATE_UINT64(size, VirtIOMEM),
+        VMSTATE_UINT64(requested_size, VirtIOMEM),
+        VMSTATE_UINT64(migration_addr, VirtIOMEM),
+        VMSTATE_UINT32(migration_block_size, VirtIOMEM),
+        VMSTATE_BITMAP(bitmap, VirtIOMEM, 0, bitmap_size),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static const VMStateDescription vmstate_virtio_mem = {
+    .name = "virtio-mem",
+    .minimum_version_id = 1,
+    .version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_VIRTIO_DEVICE,
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static void virtio_mem_fill_device_info(const VirtIOMEM *vmem,
+                                        VirtioMEMDeviceInfo *vi)
+{
+    vi->memaddr = vmem->addr;
+    vi->node = vmem->node;
+    vi->requested_size = vmem->requested_size;
+    vi->size = vmem->size;
+    vi->max_size = memory_region_size(&vmem->memdev->mr);
+    vi->block_size = vmem->block_size;
+    vi->memdev = object_get_canonical_path(OBJECT(vmem->memdev));
+}
+
+static MemoryRegion *virtio_mem_get_memory_region(VirtIOMEM *vmem, Error **errp)
+{
+    if (!vmem->memdev) {
+        error_setg(errp, "'%s' property must be set", VIRTIO_MEM_MEMDEV_PROP);
+        return NULL;
+    }
+
+    return &vmem->memdev->mr;
+}
+
+static void virtio_mem_get_size(Object *obj, Visitor *v, const char *name,
+                                void *opaque, Error **errp)
+{
+    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    uint64_t value = vmem->size;
+
+    visit_type_size(v, name, &value, errp);
+}
+
+static void virtio_mem_get_requested_size(Object *obj, Visitor *v,
+                                          const char *name, void *opaque,
+                                          Error **errp)
+{
+    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    uint64_t value = vmem->requested_size;
+
+    visit_type_size(v, name, &value, errp);
+}
+
+static void virtio_mem_set_requested_size(Object *obj, Visitor *v,
+                                          const char *name, void *opaque,
+                                          Error **errp)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    Error *err = NULL;
+    uint64_t value;
+
+    visit_type_size(v, name, &value, &err);
+    if (err) {
+        error_propagate(errp, err);
+        return;
+    }
+
+    /*
+     * The block size and memory backend are not fixed until the device was
+     * realized. realize() will verify these properties then.
+     */
+    if (DEVICE(obj)->realized) {
+        if (!QEMU_IS_ALIGNED(value, vmem->block_size)) {
+            error_setg(errp, "'%s' has to be multiples of '%s' (0x%" PRIx32
+                       ")", name, VIRTIO_MEM_BLOCK_SIZE_PROP,
+                       vmem->block_size);
+            return;
+        } else if (value > memory_region_size(&vmem->memdev->mr)) {
+            error_setg(errp, "'%s' cannot exceed the memory backend size"
+                       "(0x%" PRIx64 ")", name,
+                       memory_region_size(&vmem->memdev->mr));
+            return;
+        }
+
+        if (value != vmem->requested_size) {
+            virtio_mem_resize_usable_region(vmem, value, false);
+            vmem->requested_size = value;
+        }
+        /*
+         * Trigger a config update so the guest gets notified. We trigger
+         * even if the size didn't change (especially helpful for debugging).
+         */
+        virtio_notify_config(VIRTIO_DEVICE(vmem));
+    } else {
+        vmem->requested_size = value;
+    }
+}
+
+static void virtio_mem_get_block_size(Object *obj, Visitor *v, const char *name,
+                                      void *opaque, Error **errp)
+{
+    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    uint64_t value = vmem->block_size;
+
+    visit_type_size(v, name, &value, errp);
+}
+
+static void virtio_mem_set_block_size(Object *obj, Visitor *v, const char *name,
+                                      void *opaque, Error **errp)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    Error *err = NULL;
+    uint64_t value;
+
+    if (DEVICE(obj)->realized) {
+        error_setg(errp, "'%s' cannot be changed", name);
+        return;
+    }
+
+    visit_type_size(v, name, &value, &err);
+    if (err) {
+        error_propagate(errp, err);
+        return;
+    }
+
+    if (value > UINT32_MAX) {
+        error_setg(errp, "'%s' has to be smaller than 0x%" PRIx32, name,
+                   UINT32_MAX);
+        return;
+    } else if (value < VIRTIO_MEM_MIN_BLOCK_SIZE) {
+        error_setg(errp, "'%s' has to be at least 0x%" PRIx32, name,
+                   VIRTIO_MEM_MIN_BLOCK_SIZE);
+        return;
+    } else if (!is_power_of_2(value)) {
+        error_setg(errp, "'%s' has to be a power of two", name);
+        return;
+    }
+    vmem->block_size = value;
+}
+
+static void virtio_mem_instance_init(Object *obj)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(obj);
+
+    vmem->block_size = VIRTIO_MEM_MIN_BLOCK_SIZE;
+
+    object_property_add(obj, VIRTIO_MEM_SIZE_PROP, "size", virtio_mem_get_size,
+                        NULL, NULL, NULL, &error_abort);
+    object_property_add(obj, VIRTIO_MEM_REQUESTED_SIZE_PROP, "size",
+                        virtio_mem_get_requested_size,
+                        virtio_mem_set_requested_size, NULL, NULL,
+                        &error_abort);
+    object_property_add(obj, VIRTIO_MEM_BLOCK_SIZE_PROP, "size",
+                        virtio_mem_get_block_size, virtio_mem_set_block_size,
+                        NULL, NULL, &error_abort);
+}
+
+static Property virtio_mem_properties[] = {
+    DEFINE_PROP_UINT64(VIRTIO_MEM_ADDR_PROP, VirtIOMEM, addr, 0),
+    DEFINE_PROP_UINT32(VIRTIO_MEM_NODE_PROP, VirtIOMEM, node, 0),
+    DEFINE_PROP_LINK(VIRTIO_MEM_MEMDEV_PROP, VirtIOMEM, memdev,
+                     TYPE_MEMORY_BACKEND, HostMemoryBackend *),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void virtio_mem_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
+    VirtIOMEMClass *vmc = VIRTIO_MEM_CLASS(klass);
+
+    device_class_set_props(dc, virtio_mem_properties);
+    dc->vmsd = &vmstate_virtio_mem;
+
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+    vdc->realize = virtio_mem_device_realize;
+    vdc->unrealize = virtio_mem_device_unrealize;
+    vdc->get_config = virtio_mem_get_config;
+    vdc->get_features = virtio_mem_get_features;
+    vdc->vmsd = &vmstate_virtio_mem_device;
+
+    vmc->fill_device_info = virtio_mem_fill_device_info;
+    vmc->get_memory_region = virtio_mem_get_memory_region;
+}
+
+static const TypeInfo virtio_mem_info = {
+    .name = TYPE_VIRTIO_MEM,
+    .parent = TYPE_VIRTIO_DEVICE,
+    .instance_size = sizeof(VirtIOMEM),
+    .instance_init = virtio_mem_instance_init,
+    .class_init = virtio_mem_class_init,
+    .class_size = sizeof(VirtIOMEMClass),
+};
+
+static void virtio_register_types(void)
+{
+    type_register_static(&virtio_mem_info);
+}
+
+type_init(virtio_register_types)
diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
new file mode 100644
index 0000000000..27158cb611
--- /dev/null
+++ b/include/hw/virtio/virtio-mem.h
@@ -0,0 +1,80 @@
+/*
+ * Virtio MEM device
+ *
+ * Copyright (C) 2020 Red Hat, Inc.
+ *
+ * Authors:
+ *  David Hildenbrand <david@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef HW_VIRTIO_MEM_H
+#define HW_VIRTIO_MEM_H
+
+#include "standard-headers/linux/virtio_mem.h"
+#include "hw/virtio/virtio.h"
+#include "qapi/qapi-types-misc.h"
+#include "sysemu/hostmem.h"
+
+#define TYPE_VIRTIO_MEM "virtio-mem"
+
+#define VIRTIO_MEM(obj) \
+        OBJECT_CHECK(VirtIOMEM, (obj), TYPE_VIRTIO_MEM)
+#define VIRTIO_MEM_CLASS(oc) \
+        OBJECT_CLASS_CHECK(VirtIOMEMClass, (oc), TYPE_VIRTIO_MEM)
+#define VIRTIO_MEM_GET_CLASS(obj) \
+        OBJECT_GET_CLASS(VirtIOMEMClass, (obj), TYPE_VIRTIO_MEM)
+
+#define VIRTIO_MEM_MEMDEV_PROP "memdev"
+#define VIRTIO_MEM_NODE_PROP "node"
+#define VIRTIO_MEM_SIZE_PROP "size"
+#define VIRTIO_MEM_REQUESTED_SIZE_PROP "requested-size"
+#define VIRTIO_MEM_BLOCK_SIZE_PROP "block-size"
+#define VIRTIO_MEM_ADDR_PROP "memaddr"
+
+typedef struct VirtIOMEM {
+    VirtIODevice parent_obj;
+
+    /* guest -> host request queue */
+    VirtQueue *vq;
+
+    /* bitmap used to track unplugged memory */
+    int32_t bitmap_size;
+    unsigned long *bitmap;
+
+    /* assigned memory backend and memory region */
+    HostMemoryBackend *memdev;
+
+    /* NUMA node */
+    uint32_t node;
+
+    /* assigned address of the region in guest physical memory */
+    uint64_t addr;
+    uint64_t migration_addr;
+
+    /* usable region size (<= region_size) */
+    uint64_t usable_region_size;
+
+    /* actual size (how much the guest plugged) */
+    uint64_t size;
+
+    /* requested size */
+    uint64_t requested_size;
+
+    /* block size and alignment */
+    uint32_t block_size;
+    uint32_t migration_block_size;
+} VirtIOMEM;
+
+typedef struct VirtIOMEMClass {
+    /* private */
+    VirtIODevice parent;
+
+    /* public */
+    void (*fill_device_info)(const VirtIOMEM *vmen, VirtioMEMDeviceInfo *vi);
+    MemoryRegion *(*get_memory_region)(VirtIOMEM *vmem, Error **errp);
+} VirtIOMEMClass;
+
+#endif
diff --git a/qapi/misc.json b/qapi/misc.json
index 99b90ac80b..feaeacec22 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -1354,19 +1354,56 @@
           }
 }
 
+##
+# @VirtioMEMDeviceInfo:
+#
+# VirtioMEMDevice state information
+#
+# @id: device's ID
+#
+# @memaddr: physical address in memory, where device is mapped
+#
+# @requested-size: the user requested size of the device
+#
+# @size: the (current) size of memory that the device provides
+#
+# @max-size: the maximum size of memory that the device can provide
+#
+# @block-size: the block size of memory that the device provides
+#
+# @node: NUMA node number where device is assigned to
+#
+# @memdev: memory backend linked with the region
+#
+# Since: 5.1
+##
+{ 'struct': 'VirtioMEMDeviceInfo',
+  'data': { '*id': 'str',
+            'memaddr': 'size',
+            'requested-size': 'size',
+            'size': 'size',
+            'max-size': 'size',
+            'block-size': 'size',
+            'node': 'int',
+            'memdev': 'str'
+          }
+}
+
 ##
 # @MemoryDeviceInfo:
 #
 # Union containing information about a memory device
 #
 # nvdimm is included since 2.12. virtio-pmem is included since 4.1.
+# virtio-mem is included since 5.2.
 #
 # Since: 2.1
 ##
 { 'union': 'MemoryDeviceInfo',
   'data': { 'dimm': 'PCDIMMDeviceInfo',
             'nvdimm': 'PCDIMMDeviceInfo',
-            'virtio-pmem': 'VirtioPMEMDeviceInfo'
+            'virtio-pmem': 'VirtioPMEMDeviceInfo',
+            'virtio-mem': 'VirtioMEMDeviceInfo'
           }
 }
 
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 11/17] virtio-pci: Proxy for virtio-mem
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Igor Mammedov

Let's add a proxy for virtio-mem, make it a memory device, and
pass-through the properties.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/Makefile.objs    |   1 +
 hw/virtio/virtio-mem-pci.c | 131 +++++++++++++++++++++++++++++++++++++
 hw/virtio/virtio-mem-pci.h |  33 ++++++++++
 include/hw/pci/pci.h       |   1 +
 4 files changed, 166 insertions(+)
 create mode 100644 hw/virtio/virtio-mem-pci.c
 create mode 100644 hw/virtio/virtio-mem-pci.h

diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
index 7df70e977e..b9661f9c01 100644
--- a/hw/virtio/Makefile.objs
+++ b/hw/virtio/Makefile.objs
@@ -19,6 +19,7 @@ obj-$(call land,$(CONFIG_VHOST_USER_FS),$(CONFIG_VIRTIO_PCI)) += vhost-user-fs-p
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
 obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
 obj-$(CONFIG_VIRTIO_MEM) += virtio-mem.o
+common-obj-$(call land,$(CONFIG_VIRTIO_MEM),$(CONFIG_VIRTIO_PCI)) += virtio-mem-pci.o
 
 ifeq ($(CONFIG_VIRTIO_PCI),y)
 obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock-pci.o
diff --git a/hw/virtio/virtio-mem-pci.c b/hw/virtio/virtio-mem-pci.c
new file mode 100644
index 0000000000..a47d21c81f
--- /dev/null
+++ b/hw/virtio/virtio-mem-pci.c
@@ -0,0 +1,131 @@
+/*
+ * Virtio MEM PCI device
+ *
+ * Copyright (C) 2020 Red Hat, Inc.
+ *
+ * Authors:
+ *  David Hildenbrand <david@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "virtio-mem-pci.h"
+#include "hw/mem/memory-device.h"
+#include "qapi/error.h"
+
+static void virtio_mem_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
+{
+    VirtIOMEMPCI *mem_pci = VIRTIO_MEM_PCI(vpci_dev);
+    DeviceState *vdev = DEVICE(&mem_pci->vdev);
+
+    qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
+    object_property_set_bool(OBJECT(vdev), true, "realized", errp);
+}
+
+static void virtio_mem_pci_set_addr(MemoryDeviceState *md, uint64_t addr,
+                                    Error **errp)
+{
+    object_property_set_uint(OBJECT(md), addr, VIRTIO_MEM_ADDR_PROP, errp);
+}
+
+static uint64_t virtio_mem_pci_get_addr(const MemoryDeviceState *md)
+{
+    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_ADDR_PROP,
+                                    &error_abort);
+}
+
+static MemoryRegion *virtio_mem_pci_get_memory_region(MemoryDeviceState *md,
+                                                      Error **errp)
+{
+    VirtIOMEMPCI *pci_mem = VIRTIO_MEM_PCI(md);
+    VirtIOMEM *vmem = VIRTIO_MEM(&pci_mem->vdev);
+    VirtIOMEMClass *vmc = VIRTIO_MEM_GET_CLASS(vmem);
+
+    return vmc->get_memory_region(vmem, errp);
+}
+
+static uint64_t virtio_mem_pci_get_plugged_size(const MemoryDeviceState *md,
+                                                Error **errp)
+{
+    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_SIZE_PROP,
+                                    errp);
+}
+
+static void virtio_mem_pci_fill_device_info(const MemoryDeviceState *md,
+                                            MemoryDeviceInfo *info)
+{
+    VirtioMEMDeviceInfo *vi = g_new0(VirtioMEMDeviceInfo, 1);
+    VirtIOMEMPCI *pci_mem = VIRTIO_MEM_PCI(md);
+    VirtIOMEM *vmem = VIRTIO_MEM(&pci_mem->vdev);
+    VirtIOMEMClass *vpc = VIRTIO_MEM_GET_CLASS(vmem);
+    DeviceState *dev = DEVICE(md);
+
+    if (dev->id) {
+        vi->has_id = true;
+        vi->id = g_strdup(dev->id);
+    }
+
+    /* let the real device handle everything else */
+    vpc->fill_device_info(vmem, vi);
+
+    info->u.virtio_mem.data = vi;
+    info->type = MEMORY_DEVICE_INFO_KIND_VIRTIO_MEM;
+}
+
+static void virtio_mem_pci_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
+    PCIDeviceClass *pcidev_k = PCI_DEVICE_CLASS(klass);
+    MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(klass);
+
+    k->realize = virtio_mem_pci_realize;
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+    pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
+    pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_MEM;
+    pcidev_k->revision = VIRTIO_PCI_ABI_VERSION;
+    pcidev_k->class_id = PCI_CLASS_OTHERS;
+
+    mdc->get_addr = virtio_mem_pci_get_addr;
+    mdc->set_addr = virtio_mem_pci_set_addr;
+    mdc->get_plugged_size = virtio_mem_pci_get_plugged_size;
+    mdc->get_memory_region = virtio_mem_pci_get_memory_region;
+    mdc->fill_device_info = virtio_mem_pci_fill_device_info;
+}
+
+static void virtio_mem_pci_instance_init(Object *obj)
+{
+    VirtIOMEMPCI *dev = VIRTIO_MEM_PCI(obj);
+
+    virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
+                                TYPE_VIRTIO_MEM);
+    object_property_add_alias(obj, VIRTIO_MEM_BLOCK_SIZE_PROP,
+                              OBJECT(&dev->vdev),
+                              VIRTIO_MEM_BLOCK_SIZE_PROP, &error_abort);
+    object_property_add_alias(obj, VIRTIO_MEM_SIZE_PROP, OBJECT(&dev->vdev),
+                              VIRTIO_MEM_SIZE_PROP, &error_abort);
+    object_property_add_alias(obj, VIRTIO_MEM_REQUESTED_SIZE_PROP,
+                              OBJECT(&dev->vdev),
+                              VIRTIO_MEM_REQUESTED_SIZE_PROP, &error_abort);
+}
+
+static const VirtioPCIDeviceTypeInfo virtio_mem_pci_info = {
+    .base_name = TYPE_VIRTIO_MEM_PCI,
+    .generic_name = "virtio-mem-pci",
+    .instance_size = sizeof(VirtIOMEMPCI),
+    .instance_init = virtio_mem_pci_instance_init,
+    .class_init = virtio_mem_pci_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_MEMORY_DEVICE },
+        { }
+    },
+};
+
+static void virtio_mem_pci_register_types(void)
+{
+    virtio_pci_types_register(&virtio_mem_pci_info);
+}
+type_init(virtio_mem_pci_register_types)
diff --git a/hw/virtio/virtio-mem-pci.h b/hw/virtio/virtio-mem-pci.h
new file mode 100644
index 0000000000..8820cd6628
--- /dev/null
+++ b/hw/virtio/virtio-mem-pci.h
@@ -0,0 +1,33 @@
+/*
+ * Virtio MEM PCI device
+ *
+ * Copyright (C) 2020 Red Hat, Inc.
+ *
+ * Authors:
+ *  David Hildenbrand <david@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_VIRTIO_MEM_PCI_H
+#define QEMU_VIRTIO_MEM_PCI_H
+
+#include "hw/virtio/virtio-pci.h"
+#include "hw/virtio/virtio-mem.h"
+
+typedef struct VirtIOMEMPCI VirtIOMEMPCI;
+
+/*
+ * virtio-mem-pci: This extends VirtioPCIProxy.
+ */
+#define TYPE_VIRTIO_MEM_PCI "virtio-mem-pci-base"
+#define VIRTIO_MEM_PCI(obj) \
+        OBJECT_CHECK(VirtIOMEMPCI, (obj), TYPE_VIRTIO_MEM_PCI)
+
+struct VirtIOMEMPCI {
+    VirtIOPCIProxy parent_obj;
+    VirtIOMEM vdev;
+};
+
+#endif /* QEMU_VIRTIO_MEM_PCI_H */
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index cfedf5a995..fec72d5a31 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -87,6 +87,7 @@ extern bool pci_available;
 #define PCI_DEVICE_ID_VIRTIO_VSOCK       0x1012
 #define PCI_DEVICE_ID_VIRTIO_PMEM        0x1013
 #define PCI_DEVICE_ID_VIRTIO_IOMMU       0x1014
+#define PCI_DEVICE_ID_VIRTIO_MEM         0x1015
 
 #define PCI_VENDOR_ID_REDHAT             0x1b36
 #define PCI_DEVICE_ID_REDHAT_BRIDGE      0x0001
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 11/17] virtio-pci: Proxy for virtio-mem
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, qemu-s390x, Igor Mammedov,
	Paolo Bonzini, Richard Henderson

Let's add a proxy for virtio-mem, make it a memory device, and
pass-through the properties.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/Makefile.objs    |   1 +
 hw/virtio/virtio-mem-pci.c | 131 +++++++++++++++++++++++++++++++++++++
 hw/virtio/virtio-mem-pci.h |  33 ++++++++++
 include/hw/pci/pci.h       |   1 +
 4 files changed, 166 insertions(+)
 create mode 100644 hw/virtio/virtio-mem-pci.c
 create mode 100644 hw/virtio/virtio-mem-pci.h

diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
index 7df70e977e..b9661f9c01 100644
--- a/hw/virtio/Makefile.objs
+++ b/hw/virtio/Makefile.objs
@@ -19,6 +19,7 @@ obj-$(call land,$(CONFIG_VHOST_USER_FS),$(CONFIG_VIRTIO_PCI)) += vhost-user-fs-p
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
 obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
 obj-$(CONFIG_VIRTIO_MEM) += virtio-mem.o
+common-obj-$(call land,$(CONFIG_VIRTIO_MEM),$(CONFIG_VIRTIO_PCI)) += virtio-mem-pci.o
 
 ifeq ($(CONFIG_VIRTIO_PCI),y)
 obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock-pci.o
diff --git a/hw/virtio/virtio-mem-pci.c b/hw/virtio/virtio-mem-pci.c
new file mode 100644
index 0000000000..a47d21c81f
--- /dev/null
+++ b/hw/virtio/virtio-mem-pci.c
@@ -0,0 +1,131 @@
+/*
+ * Virtio MEM PCI device
+ *
+ * Copyright (C) 2020 Red Hat, Inc.
+ *
+ * Authors:
+ *  David Hildenbrand <david@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "virtio-mem-pci.h"
+#include "hw/mem/memory-device.h"
+#include "qapi/error.h"
+
+static void virtio_mem_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
+{
+    VirtIOMEMPCI *mem_pci = VIRTIO_MEM_PCI(vpci_dev);
+    DeviceState *vdev = DEVICE(&mem_pci->vdev);
+
+    qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
+    object_property_set_bool(OBJECT(vdev), true, "realized", errp);
+}
+
+static void virtio_mem_pci_set_addr(MemoryDeviceState *md, uint64_t addr,
+                                    Error **errp)
+{
+    object_property_set_uint(OBJECT(md), addr, VIRTIO_MEM_ADDR_PROP, errp);
+}
+
+static uint64_t virtio_mem_pci_get_addr(const MemoryDeviceState *md)
+{
+    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_ADDR_PROP,
+                                    &error_abort);
+}
+
+static MemoryRegion *virtio_mem_pci_get_memory_region(MemoryDeviceState *md,
+                                                      Error **errp)
+{
+    VirtIOMEMPCI *pci_mem = VIRTIO_MEM_PCI(md);
+    VirtIOMEM *vmem = VIRTIO_MEM(&pci_mem->vdev);
+    VirtIOMEMClass *vmc = VIRTIO_MEM_GET_CLASS(vmem);
+
+    return vmc->get_memory_region(vmem, errp);
+}
+
+static uint64_t virtio_mem_pci_get_plugged_size(const MemoryDeviceState *md,
+                                                Error **errp)
+{
+    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_SIZE_PROP,
+                                    errp);
+}
+
+static void virtio_mem_pci_fill_device_info(const MemoryDeviceState *md,
+                                            MemoryDeviceInfo *info)
+{
+    VirtioMEMDeviceInfo *vi = g_new0(VirtioMEMDeviceInfo, 1);
+    VirtIOMEMPCI *pci_mem = VIRTIO_MEM_PCI(md);
+    VirtIOMEM *vmem = VIRTIO_MEM(&pci_mem->vdev);
+    VirtIOMEMClass *vpc = VIRTIO_MEM_GET_CLASS(vmem);
+    DeviceState *dev = DEVICE(md);
+
+    if (dev->id) {
+        vi->has_id = true;
+        vi->id = g_strdup(dev->id);
+    }
+
+    /* let the real device handle everything else */
+    vpc->fill_device_info(vmem, vi);
+
+    info->u.virtio_mem.data = vi;
+    info->type = MEMORY_DEVICE_INFO_KIND_VIRTIO_MEM;
+}
+
+static void virtio_mem_pci_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
+    PCIDeviceClass *pcidev_k = PCI_DEVICE_CLASS(klass);
+    MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(klass);
+
+    k->realize = virtio_mem_pci_realize;
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+    pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
+    pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_MEM;
+    pcidev_k->revision = VIRTIO_PCI_ABI_VERSION;
+    pcidev_k->class_id = PCI_CLASS_OTHERS;
+
+    mdc->get_addr = virtio_mem_pci_get_addr;
+    mdc->set_addr = virtio_mem_pci_set_addr;
+    mdc->get_plugged_size = virtio_mem_pci_get_plugged_size;
+    mdc->get_memory_region = virtio_mem_pci_get_memory_region;
+    mdc->fill_device_info = virtio_mem_pci_fill_device_info;
+}
+
+static void virtio_mem_pci_instance_init(Object *obj)
+{
+    VirtIOMEMPCI *dev = VIRTIO_MEM_PCI(obj);
+
+    virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
+                                TYPE_VIRTIO_MEM);
+    object_property_add_alias(obj, VIRTIO_MEM_BLOCK_SIZE_PROP,
+                              OBJECT(&dev->vdev),
+                              VIRTIO_MEM_BLOCK_SIZE_PROP, &error_abort);
+    object_property_add_alias(obj, VIRTIO_MEM_SIZE_PROP, OBJECT(&dev->vdev),
+                              VIRTIO_MEM_SIZE_PROP, &error_abort);
+    object_property_add_alias(obj, VIRTIO_MEM_REQUESTED_SIZE_PROP,
+                              OBJECT(&dev->vdev),
+                              VIRTIO_MEM_REQUESTED_SIZE_PROP, &error_abort);
+}
+
+static const VirtioPCIDeviceTypeInfo virtio_mem_pci_info = {
+    .base_name = TYPE_VIRTIO_MEM_PCI,
+    .generic_name = "virtio-mem-pci",
+    .instance_size = sizeof(VirtIOMEMPCI),
+    .instance_init = virtio_mem_pci_instance_init,
+    .class_init = virtio_mem_pci_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_MEMORY_DEVICE },
+        { }
+    },
+};
+
+static void virtio_mem_pci_register_types(void)
+{
+    virtio_pci_types_register(&virtio_mem_pci_info);
+}
+type_init(virtio_mem_pci_register_types)
diff --git a/hw/virtio/virtio-mem-pci.h b/hw/virtio/virtio-mem-pci.h
new file mode 100644
index 0000000000..8820cd6628
--- /dev/null
+++ b/hw/virtio/virtio-mem-pci.h
@@ -0,0 +1,33 @@
+/*
+ * Virtio MEM PCI device
+ *
+ * Copyright (C) 2020 Red Hat, Inc.
+ *
+ * Authors:
+ *  David Hildenbrand <david@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_VIRTIO_MEM_PCI_H
+#define QEMU_VIRTIO_MEM_PCI_H
+
+#include "hw/virtio/virtio-pci.h"
+#include "hw/virtio/virtio-mem.h"
+
+typedef struct VirtIOMEMPCI VirtIOMEMPCI;
+
+/*
+ * virtio-mem-pci: This extends VirtioPCIProxy.
+ */
+#define TYPE_VIRTIO_MEM_PCI "virtio-mem-pci-base"
+#define VIRTIO_MEM_PCI(obj) \
+        OBJECT_CHECK(VirtIOMEMPCI, (obj), TYPE_VIRTIO_MEM_PCI)
+
+struct VirtIOMEMPCI {
+    VirtIOPCIProxy parent_obj;
+    VirtIOMEM vdev;
+};
+
+#endif /* QEMU_VIRTIO_MEM_PCI_H */
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index cfedf5a995..fec72d5a31 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -87,6 +87,7 @@ extern bool pci_available;
 #define PCI_DEVICE_ID_VIRTIO_VSOCK       0x1012
 #define PCI_DEVICE_ID_VIRTIO_PMEM        0x1013
 #define PCI_DEVICE_ID_VIRTIO_IOMMU       0x1014
+#define PCI_DEVICE_ID_VIRTIO_MEM         0x1015
 
 #define PCI_VENDOR_ID_REDHAT             0x1b36
 #define PCI_DEVICE_ID_REDHAT_BRIDGE      0x0001
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 12/17] MAINTAINERS: Add myself as virtio-mem maintainer
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand, Peter Maydell, Markus Armbruster

Let's make sure patches/bug reports find the right person.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Markus Armbruster <armbru@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 MAINTAINERS | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 1f84e3ae2c..09fff9e1bd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1734,6 +1734,14 @@ F: hw/virtio/virtio-crypto.c
 F: hw/virtio/virtio-crypto-pci.c
 F: include/hw/virtio/virtio-crypto.h
 
+virtio-mem
+M: David Hildenbrand <david@redhat.com>
+S: Supported
+F: hw/virtio/virtio-mem.c
+F: hw/virtio/virtio-mem-pci.h
+F: hw/virtio/virtio-mem-pci.c
+F: include/hw/virtio/virtio-mem.h
+
 nvme
 M: Keith Busch <kbusch@kernel.org>
 L: qemu-block@nongnu.org
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 12/17] MAINTAINERS: Add myself as virtio-mem maintainer
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, kvm, Michael S . Tsirkin,
	David Hildenbrand, Dr . David Alan Gilbert, Markus Armbruster,
	qemu-s390x, Paolo Bonzini, Richard Henderson

Let's make sure patches/bug reports find the right person.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Markus Armbruster <armbru@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 MAINTAINERS | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 1f84e3ae2c..09fff9e1bd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1734,6 +1734,14 @@ F: hw/virtio/virtio-crypto.c
 F: hw/virtio/virtio-crypto-pci.c
 F: include/hw/virtio/virtio-crypto.h
 
+virtio-mem
+M: David Hildenbrand <david@redhat.com>
+S: Supported
+F: hw/virtio/virtio-mem.c
+F: hw/virtio/virtio-mem-pci.h
+F: hw/virtio/virtio-mem-pci.c
+F: include/hw/virtio/virtio-mem.h
+
 nvme
 M: Keith Busch <kbusch@kernel.org>
 L: qemu-block@nongnu.org
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 13/17] hmp: Handle virtio-mem when printing memory device info
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand

Print the memory device info just like for other memory devices.

Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 monitor/hmp-cmds.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 7f6e982dc8..4b3638a2a6 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1805,6 +1805,7 @@ void hmp_info_memory_devices(Monitor *mon, const QDict *qdict)
     MemoryDeviceInfoList *info_list = qmp_query_memory_devices(&err);
     MemoryDeviceInfoList *info;
     VirtioPMEMDeviceInfo *vpi;
+    VirtioMEMDeviceInfo *vmi;
     MemoryDeviceInfo *value;
     PCDIMMDeviceInfo *di;
 
@@ -1839,6 +1840,21 @@ void hmp_info_memory_devices(Monitor *mon, const QDict *qdict)
                 monitor_printf(mon, "  size: %" PRIu64 "\n", vpi->size);
                 monitor_printf(mon, "  memdev: %s\n", vpi->memdev);
                 break;
+            case MEMORY_DEVICE_INFO_KIND_VIRTIO_MEM:
+                vmi = value->u.virtio_mem.data;
+                monitor_printf(mon, "Memory device [%s]: \"%s\"\n",
+                               MemoryDeviceInfoKind_str(value->type),
+                               vmi->id ? vmi->id : "");
+                monitor_printf(mon, "  memaddr: 0x%" PRIx64 "\n", vmi->memaddr);
+                monitor_printf(mon, "  node: %" PRId64 "\n", vmi->node);
+                monitor_printf(mon, "  requested-size: %" PRIu64 "\n",
+                               vmi->requested_size);
+                monitor_printf(mon, "  size: %" PRIu64 "\n", vmi->size);
+                monitor_printf(mon, "  max-size: %" PRIu64 "\n", vmi->max_size);
+                monitor_printf(mon, "  block-size: %" PRIu64 "\n",
+                               vmi->block_size);
+                monitor_printf(mon, "  memdev: %s\n", vmi->memdev);
+                break;
             default:
                 g_assert_not_reached();
             }
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 13/17] hmp: Handle virtio-mem when printing memory device info
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, qemu-s390x, Paolo Bonzini,
	Richard Henderson

Print the memory device info just like for other memory devices.

Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 monitor/hmp-cmds.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 7f6e982dc8..4b3638a2a6 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1805,6 +1805,7 @@ void hmp_info_memory_devices(Monitor *mon, const QDict *qdict)
     MemoryDeviceInfoList *info_list = qmp_query_memory_devices(&err);
     MemoryDeviceInfoList *info;
     VirtioPMEMDeviceInfo *vpi;
+    VirtioMEMDeviceInfo *vmi;
     MemoryDeviceInfo *value;
     PCDIMMDeviceInfo *di;
 
@@ -1839,6 +1840,21 @@ void hmp_info_memory_devices(Monitor *mon, const QDict *qdict)
                 monitor_printf(mon, "  size: %" PRIu64 "\n", vpi->size);
                 monitor_printf(mon, "  memdev: %s\n", vpi->memdev);
                 break;
+            case MEMORY_DEVICE_INFO_KIND_VIRTIO_MEM:
+                vmi = value->u.virtio_mem.data;
+                monitor_printf(mon, "Memory device [%s]: \"%s\"\n",
+                               MemoryDeviceInfoKind_str(value->type),
+                               vmi->id ? vmi->id : "");
+                monitor_printf(mon, "  memaddr: 0x%" PRIx64 "\n", vmi->memaddr);
+                monitor_printf(mon, "  node: %" PRId64 "\n", vmi->node);
+                monitor_printf(mon, "  requested-size: %" PRIu64 "\n",
+                               vmi->requested_size);
+                monitor_printf(mon, "  size: %" PRIu64 "\n", vmi->size);
+                monitor_printf(mon, "  max-size: %" PRIu64 "\n", vmi->max_size);
+                monitor_printf(mon, "  block-size: %" PRIu64 "\n",
+                               vmi->block_size);
+                monitor_printf(mon, "  memdev: %s\n", vmi->memdev);
+                break;
             default:
                 g_assert_not_reached();
             }
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 14/17] numa: Handle virtio-mem in NUMA stats
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand, Marcel Apfelbaum

Account the memory to the configured nid.

Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/core/numa.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index 316bc50d75..06960918e7 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -812,6 +812,7 @@ static void numa_stat_memory_devices(NumaNodeMem node_mem[])
     MemoryDeviceInfoList *info;
     PCDIMMDeviceInfo     *pcdimm_info;
     VirtioPMEMDeviceInfo *vpi;
+    VirtioMEMDeviceInfo *vmi;
 
     for (info = info_list; info; info = info->next) {
         MemoryDeviceInfo *value = info->value;
@@ -832,6 +833,11 @@ static void numa_stat_memory_devices(NumaNodeMem node_mem[])
                 node_mem[0].node_mem += vpi->size;
                 node_mem[0].node_plugged_mem += vpi->size;
                 break;
+            case MEMORY_DEVICE_INFO_KIND_VIRTIO_MEM:
+                vmi = value->u.virtio_mem.data;
+                node_mem[vmi->node].node_mem += vmi->size;
+                node_mem[vmi->node].node_plugged_mem += vmi->size;
+                break;
             default:
                 g_assert_not_reached();
             }
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 14/17] numa: Handle virtio-mem in NUMA stats
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, qemu-s390x, Paolo Bonzini,
	Richard Henderson

Account the memory to the configured nid.

Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/core/numa.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index 316bc50d75..06960918e7 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -812,6 +812,7 @@ static void numa_stat_memory_devices(NumaNodeMem node_mem[])
     MemoryDeviceInfoList *info;
     PCDIMMDeviceInfo     *pcdimm_info;
     VirtioPMEMDeviceInfo *vpi;
+    VirtioMEMDeviceInfo *vmi;
 
     for (info = info_list; info; info = info->next) {
         MemoryDeviceInfo *value = info->value;
@@ -832,6 +833,11 @@ static void numa_stat_memory_devices(NumaNodeMem node_mem[])
                 node_mem[0].node_mem += vpi->size;
                 node_mem[0].node_plugged_mem += vpi->size;
                 break;
+            case MEMORY_DEVICE_INFO_KIND_VIRTIO_MEM:
+                vmi = value->u.virtio_mem.data;
+                node_mem[vmi->node].node_mem += vmi->size;
+                node_mem[vmi->node].node_plugged_mem += vmi->size;
+                break;
             default:
                 g_assert_not_reached();
             }
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 15/17] pc: Support for virtio-mem-pci
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand, Marcel Apfelbaum, Eric Blake,
	Markus Armbruster

Let's wire it up similar to virtio-pmem. Also disallow unplug, so it's
harder for users to shoot themselves into the foot.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/i386/Kconfig |  1 +
 hw/i386/pc.c    | 49 ++++++++++++++++++++++++++++---------------------
 2 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig
index c93f32f657..03e347b207 100644
--- a/hw/i386/Kconfig
+++ b/hw/i386/Kconfig
@@ -35,6 +35,7 @@ config PC
     select ACPI_PCI
     select ACPI_VMGENID
     select VIRTIO_PMEM_SUPPORTED
+    select VIRTIO_MEM_SUPPORTED
 
 config PC_PCI
     bool
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index f6b8431c8b..588804f895 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -86,6 +86,7 @@
 #include "hw/net/ne2000-isa.h"
 #include "standard-headers/asm-x86/bootparam.h"
 #include "hw/virtio/virtio-pmem-pci.h"
+#include "hw/virtio/virtio-mem-pci.h"
 #include "hw/mem/memory-device.h"
 #include "sysemu/replay.h"
 #include "qapi/qmp/qerror.h"
@@ -1654,8 +1655,8 @@ static void pc_cpu_pre_plug(HotplugHandler *hotplug_dev,
     numa_cpu_pre_plug(cpu_slot, dev, errp);
 }
 
-static void pc_virtio_pmem_pci_pre_plug(HotplugHandler *hotplug_dev,
-                                        DeviceState *dev, Error **errp)
+static void pc_virtio_md_pci_pre_plug(HotplugHandler *hotplug_dev,
+                                      DeviceState *dev, Error **errp)
 {
     HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev);
     Error *local_err = NULL;
@@ -1666,7 +1667,8 @@ static void pc_virtio_pmem_pci_pre_plug(HotplugHandler *hotplug_dev,
          * order. This should never be the case on x86, however better add
          * a safety net.
          */
-        error_setg(errp, "virtio-pmem-pci not supported on this bus.");
+        error_setg(errp,
+                   "virtio based memory devices not supported on this bus.");
         return;
     }
     /*
@@ -1681,8 +1683,8 @@ static void pc_virtio_pmem_pci_pre_plug(HotplugHandler *hotplug_dev,
     error_propagate(errp, local_err);
 }
 
-static void pc_virtio_pmem_pci_plug(HotplugHandler *hotplug_dev,
-                                    DeviceState *dev, Error **errp)
+static void pc_virtio_md_pci_plug(HotplugHandler *hotplug_dev,
+                                  DeviceState *dev, Error **errp)
 {
     HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev);
     Error *local_err = NULL;
@@ -1700,17 +1702,17 @@ static void pc_virtio_pmem_pci_plug(HotplugHandler *hotplug_dev,
     error_propagate(errp, local_err);
 }
 
-static void pc_virtio_pmem_pci_unplug_request(HotplugHandler *hotplug_dev,
-                                              DeviceState *dev, Error **errp)
+static void pc_virtio_md_pci_unplug_request(HotplugHandler *hotplug_dev,
+                                            DeviceState *dev, Error **errp)
 {
-    /* We don't support virtio pmem hot unplug */
-    error_setg(errp, "virtio pmem device unplug not supported.");
+    /* We don't support hot unplug of virtio based memory devices */
+    error_setg(errp, "virtio based memory devices cannot be unplugged.");
 }
 
-static void pc_virtio_pmem_pci_unplug(HotplugHandler *hotplug_dev,
-                                      DeviceState *dev, Error **errp)
+static void pc_virtio_md_pci_unplug(HotplugHandler *hotplug_dev,
+                                    DeviceState *dev, Error **errp)
 {
-    /* We don't support virtio pmem hot unplug */
+    /* We don't support hot unplug of virtio based memory devices */
 }
 
 static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
@@ -1720,8 +1722,9 @@ static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
         pc_memory_pre_plug(hotplug_dev, dev, errp);
     } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
         pc_cpu_pre_plug(hotplug_dev, dev, errp);
-    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
-        pc_virtio_pmem_pci_pre_plug(hotplug_dev, dev, errp);
+    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
+               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
+        pc_virtio_md_pci_pre_plug(hotplug_dev, dev, errp);
     }
 }
 
@@ -1732,8 +1735,9 @@ static void pc_machine_device_plug_cb(HotplugHandler *hotplug_dev,
         pc_memory_plug(hotplug_dev, dev, errp);
     } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
         pc_cpu_plug(hotplug_dev, dev, errp);
-    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
-        pc_virtio_pmem_pci_plug(hotplug_dev, dev, errp);
+    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
+               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
+        pc_virtio_md_pci_plug(hotplug_dev, dev, errp);
     }
 }
 
@@ -1744,8 +1748,9 @@ static void pc_machine_device_unplug_request_cb(HotplugHandler *hotplug_dev,
         pc_memory_unplug_request(hotplug_dev, dev, errp);
     } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
         pc_cpu_unplug_request_cb(hotplug_dev, dev, errp);
-    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
-        pc_virtio_pmem_pci_unplug_request(hotplug_dev, dev, errp);
+    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
+               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
+        pc_virtio_md_pci_unplug_request(hotplug_dev, dev, errp);
     } else {
         error_setg(errp, "acpi: device unplug request for not supported device"
                    " type: %s", object_get_typename(OBJECT(dev)));
@@ -1759,8 +1764,9 @@ static void pc_machine_device_unplug_cb(HotplugHandler *hotplug_dev,
         pc_memory_unplug(hotplug_dev, dev, errp);
     } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
         pc_cpu_unplug_cb(hotplug_dev, dev, errp);
-    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
-        pc_virtio_pmem_pci_unplug(hotplug_dev, dev, errp);
+    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
+               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
+        pc_virtio_md_pci_unplug(hotplug_dev, dev, errp);
     } else {
         error_setg(errp, "acpi: device unplug for not supported device"
                    " type: %s", object_get_typename(OBJECT(dev)));
@@ -1772,7 +1778,8 @@ static HotplugHandler *pc_get_hotplug_handler(MachineState *machine,
 {
     if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM) ||
         object_dynamic_cast(OBJECT(dev), TYPE_CPU) ||
-        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
+        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
+        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
         return HOTPLUG_HANDLER(machine);
     }
 
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 15/17] pc: Support for virtio-mem-pci
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Markus Armbruster, qemu-s390x,
	Paolo Bonzini, Richard Henderson

Let's wire it up similar to virtio-pmem. Also disallow unplug, so it's
harder for users to shoot themselves into the foot.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/i386/Kconfig |  1 +
 hw/i386/pc.c    | 49 ++++++++++++++++++++++++++++---------------------
 2 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig
index c93f32f657..03e347b207 100644
--- a/hw/i386/Kconfig
+++ b/hw/i386/Kconfig
@@ -35,6 +35,7 @@ config PC
     select ACPI_PCI
     select ACPI_VMGENID
     select VIRTIO_PMEM_SUPPORTED
+    select VIRTIO_MEM_SUPPORTED
 
 config PC_PCI
     bool
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index f6b8431c8b..588804f895 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -86,6 +86,7 @@
 #include "hw/net/ne2000-isa.h"
 #include "standard-headers/asm-x86/bootparam.h"
 #include "hw/virtio/virtio-pmem-pci.h"
+#include "hw/virtio/virtio-mem-pci.h"
 #include "hw/mem/memory-device.h"
 #include "sysemu/replay.h"
 #include "qapi/qmp/qerror.h"
@@ -1654,8 +1655,8 @@ static void pc_cpu_pre_plug(HotplugHandler *hotplug_dev,
     numa_cpu_pre_plug(cpu_slot, dev, errp);
 }
 
-static void pc_virtio_pmem_pci_pre_plug(HotplugHandler *hotplug_dev,
-                                        DeviceState *dev, Error **errp)
+static void pc_virtio_md_pci_pre_plug(HotplugHandler *hotplug_dev,
+                                      DeviceState *dev, Error **errp)
 {
     HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev);
     Error *local_err = NULL;
@@ -1666,7 +1667,8 @@ static void pc_virtio_pmem_pci_pre_plug(HotplugHandler *hotplug_dev,
          * order. This should never be the case on x86, however better add
          * a safety net.
          */
-        error_setg(errp, "virtio-pmem-pci not supported on this bus.");
+        error_setg(errp,
+                   "virtio based memory devices not supported on this bus.");
         return;
     }
     /*
@@ -1681,8 +1683,8 @@ static void pc_virtio_pmem_pci_pre_plug(HotplugHandler *hotplug_dev,
     error_propagate(errp, local_err);
 }
 
-static void pc_virtio_pmem_pci_plug(HotplugHandler *hotplug_dev,
-                                    DeviceState *dev, Error **errp)
+static void pc_virtio_md_pci_plug(HotplugHandler *hotplug_dev,
+                                  DeviceState *dev, Error **errp)
 {
     HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev);
     Error *local_err = NULL;
@@ -1700,17 +1702,17 @@ static void pc_virtio_pmem_pci_plug(HotplugHandler *hotplug_dev,
     error_propagate(errp, local_err);
 }
 
-static void pc_virtio_pmem_pci_unplug_request(HotplugHandler *hotplug_dev,
-                                              DeviceState *dev, Error **errp)
+static void pc_virtio_md_pci_unplug_request(HotplugHandler *hotplug_dev,
+                                            DeviceState *dev, Error **errp)
 {
-    /* We don't support virtio pmem hot unplug */
-    error_setg(errp, "virtio pmem device unplug not supported.");
+    /* We don't support hot unplug of virtio based memory devices */
+    error_setg(errp, "virtio based memory devices cannot be unplugged.");
 }
 
-static void pc_virtio_pmem_pci_unplug(HotplugHandler *hotplug_dev,
-                                      DeviceState *dev, Error **errp)
+static void pc_virtio_md_pci_unplug(HotplugHandler *hotplug_dev,
+                                    DeviceState *dev, Error **errp)
 {
-    /* We don't support virtio pmem hot unplug */
+    /* We don't support hot unplug of virtio based memory devices */
 }
 
 static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
@@ -1720,8 +1722,9 @@ static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
         pc_memory_pre_plug(hotplug_dev, dev, errp);
     } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
         pc_cpu_pre_plug(hotplug_dev, dev, errp);
-    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
-        pc_virtio_pmem_pci_pre_plug(hotplug_dev, dev, errp);
+    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
+               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
+        pc_virtio_md_pci_pre_plug(hotplug_dev, dev, errp);
     }
 }
 
@@ -1732,8 +1735,9 @@ static void pc_machine_device_plug_cb(HotplugHandler *hotplug_dev,
         pc_memory_plug(hotplug_dev, dev, errp);
     } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
         pc_cpu_plug(hotplug_dev, dev, errp);
-    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
-        pc_virtio_pmem_pci_plug(hotplug_dev, dev, errp);
+    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
+               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
+        pc_virtio_md_pci_plug(hotplug_dev, dev, errp);
     }
 }
 
@@ -1744,8 +1748,9 @@ static void pc_machine_device_unplug_request_cb(HotplugHandler *hotplug_dev,
         pc_memory_unplug_request(hotplug_dev, dev, errp);
     } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
         pc_cpu_unplug_request_cb(hotplug_dev, dev, errp);
-    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
-        pc_virtio_pmem_pci_unplug_request(hotplug_dev, dev, errp);
+    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
+               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
+        pc_virtio_md_pci_unplug_request(hotplug_dev, dev, errp);
     } else {
         error_setg(errp, "acpi: device unplug request for not supported device"
                    " type: %s", object_get_typename(OBJECT(dev)));
@@ -1759,8 +1764,9 @@ static void pc_machine_device_unplug_cb(HotplugHandler *hotplug_dev,
         pc_memory_unplug(hotplug_dev, dev, errp);
     } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
         pc_cpu_unplug_cb(hotplug_dev, dev, errp);
-    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
-        pc_virtio_pmem_pci_unplug(hotplug_dev, dev, errp);
+    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
+               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
+        pc_virtio_md_pci_unplug(hotplug_dev, dev, errp);
     } else {
         error_setg(errp, "acpi: device unplug for not supported device"
                    " type: %s", object_get_typename(OBJECT(dev)));
@@ -1772,7 +1778,8 @@ static HotplugHandler *pc_get_hotplug_handler(MachineState *machine,
 {
     if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM) ||
         object_dynamic_cast(OBJECT(dev), TYPE_CPU) ||
-        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
+        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
+        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
         return HOTPLUG_HANDLER(machine);
     }
 
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 16/17] virtio-mem: Allow notifiers for size changes
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand, Igor Mammedov

We want to send qapi events in case the size of a virtio-mem device
changes. This allows upper layers to always know how much memory is
actually currently consumed via a virtio-mem device.

Unfortuantely, we have to report the id of our proxy device. Let's provide
an easy way for our proxy device to register, so it can send the qapi
events. Piggy-backing on the notifier infrastructure (although we'll
only ever have one notifier registered) seems to be an easy way.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem.c         | 21 ++++++++++++++++++++-
 include/hw/virtio/virtio-mem.h |  5 +++++
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index e25b2c74f2..88a99a0d90 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -198,6 +198,7 @@ static int virtio_mem_state_change_request(VirtIOMEM *vmem, uint64_t gpa,
     } else {
         vmem->size -= size;
     }
+    notifier_list_notify(&vmem->size_change_notifiers, &vmem->size);
     return VIRTIO_MEM_RESP_ACK;
 }
 
@@ -253,7 +254,10 @@ static int virtio_mem_unplug_all(VirtIOMEM *vmem)
         return -EBUSY;
     }
     bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
-    vmem->size = 0;
+    if (vmem->size != 0) {
+        vmem->size = 0;
+        notifier_list_notify(&vmem->size_change_notifiers, &vmem->size);
+    }
 
     virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
     return 0;
@@ -594,6 +598,18 @@ static MemoryRegion *virtio_mem_get_memory_region(VirtIOMEM *vmem, Error **errp)
     return &vmem->memdev->mr;
 }
 
+static void virtio_mem_add_size_change_notifier(VirtIOMEM *vmem,
+                                                Notifier *notifier)
+{
+    notifier_list_add(&vmem->size_change_notifiers, notifier);
+}
+
+static void virtio_mem_remove_size_change_notifier(VirtIOMEM *vmem,
+                                                   Notifier *notifier)
+{
+    notifier_remove(notifier);
+}
+
 static void virtio_mem_get_size(Object *obj, Visitor *v, const char *name,
                                 void *opaque, Error **errp)
 {
@@ -705,6 +721,7 @@ static void virtio_mem_instance_init(Object *obj)
     VirtIOMEM *vmem = VIRTIO_MEM(obj);
 
     vmem->block_size = VIRTIO_MEM_MIN_BLOCK_SIZE;
+    notifier_list_init(&vmem->size_change_notifiers);
 
     object_property_add(obj, VIRTIO_MEM_SIZE_PROP, "size", virtio_mem_get_size,
                         NULL, NULL, NULL, &error_abort);
@@ -743,6 +760,8 @@ static void virtio_mem_class_init(ObjectClass *klass, void *data)
 
     vmc->fill_device_info = virtio_mem_fill_device_info;
     vmc->get_memory_region = virtio_mem_get_memory_region;
+    vmc->add_size_change_notifier = virtio_mem_add_size_change_notifier;
+    vmc->remove_size_change_notifier = virtio_mem_remove_size_change_notifier;
 }
 
 static const TypeInfo virtio_mem_info = {
diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
index 27158cb611..5820b5c23e 100644
--- a/include/hw/virtio/virtio-mem.h
+++ b/include/hw/virtio/virtio-mem.h
@@ -66,6 +66,9 @@ typedef struct VirtIOMEM {
     /* block size and alignment */
     uint32_t block_size;
     uint32_t migration_block_size;
+
+    /* notifiers to notify when "size" changes */
+    NotifierList size_change_notifiers;
 } VirtIOMEM;
 
 typedef struct VirtIOMEMClass {
@@ -75,6 +78,8 @@ typedef struct VirtIOMEMClass {
     /* public */
     void (*fill_device_info)(const VirtIOMEM *vmen, VirtioMEMDeviceInfo *vi);
     MemoryRegion *(*get_memory_region)(VirtIOMEM *vmem, Error **errp);
+    void (*add_size_change_notifier)(VirtIOMEM *vmem, Notifier *notifier);
+    void (*remove_size_change_notifier)(VirtIOMEM *vmem, Notifier *notifier);
 } VirtIOMEMClass;
 
 #endif
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 16/17] virtio-mem: Allow notifiers for size changes
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, qemu-s390x, Igor Mammedov,
	Paolo Bonzini, Richard Henderson

We want to send qapi events in case the size of a virtio-mem device
changes. This allows upper layers to always know how much memory is
actually currently consumed via a virtio-mem device.

Unfortuantely, we have to report the id of our proxy device. Let's provide
an easy way for our proxy device to register, so it can send the qapi
events. Piggy-backing on the notifier infrastructure (although we'll
only ever have one notifier registered) seems to be an easy way.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem.c         | 21 ++++++++++++++++++++-
 include/hw/virtio/virtio-mem.h |  5 +++++
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index e25b2c74f2..88a99a0d90 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -198,6 +198,7 @@ static int virtio_mem_state_change_request(VirtIOMEM *vmem, uint64_t gpa,
     } else {
         vmem->size -= size;
     }
+    notifier_list_notify(&vmem->size_change_notifiers, &vmem->size);
     return VIRTIO_MEM_RESP_ACK;
 }
 
@@ -253,7 +254,10 @@ static int virtio_mem_unplug_all(VirtIOMEM *vmem)
         return -EBUSY;
     }
     bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
-    vmem->size = 0;
+    if (vmem->size != 0) {
+        vmem->size = 0;
+        notifier_list_notify(&vmem->size_change_notifiers, &vmem->size);
+    }
 
     virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
     return 0;
@@ -594,6 +598,18 @@ static MemoryRegion *virtio_mem_get_memory_region(VirtIOMEM *vmem, Error **errp)
     return &vmem->memdev->mr;
 }
 
+static void virtio_mem_add_size_change_notifier(VirtIOMEM *vmem,
+                                                Notifier *notifier)
+{
+    notifier_list_add(&vmem->size_change_notifiers, notifier);
+}
+
+static void virtio_mem_remove_size_change_notifier(VirtIOMEM *vmem,
+                                                   Notifier *notifier)
+{
+    notifier_remove(notifier);
+}
+
 static void virtio_mem_get_size(Object *obj, Visitor *v, const char *name,
                                 void *opaque, Error **errp)
 {
@@ -705,6 +721,7 @@ static void virtio_mem_instance_init(Object *obj)
     VirtIOMEM *vmem = VIRTIO_MEM(obj);
 
     vmem->block_size = VIRTIO_MEM_MIN_BLOCK_SIZE;
+    notifier_list_init(&vmem->size_change_notifiers);
 
     object_property_add(obj, VIRTIO_MEM_SIZE_PROP, "size", virtio_mem_get_size,
                         NULL, NULL, NULL, &error_abort);
@@ -743,6 +760,8 @@ static void virtio_mem_class_init(ObjectClass *klass, void *data)
 
     vmc->fill_device_info = virtio_mem_fill_device_info;
     vmc->get_memory_region = virtio_mem_get_memory_region;
+    vmc->add_size_change_notifier = virtio_mem_add_size_change_notifier;
+    vmc->remove_size_change_notifier = virtio_mem_remove_size_change_notifier;
 }
 
 static const TypeInfo virtio_mem_info = {
diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
index 27158cb611..5820b5c23e 100644
--- a/include/hw/virtio/virtio-mem.h
+++ b/include/hw/virtio/virtio-mem.h
@@ -66,6 +66,9 @@ typedef struct VirtIOMEM {
     /* block size and alignment */
     uint32_t block_size;
     uint32_t migration_block_size;
+
+    /* notifiers to notify when "size" changes */
+    NotifierList size_change_notifiers;
 } VirtIOMEM;
 
 typedef struct VirtIOMEMClass {
@@ -75,6 +78,8 @@ typedef struct VirtIOMEMClass {
     /* public */
     void (*fill_device_info)(const VirtIOMEM *vmen, VirtioMEMDeviceInfo *vi);
     MemoryRegion *(*get_memory_region)(VirtIOMEM *vmem, Error **errp);
+    void (*add_size_change_notifier)(VirtIOMEM *vmem, Notifier *notifier);
+    void (*remove_size_change_notifier)(VirtIOMEM *vmem, Notifier *notifier);
 } VirtIOMEMClass;
 
 #endif
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 17/17] virtio-pci: Send qapi events when the virtio-mem size changes
  2020-05-06  9:49 ` David Hildenbrand
@ 2020-05-06  9:49   ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	David Hildenbrand, Markus Armbruster, Eric Blake, Igor Mammedov

Let's register the notifier and trigger the qapi event with the right
device id.

MEMORY_DEVICE_SIZE_CHANGE is similar to BALLOON_CHANGE, however on a
memory device level.

Don't unregister the notifier (we neither have finalize() nor unrealize()
for VirtIOPCIProxy, so it's not that simple to do it) - both devices are
expected to vanish at the same time.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem-pci.c | 28 ++++++++++++++++++++++++++++
 hw/virtio/virtio-mem-pci.h |  1 +
 hw/virtio/virtio-mem.c     |  2 +-
 monitor/monitor.c          |  1 +
 qapi/misc.json             | 25 +++++++++++++++++++++++++
 5 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-mem-pci.c b/hw/virtio/virtio-mem-pci.c
index a47d21c81f..780d7b4af7 100644
--- a/hw/virtio/virtio-mem-pci.c
+++ b/hw/virtio/virtio-mem-pci.c
@@ -15,6 +15,7 @@
 #include "virtio-mem-pci.h"
 #include "hw/mem/memory-device.h"
 #include "qapi/error.h"
+#include "qapi/qapi-events-misc.h"
 
 static void virtio_mem_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
 {
@@ -75,6 +76,21 @@ static void virtio_mem_pci_fill_device_info(const MemoryDeviceState *md,
     info->type = MEMORY_DEVICE_INFO_KIND_VIRTIO_MEM;
 }
 
+static void virtio_mem_pci_size_change_notify(Notifier *notifier, void *data)
+{
+    VirtIOMEMPCI *pci_mem = container_of(notifier, VirtIOMEMPCI,
+                                         size_change_notifier);
+    DeviceState *dev = DEVICE(pci_mem);
+    const uint64_t * const size_p = data;
+    const char *id = NULL;
+
+    if (dev->id) {
+        id = g_strdup(dev->id);
+    }
+
+    qapi_event_send_memory_device_size_change(!!id, id, *size_p);
+}
+
 static void virtio_mem_pci_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -99,9 +115,21 @@ static void virtio_mem_pci_class_init(ObjectClass *klass, void *data)
 static void virtio_mem_pci_instance_init(Object *obj)
 {
     VirtIOMEMPCI *dev = VIRTIO_MEM_PCI(obj);
+    VirtIOMEMClass *vmc;
+    VirtIOMEM *vmem;
 
     virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
                                 TYPE_VIRTIO_MEM);
+
+    dev->size_change_notifier.notify = virtio_mem_pci_size_change_notify;
+    vmem = VIRTIO_MEM(&dev->vdev);
+    vmc = VIRTIO_MEM_GET_CLASS(vmem);
+    /*
+     * We never remove the notifier again, as we expect both devices to
+     * disappear at the same time.
+     */
+    vmc->add_size_change_notifier(vmem, &dev->size_change_notifier);
+
     object_property_add_alias(obj, VIRTIO_MEM_BLOCK_SIZE_PROP,
                               OBJECT(&dev->vdev),
                               VIRTIO_MEM_BLOCK_SIZE_PROP, &error_abort);
diff --git a/hw/virtio/virtio-mem-pci.h b/hw/virtio/virtio-mem-pci.h
index 8820cd6628..b51a28b275 100644
--- a/hw/virtio/virtio-mem-pci.h
+++ b/hw/virtio/virtio-mem-pci.h
@@ -28,6 +28,7 @@ typedef struct VirtIOMEMPCI VirtIOMEMPCI;
 struct VirtIOMEMPCI {
     VirtIOPCIProxy parent_obj;
     VirtIOMEM vdev;
+    Notifier size_change_notifier;
 };
 
 #endif /* QEMU_VIRTIO_MEM_PCI_H */
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 88a99a0d90..eb5cf66855 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -491,7 +491,7 @@ static void virtio_mem_device_unrealize(DeviceState *dev, Error **errp)
     virtio_del_queue(vdev, 0);
     virtio_cleanup(vdev);
     g_free(vmem->bitmap);
-    ramblock_discard_set_required(false);
+    ram_block_discard_set_required(false);
 }
 
 static int virtio_mem_pre_save(void *opaque)
diff --git a/monitor/monitor.c b/monitor/monitor.c
index 125494410a..19dcb8fbe3 100644
--- a/monitor/monitor.c
+++ b/monitor/monitor.c
@@ -235,6 +235,7 @@ static MonitorQAPIEventConf monitor_qapi_event_conf[QAPI_EVENT__MAX] = {
     [QAPI_EVENT_QUORUM_REPORT_BAD] = { 1000 * SCALE_MS },
     [QAPI_EVENT_QUORUM_FAILURE]    = { 1000 * SCALE_MS },
     [QAPI_EVENT_VSERPORT_CHANGE]   = { 1000 * SCALE_MS },
+    [QAPI_EVENT_MEMORY_DEVICE_SIZE_CHANGE] = { 1000 * SCALE_MS },
 };
 
 /*
diff --git a/qapi/misc.json b/qapi/misc.json
index feaeacec22..58b073562b 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -1432,6 +1432,31 @@
 ##
 { 'command': 'query-memory-devices', 'returns': ['MemoryDeviceInfo'] }
 
+##
+# @MEMORY_DEVICE_SIZE_CHANGE:
+#
+# Emitted when the size of a memory device changes. Only emitted for memory
+# devices that can actually change the size (e.g., virtio-mem due to guest
+# action).
+#
+# @id: device's ID
+# @size: the new size of memory that the device provides
+#
+# Note: this event is rate-limited.
+#
+# Since: 5.1
+#
+# Example:
+#
+# <- { "event": "MEMORY_DEVICE_SIZE_CHANGE",
+#      "data": { "id": "vm0", "size": 1073741824},
+#      "timestamp": { "seconds": 1588168529, "microseconds": 201316 } }
+#
+##
+{ 'event': 'MEMORY_DEVICE_SIZE_CHANGE',
+  'data': { '*id': 'str', 'size': 'size' } }
+
+
 ##
 # @MEM_UNPLUG_ERROR:
 #
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v1 17/17] virtio-pci: Send qapi events when the virtio-mem size changes
@ 2020-05-06  9:49   ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06  9:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Markus Armbruster, qemu-s390x,
	Igor Mammedov, Paolo Bonzini, Richard Henderson

Let's register the notifier and trigger the qapi event with the right
device id.

MEMORY_DEVICE_SIZE_CHANGE is similar to BALLOON_CHANGE, however on a
memory device level.

Don't unregister the notifier (we neither have finalize() nor unrealize()
for VirtIOPCIProxy, so it's not that simple to do it) - both devices are
expected to vanish at the same time.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem-pci.c | 28 ++++++++++++++++++++++++++++
 hw/virtio/virtio-mem-pci.h |  1 +
 hw/virtio/virtio-mem.c     |  2 +-
 monitor/monitor.c          |  1 +
 qapi/misc.json             | 25 +++++++++++++++++++++++++
 5 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-mem-pci.c b/hw/virtio/virtio-mem-pci.c
index a47d21c81f..780d7b4af7 100644
--- a/hw/virtio/virtio-mem-pci.c
+++ b/hw/virtio/virtio-mem-pci.c
@@ -15,6 +15,7 @@
 #include "virtio-mem-pci.h"
 #include "hw/mem/memory-device.h"
 #include "qapi/error.h"
+#include "qapi/qapi-events-misc.h"
 
 static void virtio_mem_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
 {
@@ -75,6 +76,21 @@ static void virtio_mem_pci_fill_device_info(const MemoryDeviceState *md,
     info->type = MEMORY_DEVICE_INFO_KIND_VIRTIO_MEM;
 }
 
+static void virtio_mem_pci_size_change_notify(Notifier *notifier, void *data)
+{
+    VirtIOMEMPCI *pci_mem = container_of(notifier, VirtIOMEMPCI,
+                                         size_change_notifier);
+    DeviceState *dev = DEVICE(pci_mem);
+    const uint64_t * const size_p = data;
+    const char *id = NULL;
+
+    if (dev->id) {
+        id = g_strdup(dev->id);
+    }
+
+    qapi_event_send_memory_device_size_change(!!id, id, *size_p);
+}
+
 static void virtio_mem_pci_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -99,9 +115,21 @@ static void virtio_mem_pci_class_init(ObjectClass *klass, void *data)
 static void virtio_mem_pci_instance_init(Object *obj)
 {
     VirtIOMEMPCI *dev = VIRTIO_MEM_PCI(obj);
+    VirtIOMEMClass *vmc;
+    VirtIOMEM *vmem;
 
     virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
                                 TYPE_VIRTIO_MEM);
+
+    dev->size_change_notifier.notify = virtio_mem_pci_size_change_notify;
+    vmem = VIRTIO_MEM(&dev->vdev);
+    vmc = VIRTIO_MEM_GET_CLASS(vmem);
+    /*
+     * We never remove the notifier again, as we expect both devices to
+     * disappear at the same time.
+     */
+    vmc->add_size_change_notifier(vmem, &dev->size_change_notifier);
+
     object_property_add_alias(obj, VIRTIO_MEM_BLOCK_SIZE_PROP,
                               OBJECT(&dev->vdev),
                               VIRTIO_MEM_BLOCK_SIZE_PROP, &error_abort);
diff --git a/hw/virtio/virtio-mem-pci.h b/hw/virtio/virtio-mem-pci.h
index 8820cd6628..b51a28b275 100644
--- a/hw/virtio/virtio-mem-pci.h
+++ b/hw/virtio/virtio-mem-pci.h
@@ -28,6 +28,7 @@ typedef struct VirtIOMEMPCI VirtIOMEMPCI;
 struct VirtIOMEMPCI {
     VirtIOPCIProxy parent_obj;
     VirtIOMEM vdev;
+    Notifier size_change_notifier;
 };
 
 #endif /* QEMU_VIRTIO_MEM_PCI_H */
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 88a99a0d90..eb5cf66855 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -491,7 +491,7 @@ static void virtio_mem_device_unrealize(DeviceState *dev, Error **errp)
     virtio_del_queue(vdev, 0);
     virtio_cleanup(vdev);
     g_free(vmem->bitmap);
-    ramblock_discard_set_required(false);
+    ram_block_discard_set_required(false);
 }
 
 static int virtio_mem_pre_save(void *opaque)
diff --git a/monitor/monitor.c b/monitor/monitor.c
index 125494410a..19dcb8fbe3 100644
--- a/monitor/monitor.c
+++ b/monitor/monitor.c
@@ -235,6 +235,7 @@ static MonitorQAPIEventConf monitor_qapi_event_conf[QAPI_EVENT__MAX] = {
     [QAPI_EVENT_QUORUM_REPORT_BAD] = { 1000 * SCALE_MS },
     [QAPI_EVENT_QUORUM_FAILURE]    = { 1000 * SCALE_MS },
     [QAPI_EVENT_VSERPORT_CHANGE]   = { 1000 * SCALE_MS },
+    [QAPI_EVENT_MEMORY_DEVICE_SIZE_CHANGE] = { 1000 * SCALE_MS },
 };
 
 /*
diff --git a/qapi/misc.json b/qapi/misc.json
index feaeacec22..58b073562b 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -1432,6 +1432,31 @@
 ##
 { 'command': 'query-memory-devices', 'returns': ['MemoryDeviceInfo'] }
 
+##
+# @MEMORY_DEVICE_SIZE_CHANGE:
+#
+# Emitted when the size of a memory device changes. Only emitted for memory
+# devices that can actually change the size (e.g., virtio-mem due to guest
+# action).
+#
+# @id: device's ID
+# @size: the new size of memory that the device provides
+#
+# Note: this event is rate-limited.
+#
+# Since: 5.1
+#
+# Example:
+#
+# <- { "event": "MEMORY_DEVICE_SIZE_CHANGE",
+#      "data": { "id": "vm0", "size": 1073741824},
+#      "timestamp": { "seconds": 1588168529, "microseconds": 201316 } }
+#
+##
+{ 'event': 'MEMORY_DEVICE_SIZE_CHANGE',
+  'data': { '*id': 'str', 'size': 'size' } }
+
+
 ##
 # @MEM_UNPLUG_ERROR:
 #
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 15/17] pc: Support for virtio-mem-pci
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-06 12:19     ` Pankaj Gupta
  -1 siblings, 0 replies; 94+ messages in thread
From: Pankaj Gupta @ 2020-05-06 12:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Eduardo Habkost, kvm, Michael S . Tsirkin,
	Dr . David Alan Gilbert, Markus Armbruster, qemu-s390x,
	Paolo Bonzini, Richard Henderson

> Let's wire it up similar to virtio-pmem. Also disallow unplug, so it's
> harder for users to shoot themselves into the foot.
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  hw/i386/Kconfig |  1 +
>  hw/i386/pc.c    | 49 ++++++++++++++++++++++++++++---------------------
>  2 files changed, 29 insertions(+), 21 deletions(-)
>
> diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig
> index c93f32f657..03e347b207 100644
> --- a/hw/i386/Kconfig
> +++ b/hw/i386/Kconfig
> @@ -35,6 +35,7 @@ config PC
>      select ACPI_PCI
>      select ACPI_VMGENID
>      select VIRTIO_PMEM_SUPPORTED
> +    select VIRTIO_MEM_SUPPORTED
>
>  config PC_PCI
>      bool
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index f6b8431c8b..588804f895 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -86,6 +86,7 @@
>  #include "hw/net/ne2000-isa.h"
>  #include "standard-headers/asm-x86/bootparam.h"
>  #include "hw/virtio/virtio-pmem-pci.h"
> +#include "hw/virtio/virtio-mem-pci.h"
>  #include "hw/mem/memory-device.h"
>  #include "sysemu/replay.h"
>  #include "qapi/qmp/qerror.h"
> @@ -1654,8 +1655,8 @@ static void pc_cpu_pre_plug(HotplugHandler *hotplug_dev,
>      numa_cpu_pre_plug(cpu_slot, dev, errp);
>  }
>
> -static void pc_virtio_pmem_pci_pre_plug(HotplugHandler *hotplug_dev,
> -                                        DeviceState *dev, Error **errp)
> +static void pc_virtio_md_pci_pre_plug(HotplugHandler *hotplug_dev,
> +                                      DeviceState *dev, Error **errp)
>  {
>      HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev);
>      Error *local_err = NULL;
> @@ -1666,7 +1667,8 @@ static void pc_virtio_pmem_pci_pre_plug(HotplugHandler *hotplug_dev,
>           * order. This should never be the case on x86, however better add
>           * a safety net.
>           */
> -        error_setg(errp, "virtio-pmem-pci not supported on this bus.");
> +        error_setg(errp,
> +                   "virtio based memory devices not supported on this bus.");
>          return;
>      }
>      /*
> @@ -1681,8 +1683,8 @@ static void pc_virtio_pmem_pci_pre_plug(HotplugHandler *hotplug_dev,
>      error_propagate(errp, local_err);
>  }
>
> -static void pc_virtio_pmem_pci_plug(HotplugHandler *hotplug_dev,
> -                                    DeviceState *dev, Error **errp)
> +static void pc_virtio_md_pci_plug(HotplugHandler *hotplug_dev,
> +                                  DeviceState *dev, Error **errp)
>  {
>      HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev);
>      Error *local_err = NULL;
> @@ -1700,17 +1702,17 @@ static void pc_virtio_pmem_pci_plug(HotplugHandler *hotplug_dev,
>      error_propagate(errp, local_err);
>  }
>
> -static void pc_virtio_pmem_pci_unplug_request(HotplugHandler *hotplug_dev,
> -                                              DeviceState *dev, Error **errp)
> +static void pc_virtio_md_pci_unplug_request(HotplugHandler *hotplug_dev,
> +                                            DeviceState *dev, Error **errp)
>  {
> -    /* We don't support virtio pmem hot unplug */
> -    error_setg(errp, "virtio pmem device unplug not supported.");
> +    /* We don't support hot unplug of virtio based memory devices */
> +    error_setg(errp, "virtio based memory devices cannot be unplugged.");
>  }
>
> -static void pc_virtio_pmem_pci_unplug(HotplugHandler *hotplug_dev,
> -                                      DeviceState *dev, Error **errp)
> +static void pc_virtio_md_pci_unplug(HotplugHandler *hotplug_dev,
> +                                    DeviceState *dev, Error **errp)
>  {
> -    /* We don't support virtio pmem hot unplug */
> +    /* We don't support hot unplug of virtio based memory devices */
>  }
>
>  static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
> @@ -1720,8 +1722,9 @@ static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
>          pc_memory_pre_plug(hotplug_dev, dev, errp);
>      } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
>          pc_cpu_pre_plug(hotplug_dev, dev, errp);
> -    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
> -        pc_virtio_pmem_pci_pre_plug(hotplug_dev, dev, errp);
> +    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
> +               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
> +        pc_virtio_md_pci_pre_plug(hotplug_dev, dev, errp);
>      }
>  }
>
> @@ -1732,8 +1735,9 @@ static void pc_machine_device_plug_cb(HotplugHandler *hotplug_dev,
>          pc_memory_plug(hotplug_dev, dev, errp);
>      } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
>          pc_cpu_plug(hotplug_dev, dev, errp);
> -    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
> -        pc_virtio_pmem_pci_plug(hotplug_dev, dev, errp);
> +    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
> +               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
> +        pc_virtio_md_pci_plug(hotplug_dev, dev, errp);
>      }
>  }
>
> @@ -1744,8 +1748,9 @@ static void pc_machine_device_unplug_request_cb(HotplugHandler *hotplug_dev,
>          pc_memory_unplug_request(hotplug_dev, dev, errp);
>      } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
>          pc_cpu_unplug_request_cb(hotplug_dev, dev, errp);
> -    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
> -        pc_virtio_pmem_pci_unplug_request(hotplug_dev, dev, errp);
> +    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
> +               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
> +        pc_virtio_md_pci_unplug_request(hotplug_dev, dev, errp);
>      } else {
>          error_setg(errp, "acpi: device unplug request for not supported device"
>                     " type: %s", object_get_typename(OBJECT(dev)));
> @@ -1759,8 +1764,9 @@ static void pc_machine_device_unplug_cb(HotplugHandler *hotplug_dev,
>          pc_memory_unplug(hotplug_dev, dev, errp);
>      } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
>          pc_cpu_unplug_cb(hotplug_dev, dev, errp);
> -    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
> -        pc_virtio_pmem_pci_unplug(hotplug_dev, dev, errp);
> +    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
> +               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
> +        pc_virtio_md_pci_unplug(hotplug_dev, dev, errp);
>      } else {
>          error_setg(errp, "acpi: device unplug for not supported device"
>                     " type: %s", object_get_typename(OBJECT(dev)));
> @@ -1772,7 +1778,8 @@ static HotplugHandler *pc_get_hotplug_handler(MachineState *machine,
>  {
>      if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM) ||
>          object_dynamic_cast(OBJECT(dev), TYPE_CPU) ||
> -        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
> +        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
> +        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
>          return HOTPLUG_HANDLER(machine);
>      }
>
> --

Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>

> 2.25.3
>
>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 15/17] pc: Support for virtio-mem-pci
@ 2020-05-06 12:19     ` Pankaj Gupta
  0 siblings, 0 replies; 94+ messages in thread
From: Pankaj Gupta @ 2020-05-06 12:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, Dr . David Alan Gilbert, qemu-s390x,
	Paolo Bonzini, Richard Henderson

> Let's wire it up similar to virtio-pmem. Also disallow unplug, so it's
> harder for users to shoot themselves into the foot.
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  hw/i386/Kconfig |  1 +
>  hw/i386/pc.c    | 49 ++++++++++++++++++++++++++++---------------------
>  2 files changed, 29 insertions(+), 21 deletions(-)
>
> diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig
> index c93f32f657..03e347b207 100644
> --- a/hw/i386/Kconfig
> +++ b/hw/i386/Kconfig
> @@ -35,6 +35,7 @@ config PC
>      select ACPI_PCI
>      select ACPI_VMGENID
>      select VIRTIO_PMEM_SUPPORTED
> +    select VIRTIO_MEM_SUPPORTED
>
>  config PC_PCI
>      bool
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index f6b8431c8b..588804f895 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -86,6 +86,7 @@
>  #include "hw/net/ne2000-isa.h"
>  #include "standard-headers/asm-x86/bootparam.h"
>  #include "hw/virtio/virtio-pmem-pci.h"
> +#include "hw/virtio/virtio-mem-pci.h"
>  #include "hw/mem/memory-device.h"
>  #include "sysemu/replay.h"
>  #include "qapi/qmp/qerror.h"
> @@ -1654,8 +1655,8 @@ static void pc_cpu_pre_plug(HotplugHandler *hotplug_dev,
>      numa_cpu_pre_plug(cpu_slot, dev, errp);
>  }
>
> -static void pc_virtio_pmem_pci_pre_plug(HotplugHandler *hotplug_dev,
> -                                        DeviceState *dev, Error **errp)
> +static void pc_virtio_md_pci_pre_plug(HotplugHandler *hotplug_dev,
> +                                      DeviceState *dev, Error **errp)
>  {
>      HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev);
>      Error *local_err = NULL;
> @@ -1666,7 +1667,8 @@ static void pc_virtio_pmem_pci_pre_plug(HotplugHandler *hotplug_dev,
>           * order. This should never be the case on x86, however better add
>           * a safety net.
>           */
> -        error_setg(errp, "virtio-pmem-pci not supported on this bus.");
> +        error_setg(errp,
> +                   "virtio based memory devices not supported on this bus.");
>          return;
>      }
>      /*
> @@ -1681,8 +1683,8 @@ static void pc_virtio_pmem_pci_pre_plug(HotplugHandler *hotplug_dev,
>      error_propagate(errp, local_err);
>  }
>
> -static void pc_virtio_pmem_pci_plug(HotplugHandler *hotplug_dev,
> -                                    DeviceState *dev, Error **errp)
> +static void pc_virtio_md_pci_plug(HotplugHandler *hotplug_dev,
> +                                  DeviceState *dev, Error **errp)
>  {
>      HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev);
>      Error *local_err = NULL;
> @@ -1700,17 +1702,17 @@ static void pc_virtio_pmem_pci_plug(HotplugHandler *hotplug_dev,
>      error_propagate(errp, local_err);
>  }
>
> -static void pc_virtio_pmem_pci_unplug_request(HotplugHandler *hotplug_dev,
> -                                              DeviceState *dev, Error **errp)
> +static void pc_virtio_md_pci_unplug_request(HotplugHandler *hotplug_dev,
> +                                            DeviceState *dev, Error **errp)
>  {
> -    /* We don't support virtio pmem hot unplug */
> -    error_setg(errp, "virtio pmem device unplug not supported.");
> +    /* We don't support hot unplug of virtio based memory devices */
> +    error_setg(errp, "virtio based memory devices cannot be unplugged.");
>  }
>
> -static void pc_virtio_pmem_pci_unplug(HotplugHandler *hotplug_dev,
> -                                      DeviceState *dev, Error **errp)
> +static void pc_virtio_md_pci_unplug(HotplugHandler *hotplug_dev,
> +                                    DeviceState *dev, Error **errp)
>  {
> -    /* We don't support virtio pmem hot unplug */
> +    /* We don't support hot unplug of virtio based memory devices */
>  }
>
>  static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
> @@ -1720,8 +1722,9 @@ static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
>          pc_memory_pre_plug(hotplug_dev, dev, errp);
>      } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
>          pc_cpu_pre_plug(hotplug_dev, dev, errp);
> -    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
> -        pc_virtio_pmem_pci_pre_plug(hotplug_dev, dev, errp);
> +    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
> +               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
> +        pc_virtio_md_pci_pre_plug(hotplug_dev, dev, errp);
>      }
>  }
>
> @@ -1732,8 +1735,9 @@ static void pc_machine_device_plug_cb(HotplugHandler *hotplug_dev,
>          pc_memory_plug(hotplug_dev, dev, errp);
>      } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
>          pc_cpu_plug(hotplug_dev, dev, errp);
> -    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
> -        pc_virtio_pmem_pci_plug(hotplug_dev, dev, errp);
> +    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
> +               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
> +        pc_virtio_md_pci_plug(hotplug_dev, dev, errp);
>      }
>  }
>
> @@ -1744,8 +1748,9 @@ static void pc_machine_device_unplug_request_cb(HotplugHandler *hotplug_dev,
>          pc_memory_unplug_request(hotplug_dev, dev, errp);
>      } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
>          pc_cpu_unplug_request_cb(hotplug_dev, dev, errp);
> -    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
> -        pc_virtio_pmem_pci_unplug_request(hotplug_dev, dev, errp);
> +    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
> +               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
> +        pc_virtio_md_pci_unplug_request(hotplug_dev, dev, errp);
>      } else {
>          error_setg(errp, "acpi: device unplug request for not supported device"
>                     " type: %s", object_get_typename(OBJECT(dev)));
> @@ -1759,8 +1764,9 @@ static void pc_machine_device_unplug_cb(HotplugHandler *hotplug_dev,
>          pc_memory_unplug(hotplug_dev, dev, errp);
>      } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
>          pc_cpu_unplug_cb(hotplug_dev, dev, errp);
> -    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
> -        pc_virtio_pmem_pci_unplug(hotplug_dev, dev, errp);
> +    } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
> +               object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
> +        pc_virtio_md_pci_unplug(hotplug_dev, dev, errp);
>      } else {
>          error_setg(errp, "acpi: device unplug for not supported device"
>                     " type: %s", object_get_typename(OBJECT(dev)));
> @@ -1772,7 +1778,8 @@ static HotplugHandler *pc_get_hotplug_handler(MachineState *machine,
>  {
>      if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM) ||
>          object_dynamic_cast(OBJECT(dev), TYPE_CPU) ||
> -        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI)) {
> +        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
> +        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
>          return HOTPLUG_HANDLER(machine);
>      }
>
> --

Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>

> 2.25.3
>
>


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 10/17] virtio-mem: Paravirtualized memory hot(un)plug
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-06 16:12     ` Eric Blake
  -1 siblings, 0 replies; 94+ messages in thread
From: Eric Blake @ 2020-05-06 16:12 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	Markus Armbruster, Igor Mammedov

On 5/6/20 4:49 AM, David Hildenbrand wrote:
> This is the very basic/initial version of virtio-mem. An introduction to
> virtio-mem can be found in the Linux kernel driver [1]. While it can be
> used in the current state for hotplug of a smaller amount of memory, it
> will heavily benefit from resizeable memory regions in the future.
> 
> Each virtio-mem device manages a memory region (provided via a memory
> backend). After requested by the hypervisor ("requested-size"), the
> guest can try to plug/unplug blocks of memory within that region, in order
> to reach the requested size. Initially, and after a reboot, all memory is
> unplugged (except in special cases - reboot during postcopy).
> 
> The guest may only try to plug/unplug blocks of memory within the usable
> region size. The usable region size is a little bigger than the
> requested size, to give the device driver some flexibility. The usable
> region size will only grow, except on reboots or when all memory is
> requested to get unplugged. The guest can never plug more memory than
> requested. Unplugged memory will get zapped/discarded, similar to in a
> balloon device.
> 
> The block size is variable, however, it is always chosen in a way such that
> THP splits are avoided (e.g., 2MB). The state of each block
> (plugged/unplugged) is tracked in a bitmap.
> 
> As virtio-mem devices (e.g., virtio-mem-pci) will be memory devices, we now
> expose "VirtioMEMDeviceInfo" via "query-memory-devices".
> 

> +++ b/qapi/misc.json
> @@ -1354,19 +1354,56 @@
>             }
>   }
>   
> +##
> +# @VirtioMEMDeviceInfo:
> +#

> +# @memdev: memory backend linked with the region
> +#
> +# Since: 5.1

Here you claim 5.1,

> +##
> +{ 'struct': 'VirtioMEMDeviceInfo',
> +  'data': { '*id': 'str',
> +            'memaddr': 'size',
> +            'requested-size': 'size',
> +            'size': 'size',
> +            'max-size': 'size',
> +            'block-size': 'size',
> +            'node': 'int',
> +            'memdev': 'str'
> +          }
> +}
> +
>   ##
>   # @MemoryDeviceInfo:
>   #
>   # Union containing information about a memory device
>   #
>   # nvdimm is included since 2.12. virtio-pmem is included since 4.1.
> +# virtio-mem is included since 5.2.

but here 5.2.  They should probably be the same :)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 10/17] virtio-mem: Paravirtualized memory hot(un)plug
@ 2020-05-06 16:12     ` Eric Blake
  0 siblings, 0 replies; 94+ messages in thread
From: Eric Blake @ 2020-05-06 16:12 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin,
	Dr . David Alan Gilbert, Markus Armbruster, qemu-s390x,
	Igor Mammedov, Paolo Bonzini, Richard Henderson

On 5/6/20 4:49 AM, David Hildenbrand wrote:
> This is the very basic/initial version of virtio-mem. An introduction to
> virtio-mem can be found in the Linux kernel driver [1]. While it can be
> used in the current state for hotplug of a smaller amount of memory, it
> will heavily benefit from resizeable memory regions in the future.
> 
> Each virtio-mem device manages a memory region (provided via a memory
> backend). After requested by the hypervisor ("requested-size"), the
> guest can try to plug/unplug blocks of memory within that region, in order
> to reach the requested size. Initially, and after a reboot, all memory is
> unplugged (except in special cases - reboot during postcopy).
> 
> The guest may only try to plug/unplug blocks of memory within the usable
> region size. The usable region size is a little bigger than the
> requested size, to give the device driver some flexibility. The usable
> region size will only grow, except on reboots or when all memory is
> requested to get unplugged. The guest can never plug more memory than
> requested. Unplugged memory will get zapped/discarded, similar to in a
> balloon device.
> 
> The block size is variable, however, it is always chosen in a way such that
> THP splits are avoided (e.g., 2MB). The state of each block
> (plugged/unplugged) is tracked in a bitmap.
> 
> As virtio-mem devices (e.g., virtio-mem-pci) will be memory devices, we now
> expose "VirtioMEMDeviceInfo" via "query-memory-devices".
> 

> +++ b/qapi/misc.json
> @@ -1354,19 +1354,56 @@
>             }
>   }
>   
> +##
> +# @VirtioMEMDeviceInfo:
> +#

> +# @memdev: memory backend linked with the region
> +#
> +# Since: 5.1

Here you claim 5.1,

> +##
> +{ 'struct': 'VirtioMEMDeviceInfo',
> +  'data': { '*id': 'str',
> +            'memaddr': 'size',
> +            'requested-size': 'size',
> +            'size': 'size',
> +            'max-size': 'size',
> +            'block-size': 'size',
> +            'node': 'int',
> +            'memdev': 'str'
> +          }
> +}
> +
>   ##
>   # @MemoryDeviceInfo:
>   #
>   # Union containing information about a memory device
>   #
>   # nvdimm is included since 2.12. virtio-pmem is included since 4.1.
> +# virtio-mem is included since 5.2.

but here 5.2.  They should probably be the same :)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 10/17] virtio-mem: Paravirtualized memory hot(un)plug
  2020-05-06 16:12     ` Eric Blake
@ 2020-05-06 16:14       ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06 16:14 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	Markus Armbruster, Igor Mammedov

>> +##
>> +{ 'struct': 'VirtioMEMDeviceInfo',
>> +  'data': { '*id': 'str',
>> +            'memaddr': 'size',
>> +            'requested-size': 'size',
>> +            'size': 'size',
>> +            'max-size': 'size',
>> +            'block-size': 'size',
>> +            'node': 'int',
>> +            'memdev': 'str'
>> +          }
>> +}
>> +
>>   ##
>>   # @MemoryDeviceInfo:
>>   #
>>   # Union containing information about a memory device
>>   #
>>   # nvdimm is included since 2.12. virtio-pmem is included since 4.1.
>> +# virtio-mem is included since 5.2.
> 
> but here 5.2.  They should probably be the same :)
> 

Thanks! I've been changing these numbers for a couple of releases
already, it was meant to go wrong at one point :)


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 10/17] virtio-mem: Paravirtualized memory hot(un)plug
@ 2020-05-06 16:14       ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-06 16:14 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin,
	Dr . David Alan Gilbert, Markus Armbruster, qemu-s390x,
	Igor Mammedov, Paolo Bonzini, Richard Henderson

>> +##
>> +{ 'struct': 'VirtioMEMDeviceInfo',
>> +  'data': { '*id': 'str',
>> +            'memaddr': 'size',
>> +            'requested-size': 'size',
>> +            'size': 'size',
>> +            'max-size': 'size',
>> +            'block-size': 'size',
>> +            'node': 'int',
>> +            'memdev': 'str'
>> +          }
>> +}
>> +
>>   ##
>>   # @MemoryDeviceInfo:
>>   #
>>   # Union containing information about a memory device
>>   #
>>   # nvdimm is included since 2.12. virtio-pmem is included since 4.1.
>> +# virtio-mem is included since 5.2.
> 
> but here 5.2.  They should probably be the same :)
> 

Thanks! I've been changing these numbers for a couple of releases
already, it was meant to go wrong at one point :)


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 11/17] virtio-pci: Proxy for virtio-mem
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-06 18:57     ` Pankaj Gupta
  -1 siblings, 0 replies; 94+ messages in thread
From: Pankaj Gupta @ 2020-05-06 18:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	Marcel Apfelbaum, Igor Mammedov

> Let's add a proxy for virtio-mem, make it a memory device, and
> pass-through the properties.
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  hw/virtio/Makefile.objs    |   1 +
>  hw/virtio/virtio-mem-pci.c | 131 +++++++++++++++++++++++++++++++++++++
>  hw/virtio/virtio-mem-pci.h |  33 ++++++++++
>  include/hw/pci/pci.h       |   1 +
>  4 files changed, 166 insertions(+)
>  create mode 100644 hw/virtio/virtio-mem-pci.c
>  create mode 100644 hw/virtio/virtio-mem-pci.h
>
> diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
> index 7df70e977e..b9661f9c01 100644
> --- a/hw/virtio/Makefile.objs
> +++ b/hw/virtio/Makefile.objs
> @@ -19,6 +19,7 @@ obj-$(call land,$(CONFIG_VHOST_USER_FS),$(CONFIG_VIRTIO_PCI)) += vhost-user-fs-p
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
>  obj-$(CONFIG_VIRTIO_MEM) += virtio-mem.o
> +common-obj-$(call land,$(CONFIG_VIRTIO_MEM),$(CONFIG_VIRTIO_PCI)) += virtio-mem-pci.o
>
>  ifeq ($(CONFIG_VIRTIO_PCI),y)
>  obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock-pci.o
> diff --git a/hw/virtio/virtio-mem-pci.c b/hw/virtio/virtio-mem-pci.c
> new file mode 100644
> index 0000000000..a47d21c81f
> --- /dev/null
> +++ b/hw/virtio/virtio-mem-pci.c
> @@ -0,0 +1,131 @@
> +/*
> + * Virtio MEM PCI device
> + *
> + * Copyright (C) 2020 Red Hat, Inc.
> + *
> + * Authors:
> + *  David Hildenbrand <david@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +
Don't think we need the blank line here.

> +#include "virtio-mem-pci.h"
> +#include "hw/mem/memory-device.h"
> +#include "qapi/error.h"
> +
> +static void virtio_mem_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
> +{
> +    VirtIOMEMPCI *mem_pci = VIRTIO_MEM_PCI(vpci_dev);
> +    DeviceState *vdev = DEVICE(&mem_pci->vdev);
> +
> +    qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
> +    object_property_set_bool(OBJECT(vdev), true, "realized", errp);
> +}
> +
> +static void virtio_mem_pci_set_addr(MemoryDeviceState *md, uint64_t addr,
> +                                    Error **errp)
> +{
> +    object_property_set_uint(OBJECT(md), addr, VIRTIO_MEM_ADDR_PROP, errp);
> +}
> +
> +static uint64_t virtio_mem_pci_get_addr(const MemoryDeviceState *md)
> +{
> +    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_ADDR_PROP,
> +                                    &error_abort);
> +}
> +
> +static MemoryRegion *virtio_mem_pci_get_memory_region(MemoryDeviceState *md,
> +                                                      Error **errp)
> +{
> +    VirtIOMEMPCI *pci_mem = VIRTIO_MEM_PCI(md);
> +    VirtIOMEM *vmem = VIRTIO_MEM(&pci_mem->vdev);
> +    VirtIOMEMClass *vmc = VIRTIO_MEM_GET_CLASS(vmem);
> +
> +    return vmc->get_memory_region(vmem, errp);
> +}
> +
> +static uint64_t virtio_mem_pci_get_plugged_size(const MemoryDeviceState *md,
> +                                                Error **errp)
> +{
> +    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_SIZE_PROP,
> +                                    errp);
> +}
> +
> +static void virtio_mem_pci_fill_device_info(const MemoryDeviceState *md,
> +                                            MemoryDeviceInfo *info)
> +{
> +    VirtioMEMDeviceInfo *vi = g_new0(VirtioMEMDeviceInfo, 1);
> +    VirtIOMEMPCI *pci_mem = VIRTIO_MEM_PCI(md);
> +    VirtIOMEM *vmem = VIRTIO_MEM(&pci_mem->vdev);
> +    VirtIOMEMClass *vpc = VIRTIO_MEM_GET_CLASS(vmem);
> +    DeviceState *dev = DEVICE(md);
> +
> +    if (dev->id) {
> +        vi->has_id = true;
> +        vi->id = g_strdup(dev->id);
> +    }
> +
> +    /* let the real device handle everything else */
> +    vpc->fill_device_info(vmem, vi);
> +
> +    info->u.virtio_mem.data = vi;
> +    info->type = MEMORY_DEVICE_INFO_KIND_VIRTIO_MEM;
> +}
> +
> +static void virtio_mem_pci_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
> +    PCIDeviceClass *pcidev_k = PCI_DEVICE_CLASS(klass);
> +    MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(klass);
> +
> +    k->realize = virtio_mem_pci_realize;
> +    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> +    pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
> +    pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_MEM;
> +    pcidev_k->revision = VIRTIO_PCI_ABI_VERSION;
> +    pcidev_k->class_id = PCI_CLASS_OTHERS;
> +
> +    mdc->get_addr = virtio_mem_pci_get_addr;
> +    mdc->set_addr = virtio_mem_pci_set_addr;
> +    mdc->get_plugged_size = virtio_mem_pci_get_plugged_size;
> +    mdc->get_memory_region = virtio_mem_pci_get_memory_region;
> +    mdc->fill_device_info = virtio_mem_pci_fill_device_info;
> +}
> +
> +static void virtio_mem_pci_instance_init(Object *obj)
> +{
> +    VirtIOMEMPCI *dev = VIRTIO_MEM_PCI(obj);
> +
> +    virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
> +                                TYPE_VIRTIO_MEM);
> +    object_property_add_alias(obj, VIRTIO_MEM_BLOCK_SIZE_PROP,
> +                              OBJECT(&dev->vdev),
> +                              VIRTIO_MEM_BLOCK_SIZE_PROP, &error_abort);
> +    object_property_add_alias(obj, VIRTIO_MEM_SIZE_PROP, OBJECT(&dev->vdev),
> +                              VIRTIO_MEM_SIZE_PROP, &error_abort);
> +    object_property_add_alias(obj, VIRTIO_MEM_REQUESTED_SIZE_PROP,
> +                              OBJECT(&dev->vdev),
> +                              VIRTIO_MEM_REQUESTED_SIZE_PROP, &error_abort);
> +}
> +
> +static const VirtioPCIDeviceTypeInfo virtio_mem_pci_info = {
> +    .base_name = TYPE_VIRTIO_MEM_PCI,
> +    .generic_name = "virtio-mem-pci",
> +    .instance_size = sizeof(VirtIOMEMPCI),
> +    .instance_init = virtio_mem_pci_instance_init,
> +    .class_init = virtio_mem_pci_class_init,
> +    .interfaces = (InterfaceInfo[]) {
> +        { TYPE_MEMORY_DEVICE },
> +        { }
> +    },
> +};
> +
> +static void virtio_mem_pci_register_types(void)
> +{
> +    virtio_pci_types_register(&virtio_mem_pci_info);
> +}
> +type_init(virtio_mem_pci_register_types)
> diff --git a/hw/virtio/virtio-mem-pci.h b/hw/virtio/virtio-mem-pci.h
> new file mode 100644
> index 0000000000..8820cd6628
> --- /dev/null
> +++ b/hw/virtio/virtio-mem-pci.h
> @@ -0,0 +1,33 @@
> +/*
> + * Virtio MEM PCI device
> + *
> + * Copyright (C) 2020 Red Hat, Inc.
> + *
> + * Authors:
> + *  David Hildenbrand <david@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef QEMU_VIRTIO_MEM_PCI_H
> +#define QEMU_VIRTIO_MEM_PCI_H
> +
> +#include "hw/virtio/virtio-pci.h"
> +#include "hw/virtio/virtio-mem.h"
> +
> +typedef struct VirtIOMEMPCI VirtIOMEMPCI;
> +
> +/*
> + * virtio-mem-pci: This extends VirtioPCIProxy.
> + */
> +#define TYPE_VIRTIO_MEM_PCI "virtio-mem-pci-base"
> +#define VIRTIO_MEM_PCI(obj) \
> +        OBJECT_CHECK(VirtIOMEMPCI, (obj), TYPE_VIRTIO_MEM_PCI)
> +
> +struct VirtIOMEMPCI {
> +    VirtIOPCIProxy parent_obj;
> +    VirtIOMEM vdev;
> +};
> +
> +#endif /* QEMU_VIRTIO_MEM_PCI_H */
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index cfedf5a995..fec72d5a31 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -87,6 +87,7 @@ extern bool pci_available;
>  #define PCI_DEVICE_ID_VIRTIO_VSOCK       0x1012
>  #define PCI_DEVICE_ID_VIRTIO_PMEM        0x1013
>  #define PCI_DEVICE_ID_VIRTIO_IOMMU       0x1014
> +#define PCI_DEVICE_ID_VIRTIO_MEM         0x1015
>
>  #define PCI_VENDOR_ID_REDHAT             0x1b36
>  #define PCI_DEVICE_ID_REDHAT_BRIDGE      0x0001
> --
> 2.25.3
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 11/17] virtio-pci: Proxy for virtio-mem
@ 2020-05-06 18:57     ` Pankaj Gupta
  0 siblings, 0 replies; 94+ messages in thread
From: Pankaj Gupta @ 2020-05-06 18:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin,
	Dr . David Alan Gilbert, qemu-devel, qemu-s390x, Igor Mammedov,
	Paolo Bonzini, Richard Henderson

> Let's add a proxy for virtio-mem, make it a memory device, and
> pass-through the properties.
>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  hw/virtio/Makefile.objs    |   1 +
>  hw/virtio/virtio-mem-pci.c | 131 +++++++++++++++++++++++++++++++++++++
>  hw/virtio/virtio-mem-pci.h |  33 ++++++++++
>  include/hw/pci/pci.h       |   1 +
>  4 files changed, 166 insertions(+)
>  create mode 100644 hw/virtio/virtio-mem-pci.c
>  create mode 100644 hw/virtio/virtio-mem-pci.h
>
> diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
> index 7df70e977e..b9661f9c01 100644
> --- a/hw/virtio/Makefile.objs
> +++ b/hw/virtio/Makefile.objs
> @@ -19,6 +19,7 @@ obj-$(call land,$(CONFIG_VHOST_USER_FS),$(CONFIG_VIRTIO_PCI)) += vhost-user-fs-p
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
>  obj-$(CONFIG_VIRTIO_MEM) += virtio-mem.o
> +common-obj-$(call land,$(CONFIG_VIRTIO_MEM),$(CONFIG_VIRTIO_PCI)) += virtio-mem-pci.o
>
>  ifeq ($(CONFIG_VIRTIO_PCI),y)
>  obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock-pci.o
> diff --git a/hw/virtio/virtio-mem-pci.c b/hw/virtio/virtio-mem-pci.c
> new file mode 100644
> index 0000000000..a47d21c81f
> --- /dev/null
> +++ b/hw/virtio/virtio-mem-pci.c
> @@ -0,0 +1,131 @@
> +/*
> + * Virtio MEM PCI device
> + *
> + * Copyright (C) 2020 Red Hat, Inc.
> + *
> + * Authors:
> + *  David Hildenbrand <david@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +
Don't think we need the blank line here.

> +#include "virtio-mem-pci.h"
> +#include "hw/mem/memory-device.h"
> +#include "qapi/error.h"
> +
> +static void virtio_mem_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
> +{
> +    VirtIOMEMPCI *mem_pci = VIRTIO_MEM_PCI(vpci_dev);
> +    DeviceState *vdev = DEVICE(&mem_pci->vdev);
> +
> +    qdev_set_parent_bus(vdev, BUS(&vpci_dev->bus));
> +    object_property_set_bool(OBJECT(vdev), true, "realized", errp);
> +}
> +
> +static void virtio_mem_pci_set_addr(MemoryDeviceState *md, uint64_t addr,
> +                                    Error **errp)
> +{
> +    object_property_set_uint(OBJECT(md), addr, VIRTIO_MEM_ADDR_PROP, errp);
> +}
> +
> +static uint64_t virtio_mem_pci_get_addr(const MemoryDeviceState *md)
> +{
> +    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_ADDR_PROP,
> +                                    &error_abort);
> +}
> +
> +static MemoryRegion *virtio_mem_pci_get_memory_region(MemoryDeviceState *md,
> +                                                      Error **errp)
> +{
> +    VirtIOMEMPCI *pci_mem = VIRTIO_MEM_PCI(md);
> +    VirtIOMEM *vmem = VIRTIO_MEM(&pci_mem->vdev);
> +    VirtIOMEMClass *vmc = VIRTIO_MEM_GET_CLASS(vmem);
> +
> +    return vmc->get_memory_region(vmem, errp);
> +}
> +
> +static uint64_t virtio_mem_pci_get_plugged_size(const MemoryDeviceState *md,
> +                                                Error **errp)
> +{
> +    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_SIZE_PROP,
> +                                    errp);
> +}
> +
> +static void virtio_mem_pci_fill_device_info(const MemoryDeviceState *md,
> +                                            MemoryDeviceInfo *info)
> +{
> +    VirtioMEMDeviceInfo *vi = g_new0(VirtioMEMDeviceInfo, 1);
> +    VirtIOMEMPCI *pci_mem = VIRTIO_MEM_PCI(md);
> +    VirtIOMEM *vmem = VIRTIO_MEM(&pci_mem->vdev);
> +    VirtIOMEMClass *vpc = VIRTIO_MEM_GET_CLASS(vmem);
> +    DeviceState *dev = DEVICE(md);
> +
> +    if (dev->id) {
> +        vi->has_id = true;
> +        vi->id = g_strdup(dev->id);
> +    }
> +
> +    /* let the real device handle everything else */
> +    vpc->fill_device_info(vmem, vi);
> +
> +    info->u.virtio_mem.data = vi;
> +    info->type = MEMORY_DEVICE_INFO_KIND_VIRTIO_MEM;
> +}
> +
> +static void virtio_mem_pci_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
> +    PCIDeviceClass *pcidev_k = PCI_DEVICE_CLASS(klass);
> +    MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(klass);
> +
> +    k->realize = virtio_mem_pci_realize;
> +    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> +    pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
> +    pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_MEM;
> +    pcidev_k->revision = VIRTIO_PCI_ABI_VERSION;
> +    pcidev_k->class_id = PCI_CLASS_OTHERS;
> +
> +    mdc->get_addr = virtio_mem_pci_get_addr;
> +    mdc->set_addr = virtio_mem_pci_set_addr;
> +    mdc->get_plugged_size = virtio_mem_pci_get_plugged_size;
> +    mdc->get_memory_region = virtio_mem_pci_get_memory_region;
> +    mdc->fill_device_info = virtio_mem_pci_fill_device_info;
> +}
> +
> +static void virtio_mem_pci_instance_init(Object *obj)
> +{
> +    VirtIOMEMPCI *dev = VIRTIO_MEM_PCI(obj);
> +
> +    virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
> +                                TYPE_VIRTIO_MEM);
> +    object_property_add_alias(obj, VIRTIO_MEM_BLOCK_SIZE_PROP,
> +                              OBJECT(&dev->vdev),
> +                              VIRTIO_MEM_BLOCK_SIZE_PROP, &error_abort);
> +    object_property_add_alias(obj, VIRTIO_MEM_SIZE_PROP, OBJECT(&dev->vdev),
> +                              VIRTIO_MEM_SIZE_PROP, &error_abort);
> +    object_property_add_alias(obj, VIRTIO_MEM_REQUESTED_SIZE_PROP,
> +                              OBJECT(&dev->vdev),
> +                              VIRTIO_MEM_REQUESTED_SIZE_PROP, &error_abort);
> +}
> +
> +static const VirtioPCIDeviceTypeInfo virtio_mem_pci_info = {
> +    .base_name = TYPE_VIRTIO_MEM_PCI,
> +    .generic_name = "virtio-mem-pci",
> +    .instance_size = sizeof(VirtIOMEMPCI),
> +    .instance_init = virtio_mem_pci_instance_init,
> +    .class_init = virtio_mem_pci_class_init,
> +    .interfaces = (InterfaceInfo[]) {
> +        { TYPE_MEMORY_DEVICE },
> +        { }
> +    },
> +};
> +
> +static void virtio_mem_pci_register_types(void)
> +{
> +    virtio_pci_types_register(&virtio_mem_pci_info);
> +}
> +type_init(virtio_mem_pci_register_types)
> diff --git a/hw/virtio/virtio-mem-pci.h b/hw/virtio/virtio-mem-pci.h
> new file mode 100644
> index 0000000000..8820cd6628
> --- /dev/null
> +++ b/hw/virtio/virtio-mem-pci.h
> @@ -0,0 +1,33 @@
> +/*
> + * Virtio MEM PCI device
> + *
> + * Copyright (C) 2020 Red Hat, Inc.
> + *
> + * Authors:
> + *  David Hildenbrand <david@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef QEMU_VIRTIO_MEM_PCI_H
> +#define QEMU_VIRTIO_MEM_PCI_H
> +
> +#include "hw/virtio/virtio-pci.h"
> +#include "hw/virtio/virtio-mem.h"
> +
> +typedef struct VirtIOMEMPCI VirtIOMEMPCI;
> +
> +/*
> + * virtio-mem-pci: This extends VirtioPCIProxy.
> + */
> +#define TYPE_VIRTIO_MEM_PCI "virtio-mem-pci-base"
> +#define VIRTIO_MEM_PCI(obj) \
> +        OBJECT_CHECK(VirtIOMEMPCI, (obj), TYPE_VIRTIO_MEM_PCI)
> +
> +struct VirtIOMEMPCI {
> +    VirtIOPCIProxy parent_obj;
> +    VirtIOMEM vdev;
> +};
> +
> +#endif /* QEMU_VIRTIO_MEM_PCI_H */
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index cfedf5a995..fec72d5a31 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -87,6 +87,7 @@ extern bool pci_available;
>  #define PCI_DEVICE_ID_VIRTIO_VSOCK       0x1012
>  #define PCI_DEVICE_ID_VIRTIO_PMEM        0x1013
>  #define PCI_DEVICE_ID_VIRTIO_IOMMU       0x1014
> +#define PCI_DEVICE_ID_VIRTIO_MEM         0x1015
>
>  #define PCI_VENDOR_ID_REDHAT             0x1b36
>  #define PCI_DEVICE_ID_REDHAT_BRIDGE      0x0001
> --
> 2.25.3
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 13/17] hmp: Handle virtio-mem when printing memory device info
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-06 19:03     ` Pankaj Gupta
  -1 siblings, 0 replies; 94+ messages in thread
From: Pankaj Gupta @ 2020-05-06 19:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin

> Print the memory device info just like for other memory devices.
>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  monitor/hmp-cmds.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
>
> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> index 7f6e982dc8..4b3638a2a6 100644
> --- a/monitor/hmp-cmds.c
> +++ b/monitor/hmp-cmds.c
> @@ -1805,6 +1805,7 @@ void hmp_info_memory_devices(Monitor *mon, const QDict *qdict)
>      MemoryDeviceInfoList *info_list = qmp_query_memory_devices(&err);
>      MemoryDeviceInfoList *info;
>      VirtioPMEMDeviceInfo *vpi;
> +    VirtioMEMDeviceInfo *vmi;
>      MemoryDeviceInfo *value;
>      PCDIMMDeviceInfo *di;
>
> @@ -1839,6 +1840,21 @@ void hmp_info_memory_devices(Monitor *mon, const QDict *qdict)
>                  monitor_printf(mon, "  size: %" PRIu64 "\n", vpi->size);
>                  monitor_printf(mon, "  memdev: %s\n", vpi->memdev);
>                  break;
> +            case MEMORY_DEVICE_INFO_KIND_VIRTIO_MEM:
> +                vmi = value->u.virtio_mem.data;
> +                monitor_printf(mon, "Memory device [%s]: \"%s\"\n",
> +                               MemoryDeviceInfoKind_str(value->type),
> +                               vmi->id ? vmi->id : "");
> +                monitor_printf(mon, "  memaddr: 0x%" PRIx64 "\n", vmi->memaddr);
> +                monitor_printf(mon, "  node: %" PRId64 "\n", vmi->node);
> +                monitor_printf(mon, "  requested-size: %" PRIu64 "\n",
> +                               vmi->requested_size);
> +                monitor_printf(mon, "  size: %" PRIu64 "\n", vmi->size);
> +                monitor_printf(mon, "  max-size: %" PRIu64 "\n", vmi->max_size);
> +                monitor_printf(mon, "  block-size: %" PRIu64 "\n",
> +                               vmi->block_size);
> +                monitor_printf(mon, "  memdev: %s\n", vmi->memdev);
> +                break;
>              default:
>                  g_assert_not_reached();
>              }
> --
> 2.25.3

Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 13/17] hmp: Handle virtio-mem when printing memory device info
@ 2020-05-06 19:03     ` Pankaj Gupta
  0 siblings, 0 replies; 94+ messages in thread
From: Pankaj Gupta @ 2020-05-06 19:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin,
	Dr . David Alan Gilbert, qemu-devel, qemu-s390x, Paolo Bonzini,
	Richard Henderson

> Print the memory device info just like for other memory devices.
>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  monitor/hmp-cmds.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
>
> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> index 7f6e982dc8..4b3638a2a6 100644
> --- a/monitor/hmp-cmds.c
> +++ b/monitor/hmp-cmds.c
> @@ -1805,6 +1805,7 @@ void hmp_info_memory_devices(Monitor *mon, const QDict *qdict)
>      MemoryDeviceInfoList *info_list = qmp_query_memory_devices(&err);
>      MemoryDeviceInfoList *info;
>      VirtioPMEMDeviceInfo *vpi;
> +    VirtioMEMDeviceInfo *vmi;
>      MemoryDeviceInfo *value;
>      PCDIMMDeviceInfo *di;
>
> @@ -1839,6 +1840,21 @@ void hmp_info_memory_devices(Monitor *mon, const QDict *qdict)
>                  monitor_printf(mon, "  size: %" PRIu64 "\n", vpi->size);
>                  monitor_printf(mon, "  memdev: %s\n", vpi->memdev);
>                  break;
> +            case MEMORY_DEVICE_INFO_KIND_VIRTIO_MEM:
> +                vmi = value->u.virtio_mem.data;
> +                monitor_printf(mon, "Memory device [%s]: \"%s\"\n",
> +                               MemoryDeviceInfoKind_str(value->type),
> +                               vmi->id ? vmi->id : "");
> +                monitor_printf(mon, "  memaddr: 0x%" PRIx64 "\n", vmi->memaddr);
> +                monitor_printf(mon, "  node: %" PRId64 "\n", vmi->node);
> +                monitor_printf(mon, "  requested-size: %" PRIu64 "\n",
> +                               vmi->requested_size);
> +                monitor_printf(mon, "  size: %" PRIu64 "\n", vmi->size);
> +                monitor_printf(mon, "  max-size: %" PRIu64 "\n", vmi->max_size);
> +                monitor_printf(mon, "  block-size: %" PRIu64 "\n",
> +                               vmi->block_size);
> +                monitor_printf(mon, "  memdev: %s\n", vmi->memdev);
> +                break;
>              default:
>                  g_assert_not_reached();
>              }
> --
> 2.25.3

Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 01/17] exec: Introduce ram_block_discard_set_(unreliable|required)()
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-15  9:54     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15  9:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin

* David Hildenbrand (david@redhat.com) wrote:
> We want to replace qemu_balloon_inhibit() by something more generic.
> Especially, we want to make sure that technologies that really rely on
> RAM block discards to work reliably to run mutual exclusive with
> technologies that break it.
> 
> E.g., vfio will usually pin all guest memory, turning the virtio-balloon
> basically useless and make the VM consume more memory than reported via
> the balloon. While the balloon is special already (=> no guarantees, same
> behavior possible afer reboots and with huge pages), this will be
> different, especially, with virtio-mem.
> 
> Let's implement a way such that we can make both types of technology run
> mutually exclusive. We'll convert existing balloon inhibitors in successive
> patches and add some new ones. Add the check to
> qemu_balloon_is_inhibited() for now. We might want to make
> virtio-balloon an acutal inhibitor in the future - however, that
> requires more thought to not break existing setups.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  balloon.c             |  3 ++-
>  exec.c                | 48 +++++++++++++++++++++++++++++++++++++++++++
>  include/exec/memory.h | 41 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 91 insertions(+), 1 deletion(-)
> 
> diff --git a/balloon.c b/balloon.c
> index f104b42961..c49f57c27b 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -40,7 +40,8 @@ static int balloon_inhibit_count;
>  
>  bool qemu_balloon_is_inhibited(void)
>  {
> -    return atomic_read(&balloon_inhibit_count) > 0;
> +    return atomic_read(&balloon_inhibit_count) > 0 ||
> +           ram_block_discard_is_broken();
>  }
>  
>  void qemu_balloon_inhibit(bool state)
> diff --git a/exec.c b/exec.c
> index 2874bb5088..52a6e40e99 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -4049,4 +4049,52 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root)
>      }
>  }
>  
> +static int ram_block_discard_broken;

This could do with a comment; if I'm reading this right then
  +ve means broken
  -ve means required

> +int ram_block_discard_set_broken(bool state)
> +{
> +    int old;
> +
> +    if (!state) {
> +        atomic_dec(&ram_block_discard_broken);
> +        return 0;
> +    }
> +
> +    do {
> +        old = atomic_read(&ram_block_discard_broken);
> +        if (old < 0) {
               /* Currently required */
> +            return -EBUSY;
> +        }
> +    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old + 1) != old);
> +    return 0;
> +}
> +
> +int ram_block_discard_set_required(bool state)
> +{
> +    int old;
> +
> +    if (!state) {
> +        atomic_inc(&ram_block_discard_broken);
> +        return 0;
> +    }
> +
> +    do {
> +        old = atomic_read(&ram_block_discard_broken);
> +        if (old > 0) {
               /* Currently broken */
> +            return -EBUSY;
> +        }
> +    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old - 1) != old);
> +    return 0;
> +}
> +
> +bool ram_block_discard_is_broken(void)
> +{
> +    return atomic_read(&ram_block_discard_broken) > 0;
> +}
> +
> +bool ram_block_discard_is_required(void)
> +{
> +    return atomic_read(&ram_block_discard_broken) < 0;
> +}
> +
>  #endif
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index e000bd2f97..9bb5ced38d 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -2463,6 +2463,47 @@ static inline MemOp devend_memop(enum device_endian end)
>  }
>  #endif
>  
> +/*
> + * Inhibit technologies that rely on discarding of parts of RAM blocks to work
> + * reliably, e.g., to manage the actual amount of memory consumed by the VM
> + * (then, the memory provided by RAM blocks might be bigger than the desired
> + * memory consumption). This *must* be set if:

'technologies that rely on discarding of parts of RAM blocks to work
reliably' is pretty long; I'm not sure of a better way of saying it
though.

Other than the comments;


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> + * - Discarding parts of a RAM blocks does not result in the change being
> + *   reflected in the VM and the pages getting freed.
> + * - All memory in RAM blocks is pinned or duplicated, invaldiating any previous
> + *   discards blindly.
> + * - Discarding parts of a RAM blocks will result in integrity issues (e.g.,
> + *   encrypted VMs).
> + * Technologies that only temporarily pin the current working set of a
> + * driver are fine, because we don't expect such pages to be discarded
> + * (esp. based on guest action like balloon inflation).
> + *
> + * This is *not* to be used to protect from concurrent discards (esp.,
> + * postcopy).
> + *
> + * Returns 0 if successful. Returns -EBUSY if a technology that relies on
> + * discards to work reliably is active.
> + */
> +int ram_block_discard_set_broken(bool state);
> +
> +/*
> + * Inhibit technologies that will break discarding of pages in RAM blocks.
> + *
> + * Returns 0 if successful. Returns -EBUSY if discards are already set to
> + * broken.
> + */
> +int ram_block_discard_set_required(bool state);
> +
> +/*
> + * Test if discarding of memory in ram blocks is broken.
> + */
> +bool ram_block_discard_is_broken(void);
> +
> +/*
> + * Test if discarding of memory in ram blocks is required to work reliably.
> + */
> +bool ram_block_discard_is_required(void);
> +
>  #endif
>  
>  #endif
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 01/17] exec: Introduce ram_block_discard_set_(unreliable|required)()
@ 2020-05-15  9:54     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15  9:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, qemu-devel,
	qemu-s390x, Paolo Bonzini, Richard Henderson

* David Hildenbrand (david@redhat.com) wrote:
> We want to replace qemu_balloon_inhibit() by something more generic.
> Especially, we want to make sure that technologies that really rely on
> RAM block discards to work reliably to run mutual exclusive with
> technologies that break it.
> 
> E.g., vfio will usually pin all guest memory, turning the virtio-balloon
> basically useless and make the VM consume more memory than reported via
> the balloon. While the balloon is special already (=> no guarantees, same
> behavior possible afer reboots and with huge pages), this will be
> different, especially, with virtio-mem.
> 
> Let's implement a way such that we can make both types of technology run
> mutually exclusive. We'll convert existing balloon inhibitors in successive
> patches and add some new ones. Add the check to
> qemu_balloon_is_inhibited() for now. We might want to make
> virtio-balloon an acutal inhibitor in the future - however, that
> requires more thought to not break existing setups.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  balloon.c             |  3 ++-
>  exec.c                | 48 +++++++++++++++++++++++++++++++++++++++++++
>  include/exec/memory.h | 41 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 91 insertions(+), 1 deletion(-)
> 
> diff --git a/balloon.c b/balloon.c
> index f104b42961..c49f57c27b 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -40,7 +40,8 @@ static int balloon_inhibit_count;
>  
>  bool qemu_balloon_is_inhibited(void)
>  {
> -    return atomic_read(&balloon_inhibit_count) > 0;
> +    return atomic_read(&balloon_inhibit_count) > 0 ||
> +           ram_block_discard_is_broken();
>  }
>  
>  void qemu_balloon_inhibit(bool state)
> diff --git a/exec.c b/exec.c
> index 2874bb5088..52a6e40e99 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -4049,4 +4049,52 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root)
>      }
>  }
>  
> +static int ram_block_discard_broken;

This could do with a comment; if I'm reading this right then
  +ve means broken
  -ve means required

> +int ram_block_discard_set_broken(bool state)
> +{
> +    int old;
> +
> +    if (!state) {
> +        atomic_dec(&ram_block_discard_broken);
> +        return 0;
> +    }
> +
> +    do {
> +        old = atomic_read(&ram_block_discard_broken);
> +        if (old < 0) {
               /* Currently required */
> +            return -EBUSY;
> +        }
> +    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old + 1) != old);
> +    return 0;
> +}
> +
> +int ram_block_discard_set_required(bool state)
> +{
> +    int old;
> +
> +    if (!state) {
> +        atomic_inc(&ram_block_discard_broken);
> +        return 0;
> +    }
> +
> +    do {
> +        old = atomic_read(&ram_block_discard_broken);
> +        if (old > 0) {
               /* Currently broken */
> +            return -EBUSY;
> +        }
> +    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old - 1) != old);
> +    return 0;
> +}
> +
> +bool ram_block_discard_is_broken(void)
> +{
> +    return atomic_read(&ram_block_discard_broken) > 0;
> +}
> +
> +bool ram_block_discard_is_required(void)
> +{
> +    return atomic_read(&ram_block_discard_broken) < 0;
> +}
> +
>  #endif
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index e000bd2f97..9bb5ced38d 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -2463,6 +2463,47 @@ static inline MemOp devend_memop(enum device_endian end)
>  }
>  #endif
>  
> +/*
> + * Inhibit technologies that rely on discarding of parts of RAM blocks to work
> + * reliably, e.g., to manage the actual amount of memory consumed by the VM
> + * (then, the memory provided by RAM blocks might be bigger than the desired
> + * memory consumption). This *must* be set if:

'technologies that rely on discarding of parts of RAM blocks to work
reliably' is pretty long; I'm not sure of a better way of saying it
though.

Other than the comments;


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> + * - Discarding parts of a RAM blocks does not result in the change being
> + *   reflected in the VM and the pages getting freed.
> + * - All memory in RAM blocks is pinned or duplicated, invaldiating any previous
> + *   discards blindly.
> + * - Discarding parts of a RAM blocks will result in integrity issues (e.g.,
> + *   encrypted VMs).
> + * Technologies that only temporarily pin the current working set of a
> + * driver are fine, because we don't expect such pages to be discarded
> + * (esp. based on guest action like balloon inflation).
> + *
> + * This is *not* to be used to protect from concurrent discards (esp.,
> + * postcopy).
> + *
> + * Returns 0 if successful. Returns -EBUSY if a technology that relies on
> + * discards to work reliably is active.
> + */
> +int ram_block_discard_set_broken(bool state);
> +
> +/*
> + * Inhibit technologies that will break discarding of pages in RAM blocks.
> + *
> + * Returns 0 if successful. Returns -EBUSY if discards are already set to
> + * broken.
> + */
> +int ram_block_discard_set_required(bool state);
> +
> +/*
> + * Test if discarding of memory in ram blocks is broken.
> + */
> +bool ram_block_discard_is_broken(void);
> +
> +/*
> + * Test if discarding of memory in ram blocks is required to work reliably.
> + */
> +bool ram_block_discard_is_required(void);
> +
>  #endif
>  
>  #endif
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 03/17] accel/kvm: Convert to ram_block_discard_set_broken()
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-15 11:57     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 11:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin

* David Hildenbrand (david@redhat.com) wrote:
> Discarding memory does not work as expected. At the time this is called,
> we cannot have anyone active that relies on discards to work properly.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  accel/kvm/kvm-all.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 439a4efe52..33421184ac 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -40,7 +40,6 @@
>  #include "trace.h"
>  #include "hw/irq.h"
>  #include "sysemu/sev.h"
> -#include "sysemu/balloon.h"
>  #include "qapi/visitor.h"
>  #include "qapi/qapi-types-common.h"
>  #include "qapi/qapi-visit-common.h"
> @@ -2107,7 +2106,7 @@ static int kvm_init(MachineState *ms)
>  
>      s->sync_mmu = !!kvm_vm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
>      if (!s->sync_mmu) {
> -        qemu_balloon_inhibit(true);
> +        g_assert(ram_block_discard_set_broken(true));
>      }
>  
>      return 0;
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 03/17] accel/kvm: Convert to ram_block_discard_set_broken()
@ 2020-05-15 11:57     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 11:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, qemu-devel,
	qemu-s390x, Paolo Bonzini, Richard Henderson

* David Hildenbrand (david@redhat.com) wrote:
> Discarding memory does not work as expected. At the time this is called,
> we cannot have anyone active that relies on discards to work properly.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  accel/kvm/kvm-all.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 439a4efe52..33421184ac 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -40,7 +40,6 @@
>  #include "trace.h"
>  #include "hw/irq.h"
>  #include "sysemu/sev.h"
> -#include "sysemu/balloon.h"
>  #include "qapi/visitor.h"
>  #include "qapi/qapi-types-common.h"
>  #include "qapi/qapi-visit-common.h"
> @@ -2107,7 +2106,7 @@ static int kvm_init(MachineState *ms)
>  
>      s->sync_mmu = !!kvm_vm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
>      if (!s->sync_mmu) {
> -        qemu_balloon_inhibit(true);
> +        g_assert(ram_block_discard_set_broken(true));
>      }
>  
>      return 0;
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 02/17] vfio: Convert to ram_block_discard_set_broken()
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-15 12:01     ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 12:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	Cornelia Huck, Alex Williamson, Christian Borntraeger,
	Tony Krowiak, Halil Pasic, Pierre Morel, Eric Farman

On 06.05.20 11:49, David Hildenbrand wrote:
> VFIO is (except devices without a physical IOMMU or some mediated devices)
> incompatible ram_block_discard_set_broken. The kernel will pin basically
> all VM memory. Let's convert to ram_block_discard_set_broke(), which can
> now fail, in contrast to qemu_balloon_inhibit().

Not sure what I was smoking when rewriting this 3 times:

"VFIO is (except devices without a physical IOMMU or some mediated
devices) incompatible with discarding of RAM. The kernel will pin
basically all VM memory. Let's convert to
ram_block_discard_set_broken(), which can now fail, in contrast to
qemu_balloon_inhibit()."

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 02/17] vfio: Convert to ram_block_discard_set_broken()
@ 2020-05-15 12:01     ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 12:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: Tony Krowiak, Eric Farman, Alex Williamson, Eduardo Habkost, kvm,
	Michael S . Tsirkin, Pierre Morel, Cornelia Huck,
	Dr . David Alan Gilbert, Halil Pasic, Christian Borntraeger,
	qemu-s390x, Paolo Bonzini, Richard Henderson

On 06.05.20 11:49, David Hildenbrand wrote:
> VFIO is (except devices without a physical IOMMU or some mediated devices)
> incompatible ram_block_discard_set_broken. The kernel will pin basically
> all VM memory. Let's convert to ram_block_discard_set_broke(), which can
> now fail, in contrast to qemu_balloon_inhibit().

Not sure what I was smoking when rewriting this 3 times:

"VFIO is (except devices without a physical IOMMU or some mediated
devices) incompatible with discarding of RAM. The kernel will pin
basically all VM memory. Let's convert to
ram_block_discard_set_broken(), which can now fail, in contrast to
qemu_balloon_inhibit()."

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 05/17] virtio-balloon: Rip out qemu_balloon_inhibit()
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-15 12:09     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 12:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Juan Quintela

* David Hildenbrand (david@redhat.com) wrote:
> The only remaining special case is postcopy. It cannot handle
> concurrent discards yet, which would result in requesting already sent
> pages from the source. Special-case it in virtio-balloon instead.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Juan Quintela <quintela@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  balloon.c                  | 18 ------------------
>  hw/virtio/virtio-balloon.c | 12 +++++++++++-
>  include/sysemu/balloon.h   |  2 --
>  migration/postcopy-ram.c   | 23 -----------------------
>  4 files changed, 11 insertions(+), 44 deletions(-)
> 
> diff --git a/balloon.c b/balloon.c
> index c49f57c27b..354408c6ea 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -36,24 +36,6 @@
>  static QEMUBalloonEvent *balloon_event_fn;
>  static QEMUBalloonStatus *balloon_stat_fn;
>  static void *balloon_opaque;
> -static int balloon_inhibit_count;
> -
> -bool qemu_balloon_is_inhibited(void)
> -{
> -    return atomic_read(&balloon_inhibit_count) > 0 ||
> -           ram_block_discard_is_broken();
> -}
> -
> -void qemu_balloon_inhibit(bool state)
> -{
> -    if (state) {
> -        atomic_inc(&balloon_inhibit_count);
> -    } else {
> -        atomic_dec(&balloon_inhibit_count);
> -    }
> -
> -    assert(atomic_read(&balloon_inhibit_count) >= 0);
> -}
>  
>  static bool have_balloon(Error **errp)
>  {
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index a4729f7fc9..aa5b89fb47 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -29,6 +29,7 @@
>  #include "trace.h"
>  #include "qemu/error-report.h"
>  #include "migration/misc.h"
> +#include "migration/postcopy-ram.h"
>  
>  #include "hw/virtio/virtio-bus.h"
>  #include "hw/virtio/virtio-access.h"
> @@ -63,6 +64,15 @@ static bool virtio_balloon_pbp_matches(PartiallyBalloonedPage *pbp,
>      return pbp->base_gpa == base_gpa;
>  }
>  
> +static bool virtio_balloon_inhibited(void)
> +{
> +    PostcopyState ps = postcopy_state_get();
> +
> +    /* Postcopy cannot deal with concurrent discards (yet), so it's special. */
> +    return ram_block_discard_is_broken() ||
> +           (ps >= POSTCOPY_INCOMING_DISCARD && ps < POSTCOPY_INCOMING_END);

It's a shame this is open-coded here; it would be better to have
something in migration.c ; we have a migration_in_postcopy but that's
really the sending side; a 'migration_in_incoming_postcopy' would
perhaps be good.

Dave

> +}
> +
>  static void balloon_inflate_page(VirtIOBalloon *balloon,
>                                   MemoryRegion *mr, hwaddr mr_offset,
>                                   PartiallyBalloonedPage *pbp)
> @@ -360,7 +370,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>  
>              trace_virtio_balloon_handle_output(memory_region_name(section.mr),
>                                                 pa);
> -            if (!qemu_balloon_is_inhibited()) {
> +            if (!virtio_balloon_inhibited()) {
>                  if (vq == s->ivq) {
>                      balloon_inflate_page(s, section.mr,
>                                           section.offset_within_region, &pbp);
> diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
> index aea0c44985..20a2defe3a 100644
> --- a/include/sysemu/balloon.h
> +++ b/include/sysemu/balloon.h
> @@ -23,7 +23,5 @@ typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>                               QEMUBalloonStatus *stat_func, void *opaque);
>  void qemu_remove_balloon_handler(void *opaque);
> -bool qemu_balloon_is_inhibited(void);
> -void qemu_balloon_inhibit(bool state);
>  
>  #endif
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index a36402722b..b41a9fe2fd 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -27,7 +27,6 @@
>  #include "qemu/notify.h"
>  #include "qemu/rcu.h"
>  #include "sysemu/sysemu.h"
> -#include "sysemu/balloon.h"
>  #include "qemu/error-report.h"
>  #include "trace.h"
>  #include "hw/boards.h"
> @@ -520,20 +519,6 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis)
>      return 0;
>  }
>  
> -/*
> - * Manage a single vote to the QEMU balloon inhibitor for all postcopy usage,
> - * last caller wins.
> - */
> -static void postcopy_balloon_inhibit(bool state)
> -{
> -    static bool cur_state = false;
> -
> -    if (state != cur_state) {
> -        qemu_balloon_inhibit(state);
> -        cur_state = state;
> -    }
> -}
> -
>  /*
>   * At the end of a migration where postcopy_ram_incoming_init was called.
>   */
> @@ -565,8 +550,6 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>          mis->have_fault_thread = false;
>      }
>  
> -    postcopy_balloon_inhibit(false);
> -
>      if (enable_mlock) {
>          if (os_mlock() < 0) {
>              error_report("mlock: %s", strerror(errno));
> @@ -1160,12 +1143,6 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
>      }
>      memset(mis->postcopy_tmp_zero_page, '\0', mis->largest_page_size);
>  
> -    /*
> -     * Ballooning can mark pages as absent while we're postcopying
> -     * that would cause false userfaults.
> -     */
> -    postcopy_balloon_inhibit(true);
> -
>      trace_postcopy_ram_enable_notify();
>  
>      return 0;
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 05/17] virtio-balloon: Rip out qemu_balloon_inhibit()
@ 2020-05-15 12:09     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 12:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, Juan Quintela,
	qemu-devel, qemu-s390x, Paolo Bonzini, Richard Henderson

* David Hildenbrand (david@redhat.com) wrote:
> The only remaining special case is postcopy. It cannot handle
> concurrent discards yet, which would result in requesting already sent
> pages from the source. Special-case it in virtio-balloon instead.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Juan Quintela <quintela@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  balloon.c                  | 18 ------------------
>  hw/virtio/virtio-balloon.c | 12 +++++++++++-
>  include/sysemu/balloon.h   |  2 --
>  migration/postcopy-ram.c   | 23 -----------------------
>  4 files changed, 11 insertions(+), 44 deletions(-)
> 
> diff --git a/balloon.c b/balloon.c
> index c49f57c27b..354408c6ea 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -36,24 +36,6 @@
>  static QEMUBalloonEvent *balloon_event_fn;
>  static QEMUBalloonStatus *balloon_stat_fn;
>  static void *balloon_opaque;
> -static int balloon_inhibit_count;
> -
> -bool qemu_balloon_is_inhibited(void)
> -{
> -    return atomic_read(&balloon_inhibit_count) > 0 ||
> -           ram_block_discard_is_broken();
> -}
> -
> -void qemu_balloon_inhibit(bool state)
> -{
> -    if (state) {
> -        atomic_inc(&balloon_inhibit_count);
> -    } else {
> -        atomic_dec(&balloon_inhibit_count);
> -    }
> -
> -    assert(atomic_read(&balloon_inhibit_count) >= 0);
> -}
>  
>  static bool have_balloon(Error **errp)
>  {
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index a4729f7fc9..aa5b89fb47 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -29,6 +29,7 @@
>  #include "trace.h"
>  #include "qemu/error-report.h"
>  #include "migration/misc.h"
> +#include "migration/postcopy-ram.h"
>  
>  #include "hw/virtio/virtio-bus.h"
>  #include "hw/virtio/virtio-access.h"
> @@ -63,6 +64,15 @@ static bool virtio_balloon_pbp_matches(PartiallyBalloonedPage *pbp,
>      return pbp->base_gpa == base_gpa;
>  }
>  
> +static bool virtio_balloon_inhibited(void)
> +{
> +    PostcopyState ps = postcopy_state_get();
> +
> +    /* Postcopy cannot deal with concurrent discards (yet), so it's special. */
> +    return ram_block_discard_is_broken() ||
> +           (ps >= POSTCOPY_INCOMING_DISCARD && ps < POSTCOPY_INCOMING_END);

It's a shame this is open-coded here; it would be better to have
something in migration.c ; we have a migration_in_postcopy but that's
really the sending side; a 'migration_in_incoming_postcopy' would
perhaps be good.

Dave

> +}
> +
>  static void balloon_inflate_page(VirtIOBalloon *balloon,
>                                   MemoryRegion *mr, hwaddr mr_offset,
>                                   PartiallyBalloonedPage *pbp)
> @@ -360,7 +370,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>  
>              trace_virtio_balloon_handle_output(memory_region_name(section.mr),
>                                                 pa);
> -            if (!qemu_balloon_is_inhibited()) {
> +            if (!virtio_balloon_inhibited()) {
>                  if (vq == s->ivq) {
>                      balloon_inflate_page(s, section.mr,
>                                           section.offset_within_region, &pbp);
> diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
> index aea0c44985..20a2defe3a 100644
> --- a/include/sysemu/balloon.h
> +++ b/include/sysemu/balloon.h
> @@ -23,7 +23,5 @@ typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
>  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
>                               QEMUBalloonStatus *stat_func, void *opaque);
>  void qemu_remove_balloon_handler(void *opaque);
> -bool qemu_balloon_is_inhibited(void);
> -void qemu_balloon_inhibit(bool state);
>  
>  #endif
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index a36402722b..b41a9fe2fd 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -27,7 +27,6 @@
>  #include "qemu/notify.h"
>  #include "qemu/rcu.h"
>  #include "sysemu/sysemu.h"
> -#include "sysemu/balloon.h"
>  #include "qemu/error-report.h"
>  #include "trace.h"
>  #include "hw/boards.h"
> @@ -520,20 +519,6 @@ int postcopy_ram_incoming_init(MigrationIncomingState *mis)
>      return 0;
>  }
>  
> -/*
> - * Manage a single vote to the QEMU balloon inhibitor for all postcopy usage,
> - * last caller wins.
> - */
> -static void postcopy_balloon_inhibit(bool state)
> -{
> -    static bool cur_state = false;
> -
> -    if (state != cur_state) {
> -        qemu_balloon_inhibit(state);
> -        cur_state = state;
> -    }
> -}
> -
>  /*
>   * At the end of a migration where postcopy_ram_incoming_init was called.
>   */
> @@ -565,8 +550,6 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>          mis->have_fault_thread = false;
>      }
>  
> -    postcopy_balloon_inhibit(false);
> -
>      if (enable_mlock) {
>          if (os_mlock() < 0) {
>              error_report("mlock: %s", strerror(errno));
> @@ -1160,12 +1143,6 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
>      }
>      memset(mis->postcopy_tmp_zero_page, '\0', mis->largest_page_size);
>  
> -    /*
> -     * Ballooning can mark pages as absent while we're postcopying
> -     * that would cause false userfaults.
> -     */
> -    postcopy_balloon_inhibit(true);
> -
>      trace_postcopy_ram_enable_notify();
>  
>      return 0;
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 05/17] virtio-balloon: Rip out qemu_balloon_inhibit()
  2020-05-15 12:09     ` Dr. David Alan Gilbert
@ 2020-05-15 12:12       ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 12:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Juan Quintela

On 15.05.20 14:09, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> The only remaining special case is postcopy. It cannot handle
>> concurrent discards yet, which would result in requesting already sent
>> pages from the source. Special-case it in virtio-balloon instead.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Juan Quintela <quintela@redhat.com>
>> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  balloon.c                  | 18 ------------------
>>  hw/virtio/virtio-balloon.c | 12 +++++++++++-
>>  include/sysemu/balloon.h   |  2 --
>>  migration/postcopy-ram.c   | 23 -----------------------
>>  4 files changed, 11 insertions(+), 44 deletions(-)
>>
>> diff --git a/balloon.c b/balloon.c
>> index c49f57c27b..354408c6ea 100644
>> --- a/balloon.c
>> +++ b/balloon.c
>> @@ -36,24 +36,6 @@
>>  static QEMUBalloonEvent *balloon_event_fn;
>>  static QEMUBalloonStatus *balloon_stat_fn;
>>  static void *balloon_opaque;
>> -static int balloon_inhibit_count;
>> -
>> -bool qemu_balloon_is_inhibited(void)
>> -{
>> -    return atomic_read(&balloon_inhibit_count) > 0 ||
>> -           ram_block_discard_is_broken();
>> -}
>> -
>> -void qemu_balloon_inhibit(bool state)
>> -{
>> -    if (state) {
>> -        atomic_inc(&balloon_inhibit_count);
>> -    } else {
>> -        atomic_dec(&balloon_inhibit_count);
>> -    }
>> -
>> -    assert(atomic_read(&balloon_inhibit_count) >= 0);
>> -}
>>  
>>  static bool have_balloon(Error **errp)
>>  {
>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>> index a4729f7fc9..aa5b89fb47 100644
>> --- a/hw/virtio/virtio-balloon.c
>> +++ b/hw/virtio/virtio-balloon.c
>> @@ -29,6 +29,7 @@
>>  #include "trace.h"
>>  #include "qemu/error-report.h"
>>  #include "migration/misc.h"
>> +#include "migration/postcopy-ram.h"
>>  
>>  #include "hw/virtio/virtio-bus.h"
>>  #include "hw/virtio/virtio-access.h"
>> @@ -63,6 +64,15 @@ static bool virtio_balloon_pbp_matches(PartiallyBalloonedPage *pbp,
>>      return pbp->base_gpa == base_gpa;
>>  }
>>  
>> +static bool virtio_balloon_inhibited(void)
>> +{
>> +    PostcopyState ps = postcopy_state_get();
>> +
>> +    /* Postcopy cannot deal with concurrent discards (yet), so it's special. */
>> +    return ram_block_discard_is_broken() ||
>> +           (ps >= POSTCOPY_INCOMING_DISCARD && ps < POSTCOPY_INCOMING_END);
> 
> It's a shame this is open-coded here; it would be better to have
> something in migration.c ; we have a migration_in_postcopy but that's
> really the sending side; a 'migration_in_incoming_postcopy' would
> perhaps be good.

Yes, makes sense - then I can also reuse it in patch #10.

Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 05/17] virtio-balloon: Rip out qemu_balloon_inhibit()
@ 2020-05-15 12:12       ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 12:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, Juan Quintela,
	qemu-devel, qemu-s390x, Paolo Bonzini, Richard Henderson

On 15.05.20 14:09, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> The only remaining special case is postcopy. It cannot handle
>> concurrent discards yet, which would result in requesting already sent
>> pages from the source. Special-case it in virtio-balloon instead.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Juan Quintela <quintela@redhat.com>
>> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  balloon.c                  | 18 ------------------
>>  hw/virtio/virtio-balloon.c | 12 +++++++++++-
>>  include/sysemu/balloon.h   |  2 --
>>  migration/postcopy-ram.c   | 23 -----------------------
>>  4 files changed, 11 insertions(+), 44 deletions(-)
>>
>> diff --git a/balloon.c b/balloon.c
>> index c49f57c27b..354408c6ea 100644
>> --- a/balloon.c
>> +++ b/balloon.c
>> @@ -36,24 +36,6 @@
>>  static QEMUBalloonEvent *balloon_event_fn;
>>  static QEMUBalloonStatus *balloon_stat_fn;
>>  static void *balloon_opaque;
>> -static int balloon_inhibit_count;
>> -
>> -bool qemu_balloon_is_inhibited(void)
>> -{
>> -    return atomic_read(&balloon_inhibit_count) > 0 ||
>> -           ram_block_discard_is_broken();
>> -}
>> -
>> -void qemu_balloon_inhibit(bool state)
>> -{
>> -    if (state) {
>> -        atomic_inc(&balloon_inhibit_count);
>> -    } else {
>> -        atomic_dec(&balloon_inhibit_count);
>> -    }
>> -
>> -    assert(atomic_read(&balloon_inhibit_count) >= 0);
>> -}
>>  
>>  static bool have_balloon(Error **errp)
>>  {
>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>> index a4729f7fc9..aa5b89fb47 100644
>> --- a/hw/virtio/virtio-balloon.c
>> +++ b/hw/virtio/virtio-balloon.c
>> @@ -29,6 +29,7 @@
>>  #include "trace.h"
>>  #include "qemu/error-report.h"
>>  #include "migration/misc.h"
>> +#include "migration/postcopy-ram.h"
>>  
>>  #include "hw/virtio/virtio-bus.h"
>>  #include "hw/virtio/virtio-access.h"
>> @@ -63,6 +64,15 @@ static bool virtio_balloon_pbp_matches(PartiallyBalloonedPage *pbp,
>>      return pbp->base_gpa == base_gpa;
>>  }
>>  
>> +static bool virtio_balloon_inhibited(void)
>> +{
>> +    PostcopyState ps = postcopy_state_get();
>> +
>> +    /* Postcopy cannot deal with concurrent discards (yet), so it's special. */
>> +    return ram_block_discard_is_broken() ||
>> +           (ps >= POSTCOPY_INCOMING_DISCARD && ps < POSTCOPY_INCOMING_END);
> 
> It's a shame this is open-coded here; it would be better to have
> something in migration.c ; we have a migration_in_postcopy but that's
> really the sending side; a 'migration_in_incoming_postcopy' would
> perhaps be good.

Yes, makes sense - then I can also reuse it in patch #10.

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 07/17] migration/rdma: Use ram_block_discard_set_broken()
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-15 12:45     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 12:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Juan Quintela

* David Hildenbrand (david@redhat.com) wrote:
> RDMA will pin all guest memory (as documented in docs/rdma.txt). We want
> to mark RAM block discards to be broken - however, to keep it simple
> use ram_block_discard_is_required() instead of inhibiting.

Should this be dependent on whether rdma->pin_all is set?
Even with !pin_all some will be pinned at any given time
(when it's registered with the rdma stack).

Dave

> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Juan Quintela <quintela@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  migration/rdma.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index f61587891b..029adbb950 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -29,6 +29,7 @@
>  #include "qemu/sockets.h"
>  #include "qemu/bitmap.h"
>  #include "qemu/coroutine.h"
> +#include "exec/memory.h"
>  #include <sys/socket.h>
>  #include <netdb.h>
>  #include <arpa/inet.h>
> @@ -4017,8 +4018,14 @@ void rdma_start_incoming_migration(const char *host_port, Error **errp)
>      Error *local_err = NULL;
>  
>      trace_rdma_start_incoming_migration();
> -    rdma = qemu_rdma_data_init(host_port, &local_err);
>  
> +    /* Avoid ram_block_discard_set_broken(), cannot change during migration. */
> +    if (ram_block_discard_is_required()) {
> +        error_setg(errp, "RDMA: cannot set discarding of RAM broken");
> +        return;
> +    }
> +
> +    rdma = qemu_rdma_data_init(host_port, &local_err);
>      if (rdma == NULL) {
>          goto err;
>      }
> @@ -4064,10 +4071,17 @@ void rdma_start_outgoing_migration(void *opaque,
>                              const char *host_port, Error **errp)
>  {
>      MigrationState *s = opaque;
> -    RDMAContext *rdma = qemu_rdma_data_init(host_port, errp);
>      RDMAContext *rdma_return_path = NULL;
> +    RDMAContext *rdma;
>      int ret = 0;
>  
> +    /* Avoid ram_block_discard_set_broken(), cannot change during migration. */
> +    if (ram_block_discard_is_required()) {
> +        error_setg(errp, "RDMA: cannot set discarding of RAM broken");
> +        return;
> +    }
> +
> +    rdma = qemu_rdma_data_init(host_port, errp);
>      if (rdma == NULL) {
>          goto err;
>      }
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 07/17] migration/rdma: Use ram_block_discard_set_broken()
@ 2020-05-15 12:45     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 12:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, Juan Quintela,
	qemu-devel, qemu-s390x, Paolo Bonzini, Richard Henderson

* David Hildenbrand (david@redhat.com) wrote:
> RDMA will pin all guest memory (as documented in docs/rdma.txt). We want
> to mark RAM block discards to be broken - however, to keep it simple
> use ram_block_discard_is_required() instead of inhibiting.

Should this be dependent on whether rdma->pin_all is set?
Even with !pin_all some will be pinned at any given time
(when it's registered with the rdma stack).

Dave

> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Juan Quintela <quintela@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  migration/rdma.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index f61587891b..029adbb950 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -29,6 +29,7 @@
>  #include "qemu/sockets.h"
>  #include "qemu/bitmap.h"
>  #include "qemu/coroutine.h"
> +#include "exec/memory.h"
>  #include <sys/socket.h>
>  #include <netdb.h>
>  #include <arpa/inet.h>
> @@ -4017,8 +4018,14 @@ void rdma_start_incoming_migration(const char *host_port, Error **errp)
>      Error *local_err = NULL;
>  
>      trace_rdma_start_incoming_migration();
> -    rdma = qemu_rdma_data_init(host_port, &local_err);
>  
> +    /* Avoid ram_block_discard_set_broken(), cannot change during migration. */
> +    if (ram_block_discard_is_required()) {
> +        error_setg(errp, "RDMA: cannot set discarding of RAM broken");
> +        return;
> +    }
> +
> +    rdma = qemu_rdma_data_init(host_port, &local_err);
>      if (rdma == NULL) {
>          goto err;
>      }
> @@ -4064,10 +4071,17 @@ void rdma_start_outgoing_migration(void *opaque,
>                              const char *host_port, Error **errp)
>  {
>      MigrationState *s = opaque;
> -    RDMAContext *rdma = qemu_rdma_data_init(host_port, errp);
>      RDMAContext *rdma_return_path = NULL;
> +    RDMAContext *rdma;
>      int ret = 0;
>  
> +    /* Avoid ram_block_discard_set_broken(), cannot change during migration. */
> +    if (ram_block_discard_is_required()) {
> +        error_setg(errp, "RDMA: cannot set discarding of RAM broken");
> +        return;
> +    }
> +
> +    rdma = qemu_rdma_data_init(host_port, errp);
>      if (rdma == NULL) {
>          goto err;
>      }
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 08/17] migration/colo: Use ram_block_discard_set_broken()
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-15 13:58     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 13:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Hailiang Zhang,
	Juan Quintela

* David Hildenbrand (david@redhat.com) wrote:
> COLO will copy all memory in a RAM block, mark discarding of RAM broken.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Hailiang Zhang <zhang.zhanghailiang@huawei.com>
> Cc: Juan Quintela <quintela@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/migration/colo.h |  2 +-
>  migration/migration.c    |  8 +++++++-
>  migration/savevm.c       | 11 +++++++++--
>  3 files changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/include/migration/colo.h b/include/migration/colo.h
> index 1636e6f907..768e1f04c3 100644
> --- a/include/migration/colo.h
> +++ b/include/migration/colo.h
> @@ -25,7 +25,7 @@ void migrate_start_colo_process(MigrationState *s);
>  bool migration_in_colo_state(void);
>  
>  /* loadvm */
> -void migration_incoming_enable_colo(void);
> +int migration_incoming_enable_colo(void);
>  void migration_incoming_disable_colo(void);
>  bool migration_incoming_colo_enabled(void);
>  void *colo_process_incoming_thread(void *opaque);
> diff --git a/migration/migration.c b/migration/migration.c
> index 177cce9e95..f6830e4620 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -338,12 +338,18 @@ bool migration_incoming_colo_enabled(void)
>  
>  void migration_incoming_disable_colo(void)
>  {
> +    ram_block_discard_set_broken(false);
>      migration_colo_enabled = false;
>  }
>  
> -void migration_incoming_enable_colo(void)
> +int migration_incoming_enable_colo(void)
>  {
> +    if (ram_block_discard_set_broken(true)) {
> +        error_report("COLO: cannot set discarding of RAM broken");

I'd prefer 'COLO: cannot disable RAM discard'

'broken' suggests the user has to go and fix something or report a bug
or something.

Other than that:


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Dave

> +        return -EBUSY;
> +    }
>      migration_colo_enabled = true;
> +    return 0;
>  }
>  
>  void migrate_add_address(SocketAddress *address)
> diff --git a/migration/savevm.c b/migration/savevm.c
> index c00a6807d9..19b4f9600d 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2111,8 +2111,15 @@ static int loadvm_handle_recv_bitmap(MigrationIncomingState *mis,
>  
>  static int loadvm_process_enable_colo(MigrationIncomingState *mis)
>  {
> -    migration_incoming_enable_colo();
> -    return colo_init_ram_cache();
> +    int ret = migration_incoming_enable_colo();
> +
> +    if (!ret) {
> +        ret = colo_init_ram_cache();
> +        if (ret) {
> +            migration_incoming_disable_colo();
> +        }
> +    }
> +    return ret;
>  }
>  
>  /*
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 08/17] migration/colo: Use ram_block_discard_set_broken()
@ 2020-05-15 13:58     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 13:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, Juan Quintela,
	qemu-devel, qemu-s390x, Hailiang Zhang, Paolo Bonzini,
	Richard Henderson

* David Hildenbrand (david@redhat.com) wrote:
> COLO will copy all memory in a RAM block, mark discarding of RAM broken.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Hailiang Zhang <zhang.zhanghailiang@huawei.com>
> Cc: Juan Quintela <quintela@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/migration/colo.h |  2 +-
>  migration/migration.c    |  8 +++++++-
>  migration/savevm.c       | 11 +++++++++--
>  3 files changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/include/migration/colo.h b/include/migration/colo.h
> index 1636e6f907..768e1f04c3 100644
> --- a/include/migration/colo.h
> +++ b/include/migration/colo.h
> @@ -25,7 +25,7 @@ void migrate_start_colo_process(MigrationState *s);
>  bool migration_in_colo_state(void);
>  
>  /* loadvm */
> -void migration_incoming_enable_colo(void);
> +int migration_incoming_enable_colo(void);
>  void migration_incoming_disable_colo(void);
>  bool migration_incoming_colo_enabled(void);
>  void *colo_process_incoming_thread(void *opaque);
> diff --git a/migration/migration.c b/migration/migration.c
> index 177cce9e95..f6830e4620 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -338,12 +338,18 @@ bool migration_incoming_colo_enabled(void)
>  
>  void migration_incoming_disable_colo(void)
>  {
> +    ram_block_discard_set_broken(false);
>      migration_colo_enabled = false;
>  }
>  
> -void migration_incoming_enable_colo(void)
> +int migration_incoming_enable_colo(void)
>  {
> +    if (ram_block_discard_set_broken(true)) {
> +        error_report("COLO: cannot set discarding of RAM broken");

I'd prefer 'COLO: cannot disable RAM discard'

'broken' suggests the user has to go and fix something or report a bug
or something.

Other than that:


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Dave

> +        return -EBUSY;
> +    }
>      migration_colo_enabled = true;
> +    return 0;
>  }
>  
>  void migrate_add_address(SocketAddress *address)
> diff --git a/migration/savevm.c b/migration/savevm.c
> index c00a6807d9..19b4f9600d 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2111,8 +2111,15 @@ static int loadvm_handle_recv_bitmap(MigrationIncomingState *mis,
>  
>  static int loadvm_process_enable_colo(MigrationIncomingState *mis)
>  {
> -    migration_incoming_enable_colo();
> -    return colo_init_ram_cache();
> +    int ret = migration_incoming_enable_colo();
> +
> +    if (!ret) {
> +        ret = colo_init_ram_cache();
> +        if (ret) {
> +            migration_incoming_disable_colo();
> +        }
> +    }
> +    return ret;
>  }
>  
>  /*
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 08/17] migration/colo: Use ram_block_discard_set_broken()
  2020-05-15 13:58     ` Dr. David Alan Gilbert
@ 2020-05-15 14:05       ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 14:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Hailiang Zhang,
	Juan Quintela

On 15.05.20 15:58, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> COLO will copy all memory in a RAM block, mark discarding of RAM broken.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Hailiang Zhang <zhang.zhanghailiang@huawei.com>
>> Cc: Juan Quintela <quintela@redhat.com>
>> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  include/migration/colo.h |  2 +-
>>  migration/migration.c    |  8 +++++++-
>>  migration/savevm.c       | 11 +++++++++--
>>  3 files changed, 17 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/migration/colo.h b/include/migration/colo.h
>> index 1636e6f907..768e1f04c3 100644
>> --- a/include/migration/colo.h
>> +++ b/include/migration/colo.h
>> @@ -25,7 +25,7 @@ void migrate_start_colo_process(MigrationState *s);
>>  bool migration_in_colo_state(void);
>>  
>>  /* loadvm */
>> -void migration_incoming_enable_colo(void);
>> +int migration_incoming_enable_colo(void);
>>  void migration_incoming_disable_colo(void);
>>  bool migration_incoming_colo_enabled(void);
>>  void *colo_process_incoming_thread(void *opaque);
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 177cce9e95..f6830e4620 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -338,12 +338,18 @@ bool migration_incoming_colo_enabled(void)
>>  
>>  void migration_incoming_disable_colo(void)
>>  {
>> +    ram_block_discard_set_broken(false);
>>      migration_colo_enabled = false;
>>  }
>>  
>> -void migration_incoming_enable_colo(void)
>> +int migration_incoming_enable_colo(void)
>>  {
>> +    if (ram_block_discard_set_broken(true)) {
>> +        error_report("COLO: cannot set discarding of RAM broken");
> 
> I'd prefer 'COLO: cannot disable RAM discard'
> 
> 'broken' suggests the user has to go and fix something or report a bug
> or something.

Sounds better, I'll adjust similar messages in the other patches. Thanks!

> 
> Other than that:
> 
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Dave


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 08/17] migration/colo: Use ram_block_discard_set_broken()
@ 2020-05-15 14:05       ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 14:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, Juan Quintela,
	qemu-devel, qemu-s390x, Hailiang Zhang, Paolo Bonzini,
	Richard Henderson

On 15.05.20 15:58, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> COLO will copy all memory in a RAM block, mark discarding of RAM broken.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Hailiang Zhang <zhang.zhanghailiang@huawei.com>
>> Cc: Juan Quintela <quintela@redhat.com>
>> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  include/migration/colo.h |  2 +-
>>  migration/migration.c    |  8 +++++++-
>>  migration/savevm.c       | 11 +++++++++--
>>  3 files changed, 17 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/migration/colo.h b/include/migration/colo.h
>> index 1636e6f907..768e1f04c3 100644
>> --- a/include/migration/colo.h
>> +++ b/include/migration/colo.h
>> @@ -25,7 +25,7 @@ void migrate_start_colo_process(MigrationState *s);
>>  bool migration_in_colo_state(void);
>>  
>>  /* loadvm */
>> -void migration_incoming_enable_colo(void);
>> +int migration_incoming_enable_colo(void);
>>  void migration_incoming_disable_colo(void);
>>  bool migration_incoming_colo_enabled(void);
>>  void *colo_process_incoming_thread(void *opaque);
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 177cce9e95..f6830e4620 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -338,12 +338,18 @@ bool migration_incoming_colo_enabled(void)
>>  
>>  void migration_incoming_disable_colo(void)
>>  {
>> +    ram_block_discard_set_broken(false);
>>      migration_colo_enabled = false;
>>  }
>>  
>> -void migration_incoming_enable_colo(void)
>> +int migration_incoming_enable_colo(void)
>>  {
>> +    if (ram_block_discard_set_broken(true)) {
>> +        error_report("COLO: cannot set discarding of RAM broken");
> 
> I'd prefer 'COLO: cannot disable RAM discard'
> 
> 'broken' suggests the user has to go and fix something or report a bug
> or something.

Sounds better, I'll adjust similar messages in the other patches. Thanks!

> 
> Other than that:
> 
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> Dave


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 07/17] migration/rdma: Use ram_block_discard_set_broken()
  2020-05-15 12:45     ` Dr. David Alan Gilbert
@ 2020-05-15 14:09       ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 14:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Juan Quintela

On 15.05.20 14:45, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> RDMA will pin all guest memory (as documented in docs/rdma.txt). We want
>> to mark RAM block discards to be broken - however, to keep it simple
>> use ram_block_discard_is_required() instead of inhibiting.
> 
> Should this be dependent on whether rdma->pin_all is set?
> Even with !pin_all some will be pinned at any given time
> (when it's registered with the rdma stack).

Do you know how much memory this is? Is such memory only temporarily pinned?

At least with special-cases of vfio, it's acceptable if some memory is
temporarily pinned - we assume it's only the working set of the driver,
which guests will not inflate as long as they don't want to shoot
themselves in the foot.

This here sounds like the guest does not know the pinned memory is
special, right?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 07/17] migration/rdma: Use ram_block_discard_set_broken()
@ 2020-05-15 14:09       ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 14:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, Juan Quintela,
	qemu-devel, qemu-s390x, Paolo Bonzini, Richard Henderson

On 15.05.20 14:45, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> RDMA will pin all guest memory (as documented in docs/rdma.txt). We want
>> to mark RAM block discards to be broken - however, to keep it simple
>> use ram_block_discard_is_required() instead of inhibiting.
> 
> Should this be dependent on whether rdma->pin_all is set?
> Even with !pin_all some will be pinned at any given time
> (when it's registered with the rdma stack).

Do you know how much memory this is? Is such memory only temporarily pinned?

At least with special-cases of vfio, it's acceptable if some memory is
temporarily pinned - we assume it's only the working set of the driver,
which guests will not inflate as long as they don't want to shoot
themselves in the foot.

This here sounds like the guest does not know the pinned memory is
special, right?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 01/17] exec: Introduce ram_block_discard_set_(unreliable|required)()
  2020-05-15  9:54     ` Dr. David Alan Gilbert
@ 2020-05-15 14:40       ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 14:40 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin

On 15.05.20 11:54, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> We want to replace qemu_balloon_inhibit() by something more generic.
>> Especially, we want to make sure that technologies that really rely on
>> RAM block discards to work reliably to run mutual exclusive with
>> technologies that break it.
>>
>> E.g., vfio will usually pin all guest memory, turning the virtio-balloon
>> basically useless and make the VM consume more memory than reported via
>> the balloon. While the balloon is special already (=> no guarantees, same
>> behavior possible afer reboots and with huge pages), this will be
>> different, especially, with virtio-mem.
>>
>> Let's implement a way such that we can make both types of technology run
>> mutually exclusive. We'll convert existing balloon inhibitors in successive
>> patches and add some new ones. Add the check to
>> qemu_balloon_is_inhibited() for now. We might want to make
>> virtio-balloon an acutal inhibitor in the future - however, that
>> requires more thought to not break existing setups.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Richard Henderson <rth@twiddle.net>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  balloon.c             |  3 ++-
>>  exec.c                | 48 +++++++++++++++++++++++++++++++++++++++++++
>>  include/exec/memory.h | 41 ++++++++++++++++++++++++++++++++++++
>>  3 files changed, 91 insertions(+), 1 deletion(-)
>>
>> diff --git a/balloon.c b/balloon.c
>> index f104b42961..c49f57c27b 100644
>> --- a/balloon.c
>> +++ b/balloon.c
>> @@ -40,7 +40,8 @@ static int balloon_inhibit_count;
>>  
>>  bool qemu_balloon_is_inhibited(void)
>>  {
>> -    return atomic_read(&balloon_inhibit_count) > 0;
>> +    return atomic_read(&balloon_inhibit_count) > 0 ||
>> +           ram_block_discard_is_broken();
>>  }
>>  
>>  void qemu_balloon_inhibit(bool state)
>> diff --git a/exec.c b/exec.c
>> index 2874bb5088..52a6e40e99 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -4049,4 +4049,52 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root)
>>      }
>>  }
>>  
>> +static int ram_block_discard_broken;
> 
> This could do with a comment; if I'm reading this right then
>   +ve means broken
>   -ve means required
> 

I'll add to ram_block_discard_broken:

"If positive, discarding RAM is broken. If negative, discarding of RAM
is required to work correctly."

[...]

>>  
>> +/*
>> + * Inhibit technologies that rely on discarding of parts of RAM blocks to work
>> + * reliably, e.g., to manage the actual amount of memory consumed by the VM
>> + * (then, the memory provided by RAM blocks might be bigger than the desired
>> + * memory consumption). This *must* be set if:
> 
> 'technologies that rely on discarding of parts of RAM blocks to work
> reliably' is pretty long; I'm not sure of a better way of saying it
> though.

Maybe simply

"Inhibit technologies that rely on discarding of pages in RAM blocks to
work"?

?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 01/17] exec: Introduce ram_block_discard_set_(unreliable|required)()
@ 2020-05-15 14:40       ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 14:40 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, qemu-devel,
	qemu-s390x, Paolo Bonzini, Richard Henderson

On 15.05.20 11:54, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> We want to replace qemu_balloon_inhibit() by something more generic.
>> Especially, we want to make sure that technologies that really rely on
>> RAM block discards to work reliably to run mutual exclusive with
>> technologies that break it.
>>
>> E.g., vfio will usually pin all guest memory, turning the virtio-balloon
>> basically useless and make the VM consume more memory than reported via
>> the balloon. While the balloon is special already (=> no guarantees, same
>> behavior possible afer reboots and with huge pages), this will be
>> different, especially, with virtio-mem.
>>
>> Let's implement a way such that we can make both types of technology run
>> mutually exclusive. We'll convert existing balloon inhibitors in successive
>> patches and add some new ones. Add the check to
>> qemu_balloon_is_inhibited() for now. We might want to make
>> virtio-balloon an acutal inhibitor in the future - however, that
>> requires more thought to not break existing setups.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Richard Henderson <rth@twiddle.net>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  balloon.c             |  3 ++-
>>  exec.c                | 48 +++++++++++++++++++++++++++++++++++++++++++
>>  include/exec/memory.h | 41 ++++++++++++++++++++++++++++++++++++
>>  3 files changed, 91 insertions(+), 1 deletion(-)
>>
>> diff --git a/balloon.c b/balloon.c
>> index f104b42961..c49f57c27b 100644
>> --- a/balloon.c
>> +++ b/balloon.c
>> @@ -40,7 +40,8 @@ static int balloon_inhibit_count;
>>  
>>  bool qemu_balloon_is_inhibited(void)
>>  {
>> -    return atomic_read(&balloon_inhibit_count) > 0;
>> +    return atomic_read(&balloon_inhibit_count) > 0 ||
>> +           ram_block_discard_is_broken();
>>  }
>>  
>>  void qemu_balloon_inhibit(bool state)
>> diff --git a/exec.c b/exec.c
>> index 2874bb5088..52a6e40e99 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -4049,4 +4049,52 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root)
>>      }
>>  }
>>  
>> +static int ram_block_discard_broken;
> 
> This could do with a comment; if I'm reading this right then
>   +ve means broken
>   -ve means required
> 

I'll add to ram_block_discard_broken:

"If positive, discarding RAM is broken. If negative, discarding of RAM
is required to work correctly."

[...]

>>  
>> +/*
>> + * Inhibit technologies that rely on discarding of parts of RAM blocks to work
>> + * reliably, e.g., to manage the actual amount of memory consumed by the VM
>> + * (then, the memory provided by RAM blocks might be bigger than the desired
>> + * memory consumption). This *must* be set if:
> 
> 'technologies that rely on discarding of parts of RAM blocks to work
> reliably' is pretty long; I'm not sure of a better way of saying it
> though.

Maybe simply

"Inhibit technologies that rely on discarding of pages in RAM blocks to
work"?

?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 01/17] exec: Introduce ram_block_discard_set_(unreliable|required)()
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-15 14:54     ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 14:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin

On 06.05.20 11:49, David Hildenbrand wrote:
> We want to replace qemu_balloon_inhibit() by something more generic.
> Especially, we want to make sure that technologies that really rely on
> RAM block discards to work reliably to run mutual exclusive with
> technologies that break it.
> 
> E.g., vfio will usually pin all guest memory, turning the virtio-balloon
> basically useless and make the VM consume more memory than reported via
> the balloon. While the balloon is special already (=> no guarantees, same
> behavior possible afer reboots and with huge pages), this will be
> different, especially, with virtio-mem.
> 
> Let's implement a way such that we can make both types of technology run
> mutually exclusive. We'll convert existing balloon inhibitors in successive
> patches and add some new ones. Add the check to
> qemu_balloon_is_inhibited() for now. We might want to make
> virtio-balloon an acutal inhibitor in the future - however, that
> requires more thought to not break existing setups.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  balloon.c             |  3 ++-
>  exec.c                | 48 +++++++++++++++++++++++++++++++++++++++++++
>  include/exec/memory.h | 41 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 91 insertions(+), 1 deletion(-)
> 
> diff --git a/balloon.c b/balloon.c
> index f104b42961..c49f57c27b 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -40,7 +40,8 @@ static int balloon_inhibit_count;
>  
>  bool qemu_balloon_is_inhibited(void)
>  {
> -    return atomic_read(&balloon_inhibit_count) > 0;
> +    return atomic_read(&balloon_inhibit_count) > 0 ||
> +           ram_block_discard_is_broken();
>  }
>  
>  void qemu_balloon_inhibit(bool state)
> diff --git a/exec.c b/exec.c
> index 2874bb5088..52a6e40e99 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -4049,4 +4049,52 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root)
>      }
>  }
>  
> +static int ram_block_discard_broken;
> +
> +int ram_block_discard_set_broken(bool state)
> +{
> +    int old;
> +
> +    if (!state) {
> +        atomic_dec(&ram_block_discard_broken);
> +        return 0;
> +    }
> +
> +    do {
> +        old = atomic_read(&ram_block_discard_broken);
> +        if (old < 0) {
> +            return -EBUSY;
> +        }
> +    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old + 1) != old);
> +    return 0;
> +}
> +
> +int ram_block_discard_set_required(bool state)
> +{
> +    int old;
> +
> +    if (!state) {
> +        atomic_inc(&ram_block_discard_broken);
> +        return 0;
> +    }
> +
> +    do {
> +        old = atomic_read(&ram_block_discard_broken);
> +        if (old > 0) {
> +            return -EBUSY;
> +        }
> +    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old - 1) != old);
> +    return 0;
> +}
> +
> +bool ram_block_discard_is_broken(void)
> +{
> +    return atomic_read(&ram_block_discard_broken) > 0;
> +}
> +
> +bool ram_block_discard_is_required(void)
> +{
> +    return atomic_read(&ram_block_discard_broken) < 0;
> +}
> +
>  #endif
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index e000bd2f97..9bb5ced38d 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -2463,6 +2463,47 @@ static inline MemOp devend_memop(enum device_endian end)
>  }
>  #endif
>  
> +/*
> + * Inhibit technologies that rely on discarding of parts of RAM blocks to work
> + * reliably, e.g., to manage the actual amount of memory consumed by the VM
> + * (then, the memory provided by RAM blocks might be bigger than the desired
> + * memory consumption). This *must* be set if:
> + * - Discarding parts of a RAM blocks does not result in the change being
> + *   reflected in the VM and the pages getting freed.
> + * - All memory in RAM blocks is pinned or duplicated, invaldiating any previous
> + *   discards blindly.
> + * - Discarding parts of a RAM blocks will result in integrity issues (e.g.,
> + *   encrypted VMs).
> + * Technologies that only temporarily pin the current working set of a
> + * driver are fine, because we don't expect such pages to be discarded
> + * (esp. based on guest action like balloon inflation).
> + *
> + * This is *not* to be used to protect from concurrent discards (esp.,
> + * postcopy).
> + *
> + * Returns 0 if successful. Returns -EBUSY if a technology that relies on
> + * discards to work reliably is active.
> + */
> +int ram_block_discard_set_broken(bool state);
> +
> +/*
> + * Inhibit technologies that will break discarding of pages in RAM blocks.
> + *
> + * Returns 0 if successful. Returns -EBUSY if discards are already set to
> + * broken.
> + */
> +int ram_block_discard_set_required(bool state);
> +
> +/*
> + * Test if discarding of memory in ram blocks is broken.
> + */
> +bool ram_block_discard_is_broken(void);
> +
> +/*
> + * Test if discarding of memory in ram blocks is required to work reliably.
> + */
> +bool ram_block_discard_is_required(void);
> +
>  #endif
>  
>  #endif
> 

I'm wondering if I'll just call these functions

ram_block_discard_disable()

and

ram_block_discard_require()

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 01/17] exec: Introduce ram_block_discard_set_(unreliable|required)()
@ 2020-05-15 14:54     ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 14:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin,
	Dr . David Alan Gilbert, qemu-s390x, Paolo Bonzini,
	Richard Henderson

On 06.05.20 11:49, David Hildenbrand wrote:
> We want to replace qemu_balloon_inhibit() by something more generic.
> Especially, we want to make sure that technologies that really rely on
> RAM block discards to work reliably to run mutual exclusive with
> technologies that break it.
> 
> E.g., vfio will usually pin all guest memory, turning the virtio-balloon
> basically useless and make the VM consume more memory than reported via
> the balloon. While the balloon is special already (=> no guarantees, same
> behavior possible afer reboots and with huge pages), this will be
> different, especially, with virtio-mem.
> 
> Let's implement a way such that we can make both types of technology run
> mutually exclusive. We'll convert existing balloon inhibitors in successive
> patches and add some new ones. Add the check to
> qemu_balloon_is_inhibited() for now. We might want to make
> virtio-balloon an acutal inhibitor in the future - however, that
> requires more thought to not break existing setups.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  balloon.c             |  3 ++-
>  exec.c                | 48 +++++++++++++++++++++++++++++++++++++++++++
>  include/exec/memory.h | 41 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 91 insertions(+), 1 deletion(-)
> 
> diff --git a/balloon.c b/balloon.c
> index f104b42961..c49f57c27b 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -40,7 +40,8 @@ static int balloon_inhibit_count;
>  
>  bool qemu_balloon_is_inhibited(void)
>  {
> -    return atomic_read(&balloon_inhibit_count) > 0;
> +    return atomic_read(&balloon_inhibit_count) > 0 ||
> +           ram_block_discard_is_broken();
>  }
>  
>  void qemu_balloon_inhibit(bool state)
> diff --git a/exec.c b/exec.c
> index 2874bb5088..52a6e40e99 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -4049,4 +4049,52 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root)
>      }
>  }
>  
> +static int ram_block_discard_broken;
> +
> +int ram_block_discard_set_broken(bool state)
> +{
> +    int old;
> +
> +    if (!state) {
> +        atomic_dec(&ram_block_discard_broken);
> +        return 0;
> +    }
> +
> +    do {
> +        old = atomic_read(&ram_block_discard_broken);
> +        if (old < 0) {
> +            return -EBUSY;
> +        }
> +    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old + 1) != old);
> +    return 0;
> +}
> +
> +int ram_block_discard_set_required(bool state)
> +{
> +    int old;
> +
> +    if (!state) {
> +        atomic_inc(&ram_block_discard_broken);
> +        return 0;
> +    }
> +
> +    do {
> +        old = atomic_read(&ram_block_discard_broken);
> +        if (old > 0) {
> +            return -EBUSY;
> +        }
> +    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old - 1) != old);
> +    return 0;
> +}
> +
> +bool ram_block_discard_is_broken(void)
> +{
> +    return atomic_read(&ram_block_discard_broken) > 0;
> +}
> +
> +bool ram_block_discard_is_required(void)
> +{
> +    return atomic_read(&ram_block_discard_broken) < 0;
> +}
> +
>  #endif
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index e000bd2f97..9bb5ced38d 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -2463,6 +2463,47 @@ static inline MemOp devend_memop(enum device_endian end)
>  }
>  #endif
>  
> +/*
> + * Inhibit technologies that rely on discarding of parts of RAM blocks to work
> + * reliably, e.g., to manage the actual amount of memory consumed by the VM
> + * (then, the memory provided by RAM blocks might be bigger than the desired
> + * memory consumption). This *must* be set if:
> + * - Discarding parts of a RAM blocks does not result in the change being
> + *   reflected in the VM and the pages getting freed.
> + * - All memory in RAM blocks is pinned or duplicated, invaldiating any previous
> + *   discards blindly.
> + * - Discarding parts of a RAM blocks will result in integrity issues (e.g.,
> + *   encrypted VMs).
> + * Technologies that only temporarily pin the current working set of a
> + * driver are fine, because we don't expect such pages to be discarded
> + * (esp. based on guest action like balloon inflation).
> + *
> + * This is *not* to be used to protect from concurrent discards (esp.,
> + * postcopy).
> + *
> + * Returns 0 if successful. Returns -EBUSY if a technology that relies on
> + * discards to work reliably is active.
> + */
> +int ram_block_discard_set_broken(bool state);
> +
> +/*
> + * Inhibit technologies that will break discarding of pages in RAM blocks.
> + *
> + * Returns 0 if successful. Returns -EBUSY if discards are already set to
> + * broken.
> + */
> +int ram_block_discard_set_required(bool state);
> +
> +/*
> + * Test if discarding of memory in ram blocks is broken.
> + */
> +bool ram_block_discard_is_broken(void);
> +
> +/*
> + * Test if discarding of memory in ram blocks is required to work reliably.
> + */
> +bool ram_block_discard_is_required(void);
> +
>  #endif
>  
>  #endif
> 

I'm wondering if I'll just call these functions

ram_block_discard_disable()

and

ram_block_discard_require()

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 17/17] virtio-pci: Send qapi events when the virtio-mem size changes
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-15 15:18     ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 15:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	Markus Armbruster, Eric Blake, Igor Mammedov


>  #endif /* QEMU_VIRTIO_MEM_PCI_H */
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 88a99a0d90..eb5cf66855 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -491,7 +491,7 @@ static void virtio_mem_device_unrealize(DeviceState *dev, Error **errp)
>      virtio_del_queue(vdev, 0);
>      virtio_cleanup(vdev);
>      g_free(vmem->bitmap);
> -    ramblock_discard_set_required(false);
> +    ram_block_discard_set_required(false);

^ this belongs into patch #10.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 17/17] virtio-pci: Send qapi events when the virtio-mem size changes
@ 2020-05-15 15:18     ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 15:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin,
	Dr . David Alan Gilbert, Markus Armbruster, qemu-s390x,
	Igor Mammedov, Paolo Bonzini, Richard Henderson


>  #endif /* QEMU_VIRTIO_MEM_PCI_H */
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 88a99a0d90..eb5cf66855 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -491,7 +491,7 @@ static void virtio_mem_device_unrealize(DeviceState *dev, Error **errp)
>      virtio_del_queue(vdev, 0);
>      virtio_cleanup(vdev);
>      g_free(vmem->bitmap);
> -    ramblock_discard_set_required(false);
> +    ram_block_discard_set_required(false);

^ this belongs into patch #10.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 10/17] virtio-mem: Paravirtualized memory hot(un)plug
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-15 15:37     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 15:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Eric Blake,
	Markus Armbruster, Igor Mammedov

I'm not sure if it's possible to split this up; it's a bit big.
It could also do with a pile of trace_ entries to figure out what it's
doing.


* David Hildenbrand (david@redhat.com) wrote:
> This is the very basic/initial version of virtio-mem. An introduction to
> virtio-mem can be found in the Linux kernel driver [1]. While it can be
> used in the current state for hotplug of a smaller amount of memory, it
> will heavily benefit from resizeable memory regions in the future.
> 
> Each virtio-mem device manages a memory region (provided via a memory
> backend). After requested by the hypervisor ("requested-size"), the
> guest can try to plug/unplug blocks of memory within that region, in order
> to reach the requested size. Initially, and after a reboot, all memory is
> unplugged (except in special cases - reboot during postcopy).
> 
> The guest may only try to plug/unplug blocks of memory within the usable
> region size. The usable region size is a little bigger than the
> requested size, to give the device driver some flexibility. The usable
> region size will only grow, except on reboots or when all memory is
> requested to get unplugged. The guest can never plug more memory than
> requested. Unplugged memory will get zapped/discarded, similar to in a
> balloon device.
> 
> The block size is variable, however, it is always chosen in a way such that
> THP splits are avoided (e.g., 2MB). The state of each block
> (plugged/unplugged) is tracked in a bitmap.
> 
> As virtio-mem devices (e.g., virtio-mem-pci) will be memory devices, we now
> expose "VirtioMEMDeviceInfo" via "query-memory-devices".
> 
> --------------------------------------------------------------------------
> 
> There are two important follow-up items that are in the works:
> 1. Resizeable memory regions: Use resizeable allocations/RAM blocks to
>    grow/shrink along with the usable region size. This avoids creating
>    initially very big VMAs, RAM blocks, and KVM slots.
> 2. Protection of unplugged memory: Make sure the gust cannot actually
>    make use of unplugged memory.
> 
> Other follow-up items that are in the works:
> 1. Exclude unplugged memory during migration (via precopy notifier).
> 2. Handle remapping of memory.
> 3. Support for other architectures.
> 
> --------------------------------------------------------------------------
> 
> Example usage (virtio-mem-pci is introduced in follow-up patches):
> 
> Start QEMU with two virtio-mem devices (one per NUMA node):
>  $ qemu-system-x86_64 -m 4G,maxmem=20G \
>   -smp sockets=2,cores=2 \
>   -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
>   [...]
>   -object memory-backend-ram,id=mem0,size=8G \
>   -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,requested-size=0M \
>   -object memory-backend-ram,id=mem1,size=8G \
>   -device virtio-mem-pci,id=vm1,memdev=mem1,node=1,requested-size=1G
> 
> Query the configuration:
>  (qemu) info memory-devices
>  Memory device [virtio-mem]: "vm0"
>    memaddr: 0x140000000
>    node: 0
>    requested-size: 0
>    size: 0
>    max-size: 8589934592
>    block-size: 2097152
>    memdev: /objects/mem0
>  Memory device [virtio-mem]: "vm1"
>    memaddr: 0x340000000
>    node: 1
>    requested-size: 1073741824
>    size: 1073741824
>    max-size: 8589934592
>    block-size: 2097152
>    memdev: /objects/mem1
> 
> Add some memory to node 0:
>  (qemu) qom-set vm0 requested-size 500M
> 
> Remove some memory from node 1:
>  (qemu) qom-set vm1 requested-size 200M
> 
> Query the configuration again:
>  (qemu) info memory-devices
>  Memory device [virtio-mem]: "vm0"
>    memaddr: 0x140000000
>    node: 0
>    requested-size: 524288000
>    size: 524288000
>    max-size: 8589934592
>    block-size: 2097152
>    memdev: /objects/mem0
>  Memory device [virtio-mem]: "vm1"
>    memaddr: 0x340000000
>    node: 1
>    requested-size: 209715200
>    size: 209715200
>    max-size: 8589934592
>    block-size: 2097152
>    memdev: /objects/mem1
> 
> [1] https://lkml.kernel.org/r/20200311171422.10484-1-david@redhat.com
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  hw/virtio/Kconfig              |  11 +
>  hw/virtio/Makefile.objs        |   1 +
>  hw/virtio/virtio-mem.c         | 762 +++++++++++++++++++++++++++++++++
>  include/hw/virtio/virtio-mem.h |  80 ++++
>  qapi/misc.json                 |  39 +-
>  5 files changed, 892 insertions(+), 1 deletion(-)
>  create mode 100644 hw/virtio/virtio-mem.c
>  create mode 100644 include/hw/virtio/virtio-mem.h
> 
> diff --git a/hw/virtio/Kconfig b/hw/virtio/Kconfig
> index 83122424fa..0eda25c4e1 100644
> --- a/hw/virtio/Kconfig
> +++ b/hw/virtio/Kconfig
> @@ -47,3 +47,14 @@ config VIRTIO_PMEM
>      depends on VIRTIO
>      depends on VIRTIO_PMEM_SUPPORTED
>      select MEM_DEVICE
> +
> +config VIRTIO_MEM_SUPPORTED
> +    bool
> +
> +config VIRTIO_MEM
> +    bool
> +    default y
> +    depends on VIRTIO
> +    depends on LINUX
> +    depends on VIRTIO_MEM_SUPPORTED
> +    select MEM_DEVICE
> diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
> index 4e4d39a0a4..7df70e977e 100644
> --- a/hw/virtio/Makefile.objs
> +++ b/hw/virtio/Makefile.objs
> @@ -18,6 +18,7 @@ common-obj-$(call land,$(CONFIG_VIRTIO_PMEM),$(CONFIG_VIRTIO_PCI)) += virtio-pme
>  obj-$(call land,$(CONFIG_VHOST_USER_FS),$(CONFIG_VIRTIO_PCI)) += vhost-user-fs-pci.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
> +obj-$(CONFIG_VIRTIO_MEM) += virtio-mem.o
>  
>  ifeq ($(CONFIG_VIRTIO_PCI),y)
>  obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock-pci.o
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> new file mode 100644
> index 0000000000..e25b2c74f2
> --- /dev/null
> +++ b/hw/virtio/virtio-mem.c
> @@ -0,0 +1,762 @@
> +/*
> + * Virtio MEM device
> + *
> + * Copyright (C) 2020 Red Hat, Inc.
> + *
> + * Authors:
> + *  David Hildenbrand <david@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "qemu/iov.h"
> +#include "qemu/cutils.h"
> +#include "qemu/error-report.h"
> +#include "qemu/units.h"
> +#include "sysemu/numa.h"
> +#include "sysemu/sysemu.h"
> +#include "sysemu/reset.h"
> +#include "hw/virtio/virtio.h"
> +#include "hw/virtio/virtio-bus.h"
> +#include "hw/virtio/virtio-access.h"
> +#include "hw/virtio/virtio-mem.h"
> +#include "qapi/error.h"
> +#include "qapi/visitor.h"
> +#include "exec/ram_addr.h"
> +#include "migration/misc.h"
> +#include "migration/postcopy-ram.h"
> +#include "hw/boards.h"
> +#include "hw/qdev-properties.h"
> +#include "config-devices.h"
> +
> +/*
> + * Use QEMU_VMALLOC_ALIGN, so no THP will have to be split when unplugging
> + * memory (e.g., 2MB on x86_64).
> + */
> +#define VIRTIO_MEM_MIN_BLOCK_SIZE QEMU_VMALLOC_ALIGN
> +/*
> + * Size the usable region bigger than the requested size if possible. Esp.
> + * Linux guests will only add (aligned) memory blocks in case they fully
> + * fit into the usable region, but plug+online only a subset of the pages.
> + * The memory block size corresponds mostly to the section size.
> + *
> + * This allows e.g., to add 20MB with a section size of 128MB on x86_64, and
> + * a section size of 1GB on arm64 (as long as the start address is properly
> + * aligned, similar to ordinary DIMMs).
> + *
> + * We can change this at any time and maybe even make it configurable if
> + * necessary (as the section size can change). But it's more likely that the
> + * section size will rather get smaller and not bigger over time.
> + */
> +#if defined(__x86_64__)
> +#define VIRTIO_MEM_USABLE_EXTENT (2 * (128 * MiB))
> +#else
> +#error VIRTIO_MEM_USABLE_EXTENT not defined
> +#endif
> +
> +static bool virtio_mem_discard_inhibited(void)
> +{
> +    PostcopyState ps = postcopy_state_get();
> +
> +    /* Postcopy cannot deal with concurrent discards (yet), so it's special. */
> +    return ps >= POSTCOPY_INCOMING_DISCARD && ps < POSTCOPY_INCOMING_END;
> +}
> +
> +static bool virtio_mem_test_bitmap(VirtIOMEM *vmem, uint64_t start_gpa,
> +                                   uint64_t size, bool plug)
> +{
> +    uint64_t bit = (start_gpa - vmem->addr) / vmem->block_size;
> +
> +    g_assert(QEMU_IS_ALIGNED(start_gpa, vmem->block_size));
> +    g_assert(QEMU_IS_ALIGNED(size, vmem->block_size));
> +    g_assert(vmem->bitmap);
> +
> +    while (size) {
> +        g_assert((bit / BITS_PER_BYTE) <= vmem->bitmap_size);
> +
> +        if (plug && !test_bit(bit, vmem->bitmap)) {
> +            return false;
> +        } else if (!plug && test_bit(bit, vmem->bitmap)) {
> +            return false;
> +        }
> +        size -= vmem->block_size;
> +        bit++;
> +    }
> +    return true;
> +}
> +
> +static void virtio_mem_set_bitmap(VirtIOMEM *vmem, uint64_t start_gpa,
> +                                  uint64_t size, bool plug)
> +{
> +    const uint64_t bit = (start_gpa - vmem->addr) / vmem->block_size;
> +    const uint64_t nbits = size / vmem->block_size;
> +
> +    g_assert(QEMU_IS_ALIGNED(start_gpa, vmem->block_size));
> +    g_assert(QEMU_IS_ALIGNED(size, vmem->block_size));
> +    g_assert(vmem->bitmap);

This bit/nbits/alignment checking could be split out and shared between
these two functions.

> +    if (plug) {
> +        bitmap_set(vmem->bitmap, bit, nbits);
> +    } else {
> +        bitmap_clear(vmem->bitmap, bit, nbits);
> +    }
> +}
> +
> +static void virtio_mem_send_response(VirtIOMEM *vmem, VirtQueueElement *elem,
> +                                     struct virtio_mem_resp *resp)
> +{
> +    VirtIODevice *vdev = VIRTIO_DEVICE(vmem);
> +    VirtQueue *vq = vmem->vq;
> +
> +    iov_from_buf(elem->in_sg, elem->in_num, 0, resp, sizeof(*resp));
> +
> +    virtqueue_push(vq, elem, sizeof(*resp));
> +    virtio_notify(vdev, vq);
> +}
> +
> +static void virtio_mem_send_response_simple(VirtIOMEM *vmem,
> +                                            VirtQueueElement *elem,
> +                                            uint16_t type)
> +{
> +    VirtIODevice *vdev = VIRTIO_DEVICE(vmem);
> +    struct virtio_mem_resp resp = {};
> +
> +    virtio_stw_p(vdev, &resp.type, type);
> +    virtio_mem_send_response(vmem, elem, &resp);
> +}
> +
> +static void virtio_mem_bad_request(VirtIOMEM *vmem, const char *msg)
> +{
> +    virtio_error(VIRTIO_DEVICE(vmem), "virtio-mem protocol violation: %s", msg);
> +}
> +
> +static bool virtio_mem_valid_range(VirtIOMEM *vmem, uint64_t gpa, uint64_t size)
> +{
> +    if (!QEMU_IS_ALIGNED(gpa, vmem->block_size)) {
> +            return false;
> +    }
> +    if (gpa + size < gpa || size == 0) {
> +        return false;
> +    }
> +    if (gpa < vmem->addr || gpa >= vmem->addr + vmem->usable_region_size) {
> +        return false;
> +    }
> +    if (gpa + size > vmem->addr + vmem->usable_region_size) {
> +        return false;
> +    }
> +    return true;
> +}
> +
> +static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa,
> +                                      uint64_t size, bool plug)
> +{
> +    const uint64_t offset = start_gpa - vmem->addr;
> +    int ret;
> +
> +    if (!plug) {
> +        if (virtio_mem_discard_inhibited()) {
> +            return -EBUSY;
> +        }
> +        /* Note: Discarding should never fail at this point. */
> +        ret = ram_block_discard_range(vmem->memdev->mr.ram_block, offset, size);
> +        if (ret) {

error_report ?

> +            return -EBUSY;
> +        }
> +    }
> +    virtio_mem_set_bitmap(vmem, start_gpa, size, plug);
> +    return 0;
> +}
> +
> +static int virtio_mem_state_change_request(VirtIOMEM *vmem, uint64_t gpa,
> +                                           uint16_t nb_blocks, bool plug)
> +{
> +    const uint64_t size = nb_blocks * vmem->block_size;
> +    int ret;
> +
> +    if (!virtio_mem_valid_range(vmem, gpa, size)) {
> +        return VIRTIO_MEM_RESP_ERROR;
> +    }
> +
> +    if (plug && (vmem->size + size > vmem->requested_size)) {
> +        return VIRTIO_MEM_RESP_NACK;
> +    }
> +
> +    /* test if really all blocks are in the opposite state */
> +    if (!virtio_mem_test_bitmap(vmem, gpa, size, !plug)) {
> +        return VIRTIO_MEM_RESP_ERROR;
> +    }
> +
> +    ret = virtio_mem_set_block_state(vmem, gpa, size, plug);
> +    if (ret) {
> +        return VIRTIO_MEM_RESP_BUSY;
> +    }
> +    if (plug) {
> +        vmem->size += size;
> +    } else {
> +        vmem->size -= size;
> +    }
> +    return VIRTIO_MEM_RESP_ACK;
> +}
> +
> +static void virtio_mem_plug_request(VirtIOMEM *vmem, VirtQueueElement *elem,
> +                                    struct virtio_mem_req *req)
> +{
> +    const uint64_t gpa = le64_to_cpu(req->u.plug.addr);
> +    const uint16_t nb_blocks = le16_to_cpu(req->u.plug.nb_blocks);
> +    uint16_t type;
> +
> +    type = virtio_mem_state_change_request(vmem, gpa, nb_blocks, true);
> +    virtio_mem_send_response_simple(vmem, elem, type);
> +}
> +
> +static void virtio_mem_unplug_request(VirtIOMEM *vmem, VirtQueueElement *elem,
> +                                      struct virtio_mem_req *req)
> +{
> +    const uint64_t gpa = le64_to_cpu(req->u.unplug.addr);
> +    const uint16_t nb_blocks = le16_to_cpu(req->u.unplug.nb_blocks);
> +    uint16_t type;
> +
> +    type = virtio_mem_state_change_request(vmem, gpa, nb_blocks, false);
> +    virtio_mem_send_response_simple(vmem, elem, type);
> +}
> +
> +static void virtio_mem_resize_usable_region(VirtIOMEM *vmem,
> +                                            uint64_t requested_size,
> +                                            bool can_shrink)
> +{
> +    uint64_t newsize = MIN(memory_region_size(&vmem->memdev->mr),
> +                           requested_size + VIRTIO_MEM_USABLE_EXTENT);
> +
> +    /* We must only grow while the guest is running. */
> +    if (newsize < vmem->usable_region_size && !can_shrink) {
> +        return;
> +    }
> +
> +    vmem->usable_region_size = newsize;
> +}
> +
> +static int virtio_mem_unplug_all(VirtIOMEM *vmem)
> +{
> +    RAMBlock *rb = vmem->memdev->mr.ram_block;
> +    int ret;
> +
> +    if (virtio_mem_discard_inhibited()) {
> +        return -EBUSY;
> +    }
> +
> +    ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
> +    if (ret) {
> +        /* Note: Discarding should never fail at this point. */

error_report?

> +        return -EBUSY;
> +    }
> +    bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
> +    vmem->size = 0;
> +
> +    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
> +    return 0;
> +}
> +
> +static void virtio_mem_unplug_all_request(VirtIOMEM *vmem,
> +                                          VirtQueueElement *elem)
> +{
> +
> +    if (virtio_mem_unplug_all(vmem)) {
> +        virtio_mem_send_response_simple(vmem, elem, VIRTIO_MEM_RESP_BUSY);
> +    } else {
> +        virtio_mem_send_response_simple(vmem, elem, VIRTIO_MEM_RESP_ACK);
> +    }
> +}
> +
> +static void virtio_mem_state_request(VirtIOMEM *vmem, VirtQueueElement *elem,
> +                                     struct virtio_mem_req *req)
> +{
> +    const uint64_t gpa = le64_to_cpu(req->u.state.addr);
> +    const uint16_t nb_blocks = le16_to_cpu(req->u.state.nb_blocks);
> +    const uint64_t size = nb_blocks * vmem->block_size;
> +    VirtIODevice *vdev = VIRTIO_DEVICE(vmem);
> +    struct virtio_mem_resp resp = {};
> +
> +    if (!virtio_mem_valid_range(vmem, gpa, size)) {
> +        virtio_mem_send_response_simple(vmem, elem, VIRTIO_MEM_RESP_ERROR);
> +        return;
> +    }
> +
> +    virtio_stw_p(vdev, &resp.type, VIRTIO_MEM_RESP_ACK);
> +    if (virtio_mem_test_bitmap(vmem, gpa, size, true)) {
> +        virtio_stw_p(vdev, &resp.u.state.state, VIRTIO_MEM_STATE_PLUGGED);
> +    } else if (virtio_mem_test_bitmap(vmem, gpa, size, false)) {
> +        virtio_stw_p(vdev, &resp.u.state.state, VIRTIO_MEM_STATE_UNPLUGGED);
> +    } else {
> +        virtio_stw_p(vdev, &resp.u.state.state, VIRTIO_MEM_STATE_MIXED);
> +    }
> +    virtio_mem_send_response(vmem, elem, &resp);
> +}
> +
> +static void virtio_mem_handle_request(VirtIODevice *vdev, VirtQueue *vq)
> +{
> +    const int len = sizeof(struct virtio_mem_req);
> +    VirtIOMEM *vmem = VIRTIO_MEM(vdev);
> +    VirtQueueElement *elem;
> +    struct virtio_mem_req req;
> +    uint64_t type;
> +
> +    while (true) {
> +        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> +        if (!elem) {
> +            return;
> +        }
> +
> +        if (iov_to_buf(elem->out_sg, elem->out_num, 0, &req, len) < len) {
> +            virtio_mem_bad_request(vmem, "invalid request size");

Print the size.

> +            g_free(elem);
> +            return;
> +        }
> +
> +        if (iov_size(elem->in_sg, elem->in_num) <
> +            sizeof(struct virtio_mem_resp)) {
> +            virtio_mem_bad_request(vmem, "not enough space for response");
> +            g_free(elem);
> +            return;
> +        }
> +
> +        type = le16_to_cpu(req.type);
> +        switch (type) {
> +        case VIRTIO_MEM_REQ_PLUG:
> +            virtio_mem_plug_request(vmem, elem, &req);
> +            break;
> +        case VIRTIO_MEM_REQ_UNPLUG:
> +            virtio_mem_unplug_request(vmem, elem, &req);
> +            break;
> +        case VIRTIO_MEM_REQ_UNPLUG_ALL:
> +            virtio_mem_unplug_all_request(vmem, elem);
> +            break;
> +        case VIRTIO_MEM_REQ_STATE:
> +            virtio_mem_state_request(vmem, elem, &req);
> +            break;
> +        default:
> +            virtio_mem_bad_request(vmem, "unknown request type");

Could include the type .


> +            g_free(elem);
> +            return;
> +        }
> +
> +        g_free(elem);
> +    }
> +}
> +
> +static void virtio_mem_get_config(VirtIODevice *vdev, uint8_t *config_data)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(vdev);
> +    struct virtio_mem_config *config = (void *) config_data;
> +
> +    config->block_size = cpu_to_le32(vmem->block_size);
> +    config->node_id = cpu_to_le16(vmem->node);
> +    config->requested_size = cpu_to_le64(vmem->requested_size);
> +    config->plugged_size = cpu_to_le64(vmem->size);
> +    config->addr = cpu_to_le64(vmem->addr);
> +    config->region_size = cpu_to_le64(memory_region_size(&vmem->memdev->mr));
> +    config->usable_region_size = cpu_to_le64(vmem->usable_region_size);
> +}
> +
> +static uint64_t virtio_mem_get_features(VirtIODevice *vdev, uint64_t features,
> +                                        Error **errp)
> +{
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +
> +    if (ms->numa_state) {
> +#if defined(CONFIG_ACPI)
> +        virtio_add_feature(&features, VIRTIO_MEM_F_ACPI_PXM);
> +#endif
> +    }
> +    return features;
> +}
> +
> +static void virtio_mem_system_reset(void *opaque)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
> +
> +    /*
> +     * During usual resets, we will unplug all memory and shrink the usable
> +     * region size. This is, however, not possible in all scenarios. Then,
> +     * the guest has to deal with this manually (VIRTIO_MEM_REQ_UNPLUG_ALL).
> +     */
> +    virtio_mem_unplug_all(vmem);
> +}
> +
> +static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
> +{
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    int nb_numa_nodes = ms->numa_state ? ms->numa_state->num_nodes : 0;
> +    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> +    VirtIOMEM *vmem = VIRTIO_MEM(dev);
> +    uint64_t page_size;
> +    RAMBlock *rb;
> +    int ret;
> +
> +    if (!vmem->memdev) {
> +        error_setg(errp, "'%s' property must be set", VIRTIO_MEM_MEMDEV_PROP);
> +        return;
> +    } else if (host_memory_backend_is_mapped(vmem->memdev)) {
> +        char *path = object_get_canonical_path_component(OBJECT(vmem->memdev));
> +
> +        error_setg(errp, "can't use already busy memdev: %s", path);
> +        g_free(path);
> +        return;
> +    }
> +
> +    if ((nb_numa_nodes && vmem->node >= nb_numa_nodes) ||
> +        (!nb_numa_nodes && vmem->node)) {
> +        error_setg(errp, "Property '%s' has value '%" PRIu32
> +                   "', which exceeds the number of numa nodes: %d",
> +                   VIRTIO_MEM_NODE_PROP, vmem->node,
> +                   nb_numa_nodes ? nb_numa_nodes : 1);
> +        return;
> +    }
> +
> +    if (enable_mlock) {
> +        error_setg(errp, "not compatible with mlock yet");
> +        return;
> +    }
> +
> +    if (!memory_region_is_ram(&vmem->memdev->mr) ||
> +        memory_region_is_rom(&vmem->memdev->mr) ||
> +        !vmem->memdev->mr.ram_block) {
> +        error_setg(errp, "unsupported memdev");
> +        return;
> +    }
> +
> +    rb = vmem->memdev->mr.ram_block;
> +    page_size = qemu_ram_pagesize(rb);
> +
> +    if (vmem->block_size < page_size) {
> +        error_setg(errp, "'%s' has to be at least the page size (0x%"
> +                   PRIx64 ")", VIRTIO_MEM_BLOCK_SIZE_PROP, page_size);
> +        return;
> +    } else if (!QEMU_IS_ALIGNED(vmem->requested_size, vmem->block_size)) {
> +        error_setg(errp, "'%s' has to be multiples of '%s' (0x%" PRIx32
> +                   ")", VIRTIO_MEM_REQUESTED_SIZE_PROP,
> +                   VIRTIO_MEM_BLOCK_SIZE_PROP, vmem->block_size);
> +        return;
> +    } else if (!QEMU_IS_ALIGNED(memory_region_size(&vmem->memdev->mr),
> +                                vmem->block_size)) {
> +        error_setg(errp, "'%s' backend size has to be multiples of '%s' (0x%"
> +                   PRIx32 ")", VIRTIO_MEM_MEMDEV_PROP,
> +                   VIRTIO_MEM_BLOCK_SIZE_PROP, vmem->block_size);
> +        return;
> +    }
> +
> +    if (ram_block_discard_set_required(true)) {
> +        error_setg(errp, "Discarding RAM is marked broken.");
> +        return;
> +    }
> +
> +    ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
> +    if (ret) {
> +        /* Note: Discarding should never fail at this point. */
> +        error_setg_errno(errp, -ret, "Discarding RAM failed.");
> +        ram_block_discard_set_required(false);
> +        return;
> +    }
> +
> +    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
> +
> +    vmem->bitmap_size = memory_region_size(&vmem->memdev->mr) /
> +                        vmem->block_size;
> +    vmem->bitmap = bitmap_new(vmem->bitmap_size);
> +
> +    virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
> +                sizeof(struct virtio_mem_config));
> +    vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
> +
> +    host_memory_backend_set_mapped(vmem->memdev, true);
> +    vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
> +    qemu_register_reset(virtio_mem_system_reset, vmem);
> +    return;
> +}
> +
> +static void virtio_mem_device_unrealize(DeviceState *dev, Error **errp)
> +{
> +    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> +    VirtIOMEM *vmem = VIRTIO_MEM(dev);
> +
> +    qemu_unregister_reset(virtio_mem_system_reset, vmem);
> +    vmstate_unregister_ram(&vmem->memdev->mr, DEVICE(vmem));
> +    host_memory_backend_set_mapped(vmem->memdev, false);
> +    virtio_del_queue(vdev, 0);
> +    virtio_cleanup(vdev);
> +    g_free(vmem->bitmap);
> +    ramblock_discard_set_required(false);
> +}
> +
> +static int virtio_mem_pre_save(void *opaque)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
> +
> +    vmem->migration_addr = vmem->addr;
> +    vmem->migration_block_size = vmem->block_size;

You might look at VMSTATE_WITH_TMP could avoid you having the dummy
fields.

> +    return 0;
> +}
> +
> +static int virtio_mem_restore_unplugged(VirtIOMEM *vmem)
> +{
> +    unsigned long bit;
> +    uint64_t offset;
> +    int ret;
> +
> +    /* TODO: Better postcopy handling - defer to postcopy end. */
> +    if (virtio_mem_discard_inhibited()) {
> +        return 0;
> +    }
> +
> +    bit = find_first_zero_bit(vmem->bitmap, vmem->bitmap_size);
> +    while (bit < vmem->bitmap_size) {
> +        offset = bit * vmem->block_size;
> +
> +        if (offset + vmem->block_size >=
> +            memory_region_size(&vmem->memdev->mr)) {
> +            break;
> +        }
> +        /* Note: Discarding should never fail at this point. */
> +        ret = ram_block_discard_range(vmem->memdev->mr.ram_block, offset,
> +                                      vmem->block_size);
> +        if (ret) {
> +            return -EINVAL;
> +        }
> +        bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size, bit + 1);
> +    }
> +    return 0;
> +}
> +
> +static int virtio_mem_post_load(void *opaque, int version_id)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
> +
> +    if (vmem->migration_block_size != vmem->block_size) {
> +        error_report("'%s' doesn't match", VIRTIO_MEM_BLOCK_SIZE_PROP);
> +        return -EINVAL;
> +    }
> +    if (vmem->migration_addr != vmem->addr) {
> +        error_report("'%s' doesn't match", VIRTIO_MEM_ADDR_PROP);
> +        return -EINVAL;
> +    }
> +    return virtio_mem_restore_unplugged(vmem);
> +}
> +
> +static const VMStateDescription vmstate_virtio_mem_device = {
> +    .name = "virtio-mem-device",
> +    .minimum_version_id = 1,
> +    .version_id = 1,
> +    .pre_save = virtio_mem_pre_save,
> +    .post_load = virtio_mem_post_load,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT64(usable_region_size, VirtIOMEM),
> +        VMSTATE_UINT64(size, VirtIOMEM),
> +        VMSTATE_UINT64(requested_size, VirtIOMEM),
> +        VMSTATE_UINT64(migration_addr, VirtIOMEM),
> +        VMSTATE_UINT32(migration_block_size, VirtIOMEM),
> +        VMSTATE_BITMAP(bitmap, VirtIOMEM, 0, bitmap_size),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +static const VMStateDescription vmstate_virtio_mem = {
> +    .name = "virtio-mem",
> +    .minimum_version_id = 1,
> +    .version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_VIRTIO_DEVICE,
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +static void virtio_mem_fill_device_info(const VirtIOMEM *vmem,
> +                                        VirtioMEMDeviceInfo *vi)
> +{
> +    vi->memaddr = vmem->addr;
> +    vi->node = vmem->node;
> +    vi->requested_size = vmem->requested_size;
> +    vi->size = vmem->size;
> +    vi->max_size = memory_region_size(&vmem->memdev->mr);
> +    vi->block_size = vmem->block_size;
> +    vi->memdev = object_get_canonical_path(OBJECT(vmem->memdev));
> +}
> +
> +static MemoryRegion *virtio_mem_get_memory_region(VirtIOMEM *vmem, Error **errp)
> +{
> +    if (!vmem->memdev) {
> +        error_setg(errp, "'%s' property must be set", VIRTIO_MEM_MEMDEV_PROP);
> +        return NULL;
> +    }
> +
> +    return &vmem->memdev->mr;
> +}
> +
> +static void virtio_mem_get_size(Object *obj, Visitor *v, const char *name,
> +                                void *opaque, Error **errp)
> +{
> +    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
> +    uint64_t value = vmem->size;
> +
> +    visit_type_size(v, name, &value, errp);
> +}
> +
> +static void virtio_mem_get_requested_size(Object *obj, Visitor *v,
> +                                          const char *name, void *opaque,
> +                                          Error **errp)
> +{
> +    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
> +    uint64_t value = vmem->requested_size;
> +
> +    visit_type_size(v, name, &value, errp);
> +}
> +
> +static void virtio_mem_set_requested_size(Object *obj, Visitor *v,
> +                                          const char *name, void *opaque,
> +                                          Error **errp)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(obj);
> +    Error *err = NULL;
> +    uint64_t value;
> +
> +    visit_type_size(v, name, &value, &err);
> +    if (err) {
> +        error_propagate(errp, err);
> +        return;
> +    }
> +
> +    /*
> +     * The block size and memory backend are not fixed until the device was
> +     * realized. realize() will verify these properties then.
> +     */
> +    if (DEVICE(obj)->realized) {
> +        if (!QEMU_IS_ALIGNED(value, vmem->block_size)) {
> +            error_setg(errp, "'%s' has to be multiples of '%s' (0x%" PRIx32
> +                       ")", name, VIRTIO_MEM_BLOCK_SIZE_PROP,
> +                       vmem->block_size);
> +            return;
> +        } else if (value > memory_region_size(&vmem->memdev->mr)) {
> +            error_setg(errp, "'%s' cannot exceed the memory backend size"
> +                       "(0x%" PRIx64 ")", name,
> +                       memory_region_size(&vmem->memdev->mr));
> +            return;
> +        }
> +
> +        if (value != vmem->requested_size) {
> +            virtio_mem_resize_usable_region(vmem, value, false);
> +            vmem->requested_size = value;
> +        }
> +        /*
> +         * Trigger a config update so the guest gets notified. We trigger
> +         * even if the size didn't change (especially helpful for debugging).
> +         */
> +        virtio_notify_config(VIRTIO_DEVICE(vmem));
> +    } else {
> +        vmem->requested_size = value;
> +    }
> +}
> +
> +static void virtio_mem_get_block_size(Object *obj, Visitor *v, const char *name,
> +                                      void *opaque, Error **errp)
> +{
> +    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
> +    uint64_t value = vmem->block_size;
> +
> +    visit_type_size(v, name, &value, errp);
> +}
> +
> +static void virtio_mem_set_block_size(Object *obj, Visitor *v, const char *name,
> +                                      void *opaque, Error **errp)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(obj);
> +    Error *err = NULL;
> +    uint64_t value;
> +
> +    if (DEVICE(obj)->realized) {
> +        error_setg(errp, "'%s' cannot be changed", name);
> +        return;
> +    }
> +
> +    visit_type_size(v, name, &value, &err);
> +    if (err) {
> +        error_propagate(errp, err);
> +        return;
> +    }
> +
> +    if (value > UINT32_MAX) {
> +        error_setg(errp, "'%s' has to be smaller than 0x%" PRIx32, name,
> +                   UINT32_MAX);
> +        return;
> +    } else if (value < VIRTIO_MEM_MIN_BLOCK_SIZE) {
> +        error_setg(errp, "'%s' has to be at least 0x%" PRIx32, name,
> +                   VIRTIO_MEM_MIN_BLOCK_SIZE);
> +        return;
> +    } else if (!is_power_of_2(value)) {
> +        error_setg(errp, "'%s' has to be a power of two", name);
> +        return;
> +    }
> +    vmem->block_size = value;
> +}
> +
> +static void virtio_mem_instance_init(Object *obj)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(obj);
> +
> +    vmem->block_size = VIRTIO_MEM_MIN_BLOCK_SIZE;
> +
> +    object_property_add(obj, VIRTIO_MEM_SIZE_PROP, "size", virtio_mem_get_size,
> +                        NULL, NULL, NULL, &error_abort);
> +    object_property_add(obj, VIRTIO_MEM_REQUESTED_SIZE_PROP, "size",
> +                        virtio_mem_get_requested_size,
> +                        virtio_mem_set_requested_size, NULL, NULL,
> +                        &error_abort);
> +    object_property_add(obj, VIRTIO_MEM_BLOCK_SIZE_PROP, "size",
> +                        virtio_mem_get_block_size, virtio_mem_set_block_size,
> +                        NULL, NULL, &error_abort);
> +}
> +
> +static Property virtio_mem_properties[] = {
> +    DEFINE_PROP_UINT64(VIRTIO_MEM_ADDR_PROP, VirtIOMEM, addr, 0),
> +    DEFINE_PROP_UINT32(VIRTIO_MEM_NODE_PROP, VirtIOMEM, node, 0),
> +    DEFINE_PROP_LINK(VIRTIO_MEM_MEMDEV_PROP, VirtIOMEM, memdev,
> +                     TYPE_MEMORY_BACKEND, HostMemoryBackend *),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void virtio_mem_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
> +    VirtIOMEMClass *vmc = VIRTIO_MEM_CLASS(klass);
> +
> +    device_class_set_props(dc, virtio_mem_properties);
> +    dc->vmsd = &vmstate_virtio_mem;
> +
> +    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> +    vdc->realize = virtio_mem_device_realize;
> +    vdc->unrealize = virtio_mem_device_unrealize;
> +    vdc->get_config = virtio_mem_get_config;
> +    vdc->get_features = virtio_mem_get_features;
> +    vdc->vmsd = &vmstate_virtio_mem_device;
> +
> +    vmc->fill_device_info = virtio_mem_fill_device_info;
> +    vmc->get_memory_region = virtio_mem_get_memory_region;
> +}
> +
> +static const TypeInfo virtio_mem_info = {
> +    .name = TYPE_VIRTIO_MEM,
> +    .parent = TYPE_VIRTIO_DEVICE,
> +    .instance_size = sizeof(VirtIOMEM),
> +    .instance_init = virtio_mem_instance_init,
> +    .class_init = virtio_mem_class_init,
> +    .class_size = sizeof(VirtIOMEMClass),
> +};
> +
> +static void virtio_register_types(void)
> +{
> +    type_register_static(&virtio_mem_info);
> +}
> +
> +type_init(virtio_register_types)
> diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
> new file mode 100644
> index 0000000000..27158cb611
> --- /dev/null
> +++ b/include/hw/virtio/virtio-mem.h
> @@ -0,0 +1,80 @@
> +/*
> + * Virtio MEM device
> + *
> + * Copyright (C) 2020 Red Hat, Inc.
> + *
> + * Authors:
> + *  David Hildenbrand <david@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef HW_VIRTIO_MEM_H
> +#define HW_VIRTIO_MEM_H
> +
> +#include "standard-headers/linux/virtio_mem.h"
> +#include "hw/virtio/virtio.h"
> +#include "qapi/qapi-types-misc.h"
> +#include "sysemu/hostmem.h"
> +
> +#define TYPE_VIRTIO_MEM "virtio-mem"
> +
> +#define VIRTIO_MEM(obj) \
> +        OBJECT_CHECK(VirtIOMEM, (obj), TYPE_VIRTIO_MEM)
> +#define VIRTIO_MEM_CLASS(oc) \
> +        OBJECT_CLASS_CHECK(VirtIOMEMClass, (oc), TYPE_VIRTIO_MEM)
> +#define VIRTIO_MEM_GET_CLASS(obj) \
> +        OBJECT_GET_CLASS(VirtIOMEMClass, (obj), TYPE_VIRTIO_MEM)
> +
> +#define VIRTIO_MEM_MEMDEV_PROP "memdev"
> +#define VIRTIO_MEM_NODE_PROP "node"
> +#define VIRTIO_MEM_SIZE_PROP "size"
> +#define VIRTIO_MEM_REQUESTED_SIZE_PROP "requested-size"
> +#define VIRTIO_MEM_BLOCK_SIZE_PROP "block-size"
> +#define VIRTIO_MEM_ADDR_PROP "memaddr"
> +
> +typedef struct VirtIOMEM {
> +    VirtIODevice parent_obj;
> +
> +    /* guest -> host request queue */
> +    VirtQueue *vq;
> +
> +    /* bitmap used to track unplugged memory */
> +    int32_t bitmap_size;
> +    unsigned long *bitmap;
> +
> +    /* assigned memory backend and memory region */
> +    HostMemoryBackend *memdev;
> +
> +    /* NUMA node */
> +    uint32_t node;
> +
> +    /* assigned address of the region in guest physical memory */
> +    uint64_t addr;
> +    uint64_t migration_addr;
> +
> +    /* usable region size (<= region_size) */
> +    uint64_t usable_region_size;
> +
> +    /* actual size (how much the guest plugged) */
> +    uint64_t size;
> +
> +    /* requested size */
> +    uint64_t requested_size;
> +
> +    /* block size and alignment */
> +    uint32_t block_size;
> +    uint32_t migration_block_size;
> +} VirtIOMEM;
> +
> +typedef struct VirtIOMEMClass {
> +    /* private */
> +    VirtIODevice parent;
> +
> +    /* public */
> +    void (*fill_device_info)(const VirtIOMEM *vmen, VirtioMEMDeviceInfo *vi);
> +    MemoryRegion *(*get_memory_region)(VirtIOMEM *vmem, Error **errp);
> +} VirtIOMEMClass;
> +
> +#endif
> diff --git a/qapi/misc.json b/qapi/misc.json
> index 99b90ac80b..feaeacec22 100644
> --- a/qapi/misc.json
> +++ b/qapi/misc.json
> @@ -1354,19 +1354,56 @@
>            }
>  }
>  
> +##
> +# @VirtioMEMDeviceInfo:
> +#
> +# VirtioMEMDevice state information
> +#
> +# @id: device's ID
> +#
> +# @memaddr: physical address in memory, where device is mapped
> +#
> +# @requested-size: the user requested size of the device
> +#
> +# @size: the (current) size of memory that the device provides
> +#
> +# @max-size: the maximum size of memory that the device can provide
> +#
> +# @block-size: the block size of memory that the device provides
> +#
> +# @node: NUMA node number where device is assigned to
> +#
> +# @memdev: memory backend linked with the region
> +#
> +# Since: 5.1
> +##
> +{ 'struct': 'VirtioMEMDeviceInfo',
> +  'data': { '*id': 'str',
> +            'memaddr': 'size',
> +            'requested-size': 'size',
> +            'size': 'size',
> +            'max-size': 'size',
> +            'block-size': 'size',
> +            'node': 'int',
> +            'memdev': 'str'
> +          }
> +}
> +
>  ##
>  # @MemoryDeviceInfo:
>  #
>  # Union containing information about a memory device
>  #
>  # nvdimm is included since 2.12. virtio-pmem is included since 4.1.
> +# virtio-mem is included since 5.2.
>  #
>  # Since: 2.1
>  ##
>  { 'union': 'MemoryDeviceInfo',
>    'data': { 'dimm': 'PCDIMMDeviceInfo',
>              'nvdimm': 'PCDIMMDeviceInfo',
> -            'virtio-pmem': 'VirtioPMEMDeviceInfo'
> +            'virtio-pmem': 'VirtioPMEMDeviceInfo',
> +            'virtio-mem': 'VirtioMEMDeviceInfo'
>            }
>  }
>  
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 10/17] virtio-mem: Paravirtualized memory hot(un)plug
@ 2020-05-15 15:37     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 15:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, Igor Mammedov, Paolo Bonzini,
	Richard Henderson

I'm not sure if it's possible to split this up; it's a bit big.
It could also do with a pile of trace_ entries to figure out what it's
doing.


* David Hildenbrand (david@redhat.com) wrote:
> This is the very basic/initial version of virtio-mem. An introduction to
> virtio-mem can be found in the Linux kernel driver [1]. While it can be
> used in the current state for hotplug of a smaller amount of memory, it
> will heavily benefit from resizeable memory regions in the future.
> 
> Each virtio-mem device manages a memory region (provided via a memory
> backend). After requested by the hypervisor ("requested-size"), the
> guest can try to plug/unplug blocks of memory within that region, in order
> to reach the requested size. Initially, and after a reboot, all memory is
> unplugged (except in special cases - reboot during postcopy).
> 
> The guest may only try to plug/unplug blocks of memory within the usable
> region size. The usable region size is a little bigger than the
> requested size, to give the device driver some flexibility. The usable
> region size will only grow, except on reboots or when all memory is
> requested to get unplugged. The guest can never plug more memory than
> requested. Unplugged memory will get zapped/discarded, similar to in a
> balloon device.
> 
> The block size is variable, however, it is always chosen in a way such that
> THP splits are avoided (e.g., 2MB). The state of each block
> (plugged/unplugged) is tracked in a bitmap.
> 
> As virtio-mem devices (e.g., virtio-mem-pci) will be memory devices, we now
> expose "VirtioMEMDeviceInfo" via "query-memory-devices".
> 
> --------------------------------------------------------------------------
> 
> There are two important follow-up items that are in the works:
> 1. Resizeable memory regions: Use resizeable allocations/RAM blocks to
>    grow/shrink along with the usable region size. This avoids creating
>    initially very big VMAs, RAM blocks, and KVM slots.
> 2. Protection of unplugged memory: Make sure the gust cannot actually
>    make use of unplugged memory.
> 
> Other follow-up items that are in the works:
> 1. Exclude unplugged memory during migration (via precopy notifier).
> 2. Handle remapping of memory.
> 3. Support for other architectures.
> 
> --------------------------------------------------------------------------
> 
> Example usage (virtio-mem-pci is introduced in follow-up patches):
> 
> Start QEMU with two virtio-mem devices (one per NUMA node):
>  $ qemu-system-x86_64 -m 4G,maxmem=20G \
>   -smp sockets=2,cores=2 \
>   -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
>   [...]
>   -object memory-backend-ram,id=mem0,size=8G \
>   -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,requested-size=0M \
>   -object memory-backend-ram,id=mem1,size=8G \
>   -device virtio-mem-pci,id=vm1,memdev=mem1,node=1,requested-size=1G
> 
> Query the configuration:
>  (qemu) info memory-devices
>  Memory device [virtio-mem]: "vm0"
>    memaddr: 0x140000000
>    node: 0
>    requested-size: 0
>    size: 0
>    max-size: 8589934592
>    block-size: 2097152
>    memdev: /objects/mem0
>  Memory device [virtio-mem]: "vm1"
>    memaddr: 0x340000000
>    node: 1
>    requested-size: 1073741824
>    size: 1073741824
>    max-size: 8589934592
>    block-size: 2097152
>    memdev: /objects/mem1
> 
> Add some memory to node 0:
>  (qemu) qom-set vm0 requested-size 500M
> 
> Remove some memory from node 1:
>  (qemu) qom-set vm1 requested-size 200M
> 
> Query the configuration again:
>  (qemu) info memory-devices
>  Memory device [virtio-mem]: "vm0"
>    memaddr: 0x140000000
>    node: 0
>    requested-size: 524288000
>    size: 524288000
>    max-size: 8589934592
>    block-size: 2097152
>    memdev: /objects/mem0
>  Memory device [virtio-mem]: "vm1"
>    memaddr: 0x340000000
>    node: 1
>    requested-size: 209715200
>    size: 209715200
>    max-size: 8589934592
>    block-size: 2097152
>    memdev: /objects/mem1
> 
> [1] https://lkml.kernel.org/r/20200311171422.10484-1-david@redhat.com
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  hw/virtio/Kconfig              |  11 +
>  hw/virtio/Makefile.objs        |   1 +
>  hw/virtio/virtio-mem.c         | 762 +++++++++++++++++++++++++++++++++
>  include/hw/virtio/virtio-mem.h |  80 ++++
>  qapi/misc.json                 |  39 +-
>  5 files changed, 892 insertions(+), 1 deletion(-)
>  create mode 100644 hw/virtio/virtio-mem.c
>  create mode 100644 include/hw/virtio/virtio-mem.h
> 
> diff --git a/hw/virtio/Kconfig b/hw/virtio/Kconfig
> index 83122424fa..0eda25c4e1 100644
> --- a/hw/virtio/Kconfig
> +++ b/hw/virtio/Kconfig
> @@ -47,3 +47,14 @@ config VIRTIO_PMEM
>      depends on VIRTIO
>      depends on VIRTIO_PMEM_SUPPORTED
>      select MEM_DEVICE
> +
> +config VIRTIO_MEM_SUPPORTED
> +    bool
> +
> +config VIRTIO_MEM
> +    bool
> +    default y
> +    depends on VIRTIO
> +    depends on LINUX
> +    depends on VIRTIO_MEM_SUPPORTED
> +    select MEM_DEVICE
> diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
> index 4e4d39a0a4..7df70e977e 100644
> --- a/hw/virtio/Makefile.objs
> +++ b/hw/virtio/Makefile.objs
> @@ -18,6 +18,7 @@ common-obj-$(call land,$(CONFIG_VIRTIO_PMEM),$(CONFIG_VIRTIO_PCI)) += virtio-pme
>  obj-$(call land,$(CONFIG_VHOST_USER_FS),$(CONFIG_VIRTIO_PCI)) += vhost-user-fs-pci.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
> +obj-$(CONFIG_VIRTIO_MEM) += virtio-mem.o
>  
>  ifeq ($(CONFIG_VIRTIO_PCI),y)
>  obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock-pci.o
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> new file mode 100644
> index 0000000000..e25b2c74f2
> --- /dev/null
> +++ b/hw/virtio/virtio-mem.c
> @@ -0,0 +1,762 @@
> +/*
> + * Virtio MEM device
> + *
> + * Copyright (C) 2020 Red Hat, Inc.
> + *
> + * Authors:
> + *  David Hildenbrand <david@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "qemu/iov.h"
> +#include "qemu/cutils.h"
> +#include "qemu/error-report.h"
> +#include "qemu/units.h"
> +#include "sysemu/numa.h"
> +#include "sysemu/sysemu.h"
> +#include "sysemu/reset.h"
> +#include "hw/virtio/virtio.h"
> +#include "hw/virtio/virtio-bus.h"
> +#include "hw/virtio/virtio-access.h"
> +#include "hw/virtio/virtio-mem.h"
> +#include "qapi/error.h"
> +#include "qapi/visitor.h"
> +#include "exec/ram_addr.h"
> +#include "migration/misc.h"
> +#include "migration/postcopy-ram.h"
> +#include "hw/boards.h"
> +#include "hw/qdev-properties.h"
> +#include "config-devices.h"
> +
> +/*
> + * Use QEMU_VMALLOC_ALIGN, so no THP will have to be split when unplugging
> + * memory (e.g., 2MB on x86_64).
> + */
> +#define VIRTIO_MEM_MIN_BLOCK_SIZE QEMU_VMALLOC_ALIGN
> +/*
> + * Size the usable region bigger than the requested size if possible. Esp.
> + * Linux guests will only add (aligned) memory blocks in case they fully
> + * fit into the usable region, but plug+online only a subset of the pages.
> + * The memory block size corresponds mostly to the section size.
> + *
> + * This allows e.g., to add 20MB with a section size of 128MB on x86_64, and
> + * a section size of 1GB on arm64 (as long as the start address is properly
> + * aligned, similar to ordinary DIMMs).
> + *
> + * We can change this at any time and maybe even make it configurable if
> + * necessary (as the section size can change). But it's more likely that the
> + * section size will rather get smaller and not bigger over time.
> + */
> +#if defined(__x86_64__)
> +#define VIRTIO_MEM_USABLE_EXTENT (2 * (128 * MiB))
> +#else
> +#error VIRTIO_MEM_USABLE_EXTENT not defined
> +#endif
> +
> +static bool virtio_mem_discard_inhibited(void)
> +{
> +    PostcopyState ps = postcopy_state_get();
> +
> +    /* Postcopy cannot deal with concurrent discards (yet), so it's special. */
> +    return ps >= POSTCOPY_INCOMING_DISCARD && ps < POSTCOPY_INCOMING_END;
> +}
> +
> +static bool virtio_mem_test_bitmap(VirtIOMEM *vmem, uint64_t start_gpa,
> +                                   uint64_t size, bool plug)
> +{
> +    uint64_t bit = (start_gpa - vmem->addr) / vmem->block_size;
> +
> +    g_assert(QEMU_IS_ALIGNED(start_gpa, vmem->block_size));
> +    g_assert(QEMU_IS_ALIGNED(size, vmem->block_size));
> +    g_assert(vmem->bitmap);
> +
> +    while (size) {
> +        g_assert((bit / BITS_PER_BYTE) <= vmem->bitmap_size);
> +
> +        if (plug && !test_bit(bit, vmem->bitmap)) {
> +            return false;
> +        } else if (!plug && test_bit(bit, vmem->bitmap)) {
> +            return false;
> +        }
> +        size -= vmem->block_size;
> +        bit++;
> +    }
> +    return true;
> +}
> +
> +static void virtio_mem_set_bitmap(VirtIOMEM *vmem, uint64_t start_gpa,
> +                                  uint64_t size, bool plug)
> +{
> +    const uint64_t bit = (start_gpa - vmem->addr) / vmem->block_size;
> +    const uint64_t nbits = size / vmem->block_size;
> +
> +    g_assert(QEMU_IS_ALIGNED(start_gpa, vmem->block_size));
> +    g_assert(QEMU_IS_ALIGNED(size, vmem->block_size));
> +    g_assert(vmem->bitmap);

This bit/nbits/alignment checking could be split out and shared between
these two functions.

> +    if (plug) {
> +        bitmap_set(vmem->bitmap, bit, nbits);
> +    } else {
> +        bitmap_clear(vmem->bitmap, bit, nbits);
> +    }
> +}
> +
> +static void virtio_mem_send_response(VirtIOMEM *vmem, VirtQueueElement *elem,
> +                                     struct virtio_mem_resp *resp)
> +{
> +    VirtIODevice *vdev = VIRTIO_DEVICE(vmem);
> +    VirtQueue *vq = vmem->vq;
> +
> +    iov_from_buf(elem->in_sg, elem->in_num, 0, resp, sizeof(*resp));
> +
> +    virtqueue_push(vq, elem, sizeof(*resp));
> +    virtio_notify(vdev, vq);
> +}
> +
> +static void virtio_mem_send_response_simple(VirtIOMEM *vmem,
> +                                            VirtQueueElement *elem,
> +                                            uint16_t type)
> +{
> +    VirtIODevice *vdev = VIRTIO_DEVICE(vmem);
> +    struct virtio_mem_resp resp = {};
> +
> +    virtio_stw_p(vdev, &resp.type, type);
> +    virtio_mem_send_response(vmem, elem, &resp);
> +}
> +
> +static void virtio_mem_bad_request(VirtIOMEM *vmem, const char *msg)
> +{
> +    virtio_error(VIRTIO_DEVICE(vmem), "virtio-mem protocol violation: %s", msg);
> +}
> +
> +static bool virtio_mem_valid_range(VirtIOMEM *vmem, uint64_t gpa, uint64_t size)
> +{
> +    if (!QEMU_IS_ALIGNED(gpa, vmem->block_size)) {
> +            return false;
> +    }
> +    if (gpa + size < gpa || size == 0) {
> +        return false;
> +    }
> +    if (gpa < vmem->addr || gpa >= vmem->addr + vmem->usable_region_size) {
> +        return false;
> +    }
> +    if (gpa + size > vmem->addr + vmem->usable_region_size) {
> +        return false;
> +    }
> +    return true;
> +}
> +
> +static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa,
> +                                      uint64_t size, bool plug)
> +{
> +    const uint64_t offset = start_gpa - vmem->addr;
> +    int ret;
> +
> +    if (!plug) {
> +        if (virtio_mem_discard_inhibited()) {
> +            return -EBUSY;
> +        }
> +        /* Note: Discarding should never fail at this point. */
> +        ret = ram_block_discard_range(vmem->memdev->mr.ram_block, offset, size);
> +        if (ret) {

error_report ?

> +            return -EBUSY;
> +        }
> +    }
> +    virtio_mem_set_bitmap(vmem, start_gpa, size, plug);
> +    return 0;
> +}
> +
> +static int virtio_mem_state_change_request(VirtIOMEM *vmem, uint64_t gpa,
> +                                           uint16_t nb_blocks, bool plug)
> +{
> +    const uint64_t size = nb_blocks * vmem->block_size;
> +    int ret;
> +
> +    if (!virtio_mem_valid_range(vmem, gpa, size)) {
> +        return VIRTIO_MEM_RESP_ERROR;
> +    }
> +
> +    if (plug && (vmem->size + size > vmem->requested_size)) {
> +        return VIRTIO_MEM_RESP_NACK;
> +    }
> +
> +    /* test if really all blocks are in the opposite state */
> +    if (!virtio_mem_test_bitmap(vmem, gpa, size, !plug)) {
> +        return VIRTIO_MEM_RESP_ERROR;
> +    }
> +
> +    ret = virtio_mem_set_block_state(vmem, gpa, size, plug);
> +    if (ret) {
> +        return VIRTIO_MEM_RESP_BUSY;
> +    }
> +    if (plug) {
> +        vmem->size += size;
> +    } else {
> +        vmem->size -= size;
> +    }
> +    return VIRTIO_MEM_RESP_ACK;
> +}
> +
> +static void virtio_mem_plug_request(VirtIOMEM *vmem, VirtQueueElement *elem,
> +                                    struct virtio_mem_req *req)
> +{
> +    const uint64_t gpa = le64_to_cpu(req->u.plug.addr);
> +    const uint16_t nb_blocks = le16_to_cpu(req->u.plug.nb_blocks);
> +    uint16_t type;
> +
> +    type = virtio_mem_state_change_request(vmem, gpa, nb_blocks, true);
> +    virtio_mem_send_response_simple(vmem, elem, type);
> +}
> +
> +static void virtio_mem_unplug_request(VirtIOMEM *vmem, VirtQueueElement *elem,
> +                                      struct virtio_mem_req *req)
> +{
> +    const uint64_t gpa = le64_to_cpu(req->u.unplug.addr);
> +    const uint16_t nb_blocks = le16_to_cpu(req->u.unplug.nb_blocks);
> +    uint16_t type;
> +
> +    type = virtio_mem_state_change_request(vmem, gpa, nb_blocks, false);
> +    virtio_mem_send_response_simple(vmem, elem, type);
> +}
> +
> +static void virtio_mem_resize_usable_region(VirtIOMEM *vmem,
> +                                            uint64_t requested_size,
> +                                            bool can_shrink)
> +{
> +    uint64_t newsize = MIN(memory_region_size(&vmem->memdev->mr),
> +                           requested_size + VIRTIO_MEM_USABLE_EXTENT);
> +
> +    /* We must only grow while the guest is running. */
> +    if (newsize < vmem->usable_region_size && !can_shrink) {
> +        return;
> +    }
> +
> +    vmem->usable_region_size = newsize;
> +}
> +
> +static int virtio_mem_unplug_all(VirtIOMEM *vmem)
> +{
> +    RAMBlock *rb = vmem->memdev->mr.ram_block;
> +    int ret;
> +
> +    if (virtio_mem_discard_inhibited()) {
> +        return -EBUSY;
> +    }
> +
> +    ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
> +    if (ret) {
> +        /* Note: Discarding should never fail at this point. */

error_report?

> +        return -EBUSY;
> +    }
> +    bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
> +    vmem->size = 0;
> +
> +    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
> +    return 0;
> +}
> +
> +static void virtio_mem_unplug_all_request(VirtIOMEM *vmem,
> +                                          VirtQueueElement *elem)
> +{
> +
> +    if (virtio_mem_unplug_all(vmem)) {
> +        virtio_mem_send_response_simple(vmem, elem, VIRTIO_MEM_RESP_BUSY);
> +    } else {
> +        virtio_mem_send_response_simple(vmem, elem, VIRTIO_MEM_RESP_ACK);
> +    }
> +}
> +
> +static void virtio_mem_state_request(VirtIOMEM *vmem, VirtQueueElement *elem,
> +                                     struct virtio_mem_req *req)
> +{
> +    const uint64_t gpa = le64_to_cpu(req->u.state.addr);
> +    const uint16_t nb_blocks = le16_to_cpu(req->u.state.nb_blocks);
> +    const uint64_t size = nb_blocks * vmem->block_size;
> +    VirtIODevice *vdev = VIRTIO_DEVICE(vmem);
> +    struct virtio_mem_resp resp = {};
> +
> +    if (!virtio_mem_valid_range(vmem, gpa, size)) {
> +        virtio_mem_send_response_simple(vmem, elem, VIRTIO_MEM_RESP_ERROR);
> +        return;
> +    }
> +
> +    virtio_stw_p(vdev, &resp.type, VIRTIO_MEM_RESP_ACK);
> +    if (virtio_mem_test_bitmap(vmem, gpa, size, true)) {
> +        virtio_stw_p(vdev, &resp.u.state.state, VIRTIO_MEM_STATE_PLUGGED);
> +    } else if (virtio_mem_test_bitmap(vmem, gpa, size, false)) {
> +        virtio_stw_p(vdev, &resp.u.state.state, VIRTIO_MEM_STATE_UNPLUGGED);
> +    } else {
> +        virtio_stw_p(vdev, &resp.u.state.state, VIRTIO_MEM_STATE_MIXED);
> +    }
> +    virtio_mem_send_response(vmem, elem, &resp);
> +}
> +
> +static void virtio_mem_handle_request(VirtIODevice *vdev, VirtQueue *vq)
> +{
> +    const int len = sizeof(struct virtio_mem_req);
> +    VirtIOMEM *vmem = VIRTIO_MEM(vdev);
> +    VirtQueueElement *elem;
> +    struct virtio_mem_req req;
> +    uint64_t type;
> +
> +    while (true) {
> +        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> +        if (!elem) {
> +            return;
> +        }
> +
> +        if (iov_to_buf(elem->out_sg, elem->out_num, 0, &req, len) < len) {
> +            virtio_mem_bad_request(vmem, "invalid request size");

Print the size.

> +            g_free(elem);
> +            return;
> +        }
> +
> +        if (iov_size(elem->in_sg, elem->in_num) <
> +            sizeof(struct virtio_mem_resp)) {
> +            virtio_mem_bad_request(vmem, "not enough space for response");
> +            g_free(elem);
> +            return;
> +        }
> +
> +        type = le16_to_cpu(req.type);
> +        switch (type) {
> +        case VIRTIO_MEM_REQ_PLUG:
> +            virtio_mem_plug_request(vmem, elem, &req);
> +            break;
> +        case VIRTIO_MEM_REQ_UNPLUG:
> +            virtio_mem_unplug_request(vmem, elem, &req);
> +            break;
> +        case VIRTIO_MEM_REQ_UNPLUG_ALL:
> +            virtio_mem_unplug_all_request(vmem, elem);
> +            break;
> +        case VIRTIO_MEM_REQ_STATE:
> +            virtio_mem_state_request(vmem, elem, &req);
> +            break;
> +        default:
> +            virtio_mem_bad_request(vmem, "unknown request type");

Could include the type .


> +            g_free(elem);
> +            return;
> +        }
> +
> +        g_free(elem);
> +    }
> +}
> +
> +static void virtio_mem_get_config(VirtIODevice *vdev, uint8_t *config_data)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(vdev);
> +    struct virtio_mem_config *config = (void *) config_data;
> +
> +    config->block_size = cpu_to_le32(vmem->block_size);
> +    config->node_id = cpu_to_le16(vmem->node);
> +    config->requested_size = cpu_to_le64(vmem->requested_size);
> +    config->plugged_size = cpu_to_le64(vmem->size);
> +    config->addr = cpu_to_le64(vmem->addr);
> +    config->region_size = cpu_to_le64(memory_region_size(&vmem->memdev->mr));
> +    config->usable_region_size = cpu_to_le64(vmem->usable_region_size);
> +}
> +
> +static uint64_t virtio_mem_get_features(VirtIODevice *vdev, uint64_t features,
> +                                        Error **errp)
> +{
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +
> +    if (ms->numa_state) {
> +#if defined(CONFIG_ACPI)
> +        virtio_add_feature(&features, VIRTIO_MEM_F_ACPI_PXM);
> +#endif
> +    }
> +    return features;
> +}
> +
> +static void virtio_mem_system_reset(void *opaque)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
> +
> +    /*
> +     * During usual resets, we will unplug all memory and shrink the usable
> +     * region size. This is, however, not possible in all scenarios. Then,
> +     * the guest has to deal with this manually (VIRTIO_MEM_REQ_UNPLUG_ALL).
> +     */
> +    virtio_mem_unplug_all(vmem);
> +}
> +
> +static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
> +{
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    int nb_numa_nodes = ms->numa_state ? ms->numa_state->num_nodes : 0;
> +    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> +    VirtIOMEM *vmem = VIRTIO_MEM(dev);
> +    uint64_t page_size;
> +    RAMBlock *rb;
> +    int ret;
> +
> +    if (!vmem->memdev) {
> +        error_setg(errp, "'%s' property must be set", VIRTIO_MEM_MEMDEV_PROP);
> +        return;
> +    } else if (host_memory_backend_is_mapped(vmem->memdev)) {
> +        char *path = object_get_canonical_path_component(OBJECT(vmem->memdev));
> +
> +        error_setg(errp, "can't use already busy memdev: %s", path);
> +        g_free(path);
> +        return;
> +    }
> +
> +    if ((nb_numa_nodes && vmem->node >= nb_numa_nodes) ||
> +        (!nb_numa_nodes && vmem->node)) {
> +        error_setg(errp, "Property '%s' has value '%" PRIu32
> +                   "', which exceeds the number of numa nodes: %d",
> +                   VIRTIO_MEM_NODE_PROP, vmem->node,
> +                   nb_numa_nodes ? nb_numa_nodes : 1);
> +        return;
> +    }
> +
> +    if (enable_mlock) {
> +        error_setg(errp, "not compatible with mlock yet");
> +        return;
> +    }
> +
> +    if (!memory_region_is_ram(&vmem->memdev->mr) ||
> +        memory_region_is_rom(&vmem->memdev->mr) ||
> +        !vmem->memdev->mr.ram_block) {
> +        error_setg(errp, "unsupported memdev");
> +        return;
> +    }
> +
> +    rb = vmem->memdev->mr.ram_block;
> +    page_size = qemu_ram_pagesize(rb);
> +
> +    if (vmem->block_size < page_size) {
> +        error_setg(errp, "'%s' has to be at least the page size (0x%"
> +                   PRIx64 ")", VIRTIO_MEM_BLOCK_SIZE_PROP, page_size);
> +        return;
> +    } else if (!QEMU_IS_ALIGNED(vmem->requested_size, vmem->block_size)) {
> +        error_setg(errp, "'%s' has to be multiples of '%s' (0x%" PRIx32
> +                   ")", VIRTIO_MEM_REQUESTED_SIZE_PROP,
> +                   VIRTIO_MEM_BLOCK_SIZE_PROP, vmem->block_size);
> +        return;
> +    } else if (!QEMU_IS_ALIGNED(memory_region_size(&vmem->memdev->mr),
> +                                vmem->block_size)) {
> +        error_setg(errp, "'%s' backend size has to be multiples of '%s' (0x%"
> +                   PRIx32 ")", VIRTIO_MEM_MEMDEV_PROP,
> +                   VIRTIO_MEM_BLOCK_SIZE_PROP, vmem->block_size);
> +        return;
> +    }
> +
> +    if (ram_block_discard_set_required(true)) {
> +        error_setg(errp, "Discarding RAM is marked broken.");
> +        return;
> +    }
> +
> +    ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
> +    if (ret) {
> +        /* Note: Discarding should never fail at this point. */
> +        error_setg_errno(errp, -ret, "Discarding RAM failed.");
> +        ram_block_discard_set_required(false);
> +        return;
> +    }
> +
> +    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
> +
> +    vmem->bitmap_size = memory_region_size(&vmem->memdev->mr) /
> +                        vmem->block_size;
> +    vmem->bitmap = bitmap_new(vmem->bitmap_size);
> +
> +    virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
> +                sizeof(struct virtio_mem_config));
> +    vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
> +
> +    host_memory_backend_set_mapped(vmem->memdev, true);
> +    vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
> +    qemu_register_reset(virtio_mem_system_reset, vmem);
> +    return;
> +}
> +
> +static void virtio_mem_device_unrealize(DeviceState *dev, Error **errp)
> +{
> +    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> +    VirtIOMEM *vmem = VIRTIO_MEM(dev);
> +
> +    qemu_unregister_reset(virtio_mem_system_reset, vmem);
> +    vmstate_unregister_ram(&vmem->memdev->mr, DEVICE(vmem));
> +    host_memory_backend_set_mapped(vmem->memdev, false);
> +    virtio_del_queue(vdev, 0);
> +    virtio_cleanup(vdev);
> +    g_free(vmem->bitmap);
> +    ramblock_discard_set_required(false);
> +}
> +
> +static int virtio_mem_pre_save(void *opaque)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
> +
> +    vmem->migration_addr = vmem->addr;
> +    vmem->migration_block_size = vmem->block_size;

You might look at VMSTATE_WITH_TMP could avoid you having the dummy
fields.

> +    return 0;
> +}
> +
> +static int virtio_mem_restore_unplugged(VirtIOMEM *vmem)
> +{
> +    unsigned long bit;
> +    uint64_t offset;
> +    int ret;
> +
> +    /* TODO: Better postcopy handling - defer to postcopy end. */
> +    if (virtio_mem_discard_inhibited()) {
> +        return 0;
> +    }
> +
> +    bit = find_first_zero_bit(vmem->bitmap, vmem->bitmap_size);
> +    while (bit < vmem->bitmap_size) {
> +        offset = bit * vmem->block_size;
> +
> +        if (offset + vmem->block_size >=
> +            memory_region_size(&vmem->memdev->mr)) {
> +            break;
> +        }
> +        /* Note: Discarding should never fail at this point. */
> +        ret = ram_block_discard_range(vmem->memdev->mr.ram_block, offset,
> +                                      vmem->block_size);
> +        if (ret) {
> +            return -EINVAL;
> +        }
> +        bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size, bit + 1);
> +    }
> +    return 0;
> +}
> +
> +static int virtio_mem_post_load(void *opaque, int version_id)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
> +
> +    if (vmem->migration_block_size != vmem->block_size) {
> +        error_report("'%s' doesn't match", VIRTIO_MEM_BLOCK_SIZE_PROP);
> +        return -EINVAL;
> +    }
> +    if (vmem->migration_addr != vmem->addr) {
> +        error_report("'%s' doesn't match", VIRTIO_MEM_ADDR_PROP);
> +        return -EINVAL;
> +    }
> +    return virtio_mem_restore_unplugged(vmem);
> +}
> +
> +static const VMStateDescription vmstate_virtio_mem_device = {
> +    .name = "virtio-mem-device",
> +    .minimum_version_id = 1,
> +    .version_id = 1,
> +    .pre_save = virtio_mem_pre_save,
> +    .post_load = virtio_mem_post_load,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT64(usable_region_size, VirtIOMEM),
> +        VMSTATE_UINT64(size, VirtIOMEM),
> +        VMSTATE_UINT64(requested_size, VirtIOMEM),
> +        VMSTATE_UINT64(migration_addr, VirtIOMEM),
> +        VMSTATE_UINT32(migration_block_size, VirtIOMEM),
> +        VMSTATE_BITMAP(bitmap, VirtIOMEM, 0, bitmap_size),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +static const VMStateDescription vmstate_virtio_mem = {
> +    .name = "virtio-mem",
> +    .minimum_version_id = 1,
> +    .version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_VIRTIO_DEVICE,
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +static void virtio_mem_fill_device_info(const VirtIOMEM *vmem,
> +                                        VirtioMEMDeviceInfo *vi)
> +{
> +    vi->memaddr = vmem->addr;
> +    vi->node = vmem->node;
> +    vi->requested_size = vmem->requested_size;
> +    vi->size = vmem->size;
> +    vi->max_size = memory_region_size(&vmem->memdev->mr);
> +    vi->block_size = vmem->block_size;
> +    vi->memdev = object_get_canonical_path(OBJECT(vmem->memdev));
> +}
> +
> +static MemoryRegion *virtio_mem_get_memory_region(VirtIOMEM *vmem, Error **errp)
> +{
> +    if (!vmem->memdev) {
> +        error_setg(errp, "'%s' property must be set", VIRTIO_MEM_MEMDEV_PROP);
> +        return NULL;
> +    }
> +
> +    return &vmem->memdev->mr;
> +}
> +
> +static void virtio_mem_get_size(Object *obj, Visitor *v, const char *name,
> +                                void *opaque, Error **errp)
> +{
> +    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
> +    uint64_t value = vmem->size;
> +
> +    visit_type_size(v, name, &value, errp);
> +}
> +
> +static void virtio_mem_get_requested_size(Object *obj, Visitor *v,
> +                                          const char *name, void *opaque,
> +                                          Error **errp)
> +{
> +    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
> +    uint64_t value = vmem->requested_size;
> +
> +    visit_type_size(v, name, &value, errp);
> +}
> +
> +static void virtio_mem_set_requested_size(Object *obj, Visitor *v,
> +                                          const char *name, void *opaque,
> +                                          Error **errp)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(obj);
> +    Error *err = NULL;
> +    uint64_t value;
> +
> +    visit_type_size(v, name, &value, &err);
> +    if (err) {
> +        error_propagate(errp, err);
> +        return;
> +    }
> +
> +    /*
> +     * The block size and memory backend are not fixed until the device was
> +     * realized. realize() will verify these properties then.
> +     */
> +    if (DEVICE(obj)->realized) {
> +        if (!QEMU_IS_ALIGNED(value, vmem->block_size)) {
> +            error_setg(errp, "'%s' has to be multiples of '%s' (0x%" PRIx32
> +                       ")", name, VIRTIO_MEM_BLOCK_SIZE_PROP,
> +                       vmem->block_size);
> +            return;
> +        } else if (value > memory_region_size(&vmem->memdev->mr)) {
> +            error_setg(errp, "'%s' cannot exceed the memory backend size"
> +                       "(0x%" PRIx64 ")", name,
> +                       memory_region_size(&vmem->memdev->mr));
> +            return;
> +        }
> +
> +        if (value != vmem->requested_size) {
> +            virtio_mem_resize_usable_region(vmem, value, false);
> +            vmem->requested_size = value;
> +        }
> +        /*
> +         * Trigger a config update so the guest gets notified. We trigger
> +         * even if the size didn't change (especially helpful for debugging).
> +         */
> +        virtio_notify_config(VIRTIO_DEVICE(vmem));
> +    } else {
> +        vmem->requested_size = value;
> +    }
> +}
> +
> +static void virtio_mem_get_block_size(Object *obj, Visitor *v, const char *name,
> +                                      void *opaque, Error **errp)
> +{
> +    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
> +    uint64_t value = vmem->block_size;
> +
> +    visit_type_size(v, name, &value, errp);
> +}
> +
> +static void virtio_mem_set_block_size(Object *obj, Visitor *v, const char *name,
> +                                      void *opaque, Error **errp)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(obj);
> +    Error *err = NULL;
> +    uint64_t value;
> +
> +    if (DEVICE(obj)->realized) {
> +        error_setg(errp, "'%s' cannot be changed", name);
> +        return;
> +    }
> +
> +    visit_type_size(v, name, &value, &err);
> +    if (err) {
> +        error_propagate(errp, err);
> +        return;
> +    }
> +
> +    if (value > UINT32_MAX) {
> +        error_setg(errp, "'%s' has to be smaller than 0x%" PRIx32, name,
> +                   UINT32_MAX);
> +        return;
> +    } else if (value < VIRTIO_MEM_MIN_BLOCK_SIZE) {
> +        error_setg(errp, "'%s' has to be at least 0x%" PRIx32, name,
> +                   VIRTIO_MEM_MIN_BLOCK_SIZE);
> +        return;
> +    } else if (!is_power_of_2(value)) {
> +        error_setg(errp, "'%s' has to be a power of two", name);
> +        return;
> +    }
> +    vmem->block_size = value;
> +}
> +
> +static void virtio_mem_instance_init(Object *obj)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(obj);
> +
> +    vmem->block_size = VIRTIO_MEM_MIN_BLOCK_SIZE;
> +
> +    object_property_add(obj, VIRTIO_MEM_SIZE_PROP, "size", virtio_mem_get_size,
> +                        NULL, NULL, NULL, &error_abort);
> +    object_property_add(obj, VIRTIO_MEM_REQUESTED_SIZE_PROP, "size",
> +                        virtio_mem_get_requested_size,
> +                        virtio_mem_set_requested_size, NULL, NULL,
> +                        &error_abort);
> +    object_property_add(obj, VIRTIO_MEM_BLOCK_SIZE_PROP, "size",
> +                        virtio_mem_get_block_size, virtio_mem_set_block_size,
> +                        NULL, NULL, &error_abort);
> +}
> +
> +static Property virtio_mem_properties[] = {
> +    DEFINE_PROP_UINT64(VIRTIO_MEM_ADDR_PROP, VirtIOMEM, addr, 0),
> +    DEFINE_PROP_UINT32(VIRTIO_MEM_NODE_PROP, VirtIOMEM, node, 0),
> +    DEFINE_PROP_LINK(VIRTIO_MEM_MEMDEV_PROP, VirtIOMEM, memdev,
> +                     TYPE_MEMORY_BACKEND, HostMemoryBackend *),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void virtio_mem_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
> +    VirtIOMEMClass *vmc = VIRTIO_MEM_CLASS(klass);
> +
> +    device_class_set_props(dc, virtio_mem_properties);
> +    dc->vmsd = &vmstate_virtio_mem;
> +
> +    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> +    vdc->realize = virtio_mem_device_realize;
> +    vdc->unrealize = virtio_mem_device_unrealize;
> +    vdc->get_config = virtio_mem_get_config;
> +    vdc->get_features = virtio_mem_get_features;
> +    vdc->vmsd = &vmstate_virtio_mem_device;
> +
> +    vmc->fill_device_info = virtio_mem_fill_device_info;
> +    vmc->get_memory_region = virtio_mem_get_memory_region;
> +}
> +
> +static const TypeInfo virtio_mem_info = {
> +    .name = TYPE_VIRTIO_MEM,
> +    .parent = TYPE_VIRTIO_DEVICE,
> +    .instance_size = sizeof(VirtIOMEM),
> +    .instance_init = virtio_mem_instance_init,
> +    .class_init = virtio_mem_class_init,
> +    .class_size = sizeof(VirtIOMEMClass),
> +};
> +
> +static void virtio_register_types(void)
> +{
> +    type_register_static(&virtio_mem_info);
> +}
> +
> +type_init(virtio_register_types)
> diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
> new file mode 100644
> index 0000000000..27158cb611
> --- /dev/null
> +++ b/include/hw/virtio/virtio-mem.h
> @@ -0,0 +1,80 @@
> +/*
> + * Virtio MEM device
> + *
> + * Copyright (C) 2020 Red Hat, Inc.
> + *
> + * Authors:
> + *  David Hildenbrand <david@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef HW_VIRTIO_MEM_H
> +#define HW_VIRTIO_MEM_H
> +
> +#include "standard-headers/linux/virtio_mem.h"
> +#include "hw/virtio/virtio.h"
> +#include "qapi/qapi-types-misc.h"
> +#include "sysemu/hostmem.h"
> +
> +#define TYPE_VIRTIO_MEM "virtio-mem"
> +
> +#define VIRTIO_MEM(obj) \
> +        OBJECT_CHECK(VirtIOMEM, (obj), TYPE_VIRTIO_MEM)
> +#define VIRTIO_MEM_CLASS(oc) \
> +        OBJECT_CLASS_CHECK(VirtIOMEMClass, (oc), TYPE_VIRTIO_MEM)
> +#define VIRTIO_MEM_GET_CLASS(obj) \
> +        OBJECT_GET_CLASS(VirtIOMEMClass, (obj), TYPE_VIRTIO_MEM)
> +
> +#define VIRTIO_MEM_MEMDEV_PROP "memdev"
> +#define VIRTIO_MEM_NODE_PROP "node"
> +#define VIRTIO_MEM_SIZE_PROP "size"
> +#define VIRTIO_MEM_REQUESTED_SIZE_PROP "requested-size"
> +#define VIRTIO_MEM_BLOCK_SIZE_PROP "block-size"
> +#define VIRTIO_MEM_ADDR_PROP "memaddr"
> +
> +typedef struct VirtIOMEM {
> +    VirtIODevice parent_obj;
> +
> +    /* guest -> host request queue */
> +    VirtQueue *vq;
> +
> +    /* bitmap used to track unplugged memory */
> +    int32_t bitmap_size;
> +    unsigned long *bitmap;
> +
> +    /* assigned memory backend and memory region */
> +    HostMemoryBackend *memdev;
> +
> +    /* NUMA node */
> +    uint32_t node;
> +
> +    /* assigned address of the region in guest physical memory */
> +    uint64_t addr;
> +    uint64_t migration_addr;
> +
> +    /* usable region size (<= region_size) */
> +    uint64_t usable_region_size;
> +
> +    /* actual size (how much the guest plugged) */
> +    uint64_t size;
> +
> +    /* requested size */
> +    uint64_t requested_size;
> +
> +    /* block size and alignment */
> +    uint32_t block_size;
> +    uint32_t migration_block_size;
> +} VirtIOMEM;
> +
> +typedef struct VirtIOMEMClass {
> +    /* private */
> +    VirtIODevice parent;
> +
> +    /* public */
> +    void (*fill_device_info)(const VirtIOMEM *vmen, VirtioMEMDeviceInfo *vi);
> +    MemoryRegion *(*get_memory_region)(VirtIOMEM *vmem, Error **errp);
> +} VirtIOMEMClass;
> +
> +#endif
> diff --git a/qapi/misc.json b/qapi/misc.json
> index 99b90ac80b..feaeacec22 100644
> --- a/qapi/misc.json
> +++ b/qapi/misc.json
> @@ -1354,19 +1354,56 @@
>            }
>  }
>  
> +##
> +# @VirtioMEMDeviceInfo:
> +#
> +# VirtioMEMDevice state information
> +#
> +# @id: device's ID
> +#
> +# @memaddr: physical address in memory, where device is mapped
> +#
> +# @requested-size: the user requested size of the device
> +#
> +# @size: the (current) size of memory that the device provides
> +#
> +# @max-size: the maximum size of memory that the device can provide
> +#
> +# @block-size: the block size of memory that the device provides
> +#
> +# @node: NUMA node number where device is assigned to
> +#
> +# @memdev: memory backend linked with the region
> +#
> +# Since: 5.1
> +##
> +{ 'struct': 'VirtioMEMDeviceInfo',
> +  'data': { '*id': 'str',
> +            'memaddr': 'size',
> +            'requested-size': 'size',
> +            'size': 'size',
> +            'max-size': 'size',
> +            'block-size': 'size',
> +            'node': 'int',
> +            'memdev': 'str'
> +          }
> +}
> +
>  ##
>  # @MemoryDeviceInfo:
>  #
>  # Union containing information about a memory device
>  #
>  # nvdimm is included since 2.12. virtio-pmem is included since 4.1.
> +# virtio-mem is included since 5.2.
>  #
>  # Since: 2.1
>  ##
>  { 'union': 'MemoryDeviceInfo',
>    'data': { 'dimm': 'PCDIMMDeviceInfo',
>              'nvdimm': 'PCDIMMDeviceInfo',
> -            'virtio-pmem': 'VirtioPMEMDeviceInfo'
> +            'virtio-pmem': 'VirtioPMEMDeviceInfo',
> +            'virtio-mem': 'VirtioMEMDeviceInfo'
>            }
>  }
>  
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 06/17] target/i386: sev: Use ram_block_discard_set_broken()
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-15 15:51     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 15:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin

* David Hildenbrand (david@redhat.com) wrote:
> AMD SEV will pin all guest memory, mark discarding of RAM broken. At the
> time this is called, we cannot have anyone active that relies on discards
> to work properly.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  target/i386/sev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/target/i386/sev.c b/target/i386/sev.c
> index 846018a12d..608225f9ba 100644
> --- a/target/i386/sev.c
> +++ b/target/i386/sev.c
> @@ -722,6 +722,7 @@ sev_guest_init(const char *id)
>      ram_block_notifier_add(&sev_ram_notifier);
>      qemu_add_machine_init_done_notifier(&sev_machine_done_notify);
>      qemu_add_vm_change_state_handler(sev_vm_state_change, s);
> +    g_assert(!ram_block_discard_set_broken(true));
>  
>      return s;
>  err:
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 06/17] target/i386: sev: Use ram_block_discard_set_broken()
@ 2020-05-15 15:51     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 15:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, qemu-devel,
	qemu-s390x, Paolo Bonzini, Richard Henderson

* David Hildenbrand (david@redhat.com) wrote:
> AMD SEV will pin all guest memory, mark discarding of RAM broken. At the
> time this is called, we cannot have anyone active that relies on discards
> to work properly.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  target/i386/sev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/target/i386/sev.c b/target/i386/sev.c
> index 846018a12d..608225f9ba 100644
> --- a/target/i386/sev.c
> +++ b/target/i386/sev.c
> @@ -722,6 +722,7 @@ sev_guest_init(const char *id)
>      ram_block_notifier_add(&sev_ram_notifier);
>      qemu_add_machine_init_done_notifier(&sev_machine_done_notify);
>      qemu_add_vm_change_state_handler(sev_vm_state_change, s);
> +    g_assert(!ram_block_discard_set_broken(true));
>  
>      return s;
>  err:
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 12/17] MAINTAINERS: Add myself as virtio-mem maintainer
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-15 15:55     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 15:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Peter Maydell,
	Markus Armbruster

* David Hildenbrand (david@redhat.com) wrote:
> Let's make sure patches/bug reports find the right person.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Cc: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  MAINTAINERS | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1f84e3ae2c..09fff9e1bd 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1734,6 +1734,14 @@ F: hw/virtio/virtio-crypto.c
>  F: hw/virtio/virtio-crypto-pci.c
>  F: include/hw/virtio/virtio-crypto.h
>  
> +virtio-mem
> +M: David Hildenbrand <david@redhat.com>
> +S: Supported
> +F: hw/virtio/virtio-mem.c
> +F: hw/virtio/virtio-mem-pci.h
> +F: hw/virtio/virtio-mem-pci.c
> +F: include/hw/virtio/virtio-mem.h
> +
>  nvme
>  M: Keith Busch <kbusch@kernel.org>
>  L: qemu-block@nongnu.org
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 12/17] MAINTAINERS: Add myself as virtio-mem maintainer
@ 2020-05-15 15:55     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 15:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Maydell, Eduardo Habkost, kvm, Michael S . Tsirkin,
	qemu-devel, Markus Armbruster, qemu-s390x, Paolo Bonzini,
	Richard Henderson

* David Hildenbrand (david@redhat.com) wrote:
> Let's make sure patches/bug reports find the right person.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Cc: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  MAINTAINERS | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1f84e3ae2c..09fff9e1bd 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1734,6 +1734,14 @@ F: hw/virtio/virtio-crypto.c
>  F: hw/virtio/virtio-crypto-pci.c
>  F: include/hw/virtio/virtio-crypto.h
>  
> +virtio-mem
> +M: David Hildenbrand <david@redhat.com>
> +S: Supported
> +F: hw/virtio/virtio-mem.c
> +F: hw/virtio/virtio-mem-pci.h
> +F: hw/virtio/virtio-mem-pci.c
> +F: include/hw/virtio/virtio-mem.h
> +
>  nvme
>  M: Keith Busch <kbusch@kernel.org>
>  L: qemu-block@nongnu.org
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 01/17] exec: Introduce ram_block_discard_set_(unreliable|required)()
  2020-05-15 14:54     ` David Hildenbrand
@ 2020-05-15 16:15       ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 16:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin

* David Hildenbrand (david@redhat.com) wrote:
> On 06.05.20 11:49, David Hildenbrand wrote:
> > We want to replace qemu_balloon_inhibit() by something more generic.
> > Especially, we want to make sure that technologies that really rely on
> > RAM block discards to work reliably to run mutual exclusive with
> > technologies that break it.
> > 
> > E.g., vfio will usually pin all guest memory, turning the virtio-balloon
> > basically useless and make the VM consume more memory than reported via
> > the balloon. While the balloon is special already (=> no guarantees, same
> > behavior possible afer reboots and with huge pages), this will be
> > different, especially, with virtio-mem.
> > 
> > Let's implement a way such that we can make both types of technology run
> > mutually exclusive. We'll convert existing balloon inhibitors in successive
> > patches and add some new ones. Add the check to
> > qemu_balloon_is_inhibited() for now. We might want to make
> > virtio-balloon an acutal inhibitor in the future - however, that
> > requires more thought to not break existing setups.
> > 
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >  balloon.c             |  3 ++-
> >  exec.c                | 48 +++++++++++++++++++++++++++++++++++++++++++
> >  include/exec/memory.h | 41 ++++++++++++++++++++++++++++++++++++
> >  3 files changed, 91 insertions(+), 1 deletion(-)
> > 
> > diff --git a/balloon.c b/balloon.c
> > index f104b42961..c49f57c27b 100644
> > --- a/balloon.c
> > +++ b/balloon.c
> > @@ -40,7 +40,8 @@ static int balloon_inhibit_count;
> >  
> >  bool qemu_balloon_is_inhibited(void)
> >  {
> > -    return atomic_read(&balloon_inhibit_count) > 0;
> > +    return atomic_read(&balloon_inhibit_count) > 0 ||
> > +           ram_block_discard_is_broken();
> >  }
> >  
> >  void qemu_balloon_inhibit(bool state)
> > diff --git a/exec.c b/exec.c
> > index 2874bb5088..52a6e40e99 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -4049,4 +4049,52 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root)
> >      }
> >  }
> >  
> > +static int ram_block_discard_broken;
> > +
> > +int ram_block_discard_set_broken(bool state)
> > +{
> > +    int old;
> > +
> > +    if (!state) {
> > +        atomic_dec(&ram_block_discard_broken);
> > +        return 0;
> > +    }
> > +
> > +    do {
> > +        old = atomic_read(&ram_block_discard_broken);
> > +        if (old < 0) {
> > +            return -EBUSY;
> > +        }
> > +    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old + 1) != old);
> > +    return 0;
> > +}
> > +
> > +int ram_block_discard_set_required(bool state)
> > +{
> > +    int old;
> > +
> > +    if (!state) {
> > +        atomic_inc(&ram_block_discard_broken);
> > +        return 0;
> > +    }
> > +
> > +    do {
> > +        old = atomic_read(&ram_block_discard_broken);
> > +        if (old > 0) {
> > +            return -EBUSY;
> > +        }
> > +    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old - 1) != old);
> > +    return 0;
> > +}
> > +
> > +bool ram_block_discard_is_broken(void)
> > +{
> > +    return atomic_read(&ram_block_discard_broken) > 0;
> > +}
> > +
> > +bool ram_block_discard_is_required(void)
> > +{
> > +    return atomic_read(&ram_block_discard_broken) < 0;
> > +}
> > +
> >  #endif
> > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > index e000bd2f97..9bb5ced38d 100644
> > --- a/include/exec/memory.h
> > +++ b/include/exec/memory.h
> > @@ -2463,6 +2463,47 @@ static inline MemOp devend_memop(enum device_endian end)
> >  }
> >  #endif
> >  
> > +/*
> > + * Inhibit technologies that rely on discarding of parts of RAM blocks to work
> > + * reliably, e.g., to manage the actual amount of memory consumed by the VM
> > + * (then, the memory provided by RAM blocks might be bigger than the desired
> > + * memory consumption). This *must* be set if:
> > + * - Discarding parts of a RAM blocks does not result in the change being
> > + *   reflected in the VM and the pages getting freed.
> > + * - All memory in RAM blocks is pinned or duplicated, invaldiating any previous
> > + *   discards blindly.
> > + * - Discarding parts of a RAM blocks will result in integrity issues (e.g.,
> > + *   encrypted VMs).
> > + * Technologies that only temporarily pin the current working set of a
> > + * driver are fine, because we don't expect such pages to be discarded
> > + * (esp. based on guest action like balloon inflation).
> > + *
> > + * This is *not* to be used to protect from concurrent discards (esp.,
> > + * postcopy).
> > + *
> > + * Returns 0 if successful. Returns -EBUSY if a technology that relies on
> > + * discards to work reliably is active.
> > + */
> > +int ram_block_discard_set_broken(bool state);
> > +
> > +/*
> > + * Inhibit technologies that will break discarding of pages in RAM blocks.
> > + *
> > + * Returns 0 if successful. Returns -EBUSY if discards are already set to
> > + * broken.
> > + */
> > +int ram_block_discard_set_required(bool state);
> > +
> > +/*
> > + * Test if discarding of memory in ram blocks is broken.
> > + */
> > +bool ram_block_discard_is_broken(void);
> > +
> > +/*
> > + * Test if discarding of memory in ram blocks is required to work reliably.
> > + */
> > +bool ram_block_discard_is_required(void);
> > +
> >  #endif
> >  
> >  #endif
> > 
> 
> I'm wondering if I'll just call these functions
> 
> ram_block_discard_disable()
> 
> and
> 
> ram_block_discard_require()

Yeh I prefer that.

Dave

> -- 
> Thanks,
> 
> David / dhildenb
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 01/17] exec: Introduce ram_block_discard_set_(unreliable|required)()
@ 2020-05-15 16:15       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 16:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, qemu-devel,
	qemu-s390x, Paolo Bonzini, Richard Henderson

* David Hildenbrand (david@redhat.com) wrote:
> On 06.05.20 11:49, David Hildenbrand wrote:
> > We want to replace qemu_balloon_inhibit() by something more generic.
> > Especially, we want to make sure that technologies that really rely on
> > RAM block discards to work reliably to run mutual exclusive with
> > technologies that break it.
> > 
> > E.g., vfio will usually pin all guest memory, turning the virtio-balloon
> > basically useless and make the VM consume more memory than reported via
> > the balloon. While the balloon is special already (=> no guarantees, same
> > behavior possible afer reboots and with huge pages), this will be
> > different, especially, with virtio-mem.
> > 
> > Let's implement a way such that we can make both types of technology run
> > mutually exclusive. We'll convert existing balloon inhibitors in successive
> > patches and add some new ones. Add the check to
> > qemu_balloon_is_inhibited() for now. We might want to make
> > virtio-balloon an acutal inhibitor in the future - however, that
> > requires more thought to not break existing setups.
> > 
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >  balloon.c             |  3 ++-
> >  exec.c                | 48 +++++++++++++++++++++++++++++++++++++++++++
> >  include/exec/memory.h | 41 ++++++++++++++++++++++++++++++++++++
> >  3 files changed, 91 insertions(+), 1 deletion(-)
> > 
> > diff --git a/balloon.c b/balloon.c
> > index f104b42961..c49f57c27b 100644
> > --- a/balloon.c
> > +++ b/balloon.c
> > @@ -40,7 +40,8 @@ static int balloon_inhibit_count;
> >  
> >  bool qemu_balloon_is_inhibited(void)
> >  {
> > -    return atomic_read(&balloon_inhibit_count) > 0;
> > +    return atomic_read(&balloon_inhibit_count) > 0 ||
> > +           ram_block_discard_is_broken();
> >  }
> >  
> >  void qemu_balloon_inhibit(bool state)
> > diff --git a/exec.c b/exec.c
> > index 2874bb5088..52a6e40e99 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -4049,4 +4049,52 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root)
> >      }
> >  }
> >  
> > +static int ram_block_discard_broken;
> > +
> > +int ram_block_discard_set_broken(bool state)
> > +{
> > +    int old;
> > +
> > +    if (!state) {
> > +        atomic_dec(&ram_block_discard_broken);
> > +        return 0;
> > +    }
> > +
> > +    do {
> > +        old = atomic_read(&ram_block_discard_broken);
> > +        if (old < 0) {
> > +            return -EBUSY;
> > +        }
> > +    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old + 1) != old);
> > +    return 0;
> > +}
> > +
> > +int ram_block_discard_set_required(bool state)
> > +{
> > +    int old;
> > +
> > +    if (!state) {
> > +        atomic_inc(&ram_block_discard_broken);
> > +        return 0;
> > +    }
> > +
> > +    do {
> > +        old = atomic_read(&ram_block_discard_broken);
> > +        if (old > 0) {
> > +            return -EBUSY;
> > +        }
> > +    } while (atomic_cmpxchg(&ram_block_discard_broken, old, old - 1) != old);
> > +    return 0;
> > +}
> > +
> > +bool ram_block_discard_is_broken(void)
> > +{
> > +    return atomic_read(&ram_block_discard_broken) > 0;
> > +}
> > +
> > +bool ram_block_discard_is_required(void)
> > +{
> > +    return atomic_read(&ram_block_discard_broken) < 0;
> > +}
> > +
> >  #endif
> > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > index e000bd2f97..9bb5ced38d 100644
> > --- a/include/exec/memory.h
> > +++ b/include/exec/memory.h
> > @@ -2463,6 +2463,47 @@ static inline MemOp devend_memop(enum device_endian end)
> >  }
> >  #endif
> >  
> > +/*
> > + * Inhibit technologies that rely on discarding of parts of RAM blocks to work
> > + * reliably, e.g., to manage the actual amount of memory consumed by the VM
> > + * (then, the memory provided by RAM blocks might be bigger than the desired
> > + * memory consumption). This *must* be set if:
> > + * - Discarding parts of a RAM blocks does not result in the change being
> > + *   reflected in the VM and the pages getting freed.
> > + * - All memory in RAM blocks is pinned or duplicated, invaldiating any previous
> > + *   discards blindly.
> > + * - Discarding parts of a RAM blocks will result in integrity issues (e.g.,
> > + *   encrypted VMs).
> > + * Technologies that only temporarily pin the current working set of a
> > + * driver are fine, because we don't expect such pages to be discarded
> > + * (esp. based on guest action like balloon inflation).
> > + *
> > + * This is *not* to be used to protect from concurrent discards (esp.,
> > + * postcopy).
> > + *
> > + * Returns 0 if successful. Returns -EBUSY if a technology that relies on
> > + * discards to work reliably is active.
> > + */
> > +int ram_block_discard_set_broken(bool state);
> > +
> > +/*
> > + * Inhibit technologies that will break discarding of pages in RAM blocks.
> > + *
> > + * Returns 0 if successful. Returns -EBUSY if discards are already set to
> > + * broken.
> > + */
> > +int ram_block_discard_set_required(bool state);
> > +
> > +/*
> > + * Test if discarding of memory in ram blocks is broken.
> > + */
> > +bool ram_block_discard_is_broken(void);
> > +
> > +/*
> > + * Test if discarding of memory in ram blocks is required to work reliably.
> > + */
> > +bool ram_block_discard_is_required(void);
> > +
> >  #endif
> >  
> >  #endif
> > 
> 
> I'm wondering if I'll just call these functions
> 
> ram_block_discard_disable()
> 
> and
> 
> ram_block_discard_require()

Yeh I prefer that.

Dave

> -- 
> Thanks,
> 
> David / dhildenb
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 16/17] virtio-mem: Allow notifiers for size changes
  2020-05-06  9:49   ` David Hildenbrand
@ 2020-05-15 16:46     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 16:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Igor Mammedov

* David Hildenbrand (david@redhat.com) wrote:
> We want to send qapi events in case the size of a virtio-mem device
> changes. This allows upper layers to always know how much memory is
> actually currently consumed via a virtio-mem device.
> 
> Unfortuantely, we have to report the id of our proxy device. Let's provide
> an easy way for our proxy device to register, so it can send the qapi
> events. Piggy-backing on the notifier infrastructure (although we'll
> only ever have one notifier registered) seems to be an easy way.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  hw/virtio/virtio-mem.c         | 21 ++++++++++++++++++++-
>  include/hw/virtio/virtio-mem.h |  5 +++++
>  2 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index e25b2c74f2..88a99a0d90 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -198,6 +198,7 @@ static int virtio_mem_state_change_request(VirtIOMEM *vmem, uint64_t gpa,
>      } else {
>          vmem->size -= size;
>      }
> +    notifier_list_notify(&vmem->size_change_notifiers, &vmem->size);
>      return VIRTIO_MEM_RESP_ACK;
>  }
>  
> @@ -253,7 +254,10 @@ static int virtio_mem_unplug_all(VirtIOMEM *vmem)
>          return -EBUSY;
>      }
>      bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
> -    vmem->size = 0;
> +    if (vmem->size != 0) {
> +        vmem->size = 0;
> +        notifier_list_notify(&vmem->size_change_notifiers, &vmem->size);
> +    }
>  
>      virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
>      return 0;
> @@ -594,6 +598,18 @@ static MemoryRegion *virtio_mem_get_memory_region(VirtIOMEM *vmem, Error **errp)
>      return &vmem->memdev->mr;
>  }
>  
> +static void virtio_mem_add_size_change_notifier(VirtIOMEM *vmem,
> +                                                Notifier *notifier)
> +{
> +    notifier_list_add(&vmem->size_change_notifiers, notifier);
> +}
> +
> +static void virtio_mem_remove_size_change_notifier(VirtIOMEM *vmem,
> +                                                   Notifier *notifier)
> +{
> +    notifier_remove(notifier);
> +}
> +
>  static void virtio_mem_get_size(Object *obj, Visitor *v, const char *name,
>                                  void *opaque, Error **errp)
>  {
> @@ -705,6 +721,7 @@ static void virtio_mem_instance_init(Object *obj)
>      VirtIOMEM *vmem = VIRTIO_MEM(obj);
>  
>      vmem->block_size = VIRTIO_MEM_MIN_BLOCK_SIZE;
> +    notifier_list_init(&vmem->size_change_notifiers);
>  
>      object_property_add(obj, VIRTIO_MEM_SIZE_PROP, "size", virtio_mem_get_size,
>                          NULL, NULL, NULL, &error_abort);
> @@ -743,6 +760,8 @@ static void virtio_mem_class_init(ObjectClass *klass, void *data)
>  
>      vmc->fill_device_info = virtio_mem_fill_device_info;
>      vmc->get_memory_region = virtio_mem_get_memory_region;
> +    vmc->add_size_change_notifier = virtio_mem_add_size_change_notifier;
> +    vmc->remove_size_change_notifier = virtio_mem_remove_size_change_notifier;
>  }
>  
>  static const TypeInfo virtio_mem_info = {
> diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
> index 27158cb611..5820b5c23e 100644
> --- a/include/hw/virtio/virtio-mem.h
> +++ b/include/hw/virtio/virtio-mem.h
> @@ -66,6 +66,9 @@ typedef struct VirtIOMEM {
>      /* block size and alignment */
>      uint32_t block_size;
>      uint32_t migration_block_size;
> +
> +    /* notifiers to notify when "size" changes */
> +    NotifierList size_change_notifiers;
>  } VirtIOMEM;
>  
>  typedef struct VirtIOMEMClass {
> @@ -75,6 +78,8 @@ typedef struct VirtIOMEMClass {
>      /* public */
>      void (*fill_device_info)(const VirtIOMEM *vmen, VirtioMEMDeviceInfo *vi);
>      MemoryRegion *(*get_memory_region)(VirtIOMEM *vmem, Error **errp);
> +    void (*add_size_change_notifier)(VirtIOMEM *vmem, Notifier *notifier);
> +    void (*remove_size_change_notifier)(VirtIOMEM *vmem, Notifier *notifier);
>  } VirtIOMEMClass;
>  
>  #endif
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 16/17] virtio-mem: Allow notifiers for size changes
@ 2020-05-15 16:46     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 16:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, qemu-devel,
	qemu-s390x, Igor Mammedov, Paolo Bonzini, Richard Henderson

* David Hildenbrand (david@redhat.com) wrote:
> We want to send qapi events in case the size of a virtio-mem device
> changes. This allows upper layers to always know how much memory is
> actually currently consumed via a virtio-mem device.
> 
> Unfortuantely, we have to report the id of our proxy device. Let's provide
> an easy way for our proxy device to register, so it can send the qapi
> events. Piggy-backing on the notifier infrastructure (although we'll
> only ever have one notifier registered) seems to be an easy way.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  hw/virtio/virtio-mem.c         | 21 ++++++++++++++++++++-
>  include/hw/virtio/virtio-mem.h |  5 +++++
>  2 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index e25b2c74f2..88a99a0d90 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -198,6 +198,7 @@ static int virtio_mem_state_change_request(VirtIOMEM *vmem, uint64_t gpa,
>      } else {
>          vmem->size -= size;
>      }
> +    notifier_list_notify(&vmem->size_change_notifiers, &vmem->size);
>      return VIRTIO_MEM_RESP_ACK;
>  }
>  
> @@ -253,7 +254,10 @@ static int virtio_mem_unplug_all(VirtIOMEM *vmem)
>          return -EBUSY;
>      }
>      bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
> -    vmem->size = 0;
> +    if (vmem->size != 0) {
> +        vmem->size = 0;
> +        notifier_list_notify(&vmem->size_change_notifiers, &vmem->size);
> +    }
>  
>      virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
>      return 0;
> @@ -594,6 +598,18 @@ static MemoryRegion *virtio_mem_get_memory_region(VirtIOMEM *vmem, Error **errp)
>      return &vmem->memdev->mr;
>  }
>  
> +static void virtio_mem_add_size_change_notifier(VirtIOMEM *vmem,
> +                                                Notifier *notifier)
> +{
> +    notifier_list_add(&vmem->size_change_notifiers, notifier);
> +}
> +
> +static void virtio_mem_remove_size_change_notifier(VirtIOMEM *vmem,
> +                                                   Notifier *notifier)
> +{
> +    notifier_remove(notifier);
> +}
> +
>  static void virtio_mem_get_size(Object *obj, Visitor *v, const char *name,
>                                  void *opaque, Error **errp)
>  {
> @@ -705,6 +721,7 @@ static void virtio_mem_instance_init(Object *obj)
>      VirtIOMEM *vmem = VIRTIO_MEM(obj);
>  
>      vmem->block_size = VIRTIO_MEM_MIN_BLOCK_SIZE;
> +    notifier_list_init(&vmem->size_change_notifiers);
>  
>      object_property_add(obj, VIRTIO_MEM_SIZE_PROP, "size", virtio_mem_get_size,
>                          NULL, NULL, NULL, &error_abort);
> @@ -743,6 +760,8 @@ static void virtio_mem_class_init(ObjectClass *klass, void *data)
>  
>      vmc->fill_device_info = virtio_mem_fill_device_info;
>      vmc->get_memory_region = virtio_mem_get_memory_region;
> +    vmc->add_size_change_notifier = virtio_mem_add_size_change_notifier;
> +    vmc->remove_size_change_notifier = virtio_mem_remove_size_change_notifier;
>  }
>  
>  static const TypeInfo virtio_mem_info = {
> diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
> index 27158cb611..5820b5c23e 100644
> --- a/include/hw/virtio/virtio-mem.h
> +++ b/include/hw/virtio/virtio-mem.h
> @@ -66,6 +66,9 @@ typedef struct VirtIOMEM {
>      /* block size and alignment */
>      uint32_t block_size;
>      uint32_t migration_block_size;
> +
> +    /* notifiers to notify when "size" changes */
> +    NotifierList size_change_notifiers;
>  } VirtIOMEM;
>  
>  typedef struct VirtIOMEMClass {
> @@ -75,6 +78,8 @@ typedef struct VirtIOMEMClass {
>      /* public */
>      void (*fill_device_info)(const VirtIOMEM *vmen, VirtioMEMDeviceInfo *vi);
>      MemoryRegion *(*get_memory_region)(VirtIOMEM *vmem, Error **errp);
> +    void (*add_size_change_notifier)(VirtIOMEM *vmem, Notifier *notifier);
> +    void (*remove_size_change_notifier)(VirtIOMEM *vmem, Notifier *notifier);
>  } VirtIOMEMClass;
>  
>  #endif
> -- 
> 2.25.3
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 10/17] virtio-mem: Paravirtualized memory hot(un)plug
  2020-05-15 15:37     ` Dr. David Alan Gilbert
@ 2020-05-15 16:48       ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 16:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Eric Blake,
	Markus Armbruster, Igor Mammedov

On 15.05.20 17:37, Dr. David Alan Gilbert wrote:
> I'm not sure if it's possible to split this up; it's a bit big.

Functionality-wise, it's the bare minimum. I could split out handling of
all 4 request types, but they are only ~150-200 LOC. Not sure if that is
really worth the trouble. open for suggestions.

> It could also do with a pile of trace_ entries to figure out what it's
> doing.

Good idea, will add that with a patch on top.

[...]

>> +static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa,
>> +                                      uint64_t size, bool plug)
>> +{
>> +    const uint64_t offset = start_gpa - vmem->addr;
>> +    int ret;
>> +
>> +    if (!plug) {
>> +        if (virtio_mem_discard_inhibited()) {
>> +            return -EBUSY;
>> +        }
>> +        /* Note: Discarding should never fail at this point. */
>> +        ret = ram_block_discard_range(vmem->memdev->mr.ram_block, offset, size);
>> +        if (ret) {
> 
> error_report ?


error_report("Unexpected error discarding RAM: %s",
	     strerror(-ret));
it is.

[...]

>> +    ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
>> +    if (ret) {
>> +        /* Note: Discarding should never fail at this point. */
> 
> error_report?

dito

> 
>> +        return -EBUSY;
>> +    }
>> +    bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
>> +    vmem->size = 0;
>> +
>> +    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
>> +    return 0;
>> +}

[...]

>> +static void virtio_mem_handle_request(VirtIODevice *vdev, VirtQueue *vq)
>> +{
>> +    const int len = sizeof(struct virtio_mem_req);
>> +    VirtIOMEM *vmem = VIRTIO_MEM(vdev);
>> +    VirtQueueElement *elem;
>> +    struct virtio_mem_req req;
>> +    uint64_t type;
>> +
>> +    while (true) {
>> +        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>> +        if (!elem) {
>> +            return;
>> +        }
>> +
>> +        if (iov_to_buf(elem->out_sg, elem->out_num, 0, &req, len) < len) {
>> +            virtio_mem_bad_request(vmem, "invalid request size");
> 
> Print the size.

Make sense, I'll probably get rid of virtio_mem_bad_request() and just
do the virtio_error() directly with additional paramaters.

> 
>> +            g_free(elem);
>> +            return;
>> +        }
>> +
>> +        if (iov_size(elem->in_sg, elem->in_num) <
>> +            sizeof(struct virtio_mem_resp)) {
>> +            virtio_mem_bad_request(vmem, "not enough space for response");
>> +            g_free(elem);
>> +            return;
>> +        }
>> +
>> +        type = le16_to_cpu(req.type);
>> +        switch (type) {
>> +        case VIRTIO_MEM_REQ_PLUG:
>> +            virtio_mem_plug_request(vmem, elem, &req);
>> +            break;
>> +        case VIRTIO_MEM_REQ_UNPLUG:
>> +            virtio_mem_unplug_request(vmem, elem, &req);
>> +            break;
>> +        case VIRTIO_MEM_REQ_UNPLUG_ALL:
>> +            virtio_mem_unplug_all_request(vmem, elem);
>> +            break;
>> +        case VIRTIO_MEM_REQ_STATE:
>> +            virtio_mem_state_request(vmem, elem, &req);
>> +            break;
>> +        default:
>> +            virtio_mem_bad_request(vmem, "unknown request type");
> 
> Could include the type .

Yes, will do!

[...]

>> +
>> +static int virtio_mem_pre_save(void *opaque)
>> +{
>> +    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
>> +
>> +    vmem->migration_addr = vmem->addr;
>> +    vmem->migration_block_size = vmem->block_size;
> 
> You might look at VMSTATE_WITH_TMP could avoid you having the dummy
> fields.

Thanks, will have a look.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 10/17] virtio-mem: Paravirtualized memory hot(un)plug
@ 2020-05-15 16:48       ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 16:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, Igor Mammedov, Paolo Bonzini,
	Richard Henderson

On 15.05.20 17:37, Dr. David Alan Gilbert wrote:
> I'm not sure if it's possible to split this up; it's a bit big.

Functionality-wise, it's the bare minimum. I could split out handling of
all 4 request types, but they are only ~150-200 LOC. Not sure if that is
really worth the trouble. open for suggestions.

> It could also do with a pile of trace_ entries to figure out what it's
> doing.

Good idea, will add that with a patch on top.

[...]

>> +static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa,
>> +                                      uint64_t size, bool plug)
>> +{
>> +    const uint64_t offset = start_gpa - vmem->addr;
>> +    int ret;
>> +
>> +    if (!plug) {
>> +        if (virtio_mem_discard_inhibited()) {
>> +            return -EBUSY;
>> +        }
>> +        /* Note: Discarding should never fail at this point. */
>> +        ret = ram_block_discard_range(vmem->memdev->mr.ram_block, offset, size);
>> +        if (ret) {
> 
> error_report ?


error_report("Unexpected error discarding RAM: %s",
	     strerror(-ret));
it is.

[...]

>> +    ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
>> +    if (ret) {
>> +        /* Note: Discarding should never fail at this point. */
> 
> error_report?

dito

> 
>> +        return -EBUSY;
>> +    }
>> +    bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
>> +    vmem->size = 0;
>> +
>> +    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
>> +    return 0;
>> +}

[...]

>> +static void virtio_mem_handle_request(VirtIODevice *vdev, VirtQueue *vq)
>> +{
>> +    const int len = sizeof(struct virtio_mem_req);
>> +    VirtIOMEM *vmem = VIRTIO_MEM(vdev);
>> +    VirtQueueElement *elem;
>> +    struct virtio_mem_req req;
>> +    uint64_t type;
>> +
>> +    while (true) {
>> +        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
>> +        if (!elem) {
>> +            return;
>> +        }
>> +
>> +        if (iov_to_buf(elem->out_sg, elem->out_num, 0, &req, len) < len) {
>> +            virtio_mem_bad_request(vmem, "invalid request size");
> 
> Print the size.

Make sense, I'll probably get rid of virtio_mem_bad_request() and just
do the virtio_error() directly with additional paramaters.

> 
>> +            g_free(elem);
>> +            return;
>> +        }
>> +
>> +        if (iov_size(elem->in_sg, elem->in_num) <
>> +            sizeof(struct virtio_mem_resp)) {
>> +            virtio_mem_bad_request(vmem, "not enough space for response");
>> +            g_free(elem);
>> +            return;
>> +        }
>> +
>> +        type = le16_to_cpu(req.type);
>> +        switch (type) {
>> +        case VIRTIO_MEM_REQ_PLUG:
>> +            virtio_mem_plug_request(vmem, elem, &req);
>> +            break;
>> +        case VIRTIO_MEM_REQ_UNPLUG:
>> +            virtio_mem_unplug_request(vmem, elem, &req);
>> +            break;
>> +        case VIRTIO_MEM_REQ_UNPLUG_ALL:
>> +            virtio_mem_unplug_all_request(vmem, elem);
>> +            break;
>> +        case VIRTIO_MEM_REQ_STATE:
>> +            virtio_mem_state_request(vmem, elem, &req);
>> +            break;
>> +        default:
>> +            virtio_mem_bad_request(vmem, "unknown request type");
> 
> Could include the type .

Yes, will do!

[...]

>> +
>> +static int virtio_mem_pre_save(void *opaque)
>> +{
>> +    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
>> +
>> +    vmem->migration_addr = vmem->addr;
>> +    vmem->migration_block_size = vmem->block_size;
> 
> You might look at VMSTATE_WITH_TMP could avoid you having the dummy
> fields.

Thanks, will have a look.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 07/17] migration/rdma: Use ram_block_discard_set_broken()
  2020-05-15 14:09       ` David Hildenbrand
@ 2020-05-15 17:51         ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 17:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Juan Quintela

* David Hildenbrand (david@redhat.com) wrote:
> On 15.05.20 14:45, Dr. David Alan Gilbert wrote:
> > * David Hildenbrand (david@redhat.com) wrote:
> >> RDMA will pin all guest memory (as documented in docs/rdma.txt). We want
> >> to mark RAM block discards to be broken - however, to keep it simple
> >> use ram_block_discard_is_required() instead of inhibiting.
> > 
> > Should this be dependent on whether rdma->pin_all is set?
> > Even with !pin_all some will be pinned at any given time
> > (when it's registered with the rdma stack).
> 
> Do you know how much memory this is? Is such memory only temporarily pinned?

With pin_all not set, only a subset of memory, I think multiple 1MB
chunks, are pinned at any one time.

> At least with special-cases of vfio, it's acceptable if some memory is
> temporarily pinned - we assume it's only the working set of the driver,
> which guests will not inflate as long as they don't want to shoot
> themselves in the foot.
> 
> This here sounds like the guest does not know the pinned memory is
> special, right?

Right - for RDMA it's all of memory that's being transferred, and the
guest doesn't see when each part is transferred.

Dave

> -- 
> Thanks,
> 
> David / dhildenb
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 07/17] migration/rdma: Use ram_block_discard_set_broken()
@ 2020-05-15 17:51         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 17:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, Juan Quintela,
	qemu-devel, qemu-s390x, Paolo Bonzini, Richard Henderson

* David Hildenbrand (david@redhat.com) wrote:
> On 15.05.20 14:45, Dr. David Alan Gilbert wrote:
> > * David Hildenbrand (david@redhat.com) wrote:
> >> RDMA will pin all guest memory (as documented in docs/rdma.txt). We want
> >> to mark RAM block discards to be broken - however, to keep it simple
> >> use ram_block_discard_is_required() instead of inhibiting.
> > 
> > Should this be dependent on whether rdma->pin_all is set?
> > Even with !pin_all some will be pinned at any given time
> > (when it's registered with the rdma stack).
> 
> Do you know how much memory this is? Is such memory only temporarily pinned?

With pin_all not set, only a subset of memory, I think multiple 1MB
chunks, are pinned at any one time.

> At least with special-cases of vfio, it's acceptable if some memory is
> temporarily pinned - we assume it's only the working set of the driver,
> which guests will not inflate as long as they don't want to shoot
> themselves in the foot.
> 
> This here sounds like the guest does not know the pinned memory is
> special, right?

Right - for RDMA it's all of memory that's being transferred, and the
guest doesn't see when each part is transferred.

Dave

> -- 
> Thanks,
> 
> David / dhildenb
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 07/17] migration/rdma: Use ram_block_discard_set_broken()
  2020-05-15 17:51         ` Dr. David Alan Gilbert
@ 2020-05-15 17:59           ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 17:59 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Juan Quintela

On 15.05.20 19:51, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> On 15.05.20 14:45, Dr. David Alan Gilbert wrote:
>>> * David Hildenbrand (david@redhat.com) wrote:
>>>> RDMA will pin all guest memory (as documented in docs/rdma.txt). We want
>>>> to mark RAM block discards to be broken - however, to keep it simple
>>>> use ram_block_discard_is_required() instead of inhibiting.
>>>
>>> Should this be dependent on whether rdma->pin_all is set?
>>> Even with !pin_all some will be pinned at any given time
>>> (when it's registered with the rdma stack).
>>
>> Do you know how much memory this is? Is such memory only temporarily pinned?
> 
> With pin_all not set, only a subset of memory, I think multiple 1MB
> chunks, are pinned at any one time.
> 
>> At least with special-cases of vfio, it's acceptable if some memory is
>> temporarily pinned - we assume it's only the working set of the driver,
>> which guests will not inflate as long as they don't want to shoot
>> themselves in the foot.
>>
>> This here sounds like the guest does not know the pinned memory is
>> special, right?
> 
> Right - for RDMA it's all of memory that's being transferred, and the
> guest doesn't see when each part is transferred.


Okay, so all memory will eventually be pinned, just not at the same
time, correct?

I think this implies that any memory that was previously discarded will
be backed my new pages, meaning we will consume more memory than intended.

If so, always disabling discarding of RAM seems to be the right thing to do.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 07/17] migration/rdma: Use ram_block_discard_set_broken()
@ 2020-05-15 17:59           ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-15 17:59 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, Juan Quintela,
	qemu-devel, qemu-s390x, Paolo Bonzini, Richard Henderson

On 15.05.20 19:51, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> On 15.05.20 14:45, Dr. David Alan Gilbert wrote:
>>> * David Hildenbrand (david@redhat.com) wrote:
>>>> RDMA will pin all guest memory (as documented in docs/rdma.txt). We want
>>>> to mark RAM block discards to be broken - however, to keep it simple
>>>> use ram_block_discard_is_required() instead of inhibiting.
>>>
>>> Should this be dependent on whether rdma->pin_all is set?
>>> Even with !pin_all some will be pinned at any given time
>>> (when it's registered with the rdma stack).
>>
>> Do you know how much memory this is? Is such memory only temporarily pinned?
> 
> With pin_all not set, only a subset of memory, I think multiple 1MB
> chunks, are pinned at any one time.
> 
>> At least with special-cases of vfio, it's acceptable if some memory is
>> temporarily pinned - we assume it's only the working set of the driver,
>> which guests will not inflate as long as they don't want to shoot
>> themselves in the foot.
>>
>> This here sounds like the guest does not know the pinned memory is
>> special, right?
> 
> Right - for RDMA it's all of memory that's being transferred, and the
> guest doesn't see when each part is transferred.


Okay, so all memory will eventually be pinned, just not at the same
time, correct?

I think this implies that any memory that was previously discarded will
be backed my new pages, meaning we will consume more memory than intended.

If so, always disabling discarding of RAM seems to be the right thing to do.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 07/17] migration/rdma: Use ram_block_discard_set_broken()
  2020-05-15 17:59           ` David Hildenbrand
@ 2020-05-15 18:36             ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 18:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Juan Quintela

* David Hildenbrand (david@redhat.com) wrote:
> On 15.05.20 19:51, Dr. David Alan Gilbert wrote:
> > * David Hildenbrand (david@redhat.com) wrote:
> >> On 15.05.20 14:45, Dr. David Alan Gilbert wrote:
> >>> * David Hildenbrand (david@redhat.com) wrote:
> >>>> RDMA will pin all guest memory (as documented in docs/rdma.txt). We want
> >>>> to mark RAM block discards to be broken - however, to keep it simple
> >>>> use ram_block_discard_is_required() instead of inhibiting.
> >>>
> >>> Should this be dependent on whether rdma->pin_all is set?
> >>> Even with !pin_all some will be pinned at any given time
> >>> (when it's registered with the rdma stack).
> >>
> >> Do you know how much memory this is? Is such memory only temporarily pinned?
> > 
> > With pin_all not set, only a subset of memory, I think multiple 1MB
> > chunks, are pinned at any one time.
> > 
> >> At least with special-cases of vfio, it's acceptable if some memory is
> >> temporarily pinned - we assume it's only the working set of the driver,
> >> which guests will not inflate as long as they don't want to shoot
> >> themselves in the foot.
> >>
> >> This here sounds like the guest does not know the pinned memory is
> >> special, right?
> > 
> > Right - for RDMA it's all of memory that's being transferred, and the
> > guest doesn't see when each part is transferred.
> 
> 
> Okay, so all memory will eventually be pinned, just not at the same
> time, correct?
> 
> I think this implies that any memory that was previously discarded will
> be backed my new pages, meaning we will consume more memory than intended.
> 
> If so, always disabling discarding of RAM seems to be the right thing to do.

Yeh that's probably true, although there's a check for 'buffer_is_zero'
in the !rdma->pin_all case, if the entire area is zero (or probably if
unmapped) then it sends a notification rather than registering; see
qemu_rdma_write_one and search for 'This chunk has not yet been
registered, so first check to see'

Dave

> 
> -- 
> Thanks,
> 
> David / dhildenb
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 07/17] migration/rdma: Use ram_block_discard_set_broken()
@ 2020-05-15 18:36             ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 94+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-15 18:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, Juan Quintela,
	qemu-devel, qemu-s390x, Paolo Bonzini, Richard Henderson

* David Hildenbrand (david@redhat.com) wrote:
> On 15.05.20 19:51, Dr. David Alan Gilbert wrote:
> > * David Hildenbrand (david@redhat.com) wrote:
> >> On 15.05.20 14:45, Dr. David Alan Gilbert wrote:
> >>> * David Hildenbrand (david@redhat.com) wrote:
> >>>> RDMA will pin all guest memory (as documented in docs/rdma.txt). We want
> >>>> to mark RAM block discards to be broken - however, to keep it simple
> >>>> use ram_block_discard_is_required() instead of inhibiting.
> >>>
> >>> Should this be dependent on whether rdma->pin_all is set?
> >>> Even with !pin_all some will be pinned at any given time
> >>> (when it's registered with the rdma stack).
> >>
> >> Do you know how much memory this is? Is such memory only temporarily pinned?
> > 
> > With pin_all not set, only a subset of memory, I think multiple 1MB
> > chunks, are pinned at any one time.
> > 
> >> At least with special-cases of vfio, it's acceptable if some memory is
> >> temporarily pinned - we assume it's only the working set of the driver,
> >> which guests will not inflate as long as they don't want to shoot
> >> themselves in the foot.
> >>
> >> This here sounds like the guest does not know the pinned memory is
> >> special, right?
> > 
> > Right - for RDMA it's all of memory that's being transferred, and the
> > guest doesn't see when each part is transferred.
> 
> 
> Okay, so all memory will eventually be pinned, just not at the same
> time, correct?
> 
> I think this implies that any memory that was previously discarded will
> be backed my new pages, meaning we will consume more memory than intended.
> 
> If so, always disabling discarding of RAM seems to be the right thing to do.

Yeh that's probably true, although there's a check for 'buffer_is_zero'
in the !rdma->pin_all case, if the entire area is zero (or probably if
unmapped) then it sends a notification rather than registering; see
qemu_rdma_write_one and search for 'This chunk has not yet been
registered, so first check to see'

Dave

> 
> -- 
> Thanks,
> 
> David / dhildenb
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 11/17] virtio-pci: Proxy for virtio-mem
  2020-05-06 18:57     ` Pankaj Gupta
@ 2020-05-18 13:34       ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-18 13:34 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Dr . David Alan Gilbert, Eduardo Habkost, Michael S . Tsirkin,
	Marcel Apfelbaum, Igor Mammedov

On 06.05.20 20:57, Pankaj Gupta wrote:
>> Let's add a proxy for virtio-mem, make it a memory device, and
>> pass-through the properties.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
>> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> Cc: Igor Mammedov <imammedo@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  hw/virtio/Makefile.objs    |   1 +
>>  hw/virtio/virtio-mem-pci.c | 131 +++++++++++++++++++++++++++++++++++++
>>  hw/virtio/virtio-mem-pci.h |  33 ++++++++++
>>  include/hw/pci/pci.h       |   1 +
>>  4 files changed, 166 insertions(+)
>>  create mode 100644 hw/virtio/virtio-mem-pci.c
>>  create mode 100644 hw/virtio/virtio-mem-pci.h
>>
>> diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
>> index 7df70e977e..b9661f9c01 100644
>> --- a/hw/virtio/Makefile.objs
>> +++ b/hw/virtio/Makefile.objs
>> @@ -19,6 +19,7 @@ obj-$(call land,$(CONFIG_VHOST_USER_FS),$(CONFIG_VIRTIO_PCI)) += vhost-user-fs-p
>>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>>  obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
>>  obj-$(CONFIG_VIRTIO_MEM) += virtio-mem.o
>> +common-obj-$(call land,$(CONFIG_VIRTIO_MEM),$(CONFIG_VIRTIO_PCI)) += virtio-mem-pci.o
>>
>>  ifeq ($(CONFIG_VIRTIO_PCI),y)
>>  obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock-pci.o
>> diff --git a/hw/virtio/virtio-mem-pci.c b/hw/virtio/virtio-mem-pci.c
>> new file mode 100644
>> index 0000000000..a47d21c81f
>> --- /dev/null
>> +++ b/hw/virtio/virtio-mem-pci.c
>> @@ -0,0 +1,131 @@
>> +/*
>> + * Virtio MEM PCI device
>> + *
>> + * Copyright (C) 2020 Red Hat, Inc.
>> + *
>> + * Authors:
>> + *  David Hildenbrand <david@redhat.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +
> Don't think we need the blank line here.
> 

Right, thanks!

[...]

>> --
>> 2.25.3
> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
> 

Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 11/17] virtio-pci: Proxy for virtio-mem
@ 2020-05-18 13:34       ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-18 13:34 UTC (permalink / raw)
  To: Pankaj Gupta
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin,
	Dr . David Alan Gilbert, qemu-devel, qemu-s390x, Igor Mammedov,
	Paolo Bonzini, Richard Henderson

On 06.05.20 20:57, Pankaj Gupta wrote:
>> Let's add a proxy for virtio-mem, make it a memory device, and
>> pass-through the properties.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
>> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> Cc: Igor Mammedov <imammedo@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  hw/virtio/Makefile.objs    |   1 +
>>  hw/virtio/virtio-mem-pci.c | 131 +++++++++++++++++++++++++++++++++++++
>>  hw/virtio/virtio-mem-pci.h |  33 ++++++++++
>>  include/hw/pci/pci.h       |   1 +
>>  4 files changed, 166 insertions(+)
>>  create mode 100644 hw/virtio/virtio-mem-pci.c
>>  create mode 100644 hw/virtio/virtio-mem-pci.h
>>
>> diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
>> index 7df70e977e..b9661f9c01 100644
>> --- a/hw/virtio/Makefile.objs
>> +++ b/hw/virtio/Makefile.objs
>> @@ -19,6 +19,7 @@ obj-$(call land,$(CONFIG_VHOST_USER_FS),$(CONFIG_VIRTIO_PCI)) += vhost-user-fs-p
>>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>>  obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock.o
>>  obj-$(CONFIG_VIRTIO_MEM) += virtio-mem.o
>> +common-obj-$(call land,$(CONFIG_VIRTIO_MEM),$(CONFIG_VIRTIO_PCI)) += virtio-mem-pci.o
>>
>>  ifeq ($(CONFIG_VIRTIO_PCI),y)
>>  obj-$(CONFIG_VHOST_VSOCK) += vhost-vsock-pci.o
>> diff --git a/hw/virtio/virtio-mem-pci.c b/hw/virtio/virtio-mem-pci.c
>> new file mode 100644
>> index 0000000000..a47d21c81f
>> --- /dev/null
>> +++ b/hw/virtio/virtio-mem-pci.c
>> @@ -0,0 +1,131 @@
>> +/*
>> + * Virtio MEM PCI device
>> + *
>> + * Copyright (C) 2020 Red Hat, Inc.
>> + *
>> + * Authors:
>> + *  David Hildenbrand <david@redhat.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +
> Don't think we need the blank line here.
> 

Right, thanks!

[...]

>> --
>> 2.25.3
> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
> 

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 07/17] migration/rdma: Use ram_block_discard_set_broken()
  2020-05-15 18:36             ` Dr. David Alan Gilbert
@ 2020-05-18 13:52               ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-18 13:52 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Juan Quintela

On 15.05.20 20:36, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> On 15.05.20 19:51, Dr. David Alan Gilbert wrote:
>>> * David Hildenbrand (david@redhat.com) wrote:
>>>> On 15.05.20 14:45, Dr. David Alan Gilbert wrote:
>>>>> * David Hildenbrand (david@redhat.com) wrote:
>>>>>> RDMA will pin all guest memory (as documented in docs/rdma.txt). We want
>>>>>> to mark RAM block discards to be broken - however, to keep it simple
>>>>>> use ram_block_discard_is_required() instead of inhibiting.
>>>>>
>>>>> Should this be dependent on whether rdma->pin_all is set?
>>>>> Even with !pin_all some will be pinned at any given time
>>>>> (when it's registered with the rdma stack).
>>>>
>>>> Do you know how much memory this is? Is such memory only temporarily pinned?
>>>
>>> With pin_all not set, only a subset of memory, I think multiple 1MB
>>> chunks, are pinned at any one time.
>>>
>>>> At least with special-cases of vfio, it's acceptable if some memory is
>>>> temporarily pinned - we assume it's only the working set of the driver,
>>>> which guests will not inflate as long as they don't want to shoot
>>>> themselves in the foot.
>>>>
>>>> This here sounds like the guest does not know the pinned memory is
>>>> special, right?
>>>
>>> Right - for RDMA it's all of memory that's being transferred, and the
>>> guest doesn't see when each part is transferred.
>>
>>
>> Okay, so all memory will eventually be pinned, just not at the same
>> time, correct?
>>
>> I think this implies that any memory that was previously discarded will
>> be backed my new pages, meaning we will consume more memory than intended.
>>
>> If so, always disabling discarding of RAM seems to be the right thing to do.
> 
> Yeh that's probably true, although there's a check for 'buffer_is_zero'
> in the !rdma->pin_all case, if the entire area is zero (or probably if
> unmapped) then it sends a notification rather than registering; see
> qemu_rdma_write_one and search for 'This chunk has not yet been
> registered, so first check to see'

Right, if the whole chunk is zero, it will send a "compressed" zero
chunk to the target. That will result in a memset() in case the
destination is not already zero. So, both the source and the destination
will be at least be read.

But this only works if a complete chunk (1MB) is zero IIUC. If only one
page within a chunk is not zero (e.g., not inflated), the whole chunk
will be pinned. Also, "disabled chunk registration" seems to be another
case.

https://wiki.qemu.org/Features/RDMALiveMigration

"Finally, zero pages are only checked if a page has not yet been
registered using chunk registration (or not checked at all and
unconditionally written if chunk registration is disabled. This is
accomplished using the "Compress" command listed above. If the page
*has* been registered then we check the entire chunk for zero. Only if
the entire chunk is zero, then we send a compress command to zap the
page on the other side."

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 07/17] migration/rdma: Use ram_block_discard_set_broken()
@ 2020-05-18 13:52               ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-18 13:52 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, Juan Quintela,
	qemu-devel, qemu-s390x, Paolo Bonzini, Richard Henderson

On 15.05.20 20:36, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> On 15.05.20 19:51, Dr. David Alan Gilbert wrote:
>>> * David Hildenbrand (david@redhat.com) wrote:
>>>> On 15.05.20 14:45, Dr. David Alan Gilbert wrote:
>>>>> * David Hildenbrand (david@redhat.com) wrote:
>>>>>> RDMA will pin all guest memory (as documented in docs/rdma.txt). We want
>>>>>> to mark RAM block discards to be broken - however, to keep it simple
>>>>>> use ram_block_discard_is_required() instead of inhibiting.
>>>>>
>>>>> Should this be dependent on whether rdma->pin_all is set?
>>>>> Even with !pin_all some will be pinned at any given time
>>>>> (when it's registered with the rdma stack).
>>>>
>>>> Do you know how much memory this is? Is such memory only temporarily pinned?
>>>
>>> With pin_all not set, only a subset of memory, I think multiple 1MB
>>> chunks, are pinned at any one time.
>>>
>>>> At least with special-cases of vfio, it's acceptable if some memory is
>>>> temporarily pinned - we assume it's only the working set of the driver,
>>>> which guests will not inflate as long as they don't want to shoot
>>>> themselves in the foot.
>>>>
>>>> This here sounds like the guest does not know the pinned memory is
>>>> special, right?
>>>
>>> Right - for RDMA it's all of memory that's being transferred, and the
>>> guest doesn't see when each part is transferred.
>>
>>
>> Okay, so all memory will eventually be pinned, just not at the same
>> time, correct?
>>
>> I think this implies that any memory that was previously discarded will
>> be backed my new pages, meaning we will consume more memory than intended.
>>
>> If so, always disabling discarding of RAM seems to be the right thing to do.
> 
> Yeh that's probably true, although there's a check for 'buffer_is_zero'
> in the !rdma->pin_all case, if the entire area is zero (or probably if
> unmapped) then it sends a notification rather than registering; see
> qemu_rdma_write_one and search for 'This chunk has not yet been
> registered, so first check to see'

Right, if the whole chunk is zero, it will send a "compressed" zero
chunk to the target. That will result in a memset() in case the
destination is not already zero. So, both the source and the destination
will be at least be read.

But this only works if a complete chunk (1MB) is zero IIUC. If only one
page within a chunk is not zero (e.g., not inflated), the whole chunk
will be pinned. Also, "disabled chunk registration" seems to be another
case.

https://wiki.qemu.org/Features/RDMALiveMigration

"Finally, zero pages are only checked if a page has not yet been
registered using chunk registration (or not checked at all and
unconditionally written if chunk registration is disabled. This is
accomplished using the "Compress" command listed above. If the page
*has* been registered then we check the entire chunk for zero. Only if
the entire chunk is zero, then we send a compress command to zap the
page on the other side."

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 10/17] virtio-mem: Paravirtualized memory hot(un)plug
  2020-05-15 16:48       ` David Hildenbrand
@ 2020-05-18 14:23         ` David Hildenbrand
  -1 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-18 14:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, kvm, qemu-s390x, Richard Henderson, Paolo Bonzini,
	Eduardo Habkost, Michael S . Tsirkin, Eric Blake,
	Markus Armbruster, Igor Mammedov

>>> +
>>> +static int virtio_mem_pre_save(void *opaque)
>>> +{
>>> +    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
>>> +
>>> +    vmem->migration_addr = vmem->addr;
>>> +    vmem->migration_block_size = vmem->block_size;
>>
>> You might look at VMSTATE_WITH_TMP could avoid you having the dummy
>> fields.
> 
> Thanks, will have a look.

VMSTATE_WITH_TMP looks too complicated for this simple use case. I'll
just drop these migration sanity checks for now.

Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v1 10/17] virtio-mem: Paravirtualized memory hot(un)plug
@ 2020-05-18 14:23         ` David Hildenbrand
  0 siblings, 0 replies; 94+ messages in thread
From: David Hildenbrand @ 2020-05-18 14:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Eduardo Habkost, kvm, Michael S . Tsirkin, qemu-devel,
	Markus Armbruster, qemu-s390x, Igor Mammedov, Paolo Bonzini,
	Richard Henderson

>>> +
>>> +static int virtio_mem_pre_save(void *opaque)
>>> +{
>>> +    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
>>> +
>>> +    vmem->migration_addr = vmem->addr;
>>> +    vmem->migration_block_size = vmem->block_size;
>>
>> You might look at VMSTATE_WITH_TMP could avoid you having the dummy
>> fields.
> 
> Thanks, will have a look.

VMSTATE_WITH_TMP looks too complicated for this simple use case. I'll
just drop these migration sanity checks for now.

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 94+ messages in thread

end of thread, other threads:[~2020-05-18 14:24 UTC | newest]

Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-06  9:49 [PATCH v1 00/17] virtio-mem: Paravirtualized memory hot(un)plug David Hildenbrand
2020-05-06  9:49 ` David Hildenbrand
2020-05-06  9:49 ` [PATCH v1 01/17] exec: Introduce ram_block_discard_set_(unreliable|required)() David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-15  9:54   ` Dr. David Alan Gilbert
2020-05-15  9:54     ` Dr. David Alan Gilbert
2020-05-15 14:40     ` David Hildenbrand
2020-05-15 14:40       ` David Hildenbrand
2020-05-15 14:54   ` David Hildenbrand
2020-05-15 14:54     ` David Hildenbrand
2020-05-15 16:15     ` Dr. David Alan Gilbert
2020-05-15 16:15       ` Dr. David Alan Gilbert
2020-05-06  9:49 ` [PATCH v1 02/17] vfio: Convert to ram_block_discard_set_broken() David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-15 12:01   ` David Hildenbrand
2020-05-15 12:01     ` David Hildenbrand
2020-05-06  9:49 ` [PATCH v1 03/17] accel/kvm: " David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-15 11:57   ` Dr. David Alan Gilbert
2020-05-15 11:57     ` Dr. David Alan Gilbert
2020-05-06  9:49 ` [PATCH v1 04/17] s390x/pv: " David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-06  9:49 ` [PATCH v1 05/17] virtio-balloon: Rip out qemu_balloon_inhibit() David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-15 12:09   ` Dr. David Alan Gilbert
2020-05-15 12:09     ` Dr. David Alan Gilbert
2020-05-15 12:12     ` David Hildenbrand
2020-05-15 12:12       ` David Hildenbrand
2020-05-06  9:49 ` [PATCH v1 06/17] target/i386: sev: Use ram_block_discard_set_broken() David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-15 15:51   ` Dr. David Alan Gilbert
2020-05-15 15:51     ` Dr. David Alan Gilbert
2020-05-06  9:49 ` [PATCH v1 07/17] migration/rdma: " David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-15 12:45   ` Dr. David Alan Gilbert
2020-05-15 12:45     ` Dr. David Alan Gilbert
2020-05-15 14:09     ` David Hildenbrand
2020-05-15 14:09       ` David Hildenbrand
2020-05-15 17:51       ` Dr. David Alan Gilbert
2020-05-15 17:51         ` Dr. David Alan Gilbert
2020-05-15 17:59         ` David Hildenbrand
2020-05-15 17:59           ` David Hildenbrand
2020-05-15 18:36           ` Dr. David Alan Gilbert
2020-05-15 18:36             ` Dr. David Alan Gilbert
2020-05-18 13:52             ` David Hildenbrand
2020-05-18 13:52               ` David Hildenbrand
2020-05-06  9:49 ` [PATCH v1 08/17] migration/colo: " David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-15 13:58   ` Dr. David Alan Gilbert
2020-05-15 13:58     ` Dr. David Alan Gilbert
2020-05-15 14:05     ` David Hildenbrand
2020-05-15 14:05       ` David Hildenbrand
2020-05-06  9:49 ` [PATCH v1 09/17] linux-headers: update to contain virtio-mem David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-06  9:49 ` [PATCH v1 10/17] virtio-mem: Paravirtualized memory hot(un)plug David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-06 16:12   ` Eric Blake
2020-05-06 16:12     ` Eric Blake
2020-05-06 16:14     ` David Hildenbrand
2020-05-06 16:14       ` David Hildenbrand
2020-05-15 15:37   ` Dr. David Alan Gilbert
2020-05-15 15:37     ` Dr. David Alan Gilbert
2020-05-15 16:48     ` David Hildenbrand
2020-05-15 16:48       ` David Hildenbrand
2020-05-18 14:23       ` David Hildenbrand
2020-05-18 14:23         ` David Hildenbrand
2020-05-06  9:49 ` [PATCH v1 11/17] virtio-pci: Proxy for virtio-mem David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-06 18:57   ` Pankaj Gupta
2020-05-06 18:57     ` Pankaj Gupta
2020-05-18 13:34     ` David Hildenbrand
2020-05-18 13:34       ` David Hildenbrand
2020-05-06  9:49 ` [PATCH v1 12/17] MAINTAINERS: Add myself as virtio-mem maintainer David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-15 15:55   ` Dr. David Alan Gilbert
2020-05-15 15:55     ` Dr. David Alan Gilbert
2020-05-06  9:49 ` [PATCH v1 13/17] hmp: Handle virtio-mem when printing memory device info David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-06 19:03   ` Pankaj Gupta
2020-05-06 19:03     ` Pankaj Gupta
2020-05-06  9:49 ` [PATCH v1 14/17] numa: Handle virtio-mem in NUMA stats David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-06  9:49 ` [PATCH v1 15/17] pc: Support for virtio-mem-pci David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-06 12:19   ` Pankaj Gupta
2020-05-06 12:19     ` Pankaj Gupta
2020-05-06  9:49 ` [PATCH v1 16/17] virtio-mem: Allow notifiers for size changes David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-15 16:46   ` Dr. David Alan Gilbert
2020-05-15 16:46     ` Dr. David Alan Gilbert
2020-05-06  9:49 ` [PATCH v1 17/17] virtio-pci: Send qapi events when the virtio-mem " David Hildenbrand
2020-05-06  9:49   ` David Hildenbrand
2020-05-15 15:18   ` David Hildenbrand
2020-05-15 15:18     ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.