qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots
@ 2021-10-13 10:33 David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 01/15] memory: Drop mapping check from memory_region_get_ram_discard_manager() David Hildenbrand
                   ` (15 more replies)
  0 siblings, 16 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

Based-on: 20211011175346.15499-1-david@redhat.com

A virtio-mem device is represented by a single large RAM memory region
backed by a single large mmap.

Right now, we map that complete memory region into guest physical addres
space, resulting in a very large memory mapping, KVM memory slot, ...
although only a small amount of memory might actually be exposed to the VM.

For example, when starting a VM with a 1 TiB virtio-mem device that only
exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
in order to hotplug more memory later, we waste a lot of memory on metadata
for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
optimizations in KVM are being worked on to reduce this metadata overhead
on x86-64 in some cases, it remains a problem with nested VMs and there are
other reasons why we would want to reduce the total memory slot to a
reasonable minimum.

We want to:
a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
   inside QEMU KVM code where possible.
b) Not always expose all device-memory to the VM, to reduce the attack
   surface of malicious VMs without using userfaultfd.

So instead, expose the RAM memory region not by a single large mapping
(consuming one memslot) but instead by multiple mappings, each consuming
one memslot. To do that, we divide the RAM memory region via aliases into
separate parts and only map the aliases into a device container we actually
need. We have to make sure that QEMU won't silently merge the memory
sections corresponding to the aliases (and thereby also memslots),
otherwise we lose atomic updates with KVM and vhost-user, which we deeply
care about when adding/removing memory. Further, to get memslot accounting
right, such merging is better avoided.

Within the memslots, virtio-mem can (un)plug memory in smaller granularity
dynamically. So memslots are a pure optimization to tackle a) and b) above.

Memslots are right now mapped once they fall into the usable device region
(which grows/shrinks on demand right now either when requesting to
 hotplug more memory or during/after reboots). In the future, with
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to (un)map aliases even
more dynamically when (un)plugging device blocks.


Adding a 500GiB virtio-mem device and not hotplugging any memory results in:
    0000000140000000-000001047fffffff (prio 0, i/o): device-memory
      0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots

Requesting the VM to consume 2 GiB results in (note: the usable region size
is bigger than 2 GiB, so 3 * 1 GiB memslots are required):
    0000000140000000-000001047fffffff (prio 0, i/o): device-memory
      0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
        0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
        0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
        00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff

Requesting the VM to consume 20 GiB results in:
    0000000140000000-000001047fffffff (prio 0, i/o): device-memory
      0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
        0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
        0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
        00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
        0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-3 @mem0 00000000c0000000-00000000ffffffff
        0000000240000000-000000027fffffff (prio 0, ram): alias virtio-mem-memslot-4 @mem0 0000000100000000-000000013fffffff
        0000000280000000-00000002bfffffff (prio 0, ram): alias virtio-mem-memslot-5 @mem0 0000000140000000-000000017fffffff
        00000002c0000000-00000002ffffffff (prio 0, ram): alias virtio-mem-memslot-6 @mem0 0000000180000000-00000001bfffffff
        0000000300000000-000000033fffffff (prio 0, ram): alias virtio-mem-memslot-7 @mem0 00000001c0000000-00000001ffffffff
        0000000340000000-000000037fffffff (prio 0, ram): alias virtio-mem-memslot-8 @mem0 0000000200000000-000000023fffffff
        0000000380000000-00000003bfffffff (prio 0, ram): alias virtio-mem-memslot-9 @mem0 0000000240000000-000000027fffffff
        00000003c0000000-00000003ffffffff (prio 0, ram): alias virtio-mem-memslot-10 @mem0 0000000280000000-00000002bfffffff
        0000000400000000-000000043fffffff (prio 0, ram): alias virtio-mem-memslot-11 @mem0 00000002c0000000-00000002ffffffff
        0000000440000000-000000047fffffff (prio 0, ram): alias virtio-mem-memslot-12 @mem0 0000000300000000-000000033fffffff
        0000000480000000-00000004bfffffff (prio 0, ram): alias virtio-mem-memslot-13 @mem0 0000000340000000-000000037fffffff
        00000004c0000000-00000004ffffffff (prio 0, ram): alias virtio-mem-memslot-14 @mem0 0000000380000000-00000003bfffffff
        0000000500000000-000000053fffffff (prio 0, ram): alias virtio-mem-memslot-15 @mem0 00000003c0000000-00000003ffffffff
        0000000540000000-000000057fffffff (prio 0, ram): alias virtio-mem-memslot-16 @mem0 0000000400000000-000000043fffffff
        0000000580000000-00000005bfffffff (prio 0, ram): alias virtio-mem-memslot-17 @mem0 0000000440000000-000000047fffffff
        00000005c0000000-00000005ffffffff (prio 0, ram): alias virtio-mem-memslot-18 @mem0 0000000480000000-00000004bfffffff
        0000000600000000-000000063fffffff (prio 0, ram): alias virtio-mem-memslot-19 @mem0 00000004c0000000-00000004ffffffff
        0000000640000000-000000067fffffff (prio 0, ram): alias virtio-mem-memslot-20 @mem0 0000000500000000-000000053fffffff

Requesting the VM to consume 5 GiB and rebooting (note: usable region size
will change during reboots) results in:
    0000000140000000-000001047fffffff (prio 0, i/o): device-memory
      0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
        0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
        0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
        00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
        0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-3 @mem0 00000000c0000000-00000000ffffffff
        0000000240000000-000000027fffffff (prio 0, ram): alias virtio-mem-memslot-4 @mem0 0000000100000000-000000013fffffff
        0000000280000000-00000002bfffffff (prio 0, ram): alias virtio-mem-memslot-5 @mem0 0000000140000000-000000017fffffff


In addition to other factors, we limit the number of memslots to 1024 per
devices and the size of one memslot to at least 1 GiB. So only a 1 TiB
virtio-mem device could consume 1024 memslots in the "worst" case. To
calculate a memslot limit for a device, we use a heuristic based on all
available memslots for memory devices and the percentage of
"device size":"total memory device area size". Users can further limit
the maximum number of memslots that will be used by a device by setting
the "max-memslots" property. It's expected to be set to "0" (auto) in most
setups.

In recent setups (e.g., KVM with ~32k memslots, vhost-user with ~4k
memslots after this series), we'll get the biggest benefit. In special
setups (e.g., older KVM, vhost kernel with 64 memslots), we'll get some
benefit -- the individual memslots will be bigger.

Future work:
- vhost-user and libvhost-user optimizations for handling more memslots
  more efficiently.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Ani Sinha <ani@anisinha.ca>
Cc: Peter Xu <peterx@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>
Cc: kvm@vger.kernel.org

David Hildenbrand (15):
  memory: Drop mapping check from
    memory_region_get_ram_discard_manager()
  kvm: Return number of free memslots
  vhost: Return number of free memslots
  memory: Allow for marking memory region aliases unmergeable
  vhost: Don't merge unmergeable memory sections
  memory-device: Move memory_device_check_addable() directly into
    memory_device_pre_plug()
  memory-device: Generalize memory_device_used_region_size()
  memory-device: Support memory devices that consume a variable number
    of memslots
  vhost: Respect reserved memslots for memory devices when realizing a
    vhost device
  virtio-mem: Set the RamDiscardManager for the RAM memory region
    earlier
  virtio-mem: Fix typo in virito_mem_intersect_memory_section() function
    name
  virtio-mem: Expose device memory via separate memslots
  vhost-user: Increase VHOST_USER_MAX_RAM_SLOTS to 496 with
    CONFIG_VIRTIO_MEM
  libvhost-user: Increase VHOST_USER_MAX_RAM_SLOTS to 4096
  virtio-mem: Set "max-memslots" to 0 (auto) for the 6.2 machine

 accel/kvm/kvm-all.c                       |  24 ++-
 accel/stubs/kvm-stub.c                    |   4 +-
 hw/core/machine.c                         |   1 +
 hw/mem/memory-device.c                    | 167 +++++++++++++++---
 hw/virtio/vhost-stub.c                    |   2 +-
 hw/virtio/vhost-user.c                    |   7 +-
 hw/virtio/vhost.c                         |  17 +-
 hw/virtio/virtio-mem-pci.c                |  22 +++
 hw/virtio/virtio-mem.c                    | 202 ++++++++++++++++++++--
 include/exec/memory.h                     |  23 +++
 include/hw/mem/memory-device.h            |  32 ++++
 include/hw/virtio/vhost.h                 |   2 +-
 include/hw/virtio/virtio-mem.h            |  29 +++-
 include/sysemu/kvm.h                      |   2 +-
 softmmu/memory.c                          |  35 +++-
 stubs/qmp_memory_device.c                 |   5 +
 subprojects/libvhost-user/libvhost-user.h |   7 +-
 17 files changed, 499 insertions(+), 82 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC 01/15] memory: Drop mapping check from memory_region_get_ram_discard_manager()
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 02/15] kvm: Return number of free memslots David Hildenbrand
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

It's sufficient to check whether a memory region is RAM, the region
doesn't necessarily have to be mapped into another memory region.
For example, RAM memory regions mapped via an alias will never make
"memory_region_is_mapped()" succeed.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 softmmu/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/softmmu/memory.c b/softmmu/memory.c
index 3bcfc3899b..8669f78395 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2038,7 +2038,7 @@ int memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr)
 
 RamDiscardManager *memory_region_get_ram_discard_manager(MemoryRegion *mr)
 {
-    if (!memory_region_is_mapped(mr) || !memory_region_is_ram(mr)) {
+    if (!memory_region_is_ram(mr)) {
         return NULL;
     }
     return mr->rdm;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 02/15] kvm: Return number of free memslots
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 01/15] memory: Drop mapping check from memory_region_get_ram_discard_manager() David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 03/15] vhost: " David Hildenbrand
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

Let's return the number of free slots instead of only checking if there
is a free slot. Required to support memory devices that consume multiple
memslots.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 accel/kvm/kvm-all.c    | 24 +++++++++++-------------
 accel/stubs/kvm-stub.c |  4 ++--
 hw/mem/memory-device.c |  2 +-
 include/sysemu/kvm.h   |  2 +-
 4 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index db8d83b137..0846be835e 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -103,6 +103,7 @@ struct KVMState
     AccelState parent_obj;
 
     int nr_slots;
+    int nr_free_slots;
     int fd;
     int vmfd;
     int coalesced_mmio;
@@ -245,6 +246,13 @@ int kvm_get_max_memslots(void)
     return s->nr_slots;
 }
 
+unsigned int kvm_get_free_memslots(void)
+{
+    KVMState *s = kvm_state;
+
+    return s->nr_free_slots;
+}
+
 /* Called with KVMMemoryListener.slots_lock held */
 static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
 {
@@ -260,19 +268,6 @@ static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
     return NULL;
 }
 
-bool kvm_has_free_slot(MachineState *ms)
-{
-    KVMState *s = KVM_STATE(ms->accelerator);
-    bool result;
-    KVMMemoryListener *kml = &s->memory_listener;
-
-    kvm_slots_lock();
-    result = !!kvm_get_free_slot(kml);
-    kvm_slots_unlock();
-
-    return result;
-}
-
 /* Called with KVMMemoryListener.slots_lock held */
 static KVMSlot *kvm_alloc_slot(KVMMemoryListener *kml)
 {
@@ -1410,6 +1405,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
             }
             start_addr += slot_size;
             size -= slot_size;
+            kvm_state->nr_free_slots++;
         } while (size);
         goto out;
     }
@@ -1435,6 +1431,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
         ram_start_offset += slot_size;
         ram += slot_size;
         size -= slot_size;
+        kvm_state->nr_free_slots--;
     } while (size);
 
 out:
@@ -2364,6 +2361,7 @@ static int kvm_init(MachineState *ms)
     if (!s->nr_slots) {
         s->nr_slots = 32;
     }
+    s->nr_free_slots = s->nr_slots;
 
     s->nr_as = kvm_check_extension(s, KVM_CAP_MULTI_ADDRESS_SPACE);
     if (s->nr_as <= 1) {
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 5b1d00a222..cbaeb7c656 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -133,9 +133,9 @@ int kvm_irqchip_remove_irqfd_notifier_gsi(KVMState *s, EventNotifier *n,
     return -ENOSYS;
 }
 
-bool kvm_has_free_slot(MachineState *ms)
+unsigned int kvm_get_free_memslots(void)
 {
-    return false;
+    return 0;
 }
 
 void kvm_init_cpu_signals(CPUState *cpu)
diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index d9f8301711..9045ead33e 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -73,7 +73,7 @@ static void memory_device_check_addable(MachineState *ms, uint64_t size,
     uint64_t used_region_size = 0;
 
     /* we will need a new memory slot for kvm and vhost */
-    if (kvm_enabled() && !kvm_has_free_slot(ms)) {
+    if (kvm_enabled() && !kvm_get_free_memslots()) {
         error_setg(errp, "hypervisor has no free memory slots left");
         return;
     }
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index a1ab1ee12d..c18be3cbd5 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -211,7 +211,7 @@ typedef struct Notifier Notifier;
 
 /* external API */
 
-bool kvm_has_free_slot(MachineState *ms);
+unsigned int kvm_get_free_memslots(void);
 bool kvm_has_sync_mmu(void);
 int kvm_has_vcpu_events(void);
 int kvm_has_robust_singlestep(void);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 03/15] vhost: Return number of free memslots
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 01/15] memory: Drop mapping check from memory_region_get_ram_discard_manager() David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 02/15] kvm: Return number of free memslots David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 04/15] memory: Allow for marking memory region aliases unmergeable David Hildenbrand
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

Let's return the number of free slots instead of only checking if there
is a free slot. Required to support memory devices that consume multiple
memslots.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/mem/memory-device.c    | 2 +-
 hw/virtio/vhost-stub.c    | 2 +-
 hw/virtio/vhost.c         | 4 ++--
 include/hw/virtio/vhost.h | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index 9045ead33e..7f76a09e57 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -77,7 +77,7 @@ static void memory_device_check_addable(MachineState *ms, uint64_t size,
         error_setg(errp, "hypervisor has no free memory slots left");
         return;
     }
-    if (!vhost_has_free_slot()) {
+    if (!vhost_get_free_memslots()) {
         error_setg(errp, "a used vhost backend has no free memory slots left");
         return;
     }
diff --git a/hw/virtio/vhost-stub.c b/hw/virtio/vhost-stub.c
index c175148fce..fe111e5e45 100644
--- a/hw/virtio/vhost-stub.c
+++ b/hw/virtio/vhost-stub.c
@@ -2,7 +2,7 @@
 #include "hw/virtio/vhost.h"
 #include "hw/virtio/vhost-user.h"
 
-bool vhost_has_free_slot(void)
+unsigned int vhost_get_free_memslots(void)
 {
     return true;
 }
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 437347ad01..2707972870 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -48,7 +48,7 @@ static unsigned int used_memslots;
 static QLIST_HEAD(, vhost_dev) vhost_devices =
     QLIST_HEAD_INITIALIZER(vhost_devices);
 
-bool vhost_has_free_slot(void)
+unsigned int vhost_get_free_memslots(void)
 {
     unsigned int slots_limit = ~0U;
     struct vhost_dev *hdev;
@@ -57,7 +57,7 @@ bool vhost_has_free_slot(void)
         unsigned int r = hdev->vhost_ops->vhost_backend_memslots_limit(hdev);
         slots_limit = MIN(slots_limit, r);
     }
-    return slots_limit > used_memslots;
+    return slots_limit - used_memslots;
 }
 
 static void vhost_dev_sync_region(struct vhost_dev *dev,
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 1a9fc65089..1613af8855 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -128,7 +128,7 @@ uint64_t vhost_get_features(struct vhost_dev *hdev, const int *feature_bits,
                             uint64_t features);
 void vhost_ack_features(struct vhost_dev *hdev, const int *feature_bits,
                         uint64_t features);
-bool vhost_has_free_slot(void);
+unsigned int vhost_get_free_memslots(void);
 
 int vhost_net_set_backend(struct vhost_dev *hdev,
                           struct vhost_vring_file *file);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 04/15] memory: Allow for marking memory region aliases unmergeable
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
                   ` (2 preceding siblings ...)
  2021-10-13 10:33 ` [PATCH RFC 03/15] vhost: " David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 05/15] vhost: Don't merge unmergeable memory sections David Hildenbrand
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

Let's allow for marking memory region aliases unmergeable, to teach
flatview code (and memory listeners like vhost-user) to not merge adjacent
aliases to the same memory region into a larger memory section; instead, we
want separate alias to stay separate such that we can atomically map/unmap
aliases without affecting other aliases.

This is a preparation for virtio-mem mapping device memory located
on a RAM memory region via separate aliases into a memory region container,
resulting in separate memslots that can get (un)mapped atomically.

As an example with virtio-mem, the layout looks something like this:
  [...]
  0000000180000000-000002007fffffff (prio 0, i/o): device-memory
     0000000180000000-000001017fffffff (prio 0, i/o): virtio-mem-memslots
       0000000180000000-00000001bfffffff (prio 0, ram): alias virito-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
       00000001c0000000-00000001ffffffff (prio 0, ram): alias virito-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
       0000000200000000-000000023fffffff (prio 0, ram): alias virito-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
  [...]

What would happen right now is that flatview code merged all 3 aliases
into a single memorys ection. When mapping another alias (e.g.,
virito-mem-memslot-3) or when unmapping any of the mapped aliases,
memory listeners will first get notified about the removal of the big
memory section to then get notified about re-adding of the new
(differently merged) memory section(s).

In an ideal world, memory listeners would be able to deal with that
atomically, however, that is not the case for the most important memory
listeners used in context of virtio-mem (KVM, vhost-user, vfio) and
supporting atomic updates is quite hard (e.g., for KVM where we cannot
simply resize or split memory slots due to allocated metadata per slot,
or in virtiofsd where we cannot simply resize or split an active mmap
mapping). While temporarily removing memslots, active users (e.g., KVM
VCPUs) can stumble over the missing memslot and essentially crash the
VM.

Further, merged chunks will consume less memslots, but we might end up
consuming more later, when unmapping chunks and splitting the bigger
chunks into smaller ones -- making memslot accounting for memory devices
problematic as well.

Let's allow for marking a memory region alias unmergeable, such that we
can atomically (un)map aliases to the same memory region, similar to
(un)mapping individual DIMMs.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/exec/memory.h | 23 +++++++++++++++++++++++
 softmmu/memory.c      | 33 +++++++++++++++++++++++++++------
 2 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 75b4f600e3..d877b80e6e 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -82,6 +82,7 @@ struct ReservedRegion {
  *     relative to the region's address space
  * @readonly: writes to this section are ignored
  * @nonvolatile: this section is non-volatile
+ * @unmergeable: this section should not get merged with adjacent sections
  */
 struct MemoryRegionSection {
     Int128 size;
@@ -91,6 +92,7 @@ struct MemoryRegionSection {
     hwaddr offset_within_address_space;
     bool readonly;
     bool nonvolatile;
+    bool unmergeable;
 };
 
 typedef struct IOMMUTLBEntry IOMMUTLBEntry;
@@ -720,6 +722,7 @@ struct MemoryRegion {
     bool nonvolatile;
     bool rom_device;
     bool flush_coalesced_mmio;
+    bool unmergeable;
     uint8_t dirty_log_mask;
     bool is_iommu;
     RAMBlock *ram_block;
@@ -2272,6 +2275,26 @@ void memory_region_set_size(MemoryRegion *mr, uint64_t size);
 void memory_region_set_alias_offset(MemoryRegion *mr,
                                     hwaddr offset);
 
+/*
+ * memory_region_set_alias_unmergeable: Turn a memory region alias unmergeable
+ *
+ * Mark a memory region alias unmergeable, resulting in multiple adjacent
+ * aliasas to the same memory region not getting merged into one memory section
+ * when simplifying the address space and notifying memory listeners.
+ *
+ * Primarily useful on aliases to RAM regions; the target use case is
+ * splitting a RAM memory region via aliases into multiple memslots and
+ * dynamically (un)mapping the aliases into another container memory region.
+ * As resulting memory sections don't cover multiple aliases, memory listeners
+ * will be notified about adding/removing separate aliases, resulting in
+ * individual memslots in KVM, vhost, vfio,... that can be added/removed
+ * atomically when mapping/unmapping the corresponding alias.
+ *
+ * @mr: the #MemoryRegion to be updated
+ * @unmergeable: whether to mark the #MemoryRegion unmergeable
+ */
+void memory_region_set_alias_unmergeable(MemoryRegion *mr, bool unmergeable);
+
 /**
  * memory_region_present: checks if an address relative to a @container
  * translates into #MemoryRegion within @container
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 8669f78395..c92ec8372f 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -223,6 +223,7 @@ struct FlatRange {
     bool romd_mode;
     bool readonly;
     bool nonvolatile;
+    bool unmergeable;
 };
 
 #define FOR_EACH_FLAT_RANGE(var, view)          \
@@ -239,6 +240,7 @@ section_from_flat_range(FlatRange *fr, FlatView *fv)
         .offset_within_address_space = int128_get64(fr->addr.start),
         .readonly = fr->readonly,
         .nonvolatile = fr->nonvolatile,
+        .unmergeable = fr->unmergeable,
     };
 }
 
@@ -249,7 +251,8 @@ static bool flatrange_equal(FlatRange *a, FlatRange *b)
         && a->offset_in_region == b->offset_in_region
         && a->romd_mode == b->romd_mode
         && a->readonly == b->readonly
-        && a->nonvolatile == b->nonvolatile;
+        && a->nonvolatile == b->nonvolatile
+        && a->unmergeable == b->unmergeable;
 }
 
 static FlatView *flatview_new(MemoryRegion *mr_root)
@@ -322,7 +325,8 @@ static bool can_merge(FlatRange *r1, FlatRange *r2)
         && r1->dirty_log_mask == r2->dirty_log_mask
         && r1->romd_mode == r2->romd_mode
         && r1->readonly == r2->readonly
-        && r1->nonvolatile == r2->nonvolatile;
+        && r1->nonvolatile == r2->nonvolatile
+        && !r1->unmergeable && !r2->unmergeable;
 }
 
 /* Attempt to simplify a view by merging adjacent ranges */
@@ -581,7 +585,8 @@ static void render_memory_region(FlatView *view,
                                  Int128 base,
                                  AddrRange clip,
                                  bool readonly,
-                                 bool nonvolatile)
+                                 bool nonvolatile,
+                                 bool unmergeable)
 {
     MemoryRegion *subregion;
     unsigned i;
@@ -598,6 +603,7 @@ static void render_memory_region(FlatView *view,
     int128_addto(&base, int128_make64(mr->addr));
     readonly |= mr->readonly;
     nonvolatile |= mr->nonvolatile;
+    unmergeable |= mr->unmergeable;
 
     tmp = addrrange_make(base, mr->size);
 
@@ -611,14 +617,14 @@ static void render_memory_region(FlatView *view,
         int128_subfrom(&base, int128_make64(mr->alias->addr));
         int128_subfrom(&base, int128_make64(mr->alias_offset));
         render_memory_region(view, mr->alias, base, clip,
-                             readonly, nonvolatile);
+                             readonly, nonvolatile, unmergeable);
         return;
     }
 
     /* Render subregions in priority order. */
     QTAILQ_FOREACH(subregion, &mr->subregions, subregions_link) {
         render_memory_region(view, subregion, base, clip,
-                             readonly, nonvolatile);
+                             readonly, nonvolatile, unmergeable);
     }
 
     if (!mr->terminates) {
@@ -634,6 +640,7 @@ static void render_memory_region(FlatView *view,
     fr.romd_mode = mr->romd_mode;
     fr.readonly = readonly;
     fr.nonvolatile = nonvolatile;
+    fr.unmergeable = unmergeable;
 
     /* Render the region itself into any gaps left by the current view. */
     for (i = 0; i < view->nr && int128_nz(remain); ++i) {
@@ -735,7 +742,7 @@ static FlatView *generate_memory_topology(MemoryRegion *mr)
     if (mr) {
         render_memory_region(view, mr, int128_zero(),
                              addrrange_make(int128_zero(), int128_2_64()),
-                             false, false);
+                             false, false, false);
     }
     flatview_simplify(view);
 
@@ -2634,6 +2641,20 @@ void memory_region_set_alias_offset(MemoryRegion *mr, hwaddr offset)
     memory_region_transaction_commit();
 }
 
+void memory_region_set_alias_unmergeable(MemoryRegion *mr, bool unmergeable)
+{
+    assert(mr->alias);
+
+    if (unmergeable == mr->unmergeable) {
+        return;
+    }
+
+    memory_region_transaction_begin();
+    mr->unmergeable = unmergeable;
+    memory_region_update_pending |= mr->enabled;
+    memory_region_transaction_commit();
+}
+
 uint64_t memory_region_get_alignment(const MemoryRegion *mr)
 {
     return mr->align;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 05/15] vhost: Don't merge unmergeable memory sections
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
                   ` (3 preceding siblings ...)
  2021-10-13 10:33 ` [PATCH RFC 04/15] memory: Allow for marking memory region aliases unmergeable David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 06/15] memory-device: Move memory_device_check_addable() directly into memory_device_pre_plug() David Hildenbrand
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

Memory sections that are marked unmergeable should not be merged, to
allow for atomic removal later.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/vhost.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2707972870..49a1074097 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -620,7 +620,7 @@ static void vhost_region_add_section(struct vhost_dev *dev,
                                                mrs_size, mrs_host);
     }
 
-    if (dev->n_tmp_sections) {
+    if (dev->n_tmp_sections && !section->unmergeable) {
         /* Since we already have at least one section, lets see if
          * this extends it; since we're scanning in order, we only
          * have to look at the last one, and the FlatView that calls
@@ -653,7 +653,7 @@ static void vhost_region_add_section(struct vhost_dev *dev,
             size_t offset = mrs_gpa - prev_gpa_start;
 
             if (prev_host_start + offset == mrs_host &&
-                section->mr == prev_sec->mr &&
+                section->mr == prev_sec->mr && !prev_sec->unmergeable &&
                 (!dev->vhost_ops->vhost_backend_can_merge ||
                  dev->vhost_ops->vhost_backend_can_merge(dev,
                     mrs_host, mrs_size,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 06/15] memory-device: Move memory_device_check_addable() directly into memory_device_pre_plug()
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
                   ` (4 preceding siblings ...)
  2021-10-13 10:33 ` [PATCH RFC 05/15] vhost: Don't merge unmergeable memory sections David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 07/15] memory-device: Generalize memory_device_used_region_size() David Hildenbrand
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

Move it out of memory_device_get_free_addr(), which is cleaner and
prepares for future changes.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/mem/memory-device.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index 7f76a09e57..68a2c3dbcc 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -67,9 +67,10 @@ static int memory_device_used_region_size(Object *obj, void *opaque)
     return 0;
 }
 
-static void memory_device_check_addable(MachineState *ms, uint64_t size,
+static void memory_device_check_addable(MachineState *ms, MemoryRegion *mr,
                                         Error **errp)
 {
+    const uint64_t size = memory_region_size(mr);
     uint64_t used_region_size = 0;
 
     /* we will need a new memory slot for kvm and vhost */
@@ -99,7 +100,6 @@ static uint64_t memory_device_get_free_addr(MachineState *ms,
                                             uint64_t align, uint64_t size,
                                             Error **errp)
 {
-    Error *err = NULL;
     GSList *list = NULL, *item;
     Range as, new = range_empty;
 
@@ -125,12 +125,6 @@ static uint64_t memory_device_get_free_addr(MachineState *ms,
                     align);
     }
 
-    memory_device_check_addable(ms, size, &err);
-    if (err) {
-        error_propagate(errp, err);
-        return 0;
-    }
-
     if (hint && !QEMU_IS_ALIGNED(*hint, align)) {
         error_setg(errp, "address must be aligned to 0x%" PRIx64 " bytes",
                    align);
@@ -259,6 +253,11 @@ void memory_device_pre_plug(MemoryDeviceState *md, MachineState *ms,
         goto out;
     }
 
+    memory_device_check_addable(ms, mr, &local_err);
+    if (local_err) {
+        goto out;
+    }
+
     if (legacy_align) {
         align = *legacy_align;
     } else {
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 07/15] memory-device: Generalize memory_device_used_region_size()
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
                   ` (5 preceding siblings ...)
  2021-10-13 10:33 ` [PATCH RFC 06/15] memory-device: Move memory_device_check_addable() directly into memory_device_pre_plug() David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 08/15] memory-device: Support memory devices that consume a variable number of memslots David Hildenbrand
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

Let's generalize traversal of all plugged memory devices to collect
information to prepare for future changes.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/mem/memory-device.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index 68a2c3dbcc..a915894819 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -50,20 +50,24 @@ static int memory_device_build_list(Object *obj, void *opaque)
     return 0;
 }
 
-static int memory_device_used_region_size(Object *obj, void *opaque)
+struct memory_devices_info {
+    uint64_t region_size;
+};
+
+static int memory_devices_collect_info(Object *obj, void *opaque)
 {
-    uint64_t *size = opaque;
+    struct memory_devices_info *i = opaque;
 
     if (object_dynamic_cast(obj, TYPE_MEMORY_DEVICE)) {
         const DeviceState *dev = DEVICE(obj);
         const MemoryDeviceState *md = MEMORY_DEVICE(obj);
 
         if (dev->realized) {
-            *size += memory_device_get_region_size(md, &error_abort);
+            i->region_size += memory_device_get_region_size(md, &error_abort);
         }
     }
 
-    object_child_foreach(obj, memory_device_used_region_size, opaque);
+    object_child_foreach(obj, memory_devices_collect_info, opaque);
     return 0;
 }
 
@@ -71,7 +75,7 @@ static void memory_device_check_addable(MachineState *ms, MemoryRegion *mr,
                                         Error **errp)
 {
     const uint64_t size = memory_region_size(mr);
-    uint64_t used_region_size = 0;
+    struct memory_devices_info info = {};
 
     /* we will need a new memory slot for kvm and vhost */
     if (kvm_enabled() && !kvm_get_free_memslots()) {
@@ -84,12 +88,12 @@ static void memory_device_check_addable(MachineState *ms, MemoryRegion *mr,
     }
 
     /* will we exceed the total amount of memory specified */
-    memory_device_used_region_size(OBJECT(ms), &used_region_size);
-    if (used_region_size + size < used_region_size ||
-        used_region_size + size > ms->maxram_size - ms->ram_size) {
+    memory_devices_collect_info(OBJECT(ms), &info);
+    if (info.region_size + size < info.region_size ||
+        info.region_size + size > ms->maxram_size - ms->ram_size) {
         error_setg(errp, "not enough space, currently 0x%" PRIx64
                    " in use of total space for memory devices 0x" RAM_ADDR_FMT,
-                   used_region_size, ms->maxram_size - ms->ram_size);
+                   info.region_size, ms->maxram_size - ms->ram_size);
         return;
     }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 08/15] memory-device: Support memory devices that consume a variable number of memslots
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
                   ` (6 preceding siblings ...)
  2021-10-13 10:33 ` [PATCH RFC 07/15] memory-device: Generalize memory_device_used_region_size() David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 09/15] vhost: Respect reserved memslots for memory devices when realizing a vhost device David Hildenbrand
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

We want to support memory devices that have a container as device memory
region, and (dynamically) map individual chunks into that container
resulting in multiple memslots getting consumed by such a device.

We already have one such device: NVDIMM. However, an NVDIMM also end up
consuming exactly one memslot.

The target use case will be virtio-mem, which will dynamically map
parts of a source RAM memory region into the container device region
using aliases, consuming one memslot per alias.

We need a way to query from a memory device:
* The currently used number memslots.
* The total number of memslots that might get used across device
  lifetime.

Expose some helper functions that will be used by vhost code to respect
the current memslot reservation when realizing vhost devices, and by
virtio-mem to dynamically figure out how many memslots it can use.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/mem/memory-device.c         | 136 ++++++++++++++++++++++++++++++---
 include/hw/mem/memory-device.h |  32 ++++++++
 stubs/qmp_memory_device.c      |   5 ++
 3 files changed, 164 insertions(+), 9 deletions(-)

diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index a915894819..5876c90a59 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -50,8 +50,28 @@ static int memory_device_build_list(Object *obj, void *opaque)
     return 0;
 }
 
+static unsigned int memory_device_get_used_memslots(const MemoryDeviceState *md)
+{
+    const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md);
+
+    if (!mdc->get_used_memslots)
+        return 1;
+    return mdc->get_used_memslots(md, &error_abort);
+}
+
+static unsigned int memory_device_get_memslots(const MemoryDeviceState *md)
+{
+    const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md);
+
+    if (!mdc->get_memslots)
+        return 1;
+    return mdc->get_memslots(md, &error_abort);
+}
+
 struct memory_devices_info {
     uint64_t region_size;
+    unsigned int used_memslots;
+    unsigned int reserved_memslots;
 };
 
 static int memory_devices_collect_info(Object *obj, void *opaque)
@@ -61,9 +81,15 @@ static int memory_devices_collect_info(Object *obj, void *opaque)
     if (object_dynamic_cast(obj, TYPE_MEMORY_DEVICE)) {
         const DeviceState *dev = DEVICE(obj);
         const MemoryDeviceState *md = MEMORY_DEVICE(obj);
+        unsigned int used, total;
 
         if (dev->realized) {
             i->region_size += memory_device_get_region_size(md, &error_abort);
+
+            used = memory_device_get_used_memslots(md);
+            total = memory_device_get_memslots(md);
+            i->used_memslots += used;
+            i->reserved_memslots += total - used;
         }
     }
 
@@ -71,24 +97,116 @@ static int memory_devices_collect_info(Object *obj, void *opaque)
     return 0;
 }
 
-static void memory_device_check_addable(MachineState *ms, MemoryRegion *mr,
-                                        Error **errp)
+/*
+ * Get the number of memslots that are reserved (not used yet but will get used
+ * dynamically in the future without further checks) by all memory devices.
+ */
+unsigned int memory_devices_get_reserved_memslots(void)
+{
+    struct memory_devices_info info = {};
+
+    object_child_foreach(qdev_get_machine(), memory_devices_collect_info, &info);
+    return info.reserved_memslots;
+}
+
+/*
+ * Calculate the maximum number of memslots using a heuristic a memory device
+ * with the given region size may used. Called before/while plugging and
+ * realizing a memory device that can determine the number of memslots to use
+ * dynamically depending on the actual number of available memslots.
+ */
+unsigned int memory_devices_calc_memslot_limit(uint64_t region_size)
+{
+    struct memory_devices_info info = {};
+    MachineState *ms = current_machine;
+    unsigned int total, free, limit;
+    double percent;
+
+    free = vhost_get_free_memslots();
+    if (kvm_enabled()) {
+        free = MIN(free, kvm_get_free_memslots());
+    }
+    object_child_foreach(OBJECT(ms), memory_devices_collect_info, &info);
+
+    /*
+     * Consider all memslots that are used+reserved by memory devices and
+     * can be used for memory devices. This leaves any memslots used for
+     * something else (e.g., initial memory) out of the picture.
+     */
+    total = info.used_memslots + info.reserved_memslots + free;
+
+    /*
+     * Cap the total to something reasonable for now. We don't want to have
+     * infinite memslots or max out the KVM limit ...
+     */
+    total = MIN(4096, total);
+    if (total > info.used_memslots + info.reserved_memslots) {
+        free = total - info.used_memslots + info.reserved_memslots;
+    } else {
+        free = 0;
+    }
+
+    /*
+     * Simple heuristic: equally distribute the total slots over the whole
+     * device region.
+     */
+    percent = (double)region_size / (ms->maxram_size - ms->ram_size);
+    limit = total * percent;
+
+    /*
+     * However, let's be conservative and prepare for some smaller devices
+     * that consume more memslots-per-byte. Only use 90% of the assigned
+     * percentage.
+     */
+    limit = 0.9 * limit;
+
+    /*
+     * In rare corner cases (especially, appearance of vhost devices after
+     * already plugging memory devices), we might still run into trouble.
+     * Let's try to leave 16 slots around "just in case".
+     */
+    if (limit > free) {
+        if (free > 16) {
+            free = free - 16;
+        } else {
+            free = 0;
+        }
+        limit = MIN(limit, free);
+    }
+    return !limit ? 1 : limit;
+}
+
+static void memory_device_check_addable(MachineState *ms, MemoryDeviceState *md,
+                                        MemoryRegion *mr, Error **errp)
 {
     const uint64_t size = memory_region_size(mr);
     struct memory_devices_info info = {};
+    unsigned int required, reserved;
+
+    memory_devices_collect_info(OBJECT(ms), &info);
+    reserved = info.reserved_memslots;
+    required = memory_device_get_memslots(md);
 
-    /* we will need a new memory slot for kvm and vhost */
-    if (kvm_enabled() && !kvm_get_free_memslots()) {
-        error_setg(errp, "hypervisor has no free memory slots left");
+    /*
+     * All memslots used by memory devices are already subtracted from
+     * the free memslots as reported by kvm and vhost. Memory devices that
+     * use multiple memslots are expected to take proper care (disabling
+     * merging of memory regions) such that used memslots don't end up
+     * actually consuming less right now and might consume more later.
+     */
+    if (kvm_enabled() && kvm_get_free_memslots() < reserved + required) {
+        error_setg(errp, "KVM does not have enough free memory slots left (%u vs. %u)",
+                   required, kvm_get_free_memslots() - reserved);
         return;
     }
-    if (!vhost_get_free_memslots()) {
-        error_setg(errp, "a used vhost backend has no free memory slots left");
+    if (vhost_get_free_memslots() < reserved + required) {
+        error_setg(errp,
+                   "a used vhost backend does not have enough free memory slots left (%u vs. %u)",
+                   required, vhost_get_free_memslots() - reserved);
         return;
     }
 
     /* will we exceed the total amount of memory specified */
-    memory_devices_collect_info(OBJECT(ms), &info);
     if (info.region_size + size < info.region_size ||
         info.region_size + size > ms->maxram_size - ms->ram_size) {
         error_setg(errp, "not enough space, currently 0x%" PRIx64
@@ -257,7 +375,7 @@ void memory_device_pre_plug(MemoryDeviceState *md, MachineState *ms,
         goto out;
     }
 
-    memory_device_check_addable(ms, mr, &local_err);
+    memory_device_check_addable(ms, md, mr, &local_err);
     if (local_err) {
         goto out;
     }
diff --git a/include/hw/mem/memory-device.h b/include/hw/mem/memory-device.h
index 48d2611fc5..fe4387438c 100644
--- a/include/hw/mem/memory-device.h
+++ b/include/hw/mem/memory-device.h
@@ -98,6 +98,36 @@ struct MemoryDeviceClass {
      */
     uint64_t (*get_min_alignment)(const MemoryDeviceState *md);
 
+    /*
+     * Optional: Return the number of used individual memslots (i.e.,
+     * individual RAM mappings) the device has created in the memory region of
+     * the device. The device has to make sure that memslots won't get merged
+     * internally (e.g,, by disabling merging of memory region aliases) if the
+     * memory region layout could allow for that.
+     *
+     * If this function is not implemented, we assume the device memory region
+     * is not a container and that there is exactly one memslot.
+     *
+     * Called when plugging the memory device or when iterating over
+     * all realized memory devices to calculate used/reserved/available
+     * memslots.
+     */
+    unsigned int (*get_used_memslots)(const MemoryDeviceState *md, Error **errp);
+
+    /*
+     * Optional: Return the total number of individual memslots
+     * (i.e., individual RAM mappings) the device may create in the the memory
+     * region of the device over its lifetime. The result must never change.
+     *
+     * If this function is not implemented, we assume the device memory region
+     * is not a container and that there will be exactly one memslot.
+     *
+     * Called when plugging the memory device or when iterating over
+     * all realized memory devices to calculate used/reserved/available
+     * memslots.
+     */
+    unsigned int (*get_memslots)(const MemoryDeviceState *md, Error **errp);
+
     /*
      * Translate the memory device into #MemoryDeviceInfo.
      */
@@ -113,5 +143,7 @@ void memory_device_plug(MemoryDeviceState *md, MachineState *ms);
 void memory_device_unplug(MemoryDeviceState *md, MachineState *ms);
 uint64_t memory_device_get_region_size(const MemoryDeviceState *md,
                                        Error **errp);
+unsigned int memory_devices_get_reserved_memslots(void);
+unsigned int memory_devices_calc_memslot_limit(uint64_t region_size);
 
 #endif
diff --git a/stubs/qmp_memory_device.c b/stubs/qmp_memory_device.c
index e75cac62dc..318a5d4187 100644
--- a/stubs/qmp_memory_device.c
+++ b/stubs/qmp_memory_device.c
@@ -10,3 +10,8 @@ uint64_t get_plugged_memory_size(void)
 {
     return (uint64_t)-1;
 }
+
+unsigned int memory_devices_get_reserved_memslots(void)
+{
+    return 0;
+}
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 09/15] vhost: Respect reserved memslots for memory devices when realizing a vhost device
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
                   ` (7 preceding siblings ...)
  2021-10-13 10:33 ` [PATCH RFC 08/15] memory-device: Support memory devices that consume a variable number of memslots David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 10/15] virtio-mem: Set the RamDiscardManager for the RAM memory region earlier David Hildenbrand
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

Make sure that the current reservations can be fulfilled, otherwise we
might run out of memslots later when memory devices start actually using
the reserved memslots and crash.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/vhost.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 49a1074097..b3fa814393 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -23,6 +23,7 @@
 #include "standard-headers/linux/vhost_types.h"
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
+#include "hw/mem/memory-device.h"
 #include "migration/blocker.h"
 #include "migration/qemu-file-types.h"
 #include "sysemu/dma.h"
@@ -1319,7 +1320,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
                    Error **errp)
 {
     uint64_t features;
-    int i, r, n_initialized_vqs = 0;
+    int i, r, reserved_slots, n_initialized_vqs = 0;
 
     hdev->vdev = NULL;
     hdev->migration_blocker = NULL;
@@ -1415,9 +1416,11 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
     memory_listener_register(&hdev->memory_listener, &address_space_memory);
     QLIST_INSERT_HEAD(&vhost_devices, hdev, entry);
 
-    if (used_memslots > hdev->vhost_ops->vhost_backend_memslots_limit(hdev)) {
+    reserved_slots = memory_devices_get_reserved_memslots();
+    if (used_memslots + reserved_slots >
+        hdev->vhost_ops->vhost_backend_memslots_limit(hdev)) {
         error_setg(errp, "vhost backend memory slots limit is less"
-                   " than current number of present memory slots");
+                   " than current number of used and reserved memory slots");
         r = -EINVAL;
         goto fail_busyloop;
     }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 10/15] virtio-mem: Set the RamDiscardManager for the RAM memory region earlier
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
                   ` (8 preceding siblings ...)
  2021-10-13 10:33 ` [PATCH RFC 09/15] vhost: Respect reserved memslots for memory devices when realizing a vhost device David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 11/15] virtio-mem: Fix typo in virito_mem_intersect_memory_section() function name David Hildenbrand
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

Let's set the RamDiscardManager earlier, logically before we expose the
RAM memory region to the system. This is a preparation for further changes
and is logically cleaner: before we expose the RAM memory region to
migration code, make sure we have the RamDiscardManager setup.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index d5a578142b..b2ad27ed7f 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -773,16 +773,17 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
                 sizeof(struct virtio_mem_config));
     vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
 
-    host_memory_backend_set_mapped(vmem->memdev, true);
-    vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
-    qemu_register_reset(virtio_mem_system_reset, vmem);
-
     /*
-     * Set ourselves as RamDiscardManager before the plug handler maps the
-     * memory region and exposes it via an address space.
+     * Set ourselves as RamDiscardManager before we expose the memory region
+     * to the system (e.g., marking the RAMBlock migratable, mapping the
+     * region).
      */
     memory_region_set_ram_discard_manager(&vmem->memdev->mr,
                                           RAM_DISCARD_MANAGER(vmem));
+
+    host_memory_backend_set_mapped(vmem->memdev, true);
+    vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
+    qemu_register_reset(virtio_mem_system_reset, vmem);
 }
 
 static void virtio_mem_device_unrealize(DeviceState *dev)
@@ -790,14 +791,10 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIOMEM *vmem = VIRTIO_MEM(dev);
 
-    /*
-     * The unplug handler unmapped the memory region, it cannot be
-     * found via an address space anymore. Unset ourselves.
-     */
-    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
     qemu_unregister_reset(virtio_mem_system_reset, vmem);
     vmstate_unregister_ram(&vmem->memdev->mr, DEVICE(vmem));
     host_memory_backend_set_mapped(vmem->memdev, false);
+    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
     virtio_del_queue(vdev, 0);
     virtio_cleanup(vdev);
     g_free(vmem->bitmap);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 11/15] virtio-mem: Fix typo in virito_mem_intersect_memory_section() function name
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
                   ` (9 preceding siblings ...)
  2021-10-13 10:33 ` [PATCH RFC 10/15] virtio-mem: Set the RamDiscardManager for the RAM memory region earlier David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 12/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

It's "virtio", not "virito".

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index b2ad27ed7f..1e29706798 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -177,7 +177,7 @@ static int virtio_mem_for_each_unplugged_range(const VirtIOMEM *vmem, void *arg,
  *
  * Returns false if the intersection is empty, otherwise returns true.
  */
-static bool virito_mem_intersect_memory_section(MemoryRegionSection *s,
+static bool virtio_mem_intersect_memory_section(MemoryRegionSection *s,
                                                 uint64_t offset, uint64_t size)
 {
     uint64_t start = MAX(s->offset_within_region, offset);
@@ -215,7 +215,7 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
                                       first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -247,7 +247,7 @@ static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
                                  first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -283,7 +283,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
             continue;
         }
         rdl->notify_discard(rdl, &tmp);
@@ -299,7 +299,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
             continue;
         }
         ret = rdl->notify_populate(rdl, &tmp);
@@ -316,7 +316,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
             if (rdl2 == rdl) {
                 break;
             }
-            if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+            if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
                 continue;
             }
             rdl2->notify_discard(rdl2, &tmp);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 12/15] virtio-mem: Expose device memory via separate memslots
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
                   ` (10 preceding siblings ...)
  2021-10-13 10:33 ` [PATCH RFC 11/15] virtio-mem: Fix typo in virito_mem_intersect_memory_section() function name David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-14 11:45   ` Dr. David Alan Gilbert
  2021-10-13 10:33 ` [PATCH RFC 13/15] vhost-user: Increase VHOST_USER_MAX_RAM_SLOTS to 496 with CONFIG_VIRTIO_MEM David Hildenbrand
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

KVM nowadays supports a lot of memslots. We want to exploit that in
virtio-mem, exposing device memory via separate memslots to the guest
on demand, essentially reducing the total size of KVM slots
significantly (and thereby metadata in KVM and in QEMU for KVM memory
slots) especially when exposing initially only a small amount of memory
via a virtio-mem device to the guest, to hotplug more later. Further,
not always exposing the full device memory region to the guest reduces
the attack surface in many setups without requiring other mechanisms
like uffd for protection of unplugged memory.

So split the original RAM region via memory region aliases into separate
chunks (ending up as individual memslots), and dynamically map the
required chunks (falling into the usable region) into the container.

For now, we always map the memslots covered by the usable region. In the
future, with VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to map
memslots on actual demand and optimize further.

Users can specify via the "max-memslots" property how much memslots the
virtio-mem device is allowed to use at max. "0" translates to "auto, no
limit" and is determinded automatically using a heuristic. When a maximum
(> 1) is specified, that auto-determined value is capped. The parameter
doesn't have to be migrated and can differ between source and destination.
The only reason the parameter exists is not make some corner case setups
(multiple large virtio-mem devices assigned to a single virtual NUMA node
 with only very limited available memslots, hotplug of vhost devices) work.
The parameter will be set to be "0" as default soon, whereby it will remain
to be "1" for compat machines.

The properties "memslots" and "used-memslots" are read-only.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem-pci.c     |  22 +++++
 hw/virtio/virtio-mem.c         | 173 ++++++++++++++++++++++++++++++++-
 include/hw/virtio/virtio-mem.h |  29 +++++-
 3 files changed, 220 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/virtio-mem-pci.c b/hw/virtio/virtio-mem-pci.c
index be2383b0c5..2c1be2afb7 100644
--- a/hw/virtio/virtio-mem-pci.c
+++ b/hw/virtio/virtio-mem-pci.c
@@ -82,6 +82,20 @@ static uint64_t virtio_mem_pci_get_min_alignment(const MemoryDeviceState *md)
                                     &error_abort);
 }
 
+static unsigned int virtio_mem_pci_get_used_memslots(const MemoryDeviceState *md,
+                                                     Error **errp)
+{
+    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_USED_MEMSLOTS_PROP,
+                                    &error_abort);
+}
+
+static unsigned int virtio_mem_pci_get_memslots(const MemoryDeviceState *md,
+                                                Error **errp)
+{
+    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_MEMSLOTS_PROP,
+                                    &error_abort);
+}
+
 static void virtio_mem_pci_size_change_notify(Notifier *notifier, void *data)
 {
     VirtIOMEMPCI *pci_mem = container_of(notifier, VirtIOMEMPCI,
@@ -115,6 +129,8 @@ static void virtio_mem_pci_class_init(ObjectClass *klass, void *data)
     mdc->get_memory_region = virtio_mem_pci_get_memory_region;
     mdc->fill_device_info = virtio_mem_pci_fill_device_info;
     mdc->get_min_alignment = virtio_mem_pci_get_min_alignment;
+    mdc->get_used_memslots = virtio_mem_pci_get_used_memslots;
+    mdc->get_memslots = virtio_mem_pci_get_memslots;
 }
 
 static void virtio_mem_pci_instance_init(Object *obj)
@@ -142,6 +158,12 @@ static void virtio_mem_pci_instance_init(Object *obj)
     object_property_add_alias(obj, VIRTIO_MEM_REQUESTED_SIZE_PROP,
                               OBJECT(&dev->vdev),
                               VIRTIO_MEM_REQUESTED_SIZE_PROP);
+    object_property_add_alias(obj, VIRTIO_MEM_MEMSLOTS_PROP,
+                              OBJECT(&dev->vdev),
+                              VIRTIO_MEM_MEMSLOTS_PROP);
+    object_property_add_alias(obj, VIRTIO_MEM_USED_MEMSLOTS_PROP,
+                              OBJECT(&dev->vdev),
+                              VIRTIO_MEM_USED_MEMSLOTS_PROP);
 }
 
 static const VirtioPCIDeviceTypeInfo virtio_mem_pci_info = {
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 1e29706798..f7e8f1db83 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -23,6 +23,7 @@
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
 #include "hw/virtio/virtio-mem.h"
+#include "hw/mem/memory-device.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "exec/ram_addr.h"
@@ -500,6 +501,7 @@ static void virtio_mem_resize_usable_region(VirtIOMEM *vmem,
 {
     uint64_t newsize = MIN(memory_region_size(&vmem->memdev->mr),
                            requested_size + VIRTIO_MEM_USABLE_EXTENT);
+    int i;
 
     /* The usable region size always has to be multiples of the block size. */
     newsize = QEMU_ALIGN_UP(newsize, vmem->block_size);
@@ -514,6 +516,25 @@ static void virtio_mem_resize_usable_region(VirtIOMEM *vmem,
 
     trace_virtio_mem_resized_usable_region(vmem->usable_region_size, newsize);
     vmem->usable_region_size = newsize;
+
+    /*
+     * Map all unmapped memslots that cover the usable region and unmap all
+     * remaining mapped ones.
+     */
+    for (i = 0; i < vmem->nb_memslots; i++) {
+        if (vmem->memslot_size * i < vmem->usable_region_size) {
+            if (!memory_region_is_mapped(&vmem->memslots[i])) {
+                memory_region_add_subregion(vmem->mr, vmem->memslot_size * i,
+                                            &vmem->memslots[i]);
+                vmem->nb_used_memslots++;
+            }
+        } else {
+            if (memory_region_is_mapped(&vmem->memslots[i])) {
+                memory_region_del_subregion(vmem->mr, &vmem->memslots[i]);
+                vmem->nb_used_memslots--;
+            }
+        }
+    }
 }
 
 static int virtio_mem_unplug_all(VirtIOMEM *vmem)
@@ -674,6 +695,92 @@ static void virtio_mem_system_reset(void *opaque)
     virtio_mem_unplug_all(vmem);
 }
 
+static void virtio_mem_prepare_mr(VirtIOMEM *vmem)
+{
+    const uint64_t region_size = memory_region_size(&vmem->memdev->mr);
+
+    if (vmem->mr) {
+        return;
+    }
+
+    vmem->mr = g_malloc0(sizeof(*vmem->mr));
+    memory_region_init(vmem->mr, OBJECT(vmem), "virtio-mem-memslots",
+                       region_size);
+    vmem->mr->align = memory_region_get_alignment(&vmem->memdev->mr);
+}
+
+/*
+ * Calculate the number of memslots we'll use based on device properties and
+ * available + already used+reserved memslots for other devices.
+ *
+ * Must not get called after realizing the device.
+ */
+static unsigned int virtio_mem_calc_nb_memslots(uint64_t region_size,
+                                                uint64_t block_size,
+                                                unsigned int user_limit)
+{
+    unsigned int limit = memory_devices_calc_memslot_limit(region_size);
+    uint64_t memslot_size;
+
+    /*
+     * We never use more than 1024 memslots for a single device (relevant only
+     * for devices > 1 TiB).
+     */
+    limit = MIN(limit, 1024);
+
+    /*
+     * We'll never use memslots that are smaller than 1 GiB or smaller than
+     * the block size (and thereby the page size). memslots are always a power
+     * of two.
+     */
+    memslot_size = MAX(1 * GiB, block_size);
+    while (ROUND_UP(region_size, memslot_size) / memslot_size > limit) {
+        memslot_size *= 2;
+    }
+    limit = ROUND_UP(region_size, memslot_size) / memslot_size;
+
+    return !user_limit ? limit : MIN(user_limit, limit);
+}
+
+static void virtio_mem_prepare_memslots(VirtIOMEM *vmem)
+{
+    const uint64_t region_size = memory_region_size(&vmem->memdev->mr);
+    int i;
+
+    if (!vmem->nb_memslots) {
+        vmem->nb_memslots = virtio_mem_calc_nb_memslots(region_size,
+                                                        vmem->block_size,
+                                                        vmem->nb_max_memslots);
+    }
+    if (vmem->nb_memslots == 1) {
+        vmem->memslot_size = region_size;
+    } else {
+        vmem->memslot_size = 1 * GiB;
+        while (ROUND_UP(region_size, vmem->memslot_size) / vmem->memslot_size >
+               vmem->nb_memslots) {
+            vmem->memslot_size *= 2;
+        }
+    }
+
+    /* Create our memslots but don't map them yet -- we'll map dynamically. */
+    vmem->memslots = g_malloc0_n(vmem->nb_memslots, sizeof(*vmem->memslots));
+    for (i = 0; i < vmem->nb_memslots; i++) {
+        const uint64_t size = MIN(vmem->memslot_size,
+                                  region_size - i * vmem->memslot_size);
+        char name[80];
+
+        snprintf(name, sizeof(name), "virtio-mem-memslot-%u", i);
+        memory_region_init_alias(&vmem->memslots[i], OBJECT(vmem), name,
+                                 &vmem->memdev->mr, vmem->memslot_size * i,
+                                 size);
+        /*
+         * We want our aliases to result in separate memory sections and thereby
+         * separate memslots.
+         */
+        memory_region_set_alias_unmergeable(&vmem->memslots[i], true);
+    }
+}
+
 static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
 {
     MachineState *ms = MACHINE(qdev_get_machine());
@@ -763,7 +870,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
+    virtio_mem_prepare_mr(vmem);
 
     vmem->bitmap_size = memory_region_size(&vmem->memdev->mr) /
                         vmem->block_size;
@@ -780,9 +887,11 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
      */
     memory_region_set_ram_discard_manager(&vmem->memdev->mr,
                                           RAM_DISCARD_MANAGER(vmem));
+    virtio_mem_prepare_memslots(vmem);
 
-    host_memory_backend_set_mapped(vmem->memdev, true);
+    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
     vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
+    host_memory_backend_set_mapped(vmem->memdev, true);
     qemu_register_reset(virtio_mem_system_reset, vmem);
 }
 
@@ -794,10 +903,14 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     qemu_unregister_reset(virtio_mem_system_reset, vmem);
     vmstate_unregister_ram(&vmem->memdev->mr, DEVICE(vmem));
     host_memory_backend_set_mapped(vmem->memdev, false);
+    /* Unmap all memslots. */
+    virtio_mem_resize_usable_region(vmem, 0, true);
+    g_free(vmem->memslots);
     memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
     virtio_del_queue(vdev, 0);
     virtio_cleanup(vdev);
     g_free(vmem->bitmap);
+    g_free(vmem->mr);
     ram_block_coordinated_discard_require(false);
 }
 
@@ -955,7 +1068,8 @@ static MemoryRegion *virtio_mem_get_memory_region(VirtIOMEM *vmem, Error **errp)
         return NULL;
     }
 
-    return &vmem->memdev->mr;
+    virtio_mem_prepare_mr(vmem);
+    return vmem->mr;
 }
 
 static void virtio_mem_add_size_change_notifier(VirtIOMEM *vmem,
@@ -1084,6 +1198,53 @@ static void virtio_mem_set_block_size(Object *obj, Visitor *v, const char *name,
     vmem->block_size = value;
 }
 
+static void virtio_mem_get_used_memslots(Object *obj, Visitor *v,
+                                          const char *name,
+                                          void *opaque, Error **errp)
+{
+    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    uint16_t value = vmem->nb_used_memslots;
+
+    visit_type_uint16(v, name, &value, errp);
+}
+
+static void virtio_mem_get_memslots(Object *obj, Visitor *v, const char *name,
+                                    void *opaque, Error **errp)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    uint16_t value = vmem->nb_memslots;
+
+    /* Determine the final value now, we don't want it to change later.  */
+    if (!vmem->nb_memslots) {
+        uint64_t block_size = vmem->block_size;
+        uint64_t region_size;
+        RAMBlock *rb;
+
+        if (!vmem->memdev || !memory_region_is_ram(&vmem->memdev->mr)) {
+            /* We'll fail realizing later ... */
+            vmem->nb_memslots = 1;
+            goto out;
+        }
+        region_size = memory_region_size(&vmem->memdev->mr);
+        rb = vmem->memdev->mr.ram_block;
+
+        if (!block_size) {
+            block_size = virtio_mem_default_block_size(rb);
+        } else if (block_size < qemu_ram_pagesize(rb)) {
+            /* We'll fail realizing later ... */
+            vmem->nb_memslots = 1;
+            goto out;
+        }
+
+        vmem->nb_memslots = virtio_mem_calc_nb_memslots(region_size,
+                                                        vmem->block_size,
+                                                        vmem->nb_max_memslots);
+    }
+out:
+    value = vmem->nb_memslots;
+    visit_type_uint16(v, name, &value, errp);
+}
+
 static void virtio_mem_instance_init(Object *obj)
 {
     VirtIOMEM *vmem = VIRTIO_MEM(obj);
@@ -1099,6 +1260,10 @@ static void virtio_mem_instance_init(Object *obj)
     object_property_add(obj, VIRTIO_MEM_BLOCK_SIZE_PROP, "size",
                         virtio_mem_get_block_size, virtio_mem_set_block_size,
                         NULL, NULL);
+    object_property_add(obj, VIRTIO_MEM_MEMSLOTS_PROP, "uint16",
+                        virtio_mem_get_memslots, NULL, NULL, NULL);
+    object_property_add(obj, VIRTIO_MEM_USED_MEMSLOTS_PROP, "uint16",
+                        virtio_mem_get_used_memslots, NULL, NULL, NULL);
 }
 
 static Property virtio_mem_properties[] = {
@@ -1106,6 +1271,8 @@ static Property virtio_mem_properties[] = {
     DEFINE_PROP_UINT32(VIRTIO_MEM_NODE_PROP, VirtIOMEM, node, 0),
     DEFINE_PROP_LINK(VIRTIO_MEM_MEMDEV_PROP, VirtIOMEM, memdev,
                      TYPE_MEMORY_BACKEND, HostMemoryBackend *),
+    DEFINE_PROP_UINT16(VIRTIO_MEM_MAX_MEMSLOTS_PROP, VirtIOMEM, nb_max_memslots,
+                       1),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
index a5dd6a493b..3589865871 100644
--- a/include/hw/virtio/virtio-mem.h
+++ b/include/hw/virtio/virtio-mem.h
@@ -30,6 +30,9 @@ OBJECT_DECLARE_TYPE(VirtIOMEM, VirtIOMEMClass,
 #define VIRTIO_MEM_REQUESTED_SIZE_PROP "requested-size"
 #define VIRTIO_MEM_BLOCK_SIZE_PROP "block-size"
 #define VIRTIO_MEM_ADDR_PROP "memaddr"
+#define VIRTIO_MEM_MEMSLOTS_PROP "memslots"
+#define VIRTIO_MEM_USED_MEMSLOTS_PROP "used-memslots"
+#define VIRTIO_MEM_MAX_MEMSLOTS_PROP "max-memslots"
 
 struct VirtIOMEM {
     VirtIODevice parent_obj;
@@ -41,9 +44,33 @@ struct VirtIOMEM {
     int32_t bitmap_size;
     unsigned long *bitmap;
 
-    /* assigned memory backend and memory region */
+    /* Device memory region in which we dynamically map memslots */
+    MemoryRegion *mr;
+
+    /*
+     * Assigned memory backend with the RAM memory region we will split
+     * into memslots to dynamically map them into the device memory region.
+     */
     HostMemoryBackend *memdev;
 
+    /*
+     * Individual memslots we dynamically map that are aliases to the
+     * assigned RAM memory region
+     */
+    MemoryRegion *memslots;
+
+    /* User defined maximum number of memslots we may ever use. */
+    uint16_t nb_max_memslots;
+
+    /* Total number of memslots we're going to use. */
+    uint16_t nb_memslots;
+
+    /* Current number of memslots we're using. */
+    uint16_t nb_used_memslots;
+
+    /* Size of one memslot (the last one might be smaller) */
+    uint64_t memslot_size;
+
     /* NUMA node */
     uint32_t node;
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 13/15] vhost-user: Increase VHOST_USER_MAX_RAM_SLOTS to 496 with CONFIG_VIRTIO_MEM
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
                   ` (11 preceding siblings ...)
  2021-10-13 10:33 ` [PATCH RFC 12/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 14/15] libvhost-user: Increase VHOST_USER_MAX_RAM_SLOTS to 4096 David Hildenbrand
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

Let's increase the number of slots to 4096 to allow for increased
flexibility with virtio-mem when dealing with large virtio-mem devices
that start out small.

In the future, we might want to look into some performance improvements,
but for now there isn't really anything stopping us from raising the
limit.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/vhost-user.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 2c8556237f..1c6a720728 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -24,6 +24,7 @@
 #include "sysemu/cryptodev.h"
 #include "migration/migration.h"
 #include "migration/postcopy-ram.h"
+#include CONFIG_DEVICES
 #include "trace.h"
 
 #include <sys/ioctl.h>
@@ -45,8 +46,10 @@
  * the maximum number supported by the target
  * hardware plaform.
  */
-#if defined(TARGET_X86) || defined(TARGET_X86_64) || \
-    defined(TARGET_ARM) || defined(TARGET_ARM_64)
+#if defined(CONFIG_VIRTIO_MEM)
+#define VHOST_USER_MAX_RAM_SLOTS 4096
+#elif defined(TARGET_X86) || defined(TARGET_X86_64) || \
+      defined(TARGET_ARM) || defined(TARGET_ARM_64)
 #include "hw/acpi/acpi.h"
 #define VHOST_USER_MAX_RAM_SLOTS ACPI_MAX_RAM_SLOTS
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 14/15] libvhost-user: Increase VHOST_USER_MAX_RAM_SLOTS to 4096
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
                   ` (12 preceding siblings ...)
  2021-10-13 10:33 ` [PATCH RFC 13/15] vhost-user: Increase VHOST_USER_MAX_RAM_SLOTS to 496 with CONFIG_VIRTIO_MEM David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-13 10:33 ` [PATCH RFC 15/15] virtio-mem: Set "max-memslots" to 0 (auto) for the 6.2 machine David Hildenbrand
  2021-10-13 19:03 ` [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots Dr. David Alan Gilbert
  15 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

4096 is the maximum we can have right now in QEMU with vhost-user, so
increase the libvhost-user limit as well.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 subprojects/libvhost-user/libvhost-user.h | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h
index 3d13dfadde..d9628ed9f0 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -30,11 +30,8 @@
 
 #define VHOST_MEMORY_BASELINE_NREGIONS 8
 
-/*
- * Set a reasonable maximum number of ram slots, which will be supported by
- * any architecture.
- */
-#define VHOST_USER_MAX_RAM_SLOTS 32
+/* Set the RAM slots based on the maximum supported by QEMU vhost-user. */
+#define VHOST_USER_MAX_RAM_SLOTS 4096
 
 #define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 15/15] virtio-mem: Set "max-memslots" to 0 (auto) for the 6.2 machine
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
                   ` (13 preceding siblings ...)
  2021-10-13 10:33 ` [PATCH RFC 14/15] libvhost-user: Increase VHOST_USER_MAX_RAM_SLOTS to 4096 David Hildenbrand
@ 2021-10-13 10:33 ` David Hildenbrand
  2021-10-13 19:03 ` [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots Dr. David Alan Gilbert
  15 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-13 10:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	David Hildenbrand, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

Let's enable automatic detection of memslots to use for the 6.2 machine,
leaving the behavior of compat machines unchanged.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/core/machine.c      | 1 +
 hw/virtio/virtio-mem.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index b8d95eec32..25aa42cf9f 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -39,6 +39,7 @@
 
 GlobalProperty hw_compat_6_1[] = {
     { "vhost-user-vsock-device", "seqpacket", "off" },
+    { "virtio-mem", "max-memslots", "1" },
 };
 const size_t hw_compat_6_1_len = G_N_ELEMENTS(hw_compat_6_1);
 
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index f7e8f1db83..3de8ed94e6 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -1272,7 +1272,7 @@ static Property virtio_mem_properties[] = {
     DEFINE_PROP_LINK(VIRTIO_MEM_MEMDEV_PROP, VirtIOMEM, memdev,
                      TYPE_MEMORY_BACKEND, HostMemoryBackend *),
     DEFINE_PROP_UINT16(VIRTIO_MEM_MAX_MEMSLOTS_PROP, VirtIOMEM, nb_max_memslots,
-                       1),
+                       0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots
  2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
                   ` (14 preceding siblings ...)
  2021-10-13 10:33 ` [PATCH RFC 15/15] virtio-mem: Set "max-memslots" to 0 (auto) for the 6.2 machine David Hildenbrand
@ 2021-10-13 19:03 ` Dr. David Alan Gilbert
  2021-10-14  7:01   ` David Hildenbrand
  15 siblings, 1 reply; 21+ messages in thread
From: Dr. David Alan Gilbert @ 2021-10-13 19:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	Stefan Hajnoczi, qemu-devel, Peter Xu,
	Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

* David Hildenbrand (david@redhat.com) wrote:
> Based-on: 20211011175346.15499-1-david@redhat.com
> 
> A virtio-mem device is represented by a single large RAM memory region
> backed by a single large mmap.
> 
> Right now, we map that complete memory region into guest physical addres
> space, resulting in a very large memory mapping, KVM memory slot, ...
> although only a small amount of memory might actually be exposed to the VM.
> 
> For example, when starting a VM with a 1 TiB virtio-mem device that only
> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
> in order to hotplug more memory later, we waste a lot of memory on metadata
> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
> optimizations in KVM are being worked on to reduce this metadata overhead
> on x86-64 in some cases, it remains a problem with nested VMs and there are
> other reasons why we would want to reduce the total memory slot to a
> reasonable minimum.
> 
> We want to:
> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
>    inside QEMU KVM code where possible.
> b) Not always expose all device-memory to the VM, to reduce the attack
>    surface of malicious VMs without using userfaultfd.
> 
> So instead, expose the RAM memory region not by a single large mapping
> (consuming one memslot) but instead by multiple mappings, each consuming
> one memslot. To do that, we divide the RAM memory region via aliases into
> separate parts and only map the aliases into a device container we actually
> need. We have to make sure that QEMU won't silently merge the memory
> sections corresponding to the aliases (and thereby also memslots),
> otherwise we lose atomic updates with KVM and vhost-user, which we deeply
> care about when adding/removing memory. Further, to get memslot accounting
> right, such merging is better avoided.
> 
> Within the memslots, virtio-mem can (un)plug memory in smaller granularity
> dynamically. So memslots are a pure optimization to tackle a) and b) above.
> 
> Memslots are right now mapped once they fall into the usable device region
> (which grows/shrinks on demand right now either when requesting to
>  hotplug more memory or during/after reboots). In the future, with
> VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to (un)map aliases even
> more dynamically when (un)plugging device blocks.
> 
> 
> Adding a 500GiB virtio-mem device and not hotplugging any memory results in:
>     0000000140000000-000001047fffffff (prio 0, i/o): device-memory
>       0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
> 
> Requesting the VM to consume 2 GiB results in (note: the usable region size
> is bigger than 2 GiB, so 3 * 1 GiB memslots are required):
>     0000000140000000-000001047fffffff (prio 0, i/o): device-memory
>       0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
>         0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
>         0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
>         00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff

I've got a vague memory that there were some devices that didn't like
doing split IO across a memory region (or something) - some virtio
devices?  Do you know if that's still true and if that causes a problem?

Dave

> Requesting the VM to consume 20 GiB results in:
>     0000000140000000-000001047fffffff (prio 0, i/o): device-memory
>       0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
>         0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
>         0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
>         00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
>         0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-3 @mem0 00000000c0000000-00000000ffffffff
>         0000000240000000-000000027fffffff (prio 0, ram): alias virtio-mem-memslot-4 @mem0 0000000100000000-000000013fffffff
>         0000000280000000-00000002bfffffff (prio 0, ram): alias virtio-mem-memslot-5 @mem0 0000000140000000-000000017fffffff
>         00000002c0000000-00000002ffffffff (prio 0, ram): alias virtio-mem-memslot-6 @mem0 0000000180000000-00000001bfffffff
>         0000000300000000-000000033fffffff (prio 0, ram): alias virtio-mem-memslot-7 @mem0 00000001c0000000-00000001ffffffff
>         0000000340000000-000000037fffffff (prio 0, ram): alias virtio-mem-memslot-8 @mem0 0000000200000000-000000023fffffff
>         0000000380000000-00000003bfffffff (prio 0, ram): alias virtio-mem-memslot-9 @mem0 0000000240000000-000000027fffffff
>         00000003c0000000-00000003ffffffff (prio 0, ram): alias virtio-mem-memslot-10 @mem0 0000000280000000-00000002bfffffff
>         0000000400000000-000000043fffffff (prio 0, ram): alias virtio-mem-memslot-11 @mem0 00000002c0000000-00000002ffffffff
>         0000000440000000-000000047fffffff (prio 0, ram): alias virtio-mem-memslot-12 @mem0 0000000300000000-000000033fffffff
>         0000000480000000-00000004bfffffff (prio 0, ram): alias virtio-mem-memslot-13 @mem0 0000000340000000-000000037fffffff
>         00000004c0000000-00000004ffffffff (prio 0, ram): alias virtio-mem-memslot-14 @mem0 0000000380000000-00000003bfffffff
>         0000000500000000-000000053fffffff (prio 0, ram): alias virtio-mem-memslot-15 @mem0 00000003c0000000-00000003ffffffff
>         0000000540000000-000000057fffffff (prio 0, ram): alias virtio-mem-memslot-16 @mem0 0000000400000000-000000043fffffff
>         0000000580000000-00000005bfffffff (prio 0, ram): alias virtio-mem-memslot-17 @mem0 0000000440000000-000000047fffffff
>         00000005c0000000-00000005ffffffff (prio 0, ram): alias virtio-mem-memslot-18 @mem0 0000000480000000-00000004bfffffff
>         0000000600000000-000000063fffffff (prio 0, ram): alias virtio-mem-memslot-19 @mem0 00000004c0000000-00000004ffffffff
>         0000000640000000-000000067fffffff (prio 0, ram): alias virtio-mem-memslot-20 @mem0 0000000500000000-000000053fffffff
> 
> Requesting the VM to consume 5 GiB and rebooting (note: usable region size
> will change during reboots) results in:
>     0000000140000000-000001047fffffff (prio 0, i/o): device-memory
>       0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
>         0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
>         0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
>         00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
>         0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-3 @mem0 00000000c0000000-00000000ffffffff
>         0000000240000000-000000027fffffff (prio 0, ram): alias virtio-mem-memslot-4 @mem0 0000000100000000-000000013fffffff
>         0000000280000000-00000002bfffffff (prio 0, ram): alias virtio-mem-memslot-5 @mem0 0000000140000000-000000017fffffff
> 
> 
> In addition to other factors, we limit the number of memslots to 1024 per
> devices and the size of one memslot to at least 1 GiB. So only a 1 TiB
> virtio-mem device could consume 1024 memslots in the "worst" case. To
> calculate a memslot limit for a device, we use a heuristic based on all
> available memslots for memory devices and the percentage of
> "device size":"total memory device area size". Users can further limit
> the maximum number of memslots that will be used by a device by setting
> the "max-memslots" property. It's expected to be set to "0" (auto) in most
> setups.
> 
> In recent setups (e.g., KVM with ~32k memslots, vhost-user with ~4k
> memslots after this series), we'll get the biggest benefit. In special
> setups (e.g., older KVM, vhost kernel with 64 memslots), we'll get some
> benefit -- the individual memslots will be bigger.
> 
> Future work:
> - vhost-user and libvhost-user optimizations for handling more memslots
>   more efficiently.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Ani Sinha <ani@anisinha.ca>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>
> Cc: kvm@vger.kernel.org
> 
> David Hildenbrand (15):
>   memory: Drop mapping check from
>     memory_region_get_ram_discard_manager()
>   kvm: Return number of free memslots
>   vhost: Return number of free memslots
>   memory: Allow for marking memory region aliases unmergeable
>   vhost: Don't merge unmergeable memory sections
>   memory-device: Move memory_device_check_addable() directly into
>     memory_device_pre_plug()
>   memory-device: Generalize memory_device_used_region_size()
>   memory-device: Support memory devices that consume a variable number
>     of memslots
>   vhost: Respect reserved memslots for memory devices when realizing a
>     vhost device
>   virtio-mem: Set the RamDiscardManager for the RAM memory region
>     earlier
>   virtio-mem: Fix typo in virito_mem_intersect_memory_section() function
>     name
>   virtio-mem: Expose device memory via separate memslots
>   vhost-user: Increase VHOST_USER_MAX_RAM_SLOTS to 496 with
>     CONFIG_VIRTIO_MEM
>   libvhost-user: Increase VHOST_USER_MAX_RAM_SLOTS to 4096
>   virtio-mem: Set "max-memslots" to 0 (auto) for the 6.2 machine
> 
>  accel/kvm/kvm-all.c                       |  24 ++-
>  accel/stubs/kvm-stub.c                    |   4 +-
>  hw/core/machine.c                         |   1 +
>  hw/mem/memory-device.c                    | 167 +++++++++++++++---
>  hw/virtio/vhost-stub.c                    |   2 +-
>  hw/virtio/vhost-user.c                    |   7 +-
>  hw/virtio/vhost.c                         |  17 +-
>  hw/virtio/virtio-mem-pci.c                |  22 +++
>  hw/virtio/virtio-mem.c                    | 202 ++++++++++++++++++++--
>  include/exec/memory.h                     |  23 +++
>  include/hw/mem/memory-device.h            |  32 ++++
>  include/hw/virtio/vhost.h                 |   2 +-
>  include/hw/virtio/virtio-mem.h            |  29 +++-
>  include/sysemu/kvm.h                      |   2 +-
>  softmmu/memory.c                          |  35 +++-
>  stubs/qmp_memory_device.c                 |   5 +
>  subprojects/libvhost-user/libvhost-user.h |   7 +-
>  17 files changed, 499 insertions(+), 82 deletions(-)
> 
> -- 
> 2.31.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots
  2021-10-13 19:03 ` [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots Dr. David Alan Gilbert
@ 2021-10-14  7:01   ` David Hildenbrand
  0 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-14  7:01 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	Stefan Hajnoczi, qemu-devel, Peter Xu,
	Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

On 13.10.21 21:03, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> Based-on: 20211011175346.15499-1-david@redhat.com
>>
>> A virtio-mem device is represented by a single large RAM memory region
>> backed by a single large mmap.
>>
>> Right now, we map that complete memory region into guest physical addres
>> space, resulting in a very large memory mapping, KVM memory slot, ...
>> although only a small amount of memory might actually be exposed to the VM.
>>
>> For example, when starting a VM with a 1 TiB virtio-mem device that only
>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
>> in order to hotplug more memory later, we waste a lot of memory on metadata
>> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
>> optimizations in KVM are being worked on to reduce this metadata overhead
>> on x86-64 in some cases, it remains a problem with nested VMs and there are
>> other reasons why we would want to reduce the total memory slot to a
>> reasonable minimum.
>>
>> We want to:
>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
>>    inside QEMU KVM code where possible.
>> b) Not always expose all device-memory to the VM, to reduce the attack
>>    surface of malicious VMs without using userfaultfd.
>>
>> So instead, expose the RAM memory region not by a single large mapping
>> (consuming one memslot) but instead by multiple mappings, each consuming
>> one memslot. To do that, we divide the RAM memory region via aliases into
>> separate parts and only map the aliases into a device container we actually
>> need. We have to make sure that QEMU won't silently merge the memory
>> sections corresponding to the aliases (and thereby also memslots),
>> otherwise we lose atomic updates with KVM and vhost-user, which we deeply
>> care about when adding/removing memory. Further, to get memslot accounting
>> right, such merging is better avoided.
>>
>> Within the memslots, virtio-mem can (un)plug memory in smaller granularity
>> dynamically. So memslots are a pure optimization to tackle a) and b) above.
>>
>> Memslots are right now mapped once they fall into the usable device region
>> (which grows/shrinks on demand right now either when requesting to
>>  hotplug more memory or during/after reboots). In the future, with
>> VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to (un)map aliases even
>> more dynamically when (un)plugging device blocks.
>>
>>
>> Adding a 500GiB virtio-mem device and not hotplugging any memory results in:
>>     0000000140000000-000001047fffffff (prio 0, i/o): device-memory
>>       0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
>>
>> Requesting the VM to consume 2 GiB results in (note: the usable region size
>> is bigger than 2 GiB, so 3 * 1 GiB memslots are required):
>>     0000000140000000-000001047fffffff (prio 0, i/o): device-memory
>>       0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
>>         0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
>>         0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
>>         00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
> 
> I've got a vague memory that there were some devices that didn't like
> doing split IO across a memory region (or something) - some virtio
> devices?  Do you know if that's still true and if that causes a problem?

Interesting point! I am not aware of any such issues, and I'd be
surprised if we'd still have such buggy devices, because the layout
virtio-mem now creates is just very similar to the layout we'll
automatically create with ordinary DIMMs.

If we hotplug DIMMs they will end up consecutive in guest physical
address space, however, having separate memory regions and requiring
separate memory slots. So, very similar to a virtio-mem device now.

Maybe the catch is that it's hard to cross memory regions that are e.g.,
>- 128 MiB aligned, because ordinary allocations (e.g., via the buddy in
Linux which supports <= 4 MiB pages) in won't cross these blocks.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 12/15] virtio-mem: Expose device memory via separate memslots
  2021-10-13 10:33 ` [PATCH RFC 12/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
@ 2021-10-14 11:45   ` Dr. David Alan Gilbert
  2021-10-14 13:17     ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Dr. David Alan Gilbert @ 2021-10-14 11:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	Stefan Hajnoczi, qemu-devel, Peter Xu,
	Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

* David Hildenbrand (david@redhat.com) wrote:
> KVM nowadays supports a lot of memslots. We want to exploit that in
> virtio-mem, exposing device memory via separate memslots to the guest
> on demand, essentially reducing the total size of KVM slots
> significantly (and thereby metadata in KVM and in QEMU for KVM memory
> slots) especially when exposing initially only a small amount of memory
> via a virtio-mem device to the guest, to hotplug more later. Further,
> not always exposing the full device memory region to the guest reduces
> the attack surface in many setups without requiring other mechanisms
> like uffd for protection of unplugged memory.
> 
> So split the original RAM region via memory region aliases into separate
> chunks (ending up as individual memslots), and dynamically map the
> required chunks (falling into the usable region) into the container.
> 
> For now, we always map the memslots covered by the usable region. In the
> future, with VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to map
> memslots on actual demand and optimize further.
> 
> Users can specify via the "max-memslots" property how much memslots the
> virtio-mem device is allowed to use at max. "0" translates to "auto, no
> limit" and is determinded automatically using a heuristic. When a maximum
> (> 1) is specified, that auto-determined value is capped. The parameter
> doesn't have to be migrated and can differ between source and destination.
> The only reason the parameter exists is not make some corner case setups
> (multiple large virtio-mem devices assigned to a single virtual NUMA node
>  with only very limited available memslots, hotplug of vhost devices) work.
> The parameter will be set to be "0" as default soon, whereby it will remain
> to be "1" for compat machines.
> 
> The properties "memslots" and "used-memslots" are read-only.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

I think you need to move this patch after the vhost-user patches so that
you don't break a bisect including vhost-user.

But I do worry about the effect on vhost-user:
  a) What about external programs like dpdk?
  b) I worry if you end up with a LOT of slots you end up with a lot of
mmap's and fd's in vhost-user; I'm not quite sure what all the effects
of that will be.

Dave

> ---
>  hw/virtio/virtio-mem-pci.c     |  22 +++++
>  hw/virtio/virtio-mem.c         | 173 ++++++++++++++++++++++++++++++++-
>  include/hw/virtio/virtio-mem.h |  29 +++++-
>  3 files changed, 220 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/virtio/virtio-mem-pci.c b/hw/virtio/virtio-mem-pci.c
> index be2383b0c5..2c1be2afb7 100644
> --- a/hw/virtio/virtio-mem-pci.c
> +++ b/hw/virtio/virtio-mem-pci.c
> @@ -82,6 +82,20 @@ static uint64_t virtio_mem_pci_get_min_alignment(const MemoryDeviceState *md)
>                                      &error_abort);
>  }
>  
> +static unsigned int virtio_mem_pci_get_used_memslots(const MemoryDeviceState *md,
> +                                                     Error **errp)
> +{
> +    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_USED_MEMSLOTS_PROP,
> +                                    &error_abort);
> +}
> +
> +static unsigned int virtio_mem_pci_get_memslots(const MemoryDeviceState *md,
> +                                                Error **errp)
> +{
> +    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_MEMSLOTS_PROP,
> +                                    &error_abort);
> +}
> +
>  static void virtio_mem_pci_size_change_notify(Notifier *notifier, void *data)
>  {
>      VirtIOMEMPCI *pci_mem = container_of(notifier, VirtIOMEMPCI,
> @@ -115,6 +129,8 @@ static void virtio_mem_pci_class_init(ObjectClass *klass, void *data)
>      mdc->get_memory_region = virtio_mem_pci_get_memory_region;
>      mdc->fill_device_info = virtio_mem_pci_fill_device_info;
>      mdc->get_min_alignment = virtio_mem_pci_get_min_alignment;
> +    mdc->get_used_memslots = virtio_mem_pci_get_used_memslots;
> +    mdc->get_memslots = virtio_mem_pci_get_memslots;
>  }
>  
>  static void virtio_mem_pci_instance_init(Object *obj)
> @@ -142,6 +158,12 @@ static void virtio_mem_pci_instance_init(Object *obj)
>      object_property_add_alias(obj, VIRTIO_MEM_REQUESTED_SIZE_PROP,
>                                OBJECT(&dev->vdev),
>                                VIRTIO_MEM_REQUESTED_SIZE_PROP);
> +    object_property_add_alias(obj, VIRTIO_MEM_MEMSLOTS_PROP,
> +                              OBJECT(&dev->vdev),
> +                              VIRTIO_MEM_MEMSLOTS_PROP);
> +    object_property_add_alias(obj, VIRTIO_MEM_USED_MEMSLOTS_PROP,
> +                              OBJECT(&dev->vdev),
> +                              VIRTIO_MEM_USED_MEMSLOTS_PROP);
>  }
>  
>  static const VirtioPCIDeviceTypeInfo virtio_mem_pci_info = {
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 1e29706798..f7e8f1db83 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -23,6 +23,7 @@
>  #include "hw/virtio/virtio-bus.h"
>  #include "hw/virtio/virtio-access.h"
>  #include "hw/virtio/virtio-mem.h"
> +#include "hw/mem/memory-device.h"
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
>  #include "exec/ram_addr.h"
> @@ -500,6 +501,7 @@ static void virtio_mem_resize_usable_region(VirtIOMEM *vmem,
>  {
>      uint64_t newsize = MIN(memory_region_size(&vmem->memdev->mr),
>                             requested_size + VIRTIO_MEM_USABLE_EXTENT);
> +    int i;
>  
>      /* The usable region size always has to be multiples of the block size. */
>      newsize = QEMU_ALIGN_UP(newsize, vmem->block_size);
> @@ -514,6 +516,25 @@ static void virtio_mem_resize_usable_region(VirtIOMEM *vmem,
>  
>      trace_virtio_mem_resized_usable_region(vmem->usable_region_size, newsize);
>      vmem->usable_region_size = newsize;
> +
> +    /*
> +     * Map all unmapped memslots that cover the usable region and unmap all
> +     * remaining mapped ones.
> +     */
> +    for (i = 0; i < vmem->nb_memslots; i++) {
> +        if (vmem->memslot_size * i < vmem->usable_region_size) {
> +            if (!memory_region_is_mapped(&vmem->memslots[i])) {
> +                memory_region_add_subregion(vmem->mr, vmem->memslot_size * i,
> +                                            &vmem->memslots[i]);
> +                vmem->nb_used_memslots++;
> +            }
> +        } else {
> +            if (memory_region_is_mapped(&vmem->memslots[i])) {
> +                memory_region_del_subregion(vmem->mr, &vmem->memslots[i]);
> +                vmem->nb_used_memslots--;
> +            }
> +        }
> +    }
>  }
>  
>  static int virtio_mem_unplug_all(VirtIOMEM *vmem)
> @@ -674,6 +695,92 @@ static void virtio_mem_system_reset(void *opaque)
>      virtio_mem_unplug_all(vmem);
>  }
>  
> +static void virtio_mem_prepare_mr(VirtIOMEM *vmem)
> +{
> +    const uint64_t region_size = memory_region_size(&vmem->memdev->mr);
> +
> +    if (vmem->mr) {
> +        return;
> +    }
> +
> +    vmem->mr = g_malloc0(sizeof(*vmem->mr));
> +    memory_region_init(vmem->mr, OBJECT(vmem), "virtio-mem-memslots",
> +                       region_size);
> +    vmem->mr->align = memory_region_get_alignment(&vmem->memdev->mr);
> +}
> +
> +/*
> + * Calculate the number of memslots we'll use based on device properties and
> + * available + already used+reserved memslots for other devices.
> + *
> + * Must not get called after realizing the device.
> + */
> +static unsigned int virtio_mem_calc_nb_memslots(uint64_t region_size,
> +                                                uint64_t block_size,
> +                                                unsigned int user_limit)
> +{
> +    unsigned int limit = memory_devices_calc_memslot_limit(region_size);
> +    uint64_t memslot_size;
> +
> +    /*
> +     * We never use more than 1024 memslots for a single device (relevant only
> +     * for devices > 1 TiB).
> +     */
> +    limit = MIN(limit, 1024);
> +
> +    /*
> +     * We'll never use memslots that are smaller than 1 GiB or smaller than
> +     * the block size (and thereby the page size). memslots are always a power
> +     * of two.
> +     */
> +    memslot_size = MAX(1 * GiB, block_size);
> +    while (ROUND_UP(region_size, memslot_size) / memslot_size > limit) {
> +        memslot_size *= 2;
> +    }
> +    limit = ROUND_UP(region_size, memslot_size) / memslot_size;
> +
> +    return !user_limit ? limit : MIN(user_limit, limit);
> +}
> +
> +static void virtio_mem_prepare_memslots(VirtIOMEM *vmem)
> +{
> +    const uint64_t region_size = memory_region_size(&vmem->memdev->mr);
> +    int i;
> +
> +    if (!vmem->nb_memslots) {
> +        vmem->nb_memslots = virtio_mem_calc_nb_memslots(region_size,
> +                                                        vmem->block_size,
> +                                                        vmem->nb_max_memslots);
> +    }
> +    if (vmem->nb_memslots == 1) {
> +        vmem->memslot_size = region_size;
> +    } else {
> +        vmem->memslot_size = 1 * GiB;
> +        while (ROUND_UP(region_size, vmem->memslot_size) / vmem->memslot_size >
> +               vmem->nb_memslots) {
> +            vmem->memslot_size *= 2;
> +        }
> +    }
> +
> +    /* Create our memslots but don't map them yet -- we'll map dynamically. */
> +    vmem->memslots = g_malloc0_n(vmem->nb_memslots, sizeof(*vmem->memslots));
> +    for (i = 0; i < vmem->nb_memslots; i++) {
> +        const uint64_t size = MIN(vmem->memslot_size,
> +                                  region_size - i * vmem->memslot_size);
> +        char name[80];
> +
> +        snprintf(name, sizeof(name), "virtio-mem-memslot-%u", i);
> +        memory_region_init_alias(&vmem->memslots[i], OBJECT(vmem), name,
> +                                 &vmem->memdev->mr, vmem->memslot_size * i,
> +                                 size);
> +        /*
> +         * We want our aliases to result in separate memory sections and thereby
> +         * separate memslots.
> +         */
> +        memory_region_set_alias_unmergeable(&vmem->memslots[i], true);
> +    }
> +}
> +
>  static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>  {
>      MachineState *ms = MACHINE(qdev_get_machine());
> @@ -763,7 +870,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
> +    virtio_mem_prepare_mr(vmem);
>  
>      vmem->bitmap_size = memory_region_size(&vmem->memdev->mr) /
>                          vmem->block_size;
> @@ -780,9 +887,11 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>       */
>      memory_region_set_ram_discard_manager(&vmem->memdev->mr,
>                                            RAM_DISCARD_MANAGER(vmem));
> +    virtio_mem_prepare_memslots(vmem);
>  
> -    host_memory_backend_set_mapped(vmem->memdev, true);
> +    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
>      vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
> +    host_memory_backend_set_mapped(vmem->memdev, true);
>      qemu_register_reset(virtio_mem_system_reset, vmem);
>  }
>  
> @@ -794,10 +903,14 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
>      qemu_unregister_reset(virtio_mem_system_reset, vmem);
>      vmstate_unregister_ram(&vmem->memdev->mr, DEVICE(vmem));
>      host_memory_backend_set_mapped(vmem->memdev, false);
> +    /* Unmap all memslots. */
> +    virtio_mem_resize_usable_region(vmem, 0, true);
> +    g_free(vmem->memslots);
>      memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
>      virtio_del_queue(vdev, 0);
>      virtio_cleanup(vdev);
>      g_free(vmem->bitmap);
> +    g_free(vmem->mr);
>      ram_block_coordinated_discard_require(false);
>  }
>  
> @@ -955,7 +1068,8 @@ static MemoryRegion *virtio_mem_get_memory_region(VirtIOMEM *vmem, Error **errp)
>          return NULL;
>      }
>  
> -    return &vmem->memdev->mr;
> +    virtio_mem_prepare_mr(vmem);
> +    return vmem->mr;
>  }
>  
>  static void virtio_mem_add_size_change_notifier(VirtIOMEM *vmem,
> @@ -1084,6 +1198,53 @@ static void virtio_mem_set_block_size(Object *obj, Visitor *v, const char *name,
>      vmem->block_size = value;
>  }
>  
> +static void virtio_mem_get_used_memslots(Object *obj, Visitor *v,
> +                                          const char *name,
> +                                          void *opaque, Error **errp)
> +{
> +    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
> +    uint16_t value = vmem->nb_used_memslots;
> +
> +    visit_type_uint16(v, name, &value, errp);
> +}
> +
> +static void virtio_mem_get_memslots(Object *obj, Visitor *v, const char *name,
> +                                    void *opaque, Error **errp)
> +{
> +    VirtIOMEM *vmem = VIRTIO_MEM(obj);
> +    uint16_t value = vmem->nb_memslots;
> +
> +    /* Determine the final value now, we don't want it to change later.  */
> +    if (!vmem->nb_memslots) {
> +        uint64_t block_size = vmem->block_size;
> +        uint64_t region_size;
> +        RAMBlock *rb;
> +
> +        if (!vmem->memdev || !memory_region_is_ram(&vmem->memdev->mr)) {
> +            /* We'll fail realizing later ... */
> +            vmem->nb_memslots = 1;
> +            goto out;
> +        }
> +        region_size = memory_region_size(&vmem->memdev->mr);
> +        rb = vmem->memdev->mr.ram_block;
> +
> +        if (!block_size) {
> +            block_size = virtio_mem_default_block_size(rb);
> +        } else if (block_size < qemu_ram_pagesize(rb)) {
> +            /* We'll fail realizing later ... */
> +            vmem->nb_memslots = 1;
> +            goto out;
> +        }
> +
> +        vmem->nb_memslots = virtio_mem_calc_nb_memslots(region_size,
> +                                                        vmem->block_size,
> +                                                        vmem->nb_max_memslots);
> +    }
> +out:
> +    value = vmem->nb_memslots;
> +    visit_type_uint16(v, name, &value, errp);
> +}
> +
>  static void virtio_mem_instance_init(Object *obj)
>  {
>      VirtIOMEM *vmem = VIRTIO_MEM(obj);
> @@ -1099,6 +1260,10 @@ static void virtio_mem_instance_init(Object *obj)
>      object_property_add(obj, VIRTIO_MEM_BLOCK_SIZE_PROP, "size",
>                          virtio_mem_get_block_size, virtio_mem_set_block_size,
>                          NULL, NULL);
> +    object_property_add(obj, VIRTIO_MEM_MEMSLOTS_PROP, "uint16",
> +                        virtio_mem_get_memslots, NULL, NULL, NULL);
> +    object_property_add(obj, VIRTIO_MEM_USED_MEMSLOTS_PROP, "uint16",
> +                        virtio_mem_get_used_memslots, NULL, NULL, NULL);
>  }
>  
>  static Property virtio_mem_properties[] = {
> @@ -1106,6 +1271,8 @@ static Property virtio_mem_properties[] = {
>      DEFINE_PROP_UINT32(VIRTIO_MEM_NODE_PROP, VirtIOMEM, node, 0),
>      DEFINE_PROP_LINK(VIRTIO_MEM_MEMDEV_PROP, VirtIOMEM, memdev,
>                       TYPE_MEMORY_BACKEND, HostMemoryBackend *),
> +    DEFINE_PROP_UINT16(VIRTIO_MEM_MAX_MEMSLOTS_PROP, VirtIOMEM, nb_max_memslots,
> +                       1),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
> index a5dd6a493b..3589865871 100644
> --- a/include/hw/virtio/virtio-mem.h
> +++ b/include/hw/virtio/virtio-mem.h
> @@ -30,6 +30,9 @@ OBJECT_DECLARE_TYPE(VirtIOMEM, VirtIOMEMClass,
>  #define VIRTIO_MEM_REQUESTED_SIZE_PROP "requested-size"
>  #define VIRTIO_MEM_BLOCK_SIZE_PROP "block-size"
>  #define VIRTIO_MEM_ADDR_PROP "memaddr"
> +#define VIRTIO_MEM_MEMSLOTS_PROP "memslots"
> +#define VIRTIO_MEM_USED_MEMSLOTS_PROP "used-memslots"
> +#define VIRTIO_MEM_MAX_MEMSLOTS_PROP "max-memslots"
>  
>  struct VirtIOMEM {
>      VirtIODevice parent_obj;
> @@ -41,9 +44,33 @@ struct VirtIOMEM {
>      int32_t bitmap_size;
>      unsigned long *bitmap;
>  
> -    /* assigned memory backend and memory region */
> +    /* Device memory region in which we dynamically map memslots */
> +    MemoryRegion *mr;
> +
> +    /*
> +     * Assigned memory backend with the RAM memory region we will split
> +     * into memslots to dynamically map them into the device memory region.
> +     */
>      HostMemoryBackend *memdev;
>  
> +    /*
> +     * Individual memslots we dynamically map that are aliases to the
> +     * assigned RAM memory region
> +     */
> +    MemoryRegion *memslots;
> +
> +    /* User defined maximum number of memslots we may ever use. */
> +    uint16_t nb_max_memslots;
> +
> +    /* Total number of memslots we're going to use. */
> +    uint16_t nb_memslots;
> +
> +    /* Current number of memslots we're using. */
> +    uint16_t nb_used_memslots;
> +
> +    /* Size of one memslot (the last one might be smaller) */
> +    uint64_t memslot_size;
> +
>      /* NUMA node */
>      uint32_t node;
>  
> -- 
> 2.31.1
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 12/15] virtio-mem: Expose device memory via separate memslots
  2021-10-14 11:45   ` Dr. David Alan Gilbert
@ 2021-10-14 13:17     ` David Hildenbrand
  2021-10-20 12:17       ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2021-10-14 13:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	Stefan Hajnoczi, qemu-devel, Peter Xu,
	Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

On 14.10.21 13:45, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> KVM nowadays supports a lot of memslots. We want to exploit that in
>> virtio-mem, exposing device memory via separate memslots to the guest
>> on demand, essentially reducing the total size of KVM slots
>> significantly (and thereby metadata in KVM and in QEMU for KVM memory
>> slots) especially when exposing initially only a small amount of memory
>> via a virtio-mem device to the guest, to hotplug more later. Further,
>> not always exposing the full device memory region to the guest reduces
>> the attack surface in many setups without requiring other mechanisms
>> like uffd for protection of unplugged memory.
>>
>> So split the original RAM region via memory region aliases into separate
>> chunks (ending up as individual memslots), and dynamically map the
>> required chunks (falling into the usable region) into the container.
>>
>> For now, we always map the memslots covered by the usable region. In the
>> future, with VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to map
>> memslots on actual demand and optimize further.
>>
>> Users can specify via the "max-memslots" property how much memslots the
>> virtio-mem device is allowed to use at max. "0" translates to "auto, no
>> limit" and is determinded automatically using a heuristic. When a maximum
>> (> 1) is specified, that auto-determined value is capped. The parameter
>> doesn't have to be migrated and can differ between source and destination.
>> The only reason the parameter exists is not make some corner case setups
>> (multiple large virtio-mem devices assigned to a single virtual NUMA node
>>  with only very limited available memslots, hotplug of vhost devices) work.
>> The parameter will be set to be "0" as default soon, whereby it will remain
>> to be "1" for compat machines.
>>
>> The properties "memslots" and "used-memslots" are read-only.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> I think you need to move this patch after the vhost-user patches so that
> you don't break a bisect including vhost-user.

As the default is set to 1 and is set to 0 ("auto") in the last patch in
this series, there should be (almost) no difference regarding vhost-user.

> 
> But I do worry about the effect on vhost-user:

The 4096 limit was certainly more "let's make it extreme so we raise
some eyebrows and we can talk about the implications". I'd be perfectly
happy with 256 or better 512. Anything that's bigger than 32 in case of
virtiofsd :)

>   a) What about external programs like dpdk?

At least initially virtio-mem won't apply to dpdk and similar workloads
(RT). For example, virtio-mem is incompatible with mlock. So I think the
most important use case to optimize for is virtio-mem+virtiofsd
(especially kata).

>   b) I worry if you end up with a LOT of slots you end up with a lot of
> mmap's and fd's in vhost-user; I'm not quite sure what all the effects
> of that will be.

At least for virtio-mem, there will be a small number of fd's, as many
memslots share the same fd, so with virtio-mem it's not an issue.

#VMAs is indeed worth discussing. Usually we can have up to 64k VMAs in
a process. The downside of having many is some reduce pagefault
performance. It really also depends on the target application. Maybe
there should be some libvhost-user toggle, where the application can opt
in to allow more?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 12/15] virtio-mem: Expose device memory via separate memslots
  2021-10-14 13:17     ` David Hildenbrand
@ 2021-10-20 12:17       ` David Hildenbrand
  0 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-10-20 12:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	Stefan Hajnoczi, qemu-devel, Peter Xu,
	Philippe Mathieu-Daudé,
	Igor Mammedov, Ani Sinha, Paolo Bonzini

On 14.10.21 15:17, David Hildenbrand wrote:
> On 14.10.21 13:45, Dr. David Alan Gilbert wrote:
>> * David Hildenbrand (david@redhat.com) wrote:
>>> KVM nowadays supports a lot of memslots. We want to exploit that in
>>> virtio-mem, exposing device memory via separate memslots to the guest
>>> on demand, essentially reducing the total size of KVM slots
>>> significantly (and thereby metadata in KVM and in QEMU for KVM memory
>>> slots) especially when exposing initially only a small amount of memory
>>> via a virtio-mem device to the guest, to hotplug more later. Further,
>>> not always exposing the full device memory region to the guest reduces
>>> the attack surface in many setups without requiring other mechanisms
>>> like uffd for protection of unplugged memory.
>>>
>>> So split the original RAM region via memory region aliases into separate
>>> chunks (ending up as individual memslots), and dynamically map the
>>> required chunks (falling into the usable region) into the container.
>>>
>>> For now, we always map the memslots covered by the usable region. In the
>>> future, with VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to map
>>> memslots on actual demand and optimize further.
>>>
>>> Users can specify via the "max-memslots" property how much memslots the
>>> virtio-mem device is allowed to use at max. "0" translates to "auto, no
>>> limit" and is determinded automatically using a heuristic. When a maximum
>>> (> 1) is specified, that auto-determined value is capped. The parameter
>>> doesn't have to be migrated and can differ between source and destination.
>>> The only reason the parameter exists is not make some corner case setups
>>> (multiple large virtio-mem devices assigned to a single virtual NUMA node
>>>  with only very limited available memslots, hotplug of vhost devices) work.
>>> The parameter will be set to be "0" as default soon, whereby it will remain
>>> to be "1" for compat machines.
>>>
>>> The properties "memslots" and "used-memslots" are read-only.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> I think you need to move this patch after the vhost-user patches so that
>> you don't break a bisect including vhost-user.
> 
> As the default is set to 1 and is set to 0 ("auto") in the last patch in
> this series, there should be (almost) no difference regarding vhost-user.
> 
>>
>> But I do worry about the effect on vhost-user:
> 
> The 4096 limit was certainly more "let's make it extreme so we raise
> some eyebrows and we can talk about the implications". I'd be perfectly
> happy with 256 or better 512. Anything that's bigger than 32 in case of
> virtiofsd :)
> 
>>   a) What about external programs like dpdk?
> 
> At least initially virtio-mem won't apply to dpdk and similar workloads
> (RT). For example, virtio-mem is incompatible with mlock. So I think the
> most important use case to optimize for is virtio-mem+virtiofsd
> (especially kata).
> 
>>   b) I worry if you end up with a LOT of slots you end up with a lot of
>> mmap's and fd's in vhost-user; I'm not quite sure what all the effects
>> of that will be.
> 
> At least for virtio-mem, there will be a small number of fd's, as many
> memslots share the same fd, so with virtio-mem it's not an issue.
> 
> #VMAs is indeed worth discussing. Usually we can have up to 64k VMAs in
> a process. The downside of having many is some reduce pagefault
> performance. It really also depends on the target application. Maybe
> there should be some libvhost-user toggle, where the application can opt
> in to allow more?
> 

I just did a simple test with memfds. The 1024 open fds limit does not
apply to fds we already closed again.

So the 1024 limit does not apply when done via

fd = open()
addr = mmap(fd)
close(fd)

For example, I did a simple test by creating 4096 memfds, mapping them,
and then closing the file. The end result is

$ ls -la /proc/38113/map_files/ | wc -l
4115
$ ls -la /proc/38113/fd/ | wc -l
6

Meaning there are many individual mappings, but only very limited open files

Which should be precisely what we are doing in libvhost-user code (and
should be doing in any other vhost-user code -- once we did the mmap(),
we should let go of the fd).

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-10-20 12:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-13 10:33 [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 01/15] memory: Drop mapping check from memory_region_get_ram_discard_manager() David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 02/15] kvm: Return number of free memslots David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 03/15] vhost: " David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 04/15] memory: Allow for marking memory region aliases unmergeable David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 05/15] vhost: Don't merge unmergeable memory sections David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 06/15] memory-device: Move memory_device_check_addable() directly into memory_device_pre_plug() David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 07/15] memory-device: Generalize memory_device_used_region_size() David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 08/15] memory-device: Support memory devices that consume a variable number of memslots David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 09/15] vhost: Respect reserved memslots for memory devices when realizing a vhost device David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 10/15] virtio-mem: Set the RamDiscardManager for the RAM memory region earlier David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 11/15] virtio-mem: Fix typo in virito_mem_intersect_memory_section() function name David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 12/15] virtio-mem: Expose device memory via separate memslots David Hildenbrand
2021-10-14 11:45   ` Dr. David Alan Gilbert
2021-10-14 13:17     ` David Hildenbrand
2021-10-20 12:17       ` David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 13/15] vhost-user: Increase VHOST_USER_MAX_RAM_SLOTS to 496 with CONFIG_VIRTIO_MEM David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 14/15] libvhost-user: Increase VHOST_USER_MAX_RAM_SLOTS to 4096 David Hildenbrand
2021-10-13 10:33 ` [PATCH RFC 15/15] virtio-mem: Set "max-memslots" to 0 (auto) for the 6.2 machine David Hildenbrand
2021-10-13 19:03 ` [PATCH RFC 00/15] virtio-mem: Expose device memory via separate memslots Dr. David Alan Gilbert
2021-10-14  7:01   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).