kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots
@ 2021-10-27 12:45 David Hildenbrand
  2021-10-27 12:45 ` [PATCH v1 01/12] kvm: Return number of free memslots David Hildenbrand
                   ` (12 more replies)
  0 siblings, 13 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 12:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Paolo Bonzini, Eduardo Habkost,
	Marcel Apfelbaum, Michael S. Tsirkin, Igor Mammedov, Ani Sinha,
	Peter Xu, Dr . David Alan Gilbert, Stefan Hajnoczi,
	Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

This is the follow-up of [1], dropping auto-detection and vhost-user
changes from the initial RFC.

Based-on: 20211011175346.15499-1-david@redhat.com

A virtio-mem device is represented by a single large RAM memory region
backed by a single large mmap.

Right now, we map that complete memory region into guest physical addres
space, resulting in a very large memory mapping, KVM memory slot, ...
although only a small amount of memory might actually be exposed to the VM.

For example, when starting a VM with a 1 TiB virtio-mem device that only
exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
in order to hotplug more memory later, we waste a lot of memory on metadata
for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
optimizations in KVM are being worked on to reduce this metadata overhead
on x86-64 in some cases, it remains a problem with nested VMs and there are
other reasons why we would want to reduce the total memory slot to a
reasonable minimum.

We want to:
a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
   inside QEMU KVM code where possible.
b) Not always expose all device-memory to the VM, to reduce the attack
   surface of malicious VMs without using userfaultfd.

So instead, expose the RAM memory region not by a single large mapping
(consuming one memslot) but instead by multiple mappings, each consuming
one memslot. To do that, we divide the RAM memory region via aliases into
separate parts and only map the aliases into a device container we actually
need. We have to make sure that QEMU won't silently merge the memory
sections corresponding to the aliases (and thereby also memslots),
otherwise we lose atomic updates with KVM and vhost-user, which we deeply
care about when adding/removing memory. Further, to get memslot accounting
right, such merging is better avoided.

Within the memslots, virtio-mem can (un)plug memory in smaller granularity
dynamically. So memslots are a pure optimization to tackle a) and b) above.

The user configures how many memslots a virtio-mem device should use, the
default is "1" -- essentially corresponding to the old behavior.

Memslots are right now mapped once they fall into the usable device region
(which grows/shrinks on demand right now either when requesting to
 hotplug more memory or during/after reboots). In the future, with
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to (un)map aliases even
more dynamically when (un)plugging device blocks.


Adding a 500GiB virtio-mem device with "memslots=500" and not hotplugging
any memory results in:
    0000000140000000-000001047fffffff (prio 0, i/o): device-memory
      0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots

Requesting the VM to consume 2 GiB results in (note: the usable region size
is bigger than 2 GiB, so 3 * 1 GiB memslots are required):
    0000000140000000-000001047fffffff (prio 0, i/o): device-memory
      0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
        0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
        0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
        00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff

Requesting the VM to consume 20 GiB results in:
    0000000140000000-000001047fffffff (prio 0, i/o): device-memory
      0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
        0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
        0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
        00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
        0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-3 @mem0 00000000c0000000-00000000ffffffff
        0000000240000000-000000027fffffff (prio 0, ram): alias virtio-mem-memslot-4 @mem0 0000000100000000-000000013fffffff
        0000000280000000-00000002bfffffff (prio 0, ram): alias virtio-mem-memslot-5 @mem0 0000000140000000-000000017fffffff
        00000002c0000000-00000002ffffffff (prio 0, ram): alias virtio-mem-memslot-6 @mem0 0000000180000000-00000001bfffffff
        0000000300000000-000000033fffffff (prio 0, ram): alias virtio-mem-memslot-7 @mem0 00000001c0000000-00000001ffffffff
        0000000340000000-000000037fffffff (prio 0, ram): alias virtio-mem-memslot-8 @mem0 0000000200000000-000000023fffffff
        0000000380000000-00000003bfffffff (prio 0, ram): alias virtio-mem-memslot-9 @mem0 0000000240000000-000000027fffffff
        00000003c0000000-00000003ffffffff (prio 0, ram): alias virtio-mem-memslot-10 @mem0 0000000280000000-00000002bfffffff
        0000000400000000-000000043fffffff (prio 0, ram): alias virtio-mem-memslot-11 @mem0 00000002c0000000-00000002ffffffff
        0000000440000000-000000047fffffff (prio 0, ram): alias virtio-mem-memslot-12 @mem0 0000000300000000-000000033fffffff
        0000000480000000-00000004bfffffff (prio 0, ram): alias virtio-mem-memslot-13 @mem0 0000000340000000-000000037fffffff
        00000004c0000000-00000004ffffffff (prio 0, ram): alias virtio-mem-memslot-14 @mem0 0000000380000000-00000003bfffffff
        0000000500000000-000000053fffffff (prio 0, ram): alias virtio-mem-memslot-15 @mem0 00000003c0000000-00000003ffffffff
        0000000540000000-000000057fffffff (prio 0, ram): alias virtio-mem-memslot-16 @mem0 0000000400000000-000000043fffffff
        0000000580000000-00000005bfffffff (prio 0, ram): alias virtio-mem-memslot-17 @mem0 0000000440000000-000000047fffffff
        00000005c0000000-00000005ffffffff (prio 0, ram): alias virtio-mem-memslot-18 @mem0 0000000480000000-00000004bfffffff
        0000000600000000-000000063fffffff (prio 0, ram): alias virtio-mem-memslot-19 @mem0 00000004c0000000-00000004ffffffff
        0000000640000000-000000067fffffff (prio 0, ram): alias virtio-mem-memslot-20 @mem0 0000000500000000-000000053fffffff

Requesting the VM to consume 5 GiB and rebooting (note: usable region size
will change during reboots) results in:
    0000000140000000-000001047fffffff (prio 0, i/o): device-memory
      0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
        0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
        0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
        00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
        0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-3 @mem0 00000000c0000000-00000000ffffffff
        0000000240000000-000000027fffffff (prio 0, ram): alias virtio-mem-memslot-4 @mem0 0000000100000000-000000013fffffff
        0000000280000000-00000002bfffffff (prio 0, ram): alias virtio-mem-memslot-5 @mem0 0000000140000000-000000017fffffff


In addition to other factors (e.g., device block size), we limit the number
of memslots to 1024 per devices and the size of one memslot to at least
128 MiB. Further, we make sure internally to align the memslot size to at
least 128 MiB. For now, we limit the total number of memslots that can
be used by memory devices to 2048, to no go crazy on individual RAM
mappings in our address spaces.

Future work:
- vhost-user and libvhost-user/vhost-user-backend changes to support more than
  32 memslots.
- "memslots=0" mode to allow for auto-determining the number of memslots to
  use.
- Eventually have an interface to query the memslot limit for a QEMU
  instance. But vhost-* devices complicate that matter.

RCF -> v1:
- Dropped "max-memslots=" parameter and converted to "memslots=" parameter
- Dropped auto-determining the number of memslots to use
- Dropped vhost* memslot changes
- Improved error messages regarding memory slot limits
- Reshuffled, cleaned up patches, rewrote patch descriptions

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Ani Sinha <ani@anisinha.ca>
Cc: Peter Xu <peterx@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
Cc: kvm@vger.kernel.org

[1] https://lkml.kernel.org/r/20211013103330.26869-1-david@redhat.com

David Hildenbrand (12):
  kvm: Return number of free memslots
  vhost: Return number of free memslots
  memory: Allow for marking memory region aliases unmergeable
  vhost: Don't merge unmergeable memory sections
  memory-device: Move memory_device_check_addable() directly into
    memory_device_pre_plug()
  memory-device: Generalize memory_device_used_region_size()
  memory-device: Support memory devices that dynamically consume
    multiple memslots
  vhost: Respect reserved memslots for memory devices when realizing a
    vhost device
  memory: Drop mapping check from
    memory_region_get_ram_discard_manager()
  virtio-mem: Fix typo in virito_mem_intersect_memory_section() function
    name
  virtio-mem: Set the RamDiscardManager for the RAM memory region
    earlier
  virtio-mem: Expose device memory via multiple memslots

 accel/kvm/kvm-all.c            |  24 ++--
 accel/stubs/kvm-stub.c         |   4 +-
 hw/mem/memory-device.c         | 115 ++++++++++++++----
 hw/virtio/vhost-stub.c         |   2 +-
 hw/virtio/vhost.c              |  21 ++--
 hw/virtio/virtio-mem-pci.c     |  23 ++++
 hw/virtio/virtio-mem.c         | 212 +++++++++++++++++++++++++++++----
 include/exec/memory.h          |  23 ++++
 include/hw/mem/memory-device.h |  33 +++++
 include/hw/virtio/vhost.h      |   2 +-
 include/hw/virtio/virtio-mem.h |  25 +++-
 include/sysemu/kvm.h           |   2 +-
 softmmu/memory.c               |  35 ++++--
 stubs/qmp_memory_device.c      |   5 +
 14 files changed, 449 insertions(+), 77 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v1 01/12] kvm: Return number of free memslots
  2021-10-27 12:45 [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
@ 2021-10-27 12:45 ` David Hildenbrand
  2021-10-27 12:45 ` [PATCH v1 02/12] vhost: " David Hildenbrand
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 12:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Paolo Bonzini, Eduardo Habkost,
	Marcel Apfelbaum, Michael S. Tsirkin, Igor Mammedov, Ani Sinha,
	Peter Xu, Dr . David Alan Gilbert, Stefan Hajnoczi,
	Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

Let's return the number of free slots instead of only checking if there
is a free slot. Required to support memory devices that consume multiple
memslots.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 accel/kvm/kvm-all.c    | 24 +++++++++++-------------
 accel/stubs/kvm-stub.c |  4 ++--
 hw/mem/memory-device.c |  2 +-
 include/sysemu/kvm.h   |  2 +-
 4 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index db8d83b137..0846be835e 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -103,6 +103,7 @@ struct KVMState
     AccelState parent_obj;
 
     int nr_slots;
+    int nr_free_slots;
     int fd;
     int vmfd;
     int coalesced_mmio;
@@ -245,6 +246,13 @@ int kvm_get_max_memslots(void)
     return s->nr_slots;
 }
 
+unsigned int kvm_get_free_memslots(void)
+{
+    KVMState *s = kvm_state;
+
+    return s->nr_free_slots;
+}
+
 /* Called with KVMMemoryListener.slots_lock held */
 static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
 {
@@ -260,19 +268,6 @@ static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
     return NULL;
 }
 
-bool kvm_has_free_slot(MachineState *ms)
-{
-    KVMState *s = KVM_STATE(ms->accelerator);
-    bool result;
-    KVMMemoryListener *kml = &s->memory_listener;
-
-    kvm_slots_lock();
-    result = !!kvm_get_free_slot(kml);
-    kvm_slots_unlock();
-
-    return result;
-}
-
 /* Called with KVMMemoryListener.slots_lock held */
 static KVMSlot *kvm_alloc_slot(KVMMemoryListener *kml)
 {
@@ -1410,6 +1405,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
             }
             start_addr += slot_size;
             size -= slot_size;
+            kvm_state->nr_free_slots++;
         } while (size);
         goto out;
     }
@@ -1435,6 +1431,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
         ram_start_offset += slot_size;
         ram += slot_size;
         size -= slot_size;
+        kvm_state->nr_free_slots--;
     } while (size);
 
 out:
@@ -2364,6 +2361,7 @@ static int kvm_init(MachineState *ms)
     if (!s->nr_slots) {
         s->nr_slots = 32;
     }
+    s->nr_free_slots = s->nr_slots;
 
     s->nr_as = kvm_check_extension(s, KVM_CAP_MULTI_ADDRESS_SPACE);
     if (s->nr_as <= 1) {
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 5b1d00a222..cbaeb7c656 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -133,9 +133,9 @@ int kvm_irqchip_remove_irqfd_notifier_gsi(KVMState *s, EventNotifier *n,
     return -ENOSYS;
 }
 
-bool kvm_has_free_slot(MachineState *ms)
+unsigned int kvm_get_free_memslots(void)
 {
-    return false;
+    return 0;
 }
 
 void kvm_init_cpu_signals(CPUState *cpu)
diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index d9f8301711..9045ead33e 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -73,7 +73,7 @@ static void memory_device_check_addable(MachineState *ms, uint64_t size,
     uint64_t used_region_size = 0;
 
     /* we will need a new memory slot for kvm and vhost */
-    if (kvm_enabled() && !kvm_has_free_slot(ms)) {
+    if (kvm_enabled() && !kvm_get_free_memslots()) {
         error_setg(errp, "hypervisor has no free memory slots left");
         return;
     }
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index a1ab1ee12d..c18be3cbd5 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -211,7 +211,7 @@ typedef struct Notifier Notifier;
 
 /* external API */
 
-bool kvm_has_free_slot(MachineState *ms);
+unsigned int kvm_get_free_memslots(void);
 bool kvm_has_sync_mmu(void);
 int kvm_has_vcpu_events(void);
 int kvm_has_robust_singlestep(void);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 02/12] vhost: Return number of free memslots
  2021-10-27 12:45 [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
  2021-10-27 12:45 ` [PATCH v1 01/12] kvm: Return number of free memslots David Hildenbrand
@ 2021-10-27 12:45 ` David Hildenbrand
  2021-10-27 13:36   ` Philippe Mathieu-Daudé
  2021-10-27 12:45 ` [PATCH v1 03/12] memory: Allow for marking memory region aliases unmergeable David Hildenbrand
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 12:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Paolo Bonzini, Eduardo Habkost,
	Marcel Apfelbaum, Michael S. Tsirkin, Igor Mammedov, Ani Sinha,
	Peter Xu, Dr . David Alan Gilbert, Stefan Hajnoczi,
	Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

Let's return the number of free slots instead of only checking if there
is a free slot. Required to support memory devices that consume multiple
memslots.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/mem/memory-device.c    | 2 +-
 hw/virtio/vhost-stub.c    | 2 +-
 hw/virtio/vhost.c         | 4 ++--
 include/hw/virtio/vhost.h | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index 9045ead33e..7f76a09e57 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -77,7 +77,7 @@ static void memory_device_check_addable(MachineState *ms, uint64_t size,
         error_setg(errp, "hypervisor has no free memory slots left");
         return;
     }
-    if (!vhost_has_free_slot()) {
+    if (!vhost_get_free_memslots()) {
         error_setg(errp, "a used vhost backend has no free memory slots left");
         return;
     }
diff --git a/hw/virtio/vhost-stub.c b/hw/virtio/vhost-stub.c
index c175148fce..fe111e5e45 100644
--- a/hw/virtio/vhost-stub.c
+++ b/hw/virtio/vhost-stub.c
@@ -2,7 +2,7 @@
 #include "hw/virtio/vhost.h"
 #include "hw/virtio/vhost-user.h"
 
-bool vhost_has_free_slot(void)
+unsigned int vhost_get_free_memslots(void)
 {
     return true;
 }
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 437347ad01..2707972870 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -48,7 +48,7 @@ static unsigned int used_memslots;
 static QLIST_HEAD(, vhost_dev) vhost_devices =
     QLIST_HEAD_INITIALIZER(vhost_devices);
 
-bool vhost_has_free_slot(void)
+unsigned int vhost_get_free_memslots(void)
 {
     unsigned int slots_limit = ~0U;
     struct vhost_dev *hdev;
@@ -57,7 +57,7 @@ bool vhost_has_free_slot(void)
         unsigned int r = hdev->vhost_ops->vhost_backend_memslots_limit(hdev);
         slots_limit = MIN(slots_limit, r);
     }
-    return slots_limit > used_memslots;
+    return slots_limit - used_memslots;
 }
 
 static void vhost_dev_sync_region(struct vhost_dev *dev,
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 3fa0b554ef..9d59fc1404 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -130,7 +130,7 @@ uint64_t vhost_get_features(struct vhost_dev *hdev, const int *feature_bits,
                             uint64_t features);
 void vhost_ack_features(struct vhost_dev *hdev, const int *feature_bits,
                         uint64_t features);
-bool vhost_has_free_slot(void);
+unsigned int vhost_get_free_memslots(void);
 
 int vhost_net_set_backend(struct vhost_dev *hdev,
                           struct vhost_vring_file *file);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 03/12] memory: Allow for marking memory region aliases unmergeable
  2021-10-27 12:45 [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
  2021-10-27 12:45 ` [PATCH v1 01/12] kvm: Return number of free memslots David Hildenbrand
  2021-10-27 12:45 ` [PATCH v1 02/12] vhost: " David Hildenbrand
@ 2021-10-27 12:45 ` David Hildenbrand
  2021-10-27 12:45 ` [PATCH v1 04/12] vhost: Don't merge unmergeable memory sections David Hildenbrand
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 12:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Paolo Bonzini, Eduardo Habkost,
	Marcel Apfelbaum, Michael S. Tsirkin, Igor Mammedov, Ani Sinha,
	Peter Xu, Dr . David Alan Gilbert, Stefan Hajnoczi,
	Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

Let's allow for marking memory region aliases unmergeable, to teach
flatview code (and memory listeners like vhost-user) to not merge adjacent
aliases to the same memory region into a larger memory section; instead, we
want separate alias to stay separate such that we can atomically map/unmap
aliases without affecting other aliases.

This is a preparation for virtio-mem mapping device memory located
on a RAM memory region via separate aliases into a memory region container,
resulting in separate memslots that can get (un)mapped atomically.

As an example with virtio-mem, the layout looks something like this:
  [...]
  0000000180000000-000002007fffffff (prio 0, i/o): device-memory
     0000000180000000-000001017fffffff (prio 0, i/o): virtio-mem-memslots
       0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
       00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
       0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
  [...]

What would happen right now is that flatview code merged all 3 aliases
into a single memory section. When mapping another alias (e.g.,
virtio-mem-memslot-3) or when unmapping any of the mapped aliases,
memory listeners will first get notified about the removal of the big
memory section to then get notified about re-adding of the new
(differently merged) memory section(s).

In an ideal world, memory listeners would be able to deal with that
atomically, however, that is not the case for the most important memory
listeners used in context of virtio-mem (KVM, vhost-user, vfio) and
supporting atomic updates is quite hard (e.g., for KVM where we cannot
simply resize or split memory slots due to allocated metadata per slot,
or in virtiofsd where we cannot simply resize or split an active mmap
mapping). While temporarily removing memslots, active users (e.g., KVM
VCPUs) can stumble over the missing memslot and essentially crash the
VM.

Further, merged chunks will consume less memslots, but we might end up
consuming more later, when unmapping chunks and splitting the bigger
chunks into smaller ones -- making memslot accounting for memory devices
problematic as well.

Let's allow for marking a memory region alias unmergeable, such that we
can atomically (un)map aliases to the same memory region, similar to
(un)mapping individual DIMMs.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/exec/memory.h | 23 +++++++++++++++++++++++
 softmmu/memory.c      | 33 +++++++++++++++++++++++++++------
 2 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 75b4f600e3..d877b80e6e 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -82,6 +82,7 @@ struct ReservedRegion {
  *     relative to the region's address space
  * @readonly: writes to this section are ignored
  * @nonvolatile: this section is non-volatile
+ * @unmergeable: this section should not get merged with adjacent sections
  */
 struct MemoryRegionSection {
     Int128 size;
@@ -91,6 +92,7 @@ struct MemoryRegionSection {
     hwaddr offset_within_address_space;
     bool readonly;
     bool nonvolatile;
+    bool unmergeable;
 };
 
 typedef struct IOMMUTLBEntry IOMMUTLBEntry;
@@ -720,6 +722,7 @@ struct MemoryRegion {
     bool nonvolatile;
     bool rom_device;
     bool flush_coalesced_mmio;
+    bool unmergeable;
     uint8_t dirty_log_mask;
     bool is_iommu;
     RAMBlock *ram_block;
@@ -2272,6 +2275,26 @@ void memory_region_set_size(MemoryRegion *mr, uint64_t size);
 void memory_region_set_alias_offset(MemoryRegion *mr,
                                     hwaddr offset);
 
+/*
+ * memory_region_set_alias_unmergeable: Turn a memory region alias unmergeable
+ *
+ * Mark a memory region alias unmergeable, resulting in multiple adjacent
+ * aliasas to the same memory region not getting merged into one memory section
+ * when simplifying the address space and notifying memory listeners.
+ *
+ * Primarily useful on aliases to RAM regions; the target use case is
+ * splitting a RAM memory region via aliases into multiple memslots and
+ * dynamically (un)mapping the aliases into another container memory region.
+ * As resulting memory sections don't cover multiple aliases, memory listeners
+ * will be notified about adding/removing separate aliases, resulting in
+ * individual memslots in KVM, vhost, vfio,... that can be added/removed
+ * atomically when mapping/unmapping the corresponding alias.
+ *
+ * @mr: the #MemoryRegion to be updated
+ * @unmergeable: whether to mark the #MemoryRegion unmergeable
+ */
+void memory_region_set_alias_unmergeable(MemoryRegion *mr, bool unmergeable);
+
 /**
  * memory_region_present: checks if an address relative to a @container
  * translates into #MemoryRegion within @container
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 3a2613bd3c..d8a584ca80 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -223,6 +223,7 @@ struct FlatRange {
     bool romd_mode;
     bool readonly;
     bool nonvolatile;
+    bool unmergeable;
 };
 
 #define FOR_EACH_FLAT_RANGE(var, view)          \
@@ -239,6 +240,7 @@ section_from_flat_range(FlatRange *fr, FlatView *fv)
         .offset_within_address_space = int128_get64(fr->addr.start),
         .readonly = fr->readonly,
         .nonvolatile = fr->nonvolatile,
+        .unmergeable = fr->unmergeable,
     };
 }
 
@@ -249,7 +251,8 @@ static bool flatrange_equal(FlatRange *a, FlatRange *b)
         && a->offset_in_region == b->offset_in_region
         && a->romd_mode == b->romd_mode
         && a->readonly == b->readonly
-        && a->nonvolatile == b->nonvolatile;
+        && a->nonvolatile == b->nonvolatile
+        && a->unmergeable == b->unmergeable;
 }
 
 static FlatView *flatview_new(MemoryRegion *mr_root)
@@ -322,7 +325,8 @@ static bool can_merge(FlatRange *r1, FlatRange *r2)
         && r1->dirty_log_mask == r2->dirty_log_mask
         && r1->romd_mode == r2->romd_mode
         && r1->readonly == r2->readonly
-        && r1->nonvolatile == r2->nonvolatile;
+        && r1->nonvolatile == r2->nonvolatile
+        && !r1->unmergeable && !r2->unmergeable;
 }
 
 /* Attempt to simplify a view by merging adjacent ranges */
@@ -581,7 +585,8 @@ static void render_memory_region(FlatView *view,
                                  Int128 base,
                                  AddrRange clip,
                                  bool readonly,
-                                 bool nonvolatile)
+                                 bool nonvolatile,
+                                 bool unmergeable)
 {
     MemoryRegion *subregion;
     unsigned i;
@@ -598,6 +603,7 @@ static void render_memory_region(FlatView *view,
     int128_addto(&base, int128_make64(mr->addr));
     readonly |= mr->readonly;
     nonvolatile |= mr->nonvolatile;
+    unmergeable |= mr->unmergeable;
 
     tmp = addrrange_make(base, mr->size);
 
@@ -611,14 +617,14 @@ static void render_memory_region(FlatView *view,
         int128_subfrom(&base, int128_make64(mr->alias->addr));
         int128_subfrom(&base, int128_make64(mr->alias_offset));
         render_memory_region(view, mr->alias, base, clip,
-                             readonly, nonvolatile);
+                             readonly, nonvolatile, unmergeable);
         return;
     }
 
     /* Render subregions in priority order. */
     QTAILQ_FOREACH(subregion, &mr->subregions, subregions_link) {
         render_memory_region(view, subregion, base, clip,
-                             readonly, nonvolatile);
+                             readonly, nonvolatile, unmergeable);
     }
 
     if (!mr->terminates) {
@@ -634,6 +640,7 @@ static void render_memory_region(FlatView *view,
     fr.romd_mode = mr->romd_mode;
     fr.readonly = readonly;
     fr.nonvolatile = nonvolatile;
+    fr.unmergeable = unmergeable;
 
     /* Render the region itself into any gaps left by the current view. */
     for (i = 0; i < view->nr && int128_nz(remain); ++i) {
@@ -735,7 +742,7 @@ static FlatView *generate_memory_topology(MemoryRegion *mr)
     if (mr) {
         render_memory_region(view, mr, int128_zero(),
                              addrrange_make(int128_zero(), int128_2_64()),
-                             false, false);
+                             false, false, false);
     }
     flatview_simplify(view);
 
@@ -2634,6 +2641,20 @@ void memory_region_set_alias_offset(MemoryRegion *mr, hwaddr offset)
     memory_region_transaction_commit();
 }
 
+void memory_region_set_alias_unmergeable(MemoryRegion *mr, bool unmergeable)
+{
+    assert(mr->alias);
+
+    if (unmergeable == mr->unmergeable) {
+        return;
+    }
+
+    memory_region_transaction_begin();
+    mr->unmergeable = unmergeable;
+    memory_region_update_pending |= mr->enabled;
+    memory_region_transaction_commit();
+}
+
 uint64_t memory_region_get_alignment(const MemoryRegion *mr)
 {
     return mr->align;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 04/12] vhost: Don't merge unmergeable memory sections
  2021-10-27 12:45 [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
                   ` (2 preceding siblings ...)
  2021-10-27 12:45 ` [PATCH v1 03/12] memory: Allow for marking memory region aliases unmergeable David Hildenbrand
@ 2021-10-27 12:45 ` David Hildenbrand
  2021-10-27 12:45 ` [PATCH v1 05/12] memory-device: Move memory_device_check_addable() directly into memory_device_pre_plug() David Hildenbrand
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 12:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Paolo Bonzini, Eduardo Habkost,
	Marcel Apfelbaum, Michael S. Tsirkin, Igor Mammedov, Ani Sinha,
	Peter Xu, Dr . David Alan Gilbert, Stefan Hajnoczi,
	Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

Memory sections that are marked unmergeable should not be merged, to
allow for atomic removal later.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/vhost.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2707972870..49a1074097 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -620,7 +620,7 @@ static void vhost_region_add_section(struct vhost_dev *dev,
                                                mrs_size, mrs_host);
     }
 
-    if (dev->n_tmp_sections) {
+    if (dev->n_tmp_sections && !section->unmergeable) {
         /* Since we already have at least one section, lets see if
          * this extends it; since we're scanning in order, we only
          * have to look at the last one, and the FlatView that calls
@@ -653,7 +653,7 @@ static void vhost_region_add_section(struct vhost_dev *dev,
             size_t offset = mrs_gpa - prev_gpa_start;
 
             if (prev_host_start + offset == mrs_host &&
-                section->mr == prev_sec->mr &&
+                section->mr == prev_sec->mr && !prev_sec->unmergeable &&
                 (!dev->vhost_ops->vhost_backend_can_merge ||
                  dev->vhost_ops->vhost_backend_can_merge(dev,
                     mrs_host, mrs_size,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 05/12] memory-device: Move memory_device_check_addable() directly into memory_device_pre_plug()
  2021-10-27 12:45 [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
                   ` (3 preceding siblings ...)
  2021-10-27 12:45 ` [PATCH v1 04/12] vhost: Don't merge unmergeable memory sections David Hildenbrand
@ 2021-10-27 12:45 ` David Hildenbrand
  2021-10-27 12:45 ` [PATCH v1 06/12] memory-device: Generalize memory_device_used_region_size() David Hildenbrand
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 12:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Paolo Bonzini, Eduardo Habkost,
	Marcel Apfelbaum, Michael S. Tsirkin, Igor Mammedov, Ani Sinha,
	Peter Xu, Dr . David Alan Gilbert, Stefan Hajnoczi,
	Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

Move it out of memory_device_get_free_addr(), which is cleaner and
prepares for future changes.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/mem/memory-device.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index 7f76a09e57..68a2c3dbcc 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -67,9 +67,10 @@ static int memory_device_used_region_size(Object *obj, void *opaque)
     return 0;
 }
 
-static void memory_device_check_addable(MachineState *ms, uint64_t size,
+static void memory_device_check_addable(MachineState *ms, MemoryRegion *mr,
                                         Error **errp)
 {
+    const uint64_t size = memory_region_size(mr);
     uint64_t used_region_size = 0;
 
     /* we will need a new memory slot for kvm and vhost */
@@ -99,7 +100,6 @@ static uint64_t memory_device_get_free_addr(MachineState *ms,
                                             uint64_t align, uint64_t size,
                                             Error **errp)
 {
-    Error *err = NULL;
     GSList *list = NULL, *item;
     Range as, new = range_empty;
 
@@ -125,12 +125,6 @@ static uint64_t memory_device_get_free_addr(MachineState *ms,
                     align);
     }
 
-    memory_device_check_addable(ms, size, &err);
-    if (err) {
-        error_propagate(errp, err);
-        return 0;
-    }
-
     if (hint && !QEMU_IS_ALIGNED(*hint, align)) {
         error_setg(errp, "address must be aligned to 0x%" PRIx64 " bytes",
                    align);
@@ -259,6 +253,11 @@ void memory_device_pre_plug(MemoryDeviceState *md, MachineState *ms,
         goto out;
     }
 
+    memory_device_check_addable(ms, mr, &local_err);
+    if (local_err) {
+        goto out;
+    }
+
     if (legacy_align) {
         align = *legacy_align;
     } else {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 06/12] memory-device: Generalize memory_device_used_region_size()
  2021-10-27 12:45 [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
                   ` (4 preceding siblings ...)
  2021-10-27 12:45 ` [PATCH v1 05/12] memory-device: Move memory_device_check_addable() directly into memory_device_pre_plug() David Hildenbrand
@ 2021-10-27 12:45 ` David Hildenbrand
  2021-10-27 12:45 ` [PATCH v1 07/12] memory-device: Support memory devices that dynamically consume multiple memslots David Hildenbrand
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 12:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Paolo Bonzini, Eduardo Habkost,
	Marcel Apfelbaum, Michael S. Tsirkin, Igor Mammedov, Ani Sinha,
	Peter Xu, Dr . David Alan Gilbert, Stefan Hajnoczi,
	Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

Let's generalize traversal of all plugged memory devices to collect
information in the context of memory_device_check_addable() to prepare for
future changes.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/mem/memory-device.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index 68a2c3dbcc..a915894819 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -50,20 +50,24 @@ static int memory_device_build_list(Object *obj, void *opaque)
     return 0;
 }
 
-static int memory_device_used_region_size(Object *obj, void *opaque)
+struct memory_devices_info {
+    uint64_t region_size;
+};
+
+static int memory_devices_collect_info(Object *obj, void *opaque)
 {
-    uint64_t *size = opaque;
+    struct memory_devices_info *i = opaque;
 
     if (object_dynamic_cast(obj, TYPE_MEMORY_DEVICE)) {
         const DeviceState *dev = DEVICE(obj);
         const MemoryDeviceState *md = MEMORY_DEVICE(obj);
 
         if (dev->realized) {
-            *size += memory_device_get_region_size(md, &error_abort);
+            i->region_size += memory_device_get_region_size(md, &error_abort);
         }
     }
 
-    object_child_foreach(obj, memory_device_used_region_size, opaque);
+    object_child_foreach(obj, memory_devices_collect_info, opaque);
     return 0;
 }
 
@@ -71,7 +75,7 @@ static void memory_device_check_addable(MachineState *ms, MemoryRegion *mr,
                                         Error **errp)
 {
     const uint64_t size = memory_region_size(mr);
-    uint64_t used_region_size = 0;
+    struct memory_devices_info info = {};
 
     /* we will need a new memory slot for kvm and vhost */
     if (kvm_enabled() && !kvm_get_free_memslots()) {
@@ -84,12 +88,12 @@ static void memory_device_check_addable(MachineState *ms, MemoryRegion *mr,
     }
 
     /* will we exceed the total amount of memory specified */
-    memory_device_used_region_size(OBJECT(ms), &used_region_size);
-    if (used_region_size + size < used_region_size ||
-        used_region_size + size > ms->maxram_size - ms->ram_size) {
+    memory_devices_collect_info(OBJECT(ms), &info);
+    if (info.region_size + size < info.region_size ||
+        info.region_size + size > ms->maxram_size - ms->ram_size) {
         error_setg(errp, "not enough space, currently 0x%" PRIx64
                    " in use of total space for memory devices 0x" RAM_ADDR_FMT,
-                   used_region_size, ms->maxram_size - ms->ram_size);
+                   info.region_size, ms->maxram_size - ms->ram_size);
         return;
     }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 07/12] memory-device: Support memory devices that dynamically consume multiple memslots
  2021-10-27 12:45 [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
                   ` (5 preceding siblings ...)
  2021-10-27 12:45 ` [PATCH v1 06/12] memory-device: Generalize memory_device_used_region_size() David Hildenbrand
@ 2021-10-27 12:45 ` David Hildenbrand
  2021-10-27 12:45 ` [PATCH v1 08/12] vhost: Respect reserved memslots for memory devices when realizing a vhost device David Hildenbrand
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 12:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Paolo Bonzini, Eduardo Habkost,
	Marcel Apfelbaum, Michael S. Tsirkin, Igor Mammedov, Ani Sinha,
	Peter Xu, Dr . David Alan Gilbert, Stefan Hajnoczi,
	Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

We want to support memory devices that have a container as device memory
region, and (dynamically) map individual chunks into that container
resulting in multiple memslots getting consumed by such a device. So we
want to support memory devices that require a) multiple memslots and
b) dynamically make use of these memslots.

We already have one device that uses a container as device memory region:
NVDIMM. However, an NVDIMM also end up consuming exactly one memslot.

The target use case will be virtio-mem, which will dynamically map
parts of a source RAM memory region into the container device region
using aliases, consuming one memslot per alias.

We need a way to query from a memory device:
* The currently used number memslots.
* The total number of memslots that might get used across device
  lifetime.

Expose one helper functions -- memory_devices_get_reserved_memslots() --
that will be used by vhost code to respect the current memslot reservation
when realizing vhost device.

Limit the number of memslots usable by memory devices to something sane
-- for now 2048 -- we really don't want to create tens of thousands of
memslots that can degrade performance when traversing an address space,
just because it would be possible (for example, without KVM and without
vhost there is no real limit ...). This does not affect existing setups
(for example more than 256 DIMMs+NVDIMMs is not supported).

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/mem/memory-device.c         | 84 ++++++++++++++++++++++++++++++----
 include/hw/mem/memory-device.h | 33 +++++++++++++
 stubs/qmp_memory_device.c      |  5 ++
 3 files changed, 113 insertions(+), 9 deletions(-)

diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index a915894819..bb02eb410a 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -18,6 +18,8 @@
 #include "sysemu/kvm.h"
 #include "trace.h"
 
+#define MEMORY_DEVICES_MAX_MEMSLOTS     2048
+
 static gint memory_device_addr_sort(gconstpointer a, gconstpointer b)
 {
     const MemoryDeviceState *md_a = MEMORY_DEVICE(a);
@@ -50,8 +52,28 @@ static int memory_device_build_list(Object *obj, void *opaque)
     return 0;
 }
 
+static unsigned int memory_device_get_used_memslots(const MemoryDeviceState *md)
+{
+    const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md);
+
+    if (!mdc->get_used_memslots)
+        return 1;
+    return mdc->get_used_memslots(md, &error_abort);
+}
+
+static unsigned int memory_device_get_memslots(const MemoryDeviceState *md)
+{
+    const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md);
+
+    if (!mdc->get_memslots)
+        return 1;
+    return mdc->get_memslots(md, &error_abort);
+}
+
 struct memory_devices_info {
     uint64_t region_size;
+    unsigned int used_memslots;
+    unsigned int reserved_memslots;
 };
 
 static int memory_devices_collect_info(Object *obj, void *opaque)
@@ -61,9 +83,15 @@ static int memory_devices_collect_info(Object *obj, void *opaque)
     if (object_dynamic_cast(obj, TYPE_MEMORY_DEVICE)) {
         const DeviceState *dev = DEVICE(obj);
         const MemoryDeviceState *md = MEMORY_DEVICE(obj);
+        unsigned int used, total;
 
         if (dev->realized) {
             i->region_size += memory_device_get_region_size(md, &error_abort);
+
+            used = memory_device_get_used_memslots(md);
+            total = memory_device_get_memslots(md);
+            i->used_memslots += used;
+            i->reserved_memslots += total - used;
         }
     }
 
@@ -71,24 +99,62 @@ static int memory_devices_collect_info(Object *obj, void *opaque)
     return 0;
 }
 
-static void memory_device_check_addable(MachineState *ms, MemoryRegion *mr,
-                                        Error **errp)
+/*
+ * Get the number of memslots that are reserved (not used yet but will get used
+ * dynamically in the future without further checks) by all memory devices.
+ */
+unsigned int memory_devices_get_reserved_memslots(void)
+{
+    struct memory_devices_info info = {};
+
+    memory_devices_collect_info(qdev_get_machine(), &info);
+    return info.reserved_memslots;
+}
+
+static void memory_device_check_addable(MachineState *ms, MemoryDeviceState *md,
+                                        MemoryRegion *mr, Error **errp)
 {
     const uint64_t size = memory_region_size(mr);
     struct memory_devices_info info = {};
+    unsigned int required, reserved;
+
+    memory_devices_collect_info(OBJECT(ms), &info);
+    reserved = info.reserved_memslots;
+    required = memory_device_get_memslots(md);
 
-    /* we will need a new memory slot for kvm and vhost */
-    if (kvm_enabled() && !kvm_get_free_memslots()) {
-        error_setg(errp, "hypervisor has no free memory slots left");
+    /*
+     * Limit the maximum number of memslot used by memory devices to something
+     * sane.
+     */
+    if (info.used_memslots + reserved + required >
+        MEMORY_DEVICES_MAX_MEMSLOTS) {
+        error_setg(errp, "The maximum number of memory slots to be consumed by"
+                   " memory devices (%u) would be exceeded. Used: %u,"
+                   " Reserved: %u, Required: %u",
+                   MEMORY_DEVICES_MAX_MEMSLOTS, info.used_memslots,
+                   reserved, required);
         return;
     }
-    if (!vhost_get_free_memslots()) {
-        error_setg(errp, "a used vhost backend has no free memory slots left");
+
+    /*
+     * All memslots used by memory devices are already subtracted from
+     * the free memslots as reported by kvm and vhost.
+     */
+    if (kvm_enabled() && kvm_get_free_memslots() < reserved + required) {
+        error_setg(errp, "KVM does not have enough free, unreserved memory"
+                   "slots left. Free: %u, Reserved: %u, Required: %u",
+                   kvm_get_free_memslots(), reserved, required);
+        return;
+    }
+    if (vhost_get_free_memslots() < reserved + required) {
+        error_setg(errp, "a used vhost backend does not have enough free,"
+                   " unreserved memory slots left. Free: %u, Reserved: %u,"
+                   " Required: %u", vhost_get_free_memslots(), reserved,
+                   required);
         return;
     }
 
     /* will we exceed the total amount of memory specified */
-    memory_devices_collect_info(OBJECT(ms), &info);
     if (info.region_size + size < info.region_size ||
         info.region_size + size > ms->maxram_size - ms->ram_size) {
         error_setg(errp, "not enough space, currently 0x%" PRIx64
@@ -257,7 +323,7 @@ void memory_device_pre_plug(MemoryDeviceState *md, MachineState *ms,
         goto out;
     }
 
-    memory_device_check_addable(ms, mr, &local_err);
+    memory_device_check_addable(ms, md, mr, &local_err);
     if (local_err) {
         goto out;
     }
diff --git a/include/hw/mem/memory-device.h b/include/hw/mem/memory-device.h
index 48d2611fc5..c811415ce6 100644
--- a/include/hw/mem/memory-device.h
+++ b/include/hw/mem/memory-device.h
@@ -98,6 +98,38 @@ struct MemoryDeviceClass {
      */
     uint64_t (*get_min_alignment)(const MemoryDeviceState *md);
 
+    /*
+     * Optional: Return the number of used individual memslots (i.e.,
+     * individual RAM mappings) the device has created in the memory region of
+     * the device. The device has to make sure that memslots won't get merged
+     * internally (e.g,, by disabling merging of memory region aliases) if the
+     * memory region layout could allow for that.
+     *
+     * If this function is not implemented, we assume the device memory region
+     * is not a container and that there is exactly one memslot getting used
+     * after realizing and plugging the device.
+     *
+     * Called when plugging the memory device or when iterating over
+     * all realized memory devices to calculate used/reserved/available
+     * memslots.
+     */
+    unsigned int (*get_used_memslots)(const MemoryDeviceState *md, Error **errp);
+
+    /*
+     * Optional: Return the total number of individual memslots
+     * (i.e., individual RAM mappings) the device may create in the the memory
+     * region of the device over its lifetime. The result must never change.
+     *
+     * If this function is not implemented, we assume the device memory region
+     * is not a container and that there will be exactly one memslot getting
+     * used after realizing and plugging the device.
+     *
+     * Called when plugging the memory device or when iterating over
+     * all realized memory devices to calculate used/reserved/available
+     * memslots.
+     */
+    unsigned int (*get_memslots)(const MemoryDeviceState *md, Error **errp);
+
     /*
      * Translate the memory device into #MemoryDeviceInfo.
      */
@@ -113,5 +145,6 @@ void memory_device_plug(MemoryDeviceState *md, MachineState *ms);
 void memory_device_unplug(MemoryDeviceState *md, MachineState *ms);
 uint64_t memory_device_get_region_size(const MemoryDeviceState *md,
                                        Error **errp);
+unsigned int memory_devices_get_reserved_memslots(void);
 
 #endif
diff --git a/stubs/qmp_memory_device.c b/stubs/qmp_memory_device.c
index e75cac62dc..318a5d4187 100644
--- a/stubs/qmp_memory_device.c
+++ b/stubs/qmp_memory_device.c
@@ -10,3 +10,8 @@ uint64_t get_plugged_memory_size(void)
 {
     return (uint64_t)-1;
 }
+
+unsigned int memory_devices_get_reserved_memslots(void)
+{
+    return 0;
+}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 08/12] vhost: Respect reserved memslots for memory devices when realizing a vhost device
  2021-10-27 12:45 [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
                   ` (6 preceding siblings ...)
  2021-10-27 12:45 ` [PATCH v1 07/12] memory-device: Support memory devices that dynamically consume multiple memslots David Hildenbrand
@ 2021-10-27 12:45 ` David Hildenbrand
  2021-10-27 12:45 ` [PATCH v1 09/12] memory: Drop mapping check from memory_region_get_ram_discard_manager() David Hildenbrand
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 12:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Paolo Bonzini, Eduardo Habkost,
	Marcel Apfelbaum, Michael S. Tsirkin, Igor Mammedov, Ani Sinha,
	Peter Xu, Dr . David Alan Gilbert, Stefan Hajnoczi,
	Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

Make sure that the current reservations can be fulfilled, otherwise we
might run out of memslots later when memory devices start actually using
the reserved memslots and crash.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/vhost.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 49a1074097..a8f0d0bdf7 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -23,6 +23,7 @@
 #include "standard-headers/linux/vhost_types.h"
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
+#include "hw/mem/memory-device.h"
 #include "migration/blocker.h"
 #include "migration/qemu-file-types.h"
 #include "sysemu/dma.h"
@@ -1319,7 +1320,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
                    Error **errp)
 {
     uint64_t features;
-    int i, r, n_initialized_vqs = 0;
+    int i, r, reserved_memslots, backend_memslots, n_initialized_vqs = 0;
 
     hdev->vdev = NULL;
     hdev->migration_blocker = NULL;
@@ -1415,9 +1416,13 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
     memory_listener_register(&hdev->memory_listener, &address_space_memory);
     QLIST_INSERT_HEAD(&vhost_devices, hdev, entry);
 
-    if (used_memslots > hdev->vhost_ops->vhost_backend_memslots_limit(hdev)) {
-        error_setg(errp, "vhost backend memory slots limit is less"
-                   " than current number of present memory slots");
+    reserved_memslots = memory_devices_get_reserved_memslots();
+    backend_memslots = hdev->vhost_ops->vhost_backend_memslots_limit(hdev);
+    if (used_memslots + reserved_memslots > backend_memslots) {
+        error_setg(errp, "vhost backend memory slots limit (%d) is less"
+                   " than current number of used (%d) and reserved (%d)"
+                   " memory slots", backend_memslots, used_memslots,
+                   reserved_memslots);
         r = -EINVAL;
         goto fail_busyloop;
     }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 09/12] memory: Drop mapping check from memory_region_get_ram_discard_manager()
  2021-10-27 12:45 [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
                   ` (7 preceding siblings ...)
  2021-10-27 12:45 ` [PATCH v1 08/12] vhost: Respect reserved memslots for memory devices when realizing a vhost device David Hildenbrand
@ 2021-10-27 12:45 ` David Hildenbrand
  2021-10-27 12:45 ` [PATCH v1 10/12] virtio-mem: Fix typo in virito_mem_intersect_memory_section() function name David Hildenbrand
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 12:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Paolo Bonzini, Eduardo Habkost,
	Marcel Apfelbaum, Michael S. Tsirkin, Igor Mammedov, Ani Sinha,
	Peter Xu, Dr . David Alan Gilbert, Stefan Hajnoczi,
	Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

It's sufficient to check whether a memory region is RAM, the region
doesn't necessarily have to be mapped into another memory region.
For example, RAM memory regions mapped via an alias will never make
"memory_region_is_mapped()" succeed.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 softmmu/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/softmmu/memory.c b/softmmu/memory.c
index d8a584ca80..2496bff729 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2045,7 +2045,7 @@ int memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr)
 
 RamDiscardManager *memory_region_get_ram_discard_manager(MemoryRegion *mr)
 {
-    if (!memory_region_is_mapped(mr) || !memory_region_is_ram(mr)) {
+    if (!memory_region_is_ram(mr)) {
         return NULL;
     }
     return mr->rdm;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 10/12] virtio-mem: Fix typo in virito_mem_intersect_memory_section() function name
  2021-10-27 12:45 [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
                   ` (8 preceding siblings ...)
  2021-10-27 12:45 ` [PATCH v1 09/12] memory: Drop mapping check from memory_region_get_ram_discard_manager() David Hildenbrand
@ 2021-10-27 12:45 ` David Hildenbrand
  2022-12-28 14:05   ` Philippe Mathieu-Daudé
  2021-10-27 12:45 ` [PATCH v1 11/12] virtio-mem: Set the RamDiscardManager for the RAM memory region earlier David Hildenbrand
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 12:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Paolo Bonzini, Eduardo Habkost,
	Marcel Apfelbaum, Michael S. Tsirkin, Igor Mammedov, Ani Sinha,
	Peter Xu, Dr . David Alan Gilbert, Stefan Hajnoczi,
	Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

It's "virtio", not "virito".

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index d5a578142b..0f5eae4a7e 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -177,7 +177,7 @@ static int virtio_mem_for_each_unplugged_range(const VirtIOMEM *vmem, void *arg,
  *
  * Returns false if the intersection is empty, otherwise returns true.
  */
-static bool virito_mem_intersect_memory_section(MemoryRegionSection *s,
+static bool virtio_mem_intersect_memory_section(MemoryRegionSection *s,
                                                 uint64_t offset, uint64_t size)
 {
     uint64_t start = MAX(s->offset_within_region, offset);
@@ -215,7 +215,7 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
                                       first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -247,7 +247,7 @@ static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
                                  first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -283,7 +283,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
             continue;
         }
         rdl->notify_discard(rdl, &tmp);
@@ -299,7 +299,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
             continue;
         }
         ret = rdl->notify_populate(rdl, &tmp);
@@ -316,7 +316,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
             if (rdl2 == rdl) {
                 break;
             }
-            if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+            if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
                 continue;
             }
             rdl2->notify_discard(rdl2, &tmp);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 11/12] virtio-mem: Set the RamDiscardManager for the RAM memory region earlier
  2021-10-27 12:45 [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
                   ` (9 preceding siblings ...)
  2021-10-27 12:45 ` [PATCH v1 10/12] virtio-mem: Fix typo in virito_mem_intersect_memory_section() function name David Hildenbrand
@ 2021-10-27 12:45 ` David Hildenbrand
  2021-10-27 12:45 ` [PATCH v1 12/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
  2021-11-01 22:15 ` [PATCH v1 00/12] " Michael S. Tsirkin
  12 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 12:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Paolo Bonzini, Eduardo Habkost,
	Marcel Apfelbaum, Michael S. Tsirkin, Igor Mammedov, Ani Sinha,
	Peter Xu, Dr . David Alan Gilbert, Stefan Hajnoczi,
	Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

Let's set the RamDiscardManager earlier, logically before we expose the
RAM memory region to the system. This is a preparation for further changes
and is logically cleaner: before we expose the RAM memory region to
migration code, make sure we have the RamDiscardManager setup.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 0f5eae4a7e..1e29706798 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -773,16 +773,17 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
                 sizeof(struct virtio_mem_config));
     vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
 
-    host_memory_backend_set_mapped(vmem->memdev, true);
-    vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
-    qemu_register_reset(virtio_mem_system_reset, vmem);
-
     /*
-     * Set ourselves as RamDiscardManager before the plug handler maps the
-     * memory region and exposes it via an address space.
+     * Set ourselves as RamDiscardManager before we expose the memory region
+     * to the system (e.g., marking the RAMBlock migratable, mapping the
+     * region).
      */
     memory_region_set_ram_discard_manager(&vmem->memdev->mr,
                                           RAM_DISCARD_MANAGER(vmem));
+
+    host_memory_backend_set_mapped(vmem->memdev, true);
+    vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
+    qemu_register_reset(virtio_mem_system_reset, vmem);
 }
 
 static void virtio_mem_device_unrealize(DeviceState *dev)
@@ -790,14 +791,10 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIOMEM *vmem = VIRTIO_MEM(dev);
 
-    /*
-     * The unplug handler unmapped the memory region, it cannot be
-     * found via an address space anymore. Unset ourselves.
-     */
-    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
     qemu_unregister_reset(virtio_mem_system_reset, vmem);
     vmstate_unregister_ram(&vmem->memdev->mr, DEVICE(vmem));
     host_memory_backend_set_mapped(vmem->memdev, false);
+    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
     virtio_del_queue(vdev, 0);
     virtio_cleanup(vdev);
     g_free(vmem->bitmap);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 12/12] virtio-mem: Expose device memory via multiple memslots
  2021-10-27 12:45 [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
                   ` (10 preceding siblings ...)
  2021-10-27 12:45 ` [PATCH v1 11/12] virtio-mem: Set the RamDiscardManager for the RAM memory region earlier David Hildenbrand
@ 2021-10-27 12:45 ` David Hildenbrand
  2021-11-01 22:15 ` [PATCH v1 00/12] " Michael S. Tsirkin
  12 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 12:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Paolo Bonzini, Eduardo Habkost,
	Marcel Apfelbaum, Michael S. Tsirkin, Igor Mammedov, Ani Sinha,
	Peter Xu, Dr . David Alan Gilbert, Stefan Hajnoczi,
	Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

We want to expose virtio-mem device memory via multiple memslots to the
guest on demand, essentially reducing the total size of KVM slots
significantly (and thereby metadata in KVM and in QEMU for KVM memory
slots) especially when exposing initially only a small amount of memory via
a virtio-mem device to the guest, to hotplug more memory later. Further,
not always exposing the full device memory region to the guest reduces the
attack surface in many setups without requiring other mechanisms like
userfaultfd for protection of unplugged memory.

So split the original RAM region via memory region aliases into separate
memory slots, and dynamically map the required memory slots into the
container.

For now, we always map the memslots covered by the usable region. In the
future, with VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to map
memslots on actual demand and optimize further.

The user is in charge of setting the number of memslots the device
should use. In the future, we might add an auto mode, specified via
"memslots=0" for user convenience.

There are two new properties:

1) "used-memslots" contains how many memslots are currently used
 and is read-only. Used internally, but can also be used for debugging/
 introspection purposes.

2) "memslots" specifies how many memslots the device is should use.
 * "1" is the default and corresponds mostly to the old behavior. The
   only exception is that with a usable region size of 0, the single
   memslot won't get used and nothing will get mapped.
 * "> 1" tells the device to use the given number of memslots. There are
   a couple of restrictions:
   * Cannot be bigger than 1024 or equal to 0.
   * Cannot be bigger than the number of device blocks.
   * Must not result in memslots that are smaller than the minimum
     memslot size
 This parameter doesn't have to be migrated and can differ between
 source and destination.

The minimum memslot size (and thereby the alignment of memslots in
guest physical address space) is currently defined to be 128 MiB --
a memory slot size we know works (due to x86-64 DIMMs) without confusing
devices that might not be able to handle crossing memory regions in
address spaces when performing I/O.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem-pci.c     |  23 +++++
 hw/virtio/virtio-mem.c         | 183 ++++++++++++++++++++++++++++++++-
 include/hw/virtio/virtio-mem.h |  25 ++++-
 3 files changed, 226 insertions(+), 5 deletions(-)

diff --git a/hw/virtio/virtio-mem-pci.c b/hw/virtio/virtio-mem-pci.c
index be2383b0c5..1dc4078941 100644
--- a/hw/virtio/virtio-mem-pci.c
+++ b/hw/virtio/virtio-mem-pci.c
@@ -82,6 +82,21 @@ static uint64_t virtio_mem_pci_get_min_alignment(const MemoryDeviceState *md)
                                     &error_abort);
 }
 
+static unsigned int virtio_mem_pci_get_used_memslots(
+                                                    const MemoryDeviceState *md,
+                                                     Error **errp)
+{
+    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_USED_MEMSLOTS_PROP,
+                                    &error_abort);
+}
+
+static unsigned int virtio_mem_pci_get_memslots(const MemoryDeviceState *md,
+                                                Error **errp)
+{
+    return object_property_get_uint(OBJECT(md), VIRTIO_MEM_MEMSLOTS_PROP,
+                                    &error_abort);
+}
+
 static void virtio_mem_pci_size_change_notify(Notifier *notifier, void *data)
 {
     VirtIOMEMPCI *pci_mem = container_of(notifier, VirtIOMEMPCI,
@@ -115,6 +130,8 @@ static void virtio_mem_pci_class_init(ObjectClass *klass, void *data)
     mdc->get_memory_region = virtio_mem_pci_get_memory_region;
     mdc->fill_device_info = virtio_mem_pci_fill_device_info;
     mdc->get_min_alignment = virtio_mem_pci_get_min_alignment;
+    mdc->get_used_memslots = virtio_mem_pci_get_used_memslots;
+    mdc->get_memslots = virtio_mem_pci_get_memslots;
 }
 
 static void virtio_mem_pci_instance_init(Object *obj)
@@ -142,6 +159,12 @@ static void virtio_mem_pci_instance_init(Object *obj)
     object_property_add_alias(obj, VIRTIO_MEM_REQUESTED_SIZE_PROP,
                               OBJECT(&dev->vdev),
                               VIRTIO_MEM_REQUESTED_SIZE_PROP);
+    object_property_add_alias(obj, VIRTIO_MEM_MEMSLOTS_PROP,
+                              OBJECT(&dev->vdev),
+                              VIRTIO_MEM_MEMSLOTS_PROP);
+    object_property_add_alias(obj, VIRTIO_MEM_USED_MEMSLOTS_PROP,
+                              OBJECT(&dev->vdev),
+                              VIRTIO_MEM_USED_MEMSLOTS_PROP);
 }
 
 static const VirtioPCIDeviceTypeInfo virtio_mem_pci_info = {
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 1e29706798..f0ad365b91 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -23,6 +23,7 @@
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
 #include "hw/virtio/virtio-mem.h"
+#include "hw/mem/memory-device.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "exec/ram_addr.h"
@@ -46,6 +47,13 @@
 #define VIRTIO_MEM_DEFAULT_THP_SIZE VIRTIO_MEM_MIN_BLOCK_SIZE
 #endif
 
+/*
+ * Let's not allow a crazy number of memslots for a single virtio-mem device
+ * and try to size memslots reasonably large.
+ */
+#define VIRTIO_MEM_MAX_MEMSLOTS 1024
+#define VIRTIO_MEM_MIN_MEMSLOT_SIZE (128 * MiB)
+
 /*
  * We want to have a reasonable default block size such that
  * 1. We avoid splitting THPs when unplugging memory, which degrades
@@ -500,6 +508,7 @@ static void virtio_mem_resize_usable_region(VirtIOMEM *vmem,
 {
     uint64_t newsize = MIN(memory_region_size(&vmem->memdev->mr),
                            requested_size + VIRTIO_MEM_USABLE_EXTENT);
+    int i;
 
     /* The usable region size always has to be multiples of the block size. */
     newsize = QEMU_ALIGN_UP(newsize, vmem->block_size);
@@ -514,6 +523,25 @@ static void virtio_mem_resize_usable_region(VirtIOMEM *vmem,
 
     trace_virtio_mem_resized_usable_region(vmem->usable_region_size, newsize);
     vmem->usable_region_size = newsize;
+
+    /*
+     * Map all unmapped memslots that cover the usable region and unmap all
+     * remaining mapped ones.
+     */
+    for (i = 0; i < vmem->nb_memslots; i++) {
+        if (vmem->memslot_size * i < vmem->usable_region_size) {
+            if (!memory_region_is_mapped(&vmem->memslots[i])) {
+                memory_region_add_subregion(vmem->mr, vmem->memslot_size * i,
+                                            &vmem->memslots[i]);
+                vmem->nb_used_memslots++;
+            }
+        } else {
+            if (memory_region_is_mapped(&vmem->memslots[i])) {
+                memory_region_del_subregion(vmem->mr, &vmem->memslots[i]);
+                vmem->nb_used_memslots--;
+            }
+        }
+    }
 }
 
 static int virtio_mem_unplug_all(VirtIOMEM *vmem)
@@ -674,6 +702,83 @@ static void virtio_mem_system_reset(void *opaque)
     virtio_mem_unplug_all(vmem);
 }
 
+static void virtio_mem_alloc_mr(VirtIOMEM *vmem)
+{
+    const uint64_t region_size = memory_region_size(&vmem->memdev->mr);
+
+    vmem->mr = g_new0(MemoryRegion, 1);
+    memory_region_init(vmem->mr, OBJECT(vmem), "virtio-mem-memslots",
+                       region_size);
+    vmem->mr->align = memory_region_get_alignment(&vmem->memdev->mr);
+}
+
+static int virtio_mem_prepare_memslots(VirtIOMEM *vmem, Error **errp)
+{
+    const uint64_t region_size = memory_region_size(&vmem->memdev->mr);
+    const uint64_t device_blocks = region_size / vmem->block_size;
+
+    if (vmem->nb_memslots == 1) {
+        vmem->memslot_size = region_size;
+        return 0;
+    }
+
+    /* We cannot have more memslots than device blocks. */
+    if (vmem->nb_memslots > device_blocks) {
+        error_setg(errp, "'%s' property exceeds the total number of device"
+                   " blocks (%" PRIu64 ")", VIRTIO_MEM_MEMSLOTS_PROP,
+                   device_blocks);
+        return -EINVAL;
+    }
+
+    /*
+     * We'll make sure the memslots are multiple of the minimum memslot size
+     * and multiple of the device block size; This can make the last memslot
+     * larger than the others.
+     */
+    vmem->memslot_size = QEMU_ALIGN_UP(region_size, vmem->nb_memslots) /
+                         vmem->nb_memslots;
+    vmem->memslot_size = QEMU_ALIGN_DOWN(vmem->memslot_size, vmem->block_size);
+    vmem->memslot_size = QEMU_ALIGN_DOWN(vmem->memslot_size,
+                                         VIRTIO_MEM_MIN_MEMSLOT_SIZE);
+    if (!vmem->memslot_size) {
+        error_setg(errp, "'%s' property would create memory slots smaller than"
+                   " the minimum supported memory slot size (%lu MiB)",
+                   VIRTIO_MEM_MEMSLOTS_PROP, VIRTIO_MEM_MIN_MEMSLOT_SIZE / MiB);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static void virtio_mem_alloc_memslots(VirtIOMEM *vmem)
+{
+    const uint64_t region_size = memory_region_size(&vmem->memdev->mr);
+    int i;
+
+    /* Create our memslots but don't map them yet -- we'll map dynamically. */
+    vmem->memslots = g_new0(MemoryRegion, vmem->nb_memslots);
+    for (i = 0; i < vmem->nb_memslots; i++) {
+        uint64_t size;
+        char name[80];
+
+        /* The size of the last memslot might differ. */
+        size = vmem->memslot_size;
+        if (i + 1 == vmem->nb_memslots) {
+            size = region_size - i * vmem->memslot_size;
+        }
+
+        snprintf(name, sizeof(name), "virtio-mem-memslot-%u", i);
+        memory_region_init_alias(&vmem->memslots[i], OBJECT(vmem), name,
+                                 &vmem->memdev->mr, vmem->memslot_size * i,
+                                 size);
+        /*
+         * We want our aliases to result in separate memory sections and thereby
+         * separate memslots.
+         */
+        memory_region_set_alias_unmergeable(&vmem->memslots[i], true);
+    }
+}
+
 static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
 {
     MachineState *ms = MACHINE(qdev_get_machine());
@@ -751,6 +856,10 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    if (virtio_mem_prepare_memslots(vmem, errp)) {
+        return;
+    }
+
     if (ram_block_coordinated_discard_require(true)) {
         error_setg(errp, "Discarding RAM is disabled");
         return;
@@ -763,12 +872,15 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
-
     vmem->bitmap_size = memory_region_size(&vmem->memdev->mr) /
                         vmem->block_size;
     vmem->bitmap = bitmap_new(vmem->bitmap_size);
 
+    if (!vmem->mr) {
+        virtio_mem_alloc_mr(vmem);
+    }
+    virtio_mem_alloc_memslots(vmem);
+
     virtio_init(vdev, TYPE_VIRTIO_MEM, VIRTIO_ID_MEM,
                 sizeof(struct virtio_mem_config));
     vmem->vq = virtio_add_queue(vdev, 128, virtio_mem_handle_request);
@@ -780,7 +892,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
      */
     memory_region_set_ram_discard_manager(&vmem->memdev->mr,
                                           RAM_DISCARD_MANAGER(vmem));
-
+    virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
     host_memory_backend_set_mapped(vmem->memdev, true);
     vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
     qemu_register_reset(virtio_mem_system_reset, vmem);
@@ -794,9 +906,12 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     qemu_unregister_reset(virtio_mem_system_reset, vmem);
     vmstate_unregister_ram(&vmem->memdev->mr, DEVICE(vmem));
     host_memory_backend_set_mapped(vmem->memdev, false);
+    virtio_mem_resize_usable_region(vmem, 0, true);
     memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
     virtio_del_queue(vdev, 0);
     virtio_cleanup(vdev);
+    g_free(vmem->memslots);
+    g_free(vmem->mr);
     g_free(vmem->bitmap);
     ram_block_coordinated_discard_require(false);
 }
@@ -955,7 +1070,10 @@ static MemoryRegion *virtio_mem_get_memory_region(VirtIOMEM *vmem, Error **errp)
         return NULL;
     }
 
-    return &vmem->memdev->mr;
+    if (!vmem->mr) {
+        virtio_mem_alloc_mr(vmem);
+    }
+    return vmem->mr;
 }
 
 static void virtio_mem_add_size_change_notifier(VirtIOMEM *vmem,
@@ -1084,10 +1202,62 @@ static void virtio_mem_set_block_size(Object *obj, Visitor *v, const char *name,
     vmem->block_size = value;
 }
 
+static void virtio_mem_get_used_memslots(Object *obj, Visitor *v,
+                                          const char *name,
+                                          void *opaque, Error **errp)
+{
+    const VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    uint16_t value = vmem->nb_used_memslots;
+
+    visit_type_uint16(v, name, &value, errp);
+}
+
+static void virtio_mem_get_memslots(Object *obj, Visitor *v, const char *name,
+                                    void *opaque, Error **errp)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    uint16_t value = vmem->nb_memslots;
+
+    visit_type_uint16(v, name, &value, errp);
+}
+
+static void virtio_mem_set_memslots(Object *obj, Visitor *v, const char *name,
+                                    void *opaque, Error **errp)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(obj);
+    Error *err = NULL;
+    uint16_t value;
+
+    if (DEVICE(obj)->realized) {
+        error_setg(errp, "'%s' cannot be changed", name);
+        return;
+    }
+
+    visit_type_uint16(v, name, &value, &err);
+    if (err) {
+        error_propagate(errp, err);
+        return;
+    }
+    if (value > VIRTIO_MEM_MAX_MEMSLOTS) {
+        error_setg(errp, "'%s' property must not exceed '%d'", name,
+                   VIRTIO_MEM_MAX_MEMSLOTS);
+        return;
+    } else if (!value) {
+        error_setg(errp, "'%s' property must not be '0'", name);
+        return;
+    }
+    vmem->nb_memslots = value;
+}
+
 static void virtio_mem_instance_init(Object *obj)
 {
     VirtIOMEM *vmem = VIRTIO_MEM(obj);
 
+    /*
+     * Default to a single memslot, the old default; users have to opt in for
+     * more by setting the "memslots" property accordingly.
+     */
+    vmem->nb_memslots = 1;
     notifier_list_init(&vmem->size_change_notifiers);
     QLIST_INIT(&vmem->rdl_list);
 
@@ -1099,6 +1269,11 @@ static void virtio_mem_instance_init(Object *obj)
     object_property_add(obj, VIRTIO_MEM_BLOCK_SIZE_PROP, "size",
                         virtio_mem_get_block_size, virtio_mem_set_block_size,
                         NULL, NULL);
+    object_property_add(obj, VIRTIO_MEM_MEMSLOTS_PROP, "uint16",
+                        virtio_mem_get_memslots, virtio_mem_set_memslots, NULL,
+                        NULL);
+    object_property_add(obj, VIRTIO_MEM_USED_MEMSLOTS_PROP, "uint16",
+                        virtio_mem_get_used_memslots, NULL, NULL, NULL);
 }
 
 static Property virtio_mem_properties[] = {
diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
index a5dd6a493b..8d72427be2 100644
--- a/include/hw/virtio/virtio-mem.h
+++ b/include/hw/virtio/virtio-mem.h
@@ -30,6 +30,8 @@ OBJECT_DECLARE_TYPE(VirtIOMEM, VirtIOMEMClass,
 #define VIRTIO_MEM_REQUESTED_SIZE_PROP "requested-size"
 #define VIRTIO_MEM_BLOCK_SIZE_PROP "block-size"
 #define VIRTIO_MEM_ADDR_PROP "memaddr"
+#define VIRTIO_MEM_MEMSLOTS_PROP "memslots"
+#define VIRTIO_MEM_USED_MEMSLOTS_PROP "used-memslots"
 
 struct VirtIOMEM {
     VirtIODevice parent_obj;
@@ -41,9 +43,30 @@ struct VirtIOMEM {
     int32_t bitmap_size;
     unsigned long *bitmap;
 
-    /* assigned memory backend and memory region */
+    /* Device memory region in which we dynamically map memslots */
+    MemoryRegion *mr;
+
+    /*
+     * Assigned memory backend with the RAM memory region we will split
+     * into memslots to dynamically map them into the device memory region.
+     */
     HostMemoryBackend *memdev;
 
+    /*
+     * Individual memslots we dynamically map that are aliases to the
+     * assigned RAM memory region
+     */
+    MemoryRegion *memslots;
+
+    /* Total number of memslots we're going to use. */
+    uint16_t nb_memslots;
+
+    /* Current number of memslots we're using. */
+    uint16_t nb_used_memslots;
+
+    /* Size of one memslot (the last one can differ) */
+    uint64_t memslot_size;
+
     /* NUMA node */
     uint32_t node;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 02/12] vhost: Return number of free memslots
  2021-10-27 12:45 ` [PATCH v1 02/12] vhost: " David Hildenbrand
@ 2021-10-27 13:36   ` Philippe Mathieu-Daudé
  2021-10-27 13:37     ` David Hildenbrand
  2021-10-27 14:04     ` David Hildenbrand
  0 siblings, 2 replies; 34+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-10-27 13:36 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	Stefan Hajnoczi, Dr . David Alan Gilbert, Peter Xu,
	Sebastien Boeuf, Igor Mammedov, Ani Sinha, Paolo Bonzini,
	Hui Zhu

On 10/27/21 14:45, David Hildenbrand wrote:
> Let's return the number of free slots instead of only checking if there
> is a free slot. Required to support memory devices that consume multiple
> memslots.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  hw/mem/memory-device.c    | 2 +-
>  hw/virtio/vhost-stub.c    | 2 +-
>  hw/virtio/vhost.c         | 4 ++--
>  include/hw/virtio/vhost.h | 2 +-
>  4 files changed, 5 insertions(+), 5 deletions(-)

> --- a/hw/virtio/vhost-stub.c
> +++ b/hw/virtio/vhost-stub.c
> @@ -2,7 +2,7 @@
>  #include "hw/virtio/vhost.h"
>  #include "hw/virtio/vhost-user.h"
>  
> -bool vhost_has_free_slot(void)
> +unsigned int vhost_get_free_memslots(void)
>  {
>      return true;

       return 0;

>  }


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 02/12] vhost: Return number of free memslots
  2021-10-27 13:36   ` Philippe Mathieu-Daudé
@ 2021-10-27 13:37     ` David Hildenbrand
  2021-10-27 14:04     ` David Hildenbrand
  1 sibling, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 13:37 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	Stefan Hajnoczi, Dr . David Alan Gilbert, Peter Xu,
	Sebastien Boeuf, Igor Mammedov, Ani Sinha, Paolo Bonzini,
	Hui Zhu

On 27.10.21 15:36, Philippe Mathieu-Daudé wrote:
> On 10/27/21 14:45, David Hildenbrand wrote:
>> Let's return the number of free slots instead of only checking if there
>> is a free slot. Required to support memory devices that consume multiple
>> memslots.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  hw/mem/memory-device.c    | 2 +-
>>  hw/virtio/vhost-stub.c    | 2 +-
>>  hw/virtio/vhost.c         | 4 ++--
>>  include/hw/virtio/vhost.h | 2 +-
>>  4 files changed, 5 insertions(+), 5 deletions(-)
> 
>> --- a/hw/virtio/vhost-stub.c
>> +++ b/hw/virtio/vhost-stub.c
>> @@ -2,7 +2,7 @@
>>  #include "hw/virtio/vhost.h"
>>  #include "hw/virtio/vhost-user.h"
>>  
>> -bool vhost_has_free_slot(void)
>> +unsigned int vhost_get_free_memslots(void)
>>  {
>>      return true;
> 
>        return 0;

Thanks, nice catch!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 02/12] vhost: Return number of free memslots
  2021-10-27 13:36   ` Philippe Mathieu-Daudé
  2021-10-27 13:37     ` David Hildenbrand
@ 2021-10-27 14:04     ` David Hildenbrand
  2021-10-27 14:11       ` Philippe Mathieu-Daudé
  1 sibling, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 14:04 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	Stefan Hajnoczi, Dr . David Alan Gilbert, Peter Xu,
	Sebastien Boeuf, Igor Mammedov, Ani Sinha, Paolo Bonzini,
	Hui Zhu

On 27.10.21 15:36, Philippe Mathieu-Daudé wrote:
> On 10/27/21 14:45, David Hildenbrand wrote:
>> Let's return the number of free slots instead of only checking if there
>> is a free slot. Required to support memory devices that consume multiple
>> memslots.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  hw/mem/memory-device.c    | 2 +-
>>  hw/virtio/vhost-stub.c    | 2 +-
>>  hw/virtio/vhost.c         | 4 ++--
>>  include/hw/virtio/vhost.h | 2 +-
>>  4 files changed, 5 insertions(+), 5 deletions(-)
> 
>> --- a/hw/virtio/vhost-stub.c
>> +++ b/hw/virtio/vhost-stub.c
>> @@ -2,7 +2,7 @@
>>  #include "hw/virtio/vhost.h"
>>  #include "hw/virtio/vhost-user.h"
>>  
>> -bool vhost_has_free_slot(void)
>> +unsigned int vhost_get_free_memslots(void)
>>  {
>>      return true;
> 
>        return 0;

Oh wait, no. This actually has to be

"return ~0U;" (see real vhost_get_free_memslots())

... because there is no vhost and consequently no limit applies.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 02/12] vhost: Return number of free memslots
  2021-10-27 14:04     ` David Hildenbrand
@ 2021-10-27 14:11       ` Philippe Mathieu-Daudé
  2021-10-27 15:33         ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-10-27 14:11 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Eduardo Habkost, kvm, Michael S. Tsirkin, Richard Henderson,
	Stefan Hajnoczi, Dr . David Alan Gilbert, Peter Xu,
	Sebastien Boeuf, Igor Mammedov, Ani Sinha, Paolo Bonzini,
	Hui Zhu

On 10/27/21 16:04, David Hildenbrand wrote:
> On 27.10.21 15:36, Philippe Mathieu-Daudé wrote:
>> On 10/27/21 14:45, David Hildenbrand wrote:
>>> Let's return the number of free slots instead of only checking if there
>>> is a free slot. Required to support memory devices that consume multiple
>>> memslots.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>  hw/mem/memory-device.c    | 2 +-
>>>  hw/virtio/vhost-stub.c    | 2 +-
>>>  hw/virtio/vhost.c         | 4 ++--
>>>  include/hw/virtio/vhost.h | 2 +-
>>>  4 files changed, 5 insertions(+), 5 deletions(-)

>>> -bool vhost_has_free_slot(void)
>>> +unsigned int vhost_get_free_memslots(void)
>>>  {
>>>      return true;
>>
>>        return 0;
> 
> Oh wait, no. This actually has to be
> 
> "return ~0U;" (see real vhost_get_free_memslots())
> 
> ... because there is no vhost and consequently no limit applies.

Indeed.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 02/12] vhost: Return number of free memslots
  2021-10-27 14:11       ` Philippe Mathieu-Daudé
@ 2021-10-27 15:33         ` Michael S. Tsirkin
  2021-10-27 15:45           ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2021-10-27 15:33 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: David Hildenbrand, qemu-devel, Eduardo Habkost, kvm,
	Richard Henderson, Stefan Hajnoczi, Dr . David Alan Gilbert,
	Peter Xu, Sebastien Boeuf, Igor Mammedov, Ani Sinha,
	Paolo Bonzini, Hui Zhu

On Wed, Oct 27, 2021 at 04:11:38PM +0200, Philippe Mathieu-Daudé wrote:
> On 10/27/21 16:04, David Hildenbrand wrote:
> > On 27.10.21 15:36, Philippe Mathieu-Daudé wrote:
> >> On 10/27/21 14:45, David Hildenbrand wrote:
> >>> Let's return the number of free slots instead of only checking if there
> >>> is a free slot. Required to support memory devices that consume multiple
> >>> memslots.
> >>>
> >>> Signed-off-by: David Hildenbrand <david@redhat.com>
> >>> ---
> >>>  hw/mem/memory-device.c    | 2 +-
> >>>  hw/virtio/vhost-stub.c    | 2 +-
> >>>  hw/virtio/vhost.c         | 4 ++--
> >>>  include/hw/virtio/vhost.h | 2 +-
> >>>  4 files changed, 5 insertions(+), 5 deletions(-)
> 
> >>> -bool vhost_has_free_slot(void)
> >>> +unsigned int vhost_get_free_memslots(void)
> >>>  {
> >>>      return true;
> >>
> >>        return 0;
> > 
> > Oh wait, no. This actually has to be
> > 
> > "return ~0U;" (see real vhost_get_free_memslots())
> > 
> > ... because there is no vhost and consequently no limit applies.
> 
> Indeed.
> 
> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>

confused. are you acking the theoretical patch with ~0 here?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 02/12] vhost: Return number of free memslots
  2021-10-27 15:33         ` Michael S. Tsirkin
@ 2021-10-27 15:45           ` David Hildenbrand
  2021-10-27 16:11             ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 15:45 UTC (permalink / raw)
  To: Michael S. Tsirkin, Philippe Mathieu-Daudé
  Cc: qemu-devel, Eduardo Habkost, kvm, Richard Henderson,
	Stefan Hajnoczi, Dr . David Alan Gilbert, Peter Xu,
	Sebastien Boeuf, Igor Mammedov, Ani Sinha, Paolo Bonzini,
	Hui Zhu

On 27.10.21 17:33, Michael S. Tsirkin wrote:
> On Wed, Oct 27, 2021 at 04:11:38PM +0200, Philippe Mathieu-Daudé wrote:
>> On 10/27/21 16:04, David Hildenbrand wrote:
>>> On 27.10.21 15:36, Philippe Mathieu-Daudé wrote:
>>>> On 10/27/21 14:45, David Hildenbrand wrote:
>>>>> Let's return the number of free slots instead of only checking if there
>>>>> is a free slot. Required to support memory devices that consume multiple
>>>>> memslots.
>>>>>
>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>> ---
>>>>>  hw/mem/memory-device.c    | 2 +-
>>>>>  hw/virtio/vhost-stub.c    | 2 +-
>>>>>  hw/virtio/vhost.c         | 4 ++--
>>>>>  include/hw/virtio/vhost.h | 2 +-
>>>>>  4 files changed, 5 insertions(+), 5 deletions(-)
>>
>>>>> -bool vhost_has_free_slot(void)
>>>>> +unsigned int vhost_get_free_memslots(void)
>>>>>  {
>>>>>      return true;
>>>>
>>>>        return 0;
>>>
>>> Oh wait, no. This actually has to be
>>>
>>> "return ~0U;" (see real vhost_get_free_memslots())
>>>
>>> ... because there is no vhost and consequently no limit applies.
>>
>> Indeed.
>>
>> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
> 
> confused. are you acking the theoretical patch with ~0 here?
> 

That's how I interpreted it.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 02/12] vhost: Return number of free memslots
  2021-10-27 15:45           ` David Hildenbrand
@ 2021-10-27 16:11             ` Philippe Mathieu-Daudé
  2021-10-27 16:51               ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-10-27 16:11 UTC (permalink / raw)
  To: David Hildenbrand, Michael S. Tsirkin
  Cc: qemu-devel, Eduardo Habkost, kvm, Richard Henderson,
	Stefan Hajnoczi, Dr . David Alan Gilbert, Peter Xu,
	Sebastien Boeuf, Igor Mammedov, Ani Sinha, Paolo Bonzini,
	Hui Zhu

On 10/27/21 17:45, David Hildenbrand wrote:
> On 27.10.21 17:33, Michael S. Tsirkin wrote:
>> On Wed, Oct 27, 2021 at 04:11:38PM +0200, Philippe Mathieu-Daudé wrote:
>>> On 10/27/21 16:04, David Hildenbrand wrote:
>>>> On 27.10.21 15:36, Philippe Mathieu-Daudé wrote:
>>>>> On 10/27/21 14:45, David Hildenbrand wrote:
>>>>>> Let's return the number of free slots instead of only checking if there
>>>>>> is a free slot. Required to support memory devices that consume multiple
>>>>>> memslots.
>>>>>>
>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>> ---
>>>>>>  hw/mem/memory-device.c    | 2 +-
>>>>>>  hw/virtio/vhost-stub.c    | 2 +-
>>>>>>  hw/virtio/vhost.c         | 4 ++--
>>>>>>  include/hw/virtio/vhost.h | 2 +-
>>>>>>  4 files changed, 5 insertions(+), 5 deletions(-)
>>>
>>>>>> -bool vhost_has_free_slot(void)
>>>>>> +unsigned int vhost_get_free_memslots(void)
>>>>>>  {
>>>>>>      return true;
>>>>>
>>>>>        return 0;
>>>>
>>>> Oh wait, no. This actually has to be
>>>>
>>>> "return ~0U;" (see real vhost_get_free_memslots())
>>>>
>>>> ... because there is no vhost and consequently no limit applies.
>>>
>>> Indeed.
>>>
>>> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
>>
>> confused. are you acking the theoretical patch with ~0 here?
>>
> 
> That's how I interpreted it.

~0U doesn't seem harmful when comparing. However I haven't tested
nor looked at the big picture. I wonder if vhost_has_free_slot()
shouldn't take the Error* as argument and each implementation set
the error message ("virtio/vhost support disabled" would be more
explicit in the stub case). But I still don't understand why when
built without virtio/vhost we return vhost_get_free_memslots() > 0.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 02/12] vhost: Return number of free memslots
  2021-10-27 16:11             ` Philippe Mathieu-Daudé
@ 2021-10-27 16:51               ` David Hildenbrand
  0 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-10-27 16:51 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Michael S. Tsirkin
  Cc: qemu-devel, Eduardo Habkost, kvm, Richard Henderson,
	Stefan Hajnoczi, Dr . David Alan Gilbert, Peter Xu,
	Sebastien Boeuf, Igor Mammedov, Ani Sinha, Paolo Bonzini,
	Hui Zhu

On 27.10.21 18:11, Philippe Mathieu-Daudé wrote:
> On 10/27/21 17:45, David Hildenbrand wrote:
>> On 27.10.21 17:33, Michael S. Tsirkin wrote:
>>> On Wed, Oct 27, 2021 at 04:11:38PM +0200, Philippe Mathieu-Daudé wrote:
>>>> On 10/27/21 16:04, David Hildenbrand wrote:
>>>>> On 27.10.21 15:36, Philippe Mathieu-Daudé wrote:
>>>>>> On 10/27/21 14:45, David Hildenbrand wrote:
>>>>>>> Let's return the number of free slots instead of only checking if there
>>>>>>> is a free slot. Required to support memory devices that consume multiple
>>>>>>> memslots.
>>>>>>>
>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>>> ---
>>>>>>>  hw/mem/memory-device.c    | 2 +-
>>>>>>>  hw/virtio/vhost-stub.c    | 2 +-
>>>>>>>  hw/virtio/vhost.c         | 4 ++--
>>>>>>>  include/hw/virtio/vhost.h | 2 +-
>>>>>>>  4 files changed, 5 insertions(+), 5 deletions(-)
>>>>
>>>>>>> -bool vhost_has_free_slot(void)
>>>>>>> +unsigned int vhost_get_free_memslots(void)
>>>>>>>  {
>>>>>>>      return true;
>>>>>>
>>>>>>        return 0;
>>>>>
>>>>> Oh wait, no. This actually has to be
>>>>>
>>>>> "return ~0U;" (see real vhost_get_free_memslots())
>>>>>
>>>>> ... because there is no vhost and consequently no limit applies.
>>>>
>>>> Indeed.
>>>>
>>>> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
>>>
>>> confused. are you acking the theoretical patch with ~0 here?
>>>
>>
>> That's how I interpreted it.
> 
> ~0U doesn't seem harmful when comparing. However I haven't tested
> nor looked at the big picture. I wonder if vhost_has_free_slot()
> shouldn't take the Error* as argument and each implementation set
> the error message ("virtio/vhost support disabled" would be more
> explicit in the stub case). But I still don't understand why when
> built without virtio/vhost we return vhost_get_free_memslots() > 0.

For the same reason we faked infinite slots via
vhost_has_free_slot()->true for now. We call it unconditionally from
memory device code.

Sure, we could add a stub "vhost_available()-> false" (or
vhost_enabled() ?) instead and do

if (vhost_available())
	... vhost_get_free_memslots()

similar to how we have

if (kvm_enabled())
	... kvm_get_free_memslots()

Not sure if it's worth it, though.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots
  2021-10-27 12:45 [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
                   ` (11 preceding siblings ...)
  2021-10-27 12:45 ` [PATCH v1 12/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
@ 2021-11-01 22:15 ` Michael S. Tsirkin
  2021-11-02  8:33   ` David Hildenbrand
  12 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2021-11-01 22:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Igor Mammedov, Ani Sinha, Peter Xu, Dr . David Alan Gilbert,
	Stefan Hajnoczi, Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
> This is the follow-up of [1], dropping auto-detection and vhost-user
> changes from the initial RFC.
> 
> Based-on: 20211011175346.15499-1-david@redhat.com
> 
> A virtio-mem device is represented by a single large RAM memory region
> backed by a single large mmap.
> 
> Right now, we map that complete memory region into guest physical addres
> space, resulting in a very large memory mapping, KVM memory slot, ...
> although only a small amount of memory might actually be exposed to the VM.
> 
> For example, when starting a VM with a 1 TiB virtio-mem device that only
> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
> in order to hotplug more memory later, we waste a lot of memory on metadata
> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
> optimizations in KVM are being worked on to reduce this metadata overhead
> on x86-64 in some cases, it remains a problem with nested VMs and there are
> other reasons why we would want to reduce the total memory slot to a
> reasonable minimum.
> 
> We want to:
> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
>    inside QEMU KVM code where possible.
> b) Not always expose all device-memory to the VM, to reduce the attack
>    surface of malicious VMs without using userfaultfd.

I'm confused by the mention of these security considerations,
and I expect users will be just as confused.
So let's say user wants to not be exposed. What value for
the option should be used? What if a lower option is used?
Is there still some security advantage?

> So instead, expose the RAM memory region not by a single large mapping
> (consuming one memslot) but instead by multiple mappings, each consuming
> one memslot. To do that, we divide the RAM memory region via aliases into
> separate parts and only map the aliases into a device container we actually
> need. We have to make sure that QEMU won't silently merge the memory
> sections corresponding to the aliases (and thereby also memslots),
> otherwise we lose atomic updates with KVM and vhost-user, which we deeply
> care about when adding/removing memory. Further, to get memslot accounting
> right, such merging is better avoided.
> 
> Within the memslots, virtio-mem can (un)plug memory in smaller granularity
> dynamically. So memslots are a pure optimization to tackle a) and b) above.
> 
> The user configures how many memslots a virtio-mem device should use, the
> default is "1" -- essentially corresponding to the old behavior.
> 
> Memslots are right now mapped once they fall into the usable device region
> (which grows/shrinks on demand right now either when requesting to
>  hotplug more memory or during/after reboots). In the future, with
> VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we'll be able to (un)map aliases even
> more dynamically when (un)plugging device blocks.
> 
> 
> Adding a 500GiB virtio-mem device with "memslots=500" and not hotplugging
> any memory results in:
>     0000000140000000-000001047fffffff (prio 0, i/o): device-memory
>       0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
> 
> Requesting the VM to consume 2 GiB results in (note: the usable region size
> is bigger than 2 GiB, so 3 * 1 GiB memslots are required):
>     0000000140000000-000001047fffffff (prio 0, i/o): device-memory
>       0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
>         0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
>         0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
>         00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
> 
> Requesting the VM to consume 20 GiB results in:
>     0000000140000000-000001047fffffff (prio 0, i/o): device-memory
>       0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
>         0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
>         0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
>         00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
>         0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-3 @mem0 00000000c0000000-00000000ffffffff
>         0000000240000000-000000027fffffff (prio 0, ram): alias virtio-mem-memslot-4 @mem0 0000000100000000-000000013fffffff
>         0000000280000000-00000002bfffffff (prio 0, ram): alias virtio-mem-memslot-5 @mem0 0000000140000000-000000017fffffff
>         00000002c0000000-00000002ffffffff (prio 0, ram): alias virtio-mem-memslot-6 @mem0 0000000180000000-00000001bfffffff
>         0000000300000000-000000033fffffff (prio 0, ram): alias virtio-mem-memslot-7 @mem0 00000001c0000000-00000001ffffffff
>         0000000340000000-000000037fffffff (prio 0, ram): alias virtio-mem-memslot-8 @mem0 0000000200000000-000000023fffffff
>         0000000380000000-00000003bfffffff (prio 0, ram): alias virtio-mem-memslot-9 @mem0 0000000240000000-000000027fffffff
>         00000003c0000000-00000003ffffffff (prio 0, ram): alias virtio-mem-memslot-10 @mem0 0000000280000000-00000002bfffffff
>         0000000400000000-000000043fffffff (prio 0, ram): alias virtio-mem-memslot-11 @mem0 00000002c0000000-00000002ffffffff
>         0000000440000000-000000047fffffff (prio 0, ram): alias virtio-mem-memslot-12 @mem0 0000000300000000-000000033fffffff
>         0000000480000000-00000004bfffffff (prio 0, ram): alias virtio-mem-memslot-13 @mem0 0000000340000000-000000037fffffff
>         00000004c0000000-00000004ffffffff (prio 0, ram): alias virtio-mem-memslot-14 @mem0 0000000380000000-00000003bfffffff
>         0000000500000000-000000053fffffff (prio 0, ram): alias virtio-mem-memslot-15 @mem0 00000003c0000000-00000003ffffffff
>         0000000540000000-000000057fffffff (prio 0, ram): alias virtio-mem-memslot-16 @mem0 0000000400000000-000000043fffffff
>         0000000580000000-00000005bfffffff (prio 0, ram): alias virtio-mem-memslot-17 @mem0 0000000440000000-000000047fffffff
>         00000005c0000000-00000005ffffffff (prio 0, ram): alias virtio-mem-memslot-18 @mem0 0000000480000000-00000004bfffffff
>         0000000600000000-000000063fffffff (prio 0, ram): alias virtio-mem-memslot-19 @mem0 00000004c0000000-00000004ffffffff
>         0000000640000000-000000067fffffff (prio 0, ram): alias virtio-mem-memslot-20 @mem0 0000000500000000-000000053fffffff
> 
> Requesting the VM to consume 5 GiB and rebooting (note: usable region size
> will change during reboots) results in:
>     0000000140000000-000001047fffffff (prio 0, i/o): device-memory
>       0000000140000000-0000007e3fffffff (prio 0, i/o): virtio-mem-memslots
>         0000000140000000-000000017fffffff (prio 0, ram): alias virtio-mem-memslot-0 @mem0 0000000000000000-000000003fffffff
>         0000000180000000-00000001bfffffff (prio 0, ram): alias virtio-mem-memslot-1 @mem0 0000000040000000-000000007fffffff
>         00000001c0000000-00000001ffffffff (prio 0, ram): alias virtio-mem-memslot-2 @mem0 0000000080000000-00000000bfffffff
>         0000000200000000-000000023fffffff (prio 0, ram): alias virtio-mem-memslot-3 @mem0 00000000c0000000-00000000ffffffff
>         0000000240000000-000000027fffffff (prio 0, ram): alias virtio-mem-memslot-4 @mem0 0000000100000000-000000013fffffff
>         0000000280000000-00000002bfffffff (prio 0, ram): alias virtio-mem-memslot-5 @mem0 0000000140000000-000000017fffffff
> 
> 
> In addition to other factors (e.g., device block size), we limit the number
> of memslots to 1024 per devices and the size of one memslot to at least
> 128 MiB. Further, we make sure internally to align the memslot size to at
> least 128 MiB. For now, we limit the total number of memslots that can
> be used by memory devices to 2048, to no go crazy on individual RAM
> mappings in our address spaces.
> 
> Future work:
> - vhost-user and libvhost-user/vhost-user-backend changes to support more than
>   32 memslots.
> - "memslots=0" mode to allow for auto-determining the number of memslots to
>   use.
> - Eventually have an interface to query the memslot limit for a QEMU
>   instance. But vhost-* devices complicate that matter.
> 
> RCF -> v1:
> - Dropped "max-memslots=" parameter and converted to "memslots=" parameter
> - Dropped auto-determining the number of memslots to use
> - Dropped vhost* memslot changes
> - Improved error messages regarding memory slot limits
> - Reshuffled, cleaned up patches, rewrote patch descriptions
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Ani Sinha <ani@anisinha.ca>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>
> Cc: Hui Zhu <teawater@gmail.com>
> Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Cc: kvm@vger.kernel.org
> 
> [1] https://lkml.kernel.org/r/20211013103330.26869-1-david@redhat.com
> 
> David Hildenbrand (12):
>   kvm: Return number of free memslots
>   vhost: Return number of free memslots
>   memory: Allow for marking memory region aliases unmergeable
>   vhost: Don't merge unmergeable memory sections
>   memory-device: Move memory_device_check_addable() directly into
>     memory_device_pre_plug()
>   memory-device: Generalize memory_device_used_region_size()
>   memory-device: Support memory devices that dynamically consume
>     multiple memslots
>   vhost: Respect reserved memslots for memory devices when realizing a
>     vhost device
>   memory: Drop mapping check from
>     memory_region_get_ram_discard_manager()
>   virtio-mem: Fix typo in virito_mem_intersect_memory_section() function
>     name
>   virtio-mem: Set the RamDiscardManager for the RAM memory region
>     earlier
>   virtio-mem: Expose device memory via multiple memslots
> 
>  accel/kvm/kvm-all.c            |  24 ++--
>  accel/stubs/kvm-stub.c         |   4 +-
>  hw/mem/memory-device.c         | 115 ++++++++++++++----
>  hw/virtio/vhost-stub.c         |   2 +-
>  hw/virtio/vhost.c              |  21 ++--
>  hw/virtio/virtio-mem-pci.c     |  23 ++++
>  hw/virtio/virtio-mem.c         | 212 +++++++++++++++++++++++++++++----
>  include/exec/memory.h          |  23 ++++
>  include/hw/mem/memory-device.h |  33 +++++
>  include/hw/virtio/vhost.h      |   2 +-
>  include/hw/virtio/virtio-mem.h |  25 +++-
>  include/sysemu/kvm.h           |   2 +-
>  softmmu/memory.c               |  35 ++++--
>  stubs/qmp_memory_device.c      |   5 +
>  14 files changed, 449 insertions(+), 77 deletions(-)
> 
> -- 
> 2.31.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots
  2021-11-01 22:15 ` [PATCH v1 00/12] " Michael S. Tsirkin
@ 2021-11-02  8:33   ` David Hildenbrand
  2021-11-02 11:35     ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2021-11-02  8:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Igor Mammedov, Ani Sinha, Peter Xu, Dr . David Alan Gilbert,
	Stefan Hajnoczi, Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

On 01.11.21 23:15, Michael S. Tsirkin wrote:
> On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
>> This is the follow-up of [1], dropping auto-detection and vhost-user
>> changes from the initial RFC.
>>
>> Based-on: 20211011175346.15499-1-david@redhat.com
>>
>> A virtio-mem device is represented by a single large RAM memory region
>> backed by a single large mmap.
>>
>> Right now, we map that complete memory region into guest physical addres
>> space, resulting in a very large memory mapping, KVM memory slot, ...
>> although only a small amount of memory might actually be exposed to the VM.
>>
>> For example, when starting a VM with a 1 TiB virtio-mem device that only
>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
>> in order to hotplug more memory later, we waste a lot of memory on metadata
>> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
>> optimizations in KVM are being worked on to reduce this metadata overhead
>> on x86-64 in some cases, it remains a problem with nested VMs and there are
>> other reasons why we would want to reduce the total memory slot to a
>> reasonable minimum.
>>
>> We want to:
>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
>>    inside QEMU KVM code where possible.
>> b) Not always expose all device-memory to the VM, to reduce the attack
>>    surface of malicious VMs without using userfaultfd.
> 
> I'm confused by the mention of these security considerations,
> and I expect users will be just as confused.

Malicious VMs wanting to consume more memory than desired is only
relevant when running untrusted VMs in some environments, and it can be
caught differently, for example, by carefully monitoring and limiting
the maximum memory consumption of a VM. We have the same issue already
when using virtio-balloon to logically unplug memory. For me, it's a
secondary concern ( optimizing a is much more important ).

Some users showed interest in having QEMU disallow access to unplugged
memory, because coming up with a maximum memory consumption for a VM is
hard. This is one step into that direction without having to run with
uffd enabled all of the time.

("security is somewhat the wrong word. we won't be able to steal any
information from the hypervisor.)


> So let's say user wants to not be exposed. What value for
> the option should be used? What if a lower option is used?
> Is there still some security advantage?

My recommendation will be to use 1 memslot per gigabyte as default if
possible in the configuration. If we have a virtio-mem devices with a
maximum size of 128 GiB, the suggestion will be to use memslots=128.
Some setups will require less (e.g., vhost-user until adjusted, old
KVM), some setups can allow for more. I assume that most users will
later set "memslots=0", to enable auto-detection mode.


Assume we have a virtio-mem device with a maximum size of 1 TiB and we
hotplugged 1 GiB to the VM. With "memslots=1", the malicious VM could
actually access the whole 1 TiB. With "memslots=1024", the malicious VM
could only access additional ~ 1 GiB. With "memslots=512", ~ 2 GiB.
That's the reduced attack surface.

Of course, it's different after we hotunplugged memory, before we have
VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE support in QEMU, because all memory
inside the usable region has to be accessible and we cannot "unplug" the
memslots.


Note: With upcoming VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE changes in QEMU,
one will be able to disallow any access for malicious VMs by setting the
memblock size just as big as the device block size.

So with a 128 GiB virtio-mem device with memslots=128,block-size=1G, or
with memslots=1024,block-size=128M we could make it impossible for a
malicious VM to consume more memory than intended. But we lose
flexibility due to the block size and the limited number of available
memslots.

But again, for "full protection against malicious VMs" I consider
userfaultfd protection more flexible. This approach here gives some
advantage, especially when having large virtio-mem devices that start
out small.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots
  2021-11-02  8:33   ` David Hildenbrand
@ 2021-11-02 11:35     ` Michael S. Tsirkin
  2021-11-02 11:55       ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2021-11-02 11:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Igor Mammedov, Ani Sinha, Peter Xu, Dr . David Alan Gilbert,
	Stefan Hajnoczi, Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

On Tue, Nov 02, 2021 at 09:33:55AM +0100, David Hildenbrand wrote:
> On 01.11.21 23:15, Michael S. Tsirkin wrote:
> > On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
> >> This is the follow-up of [1], dropping auto-detection and vhost-user
> >> changes from the initial RFC.
> >>
> >> Based-on: 20211011175346.15499-1-david@redhat.com
> >>
> >> A virtio-mem device is represented by a single large RAM memory region
> >> backed by a single large mmap.
> >>
> >> Right now, we map that complete memory region into guest physical addres
> >> space, resulting in a very large memory mapping, KVM memory slot, ...
> >> although only a small amount of memory might actually be exposed to the VM.
> >>
> >> For example, when starting a VM with a 1 TiB virtio-mem device that only
> >> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
> >> in order to hotplug more memory later, we waste a lot of memory on metadata
> >> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
> >> optimizations in KVM are being worked on to reduce this metadata overhead
> >> on x86-64 in some cases, it remains a problem with nested VMs and there are
> >> other reasons why we would want to reduce the total memory slot to a
> >> reasonable minimum.
> >>
> >> We want to:
> >> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
> >>    inside QEMU KVM code where possible.
> >> b) Not always expose all device-memory to the VM, to reduce the attack
> >>    surface of malicious VMs without using userfaultfd.
> > 
> > I'm confused by the mention of these security considerations,
> > and I expect users will be just as confused.
> 
> Malicious VMs wanting to consume more memory than desired is only
> relevant when running untrusted VMs in some environments, and it can be
> caught differently, for example, by carefully monitoring and limiting
> the maximum memory consumption of a VM. We have the same issue already
> when using virtio-balloon to logically unplug memory. For me, it's a
> secondary concern ( optimizing a is much more important ).
> 
> Some users showed interest in having QEMU disallow access to unplugged
> memory, because coming up with a maximum memory consumption for a VM is
> hard. This is one step into that direction without having to run with
> uffd enabled all of the time.

Sorry about missing the memo - is there a lot of overhead associated
with uffd then?

> ("security is somewhat the wrong word. we won't be able to steal any
> information from the hypervisor.)

Right. Let's just spell it out.
Further, removing memory still requires guest cooperation.

> 
> > So let's say user wants to not be exposed. What value for
> > the option should be used? What if a lower option is used?
> > Is there still some security advantage?
> 
> My recommendation will be to use 1 memslot per gigabyte as default if
> possible in the configuration. If we have a virtio-mem devices with a
> maximum size of 128 GiB, the suggestion will be to use memslots=128.
> Some setups will require less (e.g., vhost-user until adjusted, old
> KVM), some setups can allow for more. I assume that most users will
> later set "memslots=0", to enable auto-detection mode.
> 
> 
> Assume we have a virtio-mem device with a maximum size of 1 TiB and we
> hotplugged 1 GiB to the VM. With "memslots=1", the malicious VM could
> actually access the whole 1 TiB. With "memslots=1024", the malicious VM
> could only access additional ~ 1 GiB. With "memslots=512", ~ 2 GiB.
> That's the reduced attack surface.
> 
> Of course, it's different after we hotunplugged memory, before we have
> VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE support in QEMU, because all memory
> inside the usable region has to be accessible and we cannot "unplug" the
> memslots.
> 
> 
> Note: With upcoming VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE changes in QEMU,
> one will be able to disallow any access for malicious VMs by setting the
> memblock size just as big as the device block size.
> 
> So with a 128 GiB virtio-mem device with memslots=128,block-size=1G, or
> with memslots=1024,block-size=128M we could make it impossible for a
> malicious VM to consume more memory than intended. But we lose
> flexibility due to the block size and the limited number of available
> memslots.
> 
> But again, for "full protection against malicious VMs" I consider
> userfaultfd protection more flexible. This approach here gives some
> advantage, especially when having large virtio-mem devices that start
> out small.
> 
> -- 
> Thanks,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots
  2021-11-02 11:35     ` Michael S. Tsirkin
@ 2021-11-02 11:55       ` David Hildenbrand
  2021-11-02 17:06         ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2021-11-02 11:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Igor Mammedov, Ani Sinha, Peter Xu, Dr . David Alan Gilbert,
	Stefan Hajnoczi, Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

On 02.11.21 12:35, Michael S. Tsirkin wrote:
> On Tue, Nov 02, 2021 at 09:33:55AM +0100, David Hildenbrand wrote:
>> On 01.11.21 23:15, Michael S. Tsirkin wrote:
>>> On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
>>>> This is the follow-up of [1], dropping auto-detection and vhost-user
>>>> changes from the initial RFC.
>>>>
>>>> Based-on: 20211011175346.15499-1-david@redhat.com
>>>>
>>>> A virtio-mem device is represented by a single large RAM memory region
>>>> backed by a single large mmap.
>>>>
>>>> Right now, we map that complete memory region into guest physical addres
>>>> space, resulting in a very large memory mapping, KVM memory slot, ...
>>>> although only a small amount of memory might actually be exposed to the VM.
>>>>
>>>> For example, when starting a VM with a 1 TiB virtio-mem device that only
>>>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
>>>> in order to hotplug more memory later, we waste a lot of memory on metadata
>>>> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
>>>> optimizations in KVM are being worked on to reduce this metadata overhead
>>>> on x86-64 in some cases, it remains a problem with nested VMs and there are
>>>> other reasons why we would want to reduce the total memory slot to a
>>>> reasonable minimum.
>>>>
>>>> We want to:
>>>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
>>>>    inside QEMU KVM code where possible.
>>>> b) Not always expose all device-memory to the VM, to reduce the attack
>>>>    surface of malicious VMs without using userfaultfd.
>>>
>>> I'm confused by the mention of these security considerations,
>>> and I expect users will be just as confused.
>>
>> Malicious VMs wanting to consume more memory than desired is only
>> relevant when running untrusted VMs in some environments, and it can be
>> caught differently, for example, by carefully monitoring and limiting
>> the maximum memory consumption of a VM. We have the same issue already
>> when using virtio-balloon to logically unplug memory. For me, it's a
>> secondary concern ( optimizing a is much more important ).
>>
>> Some users showed interest in having QEMU disallow access to unplugged
>> memory, because coming up with a maximum memory consumption for a VM is
>> hard. This is one step into that direction without having to run with
>> uffd enabled all of the time.
> 
> Sorry about missing the memo - is there a lot of overhead associated
> with uffd then?

When used with huge/gigantic pages, we don't particularly care.

For other memory backends, we'll have to route any population via the
uffd handler: guest accesses a 4k page -> place a 4k page from user
space. Instead of the kernel automatically placing a THP, we'd be
placing single 4k pages and have to hope the kernel will collapse them
into a THP later.

khugepagd will only collapse into a THP if all affected page table
entries are present and don't map the zero page, though.

So we'll most certainly use less THP for our VM and VM startup time
("first memory access after plugging memory") can be slower.

I have prototypes for it, with some optimizations (e.g., on 4k guest
access, populate the whole THP area), but we might not want to enable it
all of the time. (interaction with postcopy has to be fixed, but it's
not a fundamental issue)


Extending uffd-based protection for virtio-mem to other processes
(vhost-user), is a bit more complicated, and I am not 100% sure if it's
worth the trouble for now. memslots provide at least some high-level
protection for the important case of having a virtio-mem device to
eventually hotplug a lot of memory later.

> 
>> ("security is somewhat the wrong word. we won't be able to steal any
>> information from the hypervisor.)
> 
> Right. Let's just spell it out.
> Further, removing memory still requires guest cooperation.

Right.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots
  2021-11-02 11:55       ` David Hildenbrand
@ 2021-11-02 17:06         ` Michael S. Tsirkin
  2021-11-02 17:10           ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2021-11-02 17:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Igor Mammedov, Ani Sinha, Peter Xu, Dr . David Alan Gilbert,
	Stefan Hajnoczi, Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

On Tue, Nov 02, 2021 at 12:55:17PM +0100, David Hildenbrand wrote:
> On 02.11.21 12:35, Michael S. Tsirkin wrote:
> > On Tue, Nov 02, 2021 at 09:33:55AM +0100, David Hildenbrand wrote:
> >> On 01.11.21 23:15, Michael S. Tsirkin wrote:
> >>> On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
> >>>> This is the follow-up of [1], dropping auto-detection and vhost-user
> >>>> changes from the initial RFC.
> >>>>
> >>>> Based-on: 20211011175346.15499-1-david@redhat.com
> >>>>
> >>>> A virtio-mem device is represented by a single large RAM memory region
> >>>> backed by a single large mmap.
> >>>>
> >>>> Right now, we map that complete memory region into guest physical addres
> >>>> space, resulting in a very large memory mapping, KVM memory slot, ...
> >>>> although only a small amount of memory might actually be exposed to the VM.
> >>>>
> >>>> For example, when starting a VM with a 1 TiB virtio-mem device that only
> >>>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
> >>>> in order to hotplug more memory later, we waste a lot of memory on metadata
> >>>> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
> >>>> optimizations in KVM are being worked on to reduce this metadata overhead
> >>>> on x86-64 in some cases, it remains a problem with nested VMs and there are
> >>>> other reasons why we would want to reduce the total memory slot to a
> >>>> reasonable minimum.
> >>>>
> >>>> We want to:
> >>>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
> >>>>    inside QEMU KVM code where possible.
> >>>> b) Not always expose all device-memory to the VM, to reduce the attack
> >>>>    surface of malicious VMs without using userfaultfd.
> >>>
> >>> I'm confused by the mention of these security considerations,
> >>> and I expect users will be just as confused.
> >>
> >> Malicious VMs wanting to consume more memory than desired is only
> >> relevant when running untrusted VMs in some environments, and it can be
> >> caught differently, for example, by carefully monitoring and limiting
> >> the maximum memory consumption of a VM. We have the same issue already
> >> when using virtio-balloon to logically unplug memory. For me, it's a
> >> secondary concern ( optimizing a is much more important ).
> >>
> >> Some users showed interest in having QEMU disallow access to unplugged
> >> memory, because coming up with a maximum memory consumption for a VM is
> >> hard. This is one step into that direction without having to run with
> >> uffd enabled all of the time.
> > 
> > Sorry about missing the memo - is there a lot of overhead associated
> > with uffd then?
> 
> When used with huge/gigantic pages, we don't particularly care.
> 
> For other memory backends, we'll have to route any population via the
> uffd handler: guest accesses a 4k page -> place a 4k page from user
> space. Instead of the kernel automatically placing a THP, we'd be
> placing single 4k pages and have to hope the kernel will collapse them
> into a THP later.

How much value there is in a THP given it's not present?


> khugepagd will only collapse into a THP if all affected page table
> entries are present and don't map the zero page, though.
> 
> So we'll most certainly use less THP for our VM and VM startup time
> ("first memory access after plugging memory") can be slower.
> 
> I have prototypes for it, with some optimizations (e.g., on 4k guest
> access, populate the whole THP area), but we might not want to enable it
> all of the time. (interaction with postcopy has to be fixed, but it's
> not a fundamental issue)
> 
> 
> Extending uffd-based protection for virtio-mem to other processes
> (vhost-user), is a bit more complicated, and I am not 100% sure if it's
> worth the trouble for now. memslots provide at least some high-level
> protection for the important case of having a virtio-mem device to
> eventually hotplug a lot of memory later.
> 
> > 
> >> ("security is somewhat the wrong word. we won't be able to steal any
> >> information from the hypervisor.)
> > 
> > Right. Let's just spell it out.
> > Further, removing memory still requires guest cooperation.
> 
> Right.
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots
  2021-11-02 17:06         ` Michael S. Tsirkin
@ 2021-11-02 17:10           ` David Hildenbrand
  2021-11-07  8:14             ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2021-11-02 17:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Igor Mammedov, Ani Sinha, Peter Xu, Dr . David Alan Gilbert,
	Stefan Hajnoczi, Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

On 02.11.21 18:06, Michael S. Tsirkin wrote:
> On Tue, Nov 02, 2021 at 12:55:17PM +0100, David Hildenbrand wrote:
>> On 02.11.21 12:35, Michael S. Tsirkin wrote:
>>> On Tue, Nov 02, 2021 at 09:33:55AM +0100, David Hildenbrand wrote:
>>>> On 01.11.21 23:15, Michael S. Tsirkin wrote:
>>>>> On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
>>>>>> This is the follow-up of [1], dropping auto-detection and vhost-user
>>>>>> changes from the initial RFC.
>>>>>>
>>>>>> Based-on: 20211011175346.15499-1-david@redhat.com
>>>>>>
>>>>>> A virtio-mem device is represented by a single large RAM memory region
>>>>>> backed by a single large mmap.
>>>>>>
>>>>>> Right now, we map that complete memory region into guest physical addres
>>>>>> space, resulting in a very large memory mapping, KVM memory slot, ...
>>>>>> although only a small amount of memory might actually be exposed to the VM.
>>>>>>
>>>>>> For example, when starting a VM with a 1 TiB virtio-mem device that only
>>>>>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
>>>>>> in order to hotplug more memory later, we waste a lot of memory on metadata
>>>>>> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
>>>>>> optimizations in KVM are being worked on to reduce this metadata overhead
>>>>>> on x86-64 in some cases, it remains a problem with nested VMs and there are
>>>>>> other reasons why we would want to reduce the total memory slot to a
>>>>>> reasonable minimum.
>>>>>>
>>>>>> We want to:
>>>>>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
>>>>>>    inside QEMU KVM code where possible.
>>>>>> b) Not always expose all device-memory to the VM, to reduce the attack
>>>>>>    surface of malicious VMs without using userfaultfd.
>>>>>
>>>>> I'm confused by the mention of these security considerations,
>>>>> and I expect users will be just as confused.
>>>>
>>>> Malicious VMs wanting to consume more memory than desired is only
>>>> relevant when running untrusted VMs in some environments, and it can be
>>>> caught differently, for example, by carefully monitoring and limiting
>>>> the maximum memory consumption of a VM. We have the same issue already
>>>> when using virtio-balloon to logically unplug memory. For me, it's a
>>>> secondary concern ( optimizing a is much more important ).
>>>>
>>>> Some users showed interest in having QEMU disallow access to unplugged
>>>> memory, because coming up with a maximum memory consumption for a VM is
>>>> hard. This is one step into that direction without having to run with
>>>> uffd enabled all of the time.
>>>
>>> Sorry about missing the memo - is there a lot of overhead associated
>>> with uffd then?
>>
>> When used with huge/gigantic pages, we don't particularly care.
>>
>> For other memory backends, we'll have to route any population via the
>> uffd handler: guest accesses a 4k page -> place a 4k page from user
>> space. Instead of the kernel automatically placing a THP, we'd be
>> placing single 4k pages and have to hope the kernel will collapse them
>> into a THP later.
> 
> How much value there is in a THP given it's not present?

If you don't place a THP right during the first page fault inside the
THP region, you'll have to rely on khugepagd to eventually place a huge
page later -- and manually fault in each and every 4k page. I haven't
done any performance measurements so far. Going via userspace on every
4k fault will most certainly hurt performance when first touching memory.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots
  2021-11-02 17:10           ` David Hildenbrand
@ 2021-11-07  8:14             ` Michael S. Tsirkin
  2021-11-07  9:21               ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2021-11-07  8:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Igor Mammedov, Ani Sinha, Peter Xu, Dr . David Alan Gilbert,
	Stefan Hajnoczi, Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

On Tue, Nov 02, 2021 at 06:10:13PM +0100, David Hildenbrand wrote:
> On 02.11.21 18:06, Michael S. Tsirkin wrote:
> > On Tue, Nov 02, 2021 at 12:55:17PM +0100, David Hildenbrand wrote:
> >> On 02.11.21 12:35, Michael S. Tsirkin wrote:
> >>> On Tue, Nov 02, 2021 at 09:33:55AM +0100, David Hildenbrand wrote:
> >>>> On 01.11.21 23:15, Michael S. Tsirkin wrote:
> >>>>> On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
> >>>>>> This is the follow-up of [1], dropping auto-detection and vhost-user
> >>>>>> changes from the initial RFC.
> >>>>>>
> >>>>>> Based-on: 20211011175346.15499-1-david@redhat.com
> >>>>>>
> >>>>>> A virtio-mem device is represented by a single large RAM memory region
> >>>>>> backed by a single large mmap.
> >>>>>>
> >>>>>> Right now, we map that complete memory region into guest physical addres
> >>>>>> space, resulting in a very large memory mapping, KVM memory slot, ...
> >>>>>> although only a small amount of memory might actually be exposed to the VM.
> >>>>>>
> >>>>>> For example, when starting a VM with a 1 TiB virtio-mem device that only
> >>>>>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
> >>>>>> in order to hotplug more memory later, we waste a lot of memory on metadata
> >>>>>> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
> >>>>>> optimizations in KVM are being worked on to reduce this metadata overhead
> >>>>>> on x86-64 in some cases, it remains a problem with nested VMs and there are
> >>>>>> other reasons why we would want to reduce the total memory slot to a
> >>>>>> reasonable minimum.
> >>>>>>
> >>>>>> We want to:
> >>>>>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
> >>>>>>    inside QEMU KVM code where possible.
> >>>>>> b) Not always expose all device-memory to the VM, to reduce the attack
> >>>>>>    surface of malicious VMs without using userfaultfd.
> >>>>>
> >>>>> I'm confused by the mention of these security considerations,
> >>>>> and I expect users will be just as confused.
> >>>>
> >>>> Malicious VMs wanting to consume more memory than desired is only
> >>>> relevant when running untrusted VMs in some environments, and it can be
> >>>> caught differently, for example, by carefully monitoring and limiting
> >>>> the maximum memory consumption of a VM. We have the same issue already
> >>>> when using virtio-balloon to logically unplug memory. For me, it's a
> >>>> secondary concern ( optimizing a is much more important ).
> >>>>
> >>>> Some users showed interest in having QEMU disallow access to unplugged
> >>>> memory, because coming up with a maximum memory consumption for a VM is
> >>>> hard. This is one step into that direction without having to run with
> >>>> uffd enabled all of the time.
> >>>
> >>> Sorry about missing the memo - is there a lot of overhead associated
> >>> with uffd then?
> >>
> >> When used with huge/gigantic pages, we don't particularly care.
> >>
> >> For other memory backends, we'll have to route any population via the
> >> uffd handler: guest accesses a 4k page -> place a 4k page from user
> >> space. Instead of the kernel automatically placing a THP, we'd be
> >> placing single 4k pages and have to hope the kernel will collapse them
> >> into a THP later.
> > 
> > How much value there is in a THP given it's not present?
> 
> If you don't place a THP right during the first page fault inside the
> THP region, you'll have to rely on khugepagd to eventually place a huge
> page later -- and manually fault in each and every 4k page. I haven't
> done any performance measurements so far. Going via userspace on every
> 4k fault will most certainly hurt performance when first touching memory.

So, if the focus is performance improvement, maybe show the speedup?


> -- 
> Thanks,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots
  2021-11-07  8:14             ` Michael S. Tsirkin
@ 2021-11-07  9:21               ` David Hildenbrand
  2021-11-07 10:21                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2021-11-07  9:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Igor Mammedov, Ani Sinha, Peter Xu, Dr . David Alan Gilbert,
	Stefan Hajnoczi, Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

On 07.11.21 09:14, Michael S. Tsirkin wrote:
> On Tue, Nov 02, 2021 at 06:10:13PM +0100, David Hildenbrand wrote:
>> On 02.11.21 18:06, Michael S. Tsirkin wrote:
>>> On Tue, Nov 02, 2021 at 12:55:17PM +0100, David Hildenbrand wrote:
>>>> On 02.11.21 12:35, Michael S. Tsirkin wrote:
>>>>> On Tue, Nov 02, 2021 at 09:33:55AM +0100, David Hildenbrand wrote:
>>>>>> On 01.11.21 23:15, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Oct 27, 2021 at 02:45:19PM +0200, David Hildenbrand wrote:
>>>>>>>> This is the follow-up of [1], dropping auto-detection and vhost-user
>>>>>>>> changes from the initial RFC.
>>>>>>>>
>>>>>>>> Based-on: 20211011175346.15499-1-david@redhat.com
>>>>>>>>
>>>>>>>> A virtio-mem device is represented by a single large RAM memory region
>>>>>>>> backed by a single large mmap.
>>>>>>>>
>>>>>>>> Right now, we map that complete memory region into guest physical addres
>>>>>>>> space, resulting in a very large memory mapping, KVM memory slot, ...
>>>>>>>> although only a small amount of memory might actually be exposed to the VM.
>>>>>>>>
>>>>>>>> For example, when starting a VM with a 1 TiB virtio-mem device that only
>>>>>>>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
>>>>>>>> in order to hotplug more memory later, we waste a lot of memory on metadata
>>>>>>>> for KVM memory slots (> 2 GiB!) and accompanied bitmaps. Although some
>>>>>>>> optimizations in KVM are being worked on to reduce this metadata overhead
>>>>>>>> on x86-64 in some cases, it remains a problem with nested VMs and there are
>>>>>>>> other reasons why we would want to reduce the total memory slot to a
>>>>>>>> reasonable minimum.
>>>>>>>>
>>>>>>>> We want to:
>>>>>>>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but also
>>>>>>>>    inside QEMU KVM code where possible.
>>>>>>>> b) Not always expose all device-memory to the VM, to reduce the attack
>>>>>>>>    surface of malicious VMs without using userfaultfd.
>>>>>>>
>>>>>>> I'm confused by the mention of these security considerations,
>>>>>>> and I expect users will be just as confused.
>>>>>>
>>>>>> Malicious VMs wanting to consume more memory than desired is only
>>>>>> relevant when running untrusted VMs in some environments, and it can be
>>>>>> caught differently, for example, by carefully monitoring and limiting
>>>>>> the maximum memory consumption of a VM. We have the same issue already
>>>>>> when using virtio-balloon to logically unplug memory. For me, it's a
>>>>>> secondary concern ( optimizing a is much more important ).
>>>>>>
>>>>>> Some users showed interest in having QEMU disallow access to unplugged
>>>>>> memory, because coming up with a maximum memory consumption for a VM is
>>>>>> hard. This is one step into that direction without having to run with
>>>>>> uffd enabled all of the time.
>>>>>
>>>>> Sorry about missing the memo - is there a lot of overhead associated
>>>>> with uffd then?
>>>>
>>>> When used with huge/gigantic pages, we don't particularly care.
>>>>
>>>> For other memory backends, we'll have to route any population via the
>>>> uffd handler: guest accesses a 4k page -> place a 4k page from user
>>>> space. Instead of the kernel automatically placing a THP, we'd be
>>>> placing single 4k pages and have to hope the kernel will collapse them
>>>> into a THP later.
>>>
>>> How much value there is in a THP given it's not present?
>>
>> If you don't place a THP right during the first page fault inside the
>> THP region, you'll have to rely on khugepagd to eventually place a huge
>> page later -- and manually fault in each and every 4k page. I haven't
>> done any performance measurements so far. Going via userspace on every
>> 4k fault will most certainly hurt performance when first touching memory.
> 
> So, if the focus is performance improvement, maybe show the speedup?

Let's not focus on b), a) is the primary goal of this series:

"
a) Reduce the metadata overhead, including bitmap sizes inside KVM but
also inside QEMU KVM code where possible.
"

Because:

"
For example, when starting a VM with a 1 TiB virtio-mem device that only
exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
in order to hotplug more memory later, we waste a lot of memory on
metadata for KVM memory slots (> 2 GiB!) and accompanied bitmaps.
"

Partially tackling b) is just a nice side effect of this series. In the
long term, we'll want userfaultfd-based protection, and I'll do a
performance evaluation then, how userfaultf vs. !userfaultfd compares
(boot time, run time, THP consumption).

I'll adjust the cover letter for the next version to make this clearer.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots
  2021-11-07  9:21               ` David Hildenbrand
@ 2021-11-07 10:21                 ` Michael S. Tsirkin
  2021-11-07 10:53                   ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2021-11-07 10:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Igor Mammedov, Ani Sinha, Peter Xu, Dr . David Alan Gilbert,
	Stefan Hajnoczi, Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

On Sun, Nov 07, 2021 at 10:21:33AM +0100, David Hildenbrand wrote:
> Let's not focus on b), a) is the primary goal of this series:
> 
> "
> a) Reduce the metadata overhead, including bitmap sizes inside KVM but
> also inside QEMU KVM code where possible.
> "
> 
> Because:
> 
> "
> For example, when starting a VM with a 1 TiB virtio-mem device that only
> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
> in order to hotplug more memory later, we waste a lot of memory on
> metadata for KVM memory slots (> 2 GiB!) and accompanied bitmaps.
> "
> 
> Partially tackling b) is just a nice side effect of this series. In the
> long term, we'll want userfaultfd-based protection, and I'll do a
> performance evaluation then, how userfaultf vs. !userfaultfd compares
> (boot time, run time, THP consumption).
> 
> I'll adjust the cover letter for the next version to make this clearer.

So given this is short-term, and long term we'll use uffd possibly with
some extension (a syscall to populate 1G in one go?) isn't there some
way to hide this from management? It's a one way street: once we get
management involved in playing with memory slots we no longer can go
back and control them ourselves. Not to mention it's a lot of
complexity to push out to management.

-- 
MST


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots
  2021-11-07 10:21                 ` Michael S. Tsirkin
@ 2021-11-07 10:53                   ` David Hildenbrand
  0 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2021-11-07 10:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Igor Mammedov, Ani Sinha, Peter Xu, Dr . David Alan Gilbert,
	Stefan Hajnoczi, Richard Henderson, Philippe Mathieu-Daudé,
	Hui Zhu, Sebastien Boeuf, kvm

On 07.11.21 11:21, Michael S. Tsirkin wrote:
> On Sun, Nov 07, 2021 at 10:21:33AM +0100, David Hildenbrand wrote:
>> Let's not focus on b), a) is the primary goal of this series:
>>
>> "
>> a) Reduce the metadata overhead, including bitmap sizes inside KVM but
>> also inside QEMU KVM code where possible.
>> "
>>
>> Because:
>>
>> "
>> For example, when starting a VM with a 1 TiB virtio-mem device that only
>> exposes little device memory (e.g., 1 GiB) towards the VM initialliy,
>> in order to hotplug more memory later, we waste a lot of memory on
>> metadata for KVM memory slots (> 2 GiB!) and accompanied bitmaps.
>> "
>>
>> Partially tackling b) is just a nice side effect of this series. In the
>> long term, we'll want userfaultfd-based protection, and I'll do a
>> performance evaluation then, how userfaultf vs. !userfaultfd compares
>> (boot time, run time, THP consumption).
>>
>> I'll adjust the cover letter for the next version to make this clearer.
> 
> So given this is short-term, and long term we'll use uffd possibly with
> some extension (a syscall to populate 1G in one go?) isn't there some
> way to hide this from management? It's a one way street: once we get
> management involved in playing with memory slots we no longer can go
> back and control them ourselves. Not to mention it's a lot of
> complexity to push out to management.

For b) userfaultfd + optimizatons is the way to go long term.
For a) userfaultfd does not help in any way, and that's what I currently
care about most.

1) For the management layer it will be as simple as providing a
"memslots" parameter to the user. I don't expect management to do manual
memslot detection+calculation -- management layer is the wrong place
because it has limited insight. Either QEMU will do it automatically or
the user will do it manually. For QEMU to do it reliably, we'll have to
teach the management layer to specify any vhost* devices before
virtio-mem* devices on the QEMU cmdline -- that is the only real
complexity I see.

2) "control them ourselves" will essentially be enabled via "memslots=0"
(auto-detect mode". The user has to opt in.

"memslots" is a pure optimization mechanism. While I'd love to hide this
complexity from user space and always use the auto-detect mode,
especially hotplug of vhost devices is a real problem and requires users
to opt-in.

I assume once we have "memslots=0" (auto-detect) mode, most people will:
* Set "memslots=0" to enable the optimization and essentially let QEMU
  control it. Will work in most cases and we can document perfectly
  where it won't. We'll always fail gracefully.
* Leave "memslots=1" if they don't care about the optimization or run a
  problematic setup.
* Set "memslots=X if they run a problemantic setup in still care about
  the optimization.


To be precise, we could have a "memslots-optimiation=true|false" toggle
instead. IMHO that could be limiting for these corner case setups where
auto-detection is problematic and users still want to optimize --
especially eventually hotplugging vhost devices. But as I assume
99.9999% of all setups will enable auto-detect mode, I don't have a
strong opinion.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 10/12] virtio-mem: Fix typo in virito_mem_intersect_memory_section() function name
  2021-10-27 12:45 ` [PATCH v1 10/12] virtio-mem: Fix typo in virito_mem_intersect_memory_section() function name David Hildenbrand
@ 2022-12-28 14:05   ` Philippe Mathieu-Daudé
  2022-12-28 14:06     ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Philippe Mathieu-Daudé @ 2022-12-28 14:05 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Michael S. Tsirkin, Igor Mammedov, Ani Sinha, Peter Xu,
	Dr . David Alan Gilbert, Stefan Hajnoczi, Richard Henderson,
	Hui Zhu, Sebastien Boeuf, kvm, QEMU Trivial

On 27/10/21 14:45, David Hildenbrand wrote:
> It's "virtio", not "virito".
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   hw/virtio/virtio-mem.c | 12 ++++++------
>   1 file changed, 6 insertions(+), 6 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>

> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index d5a578142b..0f5eae4a7e 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -177,7 +177,7 @@ static int virtio_mem_for_each_unplugged_range(const VirtIOMEM *vmem, void *arg,
>    *
>    * Returns false if the intersection is empty, otherwise returns true.
>    */
> -static bool virito_mem_intersect_memory_section(MemoryRegionSection *s,
> +static bool virtio_mem_intersect_memory_section(MemoryRegionSection *s,
>                                                   uint64_t offset, uint64_t size)
>   {
>       uint64_t start = MAX(s->offset_within_region, offset);
> @@ -215,7 +215,7 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
>                                         first_bit + 1) - 1;
>           size = (last_bit - first_bit + 1) * vmem->block_size;
>   
> -        if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
>               break;
>           }
>           ret = cb(&tmp, arg);
> @@ -247,7 +247,7 @@ static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
>                                    first_bit + 1) - 1;
>           size = (last_bit - first_bit + 1) * vmem->block_size;
>   
> -        if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
>               break;
>           }
>           ret = cb(&tmp, arg);
> @@ -283,7 +283,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
>       QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
>           MemoryRegionSection tmp = *rdl->section;
>   
> -        if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
>               continue;
>           }
>           rdl->notify_discard(rdl, &tmp);
> @@ -299,7 +299,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
>       QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
>           MemoryRegionSection tmp = *rdl->section;
>   
> -        if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
> +        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
>               continue;
>           }
>           ret = rdl->notify_populate(rdl, &tmp);
> @@ -316,7 +316,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
>               if (rdl2 == rdl) {
>                   break;
>               }
> -            if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
> +            if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
>                   continue;
>               }
>               rdl2->notify_discard(rdl2, &tmp);


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 10/12] virtio-mem: Fix typo in virito_mem_intersect_memory_section() function name
  2022-12-28 14:05   ` Philippe Mathieu-Daudé
@ 2022-12-28 14:06     ` David Hildenbrand
  2022-12-28 14:07       ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2022-12-28 14:06 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Michael S. Tsirkin, Igor Mammedov, Ani Sinha, Peter Xu,
	Dr . David Alan Gilbert, Stefan Hajnoczi, Richard Henderson,
	Hui Zhu, Sebastien Boeuf, kvm, QEMU Trivial

On 28.12.22 15:05, Philippe Mathieu-Daudé wrote:
> On 27/10/21 14:45, David Hildenbrand wrote:
>> It's "virtio", not "virito".
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>    hw/virtio/virtio-mem.c | 12 ++++++------
>>    1 file changed, 6 insertions(+), 6 deletions(-)
> 
> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>

I already queued your patch :)

https://github.com/davidhildenbrand/qemu.git mem-next

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 10/12] virtio-mem: Fix typo in virito_mem_intersect_memory_section() function name
  2022-12-28 14:06     ` David Hildenbrand
@ 2022-12-28 14:07       ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 34+ messages in thread
From: Philippe Mathieu-Daudé @ 2022-12-28 14:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Paolo Bonzini, Eduardo Habkost, Marcel Apfelbaum,
	Michael S. Tsirkin, Igor Mammedov, Ani Sinha, Peter Xu,
	Dr . David Alan Gilbert, Stefan Hajnoczi, Richard Henderson,
	Hui Zhu, Sebastien Boeuf, kvm, QEMU Trivial

On Wed, 28 Dec 2022 at 15:06, David Hildenbrand <david@redhat.com> wrote:
> On 28.12.22 15:05, Philippe Mathieu-Daudé wrote:
> > On 27/10/21 14:45, David Hildenbrand wrote:
> >> It's "virtio", not "virito".
> >>
> >> Signed-off-by: David Hildenbrand <david@redhat.com>
> >> ---
> >>    hw/virtio/virtio-mem.c | 12 ++++++------
> >>    1 file changed, 6 insertions(+), 6 deletions(-)
> >
> > Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
>
> I already queued your patch :)

Thank you then! :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2022-12-28 14:07 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-27 12:45 [PATCH v1 00/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
2021-10-27 12:45 ` [PATCH v1 01/12] kvm: Return number of free memslots David Hildenbrand
2021-10-27 12:45 ` [PATCH v1 02/12] vhost: " David Hildenbrand
2021-10-27 13:36   ` Philippe Mathieu-Daudé
2021-10-27 13:37     ` David Hildenbrand
2021-10-27 14:04     ` David Hildenbrand
2021-10-27 14:11       ` Philippe Mathieu-Daudé
2021-10-27 15:33         ` Michael S. Tsirkin
2021-10-27 15:45           ` David Hildenbrand
2021-10-27 16:11             ` Philippe Mathieu-Daudé
2021-10-27 16:51               ` David Hildenbrand
2021-10-27 12:45 ` [PATCH v1 03/12] memory: Allow for marking memory region aliases unmergeable David Hildenbrand
2021-10-27 12:45 ` [PATCH v1 04/12] vhost: Don't merge unmergeable memory sections David Hildenbrand
2021-10-27 12:45 ` [PATCH v1 05/12] memory-device: Move memory_device_check_addable() directly into memory_device_pre_plug() David Hildenbrand
2021-10-27 12:45 ` [PATCH v1 06/12] memory-device: Generalize memory_device_used_region_size() David Hildenbrand
2021-10-27 12:45 ` [PATCH v1 07/12] memory-device: Support memory devices that dynamically consume multiple memslots David Hildenbrand
2021-10-27 12:45 ` [PATCH v1 08/12] vhost: Respect reserved memslots for memory devices when realizing a vhost device David Hildenbrand
2021-10-27 12:45 ` [PATCH v1 09/12] memory: Drop mapping check from memory_region_get_ram_discard_manager() David Hildenbrand
2021-10-27 12:45 ` [PATCH v1 10/12] virtio-mem: Fix typo in virito_mem_intersect_memory_section() function name David Hildenbrand
2022-12-28 14:05   ` Philippe Mathieu-Daudé
2022-12-28 14:06     ` David Hildenbrand
2022-12-28 14:07       ` Philippe Mathieu-Daudé
2021-10-27 12:45 ` [PATCH v1 11/12] virtio-mem: Set the RamDiscardManager for the RAM memory region earlier David Hildenbrand
2021-10-27 12:45 ` [PATCH v1 12/12] virtio-mem: Expose device memory via multiple memslots David Hildenbrand
2021-11-01 22:15 ` [PATCH v1 00/12] " Michael S. Tsirkin
2021-11-02  8:33   ` David Hildenbrand
2021-11-02 11:35     ` Michael S. Tsirkin
2021-11-02 11:55       ` David Hildenbrand
2021-11-02 17:06         ` Michael S. Tsirkin
2021-11-02 17:10           ` David Hildenbrand
2021-11-07  8:14             ` Michael S. Tsirkin
2021-11-07  9:21               ` David Hildenbrand
2021-11-07 10:21                 ` Michael S. Tsirkin
2021-11-07 10:53                   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).