All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH PROTOTYPE 0/6] virtio-mem: vfio support
@ 2020-09-24 16:04 David Hildenbrand
  2020-09-24 16:04 ` [PATCH PROTOTYPE 1/6] memory: Introduce sparse RAM handler for memory regions David Hildenbrand
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: David Hildenbrand @ 2020-09-24 16:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, David Hildenbrand, Michael S. Tsirkin,
	Dr. David Alan Gilbert, Peter Xu, Luiz Capitulino, Auger Eric,
	Alex Williamson, Wei Yang, Paolo Bonzini, Igor Mammedov

This is a quick and dirty (1.5 days of hacking) prototype to make
vfio and virtio-mem play together. The basic idea was the result of Alex
brainstorming with me on how to tackle this.

A virtio-mem device manages a memory region in guest physical address
space, represented as a single (currently large) memory region in QEMU.
Before the guest is allowed to use memory blocks, it must coordinate with
the hypervisor (plug blocks). After a reboot, all memory is usually
unplugged - when the guest comes up, it detects the virtio-mem device and
selects memory blocks to plug (based on requests from the hypervisor).

Memory hot(un)plug consists of (un)plugging memory blocks via a virtio-mem
device (triggered by the guest). When unplugging blocks, we discard the
memory. In contrast to memory ballooning, we always know which memory
blocks a guest may use - especially during a reboot, after a crash, or
after kexec.

The issue with vfio is, that it cannot deal with random discards - for this
reason, virtio-mem and vfio can currently only run mutually exclusive.
Especially, vfio would currently map the whole memory region (with possible
only little/no plugged blocks), resulting in all pages getting pinned and
therefore resulting in a higher memory consumption than expected (turning
virtio-mem basically useless in these environments).

To make vfio work nicely with virtio-mem, we have to map only the plugged
blocks, and map/unmap properly when plugging/unplugging blocks (including
discarding of RAM when unplugging). We achieve that by using a new notifier
mechanism that communicates changes.

It's important to map memory in the granularity in which we could see
unmaps again (-> virtio-mem block size) - so when e.g., plugging
consecutive 100 MB with a block size of 2MB, we need 50 mappings. When
unmapping, we can use a single vfio_unmap call for the applicable range.
We expect that the block size of virtio-mem devices will be fairly large
in the future (to not run out of mappings and to improve hot(un)plug
performance), configured by the user, when used with vfio (e.g., 128MB,
1G, ...) - Linux guests will still have to be optimized for that.

We try to handle errors when plugging memory (mapping in VFIO) gracefully
- especially to cope with too many mappings in VFIO.


As I basically have no experience with vfio, all I did for testing is
passthrough a secondary GPU (NVIDIA GK208B) via vfio-pci to my guest
and saw it pop up in dmesg. I did *not* actually try to use it (I know
...), so there might still be plenty of BUGs regarding the actual mappings
in the code. When I resize virtio-mem devices (resulting in
memory hot(un)plug), I can spot the memory consumption of my host adjusting
accordingly - in contrast to before, wehreby my machine would always
consume the maximum size of my VM, as if all memory provided by
virtio-mem devices were fully plugged.

I even tested it with 2MB huge pages (sadly for the first time with
virtio-mem ever) - and it worked like a charm on the hypervisor side as
well. The number of free hugepages adjusted accordingly. (again, did not
properly test the device in the guest ...).

If anybody wants to play with it and needs some guidance, please feel
free to ask. I might add some vfio-related documentation to
https://virtio-mem.gitlab.io/ (but it really isn't that special - only
the block size limitations have to be considered).

David Hildenbrand (6):
  memory: Introduce sparse RAM handler for memory regions
  virtio-mem: Impelement SparseRAMHandler interface
  vfio: Implement support for sparse RAM memory regions
  memory: Extend ram_block_discard_(require|disable) by two discard
    types
  virtio-mem: Require only RAM_BLOCK_DISCARD_T_COORDINATED discards
  vfio: Disable only RAM_BLOCK_DISCARD_T_UNCOORDINATED discards

 exec.c                         | 109 +++++++++++++++++----
 hw/vfio/common.c               | 169 ++++++++++++++++++++++++++++++++-
 hw/virtio/virtio-mem.c         | 164 +++++++++++++++++++++++++++++++-
 include/exec/memory.h          | 151 ++++++++++++++++++++++++++++-
 include/hw/vfio/vfio-common.h  |  12 +++
 include/hw/virtio/virtio-mem.h |   3 +
 softmmu/memory.c               |   7 ++
 7 files changed, 583 insertions(+), 32 deletions(-)

-- 
2.26.2



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH PROTOTYPE 1/6] memory: Introduce sparse RAM handler for memory regions
  2020-09-24 16:04 [PATCH PROTOTYPE 0/6] virtio-mem: vfio support David Hildenbrand
@ 2020-09-24 16:04 ` David Hildenbrand
  2020-10-20 19:24   ` Peter Xu
  2020-09-24 16:04 ` [PATCH PROTOTYPE 2/6] virtio-mem: Impelement SparseRAMHandler interface David Hildenbrand
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2020-09-24 16:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, David Hildenbrand, Michael S. Tsirkin,
	Dr . David Alan Gilbert, Peter Xu, Luiz Capitulino, Auger Eric,
	Alex Williamson, Wei Yang, Igor Mammedov, Paolo Bonzini

We have some special memory ram regions (managed by paravirtualized memory
devices - virtio-mem), whereby the guest agreed to only use selected memory
ranges. This results in "sparse" mmaps, "sparse" RAMBlocks and "sparse"
memory ram regions.

In most cases, we don't currently care about that - e.g., in KVM, we simply
have a single KVM memory slot (and as the number is fairly limited, we'll
have to keep it like that). However, in case of vfio, registering the
whole region with the kernel results in all pages getting pinned, and
therefore an unexpected high memory consumption. This is the main
reason why vfio is incompatible with memory ballooning.

Let's introduce a way to communicate the actual accessible/mapped (meaning,
not discarded) pieces for such a sparse memory region, and get notified on
changes (e.g., a virito-mem device plugging/unplugging memory).

We expect that the SparseRAMHandler is set for a memory region before it
is mapped into guest physical address space (so before any memory
listeners get notified about the addition), and the SparseRAMHandler isn't
unset before the memory region was unmapped from guest physical address
space (so after any memory listener got notified about the removal).

This is somewhat similar to the iommu memory region notifier mechanism.

TODO:
- Better documentation.
- Better Naming?
- Handle it on RAMBlocks?
- SPAPR spacial handling required (virtio-mem only supports x86-64 for now)?
- Catch mapping errors during hotplug in a nice way
- Fail early when a certain number of mappings would be exceeded
  (instead of eventually consuming too many, leaving none for others)
- Resizeable memory region handling (future).
- Callback to check the state of a block.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/exec/memory.h | 115 ++++++++++++++++++++++++++++++++++++++++++
 softmmu/memory.c      |   7 +++
 2 files changed, 122 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index f1bb2a7df5..2931ead730 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -42,6 +42,12 @@ typedef struct IOMMUMemoryRegionClass IOMMUMemoryRegionClass;
 DECLARE_OBJ_CHECKERS(IOMMUMemoryRegion, IOMMUMemoryRegionClass,
                      IOMMU_MEMORY_REGION, TYPE_IOMMU_MEMORY_REGION)
 
+#define TYPE_SPARSE_RAM_HANDLER "sparse-ram-handler"
+typedef struct SparseRAMHandlerClass SparseRAMHandlerClass;
+typedef struct SparseRAMHandler SparseRAMHandler;
+DECLARE_OBJ_CHECKERS(SparseRAMHandler, SparseRAMHandlerClass,
+                     SPARSE_RAM_HANDLER, TYPE_SPARSE_RAM_HANDLER)
+
 extern bool global_dirty_log;
 
 typedef struct MemoryRegionOps MemoryRegionOps;
@@ -136,6 +142,28 @@ static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
     n->iommu_idx = iommu_idx;
 }
 
+struct SparseRAMNotifier;
+typedef int (*SparseRAMNotifyMap)(struct SparseRAMNotifier *notifier,
+                                  const MemoryRegion *mr, uint64_t mr_offset,
+                                  uint64_t size);
+typedef void (*SparseRAMNotifyUnmap)(struct SparseRAMNotifier *notifier,
+                                     const MemoryRegion *mr, uint64_t mr_offset,
+                                     uint64_t size);
+
+typedef struct SparseRAMNotifier {
+    SparseRAMNotifyMap notify_map;
+    SparseRAMNotifyUnmap notify_unmap;
+    QLIST_ENTRY(SparseRAMNotifier) next;
+} SparseRAMNotifier;
+
+static inline void sparse_ram_notifier_init(SparseRAMNotifier *notifier,
+                                            SparseRAMNotifyMap map_fn,
+                                            SparseRAMNotifyUnmap unmap_fn)
+{
+    notifier->notify_map = map_fn;
+    notifier->notify_unmap = unmap_fn;
+}
+
 /*
  * Memory region callbacks
  */
@@ -352,6 +380,36 @@ struct IOMMUMemoryRegionClass {
     int (*num_indexes)(IOMMUMemoryRegion *iommu);
 };
 
+struct SparseRAMHandlerClass {
+    /* private */
+    InterfaceClass parent_class;
+
+    /*
+     * Returns the minimum granularity in which (granularity-aligned pieces
+     * within the memory region) can become either be mapped or unmapped.
+     */
+    uint64_t (*get_granularity)(const SparseRAMHandler *srh,
+                                const MemoryRegion *mr);
+
+    /*
+     * Register a listener for mapping changes.
+     */
+    void (*register_listener)(SparseRAMHandler *srh, const MemoryRegion *mr,
+                              SparseRAMNotifier *notifier);
+
+    /*
+     * Unregister a listener for mapping changes.
+     */
+    void (*unregister_listener)(SparseRAMHandler *srh, const MemoryRegion *mr,
+                                SparseRAMNotifier *notifier);
+
+    /*
+     * Replay notifications for mapped RAM.
+     */
+    int (*replay_mapped)(SparseRAMHandler *srh, const MemoryRegion *mr,
+                         SparseRAMNotifier *notifier);
+};
+
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
 typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
 
@@ -399,6 +457,7 @@ struct MemoryRegion {
     const char *name;
     unsigned ioeventfd_nb;
     MemoryRegionIoeventfd *ioeventfds;
+    SparseRAMHandler *srh; /* For RAM only */
 };
 
 struct IOMMUMemoryRegion {
@@ -1889,6 +1948,62 @@ bool memory_region_present(MemoryRegion *container, hwaddr addr);
  */
 bool memory_region_is_mapped(MemoryRegion *mr);
 
+
+static inline SparseRAMHandler* memory_region_get_sparse_ram_handler(
+                                                               MemoryRegion *mr)
+{
+    return mr->srh;
+}
+
+static inline bool memory_region_is_sparse_ram(MemoryRegion *mr)
+{
+    return memory_region_get_sparse_ram_handler(mr) != NULL;
+}
+
+static inline void memory_region_set_sparse_ram_handler(MemoryRegion *mr,
+                                                        SparseRAMHandler *srh)
+{
+    g_assert(memory_region_is_ram(mr));
+    mr->srh = srh;
+}
+
+static inline void memory_region_register_sparse_ram_notifier(MemoryRegion *mr,
+                                                           SparseRAMNotifier *n)
+{
+    SparseRAMHandler *srh = memory_region_get_sparse_ram_handler(mr);
+    SparseRAMHandlerClass *srhc = SPARSE_RAM_HANDLER_GET_CLASS(srh);
+
+    srhc->register_listener(srh, mr, n);
+}
+
+static inline void memory_region_unregister_sparse_ram_notifier(
+                                                               MemoryRegion *mr,
+                                                           SparseRAMNotifier *n)
+{
+    SparseRAMHandler *srh = memory_region_get_sparse_ram_handler(mr);
+    SparseRAMHandlerClass *srhc = SPARSE_RAM_HANDLER_GET_CLASS(srh);
+
+    srhc->unregister_listener(srh, mr, n);
+}
+
+static inline uint64_t memory_region_sparse_ram_get_granularity(
+                                                               MemoryRegion *mr)
+{
+    SparseRAMHandler *srh = memory_region_get_sparse_ram_handler(mr);
+    SparseRAMHandlerClass *srhc = SPARSE_RAM_HANDLER_GET_CLASS(srh);
+
+    return srhc->get_granularity(srh, mr);
+}
+
+static inline int memory_region_sparse_ram_replay_mapped(MemoryRegion *mr,
+                                                         SparseRAMNotifier *n)
+{
+    SparseRAMHandler *srh = memory_region_get_sparse_ram_handler(mr);
+    SparseRAMHandlerClass *srhc = SPARSE_RAM_HANDLER_GET_CLASS(srh);
+
+    return srhc->replay_mapped(srh, mr, n);
+}
+
 /**
  * memory_region_find: translate an address/size relative to a
  * MemoryRegion into a #MemoryRegionSection.
diff --git a/softmmu/memory.c b/softmmu/memory.c
index d030eb6f7c..89649f52f7 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -3241,10 +3241,17 @@ static const TypeInfo iommu_memory_region_info = {
     .abstract           = true,
 };
 
+static const TypeInfo sparse_ram_handler_info = {
+    .parent             = TYPE_INTERFACE,
+    .name               = TYPE_SPARSE_RAM_HANDLER,
+    .class_size         = sizeof(SparseRAMHandlerClass),
+};
+
 static void memory_register_types(void)
 {
     type_register_static(&memory_region_info);
     type_register_static(&iommu_memory_region_info);
+    type_register_static(&sparse_ram_handler_info);
 }
 
 type_init(memory_register_types)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH PROTOTYPE 2/6] virtio-mem: Impelement SparseRAMHandler interface
  2020-09-24 16:04 [PATCH PROTOTYPE 0/6] virtio-mem: vfio support David Hildenbrand
  2020-09-24 16:04 ` [PATCH PROTOTYPE 1/6] memory: Introduce sparse RAM handler for memory regions David Hildenbrand
@ 2020-09-24 16:04 ` David Hildenbrand
  2020-09-24 16:04 ` [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions David Hildenbrand
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2020-09-24 16:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, David Hildenbrand, Michael S. Tsirkin,
	Dr . David Alan Gilbert, Peter Xu, Luiz Capitulino, Auger Eric,
	Alex Williamson, Wei Yang, Igor Mammedov, Paolo Bonzini

Let's properly notify when (un)plugging blocks. Handle errors from
notifiers gracefully when mapping, rolling back the change and telling
the guest that the VM is busy.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem.c         | 158 ++++++++++++++++++++++++++++++++-
 include/hw/virtio/virtio-mem.h |   3 +
 2 files changed, 160 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 8fbec77ccc..e23969eaed 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -72,6 +72,64 @@ static bool virtio_mem_is_busy(void)
     return migration_in_incoming_postcopy() || !migration_is_idle();
 }
 
+static void virtio_mem_srh_notify_unmap(VirtIOMEM *vmem, uint64_t offset,
+                                        uint64_t size)
+{
+    SparseRAMNotifier *notifier;
+
+    QLIST_FOREACH(notifier, &vmem->sram_notify, next) {
+        notifier->notify_unmap(notifier, &vmem->memdev->mr, offset, size);
+    }
+}
+
+static int virtio_mem_srh_notify_map(VirtIOMEM *vmem, uint64_t offset,
+                                     uint64_t size)
+{
+    SparseRAMNotifier *notifier, *notifier2;
+    int ret = 0;
+
+    QLIST_FOREACH(notifier, &vmem->sram_notify, next) {
+        ret = notifier->notify_map(notifier, &vmem->memdev->mr, offset, size);
+        if (ret) {
+            break;
+        }
+    }
+
+    /* In case any notifier failed, undo the whole operation. */
+    if (ret) {
+        QLIST_FOREACH(notifier2, &vmem->sram_notify, next) {
+            if (notifier2 == notifier) {
+                break;
+            }
+            notifier2->notify_unmap(notifier2, &vmem->memdev->mr, offset, size);
+        }
+    }
+    return ret;
+}
+
+/*
+ * TODO: Maybe we could notify directly that everything is unmapped/discarded.
+ * at least vfio should be able to deal with that.
+ */
+static void virtio_mem_srh_notify_unplug_all(VirtIOMEM *vmem)
+{
+    unsigned long first_zero_bit, last_zero_bit;
+    uint64_t offset, length;
+
+    /* Find consecutive unplugged blocks and notify */
+    first_zero_bit = find_first_zero_bit(vmem->bitmap, vmem->bitmap_size);
+    while (first_zero_bit < vmem->bitmap_size) {
+        offset = first_zero_bit * vmem->block_size;
+        last_zero_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size,
+                                      first_zero_bit + 1) - 1;
+        length = (last_zero_bit - first_zero_bit + 1) * vmem->block_size;
+
+        virtio_mem_srh_notify_unmap(vmem, offset, length);
+        first_zero_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size,
+                                            last_zero_bit + 2);
+    }
+}
+
 static bool virtio_mem_test_bitmap(VirtIOMEM *vmem, uint64_t start_gpa,
                                    uint64_t size, bool plugged)
 {
@@ -146,7 +204,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa,
                                       uint64_t size, bool plug)
 {
     const uint64_t offset = start_gpa - vmem->addr;
-    int ret;
+    int ret, ret2;
 
     if (virtio_mem_is_busy()) {
         return -EBUSY;
@@ -159,6 +217,23 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa,
                          strerror(-ret));
             return -EBUSY;
         }
+        /*
+         * We'll notify *after* discarding succeeded, because we might not be
+         * able to map again ...
+         */
+        virtio_mem_srh_notify_unmap(vmem, offset, size);
+    } else if (virtio_mem_srh_notify_map(vmem, offset, size)) {
+        /*
+         * Could be a mapping attempt already already resulted in memory
+         * getting populated.
+         */
+        ret2 = ram_block_discard_range(vmem->memdev->mr.ram_block, offset,
+                                       size);
+        if (ret2) {
+            error_report("Unexpected error discarding RAM: %s",
+                         strerror(-ret2));
+        }
+        return -EBUSY;
     }
     virtio_mem_set_bitmap(vmem, start_gpa, size, plug);
     return 0;
@@ -253,6 +328,8 @@ static int virtio_mem_unplug_all(VirtIOMEM *vmem)
         error_report("Unexpected error discarding RAM: %s", strerror(-ret));
         return -EBUSY;
     }
+    virtio_mem_srh_notify_unplug_all(vmem);
+
     bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
     if (vmem->size) {
         vmem->size = 0;
@@ -480,6 +557,13 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
     vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
     qemu_register_reset(virtio_mem_system_reset, vmem);
     precopy_add_notifier(&vmem->precopy_notifier);
+
+    /*
+     * Set it to sparse, so everybody is aware of it before the plug handler
+     * exposes the region to the system.
+     */
+    memory_region_set_sparse_ram_handler(&vmem->memdev->mr,
+                                         SPARSE_RAM_HANDLER(vmem));
 }
 
 static void virtio_mem_device_unrealize(DeviceState *dev)
@@ -487,6 +571,7 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIOMEM *vmem = VIRTIO_MEM(dev);
 
+    memory_region_set_sparse_ram_handler(&vmem->memdev->mr, NULL);
     precopy_remove_notifier(&vmem->precopy_notifier);
     qemu_unregister_reset(virtio_mem_system_reset, vmem);
     vmstate_unregister_ram(&vmem->memdev->mr, DEVICE(vmem));
@@ -813,6 +898,7 @@ static void virtio_mem_instance_init(Object *obj)
     vmem->block_size = VIRTIO_MEM_MIN_BLOCK_SIZE;
     notifier_list_init(&vmem->size_change_notifiers);
     vmem->precopy_notifier.notify = virtio_mem_precopy_notify;
+    QLIST_INIT(&vmem->sram_notify);
 
     object_property_add(obj, VIRTIO_MEM_SIZE_PROP, "size", virtio_mem_get_size,
                         NULL, NULL, NULL);
@@ -832,11 +918,72 @@ static Property virtio_mem_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static uint64_t virtio_mem_srh_get_granularity(const SparseRAMHandler *srh,
+                                               const MemoryRegion *mr)
+{
+    const VirtIOMEM *vmem = VIRTIO_MEM(srh);
+
+    g_assert(mr == &vmem->memdev->mr);
+    return vmem->block_size;
+}
+
+static void virtio_mem_srh_register_listener(SparseRAMHandler *srh,
+                                             const MemoryRegion *mr,
+                                             SparseRAMNotifier *notifier)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(srh);
+
+    g_assert(mr == &vmem->memdev->mr);
+    QLIST_INSERT_HEAD(&vmem->sram_notify, notifier, next);
+}
+
+static void virtio_mem_srh_unregister_listener(SparseRAMHandler *srh,
+                                               const MemoryRegion *mr,
+                                               SparseRAMNotifier *notifier)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(srh);
+
+    g_assert(mr == &vmem->memdev->mr);
+    QLIST_REMOVE(notifier, next);
+}
+
+static int virtio_mem_srh_replay_mapped(SparseRAMHandler *srh,
+                                        const MemoryRegion *mr,
+                                        SparseRAMNotifier *notifier)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(srh);
+    unsigned long first_bit, last_bit;
+    uint64_t offset, length;
+    int ret = 0;
+
+    g_assert(mr == &vmem->memdev->mr);
+
+    /* Find consecutive plugged blocks and notify */
+    first_bit = find_first_bit(vmem->bitmap, vmem->bitmap_size);
+    while (first_bit < vmem->bitmap_size) {
+        offset = first_bit * vmem->block_size;
+        last_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size,
+                                      first_bit + 1) - 1;
+        length = (last_bit - first_bit + 1) * vmem->block_size;
+
+        ret = notifier->notify_map(notifier, mr, offset, length);
+        if (ret) {
+            break;
+        }
+        first_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size,
+                                  last_bit + 2);
+    }
+
+    /* TODO: cleanup on error if necessary. */
+    return ret;
+}
+
 static void virtio_mem_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
     VirtioDeviceClass *vdc = VIRTIO_DEVICE_CLASS(klass);
     VirtIOMEMClass *vmc = VIRTIO_MEM_CLASS(klass);
+    SparseRAMHandlerClass *srhc = SPARSE_RAM_HANDLER_CLASS(klass);
 
     device_class_set_props(dc, virtio_mem_properties);
     dc->vmsd = &vmstate_virtio_mem;
@@ -852,6 +999,11 @@ static void virtio_mem_class_init(ObjectClass *klass, void *data)
     vmc->get_memory_region = virtio_mem_get_memory_region;
     vmc->add_size_change_notifier = virtio_mem_add_size_change_notifier;
     vmc->remove_size_change_notifier = virtio_mem_remove_size_change_notifier;
+
+    srhc->get_granularity = virtio_mem_srh_get_granularity;
+    srhc->register_listener = virtio_mem_srh_register_listener;
+    srhc->unregister_listener = virtio_mem_srh_unregister_listener;
+    srhc->replay_mapped = virtio_mem_srh_replay_mapped;
 }
 
 static const TypeInfo virtio_mem_info = {
@@ -861,6 +1013,10 @@ static const TypeInfo virtio_mem_info = {
     .instance_init = virtio_mem_instance_init,
     .class_init = virtio_mem_class_init,
     .class_size = sizeof(VirtIOMEMClass),
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_SPARSE_RAM_HANDLER },
+        { }
+    },
 };
 
 static void virtio_register_types(void)
diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
index 4eeb82d5dd..91d9b48ba0 100644
--- a/include/hw/virtio/virtio-mem.h
+++ b/include/hw/virtio/virtio-mem.h
@@ -67,6 +67,9 @@ struct VirtIOMEM {
 
     /* don't migrate unplugged memory */
     NotifierWithReturn precopy_notifier;
+
+    /* SparseRAMNotifier list to be notified on plug/unplug events. */
+    QLIST_HEAD(, SparseRAMNotifier) sram_notify;
 };
 
 struct VirtIOMEMClass {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions
  2020-09-24 16:04 [PATCH PROTOTYPE 0/6] virtio-mem: vfio support David Hildenbrand
  2020-09-24 16:04 ` [PATCH PROTOTYPE 1/6] memory: Introduce sparse RAM handler for memory regions David Hildenbrand
  2020-09-24 16:04 ` [PATCH PROTOTYPE 2/6] virtio-mem: Impelement SparseRAMHandler interface David Hildenbrand
@ 2020-09-24 16:04 ` David Hildenbrand
  2020-10-20 19:44   ` Peter Xu
  2020-09-24 16:04 ` [PATCH PROTOTYPE 4/6] memory: Extend ram_block_discard_(require|disable) by two discard types David Hildenbrand
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2020-09-24 16:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, David Hildenbrand, Michael S. Tsirkin,
	Dr . David Alan Gilbert, Peter Xu, Luiz Capitulino, Auger Eric,
	Alex Williamson, Wei Yang, Igor Mammedov, Paolo Bonzini

Implement support for sparse RAM, to be used by virtio-mem. Handling
is somewhat-similar to memory_region_is_iommu() handling, which also
notifies on changes.

Instead of mapping the whole region, we only map selected pieces (and
unmap previously selected pieces) when notified by the SparseRAMHandler.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/vfio/common.c              | 155 ++++++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-common.h |  12 +++
 2 files changed, 167 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 13471ae294..a3aaf70dd8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -37,6 +37,7 @@
 #include "sysemu/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "qemu/units.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -498,6 +499,143 @@ out:
     rcu_read_unlock();
 }
 
+static int vfio_sparse_ram_notify(SparseRAMNotifier *n, const MemoryRegion *mr,
+                                  uint64_t mr_offset, uint64_t size,
+                                  bool map)
+{
+    VFIOSparseRAM *sram = container_of(n, VFIOSparseRAM, notifier);
+    const hwaddr mr_start = MAX(mr_offset, sram->offset_within_region);
+    const hwaddr mr_end = MIN(mr_offset + size,
+                              sram->offset_within_region + sram->size);
+    const hwaddr iova_start = mr_start + sram->offset_within_address_space;
+    const hwaddr iova_end = mr_end + sram->offset_within_address_space;
+    hwaddr mr_cur, iova_cur, mr_next;
+    void *vaddr;
+    int ret, ret2;
+
+    g_assert(mr == sram->mr);
+
+    /* We get notified about everything, ignore range we don't care about. */
+    if (mr_start >= mr_end) {
+        return 0;
+    }
+
+    /* Unmap everything with a single call. */
+    if (!map) {
+        ret = vfio_dma_unmap(sram->container, iova_start,
+                             iova_end - iova_start);
+        if (ret) {
+            error_report("%s: vfio_dma_unmap() failed: %s", __func__,
+                         strerror(-ret));
+        }
+        return 0;
+    }
+
+    /* TODO: fail early if we would exceed a specified number of mappings. */
+
+    /* Map in (aligned within MR) granularity, so we can unmap later. */
+    for (mr_cur = mr_start; mr_cur < mr_end; mr_cur = mr_next) {
+        iova_cur = mr_cur + sram->offset_within_address_space;
+        mr_next = QEMU_ALIGN_UP(mr_cur + 1, sram->granularity);
+        mr_next = MIN(mr_next, mr_end);
+
+        vaddr = memory_region_get_ram_ptr(sram->mr) + mr_cur;
+        ret = vfio_dma_map(sram->container, iova_cur, mr_next - mr_cur,
+                           vaddr, mr->readonly);
+        if (ret) {
+            /* Rollback in case of error. */
+            if (mr_cur != mr_start) {
+                ret2 = vfio_dma_unmap(sram->container, iova_start,
+                                      iova_end - iova_start);
+                if (ret2) {
+                    error_report("%s: vfio_dma_unmap() failed: %s", __func__,
+                                  strerror(-ret));
+                }
+            }
+            return ret;
+        }
+    }
+    return 0;
+}
+
+static int vfio_sparse_ram_notify_map(SparseRAMNotifier *n,
+                                      const MemoryRegion *mr,
+                                      uint64_t mr_offset, uint64_t size)
+{
+    return vfio_sparse_ram_notify(n, mr, mr_offset, size, true);
+}
+
+static void vfio_sparse_ram_notify_unmap(SparseRAMNotifier *n,
+                                         const MemoryRegion *mr,
+                                         uint64_t mr_offset, uint64_t size)
+{
+    vfio_sparse_ram_notify(n, mr, mr_offset, size, false);
+}
+
+static void vfio_register_sparse_ram(VFIOContainer *container,
+                                     MemoryRegionSection *section)
+{
+    VFIOSparseRAM *sram;
+    int ret;
+
+    sram = g_new0(VFIOSparseRAM, 1);
+    sram->container = container;
+    sram->mr = section->mr;
+    sram->offset_within_region = section->offset_within_region;
+    sram->offset_within_address_space = section->offset_within_address_space;
+    sram->size = int128_get64(section->size);
+    sram->granularity = memory_region_sparse_ram_get_granularity(section->mr);
+
+    /*
+     * TODO: We usually want a bigger granularity (for a lot of addded memory,
+     * as we need quite a lot of mappings) - however, this has to be configured
+     * by the user.
+     */
+    g_assert(sram->granularity >= 1 * MiB &&
+             is_power_of_2(sram->granularity));
+
+    /* Register the notifier */
+    sparse_ram_notifier_init(&sram->notifier, vfio_sparse_ram_notify_map,
+                             vfio_sparse_ram_notify_unmap);
+    memory_region_register_sparse_ram_notifier(section->mr, &sram->notifier);
+    QLIST_INSERT_HEAD(&container->sram_list, sram, next);
+    /*
+     * Replay mapped blocks - if anything goes wrong (only when hotplugging
+     * vfio devices), report the error for now.
+     *
+     * TODO: Can we catch this earlier?
+     */
+    ret = memory_region_sparse_ram_replay_mapped(section->mr, &sram->notifier);
+    if (ret) {
+        error_report("%s: failed to replay mappings: %s", __func__,
+                     strerror(-ret));
+    }
+}
+
+static void vfio_unregister_sparse_ram(VFIOContainer *container,
+                                       MemoryRegionSection *section)
+{
+    VFIOSparseRAM *sram = NULL;
+
+    QLIST_FOREACH(sram, &container->sram_list, next) {
+        if (sram->mr == section->mr &&
+            sram->offset_within_region == section->offset_within_region &&
+            sram->offset_within_address_space ==
+            section->offset_within_address_space) {
+            break;
+        }
+    }
+
+    if (!sram) {
+        hw_error("vfio: Trying to unregister non-existant sparse RAM");
+    }
+
+    memory_region_unregister_sparse_ram_notifier(section->mr, &sram->notifier);
+    QLIST_REMOVE(sram, next);
+    g_free(sram);
+    /* The caller is expected to vfio_dma_unmap(). */
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -650,6 +788,15 @@ static void vfio_listener_region_add(MemoryListener *listener,
 
     /* Here we assume that memory_region_is_ram(section->mr)==true */
 
+    /*
+     * For sparse RAM, we only want to register the actually mapped
+     * pieces - and update the mapping whenever we're notified about changes.
+     */
+    if (memory_region_is_sparse_ram(section->mr)) {
+        vfio_register_sparse_ram(container, section);
+        return;
+    }
+
     vaddr = memory_region_get_ram_ptr(section->mr) +
             section->offset_within_region +
             (iova - section->offset_within_address_space);
@@ -786,6 +933,13 @@ static void vfio_listener_region_del(MemoryListener *listener,
 
         pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
         try_unmap = !((iova & pgmask) || (int128_get64(llsize) & pgmask));
+    } else if (memory_region_is_sparse_ram(section->mr)) {
+        vfio_unregister_sparse_ram(container, section);
+        /*
+         * We rely on a single vfio_dma_unmap() call below to clean the whole
+         * region.
+         */
+        try_unmap = true;
     }
 
     if (try_unmap) {
@@ -1275,6 +1429,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     container->error = NULL;
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
+    QLIST_INIT(&container->sram_list);
 
     ret = vfio_init_container(container, group->fd, errp);
     if (ret) {
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c78f3ff559..dfa18dbd8e 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -77,6 +77,7 @@ typedef struct VFIOContainer {
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
+    QLIST_HEAD(, VFIOSparseRAM) sram_list;
     QLIST_ENTRY(VFIOContainer) next;
 } VFIOContainer;
 
@@ -88,6 +89,17 @@ typedef struct VFIOGuestIOMMU {
     QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
 } VFIOGuestIOMMU;
 
+typedef struct VFIOSparseRAM {
+    VFIOContainer *container;
+    MemoryRegion *mr;
+    hwaddr offset_within_region;
+    hwaddr offset_within_address_space;
+    hwaddr size;
+    uint64_t granularity;
+    SparseRAMNotifier notifier;
+    QLIST_ENTRY(VFIOSparseRAM) next;
+} VFIOSparseRAM;
+
 typedef struct VFIOHostDMAWindow {
     hwaddr min_iova;
     hwaddr max_iova;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH PROTOTYPE 4/6] memory: Extend ram_block_discard_(require|disable) by two discard types
  2020-09-24 16:04 [PATCH PROTOTYPE 0/6] virtio-mem: vfio support David Hildenbrand
                   ` (2 preceding siblings ...)
  2020-09-24 16:04 ` [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions David Hildenbrand
@ 2020-09-24 16:04 ` David Hildenbrand
  2020-10-20 19:17   ` Peter Xu
  2020-09-24 16:04 ` [PATCH PROTOTYPE 5/6] virtio-mem: Require only RAM_BLOCK_DISCARD_T_COORDINATED discards David Hildenbrand
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2020-09-24 16:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, David Hildenbrand, Michael S. Tsirkin,
	Dr . David Alan Gilbert, Peter Xu, Luiz Capitulino, Auger Eric,
	Alex Williamson, Wei Yang, Igor Mammedov, Paolo Bonzini

We want to separate the two cases whereby
- balloning drivers do random discards on random guest memory (e.g.,
  virtio-balloon) - uncoordinated discards
- paravirtualized memory devices do discards in well-known granularity,
  and always know which block is currently accessible or inaccessible by
  a guest. - coordinated discards

This will be required to get virtio_mem + vfio running - vfio still
wants to block random memory ballooning.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 exec.c                | 109 ++++++++++++++++++++++++++++++++++--------
 include/exec/memory.h |  36 ++++++++++++--
 2 files changed, 121 insertions(+), 24 deletions(-)

diff --git a/exec.c b/exec.c
index e34b602bdf..83098e9230 100644
--- a/exec.c
+++ b/exec.c
@@ -4098,52 +4098,121 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root)
  * If positive, discarding RAM is disabled. If negative, discarding RAM is
  * required to work and cannot be disabled.
  */
-static int ram_block_discard_disabled;
+static int uncoordinated_discard_disabled;
+static int coordinated_discard_disabled;
 
-int ram_block_discard_disable(bool state)
+static int __ram_block_discard_disable(int *counter)
 {
     int old;
 
-    if (!state) {
-        atomic_dec(&ram_block_discard_disabled);
-        return 0;
-    }
-
     do {
-        old = atomic_read(&ram_block_discard_disabled);
+        old = atomic_read(counter);
         if (old < 0) {
             return -EBUSY;
         }
-    } while (atomic_cmpxchg(&ram_block_discard_disabled, old, old + 1) != old);
+    } while (atomic_cmpxchg(counter, old, old + 1) != old);
+
     return 0;
 }
 
-int ram_block_discard_require(bool state)
+int ram_block_discard_type_disable(RamBlockDiscardType type, bool state)
 {
-    int old;
+    int ret;
 
-    if (!state) {
-        atomic_inc(&ram_block_discard_disabled);
-        return 0;
+    if (type & RAM_BLOCK_DISCARD_T_UNCOORDINATED) {
+        if (!state) {
+            atomic_dec(&uncoordinated_discard_disabled);
+        } else {
+            ret = __ram_block_discard_disable(&uncoordinated_discard_disabled);
+            if (ret) {
+                return ret;
+            }
+        }
     }
+    if (type & RAM_BLOCK_DISCARD_T_COORDINATED) {
+        if (!state) {
+            atomic_dec(&coordinated_discard_disabled);
+        } else {
+            ret = __ram_block_discard_disable(&coordinated_discard_disabled);
+            if (ret) {
+                /* Rollback the previous change. */
+                if (type & RAM_BLOCK_DISCARD_T_UNCOORDINATED) {
+                    atomic_dec(&uncoordinated_discard_disabled);
+                }
+                return ret;
+            }
+        }
+    }
+    return 0;
+}
+
+static int __ram_block_discard_require(int *counter)
+{
+    int old;
 
     do {
-        old = atomic_read(&ram_block_discard_disabled);
+        old = atomic_read(counter);
         if (old > 0) {
             return -EBUSY;
         }
-    } while (atomic_cmpxchg(&ram_block_discard_disabled, old, old - 1) != old);
+    } while (atomic_cmpxchg(counter, old, old - 1) != old);
+
+    return 0;
+}
+
+int ram_block_discard_type_require(RamBlockDiscardType type, bool state)
+{
+    int ret;
+
+    if (type & RAM_BLOCK_DISCARD_T_UNCOORDINATED) {
+        if (!state) {
+            atomic_inc(&uncoordinated_discard_disabled);
+        } else {
+            ret = __ram_block_discard_require(&uncoordinated_discard_disabled);
+            if (ret) {
+                return ret;
+            }
+        }
+    }
+    if (type & RAM_BLOCK_DISCARD_T_COORDINATED) {
+        if (!state) {
+            atomic_inc(&coordinated_discard_disabled);
+        } else {
+            ret = __ram_block_discard_require(&coordinated_discard_disabled);
+            if (ret) {
+                /* Rollback the previous change. */
+                if (type & RAM_BLOCK_DISCARD_T_UNCOORDINATED) {
+                    atomic_inc(&uncoordinated_discard_disabled);
+                }
+                return ret;
+            }
+        }
+    }
     return 0;
 }
 
-bool ram_block_discard_is_disabled(void)
+bool ram_block_discard_type_is_disabled(RamBlockDiscardType type)
 {
-    return atomic_read(&ram_block_discard_disabled) > 0;
+    if (type & RAM_BLOCK_DISCARD_T_UNCOORDINATED &&
+        atomic_read(&uncoordinated_discard_disabled) > 0) {
+        return true;
+    } else if (type & RAM_BLOCK_DISCARD_T_COORDINATED &&
+               atomic_read(&coordinated_discard_disabled) > 0) {
+        return true;
+    }
+    return false;
 }
 
-bool ram_block_discard_is_required(void)
+bool ram_block_discard_type_is_required(RamBlockDiscardType type)
 {
-    return atomic_read(&ram_block_discard_disabled) < 0;
+    if (type & RAM_BLOCK_DISCARD_T_UNCOORDINATED &&
+        atomic_read(&uncoordinated_discard_disabled) < 0) {
+        return true;
+    } else if (type & RAM_BLOCK_DISCARD_T_COORDINATED &&
+               atomic_read(&coordinated_discard_disabled) < 0) {
+        return true;
+    }
+    return false;
 }
 
 #endif
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 2931ead730..3169ebc3d9 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2588,6 +2588,18 @@ static inline MemOp devend_memop(enum device_endian end)
 }
 #endif
 
+typedef enum RamBlockDiscardType {
+    /* Uncorrdinated discards (e.g., virtio-balloon */
+    RAM_BLOCK_DISCARD_T_UNCOORDINATED = 1,
+    /*
+     * Coordinated discards on selected memory regions (e.g., virtio-mem via
+     * SparseRamNotifier).
+     */
+    RAM_BLOCK_DISCARD_T_COORDINATED =   2,
+    /* Any type of discards */
+    RAM_BLOCK_DISCARD_T_ANY =           3,
+} RamBlockDiscardType;
+
 /*
  * Inhibit technologies that require discarding of pages in RAM blocks, e.g.,
  * to manage the actual amount of memory consumed by the VM (then, the memory
@@ -2609,7 +2621,11 @@ static inline MemOp devend_memop(enum device_endian end)
  * Returns 0 if successful. Returns -EBUSY if a technology that relies on
  * discards to work reliably is active.
  */
-int ram_block_discard_disable(bool state);
+int ram_block_discard_type_disable(RamBlockDiscardType type, bool state);
+static inline int ram_block_discard_disable(bool state)
+{
+    return ram_block_discard_type_disable(RAM_BLOCK_DISCARD_T_ANY, state);
+}
 
 /*
  * Inhibit technologies that disable discarding of pages in RAM blocks.
@@ -2617,17 +2633,29 @@ int ram_block_discard_disable(bool state);
  * Returns 0 if successful. Returns -EBUSY if discards are already set to
  * broken.
  */
-int ram_block_discard_require(bool state);
+int ram_block_discard_type_require(RamBlockDiscardType type, bool state);
+static inline int ram_block_discard_require(bool state)
+{
+    return ram_block_discard_type_require(RAM_BLOCK_DISCARD_T_ANY, state);
+}
 
 /*
  * Test if discarding of memory in ram blocks is disabled.
  */
-bool ram_block_discard_is_disabled(void);
+bool ram_block_discard_type_is_disabled(RamBlockDiscardType type);
+static inline bool ram_block_discard_is_disabled(void)
+{
+    return ram_block_discard_type_is_disabled(RAM_BLOCK_DISCARD_T_ANY);
+}
 
 /*
  * Test if discarding of memory in ram blocks is required to work reliably.
  */
-bool ram_block_discard_is_required(void);
+bool ram_block_discard_type_is_required(RamBlockDiscardType type);
+static inline bool ram_block_discard_is_required(void)
+{
+    return ram_block_discard_type_is_required(RAM_BLOCK_DISCARD_T_ANY);
+}
 
 #endif
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH PROTOTYPE 5/6] virtio-mem: Require only RAM_BLOCK_DISCARD_T_COORDINATED discards
  2020-09-24 16:04 [PATCH PROTOTYPE 0/6] virtio-mem: vfio support David Hildenbrand
                   ` (3 preceding siblings ...)
  2020-09-24 16:04 ` [PATCH PROTOTYPE 4/6] memory: Extend ram_block_discard_(require|disable) by two discard types David Hildenbrand
@ 2020-09-24 16:04 ` David Hildenbrand
  2020-09-24 16:04 ` [PATCH PROTOTYPE 6/6] vfio: Disable only RAM_BLOCK_DISCARD_T_UNCOORDINATED discards David Hildenbrand
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2020-09-24 16:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, David Hildenbrand, Michael S. Tsirkin,
	Dr . David Alan Gilbert, Peter Xu, Luiz Capitulino, Auger Eric,
	Alex Williamson, Wei Yang, Igor Mammedov, Paolo Bonzini

We implement the SparseRamHandler interface and properly communicate
changes by notifying listeners - especially in all scenarios
- when memory becomes usable by the guest
- when memory becomes unusable by the guest (and we discard memory)

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index e23969eaed..efeff7c64c 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -531,7 +531,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    if (ram_block_discard_require(true)) {
+    if (ram_block_discard_type_require(RAM_BLOCK_DISCARD_T_COORDINATED, true)) {
         error_setg(errp, "Discarding RAM is disabled");
         return;
     }
@@ -539,7 +539,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
     ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
     if (ret) {
         error_setg_errno(errp, -ret, "Unexpected error discarding RAM");
-        ram_block_discard_require(false);
+        ram_block_discard_type_require(RAM_BLOCK_DISCARD_T_COORDINATED, false);
         return;
     }
 
@@ -579,7 +579,7 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     virtio_del_queue(vdev, 0);
     virtio_cleanup(vdev);
     g_free(vmem->bitmap);
-    ram_block_discard_require(false);
+    ram_block_discard_type_require(RAM_BLOCK_DISCARD_T_COORDINATED, false);
 }
 
 static int virtio_mem_restore_unplugged(VirtIOMEM *vmem)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH PROTOTYPE 6/6] vfio: Disable only RAM_BLOCK_DISCARD_T_UNCOORDINATED discards
  2020-09-24 16:04 [PATCH PROTOTYPE 0/6] virtio-mem: vfio support David Hildenbrand
                   ` (4 preceding siblings ...)
  2020-09-24 16:04 ` [PATCH PROTOTYPE 5/6] virtio-mem: Require only RAM_BLOCK_DISCARD_T_COORDINATED discards David Hildenbrand
@ 2020-09-24 16:04 ` David Hildenbrand
  2020-09-24 19:30 ` [PATCH PROTOTYPE 0/6] virtio-mem: vfio support no-reply
  2020-09-29 17:02 ` Dr. David Alan Gilbert
  7 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2020-09-24 16:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, David Hildenbrand, Michael S. Tsirkin,
	Dr . David Alan Gilbert, Peter Xu, Luiz Capitulino, Auger Eric,
	Alex Williamson, Wei Yang, Igor Mammedov, Paolo Bonzini

This unlocks virtio-mem with vfio. A virtio-mem device properly notifies
about all accessible/mapped blocks inside a managed memory region -
whenever blocks become accessible and whenever blocks become inaccessible.

Note: The block size of a virtio-mem device has to be set to sane sizes,
depending on the maximum hotplug size - to not run out of vfio mappings.
The default virtio-mem block size is usually in the range of a couple of
MBs. Linux kernels (x86-64) don't support block sizes > 128MB
with an initial memory size of < 64 MB - and above that only in some
cases 2GB. The larger the blocks, the less likely that a lot of
memory can get unplugged again. The smaller the blocks, the slower
memory hot(un)plug will be.

Assume you want to hotplug 256GB - the block size would have to be at
least 8 MB (resulting in 32768 distinct mappings).

It's expected that the block size will be comparatively large when
virtio-mem is used with vfio in the future (e.g., 128MB, 1G, 2G) -
something Linux guests will have to be optimized for.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/vfio/common.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a3aaf70dd8..4d82296967 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1392,8 +1392,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
      * new memory, it will not yet set ram_block_discard_set_required() and
      * therefore, neither stops us here or deals with the sudden memory
      * consumption of inflated memory.
+     *
+     * We do support discarding for memory regions where accessible pieces
+     * are coordinated via the SparseRAMNotifier.
      */
-    ret = ram_block_discard_disable(true);
+    ret = ram_block_discard_type_disable(RAM_BLOCK_DISCARD_T_UNCOORDINATED,
+                                         true);
     if (ret) {
         error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
         return ret;
@@ -1564,7 +1568,7 @@ close_fd_exit:
     close(fd);
 
 put_space_exit:
-    ram_block_discard_disable(false);
+    ram_block_discard_type_disable(RAM_BLOCK_DISCARD_T_UNCOORDINATED, false);
     vfio_put_address_space(space);
 
     return ret;
@@ -1686,7 +1690,8 @@ void vfio_put_group(VFIOGroup *group)
     }
 
     if (!group->ram_block_discard_allowed) {
-        ram_block_discard_disable(false);
+        ram_block_discard_type_disable(RAM_BLOCK_DISCARD_T_UNCOORDINATED,
+                                       false);
     }
     vfio_kvm_device_del_group(group);
     vfio_disconnect_container(group);
@@ -1740,7 +1745,8 @@ int vfio_get_device(VFIOGroup *group, const char *name,
 
         if (!group->ram_block_discard_allowed) {
             group->ram_block_discard_allowed = true;
-            ram_block_discard_disable(false);
+            ram_block_discard_type_disable(RAM_BLOCK_DISCARD_T_UNCOORDINATED,
+                                           false);
         }
     }
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 0/6] virtio-mem: vfio support
  2020-09-24 16:04 [PATCH PROTOTYPE 0/6] virtio-mem: vfio support David Hildenbrand
                   ` (5 preceding siblings ...)
  2020-09-24 16:04 ` [PATCH PROTOTYPE 6/6] vfio: Disable only RAM_BLOCK_DISCARD_T_UNCOORDINATED discards David Hildenbrand
@ 2020-09-24 19:30 ` no-reply
  2020-09-29 17:02 ` Dr. David Alan Gilbert
  7 siblings, 0 replies; 27+ messages in thread
From: no-reply @ 2020-09-24 19:30 UTC (permalink / raw)
  To: david
  Cc: pankaj.gupta.linux, mst, david, qemu-devel, peterx, dgilbert,
	eric.auger, alex.williamson, richardw.yang, imammedo, pbonzini,
	lcapitulino

Patchew URL: https://patchew.org/QEMU/20200924160423.106747-1-david@redhat.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20200924160423.106747-1-david@redhat.com
Subject: [PATCH PROTOTYPE 0/6] virtio-mem: vfio support

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 - [tag update]      patchew/20200922210101.4081073-1-jsnow@redhat.com -> patchew/20200922210101.4081073-1-jsnow@redhat.com
 - [tag update]      patchew/20200924185414.28642-1-vsementsov@virtuozzo.com -> patchew/20200924185414.28642-1-vsementsov@virtuozzo.com
Switched to a new branch 'test'
8afd1df vfio: Disable only RAM_BLOCK_DISCARD_T_UNCOORDINATED discards
d676b32 virtio-mem: Require only RAM_BLOCK_DISCARD_T_COORDINATED discards
9492f67 memory: Extend ram_block_discard_(require|disable) by two discard types
9eeec69 vfio: Implement support for sparse RAM memory regions
3e21d3f virtio-mem: Impelement SparseRAMHandler interface
2cfc417 memory: Introduce sparse RAM handler for memory regions

=== OUTPUT BEGIN ===
1/6 Checking commit 2cfc4176fbf5 (memory: Introduce sparse RAM handler for memory regions)
ERROR: "foo* bar" should be "foo *bar"
#149: FILE: include/exec/memory.h:1952:
+static inline SparseRAMHandler* memory_region_get_sparse_ram_handler(

total: 1 errors, 0 warnings, 162 lines checked

Patch 1/6 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

2/6 Checking commit 3e21d3f59244 (virtio-mem: Impelement SparseRAMHandler interface)
3/6 Checking commit 9eeec69031b0 (vfio: Implement support for sparse RAM memory regions)
4/6 Checking commit 9492f6715512 (memory: Extend ram_block_discard_(require|disable) by two discard types)
5/6 Checking commit d676b32336b5 (virtio-mem: Require only RAM_BLOCK_DISCARD_T_COORDINATED discards)
6/6 Checking commit 8afd1df27b99 (vfio: Disable only RAM_BLOCK_DISCARD_T_UNCOORDINATED discards)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20200924160423.106747-1-david@redhat.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 0/6] virtio-mem: vfio support
  2020-09-24 16:04 [PATCH PROTOTYPE 0/6] virtio-mem: vfio support David Hildenbrand
                   ` (6 preceding siblings ...)
  2020-09-24 19:30 ` [PATCH PROTOTYPE 0/6] virtio-mem: vfio support no-reply
@ 2020-09-29 17:02 ` Dr. David Alan Gilbert
  2020-09-29 17:05   ` David Hildenbrand
  7 siblings, 1 reply; 27+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-29 17:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Michael S. Tsirkin, qemu-devel, Peter Xu,
	Luiz Capitulino, Auger Eric, Alex Williamson, Wei Yang,
	Paolo Bonzini, Igor Mammedov

* David Hildenbrand (david@redhat.com) wrote:
> This is a quick and dirty (1.5 days of hacking) prototype to make
> vfio and virtio-mem play together. The basic idea was the result of Alex
> brainstorming with me on how to tackle this.
> 
> A virtio-mem device manages a memory region in guest physical address
> space, represented as a single (currently large) memory region in QEMU.
> Before the guest is allowed to use memory blocks, it must coordinate with
> the hypervisor (plug blocks). After a reboot, all memory is usually
> unplugged - when the guest comes up, it detects the virtio-mem device and
> selects memory blocks to plug (based on requests from the hypervisor).
> 
> Memory hot(un)plug consists of (un)plugging memory blocks via a virtio-mem
> device (triggered by the guest). When unplugging blocks, we discard the
> memory. In contrast to memory ballooning, we always know which memory
> blocks a guest may use - especially during a reboot, after a crash, or
> after kexec.
> 
> The issue with vfio is, that it cannot deal with random discards - for this
> reason, virtio-mem and vfio can currently only run mutually exclusive.
> Especially, vfio would currently map the whole memory region (with possible
> only little/no plugged blocks), resulting in all pages getting pinned and
> therefore resulting in a higher memory consumption than expected (turning
> virtio-mem basically useless in these environments).
> 
> To make vfio work nicely with virtio-mem, we have to map only the plugged
> blocks, and map/unmap properly when plugging/unplugging blocks (including
> discarding of RAM when unplugging). We achieve that by using a new notifier
> mechanism that communicates changes.
> 
> It's important to map memory in the granularity in which we could see
> unmaps again (-> virtio-mem block size) - so when e.g., plugging
> consecutive 100 MB with a block size of 2MB, we need 50 mappings. When
> unmapping, we can use a single vfio_unmap call for the applicable range.
> We expect that the block size of virtio-mem devices will be fairly large
> in the future (to not run out of mappings and to improve hot(un)plug
> performance), configured by the user, when used with vfio (e.g., 128MB,
> 1G, ...) - Linux guests will still have to be optimized for that.

This seems pretty painful for those few TB mappings.
Also the calls seem pretty painful; maybe it'll be possible to have
calls that are optimised for making multiple consecutive mappings.

Dave

> We try to handle errors when plugging memory (mapping in VFIO) gracefully
> - especially to cope with too many mappings in VFIO.
> 
> 
> As I basically have no experience with vfio, all I did for testing is
> passthrough a secondary GPU (NVIDIA GK208B) via vfio-pci to my guest
> and saw it pop up in dmesg. I did *not* actually try to use it (I know
> ...), so there might still be plenty of BUGs regarding the actual mappings
> in the code. When I resize virtio-mem devices (resulting in
> memory hot(un)plug), I can spot the memory consumption of my host adjusting
> accordingly - in contrast to before, wehreby my machine would always
> consume the maximum size of my VM, as if all memory provided by
> virtio-mem devices were fully plugged.
> 
> I even tested it with 2MB huge pages (sadly for the first time with
> virtio-mem ever) - and it worked like a charm on the hypervisor side as
> well. The number of free hugepages adjusted accordingly. (again, did not
> properly test the device in the guest ...).
> 
> If anybody wants to play with it and needs some guidance, please feel
> free to ask. I might add some vfio-related documentation to
> https://virtio-mem.gitlab.io/ (but it really isn't that special - only
> the block size limitations have to be considered).
> 
> David Hildenbrand (6):
>   memory: Introduce sparse RAM handler for memory regions
>   virtio-mem: Impelement SparseRAMHandler interface
>   vfio: Implement support for sparse RAM memory regions
>   memory: Extend ram_block_discard_(require|disable) by two discard
>     types
>   virtio-mem: Require only RAM_BLOCK_DISCARD_T_COORDINATED discards
>   vfio: Disable only RAM_BLOCK_DISCARD_T_UNCOORDINATED discards
> 
>  exec.c                         | 109 +++++++++++++++++----
>  hw/vfio/common.c               | 169 ++++++++++++++++++++++++++++++++-
>  hw/virtio/virtio-mem.c         | 164 +++++++++++++++++++++++++++++++-
>  include/exec/memory.h          | 151 ++++++++++++++++++++++++++++-
>  include/hw/vfio/vfio-common.h  |  12 +++
>  include/hw/virtio/virtio-mem.h |   3 +
>  softmmu/memory.c               |   7 ++
>  7 files changed, 583 insertions(+), 32 deletions(-)
> 
> -- 
> 2.26.2
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 0/6] virtio-mem: vfio support
  2020-09-29 17:02 ` Dr. David Alan Gilbert
@ 2020-09-29 17:05   ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2020-09-29 17:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Pankaj Gupta, Michael S. Tsirkin, qemu-devel, Peter Xu,
	Luiz Capitulino, Auger Eric, Alex Williamson, Wei Yang,
	Paolo Bonzini, Igor Mammedov

On 29.09.20 19:02, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> This is a quick and dirty (1.5 days of hacking) prototype to make
>> vfio and virtio-mem play together. The basic idea was the result of Alex
>> brainstorming with me on how to tackle this.
>>
>> A virtio-mem device manages a memory region in guest physical address
>> space, represented as a single (currently large) memory region in QEMU.
>> Before the guest is allowed to use memory blocks, it must coordinate with
>> the hypervisor (plug blocks). After a reboot, all memory is usually
>> unplugged - when the guest comes up, it detects the virtio-mem device and
>> selects memory blocks to plug (based on requests from the hypervisor).
>>
>> Memory hot(un)plug consists of (un)plugging memory blocks via a virtio-mem
>> device (triggered by the guest). When unplugging blocks, we discard the
>> memory. In contrast to memory ballooning, we always know which memory
>> blocks a guest may use - especially during a reboot, after a crash, or
>> after kexec.
>>
>> The issue with vfio is, that it cannot deal with random discards - for this
>> reason, virtio-mem and vfio can currently only run mutually exclusive.
>> Especially, vfio would currently map the whole memory region (with possible
>> only little/no plugged blocks), resulting in all pages getting pinned and
>> therefore resulting in a higher memory consumption than expected (turning
>> virtio-mem basically useless in these environments).
>>
>> To make vfio work nicely with virtio-mem, we have to map only the plugged
>> blocks, and map/unmap properly when plugging/unplugging blocks (including
>> discarding of RAM when unplugging). We achieve that by using a new notifier
>> mechanism that communicates changes.
>>
>> It's important to map memory in the granularity in which we could see
>> unmaps again (-> virtio-mem block size) - so when e.g., plugging
>> consecutive 100 MB with a block size of 2MB, we need 50 mappings. When
>> unmapping, we can use a single vfio_unmap call for the applicable range.
>> We expect that the block size of virtio-mem devices will be fairly large
>> in the future (to not run out of mappings and to improve hot(un)plug
>> performance), configured by the user, when used with vfio (e.g., 128MB,
>> 1G, ...) - Linux guests will still have to be optimized for that.
> 
> This seems pretty painful for those few TB mappings.
> Also the calls seem pretty painful; maybe it'll be possible to have
> calls that are optimised for making multiple consecutive mappings.

Exactly the future I imagine. This patchset is with no kernel interface
additions - once we have an optimized interface that understands
consecutive mappings (and the granularity), we can use that instead. The
prototype already prepared for that by notifying about consecutive ranges.

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 4/6] memory: Extend ram_block_discard_(require|disable) by two discard types
  2020-09-24 16:04 ` [PATCH PROTOTYPE 4/6] memory: Extend ram_block_discard_(require|disable) by two discard types David Hildenbrand
@ 2020-10-20 19:17   ` Peter Xu
  2020-10-20 19:58     ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2020-10-20 19:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Michael S. Tsirkin, qemu-devel, Luiz Capitulino,
	Auger Eric, Alex Williamson, Wei Yang, Igor Mammedov,
	Paolo Bonzini, Dr . David Alan Gilbert

On Thu, Sep 24, 2020 at 06:04:21PM +0200, David Hildenbrand wrote:
> We want to separate the two cases whereby
> - balloning drivers do random discards on random guest memory (e.g.,
>   virtio-balloon) - uncoordinated discards
> - paravirtualized memory devices do discards in well-known granularity,
>   and always know which block is currently accessible or inaccessible by
>   a guest. - coordinated discards
> 
> This will be required to get virtio_mem + vfio running - vfio still
> wants to block random memory ballooning.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Wei Yang <richardw.yang@linux.intel.com>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
> Cc: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  exec.c                | 109 ++++++++++++++++++++++++++++++++++--------
>  include/exec/memory.h |  36 ++++++++++++--
>  2 files changed, 121 insertions(+), 24 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index e34b602bdf..83098e9230 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -4098,52 +4098,121 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root)
>   * If positive, discarding RAM is disabled. If negative, discarding RAM is
>   * required to work and cannot be disabled.
>   */
> -static int ram_block_discard_disabled;
> +static int uncoordinated_discard_disabled;
> +static int coordinated_discard_disabled;

Instead of duplicating the codes, how about start to make it an array?

Btw, iiuc these flags do not need atomic operations at all, because all callers
should be with BQL and called majorly during machine start/reset.  Even not, I
think we can also use a mutex, maybe it could make things simpler.  No strong
opinion, though.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 1/6] memory: Introduce sparse RAM handler for memory regions
  2020-09-24 16:04 ` [PATCH PROTOTYPE 1/6] memory: Introduce sparse RAM handler for memory regions David Hildenbrand
@ 2020-10-20 19:24   ` Peter Xu
  2020-10-20 20:13     ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2020-10-20 19:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Michael S. Tsirkin, qemu-devel, Luiz Capitulino,
	Auger Eric, Alex Williamson, Wei Yang, Igor Mammedov,
	Paolo Bonzini, Dr . David Alan Gilbert

On Thu, Sep 24, 2020 at 06:04:18PM +0200, David Hildenbrand wrote:
> +static inline void memory_region_set_sparse_ram_handler(MemoryRegion *mr,
> +                                                        SparseRAMHandler *srh)
> +{
> +    g_assert(memory_region_is_ram(mr));

Nit: Maybe assert mr->srh==NULL here?  If sparse ram handler is exclusive,
which afaiu, is a yes.

> +    mr->srh = srh;
> +}
> +
> +static inline void memory_region_register_sparse_ram_notifier(MemoryRegion *mr,
> +                                                           SparseRAMNotifier *n)
> +{
> +    SparseRAMHandler *srh = memory_region_get_sparse_ram_handler(mr);
> +    SparseRAMHandlerClass *srhc = SPARSE_RAM_HANDLER_GET_CLASS(srh);
> +
> +    srhc->register_listener(srh, mr, n);

I feel like you need to check srhc!=NULL first or vfio may start crash without
virtio-mem...  Same question to the other ones (at least unregister).

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions
  2020-09-24 16:04 ` [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions David Hildenbrand
@ 2020-10-20 19:44   ` Peter Xu
  2020-10-20 20:01     ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2020-10-20 19:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Michael S. Tsirkin, qemu-devel, Luiz Capitulino,
	Auger Eric, Alex Williamson, Wei Yang, Igor Mammedov,
	Paolo Bonzini, Dr . David Alan Gilbert

On Thu, Sep 24, 2020 at 06:04:20PM +0200, David Hildenbrand wrote:
> Implement support for sparse RAM, to be used by virtio-mem. Handling
> is somewhat-similar to memory_region_is_iommu() handling, which also
> notifies on changes.
> 
> Instead of mapping the whole region, we only map selected pieces (and
> unmap previously selected pieces) when notified by the SparseRAMHandler.

It works with vIOMMU too, right? :) As long as vfio_get_vaddr() works as
expected with virtio-mem plugging new memories, then I think the answer should
be yes.

If it's true, maybe worth mention it somewhere either in the commit message or
in the code comment, because it seems not that obvious.

So if you have plan to do some real IOs in your future tests, may also worth
try with the "-device intel-iommu" and intel_iommu=on in the guest against the
same test.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 4/6] memory: Extend ram_block_discard_(require|disable) by two discard types
  2020-10-20 19:17   ` Peter Xu
@ 2020-10-20 19:58     ` David Hildenbrand
  2020-10-20 20:49       ` Peter Xu
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2020-10-20 19:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Pankaj Gupta, Michael S. Tsirkin, qemu-devel, Luiz Capitulino,
	Auger Eric, Alex Williamson, Wei Yang, Igor Mammedov,
	Paolo Bonzini, Dr . David Alan Gilbert

On 20.10.20 21:17, Peter Xu wrote:
> On Thu, Sep 24, 2020 at 06:04:21PM +0200, David Hildenbrand wrote:
>> We want to separate the two cases whereby
>> - balloning drivers do random discards on random guest memory (e.g.,
>>   virtio-balloon) - uncoordinated discards
>> - paravirtualized memory devices do discards in well-known granularity,
>>   and always know which block is currently accessible or inaccessible by
>>   a guest. - coordinated discards
>>
>> This will be required to get virtio_mem + vfio running - vfio still
>> wants to block random memory ballooning.
>>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Alex Williamson <alex.williamson@redhat.com>
>> Cc: Wei Yang <richardw.yang@linux.intel.com>
>> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> Cc: Igor Mammedov <imammedo@redhat.com>
>> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
>> Cc: Peter Xu <peterx@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  exec.c                | 109 ++++++++++++++++++++++++++++++++++--------
>>  include/exec/memory.h |  36 ++++++++++++--
>>  2 files changed, 121 insertions(+), 24 deletions(-)
>>
>> diff --git a/exec.c b/exec.c
>> index e34b602bdf..83098e9230 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -4098,52 +4098,121 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root)
>>   * If positive, discarding RAM is disabled. If negative, discarding RAM is
>>   * required to work and cannot be disabled.
>>   */
>> -static int ram_block_discard_disabled;
>> +static int uncoordinated_discard_disabled;
>> +static int coordinated_discard_disabled;
> 
> Instead of duplicating the codes, how about start to make it an array?
> 
> Btw, iiuc these flags do not need atomic operations at all, because all callers
> should be with BQL and called majorly during machine start/reset.  Even not, I
> think we can also use a mutex, maybe it could make things simpler.  No strong
> opinion, though.
> 

I remember there were some !BQL users (but I might be confusing it with
postcopy code that once used to inhibit the balloon without BQL). Will
double-check. Simplifying it is certainly a good idea.

(we want to be able to check from virtio-balloon code repeatedly without
taking a mutex over and over again :) )

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions
  2020-10-20 19:44   ` Peter Xu
@ 2020-10-20 20:01     ` David Hildenbrand
  2020-10-20 20:44       ` Peter Xu
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2020-10-20 20:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: Pankaj Gupta, Michael S. Tsirkin, qemu-devel, Luiz Capitulino,
	Auger Eric, Alex Williamson, Wei Yang, Igor Mammedov,
	Paolo Bonzini, Dr . David Alan Gilbert

On 20.10.20 21:44, Peter Xu wrote:
> On Thu, Sep 24, 2020 at 06:04:20PM +0200, David Hildenbrand wrote:
>> Implement support for sparse RAM, to be used by virtio-mem. Handling
>> is somewhat-similar to memory_region_is_iommu() handling, which also
>> notifies on changes.
>>
>> Instead of mapping the whole region, we only map selected pieces (and
>> unmap previously selected pieces) when notified by the SparseRAMHandler.
> 
> It works with vIOMMU too, right? :) As long as vfio_get_vaddr() works as
> expected with virtio-mem plugging new memories, then I think the answer should
> be yes.
> 

I haven't tried, but I guess as it's simply mapped into
&address_space_memory, it should work just fine.

> If it's true, maybe worth mention it somewhere either in the commit message or
> in the code comment, because it seems not that obvious.

I will test and add, thanks for the hint!

> 
> So if you have plan to do some real IOs in your future tests, may also worth
> try with the "-device intel-iommu" and intel_iommu=on in the guest against the
> same test.

Thanks ... but I have an AMD system. Will try to find out how to get
that running with AMD :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 1/6] memory: Introduce sparse RAM handler for memory regions
  2020-10-20 19:24   ` Peter Xu
@ 2020-10-20 20:13     ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2020-10-20 20:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: Pankaj Gupta, Michael S. Tsirkin, qemu-devel, Luiz Capitulino,
	Auger Eric, Alex Williamson, Wei Yang, Igor Mammedov,
	Paolo Bonzini, Dr . David Alan Gilbert

On 20.10.20 21:24, Peter Xu wrote:
> On Thu, Sep 24, 2020 at 06:04:18PM +0200, David Hildenbrand wrote:
>> +static inline void memory_region_set_sparse_ram_handler(MemoryRegion *mr,
>> +                                                        SparseRAMHandler *srh)
>> +{
>> +    g_assert(memory_region_is_ram(mr));
> 
> Nit: Maybe assert mr->srh==NULL here?  If sparse ram handler is exclusive,
> which afaiu, is a yes.

Indeed. The owner of the memory region.

> 
>> +    mr->srh = srh;
>> +}
>> +
>> +static inline void memory_region_register_sparse_ram_notifier(MemoryRegion *mr,
>> +                                                           SparseRAMNotifier *n)
>> +{
>> +    SparseRAMHandler *srh = memory_region_get_sparse_ram_handler(mr);
>> +    SparseRAMHandlerClass *srhc = SPARSE_RAM_HANDLER_GET_CLASS(srh);
>> +
>> +    srhc->register_listener(srh, mr, n);
> 
> I feel like you need to check srhc!=NULL first or vfio may start crash without
> virtio-mem...  Same question to the other ones (at least unregister).

I think nobody should be calling this function unless
memory_region_is_sparse_ram() returns true.

I'm still considering moving it down to RAMBlock level. Feels more
natural, and would allow other RAMBlock users figure out what's going
on. If only my list of TODO items wouldn't be that long ... :)

Thanks Peter!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions
  2020-10-20 20:01     ` David Hildenbrand
@ 2020-10-20 20:44       ` Peter Xu
  2020-11-12 10:11         ` David Hildenbrand
  2020-11-18 13:04         ` David Hildenbrand
  0 siblings, 2 replies; 27+ messages in thread
From: Peter Xu @ 2020-10-20 20:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Michael S. Tsirkin, qemu-devel, Luiz Capitulino,
	Auger Eric, Alex Williamson, Wei Yang, Igor Mammedov,
	Paolo Bonzini, Dr . David Alan Gilbert

On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote:
> Thanks ... but I have an AMD system. Will try to find out how to get
> that running with AMD :)

May still start with trying intel-iommu first. :) I think it should work for
amd hosts too.

Just another FYI - Wei is working on amd-iommu for vfio [1], but it's still
during review.

[1] https://lore.kernel.org/qemu-devel/20201002145907.1294353-1-wei.huang2@amd.com/

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 4/6] memory: Extend ram_block_discard_(require|disable) by two discard types
  2020-10-20 19:58     ` David Hildenbrand
@ 2020-10-20 20:49       ` Peter Xu
  2020-10-20 21:30         ` Peter Xu
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2020-10-20 20:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Michael S. Tsirkin, qemu-devel, Luiz Capitulino,
	Auger Eric, Alex Williamson, Wei Yang, Igor Mammedov,
	Paolo Bonzini, Dr . David Alan Gilbert

On Tue, Oct 20, 2020 at 09:58:34PM +0200, David Hildenbrand wrote:
> I remember there were some !BQL users (but I might be confusing it with
> postcopy code that once used to inhibit the balloon without BQL). Will
> double-check. Simplifying it is certainly a good idea.
> 
> (we want to be able to check from virtio-balloon code repeatedly without
> taking a mutex over and over again :) )

Right.  Again I've no strong opinion so feel free to keep the codes as wish.
However if we'd go the other way (either BQL or another mutex) IMHO we can
simply read that flag directly without taking mutex.  IMHO here the mutex can
be used to protect write concurrency and should be enough. Considering that
this flag should rarely change and it should never flip (e.g., positive would
never go negative, and vise versa), then READ_ONCE() whould be safe enough on
read side (e.g., balloon).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 4/6] memory: Extend ram_block_discard_(require|disable) by two discard types
  2020-10-20 20:49       ` Peter Xu
@ 2020-10-20 21:30         ` Peter Xu
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Xu @ 2020-10-20 21:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Michael S. Tsirkin, qemu-devel, Luiz Capitulino,
	Auger Eric, Alex Williamson, Wei Yang, Igor Mammedov,
	Paolo Bonzini, Dr . David Alan Gilbert

On Tue, Oct 20, 2020 at 04:49:29PM -0400, Peter Xu wrote:
> On Tue, Oct 20, 2020 at 09:58:34PM +0200, David Hildenbrand wrote:
> > I remember there were some !BQL users (but I might be confusing it with
> > postcopy code that once used to inhibit the balloon without BQL). Will
> > double-check. Simplifying it is certainly a good idea.
> > 
> > (we want to be able to check from virtio-balloon code repeatedly without
> > taking a mutex over and over again :) )
> 
> Right.  Again I've no strong opinion so feel free to keep the codes as wish.
> However if we'd go the other way (either BQL or another mutex) IMHO we can
> simply read that flag directly without taking mutex.  IMHO here the mutex can
> be used to protect write concurrency and should be enough. Considering that
> this flag should rarely change and it should never flip (e.g., positive would
> never go negative, and vise versa), then READ_ONCE() whould be safe enough on
> read side (e.g., balloon).

Btw, what we've discussed is all about serialzation of the flag.  I'm also
thinking about whether we can make the flag clearer on what it means.  Frankly
speaking on the first glance it confused me quite a bit..

IMHO what we may want is not some complicated counter on "disablement", but
some simple counters on "someone that provided the cap to discard pages", and
"someone that won't work if we _might_ discard pages".  I'm thinking what if we
split the "disable" counter into two:

  - ram_discard_providers ("Who allows ram discard"): balloon, virtio-mem

  - ram_discard_opposers ("Who forbids ram discard"): vfio, rdma, ...

The major benefit is, the counters should always be >=0, as what a usual
counter would do.  Each device/caller should only/majorly increase the counter.
Also we don't need the cmpxchg() loops too since the check is easier - just
read the other side of the counters to know whether we should fail now.

So after this patch to introduce "coordinated discards", the counters can also
grows into four (we can still define some arrays):

        |---------------------------+------------+-----------|
        | counters                  | provider   | opposer   |
        |---------------------------+------------+-----------|
        | RAM_DISCARD_COORDINATED   | virtio-mem | rdma, ... |
        | RAM_DISCARD_UNCOORDINATED | balloon    | vfio      |
        |---------------------------+------------+-----------|

Some examples: for vfio, it only needs to make sure no UNCOORDINATE providers.
For rdma, it needs to make sure no COORDINATE/UNCOORDINATE providers.  The
check helper could simply be done using a similar ANY bitmask as introduced in
the current patch.

Not sure whether this may help.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions
  2020-10-20 20:44       ` Peter Xu
@ 2020-11-12 10:11         ` David Hildenbrand
  2020-11-18 13:04         ` David Hildenbrand
  1 sibling, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2020-11-12 10:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Pankaj Gupta, Michael S. Tsirkin, qemu-devel, Luiz Capitulino,
	Auger Eric, Alex Williamson, Wei Yang, Igor Mammedov,
	Paolo Bonzini, Dr . David Alan Gilbert

On 20.10.20 22:44, Peter Xu wrote:
> On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote:
>> Thanks ... but I have an AMD system. Will try to find out how to get
>> that running with AMD :)

I just did some more testing with the oldish GPU I have for that
purpose. Seems to work, at least I get video output that keeps
on working - did not try advanced things yet.

I use
-device vfio-pci,host=05:00.0,x-vga=on
-device vfio-pci,host=05:00.1

when adding "-device intel-iommu", I got

"qemu-system-x86_64: -device vfio-pci,host=05:00.1: vfio 0000:05:00.1: group 23 used in multiple address spaces"

... so I poked around the internet a bit and got it running with

-device intel-iommu,caching-mode=on \
-device pcie-pci-bridge,addr=1e.0,id=pci.1 \
-device vfio-pci,host=05:00.0,xvga=on,bus=pci.1,addr=1.0,multifunction=on \
-device vfio-pci,host=05:00.1,bus=pci.1,addr=1.1 \

Things still seem to be working, so I assume it works
(I guess ?!).

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions
  2020-10-20 20:44       ` Peter Xu
  2020-11-12 10:11         ` David Hildenbrand
@ 2020-11-18 13:04         ` David Hildenbrand
  2020-11-18 15:23           ` Peter Xu
  1 sibling, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2020-11-18 13:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: Le Tan, Pankaj Gupta, Michael S. Tsirkin, wei.huang2, qemu-devel,
	Luiz Capitulino, Auger Eric, Alex Williamson, Wei Yang,
	Igor Mammedov, Paolo Bonzini, Dr . David Alan Gilbert

On 20.10.20 22:44, Peter Xu wrote:
> On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote:
>> Thanks ... but I have an AMD system. Will try to find out how to get
>> that running with AMD :)
> 
> May still start with trying intel-iommu first. :) I think it should work for
> amd hosts too.
> 
> Just another FYI - Wei is working on amd-iommu for vfio [1], but it's still
> during review.
> 
> [1] https://lore.kernel.org/qemu-devel/20201002145907.1294353-1-wei.huang2@amd.com/
> 

I'm trying to get an iommu setup running (without virtio-mem!),
but it's a big mess.

Essential parts of my QEMU cmdline are:

sudo build/qemu-system-x86_64 \
    -accel kvm,kernel-irqchip=split \
    ...
     device pcie-pci-bridge,addr=1e.0,id=pci.1 \
    -device vfio-pci,host=0c:00.0,x-vga=on,bus=pci.1,addr=1.0,multifunction=on \
    -device vfio-pci,host=0c:00.1,bus=pci.1,addr=1.1 \
    -device intel-iommu,caching-mode=on,intremap=on \

I am running upstream QEMU + Linux -next kernel inside the
guest on an AMD Ryzen 9 3900X 12-Core Processor.
I am using SeaBios.

I tried faking an Intel CPU without luck.
("-cpu Skylake-Client,kvm=off,vendor=GenuineIntel")

As soon as I enable "intel_iommu=on" in my guest kernel, graphics
stop working (random mess on graphics output) and I get
  vfio-pci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0023 address=0xff924000 flags=0x0000]
in the hypervisor, along with other nice messages.

I can spot no vfio DMA mappings coming from an iommu, just as if the
guest wouldn't even try to setup the iommu.

I tried with
1. AMD Radeon RX Vega 56
2. Nvidia GT220
resulting in similar issues.

I also tried with "-device amd-iommu" with other issues
(guest won't even boot up). Are my graphics card missing some support or
is there a fundamental flaw in my setup?

Any clues appreciated.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions
  2020-11-18 13:04         ` David Hildenbrand
@ 2020-11-18 15:23           ` Peter Xu
  2020-11-18 16:14             ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2020-11-18 15:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Le Tan, Pankaj Gupta, Michael S. Tsirkin, wei.huang2, qemu-devel,
	Luiz Capitulino, Auger Eric, Alex Williamson, Wei Yang,
	Igor Mammedov, Paolo Bonzini, Dr . David Alan Gilbert

David,

On Wed, Nov 18, 2020 at 02:04:00PM +0100, David Hildenbrand wrote:
> On 20.10.20 22:44, Peter Xu wrote:
> > On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote:
> >> Thanks ... but I have an AMD system. Will try to find out how to get
> >> that running with AMD :)
> > 
> > May still start with trying intel-iommu first. :) I think it should work for
> > amd hosts too.
> > 
> > Just another FYI - Wei is working on amd-iommu for vfio [1], but it's still
> > during review.
> > 
> > [1] https://lore.kernel.org/qemu-devel/20201002145907.1294353-1-wei.huang2@amd.com/
> > 
> 
> I'm trying to get an iommu setup running (without virtio-mem!),
> but it's a big mess.
> 
> Essential parts of my QEMU cmdline are:
> 
> sudo build/qemu-system-x86_64 \
>     -accel kvm,kernel-irqchip=split \
>     ...
>      device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>     -device vfio-pci,host=0c:00.0,x-vga=on,bus=pci.1,addr=1.0,multifunction=on \
>     -device vfio-pci,host=0c:00.1,bus=pci.1,addr=1.1 \
>     -device intel-iommu,caching-mode=on,intremap=on \

The intel-iommu device needs to be created before the rest of devices.  I
forgot the reason behind, should be related to how the device address spaces
are created.  This rule should apply to all the rest of vIOMMUs, afaiu.

Libvirt guarantees that ordering when VT-d enabled, though when using qemu
cmdline indeed that's hard to identify from the first glance... iirc we tried
to fix this, but I forgot the details, it's just not trivial.

I noticed that this ordering constraint is also missing in the qemu wiki page
of vt-d, so I updated there too, hopefully..

https://wiki.qemu.org/Features/VT-d#Command_Line_Example

> 
> I am running upstream QEMU + Linux -next kernel inside the
> guest on an AMD Ryzen 9 3900X 12-Core Processor.
> I am using SeaBios.
> 
> I tried faking an Intel CPU without luck.
> ("-cpu Skylake-Client,kvm=off,vendor=GenuineIntel")
> 
> As soon as I enable "intel_iommu=on" in my guest kernel, graphics
> stop working (random mess on graphics output) and I get
>   vfio-pci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0023 address=0xff924000 flags=0x0000]
> in the hypervisor, along with other nice messages.
> 
> I can spot no vfio DMA mappings coming from an iommu, just as if the
> guest wouldn't even try to setup the iommu.
> 
> I tried with
> 1. AMD Radeon RX Vega 56
> 2. Nvidia GT220
> resulting in similar issues.
> 
> I also tried with "-device amd-iommu" with other issues
> (guest won't even boot up). Are my graphics card missing some support or
> is there a fundamental flaw in my setup?

I guess amd-iommu won't work if without Wei Huang's series applied.

> 
> Any clues appreciated.

Please try with above and see whether it works.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions
  2020-11-18 15:23           ` Peter Xu
@ 2020-11-18 16:14             ` David Hildenbrand
  2020-11-18 17:01               ` Peter Xu
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2020-11-18 16:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: Le Tan, Pankaj Gupta, Michael S. Tsirkin, wei.huang2, qemu-devel,
	Luiz Capitulino, Auger Eric, Alex Williamson, Wei Yang,
	Igor Mammedov, Paolo Bonzini, Dr . David Alan Gilbert

On 18.11.20 16:23, Peter Xu wrote:
> David,
> 
> On Wed, Nov 18, 2020 at 02:04:00PM +0100, David Hildenbrand wrote:
>> On 20.10.20 22:44, Peter Xu wrote:
>>> On Tue, Oct 20, 2020 at 10:01:12PM +0200, David Hildenbrand wrote:
>>>> Thanks ... but I have an AMD system. Will try to find out how to get
>>>> that running with AMD :)
>>>
>>> May still start with trying intel-iommu first. :) I think it should work for
>>> amd hosts too.
>>>
>>> Just another FYI - Wei is working on amd-iommu for vfio [1], but it's still
>>> during review.
>>>
>>> [1] https://lore.kernel.org/qemu-devel/20201002145907.1294353-1-wei.huang2@amd.com/
>>>
>>
>> I'm trying to get an iommu setup running (without virtio-mem!),
>> but it's a big mess.
>>
>> Essential parts of my QEMU cmdline are:
>>
>> sudo build/qemu-system-x86_64 \
>>      -accel kvm,kernel-irqchip=split \
>>      ...
>>       device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>>      -device vfio-pci,host=0c:00.0,x-vga=on,bus=pci.1,addr=1.0,multifunction=on \
>>      -device vfio-pci,host=0c:00.1,bus=pci.1,addr=1.1 \
>>      -device intel-iommu,caching-mode=on,intremap=on \
> 
> The intel-iommu device needs to be created before the rest of devices.  I
> forgot the reason behind, should be related to how the device address spaces
> are created.  This rule should apply to all the rest of vIOMMUs, afaiu.
> 
> Libvirt guarantees that ordering when VT-d enabled, though when using qemu
> cmdline indeed that's hard to identify from the first glance... iirc we tried
> to fix this, but I forgot the details, it's just not trivial.
> 
> I noticed that this ordering constraint is also missing in the qemu wiki page
> of vt-d, so I updated there too, hopefully..
> 
> https://wiki.qemu.org/Features/VT-d#Command_Line_Example
> 

That did the trick! Thanks!!!

virtio-mem + vfio + iommu seems to work. More testing to be done.

However, malicious guests can play nasty tricks like

a) Unplugging plugged virtio-mem blocks while they are mapped via an
    IOMMU

1. Guest: map memory location X located on a virtio-mem device inside a
    plugged block into the IOMMU
    -> QEMU IOMMU notifier: create vfio DMA mapping
    -> VFIO pins memory of unplugged blocks (populating memory)
2. Guest: Request to unplug memory location X via virtio-mem device
    -> QEMU virtio-mem: discards the memory.
    -> VFIO still has the memory pinned

We consume more memory than intended. In case virtio-memory would get 
replugged and used, we would have an inconsistency. IOMMU device resets/ 
fix it (whereby all VFIO mappings are removed via the IOMMU notifier).


b) Mapping unplugged virtio-mem blocks via an IOMMU

1. Guest: map memory location X located on a virtio-mem device inside an
    unplugged block
    -> QEMU IOMMU notifier: create vfio DMA mapping
    -> VFIO pins memory of unplugged blocks (populating memory)

Memory that's supposed to be discarded now consumes memory. This is 
similar to a malicious guest simply writing to unplugged memory blocks 
(to be tackled with "protection of unplugged memory" in the future) - 
however memory will also get pinned.


To prohibit b) from happening, we would have to disallow creating the 
VFIO mapping (fairly easy).

To prohibit a), there would have to be some notification to IOMMU 
implementations to unmap/refresh whenever an IOMMU entry still points at 
memory that is getting discarded (and the VM is doing something it's not 
supposed to do).


>> As soon as I enable "intel_iommu=on" in my guest kernel, graphics
>> stop working (random mess on graphics output) and I get
>>    vfio-pci 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0023 address=0xff924000 flags=0x0000]
>> in the hypervisor, along with other nice messages.
>>
>> I can spot no vfio DMA mappings coming from an iommu, just as if the
>> guest wouldn't even try to setup the iommu.
>>
>> I tried with
>> 1. AMD Radeon RX Vega 56
>> 2. Nvidia GT220
>> resulting in similar issues.
>>
>> I also tried with "-device amd-iommu" with other issues
>> (guest won't even boot up). Are my graphics card missing some support or
>> is there a fundamental flaw in my setup?
> 
> I guess amd-iommu won't work if without Wei Huang's series applied.

Oh, okay - I spotted it in QEMU and thought this was already working :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions
  2020-11-18 16:14             ` David Hildenbrand
@ 2020-11-18 17:01               ` Peter Xu
  2020-11-18 17:37                 ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2020-11-18 17:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Le Tan, Pankaj Gupta, Michael S. Tsirkin, wei.huang2, qemu-devel,
	Luiz Capitulino, Auger Eric, Alex Williamson, Wei Yang,
	Igor Mammedov, Paolo Bonzini, Dr . David Alan Gilbert

On Wed, Nov 18, 2020 at 05:14:22PM +0100, David Hildenbrand wrote:
> That did the trick! Thanks!!!

Great!  At the meantime, I've a few questions majorly about memory unplugging
below, which could be naive - I know little on that, please bare with me.. :)

> 
> virtio-mem + vfio + iommu seems to work. More testing to be done.
> 
> However, malicious guests can play nasty tricks like
> 
> a) Unplugging plugged virtio-mem blocks while they are mapped via an
>    IOMMU
> 
> 1. Guest: map memory location X located on a virtio-mem device inside a
>    plugged block into the IOMMU
>    -> QEMU IOMMU notifier: create vfio DMA mapping
>    -> VFIO pins memory of unplugged blocks (populating memory)
> 2. Guest: Request to unplug memory location X via virtio-mem device
>    -> QEMU virtio-mem: discards the memory.
>    -> VFIO still has the memory pinned

When unplug some memory, does the user need to first do something to notify the
guest kernel that "this memory is going to be unplugged soon" (assuming echo
"offline" to some dev file)?  Then the kernel should be responsible to prepare
for that before it really happens, e.g., migrate anonymous pages out from this
memory block.  I don't know what would happen if some pages on the memblock
were used for DMA like this and we want to unplug it.  Ideally I thought it
should fail the "echo offline" operation with something like EBUSY if it can't
notify the device driver about this, or it's hard to.

IMHO this question not really related to vIOMMU, but a general question for
unplugging. Say, what would happen if we unplug some memory with DMA buffers
without vIOMMU at all?  The buffer will be invalid right after unplugging, so
the guest kernel should either fail the operation trying to unplug, or at least
tell the device drivers about this somehow?

> 
> We consume more memory than intended. In case virtio-memory would get
> replugged and used, we would have an inconsistency. IOMMU device resets/ fix
> it (whereby all VFIO mappings are removed via the IOMMU notifier).
> 
> 
> b) Mapping unplugged virtio-mem blocks via an IOMMU
> 
> 1. Guest: map memory location X located on a virtio-mem device inside an
>    unplugged block
>    -> QEMU IOMMU notifier: create vfio DMA mapping
>    -> VFIO pins memory of unplugged blocks (populating memory)

For this case, I would expect vfio_get_xlat_addr() to fail directly if the
guest driver force to map some IOVA onto an invalid range of the virtio-mem
device.  Even before that, since the guest should know that this region of
virtio-mem is not valid since unplugged, so shouldn't the guest kernel directly
fail the dma_map() upon such a region even before the mapping message reaching
QEMU?

Thanks,

> 
> Memory that's supposed to be discarded now consumes memory. This is similar
> to a malicious guest simply writing to unplugged memory blocks (to be
> tackled with "protection of unplugged memory" in the future) - however
> memory will also get pinned.
> 
> 
> To prohibit b) from happening, we would have to disallow creating the VFIO
> mapping (fairly easy).
> 
> To prohibit a), there would have to be some notification to IOMMU
> implementations to unmap/refresh whenever an IOMMU entry still points at
> memory that is getting discarded (and the VM is doing something it's not
> supposed to do).

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions
  2020-11-18 17:01               ` Peter Xu
@ 2020-11-18 17:37                 ` David Hildenbrand
  2020-11-18 19:05                   ` Peter Xu
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2020-11-18 17:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: Le Tan, Pankaj Gupta, Michael S. Tsirkin, wei.huang2, qemu-devel,
	Luiz Capitulino, Auger Eric, Alex Williamson, Wei Yang,
	Igor Mammedov, Paolo Bonzini, Dr . David Alan Gilbert

>> virtio-mem + vfio + iommu seems to work. More testing to be done.
>>
>> However, malicious guests can play nasty tricks like
>>
>> a) Unplugging plugged virtio-mem blocks while they are mapped via an
>>     IOMMU
>>
>> 1. Guest: map memory location X located on a virtio-mem device inside a
>>     plugged block into the IOMMU
>>     -> QEMU IOMMU notifier: create vfio DMA mapping
>>     -> VFIO pins memory of unplugged blocks (populating memory)
>> 2. Guest: Request to unplug memory location X via virtio-mem device
>>     -> QEMU virtio-mem: discards the memory.
>>     -> VFIO still has the memory pinned
> 
> When unplug some memory, does the user need to first do something to notify the
> guest kernel that "this memory is going to be unplugged soon" (assuming echo
> "offline" to some dev file)?  Then the kernel should be responsible to prepare
> for that before it really happens, e.g., migrate anonymous pages out from this
> memory block.  I don't know what would happen if some pages on the memblock
> were used for DMA like this and we want to unplug it.  Ideally I thought it
> should fail the "echo offline" operation with something like EBUSY if it can't
> notify the device driver about this, or it's hard to.

In the very simple case (without resizable RAMBlocks/allocations.memory 
regions) as implemented right now, a virtio-mem device really just 
consists of a static RAM memory region that's mapped into guest physical 
address space. The size of that region corresponds to the "maximum" size 
a virtio-mem device can provide.

How much memory the VM should consume via such a device is expressed via 
a "requested size". So for the hypervisor requests the VM to consume 
less/more memory, it adjusts the "requested size".

It is up to the device driver in the guest to plug/unplug memory blocks 
(e.g., 4 MiB granularity), in order to reach the requested size. The 
device driver selects memory blocks within the device-assigned memory 
region and requests the hypervisor to (un)plug them - think of it as 
something similar (but different) to memory ballooning.

When requested to unplug memory by the hypervisor, the device driver in 
Linux will try to find memory blocks (e.g., 4 MiB) within the 
device-assigned memory region it can free up. This involves migrating 
pages away etc. Once that succeeded - nobody in the guest is using that 
memory anymore; the guest requests the hypervisor to unplug that block, 
resulting in QEMU discarding the memory. The guest agreed to not touch 
that memory anymore before officially requesting to "plug" it via the 
virtio-mem device.

There is no further action inside the guest required. A sane guest will 
never request to unplug memory that is still in use (similar to memory 
ballooning, where we don't inflate memory that is still in use).

But of course, a malicious guest could try doing that just to cause 
trouble.

> 
> IMHO this question not really related to vIOMMU, but a general question for
> unplugging. Say, what would happen if we unplug some memory with DMA buffers
> without vIOMMU at all?  The buffer will be invalid right after unplugging, so
> the guest kernel should either fail the operation trying to unplug, or at least
> tell the device drivers about this somehow?

A sane guest will never do that. The way memory is removed from Linux 
makes sure that there are no remaining users, otherwise it would be a BUG.

> 
>>
>> We consume more memory than intended. In case virtio-memory would get
>> replugged and used, we would have an inconsistency. IOMMU device resets/ fix
>> it (whereby all VFIO mappings are removed via the IOMMU notifier).
>>
>>
>> b) Mapping unplugged virtio-mem blocks via an IOMMU
>>
>> 1. Guest: map memory location X located on a virtio-mem device inside an
>>     unplugged block
>>     -> QEMU IOMMU notifier: create vfio DMA mapping
>>     -> VFIO pins memory of unplugged blocks (populating memory)
> 
> For this case, I would expect vfio_get_xlat_addr() to fail directly if the
> guest driver force to map some IOVA onto an invalid range of the virtio-mem
> device.  Even before that, since the guest should know that this region of
> virtio-mem is not valid since unplugged, so shouldn't the guest kernel directly
> fail the dma_map() upon such a region even before the mapping message reaching
> QEMU?

Again, sane guests will never do that, for the very reason you mentioned 
"the guest should know that this region of virtio-mem is not valid since 
unplugged,". But a malicious guest could try doing that to cause trouble :)

The memory region managed by a virtio-mem device is always fully mapped 
into the system address space: one reason being, that fragmenting it in 
2 MiB granularity or similar would not be feasible (e.g., KVM memory 
slots limit, migration, ...), but there are other reasons. (Again, 
similar to how memory ballooning works).

vfio_get_xlat_addr() only checks if that mapping exists. It would be 
easy to ask the virtio-mem device (similar as done in this prototype) if 
that part of the identified memory region may be mapped by VFIO right now.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions
  2020-11-18 17:37                 ` David Hildenbrand
@ 2020-11-18 19:05                   ` Peter Xu
  2020-11-18 19:20                     ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Xu @ 2020-11-18 19:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Le Tan, Pankaj Gupta, Michael S. Tsirkin, wei.huang2, qemu-devel,
	Luiz Capitulino, Auger Eric, Alex Williamson, Wei Yang,
	Igor Mammedov, Paolo Bonzini, Dr . David Alan Gilbert

On Wed, Nov 18, 2020 at 06:37:42PM +0100, David Hildenbrand wrote:
> > > a) Unplugging plugged virtio-mem blocks while they are mapped via an
> > >     IOMMU
> > > 
> > > 1. Guest: map memory location X located on a virtio-mem device inside a
> > >     plugged block into the IOMMU
> > >     -> QEMU IOMMU notifier: create vfio DMA mapping
> > >     -> VFIO pins memory of unplugged blocks (populating memory)
> > > 2. Guest: Request to unplug memory location X via virtio-mem device
> > >     -> QEMU virtio-mem: discards the memory.
> > >     -> VFIO still has the memory pinned

[...]

> > > b) Mapping unplugged virtio-mem blocks via an IOMMU
> > > 
> > > 1. Guest: map memory location X located on a virtio-mem device inside an
> > >     unplugged block
> > >     -> QEMU IOMMU notifier: create vfio DMA mapping
> > >     -> VFIO pins memory of unplugged blocks (populating memory)

[...]

> Again, sane guests will never do that, for the very reason you mentioned
> "the guest should know that this region of virtio-mem is not valid since
> unplugged,". But a malicious guest could try doing that to cause trouble :)

Oh I think I see your point now. :) And thanks for the write-up about how
virtio-mem works.

So it's about the malicious guests.

I agree with you that we can try to limit above from happening, e.g. by
teaching vfio_get_xlat_addr() to fail when it's going to map some unplugged
range of virtio-mem device.

Or... imho, we may not even need to worry too much on those misuses of
virtio-mem? As long as the issue is self-contained within the buggy VM/process.
E.g., the worst case of such a malicious guest is to fiddle around the maximum
allowed memory size granted to the virtio-mem device to either have pages
incorrectly pinned, or some strange IOVA mapped to unplugged pages within that
range.  As long as it won't affect other VMs and the host, and qemu won't crash
with that, then it seems ok.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions
  2020-11-18 19:05                   ` Peter Xu
@ 2020-11-18 19:20                     ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2020-11-18 19:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: Le Tan, Pankaj Gupta, Michael S. Tsirkin, wei.huang2, qemu-devel,
	Luiz Capitulino, Auger Eric, Alex Williamson, Wei Yang,
	Igor Mammedov, Paolo Bonzini, Dr . David Alan Gilbert

> So it's about the malicious guests.
> 
> I agree with you that we can try to limit above from happening, e.g. by
> teaching vfio_get_xlat_addr() to fail when it's going to map some unplugged
> range of virtio-mem device.

Exactly.

> 
> Or... imho, we may not even need to worry too much on those misuses of
> virtio-mem? As long as the issue is self-contained within the buggy VM/process.
> E.g., the worst case of such a malicious guest is to fiddle around the maximum
> allowed memory size granted to the virtio-mem device to either have pages
> incorrectly pinned, or some strange IOVA mapped to unplugged pages within that
> range.  As long as it won't affect other VMs and the host, and qemu won't crash
> with that, then it seems ok.

Indeed, I have the same thoughts.

The only "ugly" thing is that our VM might not only consume more memory 
than intended, but also pin that memory (unmovable, unswappable ...). So 
I'm thinking about at least doing a warn_report_once() until we have 
some approach to handle that - for environments that might care about this.

Thanks for looking into this!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2020-11-18 19:22 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-24 16:04 [PATCH PROTOTYPE 0/6] virtio-mem: vfio support David Hildenbrand
2020-09-24 16:04 ` [PATCH PROTOTYPE 1/6] memory: Introduce sparse RAM handler for memory regions David Hildenbrand
2020-10-20 19:24   ` Peter Xu
2020-10-20 20:13     ` David Hildenbrand
2020-09-24 16:04 ` [PATCH PROTOTYPE 2/6] virtio-mem: Impelement SparseRAMHandler interface David Hildenbrand
2020-09-24 16:04 ` [PATCH PROTOTYPE 3/6] vfio: Implement support for sparse RAM memory regions David Hildenbrand
2020-10-20 19:44   ` Peter Xu
2020-10-20 20:01     ` David Hildenbrand
2020-10-20 20:44       ` Peter Xu
2020-11-12 10:11         ` David Hildenbrand
2020-11-18 13:04         ` David Hildenbrand
2020-11-18 15:23           ` Peter Xu
2020-11-18 16:14             ` David Hildenbrand
2020-11-18 17:01               ` Peter Xu
2020-11-18 17:37                 ` David Hildenbrand
2020-11-18 19:05                   ` Peter Xu
2020-11-18 19:20                     ` David Hildenbrand
2020-09-24 16:04 ` [PATCH PROTOTYPE 4/6] memory: Extend ram_block_discard_(require|disable) by two discard types David Hildenbrand
2020-10-20 19:17   ` Peter Xu
2020-10-20 19:58     ` David Hildenbrand
2020-10-20 20:49       ` Peter Xu
2020-10-20 21:30         ` Peter Xu
2020-09-24 16:04 ` [PATCH PROTOTYPE 5/6] virtio-mem: Require only RAM_BLOCK_DISCARD_T_COORDINATED discards David Hildenbrand
2020-09-24 16:04 ` [PATCH PROTOTYPE 6/6] vfio: Disable only RAM_BLOCK_DISCARD_T_UNCOORDINATED discards David Hildenbrand
2020-09-24 19:30 ` [PATCH PROTOTYPE 0/6] virtio-mem: vfio support no-reply
2020-09-29 17:02 ` Dr. David Alan Gilbert
2020-09-29 17:05   ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.