[RFC PATCH 0/2] accel/kvm: extend kvm memory listener to support

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] accel/kvm: extend kvm memory listener to support
@ 2022-08-16 10:12 Emanuele Giuseppe Esposito
  2022-08-16 10:12 ` [RFC PATCH 1/2] softmmu/memory: add missing begin/commit callback calls Emanuele Giuseppe Esposito
  2022-08-16 10:12 ` [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase Emanuele Giuseppe Esposito
  0 siblings, 2 replies; 23+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-08-16 10:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Michael S. Tsirkin, Cornelia Huck, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm, Emanuele Giuseppe Esposito

The aim of this serie is to prepare kvm memory listener to support atomic
memslots update. In order to do that, QEMU should take care of sending all
memslot updates in a single ioctl, so that they can all be processed
atomically.

In order to do that, implement kml->begin() and kml->commit() callbacks, and
change the logic by replacing every ioctl invocation in ->region_* and ->log_*
so that the struct kvm_userspace_memory_region are queued in a linked list that
is then traversed and processed in ->commit.

Patch 1 ensures that ->region_* and ->log_* are always wrapped by ->begin and
->commit.

Emanuele Giuseppe Esposito (2):
  softmmu/memory: add missing begin/commit callback calls
  kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase

 accel/kvm/kvm-all.c       | 99 ++++++++++++++++++++++++++++-----------
 include/sysemu/kvm_int.h  |  6 +++
 linux-headers/linux/kvm.h |  9 ++++
 softmmu/memory.c          |  2 +
 4 files changed, 89 insertions(+), 27 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 1/2] softmmu/memory: add missing begin/commit callback calls
  2022-08-16 10:12 [RFC PATCH 0/2] accel/kvm: extend kvm memory listener to support Emanuele Giuseppe Esposito
@ 2022-08-16 10:12 ` Emanuele Giuseppe Esposito
  2022-08-18 19:34   ` Peter Xu
  2022-08-16 10:12 ` [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase Emanuele Giuseppe Esposito
  1 sibling, 1 reply; 23+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-08-16 10:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Michael S. Tsirkin, Cornelia Huck, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm, Emanuele Giuseppe Esposito

kvm listeners now need ->commit callback in order to actually send
the ioctl to the hypervisor. Therefore, add missing callers around
address_space_set_flatview(), which in turn calls
address_space_update_topology_pass() which calls ->region_* and
->log_* callbacks.

Using MEMORY_LISTENER_CALL_GLOBAL is a little bit an overkill,
but it is harmless, considering that other listeners that are not
invoked in address_space_update_topology_pass() won't do anything,
since they won't have anything to commit.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 softmmu/memory.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/softmmu/memory.c b/softmmu/memory.c
index 7ba2048836..1afd3f9703 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1076,7 +1076,9 @@ static void address_space_update_topology(AddressSpace *as)
     if (!g_hash_table_lookup(flat_views, physmr)) {
         generate_memory_topology(physmr);
     }
+    MEMORY_LISTENER_CALL_GLOBAL(begin, Forward);
     address_space_set_flatview(as);
+    MEMORY_LISTENER_CALL_GLOBAL(commit, Forward);
 }
 
 void memory_region_transaction_begin(void)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-16 10:12 [RFC PATCH 0/2] accel/kvm: extend kvm memory listener to support Emanuele Giuseppe Esposito
  2022-08-16 10:12 ` [RFC PATCH 1/2] softmmu/memory: add missing begin/commit callback calls Emanuele Giuseppe Esposito
@ 2022-08-16 10:12 ` Emanuele Giuseppe Esposito
  2022-08-18 20:04   ` Peter Xu
                     ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-08-16 10:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Michael S. Tsirkin, Cornelia Huck, Peter Xu,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm, Emanuele Giuseppe Esposito

Instead of sending a single ioctl every time ->region_* or ->log_*
callbacks are called, "queue" all memory regions in a list that will
be emptied only when committing.

This allow the KVM kernel API to be extended and support multiple
memslots updates in a single call.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 accel/kvm/kvm-all.c       | 99 ++++++++++++++++++++++++++++-----------
 include/sysemu/kvm_int.h  |  6 +++
 linux-headers/linux/kvm.h |  9 ++++
 3 files changed, 87 insertions(+), 27 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 645f0a249a..3afa46b2ef 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -357,39 +357,40 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void *ram,
     return ret;
 }
 
+static void kvm_memory_region_node_add(KVMMemoryListener *kml,
+                                       struct kvm_userspace_memory_region *mem)
+{
+    MemoryRegionNode *node;
+
+    node = g_malloc(sizeof(MemoryRegionNode));
+    *node = (MemoryRegionNode) {
+        .mem = mem,
+    };
+    QTAILQ_INSERT_TAIL(&kml->mem_list, node, list);
+}
+
 static int kvm_set_user_memory_region(KVMMemoryListener *kml, KVMSlot *slot, bool new)
 {
-    KVMState *s = kvm_state;
-    struct kvm_userspace_memory_region mem;
-    int ret;
+    struct kvm_userspace_memory_region *mem;
 
-    mem.slot = slot->slot | (kml->as_id << 16);
-    mem.guest_phys_addr = slot->start_addr;
-    mem.userspace_addr = (unsigned long)slot->ram;
-    mem.flags = slot->flags;
+    mem = g_malloc(sizeof(struct kvm_userspace_memory_region));
 
-    if (slot->memory_size && !new && (mem.flags ^ slot->old_flags) & KVM_MEM_READONLY) {
+    mem->slot = slot->slot | (kml->as_id << 16);
+    mem->guest_phys_addr = slot->start_addr;
+    mem->userspace_addr = (unsigned long)slot->ram;
+    mem->flags = slot->flags;
+
+    if (slot->memory_size && !new && (mem->flags ^ slot->old_flags) &
+        KVM_MEM_READONLY) {
         /* Set the slot size to 0 before setting the slot to the desired
          * value. This is needed based on KVM commit 75d61fbc. */
-        mem.memory_size = 0;
-        ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
-        if (ret < 0) {
-            goto err;
-        }
-    }
-    mem.memory_size = slot->memory_size;
-    ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
-    slot->old_flags = mem.flags;
-err:
-    trace_kvm_set_user_memory(mem.slot, mem.flags, mem.guest_phys_addr,
-                              mem.memory_size, mem.userspace_addr, ret);
-    if (ret < 0) {
-        error_report("%s: KVM_SET_USER_MEMORY_REGION failed, slot=%d,"
-                     " start=0x%" PRIx64 ", size=0x%" PRIx64 ": %s",
-                     __func__, mem.slot, slot->start_addr,
-                     (uint64_t)mem.memory_size, strerror(errno));
+        mem->memory_size = 0;
+        kvm_memory_region_node_add(kml, mem);
     }
-    return ret;
+    mem->memory_size = slot->memory_size;
+    kvm_memory_region_node_add(kml, mem);
+    slot->old_flags = mem->flags;
+    return 0;
 }
 
 static int do_kvm_destroy_vcpu(CPUState *cpu)
@@ -1517,12 +1518,52 @@ static void kvm_region_add(MemoryListener *listener,
 static void kvm_region_del(MemoryListener *listener,
                            MemoryRegionSection *section)
 {
-    KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
+    KVMMemoryListener *kml = container_of(listener, KVMMemoryListener,
+                                          listener);
 
     kvm_set_phys_mem(kml, section, false);
     memory_region_unref(section->mr);
 }
 
+static void kvm_begin(MemoryListener *listener)
+{
+    KVMMemoryListener *kml = container_of(listener, KVMMemoryListener,
+                                          listener);
+    assert(QTAILQ_EMPTY(&kml->mem_list));
+}
+
+static void kvm_commit(MemoryListener *listener)
+{
+    KVMMemoryListener *kml = container_of(listener, KVMMemoryListener,
+                                          listener);
+    MemoryRegionNode *node, *next;
+    KVMState *s = kvm_state;
+
+    QTAILQ_FOREACH_SAFE(node, &kml->mem_list, list, next) {
+        struct kvm_userspace_memory_region *mem = node->mem;
+        int ret;
+
+        ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, mem);
+
+        trace_kvm_set_user_memory(mem->slot, mem->flags, mem->guest_phys_addr,
+                                  mem->memory_size, mem->userspace_addr, 0);
+
+        if (ret < 0) {
+            error_report("%s: KVM_SET_USER_MEMORY_REGION failed, slot=%d,"
+                         " start=0x%" PRIx64 ": %s",
+                         __func__, mem->slot,
+                         (uint64_t)mem->memory_size, strerror(errno));
+        }
+
+        QTAILQ_REMOVE(&kml->mem_list, node, list);
+        g_free(mem);
+        g_free(node);
+    }
+
+
+
+}
+
 static void kvm_log_sync(MemoryListener *listener,
                          MemoryRegionSection *section)
 {
@@ -1664,8 +1705,12 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
         kml->slots[i].slot = i;
     }
 
+    QTAILQ_INIT(&kml->mem_list);
+
     kml->listener.region_add = kvm_region_add;
     kml->listener.region_del = kvm_region_del;
+    kml->listener.begin = kvm_begin;
+    kml->listener.commit = kvm_commit;
     kml->listener.log_start = kvm_log_start;
     kml->listener.log_stop = kvm_log_stop;
     kml->listener.priority = 10;
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 1f5487d9b7..eab8598007 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -30,9 +30,15 @@ typedef struct KVMSlot
     ram_addr_t ram_start_offset;
 } KVMSlot;
 
+typedef struct MemoryRegionNode {
+    struct kvm_userspace_memory_region *mem;
+    QTAILQ_ENTRY(MemoryRegionNode) list;
+} MemoryRegionNode;
+
 typedef struct KVMMemoryListener {
     MemoryListener listener;
     KVMSlot *slots;
+    QTAILQ_HEAD(, MemoryRegionNode) mem_list;
     int as_id;
 } KVMMemoryListener;
 
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index f089349149..f215efdaa8 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -103,6 +103,13 @@ struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+/* for KVM_SET_USER_MEMORY_REGION_LIST */
+struct kvm_userspace_memory_region_list {
+	__u32 nent;
+	__u32 flags;
+	struct kvm_userspace_memory_region entries[0];
+};
+
 /*
  * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
  * other bits are reserved for kvm internal use which are defined in
@@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce {
 					struct kvm_userspace_memory_region)
 #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
 #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
+#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
+					struct kvm_userspace_memory_region_list)
 
 /* enable ucontrol for s390 */
 struct kvm_s390_ucas_mapping {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/2] softmmu/memory: add missing begin/commit callback calls
  2022-08-16 10:12 ` [RFC PATCH 1/2] softmmu/memory: add missing begin/commit callback calls Emanuele Giuseppe Esposito
@ 2022-08-18 19:34   ` Peter Xu
  2022-08-26 13:53     ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2022-08-18 19:34 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: qemu-devel, Paolo Bonzini, Michael S. Tsirkin, Cornelia Huck,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm

On Tue, Aug 16, 2022 at 06:12:49AM -0400, Emanuele Giuseppe Esposito wrote:
> kvm listeners now need ->commit callback in order to actually send
> the ioctl to the hypervisor. Therefore, add missing callers around
> address_space_set_flatview(), which in turn calls
> address_space_update_topology_pass() which calls ->region_* and
> ->log_* callbacks.
> 
> Using MEMORY_LISTENER_CALL_GLOBAL is a little bit an overkill,
> but it is harmless, considering that other listeners that are not
> invoked in address_space_update_topology_pass() won't do anything,
> since they won't have anything to commit.
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>  softmmu/memory.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index 7ba2048836..1afd3f9703 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -1076,7 +1076,9 @@ static void address_space_update_topology(AddressSpace *as)
>      if (!g_hash_table_lookup(flat_views, physmr)) {
>          generate_memory_topology(physmr);
>      }
> +    MEMORY_LISTENER_CALL_GLOBAL(begin, Forward);
>      address_space_set_flatview(as);
> +    MEMORY_LISTENER_CALL_GLOBAL(commit, Forward);

Should the pair be with MEMORY_LISTENER_CALL() rather than the global
version?  Since it's only updating one address space.

Besides the perf implication (walking per-as list should be faster than
walking global memory listener list?), I think it feels broken too since
we'll call begin() then commit() (with no region_add()/region_del()/..) for
all the listeners that are not registered against this AS.  IIUC it will
empty all regions with those listeners?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-16 10:12 ` [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase Emanuele Giuseppe Esposito
@ 2022-08-18 20:04   ` Peter Xu
  2022-08-19  0:55     ` Leonardo Bras Soares Passos
  2022-08-22  9:08   ` Cornelia Huck
  2022-08-26 14:15   ` David Hildenbrand
  2 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2022-08-18 20:04 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: qemu-devel, Paolo Bonzini, Michael S. Tsirkin, Cornelia Huck,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm, Leonardo Bras Soares Passos

On Tue, Aug 16, 2022 at 06:12:50AM -0400, Emanuele Giuseppe Esposito wrote:
> +static void kvm_memory_region_node_add(KVMMemoryListener *kml,
> +                                       struct kvm_userspace_memory_region *mem)
> +{
> +    MemoryRegionNode *node;
> +
> +    node = g_malloc(sizeof(MemoryRegionNode));
> +    *node = (MemoryRegionNode) {
> +        .mem = mem,
> +    };

Nit: direct assignment of struct looks okay, but maybe pointer assignment
is clearer (with g_malloc0?  Or iirc we're suggested to always use g_new0):

  node = g_new0(MemoryRegionNode, 1);
  node->mem = mem;

[...]

> +/* for KVM_SET_USER_MEMORY_REGION_LIST */
> +struct kvm_userspace_memory_region_list {
> +	__u32 nent;
> +	__u32 flags;
> +	struct kvm_userspace_memory_region entries[0];
> +};
> +
>  /*
>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
>   * other bits are reserved for kvm internal use which are defined in
> @@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce {
>  					struct kvm_userspace_memory_region)
>  #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
>  #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
> +					struct kvm_userspace_memory_region_list)

I think this is probably good enough, but just to provide the other small
(but may not be important) piece of puzzle here.  I wanted to think through
to understand better but I never did..

For a quick look, please read the comment in kvm_set_phys_mem().

                /*
                 * NOTE: We should be aware of the fact that here we're only
                 * doing a best effort to sync dirty bits.  No matter whether
                 * we're using dirty log or dirty ring, we ignored two facts:
                 *
                 * (1) dirty bits can reside in hardware buffers (PML)
                 *
                 * (2) after we collected dirty bits here, pages can be dirtied
                 * again before we do the final KVM_SET_USER_MEMORY_REGION to
                 * remove the slot.
                 *
                 * Not easy.  Let's cross the fingers until it's fixed.
                 */

One example is if we have 16G mem, we enable dirty tracking and we punch a
hole of 1G at offset 1G, it'll change from this:

                     (a)
  |----------------- 16G -------------------|

To this:

     (b)    (c)              (d)
  |--1G--|XXXXXX|------------14G------------|

Here (c) will be a 1G hole.

With current code, the hole punching will del region (a) and add back
region (b) and (d).  After the new _LIST ioctl it'll be atomic and nicer.

Here the question is if we're with dirty tracking it means for each region
we have a dirty bitmap.  Currently we do the best effort of doing below
sequence:

  (1) fetching dirty bmap of (a)
  (2) delete region (a)
  (3) add region (b) (d)

Here (a)'s dirty bmap is mostly kept as best effort, but still we'll lose
dirty pages written between step (1) and (2) (and actually if the write
comes within (2) and (3) I think it'll crash qemu, and iiuc that's what
we're going to fix..).

So ideally the atomic op can be:

  "atomically fetch dirty bmap for removed regions, remove regions, and add
   new regions"

Rather than only:

  "atomically remove regions, and add new regions"

as what the new _LIST ioctl do.

But... maybe that's not a real problem, at least I didn't know any report
showing issue with current code yet caused by losing of dirty bits during
step (1) and (2).  Neither do I know how to trigger an issue with it.

I'm just trying to still provide this information so that you should be
aware of this problem too, at the meantime when proposing the new ioctl
change for qemu we should also keep in mind that we won't easily lose the
dirty bmap of (a) here, which I think this patch does the right thing.

Thanks!

--
Peter Xu

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-18 20:04   ` Peter Xu
@ 2022-08-19  0:55     ` Leonardo Bras Soares Passos
  2022-08-22 14:10       ` Peter Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-08-19  0:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: Emanuele Giuseppe Esposito, qemu-devel, Paolo Bonzini,
	Michael S. Tsirkin, Cornelia Huck, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm

On Thu, Aug 18, 2022 at 5:05 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Aug 16, 2022 at 06:12:50AM -0400, Emanuele Giuseppe Esposito wrote:
> > +static void kvm_memory_region_node_add(KVMMemoryListener *kml,
> > +                                       struct kvm_userspace_memory_region *mem)
> > +{
> > +    MemoryRegionNode *node;
> > +
> > +    node = g_malloc(sizeof(MemoryRegionNode));
> > +    *node = (MemoryRegionNode) {
> > +        .mem = mem,
> > +    };
>
> Nit: direct assignment of struct looks okay, but maybe pointer assignment
> is clearer (with g_malloc0?  Or iirc we're suggested to always use g_new0):
>
>   node = g_new0(MemoryRegionNode, 1);
>   node->mem = mem;
>
> [...]
>
> > +/* for KVM_SET_USER_MEMORY_REGION_LIST */
> > +struct kvm_userspace_memory_region_list {
> > +     __u32 nent;
> > +     __u32 flags;
> > +     struct kvm_userspace_memory_region entries[0];
> > +};
> > +
> >  /*
> >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> >   * other bits are reserved for kvm internal use which are defined in
> > @@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce {
> >                                       struct kvm_userspace_memory_region)
> >  #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
> >  #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> > +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
> > +                                     struct kvm_userspace_memory_region_list)
>
> I think this is probably good enough, but just to provide the other small
> (but may not be important) piece of puzzle here.  I wanted to think through
> to understand better but I never did..
>
> For a quick look, please read the comment in kvm_set_phys_mem().
>
>                 /*
>                  * NOTE: We should be aware of the fact that here we're only
>                  * doing a best effort to sync dirty bits.  No matter whether
>                  * we're using dirty log or dirty ring, we ignored two facts:
>                  *
>                  * (1) dirty bits can reside in hardware buffers (PML)
>                  *
>                  * (2) after we collected dirty bits here, pages can be dirtied
>                  * again before we do the final KVM_SET_USER_MEMORY_REGION to
>                  * remove the slot.
>                  *
>                  * Not easy.  Let's cross the fingers until it's fixed.
>                  */
>
> One example is if we have 16G mem, we enable dirty tracking and we punch a
> hole of 1G at offset 1G, it'll change from this:
>
>                      (a)
>   |----------------- 16G -------------------|
>
> To this:
>
>      (b)    (c)              (d)
>   |--1G--|XXXXXX|------------14G------------|
>
> Here (c) will be a 1G hole.
>
> With current code, the hole punching will del region (a) and add back
> region (b) and (d).  After the new _LIST ioctl it'll be atomic and nicer.
>
> Here the question is if we're with dirty tracking it means for each region
> we have a dirty bitmap.  Currently we do the best effort of doing below
> sequence:
>
>   (1) fetching dirty bmap of (a)
>   (2) delete region (a)
>   (3) add region (b) (d)
>
> Here (a)'s dirty bmap is mostly kept as best effort, but still we'll lose
> dirty pages written between step (1) and (2) (and actually if the write
> comes within (2) and (3) I think it'll crash qemu, and iiuc that's what
> we're going to fix..).
>
> So ideally the atomic op can be:
>
>   "atomically fetch dirty bmap for removed regions, remove regions, and add
>    new regions"
>
> Rather than only:
>
>   "atomically remove regions, and add new regions"
>
> as what the new _LIST ioctl do.
>
> But... maybe that's not a real problem, at least I didn't know any report
> showing issue with current code yet caused by losing of dirty bits during
> step (1) and (2).  Neither do I know how to trigger an issue with it.
>
> I'm just trying to still provide this information so that you should be
> aware of this problem too, at the meantime when proposing the new ioctl
> change for qemu we should also keep in mind that we won't easily lose the
> dirty bmap of (a) here, which I think this patch does the right thing.
>

Thanks for bringing these details Peter!

What do you think of adding?
(4) Copy the corresponding part of (a)'s dirty bitmap to (b) and (d)'s
dirty bitmaps.


Best regards,
Leo

> Thanks!
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-16 10:12 ` [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase Emanuele Giuseppe Esposito
  2022-08-18 20:04   ` Peter Xu
@ 2022-08-22  9:08   ` Cornelia Huck
  2022-08-26 13:53     ` Emanuele Giuseppe Esposito
  2022-08-26 14:15   ` David Hildenbrand
  2 siblings, 1 reply; 23+ messages in thread
From: Cornelia Huck @ 2022-08-22  9:08 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-devel
  Cc: Paolo Bonzini, Michael S. Tsirkin, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm, Emanuele Giuseppe Esposito

On Tue, Aug 16 2022, Emanuele Giuseppe Esposito <eesposit@redhat.com> wrote:

> Instead of sending a single ioctl every time ->region_* or ->log_*
> callbacks are called, "queue" all memory regions in a list that will
> be emptied only when committing.
>
> This allow the KVM kernel API to be extended and support multiple
> memslots updates in a single call.
>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>  accel/kvm/kvm-all.c       | 99 ++++++++++++++++++++++++++++-----------
>  include/sysemu/kvm_int.h  |  6 +++
>  linux-headers/linux/kvm.h |  9 ++++

Meta comment: Please split out any linux-headers changes into a [dummy,
if not yet accepted in the kernel] headers update patch.

>  3 files changed, 87 insertions(+), 27 deletions(-)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-19  0:55     ` Leonardo Bras Soares Passos
@ 2022-08-22 14:10       ` Peter Xu
  2022-08-26 14:07         ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2022-08-22 14:10 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Emanuele Giuseppe Esposito, qemu-devel, Paolo Bonzini,
	Michael S. Tsirkin, Cornelia Huck, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm

On Thu, Aug 18, 2022 at 09:55:20PM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Aug 18, 2022 at 5:05 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Aug 16, 2022 at 06:12:50AM -0400, Emanuele Giuseppe Esposito wrote:
> > > +static void kvm_memory_region_node_add(KVMMemoryListener *kml,
> > > +                                       struct kvm_userspace_memory_region *mem)
> > > +{
> > > +    MemoryRegionNode *node;
> > > +
> > > +    node = g_malloc(sizeof(MemoryRegionNode));
> > > +    *node = (MemoryRegionNode) {
> > > +        .mem = mem,
> > > +    };
> >
> > Nit: direct assignment of struct looks okay, but maybe pointer assignment
> > is clearer (with g_malloc0?  Or iirc we're suggested to always use g_new0):
> >
> >   node = g_new0(MemoryRegionNode, 1);
> >   node->mem = mem;
> >
> > [...]
> >
> > > +/* for KVM_SET_USER_MEMORY_REGION_LIST */
> > > +struct kvm_userspace_memory_region_list {
> > > +     __u32 nent;
> > > +     __u32 flags;
> > > +     struct kvm_userspace_memory_region entries[0];
> > > +};
> > > +
> > >  /*
> > >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> > >   * other bits are reserved for kvm internal use which are defined in
> > > @@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce {
> > >                                       struct kvm_userspace_memory_region)
> > >  #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
> > >  #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> > > +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
> > > +                                     struct kvm_userspace_memory_region_list)
> >
> > I think this is probably good enough, but just to provide the other small
> > (but may not be important) piece of puzzle here.  I wanted to think through
> > to understand better but I never did..
> >
> > For a quick look, please read the comment in kvm_set_phys_mem().
> >
> >                 /*
> >                  * NOTE: We should be aware of the fact that here we're only
> >                  * doing a best effort to sync dirty bits.  No matter whether
> >                  * we're using dirty log or dirty ring, we ignored two facts:
> >                  *
> >                  * (1) dirty bits can reside in hardware buffers (PML)
> >                  *
> >                  * (2) after we collected dirty bits here, pages can be dirtied
> >                  * again before we do the final KVM_SET_USER_MEMORY_REGION to
> >                  * remove the slot.
> >                  *
> >                  * Not easy.  Let's cross the fingers until it's fixed.
> >                  */
> >
> > One example is if we have 16G mem, we enable dirty tracking and we punch a
> > hole of 1G at offset 1G, it'll change from this:
> >
> >                      (a)
> >   |----------------- 16G -------------------|
> >
> > To this:
> >
> >      (b)    (c)              (d)
> >   |--1G--|XXXXXX|------------14G------------|
> >
> > Here (c) will be a 1G hole.
> >
> > With current code, the hole punching will del region (a) and add back
> > region (b) and (d).  After the new _LIST ioctl it'll be atomic and nicer.
> >
> > Here the question is if we're with dirty tracking it means for each region
> > we have a dirty bitmap.  Currently we do the best effort of doing below
> > sequence:
> >
> >   (1) fetching dirty bmap of (a)
> >   (2) delete region (a)
> >   (3) add region (b) (d)
> >
> > Here (a)'s dirty bmap is mostly kept as best effort, but still we'll lose
> > dirty pages written between step (1) and (2) (and actually if the write
> > comes within (2) and (3) I think it'll crash qemu, and iiuc that's what
> > we're going to fix..).
> >
> > So ideally the atomic op can be:
> >
> >   "atomically fetch dirty bmap for removed regions, remove regions, and add
> >    new regions"
> >
> > Rather than only:
> >
> >   "atomically remove regions, and add new regions"
> >
> > as what the new _LIST ioctl do.
> >
> > But... maybe that's not a real problem, at least I didn't know any report
> > showing issue with current code yet caused by losing of dirty bits during
> > step (1) and (2).  Neither do I know how to trigger an issue with it.
> >
> > I'm just trying to still provide this information so that you should be
> > aware of this problem too, at the meantime when proposing the new ioctl
> > change for qemu we should also keep in mind that we won't easily lose the
> > dirty bmap of (a) here, which I think this patch does the right thing.
> >
> 
> Thanks for bringing these details Peter!
> 
> What do you think of adding?
> (4) Copy the corresponding part of (a)'s dirty bitmap to (b) and (d)'s
> dirty bitmaps.

Sounds good to me, but may not cover dirty ring?  Maybe we could move on
with the simple but clean scheme first and think about a comprehensive
option only if very necessary.  The worst case is we need one more kvm cap
but we should still have enough.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/2] softmmu/memory: add missing begin/commit callback calls
  2022-08-18 19:34   ` Peter Xu
@ 2022-08-26 13:53     ` Emanuele Giuseppe Esposito
  2022-08-26 14:13       ` Peter Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-08-26 13:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Michael S. Tsirkin, Cornelia Huck,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm



Am 18/08/2022 um 21:34 schrieb Peter Xu:
> On Tue, Aug 16, 2022 at 06:12:49AM -0400, Emanuele Giuseppe Esposito wrote:
>> kvm listeners now need ->commit callback in order to actually send
>> the ioctl to the hypervisor. Therefore, add missing callers around
>> address_space_set_flatview(), which in turn calls
>> address_space_update_topology_pass() which calls ->region_* and
>> ->log_* callbacks.
>>
>> Using MEMORY_LISTENER_CALL_GLOBAL is a little bit an overkill,
>> but it is harmless, considering that other listeners that are not
>> invoked in address_space_update_topology_pass() won't do anything,
>> since they won't have anything to commit.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> ---
>>  softmmu/memory.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/softmmu/memory.c b/softmmu/memory.c
>> index 7ba2048836..1afd3f9703 100644
>> --- a/softmmu/memory.c
>> +++ b/softmmu/memory.c
>> @@ -1076,7 +1076,9 @@ static void address_space_update_topology(AddressSpace *as)
>>      if (!g_hash_table_lookup(flat_views, physmr)) {
>>          generate_memory_topology(physmr);
>>      }
>> +    MEMORY_LISTENER_CALL_GLOBAL(begin, Forward);
>>      address_space_set_flatview(as);
>> +    MEMORY_LISTENER_CALL_GLOBAL(commit, Forward);
> 
> Should the pair be with MEMORY_LISTENER_CALL() rather than the global
> version?  Since it's only updating one address space.

Ideally yes, we want to call the memory listener only for this address
space. Practically I don't know how to do it, as MEMORY_LISTENER_CALL 1)
takes additional parameters like memory region section, and 2) calls
_listener->_callback(_listener, _section, ##_args)
whereas begin and commit need (_listener, ##args) only, which is what
MEMORY_LISTENER_CALL_GLOBAL does.

> 
> Besides the perf implication (walking per-as list should be faster than
> walking global memory listener list?), I think it feels broken too since
> we'll call begin() then commit() (with no region_add()/region_del()/..) for
> all the listeners that are not registered against this AS.  IIUC it will
> empty all regions with those listeners?

What do you mean "will empty all regions with those listeners"?
But yes theoretically vhost-vdpa and physmem have commit callbacks that
are independent from whether region_add or other callbacks have been called.
For kvm and probably vhost it would be no problem, since there won't be
any list to iterate on.

I'll implement a new macro to handle this.

Emanuele
> 
> Thanks,
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-22  9:08   ` Cornelia Huck
@ 2022-08-26 13:53     ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 23+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-08-26 13:53 UTC (permalink / raw)
  To: Cornelia Huck, qemu-devel
  Cc: Paolo Bonzini, Michael S. Tsirkin, Peter Xu, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm



Am 22/08/2022 um 11:08 schrieb Cornelia Huck:
> On Tue, Aug 16 2022, Emanuele Giuseppe Esposito <eesposit@redhat.com> wrote:
> 
>> Instead of sending a single ioctl every time ->region_* or ->log_*
>> callbacks are called, "queue" all memory regions in a list that will
>> be emptied only when committing.
>>
>> This allow the KVM kernel API to be extended and support multiple
>> memslots updates in a single call.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> ---
>>  accel/kvm/kvm-all.c       | 99 ++++++++++++++++++++++++++++-----------
>>  include/sysemu/kvm_int.h  |  6 +++
>>  linux-headers/linux/kvm.h |  9 ++++
> 
> Meta comment: Please split out any linux-headers changes into a [dummy,
> if not yet accepted in the kernel] headers update patch.

Thank you for pointing that out, will do.

Emanuele

> 
>>  3 files changed, 87 insertions(+), 27 deletions(-)
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-22 14:10       ` Peter Xu
@ 2022-08-26 14:07         ` Emanuele Giuseppe Esposito
  2022-08-27 20:58           ` Peter Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-08-26 14:07 UTC (permalink / raw)
  To: Peter Xu, Leonardo Bras Soares Passos
  Cc: qemu-devel, Paolo Bonzini, Michael S. Tsirkin, Cornelia Huck,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm



Am 22/08/2022 um 16:10 schrieb Peter Xu:
> On Thu, Aug 18, 2022 at 09:55:20PM -0300, Leonardo Bras Soares Passos wrote:
>> On Thu, Aug 18, 2022 at 5:05 PM Peter Xu <peterx@redhat.com> wrote:
>>>
>>> On Tue, Aug 16, 2022 at 06:12:50AM -0400, Emanuele Giuseppe Esposito wrote:
>>>> +static void kvm_memory_region_node_add(KVMMemoryListener *kml,
>>>> +                                       struct kvm_userspace_memory_region *mem)
>>>> +{
>>>> +    MemoryRegionNode *node;
>>>> +
>>>> +    node = g_malloc(sizeof(MemoryRegionNode));
>>>> +    *node = (MemoryRegionNode) {
>>>> +        .mem = mem,
>>>> +    };
>>>
>>> Nit: direct assignment of struct looks okay, but maybe pointer assignment
>>> is clearer (with g_malloc0?  Or iirc we're suggested to always use g_new0):
>>>
>>>   node = g_new0(MemoryRegionNode, 1);
>>>   node->mem = mem;
>>>
>>> [...]

Makes sense

>>>
>>>> +/* for KVM_SET_USER_MEMORY_REGION_LIST */
>>>> +struct kvm_userspace_memory_region_list {
>>>> +     __u32 nent;
>>>> +     __u32 flags;
>>>> +     struct kvm_userspace_memory_region entries[0];
>>>> +};
>>>> +
>>>>  /*
>>>>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
>>>>   * other bits are reserved for kvm internal use which are defined in
>>>> @@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce {
>>>>                                       struct kvm_userspace_memory_region)
>>>>  #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
>>>>  #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
>>>> +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
>>>> +                                     struct kvm_userspace_memory_region_list)
>>>
>>> I think this is probably good enough, but just to provide the other small
>>> (but may not be important) piece of puzzle here.  I wanted to think through
>>> to understand better but I never did..
>>>
>>> For a quick look, please read the comment in kvm_set_phys_mem().
>>>
>>>                 /*
>>>                  * NOTE: We should be aware of the fact that here we're only
>>>                  * doing a best effort to sync dirty bits.  No matter whether
>>>                  * we're using dirty log or dirty ring, we ignored two facts:
>>>                  *
>>>                  * (1) dirty bits can reside in hardware buffers (PML)
>>>                  *
>>>                  * (2) after we collected dirty bits here, pages can be dirtied
>>>                  * again before we do the final KVM_SET_USER_MEMORY_REGION to
>>>                  * remove the slot.
>>>                  *
>>>                  * Not easy.  Let's cross the fingers until it's fixed.
>>>                  */
>>>
>>> One example is if we have 16G mem, we enable dirty tracking and we punch a
>>> hole of 1G at offset 1G, it'll change from this:
>>>
>>>                      (a)
>>>   |----------------- 16G -------------------|
>>>
>>> To this:
>>>
>>>      (b)    (c)              (d)
>>>   |--1G--|XXXXXX|------------14G------------|
>>>
>>> Here (c) will be a 1G hole.
>>>
>>> With current code, the hole punching will del region (a) and add back
>>> region (b) and (d).  After the new _LIST ioctl it'll be atomic and nicer.
>>>
>>> Here the question is if we're with dirty tracking it means for each region
>>> we have a dirty bitmap.  Currently we do the best effort of doing below
>>> sequence:
>>>
>>>   (1) fetching dirty bmap of (a)
>>>   (2) delete region (a)
>>>   (3) add region (b) (d)
>>>
>>> Here (a)'s dirty bmap is mostly kept as best effort, but still we'll lose
>>> dirty pages written between step (1) and (2) (and actually if the write
>>> comes within (2) and (3) I think it'll crash qemu, and iiuc that's what
>>> we're going to fix..).
>>>
>>> So ideally the atomic op can be:
>>>
>>>   "atomically fetch dirty bmap for removed regions, remove regions, and add
>>>    new regions"
>>>
>>> Rather than only:
>>>
>>>   "atomically remove regions, and add new regions"
>>>
>>> as what the new _LIST ioctl do.
>>>
>>> But... maybe that's not a real problem, at least I didn't know any report
>>> showing issue with current code yet caused by losing of dirty bits during
>>> step (1) and (2).  Neither do I know how to trigger an issue with it.
>>>
>>> I'm just trying to still provide this information so that you should be
>>> aware of this problem too, at the meantime when proposing the new ioctl
>>> change for qemu we should also keep in mind that we won't easily lose the
>>> dirty bmap of (a) here, which I think this patch does the right thing.
>>>
>>
>> Thanks for bringing these details Peter!
>>
>> What do you think of adding?
>> (4) Copy the corresponding part of (a)'s dirty bitmap to (b) and (d)'s
>> dirty bitmaps.
> 
> Sounds good to me, but may not cover dirty ring?  Maybe we could move on
> with the simple but clean scheme first and think about a comprehensive
> option only if very necessary.  The worst case is we need one more kvm cap
> but we should still have enough.

Ok then, I will not consider this for now.

Might or might not be relevant, but I was also considering to
pre-process the list of memslots passed to the ioctl and merge
operations when necessary, to avoid unnecessary operations.

For example, if we are creating an area and punching a hole (assuming
this is a valid example), we would have

transaction_begin()
CREATE(offset=0, memory area)
DELETE(memory area)
CREATE(offset=0, memory area / 2 - 1)
CREATE(offset=memory_area/2, memory area / 2)
transaction_commmit()

Instead, if we pre-process the memory regions and detect overlaps with
an interval tree, we could simplify the above with:
CREATE(offset=0, memory area / 2 - 1)
CREATE(offset=memory_area/2, memory area / 2)

In addition, I was thinking to also provide the operation type (called
enum kvm_mr_change) from userspace directly, and not "figure" it
ourselves in KVM.

The reason for these two changes come from KVM implementation. I know
this is no KVM place, but a background explanation might be necessary.
Basically KVM 1) figures the request type by looking at the fields
passed by userspace (for example mem_size == 0 means DELETE) and the
current status of the memslot (id not found means CREATE, found means
MOVE/UPDATE_FLAGS), and 2) has 2 memslot lists per address space: the
active (visible) and inactive. Any operation is performed in the
inactive list, then we "swap" the two so that the change is visible.

The "atomic" goal of this serie just means that we want to apply
multiple memslot operation and then perform a single "swap".
The problem is that each DELETE and MOVE request perform 2 swaps: first
substitute current memslot with an invalid one (so that page faults are
retried), and then remove the invalid one (in case of MOVE, substitute
with new memslot).

Therefore:
- KVM should ideally pre-process all DELETE/MOVE memslots and perform a
first swap(s) to replace the current memslot with an invalid one, and
then perform all other operations in order (deletion/move included).
This is why QEMU should just send pre-merged MR, so that we don't have
CREATE(x)- DELETE(x). Otherwise we would process a DELETE on a MR that
doesn't exist yet.

- Doing the above is still not enough, as KVM figures what operation to
do depending on the current state of the memslots.
Assuming we already have an already existing MR y, and now we get the
list DELETE(y) CREATE(y/2) (ie reducing y to half its size).
In this case the interval tree can't do anything, but it's very hard to
understand that the second request in the list is a CREATE, because when
KVM performs the check to understand which type of operation it is
(before doing any modification at all in the memslot list), it finds
that y (memslot with same id) exist, but has a different size than what
provided from userspace, therefore it could either fail, or misinterpret
it as another operation (currently fails -EINVALID).
If we instead already provide the labels, then we can:
	1. substitute the memslots pointed by DELETE/MOVE with invalid & swap
(so it is visible, non-atomic. But we don't care, as we are not deleting
anything)
	2. delete the invalid memslot (in the inactive memslot list)
	3. process the other requests (in the inactive memslot list)
	4. single and atomic swap (step 2 and 3 are now visible).

What do you think?

Bonus question: with this atomic operation, do we really need the
invalid memslot logic for MOVE/DELETE?

Thank you,
Emanuele


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/2] softmmu/memory: add missing begin/commit callback calls
  2022-08-26 13:53     ` Emanuele Giuseppe Esposito
@ 2022-08-26 14:13       ` Peter Xu
  2022-08-27 21:03         ` Peter Xu
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2022-08-26 14:13 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: qemu-devel, Paolo Bonzini, Michael S. Tsirkin, Cornelia Huck,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm

On Fri, Aug 26, 2022 at 03:53:09PM +0200, Emanuele Giuseppe Esposito wrote:
> What do you mean "will empty all regions with those listeners"?
> But yes theoretically vhost-vdpa and physmem have commit callbacks that
> are independent from whether region_add or other callbacks have been called.
> For kvm and probably vhost it would be no problem, since there won't be
> any list to iterate on.

Right, begin()/commit() is for address space update, so it should be fine
to have nothing to commit, sorry.

> 
> I'll implement a new macro to handle this.

Sounds good.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-16 10:12 ` [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase Emanuele Giuseppe Esposito
  2022-08-18 20:04   ` Peter Xu
  2022-08-22  9:08   ` Cornelia Huck
@ 2022-08-26 14:15   ` David Hildenbrand
  2022-08-26 14:32     ` Emanuele Giuseppe Esposito
  2 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2022-08-26 14:15 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-devel
  Cc: Paolo Bonzini, Michael S. Tsirkin, Cornelia Huck, Peter Xu,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm

On 16.08.22 12:12, Emanuele Giuseppe Esposito wrote:
> Instead of sending a single ioctl every time ->region_* or ->log_*
> callbacks are called, "queue" all memory regions in a list that will
> be emptied only when committing.
> 

Out of interest, how many such regions does the ioctl support? As many
as KVM theoretically supports? (32k IIRC)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-26 14:15   ` David Hildenbrand
@ 2022-08-26 14:32     ` Emanuele Giuseppe Esposito
  2022-08-26 14:44       ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-08-26 14:32 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Paolo Bonzini, Michael S. Tsirkin, Cornelia Huck, Peter Xu,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm

Am 26/08/2022 um 16:15 schrieb David Hildenbrand:
> On 16.08.22 12:12, Emanuele Giuseppe Esposito wrote:
>> Instead of sending a single ioctl every time ->region_* or ->log_*
>> callbacks are called, "queue" all memory regions in a list that will
>> be emptied only when committing.
>>
> 
> Out of interest, how many such regions does the ioctl support? As many
> as KVM theoretically supports? (32k IIRC)
> 

I assume you mean for the new ioctl, but yes that's a good question.

The problem here is that we could have more than a single update per
memory region. So we are not limited anymore to the number of regions,
but the number of operations * number of region.

I was thinking, maybe when pre-processing QEMU could divide a single
transaction into multiple atomic operations (ie operations on the same
memory region)? That way avoid sending a single ioctl with 32k *
#operation elements. Is that what you mean?

Emanuele

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-26 14:32     ` Emanuele Giuseppe Esposito
@ 2022-08-26 14:44       ` David Hildenbrand
  2022-09-09  8:04         ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2022-08-26 14:44 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-devel
  Cc: Paolo Bonzini, Michael S. Tsirkin, Cornelia Huck, Peter Xu,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm

On 26.08.22 16:32, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 26/08/2022 um 16:15 schrieb David Hildenbrand:
>> On 16.08.22 12:12, Emanuele Giuseppe Esposito wrote:
>>> Instead of sending a single ioctl every time ->region_* or ->log_*
>>> callbacks are called, "queue" all memory regions in a list that will
>>> be emptied only when committing.
>>>
>>
>> Out of interest, how many such regions does the ioctl support? As many
>> as KVM theoretically supports? (32k IIRC)
>>
> 
> I assume you mean for the new ioctl, but yes that's a good question.
> 
> The problem here is that we could have more than a single update per
> memory region. So we are not limited anymore to the number of regions,
> but the number of operations * number of region.
> 
> I was thinking, maybe when pre-processing QEMU could divide a single
> transaction into multiple atomic operations (ie operations on the same
> memory region)? That way avoid sending a single ioctl with 32k *
> #operation elements. Is that what you mean?

Oh, so we're effectively collecting slot updates and not the complete
"slot" view, got it. Was the kernel series already sent so I can have a
look?

Note that there are some possible slot updates (like a split, or a
merge) that involve multiple slots and that would have to be part of the
same "transaction" to be atomic.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-26 14:07         ` Emanuele Giuseppe Esposito
@ 2022-08-27 20:58           ` Peter Xu
  2022-08-30 10:59             ` David Hildenbrand
  2022-09-09  8:00             ` Emanuele Giuseppe Esposito
  0 siblings, 2 replies; 23+ messages in thread
From: Peter Xu @ 2022-08-27 20:58 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: Leonardo Bras Soares Passos, qemu-devel, Paolo Bonzini,
	Michael S. Tsirkin, Cornelia Huck, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm

Hi, Emanuele,

On Fri, Aug 26, 2022 at 04:07:01PM +0200, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 22/08/2022 um 16:10 schrieb Peter Xu:
> > On Thu, Aug 18, 2022 at 09:55:20PM -0300, Leonardo Bras Soares Passos wrote:
> >> On Thu, Aug 18, 2022 at 5:05 PM Peter Xu <peterx@redhat.com> wrote:
> >>>
> >>> On Tue, Aug 16, 2022 at 06:12:50AM -0400, Emanuele Giuseppe Esposito wrote:
> >>>> +static void kvm_memory_region_node_add(KVMMemoryListener *kml,
> >>>> +                                       struct kvm_userspace_memory_region *mem)
> >>>> +{
> >>>> +    MemoryRegionNode *node;
> >>>> +
> >>>> +    node = g_malloc(sizeof(MemoryRegionNode));
> >>>> +    *node = (MemoryRegionNode) {
> >>>> +        .mem = mem,
> >>>> +    };
> >>>
> >>> Nit: direct assignment of struct looks okay, but maybe pointer assignment
> >>> is clearer (with g_malloc0?  Or iirc we're suggested to always use g_new0):
> >>>
> >>>   node = g_new0(MemoryRegionNode, 1);
> >>>   node->mem = mem;
> >>>
> >>> [...]
> 
> Makes sense
> 
> >>>
> >>>> +/* for KVM_SET_USER_MEMORY_REGION_LIST */
> >>>> +struct kvm_userspace_memory_region_list {
> >>>> +     __u32 nent;
> >>>> +     __u32 flags;
> >>>> +     struct kvm_userspace_memory_region entries[0];
> >>>> +};
> >>>> +
> >>>>  /*
> >>>>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
> >>>>   * other bits are reserved for kvm internal use which are defined in
> >>>> @@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce {
> >>>>                                       struct kvm_userspace_memory_region)
> >>>>  #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
> >>>>  #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> >>>> +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
> >>>> +                                     struct kvm_userspace_memory_region_list)
> >>>
> >>> I think this is probably good enough, but just to provide the other small
> >>> (but may not be important) piece of puzzle here.  I wanted to think through
> >>> to understand better but I never did..
> >>>
> >>> For a quick look, please read the comment in kvm_set_phys_mem().
> >>>
> >>>                 /*
> >>>                  * NOTE: We should be aware of the fact that here we're only
> >>>                  * doing a best effort to sync dirty bits.  No matter whether
> >>>                  * we're using dirty log or dirty ring, we ignored two facts:
> >>>                  *
> >>>                  * (1) dirty bits can reside in hardware buffers (PML)
> >>>                  *
> >>>                  * (2) after we collected dirty bits here, pages can be dirtied
> >>>                  * again before we do the final KVM_SET_USER_MEMORY_REGION to
> >>>                  * remove the slot.
> >>>                  *
> >>>                  * Not easy.  Let's cross the fingers until it's fixed.
> >>>                  */
> >>>
> >>> One example is if we have 16G mem, we enable dirty tracking and we punch a
> >>> hole of 1G at offset 1G, it'll change from this:
> >>>
> >>>                      (a)
> >>>   |----------------- 16G -------------------|
> >>>
> >>> To this:
> >>>
> >>>      (b)    (c)              (d)
> >>>   |--1G--|XXXXXX|------------14G------------|
> >>>
> >>> Here (c) will be a 1G hole.
> >>>
> >>> With current code, the hole punching will del region (a) and add back
> >>> region (b) and (d).  After the new _LIST ioctl it'll be atomic and nicer.
> >>>
> >>> Here the question is if we're with dirty tracking it means for each region
> >>> we have a dirty bitmap.  Currently we do the best effort of doing below
> >>> sequence:
> >>>
> >>>   (1) fetching dirty bmap of (a)
> >>>   (2) delete region (a)
> >>>   (3) add region (b) (d)
> >>>
> >>> Here (a)'s dirty bmap is mostly kept as best effort, but still we'll lose
> >>> dirty pages written between step (1) and (2) (and actually if the write
> >>> comes within (2) and (3) I think it'll crash qemu, and iiuc that's what
> >>> we're going to fix..).
> >>>
> >>> So ideally the atomic op can be:
> >>>
> >>>   "atomically fetch dirty bmap for removed regions, remove regions, and add
> >>>    new regions"
> >>>
> >>> Rather than only:
> >>>
> >>>   "atomically remove regions, and add new regions"
> >>>
> >>> as what the new _LIST ioctl do.
> >>>
> >>> But... maybe that's not a real problem, at least I didn't know any report
> >>> showing issue with current code yet caused by losing of dirty bits during
> >>> step (1) and (2).  Neither do I know how to trigger an issue with it.
> >>>
> >>> I'm just trying to still provide this information so that you should be
> >>> aware of this problem too, at the meantime when proposing the new ioctl
> >>> change for qemu we should also keep in mind that we won't easily lose the
> >>> dirty bmap of (a) here, which I think this patch does the right thing.
> >>>
> >>
> >> Thanks for bringing these details Peter!
> >>
> >> What do you think of adding?
> >> (4) Copy the corresponding part of (a)'s dirty bitmap to (b) and (d)'s
> >> dirty bitmaps.
> > 
> > Sounds good to me, but may not cover dirty ring?  Maybe we could move on
> > with the simple but clean scheme first and think about a comprehensive
> > option only if very necessary.  The worst case is we need one more kvm cap
> > but we should still have enough.
> 
> Ok then, I will not consider this for now.
> 
> Might or might not be relevant, but I was also considering to
> pre-process the list of memslots passed to the ioctl and merge
> operations when necessary, to avoid unnecessary operations.
> 
> For example, if we are creating an area and punching a hole (assuming
> this is a valid example), we would have
> 
> transaction_begin()
> CREATE(offset=0, memory area)
> DELETE(memory area)
> CREATE(offset=0, memory area / 2 - 1)
> CREATE(offset=memory_area/2, memory area / 2)
> transaction_commmit()
> 
> Instead, if we pre-process the memory regions and detect overlaps with
> an interval tree, we could simplify the above with:
> CREATE(offset=0, memory area / 2 - 1)
> CREATE(offset=memory_area/2, memory area / 2)

As I replied in the private email, I don't think the pre-process here is
needed, because afaict flat views already handle that.

See generate_memory_topology() and especially render_memory_region().

In above example, the 1st CREATE + DELETE shouldn't reach any of the memory
listners, including the KVM one, because the flatview only contains the
final layout of the address space when commit() happens.

> 
> In addition, I was thinking to also provide the operation type (called
> enum kvm_mr_change) from userspace directly, and not "figure" it
> ourselves in KVM.
> 
> The reason for these two changes come from KVM implementation. I know
> this is no KVM place, but a background explanation might be necessary.
> Basically KVM 1) figures the request type by looking at the fields
> passed by userspace (for example mem_size == 0 means DELETE) and the
> current status of the memslot (id not found means CREATE, found means
> MOVE/UPDATE_FLAGS), and 2) has 2 memslot lists per address space: the
> active (visible) and inactive. Any operation is performed in the
> inactive list, then we "swap" the two so that the change is visible.
> 
> The "atomic" goal of this serie just means that we want to apply
> multiple memslot operation and then perform a single "swap".
> The problem is that each DELETE and MOVE request perform 2 swaps: first
> substitute current memslot with an invalid one (so that page faults are
> retried), and then remove the invalid one (in case of MOVE, substitute
> with new memslot).
> 
> Therefore:
> - KVM should ideally pre-process all DELETE/MOVE memslots and perform a
> first swap(s) to replace the current memslot with an invalid one, and
> then perform all other operations in order (deletion/move included).

This sounds correct.

> This is why QEMU should just send pre-merged MR, so that we don't have
> CREATE(x)- DELETE(x). Otherwise we would process a DELETE on a MR that
> doesn't exist yet.

As discussed above, IMHO pre-merge isn't an issue here and I think the
sequence in your example won't happen.  But you raised a great question on
whether we should allow the new ioctl (if there's going to be one) to do
whatever it wants by batching any sequence of memslot operations.

More below.

> 
> - Doing the above is still not enough, as KVM figures what operation to
> do depending on the current state of the memslots.
> Assuming we already have an already existing MR y, and now we get the
> list DELETE(y) CREATE(y/2) (ie reducing y to half its size).
> In this case the interval tree can't do anything, but it's very hard to
> understand that the second request in the list is a CREATE, because when
> KVM performs the check to understand which type of operation it is
> (before doing any modification at all in the memslot list), it finds
> that y (memslot with same id) exist, but has a different size than what
> provided from userspace, therefore it could either fail, or misinterpret
> it as another operation (currently fails -EINVALID).

Another good question..  I think that can be partly solved by not allowing
ID duplication in the same batched ioctl, or maybe we can fail it.  From
qemu perspective, we may need to change the memslot id allocation to not
reuse IDs of being-deleted memslots until the batch is processed.

Something like adding similar INVALID tags to kvm memslots in QEMU when we
are preparing the batch, then we should only reset memory_size==0 and clear
INVALID tag after the ioctl returns.  Then during the batch, any new slots
to be added from kvm_get_free_slot() will not return any duplication of a
deleting one.

> If we instead already provide the labels, then we can:
> 	1. substitute the memslots pointed by DELETE/MOVE with invalid & swap
> (so it is visible, non-atomic. But we don't care, as we are not deleting
> anything)
> 	2. delete the invalid memslot (in the inactive memslot list)
> 	3. process the other requests (in the inactive memslot list)
> 	4. single and atomic swap (step 2 and 3 are now visible).
> 
> What do you think?

Adding some limitation to the new ioctl makes sense to me, especially
moving the DELETE/MOVE to be handled earlier, rather than a complete
mixture of all of them.

I'm wondering whether the batch ioctl can be layed out separately on the
operations, then it can be clearly documented on the sequence of handling
each op:

  struct kvm_userspace_memory_region_batch {
         struct kvm_userspace_memory_region deletes[];
         struct kvm_userspace_memory_region moves[];
         struct kvm_userspace_memory_region creates[];
         struct kvm_userspace_memory_region flags_only[];
  };

So that the ordering can be very obvious.  But I didn't really think deeper
than that.

Side note: do we use MOVE at all in QEMU?

> 
> Bonus question: with this atomic operation, do we really need the
> invalid memslot logic for MOVE/DELETE?

I think so.  Not an expert on that, but iiuc that's to make sure we'll zap
all shadow paging that maps to the slots being deleted/moved.

Paolo can always help to keep me honest above.

One thing I forgot to ask: iirc we used to have a workaround to kick all
vcpus out, update memory slots, then continue all vcpus.  Would that work
for us too for the problem you're working on?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/2] softmmu/memory: add missing begin/commit callback calls
  2022-08-26 14:13       ` Peter Xu
@ 2022-08-27 21:03         ` Peter Xu
  2022-09-09  8:02           ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2022-08-27 21:03 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: qemu-devel, Paolo Bonzini, Michael S. Tsirkin, Cornelia Huck,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm

On Fri, Aug 26, 2022 at 10:13:47AM -0400, Peter Xu wrote:
> On Fri, Aug 26, 2022 at 03:53:09PM +0200, Emanuele Giuseppe Esposito wrote:
> > What do you mean "will empty all regions with those listeners"?
> > But yes theoretically vhost-vdpa and physmem have commit callbacks that
> > are independent from whether region_add or other callbacks have been called.
> > For kvm and probably vhost it would be no problem, since there won't be
> > any list to iterate on.
> 
> Right, begin()/commit() is for address space update, so it should be fine
> to have nothing to commit, sorry.

Hold on..

When I was replying to your patch 2 and reading the code around, I fount
that this patch does affect vhost..  see region_nop() hook and also vhost's
version of vhost_region_addnop().  I think vhost will sync its memory
layout for each of the commit(), and any newly created AS could emptify
vhost memory list even if registered on address_space_memory.

The other thing is address_space_update_topology() seems to be only used by
address_space_init().  It means I don't think there should have any
listener registered to this AS anyway.. :) So iiuc this patch (even if
converting to loop over per-as memory listeners) is not needed.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-27 20:58           ` Peter Xu
@ 2022-08-30 10:59             ` David Hildenbrand
  2022-09-09  8:02               ` Emanuele Giuseppe Esposito
  2022-09-09  8:00             ` Emanuele Giuseppe Esposito
  1 sibling, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2022-08-30 10:59 UTC (permalink / raw)
  To: Peter Xu, Emanuele Giuseppe Esposito
  Cc: Leonardo Bras Soares Passos, qemu-devel, Paolo Bonzini,
	Michael S. Tsirkin, Cornelia Huck, Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm

On 27.08.22 22:58, Peter Xu wrote:
> Hi, Emanuele,
> 
> On Fri, Aug 26, 2022 at 04:07:01PM +0200, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 22/08/2022 um 16:10 schrieb Peter Xu:
>>> On Thu, Aug 18, 2022 at 09:55:20PM -0300, Leonardo Bras Soares Passos wrote:
>>>> On Thu, Aug 18, 2022 at 5:05 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>
>>>>> On Tue, Aug 16, 2022 at 06:12:50AM -0400, Emanuele Giuseppe Esposito wrote:
>>>>>> +static void kvm_memory_region_node_add(KVMMemoryListener *kml,
>>>>>> +                                       struct kvm_userspace_memory_region *mem)
>>>>>> +{
>>>>>> +    MemoryRegionNode *node;
>>>>>> +
>>>>>> +    node = g_malloc(sizeof(MemoryRegionNode));
>>>>>> +    *node = (MemoryRegionNode) {
>>>>>> +        .mem = mem,
>>>>>> +    };
>>>>>
>>>>> Nit: direct assignment of struct looks okay, but maybe pointer assignment
>>>>> is clearer (with g_malloc0?  Or iirc we're suggested to always use g_new0):
>>>>>
>>>>>   node = g_new0(MemoryRegionNode, 1);
>>>>>   node->mem = mem;
>>>>>
>>>>> [...]
>>
>> Makes sense
>>
>>>>>
>>>>>> +/* for KVM_SET_USER_MEMORY_REGION_LIST */
>>>>>> +struct kvm_userspace_memory_region_list {
>>>>>> +     __u32 nent;
>>>>>> +     __u32 flags;
>>>>>> +     struct kvm_userspace_memory_region entries[0];
>>>>>> +};
>>>>>> +
>>>>>>  /*
>>>>>>   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
>>>>>>   * other bits are reserved for kvm internal use which are defined in
>>>>>> @@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce {
>>>>>>                                       struct kvm_userspace_memory_region)
>>>>>>  #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
>>>>>>  #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
>>>>>> +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
>>>>>> +                                     struct kvm_userspace_memory_region_list)
>>>>>
>>>>> I think this is probably good enough, but just to provide the other small
>>>>> (but may not be important) piece of puzzle here.  I wanted to think through
>>>>> to understand better but I never did..
>>>>>
>>>>> For a quick look, please read the comment in kvm_set_phys_mem().
>>>>>
>>>>>                 /*
>>>>>                  * NOTE: We should be aware of the fact that here we're only
>>>>>                  * doing a best effort to sync dirty bits.  No matter whether
>>>>>                  * we're using dirty log or dirty ring, we ignored two facts:
>>>>>                  *
>>>>>                  * (1) dirty bits can reside in hardware buffers (PML)
>>>>>                  *
>>>>>                  * (2) after we collected dirty bits here, pages can be dirtied
>>>>>                  * again before we do the final KVM_SET_USER_MEMORY_REGION to
>>>>>                  * remove the slot.
>>>>>                  *
>>>>>                  * Not easy.  Let's cross the fingers until it's fixed.
>>>>>                  */
>>>>>
>>>>> One example is if we have 16G mem, we enable dirty tracking and we punch a
>>>>> hole of 1G at offset 1G, it'll change from this:
>>>>>
>>>>>                      (a)
>>>>>   |----------------- 16G -------------------|
>>>>>
>>>>> To this:
>>>>>
>>>>>      (b)    (c)              (d)
>>>>>   |--1G--|XXXXXX|------------14G------------|
>>>>>
>>>>> Here (c) will be a 1G hole.
>>>>>
>>>>> With current code, the hole punching will del region (a) and add back
>>>>> region (b) and (d).  After the new _LIST ioctl it'll be atomic and nicer.
>>>>>
>>>>> Here the question is if we're with dirty tracking it means for each region
>>>>> we have a dirty bitmap.  Currently we do the best effort of doing below
>>>>> sequence:
>>>>>
>>>>>   (1) fetching dirty bmap of (a)
>>>>>   (2) delete region (a)
>>>>>   (3) add region (b) (d)
>>>>>
>>>>> Here (a)'s dirty bmap is mostly kept as best effort, but still we'll lose
>>>>> dirty pages written between step (1) and (2) (and actually if the write
>>>>> comes within (2) and (3) I think it'll crash qemu, and iiuc that's what
>>>>> we're going to fix..).
>>>>>
>>>>> So ideally the atomic op can be:
>>>>>
>>>>>   "atomically fetch dirty bmap for removed regions, remove regions, and add
>>>>>    new regions"
>>>>>
>>>>> Rather than only:
>>>>>
>>>>>   "atomically remove regions, and add new regions"
>>>>>
>>>>> as what the new _LIST ioctl do.
>>>>>
>>>>> But... maybe that's not a real problem, at least I didn't know any report
>>>>> showing issue with current code yet caused by losing of dirty bits during
>>>>> step (1) and (2).  Neither do I know how to trigger an issue with it.
>>>>>
>>>>> I'm just trying to still provide this information so that you should be
>>>>> aware of this problem too, at the meantime when proposing the new ioctl
>>>>> change for qemu we should also keep in mind that we won't easily lose the
>>>>> dirty bmap of (a) here, which I think this patch does the right thing.
>>>>>
>>>>
>>>> Thanks for bringing these details Peter!
>>>>
>>>> What do you think of adding?
>>>> (4) Copy the corresponding part of (a)'s dirty bitmap to (b) and (d)'s
>>>> dirty bitmaps.
>>>
>>> Sounds good to me, but may not cover dirty ring?  Maybe we could move on
>>> with the simple but clean scheme first and think about a comprehensive
>>> option only if very necessary.  The worst case is we need one more kvm cap
>>> but we should still have enough.
>>
>> Ok then, I will not consider this for now.
>>
>> Might or might not be relevant, but I was also considering to
>> pre-process the list of memslots passed to the ioctl and merge
>> operations when necessary, to avoid unnecessary operations.
>>
>> For example, if we are creating an area and punching a hole (assuming
>> this is a valid example), we would have
>>
>> transaction_begin()
>> CREATE(offset=0, memory area)
>> DELETE(memory area)
>> CREATE(offset=0, memory area / 2 - 1)
>> CREATE(offset=memory_area/2, memory area / 2)
>> transaction_commmit()
>>
>> Instead, if we pre-process the memory regions and detect overlaps with
>> an interval tree, we could simplify the above with:
>> CREATE(offset=0, memory area / 2 - 1)
>> CREATE(offset=memory_area/2, memory area / 2)
> 
> As I replied in the private email, I don't think the pre-process here is
> needed, because afaict flat views already handle that.
> 
> See generate_memory_topology() and especially render_memory_region().
> 
> In above example, the 1st CREATE + DELETE shouldn't reach any of the memory
> listners, including the KVM one, because the flatview only contains the
> final layout of the address space when commit() happens.
> 
>>
>> In addition, I was thinking to also provide the operation type (called
>> enum kvm_mr_change) from userspace directly, and not "figure" it
>> ourselves in KVM.
>>
>> The reason for these two changes come from KVM implementation. I know
>> this is no KVM place, but a background explanation might be necessary.
>> Basically KVM 1) figures the request type by looking at the fields
>> passed by userspace (for example mem_size == 0 means DELETE) and the
>> current status of the memslot (id not found means CREATE, found means
>> MOVE/UPDATE_FLAGS), and 2) has 2 memslot lists per address space: the
>> active (visible) and inactive. Any operation is performed in the
>> inactive list, then we "swap" the two so that the change is visible.
>>
>> The "atomic" goal of this serie just means that we want to apply
>> multiple memslot operation and then perform a single "swap".
>> The problem is that each DELETE and MOVE request perform 2 swaps: first
>> substitute current memslot with an invalid one (so that page faults are
>> retried), and then remove the invalid one (in case of MOVE, substitute
>> with new memslot).
>>
>> Therefore:
>> - KVM should ideally pre-process all DELETE/MOVE memslots and perform a
>> first swap(s) to replace the current memslot with an invalid one, and
>> then perform all other operations in order (deletion/move included).
> 
> This sounds correct.
> 
>> This is why QEMU should just send pre-merged MR, so that we don't have
>> CREATE(x)- DELETE(x). Otherwise we would process a DELETE on a MR that
>> doesn't exist yet.
> 
> As discussed above, IMHO pre-merge isn't an issue here and I think the
> sequence in your example won't happen.  But you raised a great question on
> whether we should allow the new ioctl (if there's going to be one) to do
> whatever it wants by batching any sequence of memslot operations.
> 
> More below.
> 
>>
>> - Doing the above is still not enough, as KVM figures what operation to
>> do depending on the current state of the memslots.
>> Assuming we already have an already existing MR y, and now we get the
>> list DELETE(y) CREATE(y/2) (ie reducing y to half its size).
>> In this case the interval tree can't do anything, but it's very hard to
>> understand that the second request in the list is a CREATE, because when
>> KVM performs the check to understand which type of operation it is
>> (before doing any modification at all in the memslot list), it finds
>> that y (memslot with same id) exist, but has a different size than what
>> provided from userspace, therefore it could either fail, or misinterpret
>> it as another operation (currently fails -EINVALID).
> 
> Another good question..  I think that can be partly solved by not allowing
> ID duplication in the same batched ioctl, or maybe we can fail it.  From
> qemu perspective, we may need to change the memslot id allocation to not
> reuse IDs of being-deleted memslots until the batch is processed.
> 
> Something like adding similar INVALID tags to kvm memslots in QEMU when we
> are preparing the batch, then we should only reset memory_size==0 and clear
> INVALID tag after the ioctl returns.  Then during the batch, any new slots
> to be added from kvm_get_free_slot() will not return any duplication of a
> deleting one.
> 
>> If we instead already provide the labels, then we can:
>> 	1. substitute the memslots pointed by DELETE/MOVE with invalid & swap
>> (so it is visible, non-atomic. But we don't care, as we are not deleting
>> anything)
>> 	2. delete the invalid memslot (in the inactive memslot list)
>> 	3. process the other requests (in the inactive memslot list)
>> 	4. single and atomic swap (step 2 and 3 are now visible).
>>
>> What do you think?
> 
> Adding some limitation to the new ioctl makes sense to me, especially
> moving the DELETE/MOVE to be handled earlier, rather than a complete
> mixture of all of them.
> 
> I'm wondering whether the batch ioctl can be layed out separately on the
> operations, then it can be clearly documented on the sequence of handling
> each op:
> 
>   struct kvm_userspace_memory_region_batch {
>          struct kvm_userspace_memory_region deletes[];
>          struct kvm_userspace_memory_region moves[];
>          struct kvm_userspace_memory_region creates[];
>          struct kvm_userspace_memory_region flags_only[];
>   };
> 
> So that the ordering can be very obvious.  But I didn't really think deeper
> than that.
> 
> Side note: do we use MOVE at all in QEMU?
> 
>>
>> Bonus question: with this atomic operation, do we really need the
>> invalid memslot logic for MOVE/DELETE?
> 
> I think so.  Not an expert on that, but iiuc that's to make sure we'll zap
> all shadow paging that maps to the slots being deleted/moved.
> 
> Paolo can always help to keep me honest above.
> 
> One thing I forgot to ask: iirc we used to have a workaround to kick all
> vcpus out, update memory slots, then continue all vcpus.  Would that work
> for us too for the problem you're working on?

As reference, here is one such approach for region resizes only:

https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/

which notes:

"Instead of inhibiting during the region_resize(), we could inhibit for
the hole memory transaction (from begin() to commit()). This could be
nice, because also splitting of memory regions would be atomic (I
remember there was a BUG report regarding that), however, I am not sure
if that might impact any RT users."


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-27 20:58           ` Peter Xu
  2022-08-30 10:59             ` David Hildenbrand
@ 2022-09-09  8:00             ` Emanuele Giuseppe Esposito
  1 sibling, 0 replies; 23+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-09  8:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: Leonardo Bras Soares Passos, qemu-devel, Paolo Bonzini,
	Michael S. Tsirkin, Cornelia Huck, David Hildenbrand,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm

Hi Peter,

Apologies for the delay to answer you.

[...]
>>
>> - Doing the above is still not enough, as KVM figures what operation to
>> do depending on the current state of the memslots.
>> Assuming we already have an already existing MR y, and now we get the
>> list DELETE(y) CREATE(y/2) (ie reducing y to half its size).
>> In this case the interval tree can't do anything, but it's very hard to
>> understand that the second request in the list is a CREATE, because when
>> KVM performs the check to understand which type of operation it is
>> (before doing any modification at all in the memslot list), it finds
>> that y (memslot with same id) exist, but has a different size than what
>> provided from userspace, therefore it could either fail, or misinterpret
>> it as another operation (currently fails -EINVALID).
> 
> Another good question..  I think that can be partly solved by not allowing
> ID duplication in the same batched ioctl, or maybe we can fail it.  From
> qemu perspective, we may need to change the memslot id allocation to not
> reuse IDs of being-deleted memslots until the batch is processed.
> 
> Something like adding similar INVALID tags to kvm memslots in QEMU when we
> are preparing the batch, then we should only reset memory_size==0 and clear
> INVALID tag after the ioctl returns.  Then during the batch, any new slots
> to be added from kvm_get_free_slot() will not return any duplication of a
> deleting one.

First of all, you're right. No interval tree is needed.

I think the approach I am currently using is something like what you
described above: when a DELETE operation is created in QEMU (there is no
MOVE), I set the invalid_slot flag to 1.
Then KVM will firstly process the requests with invalid_slot == 1, mark
the memslot invalid, and then process all the others (invalid included,
as they need the actual DELETE/MOVE operation).

> 
>> If we instead already provide the labels, then we can:
>> 	1. substitute the memslots pointed by DELETE/MOVE with invalid & swap
>> (so it is visible, non-atomic. But we don't care, as we are not deleting
>> anything)
>> 	2. delete the invalid memslot (in the inactive memslot list)
>> 	3. process the other requests (in the inactive memslot list)
>> 	4. single and atomic swap (step 2 and 3 are now visible).
>>
>> What do you think?
> 
> Adding some limitation to the new ioctl makes sense to me, especially
> moving the DELETE/MOVE to be handled earlier, rather than a complete
> mixture of all of them.
> 
> I'm wondering whether the batch ioctl can be layed out separately on the
> operations, then it can be clearly documented on the sequence of handling
> each op:
> 
>   struct kvm_userspace_memory_region_batch {
>          struct kvm_userspace_memory_region deletes[];
>          struct kvm_userspace_memory_region moves[];
>          struct kvm_userspace_memory_region creates[];
>          struct kvm_userspace_memory_region flags_only[];
>   };
> 
> So that the ordering can be very obvious.  But I didn't really think deeper
> than that.

No, I think it doesn't work. Oder needs to be preserved, I see many
DELETE+CREATE operations on the same slot id.
> 
> Side note: do we use MOVE at all in QEMU?

Nope :)

> 
>>
>> Bonus question: with this atomic operation, do we really need the
>> invalid memslot logic for MOVE/DELETE?
> 
> I think so.  Not an expert on that, but iiuc that's to make sure we'll zap
> all shadow paging that maps to the slots being deleted/moved.
> 
> Paolo can always help to keep me honest above.

Yes, we need to keep that logic.

v2 is coming soon.

Thank you,
Emanuele


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-30 10:59             ` David Hildenbrand
@ 2022-09-09  8:02               ` Emanuele Giuseppe Esposito
  2022-09-09 11:02                 ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-09  8:02 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu
  Cc: Leonardo Bras Soares Passos, qemu-devel, Paolo Bonzini,
	Michael S. Tsirkin, Cornelia Huck, Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm


>> One thing I forgot to ask: iirc we used to have a workaround to kick all
>> vcpus out, update memory slots, then continue all vcpus.  Would that work
>> for us too for the problem you're working on?
> 
> As reference, here is one such approach for region resizes only:
> 
> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
> 
> which notes:
> 
> "Instead of inhibiting during the region_resize(), we could inhibit for
> the hole memory transaction (from begin() to commit()). This could be
> nice, because also splitting of memory regions would be atomic (I
> remember there was a BUG report regarding that), however, I am not sure
> if that might impact any RT users."
> 
> 
I read:

"Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
temporarily drop the BQL - something most callers can't handle (esp.
when called from vcpu context e.g., in virtio code)."

Thank you,
Emanuele


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/2] softmmu/memory: add missing begin/commit callback calls
  2022-08-27 21:03         ` Peter Xu
@ 2022-09-09  8:02           ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 23+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-09  8:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Michael S. Tsirkin, Cornelia Huck,
	David Hildenbrand, Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm



Am 27/08/2022 um 23:03 schrieb Peter Xu:
> On Fri, Aug 26, 2022 at 10:13:47AM -0400, Peter Xu wrote:
>> On Fri, Aug 26, 2022 at 03:53:09PM +0200, Emanuele Giuseppe Esposito wrote:
>>> What do you mean "will empty all regions with those listeners"?
>>> But yes theoretically vhost-vdpa and physmem have commit callbacks that
>>> are independent from whether region_add or other callbacks have been called.
>>> For kvm and probably vhost it would be no problem, since there won't be
>>> any list to iterate on.
>>
>> Right, begin()/commit() is for address space update, so it should be fine
>> to have nothing to commit, sorry.
> 
> Hold on..
> 
> When I was replying to your patch 2 and reading the code around, I fount
> that this patch does affect vhost..  see region_nop() hook and also vhost's
> version of vhost_region_addnop().  I think vhost will sync its memory
> layout for each of the commit(), and any newly created AS could emptify
> vhost memory list even if registered on address_space_memory.
> 
> The other thing is address_space_update_topology() seems to be only used by
> address_space_init().  It means I don't think there should have any
> listener registered to this AS anyway.. :) So iiuc this patch (even if
> converting to loop over per-as memory listeners) is not needed.
> 
Agree, dropping this patch :)

Emanuele


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-08-26 14:44       ` David Hildenbrand
@ 2022-09-09  8:04         ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 23+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-09-09  8:04 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Paolo Bonzini, Michael S. Tsirkin, Cornelia Huck, Peter Xu,
	Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm



Am 26/08/2022 um 16:44 schrieb David Hildenbrand:
> On 26.08.22 16:32, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 26/08/2022 um 16:15 schrieb David Hildenbrand:
>>> On 16.08.22 12:12, Emanuele Giuseppe Esposito wrote:
>>>> Instead of sending a single ioctl every time ->region_* or ->log_*
>>>> callbacks are called, "queue" all memory regions in a list that will
>>>> be emptied only when committing.
>>>>
>>>
>>> Out of interest, how many such regions does the ioctl support? As many
>>> as KVM theoretically supports? (32k IIRC)
>>>
>>
>> I assume you mean for the new ioctl, but yes that's a good question.
>>
>> The problem here is that we could have more than a single update per
>> memory region. So we are not limited anymore to the number of regions,
>> but the number of operations * number of region.
>>
>> I was thinking, maybe when pre-processing QEMU could divide a single
>> transaction into multiple atomic operations (ie operations on the same
>> memory region)? That way avoid sending a single ioctl with 32k *
>> #operation elements. Is that what you mean?
> 
> Oh, so we're effectively collecting slot updates and not the complete
> "slot" view, got it. Was the kernel series already sent so I can have a
> look?

I am going to send it today. I got something working, but it's a little
bit messy on the invalid slots part.

> 
> Note that there are some possible slot updates (like a split, or a
> merge) that involve multiple slots and that would have to be part of the
> same "transaction" to be atomic.
> 
> 

Limiting the size of operations in the IOCTL can be something for the
future. Currently it's already pretty complicated as it is (in the KVM
side), plus I don't see ioctls with more than 8 requests.

Thank you,
Emanuele


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase
  2022-09-09  8:02               ` Emanuele Giuseppe Esposito
@ 2022-09-09 11:02                 ` David Hildenbrand
  0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2022-09-09 11:02 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, Peter Xu
  Cc: Leonardo Bras Soares Passos, qemu-devel, Paolo Bonzini,
	Michael S. Tsirkin, Cornelia Huck, Philippe Mathieu-Daudé,
	Maxim Levitsky, kvm

On 09.09.22 10:02, Emanuele Giuseppe Esposito wrote:
> 
>>> One thing I forgot to ask: iirc we used to have a workaround to kick all
>>> vcpus out, update memory slots, then continue all vcpus.  Would that work
>>> for us too for the problem you're working on?
>>
>> As reference, here is one such approach for region resizes only:
>>
>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
>>
>> which notes:
>>
>> "Instead of inhibiting during the region_resize(), we could inhibit for
>> the hole memory transaction (from begin() to commit()). This could be
>> nice, because also splitting of memory regions would be atomic (I
>> remember there was a BUG report regarding that), however, I am not sure
>> if that might impact any RT users."
>>
>>
> I read:
> 
> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
> temporarily drop the BQL - something most callers can't handle (esp.
> when called from vcpu context e.g., in virtio code)."

... that's why the patch takes a different approach? :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-09-09 11:02 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-16 10:12 [RFC PATCH 0/2] accel/kvm: extend kvm memory listener to support Emanuele Giuseppe Esposito
2022-08-16 10:12 ` [RFC PATCH 1/2] softmmu/memory: add missing begin/commit callback calls Emanuele Giuseppe Esposito
2022-08-18 19:34   ` Peter Xu
2022-08-26 13:53     ` Emanuele Giuseppe Esposito
2022-08-26 14:13       ` Peter Xu
2022-08-27 21:03         ` Peter Xu
2022-09-09  8:02           ` Emanuele Giuseppe Esposito
2022-08-16 10:12 ` [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase Emanuele Giuseppe Esposito
2022-08-18 20:04   ` Peter Xu
2022-08-19  0:55     ` Leonardo Bras Soares Passos
2022-08-22 14:10       ` Peter Xu
2022-08-26 14:07         ` Emanuele Giuseppe Esposito
2022-08-27 20:58           ` Peter Xu
2022-08-30 10:59             ` David Hildenbrand
2022-09-09  8:02               ` Emanuele Giuseppe Esposito
2022-09-09 11:02                 ` David Hildenbrand
2022-09-09  8:00             ` Emanuele Giuseppe Esposito
2022-08-22  9:08   ` Cornelia Huck
2022-08-26 13:53     ` Emanuele Giuseppe Esposito
2022-08-26 14:15   ` David Hildenbrand
2022-08-26 14:32     ` Emanuele Giuseppe Esposito
2022-08-26 14:44       ` David Hildenbrand
2022-09-09  8:04         ` Emanuele Giuseppe Esposito

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).