kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] KVM: allow listener to stop all vcpus before
@ 2022-11-11 15:47 Emanuele Giuseppe Esposito
  2022-11-11 15:47 ` [PATCH v3 1/3] accel: introduce accelerator blocker API Emanuele Giuseppe Esposito
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-11-11 15:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, kvm, Emanuele Giuseppe Esposito

QEMU needs to perform memslots operations like merging and splitting,
and each operation requires more than a single ioctl.
Therefore if a vcpu is concurrently reading the same memslots,
it could end up reading something that was temporarly deleted.
For example, merging two memslots into one would imply:
DELETE(m1)
DELETE(m2)
CREATE(m1+m2)

And a vcpu could attempt to read m2 right after it is deleted, but
before the new one is created.

This approach is 100% QEMU-based. No KVM API modification is involved,
but implies that QEMU must make sure no new ioctl is running and all
vcpus are stopped.

The logic and code are basically taken from David Hildenbrand
proposal given a while ago while reviewing a previous attempt where
I suggested to solve the above problem directly in KVM by extending
its API.

This is the original code:
https://github.com/davidhildenbrand/qemu/commit/86b1bf546a8d00908e33f7362b0b61e2be8dbb7a

I just split the patch in three smaller patches, and used a
QemuLockCnt instead of counter + mutex.

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1979276

Emanuele
---
v3:
* minor fixes on comments and docs
* improved accel_ioctl_inhibit_begin
* drop QSIMPLEQ_REMOVE from kvm_commit

v2:
* use QemuEvent instead of spinning in ioctl_inhibit_begin
* move the blocker API in a separate file and header, so that other accel can
  use it if they want.

David Hildenbrand (1):
  kvm: Atomic memslot updates

Emanuele Giuseppe Esposito (2):
  accel: introduce accelerator blocker API
  KVM: keep track of running ioctls

 accel/accel-blocker.c          | 154 +++++++++++++++++++++++++++++++++
 accel/kvm/kvm-all.c            | 108 ++++++++++++++++++++---
 accel/meson.build              |   2 +-
 hw/core/cpu-common.c           |   2 +
 include/hw/core/cpu.h          |   3 +
 include/sysemu/accel-blocker.h |  56 ++++++++++++
 include/sysemu/kvm_int.h       |   8 ++
 7 files changed, 321 insertions(+), 12 deletions(-)
 create mode 100644 accel/accel-blocker.c
 create mode 100644 include/sysemu/accel-blocker.h

-- 
2.31.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 1/3] accel: introduce accelerator blocker API
  2022-11-11 15:47 [PATCH v3 0/3] KVM: allow listener to stop all vcpus before Emanuele Giuseppe Esposito
@ 2022-11-11 15:47 ` Emanuele Giuseppe Esposito
  2022-11-18  7:32   ` Philippe Mathieu-Daudé
  2022-12-02  6:56   ` Robert Hoo
  2022-11-11 15:47 ` [PATCH v3 2/3] KVM: keep track of running ioctls Emanuele Giuseppe Esposito
  2022-11-11 15:47 ` [PATCH v3 3/3] kvm: Atomic memslot updates Emanuele Giuseppe Esposito
  2 siblings, 2 replies; 13+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-11-11 15:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, kvm, Emanuele Giuseppe Esposito

This API allows the accelerators to prevent vcpus from issuing
new ioctls while execting a critical section marked with the
accel_ioctl_inhibit_begin/end functions.

Note that all functions submitting ioctls must mark where the
ioctl is being called with accel_{cpu_}ioctl_begin/end().

This API requires the caller to always hold the BQL.
API documentation is in sysemu/accel-blocker.h

Internally, it uses a QemuLockCnt together with a per-CPU QemuLockCnt
(to minimize cache line bouncing) to keep avoid that new ioctls
run when the critical section starts, and a QemuEvent to wait
that all running ioctls finish.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 accel/accel-blocker.c          | 154 +++++++++++++++++++++++++++++++++
 accel/meson.build              |   2 +-
 hw/core/cpu-common.c           |   2 +
 include/hw/core/cpu.h          |   3 +
 include/sysemu/accel-blocker.h |  56 ++++++++++++
 5 files changed, 216 insertions(+), 1 deletion(-)
 create mode 100644 accel/accel-blocker.c
 create mode 100644 include/sysemu/accel-blocker.h

diff --git a/accel/accel-blocker.c b/accel/accel-blocker.c
new file mode 100644
index 0000000000..1e7f423462
--- /dev/null
+++ b/accel/accel-blocker.c
@@ -0,0 +1,154 @@
+/*
+ * Lock to inhibit accelerator ioctls
+ *
+ * Copyright (c) 2022 Red Hat Inc.
+ *
+ * Author: Emanuele Giuseppe Esposito       <eesposit@redhat.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/thread.h"
+#include "qemu/main-loop.h"
+#include "hw/core/cpu.h"
+#include "sysemu/accel-blocker.h"
+
+static QemuLockCnt accel_in_ioctl_lock;
+static QemuEvent accel_in_ioctl_event;
+
+void accel_blocker_init(void)
+{
+    qemu_lockcnt_init(&accel_in_ioctl_lock);
+    qemu_event_init(&accel_in_ioctl_event, false);
+}
+
+void accel_ioctl_begin(void)
+{
+    if (likely(qemu_mutex_iothread_locked())) {
+        return;
+    }
+
+    /* block if lock is taken in kvm_ioctl_inhibit_begin() */
+    qemu_lockcnt_inc(&accel_in_ioctl_lock);
+}
+
+void accel_ioctl_end(void)
+{
+    if (likely(qemu_mutex_iothread_locked())) {
+        return;
+    }
+
+    qemu_lockcnt_dec(&accel_in_ioctl_lock);
+    /* change event to SET. If event was BUSY, wake up all waiters */
+    qemu_event_set(&accel_in_ioctl_event);
+}
+
+void accel_cpu_ioctl_begin(CPUState *cpu)
+{
+    if (unlikely(qemu_mutex_iothread_locked())) {
+        return;
+    }
+
+    /* block if lock is taken in kvm_ioctl_inhibit_begin() */
+    qemu_lockcnt_inc(&cpu->in_ioctl_lock);
+}
+
+void accel_cpu_ioctl_end(CPUState *cpu)
+{
+    if (unlikely(qemu_mutex_iothread_locked())) {
+        return;
+    }
+
+    qemu_lockcnt_dec(&cpu->in_ioctl_lock);
+    /* change event to SET. If event was BUSY, wake up all waiters */
+    qemu_event_set(&accel_in_ioctl_event);
+}
+
+static bool accel_has_to_wait(void)
+{
+    CPUState *cpu;
+    bool needs_to_wait = false;
+
+    CPU_FOREACH(cpu) {
+        if (qemu_lockcnt_count(&cpu->in_ioctl_lock)) {
+            /* exit the ioctl, if vcpu is running it */
+            qemu_cpu_kick(cpu);
+            needs_to_wait = true;
+        }
+    }
+
+    return needs_to_wait || qemu_lockcnt_count(&accel_in_ioctl_lock);
+}
+
+void accel_ioctl_inhibit_begin(void)
+{
+    CPUState *cpu;
+
+    /*
+     * We allow to inhibit only when holding the BQL, so we can identify
+     * when an inhibitor wants to issue an ioctl easily.
+     */
+    g_assert(qemu_mutex_iothread_locked());
+
+    /* Block further invocations of the ioctls outside the BQL.  */
+    CPU_FOREACH(cpu) {
+        qemu_lockcnt_lock(&cpu->in_ioctl_lock);
+    }
+    qemu_lockcnt_lock(&accel_in_ioctl_lock);
+
+    /* Keep waiting until there are running ioctls */
+    while (true) {
+
+        /* Reset event to FREE. */
+        qemu_event_reset(&accel_in_ioctl_event);
+
+        if (accel_has_to_wait()) {
+            /*
+             * If event is still FREE, and there are ioctls still in progress,
+             * wait.
+             *
+             *  If an ioctl finishes before qemu_event_wait(), it will change
+             * the event state to SET. This will prevent qemu_event_wait() from
+             * blocking, but it's not a problem because if other ioctls are
+             * still running the loop will iterate once more and reset the event
+             * status to FREE so that it can wait properly.
+             *
+             * If an ioctls finishes while qemu_event_wait() is blocking, then
+             * it will be waken up, but also here the while loop makes sure
+             * to re-enter the wait if there are other running ioctls.
+             */
+            qemu_event_wait(&accel_in_ioctl_event);
+        } else {
+            /* No ioctl is running */
+            return;
+        }
+    }
+}
+
+void accel_ioctl_inhibit_end(void)
+{
+    CPUState *cpu;
+
+    qemu_lockcnt_unlock(&accel_in_ioctl_lock);
+    CPU_FOREACH(cpu) {
+        qemu_lockcnt_unlock(&cpu->in_ioctl_lock);
+    }
+}
+
diff --git a/accel/meson.build b/accel/meson.build
index b9a963cf80..a0d49c4f31 100644
--- a/accel/meson.build
+++ b/accel/meson.build
@@ -1,4 +1,4 @@
-specific_ss.add(files('accel-common.c'))
+specific_ss.add(files('accel-common.c', 'accel-blocker.c'))
 softmmu_ss.add(files('accel-softmmu.c'))
 user_ss.add(files('accel-user.c'))
 
diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
index f9fdd46b9d..8d6a4b1b65 100644
--- a/hw/core/cpu-common.c
+++ b/hw/core/cpu-common.c
@@ -237,6 +237,7 @@ static void cpu_common_initfn(Object *obj)
     cpu->nr_threads = 1;
 
     qemu_mutex_init(&cpu->work_mutex);
+    qemu_lockcnt_init(&cpu->in_ioctl_lock);
     QSIMPLEQ_INIT(&cpu->work_list);
     QTAILQ_INIT(&cpu->breakpoints);
     QTAILQ_INIT(&cpu->watchpoints);
@@ -248,6 +249,7 @@ static void cpu_common_finalize(Object *obj)
 {
     CPUState *cpu = CPU(obj);
 
+    qemu_lockcnt_destroy(&cpu->in_ioctl_lock);
     qemu_mutex_destroy(&cpu->work_mutex);
 }
 
diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index f9b58773f7..15053663bc 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -397,6 +397,9 @@ struct CPUState {
     uint32_t kvm_fetch_index;
     uint64_t dirty_pages;
 
+    /* Use by accel-block: CPU is executing an ioctl() */
+    QemuLockCnt in_ioctl_lock;
+
     /* Used for events with 'vcpu' and *without* the 'disabled' properties */
     DECLARE_BITMAP(trace_dstate_delayed, CPU_TRACE_DSTATE_MAX_EVENTS);
     DECLARE_BITMAP(trace_dstate, CPU_TRACE_DSTATE_MAX_EVENTS);
diff --git a/include/sysemu/accel-blocker.h b/include/sysemu/accel-blocker.h
new file mode 100644
index 0000000000..72020529ef
--- /dev/null
+++ b/include/sysemu/accel-blocker.h
@@ -0,0 +1,56 @@
+/*
+ * Accelerator blocking API, to prevent new ioctls from starting and wait the
+ * running ones finish.
+ * This mechanism differs from pause/resume_all_vcpus() in that it does not
+ * release the BQL.
+ *
+ *  Copyright (c) 2022 Red Hat Inc.
+ *
+ * Author: Emanuele Giuseppe Esposito       <eesposit@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#ifndef ACCEL_BLOCKER_H
+#define ACCEL_BLOCKER_H
+
+#include "qemu/osdep.h"
+#include "sysemu/cpus.h"
+
+extern void accel_blocker_init(void);
+
+/*
+ * accel_{cpu_}ioctl_begin/end:
+ * Mark when ioctl is about to run or just finished.
+ *
+ * accel_{cpu_}ioctl_begin will block after accel_ioctl_inhibit_begin() is
+ * called, preventing new ioctls to run. They will continue only after
+ * accel_ioctl_inibith_end().
+ */
+extern void accel_ioctl_begin(void);
+extern void accel_ioctl_end(void);
+extern void accel_cpu_ioctl_begin(CPUState *cpu);
+extern void accel_cpu_ioctl_end(CPUState *cpu);
+
+/*
+ * accel_ioctl_inhibit_begin: start critical section
+ *
+ * This function makes sure that:
+ * 1) incoming accel_{cpu_}ioctl_begin() calls block
+ * 2) wait that all ioctls that were already running reach
+ *    accel_{cpu_}ioctl_end(), kicking vcpus if necessary.
+ *
+ * This allows the caller to access shared data or perform operations without
+ * worrying of concurrent vcpus accesses.
+ */
+extern void accel_ioctl_inhibit_begin(void);
+
+/*
+ * accel_ioctl_inhibit_end: end critical section started by
+ * accel_ioctl_inhibit_begin()
+ *
+ * This function allows blocked accel_{cpu_}ioctl_begin() to continue.
+ */
+extern void accel_ioctl_inhibit_end(void);
+
+#endif /* ACCEL_BLOCKER_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/3] KVM: keep track of running ioctls
  2022-11-11 15:47 [PATCH v3 0/3] KVM: allow listener to stop all vcpus before Emanuele Giuseppe Esposito
  2022-11-11 15:47 ` [PATCH v3 1/3] accel: introduce accelerator blocker API Emanuele Giuseppe Esposito
@ 2022-11-11 15:47 ` Emanuele Giuseppe Esposito
  2022-11-17 19:27   ` David Hildenbrand
  2022-12-02  6:54   ` Robert Hoo
  2022-11-11 15:47 ` [PATCH v3 3/3] kvm: Atomic memslot updates Emanuele Giuseppe Esposito
  2 siblings, 2 replies; 13+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-11-11 15:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, kvm, Emanuele Giuseppe Esposito, David Hildenbrand

Using the new accel-blocker API, mark where ioctls are being called
in KVM. Next, we will implement the critical section that will take
care of performing memslots modifications atomically, therefore
preventing any new ioctl from running and allowing the running ones
to finish.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 accel/kvm/kvm-all.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index f99b0becd8..ff660fd469 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2310,6 +2310,7 @@ static int kvm_init(MachineState *ms)
     assert(TARGET_PAGE_SIZE <= qemu_real_host_page_size());
 
     s->sigmask_len = 8;
+    accel_blocker_init();
 
 #ifdef KVM_CAP_SET_GUEST_DEBUG
     QTAILQ_INIT(&s->kvm_sw_breakpoints);
@@ -3014,7 +3015,9 @@ int kvm_vm_ioctl(KVMState *s, int type, ...)
     va_end(ap);
 
     trace_kvm_vm_ioctl(type, arg);
+    accel_ioctl_begin();
     ret = ioctl(s->vmfd, type, arg);
+    accel_ioctl_end();
     if (ret == -1) {
         ret = -errno;
     }
@@ -3032,7 +3035,9 @@ int kvm_vcpu_ioctl(CPUState *cpu, int type, ...)
     va_end(ap);
 
     trace_kvm_vcpu_ioctl(cpu->cpu_index, type, arg);
+    accel_cpu_ioctl_begin(cpu);
     ret = ioctl(cpu->kvm_fd, type, arg);
+    accel_cpu_ioctl_end(cpu);
     if (ret == -1) {
         ret = -errno;
     }
@@ -3050,7 +3055,9 @@ int kvm_device_ioctl(int fd, int type, ...)
     va_end(ap);
 
     trace_kvm_device_ioctl(fd, type, arg);
+    accel_ioctl_begin();
     ret = ioctl(fd, type, arg);
+    accel_ioctl_end();
     if (ret == -1) {
         ret = -errno;
     }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 3/3] kvm: Atomic memslot updates
  2022-11-11 15:47 [PATCH v3 0/3] KVM: allow listener to stop all vcpus before Emanuele Giuseppe Esposito
  2022-11-11 15:47 ` [PATCH v3 1/3] accel: introduce accelerator blocker API Emanuele Giuseppe Esposito
  2022-11-11 15:47 ` [PATCH v3 2/3] KVM: keep track of running ioctls Emanuele Giuseppe Esposito
@ 2022-11-11 15:47 ` Emanuele Giuseppe Esposito
  2 siblings, 0 replies; 13+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-11-11 15:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, kvm, David Hildenbrand, Emanuele Giuseppe Esposito

From: David Hildenbrand <david@redhat.com>

If we update an existing memslot (e.g., resize, split), we temporarily
remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.

Let's inhibit most KVM ioctls while performing relevant updates, such
that we can perform the update just as if it would happen atomically
without additional kernel support.

We capture the add/del changes and apply them in the notifier commit
stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).

To keep things simple we don't perform additional checks that wouldn't
actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).

To minimize cache-line bouncing, use a separate indicator
(in_ioctl_lock) per CPU.  Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).

We have to wait until all IOCTLs were exited and block new ones from
getting executed.

This approach cannot result in a deadlock as long as the inhibitor does
not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.

AFAIKs, one possible candidate would be userfaultfd. If a page cannot be
placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.

Nothing else in the kernel should block forever and wait for userspace
intervention.

Note: pause_all_vcpus()/resume_all_vcpus() or
start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Tested-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 accel/kvm/kvm-all.c      | 101 ++++++++++++++++++++++++++++++++++-----
 include/sysemu/kvm_int.h |   8 ++++
 2 files changed, 98 insertions(+), 11 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index ff660fd469..39ed30ab59 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -31,6 +31,7 @@
 #include "sysemu/kvm_int.h"
 #include "sysemu/runstate.h"
 #include "sysemu/cpus.h"
+#include "sysemu/accel-blocker.h"
 #include "qemu/bswap.h"
 #include "exec/memory.h"
 #include "exec/ram_addr.h"
@@ -46,6 +47,7 @@
 #include "sysemu/hw_accel.h"
 #include "kvm-cpus.h"
 #include "sysemu/dirtylimit.h"
+#include "qemu/range.h"
 
 #include "hw/boards.h"
 #include "monitor/stats.h"
@@ -1292,6 +1294,7 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size)
     kvm_max_slot_size = max_slot_size;
 }
 
+/* Called with KVMMemoryListener.slots_lock held */
 static void kvm_set_phys_mem(KVMMemoryListener *kml,
                              MemoryRegionSection *section, bool add)
 {
@@ -1326,14 +1329,12 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
     ram = memory_region_get_ram_ptr(mr) + mr_offset;
     ram_start_offset = memory_region_get_ram_addr(mr) + mr_offset;
 
-    kvm_slots_lock();
-
     if (!add) {
         do {
             slot_size = MIN(kvm_max_slot_size, size);
             mem = kvm_lookup_matching_slot(kml, start_addr, slot_size);
             if (!mem) {
-                goto out;
+                return;
             }
             if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
                 /*
@@ -1371,7 +1372,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
             start_addr += slot_size;
             size -= slot_size;
         } while (size);
-        goto out;
+        return;
     }
 
     /* register the new slot */
@@ -1396,9 +1397,6 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
         ram += slot_size;
         size -= slot_size;
     } while (size);
-
-out:
-    kvm_slots_unlock();
 }
 
 static void *kvm_dirty_ring_reaper_thread(void *data)
@@ -1455,18 +1453,95 @@ static void kvm_region_add(MemoryListener *listener,
                            MemoryRegionSection *section)
 {
     KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
+    KVMMemoryUpdate *update;
+
+    update = g_new0(KVMMemoryUpdate, 1);
+    update->section = *section;
 
-    memory_region_ref(section->mr);
-    kvm_set_phys_mem(kml, section, true);
+    QSIMPLEQ_INSERT_TAIL(&kml->transaction_add, update, next);
 }
 
 static void kvm_region_del(MemoryListener *listener,
                            MemoryRegionSection *section)
 {
     KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
+    KVMMemoryUpdate *update;
+
+    update = g_new0(KVMMemoryUpdate, 1);
+    update->section = *section;
+
+    QSIMPLEQ_INSERT_TAIL(&kml->transaction_del, update, next);
+}
+
+static void kvm_region_commit(MemoryListener *listener)
+{
+    KVMMemoryListener *kml = container_of(listener, KVMMemoryListener,
+                                          listener);
+    KVMMemoryUpdate *u1, *u2;
+    bool need_inhibit = false;
+
+    if (QSIMPLEQ_EMPTY(&kml->transaction_add) &&
+        QSIMPLEQ_EMPTY(&kml->transaction_del)) {
+        return;
+    }
+
+    /*
+     * We have to be careful when regions to add overlap with ranges to remove.
+     * We have to simulate atomic KVM memslot updates by making sure no ioctl()
+     * is currently active.
+     *
+     * The lists are order by addresses, so it's easy to find overlaps.
+     */
+    u1 = QSIMPLEQ_FIRST(&kml->transaction_del);
+    u2 = QSIMPLEQ_FIRST(&kml->transaction_add);
+    while (u1 && u2) {
+        Range r1, r2;
+
+        range_init_nofail(&r1, u1->section.offset_within_address_space,
+                          int128_get64(u1->section.size));
+        range_init_nofail(&r2, u2->section.offset_within_address_space,
+                          int128_get64(u2->section.size));
+
+        if (range_overlaps_range(&r1, &r2)) {
+            need_inhibit = true;
+            break;
+        }
+        if (range_lob(&r1) < range_lob(&r2)) {
+            u1 = QSIMPLEQ_NEXT(u1, next);
+        } else {
+            u2 = QSIMPLEQ_NEXT(u2, next);
+        }
+    }
+
+    kvm_slots_lock();
+    if (need_inhibit) {
+        accel_ioctl_inhibit_begin();
+    }
+
+    /* Remove all memslots before adding the new ones. */
+    while (!QSIMPLEQ_EMPTY(&kml->transaction_del)) {
+        u1 = QSIMPLEQ_FIRST(&kml->transaction_del);
+        QSIMPLEQ_REMOVE_HEAD(&kml->transaction_del, next);
 
-    kvm_set_phys_mem(kml, section, false);
-    memory_region_unref(section->mr);
+        kvm_set_phys_mem(kml, &u1->section, false);
+        memory_region_unref(u1->section.mr);
+
+        g_free(u1);
+    }
+    while (!QSIMPLEQ_EMPTY(&kml->transaction_add)) {
+        u1 = QSIMPLEQ_FIRST(&kml->transaction_add);
+        QSIMPLEQ_REMOVE_HEAD(&kml->transaction_add, next);
+
+        memory_region_ref(u1->section.mr);
+        kvm_set_phys_mem(kml, &u1->section, true);
+
+        g_free(u1);
+    }
+
+    if (need_inhibit) {
+        accel_ioctl_inhibit_end();
+    }
+    kvm_slots_unlock();
 }
 
 static void kvm_log_sync(MemoryListener *listener,
@@ -1610,8 +1685,12 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
         kml->slots[i].slot = i;
     }
 
+    QSIMPLEQ_INIT(&kml->transaction_add);
+    QSIMPLEQ_INIT(&kml->transaction_del);
+
     kml->listener.region_add = kvm_region_add;
     kml->listener.region_del = kvm_region_del;
+    kml->listener.commit = kvm_region_commit;
     kml->listener.log_start = kvm_log_start;
     kml->listener.log_stop = kvm_log_stop;
     kml->listener.priority = 10;
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 3b4adcdc10..60b520a13e 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -12,6 +12,7 @@
 #include "exec/memory.h"
 #include "qapi/qapi-types-common.h"
 #include "qemu/accel.h"
+#include "qemu/queue.h"
 #include "sysemu/kvm.h"
 
 typedef struct KVMSlot
@@ -31,10 +32,17 @@ typedef struct KVMSlot
     ram_addr_t ram_start_offset;
 } KVMSlot;
 
+typedef struct KVMMemoryUpdate {
+    QSIMPLEQ_ENTRY(KVMMemoryUpdate) next;
+    MemoryRegionSection section;
+} KVMMemoryUpdate;
+
 typedef struct KVMMemoryListener {
     MemoryListener listener;
     KVMSlot *slots;
     int as_id;
+    QSIMPLEQ_HEAD(, KVMMemoryUpdate) transaction_add;
+    QSIMPLEQ_HEAD(, KVMMemoryUpdate) transaction_del;
 } KVMMemoryListener;
 
 #define KVM_MSI_HASHTAB_SIZE    256
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/3] KVM: keep track of running ioctls
  2022-11-11 15:47 ` [PATCH v3 2/3] KVM: keep track of running ioctls Emanuele Giuseppe Esposito
@ 2022-11-17 19:27   ` David Hildenbrand
  2022-11-18  9:53     ` Emanuele Giuseppe Esposito
  2022-12-02  6:54   ` Robert Hoo
  1 sibling, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2022-11-17 19:27 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, kvm

On 11.11.22 16:47, Emanuele Giuseppe Esposito wrote:
> Using the new accel-blocker API, mark where ioctls are being called
> in KVM. Next, we will implement the critical section that will take
> care of performing memslots modifications atomically, therefore
> preventing any new ioctl from running and allowing the running ones
> to finish.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

You might want to drop that and instead mention something like "This 
patch is based on a protoype patch by David Hildenbrand".

> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   accel/kvm/kvm-all.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index f99b0becd8..ff660fd469 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -2310,6 +2310,7 @@ static int kvm_init(MachineState *ms)
>       assert(TARGET_PAGE_SIZE <= qemu_real_host_page_size());
>   
>       s->sigmask_len = 8;
> +    accel_blocker_init();
>   
>   #ifdef KVM_CAP_SET_GUEST_DEBUG
>       QTAILQ_INIT(&s->kvm_sw_breakpoints);
> @@ -3014,7 +3015,9 @@ int kvm_vm_ioctl(KVMState *s, int type, ...)
>       va_end(ap);
>   
>       trace_kvm_vm_ioctl(type, arg);
> +    accel_ioctl_begin();
>       ret = ioctl(s->vmfd, type, arg);
> +    accel_ioctl_end();
>       if (ret == -1) {
>           ret = -errno;
>       }
> @@ -3032,7 +3035,9 @@ int kvm_vcpu_ioctl(CPUState *cpu, int type, ...)
>       va_end(ap);
>   
>       trace_kvm_vcpu_ioctl(cpu->cpu_index, type, arg);
> +    accel_cpu_ioctl_begin(cpu);
>       ret = ioctl(cpu->kvm_fd, type, arg);
> +    accel_cpu_ioctl_end(cpu);
>       if (ret == -1) {
>           ret = -errno;
>       }
> @@ -3050,7 +3055,9 @@ int kvm_device_ioctl(int fd, int type, ...)
>       va_end(ap);
>   
>       trace_kvm_device_ioctl(fd, type, arg);
> +    accel_ioctl_begin();
>       ret = ioctl(fd, type, arg);
> +    accel_ioctl_end();
>       if (ret == -1) {
>           ret = -errno;
>       }

I recall that I had some additional patches that tried to catch some of 
more calls:

https://lore.kernel.org/qemu-devel/20200312161217.3590-2-david@redhat.com/

https://lore.kernel.org/qemu-devel/20200312161217.3590-3-david@redhat.com/

Do they still apply? Is there more?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/3] accel: introduce accelerator blocker API
  2022-11-11 15:47 ` [PATCH v3 1/3] accel: introduce accelerator blocker API Emanuele Giuseppe Esposito
@ 2022-11-18  7:32   ` Philippe Mathieu-Daudé
  2022-11-18  7:35     ` Philippe Mathieu-Daudé
  2022-12-02  6:56   ` Robert Hoo
  1 sibling, 1 reply; 13+ messages in thread
From: Philippe Mathieu-Daudé @ 2022-11-18  7:32 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum, Yanan Wang, kvm

On 11/11/22 16:47, Emanuele Giuseppe Esposito wrote:
> This API allows the accelerators to prevent vcpus from issuing
> new ioctls while execting a critical section marked with the
> accel_ioctl_inhibit_begin/end functions.
> 
> Note that all functions submitting ioctls must mark where the
> ioctl is being called with accel_{cpu_}ioctl_begin/end().
> 
> This API requires the caller to always hold the BQL.
> API documentation is in sysemu/accel-blocker.h
> 
> Internally, it uses a QemuLockCnt together with a per-CPU QemuLockCnt
> (to minimize cache line bouncing) to keep avoid that new ioctls
> run when the critical section starts, and a QemuEvent to wait
> that all running ioctls finish.
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   accel/accel-blocker.c          | 154 +++++++++++++++++++++++++++++++++
>   accel/meson.build              |   2 +-
>   hw/core/cpu-common.c           |   2 +
>   include/hw/core/cpu.h          |   3 +
>   include/sysemu/accel-blocker.h |  56 ++++++++++++
>   5 files changed, 216 insertions(+), 1 deletion(-)
>   create mode 100644 accel/accel-blocker.c
>   create mode 100644 include/sysemu/accel-blocker.h

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/3] accel: introduce accelerator blocker API
  2022-11-18  7:32   ` Philippe Mathieu-Daudé
@ 2022-11-18  7:35     ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 13+ messages in thread
From: Philippe Mathieu-Daudé @ 2022-11-18  7:35 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum, Yanan Wang, kvm

On 18/11/22 08:32, Philippe Mathieu-Daudé wrote:
> On 11/11/22 16:47, Emanuele Giuseppe Esposito wrote:
>> This API allows the accelerators to prevent vcpus from issuing
>> new ioctls while execting a critical section marked with the

Typo "executing".

>> accel_ioctl_inhibit_begin/end functions.
>>
>> Note that all functions submitting ioctls must mark where the
>> ioctl is being called with accel_{cpu_}ioctl_begin/end().
>>
>> This API requires the caller to always hold the BQL.
>> API documentation is in sysemu/accel-blocker.h
>>
>> Internally, it uses a QemuLockCnt together with a per-CPU QemuLockCnt
>> (to minimize cache line bouncing) to keep avoid that new ioctls
>> run when the critical section starts, and a QemuEvent to wait
>> that all running ioctls finish.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> ---
>>   accel/accel-blocker.c          | 154 +++++++++++++++++++++++++++++++++
>>   accel/meson.build              |   2 +-
>>   hw/core/cpu-common.c           |   2 +
>>   include/hw/core/cpu.h          |   3 +
>>   include/sysemu/accel-blocker.h |  56 ++++++++++++
>>   5 files changed, 216 insertions(+), 1 deletion(-)
>>   create mode 100644 accel/accel-blocker.c
>>   create mode 100644 include/sysemu/accel-blocker.h
> 
> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/3] KVM: keep track of running ioctls
  2022-11-17 19:27   ` David Hildenbrand
@ 2022-11-18  9:53     ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 13+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-11-18  9:53 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, kvm



Am 17/11/2022 um 20:27 schrieb David Hildenbrand:
> On 11.11.22 16:47, Emanuele Giuseppe Esposito wrote:
>> Using the new accel-blocker API, mark where ioctls are being called
>> in KVM. Next, we will implement the critical section that will take
>> care of performing memslots modifications atomically, therefore
>> preventing any new ioctl from running and allowing the running ones
>> to finish.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> You might want to drop that and instead mention something like "This
> patch is based on a protoype patch by David Hildenbrand".
> 
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> ---
>>   accel/kvm/kvm-all.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index f99b0becd8..ff660fd469 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -2310,6 +2310,7 @@ static int kvm_init(MachineState *ms)
>>       assert(TARGET_PAGE_SIZE <= qemu_real_host_page_size());
>>         s->sigmask_len = 8;
>> +    accel_blocker_init();
>>     #ifdef KVM_CAP_SET_GUEST_DEBUG
>>       QTAILQ_INIT(&s->kvm_sw_breakpoints);
>> @@ -3014,7 +3015,9 @@ int kvm_vm_ioctl(KVMState *s, int type, ...)
>>       va_end(ap);
>>         trace_kvm_vm_ioctl(type, arg);
>> +    accel_ioctl_begin();
>>       ret = ioctl(s->vmfd, type, arg);
>> +    accel_ioctl_end();
>>       if (ret == -1) {
>>           ret = -errno;
>>       }
>> @@ -3032,7 +3035,9 @@ int kvm_vcpu_ioctl(CPUState *cpu, int type, ...)
>>       va_end(ap);
>>         trace_kvm_vcpu_ioctl(cpu->cpu_index, type, arg);
>> +    accel_cpu_ioctl_begin(cpu);
>>       ret = ioctl(cpu->kvm_fd, type, arg);
>> +    accel_cpu_ioctl_end(cpu);
>>       if (ret == -1) {
>>           ret = -errno;
>>       }
>> @@ -3050,7 +3055,9 @@ int kvm_device_ioctl(int fd, int type, ...)
>>       va_end(ap);
>>         trace_kvm_device_ioctl(fd, type, arg);
>> +    accel_ioctl_begin();
>>       ret = ioctl(fd, type, arg);
>> +    accel_ioctl_end();
>>       if (ret == -1) {
>>           ret = -errno;
>>       }
> 
> I recall that I had some additional patches that tried to catch some of
> more calls:
> 
> https://lore.kernel.org/qemu-devel/20200312161217.3590-2-david@redhat.com/
> 
> https://lore.kernel.org/qemu-devel/20200312161217.3590-3-david@redhat.com/
> 
> Do they still apply? Is there more?
> 

Apologies, what do you mean with "do they still apply"?
Looks fine to me

Emanuele


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/3] KVM: keep track of running ioctls
  2022-11-11 15:47 ` [PATCH v3 2/3] KVM: keep track of running ioctls Emanuele Giuseppe Esposito
  2022-11-17 19:27   ` David Hildenbrand
@ 2022-12-02  6:54   ` Robert Hoo
  2022-12-02 12:03     ` Emanuele Giuseppe Esposito
  1 sibling, 1 reply; 13+ messages in thread
From: Robert Hoo @ 2022-12-02  6:54 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, kvm, David Hildenbrand

On Fri, 2022-11-11 at 10:47 -0500, Emanuele Giuseppe Esposito wrote:
> Using the new accel-blocker API, mark where ioctls are being called
> in KVM. Next, we will implement the critical section that will take
> care of performing memslots modifications atomically, therefore
> preventing any new ioctl from running and allowing the running ones
> to finish.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>  accel/kvm/kvm-all.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index f99b0becd8..ff660fd469 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -2310,6 +2310,7 @@ static int kvm_init(MachineState *ms)
>      assert(TARGET_PAGE_SIZE <= qemu_real_host_page_size());
>  
>      s->sigmask_len = 8;
> +    accel_blocker_init();
>  
>  #ifdef KVM_CAP_SET_GUEST_DEBUG
>      QTAILQ_INIT(&s->kvm_sw_breakpoints);
> @@ -3014,7 +3015,9 @@ int kvm_vm_ioctl(KVMState *s, int type, ...)
>      va_end(ap);
>  
>      trace_kvm_vm_ioctl(type, arg);
> +    accel_ioctl_begin();
>      ret = ioctl(s->vmfd, type, arg);
> +    accel_ioctl_end();
>      if (ret == -1) {
>          ret = -errno;
>      }
> @@ -3032,7 +3035,9 @@ int kvm_vcpu_ioctl(CPUState *cpu, int type,
> ...)
>      va_end(ap);
>  
>      trace_kvm_vcpu_ioctl(cpu->cpu_index, type, arg);
> +    accel_cpu_ioctl_begin(cpu);

Does this mean that kvm_region_commit() can inhibit any other vcpus
doing any ioctls?

>      ret = ioctl(cpu->kvm_fd, type, arg);
> +    accel_cpu_ioctl_end(cpu);
>      if (ret == -1) {
>          ret = -errno;
>      }
> @@ -3050,7 +3055,9 @@ int kvm_device_ioctl(int fd, int type, ...)
>      va_end(ap);
>  
>      trace_kvm_device_ioctl(fd, type, arg);
> +    accel_ioctl_begin();
>      ret = ioctl(fd, type, arg);
> +    accel_ioctl_end();
>      if (ret == -1) {
>          ret = -errno;
>      }


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/3] accel: introduce accelerator blocker API
  2022-11-11 15:47 ` [PATCH v3 1/3] accel: introduce accelerator blocker API Emanuele Giuseppe Esposito
  2022-11-18  7:32   ` Philippe Mathieu-Daudé
@ 2022-12-02  6:56   ` Robert Hoo
  1 sibling, 0 replies; 13+ messages in thread
From: Robert Hoo @ 2022-12-02  6:56 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, kvm

On Fri, 2022-11-11 at 10:47 -0500, Emanuele Giuseppe Esposito wrote:
> This API allows the accelerators to prevent vcpus from issuing
> new ioctls while execting a critical section marked with the
> accel_ioctl_inhibit_begin/end functions.
> 
> Note that all functions submitting ioctls must mark where the
> ioctl is being called with accel_{cpu_}ioctl_begin/end().
> 
> This API requires the caller to always hold the BQL.
> API documentation is in sysemu/accel-blocker.h
> 
> Internally, it uses a QemuLockCnt together with a per-CPU QemuLockCnt
> (to minimize cache line bouncing) to keep avoid that new ioctls
> run when the critical section starts, and a QemuEvent to wait
> that all running ioctls finish.
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>  accel/accel-blocker.c          | 154
> +++++++++++++++++++++++++++++++++
>  accel/meson.build              |   2 +-
>  hw/core/cpu-common.c           |   2 +
>  include/hw/core/cpu.h          |   3 +
>  include/sysemu/accel-blocker.h |  56 ++++++++++++
>  5 files changed, 216 insertions(+), 1 deletion(-)
>  create mode 100644 accel/accel-blocker.c
>  create mode 100644 include/sysemu/accel-blocker.h
> 
> diff --git a/accel/accel-blocker.c b/accel/accel-blocker.c
> new file mode 100644
> index 0000000000..1e7f423462
> --- /dev/null
> +++ b/accel/accel-blocker.c
> @@ -0,0 +1,154 @@
> +/*
> + * Lock to inhibit accelerator ioctls
> + *
> + * Copyright (c) 2022 Red Hat Inc.
> + *
> + * Author: Emanuele Giuseppe Esposito       <eesposit@redhat.com>
> + *
> + * Permission is hereby granted, free of charge, to any person
> obtaining a copy
> + * of this software and associated documentation files (the
> "Software"), to deal
> + * in the Software without restriction, including without limitation
> the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense,
> and/or sell
> + * copies of the Software, and to permit persons to whom the
> Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be
> included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
> SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES
> OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> DEALINGS IN
> + * THE SOFTWARE.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/thread.h"
> +#include "qemu/main-loop.h"
> +#include "hw/core/cpu.h"
> +#include "sysemu/accel-blocker.h"
> +
> +static QemuLockCnt accel_in_ioctl_lock;
> +static QemuEvent accel_in_ioctl_event;
> +
> +void accel_blocker_init(void)
> +{
> +    qemu_lockcnt_init(&accel_in_ioctl_lock);
> +    qemu_event_init(&accel_in_ioctl_event, false);
> +}
> +
> +void accel_ioctl_begin(void)
> +{
> +    if (likely(qemu_mutex_iothread_locked())) {
> +        return;
> +    }
> +
> +    /* block if lock is taken in kvm_ioctl_inhibit_begin() */
> +    qemu_lockcnt_inc(&accel_in_ioctl_lock);
> +}
> +
> +void accel_ioctl_end(void)
> +{
> +    if (likely(qemu_mutex_iothread_locked())) {
> +        return;
> +    }
> +
> +    qemu_lockcnt_dec(&accel_in_ioctl_lock);
> +    /* change event to SET. If event was BUSY, wake up all waiters
> */
> +    qemu_event_set(&accel_in_ioctl_event);
> +}
> +
> +void accel_cpu_ioctl_begin(CPUState *cpu)
> +{
> +    if (unlikely(qemu_mutex_iothread_locked())) {
> +        return;
> +    }
> +
> +    /* block if lock is taken in kvm_ioctl_inhibit_begin() */
> +    qemu_lockcnt_inc(&cpu->in_ioctl_lock);
> +}
> +
> +void accel_cpu_ioctl_end(CPUState *cpu)
> +{
> +    if (unlikely(qemu_mutex_iothread_locked())) {
> +        return;
> +    }
> +
> +    qemu_lockcnt_dec(&cpu->in_ioctl_lock);
> +    /* change event to SET. If event was BUSY, wake up all waiters
> */
> +    qemu_event_set(&accel_in_ioctl_event);
> +}
> +
> +static bool accel_has_to_wait(void)
> +{
> +    CPUState *cpu;
> +    bool needs_to_wait = false;
> +
> +    CPU_FOREACH(cpu) {
> +        if (qemu_lockcnt_count(&cpu->in_ioctl_lock)) {
> +            /* exit the ioctl, if vcpu is running it */
> +            qemu_cpu_kick(cpu);
> +            needs_to_wait = true;
> +        }
> +    }
> +
> +    return needs_to_wait ||
> qemu_lockcnt_count(&accel_in_ioctl_lock);
> +}
> +
> +void accel_ioctl_inhibit_begin(void)
> +{
> +    CPUState *cpu;
> +
> +    /*
> +     * We allow to inhibit only when holding the BQL, so we can
> identify
> +     * when an inhibitor wants to issue an ioctl easily.
> +     */
> +    g_assert(qemu_mutex_iothread_locked());
> +
> +    /* Block further invocations of the ioctls outside the BQL.  */
> +    CPU_FOREACH(cpu) {
> +        qemu_lockcnt_lock(&cpu->in_ioctl_lock);
> +    }
> +    qemu_lockcnt_lock(&accel_in_ioctl_lock);
> +
> +    /* Keep waiting until there are running ioctls */
> +    while (true) {
> +
> +        /* Reset event to FREE. */
> +        qemu_event_reset(&accel_in_ioctl_event);
> +
> +        if (accel_has_to_wait()) {
> +            /*
> +             * If event is still FREE, and there are ioctls still in
> progress,
> +             * wait.
> +             *
> +             *  If an ioctl finishes before qemu_event_wait(), it
> will change
> +             * the event state to SET. This will prevent
> qemu_event_wait() from
> +             * blocking, but it's not a problem because if other
> ioctls are
> +             * still running the loop will iterate once more and
> reset the event
> +             * status to FREE so that it can wait properly.
> +             *
> +             * If an ioctls finishes while qemu_event_wait() is
> blocking, then
> +             * it will be waken up, but also here the while loop
> makes sure
> +             * to re-enter the wait if there are other running
> ioctls.
> +             */
> +            qemu_event_wait(&accel_in_ioctl_event);
> +        } else {
> +            /* No ioctl is running */
> +            return;
> +        }
> +    }
> +}
> +
> +void accel_ioctl_inhibit_end(void)
> +{
> +    CPUState *cpu;
> +
> +    qemu_lockcnt_unlock(&accel_in_ioctl_lock);
> +    CPU_FOREACH(cpu) {
> +        qemu_lockcnt_unlock(&cpu->in_ioctl_lock);
> +    }
> +}
> +
> diff --git a/accel/meson.build b/accel/meson.build
> index b9a963cf80..a0d49c4f31 100644
> --- a/accel/meson.build
> +++ b/accel/meson.build
> @@ -1,4 +1,4 @@
> -specific_ss.add(files('accel-common.c'))
> +specific_ss.add(files('accel-common.c', 'accel-blocker.c'))
>  softmmu_ss.add(files('accel-softmmu.c'))
>  user_ss.add(files('accel-user.c'))
>  
> diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c
> index f9fdd46b9d..8d6a4b1b65 100644
> --- a/hw/core/cpu-common.c
> +++ b/hw/core/cpu-common.c
> @@ -237,6 +237,7 @@ static void cpu_common_initfn(Object *obj)
>      cpu->nr_threads = 1;
>  
>      qemu_mutex_init(&cpu->work_mutex);
> +    qemu_lockcnt_init(&cpu->in_ioctl_lock);
>      QSIMPLEQ_INIT(&cpu->work_list);
>      QTAILQ_INIT(&cpu->breakpoints);
>      QTAILQ_INIT(&cpu->watchpoints);
> @@ -248,6 +249,7 @@ static void cpu_common_finalize(Object *obj)
>  {
>      CPUState *cpu = CPU(obj);
>  
> +    qemu_lockcnt_destroy(&cpu->in_ioctl_lock);
>      qemu_mutex_destroy(&cpu->work_mutex);
>  }
>  
> diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
> index f9b58773f7..15053663bc 100644
> --- a/include/hw/core/cpu.h
> +++ b/include/hw/core/cpu.h
> @@ -397,6 +397,9 @@ struct CPUState {
>      uint32_t kvm_fetch_index;
>      uint64_t dirty_pages;
>  
> +    /* Use by accel-block: CPU is executing an ioctl() */
> +    QemuLockCnt in_ioctl_lock;
> +
>      /* Used for events with 'vcpu' and *without* the 'disabled'
> properties */
>      DECLARE_BITMAP(trace_dstate_delayed,
> CPU_TRACE_DSTATE_MAX_EVENTS);
>      DECLARE_BITMAP(trace_dstate, CPU_TRACE_DSTATE_MAX_EVENTS);
> diff --git a/include/sysemu/accel-blocker.h b/include/sysemu/accel-
> blocker.h
> new file mode 100644
> index 0000000000..72020529ef
> --- /dev/null
> +++ b/include/sysemu/accel-blocker.h
> @@ -0,0 +1,56 @@
> +/*
> + * Accelerator blocking API, to prevent new ioctls from starting and
> wait the
> + * running ones finish.
> + * This mechanism differs from pause/resume_all_vcpus() in that it
> does not
> + * release the BQL.
> + *
> + *  Copyright (c) 2022 Red Hat Inc.
> + *
> + * Author: Emanuele Giuseppe Esposito       <eesposit@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2
> or later.
> + * See the COPYING file in the top-level directory.
> + */
> +#ifndef ACCEL_BLOCKER_H
> +#define ACCEL_BLOCKER_H
> +
> +#include "qemu/osdep.h"
> +#include "sysemu/cpus.h"
> +
> +extern void accel_blocker_init(void);
> +
> +/*
> + * accel_{cpu_}ioctl_begin/end:
> + * Mark when ioctl is about to run or just finished.
> + *
> + * accel_{cpu_}ioctl_begin will block after
> accel_ioctl_inhibit_begin() is
> + * called, preventing new ioctls to run. They will continue only
> after
> + * accel_ioctl_inibith_end().

Typo inibith --> inhibit

> + */
> +extern void accel_ioctl_begin(void);
> +extern void accel_ioctl_end(void);
> +extern void accel_cpu_ioctl_begin(CPUState *cpu);
> +extern void accel_cpu_ioctl_end(CPUState *cpu);
> +
> +/*
> + * accel_ioctl_inhibit_begin: start critical section
> + *
> + * This function makes sure that:
> + * 1) incoming accel_{cpu_}ioctl_begin() calls block
> + * 2) wait that all ioctls that were already running reach
> + *    accel_{cpu_}ioctl_end(), kicking vcpus if necessary.
> + *
> + * This allows the caller to access shared data or perform
> operations without
> + * worrying of concurrent vcpus accesses.
> + */
> +extern void accel_ioctl_inhibit_begin(void);
> +
> +/*
> + * accel_ioctl_inhibit_end: end critical section started by
> + * accel_ioctl_inhibit_begin()
> + *
> + * This function allows blocked accel_{cpu_}ioctl_begin() to
> continue.
> + */
> +extern void accel_ioctl_inhibit_end(void);
> +

git am complains ".git/rebase-apply/patch:170: new blank line at EOF."

> +#endif /* ACCEL_BLOCKER_H */


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/3] KVM: keep track of running ioctls
  2022-12-02  6:54   ` Robert Hoo
@ 2022-12-02 12:03     ` Emanuele Giuseppe Esposito
  2022-12-02 13:32       ` Robert Hoo
  0 siblings, 1 reply; 13+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-12-02 12:03 UTC (permalink / raw)
  To: Robert Hoo, qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, kvm, David Hildenbrand



Am 02/12/2022 um 07:54 schrieb Robert Hoo:
> On Fri, 2022-11-11 at 10:47 -0500, Emanuele Giuseppe Esposito wrote:
>> Using the new accel-blocker API, mark where ioctls are being called
>> in KVM. Next, we will implement the critical section that will take
>> care of performing memslots modifications atomically, therefore
>> preventing any new ioctl from running and allowing the running ones
>> to finish.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> ---
>>  accel/kvm/kvm-all.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index f99b0becd8..ff660fd469 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -2310,6 +2310,7 @@ static int kvm_init(MachineState *ms)
>>      assert(TARGET_PAGE_SIZE <= qemu_real_host_page_size());
>>  
>>      s->sigmask_len = 8;
>> +    accel_blocker_init();
>>  
>>  #ifdef KVM_CAP_SET_GUEST_DEBUG
>>      QTAILQ_INIT(&s->kvm_sw_breakpoints);
>> @@ -3014,7 +3015,9 @@ int kvm_vm_ioctl(KVMState *s, int type, ...)
>>      va_end(ap);
>>  
>>      trace_kvm_vm_ioctl(type, arg);
>> +    accel_ioctl_begin();
>>      ret = ioctl(s->vmfd, type, arg);
>> +    accel_ioctl_end();
>>      if (ret == -1) {
>>          ret = -errno;
>>      }
>> @@ -3032,7 +3035,9 @@ int kvm_vcpu_ioctl(CPUState *cpu, int type,
>> ...)
>>      va_end(ap);
>>  
>>      trace_kvm_vcpu_ioctl(cpu->cpu_index, type, arg);
>> +    accel_cpu_ioctl_begin(cpu);
> 
> Does this mean that kvm_region_commit() can inhibit any other vcpus
> doing any ioctls?

Yes, because we must prevent any vcpu from reading memslots while we are
updating them.

> 
>>      ret = ioctl(cpu->kvm_fd, type, arg);
>> +    accel_cpu_ioctl_end(cpu);
>>      if (ret == -1) {
>>          ret = -errno;
>>      }
>> @@ -3050,7 +3055,9 @@ int kvm_device_ioctl(int fd, int type, ...)
>>      va_end(ap);
>>  
>>      trace_kvm_device_ioctl(fd, type, arg);
>> +    accel_ioctl_begin();
>>      ret = ioctl(fd, type, arg);
>> +    accel_ioctl_end();
>>      if (ret == -1) {
>>          ret = -errno;
>>      }
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/3] KVM: keep track of running ioctls
  2022-12-02 12:03     ` Emanuele Giuseppe Esposito
@ 2022-12-02 13:32       ` Robert Hoo
  2022-12-02 14:32         ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 13+ messages in thread
From: Robert Hoo @ 2022-12-02 13:32 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, kvm, David Hildenbrand

On Fri, 2022-12-02 at 13:03 +0100, Emanuele Giuseppe Esposito wrote:
...
> > > @@ -3032,7 +3035,9 @@ int kvm_vcpu_ioctl(CPUState *cpu, int type,
> > > ...)
> > >      va_end(ap);
> > >  
> > >      trace_kvm_vcpu_ioctl(cpu->cpu_index, type, arg);
> > > +    accel_cpu_ioctl_begin(cpu);
> > 
> > Does this mean that kvm_region_commit() can inhibit any other vcpus
> > doing any ioctls?
> 
> Yes, because we must prevent any vcpu from reading memslots while we
> are
> updating them.
> 
But do most other vm/vcpu ioctls contend with memslot operations?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/3] KVM: keep track of running ioctls
  2022-12-02 13:32       ` Robert Hoo
@ 2022-12-02 14:32         ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 13+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-12-02 14:32 UTC (permalink / raw)
  To: Robert Hoo, qemu-devel
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, kvm, David Hildenbrand



Am 02/12/2022 um 14:32 schrieb Robert Hoo:
> On Fri, 2022-12-02 at 13:03 +0100, Emanuele Giuseppe Esposito wrote:
> ...
>>>> @@ -3032,7 +3035,9 @@ int kvm_vcpu_ioctl(CPUState *cpu, int type,
>>>> ...)
>>>>      va_end(ap);
>>>>  
>>>>      trace_kvm_vcpu_ioctl(cpu->cpu_index, type, arg);
>>>> +    accel_cpu_ioctl_begin(cpu);
>>>
>>> Does this mean that kvm_region_commit() can inhibit any other vcpus
>>> doing any ioctls?
>>
>> Yes, because we must prevent any vcpu from reading memslots while we
>> are
>> updating them.
>>
> But do most other vm/vcpu ioctls contend with memslot operations?
> 

I think this is the simplest way. I agree not all ioctls contend with
memslot operations, but there are also not so many memslot operations
too. Instead of going one by one in all possible ioctls, covering all of
them is the simplest way and it covers also the case of a new ioctl
reading memslots that could be added in the future (alternatively we
would be always updating the list of ioctls to block).


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-12-02 14:33 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-11 15:47 [PATCH v3 0/3] KVM: allow listener to stop all vcpus before Emanuele Giuseppe Esposito
2022-11-11 15:47 ` [PATCH v3 1/3] accel: introduce accelerator blocker API Emanuele Giuseppe Esposito
2022-11-18  7:32   ` Philippe Mathieu-Daudé
2022-11-18  7:35     ` Philippe Mathieu-Daudé
2022-12-02  6:56   ` Robert Hoo
2022-11-11 15:47 ` [PATCH v3 2/3] KVM: keep track of running ioctls Emanuele Giuseppe Esposito
2022-11-17 19:27   ` David Hildenbrand
2022-11-18  9:53     ` Emanuele Giuseppe Esposito
2022-12-02  6:54   ` Robert Hoo
2022-12-02 12:03     ` Emanuele Giuseppe Esposito
2022-12-02 13:32       ` Robert Hoo
2022-12-02 14:32         ` Emanuele Giuseppe Esposito
2022-11-11 15:47 ` [PATCH v3 3/3] kvm: Atomic memslot updates Emanuele Giuseppe Esposito

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).