All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
@ 2022-07-21 12:07 David Hildenbrand
  2022-07-21 12:07 ` [PATCH RFC 1/7] util: Cleanup and rename os_mem_prealloc() David Hildenbrand
                   ` (10 more replies)
  0 siblings, 11 replies; 21+ messages in thread
From: David Hildenbrand @ 2022-07-21 12:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Michal Privoznik, Igor Mammedov,
	Michael S. Tsirkin, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

This is a follow-up on "util: NUMA aware memory preallocation" [1] by
Michal.

Setting the CPU affinity of threads from inside QEMU usually isn't
easily possible, because we don't want QEMU -- once started and running
guest code -- to be able to mess up the system. QEMU disallows relevant
syscalls using seccomp, such that any such invocation will fail.

Especially for memory preallocation in memory backends, the CPU affinity
can significantly increase guest startup time, for example, when running
large VMs backed by huge/gigantic pages, because of NUMA effects. For
NUMA-aware preallocation, we have to set the CPU affinity, however:

(1) Once preallocation threads are created during preallocation, management
    tools cannot intercept anymore to change the affinity. These threads
    are created automatically on demand.
(2) QEMU cannot easily set the CPU affinity itself.
(3) The CPU affinity derived from the NUMA bindings of the memory backend
    might not necessarily be exactly the CPUs we actually want to use
    (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).

There is an easy "workaround". If we have a thread with the right CPU
affinity, we can simply create new threads on demand via that prepared
context. So, all we have to do is setup and create such a context ahead
of time, to then configure preallocation to create new threads via that
environment.

So, let's introduce a user-creatable "thread-context" object that
essentially consists of a context thread used to create new threads.
QEMU can either try setting the CPU affinity itself ("cpu-affinity",
"node-affinity" property), or upper layers can extract the thread id
("thread-id" property) to configure it externally.

Make memory-backends consume a thread-context object
(via the "prealloc-context" property) and use it when preallocating to
create new threads with the desired CPU affinity. Further, to make it
easier to use, allow creation of "thread-context" objects, including
setting the CPU affinity directly from QEMU, *before* enabling the
sandbox option.


Quick test on a system with 2 NUMA nodes:

Without CPU affinity:
    time qemu-system-x86_64 \
        -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
        -nographic -monitor stdio

    real    0m5.383s
    real    0m3.499s
    real    0m5.129s
    real    0m4.232s
    real    0m5.220s
    real    0m4.288s
    real    0m3.582s
    real    0m4.305s
    real    0m5.421s
    real    0m4.502s

    -> It heavily depends on the scheduler CPU selection

With CPU affinity:
    time qemu-system-x86_64 \
        -object thread-context,id=tc1,node-affinity=0 \
        -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
        -sandbox enable=on,resourcecontrol=deny \
        -nographic -monitor stdio

    real    0m1.959s
    real    0m1.942s
    real    0m1.943s
    real    0m1.941s
    real    0m1.948s
    real    0m1.964s
    real    0m1.949s
    real    0m1.948s
    real    0m1.941s
    real    0m1.937s

On reasonably large VMs, the speedup can be quite significant.

While this concept is currently only used for short-lived preallocation
threads, nothing major speaks against reusing the concept for other
threads that are harder to identify/configure -- except that
we need additional (idle) context threads that are otherwise left unused.

[1] https://lkml.kernel.org/r/ffdcd118d59b379ede2b64745144165a40f6a813.1652165704.git.mprivozn@redhat.com

Cc: Michal Privoznik <mprivozn@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: Eduardo Habkost <eduardo@habkost.net>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Stefan Weil <sw@weilnetz.de>

David Hildenbrand (7):
  util: Cleanup and rename os_mem_prealloc()
  util: Introduce qemu_thread_set_affinity() and
    qemu_thread_get_affinity()
  util: Introduce ThreadContext user-creatable object
  util: Add write-only "node-affinity" property for ThreadContext
  util: Make qemu_prealloc_mem() optionally consume a ThreadContext
  hostmem: Allow for specifying a ThreadContext for preallocation
  vl: Allow ThreadContext objects to be created before the sandbox
    option

 backends/hostmem.c            |  13 +-
 hw/virtio/virtio-mem.c        |   2 +-
 include/qemu/osdep.h          |  19 +-
 include/qemu/thread-context.h |  58 ++++++
 include/qemu/thread.h         |   4 +
 include/sysemu/hostmem.h      |   2 +
 meson.build                   |  16 ++
 qapi/qom.json                 |  25 +++
 softmmu/cpus.c                |   2 +-
 softmmu/vl.c                  |  30 ++-
 util/meson.build              |   1 +
 util/oslib-posix.c            |  39 ++--
 util/oslib-win32.c            |   8 +-
 util/qemu-thread-posix.c      |  70 +++++++
 util/qemu-thread-win32.c      |  12 ++
 util/thread-context.c         | 363 ++++++++++++++++++++++++++++++++++
 16 files changed, 637 insertions(+), 27 deletions(-)
 create mode 100644 include/qemu/thread-context.h
 create mode 100644 util/thread-context.c

-- 
2.35.3



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC 1/7] util: Cleanup and rename os_mem_prealloc()
  2022-07-21 12:07 [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext David Hildenbrand
@ 2022-07-21 12:07 ` David Hildenbrand
  2022-07-21 12:07 ` [PATCH RFC 2/7] util: Introduce qemu_thread_set_affinity() and qemu_thread_get_affinity() David Hildenbrand
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2022-07-21 12:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Michal Privoznik, Igor Mammedov,
	Michael S. Tsirkin, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

Let's
* give the function a "qemu_*" style name
* make sure the parameters in the implementation match the prototype
* rename smp_cpus to max_threads, which makes the semantics of that
  parameter clearer

... and add a function documentation.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 backends/hostmem.c     |  6 +++---
 hw/virtio/virtio-mem.c |  2 +-
 include/qemu/osdep.h   | 17 +++++++++++++++--
 softmmu/cpus.c         |  2 +-
 util/oslib-posix.c     | 24 ++++++++++++------------
 util/oslib-win32.c     |  8 ++++----
 6 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 624bb7ecd3..caff42d3a5 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -232,7 +232,7 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
         void *ptr = memory_region_get_ram_ptr(&backend->mr);
         uint64_t sz = memory_region_size(&backend->mr);
 
-        os_mem_prealloc(fd, ptr, sz, backend->prealloc_threads, &local_err);
+        qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
@@ -393,8 +393,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
          * specified NUMA policy in place.
          */
         if (backend->prealloc) {
-            os_mem_prealloc(memory_region_get_fd(&backend->mr), ptr, sz,
-                            backend->prealloc_threads, &local_err);
+            qemu_prealloc_mem(memory_region_get_fd(&backend->mr), ptr, sz,
+                              backend->prealloc_threads, &local_err);
             if (local_err) {
                 goto out;
             }
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 30d03e987a..0e9ef4ff19 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -467,7 +467,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa,
             int fd = memory_region_get_fd(&vmem->memdev->mr);
             Error *local_err = NULL;
 
-            os_mem_prealloc(fd, area, size, 1, &local_err);
+            qemu_prealloc_mem(fd, area, size, 1, &local_err);
             if (local_err) {
                 static bool warned;
 
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index b1c161c035..e556e45143 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -568,8 +568,21 @@ unsigned long qemu_getauxval(unsigned long type);
 
 void qemu_set_tty_echo(int fd, bool echo);
 
-void os_mem_prealloc(int fd, char *area, size_t sz, int smp_cpus,
-                     Error **errp);
+/**
+ * qemu_prealloc_mem:
+ * @fd: the fd mapped into the area, -1 for anonymous memory
+ * @area: start address of the are to preallocate
+ * @sz: the size of the area to preallocate
+ * @max_threads: maximum number of threads to use
+ * @errp: returns an error if this function fails
+ *
+ * Preallocate memory (populate/prefault page tables writable) for the virtual
+ * memory area starting at @area with the size of @sz. After a successful call,
+ * each page in the area was faulted in writable at least once, for example,
+ * after allocating file blocks for mapped files.
+ */
+void qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
+                       Error **errp);
 
 /**
  * qemu_get_pid_name:
diff --git a/softmmu/cpus.c b/softmmu/cpus.c
index 23b30484b2..cf8aa70ca5 100644
--- a/softmmu/cpus.c
+++ b/softmmu/cpus.c
@@ -354,7 +354,7 @@ static void qemu_init_sigbus(void)
 
     /*
      * ALERT: when modifying this, take care that SIGBUS forwarding in
-     * os_mem_prealloc() will continue working as expected.
+     * qemu_prealloc_mem() will continue working as expected.
      */
     memset(&action, 0, sizeof(action));
     action.sa_flags = SA_SIGINFO;
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index bffec18869..75e8cdcf3e 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -314,7 +314,7 @@ static void sigbus_handler(int signal)
         return;
     }
 #endif /* CONFIG_LINUX */
-    warn_report("os_mem_prealloc: unrelated SIGBUS detected and ignored");
+    warn_report("qemu_prealloc_mem: unrelated SIGBUS detected and ignored");
 }
 
 static void *do_touch_pages(void *arg)
@@ -384,13 +384,13 @@ static void *do_madv_populate_write_pages(void *arg)
 }
 
 static inline int get_memset_num_threads(size_t hpagesize, size_t numpages,
-                                         int smp_cpus)
+                                         int max_threads)
 {
     long host_procs = sysconf(_SC_NPROCESSORS_ONLN);
     int ret = 1;
 
     if (host_procs > 0) {
-        ret = MIN(MIN(host_procs, MAX_MEM_PREALLOC_THREAD_COUNT), smp_cpus);
+        ret = MIN(MIN(host_procs, MAX_MEM_PREALLOC_THREAD_COUNT), max_threads);
     }
 
     /* Especially with gigantic pages, don't create more threads than pages. */
@@ -403,11 +403,11 @@ static inline int get_memset_num_threads(size_t hpagesize, size_t numpages,
 }
 
 static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
-                           int smp_cpus, bool use_madv_populate_write)
+                           int max_threads, bool use_madv_populate_write)
 {
     static gsize initialized = 0;
     MemsetContext context = {
-        .num_threads = get_memset_num_threads(hpagesize, numpages, smp_cpus),
+        .num_threads = get_memset_num_threads(hpagesize, numpages, max_threads),
     };
     size_t numpages_per_thread, leftover;
     void *(*touch_fn)(void *);
@@ -479,13 +479,13 @@ static bool madv_populate_write_possible(char *area, size_t pagesize)
            errno != EINVAL;
 }
 
-void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
-                     Error **errp)
+void qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
+                       Error **errp)
 {
     static gsize initialized;
     int ret;
     size_t hpagesize = qemu_fd_getpagesize(fd);
-    size_t numpages = DIV_ROUND_UP(memory, hpagesize);
+    size_t numpages = DIV_ROUND_UP(sz, hpagesize);
     bool use_madv_populate_write;
     struct sigaction act;
 
@@ -515,24 +515,24 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
         if (ret) {
             qemu_mutex_unlock(&sigbus_mutex);
             error_setg_errno(errp, errno,
-                "os_mem_prealloc: failed to install signal handler");
+                "qemu_prealloc_mem: failed to install signal handler");
             return;
         }
     }
 
     /* touch pages simultaneously */
-    ret = touch_all_pages(area, hpagesize, numpages, smp_cpus,
+    ret = touch_all_pages(area, hpagesize, numpages, max_threads,
                           use_madv_populate_write);
     if (ret) {
         error_setg_errno(errp, -ret,
-                         "os_mem_prealloc: preallocating memory failed");
+                         "qemu_prealloc_mem: preallocating memory failed");
     }
 
     if (!use_madv_populate_write) {
         ret = sigaction(SIGBUS, &sigbus_oldact, NULL);
         if (ret) {
             /* Terminate QEMU since it can't recover from error */
-            perror("os_mem_prealloc: failed to reinstall signal handler");
+            perror("qemu_prealloc_mem: failed to reinstall signal handler");
             exit(1);
         }
         qemu_mutex_unlock(&sigbus_mutex);
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index 5723d3eb4c..e1cb725ecc 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -268,14 +268,14 @@ int getpagesize(void)
     return system_info.dwPageSize;
 }
 
-void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
-                     Error **errp)
+void qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
+                       Error **errp)
 {
     int i;
     size_t pagesize = qemu_real_host_page_size();
 
-    memory = (memory + pagesize - 1) & -pagesize;
-    for (i = 0; i < memory / pagesize; i++) {
+    sz = (sz + pagesize - 1) & -pagesize;
+    for (i = 0; i < sz / pagesize; i++) {
         memset(area + pagesize * i, 0, 1);
     }
 }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 2/7] util: Introduce qemu_thread_set_affinity() and qemu_thread_get_affinity()
  2022-07-21 12:07 [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext David Hildenbrand
  2022-07-21 12:07 ` [PATCH RFC 1/7] util: Cleanup and rename os_mem_prealloc() David Hildenbrand
@ 2022-07-21 12:07 ` David Hildenbrand
  2022-07-21 12:07 ` [PATCH RFC 3/7] util: Introduce ThreadContext user-creatable object David Hildenbrand
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2022-07-21 12:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Michal Privoznik, Igor Mammedov,
	Michael S. Tsirkin, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

Usually, we let upper layers handle CPU pinning, because
pthread_setaffinity_np() (-> sched_setaffinity()) is blocked via
seccomp when starting QEMU with
    -sandbox enable=on,resourcecontrol=deny

However, we want to configure and observe the CPU affinity of threads
from QEMU directly in some cases when the sandbox option is either not
enabled or not active yet.

So let's add a way to configure CPU pinning via
qemu_thread_set_affinity() and obtain CPU affinity via
qemu_thread_get_affinity() and implement them under POSIX using
pthread_setaffinity_np() + pthread_getaffinity_np().

Implementation under Windows is possible using SetProcessAffinityMask()
+ GetProcessAffinityMask(), however, that is left as future work.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/qemu/thread.h    |  4 +++
 meson.build              | 16 +++++++++
 util/qemu-thread-posix.c | 70 ++++++++++++++++++++++++++++++++++++++++
 util/qemu-thread-win32.c | 12 +++++++
 4 files changed, 102 insertions(+)

diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index af19f2b3fc..79e507c7f0 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -185,6 +185,10 @@ void qemu_event_destroy(QemuEvent *ev);
 void qemu_thread_create(QemuThread *thread, const char *name,
                         void *(*start_routine)(void *),
                         void *arg, int mode);
+int qemu_thread_set_affinity(QemuThread *thread, unsigned long *host_cpus,
+                             unsigned long nbits);
+int qemu_thread_get_affinity(QemuThread *thread, unsigned long **host_cpus,
+                             unsigned long *nbits);
 void *qemu_thread_join(QemuThread *thread);
 void qemu_thread_get_self(QemuThread *thread);
 bool qemu_thread_is_self(QemuThread *thread);
diff --git a/meson.build b/meson.build
index 8a8c415fc1..961b3c99ac 100644
--- a/meson.build
+++ b/meson.build
@@ -2072,7 +2072,23 @@ config_host_data.set('CONFIG_PTHREAD_CONDATTR_SETCLOCK', cc.links(gnu_source_pre
     pthread_condattr_setclock(&attr, CLOCK_MONOTONIC);
     return 0;
   }''', dependencies: threads))
+config_host_data.set('CONFIG_PTHREAD_AFFINITY_NP', cc.links(gnu_source_prefix + '''
+  #include <pthread.h>
 
+  static void *f(void *p) { return NULL; }
+  int main(void)
+  {
+    int setsize = CPU_ALLOC_SIZE(64);
+    pthread_t thread;
+    cpu_set_t *cpuset;
+    pthread_create(&thread, 0, f, 0);
+    cpuset = CPU_ALLOC(64);
+    CPU_ZERO_S(setsize, cpuset);
+    pthread_setaffinity_np(thread, setsize, cpuset);
+    pthread_getaffinity_np(thread, setsize, cpuset);
+    CPU_FREE(cpuset);
+    return 0;
+  }''', dependencies: threads))
 config_host_data.set('CONFIG_SIGNALFD', cc.links(gnu_source_prefix + '''
   #include <sys/signalfd.h>
   #include <stddef.h>
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index ac1d56e673..bae938c670 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -16,6 +16,7 @@
 #include "qemu/notify.h"
 #include "qemu-thread-common.h"
 #include "qemu/tsan.h"
+#include "qemu/bitmap.h"
 
 static bool name_threads;
 
@@ -552,6 +553,75 @@ void qemu_thread_create(QemuThread *thread, const char *name,
     pthread_attr_destroy(&attr);
 }
 
+int qemu_thread_set_affinity(QemuThread *thread, unsigned long *host_cpus,
+                             unsigned long nbits)
+{
+#if defined(CONFIG_PTHREAD_AFFINITY_NP)
+    const size_t setsize = CPU_ALLOC_SIZE(nbits);
+    unsigned long value;
+    cpu_set_t *cpuset;
+    int err;
+
+    cpuset = CPU_ALLOC(nbits);
+    g_assert(cpuset);
+
+    CPU_ZERO_S(setsize, cpuset);
+    value = find_first_bit(host_cpus, nbits);
+    while (value < nbits) {
+        CPU_SET_S(value, setsize, cpuset);
+        value = find_next_bit(host_cpus, nbits, value + 1);
+    }
+
+    err = pthread_setaffinity_np(thread->thread, setsize, cpuset);
+    CPU_FREE(cpuset);
+    return err;
+#else
+    return -ENOSYS;
+#endif
+}
+
+int qemu_thread_get_affinity(QemuThread *thread, unsigned long **host_cpus,
+                             unsigned long *nbits)
+{
+#if defined(CONFIG_PTHREAD_AFFINITY_NP)
+    unsigned long tmpbits;
+    cpu_set_t *cpuset;
+    size_t setsize;
+    int i, err;
+
+    tmpbits = CPU_SETSIZE;
+    while (true) {
+        setsize = CPU_ALLOC_SIZE(tmpbits);
+        cpuset = CPU_ALLOC(tmpbits);
+        g_assert(cpuset);
+
+        err = pthread_getaffinity_np(thread->thread, setsize, cpuset);
+        if (err) {
+            CPU_FREE(cpuset);
+            if (err != -EINVAL) {
+                return err;
+            }
+            tmpbits *= 2;
+        } else {
+            break;
+        }
+    }
+
+    /* Convert the result into a proper bitmap. */
+    *nbits = tmpbits;
+    *host_cpus = bitmap_new(tmpbits);
+    for (i = 0; i < tmpbits; i++) {
+        if (CPU_ISSET(i, cpuset)) {
+            set_bit(i, *host_cpus);
+        }
+    }
+    CPU_FREE(cpuset);
+    return 0;
+#else
+    return -ENOSYS;
+#endif
+}
+
 void qemu_thread_get_self(QemuThread *thread)
 {
     thread->thread = pthread_self();
diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
index a2d5a6e825..72338148bd 100644
--- a/util/qemu-thread-win32.c
+++ b/util/qemu-thread-win32.c
@@ -427,6 +427,18 @@ void qemu_thread_create(QemuThread *thread, const char *name,
     thread->data = data;
 }
 
+int qemu_thread_set_affinity(QemuThread *thread, unsigned long *host_cpus,
+                             unsigned long nbits)
+{
+    return -ENOSYS;
+}
+
+int qemu_thread_get_affinity(QemuThread *thread, unsigned long **host_cpus,
+                             unsigned long *nbits)
+{
+    return -ENOSYS;
+}
+
 void qemu_thread_get_self(QemuThread *thread)
 {
     thread->data = qemu_thread_data;
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 3/7] util: Introduce ThreadContext user-creatable object
  2022-07-21 12:07 [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext David Hildenbrand
  2022-07-21 12:07 ` [PATCH RFC 1/7] util: Cleanup and rename os_mem_prealloc() David Hildenbrand
  2022-07-21 12:07 ` [PATCH RFC 2/7] util: Introduce qemu_thread_set_affinity() and qemu_thread_get_affinity() David Hildenbrand
@ 2022-07-21 12:07 ` David Hildenbrand
  2022-08-05 11:01   ` Michal Prívozník
  2022-07-21 12:07 ` [PATCH RFC 4/7] util: Add write-only "node-affinity" property for ThreadContext David Hildenbrand
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2022-07-21 12:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Michal Privoznik, Igor Mammedov,
	Michael S. Tsirkin, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

Setting the CPU affinity of QEMU threads is a bit problematic, because
QEMU doesn't always have permissions to set the CPU affinity itself,
for example, with seccomp after initialized by QEMU:
    -sandbox enable=on,resourcecontrol=deny

While upper layers are already aware how to handl;e CPU affinities for
long-lived threads like iothreads or vcpu threads, especially short-lived
threads, as used for memory-backend preallocation, are more involved to
handle. These threads are created on demand and upper layers are not even
able to identify and configure them.

Introduce the concept of a ThreadContext, that is essentially a thread
used for creating new threads. All threads created via that context
thread inherit the configured CPU affinity. Consequently, it's
sufficient to create a ThreadContext and configure it once, and have all
threads created via that ThreadContext inherit the same CPU affinity.

The CPU affinity of a ThreadContext can be configured two ways:

(1) Obtaining the thread id via the "thread-id" property and setting the
    CPU affinity manually.

(2) Setting the "cpu-affinity" property and letting QEMU try set the
    CPU affinity itself. This will fail if QEMU doesn't have permissions
    to do so anymore after seccomp was initialized.

A ThreadContext can be reused, simply be reconfiguring the CPU affinity.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/qemu/thread-context.h |  58 +++++++
 qapi/qom.json                 |  16 ++
 util/meson.build              |   1 +
 util/oslib-posix.c            |   1 +
 util/thread-context.c         | 279 ++++++++++++++++++++++++++++++++++
 5 files changed, 355 insertions(+)
 create mode 100644 include/qemu/thread-context.h
 create mode 100644 util/thread-context.c

diff --git a/include/qemu/thread-context.h b/include/qemu/thread-context.h
new file mode 100644
index 0000000000..c799cbe7a1
--- /dev/null
+++ b/include/qemu/thread-context.h
@@ -0,0 +1,58 @@
+/*
+ * QEMU Thread Context
+ *
+ * Copyright Red Hat Inc., 2022
+ *
+ * Authors:
+ *  David Hildenbrand <david@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef SYSEMU_THREAD_CONTEXT_H
+#define SYSEMU_THREAD_CONTEXT_H
+
+#include "qapi/qapi-types-machine.h"
+#include "qemu/thread.h"
+#include "qom/object.h"
+
+#define TYPE_THREAD_CONTEXT "thread-context"
+OBJECT_DECLARE_TYPE(ThreadContext, ThreadContextClass,
+                    THREAD_CONTEXT)
+
+struct ThreadContextClass {
+    ObjectClass parent_class;
+};
+
+struct ThreadContext {
+    /* private */
+    Object parent;
+
+    /* private */
+    unsigned int thread_id;
+    QemuThread thread;
+
+    /* Semaphore to wait for context thread action. */
+    QemuSemaphore sem;
+    /* Semaphore to wait for action in context thread. */
+    QemuSemaphore sem_thread;
+    /* Mutex to synchronize requests. */
+    QemuMutex mutex;
+
+    /* Commands for the thread to execute. */
+    int thread_cmd;
+    void *thread_cmd_data;
+
+    /* CPU affinity bitmap used for initialization. */
+    unsigned long *init_cpu_bitmap;
+    int init_cpu_nbits;
+};
+
+void thread_context_create_thread(ThreadContext *tc, QemuThread *thread,
+                                  const char *name,
+                                  void *(*start_routine)(void *), void *arg,
+                                  int mode);
+
+#endif /* SYSEMU_THREAD_CONTEXT_H */
diff --git a/qapi/qom.json b/qapi/qom.json
index 80dd419b39..4775a333ed 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -830,6 +830,20 @@
             'reduced-phys-bits': 'uint32',
             '*kernel-hashes': 'bool' } }
 
+##
+# @ThreadContextProperties:
+#
+# Properties for thread context objects.
+#
+# @cpu-affinity: the CPU affinity for all threads created in the thread
+#                context (default: QEMU main thread affinity)
+#
+# Since: 7.2
+##
+{ 'struct': 'ThreadContextProperties',
+  'data': { '*cpu-affinity': ['uint16'] } }
+
+
 ##
 # @ObjectType:
 #
@@ -882,6 +896,7 @@
     { 'name': 'secret_keyring',
       'if': 'CONFIG_SECRET_KEYRING' },
     'sev-guest',
+    'thread-context',
     's390-pv-guest',
     'throttle-group',
     'tls-creds-anon',
@@ -948,6 +963,7 @@
       'secret_keyring':             { 'type': 'SecretKeyringProperties',
                                       'if': 'CONFIG_SECRET_KEYRING' },
       'sev-guest':                  'SevGuestProperties',
+      'thread-context':             'ThreadContextProperties',
       'throttle-group':             'ThrottleGroupProperties',
       'tls-creds-anon':             'TlsCredsAnonProperties',
       'tls-creds-psk':              'TlsCredsPskProperties',
diff --git a/util/meson.build b/util/meson.build
index 5e282130df..e97cd2d779 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -1,4 +1,5 @@
 util_ss.add(files('osdep.c', 'cutils.c', 'unicode.c', 'qemu-timer-common.c'))
+util_ss.add(files('thread-context.c'))
 if not config_host_data.get('CONFIG_ATOMIC64')
   util_ss.add(files('atomic64.c'))
 endif
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 75e8cdcf3e..fa66f73bf8 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -42,6 +42,7 @@
 #include "qemu/cutils.h"
 #include "qemu/compiler.h"
 #include "qemu/units.h"
+#include "qemu/thread-context.h"
 
 #ifdef CONFIG_LINUX
 #include <sys/syscall.h>
diff --git a/util/thread-context.c b/util/thread-context.c
new file mode 100644
index 0000000000..dcd607c532
--- /dev/null
+++ b/util/thread-context.c
@@ -0,0 +1,279 @@
+/*
+ * QEMU Thread Context
+ *
+ * Copyright Red Hat Inc., 2022
+ *
+ * Authors:
+ *  David Hildenbrand <david@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/thread-context.h"
+#include "qapi/error.h"
+#include "qapi/qapi-builtin-visit.h"
+#include "qapi/visitor.h"
+#include "qemu/config-file.h"
+#include "qapi/qapi-builtin-visit.h"
+#include "qom/object_interfaces.h"
+#include "qemu/module.h"
+#include "qemu/bitmap.h"
+
+enum {
+    TC_CMD_NONE = 0,
+    TC_CMD_STOP,
+    TC_CMD_NEW,
+};
+
+typedef struct ThreadContextCmdNew {
+    QemuThread *thread;
+    const char *name;
+    void *(*start_routine)(void *);
+    void *arg;
+    int mode;
+} ThreadContextCmdNew;
+
+static void *thread_context_run(void *opaque)
+{
+    ThreadContext *tc = opaque;
+
+    tc->thread_id = qemu_get_thread_id();
+    qemu_sem_post(&tc->sem);
+
+    while (true) {
+        /*
+         * Threads inherit the CPU affinity of the creating thread. For this
+         * reason, we create new (especially short-lived) threads from our
+         * persistent context thread.
+         *
+         * Especially when QEMU is not allowed to set the affinity itself,
+         * management tools can simply set the affinity of the context thread
+         * after creating the context, to have new threads created via
+         * the context inherit the CPU affinity automatically.
+         */
+        switch (tc->thread_cmd) {
+        case TC_CMD_NONE:
+            break;
+        case TC_CMD_STOP:
+            tc->thread_cmd = TC_CMD_NONE;
+            qemu_sem_post(&tc->sem);
+            return NULL;
+        case TC_CMD_NEW: {
+            ThreadContextCmdNew *cmd_new = tc->thread_cmd_data;
+
+            qemu_thread_create(cmd_new->thread, cmd_new->name,
+                               cmd_new->start_routine, cmd_new->arg,
+                               cmd_new->mode);
+            tc->thread_cmd = TC_CMD_NONE;
+            tc->thread_cmd_data = NULL;
+            qemu_sem_post(&tc->sem);
+            break;
+        }
+        default:
+            g_assert_not_reached();
+        }
+        qemu_sem_wait(&tc->sem_thread);
+    }
+}
+
+static void thread_context_set_cpu_affinity(Object *obj, Visitor *v,
+                                            const char *name, void *opaque,
+                                            Error **errp)
+{
+    ThreadContext *tc = THREAD_CONTEXT(obj);
+    uint16List *l, *host_cpus = NULL;
+    unsigned long *bitmap = NULL;
+    int nbits = 0, ret;
+    Error *err = NULL;
+
+    visit_type_uint16List(v, name, &host_cpus, &err);
+    if (err) {
+        error_propagate(errp, err);
+        return;
+    }
+
+    if (!host_cpus) {
+        error_setg(errp, "CPU list is empty");
+        goto out;
+    }
+
+    for (l = host_cpus; l; l = l->next) {
+        nbits = MAX(nbits, l->value + 1);
+    }
+    bitmap = bitmap_new(nbits);
+    for (l = host_cpus; l; l = l->next) {
+        set_bit(l->value, bitmap);
+    }
+
+    if (tc->thread_id != -1) {
+        /*
+         * Note: we won't be adjusting the affinity of any thread that is still
+         * around, but only the affinity of the context thread.
+         */
+        ret = qemu_thread_set_affinity(&tc->thread, bitmap, nbits);
+        if (ret) {
+            error_setg(errp, "Setting CPU affinity failed: %s", strerror(ret));
+        }
+    } else {
+        tc->init_cpu_bitmap = bitmap;
+        bitmap = NULL;
+        tc->init_cpu_nbits = nbits;
+    }
+out:
+    g_free(bitmap);
+    qapi_free_uint16List(host_cpus);
+}
+
+static void thread_context_get_cpu_affinity(Object *obj, Visitor *v,
+                                            const char *name, void *opaque,
+                                            Error **errp)
+{
+    unsigned long *bitmap, nbits, value;
+    ThreadContext *tc = THREAD_CONTEXT(obj);
+    uint16List *host_cpus = NULL;
+    uint16List **tail = &host_cpus;
+    int ret;
+
+    if (tc->thread_id == -1) {
+        error_setg(errp, "Object not initialized yet");
+        return;
+    }
+
+    ret = qemu_thread_get_affinity(&tc->thread, &bitmap, &nbits);
+    if (ret) {
+        error_setg(errp, "Getting CPU affinity failed: %s", strerror(ret));
+        return;
+    }
+
+    value = find_first_bit(bitmap, nbits);
+    while (value < nbits) {
+        QAPI_LIST_APPEND(tail, value);
+
+        value = find_next_bit(bitmap, nbits, value + 1);
+    }
+    g_free(bitmap);
+
+    visit_type_uint16List(v, name, &host_cpus, errp);
+    qapi_free_uint16List(host_cpus);
+}
+
+static void thread_context_get_thread_id(Object *obj, Visitor *v,
+                                         const char *name, void *opaque,
+                                         Error **errp)
+{
+    ThreadContext *tc = THREAD_CONTEXT(obj);
+    uint64_t value = tc->thread_id;
+
+    visit_type_uint64(v, name, &value, errp);
+}
+
+static void thread_context_instance_complete(UserCreatable *uc, Error **errp)
+{
+    ThreadContext *tc = THREAD_CONTEXT(uc);
+    char *thread_name;
+    int ret;
+
+    thread_name = g_strdup_printf("TC %s",
+                               object_get_canonical_path_component(OBJECT(uc)));
+    qemu_thread_create(&tc->thread, thread_name, thread_context_run, tc,
+                       QEMU_THREAD_JOINABLE);
+    g_free(thread_name);
+
+    /* Wait until initialization of the thread is done. */
+    while (tc->thread_id == -1) {
+        qemu_sem_wait(&tc->sem);
+    }
+
+    if (tc->init_cpu_bitmap) {
+        ret = qemu_thread_set_affinity(&tc->thread, tc->init_cpu_bitmap,
+                                       tc->init_cpu_nbits);
+        if (ret) {
+            error_setg(errp, "Setting CPU affinity failed: %s", strerror(ret));
+        }
+        g_free(tc->init_cpu_bitmap);
+        tc->init_cpu_bitmap = NULL;
+    }
+}
+
+static void thread_context_class_init(ObjectClass *oc, void *data)
+{
+    UserCreatableClass *ucc = USER_CREATABLE_CLASS(oc);
+
+    ucc->complete = thread_context_instance_complete;
+    object_class_property_add(oc, "thread-id", "int",
+                              thread_context_get_thread_id, NULL, NULL,
+                              NULL);
+    object_class_property_add(oc, "cpu-affinity", "int",
+                              thread_context_get_cpu_affinity,
+                              thread_context_set_cpu_affinity, NULL, NULL);
+}
+
+static void thread_context_instance_init(Object *obj)
+{
+    ThreadContext *tc = THREAD_CONTEXT(obj);
+
+    tc->thread_id = -1;
+    qemu_sem_init(&tc->sem, 0);
+    qemu_sem_init(&tc->sem_thread, 0);
+    qemu_mutex_init(&tc->mutex);
+}
+
+static void thread_context_instance_finalize(Object *obj)
+{
+    ThreadContext *tc = THREAD_CONTEXT(obj);
+
+    if (tc->thread_id != -1) {
+        tc->thread_cmd = TC_CMD_STOP;
+        qemu_sem_post(&tc->sem_thread);
+        qemu_thread_join(&tc->thread);
+    }
+    qemu_sem_destroy(&tc->sem);
+    qemu_sem_destroy(&tc->sem_thread);
+    qemu_mutex_destroy(&tc->mutex);
+}
+
+static const TypeInfo thread_context_info = {
+    .name = TYPE_THREAD_CONTEXT,
+    .parent = TYPE_OBJECT,
+    .class_init = thread_context_class_init,
+    .instance_size = sizeof(ThreadContext),
+    .instance_init = thread_context_instance_init,
+    .instance_finalize = thread_context_instance_finalize,
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_USER_CREATABLE },
+        { }
+    }
+};
+
+static void thread_context_register_types(void)
+{
+    type_register_static(&thread_context_info);
+}
+type_init(thread_context_register_types)
+
+void thread_context_create_thread(ThreadContext *tc, QemuThread *thread,
+                                  const char *name,
+                                  void *(*start_routine)(void *), void *arg,
+                                  int mode)
+{
+    ThreadContextCmdNew data = {
+        .thread = thread,
+        .name = name,
+        .start_routine = start_routine,
+        .arg = arg,
+        .mode = mode,
+    };
+
+    qemu_mutex_lock(&tc->mutex);
+    tc->thread_cmd = TC_CMD_NEW;
+    tc->thread_cmd_data = &data;
+    qemu_sem_post(&tc->sem_thread);
+
+    while (tc->thread_cmd != TC_CMD_NONE) {
+        qemu_sem_wait(&tc->sem);
+    }
+    qemu_mutex_unlock(&tc->mutex);
+}
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 4/7] util: Add write-only "node-affinity" property for ThreadContext
  2022-07-21 12:07 [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext David Hildenbrand
                   ` (2 preceding siblings ...)
  2022-07-21 12:07 ` [PATCH RFC 3/7] util: Introduce ThreadContext user-creatable object David Hildenbrand
@ 2022-07-21 12:07 ` David Hildenbrand
  2022-07-21 12:07 ` [PATCH RFC 5/7] util: Make qemu_prealloc_mem() optionally consume a ThreadContext David Hildenbrand
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2022-07-21 12:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Michal Privoznik, Igor Mammedov,
	Michael S. Tsirkin, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

Let's make it easier to pin threads created via a ThreadContext to
all current CPUs belonging to given NUMA nodes.

As "node-affinity" is simply a shortcut for setting "cpu-affinity", that
property cannot be read and if the CPUs for a node change due do CPU
hotplug, the CPU affinity will not get updated.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 qapi/qom.json         |  7 +++-
 util/meson.build      |  2 +-
 util/thread-context.c | 84 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 4775a333ed..d36bf3355f 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -838,10 +838,15 @@
 # @cpu-affinity: the CPU affinity for all threads created in the thread
 #                context (default: QEMU main thread affinity)
 #
+# @node-affinity: shortcut for looking up the current CPUs for the given nodes
+#                 and setting @cpu-affinity (default: QEMU main thread
+#                 affinity)
+#
 # Since: 7.2
 ##
 { 'struct': 'ThreadContextProperties',
-  'data': { '*cpu-affinity': ['uint16'] } }
+  'data': { '*cpu-affinity': ['uint16'],
+            '*node-affinity': ['uint16'] } }
 
 
 ##
diff --git a/util/meson.build b/util/meson.build
index e97cd2d779..c0a7bc54d4 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -1,5 +1,5 @@
 util_ss.add(files('osdep.c', 'cutils.c', 'unicode.c', 'qemu-timer-common.c'))
-util_ss.add(files('thread-context.c'))
+util_ss.add(files('thread-context.c'), numa)
 if not config_host_data.get('CONFIG_ATOMIC64')
   util_ss.add(files('atomic64.c'))
 endif
diff --git a/util/thread-context.c b/util/thread-context.c
index dcd607c532..880f0441be 100644
--- a/util/thread-context.c
+++ b/util/thread-context.c
@@ -22,6 +22,10 @@
 #include "qemu/module.h"
 #include "qemu/bitmap.h"
 
+#ifdef CONFIG_NUMA
+#include <numa.h>
+#endif
+
 enum {
     TC_CMD_NONE = 0,
     TC_CMD_STOP,
@@ -89,6 +93,11 @@ static void thread_context_set_cpu_affinity(Object *obj, Visitor *v,
     int nbits = 0, ret;
     Error *err = NULL;
 
+    if (tc->init_cpu_bitmap) {
+        error_setg(errp, "Mixing CPU and node affinity not supported");
+        return;
+    }
+
     visit_type_uint16List(v, name, &host_cpus, &err);
     if (err) {
         error_propagate(errp, err);
@@ -160,6 +169,79 @@ static void thread_context_get_cpu_affinity(Object *obj, Visitor *v,
     qapi_free_uint16List(host_cpus);
 }
 
+static void thread_context_set_node_affinity(Object *obj, Visitor *v,
+                                             const char *name, void *opaque,
+                                             Error **errp)
+{
+#ifdef CONFIG_NUMA
+    const int nbits = numa_num_possible_cpus();
+    ThreadContext *tc = THREAD_CONTEXT(obj);
+    uint16List *l, *host_nodes = NULL;
+    unsigned long *bitmap = NULL;
+    struct bitmask *tmp_cpus;
+    Error *err = NULL;
+    int ret, i;
+
+    if (tc->init_cpu_bitmap) {
+        error_setg(errp, "Mixing CPU and node affinity not supported");
+        return;
+    }
+
+    visit_type_uint16List(v, name, &host_nodes, &err);
+    if (err) {
+        error_propagate(errp, err);
+        return;
+    }
+
+    if (!host_nodes) {
+        error_setg(errp, "Node list is empty");
+        goto out;
+    }
+
+    bitmap = bitmap_new(nbits);
+    tmp_cpus = numa_allocate_cpumask();
+    for (l = host_nodes; l; l = l->next) {
+        numa_bitmask_clearall(tmp_cpus);
+        ret = numa_node_to_cpus(l->value, tmp_cpus);
+        if (ret) {
+            /* We ignore any errors, such as impossible nodes. */
+            continue;
+        }
+        for (i = 0; i < nbits; i++) {
+            if (numa_bitmask_isbitset(tmp_cpus, i)) {
+                set_bit(i, bitmap);
+            }
+        }
+    }
+    numa_free_cpumask(tmp_cpus);
+
+    if (bitmap_empty(bitmap, nbits)) {
+        error_setg(errp, "The nodes select no CPUs");
+        goto out;
+    }
+
+    if (tc->thread_id != -1) {
+        /*
+         * Note: we won't be adjusting the affinity of any thread that is still
+         * around for now, but only the affinity of the context thread.
+         */
+        ret = qemu_thread_set_affinity(&tc->thread, bitmap, nbits);
+        if (ret) {
+            error_setg(errp, "Setting CPU affinity failed: %s", strerror(ret));
+        }
+    } else {
+        tc->init_cpu_bitmap = bitmap;
+        bitmap = NULL;
+        tc->init_cpu_nbits = nbits;
+    }
+out:
+    g_free(bitmap);
+    qapi_free_uint16List(host_nodes);
+#else
+    error_setg(errp, "NUMA node affinity is not supported by this QEMU");
+#endif
+}
+
 static void thread_context_get_thread_id(Object *obj, Visitor *v,
                                          const char *name, void *opaque,
                                          Error **errp)
@@ -209,6 +291,8 @@ static void thread_context_class_init(ObjectClass *oc, void *data)
     object_class_property_add(oc, "cpu-affinity", "int",
                               thread_context_get_cpu_affinity,
                               thread_context_set_cpu_affinity, NULL, NULL);
+    object_class_property_add(oc, "node-affinity", "int", NULL,
+                              thread_context_set_node_affinity, NULL, NULL);
 }
 
 static void thread_context_instance_init(Object *obj)
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 5/7] util: Make qemu_prealloc_mem() optionally consume a ThreadContext
  2022-07-21 12:07 [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext David Hildenbrand
                   ` (3 preceding siblings ...)
  2022-07-21 12:07 ` [PATCH RFC 4/7] util: Add write-only "node-affinity" property for ThreadContext David Hildenbrand
@ 2022-07-21 12:07 ` David Hildenbrand
  2022-07-21 12:07 ` [PATCH RFC 6/7] hostmem: Allow for specifying a ThreadContext for preallocation David Hildenbrand
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2022-07-21 12:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Michal Privoznik, Igor Mammedov,
	Michael S. Tsirkin, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

... and implement it under POSIX. When a ThreadContext is provided,
create new threads via the context such that these new threads obtain a
porperly configured CPU affinity.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 backends/hostmem.c     |  5 +++--
 hw/virtio/virtio-mem.c |  2 +-
 include/qemu/osdep.h   |  4 +++-
 util/oslib-posix.c     | 20 ++++++++++++++------
 util/oslib-win32.c     |  2 +-
 5 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index caff42d3a5..46bd4cc494 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -232,7 +232,8 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
         void *ptr = memory_region_get_ram_ptr(&backend->mr);
         uint64_t sz = memory_region_size(&backend->mr);
 
-        qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads, &local_err);
+        qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads, NULL,
+                          &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
@@ -394,7 +395,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
          */
         if (backend->prealloc) {
             qemu_prealloc_mem(memory_region_get_fd(&backend->mr), ptr, sz,
-                              backend->prealloc_threads, &local_err);
+                              backend->prealloc_threads, NULL, &local_err);
             if (local_err) {
                 goto out;
             }
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 0e9ef4ff19..ed170def48 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -467,7 +467,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa,
             int fd = memory_region_get_fd(&vmem->memdev->mr);
             Error *local_err = NULL;
 
-            qemu_prealloc_mem(fd, area, size, 1, &local_err);
+            qemu_prealloc_mem(fd, area, size, 1, NULL, &local_err);
             if (local_err) {
                 static bool warned;
 
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index e556e45143..625298c8bc 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -568,6 +568,8 @@ unsigned long qemu_getauxval(unsigned long type);
 
 void qemu_set_tty_echo(int fd, bool echo);
 
+typedef struct ThreadContext ThreadContext;
+
 /**
  * qemu_prealloc_mem:
  * @fd: the fd mapped into the area, -1 for anonymous memory
@@ -582,7 +584,7 @@ void qemu_set_tty_echo(int fd, bool echo);
  * after allocating file blocks for mapped files.
  */
 void qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
-                       Error **errp);
+                       ThreadContext *tc, Error **errp);
 
 /**
  * qemu_get_pid_name:
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index fa66f73bf8..ce5fea3126 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -404,7 +404,8 @@ static inline int get_memset_num_threads(size_t hpagesize, size_t numpages,
 }
 
 static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
-                           int max_threads, bool use_madv_populate_write)
+                           int max_threads, ThreadContext *tc,
+                           bool use_madv_populate_write)
 {
     static gsize initialized = 0;
     MemsetContext context = {
@@ -443,9 +444,16 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
         context.threads[i].numpages = numpages_per_thread + (i < leftover);
         context.threads[i].hpagesize = hpagesize;
         context.threads[i].context = &context;
-        qemu_thread_create(&context.threads[i].pgthread, "touch_pages",
-                           touch_fn, &context.threads[i],
-                           QEMU_THREAD_JOINABLE);
+        if (tc) {
+            thread_context_create_thread(tc, &context.threads[i].pgthread,
+                                         "touch_pages",
+                                         touch_fn, &context.threads[i],
+                                         QEMU_THREAD_JOINABLE);
+        } else {
+            qemu_thread_create(&context.threads[i].pgthread, "touch_pages",
+                               touch_fn, &context.threads[i],
+                               QEMU_THREAD_JOINABLE);
+        }
         addr += context.threads[i].numpages * hpagesize;
     }
 
@@ -481,7 +489,7 @@ static bool madv_populate_write_possible(char *area, size_t pagesize)
 }
 
 void qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
-                       Error **errp)
+                       ThreadContext *tc, Error **errp)
 {
     static gsize initialized;
     int ret;
@@ -522,7 +530,7 @@ void qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
     }
 
     /* touch pages simultaneously */
-    ret = touch_all_pages(area, hpagesize, numpages, max_threads,
+    ret = touch_all_pages(area, hpagesize, numpages, max_threads, tc,
                           use_madv_populate_write);
     if (ret) {
         error_setg_errno(errp, -ret,
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index e1cb725ecc..a67cb3822e 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -269,7 +269,7 @@ int getpagesize(void)
 }
 
 void qemu_prealloc_mem(int fd, char *area, size_t sz, int max_threads,
-                       Error **errp)
+                       ThreadContext *tc, Error **errp)
 {
     int i;
     size_t pagesize = qemu_real_host_page_size();
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 6/7] hostmem: Allow for specifying a ThreadContext for preallocation
  2022-07-21 12:07 [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext David Hildenbrand
                   ` (4 preceding siblings ...)
  2022-07-21 12:07 ` [PATCH RFC 5/7] util: Make qemu_prealloc_mem() optionally consume a ThreadContext David Hildenbrand
@ 2022-07-21 12:07 ` David Hildenbrand
  2022-07-21 12:07 ` [PATCH RFC 7/7] vl: Allow ThreadContext objects to be created before the sandbox option David Hildenbrand
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2022-07-21 12:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Michal Privoznik, Igor Mammedov,
	Michael S. Tsirkin, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

Let's allow for specifying a thread context via the "prealloc-context"
property. When set, preallcoation threads will be crated via the
thread context -- inheriting the same CPU affinity as the thread
context.

Pinning preallcoation threads to CPUs can heavily increase performance
in NUMA setups, because, preallocation from a CPU close to the target
NUMA node(s) is faster then preallocation from a CPU further remote,
simply because of memory bandwidth for initializing memory with zeroes.
This is especially relevant for very large VMs backed by huge/gigantic
pages, whereby preallocation is mandatory.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 backends/hostmem.c       | 12 +++++++++---
 include/sysemu/hostmem.h |  2 ++
 qapi/qom.json            |  4 ++++
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 46bd4cc494..5e730fd7c0 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -232,8 +232,8 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value,
         void *ptr = memory_region_get_ram_ptr(&backend->mr);
         uint64_t sz = memory_region_size(&backend->mr);
 
-        qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads, NULL,
-                          &local_err);
+        qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads,
+                          backend->prealloc_context, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
@@ -395,7 +395,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
          */
         if (backend->prealloc) {
             qemu_prealloc_mem(memory_region_get_fd(&backend->mr), ptr, sz,
-                              backend->prealloc_threads, NULL, &local_err);
+                              backend->prealloc_threads,
+                              backend->prealloc_context, &local_err);
             if (local_err) {
                 goto out;
             }
@@ -503,6 +504,11 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
         NULL, NULL);
     object_class_property_set_description(oc, "prealloc-threads",
         "Number of CPU threads to use for prealloc");
+    object_class_property_add_link(oc, "prealloc-context",
+        TYPE_THREAD_CONTEXT, offsetof(HostMemoryBackend, prealloc_context),
+        object_property_allow_set_link, OBJ_PROP_LINK_STRONG);
+    object_class_property_set_description(oc, "prealloc-context",
+        "Context to use for creating CPU threads for preallocation");
     object_class_property_add(oc, "size", "int",
         host_memory_backend_get_size,
         host_memory_backend_set_size,
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index 9ff5c16963..39326f1d4f 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -18,6 +18,7 @@
 #include "qom/object.h"
 #include "exec/memory.h"
 #include "qemu/bitmap.h"
+#include "qemu/thread-context.h"
 
 #define TYPE_MEMORY_BACKEND "memory-backend"
 OBJECT_DECLARE_TYPE(HostMemoryBackend, HostMemoryBackendClass,
@@ -66,6 +67,7 @@ struct HostMemoryBackend {
     bool merge, dump, use_canonical_path;
     bool prealloc, is_mapped, share, reserve;
     uint32_t prealloc_threads;
+    ThreadContext *prealloc_context;
     DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
     HostMemPolicy policy;
 
diff --git a/qapi/qom.json b/qapi/qom.json
index d36bf3355f..9caa1a60e3 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -578,6 +578,9 @@
 #
 # @prealloc-threads: number of CPU threads to use for prealloc (default: 1)
 #
+# @prealloc-context: context to use for creation of preallocation threads
+#                    (default: none) (since 7.2)
+#
 # @share: if false, the memory is private to QEMU; if true, it is shared
 #         (default: false)
 #
@@ -608,6 +611,7 @@
             '*policy': 'HostMemPolicy',
             '*prealloc': 'bool',
             '*prealloc-threads': 'uint32',
+            '*prealloc-context': 'str',
             '*share': 'bool',
             '*reserve': 'bool',
             'size': 'size',
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC 7/7] vl: Allow ThreadContext objects to be created before the sandbox option
  2022-07-21 12:07 [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext David Hildenbrand
                   ` (5 preceding siblings ...)
  2022-07-21 12:07 ` [PATCH RFC 6/7] hostmem: Allow for specifying a ThreadContext for preallocation David Hildenbrand
@ 2022-07-21 12:07 ` David Hildenbrand
  2022-08-05 11:01   ` Michal Prívozník
  2022-07-25 13:59 ` [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext Michal Prívozník
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2022-07-21 12:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Michal Privoznik, Igor Mammedov,
	Michael S. Tsirkin, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

Currently, there is no way to configure a CPU affinity inside QEMU when
the sandbox option disables it for QEMU as a whole, for example, via:
    -sandbox enable=on,resourcecontrol=deny

While ThreadContext objects can be created on the QEMU commandline and
the CPU affinity can be configured externally via the thread-id, this is
insufficient if a ThreadContext with a certain CPU affinity is already
required during QEMU startup, before we can intercept QEMU and
configure the CPU affinity.

Blocking sched_setaffinity() was introduced in 24f8cdc57224 ("seccomp:
add resourcecontrol argument to command line"), "to avoid any bigger of the
process". However, we only care about once QEMU is running, not when
the instance starting QEMU explicitly requests a certain CPU affinity
on the QEMU comandline.

Right now, for NUMA-aware preallocation of memory backends used for initial
machine RAM, one has to:

1) Start QEMU with the memory-backend with "prealloc=off"
2) Pause QEMU before it starts the guest (-S)
3) Create ThreadContext, configure the CPU affinity using the thread-id
4) Configure the ThreadContext as "prealloc-context" of the memory
   backend
5) Trigger preallocation by setting "prealloc=on"

To simplify this handling especially for initial machine RAM,
allow creation of ThreadContext objects before parsing sandbox options,
such that the CPU affinity requested on the QEMU commandline alongside the
sandbox option can be set. As ThreadContext objects essentially only create
a persistant context thread and set the CPU affinity, this is easily
possible.

With this change, we can create a ThreadContext with a CPU affinity on
the QEMU commandline and use it for preallocation of memory backends
glued to the machine (simplified example):

qemu-system-x86_64 -m 1G \
 -object thread-context,id=tc1,cpu-affinity=3-4 \
 -object memory-backend-ram,id=pc.ram,size=1G,prealloc=on,prealloc-threads=2,prealloc-context=tc1 \
 -machine memory-backend=pc.ram \
 -S -monitor stdio -sandbox enable=on,resourcecontrol=deny

And while we can query the current CPU affinity:
  (qemu) qom-get tc1 cpu-affinity
  [
      3,
      4
  ]

We can no longer change it from QEMU directly:
  (qemu) qom-set tc1 cpu-affinity 1-2
  Error: Setting CPU affinity failed: Operation not permitted

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 softmmu/vl.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/softmmu/vl.c b/softmmu/vl.c
index aabd82e09a..252732cf5d 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -1761,6 +1761,27 @@ static void object_option_parse(const char *optarg)
     visit_free(v);
 }
 
+/*
+ * Very early object creation, before the sandbox options have been activated.
+ */
+static bool object_create_pre_sandbox(const char *type)
+{
+    /*
+     * Objects should in general not get initialized "too early" without
+     * a reason. If you add one, state the reason in a comment!
+     */
+
+    /*
+     * Reason: -sandbox on,resourcecontrol=deny disallows setting CPU
+     * affinity of threads.
+     */
+    if (g_str_equal(type, "thread-context")) {
+        return true;
+    }
+
+    return false;
+}
+
 /*
  * Initial object creation happens before all other
  * QEMU data types are created. The majority of objects
@@ -1775,6 +1796,11 @@ static bool object_create_early(const char *type)
      * add one, state the reason in a comment!
      */
 
+    /* Reason: already created. */
+    if (object_create_pre_sandbox(type)) {
+        return false;
+    }
+
     /* Reason: property "chardev" */
     if (g_str_equal(type, "rng-egd") ||
         g_str_equal(type, "qtest")) {
@@ -1895,7 +1921,7 @@ static void qemu_create_early_backends(void)
  */
 static bool object_create_late(const char *type)
 {
-    return !object_create_early(type);
+    return !object_create_early(type) && !object_create_pre_sandbox(type);
 }
 
 static void qemu_create_late_backends(void)
@@ -2365,6 +2391,8 @@ static int process_runstate_actions(void *opaque, QemuOpts *opts, Error **errp)
 
 static void qemu_process_early_options(void)
 {
+    object_option_foreach_add(object_create_pre_sandbox);
+
 #ifdef CONFIG_SECCOMP
     QemuOptsList *olist = qemu_find_opts_err("sandbox", NULL);
     if (olist) {
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
  2022-07-21 12:07 [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext David Hildenbrand
                   ` (6 preceding siblings ...)
  2022-07-21 12:07 ` [PATCH RFC 7/7] vl: Allow ThreadContext objects to be created before the sandbox option David Hildenbrand
@ 2022-07-25 13:59 ` Michal Prívozník
  2022-08-05 11:01 ` Michal Prívozník
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Michal Prívozník @ 2022-07-25 13:59 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Igor Mammedov, Michael S. Tsirkin, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

On 7/21/22 14:07, David Hildenbrand wrote:
> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
> Michal.

I've skimmed through patches and haven't spotted anything obviously
wrong. I'll test these more once I write libvirt support for them (which
I plan to do soon).

Michal



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 7/7] vl: Allow ThreadContext objects to be created before the sandbox option
  2022-07-21 12:07 ` [PATCH RFC 7/7] vl: Allow ThreadContext objects to be created before the sandbox option David Hildenbrand
@ 2022-08-05 11:01   ` Michal Prívozník
  2022-08-05 15:50     ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Prívozník @ 2022-08-05 11:01 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Igor Mammedov, Michael S. Tsirkin, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

On 7/21/22 14:07, David Hildenbrand wrote:
> Currently, there is no way to configure a CPU affinity inside QEMU when
> the sandbox option disables it for QEMU as a whole, for example, via:
>     -sandbox enable=on,resourcecontrol=deny
> 
> While ThreadContext objects can be created on the QEMU commandline and
> the CPU affinity can be configured externally via the thread-id, this is
> insufficient if a ThreadContext with a certain CPU affinity is already
> required during QEMU startup, before we can intercept QEMU and
> configure the CPU affinity.
> 
> Blocking sched_setaffinity() was introduced in 24f8cdc57224 ("seccomp:
> add resourcecontrol argument to command line"), "to avoid any bigger of the
> process". However, we only care about once QEMU is running, not when
> the instance starting QEMU explicitly requests a certain CPU affinity
> on the QEMU comandline.
> 
> Right now, for NUMA-aware preallocation of memory backends used for initial
> machine RAM, one has to:
> 
> 1) Start QEMU with the memory-backend with "prealloc=off"
> 2) Pause QEMU before it starts the guest (-S)
> 3) Create ThreadContext, configure the CPU affinity using the thread-id
> 4) Configure the ThreadContext as "prealloc-context" of the memory
>    backend
> 5) Trigger preallocation by setting "prealloc=on"
> 
> To simplify this handling especially for initial machine RAM,
> allow creation of ThreadContext objects before parsing sandbox options,
> such that the CPU affinity requested on the QEMU commandline alongside the
> sandbox option can be set. As ThreadContext objects essentially only create
> a persistant context thread and set the CPU affinity, this is easily
> possible.
> 
> With this change, we can create a ThreadContext with a CPU affinity on
> the QEMU commandline and use it for preallocation of memory backends
> glued to the machine (simplified example):
> 
> qemu-system-x86_64 -m 1G \
>  -object thread-context,id=tc1,cpu-affinity=3-4 \
>  -object memory-backend-ram,id=pc.ram,size=1G,prealloc=on,prealloc-threads=2,prealloc-context=tc1 \
>  -machine memory-backend=pc.ram \
>  -S -monitor stdio -sandbox enable=on,resourcecontrol=deny
> 
> And while we can query the current CPU affinity:
>   (qemu) qom-get tc1 cpu-affinity
>   [
>       3,
>       4
>   ]
> 
> We can no longer change it from QEMU directly:
>   (qemu) qom-set tc1 cpu-affinity 1-2
>   Error: Setting CPU affinity failed: Operation not permitted
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  softmmu/vl.c | 30 +++++++++++++++++++++++++++++-
>  1 file changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index aabd82e09a..252732cf5d 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -1761,6 +1761,27 @@ static void object_option_parse(const char *optarg)
>      visit_free(v);
>  }
>  
> +/*
> + * Very early object creation, before the sandbox options have been activated.
> + */
> +static bool object_create_pre_sandbox(const char *type)
> +{
> +    /*
> +     * Objects should in general not get initialized "too early" without
> +     * a reason. If you add one, state the reason in a comment!
> +     */
> +
> +    /*
> +     * Reason: -sandbox on,resourcecontrol=deny disallows setting CPU
> +     * affinity of threads.
> +     */
> +    if (g_str_equal(type, "thread-context")) {
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
>  /*
>   * Initial object creation happens before all other
>   * QEMU data types are created. The majority of objects
> @@ -1775,6 +1796,11 @@ static bool object_create_early(const char *type)
>       * add one, state the reason in a comment!
>       */
>  
> +    /* Reason: already created. */
> +    if (object_create_pre_sandbox(type)) {
> +        return false;
> +    }
> +
>      /* Reason: property "chardev" */
>      if (g_str_equal(type, "rng-egd") ||
>          g_str_equal(type, "qtest")) {
> @@ -1895,7 +1921,7 @@ static void qemu_create_early_backends(void)
>   */
>  static bool object_create_late(const char *type)
>  {
> -    return !object_create_early(type);
> +    return !object_create_early(type) && !object_create_pre_sandbox(type);
>  }
>  
>  static void qemu_create_late_backends(void)
> @@ -2365,6 +2391,8 @@ static int process_runstate_actions(void *opaque, QemuOpts *opts, Error **errp)
>  
>  static void qemu_process_early_options(void)
>  {
> +    object_option_foreach_add(object_create_pre_sandbox);
> +
>  #ifdef CONFIG_SECCOMP
>      QemuOptsList *olist = qemu_find_opts_err("sandbox", NULL);
>      if (olist) {

Cool, this is processed before -sandbox, so threads can have their
affinity. However, it's also processed before -name debug-threads=on
which means that even though we try to set a thread name in 3/7, it's
effectively a dead code because name_threads from
util/qemu-thread-posix.c is still false. Could we shift things a bit?
E.g. like this:

static void qemu_process_early_options(void)
{
    qemu_opts_foreach(qemu_find_opts("name"),
                      parse_name, NULL, &error_fatal);

    object_option_foreach_add(object_create_pre_sandbox);
..


Michal



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
  2022-07-21 12:07 [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext David Hildenbrand
                   ` (7 preceding siblings ...)
  2022-07-25 13:59 ` [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext Michal Prívozník
@ 2022-08-05 11:01 ` Michal Prívozník
  2022-08-05 15:47   ` David Hildenbrand
  2022-08-09 10:56 ` Joao Martins
  2022-09-21 14:44 ` Michal Prívozník
  10 siblings, 1 reply; 21+ messages in thread
From: Michal Prívozník @ 2022-08-05 11:01 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Igor Mammedov, Michael S. Tsirkin, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

On 7/21/22 14:07, David Hildenbrand wrote:
> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
> Michal.
> 
> Setting the CPU affinity of threads from inside QEMU usually isn't
> easily possible, because we don't want QEMU -- once started and running
> guest code -- to be able to mess up the system. QEMU disallows relevant
> syscalls using seccomp, such that any such invocation will fail.
> 
> Especially for memory preallocation in memory backends, the CPU affinity
> can significantly increase guest startup time, for example, when running
> large VMs backed by huge/gigantic pages, because of NUMA effects. For
> NUMA-aware preallocation, we have to set the CPU affinity, however:
> 
> (1) Once preallocation threads are created during preallocation, management
>     tools cannot intercept anymore to change the affinity. These threads
>     are created automatically on demand.
> (2) QEMU cannot easily set the CPU affinity itself.
> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>     might not necessarily be exactly the CPUs we actually want to use
>     (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
> 
> There is an easy "workaround". If we have a thread with the right CPU
> affinity, we can simply create new threads on demand via that prepared
> context. So, all we have to do is setup and create such a context ahead
> of time, to then configure preallocation to create new threads via that
> environment.
> 
> So, let's introduce a user-creatable "thread-context" object that
> essentially consists of a context thread used to create new threads.
> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
> "node-affinity" property), or upper layers can extract the thread id
> ("thread-id" property) to configure it externally.
> 
> Make memory-backends consume a thread-context object
> (via the "prealloc-context" property) and use it when preallocating to
> create new threads with the desired CPU affinity. Further, to make it
> easier to use, allow creation of "thread-context" objects, including
> setting the CPU affinity directly from QEMU, *before* enabling the
> sandbox option.
> 
> 
> Quick test on a system with 2 NUMA nodes:
> 
> Without CPU affinity:
>     time qemu-system-x86_64 \
>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
>         -nographic -monitor stdio
> 
>     real    0m5.383s
>     real    0m3.499s
>     real    0m5.129s
>     real    0m4.232s
>     real    0m5.220s
>     real    0m4.288s
>     real    0m3.582s
>     real    0m4.305s
>     real    0m5.421s
>     real    0m4.502s
> 
>     -> It heavily depends on the scheduler CPU selection
> 
> With CPU affinity:
>     time qemu-system-x86_64 \
>         -object thread-context,id=tc1,node-affinity=0 \
>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
>         -sandbox enable=on,resourcecontrol=deny \
>         -nographic -monitor stdio
> 
>     real    0m1.959s
>     real    0m1.942s
>     real    0m1.943s
>     real    0m1.941s
>     real    0m1.948s
>     real    0m1.964s
>     real    0m1.949s
>     real    0m1.948s
>     real    0m1.941s
>     real    0m1.937s
> 
> On reasonably large VMs, the speedup can be quite significant.
> 

I've timed 'virsh start' with a guest that has 47GB worth of 1GB
hugepages and seen the startup time halved basically (from 10.5s to
5.6s). The host has 4 NUMA nodes and I'm pinning the guest onto two nodes.

I've written libvirt counterpart (which I'll post as soon as these are
merged). The way it works is the whenever .prealloc-threads= is to be
used AND qemu is capable of thread-context the thread-context object is
generated before every memory-backend-*, like this:

-object
'{"qom-type":"thread-context","id":"tc-ram-node0","node-affinity":[2]}' \
-object
'{"qom-type":"memory-backend-memfd","id":"ram-node0","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":21474836480,"host-nodes":[2],"policy":"bind","prealloc-context":"tc-ram-node0"}'
\
-numa node,nodeid=0,cpus=0,cpus=2,memdev=ram-node0 \
-object
'{"qom-type":"thread-context","id":"tc-ram-node1","node-affinity":[3]}' \
-object
'{"qom-type":"memory-backend-memfd","id":"ram-node1","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":28991029248,"host-nodes":[3],"policy":"bind","prealloc-context":"tc-ram-node1"}'
\


Now, it's not visible in this snippet, but my code does not reuse
thread-context objects. So if there's another memfd, it'll get its own TC:

-object
'{"qom-type":"thread-context","id":"tc-memdimm0","node-affinity":[1]}' \
-object
'{"qom-type":"memory-backend-memfd","id":"memdimm0","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":1073741824,"host-nodes":[1],"policy":"bind","prealloc-context":"tc-memdimm0"}'
\

The reason is that logic generating memory-backends is very complex and
separating out parts of it so that thread-context objects can be
generated first and reused by those backends would inevitably lead to
regression. I guess my question is, whether it's a problem that libvirt
would leave one additional thread, sleeping in a semaphore, for each
memory-backend (iff prealloc-threads are used).

Although, if I read the code correctly, thread-context object can be
specified AFTER memory backends, because they are parsed and created
before backends anyway. Well, something to think over the weekend.


> While this concept is currently only used for short-lived preallocation
> threads, nothing major speaks against reusing the concept for other
> threads that are harder to identify/configure -- except that
> we need additional (idle) context threads that are otherwise left unused.
> 
> [1] https://lkml.kernel.org/r/ffdcd118d59b379ede2b64745144165a40f6a813.1652165704.git.mprivozn@redhat.com
> 
> Cc: Michal Privoznik <mprivozn@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Daniel P. Berrangé" <berrange@redhat.com>
> Cc: Eduardo Habkost <eduardo@habkost.net>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Stefan Weil <sw@weilnetz.de>
> 
> David Hildenbrand (7):
>   util: Cleanup and rename os_mem_prealloc()
>   util: Introduce qemu_thread_set_affinity() and
>     qemu_thread_get_affinity()
>   util: Introduce ThreadContext user-creatable object
>   util: Add write-only "node-affinity" property for ThreadContext
>   util: Make qemu_prealloc_mem() optionally consume a ThreadContext
>   hostmem: Allow for specifying a ThreadContext for preallocation
>   vl: Allow ThreadContext objects to be created before the sandbox
>     option
> 
>  backends/hostmem.c            |  13 +-
>  hw/virtio/virtio-mem.c        |   2 +-
>  include/qemu/osdep.h          |  19 +-
>  include/qemu/thread-context.h |  58 ++++++
>  include/qemu/thread.h         |   4 +
>  include/sysemu/hostmem.h      |   2 +
>  meson.build                   |  16 ++
>  qapi/qom.json                 |  25 +++
>  softmmu/cpus.c                |   2 +-
>  softmmu/vl.c                  |  30 ++-
>  util/meson.build              |   1 +
>  util/oslib-posix.c            |  39 ++--
>  util/oslib-win32.c            |   8 +-
>  util/qemu-thread-posix.c      |  70 +++++++
>  util/qemu-thread-win32.c      |  12 ++
>  util/thread-context.c         | 363 ++++++++++++++++++++++++++++++++++
>  16 files changed, 637 insertions(+), 27 deletions(-)
>  create mode 100644 include/qemu/thread-context.h
>  create mode 100644 util/thread-context.c
> 

Reviewed-by: Michal Privoznik <mprivozn@redhat.com>

Michal



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 3/7] util: Introduce ThreadContext user-creatable object
  2022-07-21 12:07 ` [PATCH RFC 3/7] util: Introduce ThreadContext user-creatable object David Hildenbrand
@ 2022-08-05 11:01   ` Michal Prívozník
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Prívozník @ 2022-08-05 11:01 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Igor Mammedov, Michael S. Tsirkin, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

On 7/21/22 14:07, David Hildenbrand wrote:
> Setting the CPU affinity of QEMU threads is a bit problematic, because
> QEMU doesn't always have permissions to set the CPU affinity itself,
> for example, with seccomp after initialized by QEMU:
>     -sandbox enable=on,resourcecontrol=deny
> 
> While upper layers are already aware how to handl;e CPU affinities for
> long-lived threads like iothreads or vcpu threads, especially short-lived
> threads, as used for memory-backend preallocation, are more involved to
> handle. These threads are created on demand and upper layers are not even
> able to identify and configure them.
> 
> Introduce the concept of a ThreadContext, that is essentially a thread
> used for creating new threads. All threads created via that context
> thread inherit the configured CPU affinity. Consequently, it's
> sufficient to create a ThreadContext and configure it once, and have all
> threads created via that ThreadContext inherit the same CPU affinity.
> 
> The CPU affinity of a ThreadContext can be configured two ways:
> 
> (1) Obtaining the thread id via the "thread-id" property and setting the
>     CPU affinity manually.
> 
> (2) Setting the "cpu-affinity" property and letting QEMU try set the
>     CPU affinity itself. This will fail if QEMU doesn't have permissions
>     to do so anymore after seccomp was initialized.
> 
> A ThreadContext can be reused, simply be reconfiguring the CPU affinity.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/qemu/thread-context.h |  58 +++++++
>  qapi/qom.json                 |  16 ++
>  util/meson.build              |   1 +
>  util/oslib-posix.c            |   1 +
>  util/thread-context.c         | 279 ++++++++++++++++++++++++++++++++++
>  5 files changed, 355 insertions(+)
>  create mode 100644 include/qemu/thread-context.h
>  create mode 100644 util/thread-context.c
> 
> diff --git a/include/qemu/thread-context.h b/include/qemu/thread-context.h
> new file mode 100644
> index 0000000000..c799cbe7a1
> --- /dev/null
> +++ b/include/qemu/thread-context.h
> @@ -0,0 +1,58 @@
> +/*
> + * QEMU Thread Context
> + *
> + * Copyright Red Hat Inc., 2022
> + *
> + * Authors:
> + *  David Hildenbrand <david@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef SYSEMU_THREAD_CONTEXT_H
> +#define SYSEMU_THREAD_CONTEXT_H
> +
> +#include "qapi/qapi-types-machine.h"
> +#include "qemu/thread.h"
> +#include "qom/object.h"
> +
> +#define TYPE_THREAD_CONTEXT "thread-context"
> +OBJECT_DECLARE_TYPE(ThreadContext, ThreadContextClass,
> +                    THREAD_CONTEXT)
> +
> +struct ThreadContextClass {
> +    ObjectClass parent_class;
> +};
> +
> +struct ThreadContext {
> +    /* private */
> +    Object parent;
> +
> +    /* private */
> +    unsigned int thread_id;
> +    QemuThread thread;
> +
> +    /* Semaphore to wait for context thread action. */
> +    QemuSemaphore sem;
> +    /* Semaphore to wait for action in context thread. */
> +    QemuSemaphore sem_thread;
> +    /* Mutex to synchronize requests. */
> +    QemuMutex mutex;
> +
> +    /* Commands for the thread to execute. */
> +    int thread_cmd;
> +    void *thread_cmd_data;
> +
> +    /* CPU affinity bitmap used for initialization. */
> +    unsigned long *init_cpu_bitmap;
> +    int init_cpu_nbits;
> +};
> +
> +void thread_context_create_thread(ThreadContext *tc, QemuThread *thread,
> +                                  const char *name,
> +                                  void *(*start_routine)(void *), void *arg,
> +                                  int mode);
> +
> +#endif /* SYSEMU_THREAD_CONTEXT_H */
> diff --git a/qapi/qom.json b/qapi/qom.json
> index 80dd419b39..4775a333ed 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -830,6 +830,20 @@
>              'reduced-phys-bits': 'uint32',
>              '*kernel-hashes': 'bool' } }
>  
> +##
> +# @ThreadContextProperties:
> +#
> +# Properties for thread context objects.
> +#
> +# @cpu-affinity: the CPU affinity for all threads created in the thread
> +#                context (default: QEMU main thread affinity)
> +#
> +# Since: 7.2
> +##
> +{ 'struct': 'ThreadContextProperties',
> +  'data': { '*cpu-affinity': ['uint16'] } }
> +
> +
>  ##
>  # @ObjectType:
>  #
> @@ -882,6 +896,7 @@
>      { 'name': 'secret_keyring',
>        'if': 'CONFIG_SECRET_KEYRING' },
>      'sev-guest',
> +    'thread-context',
>      's390-pv-guest',
>      'throttle-group',
>      'tls-creds-anon',
> @@ -948,6 +963,7 @@
>        'secret_keyring':             { 'type': 'SecretKeyringProperties',
>                                        'if': 'CONFIG_SECRET_KEYRING' },
>        'sev-guest':                  'SevGuestProperties',
> +      'thread-context':             'ThreadContextProperties',
>        'throttle-group':             'ThrottleGroupProperties',
>        'tls-creds-anon':             'TlsCredsAnonProperties',
>        'tls-creds-psk':              'TlsCredsPskProperties',
> diff --git a/util/meson.build b/util/meson.build
> index 5e282130df..e97cd2d779 100644
> --- a/util/meson.build
> +++ b/util/meson.build
> @@ -1,4 +1,5 @@
>  util_ss.add(files('osdep.c', 'cutils.c', 'unicode.c', 'qemu-timer-common.c'))
> +util_ss.add(files('thread-context.c'))
>  if not config_host_data.get('CONFIG_ATOMIC64')
>    util_ss.add(files('atomic64.c'))
>  endif
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 75e8cdcf3e..fa66f73bf8 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -42,6 +42,7 @@
>  #include "qemu/cutils.h"
>  #include "qemu/compiler.h"
>  #include "qemu/units.h"
> +#include "qemu/thread-context.h"
>  
>  #ifdef CONFIG_LINUX
>  #include <sys/syscall.h>
> diff --git a/util/thread-context.c b/util/thread-context.c
> new file mode 100644
> index 0000000000..dcd607c532
> --- /dev/null
> +++ b/util/thread-context.c
> @@ -0,0 +1,279 @@
> +/*
> + * QEMU Thread Context
> + *
> + * Copyright Red Hat Inc., 2022
> + *
> + * Authors:
> + *  David Hildenbrand <david@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/thread-context.h"
> +#include "qapi/error.h"
> +#include "qapi/qapi-builtin-visit.h"
> +#include "qapi/visitor.h"
> +#include "qemu/config-file.h"
> +#include "qapi/qapi-builtin-visit.h"
> +#include "qom/object_interfaces.h"
> +#include "qemu/module.h"
> +#include "qemu/bitmap.h"
> +
> +enum {
> +    TC_CMD_NONE = 0,
> +    TC_CMD_STOP,
> +    TC_CMD_NEW,
> +};
> +
> +typedef struct ThreadContextCmdNew {
> +    QemuThread *thread;
> +    const char *name;
> +    void *(*start_routine)(void *);
> +    void *arg;
> +    int mode;
> +} ThreadContextCmdNew;
> +
> +static void *thread_context_run(void *opaque)
> +{
> +    ThreadContext *tc = opaque;
> +
> +    tc->thread_id = qemu_get_thread_id();
> +    qemu_sem_post(&tc->sem);
> +
> +    while (true) {
> +        /*
> +         * Threads inherit the CPU affinity of the creating thread. For this
> +         * reason, we create new (especially short-lived) threads from our
> +         * persistent context thread.
> +         *
> +         * Especially when QEMU is not allowed to set the affinity itself,
> +         * management tools can simply set the affinity of the context thread
> +         * after creating the context, to have new threads created via
> +         * the context inherit the CPU affinity automatically.
> +         */
> +        switch (tc->thread_cmd) {
> +        case TC_CMD_NONE:
> +            break;
> +        case TC_CMD_STOP:
> +            tc->thread_cmd = TC_CMD_NONE;
> +            qemu_sem_post(&tc->sem);
> +            return NULL;
> +        case TC_CMD_NEW: {
> +            ThreadContextCmdNew *cmd_new = tc->thread_cmd_data;
> +
> +            qemu_thread_create(cmd_new->thread, cmd_new->name,
> +                               cmd_new->start_routine, cmd_new->arg,
> +                               cmd_new->mode);
> +            tc->thread_cmd = TC_CMD_NONE;
> +            tc->thread_cmd_data = NULL;
> +            qemu_sem_post(&tc->sem);
> +            break;
> +        }
> +        default:
> +            g_assert_not_reached();
> +        }
> +        qemu_sem_wait(&tc->sem_thread);
> +    }
> +}
> +
> +static void thread_context_set_cpu_affinity(Object *obj, Visitor *v,
> +                                            const char *name, void *opaque,
> +                                            Error **errp)
> +{
> +    ThreadContext *tc = THREAD_CONTEXT(obj);
> +    uint16List *l, *host_cpus = NULL;
> +    unsigned long *bitmap = NULL;
> +    int nbits = 0, ret;
> +    Error *err = NULL;
> +
> +    visit_type_uint16List(v, name, &host_cpus, &err);
> +    if (err) {
> +        error_propagate(errp, err);
> +        return;
> +    }
> +
> +    if (!host_cpus) {
> +        error_setg(errp, "CPU list is empty");
> +        goto out;
> +    }
> +
> +    for (l = host_cpus; l; l = l->next) {
> +        nbits = MAX(nbits, l->value + 1);
> +    }
> +    bitmap = bitmap_new(nbits);
> +    for (l = host_cpus; l; l = l->next) {
> +        set_bit(l->value, bitmap);
> +    }
> +
> +    if (tc->thread_id != -1) {
> +        /*
> +         * Note: we won't be adjusting the affinity of any thread that is still
> +         * around, but only the affinity of the context thread.
> +         */
> +        ret = qemu_thread_set_affinity(&tc->thread, bitmap, nbits);
> +        if (ret) {
> +            error_setg(errp, "Setting CPU affinity failed: %s", strerror(ret));
> +        }
> +    } else {
> +        tc->init_cpu_bitmap = bitmap;
> +        bitmap = NULL;
> +        tc->init_cpu_nbits = nbits;
> +    }
> +out:
> +    g_free(bitmap);
> +    qapi_free_uint16List(host_cpus);
> +}
> +
> +static void thread_context_get_cpu_affinity(Object *obj, Visitor *v,
> +                                            const char *name, void *opaque,
> +                                            Error **errp)
> +{
> +    unsigned long *bitmap, nbits, value;
> +    ThreadContext *tc = THREAD_CONTEXT(obj);
> +    uint16List *host_cpus = NULL;
> +    uint16List **tail = &host_cpus;
> +    int ret;
> +
> +    if (tc->thread_id == -1) {
> +        error_setg(errp, "Object not initialized yet");
> +        return;
> +    }
> +
> +    ret = qemu_thread_get_affinity(&tc->thread, &bitmap, &nbits);
> +    if (ret) {
> +        error_setg(errp, "Getting CPU affinity failed: %s", strerror(ret));
> +        return;
> +    }
> +
> +    value = find_first_bit(bitmap, nbits);
> +    while (value < nbits) {
> +        QAPI_LIST_APPEND(tail, value);
> +
> +        value = find_next_bit(bitmap, nbits, value + 1);
> +    }
> +    g_free(bitmap);
> +
> +    visit_type_uint16List(v, name, &host_cpus, errp);
> +    qapi_free_uint16List(host_cpus);
> +}
> +
> +static void thread_context_get_thread_id(Object *obj, Visitor *v,
> +                                         const char *name, void *opaque,
> +                                         Error **errp)
> +{
> +    ThreadContext *tc = THREAD_CONTEXT(obj);
> +    uint64_t value = tc->thread_id;
> +
> +    visit_type_uint64(v, name, &value, errp);
> +}
> +
> +static void thread_context_instance_complete(UserCreatable *uc, Error **errp)
> +{
> +    ThreadContext *tc = THREAD_CONTEXT(uc);
> +    char *thread_name;
> +    int ret;
> +
> +    thread_name = g_strdup_printf("TC %s",
> +                               object_get_canonical_path_component(OBJECT(uc)));
> +    qemu_thread_create(&tc->thread, thread_name, thread_context_run, tc,
> +                       QEMU_THREAD_JOINABLE);


This ^^^^ (see my comment to 7/7)

> +    g_free(thread_name);
> +
> +    /* Wait until initialization of the thread is done. */
> +    while (tc->thread_id == -1) {
> +        qemu_sem_wait(&tc->sem);
> +    }
> +
> +    if (tc->init_cpu_bitmap) {
> +        ret = qemu_thread_set_affinity(&tc->thread, tc->init_cpu_bitmap,
> +                                       tc->init_cpu_nbits);
> +        if (ret) {
> +            error_setg(errp, "Setting CPU affinity failed: %s", strerror(ret));
> +        }
> +        g_free(tc->init_cpu_bitmap);
> +        tc->init_cpu_bitmap = NULL;
> +    }
> +}

Michal



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
  2022-08-05 11:01 ` Michal Prívozník
@ 2022-08-05 15:47   ` David Hildenbrand
  0 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2022-08-05 15:47 UTC (permalink / raw)
  To: Michal Prívozník, qemu-devel
  Cc: Igor Mammedov, Michael S. Tsirkin, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

> 
> I've timed 'virsh start' with a guest that has 47GB worth of 1GB
> hugepages and seen the startup time halved basically (from 10.5s to
> 5.6s). The host has 4 NUMA nodes and I'm pinning the guest onto two nodes.
> 
> I've written libvirt counterpart (which I'll post as soon as these are
> merged). The way it works is the whenever .prealloc-threads= is to be
> used AND qemu is capable of thread-context the thread-context object is
> generated before every memory-backend-*, like this:

Once interesting corner case might be with CPU-less NUMA nodes. Setting
the node-affinity would fail because there are no CPUs. Libvirt could
figure that out by testing if the selected node(s) have CPUs.

> 
> -object
> '{"qom-type":"thread-context","id":"tc-ram-node0","node-affinity":[2]}' \
> -object
> '{"qom-type":"memory-backend-memfd","id":"ram-node0","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":21474836480,"host-nodes":[2],"policy":"bind","prealloc-context":"tc-ram-node0"}'
> \
> -numa node,nodeid=0,cpus=0,cpus=2,memdev=ram-node0 \
> -object
> '{"qom-type":"thread-context","id":"tc-ram-node1","node-affinity":[3]}' \
> -object
> '{"qom-type":"memory-backend-memfd","id":"ram-node1","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":28991029248,"host-nodes":[3],"policy":"bind","prealloc-context":"tc-ram-node1"}'
> \
> 
> 
> Now, it's not visible in this snippet, but my code does not reuse
> thread-context objects. So if there's another memfd, it'll get its own TC:
> 
> -object
> '{"qom-type":"thread-context","id":"tc-memdimm0","node-affinity":[1]}' \
> -object
> '{"qom-type":"memory-backend-memfd","id":"memdimm0","hugetlb":true,"hugetlbsize":1073741824,"share":true,"prealloc":true,"prealloc-threads":16,"size":1073741824,"host-nodes":[1],"policy":"bind","prealloc-context":"tc-memdimm0"}'
> \
> 
> The reason is that logic generating memory-backends is very complex and
> separating out parts of it so that thread-context objects can be
> generated first and reused by those backends would inevitably lead to

Sounds like something we can work on later.

> regression. I guess my question is, whether it's a problem that libvirt
> would leave one additional thread, sleeping in a semaphore, for each
> memory-backend (iff prealloc-threads are used).

I guess in most setups we just don't care. Of course, with 256 DIMMs or
endless number of nodes, we *might* care.


One optimization for some ordinary setups (not caring about NUMA-aware
preallocation during DIMM hotplug) would be to assign some dummy thread
context once prealloc finished (e.g., once QEMU initialized after
prealloc) and delete the original thread context along with the thread.

> 
> Although, if I read the code correctly, thread-context object can be
> specified AFTER memory backends, because they are parsed and created
> before backends anyway. Well, something to think over the weekend.

Yes, the command line order does not matter.

[...]

> 
> Reviewed-by: Michal Privoznik <mprivozn@redhat.com>

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 7/7] vl: Allow ThreadContext objects to be created before the sandbox option
  2022-08-05 11:01   ` Michal Prívozník
@ 2022-08-05 15:50     ` David Hildenbrand
  0 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2022-08-05 15:50 UTC (permalink / raw)
  To: Michal Prívozník, qemu-devel
  Cc: Igor Mammedov, Michael S. Tsirkin, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

On 05.08.22 13:01, Michal Prívozník wrote:
> On 7/21/22 14:07, David Hildenbrand wrote:
>> Currently, there is no way to configure a CPU affinity inside QEMU when
>> the sandbox option disables it for QEMU as a whole, for example, via:
>>     -sandbox enable=on,resourcecontrol=deny
>>
>> While ThreadContext objects can be created on the QEMU commandline and
>> the CPU affinity can be configured externally via the thread-id, this is
>> insufficient if a ThreadContext with a certain CPU affinity is already
>> required during QEMU startup, before we can intercept QEMU and
>> configure the CPU affinity.
>>
>> Blocking sched_setaffinity() was introduced in 24f8cdc57224 ("seccomp:
>> add resourcecontrol argument to command line"), "to avoid any bigger of the
>> process". However, we only care about once QEMU is running, not when
>> the instance starting QEMU explicitly requests a certain CPU affinity
>> on the QEMU comandline.
>>
>> Right now, for NUMA-aware preallocation of memory backends used for initial
>> machine RAM, one has to:
>>
>> 1) Start QEMU with the memory-backend with "prealloc=off"
>> 2) Pause QEMU before it starts the guest (-S)
>> 3) Create ThreadContext, configure the CPU affinity using the thread-id
>> 4) Configure the ThreadContext as "prealloc-context" of the memory
>>    backend
>> 5) Trigger preallocation by setting "prealloc=on"
>>
>> To simplify this handling especially for initial machine RAM,
>> allow creation of ThreadContext objects before parsing sandbox options,
>> such that the CPU affinity requested on the QEMU commandline alongside the
>> sandbox option can be set. As ThreadContext objects essentially only create
>> a persistant context thread and set the CPU affinity, this is easily
>> possible.
>>
>> With this change, we can create a ThreadContext with a CPU affinity on
>> the QEMU commandline and use it for preallocation of memory backends
>> glued to the machine (simplified example):
>>
>> qemu-system-x86_64 -m 1G \
>>  -object thread-context,id=tc1,cpu-affinity=3-4 \
>>  -object memory-backend-ram,id=pc.ram,size=1G,prealloc=on,prealloc-threads=2,prealloc-context=tc1 \
>>  -machine memory-backend=pc.ram \
>>  -S -monitor stdio -sandbox enable=on,resourcecontrol=deny
>>
>> And while we can query the current CPU affinity:
>>   (qemu) qom-get tc1 cpu-affinity
>>   [
>>       3,
>>       4
>>   ]
>>
>> We can no longer change it from QEMU directly:
>>   (qemu) qom-set tc1 cpu-affinity 1-2
>>   Error: Setting CPU affinity failed: Operation not permitted
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  softmmu/vl.c | 30 +++++++++++++++++++++++++++++-
>>  1 file changed, 29 insertions(+), 1 deletion(-)
>>
>> diff --git a/softmmu/vl.c b/softmmu/vl.c
>> index aabd82e09a..252732cf5d 100644
>> --- a/softmmu/vl.c
>> +++ b/softmmu/vl.c
>> @@ -1761,6 +1761,27 @@ static void object_option_parse(const char *optarg)
>>      visit_free(v);
>>  }
>>  
>> +/*
>> + * Very early object creation, before the sandbox options have been activated.
>> + */
>> +static bool object_create_pre_sandbox(const char *type)
>> +{
>> +    /*
>> +     * Objects should in general not get initialized "too early" without
>> +     * a reason. If you add one, state the reason in a comment!
>> +     */
>> +
>> +    /*
>> +     * Reason: -sandbox on,resourcecontrol=deny disallows setting CPU
>> +     * affinity of threads.
>> +     */
>> +    if (g_str_equal(type, "thread-context")) {
>> +        return true;
>> +    }
>> +
>> +    return false;
>> +}
>> +
>>  /*
>>   * Initial object creation happens before all other
>>   * QEMU data types are created. The majority of objects
>> @@ -1775,6 +1796,11 @@ static bool object_create_early(const char *type)
>>       * add one, state the reason in a comment!
>>       */
>>  
>> +    /* Reason: already created. */
>> +    if (object_create_pre_sandbox(type)) {
>> +        return false;
>> +    }
>> +
>>      /* Reason: property "chardev" */
>>      if (g_str_equal(type, "rng-egd") ||
>>          g_str_equal(type, "qtest")) {
>> @@ -1895,7 +1921,7 @@ static void qemu_create_early_backends(void)
>>   */
>>  static bool object_create_late(const char *type)
>>  {
>> -    return !object_create_early(type);
>> +    return !object_create_early(type) && !object_create_pre_sandbox(type);
>>  }
>>  
>>  static void qemu_create_late_backends(void)
>> @@ -2365,6 +2391,8 @@ static int process_runstate_actions(void *opaque, QemuOpts *opts, Error **errp)
>>  
>>  static void qemu_process_early_options(void)
>>  {
>> +    object_option_foreach_add(object_create_pre_sandbox);
>> +
>>  #ifdef CONFIG_SECCOMP
>>      QemuOptsList *olist = qemu_find_opts_err("sandbox", NULL);
>>      if (olist) {
> 
> Cool, this is processed before -sandbox, so threads can have their
> affinity. However, it's also processed before -name debug-threads=on
> which means that even though we try to set a thread name in 3/7, it's
> effectively a dead code because name_threads from
> util/qemu-thread-posix.c is still false. Could we shift things a bit?
> E.g. like this:
> 
> static void qemu_process_early_options(void)
> {
>     qemu_opts_foreach(qemu_find_opts("name"),
>                       parse_name, NULL, &error_fatal);
> 
>     object_option_foreach_add(object_create_pre_sandbox);


Thanks for pointing that out. IMHO yes, there isn't too much magic to
parse_name().

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
  2022-07-21 12:07 [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext David Hildenbrand
                   ` (8 preceding siblings ...)
  2022-08-05 11:01 ` Michal Prívozník
@ 2022-08-09 10:56 ` Joao Martins
  2022-08-09 18:06   ` David Hildenbrand
  2022-09-21 14:44 ` Michal Prívozník
  10 siblings, 1 reply; 21+ messages in thread
From: Joao Martins @ 2022-08-09 10:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Privoznik, Igor Mammedov, Michael S. Tsirkin,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Mark Kanda, Richard Henderson, Stefan Weil,
	qemu-devel

On 7/21/22 13:07, David Hildenbrand wrote:
> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
> Michal.
> 
> Setting the CPU affinity of threads from inside QEMU usually isn't
> easily possible, because we don't want QEMU -- once started and running
> guest code -- to be able to mess up the system. QEMU disallows relevant
> syscalls using seccomp, such that any such invocation will fail.
> 
> Especially for memory preallocation in memory backends, the CPU affinity
> can significantly increase guest startup time, for example, when running
> large VMs backed by huge/gigantic pages, because of NUMA effects. For
> NUMA-aware preallocation, we have to set the CPU affinity, however:
> 
> (1) Once preallocation threads are created during preallocation, management
>     tools cannot intercept anymore to change the affinity. These threads
>     are created automatically on demand.
> (2) QEMU cannot easily set the CPU affinity itself.
> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>     might not necessarily be exactly the CPUs we actually want to use
>     (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
> 
> There is an easy "workaround". If we have a thread with the right CPU
> affinity, we can simply create new threads on demand via that prepared
> context. So, all we have to do is setup and create such a context ahead
> of time, to then configure preallocation to create new threads via that
> environment.
> 
> So, let's introduce a user-creatable "thread-context" object that
> essentially consists of a context thread used to create new threads.
> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
> "node-affinity" property), or upper layers can extract the thread id
> ("thread-id" property) to configure it externally.
> 
> Make memory-backends consume a thread-context object
> (via the "prealloc-context" property) and use it when preallocating to
> create new threads with the desired CPU affinity. Further, to make it
> easier to use, allow creation of "thread-context" objects, including
> setting the CPU affinity directly from QEMU, *before* enabling the
> sandbox option.
> 
> 
> Quick test on a system with 2 NUMA nodes:
> 
> Without CPU affinity:
>     time qemu-system-x86_64 \
>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
>         -nographic -monitor stdio
> 
>     real    0m5.383s
>     real    0m3.499s
>     real    0m5.129s
>     real    0m4.232s
>     real    0m5.220s
>     real    0m4.288s
>     real    0m3.582s
>     real    0m4.305s
>     real    0m5.421s
>     real    0m4.502s
> 
>     -> It heavily depends on the scheduler CPU selection
> 
> With CPU affinity:
>     time qemu-system-x86_64 \
>         -object thread-context,id=tc1,node-affinity=0 \
>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
>         -sandbox enable=on,resourcecontrol=deny \
>         -nographic -monitor stdio
> 
>     real    0m1.959s
>     real    0m1.942s
>     real    0m1.943s
>     real    0m1.941s
>     real    0m1.948s
>     real    0m1.964s
>     real    0m1.949s
>     real    0m1.948s
>     real    0m1.941s
>     real    0m1.937s
> 
> On reasonably large VMs, the speedup can be quite significant.
> 
Really awesome work!

I am not sure I picked up this well while reading the series, but it seems to me that
prealloc is still serialized on per memory-backend when solely configured by command-line
right?

Meaning when we start prealloc we wait until the memory-backend thread-context action is
completed (per-memory-backend) even if other to-be-configured memory-backends will use a
thread-context on a separate set of pinned CPUs on another node ... and wouldn't in theory
"need" to wait until the former prealloc finishes?

Unless as you alluded in one of the last patches: we can pass these thread-contexts with
prealloc=off (and prealloc-context=NNN) while qemu is paused (-S) and have different QMP
clients set prealloc=on, and thus prealloc would happen concurrently per node?

We were thinking to extend it to leverage per socket bandwidth essentially to parallel
this even further (we saw improvements with something like that but haven't tried this
series yet). Likely this is already possible with your work and I didn't pick up on it,
hence just making sure this is the case :)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
  2022-08-09 10:56 ` Joao Martins
@ 2022-08-09 18:06   ` David Hildenbrand
  2022-08-10  6:55     ` Michal Prívozník
  2022-08-11 10:50     ` Joao Martins
  0 siblings, 2 replies; 21+ messages in thread
From: David Hildenbrand @ 2022-08-09 18:06 UTC (permalink / raw)
  To: Joao Martins
  Cc: Michal Privoznik, Igor Mammedov, Michael S. Tsirkin,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Mark Kanda, Richard Henderson, Stefan Weil,
	qemu-devel

On 09.08.22 12:56, Joao Martins wrote:
> On 7/21/22 13:07, David Hildenbrand wrote:
>> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
>> Michal.
>>
>> Setting the CPU affinity of threads from inside QEMU usually isn't
>> easily possible, because we don't want QEMU -- once started and running
>> guest code -- to be able to mess up the system. QEMU disallows relevant
>> syscalls using seccomp, such that any such invocation will fail.
>>
>> Especially for memory preallocation in memory backends, the CPU affinity
>> can significantly increase guest startup time, for example, when running
>> large VMs backed by huge/gigantic pages, because of NUMA effects. For
>> NUMA-aware preallocation, we have to set the CPU affinity, however:
>>
>> (1) Once preallocation threads are created during preallocation, management
>>     tools cannot intercept anymore to change the affinity. These threads
>>     are created automatically on demand.
>> (2) QEMU cannot easily set the CPU affinity itself.
>> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>>     might not necessarily be exactly the CPUs we actually want to use
>>     (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
>>
>> There is an easy "workaround". If we have a thread with the right CPU
>> affinity, we can simply create new threads on demand via that prepared
>> context. So, all we have to do is setup and create such a context ahead
>> of time, to then configure preallocation to create new threads via that
>> environment.
>>
>> So, let's introduce a user-creatable "thread-context" object that
>> essentially consists of a context thread used to create new threads.
>> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
>> "node-affinity" property), or upper layers can extract the thread id
>> ("thread-id" property) to configure it externally.
>>
>> Make memory-backends consume a thread-context object
>> (via the "prealloc-context" property) and use it when preallocating to
>> create new threads with the desired CPU affinity. Further, to make it
>> easier to use, allow creation of "thread-context" objects, including
>> setting the CPU affinity directly from QEMU, *before* enabling the
>> sandbox option.
>>
>>
>> Quick test on a system with 2 NUMA nodes:
>>
>> Without CPU affinity:
>>     time qemu-system-x86_64 \
>>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
>>         -nographic -monitor stdio
>>
>>     real    0m5.383s
>>     real    0m3.499s
>>     real    0m5.129s
>>     real    0m4.232s
>>     real    0m5.220s
>>     real    0m4.288s
>>     real    0m3.582s
>>     real    0m4.305s
>>     real    0m5.421s
>>     real    0m4.502s
>>
>>     -> It heavily depends on the scheduler CPU selection
>>
>> With CPU affinity:
>>     time qemu-system-x86_64 \
>>         -object thread-context,id=tc1,node-affinity=0 \
>>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
>>         -sandbox enable=on,resourcecontrol=deny \
>>         -nographic -monitor stdio
>>
>>     real    0m1.959s
>>     real    0m1.942s
>>     real    0m1.943s
>>     real    0m1.941s
>>     real    0m1.948s
>>     real    0m1.964s
>>     real    0m1.949s
>>     real    0m1.948s
>>     real    0m1.941s
>>     real    0m1.937s
>>
>> On reasonably large VMs, the speedup can be quite significant.
>>
> Really awesome work!

Thanks!

> 
> I am not sure I picked up this well while reading the series, but it seems to me that
> prealloc is still serialized on per memory-backend when solely configured by command-line
> right?

I think it's serialized in any case, even when preallocation is
triggered manually using prealloc=on. I might be wrong, but any kind of
object creation or property changes should be serialized by the BQL.

In theory, we can "easily" preallocate in our helper --
qemu_prealloc_mem() -- concurrently when we don't have to bother about
handling SIGBUS -- that is, when the kernel supports
MADV_POPULATE_WRITE. Without MADV_POPULATE_WRITE on older kernels, we'll
serialize in there as well.

> 
> Meaning when we start prealloc we wait until the memory-backend thread-context action is
> completed (per-memory-backend) even if other to-be-configured memory-backends will use a
> thread-context on a separate set of pinned CPUs on another node ... and wouldn't in theory
> "need" to wait until the former prealloc finishes?

Yes. This series only takes care of NUMA-aware preallocation, but
doesn't preallocate multiple memory backends in parallel.

In theory, it would be quite easy to preallocate concurrently: simply
create the memory backend objects passed on the QEMU cmdline
concurrently from multiple threads.

In practice, we have to be careful I think with the BQL. But it doesn't
sound horribly complicated to achieve that. We can perform all
synchronized under the BQL and only trigger actual expensive
preallocation (-> qemu_prealloc_mem()) , which we know is MT-safe, with
released BQL.

> 
> Unless as you alluded in one of the last patches: we can pass these thread-contexts with
> prealloc=off (and prealloc-context=NNN) while qemu is paused (-S) and have different QMP
> clients set prealloc=on, and thus prealloc would happen concurrently per node?

I think we will serialize in any case when modifying properties. Can you
give it a shot and see if it would work as of now? I doubt it, but I
might be wrong.

> 
> We were thinking to extend it to leverage per socket bandwidth essentially to parallel
> this even further (we saw improvements with something like that but haven't tried this
> series yet). Likely this is already possible with your work and I didn't pick up on it,
> hence just making sure this is the case :)

With this series, you can essentially tell QEMU which physical CPUs to
use for preallocating a given memory backend. But memory backends are
not created+preallocated concurrently yet.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
  2022-08-09 18:06   ` David Hildenbrand
@ 2022-08-10  6:55     ` Michal Prívozník
  2022-08-11 10:50     ` Joao Martins
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Prívozník @ 2022-08-10  6:55 UTC (permalink / raw)
  To: David Hildenbrand, Joao Martins
  Cc: Igor Mammedov, Michael S. Tsirkin, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Mark Kanda, Richard Henderson, Stefan Weil,
	qemu-devel

On 8/9/22 20:06, David Hildenbrand wrote:
> On 09.08.22 12:56, Joao Martins wrote:
>> On 7/21/22 13:07, David Hildenbrand wrote:
>>> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
>>> Michal.
>>>
>>> Setting the CPU affinity of threads from inside QEMU usually isn't
>>> easily possible, because we don't want QEMU -- once started and running
>>> guest code -- to be able to mess up the system. QEMU disallows relevant
>>> syscalls using seccomp, such that any such invocation will fail.
>>>
>>> Especially for memory preallocation in memory backends, the CPU affinity
>>> can significantly increase guest startup time, for example, when running
>>> large VMs backed by huge/gigantic pages, because of NUMA effects. For
>>> NUMA-aware preallocation, we have to set the CPU affinity, however:
>>>
>>> (1) Once preallocation threads are created during preallocation, management
>>>     tools cannot intercept anymore to change the affinity. These threads
>>>     are created automatically on demand.
>>> (2) QEMU cannot easily set the CPU affinity itself.
>>> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>>>     might not necessarily be exactly the CPUs we actually want to use
>>>     (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
>>>
>>> There is an easy "workaround". If we have a thread with the right CPU
>>> affinity, we can simply create new threads on demand via that prepared
>>> context. So, all we have to do is setup and create such a context ahead
>>> of time, to then configure preallocation to create new threads via that
>>> environment.
>>>
>>> So, let's introduce a user-creatable "thread-context" object that
>>> essentially consists of a context thread used to create new threads.
>>> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
>>> "node-affinity" property), or upper layers can extract the thread id
>>> ("thread-id" property) to configure it externally.
>>>
>>> Make memory-backends consume a thread-context object
>>> (via the "prealloc-context" property) and use it when preallocating to
>>> create new threads with the desired CPU affinity. Further, to make it
>>> easier to use, allow creation of "thread-context" objects, including
>>> setting the CPU affinity directly from QEMU, *before* enabling the
>>> sandbox option.
>>>
>>>
>>> Quick test on a system with 2 NUMA nodes:
>>>
>>> Without CPU affinity:
>>>     time qemu-system-x86_64 \
>>>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
>>>         -nographic -monitor stdio
>>>
>>>     real    0m5.383s
>>>     real    0m3.499s
>>>     real    0m5.129s
>>>     real    0m4.232s
>>>     real    0m5.220s
>>>     real    0m4.288s
>>>     real    0m3.582s
>>>     real    0m4.305s
>>>     real    0m5.421s
>>>     real    0m4.502s
>>>
>>>     -> It heavily depends on the scheduler CPU selection
>>>
>>> With CPU affinity:
>>>     time qemu-system-x86_64 \
>>>         -object thread-context,id=tc1,node-affinity=0 \
>>>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
>>>         -sandbox enable=on,resourcecontrol=deny \
>>>         -nographic -monitor stdio
>>>
>>>     real    0m1.959s
>>>     real    0m1.942s
>>>     real    0m1.943s
>>>     real    0m1.941s
>>>     real    0m1.948s
>>>     real    0m1.964s
>>>     real    0m1.949s
>>>     real    0m1.948s
>>>     real    0m1.941s
>>>     real    0m1.937s
>>>
>>> On reasonably large VMs, the speedup can be quite significant.
>>>
>> Really awesome work!
> 
> Thanks!
> 
>>
>> I am not sure I picked up this well while reading the series, but it seems to me that
>> prealloc is still serialized on per memory-backend when solely configured by command-line
>> right?
> 
> I think it's serialized in any case, even when preallocation is
> triggered manually using prealloc=on. I might be wrong, but any kind of
> object creation or property changes should be serialized by the BQL.
> 
> In theory, we can "easily" preallocate in our helper --
> qemu_prealloc_mem() -- concurrently when we don't have to bother about
> handling SIGBUS -- that is, when the kernel supports
> MADV_POPULATE_WRITE. Without MADV_POPULATE_WRITE on older kernels, we'll
> serialize in there as well.
> 
>>
>> Meaning when we start prealloc we wait until the memory-backend thread-context action is
>> completed (per-memory-backend) even if other to-be-configured memory-backends will use a
>> thread-context on a separate set of pinned CPUs on another node ... and wouldn't in theory
>> "need" to wait until the former prealloc finishes?
> 
> Yes. This series only takes care of NUMA-aware preallocation, but
> doesn't preallocate multiple memory backends in parallel.
> 
> In theory, it would be quite easy to preallocate concurrently: simply
> create the memory backend objects passed on the QEMU cmdline
> concurrently from multiple threads.
> 
> In practice, we have to be careful I think with the BQL. But it doesn't
> sound horribly complicated to achieve that. We can perform all
> synchronized under the BQL and only trigger actual expensive
> preallocation (-> qemu_prealloc_mem()) , which we know is MT-safe, with
> released BQL.
> 
>>
>> Unless as you alluded in one of the last patches: we can pass these thread-contexts with
>> prealloc=off (and prealloc-context=NNN) while qemu is paused (-S) and have different QMP
>> clients set prealloc=on, and thus prealloc would happen concurrently per node?
> 
> I think we will serialize in any case when modifying properties. Can you
> give it a shot and see if it would work as of now? I doubt it, but I
> might be wrong.

Disclaimer, I don't know QEMU internals that much so I might be wrong,
but even if libvirt went with -preconfig, wouldn't the monitor be stuck
for every 'set prealloc=on' call? I mean, in the end, the same set of
functions is called (e.g. touch_all_pages()) as if the configuration was
provided via cmd line. So I don't see why there should be any difference
between cmd line and -preconfig.



<offtopic>
In near future, as the number of cmd line arguments that libvirt
generates grows, libvirt might need to switch to -preconfig. Or, if it
needs to query some values first and generate configuration based on
that. But for now, there are no plans.
</offtopic>

Michal



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
  2022-08-09 18:06   ` David Hildenbrand
  2022-08-10  6:55     ` Michal Prívozník
@ 2022-08-11 10:50     ` Joao Martins
  1 sibling, 0 replies; 21+ messages in thread
From: Joao Martins @ 2022-08-11 10:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Privoznik, Igor Mammedov, Michael S. Tsirkin,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Mark Kanda, Richard Henderson, Stefan Weil,
	qemu-devel

On 8/9/22 19:06, David Hildenbrand wrote:
> On 09.08.22 12:56, Joao Martins wrote:
>> On 7/21/22 13:07, David Hildenbrand wrote:
>>> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
>>> Michal.
>>>
>>> Setting the CPU affinity of threads from inside QEMU usually isn't
>>> easily possible, because we don't want QEMU -- once started and running
>>> guest code -- to be able to mess up the system. QEMU disallows relevant
>>> syscalls using seccomp, such that any such invocation will fail.
>>>
>>> Especially for memory preallocation in memory backends, the CPU affinity
>>> can significantly increase guest startup time, for example, when running
>>> large VMs backed by huge/gigantic pages, because of NUMA effects. For
>>> NUMA-aware preallocation, we have to set the CPU affinity, however:
>>>
>>> (1) Once preallocation threads are created during preallocation, management
>>>     tools cannot intercept anymore to change the affinity. These threads
>>>     are created automatically on demand.
>>> (2) QEMU cannot easily set the CPU affinity itself.
>>> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>>>     might not necessarily be exactly the CPUs we actually want to use
>>>     (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
>>>
>>> There is an easy "workaround". If we have a thread with the right CPU
>>> affinity, we can simply create new threads on demand via that prepared
>>> context. So, all we have to do is setup and create such a context ahead
>>> of time, to then configure preallocation to create new threads via that
>>> environment.
>>>
>>> So, let's introduce a user-creatable "thread-context" object that
>>> essentially consists of a context thread used to create new threads.
>>> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
>>> "node-affinity" property), or upper layers can extract the thread id
>>> ("thread-id" property) to configure it externally.
>>>
>>> Make memory-backends consume a thread-context object
>>> (via the "prealloc-context" property) and use it when preallocating to
>>> create new threads with the desired CPU affinity. Further, to make it
>>> easier to use, allow creation of "thread-context" objects, including
>>> setting the CPU affinity directly from QEMU, *before* enabling the
>>> sandbox option.
>>>
>>>
>>> Quick test on a system with 2 NUMA nodes:
>>>
>>> Without CPU affinity:
>>>     time qemu-system-x86_64 \
>>>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
>>>         -nographic -monitor stdio
>>>
>>>     real    0m5.383s
>>>     real    0m3.499s
>>>     real    0m5.129s
>>>     real    0m4.232s
>>>     real    0m5.220s
>>>     real    0m4.288s
>>>     real    0m3.582s
>>>     real    0m4.305s
>>>     real    0m5.421s
>>>     real    0m4.502s
>>>
>>>     -> It heavily depends on the scheduler CPU selection
>>>
>>> With CPU affinity:
>>>     time qemu-system-x86_64 \
>>>         -object thread-context,id=tc1,node-affinity=0 \
>>>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
>>>         -sandbox enable=on,resourcecontrol=deny \
>>>         -nographic -monitor stdio
>>>
>>>     real    0m1.959s
>>>     real    0m1.942s
>>>     real    0m1.943s
>>>     real    0m1.941s
>>>     real    0m1.948s
>>>     real    0m1.964s
>>>     real    0m1.949s
>>>     real    0m1.948s
>>>     real    0m1.941s
>>>     real    0m1.937s
>>>
>>> On reasonably large VMs, the speedup can be quite significant.
>>>
>> Really awesome work!
> 
> Thanks!
> 
>>
>> I am not sure I picked up this well while reading the series, but it seems to me that
>> prealloc is still serialized on per memory-backend when solely configured by command-line
>> right?
> 
> I think it's serialized in any case, even when preallocation is
> triggered manually using prealloc=on. I might be wrong, but any kind of
> object creation or property changes should be serialized by the BQL.
> 
> In theory, we can "easily" preallocate in our helper --
> qemu_prealloc_mem() -- concurrently when we don't have to bother about
> handling SIGBUS -- that is, when the kernel supports
> MADV_POPULATE_WRITE. Without MADV_POPULATE_WRITE on older kernels, we'll
> serialize in there as well.
> 
/me nods matches my understanding

>>
>> Meaning when we start prealloc we wait until the memory-backend thread-context action is
>> completed (per-memory-backend) even if other to-be-configured memory-backends will use a
>> thread-context on a separate set of pinned CPUs on another node ... and wouldn't in theory
>> "need" to wait until the former prealloc finishes?
> 
> Yes. This series only takes care of NUMA-aware preallocation, but
> doesn't preallocate multiple memory backends in parallel.
> 
> In theory, it would be quite easy to preallocate concurrently: simply
> create the memory backend objects passed on the QEMU cmdline
> concurrently from multiple threads.
> 
Right

> In practice, we have to be careful I think with the BQL. But it doesn't
> sound horribly complicated to achieve that. We can perform all
> synchronized under the BQL and only trigger actual expensive
> preallocation (-> qemu_prealloc_mem()) , which we know is MT-safe, with
> released BQL.
> 
Right.

The small bit to take care (AFAIU from the code) is to defer waiting for all the memset
threads to finish. The problem in command line at least is that you attempt at memsetting,
but then wait for all the threads to finish. And because the context
passed to memset is allocated over the stack we must wait as we would lose that
state. So it's mainly moving the tracking to be global and defer the time
that we wait to join all threads. With MADV_POPULATE_WRITE we know we are OK but I
wonder if sigbus could be made to work too like only registering only once, and the
sigbus handler would look for the thread based on the address range it is handling,
having the just-MCEd address. And we only unregister the sigbus handler also once after
all prealloc threads are finished.

Via QMP, I am not sure BQL is the only "problem", there might be some monitor lock there
too or some sort of request handling serialization that only one thread processes QMP
requests and dispatches them. Simply releasing BQL prior to prealloc doesn't do much,
but though it may help doing other work while that is happening.

>>
>> Unless as you alluded in one of the last patches: we can pass these thread-contexts with
>> prealloc=off (and prealloc-context=NNN) while qemu is paused (-S) and have different QMP
>> clients set prealloc=on, and thus prealloc would happen concurrently per node?
> 
> I think we will serialize in any case when modifying properties. Can you
> give it a shot and see if it would work as of now? I doubt it, but I
> might be wrong.
> 

Over a quick experiment with two monitors each
attempting at prealloc each node in parallel, well it takes the same 7secs (on a small
2-node 128G test) regardless. Your expectation looks indeed correct.

>>
>> We were thinking to extend it to leverage per socket bandwidth essentially to parallel
>> this even further (we saw improvements with something like that but haven't tried this
>> series yet). Likely this is already possible with your work and I didn't pick up on it,
>> hence just making sure this is the case :)
> 
> With this series, you can essentially tell QEMU which physical CPUs to
> use for preallocating a given memory backend. But memory backends are
> not created+preallocated concurrently yet.
> 
Yeap, thanks for the context/info.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
  2022-07-21 12:07 [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext David Hildenbrand
                   ` (9 preceding siblings ...)
  2022-08-09 10:56 ` Joao Martins
@ 2022-09-21 14:44 ` Michal Prívozník
  2022-09-21 14:54   ` David Hildenbrand
  10 siblings, 1 reply; 21+ messages in thread
From: Michal Prívozník @ 2022-09-21 14:44 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Igor Mammedov, Michael S. Tsirkin, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

On 7/21/22 14:07, David Hildenbrand wrote:
>

Ping? Is there any plan how to move forward? I have libvirt patches
ready to consume this and I'd like to prune my old local branches :-)

Michal



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
  2022-09-21 14:44 ` Michal Prívozník
@ 2022-09-21 14:54   ` David Hildenbrand
  2022-09-22  6:45     ` Michal Prívozník
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2022-09-21 14:54 UTC (permalink / raw)
  To: Michal Prívozník, qemu-devel
  Cc: Igor Mammedov, Michael S. Tsirkin, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

On 21.09.22 16:44, Michal Prívozník wrote:
> On 7/21/22 14:07, David Hildenbrand wrote:
>>
> 
> Ping? Is there any plan how to move forward? I have libvirt patches
> ready to consume this and I'd like to prune my old local branches :-)

Heh, I was thinking about this series just today. I was distracted with 
all other kind of stuff.

I'll move forward with this series later this week/early next week.

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext
  2022-09-21 14:54   ` David Hildenbrand
@ 2022-09-22  6:45     ` Michal Prívozník
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Prívozník @ 2022-09-22  6:45 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Igor Mammedov, Michael S. Tsirkin, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, Dr . David Alan Gilbert, Eric Blake,
	Markus Armbruster, Richard Henderson, Stefan Weil

On 9/21/22 16:54, David Hildenbrand wrote:
> On 21.09.22 16:44, Michal Prívozník wrote:
>> On 7/21/22 14:07, David Hildenbrand wrote:
>>>
>>
>> Ping? Is there any plan how to move forward? I have libvirt patches
>> ready to consume this and I'd like to prune my old local branches :-)
> 
> Heh, I was thinking about this series just today. I was distracted with
> all other kind of stuff.
> 
> I'll move forward with this series later this week/early next week.

No rush, it's only that I don't want this to fall into void. Let me know
if I can help somehow. Meanwhile, here's my aforementioned branch:

https://gitlab.com/MichalPrivoznik/libvirt/-/tree/qemu_thread_context

I've made it so that ThreadContext is generated whenever
.prealloc-threads AND .host-nodes are used (i.e. no XML visible config
knob). And I'm generating ThreadContext objects for each memory backend
separately even though they can be reused, but IMO that's optimization
that can be done later.

Michal



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-09-22  6:48 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-21 12:07 [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext David Hildenbrand
2022-07-21 12:07 ` [PATCH RFC 1/7] util: Cleanup and rename os_mem_prealloc() David Hildenbrand
2022-07-21 12:07 ` [PATCH RFC 2/7] util: Introduce qemu_thread_set_affinity() and qemu_thread_get_affinity() David Hildenbrand
2022-07-21 12:07 ` [PATCH RFC 3/7] util: Introduce ThreadContext user-creatable object David Hildenbrand
2022-08-05 11:01   ` Michal Prívozník
2022-07-21 12:07 ` [PATCH RFC 4/7] util: Add write-only "node-affinity" property for ThreadContext David Hildenbrand
2022-07-21 12:07 ` [PATCH RFC 5/7] util: Make qemu_prealloc_mem() optionally consume a ThreadContext David Hildenbrand
2022-07-21 12:07 ` [PATCH RFC 6/7] hostmem: Allow for specifying a ThreadContext for preallocation David Hildenbrand
2022-07-21 12:07 ` [PATCH RFC 7/7] vl: Allow ThreadContext objects to be created before the sandbox option David Hildenbrand
2022-08-05 11:01   ` Michal Prívozník
2022-08-05 15:50     ` David Hildenbrand
2022-07-25 13:59 ` [PATCH RFC 0/7] hostmem: NUMA-aware memory preallocation using ThreadContext Michal Prívozník
2022-08-05 11:01 ` Michal Prívozník
2022-08-05 15:47   ` David Hildenbrand
2022-08-09 10:56 ` Joao Martins
2022-08-09 18:06   ` David Hildenbrand
2022-08-10  6:55     ` Michal Prívozník
2022-08-11 10:50     ` Joao Martins
2022-09-21 14:44 ` Michal Prívozník
2022-09-21 14:54   ` David Hildenbrand
2022-09-22  6:45     ` Michal Prívozník

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.