qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
@ 2021-07-22 12:36 David Hildenbrand
  2021-07-22 12:36 ` [PATCH v2 1/6] " David Hildenbrand
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: David Hildenbrand @ 2021-07-22 12:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Igor Mammedov, Paolo Bonzini,
	Marek Kedzierski

#1 adds support for MADV_POPULATE_WRITE, #2 cleans up the code to avoid
global variables and prepare for concurrency, #3 and #4 optimize thread
handling, #5 makes os_mem_prealloc() safe to be called from multiple
threads concurrently and #6 makes the SIGBUS handler coexist cleanly with
the MCE SIGBUS handler under Linux.

Details regarding MADV_POPULATE_WRITE can be found in introducing upstream
Linux commit 4ca9b3859dac ("mm/madvise: introduce
MADV_POPULATE_(READ|WRITE) to prefault page tables") and in the latest man
page patch [1].

v1 -> v2:
- "util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()"
-- Handle thread with no data to initialize
-- Always set use_madv_populate_write properly
-- Add comment regarding future fallocate() optimization
- "util/oslib-posix: Don't create too many threads with small memory or
   little pages"
-- Added
- "util/oslib-posix: Avoid creating a single thread with
   MADV_POPULATE_WRITE"
-- Added
- "util/oslib-posix: Support concurrent os_mem_prealloc() invocation"
-- Add missing g_once_init_leave()
-- Move g_once_init_enter() to the place where it is actually needed
- "util/oslib-posix: Forward SIGBUS to MCE handler under Linux"
-- Added

[1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Daniel P. Berrangé <berrange@redhat.com>

David Hildenbrand (6):
  util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  util/oslib-posix: Introduce and use MemsetContext for
    touch_all_pages()
  util/oslib-posix: Don't create too many threads with small memory or
    little pages
  util/oslib-posix: Avoid creating a single thread with
    MADV_POPULATE_WRITE
  util/oslib-posix: Support concurrent os_mem_prealloc() invocation
  util/oslib-posix: Forward SIGBUS to MCE handler under Linux

 include/qemu/osdep.h |   7 ++
 softmmu/cpus.c       |   4 +
 util/oslib-posix.c   | 220 +++++++++++++++++++++++++++++++++----------
 3 files changed, 181 insertions(+), 50 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  2021-07-22 12:36 [PATCH v2 0/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
@ 2021-07-22 12:36 ` David Hildenbrand
  2021-07-22 13:31   ` Daniel P. Berrangé
  2021-07-22 12:36 ` [PATCH v2 2/6] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages() David Hildenbrand
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2021-07-22 12:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Pankaj Gupta, Igor Mammedov,
	Paolo Bonzini, Marek Kedzierski

Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
does not require a SIGBUS handler, doesn't actually touch page content,
and avoids context switches; it is, therefore, faster and easier to handle
than our current approach.

While MADV_POPULATE_WRITE is, in general, faster than manual
prefaulting, and especially faster with 4k pages, there is still value in
prefaulting using multiple threads to speed up preallocation.

More details on MADV_POPULATE_WRITE can be found in the Linux commit
4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
page tables") and in the man page proposal [1].

[1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com

This resolves the TODO in do_touch_pages().

In the future, we might want to look into using fallocate(), eventually
combined with MADV_POPULATE_READ, when dealing with shared file
mappings.

Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/qemu/osdep.h |  7 ++++
 util/oslib-posix.c   | 88 +++++++++++++++++++++++++++++++++-----------
 2 files changed, 74 insertions(+), 21 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 60718fc342..d1660d67fa 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -471,6 +471,11 @@ static inline void qemu_cleanup_generic_vfree(void *p)
 #else
 #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED
 #endif
+#ifdef MADV_POPULATE_WRITE
+#define QEMU_MADV_POPULATE_WRITE MADV_POPULATE_WRITE
+#else
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
+#endif
 
 #elif defined(CONFIG_POSIX_MADVISE)
 
@@ -484,6 +489,7 @@ static inline void qemu_cleanup_generic_vfree(void *p)
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
 
 #else /* no-op */
 
@@ -497,6 +503,7 @@ static inline void qemu_cleanup_generic_vfree(void *p)
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
 
 #endif
 
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index e8bdb02e1d..1cb80bf94c 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -484,10 +484,6 @@ static void *do_touch_pages(void *arg)
              *
              * 'volatile' to stop compiler optimizing this away
              * to a no-op
-             *
-             * TODO: get a better solution from kernel so we
-             * don't need to write at all so we don't cause
-             * wear on the storage backing the region...
              */
             *(volatile char *)addr = *addr;
             addr += hpagesize;
@@ -497,6 +493,31 @@ static void *do_touch_pages(void *arg)
     return NULL;
 }
 
+static void *do_madv_populate_write_pages(void *arg)
+{
+    MemsetThread *memset_args = (MemsetThread *)arg;
+    const size_t size = memset_args->numpages * memset_args->hpagesize;
+    char * const addr = memset_args->addr;
+    int ret;
+
+    if (!size) {
+        return NULL;
+    }
+
+    /* See do_touch_pages(). */
+    qemu_mutex_lock(&page_mutex);
+    while (!threads_created_flag) {
+        qemu_cond_wait(&page_cond, &page_mutex);
+    }
+    qemu_mutex_unlock(&page_mutex);
+
+    ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
+    if (ret) {
+        memset_thread_failed = true;
+    }
+    return NULL;
+}
+
 static inline int get_memset_num_threads(int smp_cpus)
 {
     long host_procs = sysconf(_SC_NPROCESSORS_ONLN);
@@ -510,10 +531,11 @@ static inline int get_memset_num_threads(int smp_cpus)
 }
 
 static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
-                            int smp_cpus)
+                            int smp_cpus, bool use_madv_populate_write)
 {
     static gsize initialized = 0;
     size_t numpages_per_thread, leftover;
+    void *(*touch_fn)(void *);
     char *addr = area;
     int i = 0;
 
@@ -523,6 +545,12 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
         g_once_init_leave(&initialized, 1);
     }
 
+    if (use_madv_populate_write) {
+        touch_fn = do_madv_populate_write_pages;
+    } else {
+        touch_fn = do_touch_pages;
+    }
+
     memset_thread_failed = false;
     threads_created_flag = false;
     memset_num_threads = get_memset_num_threads(smp_cpus);
@@ -534,7 +562,7 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
         memset_thread[i].numpages = numpages_per_thread + (i < leftover);
         memset_thread[i].hpagesize = hpagesize;
         qemu_thread_create(&memset_thread[i].pgthread, "touch_pages",
-                           do_touch_pages, &memset_thread[i],
+                           touch_fn, &memset_thread[i],
                            QEMU_THREAD_JOINABLE);
         addr += memset_thread[i].numpages * hpagesize;
     }
@@ -553,6 +581,12 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
     return memset_thread_failed;
 }
 
+static bool madv_populate_write_possible(char *area, size_t pagesize)
+{
+    return !qemu_madvise(area, pagesize, QEMU_MADV_POPULATE_WRITE) ||
+           errno != EINVAL;
+}
+
 void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
                      Error **errp)
 {
@@ -560,29 +594,41 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
     struct sigaction act, oldact;
     size_t hpagesize = qemu_fd_getpagesize(fd);
     size_t numpages = DIV_ROUND_UP(memory, hpagesize);
+    bool use_madv_populate_write;
 
-    memset(&act, 0, sizeof(act));
-    act.sa_handler = &sigbus_handler;
-    act.sa_flags = 0;
-
-    ret = sigaction(SIGBUS, &act, &oldact);
-    if (ret) {
-        error_setg_errno(errp, errno,
-            "os_mem_prealloc: failed to install signal handler");
-        return;
+    /*
+     * Sense on every invocation, as MADV_POPULATE_WRITE cannot be used for
+     * some special mappings, such as mapping /dev/mem.
+     */
+    use_madv_populate_write = madv_populate_write_possible(area, hpagesize);
+
+    if (!use_madv_populate_write) {
+        memset(&act, 0, sizeof(act));
+        act.sa_handler = &sigbus_handler;
+        act.sa_flags = 0;
+
+        ret = sigaction(SIGBUS, &act, &oldact);
+        if (ret) {
+            error_setg_errno(errp, errno,
+                "os_mem_prealloc: failed to install signal handler");
+            return;
+        }
     }
 
     /* touch pages simultaneously */
-    if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
+    if (touch_all_pages(area, hpagesize, numpages, smp_cpus,
+                        use_madv_populate_write)) {
         error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
             "pages available to allocate guest RAM");
     }
 
-    ret = sigaction(SIGBUS, &oldact, NULL);
-    if (ret) {
-        /* Terminate QEMU since it can't recover from error */
-        perror("os_mem_prealloc: failed to reinstall signal handler");
-        exit(1);
+    if (!use_madv_populate_write) {
+        ret = sigaction(SIGBUS, &oldact, NULL);
+        if (ret) {
+            /* Terminate QEMU since it can't recover from error */
+            perror("os_mem_prealloc: failed to reinstall signal handler");
+            exit(1);
+        }
     }
 }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/6] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages()
  2021-07-22 12:36 [PATCH v2 0/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
  2021-07-22 12:36 ` [PATCH v2 1/6] " David Hildenbrand
@ 2021-07-22 12:36 ` David Hildenbrand
  2021-07-22 12:36 ` [PATCH v2 3/6] util/oslib-posix: Don't create too many threads with small memory or little pages David Hildenbrand
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2021-07-22 12:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Igor Mammedov, Paolo Bonzini,
	Marek Kedzierski

Let's minimize the number of global variables to prepare for
os_mem_prealloc() getting called concurrently and make the code a bit
easier to read.

The only consumer that really needs a global variable is the sigbus
handler, which will require protection via a mutex in the future either way
as we cannot concurrently mess with the SIGBUS handler.

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/oslib-posix.c | 81 ++++++++++++++++++++++++++++------------------
 1 file changed, 50 insertions(+), 31 deletions(-)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 1cb80bf94c..2ae6c3aaab 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -73,22 +73,30 @@
 
 #define MAX_MEM_PREALLOC_THREAD_COUNT 16
 
+struct MemsetThread;
+
+typedef struct MemsetContext {
+    bool all_threads_created;
+    bool any_thread_failed;
+    struct MemsetThread *threads;
+    int num_threads;
+} MemsetContext;
+
 struct MemsetThread {
     char *addr;
     size_t numpages;
     size_t hpagesize;
     QemuThread pgthread;
     sigjmp_buf env;
+    MemsetContext *context;
 };
 typedef struct MemsetThread MemsetThread;
 
-static MemsetThread *memset_thread;
-static int memset_num_threads;
-static bool memset_thread_failed;
+/* used by sigbus_handler() */
+static MemsetContext *sigbus_memset_context;
 
 static QemuMutex page_mutex;
 static QemuCond page_cond;
-static bool threads_created_flag;
 
 int qemu_get_thread_id(void)
 {
@@ -439,10 +447,13 @@ const char *qemu_get_exec_dir(void)
 static void sigbus_handler(int signal)
 {
     int i;
-    if (memset_thread) {
-        for (i = 0; i < memset_num_threads; i++) {
-            if (qemu_thread_is_self(&memset_thread[i].pgthread)) {
-                siglongjmp(memset_thread[i].env, 1);
+
+    if (sigbus_memset_context) {
+        for (i = 0; i < sigbus_memset_context->num_threads; i++) {
+            MemsetThread *thread = &sigbus_memset_context->threads[i];
+
+            if (qemu_thread_is_self(&thread->pgthread)) {
+                siglongjmp(thread->env, 1);
             }
         }
     }
@@ -459,7 +470,7 @@ static void *do_touch_pages(void *arg)
      * clearing until all threads have been created.
      */
     qemu_mutex_lock(&page_mutex);
-    while(!threads_created_flag){
+    while (!memset_args->context->all_threads_created) {
         qemu_cond_wait(&page_cond, &page_mutex);
     }
     qemu_mutex_unlock(&page_mutex);
@@ -470,7 +481,7 @@ static void *do_touch_pages(void *arg)
     pthread_sigmask(SIG_UNBLOCK, &set, &oldset);
 
     if (sigsetjmp(memset_args->env, 1)) {
-        memset_thread_failed = true;
+        memset_args->context->any_thread_failed = true;
     } else {
         char *addr = memset_args->addr;
         size_t numpages = memset_args->numpages;
@@ -506,14 +517,14 @@ static void *do_madv_populate_write_pages(void *arg)
 
     /* See do_touch_pages(). */
     qemu_mutex_lock(&page_mutex);
-    while (!threads_created_flag) {
+    while (!memset_args->context->all_threads_created) {
         qemu_cond_wait(&page_cond, &page_mutex);
     }
     qemu_mutex_unlock(&page_mutex);
 
     ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
     if (ret) {
-        memset_thread_failed = true;
+        memset_args->context->any_thread_failed = true;
     }
     return NULL;
 }
@@ -534,6 +545,9 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
                             int smp_cpus, bool use_madv_populate_write)
 {
     static gsize initialized = 0;
+    MemsetContext context = {
+        .num_threads = get_memset_num_threads(smp_cpus),
+    };
     size_t numpages_per_thread, leftover;
     void *(*touch_fn)(void *);
     char *addr = area;
@@ -551,34 +565,39 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
         touch_fn = do_touch_pages;
     }
 
-    memset_thread_failed = false;
-    threads_created_flag = false;
-    memset_num_threads = get_memset_num_threads(smp_cpus);
-    memset_thread = g_new0(MemsetThread, memset_num_threads);
-    numpages_per_thread = numpages / memset_num_threads;
-    leftover = numpages % memset_num_threads;
-    for (i = 0; i < memset_num_threads; i++) {
-        memset_thread[i].addr = addr;
-        memset_thread[i].numpages = numpages_per_thread + (i < leftover);
-        memset_thread[i].hpagesize = hpagesize;
-        qemu_thread_create(&memset_thread[i].pgthread, "touch_pages",
-                           touch_fn, &memset_thread[i],
+    context.threads = g_new0(MemsetThread, context.num_threads);
+    numpages_per_thread = numpages / context.num_threads;
+    leftover = numpages % context.num_threads;
+    for (i = 0; i < context.num_threads; i++) {
+        context.threads[i].addr = addr;
+        context.threads[i].numpages = numpages_per_thread + (i < leftover);
+        context.threads[i].hpagesize = hpagesize;
+        context.threads[i].context = &context;
+        qemu_thread_create(&context.threads[i].pgthread, "touch_pages",
+                           touch_fn, &context.threads[i],
                            QEMU_THREAD_JOINABLE);
-        addr += memset_thread[i].numpages * hpagesize;
+        addr += context.threads[i].numpages * hpagesize;
+    }
+
+    if (!use_madv_populate_write) {
+        sigbus_memset_context = &context;
     }
 
     qemu_mutex_lock(&page_mutex);
-    threads_created_flag = true;
+    context.all_threads_created = true;
     qemu_cond_broadcast(&page_cond);
     qemu_mutex_unlock(&page_mutex);
 
-    for (i = 0; i < memset_num_threads; i++) {
-        qemu_thread_join(&memset_thread[i].pgthread);
+    for (i = 0; i < context.num_threads; i++) {
+        qemu_thread_join(&context.threads[i].pgthread);
+    }
+
+    if (!use_madv_populate_write) {
+        sigbus_memset_context = NULL;
     }
-    g_free(memset_thread);
-    memset_thread = NULL;
+    g_free(context.threads);
 
-    return memset_thread_failed;
+    return context.any_thread_failed;
 }
 
 static bool madv_populate_write_possible(char *area, size_t pagesize)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/6] util/oslib-posix: Don't create too many threads with small memory or little pages
  2021-07-22 12:36 [PATCH v2 0/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
  2021-07-22 12:36 ` [PATCH v2 1/6] " David Hildenbrand
  2021-07-22 12:36 ` [PATCH v2 2/6] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages() David Hildenbrand
@ 2021-07-22 12:36 ` David Hildenbrand
  2021-07-27 19:01   ` Dr. David Alan Gilbert
  2021-07-28 11:23   ` Pankaj Gupta
  2021-07-22 12:36 ` [PATCH v2 4/6] util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE David Hildenbrand
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 16+ messages in thread
From: David Hildenbrand @ 2021-07-22 12:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Igor Mammedov, Paolo Bonzini,
	Marek Kedzierski

Let's limit the number of threads to something sane, especially that
- We don't have more threads than the number of pages we have
- We don't have threads that initialize small (< 64 MiB) memory

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/oslib-posix.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 2ae6c3aaab..a1d309d495 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -40,6 +40,7 @@
 #include <libgen.h>
 #include "qemu/cutils.h"
 #include "qemu/compiler.h"
+#include "qemu/units.h"
 
 #ifdef CONFIG_LINUX
 #include <sys/syscall.h>
@@ -529,7 +530,8 @@ static void *do_madv_populate_write_pages(void *arg)
     return NULL;
 }
 
-static inline int get_memset_num_threads(int smp_cpus)
+static inline int get_memset_num_threads(size_t hpagesize, size_t numpages,
+                                         int smp_cpus)
 {
     long host_procs = sysconf(_SC_NPROCESSORS_ONLN);
     int ret = 1;
@@ -537,6 +539,12 @@ static inline int get_memset_num_threads(int smp_cpus)
     if (host_procs > 0) {
         ret = MIN(MIN(host_procs, MAX_MEM_PREALLOC_THREAD_COUNT), smp_cpus);
     }
+
+    /* Especially with gigantic pages, don't create more threads than pages. */
+    ret = MIN(ret, numpages);
+    /* Don't start threads to prealloc comparatively little memory. */
+    ret = MIN(ret, MAX(1, hpagesize * numpages / (64 * MiB)));
+
     /* In case sysconf() fails, we fall back to single threaded */
     return ret;
 }
@@ -546,7 +554,7 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
 {
     static gsize initialized = 0;
     MemsetContext context = {
-        .num_threads = get_memset_num_threads(smp_cpus),
+        .num_threads = get_memset_num_threads(hpagesize, numpages, smp_cpus),
     };
     size_t numpages_per_thread, leftover;
     void *(*touch_fn)(void *);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 4/6] util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE
  2021-07-22 12:36 [PATCH v2 0/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
                   ` (2 preceding siblings ...)
  2021-07-22 12:36 ` [PATCH v2 3/6] util/oslib-posix: Don't create too many threads with small memory or little pages David Hildenbrand
@ 2021-07-22 12:36 ` David Hildenbrand
  2021-07-27 19:04   ` Dr. David Alan Gilbert
  2021-07-28 11:32   ` Pankaj Gupta
  2021-07-22 12:36 ` [PATCH v2 5/6] util/oslib-posix: Support concurrent os_mem_prealloc() invocation David Hildenbrand
  2021-07-22 12:36 ` [PATCH v2 6/6] util/oslib-posix: Forward SIGBUS to MCE handler under Linux David Hildenbrand
  5 siblings, 2 replies; 16+ messages in thread
From: David Hildenbrand @ 2021-07-22 12:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Igor Mammedov, Paolo Bonzini,
	Marek Kedzierski

Let's simplify the case when we only want a single thread and don't have
to mess with signal handlers.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/oslib-posix.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index a1d309d495..1483e985c6 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -568,6 +568,14 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
     }
 
     if (use_madv_populate_write) {
+        /* Avoid creating a single thread for MADV_POPULATE_WRITE */
+        if (context.num_threads == 1) {
+            if (qemu_madvise(area, hpagesize * numpages,
+                             QEMU_MADV_POPULATE_WRITE)) {
+                return true;
+            }
+            return false;
+        }
         touch_fn = do_madv_populate_write_pages;
     } else {
         touch_fn = do_touch_pages;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 5/6] util/oslib-posix: Support concurrent os_mem_prealloc() invocation
  2021-07-22 12:36 [PATCH v2 0/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
                   ` (3 preceding siblings ...)
  2021-07-22 12:36 ` [PATCH v2 4/6] util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE David Hildenbrand
@ 2021-07-22 12:36 ` David Hildenbrand
  2021-07-22 12:36 ` [PATCH v2 6/6] util/oslib-posix: Forward SIGBUS to MCE handler under Linux David Hildenbrand
  5 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2021-07-22 12:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Pankaj Gupta, Igor Mammedov,
	Paolo Bonzini, Marek Kedzierski

Add a mutex to protect the SIGBUS case, as we cannot mess concurrently
with the sigbus handler and we have to manage the global variable
sigbus_memset_context. The MADV_POPULATE_WRITE path can run
concurrently.

Note that page_mutex and page_cond are shared between concurrent
invocations, which shouldn't be a problem.

This is a preparation for future virtio-mem prealloc code, which will call
os_mem_prealloc() asynchronously from an iothread when handling guest
requests.

Add a comment that messing with the SIGBUS handler is frowned upon and
can result in problems we fortunately haven't seen so far. Note that
forwarding signals to the already installed SIGBUS handler isn't clean
either, as that one might modify the SIGBUS handler again.

Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/oslib-posix.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 1483e985c6..7c75848a67 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -95,6 +95,7 @@ typedef struct MemsetThread MemsetThread;
 
 /* used by sigbus_handler() */
 static MemsetContext *sigbus_memset_context;
+static QemuMutex sigbus_mutex;
 
 static QemuMutex page_mutex;
 static QemuCond page_cond;
@@ -625,6 +626,7 @@ static bool madv_populate_write_possible(char *area, size_t pagesize)
 void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
                      Error **errp)
 {
+    static gsize initialized;
     int ret;
     struct sigaction act, oldact;
     size_t hpagesize = qemu_fd_getpagesize(fd);
@@ -638,6 +640,12 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
     use_madv_populate_write = madv_populate_write_possible(area, hpagesize);
 
     if (!use_madv_populate_write) {
+        if (g_once_init_enter(&initialized)) {
+            qemu_mutex_init(&sigbus_mutex);
+            g_once_init_leave(&initialized, 1);
+        }
+
+        qemu_mutex_lock(&sigbus_mutex);
         memset(&act, 0, sizeof(act));
         act.sa_handler = &sigbus_handler;
         act.sa_flags = 0;
@@ -664,6 +672,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
             perror("os_mem_prealloc: failed to reinstall signal handler");
             exit(1);
         }
+        qemu_mutex_unlock(&sigbus_mutex);
     }
 }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 6/6] util/oslib-posix: Forward SIGBUS to MCE handler under Linux
  2021-07-22 12:36 [PATCH v2 0/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
                   ` (4 preceding siblings ...)
  2021-07-22 12:36 ` [PATCH v2 5/6] util/oslib-posix: Support concurrent os_mem_prealloc() invocation David Hildenbrand
@ 2021-07-22 12:36 ` David Hildenbrand
  5 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2021-07-22 12:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, David Hildenbrand,
	Dr . David Alan Gilbert, Igor Mammedov, Paolo Bonzini,
	Marek Kedzierski

Temporarily modifying the SIGBUS handler is really nasty, as we might be
unlucky and receive an MCE SIGBUS while having our handler registered.
Unfortunately, there is no way around messing with SIGBUS when
MADV_POPULATE_WRITE is not applicable or not around.

Let's forward SIGBUS that don't belong to us to the already registered
handler and document the situation.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 softmmu/cpus.c     |  4 ++++
 util/oslib-posix.c | 36 +++++++++++++++++++++++++++++++++---
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/softmmu/cpus.c b/softmmu/cpus.c
index 071085f840..23bca46b07 100644
--- a/softmmu/cpus.c
+++ b/softmmu/cpus.c
@@ -352,6 +352,10 @@ static void qemu_init_sigbus(void)
 {
     struct sigaction action;
 
+    /*
+     * ALERT: when modifying this, take care that SIGBUS forwarding in
+     * os_mem_prealloc() will continue working as expected.
+     */
     memset(&action, 0, sizeof(action));
     action.sa_flags = SA_SIGINFO;
     action.sa_sigaction = sigbus_handler;
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 7c75848a67..4f10108600 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -95,6 +95,7 @@ typedef struct MemsetThread MemsetThread;
 
 /* used by sigbus_handler() */
 static MemsetContext *sigbus_memset_context;
+struct sigaction sigbus_oldact;
 static QemuMutex sigbus_mutex;
 
 static QemuMutex page_mutex;
@@ -446,7 +447,11 @@ const char *qemu_get_exec_dir(void)
     return exec_dir;
 }
 
+#ifdef CONFIG_LINUX
+static void sigbus_handler(int signal, siginfo_t *siginfo, void *ctx)
+#else /* CONFIG_LINUX */
 static void sigbus_handler(int signal)
+#endif /* CONFIG_LINUX */
 {
     int i;
 
@@ -459,6 +464,26 @@ static void sigbus_handler(int signal)
             }
         }
     }
+
+#ifdef CONFIG_LINUX
+    /*
+     * We assume that the MCE SIGBUS handler could have been registered. We
+     * should never receive BUS_MCEERR_AO on any of our threads, but only on
+     * the main thread registered for PR_MCE_KILL_EARLY. Further, we should not
+     * receive BUS_MCEERR_AR triggered by action of other threads on one of
+     * our threads. So, no need to check for unrelated SIGBUS when seeing one
+     * for our threads.
+     *
+     * We will forward to the MCE handler, which will either handle the SIGBUS
+     * or reinstall the default SIGBUS handler and reraise the SIGBUS. The
+     * default SIGBUS handler will crash the process, so we don't care.
+     */
+    if (sigbus_oldact.sa_flags & SA_SIGINFO) {
+        sigbus_oldact.sa_sigaction(signal, siginfo, ctx);
+        return;
+    }
+#endif /* CONFIG_LINUX */
+    warn_report("os_mem_prealloc: unrelated SIGBUS detected and ignored");
 }
 
 static void *do_touch_pages(void *arg)
@@ -628,10 +653,10 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
 {
     static gsize initialized;
     int ret;
-    struct sigaction act, oldact;
     size_t hpagesize = qemu_fd_getpagesize(fd);
     size_t numpages = DIV_ROUND_UP(memory, hpagesize);
     bool use_madv_populate_write;
+    struct sigaction act;
 
     /*
      * Sense on every invocation, as MADV_POPULATE_WRITE cannot be used for
@@ -647,10 +672,15 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
 
         qemu_mutex_lock(&sigbus_mutex);
         memset(&act, 0, sizeof(act));
+#ifdef CONFIG_LINUX
+        act.sa_sigaction = &sigbus_handler;
+        act.sa_flags = SA_SIGINFO;
+#else /* CONFIG_LINUX */
         act.sa_handler = &sigbus_handler;
         act.sa_flags = 0;
+#endif /* CONFIG_LINUX */
 
-        ret = sigaction(SIGBUS, &act, &oldact);
+        ret = sigaction(SIGBUS, &act, &sigbus_oldact);
         if (ret) {
             error_setg_errno(errp, errno,
                 "os_mem_prealloc: failed to install signal handler");
@@ -666,7 +696,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
     }
 
     if (!use_madv_populate_write) {
-        ret = sigaction(SIGBUS, &oldact, NULL);
+        ret = sigaction(SIGBUS, &sigbus_oldact, NULL);
         if (ret) {
             /* Terminate QEMU since it can't recover from error */
             perror("os_mem_prealloc: failed to reinstall signal handler");
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  2021-07-22 12:36 ` [PATCH v2 1/6] " David Hildenbrand
@ 2021-07-22 13:31   ` Daniel P. Berrangé
  2021-07-22 13:39     ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel P. Berrangé @ 2021-07-22 13:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin,
	Dr . David Alan Gilbert, qemu-devel, Pankaj Gupta, Igor Mammedov,
	Paolo Bonzini, Marek Kedzierski

On Thu, Jul 22, 2021 at 02:36:30PM +0200, David Hildenbrand wrote:
> Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
> does not require a SIGBUS handler, doesn't actually touch page content,
> and avoids context switches; it is, therefore, faster and easier to handle
> than our current approach.
> 
> While MADV_POPULATE_WRITE is, in general, faster than manual
> prefaulting, and especially faster with 4k pages, there is still value in
> prefaulting using multiple threads to speed up preallocation.
> 
> More details on MADV_POPULATE_WRITE can be found in the Linux commit
> 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
> page tables") and in the man page proposal [1].
> 
> [1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com
> 
> This resolves the TODO in do_touch_pages().
> 
> In the future, we might want to look into using fallocate(), eventually
> combined with MADV_POPULATE_READ, when dealing with shared file
> mappings.
> 
> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/qemu/osdep.h |  7 ++++
>  util/oslib-posix.c   | 88 +++++++++++++++++++++++++++++++++-----------
>  2 files changed, 74 insertions(+), 21 deletions(-)


> @@ -497,6 +493,31 @@ static void *do_touch_pages(void *arg)
>      return NULL;
>  }
>  
> +static void *do_madv_populate_write_pages(void *arg)
> +{
> +    MemsetThread *memset_args = (MemsetThread *)arg;
> +    const size_t size = memset_args->numpages * memset_args->hpagesize;
> +    char * const addr = memset_args->addr;
> +    int ret;
> +
> +    if (!size) {
> +        return NULL;
> +    }
> +
> +    /* See do_touch_pages(). */
> +    qemu_mutex_lock(&page_mutex);
> +    while (!threads_created_flag) {
> +        qemu_cond_wait(&page_cond, &page_mutex);
> +    }
> +    qemu_mutex_unlock(&page_mutex);
> +
> +    ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
> +    if (ret) {
> +        memset_thread_failed = true;

I'm wondering if this use of memset_thread_failed is sufficient.

This is pre-existing from the current impl, and ends up being
used to set the bool result of 'touch_all_pages'. The caller
of that then does

    if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
        error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
            "pages available to allocate guest RAM");
    }

this was reasonable with the old impl, because the only reason
we ever see 'memset_thread_failed==true' is if we got SIGBUS
due to ENOMEM.

My concern is that madvise() has a bunch of possible errno
codes returned on failure, and we're not distinguishing
them. In the past this kind of thing has burnt us making
failures hard to debug.

Could we turn 'bool memset_thread_failed' into 'int memset_thread_errno'

Then, we can make 'touch_all_pages' have an 'Error **errp'
parameter, and it can directly call

 error_setg_errno(errp, memset_thead_errno, ....some message...)

when memset_thread_errno is non-zero, and thus we can remove
the generic message from the caller of touch_all_pages.

If you agree, it'd be best to refactor the existing code to
use this pattern in an initial patch.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  2021-07-22 13:31   ` Daniel P. Berrangé
@ 2021-07-22 13:39     ` David Hildenbrand
  2021-07-22 13:47       ` Daniel P. Berrangé
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2021-07-22 13:39 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin,
	Dr . David Alan Gilbert, qemu-devel, Pankaj Gupta, Igor Mammedov,
	Paolo Bonzini, Marek Kedzierski

On 22.07.21 15:31, Daniel P. Berrangé wrote:
> On Thu, Jul 22, 2021 at 02:36:30PM +0200, David Hildenbrand wrote:
>> Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
>> does not require a SIGBUS handler, doesn't actually touch page content,
>> and avoids context switches; it is, therefore, faster and easier to handle
>> than our current approach.
>>
>> While MADV_POPULATE_WRITE is, in general, faster than manual
>> prefaulting, and especially faster with 4k pages, there is still value in
>> prefaulting using multiple threads to speed up preallocation.
>>
>> More details on MADV_POPULATE_WRITE can be found in the Linux commit
>> 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
>> page tables") and in the man page proposal [1].
>>
>> [1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com
>>
>> This resolves the TODO in do_touch_pages().
>>
>> In the future, we might want to look into using fallocate(), eventually
>> combined with MADV_POPULATE_READ, when dealing with shared file
>> mappings.
>>
>> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/qemu/osdep.h |  7 ++++
>>   util/oslib-posix.c   | 88 +++++++++++++++++++++++++++++++++-----------
>>   2 files changed, 74 insertions(+), 21 deletions(-)
> 
> 
>> @@ -497,6 +493,31 @@ static void *do_touch_pages(void *arg)
>>       return NULL;
>>   }
>>   
>> +static void *do_madv_populate_write_pages(void *arg)
>> +{
>> +    MemsetThread *memset_args = (MemsetThread *)arg;
>> +    const size_t size = memset_args->numpages * memset_args->hpagesize;
>> +    char * const addr = memset_args->addr;
>> +    int ret;
>> +
>> +    if (!size) {
>> +        return NULL;
>> +    }
>> +
>> +    /* See do_touch_pages(). */
>> +    qemu_mutex_lock(&page_mutex);
>> +    while (!threads_created_flag) {
>> +        qemu_cond_wait(&page_cond, &page_mutex);
>> +    }
>> +    qemu_mutex_unlock(&page_mutex);
>> +
>> +    ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
>> +    if (ret) {
>> +        memset_thread_failed = true;
> 
> I'm wondering if this use of memset_thread_failed is sufficient.
> 
> This is pre-existing from the current impl, and ends up being
> used to set the bool result of 'touch_all_pages'. The caller
> of that then does
> 
>      if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
>          error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
>              "pages available to allocate guest RAM");
>      }
> 
> this was reasonable with the old impl, because the only reason
> we ever see 'memset_thread_failed==true' is if we got SIGBUS
> due to ENOMEM.
> 
> My concern is that madvise() has a bunch of possible errno
> codes returned on failure, and we're not distinguishing
> them. In the past this kind of thing has burnt us making
> failures hard to debug.
> 
> Could we turn 'bool memset_thread_failed' into 'int memset_thread_errno'
> 
> Then, we can make 'touch_all_pages' have an 'Error **errp'
> parameter, and it can directly call
> 
>   error_setg_errno(errp, memset_thead_errno, ....some message...)
> 
> when memset_thread_errno is non-zero, and thus we can remove
> the generic message from the caller of touch_all_pages.
> 
> If you agree, it'd be best to refactor the existing code to
> use this pattern in an initial patch.

We could also simply trace the return value, which should be 
comparatively easy to add. We should be getting either -ENOMEM or 
-EHWPOISON. And the latter is highly unlikely to happen when actually 
preallocating.

We made sure that we don't end up with -EINVAL as we're sensing of 
MADV_POPULATE_WRITE works on the mapping.

So when it comes to debugging, I'd actually prefer tracing -errno, as 
the real error will be of little help to end users.

Makes sense?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  2021-07-22 13:39     ` David Hildenbrand
@ 2021-07-22 13:47       ` Daniel P. Berrangé
  2021-07-22 14:13         ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel P. Berrangé @ 2021-07-22 13:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin,
	Dr . David Alan Gilbert, qemu-devel, Pankaj Gupta, Igor Mammedov,
	Paolo Bonzini, Marek Kedzierski

On Thu, Jul 22, 2021 at 03:39:50PM +0200, David Hildenbrand wrote:
> On 22.07.21 15:31, Daniel P. Berrangé wrote:
> > On Thu, Jul 22, 2021 at 02:36:30PM +0200, David Hildenbrand wrote:
> > > Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
> > > does not require a SIGBUS handler, doesn't actually touch page content,
> > > and avoids context switches; it is, therefore, faster and easier to handle
> > > than our current approach.
> > > 
> > > While MADV_POPULATE_WRITE is, in general, faster than manual
> > > prefaulting, and especially faster with 4k pages, there is still value in
> > > prefaulting using multiple threads to speed up preallocation.
> > > 
> > > More details on MADV_POPULATE_WRITE can be found in the Linux commit
> > > 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
> > > page tables") and in the man page proposal [1].
> > > 
> > > [1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com
> > > 
> > > This resolves the TODO in do_touch_pages().
> > > 
> > > In the future, we might want to look into using fallocate(), eventually
> > > combined with MADV_POPULATE_READ, when dealing with shared file
> > > mappings.
> > > 
> > > Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >   include/qemu/osdep.h |  7 ++++
> > >   util/oslib-posix.c   | 88 +++++++++++++++++++++++++++++++++-----------
> > >   2 files changed, 74 insertions(+), 21 deletions(-)
> > 
> > 
> > > @@ -497,6 +493,31 @@ static void *do_touch_pages(void *arg)
> > >       return NULL;
> > >   }
> > > +static void *do_madv_populate_write_pages(void *arg)
> > > +{
> > > +    MemsetThread *memset_args = (MemsetThread *)arg;
> > > +    const size_t size = memset_args->numpages * memset_args->hpagesize;
> > > +    char * const addr = memset_args->addr;
> > > +    int ret;
> > > +
> > > +    if (!size) {
> > > +        return NULL;
> > > +    }
> > > +
> > > +    /* See do_touch_pages(). */
> > > +    qemu_mutex_lock(&page_mutex);
> > > +    while (!threads_created_flag) {
> > > +        qemu_cond_wait(&page_cond, &page_mutex);
> > > +    }
> > > +    qemu_mutex_unlock(&page_mutex);
> > > +
> > > +    ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
> > > +    if (ret) {
> > > +        memset_thread_failed = true;
> > 
> > I'm wondering if this use of memset_thread_failed is sufficient.
> > 
> > This is pre-existing from the current impl, and ends up being
> > used to set the bool result of 'touch_all_pages'. The caller
> > of that then does
> > 
> >      if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
> >          error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
> >              "pages available to allocate guest RAM");
> >      }
> > 
> > this was reasonable with the old impl, because the only reason
> > we ever see 'memset_thread_failed==true' is if we got SIGBUS
> > due to ENOMEM.
> > 
> > My concern is that madvise() has a bunch of possible errno
> > codes returned on failure, and we're not distinguishing
> > them. In the past this kind of thing has burnt us making
> > failures hard to debug.
> > 
> > Could we turn 'bool memset_thread_failed' into 'int memset_thread_errno'
> > 
> > Then, we can make 'touch_all_pages' have an 'Error **errp'
> > parameter, and it can directly call
> > 
> >   error_setg_errno(errp, memset_thead_errno, ....some message...)
> > 
> > when memset_thread_errno is non-zero, and thus we can remove
> > the generic message from the caller of touch_all_pages.
> > 
> > If you agree, it'd be best to refactor the existing code to
> > use this pattern in an initial patch.
> 
> We could also simply trace the return value, which should be comparatively
> easy to add. We should be getting either -ENOMEM or -EHWPOISON. And the
> latter is highly unlikely to happen when actually preallocating.
> 
> We made sure that we don't end up with -EINVAL as we're sensing of
> MADV_POPULATE_WRITE works on the mapping.

Those are in the "normal" usage scenarios. I'm wondering about the
abnormal scenarios where QEMU code is mistakenly screwed up or
libvirt / mgmt app makes some config mistake. eg we can get
things like EPERM if selinux or seccomp block the madvise
syscall by mistake (common if EQMU is inside docker for example),
or can we get EINVAL if the 'addr' is not page aligned, and so on.

> So when it comes to debugging, I'd actually prefer tracing -errno, as the
> real error will be of little help to end users.

I don't care about the end users interpreting it, rather us as maintainers
who get a bug report containing insufficient info to diagnose the root
cause.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  2021-07-22 13:47       ` Daniel P. Berrangé
@ 2021-07-22 14:13         ` David Hildenbrand
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2021-07-22 14:13 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin,
	Dr . David Alan Gilbert, qemu-devel, Pankaj Gupta, Igor Mammedov,
	Paolo Bonzini, Marek Kedzierski

On 22.07.21 15:47, Daniel P. Berrangé wrote:
> On Thu, Jul 22, 2021 at 03:39:50PM +0200, David Hildenbrand wrote:
>> On 22.07.21 15:31, Daniel P. Berrangé wrote:
>>> On Thu, Jul 22, 2021 at 02:36:30PM +0200, David Hildenbrand wrote:
>>>> Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
>>>> does not require a SIGBUS handler, doesn't actually touch page content,
>>>> and avoids context switches; it is, therefore, faster and easier to handle
>>>> than our current approach.
>>>>
>>>> While MADV_POPULATE_WRITE is, in general, faster than manual
>>>> prefaulting, and especially faster with 4k pages, there is still value in
>>>> prefaulting using multiple threads to speed up preallocation.
>>>>
>>>> More details on MADV_POPULATE_WRITE can be found in the Linux commit
>>>> 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
>>>> page tables") and in the man page proposal [1].
>>>>
>>>> [1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com
>>>>
>>>> This resolves the TODO in do_touch_pages().
>>>>
>>>> In the future, we might want to look into using fallocate(), eventually
>>>> combined with MADV_POPULATE_READ, when dealing with shared file
>>>> mappings.
>>>>
>>>> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    include/qemu/osdep.h |  7 ++++
>>>>    util/oslib-posix.c   | 88 +++++++++++++++++++++++++++++++++-----------
>>>>    2 files changed, 74 insertions(+), 21 deletions(-)
>>>
>>>
>>>> @@ -497,6 +493,31 @@ static void *do_touch_pages(void *arg)
>>>>        return NULL;
>>>>    }
>>>> +static void *do_madv_populate_write_pages(void *arg)
>>>> +{
>>>> +    MemsetThread *memset_args = (MemsetThread *)arg;
>>>> +    const size_t size = memset_args->numpages * memset_args->hpagesize;
>>>> +    char * const addr = memset_args->addr;
>>>> +    int ret;
>>>> +
>>>> +    if (!size) {
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    /* See do_touch_pages(). */
>>>> +    qemu_mutex_lock(&page_mutex);
>>>> +    while (!threads_created_flag) {
>>>> +        qemu_cond_wait(&page_cond, &page_mutex);
>>>> +    }
>>>> +    qemu_mutex_unlock(&page_mutex);
>>>> +
>>>> +    ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
>>>> +    if (ret) {
>>>> +        memset_thread_failed = true;
>>>
>>> I'm wondering if this use of memset_thread_failed is sufficient.
>>>
>>> This is pre-existing from the current impl, and ends up being
>>> used to set the bool result of 'touch_all_pages'. The caller
>>> of that then does
>>>
>>>       if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
>>>           error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
>>>               "pages available to allocate guest RAM");
>>>       }
>>>
>>> this was reasonable with the old impl, because the only reason
>>> we ever see 'memset_thread_failed==true' is if we got SIGBUS
>>> due to ENOMEM.
>>>
>>> My concern is that madvise() has a bunch of possible errno
>>> codes returned on failure, and we're not distinguishing
>>> them. In the past this kind of thing has burnt us making
>>> failures hard to debug.
>>>
>>> Could we turn 'bool memset_thread_failed' into 'int memset_thread_errno'
>>>
>>> Then, we can make 'touch_all_pages' have an 'Error **errp'
>>> parameter, and it can directly call
>>>
>>>    error_setg_errno(errp, memset_thead_errno, ....some message...)
>>>
>>> when memset_thread_errno is non-zero, and thus we can remove
>>> the generic message from the caller of touch_all_pages.
>>>
>>> If you agree, it'd be best to refactor the existing code to
>>> use this pattern in an initial patch.
>>
>> We could also simply trace the return value, which should be comparatively
>> easy to add. We should be getting either -ENOMEM or -EHWPOISON. And the
>> latter is highly unlikely to happen when actually preallocating.
>>
>> We made sure that we don't end up with -EINVAL as we're sensing of
>> MADV_POPULATE_WRITE works on the mapping.
> 
> Those are in the "normal" usage scenarios. I'm wondering about the
> abnormal scenarios where QEMU code is mistakenly screwed up or
> libvirt / mgmt app makes some config mistake. eg we can get
> things like EPERM if selinux or seccomp block the madvise
> syscall by mistake (common if EQMU is inside docker for example),
> or can we get EINVAL if the 'addr' is not page aligned, and so on.
> 
>> So when it comes to debugging, I'd actually prefer tracing -errno, as the
>> real error will be of little help to end users.
> 
> I don't care about the end users interpreting it, rather us as maintainers
> who get a bug report containing insufficient info to diagnose the root
> cause.

Well, okay. I'll have a look how this turns out.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/6] util/oslib-posix: Don't create too many threads with small memory or little pages
  2021-07-22 12:36 ` [PATCH v2 3/6] util/oslib-posix: Don't create too many threads with small memory or little pages David Hildenbrand
@ 2021-07-27 19:01   ` Dr. David Alan Gilbert
  2021-07-28 11:23   ` Pankaj Gupta
  1 sibling, 0 replies; 16+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-27 19:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, qemu-devel, Igor Mammedov,
	Paolo Bonzini, Marek Kedzierski

* David Hildenbrand (david@redhat.com) wrote:
> Let's limit the number of threads to something sane, especially that
> - We don't have more threads than the number of pages we have
> - We don't have threads that initialize small (< 64 MiB) memory
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  util/oslib-posix.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 2ae6c3aaab..a1d309d495 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -40,6 +40,7 @@
>  #include <libgen.h>
>  #include "qemu/cutils.h"
>  #include "qemu/compiler.h"
> +#include "qemu/units.h"
>  
>  #ifdef CONFIG_LINUX
>  #include <sys/syscall.h>
> @@ -529,7 +530,8 @@ static void *do_madv_populate_write_pages(void *arg)
>      return NULL;
>  }
>  
> -static inline int get_memset_num_threads(int smp_cpus)
> +static inline int get_memset_num_threads(size_t hpagesize, size_t numpages,
> +                                         int smp_cpus)
>  {
>      long host_procs = sysconf(_SC_NPROCESSORS_ONLN);
>      int ret = 1;
> @@ -537,6 +539,12 @@ static inline int get_memset_num_threads(int smp_cpus)
>      if (host_procs > 0) {
>          ret = MIN(MIN(host_procs, MAX_MEM_PREALLOC_THREAD_COUNT), smp_cpus);
>      }
> +
> +    /* Especially with gigantic pages, don't create more threads than pages. */
> +    ret = MIN(ret, numpages);
> +    /* Don't start threads to prealloc comparatively little memory. */
> +    ret = MIN(ret, MAX(1, hpagesize * numpages / (64 * MiB)));
> +
>      /* In case sysconf() fails, we fall back to single threaded */
>      return ret;
>  }
> @@ -546,7 +554,7 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>  {
>      static gsize initialized = 0;
>      MemsetContext context = {
> -        .num_threads = get_memset_num_threads(smp_cpus),
> +        .num_threads = get_memset_num_threads(hpagesize, numpages, smp_cpus),
>      };
>      size_t numpages_per_thread, leftover;
>      void *(*touch_fn)(void *);
> -- 
> 2.31.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/6] util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE
  2021-07-22 12:36 ` [PATCH v2 4/6] util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE David Hildenbrand
@ 2021-07-27 19:04   ` Dr. David Alan Gilbert
  2021-07-30 15:13     ` David Hildenbrand
  2021-07-28 11:32   ` Pankaj Gupta
  1 sibling, 1 reply; 16+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-27 19:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, qemu-devel, Igor Mammedov,
	Paolo Bonzini, Marek Kedzierski

* David Hildenbrand (david@redhat.com) wrote:
> Let's simplify the case when we only want a single thread and don't have
> to mess with signal handlers.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  util/oslib-posix.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index a1d309d495..1483e985c6 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -568,6 +568,14 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>      }
>  
>      if (use_madv_populate_write) {
> +        /* Avoid creating a single thread for MADV_POPULATE_WRITE */
> +        if (context.num_threads == 1) {
> +            if (qemu_madvise(area, hpagesize * numpages,
> +                             QEMU_MADV_POPULATE_WRITE)) {

Do you never have to fall back if this particular memory region is the
one that can't do madv?

Dave

> +                return true;
> +            }
> +            return false;
> +        }
>          touch_fn = do_madv_populate_write_pages;
>      } else {
>          touch_fn = do_touch_pages;
> -- 
> 2.31.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/6] util/oslib-posix: Don't create too many threads with small memory or little pages
  2021-07-22 12:36 ` [PATCH v2 3/6] util/oslib-posix: Don't create too many threads with small memory or little pages David Hildenbrand
  2021-07-27 19:01   ` Dr. David Alan Gilbert
@ 2021-07-28 11:23   ` Pankaj Gupta
  1 sibling, 0 replies; 16+ messages in thread
From: Pankaj Gupta @ 2021-07-28 11:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, Dr . David Alan Gilbert,
	Qemu Developers, Igor Mammedov, Paolo Bonzini, Marek Kedzierski

> Let's limit the number of threads to something sane, especially that
> - We don't have more threads than the number of pages we have
> - We don't have threads that initialize small (< 64 MiB) memory
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  util/oslib-posix.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 2ae6c3aaab..a1d309d495 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -40,6 +40,7 @@
>  #include <libgen.h>
>  #include "qemu/cutils.h"
>  #include "qemu/compiler.h"
> +#include "qemu/units.h"
>
>  #ifdef CONFIG_LINUX
>  #include <sys/syscall.h>
> @@ -529,7 +530,8 @@ static void *do_madv_populate_write_pages(void *arg)
>      return NULL;
>  }
>
> -static inline int get_memset_num_threads(int smp_cpus)
> +static inline int get_memset_num_threads(size_t hpagesize, size_t numpages,
> +                                         int smp_cpus)
>  {
>      long host_procs = sysconf(_SC_NPROCESSORS_ONLN);
>      int ret = 1;
> @@ -537,6 +539,12 @@ static inline int get_memset_num_threads(int smp_cpus)
>      if (host_procs > 0) {
>          ret = MIN(MIN(host_procs, MAX_MEM_PREALLOC_THREAD_COUNT), smp_cpus);
>      }
> +
> +    /* Especially with gigantic pages, don't create more threads than pages. */
> +    ret = MIN(ret, numpages);
> +    /* Don't start threads to prealloc comparatively little memory. */
> +    ret = MIN(ret, MAX(1, hpagesize * numpages / (64 * MiB)));
> +
>      /* In case sysconf() fails, we fall back to single threaded */
>      return ret;
>  }
> @@ -546,7 +554,7 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>  {
>      static gsize initialized = 0;
>      MemsetContext context = {
> -        .num_threads = get_memset_num_threads(smp_cpus),
> +        .num_threads = get_memset_num_threads(hpagesize, numpages, smp_cpus),
>      };
>      size_t numpages_per_thread, leftover;
>      void *(*touch_fn)(void *);

Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/6] util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE
  2021-07-22 12:36 ` [PATCH v2 4/6] util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE David Hildenbrand
  2021-07-27 19:04   ` Dr. David Alan Gilbert
@ 2021-07-28 11:32   ` Pankaj Gupta
  1 sibling, 0 replies; 16+ messages in thread
From: Pankaj Gupta @ 2021-07-28 11:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Daniel P . Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, Dr . David Alan Gilbert,
	Qemu Developers, Igor Mammedov, Paolo Bonzini, Marek Kedzierski

> Let's simplify the case when we only want a single thread and don't have
> to mess with signal handlers.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  util/oslib-posix.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index a1d309d495..1483e985c6 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -568,6 +568,14 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>      }
>
>      if (use_madv_populate_write) {
> +        /* Avoid creating a single thread for MADV_POPULATE_WRITE */
> +        if (context.num_threads == 1) {
> +            if (qemu_madvise(area, hpagesize * numpages,
> +                             QEMU_MADV_POPULATE_WRITE)) {
> +                return true;
> +            }
> +            return false;
> +        }
>          touch_fn = do_madv_populate_write_pages;
>      } else {
>          touch_fn = do_touch_pages;

Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/6] util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE
  2021-07-27 19:04   ` Dr. David Alan Gilbert
@ 2021-07-30 15:13     ` David Hildenbrand
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2021-07-30 15:13 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Pankaj Gupta, Daniel P. Berrangé,
	Eduardo Habkost, Michael S. Tsirkin, qemu-devel, Igor Mammedov,
	Paolo Bonzini, Marek Kedzierski

On 27.07.21 21:04, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> Let's simplify the case when we only want a single thread and don't have
>> to mess with signal handlers.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   util/oslib-posix.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>> index a1d309d495..1483e985c6 100644
>> --- a/util/oslib-posix.c
>> +++ b/util/oslib-posix.c
>> @@ -568,6 +568,14 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>>       }
>>   
>>       if (use_madv_populate_write) {
>> +        /* Avoid creating a single thread for MADV_POPULATE_WRITE */
>> +        if (context.num_threads == 1) {
>> +            if (qemu_madvise(area, hpagesize * numpages,
>> +                             QEMU_MADV_POPULATE_WRITE)) {
> 
> Do you never have to fall back if this particular memory region is the
> one that can't do madv?

We sense upfront, when detecting use_madv_populate_write, whether it's 
supported on this very memory type. So, no need to fallback here.

> 
> Dave

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-07-30 15:14 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-22 12:36 [PATCH v2 0/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
2021-07-22 12:36 ` [PATCH v2 1/6] " David Hildenbrand
2021-07-22 13:31   ` Daniel P. Berrangé
2021-07-22 13:39     ` David Hildenbrand
2021-07-22 13:47       ` Daniel P. Berrangé
2021-07-22 14:13         ` David Hildenbrand
2021-07-22 12:36 ` [PATCH v2 2/6] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages() David Hildenbrand
2021-07-22 12:36 ` [PATCH v2 3/6] util/oslib-posix: Don't create too many threads with small memory or little pages David Hildenbrand
2021-07-27 19:01   ` Dr. David Alan Gilbert
2021-07-28 11:23   ` Pankaj Gupta
2021-07-22 12:36 ` [PATCH v2 4/6] util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE David Hildenbrand
2021-07-27 19:04   ` Dr. David Alan Gilbert
2021-07-30 15:13     ` David Hildenbrand
2021-07-28 11:32   ` Pankaj Gupta
2021-07-22 12:36 ` [PATCH v2 5/6] util/oslib-posix: Support concurrent os_mem_prealloc() invocation David Hildenbrand
2021-07-22 12:36 ` [PATCH v2 6/6] util/oslib-posix: Forward SIGBUS to MCE handler under Linux David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).