All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
@ 2021-07-14 11:23 David Hildenbrand
  2021-07-14 11:23 ` [PATCH v1 1/3] " David Hildenbrand
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-07-14 11:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin,
	David Hildenbrand, Dr . David Alan Gilbert, Igor Mammedov,
	Paolo Bonzini, Marek Kedzierski

#1 adds support for MADV_POPULATE_WRITE, #2 cleans up the code to avoid
global variables and prepare for concurrency and #3 makes os_mem_prealloc()
safe to be called from multiple threads concurrently.

Details regarding MADV_POPULATE_WRITE can be found in introducing upstream
Linux commit 4ca9b3859dac ("mm/madvise: introduce
MADV_POPULATE_(READ|WRITE) to prefault page tables") and in the latest man
page patch [1].

[1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>

David Hildenbrand (3):
  util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  util/oslib-posix: Introduce and use MemsetContext for
    touch_all_pages()
  util/oslib-posix: Support concurrent os_mem_prealloc() invocation

 include/qemu/osdep.h |   7 ++
 util/oslib-posix.c   | 167 ++++++++++++++++++++++++++++++-------------
 2 files changed, 126 insertions(+), 48 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v1 1/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  2021-07-14 11:23 [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
@ 2021-07-14 11:23 ` David Hildenbrand
  2021-07-20 14:08   ` Daniel P. Berrangé
  2021-07-14 11:23 ` [PATCH v1 2/3] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages() David Hildenbrand
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2021-07-14 11:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin,
	David Hildenbrand, Dr . David Alan Gilbert, Igor Mammedov,
	Paolo Bonzini, Marek Kedzierski

Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
does not require a SIGBUS handler, doesn't actually touch page content,
and avoids context switches; it is, therefore, faster and easier to handle
than our current approach.

This resolves the TODO in do_touch_pages().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/qemu/osdep.h |  7 ++++
 util/oslib-posix.c   | 84 +++++++++++++++++++++++++++++++++-----------
 2 files changed, 71 insertions(+), 20 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 60718fc342..d1660d67fa 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -471,6 +471,11 @@ static inline void qemu_cleanup_generic_vfree(void *p)
 #else
 #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED
 #endif
+#ifdef MADV_POPULATE_WRITE
+#define QEMU_MADV_POPULATE_WRITE MADV_POPULATE_WRITE
+#else
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
+#endif
 
 #elif defined(CONFIG_POSIX_MADVISE)
 
@@ -484,6 +489,7 @@ static inline void qemu_cleanup_generic_vfree(void *p)
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
 
 #else /* no-op */
 
@@ -497,6 +503,7 @@ static inline void qemu_cleanup_generic_vfree(void *p)
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
 
 #endif
 
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index e8bdb02e1d..679796ac1f 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -484,10 +484,6 @@ static void *do_touch_pages(void *arg)
              *
              * 'volatile' to stop compiler optimizing this away
              * to a no-op
-             *
-             * TODO: get a better solution from kernel so we
-             * don't need to write at all so we don't cause
-             * wear on the storage backing the region...
              */
             *(volatile char *)addr = *addr;
             addr += hpagesize;
@@ -497,6 +493,27 @@ static void *do_touch_pages(void *arg)
     return NULL;
 }
 
+static void *do_madv_populate_write_pages(void *arg)
+{
+    MemsetThread *memset_args = (MemsetThread *)arg;
+    const size_t size = memset_args->numpages * memset_args->hpagesize;
+    char * const addr = memset_args->addr;
+    int ret;
+
+    /* See do_touch_pages(). */
+    qemu_mutex_lock(&page_mutex);
+    while (!threads_created_flag) {
+        qemu_cond_wait(&page_cond, &page_mutex);
+    }
+    qemu_mutex_unlock(&page_mutex);
+
+    ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
+    if (ret) {
+        memset_thread_failed = true;
+    }
+    return NULL;
+}
+
 static inline int get_memset_num_threads(int smp_cpus)
 {
     long host_procs = sysconf(_SC_NPROCESSORS_ONLN);
@@ -510,10 +527,11 @@ static inline int get_memset_num_threads(int smp_cpus)
 }
 
 static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
-                            int smp_cpus)
+                            int smp_cpus, bool use_madv_populate_write)
 {
     static gsize initialized = 0;
     size_t numpages_per_thread, leftover;
+    void *(*touch_fn)(void *);
     char *addr = area;
     int i = 0;
 
@@ -523,6 +541,12 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
         g_once_init_leave(&initialized, 1);
     }
 
+    if (use_madv_populate_write) {
+        touch_fn = do_madv_populate_write_pages;
+    } else {
+        touch_fn = do_touch_pages;
+    }
+
     memset_thread_failed = false;
     threads_created_flag = false;
     memset_num_threads = get_memset_num_threads(smp_cpus);
@@ -534,7 +558,7 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
         memset_thread[i].numpages = numpages_per_thread + (i < leftover);
         memset_thread[i].hpagesize = hpagesize;
         qemu_thread_create(&memset_thread[i].pgthread, "touch_pages",
-                           do_touch_pages, &memset_thread[i],
+                           touch_fn, &memset_thread[i],
                            QEMU_THREAD_JOINABLE);
         addr += memset_thread[i].numpages * hpagesize;
     }
@@ -553,6 +577,12 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
     return memset_thread_failed;
 }
 
+static bool madv_populate_write_possible(char *area, size_t pagesize)
+{
+    return !qemu_madvise(area, pagesize, QEMU_MADV_POPULATE_WRITE) ||
+           errno != EINVAL;
+}
+
 void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
                      Error **errp)
 {
@@ -560,29 +590,43 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
     struct sigaction act, oldact;
     size_t hpagesize = qemu_fd_getpagesize(fd);
     size_t numpages = DIV_ROUND_UP(memory, hpagesize);
+    bool use_madv_populate_write;
+
+    /*
+     * Sense on every invocation, as MADV_POPULATE_WRITE cannot be used for
+     * some special mappings, such as mapping /dev/mem.
+     */
+    if (madv_populate_write_possible(area, hpagesize)) {
+        use_madv_populate_write = true;
+    }
 
-    memset(&act, 0, sizeof(act));
-    act.sa_handler = &sigbus_handler;
-    act.sa_flags = 0;
+    if (!use_madv_populate_write) {
+        memset(&act, 0, sizeof(act));
+        act.sa_handler = &sigbus_handler;
+        act.sa_flags = 0;
 
-    ret = sigaction(SIGBUS, &act, &oldact);
-    if (ret) {
-        error_setg_errno(errp, errno,
-            "os_mem_prealloc: failed to install signal handler");
-        return;
+        ret = sigaction(SIGBUS, &act, &oldact);
+        if (ret) {
+            error_setg_errno(errp, errno,
+                "os_mem_prealloc: failed to install signal handler");
+            return;
+        }
     }
 
     /* touch pages simultaneously */
-    if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
+    if (touch_all_pages(area, hpagesize, numpages, smp_cpus,
+                        use_madv_populate_write)) {
         error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
             "pages available to allocate guest RAM");
     }
 
-    ret = sigaction(SIGBUS, &oldact, NULL);
-    if (ret) {
-        /* Terminate QEMU since it can't recover from error */
-        perror("os_mem_prealloc: failed to reinstall signal handler");
-        exit(1);
+    if (!use_madv_populate_write) {
+        ret = sigaction(SIGBUS, &oldact, NULL);
+        if (ret) {
+            /* Terminate QEMU since it can't recover from error */
+            perror("os_mem_prealloc: failed to reinstall signal handler");
+            exit(1);
+        }
     }
 }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v1 2/3] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages()
  2021-07-14 11:23 [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
  2021-07-14 11:23 ` [PATCH v1 1/3] " David Hildenbrand
@ 2021-07-14 11:23 ` David Hildenbrand
  2021-07-20 14:27   ` Daniel P. Berrangé
  2021-07-14 11:23 ` [PATCH v1 3/3] util/oslib-posix: Support concurrent os_mem_prealloc() invocation David Hildenbrand
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2021-07-14 11:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin,
	David Hildenbrand, Dr . David Alan Gilbert, Igor Mammedov,
	Paolo Bonzini, Marek Kedzierski

Let's minimize the number of global variables to prepare for
os_mem_prealloc() getting called concurrently and make the code a bit
easier to read.

The only consumer that really needs a global variable is the sigbus
handler, which will require protection via a mutex in the future either way
as we cannot concurrently mess with the SIGBUS handler.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/oslib-posix.c | 81 ++++++++++++++++++++++++++++------------------
 1 file changed, 50 insertions(+), 31 deletions(-)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 679796ac1f..60d1da2d6c 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -73,22 +73,30 @@
 
 #define MAX_MEM_PREALLOC_THREAD_COUNT 16
 
+struct MemsetThread;
+
+typedef struct MemsetContext {
+    bool all_threads_created;
+    bool any_thread_failed;
+    struct MemsetThread *threads;
+    int num_threads;
+} MemsetContext;
+
 struct MemsetThread {
     char *addr;
     size_t numpages;
     size_t hpagesize;
     QemuThread pgthread;
     sigjmp_buf env;
+    MemsetContext *context;
 };
 typedef struct MemsetThread MemsetThread;
 
-static MemsetThread *memset_thread;
-static int memset_num_threads;
-static bool memset_thread_failed;
+/* used by sigbus_handler() */
+static MemsetContext *sigbus_memset_context;
 
 static QemuMutex page_mutex;
 static QemuCond page_cond;
-static bool threads_created_flag;
 
 int qemu_get_thread_id(void)
 {
@@ -439,10 +447,13 @@ const char *qemu_get_exec_dir(void)
 static void sigbus_handler(int signal)
 {
     int i;
-    if (memset_thread) {
-        for (i = 0; i < memset_num_threads; i++) {
-            if (qemu_thread_is_self(&memset_thread[i].pgthread)) {
-                siglongjmp(memset_thread[i].env, 1);
+
+    if (sigbus_memset_context) {
+        for (i = 0; i < sigbus_memset_context->num_threads; i++) {
+            MemsetThread *thread = &sigbus_memset_context->threads[i];
+
+            if (qemu_thread_is_self(&thread->pgthread)) {
+                siglongjmp(thread->env, 1);
             }
         }
     }
@@ -459,7 +470,7 @@ static void *do_touch_pages(void *arg)
      * clearing until all threads have been created.
      */
     qemu_mutex_lock(&page_mutex);
-    while(!threads_created_flag){
+    while (!memset_args->context->all_threads_created) {
         qemu_cond_wait(&page_cond, &page_mutex);
     }
     qemu_mutex_unlock(&page_mutex);
@@ -470,7 +481,7 @@ static void *do_touch_pages(void *arg)
     pthread_sigmask(SIG_UNBLOCK, &set, &oldset);
 
     if (sigsetjmp(memset_args->env, 1)) {
-        memset_thread_failed = true;
+        memset_args->context->any_thread_failed = true;
     } else {
         char *addr = memset_args->addr;
         size_t numpages = memset_args->numpages;
@@ -502,14 +513,14 @@ static void *do_madv_populate_write_pages(void *arg)
 
     /* See do_touch_pages(). */
     qemu_mutex_lock(&page_mutex);
-    while (!threads_created_flag) {
+    while (!memset_args->context->all_threads_created) {
         qemu_cond_wait(&page_cond, &page_mutex);
     }
     qemu_mutex_unlock(&page_mutex);
 
     ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
     if (ret) {
-        memset_thread_failed = true;
+        memset_args->context->any_thread_failed = true;
     }
     return NULL;
 }
@@ -530,6 +541,9 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
                             int smp_cpus, bool use_madv_populate_write)
 {
     static gsize initialized = 0;
+    MemsetContext context = {
+        .num_threads = get_memset_num_threads(smp_cpus),
+    };
     size_t numpages_per_thread, leftover;
     void *(*touch_fn)(void *);
     char *addr = area;
@@ -547,34 +561,39 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
         touch_fn = do_touch_pages;
     }
 
-    memset_thread_failed = false;
-    threads_created_flag = false;
-    memset_num_threads = get_memset_num_threads(smp_cpus);
-    memset_thread = g_new0(MemsetThread, memset_num_threads);
-    numpages_per_thread = numpages / memset_num_threads;
-    leftover = numpages % memset_num_threads;
-    for (i = 0; i < memset_num_threads; i++) {
-        memset_thread[i].addr = addr;
-        memset_thread[i].numpages = numpages_per_thread + (i < leftover);
-        memset_thread[i].hpagesize = hpagesize;
-        qemu_thread_create(&memset_thread[i].pgthread, "touch_pages",
-                           touch_fn, &memset_thread[i],
+    context.threads = g_new0(MemsetThread, context.num_threads);
+    numpages_per_thread = numpages / context.num_threads;
+    leftover = numpages % context.num_threads;
+    for (i = 0; i < context.num_threads; i++) {
+        context.threads[i].addr = addr;
+        context.threads[i].numpages = numpages_per_thread + (i < leftover);
+        context.threads[i].hpagesize = hpagesize;
+        context.threads[i].context = &context;
+        qemu_thread_create(&context.threads[i].pgthread, "touch_pages",
+                           touch_fn, &context.threads[i],
                            QEMU_THREAD_JOINABLE);
-        addr += memset_thread[i].numpages * hpagesize;
+        addr += context.threads[i].numpages * hpagesize;
+    }
+
+    if (!use_madv_populate_write) {
+        sigbus_memset_context = &context;
     }
 
     qemu_mutex_lock(&page_mutex);
-    threads_created_flag = true;
+    context.all_threads_created = true;
     qemu_cond_broadcast(&page_cond);
     qemu_mutex_unlock(&page_mutex);
 
-    for (i = 0; i < memset_num_threads; i++) {
-        qemu_thread_join(&memset_thread[i].pgthread);
+    for (i = 0; i < context.num_threads; i++) {
+        qemu_thread_join(&context.threads[i].pgthread);
+    }
+
+    if (!use_madv_populate_write) {
+        sigbus_memset_context = NULL;
     }
-    g_free(memset_thread);
-    memset_thread = NULL;
+    g_free(context.threads);
 
-    return memset_thread_failed;
+    return context.any_thread_failed;
 }
 
 static bool madv_populate_write_possible(char *area, size_t pagesize)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v1 3/3] util/oslib-posix: Support concurrent os_mem_prealloc() invocation
  2021-07-14 11:23 [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
  2021-07-14 11:23 ` [PATCH v1 1/3] " David Hildenbrand
  2021-07-14 11:23 ` [PATCH v1 2/3] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages() David Hildenbrand
@ 2021-07-14 11:23 ` David Hildenbrand
  2021-07-20 14:22   ` Daniel P. Berrangé
  2021-07-20 13:55 ` [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() Pankaj Gupta
  2021-07-20 14:45 ` Daniel P. Berrangé
  4 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2021-07-14 11:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin,
	David Hildenbrand, Dr . David Alan Gilbert, Igor Mammedov,
	Paolo Bonzini, Marek Kedzierski

Add a mutext to protect the SIGBUS case, as we cannot mess concurrently
with the sigbus handler and we have to manage the global variable
sigbus_memset_context. The MADV_POPULATE_WRITE path can run
concurrently.

Note that page_mutex and page_cond are shared between concurrent
invocations, which shouldn't be a problem.

This is a preparation for future virtio-mem prealloc code, which will call
os_mem_prealloc() asynchronously from an iothread when handling guest
requests.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 util/oslib-posix.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 60d1da2d6c..181f6bbf1a 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -94,6 +94,7 @@ typedef struct MemsetThread MemsetThread;
 
 /* used by sigbus_handler() */
 static MemsetContext *sigbus_memset_context;
+static QemuMutex sigbus_mutex;
 
 static QemuMutex page_mutex;
 static QemuCond page_cond;
@@ -605,12 +606,17 @@ static bool madv_populate_write_possible(char *area, size_t pagesize)
 void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
                      Error **errp)
 {
+    static gsize initialized;
     int ret;
     struct sigaction act, oldact;
     size_t hpagesize = qemu_fd_getpagesize(fd);
     size_t numpages = DIV_ROUND_UP(memory, hpagesize);
     bool use_madv_populate_write;
 
+    if (g_once_init_enter(&initialized)) {
+        qemu_mutex_init(&sigbus_mutex);
+    }
+
     /*
      * Sense on every invocation, as MADV_POPULATE_WRITE cannot be used for
      * some special mappings, such as mapping /dev/mem.
@@ -620,6 +626,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
     }
 
     if (!use_madv_populate_write) {
+        qemu_mutex_lock(&sigbus_mutex);
         memset(&act, 0, sizeof(act));
         act.sa_handler = &sigbus_handler;
         act.sa_flags = 0;
@@ -646,6 +653,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
             perror("os_mem_prealloc: failed to reinstall signal handler");
             exit(1);
         }
+        qemu_mutex_unlock(&sigbus_mutex);
     }
 }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  2021-07-14 11:23 [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
                   ` (2 preceding siblings ...)
  2021-07-14 11:23 ` [PATCH v1 3/3] util/oslib-posix: Support concurrent os_mem_prealloc() invocation David Hildenbrand
@ 2021-07-20 13:55 ` Pankaj Gupta
  2021-07-20 13:58   ` Pankaj Gupta
  2021-07-20 14:45 ` Daniel P. Berrangé
  4 siblings, 1 reply; 15+ messages in thread
From: Pankaj Gupta @ 2021-07-20 13:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Michael S. Tsirkin, Dr . David Alan Gilbert,
	Qemu Developers, Igor Mammedov, Paolo Bonzini, Marek Kedzierski

> #1 adds support for MADV_POPULATE_WRITE, #2 cleans up the code to avoid
> global variables and prepare for concurrency and #3 makes os_mem_prealloc()
> safe to be called from multiple threads concurrently.
>
> Details regarding MADV_POPULATE_WRITE can be found in introducing upstream
> Linux commit 4ca9b3859dac ("mm/madvise: introduce
> MADV_POPULATE_(READ|WRITE) to prefault page tables") and in the latest man
> page patch [1].
>
> [1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com
>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: Marek Kedzierski <mkedzier@redhat.com>
> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
>
> David Hildenbrand (3):
>   util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
>   util/oslib-posix: Introduce and use MemsetContext for
>     touch_all_pages()
>   util/oslib-posix: Support concurrent os_mem_prealloc() invocation
>
>  include/qemu/osdep.h |   7 ++
>  util/oslib-posix.c   | 167 ++++++++++++++++++++++++++++++-------------
>  2 files changed, 126 insertions(+), 48 deletions(-)
>

Nice implementation to avoid wear of memory device for prealloc case
and to avoid touching of
all the memory and abrupt exit of VM because of lack of memory. Instead better
way to populate the page tables with madvise.

Plan is to use this infrastructure for virtio-mem, I guess?

For the patches 1 & 3:
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>


Thanks,
Pankaj


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  2021-07-20 13:55 ` [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() Pankaj Gupta
@ 2021-07-20 13:58   ` Pankaj Gupta
  0 siblings, 0 replies; 15+ messages in thread
From: Pankaj Gupta @ 2021-07-20 13:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Michael S. Tsirkin, Dr . David Alan Gilbert,
	Qemu Developers, Igor Mammedov, Paolo Bonzini, Marek Kedzierski

> > #1 adds support for MADV_POPULATE_WRITE, #2 cleans up the code to avoid
> > global variables and prepare for concurrency and #3 makes os_mem_prealloc()
> > safe to be called from multiple threads concurrently.
> >
> > Details regarding MADV_POPULATE_WRITE can be found in introducing upstream
> > Linux commit 4ca9b3859dac ("mm/madvise: introduce
> > MADV_POPULATE_(READ|WRITE) to prefault page tables") and in the latest man
> > page patch [1].
> >
> > [1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com
> >
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: Igor Mammedov <imammedo@redhat.com>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Cc: Marek Kedzierski <mkedzier@redhat.com>
> > Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
> >
> > David Hildenbrand (3):
> >   util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
> >   util/oslib-posix: Introduce and use MemsetContext for
> >     touch_all_pages()
> >   util/oslib-posix: Support concurrent os_mem_prealloc() invocation
> >
> >  include/qemu/osdep.h |   7 ++
> >  util/oslib-posix.c   | 167 ++++++++++++++++++++++++++++++-------------
> >  2 files changed, 126 insertions(+), 48 deletions(-)
> >
>
> Nice implementation to avoid wear of memory device for prealloc case

For prealloc case I mean.

> and to avoid touching of
> all the memory and abrupt exit of VM because of lack of memory. Instead better
> way to populate the page tables with madvise.
>
> Plan is to use this infrastructure for virtio-mem, I guess?
>
> For the patches 1 & 3:
> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
>
>
> Thanks,
> Pankaj


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 1/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  2021-07-14 11:23 ` [PATCH v1 1/3] " David Hildenbrand
@ 2021-07-20 14:08   ` Daniel P. Berrangé
  2021-07-20 14:34     ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel P. Berrangé @ 2021-07-20 14:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Igor Mammedov,
	Marek Kedzierski

On Wed, Jul 14, 2021 at 01:23:04PM +0200, David Hildenbrand wrote:
> Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
> does not require a SIGBUS handler, doesn't actually touch page content,
> and avoids context switches; it is, therefore, faster and easier to handle
> than our current approach.
> 
> This resolves the TODO in do_touch_pages().
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/qemu/osdep.h |  7 ++++
>  util/oslib-posix.c   | 84 +++++++++++++++++++++++++++++++++-----------
>  2 files changed, 71 insertions(+), 20 deletions(-)
> 

> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index e8bdb02e1d..679796ac1f 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -484,10 +484,6 @@ static void *do_touch_pages(void *arg)
>               *
>               * 'volatile' to stop compiler optimizing this away
>               * to a no-op
> -             *
> -             * TODO: get a better solution from kernel so we
> -             * don't need to write at all so we don't cause
> -             * wear on the storage backing the region...
>               */
>              *(volatile char *)addr = *addr;
>              addr += hpagesize;
> @@ -497,6 +493,27 @@ static void *do_touch_pages(void *arg)
>      return NULL;
>  }
>  
> +static void *do_madv_populate_write_pages(void *arg)
> +{
> +    MemsetThread *memset_args = (MemsetThread *)arg;
> +    const size_t size = memset_args->numpages * memset_args->hpagesize;
> +    char * const addr = memset_args->addr;
> +    int ret;
> +
> +    /* See do_touch_pages(). */
> +    qemu_mutex_lock(&page_mutex);
> +    while (!threads_created_flag) {
> +        qemu_cond_wait(&page_cond, &page_mutex);
> +    }
> +    qemu_mutex_unlock(&page_mutex);
> +
> +    ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
> +    if (ret) {
> +        memset_thread_failed = true;
> +    }
> +    return NULL;
> +}
> +
>  static inline int get_memset_num_threads(int smp_cpus)
>  {
>      long host_procs = sysconf(_SC_NPROCESSORS_ONLN);
> @@ -510,10 +527,11 @@ static inline int get_memset_num_threads(int smp_cpus)
>  }
>  
>  static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
> -                            int smp_cpus)
> +                            int smp_cpus, bool use_madv_populate_write)
>  {
>      static gsize initialized = 0;
>      size_t numpages_per_thread, leftover;
> +    void *(*touch_fn)(void *);
>      char *addr = area;
>      int i = 0;
>  
> @@ -523,6 +541,12 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>          g_once_init_leave(&initialized, 1);
>      }
>  
> +    if (use_madv_populate_write) {
> +        touch_fn = do_madv_populate_write_pages;
> +    } else {
> +        touch_fn = do_touch_pages;
> +    }
> +
>      memset_thread_failed = false;
>      threads_created_flag = false;
>      memset_num_threads = get_memset_num_threads(smp_cpus);
> @@ -534,7 +558,7 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>          memset_thread[i].numpages = numpages_per_thread + (i < leftover);
>          memset_thread[i].hpagesize = hpagesize;
>          qemu_thread_create(&memset_thread[i].pgthread, "touch_pages",
> -                           do_touch_pages, &memset_thread[i],
> +                           touch_fn, &memset_thread[i],
>                             QEMU_THREAD_JOINABLE);
>          addr += memset_thread[i].numpages * hpagesize;
>      }

Do you have an indication of what the speed differential is for the
old read/write dance vs the kernel madvise. We needed to use threads
previously because the read/write dance is pretty terribly slow.

Is that still a problem with the madvise approach ? I would (perhaps
naively), expect that the kernel would be able to do this efficiently
for arbitrarily large memory regions, such that QEMU would not need
to play games with threads.


> @@ -553,6 +577,12 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>      return memset_thread_failed;
>  }
>  
> +static bool madv_populate_write_possible(char *area, size_t pagesize)
> +{
> +    return !qemu_madvise(area, pagesize, QEMU_MADV_POPULATE_WRITE) ||
> +           errno != EINVAL;
> +}
> +
>  void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
>                       Error **errp)
>  {
> @@ -560,29 +590,43 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
>      struct sigaction act, oldact;
>      size_t hpagesize = qemu_fd_getpagesize(fd);
>      size_t numpages = DIV_ROUND_UP(memory, hpagesize);
> +    bool use_madv_populate_write;

Initialized with random garbage from the stack

> +
> +    /*
> +     * Sense on every invocation, as MADV_POPULATE_WRITE cannot be used for
> +     * some special mappings, such as mapping /dev/mem.
> +     */
> +    if (madv_populate_write_possible(area, hpagesize)) {
> +        use_madv_populate_write = true;
> +    }

but this implicitly assumes it was initialized to false.

>  
> -    memset(&act, 0, sizeof(act));
> -    act.sa_handler = &sigbus_handler;
> -    act.sa_flags = 0;
> +    if (!use_madv_populate_write) {
> +        memset(&act, 0, sizeof(act));
> +        act.sa_handler = &sigbus_handler;
> +        act.sa_flags = 0;
>  
> -    ret = sigaction(SIGBUS, &act, &oldact);
> -    if (ret) {
> -        error_setg_errno(errp, errno,
> -            "os_mem_prealloc: failed to install signal handler");
> -        return;
> +        ret = sigaction(SIGBUS, &act, &oldact);
> +        if (ret) {
> +            error_setg_errno(errp, errno,
> +                "os_mem_prealloc: failed to install signal handler");
> +            return;
> +        }
>      }
>  
>      /* touch pages simultaneously */
> -    if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
> +    if (touch_all_pages(area, hpagesize, numpages, smp_cpus,
> +                        use_madv_populate_write)) {
>          error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
>              "pages available to allocate guest RAM");
>      }
>  
> -    ret = sigaction(SIGBUS, &oldact, NULL);
> -    if (ret) {
> -        /* Terminate QEMU since it can't recover from error */
> -        perror("os_mem_prealloc: failed to reinstall signal handler");
> -        exit(1);
> +    if (!use_madv_populate_write) {
> +        ret = sigaction(SIGBUS, &oldact, NULL);
> +        if (ret) {
> +            /* Terminate QEMU since it can't recover from error */
> +            perror("os_mem_prealloc: failed to reinstall signal handler");
> +            exit(1);
> +        }
>      }
>  }
>  
> -- 
> 2.31.1
> 
> 

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 3/3] util/oslib-posix: Support concurrent os_mem_prealloc() invocation
  2021-07-14 11:23 ` [PATCH v1 3/3] util/oslib-posix: Support concurrent os_mem_prealloc() invocation David Hildenbrand
@ 2021-07-20 14:22   ` Daniel P. Berrangé
  2021-07-20 14:27     ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel P. Berrangé @ 2021-07-20 14:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Igor Mammedov,
	Marek Kedzierski

On Wed, Jul 14, 2021 at 01:23:06PM +0200, David Hildenbrand wrote:
> Add a mutext to protect the SIGBUS case, as we cannot mess concurrently

typo  s/mutext/mutex/

> with the sigbus handler and we have to manage the global variable
> sigbus_memset_context. The MADV_POPULATE_WRITE path can run
> concurrently.
> 
> Note that page_mutex and page_cond are shared between concurrent
> invocations, which shouldn't be a problem.
> 
> This is a preparation for future virtio-mem prealloc code, which will call
> os_mem_prealloc() asynchronously from an iothread when handling guest
> requests.

Hmm, I'm wondering how the need to temporarily play with SIGBUS
at runtime for mem preallocation will interact with the SIGBUS
handler installed by softmmu/cpus.c.

The SIGBUS handler the preallocation code is installed just
blindly assumes the SIGBUS is related to the preallocation
work being done. This is a fine assumption during initially
startup where we're single threaded and not running guest
CPUs. I'm less clear on whether that's a valid assumption
at runtime once guest CPUs are running.

If the sigbus_handler method in softmmu/cpus.c is doing
something important for QEMU, then why is it ok for us to
periodically disable that handler and replace it with
something else that takes a completely different action ?

Of course with the madvise impl we're bypassing the SIGBUS
dance entirely. This is good for people with new kernels,
but is this SIGBUS stuff safe for older kernels ?

> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  util/oslib-posix.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 60d1da2d6c..181f6bbf1a 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -94,6 +94,7 @@ typedef struct MemsetThread MemsetThread;
>  
>  /* used by sigbus_handler() */
>  static MemsetContext *sigbus_memset_context;
> +static QemuMutex sigbus_mutex;
>  
>  static QemuMutex page_mutex;
>  static QemuCond page_cond;
> @@ -605,12 +606,17 @@ static bool madv_populate_write_possible(char *area, size_t pagesize)
>  void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
>                       Error **errp)
>  {
> +    static gsize initialized;
>      int ret;
>      struct sigaction act, oldact;
>      size_t hpagesize = qemu_fd_getpagesize(fd);
>      size_t numpages = DIV_ROUND_UP(memory, hpagesize);
>      bool use_madv_populate_write;
>  
> +    if (g_once_init_enter(&initialized)) {
> +        qemu_mutex_init(&sigbus_mutex);
> +    }
> +
>      /*
>       * Sense on every invocation, as MADV_POPULATE_WRITE cannot be used for
>       * some special mappings, such as mapping /dev/mem.
> @@ -620,6 +626,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
>      }
>  
>      if (!use_madv_populate_write) {
> +        qemu_mutex_lock(&sigbus_mutex);
>          memset(&act, 0, sizeof(act));
>          act.sa_handler = &sigbus_handler;
>          act.sa_flags = 0;
> @@ -646,6 +653,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
>              perror("os_mem_prealloc: failed to reinstall signal handler");
>              exit(1);
>          }
> +        qemu_mutex_unlock(&sigbus_mutex);
>      }
>  }



>  
> -- 
> 2.31.1
> 
> 

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 2/3] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages()
  2021-07-14 11:23 ` [PATCH v1 2/3] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages() David Hildenbrand
@ 2021-07-20 14:27   ` Daniel P. Berrangé
  0 siblings, 0 replies; 15+ messages in thread
From: Daniel P. Berrangé @ 2021-07-20 14:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Igor Mammedov,
	Marek Kedzierski

On Wed, Jul 14, 2021 at 01:23:05PM +0200, David Hildenbrand wrote:
> Let's minimize the number of global variables to prepare for
> os_mem_prealloc() getting called concurrently and make the code a bit
> easier to read.
> 
> The only consumer that really needs a global variable is the sigbus
> handler, which will require protection via a mutex in the future either way
> as we cannot concurrently mess with the SIGBUS handler.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  util/oslib-posix.c | 81 ++++++++++++++++++++++++++++------------------
>  1 file changed, 50 insertions(+), 31 deletions(-)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 3/3] util/oslib-posix: Support concurrent os_mem_prealloc() invocation
  2021-07-20 14:22   ` Daniel P. Berrangé
@ 2021-07-20 14:27     ` David Hildenbrand
  2021-07-20 14:31       ` Daniel P. Berrangé
  0 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2021-07-20 14:27 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Igor Mammedov,
	Marek Kedzierski

On 20.07.21 16:22, Daniel P. Berrangé wrote:
> On Wed, Jul 14, 2021 at 01:23:06PM +0200, David Hildenbrand wrote:
>> Add a mutext to protect the SIGBUS case, as we cannot mess concurrently
> 
> typo  s/mutext/mutex/
> 
>> with the sigbus handler and we have to manage the global variable
>> sigbus_memset_context. The MADV_POPULATE_WRITE path can run
>> concurrently.
>>
>> Note that page_mutex and page_cond are shared between concurrent
>> invocations, which shouldn't be a problem.
>>
>> This is a preparation for future virtio-mem prealloc code, which will call
>> os_mem_prealloc() asynchronously from an iothread when handling guest
>> requests.
> 
> Hmm, I'm wondering how the need to temporarily play with SIGBUS
> at runtime for mem preallocation will interact with the SIGBUS
> handler installed by softmmu/cpus.c.

That's exactly why I came up with MADV_POPULATE_WRITE, to avoid having 
to mess with different kinds of sigbus at the same time. You can only 
get it wrong.

> 
> The SIGBUS handler the preallocation code is installed just
> blindly assumes the SIGBUS is related to the preallocation
> work being done. This is a fine assumption during initially
> startup where we're single threaded and not running guest
> CPUs. I'm less clear on whether that's a valid assumption
> at runtime once guest CPUs are running.

I assume it's quite broken, for example, already when hotplugging a DIMM 
and prallocating memory for the memory backend.

> 
> If the sigbus_handler method in softmmu/cpus.c is doing
> something important for QEMU, then why is it ok for us to
> periodically disable that handler and replace it with
> something else that takes a completely different action ?

I don't think it is ok. I assume it has been broken for a long time, 
just nobody ever ran into that race.

> 
> Of course with the madvise impl we're bypassing the SIGBUS
> dance entirely. This is good for people with new kernels,
> but is this SIGBUS stuff safe for older kernels ?

It remains broken with old kernels I guess. There isn't too much that we 
can do: disabling prealloc=on once the VM is running breaks existing use 
cases.

Fortunately, running into that race seems to be rare, at least I never 
hear reports.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 3/3] util/oslib-posix: Support concurrent os_mem_prealloc() invocation
  2021-07-20 14:27     ` David Hildenbrand
@ 2021-07-20 14:31       ` Daniel P. Berrangé
  2021-07-20 14:35         ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel P. Berrangé @ 2021-07-20 14:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Igor Mammedov,
	Marek Kedzierski

On Tue, Jul 20, 2021 at 04:27:32PM +0200, David Hildenbrand wrote:
> On 20.07.21 16:22, Daniel P. Berrangé wrote:
> > On Wed, Jul 14, 2021 at 01:23:06PM +0200, David Hildenbrand wrote:
> > > Add a mutext to protect the SIGBUS case, as we cannot mess concurrently
> > 
> > typo  s/mutext/mutex/
> > 
> > > with the sigbus handler and we have to manage the global variable
> > > sigbus_memset_context. The MADV_POPULATE_WRITE path can run
> > > concurrently.
> > > 
> > > Note that page_mutex and page_cond are shared between concurrent
> > > invocations, which shouldn't be a problem.
> > > 
> > > This is a preparation for future virtio-mem prealloc code, which will call
> > > os_mem_prealloc() asynchronously from an iothread when handling guest
> > > requests.
> > 
> > Hmm, I'm wondering how the need to temporarily play with SIGBUS
> > at runtime for mem preallocation will interact with the SIGBUS
> > handler installed by softmmu/cpus.c.
> 
> That's exactly why I came up with MADV_POPULATE_WRITE, to avoid having to
> mess with different kinds of sigbus at the same time. You can only get it
> wrong.
> 
> > 
> > The SIGBUS handler the preallocation code is installed just
> > blindly assumes the SIGBUS is related to the preallocation
> > work being done. This is a fine assumption during initially
> > startup where we're single threaded and not running guest
> > CPUs. I'm less clear on whether that's a valid assumption
> > at runtime once guest CPUs are running.
> 
> I assume it's quite broken, for example, already when hotplugging a DIMM and
> prallocating memory for the memory backend.
> 
> > 
> > If the sigbus_handler method in softmmu/cpus.c is doing
> > something important for QEMU, then why is it ok for us to
> > periodically disable that handler and replace it with
> > something else that takes a completely different action ?
> 
> I don't think it is ok. I assume it has been broken for a long time, just
> nobody ever ran into that race.
> 
> > 
> > Of course with the madvise impl we're bypassing the SIGBUS
> > dance entirely. This is good for people with new kernels,
> > but is this SIGBUS stuff safe for older kernels ?
> 
> It remains broken with old kernels I guess. There isn't too much that we can
> do: disabling prealloc=on once the VM is running breaks existing use cases.

Ok, while refactoring this, could you add a scary warning next to the
sigaction calls mentioning that this code is not likely to play well
with qemu's other handling of sigbus, as a reminder to future reviewers.

> Fortunately, running into that race seems to be rare, at least I never hear
> reports.

The failure mode is likely to be silent or easily mis-interpreted

Is there any value in emitting a one-time per process warning message
on stderr if we take the old codepath post-startup ?

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 1/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  2021-07-20 14:08   ` Daniel P. Berrangé
@ 2021-07-20 14:34     ` David Hildenbrand
  0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-07-20 14:34 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Igor Mammedov,
	Marek Kedzierski

>>       memset_thread_failed = false;
>>       threads_created_flag = false;
>>       memset_num_threads = get_memset_num_threads(smp_cpus);
>> @@ -534,7 +558,7 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
>>           memset_thread[i].numpages = numpages_per_thread + (i < leftover);
>>           memset_thread[i].hpagesize = hpagesize;
>>           qemu_thread_create(&memset_thread[i].pgthread, "touch_pages",
>> -                           do_touch_pages, &memset_thread[i],
>> +                           touch_fn, &memset_thread[i],
>>                              QEMU_THREAD_JOINABLE);
>>           addr += memset_thread[i].numpages * hpagesize;
>>       }
> 
> Do you have an indication of what the speed differential is for the
> old read/write dance vs the kernel madvise. We needed to use threads
> previously because the read/write dance is pretty terribly slow.

The kernel patch has some performance numbers: 
https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com

For example (compressed),

     **************************************************
     4096 MiB MAP_PRIVATE:
     **************************************************
     Anon 4 KiB     : Read/Write               :  1054.041 ms
     Anon 4 KiB     : POPULATE_WRITE           :   572.582 ms
     Memfd 4 KiB    : Read/Write               :  1106.561 ms
     Memfd 4 KiB    : POPULATE_WRITE           :   805.881 ms
     Memfd 2 MiB    : Read/Write               :   357.606 ms
     Memfd 2 MiB    : POPULATE_WRITE           :   356.937 ms
     tmpfs          : Read/Write               :  1105.954 ms
     tmpfs          : POPULATE_WRITE           :   822.826 ms
     file           : Read/Write               :  1107.439 ms
     file           : POPULATE_WRITE           :   857.622 ms
     hugetlbfs      : Read/Write               :   356.127 ms
     hugetlbfs      : POPULATE_WRITE           :   355.138 ms


     4096 MiB MAP_SHARED:
     **************************************************
     Anon 4 KiB     : Read/Write               :  1060.350 m
     Anon 4 KiB     : POPULATE_WRITE           :   782.885 ms
     Anon 2 MiB     : Read/Write               :   357.992 ms
     Anon 2 MiB     : POPULATE_WRITE           :   357.808 ms
     Memfd 4 KiB    : Read/Write               :  1100.391 ms
     Memfd 4 KiB    : POPULATE_WRITE           :   804.394 ms
     Memfd 2 MiB    : Read/Write               :   358.250 ms
     Memfd 2 MiB    : POPULATE_WRITE           :   357.334 ms
     tmpfs          : Read/Write               :  1107.567 ms
     tmpfs          : POPULATE_WRITE           :   810.094 ms
     file           : Read/Write               :  1289.509 ms
     file           : POPULATE_WRITE           :  1106.816 ms
     hugetlbfs      : Read/Write               :   357.120 ms
     hugetlbfs      : POPULATE_WRITE           :   356.693 ms

For huge pages, it barely makes a difference with smallish VMs. In the 
other cases, it speeds it up, but not as extreme as that it would allow 
for dropping multi-threading.

The original MADV_POPULATE from 2016 
https://lore.kernel.org/patchwork/patch/389581/ mentiones that it 
especially helps speed up multi-threaded pre-faulting, due to reduced 
mmap_lock contention. I did not do any multi-threading benchmarks, though.

[...]

> 
> Initialized with random garbage from the stack
> 
>> +
>> +    /*
>> +     * Sense on every invocation, as MADV_POPULATE_WRITE cannot be used for
>> +     * some special mappings, such as mapping /dev/mem.
>> +     */
>> +    if (madv_populate_write_possible(area, hpagesize)) {
>> +        use_madv_populate_write = true;
>> +    }
> 
> but this implicitly assumes it was initialized to false.

Indeed, thanks for catching that!


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 3/3] util/oslib-posix: Support concurrent os_mem_prealloc() invocation
  2021-07-20 14:31       ` Daniel P. Berrangé
@ 2021-07-20 14:35         ` David Hildenbrand
  0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-07-20 14:35 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Igor Mammedov,
	Marek Kedzierski

>>>
>>> Of course with the madvise impl we're bypassing the SIGBUS
>>> dance entirely. This is good for people with new kernels,
>>> but is this SIGBUS stuff safe for older kernels ?
>>
>> It remains broken with old kernels I guess. There isn't too much that we can
>> do: disabling prealloc=on once the VM is running breaks existing use cases.
> 
> Ok, while refactoring this, could you add a scary warning next to the
> sigaction calls mentioning that this code is not likely to play well
> with qemu's other handling of sigbus, as a reminder to future reviewers.

Sure thing!

> 
>> Fortunately, running into that race seems to be rare, at least I never hear
>> reports.
> 
> The failure mode is likely to be silent or easily mis-interpreted
> 
> Is there any value in emitting a one-time per process warning message
> on stderr if we take the old codepath post-startup ?

Will look into emitting a warning when running this code while the VM is 
already running. (hope it won't be too ugly to have vm state checks in 
util/oslib-posix).

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  2021-07-14 11:23 [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
                   ` (3 preceding siblings ...)
  2021-07-20 13:55 ` [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() Pankaj Gupta
@ 2021-07-20 14:45 ` Daniel P. Berrangé
  2021-07-21  8:23   ` David Hildenbrand
  4 siblings, 1 reply; 15+ messages in thread
From: Daniel P. Berrangé @ 2021-07-20 14:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Igor Mammedov,
	Marek Kedzierski

On Wed, Jul 14, 2021 at 01:23:03PM +0200, David Hildenbrand wrote:
> #1 adds support for MADV_POPULATE_WRITE, #2 cleans up the code to avoid
> global variables and prepare for concurrency and #3 makes os_mem_prealloc()
> safe to be called from multiple threads concurrently.
> 
> Details regarding MADV_POPULATE_WRITE can be found in introducing upstream
> Linux commit 4ca9b3859dac ("mm/madvise: introduce
> MADV_POPULATE_(READ|WRITE) to prefault page tables") and in the latest man
> page patch [1].

Looking at that commit message, I see your caveat about POPULATE_WRITE
used together with shared file mappings, causing an undesirable glut
of dirty pages that needs to be flushed back to the underlying storage.

Is this something we need to be concerned with for the hostmem-file.c
implementation ? While it is mostly used to point to files on tmpfs
or hugetlbfs, I think users do something point it to a plain file
on a normal filesystem.  So will we need to optimize to use the
fallocate+POPULATE_READ combination at some point ?


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  2021-07-20 14:45 ` Daniel P. Berrangé
@ 2021-07-21  8:23   ` David Hildenbrand
  0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-07-21  8:23 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Pankaj Gupta, Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Igor Mammedov,
	Marek Kedzierski

On 20.07.21 16:45, Daniel P. Berrangé wrote:
> On Wed, Jul 14, 2021 at 01:23:03PM +0200, David Hildenbrand wrote:
>> #1 adds support for MADV_POPULATE_WRITE, #2 cleans up the code to avoid
>> global variables and prepare for concurrency and #3 makes os_mem_prealloc()
>> safe to be called from multiple threads concurrently.
>>
>> Details regarding MADV_POPULATE_WRITE can be found in introducing upstream
>> Linux commit 4ca9b3859dac ("mm/madvise: introduce
>> MADV_POPULATE_(READ|WRITE) to prefault page tables") and in the latest man
>> page patch [1].
> 
> Looking at that commit message, I see your caveat about POPULATE_WRITE
> used together with shared file mappings, causing an undesirable glut
> of dirty pages that needs to be flushed back to the underlying storage.
> 
> Is this something we need to be concerned with for the hostmem-file.c
> implementation ? While it is mostly used to point to files on tmpfs
> or hugetlbfs, I think users do something point it to a plain file
> on a normal filesystem.  So will we need to optimize to use the
> fallocate+POPULATE_READ combination at some point ?

In the future, it might make sense to use fallocate() only when it comes 
to shared file mappings.

AFAIKS os_mem_prealloc() currently serves the following purposes:

1) Preallocate anonymous memory or backend storage (file, hugetlbfs, ...)
2) Apply mbind() policy, preallocating it from the right node when 
applicable.
3) Prefault page tables

For shared mappings, it's a little bit difficult, though: mbind() does 
not seem to work on shared mappings (which to some degree makes 
logically sense, but I don't think QEMU users are aware that it is like 
that): "The specified policy will be ignored for any  MAP_SHARED 
mappings  in  the specified  memory range. Rather the pages will be 
allocated according to the memory policy of the thread that caused the 
page to be allocated. Again, this may not be the thread that called 
mbind()."

So 2) does not apply. A simple fallocate() can get 1) done more efficiently.

So if we want to use MADV_POPULATE_READ completely depends on whether we 
want 3). It can make sense to prefault page tables for RT workloads, 
however, there is usually nothing stopping the OS from clearing the page 
cache and requiring a refault later -- except with mlock.

So whether we want fallocate() or fallocate()+MADV_POPULATE_READ for 
shared file mappings really depends on the use case, and on the system 
setup. If the system won't immediately free up the page cache and undo 
what MADV_POPULATE_READ did, it might make sense to use it.

Long story short: it's complicated :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-07-21  8:24 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-14 11:23 [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
2021-07-14 11:23 ` [PATCH v1 1/3] " David Hildenbrand
2021-07-20 14:08   ` Daniel P. Berrangé
2021-07-20 14:34     ` David Hildenbrand
2021-07-14 11:23 ` [PATCH v1 2/3] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages() David Hildenbrand
2021-07-20 14:27   ` Daniel P. Berrangé
2021-07-14 11:23 ` [PATCH v1 3/3] util/oslib-posix: Support concurrent os_mem_prealloc() invocation David Hildenbrand
2021-07-20 14:22   ` Daniel P. Berrangé
2021-07-20 14:27     ` David Hildenbrand
2021-07-20 14:31       ` Daniel P. Berrangé
2021-07-20 14:35         ` David Hildenbrand
2021-07-20 13:55 ` [PATCH v1 0/3] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() Pankaj Gupta
2021-07-20 13:58   ` Pankaj Gupta
2021-07-20 14:45 ` Daniel P. Berrangé
2021-07-21  8:23   ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.