All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: qemu-devel@nongnu.org
Cc: "Peter Maydell" <peter.maydell@linaro.org>,
	"Daniel P . Berrangé" <berrange@redhat.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Michal Privoznik" <mprivozn@redhat.com>,
	"Pankaj Gupta" <pankaj.gupta@ionos.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>
Subject: [PULL 30/52] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
Date: Thu, 6 Jan 2022 08:17:39 -0500	[thread overview]
Message-ID: <20220106131534.423671-31-mst@redhat.com> (raw)
In-Reply-To: <20220106131534.423671-1-mst@redhat.com>

From: David Hildenbrand <david@redhat.com>

Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
does not require a SIGBUS handler, doesn't actually touch page content,
and avoids context switches; it is, therefore, faster and easier to handle
than our current approach.

While MADV_POPULATE_WRITE is, in general, faster than manual
prefaulting, and especially faster with 4k pages, there is still value in
prefaulting using multiple threads to speed up preallocation.

More details on MADV_POPULATE_WRITE can be found in the Linux commits
4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
page tables") and eb2faa513c24 ("mm/madvise: report SIGBUS as -EFAULT for
MADV_POPULATE_(READ|WRITE)"), and in the man page proposal [1].

This resolves the TODO in do_touch_pages().

In the future, we might want to look into using fallocate(), eventually
combined with MADV_POPULATE_READ, when dealing with shared file/fd
mappings and not caring about memory bindings.

[1] https://lkml.kernel.org/r/20210816081922.5155-1-david@redhat.com

Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20211217134611.31172-3-david@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/qemu/osdep.h |  7 ++++
 util/oslib-posix.c   | 81 +++++++++++++++++++++++++++++++++-----------
 2 files changed, 68 insertions(+), 20 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 60718fc342..d1660d67fa 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -471,6 +471,11 @@ static inline void qemu_cleanup_generic_vfree(void *p)
 #else
 #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED
 #endif
+#ifdef MADV_POPULATE_WRITE
+#define QEMU_MADV_POPULATE_WRITE MADV_POPULATE_WRITE
+#else
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
+#endif
 
 #elif defined(CONFIG_POSIX_MADVISE)
 
@@ -484,6 +489,7 @@ static inline void qemu_cleanup_generic_vfree(void *p)
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
 
 #else /* no-op */
 
@@ -497,6 +503,7 @@ static inline void qemu_cleanup_generic_vfree(void *p)
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
 
 #endif
 
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index b146beef78..cb89e07770 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -484,10 +484,6 @@ static void *do_touch_pages(void *arg)
              *
              * 'volatile' to stop compiler optimizing this away
              * to a no-op
-             *
-             * TODO: get a better solution from kernel so we
-             * don't need to write at all so we don't cause
-             * wear on the storage backing the region...
              */
             *(volatile char *)addr = *addr;
             addr += hpagesize;
@@ -497,6 +493,26 @@ static void *do_touch_pages(void *arg)
     return (void *)(uintptr_t)ret;
 }
 
+static void *do_madv_populate_write_pages(void *arg)
+{
+    MemsetThread *memset_args = (MemsetThread *)arg;
+    const size_t size = memset_args->numpages * memset_args->hpagesize;
+    char * const addr = memset_args->addr;
+    int ret = 0;
+
+    /* See do_touch_pages(). */
+    qemu_mutex_lock(&page_mutex);
+    while (!threads_created_flag) {
+        qemu_cond_wait(&page_cond, &page_mutex);
+    }
+    qemu_mutex_unlock(&page_mutex);
+
+    if (size && qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE)) {
+        ret = -errno;
+    }
+    return (void *)(uintptr_t)ret;
+}
+
 static inline int get_memset_num_threads(int smp_cpus)
 {
     long host_procs = sysconf(_SC_NPROCESSORS_ONLN);
@@ -510,10 +526,11 @@ static inline int get_memset_num_threads(int smp_cpus)
 }
 
 static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
-                           int smp_cpus)
+                           int smp_cpus, bool use_madv_populate_write)
 {
     static gsize initialized = 0;
     size_t numpages_per_thread, leftover;
+    void *(*touch_fn)(void *);
     int ret = 0, i = 0;
     char *addr = area;
 
@@ -523,6 +540,12 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
         g_once_init_leave(&initialized, 1);
     }
 
+    if (use_madv_populate_write) {
+        touch_fn = do_madv_populate_write_pages;
+    } else {
+        touch_fn = do_touch_pages;
+    }
+
     threads_created_flag = false;
     memset_num_threads = get_memset_num_threads(smp_cpus);
     memset_thread = g_new0(MemsetThread, memset_num_threads);
@@ -533,7 +556,7 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
         memset_thread[i].numpages = numpages_per_thread + (i < leftover);
         memset_thread[i].hpagesize = hpagesize;
         qemu_thread_create(&memset_thread[i].pgthread, "touch_pages",
-                           do_touch_pages, &memset_thread[i],
+                           touch_fn, &memset_thread[i],
                            QEMU_THREAD_JOINABLE);
         addr += memset_thread[i].numpages * hpagesize;
     }
@@ -556,6 +579,12 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
     return ret;
 }
 
+static bool madv_populate_write_possible(char *area, size_t pagesize)
+{
+    return !qemu_madvise(area, pagesize, QEMU_MADV_POPULATE_WRITE) ||
+           errno != EINVAL;
+}
+
 void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
                      Error **errp)
 {
@@ -563,30 +592,42 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
     struct sigaction act, oldact;
     size_t hpagesize = qemu_fd_getpagesize(fd);
     size_t numpages = DIV_ROUND_UP(memory, hpagesize);
+    bool use_madv_populate_write;
 
-    memset(&act, 0, sizeof(act));
-    act.sa_handler = &sigbus_handler;
-    act.sa_flags = 0;
+    /*
+     * Sense on every invocation, as MADV_POPULATE_WRITE cannot be used for
+     * some special mappings, such as mapping /dev/mem.
+     */
+    use_madv_populate_write = madv_populate_write_possible(area, hpagesize);
 
-    ret = sigaction(SIGBUS, &act, &oldact);
-    if (ret) {
-        error_setg_errno(errp, errno,
-            "os_mem_prealloc: failed to install signal handler");
-        return;
+    if (!use_madv_populate_write) {
+        memset(&act, 0, sizeof(act));
+        act.sa_handler = &sigbus_handler;
+        act.sa_flags = 0;
+
+        ret = sigaction(SIGBUS, &act, &oldact);
+        if (ret) {
+            error_setg_errno(errp, errno,
+                "os_mem_prealloc: failed to install signal handler");
+            return;
+        }
     }
 
     /* touch pages simultaneously */
-    ret = touch_all_pages(area, hpagesize, numpages, smp_cpus);
+    ret = touch_all_pages(area, hpagesize, numpages, smp_cpus,
+                          use_madv_populate_write);
     if (ret) {
         error_setg_errno(errp, -ret,
                          "os_mem_prealloc: preallocating memory failed");
     }
 
-    ret = sigaction(SIGBUS, &oldact, NULL);
-    if (ret) {
-        /* Terminate QEMU since it can't recover from error */
-        perror("os_mem_prealloc: failed to reinstall signal handler");
-        exit(1);
+    if (!use_madv_populate_write) {
+        ret = sigaction(SIGBUS, &oldact, NULL);
+        if (ret) {
+            /* Terminate QEMU since it can't recover from error */
+            perror("os_mem_prealloc: failed to reinstall signal handler");
+            exit(1);
+        }
     }
 }
 
-- 
MST



  parent reply	other threads:[~2022-01-06 13:47 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-06 13:16 [PULL 00/52] virtio,pci,pc: features,fixes,cleanups Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 01/52] virtio-mem: Don't skip alignment checks when warning about block size Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 02/52] acpi: validate hotplug selector on access Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 03/52] virtio: introduce macro IRTIO_CONFIG_IRQ_IDX Michael S. Tsirkin
2022-01-06 13:16   ` [Virtio-fs] " Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 05/52] virtio-pci: decouple the single vector from the interrupt process Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 06/52] vhost: introduce new VhostOps vhost_set_config_call Michael S. Tsirkin
2022-01-06 13:22   ` Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 07/52] vhost-vdpa: add support for config interrupt Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 08/52] virtio: add support for configure interrupt Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 09/52] vhost: " Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 10/52] virtio-net: " Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 11/52] virtio-mmio: " Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 12/52] virtio-pci: " Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 13/52] trace-events,pci: unify trace events format Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 14/52] vhost-user-blk: reconnect on any error during realize Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 15/52] chardev/char-socket: tcp_chr_recv: don't clobber errno Michael S. Tsirkin
2022-01-06 13:16 ` [PULL 16/52] chardev/char-socket: tcp_chr_sync_read: " Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 17/52] vhost-backend: avoid overflow on memslots_limit Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 18/52] vhost-backend: stick to -errno error return convention Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 19/52] vhost-vdpa: " Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 20/52] vhost-user: " Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 21/52] vhost: " Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 22/52] vhost-user-blk: propagate error return from generic vhost Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 23/52] pci: Export the pci_intx() function Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 24/52] pcie_aer: Don't trigger a LSI if none are defined Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 25/52] smbios: Rename SMBIOS_ENTRY_POINT_* enums Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 26/52] hw/smbios: Use qapi for SmbiosEntryPointType Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 27/52] hw/i386: expose a "smbios-entry-point-type" PC machine property Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 28/52] hw/vhost-user-blk: turn on VIRTIO_BLK_F_SIZE_MAX feature for virtio blk device Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 29/52] util/oslib-posix: Let touch_all_pages() return an error Michael S. Tsirkin
2022-01-06 13:17 ` Michael S. Tsirkin [this message]
2022-01-06 13:17 ` [PULL 31/52] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages() Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 32/52] util/oslib-posix: Don't create too many threads with small memory or little pages Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 33/52] util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 34/52] util/oslib-posix: Support concurrent os_mem_prealloc() invocation Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 35/52] util/oslib-posix: Forward SIGBUS to MCE handler under Linux Michael S. Tsirkin
2022-01-06 13:17 ` [PULL 36/52] virtio-mem: Support "prealloc=on" option Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 37/52] virtio: signal after wrapping packed used_idx Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 38/52] MAINTAINERS: Add a separate entry for acpi/VIOT tables Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 39/52] linux-headers: sync VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 40/52] virtio-mem: Support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 41/52] virtio-mem: Set "unplugged-inaccessible=auto" for the 7.0 machine on x86 Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 42/52] intel-iommu: correctly check passthrough during translation Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 43/52] acpi: fix QEMU crash when started with SLIC table Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 45/52] tests: acpi: add SLIC table test Michael S. Tsirkin
2022-01-06 13:18   ` Michael S. Tsirkin
2022-01-06 13:21     ` Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 46/52] tests: acpi: SLIC: update expected blobs Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 47/52] acpihp: simplify acpi_pcihp_disable_root_bus Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 48/52] hw/i386/pc: Add missing property descriptions Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 49/52] docs: reSTify virtio-balloon-stats documentation and move to docs/interop Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 50/52] hw/scsi/vhost-scsi: don't leak vqs on error Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 51/52] hw/scsi/vhost-scsi: don't double close vhostfd " Michael S. Tsirkin
2022-01-06 13:18 ` [PULL 52/52] virtio/vhost-vsock: don't double close vhostfd, remove redundant cleanup Michael S. Tsirkin
2022-01-06 13:21 ` [PULL 44/52] tests: acpi: whitelist expected blobs before changing them Michael S. Tsirkin
2022-01-06 13:22 ` [PULL 04/52] virtio-pci: decouple notifier from interrupt process Michael S. Tsirkin
2022-01-06 23:06 ` [PULL 00/52] virtio,pci,pc: features,fixes,cleanups Richard Henderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220106131534.423671-31-mst@redhat.com \
    --to=mst@redhat.com \
    --cc=berrange@redhat.com \
    --cc=david@redhat.com \
    --cc=mprivozn@redhat.com \
    --cc=pankaj.gupta@ionos.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.