All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/6] UFFD write-tracking migration/snapshots
@ 2020-11-26 15:17 Andrey Gruzdev via
  2020-11-26 15:17 ` [PATCH v4 1/6] introduce 'background-snapshot' migration capability Andrey Gruzdev via
                   ` (8 more replies)
  0 siblings, 9 replies; 45+ messages in thread
From: Andrey Gruzdev via @ 2020-11-26 15:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.

Currently the only way to make (external) live VM snapshot is using existing
dirty page logging migration mechanism. The main problem is that it tends to
produce a lot of page duplicates while running VM goes on updating already
saved pages. That leads to the fact that vmstate image size is commonly several
times bigger then non-zero part of virtual machine's RSS. Time required to
converge RAM migration and the size of snapshot image severely depend on the
guest memory write rate, sometimes resulting in unacceptably long snapshot
creation time and huge image size.

This series propose a way to solve the aforementioned problems. This is done
by using different RAM migration mechanism based on UFFD write protection
management introduced in v5.7 kernel. The migration strategy is to 'freeze'
guest RAM content using write-protection and iteratively release protection
for memory ranges that have already been saved to the migration stream.
At the same time we read in pending UFFD write fault events and save those
pages out-of-order with higher priority.

How to use:
1. Enable write-tracking migration capability
   virsh qemu-monitor-command <domain> --hmp migrate_set_capability.
track-writes-ram on

2. Start the external migration to a file
   virsh qemu-monitor-command <domain> --hmp migrate exec:'cat > ./vm_state'

3. Wait for the migration finish and check that the migration has completed.
state.

Changes v3->v4:

* 1. Renamed migrate capability 'track-writes-ram'->'background-snapshot'.
* 2. Use array of incompatible caps to replace bulky 'if' constructs.
* 3. Moved UFFD low-level code to the separate module ('util/userfaultfd.c').
* 4. Always do UFFD wr-unprotect on cleanup; just closing file descriptor
*    won't cleanup PTEs anyhow, it will release registration ranges, wait 
*    queues etc. but won't cleanup process MM context on MMU level.
* 5. Allow to enable 'background-snapshot' capability on Linux-only hosts.
* 6. Put UFFD code usage under '#ifdef CONFIG_LINUX' prerequisite.
* 7. Removed 'wt_' from RAMState struct.
* 8. Refactored ram_find_and_save_block() to make more clean - poll UFFD
*    wr-fault events in get_queued_page(), use ram_save_host_page_pre(),
*    ram_save_host_page_post() notifiers around ram_save_host_page()
*    instead of bulky inline write-unprotect code.

Andrey Gruzdev (6):
  introduce 'background-snapshot' migration capability
  introduce UFFD-WP low-level interface helpers
  support UFFD write fault processing in ram_save_iterate()
  implementation of background snapshot thread
  the rest of write tracking migration code
  introduce simple linear scan rate limiting mechanism

 include/exec/memory.h      |   7 +
 include/qemu/userfaultfd.h |  29 ++++
 migration/migration.c      | 314 +++++++++++++++++++++++++++++++++-
 migration/migration.h      |   4 +
 migration/ram.c            | 334 ++++++++++++++++++++++++++++++++++++-
 migration/ram.h            |   4 +
 migration/savevm.c         |   1 -
 migration/savevm.h         |   2 +
 qapi/migration.json        |   7 +-
 util/meson.build           |   1 +
 util/userfaultfd.c         | 215 ++++++++++++++++++++++++
 11 files changed, 908 insertions(+), 10 deletions(-)
 create mode 100644 include/qemu/userfaultfd.h
 create mode 100644 util/userfaultfd.c

-- 
2.25.1



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v4 1/6] introduce 'background-snapshot' migration capability
  2020-11-26 15:17 [PATCH v4 0/6] UFFD write-tracking migration/snapshots Andrey Gruzdev via
@ 2020-11-26 15:17 ` Andrey Gruzdev via
  2020-11-27 19:55   ` Peter Xu
  2020-11-26 15:17 ` [PATCH v4 2/6] introduce UFFD-WP low-level interface helpers Andrey Gruzdev via
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Andrey Gruzdev via @ 2020-11-26 15:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/migration.c | 67 +++++++++++++++++++++++++++++++++++++++++++
 migration/migration.h |  1 +
 qapi/migration.json   |  7 ++++-
 3 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 87a9b59f83..8be3bd2b8c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -56,6 +56,7 @@
 #include "net/announce.h"
 #include "qemu/queue.h"
 #include "multifd.h"
+#include "sysemu/cpus.h"
 
 #ifdef CONFIG_VFIO
 #include "hw/vfio/vfio-common.h"
@@ -134,6 +135,38 @@ enum mig_rp_message_type {
     MIG_RP_MSG_MAX
 };
 
+/* Migration capabilities set */
+struct MigrateCapsSet {
+    int size;                       /* Capability set size */
+    MigrationCapability caps[];     /* Variadic array of capabilities */
+};
+typedef struct MigrateCapsSet MigrateCapsSet;
+
+/* Define and initialize MigrateCapsSet */
+#define INITIALIZE_MIGRATE_CAPS_SET(_name, ...)   \
+    MigrateCapsSet _name = {    \
+        .size = sizeof((int []) { __VA_ARGS__ }) / sizeof(int), \
+        .caps = { __VA_ARGS__ } \
+    }
+
+/* Background-snapshot compatibility check list */
+static const
+INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
+    MIGRATION_CAPABILITY_POSTCOPY_RAM,
+    MIGRATION_CAPABILITY_DIRTY_BITMAPS,
+    MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME,
+    MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE,
+    MIGRATION_CAPABILITY_RETURN_PATH,
+    MIGRATION_CAPABILITY_MULTIFD,
+    MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER,
+    MIGRATION_CAPABILITY_AUTO_CONVERGE,
+    MIGRATION_CAPABILITY_RELEASE_RAM,
+    MIGRATION_CAPABILITY_RDMA_PIN_ALL,
+    MIGRATION_CAPABILITY_COMPRESS,
+    MIGRATION_CAPABILITY_XBZRLE,
+    MIGRATION_CAPABILITY_X_COLO,
+    MIGRATION_CAPABILITY_VALIDATE_UUID);
+
 /* When we add fault tolerance, we could have several
    migrations at once.  For now we don't need to add
    dynamic creation of migration */
@@ -1165,6 +1198,29 @@ static bool migrate_caps_check(bool *cap_list,
         }
     }
 
+    if (cap_list[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
+#ifdef CONFIG_LINUX
+        int idx;
+        /*
+         * Check if there are any migration capabilities
+         * incompatible with 'background-snapshot'.
+         */
+        for (idx = 0; idx < check_caps_background_snapshot.size; idx++) {
+            int incomp_cap = check_caps_background_snapshot.caps[idx];
+            if (cap_list[incomp_cap]) {
+                error_setg(errp,
+                        "Background-snapshot is not compatible with %s",
+                        MigrationCapability_str(incomp_cap));
+                return false;
+            }
+        }
+#else
+        error_setg(errp,
+                "Background-snapshot is not supported on non-Linux hosts");
+        return false;
+#endif
+    }
+
     return true;
 }
 
@@ -2490,6 +2546,15 @@ bool migrate_use_block_incremental(void)
     return s->parameters.block_incremental;
 }
 
+bool migrate_background_snapshot(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT];
+}
+
 /* migration thread support */
 /*
  * Something bad happened to the RP stream, mark an error
@@ -3783,6 +3848,8 @@ static Property migration_properties[] = {
     DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
     DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
     DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
+    DEFINE_PROP_MIG_CAP("x-background-snapshot",
+            MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
 
     DEFINE_PROP_END_OF_LIST(),
 };
diff --git a/migration/migration.h b/migration/migration.h
index d096b77f74..f40338cfbf 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -341,6 +341,7 @@ int migrate_compress_wait_thread(void);
 int migrate_decompress_threads(void);
 bool migrate_use_events(void);
 bool migrate_postcopy_blocktime(void);
+bool migrate_background_snapshot(void);
 
 /* Sending on the return path - generic and then for each message type */
 void migrate_send_rp_shut(MigrationIncomingState *mis,
diff --git a/qapi/migration.json b/qapi/migration.json
index 3c75820527..6291143678 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -442,6 +442,11 @@
 # @validate-uuid: Send the UUID of the source to allow the destination
 #                 to ensure it is the same. (since 4.2)
 #
+# @background-snapshot: If enabled, the migration stream will be a snapshot
+#                       of the VM exactly at the point when the migration
+#                       procedure starts. The VM RAM is saved with running VM.
+#                       (since 6.0)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
@@ -449,7 +454,7 @@
            'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
            'block', 'return-path', 'pause-before-switchover', 'multifd',
            'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
-           'x-ignore-shared', 'validate-uuid' ] }
+           'x-ignore-shared', 'validate-uuid', 'background-snapshot'] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 2/6] introduce UFFD-WP low-level interface helpers
  2020-11-26 15:17 [PATCH v4 0/6] UFFD write-tracking migration/snapshots Andrey Gruzdev via
  2020-11-26 15:17 ` [PATCH v4 1/6] introduce 'background-snapshot' migration capability Andrey Gruzdev via
@ 2020-11-26 15:17 ` Andrey Gruzdev via
  2020-11-27 21:04   ` Peter Xu
  2020-12-01 12:24   ` Dr. David Alan Gilbert
  2020-11-26 15:17 ` [PATCH v4 3/6] support UFFD write fault processing in ram_save_iterate() Andrey Gruzdev via
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 45+ messages in thread
From: Andrey Gruzdev via @ 2020-11-26 15:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Implemented support for the whole RAM block memory
protection/un-protection. Introduced higher level
ram_write_tracking_start() and ram_write_tracking_stop()
to start/stop tracking guest memory writes.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 include/exec/memory.h      |   7 ++
 include/qemu/userfaultfd.h |  29 +++++
 migration/ram.c            | 120 +++++++++++++++++++++
 migration/ram.h            |   4 +
 util/meson.build           |   1 +
 util/userfaultfd.c         | 215 +++++++++++++++++++++++++++++++++++++
 6 files changed, 376 insertions(+)
 create mode 100644 include/qemu/userfaultfd.h
 create mode 100644 util/userfaultfd.c

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 0f3e6bcd5e..3d798fce16 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier;
 /* RAM is a persistent kind memory */
 #define RAM_PMEM (1 << 5)
 
+/*
+ * UFFDIO_WRITEPROTECT is used on this RAMBlock to
+ * support 'write-tracking' migration type.
+ * Implies ram_state->ram_wt_enabled.
+ */
+#define RAM_UF_WRITEPROTECT (1 << 6)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
diff --git a/include/qemu/userfaultfd.h b/include/qemu/userfaultfd.h
new file mode 100644
index 0000000000..fb843c76db
--- /dev/null
+++ b/include/qemu/userfaultfd.h
@@ -0,0 +1,29 @@
+/*
+ * Linux UFFD-WP support
+ *
+ * Copyright Virtuozzo GmbH, 2020
+ *
+ * Authors:
+ *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#ifndef USERFAULTFD_H
+#define USERFAULTFD_H
+
+#include "qemu/osdep.h"
+#include "exec/hwaddr.h"
+#include <linux/userfaultfd.h>
+
+int uffd_create_fd(void);
+void uffd_close_fd(int uffd);
+int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
+        bool track_missing, bool track_wp);
+int uffd_unregister_memory(int uffd, hwaddr start, hwaddr length);
+int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp);
+int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
+bool uffd_poll_events(int uffd, int tmo);
+
+#endif /* USERFAULTFD_H */
diff --git a/migration/ram.c b/migration/ram.c
index 7811cde643..3adfd1948d 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -56,6 +56,11 @@
 #include "savevm.h"
 #include "qemu/iov.h"
 #include "multifd.h"
+#include "sysemu/runstate.h"
+
+#ifdef CONFIG_LINUX
+#include "qemu/userfaultfd.h"
+#endif
 
 /***********************************************************/
 /* ram save/restore */
@@ -298,6 +303,8 @@ struct RAMSrcPageRequest {
 struct RAMState {
     /* QEMUFile used for this migration */
     QEMUFile *f;
+    /* UFFD file descriptor, used in 'write-tracking' migration */
+    int uffdio_fd;
     /* Last block that we have visited searching for dirty pages */
     RAMBlock *last_seen_block;
     /* Last block from where we have sent data */
@@ -3788,6 +3795,119 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
     return 0;
 }
 
+/*
+ * ram_write_tracking_start: start UFFD-WP memory tracking
+ *
+ * Returns 0 for success or negative value in case of error
+ *
+ */
+int ram_write_tracking_start(void)
+{
+#ifdef CONFIG_LINUX
+    int uffd;
+    RAMState *rs = ram_state;
+    RAMBlock *bs;
+
+    /* Open UFFD file descriptor */
+    uffd = uffd_create_fd();
+    if (uffd < 0) {
+        return uffd;
+    }
+    rs->uffdio_fd = uffd;
+
+    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
+        /* Nothing to do with read-only and MMIO-writable regions */
+        if (bs->mr->readonly || bs->mr->rom_device) {
+            continue;
+        }
+
+        bs->flags |= RAM_UF_WRITEPROTECT;
+        /* Register block memory with UFFD to track writes */
+        if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host,
+                bs->max_length, false, true)) {
+            goto fail;
+        }
+        /* Apply UFFD write protection to the block memory range */
+        if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host,
+                bs->max_length, true)) {
+            goto fail;
+        }
+
+        info_report("UFFD-WP write-tracking enabled: "
+                "block_id=%s page_size=%zu start=%p length=%lu "
+                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
+                bs->idstr, bs->page_size, bs->host, bs->max_length,
+                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
+                bs->mr->nonvolatile, bs->mr->rom_device);
+    }
+
+    return 0;
+
+fail:
+    error_report("ram_write_tracking_start() failed: restoring initial memory state");
+
+    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
+        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
+            continue;
+        }
+        /*
+         * In case some memory block failed to be write-protected
+         * remove protection and unregister all succeeded RAM blocks
+         */
+        uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length, false);
+        uffd_unregister_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length);
+        /* Cleanup flags */
+        bs->flags &= ~RAM_UF_WRITEPROTECT;
+    }
+
+    uffd_close_fd(uffd);
+    rs->uffdio_fd = -1;
+    return -1;
+#else
+    rs->uffdio_fd = -1;
+    error_setg(&migrate_get_current()->error,
+            "Background-snapshot not supported on non-Linux hosts");
+    return -1;
+#endif /* CONFIG_LINUX */
+}
+
+/**
+ * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
+ */
+void ram_write_tracking_stop(void)
+{
+#ifdef CONFIG_LINUX
+    RAMState *rs = ram_state;
+    RAMBlock *bs;
+    assert(rs->uffdio_fd >= 0);
+
+    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
+        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
+            continue;
+        }
+        /* Remove protection and unregister all affected RAM blocks */
+        uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length, false);
+        uffd_unregister_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length);
+        /* Cleanup flags */
+        bs->flags &= ~RAM_UF_WRITEPROTECT;
+
+        info_report("UFFD-WP write-tracking disabled: "
+                "block_id=%s page_size=%zu start=%p length=%lu "
+                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
+                bs->idstr, bs->page_size, bs->host, bs->max_length,
+                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
+                bs->mr->nonvolatile, bs->mr->rom_device);
+    }
+
+    /* Finally close UFFD file descriptor */
+    uffd_close_fd(rs->uffdio_fd);
+    rs->uffdio_fd = -1;
+#else
+    error_setg(&migrate_get_current()->error,
+            "Background-snapshot not supported on non-Linux hosts");
+#endif /* CONFIG_LINUX */
+}
+
 static SaveVMHandlers savevm_ram_handlers = {
     .save_setup = ram_save_setup,
     .save_live_iterate = ram_save_iterate,
diff --git a/migration/ram.h b/migration/ram.h
index 011e85414e..0ec63e27ee 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -79,4 +79,8 @@ void colo_flush_ram_cache(void);
 void colo_release_ram_cache(void);
 void colo_incoming_start_dirty_log(void);
 
+/* Background snapshots */
+int ram_write_tracking_start(void);
+void ram_write_tracking_stop(void);
+
 #endif
diff --git a/util/meson.build b/util/meson.build
index f359af0d46..c64bfe94b3 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -50,6 +50,7 @@ endif
 
 if have_system
   util_ss.add(when: 'CONFIG_GIO', if_true: [files('dbus.c'), gio])
+  util_ss.add(when: 'CONFIG_LINUX', if_true: files('userfaultfd.c'))
 endif
 
 if have_block
diff --git a/util/userfaultfd.c b/util/userfaultfd.c
new file mode 100644
index 0000000000..038953d7ed
--- /dev/null
+++ b/util/userfaultfd.c
@@ -0,0 +1,215 @@
+/*
+ * Linux UFFD-WP support
+ *
+ * Copyright Virtuozzo GmbH, 2020
+ *
+ * Authors:
+ *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/bitops.h"
+#include "qemu/error-report.h"
+#include "qemu/userfaultfd.h"
+#include <poll.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+
+/**
+ * uffd_create_fd: create UFFD file descriptor
+ *
+ * Returns non-negative file descriptor or negative value in case of an error
+ */
+int uffd_create_fd(void)
+{
+    int uffd;
+    struct uffdio_api api_struct;
+    uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER);
+
+    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+    if (uffd < 0) {
+        error_report("uffd_create_fd() failed: UFFD not supported");
+        return -1;
+    }
+
+    api_struct.api = UFFD_API;
+    api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+    if (ioctl(uffd, UFFDIO_API, &api_struct)) {
+        error_report("uffd_create_fd() failed: "
+                     "API version not supported version=%llx errno=%i",
+                api_struct.api, errno);
+        goto fail;
+    }
+
+    if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
+        error_report("uffd_create_fd() failed: "
+                     "PAGEFAULT_FLAG_WP feature missing");
+        goto fail;
+    }
+
+    return uffd;
+
+fail:
+    close(uffd);
+    return -1;
+}
+
+/**
+ * uffd_close_fd: close UFFD file descriptor
+ *
+ * @uffd: UFFD file descriptor
+ */
+void uffd_close_fd(int uffd)
+{
+    assert(uffd >= 0);
+    close(uffd);
+}
+
+/**
+ * uffd_register_memory: register memory range with UFFD
+ *
+ * Returns 0 in case of success, negative value on error
+ *
+ * @uffd: UFFD file descriptor
+ * @start: starting virtual address of memory range
+ * @length: length of memory range
+ * @track_missing: generate events on missing-page faults
+ * @track_wp: generate events on write-protected-page faults
+ */
+int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
+        bool track_missing, bool track_wp)
+{
+    struct uffdio_register uffd_register;
+
+    uffd_register.range.start = start;
+    uffd_register.range.len = length;
+    uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) |
+                         (track_wp ? UFFDIO_REGISTER_MODE_WP : 0);
+
+    if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) {
+        error_report("uffd_register_memory() failed: "
+                     "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
+                start, length, uffd_register.mode, errno);
+        return -1;
+    }
+
+    return 0;
+}
+
+/**
+ * uffd_unregister_memory: un-register memory range with UFFD
+ *
+ * Returns 0 in case of success, negative value on error
+ *
+ * @uffd: UFFD file descriptor
+ * @start: starting virtual address of memory range
+ * @length: length of memory range
+ */
+int uffd_unregister_memory(int uffd, hwaddr start, hwaddr length)
+{
+    struct uffdio_range uffd_range;
+
+    uffd_range.start = start;
+    uffd_range.len = length;
+
+    if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_range)) {
+        error_report("uffd_unregister_memory() failed: "
+                     "start=%0"PRIx64" len=%"PRIu64" errno=%i",
+                start, length, errno);
+        return -1;
+    }
+
+    return 0;
+}
+
+/**
+ * uffd_protect_memory: protect/unprotect memory range for writes with UFFD
+ *
+ * Returns 0 on success or negative value in case of error
+ *
+ * @uffd: UFFD file descriptor
+ * @start: starting virtual address of memory range
+ * @length: length of memory range
+ * @wp: write-protect/unprotect
+ */
+int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
+{
+    struct uffdio_writeprotect uffd_writeprotect;
+    int res;
+
+    uffd_writeprotect.range.start = start;
+    uffd_writeprotect.range.len = length;
+    uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0);
+
+    do {
+        res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect);
+    } while (res < 0 && errno == EINTR);
+    if (res < 0) {
+        error_report("uffd_protect_memory() failed: "
+                     "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
+                start, length, uffd_writeprotect.mode, errno);
+        return -1;
+    }
+
+    return 0;
+}
+
+/**
+ * uffd_read_events: read pending UFFD events
+ *
+ * Returns number of fetched messages, 0 if non is available or
+ * negative value in case of an error
+ *
+ * @uffd: UFFD file descriptor
+ * @msgs: pointer to message buffer
+ * @count: number of messages that can fit in the buffer
+ */
+int uffd_read_events(int uffd, struct uffd_msg *msgs, int count)
+{
+    ssize_t res;
+    do {
+        res = read(uffd, msgs, count * sizeof(struct uffd_msg));
+    } while (res < 0 && errno == EINTR);
+
+    if ((res < 0 && errno == EAGAIN)) {
+        return 0;
+    }
+    if (res < 0) {
+        error_report("uffd_read_events() failed: errno=%i", errno);
+        return -1;
+    }
+
+    return (int) (res / sizeof(struct uffd_msg));
+}
+
+/**
+ * uffd_poll_events: poll UFFD file descriptor for read
+ *
+ * Returns true if events are available for read, false otherwise
+ *
+ * @uffd: UFFD file descriptor
+ * @tmo: timeout in milliseconds, 0 for non-blocking operation,
+ *       negative value for infinite wait
+ */
+bool uffd_poll_events(int uffd, int tmo)
+{
+    int res;
+    struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 };
+
+    do {
+        res = poll(&poll_fd, 1, tmo);
+    } while (res < 0 && errno == EINTR);
+
+    if (res == 0) {
+        return false;
+    }
+    if (res < 0) {
+        error_report("uffd_poll_events() failed: errno=%i", errno);
+        return false;
+    }
+
+    return (poll_fd.revents & POLLIN) != 0;
+}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 3/6] support UFFD write fault processing in ram_save_iterate()
  2020-11-26 15:17 [PATCH v4 0/6] UFFD write-tracking migration/snapshots Andrey Gruzdev via
  2020-11-26 15:17 ` [PATCH v4 1/6] introduce 'background-snapshot' migration capability Andrey Gruzdev via
  2020-11-26 15:17 ` [PATCH v4 2/6] introduce UFFD-WP low-level interface helpers Andrey Gruzdev via
@ 2020-11-26 15:17 ` Andrey Gruzdev via
  2020-11-27 21:49   ` Peter Xu
  2020-11-26 15:17 ` [PATCH v4 4/6] implementation of background snapshot thread Andrey Gruzdev via
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Andrey Gruzdev via @ 2020-11-26 15:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

In this particular implementation the same single migration
thread is responsible for both normal linear dirty page
migration and procesing UFFD page fault events.

Processing write faults includes reading UFFD file descriptor,
finding respective RAM block and saving faulting page to
the migration stream. After page has been saved, write protection
can be removed. Since asynchronous version of qemu_put_buffer()
is expected to be used to save pages, we also have to flush
migraion stream prior to un-protecting saved memory range.

Write protection is being removed for any previously protected
memory chunk that has hit the migration stream. That's valid
for pages from linear page scan along with write fault pages.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/ram.c | 155 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 147 insertions(+), 8 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 3adfd1948d..bcdccdaef7 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1441,6 +1441,76 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
     return block;
 }
 
+#ifdef CONFIG_LINUX
+/**
+ * ram_find_block_by_host_address: find RAM block containing host page
+ *
+ * Returns pointer to RAMBlock if found, NULL otherwise
+ *
+ * @rs: current RAM state
+ * @page_address: host page address
+ */
+static RAMBlock *ram_find_block_by_host_address(RAMState *rs, hwaddr page_address)
+{
+    RAMBlock *bs = rs->last_seen_block;
+
+    do {
+        if (page_address >= (hwaddr) bs->host && (page_address + TARGET_PAGE_SIZE) <=
+                ((hwaddr) bs->host + bs->max_length)) {
+            return bs;
+        }
+
+        bs = QLIST_NEXT_RCU(bs, next);
+        if (!bs) {
+            /* Hit the end of the list */
+            bs = QLIST_FIRST_RCU(&ram_list.blocks);
+        }
+    } while (bs != rs->last_seen_block);
+
+    return NULL;
+}
+
+/**
+ * poll_fault_page: try to get next UFFD write fault page and, if pending fault
+ *   is found, return RAM block pointer and page offset
+ *
+ * Returns pointer to the RAMBlock containing faulting page,
+ *   NULL if no write faults are pending
+ *
+ * @rs: current RAM state
+ * @offset: page offset from the beginning of the block
+ */
+static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
+{
+    struct uffd_msg uffd_msg;
+    hwaddr page_address;
+    RAMBlock *bs;
+    int res;
+
+    if (!migrate_background_snapshot()) {
+        return NULL;
+    }
+
+    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
+    if (res <= 0) {
+        return NULL;
+    }
+
+    page_address = uffd_msg.arg.pagefault.address;
+    bs = ram_find_block_by_host_address(rs, page_address);
+    if (!bs) {
+        /* In case we couldn't find respective block, just unprotect faulting page. */
+        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
+        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
+                page_address);
+        return NULL;
+    }
+
+    *offset = (ram_addr_t) (page_address - (hwaddr) bs->host);
+    return bs;
+}
+#endif /* CONFIG_LINUX */
+
 /**
  * get_queued_page: unqueue a page from the postcopy requests
  *
@@ -1480,6 +1550,16 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
 
     } while (block && !dirty);
 
+#ifdef CONFIG_LINUX
+    if (!block) {
+        /*
+         * Poll write faults too if background snapshot is enabled; that's
+         * when we have vcpus got blocked by the write protected pages.
+         */
+        block = poll_fault_page(rs, &offset);
+    }
+#endif /* CONFIG_LINUX */
+
     if (block) {
         /*
          * As soon as we start servicing pages out of order, then we have
@@ -1753,6 +1833,55 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
     return pages;
 }
 
+/**
+ * ram_save_host_page_pre: ram_save_host_page() pre-notifier
+ *
+ * @rs: current RAM state
+ * @pss: page-search-status structure
+ * @opaque: pointer to receive opaque context value
+ */
+static inline
+void ram_save_host_page_pre(RAMState *rs, PageSearchStatus *pss, void **opaque)
+{
+    *(ram_addr_t *) opaque = pss->page;
+}
+
+/**
+ * ram_save_host_page_post: ram_save_host_page() post-notifier
+ *
+ * @rs: current RAM state
+ * @pss: page-search-status structure
+ * @opaque: opaque context value
+ * @res_override: pointer to the return value of ram_save_host_page(),
+ *   overwritten in case of an error
+ */
+static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
+        void *opaque, int *res_override)
+{
+    /* Check if page is from UFFD-managed region. */
+    if (pss->block->flags & RAM_UF_WRITEPROTECT) {
+#ifdef CONFIG_LINUX
+        ram_addr_t page_from = (ram_addr_t) opaque;
+        hwaddr page_address = (hwaddr) pss->block->host +
+                              ((hwaddr) page_from << TARGET_PAGE_BITS);
+        hwaddr run_length = (hwaddr) (pss->page - page_from + 1) << TARGET_PAGE_BITS;
+        int res;
+
+        /* Flush async buffers before un-protect. */
+        qemu_fflush(rs->f);
+        /* Un-protect memory range. */
+        res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
+        /* We don't want to override existing error from ram_save_host_page(). */
+        if (res < 0 && *res_override >= 0) {
+            *res_override = res;
+        }
+#else
+        /* Should never happen */
+        qemu_file_set_error(rs->f, -ENOSYS);
+#endif /* CONFIG_LINUX */
+    }
+}
+
 /**
  * ram_find_and_save_block: finds a dirty page and sends it to f
  *
@@ -1779,14 +1908,14 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
         return pages;
     }
 
+    if (!rs->last_seen_block) {
+        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
+    }
+
     pss.block = rs->last_seen_block;
     pss.page = rs->last_page;
     pss.complete_round = false;
 
-    if (!pss.block) {
-        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
-    }
-
     do {
         again = true;
         found = get_queued_page(rs, &pss);
@@ -1797,7 +1926,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
         }
 
         if (found) {
+            void *opaque;
+
+            ram_save_host_page_pre(rs, &pss, &opaque);
             pages = ram_save_host_page(rs, &pss, last_stage);
+            ram_save_host_page_post(rs, &pss, opaque, &pages);
         }
     } while (!pages && again);
 
@@ -3864,9 +3997,12 @@ fail:
     rs->uffdio_fd = -1;
     return -1;
 #else
+    /*
+     * Should never happen since we prohibit 'background-snapshot'
+     * capability on non-Linux hosts.
+     */
     rs->uffdio_fd = -1;
-    error_setg(&migrate_get_current()->error,
-            "Background-snapshot not supported on non-Linux hosts");
+    error_setg(&migrate_get_current()->error, QERR_UNDEFINED_ERROR);
     return -1;
 #endif /* CONFIG_LINUX */
 }
@@ -3903,8 +4039,11 @@ void ram_write_tracking_stop(void)
     uffd_close_fd(rs->uffdio_fd);
     rs->uffdio_fd = -1;
 #else
-    error_setg(&migrate_get_current()->error,
-            "Background-snapshot not supported on non-Linux hosts");
+    /*
+     * Should never happen since we prohibit 'background-snapshot'
+     * capability on non-Linux hosts.
+     */
+    error_setg(&migrate_get_current()->error, QERR_UNDEFINED_ERROR);
 #endif /* CONFIG_LINUX */
 }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 4/6] implementation of background snapshot thread
  2020-11-26 15:17 [PATCH v4 0/6] UFFD write-tracking migration/snapshots Andrey Gruzdev via
                   ` (2 preceding siblings ...)
  2020-11-26 15:17 ` [PATCH v4 3/6] support UFFD write fault processing in ram_save_iterate() Andrey Gruzdev via
@ 2020-11-26 15:17 ` Andrey Gruzdev via
  2020-11-26 15:17 ` [PATCH v4 5/6] the rest of write tracking migration code Andrey Gruzdev via
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Andrey Gruzdev via @ 2020-11-26 15:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Introducing implementation of 'background' snapshot thread
which in overall follows the logic of precopy migration
while internally utilizes completely different mechanism
to 'freeze' vmstate at the start of snapshot creation.

This mechanism is based on userfault_fd with wr-protection
support and is Linux-specific.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/migration.c | 177 +++++++++++++++++++++++++++++++++++++++++-
 migration/migration.h |   3 +
 migration/savevm.c    |   1 -
 migration/savevm.h    |   2 +
 4 files changed, 180 insertions(+), 3 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 8be3bd2b8c..c2efaf72f2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1983,6 +1983,7 @@ void migrate_init(MigrationState *s)
      * locks.
      */
     s->cleanup_bh = 0;
+    s->vm_start_bh = 0;
     s->to_dst_file = NULL;
     s->state = MIGRATION_STATUS_NONE;
     s->rp_state.from_dst_file = NULL;
@@ -3521,6 +3522,21 @@ static void migration_iteration_finish(MigrationState *s)
     qemu_mutex_unlock_iothread();
 }
 
+static void bg_migration_iteration_finish(MigrationState *s)
+{
+    /* TODO: implement */
+}
+
+/*
+ * Return true if continue to the next iteration directly, false
+ * otherwise.
+ */
+static MigIterateState bg_migration_iteration_run(MigrationState *s)
+{
+    /* TODO: implement */
+    return MIG_ITERATE_RESUME;
+}
+
 void migration_make_urgent_request(void)
 {
     qemu_sem_post(&migrate_get_current()->rate_limit_sem);
@@ -3668,6 +3684,157 @@ static void *migration_thread(void *opaque)
     return NULL;
 }
 
+static void bg_migration_vm_start_bh(void *opaque)
+{
+    MigrationState *s = opaque;
+
+    qemu_bh_delete(s->vm_start_bh);
+    s->vm_start_bh = NULL;
+
+    vm_start();
+    s->downtime = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - s->downtime_start;
+}
+
+/**
+ * Background snapshot thread, based on live migration code.
+ * This is an alternative implementation of live migration mechanism
+ * introduced specifically to support background snapshots.
+ *
+ * It takes advantage of userfault_fd write protection mechanism introduced
+ * in v5.7 kernel. Compared to existing dirty page logging migration much
+ * lesser stream traffic is produced resulting in smaller snapshot images,
+ * simply cause of no page duplicates can get into the stream.
+ *
+ * Another key point is that generated vmstate stream reflects machine state
+ * 'frozen' at the beginning of snapshot creation compared to dirty page logging
+ * mechanism, which effectively results in that saved snapshot is the state of VM
+ * at the end of the process.
+ */
+static void *bg_migration_thread(void *opaque)
+{
+    MigrationState *s = opaque;
+    int64_t setup_start;
+    MigThrError thr_error;
+    QEMUFile *fb;
+    bool early_fail = true;
+
+    rcu_register_thread();
+    object_ref(OBJECT(s));
+
+    qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
+
+    setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
+    /*
+     * We want to save vmstate for the moment when migration has been
+     * initiated but also we want to save RAM content while VM is running.
+     * The RAM content should appear first in the vmstate. So, we first
+     * stash the non-RAM part of the vmstate to the temporary buffer,
+     * then write RAM part of the vmstate to the migration stream
+     * with vCPUs running and, finally, write stashed non-RAM part of
+     * the vmstate from the buffer to the migration stream.
+     */
+    s->bioc = qio_channel_buffer_new(128 * 1024);
+    qio_channel_set_name(QIO_CHANNEL(s->bioc), "vmstate-buffer");
+    fb = qemu_fopen_channel_output(QIO_CHANNEL(s->bioc));
+    object_unref(OBJECT(s->bioc));
+
+    update_iteration_initial_status(s);
+
+    qemu_savevm_state_header(s->to_dst_file);
+    qemu_savevm_state_setup(s->to_dst_file);
+
+    if (qemu_savevm_state_guest_unplug_pending()) {
+        migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
+                          MIGRATION_STATUS_WAIT_UNPLUG);
+
+        while (s->state == MIGRATION_STATUS_WAIT_UNPLUG &&
+               qemu_savevm_state_guest_unplug_pending()) {
+            qemu_sem_timedwait(&s->wait_unplug_sem, 250);
+        }
+
+        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
+                          MIGRATION_STATUS_ACTIVE);
+    }
+    s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
+
+    migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
+                      MIGRATION_STATUS_ACTIVE);
+    trace_migration_thread_setup_complete();
+    s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+    qemu_mutex_lock_iothread();
+
+    if (global_state_store()) {
+        goto fail;
+    }
+    /* Forcibly stop VM before saving state of vCPUs and devices */
+    if (vm_stop_force_state(RUN_STATE_PAUSED)) {
+        goto fail;
+    }
+    /*
+     * Put vCPUs in sync with shadow context structures, then
+     * save their state to channel-buffer along with devices.
+     */
+    cpu_synchronize_all_states();
+    if (qemu_savevm_state_complete_precopy_non_iterable(fb, false, false)) {
+        goto fail;
+    }
+    /* Now initialize UFFD context and start tracking RAM writes */
+    if (ram_write_tracking_start()) {
+        goto fail;
+    }
+    early_fail = false;
+
+    /*
+     * Start VM from BH handler to avoid write-fault lock here.
+     * UFFD-WP protection for the whole RAM is already enabled so
+     * calling VM state change notifiers from vm_start() would initiate
+     * writes to virtio VQs memory which is in write-protected region.
+     */
+    s->vm_start_bh = qemu_bh_new(bg_migration_vm_start_bh, s);
+    qemu_bh_schedule(s->vm_start_bh);
+
+    qemu_mutex_unlock_iothread();
+
+    while (migration_is_active(s)) {
+        MigIterateState iter_state = bg_migration_iteration_run(s);
+        if (iter_state == MIG_ITERATE_SKIP) {
+            continue;
+        } else if (iter_state == MIG_ITERATE_BREAK) {
+            break;
+        }
+
+        /*
+         * Try to detect any kind of failures, and see whether we
+         * should stop the migration now.
+         */
+        thr_error = migration_detect_error(s);
+        if (thr_error == MIG_THR_ERR_FATAL) {
+            /* Stop migration */
+            break;
+        }
+
+        migration_update_counters(s, qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
+    }
+
+    trace_migration_thread_after_loop();
+
+fail:
+    if (early_fail) {
+        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
+                MIGRATION_STATUS_FAILED);
+        qemu_mutex_unlock_iothread();
+    }
+
+    bg_migration_iteration_finish(s);
+
+    qemu_fclose(fb);
+    object_unref(OBJECT(s));
+    rcu_unregister_thread();
+
+    return NULL;
+}
+
 void migrate_fd_connect(MigrationState *s, Error *error_in)
 {
     Error *local_err = NULL;
@@ -3731,8 +3898,14 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
         migrate_fd_cleanup(s);
         return;
     }
-    qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
-                       QEMU_THREAD_JOINABLE);
+
+    if (migrate_background_snapshot()) {
+        qemu_thread_create(&s->thread, "background_snapshot",
+                bg_migration_thread, s, QEMU_THREAD_JOINABLE);
+    } else {
+        qemu_thread_create(&s->thread, "live_migration",
+                migration_thread, s, QEMU_THREAD_JOINABLE);
+    }
     s->migration_thread_running = true;
 }
 
diff --git a/migration/migration.h b/migration/migration.h
index f40338cfbf..0723955cd7 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -20,6 +20,7 @@
 #include "qemu/thread.h"
 #include "qemu/coroutine_int.h"
 #include "io/channel.h"
+#include "io/channel-buffer.h"
 #include "net/announce.h"
 #include "qom/object.h"
 
@@ -147,8 +148,10 @@ struct MigrationState {
 
     /*< public >*/
     QemuThread thread;
+    QEMUBH *vm_start_bh;
     QEMUBH *cleanup_bh;
     QEMUFile *to_dst_file;
+    QIOChannelBuffer *bioc;
     /*
      * Protects to_dst_file pointer.  We need to make sure we won't
      * yield or hang during the critical section, since this lock will
diff --git a/migration/savevm.c b/migration/savevm.c
index 5f937a2762..62d5f8a869 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1352,7 +1352,6 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     return 0;
 }
 
-static
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
                                                     bool in_postcopy,
                                                     bool inactivate_disks)
diff --git a/migration/savevm.h b/migration/savevm.h
index ba64a7e271..aaee2528ed 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -64,5 +64,7 @@ int qemu_loadvm_state(QEMUFile *f);
 void qemu_loadvm_state_cleanup(void);
 int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 int qemu_load_device_state(QEMUFile *f);
+int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
+        bool in_postcopy, bool inactivate_disks);
 
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 5/6] the rest of write tracking migration code
  2020-11-26 15:17 [PATCH v4 0/6] UFFD write-tracking migration/snapshots Andrey Gruzdev via
                   ` (3 preceding siblings ...)
  2020-11-26 15:17 ` [PATCH v4 4/6] implementation of background snapshot thread Andrey Gruzdev via
@ 2020-11-26 15:17 ` Andrey Gruzdev via
  2020-11-27 22:26   ` Peter Xu
  2020-11-26 15:17 ` [PATCH v4 6/6] introduce simple linear scan rate limiting mechanism Andrey Gruzdev via
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Andrey Gruzdev via @ 2020-11-26 15:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Implementation of bg_migration- iteration/completion/finish.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/migration.c | 74 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index c2efaf72f2..ef1522761c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3182,6 +3182,50 @@ fail:
                       MIGRATION_STATUS_FAILED);
 }
 
+/**
+ * bg_migration_completion: Used by bg_migration_thread when after all the
+ *   RAM has been saved. The caller 'breaks' the loop when this returns.
+ *
+ * @s: Current migration state
+ */
+static void bg_migration_completion(MigrationState *s)
+{
+    int current_active_state = s->state;
+
+    /*
+     * Stop tracking RAM writes - un-protect memory, un-register UFFD
+     * memory ranges, flush kernel wait queues and wake up threads
+     * waiting for write fault to be resolved.
+     */
+    ram_write_tracking_stop();
+
+    if (s->state == MIGRATION_STATUS_ACTIVE) {
+        /*
+         * By this moment we have RAM content saved into the migration stream.
+         * The next step is to flush the non-RAM content (device state)
+         * right after the ram content. The device state has been stored into
+         * the temporary buffer before RAM saving started.
+         */
+        qemu_put_buffer(s->to_dst_file, s->bioc->data, s->bioc->usage);
+        qemu_fflush(s->to_dst_file);
+    } else if (s->state == MIGRATION_STATUS_CANCELLING) {
+        goto fail;
+    }
+
+    if (qemu_file_get_error(s->to_dst_file)) {
+        trace_migration_completion_file_err();
+        goto fail;
+    }
+
+    migrate_set_state(&s->state, current_active_state,
+                      MIGRATION_STATUS_COMPLETED);
+    return;
+
+fail:
+    migrate_set_state(&s->state, current_active_state,
+                      MIGRATION_STATUS_FAILED);
+}
+
 bool migrate_colo_enabled(void)
 {
     MigrationState *s = migrate_get_current();
@@ -3524,7 +3568,26 @@ static void migration_iteration_finish(MigrationState *s)
 
 static void bg_migration_iteration_finish(MigrationState *s)
 {
-    /* TODO: implement */
+    qemu_mutex_lock_iothread();
+    switch (s->state) {
+    case MIGRATION_STATUS_COMPLETED:
+        migration_calculate_complete(s);
+        break;
+
+    case MIGRATION_STATUS_ACTIVE:
+    case MIGRATION_STATUS_FAILED:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_CANCELLING:
+        break;
+
+    default:
+        /* Should not reach here, but if so, forgive the VM. */
+        error_report("%s: Unknown ending state %d", __func__, s->state);
+        break;
+    }
+
+    migrate_fd_cleanup_schedule(s);
+    qemu_mutex_unlock_iothread();
 }
 
 /*
@@ -3533,7 +3596,14 @@ static void bg_migration_iteration_finish(MigrationState *s)
  */
 static MigIterateState bg_migration_iteration_run(MigrationState *s)
 {
-    /* TODO: implement */
+    int res;
+
+    res = qemu_savevm_state_iterate(s->to_dst_file, false);
+    if (res > 0) {
+        bg_migration_completion(s);
+        return MIG_ITERATE_BREAK;
+    }
+
     return MIG_ITERATE_RESUME;
 }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 6/6] introduce simple linear scan rate limiting mechanism
  2020-11-26 15:17 [PATCH v4 0/6] UFFD write-tracking migration/snapshots Andrey Gruzdev via
                   ` (4 preceding siblings ...)
  2020-11-26 15:17 ` [PATCH v4 5/6] the rest of write tracking migration code Andrey Gruzdev via
@ 2020-11-26 15:17 ` Andrey Gruzdev via
  2020-11-27 22:28   ` Peter Xu
  2020-11-26 15:47 ` [PATCH v4 0/6] UFFD write-tracking migration/snapshots Peter Krempa
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Andrey Gruzdev via @ 2020-11-26 15:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Since reading UFFD events and saving paged data are performed
from the same thread, write fault latencies are sensitive to
migration stream stalls. Limiting total page saving rate is a
method to reduce amount of noticiable fault resolution latencies.

Migration bandwidth limiting is achieved via noticing cases of
out-of-threshold write fault latencies and temporarily disabling
(strictly speaking, severely throttling) saving non-faulting pages.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/ram.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 65 insertions(+), 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index bcdccdaef7..d5b50b7804 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -322,6 +322,10 @@ struct RAMState {
     /* these variables are used for bitmap sync */
     /* last time we did a full bitmap_sync */
     int64_t time_last_bitmap_sync;
+    /* last time UFFD fault occured */
+    int64_t last_fault_ns;
+    /* linear scan throttling counter */
+    int throttle_skip_counter;
     /* bytes transferred at start_time */
     uint64_t bytes_xfer_prev;
     /* number of dirty pages since start_time */
@@ -1506,6 +1510,8 @@ static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
         return NULL;
     }
 
+    rs->last_fault_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+
     *offset = (ram_addr_t) (page_address - (hwaddr) bs->host);
     return bs;
 }
@@ -1882,6 +1888,55 @@ static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
     }
 }
 
+#define FAULT_HIGH_LATENCY_NS   5000000     /* 5 ms */
+#define SLOW_FAULT_POLL_TMO     5           /* 5 ms */
+#define SLOW_FAULT_SKIP_PAGES   200
+
+/**
+ * limit_scan_rate: limit RAM linear scan rate in case of growing write fault
+ *   latencies, used in write-tracking migration implementation
+ *
+ * @rs: current RAM state
+ */
+static void limit_scan_rate(RAMState *rs)
+{
+    if (!migrate_background_snapshot()) {
+        return;
+    }
+
+#ifdef CONFIG_LINUX
+    int64_t last_fault_latency_ns = 0;
+
+    /* Check if last write fault time is available */
+    if (rs->last_fault_ns) {
+        last_fault_latency_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) -
+                rs->last_fault_ns;
+        rs->last_fault_ns = 0;
+    }
+
+    /*
+     * In case last fault time was available and we have
+     * latency value, check if it's not too high
+     */
+    if (last_fault_latency_ns > FAULT_HIGH_LATENCY_NS) {
+        /* Reset counter after each slow write fault */
+        rs->throttle_skip_counter = SLOW_FAULT_SKIP_PAGES;
+    }
+    /*
+     * Delay thread execution till next write fault occures or timeout expires.
+     * Next SLOW_FAULT_SKIP_PAGES can be write fault pages only, not from pages going from
+     * linear scan logic. Thus we moderate migration stream rate to reduce latencies
+     */
+    if (rs->throttle_skip_counter > 0) {
+        uffd_poll_events(rs->uffdio_fd, SLOW_FAULT_POLL_TMO);
+        rs->throttle_skip_counter--;
+    }
+#else
+    /* Should never happen */
+    qemu_file_set_error(rs->f, -ENOSYS);
+#endif /* CONFIG_LINUX */
+}
+
 /**
  * ram_find_and_save_block: finds a dirty page and sends it to f
  *
@@ -1931,6 +1986,9 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
             ram_save_host_page_pre(rs, &pss, &opaque);
             pages = ram_save_host_page(rs, &pss, last_stage);
             ram_save_host_page_post(rs, &pss, opaque, &pages);
+
+            /* Linear scan rate limiting */
+            limit_scan_rate(rs);
         }
     } while (!pages && again);
 
@@ -2043,11 +2101,14 @@ static void ram_state_reset(RAMState *rs)
     rs->last_sent_block = NULL;
     rs->last_page = 0;
     rs->last_version = ram_list.version;
+    rs->last_fault_ns = 0;
+    rs->throttle_skip_counter = 0;
     rs->ram_bulk_stage = true;
     rs->fpo_enabled = false;
 }
 
-#define MAX_WAIT 50 /* ms, half buffered_file limit */
+#define MAX_WAIT    50      /* ms, half buffered_file limit */
+#define BG_MAX_WAIT 1000    /* 1000 ms, need bigger limit for background snapshot */
 
 /*
  * 'expected' is the value you expect the bitmap mostly to be full
@@ -2723,7 +2784,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
             if ((i & 63) == 0) {
                 uint64_t t1 = (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - t0) /
                               1000000;
-                if (t1 > MAX_WAIT) {
+                uint64_t max_wait = migrate_background_snapshot() ?
+                        BG_MAX_WAIT : MAX_WAIT;
+                if (t1 > max_wait) {
                     trace_ram_save_iterate_big_wait(t1, i);
                     break;
                 }
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-11-26 15:17 [PATCH v4 0/6] UFFD write-tracking migration/snapshots Andrey Gruzdev via
                   ` (5 preceding siblings ...)
  2020-11-26 15:17 ` [PATCH v4 6/6] introduce simple linear scan rate limiting mechanism Andrey Gruzdev via
@ 2020-11-26 15:47 ` Peter Krempa
  2020-11-27  8:21   ` Andrey Gruzdev
  2020-11-27 22:04 ` Peter Xu
  2020-12-01  7:08 ` Peter Krempa
  8 siblings, 1 reply; 45+ messages in thread
From: Peter Krempa @ 2020-11-26 15:47 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel, Peter Xu,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
> 
> Currently the only way to make (external) live VM snapshot is using existing
> dirty page logging migration mechanism. The main problem is that it tends to
> produce a lot of page duplicates while running VM goes on updating already
> saved pages. That leads to the fact that vmstate image size is commonly several
> times bigger then non-zero part of virtual machine's RSS. Time required to
> converge RAM migration and the size of snapshot image severely depend on the
> guest memory write rate, sometimes resulting in unacceptably long snapshot
> creation time and huge image size.
> 
> This series propose a way to solve the aforementioned problems. This is done
> by using different RAM migration mechanism based on UFFD write protection
> management introduced in v5.7 kernel. The migration strategy is to 'freeze'
> guest RAM content using write-protection and iteratively release protection
> for memory ranges that have already been saved to the migration stream.
> At the same time we read in pending UFFD write fault events and save those
> pages out-of-order with higher priority.

This sounds amazing! Based on your description I assume that the final
memory image contains state image from the beginning of the migration.

This would make it desirable for the 'migrate' qmp command to be used as
part of a 'transaction' qmp command so that we can do an instant disk
and memory snapshot without any extraneous pausing of the VM.

I'll have a look at using this mechanism in libvirt natively.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-11-26 15:47 ` [PATCH v4 0/6] UFFD write-tracking migration/snapshots Peter Krempa
@ 2020-11-27  8:21   ` Andrey Gruzdev
  2020-11-27  9:49     ` Peter Krempa
  0 siblings, 1 reply; 45+ messages in thread
From: Andrey Gruzdev @ 2020-11-27  8:21 UTC (permalink / raw)
  To: Peter Krempa
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu

On 26.11.2020 18:47, Peter Krempa wrote:
> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
>> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
>>
>> Currently the only way to make (external) live VM snapshot is using existing
>> dirty page logging migration mechanism. The main problem is that it tends to
>> produce a lot of page duplicates while running VM goes on updating already
>> saved pages. That leads to the fact that vmstate image size is commonly several
>> times bigger then non-zero part of virtual machine's RSS. Time required to
>> converge RAM migration and the size of snapshot image severely depend on the
>> guest memory write rate, sometimes resulting in unacceptably long snapshot
>> creation time and huge image size.
>>
>> This series propose a way to solve the aforementioned problems. This is done
>> by using different RAM migration mechanism based on UFFD write protection
>> management introduced in v5.7 kernel. The migration strategy is to 'freeze'
>> guest RAM content using write-protection and iteratively release protection
>> for memory ranges that have already been saved to the migration stream.
>> At the same time we read in pending UFFD write fault events and save those
>> pages out-of-order with higher priority.
> 
> This sounds amazing! Based on your description I assume that the final
> memory image contains state image from the beginning of the migration.
> 
> This would make it desirable for the 'migrate' qmp command to be used as
> part of a 'transaction' qmp command so that we can do an instant disk
> and memory snapshot without any extraneous pausing of the VM.
> 
> I'll have a look at using this mechanism in libvirt natively.
> 

Correct, the final image contains state at the beginning of migration.

So far, if I'm not missing something about libvirt, for external 
snapshot creation it performs a sequence like that: 
migrate(fd)->transaction(blockdev-snapshot-all)->cont.

So, in case 'background-snapshot' capability is enabled, the sequence 
would change to:
stop->transaction(blockdev-snapshot-all)->migrate(fd).
With background snapshot migration it will finish with VM running so 
there's not need to 'cont' here.

Thanks, I'd appreciate a lot if you look how to integrate the new 
mechanism to libvirt!

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-11-27  8:21   ` Andrey Gruzdev
@ 2020-11-27  9:49     ` Peter Krempa
  2020-11-27 10:00       ` Andrey Gruzdev
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Krempa @ 2020-11-27  9:49 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel, Peter Xu,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Fri, Nov 27, 2020 at 11:21:39 +0300, Andrey Gruzdev wrote:
> On 26.11.2020 18:47, Peter Krempa wrote:
> > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> > > implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
> > > 
> > > Currently the only way to make (external) live VM snapshot is using existing
> > > dirty page logging migration mechanism. The main problem is that it tends to
> > > produce a lot of page duplicates while running VM goes on updating already
> > > saved pages. That leads to the fact that vmstate image size is commonly several
> > > times bigger then non-zero part of virtual machine's RSS. Time required to
> > > converge RAM migration and the size of snapshot image severely depend on the
> > > guest memory write rate, sometimes resulting in unacceptably long snapshot
> > > creation time and huge image size.
> > > 
> > > This series propose a way to solve the aforementioned problems. This is done
> > > by using different RAM migration mechanism based on UFFD write protection
> > > management introduced in v5.7 kernel. The migration strategy is to 'freeze'
> > > guest RAM content using write-protection and iteratively release protection
> > > for memory ranges that have already been saved to the migration stream.
> > > At the same time we read in pending UFFD write fault events and save those
> > > pages out-of-order with higher priority.
> > 
> > This sounds amazing! Based on your description I assume that the final
> > memory image contains state image from the beginning of the migration.
> > 
> > This would make it desirable for the 'migrate' qmp command to be used as
> > part of a 'transaction' qmp command so that we can do an instant disk
> > and memory snapshot without any extraneous pausing of the VM.
> > 
> > I'll have a look at using this mechanism in libvirt natively.
> > 
> 
> Correct, the final image contains state at the beginning of migration.
> 
> So far, if I'm not missing something about libvirt, for external snapshot
> creation it performs a sequence like that:
> migrate(fd)->transaction(blockdev-snapshot-all)->cont.
> 
> So, in case 'background-snapshot' capability is enabled, the sequence would
> change to:
> stop->transaction(blockdev-snapshot-all)->migrate(fd).
> With background snapshot migration it will finish with VM running so there's
> not need to 'cont' here.

Yes, that's correct.

The reason I've suggested that 'migrate' being part of a 'transaction'
is that it would remove the need to stop it for the disk snapshot part. 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-11-27  9:49     ` Peter Krempa
@ 2020-11-27 10:00       ` Andrey Gruzdev
  2020-11-27 15:45         ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Andrey Gruzdev @ 2020-11-27 10:00 UTC (permalink / raw)
  To: Peter Krempa
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu

On 27.11.2020 12:49, Peter Krempa wrote:
> On Fri, Nov 27, 2020 at 11:21:39 +0300, Andrey Gruzdev wrote:
>> On 26.11.2020 18:47, Peter Krempa wrote:
>>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
>>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
>>>> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
>>>>
>>>> Currently the only way to make (external) live VM snapshot is using existing
>>>> dirty page logging migration mechanism. The main problem is that it tends to
>>>> produce a lot of page duplicates while running VM goes on updating already
>>>> saved pages. That leads to the fact that vmstate image size is commonly several
>>>> times bigger then non-zero part of virtual machine's RSS. Time required to
>>>> converge RAM migration and the size of snapshot image severely depend on the
>>>> guest memory write rate, sometimes resulting in unacceptably long snapshot
>>>> creation time and huge image size.
>>>>
>>>> This series propose a way to solve the aforementioned problems. This is done
>>>> by using different RAM migration mechanism based on UFFD write protection
>>>> management introduced in v5.7 kernel. The migration strategy is to 'freeze'
>>>> guest RAM content using write-protection and iteratively release protection
>>>> for memory ranges that have already been saved to the migration stream.
>>>> At the same time we read in pending UFFD write fault events and save those
>>>> pages out-of-order with higher priority.
>>>
>>> This sounds amazing! Based on your description I assume that the final
>>> memory image contains state image from the beginning of the migration.
>>>
>>> This would make it desirable for the 'migrate' qmp command to be used as
>>> part of a 'transaction' qmp command so that we can do an instant disk
>>> and memory snapshot without any extraneous pausing of the VM.
>>>
>>> I'll have a look at using this mechanism in libvirt natively.
>>>
>>
>> Correct, the final image contains state at the beginning of migration.
>>
>> So far, if I'm not missing something about libvirt, for external snapshot
>> creation it performs a sequence like that:
>> migrate(fd)->transaction(blockdev-snapshot-all)->cont.
>>
>> So, in case 'background-snapshot' capability is enabled, the sequence would
>> change to:
>> stop->transaction(blockdev-snapshot-all)->migrate(fd).
>> With background snapshot migration it will finish with VM running so there's
>> not need to 'cont' here.
> 
> Yes, that's correct.
> 
> The reason I've suggested that 'migrate' being part of a 'transaction'
> is that it would remove the need to stop it for the disk snapshot part.
> 

Hmm, I believe stopping VM for a short time is unavoidable to keep saved 
device state consistent with blockdev snapshot base.. May be I've missed 
something but it seems logical.

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-11-27 10:00       ` Andrey Gruzdev
@ 2020-11-27 15:45         ` Peter Xu
  2020-11-27 17:19           ` Andrey Gruzdev
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-11-27 15:45 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Peter Krempa, Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Fri, Nov 27, 2020 at 01:00:48PM +0300, Andrey Gruzdev wrote:
> On 27.11.2020 12:49, Peter Krempa wrote:
> > On Fri, Nov 27, 2020 at 11:21:39 +0300, Andrey Gruzdev wrote:
> > > On 26.11.2020 18:47, Peter Krempa wrote:
> > > > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> > > > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> > > > > implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
> > > > > 
> > > > > Currently the only way to make (external) live VM snapshot is using existing
> > > > > dirty page logging migration mechanism. The main problem is that it tends to
> > > > > produce a lot of page duplicates while running VM goes on updating already
> > > > > saved pages. That leads to the fact that vmstate image size is commonly several
> > > > > times bigger then non-zero part of virtual machine's RSS. Time required to
> > > > > converge RAM migration and the size of snapshot image severely depend on the
> > > > > guest memory write rate, sometimes resulting in unacceptably long snapshot
> > > > > creation time and huge image size.
> > > > > 
> > > > > This series propose a way to solve the aforementioned problems. This is done
> > > > > by using different RAM migration mechanism based on UFFD write protection
> > > > > management introduced in v5.7 kernel. The migration strategy is to 'freeze'
> > > > > guest RAM content using write-protection and iteratively release protection
> > > > > for memory ranges that have already been saved to the migration stream.
> > > > > At the same time we read in pending UFFD write fault events and save those
> > > > > pages out-of-order with higher priority.
> > > > 
> > > > This sounds amazing! Based on your description I assume that the final
> > > > memory image contains state image from the beginning of the migration.
> > > > 
> > > > This would make it desirable for the 'migrate' qmp command to be used as
> > > > part of a 'transaction' qmp command so that we can do an instant disk
> > > > and memory snapshot without any extraneous pausing of the VM.
> > > > 
> > > > I'll have a look at using this mechanism in libvirt natively.
> > > > 
> > > 
> > > Correct, the final image contains state at the beginning of migration.
> > > 
> > > So far, if I'm not missing something about libvirt, for external snapshot
> > > creation it performs a sequence like that:
> > > migrate(fd)->transaction(blockdev-snapshot-all)->cont.
> > > 
> > > So, in case 'background-snapshot' capability is enabled, the sequence would
> > > change to:
> > > stop->transaction(blockdev-snapshot-all)->migrate(fd).
> > > With background snapshot migration it will finish with VM running so there's
> > > not need to 'cont' here.
> > 
> > Yes, that's correct.
> > 
> > The reason I've suggested that 'migrate' being part of a 'transaction'
> > is that it would remove the need to stop it for the disk snapshot part.
> > 
> 
> Hmm, I believe stopping VM for a short time is unavoidable to keep saved
> device state consistent with blockdev snapshot base.. May be I've missed
> something but it seems logical.

I guess PeterK meant an explicit stop vm command, rather than the stop in this
series that should be undetectable from the guest.  It would be definitely
great if PeterK could quickly follow up with this on libvirt. :)

One overall comment (before go into details) is that please remember to start
using "migration: " prefixes in patch subjects from the next version.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-11-27 15:45         ` Peter Xu
@ 2020-11-27 17:19           ` Andrey Gruzdev
  0 siblings, 0 replies; 45+ messages in thread
From: Andrey Gruzdev @ 2020-11-27 17:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Krempa, qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini,
	Juan Quintela, Dr . David Alan Gilbert, Markus Armbruster

On 27.11.2020 18:45, Peter Xu wrote:
> On Fri, Nov 27, 2020 at 01:00:48PM +0300, Andrey Gruzdev wrote:
>> On 27.11.2020 12:49, Peter Krempa wrote:
>>> On Fri, Nov 27, 2020 at 11:21:39 +0300, Andrey Gruzdev wrote:
>>>> On 26.11.2020 18:47, Peter Krempa wrote:
>>>>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
>>>>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
>>>>>> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
>>>>>>
>>>>>> Currently the only way to make (external) live VM snapshot is using existing
>>>>>> dirty page logging migration mechanism. The main problem is that it tends to
>>>>>> produce a lot of page duplicates while running VM goes on updating already
>>>>>> saved pages. That leads to the fact that vmstate image size is commonly several
>>>>>> times bigger then non-zero part of virtual machine's RSS. Time required to
>>>>>> converge RAM migration and the size of snapshot image severely depend on the
>>>>>> guest memory write rate, sometimes resulting in unacceptably long snapshot
>>>>>> creation time and huge image size.
>>>>>>
>>>>>> This series propose a way to solve the aforementioned problems. This is done
>>>>>> by using different RAM migration mechanism based on UFFD write protection
>>>>>> management introduced in v5.7 kernel. The migration strategy is to 'freeze'
>>>>>> guest RAM content using write-protection and iteratively release protection
>>>>>> for memory ranges that have already been saved to the migration stream.
>>>>>> At the same time we read in pending UFFD write fault events and save those
>>>>>> pages out-of-order with higher priority.
>>>>>
>>>>> This sounds amazing! Based on your description I assume that the final
>>>>> memory image contains state image from the beginning of the migration.
>>>>>
>>>>> This would make it desirable for the 'migrate' qmp command to be used as
>>>>> part of a 'transaction' qmp command so that we can do an instant disk
>>>>> and memory snapshot without any extraneous pausing of the VM.
>>>>>
>>>>> I'll have a look at using this mechanism in libvirt natively.
>>>>>
>>>>
>>>> Correct, the final image contains state at the beginning of migration.
>>>>
>>>> So far, if I'm not missing something about libvirt, for external snapshot
>>>> creation it performs a sequence like that:
>>>> migrate(fd)->transaction(blockdev-snapshot-all)->cont.
>>>>
>>>> So, in case 'background-snapshot' capability is enabled, the sequence would
>>>> change to:
>>>> stop->transaction(blockdev-snapshot-all)->migrate(fd).
>>>> With background snapshot migration it will finish with VM running so there's
>>>> not need to 'cont' here.
>>>
>>> Yes, that's correct.
>>>
>>> The reason I've suggested that 'migrate' being part of a 'transaction'
>>> is that it would remove the need to stop it for the disk snapshot part.
>>>
>>
>> Hmm, I believe stopping VM for a short time is unavoidable to keep saved
>> device state consistent with blockdev snapshot base.. May be I've missed
>> something but it seems logical.
> 
> I guess PeterK meant an explicit stop vm command, rather than the stop in this
> series that should be undetectable from the guest.  It would be definitely
> great if PeterK could quickly follow up with this on libvirt. :)
> 

Sorry if I misunderstood, sure it would be great if PeterK is really 
interested to follow up with this feature!

> One overall comment (before go into details) is that please remember to start
> using "migration: " prefixes in patch subjects from the next version.
> 

Yes, I really missed it everywhere :)

> Thanks,
> 


-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 1/6] introduce 'background-snapshot' migration capability
  2020-11-26 15:17 ` [PATCH v4 1/6] introduce 'background-snapshot' migration capability Andrey Gruzdev via
@ 2020-11-27 19:55   ` Peter Xu
  2020-11-28 16:35     ` Andrey Gruzdev
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-11-27 19:55 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 26, 2020 at 06:17:29PM +0300, Andrey Gruzdev wrote:
> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>

The patch looks good to me, but again, any commit message would be more than
welcomed... as long as it's not empty. :)

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 2/6] introduce UFFD-WP low-level interface helpers
  2020-11-26 15:17 ` [PATCH v4 2/6] introduce UFFD-WP low-level interface helpers Andrey Gruzdev via
@ 2020-11-27 21:04   ` Peter Xu
  2020-11-29 20:12     ` Andrey Gruzdev
  2020-12-01 12:24   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-11-27 21:04 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 26, 2020 at 06:17:30PM +0300, Andrey Gruzdev wrote:
> Implemented support for the whole RAM block memory
> protection/un-protection. Introduced higher level
> ram_write_tracking_start() and ram_write_tracking_stop()
> to start/stop tracking guest memory writes.

The whole patch looks good to me in general.  A few nitpickings below..

> 
> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> ---
>  include/exec/memory.h      |   7 ++
>  include/qemu/userfaultfd.h |  29 +++++
>  migration/ram.c            | 120 +++++++++++++++++++++
>  migration/ram.h            |   4 +
>  util/meson.build           |   1 +
>  util/userfaultfd.c         | 215 +++++++++++++++++++++++++++++++++++++
>  6 files changed, 376 insertions(+)
>  create mode 100644 include/qemu/userfaultfd.h
>  create mode 100644 util/userfaultfd.c
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 0f3e6bcd5e..3d798fce16 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier;
>  /* RAM is a persistent kind memory */
>  #define RAM_PMEM (1 << 5)
>  
> +/*
> + * UFFDIO_WRITEPROTECT is used on this RAMBlock to
> + * support 'write-tracking' migration type.
> + * Implies ram_state->ram_wt_enabled.
> + */
> +#define RAM_UF_WRITEPROTECT (1 << 6)
> +
>  static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>                                         IOMMUNotifierFlag flags,
>                                         hwaddr start, hwaddr end,
> diff --git a/include/qemu/userfaultfd.h b/include/qemu/userfaultfd.h
> new file mode 100644
> index 0000000000..fb843c76db
> --- /dev/null
> +++ b/include/qemu/userfaultfd.h
> @@ -0,0 +1,29 @@
> +/*
> + * Linux UFFD-WP support
> + *
> + * Copyright Virtuozzo GmbH, 2020
> + *
> + * Authors:
> + *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef USERFAULTFD_H
> +#define USERFAULTFD_H
> +
> +#include "qemu/osdep.h"
> +#include "exec/hwaddr.h"
> +#include <linux/userfaultfd.h>
> +
> +int uffd_create_fd(void);
> +void uffd_close_fd(int uffd);
> +int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
> +        bool track_missing, bool track_wp);
> +int uffd_unregister_memory(int uffd, hwaddr start, hwaddr length);
> +int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp);
> +int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
> +bool uffd_poll_events(int uffd, int tmo);
> +
> +#endif /* USERFAULTFD_H */
> diff --git a/migration/ram.c b/migration/ram.c
> index 7811cde643..3adfd1948d 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -56,6 +56,11 @@
>  #include "savevm.h"
>  #include "qemu/iov.h"
>  #include "multifd.h"
> +#include "sysemu/runstate.h"
> +
> +#ifdef CONFIG_LINUX
> +#include "qemu/userfaultfd.h"
> +#endif
>  
>  /***********************************************************/
>  /* ram save/restore */
> @@ -298,6 +303,8 @@ struct RAMSrcPageRequest {
>  struct RAMState {
>      /* QEMUFile used for this migration */
>      QEMUFile *f;
> +    /* UFFD file descriptor, used in 'write-tracking' migration */
> +    int uffdio_fd;
>      /* Last block that we have visited searching for dirty pages */
>      RAMBlock *last_seen_block;
>      /* Last block from where we have sent data */
> @@ -3788,6 +3795,119 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
>      return 0;
>  }
>  
> +/*
> + * ram_write_tracking_start: start UFFD-WP memory tracking
> + *
> + * Returns 0 for success or negative value in case of error
> + *

(extra new line)

> + */
> +int ram_write_tracking_start(void)
> +{
> +#ifdef CONFIG_LINUX
> +    int uffd;
> +    RAMState *rs = ram_state;
> +    RAMBlock *bs;
> +
> +    /* Open UFFD file descriptor */
> +    uffd = uffd_create_fd();
> +    if (uffd < 0) {
> +        return uffd;
> +    }
> +    rs->uffdio_fd = uffd;

May need a rcu_read_lock() here to guarantee safe access to
RAMBLOCK_FOREACH_NOT_IGNORED.

> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> +        /* Nothing to do with read-only and MMIO-writable regions */
> +        if (bs->mr->readonly || bs->mr->rom_device) {
> +            continue;
> +        }
> +
> +        bs->flags |= RAM_UF_WRITEPROTECT;
> +        /* Register block memory with UFFD to track writes */
> +        if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host,
> +                bs->max_length, false, true)) {
> +            goto fail;
> +        }
> +        /* Apply UFFD write protection to the block memory range */
> +        if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host,
> +                bs->max_length, true)) {
> +            goto fail;
> +        }
> +
> +        info_report("UFFD-WP write-tracking enabled: "
> +                "block_id=%s page_size=%zu start=%p length=%lu "
> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
> +                bs->mr->nonvolatile, bs->mr->rom_device);

Is info_report() by default printed?  Change this into a tracepoint?  Please
refer to functions named with trace_*().

> +    }
> +
> +    return 0;
> +
> +fail:
> +    error_report("ram_write_tracking_start() failed: restoring initial memory state");
> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
> +            continue;
> +        }
> +        /*
> +         * In case some memory block failed to be write-protected
> +         * remove protection and unregister all succeeded RAM blocks
> +         */
> +        uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length, false);
> +        uffd_unregister_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length);
> +        /* Cleanup flags */
> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
> +    }
> +
> +    uffd_close_fd(uffd);
> +    rs->uffdio_fd = -1;
> +    return -1;
> +#else
> +    rs->uffdio_fd = -1;
> +    error_setg(&migrate_get_current()->error,
> +            "Background-snapshot not supported on non-Linux hosts");

Accessing the global var seems an overkill to me.  I'd simply return an error,
since iiuc if we gate the capability bit well, then these paths won't really
trigger at all.  So it's just to pass the compilations.

> +    return -1;
> +#endif /* CONFIG_LINUX */
> +}
> +
> +/**
> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
> + */
> +void ram_write_tracking_stop(void)
> +{
> +#ifdef CONFIG_LINUX
> +    RAMState *rs = ram_state;
> +    RAMBlock *bs;
> +    assert(rs->uffdio_fd >= 0);

Maybe too harsh - we can return if it's invalid.

Meanwhile, better rcu_read_lock(), as well?

> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
> +            continue;
> +        }
> +        /* Remove protection and unregister all affected RAM blocks */
> +        uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length, false);
> +        uffd_unregister_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length);
> +        /* Cleanup flags */
> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
> +
> +        info_report("UFFD-WP write-tracking disabled: "
> +                "block_id=%s page_size=%zu start=%p length=%lu "
> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
> +                bs->mr->nonvolatile, bs->mr->rom_device);

Use tracepoint?

> +    }
> +
> +    /* Finally close UFFD file descriptor */
> +    uffd_close_fd(rs->uffdio_fd);
> +    rs->uffdio_fd = -1;
> +#else
> +    error_setg(&migrate_get_current()->error,
> +            "Background-snapshot not supported on non-Linux hosts");

Ditto.

> +#endif /* CONFIG_LINUX */
> +}
> +
>  static SaveVMHandlers savevm_ram_handlers = {
>      .save_setup = ram_save_setup,
>      .save_live_iterate = ram_save_iterate,
> diff --git a/migration/ram.h b/migration/ram.h
> index 011e85414e..0ec63e27ee 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -79,4 +79,8 @@ void colo_flush_ram_cache(void);
>  void colo_release_ram_cache(void);
>  void colo_incoming_start_dirty_log(void);
>  
> +/* Background snapshots */
> +int ram_write_tracking_start(void);
> +void ram_write_tracking_stop(void);
> +
>  #endif
> diff --git a/util/meson.build b/util/meson.build
> index f359af0d46..c64bfe94b3 100644
> --- a/util/meson.build
> +++ b/util/meson.build
> @@ -50,6 +50,7 @@ endif
>  
>  if have_system
>    util_ss.add(when: 'CONFIG_GIO', if_true: [files('dbus.c'), gio])
> +  util_ss.add(when: 'CONFIG_LINUX', if_true: files('userfaultfd.c'))
>  endif
>  
>  if have_block
> diff --git a/util/userfaultfd.c b/util/userfaultfd.c
> new file mode 100644
> index 0000000000..038953d7ed
> --- /dev/null
> +++ b/util/userfaultfd.c
> @@ -0,0 +1,215 @@
> +/*
> + * Linux UFFD-WP support
> + *
> + * Copyright Virtuozzo GmbH, 2020
> + *
> + * Authors:
> + *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/bitops.h"
> +#include "qemu/error-report.h"
> +#include "qemu/userfaultfd.h"
> +#include <poll.h>
> +#include <sys/syscall.h>
> +#include <sys/ioctl.h>
> +
> +/**
> + * uffd_create_fd: create UFFD file descriptor
> + *
> + * Returns non-negative file descriptor or negative value in case of an error
> + */
> +int uffd_create_fd(void)
> +{
> +    int uffd;
> +    struct uffdio_api api_struct;
> +    uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER);
> +
> +    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> +    if (uffd < 0) {
> +        error_report("uffd_create_fd() failed: UFFD not supported");
> +        return -1;
> +    }
> +
> +    api_struct.api = UFFD_API;
> +    api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;

This might be too strict if we want to reuse the codes with postcopy to run on
old kernels that only supports missing.  Not a big problem for now; we can work
on top.

> +    if (ioctl(uffd, UFFDIO_API, &api_struct)) {
> +        error_report("uffd_create_fd() failed: "
> +                     "API version not supported version=%llx errno=%i",
> +                api_struct.api, errno);

(A few strange alignments... won't complain but fixing would be even better)

> +        goto fail;
> +    }
> +
> +    if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
> +        error_report("uffd_create_fd() failed: "
> +                     "PAGEFAULT_FLAG_WP feature missing");
> +        goto fail;
> +    }
> +
> +    return uffd;
> +
> +fail:
> +    close(uffd);
> +    return -1;
> +}
> +
> +/**
> + * uffd_close_fd: close UFFD file descriptor
> + *
> + * @uffd: UFFD file descriptor
> + */
> +void uffd_close_fd(int uffd)
> +{
> +    assert(uffd >= 0);
> +    close(uffd);
> +}
> +
> +/**
> + * uffd_register_memory: register memory range with UFFD
> + *
> + * Returns 0 in case of success, negative value on error
> + *
> + * @uffd: UFFD file descriptor
> + * @start: starting virtual address of memory range
> + * @length: length of memory range
> + * @track_missing: generate events on missing-page faults
> + * @track_wp: generate events on write-protected-page faults
> + */
> +int uffd_register_memory(int uffd, hwaddr start, hwaddr length,

My understanding is that hwaddr is for guest physical address space.  Here
I think either uint64_t or "void *" may suite better.

> +        bool track_missing, bool track_wp)
> +{
> +    struct uffdio_register uffd_register;
> +
> +    uffd_register.range.start = start;
> +    uffd_register.range.len = length;
> +    uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) |
> +                         (track_wp ? UFFDIO_REGISTER_MODE_WP : 0);
> +
> +    if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) {
> +        error_report("uffd_register_memory() failed: "
> +                     "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
> +                start, length, uffd_register.mode, errno);
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * uffd_unregister_memory: un-register memory range with UFFD
> + *
> + * Returns 0 in case of success, negative value on error
> + *
> + * @uffd: UFFD file descriptor
> + * @start: starting virtual address of memory range
> + * @length: length of memory range
> + */
> +int uffd_unregister_memory(int uffd, hwaddr start, hwaddr length)

Same question on using hwaddr as above.

> +{
> +    struct uffdio_range uffd_range;
> +
> +    uffd_range.start = start;
> +    uffd_range.len = length;
> +
> +    if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_range)) {
> +        error_report("uffd_unregister_memory() failed: "
> +                     "start=%0"PRIx64" len=%"PRIu64" errno=%i",
> +                start, length, errno);
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * uffd_protect_memory: protect/unprotect memory range for writes with UFFD

It may look strange if uffd_protect_memory() can also unprotect stuff...

Maybe rename it to uffd_change_protection()?

> + *
> + * Returns 0 on success or negative value in case of error
> + *
> + * @uffd: UFFD file descriptor
> + * @start: starting virtual address of memory range
> + * @length: length of memory range
> + * @wp: write-protect/unprotect
> + */
> +int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
> +{
> +    struct uffdio_writeprotect uffd_writeprotect;
> +    int res;
> +
> +    uffd_writeprotect.range.start = start;
> +    uffd_writeprotect.range.len = length;
> +    uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0);
> +
> +    do {
> +        res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect);
> +    } while (res < 0 && errno == EINTR);
> +    if (res < 0) {
> +        error_report("uffd_protect_memory() failed: "
> +                     "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
> +                start, length, uffd_writeprotect.mode, errno);
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * uffd_read_events: read pending UFFD events
> + *
> + * Returns number of fetched messages, 0 if non is available or
> + * negative value in case of an error
> + *
> + * @uffd: UFFD file descriptor
> + * @msgs: pointer to message buffer
> + * @count: number of messages that can fit in the buffer
> + */
> +int uffd_read_events(int uffd, struct uffd_msg *msgs, int count)
> +{
> +    ssize_t res;
> +    do {
> +        res = read(uffd, msgs, count * sizeof(struct uffd_msg));
> +    } while (res < 0 && errno == EINTR);
> +
> +    if ((res < 0 && errno == EAGAIN)) {
> +        return 0;
> +    }
> +    if (res < 0) {
> +        error_report("uffd_read_events() failed: errno=%i", errno);
> +        return -1;
> +    }
> +
> +    return (int) (res / sizeof(struct uffd_msg));
> +}
> +
> +/**
> + * uffd_poll_events: poll UFFD file descriptor for read
> + *
> + * Returns true if events are available for read, false otherwise
> + *
> + * @uffd: UFFD file descriptor
> + * @tmo: timeout in milliseconds, 0 for non-blocking operation,
> + *       negative value for infinite wait
> + */
> +bool uffd_poll_events(int uffd, int tmo)

Shall we spell "tmo" out?

Thanks,

> +{
> +    int res;
> +    struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 };
> +
> +    do {
> +        res = poll(&poll_fd, 1, tmo);
> +    } while (res < 0 && errno == EINTR);
> +
> +    if (res == 0) {
> +        return false;
> +    }
> +    if (res < 0) {
> +        error_report("uffd_poll_events() failed: errno=%i", errno);
> +        return false;
> +    }
> +
> +    return (poll_fd.revents & POLLIN) != 0;
> +}
> -- 
> 2.25.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 3/6] support UFFD write fault processing in ram_save_iterate()
  2020-11-26 15:17 ` [PATCH v4 3/6] support UFFD write fault processing in ram_save_iterate() Andrey Gruzdev via
@ 2020-11-27 21:49   ` Peter Xu
  2020-11-29 21:14     ` Andrey Gruzdev
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-11-27 21:49 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 26, 2020 at 06:17:31PM +0300, Andrey Gruzdev wrote:
> In this particular implementation the same single migration
> thread is responsible for both normal linear dirty page
> migration and procesing UFFD page fault events.
> 
> Processing write faults includes reading UFFD file descriptor,
> finding respective RAM block and saving faulting page to
> the migration stream. After page has been saved, write protection
> can be removed. Since asynchronous version of qemu_put_buffer()
> is expected to be used to save pages, we also have to flush
> migraion stream prior to un-protecting saved memory range.
> 
> Write protection is being removed for any previously protected
> memory chunk that has hit the migration stream. That's valid
> for pages from linear page scan along with write fault pages.

Thanks for working on this version, it looks much cleaner.

> 
> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> ---
>  migration/ram.c | 155 +++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 147 insertions(+), 8 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 3adfd1948d..bcdccdaef7 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1441,6 +1441,76 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
>      return block;
>  }
>  
> +#ifdef CONFIG_LINUX
> +/**
> + * ram_find_block_by_host_address: find RAM block containing host page
> + *
> + * Returns pointer to RAMBlock if found, NULL otherwise
> + *
> + * @rs: current RAM state
> + * @page_address: host page address
> + */
> +static RAMBlock *ram_find_block_by_host_address(RAMState *rs, hwaddr page_address)

Reuse qemu_ram_block_from_host() somehow?

> +{
> +    RAMBlock *bs = rs->last_seen_block;
> +
> +    do {
> +        if (page_address >= (hwaddr) bs->host && (page_address + TARGET_PAGE_SIZE) <=
> +                ((hwaddr) bs->host + bs->max_length)) {
> +            return bs;
> +        }
> +
> +        bs = QLIST_NEXT_RCU(bs, next);
> +        if (!bs) {
> +            /* Hit the end of the list */
> +            bs = QLIST_FIRST_RCU(&ram_list.blocks);
> +        }
> +    } while (bs != rs->last_seen_block);
> +
> +    return NULL;
> +}
> +
> +/**
> + * poll_fault_page: try to get next UFFD write fault page and, if pending fault
> + *   is found, return RAM block pointer and page offset
> + *
> + * Returns pointer to the RAMBlock containing faulting page,
> + *   NULL if no write faults are pending
> + *
> + * @rs: current RAM state
> + * @offset: page offset from the beginning of the block
> + */
> +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
> +{
> +    struct uffd_msg uffd_msg;
> +    hwaddr page_address;
> +    RAMBlock *bs;
> +    int res;
> +
> +    if (!migrate_background_snapshot()) {
> +        return NULL;
> +    }
> +
> +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
> +    if (res <= 0) {
> +        return NULL;
> +    }
> +
> +    page_address = uffd_msg.arg.pagefault.address;
> +    bs = ram_find_block_by_host_address(rs, page_address);
> +    if (!bs) {
> +        /* In case we couldn't find respective block, just unprotect faulting page. */
> +        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
> +        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
> +                page_address);

Looks ok to error_report() instead of assert(), but I'll suggest drop the call
to uffd_protect_memory() at least.  The only reason to not use assert() is
because we try our best to avoid crashing the vm, however I really doubt
whether uffd_protect_memory() is the right thing to do even if it happens - we
may at last try to unprotect some strange pages that we don't even know where
it is...

> +        return NULL;
> +    }
> +
> +    *offset = (ram_addr_t) (page_address - (hwaddr) bs->host);
> +    return bs;
> +}
> +#endif /* CONFIG_LINUX */
> +
>  /**
>   * get_queued_page: unqueue a page from the postcopy requests
>   *
> @@ -1480,6 +1550,16 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>  
>      } while (block && !dirty);
>  
> +#ifdef CONFIG_LINUX
> +    if (!block) {
> +        /*
> +         * Poll write faults too if background snapshot is enabled; that's
> +         * when we have vcpus got blocked by the write protected pages.
> +         */
> +        block = poll_fault_page(rs, &offset);
> +    }
> +#endif /* CONFIG_LINUX */
> +
>      if (block) {
>          /*
>           * As soon as we start servicing pages out of order, then we have
> @@ -1753,6 +1833,55 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>      return pages;
>  }
>  
> +/**
> + * ram_save_host_page_pre: ram_save_host_page() pre-notifier
> + *
> + * @rs: current RAM state
> + * @pss: page-search-status structure
> + * @opaque: pointer to receive opaque context value
> + */
> +static inline
> +void ram_save_host_page_pre(RAMState *rs, PageSearchStatus *pss, void **opaque)
> +{
> +    *(ram_addr_t *) opaque = pss->page;
> +}
> +
> +/**
> + * ram_save_host_page_post: ram_save_host_page() post-notifier
> + *
> + * @rs: current RAM state
> + * @pss: page-search-status structure
> + * @opaque: opaque context value
> + * @res_override: pointer to the return value of ram_save_host_page(),
> + *   overwritten in case of an error
> + */
> +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
> +        void *opaque, int *res_override)
> +{
> +    /* Check if page is from UFFD-managed region. */
> +    if (pss->block->flags & RAM_UF_WRITEPROTECT) {
> +#ifdef CONFIG_LINUX
> +        ram_addr_t page_from = (ram_addr_t) opaque;
> +        hwaddr page_address = (hwaddr) pss->block->host +
> +                              ((hwaddr) page_from << TARGET_PAGE_BITS);

I feel like most new uses of hwaddr is not correct...  As I also commented in
the other patch.  We should save a lot of castings if switched.

> +        hwaddr run_length = (hwaddr) (pss->page - page_from + 1) << TARGET_PAGE_BITS;
> +        int res;
> +
> +        /* Flush async buffers before un-protect. */
> +        qemu_fflush(rs->f);
> +        /* Un-protect memory range. */
> +        res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
> +        /* We don't want to override existing error from ram_save_host_page(). */
> +        if (res < 0 && *res_override >= 0) {
> +            *res_override = res;

What is this used for?  If res<0, I thought we should just fail the snapshot.

Meanwhile, res_override points to "pages", and then it'll be rewrite to the
errno returned by uffd_protect_memory().  Smells strange.

Can this ever be triggered anyway?

> +        }
> +#else
> +        /* Should never happen */
> +        qemu_file_set_error(rs->f, -ENOSYS);
> +#endif /* CONFIG_LINUX */
> +    }
> +}
> +
>  /**
>   * ram_find_and_save_block: finds a dirty page and sends it to f
>   *
> @@ -1779,14 +1908,14 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>          return pages;
>      }
>  
> +    if (!rs->last_seen_block) {
> +        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);

Why setup the last seen block to be the first if null?

> +    }
> +
>      pss.block = rs->last_seen_block;
>      pss.page = rs->last_page;
>      pss.complete_round = false;
>  
> -    if (!pss.block) {
> -        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
> -    }
> -
>      do {
>          again = true;
>          found = get_queued_page(rs, &pss);
> @@ -1797,7 +1926,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>          }
>  
>          if (found) {
> +            void *opaque;
> +
> +            ram_save_host_page_pre(rs, &pss, &opaque);
>              pages = ram_save_host_page(rs, &pss, last_stage);
> +            ram_save_host_page_post(rs, &pss, opaque, &pages);

So the pre/post idea is kind of an overkill to me...

How about we do the unprotect in ram_save_host_page() in the simple way, like:

  ram_save_host_page()
    start_addr = pss->page;
    do {
      ...
      (migrate pages)
      ...
    } while (...);
    if (background_snapshot_enabled()) {
      unprotect pages within start_addr to pss->page;
    }
    ...

>          }
>      } while (!pages && again);
>  
> @@ -3864,9 +3997,12 @@ fail:
>      rs->uffdio_fd = -1;
>      return -1;
>  #else
> +    /*
> +     * Should never happen since we prohibit 'background-snapshot'
> +     * capability on non-Linux hosts.

Yeah, yeah. So let's drop these irrelevant changes? :)

> +     */
>      rs->uffdio_fd = -1;
> -    error_setg(&migrate_get_current()->error,
> -            "Background-snapshot not supported on non-Linux hosts");
> +    error_setg(&migrate_get_current()->error, QERR_UNDEFINED_ERROR);
>      return -1;
>  #endif /* CONFIG_LINUX */
>  }
> @@ -3903,8 +4039,11 @@ void ram_write_tracking_stop(void)
>      uffd_close_fd(rs->uffdio_fd);
>      rs->uffdio_fd = -1;
>  #else
> -    error_setg(&migrate_get_current()->error,
> -            "Background-snapshot not supported on non-Linux hosts");
> +    /*
> +     * Should never happen since we prohibit 'background-snapshot'
> +     * capability on non-Linux hosts.
> +     */
> +    error_setg(&migrate_get_current()->error, QERR_UNDEFINED_ERROR);

Same here.

Thanks,

>  #endif /* CONFIG_LINUX */
>  }
>  
> -- 
> 2.25.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-11-26 15:17 [PATCH v4 0/6] UFFD write-tracking migration/snapshots Andrey Gruzdev via
                   ` (6 preceding siblings ...)
  2020-11-26 15:47 ` [PATCH v4 0/6] UFFD write-tracking migration/snapshots Peter Krempa
@ 2020-11-27 22:04 ` Peter Xu
  2020-11-30  8:07   ` Andrey Gruzdev
  2020-12-01  7:08 ` Peter Krempa
  8 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-11-27 22:04 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 26, 2020 at 06:17:28PM +0300, Andrey Gruzdev wrote:
> Changes v3->v4:
> 
> * 1. Renamed migrate capability 'track-writes-ram'->'background-snapshot'.
> * 2. Use array of incompatible caps to replace bulky 'if' constructs.
> * 3. Moved UFFD low-level code to the separate module ('util/userfaultfd.c').
> * 4. Always do UFFD wr-unprotect on cleanup; just closing file descriptor
> *    won't cleanup PTEs anyhow, it will release registration ranges, wait 
> *    queues etc. but won't cleanup process MM context on MMU level.
> * 5. Allow to enable 'background-snapshot' capability on Linux-only hosts.
> * 6. Put UFFD code usage under '#ifdef CONFIG_LINUX' prerequisite.
> * 7. Removed 'wt_' from RAMState struct.
> * 8. Refactored ram_find_and_save_block() to make more clean - poll UFFD
> *    wr-fault events in get_queued_page(), use ram_save_host_page_pre(),
> *    ram_save_host_page_post() notifiers around ram_save_host_page()
> *    instead of bulky inline write-unprotect code.

One thing I mentioned previously but it seems still got lost is that we don't
need dirty tracking for live snapshot.

A few pointers for reference:

  memory_global_dirty_log_start()
  migration_bitmap_sync_precopy()
  memory_region_clear_dirty_bitmap()
  ...

These should not be needed.  But this can also be done on top.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 5/6] the rest of write tracking migration code
  2020-11-26 15:17 ` [PATCH v4 5/6] the rest of write tracking migration code Andrey Gruzdev via
@ 2020-11-27 22:26   ` Peter Xu
  2020-11-30  8:09     ` Andrey Gruzdev
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-11-27 22:26 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 26, 2020 at 06:17:33PM +0300, Andrey Gruzdev wrote:
> Implementation of bg_migration- iteration/completion/finish.
> 
> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>

Both patch 4 & 5 look good to me in general.  Maybe you can even squash this
patch into 4?  It's not scarily hugeas a single patch, but a complete one.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 6/6] introduce simple linear scan rate limiting mechanism
  2020-11-26 15:17 ` [PATCH v4 6/6] introduce simple linear scan rate limiting mechanism Andrey Gruzdev via
@ 2020-11-27 22:28   ` Peter Xu
  2020-11-30  8:11     ` Andrey Gruzdev
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-11-27 22:28 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 26, 2020 at 06:17:34PM +0300, Andrey Gruzdev wrote:
> Since reading UFFD events and saving paged data are performed
> from the same thread, write fault latencies are sensitive to
> migration stream stalls. Limiting total page saving rate is a
> method to reduce amount of noticiable fault resolution latencies.
> 
> Migration bandwidth limiting is achieved via noticing cases of
> out-of-threshold write fault latencies and temporarily disabling
> (strictly speaking, severely throttling) saving non-faulting pages.

So have you done any measurements out of it, as we've talked in previous
version?  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 1/6] introduce 'background-snapshot' migration capability
  2020-11-27 19:55   ` Peter Xu
@ 2020-11-28 16:35     ` Andrey Gruzdev
  0 siblings, 0 replies; 45+ messages in thread
From: Andrey Gruzdev @ 2020-11-28 16:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 27.11.2020 22:55, Peter Xu wrote:
> On Thu, Nov 26, 2020 at 06:17:29PM +0300, Andrey Gruzdev wrote:
>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> 
> The patch looks good to me, but again, any commit message would be more than
> welcomed... as long as it's not empty. :)
> 

Yep, agree. :)

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 2/6] introduce UFFD-WP low-level interface helpers
  2020-11-27 21:04   ` Peter Xu
@ 2020-11-29 20:12     ` Andrey Gruzdev
  2020-11-30 15:34       ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Andrey Gruzdev @ 2020-11-29 20:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 28.11.2020 00:04, Peter Xu wrote:
> On Thu, Nov 26, 2020 at 06:17:30PM +0300, Andrey Gruzdev wrote:
>> Implemented support for the whole RAM block memory
>> protection/un-protection. Introduced higher level
>> ram_write_tracking_start() and ram_write_tracking_stop()
>> to start/stop tracking guest memory writes.
> 
> The whole patch looks good to me in general.  A few nitpickings below..
> 
>>
>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>> ---
>>   include/exec/memory.h      |   7 ++
>>   include/qemu/userfaultfd.h |  29 +++++
>>   migration/ram.c            | 120 +++++++++++++++++++++
>>   migration/ram.h            |   4 +
>>   util/meson.build           |   1 +
>>   util/userfaultfd.c         | 215 +++++++++++++++++++++++++++++++++++++
>>   6 files changed, 376 insertions(+)
>>   create mode 100644 include/qemu/userfaultfd.h
>>   create mode 100644 util/userfaultfd.c
>>
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 0f3e6bcd5e..3d798fce16 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier;
>>   /* RAM is a persistent kind memory */
>>   #define RAM_PMEM (1 << 5)
>>   
>> +/*
>> + * UFFDIO_WRITEPROTECT is used on this RAMBlock to
>> + * support 'write-tracking' migration type.
>> + * Implies ram_state->ram_wt_enabled.
>> + */
>> +#define RAM_UF_WRITEPROTECT (1 << 6)
>> +
>>   static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>>                                          IOMMUNotifierFlag flags,
>>                                          hwaddr start, hwaddr end,
>> diff --git a/include/qemu/userfaultfd.h b/include/qemu/userfaultfd.h
>> new file mode 100644
>> index 0000000000..fb843c76db
>> --- /dev/null
>> +++ b/include/qemu/userfaultfd.h
>> @@ -0,0 +1,29 @@
>> +/*
>> + * Linux UFFD-WP support
>> + *
>> + * Copyright Virtuozzo GmbH, 2020
>> + *
>> + * Authors:
>> + *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> + * later.  See the COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef USERFAULTFD_H
>> +#define USERFAULTFD_H
>> +
>> +#include "qemu/osdep.h"
>> +#include "exec/hwaddr.h"
>> +#include <linux/userfaultfd.h>
>> +
>> +int uffd_create_fd(void);
>> +void uffd_close_fd(int uffd);
>> +int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
>> +        bool track_missing, bool track_wp);
>> +int uffd_unregister_memory(int uffd, hwaddr start, hwaddr length);
>> +int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp);
>> +int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
>> +bool uffd_poll_events(int uffd, int tmo);
>> +
>> +#endif /* USERFAULTFD_H */
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 7811cde643..3adfd1948d 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -56,6 +56,11 @@
>>   #include "savevm.h"
>>   #include "qemu/iov.h"
>>   #include "multifd.h"
>> +#include "sysemu/runstate.h"
>> +
>> +#ifdef CONFIG_LINUX
>> +#include "qemu/userfaultfd.h"
>> +#endif
>>   
>>   /***********************************************************/
>>   /* ram save/restore */
>> @@ -298,6 +303,8 @@ struct RAMSrcPageRequest {
>>   struct RAMState {
>>       /* QEMUFile used for this migration */
>>       QEMUFile *f;
>> +    /* UFFD file descriptor, used in 'write-tracking' migration */
>> +    int uffdio_fd;
>>       /* Last block that we have visited searching for dirty pages */
>>       RAMBlock *last_seen_block;
>>       /* Last block from where we have sent data */
>> @@ -3788,6 +3795,119 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
>>       return 0;
>>   }
>>   
>> +/*
>> + * ram_write_tracking_start: start UFFD-WP memory tracking
>> + *
>> + * Returns 0 for success or negative value in case of error
>> + *
> 
> (extra new line)
> 

Ok, see.

>> + */
>> +int ram_write_tracking_start(void)
>> +{
>> +#ifdef CONFIG_LINUX
>> +    int uffd;
>> +    RAMState *rs = ram_state;
>> +    RAMBlock *bs;
>> +
>> +    /* Open UFFD file descriptor */
>> +    uffd = uffd_create_fd();
>> +    if (uffd < 0) {
>> +        return uffd;
>> +    }
>> +    rs->uffdio_fd = uffd;
> 
> May need a rcu_read_lock() here to guarantee safe access to
> RAMBLOCK_FOREACH_NOT_IGNORED.
> 

Yeah, really better to add RCU read lock here.

>> +
>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>> +        /* Nothing to do with read-only and MMIO-writable regions */
>> +        if (bs->mr->readonly || bs->mr->rom_device) {
>> +            continue;
>> +        }
>> +
>> +        bs->flags |= RAM_UF_WRITEPROTECT;
>> +        /* Register block memory with UFFD to track writes */
>> +        if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host,
>> +                bs->max_length, false, true)) {
>> +            goto fail;
>> +        }
>> +        /* Apply UFFD write protection to the block memory range */
>> +        if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host,
>> +                bs->max_length, true)) {
>> +            goto fail;
>> +        }
>> +
>> +        info_report("UFFD-WP write-tracking enabled: "
>> +                "block_id=%s page_size=%zu start=%p length=%lu "
>> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
>> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
>> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
>> +                bs->mr->nonvolatile, bs->mr->rom_device);
> 
> Is info_report() by default printed?  Change this into a tracepoint?  Please
> refer to functions named with trace_*().
> 

Agree, I'll change to tracepoints.

>> +    }
>> +
>> +    return 0;
>> +
>> +fail:
>> +    error_report("ram_write_tracking_start() failed: restoring initial memory state");
>> +
>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
>> +            continue;
>> +        }
>> +        /*
>> +         * In case some memory block failed to be write-protected
>> +         * remove protection and unregister all succeeded RAM blocks
>> +         */
>> +        uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length, false);
>> +        uffd_unregister_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length);
>> +        /* Cleanup flags */
>> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
>> +    }
>> +
>> +    uffd_close_fd(uffd);
>> +    rs->uffdio_fd = -1;
>> +    return -1;
>> +#else
>> +    rs->uffdio_fd = -1;
>> +    error_setg(&migrate_get_current()->error,
>> +            "Background-snapshot not supported on non-Linux hosts");
> 
> Accessing the global var seems an overkill to me.  I'd simply return an error,
> since iiuc if we gate the capability bit well, then these paths won't really
> trigger at all.  So it's just to pass the compilations.
> 

IMHO it's not too terrible, but really not needed, since won't be ever 
triggered. I'll remove error_setg() call.

>> +    return -1;
>> +#endif /* CONFIG_LINUX */
>> +}
>> +
>> +/**
>> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
>> + */
>> +void ram_write_tracking_stop(void)
>> +{
>> +#ifdef CONFIG_LINUX
>> +    RAMState *rs = ram_state;
>> +    RAMBlock *bs;
>> +    assert(rs->uffdio_fd >= 0);
> 
> Maybe too harsh - we can return if it's invalid.
> 
> Meanwhile, better rcu_read_lock(), as well?
> 

Yep, RCU lock, I'll add. Why too harsh? Just a debug assertion.
>> +
>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
>> +            continue;
>> +        }
>> +        /* Remove protection and unregister all affected RAM blocks */
>> +        uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length, false);
>> +        uffd_unregister_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length);
>> +        /* Cleanup flags */
>> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
>> +
>> +        info_report("UFFD-WP write-tracking disabled: "
>> +                "block_id=%s page_size=%zu start=%p length=%lu "
>> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
>> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
>> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
>> +                bs->mr->nonvolatile, bs->mr->rom_device);
> 
> Use tracepoint?
> 

Yep.

>> +    }
>> +
>> +    /* Finally close UFFD file descriptor */
>> +    uffd_close_fd(rs->uffdio_fd);
>> +    rs->uffdio_fd = -1;
>> +#else
>> +    error_setg(&migrate_get_current()->error,
>> +            "Background-snapshot not supported on non-Linux hosts");
> 
> Ditto.
> 

Aha, the same.

>> +#endif /* CONFIG_LINUX */
>> +}
>> +
>>   static SaveVMHandlers savevm_ram_handlers = {
>>       .save_setup = ram_save_setup,
>>       .save_live_iterate = ram_save_iterate,
>> diff --git a/migration/ram.h b/migration/ram.h
>> index 011e85414e..0ec63e27ee 100644
>> --- a/migration/ram.h
>> +++ b/migration/ram.h
>> @@ -79,4 +79,8 @@ void colo_flush_ram_cache(void);
>>   void colo_release_ram_cache(void);
>>   void colo_incoming_start_dirty_log(void);
>>   
>> +/* Background snapshots */
>> +int ram_write_tracking_start(void);
>> +void ram_write_tracking_stop(void);
>> +
>>   #endif
>> diff --git a/util/meson.build b/util/meson.build
>> index f359af0d46..c64bfe94b3 100644
>> --- a/util/meson.build
>> +++ b/util/meson.build
>> @@ -50,6 +50,7 @@ endif
>>   
>>   if have_system
>>     util_ss.add(when: 'CONFIG_GIO', if_true: [files('dbus.c'), gio])
>> +  util_ss.add(when: 'CONFIG_LINUX', if_true: files('userfaultfd.c'))
>>   endif
>>   
>>   if have_block
>> diff --git a/util/userfaultfd.c b/util/userfaultfd.c
>> new file mode 100644
>> index 0000000000..038953d7ed
>> --- /dev/null
>> +++ b/util/userfaultfd.c
>> @@ -0,0 +1,215 @@
>> +/*
>> + * Linux UFFD-WP support
>> + *
>> + * Copyright Virtuozzo GmbH, 2020
>> + *
>> + * Authors:
>> + *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> + * later.  See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/bitops.h"
>> +#include "qemu/error-report.h"
>> +#include "qemu/userfaultfd.h"
>> +#include <poll.h>
>> +#include <sys/syscall.h>
>> +#include <sys/ioctl.h>
>> +
>> +/**
>> + * uffd_create_fd: create UFFD file descriptor
>> + *
>> + * Returns non-negative file descriptor or negative value in case of an error
>> + */
>> +int uffd_create_fd(void)
>> +{
>> +    int uffd;
>> +    struct uffdio_api api_struct;
>> +    uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER);
>> +
>> +    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
>> +    if (uffd < 0) {
>> +        error_report("uffd_create_fd() failed: UFFD not supported");
>> +        return -1;
>> +    }
>> +
>> +    api_struct.api = UFFD_API;
>> +    api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
> 
> This might be too strict if we want to reuse the codes with postcopy to run on
> old kernels that only supports missing.  Not a big problem for now; we can work
> on top.
> 

Aha, ok.

>> +    if (ioctl(uffd, UFFDIO_API, &api_struct)) {
>> +        error_report("uffd_create_fd() failed: "
>> +                     "API version not supported version=%llx errno=%i",
>> +                api_struct.api, errno);
> 
> (A few strange alignments... won't complain but fixing would be even better)
> 

Mm, yes!

>> +        goto fail;
>> +    }
>> +
>> +    if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
>> +        error_report("uffd_create_fd() failed: "
>> +                     "PAGEFAULT_FLAG_WP feature missing");
>> +        goto fail;
>> +    }
>> +
>> +    return uffd;
>> +
>> +fail:
>> +    close(uffd);
>> +    return -1;
>> +}
>> +
>> +/**
>> + * uffd_close_fd: close UFFD file descriptor
>> + *
>> + * @uffd: UFFD file descriptor
>> + */
>> +void uffd_close_fd(int uffd)
>> +{
>> +    assert(uffd >= 0);
>> +    close(uffd);
>> +}
>> +
>> +/**
>> + * uffd_register_memory: register memory range with UFFD
>> + *
>> + * Returns 0 in case of success, negative value on error
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @start: starting virtual address of memory range
>> + * @length: length of memory range
>> + * @track_missing: generate events on missing-page faults
>> + * @track_wp: generate events on write-protected-page faults
>> + */
>> +int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
> 
> My understanding is that hwaddr is for guest physical address space.  Here
> I think either uint64_t or "void *" may suite better.
> 

Agree, uint64_t will be better here.

>> +        bool track_missing, bool track_wp)
>> +{
>> +    struct uffdio_register uffd_register;
>> +
>> +    uffd_register.range.start = start;
>> +    uffd_register.range.len = length;
>> +    uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) |
>> +                         (track_wp ? UFFDIO_REGISTER_MODE_WP : 0);
>> +
>> +    if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) {
>> +        error_report("uffd_register_memory() failed: "
>> +                     "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
>> +                start, length, uffd_register.mode, errno);
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/**
>> + * uffd_unregister_memory: un-register memory range with UFFD
>> + *
>> + * Returns 0 in case of success, negative value on error
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @start: starting virtual address of memory range
>> + * @length: length of memory range
>> + */
>> +int uffd_unregister_memory(int uffd, hwaddr start, hwaddr length)
> 
> Same question on using hwaddr as above.
> 

Aha.

>> +{
>> +    struct uffdio_range uffd_range;
>> +
>> +    uffd_range.start = start;
>> +    uffd_range.len = length;
>> +
>> +    if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_range)) {
>> +        error_report("uffd_unregister_memory() failed: "
>> +                     "start=%0"PRIx64" len=%"PRIu64" errno=%i",
>> +                start, length, errno);
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/**
>> + * uffd_protect_memory: protect/unprotect memory range for writes with UFFD
> 
> It may look strange if uffd_protect_memory() can also unprotect stuff...
> 
> Maybe rename it to uffd_change_protection()?
> 

Yes, this name is better.

>> + *
>> + * Returns 0 on success or negative value in case of error
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @start: starting virtual address of memory range
>> + * @length: length of memory range
>> + * @wp: write-protect/unprotect
>> + */
>> +int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
>> +{
>> +    struct uffdio_writeprotect uffd_writeprotect;
>> +    int res;
>> +
>> +    uffd_writeprotect.range.start = start;
>> +    uffd_writeprotect.range.len = length;
>> +    uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0);
>> +
>> +    do {
>> +        res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect);
>> +    } while (res < 0 && errno == EINTR);
>> +    if (res < 0) {
>> +        error_report("uffd_protect_memory() failed: "
>> +                     "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
>> +                start, length, uffd_writeprotect.mode, errno);
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/**
>> + * uffd_read_events: read pending UFFD events
>> + *
>> + * Returns number of fetched messages, 0 if non is available or
>> + * negative value in case of an error
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @msgs: pointer to message buffer
>> + * @count: number of messages that can fit in the buffer
>> + */
>> +int uffd_read_events(int uffd, struct uffd_msg *msgs, int count)
>> +{
>> +    ssize_t res;
>> +    do {
>> +        res = read(uffd, msgs, count * sizeof(struct uffd_msg));
>> +    } while (res < 0 && errno == EINTR);
>> +
>> +    if ((res < 0 && errno == EAGAIN)) {
>> +        return 0;
>> +    }
>> +    if (res < 0) {
>> +        error_report("uffd_read_events() failed: errno=%i", errno);
>> +        return -1;
>> +    }
>> +
>> +    return (int) (res / sizeof(struct uffd_msg));
>> +}
>> +
>> +/**
>> + * uffd_poll_events: poll UFFD file descriptor for read
>> + *
>> + * Returns true if events are available for read, false otherwise
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @tmo: timeout in milliseconds, 0 for non-blocking operation,
>> + *       negative value for infinite wait
>> + */
>> +bool uffd_poll_events(int uffd, int tmo)
> 
> Shall we spell "tmo" out?
> 
> Thanks,
> 

In the comment? I think it's ok.

>> +{
>> +    int res;
>> +    struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 };
>> +
>> +    do {
>> +        res = poll(&poll_fd, 1, tmo);
>> +    } while (res < 0 && errno == EINTR);
>> +
>> +    if (res == 0) {
>> +        return false;
>> +    }
>> +    if (res < 0) {
>> +        error_report("uffd_poll_events() failed: errno=%i", errno);
>> +        return false;
>> +    }
>> +
>> +    return (poll_fd.revents & POLLIN) != 0;
>> +}
>> -- 
>> 2.25.1
>>
> 


-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 3/6] support UFFD write fault processing in ram_save_iterate()
  2020-11-27 21:49   ` Peter Xu
@ 2020-11-29 21:14     ` Andrey Gruzdev
  2020-11-30 16:32       ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Andrey Gruzdev @ 2020-11-29 21:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 28.11.2020 00:49, Peter Xu wrote:
> On Thu, Nov 26, 2020 at 06:17:31PM +0300, Andrey Gruzdev wrote:
>> In this particular implementation the same single migration
>> thread is responsible for both normal linear dirty page
>> migration and procesing UFFD page fault events.
>>
>> Processing write faults includes reading UFFD file descriptor,
>> finding respective RAM block and saving faulting page to
>> the migration stream. After page has been saved, write protection
>> can be removed. Since asynchronous version of qemu_put_buffer()
>> is expected to be used to save pages, we also have to flush
>> migraion stream prior to un-protecting saved memory range.
>>
>> Write protection is being removed for any previously protected
>> memory chunk that has hit the migration stream. That's valid
>> for pages from linear page scan along with write fault pages.
> 
> Thanks for working on this version, it looks much cleaner.
> >>
>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>> ---
>>   migration/ram.c | 155 +++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 147 insertions(+), 8 deletions(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 3adfd1948d..bcdccdaef7 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1441,6 +1441,76 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
>>       return block;
>>   }
>>   
>> +#ifdef CONFIG_LINUX
>> +/**
>> + * ram_find_block_by_host_address: find RAM block containing host page
>> + *
>> + * Returns pointer to RAMBlock if found, NULL otherwise
>> + *
>> + * @rs: current RAM state
>> + * @page_address: host page address
>> + */
>> +static RAMBlock *ram_find_block_by_host_address(RAMState *rs, hwaddr page_address)
> 
> Reuse qemu_ram_block_from_host() somehow?
> 

Seems not very suitable here, since we use rs->last_seen_block to 
restart search..

>> +{
>> +    RAMBlock *bs = rs->last_seen_block;
>> +
>> +    do {
>> +        if (page_address >= (hwaddr) bs->host && (page_address + TARGET_PAGE_SIZE) <=
>> +                ((hwaddr) bs->host + bs->max_length)) {
>> +            return bs;
>> +        }
>> +
>> +        bs = QLIST_NEXT_RCU(bs, next);
>> +        if (!bs) {
>> +            /* Hit the end of the list */
>> +            bs = QLIST_FIRST_RCU(&ram_list.blocks);
>> +        }
>> +    } while (bs != rs->last_seen_block);
>> +
>> +    return NULL;
>> +}
>> +
>> +/**
>> + * poll_fault_page: try to get next UFFD write fault page and, if pending fault
>> + *   is found, return RAM block pointer and page offset
>> + *
>> + * Returns pointer to the RAMBlock containing faulting page,
>> + *   NULL if no write faults are pending
>> + *
>> + * @rs: current RAM state
>> + * @offset: page offset from the beginning of the block
>> + */
>> +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
>> +{
>> +    struct uffd_msg uffd_msg;
>> +    hwaddr page_address;
>> +    RAMBlock *bs;
>> +    int res;
>> +
>> +    if (!migrate_background_snapshot()) {
>> +        return NULL;
>> +    }
>> +
>> +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
>> +    if (res <= 0) {
>> +        return NULL;
>> +    }
>> +
>> +    page_address = uffd_msg.arg.pagefault.address;
>> +    bs = ram_find_block_by_host_address(rs, page_address);
>> +    if (!bs) {
>> +        /* In case we couldn't find respective block, just unprotect faulting page. */
>> +        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
>> +        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
>> +                page_address);
> 
> Looks ok to error_report() instead of assert(), but I'll suggest drop the call
> to uffd_protect_memory() at least.  The only reason to not use assert() is
> because we try our best to avoid crashing the vm, however I really doubt
> whether uffd_protect_memory() is the right thing to do even if it happens - we
> may at last try to unprotect some strange pages that we don't even know where
> it is...
> 

IMHO better to unprotect these strange pages then to leave them 
protected by UFFD.. To avoid getting VM completely in-operational.
At least we know the page generated wr-fault, maybe due to incorrect 
write-tracking initialization, or RAMBlock somehow has gone. 
Nevertheless if leave the page as is, VM would certainly lock.

Hmm, I wonder about assert(). In QEMU it would do something in release 
builds?

>> +        return NULL;
>> +    }
>> +
>> +    *offset = (ram_addr_t) (page_address - (hwaddr) bs->host);
>> +    return bs;
>> +}
>> +#endif /* CONFIG_LINUX */
>> +
>>   /**
>>    * get_queued_page: unqueue a page from the postcopy requests
>>    *
>> @@ -1480,6 +1550,16 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>>   
>>       } while (block && !dirty);
>>   
>> +#ifdef CONFIG_LINUX
>> +    if (!block) {
>> +        /*
>> +         * Poll write faults too if background snapshot is enabled; that's
>> +         * when we have vcpus got blocked by the write protected pages.
>> +         */
>> +        block = poll_fault_page(rs, &offset);
>> +    }
>> +#endif /* CONFIG_LINUX */
>> +
>>       if (block) {
>>           /*
>>            * As soon as we start servicing pages out of order, then we have
>> @@ -1753,6 +1833,55 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>>       return pages;
>>   }
>>   
>> +/**
>> + * ram_save_host_page_pre: ram_save_host_page() pre-notifier
>> + *
>> + * @rs: current RAM state
>> + * @pss: page-search-status structure
>> + * @opaque: pointer to receive opaque context value
>> + */
>> +static inline
>> +void ram_save_host_page_pre(RAMState *rs, PageSearchStatus *pss, void **opaque)
>> +{
>> +    *(ram_addr_t *) opaque = pss->page;
>> +}
>> +
>> +/**
>> + * ram_save_host_page_post: ram_save_host_page() post-notifier
>> + *
>> + * @rs: current RAM state
>> + * @pss: page-search-status structure
>> + * @opaque: opaque context value
>> + * @res_override: pointer to the return value of ram_save_host_page(),
>> + *   overwritten in case of an error
>> + */
>> +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
>> +        void *opaque, int *res_override)
>> +{
>> +    /* Check if page is from UFFD-managed region. */
>> +    if (pss->block->flags & RAM_UF_WRITEPROTECT) {
>> +#ifdef CONFIG_LINUX
>> +        ram_addr_t page_from = (ram_addr_t) opaque;
>> +        hwaddr page_address = (hwaddr) pss->block->host +
>> +                              ((hwaddr) page_from << TARGET_PAGE_BITS);
> 
> I feel like most new uses of hwaddr is not correct...  As I also commented in
> the other patch.  We should save a lot of castings if switched.
> 

Need to check. hwaddr is typedef'ed as uint64_t while ram_addr_t as 
uintptr_t. I any case UFFD interface relies on u64 type.

>> +        hwaddr run_length = (hwaddr) (pss->page - page_from + 1) << TARGET_PAGE_BITS;
>> +        int res;
>> +
>> +        /* Flush async buffers before un-protect. */
>> +        qemu_fflush(rs->f);
>> +        /* Un-protect memory range. */
>> +        res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
>> +        /* We don't want to override existing error from ram_save_host_page(). */
>> +        if (res < 0 && *res_override >= 0) {
>> +            *res_override = res;
> 
> What is this used for?  If res<0, I thought we should just fail the snapshot.
> 
> Meanwhile, res_override points to "pages", and then it'll be rewrite to the
> errno returned by uffd_protect_memory().  Smells strange.
> 
> Can this ever be triggered anyway?
> 

Yes, since "pages" is also for error return, if negative. If we have a 
problem with un-protecting, promote the error to the loop in 
ram_find_and_save_block() so it exits early ("pages" guaranteed to be 
non-zero). In outer routines retcode would go to qemu_set_file_error().

>> +        }
>> +#else
>> +        /* Should never happen */
>> +        qemu_file_set_error(rs->f, -ENOSYS);
>> +#endif /* CONFIG_LINUX */
>> +    }
>> +}
>> +
>>   /**
>>    * ram_find_and_save_block: finds a dirty page and sends it to f
>>    *
>> @@ -1779,14 +1908,14 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>>           return pages;
>>       }
>>   
>> +    if (!rs->last_seen_block) {
>> +        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
> 
> Why setup the last seen block to be the first if null?
> 

Because ram_find_block_by_host_address() relies on rs->last_seen_block.
Pss is not passed to that routine.

>> +    }
>> +
>>       pss.block = rs->last_seen_block;
>>       pss.page = rs->last_page;
>>       pss.complete_round = false;
>>   
>> -    if (!pss.block) {
>> -        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
>> -    }
>> -
>>       do {
>>           again = true;
>>           found = get_queued_page(rs, &pss);
>> @@ -1797,7 +1926,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>>           }
>>   
>>           if (found) {
>> +            void *opaque;
>> +
>> +            ram_save_host_page_pre(rs, &pss, &opaque);
>>               pages = ram_save_host_page(rs, &pss, last_stage);
>> +            ram_save_host_page_post(rs, &pss, opaque, &pages);
> 
> So the pre/post idea is kind of an overkill to me...
> 
> How about we do the unprotect in ram_save_host_page() in the simple way, like:
> 
>    ram_save_host_page()
>      start_addr = pss->page;
>      do {
>        ...
>        (migrate pages)
>        ...
>      } while (...);
>      if (background_snapshot_enabled()) {
>        unprotect pages within start_addr to pss->page;
>      }
>      ...
> 

Personally I prefer not to modify existing code. May be adding to simple 
calls would finally look cleaner?

>>           }
>>       } while (!pages && again);
>>   
>> @@ -3864,9 +3997,12 @@ fail:
>>       rs->uffdio_fd = -1;
>>       return -1;
>>   #else
>> +    /*
>> +     * Should never happen since we prohibit 'background-snapshot'
>> +     * capability on non-Linux hosts.
> 
> Yeah, yeah. So let's drop these irrelevant changes? :)
> 

Yes.

>> +     */
>>       rs->uffdio_fd = -1;
>> -    error_setg(&migrate_get_current()->error,
>> -            "Background-snapshot not supported on non-Linux hosts");
>> +    error_setg(&migrate_get_current()->error, QERR_UNDEFINED_ERROR);
>>       return -1;
>>   #endif /* CONFIG_LINUX */
>>   }
>> @@ -3903,8 +4039,11 @@ void ram_write_tracking_stop(void)
>>       uffd_close_fd(rs->uffdio_fd);
>>       rs->uffdio_fd = -1;
>>   #else
>> -    error_setg(&migrate_get_current()->error,
>> -            "Background-snapshot not supported on non-Linux hosts");
>> +    /*
>> +     * Should never happen since we prohibit 'background-snapshot'
>> +     * capability on non-Linux hosts.
>> +     */
>> +    error_setg(&migrate_get_current()->error, QERR_UNDEFINED_ERROR);
> 
> Same here.
> 
> Thanks,
> 
>>   #endif /* CONFIG_LINUX */
>>   }
>>   
>> -- 
>> 2.25.1
>>
> 


-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-11-27 22:04 ` Peter Xu
@ 2020-11-30  8:07   ` Andrey Gruzdev
  0 siblings, 0 replies; 45+ messages in thread
From: Andrey Gruzdev @ 2020-11-30  8:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 28.11.2020 01:04, Peter Xu wrote:
> On Thu, Nov 26, 2020 at 06:17:28PM +0300, Andrey Gruzdev wrote:
>> Changes v3->v4:
>>
>> * 1. Renamed migrate capability 'track-writes-ram'->'background-snapshot'.
>> * 2. Use array of incompatible caps to replace bulky 'if' constructs.
>> * 3. Moved UFFD low-level code to the separate module ('util/userfaultfd.c').
>> * 4. Always do UFFD wr-unprotect on cleanup; just closing file descriptor
>> *    won't cleanup PTEs anyhow, it will release registration ranges, wait
>> *    queues etc. but won't cleanup process MM context on MMU level.
>> * 5. Allow to enable 'background-snapshot' capability on Linux-only hosts.
>> * 6. Put UFFD code usage under '#ifdef CONFIG_LINUX' prerequisite.
>> * 7. Removed 'wt_' from RAMState struct.
>> * 8. Refactored ram_find_and_save_block() to make more clean - poll UFFD
>> *    wr-fault events in get_queued_page(), use ram_save_host_page_pre(),
>> *    ram_save_host_page_post() notifiers around ram_save_host_page()
>> *    instead of bulky inline write-unprotect code.
> 
> One thing I mentioned previously but it seems still got lost is that we don't
> need dirty tracking for live snapshot.
> 
> A few pointers for reference:
> 
>    memory_global_dirty_log_start()
>    migration_bitmap_sync_precopy()
>    memory_region_clear_dirty_bitmap()
>    ...
> 
> These should not be needed.  But this can also be done on top.
> 
> Thanks,
> 

Yes, all off this is really not needed. At least it seems that enabled 
dirty page logging in KVM makes not too much overhead and 
migration_bitmap_sync_precopy() is called only once. Let's do it on top.

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 5/6] the rest of write tracking migration code
  2020-11-27 22:26   ` Peter Xu
@ 2020-11-30  8:09     ` Andrey Gruzdev
  0 siblings, 0 replies; 45+ messages in thread
From: Andrey Gruzdev @ 2020-11-30  8:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 28.11.2020 01:26, Peter Xu wrote:
> On Thu, Nov 26, 2020 at 06:17:33PM +0300, Andrey Gruzdev wrote:
>> Implementation of bg_migration- iteration/completion/finish.
>>
>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> 
> Both patch 4 & 5 look good to me in general.  Maybe you can even squash this
> patch into 4?  It's not scarily hugeas a single patch, but a complete one.
> 

Yep, agree.

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 6/6] introduce simple linear scan rate limiting mechanism
  2020-11-27 22:28   ` Peter Xu
@ 2020-11-30  8:11     ` Andrey Gruzdev
  2020-11-30 16:40       ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Andrey Gruzdev @ 2020-11-30  8:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 28.11.2020 01:28, Peter Xu wrote:
> On Thu, Nov 26, 2020 at 06:17:34PM +0300, Andrey Gruzdev wrote:
>> Since reading UFFD events and saving paged data are performed
>> from the same thread, write fault latencies are sensitive to
>> migration stream stalls. Limiting total page saving rate is a
>> method to reduce amount of noticiable fault resolution latencies.
>>
>> Migration bandwidth limiting is achieved via noticing cases of
>> out-of-threshold write fault latencies and temporarily disabling
>> (strictly speaking, severely throttling) saving non-faulting pages.
> 
> So have you done any measurements out of it, as we've talked in previous
> version?  Thanks,
> 

Sorry, not done yet.

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 2/6] introduce UFFD-WP low-level interface helpers
  2020-11-29 20:12     ` Andrey Gruzdev
@ 2020-11-30 15:34       ` Peter Xu
  2020-11-30 18:41         ` Andrey Gruzdev
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-11-30 15:34 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Sun, Nov 29, 2020 at 11:12:10PM +0300, Andrey Gruzdev wrote:
> > > +void ram_write_tracking_stop(void)
> > > +{
> > > +#ifdef CONFIG_LINUX
> > > +    RAMState *rs = ram_state;
> > > +    RAMBlock *bs;
> > > +    assert(rs->uffdio_fd >= 0);
> > 
> > Maybe too harsh - we can return if it's invalid.
> > 
> > Meanwhile, better rcu_read_lock(), as well?
> > 
> 
> Yep, RCU lock, I'll add. Why too harsh? Just a debug assertion.

I was afraid some special path could trigger ram_write_tracking_stop() being
called before ram_write_tracking_start(), then vm could crash.  If we can
guarantee that not happening, then it's also ok with assert().

[...]

> > > +/**
> > > + * uffd_poll_events: poll UFFD file descriptor for read
> > > + *
> > > + * Returns true if events are available for read, false otherwise
> > > + *
> > > + * @uffd: UFFD file descriptor
> > > + * @tmo: timeout in milliseconds, 0 for non-blocking operation,
> > > + *       negative value for infinite wait
> > > + */
> > > +bool uffd_poll_events(int uffd, int tmo)
> > 
> > Shall we spell "tmo" out?
> In the comment? I think it's ok.

I'd suggest to spell it out everywhere, especially in the codes.  But feel free
to take your own preference.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 3/6] support UFFD write fault processing in ram_save_iterate()
  2020-11-29 21:14     ` Andrey Gruzdev
@ 2020-11-30 16:32       ` Peter Xu
  2020-11-30 19:27         ` Andrey Gruzdev
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-11-30 16:32 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Mon, Nov 30, 2020 at 12:14:03AM +0300, Andrey Gruzdev wrote:
> > > +#ifdef CONFIG_LINUX
> > > +/**
> > > + * ram_find_block_by_host_address: find RAM block containing host page
> > > + *
> > > + * Returns pointer to RAMBlock if found, NULL otherwise
> > > + *
> > > + * @rs: current RAM state
> > > + * @page_address: host page address
> > > + */
> > > +static RAMBlock *ram_find_block_by_host_address(RAMState *rs, hwaddr page_address)
> > 
> > Reuse qemu_ram_block_from_host() somehow?
> > 
> 
> Seems not very suitable here, since we use rs->last_seen_block to restart
> search..

The search logic is actually still the same, it's just about which block to
start searching (rs->last_seen_block, ram_list.mru_block, or the 1st one).  So
an internal helper to pass in that information would be nice.  Though that'll
require rcu lock taken before hand to keep the ramblock ptr valid.

No worry - we can keep this and work on top too, I think.

[...]

> > > +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
> > > +{
> > > +    struct uffd_msg uffd_msg;
> > > +    hwaddr page_address;
> > > +    RAMBlock *bs;
> > > +    int res;
> > > +
> > > +    if (!migrate_background_snapshot()) {
> > > +        return NULL;
> > > +    }
> > > +
> > > +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
> > > +    if (res <= 0) {
> > > +        return NULL;
> > > +    }
> > > +
> > > +    page_address = uffd_msg.arg.pagefault.address;
> > > +    bs = ram_find_block_by_host_address(rs, page_address);
> > > +    if (!bs) {
> > > +        /* In case we couldn't find respective block, just unprotect faulting page. */
> > > +        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
> > > +        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
> > > +                page_address);
> > 
> > Looks ok to error_report() instead of assert(), but I'll suggest drop the call
> > to uffd_protect_memory() at least.  The only reason to not use assert() is
> > because we try our best to avoid crashing the vm, however I really doubt
> > whether uffd_protect_memory() is the right thing to do even if it happens - we
> > may at last try to unprotect some strange pages that we don't even know where
> > it is...
> > 
> 
> IMHO better to unprotect these strange pages then to leave them protected by
> UFFD.. To avoid getting VM completely in-operational.
> At least we know the page generated wr-fault, maybe due to incorrect
> write-tracking initialization, or RAMBlock somehow has gone. Nevertheless if
> leave the page as is, VM would certainly lock.

Yes makes some senes too.  However it's definitely a severe programming error,
even if the VM is unblocked, it can be in a very weird state...

Maybe we just assert? Then we can see how unlucky we'll be. :)

> 
> Hmm, I wonder about assert(). In QEMU it would do something in release
> builds?

I guess yes, at least for now.  Because osdep.h has this:

/*
 * We have a lot of unaudited code that may fail in strange ways, or
 * even be a security risk during migration, if you disable assertions
 * at compile-time.  You may comment out these safety checks if you
 * absolutely want to disable assertion overhead, but it is not
 * supported upstream so the risk is all yours.  Meanwhile, please
 * submit patches to remove any side-effects inside an assertion, or
 * fixing error handling that should use Error instead of assert.
 */
#ifdef NDEBUG
#error building with NDEBUG is not supported
#endif
#ifdef G_DISABLE_ASSERT
#error building with G_DISABLE_ASSERT is not supported
#endif

[...]

> > > +/**
> > > + * ram_save_host_page_post: ram_save_host_page() post-notifier
> > > + *
> > > + * @rs: current RAM state
> > > + * @pss: page-search-status structure
> > > + * @opaque: opaque context value
> > > + * @res_override: pointer to the return value of ram_save_host_page(),
> > > + *   overwritten in case of an error
> > > + */
> > > +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
> > > +        void *opaque, int *res_override)
> > > +{
> > > +    /* Check if page is from UFFD-managed region. */
> > > +    if (pss->block->flags & RAM_UF_WRITEPROTECT) {
> > > +#ifdef CONFIG_LINUX
> > > +        ram_addr_t page_from = (ram_addr_t) opaque;
> > > +        hwaddr page_address = (hwaddr) pss->block->host +
> > > +                              ((hwaddr) page_from << TARGET_PAGE_BITS);
> > 
> > I feel like most new uses of hwaddr is not correct...  As I also commented in
> > the other patch.  We should save a lot of castings if switched.
> > 
> 
> Need to check. hwaddr is typedef'ed as uint64_t while ram_addr_t as
> uintptr_t. I any case UFFD interface relies on u64 type.

For example, page_from should be a host address, we can use unsigned long, or
uint64_t, but ram_addr_t is not proper, which is only used in ramblock address
space.

I think it's fine that e.g. we use common types like uint64_t directly.

> 
> > > +        hwaddr run_length = (hwaddr) (pss->page - page_from + 1) << TARGET_PAGE_BITS;
> > > +        int res;
> > > +
> > > +        /* Flush async buffers before un-protect. */
> > > +        qemu_fflush(rs->f);
> > > +        /* Un-protect memory range. */
> > > +        res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
> > > +        /* We don't want to override existing error from ram_save_host_page(). */
> > > +        if (res < 0 && *res_override >= 0) {
> > > +            *res_override = res;
> > 
> > What is this used for?  If res<0, I thought we should just fail the snapshot.
> > 
> > Meanwhile, res_override points to "pages", and then it'll be rewrite to the
> > errno returned by uffd_protect_memory().  Smells strange.
> > 
> > Can this ever be triggered anyway?
> > 
> 
> Yes, since "pages" is also for error return, if negative. If we have a
> problem with un-protecting, promote the error to the loop in
> ram_find_and_save_block() so it exits early ("pages" guaranteed to be
> non-zero). In outer routines retcode would go to qemu_set_file_error().

Ok I see your point.

> 
> > > +        }
> > > +#else
> > > +        /* Should never happen */
> > > +        qemu_file_set_error(rs->f, -ENOSYS);
> > > +#endif /* CONFIG_LINUX */
> > > +    }
> > > +}
> > > +
> > >   /**
> > >    * ram_find_and_save_block: finds a dirty page and sends it to f
> > >    *
> > > @@ -1779,14 +1908,14 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
> > >           return pages;
> > >       }
> > > +    if (!rs->last_seen_block) {
> > > +        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
> > 
> > Why setup the last seen block to be the first if null?
> > 
> 
> Because ram_find_block_by_host_address() relies on rs->last_seen_block.
> Pss is not passed to that routine.

Ok, but it's odd - because that's not the real "last seen block".

So now I'm rethinking maybe we could still reuse qemu_ram_block_from_host() by
providing a new helper to do the search logic as mentioned above, which could
take a pointer of RAMBlock as the 1st ramblock to search.  Then
ram_find_block_by_host_address() should pass in rs->last_seen_block (which
could be NULL - then we'll start with the 1st ramblock), or ram_list.mru_block
for qemu_ram_block_from_host().  That could be a standalone new patch, then we
don't need this slightly strange change.  What do you think?

> 
> > > +    }
> > > +
> > >       pss.block = rs->last_seen_block;
> > >       pss.page = rs->last_page;
> > >       pss.complete_round = false;
> > > -    if (!pss.block) {
> > > -        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
> > > -    }
> > > -
> > >       do {
> > >           again = true;
> > >           found = get_queued_page(rs, &pss);
> > > @@ -1797,7 +1926,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
> > >           }
> > >           if (found) {
> > > +            void *opaque;
> > > +
> > > +            ram_save_host_page_pre(rs, &pss, &opaque);
> > >               pages = ram_save_host_page(rs, &pss, last_stage);
> > > +            ram_save_host_page_post(rs, &pss, opaque, &pages);
> > 
> > So the pre/post idea is kind of an overkill to me...
> > 
> > How about we do the unprotect in ram_save_host_page() in the simple way, like:
> > 
> >    ram_save_host_page()
> >      start_addr = pss->page;
> >      do {
> >        ...
> >        (migrate pages)
> >        ...
> >      } while (...);
> >      if (background_snapshot_enabled()) {
> >        unprotect pages within start_addr to pss->page;
> >      }
> >      ...
> > 
> 
> Personally I prefer not to modify existing code. May be adding to simple
> calls would finally look cleaner?

Let's wait a 3rd opinion from Juan/Dave or others.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 6/6] introduce simple linear scan rate limiting mechanism
  2020-11-30  8:11     ` Andrey Gruzdev
@ 2020-11-30 16:40       ` Peter Xu
  2020-11-30 19:30         ` Andrey Gruzdev
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-11-30 16:40 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Mon, Nov 30, 2020 at 11:11:00AM +0300, Andrey Gruzdev wrote:
> On 28.11.2020 01:28, Peter Xu wrote:
> > On Thu, Nov 26, 2020 at 06:17:34PM +0300, Andrey Gruzdev wrote:
> > > Since reading UFFD events and saving paged data are performed
> > > from the same thread, write fault latencies are sensitive to
> > > migration stream stalls. Limiting total page saving rate is a
> > > method to reduce amount of noticiable fault resolution latencies.
> > > 
> > > Migration bandwidth limiting is achieved via noticing cases of
> > > out-of-threshold write fault latencies and temporarily disabling
> > > (strictly speaking, severely throttling) saving non-faulting pages.
> > 
> > So have you done any measurements out of it, as we've talked in previous
> > version?  Thanks,
> > 
> 
> Sorry, not done yet.

So do you still plan to? :)

And if not, could you describe the rational behind this patch?  For example,
what's the problem behind (e.g., guest hangs for xxx seconds, maybe?) and
what's the outcome (guest doesn't hang any more)?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 2/6] introduce UFFD-WP low-level interface helpers
  2020-11-30 15:34       ` Peter Xu
@ 2020-11-30 18:41         ` Andrey Gruzdev
  0 siblings, 0 replies; 45+ messages in thread
From: Andrey Gruzdev @ 2020-11-30 18:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 30.11.2020 18:34, Peter Xu wrote:
> On Sun, Nov 29, 2020 at 11:12:10PM +0300, Andrey Gruzdev wrote:
>>>> +void ram_write_tracking_stop(void)
>>>> +{
>>>> +#ifdef CONFIG_LINUX
>>>> +    RAMState *rs = ram_state;
>>>> +    RAMBlock *bs;
>>>> +    assert(rs->uffdio_fd >= 0);
>>>
>>> Maybe too harsh - we can return if it's invalid.
>>>
>>> Meanwhile, better rcu_read_lock(), as well?
>>>
>>
>> Yep, RCU lock, I'll add. Why too harsh? Just a debug assertion.
> 
> I was afraid some special path could trigger ram_write_tracking_stop() being
> called before ram_write_tracking_start(), then vm could crash.  If we can
> guarantee that not happening, then it's also ok with assert().
> 
> [...]
> 

It can't really happen, but I agree that assert() is a bit excessive here.

>>>> +/**
>>>> + * uffd_poll_events: poll UFFD file descriptor for read
>>>> + *
>>>> + * Returns true if events are available for read, false otherwise
>>>> + *
>>>> + * @uffd: UFFD file descriptor
>>>> + * @tmo: timeout in milliseconds, 0 for non-blocking operation,
>>>> + *       negative value for infinite wait
>>>> + */
>>>> +bool uffd_poll_events(int uffd, int tmo)
>>>
>>> Shall we spell "tmo" out?
>> In the comment? I think it's ok.
> 
> I'd suggest to spell it out everywhere, especially in the codes.  But feel free
> to take your own preference.  Thanks,
> 

Thanks,

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 3/6] support UFFD write fault processing in ram_save_iterate()
  2020-11-30 16:32       ` Peter Xu
@ 2020-11-30 19:27         ` Andrey Gruzdev
  0 siblings, 0 replies; 45+ messages in thread
From: Andrey Gruzdev @ 2020-11-30 19:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 30.11.2020 19:32, Peter Xu wrote:
> On Mon, Nov 30, 2020 at 12:14:03AM +0300, Andrey Gruzdev wrote:
>>>> +#ifdef CONFIG_LINUX
>>>> +/**
>>>> + * ram_find_block_by_host_address: find RAM block containing host page
>>>> + *
>>>> + * Returns pointer to RAMBlock if found, NULL otherwise
>>>> + *
>>>> + * @rs: current RAM state
>>>> + * @page_address: host page address
>>>> + */
>>>> +static RAMBlock *ram_find_block_by_host_address(RAMState *rs, hwaddr page_address)
>>>
>>> Reuse qemu_ram_block_from_host() somehow?
>>>
>>
>> Seems not very suitable here, since we use rs->last_seen_block to restart
>> search..
> 
> The search logic is actually still the same, it's just about which block to
> start searching (rs->last_seen_block, ram_list.mru_block, or the 1st one).  So
> an internal helper to pass in that information would be nice.  Though that'll
> require rcu lock taken before hand to keep the ramblock ptr valid.
> 
> No worry - we can keep this and work on top too, I think.
> 
> [...]
>  >>>> +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
>>>> +{
>>>> +    struct uffd_msg uffd_msg;
>>>> +    hwaddr page_address;
>>>> +    RAMBlock *bs;
>>>> +    int res;
>>>> +
>>>> +    if (!migrate_background_snapshot()) {
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
>>>> +    if (res <= 0) {
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    page_address = uffd_msg.arg.pagefault.address;
>>>> +    bs = ram_find_block_by_host_address(rs, page_address);
>>>> +    if (!bs) {
>>>> +        /* In case we couldn't find respective block, just unprotect faulting page. */
>>>> +        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
>>>> +        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
>>>> +                page_address);
>>>
>>> Looks ok to error_report() instead of assert(), but I'll suggest drop the call
>>> to uffd_protect_memory() at least.  The only reason to not use assert() is
>>> because we try our best to avoid crashing the vm, however I really doubt
>>> whether uffd_protect_memory() is the right thing to do even if it happens - we
>>> may at last try to unprotect some strange pages that we don't even know where
>>> it is...
>>>
>>
>> IMHO better to unprotect these strange pages then to leave them protected by
>> UFFD.. To avoid getting VM completely in-operational.
>> At least we know the page generated wr-fault, maybe due to incorrect
>> write-tracking initialization, or RAMBlock somehow has gone. Nevertheless if
>> leave the page as is, VM would certainly lock.
> 
> Yes makes some senes too.  However it's definitely a severe programming error,
> even if the VM is unblocked, it can be in a very weird state...
> 
> Maybe we just assert? Then we can see how unlucky we'll be. :)
> 

Yeah, seems assert is a right thing here :)

>>
>> Hmm, I wonder about assert(). In QEMU it would do something in release
>> builds?
> 
> I guess yes, at least for now.  Because osdep.h has this:
> 
> /*
>   * We have a lot of unaudited code that may fail in strange ways, or
>   * even be a security risk during migration, if you disable assertions
>   * at compile-time.  You may comment out these safety checks if you
>   * absolutely want to disable assertion overhead, but it is not
>   * supported upstream so the risk is all yours.  Meanwhile, please
>   * submit patches to remove any side-effects inside an assertion, or
>   * fixing error handling that should use Error instead of assert.
>   */
> #ifdef NDEBUG
> #error building with NDEBUG is not supported
> #endif
> #ifdef G_DISABLE_ASSERT
> #error building with G_DISABLE_ASSERT is not supported
> #endif
> 
> [...]
> 

Ooops, didn't know that, thanks for info!

>>>> +/**
>>>> + * ram_save_host_page_post: ram_save_host_page() post-notifier
>>>> + *
>>>> + * @rs: current RAM state
>>>> + * @pss: page-search-status structure
>>>> + * @opaque: opaque context value
>>>> + * @res_override: pointer to the return value of ram_save_host_page(),
>>>> + *   overwritten in case of an error
>>>> + */
>>>> +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
>>>> +        void *opaque, int *res_override)
>>>> +{
>>>> +    /* Check if page is from UFFD-managed region. */
>>>> +    if (pss->block->flags & RAM_UF_WRITEPROTECT) {
>>>> +#ifdef CONFIG_LINUX
>>>> +        ram_addr_t page_from = (ram_addr_t) opaque;
>>>> +        hwaddr page_address = (hwaddr) pss->block->host +
>>>> +                              ((hwaddr) page_from << TARGET_PAGE_BITS);
>>>
>>> I feel like most new uses of hwaddr is not correct...  As I also commented in
>>> the other patch.  We should save a lot of castings if switched.
>>>
>>
>> Need to check. hwaddr is typedef'ed as uint64_t while ram_addr_t as
>> uintptr_t. I any case UFFD interface relies on u64 type.
> 
> For example, page_from should be a host address, we can use unsigned long, or
> uint64_t, but ram_addr_t is not proper, which is only used in ramblock address
> space.
> 
> I think it's fine that e.g. we use common types like uint64_t directly.
> 

Now I also think we'd better use common types - mostly uint64_t 
directly, not to confuse anybody with that specific type descriptions.

>>
>>>> +        hwaddr run_length = (hwaddr) (pss->page - page_from + 1) << TARGET_PAGE_BITS;
>>>> +        int res;
>>>> +
>>>> +        /* Flush async buffers before un-protect. */
>>>> +        qemu_fflush(rs->f);
>>>> +        /* Un-protect memory range. */
>>>> +        res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
>>>> +        /* We don't want to override existing error from ram_save_host_page(). */
>>>> +        if (res < 0 && *res_override >= 0) {
>>>> +            *res_override = res;
>>>
>>> What is this used for?  If res<0, I thought we should just fail the snapshot.
>>>
>>> Meanwhile, res_override points to "pages", and then it'll be rewrite to the
>>> errno returned by uffd_protect_memory().  Smells strange.
>>>
>>> Can this ever be triggered anyway?
>>>
>>
>> Yes, since "pages" is also for error return, if negative. If we have a
>> problem with un-protecting, promote the error to the loop in
>> ram_find_and_save_block() so it exits early ("pages" guaranteed to be
>> non-zero). In outer routines retcode would go to qemu_set_file_error().
> 
> Ok I see your point.
> 
>>
>>>> +        }
>>>> +#else
>>>> +        /* Should never happen */
>>>> +        qemu_file_set_error(rs->f, -ENOSYS);
>>>> +#endif /* CONFIG_LINUX */
>>>> +    }
>>>> +}
>>>> +
>>>>    /**
>>>>     * ram_find_and_save_block: finds a dirty page and sends it to f
>>>>     *
>>>> @@ -1779,14 +1908,14 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>>>>            return pages;
>>>>        }
>>>> +    if (!rs->last_seen_block) {
>>>> +        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
>>>
>>> Why setup the last seen block to be the first if null?
>>>
>>
>> Because ram_find_block_by_host_address() relies on rs->last_seen_block.
>> Pss is not passed to that routine.
> 
> Ok, but it's odd - because that's not the real "last seen block".
> 
> So now I'm rethinking maybe we could still reuse qemu_ram_block_from_host() by
> providing a new helper to do the search logic as mentioned above, which could
> take a pointer of RAMBlock as the 1st ramblock to search.  Then
> ram_find_block_by_host_address() should pass in rs->last_seen_block (which
> could be NULL - then we'll start with the 1st ramblock), or ram_list.mru_block
> for qemu_ram_block_from_host().  That could be a standalone new patch, then we
> don't need this slightly strange change.  What do you think?
> 

Mm, need to think about that..

>>
>>>> +    }
>>>> +
>>>>        pss.block = rs->last_seen_block;
>>>>        pss.page = rs->last_page;
>>>>        pss.complete_round = false;
>>>> -    if (!pss.block) {
>>>> -        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
>>>> -    }
>>>> -
>>>>        do {
>>>>            again = true;
>>>>            found = get_queued_page(rs, &pss);
>>>> @@ -1797,7 +1926,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>>>>            }
>>>>            if (found) {
>>>> +            void *opaque;
>>>> +
>>>> +            ram_save_host_page_pre(rs, &pss, &opaque);
>>>>                pages = ram_save_host_page(rs, &pss, last_stage);
>>>> +            ram_save_host_page_post(rs, &pss, opaque, &pages);
>>>
>>> So the pre/post idea is kind of an overkill to me...
>>>
>>> How about we do the unprotect in ram_save_host_page() in the simple way, like:
>>>
>>>     ram_save_host_page()
>>>       start_addr = pss->page;
>>>       do {
>>>         ...
>>>         (migrate pages)
>>>         ...
>>>       } while (...);
>>>       if (background_snapshot_enabled()) {
>>>         unprotect pages within start_addr to pss->page;
>>>       }
>>>       ...
>>>
>>
>> Personally I prefer not to modify existing code. May be adding to simple
>> calls would finally look cleaner?
> 
> Let's wait a 3rd opinion from Juan/Dave or others.
> 
> Thanks,
> 

Agree.

Thanks,

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 6/6] introduce simple linear scan rate limiting mechanism
  2020-11-30 16:40       ` Peter Xu
@ 2020-11-30 19:30         ` Andrey Gruzdev
  0 siblings, 0 replies; 45+ messages in thread
From: Andrey Gruzdev @ 2020-11-30 19:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 30.11.2020 19:40, Peter Xu wrote:
> On Mon, Nov 30, 2020 at 11:11:00AM +0300, Andrey Gruzdev wrote:
>> On 28.11.2020 01:28, Peter Xu wrote:
>>> On Thu, Nov 26, 2020 at 06:17:34PM +0300, Andrey Gruzdev wrote:
>>>> Since reading UFFD events and saving paged data are performed
>>>> from the same thread, write fault latencies are sensitive to
>>>> migration stream stalls. Limiting total page saving rate is a
>>>> method to reduce amount of noticiable fault resolution latencies.
>>>>
>>>> Migration bandwidth limiting is achieved via noticing cases of
>>>> out-of-threshold write fault latencies and temporarily disabling
>>>> (strictly speaking, severely throttling) saving non-faulting pages.
>>>
>>> So have you done any measurements out of it, as we've talked in previous
>>> version?  Thanks,
>>>
>>
>> Sorry, not done yet.
> 
> So do you still plan to? :)
> 
> And if not, could you describe the rational behind this patch?  For example,
> what's the problem behind (e.g., guest hangs for xxx seconds, maybe?) and
> what's the outcome (guest doesn't hang any more)?
> 
> Thanks,
> 

Yes, I think providing a latency histogram with description of the test 
case is the right thing.

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-11-26 15:17 [PATCH v4 0/6] UFFD write-tracking migration/snapshots Andrey Gruzdev via
                   ` (7 preceding siblings ...)
  2020-11-27 22:04 ` Peter Xu
@ 2020-12-01  7:08 ` Peter Krempa
  2020-12-01  8:42   ` Andrey Gruzdev
  8 siblings, 1 reply; 45+ messages in thread
From: Peter Krempa @ 2020-12-01  7:08 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel, Peter Xu,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.

Hi,

I gave this a try when attempting to implement the libvirt code for this
feature. I've ran into a problem of migration failing right away. The
VM's cpus were running at that point.

QEMU logged the following to stdout/err:

2020-12-01T06:50:42.334062Z qemu-system-x86_64: uffd_register_memory() failed: start=7f2007fff000 len=33554432000 mode=2 errno=22
2020-12-01T06:50:42.334072Z qemu-system-x86_64: ram_write_tracking_start() failed: restoring initial memory state
2020-12-01T06:50:42.334074Z qemu-system-x86_64: uffd_protect_memory() failed: start=7f2007fff000 len=33554432000 mode=0 errno=2
2020-12-01T06:50:42.334076Z qemu-system-x86_64: uffd_unregister_memory() failed: start=7f2007fff000 len=33554432000 errno=22


The migration was started by the following QMP conversation:


QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":false},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":false},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":false},{"capability":"late-block-activate","state":false},{"capability":"multifd","state":false},{"capability":"background-snapshot","state":true}]},"id":"libvirt-14"}

QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-14"}

QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate-set-parameters","arguments":{"max-bandwidth":9223372036853727232},"id":"libvirt-15"}

QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-15"}

QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"getfd","arguments":{"fdname":"migrate"},"id":"libvirt-16"}

QEMU_MONITOR_IO_SEND_FD: mon=0x7fff9c20c610 fd=44 ret=72 errno=0
QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-16"}

QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"},"id":"libvirt-17"}

QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 962424}, "event": "MIGRATION", "data": {"status": "setup"}}
QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-17"}
QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966306}, "event": "MIGRATION_PASS", "data": {"pass": 1}}
QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966355}, "event": "MIGRATION", "data": {"status": "active"}}
QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966488}, "event": "STOP"}
QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 970326}, "event": "MIGRATION", "data": {"status": "failed"}}

QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"query-migrate","id":"libvirt-18"}

QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {"status": "failed"}, "id": "libvirt-18"}
qemuMigrationJobCheckStatus:1685 : operation failed: snapshot job: unexpectedly failed

 $ uname -r
5.8.18-300.fc33.x86_64


created by libvirt with the following patchset applied:

https://gitlab.com/pipo.sk/libvirt/-/commits/background-snapshot

git fetch https://gitlab.com/pipo.sk/libvirt.git background-snapshot

Start the snapshot via:

virsh snapshot-create-as --memspec /tmp/snap.mem --diskspec sdb,snapshot=no --diskspec sda,snapshot=no --no-metadata upstream

Note you can omit --diskspec if you have a diskless VM.

The patches are VERY work in progress as I need to figure out the proper
sequencing to ensure a consistent snapshot.

Note that in cases when qemu can't guarantee that the
background_snapshot feature will work it should not advertise it. We
need a way to check whether it's possible to use it, so we can replace
the existing --live flag with it rather than adding a new one and
shifting the problem of checking whether the feature works to the user.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-12-01  7:08 ` Peter Krempa
@ 2020-12-01  8:42   ` Andrey Gruzdev
  2020-12-01 10:53     ` Peter Krempa
  0 siblings, 1 reply; 45+ messages in thread
From: Andrey Gruzdev @ 2020-12-01  8:42 UTC (permalink / raw)
  To: Peter Krempa
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu

On 01.12.2020 10:08, Peter Krempa wrote:
> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
>> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
> 
> Hi,
> 
> I gave this a try when attempting to implement the libvirt code for this
> feature. I've ran into a problem of migration failing right away. The
> VM's cpus were running at that point.
> 
> QEMU logged the following to stdout/err:
> 
> 2020-12-01T06:50:42.334062Z qemu-system-x86_64: uffd_register_memory() failed: start=7f2007fff000 len=33554432000 mode=2 errno=22
> 2020-12-01T06:50:42.334072Z qemu-system-x86_64: ram_write_tracking_start() failed: restoring initial memory state
> 2020-12-01T06:50:42.334074Z qemu-system-x86_64: uffd_protect_memory() failed: start=7f2007fff000 len=33554432000 mode=0 errno=2
> 2020-12-01T06:50:42.334076Z qemu-system-x86_64: uffd_unregister_memory() failed: start=7f2007fff000 len=33554432000 errno=22
> 
> 
> The migration was started by the following QMP conversation:
> 
> 
> QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":false},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":false},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":false},{"capability":"late-block-activate","state":false},{"capability":"multifd","state":false},{"capability":"background-snapshot","state":true}]},"id":"libvirt-14"}
> 
> QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-14"}
> 
> QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate-set-parameters","arguments":{"max-bandwidth":9223372036853727232},"id":"libvirt-15"}
> 
> QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-15"}
> 
> QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"getfd","arguments":{"fdname":"migrate"},"id":"libvirt-16"}
> 
> QEMU_MONITOR_IO_SEND_FD: mon=0x7fff9c20c610 fd=44 ret=72 errno=0
> QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-16"}
> 
> QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"},"id":"libvirt-17"}
> 
> QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 962424}, "event": "MIGRATION", "data": {"status": "setup"}}
> QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {}, "id": "libvirt-17"}
> QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966306}, "event": "MIGRATION_PASS", "data": {"pass": 1}}
> QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966355}, "event": "MIGRATION", "data": {"status": "active"}}
> QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 966488}, "event": "STOP"}
> QEMU_MONITOR_RECV_EVENT: mon=0x7fff9c20c610 event={"timestamp": {"seconds": 1606805733, "microseconds": 970326}, "event": "MIGRATION", "data": {"status": "failed"}}
> 
> QEMU_MONITOR_IO_WRITE: mon=0x7fff9c20c610 buf={"execute":"query-migrate","id":"libvirt-18"}
> 
> QEMU_MONITOR_RECV_REPLY: mon=0x7fff9c20c610 reply={"return": {"status": "failed"}, "id": "libvirt-18"}
> qemuMigrationJobCheckStatus:1685 : operation failed: snapshot job: unexpectedly failed
> 
>   $ uname -r
> 5.8.18-300.fc33.x86_64
> 
> 
> created by libvirt with the following patchset applied:
> 
> https://gitlab.com/pipo.sk/libvirt/-/commits/background-snapshot
> 
> git fetch https://gitlab.com/pipo.sk/libvirt.git background-snapshot
> 
> Start the snapshot via:
> 
> virsh snapshot-create-as --memspec /tmp/snap.mem --diskspec sdb,snapshot=no --diskspec sda,snapshot=no --no-metadata upstream
> 
> Note you can omit --diskspec if you have a diskless VM.
> 
> The patches are VERY work in progress as I need to figure out the proper
> sequencing to ensure a consistent snapshot.
> 
> Note that in cases when qemu can't guarantee that the
> background_snapshot feature will work it should not advertise it. We
> need a way to check whether it's possible to use it, so we can replace
> the existing --live flag with it rather than adding a new one and
> shifting the problem of checking whether the feature works to the user.
> 

Hi,

May be you are using hugetlbfs as memory backend?

I totally agree that we need somehow check that kernel and VM memory 
backend support the feature before one can enable the capability.
Need to think about that..

Thanks,

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-12-01  8:42   ` Andrey Gruzdev
@ 2020-12-01 10:53     ` Peter Krempa
  2020-12-01 11:24       ` Andrey Gruzdev
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Krempa @ 2020-12-01 10:53 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel, Peter Xu,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote:
> On 01.12.2020 10:08, Peter Krempa wrote:
> > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's

[...]

> > Note that in cases when qemu can't guarantee that the
> > background_snapshot feature will work it should not advertise it. We
> > need a way to check whether it's possible to use it, so we can replace
> > the existing --live flag with it rather than adding a new one and
> > shifting the problem of checking whether the feature works to the user.
> > 
> 
> Hi,
> 
> May be you are using hugetlbfs as memory backend?

Not exactly hugepages, but I had:

  <memoryBacking>
    <access mode='shared'/>
  </memoryBacking>

which resulted into the following commandline to instantiate memory:

-object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \

When I've removed it I got:

-object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \

And the migration didn't fail in my quick test. I'll have a more
detailed look later, thanks for the pointer.

> I totally agree that we need somehow check that kernel and VM memory backend
> support the feature before one can enable the capability.
> Need to think about that..

Definitely. Also note that memory backed by memory-backend-file will be
more and more common, for cases such as virtiofs DAX sharing and
similar.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-12-01 10:53     ` Peter Krempa
@ 2020-12-01 11:24       ` Andrey Gruzdev
  2020-12-01 18:40         ` Dr. David Alan Gilbert
  2020-12-01 18:54         ` Peter Xu
  0 siblings, 2 replies; 45+ messages in thread
From: Andrey Gruzdev @ 2020-12-01 11:24 UTC (permalink / raw)
  To: Peter Krempa
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu

On 01.12.2020 13:53, Peter Krempa wrote:
> On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote:
>> On 01.12.2020 10:08, Peter Krempa wrote:
>>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
>>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> 
> [...]
> 
>>> Note that in cases when qemu can't guarantee that the
>>> background_snapshot feature will work it should not advertise it. We
>>> need a way to check whether it's possible to use it, so we can replace
>>> the existing --live flag with it rather than adding a new one and
>>> shifting the problem of checking whether the feature works to the user.
>>>
>>
>> Hi,
>>
>> May be you are using hugetlbfs as memory backend?
> 
> Not exactly hugepages, but I had:
> 
>    <memoryBacking>
>      <access mode='shared'/>
>    </memoryBacking>
> 
> which resulted into the following commandline to instantiate memory:
> 
> -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \
> 
> When I've removed it I got:
> 
> -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \
> 
> And the migration didn't fail in my quick test. I'll have a more
> detailed look later, thanks for the pointer.
> 

Yep, seems that current userfaultfd supports hugetlbfs and shared memory 
for missing pages but not for wr-protected..

>> I totally agree that we need somehow check that kernel and VM memory backend
>> support the feature before one can enable the capability.
>> Need to think about that..
> 
> Definitely. Also note that memory backed by memory-backend-file will be
> more and more common, for cases such as virtiofs DAX sharing and
> similar.
> 

I see.. That needs support from kernel side, so far 
'background-snapshots' are incompatible with memory-backend-file sharing.

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 2/6] introduce UFFD-WP low-level interface helpers
  2020-11-26 15:17 ` [PATCH v4 2/6] introduce UFFD-WP low-level interface helpers Andrey Gruzdev via
  2020-11-27 21:04   ` Peter Xu
@ 2020-12-01 12:24   ` Dr. David Alan Gilbert
  2020-12-01 19:32     ` Andrey Gruzdev
  1 sibling, 1 reply; 45+ messages in thread
From: Dr. David Alan Gilbert @ 2020-12-01 12:24 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> Implemented support for the whole RAM block memory
> protection/un-protection. Introduced higher level
> ram_write_tracking_start() and ram_write_tracking_stop()
> to start/stop tracking guest memory writes.
> 
> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> ---
>  include/exec/memory.h      |   7 ++
>  include/qemu/userfaultfd.h |  29 +++++
>  migration/ram.c            | 120 +++++++++++++++++++++
>  migration/ram.h            |   4 +
>  util/meson.build           |   1 +
>  util/userfaultfd.c         | 215 +++++++++++++++++++++++++++++++++++++
>  6 files changed, 376 insertions(+)
>  create mode 100644 include/qemu/userfaultfd.h
>  create mode 100644 util/userfaultfd.c
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 0f3e6bcd5e..3d798fce16 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier;
>  /* RAM is a persistent kind memory */
>  #define RAM_PMEM (1 << 5)
>  
> +/*
> + * UFFDIO_WRITEPROTECT is used on this RAMBlock to
> + * support 'write-tracking' migration type.
> + * Implies ram_state->ram_wt_enabled.
> + */
> +#define RAM_UF_WRITEPROTECT (1 << 6)
> +
>  static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>                                         IOMMUNotifierFlag flags,
>                                         hwaddr start, hwaddr end,
> diff --git a/include/qemu/userfaultfd.h b/include/qemu/userfaultfd.h
> new file mode 100644
> index 0000000000..fb843c76db
> --- /dev/null
> +++ b/include/qemu/userfaultfd.h
> @@ -0,0 +1,29 @@
> +/*
> + * Linux UFFD-WP support
> + *
> + * Copyright Virtuozzo GmbH, 2020
> + *
> + * Authors:
> + *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef USERFAULTFD_H
> +#define USERFAULTFD_H
> +
> +#include "qemu/osdep.h"
> +#include "exec/hwaddr.h"
> +#include <linux/userfaultfd.h>
> +
> +int uffd_create_fd(void);
> +void uffd_close_fd(int uffd);
> +int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
> +        bool track_missing, bool track_wp);
> +int uffd_unregister_memory(int uffd, hwaddr start, hwaddr length);
> +int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp);
> +int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
> +bool uffd_poll_events(int uffd, int tmo);
> +
> +#endif /* USERFAULTFD_H */
> diff --git a/migration/ram.c b/migration/ram.c
> index 7811cde643..3adfd1948d 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -56,6 +56,11 @@
>  #include "savevm.h"
>  #include "qemu/iov.h"
>  #include "multifd.h"
> +#include "sysemu/runstate.h"
> +
> +#ifdef CONFIG_LINUX
> +#include "qemu/userfaultfd.h"
> +#endif
>  
>  /***********************************************************/
>  /* ram save/restore */
> @@ -298,6 +303,8 @@ struct RAMSrcPageRequest {
>  struct RAMState {
>      /* QEMUFile used for this migration */
>      QEMUFile *f;
> +    /* UFFD file descriptor, used in 'write-tracking' migration */
> +    int uffdio_fd;
>      /* Last block that we have visited searching for dirty pages */
>      RAMBlock *last_seen_block;
>      /* Last block from where we have sent data */
> @@ -3788,6 +3795,119 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
>      return 0;
>  }
>  
> +/*
> + * ram_write_tracking_start: start UFFD-WP memory tracking
> + *
> + * Returns 0 for success or negative value in case of error
> + *
> + */
> +int ram_write_tracking_start(void)

Please split this stuff off into a separate patch; just make this one
the userfaultfd.[ch] and then a separate one for gluing it into ram.c

> +{
> +#ifdef CONFIG_LINUX
> +    int uffd;
> +    RAMState *rs = ram_state;
> +    RAMBlock *bs;
> +
> +    /* Open UFFD file descriptor */
> +    uffd = uffd_create_fd();
> +    if (uffd < 0) {
> +        return uffd;
> +    }
> +    rs->uffdio_fd = uffd;
> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> +        /* Nothing to do with read-only and MMIO-writable regions */
> +        if (bs->mr->readonly || bs->mr->rom_device) {
> +            continue;
> +        }
> +
> +        bs->flags |= RAM_UF_WRITEPROTECT;
> +        /* Register block memory with UFFD to track writes */
> +        if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host,
> +                bs->max_length, false, true)) {
> +            goto fail;
> +        }
> +        /* Apply UFFD write protection to the block memory range */
> +        if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host,
> +                bs->max_length, true)) {
> +            goto fail;
> +        }
> +
> +        info_report("UFFD-WP write-tracking enabled: "
> +                "block_id=%s page_size=%zu start=%p length=%lu "
> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
> +                bs->mr->nonvolatile, bs->mr->rom_device);
> +    }
> +
> +    return 0;
> +
> +fail:
> +    error_report("ram_write_tracking_start() failed: restoring initial memory state");
> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
> +            continue;
> +        }
> +        /*
> +         * In case some memory block failed to be write-protected
> +         * remove protection and unregister all succeeded RAM blocks
> +         */
> +        uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length, false);
> +        uffd_unregister_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length);
> +        /* Cleanup flags */
> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
> +    }
> +
> +    uffd_close_fd(uffd);
> +    rs->uffdio_fd = -1;
> +    return -1;
> +#else
> +    rs->uffdio_fd = -1;
> +    error_setg(&migrate_get_current()->error,
> +            "Background-snapshot not supported on non-Linux hosts");
> +    return -1;
> +#endif /* CONFIG_LINUX */
> +}
> +
> +/**
> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
> + */
> +void ram_write_tracking_stop(void)
> +{
> +#ifdef CONFIG_LINUX
> +    RAMState *rs = ram_state;
> +    RAMBlock *bs;
> +    assert(rs->uffdio_fd >= 0);
> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
> +            continue;
> +        }
> +        /* Remove protection and unregister all affected RAM blocks */
> +        uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length, false);
> +        uffd_unregister_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length);
> +        /* Cleanup flags */
> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
> +
> +        info_report("UFFD-WP write-tracking disabled: "
> +                "block_id=%s page_size=%zu start=%p length=%lu "
> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
> +                bs->mr->nonvolatile, bs->mr->rom_device);
> +    }
> +
> +    /* Finally close UFFD file descriptor */
> +    uffd_close_fd(rs->uffdio_fd);
> +    rs->uffdio_fd = -1;
> +#else
> +    error_setg(&migrate_get_current()->error,
> +            "Background-snapshot not supported on non-Linux hosts");
> +#endif /* CONFIG_LINUX */
> +}
> +
>  static SaveVMHandlers savevm_ram_handlers = {
>      .save_setup = ram_save_setup,
>      .save_live_iterate = ram_save_iterate,
> diff --git a/migration/ram.h b/migration/ram.h
> index 011e85414e..0ec63e27ee 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -79,4 +79,8 @@ void colo_flush_ram_cache(void);
>  void colo_release_ram_cache(void);
>  void colo_incoming_start_dirty_log(void);
>  
> +/* Background snapshots */
> +int ram_write_tracking_start(void);
> +void ram_write_tracking_stop(void);
> +
>  #endif
> diff --git a/util/meson.build b/util/meson.build
> index f359af0d46..c64bfe94b3 100644
> --- a/util/meson.build
> +++ b/util/meson.build
> @@ -50,6 +50,7 @@ endif
>  
>  if have_system
>    util_ss.add(when: 'CONFIG_GIO', if_true: [files('dbus.c'), gio])
> +  util_ss.add(when: 'CONFIG_LINUX', if_true: files('userfaultfd.c'))
>  endif
>  
>  if have_block
> diff --git a/util/userfaultfd.c b/util/userfaultfd.c
> new file mode 100644
> index 0000000000..038953d7ed
> --- /dev/null
> +++ b/util/userfaultfd.c
> @@ -0,0 +1,215 @@
> +/*
> + * Linux UFFD-WP support
> + *
> + * Copyright Virtuozzo GmbH, 2020
> + *
> + * Authors:
> + *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/bitops.h"
> +#include "qemu/error-report.h"
> +#include "qemu/userfaultfd.h"
> +#include <poll.h>
> +#include <sys/syscall.h>
> +#include <sys/ioctl.h>
> +
> +/**
> + * uffd_create_fd: create UFFD file descriptor
> + *
> + * Returns non-negative file descriptor or negative value in case of an error
> + */
> +int uffd_create_fd(void)
> +{
> +    int uffd;
> +    struct uffdio_api api_struct;
> +    uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER);
> +
> +    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> +    if (uffd < 0) {
> +        error_report("uffd_create_fd() failed: UFFD not supported");

Please include the errno (or strerror(errno)) 

> +        return -1;
> +    }
> +
> +    api_struct.api = UFFD_API;
> +    api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
> +    if (ioctl(uffd, UFFDIO_API, &api_struct)) {
> +        error_report("uffd_create_fd() failed: "
> +                     "API version not supported version=%llx errno=%i",
> +                api_struct.api, errno);
> +        goto fail;
> +    }
> +
> +    if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
> +        error_report("uffd_create_fd() failed: "
> +                     "PAGEFAULT_FLAG_WP feature missing");
> +        goto fail;
> +    }
> +
> +    return uffd;
> +
> +fail:
> +    close(uffd);
> +    return -1;
> +}

It would be great if we could find a way to share this with
migration/postcopy-ram.c - this duplicates a lor of
request_ufd_features.

> +
> +/**
> + * uffd_close_fd: close UFFD file descriptor
> + *
> + * @uffd: UFFD file descriptor
> + */
> +void uffd_close_fd(int uffd)
> +{
> +    assert(uffd >= 0);
> +    close(uffd);
> +}
> +
> +/**
> + * uffd_register_memory: register memory range with UFFD
> + *
> + * Returns 0 in case of success, negative value on error
> + *
> + * @uffd: UFFD file descriptor
> + * @start: starting virtual address of memory range
> + * @length: length of memory range
> + * @track_missing: generate events on missing-page faults
> + * @track_wp: generate events on write-protected-page faults
> + */
> +int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
> +        bool track_missing, bool track_wp)
> +{
> +    struct uffdio_register uffd_register;
> +
> +    uffd_register.range.start = start;
> +    uffd_register.range.len = length;
> +    uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) |
> +                         (track_wp ? UFFDIO_REGISTER_MODE_WP : 0);
> +
> +    if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) {
> +        error_report("uffd_register_memory() failed: "
> +                     "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",

PRIx64 for mode as well.

> +                start, length, uffd_register.mode, errno);
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * uffd_unregister_memory: un-register memory range with UFFD
> + *
> + * Returns 0 in case of success, negative value on error
> + *
> + * @uffd: UFFD file descriptor
> + * @start: starting virtual address of memory range
> + * @length: length of memory range
> + */
> +int uffd_unregister_memory(int uffd, hwaddr start, hwaddr length)
> +{
> +    struct uffdio_range uffd_range;
> +
> +    uffd_range.start = start;
> +    uffd_range.len = length;
> +
> +    if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_range)) {
> +        error_report("uffd_unregister_memory() failed: "
> +                     "start=%0"PRIx64" len=%"PRIu64" errno=%i",
> +                start, length, errno);
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * uffd_protect_memory: protect/unprotect memory range for writes with UFFD
> + *
> + * Returns 0 on success or negative value in case of error
> + *
> + * @uffd: UFFD file descriptor
> + * @start: starting virtual address of memory range
> + * @length: length of memory range
> + * @wp: write-protect/unprotect
> + */
> +int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
> +{
> +    struct uffdio_writeprotect uffd_writeprotect;
> +    int res;
> +
> +    uffd_writeprotect.range.start = start;
> +    uffd_writeprotect.range.len = length;
> +    uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0);
> +
> +    do {
> +        res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect);
> +    } while (res < 0 && errno == EINTR);
> +    if (res < 0) {
> +        error_report("uffd_protect_memory() failed: "
> +                     "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
> +                start, length, uffd_writeprotect.mode, errno);
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * uffd_read_events: read pending UFFD events
> + *
> + * Returns number of fetched messages, 0 if non is available or
> + * negative value in case of an error
> + *
> + * @uffd: UFFD file descriptor
> + * @msgs: pointer to message buffer
> + * @count: number of messages that can fit in the buffer
> + */
> +int uffd_read_events(int uffd, struct uffd_msg *msgs, int count)
> +{
> +    ssize_t res;
> +    do {
> +        res = read(uffd, msgs, count * sizeof(struct uffd_msg));
> +    } while (res < 0 && errno == EINTR);
> +
> +    if ((res < 0 && errno == EAGAIN)) {
> +        return 0;
> +    }
> +    if (res < 0) {
> +        error_report("uffd_read_events() failed: errno=%i", errno);
> +        return -1;
> +    }
> +
> +    return (int) (res / sizeof(struct uffd_msg));
> +}
> +
> +/**
> + * uffd_poll_events: poll UFFD file descriptor for read
> + *
> + * Returns true if events are available for read, false otherwise
> + *
> + * @uffd: UFFD file descriptor
> + * @tmo: timeout in milliseconds, 0 for non-blocking operation,
> + *       negative value for infinite wait
> + */
> +bool uffd_poll_events(int uffd, int tmo)
> +{
> +    int res;
> +    struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 };
> +
> +    do {
> +        res = poll(&poll_fd, 1, tmo);
> +    } while (res < 0 && errno == EINTR);
> +
> +    if (res == 0) {
> +        return false;
> +    }
> +    if (res < 0) {
> +        error_report("uffd_poll_events() failed: errno=%i", errno);
> +        return false;
> +    }
> +
> +    return (poll_fd.revents & POLLIN) != 0;
> +}
> -- 
> 2.25.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-12-01 11:24       ` Andrey Gruzdev
@ 2020-12-01 18:40         ` Dr. David Alan Gilbert
  2020-12-01 19:22           ` Peter Xu
  2020-12-01 20:11           ` Andrey Gruzdev
  2020-12-01 18:54         ` Peter Xu
  1 sibling, 2 replies; 45+ messages in thread
From: Dr. David Alan Gilbert @ 2020-12-01 18:40 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Peter Krempa, Juan Quintela, qemu-devel, Peter Xu,
	Markus Armbruster, Paolo Bonzini, Den Lunev

* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> On 01.12.2020 13:53, Peter Krempa wrote:
> > On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote:
> > > On 01.12.2020 10:08, Peter Krempa wrote:
> > > > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> > > > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> > 
> > [...]
> > 
> > > > Note that in cases when qemu can't guarantee that the
> > > > background_snapshot feature will work it should not advertise it. We
> > > > need a way to check whether it's possible to use it, so we can replace
> > > > the existing --live flag with it rather than adding a new one and
> > > > shifting the problem of checking whether the feature works to the user.
> > > > 
> > > 
> > > Hi,
> > > 
> > > May be you are using hugetlbfs as memory backend?
> > 
> > Not exactly hugepages, but I had:
> > 
> >    <memoryBacking>
> >      <access mode='shared'/>
> >    </memoryBacking>
> > 
> > which resulted into the following commandline to instantiate memory:
> > 
> > -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \
> > 
> > When I've removed it I got:
> > 
> > -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \
> > 
> > And the migration didn't fail in my quick test. I'll have a more
> > detailed look later, thanks for the pointer.
> > 
> 
> Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
> missing pages but not for wr-protected..

For hugepages, you'd need kernel support - but also you'd want to make
sure you write the whole hugepage at once.

For shared, there's a harder problem to ask; what happens if RAM is
written by the other process - for postcopy, we get the other process
to send us a userfaultfd that they have registered with their VM.

Dave

> > > I totally agree that we need somehow check that kernel and VM memory backend
> > > support the feature before one can enable the capability.
> > > Need to think about that..
> > 
> > Definitely. Also note that memory backed by memory-backend-file will be
> > more and more common, for cases such as virtiofs DAX sharing and
> > similar.
> > 
> 
> I see.. That needs support from kernel side, so far 'background-snapshots'
> are incompatible with memory-backend-file sharing.
> 
> -- 
> Andrey Gruzdev, Principal Engineer
> Virtuozzo GmbH  +7-903-247-6397
>                 virtuzzo.com
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-12-01 11:24       ` Andrey Gruzdev
  2020-12-01 18:40         ` Dr. David Alan Gilbert
@ 2020-12-01 18:54         ` Peter Xu
  2020-12-01 20:00           ` Dr. David Alan Gilbert
  2020-12-01 20:26           ` Andrey Gruzdev
  1 sibling, 2 replies; 45+ messages in thread
From: Peter Xu @ 2020-12-01 18:54 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Andrea Arcangeli, Peter Krempa, Juan Quintela, Markus Armbruster,
	qemu-devel, Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Tue, Dec 01, 2020 at 02:24:12PM +0300, Andrey Gruzdev wrote:
> On 01.12.2020 13:53, Peter Krempa wrote:
> > On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote:
> > > On 01.12.2020 10:08, Peter Krempa wrote:
> > > > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> > > > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> > 
> > [...]
> > 
> > > > Note that in cases when qemu can't guarantee that the
> > > > background_snapshot feature will work it should not advertise it. We
> > > > need a way to check whether it's possible to use it, so we can replace
> > > > the existing --live flag with it rather than adding a new one and
> > > > shifting the problem of checking whether the feature works to the user.

Would it be fine if libvirt just try the new way first anyways?  Since if it
will fail, it'll fail right away on any unsupported memory types, then
logically the libvirt user may not even notice we've retried.

Previously I thought it was enough, because so far the kernel does not have a
specific flag showing whether such type of memory is supported.  But I don't
know whether it would be non-trivial for libvirt to retry like that.

Another solution is to let qemu test the uffd ioctls right after QEMU memory
setup, so we know whether background/live snapshot is supported or not with
current memory backends.  We should need to try this for every ramblock because
I think we can have different types across all the qemu ramblocks.

> > > > 
> > > 
> > > Hi,
> > > 
> > > May be you are using hugetlbfs as memory backend?
> > 
> > Not exactly hugepages, but I had:
> > 
> >    <memoryBacking>
> >      <access mode='shared'/>
> >    </memoryBacking>
> > 
> > which resulted into the following commandline to instantiate memory:
> > 
> > -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \
> > 
> > When I've removed it I got:
> > 
> > -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \
> > 
> > And the migration didn't fail in my quick test. I'll have a more
> > detailed look later, thanks for the pointer.
> > 
> 
> Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
> missing pages but not for wr-protected..

Correct.  Btw, I'm working on both of them recently.  I have a testing kernel
branch, but I don't think it should affect our qemu work, though, since qemu
should do the same irrelevant of the memory type.  We can just test with
anonymous memories, and as long as it works, it should work perfectly on all
the rest of backends (maybe even for other file-backed memory, more below).

> 
> > > I totally agree that we need somehow check that kernel and VM memory backend
> > > support the feature before one can enable the capability.
> > > Need to think about that..
> > 
> > Definitely. Also note that memory backed by memory-backend-file will be
> > more and more common, for cases such as virtiofs DAX sharing and
> > similar.
> > 
> 
> I see.. That needs support from kernel side, so far 'background-snapshots'
> are incompatible with memory-backend-file sharing.

Yes.  So as mentioned, shmem/hugetlbfs should be WIP, but I haven't thought
about the rest yet.  Maybe... it's not hard to add uffd-wp for most of the
file-backed memory?  Since afaict the kernel handles wr-protect in a quite
straightforward way (do_wp_page() for whatever backend), and uffd-wp can be the
first to trap all of them.  I'm not sure whether Andrea has thought about that
or even on how to spread the missing usage to more types of backends (maybe
missing is more special in that it needs to consider page caches).  So I'm
copying Andrea too just in case there's further input.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-12-01 18:40         ` Dr. David Alan Gilbert
@ 2020-12-01 19:22           ` Peter Xu
  2020-12-01 20:01             ` Dr. David Alan Gilbert
  2020-12-01 20:11           ` Andrey Gruzdev
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Xu @ 2020-12-01 19:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Krempa, Juan Quintela, qemu-devel, Markus Armbruster,
	Paolo Bonzini, Den Lunev, Andrey Gruzdev

On Tue, Dec 01, 2020 at 06:40:55PM +0000, Dr. David Alan Gilbert wrote:
> > Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
> > missing pages but not for wr-protected..
> 
> For hugepages, you'd need kernel support - but also you'd want to make
> sure you write the whole hugepage at once.

Or we can do similar things by splitting the huge pages just like when we
migrate.

I should have overlooked these facts when I replied previusly - we do need the
same logic, but also special care on these special memory types.

> 
> For shared, there's a harder problem to ask; what happens if RAM is
> written by the other process - for postcopy, we get the other process
> to send us a userfaultfd that they have registered with their VM.

Good point... so we should need similar things too.

Looks like we'd better explicitly disable shmem/hugetlbfs for now from qemu
background snapshots before we have prepared these utilities, just in case it
got run on some "future" kernels and accidentally got enabled, so the snapshot
files could be corrupted ones.

Is shmem used a lot in libvirt, or is it even a default configuration?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 2/6] introduce UFFD-WP low-level interface helpers
  2020-12-01 12:24   ` Dr. David Alan Gilbert
@ 2020-12-01 19:32     ` Andrey Gruzdev
  0 siblings, 0 replies; 45+ messages in thread
From: Andrey Gruzdev @ 2020-12-01 19:32 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Markus Armbruster, Peter Xu

On 01.12.2020 15:24, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> Implemented support for the whole RAM block memory
>> protection/un-protection. Introduced higher level
>> ram_write_tracking_start() and ram_write_tracking_stop()
>> to start/stop tracking guest memory writes.
>>
>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>> ---
>>   include/exec/memory.h      |   7 ++
>>   include/qemu/userfaultfd.h |  29 +++++
>>   migration/ram.c            | 120 +++++++++++++++++++++
>>   migration/ram.h            |   4 +
>>   util/meson.build           |   1 +
>>   util/userfaultfd.c         | 215 +++++++++++++++++++++++++++++++++++++
>>   6 files changed, 376 insertions(+)
>>   create mode 100644 include/qemu/userfaultfd.h
>>   create mode 100644 util/userfaultfd.c
>>
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 0f3e6bcd5e..3d798fce16 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier;
>>   /* RAM is a persistent kind memory */
>>   #define RAM_PMEM (1 << 5)
>>   
>> +/*
>> + * UFFDIO_WRITEPROTECT is used on this RAMBlock to
>> + * support 'write-tracking' migration type.
>> + * Implies ram_state->ram_wt_enabled.
>> + */
>> +#define RAM_UF_WRITEPROTECT (1 << 6)
>> +
>>   static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>>                                          IOMMUNotifierFlag flags,
>>                                          hwaddr start, hwaddr end,
>> diff --git a/include/qemu/userfaultfd.h b/include/qemu/userfaultfd.h
>> new file mode 100644
>> index 0000000000..fb843c76db
>> --- /dev/null
>> +++ b/include/qemu/userfaultfd.h
>> @@ -0,0 +1,29 @@
>> +/*
>> + * Linux UFFD-WP support
>> + *
>> + * Copyright Virtuozzo GmbH, 2020
>> + *
>> + * Authors:
>> + *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> + * later.  See the COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef USERFAULTFD_H
>> +#define USERFAULTFD_H
>> +
>> +#include "qemu/osdep.h"
>> +#include "exec/hwaddr.h"
>> +#include <linux/userfaultfd.h>
>> +
>> +int uffd_create_fd(void);
>> +void uffd_close_fd(int uffd);
>> +int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
>> +        bool track_missing, bool track_wp);
>> +int uffd_unregister_memory(int uffd, hwaddr start, hwaddr length);
>> +int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp);
>> +int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
>> +bool uffd_poll_events(int uffd, int tmo);
>> +
>> +#endif /* USERFAULTFD_H */
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 7811cde643..3adfd1948d 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -56,6 +56,11 @@
>>   #include "savevm.h"
>>   #include "qemu/iov.h"
>>   #include "multifd.h"
>> +#include "sysemu/runstate.h"
>> +
>> +#ifdef CONFIG_LINUX
>> +#include "qemu/userfaultfd.h"
>> +#endif
>>   
>>   /***********************************************************/
>>   /* ram save/restore */
>> @@ -298,6 +303,8 @@ struct RAMSrcPageRequest {
>>   struct RAMState {
>>       /* QEMUFile used for this migration */
>>       QEMUFile *f;
>> +    /* UFFD file descriptor, used in 'write-tracking' migration */
>> +    int uffdio_fd;
>>       /* Last block that we have visited searching for dirty pages */
>>       RAMBlock *last_seen_block;
>>       /* Last block from where we have sent data */
>> @@ -3788,6 +3795,119 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
>>       return 0;
>>   }
>>   
>> +/*
>> + * ram_write_tracking_start: start UFFD-WP memory tracking
>> + *
>> + * Returns 0 for success or negative value in case of error
>> + *
>> + */
>> +int ram_write_tracking_start(void)
> 
> Please split this stuff off into a separate patch; just make this one
> the userfaultfd.[ch] and then a separate one for gluing it into ram.c
> 

Yep, I'm separating it now - userfaultdf.c and ram_write_tracking stuff, 
they go to different patches.

>> +{
>> +#ifdef CONFIG_LINUX
>> +    int uffd;
>> +    RAMState *rs = ram_state;
>> +    RAMBlock *bs;
>> +
>> +    /* Open UFFD file descriptor */
>> +    uffd = uffd_create_fd();
>> +    if (uffd < 0) {
>> +        return uffd;
>> +    }
>> +    rs->uffdio_fd = uffd;
>> +
>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>> +        /* Nothing to do with read-only and MMIO-writable regions */
>> +        if (bs->mr->readonly || bs->mr->rom_device) {
>> +            continue;
>> +        }
>> +
>> +        bs->flags |= RAM_UF_WRITEPROTECT;
>> +        /* Register block memory with UFFD to track writes */
>> +        if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host,
>> +                bs->max_length, false, true)) {
>> +            goto fail;
>> +        }
>> +        /* Apply UFFD write protection to the block memory range */
>> +        if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host,
>> +                bs->max_length, true)) {
>> +            goto fail;
>> +        }
>> +
>> +        info_report("UFFD-WP write-tracking enabled: "
>> +                "block_id=%s page_size=%zu start=%p length=%lu "
>> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
>> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
>> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
>> +                bs->mr->nonvolatile, bs->mr->rom_device);
>> +    }
>> +
>> +    return 0;
>> +
>> +fail:
>> +    error_report("ram_write_tracking_start() failed: restoring initial memory state");
>> +
>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
>> +            continue;
>> +        }
>> +        /*
>> +         * In case some memory block failed to be write-protected
>> +         * remove protection and unregister all succeeded RAM blocks
>> +         */
>> +        uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length, false);
>> +        uffd_unregister_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length);
>> +        /* Cleanup flags */
>> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
>> +    }
>> +
>> +    uffd_close_fd(uffd);
>> +    rs->uffdio_fd = -1;
>> +    return -1;
>> +#else
>> +    rs->uffdio_fd = -1;
>> +    error_setg(&migrate_get_current()->error,
>> +            "Background-snapshot not supported on non-Linux hosts");
>> +    return -1;
>> +#endif /* CONFIG_LINUX */
>> +}
>> +
>> +/**
>> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
>> + */
>> +void ram_write_tracking_stop(void)
>> +{
>> +#ifdef CONFIG_LINUX
>> +    RAMState *rs = ram_state;
>> +    RAMBlock *bs;
>> +    assert(rs->uffdio_fd >= 0);
>> +
>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
>> +            continue;
>> +        }
>> +        /* Remove protection and unregister all affected RAM blocks */
>> +        uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length, false);
>> +        uffd_unregister_memory(rs->uffdio_fd, (hwaddr) bs->host, bs->max_length);
>> +        /* Cleanup flags */
>> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
>> +
>> +        info_report("UFFD-WP write-tracking disabled: "
>> +                "block_id=%s page_size=%zu start=%p length=%lu "
>> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
>> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
>> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
>> +                bs->mr->nonvolatile, bs->mr->rom_device);
>> +    }
>> +
>> +    /* Finally close UFFD file descriptor */
>> +    uffd_close_fd(rs->uffdio_fd);
>> +    rs->uffdio_fd = -1;
>> +#else
>> +    error_setg(&migrate_get_current()->error,
>> +            "Background-snapshot not supported on non-Linux hosts");
>> +#endif /* CONFIG_LINUX */
>> +}
>> +
>>   static SaveVMHandlers savevm_ram_handlers = {
>>       .save_setup = ram_save_setup,
>>       .save_live_iterate = ram_save_iterate,
>> diff --git a/migration/ram.h b/migration/ram.h
>> index 011e85414e..0ec63e27ee 100644
>> --- a/migration/ram.h
>> +++ b/migration/ram.h
>> @@ -79,4 +79,8 @@ void colo_flush_ram_cache(void);
>>   void colo_release_ram_cache(void);
>>   void colo_incoming_start_dirty_log(void);
>>   
>> +/* Background snapshots */
>> +int ram_write_tracking_start(void);
>> +void ram_write_tracking_stop(void);
>> +
>>   #endif
>> diff --git a/util/meson.build b/util/meson.build
>> index f359af0d46..c64bfe94b3 100644
>> --- a/util/meson.build
>> +++ b/util/meson.build
>> @@ -50,6 +50,7 @@ endif
>>   
>>   if have_system
>>     util_ss.add(when: 'CONFIG_GIO', if_true: [files('dbus.c'), gio])
>> +  util_ss.add(when: 'CONFIG_LINUX', if_true: files('userfaultfd.c'))
>>   endif
>>   
>>   if have_block
>> diff --git a/util/userfaultfd.c b/util/userfaultfd.c
>> new file mode 100644
>> index 0000000000..038953d7ed
>> --- /dev/null
>> +++ b/util/userfaultfd.c
>> @@ -0,0 +1,215 @@
>> +/*
>> + * Linux UFFD-WP support
>> + *
>> + * Copyright Virtuozzo GmbH, 2020
>> + *
>> + * Authors:
>> + *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> + * later.  See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/bitops.h"
>> +#include "qemu/error-report.h"
>> +#include "qemu/userfaultfd.h"
>> +#include <poll.h>
>> +#include <sys/syscall.h>
>> +#include <sys/ioctl.h>
>> +
>> +/**
>> + * uffd_create_fd: create UFFD file descriptor
>> + *
>> + * Returns non-negative file descriptor or negative value in case of an error
>> + */
>> +int uffd_create_fd(void)
>> +{
>> +    int uffd;
>> +    struct uffdio_api api_struct;
>> +    uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER);
>> +
>> +    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
>> +    if (uffd < 0) {
>> +        error_report("uffd_create_fd() failed: UFFD not supported");
> 
> Please include the errno (or strerror(errno))
> 

Ok.

>> +        return -1;
>> +    }
>> +
>> +    api_struct.api = UFFD_API;
>> +    api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
>> +    if (ioctl(uffd, UFFDIO_API, &api_struct)) {
>> +        error_report("uffd_create_fd() failed: "
>> +                     "API version not supported version=%llx errno=%i",
>> +                api_struct.api, errno);
>> +        goto fail;
>> +    }
>> +
>> +    if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
>> +        error_report("uffd_create_fd() failed: "
>> +                     "PAGEFAULT_FLAG_WP feature missing");
>> +        goto fail;
>> +    }
>> +
>> +    return uffd;
>> +
>> +fail:
>> +    close(uffd);
>> +    return -1;
>> +}
> 
> It would be great if we could find a way to share this with
> migration/postcopy-ram.c - this duplicates a lor of
> request_ufd_features.
> 

Now I'm changing userfaultfd.c glue code to support also the postcopy 
requirenments - adding functionality to query UFFD feature support, glue 
for COPY_PAGE/ZERO_PAGE IOCTLs etc.

>> +
>> +/**
>> + * uffd_close_fd: close UFFD file descriptor
>> + *
>> + * @uffd: UFFD file descriptor
>> + */
>> +void uffd_close_fd(int uffd)
>> +{
>> +    assert(uffd >= 0);
>> +    close(uffd);
>> +}
>> +
>> +/**
>> + * uffd_register_memory: register memory range with UFFD
>> + *
>> + * Returns 0 in case of success, negative value on error
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @start: starting virtual address of memory range
>> + * @length: length of memory range
>> + * @track_missing: generate events on missing-page faults
>> + * @track_wp: generate events on write-protected-page faults
>> + */
>> +int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
>> +        bool track_missing, bool track_wp)
>> +{
>> +    struct uffdio_register uffd_register;
>> +
>> +    uffd_register.range.start = start;
>> +    uffd_register.range.len = length;
>> +    uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) |
>> +                         (track_wp ? UFFDIO_REGISTER_MODE_WP : 0);
>> +
>> +    if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) {
>> +        error_report("uffd_register_memory() failed: "
>> +                     "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
> 
> PRIx64 for mode as well.
> 

Ok.

>> +                start, length, uffd_register.mode, errno);
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/**
>> + * uffd_unregister_memory: un-register memory range with UFFD
>> + *
>> + * Returns 0 in case of success, negative value on error
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @start: starting virtual address of memory range
>> + * @length: length of memory range
>> + */
>> +int uffd_unregister_memory(int uffd, hwaddr start, hwaddr length)
>> +{
>> +    struct uffdio_range uffd_range;
>> +
>> +    uffd_range.start = start;
>> +    uffd_range.len = length;
>> +
>> +    if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_range)) {
>> +        error_report("uffd_unregister_memory() failed: "
>> +                     "start=%0"PRIx64" len=%"PRIu64" errno=%i",
>> +                start, length, errno);
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/**
>> + * uffd_protect_memory: protect/unprotect memory range for writes with UFFD
>> + *
>> + * Returns 0 on success or negative value in case of error
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @start: starting virtual address of memory range
>> + * @length: length of memory range
>> + * @wp: write-protect/unprotect
>> + */
>> +int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
>> +{
>> +    struct uffdio_writeprotect uffd_writeprotect;
>> +    int res;
>> +
>> +    uffd_writeprotect.range.start = start;
>> +    uffd_writeprotect.range.len = length;
>> +    uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0);
>> +
>> +    do {
>> +        res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect);
>> +    } while (res < 0 && errno == EINTR);
>> +    if (res < 0) {
>> +        error_report("uffd_protect_memory() failed: "
>> +                     "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
>> +                start, length, uffd_writeprotect.mode, errno);
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/**
>> + * uffd_read_events: read pending UFFD events
>> + *
>> + * Returns number of fetched messages, 0 if non is available or
>> + * negative value in case of an error
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @msgs: pointer to message buffer
>> + * @count: number of messages that can fit in the buffer
>> + */
>> +int uffd_read_events(int uffd, struct uffd_msg *msgs, int count)
>> +{
>> +    ssize_t res;
>> +    do {
>> +        res = read(uffd, msgs, count * sizeof(struct uffd_msg));
>> +    } while (res < 0 && errno == EINTR);
>> +
>> +    if ((res < 0 && errno == EAGAIN)) {
>> +        return 0;
>> +    }
>> +    if (res < 0) {
>> +        error_report("uffd_read_events() failed: errno=%i", errno);
>> +        return -1;
>> +    }
>> +
>> +    return (int) (res / sizeof(struct uffd_msg));
>> +}
>> +
>> +/**
>> + * uffd_poll_events: poll UFFD file descriptor for read
>> + *
>> + * Returns true if events are available for read, false otherwise
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @tmo: timeout in milliseconds, 0 for non-blocking operation,
>> + *       negative value for infinite wait
>> + */
>> +bool uffd_poll_events(int uffd, int tmo)
>> +{
>> +    int res;
>> +    struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 };
>> +
>> +    do {
>> +        res = poll(&poll_fd, 1, tmo);
>> +    } while (res < 0 && errno == EINTR);
>> +
>> +    if (res == 0) {
>> +        return false;
>> +    }
>> +    if (res < 0) {
>> +        error_report("uffd_poll_events() failed: errno=%i", errno);
>> +        return false;
>> +    }
>> +
>> +    return (poll_fd.revents & POLLIN) != 0;
>> +}
>> -- 
>> 2.25.1
>>


-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-12-01 18:54         ` Peter Xu
@ 2020-12-01 20:00           ` Dr. David Alan Gilbert
  2020-12-01 20:26           ` Andrey Gruzdev
  1 sibling, 0 replies; 45+ messages in thread
From: Dr. David Alan Gilbert @ 2020-12-01 20:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrea Arcangeli, Peter Krempa, Juan Quintela, qemu-devel,
	Markus Armbruster, Paolo Bonzini, Den Lunev, Andrey Gruzdev

* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Dec 01, 2020 at 02:24:12PM +0300, Andrey Gruzdev wrote:
> > On 01.12.2020 13:53, Peter Krempa wrote:
> > > On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote:
> > > > On 01.12.2020 10:08, Peter Krempa wrote:
> > > > > On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
> > > > > > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> > > 
> > > [...]
> > > 
> > > > > Note that in cases when qemu can't guarantee that the
> > > > > background_snapshot feature will work it should not advertise it. We
> > > > > need a way to check whether it's possible to use it, so we can replace
> > > > > the existing --live flag with it rather than adding a new one and
> > > > > shifting the problem of checking whether the feature works to the user.
> 
> Would it be fine if libvirt just try the new way first anyways?  Since if it
> will fail, it'll fail right away on any unsupported memory types, then
> logically the libvirt user may not even notice we've retried.
> 
> Previously I thought it was enough, because so far the kernel does not have a
> specific flag showing whether such type of memory is supported.  But I don't
> know whether it would be non-trivial for libvirt to retry like that.
> 
> Another solution is to let qemu test the uffd ioctls right after QEMU memory
> setup, so we know whether background/live snapshot is supported or not with
> current memory backends.  We should need to try this for every ramblock because
> I think we can have different types across all the qemu ramblocks.

I don't think we actually do that for postcopy; we do some checks like
checking if we have any hugepages, and if so checking for it's flags.
But note that we do tie it into migrate_caps_check to fail if you try
and set the capability.

> 
> > > > > 
> > > > 
> > > > Hi,
> > > > 
> > > > May be you are using hugetlbfs as memory backend?
> > > 
> > > Not exactly hugepages, but I had:
> > > 
> > >    <memoryBacking>
> > >      <access mode='shared'/>
> > >    </memoryBacking>
> > > 
> > > which resulted into the following commandline to instantiate memory:
> > > 
> > > -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \
> > > 
> > > When I've removed it I got:
> > > 
> > > -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \
> > > 
> > > And the migration didn't fail in my quick test. I'll have a more
> > > detailed look later, thanks for the pointer.
> > > 
> > 
> > Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
> > missing pages but not for wr-protected..
> 
> Correct.  Btw, I'm working on both of them recently.  I have a testing kernel
> branch, but I don't think it should affect our qemu work, though, since qemu
> should do the same irrelevant of the memory type.  We can just test with
> anonymous memories, and as long as it works, it should work perfectly on all
> the rest of backends (maybe even for other file-backed memory, more below).
> 
> > 
> > > > I totally agree that we need somehow check that kernel and VM memory backend
> > > > support the feature before one can enable the capability.
> > > > Need to think about that..
> > > 
> > > Definitely. Also note that memory backed by memory-backend-file will be
> > > more and more common, for cases such as virtiofs DAX sharing and
> > > similar.
> > > 
> > 
> > I see.. That needs support from kernel side, so far 'background-snapshots'
> > are incompatible with memory-backend-file sharing.
> 
> Yes.  So as mentioned, shmem/hugetlbfs should be WIP, but I haven't thought
> about the rest yet.  Maybe... it's not hard to add uffd-wp for most of the
> file-backed memory?  Since afaict the kernel handles wr-protect in a quite
> straightforward way (do_wp_page() for whatever backend), and uffd-wp can be the
> first to trap all of them.  I'm not sure whether Andrea has thought about that
> or even on how to spread the missing usage to more types of backends (maybe
> missing is more special in that it needs to consider page caches).  So I'm
> copying Andrea too just in case there's further input.

Some would be good; we've got requests for it to work on pmem mmaped
devices.   You do have to be a little careful about semantics though;
I'm not sure it's that big of a problem for the wp case, but for the
absent case you need to worry about finding an equivalent of madvise or
fallocate that cna punch a hole.

Dave

> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-12-01 19:22           ` Peter Xu
@ 2020-12-01 20:01             ` Dr. David Alan Gilbert
  2020-12-01 20:29               ` Andrey Gruzdev
  0 siblings, 1 reply; 45+ messages in thread
From: Dr. David Alan Gilbert @ 2020-12-01 20:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Krempa, Juan Quintela, qemu-devel, Markus Armbruster,
	Paolo Bonzini, Den Lunev, Andrey Gruzdev

* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Dec 01, 2020 at 06:40:55PM +0000, Dr. David Alan Gilbert wrote:
> > > Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
> > > missing pages but not for wr-protected..
> > 
> > For hugepages, you'd need kernel support - but also you'd want to make
> > sure you write the whole hugepage at once.
> 
> Or we can do similar things by splitting the huge pages just like when we
> migrate.
> 
> I should have overlooked these facts when I replied previusly - we do need the
> same logic, but also special care on these special memory types.
> 
> > 
> > For shared, there's a harder problem to ask; what happens if RAM is
> > written by the other process - for postcopy, we get the other process
> > to send us a userfaultfd that they have registered with their VM.
> 
> Good point... so we should need similar things too.
> 
> Looks like we'd better explicitly disable shmem/hugetlbfs for now from qemu
> background snapshots before we have prepared these utilities, just in case it
> got run on some "future" kernels and accidentally got enabled, so the snapshot
> files could be corrupted ones.
> 
> Is shmem used a lot in libvirt, or is it even a default configuration?

No, but it's used with vhost-user applications; like dpdk.

Dave

> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-12-01 18:40         ` Dr. David Alan Gilbert
  2020-12-01 19:22           ` Peter Xu
@ 2020-12-01 20:11           ` Andrey Gruzdev
  1 sibling, 0 replies; 45+ messages in thread
From: Andrey Gruzdev @ 2020-12-01 20:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Krempa, qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini,
	Juan Quintela, Markus Armbruster, Peter Xu

On 01.12.2020 21:40, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> On 01.12.2020 13:53, Peter Krempa wrote:
>>> On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote:
>>>> On 01.12.2020 10:08, Peter Krempa wrote:
>>>>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
>>>>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
>>>
>>> [...]
>>>
>>>>> Note that in cases when qemu can't guarantee that the
>>>>> background_snapshot feature will work it should not advertise it. We
>>>>> need a way to check whether it's possible to use it, so we can replace
>>>>> the existing --live flag with it rather than adding a new one and
>>>>> shifting the problem of checking whether the feature works to the user.
>>>>>
>>>>
>>>> Hi,
>>>>
>>>> May be you are using hugetlbfs as memory backend?
>>>
>>> Not exactly hugepages, but I had:
>>>
>>>     <memoryBacking>
>>>       <access mode='shared'/>
>>>     </memoryBacking>
>>>
>>> which resulted into the following commandline to instantiate memory:
>>>
>>> -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \
>>>
>>> When I've removed it I got:
>>>
>>> -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \
>>>
>>> And the migration didn't fail in my quick test. I'll have a more
>>> detailed look later, thanks for the pointer.
>>>
>>
>> Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
>> missing pages but not for wr-protected..
> 
> For hugepages, you'd need kernel support - but also you'd want to make
> sure you write the whole hugepage at once.
> 
> For shared, there's a harder problem to ask; what happens if RAM is
> written by the other process - for postcopy, we get the other process
> to send us a userfaultfd that they have registered with their VM.
> 
> Dave
> 

Yep. May be problematic. But if used for vhost-user external backend - 
seems it should work.

>>>> I totally agree that we need somehow check that kernel and VM memory backend
>>>> support the feature before one can enable the capability.
>>>> Need to think about that..
>>>
>>> Definitely. Also note that memory backed by memory-backend-file will be
>>> more and more common, for cases such as virtiofs DAX sharing and
>>> similar.
>>>
>>
>> I see.. That needs support from kernel side, so far 'background-snapshots'
>> are incompatible with memory-backend-file sharing.
>>
>> -- 
>> Andrey Gruzdev, Principal Engineer
>> Virtuozzo GmbH  +7-903-247-6397
>>                  virtuzzo.com
>>


-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-12-01 18:54         ` Peter Xu
  2020-12-01 20:00           ` Dr. David Alan Gilbert
@ 2020-12-01 20:26           ` Andrey Gruzdev
  1 sibling, 0 replies; 45+ messages in thread
From: Andrey Gruzdev @ 2020-12-01 20:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Krempa, qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini,
	Juan Quintela, Dr . David Alan Gilbert, Markus Armbruster,
	Andrea Arcangeli

On 01.12.2020 21:54, Peter Xu wrote:
> On Tue, Dec 01, 2020 at 02:24:12PM +0300, Andrey Gruzdev wrote:
>> On 01.12.2020 13:53, Peter Krempa wrote:
>>> On Tue, Dec 01, 2020 at 11:42:18 +0300, Andrey Gruzdev wrote:
>>>> On 01.12.2020 10:08, Peter Krempa wrote:
>>>>> On Thu, Nov 26, 2020 at 18:17:28 +0300, Andrey Gruzdev via wrote:
>>>>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
>>>
>>> [...]
>>>
>>>>> Note that in cases when qemu can't guarantee that the
>>>>> background_snapshot feature will work it should not advertise it. We
>>>>> need a way to check whether it's possible to use it, so we can replace
>>>>> the existing --live flag with it rather than adding a new one and
>>>>> shifting the problem of checking whether the feature works to the user.
> 
> Would it be fine if libvirt just try the new way first anyways?  Since if it
> will fail, it'll fail right away on any unsupported memory types, then
> logically the libvirt user may not even notice we've retried.
> 
> Previously I thought it was enough, because so far the kernel does not have a
> specific flag showing whether such type of memory is supported.  But I don't
> know whether it would be non-trivial for libvirt to retry like that.
> 
> Another solution is to let qemu test the uffd ioctls right after QEMU memory
> setup, so we know whether background/live snapshot is supported or not with
> current memory backends.  We should need to try this for every ramblock because
> I think we can have different types across all the qemu ramblocks.
> 

I've already added some code to perform raw check on UFFD feature 
support on existing QEMU ram_list. That's fast, just trying REGISTER 
IOCTL with required features for each migratable RAMBlock, then cleanup.
So we can do it on migrate_set_capabilty. If some RAMBlock is not 
compatible (several memory slots with different backends) - we just give 
up enabling the capability.

>>>>>
>>>>
>>>> Hi,
>>>>
>>>> May be you are using hugetlbfs as memory backend?
>>>
>>> Not exactly hugepages, but I had:
>>>
>>>     <memoryBacking>
>>>       <access mode='shared'/>
>>>     </memoryBacking>
>>>
>>> which resulted into the following commandline to instantiate memory:
>>>
>>> -object memory-backend-file,id=pc.ram,mem-path=/var/lib/libvirt/qemu/ram/6-upstream-bj/pc.ram,share=yes,size=33554432000,host-nodes=0,policy=bind \
>>>
>>> When I've removed it I got:
>>>
>>> -object memory-backend-ram,id=pc.ram,size=33554432000,host-nodes=0,policy=bind \
>>>
>>> And the migration didn't fail in my quick test. I'll have a more
>>> detailed look later, thanks for the pointer.
>>>
>>
>> Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
>> missing pages but not for wr-protected..
> 
> Correct.  Btw, I'm working on both of them recently.  I have a testing kernel
> branch, but I don't think it should affect our qemu work, though, since qemu
> should do the same irrelevant of the memory type.  We can just test with
> anonymous memories, and as long as it works, it should work perfectly on all
> the rest of backends (maybe even for other file-backed memory, more below).
> 

Yes, I know. Nice improvement of userfaultfd feature, IMO.

>>
>>>> I totally agree that we need somehow check that kernel and VM memory backend
>>>> support the feature before one can enable the capability.
>>>> Need to think about that..
>>>
>>> Definitely. Also note that memory backed by memory-backend-file will be
>>> more and more common, for cases such as virtiofs DAX sharing and
>>> similar.
>>>
>>
>> I see.. That needs support from kernel side, so far 'background-snapshots'
>> are incompatible with memory-backend-file sharing.
> 
> Yes.  So as mentioned, shmem/hugetlbfs should be WIP, but I haven't thought
> about the rest yet.  Maybe... it's not hard to add uffd-wp for most of the
> file-backed memory?  Since afaict the kernel handles wr-protect in a quite
> straightforward way (do_wp_page() for whatever backend), and uffd-wp can be the
> first to trap all of them.  I'm not sure whether Andrea has thought about that
> or even on how to spread the missing usage to more types of backends (maybe
> missing is more special in that it needs to consider page caches).  So I'm
> copying Andrea too just in case there's further input.
> 
> Thanks,
> 

Maybe be for file-backend it works from start, but many configurations 
come with shared hugetlbfs.. Now it's not compatible with background 
snapshotting. We can't support even anonymous hugetlbfs at the moment.

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 0/6] UFFD write-tracking migration/snapshots
  2020-12-01 20:01             ` Dr. David Alan Gilbert
@ 2020-12-01 20:29               ` Andrey Gruzdev
  0 siblings, 0 replies; 45+ messages in thread
From: Andrey Gruzdev @ 2020-12-01 20:29 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Peter Xu
  Cc: Peter Krempa, qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini,
	Juan Quintela, Markus Armbruster

On 01.12.2020 23:01, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
>> On Tue, Dec 01, 2020 at 06:40:55PM +0000, Dr. David Alan Gilbert wrote:
>>>> Yep, seems that current userfaultfd supports hugetlbfs and shared memory for
>>>> missing pages but not for wr-protected..
>>>
>>> For hugepages, you'd need kernel support - but also you'd want to make
>>> sure you write the whole hugepage at once.
>>
>> Or we can do similar things by splitting the huge pages just like when we
>> migrate.
>>
>> I should have overlooked these facts when I replied previusly - we do need the
>> same logic, but also special care on these special memory types.
>>
>>>
>>> For shared, there's a harder problem to ask; what happens if RAM is
>>> written by the other process - for postcopy, we get the other process
>>> to send us a userfaultfd that they have registered with their VM.
>>
>> Good point... so we should need similar things too.
>>
>> Looks like we'd better explicitly disable shmem/hugetlbfs for now from qemu
>> background snapshots before we have prepared these utilities, just in case it
>> got run on some "future" kernels and accidentally got enabled, so the snapshot
>> files could be corrupted ones.
>>
>> Is shmem used a lot in libvirt, or is it even a default configuration?
> 
> No, but it's used with vhost-user applications; like dpdk.
> 
> Dave
> 
>> -- 
>> Peter Xu
>>

Yep.

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2020-12-01 20:36 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-26 15:17 [PATCH v4 0/6] UFFD write-tracking migration/snapshots Andrey Gruzdev via
2020-11-26 15:17 ` [PATCH v4 1/6] introduce 'background-snapshot' migration capability Andrey Gruzdev via
2020-11-27 19:55   ` Peter Xu
2020-11-28 16:35     ` Andrey Gruzdev
2020-11-26 15:17 ` [PATCH v4 2/6] introduce UFFD-WP low-level interface helpers Andrey Gruzdev via
2020-11-27 21:04   ` Peter Xu
2020-11-29 20:12     ` Andrey Gruzdev
2020-11-30 15:34       ` Peter Xu
2020-11-30 18:41         ` Andrey Gruzdev
2020-12-01 12:24   ` Dr. David Alan Gilbert
2020-12-01 19:32     ` Andrey Gruzdev
2020-11-26 15:17 ` [PATCH v4 3/6] support UFFD write fault processing in ram_save_iterate() Andrey Gruzdev via
2020-11-27 21:49   ` Peter Xu
2020-11-29 21:14     ` Andrey Gruzdev
2020-11-30 16:32       ` Peter Xu
2020-11-30 19:27         ` Andrey Gruzdev
2020-11-26 15:17 ` [PATCH v4 4/6] implementation of background snapshot thread Andrey Gruzdev via
2020-11-26 15:17 ` [PATCH v4 5/6] the rest of write tracking migration code Andrey Gruzdev via
2020-11-27 22:26   ` Peter Xu
2020-11-30  8:09     ` Andrey Gruzdev
2020-11-26 15:17 ` [PATCH v4 6/6] introduce simple linear scan rate limiting mechanism Andrey Gruzdev via
2020-11-27 22:28   ` Peter Xu
2020-11-30  8:11     ` Andrey Gruzdev
2020-11-30 16:40       ` Peter Xu
2020-11-30 19:30         ` Andrey Gruzdev
2020-11-26 15:47 ` [PATCH v4 0/6] UFFD write-tracking migration/snapshots Peter Krempa
2020-11-27  8:21   ` Andrey Gruzdev
2020-11-27  9:49     ` Peter Krempa
2020-11-27 10:00       ` Andrey Gruzdev
2020-11-27 15:45         ` Peter Xu
2020-11-27 17:19           ` Andrey Gruzdev
2020-11-27 22:04 ` Peter Xu
2020-11-30  8:07   ` Andrey Gruzdev
2020-12-01  7:08 ` Peter Krempa
2020-12-01  8:42   ` Andrey Gruzdev
2020-12-01 10:53     ` Peter Krempa
2020-12-01 11:24       ` Andrey Gruzdev
2020-12-01 18:40         ` Dr. David Alan Gilbert
2020-12-01 19:22           ` Peter Xu
2020-12-01 20:01             ` Dr. David Alan Gilbert
2020-12-01 20:29               ` Andrey Gruzdev
2020-12-01 20:11           ` Andrey Gruzdev
2020-12-01 18:54         ` Peter Xu
2020-12-01 20:00           ` Dr. David Alan Gilbert
2020-12-01 20:26           ` Andrey Gruzdev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.