All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v11 0/5] UFFD write-tracking migration/snapshots
@ 2021-01-06 15:21 Andrey Gruzdev via
  2021-01-06 15:21 ` [PATCH v11 1/5] migration: introduce 'background-snapshot' migration capability Andrey Gruzdev via
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Andrey Gruzdev via @ 2021-01-06 15:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, Peter Xu,
	Markus Armbruster, Paolo Bonzini, Den Lunev, Andrey Gruzdev

This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.

Currently the only way to make (external) live VM snapshot is using existing
dirty page logging migration mechanism. The main problem is that it tends to
produce a lot of page duplicates while running VM goes on updating already
saved pages. That leads to the fact that vmstate image size is commonly several
times bigger then non-zero part of virtual machine's RSS. Time required to
converge RAM migration and the size of snapshot image severely depend on the
guest memory write rate, sometimes resulting in unacceptably long snapshot
creation time and huge image size.

This series propose a way to solve the aforementioned problems. This is done
by using different RAM migration mechanism based on UFFD write protection
management introduced in v5.7 kernel. The migration strategy is to 'freeze'
guest RAM content using write-protection and iteratively release protection
for memory ranges that have already been saved to the migration stream.
At the same time we read in pending UFFD write fault events and save those
pages out-of-order with higher priority.

How to use:
1. Enable write-tracking migration capability
   virsh qemu-monitor-command <domain> --hmp migrate_set_capability.
track-writes-ram on

2. Start the external migration to a file
   virsh qemu-monitor-command <domain> --hmp migrate exec:'cat > ./vm_state'

3. Wait for the migration finish and check that the migration has completed.
state.


Changes v10->v11:

* 1. Updated commit messages.

Changes v9->v10:

* 1. Fixed commit message for [PATCH v9 1/5].

Changes v8->v9:

* 1. Fixed wrong cover letter subject.

Changes v7->v8:

* 1. Fixed coding style problems to pass checkpatch.

Changes v6->v7:

* 1. Fixed background snapshot on suspended guest: call qemu_system_wakeup_request()
*    before stopping VM to make runstate transition valid.
* 2. Disabled dirty page logging and log syn when 'background-snapshot' is enabled.
* 3. Introduced 'userfaultfd-wrlat.py' script to analyze UFFD write fault latencies.

Changes v5->v6:

* 1. Consider possible hot pluggin/unpluggin of memory device - don't use static
*    for write-tracking support level in migrate_query_write_tracking(), check
*    each time when one tries to enable 'background-snapshot' capability.

Changes v4->v5:

* 1. Refactored util/userfaultfd.c code to support features required by postcopy.
* 2. Introduced checks for host kernel and guest memory backend compatibility
*    to 'background-snapshot' branch in migrate_caps_check().
* 3. Switched to using trace_xxx instead of info_report()/error_report() for
*    cases when error message must be hidden (probing UFFD-IO) or info may be
*    really littering output if goes to stderr.
* 4  Added RCU_READ_LOCK_GUARDs to the code dealing with RAM block list.
* 5. Added memory_region_ref() for each RAM block being wr-protected.
* 6. Reused qemu_ram_block_from_host() instead of custom RAM block lookup routine.
* 7. Refused from using specific hwaddr/ram_addr_t in favour of void */uint64_t.
* 8. Currently dropped 'linear-scan-rate-limiting' patch. The reason is that
*    that choosen criteria for high-latency fault detection (i.e. timestamp of
*    UFFD event fetch) is not representative enough for this task.
*    At the moment it looks somehow like premature optimization effort.
* 8. Dropped some unnecessary/unused code.

Andrey Gruzdev (5):
  migration: introduce 'background-snapshot' migration capability
  migration: introduce UFFD-WP low-level interface helpers
  migration: support UFFD write fault processing in ram_save_iterate()
  migration: implementation of background snapshot thread
  migration: introduce 'userfaultfd-wrlat.py' script

 include/exec/memory.h        |   8 +
 include/qemu/userfaultfd.h   |  35 ++++
 migration/migration.c        | 365 ++++++++++++++++++++++++++++++++++-
 migration/migration.h        |   4 +
 migration/ram.c              | 288 ++++++++++++++++++++++++++-
 migration/ram.h              |   6 +
 migration/savevm.c           |   1 -
 migration/savevm.h           |   2 +
 migration/trace-events       |   2 +
 qapi/migration.json          |   7 +-
 scripts/userfaultfd-wrlat.py | 148 ++++++++++++++
 util/meson.build             |   1 +
 util/trace-events            |   9 +
 util/userfaultfd.c           | 345 +++++++++++++++++++++++++++++++++
 14 files changed, 1211 insertions(+), 10 deletions(-)
 create mode 100644 include/qemu/userfaultfd.h
 create mode 100755 scripts/userfaultfd-wrlat.py
 create mode 100644 util/userfaultfd.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v11 1/5] migration: introduce 'background-snapshot' migration capability
  2021-01-06 15:21 [PATCH v11 0/5] UFFD write-tracking migration/snapshots Andrey Gruzdev via
@ 2021-01-06 15:21 ` Andrey Gruzdev via
  2021-01-06 15:21 ` [PATCH v11 2/5] migration: introduce UFFD-WP low-level interface helpers Andrey Gruzdev via
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Andrey Gruzdev via @ 2021-01-06 15:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, Peter Xu,
	Markus Armbruster, Paolo Bonzini, Den Lunev, Andrey Gruzdev

Add new capability to 'qapi/migration.json' schema.
Update migrate_caps_check() to validate enabled capability set
against introduced one. Perform checks for required kernel features
and compatibility with guest memory backends.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
---
 migration/migration.c | 110 ++++++++++++++++++++++++++++++++++++++++++
 migration/migration.h |   1 +
 migration/ram.c       |  21 ++++++++
 migration/ram.h       |   4 ++
 qapi/migration.json   |   7 ++-
 5 files changed, 142 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index a5da718baa..2c2cb9ef01 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -57,6 +57,7 @@
 #include "net/announce.h"
 #include "qemu/queue.h"
 #include "multifd.h"
+#include "sysemu/cpus.h"
 
 #ifdef CONFIG_VFIO
 #include "hw/vfio/vfio-common.h"
@@ -133,6 +134,38 @@ enum mig_rp_message_type {
     MIG_RP_MSG_MAX
 };
 
+/* Migration capabilities set */
+struct MigrateCapsSet {
+    int size;                       /* Capability set size */
+    MigrationCapability caps[];     /* Variadic array of capabilities */
+};
+typedef struct MigrateCapsSet MigrateCapsSet;
+
+/* Define and initialize MigrateCapsSet */
+#define INITIALIZE_MIGRATE_CAPS_SET(_name, ...)   \
+    MigrateCapsSet _name = {    \
+        .size = sizeof((int []) { __VA_ARGS__ }) / sizeof(int), \
+        .caps = { __VA_ARGS__ } \
+    }
+
+/* Background-snapshot compatibility check list */
+static const
+INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
+    MIGRATION_CAPABILITY_POSTCOPY_RAM,
+    MIGRATION_CAPABILITY_DIRTY_BITMAPS,
+    MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME,
+    MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE,
+    MIGRATION_CAPABILITY_RETURN_PATH,
+    MIGRATION_CAPABILITY_MULTIFD,
+    MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER,
+    MIGRATION_CAPABILITY_AUTO_CONVERGE,
+    MIGRATION_CAPABILITY_RELEASE_RAM,
+    MIGRATION_CAPABILITY_RDMA_PIN_ALL,
+    MIGRATION_CAPABILITY_COMPRESS,
+    MIGRATION_CAPABILITY_XBZRLE,
+    MIGRATION_CAPABILITY_X_COLO,
+    MIGRATION_CAPABILITY_VALIDATE_UUID);
+
 /* When we add fault tolerance, we could have several
    migrations at once.  For now we don't need to add
    dynamic creation of migration */
@@ -1089,6 +1122,33 @@ static void fill_source_migration_info(MigrationInfo *info)
     info->status = s->state;
 }
 
+#ifdef CONFIG_LINUX
+typedef enum WriteTrackingSupport {
+    WT_SUPPORT_UNKNOWN = 0,
+    WT_SUPPORT_ABSENT,
+    WT_SUPPORT_AVAILABLE,
+    WT_SUPPORT_COMPATIBLE
+} WriteTrackingSupport;
+
+static
+WriteTrackingSupport migrate_query_write_tracking(void)
+{
+    /* Check if kernel supports required UFFD features */
+    if (!ram_write_tracking_available()) {
+        return WT_SUPPORT_ABSENT;
+    }
+    /*
+     * Check if current memory configuration is
+     * compatible with required UFFD features.
+     */
+    if (!ram_write_tracking_compatible()) {
+        return WT_SUPPORT_AVAILABLE;
+    }
+
+    return WT_SUPPORT_COMPATIBLE;
+}
+#endif /* CONFIG_LINUX */
+
 /**
  * @migration_caps_check - check capability validity
  *
@@ -1150,6 +1210,45 @@ static bool migrate_caps_check(bool *cap_list,
         }
     }
 
+    if (cap_list[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
+#ifdef CONFIG_LINUX
+        WriteTrackingSupport wt_support;
+        int idx;
+        /*
+         * Check if 'background-snapshot' capability is supported by
+         * host kernel and compatible with guest memory configuration.
+         */
+        wt_support = migrate_query_write_tracking();
+        if (wt_support < WT_SUPPORT_AVAILABLE) {
+            error_setg(errp, "Background-snapshot is not supported by host kernel");
+            return false;
+        }
+        if (wt_support < WT_SUPPORT_COMPATIBLE) {
+            error_setg(errp, "Background-snapshot is not compatible "
+                    "with guest memory configuration");
+            return false;
+        }
+
+        /*
+         * Check if there are any migration capabilities
+         * incompatible with 'background-snapshot'.
+         */
+        for (idx = 0; idx < check_caps_background_snapshot.size; idx++) {
+            int incomp_cap = check_caps_background_snapshot.caps[idx];
+            if (cap_list[incomp_cap]) {
+                error_setg(errp,
+                        "Background-snapshot is not compatible with %s",
+                        MigrationCapability_str(incomp_cap));
+                return false;
+            }
+        }
+#else
+        error_setg(errp,
+                "Background-snapshot is not supported on non-Linux hosts");
+        return false;
+#endif
+    }
+
     return true;
 }
 
@@ -2477,6 +2576,15 @@ bool migrate_use_block_incremental(void)
     return s->parameters.block_incremental;
 }
 
+bool migrate_background_snapshot(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT];
+}
+
 /* migration thread support */
 /*
  * Something bad happened to the RP stream, mark an error
@@ -3770,6 +3878,8 @@ static Property migration_properties[] = {
     DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
     DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
     DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
+    DEFINE_PROP_MIG_CAP("x-background-snapshot",
+            MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
 
     DEFINE_PROP_END_OF_LIST(),
 };
diff --git a/migration/migration.h b/migration/migration.h
index d096b77f74..f40338cfbf 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -341,6 +341,7 @@ int migrate_compress_wait_thread(void);
 int migrate_decompress_threads(void);
 bool migrate_use_events(void);
 bool migrate_postcopy_blocktime(void);
+bool migrate_background_snapshot(void);
 
 /* Sending on the return path - generic and then for each message type */
 void migrate_send_rp_shut(MigrationIncomingState *mis,
diff --git a/migration/ram.c b/migration/ram.c
index 7811cde643..ae8de17153 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3788,6 +3788,27 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
     return 0;
 }
 
+/* ram_write_tracking_available: check if kernel supports required UFFD features
+ *
+ * Returns true if supports, false otherwise
+ */
+bool ram_write_tracking_available(void)
+{
+    /* TODO: implement */
+    return false;
+}
+
+/* ram_write_tracking_compatible: check if guest configuration is
+ *   compatible with 'write-tracking'
+ *
+ * Returns true if compatible, false otherwise
+ */
+bool ram_write_tracking_compatible(void)
+{
+    /* TODO: implement */
+    return false;
+}
+
 static SaveVMHandlers savevm_ram_handlers = {
     .save_setup = ram_save_setup,
     .save_live_iterate = ram_save_iterate,
diff --git a/migration/ram.h b/migration/ram.h
index 011e85414e..1a9ff90304 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -79,4 +79,8 @@ void colo_flush_ram_cache(void);
 void colo_release_ram_cache(void);
 void colo_incoming_start_dirty_log(void);
 
+/* Background snapshot */
+bool ram_write_tracking_available(void);
+bool ram_write_tracking_compatible(void);
+
 #endif
diff --git a/qapi/migration.json b/qapi/migration.json
index d1d9632c2a..6c12b368aa 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -442,6 +442,11 @@
 # @validate-uuid: Send the UUID of the source to allow the destination
 #                 to ensure it is the same. (since 4.2)
 #
+# @background-snapshot: If enabled, the migration stream will be a snapshot
+#                       of the VM exactly at the point when the migration
+#                       procedure starts. The VM RAM is saved with running VM.
+#                       (since 6.0)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
@@ -449,7 +454,7 @@
            'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
            'block', 'return-path', 'pause-before-switchover', 'multifd',
            'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
-           'x-ignore-shared', 'validate-uuid' ] }
+           'x-ignore-shared', 'validate-uuid', 'background-snapshot'] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v11 2/5] migration: introduce UFFD-WP low-level interface helpers
  2021-01-06 15:21 [PATCH v11 0/5] UFFD write-tracking migration/snapshots Andrey Gruzdev via
  2021-01-06 15:21 ` [PATCH v11 1/5] migration: introduce 'background-snapshot' migration capability Andrey Gruzdev via
@ 2021-01-06 15:21 ` Andrey Gruzdev via
  2021-01-06 15:21 ` [PATCH v11 3/5] migration: support UFFD write fault processing in ram_save_iterate() Andrey Gruzdev via
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Andrey Gruzdev via @ 2021-01-06 15:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, Peter Xu,
	Markus Armbruster, Paolo Bonzini, Den Lunev, Andrey Gruzdev

Glue code to the userfaultfd kernel implementation.
Querying feature support, createing file descriptor, feature control,
memory region registration, IOCTLs on registered registered regions.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h      |   1 +
 include/qemu/userfaultfd.h |  35 ++++
 util/meson.build           |   1 +
 util/trace-events          |   9 +
 util/userfaultfd.c         | 345 +++++++++++++++++++++++++++++++++++++
 5 files changed, 391 insertions(+)
 create mode 100644 include/qemu/userfaultfd.h
 create mode 100644 util/userfaultfd.c

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 521d9901d7..b76b1256bf 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -149,6 +149,7 @@ typedef struct IOMMUTLBEvent {
 /* RAM is a persistent kind memory */
 #define RAM_PMEM (1 << 5)
 
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
diff --git a/include/qemu/userfaultfd.h b/include/qemu/userfaultfd.h
new file mode 100644
index 0000000000..6b74f92792
--- /dev/null
+++ b/include/qemu/userfaultfd.h
@@ -0,0 +1,35 @@
+/*
+ * Linux UFFD-WP support
+ *
+ * Copyright Virtuozzo GmbH, 2020
+ *
+ * Authors:
+ *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#ifndef USERFAULTFD_H
+#define USERFAULTFD_H
+
+#include "qemu/osdep.h"
+#include "exec/hwaddr.h"
+#include <linux/userfaultfd.h>
+
+int uffd_query_features(uint64_t *features);
+int uffd_create_fd(uint64_t features, bool non_blocking);
+void uffd_close_fd(int uffd_fd);
+int uffd_register_memory(int uffd_fd, void *addr, uint64_t length,
+        uint64_t mode, uint64_t *ioctls);
+int uffd_unregister_memory(int uffd_fd, void *addr, uint64_t length);
+int uffd_change_protection(int uffd_fd, void *addr, uint64_t length,
+        bool wp, bool dont_wake);
+int uffd_copy_page(int uffd_fd, void *dst_addr, void *src_addr,
+        uint64_t length, bool dont_wake);
+int uffd_zero_page(int uffd_fd, void *addr, uint64_t length, bool dont_wake);
+int uffd_wakeup(int uffd_fd, void *addr, uint64_t length);
+int uffd_read_events(int uffd_fd, struct uffd_msg *msgs, int count);
+bool uffd_poll_events(int uffd_fd, int tmo);
+
+#endif /* USERFAULTFD_H */
diff --git a/util/meson.build b/util/meson.build
index f359af0d46..c64bfe94b3 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -50,6 +50,7 @@ endif
 
 if have_system
   util_ss.add(when: 'CONFIG_GIO', if_true: [files('dbus.c'), gio])
+  util_ss.add(when: 'CONFIG_LINUX', if_true: files('userfaultfd.c'))
 endif
 
 if have_block
diff --git a/util/trace-events b/util/trace-events
index 61e0d4bcdf..bac0924899 100644
--- a/util/trace-events
+++ b/util/trace-events
@@ -91,3 +91,12 @@ qemu_vfio_pci_read_config(void *buf, int ofs, int size, uint64_t region_ofs, uin
 qemu_vfio_pci_write_config(void *buf, int ofs, int size, uint64_t region_ofs, uint64_t region_size) "write cfg ptr %p ofs 0x%x size 0x%x (region addr 0x%"PRIx64" size 0x%"PRIx64")"
 qemu_vfio_region_info(const char *desc, uint64_t region_ofs, uint64_t region_size, uint32_t cap_offset) "region '%s' addr 0x%"PRIx64" size 0x%"PRIx64" cap_ofs 0x%"PRIx32
 qemu_vfio_pci_map_bar(int index, uint64_t region_ofs, uint64_t region_size, int ofs, void *host) "map region bar#%d addr 0x%"PRIx64" size 0x%"PRIx64" ofs 0x%x host %p"
+
+#userfaultfd.c
+uffd_query_features_nosys(int err) "errno: %i"
+uffd_query_features_api_failed(int err) "errno: %i"
+uffd_create_fd_nosys(int err) "errno: %i"
+uffd_create_fd_api_failed(int err) "errno: %i"
+uffd_create_fd_api_noioctl(uint64_t ioctl_req, uint64_t ioctl_supp) "ioctl_req: 0x%" PRIx64 "ioctl_supp: 0x%" PRIx64
+uffd_register_memory_failed(void *addr, uint64_t length, uint64_t mode, int err) "addr: %p length: %" PRIu64 " mode: 0x%" PRIx64 " errno: %i"
+uffd_unregister_memory_failed(void *addr, uint64_t length, int err) "addr: %p length: %" PRIu64 " errno: %i"
diff --git a/util/userfaultfd.c b/util/userfaultfd.c
new file mode 100644
index 0000000000..def50675b1
--- /dev/null
+++ b/util/userfaultfd.c
@@ -0,0 +1,345 @@
+/*
+ * Linux UFFD-WP support
+ *
+ * Copyright Virtuozzo GmbH, 2020
+ *
+ * Authors:
+ *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/bitops.h"
+#include "qemu/error-report.h"
+#include "qemu/userfaultfd.h"
+#include "trace.h"
+#include <poll.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+
+/**
+ * uffd_query_features: query UFFD features
+ *
+ * Returns: 0 on success, negative value in case of an error
+ *
+ * @features: parameter to receive 'uffdio_api.features'
+ */
+int uffd_query_features(uint64_t *features)
+{
+    int uffd_fd;
+    struct uffdio_api api_struct = { 0 };
+    int ret = -1;
+
+    uffd_fd = syscall(__NR_userfaultfd, O_CLOEXEC);
+    if (uffd_fd < 0) {
+        trace_uffd_query_features_nosys(errno);
+        return -1;
+    }
+
+    api_struct.api = UFFD_API;
+    api_struct.features = 0;
+
+    if (ioctl(uffd_fd, UFFDIO_API, &api_struct)) {
+        trace_uffd_query_features_api_failed(errno);
+        goto out;
+    }
+    *features = api_struct.features;
+    ret = 0;
+
+out:
+    close(uffd_fd);
+    return ret;
+}
+
+/**
+ * uffd_create_fd: create UFFD file descriptor
+ *
+ * Returns non-negative file descriptor or negative value in case of an error
+ *
+ * @features: UFFD features to request
+ * @non_blocking: create UFFD file descriptor for non-blocking operation
+ */
+int uffd_create_fd(uint64_t features, bool non_blocking)
+{
+    int uffd_fd;
+    int flags;
+    struct uffdio_api api_struct = { 0 };
+    uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER);
+
+    flags = O_CLOEXEC | (non_blocking ? O_NONBLOCK : 0);
+    uffd_fd = syscall(__NR_userfaultfd, flags);
+    if (uffd_fd < 0) {
+        trace_uffd_create_fd_nosys(errno);
+        return -1;
+    }
+
+    api_struct.api = UFFD_API;
+    api_struct.features = features;
+    if (ioctl(uffd_fd, UFFDIO_API, &api_struct)) {
+        trace_uffd_create_fd_api_failed(errno);
+        goto fail;
+    }
+    if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
+        trace_uffd_create_fd_api_noioctl(ioctl_mask, api_struct.ioctls);
+        goto fail;
+    }
+
+    return uffd_fd;
+
+fail:
+    close(uffd_fd);
+    return -1;
+}
+
+/**
+ * uffd_close_fd: close UFFD file descriptor
+ *
+ * @uffd_fd: UFFD file descriptor
+ */
+void uffd_close_fd(int uffd_fd)
+{
+    assert(uffd_fd >= 0);
+    close(uffd_fd);
+}
+
+/**
+ * uffd_register_memory: register memory range via UFFD-IO
+ *
+ * Returns 0 in case of success, negative value in case of an error
+ *
+ * @uffd_fd: UFFD file descriptor
+ * @addr: base address of memory range
+ * @length: length of memory range
+ * @mode: UFFD register mode (UFFDIO_REGISTER_MODE_MISSING, ...)
+ * @ioctls: optional pointer to receive supported IOCTL mask
+ */
+int uffd_register_memory(int uffd_fd, void *addr, uint64_t length,
+        uint64_t mode, uint64_t *ioctls)
+{
+    struct uffdio_register uffd_register;
+
+    uffd_register.range.start = (uint64_t) addr;
+    uffd_register.range.len = length;
+    uffd_register.mode = mode;
+
+    if (ioctl(uffd_fd, UFFDIO_REGISTER, &uffd_register)) {
+        trace_uffd_register_memory_failed(addr, length, mode, errno);
+        return -1;
+    }
+    if (ioctls) {
+        *ioctls = uffd_register.ioctls;
+    }
+
+    return 0;
+}
+
+/**
+ * uffd_unregister_memory: un-register memory range with UFFD-IO
+ *
+ * Returns 0 in case of success, negative value in case of an error
+ *
+ * @uffd_fd: UFFD file descriptor
+ * @addr: base address of memory range
+ * @length: length of memory range
+ */
+int uffd_unregister_memory(int uffd_fd, void *addr, uint64_t length)
+{
+    struct uffdio_range uffd_range;
+
+    uffd_range.start = (uint64_t) addr;
+    uffd_range.len = length;
+
+    if (ioctl(uffd_fd, UFFDIO_UNREGISTER, &uffd_range)) {
+        trace_uffd_unregister_memory_failed(addr, length, errno);
+        return -1;
+    }
+
+    return 0;
+}
+
+/**
+ * uffd_change_protection: protect/un-protect memory range for writes via UFFD-IO
+ *
+ * Returns 0 on success, negative value in case of error
+ *
+ * @uffd_fd: UFFD file descriptor
+ * @addr: base address of memory range
+ * @length: length of memory range
+ * @wp: write-protect/unprotect
+ * @dont_wake: do not wake threads waiting on wr-protected page
+ */
+int uffd_change_protection(int uffd_fd, void *addr, uint64_t length,
+        bool wp, bool dont_wake)
+{
+    struct uffdio_writeprotect uffd_writeprotect;
+
+    uffd_writeprotect.range.start = (uint64_t) addr;
+    uffd_writeprotect.range.len = length;
+    if (!wp && dont_wake) {
+        /* DONTWAKE is meaningful only on protection release */
+        uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
+    } else {
+        uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0);
+    }
+
+    if (ioctl(uffd_fd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) {
+        error_report("uffd_change_protection() failed: addr=%p len=%" PRIu64
+                " mode=%" PRIx64 " errno=%i", addr, length,
+                (uint64_t) uffd_writeprotect.mode, errno);
+        return -1;
+    }
+
+    return 0;
+}
+
+/**
+ * uffd_copy_page: copy range of pages to destination via UFFD-IO
+ *
+ * Copy range of source pages to the destination to resolve
+ * missing page fault somewhere in the destination range.
+ *
+ * Returns 0 on success, negative value in case of an error
+ *
+ * @uffd_fd: UFFD file descriptor
+ * @dst_addr: destination base address
+ * @src_addr: source base address
+ * @length: length of the range to copy
+ * @dont_wake: do not wake threads waiting on missing page
+ */
+int uffd_copy_page(int uffd_fd, void *dst_addr, void *src_addr,
+        uint64_t length, bool dont_wake)
+{
+    struct uffdio_copy uffd_copy;
+
+    uffd_copy.dst = (uint64_t) dst_addr;
+    uffd_copy.src = (uint64_t) src_addr;
+    uffd_copy.len = length;
+    uffd_copy.mode = dont_wake ? UFFDIO_COPY_MODE_DONTWAKE : 0;
+
+    if (ioctl(uffd_fd, UFFDIO_COPY, &uffd_copy)) {
+        error_report("uffd_copy_page() failed: dst_addr=%p src_addr=%p length=%" PRIu64
+                " mode=%" PRIx64 " errno=%i", dst_addr, src_addr,
+                length, (uint64_t) uffd_copy.mode, errno);
+        return -1;
+    }
+
+    return 0;
+}
+
+/**
+ * uffd_zero_page: fill range of pages with zeroes via UFFD-IO
+ *
+ * Fill range pages with zeroes to resolve missing page fault within the range.
+ *
+ * Returns 0 on success, negative value in case of an error
+ *
+ * @uffd_fd: UFFD file descriptor
+ * @addr: base address
+ * @length: length of the range to fill with zeroes
+ * @dont_wake: do not wake threads waiting on missing page
+ */
+int uffd_zero_page(int uffd_fd, void *addr, uint64_t length, bool dont_wake)
+{
+    struct uffdio_zeropage uffd_zeropage;
+
+    uffd_zeropage.range.start = (uint64_t) addr;
+    uffd_zeropage.range.len = length;
+    uffd_zeropage.mode = dont_wake ? UFFDIO_ZEROPAGE_MODE_DONTWAKE : 0;
+
+    if (ioctl(uffd_fd, UFFDIO_ZEROPAGE, &uffd_zeropage)) {
+        error_report("uffd_zero_page() failed: addr=%p length=%" PRIu64
+                " mode=%" PRIx64 " errno=%i", addr, length,
+                (uint64_t) uffd_zeropage.mode, errno);
+        return -1;
+    }
+
+    return 0;
+}
+
+/**
+ * uffd_wakeup: wake up threads waiting on page UFFD-managed page fault resolution
+ *
+ * Wake up threads waiting on any page/pages from the designated range.
+ * The main use case is when during some period, page faults are resolved
+ * via UFFD-IO IOCTLs with MODE_DONTWAKE flag set, then after that all waits
+ * for the whole memory range are satisfied in a single call to uffd_wakeup().
+ *
+ * Returns 0 on success, negative value in case of an error
+ *
+ * @uffd_fd: UFFD file descriptor
+ * @addr: base address
+ * @length: length of the range
+ */
+int uffd_wakeup(int uffd_fd, void *addr, uint64_t length)
+{
+    struct uffdio_range uffd_range;
+
+    uffd_range.start = (uint64_t) addr;
+    uffd_range.len = length;
+
+    if (ioctl(uffd_fd, UFFDIO_WAKE, &uffd_range)) {
+        error_report("uffd_wakeup() failed: addr=%p length=%" PRIu64 " errno=%i",
+                addr, length, errno);
+        return -1;
+    }
+
+    return 0;
+}
+
+/**
+ * uffd_read_events: read pending UFFD events
+ *
+ * Returns number of fetched messages, 0 if non is available or
+ * negative value in case of an error
+ *
+ * @uffd_fd: UFFD file descriptor
+ * @msgs: pointer to message buffer
+ * @count: number of messages that can fit in the buffer
+ */
+int uffd_read_events(int uffd_fd, struct uffd_msg *msgs, int count)
+{
+    ssize_t res;
+    do {
+        res = read(uffd_fd, msgs, count * sizeof(struct uffd_msg));
+    } while (res < 0 && errno == EINTR);
+
+    if ((res < 0 && errno == EAGAIN)) {
+        return 0;
+    }
+    if (res < 0) {
+        error_report("uffd_read_events() failed: errno=%i", errno);
+        return -1;
+    }
+
+    return (int) (res / sizeof(struct uffd_msg));
+}
+
+/**
+ * uffd_poll_events: poll UFFD file descriptor for read
+ *
+ * Returns true if events are available for read, false otherwise
+ *
+ * @uffd_fd: UFFD file descriptor
+ * @tmo: timeout value
+ */
+bool uffd_poll_events(int uffd_fd, int tmo)
+{
+    int res;
+    struct pollfd poll_fd = { .fd = uffd_fd, .events = POLLIN, .revents = 0 };
+
+    do {
+        res = poll(&poll_fd, 1, tmo);
+    } while (res < 0 && errno == EINTR);
+
+    if (res == 0) {
+        return false;
+    }
+    if (res < 0) {
+        error_report("uffd_poll_events() failed: errno=%i", errno);
+        return false;
+    }
+
+    return (poll_fd.revents & POLLIN) != 0;
+}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v11 3/5] migration: support UFFD write fault processing in ram_save_iterate()
  2021-01-06 15:21 [PATCH v11 0/5] UFFD write-tracking migration/snapshots Andrey Gruzdev via
  2021-01-06 15:21 ` [PATCH v11 1/5] migration: introduce 'background-snapshot' migration capability Andrey Gruzdev via
  2021-01-06 15:21 ` [PATCH v11 2/5] migration: introduce UFFD-WP low-level interface helpers Andrey Gruzdev via
@ 2021-01-06 15:21 ` Andrey Gruzdev via
  2021-01-19 17:49   ` Dr. David Alan Gilbert
  2021-01-06 15:21 ` [PATCH v11 4/5] migration: implementation of background snapshot thread Andrey Gruzdev via
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Andrey Gruzdev via @ 2021-01-06 15:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, Peter Xu,
	Markus Armbruster, Paolo Bonzini, Den Lunev, Andrey Gruzdev

In this particular implementation the same single migration
thread is responsible for both normal linear dirty page
migration and procesing UFFD page fault events.

Processing write faults includes reading UFFD file descriptor,
finding respective RAM block and saving faulting page to
the migration stream. After page has been saved, write protection
can be removed. Since asynchronous version of qemu_put_buffer()
is expected to be used to save pages, we also have to flush
migraion stream prior to un-protecting saved memory range.

Write protection is being removed for any previously protected
memory chunk that has hit the migration stream. That's valid
for pages from linear page scan along with write fault pages.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h  |   7 ++
 migration/ram.c        | 269 +++++++++++++++++++++++++++++++++++++++--
 migration/ram.h        |   2 +
 migration/trace-events |   2 +
 4 files changed, 272 insertions(+), 8 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index b76b1256bf..1aa1c6a3f4 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -150,6 +150,13 @@ typedef struct IOMMUTLBEvent {
 #define RAM_PMEM (1 << 5)
 
 
+/*
+ * UFFDIO_WRITEPROTECT is used on this RAMBlock to
+ * support 'write-tracking' migration type.
+ * Implies ram_state->ram_wt_enabled.
+ */
+#define RAM_UF_WRITEPROTECT (1 << 6)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
diff --git a/migration/ram.c b/migration/ram.c
index ae8de17153..5707382db1 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -56,6 +56,11 @@
 #include "savevm.h"
 #include "qemu/iov.h"
 #include "multifd.h"
+#include "sysemu/runstate.h"
+
+#ifdef CONFIG_LINUX
+#include "qemu/userfaultfd.h"
+#endif
 
 /***********************************************************/
 /* ram save/restore */
@@ -298,6 +303,8 @@ struct RAMSrcPageRequest {
 struct RAMState {
     /* QEMUFile used for this migration */
     QEMUFile *f;
+    /* UFFD file descriptor, used in 'write-tracking' migration */
+    int uffdio_fd;
     /* Last block that we have visited searching for dirty pages */
     RAMBlock *last_seen_block;
     /* Last block from where we have sent data */
@@ -1434,6 +1441,40 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
     return block;
 }
 
+#ifdef CONFIG_LINUX
+/**
+ * poll_fault_page: try to get next UFFD write fault page and, if pending fault
+ *   is found, return RAM block pointer and page offset
+ *
+ * Returns pointer to the RAMBlock containing faulting page,
+ *   NULL if no write faults are pending
+ *
+ * @rs: current RAM state
+ * @offset: page offset from the beginning of the block
+ */
+static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
+{
+    struct uffd_msg uffd_msg;
+    void *page_address;
+    RAMBlock *bs;
+    int res;
+
+    if (!migrate_background_snapshot()) {
+        return NULL;
+    }
+
+    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
+    if (res <= 0) {
+        return NULL;
+    }
+
+    page_address = (void *) uffd_msg.arg.pagefault.address;
+    bs = qemu_ram_block_from_host(page_address, false, offset);
+    assert(bs && (bs->flags & RAM_UF_WRITEPROTECT) != 0);
+    return bs;
+}
+#endif /* CONFIG_LINUX */
+
 /**
  * get_queued_page: unqueue a page from the postcopy requests
  *
@@ -1473,6 +1514,16 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
 
     } while (block && !dirty);
 
+#ifdef CONFIG_LINUX
+    if (!block) {
+        /*
+         * Poll write faults too if background snapshot is enabled; that's
+         * when we have vcpus got blocked by the write protected pages.
+         */
+        block = poll_fault_page(rs, &offset);
+    }
+#endif /* CONFIG_LINUX */
+
     if (block) {
         /*
          * As soon as we start servicing pages out of order, then we have
@@ -1746,6 +1797,53 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
     return pages;
 }
 
+/**
+ * ram_save_host_page_pre: ram_save_host_page() pre-notifier
+ *
+ * @rs: current RAM state
+ * @pss: page-search-status structure
+ * @opaque: pointer to receive opaque context value
+ */
+static inline
+void ram_save_host_page_pre(RAMState *rs, PageSearchStatus *pss, void **opaque)
+{
+    *(unsigned long *) opaque = pss->page;
+}
+
+/**
+ * ram_save_host_page_post: ram_save_host_page() post-notifier
+ *
+ * @rs: current RAM state
+ * @pss: page-search-status structure
+ * @opaque: opaque context value
+ * @res_override: pointer to the return value of ram_save_host_page(),
+ *   overwritten in case of an error
+ */
+static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
+        void *opaque, int *res_override)
+{
+#ifdef CONFIG_LINUX
+    /* Check if page is from UFFD-managed region. */
+    if (pss->block->flags & RAM_UF_WRITEPROTECT) {
+        unsigned long page_from = (unsigned long) opaque;
+
+        void *page_address = pss->block->host + (page_from << TARGET_PAGE_BITS);
+        uint64_t run_length = (pss->page - page_from + 1) << TARGET_PAGE_BITS;
+        int res;
+
+        /* Flush async buffers before un-protect. */
+        qemu_fflush(rs->f);
+        /* Un-protect memory range. */
+        res = uffd_change_protection(rs->uffdio_fd, page_address, run_length,
+                false, false);
+        /* We don't want to override existing error from ram_save_host_page(). */
+        if (res < 0 && *res_override >= 0) {
+            *res_override = res;
+        }
+    }
+#endif /* CONFIG_LINUX */
+}
+
 /**
  * ram_find_and_save_block: finds a dirty page and sends it to f
  *
@@ -1790,7 +1888,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
         }
 
         if (found) {
+            void *opaque;
+
+            ram_save_host_page_pre(rs, &pss, &opaque);
             pages = ram_save_host_page(rs, &pss, last_stage);
+            ram_save_host_page_post(rs, &pss, opaque, &pages);
         }
     } while (!pages && again);
 
@@ -1880,10 +1982,13 @@ static void ram_save_cleanup(void *opaque)
     RAMState **rsp = opaque;
     RAMBlock *block;
 
-    /* caller have hold iothread lock or is in a bh, so there is
-     * no writing race against the migration bitmap
-     */
-    memory_global_dirty_log_stop();
+    /* We don't use dirty log with background snapshots */
+    if (!migrate_background_snapshot()) {
+        /* caller have hold iothread lock or is in a bh, so there is
+         * no writing race against the migration bitmap
+         */
+        memory_global_dirty_log_stop();
+    }
 
     RAMBLOCK_FOREACH_NOT_IGNORED(block) {
         g_free(block->clear_bmap);
@@ -2343,8 +2448,11 @@ static void ram_init_bitmaps(RAMState *rs)
 
     WITH_RCU_READ_LOCK_GUARD() {
         ram_list_init_bitmaps();
-        memory_global_dirty_log_start();
-        migration_bitmap_sync_precopy(rs);
+        /* We don't use dirty log with background snapshots */
+        if (!migrate_background_snapshot()) {
+            memory_global_dirty_log_start();
+            migration_bitmap_sync_precopy(rs);
+        }
     }
     qemu_mutex_unlock_ramlist();
     qemu_mutex_unlock_iothread();
@@ -3794,7 +3902,14 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
  */
 bool ram_write_tracking_available(void)
 {
-    /* TODO: implement */
+#ifdef CONFIG_LINUX
+    uint64_t uffd_features;
+    int res;
+
+    res = uffd_query_features(&uffd_features);
+    return (res == 0 &&
+            (uffd_features & UFFD_FEATURE_PAGEFAULT_FLAG_WP) != 0);
+#endif
     return false;
 }
 
@@ -3805,10 +3920,148 @@ bool ram_write_tracking_available(void)
  */
 bool ram_write_tracking_compatible(void)
 {
-    /* TODO: implement */
+#ifdef CONFIG_LINUX
+    const uint64_t uffd_ioctls_mask = BIT(_UFFDIO_WRITEPROTECT);
+    int uffd_fd;
+    RAMBlock *bs;
+    bool ret = false;
+
+    /* Open UFFD file descriptor */
+    uffd_fd = uffd_create_fd(UFFD_FEATURE_PAGEFAULT_FLAG_WP, false);
+    if (uffd_fd < 0) {
+        return false;
+    }
+
+    RCU_READ_LOCK_GUARD();
+
+    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
+        uint64_t uffd_ioctls;
+
+        /* Nothing to do with read-only and MMIO-writable regions */
+        if (bs->mr->readonly || bs->mr->rom_device) {
+            continue;
+        }
+        /* Try to register block memory via UFFD-IO to track writes */
+        if (uffd_register_memory(uffd_fd, bs->host, bs->max_length,
+                UFFDIO_REGISTER_MODE_WP, &uffd_ioctls)) {
+            goto out;
+        }
+        if ((uffd_ioctls & uffd_ioctls_mask) != uffd_ioctls_mask) {
+            goto out;
+        }
+    }
+    ret = true;
+
+out:
+    uffd_close_fd(uffd_fd);
+    return ret;
+#endif
     return false;
 }
 
+/*
+ * ram_write_tracking_start: start UFFD-WP memory tracking
+ *
+ * Returns 0 for success or negative value in case of error
+ */
+int ram_write_tracking_start(void)
+{
+#ifdef CONFIG_LINUX
+    int uffd_fd;
+    RAMState *rs = ram_state;
+    RAMBlock *bs;
+
+    /* Open UFFD file descriptor */
+    uffd_fd = uffd_create_fd(UFFD_FEATURE_PAGEFAULT_FLAG_WP, true);
+    if (uffd_fd < 0) {
+        return uffd_fd;
+    }
+    rs->uffdio_fd = uffd_fd;
+
+    RCU_READ_LOCK_GUARD();
+
+    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
+        /* Nothing to do with read-only and MMIO-writable regions */
+        if (bs->mr->readonly || bs->mr->rom_device) {
+            continue;
+        }
+
+        /* Register block memory with UFFD to track writes */
+        if (uffd_register_memory(rs->uffdio_fd, bs->host,
+                bs->max_length, UFFDIO_REGISTER_MODE_WP, NULL)) {
+            goto fail;
+        }
+        /* Apply UFFD write protection to the block memory range */
+        if (uffd_change_protection(rs->uffdio_fd, bs->host,
+                bs->max_length, true, false)) {
+            goto fail;
+        }
+        bs->flags |= RAM_UF_WRITEPROTECT;
+        memory_region_ref(bs->mr);
+
+        trace_ram_write_tracking_ramblock_start(bs->idstr, bs->page_size,
+                bs->host, bs->max_length);
+    }
+
+    return 0;
+
+fail:
+    error_report("ram_write_tracking_start() failed: restoring initial memory state");
+
+    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
+        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
+            continue;
+        }
+        /*
+         * In case some memory block failed to be write-protected
+         * remove protection and unregister all succeeded RAM blocks
+         */
+        uffd_change_protection(rs->uffdio_fd, bs->host, bs->max_length, false, false);
+        uffd_unregister_memory(rs->uffdio_fd, bs->host, bs->max_length);
+        /* Cleanup flags and remove reference */
+        bs->flags &= ~RAM_UF_WRITEPROTECT;
+        memory_region_unref(bs->mr);
+    }
+
+    uffd_close_fd(uffd_fd);
+#endif /* CONFIG_LINUX */
+    rs->uffdio_fd = -1;
+    return -1;
+}
+
+/**
+ * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
+ */
+void ram_write_tracking_stop(void)
+{
+#ifdef CONFIG_LINUX
+    RAMState *rs = ram_state;
+    RAMBlock *bs;
+
+    RCU_READ_LOCK_GUARD();
+
+    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
+        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
+            continue;
+        }
+        /* Remove protection and unregister all affected RAM blocks */
+        uffd_change_protection(rs->uffdio_fd, bs->host, bs->max_length, false, false);
+        uffd_unregister_memory(rs->uffdio_fd, bs->host, bs->max_length);
+
+        trace_ram_write_tracking_ramblock_stop(bs->idstr, bs->page_size,
+                bs->host, bs->max_length);
+
+        /* Cleanup flags and remove reference */
+        bs->flags &= ~RAM_UF_WRITEPROTECT;
+        memory_region_unref(bs->mr);
+    }
+
+    /* Finally close UFFD file descriptor */
+    uffd_close_fd(rs->uffdio_fd);
+    rs->uffdio_fd = -1;
+#endif /* CONFIG_LINUX */
+}
+
 static SaveVMHandlers savevm_ram_handlers = {
     .save_setup = ram_save_setup,
     .save_live_iterate = ram_save_iterate,
diff --git a/migration/ram.h b/migration/ram.h
index 1a9ff90304..c25540cb93 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -82,5 +82,7 @@ void colo_incoming_start_dirty_log(void);
 /* Background snapshot */
 bool ram_write_tracking_available(void);
 bool ram_write_tracking_compatible(void);
+int ram_write_tracking_start(void);
+void ram_write_tracking_stop(void);
 
 #endif
diff --git a/migration/trace-events b/migration/trace-events
index 75de5004ac..668c562fed 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -111,6 +111,8 @@ save_xbzrle_page_skipping(void) ""
 save_xbzrle_page_overflow(void) ""
 ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
 ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
+ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
+ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
 
 # multifd.c
 multifd_new_send_channel_async(uint8_t id) "channel %d"
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v11 4/5] migration: implementation of background snapshot thread
  2021-01-06 15:21 [PATCH v11 0/5] UFFD write-tracking migration/snapshots Andrey Gruzdev via
                   ` (2 preceding siblings ...)
  2021-01-06 15:21 ` [PATCH v11 3/5] migration: support UFFD write fault processing in ram_save_iterate() Andrey Gruzdev via
@ 2021-01-06 15:21 ` Andrey Gruzdev via
  2021-01-19 18:49   ` Dr. David Alan Gilbert
  2021-01-06 15:21 ` [PATCH v11 5/5] migration: introduce 'userfaultfd-wrlat.py' script Andrey Gruzdev via
  2021-01-15 11:34 ` [PATCH v11 0/5] UFFD write-tracking migration/snapshots Andrey Gruzdev
  5 siblings, 1 reply; 24+ messages in thread
From: Andrey Gruzdev via @ 2021-01-06 15:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, Peter Xu,
	Markus Armbruster, Paolo Bonzini, Den Lunev, Andrey Gruzdev

Introducing implementation of 'background' snapshot thread
which in overall follows the logic of precopy migration
while internally utilizes completely different mechanism
to 'freeze' vmstate at the start of snapshot creation.

This mechanism is based on userfault_fd with wr-protection
support and is Linux-specific.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 255 +++++++++++++++++++++++++++++++++++++++++-
 migration/migration.h |   3 +
 migration/ram.c       |   2 +
 migration/savevm.c    |   1 -
 migration/savevm.h    |   2 +
 5 files changed, 260 insertions(+), 3 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 2c2cb9ef01..0901a15ac5 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2007,6 +2007,7 @@ void migrate_init(MigrationState *s)
      * locks.
      */
     s->cleanup_bh = 0;
+    s->vm_start_bh = 0;
     s->to_dst_file = NULL;
     s->state = MIGRATION_STATUS_NONE;
     s->rp_state.from_dst_file = NULL;
@@ -3211,6 +3212,50 @@ fail:
                       MIGRATION_STATUS_FAILED);
 }
 
+/**
+ * bg_migration_completion: Used by bg_migration_thread when after all the
+ *   RAM has been saved. The caller 'breaks' the loop when this returns.
+ *
+ * @s: Current migration state
+ */
+static void bg_migration_completion(MigrationState *s)
+{
+    int current_active_state = s->state;
+
+    /*
+     * Stop tracking RAM writes - un-protect memory, un-register UFFD
+     * memory ranges, flush kernel wait queues and wake up threads
+     * waiting for write fault to be resolved.
+     */
+    ram_write_tracking_stop();
+
+    if (s->state == MIGRATION_STATUS_ACTIVE) {
+        /*
+         * By this moment we have RAM content saved into the migration stream.
+         * The next step is to flush the non-RAM content (device state)
+         * right after the ram content. The device state has been stored into
+         * the temporary buffer before RAM saving started.
+         */
+        qemu_put_buffer(s->to_dst_file, s->bioc->data, s->bioc->usage);
+        qemu_fflush(s->to_dst_file);
+    } else if (s->state == MIGRATION_STATUS_CANCELLING) {
+        goto fail;
+    }
+
+    if (qemu_file_get_error(s->to_dst_file)) {
+        trace_migration_completion_file_err();
+        goto fail;
+    }
+
+    migrate_set_state(&s->state, current_active_state,
+                      MIGRATION_STATUS_COMPLETED);
+    return;
+
+fail:
+    migrate_set_state(&s->state, current_active_state,
+                      MIGRATION_STATUS_FAILED);
+}
+
 bool migrate_colo_enabled(void)
 {
     MigrationState *s = migrate_get_current();
@@ -3551,6 +3596,47 @@ static void migration_iteration_finish(MigrationState *s)
     qemu_mutex_unlock_iothread();
 }
 
+static void bg_migration_iteration_finish(MigrationState *s)
+{
+    qemu_mutex_lock_iothread();
+    switch (s->state) {
+    case MIGRATION_STATUS_COMPLETED:
+        migration_calculate_complete(s);
+        break;
+
+    case MIGRATION_STATUS_ACTIVE:
+    case MIGRATION_STATUS_FAILED:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_CANCELLING:
+        break;
+
+    default:
+        /* Should not reach here, but if so, forgive the VM. */
+        error_report("%s: Unknown ending state %d", __func__, s->state);
+        break;
+    }
+
+    migrate_fd_cleanup_schedule(s);
+    qemu_mutex_unlock_iothread();
+}
+
+/*
+ * Return true if continue to the next iteration directly, false
+ * otherwise.
+ */
+static MigIterateState bg_migration_iteration_run(MigrationState *s)
+{
+    int res;
+
+    res = qemu_savevm_state_iterate(s->to_dst_file, false);
+    if (res > 0) {
+        bg_migration_completion(s);
+        return MIG_ITERATE_BREAK;
+    }
+
+    return MIG_ITERATE_RESUME;
+}
+
 void migration_make_urgent_request(void)
 {
     qemu_sem_post(&migrate_get_current()->rate_limit_sem);
@@ -3698,6 +3784,165 @@ static void *migration_thread(void *opaque)
     return NULL;
 }
 
+static void bg_migration_vm_start_bh(void *opaque)
+{
+    MigrationState *s = opaque;
+
+    qemu_bh_delete(s->vm_start_bh);
+    s->vm_start_bh = NULL;
+
+    vm_start();
+    s->downtime = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - s->downtime_start;
+}
+
+/**
+ * Background snapshot thread, based on live migration code.
+ * This is an alternative implementation of live migration mechanism
+ * introduced specifically to support background snapshots.
+ *
+ * It takes advantage of userfault_fd write protection mechanism introduced
+ * in v5.7 kernel. Compared to existing dirty page logging migration much
+ * lesser stream traffic is produced resulting in smaller snapshot images,
+ * simply cause of no page duplicates can get into the stream.
+ *
+ * Another key point is that generated vmstate stream reflects machine state
+ * 'frozen' at the beginning of snapshot creation compared to dirty page logging
+ * mechanism, which effectively results in that saved snapshot is the state of VM
+ * at the end of the process.
+ */
+static void *bg_migration_thread(void *opaque)
+{
+    MigrationState *s = opaque;
+    int64_t setup_start;
+    MigThrError thr_error;
+    QEMUFile *fb;
+    bool early_fail = true;
+
+    rcu_register_thread();
+    object_ref(OBJECT(s));
+
+    qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
+
+    setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
+    /*
+     * We want to save vmstate for the moment when migration has been
+     * initiated but also we want to save RAM content while VM is running.
+     * The RAM content should appear first in the vmstate. So, we first
+     * stash the non-RAM part of the vmstate to the temporary buffer,
+     * then write RAM part of the vmstate to the migration stream
+     * with vCPUs running and, finally, write stashed non-RAM part of
+     * the vmstate from the buffer to the migration stream.
+     */
+    s->bioc = qio_channel_buffer_new(128 * 1024);
+    qio_channel_set_name(QIO_CHANNEL(s->bioc), "vmstate-buffer");
+    fb = qemu_fopen_channel_output(QIO_CHANNEL(s->bioc));
+    object_unref(OBJECT(s->bioc));
+
+    update_iteration_initial_status(s);
+
+    qemu_savevm_state_header(s->to_dst_file);
+    qemu_savevm_state_setup(s->to_dst_file);
+
+    if (qemu_savevm_state_guest_unplug_pending()) {
+        migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
+                          MIGRATION_STATUS_WAIT_UNPLUG);
+
+        while (s->state == MIGRATION_STATUS_WAIT_UNPLUG &&
+               qemu_savevm_state_guest_unplug_pending()) {
+            qemu_sem_timedwait(&s->wait_unplug_sem, 250);
+        }
+
+        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
+                          MIGRATION_STATUS_ACTIVE);
+    } else {
+        migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
+                MIGRATION_STATUS_ACTIVE);
+    }
+    s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
+
+    trace_migration_thread_setup_complete();
+    s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+    qemu_mutex_lock_iothread();
+
+    /*
+     * If VM is currently in suspended state, then, to make a valid runstate
+     * transition in vm_stop_force_state() we need to wakeup it up.
+     */
+    qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
+    s->vm_was_running = runstate_is_running();
+
+    if (global_state_store()) {
+        goto fail;
+    }
+    /* Forcibly stop VM before saving state of vCPUs and devices */
+    if (vm_stop_force_state(RUN_STATE_PAUSED)) {
+        goto fail;
+    }
+    /*
+     * Put vCPUs in sync with shadow context structures, then
+     * save their state to channel-buffer along with devices.
+     */
+    cpu_synchronize_all_states();
+    if (qemu_savevm_state_complete_precopy_non_iterable(fb, false, false)) {
+        goto fail;
+    }
+    /* Now initialize UFFD context and start tracking RAM writes */
+    if (ram_write_tracking_start()) {
+        goto fail;
+    }
+    early_fail = false;
+
+    /*
+     * Start VM from BH handler to avoid write-fault lock here.
+     * UFFD-WP protection for the whole RAM is already enabled so
+     * calling VM state change notifiers from vm_start() would initiate
+     * writes to virtio VQs memory which is in write-protected region.
+     */
+    s->vm_start_bh = qemu_bh_new(bg_migration_vm_start_bh, s);
+    qemu_bh_schedule(s->vm_start_bh);
+
+    qemu_mutex_unlock_iothread();
+
+    while (migration_is_active(s)) {
+        MigIterateState iter_state = bg_migration_iteration_run(s);
+        if (iter_state == MIG_ITERATE_SKIP) {
+            continue;
+        } else if (iter_state == MIG_ITERATE_BREAK) {
+            break;
+        }
+
+        /*
+         * Try to detect any kind of failures, and see whether we
+         * should stop the migration now.
+         */
+        thr_error = migration_detect_error(s);
+        if (thr_error == MIG_THR_ERR_FATAL) {
+            /* Stop migration */
+            break;
+        }
+
+        migration_update_counters(s, qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
+    }
+
+    trace_migration_thread_after_loop();
+
+fail:
+    if (early_fail) {
+        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
+                MIGRATION_STATUS_FAILED);
+        qemu_mutex_unlock_iothread();
+    }
+
+    bg_migration_iteration_finish(s);
+
+    qemu_fclose(fb);
+    object_unref(OBJECT(s));
+    rcu_unregister_thread();
+
+    return NULL;
+}
+
 void migrate_fd_connect(MigrationState *s, Error *error_in)
 {
     Error *local_err = NULL;
@@ -3761,8 +4006,14 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
         migrate_fd_cleanup(s);
         return;
     }
-    qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
-                       QEMU_THREAD_JOINABLE);
+
+    if (migrate_background_snapshot()) {
+        qemu_thread_create(&s->thread, "background_snapshot",
+                bg_migration_thread, s, QEMU_THREAD_JOINABLE);
+    } else {
+        qemu_thread_create(&s->thread, "live_migration",
+                migration_thread, s, QEMU_THREAD_JOINABLE);
+    }
     s->migration_thread_running = true;
 }
 
diff --git a/migration/migration.h b/migration/migration.h
index f40338cfbf..0723955cd7 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -20,6 +20,7 @@
 #include "qemu/thread.h"
 #include "qemu/coroutine_int.h"
 #include "io/channel.h"
+#include "io/channel-buffer.h"
 #include "net/announce.h"
 #include "qom/object.h"
 
@@ -147,8 +148,10 @@ struct MigrationState {
 
     /*< public >*/
     QemuThread thread;
+    QEMUBH *vm_start_bh;
     QEMUBH *cleanup_bh;
     QEMUFile *to_dst_file;
+    QIOChannelBuffer *bioc;
     /*
      * Protects to_dst_file pointer.  We need to make sure we won't
      * yield or hang during the critical section, since this lock will
diff --git a/migration/ram.c b/migration/ram.c
index 5707382db1..05fe0c8592 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1471,6 +1471,7 @@ static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
     page_address = (void *) uffd_msg.arg.pagefault.address;
     bs = qemu_ram_block_from_host(page_address, false, offset);
     assert(bs && (bs->flags & RAM_UF_WRITEPROTECT) != 0);
+
     return bs;
 }
 #endif /* CONFIG_LINUX */
@@ -1836,6 +1837,7 @@ static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
         /* Un-protect memory range. */
         res = uffd_change_protection(rs->uffdio_fd, page_address, run_length,
                 false, false);
+
         /* We don't want to override existing error from ram_save_host_page(). */
         if (res < 0 && *res_override >= 0) {
             *res_override = res;
diff --git a/migration/savevm.c b/migration/savevm.c
index 27e842812e..dd4ad0aaaf 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1354,7 +1354,6 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     return 0;
 }
 
-static
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
                                                     bool in_postcopy,
                                                     bool inactivate_disks)
diff --git a/migration/savevm.h b/migration/savevm.h
index ba64a7e271..aaee2528ed 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -64,5 +64,7 @@ int qemu_loadvm_state(QEMUFile *f);
 void qemu_loadvm_state_cleanup(void);
 int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 int qemu_load_device_state(QEMUFile *f);
+int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
+        bool in_postcopy, bool inactivate_disks);
 
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v11 5/5] migration: introduce 'userfaultfd-wrlat.py' script
  2021-01-06 15:21 [PATCH v11 0/5] UFFD write-tracking migration/snapshots Andrey Gruzdev via
                   ` (3 preceding siblings ...)
  2021-01-06 15:21 ` [PATCH v11 4/5] migration: implementation of background snapshot thread Andrey Gruzdev via
@ 2021-01-06 15:21 ` Andrey Gruzdev via
  2021-01-19 21:01   ` Peter Xu
  2021-01-15 11:34 ` [PATCH v11 0/5] UFFD write-tracking migration/snapshots Andrey Gruzdev
  5 siblings, 1 reply; 24+ messages in thread
From: Andrey Gruzdev via @ 2021-01-06 15:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, Peter Xu,
	Markus Armbruster, Paolo Bonzini, Den Lunev, Andrey Gruzdev

Add BCC/eBPF script to analyze userfaultfd write fault latency distribution.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 scripts/userfaultfd-wrlat.py | 148 +++++++++++++++++++++++++++++++++++
 1 file changed, 148 insertions(+)
 create mode 100755 scripts/userfaultfd-wrlat.py

diff --git a/scripts/userfaultfd-wrlat.py b/scripts/userfaultfd-wrlat.py
new file mode 100755
index 0000000000..5ffd3c6c9a
--- /dev/null
+++ b/scripts/userfaultfd-wrlat.py
@@ -0,0 +1,148 @@
+#!/usr/bin/python3
+#
+# userfaultfd-wrlat Summarize userfaultfd write fault latencies.
+#                   Events are continuously accumulated for the
+#                   run, while latency distribution histogram is
+#                   dumped each 'interval' seconds.
+#
+#                   For Linux, uses BCC, eBPF.
+#
+# USAGE: userfaultfd-lat [interval [count]]
+#
+# Copyright Virtuozzo GmbH, 2020
+#
+# Authors:
+#   Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or
+# later.  See the COPYING file in the top-level directory.
+
+from __future__ import print_function
+from bcc import BPF
+from ctypes import c_ushort, c_int, c_ulonglong
+from time import sleep
+from sys import argv
+
+def usage():
+    print("USAGE: %s [interval [count]]" % argv[0])
+    exit()
+
+# define BPF program
+bpf_text = """
+#include <uapi/linux/ptrace.h>
+#include <linux/mm.h>
+
+/*
+ * UFFD page fault event descriptor.
+ * Used as a key to BPF_HASH table.
+ */
+struct ev_desc {
+    u64 pid;
+    u64 addr;
+};
+
+BPF_HASH(ev_start, struct ev_desc, u64);
+BPF_HASH(ctx_handle_userfault, u64, u64);
+BPF_HISTOGRAM(ev_delta_hist, u64);
+
+/* Trace UFFD page fault start event. */
+static void do_event_start(u64 pid, u64 address)
+{
+    struct ev_desc desc = { .pid = pid, .addr = address };
+    u64 ts = bpf_ktime_get_ns();
+
+    ev_start.insert(&desc, &ts);
+}
+
+/* Trace UFFD page fault end event. */
+static void do_event_end(u64 pid, u64 address)
+{
+    struct ev_desc desc = { .pid = pid, .addr = address };
+    u64 ts = bpf_ktime_get_ns();
+    u64 *tsp;
+
+    tsp = ev_start.lookup(&desc);
+    if (tsp) {
+        u64 delta = ts - (*tsp);
+        /* Transform time delta to milliseconds */
+        ev_delta_hist.increment(bpf_log2l(delta / 1000000));
+        ev_start.delete(&desc);
+    }
+}
+
+/* KPROBE for handle_userfault(). */
+int probe_handle_userfault(struct pt_regs *ctx, struct vm_fault *vmf,
+        unsigned long reason)
+{
+    /* Trace only UFFD write faults. */
+    if (reason & VM_UFFD_WP) {
+        u64 pid = (u32) bpf_get_current_pid_tgid();
+        u64 addr = vmf->address;
+
+        do_event_start(pid, addr);
+        ctx_handle_userfault.update(&pid, &addr);
+    }
+    return 0;
+}
+
+/* KRETPROBE for handle_userfault(). */
+int retprobe_handle_userfault(struct pt_regs *ctx)
+{
+    u64 pid = (u32) bpf_get_current_pid_tgid();
+    u64 *addr_p;
+
+    /*
+     * Here we just ignore the return value. In case of spurious wakeup
+     * or pending signal we'll still get (at least for v5.8.0 kernel)
+     * VM_FAULT_RETRY or (VM_FAULT_RETRY | VM_FAULT_MAJOR) here.
+     * Anyhow, handle_userfault() would be re-entered if such case happens,
+     * keeping initial timestamp unchanged for the faulting thread.
+     */
+    addr_p = ctx_handle_userfault.lookup(&pid);
+    if (addr_p) {
+        do_event_end(pid, *addr_p);
+        ctx_handle_userfault.delete(&pid);
+    }
+    return 0;
+}
+"""
+
+# arguments
+interval = 10
+count = -1
+if len(argv) > 1:
+    try:
+        interval = int(argv[1])
+        if interval == 0:
+            raise
+        if len(argv) > 2:
+            count = int(argv[2])
+    except:    # also catches -h, --help
+        usage()
+
+# load BPF program
+b = BPF(text=bpf_text)
+# attach KRPOBEs
+b.attach_kprobe(event="handle_userfault", fn_name="probe_handle_userfault")
+b.attach_kretprobe(event="handle_userfault", fn_name="retprobe_handle_userfault")
+
+# header
+print("Tracing UFFD-WP write fault latency... Hit Ctrl-C to end.")
+
+# output
+loop = 0
+do_exit = 0
+while (1):
+    if count > 0:
+        loop += 1
+        if loop > count:
+            exit()
+    try:
+        sleep(interval)
+    except KeyboardInterrupt:
+        pass; do_exit = 1
+
+    print()
+    b["ev_delta_hist"].print_log2_hist("msecs")
+    if do_exit:
+        exit()
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 0/5] UFFD write-tracking migration/snapshots
  2021-01-06 15:21 [PATCH v11 0/5] UFFD write-tracking migration/snapshots Andrey Gruzdev via
                   ` (4 preceding siblings ...)
  2021-01-06 15:21 ` [PATCH v11 5/5] migration: introduce 'userfaultfd-wrlat.py' script Andrey Gruzdev via
@ 2021-01-15 11:34 ` Andrey Gruzdev
  2021-01-15 14:54   ` Peter Xu
  5 siblings, 1 reply; 24+ messages in thread
From: Andrey Gruzdev @ 2021-01-15 11:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Juan Quintela, Dr . David Alan Gilbert, Peter Xu,
	Markus Armbruster, Paolo Bonzini, Den Lunev

[-- Attachment #1: Type: text/plain, Size: 5013 bytes --]

Ping

On 06.01.2021 18:21, Andrey Gruzdev wrote:

> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
>
> Currently the only way to make (external) live VM snapshot is using existing
> dirty page logging migration mechanism. The main problem is that it tends to
> produce a lot of page duplicates while running VM goes on updating already
> saved pages. That leads to the fact that vmstate image size is commonly several
> times bigger then non-zero part of virtual machine's RSS. Time required to
> converge RAM migration and the size of snapshot image severely depend on the
> guest memory write rate, sometimes resulting in unacceptably long snapshot
> creation time and huge image size.
>
> This series propose a way to solve the aforementioned problems. This is done
> by using different RAM migration mechanism based on UFFD write protection
> management introduced in v5.7 kernel. The migration strategy is to 'freeze'
> guest RAM content using write-protection and iteratively release protection
> for memory ranges that have already been saved to the migration stream.
> At the same time we read in pending UFFD write fault events and save those
> pages out-of-order with higher priority.
>
> How to use:
> 1. Enable write-tracking migration capability
>     virsh qemu-monitor-command <domain> --hmp migrate_set_capability.
> track-writes-ram on
>
> 2. Start the external migration to a file
>     virsh qemu-monitor-command <domain> --hmp migrate exec:'cat > ./vm_state'
>
> 3. Wait for the migration finish and check that the migration has completed.
> state.
>
>
> Changes v10->v11:
>
> * 1. Updated commit messages.
>
> Changes v9->v10:
>
> * 1. Fixed commit message for [PATCH v9 1/5].
>
> Changes v8->v9:
>
> * 1. Fixed wrong cover letter subject.
>
> Changes v7->v8:
>
> * 1. Fixed coding style problems to pass checkpatch.
>
> Changes v6->v7:
>
> * 1. Fixed background snapshot on suspended guest: call qemu_system_wakeup_request()
> *    before stopping VM to make runstate transition valid.
> * 2. Disabled dirty page logging and log syn when 'background-snapshot' is enabled.
> * 3. Introduced 'userfaultfd-wrlat.py' script to analyze UFFD write fault latencies.
>
> Changes v5->v6:
>
> * 1. Consider possible hot pluggin/unpluggin of memory device - don't use static
> *    for write-tracking support level in migrate_query_write_tracking(), check
> *    each time when one tries to enable 'background-snapshot' capability.
>
> Changes v4->v5:
>
> * 1. Refactored util/userfaultfd.c code to support features required by postcopy.
> * 2. Introduced checks for host kernel and guest memory backend compatibility
> *    to 'background-snapshot' branch in migrate_caps_check().
> * 3. Switched to using trace_xxx instead of info_report()/error_report() for
> *    cases when error message must be hidden (probing UFFD-IO) or info may be
> *    really littering output if goes to stderr.
> * 4  Added RCU_READ_LOCK_GUARDs to the code dealing with RAM block list.
> * 5. Added memory_region_ref() for each RAM block being wr-protected.
> * 6. Reused qemu_ram_block_from_host() instead of custom RAM block lookup routine.
> * 7. Refused from using specific hwaddr/ram_addr_t in favour of void */uint64_t.
> * 8. Currently dropped 'linear-scan-rate-limiting' patch. The reason is that
> *    that choosen criteria for high-latency fault detection (i.e. timestamp of
> *    UFFD event fetch) is not representative enough for this task.
> *    At the moment it looks somehow like premature optimization effort.
> * 8. Dropped some unnecessary/unused code.
>
> Andrey Gruzdev (5):
>    migration: introduce 'background-snapshot' migration capability
>    migration: introduce UFFD-WP low-level interface helpers
>    migration: support UFFD write fault processing in ram_save_iterate()
>    migration: implementation of background snapshot thread
>    migration: introduce 'userfaultfd-wrlat.py' script
>
>   include/exec/memory.h        |   8 +
>   include/qemu/userfaultfd.h   |  35 ++++
>   migration/migration.c        | 365 ++++++++++++++++++++++++++++++++++-
>   migration/migration.h        |   4 +
>   migration/ram.c              | 288 ++++++++++++++++++++++++++-
>   migration/ram.h              |   6 +
>   migration/savevm.c           |   1 -
>   migration/savevm.h           |   2 +
>   migration/trace-events       |   2 +
>   qapi/migration.json          |   7 +-
>   scripts/userfaultfd-wrlat.py | 148 ++++++++++++++
>   util/meson.build             |   1 +
>   util/trace-events            |   9 +
>   util/userfaultfd.c           | 345 +++++++++++++++++++++++++++++++++
>   14 files changed, 1211 insertions(+), 10 deletions(-)
>   create mode 100644 include/qemu/userfaultfd.h
>   create mode 100755 scripts/userfaultfd-wrlat.py
>   create mode 100644 util/userfaultfd.c
>

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


[-- Attachment #2: Type: text/html, Size: 5301 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 0/5] UFFD write-tracking migration/snapshots
  2021-01-15 11:34 ` [PATCH v11 0/5] UFFD write-tracking migration/snapshots Andrey Gruzdev
@ 2021-01-15 14:54   ` Peter Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2021-01-15 14:54 UTC (permalink / raw)
  To: Andrey Gruzdev, Dr. David Alan Gilbert, Juan Quintela
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Fri, Jan 15, 2021 at 02:34:40PM +0300, Andrey Gruzdev wrote:
> Ping

Dave, Juan,

I really think this series needs some attention...

It'll be a great if QEMU can start to take live snapshot at least to me.
Andrey (and also Denis with the older versions) did a great job working on it
and moving it forward.  IMHO it's something beneficial for the QEMU community
and we should really encourage such behavior on working useful things, working
in collaborative way, and being consistent.

Is there any concern about this series?  Let me know if there's anything I can
help with it.

Thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 3/5] migration: support UFFD write fault processing in ram_save_iterate()
  2021-01-06 15:21 ` [PATCH v11 3/5] migration: support UFFD write fault processing in ram_save_iterate() Andrey Gruzdev via
@ 2021-01-19 17:49   ` Dr. David Alan Gilbert
  2021-01-21  8:48     ` Andrey Gruzdev
  0 siblings, 1 reply; 24+ messages in thread
From: Dr. David Alan Gilbert @ 2021-01-19 17:49 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> In this particular implementation the same single migration
> thread is responsible for both normal linear dirty page
> migration and procesing UFFD page fault events.
> 
> Processing write faults includes reading UFFD file descriptor,
> finding respective RAM block and saving faulting page to
> the migration stream. After page has been saved, write protection
> can be removed. Since asynchronous version of qemu_put_buffer()
> is expected to be used to save pages, we also have to flush
> migraion stream prior to un-protecting saved memory range.
> 
> Write protection is being removed for any previously protected
> memory chunk that has hit the migration stream. That's valid
> for pages from linear page scan along with write fault pages.
> 
> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> Acked-by: Peter Xu <peterx@redhat.com>

I think this is OK, so:

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

I think I'd have preferred to have kept the #ifdef LINUX's out of there,
and I suspect using the opaque's for hte pre/post hook is overly
complex; but other wise OK.

Dave

> ---
>  include/exec/memory.h  |   7 ++
>  migration/ram.c        | 269 +++++++++++++++++++++++++++++++++++++++--
>  migration/ram.h        |   2 +
>  migration/trace-events |   2 +
>  4 files changed, 272 insertions(+), 8 deletions(-)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index b76b1256bf..1aa1c6a3f4 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -150,6 +150,13 @@ typedef struct IOMMUTLBEvent {
>  #define RAM_PMEM (1 << 5)
>  
>  
> +/*
> + * UFFDIO_WRITEPROTECT is used on this RAMBlock to
> + * support 'write-tracking' migration type.
> + * Implies ram_state->ram_wt_enabled.
> + */
> +#define RAM_UF_WRITEPROTECT (1 << 6)
> +
>  static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>                                         IOMMUNotifierFlag flags,
>                                         hwaddr start, hwaddr end,
> diff --git a/migration/ram.c b/migration/ram.c
> index ae8de17153..5707382db1 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -56,6 +56,11 @@
>  #include "savevm.h"
>  #include "qemu/iov.h"
>  #include "multifd.h"
> +#include "sysemu/runstate.h"
> +
> +#ifdef CONFIG_LINUX
> +#include "qemu/userfaultfd.h"
> +#endif
>  
>  /***********************************************************/
>  /* ram save/restore */
> @@ -298,6 +303,8 @@ struct RAMSrcPageRequest {
>  struct RAMState {
>      /* QEMUFile used for this migration */
>      QEMUFile *f;
> +    /* UFFD file descriptor, used in 'write-tracking' migration */
> +    int uffdio_fd;
>      /* Last block that we have visited searching for dirty pages */
>      RAMBlock *last_seen_block;
>      /* Last block from where we have sent data */
> @@ -1434,6 +1441,40 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
>      return block;
>  }
>  
> +#ifdef CONFIG_LINUX
> +/**
> + * poll_fault_page: try to get next UFFD write fault page and, if pending fault
> + *   is found, return RAM block pointer and page offset
> + *
> + * Returns pointer to the RAMBlock containing faulting page,
> + *   NULL if no write faults are pending
> + *
> + * @rs: current RAM state
> + * @offset: page offset from the beginning of the block
> + */
> +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
> +{
> +    struct uffd_msg uffd_msg;
> +    void *page_address;
> +    RAMBlock *bs;
> +    int res;
> +
> +    if (!migrate_background_snapshot()) {
> +        return NULL;
> +    }
> +
> +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
> +    if (res <= 0) {
> +        return NULL;
> +    }
> +
> +    page_address = (void *) uffd_msg.arg.pagefault.address;
> +    bs = qemu_ram_block_from_host(page_address, false, offset);
> +    assert(bs && (bs->flags & RAM_UF_WRITEPROTECT) != 0);
> +    return bs;
> +}
> +#endif /* CONFIG_LINUX */
> +
>  /**
>   * get_queued_page: unqueue a page from the postcopy requests
>   *
> @@ -1473,6 +1514,16 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>  
>      } while (block && !dirty);
>  
> +#ifdef CONFIG_LINUX
> +    if (!block) {
> +        /*
> +         * Poll write faults too if background snapshot is enabled; that's
> +         * when we have vcpus got blocked by the write protected pages.
> +         */
> +        block = poll_fault_page(rs, &offset);
> +    }
> +#endif /* CONFIG_LINUX */
> +
>      if (block) {
>          /*
>           * As soon as we start servicing pages out of order, then we have
> @@ -1746,6 +1797,53 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>      return pages;
>  }
>  
> +/**
> + * ram_save_host_page_pre: ram_save_host_page() pre-notifier
> + *
> + * @rs: current RAM state
> + * @pss: page-search-status structure
> + * @opaque: pointer to receive opaque context value
> + */
> +static inline
> +void ram_save_host_page_pre(RAMState *rs, PageSearchStatus *pss, void **opaque)
> +{
> +    *(unsigned long *) opaque = pss->page;
> +}
> +
> +/**
> + * ram_save_host_page_post: ram_save_host_page() post-notifier
> + *
> + * @rs: current RAM state
> + * @pss: page-search-status structure
> + * @opaque: opaque context value
> + * @res_override: pointer to the return value of ram_save_host_page(),
> + *   overwritten in case of an error
> + */
> +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
> +        void *opaque, int *res_override)
> +{
> +#ifdef CONFIG_LINUX
> +    /* Check if page is from UFFD-managed region. */
> +    if (pss->block->flags & RAM_UF_WRITEPROTECT) {
> +        unsigned long page_from = (unsigned long) opaque;
> +
> +        void *page_address = pss->block->host + (page_from << TARGET_PAGE_BITS);
> +        uint64_t run_length = (pss->page - page_from + 1) << TARGET_PAGE_BITS;
> +        int res;
> +
> +        /* Flush async buffers before un-protect. */
> +        qemu_fflush(rs->f);
> +        /* Un-protect memory range. */
> +        res = uffd_change_protection(rs->uffdio_fd, page_address, run_length,
> +                false, false);
> +        /* We don't want to override existing error from ram_save_host_page(). */
> +        if (res < 0 && *res_override >= 0) {
> +            *res_override = res;
> +        }
> +    }
> +#endif /* CONFIG_LINUX */
> +}
> +
>  /**
>   * ram_find_and_save_block: finds a dirty page and sends it to f
>   *
> @@ -1790,7 +1888,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>          }
>  
>          if (found) {
> +            void *opaque;
> +
> +            ram_save_host_page_pre(rs, &pss, &opaque);
>              pages = ram_save_host_page(rs, &pss, last_stage);
> +            ram_save_host_page_post(rs, &pss, opaque, &pages);
>          }
>      } while (!pages && again);
>  
> @@ -1880,10 +1982,13 @@ static void ram_save_cleanup(void *opaque)
>      RAMState **rsp = opaque;
>      RAMBlock *block;
>  
> -    /* caller have hold iothread lock or is in a bh, so there is
> -     * no writing race against the migration bitmap
> -     */
> -    memory_global_dirty_log_stop();
> +    /* We don't use dirty log with background snapshots */
> +    if (!migrate_background_snapshot()) {
> +        /* caller have hold iothread lock or is in a bh, so there is
> +         * no writing race against the migration bitmap
> +         */
> +        memory_global_dirty_log_stop();
> +    }
>  
>      RAMBLOCK_FOREACH_NOT_IGNORED(block) {
>          g_free(block->clear_bmap);
> @@ -2343,8 +2448,11 @@ static void ram_init_bitmaps(RAMState *rs)
>  
>      WITH_RCU_READ_LOCK_GUARD() {
>          ram_list_init_bitmaps();
> -        memory_global_dirty_log_start();
> -        migration_bitmap_sync_precopy(rs);
> +        /* We don't use dirty log with background snapshots */
> +        if (!migrate_background_snapshot()) {
> +            memory_global_dirty_log_start();
> +            migration_bitmap_sync_precopy(rs);
> +        }
>      }
>      qemu_mutex_unlock_ramlist();
>      qemu_mutex_unlock_iothread();
> @@ -3794,7 +3902,14 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
>   */
>  bool ram_write_tracking_available(void)
>  {
> -    /* TODO: implement */
> +#ifdef CONFIG_LINUX
> +    uint64_t uffd_features;
> +    int res;
> +
> +    res = uffd_query_features(&uffd_features);
> +    return (res == 0 &&
> +            (uffd_features & UFFD_FEATURE_PAGEFAULT_FLAG_WP) != 0);
> +#endif
>      return false;
>  }
>  
> @@ -3805,10 +3920,148 @@ bool ram_write_tracking_available(void)
>   */
>  bool ram_write_tracking_compatible(void)
>  {
> -    /* TODO: implement */
> +#ifdef CONFIG_LINUX
> +    const uint64_t uffd_ioctls_mask = BIT(_UFFDIO_WRITEPROTECT);
> +    int uffd_fd;
> +    RAMBlock *bs;
> +    bool ret = false;
> +
> +    /* Open UFFD file descriptor */
> +    uffd_fd = uffd_create_fd(UFFD_FEATURE_PAGEFAULT_FLAG_WP, false);
> +    if (uffd_fd < 0) {
> +        return false;
> +    }
> +
> +    RCU_READ_LOCK_GUARD();
> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> +        uint64_t uffd_ioctls;
> +
> +        /* Nothing to do with read-only and MMIO-writable regions */
> +        if (bs->mr->readonly || bs->mr->rom_device) {
> +            continue;
> +        }
> +        /* Try to register block memory via UFFD-IO to track writes */
> +        if (uffd_register_memory(uffd_fd, bs->host, bs->max_length,
> +                UFFDIO_REGISTER_MODE_WP, &uffd_ioctls)) {
> +            goto out;
> +        }
> +        if ((uffd_ioctls & uffd_ioctls_mask) != uffd_ioctls_mask) {
> +            goto out;
> +        }
> +    }
> +    ret = true;
> +
> +out:
> +    uffd_close_fd(uffd_fd);
> +    return ret;
> +#endif
>      return false;
>  }
>  
> +/*
> + * ram_write_tracking_start: start UFFD-WP memory tracking
> + *
> + * Returns 0 for success or negative value in case of error
> + */
> +int ram_write_tracking_start(void)
> +{
> +#ifdef CONFIG_LINUX
> +    int uffd_fd;
> +    RAMState *rs = ram_state;
> +    RAMBlock *bs;
> +
> +    /* Open UFFD file descriptor */
> +    uffd_fd = uffd_create_fd(UFFD_FEATURE_PAGEFAULT_FLAG_WP, true);
> +    if (uffd_fd < 0) {
> +        return uffd_fd;
> +    }
> +    rs->uffdio_fd = uffd_fd;
> +
> +    RCU_READ_LOCK_GUARD();
> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> +        /* Nothing to do with read-only and MMIO-writable regions */
> +        if (bs->mr->readonly || bs->mr->rom_device) {
> +            continue;
> +        }
> +
> +        /* Register block memory with UFFD to track writes */
> +        if (uffd_register_memory(rs->uffdio_fd, bs->host,
> +                bs->max_length, UFFDIO_REGISTER_MODE_WP, NULL)) {
> +            goto fail;
> +        }
> +        /* Apply UFFD write protection to the block memory range */
> +        if (uffd_change_protection(rs->uffdio_fd, bs->host,
> +                bs->max_length, true, false)) {
> +            goto fail;
> +        }
> +        bs->flags |= RAM_UF_WRITEPROTECT;
> +        memory_region_ref(bs->mr);
> +
> +        trace_ram_write_tracking_ramblock_start(bs->idstr, bs->page_size,
> +                bs->host, bs->max_length);
> +    }
> +
> +    return 0;
> +
> +fail:
> +    error_report("ram_write_tracking_start() failed: restoring initial memory state");
> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
> +            continue;
> +        }
> +        /*
> +         * In case some memory block failed to be write-protected
> +         * remove protection and unregister all succeeded RAM blocks
> +         */
> +        uffd_change_protection(rs->uffdio_fd, bs->host, bs->max_length, false, false);
> +        uffd_unregister_memory(rs->uffdio_fd, bs->host, bs->max_length);
> +        /* Cleanup flags and remove reference */
> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
> +        memory_region_unref(bs->mr);
> +    }
> +
> +    uffd_close_fd(uffd_fd);
> +#endif /* CONFIG_LINUX */
> +    rs->uffdio_fd = -1;
> +    return -1;
> +}
> +
> +/**
> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
> + */
> +void ram_write_tracking_stop(void)
> +{
> +#ifdef CONFIG_LINUX
> +    RAMState *rs = ram_state;
> +    RAMBlock *bs;
> +
> +    RCU_READ_LOCK_GUARD();
> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
> +            continue;
> +        }
> +        /* Remove protection and unregister all affected RAM blocks */
> +        uffd_change_protection(rs->uffdio_fd, bs->host, bs->max_length, false, false);
> +        uffd_unregister_memory(rs->uffdio_fd, bs->host, bs->max_length);
> +
> +        trace_ram_write_tracking_ramblock_stop(bs->idstr, bs->page_size,
> +                bs->host, bs->max_length);
> +
> +        /* Cleanup flags and remove reference */
> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
> +        memory_region_unref(bs->mr);
> +    }
> +
> +    /* Finally close UFFD file descriptor */
> +    uffd_close_fd(rs->uffdio_fd);
> +    rs->uffdio_fd = -1;
> +#endif /* CONFIG_LINUX */
> +}
> +
>  static SaveVMHandlers savevm_ram_handlers = {
>      .save_setup = ram_save_setup,
>      .save_live_iterate = ram_save_iterate,
> diff --git a/migration/ram.h b/migration/ram.h
> index 1a9ff90304..c25540cb93 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -82,5 +82,7 @@ void colo_incoming_start_dirty_log(void);
>  /* Background snapshot */
>  bool ram_write_tracking_available(void);
>  bool ram_write_tracking_compatible(void);
> +int ram_write_tracking_start(void);
> +void ram_write_tracking_stop(void);
>  
>  #endif
> diff --git a/migration/trace-events b/migration/trace-events
> index 75de5004ac..668c562fed 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -111,6 +111,8 @@ save_xbzrle_page_skipping(void) ""
>  save_xbzrle_page_overflow(void) ""
>  ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
>  ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
> +ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
> +ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
>  
>  # multifd.c
>  multifd_new_send_channel_async(uint8_t id) "channel %d"
> -- 
> 2.25.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 4/5] migration: implementation of background snapshot thread
  2021-01-06 15:21 ` [PATCH v11 4/5] migration: implementation of background snapshot thread Andrey Gruzdev via
@ 2021-01-19 18:49   ` Dr. David Alan Gilbert
  2021-01-21  8:58     ` Andrey Gruzdev
  0 siblings, 1 reply; 24+ messages in thread
From: Dr. David Alan Gilbert @ 2021-01-19 18:49 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> Introducing implementation of 'background' snapshot thread
> which in overall follows the logic of precopy migration
> while internally utilizes completely different mechanism
> to 'freeze' vmstate at the start of snapshot creation.
> 
> This mechanism is based on userfault_fd with wr-protection
> support and is Linux-specific.

I noticed there weren't any bdrv_ calls in here; I guess with a snapshot
you still have the source running so still have it accessing the disk;
do you do anything to try and wire the ram snapshotting up to disk
snapshotting?

> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> Acked-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c | 255 +++++++++++++++++++++++++++++++++++++++++-
>  migration/migration.h |   3 +
>  migration/ram.c       |   2 +
>  migration/savevm.c    |   1 -
>  migration/savevm.h    |   2 +
>  5 files changed, 260 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 2c2cb9ef01..0901a15ac5 100644

<snip>

> -    qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
> -                       QEMU_THREAD_JOINABLE);
> +
> +    if (migrate_background_snapshot()) {
> +        qemu_thread_create(&s->thread, "background_snapshot",

Unfortunately that wont work - there's a 14 character limit on
the thread name length; I guess we just shorten that to bg_snapshot

Other than that,



Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> +                bg_migration_thread, s, QEMU_THREAD_JOINABLE);
> +    } else {
> +        qemu_thread_create(&s->thread, "live_migration",
> +                migration_thread, s, QEMU_THREAD_JOINABLE);
> +    }
>      s->migration_thread_running = true;
>  }
>  
> diff --git a/migration/migration.h b/migration/migration.h
> index f40338cfbf..0723955cd7 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -20,6 +20,7 @@
>  #include "qemu/thread.h"
>  #include "qemu/coroutine_int.h"
>  #include "io/channel.h"
> +#include "io/channel-buffer.h"
>  #include "net/announce.h"
>  #include "qom/object.h"
>  
> @@ -147,8 +148,10 @@ struct MigrationState {
>  
>      /*< public >*/
>      QemuThread thread;
> +    QEMUBH *vm_start_bh;
>      QEMUBH *cleanup_bh;
>      QEMUFile *to_dst_file;
> +    QIOChannelBuffer *bioc;
>      /*
>       * Protects to_dst_file pointer.  We need to make sure we won't
>       * yield or hang during the critical section, since this lock will
> diff --git a/migration/ram.c b/migration/ram.c
> index 5707382db1..05fe0c8592 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1471,6 +1471,7 @@ static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
>      page_address = (void *) uffd_msg.arg.pagefault.address;
>      bs = qemu_ram_block_from_host(page_address, false, offset);
>      assert(bs && (bs->flags & RAM_UF_WRITEPROTECT) != 0);
> +
>      return bs;
>  }
>  #endif /* CONFIG_LINUX */
> @@ -1836,6 +1837,7 @@ static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
>          /* Un-protect memory range. */
>          res = uffd_change_protection(rs->uffdio_fd, page_address, run_length,
>                  false, false);
> +
>          /* We don't want to override existing error from ram_save_host_page(). */
>          if (res < 0 && *res_override >= 0) {
>              *res_override = res;
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 27e842812e..dd4ad0aaaf 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1354,7 +1354,6 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>      return 0;
>  }
>  
> -static
>  int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>                                                      bool in_postcopy,
>                                                      bool inactivate_disks)
> diff --git a/migration/savevm.h b/migration/savevm.h
> index ba64a7e271..aaee2528ed 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -64,5 +64,7 @@ int qemu_loadvm_state(QEMUFile *f);
>  void qemu_loadvm_state_cleanup(void);
>  int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
>  int qemu_load_device_state(QEMUFile *f);
> +int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> +        bool in_postcopy, bool inactivate_disks);
>  
>  #endif
> -- 
> 2.25.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 5/5] migration: introduce 'userfaultfd-wrlat.py' script
  2021-01-06 15:21 ` [PATCH v11 5/5] migration: introduce 'userfaultfd-wrlat.py' script Andrey Gruzdev via
@ 2021-01-19 21:01   ` Peter Xu
  2021-01-21 13:12     ` Andrey Gruzdev
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2021-01-19 21:01 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Wed, Jan 06, 2021 at 06:21:20PM +0300, Andrey Gruzdev wrote:
> Add BCC/eBPF script to analyze userfaultfd write fault latency distribution.
> 
> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> Acked-by: Peter Xu <peterx@redhat.com>

(This seems to be the last patch that lacks a r-b ... Let's see whether I could
 convert my a-b into an r-b... :)

> +BPF_HASH(ev_start, struct ev_desc, u64);
> +BPF_HASH(ctx_handle_userfault, u64, u64);

IMHO we only need one hash here instead of two:

  BPF_HASH(ev_start, u32, u64);

Where we use the tid as the key (u32), and timestamp as the value (u64).  The
thing is we don't really need the address for current statistics, IMHO.

> +/* KPROBE for handle_userfault(). */
> +int probe_handle_userfault(struct pt_regs *ctx, struct vm_fault *vmf,
> +        unsigned long reason)
> +{
> +    /* Trace only UFFD write faults. */
> +    if (reason & VM_UFFD_WP) {

Better with comment:

           /* Using "(u32)" to drop group ID which is upper 32 bits */

If even better, we'd want a get_current_tid() helper and call it here and below
(bpf_get_current_pid_tgid() will return tid|gid<<32 I think, so I'm a bit
confused why bcc people called it pid at the first place...).

> +        u64 pid = (u32) bpf_get_current_pid_tgid();
> +        u64 addr = vmf->address;
> +
> +        do_event_start(pid, addr);
> +        ctx_handle_userfault.update(&pid, &addr);
> +    }
> +    return 0;
> +}
> +
> +/* KRETPROBE for handle_userfault(). */
> +int retprobe_handle_userfault(struct pt_regs *ctx)
> +{
> +    u64 pid = (u32) bpf_get_current_pid_tgid();
> +    u64 *addr_p;
> +
> +    /*
> +     * Here we just ignore the return value. In case of spurious wakeup
> +     * or pending signal we'll still get (at least for v5.8.0 kernel)
> +     * VM_FAULT_RETRY or (VM_FAULT_RETRY | VM_FAULT_MAJOR) here.
> +     * Anyhow, handle_userfault() would be re-entered if such case happens,
> +     * keeping initial timestamp unchanged for the faulting thread.

AFAIU this comment is not matching what the code does.  But I agree it's not a
big problem because we won't miss any long delays (because the one long delayed
sample will just be split into two or multiple delays, which will still be
reflected in the histogram at last).  Or am I wrong?

> +     */
> +    addr_p = ctx_handle_userfault.lookup(&pid);
> +    if (addr_p) {
> +        do_event_end(pid, *addr_p);
> +        ctx_handle_userfault.delete(&pid);
> +    }
> +    return 0;
> +}
> +"""

Other than that, the rest looks good to me.

I'd think it's fine to even merge the current version since it actually works
nicely.  Andrey, if you agree with any of my above comments, feel free to
repost this patch (since I see Dave provided the rest r-bs).  Then I think I
can r-b this one too.  Thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 3/5] migration: support UFFD write fault processing in ram_save_iterate()
  2021-01-19 17:49   ` Dr. David Alan Gilbert
@ 2021-01-21  8:48     ` Andrey Gruzdev
  2021-01-21 10:09       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 24+ messages in thread
From: Andrey Gruzdev @ 2021-01-21  8:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

[-- Attachment #1: Type: text/plain, Size: 15598 bytes --]

On 19.01.2021 20:49, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> In this particular implementation the same single migration
>> thread is responsible for both normal linear dirty page
>> migration and procesing UFFD page fault events.
>>
>> Processing write faults includes reading UFFD file descriptor,
>> finding respective RAM block and saving faulting page to
>> the migration stream. After page has been saved, write protection
>> can be removed. Since asynchronous version of qemu_put_buffer()
>> is expected to be used to save pages, we also have to flush
>> migraion stream prior to un-protecting saved memory range.
>>
>> Write protection is being removed for any previously protected
>> memory chunk that has hit the migration stream. That's valid
>> for pages from linear page scan along with write fault pages.
>>
>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>> Acked-by: Peter Xu <peterx@redhat.com>
> I think this is OK, so:
>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>
> I think I'd have preferred to have kept the #ifdef LINUX's out of there,
> and I suspect using the opaque's for hte pre/post hook is overly
> complex; but other wise OK.
>
> Dave

Mm, I think it's impossible to completely move #ifdef LINUX out of there,
but I suspect all #ifdef LINUX code can be moved to a single place with
stubs for the #else case.. Suppose it would be better then now.
For pre/post hooks - not too complex, but for single #ifdef section it really
looks better to move this stuff to ram_save_host_page().

Andrey

>> ---
>>   include/exec/memory.h  |   7 ++
>>   migration/ram.c        | 269 +++++++++++++++++++++++++++++++++++++++--
>>   migration/ram.h        |   2 +
>>   migration/trace-events |   2 +
>>   4 files changed, 272 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index b76b1256bf..1aa1c6a3f4 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -150,6 +150,13 @@ typedef struct IOMMUTLBEvent {
>>   #define RAM_PMEM (1 << 5)
>>   
>>   
>> +/*
>> + * UFFDIO_WRITEPROTECT is used on this RAMBlock to
>> + * support 'write-tracking' migration type.
>> + * Implies ram_state->ram_wt_enabled.
>> + */
>> +#define RAM_UF_WRITEPROTECT (1 << 6)
>> +
>>   static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>>                                          IOMMUNotifierFlag flags,
>>                                          hwaddr start, hwaddr end,
>> diff --git a/migration/ram.c b/migration/ram.c
>> index ae8de17153..5707382db1 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -56,6 +56,11 @@
>>   #include "savevm.h"
>>   #include "qemu/iov.h"
>>   #include "multifd.h"
>> +#include "sysemu/runstate.h"
>> +
>> +#ifdef CONFIG_LINUX
>> +#include "qemu/userfaultfd.h"
>> +#endif
>>   
>>   /***********************************************************/
>>   /* ram save/restore */
>> @@ -298,6 +303,8 @@ struct RAMSrcPageRequest {
>>   struct RAMState {
>>       /* QEMUFile used for this migration */
>>       QEMUFile *f;
>> +    /* UFFD file descriptor, used in 'write-tracking' migration */
>> +    int uffdio_fd;
>>       /* Last block that we have visited searching for dirty pages */
>>       RAMBlock *last_seen_block;
>>       /* Last block from where we have sent data */
>> @@ -1434,6 +1441,40 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
>>       return block;
>>   }
>>   
>> +#ifdef CONFIG_LINUX
>> +/**
>> + * poll_fault_page: try to get next UFFD write fault page and, if pending fault
>> + *   is found, return RAM block pointer and page offset
>> + *
>> + * Returns pointer to the RAMBlock containing faulting page,
>> + *   NULL if no write faults are pending
>> + *
>> + * @rs: current RAM state
>> + * @offset: page offset from the beginning of the block
>> + */
>> +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
>> +{
>> +    struct uffd_msg uffd_msg;
>> +    void *page_address;
>> +    RAMBlock *bs;
>> +    int res;
>> +
>> +    if (!migrate_background_snapshot()) {
>> +        return NULL;
>> +    }
>> +
>> +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
>> +    if (res <= 0) {
>> +        return NULL;
>> +    }
>> +
>> +    page_address = (void *) uffd_msg.arg.pagefault.address;
>> +    bs = qemu_ram_block_from_host(page_address, false, offset);
>> +    assert(bs && (bs->flags & RAM_UF_WRITEPROTECT) != 0);
>> +    return bs;
>> +}
>> +#endif /* CONFIG_LINUX */
>> +
>>   /**
>>    * get_queued_page: unqueue a page from the postcopy requests
>>    *
>> @@ -1473,6 +1514,16 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>>   
>>       } while (block && !dirty);
>>   
>> +#ifdef CONFIG_LINUX
>> +    if (!block) {
>> +        /*
>> +         * Poll write faults too if background snapshot is enabled; that's
>> +         * when we have vcpus got blocked by the write protected pages.
>> +         */
>> +        block = poll_fault_page(rs, &offset);
>> +    }
>> +#endif /* CONFIG_LINUX */
>> +
>>       if (block) {
>>           /*
>>            * As soon as we start servicing pages out of order, then we have
>> @@ -1746,6 +1797,53 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>>       return pages;
>>   }
>>   
>> +/**
>> + * ram_save_host_page_pre: ram_save_host_page() pre-notifier
>> + *
>> + * @rs: current RAM state
>> + * @pss: page-search-status structure
>> + * @opaque: pointer to receive opaque context value
>> + */
>> +static inline
>> +void ram_save_host_page_pre(RAMState *rs, PageSearchStatus *pss, void **opaque)
>> +{
>> +    *(unsigned long *) opaque = pss->page;
>> +}
>> +
>> +/**
>> + * ram_save_host_page_post: ram_save_host_page() post-notifier
>> + *
>> + * @rs: current RAM state
>> + * @pss: page-search-status structure
>> + * @opaque: opaque context value
>> + * @res_override: pointer to the return value of ram_save_host_page(),
>> + *   overwritten in case of an error
>> + */
>> +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
>> +        void *opaque, int *res_override)
>> +{
>> +#ifdef CONFIG_LINUX
>> +    /* Check if page is from UFFD-managed region. */
>> +    if (pss->block->flags & RAM_UF_WRITEPROTECT) {
>> +        unsigned long page_from = (unsigned long) opaque;
>> +
>> +        void *page_address = pss->block->host + (page_from << TARGET_PAGE_BITS);
>> +        uint64_t run_length = (pss->page - page_from + 1) << TARGET_PAGE_BITS;
>> +        int res;
>> +
>> +        /* Flush async buffers before un-protect. */
>> +        qemu_fflush(rs->f);
>> +        /* Un-protect memory range. */
>> +        res = uffd_change_protection(rs->uffdio_fd, page_address, run_length,
>> +                false, false);
>> +        /* We don't want to override existing error from ram_save_host_page(). */
>> +        if (res < 0 && *res_override >= 0) {
>> +            *res_override = res;
>> +        }
>> +    }
>> +#endif /* CONFIG_LINUX */
>> +}
>> +
>>   /**
>>    * ram_find_and_save_block: finds a dirty page and sends it to f
>>    *
>> @@ -1790,7 +1888,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>>           }
>>   
>>           if (found) {
>> +            void *opaque;
>> +
>> +            ram_save_host_page_pre(rs, &pss, &opaque);
>>               pages = ram_save_host_page(rs, &pss, last_stage);
>> +            ram_save_host_page_post(rs, &pss, opaque, &pages);
>>           }
>>       } while (!pages && again);
>>   
>> @@ -1880,10 +1982,13 @@ static void ram_save_cleanup(void *opaque)
>>       RAMState **rsp = opaque;
>>       RAMBlock *block;
>>   
>> -    /* caller have hold iothread lock or is in a bh, so there is
>> -     * no writing race against the migration bitmap
>> -     */
>> -    memory_global_dirty_log_stop();
>> +    /* We don't use dirty log with background snapshots */
>> +    if (!migrate_background_snapshot()) {
>> +        /* caller have hold iothread lock or is in a bh, so there is
>> +         * no writing race against the migration bitmap
>> +         */
>> +        memory_global_dirty_log_stop();
>> +    }
>>   
>>       RAMBLOCK_FOREACH_NOT_IGNORED(block) {
>>           g_free(block->clear_bmap);
>> @@ -2343,8 +2448,11 @@ static void ram_init_bitmaps(RAMState *rs)
>>   
>>       WITH_RCU_READ_LOCK_GUARD() {
>>           ram_list_init_bitmaps();
>> -        memory_global_dirty_log_start();
>> -        migration_bitmap_sync_precopy(rs);
>> +        /* We don't use dirty log with background snapshots */
>> +        if (!migrate_background_snapshot()) {
>> +            memory_global_dirty_log_start();
>> +            migration_bitmap_sync_precopy(rs);
>> +        }
>>       }
>>       qemu_mutex_unlock_ramlist();
>>       qemu_mutex_unlock_iothread();
>> @@ -3794,7 +3902,14 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
>>    */
>>   bool ram_write_tracking_available(void)
>>   {
>> -    /* TODO: implement */
>> +#ifdef CONFIG_LINUX
>> +    uint64_t uffd_features;
>> +    int res;
>> +
>> +    res = uffd_query_features(&uffd_features);
>> +    return (res == 0 &&
>> +            (uffd_features & UFFD_FEATURE_PAGEFAULT_FLAG_WP) != 0);
>> +#endif
>>       return false;
>>   }
>>   
>> @@ -3805,10 +3920,148 @@ bool ram_write_tracking_available(void)
>>    */
>>   bool ram_write_tracking_compatible(void)
>>   {
>> -    /* TODO: implement */
>> +#ifdef CONFIG_LINUX
>> +    const uint64_t uffd_ioctls_mask = BIT(_UFFDIO_WRITEPROTECT);
>> +    int uffd_fd;
>> +    RAMBlock *bs;
>> +    bool ret = false;
>> +
>> +    /* Open UFFD file descriptor */
>> +    uffd_fd = uffd_create_fd(UFFD_FEATURE_PAGEFAULT_FLAG_WP, false);
>> +    if (uffd_fd < 0) {
>> +        return false;
>> +    }
>> +
>> +    RCU_READ_LOCK_GUARD();
>> +
>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>> +        uint64_t uffd_ioctls;
>> +
>> +        /* Nothing to do with read-only and MMIO-writable regions */
>> +        if (bs->mr->readonly || bs->mr->rom_device) {
>> +            continue;
>> +        }
>> +        /* Try to register block memory via UFFD-IO to track writes */
>> +        if (uffd_register_memory(uffd_fd, bs->host, bs->max_length,
>> +                UFFDIO_REGISTER_MODE_WP, &uffd_ioctls)) {
>> +            goto out;
>> +        }
>> +        if ((uffd_ioctls & uffd_ioctls_mask) != uffd_ioctls_mask) {
>> +            goto out;
>> +        }
>> +    }
>> +    ret = true;
>> +
>> +out:
>> +    uffd_close_fd(uffd_fd);
>> +    return ret;
>> +#endif
>>       return false;
>>   }
>>   
>> +/*
>> + * ram_write_tracking_start: start UFFD-WP memory tracking
>> + *
>> + * Returns 0 for success or negative value in case of error
>> + */
>> +int ram_write_tracking_start(void)
>> +{
>> +#ifdef CONFIG_LINUX
>> +    int uffd_fd;
>> +    RAMState *rs = ram_state;
>> +    RAMBlock *bs;
>> +
>> +    /* Open UFFD file descriptor */
>> +    uffd_fd = uffd_create_fd(UFFD_FEATURE_PAGEFAULT_FLAG_WP, true);
>> +    if (uffd_fd < 0) {
>> +        return uffd_fd;
>> +    }
>> +    rs->uffdio_fd = uffd_fd;
>> +
>> +    RCU_READ_LOCK_GUARD();
>> +
>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>> +        /* Nothing to do with read-only and MMIO-writable regions */
>> +        if (bs->mr->readonly || bs->mr->rom_device) {
>> +            continue;
>> +        }
>> +
>> +        /* Register block memory with UFFD to track writes */
>> +        if (uffd_register_memory(rs->uffdio_fd, bs->host,
>> +                bs->max_length, UFFDIO_REGISTER_MODE_WP, NULL)) {
>> +            goto fail;
>> +        }
>> +        /* Apply UFFD write protection to the block memory range */
>> +        if (uffd_change_protection(rs->uffdio_fd, bs->host,
>> +                bs->max_length, true, false)) {
>> +            goto fail;
>> +        }
>> +        bs->flags |= RAM_UF_WRITEPROTECT;
>> +        memory_region_ref(bs->mr);
>> +
>> +        trace_ram_write_tracking_ramblock_start(bs->idstr, bs->page_size,
>> +                bs->host, bs->max_length);
>> +    }
>> +
>> +    return 0;
>> +
>> +fail:
>> +    error_report("ram_write_tracking_start() failed: restoring initial memory state");
>> +
>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
>> +            continue;
>> +        }
>> +        /*
>> +         * In case some memory block failed to be write-protected
>> +         * remove protection and unregister all succeeded RAM blocks
>> +         */
>> +        uffd_change_protection(rs->uffdio_fd, bs->host, bs->max_length, false, false);
>> +        uffd_unregister_memory(rs->uffdio_fd, bs->host, bs->max_length);
>> +        /* Cleanup flags and remove reference */
>> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
>> +        memory_region_unref(bs->mr);
>> +    }
>> +
>> +    uffd_close_fd(uffd_fd);
>> +#endif /* CONFIG_LINUX */
>> +    rs->uffdio_fd = -1;
>> +    return -1;
>> +}
>> +
>> +/**
>> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
>> + */
>> +void ram_write_tracking_stop(void)
>> +{
>> +#ifdef CONFIG_LINUX
>> +    RAMState *rs = ram_state;
>> +    RAMBlock *bs;
>> +
>> +    RCU_READ_LOCK_GUARD();
>> +
>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
>> +            continue;
>> +        }
>> +        /* Remove protection and unregister all affected RAM blocks */
>> +        uffd_change_protection(rs->uffdio_fd, bs->host, bs->max_length, false, false);
>> +        uffd_unregister_memory(rs->uffdio_fd, bs->host, bs->max_length);
>> +
>> +        trace_ram_write_tracking_ramblock_stop(bs->idstr, bs->page_size,
>> +                bs->host, bs->max_length);
>> +
>> +        /* Cleanup flags and remove reference */
>> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
>> +        memory_region_unref(bs->mr);
>> +    }
>> +
>> +    /* Finally close UFFD file descriptor */
>> +    uffd_close_fd(rs->uffdio_fd);
>> +    rs->uffdio_fd = -1;
>> +#endif /* CONFIG_LINUX */
>> +}
>> +
>>   static SaveVMHandlers savevm_ram_handlers = {
>>       .save_setup = ram_save_setup,
>>       .save_live_iterate = ram_save_iterate,
>> diff --git a/migration/ram.h b/migration/ram.h
>> index 1a9ff90304..c25540cb93 100644
>> --- a/migration/ram.h
>> +++ b/migration/ram.h
>> @@ -82,5 +82,7 @@ void colo_incoming_start_dirty_log(void);
>>   /* Background snapshot */
>>   bool ram_write_tracking_available(void);
>>   bool ram_write_tracking_compatible(void);
>> +int ram_write_tracking_start(void);
>> +void ram_write_tracking_stop(void);
>>   
>>   #endif
>> diff --git a/migration/trace-events b/migration/trace-events
>> index 75de5004ac..668c562fed 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -111,6 +111,8 @@ save_xbzrle_page_skipping(void) ""
>>   save_xbzrle_page_overflow(void) ""
>>   ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
>>   ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
>> +ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
>> +ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
>>   
>>   # multifd.c
>>   multifd_new_send_channel_async(uint8_t id) "channel %d"
>> -- 
>> 2.25.1
>>

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


[-- Attachment #2: Type: text/html, Size: 15902 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 4/5] migration: implementation of background snapshot thread
  2021-01-19 18:49   ` Dr. David Alan Gilbert
@ 2021-01-21  8:58     ` Andrey Gruzdev
  2021-01-21  9:56       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 24+ messages in thread
From: Andrey Gruzdev @ 2021-01-21  8:58 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

[-- Attachment #1: Type: text/plain, Size: 5095 bytes --]

On 19.01.2021 21:49, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> Introducing implementation of 'background' snapshot thread
>> which in overall follows the logic of precopy migration
>> while internally utilizes completely different mechanism
>> to 'freeze' vmstate at the start of snapshot creation.
>>
>> This mechanism is based on userfault_fd with wr-protection
>> support and is Linux-specific.
> I noticed there weren't any bdrv_ calls in here; I guess with a snapshot
> you still have the source running so still have it accessing the disk;
> do you do anything to try and wire the ram snapshotting up to disk
> snapshotting?

Block-related manipulations should be done externally, I think.
So create backing images for RW nodes, then stop VM, switch block graph
and start background snapshot. Something like create 'virsh snapshot-create-as'
does, but in other sequence.

//

>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>> Acked-by: Peter Xu <peterx@redhat.com>
>> ---
>>   migration/migration.c | 255 +++++++++++++++++++++++++++++++++++++++++-
>>   migration/migration.h |   3 +
>>   migration/ram.c       |   2 +
>>   migration/savevm.c    |   1 -
>>   migration/savevm.h    |   2 +
>>   5 files changed, 260 insertions(+), 3 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 2c2cb9ef01..0901a15ac5 100644
> <snip>
>
>> -    qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
>> -                       QEMU_THREAD_JOINABLE);
>> +
>> +    if (migrate_background_snapshot()) {
>> +        qemu_thread_create(&s->thread, "background_snapshot",
> Unfortunately that wont work - there's a 14 character limit on
> the thread name length; I guess we just shorten that to bg_snapshot

Yep, missed that pthread_set_name_np() has a length limit)

> Other than that,
>
>
>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>
>> +                bg_migration_thread, s, QEMU_THREAD_JOINABLE);
>> +    } else {
>> +        qemu_thread_create(&s->thread, "live_migration",
>> +                migration_thread, s, QEMU_THREAD_JOINABLE);
>> +    }
>>       s->migration_thread_running = true;
>>   }
>>   
>> diff --git a/migration/migration.h b/migration/migration.h
>> index f40338cfbf..0723955cd7 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -20,6 +20,7 @@
>>   #include "qemu/thread.h"
>>   #include "qemu/coroutine_int.h"
>>   #include "io/channel.h"
>> +#include "io/channel-buffer.h"
>>   #include "net/announce.h"
>>   #include "qom/object.h"
>>   
>> @@ -147,8 +148,10 @@ struct MigrationState {
>>   
>>       /*< public >*/
>>       QemuThread thread;
>> +    QEMUBH *vm_start_bh;
>>       QEMUBH *cleanup_bh;
>>       QEMUFile *to_dst_file;
>> +    QIOChannelBuffer *bioc;
>>       /*
>>        * Protects to_dst_file pointer.  We need to make sure we won't
>>        * yield or hang during the critical section, since this lock will
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 5707382db1..05fe0c8592 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1471,6 +1471,7 @@ static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
>>       page_address = (void *) uffd_msg.arg.pagefault.address;
>>       bs = qemu_ram_block_from_host(page_address, false, offset);
>>       assert(bs && (bs->flags & RAM_UF_WRITEPROTECT) != 0);
>> +
>>       return bs;
>>   }
>>   #endif /* CONFIG_LINUX */
>> @@ -1836,6 +1837,7 @@ static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
>>           /* Un-protect memory range. */
>>           res = uffd_change_protection(rs->uffdio_fd, page_address, run_length,
>>                   false, false);
>> +
>>           /* We don't want to override existing error from ram_save_host_page(). */
>>           if (res < 0 && *res_override >= 0) {
>>               *res_override = res;
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 27e842812e..dd4ad0aaaf 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -1354,7 +1354,6 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>>       return 0;
>>   }
>>   
>> -static
>>   int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>>                                                       bool in_postcopy,
>>                                                       bool inactivate_disks)
>> diff --git a/migration/savevm.h b/migration/savevm.h
>> index ba64a7e271..aaee2528ed 100644
>> --- a/migration/savevm.h
>> +++ b/migration/savevm.h
>> @@ -64,5 +64,7 @@ int qemu_loadvm_state(QEMUFile *f);
>>   void qemu_loadvm_state_cleanup(void);
>>   int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
>>   int qemu_load_device_state(QEMUFile *f);
>> +int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>> +        bool in_postcopy, bool inactivate_disks);
>>   
>>   #endif
>> -- 
>> 2.25.1
>>

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


[-- Attachment #2: Type: text/html, Size: 6473 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 4/5] migration: implementation of background snapshot thread
  2021-01-21  8:58     ` Andrey Gruzdev
@ 2021-01-21  9:56       ` Dr. David Alan Gilbert
  2021-01-21 11:08         ` Andrey Gruzdev
  0 siblings, 1 reply; 24+ messages in thread
From: Dr. David Alan Gilbert @ 2021-01-21  9:56 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> On 19.01.2021 21:49, Dr. David Alan Gilbert wrote:
> > * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> > > Introducing implementation of 'background' snapshot thread
> > > which in overall follows the logic of precopy migration
> > > while internally utilizes completely different mechanism
> > > to 'freeze' vmstate at the start of snapshot creation.
> > > 
> > > This mechanism is based on userfault_fd with wr-protection
> > > support and is Linux-specific.
> > I noticed there weren't any bdrv_ calls in here; I guess with a snapshot
> > you still have the source running so still have it accessing the disk;
> > do you do anything to try and wire the ram snapshotting up to disk
> > snapshotting?
> 
> Block-related manipulations should be done externally, I think.
> So create backing images for RW nodes, then stop VM, switch block graph
> and start background snapshot. Something like create 'virsh snapshot-create-as'
> does, but in other sequence.

If you get a chance it would be great if you could put together an
example of doing the combination RAM+block; that way we find out if there's
anything silly missing.

Dave

> //
> 
> > > Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> > > Acked-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >   migration/migration.c | 255 +++++++++++++++++++++++++++++++++++++++++-
> > >   migration/migration.h |   3 +
> > >   migration/ram.c       |   2 +
> > >   migration/savevm.c    |   1 -
> > >   migration/savevm.h    |   2 +
> > >   5 files changed, 260 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 2c2cb9ef01..0901a15ac5 100644
> > <snip>
> > 
> > > -    qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
> > > -                       QEMU_THREAD_JOINABLE);
> > > +
> > > +    if (migrate_background_snapshot()) {
> > > +        qemu_thread_create(&s->thread, "background_snapshot",
> > Unfortunately that wont work - there's a 14 character limit on
> > the thread name length; I guess we just shorten that to bg_snapshot
> 
> Yep, missed that pthread_set_name_np() has a length limit)
> 
> > Other than that,
> > 
> > 
> > 
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > 
> > > +                bg_migration_thread, s, QEMU_THREAD_JOINABLE);
> > > +    } else {
> > > +        qemu_thread_create(&s->thread, "live_migration",
> > > +                migration_thread, s, QEMU_THREAD_JOINABLE);
> > > +    }
> > >       s->migration_thread_running = true;
> > >   }
> > > diff --git a/migration/migration.h b/migration/migration.h
> > > index f40338cfbf..0723955cd7 100644
> > > --- a/migration/migration.h
> > > +++ b/migration/migration.h
> > > @@ -20,6 +20,7 @@
> > >   #include "qemu/thread.h"
> > >   #include "qemu/coroutine_int.h"
> > >   #include "io/channel.h"
> > > +#include "io/channel-buffer.h"
> > >   #include "net/announce.h"
> > >   #include "qom/object.h"
> > > @@ -147,8 +148,10 @@ struct MigrationState {
> > >       /*< public >*/
> > >       QemuThread thread;
> > > +    QEMUBH *vm_start_bh;
> > >       QEMUBH *cleanup_bh;
> > >       QEMUFile *to_dst_file;
> > > +    QIOChannelBuffer *bioc;
> > >       /*
> > >        * Protects to_dst_file pointer.  We need to make sure we won't
> > >        * yield or hang during the critical section, since this lock will
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index 5707382db1..05fe0c8592 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -1471,6 +1471,7 @@ static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
> > >       page_address = (void *) uffd_msg.arg.pagefault.address;
> > >       bs = qemu_ram_block_from_host(page_address, false, offset);
> > >       assert(bs && (bs->flags & RAM_UF_WRITEPROTECT) != 0);
> > > +
> > >       return bs;
> > >   }
> > >   #endif /* CONFIG_LINUX */
> > > @@ -1836,6 +1837,7 @@ static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
> > >           /* Un-protect memory range. */
> > >           res = uffd_change_protection(rs->uffdio_fd, page_address, run_length,
> > >                   false, false);
> > > +
> > >           /* We don't want to override existing error from ram_save_host_page(). */
> > >           if (res < 0 && *res_override >= 0) {
> > >               *res_override = res;
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index 27e842812e..dd4ad0aaaf 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -1354,7 +1354,6 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
> > >       return 0;
> > >   }
> > > -static
> > >   int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> > >                                                       bool in_postcopy,
> > >                                                       bool inactivate_disks)
> > > diff --git a/migration/savevm.h b/migration/savevm.h
> > > index ba64a7e271..aaee2528ed 100644
> > > --- a/migration/savevm.h
> > > +++ b/migration/savevm.h
> > > @@ -64,5 +64,7 @@ int qemu_loadvm_state(QEMUFile *f);
> > >   void qemu_loadvm_state_cleanup(void);
> > >   int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
> > >   int qemu_load_device_state(QEMUFile *f);
> > > +int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> > > +        bool in_postcopy, bool inactivate_disks);
> > >   #endif
> > > -- 
> > > 2.25.1
> > > 
> 
> -- 
> Andrey Gruzdev, Principal Engineer
> Virtuozzo GmbH  +7-903-247-6397
>                 virtuzzo.com
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 3/5] migration: support UFFD write fault processing in ram_save_iterate()
  2021-01-21  8:48     ` Andrey Gruzdev
@ 2021-01-21 10:09       ` Dr. David Alan Gilbert
  2021-01-21 11:12         ` Andrey Gruzdev
  0 siblings, 1 reply; 24+ messages in thread
From: Dr. David Alan Gilbert @ 2021-01-21 10:09 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> On 19.01.2021 20:49, Dr. David Alan Gilbert wrote:
> > * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> > > In this particular implementation the same single migration
> > > thread is responsible for both normal linear dirty page
> > > migration and procesing UFFD page fault events.
> > > 
> > > Processing write faults includes reading UFFD file descriptor,
> > > finding respective RAM block and saving faulting page to
> > > the migration stream. After page has been saved, write protection
> > > can be removed. Since asynchronous version of qemu_put_buffer()
> > > is expected to be used to save pages, we also have to flush
> > > migraion stream prior to un-protecting saved memory range.
> > > 
> > > Write protection is being removed for any previously protected
> > > memory chunk that has hit the migration stream. That's valid
> > > for pages from linear page scan along with write fault pages.
> > > 
> > > Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> > > Acked-by: Peter Xu <peterx@redhat.com>
> > I think this is OK, so:
> > 
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > 
> > I think I'd have preferred to have kept the #ifdef LINUX's out of there,
> > and I suspect using the opaque's for hte pre/post hook is overly
> > complex; but other wise OK.
> > 
> > Dave
> 
> Mm, I think it's impossible to completely move #ifdef LINUX out of there,
> but I suspect all #ifdef LINUX code can be moved to a single place with
> stubs for the #else case.. Suppose it would be better then now.
> For pre/post hooks - not too complex, but for single #ifdef section it really
> looks better to move this stuff to ram_save_host_page().

An add on patch to consolidate the ifdef'ing to one place would be
interesting;  I'll look at taking this series as is (other than the
thread name fix up etc) when I get around to my next pull (although I
need to review some other peoples patches first!)

Dave

> Andrey
> 
> > > ---
> > >   include/exec/memory.h  |   7 ++
> > >   migration/ram.c        | 269 +++++++++++++++++++++++++++++++++++++++--
> > >   migration/ram.h        |   2 +
> > >   migration/trace-events |   2 +
> > >   4 files changed, 272 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > > index b76b1256bf..1aa1c6a3f4 100644
> > > --- a/include/exec/memory.h
> > > +++ b/include/exec/memory.h
> > > @@ -150,6 +150,13 @@ typedef struct IOMMUTLBEvent {
> > >   #define RAM_PMEM (1 << 5)
> > > +/*
> > > + * UFFDIO_WRITEPROTECT is used on this RAMBlock to
> > > + * support 'write-tracking' migration type.
> > > + * Implies ram_state->ram_wt_enabled.
> > > + */
> > > +#define RAM_UF_WRITEPROTECT (1 << 6)
> > > +
> > >   static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
> > >                                          IOMMUNotifierFlag flags,
> > >                                          hwaddr start, hwaddr end,
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index ae8de17153..5707382db1 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -56,6 +56,11 @@
> > >   #include "savevm.h"
> > >   #include "qemu/iov.h"
> > >   #include "multifd.h"
> > > +#include "sysemu/runstate.h"
> > > +
> > > +#ifdef CONFIG_LINUX
> > > +#include "qemu/userfaultfd.h"
> > > +#endif
> > >   /***********************************************************/
> > >   /* ram save/restore */
> > > @@ -298,6 +303,8 @@ struct RAMSrcPageRequest {
> > >   struct RAMState {
> > >       /* QEMUFile used for this migration */
> > >       QEMUFile *f;
> > > +    /* UFFD file descriptor, used in 'write-tracking' migration */
> > > +    int uffdio_fd;
> > >       /* Last block that we have visited searching for dirty pages */
> > >       RAMBlock *last_seen_block;
> > >       /* Last block from where we have sent data */
> > > @@ -1434,6 +1441,40 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
> > >       return block;
> > >   }
> > > +#ifdef CONFIG_LINUX
> > > +/**
> > > + * poll_fault_page: try to get next UFFD write fault page and, if pending fault
> > > + *   is found, return RAM block pointer and page offset
> > > + *
> > > + * Returns pointer to the RAMBlock containing faulting page,
> > > + *   NULL if no write faults are pending
> > > + *
> > > + * @rs: current RAM state
> > > + * @offset: page offset from the beginning of the block
> > > + */
> > > +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
> > > +{
> > > +    struct uffd_msg uffd_msg;
> > > +    void *page_address;
> > > +    RAMBlock *bs;
> > > +    int res;
> > > +
> > > +    if (!migrate_background_snapshot()) {
> > > +        return NULL;
> > > +    }
> > > +
> > > +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
> > > +    if (res <= 0) {
> > > +        return NULL;
> > > +    }
> > > +
> > > +    page_address = (void *) uffd_msg.arg.pagefault.address;
> > > +    bs = qemu_ram_block_from_host(page_address, false, offset);
> > > +    assert(bs && (bs->flags & RAM_UF_WRITEPROTECT) != 0);
> > > +    return bs;
> > > +}
> > > +#endif /* CONFIG_LINUX */
> > > +
> > >   /**
> > >    * get_queued_page: unqueue a page from the postcopy requests
> > >    *
> > > @@ -1473,6 +1514,16 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
> > >       } while (block && !dirty);
> > > +#ifdef CONFIG_LINUX
> > > +    if (!block) {
> > > +        /*
> > > +         * Poll write faults too if background snapshot is enabled; that's
> > > +         * when we have vcpus got blocked by the write protected pages.
> > > +         */
> > > +        block = poll_fault_page(rs, &offset);
> > > +    }
> > > +#endif /* CONFIG_LINUX */
> > > +
> > >       if (block) {
> > >           /*
> > >            * As soon as we start servicing pages out of order, then we have
> > > @@ -1746,6 +1797,53 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
> > >       return pages;
> > >   }
> > > +/**
> > > + * ram_save_host_page_pre: ram_save_host_page() pre-notifier
> > > + *
> > > + * @rs: current RAM state
> > > + * @pss: page-search-status structure
> > > + * @opaque: pointer to receive opaque context value
> > > + */
> > > +static inline
> > > +void ram_save_host_page_pre(RAMState *rs, PageSearchStatus *pss, void **opaque)
> > > +{
> > > +    *(unsigned long *) opaque = pss->page;
> > > +}
> > > +
> > > +/**
> > > + * ram_save_host_page_post: ram_save_host_page() post-notifier
> > > + *
> > > + * @rs: current RAM state
> > > + * @pss: page-search-status structure
> > > + * @opaque: opaque context value
> > > + * @res_override: pointer to the return value of ram_save_host_page(),
> > > + *   overwritten in case of an error
> > > + */
> > > +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
> > > +        void *opaque, int *res_override)
> > > +{
> > > +#ifdef CONFIG_LINUX
> > > +    /* Check if page is from UFFD-managed region. */
> > > +    if (pss->block->flags & RAM_UF_WRITEPROTECT) {
> > > +        unsigned long page_from = (unsigned long) opaque;
> > > +
> > > +        void *page_address = pss->block->host + (page_from << TARGET_PAGE_BITS);
> > > +        uint64_t run_length = (pss->page - page_from + 1) << TARGET_PAGE_BITS;
> > > +        int res;
> > > +
> > > +        /* Flush async buffers before un-protect. */
> > > +        qemu_fflush(rs->f);
> > > +        /* Un-protect memory range. */
> > > +        res = uffd_change_protection(rs->uffdio_fd, page_address, run_length,
> > > +                false, false);
> > > +        /* We don't want to override existing error from ram_save_host_page(). */
> > > +        if (res < 0 && *res_override >= 0) {
> > > +            *res_override = res;
> > > +        }
> > > +    }
> > > +#endif /* CONFIG_LINUX */
> > > +}
> > > +
> > >   /**
> > >    * ram_find_and_save_block: finds a dirty page and sends it to f
> > >    *
> > > @@ -1790,7 +1888,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
> > >           }
> > >           if (found) {
> > > +            void *opaque;
> > > +
> > > +            ram_save_host_page_pre(rs, &pss, &opaque);
> > >               pages = ram_save_host_page(rs, &pss, last_stage);
> > > +            ram_save_host_page_post(rs, &pss, opaque, &pages);
> > >           }
> > >       } while (!pages && again);
> > > @@ -1880,10 +1982,13 @@ static void ram_save_cleanup(void *opaque)
> > >       RAMState **rsp = opaque;
> > >       RAMBlock *block;
> > > -    /* caller have hold iothread lock or is in a bh, so there is
> > > -     * no writing race against the migration bitmap
> > > -     */
> > > -    memory_global_dirty_log_stop();
> > > +    /* We don't use dirty log with background snapshots */
> > > +    if (!migrate_background_snapshot()) {
> > > +        /* caller have hold iothread lock or is in a bh, so there is
> > > +         * no writing race against the migration bitmap
> > > +         */
> > > +        memory_global_dirty_log_stop();
> > > +    }
> > >       RAMBLOCK_FOREACH_NOT_IGNORED(block) {
> > >           g_free(block->clear_bmap);
> > > @@ -2343,8 +2448,11 @@ static void ram_init_bitmaps(RAMState *rs)
> > >       WITH_RCU_READ_LOCK_GUARD() {
> > >           ram_list_init_bitmaps();
> > > -        memory_global_dirty_log_start();
> > > -        migration_bitmap_sync_precopy(rs);
> > > +        /* We don't use dirty log with background snapshots */
> > > +        if (!migrate_background_snapshot()) {
> > > +            memory_global_dirty_log_start();
> > > +            migration_bitmap_sync_precopy(rs);
> > > +        }
> > >       }
> > >       qemu_mutex_unlock_ramlist();
> > >       qemu_mutex_unlock_iothread();
> > > @@ -3794,7 +3902,14 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
> > >    */
> > >   bool ram_write_tracking_available(void)
> > >   {
> > > -    /* TODO: implement */
> > > +#ifdef CONFIG_LINUX
> > > +    uint64_t uffd_features;
> > > +    int res;
> > > +
> > > +    res = uffd_query_features(&uffd_features);
> > > +    return (res == 0 &&
> > > +            (uffd_features & UFFD_FEATURE_PAGEFAULT_FLAG_WP) != 0);
> > > +#endif
> > >       return false;
> > >   }
> > > @@ -3805,10 +3920,148 @@ bool ram_write_tracking_available(void)
> > >    */
> > >   bool ram_write_tracking_compatible(void)
> > >   {
> > > -    /* TODO: implement */
> > > +#ifdef CONFIG_LINUX
> > > +    const uint64_t uffd_ioctls_mask = BIT(_UFFDIO_WRITEPROTECT);
> > > +    int uffd_fd;
> > > +    RAMBlock *bs;
> > > +    bool ret = false;
> > > +
> > > +    /* Open UFFD file descriptor */
> > > +    uffd_fd = uffd_create_fd(UFFD_FEATURE_PAGEFAULT_FLAG_WP, false);
> > > +    if (uffd_fd < 0) {
> > > +        return false;
> > > +    }
> > > +
> > > +    RCU_READ_LOCK_GUARD();
> > > +
> > > +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> > > +        uint64_t uffd_ioctls;
> > > +
> > > +        /* Nothing to do with read-only and MMIO-writable regions */
> > > +        if (bs->mr->readonly || bs->mr->rom_device) {
> > > +            continue;
> > > +        }
> > > +        /* Try to register block memory via UFFD-IO to track writes */
> > > +        if (uffd_register_memory(uffd_fd, bs->host, bs->max_length,
> > > +                UFFDIO_REGISTER_MODE_WP, &uffd_ioctls)) {
> > > +            goto out;
> > > +        }
> > > +        if ((uffd_ioctls & uffd_ioctls_mask) != uffd_ioctls_mask) {
> > > +            goto out;
> > > +        }
> > > +    }
> > > +    ret = true;
> > > +
> > > +out:
> > > +    uffd_close_fd(uffd_fd);
> > > +    return ret;
> > > +#endif
> > >       return false;
> > >   }
> > > +/*
> > > + * ram_write_tracking_start: start UFFD-WP memory tracking
> > > + *
> > > + * Returns 0 for success or negative value in case of error
> > > + */
> > > +int ram_write_tracking_start(void)
> > > +{
> > > +#ifdef CONFIG_LINUX
> > > +    int uffd_fd;
> > > +    RAMState *rs = ram_state;
> > > +    RAMBlock *bs;
> > > +
> > > +    /* Open UFFD file descriptor */
> > > +    uffd_fd = uffd_create_fd(UFFD_FEATURE_PAGEFAULT_FLAG_WP, true);
> > > +    if (uffd_fd < 0) {
> > > +        return uffd_fd;
> > > +    }
> > > +    rs->uffdio_fd = uffd_fd;
> > > +
> > > +    RCU_READ_LOCK_GUARD();
> > > +
> > > +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> > > +        /* Nothing to do with read-only and MMIO-writable regions */
> > > +        if (bs->mr->readonly || bs->mr->rom_device) {
> > > +            continue;
> > > +        }
> > > +
> > > +        /* Register block memory with UFFD to track writes */
> > > +        if (uffd_register_memory(rs->uffdio_fd, bs->host,
> > > +                bs->max_length, UFFDIO_REGISTER_MODE_WP, NULL)) {
> > > +            goto fail;
> > > +        }
> > > +        /* Apply UFFD write protection to the block memory range */
> > > +        if (uffd_change_protection(rs->uffdio_fd, bs->host,
> > > +                bs->max_length, true, false)) {
> > > +            goto fail;
> > > +        }
> > > +        bs->flags |= RAM_UF_WRITEPROTECT;
> > > +        memory_region_ref(bs->mr);
> > > +
> > > +        trace_ram_write_tracking_ramblock_start(bs->idstr, bs->page_size,
> > > +                bs->host, bs->max_length);
> > > +    }
> > > +
> > > +    return 0;
> > > +
> > > +fail:
> > > +    error_report("ram_write_tracking_start() failed: restoring initial memory state");
> > > +
> > > +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> > > +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
> > > +            continue;
> > > +        }
> > > +        /*
> > > +         * In case some memory block failed to be write-protected
> > > +         * remove protection and unregister all succeeded RAM blocks
> > > +         */
> > > +        uffd_change_protection(rs->uffdio_fd, bs->host, bs->max_length, false, false);
> > > +        uffd_unregister_memory(rs->uffdio_fd, bs->host, bs->max_length);
> > > +        /* Cleanup flags and remove reference */
> > > +        bs->flags &= ~RAM_UF_WRITEPROTECT;
> > > +        memory_region_unref(bs->mr);
> > > +    }
> > > +
> > > +    uffd_close_fd(uffd_fd);
> > > +#endif /* CONFIG_LINUX */
> > > +    rs->uffdio_fd = -1;
> > > +    return -1;
> > > +}
> > > +
> > > +/**
> > > + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
> > > + */
> > > +void ram_write_tracking_stop(void)
> > > +{
> > > +#ifdef CONFIG_LINUX
> > > +    RAMState *rs = ram_state;
> > > +    RAMBlock *bs;
> > > +
> > > +    RCU_READ_LOCK_GUARD();
> > > +
> > > +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> > > +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
> > > +            continue;
> > > +        }
> > > +        /* Remove protection and unregister all affected RAM blocks */
> > > +        uffd_change_protection(rs->uffdio_fd, bs->host, bs->max_length, false, false);
> > > +        uffd_unregister_memory(rs->uffdio_fd, bs->host, bs->max_length);
> > > +
> > > +        trace_ram_write_tracking_ramblock_stop(bs->idstr, bs->page_size,
> > > +                bs->host, bs->max_length);
> > > +
> > > +        /* Cleanup flags and remove reference */
> > > +        bs->flags &= ~RAM_UF_WRITEPROTECT;
> > > +        memory_region_unref(bs->mr);
> > > +    }
> > > +
> > > +    /* Finally close UFFD file descriptor */
> > > +    uffd_close_fd(rs->uffdio_fd);
> > > +    rs->uffdio_fd = -1;
> > > +#endif /* CONFIG_LINUX */
> > > +}
> > > +
> > >   static SaveVMHandlers savevm_ram_handlers = {
> > >       .save_setup = ram_save_setup,
> > >       .save_live_iterate = ram_save_iterate,
> > > diff --git a/migration/ram.h b/migration/ram.h
> > > index 1a9ff90304..c25540cb93 100644
> > > --- a/migration/ram.h
> > > +++ b/migration/ram.h
> > > @@ -82,5 +82,7 @@ void colo_incoming_start_dirty_log(void);
> > >   /* Background snapshot */
> > >   bool ram_write_tracking_available(void);
> > >   bool ram_write_tracking_compatible(void);
> > > +int ram_write_tracking_start(void);
> > > +void ram_write_tracking_stop(void);
> > >   #endif
> > > diff --git a/migration/trace-events b/migration/trace-events
> > > index 75de5004ac..668c562fed 100644
> > > --- a/migration/trace-events
> > > +++ b/migration/trace-events
> > > @@ -111,6 +111,8 @@ save_xbzrle_page_skipping(void) ""
> > >   save_xbzrle_page_overflow(void) ""
> > >   ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
> > >   ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
> > > +ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
> > > +ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
> > >   # multifd.c
> > >   multifd_new_send_channel_async(uint8_t id) "channel %d"
> > > -- 
> > > 2.25.1
> > > 
> 
> -- 
> Andrey Gruzdev, Principal Engineer
> Virtuozzo GmbH  +7-903-247-6397
>                 virtuzzo.com
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 4/5] migration: implementation of background snapshot thread
  2021-01-21  9:56       ` Dr. David Alan Gilbert
@ 2021-01-21 11:08         ` Andrey Gruzdev
  2021-01-21 16:11           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 24+ messages in thread
From: Andrey Gruzdev @ 2021-01-21 11:08 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

[-- Attachment #1: Type: text/plain, Size: 6362 bytes --]

On 21.01.2021 12:56, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> On 19.01.2021 21:49, Dr. David Alan Gilbert wrote:
>>> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>>>> Introducing implementation of 'background' snapshot thread
>>>> which in overall follows the logic of precopy migration
>>>> while internally utilizes completely different mechanism
>>>> to 'freeze' vmstate at the start of snapshot creation.
>>>>
>>>> This mechanism is based on userfault_fd with wr-protection
>>>> support and is Linux-specific.
>>> I noticed there weren't any bdrv_ calls in here; I guess with a snapshot
>>> you still have the source running so still have it accessing the disk;
>>> do you do anything to try and wire the ram snapshotting up to disk
>>> snapshotting?
>> Block-related manipulations should be done externally, I think.
>> So create backing images for RW nodes, then stop VM, switch block graph
>> and start background snapshot. Something like create 'virsh snapshot-create-as'
>> does, but in other sequence.
> If you get a chance it would be great if you could put together an
> example of doing the combination RAM+block; that way we find out if there's
> anything silly missing.
>
> Dave

Yep, I'll take a look at the QMP command sequences, how it should look
like in our case and prepare an example, hope we are not missing something serious.
At least we know that block setup data won't go to snapshot.
I've also checked starting background snapshot from the stopped VM state -
looks OK, VM resumes operation, snapshot is saved, no apparent problems.

Maybe it will take some time, since now I'm on task to create tool to store
snapshots with RAM indexable by GPFNs, together with the rest of VMSTATE.
Based on QCOW2 format. Also it should support snapshot revert in postcopy mode.

Andrey

>> //
>>
>>>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>> ---
>>>>    migration/migration.c | 255 +++++++++++++++++++++++++++++++++++++++++-
>>>>    migration/migration.h |   3 +
>>>>    migration/ram.c       |   2 +
>>>>    migration/savevm.c    |   1 -
>>>>    migration/savevm.h    |   2 +
>>>>    5 files changed, 260 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index 2c2cb9ef01..0901a15ac5 100644
>>> <snip>
>>>
>>>> -    qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
>>>> -                       QEMU_THREAD_JOINABLE);
>>>> +
>>>> +    if (migrate_background_snapshot()) {
>>>> +        qemu_thread_create(&s->thread, "background_snapshot",
>>> Unfortunately that wont work - there's a 14 character limit on
>>> the thread name length; I guess we just shorten that to bg_snapshot
>> Yep, missed that pthread_set_name_np() has a length limit)
>>
>>> Other than that,
>>>
>>>
>>>
>>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>
>>>> +                bg_migration_thread, s, QEMU_THREAD_JOINABLE);
>>>> +    } else {
>>>> +        qemu_thread_create(&s->thread, "live_migration",
>>>> +                migration_thread, s, QEMU_THREAD_JOINABLE);
>>>> +    }
>>>>        s->migration_thread_running = true;
>>>>    }
>>>> diff --git a/migration/migration.h b/migration/migration.h
>>>> index f40338cfbf..0723955cd7 100644
>>>> --- a/migration/migration.h
>>>> +++ b/migration/migration.h
>>>> @@ -20,6 +20,7 @@
>>>>    #include "qemu/thread.h"
>>>>    #include "qemu/coroutine_int.h"
>>>>    #include "io/channel.h"
>>>> +#include "io/channel-buffer.h"
>>>>    #include "net/announce.h"
>>>>    #include "qom/object.h"
>>>> @@ -147,8 +148,10 @@ struct MigrationState {
>>>>        /*< public >*/
>>>>        QemuThread thread;
>>>> +    QEMUBH *vm_start_bh;
>>>>        QEMUBH *cleanup_bh;
>>>>        QEMUFile *to_dst_file;
>>>> +    QIOChannelBuffer *bioc;
>>>>        /*
>>>>         * Protects to_dst_file pointer.  We need to make sure we won't
>>>>         * yield or hang during the critical section, since this lock will
>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>> index 5707382db1..05fe0c8592 100644
>>>> --- a/migration/ram.c
>>>> +++ b/migration/ram.c
>>>> @@ -1471,6 +1471,7 @@ static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
>>>>        page_address = (void *) uffd_msg.arg.pagefault.address;
>>>>        bs = qemu_ram_block_from_host(page_address, false, offset);
>>>>        assert(bs && (bs->flags & RAM_UF_WRITEPROTECT) != 0);
>>>> +
>>>>        return bs;
>>>>    }
>>>>    #endif /* CONFIG_LINUX */
>>>> @@ -1836,6 +1837,7 @@ static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
>>>>            /* Un-protect memory range. */
>>>>            res = uffd_change_protection(rs->uffdio_fd, page_address, run_length,
>>>>                    false, false);
>>>> +
>>>>            /* We don't want to override existing error from ram_save_host_page(). */
>>>>            if (res < 0 && *res_override >= 0) {
>>>>                *res_override = res;
>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>> index 27e842812e..dd4ad0aaaf 100644
>>>> --- a/migration/savevm.c
>>>> +++ b/migration/savevm.c
>>>> @@ -1354,7 +1354,6 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>>>>        return 0;
>>>>    }
>>>> -static
>>>>    int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>>>>                                                        bool in_postcopy,
>>>>                                                        bool inactivate_disks)
>>>> diff --git a/migration/savevm.h b/migration/savevm.h
>>>> index ba64a7e271..aaee2528ed 100644
>>>> --- a/migration/savevm.h
>>>> +++ b/migration/savevm.h
>>>> @@ -64,5 +64,7 @@ int qemu_loadvm_state(QEMUFile *f);
>>>>    void qemu_loadvm_state_cleanup(void);
>>>>    int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
>>>>    int qemu_load_device_state(QEMUFile *f);
>>>> +int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>>>> +        bool in_postcopy, bool inactivate_disks);
>>>>    #endif
>>>> -- 
>>>> 2.25.1
>>>>
>> -- 
>> Andrey Gruzdev, Principal Engineer
>> Virtuozzo GmbH  +7-903-247-6397
>>                  virtuzzo.com
>>
-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


[-- Attachment #2: Type: text/html, Size: 8098 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 3/5] migration: support UFFD write fault processing in ram_save_iterate()
  2021-01-21 10:09       ` Dr. David Alan Gilbert
@ 2021-01-21 11:12         ` Andrey Gruzdev
  0 siblings, 0 replies; 24+ messages in thread
From: Andrey Gruzdev @ 2021-01-21 11:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

[-- Attachment #1: Type: text/plain, Size: 16995 bytes --]

On 21.01.2021 13:09, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> On 19.01.2021 20:49, Dr. David Alan Gilbert wrote:
>>> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>>>> In this particular implementation the same single migration
>>>> thread is responsible for both normal linear dirty page
>>>> migration and procesing UFFD page fault events.
>>>>
>>>> Processing write faults includes reading UFFD file descriptor,
>>>> finding respective RAM block and saving faulting page to
>>>> the migration stream. After page has been saved, write protection
>>>> can be removed. Since asynchronous version of qemu_put_buffer()
>>>> is expected to be used to save pages, we also have to flush
>>>> migraion stream prior to un-protecting saved memory range.
>>>>
>>>> Write protection is being removed for any previously protected
>>>> memory chunk that has hit the migration stream. That's valid
>>>> for pages from linear page scan along with write fault pages.
>>>>
>>>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>> I think this is OK, so:
>>>
>>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>
>>> I think I'd have preferred to have kept the #ifdef LINUX's out of there,
>>> and I suspect using the opaque's for hte pre/post hook is overly
>>> complex; but other wise OK.
>>>
>>> Dave
>> Mm, I think it's impossible to completely move #ifdef LINUX out of there,
>> but I suspect all #ifdef LINUX code can be moved to a single place with
>> stubs for the #else case.. Suppose it would be better then now.
>> For pre/post hooks - not too complex, but for single #ifdef section it really
>> looks better to move this stuff to ram_save_host_page().
> An add on patch to consolidate the ifdef'ing to one place would be
> interesting;  I'll look at taking this series as is (other than the
> thread name fix up etc) when I get around to my next pull (although I
> need to review some other peoples patches first!)
>
> Dave

I'll make it and prepare next series, hope to finish in 1-2 days.
Thanks!

Andrey

>> Andrey
>>
>>>> ---
>>>>    include/exec/memory.h  |   7 ++
>>>>    migration/ram.c        | 269 +++++++++++++++++++++++++++++++++++++++--
>>>>    migration/ram.h        |   2 +
>>>>    migration/trace-events |   2 +
>>>>    4 files changed, 272 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>>>> index b76b1256bf..1aa1c6a3f4 100644
>>>> --- a/include/exec/memory.h
>>>> +++ b/include/exec/memory.h
>>>> @@ -150,6 +150,13 @@ typedef struct IOMMUTLBEvent {
>>>>    #define RAM_PMEM (1 << 5)
>>>> +/*
>>>> + * UFFDIO_WRITEPROTECT is used on this RAMBlock to
>>>> + * support 'write-tracking' migration type.
>>>> + * Implies ram_state->ram_wt_enabled.
>>>> + */
>>>> +#define RAM_UF_WRITEPROTECT (1 << 6)
>>>> +
>>>>    static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>>>>                                           IOMMUNotifierFlag flags,
>>>>                                           hwaddr start, hwaddr end,
>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>> index ae8de17153..5707382db1 100644
>>>> --- a/migration/ram.c
>>>> +++ b/migration/ram.c
>>>> @@ -56,6 +56,11 @@
>>>>    #include "savevm.h"
>>>>    #include "qemu/iov.h"
>>>>    #include "multifd.h"
>>>> +#include "sysemu/runstate.h"
>>>> +
>>>> +#ifdef CONFIG_LINUX
>>>> +#include "qemu/userfaultfd.h"
>>>> +#endif
>>>>    /***********************************************************/
>>>>    /* ram save/restore */
>>>> @@ -298,6 +303,8 @@ struct RAMSrcPageRequest {
>>>>    struct RAMState {
>>>>        /* QEMUFile used for this migration */
>>>>        QEMUFile *f;
>>>> +    /* UFFD file descriptor, used in 'write-tracking' migration */
>>>> +    int uffdio_fd;
>>>>        /* Last block that we have visited searching for dirty pages */
>>>>        RAMBlock *last_seen_block;
>>>>        /* Last block from where we have sent data */
>>>> @@ -1434,6 +1441,40 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
>>>>        return block;
>>>>    }
>>>> +#ifdef CONFIG_LINUX
>>>> +/**
>>>> + * poll_fault_page: try to get next UFFD write fault page and, if pending fault
>>>> + *   is found, return RAM block pointer and page offset
>>>> + *
>>>> + * Returns pointer to the RAMBlock containing faulting page,
>>>> + *   NULL if no write faults are pending
>>>> + *
>>>> + * @rs: current RAM state
>>>> + * @offset: page offset from the beginning of the block
>>>> + */
>>>> +static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
>>>> +{
>>>> +    struct uffd_msg uffd_msg;
>>>> +    void *page_address;
>>>> +    RAMBlock *bs;
>>>> +    int res;
>>>> +
>>>> +    if (!migrate_background_snapshot()) {
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
>>>> +    if (res <= 0) {
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    page_address = (void *) uffd_msg.arg.pagefault.address;
>>>> +    bs = qemu_ram_block_from_host(page_address, false, offset);
>>>> +    assert(bs && (bs->flags & RAM_UF_WRITEPROTECT) != 0);
>>>> +    return bs;
>>>> +}
>>>> +#endif /* CONFIG_LINUX */
>>>> +
>>>>    /**
>>>>     * get_queued_page: unqueue a page from the postcopy requests
>>>>     *
>>>> @@ -1473,6 +1514,16 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>>>>        } while (block && !dirty);
>>>> +#ifdef CONFIG_LINUX
>>>> +    if (!block) {
>>>> +        /*
>>>> +         * Poll write faults too if background snapshot is enabled; that's
>>>> +         * when we have vcpus got blocked by the write protected pages.
>>>> +         */
>>>> +        block = poll_fault_page(rs, &offset);
>>>> +    }
>>>> +#endif /* CONFIG_LINUX */
>>>> +
>>>>        if (block) {
>>>>            /*
>>>>             * As soon as we start servicing pages out of order, then we have
>>>> @@ -1746,6 +1797,53 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>>>>        return pages;
>>>>    }
>>>> +/**
>>>> + * ram_save_host_page_pre: ram_save_host_page() pre-notifier
>>>> + *
>>>> + * @rs: current RAM state
>>>> + * @pss: page-search-status structure
>>>> + * @opaque: pointer to receive opaque context value
>>>> + */
>>>> +static inline
>>>> +void ram_save_host_page_pre(RAMState *rs, PageSearchStatus *pss, void **opaque)
>>>> +{
>>>> +    *(unsigned long *) opaque = pss->page;
>>>> +}
>>>> +
>>>> +/**
>>>> + * ram_save_host_page_post: ram_save_host_page() post-notifier
>>>> + *
>>>> + * @rs: current RAM state
>>>> + * @pss: page-search-status structure
>>>> + * @opaque: opaque context value
>>>> + * @res_override: pointer to the return value of ram_save_host_page(),
>>>> + *   overwritten in case of an error
>>>> + */
>>>> +static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
>>>> +        void *opaque, int *res_override)
>>>> +{
>>>> +#ifdef CONFIG_LINUX
>>>> +    /* Check if page is from UFFD-managed region. */
>>>> +    if (pss->block->flags & RAM_UF_WRITEPROTECT) {
>>>> +        unsigned long page_from = (unsigned long) opaque;
>>>> +
>>>> +        void *page_address = pss->block->host + (page_from << TARGET_PAGE_BITS);
>>>> +        uint64_t run_length = (pss->page - page_from + 1) << TARGET_PAGE_BITS;
>>>> +        int res;
>>>> +
>>>> +        /* Flush async buffers before un-protect. */
>>>> +        qemu_fflush(rs->f);
>>>> +        /* Un-protect memory range. */
>>>> +        res = uffd_change_protection(rs->uffdio_fd, page_address, run_length,
>>>> +                false, false);
>>>> +        /* We don't want to override existing error from ram_save_host_page(). */
>>>> +        if (res < 0 && *res_override >= 0) {
>>>> +            *res_override = res;
>>>> +        }
>>>> +    }
>>>> +#endif /* CONFIG_LINUX */
>>>> +}
>>>> +
>>>>    /**
>>>>     * ram_find_and_save_block: finds a dirty page and sends it to f
>>>>     *
>>>> @@ -1790,7 +1888,11 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>>>>            }
>>>>            if (found) {
>>>> +            void *opaque;
>>>> +
>>>> +            ram_save_host_page_pre(rs, &pss, &opaque);
>>>>                pages = ram_save_host_page(rs, &pss, last_stage);
>>>> +            ram_save_host_page_post(rs, &pss, opaque, &pages);
>>>>            }
>>>>        } while (!pages && again);
>>>> @@ -1880,10 +1982,13 @@ static void ram_save_cleanup(void *opaque)
>>>>        RAMState **rsp = opaque;
>>>>        RAMBlock *block;
>>>> -    /* caller have hold iothread lock or is in a bh, so there is
>>>> -     * no writing race against the migration bitmap
>>>> -     */
>>>> -    memory_global_dirty_log_stop();
>>>> +    /* We don't use dirty log with background snapshots */
>>>> +    if (!migrate_background_snapshot()) {
>>>> +        /* caller have hold iothread lock or is in a bh, so there is
>>>> +         * no writing race against the migration bitmap
>>>> +         */
>>>> +        memory_global_dirty_log_stop();
>>>> +    }
>>>>        RAMBLOCK_FOREACH_NOT_IGNORED(block) {
>>>>            g_free(block->clear_bmap);
>>>> @@ -2343,8 +2448,11 @@ static void ram_init_bitmaps(RAMState *rs)
>>>>        WITH_RCU_READ_LOCK_GUARD() {
>>>>            ram_list_init_bitmaps();
>>>> -        memory_global_dirty_log_start();
>>>> -        migration_bitmap_sync_precopy(rs);
>>>> +        /* We don't use dirty log with background snapshots */
>>>> +        if (!migrate_background_snapshot()) {
>>>> +            memory_global_dirty_log_start();
>>>> +            migration_bitmap_sync_precopy(rs);
>>>> +        }
>>>>        }
>>>>        qemu_mutex_unlock_ramlist();
>>>>        qemu_mutex_unlock_iothread();
>>>> @@ -3794,7 +3902,14 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
>>>>     */
>>>>    bool ram_write_tracking_available(void)
>>>>    {
>>>> -    /* TODO: implement */
>>>> +#ifdef CONFIG_LINUX
>>>> +    uint64_t uffd_features;
>>>> +    int res;
>>>> +
>>>> +    res = uffd_query_features(&uffd_features);
>>>> +    return (res == 0 &&
>>>> +            (uffd_features & UFFD_FEATURE_PAGEFAULT_FLAG_WP) != 0);
>>>> +#endif
>>>>        return false;
>>>>    }
>>>> @@ -3805,10 +3920,148 @@ bool ram_write_tracking_available(void)
>>>>     */
>>>>    bool ram_write_tracking_compatible(void)
>>>>    {
>>>> -    /* TODO: implement */
>>>> +#ifdef CONFIG_LINUX
>>>> +    const uint64_t uffd_ioctls_mask = BIT(_UFFDIO_WRITEPROTECT);
>>>> +    int uffd_fd;
>>>> +    RAMBlock *bs;
>>>> +    bool ret = false;
>>>> +
>>>> +    /* Open UFFD file descriptor */
>>>> +    uffd_fd = uffd_create_fd(UFFD_FEATURE_PAGEFAULT_FLAG_WP, false);
>>>> +    if (uffd_fd < 0) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    RCU_READ_LOCK_GUARD();
>>>> +
>>>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>>>> +        uint64_t uffd_ioctls;
>>>> +
>>>> +        /* Nothing to do with read-only and MMIO-writable regions */
>>>> +        if (bs->mr->readonly || bs->mr->rom_device) {
>>>> +            continue;
>>>> +        }
>>>> +        /* Try to register block memory via UFFD-IO to track writes */
>>>> +        if (uffd_register_memory(uffd_fd, bs->host, bs->max_length,
>>>> +                UFFDIO_REGISTER_MODE_WP, &uffd_ioctls)) {
>>>> +            goto out;
>>>> +        }
>>>> +        if ((uffd_ioctls & uffd_ioctls_mask) != uffd_ioctls_mask) {
>>>> +            goto out;
>>>> +        }
>>>> +    }
>>>> +    ret = true;
>>>> +
>>>> +out:
>>>> +    uffd_close_fd(uffd_fd);
>>>> +    return ret;
>>>> +#endif
>>>>        return false;
>>>>    }
>>>> +/*
>>>> + * ram_write_tracking_start: start UFFD-WP memory tracking
>>>> + *
>>>> + * Returns 0 for success or negative value in case of error
>>>> + */
>>>> +int ram_write_tracking_start(void)
>>>> +{
>>>> +#ifdef CONFIG_LINUX
>>>> +    int uffd_fd;
>>>> +    RAMState *rs = ram_state;
>>>> +    RAMBlock *bs;
>>>> +
>>>> +    /* Open UFFD file descriptor */
>>>> +    uffd_fd = uffd_create_fd(UFFD_FEATURE_PAGEFAULT_FLAG_WP, true);
>>>> +    if (uffd_fd < 0) {
>>>> +        return uffd_fd;
>>>> +    }
>>>> +    rs->uffdio_fd = uffd_fd;
>>>> +
>>>> +    RCU_READ_LOCK_GUARD();
>>>> +
>>>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>>>> +        /* Nothing to do with read-only and MMIO-writable regions */
>>>> +        if (bs->mr->readonly || bs->mr->rom_device) {
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        /* Register block memory with UFFD to track writes */
>>>> +        if (uffd_register_memory(rs->uffdio_fd, bs->host,
>>>> +                bs->max_length, UFFDIO_REGISTER_MODE_WP, NULL)) {
>>>> +            goto fail;
>>>> +        }
>>>> +        /* Apply UFFD write protection to the block memory range */
>>>> +        if (uffd_change_protection(rs->uffdio_fd, bs->host,
>>>> +                bs->max_length, true, false)) {
>>>> +            goto fail;
>>>> +        }
>>>> +        bs->flags |= RAM_UF_WRITEPROTECT;
>>>> +        memory_region_ref(bs->mr);
>>>> +
>>>> +        trace_ram_write_tracking_ramblock_start(bs->idstr, bs->page_size,
>>>> +                bs->host, bs->max_length);
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +
>>>> +fail:
>>>> +    error_report("ram_write_tracking_start() failed: restoring initial memory state");
>>>> +
>>>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>>>> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
>>>> +            continue;
>>>> +        }
>>>> +        /*
>>>> +         * In case some memory block failed to be write-protected
>>>> +         * remove protection and unregister all succeeded RAM blocks
>>>> +         */
>>>> +        uffd_change_protection(rs->uffdio_fd, bs->host, bs->max_length, false, false);
>>>> +        uffd_unregister_memory(rs->uffdio_fd, bs->host, bs->max_length);
>>>> +        /* Cleanup flags and remove reference */
>>>> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
>>>> +        memory_region_unref(bs->mr);
>>>> +    }
>>>> +
>>>> +    uffd_close_fd(uffd_fd);
>>>> +#endif /* CONFIG_LINUX */
>>>> +    rs->uffdio_fd = -1;
>>>> +    return -1;
>>>> +}
>>>> +
>>>> +/**
>>>> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
>>>> + */
>>>> +void ram_write_tracking_stop(void)
>>>> +{
>>>> +#ifdef CONFIG_LINUX
>>>> +    RAMState *rs = ram_state;
>>>> +    RAMBlock *bs;
>>>> +
>>>> +    RCU_READ_LOCK_GUARD();
>>>> +
>>>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>>>> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
>>>> +            continue;
>>>> +        }
>>>> +        /* Remove protection and unregister all affected RAM blocks */
>>>> +        uffd_change_protection(rs->uffdio_fd, bs->host, bs->max_length, false, false);
>>>> +        uffd_unregister_memory(rs->uffdio_fd, bs->host, bs->max_length);
>>>> +
>>>> +        trace_ram_write_tracking_ramblock_stop(bs->idstr, bs->page_size,
>>>> +                bs->host, bs->max_length);
>>>> +
>>>> +        /* Cleanup flags and remove reference */
>>>> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
>>>> +        memory_region_unref(bs->mr);
>>>> +    }
>>>> +
>>>> +    /* Finally close UFFD file descriptor */
>>>> +    uffd_close_fd(rs->uffdio_fd);
>>>> +    rs->uffdio_fd = -1;
>>>> +#endif /* CONFIG_LINUX */
>>>> +}
>>>> +
>>>>    static SaveVMHandlers savevm_ram_handlers = {
>>>>        .save_setup = ram_save_setup,
>>>>        .save_live_iterate = ram_save_iterate,
>>>> diff --git a/migration/ram.h b/migration/ram.h
>>>> index 1a9ff90304..c25540cb93 100644
>>>> --- a/migration/ram.h
>>>> +++ b/migration/ram.h
>>>> @@ -82,5 +82,7 @@ void colo_incoming_start_dirty_log(void);
>>>>    /* Background snapshot */
>>>>    bool ram_write_tracking_available(void);
>>>>    bool ram_write_tracking_compatible(void);
>>>> +int ram_write_tracking_start(void);
>>>> +void ram_write_tracking_stop(void);
>>>>    #endif
>>>> diff --git a/migration/trace-events b/migration/trace-events
>>>> index 75de5004ac..668c562fed 100644
>>>> --- a/migration/trace-events
>>>> +++ b/migration/trace-events
>>>> @@ -111,6 +111,8 @@ save_xbzrle_page_skipping(void) ""
>>>>    save_xbzrle_page_overflow(void) ""
>>>>    ram_save_iterate_big_wait(uint64_t milliconds, int iterations) "big wait: %" PRIu64 " milliseconds, %d iterations"
>>>>    ram_load_complete(int ret, uint64_t seq_iter) "exit_code %d seq iteration %" PRIu64
>>>> +ram_write_tracking_ramblock_start(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
>>>> +ram_write_tracking_ramblock_stop(const char *block_id, size_t page_size, void *addr, size_t length) "%s: page_size: %zu addr: %p length: %zu"
>>>>    # multifd.c
>>>>    multifd_new_send_channel_async(uint8_t id) "channel %d"
>>>> -- 
>>>> 2.25.1
>>>>
>> -- 
>> Andrey Gruzdev, Principal Engineer
>> Virtuozzo GmbH  +7-903-247-6397
>>                  virtuzzo.com
>>

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


[-- Attachment #2: Type: text/html, Size: 17025 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 5/5] migration: introduce 'userfaultfd-wrlat.py' script
  2021-01-19 21:01   ` Peter Xu
@ 2021-01-21 13:12     ` Andrey Gruzdev
  2021-01-21 15:37       ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Andrey Gruzdev @ 2021-01-21 13:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

[-- Attachment #1: Type: text/plain, Size: 3210 bytes --]

On 20.01.2021 00:01, Peter Xu wrote:
> On Wed, Jan 06, 2021 at 06:21:20PM +0300, Andrey Gruzdev wrote:
>> Add BCC/eBPF script to analyze userfaultfd write fault latency distribution.
>>
>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>> Acked-by: Peter Xu <peterx@redhat.com>
> (This seems to be the last patch that lacks a r-b ... Let's see whether I could
>   convert my a-b into an r-b... :)
>
>> +BPF_HASH(ev_start, struct ev_desc, u64);
>> +BPF_HASH(ctx_handle_userfault, u64, u64);
> IMHO we only need one hash here instead of two:
>
>    BPF_HASH(ev_start, u32, u64);
>
> Where we use the tid as the key (u32), and timestamp as the value (u64).  The
> thing is we don't really need the address for current statistics, IMHO.

Agree, that's a more appropriate way do that. Address here is not really needed.
Thanks!

>> +/* KPROBE for handle_userfault(). */
>> +int probe_handle_userfault(struct pt_regs *ctx, struct vm_fault *vmf,
>> +        unsigned long reason)
>> +{
>> +    /* Trace only UFFD write faults. */
>> +    if (reason & VM_UFFD_WP) {
> Better with comment:
>
>             /* Using "(u32)" to drop group ID which is upper 32 bits */

Yep.

>
> If even better, we'd want a get_current_tid() helper and call it here and below
> (bpf_get_current_pid_tgid() will return tid|gid<<32 I think, so I'm a bit
> confused why bcc people called it pid at the first place...).
>
>> +        u64 pid = (u32) bpf_get_current_pid_tgid();
>> +        u64 addr = vmf->address;
>> +
>> +        do_event_start(pid, addr);
>> +        ctx_handle_userfault.update(&pid, &addr);
>> +    }
>> +    return 0;
>> +}
>> +
>> +/* KRETPROBE for handle_userfault(). */
>> +int retprobe_handle_userfault(struct pt_regs *ctx)
>> +{
>> +    u64 pid = (u32) bpf_get_current_pid_tgid();
>> +    u64 *addr_p;
>> +
>> +    /*
>> +     * Here we just ignore the return value. In case of spurious wakeup
>> +     * or pending signal we'll still get (at least for v5.8.0 kernel)
>> +     * VM_FAULT_RETRY or (VM_FAULT_RETRY | VM_FAULT_MAJOR) here.
>> +     * Anyhow, handle_userfault() would be re-entered if such case happens,
>> +     * keeping initial timestamp unchanged for the faulting thread.
> AFAIU this comment is not matching what the code does.  But I agree it's not a
> big problem because we won't miss any long delays (because the one long delayed
> sample will just be split into two or multiple delays, which will still be
> reflected in the histogram at last).  Or am I wrong?

Mm, not really sure about comment.. I need to read kernel code again.

>> +     */
>> +    addr_p = ctx_handle_userfault.lookup(&pid);
>> +    if (addr_p) {
>> +        do_event_end(pid, *addr_p);
>> +        ctx_handle_userfault.delete(&pid);
>> +    }
>> +    return 0;
>> +}
>> +"""
> Other than that, the rest looks good to me.
>
> I'd think it's fine to even merge the current version since it actually works
> nicely.  Andrey, if you agree with any of my above comments, feel free to
> repost this patch (since I see Dave provided the rest r-bs).  Then I think I
> can r-b this one too.  Thanks!
>
Thanks!

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


[-- Attachment #2: Type: text/html, Size: 4811 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 5/5] migration: introduce 'userfaultfd-wrlat.py' script
  2021-01-21 13:12     ` Andrey Gruzdev
@ 2021-01-21 15:37       ` Peter Xu
  2021-01-21 17:15         ` Andrey Gruzdev
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2021-01-21 15:37 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Jan 21, 2021 at 04:12:23PM +0300, Andrey Gruzdev wrote:
> > > +/* KRETPROBE for handle_userfault(). */
> > > +int retprobe_handle_userfault(struct pt_regs *ctx)
> > > +{
> > > +    u64 pid = (u32) bpf_get_current_pid_tgid();
> > > +    u64 *addr_p;
> > > +
> > > +    /*
> > > +     * Here we just ignore the return value. In case of spurious wakeup
> > > +     * or pending signal we'll still get (at least for v5.8.0 kernel)
> > > +     * VM_FAULT_RETRY or (VM_FAULT_RETRY | VM_FAULT_MAJOR) here.
> > > +     * Anyhow, handle_userfault() would be re-entered if such case happens,
> > > +     * keeping initial timestamp unchanged for the faulting thread.
> > AFAIU this comment is not matching what the code does.  But I agree it's not a
> > big problem because we won't miss any long delays (because the one long delayed
> > sample will just be split into two or multiple delays, which will still be
> > reflected in the histogram at last).  Or am I wrong?
> 
> Mm, not really sure about comment.. I need to read kernel code again.

Not relevant to kernel; I was only talking about the last sentence where we
won't "keeping initial timestamp unchanged" but we'll do the statistic anyways.
Because exactly as you said we'll get VM_FAULT_RETRY unconditionally while we
won't be able to identify whether the page fault request is resolved or not.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 4/5] migration: implementation of background snapshot thread
  2021-01-21 11:08         ` Andrey Gruzdev
@ 2021-01-21 16:11           ` Dr. David Alan Gilbert
  2021-01-21 17:28             ` Andrey Gruzdev
  0 siblings, 1 reply; 24+ messages in thread
From: Dr. David Alan Gilbert @ 2021-01-21 16:11 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> On 21.01.2021 12:56, Dr. David Alan Gilbert wrote:
> > * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> > > On 19.01.2021 21:49, Dr. David Alan Gilbert wrote:
> > > > * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> > > > > Introducing implementation of 'background' snapshot thread
> > > > > which in overall follows the logic of precopy migration
> > > > > while internally utilizes completely different mechanism
> > > > > to 'freeze' vmstate at the start of snapshot creation.
> > > > > 
> > > > > This mechanism is based on userfault_fd with wr-protection
> > > > > support and is Linux-specific.
> > > > I noticed there weren't any bdrv_ calls in here; I guess with a snapshot
> > > > you still have the source running so still have it accessing the disk;
> > > > do you do anything to try and wire the ram snapshotting up to disk
> > > > snapshotting?
> > > Block-related manipulations should be done externally, I think.
> > > So create backing images for RW nodes, then stop VM, switch block graph
> > > and start background snapshot. Something like create 'virsh snapshot-create-as'
> > > does, but in other sequence.
> > If you get a chance it would be great if you could put together an
> > example of doing the combination RAM+block; that way we find out if there's
> > anything silly missing.
> > 
> > Dave
> 
> Yep, I'll take a look at the QMP command sequences, how it should look
> like in our case and prepare an example, hope we are not missing something serious.
> At least we know that block setup data won't go to snapshot.
> I've also checked starting background snapshot from the stopped VM state -
> looks OK, VM resumes operation, snapshot is saved, no apparent problems.
> 
> Maybe it will take some time, since now I'm on task to create tool to store
> snapshots with RAM indexable by GPFNs, together with the rest of VMSTATE.

If you want to make it indexable, why not just do a simple write(2) call
for the whole of RAM rather than doing the thing like normal migration?

Dave

> Based on QCOW2 format. Also it should support snapshot revert in postcopy mode.
> 
> Andrey
> 
> > > //
> > > 
> > > > > Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> > > > > Acked-by: Peter Xu <peterx@redhat.com>
> > > > > ---
> > > > >    migration/migration.c | 255 +++++++++++++++++++++++++++++++++++++++++-
> > > > >    migration/migration.h |   3 +
> > > > >    migration/ram.c       |   2 +
> > > > >    migration/savevm.c    |   1 -
> > > > >    migration/savevm.h    |   2 +
> > > > >    5 files changed, 260 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > index 2c2cb9ef01..0901a15ac5 100644
> > > > <snip>
> > > > 
> > > > > -    qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
> > > > > -                       QEMU_THREAD_JOINABLE);
> > > > > +
> > > > > +    if (migrate_background_snapshot()) {
> > > > > +        qemu_thread_create(&s->thread, "background_snapshot",
> > > > Unfortunately that wont work - there's a 14 character limit on
> > > > the thread name length; I guess we just shorten that to bg_snapshot
> > > Yep, missed that pthread_set_name_np() has a length limit)
> > > 
> > > > Other than that,
> > > > 
> > > > 
> > > > 
> > > > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > 
> > > > > +                bg_migration_thread, s, QEMU_THREAD_JOINABLE);
> > > > > +    } else {
> > > > > +        qemu_thread_create(&s->thread, "live_migration",
> > > > > +                migration_thread, s, QEMU_THREAD_JOINABLE);
> > > > > +    }
> > > > >        s->migration_thread_running = true;
> > > > >    }
> > > > > diff --git a/migration/migration.h b/migration/migration.h
> > > > > index f40338cfbf..0723955cd7 100644
> > > > > --- a/migration/migration.h
> > > > > +++ b/migration/migration.h
> > > > > @@ -20,6 +20,7 @@
> > > > >    #include "qemu/thread.h"
> > > > >    #include "qemu/coroutine_int.h"
> > > > >    #include "io/channel.h"
> > > > > +#include "io/channel-buffer.h"
> > > > >    #include "net/announce.h"
> > > > >    #include "qom/object.h"
> > > > > @@ -147,8 +148,10 @@ struct MigrationState {
> > > > >        /*< public >*/
> > > > >        QemuThread thread;
> > > > > +    QEMUBH *vm_start_bh;
> > > > >        QEMUBH *cleanup_bh;
> > > > >        QEMUFile *to_dst_file;
> > > > > +    QIOChannelBuffer *bioc;
> > > > >        /*
> > > > >         * Protects to_dst_file pointer.  We need to make sure we won't
> > > > >         * yield or hang during the critical section, since this lock will
> > > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > > index 5707382db1..05fe0c8592 100644
> > > > > --- a/migration/ram.c
> > > > > +++ b/migration/ram.c
> > > > > @@ -1471,6 +1471,7 @@ static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
> > > > >        page_address = (void *) uffd_msg.arg.pagefault.address;
> > > > >        bs = qemu_ram_block_from_host(page_address, false, offset);
> > > > >        assert(bs && (bs->flags & RAM_UF_WRITEPROTECT) != 0);
> > > > > +
> > > > >        return bs;
> > > > >    }
> > > > >    #endif /* CONFIG_LINUX */
> > > > > @@ -1836,6 +1837,7 @@ static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
> > > > >            /* Un-protect memory range. */
> > > > >            res = uffd_change_protection(rs->uffdio_fd, page_address, run_length,
> > > > >                    false, false);
> > > > > +
> > > > >            /* We don't want to override existing error from ram_save_host_page(). */
> > > > >            if (res < 0 && *res_override >= 0) {
> > > > >                *res_override = res;
> > > > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > > > index 27e842812e..dd4ad0aaaf 100644
> > > > > --- a/migration/savevm.c
> > > > > +++ b/migration/savevm.c
> > > > > @@ -1354,7 +1354,6 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
> > > > >        return 0;
> > > > >    }
> > > > > -static
> > > > >    int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> > > > >                                                        bool in_postcopy,
> > > > >                                                        bool inactivate_disks)
> > > > > diff --git a/migration/savevm.h b/migration/savevm.h
> > > > > index ba64a7e271..aaee2528ed 100644
> > > > > --- a/migration/savevm.h
> > > > > +++ b/migration/savevm.h
> > > > > @@ -64,5 +64,7 @@ int qemu_loadvm_state(QEMUFile *f);
> > > > >    void qemu_loadvm_state_cleanup(void);
> > > > >    int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
> > > > >    int qemu_load_device_state(QEMUFile *f);
> > > > > +int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> > > > > +        bool in_postcopy, bool inactivate_disks);
> > > > >    #endif
> > > > > -- 
> > > > > 2.25.1
> > > > > 
> > > -- 
> > > Andrey Gruzdev, Principal Engineer
> > > Virtuozzo GmbH  +7-903-247-6397
> > >                  virtuzzo.com
> > > 
> -- 
> Andrey Gruzdev, Principal Engineer
> Virtuozzo GmbH  +7-903-247-6397
>                 virtuzzo.com
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 5/5] migration: introduce 'userfaultfd-wrlat.py' script
  2021-01-21 15:37       ` Peter Xu
@ 2021-01-21 17:15         ` Andrey Gruzdev
  0 siblings, 0 replies; 24+ messages in thread
From: Andrey Gruzdev @ 2021-01-21 17:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

[-- Attachment #1: Type: text/plain, Size: 1820 bytes --]

On 21.01.2021 18:37, Peter Xu wrote:
> On Thu, Jan 21, 2021 at 04:12:23PM +0300, Andrey Gruzdev wrote:
>>>> +/* KRETPROBE for handle_userfault(). */
>>>> +int retprobe_handle_userfault(struct pt_regs *ctx)
>>>> +{
>>>> +    u64 pid = (u32) bpf_get_current_pid_tgid();
>>>> +    u64 *addr_p;
>>>> +
>>>> +    /*
>>>> +     * Here we just ignore the return value. In case of spurious wakeup
>>>> +     * or pending signal we'll still get (at least for v5.8.0 kernel)
>>>> +     * VM_FAULT_RETRY or (VM_FAULT_RETRY | VM_FAULT_MAJOR) here.
>>>> +     * Anyhow, handle_userfault() would be re-entered if such case happens,
>>>> +     * keeping initial timestamp unchanged for the faulting thread.
>>> AFAIU this comment is not matching what the code does.  But I agree it's not a
>>> big problem because we won't miss any long delays (because the one long delayed
>>> sample will just be split into two or multiple delays, which will still be
>>> reflected in the histogram at last).  Or am I wrong?
>> Mm, not really sure about comment.. I need to read kernel code again.
> Not relevant to kernel; I was only talking about the last sentence where we
> won't "keeping initial timestamp unchanged" but we'll do the statistic anyways.
> Because exactly as you said we'll get VM_FAULT_RETRY unconditionally while we
> won't be able to identify whether the page fault request is resolved or not.
>
Yep, agree. My point is also that trying to get timestamp of real PF resolution is
complicated and not very reasonable since a sequence of softbit and hardbit modifications
to the in-memory paging structures occur, not immediately reflected in particular TLB entry.
But for our statistics this level of accuracy is OK, I think.

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


[-- Attachment #2: Type: text/html, Size: 2473 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 4/5] migration: implementation of background snapshot thread
  2021-01-21 16:11           ` Dr. David Alan Gilbert
@ 2021-01-21 17:28             ` Andrey Gruzdev
  2021-01-21 17:48               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 24+ messages in thread
From: Andrey Gruzdev @ 2021-01-21 17:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

[-- Attachment #1: Type: text/plain, Size: 7296 bytes --]

On 21.01.2021 19:11, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> On 21.01.2021 12:56, Dr. David Alan Gilbert wrote:
>>> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>>>> On 19.01.2021 21:49, Dr. David Alan Gilbert wrote:
>>>>> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>>>>>> Introducing implementation of 'background' snapshot thread
>>>>>> which in overall follows the logic of precopy migration
>>>>>> while internally utilizes completely different mechanism
>>>>>> to 'freeze' vmstate at the start of snapshot creation.
>>>>>>
>>>>>> This mechanism is based on userfault_fd with wr-protection
>>>>>> support and is Linux-specific.
>>>>> I noticed there weren't any bdrv_ calls in here; I guess with a snapshot
>>>>> you still have the source running so still have it accessing the disk;
>>>>> do you do anything to try and wire the ram snapshotting up to disk
>>>>> snapshotting?
>>>> Block-related manipulations should be done externally, I think.
>>>> So create backing images for RW nodes, then stop VM, switch block graph
>>>> and start background snapshot. Something like create 'virsh snapshot-create-as'
>>>> does, but in other sequence.
>>> If you get a chance it would be great if you could put together an
>>> example of doing the combination RAM+block; that way we find out if there's
>>> anything silly missing.
>>>
>>> Dave
>> Yep, I'll take a look at the QMP command sequences, how it should look
>> like in our case and prepare an example, hope we are not missing something serious.
>> At least we know that block setup data won't go to snapshot.
>> I've also checked starting background snapshot from the stopped VM state -
>> looks OK, VM resumes operation, snapshot is saved, no apparent problems.
>>
>> Maybe it will take some time, since now I'm on task to create tool to store
>> snapshots with RAM indexable by GPFNs, together with the rest of VMSTATE.
> If you want to make it indexable, why not just do a simple write(2) call
> for the whole of RAM rather than doing the thing like normal migration?
>
> Dave

For me the main reason is apparent file size.. While we can get the same allocation
size when saving via write(2) on Linux, in many cases the apparent file size will
be much bigger then if use QCOW2.

Andrey

>> Based on QCOW2 format. Also it should support snapshot revert in postcopy mode.
>>
>> Andrey
>>
>>>> //
>>>>
>>>>>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>>>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>>>> ---
>>>>>>     migration/migration.c | 255 +++++++++++++++++++++++++++++++++++++++++-
>>>>>>     migration/migration.h |   3 +
>>>>>>     migration/ram.c       |   2 +
>>>>>>     migration/savevm.c    |   1 -
>>>>>>     migration/savevm.h    |   2 +
>>>>>>     5 files changed, 260 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>>>> index 2c2cb9ef01..0901a15ac5 100644
>>>>> <snip>
>>>>>
>>>>>> -    qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
>>>>>> -                       QEMU_THREAD_JOINABLE);
>>>>>> +
>>>>>> +    if (migrate_background_snapshot()) {
>>>>>> +        qemu_thread_create(&s->thread, "background_snapshot",
>>>>> Unfortunately that wont work - there's a 14 character limit on
>>>>> the thread name length; I guess we just shorten that to bg_snapshot
>>>> Yep, missed that pthread_set_name_np() has a length limit)
>>>>
>>>>> Other than that,
>>>>>
>>>>>
>>>>>
>>>>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>>>
>>>>>> +                bg_migration_thread, s, QEMU_THREAD_JOINABLE);
>>>>>> +    } else {
>>>>>> +        qemu_thread_create(&s->thread, "live_migration",
>>>>>> +                migration_thread, s, QEMU_THREAD_JOINABLE);
>>>>>> +    }
>>>>>>         s->migration_thread_running = true;
>>>>>>     }
>>>>>> diff --git a/migration/migration.h b/migration/migration.h
>>>>>> index f40338cfbf..0723955cd7 100644
>>>>>> --- a/migration/migration.h
>>>>>> +++ b/migration/migration.h
>>>>>> @@ -20,6 +20,7 @@
>>>>>>     #include "qemu/thread.h"
>>>>>>     #include "qemu/coroutine_int.h"
>>>>>>     #include "io/channel.h"
>>>>>> +#include "io/channel-buffer.h"
>>>>>>     #include "net/announce.h"
>>>>>>     #include "qom/object.h"
>>>>>> @@ -147,8 +148,10 @@ struct MigrationState {
>>>>>>         /*< public >*/
>>>>>>         QemuThread thread;
>>>>>> +    QEMUBH *vm_start_bh;
>>>>>>         QEMUBH *cleanup_bh;
>>>>>>         QEMUFile *to_dst_file;
>>>>>> +    QIOChannelBuffer *bioc;
>>>>>>         /*
>>>>>>          * Protects to_dst_file pointer.  We need to make sure we won't
>>>>>>          * yield or hang during the critical section, since this lock will
>>>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>>>> index 5707382db1..05fe0c8592 100644
>>>>>> --- a/migration/ram.c
>>>>>> +++ b/migration/ram.c
>>>>>> @@ -1471,6 +1471,7 @@ static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
>>>>>>         page_address = (void *) uffd_msg.arg.pagefault.address;
>>>>>>         bs = qemu_ram_block_from_host(page_address, false, offset);
>>>>>>         assert(bs && (bs->flags & RAM_UF_WRITEPROTECT) != 0);
>>>>>> +
>>>>>>         return bs;
>>>>>>     }
>>>>>>     #endif /* CONFIG_LINUX */
>>>>>> @@ -1836,6 +1837,7 @@ static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
>>>>>>             /* Un-protect memory range. */
>>>>>>             res = uffd_change_protection(rs->uffdio_fd, page_address, run_length,
>>>>>>                     false, false);
>>>>>> +
>>>>>>             /* We don't want to override existing error from ram_save_host_page(). */
>>>>>>             if (res < 0 && *res_override >= 0) {
>>>>>>                 *res_override = res;
>>>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>>>> index 27e842812e..dd4ad0aaaf 100644
>>>>>> --- a/migration/savevm.c
>>>>>> +++ b/migration/savevm.c
>>>>>> @@ -1354,7 +1354,6 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>>>>>>         return 0;
>>>>>>     }
>>>>>> -static
>>>>>>     int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>>>>>>                                                         bool in_postcopy,
>>>>>>                                                         bool inactivate_disks)
>>>>>> diff --git a/migration/savevm.h b/migration/savevm.h
>>>>>> index ba64a7e271..aaee2528ed 100644
>>>>>> --- a/migration/savevm.h
>>>>>> +++ b/migration/savevm.h
>>>>>> @@ -64,5 +64,7 @@ int qemu_loadvm_state(QEMUFile *f);
>>>>>>     void qemu_loadvm_state_cleanup(void);
>>>>>>     int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
>>>>>>     int qemu_load_device_state(QEMUFile *f);
>>>>>> +int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>>>>>> +        bool in_postcopy, bool inactivate_disks);
>>>>>>     #endif
>>>>>> -- 
>>>>>> 2.25.1
>>>>>>
>>>> -- 
>>>> Andrey Gruzdev, Principal Engineer
>>>> Virtuozzo GmbH  +7-903-247-6397
>>>>                   virtuzzo.com
>>>>
>> -- 
>> Andrey Gruzdev, Principal Engineer
>> Virtuozzo GmbH  +7-903-247-6397
>>                  virtuzzo.com
>>

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


[-- Attachment #2: Type: text/html, Size: 9435 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 4/5] migration: implementation of background snapshot thread
  2021-01-21 17:28             ` Andrey Gruzdev
@ 2021-01-21 17:48               ` Dr. David Alan Gilbert
  2021-01-21 18:29                 ` Andrey Gruzdev
  0 siblings, 1 reply; 24+ messages in thread
From: Dr. David Alan Gilbert @ 2021-01-21 17:48 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> On 21.01.2021 19:11, Dr. David Alan Gilbert wrote:
> > * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> > > On 21.01.2021 12:56, Dr. David Alan Gilbert wrote:
> > > > * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> > > > > On 19.01.2021 21:49, Dr. David Alan Gilbert wrote:
> > > > > > * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> > > > > > > Introducing implementation of 'background' snapshot thread
> > > > > > > which in overall follows the logic of precopy migration
> > > > > > > while internally utilizes completely different mechanism
> > > > > > > to 'freeze' vmstate at the start of snapshot creation.
> > > > > > > 
> > > > > > > This mechanism is based on userfault_fd with wr-protection
> > > > > > > support and is Linux-specific.
> > > > > > I noticed there weren't any bdrv_ calls in here; I guess with a snapshot
> > > > > > you still have the source running so still have it accessing the disk;
> > > > > > do you do anything to try and wire the ram snapshotting up to disk
> > > > > > snapshotting?
> > > > > Block-related manipulations should be done externally, I think.
> > > > > So create backing images for RW nodes, then stop VM, switch block graph
> > > > > and start background snapshot. Something like create 'virsh snapshot-create-as'
> > > > > does, but in other sequence.
> > > > If you get a chance it would be great if you could put together an
> > > > example of doing the combination RAM+block; that way we find out if there's
> > > > anything silly missing.
> > > > 
> > > > Dave
> > > Yep, I'll take a look at the QMP command sequences, how it should look
> > > like in our case and prepare an example, hope we are not missing something serious.
> > > At least we know that block setup data won't go to snapshot.
> > > I've also checked starting background snapshot from the stopped VM state -
> > > looks OK, VM resumes operation, snapshot is saved, no apparent problems.
> > > 
> > > Maybe it will take some time, since now I'm on task to create tool to store
> > > snapshots with RAM indexable by GPFNs, together with the rest of VMSTATE.
> > If you want to make it indexable, why not just do a simple write(2) call
> > for the whole of RAM rather than doing the thing like normal migration?
> > 
> > Dave
> 
> For me the main reason is apparent file size.. While we can get the same allocation
> size when saving via write(2) on Linux, in many cases the apparent file size will
> be much bigger then if use QCOW2.

Do you mean because of zero pages or for some other reason?

Dave

> Andrey
> 
> > > Based on QCOW2 format. Also it should support snapshot revert in postcopy mode.
> > > 
> > > Andrey
> > > 
> > > > > //
> > > > > 
> > > > > > > Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> > > > > > > Acked-by: Peter Xu <peterx@redhat.com>
> > > > > > > ---
> > > > > > >     migration/migration.c | 255 +++++++++++++++++++++++++++++++++++++++++-
> > > > > > >     migration/migration.h |   3 +
> > > > > > >     migration/ram.c       |   2 +
> > > > > > >     migration/savevm.c    |   1 -
> > > > > > >     migration/savevm.h    |   2 +
> > > > > > >     5 files changed, 260 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > > > index 2c2cb9ef01..0901a15ac5 100644
> > > > > > <snip>
> > > > > > 
> > > > > > > -    qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
> > > > > > > -                       QEMU_THREAD_JOINABLE);
> > > > > > > +
> > > > > > > +    if (migrate_background_snapshot()) {
> > > > > > > +        qemu_thread_create(&s->thread, "background_snapshot",
> > > > > > Unfortunately that wont work - there's a 14 character limit on
> > > > > > the thread name length; I guess we just shorten that to bg_snapshot
> > > > > Yep, missed that pthread_set_name_np() has a length limit)
> > > > > 
> > > > > > Other than that,
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > > > > 
> > > > > > > +                bg_migration_thread, s, QEMU_THREAD_JOINABLE);
> > > > > > > +    } else {
> > > > > > > +        qemu_thread_create(&s->thread, "live_migration",
> > > > > > > +                migration_thread, s, QEMU_THREAD_JOINABLE);
> > > > > > > +    }
> > > > > > >         s->migration_thread_running = true;
> > > > > > >     }
> > > > > > > diff --git a/migration/migration.h b/migration/migration.h
> > > > > > > index f40338cfbf..0723955cd7 100644
> > > > > > > --- a/migration/migration.h
> > > > > > > +++ b/migration/migration.h
> > > > > > > @@ -20,6 +20,7 @@
> > > > > > >     #include "qemu/thread.h"
> > > > > > >     #include "qemu/coroutine_int.h"
> > > > > > >     #include "io/channel.h"
> > > > > > > +#include "io/channel-buffer.h"
> > > > > > >     #include "net/announce.h"
> > > > > > >     #include "qom/object.h"
> > > > > > > @@ -147,8 +148,10 @@ struct MigrationState {
> > > > > > >         /*< public >*/
> > > > > > >         QemuThread thread;
> > > > > > > +    QEMUBH *vm_start_bh;
> > > > > > >         QEMUBH *cleanup_bh;
> > > > > > >         QEMUFile *to_dst_file;
> > > > > > > +    QIOChannelBuffer *bioc;
> > > > > > >         /*
> > > > > > >          * Protects to_dst_file pointer.  We need to make sure we won't
> > > > > > >          * yield or hang during the critical section, since this lock will
> > > > > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > > > > index 5707382db1..05fe0c8592 100644
> > > > > > > --- a/migration/ram.c
> > > > > > > +++ b/migration/ram.c
> > > > > > > @@ -1471,6 +1471,7 @@ static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
> > > > > > >         page_address = (void *) uffd_msg.arg.pagefault.address;
> > > > > > >         bs = qemu_ram_block_from_host(page_address, false, offset);
> > > > > > >         assert(bs && (bs->flags & RAM_UF_WRITEPROTECT) != 0);
> > > > > > > +
> > > > > > >         return bs;
> > > > > > >     }
> > > > > > >     #endif /* CONFIG_LINUX */
> > > > > > > @@ -1836,6 +1837,7 @@ static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
> > > > > > >             /* Un-protect memory range. */
> > > > > > >             res = uffd_change_protection(rs->uffdio_fd, page_address, run_length,
> > > > > > >                     false, false);
> > > > > > > +
> > > > > > >             /* We don't want to override existing error from ram_save_host_page(). */
> > > > > > >             if (res < 0 && *res_override >= 0) {
> > > > > > >                 *res_override = res;
> > > > > > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > > > > > index 27e842812e..dd4ad0aaaf 100644
> > > > > > > --- a/migration/savevm.c
> > > > > > > +++ b/migration/savevm.c
> > > > > > > @@ -1354,7 +1354,6 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
> > > > > > >         return 0;
> > > > > > >     }
> > > > > > > -static
> > > > > > >     int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> > > > > > >                                                         bool in_postcopy,
> > > > > > >                                                         bool inactivate_disks)
> > > > > > > diff --git a/migration/savevm.h b/migration/savevm.h
> > > > > > > index ba64a7e271..aaee2528ed 100644
> > > > > > > --- a/migration/savevm.h
> > > > > > > +++ b/migration/savevm.h
> > > > > > > @@ -64,5 +64,7 @@ int qemu_loadvm_state(QEMUFile *f);
> > > > > > >     void qemu_loadvm_state_cleanup(void);
> > > > > > >     int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
> > > > > > >     int qemu_load_device_state(QEMUFile *f);
> > > > > > > +int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> > > > > > > +        bool in_postcopy, bool inactivate_disks);
> > > > > > >     #endif
> > > > > > > -- 
> > > > > > > 2.25.1
> > > > > > > 
> > > > > -- 
> > > > > Andrey Gruzdev, Principal Engineer
> > > > > Virtuozzo GmbH  +7-903-247-6397
> > > > >                   virtuzzo.com
> > > > > 
> > > -- 
> > > Andrey Gruzdev, Principal Engineer
> > > Virtuozzo GmbH  +7-903-247-6397
> > >                  virtuzzo.com
> > > 
> 
> -- 
> Andrey Gruzdev, Principal Engineer
> Virtuozzo GmbH  +7-903-247-6397
>                 virtuzzo.com
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 4/5] migration: implementation of background snapshot thread
  2021-01-21 17:48               ` Dr. David Alan Gilbert
@ 2021-01-21 18:29                 ` Andrey Gruzdev
  0 siblings, 0 replies; 24+ messages in thread
From: Andrey Gruzdev @ 2021-01-21 18:29 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

[-- Attachment #1: Type: text/plain, Size: 8320 bytes --]

On 21.01.2021 20:48, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> On 21.01.2021 19:11, Dr. David Alan Gilbert wrote:
>>> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>>>> On 21.01.2021 12:56, Dr. David Alan Gilbert wrote:
>>>>> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>>>>>> On 19.01.2021 21:49, Dr. David Alan Gilbert wrote:
>>>>>>> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>>>>>>>> Introducing implementation of 'background' snapshot thread
>>>>>>>> which in overall follows the logic of precopy migration
>>>>>>>> while internally utilizes completely different mechanism
>>>>>>>> to 'freeze' vmstate at the start of snapshot creation.
>>>>>>>>
>>>>>>>> This mechanism is based on userfault_fd with wr-protection
>>>>>>>> support and is Linux-specific.
>>>>>>> I noticed there weren't any bdrv_ calls in here; I guess with a snapshot
>>>>>>> you still have the source running so still have it accessing the disk;
>>>>>>> do you do anything to try and wire the ram snapshotting up to disk
>>>>>>> snapshotting?
>>>>>> Block-related manipulations should be done externally, I think.
>>>>>> So create backing images for RW nodes, then stop VM, switch block graph
>>>>>> and start background snapshot. Something like create 'virsh snapshot-create-as'
>>>>>> does, but in other sequence.
>>>>> If you get a chance it would be great if you could put together an
>>>>> example of doing the combination RAM+block; that way we find out if there's
>>>>> anything silly missing.
>>>>>
>>>>> Dave
>>>> Yep, I'll take a look at the QMP command sequences, how it should look
>>>> like in our case and prepare an example, hope we are not missing something serious.
>>>> At least we know that block setup data won't go to snapshot.
>>>> I've also checked starting background snapshot from the stopped VM state -
>>>> looks OK, VM resumes operation, snapshot is saved, no apparent problems.
>>>>
>>>> Maybe it will take some time, since now I'm on task to create tool to store
>>>> snapshots with RAM indexable by GPFNs, together with the rest of VMSTATE.
>>> If you want to make it indexable, why not just do a simple write(2) call
>>> for the whole of RAM rather than doing the thing like normal migration?
>>>
>>> Dave
>> For me the main reason is apparent file size.. While we can get the same allocation
>> size when saving via write(2) on Linux, in many cases the apparent file size will
>> be much bigger then if use QCOW2.
> Do you mean because of zero pages or for some other reason?
>
> Dave

Yes. So plain sparse file on ext4 would grow to apparent size equal to
highest non-zero GPA. While QCOW2 won't. It's important from the point of
user experience, since desktop workload often show very small non-zero RSS.
When I start Win10 on QEMU with a single Firefox tab with some Youtube HD
video I have only 2-5GB of migration data on a 16GB VM.

Andrey

>> Andrey
>>
>>>> Based on QCOW2 format. Also it should support snapshot revert in postcopy mode.
>>>>
>>>> Andrey
>>>>
>>>>>> //
>>>>>>
>>>>>>>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>>>>>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>>>>>> ---
>>>>>>>>      migration/migration.c | 255 +++++++++++++++++++++++++++++++++++++++++-
>>>>>>>>      migration/migration.h |   3 +
>>>>>>>>      migration/ram.c       |   2 +
>>>>>>>>      migration/savevm.c    |   1 -
>>>>>>>>      migration/savevm.h    |   2 +
>>>>>>>>      5 files changed, 260 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>>>>>> index 2c2cb9ef01..0901a15ac5 100644
>>>>>>> <snip>
>>>>>>>
>>>>>>>> -    qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
>>>>>>>> -                       QEMU_THREAD_JOINABLE);
>>>>>>>> +
>>>>>>>> +    if (migrate_background_snapshot()) {
>>>>>>>> +        qemu_thread_create(&s->thread, "background_snapshot",
>>>>>>> Unfortunately that wont work - there's a 14 character limit on
>>>>>>> the thread name length; I guess we just shorten that to bg_snapshot
>>>>>> Yep, missed that pthread_set_name_np() has a length limit)
>>>>>>
>>>>>>> Other than that,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>>>>>
>>>>>>>> +                bg_migration_thread, s, QEMU_THREAD_JOINABLE);
>>>>>>>> +    } else {
>>>>>>>> +        qemu_thread_create(&s->thread, "live_migration",
>>>>>>>> +                migration_thread, s, QEMU_THREAD_JOINABLE);
>>>>>>>> +    }
>>>>>>>>          s->migration_thread_running = true;
>>>>>>>>      }
>>>>>>>> diff --git a/migration/migration.h b/migration/migration.h
>>>>>>>> index f40338cfbf..0723955cd7 100644
>>>>>>>> --- a/migration/migration.h
>>>>>>>> +++ b/migration/migration.h
>>>>>>>> @@ -20,6 +20,7 @@
>>>>>>>>      #include "qemu/thread.h"
>>>>>>>>      #include "qemu/coroutine_int.h"
>>>>>>>>      #include "io/channel.h"
>>>>>>>> +#include "io/channel-buffer.h"
>>>>>>>>      #include "net/announce.h"
>>>>>>>>      #include "qom/object.h"
>>>>>>>> @@ -147,8 +148,10 @@ struct MigrationState {
>>>>>>>>          /*< public >*/
>>>>>>>>          QemuThread thread;
>>>>>>>> +    QEMUBH *vm_start_bh;
>>>>>>>>          QEMUBH *cleanup_bh;
>>>>>>>>          QEMUFile *to_dst_file;
>>>>>>>> +    QIOChannelBuffer *bioc;
>>>>>>>>          /*
>>>>>>>>           * Protects to_dst_file pointer.  We need to make sure we won't
>>>>>>>>           * yield or hang during the critical section, since this lock will
>>>>>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>>>>>> index 5707382db1..05fe0c8592 100644
>>>>>>>> --- a/migration/ram.c
>>>>>>>> +++ b/migration/ram.c
>>>>>>>> @@ -1471,6 +1471,7 @@ static RAMBlock *poll_fault_page(RAMState *rs, ram_addr_t *offset)
>>>>>>>>          page_address = (void *) uffd_msg.arg.pagefault.address;
>>>>>>>>          bs = qemu_ram_block_from_host(page_address, false, offset);
>>>>>>>>          assert(bs && (bs->flags & RAM_UF_WRITEPROTECT) != 0);
>>>>>>>> +
>>>>>>>>          return bs;
>>>>>>>>      }
>>>>>>>>      #endif /* CONFIG_LINUX */
>>>>>>>> @@ -1836,6 +1837,7 @@ static void ram_save_host_page_post(RAMState *rs, PageSearchStatus *pss,
>>>>>>>>              /* Un-protect memory range. */
>>>>>>>>              res = uffd_change_protection(rs->uffdio_fd, page_address, run_length,
>>>>>>>>                      false, false);
>>>>>>>> +
>>>>>>>>              /* We don't want to override existing error from ram_save_host_page(). */
>>>>>>>>              if (res < 0 && *res_override >= 0) {
>>>>>>>>                  *res_override = res;
>>>>>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>>>>>> index 27e842812e..dd4ad0aaaf 100644
>>>>>>>> --- a/migration/savevm.c
>>>>>>>> +++ b/migration/savevm.c
>>>>>>>> @@ -1354,7 +1354,6 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>>>>>>>>          return 0;
>>>>>>>>      }
>>>>>>>> -static
>>>>>>>>      int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>>>>>>>>                                                          bool in_postcopy,
>>>>>>>>                                                          bool inactivate_disks)
>>>>>>>> diff --git a/migration/savevm.h b/migration/savevm.h
>>>>>>>> index ba64a7e271..aaee2528ed 100644
>>>>>>>> --- a/migration/savevm.h
>>>>>>>> +++ b/migration/savevm.h
>>>>>>>> @@ -64,5 +64,7 @@ int qemu_loadvm_state(QEMUFile *f);
>>>>>>>>      void qemu_loadvm_state_cleanup(void);
>>>>>>>>      int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
>>>>>>>>      int qemu_load_device_state(QEMUFile *f);
>>>>>>>> +int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>>>>>>>> +        bool in_postcopy, bool inactivate_disks);
>>>>>>>>      #endif
>>>>>>>> -- 
>>>>>>>> 2.25.1
>>>>>>>>
>>>>>> -- 
>>>>>> Andrey Gruzdev, Principal Engineer
>>>>>> Virtuozzo GmbH  +7-903-247-6397
>>>>>>                    virtuzzo.com
>>>>>>
>>>> -- 
>>>> Andrey Gruzdev, Principal Engineer
>>>> Virtuozzo GmbH  +7-903-247-6397
>>>>                   virtuzzo.com
>>>>
>> -- 
>> Andrey Gruzdev, Principal Engineer
>> Virtuozzo GmbH  +7-903-247-6397
>>                  virtuzzo.com
>>

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


[-- Attachment #2: Type: text/html, Size: 10894 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-01-21 18:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-06 15:21 [PATCH v11 0/5] UFFD write-tracking migration/snapshots Andrey Gruzdev via
2021-01-06 15:21 ` [PATCH v11 1/5] migration: introduce 'background-snapshot' migration capability Andrey Gruzdev via
2021-01-06 15:21 ` [PATCH v11 2/5] migration: introduce UFFD-WP low-level interface helpers Andrey Gruzdev via
2021-01-06 15:21 ` [PATCH v11 3/5] migration: support UFFD write fault processing in ram_save_iterate() Andrey Gruzdev via
2021-01-19 17:49   ` Dr. David Alan Gilbert
2021-01-21  8:48     ` Andrey Gruzdev
2021-01-21 10:09       ` Dr. David Alan Gilbert
2021-01-21 11:12         ` Andrey Gruzdev
2021-01-06 15:21 ` [PATCH v11 4/5] migration: implementation of background snapshot thread Andrey Gruzdev via
2021-01-19 18:49   ` Dr. David Alan Gilbert
2021-01-21  8:58     ` Andrey Gruzdev
2021-01-21  9:56       ` Dr. David Alan Gilbert
2021-01-21 11:08         ` Andrey Gruzdev
2021-01-21 16:11           ` Dr. David Alan Gilbert
2021-01-21 17:28             ` Andrey Gruzdev
2021-01-21 17:48               ` Dr. David Alan Gilbert
2021-01-21 18:29                 ` Andrey Gruzdev
2021-01-06 15:21 ` [PATCH v11 5/5] migration: introduce 'userfaultfd-wrlat.py' script Andrey Gruzdev via
2021-01-19 21:01   ` Peter Xu
2021-01-21 13:12     ` Andrey Gruzdev
2021-01-21 15:37       ` Peter Xu
2021-01-21 17:15         ` Andrey Gruzdev
2021-01-15 11:34 ` [PATCH v11 0/5] UFFD write-tracking migration/snapshots Andrey Gruzdev
2021-01-15 14:54   ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.