All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/7] UFFD write-tracking migration/snapshots
@ 2020-11-19 12:59 Andrey Gruzdev via
  2020-11-19 12:59 ` [PATCH v3 1/7] introduce 'track-writes-ram' migration capability Andrey Gruzdev via
                   ` (7 more replies)
  0 siblings, 8 replies; 44+ messages in thread
From: Andrey Gruzdev via @ 2020-11-19 12:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Changes with v3:
* coding style fixes to pass checkpatch test
* qapi/migration.json: change 'track-writes-ram' capability
*                      supported version to 6.0
* fixes to commit message format


This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.

Currently the only way to make (external) live VM snapshot is using existing
dirty page logging migration mechanism. The main problem is that it tends to
produce a lot of page duplicates while running VM goes on updating already
saved pages. That leads to the fact that vmstate image size is commonly several
times bigger then non-zero part of virtual machine's RSS. Time required to
converge RAM migration and the size of snapshot image severely depend on the
guest memory write rate, sometimes resulting in unacceptably long snapshot
creation time and huge image size.

This series propose a way to solve the aforementioned problems. This is done
by using different RAM migration mechanism based on UFFD write protection
management introduced in v5.7 kernel. The migration strategy is to 'freeze'
guest RAM content using write-protection and iteratively release protection
for memory ranges that have already been saved to the migration stream.
At the same time we read in pending UFFD write fault events and save those
pages out-of-order with higher priority.

How to use:
1. Enable write-tracking migration capability
   virsh qemu-monitor-command <domain> --hmp migrate_set_capability.
track-writes-ram on

2. Start the external migration to a file
   virsh qemu-monitor-command <domain> --hmp migrate exec:'cat > ./vm_state'

3. Wait for the migration finish and check that the migration has completed.
state.

Andrey Gruzdev (7):
  introduce 'track-writes-ram' migration capability
  introduce UFFD-WP low-level interface helpers
  support UFFD write fault processing in ram_save_iterate()
  implementation of write-tracking migration thread
  implementation of vm_start() BH
  the rest of write tracking migration code
  introduce simple linear scan rate limiting mechanism

 include/exec/memory.h |   7 +
 migration/migration.c | 338 +++++++++++++++++++++++++++++++-
 migration/migration.h |   4 +
 migration/ram.c       | 439 +++++++++++++++++++++++++++++++++++++++++-
 migration/ram.h       |   4 +
 migration/savevm.c    |   1 -
 migration/savevm.h    |   2 +
 qapi/migration.json   |   7 +-
 8 files changed, 790 insertions(+), 12 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 1/7] introduce 'track-writes-ram' migration capability
  2020-11-19 12:59 [PATCH v3 0/7] UFFD write-tracking migration/snapshots Andrey Gruzdev via
@ 2020-11-19 12:59 ` Andrey Gruzdev via
  2020-11-19 18:51   ` Peter Xu
  2020-11-19 12:59 ` [PATCH v3 2/7] introduce UFFD-WP low-level interface helpers Andrey Gruzdev via
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 44+ messages in thread
From: Andrey Gruzdev via @ 2020-11-19 12:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/migration.c | 96 +++++++++++++++++++++++++++++++++++++++++++
 migration/migration.h |  1 +
 qapi/migration.json   |  7 +++-
 3 files changed, 103 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 87a9b59f83..ff0364dde0 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -56,6 +56,7 @@
 #include "net/announce.h"
 #include "qemu/queue.h"
 #include "multifd.h"
+#include "sysemu/cpus.h"
 
 #ifdef CONFIG_VFIO
 #include "hw/vfio/vfio-common.h"
@@ -1165,6 +1166,91 @@ static bool migrate_caps_check(bool *cap_list,
         }
     }
 
+    if (cap_list[MIGRATION_CAPABILITY_TRACK_WRITES_RAM]) {
+        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
+            error_setg(errp,
+                    "Track-writes is not compatible with postcopy-ram");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_DIRTY_BITMAPS]) {
+            error_setg(errp,
+                    "Track-writes is not compatible with dirty-bitmaps");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME]) {
+            error_setg(errp,
+                    "Track-writes is not compatible with postcopy-blocktime");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE]) {
+            error_setg(errp,
+                    "Track-writes is not compatible with late-block-activate");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_RETURN_PATH]) {
+            error_setg(errp,
+                    "Track-writes is not compatible with return-path");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
+            error_setg(errp, "Track-writes is not compatible with multifd");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER]) {
+            error_setg(errp,
+                    "Track-writes is not compatible with pause-before-switchover");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_AUTO_CONVERGE]) {
+            error_setg(errp,
+                    "Track-writes is not compatible with auto-converge");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_RELEASE_RAM]) {
+            error_setg(errp,
+                    "Track-writes is not compatible with release-ram");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_RDMA_PIN_ALL]) {
+            error_setg(errp,
+                    "Track-writes is not compatible with rdma-pin-all");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
+            error_setg(errp,
+                    "Track-writes is not compatible with compression");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
+            error_setg(errp,
+                    "Track-writes is not compatible with XBZLRE");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_X_COLO]) {
+            error_setg(errp,
+                    "Track-writes is not compatible with x-colo");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_VALIDATE_UUID]) {
+            error_setg(errp,
+                    "Track-writes is not compatible with validate-uuid");
+            return false;
+        }
+    }
+
     return true;
 }
 
@@ -2490,6 +2576,15 @@ bool migrate_use_block_incremental(void)
     return s->parameters.block_incremental;
 }
 
+bool migrate_track_writes_ram(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_TRACK_WRITES_RAM];
+}
+
 /* migration thread support */
 /*
  * Something bad happened to the RP stream, mark an error
@@ -3783,6 +3878,7 @@ static Property migration_properties[] = {
     DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
     DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
     DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
+    DEFINE_PROP_MIG_CAP("x-track-writes-ram", MIGRATION_CAPABILITY_TRACK_WRITES_RAM),
 
     DEFINE_PROP_END_OF_LIST(),
 };
diff --git a/migration/migration.h b/migration/migration.h
index d096b77f74..339ae720e0 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -341,6 +341,7 @@ int migrate_compress_wait_thread(void);
 int migrate_decompress_threads(void);
 bool migrate_use_events(void);
 bool migrate_postcopy_blocktime(void);
+bool migrate_track_writes_ram(void);
 
 /* Sending on the return path - generic and then for each message type */
 void migrate_send_rp_shut(MigrationIncomingState *mis,
diff --git a/qapi/migration.json b/qapi/migration.json
index 3c75820527..a28d8b7ee8 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -442,6 +442,11 @@
 # @validate-uuid: Send the UUID of the source to allow the destination
 #                 to ensure it is the same. (since 4.2)
 #
+# @track-writes-ram: If enabled, the migration stream will be a snapshot
+#                    of the VM exactly at the point when the migration
+#                    procedure starts. The VM RAM is saved with running VM.
+#                    (since 6.0)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
@@ -449,7 +454,7 @@
            'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
            'block', 'return-path', 'pause-before-switchover', 'multifd',
            'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
-           'x-ignore-shared', 'validate-uuid' ] }
+           'x-ignore-shared', 'validate-uuid', 'track-writes-ram'] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 2/7] introduce UFFD-WP low-level interface helpers
  2020-11-19 12:59 [PATCH v3 0/7] UFFD write-tracking migration/snapshots Andrey Gruzdev via
  2020-11-19 12:59 ` [PATCH v3 1/7] introduce 'track-writes-ram' migration capability Andrey Gruzdev via
@ 2020-11-19 12:59 ` Andrey Gruzdev via
  2020-11-19 18:39   ` Peter Xu
  2020-11-24 17:57   ` Dr. David Alan Gilbert
  2020-11-19 12:59 ` [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate() Andrey Gruzdev via
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 44+ messages in thread
From: Andrey Gruzdev via @ 2020-11-19 12:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Implemented support for the whole RAM block memory
protection/un-protection. Introduced higher level
ram_write_tracking_start() and ram_write_tracking_stop()
to start/stop tracking guest memory writes.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 include/exec/memory.h |   7 ++
 migration/ram.c       | 267 ++++++++++++++++++++++++++++++++++++++++++
 migration/ram.h       |   4 +
 3 files changed, 278 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 0f3e6bcd5e..3d798fce16 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier;
 /* RAM is a persistent kind memory */
 #define RAM_PMEM (1 << 5)
 
+/*
+ * UFFDIO_WRITEPROTECT is used on this RAMBlock to
+ * support 'write-tracking' migration type.
+ * Implies ram_state->ram_wt_enabled.
+ */
+#define RAM_UF_WRITEPROTECT (1 << 6)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
diff --git a/migration/ram.c b/migration/ram.c
index 7811cde643..7f273c9996 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -56,6 +56,12 @@
 #include "savevm.h"
 #include "qemu/iov.h"
 #include "multifd.h"
+#include <inttypes.h>
+#include <poll.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+#include "sysemu/runstate.h"
 
 /***********************************************************/
 /* ram save/restore */
@@ -298,6 +304,8 @@ struct RAMSrcPageRequest {
 struct RAMState {
     /* QEMUFile used for this migration */
     QEMUFile *f;
+    /* UFFD file descriptor, used in 'write-tracking' migration */
+    int uffdio_fd;
     /* Last block that we have visited searching for dirty pages */
     RAMBlock *last_seen_block;
     /* Last block from where we have sent data */
@@ -453,6 +461,181 @@ static QemuThread *decompress_threads;
 static QemuMutex decomp_done_lock;
 static QemuCond decomp_done_cond;
 
+/**
+ * uffd_create_fd: create UFFD file descriptor
+ *
+ * Returns non-negative file descriptor or negative value in case of an error
+ */
+static int uffd_create_fd(void)
+{
+    int uffd;
+    struct uffdio_api api_struct;
+    uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER);
+
+    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+    if (uffd < 0) {
+        error_report("uffd_create_fd() failed: UFFD not supported");
+        return -1;
+    }
+
+    api_struct.api = UFFD_API;
+    api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+    if (ioctl(uffd, UFFDIO_API, &api_struct)) {
+        error_report("uffd_create_fd() failed: "
+                "API version not supported version=%llx errno=%i",
+                api_struct.api, errno);
+        goto fail;
+    }
+
+    if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
+        error_report("uffd_create_fd() failed: "
+                "PAGEFAULT_FLAG_WP feature missing");
+        goto fail;
+    }
+
+    return uffd;
+
+fail:
+    close(uffd);
+    return -1;
+}
+
+/**
+ * uffd_close_fd: close UFFD file descriptor
+ *
+ * @uffd: UFFD file descriptor
+ */
+static void uffd_close_fd(int uffd)
+{
+    assert(uffd >= 0);
+    close(uffd);
+}
+
+/**
+ * uffd_register_memory: register memory range with UFFD
+ *
+ * Returns 0 in case of success, negative value on error
+ *
+ * @uffd: UFFD file descriptor
+ * @start: starting virtual address of memory range
+ * @length: length of memory range
+ * @track_missing: generate events on missing-page faults
+ * @track_wp: generate events on write-protected-page faults
+ */
+static int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
+        bool track_missing, bool track_wp)
+{
+    struct uffdio_register uffd_register;
+
+    uffd_register.range.start = start;
+    uffd_register.range.len = length;
+    uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) |
+                         (track_wp ? UFFDIO_REGISTER_MODE_WP : 0);
+
+    if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) {
+        error_report("uffd_register_memory() failed: "
+                "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
+                start, length, uffd_register.mode, errno);
+        return -1;
+    }
+
+    return 0;
+}
+
+/**
+ * uffd_protect_memory: protect/unprotect memory range for writes with UFFD
+ *
+ * Returns 0 on success or negative value in case of error
+ *
+ * @uffd: UFFD file descriptor
+ * @start: starting virtual address of memory range
+ * @length: length of memory range
+ * @wp: write-protect/unprotect
+ */
+static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
+{
+    struct uffdio_writeprotect uffd_writeprotect;
+    int res;
+
+    uffd_writeprotect.range.start = start;
+    uffd_writeprotect.range.len = length;
+    uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0);
+
+    do {
+        res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect);
+    } while (res < 0 && errno == EINTR);
+    if (res < 0) {
+        error_report("uffd_protect_memory() failed: "
+                "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
+                start, length, uffd_writeprotect.mode, errno);
+        return -1;
+    }
+
+    return 0;
+}
+
+__attribute__ ((unused))
+static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
+__attribute__ ((unused))
+static bool uffd_poll_events(int uffd, int tmo);
+
+/**
+ * uffd_read_events: read pending UFFD events
+ *
+ * Returns number of fetched messages, 0 if non is available or
+ * negative value in case of an error
+ *
+ * @uffd: UFFD file descriptor
+ * @msgs: pointer to message buffer
+ * @count: number of messages that can fit in the buffer
+ */
+static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count)
+{
+    ssize_t res;
+    do {
+        res = read(uffd, msgs, count * sizeof(struct uffd_msg));
+    } while (res < 0 && errno == EINTR);
+
+    if ((res < 0 && errno == EAGAIN)) {
+        return 0;
+    }
+    if (res < 0) {
+        error_report("uffd_read_events() failed: errno=%i", errno);
+        return -1;
+    }
+
+    return (int) (res / sizeof(struct uffd_msg));
+}
+
+/**
+ * uffd_poll_events: poll UFFD file descriptor for read
+ *
+ * Returns true if events are available for read, false otherwise
+ *
+ * @uffd: UFFD file descriptor
+ * @tmo: timeout in milliseconds, 0 for non-blocking operation,
+ *       negative value for infinite wait
+ */
+static bool uffd_poll_events(int uffd, int tmo)
+{
+    int res;
+    struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 };
+
+    do {
+        res = poll(&poll_fd, 1, tmo);
+    } while (res < 0 && errno == EINTR);
+
+    if (res == 0) {
+        return false;
+    }
+    if (res < 0) {
+        error_report("uffd_poll_events() failed: errno=%i", errno);
+        return false;
+    }
+
+    return (poll_fd.revents & POLLIN) != 0;
+}
+
 static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
                                  ram_addr_t offset, uint8_t *source_buf);
 
@@ -3788,6 +3971,90 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
     return 0;
 }
 
+/**
+ * ram_write_tracking_start: start UFFD-WP memory tracking
+ *
+ * Returns 0 for success or negative value in case of error
+ *
+ */
+int ram_write_tracking_start(void)
+{
+    int uffd;
+    RAMState *rs = ram_state;
+    RAMBlock *bs;
+
+    /* Open UFFD file descriptor */
+    uffd = uffd_create_fd();
+    if (uffd < 0) {
+        return uffd;
+    }
+    rs->uffdio_fd = uffd;
+
+    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
+        /* Nothing to do with read-only and MMIO-writable regions */
+        if (bs->mr->readonly || bs->mr->rom_device) {
+            continue;
+        }
+
+        /* Register block memory with UFFD to track writes */
+        if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host,
+                bs->max_length, false, true)) {
+            goto fail;
+        }
+        /* Apply UFFD write protection to the block memory range */
+        if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host,
+                bs->max_length, true)) {
+            goto fail;
+        }
+        bs->flags |= RAM_UF_WRITEPROTECT;
+
+        info_report("UFFD-WP write-tracking enabled: "
+                "block_id=%s page_size=%zu start=%p length=%lu "
+                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
+                bs->idstr, bs->page_size, bs->host, bs->max_length,
+                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
+                bs->mr->nonvolatile, bs->mr->rom_device);
+    }
+
+    return 0;
+
+fail:
+    uffd_close_fd(uffd);
+    rs->uffdio_fd = -1;
+    return -1;
+}
+
+/**
+ * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
+ */
+void ram_write_tracking_stop(void)
+{
+    RAMState *rs = ram_state;
+    RAMBlock *bs;
+    assert(rs->uffdio_fd >= 0);
+
+    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
+        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
+            continue;
+        }
+        info_report("UFFD-WP write-tracking disabled: "
+                "block_id=%s page_size=%zu start=%p length=%lu "
+                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
+                bs->idstr, bs->page_size, bs->host, bs->max_length,
+                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
+                bs->mr->nonvolatile, bs->mr->rom_device);
+        /* Cleanup flags */
+        bs->flags &= ~RAM_UF_WRITEPROTECT;
+    }
+
+    /*
+     * Close UFFD file descriptor to remove protection,
+     * release registered memory regions and flush wait queues
+     */
+    uffd_close_fd(rs->uffdio_fd);
+    rs->uffdio_fd = -1;
+}
+
 static SaveVMHandlers savevm_ram_handlers = {
     .save_setup = ram_save_setup,
     .save_live_iterate = ram_save_iterate,
diff --git a/migration/ram.h b/migration/ram.h
index 011e85414e..3611cb51de 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -79,4 +79,8 @@ void colo_flush_ram_cache(void);
 void colo_release_ram_cache(void);
 void colo_incoming_start_dirty_log(void);
 
+/* Live snapshots */
+int ram_write_tracking_start(void);
+void ram_write_tracking_stop(void);
+
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-19 12:59 [PATCH v3 0/7] UFFD write-tracking migration/snapshots Andrey Gruzdev via
  2020-11-19 12:59 ` [PATCH v3 1/7] introduce 'track-writes-ram' migration capability Andrey Gruzdev via
  2020-11-19 12:59 ` [PATCH v3 2/7] introduce UFFD-WP low-level interface helpers Andrey Gruzdev via
@ 2020-11-19 12:59 ` Andrey Gruzdev via
  2020-11-19 18:25   ` Peter Xu
  2020-11-25 13:08   ` Dr. David Alan Gilbert
  2020-11-19 12:59 ` [PATCH v3 4/7] implementation of write-tracking migration thread Andrey Gruzdev via
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 44+ messages in thread
From: Andrey Gruzdev via @ 2020-11-19 12:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

In this particular implementation the same single migration
thread is responsible for both normal linear dirty page
migration and procesing UFFD page fault events.

Processing write faults includes reading UFFD file descriptor,
finding respective RAM block and saving faulting page to
the migration stream. After page has been saved, write protection
can be removed. Since asynchronous version of qemu_put_buffer()
is expected to be used to save pages, we also have to flush
migraion stream prior to un-protecting saved memory range.

Write protection is being removed for any previously protected
memory chunk that has hit the migration stream. That's valid
for pages from linear page scan along with write fault pages.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/ram.c | 124 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 115 insertions(+), 9 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 7f273c9996..08a1d7a252 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -314,6 +314,8 @@ struct RAMState {
     ram_addr_t last_page;
     /* last ram version we have seen */
     uint32_t last_version;
+    /* 'write-tracking' migration is enabled */
+    bool ram_wt_enabled;
     /* We are in the first round */
     bool ram_bulk_stage;
     /* The free page optimization is enabled */
@@ -574,8 +576,6 @@ static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
     return 0;
 }
 
-__attribute__ ((unused))
-static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
 __attribute__ ((unused))
 static bool uffd_poll_events(int uffd, int tmo);
 
@@ -1929,6 +1929,86 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
     return pages;
 }
 
+/**
+ * ram_find_block_by_host_address: find RAM block containing host page
+ *
+ * Returns true if RAM block is found and pss->block/page are
+ * pointing to the given host page, false in case of an error
+ *
+ * @rs: current RAM state
+ * @pss: page-search-status structure
+ */
+static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
+        hwaddr page_address)
+{
+    bool found = false;
+
+    pss->block = rs->last_seen_block;
+    do {
+        if (page_address >= (hwaddr) pss->block->host &&
+            (page_address + TARGET_PAGE_SIZE) <=
+                    ((hwaddr) pss->block->host + pss->block->used_length)) {
+            pss->page = (unsigned long)
+                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
+            found = true;
+            break;
+        }
+
+        pss->block = QLIST_NEXT_RCU(pss->block, next);
+        if (!pss->block) {
+            /* Hit the end of the list */
+            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
+        }
+    } while (pss->block != rs->last_seen_block);
+
+    rs->last_seen_block = pss->block;
+    /*
+     * Since we are in the same loop with ram_find_and_save_block(),
+     * need to reset pss->complete_round after switching to
+     * other block/page in pss.
+     */
+    pss->complete_round = false;
+
+    return found;
+}
+
+/**
+ * get_fault_page: try to get next UFFD write fault page and, if pending fault
+ *   is found, put info about RAM block and block page into pss structure
+ *
+ * Returns true in case of UFFD write fault detected, false otherwise
+ *
+ * @rs: current RAM state
+ * @pss: page-search-status structure
+ *
+ */
+static bool get_fault_page(RAMState *rs, PageSearchStatus *pss)
+{
+    struct uffd_msg uffd_msg;
+    hwaddr page_address;
+    int res;
+
+    if (!rs->ram_wt_enabled) {
+        return false;
+    }
+
+    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
+    if (res <= 0) {
+        return false;
+    }
+
+    page_address = uffd_msg.arg.pagefault.address;
+    if (!ram_find_block_by_host_address(rs, pss, page_address)) {
+        /* In case we couldn't find respective block, just unprotect faulting page */
+        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
+        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
+                     page_address);
+        return false;
+    }
+
+    return true;
+}
+
 /**
  * ram_find_and_save_block: finds a dirty page and sends it to f
  *
@@ -1955,25 +2035,50 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
         return pages;
     }
 
+    if (!rs->last_seen_block) {
+        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
+    }
     pss.block = rs->last_seen_block;
     pss.page = rs->last_page;
     pss.complete_round = false;
 
-    if (!pss.block) {
-        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
-    }
-
     do {
+        ram_addr_t page;
+        ram_addr_t page_to;
+
         again = true;
-        found = get_queued_page(rs, &pss);
-
+        /* In case of 'write-tracking' migration we first try
+         * to poll UFFD and get write page fault event */
+        found = get_fault_page(rs, &pss);
+        if (!found) {
+            /* No trying to fetch something from the priority queue */
+            found = get_queued_page(rs, &pss);
+        }
         if (!found) {
             /* priority queue empty, so just search for something dirty */
             found = find_dirty_block(rs, &pss, &again);
         }
 
         if (found) {
+            page = pss.page;
             pages = ram_save_host_page(rs, &pss, last_stage);
+            page_to = pss.page;
+
+            /* Check if page is from UFFD-managed region */
+            if (pss.block->flags & RAM_UF_WRITEPROTECT) {
+                hwaddr page_address = (hwaddr) pss.block->host +
+                        ((hwaddr) page << TARGET_PAGE_BITS);
+                hwaddr run_length = (hwaddr) (page_to - page + 1) << TARGET_PAGE_BITS;
+                int res;
+
+                /* Flush async buffers before un-protect */
+                qemu_fflush(rs->f);
+                /* Un-protect memory range */
+                res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
+                if (res < 0) {
+                    break;
+                }
+            }
         }
     } while (!pages && again);
 
@@ -2086,7 +2191,8 @@ static void ram_state_reset(RAMState *rs)
     rs->last_sent_block = NULL;
     rs->last_page = 0;
     rs->last_version = ram_list.version;
-    rs->ram_bulk_stage = true;
+    rs->ram_wt_enabled = migrate_track_writes_ram();
+    rs->ram_bulk_stage = !rs->ram_wt_enabled;
     rs->fpo_enabled = false;
 }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 4/7] implementation of write-tracking migration thread
  2020-11-19 12:59 [PATCH v3 0/7] UFFD write-tracking migration/snapshots Andrey Gruzdev via
                   ` (2 preceding siblings ...)
  2020-11-19 12:59 ` [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate() Andrey Gruzdev via
@ 2020-11-19 12:59 ` Andrey Gruzdev via
  2020-11-19 18:47   ` Peter Xu
  2020-11-19 12:59 ` [PATCH v3 5/7] implementation of vm_start() BH Andrey Gruzdev via
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 44+ messages in thread
From: Andrey Gruzdev via @ 2020-11-19 12:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/migration.c | 168 +++++++++++++++++++++++++++++++++++++++++-
 migration/migration.h |   3 +
 migration/savevm.c    |   1 -
 migration/savevm.h    |   2 +
 4 files changed, 171 insertions(+), 3 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index ff0364dde0..158e5441ec 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2013,6 +2013,7 @@ void migrate_init(MigrationState *s)
      * locks.
      */
     s->cleanup_bh = 0;
+    s->wt_vm_start_bh = 0;
     s->to_dst_file = NULL;
     s->state = MIGRATION_STATUS_NONE;
     s->rp_state.from_dst_file = NULL;
@@ -3551,6 +3552,21 @@ static void migration_iteration_finish(MigrationState *s)
     qemu_mutex_unlock_iothread();
 }
 
+static void wt_migration_iteration_finish(MigrationState *s)
+{
+    /* TODO: implement */
+}
+
+/*
+ * Return true if continue to the next iteration directly, false
+ * otherwise.
+ */
+static MigIterateState wt_migration_iteration_run(MigrationState *s)
+{
+    /* TODO: implement */
+    return MIG_ITERATE_RESUME;
+}
+
 void migration_make_urgent_request(void)
 {
     qemu_sem_post(&migrate_get_current()->rate_limit_sem);
@@ -3698,6 +3714,148 @@ static void *migration_thread(void *opaque)
     return NULL;
 }
 
+static void wt_migration_vm_start_bh(void *opaque)
+{
+    /* TODO: implement */
+}
+
+/*
+ * Master migration thread on the source VM.
+ * This is an alternative implementation of live migration
+ * which uses userfault_fd write protection mechanism introduced in
+ * 5.7 kernel. Compared to existing dirty page logging migration
+ * it produces much lesser traffic and smaller snapshot images since
+ * no page duplicates can get into the stream. Another the key point
+ * is that generated vmstate stream reflects machine state 'frozen'
+ * at the beginning of migration compared to dirty page logging
+ * mechanism, which effectively results in that saved snapshot is the
+ * state at the end of migration process.
+ */
+static void *wt_migration_thread(void *opaque)
+{
+    MigrationState *s = opaque;
+    int64_t setup_start;
+    MigThrError thr_error;
+    QEMUFile *fb;
+    bool early_fail = true;
+
+    rcu_register_thread();
+    object_ref(OBJECT(s));
+
+    qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
+
+    setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
+    /*
+     * We want to save vmstate for the moment when migration has been
+     * initiated but also we want to save RAM content while VM is running.
+     * The RAM content should appear first in the vmstate. So, we first
+     * stash the non-RAM part of the vmstate to the temporary buffer,
+     * then write RAM part of the vmstate to the migration stream
+     * with vCPUs running and, finally, write stashed non-RAM part of
+     * the vmstate from the buffer to the migration stream.
+     */
+    s->bioc = qio_channel_buffer_new(128 * 1024);
+    qio_channel_set_name(QIO_CHANNEL(s->bioc), "vmstate-buffer");
+    fb = qemu_fopen_channel_output(QIO_CHANNEL(s->bioc));
+    object_unref(OBJECT(s->bioc));
+
+    update_iteration_initial_status(s);
+
+    qemu_savevm_state_header(s->to_dst_file);
+    qemu_savevm_state_setup(s->to_dst_file);
+
+    if (qemu_savevm_state_guest_unplug_pending()) {
+        migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
+                          MIGRATION_STATUS_WAIT_UNPLUG);
+
+        while (s->state == MIGRATION_STATUS_WAIT_UNPLUG &&
+               qemu_savevm_state_guest_unplug_pending()) {
+            qemu_sem_timedwait(&s->wait_unplug_sem, 250);
+        }
+
+        migrate_set_state(&s->state, MIGRATION_STATUS_WAIT_UNPLUG,
+                          MIGRATION_STATUS_ACTIVE);
+    }
+    s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
+
+    migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
+                      MIGRATION_STATUS_ACTIVE);
+    trace_migration_thread_setup_complete();
+    s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+    qemu_mutex_lock_iothread();
+
+    if (global_state_store()) {
+        goto fail;
+    }
+    /* Forcibly stop VM before saving state of vCPUs and devices */
+    if (vm_stop_force_state(RUN_STATE_PAUSED)) {
+        goto fail;
+    }
+    /*
+     * Put vCPUs in sync with shadow context structures, then
+     * save their state to channel-buffer along with devices.
+     */
+    cpu_synchronize_all_states();
+    if (qemu_savevm_state_complete_precopy_non_iterable(fb, false, false)) {
+        goto fail;
+    }
+    /* Now initialize UFFD context and start tracking RAM writes */
+    if (ram_write_tracking_start()) {
+        goto fail;
+    }
+    early_fail = false;
+
+    /*
+     * Start VM from BH handler to avoid write-fault lock here.
+     * UFFD-WP protection for the whole RAM is already enabled so
+     * calling VM state change notifiers from vm_start() would initiate
+     * writes to virtio VQs memory which is in write-protected region.
+     */
+    s->wt_vm_start_bh = qemu_bh_new(wt_migration_vm_start_bh, s);
+    qemu_bh_schedule(s->wt_vm_start_bh);
+
+    qemu_mutex_unlock_iothread();
+
+    while (migration_is_active(s)) {
+        MigIterateState iter_state = wt_migration_iteration_run(s);
+        if (iter_state == MIG_ITERATE_SKIP) {
+            continue;
+        } else if (iter_state == MIG_ITERATE_BREAK) {
+            break;
+        }
+
+        /*
+         * Try to detect any kind of failures, and see whether we
+         * should stop the migration now.
+         */
+        thr_error = migration_detect_error(s);
+        if (thr_error == MIG_THR_ERR_FATAL) {
+            /* Stop migration */
+            break;
+        }
+
+        migration_update_counters(s, qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
+    }
+
+    trace_migration_thread_after_loop();
+
+fail:
+    if (early_fail) {
+        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
+                MIGRATION_STATUS_FAILED);
+        qemu_mutex_unlock_iothread();
+    }
+
+    wt_migration_iteration_finish(s);
+
+    qemu_fclose(fb);
+    object_unref(OBJECT(s));
+    rcu_unregister_thread();
+
+    return NULL;
+}
+
 void migrate_fd_connect(MigrationState *s, Error *error_in)
 {
     Error *local_err = NULL;
@@ -3761,8 +3919,14 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
         migrate_fd_cleanup(s);
         return;
     }
-    qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
-                       QEMU_THREAD_JOINABLE);
+
+    if (migrate_track_writes_ram()) {
+        qemu_thread_create(&s->thread, "wt_live_migration",
+                wt_migration_thread, s, QEMU_THREAD_JOINABLE);
+    } else {
+        qemu_thread_create(&s->thread, "live_migration",
+                migration_thread, s, QEMU_THREAD_JOINABLE);
+    }
     s->migration_thread_running = true;
 }
 
diff --git a/migration/migration.h b/migration/migration.h
index 339ae720e0..c3b4c7f2fd 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -20,6 +20,7 @@
 #include "qemu/thread.h"
 #include "qemu/coroutine_int.h"
 #include "io/channel.h"
+#include "io/channel-buffer.h"
 #include "net/announce.h"
 #include "qom/object.h"
 
@@ -147,8 +148,10 @@ struct MigrationState {
 
     /*< public >*/
     QemuThread thread;
+    QEMUBH *wt_vm_start_bh;
     QEMUBH *cleanup_bh;
     QEMUFile *to_dst_file;
+    QIOChannelBuffer *bioc;
     /*
      * Protects to_dst_file pointer.  We need to make sure we won't
      * yield or hang during the critical section, since this lock will
diff --git a/migration/savevm.c b/migration/savevm.c
index 5f937a2762..62d5f8a869 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1352,7 +1352,6 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     return 0;
 }
 
-static
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
                                                     bool in_postcopy,
                                                     bool inactivate_disks)
diff --git a/migration/savevm.h b/migration/savevm.h
index ba64a7e271..aaee2528ed 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -64,5 +64,7 @@ int qemu_loadvm_state(QEMUFile *f);
 void qemu_loadvm_state_cleanup(void);
 int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 int qemu_load_device_state(QEMUFile *f);
+int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
+        bool in_postcopy, bool inactivate_disks);
 
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 5/7] implementation of vm_start() BH
  2020-11-19 12:59 [PATCH v3 0/7] UFFD write-tracking migration/snapshots Andrey Gruzdev via
                   ` (3 preceding siblings ...)
  2020-11-19 12:59 ` [PATCH v3 4/7] implementation of write-tracking migration thread Andrey Gruzdev via
@ 2020-11-19 12:59 ` Andrey Gruzdev via
  2020-11-19 18:46   ` Peter Xu
  2020-11-19 12:59 ` [PATCH v3 6/7] the rest of write tracking migration code Andrey Gruzdev via
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 44+ messages in thread
From: Andrey Gruzdev via @ 2020-11-19 12:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

To avoid saving updated versions of memory pages we need
to start tracking RAM writes before we resume operation of
vCPUs. This sequence is especially critical for virtio device
backends whos VQs are mapped to main memory and accessed
directly not using MMIO callbacks.

One problem is that vm_start() routine makes calls state
change notifier callbacks directly from itself. Virtio drivers
do some stuff with syncing/flusing VQs in its notifier routines.
Since we poll UFFD and process faults on the same thread, that
leads to the situation when the thread locks in vm_start()
if we try to call it from the migration thread.

The solution is to call ram_write_tracking_start() directly
from migration thread and then schedule BH for vm_start.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/migration.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 158e5441ec..dba388f8bd 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3716,7 +3716,13 @@ static void *migration_thread(void *opaque)
 
 static void wt_migration_vm_start_bh(void *opaque)
 {
-    /* TODO: implement */
+    MigrationState *s = opaque;
+
+    qemu_bh_delete(s->wt_vm_start_bh);
+    s->wt_vm_start_bh = NULL;
+
+    vm_start();
+    s->downtime = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - s->downtime_start;
 }
 
 /*
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 6/7] the rest of write tracking migration code
  2020-11-19 12:59 [PATCH v3 0/7] UFFD write-tracking migration/snapshots Andrey Gruzdev via
                   ` (4 preceding siblings ...)
  2020-11-19 12:59 ` [PATCH v3 5/7] implementation of vm_start() BH Andrey Gruzdev via
@ 2020-11-19 12:59 ` Andrey Gruzdev via
  2020-11-19 12:59 ` [PATCH v3 7/7] introduce simple linear scan rate limiting mechanism Andrey Gruzdev via
  2020-11-24 16:41 ` [PATCH v3 0/7] UFFD write-tracking migration/snapshots Dr. David Alan Gilbert
  7 siblings, 0 replies; 44+ messages in thread
From: Andrey Gruzdev via @ 2020-11-19 12:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/migration.c | 72 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 70 insertions(+), 2 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index dba388f8bd..ccb451b238 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3212,6 +3212,48 @@ fail:
                       MIGRATION_STATUS_FAILED);
 }
 
+/**
+ * wt_migration_completion: Used by wt_migration_thread when after all the
+ *   RAM has been saved. The caller 'breaks' the loop when this returns.
+ *
+ * @s: Current migration state
+ */
+static void wt_migration_completion(MigrationState *s)
+{
+    int current_active_state = s->state;
+
+    /* Stop tracking RAM writes - un-protect memory, un-register UFFD memory ranges,
+     * flush kernel wait queues and wake up threads waiting for write fault to be
+     * resolved. All of this is essentially done by closing UFFD file descriptor */
+    ram_write_tracking_stop();
+
+    if (s->state == MIGRATION_STATUS_ACTIVE) {
+        /*
+         * By this moment we have RAM content saved into the migration stream.
+         * The next step is to flush the non-RAM content (device state)
+         * right after the ram content. The device state has been stored into
+         * the temporary buffer before RAM saving started.
+         */
+        qemu_put_buffer(s->to_dst_file, s->bioc->data, s->bioc->usage);
+        qemu_fflush(s->to_dst_file);
+    } else if (s->state == MIGRATION_STATUS_CANCELLING) {
+        goto fail;
+    }
+
+    if (qemu_file_get_error(s->to_dst_file)) {
+        trace_migration_completion_file_err();
+        goto fail;
+    }
+
+    migrate_set_state(&s->state, current_active_state,
+                      MIGRATION_STATUS_COMPLETED);
+    return;
+
+fail:
+    migrate_set_state(&s->state, current_active_state,
+                      MIGRATION_STATUS_FAILED);
+}
+
 bool migrate_colo_enabled(void)
 {
     MigrationState *s = migrate_get_current();
@@ -3554,7 +3596,26 @@ static void migration_iteration_finish(MigrationState *s)
 
 static void wt_migration_iteration_finish(MigrationState *s)
 {
-    /* TODO: implement */
+    qemu_mutex_lock_iothread();
+    switch (s->state) {
+    case MIGRATION_STATUS_COMPLETED:
+        migration_calculate_complete(s);
+        break;
+
+    case MIGRATION_STATUS_ACTIVE:
+    case MIGRATION_STATUS_FAILED:
+    case MIGRATION_STATUS_CANCELLED:
+    case MIGRATION_STATUS_CANCELLING:
+        break;
+
+    default:
+        /* Should not reach here, but if so, forgive the VM. */
+        error_report("%s: Unknown ending state %d", __func__, s->state);
+        break;
+    }
+
+    migrate_fd_cleanup_schedule(s);
+    qemu_mutex_unlock_iothread();
 }
 
 /*
@@ -3563,7 +3624,14 @@ static void wt_migration_iteration_finish(MigrationState *s)
  */
 static MigIterateState wt_migration_iteration_run(MigrationState *s)
 {
-    /* TODO: implement */
+    int res;
+
+    res = qemu_savevm_state_iterate(s->to_dst_file, false);
+    if (res) {
+        wt_migration_completion(s);
+        return MIG_ITERATE_BREAK;
+    }
+
     return MIG_ITERATE_RESUME;
 }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 7/7] introduce simple linear scan rate limiting mechanism
  2020-11-19 12:59 [PATCH v3 0/7] UFFD write-tracking migration/snapshots Andrey Gruzdev via
                   ` (5 preceding siblings ...)
  2020-11-19 12:59 ` [PATCH v3 6/7] the rest of write tracking migration code Andrey Gruzdev via
@ 2020-11-19 12:59 ` Andrey Gruzdev via
  2020-11-19 20:02   ` Peter Xu
  2020-11-24 16:41 ` [PATCH v3 0/7] UFFD write-tracking migration/snapshots Dr. David Alan Gilbert
  7 siblings, 1 reply; 44+ messages in thread
From: Andrey Gruzdev via @ 2020-11-19 12:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Since reading UFFD events and saving paged data are performed
from the same thread, write fault latencies are sensitive to
migration stream stalls. Limiting total page saving rate is a
method to reduce amount of noticiable fault resolution latencies.

Migration bandwidth limiting is achieved via noticing cases of
out-of-threshold write fault latencies and temporarily disabling
(strictly speaking, severely throttling) saving non-faulting pages.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/ram.c | 58 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 54 insertions(+), 4 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 08a1d7a252..89fe106585 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -325,6 +325,10 @@ struct RAMState {
     /* these variables are used for bitmap sync */
     /* last time we did a full bitmap_sync */
     int64_t time_last_bitmap_sync;
+    /* last time UFFD fault occured */
+    int64_t last_fault_ns;
+    /* linear scan throttling counter */
+    int throttle_skip_counter;
     /* bytes transferred at start_time */
     uint64_t bytes_xfer_prev;
     /* number of dirty pages since start_time */
@@ -576,9 +580,6 @@ static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
     return 0;
 }
 
-__attribute__ ((unused))
-static bool uffd_poll_events(int uffd, int tmo);
-
 /**
  * uffd_read_events: read pending UFFD events
  *
@@ -2006,9 +2007,51 @@ static bool get_fault_page(RAMState *rs, PageSearchStatus *pss)
         return false;
     }
 
+    rs->last_fault_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     return true;
 }
 
+#define FAULT_HIGH_LATENCY_NS   5000000     /* 5 ms */
+#define SLOW_FAULT_POLL_TMO     5           /* 5 ms */
+#define SLOW_FAULT_SKIP_PAGES   200
+
+/**
+ * limit_scan_rate: limit RAM linear scan rate in case of growing write fault
+ *  latencies, used in write-tracking migration implementation
+ *
+ * @rs: current RAM state
+ *
+ */
+static void limit_scan_rate(RAMState *rs)
+{
+    int64_t last_fault_latency_ns = 0;
+
+    if (!rs->ram_wt_enabled) {
+        return;
+    }
+
+    /* Check if last write fault time is available */
+    if (rs->last_fault_ns) {
+        last_fault_latency_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) -
+                rs->last_fault_ns;
+        rs->last_fault_ns = 0;
+    }
+
+    /* In case last fault time was available and we have
+     * latency value, check if it's not too high */
+    if (last_fault_latency_ns > FAULT_HIGH_LATENCY_NS) {
+        /* Reset counter after each slow write fault */
+        rs->throttle_skip_counter = SLOW_FAULT_SKIP_PAGES;
+    }
+    /* Delay thread execution till next write fault occures or timeout expires.
+     * Next SLOW_FAULT_SKIP_PAGES can be write fault pages only, not from pages going from
+     * linear scan logic. Thus we moderate migration stream rate to reduce latencies */
+    if (rs->throttle_skip_counter > 0) {
+        uffd_poll_events(rs->uffdio_fd, SLOW_FAULT_POLL_TMO);
+        rs->throttle_skip_counter--;
+    }
+}
+
 /**
  * ram_find_and_save_block: finds a dirty page and sends it to f
  *
@@ -2078,6 +2121,9 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
                 if (res < 0) {
                     break;
                 }
+
+                /* Linear scan rate limiting */
+                limit_scan_rate(rs);
             }
         }
     } while (!pages && again);
@@ -2191,12 +2237,15 @@ static void ram_state_reset(RAMState *rs)
     rs->last_sent_block = NULL;
     rs->last_page = 0;
     rs->last_version = ram_list.version;
+    rs->last_fault_ns = 0;
+    rs->throttle_skip_counter = 0;
     rs->ram_wt_enabled = migrate_track_writes_ram();
     rs->ram_bulk_stage = !rs->ram_wt_enabled;
     rs->fpo_enabled = false;
 }
 
 #define MAX_WAIT 50 /* ms, half buffered_file limit */
+#define WT_MAX_WAIT 1000 /* 1000 ms, need bigger limit for 'write-tracking' migration */
 
 /*
  * 'expected' is the value you expect the bitmap mostly to be full
@@ -2872,7 +2921,8 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
             if ((i & 63) == 0) {
                 uint64_t t1 = (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - t0) /
                               1000000;
-                if (t1 > MAX_WAIT) {
+                uint64_t max_wait = rs->ram_wt_enabled ? WT_MAX_WAIT : MAX_WAIT;
+                if (t1 > max_wait) {
                     trace_ram_save_iterate_big_wait(t1, i);
                     break;
                 }
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-19 12:59 ` [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate() Andrey Gruzdev via
@ 2020-11-19 18:25   ` Peter Xu
  2020-11-20 10:44     ` Andrey Gruzdev
  2020-11-25 13:08   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Xu @ 2020-11-19 18:25 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 19, 2020 at 03:59:36PM +0300, Andrey Gruzdev via wrote:

[...]

> +/**
> + * ram_find_block_by_host_address: find RAM block containing host page
> + *
> + * Returns true if RAM block is found and pss->block/page are
> + * pointing to the given host page, false in case of an error
> + *
> + * @rs: current RAM state
> + * @pss: page-search-status structure
> + */
> +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
> +        hwaddr page_address)
> +{
> +    bool found = false;
> +
> +    pss->block = rs->last_seen_block;
> +    do {
> +        if (page_address >= (hwaddr) pss->block->host &&
> +            (page_address + TARGET_PAGE_SIZE) <=
> +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
> +            pss->page = (unsigned long)
> +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
> +            found = true;
> +            break;
> +        }
> +
> +        pss->block = QLIST_NEXT_RCU(pss->block, next);
> +        if (!pss->block) {
> +            /* Hit the end of the list */
> +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
> +        }
> +    } while (pss->block != rs->last_seen_block);
> +
> +    rs->last_seen_block = pss->block;
> +    /*
> +     * Since we are in the same loop with ram_find_and_save_block(),
> +     * need to reset pss->complete_round after switching to
> +     * other block/page in pss.
> +     */
> +    pss->complete_round = false;
> +
> +    return found;
> +}

I forgot whether Denis and I have discussed this, but I'll try anyways... do
you think we can avoid touching PageSearchStatus at all?

PageSearchStatus is used to track a single migration iteration for precopy, so
that we scan from the 1st ramblock until the last one.  Then we finish one
iteration.

Snapshot is really something, imho, that can easily leverage this structure
without touching it - basically we want to do two things:

  - Do the 1st iteration of precopy (when ram_bulk_stage==true), and do that
    only.  We never need the 2nd, 3rd, ... iterations because we're snapshoting.

  - Leverage the postcopy queue mechanism so that when some page got written,
    queue that page.  We should have this queue higher priority than the
    precopy scanning mentioned above.

As long as we follow above rules, then after the above 1st round precopy, we're
simply done...  If that works, the whole logic of precopy and PageSearchStatus
does not need to be touched, iiuc.

[...]

> @@ -2086,7 +2191,8 @@ static void ram_state_reset(RAMState *rs)
>      rs->last_sent_block = NULL;
>      rs->last_page = 0;
>      rs->last_version = ram_list.version;
> -    rs->ram_bulk_stage = true;
> +    rs->ram_wt_enabled = migrate_track_writes_ram();

Maybe we don't need ram_wt_enabled, but just call migrate_track_writes_ram()
anywhere needed (actually, only in get_fault_page, once).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/7] introduce UFFD-WP low-level interface helpers
  2020-11-19 12:59 ` [PATCH v3 2/7] introduce UFFD-WP low-level interface helpers Andrey Gruzdev via
@ 2020-11-19 18:39   ` Peter Xu
  2020-11-20 11:04     ` Andrey Gruzdev
  2020-11-24 17:57   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Xu @ 2020-11-19 18:39 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 19, 2020 at 03:59:35PM +0300, Andrey Gruzdev via wrote:
> +/**
> + * uffd_register_memory: register memory range with UFFD
> + *
> + * Returns 0 in case of success, negative value on error
> + *
> + * @uffd: UFFD file descriptor
> + * @start: starting virtual address of memory range
> + * @length: length of memory range
> + * @track_missing: generate events on missing-page faults
> + * @track_wp: generate events on write-protected-page faults
> + */
> +static int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
> +        bool track_missing, bool track_wp)
> +{
> +    struct uffdio_register uffd_register;
> +
> +    uffd_register.range.start = start;
> +    uffd_register.range.len = length;
> +    uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) |
> +                         (track_wp ? UFFDIO_REGISTER_MODE_WP : 0);
> +
> +    if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) {
> +        error_report("uffd_register_memory() failed: "
> +                "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
> +                start, length, uffd_register.mode, errno);
> +        return -1;
> +    }
> +
> +    return 0;
> +}

These functions look good; we should even be able to refactor the existing
ones, e.g., ram_block_enable_notify(), but we can also do that later.  As a
start, we can move these helpers into some common files under util/.

[...]

> +/**
> + * ram_write_tracking_start: start UFFD-WP memory tracking
> + *
> + * Returns 0 for success or negative value in case of error
> + *
> + */
> +int ram_write_tracking_start(void)
> +{

Need to be slightly careful on unwind on this function, because if it fails
somehow we don't want to crash the existing running good vm... more below.

> +    int uffd;
> +    RAMState *rs = ram_state;
> +    RAMBlock *bs;
> +
> +    /* Open UFFD file descriptor */
> +    uffd = uffd_create_fd();
> +    if (uffd < 0) {
> +        return uffd;
> +    }
> +    rs->uffdio_fd = uffd;
> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> +        /* Nothing to do with read-only and MMIO-writable regions */
> +        if (bs->mr->readonly || bs->mr->rom_device) {
> +            continue;
> +        }
> +
> +        /* Register block memory with UFFD to track writes */
> +        if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host,
> +                bs->max_length, false, true)) {
> +            goto fail;
> +        }
> +        /* Apply UFFD write protection to the block memory range */
> +        if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host,
> +                bs->max_length, true)) {

Here logically we need to undo the previous register first, however userfaultfd
will auto-clean these when close(fd), so it's ok.  However still better to
unwind the protection of pages, I think.  So...

> +            goto fail;
> +        }
> +        bs->flags |= RAM_UF_WRITEPROTECT;
> +
> +        info_report("UFFD-WP write-tracking enabled: "
> +                "block_id=%s page_size=%zu start=%p length=%lu "
> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
> +                bs->mr->nonvolatile, bs->mr->rom_device);
> +    }
> +
> +    return 0;
> +
> +fail:
> +    uffd_close_fd(uffd);

... maybe do the unprotect here together, that as long as any of the above step
failed, we need to remember to unprotect all the protected pages (or just
unprotect everything).  And also the RAM_UF_WRITEPROTECT flags being set.

> +    rs->uffdio_fd = -1;
> +    return -1;
> +}
> +
> +/**
> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection

Didn't remove protections, yet?

We should remove those.  For a succeeded snapshot we can avoid that (if we want
such optimization), or imho we'd better unprotect all just in case the user
interrupted the snapshot.

> + */
> +void ram_write_tracking_stop(void)
> +{
> +    RAMState *rs = ram_state;
> +    RAMBlock *bs;
> +    assert(rs->uffdio_fd >= 0);
> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
> +            continue;
> +        }
> +        info_report("UFFD-WP write-tracking disabled: "
> +                "block_id=%s page_size=%zu start=%p length=%lu "
> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
> +                bs->mr->nonvolatile, bs->mr->rom_device);
> +        /* Cleanup flags */
> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
> +    }
> +
> +    /*
> +     * Close UFFD file descriptor to remove protection,
> +     * release registered memory regions and flush wait queues
> +     */
> +    uffd_close_fd(rs->uffdio_fd);
> +    rs->uffdio_fd = -1;
> +}

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 5/7] implementation of vm_start() BH
  2020-11-19 12:59 ` [PATCH v3 5/7] implementation of vm_start() BH Andrey Gruzdev via
@ 2020-11-19 18:46   ` Peter Xu
  2020-11-20 11:13     ` Andrey Gruzdev
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Xu @ 2020-11-19 18:46 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 19, 2020 at 03:59:38PM +0300, Andrey Gruzdev wrote:
> To avoid saving updated versions of memory pages we need
> to start tracking RAM writes before we resume operation of
> vCPUs. This sequence is especially critical for virtio device
> backends whos VQs are mapped to main memory and accessed
> directly not using MMIO callbacks.
> 
> One problem is that vm_start() routine makes calls state
> change notifier callbacks directly from itself. Virtio drivers
> do some stuff with syncing/flusing VQs in its notifier routines.
> Since we poll UFFD and process faults on the same thread, that
> leads to the situation when the thread locks in vm_start()
> if we try to call it from the migration thread.

There's a nice comment in previous patch about this before the bottom half
created, thanks, that's helpful.  Though IMHO this patch can directly be
squashed into previous one, since it's confusing with the comment there but
without doing anything about it.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 4/7] implementation of write-tracking migration thread
  2020-11-19 12:59 ` [PATCH v3 4/7] implementation of write-tracking migration thread Andrey Gruzdev via
@ 2020-11-19 18:47   ` Peter Xu
  2020-11-20 11:41     ` Andrey Gruzdev
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Xu @ 2020-11-19 18:47 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 19, 2020 at 03:59:37PM +0300, Andrey Gruzdev via wrote:
> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>

Some commit message would always be appreciated...  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/7] introduce 'track-writes-ram' migration capability
  2020-11-19 12:59 ` [PATCH v3 1/7] introduce 'track-writes-ram' migration capability Andrey Gruzdev via
@ 2020-11-19 18:51   ` Peter Xu
  2020-11-19 19:07     ` Peter Xu
  2020-11-20 11:32     ` Andrey Gruzdev
  0 siblings, 2 replies; 44+ messages in thread
From: Peter Xu @ 2020-11-19 18:51 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 19, 2020 at 03:59:34PM +0300, Andrey Gruzdev via wrote:
> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> ---
>  migration/migration.c | 96 +++++++++++++++++++++++++++++++++++++++++++
>  migration/migration.h |  1 +
>  qapi/migration.json   |  7 +++-
>  3 files changed, 103 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 87a9b59f83..ff0364dde0 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -56,6 +56,7 @@
>  #include "net/announce.h"
>  #include "qemu/queue.h"
>  #include "multifd.h"
> +#include "sysemu/cpus.h"
>  
>  #ifdef CONFIG_VFIO
>  #include "hw/vfio/vfio-common.h"
> @@ -1165,6 +1166,91 @@ static bool migrate_caps_check(bool *cap_list,
>          }
>      }
>  
> +    if (cap_list[MIGRATION_CAPABILITY_TRACK_WRITES_RAM]) {
> +        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
> +            error_setg(errp,
> +                    "Track-writes is not compatible with postcopy-ram");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_DIRTY_BITMAPS]) {
> +            error_setg(errp,
> +                    "Track-writes is not compatible with dirty-bitmaps");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME]) {
> +            error_setg(errp,
> +                    "Track-writes is not compatible with postcopy-blocktime");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE]) {
> +            error_setg(errp,
> +                    "Track-writes is not compatible with late-block-activate");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_RETURN_PATH]) {
> +            error_setg(errp,
> +                    "Track-writes is not compatible with return-path");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
> +            error_setg(errp, "Track-writes is not compatible with multifd");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER]) {
> +            error_setg(errp,
> +                    "Track-writes is not compatible with pause-before-switchover");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_AUTO_CONVERGE]) {
> +            error_setg(errp,
> +                    "Track-writes is not compatible with auto-converge");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_RELEASE_RAM]) {
> +            error_setg(errp,
> +                    "Track-writes is not compatible with release-ram");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_RDMA_PIN_ALL]) {
> +            error_setg(errp,
> +                    "Track-writes is not compatible with rdma-pin-all");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
> +            error_setg(errp,
> +                    "Track-writes is not compatible with compression");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
> +            error_setg(errp,
> +                    "Track-writes is not compatible with XBZLRE");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_X_COLO]) {
> +            error_setg(errp,
> +                    "Track-writes is not compatible with x-colo");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_VALIDATE_UUID]) {
> +            error_setg(errp,
> +                    "Track-writes is not compatible with validate-uuid");
> +            return false;
> +        }
> +    }
> +
>      return true;
>  }
>  
> @@ -2490,6 +2576,15 @@ bool migrate_use_block_incremental(void)
>      return s->parameters.block_incremental;
>  }
>  
> +bool migrate_track_writes_ram(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_TRACK_WRITES_RAM];
> +}
> +
>  /* migration thread support */
>  /*
>   * Something bad happened to the RP stream, mark an error
> @@ -3783,6 +3878,7 @@ static Property migration_properties[] = {
>      DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
>      DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
>      DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
> +    DEFINE_PROP_MIG_CAP("x-track-writes-ram", MIGRATION_CAPABILITY_TRACK_WRITES_RAM),
>  
>      DEFINE_PROP_END_OF_LIST(),
>  };
> diff --git a/migration/migration.h b/migration/migration.h
> index d096b77f74..339ae720e0 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -341,6 +341,7 @@ int migrate_compress_wait_thread(void);
>  int migrate_decompress_threads(void);
>  bool migrate_use_events(void);
>  bool migrate_postcopy_blocktime(void);
> +bool migrate_track_writes_ram(void);
>  
>  /* Sending on the return path - generic and then for each message type */
>  void migrate_send_rp_shut(MigrationIncomingState *mis,
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 3c75820527..a28d8b7ee8 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -442,6 +442,11 @@
>  # @validate-uuid: Send the UUID of the source to allow the destination
>  #                 to ensure it is the same. (since 4.2)
>  #
> +# @track-writes-ram: If enabled, the migration stream will be a snapshot
> +#                    of the VM exactly at the point when the migration
> +#                    procedure starts. The VM RAM is saved with running VM.
> +#                    (since 6.0)
> +#

The name is slightly confusing to me.  Could I ask why changed from previous
one?  "snapshot" sounds a very necessary keyword to me here and tells exactly
on what we do...  Because we can do quite a few things with "trace-writes-ram"
but not snapshotting, e.g., to calculate per-vm dirty rates.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/7] introduce 'track-writes-ram' migration capability
  2020-11-19 18:51   ` Peter Xu
@ 2020-11-19 19:07     ` Peter Xu
  2020-11-20 11:35       ` Andrey Gruzdev
  2020-11-24 16:55       ` Dr. David Alan Gilbert
  2020-11-20 11:32     ` Andrey Gruzdev
  1 sibling, 2 replies; 44+ messages in thread
From: Peter Xu @ 2020-11-19 19:07 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 19, 2020 at 01:51:50PM -0500, Peter Xu wrote:
> On Thu, Nov 19, 2020 at 03:59:34PM +0300, Andrey Gruzdev via wrote:
> > Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> > ---
> >  migration/migration.c | 96 +++++++++++++++++++++++++++++++++++++++++++
> >  migration/migration.h |  1 +
> >  qapi/migration.json   |  7 +++-
> >  3 files changed, 103 insertions(+), 1 deletion(-)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 87a9b59f83..ff0364dde0 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -56,6 +56,7 @@
> >  #include "net/announce.h"
> >  #include "qemu/queue.h"
> >  #include "multifd.h"
> > +#include "sysemu/cpus.h"
> >  
> >  #ifdef CONFIG_VFIO
> >  #include "hw/vfio/vfio-common.h"
> > @@ -1165,6 +1166,91 @@ static bool migrate_caps_check(bool *cap_list,
> >          }
> >      }
> >  
> > +    if (cap_list[MIGRATION_CAPABILITY_TRACK_WRITES_RAM]) {
> > +        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
> > +            error_setg(errp,
> > +                    "Track-writes is not compatible with postcopy-ram");
> > +            return false;
> > +        }
> > +
> > +        if (cap_list[MIGRATION_CAPABILITY_DIRTY_BITMAPS]) {
> > +            error_setg(errp,
> > +                    "Track-writes is not compatible with dirty-bitmaps");
> > +            return false;
> > +        }
> > +
> > +        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME]) {
> > +            error_setg(errp,
> > +                    "Track-writes is not compatible with postcopy-blocktime");
> > +            return false;
> > +        }
> > +
> > +        if (cap_list[MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE]) {
> > +            error_setg(errp,
> > +                    "Track-writes is not compatible with late-block-activate");
> > +            return false;
> > +        }
> > +
> > +        if (cap_list[MIGRATION_CAPABILITY_RETURN_PATH]) {
> > +            error_setg(errp,
> > +                    "Track-writes is not compatible with return-path");
> > +            return false;
> > +        }
> > +
> > +        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
> > +            error_setg(errp, "Track-writes is not compatible with multifd");
> > +            return false;
> > +        }
> > +
> > +        if (cap_list[MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER]) {
> > +            error_setg(errp,
> > +                    "Track-writes is not compatible with pause-before-switchover");
> > +            return false;
> > +        }
> > +
> > +        if (cap_list[MIGRATION_CAPABILITY_AUTO_CONVERGE]) {
> > +            error_setg(errp,
> > +                    "Track-writes is not compatible with auto-converge");
> > +            return false;
> > +        }
> > +
> > +        if (cap_list[MIGRATION_CAPABILITY_RELEASE_RAM]) {
> > +            error_setg(errp,
> > +                    "Track-writes is not compatible with release-ram");
> > +            return false;
> > +        }
> > +
> > +        if (cap_list[MIGRATION_CAPABILITY_RDMA_PIN_ALL]) {
> > +            error_setg(errp,
> > +                    "Track-writes is not compatible with rdma-pin-all");
> > +            return false;
> > +        }
> > +
> > +        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
> > +            error_setg(errp,
> > +                    "Track-writes is not compatible with compression");
> > +            return false;
> > +        }
> > +
> > +        if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
> > +            error_setg(errp,
> > +                    "Track-writes is not compatible with XBZLRE");
> > +            return false;
> > +        }
> > +
> > +        if (cap_list[MIGRATION_CAPABILITY_X_COLO]) {
> > +            error_setg(errp,
> > +                    "Track-writes is not compatible with x-colo");
> > +            return false;
> > +        }
> > +
> > +        if (cap_list[MIGRATION_CAPABILITY_VALIDATE_UUID]) {
> > +            error_setg(errp,
> > +                    "Track-writes is not compatible with validate-uuid");
> > +            return false;
> > +        }

Another thing forgot to mention - we can at least define an array for live
snapshot now so we just loop over that one instead of copy-paste these lines...

> > +    }
> > +
> >      return true;
> >  }
> >  
> > @@ -2490,6 +2576,15 @@ bool migrate_use_block_incremental(void)
> >      return s->parameters.block_incremental;
> >  }
> >  
> > +bool migrate_track_writes_ram(void)
> > +{
> > +    MigrationState *s;
> > +
> > +    s = migrate_get_current();
> > +
> > +    return s->enabled_capabilities[MIGRATION_CAPABILITY_TRACK_WRITES_RAM];
> > +}
> > +
> >  /* migration thread support */
> >  /*
> >   * Something bad happened to the RP stream, mark an error
> > @@ -3783,6 +3878,7 @@ static Property migration_properties[] = {
> >      DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
> >      DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
> >      DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
> > +    DEFINE_PROP_MIG_CAP("x-track-writes-ram", MIGRATION_CAPABILITY_TRACK_WRITES_RAM),
> >  
> >      DEFINE_PROP_END_OF_LIST(),
> >  };
> > diff --git a/migration/migration.h b/migration/migration.h
> > index d096b77f74..339ae720e0 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -341,6 +341,7 @@ int migrate_compress_wait_thread(void);
> >  int migrate_decompress_threads(void);
> >  bool migrate_use_events(void);
> >  bool migrate_postcopy_blocktime(void);
> > +bool migrate_track_writes_ram(void);
> >  
> >  /* Sending on the return path - generic and then for each message type */
> >  void migrate_send_rp_shut(MigrationIncomingState *mis,
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 3c75820527..a28d8b7ee8 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -442,6 +442,11 @@
> >  # @validate-uuid: Send the UUID of the source to allow the destination
> >  #                 to ensure it is the same. (since 4.2)
> >  #
> > +# @track-writes-ram: If enabled, the migration stream will be a snapshot
> > +#                    of the VM exactly at the point when the migration
> > +#                    procedure starts. The VM RAM is saved with running VM.
> > +#                    (since 6.0)
> > +#
> 
> The name is slightly confusing to me.  Could I ask why changed from previous
> one?  "snapshot" sounds a very necessary keyword to me here and tells exactly
> on what we do...  Because we can do quite a few things with "trace-writes-ram"
> but not snapshotting, e.g., to calculate per-vm dirty rates.
> 
> -- 
> Peter Xu

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 7/7] introduce simple linear scan rate limiting mechanism
  2020-11-19 12:59 ` [PATCH v3 7/7] introduce simple linear scan rate limiting mechanism Andrey Gruzdev via
@ 2020-11-19 20:02   ` Peter Xu
  2020-11-20 12:06     ` Andrey Gruzdev
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Xu @ 2020-11-19 20:02 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Thu, Nov 19, 2020 at 03:59:40PM +0300, Andrey Gruzdev wrote:
> Since reading UFFD events and saving paged data are performed
> from the same thread, write fault latencies are sensitive to
> migration stream stalls. Limiting total page saving rate is a
> method to reduce amount of noticiable fault resolution latencies.
> 
> Migration bandwidth limiting is achieved via noticing cases of
> out-of-threshold write fault latencies and temporarily disabling
> (strictly speaking, severely throttling) saving non-faulting pages.

Just curious: have you measured aver/max latency of wr-protected page requests,
or better, even its distribution?

I believe it should also be relevant to where the snapshot is stored, say, the
backend disk of your tests.  Is that a file on some fs?

I would expect the latency should be still good if e.g. the throughput of the
backend file system is decent even without a patch like this, but I might have
missed something..

In all cases, it would be very nice if this patch can have the histogram or
aver or max latency measured and compared before/after this patch applied.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-19 18:25   ` Peter Xu
@ 2020-11-20 10:44     ` Andrey Gruzdev
  2020-11-20 15:07       ` Peter Xu
  0 siblings, 1 reply; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-20 10:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 19.11.2020 21:25, Peter Xu wrote:
> On Thu, Nov 19, 2020 at 03:59:36PM +0300, Andrey Gruzdev via wrote:
> 
> [...]
> 
>> +/**
>> + * ram_find_block_by_host_address: find RAM block containing host page
>> + *
>> + * Returns true if RAM block is found and pss->block/page are
>> + * pointing to the given host page, false in case of an error
>> + *
>> + * @rs: current RAM state
>> + * @pss: page-search-status structure
>> + */
>> +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
>> +        hwaddr page_address)
>> +{
>> +    bool found = false;
>> +
>> +    pss->block = rs->last_seen_block;
>> +    do {
>> +        if (page_address >= (hwaddr) pss->block->host &&
>> +            (page_address + TARGET_PAGE_SIZE) <=
>> +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
>> +            pss->page = (unsigned long)
>> +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
>> +            found = true;
>> +            break;
>> +        }
>> +
>> +        pss->block = QLIST_NEXT_RCU(pss->block, next);
>> +        if (!pss->block) {
>> +            /* Hit the end of the list */
>> +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
>> +        }
>> +    } while (pss->block != rs->last_seen_block);
>> +
>> +    rs->last_seen_block = pss->block;
>> +    /*
>> +     * Since we are in the same loop with ram_find_and_save_block(),
>> +     * need to reset pss->complete_round after switching to
>> +     * other block/page in pss.
>> +     */
>> +    pss->complete_round = false;
>> +
>> +    return found;
>> +}
> 
> I forgot whether Denis and I have discussed this, but I'll try anyways... do
> you think we can avoid touching PageSearchStatus at all?
> 
> PageSearchStatus is used to track a single migration iteration for precopy, so
> that we scan from the 1st ramblock until the last one.  Then we finish one
> iteration.
> 

Yes, my first idea also was to separate normal iteration from 
write-fault page source completely and leave pss for normal scan.. But, 
the other idea is to keep some locality in respect to last write fault. 
I mean it seems to be more optimal to re-start normal scan on the page 
that is next to faulting one. In this case we can save and un-protect
the neighborhood faster and prevent many write faults.

> Snapshot is really something, imho, that can easily leverage this structure
> without touching it - basically we want to do two things:
> 
>    - Do the 1st iteration of precopy (when ram_bulk_stage==true), and do that
>      only.  We never need the 2nd, 3rd, ... iterations because we're snapshoting.
> 
>    - Leverage the postcopy queue mechanism so that when some page got written,
>      queue that page.  We should have this queue higher priority than the
>      precopy scanning mentioned above.
> 
> As long as we follow above rules, then after the above 1st round precopy, we're
> simply done...  If that works, the whole logic of precopy and PageSearchStatus
> does not need to be touched, iiuc.
> 
> [...]
> 

It's quite good alternative and I thought about using postcopy page 
queue, but this implementation won't consider the locality of writes..

What do you think?

>> @@ -2086,7 +2191,8 @@ static void ram_state_reset(RAMState *rs)
>>       rs->last_sent_block = NULL;
>>       rs->last_page = 0;
>>       rs->last_version = ram_list.version;
>> -    rs->ram_bulk_stage = true;
>> +    rs->ram_wt_enabled = migrate_track_writes_ram();
> 
> Maybe we don't need ram_wt_enabled, but just call migrate_track_writes_ram()
> anywhere needed (actually, only in get_fault_page, once).
> 
> Thanks,
> 

Yes, think you are right, we can avoid this additional field.

Thanks,

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/7] introduce UFFD-WP low-level interface helpers
  2020-11-19 18:39   ` Peter Xu
@ 2020-11-20 11:04     ` Andrey Gruzdev
  2020-11-20 15:01       ` Peter Xu
  0 siblings, 1 reply; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-20 11:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 19.11.2020 21:39, Peter Xu wrote:
> On Thu, Nov 19, 2020 at 03:59:35PM +0300, Andrey Gruzdev via wrote:
>> +/**
>> + * uffd_register_memory: register memory range with UFFD
>> + *
>> + * Returns 0 in case of success, negative value on error
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @start: starting virtual address of memory range
>> + * @length: length of memory range
>> + * @track_missing: generate events on missing-page faults
>> + * @track_wp: generate events on write-protected-page faults
>> + */
>> +static int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
>> +        bool track_missing, bool track_wp)
>> +{
>> +    struct uffdio_register uffd_register;
>> +
>> +    uffd_register.range.start = start;
>> +    uffd_register.range.len = length;
>> +    uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) |
>> +                         (track_wp ? UFFDIO_REGISTER_MODE_WP : 0);
>> +
>> +    if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) {
>> +        error_report("uffd_register_memory() failed: "
>> +                "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
>> +                start, length, uffd_register.mode, errno);
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
> 
> These functions look good; we should even be able to refactor the existing
> ones, e.g., ram_block_enable_notify(), but we can also do that later.  As a
> start, we can move these helpers into some common files under util/.
> 
> [...]
> 

Yep, agree.

>> +/**
>> + * ram_write_tracking_start: start UFFD-WP memory tracking
>> + *
>> + * Returns 0 for success or negative value in case of error
>> + *
>> + */
>> +int ram_write_tracking_start(void)
>> +{
> 
> Need to be slightly careful on unwind on this function, because if it fails
> somehow we don't want to crash the existing running good vm... more below.
> 
>> +    int uffd;
>> +    RAMState *rs = ram_state;
>> +    RAMBlock *bs;
>> +
>> +    /* Open UFFD file descriptor */
>> +    uffd = uffd_create_fd();
>> +    if (uffd < 0) {
>> +        return uffd;
>> +    }
>> +    rs->uffdio_fd = uffd;
>> +
>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>> +        /* Nothing to do with read-only and MMIO-writable regions */
>> +        if (bs->mr->readonly || bs->mr->rom_device) {
>> +            continue;
>> +        }
>> +
>> +        /* Register block memory with UFFD to track writes */
>> +        if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host,
>> +                bs->max_length, false, true)) {
>> +            goto fail;
>> +        }
>> +        /* Apply UFFD write protection to the block memory range */
>> +        if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host,
>> +                bs->max_length, true)) {
> 
> Here logically we need to undo the previous register first, however userfaultfd
> will auto-clean these when close(fd), so it's ok.  However still better to
> unwind the protection of pages, I think.  So...
> 

It should auto-clean, but as an additional safety measure - yes.

>> +            goto fail;
>> +        }
>> +        bs->flags |= RAM_UF_WRITEPROTECT;
>> +
>> +        info_report("UFFD-WP write-tracking enabled: "
>> +                "block_id=%s page_size=%zu start=%p length=%lu "
>> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
>> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
>> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
>> +                bs->mr->nonvolatile, bs->mr->rom_device);
>> +    }
>> +
>> +    return 0;
>> +
>> +fail:
>> +    uffd_close_fd(uffd);
> 
> ... maybe do the unprotect here together, that as long as any of the above step
> failed, we need to remember to unprotect all the protected pages (or just
> unprotect everything).  And also the RAM_UF_WRITEPROTECT flags being set.
> 

Not resetting RAM_UF_WRITEPROTECT is certainly a bug here.

But it seems that simply calling close() on UFFD file descriptor does 
all the rest cleanup for us - unprotect registered memory regions, 
remove all extra state from kernel etc. I never had a problem with 
simple close(uffd) to cleanup.. But maybe really more safe to do 
unprotect/unregister explicitly.

>> +    rs->uffdio_fd = -1;
>> +    return -1;
>> +}
>> +
>> +/**
>> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
> 
> Didn't remove protections, yet?
> 
> We should remove those.  For a succeeded snapshot we can avoid that (if we want
> such optimization), or imho we'd better unprotect all just in case the user
> interrupted the snapshot.
> 

Seems that closing UFFD descriptor does unprotect for us implicitly..
Am I wrong?

>> + */
>> +void ram_write_tracking_stop(void)
>> +{
>> +    RAMState *rs = ram_state;
>> +    RAMBlock *bs;
>> +    assert(rs->uffdio_fd >= 0);
>> +
>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
>> +            continue;
>> +        }
>> +        info_report("UFFD-WP write-tracking disabled: "
>> +                "block_id=%s page_size=%zu start=%p length=%lu "
>> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
>> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
>> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
>> +                bs->mr->nonvolatile, bs->mr->rom_device);
>> +        /* Cleanup flags */
>> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
>> +    }
>> +
>> +    /*
>> +     * Close UFFD file descriptor to remove protection,
>> +     * release registered memory regions and flush wait queues
>> +     */
>> +    uffd_close_fd(rs->uffdio_fd);
>> +    rs->uffdio_fd = -1;
>> +}
> 


-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 5/7] implementation of vm_start() BH
  2020-11-19 18:46   ` Peter Xu
@ 2020-11-20 11:13     ` Andrey Gruzdev
  0 siblings, 0 replies; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-20 11:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 19.11.2020 21:46, Peter Xu wrote:
> On Thu, Nov 19, 2020 at 03:59:38PM +0300, Andrey Gruzdev wrote:
>> To avoid saving updated versions of memory pages we need
>> to start tracking RAM writes before we resume operation of
>> vCPUs. This sequence is especially critical for virtio device
>> backends whos VQs are mapped to main memory and accessed
>> directly not using MMIO callbacks.
>>
>> One problem is that vm_start() routine makes calls state
>> change notifier callbacks directly from itself. Virtio drivers
>> do some stuff with syncing/flusing VQs in its notifier routines.
>> Since we poll UFFD and process faults on the same thread, that
>> leads to the situation when the thread locks in vm_start()
>> if we try to call it from the migration thread.
> 
> There's a nice comment in previous patch about this before the bottom half
> created, thanks, that's helpful.  Though IMHO this patch can directly be
> squashed into previous one, since it's confusing with the comment there but
> without doing anything about it.
> 

Yes, agree, better to squash this small commit.

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/7] introduce 'track-writes-ram' migration capability
  2020-11-19 18:51   ` Peter Xu
  2020-11-19 19:07     ` Peter Xu
@ 2020-11-20 11:32     ` Andrey Gruzdev
  1 sibling, 0 replies; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-20 11:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 19.11.2020 21:51, Peter Xu wrote:
> On Thu, Nov 19, 2020 at 03:59:34PM +0300, Andrey Gruzdev via wrote:
>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>> ---
>>   migration/migration.c | 96 +++++++++++++++++++++++++++++++++++++++++++
>>   migration/migration.h |  1 +
>>   qapi/migration.json   |  7 +++-
>>   3 files changed, 103 insertions(+), 1 deletion(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 87a9b59f83..ff0364dde0 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -56,6 +56,7 @@
>>   #include "net/announce.h"
>>   #include "qemu/queue.h"
>>   #include "multifd.h"
>> +#include "sysemu/cpus.h"
>>   
>>   #ifdef CONFIG_VFIO
>>   #include "hw/vfio/vfio-common.h"
>> @@ -1165,6 +1166,91 @@ static bool migrate_caps_check(bool *cap_list,
>>           }
>>       }
>>   
>> +    if (cap_list[MIGRATION_CAPABILITY_TRACK_WRITES_RAM]) {
>> +        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
>> +            error_setg(errp,
>> +                    "Track-writes is not compatible with postcopy-ram");
>> +            return false;
>> +        }
>> +
>> +        if (cap_list[MIGRATION_CAPABILITY_DIRTY_BITMAPS]) {
>> +            error_setg(errp,
>> +                    "Track-writes is not compatible with dirty-bitmaps");
>> +            return false;
>> +        }
>> +
>> +        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME]) {
>> +            error_setg(errp,
>> +                    "Track-writes is not compatible with postcopy-blocktime");
>> +            return false;
>> +        }
>> +
>> +        if (cap_list[MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE]) {
>> +            error_setg(errp,
>> +                    "Track-writes is not compatible with late-block-activate");
>> +            return false;
>> +        }
>> +
>> +        if (cap_list[MIGRATION_CAPABILITY_RETURN_PATH]) {
>> +            error_setg(errp,
>> +                    "Track-writes is not compatible with return-path");
>> +            return false;
>> +        }
>> +
>> +        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
>> +            error_setg(errp, "Track-writes is not compatible with multifd");
>> +            return false;
>> +        }
>> +
>> +        if (cap_list[MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER]) {
>> +            error_setg(errp,
>> +                    "Track-writes is not compatible with pause-before-switchover");
>> +            return false;
>> +        }
>> +
>> +        if (cap_list[MIGRATION_CAPABILITY_AUTO_CONVERGE]) {
>> +            error_setg(errp,
>> +                    "Track-writes is not compatible with auto-converge");
>> +            return false;
>> +        }
>> +
>> +        if (cap_list[MIGRATION_CAPABILITY_RELEASE_RAM]) {
>> +            error_setg(errp,
>> +                    "Track-writes is not compatible with release-ram");
>> +            return false;
>> +        }
>> +
>> +        if (cap_list[MIGRATION_CAPABILITY_RDMA_PIN_ALL]) {
>> +            error_setg(errp,
>> +                    "Track-writes is not compatible with rdma-pin-all");
>> +            return false;
>> +        }
>> +
>> +        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
>> +            error_setg(errp,
>> +                    "Track-writes is not compatible with compression");
>> +            return false;
>> +        }
>> +
>> +        if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
>> +            error_setg(errp,
>> +                    "Track-writes is not compatible with XBZLRE");
>> +            return false;
>> +        }
>> +
>> +        if (cap_list[MIGRATION_CAPABILITY_X_COLO]) {
>> +            error_setg(errp,
>> +                    "Track-writes is not compatible with x-colo");
>> +            return false;
>> +        }
>> +
>> +        if (cap_list[MIGRATION_CAPABILITY_VALIDATE_UUID]) {
>> +            error_setg(errp,
>> +                    "Track-writes is not compatible with validate-uuid");
>> +            return false;
>> +        }
>> +    }
>> +
>>       return true;
>>   }
>>   
>> @@ -2490,6 +2576,15 @@ bool migrate_use_block_incremental(void)
>>       return s->parameters.block_incremental;
>>   }
>>   
>> +bool migrate_track_writes_ram(void)
>> +{
>> +    MigrationState *s;
>> +
>> +    s = migrate_get_current();
>> +
>> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_TRACK_WRITES_RAM];
>> +}
>> +
>>   /* migration thread support */
>>   /*
>>    * Something bad happened to the RP stream, mark an error
>> @@ -3783,6 +3878,7 @@ static Property migration_properties[] = {
>>       DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
>>       DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
>>       DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
>> +    DEFINE_PROP_MIG_CAP("x-track-writes-ram", MIGRATION_CAPABILITY_TRACK_WRITES_RAM),
>>   
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>> diff --git a/migration/migration.h b/migration/migration.h
>> index d096b77f74..339ae720e0 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -341,6 +341,7 @@ int migrate_compress_wait_thread(void);
>>   int migrate_decompress_threads(void);
>>   bool migrate_use_events(void);
>>   bool migrate_postcopy_blocktime(void);
>> +bool migrate_track_writes_ram(void);
>>   
>>   /* Sending on the return path - generic and then for each message type */
>>   void migrate_send_rp_shut(MigrationIncomingState *mis,
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 3c75820527..a28d8b7ee8 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -442,6 +442,11 @@
>>   # @validate-uuid: Send the UUID of the source to allow the destination
>>   #                 to ensure it is the same. (since 4.2)
>>   #
>> +# @track-writes-ram: If enabled, the migration stream will be a snapshot
>> +#                    of the VM exactly at the point when the migration
>> +#                    procedure starts. The VM RAM is saved with running VM.
>> +#                    (since 6.0)
>> +#
> 
> The name is slightly confusing to me.  Could I ask why changed from previous
> one?  "snapshot" sounds a very necessary keyword to me here and tells exactly
> on what we do...  Because we can do quite a few things with "trace-writes-ram"
> but not snapshotting, e.g., to calculate per-vm dirty rates.
> 

Mm, the idea was that we introduce alternative mechanism of migration.. 
But it's really intended for snapshots only, correct. Yes, let's change 
the name to original one.

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/7] introduce 'track-writes-ram' migration capability
  2020-11-19 19:07     ` Peter Xu
@ 2020-11-20 11:35       ` Andrey Gruzdev
  2020-11-24 16:55       ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-20 11:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 19.11.2020 22:07, Peter Xu wrote:
> On Thu, Nov 19, 2020 at 01:51:50PM -0500, Peter Xu wrote:
>> On Thu, Nov 19, 2020 at 03:59:34PM +0300, Andrey Gruzdev via wrote:
>>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>>> ---
>>>   migration/migration.c | 96 +++++++++++++++++++++++++++++++++++++++++++
>>>   migration/migration.h |  1 +
>>>   qapi/migration.json   |  7 +++-
>>>   3 files changed, 103 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 87a9b59f83..ff0364dde0 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -56,6 +56,7 @@
>>>   #include "net/announce.h"
>>>   #include "qemu/queue.h"
>>>   #include "multifd.h"
>>> +#include "sysemu/cpus.h"
>>>   
>>>   #ifdef CONFIG_VFIO
>>>   #include "hw/vfio/vfio-common.h"
>>> @@ -1165,6 +1166,91 @@ static bool migrate_caps_check(bool *cap_list,
>>>           }
>>>       }
>>>   
>>> +    if (cap_list[MIGRATION_CAPABILITY_TRACK_WRITES_RAM]) {
>>> +        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
>>> +            error_setg(errp,
>>> +                    "Track-writes is not compatible with postcopy-ram");
>>> +            return false;
>>> +        }
>>> +
>>> +        if (cap_list[MIGRATION_CAPABILITY_DIRTY_BITMAPS]) {
>>> +            error_setg(errp,
>>> +                    "Track-writes is not compatible with dirty-bitmaps");
>>> +            return false;
>>> +        }
>>> +
>>> +        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME]) {
>>> +            error_setg(errp,
>>> +                    "Track-writes is not compatible with postcopy-blocktime");
>>> +            return false;
>>> +        }
>>> +
>>> +        if (cap_list[MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE]) {
>>> +            error_setg(errp,
>>> +                    "Track-writes is not compatible with late-block-activate");
>>> +            return false;
>>> +        }
>>> +
>>> +        if (cap_list[MIGRATION_CAPABILITY_RETURN_PATH]) {
>>> +            error_setg(errp,
>>> +                    "Track-writes is not compatible with return-path");
>>> +            return false;
>>> +        }
>>> +
>>> +        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
>>> +            error_setg(errp, "Track-writes is not compatible with multifd");
>>> +            return false;
>>> +        }
>>> +
>>> +        if (cap_list[MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER]) {
>>> +            error_setg(errp,
>>> +                    "Track-writes is not compatible with pause-before-switchover");
>>> +            return false;
>>> +        }
>>> +
>>> +        if (cap_list[MIGRATION_CAPABILITY_AUTO_CONVERGE]) {
>>> +            error_setg(errp,
>>> +                    "Track-writes is not compatible with auto-converge");
>>> +            return false;
>>> +        }
>>> +
>>> +        if (cap_list[MIGRATION_CAPABILITY_RELEASE_RAM]) {
>>> +            error_setg(errp,
>>> +                    "Track-writes is not compatible with release-ram");
>>> +            return false;
>>> +        }
>>> +
>>> +        if (cap_list[MIGRATION_CAPABILITY_RDMA_PIN_ALL]) {
>>> +            error_setg(errp,
>>> +                    "Track-writes is not compatible with rdma-pin-all");
>>> +            return false;
>>> +        }
>>> +
>>> +        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
>>> +            error_setg(errp,
>>> +                    "Track-writes is not compatible with compression");
>>> +            return false;
>>> +        }
>>> +
>>> +        if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
>>> +            error_setg(errp,
>>> +                    "Track-writes is not compatible with XBZLRE");
>>> +            return false;
>>> +        }
>>> +
>>> +        if (cap_list[MIGRATION_CAPABILITY_X_COLO]) {
>>> +            error_setg(errp,
>>> +                    "Track-writes is not compatible with x-colo");
>>> +            return false;
>>> +        }
>>> +
>>> +        if (cap_list[MIGRATION_CAPABILITY_VALIDATE_UUID]) {
>>> +            error_setg(errp,
>>> +                    "Track-writes is not compatible with validate-uuid");
>>> +            return false;
>>> +        }
> 
> Another thing forgot to mention - we can at least define an array for live
> snapshot now so we just loop over that one instead of copy-paste these lines...
> 

Yes, too many lines here, better to use 'compatibility array' here.

>>> +    }
>>> +
>>>       return true;
>>>   }
>>>   
>>> @@ -2490,6 +2576,15 @@ bool migrate_use_block_incremental(void)
>>>       return s->parameters.block_incremental;
>>>   }
>>>   
>>> +bool migrate_track_writes_ram(void)
>>> +{
>>> +    MigrationState *s;
>>> +
>>> +    s = migrate_get_current();
>>> +
>>> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_TRACK_WRITES_RAM];
>>> +}
>>> +
>>>   /* migration thread support */
>>>   /*
>>>    * Something bad happened to the RP stream, mark an error
>>> @@ -3783,6 +3878,7 @@ static Property migration_properties[] = {
>>>       DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
>>>       DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
>>>       DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
>>> +    DEFINE_PROP_MIG_CAP("x-track-writes-ram", MIGRATION_CAPABILITY_TRACK_WRITES_RAM),
>>>   
>>>       DEFINE_PROP_END_OF_LIST(),
>>>   };
>>> diff --git a/migration/migration.h b/migration/migration.h
>>> index d096b77f74..339ae720e0 100644
>>> --- a/migration/migration.h
>>> +++ b/migration/migration.h
>>> @@ -341,6 +341,7 @@ int migrate_compress_wait_thread(void);
>>>   int migrate_decompress_threads(void);
>>>   bool migrate_use_events(void);
>>>   bool migrate_postcopy_blocktime(void);
>>> +bool migrate_track_writes_ram(void);
>>>   
>>>   /* Sending on the return path - generic and then for each message type */
>>>   void migrate_send_rp_shut(MigrationIncomingState *mis,
>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>> index 3c75820527..a28d8b7ee8 100644
>>> --- a/qapi/migration.json
>>> +++ b/qapi/migration.json
>>> @@ -442,6 +442,11 @@
>>>   # @validate-uuid: Send the UUID of the source to allow the destination
>>>   #                 to ensure it is the same. (since 4.2)
>>>   #
>>> +# @track-writes-ram: If enabled, the migration stream will be a snapshot
>>> +#                    of the VM exactly at the point when the migration
>>> +#                    procedure starts. The VM RAM is saved with running VM.
>>> +#                    (since 6.0)
>>> +#
>>
>> The name is slightly confusing to me.  Could I ask why changed from previous
>> one?  "snapshot" sounds a very necessary keyword to me here and tells exactly
>> on what we do...  Because we can do quite a few things with "trace-writes-ram"
>> but not snapshotting, e.g., to calculate per-vm dirty rates.
>>
>> -- 
>> Peter Xu
> 


-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 4/7] implementation of write-tracking migration thread
  2020-11-19 18:47   ` Peter Xu
@ 2020-11-20 11:41     ` Andrey Gruzdev
  0 siblings, 0 replies; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-20 11:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 19.11.2020 21:47, Peter Xu wrote:
> On Thu, Nov 19, 2020 at 03:59:37PM +0300, Andrey Gruzdev via wrote:
>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> 
> Some commit message would always be appreciated...  Thanks,
> 

Yep, missed it..

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 7/7] introduce simple linear scan rate limiting mechanism
  2020-11-19 20:02   ` Peter Xu
@ 2020-11-20 12:06     ` Andrey Gruzdev
  2020-11-20 15:23       ` Peter Xu
  0 siblings, 1 reply; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-20 12:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 19.11.2020 23:02, Peter Xu wrote:
> On Thu, Nov 19, 2020 at 03:59:40PM +0300, Andrey Gruzdev wrote:
>> Since reading UFFD events and saving paged data are performed
>> from the same thread, write fault latencies are sensitive to
>> migration stream stalls. Limiting total page saving rate is a
>> method to reduce amount of noticiable fault resolution latencies.
>>
>> Migration bandwidth limiting is achieved via noticing cases of
>> out-of-threshold write fault latencies and temporarily disabling
>> (strictly speaking, severely throttling) saving non-faulting pages.
> 
> Just curious: have you measured aver/max latency of wr-protected page requests,
> or better, even its distribution?
> 
> I believe it should also be relevant to where the snapshot is stored, say, the
> backend disk of your tests.  Is that a file on some fs?
> 
> I would expect the latency should be still good if e.g. the throughput of the
> backend file system is decent even without a patch like this, but I might have
> missed something..
> 
> In all cases, it would be very nice if this patch can have the histogram or
> aver or max latency measured and compared before/after this patch applied.
> 
> Thanks,
> 
So far I have no objective latency measurements, that really 
interesting. For testing I commonly used SATA HDD, not too fresh one, 
1.5TB Seagate 7200.11 series, specially not to have large hardware cache 
or flash buffer. And yes, backend is a file on ext4.

I tried mostly with 'migrate exec:streamdump_utility', a very simple 
tool which writes stream to a file. It even doesn't use AIO - so reads 
from STDIN and file writes don't overlap. Made so intentionally to 
reduce throughput and give some random high-latency writes.

I think snapshotting performance may be severely degraded by I/O 
happening in parallel on the same storage media, that's the problem we 
need to consider.

And yes, to take latency histogram before/after the patch is nice idea!
Also I need to make stream dumping tool with AIO, to test with full 
storage throughput.

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/7] introduce UFFD-WP low-level interface helpers
  2020-11-20 11:04     ` Andrey Gruzdev
@ 2020-11-20 15:01       ` Peter Xu
  2020-11-20 15:43         ` Andrey Gruzdev
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Xu @ 2020-11-20 15:01 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Fri, Nov 20, 2020 at 02:04:46PM +0300, Andrey Gruzdev wrote:
> > > +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> > > +        /* Nothing to do with read-only and MMIO-writable regions */
> > > +        if (bs->mr->readonly || bs->mr->rom_device) {
> > > +            continue;
> > > +        }
> > > +
> > > +        /* Register block memory with UFFD to track writes */
> > > +        if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host,
> > > +                bs->max_length, false, true)) {
> > > +            goto fail;
> > > +        }
> > > +        /* Apply UFFD write protection to the block memory range */
> > > +        if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host,
> > > +                bs->max_length, true)) {
> > 
> > Here logically we need to undo the previous register first, however userfaultfd
> > will auto-clean these when close(fd), so it's ok.  However still better to
> > unwind the protection of pages, I think.  So...
> > 
> 
> It should auto-clean, but as an additional safety measure - yes.

I'm afraid it will only clean up the registers, but not the page table updates;
at least that should be what we do now in the kernel. I'm not sure whether we
should always force the kernel to unprotect those when close(). The problem is
the registered range is normally quite large while the wr-protect range can be
very small (page-based), so that could take a lot of time, which can be
unnecessary, since the userspace is the one that knows the best on which range
was protected.

Indeed I can't think if anything really bad even if not unprotect the pages as
you do right now - what will happen is that the wr-protected pages will have
UFFD_WP set and PAGE_RW cleared in the page tables even after the close(fd).
It means after the snapshot got cancelled those wr-protected pages could still
trigger page fault again when being written, though since it's not covered by
uffd-wp protected vmas, it'll become a "normal cow" fault, and the write bit
will be recovered.  However the UFFD_WP bit in the page tables could got
leftover there...  So maybe it's still best to unprotect from userspace.

There's an idea that maybe we can auto-remove the UFFD_WP bit in kernel when
cow happens for a page, but that's definitely out of topic (and we must make
sure things like "enforced cow for read-only get_user_pages() won't happen
again").  No hard to do that in userspace, anyways.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-20 10:44     ` Andrey Gruzdev
@ 2020-11-20 15:07       ` Peter Xu
  2020-11-20 16:15         ` Andrey Gruzdev
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Xu @ 2020-11-20 15:07 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Fri, Nov 20, 2020 at 01:44:53PM +0300, Andrey Gruzdev wrote:
> On 19.11.2020 21:25, Peter Xu wrote:
> > On Thu, Nov 19, 2020 at 03:59:36PM +0300, Andrey Gruzdev via wrote:
> > 
> > [...]
> > 
> > > +/**
> > > + * ram_find_block_by_host_address: find RAM block containing host page
> > > + *
> > > + * Returns true if RAM block is found and pss->block/page are
> > > + * pointing to the given host page, false in case of an error
> > > + *
> > > + * @rs: current RAM state
> > > + * @pss: page-search-status structure
> > > + */
> > > +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
> > > +        hwaddr page_address)
> > > +{
> > > +    bool found = false;
> > > +
> > > +    pss->block = rs->last_seen_block;
> > > +    do {
> > > +        if (page_address >= (hwaddr) pss->block->host &&
> > > +            (page_address + TARGET_PAGE_SIZE) <=
> > > +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
> > > +            pss->page = (unsigned long)
> > > +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
> > > +            found = true;
> > > +            break;
> > > +        }
> > > +
> > > +        pss->block = QLIST_NEXT_RCU(pss->block, next);
> > > +        if (!pss->block) {
> > > +            /* Hit the end of the list */
> > > +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
> > > +        }
> > > +    } while (pss->block != rs->last_seen_block);
> > > +
> > > +    rs->last_seen_block = pss->block;
> > > +    /*
> > > +     * Since we are in the same loop with ram_find_and_save_block(),
> > > +     * need to reset pss->complete_round after switching to
> > > +     * other block/page in pss.
> > > +     */
> > > +    pss->complete_round = false;
> > > +
> > > +    return found;
> > > +}
> > 
> > I forgot whether Denis and I have discussed this, but I'll try anyways... do
> > you think we can avoid touching PageSearchStatus at all?
> > 
> > PageSearchStatus is used to track a single migration iteration for precopy, so
> > that we scan from the 1st ramblock until the last one.  Then we finish one
> > iteration.
> > 
> 
> Yes, my first idea also was to separate normal iteration from write-fault
> page source completely and leave pss for normal scan.. But, the other idea
> is to keep some locality in respect to last write fault. I mean it seems to
> be more optimal to re-start normal scan on the page that is next to faulting
> one. In this case we can save and un-protect
> the neighborhood faster and prevent many write faults.

Yeah locality sounds reasonable, and you just reminded me the fact that
postcopy has that already I think. :) Just see get_queued_page():

    if (block) {
        /*
         * As soon as we start servicing pages out of order, then we have
         * to kill the bulk stage, since the bulk stage assumes
         * in (migration_bitmap_find_and_reset_dirty) that every page is
         * dirty, that's no longer true.
         */
        rs->ram_bulk_stage = false;

        /*
         * We want the background search to continue from the queued page
         * since the guest is likely to want other pages near to the page
         * it just requested.
         */
        pss->block = block;
        pss->page = offset >> TARGET_PAGE_BITS;

        /*
         * This unqueued page would break the "one round" check, even is
         * really rare.
         */
        pss->complete_round = false;
    }

So as long as we queue the pages onto the src_page_requests queue, it'll take
care of write locality already, iiuc.

> 
> > Snapshot is really something, imho, that can easily leverage this structure
> > without touching it - basically we want to do two things:
> > 
> >    - Do the 1st iteration of precopy (when ram_bulk_stage==true), and do that
> >      only.  We never need the 2nd, 3rd, ... iterations because we're snapshoting.
> > 
> >    - Leverage the postcopy queue mechanism so that when some page got written,
> >      queue that page.  We should have this queue higher priority than the
> >      precopy scanning mentioned above.
> > 
> > As long as we follow above rules, then after the above 1st round precopy, we're
> > simply done...  If that works, the whole logic of precopy and PageSearchStatus
> > does not need to be touched, iiuc.
> > 
> > [...]
> > 
> 
> It's quite good alternative and I thought about using postcopy page queue,
> but this implementation won't consider the locality of writes..
> 
> What do you think?

So now I think "Do the 1st iteration of precopy only" idea won't work, but
still please consider whether it's natural to just reuse postcopy's queue
mechanism.  IOW, to see whether we can avoid major of the pss logic changes in
this patch.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 7/7] introduce simple linear scan rate limiting mechanism
  2020-11-20 12:06     ` Andrey Gruzdev
@ 2020-11-20 15:23       ` Peter Xu
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Xu @ 2020-11-20 15:23 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Fri, Nov 20, 2020 at 03:06:56PM +0300, Andrey Gruzdev wrote:
> On 19.11.2020 23:02, Peter Xu wrote:
> > On Thu, Nov 19, 2020 at 03:59:40PM +0300, Andrey Gruzdev wrote:
> > > Since reading UFFD events and saving paged data are performed
> > > from the same thread, write fault latencies are sensitive to
> > > migration stream stalls. Limiting total page saving rate is a
> > > method to reduce amount of noticiable fault resolution latencies.
> > > 
> > > Migration bandwidth limiting is achieved via noticing cases of
> > > out-of-threshold write fault latencies and temporarily disabling
> > > (strictly speaking, severely throttling) saving non-faulting pages.
> > 
> > Just curious: have you measured aver/max latency of wr-protected page requests,
> > or better, even its distribution?
> > 
> > I believe it should also be relevant to where the snapshot is stored, say, the
> > backend disk of your tests.  Is that a file on some fs?
> > 
> > I would expect the latency should be still good if e.g. the throughput of the
> > backend file system is decent even without a patch like this, but I might have
> > missed something..
> > 
> > In all cases, it would be very nice if this patch can have the histogram or
> > aver or max latency measured and compared before/after this patch applied.
> > 
> > Thanks,
> > 
> So far I have no objective latency measurements, that really interesting.
> For testing I commonly used SATA HDD, not too fresh one, 1.5TB Seagate
> 7200.11 series, specially not to have large hardware cache or flash buffer.
> And yes, backend is a file on ext4.
> 
> I tried mostly with 'migrate exec:streamdump_utility', a very simple tool
> which writes stream to a file. It even doesn't use AIO - so reads from STDIN
> and file writes don't overlap. Made so intentionally to reduce throughput
> and give some random high-latency writes.
> 
> I think snapshotting performance may be severely degraded by I/O happening
> in parallel on the same storage media, that's the problem we need to
> consider.

Yeah this is a good point.  So ideally I think here we have similar requirement
with postcopy in that we really want the postcopy request stream to be separate
from the main migration stream, just like what we'd probably prefer here too
when vcpus writting to wr-protected pages, rather than missing ones.  So that
the urgent requests got handled faster than the background task.

For postcopy, we may still have some chance if someday we can separate the
migration channels, so that we can try to provide a standalone channel for
postcopy requests.  For example if tcp based, that can be a separate socket.
Then we can try to tune the priority of these sockets to make sure postcopy
stream were sent earlier.

So I guess it's not always easy to seperate these two channels - in your case
if the disk must be shared, seems not easy.  Maybe we can prioritize disk I/Os
somehow too just like tcp channels?  Not sure.  Hmm, it would be good if
QEMUFile could provide some mechanism to prioritize IOs as a common interface,
but that's probably too far beyond our discussion. :)

> 
> And yes, to take latency histogram before/after the patch is nice idea!
> Also I need to make stream dumping tool with AIO, to test with full storage
> throughput.

Yeah I would appreciate if you can provide those.

Note that I think it can even be some work after the functionality got merged
(Dave/Juan could correct me), because it can be some work on top.  Having them
together would be even nicer, of course.  Your call.  It's already much better
than stopping the whole vm, imho.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/7] introduce UFFD-WP low-level interface helpers
  2020-11-20 15:01       ` Peter Xu
@ 2020-11-20 15:43         ` Andrey Gruzdev
  0 siblings, 0 replies; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-20 15:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 20.11.2020 18:01, Peter Xu wrote:
> On Fri, Nov 20, 2020 at 02:04:46PM +0300, Andrey Gruzdev wrote:
>>>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>>>> +        /* Nothing to do with read-only and MMIO-writable regions */
>>>> +        if (bs->mr->readonly || bs->mr->rom_device) {
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        /* Register block memory with UFFD to track writes */
>>>> +        if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host,
>>>> +                bs->max_length, false, true)) {
>>>> +            goto fail;
>>>> +        }
>>>> +        /* Apply UFFD write protection to the block memory range */
>>>> +        if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host,
>>>> +                bs->max_length, true)) {
>>>
>>> Here logically we need to undo the previous register first, however userfaultfd
>>> will auto-clean these when close(fd), so it's ok.  However still better to
>>> unwind the protection of pages, I think.  So...
>>>
>>
>> It should auto-clean, but as an additional safety measure - yes.
> 
> I'm afraid it will only clean up the registers, but not the page table updates;
> at least that should be what we do now in the kernel. I'm not sure whether we
> should always force the kernel to unprotect those when close(). The problem is
> the registered range is normally quite large while the wr-protect range can be
> very small (page-based), so that could take a lot of time, which can be
> unnecessary, since the userspace is the one that knows the best on which range
> was protected.
> 
> Indeed I can't think if anything really bad even if not unprotect the pages as
> you do right now - what will happen is that the wr-protected pages will have
> UFFD_WP set and PAGE_RW cleared in the page tables even after the close(fd).
> It means after the snapshot got cancelled those wr-protected pages could still
> trigger page fault again when being written, though since it's not covered by
> uffd-wp protected vmas, it'll become a "normal cow" fault, and the write bit
> will be recovered.  However the UFFD_WP bit in the page tables could got
> leftover there...  So maybe it's still best to unprotect from userspace.
> 
> There's an idea that maybe we can auto-remove the UFFD_WP bit in kernel when
> cow happens for a page, but that's definitely out of topic (and we must make
> sure things like "enforced cow for read-only get_user_pages() won't happen
> again").  No hard to do that in userspace, anyways.
> 

Oh, I've got the point. Sure, I need to add un-protect to cleanup code.
Thanks for clarification of details on kernel implementation!

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-20 15:07       ` Peter Xu
@ 2020-11-20 16:15         ` Andrey Gruzdev
  2020-11-20 16:43           ` Peter Xu
  0 siblings, 1 reply; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-20 16:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 20.11.2020 18:07, Peter Xu wrote:
> On Fri, Nov 20, 2020 at 01:44:53PM +0300, Andrey Gruzdev wrote:
>> On 19.11.2020 21:25, Peter Xu wrote:
>>> On Thu, Nov 19, 2020 at 03:59:36PM +0300, Andrey Gruzdev via wrote:
>>>
>>> [...]
>>>
>>>> +/**
>>>> + * ram_find_block_by_host_address: find RAM block containing host page
>>>> + *
>>>> + * Returns true if RAM block is found and pss->block/page are
>>>> + * pointing to the given host page, false in case of an error
>>>> + *
>>>> + * @rs: current RAM state
>>>> + * @pss: page-search-status structure
>>>> + */
>>>> +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
>>>> +        hwaddr page_address)
>>>> +{
>>>> +    bool found = false;
>>>> +
>>>> +    pss->block = rs->last_seen_block;
>>>> +    do {
>>>> +        if (page_address >= (hwaddr) pss->block->host &&
>>>> +            (page_address + TARGET_PAGE_SIZE) <=
>>>> +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
>>>> +            pss->page = (unsigned long)
>>>> +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
>>>> +            found = true;
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        pss->block = QLIST_NEXT_RCU(pss->block, next);
>>>> +        if (!pss->block) {
>>>> +            /* Hit the end of the list */
>>>> +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
>>>> +        }
>>>> +    } while (pss->block != rs->last_seen_block);
>>>> +
>>>> +    rs->last_seen_block = pss->block;
>>>> +    /*
>>>> +     * Since we are in the same loop with ram_find_and_save_block(),
>>>> +     * need to reset pss->complete_round after switching to
>>>> +     * other block/page in pss.
>>>> +     */
>>>> +    pss->complete_round = false;
>>>> +
>>>> +    return found;
>>>> +}
>>>
>>> I forgot whether Denis and I have discussed this, but I'll try anyways... do
>>> you think we can avoid touching PageSearchStatus at all?
>>>
>>> PageSearchStatus is used to track a single migration iteration for precopy, so
>>> that we scan from the 1st ramblock until the last one.  Then we finish one
>>> iteration.
>>>
>>
>> Yes, my first idea also was to separate normal iteration from write-fault
>> page source completely and leave pss for normal scan.. But, the other idea
>> is to keep some locality in respect to last write fault. I mean it seems to
>> be more optimal to re-start normal scan on the page that is next to faulting
>> one. In this case we can save and un-protect
>> the neighborhood faster and prevent many write faults.
> 
> Yeah locality sounds reasonable, and you just reminded me the fact that
> postcopy has that already I think. :) Just see get_queued_page():
> 
>      if (block) {
>          /*
>           * As soon as we start servicing pages out of order, then we have
>           * to kill the bulk stage, since the bulk stage assumes
>           * in (migration_bitmap_find_and_reset_dirty) that every page is
>           * dirty, that's no longer true.
>           */
>          rs->ram_bulk_stage = false;
> 
>          /*
>           * We want the background search to continue from the queued page
>           * since the guest is likely to want other pages near to the page
>           * it just requested.
>           */
>          pss->block = block;
>          pss->page = offset >> TARGET_PAGE_BITS;
> 
>          /*
>           * This unqueued page would break the "one round" check, even is
>           * really rare.
>           */
>          pss->complete_round = false;
>      }
> 
> So as long as we queue the pages onto the src_page_requests queue, it'll take
> care of write locality already, iiuc.
> 
>>
>>> Snapshot is really something, imho, that can easily leverage this structure
>>> without touching it - basically we want to do two things:
>>>
>>>     - Do the 1st iteration of precopy (when ram_bulk_stage==true), and do that
>>>       only.  We never need the 2nd, 3rd, ... iterations because we're snapshoting.
>>>
>>>     - Leverage the postcopy queue mechanism so that when some page got written,
>>>       queue that page.  We should have this queue higher priority than the
>>>       precopy scanning mentioned above.
>>>
>>> As long as we follow above rules, then after the above 1st round precopy, we're
>>> simply done...  If that works, the whole logic of precopy and PageSearchStatus
>>> does not need to be touched, iiuc.
>>>
>>> [...]
>>>
>>
>> It's quite good alternative and I thought about using postcopy page queue,
>> but this implementation won't consider the locality of writes..
>>
>> What do you think?
> 
> So now I think "Do the 1st iteration of precopy only" idea won't work, but
> still please consider whether it's natural to just reuse postcopy's queue
> mechanism.  IOW, to see whether we can avoid major of the pss logic changes in
> this patch.
> 
> Thanks,
> 

Yeah, I think we can re-use the postcopy queue code for faulting pages. 
I'm worring a little about some additional overhead dealing with urgent 
request semaphore. Also, the code won't change a lot, something like:

[...]
         /* In case of 'write-tracking' migration we first try
          * to poll UFFD and sse if we have write page fault event */
         poll_fault_page(rs);

         again = true;
         found = get_queued_page(rs, &pss);

         if (!found) {
             /* priority queue empty, so just search for something dirty */
             found = find_dirty_block(rs, &pss, &again);
         }
[...]

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-20 16:15         ` Andrey Gruzdev
@ 2020-11-20 16:43           ` Peter Xu
  2020-11-20 16:53             ` Andrey Gruzdev
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Xu @ 2020-11-20 16:43 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Fri, Nov 20, 2020 at 07:15:07PM +0300, Andrey Gruzdev wrote:
> Yeah, I think we can re-use the postcopy queue code for faulting pages. I'm
> worring a little about some additional overhead dealing with urgent request
> semaphore. Also, the code won't change a lot, something like:
> 
> [...]
>         /* In case of 'write-tracking' migration we first try
>          * to poll UFFD and sse if we have write page fault event */
>         poll_fault_page(rs);
> 
>         again = true;
>         found = get_queued_page(rs, &pss);
> 
>         if (!found) {
>             /* priority queue empty, so just search for something dirty */
>             found = find_dirty_block(rs, &pss, &again);
>         }
> [...]

Could I ask what's the "urgent request semaphore"?  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-20 16:43           ` Peter Xu
@ 2020-11-20 16:53             ` Andrey Gruzdev
  2020-11-23 21:34               ` Peter Xu
  0 siblings, 1 reply; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-20 16:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 20.11.2020 19:43, Peter Xu wrote:
> On Fri, Nov 20, 2020 at 07:15:07PM +0300, Andrey Gruzdev wrote:
>> Yeah, I think we can re-use the postcopy queue code for faulting pages. I'm
>> worring a little about some additional overhead dealing with urgent request
>> semaphore. Also, the code won't change a lot, something like:
>>
>> [...]
>>          /* In case of 'write-tracking' migration we first try
>>           * to poll UFFD and sse if we have write page fault event */
>>          poll_fault_page(rs);
>>
>>          again = true;
>>          found = get_queued_page(rs, &pss);
>>
>>          if (!found) {
>>              /* priority queue empty, so just search for something dirty */
>>              found = find_dirty_block(rs, &pss, &again);
>>          }
>> [...]
> 
> Could I ask what's the "urgent request semaphore"?  Thanks,
> 

These function use it (the correct name is 'rate_limit_sem'):

void migration_make_urgent_request(void)
{
     qemu_sem_post(&migrate_get_current()->rate_limit_sem);
}

void migration_consume_urgent_request(void)
{
     qemu_sem_wait(&migrate_get_current()->rate_limit_sem);
}

They are called from ram_save_queue_pages and unqueue_page, accordingly, 
to control migration rate limiter.

bool migration_rate_limit(void)
{
[...]
         /*
          * Wait for a delay to do rate limiting OR
          * something urgent to post the semaphore.
          */
         int ms = s->iteration_start_time + BUFFER_DELAY - now;
         trace_migration_rate_limit_pre(ms);
         if (qemu_sem_timedwait(&s->rate_limit_sem, ms) == 0) {
             /*
              * We were woken by one or more urgent things but
              * the timedwait will have consumed one of them.
              * The service routine for the urgent wake will dec
              * the semaphore itself for each item it consumes,
              * so add this one we just eat back.
              */
             qemu_sem_post(&s->rate_limit_sem);
             urgent = true;
         }
[...]
}

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-20 16:53             ` Andrey Gruzdev
@ 2020-11-23 21:34               ` Peter Xu
  2020-11-24  8:02                 ` Andrey Gruzdev
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Xu @ 2020-11-23 21:34 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Fri, Nov 20, 2020 at 07:53:34PM +0300, Andrey Gruzdev wrote:
> On 20.11.2020 19:43, Peter Xu wrote:
> > On Fri, Nov 20, 2020 at 07:15:07PM +0300, Andrey Gruzdev wrote:
> > > Yeah, I think we can re-use the postcopy queue code for faulting pages. I'm
> > > worring a little about some additional overhead dealing with urgent request
> > > semaphore. Also, the code won't change a lot, something like:
> > > 
> > > [...]
> > >          /* In case of 'write-tracking' migration we first try
> > >           * to poll UFFD and sse if we have write page fault event */
> > >          poll_fault_page(rs);
> > > 
> > >          again = true;
> > >          found = get_queued_page(rs, &pss);
> > > 
> > >          if (!found) {
> > >              /* priority queue empty, so just search for something dirty */
> > >              found = find_dirty_block(rs, &pss, &again);
> > >          }
> > > [...]
> > 
> > Could I ask what's the "urgent request semaphore"?  Thanks,
> > 
> 
> These function use it (the correct name is 'rate_limit_sem'):
> 
> void migration_make_urgent_request(void)
> {
>     qemu_sem_post(&migrate_get_current()->rate_limit_sem);
> }
> 
> void migration_consume_urgent_request(void)
> {
>     qemu_sem_wait(&migrate_get_current()->rate_limit_sem);
> }
> 
> They are called from ram_save_queue_pages and unqueue_page, accordingly, to
> control migration rate limiter.
> 
> bool migration_rate_limit(void)
> {
> [...]
>         /*
>          * Wait for a delay to do rate limiting OR
>          * something urgent to post the semaphore.
>          */
>         int ms = s->iteration_start_time + BUFFER_DELAY - now;
>         trace_migration_rate_limit_pre(ms);
>         if (qemu_sem_timedwait(&s->rate_limit_sem, ms) == 0) {
>             /*
>              * We were woken by one or more urgent things but
>              * the timedwait will have consumed one of them.
>              * The service routine for the urgent wake will dec
>              * the semaphore itself for each item it consumes,
>              * so add this one we just eat back.
>              */
>             qemu_sem_post(&s->rate_limit_sem);
>             urgent = true;
>         }
> [...]
> }
> 

Hmm... Why its overhead could be a problem?  If it's an overhead that can be
avoided, then postcopy might want that too.

The thing is I really feel like the snapshot logic can leverage the whole idea
of existing postcopy (like get_queued_page/unqueue_page; it's probably due to
the fact that both of them want to "migrate some more urgent pages than the
background migration, due to either missing/wrprotected pages"), but I might
have something missing.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-23 21:34               ` Peter Xu
@ 2020-11-24  8:02                 ` Andrey Gruzdev
  2020-11-24 15:17                   ` Peter Xu
  0 siblings, 1 reply; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-24  8:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 24.11.2020 00:34, Peter Xu wrote:
> On Fri, Nov 20, 2020 at 07:53:34PM +0300, Andrey Gruzdev wrote:
>> On 20.11.2020 19:43, Peter Xu wrote:
>>> On Fri, Nov 20, 2020 at 07:15:07PM +0300, Andrey Gruzdev wrote:
>>>> Yeah, I think we can re-use the postcopy queue code for faulting pages. I'm
>>>> worring a little about some additional overhead dealing with urgent request
>>>> semaphore. Also, the code won't change a lot, something like:
>>>>
>>>> [...]
>>>>           /* In case of 'write-tracking' migration we first try
>>>>            * to poll UFFD and sse if we have write page fault event */
>>>>           poll_fault_page(rs);
>>>>
>>>>           again = true;
>>>>           found = get_queued_page(rs, &pss);
>>>>
>>>>           if (!found) {
>>>>               /* priority queue empty, so just search for something dirty */
>>>>               found = find_dirty_block(rs, &pss, &again);
>>>>           }
>>>> [...]
>>>
>>> Could I ask what's the "urgent request semaphore"?  Thanks,
>>>
>>
>> These function use it (the correct name is 'rate_limit_sem'):
>>
>> void migration_make_urgent_request(void)
>> {
>>      qemu_sem_post(&migrate_get_current()->rate_limit_sem);
>> }
>>
>> void migration_consume_urgent_request(void)
>> {
>>      qemu_sem_wait(&migrate_get_current()->rate_limit_sem);
>> }
>>
>> They are called from ram_save_queue_pages and unqueue_page, accordingly, to
>> control migration rate limiter.
>>
>> bool migration_rate_limit(void)
>> {
>> [...]
>>          /*
>>           * Wait for a delay to do rate limiting OR
>>           * something urgent to post the semaphore.
>>           */
>>          int ms = s->iteration_start_time + BUFFER_DELAY - now;
>>          trace_migration_rate_limit_pre(ms);
>>          if (qemu_sem_timedwait(&s->rate_limit_sem, ms) == 0) {
>>              /*
>>               * We were woken by one or more urgent things but
>>               * the timedwait will have consumed one of them.
>>               * The service routine for the urgent wake will dec
>>               * the semaphore itself for each item it consumes,
>>               * so add this one we just eat back.
>>               */
>>              qemu_sem_post(&s->rate_limit_sem);
>>              urgent = true;
>>          }
>> [...]
>> }
>>
> 
> Hmm... Why its overhead could be a problem?  If it's an overhead that can be
> avoided, then postcopy might want that too.
> 
> The thing is I really feel like the snapshot logic can leverage the whole idea
> of existing postcopy (like get_queued_page/unqueue_page; it's probably due to
> the fact that both of them want to "migrate some more urgent pages than the
> background migration, due to either missing/wrprotected pages"), but I might
> have something missing.
> 
> Thanks,
> 

I don't think this overhead itself is a problem since its negligible 
compared to other stuff.. You're undoubtly correct about using postcopy 
idea in case when wr-fault pages arrive from the separate thread or 
external source. Then we really need to decouple that separate thread or 
external requestor from the migration thread.

In this patch series wr-faults arrive in the same migration loop with 
normal scan, so I don't see direct reason to pass it through the queue, 
but I might have missed something from your proposal.

Do you mean that if we use postcopy logic, we'll allocate memory and 
make copies of faulting pages and then immediately un-protect them?
Or we just put faulting page address to the queued item and release 
protection after page content has been saved.

Thanks,

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-24  8:02                 ` Andrey Gruzdev
@ 2020-11-24 15:17                   ` Peter Xu
  2020-11-24 17:40                     ` Andrey Gruzdev
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Xu @ 2020-11-24 15:17 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Tue, Nov 24, 2020 at 11:02:09AM +0300, Andrey Gruzdev wrote:
> On 24.11.2020 00:34, Peter Xu wrote:
> > On Fri, Nov 20, 2020 at 07:53:34PM +0300, Andrey Gruzdev wrote:
> > > On 20.11.2020 19:43, Peter Xu wrote:
> > > > On Fri, Nov 20, 2020 at 07:15:07PM +0300, Andrey Gruzdev wrote:
> > > > > Yeah, I think we can re-use the postcopy queue code for faulting pages. I'm
> > > > > worring a little about some additional overhead dealing with urgent request
> > > > > semaphore. Also, the code won't change a lot, something like:
> > > > > 
> > > > > [...]
> > > > >           /* In case of 'write-tracking' migration we first try
> > > > >            * to poll UFFD and sse if we have write page fault event */
> > > > >           poll_fault_page(rs);
> > > > > 
> > > > >           again = true;
> > > > >           found = get_queued_page(rs, &pss);
> > > > > 
> > > > >           if (!found) {
> > > > >               /* priority queue empty, so just search for something dirty */
> > > > >               found = find_dirty_block(rs, &pss, &again);
> > > > >           }
> > > > > [...]
> > > > 
> > > > Could I ask what's the "urgent request semaphore"?  Thanks,
> > > > 
> > > 
> > > These function use it (the correct name is 'rate_limit_sem'):
> > > 
> > > void migration_make_urgent_request(void)
> > > {
> > >      qemu_sem_post(&migrate_get_current()->rate_limit_sem);
> > > }
> > > 
> > > void migration_consume_urgent_request(void)
> > > {
> > >      qemu_sem_wait(&migrate_get_current()->rate_limit_sem);
> > > }
> > > 
> > > They are called from ram_save_queue_pages and unqueue_page, accordingly, to
> > > control migration rate limiter.
> > > 
> > > bool migration_rate_limit(void)
> > > {
> > > [...]
> > >          /*
> > >           * Wait for a delay to do rate limiting OR
> > >           * something urgent to post the semaphore.
> > >           */
> > >          int ms = s->iteration_start_time + BUFFER_DELAY - now;
> > >          trace_migration_rate_limit_pre(ms);
> > >          if (qemu_sem_timedwait(&s->rate_limit_sem, ms) == 0) {
> > >              /*
> > >               * We were woken by one or more urgent things but
> > >               * the timedwait will have consumed one of them.
> > >               * The service routine for the urgent wake will dec
> > >               * the semaphore itself for each item it consumes,
> > >               * so add this one we just eat back.
> > >               */
> > >              qemu_sem_post(&s->rate_limit_sem);
> > >              urgent = true;
> > >          }
> > > [...]
> > > }
> > > 
> > 
> > Hmm... Why its overhead could be a problem?  If it's an overhead that can be
> > avoided, then postcopy might want that too.
> > 
> > The thing is I really feel like the snapshot logic can leverage the whole idea
> > of existing postcopy (like get_queued_page/unqueue_page; it's probably due to
> > the fact that both of them want to "migrate some more urgent pages than the
> > background migration, due to either missing/wrprotected pages"), but I might
> > have something missing.
> > 
> > Thanks,
> > 
> 
> I don't think this overhead itself is a problem since its negligible
> compared to other stuff.. You're undoubtly correct about using postcopy idea
> in case when wr-fault pages arrive from the separate thread or external
> source. Then we really need to decouple that separate thread or external
> requestor from the migration thread.
> 
> In this patch series wr-faults arrive in the same migration loop with normal
> scan, so I don't see direct reason to pass it through the queue, but I might
> have missed something from your proposal.

I see your point.  For me whether using a thread is not extremely important -
actually we can have a standalone thread to handle the vcpu faults too just
like postcopy; it's just run on the src host.  My major point is that we should
avoid introducing the extra pss logic because fundamentally it's doing the same
thing as get_queued_page() right now, unless I'm wrong...

So I think your previous poll_fault_page() call is okay; assuming the same poll
model as you used, I mean something like this:

------8<-------
diff --git a/migration/ram.c b/migration/ram.c
index 7811cde643..718dd276c7 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1473,6 +1473,14 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
 
     } while (block && !dirty);
 
+    if (!block) {
+        /*
+         * Set the block too if live snapshot is enabled; that's when we have
+         * vcpus got blocked by the wr-protected pages.
+         */
+        block = poll_fault_page(rs, &offset);
+    }
+
     if (block) {
         /*
          * As soon as we start servicing pages out of order, then we have
------8<-------

So as long as we set the block and offset, pss will be updated just like
postcopy.  That's exactly what we want, am I right?

> 
> Do you mean that if we use postcopy logic, we'll allocate memory and make
> copies of faulting pages and then immediately un-protect them?
> Or we just put faulting page address to the queued item and release
> protection after page content has been saved.

I think current approach would be fine, so we don't copy page and unprotect at
a single place after the page is dumped to the precopy stream.  If you could
measure the latencies then it'll be even better, then we'll know what to expect.

IIRC the "copy page first" idea came from Denis's initial version, in which I
thought as too aggresive since potentially it can eat twice the memory on the
host for the single guest, not to mention when live snapshot is taken for
mutliple guests on the same host.  It's just not something that can be directly
expected when the user wants to take a snapshot.  That's something we can do
upon this work, imho, if we'll have very high latency on resolving the page
faults. But that's still the last thing to try (or at least an "by default off"
option) so we can still think about some other solution before trying to eat
too much host mem.

Thanks,

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 0/7] UFFD write-tracking migration/snapshots
  2020-11-19 12:59 [PATCH v3 0/7] UFFD write-tracking migration/snapshots Andrey Gruzdev via
                   ` (6 preceding siblings ...)
  2020-11-19 12:59 ` [PATCH v3 7/7] introduce simple linear scan rate limiting mechanism Andrey Gruzdev via
@ 2020-11-24 16:41 ` Dr. David Alan Gilbert
  7 siblings, 0 replies; 44+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-24 16:41 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: zhang.zhanghailiang, Juan Quintela, qemu-devel, Peter Xu,
	Markus Armbruster, chen.zhang, Paolo Bonzini, Den Lunev

* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> Changes with v3:
> * coding style fixes to pass checkpatch test
> * qapi/migration.json: change 'track-writes-ram' capability
> *                      supported version to 6.0
> * fixes to commit message format
> 

cc'ing in COLO people, since they could use this as well. 
> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
> 
> Currently the only way to make (external) live VM snapshot is using existing
> dirty page logging migration mechanism. The main problem is that it tends to
> produce a lot of page duplicates while running VM goes on updating already
> saved pages. That leads to the fact that vmstate image size is commonly several
> times bigger then non-zero part of virtual machine's RSS. Time required to
> converge RAM migration and the size of snapshot image severely depend on the
> guest memory write rate, sometimes resulting in unacceptably long snapshot
> creation time and huge image size.
> 
> This series propose a way to solve the aforementioned problems. This is done
> by using different RAM migration mechanism based on UFFD write protection
> management introduced in v5.7 kernel. The migration strategy is to 'freeze'
> guest RAM content using write-protection and iteratively release protection
> for memory ranges that have already been saved to the migration stream.
> At the same time we read in pending UFFD write fault events and save those
> pages out-of-order with higher priority.
> 
> How to use:
> 1. Enable write-tracking migration capability
>    virsh qemu-monitor-command <domain> --hmp migrate_set_capability.
> track-writes-ram on
> 
> 2. Start the external migration to a file
>    virsh qemu-monitor-command <domain> --hmp migrate exec:'cat > ./vm_state'
> 
> 3. Wait for the migration finish and check that the migration has completed.
> state.
> 
> Andrey Gruzdev (7):
>   introduce 'track-writes-ram' migration capability
>   introduce UFFD-WP low-level interface helpers
>   support UFFD write fault processing in ram_save_iterate()
>   implementation of write-tracking migration thread
>   implementation of vm_start() BH
>   the rest of write tracking migration code
>   introduce simple linear scan rate limiting mechanism
> 
>  include/exec/memory.h |   7 +
>  migration/migration.c | 338 +++++++++++++++++++++++++++++++-
>  migration/migration.h |   4 +
>  migration/ram.c       | 439 +++++++++++++++++++++++++++++++++++++++++-
>  migration/ram.h       |   4 +
>  migration/savevm.c    |   1 -
>  migration/savevm.h    |   2 +
>  qapi/migration.json   |   7 +-
>  8 files changed, 790 insertions(+), 12 deletions(-)
> 
> -- 
> 2.25.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/7] introduce 'track-writes-ram' migration capability
  2020-11-19 19:07     ` Peter Xu
  2020-11-20 11:35       ` Andrey Gruzdev
@ 2020-11-24 16:55       ` Dr. David Alan Gilbert
  2020-11-24 17:25         ` Andrey Gruzdev
  1 sibling, 1 reply; 44+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-24 16:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: Juan Quintela, qemu-devel, Markus Armbruster, Paolo Bonzini,
	Den Lunev, Andrey Gruzdev

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Nov 19, 2020 at 01:51:50PM -0500, Peter Xu wrote:
> > On Thu, Nov 19, 2020 at 03:59:34PM +0300, Andrey Gruzdev via wrote:
> > > Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> > > ---
> > >  migration/migration.c | 96 +++++++++++++++++++++++++++++++++++++++++++
> > >  migration/migration.h |  1 +
> > >  qapi/migration.json   |  7 +++-
> > >  3 files changed, 103 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 87a9b59f83..ff0364dde0 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -56,6 +56,7 @@
> > >  #include "net/announce.h"
> > >  #include "qemu/queue.h"
> > >  #include "multifd.h"
> > > +#include "sysemu/cpus.h"
> > >  
> > >  #ifdef CONFIG_VFIO
> > >  #include "hw/vfio/vfio-common.h"
> > > @@ -1165,6 +1166,91 @@ static bool migrate_caps_check(bool *cap_list,
> > >          }
> > >      }
> > >  
> > > +    if (cap_list[MIGRATION_CAPABILITY_TRACK_WRITES_RAM]) {
> > > +        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
> > > +            error_setg(errp,
> > > +                    "Track-writes is not compatible with postcopy-ram");
> > > +            return false;
> > > +        }
> > > +
> > > +        if (cap_list[MIGRATION_CAPABILITY_DIRTY_BITMAPS]) {
> > > +            error_setg(errp,
> > > +                    "Track-writes is not compatible with dirty-bitmaps");
> > > +            return false;
> > > +        }
> > > +
> > > +        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME]) {
> > > +            error_setg(errp,
> > > +                    "Track-writes is not compatible with postcopy-blocktime");
> > > +            return false;
> > > +        }
> > > +
> > > +        if (cap_list[MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE]) {
> > > +            error_setg(errp,
> > > +                    "Track-writes is not compatible with late-block-activate");
> > > +            return false;
> > > +        }
> > > +
> > > +        if (cap_list[MIGRATION_CAPABILITY_RETURN_PATH]) {
> > > +            error_setg(errp,
> > > +                    "Track-writes is not compatible with return-path");
> > > +            return false;
> > > +        }
> > > +
> > > +        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
> > > +            error_setg(errp, "Track-writes is not compatible with multifd");
> > > +            return false;
> > > +        }
> > > +
> > > +        if (cap_list[MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER]) {
> > > +            error_setg(errp,
> > > +                    "Track-writes is not compatible with pause-before-switchover");
> > > +            return false;
> > > +        }
> > > +
> > > +        if (cap_list[MIGRATION_CAPABILITY_AUTO_CONVERGE]) {
> > > +            error_setg(errp,
> > > +                    "Track-writes is not compatible with auto-converge");
> > > +            return false;
> > > +        }
> > > +
> > > +        if (cap_list[MIGRATION_CAPABILITY_RELEASE_RAM]) {
> > > +            error_setg(errp,
> > > +                    "Track-writes is not compatible with release-ram");
> > > +            return false;
> > > +        }
> > > +
> > > +        if (cap_list[MIGRATION_CAPABILITY_RDMA_PIN_ALL]) {
> > > +            error_setg(errp,
> > > +                    "Track-writes is not compatible with rdma-pin-all");
> > > +            return false;
> > > +        }
> > > +
> > > +        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
> > > +            error_setg(errp,
> > > +                    "Track-writes is not compatible with compression");
> > > +            return false;
> > > +        }
> > > +
> > > +        if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
> > > +            error_setg(errp,
> > > +                    "Track-writes is not compatible with XBZLRE");
> > > +            return false;
> > > +        }
> > > +
> > > +        if (cap_list[MIGRATION_CAPABILITY_X_COLO]) {
> > > +            error_setg(errp,
> > > +                    "Track-writes is not compatible with x-colo");
> > > +            return false;
> > > +        }
> > > +
> > > +        if (cap_list[MIGRATION_CAPABILITY_VALIDATE_UUID]) {
> > > +            error_setg(errp,
> > > +                    "Track-writes is not compatible with validate-uuid");
> > > +            return false;
> > > +        }
> 
> Another thing forgot to mention - we can at least define an array for live
> snapshot now so we just loop over that one instead of copy-paste these lines...

Yes, I think we've already got a name lookup
(MigrationCapability_lookup - that's generated during build), so if you
just have an array of MigrationCapability's you should be able to loop
over them.

Dave

> > > +    }
> > > +
> > >      return true;
> > >  }
> > >  
> > > @@ -2490,6 +2576,15 @@ bool migrate_use_block_incremental(void)
> > >      return s->parameters.block_incremental;
> > >  }
> > >  
> > > +bool migrate_track_writes_ram(void)
> > > +{
> > > +    MigrationState *s;
> > > +
> > > +    s = migrate_get_current();
> > > +
> > > +    return s->enabled_capabilities[MIGRATION_CAPABILITY_TRACK_WRITES_RAM];
> > > +}
> > > +
> > >  /* migration thread support */
> > >  /*
> > >   * Something bad happened to the RP stream, mark an error
> > > @@ -3783,6 +3878,7 @@ static Property migration_properties[] = {
> > >      DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
> > >      DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
> > >      DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
> > > +    DEFINE_PROP_MIG_CAP("x-track-writes-ram", MIGRATION_CAPABILITY_TRACK_WRITES_RAM),
> > >  
> > >      DEFINE_PROP_END_OF_LIST(),
> > >  };
> > > diff --git a/migration/migration.h b/migration/migration.h
> > > index d096b77f74..339ae720e0 100644
> > > --- a/migration/migration.h
> > > +++ b/migration/migration.h
> > > @@ -341,6 +341,7 @@ int migrate_compress_wait_thread(void);
> > >  int migrate_decompress_threads(void);
> > >  bool migrate_use_events(void);
> > >  bool migrate_postcopy_blocktime(void);
> > > +bool migrate_track_writes_ram(void);
> > >  
> > >  /* Sending on the return path - generic and then for each message type */
> > >  void migrate_send_rp_shut(MigrationIncomingState *mis,
> > > diff --git a/qapi/migration.json b/qapi/migration.json
> > > index 3c75820527..a28d8b7ee8 100644
> > > --- a/qapi/migration.json
> > > +++ b/qapi/migration.json
> > > @@ -442,6 +442,11 @@
> > >  # @validate-uuid: Send the UUID of the source to allow the destination
> > >  #                 to ensure it is the same. (since 4.2)
> > >  #
> > > +# @track-writes-ram: If enabled, the migration stream will be a snapshot
> > > +#                    of the VM exactly at the point when the migration
> > > +#                    procedure starts. The VM RAM is saved with running VM.
> > > +#                    (since 6.0)
> > > +#
> > 
> > The name is slightly confusing to me.  Could I ask why changed from previous
> > one?  "snapshot" sounds a very necessary keyword to me here and tells exactly
> > on what we do...  Because we can do quite a few things with "trace-writes-ram"
> > but not snapshotting, e.g., to calculate per-vm dirty rates.
> > 
> > -- 
> > Peter Xu
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/7] introduce 'track-writes-ram' migration capability
  2020-11-24 16:55       ` Dr. David Alan Gilbert
@ 2020-11-24 17:25         ` Andrey Gruzdev
  0 siblings, 0 replies; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-24 17:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Markus Armbruster

On 24.11.2020 19:55, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
>> On Thu, Nov 19, 2020 at 01:51:50PM -0500, Peter Xu wrote:
>>> On Thu, Nov 19, 2020 at 03:59:34PM +0300, Andrey Gruzdev via wrote:
>>>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>>>> ---
>>>>   migration/migration.c | 96 +++++++++++++++++++++++++++++++++++++++++++
>>>>   migration/migration.h |  1 +
>>>>   qapi/migration.json   |  7 +++-
>>>>   3 files changed, 103 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index 87a9b59f83..ff0364dde0 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -56,6 +56,7 @@
>>>>   #include "net/announce.h"
>>>>   #include "qemu/queue.h"
>>>>   #include "multifd.h"
>>>> +#include "sysemu/cpus.h"
>>>>   
>>>>   #ifdef CONFIG_VFIO
>>>>   #include "hw/vfio/vfio-common.h"
>>>> @@ -1165,6 +1166,91 @@ static bool migrate_caps_check(bool *cap_list,
>>>>           }
>>>>       }
>>>>   
>>>> +    if (cap_list[MIGRATION_CAPABILITY_TRACK_WRITES_RAM]) {
>>>> +        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
>>>> +            error_setg(errp,
>>>> +                    "Track-writes is not compatible with postcopy-ram");
>>>> +            return false;
>>>> +        }
>>>> +
>>>> +        if (cap_list[MIGRATION_CAPABILITY_DIRTY_BITMAPS]) {
>>>> +            error_setg(errp,
>>>> +                    "Track-writes is not compatible with dirty-bitmaps");
>>>> +            return false;
>>>> +        }
>>>> +
>>>> +        if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME]) {
>>>> +            error_setg(errp,
>>>> +                    "Track-writes is not compatible with postcopy-blocktime");
>>>> +            return false;
>>>> +        }
>>>> +
>>>> +        if (cap_list[MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE]) {
>>>> +            error_setg(errp,
>>>> +                    "Track-writes is not compatible with late-block-activate");
>>>> +            return false;
>>>> +        }
>>>> +
>>>> +        if (cap_list[MIGRATION_CAPABILITY_RETURN_PATH]) {
>>>> +            error_setg(errp,
>>>> +                    "Track-writes is not compatible with return-path");
>>>> +            return false;
>>>> +        }
>>>> +
>>>> +        if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
>>>> +            error_setg(errp, "Track-writes is not compatible with multifd");
>>>> +            return false;
>>>> +        }
>>>> +
>>>> +        if (cap_list[MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER]) {
>>>> +            error_setg(errp,
>>>> +                    "Track-writes is not compatible with pause-before-switchover");
>>>> +            return false;
>>>> +        }
>>>> +
>>>> +        if (cap_list[MIGRATION_CAPABILITY_AUTO_CONVERGE]) {
>>>> +            error_setg(errp,
>>>> +                    "Track-writes is not compatible with auto-converge");
>>>> +            return false;
>>>> +        }
>>>> +
>>>> +        if (cap_list[MIGRATION_CAPABILITY_RELEASE_RAM]) {
>>>> +            error_setg(errp,
>>>> +                    "Track-writes is not compatible with release-ram");
>>>> +            return false;
>>>> +        }
>>>> +
>>>> +        if (cap_list[MIGRATION_CAPABILITY_RDMA_PIN_ALL]) {
>>>> +            error_setg(errp,
>>>> +                    "Track-writes is not compatible with rdma-pin-all");
>>>> +            return false;
>>>> +        }
>>>> +
>>>> +        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
>>>> +            error_setg(errp,
>>>> +                    "Track-writes is not compatible with compression");
>>>> +            return false;
>>>> +        }
>>>> +
>>>> +        if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
>>>> +            error_setg(errp,
>>>> +                    "Track-writes is not compatible with XBZLRE");
>>>> +            return false;
>>>> +        }
>>>> +
>>>> +        if (cap_list[MIGRATION_CAPABILITY_X_COLO]) {
>>>> +            error_setg(errp,
>>>> +                    "Track-writes is not compatible with x-colo");
>>>> +            return false;
>>>> +        }
>>>> +
>>>> +        if (cap_list[MIGRATION_CAPABILITY_VALIDATE_UUID]) {
>>>> +            error_setg(errp,
>>>> +                    "Track-writes is not compatible with validate-uuid");
>>>> +            return false;
>>>> +        }
>>
>> Another thing forgot to mention - we can at least define an array for live
>> snapshot now so we just loop over that one instead of copy-paste these lines...
> 
> Yes, I think we've already got a name lookup
> (MigrationCapability_lookup - that's generated during build), so if you
> just have an array of MigrationCapability's you should be able to loop
> over them.
> 
> Dave
> 

Yes, totally agree, already changed to loop-through incompatible caps 
array. Names are easy to lookup, found it.

Andrey

>>>> +    }
>>>> +
>>>>       return true;
>>>>   }
>>>>   
>>>> @@ -2490,6 +2576,15 @@ bool migrate_use_block_incremental(void)
>>>>       return s->parameters.block_incremental;
>>>>   }
>>>>   
>>>> +bool migrate_track_writes_ram(void)
>>>> +{
>>>> +    MigrationState *s;
>>>> +
>>>> +    s = migrate_get_current();
>>>> +
>>>> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_TRACK_WRITES_RAM];
>>>> +}
>>>> +
>>>>   /* migration thread support */
>>>>   /*
>>>>    * Something bad happened to the RP stream, mark an error
>>>> @@ -3783,6 +3878,7 @@ static Property migration_properties[] = {
>>>>       DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
>>>>       DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
>>>>       DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
>>>> +    DEFINE_PROP_MIG_CAP("x-track-writes-ram", MIGRATION_CAPABILITY_TRACK_WRITES_RAM),
>>>>   
>>>>       DEFINE_PROP_END_OF_LIST(),
>>>>   };
>>>> diff --git a/migration/migration.h b/migration/migration.h
>>>> index d096b77f74..339ae720e0 100644
>>>> --- a/migration/migration.h
>>>> +++ b/migration/migration.h
>>>> @@ -341,6 +341,7 @@ int migrate_compress_wait_thread(void);
>>>>   int migrate_decompress_threads(void);
>>>>   bool migrate_use_events(void);
>>>>   bool migrate_postcopy_blocktime(void);
>>>> +bool migrate_track_writes_ram(void);
>>>>   
>>>>   /* Sending on the return path - generic and then for each message type */
>>>>   void migrate_send_rp_shut(MigrationIncomingState *mis,
>>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>>> index 3c75820527..a28d8b7ee8 100644
>>>> --- a/qapi/migration.json
>>>> +++ b/qapi/migration.json
>>>> @@ -442,6 +442,11 @@
>>>>   # @validate-uuid: Send the UUID of the source to allow the destination
>>>>   #                 to ensure it is the same. (since 4.2)
>>>>   #
>>>> +# @track-writes-ram: If enabled, the migration stream will be a snapshot
>>>> +#                    of the VM exactly at the point when the migration
>>>> +#                    procedure starts. The VM RAM is saved with running VM.
>>>> +#                    (since 6.0)
>>>> +#
>>>
>>> The name is slightly confusing to me.  Could I ask why changed from previous
>>> one?  "snapshot" sounds a very necessary keyword to me here and tells exactly
>>> on what we do...  Because we can do quite a few things with "trace-writes-ram"
>>> but not snapshotting, e.g., to calculate per-vm dirty rates.
>>>
>>> -- 
>>> Peter Xu
>>
>> -- 
>> Peter Xu
>>


-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-24 15:17                   ` Peter Xu
@ 2020-11-24 17:40                     ` Andrey Gruzdev
  0 siblings, 0 replies; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-24 17:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

On 24.11.2020 18:17, Peter Xu wrote:
> On Tue, Nov 24, 2020 at 11:02:09AM +0300, Andrey Gruzdev wrote:
>> On 24.11.2020 00:34, Peter Xu wrote:
>>> On Fri, Nov 20, 2020 at 07:53:34PM +0300, Andrey Gruzdev wrote:
>>>> On 20.11.2020 19:43, Peter Xu wrote:
>>>>> On Fri, Nov 20, 2020 at 07:15:07PM +0300, Andrey Gruzdev wrote:
>>>>>> Yeah, I think we can re-use the postcopy queue code for faulting pages. I'm
>>>>>> worring a little about some additional overhead dealing with urgent request
>>>>>> semaphore. Also, the code won't change a lot, something like:
>>>>>>
>>>>>> [...]
>>>>>>            /* In case of 'write-tracking' migration we first try
>>>>>>             * to poll UFFD and sse if we have write page fault event */
>>>>>>            poll_fault_page(rs);
>>>>>>
>>>>>>            again = true;
>>>>>>            found = get_queued_page(rs, &pss);
>>>>>>
>>>>>>            if (!found) {
>>>>>>                /* priority queue empty, so just search for something dirty */
>>>>>>                found = find_dirty_block(rs, &pss, &again);
>>>>>>            }
>>>>>> [...]
>>>>>
>>>>> Could I ask what's the "urgent request semaphore"?  Thanks,
>>>>>
>>>>
>>>> These function use it (the correct name is 'rate_limit_sem'):
>>>>
>>>> void migration_make_urgent_request(void)
>>>> {
>>>>       qemu_sem_post(&migrate_get_current()->rate_limit_sem);
>>>> }
>>>>
>>>> void migration_consume_urgent_request(void)
>>>> {
>>>>       qemu_sem_wait(&migrate_get_current()->rate_limit_sem);
>>>> }
>>>>
>>>> They are called from ram_save_queue_pages and unqueue_page, accordingly, to
>>>> control migration rate limiter.
>>>>
>>>> bool migration_rate_limit(void)
>>>> {
>>>> [...]
>>>>           /*
>>>>            * Wait for a delay to do rate limiting OR
>>>>            * something urgent to post the semaphore.
>>>>            */
>>>>           int ms = s->iteration_start_time + BUFFER_DELAY - now;
>>>>           trace_migration_rate_limit_pre(ms);
>>>>           if (qemu_sem_timedwait(&s->rate_limit_sem, ms) == 0) {
>>>>               /*
>>>>                * We were woken by one or more urgent things but
>>>>                * the timedwait will have consumed one of them.
>>>>                * The service routine for the urgent wake will dec
>>>>                * the semaphore itself for each item it consumes,
>>>>                * so add this one we just eat back.
>>>>                */
>>>>               qemu_sem_post(&s->rate_limit_sem);
>>>>               urgent = true;
>>>>           }
>>>> [...]
>>>> }
>>>>
>>>
>>> Hmm... Why its overhead could be a problem?  If it's an overhead that can be
>>> avoided, then postcopy might want that too.
>>>
>>> The thing is I really feel like the snapshot logic can leverage the whole idea
>>> of existing postcopy (like get_queued_page/unqueue_page; it's probably due to
>>> the fact that both of them want to "migrate some more urgent pages than the
>>> background migration, due to either missing/wrprotected pages"), but I might
>>> have something missing.
>>>
>>> Thanks,
>>>
>>
>> I don't think this overhead itself is a problem since its negligible
>> compared to other stuff.. You're undoubtly correct about using postcopy idea
>> in case when wr-fault pages arrive from the separate thread or external
>> source. Then we really need to decouple that separate thread or external
>> requestor from the migration thread.
>>
>> In this patch series wr-faults arrive in the same migration loop with normal
>> scan, so I don't see direct reason to pass it through the queue, but I might
>> have missed something from your proposal.
> 
> I see your point.  For me whether using a thread is not extremely important -
> actually we can have a standalone thread to handle the vcpu faults too just
> like postcopy; it's just run on the src host.  My major point is that we should
> avoid introducing the extra pss logic because fundamentally it's doing the same
> thing as get_queued_page() right now, unless I'm wrong...
> 
> So I think your previous poll_fault_page() call is okay; assuming the same poll
> model as you used, I mean something like this:
> 
> ------8<-------
> diff --git a/migration/ram.c b/migration/ram.c
> index 7811cde643..718dd276c7 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1473,6 +1473,14 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>   
>       } while (block && !dirty);
>   
> +    if (!block) {
> +        /*
> +         * Set the block too if live snapshot is enabled; that's when we have
> +         * vcpus got blocked by the wr-protected pages.
> +         */
> +        block = poll_fault_page(rs, &offset);
> +    }
> +
>       if (block) {
>           /*
>            * As soon as we start servicing pages out of order, then we have
> ------8<-------
> 
> So as long as we set the block and offset, pss will be updated just like
> postcopy.  That's exactly what we want, am I right?
> 

Also think the best place to put polling call here, not to touch pss yet 
again.

>>
>> Do you mean that if we use postcopy logic, we'll allocate memory and make
>> copies of faulting pages and then immediately un-protect them?
>> Or we just put faulting page address to the queued item and release
>> protection after page content has been saved.
> 
> I think current approach would be fine, so we don't copy page and unprotect at
> a single place after the page is dumped to the precopy stream.  If you could
> measure the latencies then it'll be even better, then we'll know what to expect.
> 
> IIRC the "copy page first" idea came from Denis's initial version, in which I
> thought as too aggresive since potentially it can eat twice the memory on the
> host for the single guest, not to mention when live snapshot is taken for
> mutliple guests on the same host.  It's just not something that can be directly
> expected when the user wants to take a snapshot.  That's something we can do
> upon this work, imho, if we'll have very high latency on resolving the page
> faults. But that's still the last thing to try (or at least an "by default off"
> option) so we can still think about some other solution before trying to eat
> too much host mem.
> 
> Thanks,
> 

I totally agree that creating shadow copies for each faulting page is 
not a very robust idea.. Cause we could never know the 'proper' limit 
for the buffer memory we should allocate and cut from the host, it 
depends on hardware, guest workload, host I/O intensity, too many things.

So I believe file/block backend side is a better place to provide some 
buffering and keep memory/latency tradeoff. Also it allows to be less 
'invasive' to QEMU code itself.

Thanks,

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/7] introduce UFFD-WP low-level interface helpers
  2020-11-19 12:59 ` [PATCH v3 2/7] introduce UFFD-WP low-level interface helpers Andrey Gruzdev via
  2020-11-19 18:39   ` Peter Xu
@ 2020-11-24 17:57   ` Dr. David Alan Gilbert
  2020-11-25  8:11     ` Andrey Gruzdev
  1 sibling, 1 reply; 44+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-24 17:57 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> Implemented support for the whole RAM block memory
> protection/un-protection. Introduced higher level
> ram_write_tracking_start() and ram_write_tracking_stop()
> to start/stop tracking guest memory writes.
> 
> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> ---
>  include/exec/memory.h |   7 ++
>  migration/ram.c       | 267 ++++++++++++++++++++++++++++++++++++++++++
>  migration/ram.h       |   4 +
>  3 files changed, 278 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 0f3e6bcd5e..3d798fce16 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier;
>  /* RAM is a persistent kind memory */
>  #define RAM_PMEM (1 << 5)
>  
> +/*
> + * UFFDIO_WRITEPROTECT is used on this RAMBlock to
> + * support 'write-tracking' migration type.
> + * Implies ram_state->ram_wt_enabled.
> + */
> +#define RAM_UF_WRITEPROTECT (1 << 6)
> +
>  static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>                                         IOMMUNotifierFlag flags,
>                                         hwaddr start, hwaddr end,
> diff --git a/migration/ram.c b/migration/ram.c
> index 7811cde643..7f273c9996 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -56,6 +56,12 @@
>  #include "savevm.h"
>  #include "qemu/iov.h"
>  #include "multifd.h"
> +#include <inttypes.h>
> +#include <poll.h>
> +#include <sys/syscall.h>
> +#include <sys/ioctl.h>
> +#include <linux/userfaultfd.h>
> +#include "sysemu/runstate.h"
>  
>  /***********************************************************/
>  /* ram save/restore */
> @@ -298,6 +304,8 @@ struct RAMSrcPageRequest {
>  struct RAMState {
>      /* QEMUFile used for this migration */
>      QEMUFile *f;
> +    /* UFFD file descriptor, used in 'write-tracking' migration */
> +    int uffdio_fd;
>      /* Last block that we have visited searching for dirty pages */
>      RAMBlock *last_seen_block;
>      /* Last block from where we have sent data */
> @@ -453,6 +461,181 @@ static QemuThread *decompress_threads;
>  static QemuMutex decomp_done_lock;
>  static QemuCond decomp_done_cond;
>  
> +/**
> + * uffd_create_fd: create UFFD file descriptor
> + *
> + * Returns non-negative file descriptor or negative value in case of an error
> + */
> +static int uffd_create_fd(void)
> +{
> +    int uffd;
> +    struct uffdio_api api_struct;
> +    uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER);

You need to be a bit careful about doing this in migration/ram.c - it's
generic code; at minimum it needs ifdef'ing for Linux.

> +    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> +    if (uffd < 0) {
> +        error_report("uffd_create_fd() failed: UFFD not supported");
> +        return -1;
> +    }
> +
> +    api_struct.api = UFFD_API;
> +    api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
> +    if (ioctl(uffd, UFFDIO_API, &api_struct)) {
> +        error_report("uffd_create_fd() failed: "
> +                "API version not supported version=%llx errno=%i",
> +                api_struct.api, errno);
> +        goto fail;
> +    }
> +
> +    if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
> +        error_report("uffd_create_fd() failed: "
> +                "PAGEFAULT_FLAG_WP feature missing");
> +        goto fail;
> +    }
> +
> +    return uffd;

Should we be putting that somewher that we can share with postcopy?

> +fail:
> +    close(uffd);
> +    return -1;
> +}
> +
> +/**
> + * uffd_close_fd: close UFFD file descriptor
> + *
> + * @uffd: UFFD file descriptor
> + */
> +static void uffd_close_fd(int uffd)
> +{
> +    assert(uffd >= 0);
> +    close(uffd);
> +}
> +
> +/**
> + * uffd_register_memory: register memory range with UFFD
> + *
> + * Returns 0 in case of success, negative value on error
> + *
> + * @uffd: UFFD file descriptor
> + * @start: starting virtual address of memory range
> + * @length: length of memory range
> + * @track_missing: generate events on missing-page faults
> + * @track_wp: generate events on write-protected-page faults
> + */
> +static int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
> +        bool track_missing, bool track_wp)
> +{
> +    struct uffdio_register uffd_register;
> +
> +    uffd_register.range.start = start;
> +    uffd_register.range.len = length;
> +    uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) |
> +                         (track_wp ? UFFDIO_REGISTER_MODE_WP : 0);
> +
> +    if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) {
> +        error_report("uffd_register_memory() failed: "
> +                "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
> +                start, length, uffd_register.mode, errno);
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * uffd_protect_memory: protect/unprotect memory range for writes with UFFD
> + *
> + * Returns 0 on success or negative value in case of error
> + *
> + * @uffd: UFFD file descriptor
> + * @start: starting virtual address of memory range
> + * @length: length of memory range
> + * @wp: write-protect/unprotect
> + */
> +static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
> +{
> +    struct uffdio_writeprotect uffd_writeprotect;
> +    int res;
> +
> +    uffd_writeprotect.range.start = start;
> +    uffd_writeprotect.range.len = length;
> +    uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0);
> +
> +    do {
> +        res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect);
> +    } while (res < 0 && errno == EINTR);
> +    if (res < 0) {
> +        error_report("uffd_protect_memory() failed: "
> +                "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
> +                start, length, uffd_writeprotect.mode, errno);
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +__attribute__ ((unused))
> +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
> +__attribute__ ((unused))
> +static bool uffd_poll_events(int uffd, int tmo);
> +
> +/**
> + * uffd_read_events: read pending UFFD events
> + *
> + * Returns number of fetched messages, 0 if non is available or
> + * negative value in case of an error
> + *
> + * @uffd: UFFD file descriptor
> + * @msgs: pointer to message buffer
> + * @count: number of messages that can fit in the buffer
> + */
> +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count)
> +{
> +    ssize_t res;
> +    do {
> +        res = read(uffd, msgs, count * sizeof(struct uffd_msg));
> +    } while (res < 0 && errno == EINTR);
> +
> +    if ((res < 0 && errno == EAGAIN)) {
> +        return 0;
> +    }
> +    if (res < 0) {
> +        error_report("uffd_read_events() failed: errno=%i", errno);
> +        return -1;
> +    }
> +
> +    return (int) (res / sizeof(struct uffd_msg));
> +}
> +
> +/**
> + * uffd_poll_events: poll UFFD file descriptor for read
> + *
> + * Returns true if events are available for read, false otherwise
> + *
> + * @uffd: UFFD file descriptor
> + * @tmo: timeout in milliseconds, 0 for non-blocking operation,
> + *       negative value for infinite wait
> + */
> +static bool uffd_poll_events(int uffd, int tmo)
> +{
> +    int res;
> +    struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 };
> +
> +    do {
> +        res = poll(&poll_fd, 1, tmo);
> +    } while (res < 0 && errno == EINTR);
> +
> +    if (res == 0) {
> +        return false;
> +    }
> +    if (res < 0) {
> +        error_report("uffd_poll_events() failed: errno=%i", errno);
> +        return false;
> +    }
> +
> +    return (poll_fd.revents & POLLIN) != 0;
> +}
> +
>  static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
>                                   ram_addr_t offset, uint8_t *source_buf);
>  
> @@ -3788,6 +3971,90 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
>      return 0;
>  }
>  
> +/**
> + * ram_write_tracking_start: start UFFD-WP memory tracking
> + *
> + * Returns 0 for success or negative value in case of error
> + *
> + */
> +int ram_write_tracking_start(void)
> +{
> +    int uffd;
> +    RAMState *rs = ram_state;
> +    RAMBlock *bs;
> +
> +    /* Open UFFD file descriptor */
> +    uffd = uffd_create_fd();
> +    if (uffd < 0) {
> +        return uffd;
> +    }
> +    rs->uffdio_fd = uffd;
> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> +        /* Nothing to do with read-only and MMIO-writable regions */
> +        if (bs->mr->readonly || bs->mr->rom_device) {
> +            continue;
> +        }
> +
> +        /* Register block memory with UFFD to track writes */
> +        if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host,
> +                bs->max_length, false, true)) {
> +            goto fail;
> +        }
> +        /* Apply UFFD write protection to the block memory range */
> +        if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host,
> +                bs->max_length, true)) {
> +            goto fail;
> +        }
> +        bs->flags |= RAM_UF_WRITEPROTECT;
> +
> +        info_report("UFFD-WP write-tracking enabled: "
> +                "block_id=%s page_size=%zu start=%p length=%lu "
> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
> +                bs->mr->nonvolatile, bs->mr->rom_device);
> +    }
> +
> +    return 0;
> +
> +fail:
> +    uffd_close_fd(uffd);
> +    rs->uffdio_fd = -1;
> +    return -1;
> +}
> +
> +/**
> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
> + */
> +void ram_write_tracking_stop(void)
> +{
> +    RAMState *rs = ram_state;
> +    RAMBlock *bs;
> +    assert(rs->uffdio_fd >= 0);
> +
> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
> +            continue;
> +        }
> +        info_report("UFFD-WP write-tracking disabled: "
> +                "block_id=%s page_size=%zu start=%p length=%lu "
> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
> +                bs->mr->nonvolatile, bs->mr->rom_device);
> +        /* Cleanup flags */
> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
> +    }
> +
> +    /*
> +     * Close UFFD file descriptor to remove protection,
> +     * release registered memory regions and flush wait queues
> +     */
> +    uffd_close_fd(rs->uffdio_fd);
> +    rs->uffdio_fd = -1;
> +}
> +
>  static SaveVMHandlers savevm_ram_handlers = {
>      .save_setup = ram_save_setup,
>      .save_live_iterate = ram_save_iterate,
> diff --git a/migration/ram.h b/migration/ram.h
> index 011e85414e..3611cb51de 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -79,4 +79,8 @@ void colo_flush_ram_cache(void);
>  void colo_release_ram_cache(void);
>  void colo_incoming_start_dirty_log(void);
>  
> +/* Live snapshots */
> +int ram_write_tracking_start(void);
> +void ram_write_tracking_stop(void);
> +
>  #endif
> -- 
> 2.25.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/7] introduce UFFD-WP low-level interface helpers
  2020-11-24 17:57   ` Dr. David Alan Gilbert
@ 2020-11-25  8:11     ` Andrey Gruzdev
  2020-11-25 18:43       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-25  8:11 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Markus Armbruster, Peter Xu

On 24.11.2020 20:57, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> Implemented support for the whole RAM block memory
>> protection/un-protection. Introduced higher level
>> ram_write_tracking_start() and ram_write_tracking_stop()
>> to start/stop tracking guest memory writes.
>>
>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>> ---
>>   include/exec/memory.h |   7 ++
>>   migration/ram.c       | 267 ++++++++++++++++++++++++++++++++++++++++++
>>   migration/ram.h       |   4 +
>>   3 files changed, 278 insertions(+)
>>
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 0f3e6bcd5e..3d798fce16 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier;
>>   /* RAM is a persistent kind memory */
>>   #define RAM_PMEM (1 << 5)
>>   
>> +/*
>> + * UFFDIO_WRITEPROTECT is used on this RAMBlock to
>> + * support 'write-tracking' migration type.
>> + * Implies ram_state->ram_wt_enabled.
>> + */
>> +#define RAM_UF_WRITEPROTECT (1 << 6)
>> +
>>   static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>>                                          IOMMUNotifierFlag flags,
>>                                          hwaddr start, hwaddr end,
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 7811cde643..7f273c9996 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -56,6 +56,12 @@
>>   #include "savevm.h"
>>   #include "qemu/iov.h"
>>   #include "multifd.h"
>> +#include <inttypes.h>
>> +#include <poll.h>
>> +#include <sys/syscall.h>
>> +#include <sys/ioctl.h>
>> +#include <linux/userfaultfd.h>
>> +#include "sysemu/runstate.h"
>>   
>>   /***********************************************************/
>>   /* ram save/restore */
>> @@ -298,6 +304,8 @@ struct RAMSrcPageRequest {
>>   struct RAMState {
>>       /* QEMUFile used for this migration */
>>       QEMUFile *f;
>> +    /* UFFD file descriptor, used in 'write-tracking' migration */
>> +    int uffdio_fd;
>>       /* Last block that we have visited searching for dirty pages */
>>       RAMBlock *last_seen_block;
>>       /* Last block from where we have sent data */
>> @@ -453,6 +461,181 @@ static QemuThread *decompress_threads;
>>   static QemuMutex decomp_done_lock;
>>   static QemuCond decomp_done_cond;
>>   
>> +/**
>> + * uffd_create_fd: create UFFD file descriptor
>> + *
>> + * Returns non-negative file descriptor or negative value in case of an error
>> + */
>> +static int uffd_create_fd(void)
>> +{
>> +    int uffd;
>> +    struct uffdio_api api_struct;
>> +    uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER);
> 
> You need to be a bit careful about doing this in migration/ram.c - it's
> generic code; at minimum it needs ifdef'ing for Linux.
> 

Yes, it's totally linux-specific, I think better to move this code out 
of migration/ram.c, as Peter proposed.

>> +    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
>> +    if (uffd < 0) {
>> +        error_report("uffd_create_fd() failed: UFFD not supported");
>> +        return -1;
>> +    }
>> +
>> +    api_struct.api = UFFD_API;
>> +    api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
>> +    if (ioctl(uffd, UFFDIO_API, &api_struct)) {
>> +        error_report("uffd_create_fd() failed: "
>> +                "API version not supported version=%llx errno=%i",
>> +                api_struct.api, errno);
>> +        goto fail;
>> +    }
>> +
>> +    if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
>> +        error_report("uffd_create_fd() failed: "
>> +                "PAGEFAULT_FLAG_WP feature missing");
>> +        goto fail;
>> +    }
>> +
>> +    return uffd;
> 
> Should we be putting that somewher that we can share with postcopy?
> 

Sure, maybe to util/uffd-wp.c + include/qemu/uffd-wp.h.
What do you think?

>> +fail:
>> +    close(uffd);
>> +    return -1;
>> +}
>> +
>> +/**
>> + * uffd_close_fd: close UFFD file descriptor
>> + *
>> + * @uffd: UFFD file descriptor
>> + */
>> +static void uffd_close_fd(int uffd)
>> +{
>> +    assert(uffd >= 0);
>> +    close(uffd);
>> +}
>> +
>> +/**
>> + * uffd_register_memory: register memory range with UFFD
>> + *
>> + * Returns 0 in case of success, negative value on error
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @start: starting virtual address of memory range
>> + * @length: length of memory range
>> + * @track_missing: generate events on missing-page faults
>> + * @track_wp: generate events on write-protected-page faults
>> + */
>> +static int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
>> +        bool track_missing, bool track_wp)
>> +{
>> +    struct uffdio_register uffd_register;
>> +
>> +    uffd_register.range.start = start;
>> +    uffd_register.range.len = length;
>> +    uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) |
>> +                         (track_wp ? UFFDIO_REGISTER_MODE_WP : 0);
>> +
>> +    if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) {
>> +        error_report("uffd_register_memory() failed: "
>> +                "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
>> +                start, length, uffd_register.mode, errno);
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/**
>> + * uffd_protect_memory: protect/unprotect memory range for writes with UFFD
>> + *
>> + * Returns 0 on success or negative value in case of error
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @start: starting virtual address of memory range
>> + * @length: length of memory range
>> + * @wp: write-protect/unprotect
>> + */
>> +static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
>> +{
>> +    struct uffdio_writeprotect uffd_writeprotect;
>> +    int res;
>> +
>> +    uffd_writeprotect.range.start = start;
>> +    uffd_writeprotect.range.len = length;
>> +    uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0);
>> +
>> +    do {
>> +        res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect);
>> +    } while (res < 0 && errno == EINTR);
>> +    if (res < 0) {
>> +        error_report("uffd_protect_memory() failed: "
>> +                "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
>> +                start, length, uffd_writeprotect.mode, errno);
>> +        return -1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +__attribute__ ((unused))
>> +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
>> +__attribute__ ((unused))
>> +static bool uffd_poll_events(int uffd, int tmo);
>> +
>> +/**
>> + * uffd_read_events: read pending UFFD events
>> + *
>> + * Returns number of fetched messages, 0 if non is available or
>> + * negative value in case of an error
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @msgs: pointer to message buffer
>> + * @count: number of messages that can fit in the buffer
>> + */
>> +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count)
>> +{
>> +    ssize_t res;
>> +    do {
>> +        res = read(uffd, msgs, count * sizeof(struct uffd_msg));
>> +    } while (res < 0 && errno == EINTR);
>> +
>> +    if ((res < 0 && errno == EAGAIN)) {
>> +        return 0;
>> +    }
>> +    if (res < 0) {
>> +        error_report("uffd_read_events() failed: errno=%i", errno);
>> +        return -1;
>> +    }
>> +
>> +    return (int) (res / sizeof(struct uffd_msg));
>> +}
>> +
>> +/**
>> + * uffd_poll_events: poll UFFD file descriptor for read
>> + *
>> + * Returns true if events are available for read, false otherwise
>> + *
>> + * @uffd: UFFD file descriptor
>> + * @tmo: timeout in milliseconds, 0 for non-blocking operation,
>> + *       negative value for infinite wait
>> + */
>> +static bool uffd_poll_events(int uffd, int tmo)
>> +{
>> +    int res;
>> +    struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 };
>> +
>> +    do {
>> +        res = poll(&poll_fd, 1, tmo);
>> +    } while (res < 0 && errno == EINTR);
>> +
>> +    if (res == 0) {
>> +        return false;
>> +    }
>> +    if (res < 0) {
>> +        error_report("uffd_poll_events() failed: errno=%i", errno);
>> +        return false;
>> +    }
>> +
>> +    return (poll_fd.revents & POLLIN) != 0;
>> +}
>> +
>>   static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
>>                                    ram_addr_t offset, uint8_t *source_buf);
>>   
>> @@ -3788,6 +3971,90 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
>>       return 0;
>>   }
>>   
>> +/**
>> + * ram_write_tracking_start: start UFFD-WP memory tracking
>> + *
>> + * Returns 0 for success or negative value in case of error
>> + *
>> + */
>> +int ram_write_tracking_start(void)
>> +{
>> +    int uffd;
>> +    RAMState *rs = ram_state;
>> +    RAMBlock *bs;
>> +
>> +    /* Open UFFD file descriptor */
>> +    uffd = uffd_create_fd();
>> +    if (uffd < 0) {
>> +        return uffd;
>> +    }
>> +    rs->uffdio_fd = uffd;
>> +
>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>> +        /* Nothing to do with read-only and MMIO-writable regions */
>> +        if (bs->mr->readonly || bs->mr->rom_device) {
>> +            continue;
>> +        }
>> +
>> +        /* Register block memory with UFFD to track writes */
>> +        if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host,
>> +                bs->max_length, false, true)) {
>> +            goto fail;
>> +        }
>> +        /* Apply UFFD write protection to the block memory range */
>> +        if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host,
>> +                bs->max_length, true)) {
>> +            goto fail;
>> +        }
>> +        bs->flags |= RAM_UF_WRITEPROTECT;
>> +
>> +        info_report("UFFD-WP write-tracking enabled: "
>> +                "block_id=%s page_size=%zu start=%p length=%lu "
>> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
>> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
>> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
>> +                bs->mr->nonvolatile, bs->mr->rom_device);
>> +    }
>> +
>> +    return 0;
>> +
>> +fail:
>> +    uffd_close_fd(uffd);
>> +    rs->uffdio_fd = -1;
>> +    return -1;
>> +}
>> +
>> +/**
>> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
>> + */
>> +void ram_write_tracking_stop(void)
>> +{
>> +    RAMState *rs = ram_state;
>> +    RAMBlock *bs;
>> +    assert(rs->uffdio_fd >= 0);
>> +
>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
>> +            continue;
>> +        }
>> +        info_report("UFFD-WP write-tracking disabled: "
>> +                "block_id=%s page_size=%zu start=%p length=%lu "
>> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
>> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
>> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
>> +                bs->mr->nonvolatile, bs->mr->rom_device);
>> +        /* Cleanup flags */
>> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
>> +    }
>> +
>> +    /*
>> +     * Close UFFD file descriptor to remove protection,
>> +     * release registered memory regions and flush wait queues
>> +     */
>> +    uffd_close_fd(rs->uffdio_fd);
>> +    rs->uffdio_fd = -1;
>> +}
>> +
>>   static SaveVMHandlers savevm_ram_handlers = {
>>       .save_setup = ram_save_setup,
>>       .save_live_iterate = ram_save_iterate,
>> diff --git a/migration/ram.h b/migration/ram.h
>> index 011e85414e..3611cb51de 100644
>> --- a/migration/ram.h
>> +++ b/migration/ram.h
>> @@ -79,4 +79,8 @@ void colo_flush_ram_cache(void);
>>   void colo_release_ram_cache(void);
>>   void colo_incoming_start_dirty_log(void);
>>   
>> +/* Live snapshots */
>> +int ram_write_tracking_start(void);
>> +void ram_write_tracking_stop(void);
>> +
>>   #endif
>> -- 
>> 2.25.1
>>


-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-19 12:59 ` [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate() Andrey Gruzdev via
  2020-11-19 18:25   ` Peter Xu
@ 2020-11-25 13:08   ` Dr. David Alan Gilbert
  2020-11-25 14:40     ` Andrey Gruzdev
  1 sibling, 1 reply; 44+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-25 13:08 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> In this particular implementation the same single migration
> thread is responsible for both normal linear dirty page
> migration and procesing UFFD page fault events.
> 
> Processing write faults includes reading UFFD file descriptor,
> finding respective RAM block and saving faulting page to
> the migration stream. After page has been saved, write protection
> can be removed. Since asynchronous version of qemu_put_buffer()
> is expected to be used to save pages, we also have to flush
> migraion stream prior to un-protecting saved memory range.
> 
> Write protection is being removed for any previously protected
> memory chunk that has hit the migration stream. That's valid
> for pages from linear page scan along with write fault pages.
> 
> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> ---
>  migration/ram.c | 124 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 115 insertions(+), 9 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 7f273c9996..08a1d7a252 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -314,6 +314,8 @@ struct RAMState {
>      ram_addr_t last_page;
>      /* last ram version we have seen */
>      uint32_t last_version;
> +    /* 'write-tracking' migration is enabled */
> +    bool ram_wt_enabled;
>      /* We are in the first round */
>      bool ram_bulk_stage;
>      /* The free page optimization is enabled */
> @@ -574,8 +576,6 @@ static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
>      return 0;
>  }
>  
> -__attribute__ ((unused))
> -static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
>  __attribute__ ((unused))
>  static bool uffd_poll_events(int uffd, int tmo);
>  
> @@ -1929,6 +1929,86 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>      return pages;
>  }
>  
> +/**
> + * ram_find_block_by_host_address: find RAM block containing host page
> + *
> + * Returns true if RAM block is found and pss->block/page are
> + * pointing to the given host page, false in case of an error
> + *
> + * @rs: current RAM state
> + * @pss: page-search-status structure
> + */
> +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
> +        hwaddr page_address)
> +{
> +    bool found = false;
> +
> +    pss->block = rs->last_seen_block;
> +    do {
> +        if (page_address >= (hwaddr) pss->block->host &&
> +            (page_address + TARGET_PAGE_SIZE) <=
> +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
> +            pss->page = (unsigned long)
> +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
> +            found = true;
> +            break;
> +        }
> +
> +        pss->block = QLIST_NEXT_RCU(pss->block, next);
> +        if (!pss->block) {
> +            /* Hit the end of the list */
> +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
> +        }
> +    } while (pss->block != rs->last_seen_block);
> +
> +    rs->last_seen_block = pss->block;
> +    /*
> +     * Since we are in the same loop with ram_find_and_save_block(),
> +     * need to reset pss->complete_round after switching to
> +     * other block/page in pss.
> +     */
> +    pss->complete_round = false;
> +
> +    return found;
> +}
> +
> +/**
> + * get_fault_page: try to get next UFFD write fault page and, if pending fault
> + *   is found, put info about RAM block and block page into pss structure
> + *
> + * Returns true in case of UFFD write fault detected, false otherwise
> + *
> + * @rs: current RAM state
> + * @pss: page-search-status structure
> + *
> + */
> +static bool get_fault_page(RAMState *rs, PageSearchStatus *pss)
> +{
> +    struct uffd_msg uffd_msg;
> +    hwaddr page_address;
> +    int res;
> +
> +    if (!rs->ram_wt_enabled) {
> +        return false;
> +    }
> +
> +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
> +    if (res <= 0) {
> +        return false;
> +    }
> +
> +    page_address = uffd_msg.arg.pagefault.address;
> +    if (!ram_find_block_by_host_address(rs, pss, page_address)) {
> +        /* In case we couldn't find respective block, just unprotect faulting page */
> +        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
> +        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
> +                     page_address);
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
>  /**
>   * ram_find_and_save_block: finds a dirty page and sends it to f
>   *
> @@ -1955,25 +2035,50 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>          return pages;
>      }
>  
> +    if (!rs->last_seen_block) {
> +        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
> +    }
>      pss.block = rs->last_seen_block;
>      pss.page = rs->last_page;
>      pss.complete_round = false;
>  
> -    if (!pss.block) {
> -        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
> -    }
> -
>      do {
> +        ram_addr_t page;
> +        ram_addr_t page_to;
> +
>          again = true;
> -        found = get_queued_page(rs, &pss);
> -
> +        /* In case of 'write-tracking' migration we first try
> +         * to poll UFFD and get write page fault event */
> +        found = get_fault_page(rs, &pss);
> +        if (!found) {
> +            /* No trying to fetch something from the priority queue */
> +            found = get_queued_page(rs, &pss);
> +        }
>          if (!found) {
>              /* priority queue empty, so just search for something dirty */
>              found = find_dirty_block(rs, &pss, &again);
>          }
>  
>          if (found) {
> +            page = pss.page;
>              pages = ram_save_host_page(rs, &pss, last_stage);
> +            page_to = pss.page;
> +
> +            /* Check if page is from UFFD-managed region */
> +            if (pss.block->flags & RAM_UF_WRITEPROTECT) {
> +                hwaddr page_address = (hwaddr) pss.block->host +
> +                        ((hwaddr) page << TARGET_PAGE_BITS);
> +                hwaddr run_length = (hwaddr) (page_to - page + 1) << TARGET_PAGE_BITS;
> +                int res;
> +
> +                /* Flush async buffers before un-protect */
> +                qemu_fflush(rs->f);
> +                /* Un-protect memory range */
> +                res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
> +                if (res < 0) {
> +                    break;
> +                }
> +            }

Please separate that out into a separate function.

Dave

>          }
>      } while (!pages && again);
>  
> @@ -2086,7 +2191,8 @@ static void ram_state_reset(RAMState *rs)
>      rs->last_sent_block = NULL;
>      rs->last_page = 0;
>      rs->last_version = ram_list.version;
> -    rs->ram_bulk_stage = true;
> +    rs->ram_wt_enabled = migrate_track_writes_ram();
> +    rs->ram_bulk_stage = !rs->ram_wt_enabled;
>      rs->fpo_enabled = false;
>  }
>  
> -- 
> 2.25.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-25 13:08   ` Dr. David Alan Gilbert
@ 2020-11-25 14:40     ` Andrey Gruzdev
  2020-11-25 18:41       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-25 14:40 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Markus Armbruster, Peter Xu

On 25.11.2020 16:08, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> In this particular implementation the same single migration
>> thread is responsible for both normal linear dirty page
>> migration and procesing UFFD page fault events.
>>
>> Processing write faults includes reading UFFD file descriptor,
>> finding respective RAM block and saving faulting page to
>> the migration stream. After page has been saved, write protection
>> can be removed. Since asynchronous version of qemu_put_buffer()
>> is expected to be used to save pages, we also have to flush
>> migraion stream prior to un-protecting saved memory range.
>>
>> Write protection is being removed for any previously protected
>> memory chunk that has hit the migration stream. That's valid
>> for pages from linear page scan along with write fault pages.
>>
>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>> ---
>>   migration/ram.c | 124 ++++++++++++++++++++++++++++++++++++++++++++----
>>   1 file changed, 115 insertions(+), 9 deletions(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 7f273c9996..08a1d7a252 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -314,6 +314,8 @@ struct RAMState {
>>       ram_addr_t last_page;
>>       /* last ram version we have seen */
>>       uint32_t last_version;
>> +    /* 'write-tracking' migration is enabled */
>> +    bool ram_wt_enabled;
>>       /* We are in the first round */
>>       bool ram_bulk_stage;
>>       /* The free page optimization is enabled */
>> @@ -574,8 +576,6 @@ static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
>>       return 0;
>>   }
>>   
>> -__attribute__ ((unused))
>> -static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
>>   __attribute__ ((unused))
>>   static bool uffd_poll_events(int uffd, int tmo);
>>   
>> @@ -1929,6 +1929,86 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>>       return pages;
>>   }
>>   
>> +/**
>> + * ram_find_block_by_host_address: find RAM block containing host page
>> + *
>> + * Returns true if RAM block is found and pss->block/page are
>> + * pointing to the given host page, false in case of an error
>> + *
>> + * @rs: current RAM state
>> + * @pss: page-search-status structure
>> + */
>> +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
>> +        hwaddr page_address)
>> +{
>> +    bool found = false;
>> +
>> +    pss->block = rs->last_seen_block;
>> +    do {
>> +        if (page_address >= (hwaddr) pss->block->host &&
>> +            (page_address + TARGET_PAGE_SIZE) <=
>> +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
>> +            pss->page = (unsigned long)
>> +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
>> +            found = true;
>> +            break;
>> +        }
>> +
>> +        pss->block = QLIST_NEXT_RCU(pss->block, next);
>> +        if (!pss->block) {
>> +            /* Hit the end of the list */
>> +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
>> +        }
>> +    } while (pss->block != rs->last_seen_block);
>> +
>> +    rs->last_seen_block = pss->block;
>> +    /*
>> +     * Since we are in the same loop with ram_find_and_save_block(),
>> +     * need to reset pss->complete_round after switching to
>> +     * other block/page in pss.
>> +     */
>> +    pss->complete_round = false;
>> +
>> +    return found;
>> +}
>> +
>> +/**
>> + * get_fault_page: try to get next UFFD write fault page and, if pending fault
>> + *   is found, put info about RAM block and block page into pss structure
>> + *
>> + * Returns true in case of UFFD write fault detected, false otherwise
>> + *
>> + * @rs: current RAM state
>> + * @pss: page-search-status structure
>> + *
>> + */
>> +static bool get_fault_page(RAMState *rs, PageSearchStatus *pss)
>> +{
>> +    struct uffd_msg uffd_msg;
>> +    hwaddr page_address;
>> +    int res;
>> +
>> +    if (!rs->ram_wt_enabled) {
>> +        return false;
>> +    }
>> +
>> +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
>> +    if (res <= 0) {
>> +        return false;
>> +    }
>> +
>> +    page_address = uffd_msg.arg.pagefault.address;
>> +    if (!ram_find_block_by_host_address(rs, pss, page_address)) {
>> +        /* In case we couldn't find respective block, just unprotect faulting page */
>> +        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
>> +        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
>> +                     page_address);
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>>   /**
>>    * ram_find_and_save_block: finds a dirty page and sends it to f
>>    *
>> @@ -1955,25 +2035,50 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>>           return pages;
>>       }
>>   
>> +    if (!rs->last_seen_block) {
>> +        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
>> +    }
>>       pss.block = rs->last_seen_block;
>>       pss.page = rs->last_page;
>>       pss.complete_round = false;
>>   
>> -    if (!pss.block) {
>> -        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
>> -    }
>> -
>>       do {
>> +        ram_addr_t page;
>> +        ram_addr_t page_to;
>> +
>>           again = true;
>> -        found = get_queued_page(rs, &pss);
>> -
>> +        /* In case of 'write-tracking' migration we first try
>> +         * to poll UFFD and get write page fault event */
>> +        found = get_fault_page(rs, &pss);
>> +        if (!found) {
>> +            /* No trying to fetch something from the priority queue */
>> +            found = get_queued_page(rs, &pss);
>> +        }
>>           if (!found) {
>>               /* priority queue empty, so just search for something dirty */
>>               found = find_dirty_block(rs, &pss, &again);
>>           }
>>   
>>           if (found) {
>> +            page = pss.page;
>>               pages = ram_save_host_page(rs, &pss, last_stage);
>> +            page_to = pss.page;
>> +
>> +            /* Check if page is from UFFD-managed region */
>> +            if (pss.block->flags & RAM_UF_WRITEPROTECT) {
>> +                hwaddr page_address = (hwaddr) pss.block->host +
>> +                        ((hwaddr) page << TARGET_PAGE_BITS);
>> +                hwaddr run_length = (hwaddr) (page_to - page + 1) << TARGET_PAGE_BITS;
>> +                int res;
>> +
>> +                /* Flush async buffers before un-protect */
>> +                qemu_fflush(rs->f);
>> +                /* Un-protect memory range */
>> +                res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
>> +                if (res < 0) {
>> +                    break;
>> +                }
>> +            }
> 
> Please separate that out into a separate function.
> 
> Dave
> 

You mean separate implementation of ram_find_and_save_block()?

Andrey

>>           }
>>       } while (!pages && again);
>>   
>> @@ -2086,7 +2191,8 @@ static void ram_state_reset(RAMState *rs)
>>       rs->last_sent_block = NULL;
>>       rs->last_page = 0;
>>       rs->last_version = ram_list.version;
>> -    rs->ram_bulk_stage = true;
>> +    rs->ram_wt_enabled = migrate_track_writes_ram();
>> +    rs->ram_bulk_stage = !rs->ram_wt_enabled;
>>       rs->fpo_enabled = false;
>>   }
>>   
>> -- 
>> 2.25.1
>>


-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-25 14:40     ` Andrey Gruzdev
@ 2020-11-25 18:41       ` Dr. David Alan Gilbert
  2020-11-25 19:12         ` Andrey Gruzdev
  0 siblings, 1 reply; 44+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-25 18:41 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> On 25.11.2020 16:08, Dr. David Alan Gilbert wrote:
> > * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> > > In this particular implementation the same single migration
> > > thread is responsible for both normal linear dirty page
> > > migration and procesing UFFD page fault events.
> > > 
> > > Processing write faults includes reading UFFD file descriptor,
> > > finding respective RAM block and saving faulting page to
> > > the migration stream. After page has been saved, write protection
> > > can be removed. Since asynchronous version of qemu_put_buffer()
> > > is expected to be used to save pages, we also have to flush
> > > migraion stream prior to un-protecting saved memory range.
> > > 
> > > Write protection is being removed for any previously protected
> > > memory chunk that has hit the migration stream. That's valid
> > > for pages from linear page scan along with write fault pages.
> > > 
> > > Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> > > ---
> > >   migration/ram.c | 124 ++++++++++++++++++++++++++++++++++++++++++++----
> > >   1 file changed, 115 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index 7f273c9996..08a1d7a252 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -314,6 +314,8 @@ struct RAMState {
> > >       ram_addr_t last_page;
> > >       /* last ram version we have seen */
> > >       uint32_t last_version;
> > > +    /* 'write-tracking' migration is enabled */
> > > +    bool ram_wt_enabled;
> > >       /* We are in the first round */
> > >       bool ram_bulk_stage;
> > >       /* The free page optimization is enabled */
> > > @@ -574,8 +576,6 @@ static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
> > >       return 0;
> > >   }
> > > -__attribute__ ((unused))
> > > -static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
> > >   __attribute__ ((unused))
> > >   static bool uffd_poll_events(int uffd, int tmo);
> > > @@ -1929,6 +1929,86 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
> > >       return pages;
> > >   }
> > > +/**
> > > + * ram_find_block_by_host_address: find RAM block containing host page
> > > + *
> > > + * Returns true if RAM block is found and pss->block/page are
> > > + * pointing to the given host page, false in case of an error
> > > + *
> > > + * @rs: current RAM state
> > > + * @pss: page-search-status structure
> > > + */
> > > +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
> > > +        hwaddr page_address)
> > > +{
> > > +    bool found = false;
> > > +
> > > +    pss->block = rs->last_seen_block;
> > > +    do {
> > > +        if (page_address >= (hwaddr) pss->block->host &&
> > > +            (page_address + TARGET_PAGE_SIZE) <=
> > > +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
> > > +            pss->page = (unsigned long)
> > > +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
> > > +            found = true;
> > > +            break;
> > > +        }
> > > +
> > > +        pss->block = QLIST_NEXT_RCU(pss->block, next);
> > > +        if (!pss->block) {
> > > +            /* Hit the end of the list */
> > > +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
> > > +        }
> > > +    } while (pss->block != rs->last_seen_block);
> > > +
> > > +    rs->last_seen_block = pss->block;
> > > +    /*
> > > +     * Since we are in the same loop with ram_find_and_save_block(),
> > > +     * need to reset pss->complete_round after switching to
> > > +     * other block/page in pss.
> > > +     */
> > > +    pss->complete_round = false;
> > > +
> > > +    return found;
> > > +}
> > > +
> > > +/**
> > > + * get_fault_page: try to get next UFFD write fault page and, if pending fault
> > > + *   is found, put info about RAM block and block page into pss structure
> > > + *
> > > + * Returns true in case of UFFD write fault detected, false otherwise
> > > + *
> > > + * @rs: current RAM state
> > > + * @pss: page-search-status structure
> > > + *
> > > + */
> > > +static bool get_fault_page(RAMState *rs, PageSearchStatus *pss)
> > > +{
> > > +    struct uffd_msg uffd_msg;
> > > +    hwaddr page_address;
> > > +    int res;
> > > +
> > > +    if (!rs->ram_wt_enabled) {
> > > +        return false;
> > > +    }
> > > +
> > > +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
> > > +    if (res <= 0) {
> > > +        return false;
> > > +    }
> > > +
> > > +    page_address = uffd_msg.arg.pagefault.address;
> > > +    if (!ram_find_block_by_host_address(rs, pss, page_address)) {
> > > +        /* In case we couldn't find respective block, just unprotect faulting page */
> > > +        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
> > > +        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
> > > +                     page_address);
> > > +        return false;
> > > +    }
> > > +
> > > +    return true;
> > > +}
> > > +
> > >   /**
> > >    * ram_find_and_save_block: finds a dirty page and sends it to f
> > >    *
> > > @@ -1955,25 +2035,50 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
> > >           return pages;
> > >       }
> > > +    if (!rs->last_seen_block) {
> > > +        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
> > > +    }
> > >       pss.block = rs->last_seen_block;
> > >       pss.page = rs->last_page;
> > >       pss.complete_round = false;
> > > -    if (!pss.block) {
> > > -        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
> > > -    }
> > > -
> > >       do {
> > > +        ram_addr_t page;
> > > +        ram_addr_t page_to;
> > > +
> > >           again = true;
> > > -        found = get_queued_page(rs, &pss);
> > > -
> > > +        /* In case of 'write-tracking' migration we first try
> > > +         * to poll UFFD and get write page fault event */
> > > +        found = get_fault_page(rs, &pss);
> > > +        if (!found) {
> > > +            /* No trying to fetch something from the priority queue */
> > > +            found = get_queued_page(rs, &pss);
> > > +        }
> > >           if (!found) {
> > >               /* priority queue empty, so just search for something dirty */
> > >               found = find_dirty_block(rs, &pss, &again);
> > >           }
> > >           if (found) {
> > > +            page = pss.page;
> > >               pages = ram_save_host_page(rs, &pss, last_stage);
> > > +            page_to = pss.page;
> > > +
> > > +            /* Check if page is from UFFD-managed region */
> > > +            if (pss.block->flags & RAM_UF_WRITEPROTECT) {
> > > +                hwaddr page_address = (hwaddr) pss.block->host +
> > > +                        ((hwaddr) page << TARGET_PAGE_BITS);
> > > +                hwaddr run_length = (hwaddr) (page_to - page + 1) << TARGET_PAGE_BITS;
> > > +                int res;
> > > +
> > > +                /* Flush async buffers before un-protect */
> > > +                qemu_fflush(rs->f);
> > > +                /* Un-protect memory range */
> > > +                res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
> > > +                if (res < 0) {
> > > +                    break;
> > > +                }
> > > +            }
> > 
> > Please separate that out into a separate function.
> > 
> > Dave
> > 
> 
> You mean separate implementation of ram_find_and_save_block()?

No, I mean take that if (...WRITEPROTECT) { ..} block and put it in a
function and call it from there, just to keep ram_find_and_save_block
clean.

Dave

> Andrey
> 
> > >           }
> > >       } while (!pages && again);
> > > @@ -2086,7 +2191,8 @@ static void ram_state_reset(RAMState *rs)
> > >       rs->last_sent_block = NULL;
> > >       rs->last_page = 0;
> > >       rs->last_version = ram_list.version;
> > > -    rs->ram_bulk_stage = true;
> > > +    rs->ram_wt_enabled = migrate_track_writes_ram();
> > > +    rs->ram_bulk_stage = !rs->ram_wt_enabled;
> > >       rs->fpo_enabled = false;
> > >   }
> > > -- 
> > > 2.25.1
> > > 
> 
> 
> -- 
> Andrey Gruzdev, Principal Engineer
> Virtuozzo GmbH  +7-903-247-6397
>                 virtuzzo.com
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/7] introduce UFFD-WP low-level interface helpers
  2020-11-25  8:11     ` Andrey Gruzdev
@ 2020-11-25 18:43       ` Dr. David Alan Gilbert
  2020-11-25 19:17         ` Andrey Gruzdev
  0 siblings, 1 reply; 44+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-25 18:43 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, qemu-devel, Peter Xu, Markus Armbruster,
	Paolo Bonzini, Den Lunev

* Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> On 24.11.2020 20:57, Dr. David Alan Gilbert wrote:
> > * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
> > > Implemented support for the whole RAM block memory
> > > protection/un-protection. Introduced higher level
> > > ram_write_tracking_start() and ram_write_tracking_stop()
> > > to start/stop tracking guest memory writes.
> > > 
> > > Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
> > > ---
> > >   include/exec/memory.h |   7 ++
> > >   migration/ram.c       | 267 ++++++++++++++++++++++++++++++++++++++++++
> > >   migration/ram.h       |   4 +
> > >   3 files changed, 278 insertions(+)
> > > 
> > > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > > index 0f3e6bcd5e..3d798fce16 100644
> > > --- a/include/exec/memory.h
> > > +++ b/include/exec/memory.h
> > > @@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier;
> > >   /* RAM is a persistent kind memory */
> > >   #define RAM_PMEM (1 << 5)
> > > +/*
> > > + * UFFDIO_WRITEPROTECT is used on this RAMBlock to
> > > + * support 'write-tracking' migration type.
> > > + * Implies ram_state->ram_wt_enabled.
> > > + */
> > > +#define RAM_UF_WRITEPROTECT (1 << 6)
> > > +
> > >   static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
> > >                                          IOMMUNotifierFlag flags,
> > >                                          hwaddr start, hwaddr end,
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index 7811cde643..7f273c9996 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -56,6 +56,12 @@
> > >   #include "savevm.h"
> > >   #include "qemu/iov.h"
> > >   #include "multifd.h"
> > > +#include <inttypes.h>
> > > +#include <poll.h>
> > > +#include <sys/syscall.h>
> > > +#include <sys/ioctl.h>
> > > +#include <linux/userfaultfd.h>
> > > +#include "sysemu/runstate.h"
> > >   /***********************************************************/
> > >   /* ram save/restore */
> > > @@ -298,6 +304,8 @@ struct RAMSrcPageRequest {
> > >   struct RAMState {
> > >       /* QEMUFile used for this migration */
> > >       QEMUFile *f;
> > > +    /* UFFD file descriptor, used in 'write-tracking' migration */
> > > +    int uffdio_fd;
> > >       /* Last block that we have visited searching for dirty pages */
> > >       RAMBlock *last_seen_block;
> > >       /* Last block from where we have sent data */
> > > @@ -453,6 +461,181 @@ static QemuThread *decompress_threads;
> > >   static QemuMutex decomp_done_lock;
> > >   static QemuCond decomp_done_cond;
> > > +/**
> > > + * uffd_create_fd: create UFFD file descriptor
> > > + *
> > > + * Returns non-negative file descriptor or negative value in case of an error
> > > + */
> > > +static int uffd_create_fd(void)
> > > +{
> > > +    int uffd;
> > > +    struct uffdio_api api_struct;
> > > +    uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER);
> > 
> > You need to be a bit careful about doing this in migration/ram.c - it's
> > generic code; at minimum it needs ifdef'ing for Linux.
> > 
> 
> Yes, it's totally linux-specific, I think better to move this code out of
> migration/ram.c, as Peter proposed.
> 
> > > +    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> > > +    if (uffd < 0) {
> > > +        error_report("uffd_create_fd() failed: UFFD not supported");
> > > +        return -1;
> > > +    }
> > > +
> > > +    api_struct.api = UFFD_API;
> > > +    api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
> > > +    if (ioctl(uffd, UFFDIO_API, &api_struct)) {
> > > +        error_report("uffd_create_fd() failed: "
> > > +                "API version not supported version=%llx errno=%i",
> > > +                api_struct.api, errno);
> > > +        goto fail;
> > > +    }
> > > +
> > > +    if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
> > > +        error_report("uffd_create_fd() failed: "
> > > +                "PAGEFAULT_FLAG_WP feature missing");
> > > +        goto fail;
> > > +    }
> > > +
> > > +    return uffd;
> > 
> > Should we be putting that somewher that we can share with postcopy?
> > 
> 
> Sure, maybe to util/uffd-wp.c + include/qemu/uffd-wp.h.
> What do you think?

Or how about a userfaultfd.c somewhere?

Dave

> > > +fail:
> > > +    close(uffd);
> > > +    return -1;
> > > +}
> > > +
> > > +/**
> > > + * uffd_close_fd: close UFFD file descriptor
> > > + *
> > > + * @uffd: UFFD file descriptor
> > > + */
> > > +static void uffd_close_fd(int uffd)
> > > +{
> > > +    assert(uffd >= 0);
> > > +    close(uffd);
> > > +}
> > > +
> > > +/**
> > > + * uffd_register_memory: register memory range with UFFD
> > > + *
> > > + * Returns 0 in case of success, negative value on error
> > > + *
> > > + * @uffd: UFFD file descriptor
> > > + * @start: starting virtual address of memory range
> > > + * @length: length of memory range
> > > + * @track_missing: generate events on missing-page faults
> > > + * @track_wp: generate events on write-protected-page faults
> > > + */
> > > +static int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
> > > +        bool track_missing, bool track_wp)
> > > +{
> > > +    struct uffdio_register uffd_register;
> > > +
> > > +    uffd_register.range.start = start;
> > > +    uffd_register.range.len = length;
> > > +    uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) |
> > > +                         (track_wp ? UFFDIO_REGISTER_MODE_WP : 0);
> > > +
> > > +    if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) {
> > > +        error_report("uffd_register_memory() failed: "
> > > +                "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
> > > +                start, length, uffd_register.mode, errno);
> > > +        return -1;
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +/**
> > > + * uffd_protect_memory: protect/unprotect memory range for writes with UFFD
> > > + *
> > > + * Returns 0 on success or negative value in case of error
> > > + *
> > > + * @uffd: UFFD file descriptor
> > > + * @start: starting virtual address of memory range
> > > + * @length: length of memory range
> > > + * @wp: write-protect/unprotect
> > > + */
> > > +static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
> > > +{
> > > +    struct uffdio_writeprotect uffd_writeprotect;
> > > +    int res;
> > > +
> > > +    uffd_writeprotect.range.start = start;
> > > +    uffd_writeprotect.range.len = length;
> > > +    uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0);
> > > +
> > > +    do {
> > > +        res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect);
> > > +    } while (res < 0 && errno == EINTR);
> > > +    if (res < 0) {
> > > +        error_report("uffd_protect_memory() failed: "
> > > +                "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
> > > +                start, length, uffd_writeprotect.mode, errno);
> > > +        return -1;
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +__attribute__ ((unused))
> > > +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
> > > +__attribute__ ((unused))
> > > +static bool uffd_poll_events(int uffd, int tmo);
> > > +
> > > +/**
> > > + * uffd_read_events: read pending UFFD events
> > > + *
> > > + * Returns number of fetched messages, 0 if non is available or
> > > + * negative value in case of an error
> > > + *
> > > + * @uffd: UFFD file descriptor
> > > + * @msgs: pointer to message buffer
> > > + * @count: number of messages that can fit in the buffer
> > > + */
> > > +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count)
> > > +{
> > > +    ssize_t res;
> > > +    do {
> > > +        res = read(uffd, msgs, count * sizeof(struct uffd_msg));
> > > +    } while (res < 0 && errno == EINTR);
> > > +
> > > +    if ((res < 0 && errno == EAGAIN)) {
> > > +        return 0;
> > > +    }
> > > +    if (res < 0) {
> > > +        error_report("uffd_read_events() failed: errno=%i", errno);
> > > +        return -1;
> > > +    }
> > > +
> > > +    return (int) (res / sizeof(struct uffd_msg));
> > > +}
> > > +
> > > +/**
> > > + * uffd_poll_events: poll UFFD file descriptor for read
> > > + *
> > > + * Returns true if events are available for read, false otherwise
> > > + *
> > > + * @uffd: UFFD file descriptor
> > > + * @tmo: timeout in milliseconds, 0 for non-blocking operation,
> > > + *       negative value for infinite wait
> > > + */
> > > +static bool uffd_poll_events(int uffd, int tmo)
> > > +{
> > > +    int res;
> > > +    struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 };
> > > +
> > > +    do {
> > > +        res = poll(&poll_fd, 1, tmo);
> > > +    } while (res < 0 && errno == EINTR);
> > > +
> > > +    if (res == 0) {
> > > +        return false;
> > > +    }
> > > +    if (res < 0) {
> > > +        error_report("uffd_poll_events() failed: errno=%i", errno);
> > > +        return false;
> > > +    }
> > > +
> > > +    return (poll_fd.revents & POLLIN) != 0;
> > > +}
> > > +
> > >   static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
> > >                                    ram_addr_t offset, uint8_t *source_buf);
> > > @@ -3788,6 +3971,90 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
> > >       return 0;
> > >   }
> > > +/**
> > > + * ram_write_tracking_start: start UFFD-WP memory tracking
> > > + *
> > > + * Returns 0 for success or negative value in case of error
> > > + *
> > > + */
> > > +int ram_write_tracking_start(void)
> > > +{
> > > +    int uffd;
> > > +    RAMState *rs = ram_state;
> > > +    RAMBlock *bs;
> > > +
> > > +    /* Open UFFD file descriptor */
> > > +    uffd = uffd_create_fd();
> > > +    if (uffd < 0) {
> > > +        return uffd;
> > > +    }
> > > +    rs->uffdio_fd = uffd;
> > > +
> > > +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> > > +        /* Nothing to do with read-only and MMIO-writable regions */
> > > +        if (bs->mr->readonly || bs->mr->rom_device) {
> > > +            continue;
> > > +        }
> > > +
> > > +        /* Register block memory with UFFD to track writes */
> > > +        if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host,
> > > +                bs->max_length, false, true)) {
> > > +            goto fail;
> > > +        }
> > > +        /* Apply UFFD write protection to the block memory range */
> > > +        if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host,
> > > +                bs->max_length, true)) {
> > > +            goto fail;
> > > +        }
> > > +        bs->flags |= RAM_UF_WRITEPROTECT;
> > > +
> > > +        info_report("UFFD-WP write-tracking enabled: "
> > > +                "block_id=%s page_size=%zu start=%p length=%lu "
> > > +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
> > > +                bs->idstr, bs->page_size, bs->host, bs->max_length,
> > > +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
> > > +                bs->mr->nonvolatile, bs->mr->rom_device);
> > > +    }
> > > +
> > > +    return 0;
> > > +
> > > +fail:
> > > +    uffd_close_fd(uffd);
> > > +    rs->uffdio_fd = -1;
> > > +    return -1;
> > > +}
> > > +
> > > +/**
> > > + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
> > > + */
> > > +void ram_write_tracking_stop(void)
> > > +{
> > > +    RAMState *rs = ram_state;
> > > +    RAMBlock *bs;
> > > +    assert(rs->uffdio_fd >= 0);
> > > +
> > > +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
> > > +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
> > > +            continue;
> > > +        }
> > > +        info_report("UFFD-WP write-tracking disabled: "
> > > +                "block_id=%s page_size=%zu start=%p length=%lu "
> > > +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
> > > +                bs->idstr, bs->page_size, bs->host, bs->max_length,
> > > +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
> > > +                bs->mr->nonvolatile, bs->mr->rom_device);
> > > +        /* Cleanup flags */
> > > +        bs->flags &= ~RAM_UF_WRITEPROTECT;
> > > +    }
> > > +
> > > +    /*
> > > +     * Close UFFD file descriptor to remove protection,
> > > +     * release registered memory regions and flush wait queues
> > > +     */
> > > +    uffd_close_fd(rs->uffdio_fd);
> > > +    rs->uffdio_fd = -1;
> > > +}
> > > +
> > >   static SaveVMHandlers savevm_ram_handlers = {
> > >       .save_setup = ram_save_setup,
> > >       .save_live_iterate = ram_save_iterate,
> > > diff --git a/migration/ram.h b/migration/ram.h
> > > index 011e85414e..3611cb51de 100644
> > > --- a/migration/ram.h
> > > +++ b/migration/ram.h
> > > @@ -79,4 +79,8 @@ void colo_flush_ram_cache(void);
> > >   void colo_release_ram_cache(void);
> > >   void colo_incoming_start_dirty_log(void);
> > > +/* Live snapshots */
> > > +int ram_write_tracking_start(void);
> > > +void ram_write_tracking_stop(void);
> > > +
> > >   #endif
> > > -- 
> > > 2.25.1
> > > 
> 
> 
> -- 
> Andrey Gruzdev, Principal Engineer
> Virtuozzo GmbH  +7-903-247-6397
>                 virtuzzo.com
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate()
  2020-11-25 18:41       ` Dr. David Alan Gilbert
@ 2020-11-25 19:12         ` Andrey Gruzdev
  0 siblings, 0 replies; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-25 19:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Markus Armbruster, Peter Xu

On 25.11.2020 21:41, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> On 25.11.2020 16:08, Dr. David Alan Gilbert wrote:
>>> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>>>> In this particular implementation the same single migration
>>>> thread is responsible for both normal linear dirty page
>>>> migration and procesing UFFD page fault events.
>>>>
>>>> Processing write faults includes reading UFFD file descriptor,
>>>> finding respective RAM block and saving faulting page to
>>>> the migration stream. After page has been saved, write protection
>>>> can be removed. Since asynchronous version of qemu_put_buffer()
>>>> is expected to be used to save pages, we also have to flush
>>>> migraion stream prior to un-protecting saved memory range.
>>>>
>>>> Write protection is being removed for any previously protected
>>>> memory chunk that has hit the migration stream. That's valid
>>>> for pages from linear page scan along with write fault pages.
>>>>
>>>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>>>> ---
>>>>    migration/ram.c | 124 ++++++++++++++++++++++++++++++++++++++++++++----
>>>>    1 file changed, 115 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>> index 7f273c9996..08a1d7a252 100644
>>>> --- a/migration/ram.c
>>>> +++ b/migration/ram.c
>>>> @@ -314,6 +314,8 @@ struct RAMState {
>>>>        ram_addr_t last_page;
>>>>        /* last ram version we have seen */
>>>>        uint32_t last_version;
>>>> +    /* 'write-tracking' migration is enabled */
>>>> +    bool ram_wt_enabled;
>>>>        /* We are in the first round */
>>>>        bool ram_bulk_stage;
>>>>        /* The free page optimization is enabled */
>>>> @@ -574,8 +576,6 @@ static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
>>>>        return 0;
>>>>    }
>>>> -__attribute__ ((unused))
>>>> -static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
>>>>    __attribute__ ((unused))
>>>>    static bool uffd_poll_events(int uffd, int tmo);
>>>> @@ -1929,6 +1929,86 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>>>>        return pages;
>>>>    }
>>>> +/**
>>>> + * ram_find_block_by_host_address: find RAM block containing host page
>>>> + *
>>>> + * Returns true if RAM block is found and pss->block/page are
>>>> + * pointing to the given host page, false in case of an error
>>>> + *
>>>> + * @rs: current RAM state
>>>> + * @pss: page-search-status structure
>>>> + */
>>>> +static bool ram_find_block_by_host_address(RAMState *rs, PageSearchStatus *pss,
>>>> +        hwaddr page_address)
>>>> +{
>>>> +    bool found = false;
>>>> +
>>>> +    pss->block = rs->last_seen_block;
>>>> +    do {
>>>> +        if (page_address >= (hwaddr) pss->block->host &&
>>>> +            (page_address + TARGET_PAGE_SIZE) <=
>>>> +                    ((hwaddr) pss->block->host + pss->block->used_length)) {
>>>> +            pss->page = (unsigned long)
>>>> +                    ((page_address - (hwaddr) pss->block->host) >> TARGET_PAGE_BITS);
>>>> +            found = true;
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        pss->block = QLIST_NEXT_RCU(pss->block, next);
>>>> +        if (!pss->block) {
>>>> +            /* Hit the end of the list */
>>>> +            pss->block = QLIST_FIRST_RCU(&ram_list.blocks);
>>>> +        }
>>>> +    } while (pss->block != rs->last_seen_block);
>>>> +
>>>> +    rs->last_seen_block = pss->block;
>>>> +    /*
>>>> +     * Since we are in the same loop with ram_find_and_save_block(),
>>>> +     * need to reset pss->complete_round after switching to
>>>> +     * other block/page in pss.
>>>> +     */
>>>> +    pss->complete_round = false;
>>>> +
>>>> +    return found;
>>>> +}
>>>> +
>>>> +/**
>>>> + * get_fault_page: try to get next UFFD write fault page and, if pending fault
>>>> + *   is found, put info about RAM block and block page into pss structure
>>>> + *
>>>> + * Returns true in case of UFFD write fault detected, false otherwise
>>>> + *
>>>> + * @rs: current RAM state
>>>> + * @pss: page-search-status structure
>>>> + *
>>>> + */
>>>> +static bool get_fault_page(RAMState *rs, PageSearchStatus *pss)
>>>> +{
>>>> +    struct uffd_msg uffd_msg;
>>>> +    hwaddr page_address;
>>>> +    int res;
>>>> +
>>>> +    if (!rs->ram_wt_enabled) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    res = uffd_read_events(rs->uffdio_fd, &uffd_msg, 1);
>>>> +    if (res <= 0) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    page_address = uffd_msg.arg.pagefault.address;
>>>> +    if (!ram_find_block_by_host_address(rs, pss, page_address)) {
>>>> +        /* In case we couldn't find respective block, just unprotect faulting page */
>>>> +        uffd_protect_memory(rs->uffdio_fd, page_address, TARGET_PAGE_SIZE, false);
>>>> +        error_report("ram_find_block_by_host_address() failed: address=0x%0lx",
>>>> +                     page_address);
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>>    /**
>>>>     * ram_find_and_save_block: finds a dirty page and sends it to f
>>>>     *
>>>> @@ -1955,25 +2035,50 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>>>>            return pages;
>>>>        }
>>>> +    if (!rs->last_seen_block) {
>>>> +        rs->last_seen_block = QLIST_FIRST_RCU(&ram_list.blocks);
>>>> +    }
>>>>        pss.block = rs->last_seen_block;
>>>>        pss.page = rs->last_page;
>>>>        pss.complete_round = false;
>>>> -    if (!pss.block) {
>>>> -        pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
>>>> -    }
>>>> -
>>>>        do {
>>>> +        ram_addr_t page;
>>>> +        ram_addr_t page_to;
>>>> +
>>>>            again = true;
>>>> -        found = get_queued_page(rs, &pss);
>>>> -
>>>> +        /* In case of 'write-tracking' migration we first try
>>>> +         * to poll UFFD and get write page fault event */
>>>> +        found = get_fault_page(rs, &pss);
>>>> +        if (!found) {
>>>> +            /* No trying to fetch something from the priority queue */
>>>> +            found = get_queued_page(rs, &pss);
>>>> +        }
>>>>            if (!found) {
>>>>                /* priority queue empty, so just search for something dirty */
>>>>                found = find_dirty_block(rs, &pss, &again);
>>>>            }
>>>>            if (found) {
>>>> +            page = pss.page;
>>>>                pages = ram_save_host_page(rs, &pss, last_stage);
>>>> +            page_to = pss.page;
>>>> +
>>>> +            /* Check if page is from UFFD-managed region */
>>>> +            if (pss.block->flags & RAM_UF_WRITEPROTECT) {
>>>> +                hwaddr page_address = (hwaddr) pss.block->host +
>>>> +                        ((hwaddr) page << TARGET_PAGE_BITS);
>>>> +                hwaddr run_length = (hwaddr) (page_to - page + 1) << TARGET_PAGE_BITS;
>>>> +                int res;
>>>> +
>>>> +                /* Flush async buffers before un-protect */
>>>> +                qemu_fflush(rs->f);
>>>> +                /* Un-protect memory range */
>>>> +                res = uffd_protect_memory(rs->uffdio_fd, page_address, run_length, false);
>>>> +                if (res < 0) {
>>>> +                    break;
>>>> +                }
>>>> +            }
>>>
>>> Please separate that out into a separate function.
>>>
>>> Dave
>>>
>>
>> You mean separate implementation of ram_find_and_save_block()?
> 
> No, I mean take that if (...WRITEPROTECT) { ..} block and put it in a
> function and call it from there, just to keep ram_find_and_save_block
> clean.
> 
> Dave
> 

Got it, ok.

Andrey

>> Andrey
>>
>>>>            }
>>>>        } while (!pages && again);
>>>> @@ -2086,7 +2191,8 @@ static void ram_state_reset(RAMState *rs)
>>>>        rs->last_sent_block = NULL;
>>>>        rs->last_page = 0;
>>>>        rs->last_version = ram_list.version;
>>>> -    rs->ram_bulk_stage = true;
>>>> +    rs->ram_wt_enabled = migrate_track_writes_ram();
>>>> +    rs->ram_bulk_stage = !rs->ram_wt_enabled;
>>>>        rs->fpo_enabled = false;
>>>>    }
>>>> -- 
>>>> 2.25.1
>>>>
>>
>>
>> -- 
>> Andrey Gruzdev, Principal Engineer
>> Virtuozzo GmbH  +7-903-247-6397
>>                  virtuzzo.com
>>


-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/7] introduce UFFD-WP low-level interface helpers
  2020-11-25 18:43       ` Dr. David Alan Gilbert
@ 2020-11-25 19:17         ` Andrey Gruzdev
  0 siblings, 0 replies; 44+ messages in thread
From: Andrey Gruzdev @ 2020-11-25 19:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Markus Armbruster, Peter Xu

On 25.11.2020 21:43, Dr. David Alan Gilbert wrote:
> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>> On 24.11.2020 20:57, Dr. David Alan Gilbert wrote:
>>> * Andrey Gruzdev (andrey.gruzdev@virtuozzo.com) wrote:
>>>> Implemented support for the whole RAM block memory
>>>> protection/un-protection. Introduced higher level
>>>> ram_write_tracking_start() and ram_write_tracking_stop()
>>>> to start/stop tracking guest memory writes.
>>>>
>>>> Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
>>>> ---
>>>>    include/exec/memory.h |   7 ++
>>>>    migration/ram.c       | 267 ++++++++++++++++++++++++++++++++++++++++++
>>>>    migration/ram.h       |   4 +
>>>>    3 files changed, 278 insertions(+)
>>>>
>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>>>> index 0f3e6bcd5e..3d798fce16 100644
>>>> --- a/include/exec/memory.h
>>>> +++ b/include/exec/memory.h
>>>> @@ -139,6 +139,13 @@ typedef struct IOMMUNotifier IOMMUNotifier;
>>>>    /* RAM is a persistent kind memory */
>>>>    #define RAM_PMEM (1 << 5)
>>>> +/*
>>>> + * UFFDIO_WRITEPROTECT is used on this RAMBlock to
>>>> + * support 'write-tracking' migration type.
>>>> + * Implies ram_state->ram_wt_enabled.
>>>> + */
>>>> +#define RAM_UF_WRITEPROTECT (1 << 6)
>>>> +
>>>>    static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>>>>                                           IOMMUNotifierFlag flags,
>>>>                                           hwaddr start, hwaddr end,
>>>> diff --git a/migration/ram.c b/migration/ram.c
>>>> index 7811cde643..7f273c9996 100644
>>>> --- a/migration/ram.c
>>>> +++ b/migration/ram.c
>>>> @@ -56,6 +56,12 @@
>>>>    #include "savevm.h"
>>>>    #include "qemu/iov.h"
>>>>    #include "multifd.h"
>>>> +#include <inttypes.h>
>>>> +#include <poll.h>
>>>> +#include <sys/syscall.h>
>>>> +#include <sys/ioctl.h>
>>>> +#include <linux/userfaultfd.h>
>>>> +#include "sysemu/runstate.h"
>>>>    /***********************************************************/
>>>>    /* ram save/restore */
>>>> @@ -298,6 +304,8 @@ struct RAMSrcPageRequest {
>>>>    struct RAMState {
>>>>        /* QEMUFile used for this migration */
>>>>        QEMUFile *f;
>>>> +    /* UFFD file descriptor, used in 'write-tracking' migration */
>>>> +    int uffdio_fd;
>>>>        /* Last block that we have visited searching for dirty pages */
>>>>        RAMBlock *last_seen_block;
>>>>        /* Last block from where we have sent data */
>>>> @@ -453,6 +461,181 @@ static QemuThread *decompress_threads;
>>>>    static QemuMutex decomp_done_lock;
>>>>    static QemuCond decomp_done_cond;
>>>> +/**
>>>> + * uffd_create_fd: create UFFD file descriptor
>>>> + *
>>>> + * Returns non-negative file descriptor or negative value in case of an error
>>>> + */
>>>> +static int uffd_create_fd(void)
>>>> +{
>>>> +    int uffd;
>>>> +    struct uffdio_api api_struct;
>>>> +    uint64_t ioctl_mask = BIT(_UFFDIO_REGISTER) | BIT(_UFFDIO_UNREGISTER);
>>>
>>> You need to be a bit careful about doing this in migration/ram.c - it's
>>> generic code; at minimum it needs ifdef'ing for Linux.
>>>
>>
>> Yes, it's totally linux-specific, I think better to move this code out of
>> migration/ram.c, as Peter proposed.
>>
>>>> +    uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
>>>> +    if (uffd < 0) {
>>>> +        error_report("uffd_create_fd() failed: UFFD not supported");
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    api_struct.api = UFFD_API;
>>>> +    api_struct.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
>>>> +    if (ioctl(uffd, UFFDIO_API, &api_struct)) {
>>>> +        error_report("uffd_create_fd() failed: "
>>>> +                "API version not supported version=%llx errno=%i",
>>>> +                api_struct.api, errno);
>>>> +        goto fail;
>>>> +    }
>>>> +
>>>> +    if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
>>>> +        error_report("uffd_create_fd() failed: "
>>>> +                "PAGEFAULT_FLAG_WP feature missing");
>>>> +        goto fail;
>>>> +    }
>>>> +
>>>> +    return uffd;
>>>
>>> Should we be putting that somewher that we can share with postcopy?
>>>
>>
>> Sure, maybe to util/uffd-wp.c + include/qemu/uffd-wp.h.
>> What do you think?
> 
> Or how about a userfaultfd.c somewhere?
> 
> Dave
> 

For userfaultfd.c I'm also ok.

Andrey

>>>> +fail:
>>>> +    close(uffd);
>>>> +    return -1;
>>>> +}
>>>> +
>>>> +/**
>>>> + * uffd_close_fd: close UFFD file descriptor
>>>> + *
>>>> + * @uffd: UFFD file descriptor
>>>> + */
>>>> +static void uffd_close_fd(int uffd)
>>>> +{
>>>> +    assert(uffd >= 0);
>>>> +    close(uffd);
>>>> +}
>>>> +
>>>> +/**
>>>> + * uffd_register_memory: register memory range with UFFD
>>>> + *
>>>> + * Returns 0 in case of success, negative value on error
>>>> + *
>>>> + * @uffd: UFFD file descriptor
>>>> + * @start: starting virtual address of memory range
>>>> + * @length: length of memory range
>>>> + * @track_missing: generate events on missing-page faults
>>>> + * @track_wp: generate events on write-protected-page faults
>>>> + */
>>>> +static int uffd_register_memory(int uffd, hwaddr start, hwaddr length,
>>>> +        bool track_missing, bool track_wp)
>>>> +{
>>>> +    struct uffdio_register uffd_register;
>>>> +
>>>> +    uffd_register.range.start = start;
>>>> +    uffd_register.range.len = length;
>>>> +    uffd_register.mode = (track_missing ? UFFDIO_REGISTER_MODE_MISSING : 0) |
>>>> +                         (track_wp ? UFFDIO_REGISTER_MODE_WP : 0);
>>>> +
>>>> +    if (ioctl(uffd, UFFDIO_REGISTER, &uffd_register)) {
>>>> +        error_report("uffd_register_memory() failed: "
>>>> +                "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
>>>> +                start, length, uffd_register.mode, errno);
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +/**
>>>> + * uffd_protect_memory: protect/unprotect memory range for writes with UFFD
>>>> + *
>>>> + * Returns 0 on success or negative value in case of error
>>>> + *
>>>> + * @uffd: UFFD file descriptor
>>>> + * @start: starting virtual address of memory range
>>>> + * @length: length of memory range
>>>> + * @wp: write-protect/unprotect
>>>> + */
>>>> +static int uffd_protect_memory(int uffd, hwaddr start, hwaddr length, bool wp)
>>>> +{
>>>> +    struct uffdio_writeprotect uffd_writeprotect;
>>>> +    int res;
>>>> +
>>>> +    uffd_writeprotect.range.start = start;
>>>> +    uffd_writeprotect.range.len = length;
>>>> +    uffd_writeprotect.mode = (wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0);
>>>> +
>>>> +    do {
>>>> +        res = ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect);
>>>> +    } while (res < 0 && errno == EINTR);
>>>> +    if (res < 0) {
>>>> +        error_report("uffd_protect_memory() failed: "
>>>> +                "start=%0"PRIx64" len=%"PRIu64" mode=%llu errno=%i",
>>>> +                start, length, uffd_writeprotect.mode, errno);
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +__attribute__ ((unused))
>>>> +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count);
>>>> +__attribute__ ((unused))
>>>> +static bool uffd_poll_events(int uffd, int tmo);
>>>> +
>>>> +/**
>>>> + * uffd_read_events: read pending UFFD events
>>>> + *
>>>> + * Returns number of fetched messages, 0 if non is available or
>>>> + * negative value in case of an error
>>>> + *
>>>> + * @uffd: UFFD file descriptor
>>>> + * @msgs: pointer to message buffer
>>>> + * @count: number of messages that can fit in the buffer
>>>> + */
>>>> +static int uffd_read_events(int uffd, struct uffd_msg *msgs, int count)
>>>> +{
>>>> +    ssize_t res;
>>>> +    do {
>>>> +        res = read(uffd, msgs, count * sizeof(struct uffd_msg));
>>>> +    } while (res < 0 && errno == EINTR);
>>>> +
>>>> +    if ((res < 0 && errno == EAGAIN)) {
>>>> +        return 0;
>>>> +    }
>>>> +    if (res < 0) {
>>>> +        error_report("uffd_read_events() failed: errno=%i", errno);
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    return (int) (res / sizeof(struct uffd_msg));
>>>> +}
>>>> +
>>>> +/**
>>>> + * uffd_poll_events: poll UFFD file descriptor for read
>>>> + *
>>>> + * Returns true if events are available for read, false otherwise
>>>> + *
>>>> + * @uffd: UFFD file descriptor
>>>> + * @tmo: timeout in milliseconds, 0 for non-blocking operation,
>>>> + *       negative value for infinite wait
>>>> + */
>>>> +static bool uffd_poll_events(int uffd, int tmo)
>>>> +{
>>>> +    int res;
>>>> +    struct pollfd poll_fd = { .fd = uffd, .events = POLLIN, .revents = 0 };
>>>> +
>>>> +    do {
>>>> +        res = poll(&poll_fd, 1, tmo);
>>>> +    } while (res < 0 && errno == EINTR);
>>>> +
>>>> +    if (res == 0) {
>>>> +        return false;
>>>> +    }
>>>> +    if (res < 0) {
>>>> +        error_report("uffd_poll_events() failed: errno=%i", errno);
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    return (poll_fd.revents & POLLIN) != 0;
>>>> +}
>>>> +
>>>>    static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
>>>>                                     ram_addr_t offset, uint8_t *source_buf);
>>>> @@ -3788,6 +3971,90 @@ static int ram_resume_prepare(MigrationState *s, void *opaque)
>>>>        return 0;
>>>>    }
>>>> +/**
>>>> + * ram_write_tracking_start: start UFFD-WP memory tracking
>>>> + *
>>>> + * Returns 0 for success or negative value in case of error
>>>> + *
>>>> + */
>>>> +int ram_write_tracking_start(void)
>>>> +{
>>>> +    int uffd;
>>>> +    RAMState *rs = ram_state;
>>>> +    RAMBlock *bs;
>>>> +
>>>> +    /* Open UFFD file descriptor */
>>>> +    uffd = uffd_create_fd();
>>>> +    if (uffd < 0) {
>>>> +        return uffd;
>>>> +    }
>>>> +    rs->uffdio_fd = uffd;
>>>> +
>>>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>>>> +        /* Nothing to do with read-only and MMIO-writable regions */
>>>> +        if (bs->mr->readonly || bs->mr->rom_device) {
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        /* Register block memory with UFFD to track writes */
>>>> +        if (uffd_register_memory(rs->uffdio_fd, (hwaddr) bs->host,
>>>> +                bs->max_length, false, true)) {
>>>> +            goto fail;
>>>> +        }
>>>> +        /* Apply UFFD write protection to the block memory range */
>>>> +        if (uffd_protect_memory(rs->uffdio_fd, (hwaddr) bs->host,
>>>> +                bs->max_length, true)) {
>>>> +            goto fail;
>>>> +        }
>>>> +        bs->flags |= RAM_UF_WRITEPROTECT;
>>>> +
>>>> +        info_report("UFFD-WP write-tracking enabled: "
>>>> +                "block_id=%s page_size=%zu start=%p length=%lu "
>>>> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
>>>> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
>>>> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
>>>> +                bs->mr->nonvolatile, bs->mr->rom_device);
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +
>>>> +fail:
>>>> +    uffd_close_fd(uffd);
>>>> +    rs->uffdio_fd = -1;
>>>> +    return -1;
>>>> +}
>>>> +
>>>> +/**
>>>> + * ram_write_tracking_stop: stop UFFD-WP memory tracking and remove protection
>>>> + */
>>>> +void ram_write_tracking_stop(void)
>>>> +{
>>>> +    RAMState *rs = ram_state;
>>>> +    RAMBlock *bs;
>>>> +    assert(rs->uffdio_fd >= 0);
>>>> +
>>>> +    RAMBLOCK_FOREACH_NOT_IGNORED(bs) {
>>>> +        if ((bs->flags & RAM_UF_WRITEPROTECT) == 0) {
>>>> +            continue;
>>>> +        }
>>>> +        info_report("UFFD-WP write-tracking disabled: "
>>>> +                "block_id=%s page_size=%zu start=%p length=%lu "
>>>> +                "romd_mode=%i ram=%i readonly=%i nonvolatile=%i rom_device=%i",
>>>> +                bs->idstr, bs->page_size, bs->host, bs->max_length,
>>>> +                bs->mr->romd_mode, bs->mr->ram, bs->mr->readonly,
>>>> +                bs->mr->nonvolatile, bs->mr->rom_device);
>>>> +        /* Cleanup flags */
>>>> +        bs->flags &= ~RAM_UF_WRITEPROTECT;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * Close UFFD file descriptor to remove protection,
>>>> +     * release registered memory regions and flush wait queues
>>>> +     */
>>>> +    uffd_close_fd(rs->uffdio_fd);
>>>> +    rs->uffdio_fd = -1;
>>>> +}
>>>> +
>>>>    static SaveVMHandlers savevm_ram_handlers = {
>>>>        .save_setup = ram_save_setup,
>>>>        .save_live_iterate = ram_save_iterate,
>>>> diff --git a/migration/ram.h b/migration/ram.h
>>>> index 011e85414e..3611cb51de 100644
>>>> --- a/migration/ram.h
>>>> +++ b/migration/ram.h
>>>> @@ -79,4 +79,8 @@ void colo_flush_ram_cache(void);
>>>>    void colo_release_ram_cache(void);
>>>>    void colo_incoming_start_dirty_log(void);
>>>> +/* Live snapshots */
>>>> +int ram_write_tracking_start(void);
>>>> +void ram_write_tracking_stop(void);
>>>> +
>>>>    #endif
>>>> -- 
>>>> 2.25.1
>>>>
>>
>>
>> -- 
>> Andrey Gruzdev, Principal Engineer
>> Virtuozzo GmbH  +7-903-247-6397
>>                  virtuzzo.com
>>


-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2020-11-25 19:18 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-19 12:59 [PATCH v3 0/7] UFFD write-tracking migration/snapshots Andrey Gruzdev via
2020-11-19 12:59 ` [PATCH v3 1/7] introduce 'track-writes-ram' migration capability Andrey Gruzdev via
2020-11-19 18:51   ` Peter Xu
2020-11-19 19:07     ` Peter Xu
2020-11-20 11:35       ` Andrey Gruzdev
2020-11-24 16:55       ` Dr. David Alan Gilbert
2020-11-24 17:25         ` Andrey Gruzdev
2020-11-20 11:32     ` Andrey Gruzdev
2020-11-19 12:59 ` [PATCH v3 2/7] introduce UFFD-WP low-level interface helpers Andrey Gruzdev via
2020-11-19 18:39   ` Peter Xu
2020-11-20 11:04     ` Andrey Gruzdev
2020-11-20 15:01       ` Peter Xu
2020-11-20 15:43         ` Andrey Gruzdev
2020-11-24 17:57   ` Dr. David Alan Gilbert
2020-11-25  8:11     ` Andrey Gruzdev
2020-11-25 18:43       ` Dr. David Alan Gilbert
2020-11-25 19:17         ` Andrey Gruzdev
2020-11-19 12:59 ` [PATCH v3 3/7] support UFFD write fault processing in ram_save_iterate() Andrey Gruzdev via
2020-11-19 18:25   ` Peter Xu
2020-11-20 10:44     ` Andrey Gruzdev
2020-11-20 15:07       ` Peter Xu
2020-11-20 16:15         ` Andrey Gruzdev
2020-11-20 16:43           ` Peter Xu
2020-11-20 16:53             ` Andrey Gruzdev
2020-11-23 21:34               ` Peter Xu
2020-11-24  8:02                 ` Andrey Gruzdev
2020-11-24 15:17                   ` Peter Xu
2020-11-24 17:40                     ` Andrey Gruzdev
2020-11-25 13:08   ` Dr. David Alan Gilbert
2020-11-25 14:40     ` Andrey Gruzdev
2020-11-25 18:41       ` Dr. David Alan Gilbert
2020-11-25 19:12         ` Andrey Gruzdev
2020-11-19 12:59 ` [PATCH v3 4/7] implementation of write-tracking migration thread Andrey Gruzdev via
2020-11-19 18:47   ` Peter Xu
2020-11-20 11:41     ` Andrey Gruzdev
2020-11-19 12:59 ` [PATCH v3 5/7] implementation of vm_start() BH Andrey Gruzdev via
2020-11-19 18:46   ` Peter Xu
2020-11-20 11:13     ` Andrey Gruzdev
2020-11-19 12:59 ` [PATCH v3 6/7] the rest of write tracking migration code Andrey Gruzdev via
2020-11-19 12:59 ` [PATCH v3 7/7] introduce simple linear scan rate limiting mechanism Andrey Gruzdev via
2020-11-19 20:02   ` Peter Xu
2020-11-20 12:06     ` Andrey Gruzdev
2020-11-20 15:23       ` Peter Xu
2020-11-24 16:41 ` [PATCH v3 0/7] UFFD write-tracking migration/snapshots Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.