[Qemu-devel] [PATCH v0 0/7] Background snapshots

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v0 0/7] Background snapshots
@ 2018-06-29  8:03 Denis Plotnikov
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 1/7] migration: add background snapshot capability Denis Plotnikov
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: Denis Plotnikov @ 2018-06-29  8:03 UTC (permalink / raw)
  To: dgilbert, quintela, pbonzini; +Cc: qemu-devel

The patch set adds the ability to make external snapshots while VM is running.

The workflow to make a snapshot is the following:
1. Pause the vm
2. Make a snapshot of block devices using the scheme of your choice
3. Turn on background-snapshot migration capability
4. Start the migration using the destination (migration stream) of your choice.
   The migration will resume the vm execution by itself
   when it has the devices' states saved  and is ready to start ram writing
   to the migration stream.
5. Listen to the migration finish event

The feature relies on KVM unapplied ability to report the faulting address.
Please find the KVM patch snippet to make the patchset work below:

+++ b/arch/x86/kvm/vmx.c
@@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 
        vcpu->arch.exit_qualification = exit_qualification;
 
-       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+        if (r == -EFAULT) {
+               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
+
+               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
+               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
+               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
+               r = 0;
+
+       }
+       return r;

The patch to KVM can be sent if the patch set approved

Denis Plotnikov (7):
  migration: add background snapshot capability
  bitops: add some atomic versions of bitmap operations
  threads: add infrastructure to process sigsegv
  migration: add background snapshot infrastructure
  kvm: add failed memeory access exit reason
  kvm: add vCPU failed memeory access processing
  migration: add background snapshotting

 include/exec/ram_addr.h   |   7 +
 include/exec/ramlist.h    |   4 +-
 include/qemu/bitops.h     |  24 +++
 include/qemu/thread.h     |   5 +
 linux-headers/linux/kvm.h |   5 +
 migration/migration.c     | 141 +++++++++++++++-
 migration/migration.h     |   1 +
 migration/ram.c           | 333 ++++++++++++++++++++++++++++++++++++--
 migration/ram.h           |  11 +-
 migration/savevm.c        |  91 ++++++-----
 migration/savevm.h        |   2 +
 qapi/migration.json       |   6 +-
 target/i386/kvm.c         |  18 +++
 util/qemu-thread-posix.c  |  50 ++++++
 14 files changed, 635 insertions(+), 63 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v0 1/7] migration: add background snapshot capability
  2018-06-29  8:03 [Qemu-devel] [PATCH v0 0/7] Background snapshots Denis Plotnikov
@ 2018-06-29  8:03 ` Denis Plotnikov
  2018-06-29 16:02   ` Eric Blake
  2018-07-12  9:03   ` Dr. David Alan Gilbert
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 2/7] bitops: add some atomic versions of bitmap operations Denis Plotnikov
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 31+ messages in thread
From: Denis Plotnikov @ 2018-06-29  8:03 UTC (permalink / raw)
  To: dgilbert, quintela, pbonzini; +Cc: qemu-devel

The capability is used for the background vmstate saving
using the migration infrastructure.
Background vmstate saving means that the majority of vmstate
(RAM) is saved in the background when VM's vCPUS are running.
This helps to reduce the VM downtime on VM snapshotting.

Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 migration/migration.c | 35 +++++++++++++++++++++++++++++++++++
 migration/migration.h |  1 +
 qapi/migration.json   |  6 +++++-
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index d780601f0c..87096d23ef 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -697,6 +697,12 @@ static bool migrate_caps_check(bool *cap_list,
             return false;
         }
 
+        if (cap_list[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
+            error_setg(errp, "Postcopy is not compatible "
+                        "with background snapshot");
+            return false;
+        }
+
         /* This check is reasonably expensive, so only when it's being
          * set the first time, also it's only the destination that needs
          * special support.
@@ -711,6 +717,26 @@ static bool migrate_caps_check(bool *cap_list,
         }
     }
 
+    if (cap_list[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
+        if (cap_list[MIGRATION_CAPABILITY_RELEASE_RAM]) {
+            error_setg(errp, "Background snapshot is not compatible "
+                        "with release ram capability");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
+            error_setg(errp, "Background snapshot is not "
+                        "currently compatible with compression");
+            return false;
+        }
+
+        if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
+            error_setg(errp, "Background snapshot is not "
+                        "currently compatible with XBZLRE");
+            return false;
+        }
+    }
+
     return true;
 }
 
@@ -1635,6 +1661,15 @@ bool migrate_use_block_incremental(void)
     return s->parameters.block_incremental;
 }
 
+bool migrate_background_snapshot(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT];
+}
+
 /* migration thread support */
 /*
  * Something bad happened to the RP stream, mark an error
diff --git a/migration/migration.h b/migration/migration.h
index 663415fe48..21babbc008 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -201,6 +201,7 @@ int migrate_compress_level(void);
 int migrate_compress_threads(void);
 int migrate_decompress_threads(void);
 bool migrate_use_events(void);
+bool migrate_background_snapshot(void);
 
 /* Sending on the return path - generic and then for each message type */
 void migrate_send_rp_shut(MigrationIncomingState *mis,
diff --git a/qapi/migration.json b/qapi/migration.json
index 03f57c9616..eb5f52b7ff 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -352,12 +352,16 @@
 #
 # @x-multifd: Use more than one fd for migration (since 2.11)
 #
+# @background-snapshot: Using migration infrastructure makes VM snapshot
+#         saving its RAM in background. This reduces  VM downtime. (since 2.12)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
   'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
            'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
-           'block', 'return-path', 'pause-before-switchover', 'x-multifd' ] }
+           'block', 'return-path', 'pause-before-switchover', 'x-multifd',
+           'background-snapshot' ] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v0 2/7] bitops: add some atomic versions of bitmap operations
  2018-06-29  8:03 [Qemu-devel] [PATCH v0 0/7] Background snapshots Denis Plotnikov
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 1/7] migration: add background snapshot capability Denis Plotnikov
@ 2018-06-29  8:03 ` Denis Plotnikov
  2018-07-12  9:21   ` Dr. David Alan Gilbert
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 3/7] threads: add infrastructure to process sigsegv Denis Plotnikov
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Denis Plotnikov @ 2018-06-29  8:03 UTC (permalink / raw)
  To: dgilbert, quintela, pbonzini; +Cc: qemu-devel

1. test bit
2. test and set bit

Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 include/qemu/bitops.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
index 3f0926cf40..7b1c8c7baf 100644
--- a/include/qemu/bitops.h
+++ b/include/qemu/bitops.h
@@ -94,6 +94,21 @@ static inline int test_and_set_bit(long nr, unsigned long *addr)
     return (old & mask) != 0;
 }
 
+/**
+ * test_and_set_bit_atomic - Set a bit atomically and return its old value
+ * @nr: Bit to set
+ * @addr: Address to count from
+ */
+static inline int test_and_set_bit_atomic(long nr, unsigned long *addr)
+{
+    unsigned long mask = BIT_MASK(nr);
+    unsigned long *p = addr + BIT_WORD(nr);
+    unsigned long old;
+
+    old = atomic_fetch_or(p, mask);
+    return (old & mask) != 0;
+}
+
 /**
  * test_and_clear_bit - Clear a bit and return its old value
  * @nr: Bit to clear
@@ -134,6 +149,15 @@ static inline int test_bit(long nr, const unsigned long *addr)
     return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
 }
 
+/**
+ * test_bit_atomic - Determine whether a bit is set atomicallly
+ * @nr: bit number to test
+ * @addr: Address to start counting from
+ */
+static inline int test_bit_atomic(long nr, const unsigned long *addr)
+{
+    return 1UL & (atomic_read(&addr[BIT_WORD(nr)]) >> (nr & (BITS_PER_LONG-1)));
+}
 /**
  * find_last_bit - find the last set bit in a memory region
  * @addr: The address to start the search at
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v0 3/7] threads: add infrastructure to process sigsegv
  2018-06-29  8:03 [Qemu-devel] [PATCH v0 0/7] Background snapshots Denis Plotnikov
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 1/7] migration: add background snapshot capability Denis Plotnikov
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 2/7] bitops: add some atomic versions of bitmap operations Denis Plotnikov
@ 2018-06-29  8:03 ` Denis Plotnikov
  2018-07-12  9:53   ` Dr. David Alan Gilbert
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 4/7] migration: add background snapshot infrastructure Denis Plotnikov
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Denis Plotnikov @ 2018-06-29  8:03 UTC (permalink / raw)
  To: dgilbert, quintela, pbonzini; +Cc: qemu-devel

Allows to define sigsegv handler temporary for all threads.
This is useful to implement copy-on-write logic while
linux usefaultfd doesn't support write-protected faults.
In the future, switch to using WP userfaultfd when it's
available.

It's going to be used on background snapshotting.

Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 include/qemu/thread.h    |  5 ++++
 util/qemu-thread-posix.c | 50 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index 9910f49b3a..886985d289 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -210,4 +210,9 @@ void qemu_lockcnt_inc_and_unlock(QemuLockCnt *lockcnt);
  */
 unsigned qemu_lockcnt_count(QemuLockCnt *lockcnt);
 
+
+typedef void (*sigsegv_handler)(int v0, siginfo_t *v1, void *v2);
+void sigsegv_user_handler_set(sigsegv_handler handler);
+void sigsegv_user_handler_reset(void);
+
 #endif
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index 7306475899..e51abc9275 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -489,6 +489,45 @@ static void qemu_thread_set_name(QemuThread *thread, const char *name)
 #endif
 }
 
+static sigsegv_handler sigsegv_user_handler;
+
+void sigsegv_user_handler_set(sigsegv_handler handler)
+{
+    assert(handler);
+    atomic_set(&sigsegv_user_handler, handler);
+}
+
+static sigsegv_handler sigsegv_user_handler_get(void)
+{
+    return atomic_read(&sigsegv_user_handler);
+}
+
+void sigsegv_user_handler_reset(void)
+{
+    atomic_set(&sigsegv_user_handler, NULL);
+}
+
+static void sigsegv_default_handler(int v0, siginfo_t *v1, void *v2)
+{
+    sigsegv_handler handler = sigsegv_user_handler_get();
+
+    if (!handler) {
+        // remove the sigsegv handler if it's not set by user
+        // this will lead to re-raising the error without a handler
+        // and exiting from the program with "Sigmentation fault"
+        int err;
+        struct sigaction act;
+        memset(&act, 0, sizeof(act));
+        act.sa_flags = SA_RESETHAND;
+        err = sigaction(SIGSEGV, &act, NULL);
+        if (err) {
+            error_exit(err, __func__);
+        }
+    } else {
+        handler(v0, v1, v2);
+    }
+}
+
 void qemu_thread_create(QemuThread *thread, const char *name,
                        void *(*start_routine)(void*),
                        void *arg, int mode)
@@ -496,14 +535,25 @@ void qemu_thread_create(QemuThread *thread, const char *name,
     sigset_t set, oldset;
     int err;
     pthread_attr_t attr;
+    struct sigaction act;
 
     err = pthread_attr_init(&attr);
     if (err) {
         error_exit(err, __func__);
     }
 
+    memset(&act, 0, sizeof(act));
+    act.sa_flags = SA_SIGINFO;
+    act.sa_sigaction = sigsegv_default_handler;
+    err = sigaction(SIGSEGV, &act, NULL);
+    if (err) {
+        error_exit(err, __func__);
+    }
+
     /* Leave signal handling to the iothread.  */
     sigfillset(&set);
+    // ...all but SIGSEGV
+    sigdelset(&set, SIGSEGV);
     pthread_sigmask(SIG_SETMASK, &set, &oldset);
     err = pthread_create(&thread->thread, &attr, start_routine, arg);
     if (err)
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v0 4/7] migration: add background snapshot infrastructure
  2018-06-29  8:03 [Qemu-devel] [PATCH v0 0/7] Background snapshots Denis Plotnikov
                   ` (2 preceding siblings ...)
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 3/7] threads: add infrastructure to process sigsegv Denis Plotnikov
@ 2018-06-29  8:03 ` Denis Plotnikov
  2018-07-12 11:46   ` Dr. David Alan Gilbert
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 5/7] kvm: add failed memeory access exit reason Denis Plotnikov
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Denis Plotnikov @ 2018-06-29  8:03 UTC (permalink / raw)
  To: dgilbert, quintela, pbonzini; +Cc: qemu-devel

It allows to intercept VM's RAM access and write them into the
snapshot.

Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 include/exec/ram_addr.h |   7 +
 include/exec/ramlist.h  |   4 +-
 migration/migration.c   |   2 +-
 migration/ram.c         | 333 ++++++++++++++++++++++++++++++++++++++--
 migration/ram.h         |  11 +-
 5 files changed, 338 insertions(+), 19 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 6cbc02aa0f..5b403d537d 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -36,6 +36,8 @@ struct RAMBlock {
     char idstr[256];
     /* RCU-enabled, writes protected by the ramlist lock */
     QLIST_ENTRY(RAMBlock) next;
+    /* blocks used for background snapshot */
+    QLIST_ENTRY(RAMBlock) bgs_next;
     QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
     int fd;
     size_t page_size;
@@ -49,6 +51,11 @@ struct RAMBlock {
     unsigned long *unsentmap;
     /* bitmap of already received pages in postcopy */
     unsigned long *receivedmap;
+    /* The following 2 are for background snapshot */
+    /* Pages currently being copied */
+    unsigned long *touched_map;
+    /* Pages has been copied already */
+    unsigned long *copied_map;
 };
 
 static inline bool offset_in_ramblock(RAMBlock *b, ram_addr_t offset)
diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
index 2e2ac6cb99..e0231d3bec 100644
--- a/include/exec/ramlist.h
+++ b/include/exec/ramlist.h
@@ -44,11 +44,13 @@ typedef struct {
     unsigned long *blocks[];
 } DirtyMemoryBlocks;
 
+typedef QLIST_HEAD(, RAMBlock) RamBlockList;
+
 typedef struct RAMList {
     QemuMutex mutex;
     RAMBlock *mru_block;
     /* RCU-enabled, writes protected by the ramlist lock. */
-    QLIST_HEAD(, RAMBlock) blocks;
+    RamBlockList blocks;
     DirtyMemoryBlocks *dirty_memory[DIRTY_MEMORY_NUM];
     uint32_t version;
     QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
diff --git a/migration/migration.c b/migration/migration.c
index 87096d23ef..131d0904e4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1716,7 +1716,7 @@ static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
         return;
     }
 
-    if (ram_save_queue_pages(rbname, start, len)) {
+    if (ram_save_queue_pages(NULL, rbname, start, len, NULL)) {
         mark_source_rp_bad(ms);
     }
 }
diff --git a/migration/ram.c b/migration/ram.c
index 021d583b9b..286b79ad51 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -188,10 +188,21 @@ struct RAMSrcPageRequest {
     RAMBlock *rb;
     hwaddr    offset;
     hwaddr    len;
+    void*     page_copy;
 
     QSIMPLEQ_ENTRY(RAMSrcPageRequest) next_req;
 };
 
+/* Page buffer used for background snapshot */
+typedef struct RAMPageBuffer {
+    /* Page buffer capacity in host pages */
+    int capacity;
+    /* Current number of pages in the buffer */
+    int used;
+    /* Event to notify that buffer usage is under capacity */
+    QemuEvent used_decreased;
+} RAMPageBuffer;
+
 /* State of RAM for migration */
 struct RAMState {
     /* QEMUFile used for this migration */
@@ -230,6 +241,11 @@ struct RAMState {
     /* Queue of outstanding page requests from the destination */
     QemuMutex src_page_req_mutex;
     QSIMPLEQ_HEAD(src_page_requests, RAMSrcPageRequest) src_page_requests;
+    /* The following 2 are for background snapshot */
+    /* Buffer data to store copies of ram pages while async vm saving */
+    RAMPageBuffer page_buffer;
+    /* Event to notify that a page coping just has finished*/
+    QemuEvent page_coping_done;
 };
 typedef struct RAMState RAMState;
 
@@ -250,6 +266,8 @@ struct PageSearchStatus {
     unsigned long page;
     /* Set once we wrap around */
     bool         complete_round;
+    /* Pointer to the cached page */
+    void* page_copy;
 };
 typedef struct PageSearchStatus PageSearchStatus;
 
@@ -958,7 +976,11 @@ static int ram_save_page(RAMState *rs, PageSearchStatus *pss, bool last_stage)
     RAMBlock *block = pss->block;
     ram_addr_t offset = pss->page << TARGET_PAGE_BITS;
 
-    p = block->host + offset;
+    if (pss->page_copy) {
+        p = pss->page_copy;
+    } else {
+        p = block->host + offset;
+    }
     trace_ram_save_page(block->idstr, (uint64_t)offset, p);
 
     /* In doubt sent page as normal */
@@ -989,9 +1011,12 @@ static int ram_save_page(RAMState *rs, PageSearchStatus *pss, bool last_stage)
              * page would be stale
              */
             xbzrle_cache_zero_page(rs, current_addr);
-            ram_release_pages(block->idstr, offset, pages);
+            if (pss->page_copy) {
+                qemu_madvise(p, TARGET_PAGE_SIZE, MADV_DONTNEED);
+            }
         } else if (!rs->ram_bulk_stage &&
-                   !migration_in_postcopy() && migrate_use_xbzrle()) {
+                   !migration_in_postcopy() && migrate_use_xbzrle() &&
+                   !migrate_background_snapshot()) {
             pages = save_xbzrle_page(rs, &p, current_addr, block,
                                      offset, last_stage);
             if (!last_stage) {
@@ -1008,9 +1033,10 @@ static int ram_save_page(RAMState *rs, PageSearchStatus *pss, bool last_stage)
         ram_counters.transferred +=
             save_page_header(rs, rs->f, block, offset | RAM_SAVE_FLAG_PAGE);
         if (send_async) {
-            qemu_put_buffer_async(rs->f, p, TARGET_PAGE_SIZE,
-                                  migrate_release_ram() &
-                                  migration_in_postcopy());
+            bool may_free = migrate_background_snapshot() ||
+                            (migrate_release_ram() &&
+                             migration_in_postcopy());
+            qemu_put_buffer_async(rs->f, p, TARGET_PAGE_SIZE, may_free);
         } else {
             qemu_put_buffer(rs->f, p, TARGET_PAGE_SIZE);
         }
@@ -1251,7 +1277,7 @@ static bool find_dirty_block(RAMState *rs, PageSearchStatus *pss, bool *again)
  * @rs: current RAM state
  * @offset: used to return the offset within the RAMBlock
  */
-static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
+static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset, void **page_copy)
 {
     RAMBlock *block = NULL;
 
@@ -1261,10 +1287,14 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
                                 QSIMPLEQ_FIRST(&rs->src_page_requests);
         block = entry->rb;
         *offset = entry->offset;
+        *page_copy = entry->page_copy;
 
         if (entry->len > TARGET_PAGE_SIZE) {
             entry->len -= TARGET_PAGE_SIZE;
             entry->offset += TARGET_PAGE_SIZE;
+            if (entry->page_copy) {
+                entry->page_copy += TARGET_PAGE_SIZE/sizeof(void*);
+            }
         } else {
             memory_region_unref(block->mr);
             QSIMPLEQ_REMOVE_HEAD(&rs->src_page_requests, next_req);
@@ -1291,9 +1321,10 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
     RAMBlock  *block;
     ram_addr_t offset;
     bool dirty;
+    void *page_copy;
 
     do {
-        block = unqueue_page(rs, &offset);
+        block = unqueue_page(rs, &offset, &page_copy);
         /*
          * We're sending this page, and since it's postcopy nothing else
          * will dirty it, and we must make sure it doesn't get sent again
@@ -1331,6 +1362,7 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
          */
         pss->block = block;
         pss->page = offset >> TARGET_PAGE_BITS;
+        pss->page_copy = page_copy;
     }
 
     return !!block;
@@ -1368,17 +1400,25 @@ static void migration_page_queue_free(RAMState *rs)
  *
  * @rbname: Name of the RAMBLock of the request. NULL means the
  *          same that last one.
+ * @block: RAMBlock to use. block and rbname have mutualy exclusive
+ *         semantic with higher priority of the block.
  * @start: starting address from the start of the RAMBlock
  * @len: length (in bytes) to send
+ * @page_copy: the address the page should be written from
  */
-int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len)
+int ram_save_queue_pages(RAMBlock *block, const char *rbname,
+                         ram_addr_t start, ram_addr_t len, void* page_copy)
 {
     RAMBlock *ramblock;
     RAMState *rs = ram_state;
 
     ram_counters.postcopy_requests++;
+
     rcu_read_lock();
-    if (!rbname) {
+
+    if (block) {
+        ramblock = block;
+    } else if (!rbname) {
         /* Reuse last RAMBlock */
         ramblock = rs->last_req_rb;
 
@@ -1413,6 +1453,7 @@ int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len)
     new_entry->rb = ramblock;
     new_entry->offset = start;
     new_entry->len = len;
+    new_entry->page_copy = page_copy;
 
     memory_region_ref(ramblock->mr);
     qemu_mutex_lock(&rs->src_page_req_mutex);
@@ -1450,7 +1491,8 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
          * xbzrle can do better than compression.
          */
         if (migrate_use_compression() &&
-            (rs->ram_bulk_stage || !migrate_use_xbzrle())) {
+            (rs->ram_bulk_stage || !migrate_use_xbzrle()) &&
+            !migrate_background_snapshot()) {
             res = ram_save_compressed_page(rs, pss, last_stage);
         } else {
             res = ram_save_page(rs, pss, last_stage);
@@ -1508,6 +1550,226 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
     return pages;
 }
 
+static bool ram_has_postcopy(void *opaque)
+{
+    return migrate_postcopy_ram();
+}
+
+static int mem_protect(void *addr, uint64_t length, int prot)
+{
+    int ret = mprotect(addr, length, prot);
+
+    if (ret < 0) {
+        error_report("%s: Can't change protection on ram block at %p (len: %lu)",
+                     __func__, addr, length);
+    }
+
+    // 0 on success
+    return ret;
+}
+
+static int ram_set_ro(void* addr, uint64_t length)
+{
+    return mem_protect(addr, length, PROT_READ);
+}
+
+static int ram_set_rw(void* addr, uint64_t length)
+{
+    return mem_protect(addr, length, PROT_READ | PROT_WRITE);
+}
+
+static RamBlockList ram_blocks;
+
+RamBlockList *ram_blocks_get(void)
+{
+    return &ram_blocks;
+}
+
+void ram_blocks_fill(RamBlockList *blocks)
+{
+    RAMBlock *block = NULL;
+
+    qemu_mutex_lock_ramlist();
+    QLIST_FOREACH(block, &ram_list.blocks, next) {
+        memory_region_ref(block->mr);
+        QLIST_INSERT_HEAD(blocks, block, bgs_next);
+    }
+    qemu_mutex_unlock_ramlist();
+}
+
+void ram_blocks_clear(RamBlockList *blocks)
+{
+    RAMBlock *block = NULL;
+
+    QLIST_FOREACH(block, blocks, bgs_next) {
+        QLIST_REMOVE(block, bgs_next);
+        memory_region_unref(block->mr);
+    }
+}
+
+int ram_blocks_set_ro(RamBlockList *blocks)
+{
+    RAMBlock *block = NULL;
+    int ret = 0;
+
+    QLIST_FOREACH(block, blocks, bgs_next) {
+        ret = ram_set_ro(block->host, block->used_length);
+        if (ret) {
+            break;
+        }
+    }
+
+    return ret;
+}
+
+int ram_blocks_set_rw(RamBlockList *blocks)
+{
+    RAMBlock *block = NULL;
+    int ret = 0;
+
+    QLIST_FOREACH(block, blocks, bgs_next) {
+        ret = ram_set_rw(block->host, block->used_length);
+        if (ret) {
+            break;
+        }
+    }
+
+    return ret;
+}
+
+static void ram_page_buffer_decrease_used(void)
+{
+    qemu_event_reset(&ram_state->page_buffer.used_decreased);
+    atomic_dec(&ram_state->page_buffer.used);
+    qemu_event_set(&ram_state->page_buffer.used_decreased);
+}
+
+static void ram_page_buffer_increase_used_wait(void)
+{
+    int ret, used, *used_ptr;
+    RAMState *rs = ram_state;
+    used_ptr = &rs->page_buffer.used;
+    do {
+        used = atomic_read(used_ptr);
+        if (rs->page_buffer.capacity > used) {
+            if ((ret = atomic_cmpxchg(used_ptr, used, used + 1)) == used) {
+                return;
+            } else {
+                continue;
+            }
+        } else {
+            qemu_event_wait(&rs->page_buffer.used_decreased);
+        }
+    } while(true);
+}
+
+static void *ram_page_buffer_get(void)
+{
+    void *page;
+    ram_page_buffer_increase_used_wait();
+    page = mmap(0, TARGET_PAGE_SIZE, PROT_READ|PROT_WRITE,
+                    MAP_PRIVATE|MAP_ANONYMOUS,
+                    -1, 0);
+   if (page == MAP_FAILED) {
+       ram_page_buffer_decrease_used();
+       page = NULL;
+   }
+   return page;
+}
+
+static int ram_page_buffer_free(void *buffer)
+{
+    ram_page_buffer_decrease_used();
+    return qemu_madvise(buffer, TARGET_PAGE_SIZE, MADV_DONTNEED);
+}
+
+static int ram_try_copy_page(RAMBlock *block, unsigned long page_nr,
+                             void** page_copy)
+{
+    void *host_page;
+
+    if (test_and_set_bit_atomic(page_nr, block->touched_map)) {
+        while (!test_bit_atomic(page_nr, block->copied_map)) {
+            // the page is being copied -- wait for the end of the coping
+            // and check once again
+            qemu_event_reset(&ram_state->page_coping_done);
+            qemu_event_wait(&ram_state->page_coping_done);
+        }
+        return 0;
+    }
+
+    *page_copy = ram_page_buffer_get();
+    if (!*page_copy) {
+        return -1;
+    }
+
+    host_page = block->host + (page_nr << TARGET_PAGE_BITS);
+    memcpy(*page_copy, host_page, TARGET_PAGE_SIZE);
+
+    if (ram_set_rw(host_page, TARGET_PAGE_SIZE)) {
+        ram_page_buffer_free(*page_copy);
+        *page_copy = NULL;
+        return -1;
+    }
+
+    smp_mb();
+    set_bit_atomic(page_nr, block->copied_map);
+    qemu_event_set(&ram_state->page_coping_done);
+
+    return 1;
+}
+
+static RAMBlock *find_ram_block(uint8_t *address, ram_addr_t *page_offset)
+{
+    RAMBlock *block = NULL;
+
+
+    QLIST_FOREACH(block, ram_blocks_get(), bgs_next) {
+        /* This case append when the block is not mapped. */
+        if (block->host == NULL) {
+            continue;
+        }
+
+        if (address - block->host < block->max_length) {
+            *page_offset = (address - block->host) & TARGET_PAGE_MASK;
+            return block;
+        }
+    }
+
+    return NULL;
+}
+
+// 0 - on success, 0 < - on error
+int ram_process_page_fault(void *address)
+{
+    int ret;
+    void *page_copy = NULL;
+    unsigned long page_nr;
+    ram_addr_t offset;
+
+    RAMBlock *block = find_ram_block(address, &offset);
+
+    if (!block) {
+        return -1;
+    }
+
+    page_nr = offset >> TARGET_PAGE_BITS;
+
+    ret = ram_try_copy_page(block, page_nr, &page_copy);
+
+    if (ret < 0) {
+        return ret;
+    } else if (ret > 0) {
+        if (ram_save_queue_pages(block, NULL, offset,
+                                 TARGET_PAGE_SIZE, page_copy)) {
+            ram_page_buffer_free(page_copy);
+            return -1;
+        }
+    }
+
+    return 0;
+}
+
 /**
  * ram_find_and_save_block: finds a dirty page and sends it to f
  *
@@ -1536,6 +1798,7 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
     pss.block = rs->last_seen_block;
     pss.page = rs->last_page;
     pss.complete_round = false;
+    pss.page_copy = NULL;
 
     if (!pss.block) {
         pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
@@ -1548,11 +1811,27 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
         if (!found) {
             /* priority queue empty, so just search for something dirty */
             found = find_dirty_block(rs, &pss, &again);
+
+            if (found && migrate_background_snapshot()) {
+                // make a copy of the page and pass it to the page search status
+                int ret;
+                ret = ram_try_copy_page(pss.block, pss.page, &pss.page_copy);
+                if (ret == 0) {
+                    found = false;
+                    pages = 0;
+                } else if(ret < 0) {
+                    return ret;
+                }
+            }
         }
 
         if (found) {
             pages = ram_save_host_page(rs, &pss, last_stage);
         }
+
+        if (pss.page_copy) {
+            ram_page_buffer_decrease_used();
+        }
     } while (!pages && again);
 
     rs->last_seen_block = pss.block;
@@ -1600,9 +1879,15 @@ static void xbzrle_load_cleanup(void)
 
 static void ram_state_cleanup(RAMState **rsp)
 {
+    if (migrate_background_snapshot()) {
+        qemu_event_destroy(&(*rsp)->page_buffer.used_decreased);
+        qemu_event_destroy(&(*rsp)->page_coping_done);
+    }
+
     migration_page_queue_free(*rsp);
     qemu_mutex_destroy(&(*rsp)->bitmap_mutex);
     qemu_mutex_destroy(&(*rsp)->src_page_req_mutex);
+
     g_free(*rsp);
     *rsp = NULL;
 }
@@ -1638,6 +1923,13 @@ static void ram_save_cleanup(void *opaque)
         block->bmap = NULL;
         g_free(block->unsentmap);
         block->unsentmap = NULL;
+
+        if (migrate_background_snapshot()) {
+            g_free(block->touched_map);
+            block->touched_map = NULL;
+            g_free(block->copied_map);
+            block->copied_map = NULL;
+        }
     }
 
     xbzrle_cleanup();
@@ -1652,6 +1944,9 @@ static void ram_state_reset(RAMState *rs)
     rs->last_page = 0;
     rs->last_version = ram_list.version;
     rs->ram_bulk_stage = true;
+
+    rs->page_buffer.capacity = 1000; // in number of pages
+    rs->page_buffer.used = 0;
 }
 
 #define MAX_WAIT 50 /* ms, half buffered_file limit */
@@ -2129,6 +2424,11 @@ static int ram_state_init(RAMState **rsp)
      */
     (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
 
+    if (migrate_background_snapshot()) {
+        qemu_event_init(&ram_state->page_buffer.used_decreased, false);
+        qemu_event_init(&ram_state->page_coping_done, false);
+    }
+
     ram_state_reset(*rsp);
 
     return 0;
@@ -2145,10 +2445,16 @@ static void ram_list_init_bitmaps(void)
             pages = block->max_length >> TARGET_PAGE_BITS;
             block->bmap = bitmap_new(pages);
             bitmap_set(block->bmap, 0, pages);
+
             if (migrate_postcopy_ram()) {
                 block->unsentmap = bitmap_new(pages);
                 bitmap_set(block->unsentmap, 0, pages);
             }
+
+            if (migrate_background_snapshot()) {
+                block->touched_map = bitmap_new(pages);
+                block->copied_map = bitmap_new(pages);
+            }
         }
     }
 }
@@ -2974,11 +3280,6 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
-static bool ram_has_postcopy(void *opaque)
-{
-    return migrate_postcopy_ram();
-}
-
 static SaveVMHandlers savevm_ram_handlers = {
     .save_setup = ram_save_setup,
     .save_live_iterate = ram_save_iterate,
diff --git a/migration/ram.h b/migration/ram.h
index 64d81e9f1d..627c2efb51 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -31,6 +31,7 @@
 
 #include "qemu-common.h"
 #include "exec/cpu-common.h"
+#include "exec/ramlist.h"
 
 extern MigrationStats ram_counters;
 extern XBZRLECacheStats xbzrle_counters;
@@ -45,7 +46,9 @@ int multifd_load_setup(void);
 int multifd_load_cleanup(Error **errp);
 
 uint64_t ram_pagesize_summary(void);
-int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len);
+int ram_save_queue_pages(RAMBlock *block, const char *rbname,
+                         ram_addr_t start, ram_addr_t len,
+                         void* cached_page);
 void acct_update_position(QEMUFile *f, size_t size, bool zero);
 void ram_debug_dump_bitmap(unsigned long *todump, bool expected,
                            unsigned long pages);
@@ -61,5 +64,11 @@ void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
 void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
 void ramblock_recv_bitmap_set_range(RAMBlock *rb, void *host_addr, size_t nr);
+int ram_process_page_fault(void *address);
+RamBlockList *ram_blocks_get(void);
+void ram_blocks_fill(RamBlockList *blocks);
+void ram_blocks_clear(RamBlockList *blocks);
+int ram_blocks_set_ro(RamBlockList *blocks);
+int ram_blocks_set_rw(RamBlockList *blocks);
 
 #endif
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v0 5/7] kvm: add failed memeory access exit reason
  2018-06-29  8:03 [Qemu-devel] [PATCH v0 0/7] Background snapshots Denis Plotnikov
                   ` (3 preceding siblings ...)
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 4/7] migration: add background snapshot infrastructure Denis Plotnikov
@ 2018-06-29  8:03 ` Denis Plotnikov
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 6/7] kvm: add vCPU failed memeory access processing Denis Plotnikov
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Denis Plotnikov @ 2018-06-29  8:03 UTC (permalink / raw)
  To: dgilbert, quintela, pbonzini; +Cc: qemu-devel

The patch allows qemu to be aware of how to read kvm failed memeory
access.

This is a temporary patch. It sould be removed on importing
the kvm failed memeory access exit from the linux branch.

Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 linux-headers/linux/kvm.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index d92c9b2f0e..ea63b681d8 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -235,6 +235,7 @@ struct kvm_hyperv_exit {
 #define KVM_EXIT_S390_STSI        25
 #define KVM_EXIT_IOAPIC_EOI       26
 #define KVM_EXIT_HYPERV           27
+#define KVM_EXIT_FAIL_MEM_ACCESS  28
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -392,6 +393,10 @@ struct kvm_run {
 		} eoi;
 		/* KVM_EXIT_HYPERV */
 		struct kvm_hyperv_exit hyperv;
+		/* KVM_EXIT_FAIL_MEM_ACCESS */
+		struct {
+			__u64 hva;
+		} fail_mem_access;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v0 6/7] kvm: add vCPU failed memeory access processing
  2018-06-29  8:03 [Qemu-devel] [PATCH v0 0/7] Background snapshots Denis Plotnikov
                   ` (4 preceding siblings ...)
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 5/7] kvm: add failed memeory access exit reason Denis Plotnikov
@ 2018-06-29  8:03 ` Denis Plotnikov
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 7/7] migration: add background snapshotting Denis Plotnikov
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Denis Plotnikov @ 2018-06-29  8:03 UTC (permalink / raw)
  To: dgilbert, quintela, pbonzini; +Cc: qemu-devel

Is done with support of the KVM patch returning the faulting address.

Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 target/i386/kvm.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index 3ac5302bc5..b87881a8f1 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -45,6 +45,8 @@
 #include "hw/pci/msi.h"
 #include "hw/pci/msix.h"
 #include "migration/blocker.h"
+#include "migration/savevm.h"
+#include "migration/ram.h"
 #include "exec/memattrs.h"
 #include "trace.h"
 
@@ -3130,6 +3132,18 @@ static bool host_supports_vmx(void)
     return ecx & CPUID_EXT_VMX;
 }
 
+static int kvm_handle_fail_mem_access(CPUState *cpu)
+{
+    struct kvm_run *run = cpu->kvm_run;
+    int ret = ram_process_page_fault((void*)run->fail_mem_access.hva);
+
+    if (ret >= 0) {
+        cpu_resume(cpu);
+    }
+
+    return ret;
+}
+
 #define VMX_INVALID_GUEST_STATE 0x80000021
 
 int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
@@ -3188,6 +3202,10 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
         ioapic_eoi_broadcast(run->eoi.vector);
         ret = 0;
         break;
+    case KVM_EXIT_FAIL_MEM_ACCESS:
+        ret = kvm_handle_fail_mem_access(cs);
+        //ret = -1; -- to prevent further execution ret = 0; -- to continue without errors
+        break;
     default:
         fprintf(stderr, "KVM: unknown exit reason %d\n", run->exit_reason);
         ret = -1;
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v0 7/7] migration: add background snapshotting
  2018-06-29  8:03 [Qemu-devel] [PATCH v0 0/7] Background snapshots Denis Plotnikov
                   ` (5 preceding siblings ...)
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 6/7] kvm: add vCPU failed memeory access processing Denis Plotnikov
@ 2018-06-29  8:03 ` Denis Plotnikov
  2018-07-12 18:59   ` Dr. David Alan Gilbert
  2018-06-29 11:53 ` [Qemu-devel] [PATCH v0 0/7] Background snapshots Dr. David Alan Gilbert
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Denis Plotnikov @ 2018-06-29  8:03 UTC (permalink / raw)
  To: dgilbert, quintela, pbonzini; +Cc: qemu-devel

The patch enables to save vmstate to a migration thread
in the background: ram is being saved while vCPUs are running.
This is done to reduce downtime on vm snapshotting: the majority
of vmstate is ram, the rest of devices consumes only a few MB of
memory on a typical vm.
By this moment, there were no means to make a snapshot without
full vm stopping. From now, one can save a vm state having the vm
stopped only for a couple of seconds to save devices' states. Then
the vm resumed and ram is written without the vm full stopping.

To use async vm state it's needed to enable "background-snapshot"
migration capability and start the migration.

Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 migration/migration.c | 104 +++++++++++++++++++++++++++++++++++++++++-
 migration/savevm.c    |  91 +++++++++++++++++++-----------------
 migration/savevm.h    |   2 +
 3 files changed, 154 insertions(+), 43 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 131d0904e4..b062bd8ddc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2188,6 +2188,100 @@ bool migrate_colo_enabled(void)
     return s->enabled_capabilities[MIGRATION_CAPABILITY_X_COLO];
 }
 
+static void background_snapshot_sigsegv_handler(int unused0, siginfo_t *info, void *unused1)
+{
+    if (ram_process_page_fault(info->si_addr)) {
+        assert(false);
+    }
+}
+
+static void *background_snapshot_thread(void *opaque)
+{
+    MigrationState *m = opaque;
+    QIOChannelBuffer *bioc;
+    QEMUFile *fb;
+    RamBlockList *ram_blocks;
+    int res = 0;
+
+    rcu_register_thread();
+
+    qemu_file_set_rate_limit(m->to_dst_file, INT64_MAX);
+
+    qemu_mutex_lock_iothread();
+    vm_stop(RUN_STATE_PAUSED);
+
+    qemu_savevm_state_header(m->to_dst_file);
+    qemu_mutex_unlock_iothread();
+    qemu_savevm_state_setup(m->to_dst_file);
+    qemu_mutex_lock_iothread();
+
+    migrate_set_state(&m->state, MIGRATION_STATUS_SETUP,
+                      MIGRATION_STATUS_ACTIVE);
+
+    // save the device state to the temporary buffer
+    bioc = qio_channel_buffer_new(4096);
+    qio_channel_set_name(QIO_CHANNEL(bioc), "vmstate-buffer");
+    fb = qemu_fopen_channel_output(QIO_CHANNEL(bioc));
+    object_unref(OBJECT(bioc));
+
+    ram_blocks = ram_blocks_get();
+    ram_blocks_fill(ram_blocks);
+    ram_blocks_set_ro(ram_blocks);
+
+    if(global_state_store()) {
+        goto exit;
+    }
+
+    if (qemu_save_all_device_states(fb, false, true) < 0) {
+        migrate_set_state(&m->state, MIGRATION_STATUS_ACTIVE,
+                          MIGRATION_STATUS_FAILED);
+        goto exit;
+    }
+
+    sigsegv_user_handler_set(background_snapshot_sigsegv_handler);
+
+    smp_mb();
+    vm_start();
+    qemu_mutex_unlock_iothread();
+
+    while (!res) {
+        res = qemu_savevm_state_iterate(m->to_dst_file, false);
+
+        if (res < 0 || qemu_file_get_error(m->to_dst_file)) {
+            migrate_set_state(&m->state, MIGRATION_STATUS_ACTIVE,
+                              MIGRATION_STATUS_FAILED);
+            goto exit;
+        }
+    }
+
+    // By this moment we have RAM saved into the stream
+    // The next step is to flush the device state right after the
+    // RAM saved. The rest of device state is stored in
+    // the temporary buffer. Let's flush the buffer into the stream.
+    qemu_put_buffer(m->to_dst_file, bioc->data, bioc->usage);
+    qemu_fflush(m->to_dst_file);
+
+    if (qemu_file_get_error(m->to_dst_file)) {
+        migrate_set_state(&m->state, MIGRATION_STATUS_ACTIVE,
+                          MIGRATION_STATUS_FAILED);
+        goto exit;
+    }
+
+    migrate_set_state(&m->state, MIGRATION_STATUS_ACTIVE,
+                                 MIGRATION_STATUS_COMPLETED);
+exit:
+    ram_blocks_set_rw(ram_blocks);
+    ram_blocks_clear(ram_blocks);
+    smp_mb();
+    sigsegv_user_handler_reset();
+    qemu_fclose(fb);
+    qemu_mutex_lock_iothread();
+    qemu_savevm_state_cleanup();
+    qemu_mutex_unlock_iothread();
+    rcu_unregister_thread();
+    return NULL;
+}
+
 /*
  * Master migration thread on the source VM.
  * It drives the migration and pumps the data down the outgoing channel.
@@ -2402,8 +2496,14 @@ void migrate_fd_connect(MigrationState *s)
         migrate_fd_cleanup(s);
         return;
     }
-    qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
-                       QEMU_THREAD_JOINABLE);
+    if (migrate_background_snapshot()) {
+        qemu_thread_create(&s->thread, "bg_snapshot",
+                           background_snapshot_thread, s,
+                           QEMU_THREAD_JOINABLE);
+    } else {
+        qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
+                           QEMU_THREAD_JOINABLE);
+    }
     s->migration_thread_running = true;
 }
 
diff --git a/migration/savevm.c b/migration/savevm.c
index f202c3de3a..641b897ff5 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1107,51 +1107,15 @@ void qemu_savevm_state_complete_postcopy(QEMUFile *f)
     qemu_fflush(f);
 }
 
-int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
-                                       bool inactivate_disks)
+int qemu_save_all_device_states(QEMUFile *f, bool inactivate_disks,
+                                bool send_eof)
 {
-    QJSON *vmdesc;
     int vmdesc_len;
     SaveStateEntry *se;
     int ret;
-    bool in_postcopy = migration_in_postcopy();
-
-    trace_savevm_state_complete_precopy();
+    QJSON *vmdesc = qjson_new();
 
     cpu_synchronize_all_states();
-
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
-        if (!se->ops ||
-            (in_postcopy && se->ops->has_postcopy &&
-             se->ops->has_postcopy(se->opaque)) ||
-            (in_postcopy && !iterable_only) ||
-            !se->ops->save_live_complete_precopy) {
-            continue;
-        }
-
-        if (se->ops && se->ops->is_active) {
-            if (!se->ops->is_active(se->opaque)) {
-                continue;
-            }
-        }
-        trace_savevm_section_start(se->idstr, se->section_id);
-
-        save_section_header(f, se, QEMU_VM_SECTION_END);
-
-        ret = se->ops->save_live_complete_precopy(f, se->opaque);
-        trace_savevm_section_end(se->idstr, se->section_id, ret);
-        save_section_footer(f, se);
-        if (ret < 0) {
-            qemu_file_set_error(f, ret);
-            return -1;
-        }
-    }
-
-    if (iterable_only) {
-        return 0;
-    }
-
-    vmdesc = qjson_new();
     json_prop_int(vmdesc, "page_size", qemu_target_page_size());
     json_start_array(vmdesc, "devices");
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
@@ -1193,8 +1157,8 @@ int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
             return ret;
         }
     }
-    if (!in_postcopy) {
-        /* Postcopy stream will still be going */
+
+    if (send_eof) {
         qemu_put_byte(f, QEMU_VM_EOF);
     }
 
@@ -1213,6 +1177,51 @@ int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
     return 0;
 }
 
+int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
+                                       bool inactivate_disks)
+{
+    SaveStateEntry *se;
+    int ret;
+    bool in_postcopy = migration_in_postcopy();
+
+    trace_savevm_state_complete_precopy();
+
+    cpu_synchronize_all_states();
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops ||
+            (in_postcopy && se->ops->has_postcopy &&
+             se->ops->has_postcopy(se->opaque)) ||
+            (in_postcopy && !iterable_only) ||
+            !se->ops->save_live_complete_precopy) {
+            continue;
+        }
+
+        if (se->ops && se->ops->is_active) {
+            if (!se->ops->is_active(se->opaque)) {
+                continue;
+            }
+        }
+        trace_savevm_section_start(se->idstr, se->section_id);
+
+        save_section_header(f, se, QEMU_VM_SECTION_END);
+
+        ret = se->ops->save_live_complete_precopy(f, se->opaque);
+        trace_savevm_section_end(se->idstr, se->section_id, ret);
+        save_section_footer(f, se);
+        if (ret < 0) {
+            qemu_file_set_error(f, ret);
+            return -1;
+        }
+    }
+
+    if (iterable_only) {
+        return 0;
+    }
+
+    return qemu_save_all_device_states(f, inactivate_disks, !in_postcopy);
+}
+
 /* Give an estimate of the amount left to be transferred,
  * the result is split into the amount for units that can and
  * for units that can't do postcopy.
diff --git a/migration/savevm.h b/migration/savevm.h
index 295c4a1f2c..87b278142d 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -55,4 +55,6 @@ void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
 int qemu_loadvm_state(QEMUFile *f);
 void qemu_loadvm_state_cleanup(void);
 
+int qemu_save_all_device_states(QEMUFile *f, bool inactivate_disks,
+                                bool send_eof);
 #endif
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-06-29  8:03 [Qemu-devel] [PATCH v0 0/7] Background snapshots Denis Plotnikov
                   ` (6 preceding siblings ...)
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 7/7] migration: add background snapshotting Denis Plotnikov
@ 2018-06-29 11:53 ` Dr. David Alan Gilbert
  2018-07-25 10:18   ` Peter Xu
  2018-07-02 11:23 ` Peter Xu
  2018-07-13  5:20 ` Peter Xu
  9 siblings, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-29 11:53 UTC (permalink / raw)
  To: Denis Plotnikov, aarcange; +Cc: quintela, pbonzini, qemu-devel

* Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> The patch set adds the ability to make external snapshots while VM is running.

cc'ing in Andrea since this uses sigsegv's to avoid userfault-wp that
isn't there yet.

Hi Denis,
  How robust are you finding this SEGV based trick; for example what
about things like the kernel walking vhost queues or similar kernel
nasties?

Dave

> The workflow to make a snapshot is the following:
> 1. Pause the vm
> 2. Make a snapshot of block devices using the scheme of your choice
> 3. Turn on background-snapshot migration capability
> 4. Start the migration using the destination (migration stream) of your choice.
>    The migration will resume the vm execution by itself
>    when it has the devices' states saved  and is ready to start ram writing
>    to the migration stream.
> 5. Listen to the migration finish event
> 
> The feature relies on KVM unapplied ability to report the faulting address.
> Please find the KVM patch snippet to make the patchset work below:
> 
> +++ b/arch/x86/kvm/vmx.c
> @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  
>         vcpu->arch.exit_qualification = exit_qualification;
>  
> -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +        if (r == -EFAULT) {
> +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
> +
> +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
> +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
> +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
> +               r = 0;
> +
> +       }
> +       return r;
> 
> The patch to KVM can be sent if the patch set approved
> 
> Denis Plotnikov (7):
>   migration: add background snapshot capability
>   bitops: add some atomic versions of bitmap operations
>   threads: add infrastructure to process sigsegv
>   migration: add background snapshot infrastructure
>   kvm: add failed memeory access exit reason
>   kvm: add vCPU failed memeory access processing
>   migration: add background snapshotting
> 
>  include/exec/ram_addr.h   |   7 +
>  include/exec/ramlist.h    |   4 +-
>  include/qemu/bitops.h     |  24 +++
>  include/qemu/thread.h     |   5 +
>  linux-headers/linux/kvm.h |   5 +
>  migration/migration.c     | 141 +++++++++++++++-
>  migration/migration.h     |   1 +
>  migration/ram.c           | 333 ++++++++++++++++++++++++++++++++++++--
>  migration/ram.h           |  11 +-
>  migration/savevm.c        |  91 ++++++-----
>  migration/savevm.h        |   2 +
>  qapi/migration.json       |   6 +-
>  target/i386/kvm.c         |  18 +++
>  util/qemu-thread-posix.c  |  50 ++++++
>  14 files changed, 635 insertions(+), 63 deletions(-)
> 
> -- 
> 2.17.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 1/7] migration: add background snapshot capability
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 1/7] migration: add background snapshot capability Denis Plotnikov
@ 2018-06-29 16:02   ` Eric Blake
  2018-07-12  9:03   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 31+ messages in thread
From: Eric Blake @ 2018-06-29 16:02 UTC (permalink / raw)
  To: Denis Plotnikov, dgilbert, quintela, pbonzini; +Cc: qemu-devel

On 06/29/2018 03:03 AM, Denis Plotnikov wrote:
> The capability is used for the background vmstate saving
> using the migration infrastructure.
> Background vmstate saving means that the majority of vmstate
> (RAM) is saved in the background when VM's vCPUS are running.
> This helps to reduce the VM downtime on VM snapshotting.
> 
> Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
> ---
>   migration/migration.c | 35 +++++++++++++++++++++++++++++++++++
>   migration/migration.h |  1 +
>   qapi/migration.json   |  6 +++++-
>   3 files changed, 41 insertions(+), 1 deletion(-)
> 

> +++ b/qapi/migration.json
> @@ -352,12 +352,16 @@
>   #
>   # @x-multifd: Use more than one fd for migration (since 2.11)
>   #
> +# @background-snapshot: Using migration infrastructure makes VM snapshot
> +#         saving its RAM in background. This reduces  VM downtime. (since 2.12)

You've missed 2.12; the next release is 3.0 (and even then, we're fast 
coming up on soft freeze, so a maintainer must be willing to take this 
feature soon, or it will become 3.1 material).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-06-29  8:03 [Qemu-devel] [PATCH v0 0/7] Background snapshots Denis Plotnikov
                   ` (7 preceding siblings ...)
  2018-06-29 11:53 ` [Qemu-devel] [PATCH v0 0/7] Background snapshots Dr. David Alan Gilbert
@ 2018-07-02 11:23 ` Peter Xu
  2018-07-02 12:40   ` Denis Plotnikov
  2018-07-13  5:20 ` Peter Xu
  9 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2018-07-02 11:23 UTC (permalink / raw)
  To: Denis Plotnikov; +Cc: dgilbert, quintela, pbonzini, qemu-devel

On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote:
> The patch set adds the ability to make external snapshots while VM is running.

Hi, Denis,

This work is interesting, though I have a few questions to ask in
general below.

> 
> The workflow to make a snapshot is the following:
> 1. Pause the vm
> 2. Make a snapshot of block devices using the scheme of your choice

Here you explicitly took the snapshot for the block device, then...

> 3. Turn on background-snapshot migration capability
> 4. Start the migration using the destination (migration stream) of your choice.

... here you started the VM snapshot.  How did you make sure that the
VM snapshot (e.g., the RAM data) and the block snapshot will be
aligned?

For example, in current save_snapshot() we'll quiesce disk IOs before
migrating the last pieces of RAM data to make sure they are aligned.
I didn't figure out myself on how that's done in this work.

>    The migration will resume the vm execution by itself
>    when it has the devices' states saved  and is ready to start ram writing
>    to the migration stream.
> 5. Listen to the migration finish event
> 
> The feature relies on KVM unapplied ability to report the faulting address.
> Please find the KVM patch snippet to make the patchset work below:
> 
> +++ b/arch/x86/kvm/vmx.c
> @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  
>         vcpu->arch.exit_qualification = exit_qualification;
>  
> -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +        if (r == -EFAULT) {
> +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
> +
> +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
> +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
> +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
> +               r = 0;
> +
> +       }
> +       return r;

Just to make sure I fully understand here: so this is some extra KVM
work just to make sure the mprotect() trick will work even for KVM
vcpu threads, am I right?

Meanwhile, I see that you only modified EPT violation code, then how
about the legacy hardwares and softmmu case?

Thanks,

> 
> The patch to KVM can be sent if the patch set approved
> 
> Denis Plotnikov (7):
>   migration: add background snapshot capability
>   bitops: add some atomic versions of bitmap operations
>   threads: add infrastructure to process sigsegv
>   migration: add background snapshot infrastructure
>   kvm: add failed memeory access exit reason
>   kvm: add vCPU failed memeory access processing
>   migration: add background snapshotting
> 
>  include/exec/ram_addr.h   |   7 +
>  include/exec/ramlist.h    |   4 +-
>  include/qemu/bitops.h     |  24 +++
>  include/qemu/thread.h     |   5 +
>  linux-headers/linux/kvm.h |   5 +
>  migration/migration.c     | 141 +++++++++++++++-
>  migration/migration.h     |   1 +
>  migration/ram.c           | 333 ++++++++++++++++++++++++++++++++++++--
>  migration/ram.h           |  11 +-
>  migration/savevm.c        |  91 ++++++-----
>  migration/savevm.h        |   2 +
>  qapi/migration.json       |   6 +-
>  target/i386/kvm.c         |  18 +++
>  util/qemu-thread-posix.c  |  50 ++++++
>  14 files changed, 635 insertions(+), 63 deletions(-)
> 
> -- 
> 2.17.0
> 
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-07-02 11:23 ` Peter Xu
@ 2018-07-02 12:40   ` Denis Plotnikov
  2018-07-03  5:54     ` Peter Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Denis Plotnikov @ 2018-07-02 12:40 UTC (permalink / raw)
  To: Peter Xu; +Cc: dgilbert, quintela, pbonzini, qemu-devel



On 02.07.2018 14:23, Peter Xu wrote:
> On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote:
>> The patch set adds the ability to make external snapshots while VM is running.
> 
> Hi, Denis,
> 
> This work is interesting, though I have a few questions to ask in
> general below.
> 
>>
>> The workflow to make a snapshot is the following:
>> 1. Pause the vm
>> 2. Make a snapshot of block devices using the scheme of your choice
> 
> Here you explicitly took the snapshot for the block device, then...
> 
>> 3. Turn on background-snapshot migration capability
>> 4. Start the migration using the destination (migration stream) of your choice.
> 
> ... here you started the VM snapshot.  How did you make sure that the
> VM snapshot (e.g., the RAM data) and the block snapshot will be
> aligned?
As the VM has been paused before making an image(disk) snapshot, there 
should be no requests to the original image done ever since. All the 
later request's goes to the disk snapshot.

At the point we have a disk image and its snapshot.
In the image we have kind of checkpoint-ed state which won't (shouldn't) 
be changed because all the writing requests should go to the image snapshot.

Then we start the background snapshot which marks all the memory as 
read-only and writing the part of VM state to the VM snapshot file.
By making the memory read-only we kind of freeze the state of the RAM.

At that point we have an original image and the VM memory content which 
corresponds to each other because the VM isn't running.

Then, the background snapshot thread continues VM execution with the 
read-only-marked memory which is being written to the external VM 
snapshot file. All the write accesses to the memory are intercepted and 
the memory pages being accessed are written to the VM snapshot (VM 
state) file in priority. Having marked as read-write right after the 
writing, the memory pages aren't tracked for later accesses.

This is how we guarantee that the VM snapshot (state) file has the 
memory content corresponding to the moment when the disk snapshot is 
created.

When the writing ends, we have the VM snapshot (VM state) file which has 
the memory content saved by the moment of the image snapshot creating.

So, to restore the VM from "the snapshot" we need to use the original 
image disk (not the disk snapshot) and the VM snapshot (VM state with 
saved memory) file.

> 
> For example, in current save_snapshot() we'll quiesce disk IOs before
> migrating the last pieces of RAM data to make sure they are aligned.
> I didn't figure out myself on how that's done in this work.
> 
>>     The migration will resume the vm execution by itself
>>     when it has the devices' states saved  and is ready to start ram writing
>>     to the migration stream.
>> 5. Listen to the migration finish event
>>
>> The feature relies on KVM unapplied ability to report the faulting address.
>> Please find the KVM patch snippet to make the patchset work below:
>>
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>>   
>>          vcpu->arch.exit_qualification = exit_qualification;
>>   
>> -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>> +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>> +        if (r == -EFAULT) {
>> +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
>> +
>> +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
>> +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
>> +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
>> +               r = 0;
>> +
>> +       }
>> +       return r;
> 
> Just to make sure I fully understand here: so this is some extra KVM
> work just to make sure the mprotect() trick will work even for KVM
> vcpu threads, am I right?

That's correct!
> 
> Meanwhile, I see that you only modified EPT violation code, then how
> about the legacy hardwares and softmmu case?

Didn't check thoroughly but the scheme works in TCG mode.


Thanks,
Denis

> 
> Thanks,
> 
>>
>> The patch to KVM can be sent if the patch set approved
>>
>> Denis Plotnikov (7):
>>    migration: add background snapshot capability
>>    bitops: add some atomic versions of bitmap operations
>>    threads: add infrastructure to process sigsegv
>>    migration: add background snapshot infrastructure
>>    kvm: add failed memeory access exit reason
>>    kvm: add vCPU failed memeory access processing
>>    migration: add background snapshotting
>>
>>   include/exec/ram_addr.h   |   7 +
>>   include/exec/ramlist.h    |   4 +-
>>   include/qemu/bitops.h     |  24 +++
>>   include/qemu/thread.h     |   5 +
>>   linux-headers/linux/kvm.h |   5 +
>>   migration/migration.c     | 141 +++++++++++++++-
>>   migration/migration.h     |   1 +
>>   migration/ram.c           | 333 ++++++++++++++++++++++++++++++++++++--
>>   migration/ram.h           |  11 +-
>>   migration/savevm.c        |  91 ++++++-----
>>   migration/savevm.h        |   2 +
>>   qapi/migration.json       |   6 +-
>>   target/i386/kvm.c         |  18 +++
>>   util/qemu-thread-posix.c  |  50 ++++++
>>   14 files changed, 635 insertions(+), 63 deletions(-)
>>
>> -- 
>> 2.17.0
>>
>>
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-07-02 12:40   ` Denis Plotnikov
@ 2018-07-03  5:54     ` Peter Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2018-07-03  5:54 UTC (permalink / raw)
  To: Denis Plotnikov; +Cc: dgilbert, quintela, pbonzini, qemu-devel

On Mon, Jul 02, 2018 at 03:40:31PM +0300, Denis Plotnikov wrote:
> 
> 
> On 02.07.2018 14:23, Peter Xu wrote:
> > On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote:
> > > The patch set adds the ability to make external snapshots while VM is running.
> > 
> > Hi, Denis,
> > 
> > This work is interesting, though I have a few questions to ask in
> > general below.
> > 
> > > 
> > > The workflow to make a snapshot is the following:
> > > 1. Pause the vm
> > > 2. Make a snapshot of block devices using the scheme of your choice
> > 
> > Here you explicitly took the snapshot for the block device, then...
> > 
> > > 3. Turn on background-snapshot migration capability
> > > 4. Start the migration using the destination (migration stream) of your choice.
> > 
> > ... here you started the VM snapshot.  How did you make sure that the
> > VM snapshot (e.g., the RAM data) and the block snapshot will be
> > aligned?
> As the VM has been paused before making an image(disk) snapshot, there
> should be no requests to the original image done ever since. All the later
> request's goes to the disk snapshot.
> 
> At the point we have a disk image and its snapshot.
> In the image we have kind of checkpoint-ed state which won't (shouldn't) be
> changed because all the writing requests should go to the image snapshot.
> 
> Then we start the background snapshot which marks all the memory as
> read-only and writing the part of VM state to the VM snapshot file.
> By making the memory read-only we kind of freeze the state of the RAM.
> 
> At that point we have an original image and the VM memory content which
> corresponds to each other because the VM isn't running.
> 
> Then, the background snapshot thread continues VM execution with the
> read-only-marked memory which is being written to the external VM snapshot
> file. All the write accesses to the memory are intercepted and the memory
> pages being accessed are written to the VM snapshot (VM state) file in
> priority. Having marked as read-write right after the writing, the memory
> pages aren't tracked for later accesses.
> 
> This is how we guarantee that the VM snapshot (state) file has the memory
> content corresponding to the moment when the disk snapshot is created.
> 
> When the writing ends, we have the VM snapshot (VM state) file which has the
> memory content saved by the moment of the image snapshot creating.
> 
> So, to restore the VM from "the snapshot" we need to use the original image
> disk (not the disk snapshot) and the VM snapshot (VM state with saved
> memory) file.

My bad to have not noticed about the implication of vm_stop() as the
first step.  Your explanation is clear.  Thank you!

> 
> > 
> > For example, in current save_snapshot() we'll quiesce disk IOs before
> > migrating the last pieces of RAM data to make sure they are aligned.
> > I didn't figure out myself on how that's done in this work.
> > 
> > >     The migration will resume the vm execution by itself
> > >     when it has the devices' states saved  and is ready to start ram writing
> > >     to the migration stream.
> > > 5. Listen to the migration finish event
> > > 
> > > The feature relies on KVM unapplied ability to report the faulting address.
> > > Please find the KVM patch snippet to make the patchset work below:
> > > 
> > > +++ b/arch/x86/kvm/vmx.c
> > > @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> > >          vcpu->arch.exit_qualification = exit_qualification;
> > > -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > > +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > > +        if (r == -EFAULT) {
> > > +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
> > > +
> > > +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
> > > +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
> > > +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
> > > +               r = 0;
> > > +
> > > +       }
> > > +       return r;
> > 
> > Just to make sure I fully understand here: so this is some extra KVM
> > work just to make sure the mprotect() trick will work even for KVM
> > vcpu threads, am I right?
> 
> That's correct!
> > 
> > Meanwhile, I see that you only modified EPT violation code, then how
> > about the legacy hardwares and softmmu case?
> 
> Didn't check thoroughly but the scheme works in TCG mode.

Yeah I guess TCG will work since the SIGSEGV handler will work with
that.  I meant the shadow MMU implementation in KVM when
kvm_intel.ept=0 is set on the host.  But of course that's not a big
deal for now since that can be discussed in the kvm counterpart of the
work.  Meanwhile, considering that this series seems to provide a
general framework for live snapshot, this work is meaningful no matter
what backend magic is used (either mprotect, or userfaultfd in the
future).

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 1/7] migration: add background snapshot capability
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 1/7] migration: add background snapshot capability Denis Plotnikov
  2018-06-29 16:02   ` Eric Blake
@ 2018-07-12  9:03   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-12  9:03 UTC (permalink / raw)
  To: Denis Plotnikov; +Cc: quintela, pbonzini, qemu-devel

* Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> The capability is used for the background vmstate saving
> using the migration infrastructure.
> Background vmstate saving means that the majority of vmstate
> (RAM) is saved in the background when VM's vCPUS are running.
> This helps to reduce the VM downtime on VM snapshotting.
> 
> Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>

(As normal the 'since' will need bumping)

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/migration.c | 35 +++++++++++++++++++++++++++++++++++
>  migration/migration.h |  1 +
>  qapi/migration.json   |  6 +++++-
>  3 files changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index d780601f0c..87096d23ef 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -697,6 +697,12 @@ static bool migrate_caps_check(bool *cap_list,
>              return false;
>          }
>  
> +        if (cap_list[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
> +            error_setg(errp, "Postcopy is not compatible "
> +                        "with background snapshot");
> +            return false;
> +        }
> +
>          /* This check is reasonably expensive, so only when it's being
>           * set the first time, also it's only the destination that needs
>           * special support.
> @@ -711,6 +717,26 @@ static bool migrate_caps_check(bool *cap_list,
>          }
>      }
>  
> +    if (cap_list[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
> +        if (cap_list[MIGRATION_CAPABILITY_RELEASE_RAM]) {
> +            error_setg(errp, "Background snapshot is not compatible "
> +                        "with release ram capability");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_COMPRESS]) {
> +            error_setg(errp, "Background snapshot is not "
> +                        "currently compatible with compression");
> +            return false;
> +        }
> +
> +        if (cap_list[MIGRATION_CAPABILITY_XBZRLE]) {
> +            error_setg(errp, "Background snapshot is not "
> +                        "currently compatible with XBZLRE");
> +            return false;
> +        }
> +    }
> +
>      return true;
>  }
>  
> @@ -1635,6 +1661,15 @@ bool migrate_use_block_incremental(void)
>      return s->parameters.block_incremental;
>  }
>  
> +bool migrate_background_snapshot(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT];
> +}
> +
>  /* migration thread support */
>  /*
>   * Something bad happened to the RP stream, mark an error
> diff --git a/migration/migration.h b/migration/migration.h
> index 663415fe48..21babbc008 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -201,6 +201,7 @@ int migrate_compress_level(void);
>  int migrate_compress_threads(void);
>  int migrate_decompress_threads(void);
>  bool migrate_use_events(void);
> +bool migrate_background_snapshot(void);
>  
>  /* Sending on the return path - generic and then for each message type */
>  void migrate_send_rp_shut(MigrationIncomingState *mis,
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 03f57c9616..eb5f52b7ff 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -352,12 +352,16 @@
>  #
>  # @x-multifd: Use more than one fd for migration (since 2.11)
>  #
> +# @background-snapshot: Using migration infrastructure makes VM snapshot
> +#         saving its RAM in background. This reduces  VM downtime. (since 2.12)
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
>             'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
> -           'block', 'return-path', 'pause-before-switchover', 'x-multifd' ] }
> +           'block', 'return-path', 'pause-before-switchover', 'x-multifd',
> +           'background-snapshot' ] }
>  
>  ##
>  # @MigrationCapabilityStatus:
> -- 
> 2.17.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 2/7] bitops: add some atomic versions of bitmap operations
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 2/7] bitops: add some atomic versions of bitmap operations Denis Plotnikov
@ 2018-07-12  9:21   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-12  9:21 UTC (permalink / raw)
  To: Denis Plotnikov; +Cc: quintela, pbonzini, qemu-devel

* Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> 1. test bit
> 2. test and set bit
> 
> Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
> ---
>  include/qemu/bitops.h | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
> index 3f0926cf40..7b1c8c7baf 100644
> --- a/include/qemu/bitops.h
> +++ b/include/qemu/bitops.h
> @@ -94,6 +94,21 @@ static inline int test_and_set_bit(long nr, unsigned long *addr)
>      return (old & mask) != 0;
>  }
>  
> +/**
> + * test_and_set_bit_atomic - Set a bit atomically and return its old value
> + * @nr: Bit to set
> + * @addr: Address to count from
> + */
> +static inline int test_and_set_bit_atomic(long nr, unsigned long *addr)
> +{
> +    unsigned long mask = BIT_MASK(nr);
> +    unsigned long *p = addr + BIT_WORD(nr);
> +    unsigned long old;
> +
> +    old = atomic_fetch_or(p, mask);
> +    return (old & mask) != 0;
> +}
> +
>  /**
>   * test_and_clear_bit - Clear a bit and return its old value
>   * @nr: Bit to clear
> @@ -134,6 +149,15 @@ static inline int test_bit(long nr, const unsigned long *addr)
>      return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
>  }
>  
> +/**
> + * test_bit_atomic - Determine whether a bit is set atomicallly
> + * @nr: bit number to test
> + * @addr: Address to start counting from
> + */
> +static inline int test_bit_atomic(long nr, const unsigned long *addr)
> +{
> +    return 1UL & (atomic_read(&addr[BIT_WORD(nr)]) >> (nr & (BITS_PER_LONG-1)));

Note that fails the style check because it's expecting spaces around the
'-', (which would then take you over 80 chars);  your patch is
consistent with the other places in the file though.
It's probably best if you make the style checker happy.

So,

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> +}
>  /**
>   * find_last_bit - find the last set bit in a memory region
>   * @addr: The address to start the search at
> -- 
> 2.17.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 3/7] threads: add infrastructure to process sigsegv
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 3/7] threads: add infrastructure to process sigsegv Denis Plotnikov
@ 2018-07-12  9:53   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-12  9:53 UTC (permalink / raw)
  To: Denis Plotnikov; +Cc: quintela, pbonzini, qemu-devel

* Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> Allows to define sigsegv handler temporary for all threads.
> This is useful to implement copy-on-write logic while
> linux usefaultfd doesn't support write-protected faults.
> In the future, switch to using WP userfaultfd when it's
> available.
> 
> It's going to be used on background snapshotting.

I'll leave the details of signal handling to someone else
(anyone knows how this would interact with qemu-user; or the
signalfd's in util/main-loop.c ? )

But also, I'd still like to understand how this works when the
kernel makes guest accesses for things like vhost.

> Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
> ---
>  include/qemu/thread.h    |  5 ++++
>  util/qemu-thread-posix.c | 50 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 55 insertions(+)
> 
> diff --git a/include/qemu/thread.h b/include/qemu/thread.h
> index 9910f49b3a..886985d289 100644
> --- a/include/qemu/thread.h
> +++ b/include/qemu/thread.h
> @@ -210,4 +210,9 @@ void qemu_lockcnt_inc_and_unlock(QemuLockCnt *lockcnt);
>   */
>  unsigned qemu_lockcnt_count(QemuLockCnt *lockcnt);
>  
> +
> +typedef void (*sigsegv_handler)(int v0, siginfo_t *v1, void *v2);
> +void sigsegv_user_handler_set(sigsegv_handler handler);
> +void sigsegv_user_handler_reset(void);
> +
>  #endif
> diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
> index 7306475899..e51abc9275 100644
> --- a/util/qemu-thread-posix.c
> +++ b/util/qemu-thread-posix.c
> @@ -489,6 +489,45 @@ static void qemu_thread_set_name(QemuThread *thread, const char *name)
>  #endif
>  }
>  
> +static sigsegv_handler sigsegv_user_handler;
> +
> +void sigsegv_user_handler_set(sigsegv_handler handler)
> +{
> +    assert(handler);
> +    atomic_set(&sigsegv_user_handler, handler);
> +}
> +
> +static sigsegv_handler sigsegv_user_handler_get(void)
> +{
> +    return atomic_read(&sigsegv_user_handler);
> +}
> +
> +void sigsegv_user_handler_reset(void)
> +{
> +    atomic_set(&sigsegv_user_handler, NULL);
> +}
> +
> +static void sigsegv_default_handler(int v0, siginfo_t *v1, void *v2)

v0/v1/v2 aren't great names for the parameters.

> +{
> +    sigsegv_handler handler = sigsegv_user_handler_get();
> +
> +    if (!handler) {
> +        // remove the sigsegv handler if it's not set by user
> +        // this will lead to re-raising the error without a handler
> +        // and exiting from the program with "Sigmentation fault"

Style guide doesn't allow C99 comments.
(And typo: Sig->Seg)

Dave

> +        int err;
> +        struct sigaction act;
> +        memset(&act, 0, sizeof(act));
> +        act.sa_flags = SA_RESETHAND;
> +        err = sigaction(SIGSEGV, &act, NULL);
> +        if (err) {
> +            error_exit(err, __func__);
> +        }
> +    } else {
> +        handler(v0, v1, v2);
> +    }
> +}
> +
>  void qemu_thread_create(QemuThread *thread, const char *name,
>                         void *(*start_routine)(void*),
>                         void *arg, int mode)
> @@ -496,14 +535,25 @@ void qemu_thread_create(QemuThread *thread, const char *name,
>      sigset_t set, oldset;
>      int err;
>      pthread_attr_t attr;
> +    struct sigaction act;
>  
>      err = pthread_attr_init(&attr);
>      if (err) {
>          error_exit(err, __func__);
>      }
>  
> +    memset(&act, 0, sizeof(act));
> +    act.sa_flags = SA_SIGINFO;
> +    act.sa_sigaction = sigsegv_default_handler;
> +    err = sigaction(SIGSEGV, &act, NULL);
> +    if (err) {
> +        error_exit(err, __func__);
> +    }
> +
>      /* Leave signal handling to the iothread.  */
>      sigfillset(&set);
> +    // ...all but SIGSEGV
> +    sigdelset(&set, SIGSEGV);
>      pthread_sigmask(SIG_SETMASK, &set, &oldset);
>      err = pthread_create(&thread->thread, &attr, start_routine, arg);
>      if (err)
> -- 
> 2.17.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 4/7] migration: add background snapshot infrastructure
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 4/7] migration: add background snapshot infrastructure Denis Plotnikov
@ 2018-07-12 11:46   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-12 11:46 UTC (permalink / raw)
  To: Denis Plotnikov; +Cc: quintela, pbonzini, qemu-devel

* Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> It allows to intercept VM's RAM access and write them into the
> snapshot.

This is too big for me to properly review; it needs splitting
into smaller chunks.
However, there are some comments below.

> Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
> ---
>  include/exec/ram_addr.h |   7 +
>  include/exec/ramlist.h  |   4 +-
>  migration/migration.c   |   2 +-
>  migration/ram.c         | 333 ++++++++++++++++++++++++++++++++++++++--
>  migration/ram.h         |  11 +-
>  5 files changed, 338 insertions(+), 19 deletions(-)
> 
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index 6cbc02aa0f..5b403d537d 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -36,6 +36,8 @@ struct RAMBlock {
>      char idstr[256];
>      /* RCU-enabled, writes protected by the ramlist lock */
>      QLIST_ENTRY(RAMBlock) next;
> +    /* blocks used for background snapshot */
> +    QLIST_ENTRY(RAMBlock) bgs_next;
>      QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
>      int fd;
>      size_t page_size;
> @@ -49,6 +51,11 @@ struct RAMBlock {
>      unsigned long *unsentmap;
>      /* bitmap of already received pages in postcopy */
>      unsigned long *receivedmap;
> +    /* The following 2 are for background snapshot */
> +    /* Pages currently being copied */
> +    unsigned long *touched_map;
> +    /* Pages has been copied already */
> +    unsigned long *copied_map;

Does this need to touch exec/ram_addr.h or can we keep this
all private to the migration code?

>  };
>  
>  static inline bool offset_in_ramblock(RAMBlock *b, ram_addr_t offset)
> diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
> index 2e2ac6cb99..e0231d3bec 100644
> --- a/include/exec/ramlist.h
> +++ b/include/exec/ramlist.h
> @@ -44,11 +44,13 @@ typedef struct {
>      unsigned long *blocks[];
>  } DirtyMemoryBlocks;
>  
> +typedef QLIST_HEAD(, RAMBlock) RamBlockList;
> +
>  typedef struct RAMList {
>      QemuMutex mutex;
>      RAMBlock *mru_block;
>      /* RCU-enabled, writes protected by the ramlist lock. */
> -    QLIST_HEAD(, RAMBlock) blocks;
> +    RamBlockList blocks;
>      DirtyMemoryBlocks *dirty_memory[DIRTY_MEMORY_NUM];
>      uint32_t version;
>      QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
> diff --git a/migration/migration.c b/migration/migration.c
> index 87096d23ef..131d0904e4 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1716,7 +1716,7 @@ static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
>          return;
>      }
>  
> -    if (ram_save_queue_pages(rbname, start, len)) {
> +    if (ram_save_queue_pages(NULL, rbname, start, len, NULL)) {
>          mark_source_rp_bad(ms);
>      }
>  }
> diff --git a/migration/ram.c b/migration/ram.c
> index 021d583b9b..286b79ad51 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -188,10 +188,21 @@ struct RAMSrcPageRequest {
>      RAMBlock *rb;
>      hwaddr    offset;
>      hwaddr    len;
> +    void*     page_copy;
>  
>      QSIMPLEQ_ENTRY(RAMSrcPageRequest) next_req;
>  };
>  
> +/* Page buffer used for background snapshot */
> +typedef struct RAMPageBuffer {
> +    /* Page buffer capacity in host pages */
> +    int capacity;
> +    /* Current number of pages in the buffer */
> +    int used;
> +    /* Event to notify that buffer usage is under capacity */
> +    QemuEvent used_decreased;
> +} RAMPageBuffer;
> +
>  /* State of RAM for migration */
>  struct RAMState {
>      /* QEMUFile used for this migration */
> @@ -230,6 +241,11 @@ struct RAMState {
>      /* Queue of outstanding page requests from the destination */
>      QemuMutex src_page_req_mutex;
>      QSIMPLEQ_HEAD(src_page_requests, RAMSrcPageRequest) src_page_requests;
> +    /* The following 2 are for background snapshot */
> +    /* Buffer data to store copies of ram pages while async vm saving */
> +    RAMPageBuffer page_buffer;
> +    /* Event to notify that a page coping just has finished*/
> +    QemuEvent page_coping_done;

typo: copin->copying (you've got that in a few places)

>  };
>  typedef struct RAMState RAMState;
>  
> @@ -250,6 +266,8 @@ struct PageSearchStatus {
>      unsigned long page;
>      /* Set once we wrap around */
>      bool         complete_round;
> +    /* Pointer to the cached page */
> +    void* page_copy;
>  };
>  typedef struct PageSearchStatus PageSearchStatus;
>  
> @@ -958,7 +976,11 @@ static int ram_save_page(RAMState *rs, PageSearchStatus *pss, bool last_stage)
>      RAMBlock *block = pss->block;
>      ram_addr_t offset = pss->page << TARGET_PAGE_BITS;
>  
> -    p = block->host + offset;
> +    if (pss->page_copy) {
> +        p = pss->page_copy;
> +    } else {
> +        p = block->host + offset;
> +    }
>      trace_ram_save_page(block->idstr, (uint64_t)offset, p);
>  
>      /* In doubt sent page as normal */
> @@ -989,9 +1011,12 @@ static int ram_save_page(RAMState *rs, PageSearchStatus *pss, bool last_stage)
>               * page would be stale
>               */
>              xbzrle_cache_zero_page(rs, current_addr);
> -            ram_release_pages(block->idstr, offset, pages);
> +            if (pss->page_copy) {
> +                qemu_madvise(p, TARGET_PAGE_SIZE, MADV_DONTNEED);
> +            }

Is it easier/better just to change the logic in ram_release_pages?
Note you're using TARGET_PAGE_SIZE here - that's often smaller than the
host page size; that's probably a problem on non-x86, since you can't
free at that granularity.

>          } else if (!rs->ram_bulk_stage &&
> -                   !migration_in_postcopy() && migrate_use_xbzrle()) {
> +                   !migration_in_postcopy() && migrate_use_xbzrle() &&
> +                   !migrate_background_snapshot()) {
>              pages = save_xbzrle_page(rs, &p, current_addr, block,
>                                       offset, last_stage);
>              if (!last_stage) {
> @@ -1008,9 +1033,10 @@ static int ram_save_page(RAMState *rs, PageSearchStatus *pss, bool last_stage)
>          ram_counters.transferred +=
>              save_page_header(rs, rs->f, block, offset | RAM_SAVE_FLAG_PAGE);
>          if (send_async) {
> -            qemu_put_buffer_async(rs->f, p, TARGET_PAGE_SIZE,
> -                                  migrate_release_ram() &
> -                                  migration_in_postcopy());
> +            bool may_free = migrate_background_snapshot() ||
> +                            (migrate_release_ram() &&
> +                             migration_in_postcopy());
> +            qemu_put_buffer_async(rs->f, p, TARGET_PAGE_SIZE, may_free);
>          } else {
>              qemu_put_buffer(rs->f, p, TARGET_PAGE_SIZE);
>          }
> @@ -1251,7 +1277,7 @@ static bool find_dirty_block(RAMState *rs, PageSearchStatus *pss, bool *again)
>   * @rs: current RAM state
>   * @offset: used to return the offset within the RAMBlock
>   */
> -static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
> +static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset, void **page_copy)
>  {
>      RAMBlock *block = NULL;
>  
> @@ -1261,10 +1287,14 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
>                                  QSIMPLEQ_FIRST(&rs->src_page_requests);
>          block = entry->rb;
>          *offset = entry->offset;
> +        *page_copy = entry->page_copy;
>  
>          if (entry->len > TARGET_PAGE_SIZE) {
>              entry->len -= TARGET_PAGE_SIZE;
>              entry->offset += TARGET_PAGE_SIZE;
> +            if (entry->page_copy) {
> +                entry->page_copy += TARGET_PAGE_SIZE/sizeof(void*);

I don't think I understand why it's doing that.

> +            }
>          } else {
>              memory_region_unref(block->mr);
>              QSIMPLEQ_REMOVE_HEAD(&rs->src_page_requests, next_req);
> @@ -1291,9 +1321,10 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>      RAMBlock  *block;
>      ram_addr_t offset;
>      bool dirty;
> +    void *page_copy;
>  
>      do {
> -        block = unqueue_page(rs, &offset);
> +        block = unqueue_page(rs, &offset, &page_copy);
>          /*
>           * We're sending this page, and since it's postcopy nothing else
>           * will dirty it, and we must make sure it doesn't get sent again
> @@ -1331,6 +1362,7 @@ static bool get_queued_page(RAMState *rs, PageSearchStatus *pss)
>           */
>          pss->block = block;
>          pss->page = offset >> TARGET_PAGE_BITS;
> +        pss->page_copy = page_copy;
>      }
>  
>      return !!block;
> @@ -1368,17 +1400,25 @@ static void migration_page_queue_free(RAMState *rs)
>   *
>   * @rbname: Name of the RAMBLock of the request. NULL means the
>   *          same that last one.
> + * @block: RAMBlock to use. block and rbname have mutualy exclusive
> + *         semantic with higher priority of the block.
>   * @start: starting address from the start of the RAMBlock
>   * @len: length (in bytes) to send
> + * @page_copy: the address the page should be written from
>   */
> -int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len)
> +int ram_save_queue_pages(RAMBlock *block, const char *rbname,
> +                         ram_addr_t start, ram_addr_t len, void* page_copy)

This is just used for postcopy at the moment - is it relevant to you?

>  {
>      RAMBlock *ramblock;
>      RAMState *rs = ram_state;
>  
>      ram_counters.postcopy_requests++;
> +
>      rcu_read_lock();
> -    if (!rbname) {
> +
> +    if (block) {
> +        ramblock = block;
> +    } else if (!rbname) {
>          /* Reuse last RAMBlock */
>          ramblock = rs->last_req_rb;
>  
> @@ -1413,6 +1453,7 @@ int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len)
>      new_entry->rb = ramblock;
>      new_entry->offset = start;
>      new_entry->len = len;
> +    new_entry->page_copy = page_copy;
>  
>      memory_region_ref(ramblock->mr);
>      qemu_mutex_lock(&rs->src_page_req_mutex);
> @@ -1450,7 +1491,8 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
>           * xbzrle can do better than compression.
>           */
>          if (migrate_use_compression() &&
> -            (rs->ram_bulk_stage || !migrate_use_xbzrle())) {
> +            (rs->ram_bulk_stage || !migrate_use_xbzrle()) &&
> +            !migrate_background_snapshot()) {
>              res = ram_save_compressed_page(rs, pss, last_stage);
>          } else {
>              res = ram_save_page(rs, pss, last_stage);
> @@ -1508,6 +1550,226 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
>      return pages;
>  }
>  
> +static bool ram_has_postcopy(void *opaque)
> +{
> +    return migrate_postcopy_ram();
> +}
> +
> +static int mem_protect(void *addr, uint64_t length, int prot)
> +{
> +    int ret = mprotect(addr, length, prot);
> +
> +    if (ret < 0) {
> +        error_report("%s: Can't change protection on ram block at %p (len: %lu)",
> +                     __func__, addr, length);
> +    }
> +
> +    // 0 on success
> +    return ret;
> +}
> +
> +static int ram_set_ro(void* addr, uint64_t length)
> +{
> +    return mem_protect(addr, length, PROT_READ);
> +}
> +
> +static int ram_set_rw(void* addr, uint64_t length)
> +{
> +    return mem_protect(addr, length, PROT_READ | PROT_WRITE);
> +}

We need to keep these together so we know which bits to change for
userfault-wp

> +static RamBlockList ram_blocks;
> +
> +RamBlockList *ram_blocks_get(void)
> +{
> +    return &ram_blocks;
> +}
> +
> +void ram_blocks_fill(RamBlockList *blocks)
> +{
> +    RAMBlock *block = NULL;
> +
> +    qemu_mutex_lock_ramlist();
> +    QLIST_FOREACH(block, &ram_list.blocks, next) {

Look at how we use FOREACH_MIGRATABLE now

> +        memory_region_ref(block->mr);
> +        QLIST_INSERT_HEAD(blocks, block, bgs_next);
> +    }
> +    qemu_mutex_unlock_ramlist();
> +}
> +
> +void ram_blocks_clear(RamBlockList *blocks)
> +{
> +    RAMBlock *block = NULL;
> +
> +    QLIST_FOREACH(block, blocks, bgs_next) {
> +        QLIST_REMOVE(block, bgs_next);
> +        memory_region_unref(block->mr);
> +    }
> +}
> +
> +int ram_blocks_set_ro(RamBlockList *blocks)
> +{
> +    RAMBlock *block = NULL;
> +    int ret = 0;
> +
> +    QLIST_FOREACH(block, blocks, bgs_next) {
> +        ret = ram_set_ro(block->host, block->used_length);
> +        if (ret) {
> +            break;
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +int ram_blocks_set_rw(RamBlockList *blocks)
> +{
> +    RAMBlock *block = NULL;
> +    int ret = 0;
> +
> +    QLIST_FOREACH(block, blocks, bgs_next) {
> +        ret = ram_set_rw(block->host, block->used_length);
> +        if (ret) {
> +            break;
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +static void ram_page_buffer_decrease_used(void)
> +{
> +    qemu_event_reset(&ram_state->page_buffer.used_decreased);
> +    atomic_dec(&ram_state->page_buffer.used);
> +    qemu_event_set(&ram_state->page_buffer.used_decreased);
> +}
> +
> +static void ram_page_buffer_increase_used_wait(void)
> +{
> +    int ret, used, *used_ptr;
> +    RAMState *rs = ram_state;
> +    used_ptr = &rs->page_buffer.used;
> +    do {
> +        used = atomic_read(used_ptr);
> +        if (rs->page_buffer.capacity > used) {
> +            if ((ret = atomic_cmpxchg(used_ptr, used, used + 1)) == used) {
> +                return;
> +            } else {
> +                continue;
> +            }
> +        } else {
> +            qemu_event_wait(&rs->page_buffer.used_decreased);
> +        }
> +    } while(true);
> +}
> +
> +static void *ram_page_buffer_get(void)
> +{
> +    void *page;
> +    ram_page_buffer_increase_used_wait();
> +    page = mmap(0, TARGET_PAGE_SIZE, PROT_READ|PROT_WRITE,
> +                    MAP_PRIVATE|MAP_ANONYMOUS,
> +                    -1, 0);

Again TARGET_PAGE_SIZE is below host page size often.

> +   if (page == MAP_FAILED) {
> +       ram_page_buffer_decrease_used();
> +       page = NULL;
> +   }
> +   return page;
> +}
> +
> +static int ram_page_buffer_free(void *buffer)
> +{
> +    ram_page_buffer_decrease_used();
> +    return qemu_madvise(buffer, TARGET_PAGE_SIZE, MADV_DONTNEED);
> +}
> +
> +static int ram_try_copy_page(RAMBlock *block, unsigned long page_nr,
> +                             void** page_copy)
> +{
> +    void *host_page;

We could do with some more comments here;  is this called when you
notice a page is modified and need to start the copy?

> +    if (test_and_set_bit_atomic(page_nr, block->touched_map)) {

So you're marking it as written to, but it's already marked as
written to? (By another vCPU?)

> +        while (!test_bit_atomic(page_nr, block->copied_map)) {
> +            // the page is being copied -- wait for the end of the coping
> +            // and check once again
> +            qemu_event_reset(&ram_state->page_coping_done);
> +            qemu_event_wait(&ram_state->page_coping_done);
> +        }
> +        return 0;
> +    }
> +
> +    *page_copy = ram_page_buffer_get();
> +    if (!*page_copy) {
> +        return -1;
> +    }
> +
> +    host_page = block->host + (page_nr << TARGET_PAGE_BITS);
> +    memcpy(*page_copy, host_page, TARGET_PAGE_SIZE);
> +
> +    if (ram_set_rw(host_page, TARGET_PAGE_SIZE)) {
> +        ram_page_buffer_free(*page_copy);
> +        *page_copy = NULL;

Print some error messages - you've got quite a few places where you
fail with -1 but don't print anything to help know what happened.

> +        return -1;
> +    }
> +
> +    smp_mb();
> +    set_bit_atomic(page_nr, block->copied_map);
> +    qemu_event_set(&ram_state->page_coping_done);
> +
> +    return 1;
> +}
> +
> +static RAMBlock *find_ram_block(uint8_t *address, ram_addr_t *page_offset)
> +{
> +    RAMBlock *block = NULL;
> +
> +
> +    QLIST_FOREACH(block, ram_blocks_get(), bgs_next) {
> +        /* This case append when the block is not mapped. */
> +        if (block->host == NULL) {
> +            continue;
> +        }
> +
> +        if (address - block->host < block->max_length) {
> +            *page_offset = (address - block->host) & TARGET_PAGE_MASK;
> +            return block;
> +        }
> +    }
> +
> +    return NULL;
> +}

Isn't this the same as qemu_ram_block_from_host?

> +
> +// 0 - on success, 0 < - on error
> +int ram_process_page_fault(void *address)
> +{
> +    int ret;
> +    void *page_copy = NULL;
> +    unsigned long page_nr;
> +    ram_addr_t offset;
> +
> +    RAMBlock *block = find_ram_block(address, &offset);
> +
> +    if (!block) {
> +        return -1;
> +    }
> +
> +    page_nr = offset >> TARGET_PAGE_BITS;
> +
> +    ret = ram_try_copy_page(block, page_nr, &page_copy);
> +
> +    if (ret < 0) {
> +        return ret;
> +    } else if (ret > 0) {
> +        if (ram_save_queue_pages(block, NULL, offset,
> +                                 TARGET_PAGE_SIZE, page_copy)) {

Ah, so I guess you're using the queue for something other than postcopy
now - be careful, it's probably got other postcopy dependencies?

> +            ram_page_buffer_free(page_copy);
> +            return -1;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
>  /**
>   * ram_find_and_save_block: finds a dirty page and sends it to f
>   *
> @@ -1536,6 +1798,7 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>      pss.block = rs->last_seen_block;
>      pss.page = rs->last_page;
>      pss.complete_round = false;
> +    pss.page_copy = NULL;
>  
>      if (!pss.block) {
>          pss.block = QLIST_FIRST_RCU(&ram_list.blocks);
> @@ -1548,11 +1811,27 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>          if (!found) {
>              /* priority queue empty, so just search for something dirty */
>              found = find_dirty_block(rs, &pss, &again);
> +
> +            if (found && migrate_background_snapshot()) {
> +                // make a copy of the page and pass it to the page search status
> +                int ret;
> +                ret = ram_try_copy_page(pss.block, pss.page, &pss.page_copy);
> +                if (ret == 0) {
> +                    found = false;
> +                    pages = 0;
> +                } else if(ret < 0) {
> +                    return ret;
> +                }
> +            }
>          }
>  
>          if (found) {
>              pages = ram_save_host_page(rs, &pss, last_stage);
>          }
> +
> +        if (pss.page_copy) {
> +            ram_page_buffer_decrease_used();
> +        }
>      } while (!pages && again);
>  
>      rs->last_seen_block = pss.block;
> @@ -1600,9 +1879,15 @@ static void xbzrle_load_cleanup(void)
>  
>  static void ram_state_cleanup(RAMState **rsp)
>  {
> +    if (migrate_background_snapshot()) {
> +        qemu_event_destroy(&(*rsp)->page_buffer.used_decreased);
> +        qemu_event_destroy(&(*rsp)->page_coping_done);
> +    }
> +
>      migration_page_queue_free(*rsp);
>      qemu_mutex_destroy(&(*rsp)->bitmap_mutex);
>      qemu_mutex_destroy(&(*rsp)->src_page_req_mutex);
> +
>      g_free(*rsp);
>      *rsp = NULL;
>  }
> @@ -1638,6 +1923,13 @@ static void ram_save_cleanup(void *opaque)
>          block->bmap = NULL;
>          g_free(block->unsentmap);
>          block->unsentmap = NULL;
> +
> +        if (migrate_background_snapshot()) {
> +            g_free(block->touched_map);
> +            block->touched_map = NULL;
> +            g_free(block->copied_map);
> +            block->copied_map = NULL;
> +        }
>      }
>  
>      xbzrle_cleanup();
> @@ -1652,6 +1944,9 @@ static void ram_state_reset(RAMState *rs)
>      rs->last_page = 0;
>      rs->last_version = ram_list.version;
>      rs->ram_bulk_stage = true;
> +
> +    rs->page_buffer.capacity = 1000; // in number of pages
> +    rs->page_buffer.used = 0;
>  }
>  
>  #define MAX_WAIT 50 /* ms, half buffered_file limit */
> @@ -2129,6 +2424,11 @@ static int ram_state_init(RAMState **rsp)
>       */
>      (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
>  
> +    if (migrate_background_snapshot()) {
> +        qemu_event_init(&ram_state->page_buffer.used_decreased, false);
> +        qemu_event_init(&ram_state->page_coping_done, false);
> +    }
> +
>      ram_state_reset(*rsp);
>  
>      return 0;
> @@ -2145,10 +2445,16 @@ static void ram_list_init_bitmaps(void)
>              pages = block->max_length >> TARGET_PAGE_BITS;
>              block->bmap = bitmap_new(pages);
>              bitmap_set(block->bmap, 0, pages);
> +
>              if (migrate_postcopy_ram()) {
>                  block->unsentmap = bitmap_new(pages);
>                  bitmap_set(block->unsentmap, 0, pages);
>              }
> +
> +            if (migrate_background_snapshot()) {
> +                block->touched_map = bitmap_new(pages);
> +                block->copied_map = bitmap_new(pages);
> +            }
>          }
>      }
>  }
> @@ -2974,11 +3280,6 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>      return ret;
>  }
>  
> -static bool ram_has_postcopy(void *opaque)
> -{
> -    return migrate_postcopy_ram();
> -}
> -
>  static SaveVMHandlers savevm_ram_handlers = {
>      .save_setup = ram_save_setup,
>      .save_live_iterate = ram_save_iterate,
> diff --git a/migration/ram.h b/migration/ram.h
> index 64d81e9f1d..627c2efb51 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -31,6 +31,7 @@
>  
>  #include "qemu-common.h"
>  #include "exec/cpu-common.h"
> +#include "exec/ramlist.h"
>  
>  extern MigrationStats ram_counters;
>  extern XBZRLECacheStats xbzrle_counters;
> @@ -45,7 +46,9 @@ int multifd_load_setup(void);
>  int multifd_load_cleanup(Error **errp);
>  
>  uint64_t ram_pagesize_summary(void);
> -int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len);
> +int ram_save_queue_pages(RAMBlock *block, const char *rbname,
> +                         ram_addr_t start, ram_addr_t len,
> +                         void* cached_page);
>  void acct_update_position(QEMUFile *f, size_t size, bool zero);
>  void ram_debug_dump_bitmap(unsigned long *todump, bool expected,
>                             unsigned long pages);
> @@ -61,5 +64,11 @@ void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
>  int ramblock_recv_bitmap_test(RAMBlock *rb, void *host_addr);
>  void ramblock_recv_bitmap_set(RAMBlock *rb, void *host_addr);
>  void ramblock_recv_bitmap_set_range(RAMBlock *rb, void *host_addr, size_t nr);
> +int ram_process_page_fault(void *address);
> +RamBlockList *ram_blocks_get(void);
> +void ram_blocks_fill(RamBlockList *blocks);
> +void ram_blocks_clear(RamBlockList *blocks);
> +int ram_blocks_set_ro(RamBlockList *blocks);
> +int ram_blocks_set_rw(RamBlockList *blocks);
>  
>  #endif
> -- 
> 2.17.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 7/7] migration: add background snapshotting
  2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 7/7] migration: add background snapshotting Denis Plotnikov
@ 2018-07-12 18:59   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-12 18:59 UTC (permalink / raw)
  To: Denis Plotnikov; +Cc: quintela, pbonzini, qemu-devel

* Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> The patch enables to save vmstate to a migration thread
> in the background: ram is being saved while vCPUs are running.
> This is done to reduce downtime on vm snapshotting: the majority
> of vmstate is ram, the rest of devices consumes only a few MB of
> memory on a typical vm.
> By this moment, there were no means to make a snapshot without
> full vm stopping. From now, one can save a vm state having the vm
> stopped only for a couple of seconds to save devices' states. Then
> the vm resumed and ram is written without the vm full stopping.
> 
> To use async vm state it's needed to enable "background-snapshot"
> migration capability and start the migration.
> 
> Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
> ---
>  migration/migration.c | 104 +++++++++++++++++++++++++++++++++++++++++-
>  migration/savevm.c    |  91 +++++++++++++++++++-----------------
>  migration/savevm.h    |   2 +
>  3 files changed, 154 insertions(+), 43 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 131d0904e4..b062bd8ddc 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2188,6 +2188,100 @@ bool migrate_colo_enabled(void)
>      return s->enabled_capabilities[MIGRATION_CAPABILITY_X_COLO];
>  }
>  
> +static void background_snapshot_sigsegv_handler(int unused0, siginfo_t *info, void *unused1)
> +{
> +    if (ram_process_page_fault(info->si_addr)) {
> +        assert(false);
> +    }
> +}
> +
> +static void *background_snapshot_thread(void *opaque)
> +{
> +    MigrationState *m = opaque;
> +    QIOChannelBuffer *bioc;
> +    QEMUFile *fb;
> +    RamBlockList *ram_blocks;
> +    int res = 0;
> +
> +    rcu_register_thread();
> +
> +    qemu_file_set_rate_limit(m->to_dst_file, INT64_MAX);
> +
> +    qemu_mutex_lock_iothread();
> +    vm_stop(RUN_STATE_PAUSED);
> +
> +    qemu_savevm_state_header(m->to_dst_file);
> +    qemu_mutex_unlock_iothread();
> +    qemu_savevm_state_setup(m->to_dst_file);
> +    qemu_mutex_lock_iothread();

That's curiously different locking from the ones in migration.c and
savevm.c (qemu_savevm_state)

> +    migrate_set_state(&m->state, MIGRATION_STATUS_SETUP,
> +                      MIGRATION_STATUS_ACTIVE);
> +
> +    // save the device state to the temporary buffer
> +    bioc = qio_channel_buffer_new(4096);
> +    qio_channel_set_name(QIO_CHANNEL(bioc), "vmstate-buffer");
> +    fb = qemu_fopen_channel_output(QIO_CHANNEL(bioc));
> +    object_unref(OBJECT(bioc));
> +
> +    ram_blocks = ram_blocks_get();
> +    ram_blocks_fill(ram_blocks);
> +    ram_blocks_set_ro(ram_blocks);
> +
> +    if(global_state_store()) {
> +        goto exit;
> +    }
> +
> +    if (qemu_save_all_device_states(fb, false, true) < 0) {
> +        migrate_set_state(&m->state, MIGRATION_STATUS_ACTIVE,
> +                          MIGRATION_STATUS_FAILED);
> +        goto exit;
> +    }
> +
> +    sigsegv_user_handler_set(background_snapshot_sigsegv_handler);

That's actually a good test if you're write protecting the RAM before
saving the device state; they *shouldn't* write to ram doing that
but I don't trust them.

> +    smp_mb();

Comment why that mb is there?

> +    vm_start();
> +    qemu_mutex_unlock_iothread();
> +
> +    while (!res) {
> +        res = qemu_savevm_state_iterate(m->to_dst_file, false);

I'm surprised it goes around this loop much, given that you've
set hte bandwidth to max and I assume you never see dirty pages.

> +        if (res < 0 || qemu_file_get_error(m->to_dst_file)) {
> +            migrate_set_state(&m->state, MIGRATION_STATUS_ACTIVE,
> +                              MIGRATION_STATUS_FAILED);
> +            goto exit;
> +        }
> +    }
> +
> +    // By this moment we have RAM saved into the stream
> +    // The next step is to flush the device state right after the
> +    // RAM saved. The rest of device state is stored in
> +    // the temporary buffer. Let's flush the buffer into the stream.
> +    qemu_put_buffer(m->to_dst_file, bioc->data, bioc->usage);
> +    qemu_fflush(m->to_dst_file);
> +
> +    if (qemu_file_get_error(m->to_dst_file)) {
> +        migrate_set_state(&m->state, MIGRATION_STATUS_ACTIVE,
> +                          MIGRATION_STATUS_FAILED);
> +        goto exit;
> +    }
> +
> +    migrate_set_state(&m->state, MIGRATION_STATUS_ACTIVE,
> +                                 MIGRATION_STATUS_COMPLETED);
> +exit:
> +    ram_blocks_set_rw(ram_blocks);
> +    ram_blocks_clear(ram_blocks);
> +    smp_mb();
> +    sigsegv_user_handler_reset();
> +    qemu_fclose(fb);
> +    qemu_mutex_lock_iothread();
> +    qemu_savevm_state_cleanup();
> +    qemu_mutex_unlock_iothread();
> +    rcu_unregister_thread();
> +    return NULL;
> +}
> +
>  /*
>   * Master migration thread on the source VM.
>   * It drives the migration and pumps the data down the outgoing channel.
> @@ -2402,8 +2496,14 @@ void migrate_fd_connect(MigrationState *s)
>          migrate_fd_cleanup(s);
>          return;
>      }
> -    qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
> -                       QEMU_THREAD_JOINABLE);
> +    if (migrate_background_snapshot()) {
> +        qemu_thread_create(&s->thread, "bg_snapshot",
> +                           background_snapshot_thread, s,
> +                           QEMU_THREAD_JOINABLE);
> +    } else {
> +        qemu_thread_create(&s->thread, "live_migration", migration_thread, s,
> +                           QEMU_THREAD_JOINABLE);
> +    }
>      s->migration_thread_running = true;
>  }
>  
> diff --git a/migration/savevm.c b/migration/savevm.c
> index f202c3de3a..641b897ff5 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1107,51 +1107,15 @@ void qemu_savevm_state_complete_postcopy(QEMUFile *f)
>      qemu_fflush(f);
>  }
>  
> -int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
> -                                       bool inactivate_disks)
> +int qemu_save_all_device_states(QEMUFile *f, bool inactivate_disks,
> +                                bool send_eof)

We've got qemu_save_device_state as well; lots of variations.

Dave

>  {
> -    QJSON *vmdesc;
>      int vmdesc_len;
>      SaveStateEntry *se;
>      int ret;
> -    bool in_postcopy = migration_in_postcopy();
> -
> -    trace_savevm_state_complete_precopy();
> +    QJSON *vmdesc = qjson_new();
>  
>      cpu_synchronize_all_states();
> -
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> -        if (!se->ops ||
> -            (in_postcopy && se->ops->has_postcopy &&
> -             se->ops->has_postcopy(se->opaque)) ||
> -            (in_postcopy && !iterable_only) ||
> -            !se->ops->save_live_complete_precopy) {
> -            continue;
> -        }
> -
> -        if (se->ops && se->ops->is_active) {
> -            if (!se->ops->is_active(se->opaque)) {
> -                continue;
> -            }
> -        }
> -        trace_savevm_section_start(se->idstr, se->section_id);
> -
> -        save_section_header(f, se, QEMU_VM_SECTION_END);
> -
> -        ret = se->ops->save_live_complete_precopy(f, se->opaque);
> -        trace_savevm_section_end(se->idstr, se->section_id, ret);
> -        save_section_footer(f, se);
> -        if (ret < 0) {
> -            qemu_file_set_error(f, ret);
> -            return -1;
> -        }
> -    }
> -
> -    if (iterable_only) {
> -        return 0;
> -    }
> -
> -    vmdesc = qjson_new();
>      json_prop_int(vmdesc, "page_size", qemu_target_page_size());
>      json_start_array(vmdesc, "devices");
>      QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> @@ -1193,8 +1157,8 @@ int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
>              return ret;
>          }
>      }
> -    if (!in_postcopy) {
> -        /* Postcopy stream will still be going */
> +
> +    if (send_eof) {
>          qemu_put_byte(f, QEMU_VM_EOF);
>      }
>  
> @@ -1213,6 +1177,51 @@ int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
>      return 0;
>  }
>  
> +int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
> +                                       bool inactivate_disks)
> +{
> +    SaveStateEntry *se;
> +    int ret;
> +    bool in_postcopy = migration_in_postcopy();
> +
> +    trace_savevm_state_complete_precopy();
> +
> +    cpu_synchronize_all_states();
> +
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (!se->ops ||
> +            (in_postcopy && se->ops->has_postcopy &&
> +             se->ops->has_postcopy(se->opaque)) ||
> +            (in_postcopy && !iterable_only) ||
> +            !se->ops->save_live_complete_precopy) {
> +            continue;
> +        }
> +
> +        if (se->ops && se->ops->is_active) {
> +            if (!se->ops->is_active(se->opaque)) {
> +                continue;
> +            }
> +        }
> +        trace_savevm_section_start(se->idstr, se->section_id);
> +
> +        save_section_header(f, se, QEMU_VM_SECTION_END);
> +
> +        ret = se->ops->save_live_complete_precopy(f, se->opaque);
> +        trace_savevm_section_end(se->idstr, se->section_id, ret);
> +        save_section_footer(f, se);
> +        if (ret < 0) {
> +            qemu_file_set_error(f, ret);
> +            return -1;
> +        }
> +    }
> +
> +    if (iterable_only) {
> +        return 0;
> +    }
> +
> +    return qemu_save_all_device_states(f, inactivate_disks, !in_postcopy);
> +}
> +
>  /* Give an estimate of the amount left to be transferred,
>   * the result is split into the amount for units that can and
>   * for units that can't do postcopy.
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 295c4a1f2c..87b278142d 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -55,4 +55,6 @@ void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
>  int qemu_loadvm_state(QEMUFile *f);
>  void qemu_loadvm_state_cleanup(void);
>  
> +int qemu_save_all_device_states(QEMUFile *f, bool inactivate_disks,
> +                                bool send_eof);
>  #endif
> -- 
> 2.17.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-06-29  8:03 [Qemu-devel] [PATCH v0 0/7] Background snapshots Denis Plotnikov
                   ` (8 preceding siblings ...)
  2018-07-02 11:23 ` Peter Xu
@ 2018-07-13  5:20 ` Peter Xu
  2018-07-16 15:00   ` Denis Plotnikov
  9 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2018-07-13  5:20 UTC (permalink / raw)
  To: Denis Plotnikov; +Cc: dgilbert, quintela, pbonzini, qemu-devel

On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote:
> The patch set adds the ability to make external snapshots while VM is running.
> 
> The workflow to make a snapshot is the following:
> 1. Pause the vm
> 2. Make a snapshot of block devices using the scheme of your choice
> 3. Turn on background-snapshot migration capability
> 4. Start the migration using the destination (migration stream) of your choice.
>    The migration will resume the vm execution by itself
>    when it has the devices' states saved  and is ready to start ram writing
>    to the migration stream.
> 5. Listen to the migration finish event
> 
> The feature relies on KVM unapplied ability to report the faulting address.
> Please find the KVM patch snippet to make the patchset work below:
> 
> +++ b/arch/x86/kvm/vmx.c
> @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  
>         vcpu->arch.exit_qualification = exit_qualification;
>  
> -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +        if (r == -EFAULT) {
> +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
> +
> +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
> +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
> +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
> +               r = 0;
> +
> +       }
> +       return r;
> 
> The patch to KVM can be sent if the patch set approved

Hi, Denis,

If the work will definitely require KVM to cooperate, AFAIU the thing
we normally do is that we first propose the kernel counterpart on kvm
list, then it'll be easier to review the QEMU counterpart (or, propose
both kvm/qemu changes at the same time, always the QEMU changes can be
RFC, as a reference to prove the kvm change is valid and useful).  Not
sure whether you should do this as well for this live snapshot work.

Since we might have two backends in the future, my major question for
that counterpart series would be whether we need to support both in
the future (mprotect, and userfaultfd), and the differences between
the two methods from kernel's point of view.  I would vaguely guess
that we can at least firstly have mprotect work then userfaultfd then
we can automatically choose the backend when both are provided, but I
guess that discussion might still better happen on the kvm list.  Also
I would also guess that in that work you'd better consider no-ept case
as well for Intel, even for AMD.  But not sure we can at least start a
RFC with the simplest scenario and prove its validity.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-07-13  5:20 ` Peter Xu
@ 2018-07-16 15:00   ` Denis Plotnikov
  0 siblings, 0 replies; 31+ messages in thread
From: Denis Plotnikov @ 2018-07-16 15:00 UTC (permalink / raw)
  To: Peter Xu; +Cc: dgilbert, quintela, pbonzini, qemu-devel



On 13.07.2018 08:20, Peter Xu wrote:
> On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote:
>> The patch set adds the ability to make external snapshots while VM is running.
>>
>> The workflow to make a snapshot is the following:
>> 1. Pause the vm
>> 2. Make a snapshot of block devices using the scheme of your choice
>> 3. Turn on background-snapshot migration capability
>> 4. Start the migration using the destination (migration stream) of your choice.
>>     The migration will resume the vm execution by itself
>>     when it has the devices' states saved  and is ready to start ram writing
>>     to the migration stream.
>> 5. Listen to the migration finish event
>>
>> The feature relies on KVM unapplied ability to report the faulting address.
>> Please find the KVM patch snippet to make the patchset work below:
>>
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>>   
>>          vcpu->arch.exit_qualification = exit_qualification;
>>   
>> -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>> +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>> +        if (r == -EFAULT) {
>> +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
>> +
>> +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
>> +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
>> +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
>> +               r = 0;
>> +
>> +       }
>> +       return r;
>>
>> The patch to KVM can be sent if the patch set approved
> 
> Hi, Denis,
> 
> If the work will definitely require KVM to cooperate, AFAIU the thing
> we normally do is that we first propose the kernel counterpart on kvm
> list, then it'll be easier to review the QEMU counterpart (or, propose
> both kvm/qemu changes at the same time, always the QEMU changes can be
> RFC, as a reference to prove the kvm change is valid and useful).  Not
> sure whether you should do this as well for this live snapshot work.
> 
> Since we might have two backends in the future, my major question for
> that counterpart series would be whether we need to support both in
> the future (mprotect, and userfaultfd), and the differences between
> the two methods from kernel's point of view.  I would vaguely guess
> that we can at least firstly have mprotect work then userfaultfd then
> we can automatically choose the backend when both are provided, but I
> guess that discussion might still better happen on the kvm list.  Also
> I would also guess that in that work you'd better consider no-ept case
> as well for Intel, even for AMD.  But not sure we can at least start a
> RFC with the simplest scenario and prove its validity.
> 
> Regards,
> 
Hi, Peter,
I think this is a good idea to go through the KVM path firstly.
When the discussion come to some conclusion further steps may become
more clear.
I'll send the patch there shortly to start the discussion.

Thanks!

Best, Denis

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-06-29 11:53 ` [Qemu-devel] [PATCH v0 0/7] Background snapshots Dr. David Alan Gilbert
@ 2018-07-25 10:18   ` Peter Xu
  2018-07-25 19:17     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Xu @ 2018-07-25 10:18 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Denis Plotnikov, aarcange, pbonzini, qemu-devel, quintela

On Fri, Jun 29, 2018 at 12:53:59PM +0100, Dr. David Alan Gilbert wrote:
> * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> > The patch set adds the ability to make external snapshots while VM is running.
> 
> cc'ing in Andrea since this uses sigsegv's to avoid userfault-wp that
> isn't there yet.
> 
> Hi Denis,
>   How robust are you finding this SEGV based trick; for example what
> about things like the kernel walking vhost queues or similar kernel
> nasties?

(I'm commenting on this old series to keep the discussion together)

If we want to make this series really work for people, we should
possibly need to know whether it could work with vhost (otherwise we
might need to go back to userfaultfd write-protection).

I digged a bit on the vhost-net IO, it should be using two ways to
write to guest memory:

- copy_to_user(): this should possibly still be able to be captured by
  mprotect() (after some confirmation from Paolo, but still we'd
  better try it out)

- kmap_atomic(): this is used to log dirty pages, this seems to be
  incompatible with mprotect() but I think we can just make sure dirty
  page tracking is disabled when live snapshot is enabled (please
  refer to the code clip below as a referece)

So IMHO this series should work with vhost if with logging off, but we
need to confirm...  Denis, do you want to do some more test with vhost
when you do your next post (possibly with changes like below)?

=====================

diff --git a/migration/ram.c b/migration/ram.c
index 24dea2730c..a3fa256143 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1605,6 +1605,10 @@ static void migration_bitmap_sync(RAMState *rs)
     int64_t end_time;
     uint64_t bytes_xfer_now;
 
+    if (live_snapshot) {
+        return;
+    }
+
     ram_counters.dirty_sync_count++;
 
     if (!rs->time_last_bitmap_sync) {
@@ -2952,7 +2956,10 @@ static void ram_init_bitmaps(RAMState *rs)
     rcu_read_lock();
 
     ram_list_init_bitmaps();
-    memory_global_dirty_log_start();
+
+    if (!live_snapshot) {
+        memory_global_dirty_log_start();
+    }
     migration_bitmap_sync(rs);
 
     rcu_read_unlock();


> 
> Dave
> 
> > The workflow to make a snapshot is the following:
> > 1. Pause the vm
> > 2. Make a snapshot of block devices using the scheme of your choice
> > 3. Turn on background-snapshot migration capability
> > 4. Start the migration using the destination (migration stream) of your choice.
> >    The migration will resume the vm execution by itself
> >    when it has the devices' states saved  and is ready to start ram writing
> >    to the migration stream.
> > 5. Listen to the migration finish event
> > 
> > The feature relies on KVM unapplied ability to report the faulting address.
> > Please find the KVM patch snippet to make the patchset work below:
> > 
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> >  
> >         vcpu->arch.exit_qualification = exit_qualification;
> >  
> > -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > +        if (r == -EFAULT) {
> > +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
> > +
> > +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
> > +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
> > +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
> > +               r = 0;
> > +
> > +       }
> > +       return r;
> > 
> > The patch to KVM can be sent if the patch set approved
> > 
> > Denis Plotnikov (7):
> >   migration: add background snapshot capability
> >   bitops: add some atomic versions of bitmap operations
> >   threads: add infrastructure to process sigsegv
> >   migration: add background snapshot infrastructure
> >   kvm: add failed memeory access exit reason
> >   kvm: add vCPU failed memeory access processing
> >   migration: add background snapshotting
> > 
> >  include/exec/ram_addr.h   |   7 +
> >  include/exec/ramlist.h    |   4 +-
> >  include/qemu/bitops.h     |  24 +++
> >  include/qemu/thread.h     |   5 +
> >  linux-headers/linux/kvm.h |   5 +
> >  migration/migration.c     | 141 +++++++++++++++-
> >  migration/migration.h     |   1 +
> >  migration/ram.c           | 333 ++++++++++++++++++++++++++++++++++++--
> >  migration/ram.h           |  11 +-
> >  migration/savevm.c        |  91 ++++++-----
> >  migration/savevm.h        |   2 +
> >  qapi/migration.json       |   6 +-
> >  target/i386/kvm.c         |  18 +++
> >  util/qemu-thread-posix.c  |  50 ++++++
> >  14 files changed, 635 insertions(+), 63 deletions(-)
> > 
> > -- 
> > 2.17.0
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

Regards,

-- 
Peter Xu

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-07-25 10:18   ` Peter Xu
@ 2018-07-25 19:17     ` Dr. David Alan Gilbert
  2018-07-25 20:04       ` Andrea Arcangeli
  0 siblings, 1 reply; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-25 19:17 UTC (permalink / raw)
  To: Peter Xu; +Cc: Denis Plotnikov, aarcange, pbonzini, qemu-devel, quintela

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Jun 29, 2018 at 12:53:59PM +0100, Dr. David Alan Gilbert wrote:
> > * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> > > The patch set adds the ability to make external snapshots while VM is running.
> > 
> > cc'ing in Andrea since this uses sigsegv's to avoid userfault-wp that
> > isn't there yet.
> > 
> > Hi Denis,
> >   How robust are you finding this SEGV based trick; for example what
> > about things like the kernel walking vhost queues or similar kernel
> > nasties?
> 
> (I'm commenting on this old series to keep the discussion together)
> 
> If we want to make this series really work for people, we should
> possibly need to know whether it could work with vhost (otherwise we
> might need to go back to userfaultfd write-protection).
> 
> I digged a bit on the vhost-net IO, it should be using two ways to
> write to guest memory:
> 
> - copy_to_user(): this should possibly still be able to be captured by
>   mprotect() (after some confirmation from Paolo, but still we'd
>   better try it out)

What confuses me here is who is going to get the signal from this and
how we recover from the signal - or does it come back as an error
on the vhost fd somehow?

Dave

> - kmap_atomic(): this is used to log dirty pages, this seems to be
>   incompatible with mprotect() but I think we can just make sure dirty
>   page tracking is disabled when live snapshot is enabled (please
>   refer to the code clip below as a referece)
> 
> So IMHO this series should work with vhost if with logging off, but we
> need to confirm...  Denis, do you want to do some more test with vhost
> when you do your next post (possibly with changes like below)?
> 
> =====================
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 24dea2730c..a3fa256143 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1605,6 +1605,10 @@ static void migration_bitmap_sync(RAMState *rs)
>      int64_t end_time;
>      uint64_t bytes_xfer_now;
>  
> +    if (live_snapshot) {
> +        return;
> +    }
> +
>      ram_counters.dirty_sync_count++;
>  
>      if (!rs->time_last_bitmap_sync) {
> @@ -2952,7 +2956,10 @@ static void ram_init_bitmaps(RAMState *rs)
>      rcu_read_lock();
>  
>      ram_list_init_bitmaps();
> -    memory_global_dirty_log_start();
> +
> +    if (!live_snapshot) {
> +        memory_global_dirty_log_start();
> +    }
>      migration_bitmap_sync(rs);
>  
>      rcu_read_unlock();
> 
> 
> > 
> > Dave
> > 
> > > The workflow to make a snapshot is the following:
> > > 1. Pause the vm
> > > 2. Make a snapshot of block devices using the scheme of your choice
> > > 3. Turn on background-snapshot migration capability
> > > 4. Start the migration using the destination (migration stream) of your choice.
> > >    The migration will resume the vm execution by itself
> > >    when it has the devices' states saved  and is ready to start ram writing
> > >    to the migration stream.
> > > 5. Listen to the migration finish event
> > > 
> > > The feature relies on KVM unapplied ability to report the faulting address.
> > > Please find the KVM patch snippet to make the patchset work below:
> > > 
> > > +++ b/arch/x86/kvm/vmx.c
> > > @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> > >  
> > >         vcpu->arch.exit_qualification = exit_qualification;
> > >  
> > > -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > > +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > > +        if (r == -EFAULT) {
> > > +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
> > > +
> > > +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
> > > +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
> > > +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
> > > +               r = 0;
> > > +
> > > +       }
> > > +       return r;
> > > 
> > > The patch to KVM can be sent if the patch set approved
> > > 
> > > Denis Plotnikov (7):
> > >   migration: add background snapshot capability
> > >   bitops: add some atomic versions of bitmap operations
> > >   threads: add infrastructure to process sigsegv
> > >   migration: add background snapshot infrastructure
> > >   kvm: add failed memeory access exit reason
> > >   kvm: add vCPU failed memeory access processing
> > >   migration: add background snapshotting
> > > 
> > >  include/exec/ram_addr.h   |   7 +
> > >  include/exec/ramlist.h    |   4 +-
> > >  include/qemu/bitops.h     |  24 +++
> > >  include/qemu/thread.h     |   5 +
> > >  linux-headers/linux/kvm.h |   5 +
> > >  migration/migration.c     | 141 +++++++++++++++-
> > >  migration/migration.h     |   1 +
> > >  migration/ram.c           | 333 ++++++++++++++++++++++++++++++++++++--
> > >  migration/ram.h           |  11 +-
> > >  migration/savevm.c        |  91 ++++++-----
> > >  migration/savevm.h        |   2 +
> > >  qapi/migration.json       |   6 +-
> > >  target/i386/kvm.c         |  18 +++
> > >  util/qemu-thread-posix.c  |  50 ++++++
> > >  14 files changed, 635 insertions(+), 63 deletions(-)
> > > 
> > > -- 
> > > 2.17.0
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
> Regards,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-07-25 19:17     ` Dr. David Alan Gilbert
@ 2018-07-25 20:04       ` Andrea Arcangeli
  2018-07-26  8:51         ` Paolo Bonzini
  0 siblings, 1 reply; 31+ messages in thread
From: Andrea Arcangeli @ 2018-07-25 20:04 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, Denis Plotnikov, pbonzini, qemu-devel, quintela

On Wed, Jul 25, 2018 at 08:17:37PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Fri, Jun 29, 2018 at 12:53:59PM +0100, Dr. David Alan Gilbert wrote:
> > > * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> > > > The patch set adds the ability to make external snapshots while VM is running.
> > > 
> > > cc'ing in Andrea since this uses sigsegv's to avoid userfault-wp that
> > > isn't there yet.
> > > 
> > > Hi Denis,
> > >   How robust are you finding this SEGV based trick; for example what
> > > about things like the kernel walking vhost queues or similar kernel
> > > nasties?
> > 
> > (I'm commenting on this old series to keep the discussion together)
> > 
> > If we want to make this series really work for people, we should
> > possibly need to know whether it could work with vhost (otherwise we
> > might need to go back to userfaultfd write-protection).
> > 
> > I digged a bit on the vhost-net IO, it should be using two ways to
> > write to guest memory:
> > 
> > - copy_to_user(): this should possibly still be able to be captured by
> >   mprotect() (after some confirmation from Paolo, but still we'd
> >   better try it out)
> 
> What confuses me here is who is going to get the signal from this and
> how we recover from the signal - or does it come back as an error
> on the vhost fd somehow?

The problem is having to start to handle manually all sigsegv in
vhost-net by trapping copy_to_user returning less than the full buffer
size or put_user returning -EFAULT.

Those errors would need to be forwarded by vhost-net to qemu userland
to call mprotect after copying the data.

That's not conceptually different from having uffd-wp sending the
message except that will then require zero changes to vhost-net and
every other piece of kernel code that may have to write to the write
protected memory.

It may look like the uffd-wp model is wish-feature similar to an
optimization, but without the uffd-wp model when the WP fault is
triggered by kernel code, the sigsegv model falls apart and requires
all kind of ad-hoc changes just for this single feature. Plus uffd-wp
has other benefits: it makes it all reliable in terms of not
increasing the number of vmas in use during the snapshot. Finally it
makes it faster too with no mmap_sem for reading and no sigsegv
signals.

The non cooperative features got merged first because there was much
activity on the kernel side on that front, but this is just an ideal
time to nail down the remaining issues in uffd-wp I think. That I
believe is time better spent than trying to emulate it with sigsegv
and changing all drivers to send new events down to qemu specific to
the sigsegv handling. We considered this before doing uffd for
postcopy too but overall it's unreliable and more work (no single
change was then needed to KVM code with uffd to handle postcopy and
here it should be the same).

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-07-25 20:04       ` Andrea Arcangeli
@ 2018-07-26  8:51         ` Paolo Bonzini
  2018-07-26  9:23           ` Peter Xu
  2018-07-26 15:13           ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 31+ messages in thread
From: Paolo Bonzini @ 2018-07-26  8:51 UTC (permalink / raw)
  To: Andrea Arcangeli, Dr. David Alan Gilbert
  Cc: Peter Xu, Denis Plotnikov, qemu-devel, quintela

On 25/07/2018 22:04, Andrea Arcangeli wrote:
> 
> It may look like the uffd-wp model is wish-feature similar to an
> optimization, but without the uffd-wp model when the WP fault is
> triggered by kernel code, the sigsegv model falls apart and requires
> all kind of ad-hoc changes just for this single feature. Plus uffd-wp
> has other benefits: it makes it all reliable in terms of not
> increasing the number of vmas in use during the snapshot. Finally it
> makes it faster too with no mmap_sem for reading and no sigsegv
> signals.
> 
> The non cooperative features got merged first because there was much
> activity on the kernel side on that front, but this is just an ideal
> time to nail down the remaining issues in uffd-wp I think. That I
> believe is time better spent than trying to emulate it with sigsegv
> and changing all drivers to send new events down to qemu specific to
> the sigsegv handling. We considered this before doing uffd for
> postcopy too but overall it's unreliable and more work (no single
> change was then needed to KVM code with uffd to handle postcopy and
> here it should be the same).

I totally agree.  The hard part in userfaultfd was the changes to the
kernel get_user_pages API, but the payback was huge because _all_ kernel
uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
back to mprotect would be a huge mistake.

Paolo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-07-26  8:51         ` Paolo Bonzini
@ 2018-07-26  9:23           ` Peter Xu
  2018-08-13 12:55             ` Denis Plotnikov
  2018-07-26 15:13           ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Xu @ 2018-07-26  9:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andrea Arcangeli, Dr. David Alan Gilbert, Denis Plotnikov,
	qemu-devel, quintela

On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote:
> On 25/07/2018 22:04, Andrea Arcangeli wrote:
> > 
> > It may look like the uffd-wp model is wish-feature similar to an
> > optimization, but without the uffd-wp model when the WP fault is
> > triggered by kernel code, the sigsegv model falls apart and requires
> > all kind of ad-hoc changes just for this single feature. Plus uffd-wp
> > has other benefits: it makes it all reliable in terms of not
> > increasing the number of vmas in use during the snapshot. Finally it
> > makes it faster too with no mmap_sem for reading and no sigsegv
> > signals.
> > 
> > The non cooperative features got merged first because there was much
> > activity on the kernel side on that front, but this is just an ideal
> > time to nail down the remaining issues in uffd-wp I think. That I
> > believe is time better spent than trying to emulate it with sigsegv
> > and changing all drivers to send new events down to qemu specific to
> > the sigsegv handling. We considered this before doing uffd for
> > postcopy too but overall it's unreliable and more work (no single
> > change was then needed to KVM code with uffd to handle postcopy and
> > here it should be the same).
> 
> I totally agree.  The hard part in userfaultfd was the changes to the
> kernel get_user_pages API, but the payback was huge because _all_ kernel
> uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
> back to mprotect would be a huge mistake.

Thanks for explaining the bits.  I'd say I wasn't aware of the
difference before I started the investigation (and only until now I
noticed that major difference between mprotect and userfaultfd).  I'm
really glad that it's much clear (at least for me) on which way we
should choose.

Now I'm thinking whether we can move the userfault write protect work
forward.  The latest discussion I saw so far is in 2016, when someone
from Huawei tried to use the write protect feature for that old
version of live snapshot but reported issue:

  https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html

Is that the latest status for userfaultfd wr-protect?

If so, I'm thinking whether I can try to re-verify the work (I tried
his QEMU repository but I failed to compile somehow, so I plan to
write some even simpler code to try) to see whether I can get the same
KVM error he encountered.

Thoughts?

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-07-26  8:51         ` Paolo Bonzini
  2018-07-26  9:23           ` Peter Xu
@ 2018-07-26 15:13           ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-26 15:13 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andrea Arcangeli, Peter Xu, Denis Plotnikov, qemu-devel, quintela

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> On 25/07/2018 22:04, Andrea Arcangeli wrote:
> > 
> > It may look like the uffd-wp model is wish-feature similar to an
> > optimization, but without the uffd-wp model when the WP fault is
> > triggered by kernel code, the sigsegv model falls apart and requires
> > all kind of ad-hoc changes just for this single feature. Plus uffd-wp
> > has other benefits: it makes it all reliable in terms of not
> > increasing the number of vmas in use during the snapshot. Finally it
> > makes it faster too with no mmap_sem for reading and no sigsegv
> > signals.
> > 
> > The non cooperative features got merged first because there was much
> > activity on the kernel side on that front, but this is just an ideal
> > time to nail down the remaining issues in uffd-wp I think. That I
> > believe is time better spent than trying to emulate it with sigsegv
> > and changing all drivers to send new events down to qemu specific to
> > the sigsegv handling. We considered this before doing uffd for
> > postcopy too but overall it's unreliable and more work (no single
> > change was then needed to KVM code with uffd to handle postcopy and
> > here it should be the same).
> 
> I totally agree.  The hard part in userfaultfd was the changes to the
> kernel get_user_pages API, but the payback was huge because _all_ kernel
> uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
> back to mprotect would be a huge mistake.

TBF I think Denis just wanted to get things moving, knowing that we've
not got userfault-wp yet;  what I'm curious about though is how Denis
has anything stable given that the faults can land in syscalls and
vhost etc.

Dave

> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-07-26  9:23           ` Peter Xu
@ 2018-08-13 12:55             ` Denis Plotnikov
  2018-08-13 19:00               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 31+ messages in thread
From: Denis Plotnikov @ 2018-08-13 12:55 UTC (permalink / raw)
  To: Peter Xu, Paolo Bonzini
  Cc: Andrea Arcangeli, Dr. David Alan Gilbert, qemu-devel, quintela



On 26.07.2018 12:23, Peter Xu wrote:
> On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote:
>> On 25/07/2018 22:04, Andrea Arcangeli wrote:
>>>
>>> It may look like the uffd-wp model is wish-feature similar to an
>>> optimization, but without the uffd-wp model when the WP fault is
>>> triggered by kernel code, the sigsegv model falls apart and requires
>>> all kind of ad-hoc changes just for this single feature. Plus uffd-wp
>>> has other benefits: it makes it all reliable in terms of not
>>> increasing the number of vmas in use during the snapshot. Finally it
>>> makes it faster too with no mmap_sem for reading and no sigsegv
>>> signals.
>>>
>>> The non cooperative features got merged first because there was much
>>> activity on the kernel side on that front, but this is just an ideal
>>> time to nail down the remaining issues in uffd-wp I think. That I
>>> believe is time better spent than trying to emulate it with sigsegv
>>> and changing all drivers to send new events down to qemu specific to
>>> the sigsegv handling. We considered this before doing uffd for
>>> postcopy too but overall it's unreliable and more work (no single
>>> change was then needed to KVM code with uffd to handle postcopy and
>>> here it should be the same).
>>
>> I totally agree.  The hard part in userfaultfd was the changes to the
>> kernel get_user_pages API, but the payback was huge because _all_ kernel
>> uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
>> back to mprotect would be a huge mistake.
> 
> Thanks for explaining the bits.  I'd say I wasn't aware of the
> difference before I started the investigation (and only until now I
> noticed that major difference between mprotect and userfaultfd).  I'm
> really glad that it's much clear (at least for me) on which way we
> should choose.
> 
> Now I'm thinking whether we can move the userfault write protect work
> forward.  The latest discussion I saw so far is in 2016, when someone
> from Huawei tried to use the write protect feature for that old
> version of live snapshot but reported issue:
> 
>    https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html
> 
> Is that the latest status for userfaultfd wr-protect?
> 
> If so, I'm thinking whether I can try to re-verify the work (I tried
> his QEMU repository but I failed to compile somehow, so I plan to
> write some even simpler code to try) to see whether I can get the same
> KVM error he encountered.
> 
> Thoughts?

Just to sum up all being said before.

Using mprotect is a bad idea because VM's memory can be accessed from 
the number of places (KVM, vhost, ...) which need their own special care
of tracking memory accesses and notifying QEMU which makes the mprotect 
using unacceptable.

Protected memory accesses tracking can be done via userfaultfd's WP mode 
which isn't available right now.

So, the reasonable conclusion is to wait until the WP mode is available 
and build the background snapshot on top of userfaultfd-wp.
But, works on adding the WP-mode is pending for a quite a long time 
already.

Is there any way to estimate when it could be available?
> 
> Regards,
> 

-- 
Best,
Denis

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-08-13 12:55             ` Denis Plotnikov
@ 2018-08-13 19:00               ` Dr. David Alan Gilbert
  2018-08-14  5:45                 ` Peter Xu
                                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Dr. David Alan Gilbert @ 2018-08-13 19:00 UTC (permalink / raw)
  To: Denis Plotnikov
  Cc: Peter Xu, Paolo Bonzini, Andrea Arcangeli, qemu-devel, quintela,
	rppt, mike.kravetz

cc'ing in Mike*2
* Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> 
> 
> On 26.07.2018 12:23, Peter Xu wrote:
> > On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote:
> > > On 25/07/2018 22:04, Andrea Arcangeli wrote:
> > > > 
> > > > It may look like the uffd-wp model is wish-feature similar to an
> > > > optimization, but without the uffd-wp model when the WP fault is
> > > > triggered by kernel code, the sigsegv model falls apart and requires
> > > > all kind of ad-hoc changes just for this single feature. Plus uffd-wp
> > > > has other benefits: it makes it all reliable in terms of not
> > > > increasing the number of vmas in use during the snapshot. Finally it
> > > > makes it faster too with no mmap_sem for reading and no sigsegv
> > > > signals.
> > > > 
> > > > The non cooperative features got merged first because there was much
> > > > activity on the kernel side on that front, but this is just an ideal
> > > > time to nail down the remaining issues in uffd-wp I think. That I
> > > > believe is time better spent than trying to emulate it with sigsegv
> > > > and changing all drivers to send new events down to qemu specific to
> > > > the sigsegv handling. We considered this before doing uffd for
> > > > postcopy too but overall it's unreliable and more work (no single
> > > > change was then needed to KVM code with uffd to handle postcopy and
> > > > here it should be the same).
> > > 
> > > I totally agree.  The hard part in userfaultfd was the changes to the
> > > kernel get_user_pages API, but the payback was huge because _all_ kernel
> > > uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
> > > back to mprotect would be a huge mistake.
> > 
> > Thanks for explaining the bits.  I'd say I wasn't aware of the
> > difference before I started the investigation (and only until now I
> > noticed that major difference between mprotect and userfaultfd).  I'm
> > really glad that it's much clear (at least for me) on which way we
> > should choose.
> > 
> > Now I'm thinking whether we can move the userfault write protect work
> > forward.  The latest discussion I saw so far is in 2016, when someone
> > from Huawei tried to use the write protect feature for that old
> > version of live snapshot but reported issue:
> > 
> >    https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html
> > 
> > Is that the latest status for userfaultfd wr-protect?
> > 
> > If so, I'm thinking whether I can try to re-verify the work (I tried
> > his QEMU repository but I failed to compile somehow, so I plan to
> > write some even simpler code to try) to see whether I can get the same
> > KVM error he encountered.
> > 
> > Thoughts?
> 
> Just to sum up all being said before.
> 
> Using mprotect is a bad idea because VM's memory can be accessed from the
> number of places (KVM, vhost, ...) which need their own special care
> of tracking memory accesses and notifying QEMU which makes the mprotect
> using unacceptable.
> 
> Protected memory accesses tracking can be done via userfaultfd's WP mode
> which isn't available right now.
> 
> So, the reasonable conclusion is to wait until the WP mode is available and
> build the background snapshot on top of userfaultfd-wp.
> But, works on adding the WP-mode is pending for a quite a long time already.
> 
> Is there any way to estimate when it could be available?

I think a question is whether anyone is actively working on it; I
suspect really it's on a TODO list rather than moving at the moment.

What I don't really understand is what stage the last version got upto.

Dave

> > 
> > Regards,
> > 
> 
> -- 
> Best,
> Denis
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-08-13 19:00               ` Dr. David Alan Gilbert
@ 2018-08-14  5:45                 ` Peter Xu
  2018-08-14  6:13                 ` Mike Rapoport
  2018-08-14 23:16                 ` Mike Kravetz
  2 siblings, 0 replies; 31+ messages in thread
From: Peter Xu @ 2018-08-14  5:45 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Denis Plotnikov, Paolo Bonzini, Andrea Arcangeli, qemu-devel,
	quintela, rppt, mike.kravetz

On Mon, Aug 13, 2018 at 08:00:19PM +0100, Dr. David Alan Gilbert wrote:
> cc'ing in Mike*2
> * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> > 
> > 
> > On 26.07.2018 12:23, Peter Xu wrote:
> > > On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote:
> > > > On 25/07/2018 22:04, Andrea Arcangeli wrote:
> > > > > 
> > > > > It may look like the uffd-wp model is wish-feature similar to an
> > > > > optimization, but without the uffd-wp model when the WP fault is
> > > > > triggered by kernel code, the sigsegv model falls apart and requires
> > > > > all kind of ad-hoc changes just for this single feature. Plus uffd-wp
> > > > > has other benefits: it makes it all reliable in terms of not
> > > > > increasing the number of vmas in use during the snapshot. Finally it
> > > > > makes it faster too with no mmap_sem for reading and no sigsegv
> > > > > signals.
> > > > > 
> > > > > The non cooperative features got merged first because there was much
> > > > > activity on the kernel side on that front, but this is just an ideal
> > > > > time to nail down the remaining issues in uffd-wp I think. That I
> > > > > believe is time better spent than trying to emulate it with sigsegv
> > > > > and changing all drivers to send new events down to qemu specific to
> > > > > the sigsegv handling. We considered this before doing uffd for
> > > > > postcopy too but overall it's unreliable and more work (no single
> > > > > change was then needed to KVM code with uffd to handle postcopy and
> > > > > here it should be the same).
> > > > 
> > > > I totally agree.  The hard part in userfaultfd was the changes to the
> > > > kernel get_user_pages API, but the payback was huge because _all_ kernel
> > > > uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
> > > > back to mprotect would be a huge mistake.
> > > 
> > > Thanks for explaining the bits.  I'd say I wasn't aware of the
> > > difference before I started the investigation (and only until now I
> > > noticed that major difference between mprotect and userfaultfd).  I'm
> > > really glad that it's much clear (at least for me) on which way we
> > > should choose.
> > > 
> > > Now I'm thinking whether we can move the userfault write protect work
> > > forward.  The latest discussion I saw so far is in 2016, when someone
> > > from Huawei tried to use the write protect feature for that old
> > > version of live snapshot but reported issue:
> > > 
> > >    https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html
> > > 
> > > Is that the latest status for userfaultfd wr-protect?
> > > 
> > > If so, I'm thinking whether I can try to re-verify the work (I tried
> > > his QEMU repository but I failed to compile somehow, so I plan to
> > > write some even simpler code to try) to see whether I can get the same
> > > KVM error he encountered.
> > > 
> > > Thoughts?
> > 
> > Just to sum up all being said before.
> > 
> > Using mprotect is a bad idea because VM's memory can be accessed from the
> > number of places (KVM, vhost, ...) which need their own special care
> > of tracking memory accesses and notifying QEMU which makes the mprotect
> > using unacceptable.
> > 
> > Protected memory accesses tracking can be done via userfaultfd's WP mode
> > which isn't available right now.
> > 
> > So, the reasonable conclusion is to wait until the WP mode is available and
> > build the background snapshot on top of userfaultfd-wp.
> > But, works on adding the WP-mode is pending for a quite a long time already.
> > 
> > Is there any way to estimate when it could be available?
> 
> I think a question is whether anyone is actively working on it; I
> suspect really it's on a TODO list rather than moving at the moment.
> 
> What I don't really understand is what stage the last version got upto.

I'm still testing that tree (though due to some reason I haven't yet
continued... but I will; currently WP still not working in my test).
My plan is that I'll try to dig into the WP work if it does not work
as expected (after I'm sure there's nothing wrong with the userspace),
though that of course won't be a trivial task.  I'll see how far I can
go with it.  Anyone that would like to help with that would be greatly
welcomed too.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-08-13 19:00               ` Dr. David Alan Gilbert
  2018-08-14  5:45                 ` Peter Xu
@ 2018-08-14  6:13                 ` Mike Rapoport
  2018-08-14 23:16                 ` Mike Kravetz
  2 siblings, 0 replies; 31+ messages in thread
From: Mike Rapoport @ 2018-08-14  6:13 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Denis Plotnikov, Peter Xu, Paolo Bonzini, Andrea Arcangeli,
	qemu-devel, quintela, mike.kravetz

On Mon, Aug 13, 2018 at 08:00:19PM +0100, Dr. David Alan Gilbert wrote:
> cc'ing in Mike*2
> * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> > 
> > 
> > On 26.07.2018 12:23, Peter Xu wrote:
> > > On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote:
> > > > On 25/07/2018 22:04, Andrea Arcangeli wrote:
> > > > > 
> > > > > It may look like the uffd-wp model is wish-feature similar to an
> > > > > optimization, but without the uffd-wp model when the WP fault is
> > > > > triggered by kernel code, the sigsegv model falls apart and requires
> > > > > all kind of ad-hoc changes just for this single feature. Plus uffd-wp
> > > > > has other benefits: it makes it all reliable in terms of not
> > > > > increasing the number of vmas in use during the snapshot. Finally it
> > > > > makes it faster too with no mmap_sem for reading and no sigsegv
> > > > > signals.
> > > > > 
> > > > > The non cooperative features got merged first because there was much
> > > > > activity on the kernel side on that front, but this is just an ideal
> > > > > time to nail down the remaining issues in uffd-wp I think. That I
> > > > > believe is time better spent than trying to emulate it with sigsegv
> > > > > and changing all drivers to send new events down to qemu specific to
> > > > > the sigsegv handling. We considered this before doing uffd for
> > > > > postcopy too but overall it's unreliable and more work (no single
> > > > > change was then needed to KVM code with uffd to handle postcopy and
> > > > > here it should be the same).
> > > > 
> > > > I totally agree.  The hard part in userfaultfd was the changes to the
> > > > kernel get_user_pages API, but the payback was huge because _all_ kernel
> > > > uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
> > > > back to mprotect would be a huge mistake.
> > > 
> > > Thanks for explaining the bits.  I'd say I wasn't aware of the
> > > difference before I started the investigation (and only until now I
> > > noticed that major difference between mprotect and userfaultfd).  I'm
> > > really glad that it's much clear (at least for me) on which way we
> > > should choose.
> > > 
> > > Now I'm thinking whether we can move the userfault write protect work
> > > forward.  The latest discussion I saw so far is in 2016, when someone
> > > from Huawei tried to use the write protect feature for that old
> > > version of live snapshot but reported issue:
> > > 
> > >    https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html
> > > 
> > > Is that the latest status for userfaultfd wr-protect?
> > > 
> > > If so, I'm thinking whether I can try to re-verify the work (I tried
> > > his QEMU repository but I failed to compile somehow, so I plan to
> > > write some even simpler code to try) to see whether I can get the same
> > > KVM error he encountered.
> > > 
> > > Thoughts?
> > 
> > Just to sum up all being said before.
> > 
> > Using mprotect is a bad idea because VM's memory can be accessed from the
> > number of places (KVM, vhost, ...) which need their own special care
> > of tracking memory accesses and notifying QEMU which makes the mprotect
> > using unacceptable.
> > 
> > Protected memory accesses tracking can be done via userfaultfd's WP mode
> > which isn't available right now.
> > 
> > So, the reasonable conclusion is to wait until the WP mode is available and
> > build the background snapshot on top of userfaultfd-wp.
> > But, works on adding the WP-mode is pending for a quite a long time already.
> > 
> > Is there any way to estimate when it could be available?
> 
> I think a question is whether anyone is actively working on it; I
> suspect really it's on a TODO list rather than moving at the moment.

I thought Andrea was working on it :)
 
> What I don't really understand is what stage the last version got upto.
> 
> Dave
> 
> > > 
> > > Regards,
> > > 
> > 
> > -- 
> > Best,
> > Denis
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v0 0/7] Background snapshots
  2018-08-13 19:00               ` Dr. David Alan Gilbert
  2018-08-14  5:45                 ` Peter Xu
  2018-08-14  6:13                 ` Mike Rapoport
@ 2018-08-14 23:16                 ` Mike Kravetz
  2 siblings, 0 replies; 31+ messages in thread
From: Mike Kravetz @ 2018-08-14 23:16 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Denis Plotnikov
  Cc: Peter Xu, Paolo Bonzini, Andrea Arcangeli, qemu-devel, quintela, rppt

On 08/13/2018 12:00 PM, Dr. David Alan Gilbert wrote:
> cc'ing in Mike*2
> * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
>>
>>
>> On 26.07.2018 12:23, Peter Xu wrote:
>>> On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote:
>>>> On 25/07/2018 22:04, Andrea Arcangeli wrote:
>>>>>
>>>>> It may look like the uffd-wp model is wish-feature similar to an
>>>>> optimization, but without the uffd-wp model when the WP fault is
>>>>> triggered by kernel code, the sigsegv model falls apart and requires
>>>>> all kind of ad-hoc changes just for this single feature. Plus uffd-wp
>>>>> has other benefits: it makes it all reliable in terms of not
>>>>> increasing the number of vmas in use during the snapshot. Finally it
>>>>> makes it faster too with no mmap_sem for reading and no sigsegv
>>>>> signals.
>>>>>
>>>>> The non cooperative features got merged first because there was much
>>>>> activity on the kernel side on that front, but this is just an ideal
>>>>> time to nail down the remaining issues in uffd-wp I think. That I
>>>>> believe is time better spent than trying to emulate it with sigsegv
>>>>> and changing all drivers to send new events down to qemu specific to
>>>>> the sigsegv handling. We considered this before doing uffd for
>>>>> postcopy too but overall it's unreliable and more work (no single
>>>>> change was then needed to KVM code with uffd to handle postcopy and
>>>>> here it should be the same).
>>>>
>>>> I totally agree.  The hard part in userfaultfd was the changes to the
>>>> kernel get_user_pages API, but the payback was huge because _all_ kernel
>>>> uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
>>>> back to mprotect would be a huge mistake.
>>>
>>> Thanks for explaining the bits.  I'd say I wasn't aware of the
>>> difference before I started the investigation (and only until now I
>>> noticed that major difference between mprotect and userfaultfd).  I'm
>>> really glad that it's much clear (at least for me) on which way we
>>> should choose.
>>>
>>> Now I'm thinking whether we can move the userfault write protect work
>>> forward.  The latest discussion I saw so far is in 2016, when someone
>>> from Huawei tried to use the write protect feature for that old
>>> version of live snapshot but reported issue:
>>>
>>>    https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html
>>>
>>> Is that the latest status for userfaultfd wr-protect?
>>>
>>> If so, I'm thinking whether I can try to re-verify the work (I tried
>>> his QEMU repository but I failed to compile somehow, so I plan to
>>> write some even simpler code to try) to see whether I can get the same
>>> KVM error he encountered.
>>>
>>> Thoughts?
>>
>> Just to sum up all being said before.
>>
>> Using mprotect is a bad idea because VM's memory can be accessed from the
>> number of places (KVM, vhost, ...) which need their own special care
>> of tracking memory accesses and notifying QEMU which makes the mprotect
>> using unacceptable.
>>
>> Protected memory accesses tracking can be done via userfaultfd's WP mode
>> which isn't available right now.
>>
>> So, the reasonable conclusion is to wait until the WP mode is available and
>> build the background snapshot on top of userfaultfd-wp.
>> But, works on adding the WP-mode is pending for a quite a long time already.
>>
>> Is there any way to estimate when it could be available?
> 
> I think a question is whether anyone is actively working on it; I
> suspect really it's on a TODO list rather than moving at the moment.
> 

I am not working on it, and it is not on my TODO list.

However, if someone starts making progress I will jump in and work on
hugetlbfs support.  My intention would be to not let hugetlbfs support
'fall behind' general uffd support.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-08-14 23:18 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-29  8:03 [Qemu-devel] [PATCH v0 0/7] Background snapshots Denis Plotnikov
2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 1/7] migration: add background snapshot capability Denis Plotnikov
2018-06-29 16:02   ` Eric Blake
2018-07-12  9:03   ` Dr. David Alan Gilbert
2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 2/7] bitops: add some atomic versions of bitmap operations Denis Plotnikov
2018-07-12  9:21   ` Dr. David Alan Gilbert
2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 3/7] threads: add infrastructure to process sigsegv Denis Plotnikov
2018-07-12  9:53   ` Dr. David Alan Gilbert
2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 4/7] migration: add background snapshot infrastructure Denis Plotnikov
2018-07-12 11:46   ` Dr. David Alan Gilbert
2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 5/7] kvm: add failed memeory access exit reason Denis Plotnikov
2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 6/7] kvm: add vCPU failed memeory access processing Denis Plotnikov
2018-06-29  8:03 ` [Qemu-devel] [PATCH v0 7/7] migration: add background snapshotting Denis Plotnikov
2018-07-12 18:59   ` Dr. David Alan Gilbert
2018-06-29 11:53 ` [Qemu-devel] [PATCH v0 0/7] Background snapshots Dr. David Alan Gilbert
2018-07-25 10:18   ` Peter Xu
2018-07-25 19:17     ` Dr. David Alan Gilbert
2018-07-25 20:04       ` Andrea Arcangeli
2018-07-26  8:51         ` Paolo Bonzini
2018-07-26  9:23           ` Peter Xu
2018-08-13 12:55             ` Denis Plotnikov
2018-08-13 19:00               ` Dr. David Alan Gilbert
2018-08-14  5:45                 ` Peter Xu
2018-08-14  6:13                 ` Mike Rapoport
2018-08-14 23:16                 ` Mike Kravetz
2018-07-26 15:13           ` Dr. David Alan Gilbert
2018-07-02 11:23 ` Peter Xu
2018-07-02 12:40   ` Denis Plotnikov
2018-07-03  5:54     ` Peter Xu
2018-07-13  5:20 ` Peter Xu
2018-07-16 15:00   ` Denis Plotnikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.