qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/9] migration/snap-tool: External snapshot utility
@ 2021-03-17 16:32 Andrey Gruzdev
  2021-03-17 16:32 ` [RFC PATCH 1/9] migration/snap-tool: Introduce qemu-snap tool Andrey Gruzdev
                   ` (10 more replies)
  0 siblings, 11 replies; 13+ messages in thread
From: Andrey Gruzdev @ 2021-03-17 16:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

This series is a kind of PoC for asynchronous snapshot reverting. This is
about external snapshots only and doesn't involve block devices. Thus, it's
mainly intended to be used with the new 'background-snapshot' migration
capability and otherwise standard QEMU migration mechanism.

The major ideas behind this first version were:
  * Make it compatible with 'exec:'-style migration - options can be create
    some separate tool or integrate into qemu-system.
  * Support asynchronous revert stage by using unaltered postcopy logic
    at destination. To do this, we should be capable of saving RAM pages
    so that any particular page can be directly addressed by it's block ID
    and page offset. Possible solutions here seem to be:
      use separate index (and storing it somewhere)
      create sparse file on host FS and address pages with file offset
      use QCOW2 (or other) image container with inherent sparsity support
  * Make snapshot image file dense on the host FS so we don't depend on
    copy/backup tools and how they deal with sparse files. Off course,
    there's some performance cost for this choice.
  * Make the code which is parsing unstructered format of migration stream,
    at least, not very sophisticated. Also, try to have minimum dependencies
    on QEMU migration code, both RAM and device.
  * Try to keep page save latencies small while not degrading migration
    bandwidth too much.

For this first version I decided not to integrate into main QEMU code but
create a separate tool. The main reason is that there's not too much migration
code that is target-specific and can be used in it's unmodified form. Also,
it's still not very clear how to make 'qemu-system' integration in terms of
command-line (or monitor/QMP?) interface extension.

For the storage format, QCOW2 as a container and large (1MB) cluster size seem
to be an optimal choice. Larger cluster is beneficial for performance particularly
in the case when image preallocation is disabled. Such cluster size does not result
in too high internal fragmentation level (~10% of space waste in most cases) yet
allows to reduce significantly the number of expensive cluster allocations.

A bit tricky part is dispatching QEMU migration stream cause it is mostly
unstructered and depends on configuration parameters like 'send-configuration'
and 'send-section-footer'. But, for the case with default values in migration
globals it seems that implemented dispatching code works well and won't have
compatibility issues in a reasonably long time frame.

I decided to keep RAM save path synchronous, anyhow it's better to use writeback
cache mode for the live snapshots cause of it's interleaving page address pattern.
Page coalescing buffer is used to merge contiguous pages to optimize block layer
writes.

Since for snapshot loading opening image file in cached mode would not do any good,
it implies that Linux native AIO and O_DIRECT mode is used in a common scenario.
AIO support in RAM loading path is implemented by using a ring of preallocated
fixed-sized buffers in such a way that there's always a number of outstanding block
requests anytime. It also ensures in-order request completion.

How to use:

**Save:**
* qemu> migrate_set_capability background-snapshot on
* qemu> migrate "exec:<qemu-bin-path>/qemu-snap -s <virtual-size>
           --cache=writeback --aio=threads save <image-file.qcow2>"

**Load:**
* Use 'qemu-system-* -incoming defer'
* qemu> migrate_incoming "exec:<qemu-bin-path>/qemu-snap
          --cache=none --aio=native load <image-file.qcow2>"

**Load with postcopy:**
* Use 'qemu-system-* -incoming defer'
* qemu> migrate_set_capability postcopy-ram on
* qemu> migrate_incoming "exec:<qemu-bin-path>/qemu-snap --postcopy=60
          --cache=none --aio=native load <image-file.qcow2>"

And yes, asynchronous revert works well only with SSD, not with rotational disk..

Some performance stats:
* SATA SSD drive with ~500/450 MB/s sequantial read/write and ~60K IOPS max.
* 220 MB/s average save rate (depends on workload)
* 440 MB/s average load rate in precopy
* 260 MB/s average load rate in postcopy

Andrey Gruzdev (9):
  migration/snap-tool: Introduce qemu-snap tool
  migration/snap-tool: Snapshot image create/open routines for qemu-snap
    tool
  migration/snap-tool: Preparations to run code in main loop context
  migration/snap-tool: Introduce qemu_ftell2() routine to qemu-file.c
  migration/snap-tool: Block layer AIO support and file utility routines
  migration/snap-tool: Move RAM_SAVE_FLAG_xxx defines to migration/ram.h
  migration/snap-tool: Complete implementation of snapshot saving
  migration/snap-tool: Implementation of snapshot loading in precopy
  migration/snap-tool: Implementation of snapshot loading in postcopy

 include/qemu-snap.h   |  163 ++++
 meson.build           |    2 +
 migration/qemu-file.c |    6 +
 migration/qemu-file.h |    1 +
 migration/ram.c       |   16 -
 migration/ram.h       |   16 +
 qemu-snap-handlers.c  | 1801 +++++++++++++++++++++++++++++++++++++++++
 qemu-snap-io.c        |  325 ++++++++
 qemu-snap.c           |  673 +++++++++++++++
 9 files changed, 2987 insertions(+), 16 deletions(-)
 create mode 100644 include/qemu-snap.h
 create mode 100644 qemu-snap-handlers.c
 create mode 100644 qemu-snap-io.c
 create mode 100644 qemu-snap.c

-- 
2.25.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/9] migration/snap-tool: Introduce qemu-snap tool
  2021-03-17 16:32 [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
@ 2021-03-17 16:32 ` Andrey Gruzdev
  2021-03-17 16:32 ` [RFC PATCH 2/9] migration/snap-tool: Snapshot image create/open routines for " Andrey Gruzdev
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Andrey Gruzdev @ 2021-03-17 16:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Initial commit with code to set up execution environment, parse
command-line arguments, show usage/version info and so on.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 include/qemu-snap.h |  35 ++++
 meson.build         |   2 +
 qemu-snap.c         | 414 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 451 insertions(+)
 create mode 100644 include/qemu-snap.h
 create mode 100644 qemu-snap.c

diff --git a/include/qemu-snap.h b/include/qemu-snap.h
new file mode 100644
index 0000000000..b8e48bfcbb
--- /dev/null
+++ b/include/qemu-snap.h
@@ -0,0 +1,35 @@
+/*
+ * QEMU External Snapshot Utility
+ *
+ * Copyright Virtuozzo GmbH, 2021
+ *
+ * Authors:
+ *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later. See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_SNAP_H
+#define QEMU_SNAP_H
+
+/* Target page size, if not specified explicitly in command-line */
+#define DEFAULT_PAGE_SIZE       4096
+/*
+ * Maximum supported target page size, cause we use QEMUFile and
+ * qemu_get_buffer_in_place(). IO_BUF_SIZE is currently 32KB.
+ */
+#define PAGE_SIZE_MAX           16384
+
+typedef struct SnapSaveState {
+    BlockBackend *blk;          /* Block backend */
+} SnapSaveState;
+
+typedef struct SnapLoadState {
+    BlockBackend *blk;          /* Block backend */
+} SnapLoadState;
+
+SnapSaveState *snap_save_get_state(void);
+SnapLoadState *snap_load_get_state(void);
+
+#endif /* QEMU_SNAP_H */
diff --git a/meson.build b/meson.build
index a7d2dd429d..11564165ba 100644
--- a/meson.build
+++ b/meson.build
@@ -2324,6 +2324,8 @@ if have_tools
              dependencies: [block, qemuutil], install: true)
   qemu_nbd = executable('qemu-nbd', files('qemu-nbd.c'),
                dependencies: [blockdev, qemuutil, gnutls], install: true)
+  qemu_snap = executable('qemu-snap', files('qemu-snap.c'),
+               dependencies: [blockdev, qemuutil, migration], install: true)
 
   subdir('storage-daemon')
   subdir('contrib/rdmacm-mux')
diff --git a/qemu-snap.c b/qemu-snap.c
new file mode 100644
index 0000000000..c7118927f7
--- /dev/null
+++ b/qemu-snap.c
@@ -0,0 +1,414 @@
+/*
+ * QEMU External Snapshot Utility
+ *
+ * Copyright Virtuozzo GmbH, 2021
+ *
+ * Authors:
+ *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later. See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <getopt.h>
+
+#include "qemu-common.h"
+#include "qemu-version.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qdict.h"
+#include "sysemu/block-backend.h"
+#include "sysemu/runstate.h" /* for qemu_system_killed() prototype */
+#include "qemu/cutils.h"
+#include "qemu/coroutine.h"
+#include "qemu/error-report.h"
+#include "qemu/config-file.h"
+#include "qemu/log.h"
+#include "trace/control.h"
+#include "io/channel-util.h"
+#include "io/channel-buffer.h"
+#include "migration/qemu-file-channel.h"
+#include "migration/qemu-file.h"
+#include "qemu-snap.h"
+
+#define OPT_CACHE   256
+#define OPT_AIO     257
+
+/* Parameters for snapshot saving */
+typedef struct SnapSaveParams {
+    const char *filename;       /* QCOW2 image file name */
+    int64_t image_size;         /* QCOW2 virtual image size */
+    int bdrv_flags;             /* BDRV flags (cache/AIO mode)*/
+    bool writethrough;          /* BDRV writes in FUA mode */
+
+    int64_t page_size;          /* Target page size to use */
+
+    int fd;                     /* Migration stream input FD */
+} SnapSaveParams;
+
+/* Parameters for snapshot saving */
+typedef struct SnapLoadParams {
+    const char *filename;       /* QCOW2 image file name */
+    int bdrv_flags;             /* BDRV flags (cache/AIO mode)*/
+
+    int64_t page_size;          /* Target page size to use */
+    bool postcopy;              /* Use postcopy */
+    /* Switch to postcopy after postcopy_percent% of RAM loaded */
+    int postcopy_percent;
+
+    int fd;                     /* Migration stream output FD */
+    int rp_fd;                  /* Return-path FD (for postcopy) */
+} SnapLoadParams;
+
+static SnapSaveState save_state;
+static SnapLoadState load_state;
+
+#ifdef CONFIG_POSIX
+void qemu_system_killed(int signum, pid_t pid)
+{
+}
+#endif /* CONFIG_POSIX */
+
+static void snap_shutdown(void)
+{
+    bdrv_close_all();
+}
+
+SnapSaveState *snap_save_get_state(void)
+{
+    return &save_state;
+}
+
+SnapLoadState *snap_load_get_state(void)
+{
+    return &load_state;
+}
+
+static void snap_save_init_state(void)
+{
+    memset(&save_state, 0, sizeof(save_state));
+}
+
+static void snap_save_destroy_state(void)
+{
+    /* TODO: implement */
+}
+
+static void snap_load_init_state(void)
+{
+    memset(&load_state, 0, sizeof(load_state));
+}
+
+static void snap_load_destroy_state(void)
+{
+    /* TODO: implement */
+}
+
+static int snap_save(const SnapSaveParams *params)
+{
+    SnapSaveState *sn;
+
+    snap_save_init_state();
+    sn = snap_save_get_state();
+    (void) sn;
+
+    snap_save_destroy_state();
+
+    return 0;
+}
+
+static int snap_load(SnapLoadParams *params)
+{
+    SnapLoadState *sn;
+
+    snap_load_init_state();
+    sn = snap_load_get_state();
+    (void) sn;
+
+    snap_load_destroy_state();
+
+    return 0;
+}
+
+static int64_t cvtnum_full(const char *name, const char *value,
+        int64_t min, int64_t max)
+{
+    uint64_t res;
+    int err;
+
+    err = qemu_strtosz(value, NULL, &res);
+    if (err < 0 && err != -ERANGE) {
+        error_report("Invalid %s specified. You may use "
+                     "k, M, G, T, P or E suffixes for", name);
+        error_report("kilobytes, megabytes, gigabytes, terabytes, "
+                     "petabytes and exabytes.");
+        return err;
+    }
+    if (err == -ERANGE || res > max || res < min) {
+        error_report("Invalid %s specified. Must be between %" PRId64
+                     " and %" PRId64 ".", name, min, max);
+        return -ERANGE;
+    }
+
+    return res;
+}
+
+static int64_t cvtnum(const char *name, const char *value)
+{
+    return cvtnum_full(name, value, 0, INT64_MAX);
+}
+
+static bool is_2power(int64_t val)
+{
+    return val && ((val & (val - 1)) == 0);
+}
+
+static void usage(const char *name)
+{
+    printf(
+        "Usage: %s [OPTIONS] save|load FILE\n"
+        "QEMU External Snapshot Utility\n"
+        "\n"
+        "  -h, --help                display this help and exit\n"
+        "  -V, --version             output version information and exit\n"
+        "\n"
+        "General purpose options:\n"
+        "  -t, --trace [[enable=]<pattern>][,events=<file>][,file=<file>]\n"
+        "                            specify tracing options\n"
+        "\n"
+        "Image options:\n"
+        "  -s, --image-size=SIZE     size of image to create for 'save'\n"
+        "  -n, --nocache             disable host cache\n"
+        "      --cache=MODE          set cache mode (none, writeback, ...)\n"
+        "      --aio=MODE            set AIO mode (native, io_uring or threads)\n"
+        "\n"
+        "Snapshot options:\n"
+        "  -S, --page-size=SIZE      target page size\n"
+        "  -p, --postcopy=%%RAM       switch to postcopy after '%%RAM' loaded\n"
+        "\n"
+        QEMU_HELP_BOTTOM "\n", name);
+}
+
+static void version(const char *name)
+{
+    printf(
+        "%s " QEMU_FULL_VERSION "\n"
+        "Written by Andrey Gruzdev.\n"
+        "\n"
+        QEMU_COPYRIGHT "\n"
+        "This is free software; see the source for copying conditions.  There is NO\n"
+        "warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n",
+        name);
+}
+
+int main(int argc, char **argv)
+{
+    static const struct option l_opt[] = {
+        { "help", no_argument, NULL, 'h' },
+        { "version", no_argument, NULL, 'V' },
+        { "image-size", required_argument, NULL, 's' },
+        { "page-size", required_argument, NULL, 'S' },
+        { "postcopy", required_argument, NULL, 'p' },
+        { "nocache", no_argument, NULL, 'n' },
+        { "cache", required_argument, NULL, OPT_CACHE },
+        { "aio", required_argument, NULL, OPT_AIO },
+        { "trace", required_argument, NULL, 't' },
+        { NULL, 0, NULL, 0 }
+    };
+    static const char *s_opt = "hVs:S:p:nt:";
+
+    int ch;
+    int l_ind = 0;
+
+    bool seen_image_size = false;
+    bool seen_page_size = false;
+    bool seen_postcopy = false;
+    bool seen_cache = false;
+    bool seen_aio = false;
+    int64_t image_size = 0;
+    int64_t page_size = DEFAULT_PAGE_SIZE;
+    int64_t postcopy_percent = 0;
+    int bdrv_flags = 0;
+    bool writethrough = false;
+    bool postcopy = false;
+    const char *cmd_name;
+    const char *file_name;
+    Error *local_err = NULL;
+
+#ifdef CONFIG_POSIX
+    signal(SIGPIPE, SIG_IGN);
+#endif
+    error_init(argv[0]);
+    module_call_init(MODULE_INIT_TRACE);
+    module_call_init(MODULE_INIT_QOM);
+
+    qemu_add_opts(&qemu_trace_opts);
+    qemu_init_exec_dir(argv[0]);
+
+    while ((ch = getopt_long(argc, argv, s_opt, l_opt, &l_ind)) != -1) {
+        switch (ch) {
+        case '?':
+            error_report("Try `%s --help' for more information", argv[0]);
+            return EXIT_FAILURE;
+
+        case 's':
+            if (seen_image_size) {
+                error_report("-s and --image-size can only be specified once");
+                return EXIT_FAILURE;
+            }
+            seen_image_size = true;
+
+            image_size = cvtnum(l_opt[l_ind].name, optarg);
+            if (image_size <= 0) {
+                error_report("Invalid image size parameter '%s'", optarg);
+                return EXIT_FAILURE;
+            }
+            break;
+
+        case 'S':
+            if (seen_page_size) {
+                error_report("-S and --page-size can only be specified once");
+                return EXIT_FAILURE;
+            }
+            seen_page_size = true;
+
+            page_size = cvtnum(l_opt[l_ind].name, optarg);
+            if (page_size <= 0 || !is_2power(page_size) ||
+                    page_size > PAGE_SIZE_MAX) {
+                error_report("Invalid target page size parameter '%s'", optarg);
+                return EXIT_FAILURE;
+            }
+            break;
+
+        case 'p':
+            if (seen_postcopy) {
+                error_report("-p and --postcopy can only be specified once");
+                return EXIT_FAILURE;
+            }
+            seen_postcopy = true;
+
+            postcopy_percent = cvtnum(l_opt[l_ind].name, optarg);
+            if (!(postcopy_percent > 0 && postcopy_percent < 100)) {
+                error_report("Invalid postcopy %%RAM parameter '%s'", optarg);
+                return EXIT_FAILURE;
+            }
+            postcopy = true;
+            break;
+
+        case 'n':
+            optarg = (char *) "none";
+            /* fallthrough */
+
+        case OPT_CACHE:
+            if (seen_cache) {
+                error_report("-n and --cache can only be specified once");
+                return EXIT_FAILURE;
+            }
+            seen_cache = true;
+
+            if (bdrv_parse_cache_mode(optarg, &bdrv_flags, &writethrough)) {
+                error_report("Invalid cache mode '%s'", optarg);
+                return EXIT_FAILURE;
+            }
+            break;
+
+        case OPT_AIO:
+            if (seen_aio) {
+                error_report("--aio can only be specified once");
+                return EXIT_FAILURE;
+            }
+            seen_aio = true;
+
+            if (bdrv_parse_aio(optarg, &bdrv_flags)) {
+                error_report("Invalid AIO mode '%s'", optarg);
+                return EXIT_FAILURE;
+            }
+            break;
+
+        case 'V':
+            version(argv[0]);
+            return EXIT_SUCCESS;
+
+        case 'h':
+            usage(argv[0]);
+            return EXIT_SUCCESS;
+
+        case 't':
+            trace_opt_parse(optarg);
+            break;
+
+        }
+    }
+
+    if ((argc - optind) != 2) {
+        error_report("Invalid number of arguments");
+        return EXIT_FAILURE;
+    }
+
+    if (!trace_init_backends()) {
+        return EXIT_FAILURE;
+    }
+    trace_init_file();
+    qemu_set_log(LOG_TRACE);
+
+    if (qemu_init_main_loop(&local_err)) {
+        error_report_err(local_err);
+        return EXIT_FAILURE;
+    }
+
+    bdrv_init();
+    atexit(snap_shutdown);
+
+    cmd_name = argv[optind];
+    file_name = argv[optind + 1];
+
+    if (!strcmp(cmd_name, "save")) {
+        SnapSaveParams params = {
+            .filename = file_name,
+            .image_size = image_size,
+            .page_size = page_size,
+            .bdrv_flags = (bdrv_flags | BDRV_O_RDWR),
+            .writethrough = writethrough,
+            .fd = STDIN_FILENO };
+        int res;
+
+        if (seen_postcopy) {
+            error_report("-p and --postcopy cannot be used for 'save'");
+            return EXIT_FAILURE;
+        }
+        if (!seen_image_size) {
+            error_report("-s or --size are required for 'save'");
+            return EXIT_FAILURE;
+        }
+
+        res = snap_save(&params);
+        if (res < 0) {
+            return EXIT_FAILURE;
+        }
+        return EXIT_SUCCESS;
+    } else if (!strcmp(cmd_name, "load")) {
+        SnapLoadParams params = {
+            .filename = file_name,
+            .bdrv_flags = bdrv_flags,
+            .postcopy = postcopy,
+            .postcopy_percent = postcopy_percent,
+            .page_size = page_size,
+            .fd = STDOUT_FILENO,
+            .rp_fd = STDIN_FILENO };
+        int res;
+
+        if (seen_image_size) {
+            error_report("-s and --size cannot be used for 'load'");
+            return EXIT_FAILURE;
+        }
+
+        res = snap_load(&params);
+        if (res < 0) {
+            return EXIT_FAILURE;
+        }
+        return EXIT_SUCCESS;
+    }
+
+    error_report("Invalid command");
+    return EXIT_FAILURE;
+}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/9] migration/snap-tool: Snapshot image create/open routines for qemu-snap tool
  2021-03-17 16:32 [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
  2021-03-17 16:32 ` [RFC PATCH 1/9] migration/snap-tool: Introduce qemu-snap tool Andrey Gruzdev
@ 2021-03-17 16:32 ` Andrey Gruzdev
  2021-03-17 16:32 ` [RFC PATCH 3/9] migration/snap-tool: Preparations to run code in main loop context Andrey Gruzdev
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Andrey Gruzdev @ 2021-03-17 16:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Implementation of routines for QCOW2 image creation and opening. Some
predefined parameters for image creation and opening are introduced that
provide reasonable tradeoff between performance, file size and usability.

Thus, it was chosen to disable preallocation and keep image file dense on
host file system, the apparent file size equals allocated in this case
which is anyways beneficial for the user experience.
Larger 1MB cluster size is adopted to reduce allocation overhead and
improve I/O performance while keeping internal fragmentation of snapshot
image reasonably small.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 qemu-snap.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 90 insertions(+), 4 deletions(-)

diff --git a/qemu-snap.c b/qemu-snap.c
index c7118927f7..c9f8d7166a 100644
--- a/qemu-snap.c
+++ b/qemu-snap.c
@@ -31,6 +31,16 @@
 #include "migration/qemu-file.h"
 #include "qemu-snap.h"
 
+/* QCOW2 image options */
+#define BLK_FORMAT_DRIVER       "qcow2"
+#define BLK_CREATE_OPT_STRING   "preallocation=off,lazy_refcounts=on,"  \
+                                        "extended_l2=off,compat=v3,cluster_size=1M,"    \
+                                        "refcount_bits=8"
+/* L2 cache size to cover 2TB of memory */
+#define BLK_L2_CACHE_SIZE       "16M"
+/* Single L2 cache entry for the whole L2 table */
+#define BLK_L2_CACHE_ENTRY_SIZE "1M"
+
 #define OPT_CACHE   256
 #define OPT_AIO     257
 
@@ -104,30 +114,106 @@ static void snap_load_destroy_state(void)
     /* TODO: implement */
 }
 
+static BlockBackend *snap_create(const char *filename, int64_t image_size,
+        int flags, bool writethrough)
+{
+    char *create_opt_string;
+    QDict *blk_opts;
+    BlockBackend *blk;
+    Error *local_err = NULL;
+
+    /* Create QCOW2 image with given parameters */
+    create_opt_string = g_strdup(BLK_CREATE_OPT_STRING);
+    bdrv_img_create(filename, BLK_FORMAT_DRIVER, NULL, NULL,
+            create_opt_string, image_size, flags, true, &local_err);
+    g_free(create_opt_string);
+
+    if (local_err) {
+        error_reportf_err(local_err, "Could not create '%s': ", filename);
+        goto fail;
+    }
+
+    /* Block backend open options */
+    blk_opts = qdict_new();
+    qdict_put_str(blk_opts, "driver", BLK_FORMAT_DRIVER);
+    qdict_put_str(blk_opts, "l2-cache-size", BLK_L2_CACHE_SIZE);
+    qdict_put_str(blk_opts, "l2-cache-entry-size", BLK_L2_CACHE_ENTRY_SIZE);
+
+    /* Open block backend instance for the created image */
+    blk = blk_new_open(filename, NULL, blk_opts, flags, &local_err);
+    if (!blk) {
+        error_reportf_err(local_err, "Could not open '%s': ", filename);
+        /* Delete image file */
+        qemu_unlink(filename);
+        goto fail;
+    }
+
+    blk_set_enable_write_cache(blk, !writethrough);
+    return blk;
+
+fail:
+    return NULL;
+}
+
+static BlockBackend *snap_open(const char *filename, int flags)
+{
+    QDict *blk_opts;
+    BlockBackend *blk;
+    Error *local_err = NULL;
+
+    /* Block backend open options */
+    blk_opts = qdict_new();
+    qdict_put_str(blk_opts, "driver", BLK_FORMAT_DRIVER);
+    qdict_put_str(blk_opts, "l2-cache-size", BLK_L2_CACHE_SIZE);
+    qdict_put_str(blk_opts, "l2-cache-entry-size", BLK_L2_CACHE_ENTRY_SIZE);
+
+    /* Open block backend instance */
+    blk = blk_new_open(filename, NULL, blk_opts, flags, &local_err);
+    if (!blk) {
+        error_reportf_err(local_err, "Could not open '%s': ", filename);
+        return NULL;
+    }
+
+    return blk;
+}
+
 static int snap_save(const SnapSaveParams *params)
 {
     SnapSaveState *sn;
+    int res = -1;
 
     snap_save_init_state();
     sn = snap_save_get_state();
-    (void) sn;
 
+    sn->blk = snap_create(params->filename, params->image_size,
+            params->bdrv_flags, params->writethrough);
+    if (!sn->blk) {
+        goto fail;
+    }
+
+fail:
     snap_save_destroy_state();
 
-    return 0;
+    return res;
 }
 
 static int snap_load(SnapLoadParams *params)
 {
     SnapLoadState *sn;
+    int res = -1;
 
     snap_load_init_state();
     sn = snap_load_get_state();
-    (void) sn;
 
+    sn->blk = snap_open(params->filename, params->bdrv_flags);
+    if (!sn->blk) {
+        goto fail;
+    }
+
+fail:
     snap_load_destroy_state();
 
-    return 0;
+    return res;
 }
 
 static int64_t cvtnum_full(const char *name, const char *value,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 3/9] migration/snap-tool: Preparations to run code in main loop context
  2021-03-17 16:32 [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
  2021-03-17 16:32 ` [RFC PATCH 1/9] migration/snap-tool: Introduce qemu-snap tool Andrey Gruzdev
  2021-03-17 16:32 ` [RFC PATCH 2/9] migration/snap-tool: Snapshot image create/open routines for " Andrey Gruzdev
@ 2021-03-17 16:32 ` Andrey Gruzdev
  2021-03-17 16:32 ` [RFC PATCH 4/9] migration/snap-tool: Introduce qemu_ftell2() routine to qemu-file.c Andrey Gruzdev
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Andrey Gruzdev @ 2021-03-17 16:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Major part of code is using QEMUFile and block layer routines, thus to
take advantage from concurrent I/O operations we need to use coroutines
and run in the the main loop context.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 include/qemu-snap.h  |  3 +++
 meson.build          |  2 +-
 qemu-snap-handlers.c | 38 ++++++++++++++++++++++++++
 qemu-snap.c          | 63 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 105 insertions(+), 1 deletion(-)
 create mode 100644 qemu-snap-handlers.c

diff --git a/include/qemu-snap.h b/include/qemu-snap.h
index b8e48bfcbb..b6fd779b13 100644
--- a/include/qemu-snap.h
+++ b/include/qemu-snap.h
@@ -32,4 +32,7 @@ typedef struct SnapLoadState {
 SnapSaveState *snap_save_get_state(void);
 SnapLoadState *snap_load_get_state(void);
 
+int coroutine_fn snap_save_state_main(SnapSaveState *sn);
+int coroutine_fn snap_load_state_main(SnapLoadState *sn);
+
 #endif /* QEMU_SNAP_H */
diff --git a/meson.build b/meson.build
index 11564165ba..252c55d6a3 100644
--- a/meson.build
+++ b/meson.build
@@ -2324,7 +2324,7 @@ if have_tools
              dependencies: [block, qemuutil], install: true)
   qemu_nbd = executable('qemu-nbd', files('qemu-nbd.c'),
                dependencies: [blockdev, qemuutil, gnutls], install: true)
-  qemu_snap = executable('qemu-snap', files('qemu-snap.c'),
+  qemu_snap = executable('qemu-snap', files('qemu-snap.c', 'qemu-snap-handlers.c'),
                dependencies: [blockdev, qemuutil, migration], install: true)
 
   subdir('storage-daemon')
diff --git a/qemu-snap-handlers.c b/qemu-snap-handlers.c
new file mode 100644
index 0000000000..bdc1911909
--- /dev/null
+++ b/qemu-snap-handlers.c
@@ -0,0 +1,38 @@
+/*
+ * QEMU External Snapshot Utility
+ *
+ * Copyright Virtuozzo GmbH, 2021
+ *
+ * Authors:
+ *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later. See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "sysemu/block-backend.h"
+#include "qemu/coroutine.h"
+#include "qemu/cutils.h"
+#include "qemu/bitmap.h"
+#include "qemu/error-report.h"
+#include "io/channel-buffer.h"
+#include "migration/qemu-file-channel.h"
+#include "migration/qemu-file.h"
+#include "migration/savevm.h"
+#include "migration/ram.h"
+#include "qemu-snap.h"
+
+/* Save snapshot data from incoming migration stream */
+int coroutine_fn snap_save_state_main(SnapSaveState *sn)
+{
+    /* TODO: implement */
+    return 0;
+}
+
+/* Load snapshot data and send it with outgoing migration stream */
+int coroutine_fn snap_load_state_main(SnapLoadState *sn)
+{
+    /* TODO: implement */
+    return 0;
+}
diff --git a/qemu-snap.c b/qemu-snap.c
index c9f8d7166a..ec56aa55d2 100644
--- a/qemu-snap.c
+++ b/qemu-snap.c
@@ -44,6 +44,14 @@
 #define OPT_CACHE   256
 #define OPT_AIO     257
 
+/* Snapshot task execution state */
+typedef struct SnapTaskState {
+    QEMUBH *bh;                 /* BH to enter task's coroutine */
+    Coroutine *co;              /* Coroutine to execute task */
+
+    int ret;                    /* Return code, -EINPROGRESS until complete */
+} SnapTaskState;
+
 /* Parameters for snapshot saving */
 typedef struct SnapSaveParams {
     const char *filename;       /* QCOW2 image file name */
@@ -177,6 +185,51 @@ static BlockBackend *snap_open(const char *filename, int flags)
     return blk;
 }
 
+static void coroutine_fn do_snap_save_co(void *opaque)
+{
+    SnapTaskState *task_state = opaque;
+    SnapSaveState *sn = snap_save_get_state();
+
+    /* Enter main routine */
+    task_state->ret = snap_save_state_main(sn);
+}
+
+static void coroutine_fn do_snap_load_co(void *opaque)
+{
+    SnapTaskState *task_state = opaque;
+    SnapLoadState *sn = snap_load_get_state();
+
+    /* Enter main routine */
+    task_state->ret = snap_load_state_main(sn);
+}
+
+/* We use BH to enter coroutine from the main loop context */
+static void enter_co_bh(void *opaque)
+{
+    SnapTaskState *task_state = opaque;
+
+    qemu_coroutine_enter(task_state->co);
+    /* Delete BH once we entered coroutine from the main loop */
+    qemu_bh_delete(task_state->bh);
+    task_state->bh = NULL;
+}
+
+static int run_snap_task(CoroutineEntry *entry)
+{
+    SnapTaskState task_state;
+
+    task_state.bh = qemu_bh_new(enter_co_bh, &task_state);
+    task_state.co = qemu_coroutine_create(entry, &task_state);
+    task_state.ret = -EINPROGRESS;
+
+    qemu_bh_schedule(task_state.bh);
+    while (task_state.ret == -EINPROGRESS) {
+        main_loop_wait(false);
+    }
+
+    return task_state.ret;
+}
+
 static int snap_save(const SnapSaveParams *params)
 {
     SnapSaveState *sn;
@@ -191,6 +244,11 @@ static int snap_save(const SnapSaveParams *params)
         goto fail;
     }
 
+    res = run_snap_task(do_snap_save_co);
+    if (res) {
+        error_report("Failed to save snapshot: error=%d", res);
+    }
+
 fail:
     snap_save_destroy_state();
 
@@ -210,6 +268,11 @@ static int snap_load(SnapLoadParams *params)
         goto fail;
     }
 
+    res = run_snap_task(do_snap_load_co);
+    if (res) {
+        error_report("Failed to load snapshot: error=%d", res);
+    }
+
 fail:
     snap_load_destroy_state();
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 4/9] migration/snap-tool: Introduce qemu_ftell2() routine to qemu-file.c
  2021-03-17 16:32 [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
                   ` (2 preceding siblings ...)
  2021-03-17 16:32 ` [RFC PATCH 3/9] migration/snap-tool: Preparations to run code in main loop context Andrey Gruzdev
@ 2021-03-17 16:32 ` Andrey Gruzdev
  2021-03-17 16:32 ` [RFC PATCH 5/9] migration/snap-tool: Block layer AIO support and file utility routines Andrey Gruzdev
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Andrey Gruzdev @ 2021-03-17 16:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

In several place we need to get QEMUFile input position in the meaning of
the number of bytes read by qemu_get_byte()/qemu_get_buffer() routines.

Existing qemu_ftell() returns offset in terms of the number of bytes read
from underlying IOChannel object which is not suitable here.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/qemu-file.c | 6 ++++++
 migration/qemu-file.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index d6e03dbc0e..66be5e6460 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -657,6 +657,12 @@ int64_t qemu_ftell(QEMUFile *f)
     return f->pos;
 }
 
+int64_t qemu_ftell2(QEMUFile *f)
+{
+    qemu_fflush(f);
+    return f->pos + f->buf_index - f->buf_size;
+}
+
 int qemu_file_rate_limit(QEMUFile *f)
 {
     if (f->shutdown) {
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index a9b6d6ccb7..bd1a6def02 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -124,6 +124,7 @@ void qemu_file_set_hooks(QEMUFile *f, const QEMUFileHooks *hooks);
 int qemu_get_fd(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
 int64_t qemu_ftell(QEMUFile *f);
+int64_t qemu_ftell2(QEMUFile *f);
 int64_t qemu_ftell_fast(QEMUFile *f);
 /*
  * put_buffer without copying the buffer.
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 5/9] migration/snap-tool: Block layer AIO support and file utility routines
  2021-03-17 16:32 [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
                   ` (3 preceding siblings ...)
  2021-03-17 16:32 ` [RFC PATCH 4/9] migration/snap-tool: Introduce qemu_ftell2() routine to qemu-file.c Andrey Gruzdev
@ 2021-03-17 16:32 ` Andrey Gruzdev
  2021-03-17 16:32 ` [RFC PATCH 6/9] migration/snap-tool: Move RAM_SAVE_FLAG_xxx defines to migration/ram.h Andrey Gruzdev
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Andrey Gruzdev @ 2021-03-17 16:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Introducing support for asynchronous block layer requests with in-order
completion guerantee using simple buffer descriptor ring and coroutines.

Added support for opening QEMUFile with VMSTATE area of QCOW2 image as
backing, also introduced several file utility routines.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 include/qemu-snap.h |  48 +++++++
 meson.build         |   2 +-
 qemu-snap-io.c      | 325 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 374 insertions(+), 1 deletion(-)
 create mode 100644 qemu-snap-io.c

diff --git a/include/qemu-snap.h b/include/qemu-snap.h
index b6fd779b13..f4b38d6442 100644
--- a/include/qemu-snap.h
+++ b/include/qemu-snap.h
@@ -13,6 +13,11 @@
 #ifndef QEMU_SNAP_H
 #define QEMU_SNAP_H
 
+/* Synthetic value for invalid offset */
+#define INVALID_OFFSET          ((int64_t) -1)
+/* Max. byte count for QEMUFile inplace read */
+#define INPLACE_READ_MAX        (32768 - 4096)
+
 /* Target page size, if not specified explicitly in command-line */
 #define DEFAULT_PAGE_SIZE       4096
 /*
@@ -21,6 +26,31 @@
  */
 #define PAGE_SIZE_MAX           16384
 
+typedef struct AioBufferPool AioBufferPool;
+
+typedef struct AioBufferStatus {
+    /* BDRV operation start offset */
+    int64_t offset;
+    /* BDRV operation byte count or negative error code */
+    int count;
+} AioBufferStatus;
+
+typedef struct AioBuffer {
+    void *data;                 /* Data buffer */
+    int size;                   /* Size of data buffer */
+
+    AioBufferStatus status;     /* Status returned by task->func() */
+} AioBuffer;
+
+typedef struct AioBufferTask {
+    AioBuffer *buffer;          /* AIO buffer */
+
+    int64_t offset;             /* BDRV operation start offset */
+    int size;                   /* BDRV requested transfer size */
+} AioBufferTask;
+
+typedef AioBufferStatus coroutine_fn (*AioBufferFunc)(AioBufferTask *task);
+
 typedef struct SnapSaveState {
     BlockBackend *blk;          /* Block backend */
 } SnapSaveState;
@@ -35,4 +65,22 @@ SnapLoadState *snap_load_get_state(void);
 int coroutine_fn snap_save_state_main(SnapSaveState *sn);
 int coroutine_fn snap_load_state_main(SnapLoadState *sn);
 
+QEMUFile *qemu_fopen_bdrv_vmstate(BlockDriverState *bs, int is_writable);
+
+AioBufferPool *coroutine_fn aio_pool_new(int buf_align, int buf_size, int buf_count);
+void aio_pool_free(AioBufferPool *pool);
+void aio_pool_set_max_in_flight(AioBufferPool *pool, int max_in_flight);
+int aio_pool_status(AioBufferPool *pool);
+
+bool coroutine_fn aio_pool_can_acquire_next(AioBufferPool *pool);
+AioBuffer *coroutine_fn aio_pool_try_acquire_next(AioBufferPool *pool);
+AioBuffer *coroutine_fn aio_pool_wait_compl_next(AioBufferPool *pool);
+void coroutine_fn aio_buffer_release(AioBuffer *buffer);
+
+void coroutine_fn aio_buffer_start_task(AioBuffer *buffer, AioBufferFunc func,
+        int64_t offset, int size);
+
+void file_transfer_to_eof(QEMUFile *f_dst, QEMUFile *f_src);
+void file_transfer_bytes(QEMUFile *f_dst, QEMUFile *f_src, size_t size);
+
 #endif /* QEMU_SNAP_H */
diff --git a/meson.build b/meson.build
index 252c55d6a3..48f2367a5a 100644
--- a/meson.build
+++ b/meson.build
@@ -2324,7 +2324,7 @@ if have_tools
              dependencies: [block, qemuutil], install: true)
   qemu_nbd = executable('qemu-nbd', files('qemu-nbd.c'),
                dependencies: [blockdev, qemuutil, gnutls], install: true)
-  qemu_snap = executable('qemu-snap', files('qemu-snap.c', 'qemu-snap-handlers.c'),
+  qemu_snap = executable('qemu-snap', files('qemu-snap.c', 'qemu-snap-handlers.c', 'qemu-snap-io.c'),
                dependencies: [blockdev, qemuutil, migration], install: true)
 
   subdir('storage-daemon')
diff --git a/qemu-snap-io.c b/qemu-snap-io.c
new file mode 100644
index 0000000000..972c353255
--- /dev/null
+++ b/qemu-snap-io.c
@@ -0,0 +1,325 @@
+/*
+ * QEMU External Snapshot Utility
+ *
+ * Copyright Virtuozzo GmbH, 2021
+ *
+ * Authors:
+ *  Andrey Gruzdev   <andrey.gruzdev@virtuozzo.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later. See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/coroutine.h"
+#include "qemu/error-report.h"
+#include "sysemu/block-backend.h"
+#include "qapi/error.h"
+#include "migration/qemu-file.h"
+#include "qemu-snap.h"
+
+/*
+ * AIO buffer pool.
+ *
+ * Coroutine-based environment to support concurrent block layer operations
+ * providing pre-allocated data buffers and in-order completion guarantee.
+ *
+ * All routines (with an exception of aio_pool_free()) are required to be
+ * called from the same coroutine in main loop context.
+ *
+ * Call sequence to keep several pending block layer requests:
+ *
+ *   aio_pool_new()                 !
+ *                                  !
+ *   aio_pool_try_acquire_next()    !<------!<------!
+ *   aio_buffer_start_task()        !------>!       !
+ *                                  !               !
+ *   aio_pool_wait_compl_next()     !               !
+ *   aio_buffer_release()           !-------------->!
+ *                                  !
+ *   aio_pool_free()                !
+ *
+ */
+
+/* AIO buffer private struct */
+typedef struct AioBufferImpl {
+    AioBuffer user;             /* Public part */
+    AioBufferPool *pool;        /* Buffer pool */
+
+    bool acquired;              /* Buffer is acquired */
+    bool busy;                  /* Task not complete */
+} AioBufferImpl;
+
+/* AIO task private struct */
+typedef struct AioBufferTaskImpl {
+    AioBufferTask user;         /* Public part */
+    AioBufferFunc func;         /* Task func */
+} AioBufferTaskImpl;
+
+/* AIO buffer pool */
+typedef struct AioBufferPool {
+    int count;                  /* Number of AioBuffer's */
+
+    Coroutine *main_co;         /* Parent coroutine */
+    int status;                 /* Overall pool status */
+
+    /* Index of next buffer to await in-order */
+    int wait_head;
+    /* Index of next buffer to acquire in-order */
+    int acquire_tail;
+
+    /* AioBuffer that is currently awaited for task completion, or NULL */
+    AioBufferImpl *wait_on_buffer;
+
+    int in_flight;              /* AIO requests in-flight */
+    int max_in_flight;          /* Max. AIO in-flight requests */
+
+    AioBufferImpl buffers[];    /* Flex-array of AioBuffer's */
+} AioBufferPool;
+
+/* Wrapper for task->func() to maintain private state */
+static void coroutine_fn aio_buffer_co(void *opaque)
+{
+    AioBufferTaskImpl *task = (AioBufferTaskImpl *) opaque;
+    AioBufferImpl *buffer = (AioBufferImpl *) task->user.buffer;
+    AioBufferPool *pool = buffer->pool;
+
+    buffer->busy = true;
+    buffer->user.status = task->func((AioBufferTask *) task);
+    /* Update pool status in case of an error */
+    if (buffer->user.status.count < 0 && pool->status == 0) {
+        pool->status = buffer->user.status.count;
+    }
+    buffer->busy = false;
+
+    g_free(task);
+
+    if (buffer == pool->wait_on_buffer) {
+        pool->wait_on_buffer = NULL;
+        aio_co_wake(pool->main_co);
+    }
+}
+
+/* Check that aio_pool_try_acquire_next() shall succeed */
+bool coroutine_fn aio_pool_can_acquire_next(AioBufferPool *pool)
+{
+    assert(qemu_coroutine_self() == pool->main_co);
+
+    return (pool->in_flight < pool->max_in_flight) &&
+            !pool->buffers[pool->acquire_tail].acquired;
+}
+
+/* Try to acquire next buffer from the pool */
+AioBuffer *coroutine_fn aio_pool_try_acquire_next(AioBufferPool *pool)
+{
+    AioBufferImpl *buffer;
+
+    assert(qemu_coroutine_self() == pool->main_co);
+
+    if (pool->in_flight >= pool->max_in_flight) {
+        return NULL;
+    }
+
+    buffer = &pool->buffers[pool->acquire_tail];
+    if (!buffer->acquired) {
+        assert(!buffer->busy);
+
+        buffer->acquired = true;
+        pool->acquire_tail = (pool->acquire_tail + 1) % pool->count;
+
+        pool->in_flight++;
+        return (AioBuffer *) buffer;
+    }
+
+    return NULL;
+}
+
+/* Start BDRV task on acquired buffer */
+void coroutine_fn aio_buffer_start_task(AioBuffer *buffer,
+        AioBufferFunc func, int64_t offset, int size)
+{
+    AioBufferImpl *buffer_impl = (AioBufferImpl *) buffer;
+    AioBufferTaskImpl *task;
+
+    assert(qemu_coroutine_self() == buffer_impl->pool->main_co);
+    assert(buffer_impl->acquired && !buffer_impl->busy);
+    assert(size <= buffer->size);
+
+    task = g_new0(AioBufferTaskImpl, 1);
+    task->user.buffer = buffer;
+    task->user.offset = offset;
+    task->user.size = size;
+    task->func = func;
+
+    qemu_coroutine_enter(qemu_coroutine_create(aio_buffer_co, task));
+}
+
+/* Wait for buffer task completion in-order */
+AioBuffer *coroutine_fn aio_pool_wait_compl_next(AioBufferPool *pool)
+{
+    AioBufferImpl *buffer;
+
+    assert(qemu_coroutine_self() == pool->main_co);
+
+    buffer = &pool->buffers[pool->wait_head];
+    if (!buffer->acquired) {
+        return NULL;
+    }
+
+    if (!buffer->busy) {
+restart:
+        pool->wait_head = (pool->wait_head + 1) % pool->count;
+        return (AioBuffer *) buffer;
+    }
+
+    pool->wait_on_buffer = buffer;
+    qemu_coroutine_yield();
+
+    assert(!pool->wait_on_buffer);
+    assert(!buffer->busy);
+
+    goto restart;
+}
+
+/* Release buffer */
+void coroutine_fn aio_buffer_release(AioBuffer *buffer)
+{
+    AioBufferImpl *buffer_impl = (AioBufferImpl *) buffer;
+
+    assert(qemu_coroutine_self() == buffer_impl->pool->main_co);
+    assert(buffer_impl->acquired && !buffer_impl->busy);
+
+    buffer_impl->acquired = false;
+    buffer_impl->pool->in_flight--;
+}
+
+/* Create new AIO buffer pool */
+AioBufferPool *coroutine_fn aio_pool_new(int buf_align,
+        int buf_size, int buf_count)
+{
+    AioBufferPool *pool = g_malloc0(sizeof(AioBufferPool) +
+            buf_count * sizeof(pool->buffers[0]));
+
+    pool->main_co = qemu_coroutine_self();
+
+    pool->count = buf_count;
+    pool->max_in_flight = pool->count;
+
+    for (int i = 0; i < buf_count; i++) {
+        pool->buffers[i].pool = pool;
+        pool->buffers[i].user.data = qemu_memalign(buf_align, buf_size);
+        pool->buffers[i].user.size = buf_size;
+    }
+
+    return pool;
+}
+
+/* Free AIO buffer pool */
+void aio_pool_free(AioBufferPool *pool)
+{
+    for (int i = 0; i < pool->count; i++) {
+        qemu_vfree(pool->buffers[i].user.data);
+    }
+    g_free(pool);
+}
+
+/* Limit the max. number of in-flight BDRV tasks/requests */
+void aio_pool_set_max_in_flight(AioBufferPool *pool, int max_in_flight)
+{
+    assert(max_in_flight > 0);
+    pool->max_in_flight = MIN(max_in_flight, pool->count);
+}
+
+/* Get overall pool operation status */
+int aio_pool_status(AioBufferPool *pool)
+{
+    return pool->status;
+}
+
+static ssize_t bdrv_vmstate_writev_buffer(void *opaque, struct iovec *iov,
+        int iovcnt, int64_t pos, Error **errp)
+{
+    int ret;
+    QEMUIOVector qiov;
+
+    qemu_iovec_init_external(&qiov, iov, iovcnt);
+    ret = bdrv_writev_vmstate((BlockDriverState *) opaque, &qiov, pos);
+    if (ret < 0) {
+        return ret;
+    }
+
+    return qiov.size;
+}
+
+static ssize_t bdrv_vmstate_get_buffer(void *opaque, uint8_t *buf,
+        int64_t pos, size_t size, Error **errp)
+{
+    return bdrv_load_vmstate((BlockDriverState *) opaque, buf, pos, size);
+}
+
+static int bdrv_vmstate_fclose(void *opaque, Error **errp)
+{
+    return bdrv_flush((BlockDriverState *) opaque);
+}
+
+static const QEMUFileOps bdrv_vmstate_read_ops = {
+    .get_buffer     = bdrv_vmstate_get_buffer,
+    .close          = bdrv_vmstate_fclose,
+};
+
+static const QEMUFileOps bdrv_vmstate_write_ops = {
+    .writev_buffer  = bdrv_vmstate_writev_buffer,
+    .close          = bdrv_vmstate_fclose,
+};
+
+/* Create QEMUFile object to access vmstate area of the image */
+QEMUFile *qemu_fopen_bdrv_vmstate(BlockDriverState *bs, int is_writable)
+{
+    if (is_writable) {
+        return qemu_fopen_ops(bs, &bdrv_vmstate_write_ops);
+    }
+    return qemu_fopen_ops(bs, &bdrv_vmstate_read_ops);
+}
+
+/*
+ * Transfer data from source QEMUFile to destination
+ * until we rich EOF on source.
+ */
+void file_transfer_to_eof(QEMUFile *f_dst, QEMUFile *f_src)
+{
+    bool eof = false;
+
+    while (!eof) {
+        const size_t size = INPLACE_READ_MAX;
+        uint8_t *buffer = NULL;
+        size_t count;
+
+        count = qemu_peek_buffer(f_src, &buffer, size, 0);
+        qemu_file_skip(f_src, count);
+        /* Reached stream EOF? */
+        if (count != size) {
+            eof = true;
+        }
+
+        qemu_put_buffer(f_dst, buffer, count);
+    }
+}
+
+/* Transfer given number of bytes from source QEMUFile to destination */
+void file_transfer_bytes(QEMUFile *f_dst, QEMUFile *f_src, size_t size)
+{
+    size_t rest = size;
+
+    while (rest) {
+        uint8_t *ptr = NULL;
+        size_t req_size;
+        size_t count;
+
+        req_size = MIN(rest, INPLACE_READ_MAX);
+        count = qemu_peek_buffer(f_src, &ptr, req_size, 0);
+        qemu_file_skip(f_src, count);
+
+        qemu_put_buffer(f_dst, ptr, count);
+        rest -= count;
+    }
+}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 6/9] migration/snap-tool: Move RAM_SAVE_FLAG_xxx defines to migration/ram.h
  2021-03-17 16:32 [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
                   ` (4 preceding siblings ...)
  2021-03-17 16:32 ` [RFC PATCH 5/9] migration/snap-tool: Block layer AIO support and file utility routines Andrey Gruzdev
@ 2021-03-17 16:32 ` Andrey Gruzdev
  2021-03-17 16:32 ` [RFC PATCH 7/9] migration/snap-tool: Complete implementation of snapshot saving Andrey Gruzdev
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Andrey Gruzdev @ 2021-03-17 16:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Move RAM_SAVE_FLAG_xxx defines from migration/ram.c to migration/ram.h

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 migration/ram.c | 16 ----------------
 migration/ram.h | 16 ++++++++++++++++
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 52537f14ac..d3da0c8208 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -65,22 +65,6 @@
 /***********************************************************/
 /* ram save/restore */
 
-/* RAM_SAVE_FLAG_ZERO used to be named RAM_SAVE_FLAG_COMPRESS, it
- * worked for pages that where filled with the same char.  We switched
- * it to only search for the zero value.  And to avoid confusion with
- * RAM_SSAVE_FLAG_COMPRESS_PAGE just rename it.
- */
-
-#define RAM_SAVE_FLAG_FULL     0x01 /* Obsolete, not used anymore */
-#define RAM_SAVE_FLAG_ZERO     0x02
-#define RAM_SAVE_FLAG_MEM_SIZE 0x04
-#define RAM_SAVE_FLAG_PAGE     0x08
-#define RAM_SAVE_FLAG_EOS      0x10
-#define RAM_SAVE_FLAG_CONTINUE 0x20
-#define RAM_SAVE_FLAG_XBZRLE   0x40
-/* 0x80 is reserved in migration.h start with 0x100 next */
-#define RAM_SAVE_FLAG_COMPRESS_PAGE    0x100
-
 static inline bool is_zero_range(uint8_t *p, uint64_t size)
 {
     return buffer_is_zero(p, size);
diff --git a/migration/ram.h b/migration/ram.h
index 6378bb3ebc..c6bad8bbdf 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -33,6 +33,22 @@
 #include "exec/cpu-common.h"
 #include "io/channel.h"
 
+/* RAM_SAVE_FLAG_ZERO used to be named RAM_SAVE_FLAG_COMPRESS, it
+ * worked for pages that where filled with the same char.  We switched
+ * it to only search for the zero value.  And to avoid confusion with
+ * RAM_SSAVE_FLAG_COMPRESS_PAGE just rename it.
+ */
+
+#define RAM_SAVE_FLAG_FULL     0x01 /* Obsolete, not used anymore */
+#define RAM_SAVE_FLAG_ZERO     0x02
+#define RAM_SAVE_FLAG_MEM_SIZE 0x04
+#define RAM_SAVE_FLAG_PAGE     0x08
+#define RAM_SAVE_FLAG_EOS      0x10
+#define RAM_SAVE_FLAG_CONTINUE 0x20
+#define RAM_SAVE_FLAG_XBZRLE   0x40
+/* 0x80 is reserved in migration.h start with 0x100 next */
+#define RAM_SAVE_FLAG_COMPRESS_PAGE    0x100
+
 extern MigrationStats ram_counters;
 extern XBZRLECacheStats xbzrle_counters;
 extern CompressionStats compression_counters;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 7/9] migration/snap-tool: Complete implementation of snapshot saving
  2021-03-17 16:32 [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
                   ` (5 preceding siblings ...)
  2021-03-17 16:32 ` [RFC PATCH 6/9] migration/snap-tool: Move RAM_SAVE_FLAG_xxx defines to migration/ram.h Andrey Gruzdev
@ 2021-03-17 16:32 ` Andrey Gruzdev
  2021-03-17 16:32 ` [RFC PATCH 8/9] migration/snap-tool: Implementation of snapshot loading in precopy Andrey Gruzdev
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Andrey Gruzdev @ 2021-03-17 16:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Includes code to parse incoming migration stream, dispatch data to
section handlers and deal with complications of open-coded migration
format without introducing strong dependencies on QEMU migration code.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 include/qemu-snap.h  |  42 +++
 qemu-snap-handlers.c | 717 ++++++++++++++++++++++++++++++++++++++++++-
 qemu-snap.c          |  54 +++-
 3 files changed, 810 insertions(+), 3 deletions(-)

diff --git a/include/qemu-snap.h b/include/qemu-snap.h
index f4b38d6442..570f200c9d 100644
--- a/include/qemu-snap.h
+++ b/include/qemu-snap.h
@@ -51,8 +51,47 @@ typedef struct AioBufferTask {
 
 typedef AioBufferStatus coroutine_fn (*AioBufferFunc)(AioBufferTask *task);
 
+typedef struct QIOChannelBuffer QIOChannelBuffer;
+
 typedef struct SnapSaveState {
+    const char *filename;       /* Image file name */
     BlockBackend *blk;          /* Block backend */
+
+    QEMUFile *f_fd;             /* Incoming migration stream QEMUFile */
+    QEMUFile *f_vmstate;        /* Block backend vmstate area QEMUFile */
+    /*
+     * Buffer to stash first few KBs of incoming stream, part of it later will
+     * go the the VMSTATE area of the image file. Specifically, these are VM
+     * state header, configuration section and the section which contains
+     * RAM block list.
+     */
+    QIOChannelBuffer *ioc_lbuf;
+    /* Page coalescing buffer channel */
+    QIOChannelBuffer *ioc_pbuf;
+
+    /* BDRV offset matching start of ioc_pbuf */
+    int64_t bdrv_offset;
+    /* Last BDRV offset saved to ioc_pbuf */
+    int64_t last_bdrv_offset;
+
+    /* Stream read position, updated at the beginning of each new section */
+    int64_t stream_pos;
+
+    /* Stream read position at the beginning of RAM block list section */
+    int64_t ram_list_pos;
+    /* Stream read position at the beginning of the first RAM data section */
+    int64_t ram_pos;
+    /* Stream read position at the beginning of the first device state section */
+    int64_t device_pos;
+
+    /* Final status */
+    int status;
+
+    /*
+     * Keep first few bytes from the beginning of each section for the case
+     * when we meet device state section and go into 'default_handler'.
+     */
+    uint8_t section_header[512];
 } SnapSaveState;
 
 typedef struct SnapLoadState {
@@ -65,6 +104,9 @@ SnapLoadState *snap_load_get_state(void);
 int coroutine_fn snap_save_state_main(SnapSaveState *sn);
 int coroutine_fn snap_load_state_main(SnapLoadState *sn);
 
+void snap_ram_init_state(int page_bits);
+void snap_ram_destroy_state(void);
+
 QEMUFile *qemu_fopen_bdrv_vmstate(BlockDriverState *bs, int is_writable);
 
 AioBufferPool *coroutine_fn aio_pool_new(int buf_align, int buf_size, int buf_count);
diff --git a/qemu-snap-handlers.c b/qemu-snap-handlers.c
index bdc1911909..4b63d42a29 100644
--- a/qemu-snap-handlers.c
+++ b/qemu-snap-handlers.c
@@ -23,11 +23,704 @@
 #include "migration/ram.h"
 #include "qemu-snap.h"
 
+/* BDRV vmstate area MAGIC for state header */
+#define VMSTATE_MAGIC               0x5354564d
+/* BDRV vmstate area header size */
+#define VMSTATE_HEADER_SIZE         28
+/* BDRV vmstate area header eof_pos field offset */
+#define VMSTATE_HEADER_EOF_OFFSET   24
+
+/* Alignment of QEMU RAM block on backing storage */
+#define BLK_RAM_BLOCK_ALIGN         (1024 * 1024)
+/* Max. byte count for page coalescing buffer */
+#define PAGE_COALESC_MAX            (512 * 1024)
+
+/* RAM block descriptor */
+typedef struct RAMBlockDesc {
+    int64_t bdrv_offset;        /* Offset on backing storage */
+    int64_t length;             /* RAM block used_length */
+    int64_t nr_pages;           /* RAM block page count (length >> page_bits) */
+
+    char idstr[256];            /* RAM block id string */
+
+    /* Link into ram_list */
+    QSIMPLEQ_ENTRY(RAMBlockDesc) next;
+} RAMBlockDesc;
+
+/* State reflecting RAM part of snapshot */
+typedef struct RAMState {
+    int64_t page_size;          /* Page size */
+    int64_t page_mask;          /* Page mask */
+    int page_bits;              /* Page size bits */
+
+    int64_t normal_pages;       /* Total number of normal (non-zero) pages */
+
+    /* List of RAM blocks */
+    QSIMPLEQ_HEAD(, RAMBlockDesc) ram_block_list;
+} RAMState;
+
+/* Section handler ops */
+typedef struct SectionHandlerOps {
+    int (*save_section)(QEMUFile *f, void *opaque, int version_id);
+    int (*load_section)(QEMUFile *f, void *opaque, int version_id);
+} SectionHandlerOps;
+
+/* Section handler */
+typedef struct SectionHandlersEntry {
+    const char *idstr;          /* Section id string */
+    const int instance_id;      /* Section instance id */
+    const int version_id;       /* Max. supported section version id */
+
+    int state_section_id;       /* Section id from migration stream */
+    int state_version_id;       /* Version id from migration stream */
+
+    SectionHandlerOps *ops;     /* Section handler callbacks */
+} SectionHandlersEntry;
+
+/* Available section handlers */
+typedef struct SectionHandlers {
+    /* Handler for sections not identified by 'handlers' array */
+    SectionHandlersEntry default_entry;
+    /* Array of section save/load handlers */
+    SectionHandlersEntry entries[];
+} SectionHandlers;
+
+#define SECTION_HANDLERS_ENTRY(_idstr, _instance_id, _version_id, _ops) {   \
+    .idstr          = _idstr,           \
+    .instance_id    = (_instance_id),   \
+    .version_id     = (_version_id),    \
+    .ops            = (_ops),           \
+}
+
+#define SECTION_HANDLERS_END()  { NULL, }
+
+/* Forward declarations */
+static int default_save(QEMUFile *f, void *opaque, int version_id);
+static int ram_save(QEMUFile *f, void *opaque, int version_id);
+static int save_state_complete(SnapSaveState *sn);
+
+static RAMState ram_state;
+
+static SectionHandlerOps default_handler_ops = {
+    .save_section = default_save,
+    .load_section = NULL,
+};
+
+static SectionHandlerOps ram_handler_ops = {
+    .save_section = ram_save,
+    .load_section = NULL,
+};
+
+static SectionHandlers section_handlers = {
+    .default_entry =
+        SECTION_HANDLERS_ENTRY("default", 0, 0, &default_handler_ops),
+    .entries = {
+        SECTION_HANDLERS_ENTRY("ram", 0, 4, &ram_handler_ops),
+        SECTION_HANDLERS_END(),
+    },
+};
+
+static SectionHandlersEntry *find_se(const char *idstr, int instance_id)
+{
+    SectionHandlersEntry *se;
+
+    for (se = section_handlers.entries; se->idstr; se++) {
+        if (!strcmp(se->idstr, idstr) && (instance_id == se->instance_id)) {
+            return se;
+        }
+    }
+    return NULL;
+}
+
+static SectionHandlersEntry *find_se_by_section_id(int section_id)
+{
+    SectionHandlersEntry *se;
+
+    for (se = section_handlers.entries; se->idstr; se++) {
+        if (section_id == se->state_section_id) {
+            return se;
+        }
+    }
+    return NULL;
+}
+
+static bool check_section_footer(QEMUFile *f, SectionHandlersEntry *se)
+{
+    uint8_t token;
+    int section_id;
+
+    token = qemu_get_byte(f);
+    if (token != QEMU_VM_SECTION_FOOTER) {
+        error_report("Missing footer for section '%s'", se->idstr);
+        return false;
+    }
+
+    section_id = qemu_get_be32(f);
+    if (section_id != se->state_section_id) {
+        error_report("Mismatched section_id in footer for section '%s':"
+                     " read_id=%d expected_id=%d",
+                se->idstr, section_id, se->state_section_id);
+        return false;
+    }
+    return true;
+}
+
+static inline
+bool ram_offset_in_block(RAMBlockDesc *block, int64_t offset)
+{
+    return (block && (offset < block->length));
+}
+
+static inline
+bool ram_bdrv_offset_in_block(RAMBlockDesc *block, int64_t bdrv_offset)
+{
+    return (block && (bdrv_offset >= block->bdrv_offset) &&
+            (bdrv_offset < block->bdrv_offset + block->length));
+}
+
+static inline
+int64_t ram_bdrv_from_block_offset(RAMBlockDesc *block, int64_t offset)
+{
+    if (!ram_offset_in_block(block, offset)) {
+        return INVALID_OFFSET;
+    }
+
+    return block->bdrv_offset + offset;
+}
+
+static inline
+int64_t ram_block_offset_from_bdrv(RAMBlockDesc *block, int64_t bdrv_offset)
+{
+    int64_t offset;
+
+    if (!block) {
+        return INVALID_OFFSET;
+    }
+    offset = bdrv_offset - block->bdrv_offset;
+    return offset >= 0 ? offset : INVALID_OFFSET;
+}
+
+static RAMBlockDesc *ram_block_by_idstr(const char *idstr)
+{
+    RAMBlockDesc *block;
+
+    QSIMPLEQ_FOREACH(block, &ram_state.ram_block_list, next) {
+        if (!strcmp(idstr, block->idstr)) {
+            return block;
+        }
+    }
+    return NULL;
+}
+
+static RAMBlockDesc *ram_block_from_stream(QEMUFile *f, int flags)
+{
+    static RAMBlockDesc *block;
+    char idstr[256];
+
+    if (flags & RAM_SAVE_FLAG_CONTINUE) {
+        if (!block) {
+            error_report("Corrupted 'ram' section: offset=0x%" PRIx64,
+                    qemu_ftell2(f));
+            return NULL;
+        }
+        return block;
+    }
+
+    if (!qemu_get_counted_string(f, idstr)) {
+        return NULL;
+    }
+    block = ram_block_by_idstr(idstr);
+    if (!block) {
+        error_report("Can't find RAM block '%s'", idstr);
+        return NULL;
+    }
+    return block;
+}
+
+static int64_t ram_block_next_bdrv_offset(void)
+{
+    RAMBlockDesc *last_block;
+    int64_t offset;
+
+    last_block = QSIMPLEQ_LAST(&ram_state.ram_block_list, RAMBlockDesc, next);
+    if (!last_block) {
+        return 0;
+    }
+    offset = last_block->bdrv_offset + last_block->length;
+    return ROUND_UP(offset, BLK_RAM_BLOCK_ALIGN);
+}
+
+static void ram_block_add(const char *idstr, int64_t size)
+{
+    RAMBlockDesc *block;
+
+    block = g_new0(RAMBlockDesc, 1);
+    block->length = size;
+    block->bdrv_offset = ram_block_next_bdrv_offset();
+    strcpy(block->idstr, idstr);
+
+    QSIMPLEQ_INSERT_TAIL(&ram_state.ram_block_list, block, next);
+}
+
+static void ram_block_list_from_stream(QEMUFile *f, int64_t mem_size)
+{
+    int64_t total_ram_bytes;
+
+    total_ram_bytes = mem_size;
+    while (total_ram_bytes > 0) {
+        char idstr[256];
+        int64_t size;
+
+        if (!qemu_get_counted_string(f, idstr)) {
+            error_report("Can't get RAM block id string in 'ram' "
+                         "MEM_SIZE: offset=0x%" PRIx64 " error=%d",
+                    qemu_ftell2(f), qemu_file_get_error(f));
+            return;
+        }
+        size = (int64_t) qemu_get_be64(f);
+
+        ram_block_add(idstr, size);
+        total_ram_bytes -= size;
+    }
+    if (total_ram_bytes != 0) {
+        error_report("Mismatched MEM_SIZE vs sum of RAM block lengths:"
+                     " mem_size=%" PRId64 " block_sum=%" PRId64,
+                mem_size, (mem_size - total_ram_bytes));
+    }
+}
+
+static void save_check_file_errors(SnapSaveState *sn, int *res)
+{
+    /* Check for -EIO that indicates plane EOF */
+    if (*res == -EIO) {
+        *res = 0;
+    }
+    /* Check file errors for success and -EINVAL retcodes */
+    if (*res >= 0 || *res == -EINVAL) {
+        int f_res;
+
+        f_res = qemu_file_get_error(sn->f_fd);
+        f_res = (f_res == -EIO) ? 0 : f_res;
+        if (!f_res) {
+            f_res = qemu_file_get_error(sn->f_vmstate);
+        }
+        *res = f_res ? f_res : *res;
+    }
+}
+
+static int ram_save_page(SnapSaveState *sn, uint8_t *page_ptr, int64_t bdrv_offset)
+{
+    size_t pbuf_usage = sn->ioc_pbuf->usage;
+    int page_size = ram_state.page_size;
+    int res = 0;
+
+    if (bdrv_offset != sn->last_bdrv_offset ||
+        (pbuf_usage + page_size) >= PAGE_COALESC_MAX) {
+        if (pbuf_usage) {
+            /* Flush coalesced pages to block device */
+            res = blk_pwrite(sn->blk, sn->bdrv_offset,
+                    sn->ioc_pbuf->data, pbuf_usage, 0);
+            res = res < 0 ? res : 0;
+        }
+
+        /* Reset coalescing buffer state */
+        sn->ioc_pbuf->usage = 0;
+        sn->ioc_pbuf->offset = 0;
+        /* Switch to new starting bdrv_offset */
+        sn->bdrv_offset = bdrv_offset;
+    }
+
+    qio_channel_write(QIO_CHANNEL(sn->ioc_pbuf),
+            (char *) page_ptr, page_size, NULL);
+    sn->last_bdrv_offset = bdrv_offset + page_size;
+    return res;
+}
+
+static int ram_save_page_flush(SnapSaveState *sn)
+{
+    size_t pbuf_usage = sn->ioc_pbuf->usage;
+    int res = 0;
+
+    if (pbuf_usage) {
+        /* Flush coalesced pages to block device */
+        res = blk_pwrite(sn->blk, sn->bdrv_offset,
+                sn->ioc_pbuf->data, pbuf_usage, 0);
+        res = res < 0 ? res : 0;
+    }
+
+    /* Reset coalescing buffer state */
+    sn->ioc_pbuf->usage = 0;
+    sn->ioc_pbuf->offset = 0;
+
+    sn->last_bdrv_offset = INVALID_OFFSET;
+    return res;
+}
+
+static int ram_save(QEMUFile *f, void *opaque, int version_id)
+{
+    SnapSaveState *sn = (SnapSaveState *) opaque;
+    RAMState *rs = &ram_state;
+    int incompat_flags = (RAM_SAVE_FLAG_COMPRESS_PAGE | RAM_SAVE_FLAG_XBZRLE);
+    int page_size = rs->page_size;
+    int flags = 0;
+    int res = 0;
+
+    if (version_id != 4) {
+        error_report("Unsupported version %d for 'ram' handler v4", version_id);
+        return -EINVAL;
+    }
+
+    while (!res && !(flags & RAM_SAVE_FLAG_EOS)) {
+        RAMBlockDesc *block;
+        int64_t bdrv_offset = INVALID_OFFSET;
+        uint8_t *page_ptr;
+        ssize_t count;
+        int64_t addr;
+
+        addr = qemu_get_be64(f);
+        flags = addr & ~rs->page_mask;
+        addr &= rs->page_mask;
+
+        if (flags & incompat_flags) {
+            error_report("RAM page with incompatible flags: offset=0x%" PRIx64
+                         " flags=0x%x", qemu_ftell2(f), flags);
+            res = -EINVAL;
+            break;
+        }
+
+        if (flags & (RAM_SAVE_FLAG_ZERO | RAM_SAVE_FLAG_PAGE)) {
+            block = ram_block_from_stream(f, flags);
+            bdrv_offset = ram_bdrv_from_block_offset(block, addr);
+            if (bdrv_offset == INVALID_OFFSET) {
+                error_report("Corrupted RAM page: offset=0x%" PRIx64
+                             " page_addr=0x%" PRIx64, qemu_ftell2(f), addr);
+                res = -EINVAL;
+                break;
+            }
+        }
+
+        switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
+        case RAM_SAVE_FLAG_MEM_SIZE:
+            /* Save position of the section containing list of RAM blocks */
+            if (sn->ram_list_pos) {
+                error_report("Unexpected RAM page with FLAG_MEM_SIZE:"
+                             " offset=0x%" PRIx64 " page_addr=0x%" PRIx64
+                             " flags=0x%x", qemu_ftell2(f), addr, flags);
+                res = -EINVAL;
+                break;
+            }
+            sn->ram_list_pos = sn->stream_pos;
+
+            /* Fill RAM block list */
+            ram_block_list_from_stream(f, addr);
+            break;
+
+        case RAM_SAVE_FLAG_ZERO:
+            /* Nothing to do with zero page */
+            qemu_get_byte(f);
+            break;
+
+        case RAM_SAVE_FLAG_PAGE:
+            count = qemu_peek_buffer(f, &page_ptr, page_size, 0);
+            qemu_file_skip(f, count);
+            if (count != page_size) {
+                /* I/O error */
+                break;
+            }
+
+            res = ram_save_page(sn, page_ptr, bdrv_offset);
+            /* Update normal page count */
+            ram_state.normal_pages++;
+            break;
+
+        case RAM_SAVE_FLAG_EOS:
+            /* Normal exit */
+            break;
+
+        default:
+            error_report("RAM page with unknown combination of flags:"
+                         " offset=0x%" PRIx64 " page_addr=0x%" PRIx64
+                         " flags=0x%x", qemu_ftell2(f), addr, flags);
+            res = -EINVAL;
+        }
+
+        /* Make additional check for file errors */
+        if (!res) {
+            res = qemu_file_get_error(f);
+        }
+    }
+
+    /* Flush page coalescing buffer at RAM_SAVE_FLAG_EOS */
+    if (!res) {
+        res = ram_save_page_flush(sn);
+    }
+    return res;
+}
+
+static int default_save(QEMUFile *f, void *opaque, int version_id)
+{
+    SnapSaveState *sn = (SnapSaveState *) opaque;
+
+    if (!sn->ram_pos) {
+        error_report("Section with unknown ID before first 'ram' section:"
+                     " offset=0x%" PRIx64, sn->stream_pos);
+        return -EINVAL;
+    }
+    if (!sn->device_pos) {
+        sn->device_pos = sn->stream_pos;
+        /*
+         * Save the rest of migration data needed to restore VM state.
+         * It is the header, configuration section, first 'ram' section
+         * with the list of RAM blocks and device state data.
+         */
+        return save_state_complete(sn);
+    }
+
+    /* Should never get here */
+    assert(false);
+    return -EINVAL;
+}
+
+static int save_state_complete(SnapSaveState *sn)
+{
+    QEMUFile *f = sn->f_fd;
+    int64_t pos;
+    int64_t eof_pos;
+
+    /* Current read position */
+    pos = qemu_ftell2(f);
+
+    /* Put specific MAGIC at the beginning of saved BDRV vmstate stream */
+    qemu_put_be32(sn->f_vmstate, VMSTATE_MAGIC);
+    /* Target page size */
+    qemu_put_be32(sn->f_vmstate, ram_state.page_size);
+    /* Number of normal (non-zero) pages in snapshot */
+    qemu_put_be64(sn->f_vmstate, ram_state.normal_pages);
+    /* Offset of RAM block list section relative to QEMU_VM_FILE_MAGIC */
+    qemu_put_be32(sn->f_vmstate, sn->ram_list_pos);
+    /* Offset of first device state section relative to QEMU_VM_FILE_MAGIC */
+    qemu_put_be32(sn->f_vmstate, sn->ram_pos);
+    /*
+     * Put a slot here since we don't really know how
+     * long is the rest of migration stream.
+     */
+    qemu_put_be32(sn->f_vmstate, 0);
+
+    /*
+     * At the completion stage we save the leading part of migration stream
+     * which contains header, configuration section and the 'ram' section
+     * with QEMU_VM_SECTION_FULL type containing the list of RAM blocks.
+     *
+     * All of this comes before the first QEMU_VM_SECTION_PART token for 'ram'.
+     * That QEMU_VM_SECTION_PART token is pointed by sn->ram_pos.
+     */
+    qemu_put_buffer(sn->f_vmstate, sn->ioc_lbuf->data, sn->ram_pos);
+    /*
+     * And then we save the trailing part with device state.
+     *
+     * First we take section header which has already been skipped
+     * by QEMUFile but we can get it from sn->section_header.
+     */
+    qemu_put_buffer(sn->f_vmstate, sn->section_header, (pos - sn->device_pos));
+
+    /* Forward the rest of stream data to the BDRV vmstate file */
+    file_transfer_to_eof(sn->f_vmstate, f);
+    /* It does qemu_fflush() internally */
+    eof_pos = qemu_ftell(sn->f_vmstate);
+
+    /* Hack: simulate negative seek() */
+    qemu_update_position(sn->f_vmstate,
+            (size_t)(ssize_t) (VMSTATE_HEADER_EOF_OFFSET - eof_pos));
+    qemu_put_be32(sn->f_vmstate, eof_pos - VMSTATE_HEADER_SIZE);
+    /* Final flush to deliver eof_offset header field */
+    qemu_fflush(sn->f_vmstate);
+
+    return 1;
+}
+
+static int save_section_config(SnapSaveState *sn)
+{
+    QEMUFile *f = sn->f_fd;
+    uint32_t id_len;
+
+    id_len = qemu_get_be32(f);
+    if (id_len > 255) {
+        error_report("Corrupted QEMU_VM_CONFIGURATION section");
+        return -EINVAL;
+    }
+    qemu_file_skip(f, id_len);
+    return 0;
+}
+
+static int save_section_start_full(SnapSaveState *sn)
+{
+    QEMUFile *f = sn->f_fd;
+    SectionHandlersEntry *se;
+    int section_id;
+    int instance_id;
+    int version_id;
+    char id_str[256];
+    int res;
+
+    /* Read section start */
+    section_id = qemu_get_be32(f);
+    if (!qemu_get_counted_string(f, id_str)) {
+        return qemu_file_get_error(f);
+    }
+    instance_id = qemu_get_be32(f);
+    version_id = qemu_get_be32(f);
+
+    se = find_se(id_str, instance_id);
+    if (!se) {
+        se = &section_handlers.default_entry;
+    } else if (version_id > se->version_id) {
+        /* Validate version */
+        error_report("Unsupported version %d for '%s' v%d",
+                version_id, id_str, se->version_id);
+        return -EINVAL;
+    }
+
+    se->state_section_id = section_id;
+    se->state_version_id = version_id;
+
+    res = se->ops->save_section(f, sn, se->state_version_id);
+    /*
+     * Positive return value indicates save completion,
+     * no need to check section footer.
+     */
+    if (res) {
+        return res;
+    }
+
+    /* Finally check section footer */
+    if (!check_section_footer(f, se)) {
+        return -EINVAL;
+    }
+    return 0;
+}
+
+static int save_section_part_end(SnapSaveState *sn)
+{
+    QEMUFile *f = sn->f_fd;
+    SectionHandlersEntry *se;
+    int section_id;
+    int res;
+
+    /* First section with QEMU_VM_SECTION_PART type must be the 'ram' section */
+    if (!sn->ram_pos) {
+        sn->ram_pos = sn->stream_pos;
+    }
+
+    section_id = qemu_get_be32(f);
+    se = find_se_by_section_id(section_id);
+    if (!se) {
+        error_report("Unknown section ID: %d", section_id);
+        return -EINVAL;
+    }
+
+    res = se->ops->save_section(f, sn, se->state_version_id);
+    if (res) {
+        error_report("Error while saving section: id_str='%s' section_id=%d",
+                se->idstr, section_id);
+        return res;
+    }
+
+    /* Finally check section footer */
+    if (!check_section_footer(f, se)) {
+        return -EINVAL;
+    }
+    return 0;
+}
+
+static int save_state_header(SnapSaveState *sn)
+{
+    QEMUFile *f = sn->f_fd;
+    uint32_t v;
+
+    /* Validate QEMU MAGIC */
+    v = qemu_get_be32(f);
+    if (v != QEMU_VM_FILE_MAGIC) {
+        error_report("Not a migration stream");
+        return -EINVAL;
+    }
+    v = qemu_get_be32(f);
+    if (v == QEMU_VM_FILE_VERSION_COMPAT) {
+        error_report("SaveVM v2 format is obsolete");
+        return -EINVAL;
+    }
+    if (v != QEMU_VM_FILE_VERSION) {
+        error_report("Unsupported migration stream version");
+        return -EINVAL;
+    }
+    return 0;
+}
+
 /* Save snapshot data from incoming migration stream */
 int coroutine_fn snap_save_state_main(SnapSaveState *sn)
 {
-    /* TODO: implement */
-    return 0;
+    QEMUFile *f = sn->f_fd;
+    uint8_t *buf;
+    uint8_t section_type;
+    int res = 0;
+
+    res = save_state_header(sn);
+    if (res) {
+        save_check_file_errors(sn, &res);
+        return res;
+    }
+
+    while (!res) {
+        /* Update current stream position so it points to the section type token */
+        sn->stream_pos = qemu_ftell2(f);
+
+        /*
+         * Keep some data from the beginning of the section to use it if it appears
+         * that we have reached device state section and go into 'default_handler'.
+         */
+        qemu_peek_buffer(f, &buf, sizeof(sn->section_header), 0);
+        memcpy(sn->section_header, buf, sizeof(sn->section_header));
+
+        /* Read section type token */
+        section_type = qemu_get_byte(f);
+
+        switch (section_type) {
+        case QEMU_VM_CONFIGURATION:
+            res = save_section_config(sn);
+            break;
+
+        case QEMU_VM_SECTION_FULL:
+        case QEMU_VM_SECTION_START:
+            res = save_section_start_full(sn);
+            break;
+
+        case QEMU_VM_SECTION_PART:
+        case QEMU_VM_SECTION_END:
+            res = save_section_part_end(sn);
+            break;
+
+        case QEMU_VM_EOF:
+            /*
+             * End of migration stream, but normally we will never really get here
+             * since final part of migration stream is a series of QEMU_VM_SECTION_FULL
+             * sections holding non-iterable device state. In our case all this
+             * state is saved with single call to snap_save_section_start_full()
+             * when we first meet unknown section id string.
+             */
+            res = -EINVAL;
+            break;
+
+        default:
+            error_report("Unknown section type %d", section_type);
+            res = -EINVAL;
+        }
+
+        /* Additional check for file errors on success and -EINVAL */
+        save_check_file_errors(sn, &res);
+    }
+
+    /* Replace positive exit code with 0 */
+    sn->status = res < 0 ? res : 0;
+    return sn->status;
 }
 
 /* Load snapshot data and send it with outgoing migration stream */
@@ -36,3 +729,23 @@ int coroutine_fn snap_load_state_main(SnapLoadState *sn)
     /* TODO: implement */
     return 0;
 }
+
+/* Initialize snapshot RAM state */
+void snap_ram_init_state(int page_bits)
+{
+    RAMState *rs = &ram_state;
+
+    memset(rs, 0, sizeof(ram_state));
+
+    rs->page_bits = page_bits;
+    rs->page_size = (int64_t) 1 << page_bits;
+    rs->page_mask = ~(rs->page_size - 1);
+
+    /* Initialize RAM block list head */
+    QSIMPLEQ_INIT(&rs->ram_block_list);
+}
+
+/* Destroy snapshot RAM state */
+void snap_ram_destroy_state(void)
+{
+}
diff --git a/qemu-snap.c b/qemu-snap.c
index ec56aa55d2..a337a7667b 100644
--- a/qemu-snap.c
+++ b/qemu-snap.c
@@ -105,11 +105,31 @@ SnapLoadState *snap_load_get_state(void)
 static void snap_save_init_state(void)
 {
     memset(&save_state, 0, sizeof(save_state));
+    save_state.status = -1;
 }
 
 static void snap_save_destroy_state(void)
 {
-    /* TODO: implement */
+    SnapSaveState *sn = snap_save_get_state();
+
+    if (sn->ioc_lbuf) {
+        object_unref(OBJECT(sn->ioc_lbuf));
+    }
+    if (sn->ioc_pbuf) {
+        object_unref(OBJECT(sn->ioc_pbuf));
+    }
+    if (sn->f_vmstate) {
+        qemu_fclose(sn->f_vmstate);
+    }
+    if (sn->blk) {
+        blk_flush(sn->blk);
+        blk_unref(sn->blk);
+
+        /* Delete image file in case of failure */
+        if (sn->status) {
+            qemu_unlink(sn->filename);
+        }
+    }
 }
 
 static void snap_load_init_state(void)
@@ -190,6 +210,8 @@ static void coroutine_fn do_snap_save_co(void *opaque)
     SnapTaskState *task_state = opaque;
     SnapSaveState *sn = snap_save_get_state();
 
+    /* Switch to non-blocking mode in coroutine context */
+    qemu_file_set_blocking(sn->f_fd, false);
     /* Enter main routine */
     task_state->ret = snap_save_state_main(sn);
 }
@@ -233,17 +255,46 @@ static int run_snap_task(CoroutineEntry *entry)
 static int snap_save(const SnapSaveParams *params)
 {
     SnapSaveState *sn;
+    QIOChannel *ioc_fd;
+    uint8_t *buf;
+    size_t count;
     int res = -1;
 
+    snap_ram_init_state(ctz64(params->page_size));
     snap_save_init_state();
     sn = snap_save_get_state();
 
+    sn->filename = params->filename;
+
+    ioc_fd = qio_channel_new_fd(params->fd, NULL);
+    qio_channel_set_name(QIO_CHANNEL(ioc_fd), "snap-channel-incoming");
+    sn->f_fd = qemu_fopen_channel_input(ioc_fd);
+    object_unref(OBJECT(ioc_fd));
+
+    /* Create buffer channel to store leading part of incoming stream */
+    sn->ioc_lbuf = qio_channel_buffer_new(INPLACE_READ_MAX);
+    qio_channel_set_name(QIO_CHANNEL(sn->ioc_lbuf), "snap-leader-buffer");
+
+    count = qemu_peek_buffer(sn->f_fd, &buf, INPLACE_READ_MAX, 0);
+    res = qemu_file_get_error(sn->f_fd);
+    if (res < 0) {
+        goto fail;
+    }
+    qio_channel_write(QIO_CHANNEL(sn->ioc_lbuf), (char *) buf, count, NULL);
+
+    /* Used for incoming page coalescing */
+    sn->ioc_pbuf = qio_channel_buffer_new(128 * 1024);
+    qio_channel_set_name(QIO_CHANNEL(sn->ioc_pbuf), "snap-page-buffer");
+
     sn->blk = snap_create(params->filename, params->image_size,
             params->bdrv_flags, params->writethrough);
     if (!sn->blk) {
         goto fail;
     }
 
+    /* Open QEMUFile so we can write to BDRV vmstate area */
+    sn->f_vmstate = qemu_fopen_bdrv_vmstate(blk_bs(sn->blk), 1);
+
     res = run_snap_task(do_snap_save_co);
     if (res) {
         error_report("Failed to save snapshot: error=%d", res);
@@ -251,6 +302,7 @@ static int snap_save(const SnapSaveParams *params)
 
 fail:
     snap_save_destroy_state();
+    snap_ram_destroy_state();
 
     return res;
 }
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 8/9] migration/snap-tool: Implementation of snapshot loading in precopy
  2021-03-17 16:32 [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
                   ` (6 preceding siblings ...)
  2021-03-17 16:32 ` [RFC PATCH 7/9] migration/snap-tool: Complete implementation of snapshot saving Andrey Gruzdev
@ 2021-03-17 16:32 ` Andrey Gruzdev
  2021-03-17 16:32 ` [RFC PATCH 9/9] migration/snap-tool: Implementation of snapshot loading in postcopy Andrey Gruzdev
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Andrey Gruzdev @ 2021-03-17 16:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

This part implements snapshot loading in precopy mode.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 include/qemu-snap.h  |  24 ++
 qemu-snap-handlers.c | 586 ++++++++++++++++++++++++++++++++++++++++++-
 qemu-snap.c          |  44 +++-
 3 files changed, 649 insertions(+), 5 deletions(-)

diff --git a/include/qemu-snap.h b/include/qemu-snap.h
index 570f200c9d..f9f6db529f 100644
--- a/include/qemu-snap.h
+++ b/include/qemu-snap.h
@@ -26,6 +26,11 @@
  */
 #define PAGE_SIZE_MAX           16384
 
+/* Buffer size for RAM chunk loads via AIO buffer_pool */
+#define AIO_BUFFER_SIZE         (1024 * 1024)
+/* Max. concurrent AIO tasks */
+#define AIO_TASKS_MAX           8
+
 typedef struct AioBufferPool AioBufferPool;
 
 typedef struct AioBufferStatus {
@@ -96,6 +101,25 @@ typedef struct SnapSaveState {
 
 typedef struct SnapLoadState {
     BlockBackend *blk;          /* Block backend */
+
+    QEMUFile *f_fd;             /* Outgoing migration stream QEMUFile */
+    QEMUFile *f_vmstate;        /* Block backend vmstate area QEMUFile */
+    /*
+     * Buffer to keep first few KBs of BDRV vmstate that we stashed at the
+     * start. Within this buffer are VM state header, configuration section
+     * and the first 'ram' section with RAM block list.
+     */
+    QIOChannelBuffer *ioc_lbuf;
+
+    /* AIO buffer pool */
+    AioBufferPool *aio_pool;
+
+    /* BDRV vmstate offset of RAM block list section */
+    int64_t state_ram_list_offset;
+    /* BDRV vmstate offset of the first device section */
+    int64_t state_device_offset;
+    /* BDRV vmstate End-Of-File */
+    int64_t state_eof;
 } SnapLoadState;
 
 SnapSaveState *snap_save_get_state(void);
diff --git a/qemu-snap-handlers.c b/qemu-snap-handlers.c
index 4b63d42a29..7dfe950829 100644
--- a/qemu-snap-handlers.c
+++ b/qemu-snap-handlers.c
@@ -41,12 +41,22 @@ typedef struct RAMBlockDesc {
     int64_t length;             /* RAM block used_length */
     int64_t nr_pages;           /* RAM block page count (length >> page_bits) */
 
+    int64_t last_offset;        /* Last offset sent in precopy */
+
     char idstr[256];            /* RAM block id string */
 
+    unsigned long *bitmap;      /* Loaded pages bitmap */
+
     /* Link into ram_list */
     QSIMPLEQ_ENTRY(RAMBlockDesc) next;
 } RAMBlockDesc;
 
+/* Reference to the RAM page with block/page tuple */
+typedef struct RAMPageRef {
+    RAMBlockDesc *block;        /* RAM block containing page */
+    int64_t page;               /* Page index in RAM block */
+} RAMPageRef;
+
 /* State reflecting RAM part of snapshot */
 typedef struct RAMState {
     int64_t page_size;          /* Page size */
@@ -54,6 +64,15 @@ typedef struct RAMState {
     int page_bits;              /* Page size bits */
 
     int64_t normal_pages;       /* Total number of normal (non-zero) pages */
+    int64_t loaded_pages;       /* Current number of normal pages loaded */
+
+    /* Last RAM block touched by load_send_ram_iterate() */
+    RAMBlockDesc *last_block;
+    /* Last page touched by load_send_ram_iterate() */
+    int64_t last_page;
+
+    /* Last RAM block sent by load_send_ram_iterate() */
+    RAMBlockDesc *last_sent_block;
 
     /* List of RAM blocks */
     QSIMPLEQ_HEAD(, RAMBlockDesc) ram_block_list;
@@ -96,19 +115,22 @@ typedef struct SectionHandlers {
 
 /* Forward declarations */
 static int default_save(QEMUFile *f, void *opaque, int version_id);
+static int default_load(QEMUFile *f, void *opaque, int version_id);
 static int ram_save(QEMUFile *f, void *opaque, int version_id);
+static int ram_load(QEMUFile *f, void *opaque, int version_id);
 static int save_state_complete(SnapSaveState *sn);
+static int coroutine_fn load_send_pages_flush(SnapLoadState *sn);
 
 static RAMState ram_state;
 
 static SectionHandlerOps default_handler_ops = {
     .save_section = default_save,
-    .load_section = NULL,
+    .load_section = default_load,
 };
 
 static SectionHandlerOps ram_handler_ops = {
     .save_section = ram_save,
-    .load_section = NULL,
+    .load_section = ram_load,
 };
 
 static SectionHandlers section_handlers = {
@@ -212,6 +234,18 @@ static RAMBlockDesc *ram_block_by_idstr(const char *idstr)
     return NULL;
 }
 
+static RAMBlockDesc *ram_block_by_bdrv_offset(int64_t bdrv_offset)
+{
+    RAMBlockDesc *block;
+
+    QSIMPLEQ_FOREACH(block, &ram_state.ram_block_list, next) {
+        if (ram_bdrv_offset_in_block(block, bdrv_offset)) {
+            return block;
+        }
+    }
+    return NULL;
+}
+
 static RAMBlockDesc *ram_block_from_stream(QEMUFile *f, int flags)
 {
     static RAMBlockDesc *block;
@@ -289,6 +323,36 @@ static void ram_block_list_from_stream(QEMUFile *f, int64_t mem_size)
     }
 }
 
+static void ram_block_list_init_bitmaps(void)
+{
+    RAMBlockDesc *block;
+
+    QSIMPLEQ_FOREACH(block, &ram_state.ram_block_list, next) {
+        block->nr_pages = block->length >> ram_state.page_bits;
+
+        block->bitmap = bitmap_new(block->nr_pages);
+        bitmap_set(block->bitmap, 0, block->nr_pages);
+    }
+}
+
+static inline
+int64_t ram_block_bitmap_find_next(RAMBlockDesc *block, int64_t start)
+{
+    return find_next_bit(block->bitmap, block->nr_pages, start);
+}
+
+static inline
+int64_t ram_block_bitmap_find_next_clear(RAMBlockDesc *block, int64_t start)
+{
+    return find_next_zero_bit(block->bitmap, block->nr_pages, start);
+}
+
+static inline
+void ram_block_bitmap_clear(RAMBlockDesc *block, int64_t start, int64_t count)
+{
+    bitmap_clear(block->bitmap, start, count);
+}
+
 static void save_check_file_errors(SnapSaveState *sn, int *res)
 {
     /* Check for -EIO that indicates plane EOF */
@@ -723,11 +787,517 @@ int coroutine_fn snap_save_state_main(SnapSaveState *sn)
     return sn->status;
 }
 
+static void load_check_file_errors(SnapLoadState *sn, int *res)
+{
+    /* Check file errors even on success */
+    if (*res >= 0 || *res == -EINVAL) {
+        int f_res;
+
+        f_res = qemu_file_get_error(sn->f_fd);
+        if (!f_res) {
+            f_res = qemu_file_get_error(sn->f_vmstate);
+        }
+        *res = f_res ? f_res : *res;
+    }
+}
+
+static int ram_load(QEMUFile *f, void *opaque, int version_id)
+{
+    int compat_flags = (RAM_SAVE_FLAG_MEM_SIZE | RAM_SAVE_FLAG_EOS);
+    int64_t page_mask = ram_state.page_mask;
+    int flags = 0;
+    int res = 0;
+
+    if (version_id != 4) {
+        error_report("Unsupported version %d for 'ram' handler v4", version_id);
+        return -EINVAL;
+    }
+
+    while (!res && !(flags & RAM_SAVE_FLAG_EOS)) {
+        int64_t addr;
+
+        addr = qemu_get_be64(f);
+        flags = addr & ~page_mask;
+        addr &= page_mask;
+
+        if (flags & ~compat_flags) {
+            error_report("RAM page with incompatible flags: offset=0x%" PRIx64
+                         " flags=0x%x", qemu_ftell2(f), flags);
+            res = -EINVAL;
+            break;
+        }
+
+        switch (flags) {
+        case RAM_SAVE_FLAG_MEM_SIZE:
+            /* Fill RAM block list */
+            ram_block_list_from_stream(f, addr);
+            break;
+
+        case RAM_SAVE_FLAG_EOS:
+            /* Normal exit */
+            break;
+
+        default:
+            error_report("RAM page with unknown combination of flags:"
+                         " offset=0x%" PRIx64 " page_addr=0x%" PRIx64
+                         " flags=0x%x", qemu_ftell2(f), addr, flags);
+            res = -EINVAL;
+        }
+
+        /* Check for file errors even if all looks good */
+        if (!res) {
+            res = qemu_file_get_error(f);
+        }
+    }
+    return res;
+}
+
+static int default_load(QEMUFile *f, void *opaque, int version_id)
+{
+    error_report("Section with unknown ID: offset=0x%" PRIx64,
+            qemu_ftell2(f));
+    return -EINVAL;
+}
+
+static void send_page_header(QEMUFile *f, RAMBlockDesc *block, int64_t offset)
+{
+    uint8_t hdr_buf[512];
+    int hdr_len = 8;
+
+    stq_be_p(hdr_buf, offset);
+    if (!(offset & RAM_SAVE_FLAG_CONTINUE)) {
+        int id_len;
+
+        id_len = strlen(block->idstr);
+        assert(id_len < 256);
+
+        hdr_buf[hdr_len] = id_len;
+        memcpy((hdr_buf + hdr_len + 1), block->idstr, id_len);
+
+        hdr_len += 1 + id_len;
+    }
+
+    qemu_put_buffer(f, hdr_buf, hdr_len);
+}
+
+static void send_zeropage(QEMUFile *f, RAMBlockDesc *block, int64_t offset)
+{
+    send_page_header(f, block, offset | RAM_SAVE_FLAG_ZERO);
+    qemu_put_byte(f, 0);
+}
+
+static int send_pages_from_buffer(QEMUFile *f, AioBuffer *buffer)
+{
+    RAMState *rs = &ram_state;
+    int page_size = rs->page_size;
+    RAMBlockDesc *block = rs->last_sent_block;
+    int64_t bdrv_offset = buffer->status.offset;
+    int64_t flags = RAM_SAVE_FLAG_CONTINUE;
+    int pages = 0;
+
+    /* Need to switch to the another RAM block? */
+    if (!ram_bdrv_offset_in_block(block, bdrv_offset)) {
+        /*
+         * Lookup RAM block by BDRV offset cause in postcopy we
+         * can issue AIO buffer loads from non-contiguous blocks.
+         */
+        block = ram_block_by_bdrv_offset(bdrv_offset);
+        rs->last_sent_block = block;
+        /* Reset RAM_SAVE_FLAG_CONTINUE */
+        flags = 0;
+    }
+
+    for (int offset = 0; offset < buffer->status.count;
+            offset += page_size, bdrv_offset += page_size) {
+        void *page_buf = buffer->data + offset;
+        int64_t addr;
+
+        addr = ram_block_offset_from_bdrv(block, bdrv_offset);
+
+        if (buffer_is_zero(page_buf, page_size)) {
+            send_zeropage(f, block, (addr | flags));
+        } else {
+            send_page_header(f, block,
+                    (addr | RAM_SAVE_FLAG_PAGE | flags));
+            qemu_put_buffer_async(f, page_buf, page_size, false);
+
+            /* Update non-zero page count */
+            rs->loaded_pages++;
+        }
+        /*
+         * AioBuffer is always within a single RAM block so we need
+         * to set RAM_SAVE_FLAG_CONTINUE here unconditionally.
+         */
+        flags = RAM_SAVE_FLAG_CONTINUE;
+        pages++;
+    }
+
+    /* Need to flush cause we use qemu_put_buffer_async() */
+    qemu_fflush(f);
+    return pages;
+}
+
+static bool find_next_unsent_page(RAMPageRef *p_ref)
+{
+    RAMState *rs = &ram_state;
+    RAMBlockDesc *block = rs->last_block;
+    int64_t page = rs->last_page;
+    bool found = false;
+    bool full_round = false;
+
+    if (!block) {
+restart:
+        block = QSIMPLEQ_FIRST(&rs->ram_block_list);
+        page = 0;
+        full_round = true;
+    }
+
+    while (!found && block) {
+        page = ram_block_bitmap_find_next(block, page);
+        if (page >= block->nr_pages) {
+            block = QSIMPLEQ_NEXT(block, next);
+            page = 0;
+            continue;
+        }
+        found = true;
+    }
+
+    if (!found && !full_round) {
+        goto restart;
+    }
+
+    if (found) {
+        p_ref->block = block;
+        p_ref->page = page;
+    }
+    return found;
+}
+
+static inline
+void get_unsent_page_range(RAMPageRef *p_ref, RAMBlockDesc **block,
+        int64_t *offset, int64_t *limit)
+{
+    int64_t page_limit;
+
+    *block = p_ref->block;
+    *offset = p_ref->page << ram_state.page_bits;
+    page_limit = ram_block_bitmap_find_next_clear(p_ref->block, (p_ref->page + 1));
+    *limit = page_limit << ram_state.page_bits;
+}
+
+static AioBufferStatus coroutine_fn load_buffers_task_co(AioBufferTask *task)
+{
+    SnapLoadState *sn = snap_load_get_state();
+    AioBufferStatus ret;
+    int count;
+
+    count = blk_pread(sn->blk, task->offset, task->buffer->data, task->size);
+
+    ret.offset = task->offset;
+    ret.count = count;
+
+    return ret;
+}
+
+static void coroutine_fn load_buffers_fill_queue(SnapLoadState *sn)
+{
+    RAMState *rs = &ram_state;
+    RAMPageRef p_ref;
+    RAMBlockDesc *block;
+    int64_t offset;
+    int64_t limit;
+    int64_t pages;
+
+    if (!find_next_unsent_page(&p_ref)) {
+        return;
+    }
+
+    get_unsent_page_range(&p_ref, &block, &offset, &limit);
+
+    do {
+        AioBuffer *buffer;
+        int64_t bdrv_offset;
+        int size;
+
+        /* Try to acquire next buffer from the pool */
+        buffer = aio_pool_try_acquire_next(sn->aio_pool);
+        if (!buffer) {
+            break;
+        }
+
+        bdrv_offset = ram_bdrv_from_block_offset(block, offset);
+        assert(bdrv_offset != INVALID_OFFSET);
+
+        /* Get maximum transfer size for current RAM block and offset */
+        size = MIN((limit - offset), buffer->size);
+        aio_buffer_start_task(buffer, load_buffers_task_co, bdrv_offset, size);
+
+        offset += size;
+    } while (offset < limit);
+
+    rs->last_block = block;
+    rs->last_page = offset >> rs->page_bits;
+
+    block->last_offset = offset;
+
+    pages = rs->last_page - p_ref.page;
+    ram_block_bitmap_clear(block, p_ref.page, pages);
+}
+
+static int coroutine_fn load_send_pages(SnapLoadState *sn)
+{
+    AioBuffer *compl_buffer;
+    int pages = 0;
+
+    load_buffers_fill_queue(sn);
+
+    compl_buffer = aio_pool_wait_compl_next(sn->aio_pool);
+    if (compl_buffer) {
+        /* Check AIO completion status */
+        pages = aio_pool_status(sn->aio_pool);
+        if (pages < 0) {
+            return pages;
+        }
+
+        pages = send_pages_from_buffer(sn->f_fd, compl_buffer);
+        aio_buffer_release(compl_buffer);
+    }
+
+    return pages;
+}
+
+static int coroutine_fn load_send_pages_flush(SnapLoadState *sn)
+{
+    AioBuffer *compl_buffer;
+
+    while ((compl_buffer = aio_pool_wait_compl_next(sn->aio_pool))) {
+        int res = aio_pool_status(sn->aio_pool);
+        /* Check AIO completion status */
+        if (res < 0) {
+            return res;
+        }
+
+        send_pages_from_buffer(sn->f_fd, compl_buffer);
+        aio_buffer_release(compl_buffer);
+    }
+
+    return 0;
+}
+
+static void send_section_header_part_end(QEMUFile *f, SectionHandlersEntry *se,
+        uint8_t section_type)
+{
+    assert(section_type == QEMU_VM_SECTION_PART ||
+           section_type == QEMU_VM_SECTION_END);
+
+    qemu_put_byte(f, section_type);
+    qemu_put_be32(f, se->state_section_id);
+}
+
+static void send_section_footer(QEMUFile *f, SectionHandlersEntry *se)
+{
+    qemu_put_byte(f, QEMU_VM_SECTION_FOOTER);
+    qemu_put_be32(f, se->state_section_id);
+}
+
+#define YIELD_AFTER_MS  500 /* ms */
+
+static int coroutine_fn load_send_ram_iterate(SnapLoadState *sn)
+{
+    SectionHandlersEntry *se;
+    int64_t t_start;
+    int tmp_res;
+    int res = 1;
+
+    /* Send 'ram' section header with QEMU_VM_SECTION_PART type */
+    se = find_se("ram", 0);
+    send_section_header_part_end(sn->f_fd, se, QEMU_VM_SECTION_PART);
+
+    t_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+    for (int iter = 0; res > 0; iter++) {
+        res = load_send_pages(sn);
+
+        if (!(iter & 7)) {
+            int64_t t_cur = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+            if ((t_cur - t_start) > YIELD_AFTER_MS) {
+                break;
+            }
+        }
+    }
+
+    /* Zero return code means that there're no more pages to send */
+    if (res >= 0) {
+        res = res ? 0 : 1;
+    }
+
+    /* Flush AIO buffers cause some may still remain unsent */
+    tmp_res = load_send_pages_flush(sn);
+    res = tmp_res ? tmp_res : res;
+
+    /* Send EOS flag before section footer */
+    qemu_put_be64(sn->f_fd, RAM_SAVE_FLAG_EOS);
+    send_section_footer(sn->f_fd, se);
+
+    qemu_fflush(sn->f_fd);
+    return res;
+}
+
+static int load_send_leader(SnapLoadState *sn)
+{
+    qemu_put_buffer(sn->f_fd, (sn->ioc_lbuf->data + VMSTATE_HEADER_SIZE),
+            sn->state_device_offset);
+    return qemu_file_get_error(sn->f_fd);
+}
+
+static int load_send_complete(SnapLoadState *sn)
+{
+    /* Transfer device state to the output pipe */
+    file_transfer_bytes(sn->f_fd, sn->f_vmstate,
+            (sn->state_eof - sn->state_device_offset));
+    qemu_fflush(sn->f_fd);
+    return 1;
+}
+
+static int load_section_start_full(SnapLoadState *sn)
+{
+    QEMUFile *f = sn->f_vmstate;
+    int section_id;
+    int instance_id;
+    int version_id;
+    char idstr[256];
+    SectionHandlersEntry *se;
+    int res;
+
+    /* Read section start */
+    section_id = qemu_get_be32(f);
+    if (!qemu_get_counted_string(f, idstr)) {
+        return qemu_file_get_error(f);
+    }
+    instance_id = qemu_get_be32(f);
+    version_id = qemu_get_be32(f);
+
+    se = find_se(idstr, instance_id);
+    if (!se) {
+        se = &section_handlers.default_entry;
+    } else if (version_id > se->version_id) {
+        /* Validate version */
+        error_report("Unsupported version %d for '%s' v%d",
+                version_id, idstr, se->version_id);
+        return -EINVAL;
+    }
+
+    se->state_section_id = section_id;
+    se->state_version_id = version_id;
+
+    res = se->ops->load_section(f, sn, se->state_version_id);
+    if (res) {
+        return res;
+    }
+
+    /* Finally check section footer */
+    if (!check_section_footer(f, se)) {
+        return -EINVAL;
+    }
+    return 0;
+}
+
+static int load_setup_ramlist(SnapLoadState *sn)
+{
+    QEMUFile *f = sn->f_vmstate;
+    uint8_t section_type;
+    int64_t section_pos;
+    int res;
+
+    section_pos = qemu_ftell2(f);
+
+    /* Read section type token */
+    section_type = qemu_get_byte(f);
+    if (section_type == QEMU_VM_EOF) {
+        error_report("Unexpected EOF token: offset=0x%" PRIx64, section_pos);
+        return -EINVAL;
+    } else if (section_type != QEMU_VM_SECTION_FULL &&
+               section_type != QEMU_VM_SECTION_START) {
+        error_report("Unexpected section type %d: offset=0x%" PRIx64,
+                section_type, section_pos);
+        return -EINVAL;
+    }
+
+    res = load_section_start_full(sn);
+    if (!res) {
+        ram_block_list_init_bitmaps();
+    }
+    return res;
+}
+
+static int load_state_header(SnapLoadState *sn)
+{
+    QEMUFile *f = sn->f_vmstate;
+    uint32_t v;
+
+    /* Validate specific MAGIC in vmstate area */
+    v = qemu_get_be32(f);
+    if (v != VMSTATE_MAGIC) {
+        error_report("Not a valid VMSTATE");
+        return -EINVAL;
+    }
+    v = qemu_get_be32(f);
+    if (v != ram_state.page_size) {
+        error_report("VMSTATE page size not matching target");
+        return -EINVAL;
+    }
+
+    /* Number of non-zero pages in all RAM blocks */
+    ram_state.normal_pages = qemu_get_be64(f);
+
+    /* VMSTATE area offsets, counted from QEMU_FILE_MAGIC */
+    sn->state_ram_list_offset = qemu_get_be32(f);
+    sn->state_device_offset = qemu_get_be32(f);
+    sn->state_eof = qemu_get_be32(f);
+
+    /* Check that offsets are within the limits */
+    if ((VMSTATE_HEADER_SIZE + sn->state_device_offset) > INPLACE_READ_MAX ||
+        sn->state_device_offset <= sn->state_ram_list_offset) {
+        error_report("Corrupted VMSTATE header");
+        return -EINVAL;
+    }
+
+    /* Skip up to RAM block list section */
+    qemu_file_skip(f, sn->state_ram_list_offset);
+    return 0;
+}
+
 /* Load snapshot data and send it with outgoing migration stream */
 int coroutine_fn snap_load_state_main(SnapLoadState *sn)
 {
-    /* TODO: implement */
-    return 0;
+    int res;
+
+    res = load_state_header(sn);
+    if (res) {
+        goto fail;
+    }
+    res = load_setup_ramlist(sn);
+    if (res) {
+        goto fail;
+    }
+    res = load_send_leader(sn);
+    if (res) {
+        goto fail;
+    }
+
+    do {
+        res = load_send_ram_iterate(sn);
+        /* Make additional check for file errors */
+        load_check_file_errors(sn, &res);
+    } while (!res);
+
+    if (res == 1) {
+        res = load_send_complete(sn);
+    }
+
+fail:
+    load_check_file_errors(sn, &res);
+    /* Replace positive exit code with 0 */
+    return res < 0 ? res : 0;
 }
 
 /* Initialize snapshot RAM state */
@@ -748,4 +1318,12 @@ void snap_ram_init_state(int page_bits)
 /* Destroy snapshot RAM state */
 void snap_ram_destroy_state(void)
 {
+    RAMBlockDesc *block;
+    RAMBlockDesc *next_block;
+
+    /* Free RAM blocks */
+    QSIMPLEQ_FOREACH_SAFE(block, &ram_state.ram_block_list, next, next_block) {
+        g_free(block->bitmap);
+        g_free(block);
+    }
 }
diff --git a/qemu-snap.c b/qemu-snap.c
index a337a7667b..c5efbd6803 100644
--- a/qemu-snap.c
+++ b/qemu-snap.c
@@ -139,7 +139,20 @@ static void snap_load_init_state(void)
 
 static void snap_load_destroy_state(void)
 {
-    /* TODO: implement */
+    SnapLoadState *sn = snap_load_get_state();
+
+    if (sn->aio_pool) {
+        aio_pool_free(sn->aio_pool);
+    }
+    if (sn->ioc_lbuf) {
+        object_unref(OBJECT(sn->ioc_lbuf));
+    }
+    if (sn->f_vmstate) {
+        qemu_fclose(sn->f_vmstate);
+    }
+    if (sn->blk) {
+        blk_unref(sn->blk);
+    }
 }
 
 static BlockBackend *snap_create(const char *filename, int64_t image_size,
@@ -221,6 +234,12 @@ static void coroutine_fn do_snap_load_co(void *opaque)
     SnapTaskState *task_state = opaque;
     SnapLoadState *sn = snap_load_get_state();
 
+    /* Switch to non-blocking mode in coroutine context */
+    qemu_file_set_blocking(sn->f_vmstate, false);
+    qemu_file_set_blocking(sn->f_fd, false);
+    /* Initialize AIO buffer pool in coroutine context */
+    sn->aio_pool = aio_pool_new(DEFAULT_PAGE_SIZE, AIO_BUFFER_SIZE,
+            AIO_TASKS_MAX);
     /* Enter main routine */
     task_state->ret = snap_load_state_main(sn);
 }
@@ -310,15 +329,37 @@ fail:
 static int snap_load(SnapLoadParams *params)
 {
     SnapLoadState *sn;
+    QIOChannel *ioc_fd;
+    uint8_t *buf;
+    size_t count;
     int res = -1;
 
+    snap_ram_init_state(ctz64(params->page_size));
     snap_load_init_state();
     sn = snap_load_get_state();
 
+    ioc_fd = qio_channel_new_fd(params->fd, NULL);
+    qio_channel_set_name(QIO_CHANNEL(ioc_fd), "snap-channel-outgoing");
+    sn->f_fd = qemu_fopen_channel_output(ioc_fd);
+    object_unref(OBJECT(ioc_fd));
+
     sn->blk = snap_open(params->filename, params->bdrv_flags);
     if (!sn->blk) {
         goto fail;
     }
+    /* Open QEMUFile for BDRV vmstate area */
+    sn->f_vmstate = qemu_fopen_bdrv_vmstate(blk_bs(sn->blk), 0);
+
+    /* Create buffer channel to store leading part of VMSTATE stream */
+    sn->ioc_lbuf = qio_channel_buffer_new(INPLACE_READ_MAX);
+    qio_channel_set_name(QIO_CHANNEL(sn->ioc_lbuf), "snap-leader-buffer");
+
+    count = qemu_peek_buffer(sn->f_vmstate, &buf, INPLACE_READ_MAX, 0);
+    res = qemu_file_get_error(sn->f_vmstate);
+    if (res < 0) {
+        goto fail;
+    }
+    qio_channel_write(QIO_CHANNEL(sn->ioc_lbuf), (char *) buf, count, NULL);
 
     res = run_snap_task(do_snap_load_co);
     if (res) {
@@ -327,6 +368,7 @@ static int snap_load(SnapLoadParams *params)
 
 fail:
     snap_load_destroy_state();
+    snap_ram_destroy_state();
 
     return res;
 }
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 9/9] migration/snap-tool: Implementation of snapshot loading in postcopy
  2021-03-17 16:32 [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
                   ` (7 preceding siblings ...)
  2021-03-17 16:32 ` [RFC PATCH 8/9] migration/snap-tool: Implementation of snapshot loading in precopy Andrey Gruzdev
@ 2021-03-17 16:32 ` Andrey Gruzdev
  2021-03-29  8:11 ` [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
  2021-04-15 23:50 ` Peter Xu
  10 siblings, 0 replies; 13+ messages in thread
From: Andrey Gruzdev @ 2021-03-17 16:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu,
	Andrey Gruzdev

Implementation of asynchronous snapshot loading using standard
postcopy migration mechanism on destination VM.

The point of switchover to postcopy is trivially selected based on
percentage of non-zero pages loaded in precopy.

Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
---
 include/qemu-snap.h  |  11 +
 qemu-snap-handlers.c | 482 ++++++++++++++++++++++++++++++++++++++++++-
 qemu-snap.c          |  16 ++
 3 files changed, 504 insertions(+), 5 deletions(-)

diff --git a/include/qemu-snap.h b/include/qemu-snap.h
index f9f6db529f..4bf79e0964 100644
--- a/include/qemu-snap.h
+++ b/include/qemu-snap.h
@@ -30,6 +30,8 @@
 #define AIO_BUFFER_SIZE         (1024 * 1024)
 /* Max. concurrent AIO tasks */
 #define AIO_TASKS_MAX           8
+/* Max. concurrent AIO tasks in postcopy */
+#define AIO_TASKS_POSTCOPY_MAX  4
 
 typedef struct AioBufferPool AioBufferPool;
 
@@ -103,6 +105,7 @@ typedef struct SnapLoadState {
     BlockBackend *blk;          /* Block backend */
 
     QEMUFile *f_fd;             /* Outgoing migration stream QEMUFile */
+    QEMUFile *f_rp_fd;          /* Return path stream QEMUFile */
     QEMUFile *f_vmstate;        /* Block backend vmstate area QEMUFile */
     /*
      * Buffer to keep first few KBs of BDRV vmstate that we stashed at the
@@ -114,6 +117,14 @@ typedef struct SnapLoadState {
     /* AIO buffer pool */
     AioBufferPool *aio_pool;
 
+    bool postcopy;              /* From command-line --postcopy */
+    int postcopy_percent;       /* From command-line --postcopy */
+    bool in_postcopy;           /* Switched to postcopy mode */
+
+    /* Return path listening thread */
+    QemuThread rp_listen_thread;
+    bool has_rp_listen_thread;
+
     /* BDRV vmstate offset of RAM block list section */
     int64_t state_ram_list_offset;
     /* BDRV vmstate offset of the first device section */
diff --git a/qemu-snap-handlers.c b/qemu-snap-handlers.c
index 7dfe950829..ae581b3178 100644
--- a/qemu-snap-handlers.c
+++ b/qemu-snap-handlers.c
@@ -57,6 +57,16 @@ typedef struct RAMPageRef {
     int64_t page;               /* Page index in RAM block */
 } RAMPageRef;
 
+/* Page request from destination in postcopy */
+typedef struct RAMPageRequest {
+    RAMBlockDesc *block;        /* RAM block*/
+    int64_t offset;             /* Offset within RAM block */
+    unsigned size;              /* Size of request */
+
+    /* Link into ram_state.page_req */
+    QSIMPLEQ_ENTRY(RAMPageRequest) next;
+} RAMPageRequest;
+
 /* State reflecting RAM part of snapshot */
 typedef struct RAMState {
     int64_t page_size;          /* Page size */
@@ -64,6 +74,7 @@ typedef struct RAMState {
     int page_bits;              /* Page size bits */
 
     int64_t normal_pages;       /* Total number of normal (non-zero) pages */
+    int64_t precopy_pages;      /* Normal pages to load in precopy */
     int64_t loaded_pages;       /* Current number of normal pages loaded */
 
     /* Last RAM block touched by load_send_ram_iterate() */
@@ -73,9 +84,15 @@ typedef struct RAMState {
 
     /* Last RAM block sent by load_send_ram_iterate() */
     RAMBlockDesc *last_sent_block;
+    /* RAM block from last enqueued load-postcopy page request */
+    RAMBlockDesc *last_req_block;
 
     /* List of RAM blocks */
     QSIMPLEQ_HEAD(, RAMBlockDesc) ram_block_list;
+
+    /* Page request queue for load-postcopy */
+    QemuMutex page_req_mutex;
+    QSIMPLEQ_HEAD(, RAMPageRequest) page_req;
 } RAMState;
 
 /* Section handler ops */
@@ -801,6 +818,422 @@ static void load_check_file_errors(SnapLoadState *sn, int *res)
     }
 }
 
+static bool get_queued_page(RAMPageRef *p_ref)
+{
+    RAMState *rs = &ram_state;
+    RAMBlockDesc *block = NULL;
+    int64_t offset;
+
+    if (QSIMPLEQ_EMPTY_ATOMIC(&rs->page_req)) {
+        return false;
+    }
+
+    QEMU_LOCK_GUARD(&rs->page_req_mutex);
+
+    while (!QSIMPLEQ_EMPTY(&rs->page_req)) {
+        RAMPageRequest *entry = QSIMPLEQ_FIRST(&rs->page_req);
+
+        block = entry->block;
+        offset = entry->offset;
+
+        if (entry->size > rs->page_size) {
+            entry->size -= rs->page_size;
+            entry->offset += rs->page_size;
+        } else {
+            QSIMPLEQ_REMOVE_HEAD(&rs->page_req, next);
+            g_free(entry);
+        }
+
+        if (test_bit((offset >> rs->page_bits), block->bitmap)) {
+            p_ref->block = block;
+            p_ref->page = offset >> rs->page_bits;
+            return true;
+        }
+    }
+    return false;
+}
+
+static int queue_page_request(const char *id_str, int64_t offset, unsigned size)
+{
+    RAMState *rs = &ram_state;
+    RAMBlockDesc *block;
+    RAMPageRequest *new_entry;
+
+    if (!id_str) {
+        block = rs->last_req_block;
+        if (!block) {
+            error_report("RP-REQ_PAGES: no previous block");
+            return -EINVAL;
+        }
+    } else {
+        block = ram_block_by_idstr(id_str);
+        if (!block) {
+            error_report("RP-REQ_PAGES: cannot find block '%s'", id_str);
+            return -EINVAL;
+        }
+        rs->last_req_block = block;
+    }
+
+    if (offset + size > block->length) {
+        error_report("RP-REQ_PAGES: offset/size out RAM block end_offset=0x%" PRIx64
+                     " limit=0x%" PRIx64, (offset + size), block->length);
+        return -EINVAL;
+    }
+
+    new_entry = g_new0(RAMPageRequest, 1);
+    new_entry->block = block;
+    new_entry->offset = offset;
+    new_entry->size = size;
+
+    qemu_mutex_lock(&rs->page_req_mutex);
+    QSIMPLEQ_INSERT_TAIL(&rs->page_req, new_entry, next);
+    qemu_mutex_unlock(&rs->page_req_mutex);
+
+    return 0;
+}
+
+/* QEMU_VM_COMMAND sub-commands */
+typedef enum VmSubCmd {
+    MIG_CMD_OPEN_RETURN_PATH = 1,
+    MIG_CMD_POSTCOPY_ADVISE = 3,
+    MIG_CMD_POSTCOPY_LISTEN = 4,
+    MIG_CMD_POSTCOPY_RUN = 5,
+    MIG_CMD_POSTCOPY_RAM_DISCARD = 6,
+    MIG_CMD_PACKAGED = 7,
+} VmSubCmd;
+
+/* Return-path message types */
+typedef enum RpMsgType {
+    MIG_RP_MSG_INVALID = 0,
+    MIG_RP_MSG_SHUT = 1,
+    MIG_RP_MSG_REQ_PAGES_ID = 3,
+    MIG_RP_MSG_REQ_PAGES = 4,
+    MIG_RP_MSG_MAX = 7,
+} RpMsgType;
+
+typedef struct RpMsgArgs {
+    int len;
+    const char *name;
+} RpMsgArgs;
+
+/*
+ * Return-path message length/name indexed by message type.
+ * -1 value stands for variable message length.
+ */
+static RpMsgArgs rp_msg_args[] = {
+    [MIG_RP_MSG_INVALID]        = { .len = -1, .name = "INVALID" },
+    [MIG_RP_MSG_SHUT]           = { .len =  4, .name = "SHUT" },
+    [MIG_RP_MSG_REQ_PAGES_ID]   = { .len = -1, .name = "REQ_PAGES_ID" },
+    [MIG_RP_MSG_REQ_PAGES]      = { .len = 12, .name = "REQ_PAGES" },
+    [MIG_RP_MSG_MAX]            = { .len = -1, .name = "MAX" },
+};
+
+/* Return-path message processing thread */
+static void *rp_listen_thread(void *opaque)
+{
+    SnapLoadState *sn = (SnapLoadState *) opaque;
+    QEMUFile *f = sn->f_rp_fd;
+    int res = 0;
+
+    while (!res) {
+        uint8_t h_buf[512];
+        const int h_max_len = sizeof(h_buf);
+        int h_type;
+        int h_len;
+        size_t count;
+
+        h_type = qemu_get_be16(f);
+        h_len = qemu_get_be16(f);
+        /* Make early check for input errors */
+        res = qemu_file_get_error(f);
+        if (res) {
+            break;
+        }
+
+        /* Check message type */
+        if (h_type >= MIG_RP_MSG_MAX || h_type == MIG_RP_MSG_INVALID) {
+            error_report("RP: received invalid message type=%d length=%d",
+                    h_type, h_len);
+            res = -EINVAL;
+            break;
+        }
+
+        /* Check message length */
+        if (rp_msg_args[h_type].len != -1 && h_len != rp_msg_args[h_type].len) {
+            error_report("RP: received '%s' message len=%d expected=%d",
+                    rp_msg_args[h_type].name, h_len, rp_msg_args[h_type].len);
+            res = -EINVAL;
+            break;
+        } else if (h_len > h_max_len) {
+            error_report("RP: received '%s' message len=%d max_len=%d",
+                    rp_msg_args[h_type].name, h_len, h_max_len);
+            res = -EINVAL;
+            break;
+        }
+
+        count = qemu_get_buffer(f, h_buf, h_len);
+        if (count != h_len) {
+            break;
+        }
+
+        switch (h_type) {
+        case MIG_RP_MSG_SHUT:
+        {
+            int shut_error;
+
+            shut_error = be32_to_cpu(*(uint32_t *) h_buf);
+            if (shut_error) {
+                error_report("RP: sibling shutdown error=%d", shut_error);
+            }
+            /* Exit processing loop */
+            res = 1;
+            break;
+        }
+
+        case MIG_RP_MSG_REQ_PAGES:
+        case MIG_RP_MSG_REQ_PAGES_ID:
+        {
+            uint64_t offset;
+            uint32_t size;
+            char *id_str = NULL;
+
+            offset = be64_to_cpu(*(uint64_t *) (h_buf + 0));
+            size = be32_to_cpu(*(uint32_t *) (h_buf + 8));
+
+            if (h_type == MIG_RP_MSG_REQ_PAGES_ID) {
+                int h_parsed_len = rp_msg_args[MIG_RP_MSG_REQ_PAGES].len;
+
+                if (h_len > h_parsed_len) {
+                    int id_len;
+
+                    /* RAM block ID string */
+                    id_len = h_buf[h_parsed_len];
+                    id_str = (char *) &h_buf[h_parsed_len + 1];
+                    id_str[id_len] = 0;
+
+                    h_parsed_len += id_len + 1;
+                }
+
+                if (h_parsed_len != h_len) {
+                    error_report("RP: received '%s' message len=%d expected=%d",
+                            rp_msg_args[MIG_RP_MSG_REQ_PAGES_ID].name,
+                            h_len, h_parsed_len);
+                    res = -EINVAL;
+                    break;
+                }
+            }
+
+            res = queue_page_request(id_str, offset, size);
+            break;
+        }
+
+        default:
+            error_report("RP: received unexpected message type=%d len=%d",
+                    h_type, h_len);
+            res = -EINVAL;
+        }
+    }
+
+    if (res >= 0) {
+        res = qemu_file_get_error(f);
+    }
+    if (res) {
+        error_report("RP: listen thread exit error=%d", res);
+    }
+    return NULL;
+}
+
+static void send_command(QEMUFile *f, int cmd, uint16_t len, uint8_t *data)
+{
+    qemu_put_byte(f, QEMU_VM_COMMAND);
+    qemu_put_be16(f, (uint16_t) cmd);
+    qemu_put_be16(f, len);
+
+    qemu_put_buffer_async(f, data, len, false);
+    qemu_fflush(f);
+}
+
+static void send_ram_block_discard(QEMUFile *f, RAMBlockDesc *block)
+{
+    int id_len;
+    int msg_len;
+    uint8_t msg_buf[512];
+
+    id_len = strlen(block->idstr);
+    assert(id_len < 256);
+
+    /* Version, always 0 */
+    msg_buf[0] = 0;
+    /* RAM block ID string length, not including terminating 0 */
+    msg_buf[1] = id_len;
+    /* RAM block ID string with terminating zero */
+    memcpy(msg_buf + 2, block->idstr, (id_len + 1));
+    msg_len = 2 + id_len + 1;
+    /*
+     * We take discard range start offset for the RAM block
+     * from precopy-mode block->last_offset.
+     */
+    stq_be_p(msg_buf + msg_len, block->last_offset);
+    msg_len += 8;
+    /* Discard range length */
+    stq_be_p(msg_buf + msg_len, (block->length - block->last_offset));
+    msg_len += 8;
+
+    send_command(f, MIG_CMD_POSTCOPY_RAM_DISCARD, msg_len, msg_buf);
+}
+
+static int send_ram_each_block_discard(QEMUFile *f)
+{
+    RAMBlockDesc *block;
+    int res = 0;
+
+    QSIMPLEQ_FOREACH(block, &ram_state.ram_block_list, next) {
+        send_ram_block_discard(f, block);
+        res = qemu_file_get_error(f);
+        if (res) {
+            break;
+        }
+    }
+    return res;
+}
+
+static int load_prepare_postcopy(SnapLoadState *sn)
+{
+    QEMUFile *f = sn->f_fd;
+    uint64_t tmp[2];
+    int res;
+
+    /* Set number of pages to load in precopy before switching to postcopy */
+    ram_state.precopy_pages = ram_state.normal_pages *
+                              sn->postcopy_percent / 100;
+
+    /* Send POSTCOPY_ADVISE */
+    tmp[0] = cpu_to_be64(ram_state.page_size);
+    tmp[1] = cpu_to_be64(ram_state.page_size);
+    send_command(f, MIG_CMD_POSTCOPY_ADVISE, 16, (uint8_t *) tmp);
+    /* Open return path on destination */
+    send_command(f, MIG_CMD_OPEN_RETURN_PATH, 0, NULL);
+
+    /*
+     * Check file errors after POSTCOPY_ADVISE command since destination may already
+     * have closed input pipe in case postcopy had not been enabled in advance.
+     */
+    res = qemu_file_get_error(f);
+    if (!res) {
+        qemu_thread_create(&sn->rp_listen_thread, "return-path-thread",
+                rp_listen_thread, sn, QEMU_THREAD_JOINABLE);
+        sn->has_rp_listen_thread = true;
+    }
+    return res;
+}
+
+static int load_start_postcopy(SnapLoadState *sn)
+{
+    QIOChannelBuffer *bioc;
+    QEMUFile *fb;
+    int eof_pos;
+    uint32_t length;
+    int res = 0;
+
+    /*
+     * Send RAM discards for each block's unsent part. Without discards,
+     * userfault_fd code on destination will not trigger page requests
+     * as expected. Also, the UFFDIO_COPY/ZEROPAGE ioctl's that are used
+     * to place incoming page in postcopy would give an error if that page
+     * has not faulted with userfault_fd MISSING reason.
+     */
+    res = send_ram_each_block_discard(sn->f_fd);
+    if (res) {
+        return res;
+    }
+
+    /*
+     * To perform a switch to postcopy on destination, we need to send
+     * commands and the device state data in the following order:
+     *   * MIG_CMD_POSTCOPY_LISTEN
+     *   * device state sections
+     *   * MIG_CMD_POSTCOPY_RUN
+     * All of this has to be packaged into a single blob using
+     * MIG_CMD_PACKAGED command. While loading the device state we may
+     * trigger page transfer requests and the fd must be free to process
+     * those, thus the destination must read the whole device state off
+     * the fd before it starts processing it. To wrap it up in a package,
+     * we use QEMU buffer channel object.
+     */
+    bioc = qio_channel_buffer_new(512 * 1024);
+    qio_channel_set_name(QIO_CHANNEL(bioc), "snap-postcopy-buffer");
+    fb = qemu_fopen_channel_output(QIO_CHANNEL(bioc));
+    object_unref(OBJECT(bioc));
+
+    /* First goes MIG_CMD_POSTCOPY_LISTEN command */
+    send_command(fb, MIG_CMD_POSTCOPY_LISTEN, 0, NULL);
+
+    /* Then the rest of device state with optional VMDESC section.. */
+    file_transfer_bytes(fb, sn->f_vmstate,
+            (sn->state_eof - sn->state_device_offset));
+    qemu_fflush(fb);
+
+    /*
+     * VMDESC json section may be present at the end of the stream
+     * so we'll try to locate it and truncate trailer for postcopy.
+     */
+    eof_pos = bioc->usage - 1;
+    for (int offset = (bioc->usage - 11); offset >= 0; offset--) {
+        if (bioc->data[offset] == QEMU_VM_SECTION_FOOTER &&
+            bioc->data[offset + 5] == QEMU_VM_EOF &&
+            bioc->data[offset + 6] == QEMU_VM_VMDESCRIPTION) {
+            uint32_t json_length;
+            uint32_t expected_length = bioc->usage - (offset + 11);
+
+            json_length = be32_to_cpu(*(uint32_t  *) &bioc->data[offset + 7]);
+            if (json_length != expected_length) {
+                error_report("Corrupted VMDESC trailer: length=%" PRIu32
+                             " expected=%" PRIu32, json_length, expected_length);
+                res = -EINVAL;
+                goto fail;
+            }
+
+            eof_pos = offset + 5;
+            break;
+        }
+    }
+
+    /*
+     * In postcopy we need to remove QEMU_VM_EOF token which normally goes
+     * after the last non-iterable device state section before the (optional)
+     * VMDESC json section. This is required to allow snapshot loading to
+     * continue in postcopy after we sent the rest of device state.
+     * VMDESC section also has to be removed from the stream if present.
+     */
+    if (eof_pos >= 0 && bioc->data[eof_pos] == QEMU_VM_EOF) {
+        bioc->usage = eof_pos;
+        bioc->offset = eof_pos;
+    }
+
+    /* And the final MIG_CMD_POSTCOPY_RUN */
+    send_command(fb, MIG_CMD_POSTCOPY_RUN, 0, NULL);
+
+    /* Now send that blob */
+    length = cpu_to_be32(bioc->usage);
+    send_command(sn->f_fd, MIG_CMD_PACKAGED, sizeof(length),
+            (uint8_t *) &length);
+    qemu_put_buffer_async(sn->f_fd, bioc->data, bioc->usage, false);
+    qemu_fflush(sn->f_fd);
+
+    /*
+     * We set lower limit on the number of AIO in-flight requests
+     * to reduce return path PAGE_REQ processing latencies.
+     */
+    aio_pool_set_max_in_flight(sn->aio_pool, AIO_TASKS_POSTCOPY_MAX);
+    /* Now in postcopy */
+    sn->in_postcopy = true;
+
+fail:
+    qemu_fclose(fb);
+    load_check_file_errors(sn, &res);
+    return res;
+}
+
 static int ram_load(QEMUFile *f, void *opaque, int version_id)
 {
     int compat_flags = (RAM_SAVE_FLAG_MEM_SIZE | RAM_SAVE_FLAG_EOS);
@@ -1007,11 +1440,23 @@ static void coroutine_fn load_buffers_fill_queue(SnapLoadState *sn)
     int64_t offset;
     int64_t limit;
     int64_t pages;
+    bool urgent;
 
-    if (!find_next_unsent_page(&p_ref)) {
+    /*
+     * First need to check if aio_pool_try_acquire_next() will
+     * succeed at least once since we can't revert get_queued_page().
+     */
+    if (!aio_pool_can_acquire_next(sn->aio_pool)) {
         return;
     }
 
+    urgent = get_queued_page(&p_ref);
+    if (!urgent) {
+        if (!find_next_unsent_page(&p_ref)) {
+            return;
+        }
+    }
+
     get_unsent_page_range(&p_ref, &block, &offset, &limit);
 
     do {
@@ -1033,7 +1478,7 @@ static void coroutine_fn load_buffers_fill_queue(SnapLoadState *sn)
         aio_buffer_start_task(buffer, load_buffers_task_co, bdrv_offset, size);
 
         offset += size;
-    } while (offset < limit);
+    } while (!urgent && (offset < limit));
 
     rs->last_block = block;
     rs->last_page = offset >> rs->page_bits;
@@ -1151,9 +1596,14 @@ static int load_send_leader(SnapLoadState *sn)
 
 static int load_send_complete(SnapLoadState *sn)
 {
-    /* Transfer device state to the output pipe */
-    file_transfer_bytes(sn->f_fd, sn->f_vmstate,
-            (sn->state_eof - sn->state_device_offset));
+    if (!sn->in_postcopy) {
+        /* Transfer device state to the output pipe */
+        file_transfer_bytes(sn->f_fd, sn->f_vmstate,
+                (sn->state_eof - sn->state_device_offset));
+    } else {
+        /* In postcopy send final QEMU_VM_EOF token */
+        qemu_put_byte(sn->f_fd, QEMU_VM_EOF);
+    }
     qemu_fflush(sn->f_fd);
     return 1;
 }
@@ -1266,6 +1716,11 @@ static int load_state_header(SnapLoadState *sn)
     return 0;
 }
 
+static bool load_switch_to_postcopy(SnapLoadState *sn)
+{
+    return ram_state.loaded_pages > ram_state.precopy_pages;
+}
+
 /* Load snapshot data and send it with outgoing migration stream */
 int coroutine_fn snap_load_state_main(SnapLoadState *sn)
 {
@@ -1283,11 +1738,22 @@ int coroutine_fn snap_load_state_main(SnapLoadState *sn)
     if (res) {
         goto fail;
     }
+    if (sn->postcopy) {
+        res = load_prepare_postcopy(sn);
+        if (res) {
+            goto fail;
+        }
+    }
 
     do {
         res = load_send_ram_iterate(sn);
         /* Make additional check for file errors */
         load_check_file_errors(sn, &res);
+
+        if (!res && sn->postcopy && !sn->in_postcopy &&
+                load_switch_to_postcopy(sn)) {
+            res = load_start_postcopy(sn);
+        }
     } while (!res);
 
     if (res == 1) {
@@ -1313,6 +1779,10 @@ void snap_ram_init_state(int page_bits)
 
     /* Initialize RAM block list head */
     QSIMPLEQ_INIT(&rs->ram_block_list);
+
+    /* Initialize load-postcopy page request queue */
+    qemu_mutex_init(&rs->page_req_mutex);
+    QSIMPLEQ_INIT(&rs->page_req);
 }
 
 /* Destroy snapshot RAM state */
@@ -1326,4 +1796,6 @@ void snap_ram_destroy_state(void)
         g_free(block->bitmap);
         g_free(block);
     }
+    /* Destroy page request mutex */
+    qemu_mutex_destroy(&ram_state.page_req_mutex);
 }
diff --git a/qemu-snap.c b/qemu-snap.c
index c5efbd6803..89fb918cfc 100644
--- a/qemu-snap.c
+++ b/qemu-snap.c
@@ -141,6 +141,10 @@ static void snap_load_destroy_state(void)
 {
     SnapLoadState *sn = snap_load_get_state();
 
+    if (sn->has_rp_listen_thread) {
+        qemu_thread_join(&sn->rp_listen_thread);
+    }
+
     if (sn->aio_pool) {
         aio_pool_free(sn->aio_pool);
     }
@@ -330,6 +334,7 @@ static int snap_load(SnapLoadParams *params)
 {
     SnapLoadState *sn;
     QIOChannel *ioc_fd;
+    QIOChannel *ioc_rp_fd;
     uint8_t *buf;
     size_t count;
     int res = -1;
@@ -338,11 +343,22 @@ static int snap_load(SnapLoadParams *params)
     snap_load_init_state();
     sn = snap_load_get_state();
 
+    sn->postcopy = params->postcopy;
+    sn->postcopy_percent = params->postcopy_percent;
+
     ioc_fd = qio_channel_new_fd(params->fd, NULL);
     qio_channel_set_name(QIO_CHANNEL(ioc_fd), "snap-channel-outgoing");
     sn->f_fd = qemu_fopen_channel_output(ioc_fd);
     object_unref(OBJECT(ioc_fd));
 
+    /* Open return path QEMUFile in case we shall use postcopy */
+    if (params->postcopy) {
+        ioc_rp_fd = qio_channel_new_fd(params->rp_fd, NULL);
+        qio_channel_set_name(QIO_CHANNEL(ioc_fd), "snap-channel-rp");
+        sn->f_rp_fd = qemu_fopen_channel_input(ioc_rp_fd);
+        object_unref(OBJECT(ioc_rp_fd));
+    }
+
     sn->blk = snap_open(params->filename, params->bdrv_flags);
     if (!sn->blk) {
         goto fail;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/9] migration/snap-tool: External snapshot utility
  2021-03-17 16:32 [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
                   ` (8 preceding siblings ...)
  2021-03-17 16:32 ` [RFC PATCH 9/9] migration/snap-tool: Implementation of snapshot loading in postcopy Andrey Gruzdev
@ 2021-03-29  8:11 ` Andrey Gruzdev
  2021-04-15 23:50 ` Peter Xu
  10 siblings, 0 replies; 13+ messages in thread
From: Andrey Gruzdev @ 2021-03-29  8:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster, Peter Xu

[-- Attachment #1: Type: text/plain, Size: 5729 bytes --]

Ping


On 17.03.2021 19:32, Andrey Gruzdev wrote:

> This series is a kind of PoC for asynchronous snapshot reverting. This is
> about external snapshots only and doesn't involve block devices. Thus, it's
> mainly intended to be used with the new 'background-snapshot' migration
> capability and otherwise standard QEMU migration mechanism.
>
> The major ideas behind this first version were:
>    * Make it compatible with 'exec:'-style migration - options can be create
>      some separate tool or integrate into qemu-system.
>    * Support asynchronous revert stage by using unaltered postcopy logic
>      at destination. To do this, we should be capable of saving RAM pages
>      so that any particular page can be directly addressed by it's block ID
>      and page offset. Possible solutions here seem to be:
>        use separate index (and storing it somewhere)
>        create sparse file on host FS and address pages with file offset
>        use QCOW2 (or other) image container with inherent sparsity support
>    * Make snapshot image file dense on the host FS so we don't depend on
>      copy/backup tools and how they deal with sparse files. Off course,
>      there's some performance cost for this choice.
>    * Make the code which is parsing unstructered format of migration stream,
>      at least, not very sophisticated. Also, try to have minimum dependencies
>      on QEMU migration code, both RAM and device.
>    * Try to keep page save latencies small while not degrading migration
>      bandwidth too much.
>
> For this first version I decided not to integrate into main QEMU code but
> create a separate tool. The main reason is that there's not too much migration
> code that is target-specific and can be used in it's unmodified form. Also,
> it's still not very clear how to make 'qemu-system' integration in terms of
> command-line (or monitor/QMP?) interface extension.
>
> For the storage format, QCOW2 as a container and large (1MB) cluster size seem
> to be an optimal choice. Larger cluster is beneficial for performance particularly
> in the case when image preallocation is disabled. Such cluster size does not result
> in too high internal fragmentation level (~10% of space waste in most cases) yet
> allows to reduce significantly the number of expensive cluster allocations.
>
> A bit tricky part is dispatching QEMU migration stream cause it is mostly
> unstructered and depends on configuration parameters like 'send-configuration'
> and 'send-section-footer'. But, for the case with default values in migration
> globals it seems that implemented dispatching code works well and won't have
> compatibility issues in a reasonably long time frame.
>
> I decided to keep RAM save path synchronous, anyhow it's better to use writeback
> cache mode for the live snapshots cause of it's interleaving page address pattern.
> Page coalescing buffer is used to merge contiguous pages to optimize block layer
> writes.
>
> Since for snapshot loading opening image file in cached mode would not do any good,
> it implies that Linux native AIO and O_DIRECT mode is used in a common scenario.
> AIO support in RAM loading path is implemented by using a ring of preallocated
> fixed-sized buffers in such a way that there's always a number of outstanding block
> requests anytime. It also ensures in-order request completion.
>
> How to use:
>
> **Save:**
> * qemu> migrate_set_capability background-snapshot on
> * qemu> migrate "exec:<qemu-bin-path>/qemu-snap -s <virtual-size>
>             --cache=writeback --aio=threads save <image-file.qcow2>"
>
> **Load:**
> * Use 'qemu-system-* -incoming defer'
> * qemu> migrate_incoming "exec:<qemu-bin-path>/qemu-snap
>            --cache=none --aio=native load <image-file.qcow2>"
>
> **Load with postcopy:**
> * Use 'qemu-system-* -incoming defer'
> * qemu> migrate_set_capability postcopy-ram on
> * qemu> migrate_incoming "exec:<qemu-bin-path>/qemu-snap --postcopy=60
>            --cache=none --aio=native load <image-file.qcow2>"
>
> And yes, asynchronous revert works well only with SSD, not with rotational disk..
>
> Some performance stats:
> * SATA SSD drive with ~500/450 MB/s sequantial read/write and ~60K IOPS max.
> * 220 MB/s average save rate (depends on workload)
> * 440 MB/s average load rate in precopy
> * 260 MB/s average load rate in postcopy
>
> Andrey Gruzdev (9):
>    migration/snap-tool: Introduce qemu-snap tool
>    migration/snap-tool: Snapshot image create/open routines for qemu-snap
>      tool
>    migration/snap-tool: Preparations to run code in main loop context
>    migration/snap-tool: Introduce qemu_ftell2() routine to qemu-file.c
>    migration/snap-tool: Block layer AIO support and file utility routines
>    migration/snap-tool: Move RAM_SAVE_FLAG_xxx defines to migration/ram.h
>    migration/snap-tool: Complete implementation of snapshot saving
>    migration/snap-tool: Implementation of snapshot loading in precopy
>    migration/snap-tool: Implementation of snapshot loading in postcopy
>
>   include/qemu-snap.h   |  163 ++++
>   meson.build           |    2 +
>   migration/qemu-file.c |    6 +
>   migration/qemu-file.h |    1 +
>   migration/ram.c       |   16 -
>   migration/ram.h       |   16 +
>   qemu-snap-handlers.c  | 1801 +++++++++++++++++++++++++++++++++++++++++
>   qemu-snap-io.c        |  325 ++++++++
>   qemu-snap.c           |  673 +++++++++++++++
>   9 files changed, 2987 insertions(+), 16 deletions(-)
>   create mode 100644 include/qemu-snap.h
>   create mode 100644 qemu-snap-handlers.c
>   create mode 100644 qemu-snap-io.c
>   create mode 100644 qemu-snap.c
>


-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


[-- Attachment #2: Type: text/html, Size: 6002 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/9] migration/snap-tool: External snapshot utility
  2021-03-17 16:32 [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
                   ` (9 preceding siblings ...)
  2021-03-29  8:11 ` [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
@ 2021-04-15 23:50 ` Peter Xu
  2021-04-16 12:27   ` Andrey Gruzdev
  10 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2021-04-15 23:50 UTC (permalink / raw)
  To: Andrey Gruzdev
  Cc: Juan Quintela, Markus Armbruster, qemu-devel,
	Dr . David Alan Gilbert, Paolo Bonzini, Den Lunev

On Wed, Mar 17, 2021 at 07:32:13PM +0300, Andrey Gruzdev wrote:
> This series is a kind of PoC for asynchronous snapshot reverting. This is
> about external snapshots only and doesn't involve block devices. Thus, it's
> mainly intended to be used with the new 'background-snapshot' migration
> capability and otherwise standard QEMU migration mechanism.
> 
> The major ideas behind this first version were:
>   * Make it compatible with 'exec:'-style migration - options can be create
>     some separate tool or integrate into qemu-system.
>   * Support asynchronous revert stage by using unaltered postcopy logic
>     at destination. To do this, we should be capable of saving RAM pages
>     so that any particular page can be directly addressed by it's block ID
>     and page offset. Possible solutions here seem to be:
>       use separate index (and storing it somewhere)
>       create sparse file on host FS and address pages with file offset
>       use QCOW2 (or other) image container with inherent sparsity support
>   * Make snapshot image file dense on the host FS so we don't depend on
>     copy/backup tools and how they deal with sparse files. Off course,
>     there's some performance cost for this choice.
>   * Make the code which is parsing unstructered format of migration stream,
>     at least, not very sophisticated. Also, try to have minimum dependencies
>     on QEMU migration code, both RAM and device.
>   * Try to keep page save latencies small while not degrading migration
>     bandwidth too much.
> 
> For this first version I decided not to integrate into main QEMU code but
> create a separate tool. The main reason is that there's not too much migration
> code that is target-specific and can be used in it's unmodified form. Also,
> it's still not very clear how to make 'qemu-system' integration in terms of
> command-line (or monitor/QMP?) interface extension.
> 
> For the storage format, QCOW2 as a container and large (1MB) cluster size seem
> to be an optimal choice. Larger cluster is beneficial for performance particularly
> in the case when image preallocation is disabled. Such cluster size does not result
> in too high internal fragmentation level (~10% of space waste in most cases) yet
> allows to reduce significantly the number of expensive cluster allocations.
> 
> A bit tricky part is dispatching QEMU migration stream cause it is mostly
> unstructered and depends on configuration parameters like 'send-configuration'
> and 'send-section-footer'. But, for the case with default values in migration
> globals it seems that implemented dispatching code works well and won't have
> compatibility issues in a reasonably long time frame.
> 
> I decided to keep RAM save path synchronous, anyhow it's better to use writeback
> cache mode for the live snapshots cause of it's interleaving page address pattern.
> Page coalescing buffer is used to merge contiguous pages to optimize block layer
> writes.
> 
> Since for snapshot loading opening image file in cached mode would not do any good,
> it implies that Linux native AIO and O_DIRECT mode is used in a common scenario.
> AIO support in RAM loading path is implemented by using a ring of preallocated
> fixed-sized buffers in such a way that there's always a number of outstanding block
> requests anytime. It also ensures in-order request completion.
> 
> How to use:
> 
> **Save:**
> * qemu> migrate_set_capability background-snapshot on
> * qemu> migrate "exec:<qemu-bin-path>/qemu-snap -s <virtual-size>
>            --cache=writeback --aio=threads save <image-file.qcow2>"
> 
> **Load:**
> * Use 'qemu-system-* -incoming defer'
> * qemu> migrate_incoming "exec:<qemu-bin-path>/qemu-snap
>           --cache=none --aio=native load <image-file.qcow2>"
> 
> **Load with postcopy:**
> * Use 'qemu-system-* -incoming defer'
> * qemu> migrate_set_capability postcopy-ram on
> * qemu> migrate_incoming "exec:<qemu-bin-path>/qemu-snap --postcopy=60
>           --cache=none --aio=native load <image-file.qcow2>"
> 
> And yes, asynchronous revert works well only with SSD, not with rotational disk..
> 
> Some performance stats:
> * SATA SSD drive with ~500/450 MB/s sequantial read/write and ~60K IOPS max.
> * 220 MB/s average save rate (depends on workload)
> * 440 MB/s average load rate in precopy
> * 260 MB/s average load rate in postcopy

Andrey,

Before I try to read it (since I'm probably not the best person to review
it..).. Would you remind me on the major difference of external snapshots
comparing to the existing one, and problems to solve?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/9] migration/snap-tool: External snapshot utility
  2021-04-15 23:50 ` Peter Xu
@ 2021-04-16 12:27   ` Andrey Gruzdev
  0 siblings, 0 replies; 13+ messages in thread
From: Andrey Gruzdev @ 2021-04-16 12:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Den Lunev, Eric Blake, Paolo Bonzini, Juan Quintela,
	Dr . David Alan Gilbert, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 6256 bytes --]

On 16.04.2021 02:50, Peter Xu wrote:
> On Wed, Mar 17, 2021 at 07:32:13PM +0300, Andrey Gruzdev wrote:
>> This series is a kind of PoC for asynchronous snapshot reverting. This is
>> about external snapshots only and doesn't involve block devices. Thus, it's
>> mainly intended to be used with the new 'background-snapshot' migration
>> capability and otherwise standard QEMU migration mechanism.
>>
>> The major ideas behind this first version were:
>>    * Make it compatible with 'exec:'-style migration - options can be create
>>      some separate tool or integrate into qemu-system.
>>    * Support asynchronous revert stage by using unaltered postcopy logic
>>      at destination. To do this, we should be capable of saving RAM pages
>>      so that any particular page can be directly addressed by it's block ID
>>      and page offset. Possible solutions here seem to be:
>>        use separate index (and storing it somewhere)
>>        create sparse file on host FS and address pages with file offset
>>        use QCOW2 (or other) image container with inherent sparsity support
>>    * Make snapshot image file dense on the host FS so we don't depend on
>>      copy/backup tools and how they deal with sparse files. Off course,
>>      there's some performance cost for this choice.
>>    * Make the code which is parsing unstructered format of migration stream,
>>      at least, not very sophisticated. Also, try to have minimum dependencies
>>      on QEMU migration code, both RAM and device.
>>    * Try to keep page save latencies small while not degrading migration
>>      bandwidth too much.
>>
>> For this first version I decided not to integrate into main QEMU code but
>> create a separate tool. The main reason is that there's not too much migration
>> code that is target-specific and can be used in it's unmodified form. Also,
>> it's still not very clear how to make 'qemu-system' integration in terms of
>> command-line (or monitor/QMP?) interface extension.
>>
>> For the storage format, QCOW2 as a container and large (1MB) cluster size seem
>> to be an optimal choice. Larger cluster is beneficial for performance particularly
>> in the case when image preallocation is disabled. Such cluster size does not result
>> in too high internal fragmentation level (~10% of space waste in most cases) yet
>> allows to reduce significantly the number of expensive cluster allocations.
>>
>> A bit tricky part is dispatching QEMU migration stream cause it is mostly
>> unstructered and depends on configuration parameters like 'send-configuration'
>> and 'send-section-footer'. But, for the case with default values in migration
>> globals it seems that implemented dispatching code works well and won't have
>> compatibility issues in a reasonably long time frame.
>>
>> I decided to keep RAM save path synchronous, anyhow it's better to use writeback
>> cache mode for the live snapshots cause of it's interleaving page address pattern.
>> Page coalescing buffer is used to merge contiguous pages to optimize block layer
>> writes.
>>
>> Since for snapshot loading opening image file in cached mode would not do any good,
>> it implies that Linux native AIO and O_DIRECT mode is used in a common scenario.
>> AIO support in RAM loading path is implemented by using a ring of preallocated
>> fixed-sized buffers in such a way that there's always a number of outstanding block
>> requests anytime. It also ensures in-order request completion.
>>
>> How to use:
>>
>> **Save:**
>> * qemu> migrate_set_capability background-snapshot on
>> * qemu> migrate "exec:<qemu-bin-path>/qemu-snap -s <virtual-size>
>>             --cache=writeback --aio=threads save <image-file.qcow2>"
>>
>> **Load:**
>> * Use 'qemu-system-* -incoming defer'
>> * qemu> migrate_incoming "exec:<qemu-bin-path>/qemu-snap
>>            --cache=none --aio=native load <image-file.qcow2>"
>>
>> **Load with postcopy:**
>> * Use 'qemu-system-* -incoming defer'
>> * qemu> migrate_set_capability postcopy-ram on
>> * qemu> migrate_incoming "exec:<qemu-bin-path>/qemu-snap --postcopy=60
>>            --cache=none --aio=native load <image-file.qcow2>"
>>
>> And yes, asynchronous revert works well only with SSD, not with rotational disk..
>>
>> Some performance stats:
>> * SATA SSD drive with ~500/450 MB/s sequantial read/write and ~60K IOPS max.
>> * 220 MB/s average save rate (depends on workload)
>> * 440 MB/s average load rate in precopy
>> * 260 MB/s average load rate in postcopy
> Andrey,
>
> Before I try to read it (since I'm probably not the best person to review
> it..).. Would you remind me on the major difference of external snapshots
> comparing to the existing one, and problems to solve?
>
> Thanks,
>
Hi Peter,

For the external snapshots - the difference (compared to internal) is that snapshot
data is going to storage objects which are not part VM config. I mean that for internal
snapshots we use configured storage of the VM instance to store both vm state and blockdev
snapshot data. The opposite is for external snapshots when we save vmstate and blockdev
snapshots to separate files on the host. Also external snapshots are not managed by QEMU.

One of the problems is that the vmstate part of external snapshot is essentially the
migration stream which is schema-less and it's structure is dependent on QEMU target.
That means that currently we can do a revert-to-snapshot operation with the sequence of
QMP commands but we can do that only in a synchronous way, i.e. vcpus can't be started
until all of the vmstate data has been transferred. The reason for this synchronous
behavior is that we cannot locate arbitrary RAM page in raw migration stream if we start
vcpus early and get faults for the pages that are missing on destination vm.

So the major goal of this PoC is to demonstrate asynchronous snapshot reverting in QEMU
while keeping migration code mostly unchanged. To do that we need to split migration stream
into two parts, particularly these parts are RAM pages and the rest of vmstate. And then,
if we can do this, RAM pages can be dispatched directly to a block device with block offsets
deduced from page GPAs.


-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


[-- Attachment #2: Type: text/html, Size: 6604 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-04-16 12:44 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-17 16:32 [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
2021-03-17 16:32 ` [RFC PATCH 1/9] migration/snap-tool: Introduce qemu-snap tool Andrey Gruzdev
2021-03-17 16:32 ` [RFC PATCH 2/9] migration/snap-tool: Snapshot image create/open routines for " Andrey Gruzdev
2021-03-17 16:32 ` [RFC PATCH 3/9] migration/snap-tool: Preparations to run code in main loop context Andrey Gruzdev
2021-03-17 16:32 ` [RFC PATCH 4/9] migration/snap-tool: Introduce qemu_ftell2() routine to qemu-file.c Andrey Gruzdev
2021-03-17 16:32 ` [RFC PATCH 5/9] migration/snap-tool: Block layer AIO support and file utility routines Andrey Gruzdev
2021-03-17 16:32 ` [RFC PATCH 6/9] migration/snap-tool: Move RAM_SAVE_FLAG_xxx defines to migration/ram.h Andrey Gruzdev
2021-03-17 16:32 ` [RFC PATCH 7/9] migration/snap-tool: Complete implementation of snapshot saving Andrey Gruzdev
2021-03-17 16:32 ` [RFC PATCH 8/9] migration/snap-tool: Implementation of snapshot loading in precopy Andrey Gruzdev
2021-03-17 16:32 ` [RFC PATCH 9/9] migration/snap-tool: Implementation of snapshot loading in postcopy Andrey Gruzdev
2021-03-29  8:11 ` [RFC PATCH 0/9] migration/snap-tool: External snapshot utility Andrey Gruzdev
2021-04-15 23:50 ` Peter Xu
2021-04-16 12:27   ` Andrey Gruzdev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).