All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
@ 2012-11-21  9:01 Dietmar Maurer
  2012-11-21  9:01 ` [Qemu-devel] [PATCH 2/5] add basic backup support to block driver Dietmar Maurer
                   ` (6 more replies)
  0 siblings, 7 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-21  9:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Dietmar Maurer

This series provides a way to efficiently backup VMs.

* Backup to a single archive file
* Backup contain all data to restore VM (full backup)
* Do not depend on storage type or image format
* Avoid use of temporary storage
* store sparse images efficiently

The file docs/backup-rfc.txt contains more details.

Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
---
 docs/backup-rfc.txt |  119 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 119 insertions(+), 0 deletions(-)
 create mode 100644 docs/backup-rfc.txt

diff --git a/docs/backup-rfc.txt b/docs/backup-rfc.txt
new file mode 100644
index 0000000..5b4b3df
--- /dev/null
+++ b/docs/backup-rfc.txt
@@ -0,0 +1,119 @@
+RFC: Efficient VM backup for qemu
+
+=Requirements=
+
+* Backup to a single archive file
+* Backup needs to contain all data to restore VM (full backup)
+* Do not depend on storage type or image format
+* Avoid use of temporary storage
+* store sparse images efficiently
+
+=Introduction=
+
+Most VM backup solutions use some kind of snapshot to get a consistent
+VM view at a specific point in time. For example, we previously used
+LVM to create a snapshot of all used VM images, which are then copied
+into a tar file.
+
+That basically means that any data written during backup involve
+considerable overhead. For LVM we get the following steps:
+
+1.) read original data (VM write)
+2.) write original data into snapshot (VM write)
+3.) write new data (VM write)
+4.) read data from snapshot (backup)
+5.) write data from snapshot into tar file (backup)
+
+Another approach to backup VM images is to create a new qcow2 image
+which use the old image as base. During backup, writes are redirected
+to the new image, so the old image represents a 'snapshot'. After
+backup, data need to be copied back from new image into the old
+one (commit). So a simple write during backup triggers the following
+steps:
+
+1.) write new data to new image (VM write)
+2.) read data from old image (backup)
+3.) write data from old image into tar file (backup)
+
+4.) read data from new image (commit)
+5.) write data to old image (commit)
+
+This is in fact the same overhead as before. Other tools like qemu
+livebackup produces similar overhead (2 reads, 3 writes).
+
+Some storage types/formats supports internal snapshots using some kind
+of reference counting (rados, sheepdog, dm-thin, qcow2). It would be possible
+to use that for backups, but for now we want to be storage-independent.
+
+Note: It turned out that taking a qcow2 snapshot can take a very long
+time on larger files.
+
+=Make it more efficient=
+
+The be more efficient, we simply need to avoid unnecessary steps. The
+following steps are always required:
+
+1.) read old data before it gets overwritten
+2.) write that data into the backup archive
+3.) write new data (VM write)
+
+As you can see, this involves only one read, an two writes.
+
+To make that work, our backup archive need to be able to store image
+data 'out of order'. It is important to notice that this will not work
+with traditional archive formats like tar.
+
+During backup we simply intercept writes, then read existing data and
+store that directly into the archive. After that we can continue the
+write.
+
+==Advantages==
+
+* very good performance (1 read, 2 writes)
+* works on any storage type and image format.
+* avoid usage of temporary storage
+* we can define a new and simple archive format, which is able to
+  store sparse files efficiently.
+
+Note: Storing sparse files is a mess with existing archive
+formats. For example, tar requires information about holes at the
+beginning of the archive.
+
+==Disadvantages==
+
+* we need to define a new archive format
+
+Note: Most existing archive formats are optimized to store small files
+including file attributes. We simply do not need that for VM archives.
+
+* archive contains data 'out of order'
+
+If you want to access image data in sequential order, you need to
+re-order archive data. It would be possible to to that on the fly,
+using temporary files.
+
+Fortunately, a normal restore/extract works perfectly with 'out of
+order' data, because the target files are seekable.
+
+* slow backup storage can slow down VM during backup
+
+It is important to note that we only do sequential writes to the
+backup storage. Furthermore one can compress the backup stream. IMHO,
+it is better to slow down the VM a bit. All other solutions creates
+large amounts of temporary data during backup.
+
+=Archive format requirements=
+
+The basic requirement for such new format is that we can store image
+date 'out of order'. It is also very likely that we have less than 256
+drives/images per VM, and we want to be able to store VM configuration
+files.
+
+We have defined a very simply format with those properties, see:
+
+docs/specs/vma_spec.txt
+
+Please let us know if you know an existing format which provides the
+same functionality.
+
+
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH 2/5] add basic backup support to block driver
  2012-11-21  9:01 [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1) Dietmar Maurer
@ 2012-11-21  9:01 ` Dietmar Maurer
  2012-11-22 11:25   ` Stefan Hajnoczi
  2012-11-21  9:01 ` [Qemu-devel] [PATCH 3/5] introduce new vma archive format Dietmar Maurer
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-21  9:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Dietmar Maurer

Function bdrv_backup_init() creates a block job to backup a block device.

We call brdv_co_backup_cow() for each write during backup. That function
reads the original data and pass it to backup_dump_cb().

The tracked_request infrastructure is used to serialize access.

Currently backup cluster size is hardcoded to 65536 bytes.

Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
---
 Makefile.objs |    1 +
 backup.c      |  296 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 block.c       |   60 +++++++++++-
 block.h       |    6 +
 block_int.h   |   28 ++++++
 5 files changed, 388 insertions(+), 3 deletions(-)
 create mode 100644 backup.c

diff --git a/Makefile.objs b/Makefile.objs
index 3c7abca..cb46be5 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -48,6 +48,7 @@ coroutine-obj-$(CONFIG_WIN32) += coroutine-win32.o
 block-obj-y = iov.o cache-utils.o qemu-option.o module.o async.o
 block-obj-y += nbd.o block.o blockjob.o aes.o qemu-config.o
 block-obj-y += thread-pool.o qemu-progress.o qemu-sockets.o uri.o notify.o
+block-obj-y += backup.o
 block-obj-y += $(coroutine-obj-y) $(qobject-obj-y) $(version-obj-y)
 block-obj-$(CONFIG_POSIX) += event_notifier-posix.o aio-posix.o
 block-obj-$(CONFIG_WIN32) += event_notifier-win32.o aio-win32.o
diff --git a/backup.c b/backup.c
new file mode 100644
index 0000000..f8c21f8
--- /dev/null
+++ b/backup.c
@@ -0,0 +1,296 @@
+/*
+ * QEMU backup
+ *
+ * Copyright (C) Proxmox Server Solutions
+ *
+ * Authors:
+ *  Dietmar Maurer (dietmar@proxmox.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <stdio.h>
+#include <errno.h>
+#include <unistd.h>
+
+#include "block.h"
+#include "block_int.h"
+#include "blockjob.h"
+
+/* #define DEBUG_BACKUP */
+
+#ifdef DEBUG_BACKUP
+#define DPRINTF(fmt, ...) \
+    do { printf("backup: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+
+#define BITS_PER_LONG  (sizeof(unsigned long) * 8)
+
+static int backup_get_bitmap(BlockDriverState *bs, int64_t cluster_num)
+{
+    assert(bs);
+    assert(bs->backup_info);
+    assert(bs->backup_info->bitmap);
+
+    unsigned long val, idx, bit;
+
+    idx = cluster_num / BITS_PER_LONG;
+
+    assert(bs->backup_info->bitmap_size > idx);
+
+    bit = cluster_num % BITS_PER_LONG;
+    val = bs->backup_info->bitmap[idx];
+
+    return !!(val & (1UL << bit));
+}
+
+static void backup_set_bitmap(BlockDriverState *bs, int64_t cluster_num,
+                              int dirty)
+{
+    assert(bs);
+    assert(bs->backup_info);
+    assert(bs->backup_info->bitmap);
+
+    unsigned long val, idx, bit;
+
+    idx = cluster_num / BITS_PER_LONG;
+
+    assert(bs->backup_info->bitmap_size > idx);
+
+    bit = cluster_num % BITS_PER_LONG;
+    val = bs->backup_info->bitmap[idx];
+    if (dirty) {
+        if (!(val & (1UL << bit))) {
+            val |= 1UL << bit;
+        }
+    } else {
+        if (val & (1UL << bit)) {
+            val &= ~(1UL << bit);
+        }
+    }
+    bs->backup_info->bitmap[idx] = val;
+}
+
+static int backup_in_progress_count;
+
+int coroutine_fn brdv_co_backup_cow(BlockDriverState *bs,
+                                     int64_t sector_num, int nb_sectors)
+{
+    assert(bs);
+
+    BackupInfo *info = bs->backup_info;
+    BlockDriver *drv = bs->drv;
+    struct iovec iov;
+    QEMUIOVector bounce_qiov;
+    void *bounce_buffer = NULL;
+    int ret = 0;
+
+    assert(info);
+
+    backup_in_progress_count++;
+
+    int64_t start, end;
+
+    start = sector_num / BACKUP_BLOCKS_PER_CLUSTER;
+    end = (sector_num + nb_sectors + BACKUP_BLOCKS_PER_CLUSTER - 1) /
+        BACKUP_BLOCKS_PER_CLUSTER;
+
+    DPRINTF("brdv_co_backup_cow enter %s C%zd %zd %d\n",
+            bdrv_get_device_name(bs), start, sector_num, nb_sectors);
+
+    for (; start < end; start++) {
+        if (backup_get_bitmap(bs, start)) {
+            DPRINTF("brdv_co_backup_cow skip C%zd\n", start);
+            continue; /* already copied */
+        }
+
+        /* immediately set bitmap (avoid coroutine race) */
+        backup_set_bitmap(bs, start, 1);
+
+        DPRINTF("brdv_co_backup_cow C%zd\n", start);
+
+        if (!bounce_buffer) {
+            iov.iov_len = BACKUP_CLUSTER_SIZE;
+            iov.iov_base = bounce_buffer = qemu_blockalign(bs, iov.iov_len);
+            qemu_iovec_init_external(&bounce_qiov, &iov, 1);
+        }
+
+        ret = drv->bdrv_co_readv(bs, start * BACKUP_BLOCKS_PER_CLUSTER,
+                                 BACKUP_BLOCKS_PER_CLUSTER,
+                                 &bounce_qiov);
+        if (ret < 0) {
+            DPRINTF("brdv_co_backup_cow bdrv_read C%zd failed\n", start);
+            goto out;
+        }
+
+        ret = info->backup_dump_cb(info->opaque, bs, start, bounce_buffer);
+        if (ret < 0) {
+            DPRINTF("brdv_co_backup_cow dump_cluster_cb C%zd failed\n", start);
+            goto out;
+        }
+
+        DPRINTF("brdv_co_backup_cow done C%zd\n", start);
+    }
+
+out:
+    if (bounce_buffer) {
+        qemu_vfree(bounce_buffer);
+    }
+
+    backup_in_progress_count--;
+
+    return ret;
+}
+
+void bdrv_backup_deinit(BlockDriverState *bs)
+{
+    BackupInfo *info = bs->backup_info;
+
+    if (!info) {
+        return;
+    }
+
+    DPRINTF("backup_deinit %s\n", bdrv_get_device_name(bs));
+
+    g_free(info->bitmap);
+
+    bs->backup_info = NULL;
+}
+
+typedef struct BackupBlockJob {
+    BlockJob common;
+} BackupBlockJob;
+
+static BlockJobType backup_job_type = {
+    .instance_size = sizeof(BackupBlockJob),
+    .job_type      = "backup",
+};
+
+static void coroutine_fn backup_run(void *opaque)
+{
+    BackupBlockJob *job = opaque;
+    BlockDriverState *bs = job->common.bs;
+    assert(bs);
+    assert(bs->backup_info);
+
+    int64_t start, end;
+
+    start = 0;
+    end = (bs->total_sectors + BACKUP_BLOCKS_PER_CLUSTER - 1) /
+        BACKUP_BLOCKS_PER_CLUSTER;
+
+    DPRINTF("backup_run start %s %zd %zd\n", bdrv_get_device_name(bs),
+            start, end);
+
+    int ret = 0;
+
+    for (; start < end; start++) {
+        if (block_job_is_cancelled(&job->common)) {
+            ret = -1;
+            break;
+        }
+
+        if (backup_get_bitmap(bs, start)) {
+            continue; /* already copied */
+        }
+
+        /* we need to yield so that qemu_aio_flush() returns.
+         * (without, VM does not reboot)
+         * todo: can we avoid that?
+         */
+        co_sleep_ns(rt_clock, 0);
+        if (block_job_is_cancelled(&job->common)) {
+            ret = -1;
+            break;
+        }
+        DPRINTF("backup_run loop C%zd\n", start);
+
+        /**
+         * This triggers a cluster copy
+         * Note: avoid direct call to brdv_co_backup_cow, because
+         * this does not call tracked_request_begin()
+         */
+        ret = bdrv_co_backup(bs, start*BACKUP_BLOCKS_PER_CLUSTER, 1);
+        if (ret < 0) {
+            break;
+        }
+        /* Publish progress */
+        job->common.offset += BACKUP_CLUSTER_SIZE;
+    }
+
+    while (backup_in_progress_count > 0) {
+        DPRINTF("backup_run backup_in_progress_count != 0 (%d)",
+                backup_in_progress_count);
+        co_sleep_ns(rt_clock, 10000);
+    }
+
+    DPRINTF("backup_run complete %d\n", ret);
+    block_job_completed(&job->common, ret);
+}
+
+static void backup_job_cleanup_cb(void *opaque, int ret)
+{
+    BlockDriverState *bs = opaque;
+
+    DPRINTF("backup_job_cleanup_cb start %d\n", ret);
+
+    bs->backup_info->backup_complete_cb(bs->backup_info->opaque, ret);
+
+    DPRINTF("backup_job_cleanup_cb end\n");
+
+    bdrv_backup_deinit(bs);
+}
+
+BackupInfo *
+bdrv_backup_init(BlockDriverState *bs, BackupDumpFunc *backup_dump_cb,
+                 BlockDriverCompletionFunc *backup_complete_cb,
+                 void *opaque)
+{
+    assert(bs);
+    assert(backup_dump_cb);
+    assert(backup_complete_cb);
+
+    if (bs->backup_info) {
+        DPRINTF("bdrv_backup_init already initialized %s\n",
+                bdrv_get_device_name(bs));
+        return NULL;
+    }
+
+    BackupInfo *info = g_malloc0(sizeof(BackupInfo));
+    int64_t bitmap_size;
+    const char *devname = bdrv_get_device_name(bs);
+
+    if (!devname || !devname[0]) {
+        return NULL;
+    }
+
+    DPRINTF("bdrv_backup_init %s\n", bdrv_get_device_name(bs));
+
+    bitmap_size = bs->total_sectors +
+        BACKUP_BLOCKS_PER_CLUSTER * BITS_PER_LONG - 1;
+    bitmap_size /= BACKUP_BLOCKS_PER_CLUSTER * BITS_PER_LONG;
+
+    info->backup_dump_cb = backup_dump_cb;
+    info->backup_complete_cb = backup_complete_cb;
+    info->opaque = opaque;
+    info->bitmap_size = bitmap_size;
+    info->bitmap = g_new0(unsigned long, bitmap_size);
+
+    Error *errp;
+    BackupBlockJob *job = block_job_create(&backup_job_type, bs, 0,
+                                           backup_job_cleanup_cb, bs, &errp);
+
+    bs->backup_info = info;
+
+    job->common.len = bs->total_sectors*BDRV_SECTOR_SIZE;
+    job->common.co = qemu_coroutine_create(backup_run);
+    qemu_coroutine_enter(job->common.co, job);
+
+    return info;
+}
diff --git a/block.c b/block.c
index 854ebd6..48e9b68 100644
--- a/block.c
+++ b/block.c
@@ -54,6 +54,7 @@
 typedef enum {
     BDRV_REQ_COPY_ON_READ = 0x1,
     BDRV_REQ_ZERO_WRITE   = 0x2,
+    BDRV_REQ_BACKUP_ONLY  = 0x4,
 } BdrvRequestFlags;
 
 static void bdrv_dev_change_media_cb(BlockDriverState *bs, bool load);
@@ -1132,6 +1133,11 @@ void bdrv_close(BlockDriverState *bs)
         block_job_cancel_sync(bs->job);
     }
     bdrv_drain_all();
+
+    if (bs->backup_info) {
+        bdrv_backup_deinit(bs);
+    }
+
     notifier_list_notify(&bs->close_notifiers, bs);
 
     if (bs->drv) {
@@ -1541,7 +1547,7 @@ int bdrv_commit(BlockDriverState *bs)
 
     if (!drv)
         return -ENOMEDIUM;
-    
+
     if (!bs->backing_hd) {
         return -ENOTSUP;
     }
@@ -1678,6 +1684,20 @@ static void round_to_clusters(BlockDriverState *bs,
     }
 }
 
+/**
+ * Round a region to backup cluster boundaries
+ */
+static void round_to_backup_clusters(BlockDriverState *bs,
+                                     int64_t sector_num, int nb_sectors,
+                                     int64_t *cluster_sector_num,
+                                     int *cluster_nb_sectors)
+{
+    int64_t c = BACKUP_BLOCKS_PER_CLUSTER;
+    *cluster_sector_num = QEMU_ALIGN_DOWN(sector_num, c);
+    *cluster_nb_sectors = QEMU_ALIGN_UP(sector_num - *cluster_sector_num +
+                                        nb_sectors, c);
+}
+
 static bool tracked_request_overlaps(BdrvTrackedRequest *req,
                                      int64_t sector_num, int nb_sectors) {
     /*        aaaa   bbbb */
@@ -1708,6 +1728,11 @@ static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
     round_to_clusters(bs, sector_num, nb_sectors,
                       &cluster_sector_num, &cluster_nb_sectors);
 
+    if (bs->backup_info) {
+        round_to_backup_clusters(bs, sector_num, nb_sectors,
+                                 &cluster_sector_num, &cluster_nb_sectors);
+    }
+
     do {
         retry = false;
         QLIST_FOREACH(req, &bs->tracked_requests, list) {
@@ -2277,12 +2302,22 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
         bs->copy_on_read_in_flight++;
     }
 
-    if (bs->copy_on_read_in_flight) {
+    if (bs->copy_on_read_in_flight || bs->backup_info) {
         wait_for_overlapping_requests(bs, sector_num, nb_sectors);
     }
 
     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
 
+    if (flags & BDRV_REQ_BACKUP_ONLY) {
+        /* Note: We do not return any data to the caller */
+        if (bs->backup_info) {
+            ret = brdv_co_backup_cow(bs, sector_num, nb_sectors);
+        } else {
+            ret = -1;
+        }
+        goto out;
+    }
+
     if (flags & BDRV_REQ_COPY_ON_READ) {
         int pnum;
 
@@ -2326,6 +2361,17 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
                             BDRV_REQ_COPY_ON_READ);
 }
 
+int coroutine_fn bdrv_co_backup(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors)
+{
+    if (!bs->backup_info) {
+        return -ENOTSUP;
+    }
+
+    return bdrv_co_do_readv(bs, sector_num, nb_sectors, NULL,
+                            BDRV_REQ_BACKUP_ONLY);
+}
+
 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors)
 {
@@ -2383,12 +2429,19 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         bdrv_io_limits_intercept(bs, true, nb_sectors);
     }
 
-    if (bs->copy_on_read_in_flight) {
+    if (bs->copy_on_read_in_flight || bs->backup_info) {
         wait_for_overlapping_requests(bs, sector_num, nb_sectors);
     }
 
     tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
 
+    if (bs->backup_info) {
+        ret = brdv_co_backup_cow(bs, sector_num, nb_sectors);
+        if (ret < 0) {
+            goto out;
+        }
+    }
+
     if (flags & BDRV_REQ_ZERO_WRITE) {
         ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors);
     } else {
@@ -2407,6 +2460,7 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         bs->wr_highest_sector = sector_num + nb_sectors - 1;
     }
 
+out:
     tracked_request_end(&req);
 
     return ret;
diff --git a/block.h b/block.h
index 722c620..630fabd 100644
--- a/block.h
+++ b/block.h
@@ -90,6 +90,10 @@ typedef struct BlockDevOps {
 #define BDRV_SECTOR_SIZE   (1ULL << BDRV_SECTOR_BITS)
 #define BDRV_SECTOR_MASK   ~(BDRV_SECTOR_SIZE - 1)
 
+#define BACKUP_CLUSTER_BITS 16
+#define BACKUP_CLUSTER_SIZE (1<<BACKUP_CLUSTER_BITS)
+#define BACKUP_BLOCKS_PER_CLUSTER (BACKUP_CLUSTER_SIZE/BDRV_SECTOR_SIZE)
+
 typedef enum {
     BDRV_ACTION_REPORT, BDRV_ACTION_IGNORE, BDRV_ACTION_STOP
 } BlockErrorAction;
@@ -172,6 +176,8 @@ int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov);
 int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
+int coroutine_fn bdrv_co_backup(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors);
 int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov);
 /*
diff --git a/block_int.h b/block_int.h
index 9deedb8..08122ba 100644
--- a/block_int.h
+++ b/block_int.h
@@ -68,6 +68,32 @@ typedef struct BlockIOBaseValue {
     uint64_t ios[2];
 } BlockIOBaseValue;
 
+/**
+ * Backup related definitions
+ */
+
+typedef int BackupDumpFunc(void *opaque, BlockDriverState *bs,
+                           int64_t cluster_num, unsigned char *buf);
+
+typedef struct BackupInfo {
+    unsigned long *bitmap;
+    int bitmap_size;
+    BackupDumpFunc *backup_dump_cb;
+    BlockDriverCompletionFunc *backup_complete_cb;
+    void *opaque;
+} BackupInfo;
+
+BackupInfo *bdrv_backup_init(BlockDriverState *bs,
+                             BackupDumpFunc *backup_dump_cb,
+                             BlockDriverCompletionFunc *backup_complete_cb,
+                             void *opaque);
+
+void bdrv_backup_deinit(BlockDriverState *bs);
+
+int coroutine_fn brdv_co_backup_cow(BlockDriverState *bs,
+                                    int64_t sector_num, int nb_sectors);
+
+
 struct BlockDriver {
     const char *format_name;
     int instance_size;
@@ -276,6 +302,8 @@ struct BlockDriverState {
 
     QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
 
+    BackupInfo *backup_info;
+
     /* long-running background operation */
     BlockJob *job;
 
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH 3/5] introduce new vma archive format
  2012-11-21  9:01 [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1) Dietmar Maurer
  2012-11-21  9:01 ` [Qemu-devel] [PATCH 2/5] add basic backup support to block driver Dietmar Maurer
@ 2012-11-21  9:01 ` Dietmar Maurer
  2012-11-21 16:06   ` Eric Blake
  2012-11-21  9:01 ` [Qemu-devel] [PATCH 4/5] add backup related monitor commands Dietmar Maurer
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-21  9:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Dietmar Maurer

This is a very simple archive format, see docs/specs/vma_spec.txt

Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
---
 Makefile                |    3 +-
 docs/specs/vma_spec.txt |   24 ++
 vma-reader.c            |  720 +++++++++++++++++++++++++++++++++++++++++
 vma-writer.c            |  824 +++++++++++++++++++++++++++++++++++++++++++++++
 vma.c                   |  526 ++++++++++++++++++++++++++++++
 vma.h                   |  139 ++++++++
 6 files changed, 2235 insertions(+), 1 deletions(-)
 create mode 100644 docs/specs/vma_spec.txt
 create mode 100644 vma-reader.c
 create mode 100644 vma-writer.c
 create mode 100644 vma.c
 create mode 100644 vma.h

diff --git a/Makefile b/Makefile
index 3e8d441..3433dc2 100644
--- a/Makefile
+++ b/Makefile
@@ -100,7 +100,7 @@ defconfig:
 
 -include config-all-devices.mak
 
-all: $(DOCS) $(TOOLS) $(HELPERS-y) recurse-all
+all: $(DOCS) $(TOOLS) vma$(EXESUF) $(HELPERS-y) recurse-all
 
 config-host.h: config-host.h-timestamp
 config-host.h-timestamp: config-host.mak
@@ -194,6 +194,7 @@ tools-obj-$(CONFIG_POSIX) += compatfd.o
 qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y) libqemustub.a
 qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y) libqemustub.a
 qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y) libqemustub.a
+vma$(EXESUF): vma.o vma-writer.o vma-reader.o $(tools-obj-y) $(block-obj-y) libqemustub.a
 
 qemu-bridge-helper$(EXESUF): qemu-bridge-helper.o
 
diff --git a/docs/specs/vma_spec.txt b/docs/specs/vma_spec.txt
new file mode 100644
index 0000000..28eb292
--- /dev/null
+++ b/docs/specs/vma_spec.txt
@@ -0,0 +1,24 @@
+=Virtual Machine Archive format (VMA)=
+
+This format contains a header which includes the VM configuration as
+binary blobs, and a list of devices (dev_id, name).
+
+The actual VM image date is stored inside extends. En extend contains
+up to 64 clusters, and start with a 512 byte header containing
+additional information for those clusters.
+
+We use a cluster size of 65536, and use 8 bytes for each
+cluster in the header to store the following information:
+
+* 1 byte dev_id (to identity the drive)
+* 2 bytes zero indicator (mark zero regions (16x4069))
+* 4 bytes cluster number
+* 1 byte not used (reserved)
+
+We only store non-zero blocks (such block is 4096 bytes).
+
+Each archive is marked with an unique uuid. The archive header and all
+extend headers includes that uuid and a MD5 checksum (over header
+data).
+
+
diff --git a/vma-reader.c b/vma-reader.c
new file mode 100644
index 0000000..7a54de5
--- /dev/null
+++ b/vma-reader.c
@@ -0,0 +1,720 @@
+/*
+ * VMA: Virtual Machine Archive
+ *
+ * Copyright (C) Proxmox Server Solutions
+ *
+ * Authors:
+ *  Dietmar Maurer (dietmar@proxmox.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <stdio.h>
+#include <errno.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <glib.h>
+#include <uuid/uuid.h>
+
+#include "qemu-common.h"
+#include "qemu_socket.h"
+#include "qemu-coroutine.h"
+#include "qemu-aio.h"
+#include "qemu/ratelimit.h"
+#include "vma.h"
+#include "block.h"
+
+/* TODO:
+
+   check if we restored all data (use bitmaps?)
+
+ */
+
+#define BITS_PER_LONG  (sizeof(unsigned long) * 8)
+
+static unsigned char zero_vma_block[VMA_BLOCK_SIZE];
+
+typedef struct VmaRestoreState {
+    BlockDriverState *bs;
+    bool write_zeroes;
+    unsigned long *bitmap;
+    int bitmap_size;
+}  VmaRestoreState;
+
+struct VmaReader {
+    int fd;
+    GChecksum *md5csum;
+    GHashTable *blob_hash;
+    unsigned char *head_data;
+    VmaDeviceInfo devinfo[256];
+    VmaRestoreState rstate[256];
+    GList *cdata_list;
+};
+
+static guint
+g_int32_hash(gconstpointer v)
+{
+    return *(const uint32_t *)v;
+}
+
+static gboolean
+g_int32_equal(gconstpointer v1, gconstpointer v2)
+{
+    return *((const uint32_t *)v1) == *((const uint32_t *)v2);
+}
+
+static int vma_reader_get_bitmap(VmaRestoreState *rstate, int64_t cluster_num)
+{
+    assert(rstate);
+    assert(rstate->bitmap);
+
+    unsigned long val, idx, bit;
+
+    idx = cluster_num / BITS_PER_LONG;
+
+    assert(rstate->bitmap_size > idx);
+
+    bit = cluster_num % BITS_PER_LONG;
+    val = rstate->bitmap[idx];
+
+    return !!(val & (1UL << bit));
+}
+
+static void vma_reader_set_bitmap(VmaRestoreState *rstate, int64_t cluster_num,
+                                  int dirty)
+{
+    assert(rstate);
+    assert(rstate->bitmap);
+
+    unsigned long val, idx, bit;
+
+    idx = cluster_num / BITS_PER_LONG;
+
+    assert(rstate->bitmap_size > idx);
+
+    bit = cluster_num % BITS_PER_LONG;
+    val = rstate->bitmap[idx];
+    if (dirty) {
+        if (!(val & (1UL << bit))) {
+            val |= 1UL << bit;
+        }
+    } else {
+        if (val & (1UL << bit)) {
+            val &= ~(1UL << bit);
+        }
+    }
+    rstate->bitmap[idx] = val;
+}
+
+typedef struct VmaBlob {
+    uint32_t start;
+    uint32_t len;
+    unsigned char *data;
+} VmaBlob;
+
+static const VmaBlob *get_header_blob(VmaReader *vmar, uint32_t pos)
+{
+    assert(vmar);
+    assert(vmar->blob_hash);
+
+    return g_hash_table_lookup(vmar->blob_hash, &pos);
+}
+
+static const unsigned char *get_header_str(VmaReader *vmar, uint32_t pos)
+{
+    const VmaBlob *blob = get_header_blob(vmar, pos);
+    if (!blob || (blob->data[blob->len-1] != '\0')) {
+        return NULL;
+    }
+    return blob->data;
+}
+
+static ssize_t
+safe_read(int fd, unsigned char *buf, size_t count)
+{
+    ssize_t n;
+
+    do {
+        n = read(fd, buf, count);
+    } while (n < 0 && errno == EINTR);
+
+    return n;
+}
+
+static ssize_t
+full_read(int fd, unsigned char *buf, size_t len)
+{
+    ssize_t n;
+    size_t total;
+
+    total = 0;
+
+    while (len > 0) {
+        n = safe_read(fd, buf, len);
+
+        if (n == 0) {
+            return total;
+        }
+
+        if (n <= 0) {
+            break;
+        }
+
+        buf += n;
+        total += n;
+        len -= n;
+    }
+
+    if (len) {
+        return -1;
+    }
+
+    return total;
+}
+
+void vma_reader_destroy(VmaReader *vmar)
+{
+    assert(vmar);
+
+    if (vmar->fd >= 0) {
+        close(vmar->fd);
+    }
+
+    if (vmar->cdata_list) {
+        g_list_free(vmar->cdata_list);
+    }
+
+    int i;
+    for (i = 1; i < 256; i++) {
+        if (vmar->rstate[i].bitmap) {
+            g_free(vmar->rstate[i].bitmap);
+        }
+    }
+
+    if (vmar->md5csum) {
+        g_checksum_free(vmar->md5csum);
+    }
+
+    if (vmar->blob_hash) {
+        g_hash_table_destroy(vmar->blob_hash);
+    }
+
+    if (vmar->head_data) {
+        g_free(vmar->head_data);
+    }
+
+    g_free(vmar);
+
+};
+
+static int vma_reader_read_head(VmaReader *vmar, Error **errp)
+{
+    assert(vmar);
+    assert(errp);
+    assert(*errp == NULL);
+
+    unsigned char md5sum[16];
+    int i;
+    int ret = 0;
+
+    vmar->head_data = g_malloc(sizeof(VmaHeader));
+
+    if (full_read(vmar->fd, vmar->head_data, sizeof(VmaHeader)) !=
+        sizeof(VmaHeader)) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "can't read vma header - %s",
+                  errno ? strerror(errno) : "got EOF");
+        return -1;
+    }
+
+    VmaHeader *h = (VmaHeader *)vmar->head_data;
+
+    if (h->magic != VMA_MAGIC) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "not a vma file - wrong magic number");
+        return -1;
+    }
+
+    uint32_t header_size = GUINT32_FROM_BE(h->header_size);
+    int need = header_size - sizeof(VmaHeader);
+    if (need <= 0) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "wrong vma header size %d", header_size);
+        return -1;
+    }
+
+    vmar->head_data = g_realloc(vmar->head_data, header_size);
+    h = (VmaHeader *)vmar->head_data;
+
+    if (full_read(vmar->fd, vmar->head_data + sizeof(VmaHeader), need) !=
+        need) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "can't read vma header data - %s",
+                  errno ? strerror(errno) : "got EOF");
+        return -1;
+    }
+
+    memcpy(md5sum, h->md5sum, 16);
+    memset(h->md5sum, 0, 16);
+
+    g_checksum_reset(vmar->md5csum);
+    g_checksum_update(vmar->md5csum, vmar->head_data, header_size);
+    gsize csize = 16;
+    g_checksum_get_digest(vmar->md5csum, (guint8 *)(h->md5sum), &csize);
+
+    if (memcmp(md5sum, h->md5sum, 16) != 0) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR, "wrong vma header chechsum");
+        return -1;
+    }
+
+    /* we can modify header data after checksum verify */
+    h->header_size = header_size;
+
+    h->version = GUINT32_FROM_BE(h->version);
+    if (h->version != 1) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "wrong vma version %d", h->version);
+        return -1;
+    }
+
+    h->ctime = GUINT64_FROM_BE(h->ctime);
+    h->blob_buffer_offset = GUINT32_FROM_BE(h->blob_buffer_offset);
+    h->blob_buffer_size = GUINT32_FROM_BE(h->blob_buffer_size);
+
+    uint32_t bstart = h->blob_buffer_offset + 1;
+    uint32_t bend = h->blob_buffer_offset + h->blob_buffer_size;
+
+    if (bstart <= sizeof(VmaHeader)) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "wrong vma blob buffer offset %d", h->blob_buffer_offset);
+        return -1;
+    }
+
+    if (bend > header_size) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "wrong vma blob buffer size %d/%d", h->blob_buffer_offset,
+                  h->blob_buffer_size);
+        return -1;
+    }
+
+    while ((bstart + 2) <= bend) {
+        uint32_t size = vmar->head_data[bstart] +
+            (vmar->head_data[bstart+1] << 8);
+        if ((bstart + size + 2) <= bend) {
+            VmaBlob *blob = g_new0(VmaBlob, 1);
+            blob->start = bstart - h->blob_buffer_offset;
+            blob->len = size;
+            blob->data = vmar->head_data + bstart + 2;
+            g_hash_table_insert(vmar->blob_hash, &blob->start, blob);
+        }
+        bstart += size + 2;
+    }
+
+
+    int count = 0;
+    for (i = 1; i < 256; i++) {
+        VmaDeviceInfoHeader *dih = &h->dev_info[i];
+        uint32_t devname_ptr = GUINT32_FROM_BE(dih->devname_ptr);
+        uint64_t size = GUINT64_FROM_BE(dih->size);
+        const unsigned char *devname =  get_header_str(vmar, devname_ptr);
+
+        if (size && devname) {
+            count++;
+            vmar->devinfo[i].size = size;
+            vmar->devinfo[i].devname = devname;
+        }
+    }
+
+    if (!count) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR, "vma does not contain data");
+        return -1;
+    }
+
+    for (i = 0; i < VMA_MAX_CONFIGS; i++) {
+        uint32_t name_ptr = GUINT32_FROM_BE(h->config_names[i]);
+        uint32_t data_ptr = GUINT32_FROM_BE(h->config_data[i]);
+
+        if (!(name_ptr && data_ptr)) {
+            continue;
+        }
+        const unsigned char *name =  get_header_str(vmar, name_ptr);
+        const VmaBlob *blob = get_header_blob(vmar, data_ptr);
+
+        if (!(name && blob)) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                      "vma contains invalid data pointers");
+            return -1;
+        }
+
+        VmaConfigData *cdata = g_new0(VmaConfigData, 1);
+        cdata->name = name;
+        cdata->data = blob->data;
+        cdata->len = blob->len;
+
+        vmar->cdata_list = g_list_append(vmar->cdata_list, cdata);
+    }
+
+    return ret;
+};
+
+VmaReader *vma_reader_create(const char *filename, Error **errp)
+{
+    assert(filename);
+    assert(errp);
+
+    VmaReader *vmar = g_new0(VmaReader, 1);
+
+    vmar->fd = open(filename, O_RDONLY);
+
+    if (vmar->fd < 0) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "can't open file %s - %s\n", filename, strerror(errno));
+        goto err;
+    }
+
+    vmar->md5csum = g_checksum_new(G_CHECKSUM_MD5);
+    if (!vmar->md5csum) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "can't allocate cmsum\n");
+        goto err;
+    }
+
+    vmar->blob_hash = g_hash_table_new_full(g_int32_hash, g_int32_equal,
+                                            NULL, g_free);
+
+    if (vma_reader_read_head(vmar, errp) < 0) {
+        goto err;
+    }
+
+    return vmar;
+
+err:
+    if (vmar) {
+        vma_reader_destroy(vmar);
+    }
+
+    return NULL;
+}
+
+VmaHeader *vma_reader_get_header(VmaReader *vmar)
+{
+    assert(vmar);
+    assert(vmar->head_data);
+
+    return (VmaHeader *)(vmar->head_data);
+}
+
+GList *vma_reader_get_config_data(VmaReader *vmar)
+{
+    assert(vmar);
+    assert(vmar->head_data);
+
+    return vmar->cdata_list;
+}
+
+VmaDeviceInfo *vma_reader_get_device_info(VmaReader *vmar, guint8 dev_id)
+{
+    assert(vmar);
+    assert(dev_id);
+
+    if (vmar->devinfo[dev_id].size && vmar->devinfo[dev_id].devname) {
+        return &vmar->devinfo[dev_id];
+    }
+
+    return NULL;
+}
+
+int vma_reader_register_bs(VmaReader *vmar, guint8 dev_id, BlockDriverState *bs,
+                           bool write_zeroes, Error **errp)
+{
+    assert(vmar);
+    assert(bs != NULL);
+    assert(dev_id);
+    assert(vmar->rstate[dev_id].bs == NULL);
+
+    int64_t size = bdrv_getlength(bs);
+    if (size != vmar->devinfo[dev_id].size) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "vma_reader_register_bs for stream %s failed - "
+                  "unexpected size %zd != %zd", vmar->devinfo[dev_id].devname,
+                  size, vmar->devinfo[dev_id].size);
+        return -1;
+    }
+
+    vmar->rstate[dev_id].bs = bs;
+    vmar->rstate[dev_id].write_zeroes = write_zeroes;
+
+    int64_t bitmap_size = (size/BDRV_SECTOR_SIZE) +
+        (VMA_CLUSTER_SIZE/BDRV_SECTOR_SIZE) * BITS_PER_LONG - 1;
+    bitmap_size /= (VMA_CLUSTER_SIZE/BDRV_SECTOR_SIZE) * BITS_PER_LONG;
+
+    vmar->rstate[dev_id].bitmap_size = bitmap_size;
+    vmar->rstate[dev_id].bitmap = g_new0(unsigned long, bitmap_size);
+
+    return 0;
+}
+
+static int restore_extend(VmaReader *vmar, unsigned char *buf,
+                          int extend_size, Error **errp)
+{
+    assert(vmar);
+    assert(buf);
+
+    VmaExtendHeader *ehead = (VmaExtendHeader *)buf;
+    int start = VMA_EXTEND_HEADER_SIZE;
+    int i;
+
+    for (i = 0; i < VMA_BLOCKS_PER_EXTEND; i++) {
+        uint64_t block_info = GUINT64_FROM_BE(ehead->blockinfo[i]);
+        uint32_t cluster_num = block_info &  0xffffffff;
+        uint8_t dev_id = (block_info >> 32) &  0xff;
+        uint16_t mask = block_info >> (32+16);
+
+        if (!dev_id) {
+            continue;
+        }
+
+        VmaRestoreState *rstate = &vmar->rstate[dev_id];
+        BlockDriverState *bs = rstate->bs;
+        if (!bs) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                      "got wrong dev id %d", dev_id);
+            return -1;
+        }
+
+        if (vma_reader_get_bitmap(rstate, cluster_num)) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                      "found duplicated cluster %d for stream %s", cluster_num,
+                      vmar->devinfo[dev_id].devname);
+            return -1;
+        }
+
+        vma_reader_set_bitmap(rstate, cluster_num, 1);
+
+        int64_t max_sector = vmar->devinfo[dev_id].size/BDRV_SECTOR_SIZE;
+
+        /* try to write whole clusters to speedup restore */
+        if (mask == 0xffff) {
+            if ((start + VMA_CLUSTER_SIZE) > extend_size) {
+                error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                          "short vma extend - too many blocks");
+                return -1;
+            }
+            int64_t sector_num = (cluster_num * VMA_CLUSTER_SIZE) /
+                BDRV_SECTOR_SIZE;
+            int64_t end_sector = sector_num +
+                VMA_CLUSTER_SIZE/BDRV_SECTOR_SIZE;
+
+            if (end_sector > max_sector) {
+                end_sector = max_sector;
+            }
+
+            if (end_sector <= sector_num) {
+                error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                          "got wrong block address - write bejond end");
+                return -1;
+            }
+
+            int res = bdrv_write(bs, sector_num, buf + start,
+                                 end_sector - sector_num);
+            if (res < 0) {
+                error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                          "bdrv_write to %s failed (%d)",
+                          bdrv_get_device_name(bs), res);
+                return -1;
+            }
+            start += VMA_CLUSTER_SIZE;
+        } else {
+            int j;
+            int bit = 1;
+
+            for (j = 0; j < 16; j++) {
+                int64_t sector_num = (cluster_num*VMA_CLUSTER_SIZE +
+                                      j*VMA_BLOCK_SIZE)/BDRV_SECTOR_SIZE;
+
+                int64_t end_sector = sector_num +
+                    VMA_BLOCK_SIZE/BDRV_SECTOR_SIZE;
+                if (end_sector > max_sector) {
+                    end_sector = max_sector;
+                }
+
+                int res;
+                if (mask & bit) {
+                    if ((start + VMA_BLOCK_SIZE) > extend_size) {
+                        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                                  "short vma extend - too many blocks");
+                        return -1;
+                    }
+
+                    if (end_sector <= sector_num) {
+                        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                                  "got wrong block address - write bejond end");
+                        return -1;
+                    } else {
+                        res = bdrv_write(bs, sector_num, buf + start,
+                                         end_sector - sector_num);
+                        if (res < 0) {
+                            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                                      "bdrv_write to %s failed (%d)",
+                                      bdrv_get_device_name(bs), res);
+                            return -1;
+                        }
+                    }
+
+                    start += VMA_BLOCK_SIZE;
+
+                } else {
+
+                    if (rstate->write_zeroes & (end_sector > sector_num)) {
+                        /* Todo: use bdrv_co_write_zeroes (but that need to
+                         * be run inside coroutine?)
+                         */
+                        res = bdrv_write(bs, sector_num, zero_vma_block,
+                                         end_sector - sector_num);
+                        if (res < 0) {
+                            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                                      "bdrv_write zeros to %s failed (%d)",
+                                      bdrv_get_device_name(bs), res);
+                            return -1;
+                        }
+                    }
+                }
+
+                bit = bit << 1;
+            }
+        }
+    }
+
+    if (start != extend_size) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "vma extend error - missing blocks");
+        return -1;
+    }
+
+    return 0;
+}
+
+int vma_reader_restore(VmaReader *vmar, Error **errp)
+{
+    assert(vmar);
+    assert(vmar->head_data);
+
+    int ret = 0;
+    unsigned char buf[VMA_MAX_EXTEND_SIZE];
+    int buf_pos = 0;
+    unsigned char md5sum[16];
+    VmaHeader *h = (VmaHeader *)vmar->head_data;
+
+
+    while (1) {
+        int bytes = full_read(vmar->fd, buf + buf_pos, sizeof(buf) - buf_pos);
+        if (bytes < 0) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR, "read failed - %s",
+                      strerror(errno));
+            return -1;
+        }
+
+        buf_pos += bytes;
+
+        if (!buf_pos) {
+            break; /* EOF */
+        }
+
+        if (buf_pos < VMA_EXTEND_HEADER_SIZE) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                      "read short extend (%d bytes)", buf_pos);
+            return -1;
+        }
+
+        VmaExtendHeader *ehead = (VmaExtendHeader *)buf;
+
+        /* extract md5sum */
+        memcpy(md5sum, ehead->md5sum, sizeof(ehead->md5sum));
+        memset(ehead->md5sum, 0, sizeof(ehead->md5sum));
+
+        g_checksum_reset(vmar->md5csum);
+        g_checksum_update(vmar->md5csum, buf, VMA_EXTEND_HEADER_SIZE);
+        gsize csize = 16;
+        g_checksum_get_digest(vmar->md5csum, ehead->md5sum, &csize);
+
+        if (memcmp(md5sum, ehead->md5sum, 16) != 0) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                      "wrong vma extend header chechsum");
+            return -1;
+        }
+
+        if (memcmp(h->uuid, ehead->uuid, sizeof(ehead->uuid)) != 0) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR, "wrong vma extend uuid");
+            return -1;
+        }
+
+        if (ehead->magic != VMA_EXTEND_MAGIC || ehead->reserved1 != 0) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                      "wrong vma extend header magic");
+            return -1;
+        }
+
+        int block_count = GUINT16_FROM_BE(ehead->block_count);
+        int extend_size = VMA_EXTEND_HEADER_SIZE + block_count*VMA_BLOCK_SIZE;
+
+        if (buf_pos < extend_size) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                      "short vma extend (%d < %d)", buf_pos, extend_size);
+            return -1;
+        }
+
+        if (restore_extend(vmar, buf, extend_size, errp) < 0) {
+            return -1;
+        }
+
+        if (buf_pos > extend_size) {
+            memmove(buf, buf + extend_size, buf_pos - extend_size);
+            buf_pos = buf_pos - extend_size;
+        } else {
+            buf_pos = 0;
+        }
+    }
+
+    bdrv_drain_all();
+
+    int i;
+    for (i = 1; i < 256; i++) {
+        VmaRestoreState *rstate = &vmar->rstate[i];
+        if (!rstate->bs) {
+            continue;
+        }
+
+        if (bdrv_flush(rstate->bs) < 0) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                      "vma bdrv_flush %s failed", vmar->devinfo[i].devname);
+            return -1;
+        }
+
+        if (vmar->devinfo[i].size) {
+            assert(rstate->bitmap);
+
+            int64_t cluster_num, end;
+
+            end = (vmar->devinfo[i].size + VMA_CLUSTER_SIZE - 1) /
+                VMA_CLUSTER_SIZE;
+
+            for (cluster_num = 0; cluster_num < end; cluster_num++) {
+                if (!vma_reader_get_bitmap(rstate, cluster_num)) {
+                    error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                              "detected missing cluster %zd for stream %s",
+                              cluster_num, vmar->devinfo[i].devname);
+                    return -1;
+                }
+            }
+        }
+    }
+
+    return ret;
+}
+
diff --git a/vma-writer.c b/vma-writer.c
new file mode 100644
index 0000000..6aaa45d
--- /dev/null
+++ b/vma-writer.c
@@ -0,0 +1,824 @@
+/*
+ * VMA: Virtual Machine Archive
+ *
+ * Copyright (C) Proxmox Server Solutions
+ *
+ * Authors:
+ *  Dietmar Maurer (dietmar@proxmox.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <stdio.h>
+#include <errno.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <glib.h>
+#include <uuid/uuid.h>
+
+#include "qemu-common.h"
+#include "qemu_socket.h"
+#include "qemu-coroutine.h"
+#include "qemu-aio.h"
+#include "qemu/ratelimit.h"
+#include "vma.h"
+#include "block.h"
+
+/* #define DEBUG_VMA */
+
+#ifdef DEBUG_VMA
+#define DPRINTF(fmt, ...) \
+    do { printf("vma: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+#define WRITE_BUFFERS 5
+
+typedef struct VmaAIOCB VmaAIOCB;
+struct VmaAIOCB {
+    VmaWriter *vmaw;
+    unsigned char buffer[VMA_MAX_EXTEND_SIZE];
+    size_t bytes;
+    Coroutine *co;
+};
+
+struct VmaWriter {
+    int fd;
+    FILE *cmd;
+    int status;
+    char errmsg[8192];
+    uuid_t uuid;
+    bool header_written;
+    bool closed;
+
+    /* we always write extends */
+    unsigned char outbuf[VMA_MAX_EXTEND_SIZE];
+    int outbuf_pos; /* in bytes */
+    int outbuf_count; /* in VMA_BLOCKS */
+    uint64_t outbuf_block_info[VMA_BLOCKS_PER_EXTEND];
+
+    VmaAIOCB aiocbs[WRITE_BUFFERS];
+    CoQueue wqueue;
+
+    GChecksum *md5csum;
+    CoMutex writer_lock;
+    CoMutex flush_lock;
+    Coroutine *co_writer;
+    RateLimit limit;
+    uint64_t delay_ns;
+
+    /* drive informations */
+    VmaStreamInfo stream_info[256];
+    guint stream_count;
+
+    /* header blob table */
+    char *header_blob_table;
+    uint32_t header_blob_table_size;
+    uint32_t header_blob_table_pos;
+
+    /* store for config blobs */
+    uint32_t config_names[VMA_MAX_CONFIGS]; /* offset into blob_buffer table */
+    uint32_t config_data[VMA_MAX_CONFIGS];  /* offset into blob_buffer table */
+    uint32_t config_count;
+
+    /* statistics */
+    uint64_t zero_block_count;
+    uint64_t total_bytes_written;
+};
+
+void vma_writer_set_error(VmaWriter *vmaw, const char *fmt, ...)
+{
+    va_list ap;
+
+    if (vmaw->status < 0) {
+        return;
+    }
+
+    vmaw->status = -1;
+
+    va_start(ap, fmt);
+    g_vsnprintf(vmaw->errmsg, sizeof(vmaw->errmsg), fmt, ap);
+    va_end(ap);
+
+    DPRINTF("vma_writer_set_error: %s\n", vmaw->errmsg);
+}
+
+static uint32_t allocate_header_blob(VmaWriter *vmaw, const char *data,
+                                     size_t len)
+{
+    if (len > 65535) {
+        return 0;
+    }
+
+    if (!vmaw->header_blob_table ||
+        (vmaw->header_blob_table_size <
+         (vmaw->header_blob_table_pos + len + 2))) {
+        int newsize = vmaw->header_blob_table_size + ((len + 2 + 511)/512)*512;
+
+        vmaw->header_blob_table = g_realloc(vmaw->header_blob_table, newsize);
+        memset(vmaw->header_blob_table + vmaw->header_blob_table_size,
+               0, newsize - vmaw->header_blob_table_size);
+        vmaw->header_blob_table_size = newsize;
+    }
+
+    uint32_t cpos = vmaw->header_blob_table_pos;
+    vmaw->header_blob_table[cpos] = len & 255;
+    vmaw->header_blob_table[cpos+1] = (len >> 8) & 255;
+    memcpy(vmaw->header_blob_table + cpos + 2, data, len);
+    vmaw->header_blob_table_pos += len + 2;
+    return cpos;
+}
+
+static uint32_t allocate_header_string(VmaWriter *vmaw, const char *str)
+{
+    assert(vmaw);
+
+    size_t len = strlen(str) + 1;
+
+    return allocate_header_blob(vmaw, str, len);
+}
+
+int vma_writer_add_config(VmaWriter *vmaw, const char *name, gpointer data,
+                          gsize len)
+{
+    assert(vmaw);
+    assert(!vmaw->header_written);
+    assert(vmaw->config_count < VMA_MAX_CONFIGS);
+    assert(name);
+    assert(data);
+    assert(len);
+
+    uint32_t name_ptr = allocate_header_string(vmaw, name);
+    if (!name_ptr) {
+        return -1;
+    }
+
+    uint32_t data_ptr = allocate_header_blob(vmaw, data, len);
+    if (!data_ptr) {
+        return -1;
+    }
+
+    vmaw->config_names[vmaw->config_count] = name_ptr;
+    vmaw->config_data[vmaw->config_count] = data_ptr;
+
+    vmaw->config_count++;
+
+    return 0;
+}
+
+int vma_writer_register_stream(VmaWriter *vmaw, const char *devname,
+                               size_t size)
+{
+    assert(vmaw);
+    assert(!vmaw->status);
+
+    if (vmaw->header_written) {
+        vma_writer_set_error(vmaw, "vma_writer_register_drive: header "
+                             "already written");
+        return -1;
+    }
+
+    guint n = vmaw->stream_count + 1;
+
+    /* we can have dev_ids form 1 to 255 (0 reserved)
+     * but 254 reserved for VM state
+     * 255(-1) reseverd for safety
+     */
+    if (n > 253) {
+        vma_writer_set_error(vmaw, "vma_writer_register_drive: "
+                             "too many drives");
+        return -1;
+    }
+
+    if (size <= 0) {
+        vma_writer_set_error(vmaw, "vma_writer_register_drive: "
+                             "got strange size %zd", size);
+        return -1;
+    }
+
+    DPRINTF("vma_writer_register_drive %s %zd %d", devname, size, n);
+
+    vmaw->stream_info[n].devname = g_strdup(devname);
+    vmaw->stream_info[n].size = size;
+
+    vmaw->stream_info[n].cluster_count = (size + VMA_CLUSTER_SIZE - 1) /
+        VMA_CLUSTER_SIZE;
+
+    vmaw->stream_count = n;
+
+    return n;
+}
+
+static void vma_co_continue_write(void *opaque)
+{
+    VmaWriter *vmaw = opaque;
+
+    qemu_aio_set_fd_handler(vmaw->fd, NULL, NULL, NULL, NULL);
+
+    DPRINTF("vma_co_continue_write\n");
+    qemu_coroutine_enter(vmaw->co_writer, NULL);
+}
+
+static ssize_t coroutine_fn
+vma_co_write(VmaWriter *vmaw, const void *buf, size_t bytes)
+{
+    size_t done = 0;
+    ssize_t ret;
+
+    /* atomic writes (we cannot interleave writes) */
+    qemu_co_mutex_lock(&vmaw->writer_lock);
+
+    DPRINTF("vma_co_write enter %zd\n", bytes);
+
+    while (done < bytes) {
+        ret = write(vmaw->fd, buf + done, bytes - done);
+        if (ret > 0) {
+            done += ret;
+            vmaw->total_bytes_written += ret;
+            DPRINTF("vma_co_write written %zd %zd\n", done, ret);
+        } else if (ret < 0) {
+            if (errno == EAGAIN || errno == EWOULDBLOCK) {
+                DPRINTF("vma_co_write yield %zd\n", done);
+
+                vmaw->co_writer = qemu_coroutine_self();
+                qemu_aio_set_fd_handler(vmaw->fd, NULL, vma_co_continue_write,
+                                        NULL, vmaw);
+
+                qemu_coroutine_yield();
+                DPRINTF("vma_co_write restart %zd\n", done);
+            } else {
+                vma_writer_set_error(vmaw, "vma_co_write write error - %s",
+                                     strerror(errno));
+                done = -1; /* always return failure for partial writes */
+                break;
+            }
+        } else if (ret == 0) {
+            /* should not happen - simply try again */
+        }
+    }
+
+    qemu_co_mutex_unlock(&vmaw->writer_lock);
+
+    DPRINTF("vma_co_write leave %zd\n", done);
+    return done;
+}
+
+static void coroutine_fn vma_co_writer_task(void *opaque)
+{
+    VmaAIOCB *cb = opaque;
+
+    DPRINTF("vma_co_writer_task start\n");
+
+    int64_t done = vma_co_write(cb->vmaw, cb->buffer, cb->bytes);
+    DPRINTF("vma_co_writer_task write done %zd\n", done);
+
+    if (done != cb->bytes) {
+        DPRINTF("vma_co_writer_task failed write %zd %zd", cb->bytes, done);
+        vma_writer_set_error(cb->vmaw, "vma_co_writer_task failed write %zd",
+                             done);
+    }
+
+    cb->bytes = 0;
+
+    qemu_co_queue_next(&cb->vmaw->wqueue);
+
+    DPRINTF("vma_co_writer_task end\n");
+}
+
+static void coroutine_fn vma_queue_flush(VmaWriter *vmaw)
+{
+    DPRINTF("vma_queue_flush enter\n");
+
+    assert(vmaw);
+
+    while (1) {
+        int i;
+        VmaAIOCB *cb = NULL;
+        for (i = 0; i < WRITE_BUFFERS; i++) {
+            if (vmaw->aiocbs[i].bytes) {
+                cb = &vmaw->aiocbs[i];
+                DPRINTF("FOUND USED AIO BUFFER %d %zd\n", i,
+                        vmaw->aiocbs[i].bytes);
+                break;
+            }
+        }
+        if (!cb) {
+            break;
+        }
+        qemu_co_queue_wait(&vmaw->wqueue);
+    }
+
+    DPRINTF("vma_queue_flush leave\n");
+}
+
+/**
+ * NOTE: pipe buffer size in only 4096 bytes on linux (see 'ulimit -a')
+ * So we need to create a coroutione to allow 'parallel' execution.
+ */
+static ssize_t coroutine_fn
+vma_queue_write(VmaWriter *vmaw, const void *buf, size_t bytes)
+{
+    DPRINTF("vma_queue_write enter %zd\n", bytes);
+
+    assert(vmaw);
+    assert(buf);
+    assert(bytes <= VMA_MAX_EXTEND_SIZE);
+
+    VmaAIOCB *cb = NULL;
+    while (!cb) {
+        int i;
+        for (i = 0; i < WRITE_BUFFERS; i++) {
+            if (!vmaw->aiocbs[i].bytes) {
+                cb = &vmaw->aiocbs[i];
+                break;
+            }
+        }
+        if (!cb) {
+            qemu_co_queue_wait(&vmaw->wqueue);
+        }
+    }
+
+    memcpy(cb->buffer, buf, bytes);
+    cb->bytes = bytes;
+    cb->vmaw = vmaw;
+
+    DPRINTF("vma_queue_write start %zd\n", bytes);
+    cb->co = qemu_coroutine_create(vma_co_writer_task);
+    qemu_coroutine_enter(cb->co, cb);
+
+    DPRINTF("vma_queue_write leave\n");
+
+    return bytes;
+}
+
+VmaWriter *vma_writer_create(const char *filename, int64_t speed, Error **errp)
+{
+    const char *p;
+
+    assert(sizeof(VmaHeader) == (4096 + 8192));
+    assert(sizeof(VmaExtendHeader) == 512);
+
+    VmaWriter *vmaw = g_new0(VmaWriter, 1);
+    vmaw->fd = -1;
+
+    vmaw->md5csum = g_checksum_new(G_CHECKSUM_MD5);
+    if (!vmaw->md5csum) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                  "can't allocate cmsum\n");
+        goto err;
+    }
+
+    if (strstart(filename, "exec:", &p)) {
+        vmaw->cmd = popen(p, "w");
+        if (vmaw->cmd == NULL) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                      "can't popen command '%s' - %s\n", p, strerror(errno));
+            goto err;
+        }
+        vmaw->fd = fileno(vmaw->cmd);
+        socket_set_nonblock(vmaw->fd);
+
+    } else {
+        vmaw->fd = open(filename, O_NONBLOCK|O_WRONLY|O_CREAT|O_EXCL, 0644);
+        if (vmaw->fd < 0) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                      "can't open file %s - %s\n", filename, strerror(errno));
+            goto err;
+        }
+    }
+
+    vmaw->outbuf_count = 0;
+    vmaw->outbuf_pos = VMA_EXTEND_HEADER_SIZE;
+
+    vmaw->header_blob_table_pos = 1; /* start at pos 1 */
+
+    qemu_co_mutex_init(&vmaw->writer_lock);
+    qemu_co_mutex_init(&vmaw->flush_lock);
+    qemu_co_queue_init(&vmaw->wqueue);
+
+    uuid_generate(vmaw->uuid);
+
+    if (speed <= 0) {
+        speed = 10*1024*1024*1024LLU; /* default 10GB/s */
+    }
+
+    ratelimit_set_speed(&vmaw->limit, speed, 100000000ULL /* 0.1 sec */);
+
+    return vmaw;
+
+err:
+    if (vmaw) {
+        if (vmaw->cmd) {
+            pclose(vmaw->cmd);
+        } else if (vmaw->fd >= 0) {
+            close(vmaw->fd);
+        }
+
+        if (vmaw->md5csum) {
+            g_checksum_free(vmaw->md5csum);
+        }
+
+        g_free(vmaw);
+    }
+
+    return NULL;
+}
+
+static int coroutine_fn vma_write_header(VmaWriter *vmaw)
+{
+    assert(vmaw);
+    int header_clusters = 8;
+    char buf[65536*header_clusters];
+    VmaHeader *head = (VmaHeader *)buf;
+
+    int i;
+
+    DPRINTF("VMA WRITE HEADER\n");
+
+    if (vmaw->status < 0) {
+        return vmaw->status;
+    }
+
+    memset(buf, 0, sizeof(buf));
+
+    head->magic = VMA_MAGIC;
+    head->version = GUINT32_TO_BE(1); /* v1 */
+    memcpy(head->uuid, vmaw->uuid, 16);
+
+    time_t ctime = time(NULL);
+    head->ctime = GUINT64_TO_BE(ctime);
+
+    if (!vmaw->stream_count) {
+        return -1;
+    }
+
+    for (i = 0; i < VMA_MAX_CONFIGS; i++) {
+        head->config_names[i] = GUINT32_TO_BE(vmaw->config_names[i]);
+        head->config_data[i] = GUINT32_TO_BE(vmaw->config_data[i]);
+    }
+
+    /* 32 bytes per device (12 used currently) = 8192 bytes max */
+    for (i = 1; i <= 254; i++) {
+        VmaStreamInfo *si = &vmaw->stream_info[i];
+        if (si->size) {
+            assert(si->devname);
+            uint32_t devname_ptr = allocate_header_string(vmaw, si->devname);
+            if (!devname_ptr) {
+                return -1;
+            }
+            head->dev_info[i].devname_ptr = GUINT32_TO_BE(devname_ptr);
+            head->dev_info[i].size = GUINT64_TO_BE(si->size);
+        }
+    }
+
+    uint32_t header_size = sizeof(VmaHeader) + vmaw->header_blob_table_size;
+    head->header_size = GUINT32_TO_BE(header_size);
+
+    if (header_size > sizeof(buf)) {
+        return -1; /* just to be sure */
+    }
+
+    uint32_t blob_buffer_offset = sizeof(VmaHeader);
+    memcpy(buf + blob_buffer_offset, vmaw->header_blob_table,
+           vmaw->header_blob_table_size);
+    head->blob_buffer_offset = GUINT32_TO_BE(blob_buffer_offset);
+    head->blob_buffer_size = GUINT32_TO_BE(vmaw->header_blob_table_pos);
+
+    g_checksum_reset(vmaw->md5csum);
+    g_checksum_update(vmaw->md5csum, (const guchar *)buf, header_size);
+    gsize csize = 16;
+    g_checksum_get_digest(vmaw->md5csum, (guint8 *)(head->md5sum), &csize);
+
+    return vma_queue_write(vmaw, buf, header_size);
+}
+
+static int coroutine_fn vma_writer_flush(VmaWriter *vmaw)
+{
+    assert(vmaw);
+
+    int ret;
+    int i;
+
+    if (vmaw->status < 0) {
+        return vmaw->status;
+    }
+
+    if (!vmaw->header_written) {
+        vmaw->header_written = true;
+        ret = vma_write_header(vmaw);
+        if (ret < 0) {
+            vma_writer_set_error(vmaw, "vma_writer_flush: write header failed");
+            return ret;
+        }
+    }
+
+    DPRINTF("VMA WRITE FLUSH %d %d %zd\n", vmaw->outbuf_count,
+           vmaw->outbuf_pos,
+           vmaw->zero_block_count);
+
+
+    VmaExtendHeader *ehead = (VmaExtendHeader *)vmaw->outbuf;
+
+    ehead->magic = VMA_EXTEND_MAGIC;
+    ehead->reserved1 = 0;
+
+    for (i = 0; i < VMA_BLOCKS_PER_EXTEND; i++) {
+        ehead->blockinfo[i] = GUINT64_TO_BE(vmaw->outbuf_block_info[i]);
+    }
+
+    guint16 block_count = (vmaw->outbuf_pos - VMA_EXTEND_HEADER_SIZE) /
+        VMA_BLOCK_SIZE;
+
+    ehead->block_count = GUINT16_TO_BE(block_count);
+
+    memcpy(ehead->uuid, vmaw->uuid, sizeof(ehead->uuid));
+    memset(ehead->md5sum, 0, sizeof(ehead->md5sum));
+
+    g_checksum_reset(vmaw->md5csum);
+    g_checksum_update(vmaw->md5csum, vmaw->outbuf, VMA_EXTEND_HEADER_SIZE);
+    gsize csize = 16;
+    g_checksum_get_digest(vmaw->md5csum, ehead->md5sum, &csize);
+
+    int bytes = vmaw->outbuf_pos;
+    ret = vma_queue_write(vmaw, vmaw->outbuf, bytes);
+    if (ret != bytes) {
+        vma_writer_set_error(vmaw, "vma_writer_flush: failed write");
+    }
+
+    vmaw->outbuf_count = 0;
+    vmaw->outbuf_pos = VMA_EXTEND_HEADER_SIZE;
+
+    for (i = 0; i < VMA_BLOCKS_PER_EXTEND; i++) {
+        vmaw->outbuf_block_info[i] = 0;
+    }
+
+    return vmaw->status;
+}
+
+static int vma_count_open_streams(VmaWriter *vmaw)
+{
+    g_assert(vmaw != NULL);
+
+    int i;
+    int open_drives = 0;
+    for (i = 0; i <= 255; i++) {
+        if (vmaw->stream_info[i].size && !vmaw->stream_info[i].finished) {
+            open_drives++;
+        }
+    }
+
+    return open_drives;
+}
+
+/**
+ * all jobs should call this when there is no more data
+ * Returns: number of remaining stream (0 ==> finished)
+ */
+int coroutine_fn
+vma_writer_close_stream(VmaWriter *vmaw, uint8_t dev_id)
+{
+    g_assert(vmaw != NULL);
+
+    DPRINTF("vma_writer_set_status %d\n", dev_id);
+    if (!vmaw->stream_info[dev_id].size) {
+        vma_writer_set_error(vmaw, "vma_writer_close_stream: "
+                             "no such stream %d", dev_id);
+        return -1;
+    }
+    if (vmaw->stream_info[dev_id].finished) {
+        vma_writer_set_error(vmaw, "vma_writer_close_stream: "
+                             "stream already closed %d", dev_id);
+        return -1;
+    }
+
+    vmaw->stream_info[dev_id].finished = true;
+
+    int open_drives = vma_count_open_streams(vmaw);
+
+    if (open_drives <= 0) {
+        DPRINTF("vma_writer_set_status all drives completed\n");
+        qemu_co_mutex_lock(&vmaw->flush_lock);
+        int ret = vma_writer_flush(vmaw);
+        qemu_co_mutex_unlock(&vmaw->flush_lock);
+        if (ret < 0) {
+            vma_writer_set_error(vmaw, "vma_writer_close_stream: flush failed");
+        }
+    }
+
+    return open_drives;
+}
+
+int vma_writer_get_status(VmaWriter *vmaw, VmaStatus *status)
+{
+    int i;
+
+    g_assert(vmaw != NULL);
+
+    if (status) {
+        status->status = vmaw->status;
+        g_strlcpy(status->errmsg, vmaw->errmsg, sizeof(status->errmsg));
+        for (i = 0; i <= 255; i++) {
+            status->stream_info[i] = vmaw->stream_info[i];
+        }
+
+        uuid_unparse_lower(vmaw->uuid, status->uuid_str);
+    }
+
+    status->closed = vmaw->closed;
+
+    return vmaw->status;
+}
+
+static int vma_writer_get_buffer(VmaWriter *vmaw)
+{
+
+    /* wait until buffer is available */
+    while (vmaw->outbuf_count >= (VMA_BLOCKS_PER_EXTEND - 1)) {
+        int res = 0;
+
+        qemu_co_mutex_lock(&vmaw->flush_lock);
+        res = vma_writer_flush(vmaw);
+        qemu_co_mutex_unlock(&vmaw->flush_lock);
+
+        if (res < 0) {
+            vma_writer_set_error(vmaw, "vma_writer_get_buffer: flush failed");
+            return -1;
+        }
+    }
+
+    return 0;
+}
+
+
+int64_t coroutine_fn
+vma_writer_write(VmaWriter *vmaw, uint8_t dev_id, int64_t cluster_num,
+                 unsigned char *buf)
+{
+    g_assert(vmaw != NULL);
+
+    if (vmaw->status < 0) {
+        return vmaw->status;
+    }
+
+    if (!dev_id || !vmaw->stream_info[dev_id].size) {
+        vma_writer_set_error(vmaw, "vma_writer_write: "
+                             "no such stream %d", dev_id);
+        return -1;
+    }
+
+    if (vmaw->stream_info[dev_id].finished) {
+        vma_writer_set_error(vmaw, "vma_writer_write: "
+                             "stream already closed %d", dev_id);
+        return -1;
+    }
+
+
+    if (cluster_num >= (((uint64_t)1)<<32)) {
+        vma_writer_set_error(vmaw, "vma_writer_write: "
+                             "cluster number out of range");
+        return -1;
+    }
+
+    if (cluster_num >= vmaw->stream_info[dev_id].cluster_count) {
+        vma_writer_set_error(vmaw, "vma_writer_write: cluster number too big");
+        return -1;
+    }
+
+    /* wait until buffer is available */
+    if (vma_writer_get_buffer(vmaw) < 0) {
+        vma_writer_set_error(vmaw, "vma_writer_write: "
+                             "vma_writer_get_buffer failed");
+        return -1;
+    }
+
+    DPRINTF("VMA WRITE %zd\n", cluster_num);
+
+    int i;
+    int bit = 1;
+    uint16_t mask = 0;
+    for (i = 0; i < 16; i++) {
+        unsigned char *vmablock = buf + (i*VMA_BLOCK_SIZE);
+        if (buffer_is_zero(vmablock, VMA_BLOCK_SIZE)) {
+            DPRINTF("VMA WRITE %zd ZERO BLOCK %d\n", cluster_num, i);
+            vmaw->zero_block_count++;
+            vmaw->stream_info[dev_id].zero_bytes += VMA_BLOCK_SIZE;
+        } else {
+            mask |= bit;
+            memcpy(vmaw->outbuf + vmaw->outbuf_pos, vmablock, VMA_BLOCK_SIZE);
+            vmaw->outbuf_pos += VMA_BLOCK_SIZE;
+
+            vmaw->delay_ns = ratelimit_calculate_delay(&vmaw->limit,
+                                                       VMA_BLOCK_SIZE);
+            if (vmaw->delay_ns) {
+                co_sleep_ns(rt_clock, vmaw->delay_ns);
+            }
+        }
+
+        bit = bit << 1;
+    }
+
+    uint64_t block_info = ((uint64_t)mask) << (32+16);
+    block_info |= ((uint64_t)dev_id) << 32;
+    block_info |= (cluster_num & 0xffffffff);
+    vmaw->outbuf_block_info[vmaw->outbuf_count] =  block_info;
+
+    DPRINTF("VMA WRITE MASK %zd %zx\n", cluster_num, block_info);
+
+    vmaw->outbuf_count++;
+
+    vmaw->stream_info[dev_id].transferred += VMA_CLUSTER_SIZE;
+
+    /** NOTE: We allways write whole clusters, but we correctly set
+     * transferred bytes. So transferred == size when when everything
+     * went OK.
+     */
+    uint64_t last = (cluster_num + 1) * VMA_CLUSTER_SIZE;
+    if (last > vmaw->stream_info[dev_id].size) {
+        uint64_t diff = last - vmaw->stream_info[dev_id].size;
+        if (diff >= VMA_CLUSTER_SIZE) {
+            vma_writer_set_error(vmaw, "vma_writer_write: "
+                                 "read after last cluster");
+            return -1;
+        }
+        vmaw->stream_info[dev_id].transferred -= diff;
+    }
+
+    return vmaw->total_bytes_written;
+}
+
+int vma_writer_close(VmaWriter *vmaw)
+{
+    g_assert(vmaw != NULL);
+
+    int i;
+
+    vma_queue_flush(vmaw);
+
+    /* this should not happen - just to be sure */
+    while (!qemu_co_queue_empty(&vmaw->wqueue)) {
+        DPRINTF("vma_writer_close wait\n");
+        co_sleep_ns(rt_clock, 1000000);
+    }
+
+    if (vmaw->cmd) {
+        if (pclose(vmaw->cmd) < 0) {
+            vma_writer_set_error(vmaw, "vma_writer_close: "
+                                 "pclose failed - %s", strerror(errno));
+        }
+    } else {
+        if (close(vmaw->fd) < 0) {
+            vma_writer_set_error(vmaw, "vma_writer_close: "
+                                 "close failed - %s", strerror(errno));
+        }
+    }
+
+    for (i = 0; i <= 255; i++) {
+        VmaStreamInfo *si = &vmaw->stream_info[i];
+        if (si->size) {
+            if (!si->finished) {
+                vma_writer_set_error(vmaw, "vma_writer_close: "
+                                     "detected open stream '%s'", si->devname);
+            } else if (si->transferred != si->size) {
+                vma_writer_set_error(vmaw, "vma_writer_close: "
+                                     "incomplete stream '%s' (%zd != %zd)",
+                                     si->devname, si->transferred, si->size);
+            }
+        }
+    }
+
+    for (i = 0; i <= 255; i++) {
+        vmaw->stream_info[i].finished = 1; /* mark as closed */
+    }
+
+    vmaw->closed = 1;
+
+    return vmaw->status;
+}
+
+void vma_writer_destroy(VmaWriter *vmaw)
+{
+    assert(vmaw);
+
+    int i;
+
+    for (i = 0; i <= 255; i++) {
+        if (vmaw->stream_info[i].devname) {
+            g_free(vmaw->stream_info[i].devname);
+        }
+    }
+
+    if (vmaw->md5csum) {
+        g_checksum_free(vmaw->md5csum);
+    }
+
+    g_free(vmaw);
+}
+
diff --git a/vma.c b/vma.c
new file mode 100644
index 0000000..f8336bf
--- /dev/null
+++ b/vma.c
@@ -0,0 +1,526 @@
+/*
+ * VMA: Virtual Machine Archive
+ *
+ * Copyright (C) Proxmox Server Solutions
+ *
+ * Authors:
+ *  Dietmar Maurer (dietmar@proxmox.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <stdio.h>
+#include <errno.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <glib.h>
+
+#include "qemu-common.h"
+#include "qemu-option.h"
+#include "qemu-error.h"
+#include "osdep.h"
+#include "sysemu.h"
+#include "block_int.h"
+#include <stdio.h>
+#include "vma.h"
+
+static void help(void)
+{
+    const char *help_msg =
+        "usage: vma command [command options]\n"
+        "\n"
+        "vma list <filename>\n"
+        "vma create <filename> [-c config] <archive> pathname ...\n"
+        "vma extract <filename> [-r] <targetdir>\n"
+        ;
+
+    printf("%s", help_msg);
+    exit(1);
+}
+
+static const char *extract_devname(const char *path, char **devname, int index)
+{
+    assert(path);
+
+    const char *sep = strchr(path, '=');
+
+    if (sep) {
+        *devname = g_strndup(path, sep - path);
+        path = sep + 1;
+    } else {
+        if (index >= 0) {
+            *devname = g_strdup_printf("disk%d", index);
+        } else {
+            *devname = NULL;
+        }
+    }
+
+    return path;
+}
+
+static void print_content(VmaReader *vmar)
+{
+    assert(vmar);
+
+    VmaHeader *head = vma_reader_get_header(vmar);
+
+    printf("CTIME: %s", ctime(&head->ctime));
+
+    GList *l = vma_reader_get_config_data(vmar);
+    while (l && l->data) {
+        VmaConfigData *cdata = (VmaConfigData *)l->data;
+        l = g_list_next(l);
+        printf("CFG: size: %d name: %s\n", cdata->len, cdata->name);
+    }
+
+    int i;
+    VmaDeviceInfo *di;
+    for (i = 1; i < 255; i++) {
+        di = vma_reader_get_device_info(vmar, i);
+        if (di) {
+            printf("DEV: dev_id=%d size: %zd devname: %s\n",
+                   i, di->size, di->devname);
+        }
+    }
+}
+
+static int list_content(int argc, char **argv)
+{
+    int c, ret = 0;
+    const char *filename;
+
+    for (;;) {
+        c = getopt(argc, argv, "h");
+        if (c == -1) {
+            break;
+        }
+        switch (c) {
+        case '?':
+        case 'h':
+            help();
+            break;
+        default:
+            g_assert_not_reached();
+        }
+    }
+
+    /* Get the filename */
+    if ((optind + 1) != argc) {
+        help();
+    }
+    filename = argv[optind++];
+
+    Error *errp = NULL;
+    VmaReader *vmar = vma_reader_create(filename, &errp);
+
+    if (!vmar) {
+        g_error("%s", error_get_pretty(errp));
+    }
+
+    print_content(vmar);
+
+    vma_reader_destroy(vmar);
+
+    return ret;
+}
+
+typedef struct RestoreMap {
+    char *devname;
+    char *path;
+    bool write_zero;
+} RestoreMap;
+
+static int extract_content(int argc, char **argv)
+{
+    int c, ret = 0;
+    const char *filename;
+    const char *dirname;
+    int readmap = 0;
+
+    for (;;) {
+        c = getopt(argc, argv, "hr");
+        if (c == -1) {
+            break;
+        }
+        switch (c) {
+        case '?':
+        case 'h':
+            help();
+            break;
+        case 'r':
+            readmap = 1;
+            break;
+        default:
+            help();
+        }
+    }
+
+    /* Get the filename */
+    if ((optind + 2) != argc) {
+        help();
+    }
+    filename = argv[optind++];
+    dirname = argv[optind++];
+
+    Error *errp = NULL;
+    VmaReader *vmar = vma_reader_create(filename, &errp);
+
+    if (!vmar) {
+        g_error("%s", error_get_pretty(errp));
+    }
+
+    if (mkdir(dirname, 0777) < 0) {
+        g_error("unable to create target directory %s - %s",
+                dirname, strerror(errno));
+    }
+
+    GList *l = vma_reader_get_config_data(vmar);
+    while (l && l->data) {
+        VmaConfigData *cdata = (VmaConfigData *)l->data;
+        l = g_list_next(l);
+        char *cfgfn = g_strdup_printf("%s/%s", dirname, cdata->name);
+        GError *err = NULL;
+        if (!g_file_set_contents(cfgfn, (gchar *)cdata->data, cdata->len,
+                                 &err)) {
+            g_error("Unable to write file: %s", err->message);
+        }
+    }
+
+    GHashTable *devmap = g_hash_table_new(g_str_hash, g_str_equal);
+
+    if (readmap) {
+        print_content(vmar);
+
+        while (1) {
+            char inbuf[8192];
+            char *line = fgets(inbuf, sizeof(inbuf), stdin);
+            if (!line || line[0] == '\0' || !strcmp(line, "done\n")) {
+                break;
+            }
+            int len = strlen(line);
+            if (line[len - 1] == '\n') {
+                line[len - 1] = '\0';
+                if (len == 1) {
+                    break;
+                }
+            }
+
+            const char *path;
+            bool write_zero;
+            if (line[0] == '0' && line[1] == ':') {
+                path = inbuf + 2;
+                write_zero = false;
+            } else if (line[0] == '1' && line[1] == ':') {
+                path = inbuf + 2;
+                write_zero = true;
+           } else {
+                g_error("read map failed - parse error ('%s')", inbuf);
+            }
+
+            char *devname = NULL;
+            path = extract_devname(path, &devname, -1);
+            if (!devname) {
+                g_error("read map failed - no dev name specified ('%s')",
+                        inbuf);
+            }
+
+            printf("TEST %s %s\n", path, devname);
+
+            RestoreMap *map = g_new0(RestoreMap, 1);
+            map->devname = g_strdup(devname);
+            map->path = g_strdup(path);
+            map->write_zero = write_zero;
+
+            g_hash_table_insert(devmap, map->devname, map);
+
+        };
+    }
+
+    int i;
+    for (i = 1; i < 255; i++) {
+        VmaDeviceInfo *di = vma_reader_get_device_info(vmar, i);
+        if (di) {
+            char *devfn = NULL;
+            int flags = BDRV_O_RDWR|BDRV_O_CACHE_WB;
+            bool write_zero = true;
+
+            if (readmap) {
+                RestoreMap *map;
+                map = (RestoreMap *)g_hash_table_lookup(devmap, di->devname);
+                if (map == NULL) {
+                    g_error("no device name mapping for %s", di->devname);
+                }
+                devfn = map->path;
+                write_zero = map->write_zero;
+            } else {
+                devfn = g_strdup_printf("%s/tmp-disk-%s.raw",
+                                        dirname, di->devname);
+                printf("DEVINFO %s %zd\n", devfn, di->size);
+
+                if (bdrv_img_create(devfn, "raw", NULL, NULL, NULL,
+                                    di->size, flags)) {
+                    g_error("can't create file %s", devfn);
+                }
+                /* Note: if we created an empty file above, there is no need
+                 * to write zeroes (so we generate a sparse file)
+                 */
+                write_zero = false;
+            }
+
+            BlockDriverState *bs = NULL;
+            if (bdrv_file_open(&bs, devfn, flags)) {
+                g_error("can't open file %s", devfn);
+            }
+            if (vma_reader_register_bs(vmar, i, bs, write_zero, &errp) < 0) {
+                g_error("%s", error_get_pretty(errp));
+            }
+
+            if (!readmap) {
+                g_free(devfn);
+            }
+        }
+    }
+
+    if (vma_reader_restore(vmar, &errp) < 0) {
+        g_error("restore failed - %s", error_get_pretty(errp));
+    }
+
+    if (!readmap) {
+        for (i = 1; i < 255; i++) {
+            VmaDeviceInfo *di = vma_reader_get_device_info(vmar, i);
+            if (di) {
+                char *tmpfn = g_strdup_printf("%s/tmp-disk-%s.raw",
+                                              dirname, di->devname);
+                char *fn = g_strdup_printf("%s/disk-%s.raw",
+                                           dirname, di->devname);
+                if (rename(tmpfn, fn) != 0) {
+                    g_error("rename %s to %s failed - %s",
+                            tmpfn, fn, strerror(errno));
+                }
+            }
+        }
+    }
+
+    vma_reader_destroy(vmar);
+
+    bdrv_close_all();
+
+    return ret;
+}
+
+typedef struct BackupCB {
+    VmaWriter *vmaw;
+    uint8_t dev_id;
+} BackupCB;
+
+static int backup_dump_cb(void *opaque, BlockDriverState *bs,
+                          int64_t cluster_num, unsigned char *buf)
+{
+    BackupCB *bcb = opaque;
+
+    if (vma_writer_write(bcb->vmaw, bcb->dev_id, cluster_num, buf) < 0) {
+        g_warning("backup_dump_cb vma_writer_write failed");
+        return -1;
+    }
+
+    return 0;
+}
+
+static void backup_complete_cb(void *opaque, int ret)
+{
+    BackupCB *bcb = opaque;
+
+    if (ret < 0) {
+        vma_writer_set_error(bcb->vmaw, "backup_complete_cb %d", ret);
+    }
+
+    if (vma_writer_close_stream(bcb->vmaw, bcb->dev_id) <= 0) {
+        if (vma_writer_close(bcb->vmaw) != 0) {
+            g_warning("vma_writer_close failed");
+        }
+    }
+}
+
+static int create_archive(int argc, char **argv)
+{
+    int i, c, res;
+    int verbose = 0;
+    const char *archivename;
+    GList *config_files = NULL;
+
+    for (;;) {
+        c = getopt(argc, argv, "hvc:");
+        if (c == -1) {
+            break;
+        }
+        switch (c) {
+        case '?':
+        case 'h':
+            help();
+            break;
+        case 'c':
+            config_files = g_list_append(config_files, optarg);
+            break;
+        case 'v':
+            verbose = 1;
+            break;
+        default:
+            g_assert_not_reached();
+        }
+    }
+
+
+    /* make sure we have archive name and at least one path */
+    if ((optind + 2) > argc) {
+        help();
+    }
+
+    archivename = argv[optind++];
+
+    Error *local_err = NULL;
+    VmaWriter *vmaw = vma_writer_create(archivename, 0, &local_err);
+
+    if (vmaw == NULL) {
+        g_error("%s", error_get_pretty(local_err));
+    }
+
+    GList *l = config_files;
+    while (l && l->data) {
+        char *name = l->data;
+        char *cdata = NULL;
+        gsize clen = 0;
+        GError *err = NULL;
+        if (!g_file_get_contents(name, &cdata, &clen, &err)) {
+            unlink(archivename);
+            g_error("Unable to read file: %s", err->message);
+        }
+
+        if (vma_writer_add_config(vmaw, name, cdata, clen) != 0) {
+            unlink(archivename);
+            g_error("Unable to append config data %s (len = %zd)",
+                    name, clen);
+        }
+        l = g_list_next(l);
+    }
+
+    int ind = 0;
+    while (optind < argc) {
+        const char *path = argv[optind++];
+        char *devname = NULL;
+        path = extract_devname(path, &devname, ind++);
+
+        BlockDriver *drv = NULL;
+        BlockDriverState *bs = bdrv_new(devname);
+
+        res = bdrv_open(bs, path, BDRV_O_CACHE_WB , drv);
+        if (res < 0) {
+            unlink(archivename);
+            g_error("bdrv_open '%s' failed", path);
+        }
+        int64_t size = bdrv_getlength(bs);
+        int dev_id = vma_writer_register_stream(vmaw, devname, size);
+        if (dev_id <= 0) {
+            unlink(archivename);
+            g_error("vma_writer_register_stream '%s' failed", devname);
+        }
+
+        BackupCB *bcb = g_new0(BackupCB, 1);
+        bcb->vmaw = vmaw;
+        bcb->dev_id = dev_id;
+
+        bdrv_backup_init(bs, backup_dump_cb, backup_complete_cb, bcb);
+    }
+
+    VmaStatus vmastat;
+    int percent = 0;
+    int last_percent = -1;
+
+    while (1) {
+        main_loop_wait(false);
+        vma_writer_get_status(vmaw, &vmastat);
+
+        if (verbose) {
+
+            uint64_t total = 0;
+            uint64_t transferred = 0;
+            uint64_t zero_bytes = 0;
+
+            int i;
+            for (i = 0; i < 256; i++) {
+                if (vmastat.stream_info[i].size) {
+                    total += vmastat.stream_info[i].size;
+                    transferred += vmastat.stream_info[i].transferred;
+                    zero_bytes += vmastat.stream_info[i].zero_bytes;
+                }
+            }
+            percent = (transferred*100)/total;
+            if (percent != last_percent) {
+                printf("progress %d%% %zd/%zd %zd\n", percent,
+                       transferred, total, zero_bytes);
+
+                last_percent = percent;
+            }
+        }
+
+        if (vmastat.closed) {
+            break;
+        }
+    }
+
+    bdrv_drain_all();
+
+    vma_writer_get_status(vmaw, &vmastat);
+
+    if (verbose) {
+        for (i = 0; i < 256; i++) {
+            VmaStreamInfo *si = &vmastat.stream_info[i];
+            if (si->size) {
+                printf("image %s: size=%zd zeros=%zd saved=%zd\n", si->devname,
+                       si->size, si->zero_bytes, si->size - si->zero_bytes);
+            }
+        }
+    }
+
+    if (vmastat.status < 0) {
+        unlink(archivename);
+        g_error("creating vma archive failed");
+    }
+
+    return 0;
+}
+
+int main(int argc, char **argv)
+{
+    const char *cmdname;
+
+    error_set_progname(argv[0]);
+
+    qemu_init_main_loop();
+
+    bdrv_init();
+
+    if (argc < 2) {
+        help();
+    }
+
+    cmdname = argv[1];
+    argc--; argv++;
+
+
+    if (!strcmp(cmdname, "list")) {
+        return list_content(argc, argv);
+    } else if (!strcmp(cmdname, "create")) {
+        return create_archive(argc, argv);
+    } else if (!strcmp(cmdname, "extract")) {
+        return extract_content(argc, argv);
+    }
+
+    help();
+    return 0;
+}
diff --git a/vma.h b/vma.h
new file mode 100644
index 0000000..7c40d1b
--- /dev/null
+++ b/vma.h
@@ -0,0 +1,139 @@
+/*
+ * VMA: Virtual Machine Archive
+ *
+ * Copyright (C) Proxmox Server Solutions
+ *
+ * Authors:
+ *  Dietmar Maurer (dietmar@proxmox.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef BACKUP_VMA_H
+#define BACKUP_VMA_H
+
+#include "error.h"
+
+#define VMA_BLOCK_BITS 12
+#define VMA_BLOCK_SIZE (1<<VMA_BLOCK_BITS)
+#define VMA_CLUSTER_BITS (VMA_BLOCK_BITS+4)
+#define VMA_CLUSTER_SIZE (1<<VMA_CLUSTER_BITS)
+
+#if VMA_CLUSTER_SIZE != 65536
+#error unexpected cluster size
+#endif
+
+#define VMA_EXTEND_HEADER_SIZE 512
+#define VMA_BLOCKS_PER_EXTEND 59
+#define VMA_MAX_CONFIGS 256
+
+#define VMA_MAX_EXTEND_SIZE \
+    (VMA_EXTEND_HEADER_SIZE+VMA_CLUSTER_SIZE*VMA_BLOCKS_PER_EXTEND)
+#if VMA_MAX_EXTEND_SIZE != 3867136
+#error unexpected VMA_EXTEND_SIZE
+#endif
+
+/* File Format Definitions */
+
+#define VMA_MAGIC (GUINT32_TO_BE(('V'<<24)|('M'<<16)|('A'<<8)|0x00))
+#define VMA_EXTEND_MAGIC (GUINT32_TO_BE(('V'<<24)|('M'<<16)|('A'<<8)|'E'))
+
+typedef struct VmaDeviceInfoHeader {
+    uint32_t devname_ptr; /* offset into blob_buffer table */
+    uint32_t reserved0;
+    uint64_t size; /* device size in bytes */
+    uint64_t reserved1;
+    uint64_t reserved2;
+} VmaDeviceInfoHeader;
+
+typedef struct VmaHeader {
+    uint32_t magic;
+    uint32_t version;
+    unsigned char uuid[16];
+    int64_t ctime;
+    unsigned char md5sum[16];
+
+    uint32_t blob_buffer_offset;
+    uint32_t blob_buffer_size;
+    uint32_t header_size;
+
+    unsigned char reserved[1984];
+
+    uint32_t config_names[VMA_MAX_CONFIGS]; /* offset into blob_buffer table */
+    uint32_t config_data[VMA_MAX_CONFIGS];  /* offset into blob_buffer table */
+
+    VmaDeviceInfoHeader dev_info[256];
+} VmaHeader;
+
+typedef struct VmaExtendHeader {
+    uint32_t magic;
+    uint16_t reserved1;
+    uint16_t block_count;
+    unsigned char uuid[16];
+    unsigned char md5sum[16];
+    uint64_t blockinfo[VMA_BLOCKS_PER_EXTEND];
+} VmaExtendHeader;
+
+/* functions/definitions to read/write vma files */
+
+typedef struct VmaReader VmaReader;
+
+typedef struct VmaWriter VmaWriter;
+
+typedef struct VmaConfigData {
+    const unsigned char *name;
+    const unsigned char *data;
+    uint32_t len;
+} VmaConfigData;
+
+typedef struct VmaStreamInfo {
+    uint64_t size;
+    uint64_t cluster_count;
+    uint64_t transferred;
+    uint64_t zero_bytes;
+    int finished;
+    char *devname;
+} VmaStreamInfo;
+
+typedef struct VmaStatus {
+    int status;
+    bool closed;
+    char errmsg[8192];
+    char uuid_str[37];
+    VmaStreamInfo stream_info[256];
+} VmaStatus;
+
+typedef struct VmaDeviceInfo {
+    uint64_t size; /* device size in bytes */
+    const unsigned char *devname;
+} VmaDeviceInfo;
+
+VmaWriter *vma_writer_create(const char *filename, int64_t speed, Error **errp);
+int vma_writer_close(VmaWriter *vmaw);
+void vma_writer_destroy(VmaWriter *vmaw);
+int vma_writer_add_config(VmaWriter *vmaw, const char *name, gpointer data,
+                          size_t len);
+int vma_writer_register_stream(VmaWriter *vmaw, const char *devname,
+                               size_t size);
+
+int64_t coroutine_fn vma_writer_write(VmaWriter *vmaw, uint8_t dev_id,
+                                      int64_t cluster_num, unsigned char *buf);
+int coroutine_fn vma_writer_close_stream(VmaWriter *vmaw, uint8_t dev_id);
+
+int vma_writer_get_status(VmaWriter *vmaw, VmaStatus *status);
+void vma_writer_set_error(VmaWriter *vmaw, const char *fmt, ...);
+
+
+VmaReader *vma_reader_create(const char *filename, Error **errp);
+void vma_reader_destroy(VmaReader *vmar);
+VmaHeader *vma_reader_get_header(VmaReader *vmar);
+GList *vma_reader_get_config_data(VmaReader *vmar);
+VmaDeviceInfo *vma_reader_get_device_info(VmaReader *vmar, guint8 dev_id);
+int vma_reader_register_bs(VmaReader *vmar, guint8 dev_id,
+                           BlockDriverState *bs, bool write_zeroes,
+                           Error **errp);
+int vma_reader_restore(VmaReader *vmar, Error **errp);
+
+#endif /* BACKUP_VMA_H */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH 4/5] add backup related monitor commands
  2012-11-21  9:01 [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1) Dietmar Maurer
  2012-11-21  9:01 ` [Qemu-devel] [PATCH 2/5] add basic backup support to block driver Dietmar Maurer
  2012-11-21  9:01 ` [Qemu-devel] [PATCH 3/5] introduce new vma archive format Dietmar Maurer
@ 2012-11-21  9:01 ` Dietmar Maurer
  2012-11-21 16:16   ` Eric Blake
  2012-11-21  9:01 ` [Qemu-devel] [PATCH 5/5] add regression tests for backup Dietmar Maurer
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-21  9:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Dietmar Maurer

We currently create 'vma' archives without any configuration inside.
Future versions may support other formats...

Another option would be to simply dump <devid,cluster_num,cluster_data> to
the output fh (pipe), and an external binary saves the data. That way we
could move the whole archive format related code out of qemu.

Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
---
 Makefile.objs    |    2 +-
 blockdev.c       |  263 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 hmp-commands.hx  |   31 +++++++
 hmp.c            |   63 +++++++++++++
 hmp.h            |    3 +
 monitor.c        |    7 ++
 qapi-schema.json |   46 ++++++++++
 qmp-commands.hx  |   27 ++++++
 8 files changed, 441 insertions(+), 1 deletions(-)

diff --git a/Makefile.objs b/Makefile.objs
index cb46be5..b5732e2 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -48,7 +48,7 @@ coroutine-obj-$(CONFIG_WIN32) += coroutine-win32.o
 block-obj-y = iov.o cache-utils.o qemu-option.o module.o async.o
 block-obj-y += nbd.o block.o blockjob.o aes.o qemu-config.o
 block-obj-y += thread-pool.o qemu-progress.o qemu-sockets.o uri.o notify.o
-block-obj-y += backup.o
+block-obj-y += vma-writer.o backup.o
 block-obj-y += $(coroutine-obj-y) $(qobject-obj-y) $(version-obj-y)
 block-obj-$(CONFIG_POSIX) += event_notifier-posix.o aio-posix.o
 block-obj-$(CONFIG_WIN32) += event_notifier-win32.o aio-win32.o
diff --git a/blockdev.c b/blockdev.c
index e73fd6e..de5db5e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -20,6 +20,7 @@
 #include "qmp-commands.h"
 #include "trace.h"
 #include "arch_init.h"
+#include "vma.h"
 
 static QTAILQ_HEAD(drivelist, DriveInfo) drives = QTAILQ_HEAD_INITIALIZER(drives);
 
@@ -1321,6 +1322,268 @@ void qmp_drive_mirror(const char *device, const char *target,
     drive_get_ref(drive_get_by_blockdev(bs));
 }
 
+/* Backup related function */
+
+static struct VmaBackupState {
+    time_t start_time;
+    time_t end_time;
+    char *backupfile;
+    VmaWriter *vmaw;
+    VmaStatus status;
+} backup_state;
+
+typedef struct BackupCB {
+    BlockDriverState *bs;
+    VmaWriter *vmaw;
+    uint8_t dev_id;
+} BackupCB;
+
+static int backup_dump_cb(void *opaque, BlockDriverState *bs,
+                          int64_t cluster_num, unsigned char *buf)
+{
+    BackupCB *bcb = opaque;
+
+    if (vma_writer_write(bcb->vmaw, bcb->dev_id, cluster_num, buf) < 0) {
+        return -1;
+    }
+    return 0;
+}
+
+static void backup_complete_cb(void *opaque, int ret)
+{
+    BackupCB *bcb = opaque;
+
+    printf("backup_complete_cb start %d\n", ret);
+    drive_put_ref_bh_schedule(drive_get_by_blockdev(bcb->bs));
+
+    if (ret < 0) {
+        vma_writer_set_error(bcb->vmaw, "backup_complete_cb %d", ret);
+    }
+
+    if (vma_writer_close_stream(bcb->vmaw, bcb->dev_id) <= 0) {
+        printf("all backup jobs completed\n");
+
+        backup_state.end_time = time(NULL);
+
+        int res = vma_writer_close(bcb->vmaw);
+        vma_writer_get_status(bcb->vmaw, &backup_state.status);
+        vma_writer_destroy(bcb->vmaw);
+
+        backup_state.vmaw = NULL;
+        if (res < 0) {
+            printf("backup failed\n");
+        } else {
+            printf("backup successful\n");
+        }
+
+    }
+
+    g_free(bcb);
+}
+
+void qmp_backup_cancel(Error **errp)
+{
+    if (backup_state.vmaw) {
+        vma_writer_set_error(backup_state.vmaw, "backup canceled");
+    }
+}
+
+char *qmp_backup(const char * backupfile, bool has_devlist, const char *devlist,
+                 bool has_speed, int64_t speed, Error **errp)
+{
+    BlockDriverState *bs;
+    Error *local_err = NULL;
+    VmaWriter *vmaw = NULL;
+    gchar **devs = NULL;
+    GList *bcblist = NULL;
+
+    if (has_devlist) {
+        devs = g_strsplit(devlist, ",;:", -1);
+
+        gchar **d = devs;
+        while (d && *d) {
+            bs = bdrv_find(*d);
+            if (bs) {
+                if (bdrv_is_read_only(bs)) {
+                    error_set(errp, QERR_DEVICE_IS_READ_ONLY, *d);
+                    goto err;
+                }
+                if (!bdrv_is_inserted(bs)) {
+                    error_set(errp, QERR_DEVICE_HAS_NO_MEDIUM, *d);
+                    goto err;
+                }
+                BackupCB *bcb = g_new0(BackupCB, 1);
+                bcb->bs = bs;
+                bcblist = g_list_append(bcblist, bcb);
+            } else {
+                error_set(errp, QERR_DEVICE_NOT_FOUND, *d);
+                goto err;
+            }
+            d++;
+        }
+
+    } else {
+
+        bs = NULL;
+        while ((bs = bdrv_next(bs))) {
+
+            if (!bdrv_is_inserted(bs) || bdrv_is_read_only(bs)) {
+                continue;
+            }
+
+            BackupCB *bcb = g_new0(BackupCB, 1);
+            bcb->bs = bs;
+            bcblist = g_list_append(bcblist, bcb);
+        }
+    }
+
+    if (!bcblist) {
+        error_set(errp, ERROR_CLASS_GENERIC_ERROR, "empty device list");
+        goto err;
+    }
+
+    GList *l = bcblist;
+    while (l) {
+        BackupCB *bcb = l->data;
+        l = g_list_next(l);
+        if (bcb->bs->job) {
+            error_set(errp, QERR_DEVICE_IN_USE, bdrv_get_device_name(bcb->bs));
+            goto err;
+        }
+    }
+
+    vmaw = vma_writer_create(backupfile, speed, &local_err);
+    if (!vmaw) {
+        if (error_is_set(&local_err)) {
+            error_propagate(errp, local_err);
+        }
+        goto err;
+    }
+
+    /* register all devices for vma writer */
+    l = bcblist;
+    while (l) {
+        BackupCB *bcb = l->data;
+        l = g_list_next(l);
+        bcb->vmaw = vmaw;
+        bcb->dev_id = vma_writer_register_stream(
+            vmaw, bdrv_get_device_name(bcb->bs), bdrv_getlength(bcb->bs));
+        if (bcb->dev_id <= 0) {
+            error_set(errp, ERROR_CLASS_GENERIC_ERROR,
+                      "vma_writer_register_stream failed");
+            goto err;
+        }
+    }
+
+    backup_state.start_time = time(NULL);
+    backup_state.end_time = 0;
+    backup_state.backupfile = g_strdup(backupfile);
+    backup_state.vmaw = vmaw;
+
+    vma_writer_get_status(vmaw, &backup_state.status);
+
+    /* start all jobs (one for each device) */
+    l = bcblist;
+    while (l) {
+        BackupCB *bcb = l->data;
+        l = g_list_next(l);
+
+        if (bdrv_backup_init(bcb->bs, backup_dump_cb,
+                             backup_complete_cb, bcb)) {
+            /* Grab a reference so hotplug does not delete the
+             * BlockDriverState from underneath us.
+             */
+            drive_get_ref(drive_get_by_blockdev(bcb->bs));
+        }
+    }
+
+    return g_strdup(backup_state.status.uuid_str);
+
+err:
+
+    l = bcblist;
+    while (l) {
+        g_free(l->data);
+        l = g_list_next(l);
+    }
+    g_list_free(bcblist);
+
+    if (devs) {
+        g_strfreev(devs);
+    }
+
+    if (vmaw) {
+        unlink(backupfile);
+        vma_writer_close(vmaw);
+        vma_writer_destroy(vmaw);
+    }
+
+    return NULL;
+}
+
+BackupStatus *qmp_query_backup(Error **errp)
+{
+    int i;
+    BackupStatus *info = g_malloc0(sizeof(*info));
+
+    if (!backup_state.start_time) {
+        /* not started, return {} */
+        return info;
+    }
+
+    info->has_status = true;
+    info->has_start_time = true;
+    info->start_time = backup_state.start_time;
+
+    if (backup_state.backupfile) {
+        info->has_backupfile = true;
+        info->backupfile = g_strdup(backup_state.backupfile);
+    }
+
+    info->has_uuid = true;
+    info->uuid = g_strdup(backup_state.status.uuid_str);
+
+    if (backup_state.end_time) {
+        if (backup_state.status.status >= 0) {
+            info->status = g_strdup("done");
+        } else {
+            info->status = g_strdup("error");
+            if (backup_state.status.errmsg[0]) {
+                info->has_errmsg = true;
+                info->errmsg = g_strdup(backup_state.status.errmsg);
+            }
+        }
+        info->has_end_time = true;
+        info->end_time = backup_state.end_time;
+    } else {
+        if (backup_state.vmaw) {
+            vma_writer_get_status(backup_state.vmaw, &backup_state.status);
+        }
+        info->status = g_strdup("active");
+    }
+
+    uint64_t total = 0;
+    uint64_t zero_bytes = 0;
+    uint64_t transferred = 0;
+
+    for (i = 0; i <= 255; i++) {
+        if (backup_state.status.stream_info[i].size) {
+            total += backup_state.status.stream_info[i].size;
+            zero_bytes += backup_state.status.stream_info[i].zero_bytes;
+            transferred += backup_state.status.stream_info[i].transferred;
+        }
+    }
+
+    info->has_total = true;
+    info->total = total;
+    info->has_zero_bytes = true;
+    info->zero_bytes = zero_bytes;
+    info->has_transferred = true;
+    info->transferred = transferred;
+
+    return info;
+}
+
 static BlockJob *find_block_job(const char *device)
 {
     BlockDriverState *bs;
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 010b8c9..57be357 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -83,6 +83,35 @@ STEXI
 Copy data from a backing file into a block device.
 ETEXI
 
+   {
+        .name       = "backup",
+        .args_type  = "backupfile:s,speed:o?,devlist:s?",
+        .params     = "backupfile [speed [devlist]]",
+        .help       = "create a VM Backup.",
+        .mhandler.cmd = hmp_backup,
+    },
+
+STEXI
+@item backup
+@findex backup
+Create a VM backup.
+ETEXI
+
+    {
+        .name       = "backup_cancel",
+        .args_type  = "",
+        .params     = "",
+        .help       = "cancel the current VM backup",
+        .mhandler.cmd = hmp_backup_cancel,
+    },
+
+STEXI
+@item backup_cancel
+@findex backup_cancel
+Cancel the current VM backup.
+
+ETEXI
+
     {
         .name       = "block_job_set_speed",
         .args_type  = "device:B,speed:o",
@@ -1558,6 +1587,8 @@ show CPU statistics
 show user network stack connection states
 @item info migrate
 show migration status
+@item info backup
+show backup status
 @item info migrate_capabilities
 show current migration capabilities
 @item info migrate_cache_size
diff --git a/hmp.c b/hmp.c
index 180ba2b..be84d01 100644
--- a/hmp.c
+++ b/hmp.c
@@ -130,6 +130,38 @@ void hmp_info_mice(Monitor *mon)
     qapi_free_MouseInfoList(mice_list);
 }
 
+void hmp_info_backup(Monitor *mon)
+{
+    BackupStatus *info;
+
+    info = qmp_query_backup(NULL);
+    if (info->has_status) {
+        if (info->has_errmsg) {
+            monitor_printf(mon, "Backup status: %s - %s\n",
+                           info->status, info->errmsg);
+        } else {
+            monitor_printf(mon, "Backup status: %s\n", info->status);
+        }
+    }
+    if (info->has_backupfile) {
+        int per = (info->has_total && info->total &&
+            info->has_transferred && info->transferred) ?
+            (info->transferred * 100)/info->total : 0;
+        int zero_per = (info->has_total && info->total &&
+                        info->has_zero_bytes && info->zero_bytes) ?
+            (info->zero_bytes * 100)/info->total : 0;
+        monitor_printf(mon, "Backup file: %s\n", info->backupfile);
+        monitor_printf(mon, "Backup uuid: %s\n", info->uuid);
+        monitor_printf(mon, "Total size: %zd\n", info->total);
+        monitor_printf(mon, "Transferred bytes: %zd (%d%%)\n",
+                       info->transferred, per);
+        monitor_printf(mon, "Zero bytes: %zd (%d%%)\n",
+                       info->zero_bytes, zero_per);
+    }
+
+    qapi_free_BackupStatus(info);
+}
+
 void hmp_info_migrate(Monitor *mon)
 {
     MigrationInfo *info;
@@ -977,6 +1009,37 @@ void hmp_block_stream(Monitor *mon, const QDict *qdict)
     hmp_handle_error(mon, &error);
 }
 
+void hmp_backup_cancel(Monitor *mon, const QDict *qdict)
+{
+    Error *errp = NULL;
+
+    qmp_backup_cancel(&errp);
+
+    if (error_is_set(&errp)) {
+        monitor_printf(mon, "%s\n", error_get_pretty(errp));
+        error_free(errp);
+        return;
+    }
+}
+
+void hmp_backup(Monitor *mon, const QDict *qdict)
+{
+    const char *backupfile = qdict_get_str(qdict, "backupfile");
+    const char *devlist = qdict_get_try_str(qdict, "devlist");
+    int64_t speed = qdict_get_try_int(qdict, "speed", 0);
+
+    Error *errp = NULL;
+
+    qmp_backup(backupfile, !!devlist, devlist,
+               qdict_haskey(qdict, "speed"), speed, &errp);
+
+    if (error_is_set(&errp)) {
+        monitor_printf(mon, "%s\n", error_get_pretty(errp));
+        error_free(errp);
+        return;
+    }
+}
+
 void hmp_block_job_set_speed(Monitor *mon, const QDict *qdict)
 {
     Error *error = NULL;
diff --git a/hmp.h b/hmp.h
index 0ab03be..20c9a62 100644
--- a/hmp.h
+++ b/hmp.h
@@ -28,6 +28,7 @@ void hmp_info_mice(Monitor *mon);
 void hmp_info_migrate(Monitor *mon);
 void hmp_info_migrate_capabilities(Monitor *mon);
 void hmp_info_migrate_cache_size(Monitor *mon);
+void hmp_info_backup(Monitor *mon);
 void hmp_info_cpus(Monitor *mon);
 void hmp_info_block(Monitor *mon);
 void hmp_info_blockstats(Monitor *mon);
@@ -63,6 +64,8 @@ void hmp_eject(Monitor *mon, const QDict *qdict);
 void hmp_change(Monitor *mon, const QDict *qdict);
 void hmp_block_set_io_throttle(Monitor *mon, const QDict *qdict);
 void hmp_block_stream(Monitor *mon, const QDict *qdict);
+void hmp_backup(Monitor *mon, const QDict *qdict);
+void hmp_backup_cancel(Monitor *mon, const QDict *qdict);
 void hmp_block_job_set_speed(Monitor *mon, const QDict *qdict);
 void hmp_block_job_cancel(Monitor *mon, const QDict *qdict);
 void hmp_block_job_pause(Monitor *mon, const QDict *qdict);
diff --git a/monitor.c b/monitor.c
index c0e32d6..85cf47e 100644
--- a/monitor.c
+++ b/monitor.c
@@ -2680,6 +2680,13 @@ static mon_cmd_t info_cmds[] = {
     },
 #endif
     {
+        .name       = "backup",
+        .args_type  = "",
+        .params     = "",
+        .help       = "show backup status",
+        .mhandler.info = hmp_info_backup,
+    },
+    {
         .name       = "migrate",
         .args_type  = "",
         .params     = "",
diff --git a/qapi-schema.json b/qapi-schema.json
index 5dfa052..5536ddc 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -357,6 +357,13 @@
 ##
 { 'type': 'EventInfo', 'data': {'name': 'str'} }
 
+
+{ 'type': 'BackupStatus',
+  'data': {'*status': 'str', '*errmsg': 'str', '*total': 'int',
+           '*transferred': 'int', '*zero-bytes': 'int',
+           '*start-time': 'int', '*end-time': 'int',
+           '*backupfile': 'str', '*uuid': 'str' } }
+
 ##
 # @query-events:
 #
@@ -1764,6 +1771,45 @@
   'data': { 'path': 'str' },
   'returns': [ 'ObjectPropertyInfo' ] }
 
+
+##
+# @backup:
+#
+# Starts a VM backup.
+#
+# @backupfile: the backup file name
+#
+# @speed:  #optional the maximum speed, in bytes per second
+#
+# Returns: the uuid of the backup job
+#
+##
+{ 'command': 'backup', 'data': { 'backupfile': 'str', '*devlist': 'str',
+                                 '*speed': 'int' },
+  'returns': 'str' }
+
+##
+# @query-backup
+#
+# Returns information about current/last backup task.
+#
+# Returns: @BackupStatus
+#
+##
+{ 'command': 'query-backup', 'returns': 'BackupStatus' }
+
+##
+# @backup_cancel
+#
+# Cancel the current executing backup process.
+#
+# Returns: nothing on success
+#
+# Notes: This command succeeds even if there is no backup process running.
+#
+##
+{ 'command': 'backup_cancel' }
+
 ##
 # @qom-get:
 #
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 5c692d0..e83c275 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -822,6 +822,18 @@ EQMP
     },
 
     {
+        .name       = "backup",
+        .args_type  = "backupfile:s,speed:o?,devlist:s?",
+        .mhandler.cmd_new = qmp_marshal_input_backup,
+    },
+
+    {
+        .name       = "backup_cancel",
+        .args_type  = "",
+        .mhandler.cmd_new = qmp_marshal_input_backup_cancel,
+    },
+
+    {
         .name       = "block-job-set-speed",
         .args_type  = "device:B,speed:o",
         .mhandler.cmd_new = qmp_marshal_input_block_job_set_speed,
@@ -2491,6 +2503,21 @@ EQMP
     },
 
 SQMP
+
+query-backup
+-------------
+
+Backup status.
+
+EQMP
+
+    {
+        .name       = "query-backup",
+        .args_type  = "",
+        .mhandler.cmd_new = qmp_marshal_input_query_backup,
+    },
+
+SQMP
 migrate-set-capabilities
 -------
 
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH 5/5] add regression tests for backup
  2012-11-21  9:01 [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1) Dietmar Maurer
                   ` (2 preceding siblings ...)
  2012-11-21  9:01 ` [Qemu-devel] [PATCH 4/5] add backup related monitor commands Dietmar Maurer
@ 2012-11-21  9:01 ` Dietmar Maurer
  2012-11-21 10:48 ` [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1) Kevin Wolf
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-21  9:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Dietmar Maurer

Also added the ratelimit.h fix here, because it is still not upstream.

Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
---
 include/qemu/ratelimit.h |    2 +-
 tests/Makefile           |   11 +-
 tests/backup-test.c      |  505 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 515 insertions(+), 3 deletions(-)
 create mode 100644 tests/backup-test.c

diff --git a/include/qemu/ratelimit.h b/include/qemu/ratelimit.h
index c6ac281..d1610f1 100644
--- a/include/qemu/ratelimit.h
+++ b/include/qemu/ratelimit.h
@@ -42,7 +42,7 @@ static inline void ratelimit_set_speed(RateLimit *limit, uint64_t speed,
                                        uint64_t slice_ns)
 {
     limit->slice_ns = slice_ns;
-    limit->slice_quota = ((double)speed * 1000000000ULL) / slice_ns;
+    limit->slice_quota = ((double)speed * slice_ns)/1000000000ULL;
 }
 
 #endif
diff --git a/tests/Makefile b/tests/Makefile
index ca680e5..284dc0c 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -18,6 +18,8 @@ check-unit-y += tests/test-iov$(EXESUF)
 
 check-block-$(CONFIG_POSIX) += tests/qemu-iotests-quick.sh
 
+check-backup-y = tests/backup-test$(EXESUF)
+
 # All QTests for now are POSIX-only, but the dependencies are
 # really in libqtest, not in the testcases themselves.
 check-qtest-i386-y = tests/fdc-test$(EXESUF)
@@ -50,6 +52,7 @@ tests/check-qfloat$(EXESUF): tests/check-qfloat.o qfloat.o
 tests/check-qjson$(EXESUF): tests/check-qjson.o $(qobject-obj-y) qemu-tool.o
 tests/test-coroutine$(EXESUF): tests/test-coroutine.o $(coroutine-obj-y) $(tools-obj-y) $(block-obj-y) iov.o libqemustub.a
 tests/test-iov$(EXESUF): tests/test-iov.o iov.o
+tests/backup-test$(EXESUF): tests/backup-test.o vma-reader.o $(tools-obj-y) $(block-obj-y) libqemustub.a
 
 tests/test-qapi-types.c tests/test-qapi-types.h :\
 $(SRC_PATH)/qapi-schema-test.json $(SRC_PATH)/scripts/qapi-types.py
@@ -142,10 +145,14 @@ check-tests/qemu-iotests-quick.sh: tests/qemu-iotests-quick.sh qemu-img$(EXESUF)
 
 # Consolidated targets
 
-.PHONY: check-qtest check-unit check
+.PHONY: check-backup check-qtest check-unit check
 check-qtest: $(patsubst %,check-qtest-%, $(QTEST_TARGETS))
 check-unit: $(patsubst %,check-%, $(check-unit-y))
 check-block: $(patsubst %,check-%, $(check-block-y))
-check: check-unit check-qtest
+
+check-backup: tests/backup-test$(EXESUF)
+	$<
+
+check: check-unit check-qtest check-backup
 
 -include $(wildcard tests/*.d)
diff --git a/tests/backup-test.c b/tests/backup-test.c
new file mode 100644
index 0000000..26ad346
--- /dev/null
+++ b/tests/backup-test.c
@@ -0,0 +1,505 @@
+/*
+ * QEMU backup test suit
+ *
+ * Copyright (C) Proxmox Server Solutions
+ *
+ * Authors:
+ *  Dietmar Maurer (dietmar@proxmox.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Fixme: running 'backup-test -l' trigger a bug in g_slice_alloc()
+ * Note: 'G_SLICE=always-malloc ./tests/backup-test -l' works
+ *
+ */
+
+#include <sys/time.h>
+#include <sys/types.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <getopt.h>
+#include <libgen.h>
+
+#include "qemu-common.h"
+#include "main-loop.h"
+#include "block_int.h"
+
+#include "vma.h"
+
+static int opt_debug;
+static int opt_loop;
+
+#define DPRINTF(fmt, ...) \
+    do { if (opt_debug) { printf(fmt, ## __VA_ARGS__); } } while (0)
+
+#define CLUSTER(x) (x*BACKUP_CLUSTER_SIZE)
+
+#define RUN_TEST(testfunc, speed) \
+    backup_test(#testfunc " speed " #speed, speed, testfunc);
+
+
+static unsigned char buf_sec_pattern_cd[BDRV_SECTOR_SIZE];
+static unsigned char buf_sec_pattern_32[BDRV_SECTOR_SIZE];
+
+#define TEST_IMG_SIZE (6*1024*1024+BDRV_SECTOR_SIZE)
+#define TEST_IMG_NAME "backuptest.raw"
+#define TEST_IMG_RESTORE_NAME "backuptest.raw.restore"
+#define TEST_VMA_NAME "backuptest.vma"
+
+typedef struct BackupCB {
+    VmaWriter *vmaw;
+    uint8_t dev_id;
+} BackupCB;
+
+static int backup_dump_cb(void *opaque, BlockDriverState *bs,
+                          int64_t cluster_num, unsigned char *buf)
+{
+    BackupCB *bcb = opaque;
+
+    DPRINTF("backup_dump_cb C%zd %d\n", cluster_num, bcb->dev_id);
+
+    if (vma_writer_write(bcb->vmaw, bcb->dev_id, cluster_num, buf) < 0) {
+        printf("backup_dump_cb vma_writer_write failed\n");
+        return -1;
+    }
+
+    return 0;
+}
+
+static void backup_complete_cb(void *opaque, int ret)
+{
+    BackupCB *bcb = opaque;
+
+    DPRINTF("backup_complete_cb %d %d\n", bcb->dev_id, ret);
+
+    if (ret < 0) {
+        vma_writer_set_error(bcb->vmaw, "backup_complete_cb %d", ret);
+    }
+
+    if (vma_writer_close_stream(bcb->vmaw, bcb->dev_id) <= 0) {
+        if (vma_writer_close(bcb->vmaw) != 0) {
+            g_error("vma_writer_close failed");
+        }
+    }
+    DPRINTF("backup_complete_cb finish\n");
+}
+
+static void write_sec_pattern_cd(BlockDriverState *bs, int64_t offset)
+{
+    int ret;
+
+    DPRINTF("write_sec_pattern_cd %zd\n", offset);
+
+    if (offset & 0x1ff) {
+        g_error("write_sec_pattern_cd offset %zd is not sector aligned\n",
+                offset);
+    }
+
+    ret = bdrv_write(bs, offset >> 9, buf_sec_pattern_cd, 1);
+    if (ret < 0) {
+        g_error("write_sec_pattern_cd %zd failed", offset);
+    }
+
+}
+
+static void read_sec(BlockDriverState *bs, int64_t offset, unsigned char *buf)
+{
+    DPRINTF("read_sec C%zd start %zd\n", offset>>VMA_CLUSTER_BITS, offset);
+
+    if (offset & 0x1ff) {
+        g_error("read_sec offset %zd is not sector aligned\n", offset);
+    }
+
+    if (bdrv_read(bs, offset >> 9, buf, 1) < 0) {
+        g_error("bdrv_read failed");
+    }
+}
+
+static bool request_term;
+
+typedef struct TestCB {
+    Coroutine *co;
+    BlockDriverState *bs;
+    bool finished;
+} TestCB;
+
+static TestCB *enter_test_co(BlockDriverState *bs, CoroutineEntry *entry)
+{
+    TestCB *cb = g_new0(TestCB, 1);
+    cb->bs = bs;
+    cb->co = qemu_coroutine_create(entry);
+    qemu_coroutine_enter(cb->co, cb);
+    return cb;
+}
+
+static void test_co_sleep(double sec)
+{
+    co_sleep_ns(rt_clock, (int64_t)(sec*1000000000));
+};
+
+static void test_co_yield(void)
+{
+    co_sleep_ns(rt_clock, (int64_t)(1000));
+};
+
+static void coroutine_fn run_co_test1(void *opaque)
+{
+    assert(opaque);
+    TestCB *cb = (TestCB *)opaque;
+
+    test_co_sleep(0.2);
+    write_sec_pattern_cd(cb->bs, 5*BACKUP_CLUSTER_SIZE);
+    test_co_sleep(0.2);
+    write_sec_pattern_cd(cb->bs, 10*BACKUP_CLUSTER_SIZE);
+    test_co_sleep(0.2);
+    write_sec_pattern_cd(cb->bs, 10*BACKUP_CLUSTER_SIZE);
+
+    cb->finished = true;
+}
+
+static void coroutine_fn run_co_test2(void *opaque)
+{
+    assert(opaque);
+    TestCB *cb = (TestCB *)opaque;
+    unsigned char buf[512];
+
+    test_co_sleep(0.2);
+    read_sec(cb->bs, 5*BACKUP_CLUSTER_SIZE, buf);
+    write_sec_pattern_cd(cb->bs, 6*BACKUP_CLUSTER_SIZE);
+
+    cb->finished = true;
+}
+
+static void coroutine_fn run_co_random_read(void *opaque)
+{
+    assert(opaque);
+    TestCB *cb = (TestCB *)opaque;
+    int64_t sectors = bdrv_getlength(cb->bs)/BDRV_SECTOR_SIZE - 1;
+    unsigned char buf[512];
+
+    while (1) {
+        test_co_yield();
+        if (request_term) {
+            DPRINTF("finish run_co_random_read\n");
+            break;
+        }
+        int64_t s = (rand()*sectors)/RAND_MAX;
+        read_sec(cb->bs, s*BDRV_SECTOR_SIZE, buf);
+    }
+
+    cb->finished = true;
+}
+
+static void coroutine_fn run_co_random_write(void *opaque)
+{
+    assert(opaque);
+    TestCB *cb = (TestCB *)opaque;
+    int64_t sectors = bdrv_getlength(cb->bs)/BDRV_SECTOR_SIZE;
+
+    while (1) {
+        test_co_yield();
+        if (request_term) {
+            DPRINTF("finish run_co_random_write\n");
+            break;
+        }
+        int64_t s = (rand()*sectors)/RAND_MAX;
+        write_sec_pattern_cd(cb->bs, s*BDRV_SECTOR_SIZE);
+    }
+
+    cb->finished = true;
+}
+
+static void fill_test_sector(void *buf, size_t sector_num)
+{
+    int64_t *i64buf = (int64_t *)buf;
+    int i;
+
+    int data = sector_num;
+    if (sector_num >= 8 && sector_num < 8*(2*16+2)) {
+        data = 0;  /* add zero region for testing */
+    }
+
+    for (i = 0; i < (512/sizeof(int64_t)); i++) {
+        i64buf[i] = data;
+    }
+}
+
+static void verify_archive(const char *archive, size_t size)
+{
+    Error *errp = NULL;
+
+    VmaReader *vmar = vma_reader_create(archive, &errp);
+
+    if (!vmar) {
+        g_error("%s", error_get_pretty(errp));
+    }
+
+    VmaDeviceInfo *di = vma_reader_get_device_info(vmar, 1);
+    if (!di || strcmp((char *)di->devname, "hda") || di->size != size) {
+        g_error("got wrong device info");
+    }
+
+    unlink(TEST_IMG_RESTORE_NAME);
+
+    int flags = BDRV_O_NATIVE_AIO|BDRV_O_RDWR|BDRV_O_CACHE_WB;
+
+    if (bdrv_img_create(TEST_IMG_RESTORE_NAME, "raw", NULL, NULL, NULL,
+                        size, flags)) {
+        g_error("can't create file %s", TEST_IMG_RESTORE_NAME);
+    }
+
+    BlockDriverState *bs = NULL;
+    if (bdrv_file_open(&bs, TEST_IMG_RESTORE_NAME, flags)) {
+        g_error("can't open file %s", TEST_IMG_RESTORE_NAME);
+    }
+    if (vma_reader_register_bs(vmar, 1, bs, false, &errp) < 0) {
+        g_error("%s", error_get_pretty(errp));
+    }
+
+    if (vma_reader_restore(vmar, &errp) < 0) {
+        g_error("restore failed - %s", error_get_pretty(errp));
+    }
+
+    size_t i;
+    size_t sectors = size/BDRV_SECTOR_SIZE;
+    int64_t buf[512/sizeof(int64_t)];
+    int64_t buf2[512/sizeof(int64_t)];
+
+    for (i = 0; i < sectors; i++) {
+        if (bdrv_read(bs, i, (uint8_t *)buf, 1) != 0) {
+            g_error("bdrv_read failed");
+        }
+        fill_test_sector(buf2, i);
+        if (bcmp(buf, buf2, sizeof(buf))) {
+            g_error("data is different at sector %zd", i);
+        }
+    }
+
+    vma_reader_destroy(vmar);
+
+    unlink(TEST_IMG_RESTORE_NAME);
+}
+
+static void prepare_vm_image(const char *filename, size_t sectors)
+{
+    int fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0644);
+    if (fd < 0) {
+        g_error("can't open file %s\n", filename);
+    }
+
+    size_t i;
+    int64_t buf[512/sizeof(int64_t)];
+
+    for (i = 0; i < sectors; i++) {
+        fill_test_sector(buf, i);
+
+         int res = 0;
+        while (1) {
+            res = write(fd, buf, sizeof(buf));
+            if (!(res < 0 && errno == EINTR)) {
+                break;
+            }
+        }
+        if (res != sizeof(buf)) {
+            g_error("can't initialize file %s - %s %d\n",
+                    filename, strerror(errno), res);
+        }
+    }
+
+    if (close(fd) != 0) {
+        g_error("close failed");
+    }
+}
+
+static GList *simple_test(BlockDriverState *bs)
+{
+    GList *cb_list = NULL;
+
+    cb_list = g_list_append(cb_list, enter_test_co(bs, run_co_test1));
+    cb_list = g_list_append(cb_list, enter_test_co(bs, run_co_test2));
+
+    return cb_list;
+}
+
+static GList *random_read_write_test(BlockDriverState *bs)
+{
+    GList *cb_list = NULL;
+
+    cb_list = g_list_append(cb_list, enter_test_co(bs, run_co_random_read));
+    cb_list = g_list_append(cb_list, enter_test_co(bs, run_co_random_read));
+    cb_list = g_list_append(cb_list, enter_test_co(bs, run_co_random_write));
+    cb_list = g_list_append(cb_list, enter_test_co(bs, run_co_random_write));
+
+    return cb_list;
+}
+
+static void backup_test(const char *testname, int64_t speed,
+                        GList *(*start_test_cb)(BlockDriverState *bs))
+{
+    BlockDriverState *bs = bdrv_new("hda");
+
+    static int test_counter;
+
+    test_counter++;
+
+    printf("starting test #%d '%s'\n", test_counter, testname);
+
+    const char *filename = TEST_IMG_NAME;
+
+    prepare_vm_image(TEST_IMG_NAME, TEST_IMG_SIZE/BDRV_SECTOR_SIZE);
+
+    int flags = BDRV_O_NATIVE_AIO|BDRV_O_RDWR|BDRV_O_CACHE_WB;
+
+    if (bdrv_open(bs, filename, flags, NULL) < 0) {
+        g_error("can't open device %s\n", filename);
+    }
+
+    Error *err = NULL;
+
+    unlink(TEST_VMA_NAME);
+
+    VmaWriter *vmaw = vma_writer_create(TEST_VMA_NAME, speed, &err);
+    if (!vmaw) {
+        g_error("%s", error_get_pretty(err));
+    }
+
+    BackupCB bcb;
+    bcb.vmaw = vmaw;
+    bcb.dev_id = vma_writer_register_stream(vmaw, bdrv_get_device_name(bs),
+                                            bdrv_getlength(bs));
+    bdrv_backup_init(bs, backup_dump_cb, backup_complete_cb, &bcb);
+
+    request_term = false;
+
+    GList *cb_list = start_test_cb(bs);
+
+    while (1) {
+        main_loop_wait(false);
+
+        VmaStatus vmastat;
+        vma_writer_get_status(vmaw, &vmastat);
+        if (vmastat.closed) {
+            break;
+        }
+    }
+
+    request_term = true;
+
+    while (1) {
+        GList *l = cb_list;
+        bool active = 0;
+        while (l && l->data) {
+            TestCB *cb = (TestCB *)l->data;
+            l = g_list_next(l);
+            if (!cb->finished) {
+                active = true;
+                break;
+            }
+        }
+        if (!active) {
+            DPRINTF("All test coroutines finished\n");
+            break;
+        }
+        main_loop_wait(false);
+    }
+
+    /* Make sure all outstanding requests complete */
+    bdrv_drain_all();
+
+    VmaStatus vmastat;
+    vma_writer_get_status(vmaw, &vmastat);
+    DPRINTF("statistic %zd %zd\n", vmastat.stream_info[1].size,
+            vmastat.stream_info[1].transferred);
+    assert(vmastat.stream_info[1].size == vmastat.stream_info[1].transferred);
+
+    vma_writer_destroy(vmaw);
+
+    bdrv_delete(bs);
+
+    /* start verification */
+    verify_archive(TEST_VMA_NAME, TEST_IMG_SIZE);
+
+    bdrv_close_all();
+
+    unlink(TEST_IMG_NAME);
+    unlink(TEST_VMA_NAME);
+
+    printf("finish test #%d '%s' OK\n", test_counter, testname);
+}
+
+static void help(void)
+{
+    const char *help_msg =
+        "usage: backup-test [options]\n"
+        "\n"
+        "backup-test        run default regression test (fast)\n"
+        "backup-test -l     run long running test loop (endless)\n"
+        "\n"
+        "use option -d to turn on verbose debug output\n"
+        ;
+
+    printf("%s", help_msg);
+    exit(1);
+}
+
+int main(int argc, char **argv)
+{
+    int c;
+
+    for (;;) {
+        c = getopt(argc, argv, "hdl");
+        if (c == -1) {
+            break;
+        }
+        switch (c) {
+        case '?':
+        case 'h':
+            help();
+            break;
+        case 'd':
+            opt_debug = 1;
+            break;
+        case 'l':
+            opt_loop = 1;
+            break;
+        default:
+            g_assert_not_reached();
+        }
+    }
+
+    memset(buf_sec_pattern_cd, 0xcd, sizeof(buf_sec_pattern_cd));
+    memset(buf_sec_pattern_32, 0x32, sizeof(buf_sec_pattern_32));
+
+    srand(1234);
+
+    /* ignore SIGPIPE */
+    struct sigaction act;
+    sigfillset(&act.sa_mask);
+    act.sa_flags = 0;
+    act.sa_handler = SIG_IGN;
+    sigaction(SIGPIPE, &act, NULL);
+
+    qemu_init_main_loop();
+
+    bdrv_init();
+
+    if (opt_loop) { /* endless test loop */
+        while (1) {
+            RUN_TEST(random_read_write_test, 0);
+        }
+        return 0;
+    }
+
+    if (opt_debug) { /* run simple test (rate limited) */
+        RUN_TEST(simple_test, 1024*1024);
+        return 0;
+    }
+
+    /* run default regression tests at full speed */
+
+    RUN_TEST(simple_test, 0);
+    RUN_TEST(random_read_write_test, 0);
+
+    return 0;
+}
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-21  9:01 [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1) Dietmar Maurer
                   ` (3 preceding siblings ...)
  2012-11-21  9:01 ` [Qemu-devel] [PATCH 5/5] add regression tests for backup Dietmar Maurer
@ 2012-11-21 10:48 ` Kevin Wolf
  2012-11-21 11:10   ` Dietmar Maurer
  2012-11-21 11:23   ` Dietmar Maurer
  2012-11-22 11:12 ` Stefan Hajnoczi
  2012-11-27 10:09 ` Wenchao Xia
  6 siblings, 2 replies; 73+ messages in thread
From: Kevin Wolf @ 2012-11-21 10:48 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Paolo Bonzini, qemu-devel

Am 21.11.2012 10:01, schrieb Dietmar Maurer:
> +Some storage types/formats supports internal snapshots using some kind
> +of reference counting (rados, sheepdog, dm-thin, qcow2). It would be possible
> +to use that for backups, but for now we want to be storage-independent.
> +
> +Note: It turned out that taking a qcow2 snapshot can take a very long
> +time on larger files.

Hm, really? What are "larger files"? It has always been relatively quick
when I tested it, though internal snapshots are not my focus, so that
need not mean much.

If this is really an important use case for someone, I think qcow2
internal snapshots still have some potential for relatively easy
performance optimisations.

But that just as an aside...

> +
> +=Make it more efficient=
> +
> +The be more efficient, we simply need to avoid unnecessary steps. The
> +following steps are always required:
> +
> +1.) read old data before it gets overwritten
> +2.) write that data into the backup archive
> +3.) write new data (VM write)
> +
> +As you can see, this involves only one read, an two writes.

Looks like a nice approach to backup indeed.

The question is how to fit this into the big picture of qemu's live
block operations. Much of it looks like an active mirror (which is still
to be implemented), with the difference that it doesn't write the new,
but the old data, and that it keeps a bitmap of clusters that should not
be mirrored.

I'm not sure if this means that code should be shared between these two
or if the differences are too big. However, both of them have things in
common regarding the design. For example, both have a background part
(copying the existing data) and an active part (mirroring/backing up
data on writes). Block jobs are the right tool for the background part.

The active part is a bit more tricky. You're putting some code into
block.c to achieve it, which is kind of ugly. We have been talking about
"block filters" previously that would provide a generic infrastructure,
and at least in the mid term the additions to block.c must disappear.
(Same for block.h and block_int.h - keep things as separated from the
core as possible) Maybe we should introduce this infrastructure now.

Another interesting point is how (or whether) to link block jobs with
block filters. I think when the job is started, the filter should be
inserted automatically, and when you cancel it, it should be stopped.
When you pause the job... no idea. :-)

> +
> +To make that work, our backup archive need to be able to store image
> +data 'out of order'. It is important to notice that this will not work
> +with traditional archive formats like tar.

> +* works on any storage type and image format.
> +* we can define a new and simple archive format, which is able to
> +  store sparse files efficiently.

> +
> +Note: Storing sparse files is a mess with existing archive
> +formats. For example, tar requires information about holes at the
> +beginning of the archive.

> +* we need to define a new archive format
> +
> +Note: Most existing archive formats are optimized to store small files
> +including file attributes. We simply do not need that for VM archives.
> +
> +* archive contains data 'out of order'
> +
> +If you want to access image data in sequential order, you need to
> +re-order archive data. It would be possible to to that on the fly,
> +using temporary files.
> +
> +Fortunately, a normal restore/extract works perfectly with 'out of
> +order' data, because the target files are seekable.

> +=Archive format requirements=
> +
> +The basic requirement for such new format is that we can store image
> +date 'out of order'. It is also very likely that we have less than 256
> +drives/images per VM, and we want to be able to store VM configuration
> +files.
> +
> +We have defined a very simply format with those properties, see:
> +
> +docs/specs/vma_spec.txt
> +
> +Please let us know if you know an existing format which provides the
> +same functionality.

Essentially, what you need is an image format. You want to be
independent from the source image formats, but you're okay with using a
specific format for the backup (or you wouldn't have defined a new
format for it).

The one special thing that you need is storing multiple images in one
file. There's something like this already in qemu: qcow2 with its
internal snapshots is basically a flat file system.

Not saying that this is necessarily the best option, but I think reusing
existing formats and implementation is always a good thing, so it's an
idea to consider.

Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-21 10:48 ` [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1) Kevin Wolf
@ 2012-11-21 11:10   ` Dietmar Maurer
  2012-11-21 12:37     ` Kevin Wolf
  2012-11-21 11:23   ` Dietmar Maurer
  1 sibling, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-21 11:10 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, qemu-devel

> > +Note: It turned out that taking a qcow2 snapshot can take a very long
> > +time on larger files.
> 
> Hm, really? What are "larger files"? It has always been relatively quick when I
> tested it, though internal snapshots are not my focus, so that need not mean
> much.

300GB or larger
 
> If this is really an important use case for someone, I think qcow2 internal
> snapshots still have some potential for relatively easy performance
> optimisations.

I guess the problem is the small cluster size, so the reference table gets quite large
(for example fvd uses 2GB to minimize table size).
 
> But that just as an aside...
> 
> > +
> > +=Make it more efficient=
> > +
> > +The be more efficient, we simply need to avoid unnecessary steps. The
> > +following steps are always required:
> > +
> > +1.) read old data before it gets overwritten
> > +2.) write that data into the backup archive
> > +3.) write new data (VM write)
> > +
> > +As you can see, this involves only one read, an two writes.
> 
> Looks like a nice approach to backup indeed.
> 
> The question is how to fit this into the big picture of qemu's live block
> operations. Much of it looks like an active mirror (which is still to be
> implemented), with the difference that it doesn't write the new, but the old
> data, and that it keeps a bitmap of clusters that should not be mirrored.
> 
> I'm not sure if this means that code should be shared between these two or
> if the differences are too big. However, both of them have things in common
> regarding the design. For example, both have a background part (copying the
> existing data) and an active part (mirroring/backing up data on writes). Block
> jobs are the right tool for the background part.

I already use block jobs. Or do you want to share more?
 
> The active part is a bit more tricky. You're putting some code into block.c to
> achieve it, which is kind of ugly. 

yes. but I tried to keep that small ;-)

>We have been talking about "block filters"
> previously that would provide a generic infrastructure, and at least in the mid
> term the additions to block.c must disappear.
> (Same for block.h and block_int.h - keep things as separated from the core as
> possible) Maybe we should introduce this infrastructure now.

I have no idea what you talk about? Can you point me to the relevant discussion?
 
> Another interesting point is how (or whether) to link block jobs with block
> filters. I think when the job is started, the filter should be inserted
> automatically, and when you cancel it, it should be stopped.
> When you pause the job... no idea. :-)
> 
> > +
> > +To make that work, our backup archive need to be able to store image
> > +data 'out of order'. It is important to notice that this will not
> > +work with traditional archive formats like tar.
> 
> > +* works on any storage type and image format.
> > +* we can define a new and simple archive format, which is able to
> > +  store sparse files efficiently.
> 
> > +
> > +Note: Storing sparse files is a mess with existing archive formats.
> > +For example, tar requires information about holes at the beginning of
> > +the archive.
> 
> > +* we need to define a new archive format
> > +
> > +Note: Most existing archive formats are optimized to store small
> > +files including file attributes. We simply do not need that for VM archives.
> > +
> > +* archive contains data 'out of order'
> > +
> > +If you want to access image data in sequential order, you need to
> > +re-order archive data. It would be possible to to that on the fly,
> > +using temporary files.
> > +
> > +Fortunately, a normal restore/extract works perfectly with 'out of
> > +order' data, because the target files are seekable.
> 
> > +=Archive format requirements=
> > +
> > +The basic requirement for such new format is that we can store image
> > +date 'out of order'. It is also very likely that we have less than
> > +256 drives/images per VM, and we want to be able to store VM
> > +configuration files.
> > +
> > +We have defined a very simply format with those properties, see:
> > +
> > +docs/specs/vma_spec.txt
> > +
> > +Please let us know if you know an existing format which provides the
> > +same functionality.
> 
> Essentially, what you need is an image format. You want to be independent
> from the source image formats, but you're okay with using a specific format
> for the backup (or you wouldn't have defined a new format for it).
> 
> The one special thing that you need is storing multiple images in one file.
> There's something like this already in qemu: qcow2 with its internal
> snapshots is basically a flat file system.
> 
> Not saying that this is necessarily the best option, but I think reusing existing
> formats and implementation is always a good thing, so it's an idea to
> consider.

AFAIK qcow2 file cannot store data out of order. In general, an backup fd is not seekable, 
and we only want to do sequential writes. Image format always requires seekable fds?

Anyways, a qcow2 file is really complex beast - I am quite unsure if I would use 
that for backup if it is possible. 

That would require any external tool to include >=50000 LOC

The vma reader code is about 700 LOC (quite easy).

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-21 10:48 ` [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1) Kevin Wolf
  2012-11-21 11:10   ` Dietmar Maurer
@ 2012-11-21 11:23   ` Dietmar Maurer
  1 sibling, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-21 11:23 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, qemu-devel

> Not saying that this is necessarily the best option, but I think reusing existing
> formats and implementation is always a good thing, so it's an idea to
> consider.

Yes, I would really like to reuse something. Our current backup software uses 'tar' files,
but that is really inefficient. We also analyzed all other available
archive formats, but none of them is capable to store sparse files efficiently. 

And storing data out of order is beyond the scope of existing format.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-21 11:10   ` Dietmar Maurer
@ 2012-11-21 12:37     ` Kevin Wolf
  2012-11-21 13:23       ` Paolo Bonzini
                         ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Kevin Wolf @ 2012-11-21 12:37 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Paolo Bonzini, qemu-devel

Am 21.11.2012 12:10, schrieb Dietmar Maurer:
>>> +Note: It turned out that taking a qcow2 snapshot can take a very long
>>> +time on larger files.
>>
>> Hm, really? What are "larger files"? It has always been relatively quick when I
>> tested it, though internal snapshots are not my focus, so that need not mean
>> much.
> 
> 300GB or larger
>  
>> If this is really an important use case for someone, I think qcow2 internal
>> snapshots still have some potential for relatively easy performance
>> optimisations.
> 
> I guess the problem is the small cluster size, so the reference table gets quite large
> (for example fvd uses 2GB to minimize table size).

qemu-img check gives an idea of what it costs to read in the whole
metadata of an image. Updating some of it should mean not more than a
factor of two. I'm seeing much bigger differences, so I suspect there's
something wrong.

Somebody should probably try tracing where the performance is lost.

>> But that just as an aside...
>>
>>> +
>>> +=Make it more efficient=
>>> +
>>> +The be more efficient, we simply need to avoid unnecessary steps. The
>>> +following steps are always required:
>>> +
>>> +1.) read old data before it gets overwritten
>>> +2.) write that data into the backup archive
>>> +3.) write new data (VM write)
>>> +
>>> +As you can see, this involves only one read, an two writes.
>>
>> Looks like a nice approach to backup indeed.
>>
>> The question is how to fit this into the big picture of qemu's live block
>> operations. Much of it looks like an active mirror (which is still to be
>> implemented), with the difference that it doesn't write the new, but the old
>> data, and that it keeps a bitmap of clusters that should not be mirrored.
>>
>> I'm not sure if this means that code should be shared between these two or
>> if the differences are too big. However, both of them have things in common
>> regarding the design. For example, both have a background part (copying the
>> existing data) and an active part (mirroring/backing up data on writes). Block
>> jobs are the right tool for the background part.
> 
> I already use block jobs. Or do you want to share more?

I was thinking about sharing code between a future active mirror and the
backup job. Which may or may not make sense. I'm mostly hoping for input
from Paolo here.

>> The active part is a bit more tricky. You're putting some code into block.c to
>> achieve it, which is kind of ugly. 
> 
> yes. but I tried to keep that small ;-)

Yup, it's already not too bad. I haven't looked into it in much detail,
but I'd like to reduce it even a bit more. In particular, the
backup_info field in the BlockDriverState feels wrong to me. In the long
term the generic block layer shouldn't know at all what a backup is, and
baking it into BDS couples it very tightly.

>> We have been talking about "block filters"
>> previously that would provide a generic infrastructure, and at least in the mid
>> term the additions to block.c must disappear.
>> (Same for block.h and block_int.h - keep things as separated from the core as
>> possible) Maybe we should introduce this infrastructure now.
> 
> I have no idea what you talk about? Can you point me to the relevant discussion?

Not sure if a single discussion explains it, and I can't even find one
at the moment.

In short, the idea is that you can stick filters on top of a
BlockDriverState, so that any read/writes (and possibly more requests,
if necessary) are routed through the filter before they are passed to
the block driver of this BDS. Filters would be implemented as
BlockDrivers, i.e. you could implement .bdrv_co_write() in a filter to
intercept all writes to an image.

>> Another interesting point is how (or whether) to link block jobs with block
>> filters. I think when the job is started, the filter should be inserted
>> automatically, and when you cancel it, it should be stopped.
>> When you pause the job... no idea. :-)

>> Essentially, what you need is an image format. You want to be independent
>> from the source image formats, but you're okay with using a specific format
>> for the backup (or you wouldn't have defined a new format for it).
>>
>> The one special thing that you need is storing multiple images in one file.
>> There's something like this already in qemu: qcow2 with its internal
>> snapshots is basically a flat file system.
>>
>> Not saying that this is necessarily the best option, but I think reusing existing
>> formats and implementation is always a good thing, so it's an idea to
>> consider.
> 
> AFAIK qcow2 file cannot store data out of order. In general, an backup fd is not seekable, 
> and we only want to do sequential writes. Image format always requires seekable fds?

Ah, this is what you mean by "out of order". Just out of curiosity, what
are these non-seekable backup fds usually?

In principle even for this qcow2 could be used as an image format,
however the existing implementation wouldn't be of much use for you, so
it loses quite a bit of its attractiveness.

> Anyways, a qcow2 file is really complex beast - I am quite unsure if I would use 
> that for backup if it is possible. 
> 
> That would require any external tool to include >=50000 LOC
> 
> The vma reader code is about 700 LOC (quite easy).

So what? qemu-img is already there.

Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-21 12:37     ` Kevin Wolf
@ 2012-11-21 13:23       ` Paolo Bonzini
  2012-11-23  7:42         ` Dietmar Maurer
  2012-11-23  8:12         ` Dietmar Maurer
  2012-11-21 13:25       ` Dietmar Maurer
  2012-11-23  7:38       ` Dietmar Maurer
  2 siblings, 2 replies; 73+ messages in thread
From: Paolo Bonzini @ 2012-11-21 13:23 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Dietmar Maurer, qemu-devel

Il 21/11/2012 13:37, Kevin Wolf ha scritto:
>>> >> The active part is a bit more tricky. You're putting some code into block.c to
>>> >> achieve it, which is kind of ugly. 
>> > 
>> > yes. but I tried to keep that small ;-)
> Yup, it's already not too bad. I haven't looked into it in much detail,
> but I'd like to reduce it even a bit more. In particular, the
> backup_info field in the BlockDriverState feels wrong to me. In the long
> term the generic block layer shouldn't know at all what a backup is, and
> baking it into BDS couples it very tightly.

My plan was to have something like bs->job->job_type->{before,after}_write.

   int coroutine_fn (*before_write)(BlockDriverState *bs,
        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
        void **cookie);
   int coroutine_fn (*after_write)(BlockDriverState *bs,
        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
        void *cookie);


The before_write could optionally return a "cookie" that is passed back
to the after_write callback.

Actually this was plan B, as a poor-man implementation of the filter
infrastructure.  Plan A was that the block filters would materialize
suddenly in someone's git tree.

Anyway, it should be very easy to convert Dietmar's code to something
like that, and the active mirror could use it as well.

>> > AFAIK qcow2 file cannot store data out of order. In general, an backup fd is not seekable, 
>> > and we only want to do sequential writes. Image format always requires seekable fds?
> Ah, this is what you mean by "out of order". Just out of curiosity, what
> are these non-seekable backup fds usually?

Perhaps I've been reading the SCSI standards too much lately, but tapes
come to mind. :)

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-21 12:37     ` Kevin Wolf
  2012-11-21 13:23       ` Paolo Bonzini
@ 2012-11-21 13:25       ` Dietmar Maurer
  2012-11-21 13:58         ` Kevin Wolf
  2012-11-23  7:38       ` Dietmar Maurer
  2 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-21 13:25 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, qemu-devel

> > AFAIK qcow2 file cannot store data out of order. In general, an backup
> > fd is not seekable, and we only want to do sequential writes. Image format
> always requires seekable fds?
> 
> Ah, this is what you mean by "out of order". Just out of curiosity, what are
> these non-seekable backup fds usually?

/dev/nst0 ;-)

But there are better examples. Usually you want to use some kind of
compression, and you do that with existing tools:

# backup to stdout|gzip|...

A common usage scenario is to pipe a backup into a restore (copy)

# backup to stdout|ssh to remote host -c 'restore from stdin'

It is also a performance question. Seeks are terrible slow.

> In principle even for this qcow2 could be used as an image format, however
> the existing implementation wouldn't be of much use for you, so it loses
> quite a bit of its attractiveness.
> 
> > Anyways, a qcow2 file is really complex beast - I am quite unsure if I
> > would use that for backup if it is possible.
> >
> > That would require any external tool to include >=50000 LOC
> >
> > The vma reader code is about 700 LOC (quite easy).
> 
> So what? qemu-img is already there.

Anyways, you already pointed out that the existing implementation does not work.

But I already expected such discussion. So maybe it is better we simply pipe all data to an external binary?
We just need to define a minimal protocol. 

In future we can produce different archivers as independent/external binaries?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-21 13:25       ` Dietmar Maurer
@ 2012-11-21 13:58         ` Kevin Wolf
  2012-11-21 15:47           ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Kevin Wolf @ 2012-11-21 13:58 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Paolo Bonzini, qemu-devel

Am 21.11.2012 14:25, schrieb Dietmar Maurer:
>>> AFAIK qcow2 file cannot store data out of order. In general, an backup
>>> fd is not seekable, and we only want to do sequential writes. Image format
>> always requires seekable fds?
>>
>> Ah, this is what you mean by "out of order". Just out of curiosity, what are
>> these non-seekable backup fds usually?
> 
> /dev/nst0 ;-)

Sure. :-)

> But there are better examples. Usually you want to use some kind of
> compression, and you do that with existing tools:
> 
> # backup to stdout|gzip|...

When you use an image/archive format anyway, you could use a compression
mechanism that it already supports.

> A common usage scenario is to pipe a backup into a restore (copy)
> 
> # backup to stdout|ssh to remote host -c 'restore from stdin'

This is a good one. I believe our usual solution would have been to
backup to a NBD server on the remote host instead.

In general I can see that being able to pipe it to other programs could
be nice. I'm not sure if it's an absolute requirement. Would your tools
for taking the backup employ any specific use of pipes?

> It is also a performance question. Seeks are terrible slow.

You wouldn't do it a lot. Only for metadata, and you would only write
out the metadata once the in-memory cache is full.

>> In principle even for this qcow2 could be used as an image format, however
>> the existing implementation wouldn't be of much use for you, so it loses
>> quite a bit of its attractiveness.
>>
>>> Anyways, a qcow2 file is really complex beast - I am quite unsure if I
>>> would use that for backup if it is possible.
>>>
>>> That would require any external tool to include >=50000 LOC
>>>
>>> The vma reader code is about 700 LOC (quite easy).
>>
>> So what? qemu-img is already there.
> 
> Anyways, you already pointed out that the existing implementation does not work.

I'm still trying to figure out the real requirements to think some more
about it. :-)

> But I already expected such discussion. So maybe it is better we simply pipe all data to an external binary?
> We just need to define a minimal protocol. 
> 
> In future we can produce different archivers as independent/external binaries?

You shouldn't look at discussions as a bad thing. We're not trying to
block your changes, but to understand and possibly improve them.

Yes, discussions mean that it takes a bit longer to get things merged,
but they also mean that usually something better is merged in the end
that actually fits well in qemu's design, is maintainable, generic and
so on. Evading the discussions by keeping code externally wouldn't
improve things.

Which doesn't mean that external archivers are completely out of the
question, but I would only consider them if there's a good technical
reason to do so.

So if eventually we come to the conclusion that vma (or for that matter,
anything else in your patches) is the right solution, let's take it. But
first please give us the chance to understand the reasons of why you did
things the way you did them, and to discuss the pros and cons of
alternative solutions.

Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-21 13:58         ` Kevin Wolf
@ 2012-11-21 15:47           ` Dietmar Maurer
  0 siblings, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-21 15:47 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, qemu-devel

> >> Ah, this is what you mean by "out of order". Just out of curiosity,
> >> what are these non-seekable backup fds usually?
> >
> > /dev/nst0 ;-)
> 
> Sure. :-)
> 
> > But there are better examples. Usually you want to use some kind of
> > compression, and you do that with existing tools:
> >
> > # backup to stdout|gzip|...
> 
> When you use an image/archive format anyway, you could use a
> compression mechanism that it already supports.

Many archive formats does not support compressions internally (tar, cpio, ..).
I also avoided to include that in the 'vma' format. So you can use any
external tool. 

Some user wants to compress, other wants bzip2, or gzip -1, xz, pgzip, ...
Or maybe pipe into some kind of encryption tool ...

> > A common usage scenario is to pipe a backup into a restore (copy)
> >
> > # backup to stdout|ssh to remote host -c 'restore from stdin'
> 
> This is a good one. I believe our usual solution would have been to backup to
> a NBD server on the remote host instead.
> 
> In general I can see that being able to pipe it to other programs could be nice.
> I'm not sure if it's an absolute requirement. Would your tools for taking the
> backup employ any specific use of pipes?

Yes, we currently have that functionality, and I do not want to remove features.

> > It is also a performance question. Seeks are terrible slow.
> 
> You wouldn't do it a lot. Only for metadata, and you would only write out the
> metadata once the in-memory cache is full.

IMHO it is still much better to write sequentially, because that has 'zero' overhead.

Besides, writing data sequentially is so much easier (on the implementation side)

The current VMA code also use checksums and special 'uuid' markers, which
makes it possible to find and recover damaged archives. I guess such things
are quite impossible with qcow2, or very hard to do?

> >> In principle even for this qcow2 could be used as an image format,
> >> however the existing implementation wouldn't be of much use for you,
> >> so it loses quite a bit of its attractiveness.
> >>
> >>> Anyways, a qcow2 file is really complex beast - I am quite unsure if
> >>> I would use that for backup if it is possible.
> >>>
> >>> That would require any external tool to include >=50000 LOC
> >>>
> >>> The vma reader code is about 700 LOC (quite easy).
> >>
> >> So what? qemu-img is already there.
> >
> > Anyways, you already pointed out that the existing implementation does
> not work.
> 
> I'm still trying to figure out the real requirements to think some more about
> it. :-)

Any existing archive format I know works on pipes (without seeks). 
Well, that does not really mean anything.

> > But I already expected such discussion. So maybe it is better we simply pipe
> all data to an external binary?
> > We just need to define a minimal protocol.
> >
> > In future we can produce different archivers as independent/external
> binaries?
> 
> You shouldn't look at discussions as a bad thing. We're not trying to block
> your changes, but to understand and possibly improve them.

I do not consider your comments as 'bad thing' - above idea was a real suggestion ;-)

I already have plans to use a Content Addressable Storage (instead of 'vma'), so
such plugin architecture makes it easier to play around with different formats.
 
> Yes, discussions mean that it takes a bit longer to get things merged, but they
> also mean that usually something better is merged in the end that actually
> fits well in qemu's design, is maintainable, generic and so on. Evading the
> discussions by keeping code externally wouldn't improve things.

sure.
 
> Which doesn't mean that external archivers are completely out of the
> question, but I would only consider them if there's a good technical reason to
> do so.

As noted above, I can see rooms for different format. 

1.) 'vma' is my proof of concept, easy to implement and use.
2.) CAS - very useful to sync backup data across datacenters (this
gives us deduplication and kind of 'incremental backups')
3.) support existing archive format like 'tar' (this is possible if we
use temporary files to store out-of-order data)
4.) backup to some kind of external server
5.) plugins for existing backup tools (bacula, ...)?

> So if eventually we come to the conclusion that vma (or for that matter,
> anything else in your patches) is the right solution, let's take it. But first
> please give us the chance to understand the reasons of why you did things
> the way you did them, and to discuss the pros and cons of alternative
> solutions.

Sure. I was not aware that I wrote something negative in the previous reply - sorry for that.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] introduce new vma archive format
  2012-11-21  9:01 ` [Qemu-devel] [PATCH 3/5] introduce new vma archive format Dietmar Maurer
@ 2012-11-21 16:06   ` Eric Blake
  2012-11-21 17:56     ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Eric Blake @ 2012-11-21 16:06 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: kwolf, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2525 bytes --]

On 11/21/2012 02:01 AM, Dietmar Maurer wrote:
> This is a very simple archive format, see docs/specs/vma_spec.txt
> 
> Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
> ---

> +++ b/docs/specs/vma_spec.txt
> @@ -0,0 +1,24 @@
> +=Virtual Machine Archive format (VMA)=
> +
> +This format contains a header which includes the VM configuration as
> +binary blobs, and a list of devices (dev_id, name).
> +
> +The actual VM image date is stored inside extends. En extend contains

s/date/data/
s/extends. En extend/extents. An extent/

> +up to 64 clusters, and start with a 512 byte header containing
> +additional information for those clusters.
> +
> +We use a cluster size of 65536, and use 8 bytes for each
> +cluster in the header to store the following information:
> +
> +* 1 byte dev_id (to identity the drive)
> +* 2 bytes zero indicator (mark zero regions (16x4069))

s/4069/4096/

> +* 4 bytes cluster number

Is that sufficient, or is it possible to have an image larger than
64k*4G that would overflow?

> +* 1 byte not used (reserved)
> +
> +We only store non-zero blocks (such block is 4096 bytes).
> +
> +Each archive is marked with an unique uuid. The archive header and all

s/an unique/a/ (by definition, 'uuid' is an acronym that already means
unique; also, it is 'a' and not 'an' before any 'u' pronounced as 'y',
which is true for both 'unique' and a spelled-out 'uuid')

> +extend headers includes that uuid and a MD5 checksum (over header

s/extend/extent/

> +data).
> +
> +
> diff --git a/vma-reader.c b/vma-reader.c
> new file mode 100644
> index 0000000..7a54de5
> --- /dev/null
> +++ b/vma-reader.c
> @@ -0,0 +1,720 @@
> +/*
> + * VMA: Virtual Machine Archive
> + *
> + * Copyright (C) Proxmox Server Solutions

Missing a year.

> + *
> + * Authors:
> + *  Dietmar Maurer (dietmar@proxmox.com)
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See

Can you please use GPLv2+ (the 'or later' clause is essential if you
want your work to be reusable in GPLv3[+] projects)?

I didn't review the code, just the specification.  I have to wonder how
much of your work overlaps with Paolo's 'drive-mirror' and NBD server
work; and it seems to me that it is better to use 'drive-mirror' for
doing backup work into existing disk formats, rather than inventing yet
another archive format.

-- 
Eric Blake   eblake@redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 619 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] add backup related monitor commands
  2012-11-21  9:01 ` [Qemu-devel] [PATCH 4/5] add backup related monitor commands Dietmar Maurer
@ 2012-11-21 16:16   ` Eric Blake
  2012-11-21 17:59     ` Dietmar Maurer
  2012-11-21 18:49     ` Dietmar Maurer
  0 siblings, 2 replies; 73+ messages in thread
From: Eric Blake @ 2012-11-21 16:16 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: kwolf, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2276 bytes --]

On 11/21/2012 02:01 AM, Dietmar Maurer wrote:
> We currently create 'vma' archives without any configuration inside.
> Future versions may support other formats...
> 
> Another option would be to simply dump <devid,cluster_num,cluster_data> to
> the output fh (pipe), and an external binary saves the data. That way we
> could move the whole archive format related code out of qemu.
> 

> +++ b/qapi-schema.json
> @@ -357,6 +357,13 @@
>  ##
>  { 'type': 'EventInfo', 'data': {'name': 'str'} }
>  
> +
> +{ 'type': 'BackupStatus',
> +  'data': {'*status': 'str', '*errmsg': 'str', '*total': 'int',
> +           '*transferred': 'int', '*zero-bytes': 'int',
> +           '*start-time': 'int', '*end-time': 'int',
> +           '*backupfile': 'str', '*uuid': 'str' } }
> +

Missing documentation for what all these fields mean.

>  ##
>  # @query-events:
>  #
> @@ -1764,6 +1771,45 @@
>    'data': { 'path': 'str' },
>    'returns': [ 'ObjectPropertyInfo' ] }
>  
> +
> +##
> +# @backup:
> +#
> +# Starts a VM backup.
> +#
> +# @backupfile: the backup file name
> +#
> +# @speed:  #optional the maximum speed, in bytes per second
> +#
> +# Returns: the uuid of the backup job
> +#

Needs to mention when it was introduced (at best, it will be since 1.4).

> +##
> +{ 'command': 'backup', 'data': { 'backupfile': 'str', '*devlist': 'str',
> +                                 '*speed': 'int' },
> +  'returns': 'str' }
> +
> +##
> +# @query-backup
> +#
> +# Returns information about current/last backup task.
> +#
> +# Returns: @BackupStatus
> +#
> +##
> +{ 'command': 'query-backup', 'returns': 'BackupStatus' }
> +
> +##
> +# @backup_cancel

s/_/-/, you should prefer dashes in QMP

> +#
> +# Cancel the current executing backup process.
> +#
> +# Returns: nothing on success
> +#
> +# Notes: This command succeeds even if there is no backup process running.
> +#
> +##
> +{ 'command': 'backup_cancel' }
> +

You are basically adding a new asynchronous job.  Do you really need to
add a 'backup-cancel' command, or should we be reusing a generic
framework for canceling arbitrary jobs?

-- 
Eric Blake   eblake@redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 619 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] introduce new vma archive format
  2012-11-21 16:06   ` Eric Blake
@ 2012-11-21 17:56     ` Dietmar Maurer
  0 siblings, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-21 17:56 UTC (permalink / raw)
  To: Eric Blake; +Cc: kwolf, qemu-devel

> > +* 4 bytes cluster number
> 
> Is that sufficient, or is it possible to have an image larger than 64k*4G that
> would overflow?

Well, that is 256TB per image. This is sufficient for us.

> > +* 1 byte not used (reserved)
> > +
> > +We only store non-zero blocks (such block is 4096 bytes).
> > +
> > +Each archive is marked with an unique uuid. The archive header and
> > +all
> 
> s/an unique/a/ (by definition, 'uuid' is an acronym that already means
> unique; also, it is 'a' and not 'an' before any 'u' pronounced as 'y', which is
> true for both 'unique' and a spelled-out 'uuid')

many thanks for your help - will fix those things.

> > + * This work is licensed under the terms of the GNU GPL, version 2.
> > + See
> 
> Can you please use GPLv2+ (the 'or later' clause is essential if you want your
> work to be reusable in GPLv3[+] projects)?

sure, no problem.
 
> I didn't review the code, just the specification.  I have to wonder how much
> of your work overlaps with Paolo's 'drive-mirror' and NBD server work; and it
> seems to me that it is better to use 'drive-mirror' for doing backup work into
> existing disk formats, rather than inventing yet another archive format.

I already tried to explain why that does not work - see my previous posts.

What image format provides that functionality?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] add backup related monitor commands
  2012-11-21 16:16   ` Eric Blake
@ 2012-11-21 17:59     ` Dietmar Maurer
  2012-11-21 18:49     ` Dietmar Maurer
  1 sibling, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-21 17:59 UTC (permalink / raw)
  To: Eric Blake; +Cc: kwolf, qemu-devel

> > +{ 'type': 'BackupStatus',
> > +  'data': {'*status': 'str', '*errmsg': 'str', '*total': 'int',
> > +           '*transferred': 'int', '*zero-bytes': 'int',
> > +           '*start-time': 'int', '*end-time': 'int',
> > +           '*backupfile': 'str', '*uuid': 'str' } }
> > +
> 
> Missing documentation for what all these fields mean.

Yes, sorry for that. I just wanted to get an early feedback from the community
before I work out the details.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] add backup related monitor commands
  2012-11-21 16:16   ` Eric Blake
  2012-11-21 17:59     ` Dietmar Maurer
@ 2012-11-21 18:49     ` Dietmar Maurer
  1 sibling, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-21 18:49 UTC (permalink / raw)
  To: Eric Blake; +Cc: kwolf, qemu-devel

> > +{ 'command': 'backup_cancel' }
> > +
> 
> You are basically adding a new asynchronous job.  Do you really need to add
> a 'backup-cancel' command, or should we be reusing a generic framework
> for canceling arbitrary jobs?

No, we basically add several asynchronous job - one for each blockdev. And we want
to cancel them all together.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-21  9:01 [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1) Dietmar Maurer
                   ` (4 preceding siblings ...)
  2012-11-21 10:48 ` [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1) Kevin Wolf
@ 2012-11-22 11:12 ` Stefan Hajnoczi
  2012-11-22 11:26   ` Dietmar Maurer
                     ` (4 more replies)
  2012-11-27 10:09 ` Wenchao Xia
  6 siblings, 5 replies; 73+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 11:12 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: kwolf, qemu-devel

On Wed, Nov 21, 2012 at 10:01:00AM +0100, Dietmar Maurer wrote:
> +==Disadvantages==
> +
> +* we need to define a new archive format
> +
> +Note: Most existing archive formats are optimized to store small files
> +including file attributes. We simply do not need that for VM archives.

Did you look at the VMDK "Stream-Optimized Compressed" subformat?

http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?src=vmdk

It is a stream of compressed "grains" (data).  They are out-of-order and
each grain comes with the virtual disk lba where the data should be
visible to the guest.

The stream also contains "grain tables" and "grain directories".  This
metadata makes random read access to the file possible once you have
downloaded the entire file (i.e. it is seekable).  Although tools can
choose to consume the stream in sequential order too and ignore the
metadata.

In other words, the format is an out-of-order stream of data chunks plus
random access lookup tables at the end.

QEMU's block/vmdk.c already has some support for this format although I
don't think we generate out-of-order yet.

The benefit of reusing this code is that existing tools can consume
these files.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] add basic backup support to block driver
  2012-11-21  9:01 ` [Qemu-devel] [PATCH 2/5] add basic backup support to block driver Dietmar Maurer
@ 2012-11-22 11:25   ` Stefan Hajnoczi
  2012-11-22 11:29     ` Dietmar Maurer
  2012-11-23  8:56     ` Dietmar Maurer
  0 siblings, 2 replies; 73+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 11:25 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: kwolf, qemu-devel

On Wed, Nov 21, 2012 at 10:01:01AM +0100, Dietmar Maurer wrote:
Two comments from skimming the code, not a full review.

> +/* #define DEBUG_BACKUP */
> +
> +#ifdef DEBUG_BACKUP
> +#define DPRINTF(fmt, ...) \
> +    do { printf("backup: " fmt, ## __VA_ARGS__); } while (0)
> +#else
> +#define DPRINTF(fmt, ...) \
> +    do { } while (0)
> +#endif

Please don't #define debug macros like this.  It hides build breakage
because the debug code is #ifdef'd out.

Either use docs/tracing.txt or do the following so the code always gets
parsed by the compiler:

#define DEBUG_BACKUP 0
#define DPRINTF(fmt, ...) \
    do { \
        if (DEBUG_BACKUP) { \
            printf("backup: " fmt, ## __VA_ARGS__); \
	} \
    } while (0)

> +BackupInfo *
> +bdrv_backup_init(BlockDriverState *bs, BackupDumpFunc *backup_dump_cb,
> +                 BlockDriverCompletionFunc *backup_complete_cb,
> +                 void *opaque)
> +{
> +    assert(bs);
> +    assert(backup_dump_cb);
> +    assert(backup_complete_cb);
> +
> +    if (bs->backup_info) {
> +        DPRINTF("bdrv_backup_init already initialized %s\n",
> +                bdrv_get_device_name(bs));
> +        return NULL;
> +    }
> +
> +    BackupInfo *info = g_malloc0(sizeof(BackupInfo));
> +    int64_t bitmap_size;
> +    const char *devname = bdrv_get_device_name(bs);
> +
> +    if (!devname || !devname[0]) {
> +        return NULL;
> +    }
> +
> +    DPRINTF("bdrv_backup_init %s\n", bdrv_get_device_name(bs));
> +
> +    bitmap_size = bs->total_sectors +
> +        BACKUP_BLOCKS_PER_CLUSTER * BITS_PER_LONG - 1;
> +    bitmap_size /= BACKUP_BLOCKS_PER_CLUSTER * BITS_PER_LONG;
> +
> +    info->backup_dump_cb = backup_dump_cb;
> +    info->backup_complete_cb = backup_complete_cb;
> +    info->opaque = opaque;
> +    info->bitmap_size = bitmap_size;
> +    info->bitmap = g_new0(unsigned long, bitmap_size);
> +
> +    Error *errp;
> +    BackupBlockJob *job = block_job_create(&backup_job_type, bs, 0,
> +                                           backup_job_cleanup_cb, bs, &errp);

Is there a 1:1 relationship between BackupInfo and BackupBlockJob?  Then
it would be nicer to move BlockupInfo fields into BackupBlockJob (which
is empty right now).  Then you also don't need to add fields to
BlockDriverState because you know that if your blockjob is running you
can access it via bs->job, if necessary.  You can then drop BackupInfo
and any memory management code for it.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 11:12 ` Stefan Hajnoczi
@ 2012-11-22 11:26   ` Dietmar Maurer
  2012-11-22 12:44     ` Stefan Hajnoczi
  2012-11-22 11:40   ` Dietmar Maurer
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-22 11:26 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel

> Did you look at the VMDK "Stream-Optimized Compressed" subformat?
> 
> http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
> src=vmdk
> 
> It is a stream of compressed "grains" (data).  They are out-of-order and each
> grain comes with the virtual disk lba where the data should be visible to the
> guest.
> 

What kind of license is applied to that specification?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] add basic backup support to block driver
  2012-11-22 11:25   ` Stefan Hajnoczi
@ 2012-11-22 11:29     ` Dietmar Maurer
  2012-11-23  8:56     ` Dietmar Maurer
  1 sibling, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-22 11:29 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel

> Is there a 1:1 relationship between BackupInfo and BackupBlockJob?  Then it
> would be nicer to move BlockupInfo fields into BackupBlockJob (which is
> empty right now

No, a backup create several block jobs - one for each blockdev you want to backup. Those
jobs run in parallel.


 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 11:12 ` Stefan Hajnoczi
  2012-11-22 11:26   ` Dietmar Maurer
@ 2012-11-22 11:40   ` Dietmar Maurer
  2012-11-22 15:42     ` Stefan Hajnoczi
  2012-11-22 12:00   ` Dietmar Maurer
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-22 11:40 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel

> Did you look at the VMDK "Stream-Optimized Compressed" subformat?
> 
> http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
> src=vmdk

Max file size 2TB?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 11:12 ` Stefan Hajnoczi
  2012-11-22 11:26   ` Dietmar Maurer
  2012-11-22 11:40   ` Dietmar Maurer
@ 2012-11-22 12:00   ` Dietmar Maurer
  2012-11-22 15:45     ` Stefan Hajnoczi
  2012-11-22 12:03   ` Dietmar Maurer
  2012-11-22 17:16   ` Stefan Hajnoczi
  4 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-22 12:00 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel

> It is a stream of compressed "grains" (data).  They are out-of-order and each
> grain comes with the virtual disk lba where the data should be visible to the
> guest.
> 
> The stream also contains "grain tables" and "grain directories".  This
> metadata makes random read access to the file possible once you have
> downloaded the entire file (i.e. it is seekable).  Although tools can choose to
> consume the stream in sequential order too and ignore the metadata.
> 
> In other words, the format is an out-of-order stream of data chunks plus
> random access lookup tables at the end.
> 
> QEMU's block/vmdk.c already has some support for this format although I
> don't think we generate out-of-order yet.
> 
> The benefit of reusing this code is that existing tools can consume these files.

Compression format is hardcoded to RFC 1951 (defalte). I think this is a major disadvantage, 
because it is really slow (compared to lzop).

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 11:12 ` Stefan Hajnoczi
                     ` (2 preceding siblings ...)
  2012-11-22 12:00   ` Dietmar Maurer
@ 2012-11-22 12:03   ` Dietmar Maurer
  2012-11-22 17:16   ` Stefan Hajnoczi
  4 siblings, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-22 12:03 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel

> Did you look at the VMDK "Stream-Optimized Compressed" subformat?
> 
> http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
> src=vmdk

And is that covered by any patents?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 11:26   ` Dietmar Maurer
@ 2012-11-22 12:44     ` Stefan Hajnoczi
  2012-11-22 12:55       ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 12:44 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: kwolf, qemu-devel

On Thu, Nov 22, 2012 at 11:26:21AM +0000, Dietmar Maurer wrote:
> > Did you look at the VMDK "Stream-Optimized Compressed" subformat?
> > 
> > http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
> > src=vmdk
> > 
> > It is a stream of compressed "grains" (data).  They are out-of-order and each
> > grain comes with the virtual disk lba where the data should be visible to the
> > guest.
> > 
> 
> What kind of license is applied to that specification?

The document I linked came straight from Google Search and you don't
need to agree to anything to view it.  The document doesn't seem to
impose restrictions.  QEMU has supported the VMDK format and so have
other open source tools for a number of years.

For anything more specific you could search VMware's website and/or
check with a lawyer.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 12:44     ` Stefan Hajnoczi
@ 2012-11-22 12:55       ` Dietmar Maurer
  2012-11-22 15:30         ` Stefan Hajnoczi
  0 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-22 12:55 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel

> -----Original Message-----
> From: Stefan Hajnoczi [mailto:stefanha@gmail.com]
> Sent: Donnerstag, 22. November 2012 13:45
> To: Dietmar Maurer
> Cc: qemu-devel@nongnu.org; kwolf@redhat.com
> Subject: Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu
> (v1)
> 
> On Thu, Nov 22, 2012 at 11:26:21AM +0000, Dietmar Maurer wrote:
> > > Did you look at the VMDK "Stream-Optimized Compressed" subformat?
> > >
> > >
> http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
> > > src=vmdk
> > >
> > > It is a stream of compressed "grains" (data).  They are out-of-order
> > > and each grain comes with the virtual disk lba where the data should
> > > be visible to the guest.
> > >
> >
> > What kind of license is applied to that specification?
> 
> The document I linked came straight from Google Search and you don't need
> to agree to anything to view it.  The document doesn't seem to impose
> restrictions.  QEMU has supported the VMDK format and so have other open
> source tools for a number of years.
> 
> For anything more specific you could search VMware's website and/or check
> with a lawyer.

The documents says: VMware products are covered by one or more patents listed at http://www.vmware.com/go/patents

I simply do not have the time to check all those things, which make that format unusable for me.

Anyways, thanks for the link.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 12:55       ` Dietmar Maurer
@ 2012-11-22 15:30         ` Stefan Hajnoczi
  2012-11-22 15:58           ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:30 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: kwolf, qemu-devel

On Thu, Nov 22, 2012 at 1:55 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> -----Original Message-----
>> From: Stefan Hajnoczi [mailto:stefanha@gmail.com]
>> Sent: Donnerstag, 22. November 2012 13:45
>> To: Dietmar Maurer
>> Cc: qemu-devel@nongnu.org; kwolf@redhat.com
>> Subject: Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu
>> (v1)
>>
>> On Thu, Nov 22, 2012 at 11:26:21AM +0000, Dietmar Maurer wrote:
>> > > Did you look at the VMDK "Stream-Optimized Compressed" subformat?
>> > >
>> > >
>> http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
>> > > src=vmdk
>> > >
>> > > It is a stream of compressed "grains" (data).  They are out-of-order
>> > > and each grain comes with the virtual disk lba where the data should
>> > > be visible to the guest.
>> > >
>> >
>> > What kind of license is applied to that specification?
>>
>> The document I linked came straight from Google Search and you don't need
>> to agree to anything to view it.  The document doesn't seem to impose
>> restrictions.  QEMU has supported the VMDK format and so have other open
>> source tools for a number of years.
>>
>> For anything more specific you could search VMware's website and/or check
>> with a lawyer.
>
> The documents says: VMware products are covered by one or more patents listed at http://www.vmware.com/go/patents
>
> I simply do not have the time to check all those things, which make that format unusable for me.

In think proxmox ships the QEMU vmdk functionality today?  In that
case you should check this :).

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 11:40   ` Dietmar Maurer
@ 2012-11-22 15:42     ` Stefan Hajnoczi
  0 siblings, 0 replies; 73+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:42 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: kwolf, qemu-devel

On Thu, Nov 22, 2012 at 12:40 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> Did you look at the VMDK "Stream-Optimized Compressed" subformat?
>>
>> http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
>> src=vmdk
>
> Max file size 2TB?

That is for a single .vmdk file.  But vmdks can also be split across
multiple files ("extents") so you can get more than 2TB.

QEMU has some support for reading vmdks with extents, but I think we
never create files like this today.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 12:00   ` Dietmar Maurer
@ 2012-11-22 15:45     ` Stefan Hajnoczi
  2012-11-22 15:56       ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 15:45 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: kwolf, qemu-devel

On Thu, Nov 22, 2012 at 1:00 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> It is a stream of compressed "grains" (data).  They are out-of-order and each
>> grain comes with the virtual disk lba where the data should be visible to the
>> guest.
>>
>> The stream also contains "grain tables" and "grain directories".  This
>> metadata makes random read access to the file possible once you have
>> downloaded the entire file (i.e. it is seekable).  Although tools can choose to
>> consume the stream in sequential order too and ignore the metadata.
>>
>> In other words, the format is an out-of-order stream of data chunks plus
>> random access lookup tables at the end.
>>
>> QEMU's block/vmdk.c already has some support for this format although I
>> don't think we generate out-of-order yet.
>>
>> The benefit of reusing this code is that existing tools can consume these files.
>
> Compression format is hardcoded to RFC 1951 (defalte). I think this is a major disadvantage,
> because it is really slow (compared to lzop).

It's a naughty thing to do but we could simply pick a new constant and
support LZO as an incompatible option.  The file is then no longer
compatible with existing vmdk tools but at least we then have a choice
of using compatible deflate or the LZO extension.

VMDK already has 99% of what you need and we already have a bunch of
code to handle this format.  This seems like a good opportunity to
flesh out VMDK support and avoid reinventing the wheel.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 15:45     ` Stefan Hajnoczi
@ 2012-11-22 15:56       ` Dietmar Maurer
  2012-11-22 16:37         ` Stefan Hajnoczi
  0 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-22 15:56 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel

> It's a naughty thing to do but we could simply pick a new constant and
> support LZO as an incompatible option.  The file is then no longer compatible
> with existing vmdk tools but at least we then have a choice of using
> compatible deflate or the LZO extension.

To be 100% incompatible to existing tools? That would remove any advantage.
 
> VMDK already has 99% of what you need and we already have a bunch of
> code to handle this format.  This seems like a good opportunity to flesh out
> VMDK support and avoid reinventing the wheel.

Using 'probably' patented software is a bad idea - I will not go that way.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 15:30         ` Stefan Hajnoczi
@ 2012-11-22 15:58           ` Dietmar Maurer
  2012-11-22 17:02             ` Stefan Hajnoczi
  0 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-22 15:58 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel

> > The documents says: VMware products are covered by one or more
> patents
> > listed at http://www.vmware.com/go/patents
> >
> > I simply do not have the time to check all those things, which make that
> format unusable for me.
> 
> In think proxmox ships the QEMU vmdk functionality today?  In that case you
> should check this :).

Well, and also remove it from the qemu repository? Such things are not compatible with GPL?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 15:56       ` Dietmar Maurer
@ 2012-11-22 16:37         ` Stefan Hajnoczi
  0 siblings, 0 replies; 73+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 16:37 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: kwolf, qemu-devel

On Thu, Nov 22, 2012 at 4:56 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> It's a naughty thing to do but we could simply pick a new constant and
>> support LZO as an incompatible option.  The file is then no longer compatible
>> with existing vmdk tools but at least we then have a choice of using
>> compatible deflate or the LZO extension.
>
> To be 100% incompatible to existing tools? That would remove any advantage.

No, it should be an option.  Users who care about compatibility can use deflate.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 15:58           ` Dietmar Maurer
@ 2012-11-22 17:02             ` Stefan Hajnoczi
  2012-11-22 17:34               ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 17:02 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: kwolf, qemu-devel

On Thu, Nov 22, 2012 at 4:58 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> > The documents says: VMware products are covered by one or more
>> patents
>> > listed at http://www.vmware.com/go/patents
>> >
>> > I simply do not have the time to check all those things, which make that
>> format unusable for me.
>>
>> In think proxmox ships the QEMU vmdk functionality today?  In that case you
>> should check this :).
>
> Well, and also remove it from the qemu repository? Such things are not compatible with GPL?

If you are really concerned about this then submit a patch to add
./configure --disable-vmdk, ship QEMU without VMDK, and drop it from
your documentation/wiki.

If you're not really concerned, then let's accept that VMDK support is okay.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 11:12 ` Stefan Hajnoczi
                     ` (3 preceding siblings ...)
  2012-11-22 12:03   ` Dietmar Maurer
@ 2012-11-22 17:16   ` Stefan Hajnoczi
  2012-11-22 17:46     ` Dietmar Maurer
                       ` (3 more replies)
  4 siblings, 4 replies; 73+ messages in thread
From: Stefan Hajnoczi @ 2012-11-22 17:16 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Kevin Wolf, qemu-devel

On Thu, Nov 22, 2012 at 12:12 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Wed, Nov 21, 2012 at 10:01:00AM +0100, Dietmar Maurer wrote:
>> +==Disadvantages==
>> +
>> +* we need to define a new archive format
>> +
>> +Note: Most existing archive formats are optimized to store small files
>> +including file attributes. We simply do not need that for VM archives.
>
> Did you look at the VMDK "Stream-Optimized Compressed" subformat?

We've gone down several sub-threads discussing whether VMDK is
suitable.  I want to summarize why this is a good approach:

The VMDK format already allows for out-of-order data and is supported
by existing tools - this is very important for backups where people
are (rightfully) paranoid about putting their backups in an obscure
format.  They want to be able to access their data years later,
whether your tool is still around or not.

QEMU's implementation has partial support for Stream-Optimized
Compressed images.  If you complete the code for this subformat, not
only does this benefit the VM Backup feature, but it also makes
qemu-img convert more powerful for everyone.  I hope we can kill two
birds with one stone here.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 17:02             ` Stefan Hajnoczi
@ 2012-11-22 17:34               ` Dietmar Maurer
  0 siblings, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-22 17:34 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel



> -----Original Message-----
> From: Stefan Hajnoczi [mailto:stefanha@gmail.com]
> Sent: Donnerstag, 22. November 2012 18:02
> To: Dietmar Maurer
> Cc: kwolf@redhat.com; qemu-devel@nongnu.org
> Subject: Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu
> (v1)
> 
> On Thu, Nov 22, 2012 at 4:58 PM, Dietmar Maurer <dietmar@proxmox.com>
> wrote:
> >> > The documents says: VMware products are covered by one or more
> >> patents
> >> > listed at http://www.vmware.com/go/patents
> >> >
> >> > I simply do not have the time to check all those things, which make
> >> > that
> >> format unusable for me.
> >>
> >> In think proxmox ships the QEMU vmdk functionality today?  In that
> >> case you should check this :).
> >
> > Well, and also remove it from the qemu repository? Such things are not
> compatible with GPL?
> 
> If you are really concerned about this then submit a patch to add ./configure
> --disable-vmdk, ship QEMU without VMDK, and drop it from your
> documentation/wiki.
> 
> If you're not really concerned, then let's accept that VMDK support is okay.

I simply don't want to waste time one something with unclear License issues.
So I will not work on such format.

Don't get me wrong - that is just my personal opinion.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 17:16   ` Stefan Hajnoczi
@ 2012-11-22 17:46     ` Dietmar Maurer
  2012-11-23  5:23       ` Stefan Hajnoczi
  2012-11-22 17:50     ` Dietmar Maurer
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-22 17:46 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel

> > Did you look at the VMDK "Stream-Optimized Compressed" subformat?
> 
> We've gone down several sub-threads discussing whether VMDK is suitable.
> I want to summarize why this is a good approach:
> 
> The VMDK format already allows for out-of-order data and is supported by
> existing tools - this is very important for backups where people are
> (rightfully) paranoid about putting their backups in an obscure format.  They
> want to be able to access their data years later, whether your tool is still
> around or not.

The VMDK format has strong disadvantages:

- unclear License (the spec links to patents)
- they use a very slow compression algorithm (deflate), which makes it unusable for backup 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 17:16   ` Stefan Hajnoczi
  2012-11-22 17:46     ` Dietmar Maurer
@ 2012-11-22 17:50     ` Dietmar Maurer
  2012-11-23  5:21       ` Stefan Hajnoczi
  2012-11-22 18:05     ` Dietmar Maurer
  2012-11-22 18:15     ` Dietmar Maurer
  3 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-22 17:50 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel

> The VMDK format has strong disadvantages:
> 
> - unclear License (the spec links to patents)
> - they use a very slow compression algorithm (deflate), which makes it
> unusable for backup

Seems they do not support multiple configuration files. You can only
a single text block, and that needs to contain vmware specific info.
So where do I add my qemu related config?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 17:16   ` Stefan Hajnoczi
  2012-11-22 17:46     ` Dietmar Maurer
  2012-11-22 17:50     ` Dietmar Maurer
@ 2012-11-22 18:05     ` Dietmar Maurer
  2012-11-23  5:19       ` Stefan Hajnoczi
  2012-11-22 18:15     ` Dietmar Maurer
  3 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-22 18:05 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel

> QEMU's implementation has partial support for Stream-Optimized
> Compressed images.  If you complete the code for this subformat, not only
> does this benefit the VM Backup feature, but it also makes qemu-img convert
> more powerful for everyone.  I hope we can kill two birds with one stone

The doc contain the following link:

http://www.vmware.com/download/patents.html

I simply have no idea how to check all those patents. How can someone tell
that they do not cover things in the specs? I am really curios?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 17:16   ` Stefan Hajnoczi
                       ` (2 preceding siblings ...)
  2012-11-22 18:05     ` Dietmar Maurer
@ 2012-11-22 18:15     ` Dietmar Maurer
  3 siblings, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-22 18:15 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel

> The VMDK format already allows for out-of-order data and is supported by
> existing tools - this is very important for backups where people are
> (rightfully) paranoid about putting their backups in an obscure format.  They
> want to be able to access their data years later, whether your tool is still
> around or not.

Anything we will add to the qemu source fulfills those properties. Or do you
really think qemu will disappear soon?

Besides, the VMA format is much simpler than the vmdk format. Thus I consider
it safer (and not 'obscure').

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 18:05     ` Dietmar Maurer
@ 2012-11-23  5:19       ` Stefan Hajnoczi
  2012-11-23  6:05         ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Stefan Hajnoczi @ 2012-11-23  5:19 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Kevin Wolf, qemu-devel

On Thu, Nov 22, 2012 at 7:05 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> QEMU's implementation has partial support for Stream-Optimized
>> Compressed images.  If you complete the code for this subformat, not only
>> does this benefit the VM Backup feature, but it also makes qemu-img convert
>> more powerful for everyone.  I hope we can kill two birds with one stone
>
> The doc contain the following link:
>
> http://www.vmware.com/download/patents.html
>
> I simply have no idea how to check all those patents. How can someone tell
> that they do not cover things in the specs? I am really curios?

If you want to investigate it then you would look at each one or use
search engines to make it easier (skip all the non-disk image related
patents).

But keep in mind that any other company out there could have a patent
on out-of-order data in an image file or other aspects of what you're
proposing.  Reinventing the wheel may not stop you from infringing on
their patents so the fact that VMware may or may not have patents
doesn't change things if you're really trying to find possible issues.

This is why SQLite has a policy of only using algorithms that are
older than 17 years, see comment by drh:
http://www.sqlite.org/cvstrac/wiki?p=BlueSky

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 17:50     ` Dietmar Maurer
@ 2012-11-23  5:21       ` Stefan Hajnoczi
  0 siblings, 0 replies; 73+ messages in thread
From: Stefan Hajnoczi @ 2012-11-23  5:21 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Kevin Wolf, qemu-devel

On Thu, Nov 22, 2012 at 6:50 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> The VMDK format has strong disadvantages:
>>
>> - unclear License (the spec links to patents)
>> - they use a very slow compression algorithm (deflate), which makes it
>> unusable for backup
>
> Seems they do not support multiple configuration files. You can only
> a single text block, and that needs to contain vmware specific info.
> So where do I add my qemu related config?

This is true.  QEMU uses a VMDK as a single disk image.  To handle
multiple disks there would need to be multiple images plus a vmstate
or config file.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-22 17:46     ` Dietmar Maurer
@ 2012-11-23  5:23       ` Stefan Hajnoczi
  2012-11-23  5:25         ` Stefan Hajnoczi
  2012-11-23  6:13         ` Dietmar Maurer
  0 siblings, 2 replies; 73+ messages in thread
From: Stefan Hajnoczi @ 2012-11-23  5:23 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Kevin Wolf, qemu-devel

On Thu, Nov 22, 2012 at 6:46 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> > Did you look at the VMDK "Stream-Optimized Compressed" subformat?
>>
>> We've gone down several sub-threads discussing whether VMDK is suitable.
>> I want to summarize why this is a good approach:
>>
>> The VMDK format already allows for out-of-order data and is supported by
>> existing tools - this is very important for backups where people are
>> (rightfully) paranoid about putting their backups in an obscure format.  They
>> want to be able to access their data years later, whether your tool is still
>> around or not.
>
> The VMDK format has strong disadvantages:
>
> - unclear License (the spec links to patents)

I've already pointed out that you're taking an inconsistent position
on this point.  It's FUD.

> - they use a very slow compression algorithm (deflate), which makes it unusable for backup

I've already pointed out that we can optionally support other algorithms.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  5:23       ` Stefan Hajnoczi
@ 2012-11-23  5:25         ` Stefan Hajnoczi
  2012-11-23  6:18           ` Dietmar Maurer
  2012-11-23  6:13         ` Dietmar Maurer
  1 sibling, 1 reply; 73+ messages in thread
From: Stefan Hajnoczi @ 2012-11-23  5:25 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Kevin Wolf, qemu-devel

On Fri, Nov 23, 2012 at 6:23 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Thu, Nov 22, 2012 at 6:46 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>>> > Did you look at the VMDK "Stream-Optimized Compressed" subformat?
>>>
>>> We've gone down several sub-threads discussing whether VMDK is suitable.
>>> I want to summarize why this is a good approach:
>>>
>>> The VMDK format already allows for out-of-order data and is supported by
>>> existing tools - this is very important for backups where people are
>>> (rightfully) paranoid about putting their backups in an obscure format.  They
>>> want to be able to access their data years later, whether your tool is still
>>> around or not.
>>
>> The VMDK format has strong disadvantages:
>>
>> - unclear License (the spec links to patents)
>
> I've already pointed out that you're taking an inconsistent position
> on this point.  It's FUD.
>
>> - they use a very slow compression algorithm (deflate), which makes it unusable for backup
>
> I've already pointed out that we can optionally support other algorithms.

To make progress here I'll review the RFC patches.  VMDK or not isn't
the main thing, a backup feature like this looks interesting.

Stefan

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  5:19       ` Stefan Hajnoczi
@ 2012-11-23  6:05         ` Dietmar Maurer
  0 siblings, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-23  6:05 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel

> But keep in mind that any other company out there could have a patent on
> out-of-order data in an image file or other aspects of what you're proposing.

Sorry, but the vmware docs explicitly include a pointer to those patents. So this
is something completely different to me.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  5:23       ` Stefan Hajnoczi
  2012-11-23  5:25         ` Stefan Hajnoczi
@ 2012-11-23  6:13         ` Dietmar Maurer
  1 sibling, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-23  6:13 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel

> > The VMDK format has strong disadvantages:
> >
> > - unclear License (the spec links to patents)
> 
> I've already pointed out that you're taking an inconsistent position on this
> point.  It's FUD.
> 
> > - they use a very slow compression algorithm (deflate), which makes it
> > unusable for backup
> 
> I've already pointed out that we can optionally support other algorithms.

Well, I guess we both pointed out our opinions. 

I will try to implement some kind of plugging architecture for backup formats.
That way we can implement/support more than one format.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  5:25         ` Stefan Hajnoczi
@ 2012-11-23  6:18           ` Dietmar Maurer
  0 siblings, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-23  6:18 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel

> To make progress here I'll review the RFC patches.  VMDK or not isn't the
> main thing, a backup feature like this looks interesting.

Yes, a 'review' would be great - thanks.

- Dietmar

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-21 12:37     ` Kevin Wolf
  2012-11-21 13:23       ` Paolo Bonzini
  2012-11-21 13:25       ` Dietmar Maurer
@ 2012-11-23  7:38       ` Dietmar Maurer
  2012-11-23  9:08         ` Kevin Wolf
  2 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-23  7:38 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, qemu-devel

> In short, the idea is that you can stick filters on top of a BlockDriverState, so
> that any read/writes (and possibly more requests, if necessary) are routed
> through the filter before they are passed to the block driver of this BDS.
> Filters would be implemented as BlockDrivers, i.e. you could implement
> .bdrv_co_write() in a filter to intercept all writes to an image.

I am quite unsure if that make things easier.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-21 13:23       ` Paolo Bonzini
@ 2012-11-23  7:42         ` Dietmar Maurer
  2012-11-23  9:18           ` Paolo Bonzini
  2012-11-23  8:12         ` Dietmar Maurer
  1 sibling, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-23  7:42 UTC (permalink / raw)
  To: Paolo Bonzini, Kevin Wolf; +Cc: qemu-devel

> > Yup, it's already not too bad. I haven't looked into it in much
> > detail, but I'd like to reduce it even a bit more. In particular, the
> > backup_info field in the BlockDriverState feels wrong to me. In the
> > long term the generic block layer shouldn't know at all what a backup
> > is, and baking it into BDS couples it very tightly.
> 
> My plan was to have something like bs->job->job_type->{before,after}_write.
> 
>    int coroutine_fn (*before_write)(BlockDriverState *bs,
>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>         void **cookie);
>    int coroutine_fn (*after_write)(BlockDriverState *bs,
>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>         void *cookie);
> 
> 
> The before_write could optionally return a "cookie" that is passed back to
> the after_write callback.

I don't really understand why a filter is related to the job? This is sometimes useful,
but not a generic filter infrastructure (maybe someone want to use filters without a job).

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-21 13:23       ` Paolo Bonzini
  2012-11-23  7:42         ` Dietmar Maurer
@ 2012-11-23  8:12         ` Dietmar Maurer
  2012-11-23  9:01           ` Dietmar Maurer
  1 sibling, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-23  8:12 UTC (permalink / raw)
  To: Paolo Bonzini, Kevin Wolf; +Cc: qemu-devel

> > Yup, it's already not too bad. I haven't looked into it in much
> > detail, but I'd like to reduce it even a bit more. In particular, the
> > backup_info field in the BlockDriverState feels wrong to me. In the
> > long term the generic block layer shouldn't know at all what a backup
> > is, and baking it into BDS couples it very tightly.
> 
> My plan was to have something like bs->job->job_type->{before,after}_write.
> 
>    int coroutine_fn (*before_write)(BlockDriverState *bs,
>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>         void **cookie);
>    int coroutine_fn (*after_write)(BlockDriverState *bs,
>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>         void *cookie);

I don't think that job is the right place. Instead I would put a list 
of filters into BDS:

typedef struct BlockFilter {
    void *opaque;
    int cluster_size;
    int coroutine_fn (before_read)(BlockDriverState *bs,
        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
        BdrvRequestFlags flags, void **cookie);
    int coroutine_fn (after_read)(BlockDriverState *bs,
        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
        BdrvRequestFlags flags, void *cookie);
    int coroutine_fn (*before_write)(BlockDriverState *bs,
        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
        void **cookie);
    int coroutine_fn (*after_write)(BlockDriverState *bs,
        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
        void *cookie);
} BlockFilter;

struct BlockDriverState {
   ...
    QLIST_HEAD(, BlockFilters) filters;
};

Would that work for you?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] add basic backup support to block driver
  2012-11-22 11:25   ` Stefan Hajnoczi
  2012-11-22 11:29     ` Dietmar Maurer
@ 2012-11-23  8:56     ` Dietmar Maurer
  1 sibling, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-23  8:56 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-devel

> Is there a 1:1 relationship between BackupInfo and BackupBlockJob?  Then it
> would be nicer to move BlockupInfo fields into BackupBlockJob (which is
> empty right now).  Then you also don't need to add fields to BlockDriverState
> because you know that if your blockjob is running you can access it via bs-
> >job, if necessary.  You can then drop BackupInfo and any memory
> management code for it.

Oh, finally got your point.

There is an ongoing discussion about generic block filters, So I guess this totally 
depends on how we implement them.


- Dietmar

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  8:12         ` Dietmar Maurer
@ 2012-11-23  9:01           ` Dietmar Maurer
  2012-11-23  9:05             ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-23  9:01 UTC (permalink / raw)
  To: Dietmar Maurer, Paolo Bonzini, Kevin Wolf; +Cc: qemu-devel

> > My plan was to have something like bs->job->job_type-
> >{before,after}_write.
> >
> >    int coroutine_fn (*before_write)(BlockDriverState *bs,
> >         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> >         void **cookie);
> >    int coroutine_fn (*after_write)(BlockDriverState *bs,
> >         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> >         void *cookie);
> 
> I don't think that job is the right place. Instead I would put a list of filters into
> BDS:

Well, I can also add it to job_type. Just tell me what you prefer, and I will write the patch.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  9:01           ` Dietmar Maurer
@ 2012-11-23  9:05             ` Dietmar Maurer
  2012-11-23  9:15               ` Paolo Bonzini
  2012-11-23  9:55               ` Kevin Wolf
  0 siblings, 2 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-23  9:05 UTC (permalink / raw)
  To: Paolo Bonzini, Kevin Wolf; +Cc: qemu-devel

> > > My plan was to have something like bs->job->job_type-
> > >{before,after}_write.
> > >
> > >    int coroutine_fn (*before_write)(BlockDriverState *bs,
> > >         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> > >         void **cookie);
> > >    int coroutine_fn (*after_write)(BlockDriverState *bs,
> > >         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> > >         void *cookie);
> >
> > I don't think that job is the right place. Instead I would put a list
> > of filters into
> > BDS:
> 
> Well, I can also add it to job_type. Just tell me what you prefer, and I will
> write the patch.

BTW, will such filters work with the new virtio-blk-data-plane?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  7:38       ` Dietmar Maurer
@ 2012-11-23  9:08         ` Kevin Wolf
  2012-11-23  9:21           ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Kevin Wolf @ 2012-11-23  9:08 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Paolo Bonzini, qemu-devel

Am 23.11.2012 08:38, schrieb Dietmar Maurer:
>> In short, the idea is that you can stick filters on top of a BlockDriverState, so
>> that any read/writes (and possibly more requests, if necessary) are routed
>> through the filter before they are passed to the block driver of this BDS.
>> Filters would be implemented as BlockDrivers, i.e. you could implement
>> .bdrv_co_write() in a filter to intercept all writes to an image.
> 
> I am quite unsure if that make things easier.

At least it would make for a much cleaner design compared to putting
code for every feature you can think of into bdrv_co_do_readv/writev().

Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  9:05             ` Dietmar Maurer
@ 2012-11-23  9:15               ` Paolo Bonzini
  2012-11-23  9:17                 ` Dietmar Maurer
  2012-11-23  9:55               ` Kevin Wolf
  1 sibling, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2012-11-23  9:15 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Kevin Wolf, qemu-devel

Il 23/11/2012 10:05, Dietmar Maurer ha scritto:
>>>> My plan was to have something like bs->job->job_type-
>>>> {before,after}_write.
>>>>
>>>>    int coroutine_fn (*before_write)(BlockDriverState *bs,
>>>>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>>>>         void **cookie);
>>>>    int coroutine_fn (*after_write)(BlockDriverState *bs,
>>>>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>>>>         void *cookie);
>>>
>>> I don't think that job is the right place. Instead I would put a list
>>> of filters into
>>> BDS:
>>
>> Well, I can also add it to job_type. Just tell me what you prefer, and I will
>> write the patch.
> 
> BTW, will such filters work with the new virtio-blk-data-plane?

No, virtio-blk-data-plane is a hack and will be slowly rewritten to
support all fancy features.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  9:15               ` Paolo Bonzini
@ 2012-11-23  9:17                 ` Dietmar Maurer
  0 siblings, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-23  9:17 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, qemu-devel

> > BTW, will such filters work with the new virtio-blk-data-plane?
> 
> No, virtio-blk-data-plane is a hack and will be slowly rewritten to support all
> fancy features.

Ah, good to know ;-) thanks.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  7:42         ` Dietmar Maurer
@ 2012-11-23  9:18           ` Paolo Bonzini
  2012-11-23  9:28             ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2012-11-23  9:18 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Kevin Wolf, qemu-devel

Il 23/11/2012 08:42, Dietmar Maurer ha scritto:
>> > My plan was to have something like bs->job->job_type->{before,after}_write.
>> > 
>> >    int coroutine_fn (*before_write)(BlockDriverState *bs,
>> >         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>> >         void **cookie);
>> >    int coroutine_fn (*after_write)(BlockDriverState *bs,
>> >         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>> >         void *cookie);
>> > 
>> > 
>> > The before_write could optionally return a "cookie" that is passed back to
>> > the after_write callback.
> I don't really understand why a filter is related to the job? This is sometimes useful,
> but not a generic filter infrastructure (maybe someone want to use filters without a job).

See the part you snipped:

Actually this was plan B, as a poor-man implementation of the filter
infrastructure.  Plan A was that the block filters would materialize
suddenly in someone's git tree.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  9:08         ` Kevin Wolf
@ 2012-11-23  9:21           ` Dietmar Maurer
  2012-11-23  9:31             ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-23  9:21 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, qemu-devel

> >> Filters would be implemented as BlockDrivers, i.e. you could
> >> implement
> >> .bdrv_co_write() in a filter to intercept all writes to an image.
> >
> > I am quite unsure if that make things easier.
> 
> At least it would make for a much cleaner design compared to putting code
> for every feature you can think of into bdrv_co_do_readv/writev().

So if you want to add a filter, you simply modify bs->drv to point to the filter?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  9:18           ` Paolo Bonzini
@ 2012-11-23  9:28             ` Dietmar Maurer
  0 siblings, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-23  9:28 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, qemu-devel

> Actually this was plan B, as a poor-man implementation of the filter
> infrastructure.  Plan A was that the block filters would materialize suddenly in
> someone's git tree.

OK, so let us summarize the options:

a.) wait untit it materialize suddenly in someone's git tree.
b.) add BlockFilter inside BDS
c.) add filter callbacks to block bojs (job_type)
d.) use BlockDriver as filter
e.) use the current BackupInfo unless filters materialize suddenly in someone's git tree.

more ideas?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  9:21           ` Dietmar Maurer
@ 2012-11-23  9:31             ` Dietmar Maurer
  2012-11-23 10:29               ` Kevin Wolf
  0 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-23  9:31 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, qemu-devel

> > >> Filters would be implemented as BlockDrivers, i.e. you could
> > >> implement
> > >> .bdrv_co_write() in a filter to intercept all writes to an image.
> > >
> > > I am quite unsure if that make things easier.
> >
> > At least it would make for a much cleaner design compared to putting
> > code for every feature you can think of into bdrv_co_do_readv/writev().
> 
> So if you want to add a filter, you simply modify bs->drv to point to the filter?

Seems the BlockDriver struct does not contain any 'state' (I guess that is by design),
so where do you store filter related dynamic data?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  9:05             ` Dietmar Maurer
  2012-11-23  9:15               ` Paolo Bonzini
@ 2012-11-23  9:55               ` Kevin Wolf
  2012-11-23 10:55                 ` Markus Armbruster
  1 sibling, 1 reply; 73+ messages in thread
From: Kevin Wolf @ 2012-11-23  9:55 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Paolo Bonzini, qemu-devel

Am 23.11.2012 10:05, schrieb Dietmar Maurer:
>>>> My plan was to have something like bs->job->job_type-
>>>> {before,after}_write.
>>>>
>>>>    int coroutine_fn (*before_write)(BlockDriverState *bs,
>>>>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>>>>         void **cookie);
>>>>    int coroutine_fn (*after_write)(BlockDriverState *bs,
>>>>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>>>>         void *cookie);
>>>
>>> I don't think that job is the right place. Instead I would put a list
>>> of filters into
>>> BDS:
>>
>> Well, I can also add it to job_type. Just tell me what you prefer, and I will
>> write the patch.

A block filter shouldn't be tied to a job, I think. We have things like
blkdebug that are really filters and aren't coupled with a job, and on
the other hand we want to generalise "block jobs" into just "jobs", so
adding block specific things to job_type would be a step in the wrong
direction.

I also think that before_write/after_write isn't a convenient interface,
it brings back much of the callback-based AIO cruft and passing void*
isn't nice anyway. It's much nice to have a single .bdrv_co_write
callback that somewhere in the middle calls the layer below with a
simple function call.

Also read/write aren't enough, for a full filter interface you
potentially also need flush, discard and probably most other operations.

This is why I suggested using a regular BlockDriver struct for filters,
it already has all functions that are needed.

> BTW, will such filters work with the new virtio-blk-data-plane?

Not initially, but I think as soon as data plane gets support for image
formats, filters would work as well.

Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  9:31             ` Dietmar Maurer
@ 2012-11-23 10:29               ` Kevin Wolf
  2012-11-26  5:51                 ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Kevin Wolf @ 2012-11-23 10:29 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Paolo Bonzini, qemu-devel

Am 23.11.2012 10:31, schrieb Dietmar Maurer:
>>>>> Filters would be implemented as BlockDrivers, i.e. you could
>>>>> implement
>>>>> .bdrv_co_write() in a filter to intercept all writes to an image.
>>>>
>>>> I am quite unsure if that make things easier.
>>>
>>> At least it would make for a much cleaner design compared to putting
>>> code for every feature you can think of into bdrv_co_do_readv/writev().
>>
>> So if you want to add a filter, you simply modify bs->drv to point to the filter?
> 
> Seems the BlockDriver struct does not contain any 'state' (I guess that is by design),
> so where do you store filter related dynamic data?

You wouldn't change bs->drv of the block device, you still need that one
after having processed the data in the filter.

Instead, you'd have some BlockDriverState *first_filter in bs to which
requests are forwarded. first_filter->file would point to either the
next filter or if there are no more filters to the real BlockDriverState.

Which raises the question of how to distinguish whether it's a new
request to bs that must go through the filters or whether it actually
comes from the last filter in the chain. As you can see, we don't have a
well thought out plan yet, just rough ideas (otherwise it would probably
be implemented already).

Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23  9:55               ` Kevin Wolf
@ 2012-11-23 10:55                 ` Markus Armbruster
  0 siblings, 0 replies; 73+ messages in thread
From: Markus Armbruster @ 2012-11-23 10:55 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, Dietmar Maurer, qemu-devel

Kevin Wolf <kwolf@redhat.com> writes:

> Am 23.11.2012 10:05, schrieb Dietmar Maurer:
>>>>> My plan was to have something like bs->job->job_type-
>>>>> {before,after}_write.
>>>>>
>>>>>    int coroutine_fn (*before_write)(BlockDriverState *bs,
>>>>>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>>>>>         void **cookie);
>>>>>    int coroutine_fn (*after_write)(BlockDriverState *bs,
>>>>>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>>>>>         void *cookie);
>>>>
>>>> I don't think that job is the right place. Instead I would put a list
>>>> of filters into
>>>> BDS:
>>>
>>> Well, I can also add it to job_type. Just tell me what you prefer, and I will
>>> write the patch.
>
> A block filter shouldn't be tied to a job, I think. We have things like
> blkdebug that are really filters and aren't coupled with a job, and on
> the other hand we want to generalise "block jobs" into just "jobs", so
> adding block specific things to job_type would be a step in the wrong
> direction.
>
> I also think that before_write/after_write isn't a convenient interface,
> it brings back much of the callback-based AIO cruft and passing void*
> isn't nice anyway. It's much nice to have a single .bdrv_co_write
> callback that somewhere in the middle calls the layer below with a
> simple function call.
>
> Also read/write aren't enough, for a full filter interface you
> potentially also need flush, discard and probably most other operations.
>
> This is why I suggested using a regular BlockDriver struct for filters,
> it already has all functions that are needed.

Let me elaborate a bit.

A block backend is a tree of block driver instances (BlockDriverState).
Common examples:

        raw             qcow2           qcow2
         |              /   \           /   \
        file         file   raw      file  qcow2
                             |             /   \
                            file         file  raw
                                                |
                                               file

A less common example:

        raw
         |
      blkdebug
         |
        file

Here, "blkdebug" acts as a filter, i.e. a block driver that can be put
between two adjacent tree nodes.  It injects errors by selectively
failing some bdrv_aio_readv() and bdrv_aio_writev() operations.

Actually, "raw" could also be viewed as a degenerate filter that does
nothing[*], but such a filter isn't particularly useful.

Except perhaps to serve as base for real filters, that do stuff.  To do
stuff in your filter, you'd replace raw's operations with your own.

Hmm, blkdebug implements much fewer operations than raw.  Makes me
wonder whether it works only in special places in the tree now.

[...]


[*] Except occasionally inject bugs when somebody adds new BlockDriver
operations without updating "raw" to forward them.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-23 10:29               ` Kevin Wolf
@ 2012-11-26  5:51                 ` Dietmar Maurer
  2012-11-26 12:07                   ` Paolo Bonzini
  0 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-26  5:51 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, qemu-devel

> Which raises the question of how to distinguish whether it's a new request to
> bs that must go through the filters or whether it actually comes from the last
> filter in the chain. As you can see, we don't have a well thought out plan yet,
> just rough ideas (otherwise it would probably be implemented already).

The question is if I should modify my backup patch (regarding block filters)?

IMHO, the current implementation is quite simple and easy to maintain. We can easily
convert it if someone comes up with a full featured 'block filter' solution.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-26  5:51                 ` Dietmar Maurer
@ 2012-11-26 12:07                   ` Paolo Bonzini
  2012-11-27  6:20                     ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2012-11-26 12:07 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Kevin Wolf, qemu-devel

Il 26/11/2012 06:51, Dietmar Maurer ha scritto:
>> Which raises the question of how to distinguish whether it's a new request to
>> bs that must go through the filters or whether it actually comes from the last
>> filter in the chain. As you can see, we don't have a well thought out plan yet,
>> just rough ideas (otherwise it would probably be implemented already).
> 
> The question is if I should modify my backup patch (regarding block filters)?

The only solution I came up with is to add before/after hooks in the
block job.  I agree with the criticism, but I think it's general enough
and at the same time easy enough to implement.

> IMHO, the current implementation is quite simple and easy to maintain.

No, "if (bs->backup_info)" simply doesn't belong in bdrv_co_writev.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-26 12:07                   ` Paolo Bonzini
@ 2012-11-27  6:20                     ` Dietmar Maurer
  2012-11-27  7:15                       ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-27  6:20 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, qemu-devel

> The only solution I came up with is to add before/after hooks in the block
> job.  I agree with the criticism, but I think it's general enough and at the same
> time easy enough to implement.
> 
> > IMHO, the current implementation is quite simple and easy to maintain.
> 
> No, "if (bs->backup_info)" simply doesn't belong in bdrv_co_writev.

I do not really understand that argument, because the current COPY_ON_READ
implementation also works that way:

    if (bs->copy_on_read) {
        flags |= BDRV_REQ_COPY_ON_READ;
    }
    if (flags & BDRV_REQ_COPY_ON_READ) {
        bs->copy_on_read_in_flight++;
    }

    if (bs->copy_on_read_in_flight) {
        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
    }

    tracked_request_begin(&req, bs, sector_num, nb_sectors, false);

    if (flags & BDRV_REQ_COPY_ON_READ) {
...

Or do you also want to move that to block job hooks?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-27  6:20                     ` Dietmar Maurer
@ 2012-11-27  7:15                       ` Dietmar Maurer
  2012-11-27  8:48                         ` Kevin Wolf
  0 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-27  7:15 UTC (permalink / raw)
  To: Dietmar Maurer, Paolo Bonzini; +Cc: Kevin Wolf, qemu-devel

> > The only solution I came up with is to add before/after hooks in the
> > block job.  I agree with the criticism, but I think it's general
> > enough and at the same time easy enough to implement.
> >
> > > IMHO, the current implementation is quite simple and easy to maintain.
> >
> > No, "if (bs->backup_info)" simply doesn't belong in bdrv_co_writev.
> 
> I do not really understand that argument, because the current
> COPY_ON_READ implementation also works that way:
> 
>     if (bs->copy_on_read) {
>         flags |= BDRV_REQ_COPY_ON_READ;
>     }
>     if (flags & BDRV_REQ_COPY_ON_READ) {
>         bs->copy_on_read_in_flight++;
>     }
> 
>     if (bs->copy_on_read_in_flight) {
>         wait_for_overlapping_requests(bs, sector_num, nb_sectors);
>     }
> 
>     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
> 
>     if (flags & BDRV_REQ_COPY_ON_READ) { ...
> 
> Or do you also want to move that to block job hooks?

Just tried to move that code, but copy on read feature is unrelated to block jobs,
i.e. one can open a bdrv with BDRV_O_COPY_ON_READ, and that does not create
a job.

I already suggested to add those hooks to BDS instead - don't you think that would work?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-27  7:15                       ` Dietmar Maurer
@ 2012-11-27  8:48                         ` Kevin Wolf
  2012-11-27 10:24                           ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Kevin Wolf @ 2012-11-27  8:48 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Paolo Bonzini, qemu-devel

Am 27.11.2012 08:15, schrieb Dietmar Maurer:
>>> The only solution I came up with is to add before/after hooks in the
>>> block job.  I agree with the criticism, but I think it's general
>>> enough and at the same time easy enough to implement.
>>>
>>>> IMHO, the current implementation is quite simple and easy to maintain.
>>>
>>> No, "if (bs->backup_info)" simply doesn't belong in bdrv_co_writev.
>>
>> I do not really understand that argument, because the current
>> COPY_ON_READ implementation also works that way:
>>
>>     if (bs->copy_on_read) {
>>         flags |= BDRV_REQ_COPY_ON_READ;
>>     }
>>     if (flags & BDRV_REQ_COPY_ON_READ) {
>>         bs->copy_on_read_in_flight++;
>>     }
>>
>>     if (bs->copy_on_read_in_flight) {
>>         wait_for_overlapping_requests(bs, sector_num, nb_sectors);
>>     }
>>
>>     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
>>
>>     if (flags & BDRV_REQ_COPY_ON_READ) { ...
>>
>> Or do you also want to move that to block job hooks?
> 
> Just tried to move that code, but copy on read feature is unrelated to block jobs,
> i.e. one can open a bdrv with BDRV_O_COPY_ON_READ, and that does not create
> a job.
> 
> I already suggested to add those hooks to BDS instead - don't you think that would work?

To which BDS? If it is the BDS that is being backed up, the problem is
that you could only have one implementation per BDS, i.e. you couldn't
use backup and copy on read or I/O throttling or whatever at the same time.

Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-21  9:01 [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1) Dietmar Maurer
                   ` (5 preceding siblings ...)
  2012-11-22 11:12 ` Stefan Hajnoczi
@ 2012-11-27 10:09 ` Wenchao Xia
  2012-11-27 10:37   ` Dietmar Maurer
  6 siblings, 1 reply; 73+ messages in thread
From: Wenchao Xia @ 2012-11-27 10:09 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: kwolf, qemu-devel

于 2012-11-21 17:01, Dietmar Maurer 写道:
> This series provides a way to efficiently backup VMs.
> 
> * Backup to a single archive file
> * Backup contain all data to restore VM (full backup)
> * Do not depend on storage type or image format
> * Avoid use of temporary storage
> * store sparse images efficiently
> 
> The file docs/backup-rfc.txt contains more details.
> 
> Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
> ---
>   docs/backup-rfc.txt |  119 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 files changed, 119 insertions(+), 0 deletions(-)
>   create mode 100644 docs/backup-rfc.txt
> 
> diff --git a/docs/backup-rfc.txt b/docs/backup-rfc.txt
> new file mode 100644
> index 0000000..5b4b3df
> --- /dev/null
> +++ b/docs/backup-rfc.txt
> @@ -0,0 +1,119 @@
> +RFC: Efficient VM backup for qemu
> +
> +=Requirements=
> +
> +* Backup to a single archive file
> +* Backup needs to contain all data to restore VM (full backup)
> +* Do not depend on storage type or image format
> +* Avoid use of temporary storage
> +* store sparse images efficiently
> +
> +=Introduction=
> +
> +Most VM backup solutions use some kind of snapshot to get a consistent
> +VM view at a specific point in time. For example, we previously used
> +LVM to create a snapshot of all used VM images, which are then copied
> +into a tar file.
> +
> +That basically means that any data written during backup involve
> +considerable overhead. For LVM we get the following steps:
> +
> +1.) read original data (VM write)
> +2.) write original data into snapshot (VM write)
> +3.) write new data (VM write)
> +4.) read data from snapshot (backup)
> +5.) write data from snapshot into tar file (backup)
> +
> +Another approach to backup VM images is to create a new qcow2 image
> +which use the old image as base. During backup, writes are redirected
> +to the new image, so the old image represents a 'snapshot'. After
> +backup, data need to be copied back from new image into the old
> +one (commit). So a simple write during backup triggers the following
> +steps:
> +
> +1.) write new data to new image (VM write)
> +2.) read data from old image (backup)
> +3.) write data from old image into tar file (backup)
> +
> +4.) read data from new image (commit)
> +5.) write data to old image (commit)
> +
> +This is in fact the same overhead as before. Other tools like qemu
> +livebackup produces similar overhead (2 reads, 3 writes).
> +
> +Some storage types/formats supports internal snapshots using some kind
> +of reference counting (rados, sheepdog, dm-thin, qcow2). It would be possible
> +to use that for backups, but for now we want to be storage-independent.
> +
> +Note: It turned out that taking a qcow2 snapshot can take a very long
> +time on larger files.
> +
> +=Make it more efficient=
> +
> +The be more efficient, we simply need to avoid unnecessary steps. The
> +following steps are always required:
> +
> +1.) read old data before it gets overwritten
> +2.) write that data into the backup archive
> +3.) write new data (VM write)
> +
> +As you can see, this involves only one read, an two writes.
> +
> +To make that work, our backup archive need to be able to store image
> +data 'out of order'. It is important to notice that this will not work
> +with traditional archive formats like tar.
> +
> +During backup we simply intercept writes, then read existing data and
> +store that directly into the archive. After that we can continue the
> +write.
> +
> +==Advantages==
> +
> +* very good performance (1 read, 2 writes)
> +* works on any storage type and image format.
> +* avoid usage of temporary storage
> +* we can define a new and simple archive format, which is able to
> +  store sparse files efficiently.
> +
> +Note: Storing sparse files is a mess with existing archive
> +formats. For example, tar requires information about holes at the
> +beginning of the archive.
> +
> +==Disadvantages==
> +
> +* we need to define a new archive format
> +
> +Note: Most existing archive formats are optimized to store small files
> +including file attributes. We simply do not need that for VM archives.
> +
> +* archive contains data 'out of order'
> +
> +If you want to access image data in sequential order, you need to
> +re-order archive data. It would be possible to to that on the fly,
> +using temporary files.
> +
> +Fortunately, a normal restore/extract works perfectly with 'out of
> +order' data, because the target files are seekable.
> +
> +* slow backup storage can slow down VM during backup
> +
> +It is important to note that we only do sequential writes to the
> +backup storage. Furthermore one can compress the backup stream. IMHO,
> +it is better to slow down the VM a bit. All other solutions creates
> +large amounts of temporary data during backup.
> +
> +=Archive format requirements=
> +
> +The basic requirement for such new format is that we can store image
> +date 'out of order'. It is also very likely that we have less than 256
> +drives/images per VM, and we want to be able to store VM configuration
> +files.
> +
> +We have defined a very simply format with those properties, see:
> +
> +docs/specs/vma_spec.txt
> +
> +Please let us know if you know an existing format which provides the
> +same functionality.
> +
> +
> 
 Just want to confirm something to understand it better:
you are backing up the block image not including VM memory
state right?  I am considering a way to do live Savevm including
memory and device state, so wonder if you already had a solution
for it.

-- 
Best Regards

Wenchao Xia

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-27  8:48                         ` Kevin Wolf
@ 2012-11-27 10:24                           ` Dietmar Maurer
  0 siblings, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-27 10:24 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Paolo Bonzini, qemu-devel

> >>> The only solution I came up with is to add before/after hooks in the
> >>> block job.  I agree with the criticism, but I think it's general
> >>> enough and at the same time easy enough to implement.

Ok, here is another try - do you think that is better?

Note: This code is not tested - I just want to ask if I am moving into the right direction.

I tried to move BackupInfo into the block job.

>From 86c3ca6467e8d220369f5e7b235cdc0e88a79ff9 Mon Sep 17 00:00:00 2001
From: Dietmar Maurer <dietmar@proxmox.com>
Date: Tue, 27 Nov 2012 11:05:23 +0100
Subject: [PATCH] add basic backup support to block driver

Function bdrv_backup_init() creates a block job to backup a block device.

We call brdv_co_backup_cow() for each write during backup. That function
reads the original data and pass it to backup_dump_cb().

The tracked_request infrastructure is used to serialize access.

Currently backup cluster size is hardcoded to 65536 bytes.

Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
---
 Makefile.objs |    1 +
 backup.c      |  302 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 backup.h      |   30 ++++++
 block.c       |   71 ++++++++++++-
 block.h       |    2 +
 blockjob.h    |   10 ++
 6 files changed, 410 insertions(+), 6 deletions(-)
 create mode 100644 backup.c
 create mode 100644 backup.h

diff --git a/Makefile.objs b/Makefile.objs
index 3c7abca..cb46be5 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -48,6 +48,7 @@ coroutine-obj-$(CONFIG_WIN32) += coroutine-win32.o
 block-obj-y = iov.o cache-utils.o qemu-option.o module.o async.o
 block-obj-y += nbd.o block.o blockjob.o aes.o qemu-config.o
 block-obj-y += thread-pool.o qemu-progress.o qemu-sockets.o uri.o notify.o
+block-obj-y += backup.o
 block-obj-y += $(coroutine-obj-y) $(qobject-obj-y) $(version-obj-y)
 block-obj-$(CONFIG_POSIX) += event_notifier-posix.o aio-posix.o
 block-obj-$(CONFIG_WIN32) += event_notifier-win32.o aio-win32.o
diff --git a/backup.c b/backup.c
new file mode 100644
index 0000000..e86b76e
--- /dev/null
+++ b/backup.c
@@ -0,0 +1,302 @@
+/*
+ * QEMU backup
+ *
+ * Copyright (C) 2012 Proxmox Server Solutions
+ *
+ * Authors:
+ *  Dietmar Maurer (dietmar@proxmox.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include <stdio.h>
+#include <errno.h>
+#include <unistd.h>
+
+#include "block.h"
+#include "block_int.h"
+#include "blockjob.h"
+#include "backup.h"
+
+#define DEBUG_BACKUP 0
+
+#define DPRINTF(fmt, ...) \
+    do { if (DEBUG_BACKUP) { printf("backup: " fmt, ## __VA_ARGS__); } } \
+    while (0)
+
+
+#define BITS_PER_LONG  (sizeof(unsigned long) * 8)
+
+typedef struct BackupBlockJob {
+    BlockJob common;
+    unsigned long *bitmap;
+    int bitmap_size;
+    BackupDumpFunc *backup_dump_cb;
+    BlockDriverCompletionFunc *backup_complete_cb;
+    void *opaque;
+} BackupBlockJob;
+
+static int backup_get_bitmap(BlockDriverState *bs, int64_t cluster_num)
+{
+    assert(bs);
+    BackupBlockJob *job = (BackupBlockJob *)bs->job;
+    assert(job);
+    assert(job->bitmap);
+
+    unsigned long val, idx, bit;
+
+    idx = cluster_num / BITS_PER_LONG;
+
+    assert(job->bitmap_size > idx);
+
+    bit = cluster_num % BITS_PER_LONG;
+    val = job->bitmap[idx];
+
+    return !!(val & (1UL << bit));
+}
+
+static void backup_set_bitmap(BlockDriverState *bs, int64_t cluster_num,
+                              int dirty)
+{
+    assert(bs);
+    BackupBlockJob *job = (BackupBlockJob *)bs->job;
+    assert(job);
+    assert(job->bitmap);
+
+    unsigned long val, idx, bit;
+
+    idx = cluster_num / BITS_PER_LONG;
+
+    assert(job->bitmap_size > idx);
+
+    bit = cluster_num % BITS_PER_LONG;
+    val = job->bitmap[idx];
+    if (dirty) {
+        if (!(val & (1UL << bit))) {
+            val |= 1UL << bit;
+        }
+    } else {
+        if (val & (1UL << bit)) {
+            val &= ~(1UL << bit);
+        }
+    }
+    job->bitmap[idx] = val;
+}
+
+static int backup_in_progress_count;
+
+static int coroutine_fn bdrv_co_backup_cow(BlockDriverState *bs,
+                                           int64_t sector_num, int nb_sectors)
+{
+    assert(bs);
+    BackupBlockJob *job = (BackupBlockJob *)bs->job;
+    assert(job);
+
+    BlockDriver *drv = bs->drv;
+    struct iovec iov;
+    QEMUIOVector bounce_qiov;
+    void *bounce_buffer = NULL;
+    int ret = 0;
+
+    backup_in_progress_count++;
+
+    int64_t start, end;
+
+    start = sector_num / BACKUP_BLOCKS_PER_CLUSTER;
+    end = (sector_num + nb_sectors + BACKUP_BLOCKS_PER_CLUSTER - 1) /
+        BACKUP_BLOCKS_PER_CLUSTER;
+
+    DPRINTF("brdv_co_backup_cow enter %s C%zd %zd %d\n",
+            bdrv_get_device_name(bs), start, sector_num, nb_sectors);
+
+    for (; start < end; start++) {
+        if (backup_get_bitmap(bs, start)) {
+            DPRINTF("brdv_co_backup_cow skip C%zd\n", start);
+            continue; /* already copied */
+        }
+
+        /* immediately set bitmap (avoid coroutine race) */
+        backup_set_bitmap(bs, start, 1);
+
+        DPRINTF("brdv_co_backup_cow C%zd\n", start);
+
+        if (!bounce_buffer) {
+            iov.iov_len = BACKUP_CLUSTER_SIZE;
+            iov.iov_base = bounce_buffer = qemu_blockalign(bs, iov.iov_len);
+            qemu_iovec_init_external(&bounce_qiov, &iov, 1);
+        }
+
+        ret = drv->bdrv_co_readv(bs, start * BACKUP_BLOCKS_PER_CLUSTER,
+                                 BACKUP_BLOCKS_PER_CLUSTER,
+                                 &bounce_qiov);
+        if (ret < 0) {
+            DPRINTF("brdv_co_backup_cow bdrv_read C%zd failed\n", start);
+            goto out;
+        }
+
+        ret = job->backup_dump_cb(job->opaque, bs, start, bounce_buffer);
+        if (ret < 0) {
+            DPRINTF("brdv_co_backup_cow dump_cluster_cb C%zd failed\n", start);
+            goto out;
+        }
+
+        DPRINTF("brdv_co_backup_cow done C%zd\n", start);
+    }
+
+out:
+    if (bounce_buffer) {
+        qemu_vfree(bounce_buffer);
+    }
+
+    backup_in_progress_count--;
+
+    return ret;
+}
+    
+static int coroutine_fn backup_before_read(BlockDriverState *bs, 
+                                           int64_t sector_num, 
+                                           int nb_sectors, QEMUIOVector *qiov)
+{
+    return bdrv_co_backup_cow(bs, sector_num, nb_sectors);
+}
+
+static int coroutine_fn backup_before_write(BlockDriverState *bs,
+                                            int64_t sector_num, 
+                                            int nb_sectors, QEMUIOVector *qiov)
+{
+    return bdrv_co_backup_cow(bs, sector_num, nb_sectors);
+}
+
+
+static BlockJobType backup_job_type = {
+    .instance_size = sizeof(BackupBlockJob),
+    .before_read = backup_before_read,
+    .before_write = backup_before_write,
+    .job_type      = "backup",
+};
+
+static void coroutine_fn backup_run(void *opaque)
+{
+    BackupBlockJob *job = opaque;
+    BlockDriverState *bs = job->common.bs;
+    assert(bs);
+
+    int64_t start, end;
+
+    start = 0;
+    end = (bs->total_sectors + BACKUP_BLOCKS_PER_CLUSTER - 1) /
+        BACKUP_BLOCKS_PER_CLUSTER;
+
+    DPRINTF("backup_run start %s %zd %zd\n", bdrv_get_device_name(bs),
+            start, end);
+
+    int ret = 0;
+
+    for (; start < end; start++) {
+        if (block_job_is_cancelled(&job->common)) {
+            ret = -1;
+            break;
+        }
+
+        if (backup_get_bitmap(bs, start)) {
+            continue; /* already copied */
+        }
+
+        /* we need to yield so that qemu_aio_flush() returns.
+         * (without, VM does not reboot)
+         * todo: can we avoid that?
+         */
+        co_sleep_ns(rt_clock, 0);
+        if (block_job_is_cancelled(&job->common)) {
+            ret = -1;
+            break;
+        }
+        DPRINTF("backup_run loop C%zd\n", start);
+
+        /**
+         * This triggers a cluster copy
+         * Note: avoid direct call to brdv_co_backup_cow, because
+         * this does not call tracked_request_begin()
+         */
+        ret = bdrv_co_backup(bs, start*BACKUP_BLOCKS_PER_CLUSTER, 1);
+        if (ret < 0) {
+            break;
+        }
+        /* Publish progress */
+        job->common.offset += BACKUP_CLUSTER_SIZE;
+    }
+
+    while (backup_in_progress_count > 0) {
+        DPRINTF("backup_run backup_in_progress_count != 0 (%d)",
+                backup_in_progress_count);
+        co_sleep_ns(rt_clock, 10000);
+    }
+
+    DPRINTF("backup_run complete %d\n", ret);
+    block_job_completed(&job->common, ret);
+}
+
+static void backup_job_cleanup_cb(void *opaque, int ret)
+{
+    BlockDriverState *bs = opaque;
+    assert(bs);
+    BackupBlockJob *job = (BackupBlockJob *)bs->job;
+    assert(job);
+
+    DPRINTF("backup_job_cleanup_cb start %d\n", ret);
+
+    job->backup_complete_cb(job->opaque, ret);
+
+    DPRINTF("backup_job_cleanup_cb end\n");
+
+    g_free(job->bitmap);
+}
+
+int
+bdrv_backup_init(BlockDriverState *bs, BackupDumpFunc *backup_dump_cb,
+                 BlockDriverCompletionFunc *backup_complete_cb,
+                 void *opaque)
+{
+    assert(bs);
+    assert(backup_dump_cb);
+    assert(backup_complete_cb);
+
+    if (bs->job) {
+        DPRINTF("bdrv_backup_init failed - running job on %s\n",
+                bdrv_get_device_name(bs));
+        return -1;
+    }
+
+    int64_t bitmap_size;
+    const char *devname = bdrv_get_device_name(bs);
+
+    if (!devname || !devname[0]) {
+        return -1;
+    }
+
+    DPRINTF("bdrv_backup_init %s\n", bdrv_get_device_name(bs));
+
+    Error *errp;
+    BackupBlockJob *job = block_job_create(&backup_job_type, bs, 0,
+                                           backup_job_cleanup_cb, bs, &errp);
+
+    job->common.cluster_size = BACKUP_CLUSTER_SIZE;
+
+    bitmap_size = bs->total_sectors +
+        BACKUP_BLOCKS_PER_CLUSTER * BITS_PER_LONG - 1;
+    bitmap_size /= BACKUP_BLOCKS_PER_CLUSTER * BITS_PER_LONG;
+
+    job->backup_dump_cb = backup_dump_cb;
+    job->backup_complete_cb = backup_complete_cb;
+    job->opaque = opaque;
+    job->bitmap_size = bitmap_size;
+    job->bitmap = g_new0(unsigned long, bitmap_size);
+
+    job->common.len = bs->total_sectors*BDRV_SECTOR_SIZE;
+    job->common.co = qemu_coroutine_create(backup_run);
+    qemu_coroutine_enter(job->common.co, job);
+
+    return 0;
+}
diff --git a/backup.h b/backup.h
new file mode 100644
index 0000000..577ac19
--- /dev/null
+++ b/backup.h
@@ -0,0 +1,30 @@
+/*
+ * QEMU backup related definitions
+ *
+ * Copyright (C) Proxmox Server Solutions
+ *
+ * Authors:
+ *  Dietmar Maurer (dietmar@proxmox.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_BACKUP_H
+#define QEMU_BACKUP_H
+
+#include "block.h"
+
+#define BACKUP_CLUSTER_BITS 16
+#define BACKUP_CLUSTER_SIZE (1<<BACKUP_CLUSTER_BITS)
+#define BACKUP_BLOCKS_PER_CLUSTER (BACKUP_CLUSTER_SIZE/BDRV_SECTOR_SIZE)
+
+typedef int BackupDumpFunc(void *opaque, BlockDriverState *bs,
+                           int64_t cluster_num, unsigned char *buf);
+
+int bdrv_backup_init(BlockDriverState *bs, BackupDumpFunc *backup_dump_cb,
+		     BlockDriverCompletionFunc *backup_complete_cb,
+		     void *opaque);
+
+#endif /* QEMU_BACKUP_H */
diff --git a/block.c b/block.c
index c05875f..2f7c2eb 100644
--- a/block.c
+++ b/block.c
@@ -54,6 +54,7 @@
 typedef enum {
     BDRV_REQ_COPY_ON_READ = 0x1,
     BDRV_REQ_ZERO_WRITE   = 0x2,
+    BDRV_REQ_BACKUP_ONLY  = 0x4,
 } BdrvRequestFlags;
 
 static void bdrv_dev_change_media_cb(BlockDriverState *bs, bool load);
@@ -1542,7 +1543,7 @@ int bdrv_commit(BlockDriverState *bs)
 
     if (!drv)
         return -ENOMEDIUM;
-    
+
     if (!bs->backing_hd) {
         return -ENOTSUP;
     }
@@ -1679,6 +1680,22 @@ static void round_to_clusters(BlockDriverState *bs,
     }
 }
 
+/**
+ * Round a region to job cluster boundaries
+ */
+static void round_to_job_clusters(BlockDriverState *bs,
+                                  int64_t sector_num, int nb_sectors,
+                                  int job_cluster_size,
+                                  int64_t *cluster_sector_num,
+                                  int *cluster_nb_sectors)
+{
+    int64_t c = job_cluster_size/BDRV_SECTOR_SIZE;
+
+    *cluster_sector_num = QEMU_ALIGN_DOWN(sector_num, c);
+    *cluster_nb_sectors = QEMU_ALIGN_UP(sector_num - *cluster_sector_num +
+                                        nb_sectors, c);
+}
+
 static bool tracked_request_overlaps(BdrvTrackedRequest *req,
                                      int64_t sector_num, int nb_sectors) {
     /*        aaaa   bbbb */
@@ -1693,7 +1710,9 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
 }
 
 static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
-        int64_t sector_num, int nb_sectors)
+                                                       int64_t sector_num,
+                                                       int nb_sectors,
+                                                       int job_cluster_size)
 {
     BdrvTrackedRequest *req;
     int64_t cluster_sector_num;
@@ -1709,6 +1728,11 @@ static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
     round_to_clusters(bs, sector_num, nb_sectors,
                       &cluster_sector_num, &cluster_nb_sectors);
 
+    if (job_cluster_size) {
+        round_to_job_clusters(bs, sector_num, nb_sectors, job_cluster_size,
+                              &cluster_sector_num, &cluster_nb_sectors);
+    }
+
     do {
         retry = false;
         QLIST_FOREACH(req, &bs->tracked_requests, list) {
@@ -2278,12 +2302,24 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
         bs->copy_on_read_in_flight++;
     }
 
-    if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+    int job_cluster_size = bs->job && bs->job->cluster_size ?
+        bs->job->cluster_size : 0;
+
+    if (bs->copy_on_read_in_flight || job_cluster_size) {
+        wait_for_overlapping_requests(bs, sector_num, nb_sectors,
+                                      job_cluster_size);
     }
 
     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
 
+    if (bs->job && bs->job->job_type->before_read) {
+        ret = bs->job->job_type->before_read(bs, sector_num, nb_sectors, qiov);
+        if (flags & BDRV_REQ_BACKUP_ONLY) {
+            /* Note: We do not return any data to the caller */
+            goto out;
+        }
+    }
+
     if (flags & BDRV_REQ_COPY_ON_READ) {
         int pnum;
 
@@ -2327,6 +2363,17 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
                             BDRV_REQ_COPY_ON_READ);
 }
 
+int coroutine_fn bdrv_co_backup(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors)
+{
+    if (!bs->job) {
+        return -ENOTSUP;
+    }
+
+    return bdrv_co_do_readv(bs, sector_num, nb_sectors, NULL,
+                            BDRV_REQ_BACKUP_ONLY);
+}
+
 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors)
 {
@@ -2384,12 +2431,23 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         bdrv_io_limits_intercept(bs, true, nb_sectors);
     }
 
-    if (bs->copy_on_read_in_flight) {
-        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+    int job_cluster_size = bs->job && bs->job->cluster_size ?
+        bs->job->cluster_size : 0;
+
+    if (bs->copy_on_read_in_flight || job_cluster_size) {
+        wait_for_overlapping_requests(bs, sector_num, nb_sectors,
+                                      job_cluster_size);
     }
 
     tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
 
+    if (bs->job && bs->job->job_type->before_write) {
+        ret = bs->job->job_type->before_write(bs, sector_num, nb_sectors, qiov);
+        if (ret < 0) {
+            goto out;
+        }
+    }
+
     if (flags & BDRV_REQ_ZERO_WRITE) {
         ret = bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors);
     } else {
@@ -2408,6 +2466,7 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         bs->wr_highest_sector = sector_num + nb_sectors - 1;
     }
 
+out:
     tracked_request_end(&req);
 
     return ret;
diff --git a/block.h b/block.h
index 722c620..94e5903 100644
--- a/block.h
+++ b/block.h
@@ -172,6 +172,8 @@ int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov);
 int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
+int coroutine_fn bdrv_co_backup(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors);
 int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
     int nb_sectors, QEMUIOVector *qiov);
 /*
diff --git a/blockjob.h b/blockjob.h
index 3792b73..1d3f4fd 100644
--- a/blockjob.h
+++ b/blockjob.h
@@ -50,6 +50,13 @@ typedef struct BlockJobType {
      * manually.
      */
     void (*complete)(BlockJob *job, Error **errp);
+
+    /** tracked requests */
+    int coroutine_fn (*before_read)(BlockDriverState *bs, int64_t sector_num, 
+                                    int nb_sectors, QEMUIOVector *qiov);
+    int coroutine_fn (*before_write)(BlockDriverState *bs, int64_t sector_num, 
+                                     int nb_sectors, QEMUIOVector *qiov);
+
 } BlockJobType;
 
 /**
@@ -103,6 +110,9 @@ struct BlockJob {
     /** Speed that was set with @block_job_set_speed.  */
     int64_t speed;
 
+    /** tracked requests */
+    int cluster_size;
+
     /** The completion function that will be called when the job completes.  */
     BlockDriverCompletionFunc *cb;
 
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-27 10:09 ` Wenchao Xia
@ 2012-11-27 10:37   ` Dietmar Maurer
  2012-11-28  9:39     ` Wenchao Xia
  0 siblings, 1 reply; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-27 10:37 UTC (permalink / raw)
  To: Wenchao Xia; +Cc: kwolf, qemu-devel

>  Just want to confirm something to understand it better:
> you are backing up the block image not including VM memory state right?  I
> am considering a way to do live Savevm including memory and device state,
> so wonder if you already had a solution for it.

Yes, I have already code for that. 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-27 10:37   ` Dietmar Maurer
@ 2012-11-28  9:39     ` Wenchao Xia
  2012-11-28 11:08       ` Dietmar Maurer
  0 siblings, 1 reply; 73+ messages in thread
From: Wenchao Xia @ 2012-11-28  9:39 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: kwolf, qemu-devel

于 2012-11-27 18:37, Dietmar Maurer 写道:
>>   Just want to confirm something to understand it better:
>> you are backing up the block image not including VM memory state right?  I
>> am considering a way to do live Savevm including memory and device state,
>> so wonder if you already had a solution for it.
>
> Yes, I have already code for that.
>
>

   Does those code for VM memory and device state lively save/restore
included in this patch serials? I quickly reviewed the patches but
did not found a hook to save VM memory state? Hope you can enlight
me your way, my thoughts is do live migration into qcow2 file,
but your code seems not touched qcow2 images.

-- 
Best Regards

Wenchao Xia

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
  2012-11-28  9:39     ` Wenchao Xia
@ 2012-11-28 11:08       ` Dietmar Maurer
  0 siblings, 0 replies; 73+ messages in thread
From: Dietmar Maurer @ 2012-11-28 11:08 UTC (permalink / raw)
  To: Wenchao Xia; +Cc: kwolf, qemu-devel

>    Does those code for VM memory and device state lively save/restore
> included in this patch serials? I quickly reviewed the patches but did not
> found a hook to save VM memory state? Hope you can enlight me your way,
> my thoughts is do live migration into qcow2 file, but your code seems not
> touched qcow2 images.

I will post that code in a few days.

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2012-11-28 11:09 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-21  9:01 [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1) Dietmar Maurer
2012-11-21  9:01 ` [Qemu-devel] [PATCH 2/5] add basic backup support to block driver Dietmar Maurer
2012-11-22 11:25   ` Stefan Hajnoczi
2012-11-22 11:29     ` Dietmar Maurer
2012-11-23  8:56     ` Dietmar Maurer
2012-11-21  9:01 ` [Qemu-devel] [PATCH 3/5] introduce new vma archive format Dietmar Maurer
2012-11-21 16:06   ` Eric Blake
2012-11-21 17:56     ` Dietmar Maurer
2012-11-21  9:01 ` [Qemu-devel] [PATCH 4/5] add backup related monitor commands Dietmar Maurer
2012-11-21 16:16   ` Eric Blake
2012-11-21 17:59     ` Dietmar Maurer
2012-11-21 18:49     ` Dietmar Maurer
2012-11-21  9:01 ` [Qemu-devel] [PATCH 5/5] add regression tests for backup Dietmar Maurer
2012-11-21 10:48 ` [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1) Kevin Wolf
2012-11-21 11:10   ` Dietmar Maurer
2012-11-21 12:37     ` Kevin Wolf
2012-11-21 13:23       ` Paolo Bonzini
2012-11-23  7:42         ` Dietmar Maurer
2012-11-23  9:18           ` Paolo Bonzini
2012-11-23  9:28             ` Dietmar Maurer
2012-11-23  8:12         ` Dietmar Maurer
2012-11-23  9:01           ` Dietmar Maurer
2012-11-23  9:05             ` Dietmar Maurer
2012-11-23  9:15               ` Paolo Bonzini
2012-11-23  9:17                 ` Dietmar Maurer
2012-11-23  9:55               ` Kevin Wolf
2012-11-23 10:55                 ` Markus Armbruster
2012-11-21 13:25       ` Dietmar Maurer
2012-11-21 13:58         ` Kevin Wolf
2012-11-21 15:47           ` Dietmar Maurer
2012-11-23  7:38       ` Dietmar Maurer
2012-11-23  9:08         ` Kevin Wolf
2012-11-23  9:21           ` Dietmar Maurer
2012-11-23  9:31             ` Dietmar Maurer
2012-11-23 10:29               ` Kevin Wolf
2012-11-26  5:51                 ` Dietmar Maurer
2012-11-26 12:07                   ` Paolo Bonzini
2012-11-27  6:20                     ` Dietmar Maurer
2012-11-27  7:15                       ` Dietmar Maurer
2012-11-27  8:48                         ` Kevin Wolf
2012-11-27 10:24                           ` Dietmar Maurer
2012-11-21 11:23   ` Dietmar Maurer
2012-11-22 11:12 ` Stefan Hajnoczi
2012-11-22 11:26   ` Dietmar Maurer
2012-11-22 12:44     ` Stefan Hajnoczi
2012-11-22 12:55       ` Dietmar Maurer
2012-11-22 15:30         ` Stefan Hajnoczi
2012-11-22 15:58           ` Dietmar Maurer
2012-11-22 17:02             ` Stefan Hajnoczi
2012-11-22 17:34               ` Dietmar Maurer
2012-11-22 11:40   ` Dietmar Maurer
2012-11-22 15:42     ` Stefan Hajnoczi
2012-11-22 12:00   ` Dietmar Maurer
2012-11-22 15:45     ` Stefan Hajnoczi
2012-11-22 15:56       ` Dietmar Maurer
2012-11-22 16:37         ` Stefan Hajnoczi
2012-11-22 12:03   ` Dietmar Maurer
2012-11-22 17:16   ` Stefan Hajnoczi
2012-11-22 17:46     ` Dietmar Maurer
2012-11-23  5:23       ` Stefan Hajnoczi
2012-11-23  5:25         ` Stefan Hajnoczi
2012-11-23  6:18           ` Dietmar Maurer
2012-11-23  6:13         ` Dietmar Maurer
2012-11-22 17:50     ` Dietmar Maurer
2012-11-23  5:21       ` Stefan Hajnoczi
2012-11-22 18:05     ` Dietmar Maurer
2012-11-23  5:19       ` Stefan Hajnoczi
2012-11-23  6:05         ` Dietmar Maurer
2012-11-22 18:15     ` Dietmar Maurer
2012-11-27 10:09 ` Wenchao Xia
2012-11-27 10:37   ` Dietmar Maurer
2012-11-28  9:39     ` Wenchao Xia
2012-11-28 11:08       ` Dietmar Maurer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.