All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP
@ 2014-10-22 12:51 Max Reitz
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 01/14] qcow2: Allow "full" discard Max Reitz
                   ` (13 more replies)
  0 siblings, 14 replies; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

qemu-img should use QMP commands whenever possible in order to ensure
feature completeness of both online and offline image operations. For
the "commit" command, this is relatively easy, so implement it first
(in the hope that indeed others will follow).

As qemu-img does not have access to QMP (due to QMP being intertwined
with basically everything in qemu), we cannot directly use QMP, but at
least use the functions the corresponding QMP commands are using (which
would be "block-commit", in this case).


This series depends on my series
"[PATCH v8 00/17] qcow2: Fix image repairing" (or on any later version).


v13:
- Patch 3:
  - Use fallback code not only for images with internal snapshots, but
    also for compat=0.10 [Eric]
  - Use an anonymous struct for l1_ofs_rt_ofs_cls instead of a
    20 byte array [Eric]
- Patch 8:
  - Rebase conflict due to the null block driver
  - Rebase conflicts due to BlockBackend
- Patch 14: Force compat=1.1 [Eric]


git-backport-diff against v12:

Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/14:[----] [--] 'qcow2: Allow "full" discard'
002/14:[----] [--] 'qcow2: Implement bdrv_make_empty()'
003/14:[0020] [FC] 'qcow2: Optimize bdrv_make_empty()'
004/14:[----] [-C] 'blockjob: Introduce block_job_complete_sync()'
005/14:[----] [--] 'blockjob: Add "ready" field'
006/14:[----] [--] 'iotests: Omit length/offset test in 040 and 041'
007/14:[----] [--] 'block/mirror: Improve progress report'
008/14:[0008] [FC] 'qemu-img: Implement commit like QMP'
009/14:[----] [-C] 'qemu-img: Empty image after commit'
010/14:[----] [-C] 'qemu-img: Enable progress output for commit'
011/14:[----] [-C] 'qemu-img: Specify backing file for commit'
012/14:[----] [-C] 'iotests: Add _filter_qemu_img_map'
013/14:[----] [-C] 'iotests: Add test for backing-chain commits'
014/14:[0001] [FC] 'iotests: Add test for qcow2's bdrv_make_empty'


Max Reitz (14):
  qcow2: Allow "full" discard
  qcow2: Implement bdrv_make_empty()
  qcow2: Optimize bdrv_make_empty()
  blockjob: Introduce block_job_complete_sync()
  blockjob: Add "ready" field
  iotests: Omit length/offset test in 040 and 041
  block/mirror: Improve progress report
  qemu-img: Implement commit like QMP
  qemu-img: Empty image after commit
  qemu-img: Enable progress output for commit
  qemu-img: Specify backing file for commit
  iotests: Add _filter_qemu_img_map
  iotests: Add test for backing-chain commits
  iotests: Add test for qcow2's bdrv_make_empty

 block/Makefile.objs              |   3 +-
 block/blkdebug.c                 |   2 +
 block/mirror.c                   |  34 +++++----
 block/qcow2-cluster.c            |  27 ++++---
 block/qcow2-snapshot.c           |   2 +-
 block/qcow2.c                    | 142 ++++++++++++++++++++++++++++++++++++-
 block/qcow2.h                    |   2 +-
 blockjob.c                       |  42 +++++++++--
 include/block/block.h            |   2 +
 include/block/blockjob.h         |  20 ++++++
 qapi/block-core.json             |   4 +-
 qemu-img-cmds.hx                 |   4 +-
 qemu-img.c                       | 148 +++++++++++++++++++++++++++++++++------
 qemu-img.texi                    |  13 +++-
 tests/qemu-iotests/040           |   4 +-
 tests/qemu-iotests/041           |   3 +-
 tests/qemu-iotests/097           | 122 ++++++++++++++++++++++++++++++++
 tests/qemu-iotests/097.out       | 119 +++++++++++++++++++++++++++++++
 tests/qemu-iotests/098           |  79 +++++++++++++++++++++
 tests/qemu-iotests/098.out       |  45 ++++++++++++
 tests/qemu-iotests/common.filter |   7 ++
 tests/qemu-iotests/group         |   2 +
 tests/qemu-iotests/iotests.py    |   3 +-
 23 files changed, 761 insertions(+), 68 deletions(-)
 create mode 100755 tests/qemu-iotests/097
 create mode 100644 tests/qemu-iotests/097.out
 create mode 100755 tests/qemu-iotests/098
 create mode 100644 tests/qemu-iotests/098.out

-- 
1.9.3

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v13 01/14] qcow2: Allow "full" discard
  2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
@ 2014-10-22 12:51 ` Max Reitz
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 02/14] qcow2: Implement bdrv_make_empty() Max Reitz
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Normally, discarded sectors should read back as zero. However, there are
cases in which a sector (or rather cluster) should be discarded as if
they were never written in the first place, that is, reading them should
fall through to the backing file again.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c  | 27 +++++++++++++++++----------
 block/qcow2-snapshot.c |  2 +-
 block/qcow2.c          |  2 +-
 block/qcow2.h          |  2 +-
 4 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index f7dd8c0..5dc83e5 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1412,7 +1412,7 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
  * clusters.
  */
 static int discard_single_l2(BlockDriverState *bs, uint64_t offset,
-    unsigned int nb_clusters, enum qcow2_discard_type type)
+    unsigned int nb_clusters, enum qcow2_discard_type type, bool full_discard)
 {
     BDRVQcowState *s = bs->opaque;
     uint64_t *l2_table;
@@ -1434,23 +1434,30 @@ static int discard_single_l2(BlockDriverState *bs, uint64_t offset,
         old_l2_entry = be64_to_cpu(l2_table[l2_index + i]);
 
         /*
-         * Make sure that a discarded area reads back as zeroes for v3 images
-         * (we cannot do it for v2 without actually writing a zero-filled
-         * buffer). We can skip the operation if the cluster is already marked
-         * as zero, or if it's unallocated and we don't have a backing file.
+         * If full_discard is false, make sure that a discarded area reads back
+         * as zeroes for v3 images (we cannot do it for v2 without actually
+         * writing a zero-filled buffer). We can skip the operation if the
+         * cluster is already marked as zero, or if it's unallocated and we
+         * don't have a backing file.
          *
          * TODO We might want to use bdrv_get_block_status(bs) here, but we're
          * holding s->lock, so that doesn't work today.
+         *
+         * If full_discard is true, the sector should not read back as zeroes,
+         * but rather fall through to the backing file.
          */
         switch (qcow2_get_cluster_type(old_l2_entry)) {
             case QCOW2_CLUSTER_UNALLOCATED:
-                if (!bs->backing_hd) {
+                if (full_discard || !bs->backing_hd) {
                     continue;
                 }
                 break;
 
             case QCOW2_CLUSTER_ZERO:
-                continue;
+                if (!full_discard) {
+                    continue;
+                }
+                break;
 
             case QCOW2_CLUSTER_NORMAL:
             case QCOW2_CLUSTER_COMPRESSED:
@@ -1462,7 +1469,7 @@ static int discard_single_l2(BlockDriverState *bs, uint64_t offset,
 
         /* First remove L2 entries */
         qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
-        if (s->qcow_version >= 3) {
+        if (!full_discard && s->qcow_version >= 3) {
             l2_table[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO);
         } else {
             l2_table[l2_index + i] = cpu_to_be64(0);
@@ -1481,7 +1488,7 @@ static int discard_single_l2(BlockDriverState *bs, uint64_t offset,
 }
 
 int qcow2_discard_clusters(BlockDriverState *bs, uint64_t offset,
-    int nb_sectors, enum qcow2_discard_type type)
+    int nb_sectors, enum qcow2_discard_type type, bool full_discard)
 {
     BDRVQcowState *s = bs->opaque;
     uint64_t end_offset;
@@ -1504,7 +1511,7 @@ int qcow2_discard_clusters(BlockDriverState *bs, uint64_t offset,
 
     /* Each L2 table is handled by its own loop iteration */
     while (nb_clusters > 0) {
-        ret = discard_single_l2(bs, offset, nb_clusters, type);
+        ret = discard_single_l2(bs, offset, nb_clusters, type, full_discard);
         if (ret < 0) {
             goto fail;
         }
diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
index f52d7fd..5b3903c 100644
--- a/block/qcow2-snapshot.c
+++ b/block/qcow2-snapshot.c
@@ -441,7 +441,7 @@ int qcow2_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo *sn_info)
     qcow2_discard_clusters(bs, qcow2_vm_state_offset(s),
                            align_offset(sn->vm_state_size, s->cluster_size)
                                 >> BDRV_SECTOR_BITS,
-                           QCOW2_DISCARD_NEVER);
+                           QCOW2_DISCARD_NEVER, false);
 
 #ifdef DEBUG_ALLOC
     {
diff --git a/block/qcow2.c b/block/qcow2.c
index 7a2c66f..71c0421 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2089,7 +2089,7 @@ static coroutine_fn int qcow2_co_discard(BlockDriverState *bs,
 
     qemu_co_mutex_lock(&s->lock);
     ret = qcow2_discard_clusters(bs, sector_num << BDRV_SECTOR_BITS,
-        nb_sectors, QCOW2_DISCARD_REQUEST);
+        nb_sectors, QCOW2_DISCARD_REQUEST, false);
     qemu_co_mutex_unlock(&s->lock);
     return ret;
 }
diff --git a/block/qcow2.h b/block/qcow2.h
index 47cd4b3..b491a25 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -536,7 +536,7 @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
 
 int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m);
 int qcow2_discard_clusters(BlockDriverState *bs, uint64_t offset,
-    int nb_sectors, enum qcow2_discard_type type);
+    int nb_sectors, enum qcow2_discard_type type, bool full_discard);
 int qcow2_zero_clusters(BlockDriverState *bs, uint64_t offset, int nb_sectors);
 
 int qcow2_expand_zero_clusters(BlockDriverState *bs);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v13 02/14] qcow2: Implement bdrv_make_empty()
  2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 01/14] qcow2: Allow "full" discard Max Reitz
@ 2014-10-22 12:51 ` Max Reitz
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty() Max Reitz
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Implement this function by making all clusters in the image file fall
through to the backing file (by using the recently extended discard).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 71c0421..1ef3a5f 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2230,6 +2230,32 @@ fail:
     return ret;
 }
 
+static int qcow2_make_empty(BlockDriverState *bs)
+{
+    int ret = 0;
+    uint64_t start_sector;
+    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
+
+    for (start_sector = 0; start_sector < bs->total_sectors;
+         start_sector += sector_step)
+    {
+        /* As this function is generally used after committing an external
+         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
+         * default action for this kind of discard is to pass the discard,
+         * which will ideally result in an actually smaller image file, as
+         * is probably desired. */
+        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
+                                     MIN(sector_step,
+                                         bs->total_sectors - start_sector),
+                                     QCOW2_DISCARD_SNAPSHOT, true);
+        if (ret < 0) {
+            break;
+        }
+    }
+
+    return ret;
+}
+
 static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
 {
     BDRVQcowState *s = bs->opaque;
@@ -2676,6 +2702,7 @@ static BlockDriver bdrv_qcow2 = {
     .bdrv_co_discard        = qcow2_co_discard,
     .bdrv_truncate          = qcow2_truncate,
     .bdrv_write_compressed  = qcow2_write_compressed,
+    .bdrv_make_empty        = qcow2_make_empty,
 
     .bdrv_snapshot_create   = qcow2_snapshot_create,
     .bdrv_snapshot_goto     = qcow2_snapshot_goto,
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty()
  2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 01/14] qcow2: Allow "full" discard Max Reitz
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 02/14] qcow2: Implement bdrv_make_empty() Max Reitz
@ 2014-10-22 12:51 ` Max Reitz
  2014-10-22 16:04   ` Eric Blake
  2014-10-22 18:35   ` Kevin Wolf
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 04/14] blockjob: Introduce block_job_complete_sync() Max Reitz
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

bdrv_make_empty() is currently only called if the current image
represents an external snapshot that has been committed to its base
image; it is therefore unlikely to have internal snapshots. In this
case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
refcount table (while having the dirty flag set, which only works for
compat=1.1) and creating a trivial refcount structure.

If there are snapshots or for compat=0.10, fall back to the simple
implementation (discard all clusters).

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/blkdebug.c      |   2 +
 block/qcow2.c         | 143 ++++++++++++++++++++++++++++++++++++++++++++------
 include/block/block.h |   2 +
 3 files changed, 132 insertions(+), 15 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index e046b92..862d93b 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -195,6 +195,8 @@ static const char *event_names[BLKDBG_EVENT_MAX] = {
     [BLKDBG_PWRITEV]                        = "pwritev",
     [BLKDBG_PWRITEV_ZERO]                   = "pwritev_zero",
     [BLKDBG_PWRITEV_DONE]                   = "pwritev_done",
+
+    [BLKDBG_EMPTY_IMAGE_PREPARE]            = "empty_image_prepare",
 };
 
 static int get_event_by_name(const char *name, BlkDebugEvent *event)
diff --git a/block/qcow2.c b/block/qcow2.c
index 1ef3a5f..16dece2 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2232,24 +2232,137 @@ fail:
 
 static int qcow2_make_empty(BlockDriverState *bs)
 {
+    BDRVQcowState *s = bs->opaque;
     int ret = 0;
-    uint64_t start_sector;
-    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
 
-    for (start_sector = 0; start_sector < bs->total_sectors;
-         start_sector += sector_step)
-    {
-        /* As this function is generally used after committing an external
-         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
-         * default action for this kind of discard is to pass the discard,
-         * which will ideally result in an actually smaller image file, as
-         * is probably desired. */
-        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
-                                     MIN(sector_step,
-                                         bs->total_sectors - start_sector),
-                                     QCOW2_DISCARD_SNAPSHOT, true);
+    if (s->snapshots || s->qcow_version < 3) {
+        uint64_t start_sector;
+        int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
+
+        /* If there are snapshots, every active cluster has to be discarded; and
+         * because compat=0.10 does not support setting the dirty flag, we have
+         * to use this fallback there as well */
+
+        for (start_sector = 0; start_sector < bs->total_sectors;
+             start_sector += sector_step)
+        {
+            /* As this function is generally used after committing an external
+             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
+             * default action for this kind of discard is to pass the discard,
+             * which will ideally result in an actually smaller image file, as
+             * is probably desired. */
+            ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
+                                         MIN(sector_step,
+                                             bs->total_sectors - start_sector),
+                                         QCOW2_DISCARD_SNAPSHOT, true);
+            if (ret < 0) {
+                break;
+            }
+        }
+    } else {
+        int l1_clusters;
+        int64_t offset;
+        uint64_t *new_reftable;
+        uint64_t rt_entry;
+        struct {
+            uint64_t l1_offset;
+            uint64_t reftable_offset;
+            uint32_t reftable_clusters;
+        } QEMU_PACKED l1_ofs_rt_ofs_cls;
+
+        ret = qcow2_cache_empty(bs, s->l2_table_cache);
         if (ret < 0) {
-            break;
+            return ret;
+        }
+
+        ret = qcow2_cache_empty(bs, s->refcount_block_cache);
+        if (ret < 0) {
+            return ret;
+        }
+
+        /* Refcounts will be broken utterly */
+        ret = qcow2_mark_dirty(bs);
+        if (ret < 0) {
+            return ret;
+        }
+
+        l1_clusters = DIV_ROUND_UP(s->l1_size,
+                                   s->cluster_size / sizeof(uint64_t));
+        new_reftable = g_try_new0(uint64_t, s->cluster_size / sizeof(uint64_t));
+        if (!new_reftable) {
+            return -ENOMEM;
+        }
+
+        BLKDBG_EVENT(bs->file, BLKDBG_EMPTY_IMAGE_PREPARE);
+
+        /* Overwrite enough clusters at the beginning of the sectors to place
+         * the refcount table, a refcount block and the L1 table in; this may
+         * overwrite parts of the existing refcount and L1 table, which is not
+         * an issue because the dirty flag is set, complete data loss is in fact
+         * desired and partial data loss is consequently fine as well */
+        ret = bdrv_write_zeroes(bs->file, s->cluster_size / BDRV_SECTOR_SIZE,
+                                (2 + l1_clusters) * s->cluster_size /
+                                BDRV_SECTOR_SIZE, 0);
+        if (ret < 0) {
+            g_free(new_reftable);
+            return ret;
+        }
+
+        BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE);
+        BLKDBG_EVENT(bs->file, BLKDBG_REFTABLE_UPDATE);
+
+        /* "Create" an empty reftable (one cluster) directly after the image
+         * header and an empty L1 table three clusters after the image header;
+         * the cluster between those two will be used as the first refblock */
+        cpu_to_be64w(&l1_ofs_rt_ofs_cls.l1_offset, 3 * s->cluster_size);
+        cpu_to_be64w(&l1_ofs_rt_ofs_cls.reftable_offset, s->cluster_size);
+        cpu_to_be32w(&l1_ofs_rt_ofs_cls.reftable_clusters, 1);
+        ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_table_offset),
+                               &l1_ofs_rt_ofs_cls, sizeof(l1_ofs_rt_ofs_cls));
+        if (ret < 0) {
+            g_free(new_reftable);
+            return ret;
+        }
+
+        s->l1_table_offset = 3 * s->cluster_size;
+        memset(s->l1_table, 0, s->l1_size * sizeof(uint64_t));
+
+        s->refcount_table_offset = s->cluster_size;
+        s->refcount_table_size   = s->cluster_size / sizeof(uint64_t);
+
+        g_free(s->refcount_table);
+        s->refcount_table = new_reftable;
+
+        BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC);
+
+        /* Enter the first refblock into the reftable */
+        rt_entry = cpu_to_be64(2 * s->cluster_size);
+        ret = bdrv_pwrite_sync(bs->file, s->cluster_size,
+                               &rt_entry, sizeof(rt_entry));
+        if (ret < 0) {
+            return ret;
+        }
+
+        s->refcount_table[0] = 2 * s->cluster_size;
+
+        ret = bdrv_truncate(bs->file, (3 + l1_clusters) * s->cluster_size);
+        if (ret < 0) {
+            return ret;
+        }
+
+        s->free_cluster_index = 0;
+        offset = qcow2_alloc_clusters(bs, 3 * s->cluster_size +
+                                      s->l1_size * sizeof(uint64_t));
+        if (offset < 0) {
+            return offset;
+        } else if (offset > 0) {
+            error_report("First cluster in emptied image is in use");
+            abort();
+        }
+
+        ret = qcow2_mark_clean(bs);
+        if (ret < 0) {
+            return ret;
         }
     }
 
diff --git a/include/block/block.h b/include/block/block.h
index 341054d..b1f4385 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -498,6 +498,8 @@ typedef enum {
     BLKDBG_PWRITEV_ZERO,
     BLKDBG_PWRITEV_DONE,
 
+    BLKDBG_EMPTY_IMAGE_PREPARE,
+
     BLKDBG_EVENT_MAX,
 } BlkDebugEvent;
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v13 04/14] blockjob: Introduce block_job_complete_sync()
  2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (2 preceding siblings ...)
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty() Max Reitz
@ 2014-10-22 12:51 ` Max Reitz
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 05/14] blockjob: Add "ready" field Max Reitz
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Implement block_job_complete_sync() by doing the exact same thing as
block_job_cancel_sync() does, only with calling block_job_complete()
instead of block_job_cancel().

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 blockjob.c               | 39 ++++++++++++++++++++++++++++++++-------
 include/block/blockjob.h | 15 +++++++++++++++
 2 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index ff0908a..a7d57e3 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -153,7 +153,7 @@ void block_job_iostatus_reset(BlockJob *job)
     }
 }
 
-struct BlockCancelData {
+struct BlockFinishData {
     BlockJob *job;
     BlockCompletionFunc *cb;
     void *opaque;
@@ -161,19 +161,22 @@ struct BlockCancelData {
     int ret;
 };
 
-static void block_job_cancel_cb(void *opaque, int ret)
+static void block_job_finish_cb(void *opaque, int ret)
 {
-    struct BlockCancelData *data = opaque;
+    struct BlockFinishData *data = opaque;
 
     data->cancelled = block_job_is_cancelled(data->job);
     data->ret = ret;
     data->cb(data->opaque, ret);
 }
 
-int block_job_cancel_sync(BlockJob *job)
+static int block_job_finish_sync(BlockJob *job,
+                                 void (*finish)(BlockJob *, Error **errp),
+                                 Error **errp)
 {
-    struct BlockCancelData data;
+    struct BlockFinishData data;
     BlockDriverState *bs = job->bs;
+    Error *local_err = NULL;
 
     assert(bs->job == job);
 
@@ -184,15 +187,37 @@ int block_job_cancel_sync(BlockJob *job)
     data.cb = job->cb;
     data.opaque = job->opaque;
     data.ret = -EINPROGRESS;
-    job->cb = block_job_cancel_cb;
+    job->cb = block_job_finish_cb;
     job->opaque = &data;
-    block_job_cancel(job);
+    finish(job, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return -EBUSY;
+    }
     while (data.ret == -EINPROGRESS) {
         aio_poll(bdrv_get_aio_context(bs), true);
     }
     return (data.cancelled && data.ret == 0) ? -ECANCELED : data.ret;
 }
 
+/* A wrapper around block_job_cancel() taking an Error ** parameter so it may be
+ * used with block_job_finish_sync() without the need for (rather nasty)
+ * function pointer casts there. */
+static void block_job_cancel_err(BlockJob *job, Error **errp)
+{
+    block_job_cancel(job);
+}
+
+int block_job_cancel_sync(BlockJob *job)
+{
+    return block_job_finish_sync(job, &block_job_cancel_err, NULL);
+}
+
+int block_job_complete_sync(BlockJob *job, Error **errp)
+{
+    return block_job_finish_sync(job, &block_job_complete, errp);
+}
+
 void block_job_sleep_ns(BlockJob *job, QEMUClockType type, int64_t ns)
 {
     assert(job->busy);
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index acb399f..ab11a0f 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -273,6 +273,21 @@ bool block_job_is_paused(BlockJob *job);
 int block_job_cancel_sync(BlockJob *job);
 
 /**
+ * block_job_complete_sync:
+ * @job: The job to be completed.
+ * @errp: Error object which may be set by block_job_complete(); this is not
+ *        necessarily set on every error, the job return value has to be
+ *        checked as well.
+ *
+ * Synchronously complete the job.  The completion callback is called before the
+ * function returns, unless it is NULL (which is permissible when using this
+ * function).
+ *
+ * Returns the return value from the job.
+ */
+int block_job_complete_sync(BlockJob *job, Error **errp);
+
+/**
  * block_job_iostatus_reset:
  * @job: The job whose I/O status should be reset.
  *
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v13 05/14] blockjob: Add "ready" field
  2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (3 preceding siblings ...)
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 04/14] blockjob: Introduce block_job_complete_sync() Max Reitz
@ 2014-10-22 12:51 ` Max Reitz
  2014-10-23 10:01   ` Kevin Wolf
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 06/14] iotests: Omit length/offset test in 040 and 041 Max Reitz
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

When a block job signals readiness, this is currently reported only
through QMP. If qemu wants to use block jobs for internal tasks, there
needs to be another way to correctly detect when a block job may be
completed.

For this reason, introduce a bool "ready" which is set when the block
job may be completed.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 blockjob.c               | 3 +++
 include/block/blockjob.h | 5 +++++
 qapi/block-core.json     | 4 +++-
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/blockjob.c b/blockjob.c
index a7d57e3..448b9ce 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -261,6 +261,7 @@ BlockJobInfo *block_job_query(BlockJob *job)
     info->offset    = job->offset;
     info->speed     = job->speed;
     info->io_status = job->iostatus;
+    info->ready     = job->ready;
     return info;
 }
 
@@ -296,6 +297,8 @@ void block_job_event_completed(BlockJob *job, const char *msg)
 
 void block_job_event_ready(BlockJob *job)
 {
+    job->ready = true;
+
     qapi_event_send_block_job_ready(job->driver->job_type,
                                     bdrv_get_device_name(job->bs),
                                     job->len,
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index ab11a0f..9694f13 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -91,6 +91,11 @@ struct BlockJob {
      */
     bool busy;
 
+    /**
+     * Set to true when the job is ready to be completed.
+     */
+    bool ready;
+
     /** Status that is published by the query-block-jobs QMP API */
     BlockDeviceIoStatus iostatus;
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 8f7089e..77a0cfb 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -514,12 +514,14 @@
 #
 # @io-status: the status of the job (since 1.3)
 #
+# @ready: true if the job may be completed (since 2.2)
+#
 # Since: 1.1
 ##
 { 'type': 'BlockJobInfo',
   'data': {'type': 'str', 'device': 'str', 'len': 'int',
            'offset': 'int', 'busy': 'bool', 'paused': 'bool', 'speed': 'int',
-           'io-status': 'BlockDeviceIoStatus'} }
+           'io-status': 'BlockDeviceIoStatus', 'ready': 'bool'} }
 
 ##
 # @query-block-jobs:
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v13 06/14] iotests: Omit length/offset test in 040 and 041
  2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (4 preceding siblings ...)
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 05/14] blockjob: Add "ready" field Max Reitz
@ 2014-10-22 12:51 ` Max Reitz
  2014-10-23 10:06   ` Kevin Wolf
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 07/14] block/mirror: Improve progress report Max Reitz
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

As of a follow-up patch to this one, the length of a mirror block job
will no longer directly depend on the size of the block device;
therefore, drop these checks from this test. Instead, just check whether
the final offset equals the block job length.

As 041 uses the wait_until_completed function from iotests.py, the same
applies there as well which in turn affects tests 030, 055 and 056. On
the other hand, a block job's length does not have to be related to the
length of the image file in the first place, so that check was
questionable anyway.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/040        | 4 +---
 tests/qemu-iotests/041        | 3 +--
 tests/qemu-iotests/iotests.py | 3 +--
 3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/tests/qemu-iotests/040 b/tests/qemu-iotests/040
index f1e16c1..2b432ad 100755
--- a/tests/qemu-iotests/040
+++ b/tests/qemu-iotests/040
@@ -43,8 +43,7 @@ class ImageCommitTestCase(iotests.QMPTestCase):
                 if event['event'] == 'BLOCK_JOB_COMPLETED':
                     self.assert_qmp(event, 'data/type', 'commit')
                     self.assert_qmp(event, 'data/device', 'drive0')
-                    self.assert_qmp(event, 'data/offset', self.image_len)
-                    self.assert_qmp(event, 'data/len', self.image_len)
+                    self.assert_qmp(event, 'data/offset', event['data']['len'])
                     if need_ready:
                         self.assertTrue(ready, "Expecting BLOCK_JOB_COMPLETED event")
                     completed = True
@@ -52,7 +51,6 @@ class ImageCommitTestCase(iotests.QMPTestCase):
                     ready = True
                     self.assert_qmp(event, 'data/type', 'commit')
                     self.assert_qmp(event, 'data/device', 'drive0')
-                    self.assert_qmp(event, 'data/len', self.image_len)
                     self.vm.qmp('block-job-complete', device='drive0')
 
         self.assert_no_active_block_jobs()
diff --git a/tests/qemu-iotests/041 b/tests/qemu-iotests/041
index 5dbd4ee..90721cc 100755
--- a/tests/qemu-iotests/041
+++ b/tests/qemu-iotests/041
@@ -52,8 +52,7 @@ class ImageMirroringTestCase(iotests.QMPTestCase):
         event = self.cancel_and_wait(drive=drive)
         self.assertEquals(event['event'], 'BLOCK_JOB_COMPLETED')
         self.assert_qmp(event, 'data/type', 'mirror')
-        self.assert_qmp(event, 'data/offset', self.image_len)
-        self.assert_qmp(event, 'data/len', self.image_len)
+        self.assert_qmp(event, 'data/offset', event['data']['len'])
 
     def complete_and_wait(self, drive='drive0', wait_ready=True):
         '''Complete a block job and wait for it to finish'''
diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 39a4cfc..f57f154 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -267,8 +267,7 @@ class QMPTestCase(unittest.TestCase):
                     self.assert_qmp(event, 'data/device', drive)
                     self.assert_qmp_absent(event, 'data/error')
                     if check_offset:
-                        self.assert_qmp(event, 'data/offset', self.image_len)
-                    self.assert_qmp(event, 'data/len', self.image_len)
+                        self.assert_qmp(event, 'data/offset', event['data']['len'])
                     completed = True
 
         self.assert_no_active_block_jobs()
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v13 07/14] block/mirror: Improve progress report
  2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (5 preceding siblings ...)
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 06/14] iotests: Omit length/offset test in 040 and 041 Max Reitz
@ 2014-10-22 12:51 ` Max Reitz
  2014-10-23 10:52   ` Kevin Wolf
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 08/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Instead of taking the total length of the block device as the block
job's length, use the number of dirty sectors. The progress is now the
number of sectors mirrored to the target block device. Note that this
may result in the job's length increasing during operation, which is
however in fact desirable.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 block/mirror.c | 34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index e8a43eb..2a1acfe 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -45,6 +45,7 @@ typedef struct MirrorBlockJob {
     int64_t sector_num;
     int64_t granularity;
     size_t buf_size;
+    int64_t bdev_length;
     unsigned long *cow_bitmap;
     BdrvDirtyBitmap *dirty_bitmap;
     HBitmapIter hbi;
@@ -54,6 +55,7 @@ typedef struct MirrorBlockJob {
 
     unsigned long *in_flight_bitmap;
     int in_flight;
+    int sectors_in_flight;
     int ret;
 } MirrorBlockJob;
 
@@ -87,6 +89,7 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
     trace_mirror_iteration_done(s, op->sector_num, op->nb_sectors, ret);
 
     s->in_flight--;
+    s->sectors_in_flight -= op->nb_sectors;
     iov = op->qiov.iov;
     for (i = 0; i < op->qiov.niov; i++) {
         MirrorBuffer *buf = (MirrorBuffer *) iov[i].iov_base;
@@ -98,8 +101,11 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
     chunk_num = op->sector_num / sectors_per_chunk;
     nb_chunks = op->nb_sectors / sectors_per_chunk;
     bitmap_clear(s->in_flight_bitmap, chunk_num, nb_chunks);
-    if (s->cow_bitmap && ret >= 0) {
-        bitmap_set(s->cow_bitmap, chunk_num, nb_chunks);
+    if (ret >= 0) {
+        if (s->cow_bitmap) {
+            bitmap_set(s->cow_bitmap, chunk_num, nb_chunks);
+        }
+        s->common.offset += (uint64_t)op->nb_sectors * BDRV_SECTOR_SIZE;
     }
 
     qemu_iovec_destroy(&op->qiov);
@@ -172,7 +178,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
     hbitmap_next_sector = s->sector_num;
     sector_num = s->sector_num;
     sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
-    end = s->common.len >> BDRV_SECTOR_BITS;
+    end = s->bdev_length / BDRV_SECTOR_SIZE;
 
     /* Extend the QEMUIOVector to include all adjacent blocks that will
      * be copied in this operation.
@@ -284,6 +290,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 
     /* Copy the dirty cluster.  */
     s->in_flight++;
+    s->sectors_in_flight += nb_sectors;
     trace_mirror_one_iteration(s, sector_num, nb_sectors);
     bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
                    mirror_read_complete, op);
@@ -329,11 +336,11 @@ static void coroutine_fn mirror_run(void *opaque)
         goto immediate_exit;
     }
 
-    s->common.len = bdrv_getlength(bs);
-    if (s->common.len < 0) {
-        ret = s->common.len;
+    s->bdev_length = bdrv_getlength(bs);
+    if (s->bdev_length < 0) {
+        ret = s->bdev_length;
         goto immediate_exit;
-    } else if (s->common.len == 0) {
+    } else if (s->bdev_length == 0) {
         /* Report BLOCK_JOB_READY and wait for complete. */
         block_job_event_ready(&s->common);
         s->synced = true;
@@ -344,7 +351,7 @@ static void coroutine_fn mirror_run(void *opaque)
         goto immediate_exit;
     }
 
-    length = DIV_ROUND_UP(s->common.len, s->granularity);
+    length = DIV_ROUND_UP(s->bdev_length, s->granularity);
     s->in_flight_bitmap = bitmap_new(length);
 
     /* If we have no backing file yet in the destination, we cannot let
@@ -364,7 +371,7 @@ static void coroutine_fn mirror_run(void *opaque)
         }
     }
 
-    end = s->common.len >> BDRV_SECTOR_BITS;
+    end = s->bdev_length / BDRV_SECTOR_SIZE;
     s->buf = qemu_try_blockalign(bs, s->buf_size);
     if (s->buf == NULL) {
         ret = -ENOMEM;
@@ -409,6 +416,12 @@ static void coroutine_fn mirror_run(void *opaque)
         }
 
         cnt = bdrv_get_dirty_count(bs, s->dirty_bitmap);
+        /* s->common.offset contains the number of bytes already processed so
+         * far, cnt is the number of dirty sectors remaining and
+         * s->sectors_in_flight is the number of sectors currently being
+         * processed; together those are the current total operation length */
+        s->common.len = s->common.offset +
+                        (cnt + s->sectors_in_flight) * BDRV_SECTOR_SIZE;
 
         /* Note that even when no rate limit is applied we need to yield
          * periodically with no pending I/O so that qemu_aio_flush() returns.
@@ -445,7 +458,6 @@ static void coroutine_fn mirror_run(void *opaque)
                  * report completion.  This way, block-job-cancel will leave
                  * the target in a consistent state.
                  */
-                s->common.offset = end * BDRV_SECTOR_SIZE;
                 if (!s->synced) {
                     block_job_event_ready(&s->common);
                     s->synced = true;
@@ -474,8 +486,6 @@ static void coroutine_fn mirror_run(void *opaque)
         ret = 0;
         trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
         if (!s->synced) {
-            /* Publish progress */
-            s->common.offset = (end - cnt) * BDRV_SECTOR_SIZE;
             block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, delay_ns);
             if (block_job_is_cancelled(&s->common)) {
                 break;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v13 08/14] qemu-img: Implement commit like QMP
  2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (6 preceding siblings ...)
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 07/14] block/mirror: Improve progress report Max Reitz
@ 2014-10-22 12:51 ` Max Reitz
  2014-10-22 16:22   ` Eric Blake
  2014-10-23 11:59   ` Kevin Wolf
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 09/14] qemu-img: Empty image after commit Max Reitz
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

qemu-img should use QMP commands whenever possible in order to ensure
feature completeness of both online and offline image operations. As
qemu-img itself has no access to QMP (since this would basically require
just everything being linked into qemu-img), imitate QMP's
implementation of block-commit by using commit_active_start() and then
waiting for the block job to finish.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/Makefile.objs |  3 +-
 qemu-img.c          | 82 ++++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 64 insertions(+), 21 deletions(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 27911b6..04b0e43 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -9,7 +9,7 @@ block-obj-y += block-backend.o snapshot.o qapi.o
 block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
 block-obj-$(CONFIG_POSIX) += raw-posix.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
-block-obj-y += null.o
+block-obj-y += null.o mirror.o
 
 block-obj-y += nbd.o nbd-client.o sheepdog.o
 block-obj-$(CONFIG_LIBISCSI) += iscsi.o
@@ -23,7 +23,6 @@ block-obj-y += accounting.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
-common-obj-y += mirror.o
 common-obj-y += backup.o
 
 iscsi.o-cflags     := $(LIBISCSI_CFLAGS)
diff --git a/qemu-img.c b/qemu-img.c
index 09e7e72..f1f2857 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -31,6 +31,7 @@
 #include "sysemu/sysemu.h"
 #include "sysemu/block-backend.h"
 #include "block/block_int.h"
+#include "block/blockjob.h"
 #include "block/qapi.h"
 #include <getopt.h>
 
@@ -715,13 +716,47 @@ fail:
     return ret;
 }
 
+typedef struct CommonBlockJobCBInfo {
+    BlockDriverState *bs;
+    Error **errp;
+} CommonBlockJobCBInfo;
+
+static void common_block_job_cb(void *opaque, int ret)
+{
+    CommonBlockJobCBInfo *cbi = opaque;
+
+    if (ret < 0) {
+        error_setg_errno(cbi->errp, -ret, "Block job failed");
+    }
+
+    /* Drop this block job's reference */
+    bdrv_unref(cbi->bs);
+}
+
+static void run_block_job(BlockJob *job, Error **errp)
+{
+    AioContext *aio_context = bdrv_get_aio_context(job->bs);
+
+    do {
+        aio_poll(aio_context, true);
+
+        if (!job->busy && !job->ready) {
+            block_job_resume(job);
+        }
+    } while (!job->ready);
+
+    block_job_complete_sync(job, errp);
+}
+
 static int img_commit(int argc, char **argv)
 {
     int c, ret, flags;
     const char *filename, *fmt, *cache;
     BlockBackend *blk;
-    BlockDriverState *bs;
+    BlockDriverState *bs, *base_bs;
     bool quiet = false;
+    Error *local_err = NULL;
+    CommonBlockJobCBInfo cbi;
 
     fmt = NULL;
     cache = BDRV_DEFAULT_CACHE;
@@ -764,29 +799,38 @@ static int img_commit(int argc, char **argv)
     }
     bs = blk_bs(blk);
 
-    ret = bdrv_commit(bs);
-    switch(ret) {
-    case 0:
-        qprintf(quiet, "Image committed.\n");
-        break;
-    case -ENOENT:
-        error_report("No disk inserted");
-        break;
-    case -EACCES:
-        error_report("Image is read-only");
-        break;
-    case -ENOTSUP:
-        error_report("Image is already committed");
-        break;
-    default:
-        error_report("Error while committing image");
-        break;
+    /* This is different from QMP, which by default uses the deepest file in the
+     * backing chain (i.e., the very base); however, the traditional behavior of
+     * qemu-img commit is using the immediate backing file. */
+    base_bs = bs->backing_hd;
+    if (!base_bs) {
+        error_set(&local_err, QERR_BASE_NOT_FOUND, "NULL");
+        goto done;
     }
 
+    cbi = (CommonBlockJobCBInfo){
+        .errp = &local_err,
+        .bs   = bs,
+    };
+
+    commit_active_start(bs, base_bs, 0, BLOCKDEV_ON_ERROR_REPORT,
+                        common_block_job_cb, &cbi, &local_err);
+    if (local_err) {
+        goto done;
+    }
+
+    run_block_job(bs->job, &local_err);
+
+done:
     blk_unref(blk);
-    if (ret) {
+
+    if (local_err) {
+        qerror_report_err(local_err);
+        error_free(local_err);
         return 1;
     }
+
+    qprintf(quiet, "Image committed.\n");
     return 0;
 }
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v13 09/14] qemu-img: Empty image after commit
  2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (7 preceding siblings ...)
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 08/14] qemu-img: Implement commit like QMP Max Reitz
@ 2014-10-22 12:51 ` Max Reitz
  2014-10-23 12:55   ` Kevin Wolf
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 10/14] qemu-img: Enable progress output for commit Max Reitz
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

After the top image has been committed, it should be emptied unless
specified otherwise.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 qemu-img-cmds.hx |  4 ++--
 qemu-img.c       | 34 +++++++++++++++++++++++++++++++---
 qemu-img.texi    |  6 +++++-
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index 55aec6b..e2ceadf 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -22,9 +22,9 @@ STEXI
 ETEXI
 
 DEF("commit", img_commit,
-    "commit [-q] [-f fmt] [-t cache] filename")
+    "commit [-q] [-f fmt] [-t cache] [-d] filename")
 STEXI
-@item commit [-q] [-f @var{fmt}] [-t @var{cache}] @var{filename}
+@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-d] @var{filename}
 ETEXI
 
 DEF("compare", img_compare,
diff --git a/qemu-img.c b/qemu-img.c
index f1f2857..5093bfa 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -754,14 +754,14 @@ static int img_commit(int argc, char **argv)
     const char *filename, *fmt, *cache;
     BlockBackend *blk;
     BlockDriverState *bs, *base_bs;
-    bool quiet = false;
+    bool quiet = false, drop = false;
     Error *local_err = NULL;
     CommonBlockJobCBInfo cbi;
 
     fmt = NULL;
     cache = BDRV_DEFAULT_CACHE;
     for(;;) {
-        c = getopt(argc, argv, "f:ht:q");
+        c = getopt(argc, argv, "f:ht:dq");
         if (c == -1) {
             break;
         }
@@ -776,6 +776,9 @@ static int img_commit(int argc, char **argv)
         case 't':
             cache = optarg;
             break;
+        case 'd':
+            drop = true;
+            break;
         case 'q':
             quiet = true;
             break;
@@ -786,7 +789,7 @@ static int img_commit(int argc, char **argv)
     }
     filename = argv[optind++];
 
-    flags = BDRV_O_RDWR;
+    flags = BDRV_O_RDWR | BDRV_O_UNMAP;
     ret = bdrv_parse_cache_flags(cache, &flags);
     if (ret < 0) {
         error_report("Invalid cache option: %s", cache);
@@ -819,7 +822,32 @@ static int img_commit(int argc, char **argv)
         goto done;
     }
 
+    /* The block job will swap base_bs and bs (which is not what we really want
+     * here, but okay) and unref base_bs (after the swap, i.e., the old top
+     * image). In order to still be able to empty that top image afterwards,
+     * increment the reference counter here preemptively. */
+    if (!drop) {
+        bdrv_ref(base_bs);
+    }
+
     run_block_job(bs->job, &local_err);
+    if (local_err) {
+        goto unref_backing;
+    }
+
+    if (!drop && base_bs->drv->bdrv_make_empty) {
+        ret = base_bs->drv->bdrv_make_empty(base_bs);
+        if (ret) {
+            error_setg_errno(&local_err, -ret, "Could not empty %s",
+                             filename);
+            goto unref_backing;
+        }
+    }
+
+unref_backing:
+    if (!drop) {
+        bdrv_unref(base_bs);
+    }
 
 done:
     blk_unref(blk);
diff --git a/qemu-img.texi b/qemu-img.texi
index e3ef4a0..420bd91 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -167,7 +167,7 @@ this case. @var{backing_file} will never be modified unless you use the
 The size can also be specified using the @var{size} option with @code{-o},
 it doesn't need to be specified separately in this case.
 
-@item commit [-f @var{fmt}] [-t @var{cache}] @var{filename}
+@item commit [-f @var{fmt}] [-t @var{cache}] [-d] @var{filename}
 
 Commit the changes recorded in @var{filename} in its base image or backing file.
 If the backing file is smaller than the snapshot, then the backing file will be
@@ -176,6 +176,10 @@ the backing file, the backing file will not be truncated.  If you want the
 backing file to match the size of the smaller snapshot, you can safely truncate
 it yourself once the commit operation successfully completes.
 
+The image @var{filename} is emptied after the operation has succeeded. If you do
+not need @var{filename} afterwards and intend to drop it, you may skip emptying
+@var{filename} by specifying the @code{-d} flag.
+
 @item compare [-f @var{fmt}] [-F @var{fmt}] [-T @var{src_cache}] [-p] [-s] [-q] @var{filename1} @var{filename2}
 
 Check if two images have the same content. You can compare images with
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v13 10/14] qemu-img: Enable progress output for commit
  2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (8 preceding siblings ...)
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 09/14] qemu-img: Empty image after commit Max Reitz
@ 2014-10-22 12:51 ` Max Reitz
  2014-10-23 13:00   ` Kevin Wolf
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 11/14] qemu-img: Specify backing file " Max Reitz
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Implement progress output for the commit command by querying the
progress of the block job.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 qemu-img-cmds.hx |  4 ++--
 qemu-img.c       | 24 ++++++++++++++++++++++--
 qemu-img.texi    |  2 +-
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index e2ceadf..fde33d1 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -22,9 +22,9 @@ STEXI
 ETEXI
 
 DEF("commit", img_commit,
-    "commit [-q] [-f fmt] [-t cache] [-d] filename")
+    "commit [-q] [-f fmt] [-t cache] [-d] [-p] filename")
 STEXI
-@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-d] @var{filename}
+@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-d] [-p] @var{filename}
 ETEXI
 
 DEF("compare", img_compare,
diff --git a/qemu-img.c b/qemu-img.c
index 5093bfa..4b0faf3 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -740,12 +740,18 @@ static void run_block_job(BlockJob *job, Error **errp)
     do {
         aio_poll(aio_context, true);
 
+        qemu_progress_print((float)job->offset / job->len * 100.f, 0);
+
         if (!job->busy && !job->ready) {
             block_job_resume(job);
         }
     } while (!job->ready);
 
     block_job_complete_sync(job, errp);
+
+    /* A block job may finish instantaneously without publishing any progress,
+     * so just signal completion here */
+    qemu_progress_print(100.f, 0);
 }
 
 static int img_commit(int argc, char **argv)
@@ -754,14 +760,14 @@ static int img_commit(int argc, char **argv)
     const char *filename, *fmt, *cache;
     BlockBackend *blk;
     BlockDriverState *bs, *base_bs;
-    bool quiet = false, drop = false;
+    bool progress = false, quiet = false, drop = false;
     Error *local_err = NULL;
     CommonBlockJobCBInfo cbi;
 
     fmt = NULL;
     cache = BDRV_DEFAULT_CACHE;
     for(;;) {
-        c = getopt(argc, argv, "f:ht:dq");
+        c = getopt(argc, argv, "f:ht:dpq");
         if (c == -1) {
             break;
         }
@@ -779,11 +785,20 @@ static int img_commit(int argc, char **argv)
         case 'd':
             drop = true;
             break;
+        case 'p':
+            progress = true;
+            break;
         case 'q':
             quiet = true;
             break;
         }
     }
+
+    /* Progress is not shown in Quiet mode */
+    if (quiet) {
+        progress = false;
+    }
+
     if (optind != argc - 1) {
         error_exit("Expecting one image file name");
     }
@@ -802,6 +817,9 @@ static int img_commit(int argc, char **argv)
     }
     bs = blk_bs(blk);
 
+    qemu_progress_init(progress, 1.f);
+    qemu_progress_print(0.f, 100);
+
     /* This is different from QMP, which by default uses the deepest file in the
      * backing chain (i.e., the very base); however, the traditional behavior of
      * qemu-img commit is using the immediate backing file. */
@@ -850,6 +868,8 @@ unref_backing:
     }
 
 done:
+    qemu_progress_end();
+
     blk_unref(blk);
 
     if (local_err) {
diff --git a/qemu-img.texi b/qemu-img.texi
index 420bd91..f82d1b4 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -167,7 +167,7 @@ this case. @var{backing_file} will never be modified unless you use the
 The size can also be specified using the @var{size} option with @code{-o},
 it doesn't need to be specified separately in this case.
 
-@item commit [-f @var{fmt}] [-t @var{cache}] [-d] @var{filename}
+@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-d] [-p] @var{filename}
 
 Commit the changes recorded in @var{filename} in its base image or backing file.
 If the backing file is smaller than the snapshot, then the backing file will be
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v13 11/14] qemu-img: Specify backing file for commit
  2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (9 preceding siblings ...)
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 10/14] qemu-img: Enable progress output for commit Max Reitz
@ 2014-10-22 12:51 ` Max Reitz
  2014-10-23 13:05   ` Kevin Wolf
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 12/14] iotests: Add _filter_qemu_img_map Max Reitz
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Introduce a new parameter for qemu-img commit which may be used to
explicitly specify the backing file into which an image should be
committed if the backing chain has more than a single layer.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 qemu-img-cmds.hx |  4 ++--
 qemu-img.c       | 24 +++++++++++++++++-------
 qemu-img.texi    |  9 ++++++++-
 3 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index fde33d1..04c207d 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -22,9 +22,9 @@ STEXI
 ETEXI
 
 DEF("commit", img_commit,
-    "commit [-q] [-f fmt] [-t cache] [-d] [-p] filename")
+    "commit [-q] [-f fmt] [-t cache] [-b base] [-d] [-p] filename")
 STEXI
-@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-d] [-p] @var{filename}
+@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-b @var{base}] [-d] [-p] @var{filename}
 ETEXI
 
 DEF("compare", img_compare,
diff --git a/qemu-img.c b/qemu-img.c
index 4b0faf3..65f35cb 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -757,7 +757,7 @@ static void run_block_job(BlockJob *job, Error **errp)
 static int img_commit(int argc, char **argv)
 {
     int c, ret, flags;
-    const char *filename, *fmt, *cache;
+    const char *filename, *fmt, *cache, *base;
     BlockBackend *blk;
     BlockDriverState *bs, *base_bs;
     bool progress = false, quiet = false, drop = false;
@@ -766,8 +766,9 @@ static int img_commit(int argc, char **argv)
 
     fmt = NULL;
     cache = BDRV_DEFAULT_CACHE;
+    base = NULL;
     for(;;) {
-        c = getopt(argc, argv, "f:ht:dpq");
+        c = getopt(argc, argv, "f:ht:b:dpq");
         if (c == -1) {
             break;
         }
@@ -782,6 +783,11 @@ static int img_commit(int argc, char **argv)
         case 't':
             cache = optarg;
             break;
+        case 'b':
+            base = optarg;
+            /* -b implies -d */
+            drop = true;
+            break;
         case 'd':
             drop = true;
             break;
@@ -820,12 +826,16 @@ static int img_commit(int argc, char **argv)
     qemu_progress_init(progress, 1.f);
     qemu_progress_print(0.f, 100);
 
-    /* This is different from QMP, which by default uses the deepest file in the
-     * backing chain (i.e., the very base); however, the traditional behavior of
-     * qemu-img commit is using the immediate backing file. */
-    base_bs = bs->backing_hd;
+    if (base) {
+        base_bs = bdrv_find_backing_image(bs, base);
+    } else {
+        /* This is different from QMP, which by default uses the deepest file in
+         * the backing chain (i.e., the very base); however, the traditional
+         * behavior of qemu-img commit is using the immediate backing file. */
+        base_bs = bs->backing_hd;
+    }
     if (!base_bs) {
-        error_set(&local_err, QERR_BASE_NOT_FOUND, "NULL");
+        error_set(&local_err, QERR_BASE_NOT_FOUND, base ?: "NULL");
         goto done;
     }
 
diff --git a/qemu-img.texi b/qemu-img.texi
index f82d1b4..828d522 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -167,7 +167,7 @@ this case. @var{backing_file} will never be modified unless you use the
 The size can also be specified using the @var{size} option with @code{-o},
 it doesn't need to be specified separately in this case.
 
-@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-d] [-p] @var{filename}
+@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-b @var{base}] [-d] [-p] @var{filename}
 
 Commit the changes recorded in @var{filename} in its base image or backing file.
 If the backing file is smaller than the snapshot, then the backing file will be
@@ -180,6 +180,13 @@ The image @var{filename} is emptied after the operation has succeeded. If you do
 not need @var{filename} afterwards and intend to drop it, you may skip emptying
 @var{filename} by specifying the @code{-d} flag.
 
+If the backing chain of the given image file @var{filename} has more than one
+layer, the backing file into which the changes will be committed may be
+specified as @var{base} (which has to be part of @var{filename}'s backing
+chain). If @var{base} is not specified, the immediate backing file of the top
+image (which is @var{filename}) will be used. For reasons of consistency,
+explicitly specifying @var{base} will always imply @code{-d}.
+
 @item compare [-f @var{fmt}] [-F @var{fmt}] [-T @var{src_cache}] [-p] [-s] [-q] @var{filename1} @var{filename2}
 
 Check if two images have the same content. You can compare images with
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v13 12/14] iotests: Add _filter_qemu_img_map
  2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (10 preceding siblings ...)
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 11/14] qemu-img: Specify backing file " Max Reitz
@ 2014-10-22 12:51 ` Max Reitz
  2014-10-23 13:08   ` Kevin Wolf
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 13/14] iotests: Add test for backing-chain commits Max Reitz
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 14/14] iotests: Add test for qcow2's bdrv_make_empty Max Reitz
  13 siblings, 1 reply; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

As different image formats most probably map guest addresses to
different host addresses, add a filter to filter the host addresses out;
also, the image filename should be filtered.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/common.filter | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/tests/qemu-iotests/common.filter b/tests/qemu-iotests/common.filter
index f69cb6b..3acdb30 100644
--- a/tests/qemu-iotests/common.filter
+++ b/tests/qemu-iotests/common.filter
@@ -213,5 +213,12 @@ _filter_img_info()
         -e "s/archipelago:a/TEST_DIR\//g"
 }
 
+# filter out offsets and file names from qemu-img map
+_filter_qemu_img_map()
+{
+    sed -e 's/\([0-9a-fx]* *[0-9a-fx]* *\)[0-9a-fx]* */\1/g' \
+        -e 's/Mapped to *//' | _filter_testdir | _filter_imgfmt
+}
+
 # make sure this script returns success
 /bin/true
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v13 13/14] iotests: Add test for backing-chain commits
  2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (11 preceding siblings ...)
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 12/14] iotests: Add _filter_qemu_img_map Max Reitz
@ 2014-10-22 12:51 ` Max Reitz
  2014-10-23 13:24   ` Kevin Wolf
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 14/14] iotests: Add test for qcow2's bdrv_make_empty Max Reitz
  13 siblings, 1 reply; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Add a test for qemu-img commit on backing chains with more than two
images. This test also checks whether the top image is emptied (unless
this is prevented by specifying either -d or -b) and does therefore not
work for qed and vmdk which requires it to be separate from 020.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/097     | 122 +++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/097.out | 119 +++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 242 insertions(+)
 create mode 100755 tests/qemu-iotests/097
 create mode 100644 tests/qemu-iotests/097.out

diff --git a/tests/qemu-iotests/097 b/tests/qemu-iotests/097
new file mode 100755
index 0000000..c7a613b
--- /dev/null
+++ b/tests/qemu-iotests/097
@@ -0,0 +1,122 @@
+#!/bin/bash
+#
+# Commit changes into backing chains and empty the top image if the
+# backing image is not explicitly specified
+#
+# Copyright (C) 2014 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=mreitz@redhat.com
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+here="$PWD"
+tmp=/tmp/$$
+status=1	# failure is the default!
+
+_cleanup()
+{
+    _cleanup_test_img
+    _rm_test_img "$TEST_IMG.itmd"
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.pattern
+
+# Any format supporting backing files and bdrv_make_empty
+_supported_fmt qcow qcow2
+_supported_proto file
+_supported_os Linux
+
+
+# Four passes:
+#  0: Two-layer backing chain, commit to upper backing file (implicitly)
+#     (in this case, the top image will be emptied)
+#  1: Two-layer backing chain, commit to upper backing file (explicitly)
+#     (in this case, the top image will implicitly stay unchanged)
+#  2: Two-layer backing chain, commit to upper backing file (implicitly with -d)
+#     (in this case, the top image will explicitly stay unchanged)
+#  3: Two-layer backing chain, commit to lower backing file
+#     (in this case, the top image will implicitly stay unchanged)
+#
+# 020 already tests committing, so this only tests whether image chains are
+# working properly and that all images above the base are emptied; therefore,
+# no complicated patterns are necessary
+for i in 0 1 2 3; do
+
+echo
+echo "=== Test pass $i ==="
+echo
+
+TEST_IMG="$TEST_IMG.base" _make_test_img 64M
+TEST_IMG="$TEST_IMG.itmd" _make_test_img -b "$TEST_IMG.base" 64M
+_make_test_img -b "$TEST_IMG.itmd" 64M
+
+$QEMU_IO -c 'write -P 1 0 192k' "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c 'write -P 2 64k 128k' "$TEST_IMG.itmd" | _filter_qemu_io
+$QEMU_IO -c 'write -P 3 128k 64k' "$TEST_IMG" | _filter_qemu_io
+
+if [ $i -lt 3 ]; then
+    if [ $i == 0 ]; then
+        # -b "$TEST_IMG.itmd" should be the default (that is, committing to the
+        # first backing file in the chain)
+        $QEMU_IMG commit "$TEST_IMG"
+    elif [ $i == 1 ]; then
+        # explicitly specify the commit target (this should imply -d)
+        $QEMU_IMG commit -b "$TEST_IMG.itmd" "$TEST_IMG"
+    else
+        # do not explicitly specify the commit target, but use -d to leave the
+        # top image unchanged
+        $QEMU_IMG commit -d "$TEST_IMG"
+    fi
+
+    # Bottom should be unchanged
+    $QEMU_IO -c 'read -P 1 0 192k' "$TEST_IMG.base" | _filter_qemu_io
+
+    # Intermediate should contain changes from top
+    $QEMU_IO -c 'read -P 1 0 64k' "$TEST_IMG.itmd" | _filter_qemu_io
+    $QEMU_IO -c 'read -P 2 64k 64k' "$TEST_IMG.itmd" | _filter_qemu_io
+    $QEMU_IO -c 'read -P 3 128k 64k' "$TEST_IMG.itmd" | _filter_qemu_io
+
+    # And in pass 0, the top image should be empty, whereas in both other passes
+    # it should be unchanged (which is both checked by qemu-img map)
+else
+    $QEMU_IMG commit -b "$TEST_IMG.base" "$TEST_IMG"
+
+    # Bottom should contain all changes
+    $QEMU_IO -c 'read -P 1 0 64k' "$TEST_IMG.base" | _filter_qemu_io
+    $QEMU_IO -c 'read -P 2 64k 64k' "$TEST_IMG.base" | _filter_qemu_io
+    $QEMU_IO -c 'read -P 3 128k 64k' "$TEST_IMG.base" | _filter_qemu_io
+
+    # Both top and intermediate should be unchanged
+fi
+
+$QEMU_IMG map "$TEST_IMG.base" | _filter_qemu_img_map
+$QEMU_IMG map "$TEST_IMG.itmd" | _filter_qemu_img_map
+$QEMU_IMG map "$TEST_IMG" | _filter_qemu_img_map
+
+done
+
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/097.out b/tests/qemu-iotests/097.out
new file mode 100644
index 0000000..1cb7764
--- /dev/null
+++ b/tests/qemu-iotests/097.out
@@ -0,0 +1,119 @@
+QA output created by 097
+
+=== Test pass 0 ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+Formatting 'TEST_DIR/t.IMGFMT.itmd', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base' 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.itmd' 
+wrote 196608/196608 bytes at offset 0
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 131072/131072 bytes at offset 65536
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Image committed.
+read 196608/196608 bytes at offset 0
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset          Length          File
+0               0x30000         TEST_DIR/t.IMGFMT.base
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x20000         TEST_DIR/t.IMGFMT.itmd
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x20000         TEST_DIR/t.IMGFMT.itmd
+
+=== Test pass 1 ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+Formatting 'TEST_DIR/t.IMGFMT.itmd', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base' 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.itmd' 
+wrote 196608/196608 bytes at offset 0
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 131072/131072 bytes at offset 65536
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Image committed.
+read 196608/196608 bytes at offset 0
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset          Length          File
+0               0x30000         TEST_DIR/t.IMGFMT.base
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x20000         TEST_DIR/t.IMGFMT.itmd
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x10000         TEST_DIR/t.IMGFMT.itmd
+0x20000         0x10000         TEST_DIR/t.IMGFMT
+
+=== Test pass 2 ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+Formatting 'TEST_DIR/t.IMGFMT.itmd', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base' 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.itmd' 
+wrote 196608/196608 bytes at offset 0
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 131072/131072 bytes at offset 65536
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Image committed.
+read 196608/196608 bytes at offset 0
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset          Length          File
+0               0x30000         TEST_DIR/t.IMGFMT.base
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x20000         TEST_DIR/t.IMGFMT.itmd
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x10000         TEST_DIR/t.IMGFMT.itmd
+0x20000         0x10000         TEST_DIR/t.IMGFMT
+
+=== Test pass 3 ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+Formatting 'TEST_DIR/t.IMGFMT.itmd', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base' 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.itmd' 
+wrote 196608/196608 bytes at offset 0
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 131072/131072 bytes at offset 65536
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Image committed.
+read 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset          Length          File
+0               0x30000         TEST_DIR/t.IMGFMT.base
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x20000         TEST_DIR/t.IMGFMT.itmd
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x10000         TEST_DIR/t.IMGFMT.itmd
+0x20000         0x10000         TEST_DIR/t.IMGFMT
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index be2054f..fab0158 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -100,6 +100,7 @@
 091 rw auto quick
 092 rw auto quick
 095 rw auto quick
+097 rw auto backing
 099 rw auto quick
 100 rw auto quick
 101 rw auto quick
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [Qemu-devel] [PATCH v13 14/14] iotests: Add test for qcow2's bdrv_make_empty
  2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (12 preceding siblings ...)
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 13/14] iotests: Add test for backing-chain commits Max Reitz
@ 2014-10-22 12:51 ` Max Reitz
  2014-10-22 17:05   ` Eric Blake
  2014-10-23 13:30   ` Kevin Wolf
  13 siblings, 2 replies; 43+ messages in thread
From: Max Reitz @ 2014-10-22 12:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Add a test for qcow2's fast bdrv_make_empty implementation on images
without internal snapshots.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/098     | 79 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/098.out | 45 ++++++++++++++++++++++++++
 tests/qemu-iotests/group   |  1 +
 3 files changed, 125 insertions(+)
 create mode 100755 tests/qemu-iotests/098
 create mode 100644 tests/qemu-iotests/098.out

diff --git a/tests/qemu-iotests/098 b/tests/qemu-iotests/098
new file mode 100755
index 0000000..2aacb1f
--- /dev/null
+++ b/tests/qemu-iotests/098
@@ -0,0 +1,79 @@
+#!/bin/bash
+#
+# Test qcow2's bdrv_make_empty for images without internal snapshots
+#
+# Copyright (C) 2014 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=mreitz@redhat.com
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+here="$PWD"
+tmp=/tmp/$$
+status=1	# failure is the default!
+
+_cleanup()
+{
+    _cleanup_test_img
+    rm -f "$TEST_DIR/blkdebug.conf"
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.pattern
+
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+
+IMGOPTS="compat=1.1"
+
+for event in empty_image_prepare l1_update reftable_update refblock_alloc; do
+
+echo
+echo "=== $event ==="
+echo
+
+TEST_IMG="$TEST_IMG.base" _make_test_img 64M
+_make_test_img -b "$TEST_IMG.base" 64M
+
+cat > "$TEST_DIR/blkdebug.conf" <<EOF
+[inject-error]
+event = "$event"
+EOF
+
+$QEMU_IMG commit "blkdebug:$TEST_DIR/blkdebug.conf:$TEST_IMG" 2>&1 \
+    | _filter_testdir | _filter_imgfmt
+
+# There may be errors, but they should be fixed by opening the image
+$QEMU_IO -c close "$TEST_IMG"
+
+_check_test_img
+
+done
+
+
+rm -f "$TEST_DIR/blkdebug.conf"
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/098.out b/tests/qemu-iotests/098.out
new file mode 100644
index 0000000..3202d70
--- /dev/null
+++ b/tests/qemu-iotests/098.out
@@ -0,0 +1,45 @@
+QA output created by 098
+
+=== empty_image_prepare ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base' 
+qemu-img: Could not empty blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
+No errors were found on the image.
+
+=== l1_update ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base' 
+qemu-img: Could not empty blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
+ERROR cluster 0 refcount=0 reference=1
+ERROR cluster 1 refcount=0 reference=1
+ERROR cluster 3 refcount=0 reference=1
+Rebuilding refcount structure
+Repairing cluster 1 refcount=1 reference=0
+No errors were found on the image.
+
+=== reftable_update ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base' 
+qemu-img: Could not empty blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
+ERROR cluster 0 refcount=0 reference=1
+ERROR cluster 1 refcount=0 reference=1
+ERROR cluster 3 refcount=0 reference=1
+Rebuilding refcount structure
+Repairing cluster 1 refcount=1 reference=0
+No errors were found on the image.
+
+=== refblock_alloc ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base' 
+qemu-img: Could not empty blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
+ERROR cluster 0 refcount=0 reference=1
+ERROR cluster 1 refcount=0 reference=1
+ERROR cluster 3 refcount=0 reference=1
+Rebuilding refcount structure
+Repairing cluster 1 refcount=1 reference=0
+No errors were found on the image.
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index fab0158..2d745eb 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -101,6 +101,7 @@
 092 rw auto quick
 095 rw auto quick
 097 rw auto backing
+098 rw auto backing quick
 099 rw auto quick
 100 rw auto quick
 101 rw auto quick
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty()
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty() Max Reitz
@ 2014-10-22 16:04   ` Eric Blake
  2014-10-22 18:35   ` Kevin Wolf
  1 sibling, 0 replies; 43+ messages in thread
From: Eric Blake @ 2014-10-22 16:04 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 995 bytes --]

On 10/22/2014 06:51 AM, Max Reitz wrote:
> bdrv_make_empty() is currently only called if the current image
> represents an external snapshot that has been committed to its base
> image; it is therefore unlikely to have internal snapshots. In this
> case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
> refcount table (while having the dirty flag set, which only works for
> compat=1.1) and creating a trivial refcount structure.
> 
> If there are snapshots or for compat=0.10, fall back to the simple
> implementation (discard all clusters).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/blkdebug.c      |   2 +
>  block/qcow2.c         | 143 ++++++++++++++++++++++++++++++++++++++++++++------
>  include/block/block.h |   2 +
>  3 files changed, 132 insertions(+), 15 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 08/14] qemu-img: Implement commit like QMP
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 08/14] qemu-img: Implement commit like QMP Max Reitz
@ 2014-10-22 16:22   ` Eric Blake
  2014-10-23 11:59   ` Kevin Wolf
  1 sibling, 0 replies; 43+ messages in thread
From: Eric Blake @ 2014-10-22 16:22 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: Kevin Wolf, Igor Mammedov, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 1366 bytes --]

On 10/22/2014 06:51 AM, Max Reitz wrote:
> qemu-img should use QMP commands whenever possible in order to ensure
> feature completeness of both online and offline image operations. As
> qemu-img itself has no access to QMP (since this would basically require
> just everything being linked into qemu-img), imitate QMP's
> implementation of block-commit by using commit_active_start() and then
> waiting for the block job to finish.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/Makefile.objs |  3 +-
>  qemu-img.c          | 82 ++++++++++++++++++++++++++++++++++++++++-------------
>  2 files changed, 64 insertions(+), 21 deletions(-)
> 

> +done:
>      blk_unref(blk);
> -    if (ret) {
> +
> +    if (local_err) {
> +        qerror_report_err(local_err);
> +        error_free(local_err);
>          return 1;

Igor has a patch that simplifies this style of cleanup:
https://lists.gnu.org/archive/html/qemu-devel/2014-10/msg01992.html

But it does so by using exit(1) instead of return 1, not to mention that
there are probably a lot of places in qemu-img.c that could be similarly
cleaned at the same time as a separate patch.

So for this patch, as-is,
Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 14/14] iotests: Add test for qcow2's bdrv_make_empty
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 14/14] iotests: Add test for qcow2's bdrv_make_empty Max Reitz
@ 2014-10-22 17:05   ` Eric Blake
  2014-10-23 13:30   ` Kevin Wolf
  1 sibling, 0 replies; 43+ messages in thread
From: Eric Blake @ 2014-10-22 17:05 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 685 bytes --]

On 10/22/2014 06:51 AM, Max Reitz wrote:
> Add a test for qcow2's fast bdrv_make_empty implementation on images
> without internal snapshots.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  tests/qemu-iotests/098     | 79 ++++++++++++++++++++++++++++++++++++++++++++++
>  tests/qemu-iotests/098.out | 45 ++++++++++++++++++++++++++
>  tests/qemu-iotests/group   |  1 +
>  3 files changed, 125 insertions(+)
>  create mode 100755 tests/qemu-iotests/098
>  create mode 100644 tests/qemu-iotests/098.out
> 

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty()
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty() Max Reitz
  2014-10-22 16:04   ` Eric Blake
@ 2014-10-22 18:35   ` Kevin Wolf
  2014-10-23  7:46     ` Max Reitz
  1 sibling, 1 reply; 43+ messages in thread
From: Kevin Wolf @ 2014-10-22 18:35 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> bdrv_make_empty() is currently only called if the current image
> represents an external snapshot that has been committed to its base
> image; it is therefore unlikely to have internal snapshots. In this
> case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
> refcount table (while having the dirty flag set, which only works for
> compat=1.1) and creating a trivial refcount structure.
> 
> If there are snapshots or for compat=0.10, fall back to the simple
> implementation (discard all clusters).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>

Hey, this feels actually reviewable this time. :-)

> diff --git a/block/blkdebug.c b/block/blkdebug.c
> index e046b92..862d93b 100644
> --- a/block/blkdebug.c
> +++ b/block/blkdebug.c
> @@ -195,6 +195,8 @@ static const char *event_names[BLKDBG_EVENT_MAX] = {
>      [BLKDBG_PWRITEV]                        = "pwritev",
>      [BLKDBG_PWRITEV_ZERO]                   = "pwritev_zero",
>      [BLKDBG_PWRITEV_DONE]                   = "pwritev_done",
> +
> +    [BLKDBG_EMPTY_IMAGE_PREPARE]            = "empty_image_prepare",
>  };
>  
>  static int get_event_by_name(const char *name, BlkDebugEvent *event)
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 1ef3a5f..16dece2 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2232,24 +2232,137 @@ fail:
>  
>  static int qcow2_make_empty(BlockDriverState *bs)
>  {
> +    BDRVQcowState *s = bs->opaque;
>      int ret = 0;
> -    uint64_t start_sector;
> -    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
>  
> -    for (start_sector = 0; start_sector < bs->total_sectors;
> -         start_sector += sector_step)
> -    {
> -        /* As this function is generally used after committing an external
> -         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
> -         * default action for this kind of discard is to pass the discard,
> -         * which will ideally result in an actually smaller image file, as
> -         * is probably desired. */
> -        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
> -                                     MIN(sector_step,
> -                                         bs->total_sectors - start_sector),
> -                                     QCOW2_DISCARD_SNAPSHOT, true);
> +    if (s->snapshots || s->qcow_version < 3) {
> +        uint64_t start_sector;
> +        int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
> +
> +        /* If there are snapshots, every active cluster has to be discarded; and
> +         * because compat=0.10 does not support setting the dirty flag, we have
> +         * to use this fallback there as well */
> +
> +        for (start_sector = 0; start_sector < bs->total_sectors;
> +             start_sector += sector_step)
> +        {
> +            /* As this function is generally used after committing an external
> +             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
> +             * default action for this kind of discard is to pass the discard,
> +             * which will ideally result in an actually smaller image file, as
> +             * is probably desired. */
> +            ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
> +                                         MIN(sector_step,
> +                                             bs->total_sectors - start_sector),
> +                                         QCOW2_DISCARD_SNAPSHOT, true);
> +            if (ret < 0) {
> +                break;
> +            }
> +        }

My first though was to add a return here, so the indentation level for
the rest is one less. Probably a matter of taste, though.

> +    } else {
> +        int l1_clusters;
> +        int64_t offset;
> +        uint64_t *new_reftable;
> +        uint64_t rt_entry;
> +        struct {
> +            uint64_t l1_offset;
> +            uint64_t reftable_offset;
> +            uint32_t reftable_clusters;
> +        } QEMU_PACKED l1_ofs_rt_ofs_cls;
> +
> +        ret = qcow2_cache_empty(bs, s->l2_table_cache);
>          if (ret < 0) {
> -            break;
> +            return ret;
> +        }
> +
> +        ret = qcow2_cache_empty(bs, s->refcount_block_cache);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        /* Refcounts will be broken utterly */
> +        ret = qcow2_mark_dirty(bs);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        l1_clusters = DIV_ROUND_UP(s->l1_size,
> +                                   s->cluster_size / sizeof(uint64_t));
> +        new_reftable = g_try_new0(uint64_t, s->cluster_size / sizeof(uint64_t));
> +        if (!new_reftable) {
> +            return -ENOMEM;
> +        }
> +
> +        BLKDBG_EVENT(bs->file, BLKDBG_EMPTY_IMAGE_PREPARE);

Until here, the failure cases are trivially okay. The worst thing that
could happen is that the image is needlessly marked as dirty.

> +        /* Overwrite enough clusters at the beginning of the sectors to place
> +         * the refcount table, a refcount block and the L1 table in; this may
> +         * overwrite parts of the existing refcount and L1 table, which is not
> +         * an issue because the dirty flag is set, complete data loss is in fact
> +         * desired and partial data loss is consequently fine as well */
> +        ret = bdrv_write_zeroes(bs->file, s->cluster_size / BDRV_SECTOR_SIZE,
> +                                (2 + l1_clusters) * s->cluster_size /
> +                                BDRV_SECTOR_SIZE, 0);

If we crash at this point, we're _not_ okay any more. --verbose follows:

On disk, we may have overwritten a refcount table or a refcount block
with zeros. This is fine, we have the dirty flag set, so destroying any
refcounting structure does no harm.

We may also have overwritten an L1 or L2 table. As the commit correctly
explains, this is doing partially what the function is supposed to do
for the whole image. Affected data clusters are now read from the
backing file. Good.

However, we may also have overwritten data clusters that are still
accessible using an L1/L2 table that hasn't been hit by this write
operation. We're reading corrupted (zeroed out) data now instead of
going to the backing file. Bug!

In my original suggestion I had an item where the L1 table was zeroed
out first before the start of the image is zeroed. This would have
avoided the bug.

> +        if (ret < 0) {
> +            g_free(new_reftable);
> +            return ret;
> +        }

If we fail here (without crashing), the first clusters could be in their
original state or partially zeroed. Assuming that you fixed the above
bug, the on-disk state would be okay if we opened the image now because
the dirty flag would trigger an image repair; but we don't get the
repair when taking this failure path and we may have zeroed a refcount
table/block. This is probably a problem and we may have to make the BDS
unusable.

The in-memory state of the L1 table is hopefully zeroed out, so it's
consistent with what is on disk.

The in-memory state of the refcount table looks like it's not in sync
with the on-disk state. Note that while the dirty flag allows that the
on-disk state can be anything, the in-memory state is what we keep using
after a failure. The in-memory state isn't accurate at this point, but
we only create leaks. Lots of them, because we zeroed the L1 table, but
that's good enough. If refcounts are updated later, the old offsets
should still be valid.

All of the following code shouldn't change what a guest is seeing, but
only fix up the refcounts.

> +        BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE);
> +        BLKDBG_EVENT(bs->file, BLKDBG_REFTABLE_UPDATE);
> +
> +        /* "Create" an empty reftable (one cluster) directly after the image
> +         * header and an empty L1 table three clusters after the image header;
> +         * the cluster between those two will be used as the first refblock */
> +        cpu_to_be64w(&l1_ofs_rt_ofs_cls.l1_offset, 3 * s->cluster_size);
> +        cpu_to_be64w(&l1_ofs_rt_ofs_cls.reftable_offset, s->cluster_size);
> +        cpu_to_be32w(&l1_ofs_rt_ofs_cls.reftable_clusters, 1);
> +        ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_table_offset),
> +                               &l1_ofs_rt_ofs_cls, sizeof(l1_ofs_rt_ofs_cls));

If we crash here, we have a new, but still an all-zero L1 table, which
ensures that the guest sees the right thing. We also get a new refcount
table, which is invalid anyway. It is completely zeroed out, image
repair should be able to cope with it.

> +        if (ret < 0) {
> +            g_free(new_reftable);
> +            return ret;
> +        }

Failure without a crash. It should be safe to assume that either the
whole header has been updated (and the flush failed) or it's in its old
state.

Probably the same problem as above with potentially overwritten refcount
blocks while we don't get a repair on this failure path. Without
accurate in-memory refcounts, we can't keep using the BDS.

> +        s->l1_table_offset = 3 * s->cluster_size;
> +        memset(s->l1_table, 0, s->l1_size * sizeof(uint64_t));

This line has been moved to above by the assumed L1 fix.

> +        s->refcount_table_offset = s->cluster_size;
> +        s->refcount_table_size   = s->cluster_size / sizeof(uint64_t);
> +
> +        g_free(s->refcount_table);
> +        s->refcount_table = new_reftable;
> +
> +        BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC);
> +
> +        /* Enter the first refblock into the reftable */
> +        rt_entry = cpu_to_be64(2 * s->cluster_size);
> +        ret = bdrv_pwrite_sync(bs->file, s->cluster_size,
> +                               &rt_entry, sizeof(rt_entry));

On-disk status: L1 table empty, refcount table has a refblock, but still
no references. Dirty flag ensures we're safe.

> +        if (ret < 0) {
> +            return ret;
> +        }

Failure path without crash: s->refcount_table_offset/size have been
updated, but we still don't have accurate in-memory information. Still
can't keep using the BDS after failing here.

> +        s->refcount_table[0] = 2 * s->cluster_size;
> +
> +        ret = bdrv_truncate(bs->file, (3 + l1_clusters) * s->cluster_size);
> +        if (ret < 0) {
> +            return ret;
> +        }

Nothing interesting happens here (except that this is the heart of the
functionality :-)). The on-disk state didn't reference any of the
truncated clusters any more. The in-memory state isn't fixed yet.

> +        s->free_cluster_index = 0;
> +        offset = qcow2_alloc_clusters(bs, 3 * s->cluster_size +
> +                                      s->l1_size * sizeof(uint64_t));
> +        if (offset < 0) {
> +            return offset;

BDS still unusable here.

> +        } else if (offset > 0) {
> +            error_report("First cluster in emptied image is in use");
> +            abort();
> +        }

Finally, at this point we have correct in-memory information again.

Not sure if it's worth the effort of trying to move this further up. We
could at least easily move the bdrv_truncate() to below this, which
would make it's error path less bad.

> +        ret = qcow2_mark_clean(bs);
> +        if (ret < 0) {
> +            return ret;
>          }

And this writes them to disk. If we fail, we can keep going on, as the
in-memory information is correct and after a qemu restart, the dirty
flag will trigger a repair.

>      }
>  
> diff --git a/include/block/block.h b/include/block/block.h
> index 341054d..b1f4385 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -498,6 +498,8 @@ typedef enum {
>      BLKDBG_PWRITEV_ZERO,
>      BLKDBG_PWRITEV_DONE,
>  
> +    BLKDBG_EMPTY_IMAGE_PREPARE,
> +
>      BLKDBG_EVENT_MAX,
>  } BlkDebugEvent;

Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty()
  2014-10-22 18:35   ` Kevin Wolf
@ 2014-10-23  7:46     ` Max Reitz
  2014-10-23  8:29       ` Kevin Wolf
  0 siblings, 1 reply; 43+ messages in thread
From: Max Reitz @ 2014-10-23  7:46 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, Stefan Hajnoczi

On 2014-10-22 at 20:35, Kevin Wolf wrote:
> Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
>> bdrv_make_empty() is currently only called if the current image
>> represents an external snapshot that has been committed to its base
>> image; it is therefore unlikely to have internal snapshots. In this
>> case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
>> refcount table (while having the dirty flag set, which only works for
>> compat=1.1) and creating a trivial refcount structure.
>>
>> If there are snapshots or for compat=0.10, fall back to the simple
>> implementation (discard all clusters).
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Hey, this feels actually reviewable this time. :-)

I'm still unsure which version I like more. If it wasn't for the math, 
I'd prefer the other version.

>> diff --git a/block/blkdebug.c b/block/blkdebug.c
>> index e046b92..862d93b 100644
>> --- a/block/blkdebug.c
>> +++ b/block/blkdebug.c
>> @@ -195,6 +195,8 @@ static const char *event_names[BLKDBG_EVENT_MAX] = {
>>       [BLKDBG_PWRITEV]                        = "pwritev",
>>       [BLKDBG_PWRITEV_ZERO]                   = "pwritev_zero",
>>       [BLKDBG_PWRITEV_DONE]                   = "pwritev_done",
>> +
>> +    [BLKDBG_EMPTY_IMAGE_PREPARE]            = "empty_image_prepare",
>>   };
>>   
>>   static int get_event_by_name(const char *name, BlkDebugEvent *event)
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 1ef3a5f..16dece2 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -2232,24 +2232,137 @@ fail:
>>   
>>   static int qcow2_make_empty(BlockDriverState *bs)
>>   {
>> +    BDRVQcowState *s = bs->opaque;
>>       int ret = 0;
>> -    uint64_t start_sector;
>> -    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
>>   
>> -    for (start_sector = 0; start_sector < bs->total_sectors;
>> -         start_sector += sector_step)
>> -    {
>> -        /* As this function is generally used after committing an external
>> -         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
>> -         * default action for this kind of discard is to pass the discard,
>> -         * which will ideally result in an actually smaller image file, as
>> -         * is probably desired. */
>> -        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
>> -                                     MIN(sector_step,
>> -                                         bs->total_sectors - start_sector),
>> -                                     QCOW2_DISCARD_SNAPSHOT, true);
>> +    if (s->snapshots || s->qcow_version < 3) {
>> +        uint64_t start_sector;
>> +        int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
>> +
>> +        /* If there are snapshots, every active cluster has to be discarded; and
>> +         * because compat=0.10 does not support setting the dirty flag, we have
>> +         * to use this fallback there as well */
>> +
>> +        for (start_sector = 0; start_sector < bs->total_sectors;
>> +             start_sector += sector_step)
>> +        {
>> +            /* As this function is generally used after committing an external
>> +             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
>> +             * default action for this kind of discard is to pass the discard,
>> +             * which will ideally result in an actually smaller image file, as
>> +             * is probably desired. */
>> +            ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
>> +                                         MIN(sector_step,
>> +                                             bs->total_sectors - start_sector),
>> +                                         QCOW2_DISCARD_SNAPSHOT, true);
>> +            if (ret < 0) {
>> +                break;
>> +            }
>> +        }
> My first though was to add a return here, so the indentation level for
> the rest is one less. Probably a matter of taste, though.

I'd rather put the second branch into an own function.

>> +    } else {
>> +        int l1_clusters;
>> +        int64_t offset;
>> +        uint64_t *new_reftable;
>> +        uint64_t rt_entry;
>> +        struct {
>> +            uint64_t l1_offset;
>> +            uint64_t reftable_offset;
>> +            uint32_t reftable_clusters;
>> +        } QEMU_PACKED l1_ofs_rt_ofs_cls;
>> +
>> +        ret = qcow2_cache_empty(bs, s->l2_table_cache);
>>           if (ret < 0) {
>> -            break;
>> +            return ret;
>> +        }
>> +
>> +        ret = qcow2_cache_empty(bs, s->refcount_block_cache);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +
>> +        /* Refcounts will be broken utterly */
>> +        ret = qcow2_mark_dirty(bs);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +
>> +        l1_clusters = DIV_ROUND_UP(s->l1_size,
>> +                                   s->cluster_size / sizeof(uint64_t));
>> +        new_reftable = g_try_new0(uint64_t, s->cluster_size / sizeof(uint64_t));
>> +        if (!new_reftable) {
>> +            return -ENOMEM;
>> +        }
>> +
>> +        BLKDBG_EVENT(bs->file, BLKDBG_EMPTY_IMAGE_PREPARE);
> Until here, the failure cases are trivially okay. The worst thing that
> could happen is that the image is needlessly marked as dirty.
>
>> +        /* Overwrite enough clusters at the beginning of the sectors to place
>> +         * the refcount table, a refcount block and the L1 table in; this may
>> +         * overwrite parts of the existing refcount and L1 table, which is not
>> +         * an issue because the dirty flag is set, complete data loss is in fact
>> +         * desired and partial data loss is consequently fine as well */
>> +        ret = bdrv_write_zeroes(bs->file, s->cluster_size / BDRV_SECTOR_SIZE,
>> +                                (2 + l1_clusters) * s->cluster_size /
>> +                                BDRV_SECTOR_SIZE, 0);
> If we crash at this point, we're _not_ okay any more. --verbose follows:
>
> On disk, we may have overwritten a refcount table or a refcount block
> with zeros. This is fine, we have the dirty flag set, so destroying any
> refcounting structure does no harm.
>
> We may also have overwritten an L1 or L2 table. As the commit correctly
> explains, this is doing partially what the function is supposed to do
> for the whole image. Affected data clusters are now read from the
> backing file. Good.
>
> However, we may also have overwritten data clusters that are still
> accessible using an L1/L2 table that hasn't been hit by this write
> operation. We're reading corrupted (zeroed out) data now instead of
> going to the backing file. Bug!

Oh, right, I forgot about the L1 table not always being at the start of 
the file.

> In my original suggestion I had an item where the L1 table was zeroed
> out first before the start of the image is zeroed. This would have
> avoided the bug.
>
>> +        if (ret < 0) {
>> +            g_free(new_reftable);
>> +            return ret;
>> +        }
> If we fail here (without crashing), the first clusters could be in their
> original state or partially zeroed. Assuming that you fixed the above
> bug, the on-disk state would be okay if we opened the image now because
> the dirty flag would trigger an image repair; but we don't get the
> repair when taking this failure path and we may have zeroed a refcount
> table/block. This is probably a problem and we may have to make the BDS
> unusable.
>
> The in-memory state of the L1 table is hopefully zeroed out, so it's
> consistent with what is on disk.
>
> The in-memory state of the refcount table looks like it's not in sync
> with the on-disk state. Note that while the dirty flag allows that the
> on-disk state can be anything, the in-memory state is what we keep using
> after a failure. The in-memory state isn't accurate at this point, but
> we only create leaks. Lots of them, because we zeroed the L1 table, but
> that's good enough. If refcounts are updated later, the old offsets
> should still be valid.

If we set at least parts of the in-memory reftable to zero, everything 
probably breaks. Try to allocate a new cluster while the beginning of 
the reftable is zero. So we cannot take the on-disk reftable into memory.

Doing it the other way around, writing the in-memory reftable to disk on 
error won't work either. The refblocks may have been zeroed out, so we 
have exactly the same problem.

Therefore, to make the BDS usable after error, we have to (in the error 
path) read the on-disk reftable into memory, call the qcow2 repair 
function and hope for the best.

Or, you know, we could go back to v11 which had my other version of this 
patch which always kept everything consistent. :-P

> All of the following code shouldn't change what a guest is seeing, but
> only fix up the refcounts.
>
>> +        BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE);
>> +        BLKDBG_EVENT(bs->file, BLKDBG_REFTABLE_UPDATE);
>> +
>> +        /* "Create" an empty reftable (one cluster) directly after the image
>> +         * header and an empty L1 table three clusters after the image header;
>> +         * the cluster between those two will be used as the first refblock */
>> +        cpu_to_be64w(&l1_ofs_rt_ofs_cls.l1_offset, 3 * s->cluster_size);
>> +        cpu_to_be64w(&l1_ofs_rt_ofs_cls.reftable_offset, s->cluster_size);
>> +        cpu_to_be32w(&l1_ofs_rt_ofs_cls.reftable_clusters, 1);
>> +        ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_table_offset),
>> +                               &l1_ofs_rt_ofs_cls, sizeof(l1_ofs_rt_ofs_cls));
> If we crash here, we have a new, but still an all-zero L1 table, which
> ensures that the guest sees the right thing. We also get a new refcount
> table, which is invalid anyway. It is completely zeroed out, image
> repair should be able to cope with it.

It now is, yes.

>> +        if (ret < 0) {
>> +            g_free(new_reftable);
>> +            return ret;
>> +        }
> Failure without a crash. It should be safe to assume that either the
> whole header has been updated (and the flush failed) or it's in its old
> state.
>
> Probably the same problem as above with potentially overwritten refcount
> blocks while we don't get a repair on this failure path. Without
> accurate in-memory refcounts, we can't keep using the BDS.
>
>> +        s->l1_table_offset = 3 * s->cluster_size;
>> +        memset(s->l1_table, 0, s->l1_size * sizeof(uint64_t));
> This line has been moved to above by the assumed L1 fix.
>
>> +        s->refcount_table_offset = s->cluster_size;
>> +        s->refcount_table_size   = s->cluster_size / sizeof(uint64_t);
>> +
>> +        g_free(s->refcount_table);
>> +        s->refcount_table = new_reftable;
>> +
>> +        BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC);
>> +
>> +        /* Enter the first refblock into the reftable */
>> +        rt_entry = cpu_to_be64(2 * s->cluster_size);
>> +        ret = bdrv_pwrite_sync(bs->file, s->cluster_size,
>> +                               &rt_entry, sizeof(rt_entry));
> On-disk status: L1 table empty, refcount table has a refblock, but still
> no references. Dirty flag ensures we're safe.
>
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
> Failure path without crash: s->refcount_table_offset/size have been
> updated, but we still don't have accurate in-memory information. Still
> can't keep using the BDS after failing here.
>
>> +        s->refcount_table[0] = 2 * s->cluster_size;
>> +
>> +        ret = bdrv_truncate(bs->file, (3 + l1_clusters) * s->cluster_size);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
> Nothing interesting happens here (except that this is the heart of the
> functionality :-)). The on-disk state didn't reference any of the
> truncated clusters any more. The in-memory state isn't fixed yet.
>
>> +        s->free_cluster_index = 0;
>> +        offset = qcow2_alloc_clusters(bs, 3 * s->cluster_size +
>> +                                      s->l1_size * sizeof(uint64_t));
>> +        if (offset < 0) {
>> +            return offset;
> BDS still unusable here.
>
>> +        } else if (offset > 0) {
>> +            error_report("First cluster in emptied image is in use");
>> +            abort();
>> +        }
> Finally, at this point we have correct in-memory information again.
>
> Not sure if it's worth the effort of trying to move this further up. We
> could at least easily move the bdrv_truncate() to below this, which
> would make it's error path less bad.
>
>> +        ret = qcow2_mark_clean(bs);
>> +        if (ret < 0) {
>> +            return ret;
>>           }
> And this writes them to disk. If we fail, we can keep going on, as the
> in-memory information is correct and after a qemu restart, the dirty
> flag will trigger a repair.
>
>>       }
>>   
>> diff --git a/include/block/block.h b/include/block/block.h
>> index 341054d..b1f4385 100644
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
>> @@ -498,6 +498,8 @@ typedef enum {
>>       BLKDBG_PWRITEV_ZERO,
>>       BLKDBG_PWRITEV_DONE,
>>   
>> +    BLKDBG_EMPTY_IMAGE_PREPARE,
>> +
>>       BLKDBG_EVENT_MAX,
>>   } BlkDebugEvent;
> Kevin

Thank you for your review,

Max

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty()
  2014-10-23  7:46     ` Max Reitz
@ 2014-10-23  8:29       ` Kevin Wolf
  2014-10-23  8:36         ` Max Reitz
  0 siblings, 1 reply; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23  8:29 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 23.10.2014 um 09:46 hat Max Reitz geschrieben:
> On 2014-10-22 at 20:35, Kevin Wolf wrote:
> >Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> >>bdrv_make_empty() is currently only called if the current image
> >>represents an external snapshot that has been committed to its base
> >>image; it is therefore unlikely to have internal snapshots. In this
> >>case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
> >>refcount table (while having the dirty flag set, which only works for
> >>compat=1.1) and creating a trivial refcount structure.
> >>
> >>If there are snapshots or for compat=0.10, fall back to the simple
> >>implementation (discard all clusters).
> >>
> >>Signed-off-by: Max Reitz <mreitz@redhat.com>
> >Hey, this feels actually reviewable this time. :-)
> 
> I'm still unsure which version I like more. If it wasn't for the
> math, I'd prefer the other version.
> 
> >>diff --git a/block/blkdebug.c b/block/blkdebug.c
> >>index e046b92..862d93b 100644
> >>--- a/block/blkdebug.c
> >>+++ b/block/blkdebug.c
> >>@@ -195,6 +195,8 @@ static const char *event_names[BLKDBG_EVENT_MAX] = {
> >>      [BLKDBG_PWRITEV]                        = "pwritev",
> >>      [BLKDBG_PWRITEV_ZERO]                   = "pwritev_zero",
> >>      [BLKDBG_PWRITEV_DONE]                   = "pwritev_done",
> >>+
> >>+    [BLKDBG_EMPTY_IMAGE_PREPARE]            = "empty_image_prepare",
> >>  };
> >>  static int get_event_by_name(const char *name, BlkDebugEvent *event)
> >>diff --git a/block/qcow2.c b/block/qcow2.c
> >>index 1ef3a5f..16dece2 100644
> >>--- a/block/qcow2.c
> >>+++ b/block/qcow2.c
> >>@@ -2232,24 +2232,137 @@ fail:
> >>  static int qcow2_make_empty(BlockDriverState *bs)
> >>  {
> >>+    BDRVQcowState *s = bs->opaque;
> >>      int ret = 0;
> >>-    uint64_t start_sector;
> >>-    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
> >>-    for (start_sector = 0; start_sector < bs->total_sectors;
> >>-         start_sector += sector_step)
> >>-    {
> >>-        /* As this function is generally used after committing an external
> >>-         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
> >>-         * default action for this kind of discard is to pass the discard,
> >>-         * which will ideally result in an actually smaller image file, as
> >>-         * is probably desired. */
> >>-        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
> >>-                                     MIN(sector_step,
> >>-                                         bs->total_sectors - start_sector),
> >>-                                     QCOW2_DISCARD_SNAPSHOT, true);
> >>+    if (s->snapshots || s->qcow_version < 3) {
> >>+        uint64_t start_sector;
> >>+        int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
> >>+
> >>+        /* If there are snapshots, every active cluster has to be discarded; and
> >>+         * because compat=0.10 does not support setting the dirty flag, we have
> >>+         * to use this fallback there as well */
> >>+
> >>+        for (start_sector = 0; start_sector < bs->total_sectors;
> >>+             start_sector += sector_step)
> >>+        {
> >>+            /* As this function is generally used after committing an external
> >>+             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
> >>+             * default action for this kind of discard is to pass the discard,
> >>+             * which will ideally result in an actually smaller image file, as
> >>+             * is probably desired. */
> >>+            ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
> >>+                                         MIN(sector_step,
> >>+                                             bs->total_sectors - start_sector),
> >>+                                         QCOW2_DISCARD_SNAPSHOT, true);
> >>+            if (ret < 0) {
> >>+                break;
> >>+            }
> >>+        }
> >My first though was to add a return here, so the indentation level for
> >the rest is one less. Probably a matter of taste, though.
> 
> I'd rather put the second branch into an own function.

Works for me.

> >>+    } else {
> >>+        int l1_clusters;
> >>+        int64_t offset;
> >>+        uint64_t *new_reftable;
> >>+        uint64_t rt_entry;
> >>+        struct {
> >>+            uint64_t l1_offset;
> >>+            uint64_t reftable_offset;
> >>+            uint32_t reftable_clusters;
> >>+        } QEMU_PACKED l1_ofs_rt_ofs_cls;
> >>+
> >>+        ret = qcow2_cache_empty(bs, s->l2_table_cache);
> >>          if (ret < 0) {
> >>-            break;
> >>+            return ret;
> >>+        }
> >>+
> >>+        ret = qcow2_cache_empty(bs, s->refcount_block_cache);
> >>+        if (ret < 0) {
> >>+            return ret;
> >>+        }
> >>+
> >>+        /* Refcounts will be broken utterly */
> >>+        ret = qcow2_mark_dirty(bs);
> >>+        if (ret < 0) {
> >>+            return ret;
> >>+        }
> >>+
> >>+        l1_clusters = DIV_ROUND_UP(s->l1_size,
> >>+                                   s->cluster_size / sizeof(uint64_t));
> >>+        new_reftable = g_try_new0(uint64_t, s->cluster_size / sizeof(uint64_t));
> >>+        if (!new_reftable) {
> >>+            return -ENOMEM;
> >>+        }
> >>+
> >>+        BLKDBG_EVENT(bs->file, BLKDBG_EMPTY_IMAGE_PREPARE);
> >Until here, the failure cases are trivially okay. The worst thing that
> >could happen is that the image is needlessly marked as dirty.
> >
> >>+        /* Overwrite enough clusters at the beginning of the sectors to place
> >>+         * the refcount table, a refcount block and the L1 table in; this may
> >>+         * overwrite parts of the existing refcount and L1 table, which is not
> >>+         * an issue because the dirty flag is set, complete data loss is in fact
> >>+         * desired and partial data loss is consequently fine as well */
> >>+        ret = bdrv_write_zeroes(bs->file, s->cluster_size / BDRV_SECTOR_SIZE,
> >>+                                (2 + l1_clusters) * s->cluster_size /
> >>+                                BDRV_SECTOR_SIZE, 0);
> >If we crash at this point, we're _not_ okay any more. --verbose follows:
> >
> >On disk, we may have overwritten a refcount table or a refcount block
> >with zeros. This is fine, we have the dirty flag set, so destroying any
> >refcounting structure does no harm.
> >
> >We may also have overwritten an L1 or L2 table. As the commit correctly
> >explains, this is doing partially what the function is supposed to do
> >for the whole image. Affected data clusters are now read from the
> >backing file. Good.
> >
> >However, we may also have overwritten data clusters that are still
> >accessible using an L1/L2 table that hasn't been hit by this write
> >operation. We're reading corrupted (zeroed out) data now instead of
> >going to the backing file. Bug!
> 
> Oh, right, I forgot about the L1 table not always being at the start
> of the file.
> 
> >In my original suggestion I had an item where the L1 table was zeroed
> >out first before the start of the image is zeroed. This would have
> >avoided the bug.
> >
> >>+        if (ret < 0) {
> >>+            g_free(new_reftable);
> >>+            return ret;
> >>+        }
> >If we fail here (without crashing), the first clusters could be in their
> >original state or partially zeroed. Assuming that you fixed the above
> >bug, the on-disk state would be okay if we opened the image now because
> >the dirty flag would trigger an image repair; but we don't get the
> >repair when taking this failure path and we may have zeroed a refcount
> >table/block. This is probably a problem and we may have to make the BDS
> >unusable.
> >
> >The in-memory state of the L1 table is hopefully zeroed out, so it's
> >consistent with what is on disk.
> >
> >The in-memory state of the refcount table looks like it's not in sync
> >with the on-disk state. Note that while the dirty flag allows that the
> >on-disk state can be anything, the in-memory state is what we keep using
> >after a failure. The in-memory state isn't accurate at this point, but
> >we only create leaks. Lots of them, because we zeroed the L1 table, but
> >that's good enough. If refcounts are updated later, the old offsets
> >should still be valid.
> 
> If we set at least parts of the in-memory reftable to zero,
> everything probably breaks. Try to allocate a new cluster while the
> beginning of the reftable is zero. So we cannot take the on-disk
> reftable into memory.
> 
> Doing it the other way around, writing the in-memory reftable to
> disk on error won't work either. The refblocks may have been zeroed
> out, so we have exactly the same problem.
> 
> Therefore, to make the BDS usable after error, we have to (in the
> error path) read the on-disk reftable into memory, call the qcow2
> repair function and hope for the best.
> 
> Or, you know, we could go back to v11 which had my other version of
> this patch which always kept everything consistent. :-P

I'm not sure how important it is to keep the BDS usable after such an
unlikely error.

I started to write up a suggestion that we could do without
qcow2_alloc_clusters(), but just build up the first refcount block
ourselves. Doesn't really help us, because the new state is not what we
need here, but it made me aware that it assumes that the L1 table is
small enough to fit in the first refcount block and we would have to
fall back to discard if it isn't. Come to think of it, I suspect you
already make the same assumption without checking it.


Anyway, if we want to keep the BDS usable what is the right state?
We need references for the header, for the L1 table, for the refcount
table and all refcount blocks. If we decide to build a new in-memory
refcount table, the refcount blocks are only the refcount blocks that
are still referenced there, i.e. we don't have to worry about the
location of refcount blocks on the disk.

In fact, we can also safely change the refcount table offset to
cluster 1 and handle it together with the header. We'll then need one
refcount block for header/reftable and potentially another one for the
L1 table, which still must be in its original place (adding the
assumption that the L1 table doesn't cross a refblock boundary :-/).

Our refblock cache can hold 4 tables at least, so creating two refblocks
purely in memory is doable.

Leaves the question, is it worth the hassle?

Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty()
  2014-10-23  8:29       ` Kevin Wolf
@ 2014-10-23  8:36         ` Max Reitz
  2014-10-23  8:41           ` Kevin Wolf
  0 siblings, 1 reply; 43+ messages in thread
From: Max Reitz @ 2014-10-23  8:36 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, Stefan Hajnoczi

On 2014-10-23 at 10:29, Kevin Wolf wrote:
> Am 23.10.2014 um 09:46 hat Max Reitz geschrieben:
>> On 2014-10-22 at 20:35, Kevin Wolf wrote:
>>> Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
>>>> bdrv_make_empty() is currently only called if the current image
>>>> represents an external snapshot that has been committed to its base
>>>> image; it is therefore unlikely to have internal snapshots. In this
>>>> case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
>>>> refcount table (while having the dirty flag set, which only works for
>>>> compat=1.1) and creating a trivial refcount structure.
>>>>
>>>> If there are snapshots or for compat=0.10, fall back to the simple
>>>> implementation (discard all clusters).
>>>>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>> Hey, this feels actually reviewable this time. :-)
>> I'm still unsure which version I like more. If it wasn't for the
>> math, I'd prefer the other version.
>>
>>>> diff --git a/block/blkdebug.c b/block/blkdebug.c
>>>> index e046b92..862d93b 100644
>>>> --- a/block/blkdebug.c
>>>> +++ b/block/blkdebug.c
>>>> @@ -195,6 +195,8 @@ static const char *event_names[BLKDBG_EVENT_MAX] = {
>>>>       [BLKDBG_PWRITEV]                        = "pwritev",
>>>>       [BLKDBG_PWRITEV_ZERO]                   = "pwritev_zero",
>>>>       [BLKDBG_PWRITEV_DONE]                   = "pwritev_done",
>>>> +
>>>> +    [BLKDBG_EMPTY_IMAGE_PREPARE]            = "empty_image_prepare",
>>>>   };
>>>>   static int get_event_by_name(const char *name, BlkDebugEvent *event)
>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>> index 1ef3a5f..16dece2 100644
>>>> --- a/block/qcow2.c
>>>> +++ b/block/qcow2.c
>>>> @@ -2232,24 +2232,137 @@ fail:
>>>>   static int qcow2_make_empty(BlockDriverState *bs)
>>>>   {
>>>> +    BDRVQcowState *s = bs->opaque;
>>>>       int ret = 0;
>>>> -    uint64_t start_sector;
>>>> -    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
>>>> -    for (start_sector = 0; start_sector < bs->total_sectors;
>>>> -         start_sector += sector_step)
>>>> -    {
>>>> -        /* As this function is generally used after committing an external
>>>> -         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
>>>> -         * default action for this kind of discard is to pass the discard,
>>>> -         * which will ideally result in an actually smaller image file, as
>>>> -         * is probably desired. */
>>>> -        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
>>>> -                                     MIN(sector_step,
>>>> -                                         bs->total_sectors - start_sector),
>>>> -                                     QCOW2_DISCARD_SNAPSHOT, true);
>>>> +    if (s->snapshots || s->qcow_version < 3) {
>>>> +        uint64_t start_sector;
>>>> +        int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
>>>> +
>>>> +        /* If there are snapshots, every active cluster has to be discarded; and
>>>> +         * because compat=0.10 does not support setting the dirty flag, we have
>>>> +         * to use this fallback there as well */
>>>> +
>>>> +        for (start_sector = 0; start_sector < bs->total_sectors;
>>>> +             start_sector += sector_step)
>>>> +        {
>>>> +            /* As this function is generally used after committing an external
>>>> +             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
>>>> +             * default action for this kind of discard is to pass the discard,
>>>> +             * which will ideally result in an actually smaller image file, as
>>>> +             * is probably desired. */
>>>> +            ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
>>>> +                                         MIN(sector_step,
>>>> +                                             bs->total_sectors - start_sector),
>>>> +                                         QCOW2_DISCARD_SNAPSHOT, true);
>>>> +            if (ret < 0) {
>>>> +                break;
>>>> +            }
>>>> +        }
>>> My first though was to add a return here, so the indentation level for
>>> the rest is one less. Probably a matter of taste, though.
>> I'd rather put the second branch into an own function.
> Works for me.
>
>>>> +    } else {
>>>> +        int l1_clusters;
>>>> +        int64_t offset;
>>>> +        uint64_t *new_reftable;
>>>> +        uint64_t rt_entry;
>>>> +        struct {
>>>> +            uint64_t l1_offset;
>>>> +            uint64_t reftable_offset;
>>>> +            uint32_t reftable_clusters;
>>>> +        } QEMU_PACKED l1_ofs_rt_ofs_cls;
>>>> +
>>>> +        ret = qcow2_cache_empty(bs, s->l2_table_cache);
>>>>           if (ret < 0) {
>>>> -            break;
>>>> +            return ret;
>>>> +        }
>>>> +
>>>> +        ret = qcow2_cache_empty(bs, s->refcount_block_cache);
>>>> +        if (ret < 0) {
>>>> +            return ret;
>>>> +        }
>>>> +
>>>> +        /* Refcounts will be broken utterly */
>>>> +        ret = qcow2_mark_dirty(bs);
>>>> +        if (ret < 0) {
>>>> +            return ret;
>>>> +        }
>>>> +
>>>> +        l1_clusters = DIV_ROUND_UP(s->l1_size,
>>>> +                                   s->cluster_size / sizeof(uint64_t));
>>>> +        new_reftable = g_try_new0(uint64_t, s->cluster_size / sizeof(uint64_t));
>>>> +        if (!new_reftable) {
>>>> +            return -ENOMEM;
>>>> +        }
>>>> +
>>>> +        BLKDBG_EVENT(bs->file, BLKDBG_EMPTY_IMAGE_PREPARE);
>>> Until here, the failure cases are trivially okay. The worst thing that
>>> could happen is that the image is needlessly marked as dirty.
>>>
>>>> +        /* Overwrite enough clusters at the beginning of the sectors to place
>>>> +         * the refcount table, a refcount block and the L1 table in; this may
>>>> +         * overwrite parts of the existing refcount and L1 table, which is not
>>>> +         * an issue because the dirty flag is set, complete data loss is in fact
>>>> +         * desired and partial data loss is consequently fine as well */
>>>> +        ret = bdrv_write_zeroes(bs->file, s->cluster_size / BDRV_SECTOR_SIZE,
>>>> +                                (2 + l1_clusters) * s->cluster_size /
>>>> +                                BDRV_SECTOR_SIZE, 0);
>>> If we crash at this point, we're _not_ okay any more. --verbose follows:
>>>
>>> On disk, we may have overwritten a refcount table or a refcount block
>>> with zeros. This is fine, we have the dirty flag set, so destroying any
>>> refcounting structure does no harm.
>>>
>>> We may also have overwritten an L1 or L2 table. As the commit correctly
>>> explains, this is doing partially what the function is supposed to do
>>> for the whole image. Affected data clusters are now read from the
>>> backing file. Good.
>>>
>>> However, we may also have overwritten data clusters that are still
>>> accessible using an L1/L2 table that hasn't been hit by this write
>>> operation. We're reading corrupted (zeroed out) data now instead of
>>> going to the backing file. Bug!
>> Oh, right, I forgot about the L1 table not always being at the start
>> of the file.
>>
>>> In my original suggestion I had an item where the L1 table was zeroed
>>> out first before the start of the image is zeroed. This would have
>>> avoided the bug.
>>>
>>>> +        if (ret < 0) {
>>>> +            g_free(new_reftable);
>>>> +            return ret;
>>>> +        }
>>> If we fail here (without crashing), the first clusters could be in their
>>> original state or partially zeroed. Assuming that you fixed the above
>>> bug, the on-disk state would be okay if we opened the image now because
>>> the dirty flag would trigger an image repair; but we don't get the
>>> repair when taking this failure path and we may have zeroed a refcount
>>> table/block. This is probably a problem and we may have to make the BDS
>>> unusable.
>>>
>>> The in-memory state of the L1 table is hopefully zeroed out, so it's
>>> consistent with what is on disk.
>>>
>>> The in-memory state of the refcount table looks like it's not in sync
>>> with the on-disk state. Note that while the dirty flag allows that the
>>> on-disk state can be anything, the in-memory state is what we keep using
>>> after a failure. The in-memory state isn't accurate at this point, but
>>> we only create leaks. Lots of them, because we zeroed the L1 table, but
>>> that's good enough. If refcounts are updated later, the old offsets
>>> should still be valid.
>> If we set at least parts of the in-memory reftable to zero,
>> everything probably breaks. Try to allocate a new cluster while the
>> beginning of the reftable is zero. So we cannot take the on-disk
>> reftable into memory.
>>
>> Doing it the other way around, writing the in-memory reftable to
>> disk on error won't work either. The refblocks may have been zeroed
>> out, so we have exactly the same problem.
>>
>> Therefore, to make the BDS usable after error, we have to (in the
>> error path) read the on-disk reftable into memory, call the qcow2
>> repair function and hope for the best.
>>
>> Or, you know, we could go back to v11 which had my other version of
>> this patch which always kept everything consistent. :-P
> I'm not sure how important it is to keep the BDS usable after such an
> unlikely error.
>
> I started to write up a suggestion that we could do without
> qcow2_alloc_clusters(), but just build up the first refcount block
> ourselves. Doesn't really help us, because the new state is not what we
> need here, but it made me aware that it assumes that the L1 table is
> small enough to fit in the first refcount block and we would have to
> fall back to discard if it isn't. Come to think of it, I suspect you
> already make the same assumption without checking it.
>
>
> Anyway, if we want to keep the BDS usable what is the right state?
> We need references for the header, for the L1 table, for the refcount
> table and all refcount blocks. If we decide to build a new in-memory
> refcount table, the refcount blocks are only the refcount blocks that
> are still referenced there, i.e. we don't have to worry about the
> location of refcount blocks on the disk.
>
> In fact, we can also safely change the refcount table offset to
> cluster 1 and handle it together with the header. We'll then need one
> refcount block for header/reftable and potentially another one for the
> L1 table, which still must be in its original place (adding the
> assumption that the L1 table doesn't cross a refblock boundary :-/).
>
> Our refblock cache can hold 4 tables at least, so creating two refblocks
> purely in memory is doable.
>
> Leaves the question, is it worth the hassle?

Probably not. Would you be fine with setting bs->drv to NULL in the 
problematic error paths?

Max

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty()
  2014-10-23  8:36         ` Max Reitz
@ 2014-10-23  8:41           ` Kevin Wolf
  2014-10-23  9:11             ` Max Reitz
  0 siblings, 1 reply; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23  8:41 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 23.10.2014 um 10:36 hat Max Reitz geschrieben:
> On 2014-10-23 at 10:29, Kevin Wolf wrote:
> >Am 23.10.2014 um 09:46 hat Max Reitz geschrieben:
> >>On 2014-10-22 at 20:35, Kevin Wolf wrote:
> >>>Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> >>>>bdrv_make_empty() is currently only called if the current image
> >>>>represents an external snapshot that has been committed to its base
> >>>>image; it is therefore unlikely to have internal snapshots. In this
> >>>>case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
> >>>>refcount table (while having the dirty flag set, which only works for
> >>>>compat=1.1) and creating a trivial refcount structure.
> >>>>
> >>>>If there are snapshots or for compat=0.10, fall back to the simple
> >>>>implementation (discard all clusters).
> >>>>
> >>>>Signed-off-by: Max Reitz <mreitz@redhat.com>
> >>>Hey, this feels actually reviewable this time. :-)
> >>I'm still unsure which version I like more. If it wasn't for the
> >>math, I'd prefer the other version.
> >>
> >>>>diff --git a/block/blkdebug.c b/block/blkdebug.c
> >>>>index e046b92..862d93b 100644
> >>>>--- a/block/blkdebug.c
> >>>>+++ b/block/blkdebug.c
> >>>>@@ -195,6 +195,8 @@ static const char *event_names[BLKDBG_EVENT_MAX] = {
> >>>>      [BLKDBG_PWRITEV]                        = "pwritev",
> >>>>      [BLKDBG_PWRITEV_ZERO]                   = "pwritev_zero",
> >>>>      [BLKDBG_PWRITEV_DONE]                   = "pwritev_done",
> >>>>+
> >>>>+    [BLKDBG_EMPTY_IMAGE_PREPARE]            = "empty_image_prepare",
> >>>>  };
> >>>>  static int get_event_by_name(const char *name, BlkDebugEvent *event)
> >>>>diff --git a/block/qcow2.c b/block/qcow2.c
> >>>>index 1ef3a5f..16dece2 100644
> >>>>--- a/block/qcow2.c
> >>>>+++ b/block/qcow2.c
> >>>>@@ -2232,24 +2232,137 @@ fail:
> >>>>  static int qcow2_make_empty(BlockDriverState *bs)
> >>>>  {
> >>>>+    BDRVQcowState *s = bs->opaque;
> >>>>      int ret = 0;
> >>>>-    uint64_t start_sector;
> >>>>-    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
> >>>>-    for (start_sector = 0; start_sector < bs->total_sectors;
> >>>>-         start_sector += sector_step)
> >>>>-    {
> >>>>-        /* As this function is generally used after committing an external
> >>>>-         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
> >>>>-         * default action for this kind of discard is to pass the discard,
> >>>>-         * which will ideally result in an actually smaller image file, as
> >>>>-         * is probably desired. */
> >>>>-        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
> >>>>-                                     MIN(sector_step,
> >>>>-                                         bs->total_sectors - start_sector),
> >>>>-                                     QCOW2_DISCARD_SNAPSHOT, true);
> >>>>+    if (s->snapshots || s->qcow_version < 3) {
> >>>>+        uint64_t start_sector;
> >>>>+        int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
> >>>>+
> >>>>+        /* If there are snapshots, every active cluster has to be discarded; and
> >>>>+         * because compat=0.10 does not support setting the dirty flag, we have
> >>>>+         * to use this fallback there as well */
> >>>>+
> >>>>+        for (start_sector = 0; start_sector < bs->total_sectors;
> >>>>+             start_sector += sector_step)
> >>>>+        {
> >>>>+            /* As this function is generally used after committing an external
> >>>>+             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
> >>>>+             * default action for this kind of discard is to pass the discard,
> >>>>+             * which will ideally result in an actually smaller image file, as
> >>>>+             * is probably desired. */
> >>>>+            ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
> >>>>+                                         MIN(sector_step,
> >>>>+                                             bs->total_sectors - start_sector),
> >>>>+                                         QCOW2_DISCARD_SNAPSHOT, true);
> >>>>+            if (ret < 0) {
> >>>>+                break;
> >>>>+            }
> >>>>+        }
> >>>My first though was to add a return here, so the indentation level for
> >>>the rest is one less. Probably a matter of taste, though.
> >>I'd rather put the second branch into an own function.
> >Works for me.
> >
> >>>>+    } else {
> >>>>+        int l1_clusters;
> >>>>+        int64_t offset;
> >>>>+        uint64_t *new_reftable;
> >>>>+        uint64_t rt_entry;
> >>>>+        struct {
> >>>>+            uint64_t l1_offset;
> >>>>+            uint64_t reftable_offset;
> >>>>+            uint32_t reftable_clusters;
> >>>>+        } QEMU_PACKED l1_ofs_rt_ofs_cls;
> >>>>+
> >>>>+        ret = qcow2_cache_empty(bs, s->l2_table_cache);
> >>>>          if (ret < 0) {
> >>>>-            break;
> >>>>+            return ret;
> >>>>+        }
> >>>>+
> >>>>+        ret = qcow2_cache_empty(bs, s->refcount_block_cache);
> >>>>+        if (ret < 0) {
> >>>>+            return ret;
> >>>>+        }
> >>>>+
> >>>>+        /* Refcounts will be broken utterly */
> >>>>+        ret = qcow2_mark_dirty(bs);
> >>>>+        if (ret < 0) {
> >>>>+            return ret;
> >>>>+        }
> >>>>+
> >>>>+        l1_clusters = DIV_ROUND_UP(s->l1_size,
> >>>>+                                   s->cluster_size / sizeof(uint64_t));
> >>>>+        new_reftable = g_try_new0(uint64_t, s->cluster_size / sizeof(uint64_t));
> >>>>+        if (!new_reftable) {
> >>>>+            return -ENOMEM;
> >>>>+        }
> >>>>+
> >>>>+        BLKDBG_EVENT(bs->file, BLKDBG_EMPTY_IMAGE_PREPARE);
> >>>Until here, the failure cases are trivially okay. The worst thing that
> >>>could happen is that the image is needlessly marked as dirty.
> >>>
> >>>>+        /* Overwrite enough clusters at the beginning of the sectors to place
> >>>>+         * the refcount table, a refcount block and the L1 table in; this may
> >>>>+         * overwrite parts of the existing refcount and L1 table, which is not
> >>>>+         * an issue because the dirty flag is set, complete data loss is in fact
> >>>>+         * desired and partial data loss is consequently fine as well */
> >>>>+        ret = bdrv_write_zeroes(bs->file, s->cluster_size / BDRV_SECTOR_SIZE,
> >>>>+                                (2 + l1_clusters) * s->cluster_size /
> >>>>+                                BDRV_SECTOR_SIZE, 0);
> >>>If we crash at this point, we're _not_ okay any more. --verbose follows:
> >>>
> >>>On disk, we may have overwritten a refcount table or a refcount block
> >>>with zeros. This is fine, we have the dirty flag set, so destroying any
> >>>refcounting structure does no harm.
> >>>
> >>>We may also have overwritten an L1 or L2 table. As the commit correctly
> >>>explains, this is doing partially what the function is supposed to do
> >>>for the whole image. Affected data clusters are now read from the
> >>>backing file. Good.
> >>>
> >>>However, we may also have overwritten data clusters that are still
> >>>accessible using an L1/L2 table that hasn't been hit by this write
> >>>operation. We're reading corrupted (zeroed out) data now instead of
> >>>going to the backing file. Bug!
> >>Oh, right, I forgot about the L1 table not always being at the start
> >>of the file.
> >>
> >>>In my original suggestion I had an item where the L1 table was zeroed
> >>>out first before the start of the image is zeroed. This would have
> >>>avoided the bug.
> >>>
> >>>>+        if (ret < 0) {
> >>>>+            g_free(new_reftable);
> >>>>+            return ret;
> >>>>+        }
> >>>If we fail here (without crashing), the first clusters could be in their
> >>>original state or partially zeroed. Assuming that you fixed the above
> >>>bug, the on-disk state would be okay if we opened the image now because
> >>>the dirty flag would trigger an image repair; but we don't get the
> >>>repair when taking this failure path and we may have zeroed a refcount
> >>>table/block. This is probably a problem and we may have to make the BDS
> >>>unusable.
> >>>
> >>>The in-memory state of the L1 table is hopefully zeroed out, so it's
> >>>consistent with what is on disk.
> >>>
> >>>The in-memory state of the refcount table looks like it's not in sync
> >>>with the on-disk state. Note that while the dirty flag allows that the
> >>>on-disk state can be anything, the in-memory state is what we keep using
> >>>after a failure. The in-memory state isn't accurate at this point, but
> >>>we only create leaks. Lots of them, because we zeroed the L1 table, but
> >>>that's good enough. If refcounts are updated later, the old offsets
> >>>should still be valid.
> >>If we set at least parts of the in-memory reftable to zero,
> >>everything probably breaks. Try to allocate a new cluster while the
> >>beginning of the reftable is zero. So we cannot take the on-disk
> >>reftable into memory.
> >>
> >>Doing it the other way around, writing the in-memory reftable to
> >>disk on error won't work either. The refblocks may have been zeroed
> >>out, so we have exactly the same problem.
> >>
> >>Therefore, to make the BDS usable after error, we have to (in the
> >>error path) read the on-disk reftable into memory, call the qcow2
> >>repair function and hope for the best.
> >>
> >>Or, you know, we could go back to v11 which had my other version of
> >>this patch which always kept everything consistent. :-P
> >I'm not sure how important it is to keep the BDS usable after such an
> >unlikely error.
> >
> >I started to write up a suggestion that we could do without
> >qcow2_alloc_clusters(), but just build up the first refcount block
> >ourselves. Doesn't really help us, because the new state is not what we
> >need here, but it made me aware that it assumes that the L1 table is
> >small enough to fit in the first refcount block and we would have to
> >fall back to discard if it isn't. Come to think of it, I suspect you
> >already make the same assumption without checking it.
> >
> >
> >Anyway, if we want to keep the BDS usable what is the right state?
> >We need references for the header, for the L1 table, for the refcount
> >table and all refcount blocks. If we decide to build a new in-memory
> >refcount table, the refcount blocks are only the refcount blocks that
> >are still referenced there, i.e. we don't have to worry about the
> >location of refcount blocks on the disk.
> >
> >In fact, we can also safely change the refcount table offset to
> >cluster 1 and handle it together with the header. We'll then need one
> >refcount block for header/reftable and potentially another one for the
> >L1 table, which still must be in its original place (adding the
> >assumption that the L1 table doesn't cross a refblock boundary :-/).
> >
> >Our refblock cache can hold 4 tables at least, so creating two refblocks
> >purely in memory is doable.
> >
> >Leaves the question, is it worth the hassle?
> 
> Probably not. Would you be fine with setting bs->drv to NULL in the
> problematic error paths?

By calling qcow2_signal_corruption(), right? I think that's fine.

What about the assumption that 3 + l1_size fits in one refcount block?
Do we need to check it even now?

Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty()
  2014-10-23  8:41           ` Kevin Wolf
@ 2014-10-23  9:11             ` Max Reitz
  2014-10-23  9:42               ` Kevin Wolf
  0 siblings, 1 reply; 43+ messages in thread
From: Max Reitz @ 2014-10-23  9:11 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, Stefan Hajnoczi

On 2014-10-23 at 10:41, Kevin Wolf wrote:
> Am 23.10.2014 um 10:36 hat Max Reitz geschrieben:
>> On 2014-10-23 at 10:29, Kevin Wolf wrote:
>>> Am 23.10.2014 um 09:46 hat Max Reitz geschrieben:
>>>> On 2014-10-22 at 20:35, Kevin Wolf wrote:
>>>>> Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
>>>>>> bdrv_make_empty() is currently only called if the current image
>>>>>> represents an external snapshot that has been committed to its base
>>>>>> image; it is therefore unlikely to have internal snapshots. In this
>>>>>> case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
>>>>>> refcount table (while having the dirty flag set, which only works for
>>>>>> compat=1.1) and creating a trivial refcount structure.
>>>>>>
>>>>>> If there are snapshots or for compat=0.10, fall back to the simple
>>>>>> implementation (discard all clusters).
>>>>>>
>>>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>>> Hey, this feels actually reviewable this time. :-)
>>>> I'm still unsure which version I like more. If it wasn't for the
>>>> math, I'd prefer the other version.
>>>>
>>>>>> diff --git a/block/blkdebug.c b/block/blkdebug.c
>>>>>> index e046b92..862d93b 100644
>>>>>> --- a/block/blkdebug.c
>>>>>> +++ b/block/blkdebug.c
>>>>>> @@ -195,6 +195,8 @@ static const char *event_names[BLKDBG_EVENT_MAX] = {
>>>>>>       [BLKDBG_PWRITEV]                        = "pwritev",
>>>>>>       [BLKDBG_PWRITEV_ZERO]                   = "pwritev_zero",
>>>>>>       [BLKDBG_PWRITEV_DONE]                   = "pwritev_done",
>>>>>> +
>>>>>> +    [BLKDBG_EMPTY_IMAGE_PREPARE]            = "empty_image_prepare",
>>>>>>   };
>>>>>>   static int get_event_by_name(const char *name, BlkDebugEvent *event)
>>>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>>>> index 1ef3a5f..16dece2 100644
>>>>>> --- a/block/qcow2.c
>>>>>> +++ b/block/qcow2.c
>>>>>> @@ -2232,24 +2232,137 @@ fail:
>>>>>>   static int qcow2_make_empty(BlockDriverState *bs)
>>>>>>   {
>>>>>> +    BDRVQcowState *s = bs->opaque;
>>>>>>       int ret = 0;
>>>>>> -    uint64_t start_sector;
>>>>>> -    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
>>>>>> -    for (start_sector = 0; start_sector < bs->total_sectors;
>>>>>> -         start_sector += sector_step)
>>>>>> -    {
>>>>>> -        /* As this function is generally used after committing an external
>>>>>> -         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
>>>>>> -         * default action for this kind of discard is to pass the discard,
>>>>>> -         * which will ideally result in an actually smaller image file, as
>>>>>> -         * is probably desired. */
>>>>>> -        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
>>>>>> -                                     MIN(sector_step,
>>>>>> -                                         bs->total_sectors - start_sector),
>>>>>> -                                     QCOW2_DISCARD_SNAPSHOT, true);
>>>>>> +    if (s->snapshots || s->qcow_version < 3) {
>>>>>> +        uint64_t start_sector;
>>>>>> +        int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
>>>>>> +
>>>>>> +        /* If there are snapshots, every active cluster has to be discarded; and
>>>>>> +         * because compat=0.10 does not support setting the dirty flag, we have
>>>>>> +         * to use this fallback there as well */
>>>>>> +
>>>>>> +        for (start_sector = 0; start_sector < bs->total_sectors;
>>>>>> +             start_sector += sector_step)
>>>>>> +        {
>>>>>> +            /* As this function is generally used after committing an external
>>>>>> +             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
>>>>>> +             * default action for this kind of discard is to pass the discard,
>>>>>> +             * which will ideally result in an actually smaller image file, as
>>>>>> +             * is probably desired. */
>>>>>> +            ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
>>>>>> +                                         MIN(sector_step,
>>>>>> +                                             bs->total_sectors - start_sector),
>>>>>> +                                         QCOW2_DISCARD_SNAPSHOT, true);
>>>>>> +            if (ret < 0) {
>>>>>> +                break;
>>>>>> +            }
>>>>>> +        }
>>>>> My first though was to add a return here, so the indentation level for
>>>>> the rest is one less. Probably a matter of taste, though.
>>>> I'd rather put the second branch into an own function.
>>> Works for me.
>>>
>>>>>> +    } else {
>>>>>> +        int l1_clusters;
>>>>>> +        int64_t offset;
>>>>>> +        uint64_t *new_reftable;
>>>>>> +        uint64_t rt_entry;
>>>>>> +        struct {
>>>>>> +            uint64_t l1_offset;
>>>>>> +            uint64_t reftable_offset;
>>>>>> +            uint32_t reftable_clusters;
>>>>>> +        } QEMU_PACKED l1_ofs_rt_ofs_cls;
>>>>>> +
>>>>>> +        ret = qcow2_cache_empty(bs, s->l2_table_cache);
>>>>>>           if (ret < 0) {
>>>>>> -            break;
>>>>>> +            return ret;
>>>>>> +        }
>>>>>> +
>>>>>> +        ret = qcow2_cache_empty(bs, s->refcount_block_cache);
>>>>>> +        if (ret < 0) {
>>>>>> +            return ret;
>>>>>> +        }
>>>>>> +
>>>>>> +        /* Refcounts will be broken utterly */
>>>>>> +        ret = qcow2_mark_dirty(bs);
>>>>>> +        if (ret < 0) {
>>>>>> +            return ret;
>>>>>> +        }
>>>>>> +
>>>>>> +        l1_clusters = DIV_ROUND_UP(s->l1_size,
>>>>>> +                                   s->cluster_size / sizeof(uint64_t));
>>>>>> +        new_reftable = g_try_new0(uint64_t, s->cluster_size / sizeof(uint64_t));
>>>>>> +        if (!new_reftable) {
>>>>>> +            return -ENOMEM;
>>>>>> +        }
>>>>>> +
>>>>>> +        BLKDBG_EVENT(bs->file, BLKDBG_EMPTY_IMAGE_PREPARE);
>>>>> Until here, the failure cases are trivially okay. The worst thing that
>>>>> could happen is that the image is needlessly marked as dirty.
>>>>>
>>>>>> +        /* Overwrite enough clusters at the beginning of the sectors to place
>>>>>> +         * the refcount table, a refcount block and the L1 table in; this may
>>>>>> +         * overwrite parts of the existing refcount and L1 table, which is not
>>>>>> +         * an issue because the dirty flag is set, complete data loss is in fact
>>>>>> +         * desired and partial data loss is consequently fine as well */
>>>>>> +        ret = bdrv_write_zeroes(bs->file, s->cluster_size / BDRV_SECTOR_SIZE,
>>>>>> +                                (2 + l1_clusters) * s->cluster_size /
>>>>>> +                                BDRV_SECTOR_SIZE, 0);
>>>>> If we crash at this point, we're _not_ okay any more. --verbose follows:
>>>>>
>>>>> On disk, we may have overwritten a refcount table or a refcount block
>>>>> with zeros. This is fine, we have the dirty flag set, so destroying any
>>>>> refcounting structure does no harm.
>>>>>
>>>>> We may also have overwritten an L1 or L2 table. As the commit correctly
>>>>> explains, this is doing partially what the function is supposed to do
>>>>> for the whole image. Affected data clusters are now read from the
>>>>> backing file. Good.
>>>>>
>>>>> However, we may also have overwritten data clusters that are still
>>>>> accessible using an L1/L2 table that hasn't been hit by this write
>>>>> operation. We're reading corrupted (zeroed out) data now instead of
>>>>> going to the backing file. Bug!
>>>> Oh, right, I forgot about the L1 table not always being at the start
>>>> of the file.
>>>>
>>>>> In my original suggestion I had an item where the L1 table was zeroed
>>>>> out first before the start of the image is zeroed. This would have
>>>>> avoided the bug.
>>>>>
>>>>>> +        if (ret < 0) {
>>>>>> +            g_free(new_reftable);
>>>>>> +            return ret;
>>>>>> +        }
>>>>> If we fail here (without crashing), the first clusters could be in their
>>>>> original state or partially zeroed. Assuming that you fixed the above
>>>>> bug, the on-disk state would be okay if we opened the image now because
>>>>> the dirty flag would trigger an image repair; but we don't get the
>>>>> repair when taking this failure path and we may have zeroed a refcount
>>>>> table/block. This is probably a problem and we may have to make the BDS
>>>>> unusable.
>>>>>
>>>>> The in-memory state of the L1 table is hopefully zeroed out, so it's
>>>>> consistent with what is on disk.
>>>>>
>>>>> The in-memory state of the refcount table looks like it's not in sync
>>>>> with the on-disk state. Note that while the dirty flag allows that the
>>>>> on-disk state can be anything, the in-memory state is what we keep using
>>>>> after a failure. The in-memory state isn't accurate at this point, but
>>>>> we only create leaks. Lots of them, because we zeroed the L1 table, but
>>>>> that's good enough. If refcounts are updated later, the old offsets
>>>>> should still be valid.
>>>> If we set at least parts of the in-memory reftable to zero,
>>>> everything probably breaks. Try to allocate a new cluster while the
>>>> beginning of the reftable is zero. So we cannot take the on-disk
>>>> reftable into memory.
>>>>
>>>> Doing it the other way around, writing the in-memory reftable to
>>>> disk on error won't work either. The refblocks may have been zeroed
>>>> out, so we have exactly the same problem.
>>>>
>>>> Therefore, to make the BDS usable after error, we have to (in the
>>>> error path) read the on-disk reftable into memory, call the qcow2
>>>> repair function and hope for the best.
>>>>
>>>> Or, you know, we could go back to v11 which had my other version of
>>>> this patch which always kept everything consistent. :-P
>>> I'm not sure how important it is to keep the BDS usable after such an
>>> unlikely error.
>>>
>>> I started to write up a suggestion that we could do without
>>> qcow2_alloc_clusters(), but just build up the first refcount block
>>> ourselves. Doesn't really help us, because the new state is not what we
>>> need here, but it made me aware that it assumes that the L1 table is
>>> small enough to fit in the first refcount block and we would have to
>>> fall back to discard if it isn't. Come to think of it, I suspect you
>>> already make the same assumption without checking it.
>>>
>>>
>>> Anyway, if we want to keep the BDS usable what is the right state?
>>> We need references for the header, for the L1 table, for the refcount
>>> table and all refcount blocks. If we decide to build a new in-memory
>>> refcount table, the refcount blocks are only the refcount blocks that
>>> are still referenced there, i.e. we don't have to worry about the
>>> location of refcount blocks on the disk.
>>>
>>> In fact, we can also safely change the refcount table offset to
>>> cluster 1 and handle it together with the header. We'll then need one
>>> refcount block for header/reftable and potentially another one for the
>>> L1 table, which still must be in its original place (adding the
>>> assumption that the L1 table doesn't cross a refblock boundary :-/).
>>>
>>> Our refblock cache can hold 4 tables at least, so creating two refblocks
>>> purely in memory is doable.
>>>
>>> Leaves the question, is it worth the hassle?
>> Probably not. Would you be fine with setting bs->drv to NULL in the
>> problematic error paths?
> By calling qcow2_signal_corruption(), right? I think that's fine.

I don't know. The image isn't corrupted, it's just dirty. When you open 
it, it'll work fine. I'd just output an own error here and set bs->drv 
to NULL.

> What about the assumption that 3 + l1_size fits in one refcount block?
> Do we need to check it even now?

The alternative would be not to allocate 3 + l1_size, but rather just 3. 
Then we have a working refcount structure and that's the most important 
thing (the L1 table is not yet accounted for, but repair can fix that).

Afterwards, we allocate the L1 table. With this change, we don't know 
its offset yet, so we may have to change it again. But we can easily do 
that by just setting s->l1_size to 0 and call qcow2_grow_l1_table().

Max

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty()
  2014-10-23  9:11             ` Max Reitz
@ 2014-10-23  9:42               ` Kevin Wolf
  2014-10-23  9:44                 ` Max Reitz
  0 siblings, 1 reply; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23  9:42 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 23.10.2014 um 11:11 hat Max Reitz geschrieben:
> >>>Leaves the question, is it worth the hassle?
> >>Probably not. Would you be fine with setting bs->drv to NULL in the
> >>problematic error paths?
> >By calling qcow2_signal_corruption(), right? I think that's fine.
> 
> I don't know. The image isn't corrupted, it's just dirty. When you
> open it, it'll work fine. I'd just output an own error here and set
> bs->drv to NULL.

Okay.

> >What about the assumption that 3 + l1_size fits in one refcount block?
> >Do we need to check it even now?
> 
> The alternative would be not to allocate 3 + l1_size, but rather
> just 3. Then we have a working refcount structure and that's the
> most important thing (the L1 table is not yet accounted for, but
> repair can fix that).
> 
> Afterwards, we allocate the L1 table. With this change, we don't
> know its offset yet, so we may have to change it again. But we can
> easily do that by just setting s->l1_size to 0 and call
> qcow2_grow_l1_table().

How do you make sure that the L1 table always stays zeroed then?

And seriously, if you have an L1 table that doesn't fit in one refcount
block, you're doing something wrong and deserve the slow fallback.

Or, if we really want, we could zero out 2 + l1_size + num_rb clusters
and hook up multiple refcount blocks. The new limit would then be at the
point where the refcount table exceeds a cluster.

Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty()
  2014-10-23  9:42               ` Kevin Wolf
@ 2014-10-23  9:44                 ` Max Reitz
  0 siblings, 0 replies; 43+ messages in thread
From: Max Reitz @ 2014-10-23  9:44 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, Stefan Hajnoczi

On 2014-10-23 at 11:42, Kevin Wolf wrote:
> Am 23.10.2014 um 11:11 hat Max Reitz geschrieben:
>>>>> Leaves the question, is it worth the hassle?
>>>> Probably not. Would you be fine with setting bs->drv to NULL in the
>>>> problematic error paths?
>>> By calling qcow2_signal_corruption(), right? I think that's fine.
>> I don't know. The image isn't corrupted, it's just dirty. When you
>> open it, it'll work fine. I'd just output an own error here and set
>> bs->drv to NULL.
> Okay.
>
>>> What about the assumption that 3 + l1_size fits in one refcount block?
>>> Do we need to check it even now?
>> The alternative would be not to allocate 3 + l1_size, but rather
>> just 3. Then we have a working refcount structure and that's the
>> most important thing (the L1 table is not yet accounted for, but
>> repair can fix that).
>>
>> Afterwards, we allocate the L1 table. With this change, we don't
>> know its offset yet, so we may have to change it again. But we can
>> easily do that by just setting s->l1_size to 0 and call
>> qcow2_grow_l1_table().
> How do you make sure that the L1 table always stays zeroed then?

Details...

> And seriously, if you have an L1 table that doesn't fit in one refcount
> block, you're doing something wrong and deserve the slow fallback.

Probably so, yes.

> Or, if we really want, we could zero out 2 + l1_size + num_rb clusters
> and hook up multiple refcount blocks. The new limit would then be at the
> point where the refcount table exceeds a cluster.

Well, if we're falling back to maths, v11 is always there. :-P

Max

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 05/14] blockjob: Add "ready" field
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 05/14] blockjob: Add "ready" field Max Reitz
@ 2014-10-23 10:01   ` Kevin Wolf
  0 siblings, 0 replies; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23 10:01 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> When a block job signals readiness, this is currently reported only
> through QMP. If qemu wants to use block jobs for internal tasks, there
> needs to be another way to correctly detect when a block job may be
> completed.
> 
> For this reason, introduce a bool "ready" which is set when the block
> job may be completed.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 06/14] iotests: Omit length/offset test in 040 and 041
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 06/14] iotests: Omit length/offset test in 040 and 041 Max Reitz
@ 2014-10-23 10:06   ` Kevin Wolf
  2014-10-23 10:07     ` Max Reitz
  0 siblings, 1 reply; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23 10:06 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> As of a follow-up patch to this one, the length of a mirror block job
> will no longer directly depend on the size of the block device;
> therefore, drop these checks from this test. Instead, just check whether
> the final offset equals the block job length.
> 
> As 041 uses the wait_until_completed function from iotests.py, the same
> applies there as well which in turn affects tests 030, 055 and 056. On
> the other hand, a block job's length does not have to be related to the
> length of the image file in the first place, so that check was
> questionable anyway.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Possibly a result of rebasing over a long time, but 041 still has some
checks of len left after this patch.

Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 06/14] iotests: Omit length/offset test in 040 and 041
  2014-10-23 10:06   ` Kevin Wolf
@ 2014-10-23 10:07     ` Max Reitz
  0 siblings, 0 replies; 43+ messages in thread
From: Max Reitz @ 2014-10-23 10:07 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, Stefan Hajnoczi

On 2014-10-23 at 12:06, Kevin Wolf wrote:
> Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
>> As of a follow-up patch to this one, the length of a mirror block job
>> will no longer directly depend on the size of the block device;
>> therefore, drop these checks from this test. Instead, just check whether
>> the final offset equals the block job length.
>>
>> As 041 uses the wait_until_completed function from iotests.py, the same
>> applies there as well which in turn affects tests 030, 055 and 056. On
>> the other hand, a block job's length does not have to be related to the
>> length of the image file in the first place, so that check was
>> questionable anyway.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> Reviewed-by: Eric Blake <eblake@redhat.com>
> Possibly a result of rebasing over a long time, but 041 still has some
> checks of len left after this patch.

Very likely, yes. I'll fix it.

Max

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 07/14] block/mirror: Improve progress report
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 07/14] block/mirror: Improve progress report Max Reitz
@ 2014-10-23 10:52   ` Kevin Wolf
  2014-10-23 11:09     ` Max Reitz
  2014-10-23 13:10     ` Eric Blake
  0 siblings, 2 replies; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23 10:52 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> Instead of taking the total length of the block device as the block
> job's length, use the number of dirty sectors. The progress is now the
> number of sectors mirrored to the target block device. Note that this
> may result in the job's length increasing during operation, which is
> however in fact desirable.

More importantly, because it might surprise management tools, is that
the progress (as in offset/len) can actually decrease now.

I can't say whether that creates any problem, I'll rely on Eric's
Reviewed-by for that.

> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  block/mirror.c | 34 ++++++++++++++++++++++------------
>  1 file changed, 22 insertions(+), 12 deletions(-)

> @@ -409,6 +416,12 @@ static void coroutine_fn mirror_run(void *opaque)
>          }
>  
>          cnt = bdrv_get_dirty_count(bs, s->dirty_bitmap);
> +        /* s->common.offset contains the number of bytes already processed so
> +         * far, cnt is the number of dirty sectors remaining and
> +         * s->sectors_in_flight is the number of sectors currently being
> +         * processed; together those are the current total operation length */
> +        s->common.len = s->common.offset +
> +                        (cnt + s->sectors_in_flight) * BDRV_SECTOR_SIZE;

Isn't s->sectors_in_flight still contained in cnt? If I understand
correctly, sectors are only marked as clean at the same time as
s->sectors_in_flight is decremented again.

Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 07/14] block/mirror: Improve progress report
  2014-10-23 10:52   ` Kevin Wolf
@ 2014-10-23 11:09     ` Max Reitz
  2014-10-23 12:03       ` Kevin Wolf
  2014-10-23 13:10     ` Eric Blake
  1 sibling, 1 reply; 43+ messages in thread
From: Max Reitz @ 2014-10-23 11:09 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, Stefan Hajnoczi

On 2014-10-23 at 12:52, Kevin Wolf wrote:
> Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
>> Instead of taking the total length of the block device as the block
>> job's length, use the number of dirty sectors. The progress is now the
>> number of sectors mirrored to the target block device. Note that this
>> may result in the job's length increasing during operation, which is
>> however in fact desirable.
> More importantly, because it might surprise management tools, is that
> the progress (as in offset/len) can actually decrease now.
>
> I can't say whether that creates any problem, I'll rely on Eric's
> Reviewed-by for that.

Eric said, it would rather solve problems.

>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> Reviewed-by: Eric Blake <eblake@redhat.com>
>> ---
>>   block/mirror.c | 34 ++++++++++++++++++++++------------
>>   1 file changed, 22 insertions(+), 12 deletions(-)
>> @@ -409,6 +416,12 @@ static void coroutine_fn mirror_run(void *opaque)
>>           }
>>   
>>           cnt = bdrv_get_dirty_count(bs, s->dirty_bitmap);
>> +        /* s->common.offset contains the number of bytes already processed so
>> +         * far, cnt is the number of dirty sectors remaining and
>> +         * s->sectors_in_flight is the number of sectors currently being
>> +         * processed; together those are the current total operation length */
>> +        s->common.len = s->common.offset +
>> +                        (cnt + s->sectors_in_flight) * BDRV_SECTOR_SIZE;
> Isn't s->sectors_in_flight still contained in cnt? If I understand
> correctly, sectors are only marked as clean at the same time as
> s->sectors_in_flight is decremented again.

As far as I see, bdrv_reset_dirty() is called in mirror_iteration() 
right before s->sectors_in_flight is incremented.

Max

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 08/14] qemu-img: Implement commit like QMP
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 08/14] qemu-img: Implement commit like QMP Max Reitz
  2014-10-22 16:22   ` Eric Blake
@ 2014-10-23 11:59   ` Kevin Wolf
  2014-10-23 12:35     ` Max Reitz
  1 sibling, 1 reply; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23 11:59 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> qemu-img should use QMP commands whenever possible in order to ensure
> feature completeness of both online and offline image operations. As
> qemu-img itself has no access to QMP (since this would basically require
> just everything being linked into qemu-img), imitate QMP's
> implementation of block-commit by using commit_active_start() and then
> waiting for the block job to finish.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/Makefile.objs |  3 +-
>  qemu-img.c          | 82 ++++++++++++++++++++++++++++++++++++++++-------------
>  2 files changed, 64 insertions(+), 21 deletions(-)
> 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index 27911b6..04b0e43 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -9,7 +9,7 @@ block-obj-y += block-backend.o snapshot.o qapi.o
>  block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
>  block-obj-$(CONFIG_POSIX) += raw-posix.o
>  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
> -block-obj-y += null.o
> +block-obj-y += null.o mirror.o
>  
>  block-obj-y += nbd.o nbd-client.o sheepdog.o
>  block-obj-$(CONFIG_LIBISCSI) += iscsi.o
> @@ -23,7 +23,6 @@ block-obj-y += accounting.o
>  
>  common-obj-y += stream.o
>  common-obj-y += commit.o
> -common-obj-y += mirror.o
>  common-obj-y += backup.o
>  
>  iscsi.o-cflags     := $(LIBISCSI_CFLAGS)
> diff --git a/qemu-img.c b/qemu-img.c
> index 09e7e72..f1f2857 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -31,6 +31,7 @@
>  #include "sysemu/sysemu.h"
>  #include "sysemu/block-backend.h"
>  #include "block/block_int.h"
> +#include "block/blockjob.h"
>  #include "block/qapi.h"
>  #include <getopt.h>
>  
> @@ -715,13 +716,47 @@ fail:
>      return ret;
>  }
>  
> +typedef struct CommonBlockJobCBInfo {
> +    BlockDriverState *bs;
> +    Error **errp;
> +} CommonBlockJobCBInfo;
> +
> +static void common_block_job_cb(void *opaque, int ret)
> +{
> +    CommonBlockJobCBInfo *cbi = opaque;
> +
> +    if (ret < 0) {
> +        error_setg_errno(cbi->errp, -ret, "Block job failed");
> +    }
> +
> +    /* Drop this block job's reference */
> +    bdrv_unref(cbi->bs);
> +}
> +
> +static void run_block_job(BlockJob *job, Error **errp)
> +{
> +    AioContext *aio_context = bdrv_get_aio_context(job->bs);
> +
> +    do {
> +        aio_poll(aio_context, true);
> +
> +        if (!job->busy && !job->ready) {
> +            block_job_resume(job);
> +        }

I wasn't quite sure what this is for. With BLOCKDEV_ON_ERROR_REPORT, why
would the job ever be paused?

While trying out what would happen with failing requests (both for
checking this and what error message I would get), my image got
corrupted (as I told you on IRC):

$ qemu-img create -f qcow2 -b /tmp/backing.qcow2 /tmp/test.qcow2 64M
$ qemu-img commit blkdebug::/tmp/test.qcow2
qemu-img: Could not empty blkdebug::/tmp/test.qcow2: Operation not supported
$ qemu-io -c 'write 0 1M' /tmp/test.qcow2
qcow2: Marking image as corrupt: Preventing invalid write on metadata
(overlaps with active L1 table); further corruption events will be
suppressed
write failed: Input/output error

> +    } while (!job->ready);
> +
> +    block_job_complete_sync(job, errp);
> +}
> +
>  static int img_commit(int argc, char **argv)
>  {
>      int c, ret, flags;
>      const char *filename, *fmt, *cache;
>      BlockBackend *blk;
> -    BlockDriverState *bs;
> +    BlockDriverState *bs, *base_bs;
>      bool quiet = false;
> +    Error *local_err = NULL;
> +    CommonBlockJobCBInfo cbi;
>  
>      fmt = NULL;
>      cache = BDRV_DEFAULT_CACHE;
> @@ -764,29 +799,38 @@ static int img_commit(int argc, char **argv)
>      }
>      bs = blk_bs(blk);
>  
> -    ret = bdrv_commit(bs);
> -    switch(ret) {
> -    case 0:
> -        qprintf(quiet, "Image committed.\n");
> -        break;
> -    case -ENOENT:
> -        error_report("No disk inserted");
> -        break;
> -    case -EACCES:
> -        error_report("Image is read-only");
> -        break;
> -    case -ENOTSUP:
> -        error_report("Image is already committed");
> -        break;
> -    default:
> -        error_report("Error while committing image");
> -        break;
> +    /* This is different from QMP, which by default uses the deepest file in the
> +     * backing chain (i.e., the very base); however, the traditional behavior of
> +     * qemu-img commit is using the immediate backing file. */
> +    base_bs = bs->backing_hd;
> +    if (!base_bs) {
> +        error_set(&local_err, QERR_BASE_NOT_FOUND, "NULL");

$ qemu-img create -f qcow2 /tmp/test.qcow2 64M
Formatting '/tmp/test.qcow2', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off
$ qemu-img commit /tmp/test.qcow2
qemu-img: Base 'NULL' not found

Do we expect any user to understand what we want to tell them? This
should clearly be something along the lines of error_setg(&local_err,
"Image doesn't have a backing file"). (Before this patch, it said
"Image is already committed", which isn't great either, but not as bad
as the new message)

Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 07/14] block/mirror: Improve progress report
  2014-10-23 11:09     ` Max Reitz
@ 2014-10-23 12:03       ` Kevin Wolf
  0 siblings, 0 replies; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23 12:03 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 23.10.2014 um 13:09 hat Max Reitz geschrieben:
> On 2014-10-23 at 12:52, Kevin Wolf wrote:
> >Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> >>Instead of taking the total length of the block device as the block
> >>job's length, use the number of dirty sectors. The progress is now the
> >>number of sectors mirrored to the target block device. Note that this
> >>may result in the job's length increasing during operation, which is
> >>however in fact desirable.
> >More importantly, because it might surprise management tools, is that
> >the progress (as in offset/len) can actually decrease now.
> >
> >I can't say whether that creates any problem, I'll rely on Eric's
> >Reviewed-by for that.
> 
> Eric said, it would rather solve problems.
> 
> >>Signed-off-by: Max Reitz <mreitz@redhat.com>
> >>Reviewed-by: Eric Blake <eblake@redhat.com>
> >>---
> >>  block/mirror.c | 34 ++++++++++++++++++++++------------
> >>  1 file changed, 22 insertions(+), 12 deletions(-)
> >>@@ -409,6 +416,12 @@ static void coroutine_fn mirror_run(void *opaque)
> >>          }
> >>          cnt = bdrv_get_dirty_count(bs, s->dirty_bitmap);
> >>+        /* s->common.offset contains the number of bytes already processed so
> >>+         * far, cnt is the number of dirty sectors remaining and
> >>+         * s->sectors_in_flight is the number of sectors currently being
> >>+         * processed; together those are the current total operation length */
> >>+        s->common.len = s->common.offset +
> >>+                        (cnt + s->sectors_in_flight) * BDRV_SECTOR_SIZE;
> >Isn't s->sectors_in_flight still contained in cnt? If I understand
> >correctly, sectors are only marked as clean at the same time as
> >s->sectors_in_flight is decremented again.
> 
> As far as I see, bdrv_reset_dirty() is called in mirror_iteration()
> right before s->sectors_in_flight is incremented.

You're right, I confused it with the other bitmap operations in
mirror_iteration_done(). So:

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 08/14] qemu-img: Implement commit like QMP
  2014-10-23 11:59   ` Kevin Wolf
@ 2014-10-23 12:35     ` Max Reitz
  2014-10-23 12:40       ` Kevin Wolf
  0 siblings, 1 reply; 43+ messages in thread
From: Max Reitz @ 2014-10-23 12:35 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, Stefan Hajnoczi

On 2014-10-23 at 13:59, Kevin Wolf wrote:
> Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
>> qemu-img should use QMP commands whenever possible in order to ensure
>> feature completeness of both online and offline image operations. As
>> qemu-img itself has no access to QMP (since this would basically require
>> just everything being linked into qemu-img), imitate QMP's
>> implementation of block-commit by using commit_active_start() and then
>> waiting for the block job to finish.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/Makefile.objs |  3 +-
>>   qemu-img.c          | 82 ++++++++++++++++++++++++++++++++++++++++-------------
>>   2 files changed, 64 insertions(+), 21 deletions(-)
>>
>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>> index 27911b6..04b0e43 100644
>> --- a/block/Makefile.objs
>> +++ b/block/Makefile.objs
>> @@ -9,7 +9,7 @@ block-obj-y += block-backend.o snapshot.o qapi.o
>>   block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
>>   block-obj-$(CONFIG_POSIX) += raw-posix.o
>>   block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>> -block-obj-y += null.o
>> +block-obj-y += null.o mirror.o
>>   
>>   block-obj-y += nbd.o nbd-client.o sheepdog.o
>>   block-obj-$(CONFIG_LIBISCSI) += iscsi.o
>> @@ -23,7 +23,6 @@ block-obj-y += accounting.o
>>   
>>   common-obj-y += stream.o
>>   common-obj-y += commit.o
>> -common-obj-y += mirror.o
>>   common-obj-y += backup.o
>>   
>>   iscsi.o-cflags     := $(LIBISCSI_CFLAGS)
>> diff --git a/qemu-img.c b/qemu-img.c
>> index 09e7e72..f1f2857 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
>> @@ -31,6 +31,7 @@
>>   #include "sysemu/sysemu.h"
>>   #include "sysemu/block-backend.h"
>>   #include "block/block_int.h"
>> +#include "block/blockjob.h"
>>   #include "block/qapi.h"
>>   #include <getopt.h>
>>   
>> @@ -715,13 +716,47 @@ fail:
>>       return ret;
>>   }
>>   
>> +typedef struct CommonBlockJobCBInfo {
>> +    BlockDriverState *bs;
>> +    Error **errp;
>> +} CommonBlockJobCBInfo;
>> +
>> +static void common_block_job_cb(void *opaque, int ret)
>> +{
>> +    CommonBlockJobCBInfo *cbi = opaque;
>> +
>> +    if (ret < 0) {
>> +        error_setg_errno(cbi->errp, -ret, "Block job failed");
>> +    }
>> +
>> +    /* Drop this block job's reference */
>> +    bdrv_unref(cbi->bs);
>> +}
>> +
>> +static void run_block_job(BlockJob *job, Error **errp)
>> +{
>> +    AioContext *aio_context = bdrv_get_aio_context(job->bs);
>> +
>> +    do {
>> +        aio_poll(aio_context, true);
>> +
>> +        if (!job->busy && !job->ready) {
>> +            block_job_resume(job);
>> +        }
> I wasn't quite sure what this is for. With BLOCKDEV_ON_ERROR_REPORT, why
> would the job ever be paused?

I remember you telling that I puzzled this code together until it 
worked. At one point in time, it didn't work without this. It works now. 
I'm going to trust you.

> While trying out what would happen with failing requests (both for
> checking this and what error message I would get), my image got
> corrupted (as I told you on IRC):
>
> $ qemu-img create -f qcow2 -b /tmp/backing.qcow2 /tmp/test.qcow2 64M
> $ qemu-img commit blkdebug::/tmp/test.qcow2
> qemu-img: Could not empty blkdebug::/tmp/test.qcow2: Operation not supported
> $ qemu-io -c 'write 0 1M' /tmp/test.qcow2
> qcow2: Marking image as corrupt: Preventing invalid write on metadata
> (overlaps with active L1 table); further corruption events will be
> suppressed
> write failed: Input/output error

As discussed on IRC, this is due to qcow2 marking every image clean when 
it's closed. Setting bs->drv to NULL solves this issue.

>> +    } while (!job->ready);
>> +
>> +    block_job_complete_sync(job, errp);
>> +}
>> +
>>   static int img_commit(int argc, char **argv)
>>   {
>>       int c, ret, flags;
>>       const char *filename, *fmt, *cache;
>>       BlockBackend *blk;
>> -    BlockDriverState *bs;
>> +    BlockDriverState *bs, *base_bs;
>>       bool quiet = false;
>> +    Error *local_err = NULL;
>> +    CommonBlockJobCBInfo cbi;
>>   
>>       fmt = NULL;
>>       cache = BDRV_DEFAULT_CACHE;
>> @@ -764,29 +799,38 @@ static int img_commit(int argc, char **argv)
>>       }
>>       bs = blk_bs(blk);
>>   
>> -    ret = bdrv_commit(bs);
>> -    switch(ret) {
>> -    case 0:
>> -        qprintf(quiet, "Image committed.\n");
>> -        break;
>> -    case -ENOENT:
>> -        error_report("No disk inserted");
>> -        break;
>> -    case -EACCES:
>> -        error_report("Image is read-only");
>> -        break;
>> -    case -ENOTSUP:
>> -        error_report("Image is already committed");
>> -        break;
>> -    default:
>> -        error_report("Error while committing image");
>> -        break;
>> +    /* This is different from QMP, which by default uses the deepest file in the
>> +     * backing chain (i.e., the very base); however, the traditional behavior of
>> +     * qemu-img commit is using the immediate backing file. */
>> +    base_bs = bs->backing_hd;
>> +    if (!base_bs) {
>> +        error_set(&local_err, QERR_BASE_NOT_FOUND, "NULL");
> $ qemu-img create -f qcow2 /tmp/test.qcow2 64M
> Formatting '/tmp/test.qcow2', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off
> $ qemu-img commit /tmp/test.qcow2
> qemu-img: Base 'NULL' not found
>
> Do we expect any user to understand what we want to tell them? This
> should clearly be something along the lines of error_setg(&local_err,
> "Image doesn't have a backing file"). (Before this patch, it said
> "Image is already committed", which isn't great either, but not as bad
> as the new message)

Yes, will fix.

Max

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 08/14] qemu-img: Implement commit like QMP
  2014-10-23 12:35     ` Max Reitz
@ 2014-10-23 12:40       ` Kevin Wolf
  2014-10-23 12:44         ` Max Reitz
  0 siblings, 1 reply; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23 12:40 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 23.10.2014 um 14:35 hat Max Reitz geschrieben:
> On 2014-10-23 at 13:59, Kevin Wolf wrote:
> >Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> >>qemu-img should use QMP commands whenever possible in order to ensure
> >>feature completeness of both online and offline image operations. As
> >>qemu-img itself has no access to QMP (since this would basically require
> >>just everything being linked into qemu-img), imitate QMP's
> >>implementation of block-commit by using commit_active_start() and then
> >>waiting for the block job to finish.
> >>
> >>Signed-off-by: Max Reitz <mreitz@redhat.com>
> >>---
> >>  block/Makefile.objs |  3 +-
> >>  qemu-img.c          | 82 ++++++++++++++++++++++++++++++++++++++++-------------
> >>  2 files changed, 64 insertions(+), 21 deletions(-)
> >>
> >>diff --git a/block/Makefile.objs b/block/Makefile.objs
> >>index 27911b6..04b0e43 100644
> >>--- a/block/Makefile.objs
> >>+++ b/block/Makefile.objs
> >>@@ -9,7 +9,7 @@ block-obj-y += block-backend.o snapshot.o qapi.o
> >>  block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
> >>  block-obj-$(CONFIG_POSIX) += raw-posix.o
> >>  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
> >>-block-obj-y += null.o
> >>+block-obj-y += null.o mirror.o
> >>  block-obj-y += nbd.o nbd-client.o sheepdog.o
> >>  block-obj-$(CONFIG_LIBISCSI) += iscsi.o
> >>@@ -23,7 +23,6 @@ block-obj-y += accounting.o
> >>  common-obj-y += stream.o
> >>  common-obj-y += commit.o
> >>-common-obj-y += mirror.o
> >>  common-obj-y += backup.o
> >>  iscsi.o-cflags     := $(LIBISCSI_CFLAGS)
> >>diff --git a/qemu-img.c b/qemu-img.c
> >>index 09e7e72..f1f2857 100644
> >>--- a/qemu-img.c
> >>+++ b/qemu-img.c
> >>@@ -31,6 +31,7 @@
> >>  #include "sysemu/sysemu.h"
> >>  #include "sysemu/block-backend.h"
> >>  #include "block/block_int.h"
> >>+#include "block/blockjob.h"
> >>  #include "block/qapi.h"
> >>  #include <getopt.h>
> >>@@ -715,13 +716,47 @@ fail:
> >>      return ret;
> >>  }
> >>+typedef struct CommonBlockJobCBInfo {
> >>+    BlockDriverState *bs;
> >>+    Error **errp;
> >>+} CommonBlockJobCBInfo;
> >>+
> >>+static void common_block_job_cb(void *opaque, int ret)
> >>+{
> >>+    CommonBlockJobCBInfo *cbi = opaque;
> >>+
> >>+    if (ret < 0) {
> >>+        error_setg_errno(cbi->errp, -ret, "Block job failed");
> >>+    }
> >>+
> >>+    /* Drop this block job's reference */
> >>+    bdrv_unref(cbi->bs);
> >>+}
> >>+
> >>+static void run_block_job(BlockJob *job, Error **errp)
> >>+{
> >>+    AioContext *aio_context = bdrv_get_aio_context(job->bs);
> >>+
> >>+    do {
> >>+        aio_poll(aio_context, true);
> >>+
> >>+        if (!job->busy && !job->ready) {
> >>+            block_job_resume(job);
> >>+        }
> >I wasn't quite sure what this is for. With BLOCKDEV_ON_ERROR_REPORT, why
> >would the job ever be paused?
> 
> I remember you telling that I puzzled this code together until it
> worked. At one point in time, it didn't work without this. It works
> now. I'm going to trust you.

This was an honest question. It's probably risky to trust that if I
don't understand something, the opposite is true...

Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 08/14] qemu-img: Implement commit like QMP
  2014-10-23 12:40       ` Kevin Wolf
@ 2014-10-23 12:44         ` Max Reitz
  0 siblings, 0 replies; 43+ messages in thread
From: Max Reitz @ 2014-10-23 12:44 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, Stefan Hajnoczi

On 2014-10-23 at 14:40, Kevin Wolf wrote:
> Am 23.10.2014 um 14:35 hat Max Reitz geschrieben:
>> On 2014-10-23 at 13:59, Kevin Wolf wrote:
>>> Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
>>>> qemu-img should use QMP commands whenever possible in order to ensure
>>>> feature completeness of both online and offline image operations. As
>>>> qemu-img itself has no access to QMP (since this would basically require
>>>> just everything being linked into qemu-img), imitate QMP's
>>>> implementation of block-commit by using commit_active_start() and then
>>>> waiting for the block job to finish.
>>>>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>   block/Makefile.objs |  3 +-
>>>>   qemu-img.c          | 82 ++++++++++++++++++++++++++++++++++++++++-------------
>>>>   2 files changed, 64 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>>>> index 27911b6..04b0e43 100644
>>>> --- a/block/Makefile.objs
>>>> +++ b/block/Makefile.objs
>>>> @@ -9,7 +9,7 @@ block-obj-y += block-backend.o snapshot.o qapi.o
>>>>   block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
>>>>   block-obj-$(CONFIG_POSIX) += raw-posix.o
>>>>   block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>>>> -block-obj-y += null.o
>>>> +block-obj-y += null.o mirror.o
>>>>   block-obj-y += nbd.o nbd-client.o sheepdog.o
>>>>   block-obj-$(CONFIG_LIBISCSI) += iscsi.o
>>>> @@ -23,7 +23,6 @@ block-obj-y += accounting.o
>>>>   common-obj-y += stream.o
>>>>   common-obj-y += commit.o
>>>> -common-obj-y += mirror.o
>>>>   common-obj-y += backup.o
>>>>   iscsi.o-cflags     := $(LIBISCSI_CFLAGS)
>>>> diff --git a/qemu-img.c b/qemu-img.c
>>>> index 09e7e72..f1f2857 100644
>>>> --- a/qemu-img.c
>>>> +++ b/qemu-img.c
>>>> @@ -31,6 +31,7 @@
>>>>   #include "sysemu/sysemu.h"
>>>>   #include "sysemu/block-backend.h"
>>>>   #include "block/block_int.h"
>>>> +#include "block/blockjob.h"
>>>>   #include "block/qapi.h"
>>>>   #include <getopt.h>
>>>> @@ -715,13 +716,47 @@ fail:
>>>>       return ret;
>>>>   }
>>>> +typedef struct CommonBlockJobCBInfo {
>>>> +    BlockDriverState *bs;
>>>> +    Error **errp;
>>>> +} CommonBlockJobCBInfo;
>>>> +
>>>> +static void common_block_job_cb(void *opaque, int ret)
>>>> +{
>>>> +    CommonBlockJobCBInfo *cbi = opaque;
>>>> +
>>>> +    if (ret < 0) {
>>>> +        error_setg_errno(cbi->errp, -ret, "Block job failed");
>>>> +    }
>>>> +
>>>> +    /* Drop this block job's reference */
>>>> +    bdrv_unref(cbi->bs);
>>>> +}
>>>> +
>>>> +static void run_block_job(BlockJob *job, Error **errp)
>>>> +{
>>>> +    AioContext *aio_context = bdrv_get_aio_context(job->bs);
>>>> +
>>>> +    do {
>>>> +        aio_poll(aio_context, true);
>>>> +
>>>> +        if (!job->busy && !job->ready) {
>>>> +            block_job_resume(job);
>>>> +        }
>>> I wasn't quite sure what this is for. With BLOCKDEV_ON_ERROR_REPORT, why
>>> would the job ever be paused?
>> I remember you telling that I puzzled this code together until it
>> worked. At one point in time, it didn't work without this. It works
>> now. I'm going to trust you.
> This was an honest question. It's probably risky to trust that if I
> don't understand something, the opposite is true...

I remember having had trouble with the block job just pausing and then 
qemu-img would sleep indefinitely. Maybe I'm mistaken. Maybe it actually 
was necessary back in April.

Right now I don't see a way how the block job should be paused aside 
from errors and manual QMP intervention.

Max

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 09/14] qemu-img: Empty image after commit
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 09/14] qemu-img: Empty image after commit Max Reitz
@ 2014-10-23 12:55   ` Kevin Wolf
  0 siblings, 0 replies; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23 12:55 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> After the top image has been committed, it should be emptied unless
> specified otherwise.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 10/14] qemu-img: Enable progress output for commit
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 10/14] qemu-img: Enable progress output for commit Max Reitz
@ 2014-10-23 13:00   ` Kevin Wolf
  0 siblings, 0 replies; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23 13:00 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> Implement progress output for the commit command by querying the
> progress of the block job.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 11/14] qemu-img: Specify backing file for commit
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 11/14] qemu-img: Specify backing file " Max Reitz
@ 2014-10-23 13:05   ` Kevin Wolf
  0 siblings, 0 replies; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23 13:05 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> Introduce a new parameter for qemu-img commit which may be used to
> explicitly specify the backing file into which an image should be
> committed if the backing chain has more than a single layer.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

> +If the backing chain of the given image file @var{filename} has more than one
> +layer, the backing file into which the changes will be committed may be
> +specified as @var{base} (which has to be part of @var{filename}'s backing
> +chain). If @var{base} is not specified, the immediate backing file of the top
> +image (which is @var{filename}) will be used. For reasons of consistency,
> +explicitly specifying @var{base} will always imply @code{-d}.

Wouldn't hurt, though, to be more explicit about the "reasons of
consistency". Took me a few moments to understand.

Kevin

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 12/14] iotests: Add _filter_qemu_img_map
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 12/14] iotests: Add _filter_qemu_img_map Max Reitz
@ 2014-10-23 13:08   ` Kevin Wolf
  0 siblings, 0 replies; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23 13:08 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> As different image formats most probably map guest addresses to
> different host addresses, add a filter to filter the host addresses out;
> also, the image filename should be filtered.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 07/14] block/mirror: Improve progress report
  2014-10-23 10:52   ` Kevin Wolf
  2014-10-23 11:09     ` Max Reitz
@ 2014-10-23 13:10     ` Eric Blake
  1 sibling, 0 replies; 43+ messages in thread
From: Eric Blake @ 2014-10-23 13:10 UTC (permalink / raw)
  To: Kevin Wolf, Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 1414 bytes --]

On 10/23/2014 04:52 AM, Kevin Wolf wrote:
> Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
>> Instead of taking the total length of the block device as the block
>> job's length, use the number of dirty sectors. The progress is now the
>> number of sectors mirrored to the target block device. Note that this
>> may result in the job's length increasing during operation, which is
>> however in fact desirable.
> 
> More importantly, because it might surprise management tools, is that
> the progress (as in offset/len) can actually decrease now.

Decreased progress is a GOOD thing - it tells management that the guest
is causing I/O faster than the mirroring can keep up with, and that it
might be time to turn on some throttling if the mirror is going to have
a chance at converging.  The old code didn't have any correlation to
progress complete vs work remaining, making it harder to accurately
estimate how long a job still has to go.

> 
> I can't say whether that creates any problem, I'll rely on Eric's
> Reviewed-by for that.

Libvirt has already documented that for all jobs (including block jobs),
progress is estimated by offset/total, but that total need not be
constant between two consecutive calls, precisely for the behavior that
this patch moves toward.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 13/14] iotests: Add test for backing-chain commits
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 13/14] iotests: Add test for backing-chain commits Max Reitz
@ 2014-10-23 13:24   ` Kevin Wolf
  0 siblings, 0 replies; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23 13:24 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> Add a test for qemu-img commit on backing chains with more than two
> images. This test also checks whether the top image is emptied (unless
> this is prevented by specifying either -d or -b) and does therefore not
> work for qed and vmdk which requires it to be separate from 020.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Qemu-devel] [PATCH v13 14/14] iotests: Add test for qcow2's bdrv_make_empty
  2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 14/14] iotests: Add test for qcow2's bdrv_make_empty Max Reitz
  2014-10-22 17:05   ` Eric Blake
@ 2014-10-23 13:30   ` Kevin Wolf
  1 sibling, 0 replies; 43+ messages in thread
From: Kevin Wolf @ 2014-10-23 13:30 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> Add a test for qcow2's fast bdrv_make_empty implementation on images
> without internal snapshots.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>

We'll probably want to add more test cases, but what's there looks
reasonable.

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2014-10-23 14:00 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-22 12:51 [Qemu-devel] [PATCH v13 00/14] qemu-img: Implement commit like QMP Max Reitz
2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 01/14] qcow2: Allow "full" discard Max Reitz
2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 02/14] qcow2: Implement bdrv_make_empty() Max Reitz
2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty() Max Reitz
2014-10-22 16:04   ` Eric Blake
2014-10-22 18:35   ` Kevin Wolf
2014-10-23  7:46     ` Max Reitz
2014-10-23  8:29       ` Kevin Wolf
2014-10-23  8:36         ` Max Reitz
2014-10-23  8:41           ` Kevin Wolf
2014-10-23  9:11             ` Max Reitz
2014-10-23  9:42               ` Kevin Wolf
2014-10-23  9:44                 ` Max Reitz
2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 04/14] blockjob: Introduce block_job_complete_sync() Max Reitz
2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 05/14] blockjob: Add "ready" field Max Reitz
2014-10-23 10:01   ` Kevin Wolf
2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 06/14] iotests: Omit length/offset test in 040 and 041 Max Reitz
2014-10-23 10:06   ` Kevin Wolf
2014-10-23 10:07     ` Max Reitz
2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 07/14] block/mirror: Improve progress report Max Reitz
2014-10-23 10:52   ` Kevin Wolf
2014-10-23 11:09     ` Max Reitz
2014-10-23 12:03       ` Kevin Wolf
2014-10-23 13:10     ` Eric Blake
2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 08/14] qemu-img: Implement commit like QMP Max Reitz
2014-10-22 16:22   ` Eric Blake
2014-10-23 11:59   ` Kevin Wolf
2014-10-23 12:35     ` Max Reitz
2014-10-23 12:40       ` Kevin Wolf
2014-10-23 12:44         ` Max Reitz
2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 09/14] qemu-img: Empty image after commit Max Reitz
2014-10-23 12:55   ` Kevin Wolf
2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 10/14] qemu-img: Enable progress output for commit Max Reitz
2014-10-23 13:00   ` Kevin Wolf
2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 11/14] qemu-img: Specify backing file " Max Reitz
2014-10-23 13:05   ` Kevin Wolf
2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 12/14] iotests: Add _filter_qemu_img_map Max Reitz
2014-10-23 13:08   ` Kevin Wolf
2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 13/14] iotests: Add test for backing-chain commits Max Reitz
2014-10-23 13:24   ` Kevin Wolf
2014-10-22 12:51 ` [Qemu-devel] [PATCH v13 14/14] iotests: Add test for qcow2's bdrv_make_empty Max Reitz
2014-10-22 17:05   ` Eric Blake
2014-10-23 13:30   ` Kevin Wolf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.