[Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP
@ 2014-08-20 18:17 Max Reitz
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 01/14] qcow2: Allow "full" discard Max Reitz
                   ` (13 more replies)
  0 siblings, 14 replies; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

qemu-img should use QMP commands whenever possible in order to ensure
feature completeness of both online and offline image operations. For
the "commit" command, this is relatively easy, so implement it first
(in the hope that indeed others will follow).

As qemu-img does not have access to QMP (due to QMP being intertwined
with basically everything in qemu), we cannot directly use QMP, but at
least use the functions the corresponding QMP commands are using (which
would be "block-commit", in this case).


v11:
- Rebased on Kevin's block branch


git-backport-diff output against v10:

Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/14:[----] [--] 'qcow2: Allow "full" discard'
002/14:[----] [--] 'qcow2: Implement bdrv_make_empty()'
003/14:[----] [--] 'qcow2: Optimize bdrv_make_empty()'
004/14:[----] [--] 'blockjob: Introduce block_job_complete_sync()'
005/14:[----] [--] 'blockjob: Add "ready" field'
006/14:[----] [-C] 'block/mirror: Improve progress report'
007/14:[----] [--] 'qemu-img: Implement commit like QMP'
008/14:[----] [--] 'qemu-img: Empty image after commit'
009/14:[----] [--] 'qemu-img: Enable progress output for commit'
010/14:[----] [--] 'qemu-img: Specify backing file for commit'
011/14:[----] [-C] 'iotests: Add _filter_qemu_img_map'
012/14:[----] [-C] 'iotests: Add test for backing-chain commits'
013/14:[----] [-C] 'iotests: Add test for qcow2's bdrv_make_empty'
014/14:[----] [--] 'iotests: Omit length/offset test in 040 and 041'


Max Reitz (14):
  qcow2: Allow "full" discard
  qcow2: Implement bdrv_make_empty()
  qcow2: Optimize bdrv_make_empty()
  blockjob: Introduce block_job_complete_sync()
  blockjob: Add "ready" field
  block/mirror: Improve progress report
  qemu-img: Implement commit like QMP
  qemu-img: Empty image after commit
  qemu-img: Enable progress output for commit
  qemu-img: Specify backing file for commit
  iotests: Add _filter_qemu_img_map
  iotests: Add test for backing-chain commits
  iotests: Add test for qcow2's bdrv_make_empty
  iotests: Omit length/offset test in 040 and 041

 block/Makefile.objs              |   2 +-
 block/mirror.c                   |  34 ++--
 block/qcow2-cluster.c            |  27 ++-
 block/qcow2-snapshot.c           |   2 +-
 block/qcow2.c                    | 388 ++++++++++++++++++++++++++++++++++++++-
 block/qcow2.h                    |   2 +-
 blockjob.c                       |  42 ++++-
 include/block/blockjob.h         |  20 ++
 qapi/block-core.json             |   4 +-
 qemu-img-cmds.hx                 |   4 +-
 qemu-img.c                       | 149 ++++++++++++---
 qemu-img.texi                    |  13 +-
 tests/qemu-iotests/040           |   4 +-
 tests/qemu-iotests/041           |   3 +-
 tests/qemu-iotests/097           | 122 ++++++++++++
 tests/qemu-iotests/097.out       | 119 ++++++++++++
 tests/qemu-iotests/098           |  75 ++++++++
 tests/qemu-iotests/098.out       |  26 +++
 tests/qemu-iotests/common.filter |   7 +
 tests/qemu-iotests/group         |   2 +
 tests/qemu-iotests/iotests.py    |   3 +-
 21 files changed, 981 insertions(+), 67 deletions(-)
 create mode 100755 tests/qemu-iotests/097
 create mode 100644 tests/qemu-iotests/097.out
 create mode 100755 tests/qemu-iotests/098
 create mode 100644 tests/qemu-iotests/098.out

-- 
2.0.4

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v11 01/14] qcow2: Allow "full" discard
  2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
@ 2014-08-20 18:17 ` Max Reitz
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 02/14] qcow2: Implement bdrv_make_empty() Max Reitz
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Normally, discarded sectors should read back as zero. However, there are
cases in which a sector (or rather cluster) should be discarded as if
they were never written in the first place, that is, reading them should
fall through to the backing file again.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c  | 27 +++++++++++++++++----------
 block/qcow2-snapshot.c |  2 +-
 block/qcow2.c          |  2 +-
 block/qcow2.h          |  2 +-
 4 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 735f687..7a49380 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1375,7 +1375,7 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
  * clusters.
  */
 static int discard_single_l2(BlockDriverState *bs, uint64_t offset,
-    unsigned int nb_clusters, enum qcow2_discard_type type)
+    unsigned int nb_clusters, enum qcow2_discard_type type, bool full_discard)
 {
     BDRVQcowState *s = bs->opaque;
     uint64_t *l2_table;
@@ -1397,23 +1397,30 @@ static int discard_single_l2(BlockDriverState *bs, uint64_t offset,
         old_l2_entry = be64_to_cpu(l2_table[l2_index + i]);
 
         /*
-         * Make sure that a discarded area reads back as zeroes for v3 images
-         * (we cannot do it for v2 without actually writing a zero-filled
-         * buffer). We can skip the operation if the cluster is already marked
-         * as zero, or if it's unallocated and we don't have a backing file.
+         * If full_discard is false, make sure that a discarded area reads back
+         * as zeroes for v3 images (we cannot do it for v2 without actually
+         * writing a zero-filled buffer). We can skip the operation if the
+         * cluster is already marked as zero, or if it's unallocated and we
+         * don't have a backing file.
          *
          * TODO We might want to use bdrv_get_block_status(bs) here, but we're
          * holding s->lock, so that doesn't work today.
+         *
+         * If full_discard is true, the sector should not read back as zeroes,
+         * but rather fall through to the backing file.
          */
         switch (qcow2_get_cluster_type(old_l2_entry)) {
             case QCOW2_CLUSTER_UNALLOCATED:
-                if (!bs->backing_hd) {
+                if (full_discard || !bs->backing_hd) {
                     continue;
                 }
                 break;
 
             case QCOW2_CLUSTER_ZERO:
-                continue;
+                if (!full_discard) {
+                    continue;
+                }
+                break;
 
             case QCOW2_CLUSTER_NORMAL:
             case QCOW2_CLUSTER_COMPRESSED:
@@ -1425,7 +1432,7 @@ static int discard_single_l2(BlockDriverState *bs, uint64_t offset,
 
         /* First remove L2 entries */
         qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
-        if (s->qcow_version >= 3) {
+        if (!full_discard && s->qcow_version >= 3) {
             l2_table[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO);
         } else {
             l2_table[l2_index + i] = cpu_to_be64(0);
@@ -1444,7 +1451,7 @@ static int discard_single_l2(BlockDriverState *bs, uint64_t offset,
 }
 
 int qcow2_discard_clusters(BlockDriverState *bs, uint64_t offset,
-    int nb_sectors, enum qcow2_discard_type type)
+    int nb_sectors, enum qcow2_discard_type type, bool full_discard)
 {
     BDRVQcowState *s = bs->opaque;
     uint64_t end_offset;
@@ -1467,7 +1474,7 @@ int qcow2_discard_clusters(BlockDriverState *bs, uint64_t offset,
 
     /* Each L2 table is handled by its own loop iteration */
     while (nb_clusters > 0) {
-        ret = discard_single_l2(bs, offset, nb_clusters, type);
+        ret = discard_single_l2(bs, offset, nb_clusters, type, full_discard);
         if (ret < 0) {
             goto fail;
         }
diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
index f52d7fd..5b3903c 100644
--- a/block/qcow2-snapshot.c
+++ b/block/qcow2-snapshot.c
@@ -441,7 +441,7 @@ int qcow2_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo *sn_info)
     qcow2_discard_clusters(bs, qcow2_vm_state_offset(s),
                            align_offset(sn->vm_state_size, s->cluster_size)
                                 >> BDRV_SECTOR_BITS,
-                           QCOW2_DISCARD_NEVER);
+                           QCOW2_DISCARD_NEVER, false);
 
 #ifdef DEBUG_ALLOC
     {
diff --git a/block/qcow2.c b/block/qcow2.c
index f9e045f..59395d0 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2012,7 +2012,7 @@ static coroutine_fn int qcow2_co_discard(BlockDriverState *bs,
 
     qemu_co_mutex_lock(&s->lock);
     ret = qcow2_discard_clusters(bs, sector_num << BDRV_SECTOR_BITS,
-        nb_sectors, QCOW2_DISCARD_REQUEST);
+        nb_sectors, QCOW2_DISCARD_REQUEST, false);
     qemu_co_mutex_unlock(&s->lock);
     return ret;
 }
diff --git a/block/qcow2.h b/block/qcow2.h
index 6aeb7ea..9963007 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -528,7 +528,7 @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
 
 int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m);
 int qcow2_discard_clusters(BlockDriverState *bs, uint64_t offset,
-    int nb_sectors, enum qcow2_discard_type type);
+    int nb_sectors, enum qcow2_discard_type type, bool full_discard);
 int qcow2_zero_clusters(BlockDriverState *bs, uint64_t offset, int nb_sectors);
 
 int qcow2_expand_zero_clusters(BlockDriverState *bs);
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v11 02/14] qcow2: Implement bdrv_make_empty()
  2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 01/14] qcow2: Allow "full" discard Max Reitz
@ 2014-08-20 18:17 ` Max Reitz
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 03/14] qcow2: Optimize bdrv_make_empty() Max Reitz
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Implement this function by making all clusters in the image file fall
through to the backing file (by using the recently extended discard).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 59395d0..eca7656 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2153,6 +2153,32 @@ fail:
     return ret;
 }
 
+static int qcow2_make_empty(BlockDriverState *bs)
+{
+    int ret = 0;
+    uint64_t start_sector;
+    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
+
+    for (start_sector = 0; start_sector < bs->total_sectors;
+         start_sector += sector_step)
+    {
+        /* As this function is generally used after committing an external
+         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
+         * default action for this kind of discard is to pass the discard,
+         * which will ideally result in an actually smaller image file, as
+         * is probably desired. */
+        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
+                                     MIN(sector_step,
+                                         bs->total_sectors - start_sector),
+                                     QCOW2_DISCARD_SNAPSHOT, true);
+        if (ret < 0) {
+            break;
+        }
+    }
+
+    return ret;
+}
+
 static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
 {
     BDRVQcowState *s = bs->opaque;
@@ -2549,6 +2575,7 @@ static BlockDriver bdrv_qcow2 = {
     .bdrv_co_discard        = qcow2_co_discard,
     .bdrv_truncate          = qcow2_truncate,
     .bdrv_write_compressed  = qcow2_write_compressed,
+    .bdrv_make_empty        = qcow2_make_empty,
 
     .bdrv_snapshot_create   = qcow2_snapshot_create,
     .bdrv_snapshot_goto     = qcow2_snapshot_goto,
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v11 03/14] qcow2: Optimize bdrv_make_empty()
  2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 01/14] qcow2: Allow "full" discard Max Reitz
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 02/14] qcow2: Implement bdrv_make_empty() Max Reitz
@ 2014-08-20 18:17 ` Max Reitz
  2014-08-21 14:31   ` Kevin Wolf
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 04/14] blockjob: Introduce block_job_complete_sync() Max Reitz
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

bdrv_make_empty() is currently only called if the current image
represents an external snapshot that has been committed to its base
image; it is therefore unlikely to have internal snapshots. In this
case, bdrv_make_empty() can be greatly sped up by creating an empty L1
table and dropping all data clusters at once by recreating the refcount
structure accordingly instead of normally discarding all clusters.

If there are snapshots, fall back to the simple implementation (discard
all clusters).

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 block/qcow2.c | 389 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 374 insertions(+), 15 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index eca7656..fa8ca66 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2153,27 +2153,386 @@ fail:
     return ret;
 }
 
+/*
+ * Creates a reftable pointing to refblocks following right afterwards and an
+ * empty L1 table at the given @offset. @refblocks is the number of refblocks
+ * to create.
+ *
+ * This combination of structures (reftable+refblocks+L1) is here called a
+ * "blob".
+ */
+static int create_refcount_l1(BlockDriverState *bs, uint64_t offset,
+                              uint64_t refblocks)
+{
+    BDRVQcowState *s = bs->opaque;
+    uint64_t *reftable = NULL;
+    uint16_t *refblock = NULL;
+    uint64_t reftable_clusters;
+    uint64_t rbi;
+    uint64_t blob_start, blob_end;
+    uint64_t l2_tables, l1_clusters, l1_size2;
+    uint8_t l1_size_and_offset[12];
+    uint64_t rt_offset;
+    int ret, i;
+
+    reftable_clusters = DIV_ROUND_UP(refblocks, s->cluster_size / 8);
+    l2_tables = DIV_ROUND_UP(bs->total_sectors / s->cluster_sectors,
+                             s->cluster_size / 8);
+    l1_clusters = DIV_ROUND_UP(l2_tables, s->cluster_size / 8);
+    l1_size2 = l1_clusters << s->cluster_bits;
+
+    blob_start = offset;
+    blob_end = offset + ((reftable_clusters + refblocks + l1_clusters) <<
+                s->cluster_bits);
+
+    ret = qcow2_pre_write_overlap_check(bs, 0, blob_start,
+                                        blob_end - blob_start);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /* Create the reftable pointing with entries pointing consecutively to the
+     * following clusters */
+    reftable = g_malloc0_n(reftable_clusters, s->cluster_size);
+
+    for (rbi = 0; rbi < refblocks; rbi++) {
+        reftable[rbi] = cpu_to_be64(offset + ((reftable_clusters + rbi) <<
+                        s->cluster_bits));
+    }
+
+    ret = bdrv_write(bs->file, offset >> BDRV_SECTOR_BITS, (uint8_t *)reftable,
+                     reftable_clusters * s->cluster_sectors);
+    if (ret < 0) {
+        goto out;
+    }
+
+    offset += reftable_clusters << s->cluster_bits;
+
+    /* Keep the reftable, as we will need it for the BDRVQcowState anyway */
+
+    /* Now, create all the refblocks */
+    refblock = g_malloc(s->cluster_size);
+
+    for (rbi = 0; rbi < refblocks; rbi++) {
+        for (i = 0; i < s->cluster_size / 2; i++) {
+            uint64_t cluster_offset = ((rbi << (s->cluster_bits - 1)) + i)
+                                      << s->cluster_bits;
+
+            /* Only 0 and 1 are possible as refcounts here */
+            refblock[i] = cpu_to_be16(!cluster_offset ||
+                                      (cluster_offset >= blob_start &&
+                                       cluster_offset <  blob_end));
+        }
+
+        ret = bdrv_write(bs->file, offset >> BDRV_SECTOR_BITS,
+                         (uint8_t *)refblock, s->cluster_sectors);
+        if (ret < 0) {
+            goto out;
+        }
+
+        offset += s->cluster_size;
+    }
+
+    g_free(refblock);
+    refblock = NULL;
+
+    /* The L1 table is very simple */
+    ret = bdrv_write_zeroes(bs->file, offset >> BDRV_SECTOR_BITS,
+                            l1_clusters * s->cluster_sectors,
+                            BDRV_REQ_MAY_UNMAP);
+    if (ret < 0) {
+        goto out;
+    }
+
+    /* Now make sure all changes are stable and clear all caches */
+    bdrv_flush(bs);
+
+    /* This is probably not really necessary, but it cannot hurt, either */
+    ret = qcow2_mark_clean(bs);
+    if (ret < 0) {
+        goto out;
+    }
+
+    ret = qcow2_cache_empty(bs, s->l2_table_cache);
+    if (ret < 0) {
+        goto out;
+    }
+
+    ret = qcow2_cache_empty(bs, s->refcount_block_cache);
+    if (ret < 0) {
+        goto out;
+    }
+
+    /* Modify the image header to point to the new blank L1 table. This will
+     * leak all currently existing data clusters, which is fine. */
+    BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE);
+
+    assert(l1_size2 / 8 <= UINT32_MAX);
+    cpu_to_be32w((uint32_t *)&l1_size_and_offset[0], l1_size2 / 8);
+    cpu_to_be64w((uint64_t *)&l1_size_and_offset[4], offset);
+    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_size),
+                           l1_size_and_offset, sizeof(l1_size_and_offset));
+    if (ret < 0) {
+        goto out;
+    }
+
+    /* Adapt the in-memory L1 table accordingly */
+    s->l1_table_offset = offset;
+    s->l1_size         = l1_size2 / 8;
+
+    s->l1_table = g_realloc(s->l1_table, l1_size2);
+    memset(s->l1_table, 0, l1_size2);
+
+    /* Modify the image header to point to the refcount table. This will fix all
+     * leaks (unless an error occurs). */
+    BLKDBG_EVENT(bs->file, BLKDBG_REFTABLE_UPDATE);
+
+    rt_offset = cpu_to_be64(blob_start);
+    ret = bdrv_pwrite_sync(bs->file,
+                           offsetof(QCowHeader, refcount_table_offset),
+                           &rt_offset, sizeof(rt_offset));
+    if (ret < 0) {
+        goto out;
+    }
+
+    /* The image is now clean. The only thing left is to adapt the in-memory
+     * refcount table to match it. */
+    s->refcount_table_offset = blob_start;
+    s->refcount_table_size   = reftable_clusters << (s->cluster_bits - 3);
+
+    for (rbi = 0; rbi < refblocks; rbi++) {
+        be64_to_cpus(&reftable[rbi]);
+    }
+
+    g_free(s->refcount_table);
+    s->refcount_table = reftable;
+    reftable = NULL;
+
+out:
+    g_free(refblock);
+    g_free(reftable);
+    return ret;
+}
+
+/*
+ * Calculates the number of clusters required for an L1 table for an image with
+ * the given parameters plus the reftable and the refblocks required to cover
+ * themselves, the L1 table and a given number of clusters @overhead.
+ *
+ * @ts:       Total number of guest sectors the image provides
+ * @cb:       1 << @cb is the cluster size in bytes
+ * @spcb:     1 << @spcb is the number of clusters per sector
+ * @overhead: Number of clusters which shall additionally be covered by the
+ *            refcount structures
+ */
+static uint64_t minimal_blob_size(uint64_t ts, int cb, int spcb,
+                                  uint64_t overhead)
+{
+    uint64_t cs  = UINT64_C(1) << cb;
+    uint64_t spc = UINT64_C(1) << spcb;
+
+    /* The following statement is a solution for this system of equations:
+     *
+     * Let req_cls be the required number of clusters required for the reftable,
+     * all refblocks and the L1 table.
+     *
+     * rtc be the clusters required for the reftable, rbc the clusters for all
+     * refblocks (i.e., the number of refblocks), l1c the clusters for the L1
+     * table and l2c the clusters for all L2 tables (i.e., the number of L2
+     * tables).
+     *
+     * cs be the cluster size (in bytes), ts the total number of sectors, spc
+     * the number of sectors per cluster and tc the total number of clusters.
+     *
+     * overhead is a number of clusters which should additionally be covered by
+     * the refcount structures (i.e. all clusters before this blob of req_cls
+     * clusters).
+     *
+     * req_cls >= rtc + rbc + l1c
+     *   -- should be obvious
+     * rbc      = ceil((overhead + req_cls) / (cs / 2))
+     *   -- as each refblock holds cs/2 entries
+     * rtc      = ceil(rbc                  / (cs / 8))
+     *   -- as each reftable cluster holds cs/8 entries
+     * tc       = ceil(ts / spc)
+     *   -- should be obvious as well
+     * l2c      = ceil(tc  / (cs / 8))
+     *   -- as each L2 table holds cs/8 entries
+     * l1c      = ceil(l2c / (cs / 8))
+     *   -- as each L1 table cluster holds cs/8 entries
+     *
+     * The following equation yields a result which satisfies the constraint.
+     * The result may be too high, but is never too low.
+     *
+     * The original calculation (without bitshifting) was:
+     *
+     * DIV_ROUND_UP(overhead * (1 + cs / 8) +
+     *              3 * cs * cs / 16 - 5 * cs / 8 - 1 +
+     *              4 * (ts + spc * cs / 8 - 1) / spc,
+     *              cs * cs / 16 - cs / 8 - 1)
+     *
+     */
+
+    return DIV_ROUND_UP(overhead + (overhead << (cb - 3)) +
+                        (3 << (2 * cb - 4)) - (5 << (cb - 3)) - 1 +
+                        (4 * (ts + (spc << (cb - 3)) - 1) >> spcb),
+                        (1 << (2 * cb - 4)) - cs / 8 - 1);
+}
+
 static int qcow2_make_empty(BlockDriverState *bs)
 {
+    BDRVQcowState *s = bs->opaque;
     int ret = 0;
-    uint64_t start_sector;
-    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
 
-    for (start_sector = 0; start_sector < bs->total_sectors;
-         start_sector += sector_step)
-    {
-        /* As this function is generally used after committing an external
-         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
-         * default action for this kind of discard is to pass the discard,
-         * which will ideally result in an actually smaller image file, as
-         * is probably desired. */
-        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
-                                     MIN(sector_step,
-                                         bs->total_sectors - start_sector),
-                                     QCOW2_DISCARD_SNAPSHOT, true);
+    if (s->snapshots) {
+        uint64_t start_sector;
+        int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
+
+        /* If there are snapshots, every active cluster has to be discarded */
+
+        for (start_sector = 0; start_sector < bs->total_sectors;
+             start_sector += sector_step)
+        {
+            /* As this function is generally used after committing an external
+             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
+             * default action for this kind of discard is to pass the discard,
+             * which will ideally result in an actually smaller image file, as
+             * is probably desired. */
+            ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
+                                         MIN(sector_step,
+                                             bs->total_sectors - start_sector),
+                                         QCOW2_DISCARD_SNAPSHOT, true);
+            if (ret < 0) {
+                break;
+            }
+        }
+    } else {
+        uint64_t min_size_1, min_size_2;
+        int64_t blob_start;
+        uint64_t blob_end, real_blob_size, clusters;
+        uint64_t refblocks, reftable_clusters, l2_tables, l1_clusters;
+
+        /* If there are no snapshots, we basically want to create a new empty
+         * image. This function is however not for creating a new image and
+         * renaming it so it shadows the existing but rather for emptying an
+         * image, so do exactly that.
+         *
+         * Therefore, the image should be valid at all points in time and may
+         * at worst leak clusters.
+         *
+         * Any valid qcow2 image requires an L1 table which is large enough to
+         * cover all of the guest cluster range, therefore it is impossible to
+         * simply drop the L1 table (make its size 0) and create a minimal
+         * refcount structure.
+         *
+         * Instead, allocate a blob of data which is large enough to hold a new
+         * refcount structure (refcount table and all blocks) covering the whole
+         * image until the end of that blob, and an empty L1 table covering the
+         * whole guest cluster range. Then these structures are initialized and
+         * the image header is modified to point to them.
+         *
+         * Generally, this blob will be allocated at the end of the image (that
+         * is, its offset will be greater than its size). If that is indeed the
+         * case, the same operation is repeated, but this time the new blob
+         * starts right after the image header which will then indeed lead to a
+         * minimal image. If this is not the case, the image will be nearly
+         * minimal as well, as long as the underlying protocol supports discard.
+         *
+         * Note that this implementation never frees allocated clusters. This is
+         * because in case of success, the current refcounts are invalid anyway;
+         * and in case of failure, it would be too cumbersome to keep track of
+         * all allocated cluster ranges and free them then.
+         *
+         * min_size_1 will contain the number of clusters minimally required for
+         * a blob that starts right after the image header; min_size_2 will
+         * contain the number of clusters minimally required for the blob which
+         * can be allocated based on the existing refcount structure.
+         */
+
+        /* Repeat until a blob could be allocated which is large enough to
+         * contain all structures necessary for describing itself. Allocated
+         * clusters are not freed, even if they are not suitable, as this would
+         * result in exactly the same cluster range being returned during the
+         * retry (which is obviously not desirable). In case of success, the
+         * current refcounts do not matter anyway; and in case of failure, the
+         * clusters are only leaked (which can be easily repaired). */
+        do {
+            uint64_t fci = s->free_cluster_index;
+
+            /* TODO: Optimize, we do not need refblocks for other parts of the
+             * image than the header and this blob */
+            min_size_2 = minimal_blob_size(bs->total_sectors, s->cluster_bits,
+                                           s->cluster_bits - BDRV_SECTOR_BITS,
+                                           fci);
+
+            blob_start = qcow2_alloc_clusters(bs,
+                                              min_size_2 << s->cluster_bits);
+            if (blob_start < 0) {
+                return blob_start;
+            }
+
+            clusters          = (blob_start >> s->cluster_bits) + min_size_2;
+            refblocks         = DIV_ROUND_UP(clusters, s->cluster_size / 2);
+            reftable_clusters = DIV_ROUND_UP(refblocks, s->cluster_size / 8);
+            l2_tables         = DIV_ROUND_UP(bs->total_sectors /
+                                             s->cluster_sectors,
+                                             s->cluster_size / 8);
+            l1_clusters       = DIV_ROUND_UP(l2_tables, s->cluster_size / 8);
+
+            real_blob_size = reftable_clusters + refblocks + l1_clusters;
+        } while (min_size_2 < real_blob_size);
+
+        ret = create_refcount_l1(bs, blob_start, refblocks);
         if (ret < 0) {
-            break;
+            return ret;
+        }
+
+        /* The only overhead is the image header */
+        min_size_1 = minimal_blob_size(bs->total_sectors, s->cluster_bits,
+                                       s->cluster_bits - BDRV_SECTOR_BITS, 1);
+
+        /* If we cannot create a new blob before the current one, just discard
+         * the sectors in between and return. Even if the discard does nothing,
+         * only up to min_size_1 clusters plus the refcount blocks for those
+         * are unused. The worst case is therefore an image of double the size
+         * it needs to be, which is not too bad. */
+        if ((blob_start >> s->cluster_bits) < 1 + min_size_1) {
+            uint64_t sector, end;
+            int step = INT_MAX / BDRV_SECTOR_SIZE;
+
+            end = blob_start >> (s->cluster_bits - BDRV_SECTOR_SIZE);
+
+            /* skip the image header */
+            for (sector = s->cluster_sectors; sector < end; sector += step) {
+                ret = bdrv_discard(bs->file, sector, MIN(step, end - sector));
+                if (ret < 0) {
+                    return ret;
+                }
+            }
+
+            blob_end = (blob_start + real_blob_size) << s->cluster_bits;
+        } else {
+            clusters          = 1 + min_size_1;
+            refblocks         = DIV_ROUND_UP(clusters, s->cluster_size / 2);
+            reftable_clusters = DIV_ROUND_UP(refblocks, s->cluster_size / 8);
+            l2_tables         = DIV_ROUND_UP(bs->total_sectors /
+                                             s->cluster_sectors,
+                                             s->cluster_size / 8);
+            l1_clusters       = DIV_ROUND_UP(l2_tables, s->cluster_size / 8);
+
+            real_blob_size = reftable_clusters + refblocks + l1_clusters;
+
+            assert(min_size_1 >= real_blob_size);
+
+            ret = create_refcount_l1(bs, s->cluster_size, refblocks);
+            if (ret < 0) {
+                return ret;
+            }
+
+            blob_end = (1 + real_blob_size) << s->cluster_bits;
         }
+
+        ret = bdrv_truncate(bs->file, blob_end);
     }
 
     return ret;
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v11 04/14] blockjob: Introduce block_job_complete_sync()
  2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (2 preceding siblings ...)
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 03/14] qcow2: Optimize bdrv_make_empty() Max Reitz
@ 2014-08-20 18:17 ` Max Reitz
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 05/14] blockjob: Add "ready" field Max Reitz
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Implement block_job_complete_sync() by doing the exact same thing as
block_job_cancel_sync() does, only with calling block_job_complete()
instead of block_job_cancel().

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 blockjob.c               | 39 ++++++++++++++++++++++++++++++++-------
 include/block/blockjob.h | 15 +++++++++++++++
 2 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index ca0b4e2..ed9927b 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -152,7 +152,7 @@ void block_job_iostatus_reset(BlockJob *job)
     }
 }
 
-struct BlockCancelData {
+struct BlockFinishData {
     BlockJob *job;
     BlockDriverCompletionFunc *cb;
     void *opaque;
@@ -160,19 +160,22 @@ struct BlockCancelData {
     int ret;
 };
 
-static void block_job_cancel_cb(void *opaque, int ret)
+static void block_job_finish_cb(void *opaque, int ret)
 {
-    struct BlockCancelData *data = opaque;
+    struct BlockFinishData *data = opaque;
 
     data->cancelled = block_job_is_cancelled(data->job);
     data->ret = ret;
     data->cb(data->opaque, ret);
 }
 
-int block_job_cancel_sync(BlockJob *job)
+static int block_job_finish_sync(BlockJob *job,
+                                 void (*finish)(BlockJob *, Error **errp),
+                                 Error **errp)
 {
-    struct BlockCancelData data;
+    struct BlockFinishData data;
     BlockDriverState *bs = job->bs;
+    Error *local_err = NULL;
 
     assert(bs->job == job);
 
@@ -183,15 +186,37 @@ int block_job_cancel_sync(BlockJob *job)
     data.cb = job->cb;
     data.opaque = job->opaque;
     data.ret = -EINPROGRESS;
-    job->cb = block_job_cancel_cb;
+    job->cb = block_job_finish_cb;
     job->opaque = &data;
-    block_job_cancel(job);
+    finish(job, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return -EBUSY;
+    }
     while (data.ret == -EINPROGRESS) {
         aio_poll(bdrv_get_aio_context(bs), true);
     }
     return (data.cancelled && data.ret == 0) ? -ECANCELED : data.ret;
 }
 
+/* A wrapper around block_job_cancel() taking an Error ** parameter so it may be
+ * used with block_job_finish_sync() without the need for (rather nasty)
+ * function pointer casts there. */
+static void block_job_cancel_err(BlockJob *job, Error **errp)
+{
+    block_job_cancel(job);
+}
+
+int block_job_cancel_sync(BlockJob *job)
+{
+    return block_job_finish_sync(job, &block_job_cancel_err, NULL);
+}
+
+int block_job_complete_sync(BlockJob *job, Error **errp)
+{
+    return block_job_finish_sync(job, &block_job_complete, errp);
+}
+
 void block_job_sleep_ns(BlockJob *job, QEMUClockType type, int64_t ns)
 {
     assert(job->busy);
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 60aa835..5f1b1b5 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -273,6 +273,21 @@ bool block_job_is_paused(BlockJob *job);
 int block_job_cancel_sync(BlockJob *job);
 
 /**
+ * block_job_complete_sync:
+ * @job: The job to be completed.
+ * @errp: Error object which may be set by block_job_complete(); this is not
+ *        necessarily set on every error, the job return value has to be
+ *        checked as well.
+ *
+ * Synchronously complete the job.  The completion callback is called before the
+ * function returns, unless it is NULL (which is permissible when using this
+ * function).
+ *
+ * Returns the return value from the job.
+ */
+int block_job_complete_sync(BlockJob *job, Error **errp);
+
+/**
  * block_job_iostatus_reset:
  * @job: The job whose I/O status should be reset.
  *
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v11 05/14] blockjob: Add "ready" field
  2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (3 preceding siblings ...)
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 04/14] blockjob: Introduce block_job_complete_sync() Max Reitz
@ 2014-08-20 18:17 ` Max Reitz
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 06/14] block/mirror: Improve progress report Max Reitz
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

When a block job signals readiness, this is currently reported only
through QMP. If qemu wants to use block jobs for internal tasks, there
needs to be another way to correctly detect when a block job may be
completed.

For this reason, introduce a bool "ready" which is set when the block
job may be completed.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 blockjob.c               | 3 +++
 include/block/blockjob.h | 5 +++++
 qapi/block-core.json     | 4 +++-
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/blockjob.c b/blockjob.c
index ed9927b..56525c4 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -260,6 +260,7 @@ BlockJobInfo *block_job_query(BlockJob *job)
     info->offset    = job->offset;
     info->speed     = job->speed;
     info->io_status = job->iostatus;
+    info->ready     = job->ready;
     return info;
 }
 
@@ -295,6 +296,8 @@ void block_job_event_completed(BlockJob *job, const char *msg)
 
 void block_job_event_ready(BlockJob *job)
 {
+    job->ready = true;
+
     qapi_event_send_block_job_ready(job->driver->job_type,
                                     bdrv_get_device_name(job->bs),
                                     job->len,
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 5f1b1b5..402769a 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -91,6 +91,11 @@ struct BlockJob {
      */
     bool busy;
 
+    /**
+     * Set to true when the job is ready to be completed.
+     */
+    bool ready;
+
     /** Status that is published by the query-block-jobs QMP API */
     BlockDeviceIoStatus iostatus;
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index fb74c56..3c11cdb 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -509,12 +509,14 @@
 #
 # @io-status: the status of the job (since 1.3)
 #
+# @ready: true if the job may be completed (since 2.2)
+#
 # Since: 1.1
 ##
 { 'type': 'BlockJobInfo',
   'data': {'type': 'str', 'device': 'str', 'len': 'int',
            'offset': 'int', 'busy': 'bool', 'paused': 'bool', 'speed': 'int',
-           'io-status': 'BlockDeviceIoStatus'} }
+           'io-status': 'BlockDeviceIoStatus', 'ready': 'bool'} }
 
 ##
 # @query-block-jobs:
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v11 06/14] block/mirror: Improve progress report
  2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (4 preceding siblings ...)
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 05/14] blockjob: Add "ready" field Max Reitz
@ 2014-08-20 18:17 ` Max Reitz
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 07/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Instead of taking the total length of the block device as the block
job's length, use the number of dirty sectors. The progress is now the
number of sectors mirrored to the target block device. Note that this
may result in the job's length increasing during operation, which is
however in fact desirable.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 block/mirror.c | 34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 5e7a166..98ba217 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -45,6 +45,7 @@ typedef struct MirrorBlockJob {
     int64_t sector_num;
     int64_t granularity;
     size_t buf_size;
+    int64_t bdev_length;
     unsigned long *cow_bitmap;
     BdrvDirtyBitmap *dirty_bitmap;
     HBitmapIter hbi;
@@ -54,6 +55,7 @@ typedef struct MirrorBlockJob {
 
     unsigned long *in_flight_bitmap;
     int in_flight;
+    int sectors_in_flight;
     int ret;
 } MirrorBlockJob;
 
@@ -87,6 +89,7 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
     trace_mirror_iteration_done(s, op->sector_num, op->nb_sectors, ret);
 
     s->in_flight--;
+    s->sectors_in_flight -= op->nb_sectors;
     iov = op->qiov.iov;
     for (i = 0; i < op->qiov.niov; i++) {
         MirrorBuffer *buf = (MirrorBuffer *) iov[i].iov_base;
@@ -98,8 +101,11 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
     chunk_num = op->sector_num / sectors_per_chunk;
     nb_chunks = op->nb_sectors / sectors_per_chunk;
     bitmap_clear(s->in_flight_bitmap, chunk_num, nb_chunks);
-    if (s->cow_bitmap && ret >= 0) {
-        bitmap_set(s->cow_bitmap, chunk_num, nb_chunks);
+    if (ret >= 0) {
+        if (s->cow_bitmap) {
+            bitmap_set(s->cow_bitmap, chunk_num, nb_chunks);
+        }
+        s->common.offset += (uint64_t)op->nb_sectors * BDRV_SECTOR_SIZE;
     }
 
     qemu_iovec_destroy(&op->qiov);
@@ -172,7 +178,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
     hbitmap_next_sector = s->sector_num;
     sector_num = s->sector_num;
     sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
-    end = s->common.len >> BDRV_SECTOR_BITS;
+    end = s->bdev_length / BDRV_SECTOR_SIZE;
 
     /* Extend the QEMUIOVector to include all adjacent blocks that will
      * be copied in this operation.
@@ -286,6 +292,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 
     /* Copy the dirty cluster.  */
     s->in_flight++;
+    s->sectors_in_flight += nb_sectors;
     trace_mirror_one_iteration(s, sector_num, nb_sectors);
     bdrv_aio_readv(source, sector_num, &op->qiov, nb_sectors,
                    mirror_read_complete, op);
@@ -331,11 +338,11 @@ static void coroutine_fn mirror_run(void *opaque)
         goto immediate_exit;
     }
 
-    s->common.len = bdrv_getlength(bs);
-    if (s->common.len < 0) {
-        ret = s->common.len;
+    s->bdev_length = bdrv_getlength(bs);
+    if (s->bdev_length < 0) {
+        ret = s->bdev_length;
         goto immediate_exit;
-    } else if (s->common.len == 0) {
+    } else if (s->bdev_length == 0) {
         /* Report BLOCK_JOB_READY and wait for complete. */
         block_job_event_ready(&s->common);
         s->synced = true;
@@ -346,7 +353,7 @@ static void coroutine_fn mirror_run(void *opaque)
         goto immediate_exit;
     }
 
-    length = DIV_ROUND_UP(s->common.len, s->granularity);
+    length = DIV_ROUND_UP(s->bdev_length, s->granularity);
     s->in_flight_bitmap = bitmap_new(length);
 
     /* If we have no backing file yet in the destination, we cannot let
@@ -366,7 +373,7 @@ static void coroutine_fn mirror_run(void *opaque)
         }
     }
 
-    end = s->common.len >> BDRV_SECTOR_BITS;
+    end = s->bdev_length / BDRV_SECTOR_SIZE;
     s->buf = qemu_try_blockalign(bs, s->buf_size);
     if (s->buf == NULL) {
         ret = -ENOMEM;
@@ -411,6 +418,12 @@ static void coroutine_fn mirror_run(void *opaque)
         }
 
         cnt = bdrv_get_dirty_count(bs, s->dirty_bitmap);
+        /* s->common.offset contains the number of bytes already processed so
+         * far, cnt is the number of dirty sectors remaining and
+         * s->sectors_in_flight is the number of sectors currently being
+         * processed; together those are the current total operation length */
+        s->common.len = s->common.offset +
+                        (cnt + s->sectors_in_flight) * BDRV_SECTOR_SIZE;
 
         /* Note that even when no rate limit is applied we need to yield
          * periodically with no pending I/O so that qemu_aio_flush() returns.
@@ -447,7 +460,6 @@ static void coroutine_fn mirror_run(void *opaque)
                  * report completion.  This way, block-job-cancel will leave
                  * the target in a consistent state.
                  */
-                s->common.offset = end * BDRV_SECTOR_SIZE;
                 if (!s->synced) {
                     block_job_event_ready(&s->common);
                     s->synced = true;
@@ -476,8 +488,6 @@ static void coroutine_fn mirror_run(void *opaque)
         ret = 0;
         trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
         if (!s->synced) {
-            /* Publish progress */
-            s->common.offset = (end - cnt) * BDRV_SECTOR_SIZE;
             block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, delay_ns);
             if (block_job_is_cancelled(&s->common)) {
                 break;
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v11 07/14] qemu-img: Implement commit like QMP
  2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (5 preceding siblings ...)
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 06/14] block/mirror: Improve progress report Max Reitz
@ 2014-08-20 18:17 ` Max Reitz
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 08/14] qemu-img: Empty image after commit Max Reitz
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

qemu-img should use QMP commands whenever possible in order to ensure
feature completeness of both online and offline image operations. As
qemu-img itself has no access to QMP (since this would basically require
just everything being linked into qemu-img), imitate QMP's
implementation of block-commit by using commit_active_start() and then
waiting for the block job to finish.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 block/Makefile.objs |  2 +-
 qemu-img.c          | 83 +++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 65 insertions(+), 20 deletions(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 858d2b3..92ef4f6 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -9,6 +9,7 @@ block-obj-y += snapshot.o qapi.o
 block-obj-$(CONFIG_WIN32) += raw-win32.o win32-aio.o
 block-obj-$(CONFIG_POSIX) += raw-posix.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
+block-obj-y += mirror.o
 
 ifeq ($(CONFIG_POSIX),y)
 block-obj-y += nbd.o nbd-client.o sheepdog.o
@@ -23,7 +24,6 @@ endif
 
 common-obj-y += stream.o
 common-obj-y += commit.o
-common-obj-y += mirror.o
 common-obj-y += backup.o
 
 iscsi.o-cflags     := $(LIBISCSI_CFLAGS)
diff --git a/qemu-img.c b/qemu-img.c
index 18caa4b..1b2255b 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -30,6 +30,7 @@
 #include "qemu/osdep.h"
 #include "sysemu/sysemu.h"
 #include "block/block_int.h"
+#include "block/blockjob.h"
 #include "block/qapi.h"
 #include <getopt.h>
 #include <glib.h>
@@ -721,12 +722,46 @@ fail:
     return ret;
 }
 
+typedef struct CommonBlockJobCBInfo {
+    BlockDriverState *bs;
+    Error **errp;
+} CommonBlockJobCBInfo;
+
+static void common_block_job_cb(void *opaque, int ret)
+{
+    CommonBlockJobCBInfo *cbi = opaque;
+
+    if (ret < 0) {
+        error_setg_errno(cbi->errp, -ret, "Block job failed");
+    }
+
+    /* Drop this block job's reference */
+    bdrv_unref(cbi->bs);
+}
+
+static void run_block_job(BlockJob *job, Error **errp)
+{
+    AioContext *aio_context = bdrv_get_aio_context(job->bs);
+
+    do {
+        aio_poll(aio_context, true);
+
+        if (!job->busy && !job->ready) {
+            block_job_resume(job);
+        }
+    } while (!job->ready);
+
+    block_job_complete_sync(job, errp);
+}
+
 static int img_commit(int argc, char **argv)
 {
     int c, ret, flags;
     const char *filename, *fmt, *cache;
-    BlockDriverState *bs;
+    BlockDriverState *bs, *base_bs;
     bool quiet = false;
+    Error *local_err = NULL;
+    CommonBlockJobCBInfo cbi;
 
     fmt = NULL;
     cache = BDRV_DEFAULT_CACHE;
@@ -767,29 +802,39 @@ static int img_commit(int argc, char **argv)
     if (!bs) {
         return 1;
     }
-    ret = bdrv_commit(bs);
-    switch(ret) {
-    case 0:
-        qprintf(quiet, "Image committed.\n");
-        break;
-    case -ENOENT:
-        error_report("No disk inserted");
-        break;
-    case -EACCES:
-        error_report("Image is read-only");
-        break;
-    case -ENOTSUP:
-        error_report("Image is already committed");
-        break;
-    default:
-        error_report("Error while committing image");
-        break;
+
+    /* This is different from QMP, which by default uses the deepest file in the
+     * backing chain (i.e., the very base); however, the traditional behavior of
+     * qemu-img commit is using the immediate backing file. */
+    base_bs = bs->backing_hd;
+    if (!base_bs) {
+        error_set(&local_err, QERR_BASE_NOT_FOUND, "NULL");
+        goto done;
+    }
+
+    cbi = (CommonBlockJobCBInfo){
+        .errp = &local_err,
+        .bs   = bs,
+    };
+
+    commit_active_start(bs, base_bs, 0, BLOCKDEV_ON_ERROR_REPORT,
+                        common_block_job_cb, &cbi, &local_err);
+    if (local_err) {
+        goto done;
     }
 
+    run_block_job(bs->job, &local_err);
+
+done:
     bdrv_unref(bs);
-    if (ret) {
+
+    if (local_err) {
+        qerror_report_err(local_err);
+        error_free(local_err);
         return 1;
     }
+
+    qprintf(quiet, "Image committed.\n");
     return 0;
 }
 
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v11 08/14] qemu-img: Empty image after commit
  2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (6 preceding siblings ...)
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 07/14] qemu-img: Implement commit like QMP Max Reitz
@ 2014-08-20 18:17 ` Max Reitz
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 09/14] qemu-img: Enable progress output for commit Max Reitz
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

After the top image has been committed, it should be emptied unless
specified otherwise.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 qemu-img-cmds.hx |  4 ++--
 qemu-img.c       | 34 +++++++++++++++++++++++++++++++---
 qemu-img.texi    |  6 +++++-
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index d029609..b31d81c 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -22,9 +22,9 @@ STEXI
 ETEXI
 
 DEF("commit", img_commit,
-    "commit [-q] [-f fmt] [-t cache] filename")
+    "commit [-q] [-f fmt] [-t cache] [-d] filename")
 STEXI
-@item commit [-q] [-f @var{fmt}] [-t @var{cache}] @var{filename}
+@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-d] @var{filename}
 ETEXI
 
 DEF("compare", img_compare,
diff --git a/qemu-img.c b/qemu-img.c
index 1b2255b..302eb15 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -759,14 +759,14 @@ static int img_commit(int argc, char **argv)
     int c, ret, flags;
     const char *filename, *fmt, *cache;
     BlockDriverState *bs, *base_bs;
-    bool quiet = false;
+    bool quiet = false, drop = false;
     Error *local_err = NULL;
     CommonBlockJobCBInfo cbi;
 
     fmt = NULL;
     cache = BDRV_DEFAULT_CACHE;
     for(;;) {
-        c = getopt(argc, argv, "f:ht:q");
+        c = getopt(argc, argv, "f:ht:dq");
         if (c == -1) {
             break;
         }
@@ -781,6 +781,9 @@ static int img_commit(int argc, char **argv)
         case 't':
             cache = optarg;
             break;
+        case 'd':
+            drop = true;
+            break;
         case 'q':
             quiet = true;
             break;
@@ -791,7 +794,7 @@ static int img_commit(int argc, char **argv)
     }
     filename = argv[optind++];
 
-    flags = BDRV_O_RDWR;
+    flags = BDRV_O_RDWR | BDRV_O_UNMAP;
     ret = bdrv_parse_cache_flags(cache, &flags);
     if (ret < 0) {
         error_report("Invalid cache option: %s", cache);
@@ -823,7 +826,32 @@ static int img_commit(int argc, char **argv)
         goto done;
     }
 
+    /* The block job will swap base_bs and bs (which is not what we really want
+     * here, but okay) and unref base_bs (after the swap, i.e., the old top
+     * image). In order to still be able to empty that top image afterwards,
+     * increment the reference counter here preemptively. */
+    if (!drop) {
+        bdrv_ref(base_bs);
+    }
+
     run_block_job(bs->job, &local_err);
+    if (local_err) {
+        goto unref_backing;
+    }
+
+    if (!drop && base_bs->drv->bdrv_make_empty) {
+        ret = base_bs->drv->bdrv_make_empty(base_bs);
+        if (ret) {
+            error_setg_errno(&local_err, -ret, "Could not empty %s",
+                             filename);
+            goto unref_backing;
+        }
+    }
+
+unref_backing:
+    if (!drop) {
+        bdrv_unref(base_bs);
+    }
 
 done:
     bdrv_unref(bs);
diff --git a/qemu-img.texi b/qemu-img.texi
index 514be90..a709e92 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -163,7 +163,7 @@ this case. @var{backing_file} will never be modified unless you use the
 The size can also be specified using the @var{size} option with @code{-o},
 it doesn't need to be specified separately in this case.
 
-@item commit [-f @var{fmt}] [-t @var{cache}] @var{filename}
+@item commit [-f @var{fmt}] [-t @var{cache}] [-d] @var{filename}
 
 Commit the changes recorded in @var{filename} in its base image or backing file.
 If the backing file is smaller than the snapshot, then the backing file will be
@@ -172,6 +172,10 @@ the backing file, the backing file will not be truncated.  If you want the
 backing file to match the size of the smaller snapshot, you can safely truncate
 it yourself once the commit operation successfully completes.
 
+The image @var{filename} is emptied after the operation has succeeded. If you do
+not need @var{filename} afterwards and intend to drop it, you may skip emptying
+@var{filename} by specifying the @code{-d} flag.
+
 @item compare [-f @var{fmt}] [-F @var{fmt}] [-p] [-s] [-q] @var{filename1} @var{filename2}
 
 Check if two images have the same content. You can compare images with
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v11 09/14] qemu-img: Enable progress output for commit
  2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (7 preceding siblings ...)
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 08/14] qemu-img: Empty image after commit Max Reitz
@ 2014-08-20 18:17 ` Max Reitz
  2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 10/14] qemu-img: Specify backing file " Max Reitz
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Implement progress output for the commit command by querying the
progress of the block job.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 qemu-img-cmds.hx |  4 ++--
 qemu-img.c       | 24 ++++++++++++++++++++++--
 qemu-img.texi    |  2 +-
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index b31d81c..ea41d4f 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -22,9 +22,9 @@ STEXI
 ETEXI
 
 DEF("commit", img_commit,
-    "commit [-q] [-f fmt] [-t cache] [-d] filename")
+    "commit [-q] [-f fmt] [-t cache] [-d] [-p] filename")
 STEXI
-@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-d] @var{filename}
+@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-d] [-p] @var{filename}
 ETEXI
 
 DEF("compare", img_compare,
diff --git a/qemu-img.c b/qemu-img.c
index 302eb15..1a35449 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -746,12 +746,18 @@ static void run_block_job(BlockJob *job, Error **errp)
     do {
         aio_poll(aio_context, true);
 
+        qemu_progress_print((float)job->offset / job->len * 100.f, 0);
+
         if (!job->busy && !job->ready) {
             block_job_resume(job);
         }
     } while (!job->ready);
 
     block_job_complete_sync(job, errp);
+
+    /* A block job may finish instantaneously without publishing any progress,
+     * so just signal completion here */
+    qemu_progress_print(100.f, 0);
 }
 
 static int img_commit(int argc, char **argv)
@@ -759,14 +765,14 @@ static int img_commit(int argc, char **argv)
     int c, ret, flags;
     const char *filename, *fmt, *cache;
     BlockDriverState *bs, *base_bs;
-    bool quiet = false, drop = false;
+    bool progress = false, quiet = false, drop = false;
     Error *local_err = NULL;
     CommonBlockJobCBInfo cbi;
 
     fmt = NULL;
     cache = BDRV_DEFAULT_CACHE;
     for(;;) {
-        c = getopt(argc, argv, "f:ht:dq");
+        c = getopt(argc, argv, "f:ht:dpq");
         if (c == -1) {
             break;
         }
@@ -784,11 +790,20 @@ static int img_commit(int argc, char **argv)
         case 'd':
             drop = true;
             break;
+        case 'p':
+            progress = true;
+            break;
         case 'q':
             quiet = true;
             break;
         }
     }
+
+    /* Progress is not shown in Quiet mode */
+    if (quiet) {
+        progress = false;
+    }
+
     if (optind != argc - 1) {
         error_exit("Expecting one image file name");
     }
@@ -806,6 +821,9 @@ static int img_commit(int argc, char **argv)
         return 1;
     }
 
+    qemu_progress_init(progress, 1.f);
+    qemu_progress_print(0.f, 100);
+
     /* This is different from QMP, which by default uses the deepest file in the
      * backing chain (i.e., the very base); however, the traditional behavior of
      * qemu-img commit is using the immediate backing file. */
@@ -854,6 +872,8 @@ unref_backing:
     }
 
 done:
+    qemu_progress_end();
+
     bdrv_unref(bs);
 
     if (local_err) {
diff --git a/qemu-img.texi b/qemu-img.texi
index a709e92..557b4de 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -163,7 +163,7 @@ this case. @var{backing_file} will never be modified unless you use the
 The size can also be specified using the @var{size} option with @code{-o},
 it doesn't need to be specified separately in this case.
 
-@item commit [-f @var{fmt}] [-t @var{cache}] [-d] @var{filename}
+@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-d] [-p] @var{filename}
 
 Commit the changes recorded in @var{filename} in its base image or backing file.
 If the backing file is smaller than the snapshot, then the backing file will be
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v11 10/14] qemu-img: Specify backing file for commit
  2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (8 preceding siblings ...)
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 09/14] qemu-img: Enable progress output for commit Max Reitz
@ 2014-08-20 18:18 ` Max Reitz
  2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 11/14] iotests: Add _filter_qemu_img_map Max Reitz
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Introduce a new parameter for qemu-img commit which may be used to
explicitly specify the backing file into which an image should be
committed if the backing chain has more than a single layer.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 qemu-img-cmds.hx |  4 ++--
 qemu-img.c       | 24 +++++++++++++++++-------
 qemu-img.texi    |  9 ++++++++-
 3 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index ea41d4f..4331949 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -22,9 +22,9 @@ STEXI
 ETEXI
 
 DEF("commit", img_commit,
-    "commit [-q] [-f fmt] [-t cache] [-d] [-p] filename")
+    "commit [-q] [-f fmt] [-t cache] [-b base] [-d] [-p] filename")
 STEXI
-@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-d] [-p] @var{filename}
+@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-b @var{base}] [-d] [-p] @var{filename}
 ETEXI
 
 DEF("compare", img_compare,
diff --git a/qemu-img.c b/qemu-img.c
index 1a35449..f330e6d 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -763,7 +763,7 @@ static void run_block_job(BlockJob *job, Error **errp)
 static int img_commit(int argc, char **argv)
 {
     int c, ret, flags;
-    const char *filename, *fmt, *cache;
+    const char *filename, *fmt, *cache, *base;
     BlockDriverState *bs, *base_bs;
     bool progress = false, quiet = false, drop = false;
     Error *local_err = NULL;
@@ -771,8 +771,9 @@ static int img_commit(int argc, char **argv)
 
     fmt = NULL;
     cache = BDRV_DEFAULT_CACHE;
+    base = NULL;
     for(;;) {
-        c = getopt(argc, argv, "f:ht:dpq");
+        c = getopt(argc, argv, "f:ht:b:dpq");
         if (c == -1) {
             break;
         }
@@ -787,6 +788,11 @@ static int img_commit(int argc, char **argv)
         case 't':
             cache = optarg;
             break;
+        case 'b':
+            base = optarg;
+            /* -b implies -d */
+            drop = true;
+            break;
         case 'd':
             drop = true;
             break;
@@ -824,12 +830,16 @@ static int img_commit(int argc, char **argv)
     qemu_progress_init(progress, 1.f);
     qemu_progress_print(0.f, 100);
 
-    /* This is different from QMP, which by default uses the deepest file in the
-     * backing chain (i.e., the very base); however, the traditional behavior of
-     * qemu-img commit is using the immediate backing file. */
-    base_bs = bs->backing_hd;
+    if (base) {
+        base_bs = bdrv_find_backing_image(bs, base);
+    } else {
+        /* This is different from QMP, which by default uses the deepest file in
+         * the backing chain (i.e., the very base); however, the traditional
+         * behavior of qemu-img commit is using the immediate backing file. */
+        base_bs = bs->backing_hd;
+    }
     if (!base_bs) {
-        error_set(&local_err, QERR_BASE_NOT_FOUND, "NULL");
+        error_set(&local_err, QERR_BASE_NOT_FOUND, base ?: "NULL");
         goto done;
     }
 
diff --git a/qemu-img.texi b/qemu-img.texi
index 557b4de..82558b8 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -163,7 +163,7 @@ this case. @var{backing_file} will never be modified unless you use the
 The size can also be specified using the @var{size} option with @code{-o},
 it doesn't need to be specified separately in this case.
 
-@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-d] [-p] @var{filename}
+@item commit [-q] [-f @var{fmt}] [-t @var{cache}] [-b @var{base}] [-d] [-p] @var{filename}
 
 Commit the changes recorded in @var{filename} in its base image or backing file.
 If the backing file is smaller than the snapshot, then the backing file will be
@@ -176,6 +176,13 @@ The image @var{filename} is emptied after the operation has succeeded. If you do
 not need @var{filename} afterwards and intend to drop it, you may skip emptying
 @var{filename} by specifying the @code{-d} flag.
 
+If the backing chain of the given image file @var{filename} has more than one
+layer, the backing file into which the changes will be committed may be
+specified as @var{base} (which has to be part of @var{filename}'s backing
+chain). If @var{base} is not specified, the immediate backing file of the top
+image (which is @var{filename}) will be used. For reasons of consistency,
+explicitly specifying @var{base} will always imply @code{-d}.
+
 @item compare [-f @var{fmt}] [-F @var{fmt}] [-p] [-s] [-q] @var{filename1} @var{filename2}
 
 Check if two images have the same content. You can compare images with
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v11 11/14] iotests: Add _filter_qemu_img_map
  2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (9 preceding siblings ...)
  2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 10/14] qemu-img: Specify backing file " Max Reitz
@ 2014-08-20 18:18 ` Max Reitz
  2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 12/14] iotests: Add test for backing-chain commits Max Reitz
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

As different image formats most probably map guest addresses to
different host addresses, add a filter to filter the host addresses out;
also, the image filename should be filtered.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/common.filter | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/tests/qemu-iotests/common.filter b/tests/qemu-iotests/common.filter
index 51192c8..05e92a0 100644
--- a/tests/qemu-iotests/common.filter
+++ b/tests/qemu-iotests/common.filter
@@ -192,5 +192,12 @@ _filter_img_create()
         -e "s/archipelago:a/TEST_DIR\//g"
 }
 
+# filter out offsets and file names from qemu-img map
+_filter_qemu_img_map()
+{
+    sed -e 's/\([0-9a-fx]* *[0-9a-fx]* *\)[0-9a-fx]* */\1/g' \
+        -e 's/Mapped to *//' | _filter_testdir | _filter_imgfmt
+}
+
 # make sure this script returns success
 /bin/true
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v11 12/14] iotests: Add test for backing-chain commits
  2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (10 preceding siblings ...)
  2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 11/14] iotests: Add _filter_qemu_img_map Max Reitz
@ 2014-08-20 18:18 ` Max Reitz
  2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 13/14] iotests: Add test for qcow2's bdrv_make_empty Max Reitz
  2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 14/14] iotests: Omit length/offset test in 040 and 041 Max Reitz
  13 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Add a test for qemu-img commit on backing chains with more than two
images. This test also checks whether the top image is emptied (unless
this is prevented by specifying either -d or -b) and does therefore not
work for qed and vmdk which requires it to be separate from 020.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/097     | 122 +++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/097.out | 119 +++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 242 insertions(+)
 create mode 100755 tests/qemu-iotests/097
 create mode 100644 tests/qemu-iotests/097.out

diff --git a/tests/qemu-iotests/097 b/tests/qemu-iotests/097
new file mode 100755
index 0000000..c7a613b
--- /dev/null
+++ b/tests/qemu-iotests/097
@@ -0,0 +1,122 @@
+#!/bin/bash
+#
+# Commit changes into backing chains and empty the top image if the
+# backing image is not explicitly specified
+#
+# Copyright (C) 2014 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=mreitz@redhat.com
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+here="$PWD"
+tmp=/tmp/$$
+status=1	# failure is the default!
+
+_cleanup()
+{
+    _cleanup_test_img
+    _rm_test_img "$TEST_IMG.itmd"
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.pattern
+
+# Any format supporting backing files and bdrv_make_empty
+_supported_fmt qcow qcow2
+_supported_proto file
+_supported_os Linux
+
+
+# Four passes:
+#  0: Two-layer backing chain, commit to upper backing file (implicitly)
+#     (in this case, the top image will be emptied)
+#  1: Two-layer backing chain, commit to upper backing file (explicitly)
+#     (in this case, the top image will implicitly stay unchanged)
+#  2: Two-layer backing chain, commit to upper backing file (implicitly with -d)
+#     (in this case, the top image will explicitly stay unchanged)
+#  3: Two-layer backing chain, commit to lower backing file
+#     (in this case, the top image will implicitly stay unchanged)
+#
+# 020 already tests committing, so this only tests whether image chains are
+# working properly and that all images above the base are emptied; therefore,
+# no complicated patterns are necessary
+for i in 0 1 2 3; do
+
+echo
+echo "=== Test pass $i ==="
+echo
+
+TEST_IMG="$TEST_IMG.base" _make_test_img 64M
+TEST_IMG="$TEST_IMG.itmd" _make_test_img -b "$TEST_IMG.base" 64M
+_make_test_img -b "$TEST_IMG.itmd" 64M
+
+$QEMU_IO -c 'write -P 1 0 192k' "$TEST_IMG.base" | _filter_qemu_io
+$QEMU_IO -c 'write -P 2 64k 128k' "$TEST_IMG.itmd" | _filter_qemu_io
+$QEMU_IO -c 'write -P 3 128k 64k' "$TEST_IMG" | _filter_qemu_io
+
+if [ $i -lt 3 ]; then
+    if [ $i == 0 ]; then
+        # -b "$TEST_IMG.itmd" should be the default (that is, committing to the
+        # first backing file in the chain)
+        $QEMU_IMG commit "$TEST_IMG"
+    elif [ $i == 1 ]; then
+        # explicitly specify the commit target (this should imply -d)
+        $QEMU_IMG commit -b "$TEST_IMG.itmd" "$TEST_IMG"
+    else
+        # do not explicitly specify the commit target, but use -d to leave the
+        # top image unchanged
+        $QEMU_IMG commit -d "$TEST_IMG"
+    fi
+
+    # Bottom should be unchanged
+    $QEMU_IO -c 'read -P 1 0 192k' "$TEST_IMG.base" | _filter_qemu_io
+
+    # Intermediate should contain changes from top
+    $QEMU_IO -c 'read -P 1 0 64k' "$TEST_IMG.itmd" | _filter_qemu_io
+    $QEMU_IO -c 'read -P 2 64k 64k' "$TEST_IMG.itmd" | _filter_qemu_io
+    $QEMU_IO -c 'read -P 3 128k 64k' "$TEST_IMG.itmd" | _filter_qemu_io
+
+    # And in pass 0, the top image should be empty, whereas in both other passes
+    # it should be unchanged (which is both checked by qemu-img map)
+else
+    $QEMU_IMG commit -b "$TEST_IMG.base" "$TEST_IMG"
+
+    # Bottom should contain all changes
+    $QEMU_IO -c 'read -P 1 0 64k' "$TEST_IMG.base" | _filter_qemu_io
+    $QEMU_IO -c 'read -P 2 64k 64k' "$TEST_IMG.base" | _filter_qemu_io
+    $QEMU_IO -c 'read -P 3 128k 64k' "$TEST_IMG.base" | _filter_qemu_io
+
+    # Both top and intermediate should be unchanged
+fi
+
+$QEMU_IMG map "$TEST_IMG.base" | _filter_qemu_img_map
+$QEMU_IMG map "$TEST_IMG.itmd" | _filter_qemu_img_map
+$QEMU_IMG map "$TEST_IMG" | _filter_qemu_img_map
+
+done
+
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/097.out b/tests/qemu-iotests/097.out
new file mode 100644
index 0000000..1cb7764
--- /dev/null
+++ b/tests/qemu-iotests/097.out
@@ -0,0 +1,119 @@
+QA output created by 097
+
+=== Test pass 0 ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+Formatting 'TEST_DIR/t.IMGFMT.itmd', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base' 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.itmd' 
+wrote 196608/196608 bytes at offset 0
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 131072/131072 bytes at offset 65536
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Image committed.
+read 196608/196608 bytes at offset 0
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset          Length          File
+0               0x30000         TEST_DIR/t.IMGFMT.base
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x20000         TEST_DIR/t.IMGFMT.itmd
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x20000         TEST_DIR/t.IMGFMT.itmd
+
+=== Test pass 1 ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+Formatting 'TEST_DIR/t.IMGFMT.itmd', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base' 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.itmd' 
+wrote 196608/196608 bytes at offset 0
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 131072/131072 bytes at offset 65536
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Image committed.
+read 196608/196608 bytes at offset 0
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset          Length          File
+0               0x30000         TEST_DIR/t.IMGFMT.base
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x20000         TEST_DIR/t.IMGFMT.itmd
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x10000         TEST_DIR/t.IMGFMT.itmd
+0x20000         0x10000         TEST_DIR/t.IMGFMT
+
+=== Test pass 2 ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+Formatting 'TEST_DIR/t.IMGFMT.itmd', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base' 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.itmd' 
+wrote 196608/196608 bytes at offset 0
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 131072/131072 bytes at offset 65536
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Image committed.
+read 196608/196608 bytes at offset 0
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset          Length          File
+0               0x30000         TEST_DIR/t.IMGFMT.base
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x20000         TEST_DIR/t.IMGFMT.itmd
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x10000         TEST_DIR/t.IMGFMT.itmd
+0x20000         0x10000         TEST_DIR/t.IMGFMT
+
+=== Test pass 3 ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+Formatting 'TEST_DIR/t.IMGFMT.itmd', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base' 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.itmd' 
+wrote 196608/196608 bytes at offset 0
+192 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 131072/131072 bytes at offset 65536
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Image committed.
+read 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 131072
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset          Length          File
+0               0x30000         TEST_DIR/t.IMGFMT.base
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x20000         TEST_DIR/t.IMGFMT.itmd
+Offset          Length          File
+0               0x10000         TEST_DIR/t.IMGFMT.base
+0x10000         0x10000         TEST_DIR/t.IMGFMT.itmd
+0x20000         0x10000         TEST_DIR/t.IMGFMT
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index f157864..2727774 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -100,5 +100,6 @@
 091 rw auto quick
 092 rw auto quick
 095 rw auto quick
+097 rw auto backing
 099 rw auto quick
 103 rw auto quick
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v11 13/14] iotests: Add test for qcow2's bdrv_make_empty
  2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (11 preceding siblings ...)
  2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 12/14] iotests: Add test for backing-chain commits Max Reitz
@ 2014-08-20 18:18 ` Max Reitz
  2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 14/14] iotests: Omit length/offset test in 040 and 041 Max Reitz
  13 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Add a test for qcow2's fast bdrv_make_empty implementation on images
without internal snapshots.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/098     | 75 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/098.out | 26 ++++++++++++++++
 tests/qemu-iotests/group   |  1 +
 3 files changed, 102 insertions(+)
 create mode 100755 tests/qemu-iotests/098
 create mode 100644 tests/qemu-iotests/098.out

diff --git a/tests/qemu-iotests/098 b/tests/qemu-iotests/098
new file mode 100755
index 0000000..116f935
--- /dev/null
+++ b/tests/qemu-iotests/098
@@ -0,0 +1,75 @@
+#!/bin/bash
+#
+# Test qcow2's bdrv_make_empty for images without internal snapshots
+#
+# Copyright (C) 2014 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=mreitz@redhat.com
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+here="$PWD"
+tmp=/tmp/$$
+status=1	# failure is the default!
+
+_cleanup()
+{
+    _cleanup_test_img
+    rm -f "$TEST_DIR/blkdebug.conf"
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.pattern
+
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+
+
+for event in l1_update reftable_update; do
+
+echo
+echo "=== $event ==="
+echo
+
+TEST_IMG="$TEST_IMG.base" _make_test_img 64M
+_make_test_img -b "$TEST_IMG.base" 64M
+
+cat > "$TEST_DIR/blkdebug.conf" <<EOF
+[inject-error]
+event = "$event"
+EOF
+
+$QEMU_IMG commit "blkdebug:$TEST_DIR/blkdebug.conf:$TEST_IMG" 2>&1 \
+    | _filter_testdir | _filter_imgfmt
+
+_check_test_img
+
+done
+
+
+rm -f "$TEST_DIR/blkdebug.conf"
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/098.out b/tests/qemu-iotests/098.out
new file mode 100644
index 0000000..5f5af1b
--- /dev/null
+++ b/tests/qemu-iotests/098.out
@@ -0,0 +1,26 @@
+QA output created by 098
+
+=== l1_update ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base' 
+qemu-img: Could not empty blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
+Leaked cluster 4 refcount=1 reference=0
+Leaked cluster 5 refcount=1 reference=0
+Leaked cluster 6 refcount=1 reference=0
+
+3 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
+
+=== reftable_update ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base' 
+qemu-img: Could not empty blkdebug:TEST_DIR/blkdebug.conf:TEST_DIR/t.IMGFMT: Input/output error
+Leaked cluster 3 refcount=1 reference=0
+Leaked cluster 4 refcount=1 reference=0
+Leaked cluster 5 refcount=1 reference=0
+
+3 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 2727774..15c8e6e 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -101,5 +101,6 @@
 092 rw auto quick
 095 rw auto quick
 097 rw auto backing
+098 rw auto backing quick
 099 rw auto quick
 103 rw auto quick
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v11 14/14] iotests: Omit length/offset test in 040 and 041
  2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
                   ` (12 preceding siblings ...)
  2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 13/14] iotests: Add test for qcow2's bdrv_make_empty Max Reitz
@ 2014-08-20 18:18 ` Max Reitz
  13 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2014-08-20 18:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

As the length of a mirror block job no longer directly depends on the
size of the block device, drop those checks from this test. Instead,
just check whether the final offset equals the block job length.

As 041 uses the wait_until_completed function from iotests.py, the same
applies there as well which in turn affects tests 030, 055 and 056. On
the other hand, a block job's length does not have to be related to the
length of the image file in the first place, so that check was
questionable anyway.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/040        | 4 +---
 tests/qemu-iotests/041        | 3 +--
 tests/qemu-iotests/iotests.py | 3 +--
 3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/tests/qemu-iotests/040 b/tests/qemu-iotests/040
index f1e16c1..2b432ad 100755
--- a/tests/qemu-iotests/040
+++ b/tests/qemu-iotests/040
@@ -43,8 +43,7 @@ class ImageCommitTestCase(iotests.QMPTestCase):
                 if event['event'] == 'BLOCK_JOB_COMPLETED':
                     self.assert_qmp(event, 'data/type', 'commit')
                     self.assert_qmp(event, 'data/device', 'drive0')
-                    self.assert_qmp(event, 'data/offset', self.image_len)
-                    self.assert_qmp(event, 'data/len', self.image_len)
+                    self.assert_qmp(event, 'data/offset', event['data']['len'])
                     if need_ready:
                         self.assertTrue(ready, "Expecting BLOCK_JOB_COMPLETED event")
                     completed = True
@@ -52,7 +51,6 @@ class ImageCommitTestCase(iotests.QMPTestCase):
                     ready = True
                     self.assert_qmp(event, 'data/type', 'commit')
                     self.assert_qmp(event, 'data/device', 'drive0')
-                    self.assert_qmp(event, 'data/len', self.image_len)
                     self.vm.qmp('block-job-complete', device='drive0')
 
         self.assert_no_active_block_jobs()
diff --git a/tests/qemu-iotests/041 b/tests/qemu-iotests/041
index 5dbd4ee..90721cc 100755
--- a/tests/qemu-iotests/041
+++ b/tests/qemu-iotests/041
@@ -52,8 +52,7 @@ class ImageMirroringTestCase(iotests.QMPTestCase):
         event = self.cancel_and_wait(drive=drive)
         self.assertEquals(event['event'], 'BLOCK_JOB_COMPLETED')
         self.assert_qmp(event, 'data/type', 'mirror')
-        self.assert_qmp(event, 'data/offset', self.image_len)
-        self.assert_qmp(event, 'data/len', self.image_len)
+        self.assert_qmp(event, 'data/offset', event['data']['len'])
 
     def complete_and_wait(self, drive='drive0', wait_ready=True):
         '''Complete a block job and wait for it to finish'''
diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 39a4cfc..f57f154 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -267,8 +267,7 @@ class QMPTestCase(unittest.TestCase):
                     self.assert_qmp(event, 'data/device', drive)
                     self.assert_qmp_absent(event, 'data/error')
                     if check_offset:
-                        self.assert_qmp(event, 'data/offset', self.image_len)
-                    self.assert_qmp(event, 'data/len', self.image_len)
+                        self.assert_qmp(event, 'data/offset', event['data']['len'])
                     completed = True
 
         self.assert_no_active_block_jobs()
-- 
2.0.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v11 03/14] qcow2: Optimize bdrv_make_empty()
  2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 03/14] qcow2: Optimize bdrv_make_empty() Max Reitz
@ 2014-08-21 14:31   ` Kevin Wolf
  2014-08-22 13:59     ` Max Reitz
  0 siblings, 1 reply; 18+ messages in thread
From: Kevin Wolf @ 2014-08-21 14:31 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 20.08.2014 um 20:17 hat Max Reitz geschrieben:
> bdrv_make_empty() is currently only called if the current image
> represents an external snapshot that has been committed to its base
> image; it is therefore unlikely to have internal snapshots. In this
> case, bdrv_make_empty() can be greatly sped up by creating an empty L1
> table and dropping all data clusters at once by recreating the refcount
> structure accordingly instead of normally discarding all clusters.
> 
> If there are snapshots, fall back to the simple implementation (discard
> all clusters).
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>  block/qcow2.c | 389 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 374 insertions(+), 15 deletions(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index eca7656..fa8ca66 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2153,27 +2153,386 @@ fail:
>      return ret;
>  }
>  
> +/*
> + * Creates a reftable pointing to refblocks following right afterwards and an
> + * empty L1 table at the given @offset. @refblocks is the number of refblocks
> + * to create.
> + *
> + * This combination of structures (reftable+refblocks+L1) is here called a
> + * "blob".
> + */
> +static int create_refcount_l1(BlockDriverState *bs, uint64_t offset,
> +                              uint64_t refblocks)
> +{
> +    BDRVQcowState *s = bs->opaque;
> +    uint64_t *reftable = NULL;
> +    uint16_t *refblock = NULL;
> +    uint64_t reftable_clusters;
> +    uint64_t rbi;
> +    uint64_t blob_start, blob_end;
> +    uint64_t l2_tables, l1_clusters, l1_size2;
> +    uint8_t l1_size_and_offset[12];
> +    uint64_t rt_offset;
> +    int ret, i;
> +
> +    reftable_clusters = DIV_ROUND_UP(refblocks, s->cluster_size / 8);
> +    l2_tables = DIV_ROUND_UP(bs->total_sectors / s->cluster_sectors,
> +                             s->cluster_size / 8);
> +    l1_clusters = DIV_ROUND_UP(l2_tables, s->cluster_size / 8);
> +    l1_size2 = l1_clusters << s->cluster_bits;
> +
> +    blob_start = offset;
> +    blob_end = offset + ((reftable_clusters + refblocks + l1_clusters) <<
> +                s->cluster_bits);
> +
> +    ret = qcow2_pre_write_overlap_check(bs, 0, blob_start,
> +                                        blob_end - blob_start);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    /* Create the reftable pointing with entries pointing consecutively to the
> +     * following clusters */
> +    reftable = g_malloc0_n(reftable_clusters, s->cluster_size);
> +
> +    for (rbi = 0; rbi < refblocks; rbi++) {
> +        reftable[rbi] = cpu_to_be64(offset + ((reftable_clusters + rbi) <<
> +                        s->cluster_bits));
> +    }
> +
> +    ret = bdrv_write(bs->file, offset >> BDRV_SECTOR_BITS, (uint8_t *)reftable,
> +                     reftable_clusters * s->cluster_sectors);
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    offset += reftable_clusters << s->cluster_bits;
> +
> +    /* Keep the reftable, as we will need it for the BDRVQcowState anyway */
> +
> +    /* Now, create all the refblocks */
> +    refblock = g_malloc(s->cluster_size);
> +
> +    for (rbi = 0; rbi < refblocks; rbi++) {
> +        for (i = 0; i < s->cluster_size / 2; i++) {
> +            uint64_t cluster_offset = ((rbi << (s->cluster_bits - 1)) + i)
> +                                      << s->cluster_bits;
> +
> +            /* Only 0 and 1 are possible as refcounts here */
> +            refblock[i] = cpu_to_be16(!cluster_offset ||
> +                                      (cluster_offset >= blob_start &&
> +                                       cluster_offset <  blob_end));
> +        }
> +
> +        ret = bdrv_write(bs->file, offset >> BDRV_SECTOR_BITS,
> +                         (uint8_t *)refblock, s->cluster_sectors);
> +        if (ret < 0) {
> +            goto out;
> +        }
> +
> +        offset += s->cluster_size;
> +    }
> +
> +    g_free(refblock);
> +    refblock = NULL;
> +
> +    /* The L1 table is very simple */
> +    ret = bdrv_write_zeroes(bs->file, offset >> BDRV_SECTOR_BITS,
> +                            l1_clusters * s->cluster_sectors,
> +                            BDRV_REQ_MAY_UNMAP);

I wouldn't discard the L1 table. The next write operation will need it
again.

> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    /* Now make sure all changes are stable and clear all caches */
> +    bdrv_flush(bs);

bdrv_flush() can fail.

> +    /* This is probably not really necessary, but it cannot hurt, either */
> +    ret = qcow2_mark_clean(bs);
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    ret = qcow2_cache_empty(bs, s->l2_table_cache);
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    ret = qcow2_cache_empty(bs, s->refcount_block_cache);
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    /* Modify the image header to point to the new blank L1 table. This will
> +     * leak all currently existing data clusters, which is fine. */
> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE);
> +
> +    assert(l1_size2 / 8 <= UINT32_MAX);
> +    cpu_to_be32w((uint32_t *)&l1_size_and_offset[0], l1_size2 / 8);
> +    cpu_to_be64w((uint64_t *)&l1_size_and_offset[4], offset);
> +    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_size),
> +                           l1_size_and_offset, sizeof(l1_size_and_offset));
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    /* Adapt the in-memory L1 table accordingly */
> +    s->l1_table_offset = offset;
> +    s->l1_size         = l1_size2 / 8;
> +
> +    s->l1_table = g_realloc(s->l1_table, l1_size2);

s->l1_table is allocated with qemu_try_blockalign(). You should probably
use another qemu_try_blockalign() + qemu_vfree(old) here.

> +    memset(s->l1_table, 0, l1_size2);
> +
> +    /* Modify the image header to point to the refcount table. This will fix all
> +     * leaks (unless an error occurs). */
> +    BLKDBG_EVENT(bs->file, BLKDBG_REFTABLE_UPDATE);
> +
> +    rt_offset = cpu_to_be64(blob_start);
> +    ret = bdrv_pwrite_sync(bs->file,
> +                           offsetof(QCowHeader, refcount_table_offset),
> +                           &rt_offset, sizeof(rt_offset));

Don't you need to update refcount_table_clusters for the case that the
blob hits the boundary where the table needs a new cluster?

> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    /* The image is now clean. The only thing left is to adapt the in-memory
> +     * refcount table to match it. */
> +    s->refcount_table_offset = blob_start;
> +    s->refcount_table_size   = reftable_clusters << (s->cluster_bits - 3);
> +
> +    for (rbi = 0; rbi < refblocks; rbi++) {
> +        be64_to_cpus(&reftable[rbi]);
> +    }
> +
> +    g_free(s->refcount_table);
> +    s->refcount_table = reftable;
> +    reftable = NULL;
> +
> +out:
> +    g_free(refblock);
> +    g_free(reftable);
> +    return ret;
> +}
> +
> +/*
> + * Calculates the number of clusters required for an L1 table for an image with
> + * the given parameters plus the reftable and the refblocks required to cover
> + * themselves, the L1 table and a given number of clusters @overhead.
> + *
> + * @ts:       Total number of guest sectors the image provides
> + * @cb:       1 << @cb is the cluster size in bytes
> + * @spcb:     1 << @spcb is the number of clusters per sector

Sectors per cluster, I guess.

> + * @overhead: Number of clusters which shall additionally be covered by the
> + *            refcount structures
> + */
> +static uint64_t minimal_blob_size(uint64_t ts, int cb, int spcb,
> +                                  uint64_t overhead)
> +{
> +    uint64_t cs  = UINT64_C(1) << cb;
> +    uint64_t spc = UINT64_C(1) << spcb;
> +
> +    /* The following statement is a solution for this system of equations:
> +     *
> +     * Let req_cls be the required number of clusters required for the reftable,
> +     * all refblocks and the L1 table.
> +     *
> +     * rtc be the clusters required for the reftable, rbc the clusters for all
> +     * refblocks (i.e., the number of refblocks), l1c the clusters for the L1
> +     * table and l2c the clusters for all L2 tables (i.e., the number of L2
> +     * tables).
> +     *
> +     * cs be the cluster size (in bytes), ts the total number of sectors, spc
> +     * the number of sectors per cluster and tc the total number of clusters.
> +     *
> +     * overhead is a number of clusters which should additionally be covered by
> +     * the refcount structures (i.e. all clusters before this blob of req_cls
> +     * clusters).
> +     *
> +     * req_cls >= rtc + rbc + l1c
> +     *   -- should be obvious

The >= isn't, I would have expected =.

> +     * rbc      = ceil((overhead + req_cls) / (cs / 2))
> +     *   -- as each refblock holds cs/2 entries

Hopefully we'll never implement refcount_order != 4...

> +     * rtc      = ceil(rbc                  / (cs / 8))
> +     *   -- as each reftable cluster holds cs/8 entries
> +     * tc       = ceil(ts / spc)
> +     *   -- should be obvious as well
> +     * l2c      = ceil(tc  / (cs / 8))
> +     *   -- as each L2 table holds cs/8 entries
> +     * l1c      = ceil(l2c / (cs / 8))
> +     *   -- as each L1 table cluster holds cs/8 entries
> +     *
> +     * The following equation yields a result which satisfies the constraint.
> +     * The result may be too high, but is never too low.
> +     *
> +     * The original calculation (without bitshifting) was:
> +     *
> +     * DIV_ROUND_UP(overhead * (1 + cs / 8) +
> +     *              3 * cs * cs / 16 - 5 * cs / 8 - 1 +
> +     *              4 * (ts + spc * cs / 8 - 1) / spc,
> +     *              cs * cs / 16 - cs / 8 - 1)

Should be obvious?

I think line 3 is for calculating l1c. That could easily be separated
out because it doesn't depend on any of the other variables.

After doing some maths I'm relatively close to your formula, but I still
have 8 * cs instead of 5 * cs in the second line, and I also don't know
where you got the -1 from.

And this is the reason why I asked for a method without precalculating
these sizes...

> +     *
> +     */
> +
> +    return DIV_ROUND_UP(overhead + (overhead << (cb - 3)) +
> +                        (3 << (2 * cb - 4)) - (5 << (cb - 3)) - 1 +
> +                        (4 * (ts + (spc << (cb - 3)) - 1) >> spcb),
> +                        (1 << (2 * cb - 4)) - cs / 8 - 1);
> +}
> +
>  static int qcow2_make_empty(BlockDriverState *bs)
>  {
> +    BDRVQcowState *s = bs->opaque;
>      int ret = 0;
> -    uint64_t start_sector;
> -    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
>  
> -    for (start_sector = 0; start_sector < bs->total_sectors;
> -         start_sector += sector_step)
> -    {
> -        /* As this function is generally used after committing an external
> -         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
> -         * default action for this kind of discard is to pass the discard,
> -         * which will ideally result in an actually smaller image file, as
> -         * is probably desired. */
> -        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
> -                                     MIN(sector_step,
> -                                         bs->total_sectors - start_sector),
> -                                     QCOW2_DISCARD_SNAPSHOT, true);
> +    if (s->snapshots) {
> +        uint64_t start_sector;
> +        int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
> +
> +        /* If there are snapshots, every active cluster has to be discarded */
> +
> +        for (start_sector = 0; start_sector < bs->total_sectors;
> +             start_sector += sector_step)
> +        {
> +            /* As this function is generally used after committing an external
> +             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
> +             * default action for this kind of discard is to pass the discard,
> +             * which will ideally result in an actually smaller image file, as
> +             * is probably desired. */
> +            ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
> +                                         MIN(sector_step,
> +                                             bs->total_sectors - start_sector),
> +                                         QCOW2_DISCARD_SNAPSHOT, true);
> +            if (ret < 0) {
> +                break;
> +            }
> +        }
> +    } else {
> +        uint64_t min_size_1, min_size_2;
> +        int64_t blob_start;
> +        uint64_t blob_end, real_blob_size, clusters;
> +        uint64_t refblocks, reftable_clusters, l2_tables, l1_clusters;
> +
> +        /* If there are no snapshots, we basically want to create a new empty
> +         * image. This function is however not for creating a new image and
> +         * renaming it so it shadows the existing but rather for emptying an
> +         * image, so do exactly that.

After killing my brain with the math you can't use English syntax as
convoluted as in this sentence. ;-)

> +         * Therefore, the image should be valid at all points in time and may
> +         * at worst leak clusters.
> +         *
> +         * Any valid qcow2 image requires an L1 table which is large enough to
> +         * cover all of the guest cluster range, therefore it is impossible to
> +         * simply drop the L1 table (make its size 0) and create a minimal
> +         * refcount structure.
> +         *
> +         * Instead, allocate a blob of data which is large enough to hold a new
> +         * refcount structure (refcount table and all blocks) covering the whole
> +         * image until the end of that blob, and an empty L1 table covering the
> +         * whole guest cluster range. Then these structures are initialized and
> +         * the image header is modified to point to them.
> +         *
> +         * Generally, this blob will be allocated at the end of the image (that
> +         * is, its offset will be greater than its size). If that is indeed the
> +         * case, the same operation is repeated, but this time the new blob
> +         * starts right after the image header which will then indeed lead to a
> +         * minimal image. If this is not the case, the image will be nearly
> +         * minimal as well, as long as the underlying protocol supports discard.
> +         *
> +         * Note that this implementation never frees allocated clusters. This is
> +         * because in case of success, the current refcounts are invalid anyway;
> +         * and in case of failure, it would be too cumbersome to keep track of
> +         * all allocated cluster ranges and free them then.
> +         *
> +         * min_size_1 will contain the number of clusters minimally required for
> +         * a blob that starts right after the image header; min_size_2 will
> +         * contain the number of clusters minimally required for the blob which
> +         * can be allocated based on the existing refcount structure.
> +         */
> +
> +        /* Repeat until a blob could be allocated which is large enough to
> +         * contain all structures necessary for describing itself. Allocated
> +         * clusters are not freed, even if they are not suitable, as this would
> +         * result in exactly the same cluster range being returned during the
> +         * retry (which is obviously not desirable). In case of success, the
> +         * current refcounts do not matter anyway; and in case of failure, the
> +         * clusters are only leaked (which can be easily repaired). */
> +        do {
> +            uint64_t fci = s->free_cluster_index;
> +
> +            /* TODO: Optimize, we do not need refblocks for other parts of the
> +             * image than the header and this blob */
> +            min_size_2 = minimal_blob_size(bs->total_sectors, s->cluster_bits,
> +                                           s->cluster_bits - BDRV_SECTOR_BITS,
> +                                           fci);
> +
> +            blob_start = qcow2_alloc_clusters(bs,
> +                                              min_size_2 << s->cluster_bits);
> +            if (blob_start < 0) {
> +                return blob_start;
> +            }
> +
> +            clusters          = (blob_start >> s->cluster_bits) + min_size_2;
> +            refblocks         = DIV_ROUND_UP(clusters, s->cluster_size / 2);
> +            reftable_clusters = DIV_ROUND_UP(refblocks, s->cluster_size / 8);
> +            l2_tables         = DIV_ROUND_UP(bs->total_sectors /
> +                                             s->cluster_sectors,
> +                                             s->cluster_size / 8);
> +            l1_clusters       = DIV_ROUND_UP(l2_tables, s->cluster_size / 8);
> +
> +            real_blob_size = reftable_clusters + refblocks + l1_clusters;
> +        } while (min_size_2 < real_blob_size);
> +
> +        ret = create_refcount_l1(bs, blob_start, refblocks);
>          if (ret < 0) {
> -            break;
> +            return ret;
> +        }
> +
> +        /* The only overhead is the image header */
> +        min_size_1 = minimal_blob_size(bs->total_sectors, s->cluster_bits,
> +                                       s->cluster_bits - BDRV_SECTOR_BITS, 1);
> +
> +        /* If we cannot create a new blob before the current one, just discard
> +         * the sectors in between and return. Even if the discard does nothing,
> +         * only up to min_size_1 clusters plus the refcount blocks for those
> +         * are unused. The worst case is therefore an image of double the size
> +         * it needs to be, which is not too bad. */
> +        if ((blob_start >> s->cluster_bits) < 1 + min_size_1) {
> +            uint64_t sector, end;
> +            int step = INT_MAX / BDRV_SECTOR_SIZE;
> +
> +            end = blob_start >> (s->cluster_bits - BDRV_SECTOR_SIZE);
> +
> +            /* skip the image header */
> +            for (sector = s->cluster_sectors; sector < end; sector += step) {
> +                ret = bdrv_discard(bs->file, sector, MIN(step, end - sector));
> +                if (ret < 0) {
> +                    return ret;
> +                }
> +            }
> +
> +            blob_end = (blob_start + real_blob_size) << s->cluster_bits;
> +        } else {
> +            clusters          = 1 + min_size_1;
> +            refblocks         = DIV_ROUND_UP(clusters, s->cluster_size / 2);
> +            reftable_clusters = DIV_ROUND_UP(refblocks, s->cluster_size / 8);
> +            l2_tables         = DIV_ROUND_UP(bs->total_sectors /
> +                                             s->cluster_sectors,
> +                                             s->cluster_size / 8);
> +            l1_clusters       = DIV_ROUND_UP(l2_tables, s->cluster_size / 8);
> +
> +            real_blob_size = reftable_clusters + refblocks + l1_clusters;
> +
> +            assert(min_size_1 >= real_blob_size);
> +
> +            ret = create_refcount_l1(bs, s->cluster_size, refblocks);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +
> +            blob_end = (1 + real_blob_size) << s->cluster_bits;
>          }
> +
> +        ret = bdrv_truncate(bs->file, blob_end);
>      }

This may or may not work. I don't know. It's most certainly too much
magic for a corner case function that will need an awful lot of testing
to give us some confidence.

If you still want to convince me, you need to do more to prove it's
correct, like improve the explanation of your formula or try to get a
Reviewed-by from Laszlo... Or you can hope for Stefan's week, but I'm
not sure if he'll be much happier with reviewing this. :-)

Kevin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v11 03/14] qcow2: Optimize bdrv_make_empty()
  2014-08-21 14:31   ` Kevin Wolf
@ 2014-08-22 13:59     ` Max Reitz
  2014-08-22 15:04       ` Kevin Wolf
  0 siblings, 1 reply; 18+ messages in thread
From: Max Reitz @ 2014-08-22 13:59 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, Stefan Hajnoczi

On 21.08.2014 16:31, Kevin Wolf wrote:
> Am 20.08.2014 um 20:17 hat Max Reitz geschrieben:
>> bdrv_make_empty() is currently only called if the current image
>> represents an external snapshot that has been committed to its base
>> image; it is therefore unlikely to have internal snapshots. In this
>> case, bdrv_make_empty() can be greatly sped up by creating an empty L1
>> table and dropping all data clusters at once by recreating the refcount
>> structure accordingly instead of normally discarding all clusters.
>>
>> If there are snapshots, fall back to the simple implementation (discard
>> all clusters).
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> Reviewed-by: Eric Blake <eblake@redhat.com>
>> ---
>>   block/qcow2.c | 389 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 374 insertions(+), 15 deletions(-)
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index eca7656..fa8ca66 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -2153,27 +2153,386 @@ fail:
>>       return ret;
>>   }
>>   
>> +/*
>> + * Creates a reftable pointing to refblocks following right afterwards and an
>> + * empty L1 table at the given @offset. @refblocks is the number of refblocks
>> + * to create.
>> + *
>> + * This combination of structures (reftable+refblocks+L1) is here called a
>> + * "blob".
>> + */
>> +static int create_refcount_l1(BlockDriverState *bs, uint64_t offset,
>> +                              uint64_t refblocks)
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    uint64_t *reftable = NULL;
>> +    uint16_t *refblock = NULL;
>> +    uint64_t reftable_clusters;
>> +    uint64_t rbi;
>> +    uint64_t blob_start, blob_end;
>> +    uint64_t l2_tables, l1_clusters, l1_size2;
>> +    uint8_t l1_size_and_offset[12];
>> +    uint64_t rt_offset;
>> +    int ret, i;
>> +
>> +    reftable_clusters = DIV_ROUND_UP(refblocks, s->cluster_size / 8);
>> +    l2_tables = DIV_ROUND_UP(bs->total_sectors / s->cluster_sectors,
>> +                             s->cluster_size / 8);
>> +    l1_clusters = DIV_ROUND_UP(l2_tables, s->cluster_size / 8);
>> +    l1_size2 = l1_clusters << s->cluster_bits;
>> +
>> +    blob_start = offset;
>> +    blob_end = offset + ((reftable_clusters + refblocks + l1_clusters) <<
>> +                s->cluster_bits);
>> +
>> +    ret = qcow2_pre_write_overlap_check(bs, 0, blob_start,
>> +                                        blob_end - blob_start);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    /* Create the reftable pointing with entries pointing consecutively to the
>> +     * following clusters */
>> +    reftable = g_malloc0_n(reftable_clusters, s->cluster_size);
>> +
>> +    for (rbi = 0; rbi < refblocks; rbi++) {
>> +        reftable[rbi] = cpu_to_be64(offset + ((reftable_clusters + rbi) <<
>> +                        s->cluster_bits));
>> +    }
>> +
>> +    ret = bdrv_write(bs->file, offset >> BDRV_SECTOR_BITS, (uint8_t *)reftable,
>> +                     reftable_clusters * s->cluster_sectors);
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +
>> +    offset += reftable_clusters << s->cluster_bits;
>> +
>> +    /* Keep the reftable, as we will need it for the BDRVQcowState anyway */
>> +
>> +    /* Now, create all the refblocks */
>> +    refblock = g_malloc(s->cluster_size);
>> +
>> +    for (rbi = 0; rbi < refblocks; rbi++) {
>> +        for (i = 0; i < s->cluster_size / 2; i++) {
>> +            uint64_t cluster_offset = ((rbi << (s->cluster_bits - 1)) + i)
>> +                                      << s->cluster_bits;
>> +
>> +            /* Only 0 and 1 are possible as refcounts here */
>> +            refblock[i] = cpu_to_be16(!cluster_offset ||
>> +                                      (cluster_offset >= blob_start &&
>> +                                       cluster_offset <  blob_end));
>> +        }
>> +
>> +        ret = bdrv_write(bs->file, offset >> BDRV_SECTOR_BITS,
>> +                         (uint8_t *)refblock, s->cluster_sectors);
>> +        if (ret < 0) {
>> +            goto out;
>> +        }
>> +
>> +        offset += s->cluster_size;
>> +    }
>> +
>> +    g_free(refblock);
>> +    refblock = NULL;
>> +
>> +    /* The L1 table is very simple */
>> +    ret = bdrv_write_zeroes(bs->file, offset >> BDRV_SECTOR_BITS,
>> +                            l1_clusters * s->cluster_sectors,
>> +                            BDRV_REQ_MAY_UNMAP);
> I wouldn't discard the L1 table. The next write operation will need it
> again.

Okay.

>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +
>> +    /* Now make sure all changes are stable and clear all caches */
>> +    bdrv_flush(bs);
> bdrv_flush() can fail.

Oh, right.

>> +    /* This is probably not really necessary, but it cannot hurt, either */
>> +    ret = qcow2_mark_clean(bs);
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +
>> +    ret = qcow2_cache_empty(bs, s->l2_table_cache);
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +
>> +    ret = qcow2_cache_empty(bs, s->refcount_block_cache);
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +
>> +    /* Modify the image header to point to the new blank L1 table. This will
>> +     * leak all currently existing data clusters, which is fine. */
>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE);
>> +
>> +    assert(l1_size2 / 8 <= UINT32_MAX);
>> +    cpu_to_be32w((uint32_t *)&l1_size_and_offset[0], l1_size2 / 8);
>> +    cpu_to_be64w((uint64_t *)&l1_size_and_offset[4], offset);
>> +    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_size),
>> +                           l1_size_and_offset, sizeof(l1_size_and_offset));
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +
>> +    /* Adapt the in-memory L1 table accordingly */
>> +    s->l1_table_offset = offset;
>> +    s->l1_size         = l1_size2 / 8;
>> +
>> +    s->l1_table = g_realloc(s->l1_table, l1_size2);
> s->l1_table is allocated with qemu_try_blockalign(). You should probably
> use another qemu_try_blockalign() + qemu_vfree(old) here.

Right. I missed that during the rebase (de82815d).

>> +    memset(s->l1_table, 0, l1_size2);
>> +
>> +    /* Modify the image header to point to the refcount table. This will fix all
>> +     * leaks (unless an error occurs). */
>> +    BLKDBG_EVENT(bs->file, BLKDBG_REFTABLE_UPDATE);
>> +
>> +    rt_offset = cpu_to_be64(blob_start);
>> +    ret = bdrv_pwrite_sync(bs->file,
>> +                           offsetof(QCowHeader, refcount_table_offset),
>> +                           &rt_offset, sizeof(rt_offset));
> Don't you need to update refcount_table_clusters for the case that the
> blob hits the boundary where the table needs a new cluster?

As I'm updating it in the QcowState, you're most probably right.

>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +
>> +    /* The image is now clean. The only thing left is to adapt the in-memory
>> +     * refcount table to match it. */
>> +    s->refcount_table_offset = blob_start;
>> +    s->refcount_table_size   = reftable_clusters << (s->cluster_bits - 3);
>> +
>> +    for (rbi = 0; rbi < refblocks; rbi++) {
>> +        be64_to_cpus(&reftable[rbi]);
>> +    }
>> +
>> +    g_free(s->refcount_table);
>> +    s->refcount_table = reftable;
>> +    reftable = NULL;
>> +
>> +out:
>> +    g_free(refblock);
>> +    g_free(reftable);
>> +    return ret;
>> +}
>> +
>> +/*
>> + * Calculates the number of clusters required for an L1 table for an image with
>> + * the given parameters plus the reftable and the refblocks required to cover
>> + * themselves, the L1 table and a given number of clusters @overhead.
>> + *
>> + * @ts:       Total number of guest sectors the image provides
>> + * @cb:       1 << @cb is the cluster size in bytes
>> + * @spcb:     1 << @spcb is the number of clusters per sector
> Sectors per cluster, I guess.

Oops *g*

Now I can see why you were confused. *cough*

>> + * @overhead: Number of clusters which shall additionally be covered by the
>> + *            refcount structures
>> + */
>> +static uint64_t minimal_blob_size(uint64_t ts, int cb, int spcb,
>> +                                  uint64_t overhead)
>> +{
>> +    uint64_t cs  = UINT64_C(1) << cb;
>> +    uint64_t spc = UINT64_C(1) << spcb;
>> +
>> +    /* The following statement is a solution for this system of equations:
>> +     *
>> +     * Let req_cls be the required number of clusters required for the reftable,
>> +     * all refblocks and the L1 table.
>> +     *
>> +     * rtc be the clusters required for the reftable, rbc the clusters for all
>> +     * refblocks (i.e., the number of refblocks), l1c the clusters for the L1
>> +     * table and l2c the clusters for all L2 tables (i.e., the number of L2
>> +     * tables).
>> +     *
>> +     * cs be the cluster size (in bytes), ts the total number of sectors, spc
>> +     * the number of sectors per cluster and tc the total number of clusters.
>> +     *
>> +     * overhead is a number of clusters which should additionally be covered by
>> +     * the refcount structures (i.e. all clusters before this blob of req_cls
>> +     * clusters).
>> +     *
>> +     * req_cls >= rtc + rbc + l1c
>> +     *   -- should be obvious
> The >= isn't, I would have expected =.

Leaking some clusters isn't too bad.


>> +     * rbc      = ceil((overhead + req_cls) / (cs / 2))
>> +     *   -- as each refblock holds cs/2 entries
> Hopefully we'll never implement refcount_order != 4...

Actually, I thought about doing that just to have some fun. :-P

>> +     * rtc      = ceil(rbc                  / (cs / 8))
>> +     *   -- as each reftable cluster holds cs/8 entries
>> +     * tc       = ceil(ts / spc)
>> +     *   -- should be obvious as well
>> +     * l2c      = ceil(tc  / (cs / 8))
>> +     *   -- as each L2 table holds cs/8 entries
>> +     * l1c      = ceil(l2c / (cs / 8))
>> +     *   -- as each L1 table cluster holds cs/8 entries
>> +     *
>> +     * The following equation yields a result which satisfies the constraint.
>> +     * The result may be too high, but is never too low.
>> +     *
>> +     * The original calculation (without bitshifting) was:
>> +     *
>> +     * DIV_ROUND_UP(overhead * (1 + cs / 8) +
>> +     *              3 * cs * cs / 16 - 5 * cs / 8 - 1 +
>> +     *              4 * (ts + spc * cs / 8 - 1) / spc,
>> +     *              cs * cs / 16 - cs / 8 - 1)
> Should be obvious?

I had a PDF in the original version of this patch: 
https://lists.nongnu.org/archive/html/qemu-devel/2014-04/pdf5DmYjL5zXy.pdf

> I think line 3 is for calculating l1c. That could easily be separated
> out because it doesn't depend on any of the other variables.
>
> After doing some maths I'm relatively close to your formula, but I still
> have 8 * cs instead of 5 * cs in the second line, and I also don't know
> where you got the -1 from.

The -1 are there because we only have truncating division. As you can 
see in the PDF, I converted ceil(x / y) to floor((x + y - 1) / y). We 
could use floats and ceil(), but this might cause precision problems.

> And this is the reason why I asked for a method without precalculating
> these sizes...

Either we do this or we have something slightly less complicated for the 
prealloction series and still a lot (but (much?) less complicated) code 
here.

As I said in my response to your response on v8, I still have 
rudimentary code for what you proposed but it actually seemed to 
complicated for me to do it right rather than just go on with this.

>> +     *
>> +     */
>> +
>> +    return DIV_ROUND_UP(overhead + (overhead << (cb - 3)) +
>> +                        (3 << (2 * cb - 4)) - (5 << (cb - 3)) - 1 +
>> +                        (4 * (ts + (spc << (cb - 3)) - 1) >> spcb),
>> +                        (1 << (2 * cb - 4)) - cs / 8 - 1);
>> +}
>> +
>>   static int qcow2_make_empty(BlockDriverState *bs)
>>   {
>> +    BDRVQcowState *s = bs->opaque;
>>       int ret = 0;
>> -    uint64_t start_sector;
>> -    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
>>   
>> -    for (start_sector = 0; start_sector < bs->total_sectors;
>> -         start_sector += sector_step)
>> -    {
>> -        /* As this function is generally used after committing an external
>> -         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
>> -         * default action for this kind of discard is to pass the discard,
>> -         * which will ideally result in an actually smaller image file, as
>> -         * is probably desired. */
>> -        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
>> -                                     MIN(sector_step,
>> -                                         bs->total_sectors - start_sector),
>> -                                     QCOW2_DISCARD_SNAPSHOT, true);
>> +    if (s->snapshots) {
>> +        uint64_t start_sector;
>> +        int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
>> +
>> +        /* If there are snapshots, every active cluster has to be discarded */
>> +
>> +        for (start_sector = 0; start_sector < bs->total_sectors;
>> +             start_sector += sector_step)
>> +        {
>> +            /* As this function is generally used after committing an external
>> +             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
>> +             * default action for this kind of discard is to pass the discard,
>> +             * which will ideally result in an actually smaller image file, as
>> +             * is probably desired. */
>> +            ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
>> +                                         MIN(sector_step,
>> +                                             bs->total_sectors - start_sector),
>> +                                         QCOW2_DISCARD_SNAPSHOT, true);
>> +            if (ret < 0) {
>> +                break;
>> +            }
>> +        }
>> +    } else {
>> +        uint64_t min_size_1, min_size_2;
>> +        int64_t blob_start;
>> +        uint64_t blob_end, real_blob_size, clusters;
>> +        uint64_t refblocks, reftable_clusters, l2_tables, l1_clusters;
>> +
>> +        /* If there are no snapshots, we basically want to create a new empty
>> +         * image. This function is however not for creating a new image and
>> +         * renaming it so it shadows the existing but rather for emptying an
>> +         * image, so do exactly that.
> After killing my brain with the math you can't use English syntax as
> convoluted as in this sentence. ;-)

Hm *g*

>> +         * Therefore, the image should be valid at all points in time and may
>> +         * at worst leak clusters.
>> +         *
>> +         * Any valid qcow2 image requires an L1 table which is large enough to
>> +         * cover all of the guest cluster range, therefore it is impossible to
>> +         * simply drop the L1 table (make its size 0) and create a minimal
>> +         * refcount structure.
>> +         *
>> +         * Instead, allocate a blob of data which is large enough to hold a new
>> +         * refcount structure (refcount table and all blocks) covering the whole
>> +         * image until the end of that blob, and an empty L1 table covering the
>> +         * whole guest cluster range. Then these structures are initialized and
>> +         * the image header is modified to point to them.
>> +         *
>> +         * Generally, this blob will be allocated at the end of the image (that
>> +         * is, its offset will be greater than its size). If that is indeed the
>> +         * case, the same operation is repeated, but this time the new blob
>> +         * starts right after the image header which will then indeed lead to a
>> +         * minimal image. If this is not the case, the image will be nearly
>> +         * minimal as well, as long as the underlying protocol supports discard.
>> +         *
>> +         * Note that this implementation never frees allocated clusters. This is
>> +         * because in case of success, the current refcounts are invalid anyway;
>> +         * and in case of failure, it would be too cumbersome to keep track of
>> +         * all allocated cluster ranges and free them then.
>> +         *
>> +         * min_size_1 will contain the number of clusters minimally required for
>> +         * a blob that starts right after the image header; min_size_2 will
>> +         * contain the number of clusters minimally required for the blob which
>> +         * can be allocated based on the existing refcount structure.
>> +         */
>> +
>> +        /* Repeat until a blob could be allocated which is large enough to
>> +         * contain all structures necessary for describing itself. Allocated
>> +         * clusters are not freed, even if they are not suitable, as this would
>> +         * result in exactly the same cluster range being returned during the
>> +         * retry (which is obviously not desirable). In case of success, the
>> +         * current refcounts do not matter anyway; and in case of failure, the
>> +         * clusters are only leaked (which can be easily repaired). */
>> +        do {
>> +            uint64_t fci = s->free_cluster_index;
>> +
>> +            /* TODO: Optimize, we do not need refblocks for other parts of the
>> +             * image than the header and this blob */
>> +            min_size_2 = minimal_blob_size(bs->total_sectors, s->cluster_bits,
>> +                                           s->cluster_bits - BDRV_SECTOR_BITS,
>> +                                           fci);
>> +
>> +            blob_start = qcow2_alloc_clusters(bs,
>> +                                              min_size_2 << s->cluster_bits);
>> +            if (blob_start < 0) {
>> +                return blob_start;
>> +            }
>> +
>> +            clusters          = (blob_start >> s->cluster_bits) + min_size_2;
>> +            refblocks         = DIV_ROUND_UP(clusters, s->cluster_size / 2);
>> +            reftable_clusters = DIV_ROUND_UP(refblocks, s->cluster_size / 8);
>> +            l2_tables         = DIV_ROUND_UP(bs->total_sectors /
>> +                                             s->cluster_sectors,
>> +                                             s->cluster_size / 8);
>> +            l1_clusters       = DIV_ROUND_UP(l2_tables, s->cluster_size / 8);
>> +
>> +            real_blob_size = reftable_clusters + refblocks + l1_clusters;
>> +        } while (min_size_2 < real_blob_size);
>> +
>> +        ret = create_refcount_l1(bs, blob_start, refblocks);
>>           if (ret < 0) {
>> -            break;
>> +            return ret;
>> +        }
>> +
>> +        /* The only overhead is the image header */
>> +        min_size_1 = minimal_blob_size(bs->total_sectors, s->cluster_bits,
>> +                                       s->cluster_bits - BDRV_SECTOR_BITS, 1);
>> +
>> +        /* If we cannot create a new blob before the current one, just discard
>> +         * the sectors in between and return. Even if the discard does nothing,
>> +         * only up to min_size_1 clusters plus the refcount blocks for those
>> +         * are unused. The worst case is therefore an image of double the size
>> +         * it needs to be, which is not too bad. */
>> +        if ((blob_start >> s->cluster_bits) < 1 + min_size_1) {
>> +            uint64_t sector, end;
>> +            int step = INT_MAX / BDRV_SECTOR_SIZE;
>> +
>> +            end = blob_start >> (s->cluster_bits - BDRV_SECTOR_SIZE);
>> +
>> +            /* skip the image header */
>> +            for (sector = s->cluster_sectors; sector < end; sector += step) {
>> +                ret = bdrv_discard(bs->file, sector, MIN(step, end - sector));
>> +                if (ret < 0) {
>> +                    return ret;
>> +                }
>> +            }
>> +
>> +            blob_end = (blob_start + real_blob_size) << s->cluster_bits;
>> +        } else {
>> +            clusters          = 1 + min_size_1;
>> +            refblocks         = DIV_ROUND_UP(clusters, s->cluster_size / 2);
>> +            reftable_clusters = DIV_ROUND_UP(refblocks, s->cluster_size / 8);
>> +            l2_tables         = DIV_ROUND_UP(bs->total_sectors /
>> +                                             s->cluster_sectors,
>> +                                             s->cluster_size / 8);
>> +            l1_clusters       = DIV_ROUND_UP(l2_tables, s->cluster_size / 8);
>> +
>> +            real_blob_size = reftable_clusters + refblocks + l1_clusters;
>> +
>> +            assert(min_size_1 >= real_blob_size);
>> +
>> +            ret = create_refcount_l1(bs, s->cluster_size, refblocks);
>> +            if (ret < 0) {
>> +                return ret;
>> +            }
>> +
>> +            blob_end = (1 + real_blob_size) << s->cluster_bits;
>>           }
>> +
>> +        ret = bdrv_truncate(bs->file, blob_end);
>>       }
> This may or may not work. I don't know. It's most certainly too much
> magic for a corner case function that will need an awful lot of testing
> to give us some confidence.

I just thought about it and reflected about why I did not pursue your 
idea. You basically suggested to mark the image dirty, overwrite the 
first few clusters with zero (one cluster for the reftable, one 
refblock, the rest for the L1 table), reset reftable and L1 table offset 
and size, allocate those clusters and remove the dirty flag.

This did not work because if qemu crashed between overwriting the first 
few clusters and having set up the new (empty) refcount structure, the 
image could not be repaired because the repair function could not 
allocate a refblock without overwriting the image header. However, with 
the patches for qcow2's repair function I sent just recently, this is 
not a problem any longer and the image can in fact be repaired.

So... I could say thanks for reminding me and sorry for having to read 
this patch; though Eric found it great fun. ;-)

On the other hand, this doesn't solve the prealloction problem. In Hu 
Tao's original version he had a function which calculated the size of a 
self-relfcounting blob as well, but his function was (obviously) wrong. 
As far as I remember, we do need such a function there and tuning it to 
its purpose there would make it only slightly less complicated. I guess 
I'll take a new look into his series and see whether we can do without.

> If you still want to convince me, you need to do more to prove it's
> correct, like improve the explanation of your formula

I'm fine with writing LaTeX into the code. :-P

> or try to get a
> Reviewed-by from Laszlo... Or you can hope for Stefan's week, but I'm
> not sure if he'll be much happier with reviewing this. :-)

Now imagine how much fun I had writing this. ;-)

Max

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v11 03/14] qcow2: Optimize bdrv_make_empty()
  2014-08-22 13:59     ` Max Reitz
@ 2014-08-22 15:04       ` Kevin Wolf
  0 siblings, 0 replies; 18+ messages in thread
From: Kevin Wolf @ 2014-08-22 15:04 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Stefan Hajnoczi

Am 22.08.2014 um 15:59 hat Max Reitz geschrieben:
> On 21.08.2014 16:31, Kevin Wolf wrote:
> >Am 20.08.2014 um 20:17 hat Max Reitz geschrieben:
> >>+/*
> >>+ * Calculates the number of clusters required for an L1 table for an image with
> >>+ * the given parameters plus the reftable and the refblocks required to cover
> >>+ * themselves, the L1 table and a given number of clusters @overhead.
> >>+ *
> >>+ * @ts:       Total number of guest sectors the image provides
> >>+ * @cb:       1 << @cb is the cluster size in bytes
> >>+ * @spcb:     1 << @spcb is the number of clusters per sector
> >Sectors per cluster, I guess.
> 
> Oops *g*
> 
> Now I can see why you were confused. *cough*
> 
> >>+ * @overhead: Number of clusters which shall additionally be covered by the
> >>+ *            refcount structures
> >>+ */
> >>+static uint64_t minimal_blob_size(uint64_t ts, int cb, int spcb,
> >>+                                  uint64_t overhead)
> >>+{
> >>+    uint64_t cs  = UINT64_C(1) << cb;
> >>+    uint64_t spc = UINT64_C(1) << spcb;
> >>+
> >>+    /* The following statement is a solution for this system of equations:
> >>+     *
> >>+     * Let req_cls be the required number of clusters required for the reftable,
> >>+     * all refblocks and the L1 table.
> >>+     *
> >>+     * rtc be the clusters required for the reftable, rbc the clusters for all
> >>+     * refblocks (i.e., the number of refblocks), l1c the clusters for the L1
> >>+     * table and l2c the clusters for all L2 tables (i.e., the number of L2
> >>+     * tables).
> >>+     *
> >>+     * cs be the cluster size (in bytes), ts the total number of sectors, spc
> >>+     * the number of sectors per cluster and tc the total number of clusters.
> >>+     *
> >>+     * overhead is a number of clusters which should additionally be covered by
> >>+     * the refcount structures (i.e. all clusters before this blob of req_cls
> >>+     * clusters).
> >>+     *
> >>+     * req_cls >= rtc + rbc + l1c
> >>+     *   -- should be obvious
> >The >= isn't, I would have expected =.
> 
> Leaking some clusters isn't too bad.

Fine with me. But you defined it as the "required number of clusters
required" (what?), which sounds like something smaller than the req_cls
that you calculate here.

Yes, nit-picking, but that's the part of this function that I still
understand, so let me make some use of it.

> >>+     * rbc      = ceil((overhead + req_cls) / (cs / 2))
> >>+     *   -- as each refblock holds cs/2 entries
> >Hopefully we'll never implement refcount_order != 4...
> 
> Actually, I thought about doing that just to have some fun. :-P

Sure, go ahead. You might just include it as a variable in this formula
now rather than revisiting all of this in three weeks.

> >>+     * rtc      = ceil(rbc                  / (cs / 8))
> >>+     *   -- as each reftable cluster holds cs/8 entries
> >>+     * tc       = ceil(ts / spc)
> >>+     *   -- should be obvious as well
> >>+     * l2c      = ceil(tc  / (cs / 8))
> >>+     *   -- as each L2 table holds cs/8 entries
> >>+     * l1c      = ceil(l2c / (cs / 8))
> >>+     *   -- as each L1 table cluster holds cs/8 entries
> >>+     *
> >>+     * The following equation yields a result which satisfies the constraint.
> >>+     * The result may be too high, but is never too low.
> >>+     *
> >>+     * The original calculation (without bitshifting) was:
> >>+     *
> >>+     * DIV_ROUND_UP(overhead * (1 + cs / 8) +
> >>+     *              3 * cs * cs / 16 - 5 * cs / 8 - 1 +
> >>+     *              4 * (ts + spc * cs / 8 - 1) / spc,
> >>+     *              cs * cs / 16 - cs / 8 - 1)
> >Should be obvious?
> 
> I had a PDF in the original version of this patch: https://lists.nongnu.org/archive/html/qemu-devel/2014-04/pdf5DmYjL5zXy.pdf

> >I think line 3 is for calculating l1c. That could easily be separated
> >out because it doesn't depend on any of the other variables.
> >
> >After doing some maths I'm relatively close to your formula, but I still
> >have 8 * cs instead of 5 * cs in the second line, and I also don't know
> >where you got the -1 from.
> 
> The -1 are there because we only have truncating division. As you
> can see in the PDF, I converted ceil(x / y) to floor((x + y - 1) /
> y). We could use floats and ceil(), but this might cause precision
> problems.

I see. That was too obvious. (I couldn't be bothered doing this
correctly, so I simply put +1 everywhere.)

> >And this is the reason why I asked for a method without precalculating
> >these sizes...
> 
> Either we do this or we have something slightly less complicated for
> the prealloction series and still a lot (but (much?) less
> complicated) code here.
> 
> As I said in my response to your response on v8, I still have
> rudimentary code for what you proposed but it actually seemed to
> complicated for me to do it right rather than just go on with this.
> 
> >>+     *
> >>+     */
> >>+
> >>+    return DIV_ROUND_UP(overhead + (overhead << (cb - 3)) +
> >>+                        (3 << (2 * cb - 4)) - (5 << (cb - 3)) - 1 +
> >>+                        (4 * (ts + (spc << (cb - 3)) - 1) >> spcb),
> >>+                        (1 << (2 * cb - 4)) - cs / 8 - 1);
> >>+}
> >>+

> >This may or may not work. I don't know. It's most certainly too much
> >magic for a corner case function that will need an awful lot of testing
> >to give us some confidence.
> 
> I just thought about it and reflected about why I did not pursue
> your idea. You basically suggested to mark the image dirty,
> overwrite the first few clusters with zero (one cluster for the
> reftable, one refblock, the rest for the L1 table), reset reftable
> and L1 table offset and size, allocate those clusters and remove the
> dirty flag.
> 
> This did not work because if qemu crashed between overwriting the
> first few clusters and having set up the new (empty) refcount
> structure, the image could not be repaired because the repair
> function could not allocate a refblock without overwriting the image
> header. However, with the patches for qcow2's repair function I sent
> just recently, this is not a problem any longer and the image can in
> fact be repaired.
> 
> So... I could say thanks for reminding me and sorry for having to
> read this patch; though Eric found it great fun. ;-)

It's okay, don't take my frustrated comments too seriously. We just need
to make sure that you can understand the code without reading a PDF
somewhere in the mailing list archives. Using some C variables instead
of doing everything in a single four-line expression would certainly
help, too. You can do it for the l1c part at least.

> On the other hand, this doesn't solve the prealloction problem. In
> Hu Tao's original version he had a function which calculated the
> size of a self-relfcounting blob as well, but his function was
> (obviously) wrong. As far as I remember, we do need such a function
> there and tuning it to its purpose there would make it only slightly
> less complicated. I guess I'll take a new look into his series and
> see whether we can do without.

There's another similar calculation in alloc_refcount_block(), even
though that one isn't really trying to get thing absolutely minimal. I'm
wondering if we can reuse this function there (even though a smaller
number of refblocks is needed), and if a simpler iterative calculation
like there might not be easier even if it makes the operation minimally
slower.

Though I'll freely admit that doing the maths, while causing more
headaches, is also more elegant.

> >If you still want to convince me, you need to do more to prove it's
> >correct, like improve the explanation of your formula
> 
> I'm fine with writing LaTeX into the code. :-P

Actually, if you just write the second line of your PDF into the source
code, I think that might already help a lot with understanding what's
going on.

Kevin

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-08-22 15:04 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-20 18:17 [Qemu-devel] [PATCH v11 00/14] qemu-img: Implement commit like QMP Max Reitz
2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 01/14] qcow2: Allow "full" discard Max Reitz
2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 02/14] qcow2: Implement bdrv_make_empty() Max Reitz
2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 03/14] qcow2: Optimize bdrv_make_empty() Max Reitz
2014-08-21 14:31   ` Kevin Wolf
2014-08-22 13:59     ` Max Reitz
2014-08-22 15:04       ` Kevin Wolf
2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 04/14] blockjob: Introduce block_job_complete_sync() Max Reitz
2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 05/14] blockjob: Add "ready" field Max Reitz
2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 06/14] block/mirror: Improve progress report Max Reitz
2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 07/14] qemu-img: Implement commit like QMP Max Reitz
2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 08/14] qemu-img: Empty image after commit Max Reitz
2014-08-20 18:17 ` [Qemu-devel] [PATCH v11 09/14] qemu-img: Enable progress output for commit Max Reitz
2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 10/14] qemu-img: Specify backing file " Max Reitz
2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 11/14] iotests: Add _filter_qemu_img_map Max Reitz
2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 12/14] iotests: Add test for backing-chain commits Max Reitz
2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 13/14] iotests: Add test for qcow2's bdrv_make_empty Max Reitz
2014-08-20 18:18 ` [Qemu-devel] [PATCH v11 14/14] iotests: Omit length/offset test in 040 and 041 Max Reitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.