All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] qcow2: Improve refcount structure rebuilding
@ 2022-03-29  9:19 Hanna Reitz
  2022-03-29  9:19 ` [PATCH v2 1/2] " Hanna Reitz
  2022-03-29  9:19 ` [PATCH v2 2/2] iotests/108: Test new refcount rebuild algorithm Hanna Reitz
  0 siblings, 2 replies; 7+ messages in thread
From: Hanna Reitz @ 2022-03-29  9:19 UTC (permalink / raw)
  To: qemu-block; +Cc: Kevin Wolf, Hanna Reitz, qemu-devel

Hi,

v1 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2021-03/msg00651.html

Since it’s been a while since v1, let me reproduce parts from that cover
letter here:


When the qcow2 refcount structure is broken to a point where we cannot
rely on any information from it (because it shows clusters as free that
are in use), “qemu-img check -r all” completely rewrites it.

The new reftable is preferably written into the area covered by the last
refblock for the image, but if that refblock is empty (e.g. because the
image is on a block device and there is just nothing near the end of the
block device), then the reftable will be put after the image’s end.
Which is a problem on block devices, because they can’t easily be
resized (also, resizing wouldn’t really help in this case, because the
reftable would still be written past the new end).


Effectively, this means you can’t run `qemu-img check -r all` on qcow2
images that are on block devices when there are clusters that are in
use, but not marked as allocated.


I recommend reviewing patch 1 not based on the diff, but by applying it
and reviewing the rebuild_refcount_structure() function (and its new
rebuild_refcounts_write_refblocks() helper).


v2:
- Dropped patches 2 and 3 (not really necessary with QSD --daemonize),
  which used to bring support for the QSD to _launch_qemu
- Patch 1: Functionally the same (I hope), but I have added extensive
  comments and put the `for ()` loop into its own dedicated function so
  that we can get rid of the backwards-jumping goto
- Patch 2: Changed to make do without _launch_qemu having QSD support


git backport-diff against v1:

Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/2:[0370] [FC] 'qcow2: Improve refcount structure rebuilding'
002/2:[0047] [FC] 'iotests/108: Test new refcount rebuild algorithm'


Hanna Reitz (2):
  qcow2: Improve refcount structure rebuilding
  iotests/108: Test new refcount rebuild algorithm

 block/qcow2-refcount.c     | 332 ++++++++++++++++++++++++++-----------
 tests/qemu-iotests/108     | 259 ++++++++++++++++++++++++++++-
 tests/qemu-iotests/108.out |  81 +++++++++
 3 files changed, 574 insertions(+), 98 deletions(-)

-- 
2.35.1



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/2] qcow2: Improve refcount structure rebuilding
  2022-03-29  9:19 [PATCH v2 0/2] qcow2: Improve refcount structure rebuilding Hanna Reitz
@ 2022-03-29  9:19 ` Hanna Reitz
  2022-03-30 14:32   ` Eric Blake
  2022-03-29  9:19 ` [PATCH v2 2/2] iotests/108: Test new refcount rebuild algorithm Hanna Reitz
  1 sibling, 1 reply; 7+ messages in thread
From: Hanna Reitz @ 2022-03-29  9:19 UTC (permalink / raw)
  To: qemu-block; +Cc: Kevin Wolf, Hanna Reitz, qemu-devel

When rebuilding the refcount structures (when qemu-img check -r found
errors with refcount = 0, but reference count > 0), the new refcount
table defaults to being put at the image file end[1].  There is no good
reason for that except that it means we will not have to rewrite any
refblocks we already wrote to disk.

Changing the code to rewrite those refblocks is not too difficult,
though, so let us do that.  That is beneficial for images on block
devices, where we cannot really write beyond the end of the image file.

Use this opportunity to add extensive comments to the code, and refactor
it a bit, getting rid of the backwards-jumping goto.

[1] Unless there is something allocated in the area pointed to by the
    last refblock, so we have to write that refblock.  In that case, we
    try to put the reftable in there.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1519071
Closes: https://gitlab.com/qemu-project/qemu/-/issues/941
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
---
 block/qcow2-refcount.c | 332 +++++++++++++++++++++++++++++------------
 1 file changed, 235 insertions(+), 97 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index b91499410c..bd53c58500 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -2438,111 +2438,140 @@ static int64_t alloc_clusters_imrt(BlockDriverState *bs,
 }
 
 /*
- * Creates a new refcount structure based solely on the in-memory information
- * given through *refcount_table. All necessary allocations will be reflected
- * in that array.
+ * Helper function for rebuild_refcount_structure().
  *
- * On success, the old refcount structure is leaked (it will be covered by the
- * new refcount structure).
+ * Scan the range of clusters [first_cluster, end_cluster) for allocated
+ * clusters and write all corresponding refblocks to disk.  The refblock
+ * and allocation data is taken from the in-memory refcount table
+ * *refcount_table[] (of size *nb_clusters), which is basically one big
+ * (unlimited size) refblock for the whole image.
+ *
+ * For these refblocks, clusters are allocated using said in-memory
+ * refcount table.  Care is taken that these allocations are reflected
+ * in the refblocks written to disk.
+ *
+ * The refblocks' offsets are written into a reftable, which is
+ * *on_disk_reftable_ptr[] (of size *on_disk_reftable_entries_ptr).  If
+ * that reftable is of insufficient size, it will be resized to fit.
+ * This reftable is not written to disk.
+ *
+ * (If *on_disk_reftable_ptr is not NULL, the entries within are assumed
+ * to point to existing valid refblocks that do not need to be allocated
+ * again.)
+ *
+ * Return whether the on-disk reftable array was resized (true/false),
+ * or -errno on error.
  */
-static int rebuild_refcount_structure(BlockDriverState *bs,
-                                      BdrvCheckResult *res,
-                                      void **refcount_table,
-                                      int64_t *nb_clusters)
+static int rebuild_refcounts_write_refblocks(
+        BlockDriverState *bs, void **refcount_table, int64_t *nb_clusters,
+        int64_t first_cluster, int64_t end_cluster,
+        uint64_t **on_disk_reftable_ptr, uint32_t *on_disk_reftable_entries_ptr
+    )
 {
     BDRVQcow2State *s = bs->opaque;
-    int64_t first_free_cluster = 0, reftable_offset = -1, cluster = 0;
+    int64_t cluster;
     int64_t refblock_offset, refblock_start, refblock_index;
-    uint32_t reftable_size = 0;
-    uint64_t *on_disk_reftable = NULL;
+    int64_t first_free_cluster = 0;
+    uint64_t *on_disk_reftable = *on_disk_reftable_ptr;
+    uint32_t on_disk_reftable_entries = *on_disk_reftable_entries_ptr;
     void *on_disk_refblock;
-    int ret = 0;
-    struct {
-        uint64_t reftable_offset;
-        uint32_t reftable_clusters;
-    } QEMU_PACKED reftable_offset_and_clusters;
-
-    qcow2_cache_empty(bs, s->refcount_block_cache);
+    bool reftable_grown = false;
+    int ret;
 
-write_refblocks:
-    for (; cluster < *nb_clusters; cluster++) {
+    for (cluster = first_cluster; cluster < end_cluster; cluster++) {
+        /* Check all clusters to find refblocks that contain non-zero entries */
         if (!s->get_refcount(*refcount_table, cluster)) {
             continue;
         }
 
+        /*
+         * This cluster is allocated, so we need to create a refblock
+         * for it.  The data we will write to disk is just the
+         * respective slice from *refcount_table, so it will contain
+         * accurate refcounts for all clusters belonging to this
+         * refblock.  After we have written it, we will therefore skip
+         * all remaining clusters in this refblock.
+         */
+
         refblock_index = cluster >> s->refcount_block_bits;
         refblock_start = refblock_index << s->refcount_block_bits;
 
-        /* Don't allocate a cluster in a refblock already written to disk */
-        if (first_free_cluster < refblock_start) {
-            first_free_cluster = refblock_start;
-        }
-        refblock_offset = alloc_clusters_imrt(bs, 1, refcount_table,
-                                              nb_clusters, &first_free_cluster);
-        if (refblock_offset < 0) {
-            fprintf(stderr, "ERROR allocating refblock: %s\n",
-                    strerror(-refblock_offset));
-            res->check_errors++;
-            ret = refblock_offset;
-            goto fail;
-        }
+        if (on_disk_reftable_entries > refblock_index &&
+            on_disk_reftable[refblock_index])
+        {
+            /*
+             * We can get here after a `goto write_refblocks`: We have a
+             * reftable from a previous run, and the refblock is already
+             * allocated.  No need to allocate it again.
+             */
+            refblock_offset = on_disk_reftable[refblock_index];
+        } else {
+            int64_t refblock_cluster_index;
 
-        if (reftable_size <= refblock_index) {
-            uint32_t old_reftable_size = reftable_size;
-            uint64_t *new_on_disk_reftable;
+            /* Don't allocate a cluster in a refblock already written to disk */
+            if (first_free_cluster < refblock_start) {
+                first_free_cluster = refblock_start;
+            }
+            refblock_offset = alloc_clusters_imrt(bs, 1, refcount_table,
+                                                  nb_clusters,
+                                                  &first_free_cluster);
+            if (refblock_offset < 0) {
+                fprintf(stderr, "ERROR allocating refblock: %s\n",
+                        strerror(-refblock_offset));
+                return refblock_offset;
+            }
 
-            reftable_size = ROUND_UP((refblock_index + 1) * REFTABLE_ENTRY_SIZE,
-                                     s->cluster_size) / REFTABLE_ENTRY_SIZE;
-            new_on_disk_reftable = g_try_realloc(on_disk_reftable,
-                                                 reftable_size *
-                                                 REFTABLE_ENTRY_SIZE);
-            if (!new_on_disk_reftable) {
-                res->check_errors++;
-                ret = -ENOMEM;
-                goto fail;
+            refblock_cluster_index = refblock_offset / s->cluster_size;
+            if (refblock_cluster_index >= end_cluster) {
+                /*
+                 * We must write the refblock that holds this refblock's
+                 * refcount
+                 */
+                end_cluster = refblock_cluster_index + 1;
             }
-            on_disk_reftable = new_on_disk_reftable;
 
-            memset(on_disk_reftable + old_reftable_size, 0,
-                   (reftable_size - old_reftable_size) * REFTABLE_ENTRY_SIZE);
+            if (on_disk_reftable_entries <= refblock_index) {
+                on_disk_reftable_entries =
+                    ROUND_UP((refblock_index + 1) * REFTABLE_ENTRY_SIZE,
+                             s->cluster_size) / REFTABLE_ENTRY_SIZE;
+                on_disk_reftable =
+                    g_try_realloc(on_disk_reftable,
+                                  on_disk_reftable_entries *
+                                  REFTABLE_ENTRY_SIZE);
+                if (!on_disk_reftable) {
+                    return -ENOMEM;
+                }
 
-            /* The offset we have for the reftable is now no longer valid;
-             * this will leak that range, but we can easily fix that by running
-             * a leak-fixing check after this rebuild operation */
-            reftable_offset = -1;
-        } else {
-            assert(on_disk_reftable);
-        }
-        on_disk_reftable[refblock_index] = refblock_offset;
+                memset(on_disk_reftable + *on_disk_reftable_entries_ptr, 0,
+                       (on_disk_reftable_entries -
+                        *on_disk_reftable_entries_ptr) *
+                       REFTABLE_ENTRY_SIZE);
 
-        /* If this is apparently the last refblock (for now), try to squeeze the
-         * reftable in */
-        if (refblock_index == (*nb_clusters - 1) >> s->refcount_block_bits &&
-            reftable_offset < 0)
-        {
-            uint64_t reftable_clusters = size_to_clusters(s, reftable_size *
-                                                          REFTABLE_ENTRY_SIZE);
-            reftable_offset = alloc_clusters_imrt(bs, reftable_clusters,
-                                                  refcount_table, nb_clusters,
-                                                  &first_free_cluster);
-            if (reftable_offset < 0) {
-                fprintf(stderr, "ERROR allocating reftable: %s\n",
-                        strerror(-reftable_offset));
-                res->check_errors++;
-                ret = reftable_offset;
-                goto fail;
+                *on_disk_reftable_ptr = on_disk_reftable;
+                *on_disk_reftable_entries_ptr = on_disk_reftable_entries;
+
+                reftable_grown = true;
+            } else {
+                assert(on_disk_reftable);
             }
+            on_disk_reftable[refblock_index] = refblock_offset;
         }
 
+        /* Refblock is allocated, write it to disk */
+
         ret = qcow2_pre_write_overlap_check(bs, 0, refblock_offset,
                                             s->cluster_size, false);
         if (ret < 0) {
             fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
-            goto fail;
+            return ret;
         }
 
-        /* The size of *refcount_table is always cluster-aligned, therefore the
-         * write operation will not overflow */
+        /*
+         * The refblock is simply a slice of *refcount_table.
+         * Note that the size of *refcount_table is always aligned to
+         * whole clusters, so the write operation will not result in
+         * out-of-bounds accesses.
+         */
         on_disk_refblock = (void *)((char *) *refcount_table +
                                     refblock_index * s->cluster_size);
 
@@ -2550,23 +2579,99 @@ write_refblocks:
                           s->cluster_size);
         if (ret < 0) {
             fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
-            goto fail;
+            return ret;
         }
 
-        /* Go to the end of this refblock */
+        /* This refblock is done, skip to its end */
         cluster = refblock_start + s->refcount_block_size - 1;
     }
 
-    if (reftable_offset < 0) {
-        uint64_t post_refblock_start, reftable_clusters;
+    return reftable_grown;
+}
+
+/*
+ * Creates a new refcount structure based solely on the in-memory information
+ * given through *refcount_table (this in-memory information is basically just
+ * the concatenation of all refblocks).  All necessary allocations will be
+ * reflected in that array.
+ *
+ * On success, the old refcount structure is leaked (it will be covered by the
+ * new refcount structure).
+ */
+static int rebuild_refcount_structure(BlockDriverState *bs,
+                                      BdrvCheckResult *res,
+                                      void **refcount_table,
+                                      int64_t *nb_clusters)
+{
+    BDRVQcow2State *s = bs->opaque;
+    int64_t reftable_offset = -1;
+    int64_t reftable_length = 0;
+    int64_t reftable_clusters;
+    int64_t refblock_index;
+    uint32_t on_disk_reftable_entries = 0;
+    uint64_t *on_disk_reftable = NULL;
+    int ret = 0;
+    int reftable_size_changed = 0;
+    struct {
+        uint64_t reftable_offset;
+        uint32_t reftable_clusters;
+    } QEMU_PACKED reftable_offset_and_clusters;
+
+    qcow2_cache_empty(bs, s->refcount_block_cache);
+
+    /*
+     * For each refblock containing entries, we try to allocate a
+     * cluster (in the in-memory refcount table) and write its offset
+     * into on_disk_reftable[].  We then write the whole refblock to
+     * disk (as a slice of the in-memory refcount table).
+     * This is done by rebuild_refcounts_write_refblocks().
+     *
+     * Once we have scanned all clusters, we try to find space for the
+     * reftable.  This will dirty the in-memory refcount table (i.e.
+     * make it differ from the refblocks we have already written), so we
+     * need to run rebuild_refcounts_write_refblocks() again for the
+     * range of clusters where the reftable has been allocated.
+     *
+     * This second run might make the reftable grow again, in which case
+     * we will need to allocate another space for it, which is why we
+     * repeat all this until the reftable stops growing.
+     *
+     * (This loop will terminate, because with every cluster the
+     * reftable grows, it can accomodate a multitude of more refcounts,
+     * so that at some point this must be able to cover the reftable
+     * and all refblocks describing it.)
+     *
+     * We then convert the reftable to big-endian and write it to disk.
+     *
+     * Note that we never free any reftable allocations.  Doing so would
+     * needlessly complicate the algorithm: The eventual second check
+     * run we do will clean up all leaks we have caused.
+     */
+
+    reftable_size_changed =
+        rebuild_refcounts_write_refblocks(bs, refcount_table, nb_clusters,
+                                          0, *nb_clusters,
+                                          &on_disk_reftable,
+                                          &on_disk_reftable_entries);
+    if (reftable_size_changed < 0) {
+        res->check_errors++;
+        ret = reftable_size_changed;
+        goto fail;
+    }
+
+    /*
+     * There was no reftable before, so rebuild_refcounts_write_refblocks()
+     * must have increased its size (from 0 to something).
+     */
+    assert(reftable_size_changed == true);
+
+    do {
+        int64_t reftable_start_cluster, reftable_end_cluster;
+        int64_t first_free_cluster = 0;
+
+        reftable_length = on_disk_reftable_entries * REFTABLE_ENTRY_SIZE;
+        reftable_clusters = size_to_clusters(s, reftable_length);
 
-        post_refblock_start = ROUND_UP(*nb_clusters, s->refcount_block_size);
-        reftable_clusters =
-            size_to_clusters(s, reftable_size * REFTABLE_ENTRY_SIZE);
-        /* Not pretty but simple */
-        if (first_free_cluster < post_refblock_start) {
-            first_free_cluster = post_refblock_start;
-        }
         reftable_offset = alloc_clusters_imrt(bs, reftable_clusters,
                                               refcount_table, nb_clusters,
                                               &first_free_cluster);
@@ -2578,24 +2683,55 @@ write_refblocks:
             goto fail;
         }
 
-        goto write_refblocks;
-    }
+        /*
+         * We need to update the affected refblocks, so re-run the
+         * write_refblocks loop for the reftable's range of clusters.
+         */
+        assert(offset_into_cluster(s, reftable_offset) == 0);
+        reftable_start_cluster = reftable_offset / s->cluster_size;
+        reftable_end_cluster = reftable_start_cluster + reftable_clusters;
+        reftable_size_changed =
+            rebuild_refcounts_write_refblocks(bs, refcount_table, nb_clusters,
+                                              reftable_start_cluster,
+                                              reftable_end_cluster,
+                                              &on_disk_reftable,
+                                              &on_disk_reftable_entries);
+        if (reftable_size_changed < 0) {
+            res->check_errors++;
+            ret = reftable_size_changed;
+            goto fail;
+        }
+
+        /*
+         * If the reftable size has changed, we will need to find a new
+         * allocation, repeating the loop.
+         */
+    } while (reftable_size_changed);
 
-    for (refblock_index = 0; refblock_index < reftable_size; refblock_index++) {
+    /* The above loop must have run at least once */
+    assert(reftable_offset >= 0);
+
+    /*
+     * All allocations are done, all refblocks are written, convert the
+     * reftable to big-endian and write it to disk.
+     */
+
+    for (refblock_index = 0; refblock_index < on_disk_reftable_entries;
+         refblock_index++)
+    {
         cpu_to_be64s(&on_disk_reftable[refblock_index]);
     }
 
-    ret = qcow2_pre_write_overlap_check(bs, 0, reftable_offset,
-                                        reftable_size * REFTABLE_ENTRY_SIZE,
+    ret = qcow2_pre_write_overlap_check(bs, 0, reftable_offset, reftable_length,
                                         false);
     if (ret < 0) {
         fprintf(stderr, "ERROR writing reftable: %s\n", strerror(-ret));
         goto fail;
     }
 
-    assert(reftable_size < INT_MAX / REFTABLE_ENTRY_SIZE);
+    assert(reftable_length < INT_MAX);
     ret = bdrv_pwrite(bs->file, reftable_offset, on_disk_reftable,
-                      reftable_size * REFTABLE_ENTRY_SIZE);
+                      reftable_length);
     if (ret < 0) {
         fprintf(stderr, "ERROR writing reftable: %s\n", strerror(-ret));
         goto fail;
@@ -2604,7 +2740,7 @@ write_refblocks:
     /* Enter new reftable into the image header */
     reftable_offset_and_clusters.reftable_offset = cpu_to_be64(reftable_offset);
     reftable_offset_and_clusters.reftable_clusters =
-        cpu_to_be32(size_to_clusters(s, reftable_size * REFTABLE_ENTRY_SIZE));
+        cpu_to_be32(reftable_clusters);
     ret = bdrv_pwrite_sync(bs->file,
                            offsetof(QCowHeader, refcount_table_offset),
                            &reftable_offset_and_clusters,
@@ -2614,12 +2750,14 @@ write_refblocks:
         goto fail;
     }
 
-    for (refblock_index = 0; refblock_index < reftable_size; refblock_index++) {
+    for (refblock_index = 0; refblock_index < on_disk_reftable_entries;
+         refblock_index++)
+    {
         be64_to_cpus(&on_disk_reftable[refblock_index]);
     }
     s->refcount_table = on_disk_reftable;
     s->refcount_table_offset = reftable_offset;
-    s->refcount_table_size = reftable_size;
+    s->refcount_table_size = on_disk_reftable_entries;
     update_max_refcount_table_index(s);
 
     return 0;
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 2/2] iotests/108: Test new refcount rebuild algorithm
  2022-03-29  9:19 [PATCH v2 0/2] qcow2: Improve refcount structure rebuilding Hanna Reitz
  2022-03-29  9:19 ` [PATCH v2 1/2] " Hanna Reitz
@ 2022-03-29  9:19 ` Hanna Reitz
  2022-03-30 15:07   ` Eric Blake
  1 sibling, 1 reply; 7+ messages in thread
From: Hanna Reitz @ 2022-03-29  9:19 UTC (permalink / raw)
  To: qemu-block; +Cc: Kevin Wolf, Hanna Reitz, qemu-devel

One clear problem with how qcow2's refcount structure rebuild algorithm
used to be before "qcow2: Improve refcount structure rebuilding" was
that it is prone to failure for qcow2 images on block devices: There is
generally unused space after the actual image, and if that exceeds what
one refblock covers, the old algorithm would invariably write the
reftable past the block device's end, which cannot work.  The new
algorithm does not have this problem.

Test it with three tests:
(1) Create an image with more empty space at the end than what one
    refblock covers, see whether rebuilding the refcount structures
    results in a change in the image file length.  (It should not.)

(2) Leave precisely enough space somewhere at the beginning of the image
    for the new reftable (and the refblock for that place), see whether
    the new algorithm puts the reftable there.  (It should.)

(3) Test the original problem: Create (something like) a block device
    with a fixed size, then create a qcow2 image in there, write some
    data, and then have qemu-img check rebuild the refcount structures.
    Before HEAD^, the reftable would have been written past the image
    file end, i.e. outside of what the block device provides, which
    cannot work.  HEAD^ should have fixed that.
    ("Something like a block device" means a loop device if we can use
    one ("sudo -n losetup" works), or a FUSE block export with
    growable=false otherwise.)

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
---
 tests/qemu-iotests/108     | 259 ++++++++++++++++++++++++++++++++++++-
 tests/qemu-iotests/108.out |  81 ++++++++++++
 2 files changed, 339 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/108 b/tests/qemu-iotests/108
index 56339ab2c5..e9e2ced69b 100755
--- a/tests/qemu-iotests/108
+++ b/tests/qemu-iotests/108
@@ -30,13 +30,20 @@ status=1	# failure is the default!
 
 _cleanup()
 {
-	_cleanup_test_img
+    _cleanup_test_img
+    if [ -f "$TEST_DIR/qsd.pid" ]; then
+        qsd_pid=$(cat "$TEST_DIR/qsd.pid")
+        kill -KILL "$qsd_pid"
+        fusermount -u "$TEST_DIR/fuse-export" &>/dev/null
+    fi
+    rm -f "$TEST_DIR/fuse-export"
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15
 
 # get standard environment, filters and checks
 . ./common.rc
 . ./common.filter
+. ./common.qemu
 
 # This tests qcow2-specific low-level functionality
 _supported_fmt qcow2
@@ -47,6 +54,22 @@ _supported_os Linux
 # files
 _unsupported_imgopts 'refcount_bits=\([^1]\|.\([^6]\|$\)\)' data_file
 
+# This test either needs sudo -n losetup or FUSE exports to work
+if sudo -n losetup &>/dev/null; then
+    loopdev=true
+else
+    loopdev=false
+
+    # QSD --export fuse will either yield "Parameter 'id' is missing"
+    # or "Invalid parameter 'fuse'", depending on whether there is
+    # FUSE support or not.
+    error=$($QSD --export fuse 2>&1)
+    if [[ $error = *"Invalid parameter 'fuse'" ]]; then
+        _notrun 'Passwordless sudo for losetup or FUSE support required, but' \
+                'neither is available'
+    fi
+fi
+
 echo
 echo '=== Repairing an image without any refcount table ==='
 echo
@@ -138,6 +161,240 @@ _make_test_img 64M
 poke_file "$TEST_IMG" $((0x10008)) "\xff\xff\xff\xff\xff\xff\x00\x00"
 _check_test_img -r all
 
+echo
+echo '=== Check rebuilt reftable location ==='
+
+# In an earlier version of the refcount rebuild algorithm, the
+# reftable was generally placed at the image end (unless something was
+# allocated in the area covered by the refblock right before the image
+# file end, then we would try to place the reftable in that refblock).
+# This was later changed so the reftable would be placed in the
+# earliest possible location.  Test this.
+
+echo
+echo '--- Does the image size increase? ---'
+echo
+
+# First test: Just create some image, write some data to it, and
+# resize it so there is free space at the end of the image (enough
+# that it spans at least one full refblock, which for cluster_size=512
+# images, spans 128k).  With the old algorithm, the reftable would
+# have then been placed at the end of the image file, but with the new
+# one, it will be put in that free space.
+# We want to check whether the size of the image file increases due to
+# rebuilding the refcount structures (it should not).
+
+_make_test_img -o 'cluster_size=512' 1M
+# Write something
+$QEMU_IO -c 'write 0 64k' "$TEST_IMG" | _filter_qemu_io
+
+# Add free space
+file_len=$(stat -c '%s' "$TEST_IMG")
+truncate -s $((file_len + 256 * 1024)) "$TEST_IMG"
+
+# Corrupt the image by saying the image header were not allocated
+rt_offset=$(peek_file_be "$TEST_IMG" 48 8)
+rb_offset=$(peek_file_be "$TEST_IMG" $rt_offset 8)
+poke_file "$TEST_IMG" $rb_offset "\x00\x00"
+
+# Check whether rebuilding the refcount structures increases the image
+# file size
+file_len=$(stat -c '%s' "$TEST_IMG")
+echo
+# The only leaks there can be are the old refcount structures that are
+# leaked during rebuilding, no need to clutter the output with them
+_check_test_img -r all | grep -v '^Repairing cluster.*refcount=1 reference=0'
+echo
+post_repair_file_len=$(stat -c '%s' "$TEST_IMG")
+
+if [[ $file_len -eq $post_repair_file_len ]]; then
+    echo 'OK: Image size did not change'
+else
+    echo 'ERROR: Image size differs' \
+        "($file_len before, $post_repair_file_len after)"
+fi
+
+echo
+echo '--- Will the reftable occupy a hole specifically left for it?  ---'
+echo
+
+# Note: With cluster_size=512, every refblock covers 128k.
+# The reftable covers 8M per reftable cluster.
+
+# Create an image that requires two reftable clusters (just because
+# this is more interesting than a single-clustered reftable).
+_make_test_img -o 'cluster_size=512' 9M
+$QEMU_IO -c 'write 0 8M' "$TEST_IMG" | _filter_qemu_io
+
+# Writing 8M will have resized the reftable.  Unfortunately, doing so
+# will leave holes in the file, so we need to fill them up so we can
+# be sure the whole file is allocated.  Do that by writing
+# consecutively smaller chunks starting from 8 MB, until the file
+# length increases even with a chunk size of 512.  Then we must have
+# filled all holes.
+ofs=$((8 * 1024 * 1024))
+block_len=$((16 * 1024))
+while [[ $block_len -ge 512 ]]; do
+    file_len=$(stat -c '%s' "$TEST_IMG")
+    while [[ $(stat -c '%s' "$TEST_IMG") -eq $file_len ]]; do
+        # Do not include this in the reference output, it does not
+        # really matter which qemu-io calls we do here exactly
+        $QEMU_IO -c "write $ofs $block_len" "$TEST_IMG" >/dev/null
+        ofs=$((ofs + block_len))
+    done
+    block_len=$((block_len / 2))
+done
+
+# Fill up to 9M (do not include this in the reference output either,
+# $ofs is random for all we know)
+$QEMU_IO -c "write $ofs $((9 * 1024 * 1024 - ofs))" "$TEST_IMG" >/dev/null
+
+# Make space as follows:
+# - For the first refblock: Right at the beginning of the image (this
+#   refblock is placed in the first place possible),
+# - For the reftable somewhere soon afterwards, still near the
+#   beginning of the image (i.e. covered by the first refblock); the
+#   reftable too is placed in the first place possible, but only after
+#   all refblocks have been placed)
+# No space is needed for the other refblocks, because no refblock is
+# put before the space it covers.  In this test case, we do not mind
+# if they are placed at the image file's end.
+
+# Before we make that space, we have to find out the host offset of
+# the area that belonged to the two data clusters at guest offset 4k,
+# because we expect the reftable to be placed there, and we will have
+# to verify that it is.
+
+l1_offset=$(peek_file_be "$TEST_IMG" 40 8)
+l2_offset=$(peek_file_be "$TEST_IMG" $l1_offset 8)
+l2_offset=$((l2_offset & 0x00fffffffffffe00))
+data_4k_offset=$(peek_file_be "$TEST_IMG" \
+                 $((l2_offset + 4096 / 512 * 8)) 8)
+data_4k_offset=$((data_4k_offset & 0x00fffffffffffe00))
+
+$QEMU_IO -c "discard 0 512" -c "discard 4k 1k" "$TEST_IMG" | _filter_qemu_io
+
+# Corrupt the image by saying the image header were not allocated
+rt_offset=$(peek_file_be "$TEST_IMG" 48 8)
+rb_offset=$(peek_file_be "$TEST_IMG" $rt_offset 8)
+poke_file "$TEST_IMG" $rb_offset "\x00\x00"
+
+echo
+# The only leaks there can be are the old refcount structures that are
+# leaked during rebuilding, no need to clutter the output with them
+_check_test_img -r all | grep -v '^Repairing cluster.*refcount=1 reference=0'
+echo
+
+# Check whether the reftable was put where we expected
+rt_offset=$(peek_file_be "$TEST_IMG" 48 8)
+if [[ $rt_offset -eq $data_4k_offset ]]; then
+    echo 'OK: Reftable is where we expect it'
+else
+    echo "ERROR: Reftable is at $rt_offset, but was expected at $data_4k_offset"
+fi
+
+echo
+echo '--- Rebuilding refcount structures on block devices ---'
+echo
+
+# A block device cannot really grow, at least not during qemu-img
+# check.  As mentioned in the above cases, rebuilding the refcount
+# structure may lead to new refcount structures being written after
+# the end of the image, and in the past that happened even if there
+# was more than sufficient space in the image.  Such post-EOF writes
+# will not work on block devices, so test that the new algorithm
+# avoids it.
+
+# If we have passwordless sudo and losetup, we can use those to create
+# a block device.  Otherwise, we can resort to qemu's FUSE export to
+# create a file that isn't growable, which effectively tests the same
+# thing.
+
+_cleanup_test_img
+truncate -s $((64 * 1024 * 1024)) "$TEST_IMG"
+
+if $loopdev; then
+    export_mp=$(sudo -n losetup --show -f "$TEST_IMG")
+    export_mp_driver=host_device
+    sudo -n chmod go+rw "$export_mp"
+else
+    # Create non-growable FUSE export that is a bit like an empty
+    # block device
+    export_mp="$TEST_DIR/fuse-export"
+    export_mp_driver=file
+    touch "$export_mp"
+
+    $QSD \
+        --blockdev file,node-name=export-node,filename="$TEST_IMG" \
+        --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=off \
+        --pidfile "$TEST_DIR/qsd.pid" \
+        --daemonize
+fi
+
+# Now create a qcow2 image on the device -- unfortunately, qemu-img
+# create force-creates the file, so we have to resort to the
+# blockdev-create job.
+_launch_qemu \
+    --blockdev $export_mp_driver,node-name=file,filename="$export_mp"
+
+_send_qemu_cmd \
+    $QEMU_HANDLE \
+    '{ "execute": "qmp_capabilities" }' \
+    'return'
+
+# Small cluster size again, so the image needs multiple refblocks
+_send_qemu_cmd \
+    $QEMU_HANDLE \
+    '{ "execute": "blockdev-create",
+       "arguments": {
+           "job-id": "create",
+           "options": {
+               "driver": "qcow2",
+               "file": "file",
+               "size": '$((64 * 1024 * 1024))',
+               "cluster-size": 512
+           } } }' \
+    '"concluded"'
+
+_send_qemu_cmd \
+    $QEMU_HANDLE \
+    '{ "execute": "job-dismiss", "arguments": { "id": "create" } }' \
+    'return'
+
+_send_qemu_cmd \
+    $QEMU_HANDLE \
+    '{ "execute": "quit" }' \
+    'return'
+
+wait=y _cleanup_qemu
+echo
+
+# Write some data
+$QEMU_IO -c 'write 0 64k' "$export_mp" | _filter_qemu_io
+
+# Corrupt the image by saying the image header were not allocated
+rt_offset=$(peek_file_be "$export_mp" 48 8)
+rb_offset=$(peek_file_be "$export_mp" $rt_offset 8)
+poke_file "$export_mp" $rb_offset "\x00\x00"
+
+# Repairing such a simple case should just work
+# (We used to put the reftable at the end of the image file, which can
+# never work for non-growable devices.)
+echo
+TEST_IMG="$export_mp" _check_test_img -r all \
+    | grep -v '^Repairing cluster.*refcount=1 reference=0'
+
+if $loopdev; then
+    sudo -n losetup -d "$export_mp"
+else
+    qsd_pid=$(cat "$TEST_DIR/qsd.pid")
+    kill -TERM "$qsd_pid"
+    # Wait for process to exit (cannot `wait` because the QSD is daemonized)
+    while [ -f "$TEST_DIR/qsd.pid" ]; do
+        true
+    done
+fi
+
 # success, all done
 echo '*** done'
 rm -f $seq.full
diff --git a/tests/qemu-iotests/108.out b/tests/qemu-iotests/108.out
index 75bab8dc84..b5401d788d 100644
--- a/tests/qemu-iotests/108.out
+++ b/tests/qemu-iotests/108.out
@@ -105,6 +105,87 @@ The following inconsistencies were found and repaired:
     0 leaked clusters
     1 corruptions
 
+Double checking the fixed image now...
+No errors were found on the image.
+
+=== Check rebuilt reftable location ===
+
+--- Does the image size increase? ---
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
+wrote 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+ERROR cluster 0 refcount=0 reference=1
+Rebuilding refcount structure
+The following inconsistencies were found and repaired:
+
+    0 leaked clusters
+    1 corruptions
+
+Double checking the fixed image now...
+No errors were found on the image.
+
+OK: Image size did not change
+
+--- Will the reftable occupy a hole specifically left for it?  ---
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=9437184
+wrote 8388608/8388608 bytes at offset 0
+8 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+discard 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+discard 1024/1024 bytes at offset 4096
+1 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+ERROR cluster 0 refcount=0 reference=1
+Rebuilding refcount structure
+The following inconsistencies were found and repaired:
+
+    0 leaked clusters
+    1 corruptions
+
+Double checking the fixed image now...
+No errors were found on the image.
+
+OK: Reftable is where we expect it
+
+--- Rebuilding refcount structures on block devices ---
+
+{ "execute": "qmp_capabilities" }
+{"return": {}}
+{ "execute": "blockdev-create",
+       "arguments": {
+           "job-id": "create",
+           "options": {
+               "driver": "IMGFMT",
+               "file": "file",
+               "size": 67108864,
+               "cluster-size": 512
+           } } }
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "create"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "create"}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "create"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "create"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "create"}}
+{ "execute": "job-dismiss", "arguments": { "id": "create" } }
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "create"}}
+{"return": {}}
+{ "execute": "quit" }
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
+
+wrote 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+ERROR cluster 0 refcount=0 reference=1
+Rebuilding refcount structure
+The following inconsistencies were found and repaired:
+
+    0 leaked clusters
+    1 corruptions
+
 Double checking the fixed image now...
 No errors were found on the image.
 *** done
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] qcow2: Improve refcount structure rebuilding
  2022-03-29  9:19 ` [PATCH v2 1/2] " Hanna Reitz
@ 2022-03-30 14:32   ` Eric Blake
  2022-04-01 13:49     ` Hanna Reitz
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Blake @ 2022-03-30 14:32 UTC (permalink / raw)
  To: Hanna Reitz; +Cc: Kevin Wolf, qemu-devel, qemu-block

On Tue, Mar 29, 2022 at 11:19:16AM +0200, Hanna Reitz wrote:
> When rebuilding the refcount structures (when qemu-img check -r found
> errors with refcount = 0, but reference count > 0), the new refcount
> table defaults to being put at the image file end[1].  There is no good
> reason for that except that it means we will not have to rewrite any
> refblocks we already wrote to disk.
> 
> Changing the code to rewrite those refblocks is not too difficult,
> though, so let us do that.  That is beneficial for images on block
> devices, where we cannot really write beyond the end of the image file.
> 
> Use this opportunity to add extensive comments to the code, and refactor
> it a bit, getting rid of the backwards-jumping goto.
> 
> [1] Unless there is something allocated in the area pointed to by the
>     last refblock, so we have to write that refblock.  In that case, we
>     try to put the reftable in there.
> 
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1519071
> Closes: https://gitlab.com/qemu-project/qemu/-/issues/941
> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
> ---
>  block/qcow2-refcount.c | 332 +++++++++++++++++++++++++++++------------
>  1 file changed, 235 insertions(+), 97 deletions(-)
> 

> +static int rebuild_refcounts_write_refblocks(
> +        BlockDriverState *bs, void **refcount_table, int64_t *nb_clusters,
> +        int64_t first_cluster, int64_t end_cluster,
> +        uint64_t **on_disk_reftable_ptr, uint32_t *on_disk_reftable_entries_ptr
> +    )

As you are rewriting this into a helper function, should this function
take Error **errp,...

> +            /* Don't allocate a cluster in a refblock already written to disk */
> +            if (first_free_cluster < refblock_start) {
> +                first_free_cluster = refblock_start;
> +            }
> +            refblock_offset = alloc_clusters_imrt(bs, 1, refcount_table,
> +                                                  nb_clusters,
> +                                                  &first_free_cluster);
> +            if (refblock_offset < 0) {
> +                fprintf(stderr, "ERROR allocating refblock: %s\n",
> +                        strerror(-refblock_offset));
> +                return refblock_offset;

...instead of using fprintf(stderr), where the caller then handles the
error by printing?

Could be done as a separate patch.

>  
> +        /* Refblock is allocated, write it to disk */
> +
>          ret = qcow2_pre_write_overlap_check(bs, 0, refblock_offset,
>                                              s->cluster_size, false);
>          if (ret < 0) {
>              fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
> -            goto fail;
> +            return ret;
>          }

Another spot where errp conversion might improve the code.

>  
> -        /* The size of *refcount_table is always cluster-aligned, therefore the
> -         * write operation will not overflow */
> +        /*
> +         * The refblock is simply a slice of *refcount_table.
> +         * Note that the size of *refcount_table is always aligned to
> +         * whole clusters, so the write operation will not result in
> +         * out-of-bounds accesses.
> +         */
>          on_disk_refblock = (void *)((char *) *refcount_table +
>                                      refblock_index * s->cluster_size);
>  
> @@ -2550,23 +2579,99 @@ write_refblocks:
>                            s->cluster_size);
>          if (ret < 0) {
>              fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
> -            goto fail;
> +            return ret;
>          }

and another

>  
> -        /* Go to the end of this refblock */
> +        /* This refblock is done, skip to its end */
>          cluster = refblock_start + s->refcount_block_size - 1;
>      }
>  
> -    if (reftable_offset < 0) {
> -        uint64_t post_refblock_start, reftable_clusters;
> +    return reftable_grown;
> +}

The helper function looks okay.

> +
> +/*
> + * Creates a new refcount structure based solely on the in-memory information
> + * given through *refcount_table (this in-memory information is basically just
> + * the concatenation of all refblocks).  All necessary allocations will be
> + * reflected in that array.
> + *
> + * On success, the old refcount structure is leaked (it will be covered by the
> + * new refcount structure).
> + */
> +static int rebuild_refcount_structure(BlockDriverState *bs,
> +                                      BdrvCheckResult *res,
> +                                      void **refcount_table,
> +                                      int64_t *nb_clusters)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int64_t reftable_offset = -1;
> +    int64_t reftable_length = 0;
> +    int64_t reftable_clusters;
> +    int64_t refblock_index;
> +    uint32_t on_disk_reftable_entries = 0;
> +    uint64_t *on_disk_reftable = NULL;
> +    int ret = 0;
> +    int reftable_size_changed = 0;
> +    struct {
> +        uint64_t reftable_offset;
> +        uint32_t reftable_clusters;
> +    } QEMU_PACKED reftable_offset_and_clusters;
> +
> +    qcow2_cache_empty(bs, s->refcount_block_cache);
> +
> +    /*
> +     * For each refblock containing entries, we try to allocate a
> +     * cluster (in the in-memory refcount table) and write its offset
> +     * into on_disk_reftable[].  We then write the whole refblock to
> +     * disk (as a slice of the in-memory refcount table).
> +     * This is done by rebuild_refcounts_write_refblocks().
> +     *
> +     * Once we have scanned all clusters, we try to find space for the
> +     * reftable.  This will dirty the in-memory refcount table (i.e.
> +     * make it differ from the refblocks we have already written), so we
> +     * need to run rebuild_refcounts_write_refblocks() again for the
> +     * range of clusters where the reftable has been allocated.
> +     *
> +     * This second run might make the reftable grow again, in which case
> +     * we will need to allocate another space for it, which is why we
> +     * repeat all this until the reftable stops growing.
> +     *
> +     * (This loop will terminate, because with every cluster the
> +     * reftable grows, it can accomodate a multitude of more refcounts,
> +     * so that at some point this must be able to cover the reftable
> +     * and all refblocks describing it.)
> +     *
> +     * We then convert the reftable to big-endian and write it to disk.
> +     *
> +     * Note that we never free any reftable allocations.  Doing so would
> +     * needlessly complicate the algorithm: The eventual second check
> +     * run we do will clean up all leaks we have caused.
> +     */

Freeing reftable allocations from the first run might allow the second
(or third) to find a spot earlier in the image that is large enough to
contain the resized reftable, compared to leaving it leaked and
forcing subsequent runs to look later into the image.  But I agree
that the complication of code needed to handle that is not worth the
minor corner-case savings of a more densely packed overall image (the
leaked clusters will probably remain sparse for any decent cluster
size).

Another approach might be to intentionally over-allocate the reftable
to the point where we know it can't grow, then allocate, then scale it
back down to how much we actually used (possibly reclaiming a few
clusters at the end of the allocation).  But that's an even bigger
rewrite, and again, not worth the headache.

512-byte clusters would be where this is most likely to be noticeable
(that is, hitting a 3rd iteration with 512-byte clusters is probably
easy enough to actually test, but hitting a 3rd iteration with 2M
clusters is probably prohibitively expensive if even possible).

> +
> +    reftable_size_changed =
> +        rebuild_refcounts_write_refblocks(bs, refcount_table, nb_clusters,
> +                                          0, *nb_clusters,
> +                                          &on_disk_reftable,
> +                                          &on_disk_reftable_entries);
> +    if (reftable_size_changed < 0) {
> +        res->check_errors++;
> +        ret = reftable_size_changed;
> +        goto fail;
> +    }
> +
> +    /*
> +     * There was no reftable before, so rebuild_refcounts_write_refblocks()
> +     * must have increased its size (from 0 to something).
> +     */
> +    assert(reftable_size_changed == true);

'int == bool'.  Works, but is a bit odd.  I might have done just
assert(reftable_size_changed), since we just ruled out negative values
above.  Particularly since...

> +
> +    do {
> +        int64_t reftable_start_cluster, reftable_end_cluster;
> +        int64_t first_free_cluster = 0;
...
> +
> +        /*
> +         * If the reftable size has changed, we will need to find a new
> +         * allocation, repeating the loop.
> +         */
> +    } while (reftable_size_changed);

...here you ARE using the int as a bool directly.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] iotests/108: Test new refcount rebuild algorithm
  2022-03-29  9:19 ` [PATCH v2 2/2] iotests/108: Test new refcount rebuild algorithm Hanna Reitz
@ 2022-03-30 15:07   ` Eric Blake
  2022-04-01 13:55     ` Hanna Reitz
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Blake @ 2022-03-30 15:07 UTC (permalink / raw)
  To: Hanna Reitz; +Cc: Kevin Wolf, qemu-devel, qemu-block

On Tue, Mar 29, 2022 at 11:19:17AM +0200, Hanna Reitz wrote:
> One clear problem with how qcow2's refcount structure rebuild algorithm
> used to be before "qcow2: Improve refcount structure rebuilding" was
> that it is prone to failure for qcow2 images on block devices: There is
> generally unused space after the actual image, and if that exceeds what
> one refblock covers, the old algorithm would invariably write the
> reftable past the block device's end, which cannot work.  The new
> algorithm does not have this problem.
> 
> Test it with three tests:
> (1) Create an image with more empty space at the end than what one
>     refblock covers, see whether rebuilding the refcount structures
>     results in a change in the image file length.  (It should not.)
> 
> (2) Leave precisely enough space somewhere at the beginning of the image
>     for the new reftable (and the refblock for that place), see whether
>     the new algorithm puts the reftable there.  (It should.)
> 
> (3) Test the original problem: Create (something like) a block device
>     with a fixed size, then create a qcow2 image in there, write some
>     data, and then have qemu-img check rebuild the refcount structures.
>     Before HEAD^, the reftable would have been written past the image
>     file end, i.e. outside of what the block device provides, which
>     cannot work.  HEAD^ should have fixed that.
>     ("Something like a block device" means a loop device if we can use
>     one ("sudo -n losetup" works), or a FUSE block export with
>     growable=false otherwise.)

NBD might work for that purpose as well.  But no need to rewrite your
test for yet another block-alike behavior ;)

> 
> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
> ---
>  tests/qemu-iotests/108     | 259 ++++++++++++++++++++++++++++++++++++-
>  tests/qemu-iotests/108.out |  81 ++++++++++++
>  2 files changed, 339 insertions(+), 1 deletion(-)
> 
> diff --git a/tests/qemu-iotests/108 b/tests/qemu-iotests/108
> index 56339ab2c5..e9e2ced69b 100755
> --- a/tests/qemu-iotests/108
> +++ b/tests/qemu-iotests/108
> @@ -30,13 +30,20 @@ status=1	# failure is the default!
>  
>  _cleanup()
>  {
> -	_cleanup_test_img
> +    _cleanup_test_img

Nice TAB cleanup while at it.

> +    if [ -f "$TEST_DIR/qsd.pid" ]; then
> +        qsd_pid=$(cat "$TEST_DIR/qsd.pid")
> +        kill -KILL "$qsd_pid"

SIGKILL is harsh, skipping cleanup paths.  Do we want to try SIGINT
first, and only resort to SIGKILL if we didn't get a response?  Do we
(should we?) have a helper function for doing process reaping so that
we aren't open-coding a two-layer signal approach in all callers? [1]

> +        fusermount -u "$TEST_DIR/fuse-export" &>/dev/null
> +    fi
> +    rm -f "$TEST_DIR/fuse-export"
>  }
>  trap "_cleanup; exit \$status" 0 1 2 3 15
>  
>  # get standard environment, filters and checks
>  . ./common.rc
>  . ./common.filter
> +. ./common.qemu
>  
>  # This tests qcow2-specific low-level functionality
>  _supported_fmt qcow2
> @@ -47,6 +54,22 @@ _supported_os Linux
>  # files
>  _unsupported_imgopts 'refcount_bits=\([^1]\|.\([^6]\|$\)\)' data_file
>  
> +# This test either needs sudo -n losetup or FUSE exports to work
> +if sudo -n losetup &>/dev/null; then
> +    loopdev=true
> +else
> +    loopdev=false
> +
> +    # QSD --export fuse will either yield "Parameter 'id' is missing"
> +    # or "Invalid parameter 'fuse'", depending on whether there is
> +    # FUSE support or not.
> +    error=$($QSD --export fuse 2>&1)
> +    if [[ $error = *"Invalid parameter 'fuse'" ]]; then

Good thing iotests are written in bash.

> +        _notrun 'Passwordless sudo for losetup or FUSE support required, but' \
> +                'neither is available'
> +    fi
> +fi
> +
>  echo
>  echo '=== Repairing an image without any refcount table ==='
>  echo
> @@ -138,6 +161,240 @@ _make_test_img 64M
>  poke_file "$TEST_IMG" $((0x10008)) "\xff\xff\xff\xff\xff\xff\x00\x00"
>  _check_test_img -r all
>  
> +echo
> +echo '=== Check rebuilt reftable location ==='
> +
> +# In an earlier version of the refcount rebuild algorithm, the
> +# reftable was generally placed at the image end (unless something was
> +# allocated in the area covered by the refblock right before the image
> +# file end, then we would try to place the reftable in that refblock).
> +# This was later changed so the reftable would be placed in the
> +# earliest possible location.  Test this.
> +
> +echo
> +echo '--- Does the image size increase? ---'
> +echo
> +
> +# First test: Just create some image, write some data to it, and
> +# resize it so there is free space at the end of the image (enough
> +# that it spans at least one full refblock, which for cluster_size=512
> +# images, spans 128k).  With the old algorithm, the reftable would
> +# have then been placed at the end of the image file, but with the new
> +# one, it will be put in that free space.
> +# We want to check whether the size of the image file increases due to
> +# rebuilding the refcount structures (it should not).
> +
> +_make_test_img -o 'cluster_size=512' 1M
> +# Write something
> +$QEMU_IO -c 'write 0 64k' "$TEST_IMG" | _filter_qemu_io
> +
> +# Add free space
> +file_len=$(stat -c '%s' "$TEST_IMG")
> +truncate -s $((file_len + 256 * 1024)) "$TEST_IMG"
> +
> +# Corrupt the image by saying the image header were not allocated

s/were/was/

> +rt_offset=$(peek_file_be "$TEST_IMG" 48 8)
> +rb_offset=$(peek_file_be "$TEST_IMG" $rt_offset 8)
> +poke_file "$TEST_IMG" $rb_offset "\x00\x00"
> +
> +# Check whether rebuilding the refcount structures increases the image
> +# file size
> +file_len=$(stat -c '%s' "$TEST_IMG")
> +echo
> +# The only leaks there can be are the old refcount structures that are
> +# leaked during rebuilding, no need to clutter the output with them
> +_check_test_img -r all | grep -v '^Repairing cluster.*refcount=1 reference=0'
> +echo
> +post_repair_file_len=$(stat -c '%s' "$TEST_IMG")
> +
> +if [[ $file_len -eq $post_repair_file_len ]]; then
> +    echo 'OK: Image size did not change'
> +else
> +    echo 'ERROR: Image size differs' \
> +        "($file_len before, $post_repair_file_len after)"
> +fi
> +
> +echo
> +echo '--- Will the reftable occupy a hole specifically left for it?  ---'
> +echo
> +
> +# Note: With cluster_size=512, every refblock covers 128k.
> +# The reftable covers 8M per reftable cluster.
> +
> +# Create an image that requires two reftable clusters (just because
> +# this is more interesting than a single-clustered reftable).
> +_make_test_img -o 'cluster_size=512' 9M
> +$QEMU_IO -c 'write 0 8M' "$TEST_IMG" | _filter_qemu_io
> +
> +# Writing 8M will have resized the reftable.  Unfortunately, doing so
> +# will leave holes in the file, so we need to fill them up so we can
> +# be sure the whole file is allocated.  Do that by writing
> +# consecutively smaller chunks starting from 8 MB, until the file
> +# length increases even with a chunk size of 512.  Then we must have
> +# filled all holes.
> +ofs=$((8 * 1024 * 1024))
> +block_len=$((16 * 1024))
> +while [[ $block_len -ge 512 ]]; do
> +    file_len=$(stat -c '%s' "$TEST_IMG")
> +    while [[ $(stat -c '%s' "$TEST_IMG") -eq $file_len ]]; do
> +        # Do not include this in the reference output, it does not
> +        # really matter which qemu-io calls we do here exactly
> +        $QEMU_IO -c "write $ofs $block_len" "$TEST_IMG" >/dev/null
> +        ofs=$((ofs + block_len))
> +    done
> +    block_len=$((block_len / 2))
> +done

Interesting approach.

> +
> +# Fill up to 9M (do not include this in the reference output either,
> +# $ofs is random for all we know)
> +$QEMU_IO -c "write $ofs $((9 * 1024 * 1024 - ofs))" "$TEST_IMG" >/dev/null
> +
> +# Make space as follows:
> +# - For the first refblock: Right at the beginning of the image (this
> +#   refblock is placed in the first place possible),
> +# - For the reftable somewhere soon afterwards, still near the
> +#   beginning of the image (i.e. covered by the first refblock); the
> +#   reftable too is placed in the first place possible, but only after
> +#   all refblocks have been placed)
> +# No space is needed for the other refblocks, because no refblock is
> +# put before the space it covers.  In this test case, we do not mind
> +# if they are placed at the image file's end.
> +
> +# Before we make that space, we have to find out the host offset of
> +# the area that belonged to the two data clusters at guest offset 4k,
> +# because we expect the reftable to be placed there, and we will have
> +# to verify that it is.
> +
> +l1_offset=$(peek_file_be "$TEST_IMG" 40 8)
> +l2_offset=$(peek_file_be "$TEST_IMG" $l1_offset 8)
> +l2_offset=$((l2_offset & 0x00fffffffffffe00))
> +data_4k_offset=$(peek_file_be "$TEST_IMG" \
> +                 $((l2_offset + 4096 / 512 * 8)) 8)
> +data_4k_offset=$((data_4k_offset & 0x00fffffffffffe00))
> +
> +$QEMU_IO -c "discard 0 512" -c "discard 4k 1k" "$TEST_IMG" | _filter_qemu_io
> +
> +# Corrupt the image by saying the image header were not allocated

s/were/was/

> +rt_offset=$(peek_file_be "$TEST_IMG" 48 8)
> +rb_offset=$(peek_file_be "$TEST_IMG" $rt_offset 8)
> +poke_file "$TEST_IMG" $rb_offset "\x00\x00"
> +
> +echo
> +# The only leaks there can be are the old refcount structures that are
> +# leaked during rebuilding, no need to clutter the output with them
> +_check_test_img -r all | grep -v '^Repairing cluster.*refcount=1 reference=0'
> +echo
> +
> +# Check whether the reftable was put where we expected
> +rt_offset=$(peek_file_be "$TEST_IMG" 48 8)
> +if [[ $rt_offset -eq $data_4k_offset ]]; then
> +    echo 'OK: Reftable is where we expect it'
> +else
> +    echo "ERROR: Reftable is at $rt_offset, but was expected at $data_4k_offset"
> +fi
> +
> +echo
> +echo '--- Rebuilding refcount structures on block devices ---'
> +echo
> +
> +# A block device cannot really grow, at least not during qemu-img
> +# check.  As mentioned in the above cases, rebuilding the refcount
> +# structure may lead to new refcount structures being written after
> +# the end of the image, and in the past that happened even if there
> +# was more than sufficient space in the image.  Such post-EOF writes
> +# will not work on block devices, so test that the new algorithm
> +# avoids it.
> +
> +# If we have passwordless sudo and losetup, we can use those to create
> +# a block device.  Otherwise, we can resort to qemu's FUSE export to
> +# create a file that isn't growable, which effectively tests the same
> +# thing.
> +
> +_cleanup_test_img
> +truncate -s $((64 * 1024 * 1024)) "$TEST_IMG"
> +
> +if $loopdev; then
> +    export_mp=$(sudo -n losetup --show -f "$TEST_IMG")
> +    export_mp_driver=host_device
> +    sudo -n chmod go+rw "$export_mp"
> +else
> +    # Create non-growable FUSE export that is a bit like an empty
> +    # block device
> +    export_mp="$TEST_DIR/fuse-export"
> +    export_mp_driver=file
> +    touch "$export_mp"
> +
> +    $QSD \
> +        --blockdev file,node-name=export-node,filename="$TEST_IMG" \
> +        --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=off \
> +        --pidfile "$TEST_DIR/qsd.pid" \
> +        --daemonize
> +fi
> +
> +# Now create a qcow2 image on the device -- unfortunately, qemu-img
> +# create force-creates the file, so we have to resort to the
> +# blockdev-create job.

Huh. Why do we have 'qemu-img convert -n' but not 'qemu-img create -n'
for writing a particular destination format into an already existing
storage space?

> +_launch_qemu \
> +    --blockdev $export_mp_driver,node-name=file,filename="$export_mp"
> +
> +_send_qemu_cmd \
> +    $QEMU_HANDLE \
> +    '{ "execute": "qmp_capabilities" }' \
> +    'return'
> +
> +# Small cluster size again, so the image needs multiple refblocks
> +_send_qemu_cmd \
> +    $QEMU_HANDLE \
> +    '{ "execute": "blockdev-create",
> +       "arguments": {
> +           "job-id": "create",
> +           "options": {
> +               "driver": "qcow2",
> +               "file": "file",
> +               "size": '$((64 * 1024 * 1024))',

Oh, the joys of nested quoting ;)

> +               "cluster-size": 512
> +           } } }' \
> +    '"concluded"'
> +
> +_send_qemu_cmd \
> +    $QEMU_HANDLE \
> +    '{ "execute": "job-dismiss", "arguments": { "id": "create" } }' \
> +    'return'
> +
> +_send_qemu_cmd \
> +    $QEMU_HANDLE \
> +    '{ "execute": "quit" }' \
> +    'return'
> +
> +wait=y _cleanup_qemu
> +echo
> +
> +# Write some data
> +$QEMU_IO -c 'write 0 64k' "$export_mp" | _filter_qemu_io
> +
> +# Corrupt the image by saying the image header were not allocated

s/were/was/

> +rt_offset=$(peek_file_be "$export_mp" 48 8)
> +rb_offset=$(peek_file_be "$export_mp" $rt_offset 8)
> +poke_file "$export_mp" $rb_offset "\x00\x00"
> +
> +# Repairing such a simple case should just work
> +# (We used to put the reftable at the end of the image file, which can
> +# never work for non-growable devices.)
> +echo
> +TEST_IMG="$export_mp" _check_test_img -r all \
> +    | grep -v '^Repairing cluster.*refcount=1 reference=0'
> +
> +if $loopdev; then
> +    sudo -n losetup -d "$export_mp"
> +else
> +    qsd_pid=$(cat "$TEST_DIR/qsd.pid")
> +    kill -TERM "$qsd_pid"
> +    # Wait for process to exit (cannot `wait` because the QSD is daemonized)
> +    while [ -f "$TEST_DIR/qsd.pid" ]; do
> +        true
> +    done
> +fi

[1] Okay, when the test normally passes, you ARE doing a safe cleanup,
and saving SIGKILL only for the really bad cases where the test fails,
and we don't care if even more gets stranded.

> +
>  # success, all done
>  echo '*** done'
>  rm -f $seq.full

Nice addition to the testsuite.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] qcow2: Improve refcount structure rebuilding
  2022-03-30 14:32   ` Eric Blake
@ 2022-04-01 13:49     ` Hanna Reitz
  0 siblings, 0 replies; 7+ messages in thread
From: Hanna Reitz @ 2022-04-01 13:49 UTC (permalink / raw)
  To: Eric Blake; +Cc: Kevin Wolf, qemu-devel, qemu-block

On 30.03.22 16:32, Eric Blake wrote:
> On Tue, Mar 29, 2022 at 11:19:16AM +0200, Hanna Reitz wrote:
>> When rebuilding the refcount structures (when qemu-img check -r found
>> errors with refcount = 0, but reference count > 0), the new refcount
>> table defaults to being put at the image file end[1].  There is no good
>> reason for that except that it means we will not have to rewrite any
>> refblocks we already wrote to disk.
>>
>> Changing the code to rewrite those refblocks is not too difficult,
>> though, so let us do that.  That is beneficial for images on block
>> devices, where we cannot really write beyond the end of the image file.
>>
>> Use this opportunity to add extensive comments to the code, and refactor
>> it a bit, getting rid of the backwards-jumping goto.
>>
>> [1] Unless there is something allocated in the area pointed to by the
>>      last refblock, so we have to write that refblock.  In that case, we
>>      try to put the reftable in there.
>>
>> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1519071
>> Closes: https://gitlab.com/qemu-project/qemu/-/issues/941
>> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
>> ---
>>   block/qcow2-refcount.c | 332 +++++++++++++++++++++++++++++------------
>>   1 file changed, 235 insertions(+), 97 deletions(-)
>>
>> +static int rebuild_refcounts_write_refblocks(
>> +        BlockDriverState *bs, void **refcount_table, int64_t *nb_clusters,
>> +        int64_t first_cluster, int64_t end_cluster,
>> +        uint64_t **on_disk_reftable_ptr, uint32_t *on_disk_reftable_entries_ptr
>> +    )
> As you are rewriting this into a helper function, should this function
> take Error **errp,...
>
>> +            /* Don't allocate a cluster in a refblock already written to disk */
>> +            if (first_free_cluster < refblock_start) {
>> +                first_free_cluster = refblock_start;
>> +            }
>> +            refblock_offset = alloc_clusters_imrt(bs, 1, refcount_table,
>> +                                                  nb_clusters,
>> +                                                  &first_free_cluster);
>> +            if (refblock_offset < 0) {
>> +                fprintf(stderr, "ERROR allocating refblock: %s\n",
>> +                        strerror(-refblock_offset));
>> +                return refblock_offset;
> ...instead of using fprintf(stderr), where the caller then handles the
> error by printing?
>
> Could be done as a separate patch.

Sounds good!  I don’t know whether a separate patch is better or not, 
but since you gave an R-b on this one as-is, I will send it as a 
separate patch (together in one v3 series, though).

>>   
>> +        /* Refblock is allocated, write it to disk */
>> +
>>           ret = qcow2_pre_write_overlap_check(bs, 0, refblock_offset,
>>                                               s->cluster_size, false);
>>           if (ret < 0) {
>>               fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
>> -            goto fail;
>> +            return ret;
>>           }
> Another spot where errp conversion might improve the code.
>
>>   
>> -        /* The size of *refcount_table is always cluster-aligned, therefore the
>> -         * write operation will not overflow */
>> +        /*
>> +         * The refblock is simply a slice of *refcount_table.
>> +         * Note that the size of *refcount_table is always aligned to
>> +         * whole clusters, so the write operation will not result in
>> +         * out-of-bounds accesses.
>> +         */
>>           on_disk_refblock = (void *)((char *) *refcount_table +
>>                                       refblock_index * s->cluster_size);
>>   
>> @@ -2550,23 +2579,99 @@ write_refblocks:
>>                             s->cluster_size);
>>           if (ret < 0) {
>>               fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
>> -            goto fail;
>> +            return ret;
>>           }
> and another
>
>>   
>> -        /* Go to the end of this refblock */
>> +        /* This refblock is done, skip to its end */
>>           cluster = refblock_start + s->refcount_block_size - 1;
>>       }
>>   
>> -    if (reftable_offset < 0) {
>> -        uint64_t post_refblock_start, reftable_clusters;
>> +    return reftable_grown;
>> +}
> The helper function looks okay.
>
>> +
>> +/*
>> + * Creates a new refcount structure based solely on the in-memory information
>> + * given through *refcount_table (this in-memory information is basically just
>> + * the concatenation of all refblocks).  All necessary allocations will be
>> + * reflected in that array.
>> + *
>> + * On success, the old refcount structure is leaked (it will be covered by the
>> + * new refcount structure).
>> + */
>> +static int rebuild_refcount_structure(BlockDriverState *bs,
>> +                                      BdrvCheckResult *res,
>> +                                      void **refcount_table,
>> +                                      int64_t *nb_clusters)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    int64_t reftable_offset = -1;
>> +    int64_t reftable_length = 0;
>> +    int64_t reftable_clusters;
>> +    int64_t refblock_index;
>> +    uint32_t on_disk_reftable_entries = 0;
>> +    uint64_t *on_disk_reftable = NULL;
>> +    int ret = 0;
>> +    int reftable_size_changed = 0;
>> +    struct {
>> +        uint64_t reftable_offset;
>> +        uint32_t reftable_clusters;
>> +    } QEMU_PACKED reftable_offset_and_clusters;
>> +
>> +    qcow2_cache_empty(bs, s->refcount_block_cache);
>> +
>> +    /*
>> +     * For each refblock containing entries, we try to allocate a
>> +     * cluster (in the in-memory refcount table) and write its offset
>> +     * into on_disk_reftable[].  We then write the whole refblock to
>> +     * disk (as a slice of the in-memory refcount table).
>> +     * This is done by rebuild_refcounts_write_refblocks().
>> +     *
>> +     * Once we have scanned all clusters, we try to find space for the
>> +     * reftable.  This will dirty the in-memory refcount table (i.e.
>> +     * make it differ from the refblocks we have already written), so we
>> +     * need to run rebuild_refcounts_write_refblocks() again for the
>> +     * range of clusters where the reftable has been allocated.
>> +     *
>> +     * This second run might make the reftable grow again, in which case
>> +     * we will need to allocate another space for it, which is why we
>> +     * repeat all this until the reftable stops growing.
>> +     *
>> +     * (This loop will terminate, because with every cluster the
>> +     * reftable grows, it can accomodate a multitude of more refcounts,
>> +     * so that at some point this must be able to cover the reftable
>> +     * and all refblocks describing it.)
>> +     *
>> +     * We then convert the reftable to big-endian and write it to disk.
>> +     *
>> +     * Note that we never free any reftable allocations.  Doing so would
>> +     * needlessly complicate the algorithm: The eventual second check
>> +     * run we do will clean up all leaks we have caused.
>> +     */
> Freeing reftable allocations from the first run might allow the second
> (or third) to find a spot earlier in the image that is large enough to
> contain the resized reftable, compared to leaving it leaked and
> forcing subsequent runs to look later into the image.  But I agree
> that the complication of code needed to handle that is not worth the
> minor corner-case savings of a more densely packed overall image (the
> leaked clusters will probably remain sparse for any decent cluster
> size).
>
> Another approach might be to intentionally over-allocate the reftable
> to the point where we know it can't grow, then allocate, then scale it
> back down to how much we actually used (possibly reclaiming a few
> clusters at the end of the allocation).  But that's an even bigger
> rewrite, and again, not worth the headache.
>
> 512-byte clusters would be where this is most likely to be noticeable
> (that is, hitting a 3rd iteration with 512-byte clusters is probably
> easy enough to actually test, but hitting a 3rd iteration with 2M
> clusters is probably prohibitively expensive if even possible).

Yes, I figured it’s complex enough as-is.  Perhaps later(tm) someone(tm) 
can try their hand on improving this. O:)

>> +
>> +    reftable_size_changed =
>> +        rebuild_refcounts_write_refblocks(bs, refcount_table, nb_clusters,
>> +                                          0, *nb_clusters,
>> +                                          &on_disk_reftable,
>> +                                          &on_disk_reftable_entries);
>> +    if (reftable_size_changed < 0) {
>> +        res->check_errors++;
>> +        ret = reftable_size_changed;
>> +        goto fail;
>> +    }
>> +
>> +    /*
>> +     * There was no reftable before, so rebuild_refcounts_write_refblocks()
>> +     * must have increased its size (from 0 to something).
>> +     */
>> +    assert(reftable_size_changed == true);
> 'int == bool'.  Works, but is a bit odd.  I might have done just
> assert(reftable_size_changed), since we just ruled out negative values
> above.  Particularly since...

OK, sure.  (I guess I wanted to make the point that the return value on 
success is just a bool, hence the explicit `true` here, but if it 
doesn’t help, I’ll abbreviate it.)

>> +
>> +    do {
>> +        int64_t reftable_start_cluster, reftable_end_cluster;
>> +        int64_t first_free_cluster = 0;
> ...
>> +
>> +        /*
>> +         * If the reftable size has changed, we will need to find a new
>> +         * allocation, repeating the loop.
>> +         */
>> +    } while (reftable_size_changed);
> ...here you ARE using the int as a bool directly.
>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks a lot!

Hanna



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] iotests/108: Test new refcount rebuild algorithm
  2022-03-30 15:07   ` Eric Blake
@ 2022-04-01 13:55     ` Hanna Reitz
  0 siblings, 0 replies; 7+ messages in thread
From: Hanna Reitz @ 2022-04-01 13:55 UTC (permalink / raw)
  To: Eric Blake; +Cc: Kevin Wolf, qemu-devel, qemu-block

On 30.03.22 17:07, Eric Blake wrote:
> On Tue, Mar 29, 2022 at 11:19:17AM +0200, Hanna Reitz wrote:
>> One clear problem with how qcow2's refcount structure rebuild algorithm
>> used to be before "qcow2: Improve refcount structure rebuilding" was
>> that it is prone to failure for qcow2 images on block devices: There is
>> generally unused space after the actual image, and if that exceeds what
>> one refblock covers, the old algorithm would invariably write the
>> reftable past the block device's end, which cannot work.  The new
>> algorithm does not have this problem.
>>
>> Test it with three tests:
>> (1) Create an image with more empty space at the end than what one
>>      refblock covers, see whether rebuilding the refcount structures
>>      results in a change in the image file length.  (It should not.)
>>
>> (2) Leave precisely enough space somewhere at the beginning of the image
>>      for the new reftable (and the refblock for that place), see whether
>>      the new algorithm puts the reftable there.  (It should.)
>>
>> (3) Test the original problem: Create (something like) a block device
>>      with a fixed size, then create a qcow2 image in there, write some
>>      data, and then have qemu-img check rebuild the refcount structures.
>>      Before HEAD^, the reftable would have been written past the image
>>      file end, i.e. outside of what the block device provides, which
>>      cannot work.  HEAD^ should have fixed that.
>>      ("Something like a block device" means a loop device if we can use
>>      one ("sudo -n losetup" works), or a FUSE block export with
>>      growable=false otherwise.)
> NBD might work for that purpose as well.  But no need to rewrite your
> test for yet another block-alike behavior ;)
>
>> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
>> ---
>>   tests/qemu-iotests/108     | 259 ++++++++++++++++++++++++++++++++++++-
>>   tests/qemu-iotests/108.out |  81 ++++++++++++
>>   2 files changed, 339 insertions(+), 1 deletion(-)

[...]

>> +if $loopdev; then
>> +    export_mp=$(sudo -n losetup --show -f "$TEST_IMG")
>> +    export_mp_driver=host_device
>> +    sudo -n chmod go+rw "$export_mp"
>> +else
>> +    # Create non-growable FUSE export that is a bit like an empty
>> +    # block device
>> +    export_mp="$TEST_DIR/fuse-export"
>> +    export_mp_driver=file
>> +    touch "$export_mp"
>> +
>> +    $QSD \
>> +        --blockdev file,node-name=export-node,filename="$TEST_IMG" \
>> +        --export fuse,id=fuse-export,node-name=export-node,mountpoint="$export_mp",writable=on,growable=off \
>> +        --pidfile "$TEST_DIR/qsd.pid" \
>> +        --daemonize
>> +fi
>> +
>> +# Now create a qcow2 image on the device -- unfortunately, qemu-img
>> +# create force-creates the file, so we have to resort to the
>> +# blockdev-create job.
> Huh. Why do we have 'qemu-img convert -n' but not 'qemu-img create -n'
> for writing a particular destination format into an already existing
> storage space?

`qemu-img convert -n` is actually different than what `qemu-img create 
-n` would be.  The former will not create the image format either, and 
so you can’t emulate the latter’s behavior with something like `qemu-img 
convert -n null-co:// target.qcow2`.

I think at some point I dabbled with a `qemu-img create -n`, but didn’t 
really get anywhere.  I don’t remember why, I think it had something to 
do with that if you don’t want to create the protocol layer, you will 
have to use the blockdev-create infrastructure, and can’t use 
bdrv_img_create().  So it all became a bit more complicated than I’d 
hoped for.

Maybe a `run-job` subcommand for qemu-img might be a reasonable stop-gap.

(Again, thanks for the review!)

Hanna



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-04-01 13:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-29  9:19 [PATCH v2 0/2] qcow2: Improve refcount structure rebuilding Hanna Reitz
2022-03-29  9:19 ` [PATCH v2 1/2] " Hanna Reitz
2022-03-30 14:32   ` Eric Blake
2022-04-01 13:49     ` Hanna Reitz
2022-03-29  9:19 ` [PATCH v2 2/2] iotests/108: Test new refcount rebuild algorithm Hanna Reitz
2022-03-30 15:07   ` Eric Blake
2022-04-01 13:55     ` Hanna Reitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.