All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 0/4] Add shrink image for qcow2
@ 2017-06-13 12:16 Pavel Butsykin
  2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 1/4] qemu-img: add --shrink flag for resize Pavel Butsykin
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Pavel Butsykin @ 2017-06-13 12:16 UTC (permalink / raw)
  To: qemu-block, qemu-devel; +Cc: kwolf, mreitz, eblake, armbru, pbutsykin, den

This patch add shrinking of the image file for qcow2. As a result, this allows
us to reduce the virtual image size and free up space on the disk without
copying the image. Image can be fragmented and shrink is done by punching holes
in the image file.

# ./qemu-img create -f qcow2 -o size=4G image.qcow2
Formatting 'image.qcow2', fmt=qcow2 size=4294967296 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16

# ./qemu-io -c "write -P 0x22 0 1G" image.qcow2
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 0:00:02.59 (395.180 MiB/sec and 0.3859 ops/sec)

# ./qemu-img resize image.qcow2 512M
Warning: shrinking of the image can lead to data loss. Before performing shrink operation you must make sure that the shrink part of image doesn't contain important data.
If you don't want to see this message use --shrink option.
Image resized.

# ./qemu-img resize --shrink image.qcow2 128M
Image resized.

# ./qemu-img info image.qcow2
image: image.qcow2
file format: qcow2
virtual size: 128M (134217728 bytes)
disk size: 128M
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

# du -h ./image.qcow2 
129M    ./image.qcow2

Changes from v1:
- add --shrink flag for qemu-img resize
- add qcow2_cache_discard
- simplify qcow2_shrink_l1_table() to reduce the likelihood of image corruption
- add new qemu-iotests for shrinking images

Pavel Butsykin (4):
  qemu-img: add --shrink flag for resize
  qcow2: add qcow2_cache_discard
  qcow2: add shrink image support
  qemu-iotests: add shrinking image test

 block/qcow2-cache.c        |  21 +++++++++
 block/qcow2-cluster.c      |  42 +++++++++++++++++
 block/qcow2-refcount.c     |  70 ++++++++++++++++++++++++++++
 block/qcow2.c              |  40 ++++++++++++----
 block/qcow2.h              |   3 ++
 qapi/block-core.json       |   3 +-
 qemu-img-cmds.hx           |   4 +-
 qemu-img.c                 |  15 ++++++
 qemu-img.texi              |   5 +-
 tests/qemu-iotests/102     |   4 +-
 tests/qemu-iotests/163     | 113 +++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/163.out |   5 ++
 tests/qemu-iotests/group   |   1 +
 13 files changed, 310 insertions(+), 16 deletions(-)
 create mode 100644 tests/qemu-iotests/163
 create mode 100644 tests/qemu-iotests/163.out

-- 
2.13.0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH v2 1/4] qemu-img: add --shrink flag for resize
  2017-06-13 12:16 [Qemu-devel] [PATCH v2 0/4] Add shrink image for qcow2 Pavel Butsykin
@ 2017-06-13 12:16 ` Pavel Butsykin
  2017-06-21 22:17   ` Max Reitz
  2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 2/4] qcow2: add qcow2_cache_discard Pavel Butsykin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Pavel Butsykin @ 2017-06-13 12:16 UTC (permalink / raw)
  To: qemu-block, qemu-devel; +Cc: kwolf, mreitz, eblake, armbru, pbutsykin, den

The flag as additional precaution of data loss. Perhaps in the future the
operation shrink without this flag will be banned, but while we need to
maintain compatibility.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 qemu-img-cmds.hx       |  4 ++--
 qemu-img.c             | 15 +++++++++++++++
 qemu-img.texi          |  5 ++++-
 tests/qemu-iotests/102 |  4 ++--
 4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index a39fcdba71..3b2eab9d20 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -76,9 +76,9 @@ STEXI
 ETEXI
 
 DEF("resize", img_resize,
-    "resize [--object objectdef] [--image-opts] [-q] filename [+ | -]size")
+    "resize [--object objectdef] [--image-opts] [-q] [--shrink] filename [+ | -]size")
 STEXI
-@item resize [--object @var{objectdef}] [--image-opts] [-q] @var{filename} [+ | -]@var{size}
+@item resize [--object @var{objectdef}] [--image-opts] [-q] [--shrink] @var{filename} [+ | -]@var{size}
 ETEXI
 
 DEF("amend", img_amend,
diff --git a/qemu-img.c b/qemu-img.c
index 0ad698d7f1..bfe5f61b0b 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -61,6 +61,7 @@ enum {
     OPTION_FLUSH_INTERVAL = 261,
     OPTION_NO_DRAIN = 262,
     OPTION_TARGET_IMAGE_OPTS = 263,
+    OPTION_SHRINK = 264,
 };
 
 typedef enum OutputFormat {
@@ -3452,6 +3453,7 @@ static int img_resize(int argc, char **argv)
         },
     };
     bool image_opts = false;
+    bool shrink = false;
 
     /* Remove size from argv manually so that negative numbers are not treated
      * as options by getopt. */
@@ -3469,6 +3471,7 @@ static int img_resize(int argc, char **argv)
             {"help", no_argument, 0, 'h'},
             {"object", required_argument, 0, OPTION_OBJECT},
             {"image-opts", no_argument, 0, OPTION_IMAGE_OPTS},
+            {"shrink", no_argument, 0, OPTION_SHRINK},
             {0, 0, 0, 0}
         };
         c = getopt_long(argc, argv, ":f:hq",
@@ -3503,6 +3506,9 @@ static int img_resize(int argc, char **argv)
         case OPTION_IMAGE_OPTS:
             image_opts = true;
             break;
+        case OPTION_SHRINK:
+            shrink = true;
+            break;
         }
     }
     if (optind != argc - 1) {
@@ -3562,6 +3568,15 @@ static int img_resize(int argc, char **argv)
         goto out;
     }
 
+    if (total_size < blk_getlength(blk) && !shrink) {
+        qprintf(quiet, "Warning: shrinking of the image can lead to data loss. "
+                       "Before performing shrink operation you must make sure "
+                       "that the shrink part of image doesn't contain important"
+                       " data.\n");
+        qprintf(quiet,
+                "If you don't want to see this message use --shrink option.\n");
+    }
+
     ret = blk_truncate(blk, total_size, &err);
     if (!ret) {
         qprintf(quiet, "Image resized.\n");
diff --git a/qemu-img.texi b/qemu-img.texi
index 5b925ecf41..c2b694cd00 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -499,7 +499,7 @@ qemu-img rebase -b base.img diff.qcow2
 At this point, @code{modified.img} can be discarded, since
 @code{base.img + diff.qcow2} contains the same information.
 
-@item resize @var{filename} [+ | -]@var{size}
+@item resize [--shrink] @var{filename} [+ | -]@var{size}
 
 Change the disk image as if it had been created with @var{size}.
 
@@ -507,6 +507,9 @@ Before using this command to shrink a disk image, you MUST use file system and
 partitioning tools inside the VM to reduce allocated file systems and partition
 sizes accordingly.  Failure to do so will result in data loss!
 
+If @code{--shrink} is specified, warning about data loss doesn't print for
+the shrink operation.
+
 After using this command to grow a disk image, you must use file system and
 partitioning tools inside the VM to actually begin using the new space on the
 device.
diff --git a/tests/qemu-iotests/102 b/tests/qemu-iotests/102
index 87db1bb1bf..d7ad8d9840 100755
--- a/tests/qemu-iotests/102
+++ b/tests/qemu-iotests/102
@@ -54,7 +54,7 @@ _make_test_img $IMG_SIZE
 $QEMU_IO -c 'write 0 64k' "$TEST_IMG" | _filter_qemu_io
 # Remove data cluster from image (first cluster: image header, second: reftable,
 # third: refblock, fourth: L1 table, fifth: L2 table)
-$QEMU_IMG resize -f raw "$TEST_IMG" $((5 * 64 * 1024))
+$QEMU_IMG resize -f raw --shrink "$TEST_IMG" $((5 * 64 * 1024))
 
 $QEMU_IO -c map "$TEST_IMG"
 $QEMU_IMG map "$TEST_IMG"
@@ -69,7 +69,7 @@ $QEMU_IO -c 'write 0 64k' "$TEST_IMG" | _filter_qemu_io
 
 qemu_comm_method=monitor _launch_qemu -drive if=none,file="$TEST_IMG",id=drv0
 
-$QEMU_IMG resize -f raw "$TEST_IMG" $((5 * 64 * 1024))
+$QEMU_IMG resize -f raw --shrink "$TEST_IMG" $((5 * 64 * 1024))
 
 _send_qemu_cmd $QEMU_HANDLE 'qemu-io drv0 map' 'allocated' \
     | sed -e 's/^(qemu).*qemu-io drv0 map...$/(qemu) qemu-io drv0 map/'
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH v2 2/4] qcow2: add qcow2_cache_discard
  2017-06-13 12:16 [Qemu-devel] [PATCH v2 0/4] Add shrink image for qcow2 Pavel Butsykin
  2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 1/4] qemu-img: add --shrink flag for resize Pavel Butsykin
@ 2017-06-13 12:16 ` Pavel Butsykin
  2017-06-21 22:29   ` Max Reitz
  2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 3/4] qcow2: add shrink image support Pavel Butsykin
  2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 4/4] qemu-iotests: add shrinking image test Pavel Butsykin
  3 siblings, 1 reply; 21+ messages in thread
From: Pavel Butsykin @ 2017-06-13 12:16 UTC (permalink / raw)
  To: qemu-block, qemu-devel; +Cc: kwolf, mreitz, eblake, armbru, pbutsykin, den

Whenever l2/refcount table clusters are discarded from the file we can
automatically drop unnecessary content of the cache tables. This reduces
the chance of eviction useful cache data and eliminates inconsistent data
in thecache with the data in the file.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/qcow2-cache.c    | 21 +++++++++++++++++++++
 block/qcow2-refcount.c |  5 +++++
 block/qcow2.h          |  1 +
 3 files changed, 27 insertions(+)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 1d25147392..7931edf237 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -411,3 +411,24 @@ void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c,
     assert(c->entries[i].offset != 0);
     c->entries[i].dirty = true;
 }
+
+void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset)
+{
+    int i;
+
+    for (i = 0; i < c->size; i++) {
+        if (c->entries[i].offset == offset) {
+            goto found; /* table offset */
+        }
+    }
+    return;
+
+found:
+    assert(c->entries[i].ref == 0);
+
+    c->entries[i].offset = 0;
+    c->entries[i].lru_counter = 0;
+    c->entries[i].dirty = false;
+
+    qcow2_cache_table_release(bs, c, i, 1);
+}
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 7c06061aae..576ab551d6 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -767,6 +767,11 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
         s->set_refcount(refcount_block, block_index, refcount);
 
         if (refcount == 0 && s->discard_passthrough[type]) {
+            qcow2_cache_put(bs, s->refcount_block_cache, &refcount_block);
+            qcow2_cache_discard(bs, s->refcount_block_cache, offset);
+
+            qcow2_cache_discard(bs, s->l2_table_cache, offset);
+
             update_refcount_discard(bs, cluster_offset, s->cluster_size);
         }
     }
diff --git a/block/qcow2.h b/block/qcow2.h
index 1801dc30dc..07faa6dc78 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -597,5 +597,6 @@ int qcow2_cache_get(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
 int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
     void **table);
 void qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
+void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset);
 
 #endif
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH v2 3/4] qcow2: add shrink image support
  2017-06-13 12:16 [Qemu-devel] [PATCH v2 0/4] Add shrink image for qcow2 Pavel Butsykin
  2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 1/4] qemu-img: add --shrink flag for resize Pavel Butsykin
  2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 2/4] qcow2: add qcow2_cache_discard Pavel Butsykin
@ 2017-06-13 12:16 ` Pavel Butsykin
  2017-06-21 22:55   ` Max Reitz
  2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 4/4] qemu-iotests: add shrinking image test Pavel Butsykin
  3 siblings, 1 reply; 21+ messages in thread
From: Pavel Butsykin @ 2017-06-13 12:16 UTC (permalink / raw)
  To: qemu-block, qemu-devel; +Cc: kwolf, mreitz, eblake, armbru, pbutsykin, den

This patch add shrinking of the image file for qcow2. As a result, this allows
us to reduce the virtual image size and free up space on the disk without
copying the image. Image can be fragmented and shrink is done by punching holes
in the image file.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 block/qcow2-cluster.c  | 42 ++++++++++++++++++++++++++++++++
 block/qcow2-refcount.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++
 block/qcow2.c          | 40 +++++++++++++++++++++++--------
 block/qcow2.h          |  2 ++
 qapi/block-core.json   |  3 ++-
 5 files changed, 141 insertions(+), 11 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index d779ea19cf..a84b7e607e 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -32,6 +32,48 @@
 #include "qemu/bswap.h"
 #include "trace.h"
 
+int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t max_size)
+{
+    BDRVQcow2State *s = bs->opaque;
+    int new_l1_size, i, ret;
+
+    if (max_size >= s->l1_size) {
+        return 0;
+    }
+
+    new_l1_size = max_size;
+
+#ifdef DEBUG_ALLOC2
+    fprintf(stderr, "shrink l1_table from %d to %" PRId64 "\n",
+            s->l1_size, new_l1_size);
+#endif
+
+    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
+    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
+                                       sizeof(uint64_t) * new_l1_size,
+                             (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = bdrv_flush(bs->file->bs);
+    if (ret < 0) {
+        return ret;
+    }
+
+    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
+    for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
+        if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
+            continue;
+        }
+        qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
+                            s->l2_size * sizeof(uint64_t),
+                            QCOW2_DISCARD_ALWAYS);
+        s->l1_table[i] = 0;
+    }
+    return 0;
+}
+
 int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                         bool exact_size)
 {
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 576ab551d6..e98306acd8 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -29,6 +29,7 @@
 #include "block/qcow2.h"
 #include "qemu/range.h"
 #include "qemu/bswap.h"
+#include "qemu/cutils.h"
 
 static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size);
 static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
@@ -2936,3 +2937,67 @@ done:
     qemu_vfree(new_refblock);
     return ret;
 }
+
+int qcow2_shrink_reftable(BlockDriverState *bs)
+{
+    BDRVQcow2State *s = bs->opaque;
+    uint64_t *reftable_tmp =
+        g_try_malloc(sizeof(uint64_t) * s->refcount_table_size);
+    int i, ret;
+
+    if (s->refcount_table_size && reftable_tmp == NULL) {
+        return -ENOMEM;
+    }
+
+    for (i = 0; i < s->refcount_table_size; i++) {
+        int64_t refblock_offs = s->refcount_table[i] & REFT_OFFSET_MASK;
+        void *refblock;
+        bool unused_block;
+
+        if (refblock_offs == 0) {
+            reftable_tmp[i] = 0;
+            continue;
+        }
+        ret = qcow2_cache_get(bs, s->refcount_block_cache, refblock_offs,
+                              &refblock);
+        if (ret < 0) {
+            goto out;
+        }
+
+        /* the refblock has own reference */
+        if (i == refblock_offs >> (s->refcount_block_bits + s->cluster_bits)) {
+            uint64_t blk_index = (refblock_offs >> s->cluster_bits) &
+                                 (s->refcount_block_size - 1);
+            uint64_t refcount = s->get_refcount(refblock, blk_index);
+
+            s->set_refcount(refblock, blk_index, 0);
+
+            unused_block = buffer_is_zero(refblock, s->refcount_block_size);
+
+            s->set_refcount(refblock, blk_index, refcount);
+        } else {
+            unused_block = buffer_is_zero(refblock, s->refcount_block_size);
+        }
+        qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
+
+        reftable_tmp[i] = unused_block ? 0 : cpu_to_be64(s->refcount_table[i]);
+    }
+
+    ret = bdrv_pwrite_sync(bs->file, s->refcount_table_offset, reftable_tmp,
+                           sizeof(uint64_t) * s->refcount_table_size);
+    if (ret < 0) {
+        goto out;
+    }
+
+    for (i = 0; i < s->refcount_table_size; i++) {
+        if (s->refcount_table[i] && !reftable_tmp[i]) {
+            qcow2_free_clusters(bs, s->refcount_table[i] & REFT_OFFSET_MASK,
+                                s->cluster_size, QCOW2_DISCARD_ALWAYS);
+            s->refcount_table[i] = 0;
+        }
+    }
+
+out:
+    g_free(reftable_tmp);
+    return ret;
+}
diff --git a/block/qcow2.c b/block/qcow2.c
index b3ba5daa93..0ad46d2776 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2545,6 +2545,7 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset, Error **errp)
 {
     BDRVQcow2State *s = bs->opaque;
     int64_t new_l1_size;
+    uint64_t total_size;
     int ret;
 
     if (offset & 511) {
@@ -2558,17 +2559,36 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset, Error **errp)
         return -ENOTSUP;
     }
 
-    /* shrinking is currently not supported */
-    if (offset < bs->total_sectors * 512) {
-        error_setg(errp, "qcow2 doesn't support shrinking images yet");
-        return -ENOTSUP;
-    }
-
     new_l1_size = size_to_l1(s, offset);
-    ret = qcow2_grow_l1_table(bs, new_l1_size, true);
-    if (ret < 0) {
-        error_setg_errno(errp, -ret, "Failed to grow the L1 table");
-        return ret;
+    total_size = bs->total_sectors << BDRV_SECTOR_BITS;
+
+    if (offset < total_size) {
+        ret = qcow2_cluster_discard(bs, ROUND_UP(offset, s->cluster_size),
+                                    total_size - ROUND_UP(offset,
+                                                          s->cluster_size),
+                                    QCOW2_DISCARD_ALWAYS, true);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "Failed to discard reduced clasters");
+            return ret;
+        }
+
+        ret = qcow2_shrink_l1_table(bs, new_l1_size);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "Failed to reduce the L1 table");
+            return ret;
+        }
+
+        ret = qcow2_shrink_reftable(bs);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "Failed to shrink the refcount table");
+            return ret;
+        }
+    } else {
+        ret = qcow2_grow_l1_table(bs, new_l1_size, true);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "Failed to grow the L1 table");
+            return ret;
+        }
     }
 
     /* write updated header.size */
diff --git a/block/qcow2.h b/block/qcow2.h
index 07faa6dc78..600463bf8e 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -531,10 +531,12 @@ int qcow2_pre_write_overlap_check(BlockDriverState *bs, int ign, int64_t offset,
 int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
                                 BlockDriverAmendStatusCB *status_cb,
                                 void *cb_opaque, Error **errp);
+int qcow2_shrink_reftable(BlockDriverState *bs);
 
 /* qcow2-cluster.c functions */
 int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                         bool exact_size);
+int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t max_size);
 int qcow2_write_l1_entry(BlockDriverState *bs, int l1_index);
 int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset);
 int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
diff --git a/qapi/block-core.json b/qapi/block-core.json
index f85c2235c7..bcbffa3339 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2372,7 +2372,8 @@
             'cluster_alloc_bytes', 'cluster_free', 'flush_to_os',
             'flush_to_disk', 'pwritev_rmw_head', 'pwritev_rmw_after_head',
             'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
-            'pwritev_zero', 'pwritev_done', 'empty_image_prepare' ] }
+            'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
+            'l1_shrink_write_table', 'l1_shrink_free_l2_clusters' ] }
 
 ##
 # @BlkdebugInjectErrorOptions:
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH v2 4/4] qemu-iotests: add shrinking image test
  2017-06-13 12:16 [Qemu-devel] [PATCH v2 0/4] Add shrink image for qcow2 Pavel Butsykin
                   ` (2 preceding siblings ...)
  2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 3/4] qcow2: add shrink image support Pavel Butsykin
@ 2017-06-13 12:16 ` Pavel Butsykin
  2017-06-23 15:47   ` Max Reitz
  3 siblings, 1 reply; 21+ messages in thread
From: Pavel Butsykin @ 2017-06-13 12:16 UTC (permalink / raw)
  To: qemu-block, qemu-devel; +Cc: kwolf, mreitz, eblake, armbru, pbutsykin, den

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
---
 tests/qemu-iotests/163     | 113 +++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/163.out |   5 ++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 119 insertions(+)
 create mode 100644 tests/qemu-iotests/163
 create mode 100644 tests/qemu-iotests/163.out

diff --git a/tests/qemu-iotests/163 b/tests/qemu-iotests/163
new file mode 100644
index 0000000000..2cb0116173
--- /dev/null
+++ b/tests/qemu-iotests/163
@@ -0,0 +1,113 @@
+#!/usr/bin/env python
+#
+# Tests for shrinking images
+#
+# Copyright (c) 2016-2017 Parallels International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os, random, iotests
+from iotests import qemu_img, qemu_io, image_size
+
+test_img = os.path.join(iotests.test_dir, 'test.img')
+check_img = os.path.join(iotests.test_dir, 'check.img')
+
+def size_to_int(str):
+    suff = ['B', 'K', 'M', 'G', 'T']
+    return int(str[:-1]) * 1024**suff.index(str[-1:])
+
+class TestShrink(iotests.QMPTestCase):
+    image_len = '1G'
+    shrink_size = '128M'
+    chank_size = '256M'
+
+    def setUp(self):
+        qemu_img('create', '-f', iotests.imgfmt, test_img, TestShrink.image_len)
+        qemu_img('create', '-f', iotests.imgfmt, check_img,
+                 TestShrink.shrink_size)
+
+    def tearDown(self):
+        os.remove(test_img)
+        os.remove(check_img)
+
+    def image_verify(self):
+        self.assertEqual(image_size(test_img), image_size(check_img),
+                         "Verifying image size")
+
+        if iotests.imgfmt == 'raw':
+            return
+
+        self.assertEqual(qemu_img('check', test_img),
+                         qemu_img('check', check_img),
+                         "Verifying image corruption")
+
+    def test_empty_image(self):
+        qemu_img('resize',  '-f', iotests.imgfmt, '--shrink', test_img,
+                 TestShrink.shrink_size)
+
+        self.assertEqual(
+            qemu_io('-c', 'read -P 0x00 %s'%TestShrink.shrink_size, test_img),
+            qemu_io('-c', 'read -P 0x00 %s'%TestShrink.shrink_size, check_img),
+            "Verifying image content")
+
+        TestShrink.image_verify(self)
+
+    def test_sequential_write(self):
+        for offs in range(0, size_to_int(TestShrink.image_len),
+                          size_to_int(TestShrink.chank_size)):
+            qemu_io('-c', 'write -P 0xff %d %s' % (offs, TestShrink.chank_size),
+                    test_img)
+
+        qemu_img('resize',  '-f', iotests.imgfmt, '--shrink', test_img,
+                 TestShrink.shrink_size)
+
+        self.assertEqual(
+            qemu_io('-c', 'read -P 0xff %s'%TestShrink.image_len, test_img),
+            qemu_io('-c', 'read -P 0xff %s'%TestShrink.image_len, check_img),
+            "Verifying image content")
+
+        self.assertEqual(
+            qemu_io('-c', 'read -P 0xff %s'%TestShrink.shrink_size, test_img),
+            qemu_io('-c', 'read -P 0xff %s'%TestShrink.shrink_size, check_img),
+            "Verifying image content")
+
+        TestShrink.image_verify(self)
+
+    def test_random_write(self):
+        offs_list = range(0, size_to_int(TestShrink.image_len),
+                          size_to_int(TestShrink.chank_size))
+        random.shuffle(offs_list)
+        for offs in offs_list:
+            qemu_io('-c', 'write -P 0xff %d %s' % (offs, TestShrink.chank_size),
+                    test_img)
+
+        qemu_img('resize',  '-f', iotests.imgfmt, '--shrink', test_img,
+                 TestShrink.shrink_size)
+
+        self.assertEqual(
+            qemu_io('-c', 'read -P 0xff %s'%TestShrink.image_len, test_img),
+            qemu_io('-c', 'read -P 0xff %s'%TestShrink.image_len, check_img),
+            "Verifying image content")
+
+        self.assertEqual(
+            qemu_io('-c', 'read -P 0xff %s'%TestShrink.shrink_size, test_img),
+            qemu_io('-c', 'read -P 0xff %s'%TestShrink.shrink_size, check_img),
+            "Verifying image content")
+
+        TestShrink.image_verify(self)
+
+
+if __name__ == '__main__':
+    iotests.main(supported_fmts=['raw', 'qcow2'])
diff --git a/tests/qemu-iotests/163.out b/tests/qemu-iotests/163.out
new file mode 100644
index 0000000000..8d7e996700
--- /dev/null
+++ b/tests/qemu-iotests/163.out
@@ -0,0 +1,5 @@
+...
+----------------------------------------------------------------------
+Ran 3 tests
+
+OK
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 5c8ea0f95c..a2f42e7165 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -163,6 +163,7 @@
 159 rw auto quick
 160 rw auto quick
 162 auto quick
+163 rw auto quick
 170 rw auto quick
 171 rw auto quick
 172 auto
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/4] qemu-img: add --shrink flag for resize
  2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 1/4] qemu-img: add --shrink flag for resize Pavel Butsykin
@ 2017-06-21 22:17   ` Max Reitz
  2017-06-22 13:54     ` Pavel Butsykin
  0 siblings, 1 reply; 21+ messages in thread
From: Max Reitz @ 2017-06-21 22:17 UTC (permalink / raw)
  To: Pavel Butsykin, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den

[-- Attachment #1: Type: text/plain, Size: 5805 bytes --]

On 2017-06-13 14:16, Pavel Butsykin wrote:
> The flag as additional precaution of data loss. Perhaps in the future the
> operation shrink without this flag will be banned, but while we need to
> maintain compatibility.
> 
> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> ---
>  qemu-img-cmds.hx       |  4 ++--
>  qemu-img.c             | 15 +++++++++++++++
>  qemu-img.texi          |  5 ++++-
>  tests/qemu-iotests/102 |  4 ++--
>  4 files changed, 23 insertions(+), 5 deletions(-)
> 
> diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
> index a39fcdba71..3b2eab9d20 100644
> --- a/qemu-img-cmds.hx
> +++ b/qemu-img-cmds.hx
> @@ -76,9 +76,9 @@ STEXI
>  ETEXI
>  
>  DEF("resize", img_resize,
> -    "resize [--object objectdef] [--image-opts] [-q] filename [+ | -]size")
> +    "resize [--object objectdef] [--image-opts] [-q] [--shrink] filename [+ | -]size")
>  STEXI
> -@item resize [--object @var{objectdef}] [--image-opts] [-q] @var{filename} [+ | -]@var{size}
> +@item resize [--object @var{objectdef}] [--image-opts] [-q] [--shrink] @var{filename} [+ | -]@var{size}
>  ETEXI
>  
>  DEF("amend", img_amend,
> diff --git a/qemu-img.c b/qemu-img.c
> index 0ad698d7f1..bfe5f61b0b 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -61,6 +61,7 @@ enum {
>      OPTION_FLUSH_INTERVAL = 261,
>      OPTION_NO_DRAIN = 262,
>      OPTION_TARGET_IMAGE_OPTS = 263,
> +    OPTION_SHRINK = 264,
>  };
>  
>  typedef enum OutputFormat {
> @@ -3452,6 +3453,7 @@ static int img_resize(int argc, char **argv)
>          },
>      };
>      bool image_opts = false;
> +    bool shrink = false;
>  
>      /* Remove size from argv manually so that negative numbers are not treated
>       * as options by getopt. */
> @@ -3469,6 +3471,7 @@ static int img_resize(int argc, char **argv)
>              {"help", no_argument, 0, 'h'},
>              {"object", required_argument, 0, OPTION_OBJECT},
>              {"image-opts", no_argument, 0, OPTION_IMAGE_OPTS},
> +            {"shrink", no_argument, 0, OPTION_SHRINK},
>              {0, 0, 0, 0}
>          };
>          c = getopt_long(argc, argv, ":f:hq",
> @@ -3503,6 +3506,9 @@ static int img_resize(int argc, char **argv)
>          case OPTION_IMAGE_OPTS:
>              image_opts = true;
>              break;
> +        case OPTION_SHRINK:
> +            shrink = true;
> +            break;
>          }
>      }
>      if (optind != argc - 1) {
> @@ -3562,6 +3568,15 @@ static int img_resize(int argc, char **argv)
>          goto out;
>      }
>  
> +    if (total_size < blk_getlength(blk) && !shrink) {
> +        qprintf(quiet, "Warning: shrinking of the image can lead to data loss. "

I think this should always be printed to stderr, even if quiet is true;
especially considering we (or at least I) plan to make it mandatory.

> +                       "Before performing shrink operation you must make sure "

*Before performaing a shrink operation

Also, I'd rather use the imperative: "Before ... operation, make sure
that the..."

(English isn't my native language either, but as far as I remember
"must" is generally used for external influences. But it's not like a
force of nature is forcing you to confirm there's no important data; you
can just ignore this advice and see all of your non-backed-up childhood
pictures go to /dev/null.)

> +                       "that the shrink part of image doesn't contain important"

Hmm... I don't think "shrink part" really works.

Maybe the following would work better:

  Warning: Shrinking an image will delete all data beyond the shrunken
  image's end. Before performing such an operation, make sure there is
  no important data there.

> +                       " data.\n");
> +        qprintf(quiet,
> +                "If you don't want to see this message use --shrink option.\n");

You should make a note that --shrink may (and I hope it will) become
necessary in the future, like:

  Using the --shrink option will suppress this message. Note that future
  versions of qemu-img may refuse to shrink images without this option!

> +    }
> +
>      ret = blk_truncate(blk, total_size, &err);
>      if (!ret) {
>          qprintf(quiet, "Image resized.\n");
> diff --git a/qemu-img.texi b/qemu-img.texi
> index 5b925ecf41..c2b694cd00 100644
> --- a/qemu-img.texi
> +++ b/qemu-img.texi
> @@ -499,7 +499,7 @@ qemu-img rebase -b base.img diff.qcow2
>  At this point, @code{modified.img} can be discarded, since
>  @code{base.img + diff.qcow2} contains the same information.
>  
> -@item resize @var{filename} [+ | -]@var{size}
> +@item resize [--shrink] @var{filename} [+ | -]@var{size}
>  
>  Change the disk image as if it had been created with @var{size}.
>  
> @@ -507,6 +507,9 @@ Before using this command to shrink a disk image, you MUST use file system and
>  partitioning tools inside the VM to reduce allocated file systems and partition
>  sizes accordingly.  Failure to do so will result in data loss!
>  
> +If @code{--shrink} is specified, warning about data loss doesn't print for
> +the shrink operation.
> +

Maybe rather:

@code{--shrink} informs qemu-img that the user is certain about wanting
to shrink an image and is aware that any data beyond the truncated
image's end will be lost. Trying to shrink an image without this option
results in a warning; future versions may make it an error.

The functional changes look good to me; even though I'd rather have it
an error for qcow2 now (even if that means having to check the image
format in img_resize(), and being inconsistent because you wouldn't need
--shrink for raw, but for qcow2 you would). But, well, I'm not going to
stop this series over that.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 498 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/4] qcow2: add qcow2_cache_discard
  2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 2/4] qcow2: add qcow2_cache_discard Pavel Butsykin
@ 2017-06-21 22:29   ` Max Reitz
  2017-06-22 13:55     ` Pavel Butsykin
  0 siblings, 1 reply; 21+ messages in thread
From: Max Reitz @ 2017-06-21 22:29 UTC (permalink / raw)
  To: Pavel Butsykin, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den

[-- Attachment #1: Type: text/plain, Size: 3249 bytes --]

On 2017-06-13 14:16, Pavel Butsykin wrote:
> Whenever l2/refcount table clusters are discarded from the file we can
> automatically drop unnecessary content of the cache tables. This reduces
> the chance of eviction useful cache data and eliminates inconsistent data
> in thecache with the data in the file.
> 
> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> ---
>  block/qcow2-cache.c    | 21 +++++++++++++++++++++
>  block/qcow2-refcount.c |  5 +++++
>  block/qcow2.h          |  1 +
>  3 files changed, 27 insertions(+)
> 
> diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
> index 1d25147392..7931edf237 100644
> --- a/block/qcow2-cache.c
> +++ b/block/qcow2-cache.c
> @@ -411,3 +411,24 @@ void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c,
>      assert(c->entries[i].offset != 0);
>      c->entries[i].dirty = true;
>  }
> +
> +void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset)
> +{
> +    int i;
> +
> +    for (i = 0; i < c->size; i++) {
> +        if (c->entries[i].offset == offset) {
> +            goto found; /* table offset */
> +        }
> +    }
> +    return;
> +
> +found:
> +    assert(c->entries[i].ref == 0);
> +
> +    c->entries[i].offset = 0;
> +    c->entries[i].lru_counter = 0;
> +    c->entries[i].dirty = false;
> +
> +    qcow2_cache_table_release(bs, c, i, 1);
> +}
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 7c06061aae..576ab551d6 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -767,6 +767,11 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>          s->set_refcount(refcount_block, block_index, refcount);
>  
>          if (refcount == 0 && s->discard_passthrough[type]) {
> +            qcow2_cache_put(bs, s->refcount_block_cache, &refcount_block);

I don't like this very much. It works, but it feels bad.

Would it be possible to store this refblock's offset somewhere and only
put it back if @offset is equal to that?

> +            qcow2_cache_discard(bs, s->refcount_block_cache, offset);
> +
> +            qcow2_cache_discard(bs, s->l2_table_cache, offset);
> +

So you're blindly calling qcow2_cache_discard() on @offset because
@offset may be pointing to a refblock or an L2 table? Right, that works,
but it still deserves a comment, I think (that we have no idea whether
@offset actually points to any of these refcount structures, but that we
also just don't have to care).

Looks OK to me, functionally, but I'd at least like to have that comment.

Max

>              update_refcount_discard(bs, cluster_offset, s->cluster_size);
>          }
>      }
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 1801dc30dc..07faa6dc78 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -597,5 +597,6 @@ int qcow2_cache_get(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
>  int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
>      void **table);
>  void qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
> +void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset);
>  
>  #endif
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 498 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] qcow2: add shrink image support
  2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 3/4] qcow2: add shrink image support Pavel Butsykin
@ 2017-06-21 22:55   ` Max Reitz
  2017-06-22 13:57     ` Pavel Butsykin
  0 siblings, 1 reply; 21+ messages in thread
From: Max Reitz @ 2017-06-21 22:55 UTC (permalink / raw)
  To: Pavel Butsykin, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den

[-- Attachment #1: Type: text/plain, Size: 10721 bytes --]

On 2017-06-13 14:16, Pavel Butsykin wrote:
> This patch add shrinking of the image file for qcow2. As a result, this allows
> us to reduce the virtual image size and free up space on the disk without
> copying the image. Image can be fragmented and shrink is done by punching holes
> in the image file.
> 
> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> ---
>  block/qcow2-cluster.c  | 42 ++++++++++++++++++++++++++++++++
>  block/qcow2-refcount.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  block/qcow2.c          | 40 +++++++++++++++++++++++--------
>  block/qcow2.h          |  2 ++
>  qapi/block-core.json   |  3 ++-
>  5 files changed, 141 insertions(+), 11 deletions(-)
> 
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index d779ea19cf..a84b7e607e 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -32,6 +32,48 @@
>  #include "qemu/bswap.h"
>  #include "trace.h"
>  
> +int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t max_size)

It's not really a max_size but always an exact size. You don't want it
to be any smaller than this.

> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int new_l1_size, i, ret;
> +
> +    if (max_size >= s->l1_size) {
> +        return 0;
> +    }
> +
> +    new_l1_size = max_size;
> +
> +#ifdef DEBUG_ALLOC2
> +    fprintf(stderr, "shrink l1_table from %d to %" PRId64 "\n",
> +            s->l1_size, new_l1_size);

new_l1_size is of type int, not int64_t.

> +#endif
> +
> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
> +    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
> +                                       sizeof(uint64_t) * new_l1_size,
> +                             (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = bdrv_flush(bs->file->bs);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
> +    for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
> +        if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
> +            continue;
> +        }
> +        qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
> +                            s->l2_size * sizeof(uint64_t),

I'm more of a fan of s->cluster_size instead of s->l2_size *
sizeof(uint64_t) but it's not like it matters...

> +                            QCOW2_DISCARD_ALWAYS);
> +        s->l1_table[i] = 0;

I'd probably clear the overhanging s->l1_table entries before
bdrv_flush() (before you shouldn't really use them after
bdrv_pwrite_zeroes() has returned, even if bdrv_flush() has failed), but
it's not absolutely necessary. As long as they still have a refcount of
at least one, writing to them will just be useless but not destroy any data.

> +    }
> +    return 0;
> +}
> +
>  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>                          bool exact_size)
>  {
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 576ab551d6..e98306acd8 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -29,6 +29,7 @@
>  #include "block/qcow2.h"
>  #include "qemu/range.h"
>  #include "qemu/bswap.h"
> +#include "qemu/cutils.h"
>  
>  static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size);
>  static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
> @@ -2936,3 +2937,67 @@ done:
>      qemu_vfree(new_refblock);
>      return ret;
>  }
> +
> +int qcow2_shrink_reftable(BlockDriverState *bs)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    uint64_t *reftable_tmp =
> +        g_try_malloc(sizeof(uint64_t) * s->refcount_table_size);
> +    int i, ret;
> +
> +    if (s->refcount_table_size && reftable_tmp == NULL) {
> +        return -ENOMEM;
> +    }
> +
> +    for (i = 0; i < s->refcount_table_size; i++) {
> +        int64_t refblock_offs = s->refcount_table[i] & REFT_OFFSET_MASK;
> +        void *refblock;
> +        bool unused_block;
> +
> +        if (refblock_offs == 0) {
> +            reftable_tmp[i] = 0;
> +            continue;
> +        }
> +        ret = qcow2_cache_get(bs, s->refcount_block_cache, refblock_offs,
> +                              &refblock);
> +        if (ret < 0) {
> +            goto out;
> +        }
> +
> +        /* the refblock has own reference */
> +        if (i == refblock_offs >> (s->refcount_block_bits + s->cluster_bits)) {
> +            uint64_t blk_index = (refblock_offs >> s->cluster_bits) &
> +                                 (s->refcount_block_size - 1);
> +            uint64_t refcount = s->get_refcount(refblock, blk_index);
> +
> +            s->set_refcount(refblock, blk_index, 0);
> +
> +            unused_block = buffer_is_zero(refblock, s->refcount_block_size);

s/refcount_block_size/cluster_size/

> +
> +            s->set_refcount(refblock, blk_index, refcount);
> +        } else {
> +            unused_block = buffer_is_zero(refblock, s->refcount_block_size);

Same here.

> +        }
> +        qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
> +
> +        reftable_tmp[i] = unused_block ? 0 : cpu_to_be64(s->refcount_table[i]);
> +    }
> +
> +    ret = bdrv_pwrite_sync(bs->file, s->refcount_table_offset, reftable_tmp,
> +                           sizeof(uint64_t) * s->refcount_table_size);
> +    if (ret < 0) {
> +        goto out;
> +    }
> +
> +    for (i = 0; i < s->refcount_table_size; i++) {
> +        if (s->refcount_table[i] && !reftable_tmp[i]) {
> +            qcow2_free_clusters(bs, s->refcount_table[i] & REFT_OFFSET_MASK,
> +                                s->cluster_size, QCOW2_DISCARD_ALWAYS);

This doesn't feel like a very good idea. The bdrv_pwrite_sync() before
has brought the on-disk refcount structures into a different state than
what we have cached.

OTOH, the bdrv_pwrite_sync() has accessed only the reftable and this
should only access refblocks. So I cannot think of any way this might
actually do something bad. But I guess it'll be better for to revisit
this when it's not in the middle of the night (so on Friday).

> +            s->refcount_table[i] = 0;
> +        }
> +    }
> +
> +out:
> +    g_free(reftable_tmp);
> +    return ret;
> +}
> diff --git a/block/qcow2.c b/block/qcow2.c
> index b3ba5daa93..0ad46d2776 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2545,6 +2545,7 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset, Error **errp)
>  {
>      BDRVQcow2State *s = bs->opaque;
>      int64_t new_l1_size;
> +    uint64_t total_size;
>      int ret;
>  
>      if (offset & 511) {
> @@ -2558,17 +2559,36 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset, Error **errp)
>          return -ENOTSUP;
>      }
>  
> -    /* shrinking is currently not supported */
> -    if (offset < bs->total_sectors * 512) {
> -        error_setg(errp, "qcow2 doesn't support shrinking images yet");
> -        return -ENOTSUP;
> -    }
> -
>      new_l1_size = size_to_l1(s, offset);
> -    ret = qcow2_grow_l1_table(bs, new_l1_size, true);
> -    if (ret < 0) {
> -        error_setg_errno(errp, -ret, "Failed to grow the L1 table");
> -        return ret;
> +    total_size = bs->total_sectors << BDRV_SECTOR_BITS;
> +
> +    if (offset < total_size) {
> +        ret = qcow2_cluster_discard(bs, ROUND_UP(offset, s->cluster_size),
> +                                    total_size - ROUND_UP(offset,
> +                                                          s->cluster_size),
> +                                    QCOW2_DISCARD_ALWAYS, true);
> +        if (ret < 0) {
> +            error_setg_errno(errp, -ret, "Failed to discard reduced clasters");

s/clasters/clusters/

And maybe "truncated", "stripped", or "cropped" instead of "reduced"?

> +            return ret;
> +        }
> +
> +        ret = qcow2_shrink_l1_table(bs, new_l1_size);
> +        if (ret < 0) {
> +            error_setg_errno(errp, -ret, "Failed to reduce the L1 table");

s/reduce/shrink/ (or "truncate"; or "reduce the L1 table size")

Also, to be fair, you're actually reducing the number of L2 tables, not
the size of the L1 table. (But that's a nit pick)

> +            return ret;
> +        }
> +
> +        ret = qcow2_shrink_reftable(bs);
> +        if (ret < 0) {
> +            error_setg_errno(errp, -ret, "Failed to shrink the refcount table");

And this is not really shrinking the reftable but instead discarding
some refblocks (potentially). (This is a nit pick, too)

Max

> +            return ret;
> +        }
> +    } else {
> +        ret = qcow2_grow_l1_table(bs, new_l1_size, true);
> +        if (ret < 0) {
> +            error_setg_errno(errp, -ret, "Failed to grow the L1 table");
> +            return ret;
> +        }
>      }
>  
>      /* write updated header.size */
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 07faa6dc78..600463bf8e 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -531,10 +531,12 @@ int qcow2_pre_write_overlap_check(BlockDriverState *bs, int ign, int64_t offset,
>  int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
>                                  BlockDriverAmendStatusCB *status_cb,
>                                  void *cb_opaque, Error **errp);
> +int qcow2_shrink_reftable(BlockDriverState *bs);
>  
>  /* qcow2-cluster.c functions */
>  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>                          bool exact_size);
> +int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t max_size);
>  int qcow2_write_l1_entry(BlockDriverState *bs, int l1_index);
>  int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset);
>  int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index f85c2235c7..bcbffa3339 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2372,7 +2372,8 @@
>              'cluster_alloc_bytes', 'cluster_free', 'flush_to_os',
>              'flush_to_disk', 'pwritev_rmw_head', 'pwritev_rmw_after_head',
>              'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
> -            'pwritev_zero', 'pwritev_done', 'empty_image_prepare' ] }
> +            'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
> +            'l1_shrink_write_table', 'l1_shrink_free_l2_clusters' ] }
>  
>  ##
>  # @BlkdebugInjectErrorOptions:
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 498 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/4] qemu-img: add --shrink flag for resize
  2017-06-21 22:17   ` Max Reitz
@ 2017-06-22 13:54     ` Pavel Butsykin
  2017-06-22 14:49       ` Kevin Wolf
  0 siblings, 1 reply; 21+ messages in thread
From: Pavel Butsykin @ 2017-06-22 13:54 UTC (permalink / raw)
  To: Max Reitz, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den

On 22.06.2017 01:17, Max Reitz wrote:
> On 2017-06-13 14:16, Pavel Butsykin wrote:
>> The flag as additional precaution of data loss. Perhaps in the future the
>> operation shrink without this flag will be banned, but while we need to
>> maintain compatibility.
>>
>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> ---
>>   qemu-img-cmds.hx       |  4 ++--
>>   qemu-img.c             | 15 +++++++++++++++
>>   qemu-img.texi          |  5 ++++-
>>   tests/qemu-iotests/102 |  4 ++--
>>   4 files changed, 23 insertions(+), 5 deletions(-)
>>
>> diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
>> index a39fcdba71..3b2eab9d20 100644
>> --- a/qemu-img-cmds.hx
>> +++ b/qemu-img-cmds.hx
>> @@ -76,9 +76,9 @@ STEXI
>>   ETEXI
>>   
>>   DEF("resize", img_resize,
>> -    "resize [--object objectdef] [--image-opts] [-q] filename [+ | -]size")
>> +    "resize [--object objectdef] [--image-opts] [-q] [--shrink] filename [+ | -]size")
>>   STEXI
>> -@item resize [--object @var{objectdef}] [--image-opts] [-q] @var{filename} [+ | -]@var{size}
>> +@item resize [--object @var{objectdef}] [--image-opts] [-q] [--shrink] @var{filename} [+ | -]@var{size}
>>   ETEXI
>>   
>>   DEF("amend", img_amend,
>> diff --git a/qemu-img.c b/qemu-img.c
>> index 0ad698d7f1..bfe5f61b0b 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
>> @@ -61,6 +61,7 @@ enum {
>>       OPTION_FLUSH_INTERVAL = 261,
>>       OPTION_NO_DRAIN = 262,
>>       OPTION_TARGET_IMAGE_OPTS = 263,
>> +    OPTION_SHRINK = 264,
>>   };
>>   
>>   typedef enum OutputFormat {
>> @@ -3452,6 +3453,7 @@ static int img_resize(int argc, char **argv)
>>           },
>>       };
>>       bool image_opts = false;
>> +    bool shrink = false;
>>   
>>       /* Remove size from argv manually so that negative numbers are not treated
>>        * as options by getopt. */
>> @@ -3469,6 +3471,7 @@ static int img_resize(int argc, char **argv)
>>               {"help", no_argument, 0, 'h'},
>>               {"object", required_argument, 0, OPTION_OBJECT},
>>               {"image-opts", no_argument, 0, OPTION_IMAGE_OPTS},
>> +            {"shrink", no_argument, 0, OPTION_SHRINK},
>>               {0, 0, 0, 0}
>>           };
>>           c = getopt_long(argc, argv, ":f:hq",
>> @@ -3503,6 +3506,9 @@ static int img_resize(int argc, char **argv)
>>           case OPTION_IMAGE_OPTS:
>>               image_opts = true;
>>               break;
>> +        case OPTION_SHRINK:
>> +            shrink = true;
>> +            break;
>>           }
>>       }
>>       if (optind != argc - 1) {
>> @@ -3562,6 +3568,15 @@ static int img_resize(int argc, char **argv)
>>           goto out;
>>       }
>>   
>> +    if (total_size < blk_getlength(blk) && !shrink) {
>> +        qprintf(quiet, "Warning: shrinking of the image can lead to data loss. "
> 
> I think this should always be printed to stderr, even if quiet is true;
> especially considering we (or at least I) plan to make it mandatory.
>

OK.

>> +                       "Before performing shrink operation you must make sure "
> 
> *Before performaing a shrink operation

s/performaing/performing/

> Also, I'd rather use the imperative: "Before ... operation, make sure
> that the..."
> 
> (English isn't my native language either, but as far as I remember
> "must" is generally used for external influences. But it's not like a
> force of nature is forcing you to confirm there's no important data; you
> can just ignore this advice and see all of your non-backed-up childhood
> pictures go to /dev/null.)
>

Yes, It should be just a recommendation.

(As you might have already guessed, English isn't my native language too
:) I'm glad you are understanding. Really thanks for helping to improve
the text.)

>> +                       "that the shrink part of image doesn't contain important"
> 
> Hmm... I don't think "shrink part" really works.
> 
> Maybe the following would work better:
> 
>    Warning: Shrinking an image will delete all data beyond the shrunken
>    image's end. Before performing such an operation, make sure there is
>    no important data there.
> 
>> +                       " data.\n");
>> +        qprintf(quiet,
>> +                "If you don't want to see this message use --shrink option.\n");
> 
> You should make a note that --shrink may (and I hope it will) become
> necessary in the future, like:
> 
>    Using the --shrink option will suppress this message. Note that future
>    versions of qemu-img may refuse to shrink images without this option!
>

will fix.

>> +    }
>> +
>>       ret = blk_truncate(blk, total_size, &err);
>>       if (!ret) {
>>           qprintf(quiet, "Image resized.\n");
>> diff --git a/qemu-img.texi b/qemu-img.texi
>> index 5b925ecf41..c2b694cd00 100644
>> --- a/qemu-img.texi
>> +++ b/qemu-img.texi
>> @@ -499,7 +499,7 @@ qemu-img rebase -b base.img diff.qcow2
>>   At this point, @code{modified.img} can be discarded, since
>>   @code{base.img + diff.qcow2} contains the same information.
>>   
>> -@item resize @var{filename} [+ | -]@var{size}
>> +@item resize [--shrink] @var{filename} [+ | -]@var{size}
>>   
>>   Change the disk image as if it had been created with @var{size}.
>>   
>> @@ -507,6 +507,9 @@ Before using this command to shrink a disk image, you MUST use file system and
>>   partitioning tools inside the VM to reduce allocated file systems and partition
>>   sizes accordingly.  Failure to do so will result in data loss!
>>   
>> +If @code{--shrink} is specified, warning about data loss doesn't print for
>> +the shrink operation.
>> +
> 
> Maybe rather:
> 
> @code{--shrink} informs qemu-img that the user is certain about wanting
> to shrink an image and is aware that any data beyond the truncated
> image's end will be lost. Trying to shrink an image without this option
> results in a warning; future versions may make it an error.
>

good.

> The functional changes look good to me; even though I'd rather have it
> an error for qcow2 now (even if that means having to check the image
> format in img_resize(), and being inconsistent because you wouldn't need
> --shrink for raw, but for qcow2 you would). But, well, I'm not going to
> stop this series over that.
>

Why shrink for qcow2 image is dangerous, but for raw is not dangerous? I
think we should provide the same behavior for all formats. When --shrink
option will become necessary, it also should be the same for all image
formats.

> Max
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/4] qcow2: add qcow2_cache_discard
  2017-06-21 22:29   ` Max Reitz
@ 2017-06-22 13:55     ` Pavel Butsykin
  0 siblings, 0 replies; 21+ messages in thread
From: Pavel Butsykin @ 2017-06-22 13:55 UTC (permalink / raw)
  To: Max Reitz, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den

On 22.06.2017 01:29, Max Reitz wrote:
> On 2017-06-13 14:16, Pavel Butsykin wrote:
>> Whenever l2/refcount table clusters are discarded from the file we can
>> automatically drop unnecessary content of the cache tables. This reduces
>> the chance of eviction useful cache data and eliminates inconsistent data
>> in thecache with the data in the file.
>>
>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> ---
>>   block/qcow2-cache.c    | 21 +++++++++++++++++++++
>>   block/qcow2-refcount.c |  5 +++++
>>   block/qcow2.h          |  1 +
>>   3 files changed, 27 insertions(+)
>>
>> diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
>> index 1d25147392..7931edf237 100644
>> --- a/block/qcow2-cache.c
>> +++ b/block/qcow2-cache.c
>> @@ -411,3 +411,24 @@ void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c,
>>       assert(c->entries[i].offset != 0);
>>       c->entries[i].dirty = true;
>>   }
>> +
>> +void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < c->size; i++) {
>> +        if (c->entries[i].offset == offset) {
>> +            goto found; /* table offset */
>> +        }
>> +    }
>> +    return;
>> +
>> +found:
>> +    assert(c->entries[i].ref == 0);
>> +
>> +    c->entries[i].offset = 0;
>> +    c->entries[i].lru_counter = 0;
>> +    c->entries[i].dirty = false;
>> +
>> +    qcow2_cache_table_release(bs, c, i, 1);
>> +}
>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>> index 7c06061aae..576ab551d6 100644
>> --- a/block/qcow2-refcount.c
>> +++ b/block/qcow2-refcount.c
>> @@ -767,6 +767,11 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>           s->set_refcount(refcount_block, block_index, refcount);
>>   
>>           if (refcount == 0 && s->discard_passthrough[type]) {
>> +            qcow2_cache_put(bs, s->refcount_block_cache, &refcount_block);
> 
> I don't like this very much. It works, but it feels bad.
> 
> Would it be possible to store this refblock's offset somewhere and only
> put it back if @offset is equal to that?
> 
>> +            qcow2_cache_discard(bs, s->refcount_block_cache, offset);
>> +
>> +            qcow2_cache_discard(bs, s->l2_table_cache, offset);
>> +
> 
> So you're blindly calling qcow2_cache_discard() on @offset because
> @offset may be pointing to a refblock or an L2 table? Right, that works,
> but it still deserves a comment, I think (that we have no idea whether
> @offset actually points to any of these refcount structures, but that we
> also just don't have to care).
> 
> Looks OK to me, functionally, but I'd at least like to have that comment.
>

Hmm.. We can split qcow2_cache_discard() and kill two birds with one
stone.

void* qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c,
                                   uint64_t offset)
{
     int i;

     for (i = 0; i < c->size; i++) {
         if (c->entries[i].offset == offset) {
             return qcow2_cache_get_table_addr(bs, c, i);
         }
     }
     return NULL;
}

void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void table)
{
     int i = qcow2_cache_get_table_idx(bs, c, table);

     assert(c->entries[i].ref == 0);

     c->entries[i].offset = 0;
     c->entries[i].lru_counter = 0;
     c->entries[i].dirty = false;

     qcow2_cache_table_release(bs, c, i, 1);
}

....

if (refcount == 0 && s->discard_passthrough[type]) {
     void *table;

     table = qcow2_cache_is_table_offset(bs, s->refcount_block_cache, 
offset);
     if (table != NULL) {
         qcow2_cache_put(bs, s->refcount_block_cache, &refcount_block);
         qcow2_cache_discard(bs, s->refcount_block_cache, table);
     }

     table = qcow2_cache_is_table_offset(bs, s->l2_table_cache, offset);
     if (table != NULL) {
         qcow2_cache_discard(bs, s->l2_table_cache, table);
     }
...

> Max
> 
>>               update_refcount_discard(bs, cluster_offset, s->cluster_size);
>>           }
>>       }
>> diff --git a/block/qcow2.h b/block/qcow2.h
>> index 1801dc30dc..07faa6dc78 100644
>> --- a/block/qcow2.h
>> +++ b/block/qcow2.h
>> @@ -597,5 +597,6 @@ int qcow2_cache_get(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
>>   int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
>>       void **table);
>>   void qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
>> +void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset);
>>   
>>   #endif
>>
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] qcow2: add shrink image support
  2017-06-21 22:55   ` Max Reitz
@ 2017-06-22 13:57     ` Pavel Butsykin
  2017-06-23 15:46       ` Max Reitz
  0 siblings, 1 reply; 21+ messages in thread
From: Pavel Butsykin @ 2017-06-22 13:57 UTC (permalink / raw)
  To: Max Reitz, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den


On 22.06.2017 01:55, Max Reitz wrote:
> On 2017-06-13 14:16, Pavel Butsykin wrote:
>> This patch add shrinking of the image file for qcow2. As a result, this allows
>> us to reduce the virtual image size and free up space on the disk without
>> copying the image. Image can be fragmented and shrink is done by punching holes
>> in the image file.
>>
>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> ---
>>   block/qcow2-cluster.c  | 42 ++++++++++++++++++++++++++++++++
>>   block/qcow2-refcount.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>   block/qcow2.c          | 40 +++++++++++++++++++++++--------
>>   block/qcow2.h          |  2 ++
>>   qapi/block-core.json   |  3 ++-
>>   5 files changed, 141 insertions(+), 11 deletions(-)
>>
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index d779ea19cf..a84b7e607e 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -32,6 +32,48 @@
>>   #include "qemu/bswap.h"
>>   #include "trace.h"
>>   
>> +int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t max_size)
> 
> It's not really a max_size but always an exact size. You don't want it
> to be any smaller than this.
> 
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    int new_l1_size, i, ret;
>> +
>> +    if (max_size >= s->l1_size) {
>> +        return 0;
>> +    }
>> +
>> +    new_l1_size = max_size;
>> +
>> +#ifdef DEBUG_ALLOC2
>> +    fprintf(stderr, "shrink l1_table from %d to %" PRId64 "\n",
>> +            s->l1_size, new_l1_size);
> 
> new_l1_size is of type int, not int64_t.
> 
>> +#endif
>> +
>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
>> +    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
>> +                                       sizeof(uint64_t) * new_l1_size,
>> +                             (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    ret = bdrv_flush(bs->file->bs);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
>> +    for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
>> +        if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
>> +            continue;
>> +        }
>> +        qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
>> +                            s->l2_size * sizeof(uint64_t),
> 
> I'm more of a fan of s->cluster_size instead of s->l2_size *
> sizeof(uint64_t) but it's not like it matters...
> 
>> +                            QCOW2_DISCARD_ALWAYS);
>> +        s->l1_table[i] = 0;
> 
> I'd probably clear the overhanging s->l1_table entries before
> bdrv_flush() (before you shouldn't really use them after
> bdrv_pwrite_zeroes() has returned, even if bdrv_flush() has failed), but
> it's not absolutely necessary. As long as they still have a refcount of
> at least one, writing to them will just be useless but not destroy any data.
>

You're right, but If it's not necessary, I would prefer to leave as is..
Just because overhanging s->l1_table entries used to release clusters :)

>> +    }
>> +    return 0;
>> +}
>> +
>>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>                           bool exact_size)
>>   {
>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>> index 576ab551d6..e98306acd8 100644
>> --- a/block/qcow2-refcount.c
>> +++ b/block/qcow2-refcount.c
>> @@ -29,6 +29,7 @@
>>   #include "block/qcow2.h"
>>   #include "qemu/range.h"
>>   #include "qemu/bswap.h"
>> +#include "qemu/cutils.h"
>>   
>>   static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size);
>>   static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>> @@ -2936,3 +2937,67 @@ done:
>>       qemu_vfree(new_refblock);
>>       return ret;
>>   }
>> +
>> +int qcow2_shrink_reftable(BlockDriverState *bs)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    uint64_t *reftable_tmp =
>> +        g_try_malloc(sizeof(uint64_t) * s->refcount_table_size);
>> +    int i, ret;
>> +
>> +    if (s->refcount_table_size && reftable_tmp == NULL) {
>> +        return -ENOMEM;
>> +    }
>> +
>> +    for (i = 0; i < s->refcount_table_size; i++) {
>> +        int64_t refblock_offs = s->refcount_table[i] & REFT_OFFSET_MASK;
>> +        void *refblock;
>> +        bool unused_block;
>> +
>> +        if (refblock_offs == 0) {
>> +            reftable_tmp[i] = 0;
>> +            continue;
>> +        }
>> +        ret = qcow2_cache_get(bs, s->refcount_block_cache, refblock_offs,
>> +                              &refblock);
>> +        if (ret < 0) {
>> +            goto out;
>> +        }
>> +
>> +        /* the refblock has own reference */
>> +        if (i == refblock_offs >> (s->refcount_block_bits + s->cluster_bits)) {
>> +            uint64_t blk_index = (refblock_offs >> s->cluster_bits) &
>> +                                 (s->refcount_block_size - 1);
>> +            uint64_t refcount = s->get_refcount(refblock, blk_index);
>> +
>> +            s->set_refcount(refblock, blk_index, 0);
>> +
>> +            unused_block = buffer_is_zero(refblock, s->refcount_block_size);
> 
> s/refcount_block_size/cluster_size/
> 
>> +
>> +            s->set_refcount(refblock, blk_index, refcount);
>> +        } else {
>> +            unused_block = buffer_is_zero(refblock, s->refcount_block_size);
> 
> Same here.
> 
>> +        }
>> +        qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
>> +
>> +        reftable_tmp[i] = unused_block ? 0 : cpu_to_be64(s->refcount_table[i]);
>> +    }
>> +
>> +    ret = bdrv_pwrite_sync(bs->file, s->refcount_table_offset, reftable_tmp,
>> +                           sizeof(uint64_t) * s->refcount_table_size);
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +
>> +    for (i = 0; i < s->refcount_table_size; i++) {
>> +        if (s->refcount_table[i] && !reftable_tmp[i]) {
>> +            qcow2_free_clusters(bs, s->refcount_table[i] & REFT_OFFSET_MASK,
>> +                                s->cluster_size, QCOW2_DISCARD_ALWAYS);
> 
> This doesn't feel like a very good idea. The bdrv_pwrite_sync() before
> has brought the on-disk refcount structures into a different state than
> what we have cached.

It is for this inside qcow2_free_clusters()->update_refcount() the cache
is discarded by qcow2_cache_discard().

> OTOH, the bdrv_pwrite_sync() has accessed only the reftable and this
> should only access refblocks. So I cannot think of any way this might
> actually do something bad. But I guess it'll be better for to revisit
> this when it's not in the middle of the night (so on Friday).
> 
>> +            s->refcount_table[i] = 0;
>> +        }
>> +    }
>> +
>> +out:
>> +    g_free(reftable_tmp);
>> +    return ret;
>> +}
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index b3ba5daa93..0ad46d2776 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -2545,6 +2545,7 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset, Error **errp)
>>   {
>>       BDRVQcow2State *s = bs->opaque;
>>       int64_t new_l1_size;
>> +    uint64_t total_size;
>>       int ret;
>>   
>>       if (offset & 511) {
>> @@ -2558,17 +2559,36 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset, Error **errp)
>>           return -ENOTSUP;
>>       }
>>   
>> -    /* shrinking is currently not supported */
>> -    if (offset < bs->total_sectors * 512) {
>> -        error_setg(errp, "qcow2 doesn't support shrinking images yet");
>> -        return -ENOTSUP;
>> -    }
>> -
>>       new_l1_size = size_to_l1(s, offset);
>> -    ret = qcow2_grow_l1_table(bs, new_l1_size, true);
>> -    if (ret < 0) {
>> -        error_setg_errno(errp, -ret, "Failed to grow the L1 table");
>> -        return ret;
>> +    total_size = bs->total_sectors << BDRV_SECTOR_BITS;
>> +
>> +    if (offset < total_size) {
>> +        ret = qcow2_cluster_discard(bs, ROUND_UP(offset, s->cluster_size),
>> +                                    total_size - ROUND_UP(offset,
>> +                                                          s->cluster_size),
>> +                                    QCOW2_DISCARD_ALWAYS, true);
>> +        if (ret < 0) {
>> +            error_setg_errno(errp, -ret, "Failed to discard reduced clasters");
> 
> s/clasters/clusters/
> 
> And maybe "truncated", "stripped", or "cropped" instead of "reduced"?
> 
>> +            return ret;
>> +        }
>> +
>> +        ret = qcow2_shrink_l1_table(bs, new_l1_size);
>> +        if (ret < 0) {
>> +            error_setg_errno(errp, -ret, "Failed to reduce the L1 table");
> 
> s/reduce/shrink/ (or "truncate"; or "reduce the L1 table size")
> 
> Also, to be fair, you're actually reducing the number of L2 tables, not
> the size of the L1 table. (But that's a nit pick)

In the previous patch version, there really was reducing the L1 table
size :) I think now it's better to fix the error message.

>> +            return ret;
>> +        }
>> +
>> +        ret = qcow2_shrink_reftable(bs);
>> +        if (ret < 0) {
>> +            error_setg_errno(errp, -ret, "Failed to shrink the refcount table");
> 
> And this is not really shrinking the reftable but instead discarding
> some refblocks (potentially). (This is a nit pick, too)
> 
> Max
> 
>> +            return ret;
>> +        }
>> +    } else {
>> +        ret = qcow2_grow_l1_table(bs, new_l1_size, true);
>> +        if (ret < 0) {
>> +            error_setg_errno(errp, -ret, "Failed to grow the L1 table");
>> +            return ret;
>> +        }
>>       }
>>   
>>       /* write updated header.size */
>> diff --git a/block/qcow2.h b/block/qcow2.h
>> index 07faa6dc78..600463bf8e 100644
>> --- a/block/qcow2.h
>> +++ b/block/qcow2.h
>> @@ -531,10 +531,12 @@ int qcow2_pre_write_overlap_check(BlockDriverState *bs, int ign, int64_t offset,
>>   int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
>>                                   BlockDriverAmendStatusCB *status_cb,
>>                                   void *cb_opaque, Error **errp);
>> +int qcow2_shrink_reftable(BlockDriverState *bs);
>>   
>>   /* qcow2-cluster.c functions */
>>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>                           bool exact_size);
>> +int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t max_size);
>>   int qcow2_write_l1_entry(BlockDriverState *bs, int l1_index);
>>   int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset);
>>   int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index f85c2235c7..bcbffa3339 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -2372,7 +2372,8 @@
>>               'cluster_alloc_bytes', 'cluster_free', 'flush_to_os',
>>               'flush_to_disk', 'pwritev_rmw_head', 'pwritev_rmw_after_head',
>>               'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
>> -            'pwritev_zero', 'pwritev_done', 'empty_image_prepare' ] }
>> +            'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
>> +            'l1_shrink_write_table', 'l1_shrink_free_l2_clusters' ] }
>>   
>>   ##
>>   # @BlkdebugInjectErrorOptions:
>>
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/4] qemu-img: add --shrink flag for resize
  2017-06-22 13:54     ` Pavel Butsykin
@ 2017-06-22 14:49       ` Kevin Wolf
  2017-06-22 16:26         ` Pavel Butsykin
  0 siblings, 1 reply; 21+ messages in thread
From: Kevin Wolf @ 2017-06-22 14:49 UTC (permalink / raw)
  To: Pavel Butsykin; +Cc: Max Reitz, qemu-block, qemu-devel, eblake, armbru, den

Am 22.06.2017 um 15:54 hat Pavel Butsykin geschrieben:
> On 22.06.2017 01:17, Max Reitz wrote:
> >On 2017-06-13 14:16, Pavel Butsykin wrote:
> >>The flag as additional precaution of data loss. Perhaps in the future the
> >>operation shrink without this flag will be banned, but while we need to
> >>maintain compatibility.
> >>
> >>Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>

> >The functional changes look good to me; even though I'd rather have it
> >an error for qcow2 now (even if that means having to check the image
> >format in img_resize(), and being inconsistent because you wouldn't need
> >--shrink for raw, but for qcow2 you would). But, well, I'm not going to
> >stop this series over that.
> >
> 
> Why shrink for qcow2 image is dangerous, but for raw is not dangerous? I
> think we should provide the same behavior for all formats. When --shrink
> option will become necessary, it also should be the same for all image
> formats.

It is dangerous for both, but for raw we can't enforce the flag
immediately without a deprecation period. With qcow2 we can (because it
is new functionality), so we might as well enforce it from the start.

Kevin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/4] qemu-img: add --shrink flag for resize
  2017-06-22 14:49       ` Kevin Wolf
@ 2017-06-22 16:26         ` Pavel Butsykin
  0 siblings, 0 replies; 21+ messages in thread
From: Pavel Butsykin @ 2017-06-22 16:26 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Max Reitz, qemu-block, qemu-devel, eblake, armbru, den

On 22.06.2017 17:49, Kevin Wolf wrote:
> Am 22.06.2017 um 15:54 hat Pavel Butsykin geschrieben:
>> On 22.06.2017 01:17, Max Reitz wrote:
>>> On 2017-06-13 14:16, Pavel Butsykin wrote:
>>>> The flag as additional precaution of data loss. Perhaps in the future the
>>>> operation shrink without this flag will be banned, but while we need to
>>>> maintain compatibility.
>>>>
>>>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> 
>>> The functional changes look good to me; even though I'd rather have it
>>> an error for qcow2 now (even if that means having to check the image
>>> format in img_resize(), and being inconsistent because you wouldn't need
>>> --shrink for raw, but for qcow2 you would). But, well, I'm not going to
>>> stop this series over that.
>>>
>>
>> Why shrink for qcow2 image is dangerous, but for raw is not dangerous? I
>> think we should provide the same behavior for all formats. When --shrink
>> option will become necessary, it also should be the same for all image
>> formats.
> 
> It is dangerous for both, but for raw we can't enforce the flag
> immediately without a deprecation period. With qcow2 we can (because it
> is new functionality), so we might as well enforce it from the start.
>

Ah, exactly. I like the offer to print the warning for raw and enforce
the flag for other formats.

> Kevin
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] qcow2: add shrink image support
  2017-06-22 13:57     ` Pavel Butsykin
@ 2017-06-23 15:46       ` Max Reitz
  2017-06-26 15:23         ` Pavel Butsykin
  0 siblings, 1 reply; 21+ messages in thread
From: Max Reitz @ 2017-06-23 15:46 UTC (permalink / raw)
  To: Pavel Butsykin, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den

[-- Attachment #1: Type: text/plain, Size: 15295 bytes --]

On 2017-06-22 15:57, Pavel Butsykin wrote:
> 
> On 22.06.2017 01:55, Max Reitz wrote:
>> On 2017-06-13 14:16, Pavel Butsykin wrote:
>>> This patch add shrinking of the image file for qcow2. As a result,
>>> this allows
>>> us to reduce the virtual image size and free up space on the disk
>>> without
>>> copying the image. Image can be fragmented and shrink is done by
>>> punching holes
>>> in the image file.
>>>
>>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>>> ---
>>>   block/qcow2-cluster.c  | 42 ++++++++++++++++++++++++++++++++
>>>   block/qcow2-refcount.c | 65
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>   block/qcow2.c          | 40 +++++++++++++++++++++++--------
>>>   block/qcow2.h          |  2 ++
>>>   qapi/block-core.json   |  3 ++-
>>>   5 files changed, 141 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>>> index d779ea19cf..a84b7e607e 100644
>>> --- a/block/qcow2-cluster.c
>>> +++ b/block/qcow2-cluster.c
>>> @@ -32,6 +32,48 @@
>>>   #include "qemu/bswap.h"
>>>   #include "trace.h"
>>>   +int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t max_size)
>>
>> It's not really a max_size but always an exact size. You don't want it
>> to be any smaller than this.
>>
>>> +{
>>> +    BDRVQcow2State *s = bs->opaque;
>>> +    int new_l1_size, i, ret;
>>> +
>>> +    if (max_size >= s->l1_size) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    new_l1_size = max_size;
>>> +
>>> +#ifdef DEBUG_ALLOC2
>>> +    fprintf(stderr, "shrink l1_table from %d to %" PRId64 "\n",
>>> +            s->l1_size, new_l1_size);
>>
>> new_l1_size is of type int, not int64_t.
>>
>>> +#endif
>>> +
>>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
>>> +    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
>>> +                                       sizeof(uint64_t) * new_l1_size,
>>> +                             (s->l1_size - new_l1_size) *
>>> sizeof(uint64_t), 0);
>>> +    if (ret < 0) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    ret = bdrv_flush(bs->file->bs);
>>> +    if (ret < 0) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
>>> +    for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
>>> +        if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
>>> +            continue;
>>> +        }
>>> +        qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
>>> +                            s->l2_size * sizeof(uint64_t),
>>
>> I'm more of a fan of s->cluster_size instead of s->l2_size *
>> sizeof(uint64_t) but it's not like it matters...
>>
>>> +                            QCOW2_DISCARD_ALWAYS);
>>> +        s->l1_table[i] = 0;
>>
>> I'd probably clear the overhanging s->l1_table entries before
>> bdrv_flush() (before you shouldn't really use them after
>> bdrv_pwrite_zeroes() has returned, even if bdrv_flush() has failed), but
>> it's not absolutely necessary. As long as they still have a refcount of
>> at least one, writing to them will just be useless but not destroy any
>> data.
>>
> 
> You're right, but If it's not necessary, I would prefer to leave as is..
> Just because overhanging s->l1_table entries used to release clusters :)

Hm, yes. The question is, how bad are useless writes?

So the worst case scenario is this: You invoke qmp_block_resize() to
shrink the image; the bdrv_flush() call fails somewhere in the middle
but the data is still kind of pending and basically in the image.

Now when you continue to use the image and write data beyond the
intended new end, that data basically ends up nowhere. You can still
read the data just fine and change it, but when you restart qemu, it
will all be gone. So that's weird.

Admittedly, though, bdrv_flush() isn't the only issue here;
bdrv_pwrite_zeroes() is, too. If that fails somewhere in the middle, we
basically have the same situation.

Now if we were to update s->l1_table before the bdrv_pwrite_zeroes()
call, we might end up with the opposite issue: The data appears to be
gone, but after reopening the image, it's back again. The main
difference is that in this case we'll have to allocate L2 tables anew
and this will require writes to the L1 table, so maybe we can actually
succeed in overwriting the old data then... But that's a big maybe.

So all in all we'll very likely get inconsistencies either way, so yes,
it doesn't actually matter. :-)

>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>>                           bool exact_size)
>>>   {
>>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>>> index 576ab551d6..e98306acd8 100644
>>> --- a/block/qcow2-refcount.c
>>> +++ b/block/qcow2-refcount.c
>>> @@ -29,6 +29,7 @@
>>>   #include "block/qcow2.h"
>>>   #include "qemu/range.h"
>>>   #include "qemu/bswap.h"
>>> +#include "qemu/cutils.h"
>>>     static int64_t alloc_clusters_noref(BlockDriverState *bs,
>>> uint64_t size);
>>>   static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState
>>> *bs,
>>> @@ -2936,3 +2937,67 @@ done:
>>>       qemu_vfree(new_refblock);
>>>       return ret;
>>>   }
>>> +
>>> +int qcow2_shrink_reftable(BlockDriverState *bs)
>>> +{
>>> +    BDRVQcow2State *s = bs->opaque;
>>> +    uint64_t *reftable_tmp =
>>> +        g_try_malloc(sizeof(uint64_t) * s->refcount_table_size);
>>> +    int i, ret;
>>> +
>>> +    if (s->refcount_table_size && reftable_tmp == NULL) {
>>> +        return -ENOMEM;
>>> +    }
>>> +
>>> +    for (i = 0; i < s->refcount_table_size; i++) {
>>> +        int64_t refblock_offs = s->refcount_table[i] &
>>> REFT_OFFSET_MASK;
>>> +        void *refblock;
>>> +        bool unused_block;
>>> +
>>> +        if (refblock_offs == 0) {
>>> +            reftable_tmp[i] = 0;
>>> +            continue;
>>> +        }
>>> +        ret = qcow2_cache_get(bs, s->refcount_block_cache,
>>> refblock_offs,
>>> +                              &refblock);
>>> +        if (ret < 0) {
>>> +            goto out;
>>> +        }
>>> +
>>> +        /* the refblock has own reference */
>>> +        if (i == refblock_offs >> (s->refcount_block_bits +
>>> s->cluster_bits)) {
>>> +            uint64_t blk_index = (refblock_offs >> s->cluster_bits) &
>>> +                                 (s->refcount_block_size - 1);
>>> +            uint64_t refcount = s->get_refcount(refblock, blk_index);
>>> +
>>> +            s->set_refcount(refblock, blk_index, 0);
>>> +
>>> +            unused_block = buffer_is_zero(refblock,
>>> s->refcount_block_size);
>>
>> s/refcount_block_size/cluster_size/
>>
>>> +
>>> +            s->set_refcount(refblock, blk_index, refcount);
>>> +        } else {
>>> +            unused_block = buffer_is_zero(refblock,
>>> s->refcount_block_size);
>>
>> Same here.
>>
>>> +        }
>>> +        qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
>>> +
>>> +        reftable_tmp[i] = unused_block ? 0 :
>>> cpu_to_be64(s->refcount_table[i]);
>>> +    }
>>> +
>>> +    ret = bdrv_pwrite_sync(bs->file, s->refcount_table_offset,
>>> reftable_tmp,
>>> +                           sizeof(uint64_t) * s->refcount_table_size);
>>> +    if (ret < 0) {
>>> +        goto out;
>>> +    }
>>> +
>>> +    for (i = 0; i < s->refcount_table_size; i++) {
>>> +        if (s->refcount_table[i] && !reftable_tmp[i]) {
>>> +            qcow2_free_clusters(bs, s->refcount_table[i] &
>>> REFT_OFFSET_MASK,
>>> +                                s->cluster_size, QCOW2_DISCARD_ALWAYS);
>>
>> This doesn't feel like a very good idea. The bdrv_pwrite_sync() before
>> has brought the on-disk refcount structures into a different state than
>> what we have cached.
> 
> It is for this inside qcow2_free_clusters()->update_refcount() the cache
> is discarded by qcow2_cache_discard().

This doesn't change the fact that the in-memory reftable is different
from the on-disk reftable and that qcow2_free_clusters() may trip up on
that; the main issue is the allocate_refcount_block() call before.

So we need a guarantee that update_refcount() won't touch the reftable
if the refcount is decreased. It will call alloc_refcount_block() and
that should definitely find the respective refblock to already exist
because of course it has a refcount already.

But here's an issue: It tries to read from s->refcount_table[], and you
are slowly overwriting it in the same loop here. So it may not actually
find the refcount (if a refblock is described by an earlier one).
(After more than an hour of debugging, I realized this is not true: You
will only zero reftable entries if the refblock describes nothing or
only themselves. So overwriting one reftable entry cannot have effects
on other refblocks. Or at least it should not.)

Another potential issue is that you're assuming s->refcount_table_size
to be constant. I cannot find a way for it not to be, but investigating
this is painful and I can't claim I know for sure that it is constant.
If it isn't, you may get overflows when accessing reftable_tmp[].

(Yes, it may be constant; but the reader of this code has to read
through qcow2_free_clusters(), allocate_refcount_block() and
update_refcount() to know (or at least to guess) that's the case.)

I don't really want to look deeper into this, but here's an image that I
produced while trying to somehow break all of this. It makes qemu-img
check pass but fails after qemu-img resize --shrink shrink.qcow2 32M:
https://xanclic.moe/shrink.qcow2

(The image has been created with cluster_size=512 and refcount_bits=64;
then I filled the penultimate two entries of the reftable with pointers
to 0x1f0000 and 0x1f0200, respectively (so the first of these refblocks
would describe both), giving me this:
https://xanclic.moe/shrink-template.qcow2
I then put some data onto it with qemu-io -c 'write 0 1457K', which gave
me shrink.qcow2.)

Max

>> OTOH, the bdrv_pwrite_sync() has accessed only the reftable and this
>> should only access refblocks. So I cannot think of any way this might
>> actually do something bad. But I guess it'll be better for to revisit
>> this when it's not in the middle of the night (so on Friday).
>>
>>> +            s->refcount_table[i] = 0;
>>> +        }
>>> +    }
>>> +
>>> +out:
>>> +    g_free(reftable_tmp);
>>> +    return ret;
>>> +}
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index b3ba5daa93..0ad46d2776 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>> @@ -2545,6 +2545,7 @@ static int qcow2_truncate(BlockDriverState *bs,
>>> int64_t offset, Error **errp)
>>>   {
>>>       BDRVQcow2State *s = bs->opaque;
>>>       int64_t new_l1_size;
>>> +    uint64_t total_size;
>>>       int ret;
>>>         if (offset & 511) {
>>> @@ -2558,17 +2559,36 @@ static int qcow2_truncate(BlockDriverState
>>> *bs, int64_t offset, Error **errp)
>>>           return -ENOTSUP;
>>>       }
>>>   -    /* shrinking is currently not supported */
>>> -    if (offset < bs->total_sectors * 512) {
>>> -        error_setg(errp, "qcow2 doesn't support shrinking images yet");
>>> -        return -ENOTSUP;
>>> -    }
>>> -
>>>       new_l1_size = size_to_l1(s, offset);
>>> -    ret = qcow2_grow_l1_table(bs, new_l1_size, true);
>>> -    if (ret < 0) {
>>> -        error_setg_errno(errp, -ret, "Failed to grow the L1 table");
>>> -        return ret;
>>> +    total_size = bs->total_sectors << BDRV_SECTOR_BITS;
>>> +
>>> +    if (offset < total_size) {
>>> +        ret = qcow2_cluster_discard(bs, ROUND_UP(offset,
>>> s->cluster_size),
>>> +                                    total_size - ROUND_UP(offset,
>>> +                                                         
>>> s->cluster_size),
>>> +                                    QCOW2_DISCARD_ALWAYS, true);
>>> +        if (ret < 0) {
>>> +            error_setg_errno(errp, -ret, "Failed to discard reduced
>>> clasters");
>>
>> s/clasters/clusters/
>>
>> And maybe "truncated", "stripped", or "cropped" instead of "reduced"?
>>
>>> +            return ret;
>>> +        }
>>> +
>>> +        ret = qcow2_shrink_l1_table(bs, new_l1_size);
>>> +        if (ret < 0) {
>>> +            error_setg_errno(errp, -ret, "Failed to reduce the L1
>>> table");
>>
>> s/reduce/shrink/ (or "truncate"; or "reduce the L1 table size")
>>
>> Also, to be fair, you're actually reducing the number of L2 tables, not
>> the size of the L1 table. (But that's a nit pick)
> 
> In the previous patch version, there really was reducing the L1 table
> size :) I think now it's better to fix the error message.
> 
>>> +            return ret;
>>> +        }
>>> +
>>> +        ret = qcow2_shrink_reftable(bs);
>>> +        if (ret < 0) {
>>> +            error_setg_errno(errp, -ret, "Failed to shrink the
>>> refcount table");
>>
>> And this is not really shrinking the reftable but instead discarding
>> some refblocks (potentially). (This is a nit pick, too)
>>
>> Max
>>
>>> +            return ret;
>>> +        }
>>> +    } else {
>>> +        ret = qcow2_grow_l1_table(bs, new_l1_size, true);
>>> +        if (ret < 0) {
>>> +            error_setg_errno(errp, -ret, "Failed to grow the L1
>>> table");
>>> +            return ret;
>>> +        }
>>>       }
>>>         /* write updated header.size */
>>> diff --git a/block/qcow2.h b/block/qcow2.h
>>> index 07faa6dc78..600463bf8e 100644
>>> --- a/block/qcow2.h
>>> +++ b/block/qcow2.h
>>> @@ -531,10 +531,12 @@ int
>>> qcow2_pre_write_overlap_check(BlockDriverState *bs, int ign, int64_t
>>> offset,
>>>   int qcow2_change_refcount_order(BlockDriverState *bs, int
>>> refcount_order,
>>>                                   BlockDriverAmendStatusCB *status_cb,
>>>                                   void *cb_opaque, Error **errp);
>>> +int qcow2_shrink_reftable(BlockDriverState *bs);
>>>     /* qcow2-cluster.c functions */
>>>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>>                           bool exact_size);
>>> +int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t max_size);
>>>   int qcow2_write_l1_entry(BlockDriverState *bs, int l1_index);
>>>   int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t
>>> cluster_offset);
>>>   int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>>> index f85c2235c7..bcbffa3339 100644
>>> --- a/qapi/block-core.json
>>> +++ b/qapi/block-core.json
>>> @@ -2372,7 +2372,8 @@
>>>               'cluster_alloc_bytes', 'cluster_free', 'flush_to_os',
>>>               'flush_to_disk', 'pwritev_rmw_head',
>>> 'pwritev_rmw_after_head',
>>>               'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
>>> -            'pwritev_zero', 'pwritev_done', 'empty_image_prepare' ] }
>>> +            'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
>>> +            'l1_shrink_write_table', 'l1_shrink_free_l2_clusters' ] }
>>>     ##
>>>   # @BlkdebugInjectErrorOptions:
>>>
>>
>>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 498 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 4/4] qemu-iotests: add shrinking image test
  2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 4/4] qemu-iotests: add shrinking image test Pavel Butsykin
@ 2017-06-23 15:47   ` Max Reitz
  0 siblings, 0 replies; 21+ messages in thread
From: Max Reitz @ 2017-06-23 15:47 UTC (permalink / raw)
  To: Pavel Butsykin, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den

[-- Attachment #1: Type: text/plain, Size: 648 bytes --]

On 2017-06-13 14:16, Pavel Butsykin wrote:
> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> ---
>  tests/qemu-iotests/163     | 113 +++++++++++++++++++++++++++++++++++++++++++++
>  tests/qemu-iotests/163.out |   5 ++
>  tests/qemu-iotests/group   |   1 +
>  3 files changed, 119 insertions(+)
>  create mode 100644 tests/qemu-iotests/163
>  create mode 100644 tests/qemu-iotests/163.out

Ideally this test should contain tests for how the L1 and refcount table
shrinking functions actually behave (like what we have for growing
images). Right now we don't know whether they do anything in these test
cases at all.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 498 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] qcow2: add shrink image support
  2017-06-23 15:46       ` Max Reitz
@ 2017-06-26 15:23         ` Pavel Butsykin
  2017-06-26 17:47           ` Max Reitz
  0 siblings, 1 reply; 21+ messages in thread
From: Pavel Butsykin @ 2017-06-26 15:23 UTC (permalink / raw)
  To: Max Reitz, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den

On 23.06.2017 18:46, Max Reitz wrote:
> On 2017-06-22 15:57, Pavel Butsykin wrote:
>>
>> On 22.06.2017 01:55, Max Reitz wrote:
>>> On 2017-06-13 14:16, Pavel Butsykin wrote:
[]
>>>> +        }
>>>> +        qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
>>>> +
>>>> +        reftable_tmp[i] = unused_block ? 0 :
>>>> cpu_to_be64(s->refcount_table[i]);
>>>> +    }
>>>> +
>>>> +    ret = bdrv_pwrite_sync(bs->file, s->refcount_table_offset,
>>>> reftable_tmp,
>>>> +                           sizeof(uint64_t) * s->refcount_table_size);
>>>> +    if (ret < 0) {
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    for (i = 0; i < s->refcount_table_size; i++) {
>>>> +        if (s->refcount_table[i] && !reftable_tmp[i]) {
>>>> +            qcow2_free_clusters(bs, s->refcount_table[i] &
>>>> REFT_OFFSET_MASK,
>>>> +                                s->cluster_size, QCOW2_DISCARD_ALWAYS);
>>>
>>> This doesn't feel like a very good idea. The bdrv_pwrite_sync() before
>>> has brought the on-disk refcount structures into a different state than
>>> what we have cached.
>>
>> It is for this inside qcow2_free_clusters()->update_refcount() the cache
>> is discarded by qcow2_cache_discard().
> 
> This doesn't change the fact that the in-memory reftable is different
> from the on-disk reftable and that qcow2_free_clusters() may trip up on
> that; the main issue is the allocate_refcount_block() call before.

before what?

If we are talking about allocate_refcount_block() calls after
bdrv_pwrite_sync(), then... Inside allocate_refcount_block() will always
be called load_refcount_block(), what actually is not so dangerous even
if refcount_block_cache is empty. Because the refblock offset will
always be taken from s->refcount_table.

> So we need a guarantee that update_refcount() won't touch the reftable
> if the refcount is decreased. It will call alloc_refcount_block() and
> that should definitely find the respective refblock to already exist
> because of course it has a refcount already.

We don't touch the refblocks which contain references to other
refblocks, this ensures that update_refcount() will not try to raise
the discarded refblock.

> But here's an issue: It tries to read from s->refcount_table[], and you
> are slowly overwriting it in the same loop here. So it may not actually
> find the refcount (if a refblock is described by an earlier one).
> (After more than an hour of debugging, I realized this is not true: You
> will only zero reftable entries if the refblock describes nothing or
> only themselves. So overwriting one reftable entry cannot have effects
> on other refblocks. Or at least it should not.)

As you've noticed, here uses a simple approach:
We discard only refblocks that contain nothing or own reference. If we
have a refblock that is actually empty, but contains a reference to
another empty refblock, we don't touch this refblock. Maybe it's not the
best solution, but at least it's simple and secure.

There is another approach that can be applied here:

1. decrease the refcounts for all refblocks
2. find all empty refblocks
3. increase the refcounts for all refblocks
4. rewrite the refcount_table on disk (with the empty reftable entries)
5. release all the emptt reblocks in reverse order (start at the end of 
the s->refcount_table)

This will certainly allow us to get rid of all empty reblocks, but the
code will be less welcoming :) Also the case when the refblock contains
a reference to another refblock is quite rare.

> Another potential issue is that you're assuming s->refcount_table_size
> to be constant. I cannot find a way for it not to be, but investigating
> this is painful and I can't claim I know for sure that it is constant.
> If it isn't, you may get overflows when accessing reftable_tmp[].
> 
> (Yes, it may be constant; but the reader of this code has to read
> through qcow2_free_clusters(), allocate_refcount_block() and
> update_refcount() to know (or at least to guess) that's the case.)

Is there any guarantee that in the future this will not change? Because
in this case it can be a potential danger.

I can add a comment... Or add a new variable with the size of
reftable_tmp, and every time count min(s->refcount_table_size, 
reftable_tmp_size)
before accessing to s->refcount_table[]/reftable_tmp[]

> I don't really want to look deeper into this, but here's an image that I
> produced while trying to somehow break all of this. It makes qemu-img
> check pass but fails after qemu-img resize --shrink shrink.qcow2 32M:
> https://xanclic.moe/shrink.qcow2
> 
> (The image has been created with cluster_size=512 and refcount_bits=64;
> then I filled the penultimate two entries of the reftable with pointers
> to 0x1f0000 and 0x1f0200, respectively (so the first of these refblocks
> would describe both), giving me this:
> https://xanclic.moe/shrink-template.qcow2
> I then put some data onto it with qemu-io -c 'write 0 1457K', which gave
> me shrink.qcow2.)
>

Thank you for the samples! The mistake was quite naive :) , I just
messed up the sizes here:
             unused_block = buffer_is_zero(refblock, ->refcount_block_size);

             s->set_refcount(refblock, blk_index, refcount);
         } else {
             unused_block = buffer_is_zero(refblock, 
s->refcount_block_size);

Should be:
buffer_is_zero(refblock, s->cluster_size);

> Max
> 
>>> OTOH, the bdrv_pwrite_sync() has accessed only the reftable and this
>>> should only access refblocks. So I cannot think of any way this might
>>> actually do something bad. But I guess it'll be better for to revisit
>>> this when it's not in the middle of the night (so on Friday).
>>>
>>>> +            s->refcount_table[i] = 0;
>>>> +        }
>>>> +    }
>>>> +
>>>> +out:
>>>> +    g_free(reftable_tmp);
>>>> +    return ret;
>>>> +}
>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>> index b3ba5daa93..0ad46d2776 100644
>>>> --- a/block/qcow2.c
>>>> +++ b/block/qcow2.c
>>>> @@ -2545,6 +2545,7 @@ static int qcow2_truncate(BlockDriverState *bs,
>>>> int64_t offset, Error **errp)
>>>>    {
>>>>        BDRVQcow2State *s = bs->opaque;
>>>>        int64_t new_l1_size;
>>>> +    uint64_t total_size;
>>>>        int ret;
>>>>          if (offset & 511) {
>>>> @@ -2558,17 +2559,36 @@ static int qcow2_truncate(BlockDriverState
>>>> *bs, int64_t offset, Error **errp)
>>>>            return -ENOTSUP;
>>>>        }
>>>>    -    /* shrinking is currently not supported */
>>>> -    if (offset < bs->total_sectors * 512) {
>>>> -        error_setg(errp, "qcow2 doesn't support shrinking images yet");
>>>> -        return -ENOTSUP;
>>>> -    }
>>>> -
>>>>        new_l1_size = size_to_l1(s, offset);
>>>> -    ret = qcow2_grow_l1_table(bs, new_l1_size, true);
>>>> -    if (ret < 0) {
>>>> -        error_setg_errno(errp, -ret, "Failed to grow the L1 table");
>>>> -        return ret;
>>>> +    total_size = bs->total_sectors << BDRV_SECTOR_BITS;
>>>> +
>>>> +    if (offset < total_size) {
>>>> +        ret = qcow2_cluster_discard(bs, ROUND_UP(offset,
>>>> s->cluster_size),
>>>> +                                    total_size - ROUND_UP(offset,
>>>> +
>>>> s->cluster_size),
>>>> +                                    QCOW2_DISCARD_ALWAYS, true);
>>>> +        if (ret < 0) {
>>>> +            error_setg_errno(errp, -ret, "Failed to discard reduced
>>>> clasters");
>>>
>>> s/clasters/clusters/
>>>
>>> And maybe "truncated", "stripped", or "cropped" instead of "reduced"?
>>>
>>>> +            return ret;
>>>> +        }
>>>> +
>>>> +        ret = qcow2_shrink_l1_table(bs, new_l1_size);
>>>> +        if (ret < 0) {
>>>> +            error_setg_errno(errp, -ret, "Failed to reduce the L1
>>>> table");
>>>
>>> s/reduce/shrink/ (or "truncate"; or "reduce the L1 table size")
>>>
>>> Also, to be fair, you're actually reducing the number of L2 tables, not
>>> the size of the L1 table. (But that's a nit pick)
>>
>> In the previous patch version, there really was reducing the L1 table
>> size :) I think now it's better to fix the error message.
>>
>>>> +            return ret;
>>>> +        }
>>>> +
>>>> +        ret = qcow2_shrink_reftable(bs);
>>>> +        if (ret < 0) {
>>>> +            error_setg_errno(errp, -ret, "Failed to shrink the
>>>> refcount table");
>>>
>>> And this is not really shrinking the reftable but instead discarding
>>> some refblocks (potentially). (This is a nit pick, too)
>>>
>>> Max
>>>
>>>> +            return ret;
>>>> +        }
>>>> +    } else {
>>>> +        ret = qcow2_grow_l1_table(bs, new_l1_size, true);
>>>> +        if (ret < 0) {
>>>> +            error_setg_errno(errp, -ret, "Failed to grow the L1
>>>> table");
>>>> +            return ret;
>>>> +        }
>>>>        }
>>>>          /* write updated header.size */
>>>> diff --git a/block/qcow2.h b/block/qcow2.h
>>>> index 07faa6dc78..600463bf8e 100644
>>>> --- a/block/qcow2.h
>>>> +++ b/block/qcow2.h
>>>> @@ -531,10 +531,12 @@ int
>>>> qcow2_pre_write_overlap_check(BlockDriverState *bs, int ign, int64_t
>>>> offset,
>>>>    int qcow2_change_refcount_order(BlockDriverState *bs, int
>>>> refcount_order,
>>>>                                    BlockDriverAmendStatusCB *status_cb,
>>>>                                    void *cb_opaque, Error **errp);
>>>> +int qcow2_shrink_reftable(BlockDriverState *bs);
>>>>      /* qcow2-cluster.c functions */
>>>>    int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>>>                            bool exact_size);
>>>> +int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t max_size);
>>>>    int qcow2_write_l1_entry(BlockDriverState *bs, int l1_index);
>>>>    int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t
>>>> cluster_offset);
>>>>    int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
>>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>>>> index f85c2235c7..bcbffa3339 100644
>>>> --- a/qapi/block-core.json
>>>> +++ b/qapi/block-core.json
>>>> @@ -2372,7 +2372,8 @@
>>>>                'cluster_alloc_bytes', 'cluster_free', 'flush_to_os',
>>>>                'flush_to_disk', 'pwritev_rmw_head',
>>>> 'pwritev_rmw_after_head',
>>>>                'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
>>>> -            'pwritev_zero', 'pwritev_done', 'empty_image_prepare' ] }
>>>> +            'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
>>>> +            'l1_shrink_write_table', 'l1_shrink_free_l2_clusters' ] }
>>>>      ##
>>>>    # @BlkdebugInjectErrorOptions:
>>>>
>>>
>>>
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] qcow2: add shrink image support
  2017-06-26 15:23         ` Pavel Butsykin
@ 2017-06-26 17:47           ` Max Reitz
  2017-06-27 15:06             ` Pavel Butsykin
  0 siblings, 1 reply; 21+ messages in thread
From: Max Reitz @ 2017-06-26 17:47 UTC (permalink / raw)
  To: Pavel Butsykin, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den

[-- Attachment #1: Type: text/plain, Size: 7443 bytes --]

On 2017-06-26 17:23, Pavel Butsykin wrote:
> On 23.06.2017 18:46, Max Reitz wrote:
>> On 2017-06-22 15:57, Pavel Butsykin wrote:
>>>
>>> On 22.06.2017 01:55, Max Reitz wrote:
>>>> On 2017-06-13 14:16, Pavel Butsykin wrote:
> []
>>>>> +        }
>>>>> +        qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
>>>>> +
>>>>> +        reftable_tmp[i] = unused_block ? 0 :
>>>>> cpu_to_be64(s->refcount_table[i]);
>>>>> +    }
>>>>> +
>>>>> +    ret = bdrv_pwrite_sync(bs->file, s->refcount_table_offset,
>>>>> reftable_tmp,
>>>>> +                           sizeof(uint64_t) *
>>>>> s->refcount_table_size);
>>>>> +    if (ret < 0) {
>>>>> +        goto out;
>>>>> +    }
>>>>> +
>>>>> +    for (i = 0; i < s->refcount_table_size; i++) {
>>>>> +        if (s->refcount_table[i] && !reftable_tmp[i]) {
>>>>> +            qcow2_free_clusters(bs, s->refcount_table[i] &
>>>>> REFT_OFFSET_MASK,
>>>>> +                                s->cluster_size,
>>>>> QCOW2_DISCARD_ALWAYS);
>>>>
>>>> This doesn't feel like a very good idea. The bdrv_pwrite_sync() before
>>>> has brought the on-disk refcount structures into a different state than
>>>> what we have cached.
>>>
>>> It is for this inside qcow2_free_clusters()->update_refcount() the cache
>>> is discarded by qcow2_cache_discard().
>>
>> This doesn't change the fact that the in-memory reftable is different
>> from the on-disk reftable and that qcow2_free_clusters() may trip up on
>> that; the main issue is the allocate_refcount_block() call before.

*alloc_refcount_block(), sorry.

> before what?

Before qcow2_cache_discard() is called.

> If we are talking about allocate_refcount_block() calls after
> bdrv_pwrite_sync(), then... Inside allocate_refcount_block() will always
> be called load_refcount_block(), what actually is not so dangerous even
> if refcount_block_cache is empty. Because the refblock offset will
> always be taken from s->refcount_table.

Well, yes, and this offset has not been cleared yet, so it still points
to the old refblock (but the on-disk reftable does not, and this worries
me).

>> So we need a guarantee that update_refcount() won't touch the reftable
>> if the refcount is decreased. It will call alloc_refcount_block() and
>> that should definitely find the respective refblock to already exist
>> because of course it has a refcount already.
> 
> We don't touch the refblocks which contain references to other
> refblocks, this ensures that update_refcount() will not try to raise
> the discarded refblock.

It may ensure this in practice, yes, but I found proving this to be
rather difficult.

>> But here's an issue: It tries to read from s->refcount_table[], and you
>> are slowly overwriting it in the same loop here. So it may not actually
>> find the refcount (if a refblock is described by an earlier one).
>> (After more than an hour of debugging, I realized this is not true: You
>> will only zero reftable entries if the refblock describes nothing or
>> only themselves. So overwriting one reftable entry cannot have effects
>> on other refblocks. Or at least it should not.)
> 
> As you've noticed, here uses a simple approach:

Well, if it is a simple approach, I'm just very dense.

> We discard only refblocks that contain nothing or own reference. If we
> have a refblock that is actually empty, but contains a reference to
> another empty refblock, we don't touch this refblock. Maybe it's not the
> best solution, but at least it's simple and secure.

The theory is simple, yes, which is why I found it fine until the point
you decide to call usual refcount management functions with the on-disk
data differing from what is cached as s->refcount_table.

> There is another approach that can be applied here:
> 
> 1. decrease the refcounts for all refblocks

I think this will leave a broken image at this point, so I'd rather
avoid it.

> 2. find all empty refblocks
> 3. increase the refcounts for all refblocks
> 4. rewrite the refcount_table on disk (with the empty reftable entries)
> 5. release all the emptt reblocks in reverse order (start at the end of
> the s->refcount_table)
> 
> This will certainly allow us to get rid of all empty reblocks, but the
> code will be less welcoming :) Also the case when the refblock contains
> a reference to another refblock is quite rare.

Another way would be to invoke the function for dropping empty refblocks
repeatedly until no refblocks can be dropped anymore. This wouldn't
cover cyclic references, but I think that's fine.

In any case, though, I agree that we don't need to put too much work
into this. Refblocks by default just use around 0.03 % of what's needed
for data, so...

>> Another potential issue is that you're assuming s->refcount_table_size
>> to be constant. I cannot find a way for it not to be, but investigating
>> this is painful and I can't claim I know for sure that it is constant.
>> If it isn't, you may get overflows when accessing reftable_tmp[].
>>
>> (Yes, it may be constant; but the reader of this code has to read
>> through qcow2_free_clusters(), allocate_refcount_block() and
>> update_refcount() to know (or at least to guess) that's the case.)
> 
> Is there any guarantee that in the future this will not change? Because
> in this case it can be a potential danger.

Since this behavior is not documented anywhere, there is no guarantee.

> I can add a comment... Or add a new variable with the size of
> reftable_tmp, and every time count min(s->refcount_table_size,
> reftable_tmp_size)
> before accessing to s->refcount_table[]/reftable_tmp[]

Or (1) you add an assertion that refcount_table_size doesn't change
along with a comment why that is the case, which also explains in detail
why the call to qcow2_free_clusters() should be safe: The on-disk
reftable differs from the one in memory. qcow2_free_clusters()and
update_refcount() themselves do not access the reftable, so they are
safe. However, update_refcount() calls alloc_refcount_block(), and that
function does access the reftable: Now, as long as
s->refcount_table_size does not shrink (which I can't see why it would),
refcount_table_index should always be smaller. Now we're accessing
s->refcount_table: This will always return an existing refblock because
this will either be the refblock itself (for self-referencing refblocks)
or another one that is not going to be freed by qcow2_shrink_reftable()
because this function will not free refblocks which cover other clusters
than themselves.
We will then proceed to update the refblock which is either right (if it
is not the refblock to be freed) or won't do anything (if it is the one
to be freed).
In any case, we will never write to the reftable and reading from the
basically outdated cached version will never do anything bad.

Or (2) you copy reftable_tmp into s->refcount_table[] *before* any call
to qcow2_free_clusters(). To make this work, you would need to also
discard all refblocks from the cache in this function here (and not in
update_refcount()) and then only call qcow2_free_clusters() on refblocks
which were not self-referencing. An alternative hack would be to simply
mark the image dirty and just not do any qcow2_free_clusters() call...

Or (3) of course it would be possible to not clean up refcount
structures at all...

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 498 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] qcow2: add shrink image support
  2017-06-26 17:47           ` Max Reitz
@ 2017-06-27 15:06             ` Pavel Butsykin
  2017-06-28 13:59               ` Max Reitz
  0 siblings, 1 reply; 21+ messages in thread
From: Pavel Butsykin @ 2017-06-27 15:06 UTC (permalink / raw)
  To: Max Reitz, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den

On 26.06.2017 20:47, Max Reitz wrote:
> On 2017-06-26 17:23, Pavel Butsykin wrote:
[]
>>
>> Is there any guarantee that in the future this will not change? Because
>> in this case it can be a potential danger.
> 
> Since this behavior is not documented anywhere, there is no guarantee.
> 
>> I can add a comment... Or add a new variable with the size of
>> reftable_tmp, and every time count min(s->refcount_table_size,
>> reftable_tmp_size)
>> before accessing to s->refcount_table[]/reftable_tmp[]
> 
> Or (1) you add an assertion that refcount_table_size doesn't change
> along with a comment why that is the case, which also explains in detail
> why the call to qcow2_free_clusters() should be safe: The on-disk
> reftable differs from the one in memory. qcow2_free_clusters()and
> update_refcount() themselves do not access the reftable, so they are
> safe. However, update_refcount() calls alloc_refcount_block(), and that
> function does access the reftable: Now, as long as
> s->refcount_table_size does not shrink (which I can't see why it would),
> refcount_table_index should always be smaller. Now we're accessing
> s->refcount_table: This will always return an existing refblock because
> this will either be the refblock itself (for self-referencing refblocks)
> or another one that is not going to be freed by qcow2_shrink_reftable()
> because this function will not free refblocks which cover other clusters
> than themselves.
> We will then proceed to update the refblock which is either right (if it
> is not the refblock to be freed) or won't do anything (if it is the one
> to be freed).
> In any case, we will never write to the reftable and reading from the
> basically outdated cached version will never do anything bad.

OK, SGTM.

> Or (2) you copy reftable_tmp into s->refcount_table[] *before* any call
> to qcow2_free_clusters(). To make this work, you would need to also
> discard all refblocks from the cache in this function here (and not in
> update_refcount()) and then only call qcow2_free_clusters() on refblocks
> which were not self-referencing. An alternative hack would be to simply
> mark the image dirty and just not do any qcow2_free_clusters() call...

The main purpose of qcow2_reftable_shrink() function is discard all
unnecessary refblocks from the file. If we do only rewrite
refcount_table and discard non-self-referencing refblocks (which are
actually very rare), then the meaning of the function is lost.

> Or (3) of course it would be possible to not clean up refcount
> structures at all...

Nice solution :)

> Max
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] qcow2: add shrink image support
  2017-06-27 15:06             ` Pavel Butsykin
@ 2017-06-28 13:59               ` Max Reitz
  2017-06-28 15:31                 ` Pavel Butsykin
  0 siblings, 1 reply; 21+ messages in thread
From: Max Reitz @ 2017-06-28 13:59 UTC (permalink / raw)
  To: Pavel Butsykin, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den

[-- Attachment #1: Type: text/plain, Size: 3023 bytes --]

On 2017-06-27 17:06, Pavel Butsykin wrote:
> On 26.06.2017 20:47, Max Reitz wrote:
>> On 2017-06-26 17:23, Pavel Butsykin wrote:
> []
>>>
>>> Is there any guarantee that in the future this will not change? Because
>>> in this case it can be a potential danger.
>>
>> Since this behavior is not documented anywhere, there is no guarantee.
>>
>>> I can add a comment... Or add a new variable with the size of
>>> reftable_tmp, and every time count min(s->refcount_table_size,
>>> reftable_tmp_size)
>>> before accessing to s->refcount_table[]/reftable_tmp[]
>>
>> Or (1) you add an assertion that refcount_table_size doesn't change
>> along with a comment why that is the case, which also explains in detail
>> why the call to qcow2_free_clusters() should be safe: The on-disk
>> reftable differs from the one in memory. qcow2_free_clusters()and
>> update_refcount() themselves do not access the reftable, so they are
>> safe. However, update_refcount() calls alloc_refcount_block(), and that
>> function does access the reftable: Now, as long as
>> s->refcount_table_size does not shrink (which I can't see why it would),
>> refcount_table_index should always be smaller. Now we're accessing
>> s->refcount_table: This will always return an existing refblock because
>> this will either be the refblock itself (for self-referencing refblocks)
>> or another one that is not going to be freed by qcow2_shrink_reftable()
>> because this function will not free refblocks which cover other clusters
>> than themselves.
>> We will then proceed to update the refblock which is either right (if it
>> is not the refblock to be freed) or won't do anything (if it is the one
>> to be freed).
>> In any case, we will never write to the reftable and reading from the
>> basically outdated cached version will never do anything bad.
> 
> OK, SGTM.
> 
>> Or (2) you copy reftable_tmp into s->refcount_table[] *before* any call
>> to qcow2_free_clusters(). To make this work, you would need to also
>> discard all refblocks from the cache in this function here (and not in
>> update_refcount()) and then only call qcow2_free_clusters() on refblocks
>> which were not self-referencing. An alternative hack would be to simply
>> mark the image dirty and just not do any qcow2_free_clusters() call...
> 
> The main purpose of qcow2_reftable_shrink() function is discard all
> unnecessary refblocks from the file. If we do only rewrite
> refcount_table and discard non-self-referencing refblocks (which are
> actually very rare), then the meaning of the function is lost.

It would do exactly the same. The idea is that you do not need to call
qcow2_free_clusters() on self-referencing refblocks at all, since they
are freed automatically when their reftable entry is overwritten with 0.

>> Or (3) of course it would be possible to not clean up refcount
>> structures at all...
> 
> Nice solution :)

It is, because as I said refcount structures only have a small overhead.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 498 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] qcow2: add shrink image support
  2017-06-28 13:59               ` Max Reitz
@ 2017-06-28 15:31                 ` Pavel Butsykin
  2017-06-28 23:36                   ` Max Reitz
  0 siblings, 1 reply; 21+ messages in thread
From: Pavel Butsykin @ 2017-06-28 15:31 UTC (permalink / raw)
  To: Max Reitz, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den

On 28.06.2017 16:59, Max Reitz wrote:
> On 2017-06-27 17:06, Pavel Butsykin wrote:
>> On 26.06.2017 20:47, Max Reitz wrote:
>>> On 2017-06-26 17:23, Pavel Butsykin wrote:
>> []
>>>>
>>>> Is there any guarantee that in the future this will not change? Because
>>>> in this case it can be a potential danger.
>>>
>>> Since this behavior is not documented anywhere, there is no guarantee.
>>>
>>>> I can add a comment... Or add a new variable with the size of
>>>> reftable_tmp, and every time count min(s->refcount_table_size,
>>>> reftable_tmp_size)
>>>> before accessing to s->refcount_table[]/reftable_tmp[]
>>>
>>> Or (1) you add an assertion that refcount_table_size doesn't change
>>> along with a comment why that is the case, which also explains in detail
>>> why the call to qcow2_free_clusters() should be safe: The on-disk
>>> reftable differs from the one in memory. qcow2_free_clusters()and
>>> update_refcount() themselves do not access the reftable, so they are
>>> safe. However, update_refcount() calls alloc_refcount_block(), and that
>>> function does access the reftable: Now, as long as
>>> s->refcount_table_size does not shrink (which I can't see why it would),
>>> refcount_table_index should always be smaller. Now we're accessing
>>> s->refcount_table: This will always return an existing refblock because
>>> this will either be the refblock itself (for self-referencing refblocks)
>>> or another one that is not going to be freed by qcow2_shrink_reftable()
>>> because this function will not free refblocks which cover other clusters
>>> than themselves.
>>> We will then proceed to update the refblock which is either right (if it
>>> is not the refblock to be freed) or won't do anything (if it is the one
>>> to be freed).
>>> In any case, we will never write to the reftable and reading from the
>>> basically outdated cached version will never do anything bad.
>>
>> OK, SGTM.
>>
>>> Or (2) you copy reftable_tmp into s->refcount_table[] *before* any call
>>> to qcow2_free_clusters(). To make this work, you would need to also
>>> discard all refblocks from the cache in this function here (and not in
>>> update_refcount()) and then only call qcow2_free_clusters() on refblocks
>>> which were not self-referencing. An alternative hack would be to simply
>>> mark the image dirty and just not do any qcow2_free_clusters() call...
>>
>> The main purpose of qcow2_reftable_shrink() function is discard all
>> unnecessary refblocks from the file. If we do only rewrite
>> refcount_table and discard non-self-referencing refblocks (which are
>> actually very rare), then the meaning of the function is lost.
> 
> It would do exactly the same. The idea is that you do not need to call
> qcow2_free_clusters() on self-referencing refblocks at all, since they
> are freed automatically when their reftable entry is overwritten with 0.

Not sure.. For self-referencing refblocks, we also need to do:
1. check if refcount > 1
2. update s->free_cluster_index
3. call update_refcount_discard() (to in the end the fallocate
PUNCH_HOLE was called on refblock offset)

It will be practically a copy-paste from qcow2_free_clusters(), so it is
better to avoid it. I think that if it makes sense to do
qcow2_reftable_shrink(), it is only because we can slightly reduce image
size.

>>> Or (3) of course it would be possible to not clean up refcount
>>> structures at all...
>>
>> Nice solution :)
> 
> It is, because as I said refcount structures only have a small overhead.

Yes, I agree.

> Max
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v2 3/4] qcow2: add shrink image support
  2017-06-28 15:31                 ` Pavel Butsykin
@ 2017-06-28 23:36                   ` Max Reitz
  0 siblings, 0 replies; 21+ messages in thread
From: Max Reitz @ 2017-06-28 23:36 UTC (permalink / raw)
  To: Pavel Butsykin, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den

[-- Attachment #1: Type: text/plain, Size: 4134 bytes --]

On 2017-06-28 17:31, Pavel Butsykin wrote:
> On 28.06.2017 16:59, Max Reitz wrote:
>> On 2017-06-27 17:06, Pavel Butsykin wrote:
>>> On 26.06.2017 20:47, Max Reitz wrote:
>>>> On 2017-06-26 17:23, Pavel Butsykin wrote:
>>> []
>>>>>
>>>>> Is there any guarantee that in the future this will not change?
>>>>> Because
>>>>> in this case it can be a potential danger.
>>>>
>>>> Since this behavior is not documented anywhere, there is no guarantee.
>>>>
>>>>> I can add a comment... Or add a new variable with the size of
>>>>> reftable_tmp, and every time count min(s->refcount_table_size,
>>>>> reftable_tmp_size)
>>>>> before accessing to s->refcount_table[]/reftable_tmp[]
>>>>
>>>> Or (1) you add an assertion that refcount_table_size doesn't change
>>>> along with a comment why that is the case, which also explains in
>>>> detail
>>>> why the call to qcow2_free_clusters() should be safe: The on-disk
>>>> reftable differs from the one in memory. qcow2_free_clusters()and
>>>> update_refcount() themselves do not access the reftable, so they are
>>>> safe. However, update_refcount() calls alloc_refcount_block(), and that
>>>> function does access the reftable: Now, as long as
>>>> s->refcount_table_size does not shrink (which I can't see why it
>>>> would),
>>>> refcount_table_index should always be smaller. Now we're accessing
>>>> s->refcount_table: This will always return an existing refblock because
>>>> this will either be the refblock itself (for self-referencing
>>>> refblocks)
>>>> or another one that is not going to be freed by qcow2_shrink_reftable()
>>>> because this function will not free refblocks which cover other
>>>> clusters
>>>> than themselves.
>>>> We will then proceed to update the refblock which is either right
>>>> (if it
>>>> is not the refblock to be freed) or won't do anything (if it is the one
>>>> to be freed).
>>>> In any case, we will never write to the reftable and reading from the
>>>> basically outdated cached version will never do anything bad.
>>>
>>> OK, SGTM.
>>>
>>>> Or (2) you copy reftable_tmp into s->refcount_table[] *before* any call
>>>> to qcow2_free_clusters(). To make this work, you would need to also
>>>> discard all refblocks from the cache in this function here (and not in
>>>> update_refcount()) and then only call qcow2_free_clusters() on
>>>> refblocks
>>>> which were not self-referencing. An alternative hack would be to simply
>>>> mark the image dirty and just not do any qcow2_free_clusters() call...
>>>
>>> The main purpose of qcow2_reftable_shrink() function is discard all
>>> unnecessary refblocks from the file. If we do only rewrite
>>> refcount_table and discard non-self-referencing refblocks (which are
>>> actually very rare), then the meaning of the function is lost.
>>
>> It would do exactly the same. The idea is that you do not need to call
>> qcow2_free_clusters() on self-referencing refblocks at all, since they
>> are freed automatically when their reftable entry is overwritten with 0.
> 
> Not sure.. For self-referencing refblocks, we also need to do:
> 1. check if refcount > 1

Yes, if that wasn't an error flagged by qemu-img check. :-)

(http://git.qemu.org/?p=qemu.git;a=blob;f=block/qcow2-refcount.c;h=7c06061aae90eb4f091f51df995a9e099178c0ed;hb=HEAD#l1787)

> 2. update s->free_cluster_index
> 3. call update_refcount_discard() (to in the end the fallocate
> PUNCH_HOLE was called on refblock offset)

These, yes, you'd have to do here.

> It will be practically a copy-paste from qcow2_free_clusters(), so it is
> better to avoid it. I think that if it makes sense to do
> qcow2_reftable_shrink(), it is only because we can slightly reduce image
> size.

But it would be a small copy-paste (although I may very well be wrong)
and it would help me sleep better because I could actually understand it.

Max

>>>> Or (3) of course it would be possible to not clean up refcount
>>>> structures at all...
>>>
>>> Nice solution :)
>>
>> It is, because as I said refcount structures only have a small overhead.
> 
> Yes, I agree.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 498 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-06-28 23:36 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-13 12:16 [Qemu-devel] [PATCH v2 0/4] Add shrink image for qcow2 Pavel Butsykin
2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 1/4] qemu-img: add --shrink flag for resize Pavel Butsykin
2017-06-21 22:17   ` Max Reitz
2017-06-22 13:54     ` Pavel Butsykin
2017-06-22 14:49       ` Kevin Wolf
2017-06-22 16:26         ` Pavel Butsykin
2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 2/4] qcow2: add qcow2_cache_discard Pavel Butsykin
2017-06-21 22:29   ` Max Reitz
2017-06-22 13:55     ` Pavel Butsykin
2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 3/4] qcow2: add shrink image support Pavel Butsykin
2017-06-21 22:55   ` Max Reitz
2017-06-22 13:57     ` Pavel Butsykin
2017-06-23 15:46       ` Max Reitz
2017-06-26 15:23         ` Pavel Butsykin
2017-06-26 17:47           ` Max Reitz
2017-06-27 15:06             ` Pavel Butsykin
2017-06-28 13:59               ` Max Reitz
2017-06-28 15:31                 ` Pavel Butsykin
2017-06-28 23:36                   ` Max Reitz
2017-06-13 12:16 ` [Qemu-devel] [PATCH v2 4/4] qemu-iotests: add shrinking image test Pavel Butsykin
2017-06-23 15:47   ` Max Reitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.