All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/8] block/qcow2: Improve (?) zero cluster expansion
@ 2014-07-25 18:07 Max Reitz
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 1/8] block: Add status callback to bdrv_amend_options() Max Reitz
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Max Reitz @ 2014-07-25 18:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

The main purpose of this series is to add a progress report to
qemu-img amend. This is achieved by adding a callback function to
bdrv_amend_options() - the reasons for this choice are explained in
patch 1.

While adapting qcow2's expand_zero_clusters_in_l1() accordingly, I
noticed a way to simplify it and get rid of the rather ugly bitmap used
there (patch 6) and even found a way to optimize it (patch 7).

However, Kevin already expressed strong dislike about that patch 7, as
it complicates things in a function which should rarely ever be used.

I personally don't have a strong opinion on the topic. The optimization
should significantly increase the expansion performance; on the other
hand, it makes the code equally more complicated.

Patch 8 does not really depend on patch 7; it contains a test case
specifically for patch 7, but of course it works without that patch just
as well.

Therefore, we can simply drop patch 7 if we don't need it.


In this version, the total size of the "zero cluster expansion job" is
the number of zero cluster entries in all L2 tables multiplied by the
cluster size. This of course requires that number to be calculated
beforehand, which is complicated as well. An easier way (suggested by
Kevin) would be to simply use the number of all L2 entries as a basis.
This would not be as exact, but much easier to implement.

Since I already implemented the more complicated version of everything,
I'm just sending it as-is. I'll write an alternative to patch 5 which
uses the simpler variant and send it for comparison later.


This series depends on v2 of my "qemu-img: Allow source cache mode
specification" seires.


Max Reitz (8):
  block: Add status callback to bdrv_amend_options()
  qemu-img: Add progress output for amend
  qemu-img: Fix insignifcant memleak
  block/qcow2: Make get_refcount() global
  block/qcow2: Implement status CB for amend
  block/qcow2: Simplify shared L2 handling in amend
  block/qcow2: Speed up zero cluster expansion
  iotests: Expand test 061

 block.c                    |   5 +-
 block/qcow2-cluster.c      | 319 ++++++++++++++++++++++++++++++++++-----------
 block/qcow2-refcount.c     |  23 ++--
 block/qcow2.c              |  10 +-
 block/qcow2.h              |   5 +-
 include/block/block.h      |   5 +-
 include/block/block_int.h  |   3 +-
 qemu-img.c                 |  30 ++++-
 tests/qemu-iotests/061     |  41 ++++++
 tests/qemu-iotests/061.out |  40 ++++++
 tests/qemu-iotests/group   |   2 +-
 11 files changed, 379 insertions(+), 104 deletions(-)

-- 
2.0.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 1/8] block: Add status callback to bdrv_amend_options()
  2014-07-25 18:07 [Qemu-devel] [PATCH 0/8] block/qcow2: Improve (?) zero cluster expansion Max Reitz
@ 2014-07-25 18:07 ` Max Reitz
  2014-07-30 14:50   ` Eric Blake
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 2/8] qemu-img: Add progress output for amend Max Reitz
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Max Reitz @ 2014-07-25 18:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Depending on the changed options and the image format,
bdrv_amend_options() may take a significant amount of time. In these
cases, a way to be informed about the operation's status is desirable.

Since the operation is rather complex and may fundamentally change the
image, implementing it as AIO or a couroutine does not seem feasible. On
the other hand, implementing it as a block job would be significantly
more difficult than a simple callback and would not add benefits other
than progress report to the amending operation, because it should not
actually be run as a block job at all.

A callback may not be very pretty, but it's very easy to implement and
perfectly fits its purpose here.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block.c                   | 5 +++--
 block/qcow2.c             | 5 ++++-
 include/block/block.h     | 5 ++++-
 include/block/block_int.h | 3 ++-
 qemu-img.c                | 2 +-
 5 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index 3e252a2..4c8d4d8 100644
--- a/block.c
+++ b/block.c
@@ -5708,12 +5708,13 @@ void bdrv_add_before_write_notifier(BlockDriverState *bs,
     notifier_with_return_list_add(&bs->before_write_notifiers, notifier);
 }
 
-int bdrv_amend_options(BlockDriverState *bs, QemuOpts *opts)
+int bdrv_amend_options(BlockDriverState *bs, QemuOpts *opts,
+                       BlockDriverAmendStatusCB *status_cb)
 {
     if (!bs->drv->bdrv_amend_options) {
         return -ENOTSUP;
     }
-    return bs->drv->bdrv_amend_options(bs, opts);
+    return bs->drv->bdrv_amend_options(bs, opts, status_cb);
 }
 
 /* This function will be called by the bdrv_recurse_is_first_non_filter method
diff --git a/block/qcow2.c b/block/qcow2.c
index b0faa69..757f890 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2210,7 +2210,8 @@ static int qcow2_downgrade(BlockDriverState *bs, int target_version)
     return 0;
 }
 
-static int qcow2_amend_options(BlockDriverState *bs, QemuOpts *opts)
+static int qcow2_amend_options(BlockDriverState *bs, QemuOpts *opts,
+                               BlockDriverAmendStatusCB *status_cb)
 {
     BDRVQcowState *s = bs->opaque;
     int old_version = s->qcow_version, new_version = old_version;
@@ -2223,6 +2224,8 @@ static int qcow2_amend_options(BlockDriverState *bs, QemuOpts *opts)
     int ret;
     QemuOptDesc *desc = opts->list->desc;
 
+    (void)status_cb;
+
     while (desc && desc->name) {
         if (!qemu_opt_find(opts, desc->name)) {
             /* only change explicitly defined options */
diff --git a/include/block/block.h b/include/block/block.h
index 32d3676..f2b1799 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -309,7 +309,10 @@ typedef enum {
 
 int bdrv_check(BlockDriverState *bs, BdrvCheckResult *res, BdrvCheckMode fix);
 
-int bdrv_amend_options(BlockDriverState *bs_new, QemuOpts *opts);
+typedef void BlockDriverAmendStatusCB(BlockDriverState *bs, int64_t offset,
+                                      int64_t total_work_size);
+int bdrv_amend_options(BlockDriverState *bs_new, QemuOpts *opts,
+                       BlockDriverAmendStatusCB *status_cb);
 
 /* external snapshots */
 bool bdrv_recurse_is_first_non_filter(BlockDriverState *bs,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index f6c3bef..5c5798d 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -228,7 +228,8 @@ struct BlockDriver {
     int (*bdrv_check)(BlockDriverState* bs, BdrvCheckResult *result,
         BdrvCheckMode fix);
 
-    int (*bdrv_amend_options)(BlockDriverState *bs, QemuOpts *opts);
+    int (*bdrv_amend_options)(BlockDriverState *bs, QemuOpts *opts,
+                              BlockDriverAmendStatusCB *status_cb);
 
     void (*bdrv_debug_event)(BlockDriverState *bs, BlkDebugEvent event);
 
diff --git a/qemu-img.c b/qemu-img.c
index ef74ae4..90d6b79 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2827,7 +2827,7 @@ static int img_amend(int argc, char **argv)
         goto out;
     }
 
-    ret = bdrv_amend_options(bs, opts);
+    ret = bdrv_amend_options(bs, opts, NULL);
     if (ret < 0) {
         error_report("Error while amending options: %s", strerror(-ret));
         goto out;
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 2/8] qemu-img: Add progress output for amend
  2014-07-25 18:07 [Qemu-devel] [PATCH 0/8] block/qcow2: Improve (?) zero cluster expansion Max Reitz
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 1/8] block: Add status callback to bdrv_amend_options() Max Reitz
@ 2014-07-25 18:07 ` Max Reitz
  2014-07-30 14:55   ` Eric Blake
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 3/8] qemu-img: Fix insignifcant memleak Max Reitz
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Max Reitz @ 2014-07-25 18:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Now that bdrv_amend_options() supports a status callback, use it to
display a progress report.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 qemu-img.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 90d6b79..a06f425 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2732,6 +2732,13 @@ out:
     return 0;
 }
 
+static void amend_status_cb(BlockDriverState *bs,
+                            int64_t offset, int64_t total_work_size)
+{
+    (void)bs;
+    qemu_progress_print(100.f * offset / total_work_size, 0);
+}
+
 static int img_amend(int argc, char **argv)
 {
     int c, ret = 0;
@@ -2740,12 +2747,12 @@ static int img_amend(int argc, char **argv)
     QemuOpts *opts = NULL;
     const char *fmt = NULL, *filename, *cache;
     int flags;
-    bool quiet = false;
+    bool quiet = false, progress = false;
     BlockDriverState *bs = NULL;
 
     cache = BDRV_DEFAULT_CACHE;
     for (;;) {
-        c = getopt(argc, argv, "ho:f:t:q");
+        c = getopt(argc, argv, "ho:f:t:pq");
         if (c == -1) {
             break;
         }
@@ -2775,6 +2782,9 @@ static int img_amend(int argc, char **argv)
             case 't':
                 cache = optarg;
                 break;
+            case 'p':
+                progress = true;
+                break;
             case 'q':
                 quiet = true;
                 break;
@@ -2785,6 +2795,11 @@ static int img_amend(int argc, char **argv)
         error_exit("Must specify options (-o)");
     }
 
+    if (quiet) {
+        progress = false;
+    }
+    qemu_progress_init(progress, 1.0);
+
     filename = (optind == argc - 1) ? argv[argc - 1] : NULL;
     if (fmt && has_help_option(options)) {
         /* If a format is explicitly specified (and possibly no filename is
@@ -2827,13 +2842,18 @@ static int img_amend(int argc, char **argv)
         goto out;
     }
 
-    ret = bdrv_amend_options(bs, opts, NULL);
+    /* In case the driver does not call amend_status_cb() */
+    qemu_progress_print(0.f, 0);
+    ret = bdrv_amend_options(bs, opts, &amend_status_cb);
+    qemu_progress_print(100.f, 0);
     if (ret < 0) {
         error_report("Error while amending options: %s", strerror(-ret));
         goto out;
     }
 
 out:
+    qemu_progress_end();
+
     if (bs) {
         bdrv_unref(bs);
     }
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 3/8] qemu-img: Fix insignifcant memleak
  2014-07-25 18:07 [Qemu-devel] [PATCH 0/8] block/qcow2: Improve (?) zero cluster expansion Max Reitz
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 1/8] block: Add status callback to bdrv_amend_options() Max Reitz
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 2/8] qemu-img: Add progress output for amend Max Reitz
@ 2014-07-25 18:07 ` Max Reitz
  2014-07-30 14:56   ` Eric Blake
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 4/8] block/qcow2: Make get_refcount() global Max Reitz
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Max Reitz @ 2014-07-25 18:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

As soon as options is set in img_amend(), it needs to be freed before
the function returns. This leak is rather insignifcant, as qemu-img will
exit subsequently anyway, but there's no point in not fixing it.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 qemu-img.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/qemu-img.c b/qemu-img.c
index a06f425..d73a68a 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2809,7 +2809,9 @@ static int img_amend(int argc, char **argv)
     }
 
     if (optind != argc - 1) {
-        error_exit("Expecting one image file name");
+        error_report("Expecting one image file name");
+        ret = -1;
+        goto out;
     }
 
     flags = BDRV_O_FLAGS | BDRV_O_RDWR;
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 4/8] block/qcow2: Make get_refcount() global
  2014-07-25 18:07 [Qemu-devel] [PATCH 0/8] block/qcow2: Improve (?) zero cluster expansion Max Reitz
                   ` (2 preceding siblings ...)
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 3/8] qemu-img: Fix insignifcant memleak Max Reitz
@ 2014-07-25 18:07 ` Max Reitz
  2014-07-30 15:04   ` Eric Blake
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 5/8] block/qcow2: Implement status CB for amend Max Reitz
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Max Reitz @ 2014-07-25 18:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Reading the refcount of a cluster is an operation which can be useful in
all of the qcow2 code, so make that function globally available.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/qcow2-refcount.c | 23 ++++++++++++-----------
 block/qcow2.h          |  2 ++
 2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index cc6cf74..0c9887b 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -87,7 +87,7 @@ static int load_refcount_block(BlockDriverState *bs,
  * return value is the refcount of the cluster, negative values are -errno
  * and indicate an error.
  */
-static int get_refcount(BlockDriverState *bs, int64_t cluster_index)
+int qcow2_get_refcount(BlockDriverState *bs, int64_t cluster_index)
 {
     BDRVQcowState *s = bs->opaque;
     uint64_t refcount_table_index, block_index;
@@ -625,7 +625,7 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs,
         return ret;
     }
 
-    return get_refcount(bs, cluster_index);
+    return qcow2_get_refcount(bs, cluster_index);
 }
 
 
@@ -646,7 +646,7 @@ static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size)
 retry:
     for(i = 0; i < nb_clusters; i++) {
         uint64_t next_cluster_index = s->free_cluster_index++;
-        refcount = get_refcount(bs, next_cluster_index);
+        refcount = qcow2_get_refcount(bs, next_cluster_index);
 
         if (refcount < 0) {
             return refcount;
@@ -710,7 +710,7 @@ int qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
         /* Check how many clusters there are free */
         cluster_index = offset >> s->cluster_bits;
         for(i = 0; i < nb_clusters; i++) {
-            refcount = get_refcount(bs, cluster_index++);
+            refcount = qcow2_get_refcount(bs, cluster_index++);
 
             if (refcount < 0) {
                 return refcount;
@@ -927,7 +927,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
                                     cluster_index, addend,
                                     QCOW2_DISCARD_SNAPSHOT);
                         } else {
-                            refcount = get_refcount(bs, cluster_index);
+                            refcount = qcow2_get_refcount(bs, cluster_index);
                         }
 
                         if (refcount < 0) {
@@ -967,7 +967,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
                 refcount = qcow2_update_cluster_refcount(bs, l2_offset >>
                         s->cluster_bits, addend, QCOW2_DISCARD_SNAPSHOT);
             } else {
-                refcount = get_refcount(bs, l2_offset >> s->cluster_bits);
+                refcount = qcow2_get_refcount(bs, l2_offset >> s->cluster_bits);
             }
             if (refcount < 0) {
                 ret = refcount;
@@ -1243,8 +1243,8 @@ fail:
  * Checks the OFLAG_COPIED flag for all L1 and L2 entries.
  *
  * This function does not print an error message nor does it increment
- * check_errors if get_refcount fails (this is because such an error will have
- * been already detected and sufficiently signaled by the calling function
+ * check_errors if qcow2_get_refcount fails (this is because such an error will
+ * have been already detected and sufficiently signaled by the calling function
  * (qcow2_check_refcounts) by the time this function is called).
  */
 static int check_oflag_copied(BlockDriverState *bs, BdrvCheckResult *res,
@@ -1265,7 +1265,7 @@ static int check_oflag_copied(BlockDriverState *bs, BdrvCheckResult *res,
             continue;
         }
 
-        refcount = get_refcount(bs, l2_offset >> s->cluster_bits);
+        refcount = qcow2_get_refcount(bs, l2_offset >> s->cluster_bits);
         if (refcount < 0) {
             /* don't print message nor increment check_errors */
             continue;
@@ -1307,7 +1307,8 @@ static int check_oflag_copied(BlockDriverState *bs, BdrvCheckResult *res,
 
             if ((cluster_type == QCOW2_CLUSTER_NORMAL) ||
                 ((cluster_type == QCOW2_CLUSTER_ZERO) && (data_offset != 0))) {
-                refcount = get_refcount(bs, data_offset >> s->cluster_bits);
+                refcount = qcow2_get_refcount(bs,
+                                              data_offset >> s->cluster_bits);
                 if (refcount < 0) {
                     /* don't print message nor increment check_errors */
                     continue;
@@ -1597,7 +1598,7 @@ int qcow2_check_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
 
     /* compare ref counts */
     for (i = 0, highest_cluster = 0; i < nb_clusters; i++) {
-        refcount1 = get_refcount(bs, i);
+        refcount1 = qcow2_get_refcount(bs, i);
         if (refcount1 < 0) {
             fprintf(stderr, "Can't get refcount for cluster %" PRId64 ": %s\n",
                 i, strerror(-refcount1));
diff --git a/block/qcow2.h b/block/qcow2.h
index b49424b..b423b71 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -472,6 +472,8 @@ int qcow2_update_header(BlockDriverState *bs);
 int qcow2_refcount_init(BlockDriverState *bs);
 void qcow2_refcount_close(BlockDriverState *bs);
 
+int qcow2_get_refcount(BlockDriverState *bs, int64_t cluster_index);
+
 int qcow2_update_cluster_refcount(BlockDriverState *bs, int64_t cluster_index,
                                   int addend, enum qcow2_discard_type type);
 
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 5/8] block/qcow2: Implement status CB for amend
  2014-07-25 18:07 [Qemu-devel] [PATCH 0/8] block/qcow2: Improve (?) zero cluster expansion Max Reitz
                   ` (3 preceding siblings ...)
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 4/8] block/qcow2: Make get_refcount() global Max Reitz
@ 2014-07-25 18:07 ` Max Reitz
  2014-07-30 15:23   ` Eric Blake
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 6/8] block/qcow2: Simplify shared L2 handling in amend Max Reitz
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Max Reitz @ 2014-07-25 18:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

The only really time-consuming operation potentially performed by
qcow2_amend_options() is zero cluster expansion when downgrading qcow2
images from compat=1.1 to compat=0.10, so report status of that
operation and that operation only through the status CB.

For this, count the number of L2 zero entries, use this as the basis for
the total "amend job length" and increase the current offset by the
cluster size multiplied by the refcount of the L2 table when expanding
zero clusters.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/qcow2-cluster.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++--
 block/qcow2.c         |   9 ++--
 block/qcow2.h         |   3 +-
 3 files changed, 147 insertions(+), 10 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 4208dc0..0f52ef6 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1548,10 +1548,20 @@ fail:
  * zero expansion (i.e., has been filled with zeroes and is referenced from an
  * L2 table). nb_clusters contains the total cluster count of the image file,
  * i.e., the number of bits in expanded_clusters.
+ *
+ * zero_clusters_count and *expanded_count are used to keep track of progress
+ * for status_cb(). zero_clusters_count contains the total number of L2 zero
+ * cluster entries. Accordingly, *expanded_count counts all visited L2 zero
+ * cluster entries; shared L2 tables are counted accordingly, that is, for each
+ * L2 zero cluster entry the count is increased by the refcount of the L2 table
+ * cluster.
  */
 static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
                                       int l1_size, uint8_t **expanded_clusters,
-                                      uint64_t *nb_clusters)
+                                      uint64_t *nb_clusters,
+                                      int64_t *expanded_count,
+                                      int64_t zero_clusters_count,
+                                      BlockDriverAmendStatusCB *status_cb)
 {
     BDRVQcowState *s = bs->opaque;
     bool is_active_l1 = (l1_table == s->l1_table);
@@ -1568,6 +1578,7 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
     for (i = 0; i < l1_size; i++) {
         uint64_t l2_offset = l1_table[i] & L1E_OFFSET_MASK;
         bool l2_dirty = false;
+        int l2_refcount;
 
         if (!l2_offset) {
             /* unallocated */
@@ -1587,6 +1598,12 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
             goto fail;
         }
 
+        l2_refcount = qcow2_get_refcount(bs, l2_offset >> s->cluster_bits);
+        if (l2_refcount < 0) {
+            ret = l2_refcount;
+            goto fail;
+        }
+
         for (j = 0; j < s->l2_size; j++) {
             uint64_t l2_entry = be64_to_cpu(l2_table[j]);
             int64_t offset = l2_entry & L2E_OFFSET_MASK, cluster_index;
@@ -1617,6 +1634,8 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
                 continue;
             }
 
+            *expanded_count += l2_refcount;
+
             if (!preallocated) {
                 if (!bs->backing_hd) {
                     /* not backed; therefore we can simply deallocate the
@@ -1703,6 +1722,11 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
                 }
             }
         }
+
+        if (status_cb) {
+            status_cb(bs, *expanded_count << s->cluster_bits,
+                      zero_clusters_count << s->cluster_bits);
+        }
     }
 
     ret = 0;
@@ -1724,26 +1748,137 @@ fail:
 }
 
 /*
+ * Counts all zero clusters in a specific L1 table.
+ */
+static int64_t count_zero_clusters_in_l1(BlockDriverState *bs,
+                                         uint64_t *l1_table, int l1_size)
+{
+    BDRVQcowState *s = bs->opaque;
+    bool is_active_l1 = (l1_table == s->l1_table);
+    uint64_t *l2_table = NULL;
+    int64_t count = 0;
+    int ret;
+    int i, j;
+
+    if (!is_active_l1) {
+        l2_table = qemu_blockalign(bs, s->cluster_size);
+    }
+
+    for (i = 0; i < l1_size; i++) {
+        uint64_t l2_offset = l1_table[i] & L1E_OFFSET_MASK;
+
+        if (!l2_offset) {
+            continue;
+        }
+
+        if (is_active_l1) {
+            ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset,
+                                  (void **)&l2_table);
+        } else {
+            ret = bdrv_read(bs->file, l2_offset / BDRV_SECTOR_SIZE,
+                            (void *)l2_table, s->cluster_sectors);
+        }
+        if (ret < 0) {
+            goto fail;
+        }
+
+        for (j = 0; j < s->l2_size; j++) {
+            if (qcow2_get_cluster_type(be64_to_cpu(l2_table[j]))
+                == QCOW2_CLUSTER_ZERO)
+            {
+                count++;
+            }
+        }
+
+        if (is_active_l1) {
+            ret = qcow2_cache_put(bs, s->l2_table_cache, (void **)&l2_table);
+            l2_table = NULL;
+            if (ret < 0) {
+                goto fail;
+            }
+        }
+    }
+
+    ret = 0;
+
+fail:
+    if (l2_table) {
+        if (!is_active_l1) {
+            qemu_vfree(l2_table);
+        } else {
+            if (ret < 0) {
+                qcow2_cache_put(bs, s->l2_table_cache, (void **)&l2_table);
+            } else {
+                ret = qcow2_cache_put(bs, s->l2_table_cache,
+                                      (void **)&l2_table);
+            }
+        }
+    }
+
+    return ret < 0 ? ret : count;
+}
+
+/*
  * For backed images, expands all zero clusters on the image. For non-backed
  * images, deallocates all non-pre-allocated zero clusters (and claims the
  * allocation for pre-allocated ones). This is important for downgrading to a
  * qcow2 version which doesn't yet support metadata zero clusters.
  */
-int qcow2_expand_zero_clusters(BlockDriverState *bs)
+int qcow2_expand_zero_clusters(BlockDriverState *bs,
+                               BlockDriverAmendStatusCB *status_cb)
 {
     BDRVQcowState *s = bs->opaque;
     uint64_t *l1_table = NULL;
     uint64_t nb_clusters;
+    int64_t zero_cluster_count = 0, expanded_count = 0;
     uint8_t *expanded_clusters;
     int ret;
     int i, j;
 
+    if (status_cb) {
+        int64_t l1_zero_count;
+
+        l1_zero_count = count_zero_clusters_in_l1(bs, s->l1_table, s->l1_size);
+        if (l1_zero_count < 0) {
+            ret = l1_zero_count;
+            goto fail;
+        }
+        zero_cluster_count += l1_zero_count;
+
+        for (i = 0; i < s->nb_snapshots; i++) {
+            int l1_sectors = DIV_ROUND_UP(s->snapshots[i].l1_size *
+                                          sizeof(uint64_t), BDRV_SECTOR_SIZE);
+
+            l1_table = g_realloc(l1_table, l1_sectors * BDRV_SECTOR_SIZE);
+
+            ret = bdrv_read(bs->file, s->snapshots[i].l1_table_offset /
+                            BDRV_SECTOR_SIZE, (void *)l1_table, l1_sectors);
+            if (ret < 0) {
+                goto fail;
+            }
+
+            for (j = 0; j < s->snapshots[i].l1_size; j++) {
+                be64_to_cpus(&l1_table[j]);
+            }
+
+            l1_zero_count = count_zero_clusters_in_l1(bs, l1_table,
+                                                      s->snapshots[i].l1_size);
+            if (l1_zero_count < 0) {
+                ret = l1_zero_count;
+                goto fail;
+            }
+            zero_cluster_count += l1_zero_count;
+        }
+    }
+
     nb_clusters = size_to_clusters(s, bs->file->total_sectors *
                                    BDRV_SECTOR_SIZE);
     expanded_clusters = g_malloc0((nb_clusters + 7) / 8);
 
     ret = expand_zero_clusters_in_l1(bs, s->l1_table, s->l1_size,
-                                     &expanded_clusters, &nb_clusters);
+                                     &expanded_clusters, &nb_clusters,
+                                     &expanded_count, zero_cluster_count,
+                                     status_cb);
     if (ret < 0) {
         goto fail;
     }
@@ -1777,7 +1912,9 @@ int qcow2_expand_zero_clusters(BlockDriverState *bs)
         }
 
         ret = expand_zero_clusters_in_l1(bs, l1_table, s->snapshots[i].l1_size,
-                                         &expanded_clusters, &nb_clusters);
+                                         &expanded_clusters, &nb_clusters,
+                                         &expanded_count, zero_cluster_count,
+                                         status_cb);
         if (ret < 0) {
             goto fail;
         }
diff --git a/block/qcow2.c b/block/qcow2.c
index 757f890..6e8c8ab 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2147,7 +2147,8 @@ static int qcow2_load_vmstate(BlockDriverState *bs, uint8_t *buf,
  * Downgrades an image's version. To achieve this, any incompatible features
  * have to be removed.
  */
-static int qcow2_downgrade(BlockDriverState *bs, int target_version)
+static int qcow2_downgrade(BlockDriverState *bs, int target_version,
+                           BlockDriverAmendStatusCB *status_cb)
 {
     BDRVQcowState *s = bs->opaque;
     int current_version = s->qcow_version;
@@ -2196,7 +2197,7 @@ static int qcow2_downgrade(BlockDriverState *bs, int target_version)
     /* clearing autoclear features is trivial */
     s->autoclear_features = 0;
 
-    ret = qcow2_expand_zero_clusters(bs);
+    ret = qcow2_expand_zero_clusters(bs, status_cb);
     if (ret < 0) {
         return ret;
     }
@@ -2224,8 +2225,6 @@ static int qcow2_amend_options(BlockDriverState *bs, QemuOpts *opts,
     int ret;
     QemuOptDesc *desc = opts->list->desc;
 
-    (void)status_cb;
-
     while (desc && desc->name) {
         if (!qemu_opt_find(opts, desc->name)) {
             /* only change explicitly defined options */
@@ -2291,7 +2290,7 @@ static int qcow2_amend_options(BlockDriverState *bs, QemuOpts *opts,
                 return ret;
             }
         } else {
-            ret = qcow2_downgrade(bs, new_version);
+            ret = qcow2_downgrade(bs, new_version, status_cb);
             if (ret < 0) {
                 return ret;
             }
diff --git a/block/qcow2.h b/block/qcow2.h
index b423b71..c0e1b7b 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -524,7 +524,8 @@ int qcow2_discard_clusters(BlockDriverState *bs, uint64_t offset,
     int nb_sectors, enum qcow2_discard_type type);
 int qcow2_zero_clusters(BlockDriverState *bs, uint64_t offset, int nb_sectors);
 
-int qcow2_expand_zero_clusters(BlockDriverState *bs);
+int qcow2_expand_zero_clusters(BlockDriverState *bs,
+                               BlockDriverAmendStatusCB *status_cb);
 
 /* qcow2-snapshot.c functions */
 int qcow2_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo *sn_info);
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 6/8] block/qcow2: Simplify shared L2 handling in amend
  2014-07-25 18:07 [Qemu-devel] [PATCH 0/8] block/qcow2: Improve (?) zero cluster expansion Max Reitz
                   ` (4 preceding siblings ...)
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 5/8] block/qcow2: Implement status CB for amend Max Reitz
@ 2014-07-25 18:07 ` Max Reitz
  2014-07-30 15:36   ` Eric Blake
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 7/8] block/qcow2: Speed up zero cluster expansion Max Reitz
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 8/8] iotests: Expand test 061 Max Reitz
  7 siblings, 1 reply; 20+ messages in thread
From: Max Reitz @ 2014-07-25 18:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Currently, we have a bitmap for keeping track of which clusters have
been created during the zero cluster expansion process. This was
necessary because we need to properly increase the refcount for shared
L2 tables.

However, now we can simply take the L2 refcount and use it for the
cluster allocated for expansion. This will be the correct refcount and
therefore we don't have to remember that cluster having been allocated
any more.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/qcow2-cluster.c | 83 +++++++++++++--------------------------------------
 1 file changed, 21 insertions(+), 62 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 0f52ef6..905beb6 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1543,12 +1543,6 @@ fail:
  * Expands all zero clusters in a specific L1 table (or deallocates them, for
  * non-backed non-pre-allocated zero clusters).
  *
- * expanded_clusters is a bitmap where every bit corresponds to one cluster in
- * the image file; a bit gets set if the corresponding cluster has been used for
- * zero expansion (i.e., has been filled with zeroes and is referenced from an
- * L2 table). nb_clusters contains the total cluster count of the image file,
- * i.e., the number of bits in expanded_clusters.
- *
  * zero_clusters_count and *expanded_count are used to keep track of progress
  * for status_cb(). zero_clusters_count contains the total number of L2 zero
  * cluster entries. Accordingly, *expanded_count counts all visited L2 zero
@@ -1557,9 +1551,7 @@ fail:
  * cluster.
  */
 static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
-                                      int l1_size, uint8_t **expanded_clusters,
-                                      uint64_t *nb_clusters,
-                                      int64_t *expanded_count,
+                                      int l1_size, int64_t *expanded_count,
                                       int64_t zero_clusters_count,
                                       BlockDriverAmendStatusCB *status_cb)
 {
@@ -1606,31 +1598,11 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
 
         for (j = 0; j < s->l2_size; j++) {
             uint64_t l2_entry = be64_to_cpu(l2_table[j]);
-            int64_t offset = l2_entry & L2E_OFFSET_MASK, cluster_index;
+            int64_t offset = l2_entry & L2E_OFFSET_MASK;
             int cluster_type = qcow2_get_cluster_type(l2_entry);
             bool preallocated = offset != 0;
 
-            if (cluster_type == QCOW2_CLUSTER_NORMAL) {
-                cluster_index = offset >> s->cluster_bits;
-                assert((cluster_index >= 0) && (cluster_index < *nb_clusters));
-                if ((*expanded_clusters)[cluster_index / 8] &
-                    (1 << (cluster_index % 8))) {
-                    /* Probably a shared L2 table; this cluster was a zero
-                     * cluster which has been expanded, its refcount
-                     * therefore most likely requires an update. */
-                    ret = qcow2_update_cluster_refcount(bs, cluster_index, 1,
-                                                        QCOW2_DISCARD_NEVER);
-                    if (ret < 0) {
-                        goto fail;
-                    }
-                    /* Since we just increased the refcount, the COPIED flag may
-                     * no longer be set. */
-                    l2_table[j] = cpu_to_be64(l2_entry & ~QCOW_OFLAG_COPIED);
-                    l2_dirty = true;
-                }
-                continue;
-            }
-            else if (qcow2_get_cluster_type(l2_entry) != QCOW2_CLUSTER_ZERO) {
+            if (cluster_type != QCOW2_CLUSTER_ZERO) {
                 continue;
             }
 
@@ -1650,6 +1622,19 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
                     ret = offset;
                     goto fail;
                 }
+
+                if (l2_refcount > 1) {
+                    /* For shared L2 tables, set the refcount accordingly (it is
+                     * already 1 and needs to be l2_refcount) */
+                    ret = qcow2_update_cluster_refcount(bs,
+                            offset >> s->cluster_bits, l2_refcount - 1,
+                            QCOW2_DISCARD_OTHER);
+                    if (ret < 0) {
+                        qcow2_free_clusters(bs, offset, s->cluster_size,
+                                            QCOW2_DISCARD_OTHER);
+                        goto fail;
+                    }
+                }
             }
 
             ret = qcow2_pre_write_overlap_check(bs, 0, offset, s->cluster_size);
@@ -1671,29 +1656,12 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
                 goto fail;
             }
 
-            l2_table[j] = cpu_to_be64(offset | QCOW_OFLAG_COPIED);
-            l2_dirty = true;
-
-            cluster_index = offset >> s->cluster_bits;
-
-            if (cluster_index >= *nb_clusters) {
-                uint64_t old_bitmap_size = (*nb_clusters + 7) / 8;
-                uint64_t new_bitmap_size;
-                /* The offset may lie beyond the old end of the underlying image
-                 * file for growable files only */
-                assert(bs->file->growable);
-                *nb_clusters = size_to_clusters(s, bs->file->total_sectors *
-                                                BDRV_SECTOR_SIZE);
-                new_bitmap_size = (*nb_clusters + 7) / 8;
-                *expanded_clusters = g_realloc(*expanded_clusters,
-                                               new_bitmap_size);
-                /* clear the newly allocated space */
-                memset(&(*expanded_clusters)[old_bitmap_size], 0,
-                       new_bitmap_size - old_bitmap_size);
+            if (l2_refcount == 1) {
+                l2_table[j] = cpu_to_be64(offset | QCOW_OFLAG_COPIED);
+            } else {
+                l2_table[j] = cpu_to_be64(offset);
             }
-
-            assert((cluster_index >= 0) && (cluster_index < *nb_clusters));
-            (*expanded_clusters)[cluster_index / 8] |= 1 << (cluster_index % 8);
+            l2_dirty = true;
         }
 
         if (is_active_l1) {
@@ -1829,9 +1797,7 @@ int qcow2_expand_zero_clusters(BlockDriverState *bs,
 {
     BDRVQcowState *s = bs->opaque;
     uint64_t *l1_table = NULL;
-    uint64_t nb_clusters;
     int64_t zero_cluster_count = 0, expanded_count = 0;
-    uint8_t *expanded_clusters;
     int ret;
     int i, j;
 
@@ -1871,12 +1837,7 @@ int qcow2_expand_zero_clusters(BlockDriverState *bs,
         }
     }
 
-    nb_clusters = size_to_clusters(s, bs->file->total_sectors *
-                                   BDRV_SECTOR_SIZE);
-    expanded_clusters = g_malloc0((nb_clusters + 7) / 8);
-
     ret = expand_zero_clusters_in_l1(bs, s->l1_table, s->l1_size,
-                                     &expanded_clusters, &nb_clusters,
                                      &expanded_count, zero_cluster_count,
                                      status_cb);
     if (ret < 0) {
@@ -1912,7 +1873,6 @@ int qcow2_expand_zero_clusters(BlockDriverState *bs,
         }
 
         ret = expand_zero_clusters_in_l1(bs, l1_table, s->snapshots[i].l1_size,
-                                         &expanded_clusters, &nb_clusters,
                                          &expanded_count, zero_cluster_count,
                                          status_cb);
         if (ret < 0) {
@@ -1923,7 +1883,6 @@ int qcow2_expand_zero_clusters(BlockDriverState *bs,
     ret = 0;
 
 fail:
-    g_free(expanded_clusters);
     g_free(l1_table);
     return ret;
 }
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 7/8] block/qcow2: Speed up zero cluster expansion
  2014-07-25 18:07 [Qemu-devel] [PATCH 0/8] block/qcow2: Improve (?) zero cluster expansion Max Reitz
                   ` (5 preceding siblings ...)
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 6/8] block/qcow2: Simplify shared L2 handling in amend Max Reitz
@ 2014-07-25 18:07 ` Max Reitz
  2014-07-30 16:14   ` Eric Blake
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 8/8] iotests: Expand test 061 Max Reitz
  7 siblings, 1 reply; 20+ messages in thread
From: Max Reitz @ 2014-07-25 18:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Actually, we do not need to allocate a new data cluster for every zero
cluster to be expanded: It is completely sufficient to rely on qcow2's
COW part and instead create a single zero cluster and reuse it as much
as possible.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/qcow2-cluster.c | 119 ++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 92 insertions(+), 27 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 905beb6..867db03 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1558,6 +1558,9 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
     BDRVQcowState *s = bs->opaque;
     bool is_active_l1 = (l1_table == s->l1_table);
     uint64_t *l2_table = NULL;
+    int64_t zeroed_cluster_offset = 0;
+    int zeroed_cluster_refcount = 0;
+    int last_zeroed_cluster_l1i = 0, last_zeroed_cluster_l2i = 0;
     int ret;
     int i, j;
 
@@ -1617,47 +1620,79 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
                     continue;
                 }
 
-                offset = qcow2_alloc_clusters(bs, s->cluster_size);
-                if (offset < 0) {
-                    ret = offset;
-                    goto fail;
+                if (zeroed_cluster_offset) {
+                    zeroed_cluster_refcount += l2_refcount;
+                    if (zeroed_cluster_refcount > 0xffff) {
+                        zeroed_cluster_refcount = 0;
+                        zeroed_cluster_offset = 0;
+                    }
                 }
+                if (!zeroed_cluster_offset) {
+                    offset = qcow2_alloc_clusters(bs, s->cluster_size);
+                    if (offset < 0) {
+                        ret = offset;
+                        goto fail;
+                    }
 
-                if (l2_refcount > 1) {
-                    /* For shared L2 tables, set the refcount accordingly (it is
-                     * already 1 and needs to be l2_refcount) */
-                    ret = qcow2_update_cluster_refcount(bs,
-                            offset >> s->cluster_bits, l2_refcount - 1,
-                            QCOW2_DISCARD_OTHER);
+                    ret = qcow2_pre_write_overlap_check(bs, 0, offset,
+                                                        s->cluster_size);
+                    if (ret < 0) {
+                        qcow2_free_clusters(bs, offset, s->cluster_size,
+                                            QCOW2_DISCARD_OTHER);
+                        goto fail;
+                    }
+
+                    ret = bdrv_write_zeroes(bs->file, offset / BDRV_SECTOR_SIZE,
+                                            s->cluster_sectors, 0);
                     if (ret < 0) {
                         qcow2_free_clusters(bs, offset, s->cluster_size,
                                             QCOW2_DISCARD_OTHER);
                         goto fail;
                     }
+
+                    if (l2_refcount > 1) {
+                        ret = qcow2_update_cluster_refcount(bs,
+                                offset >> s->cluster_bits, l2_refcount - 1,
+                                QCOW2_DISCARD_OTHER);
+                        if (ret < 0) {
+                            qcow2_free_clusters(bs, offset, s->cluster_size,
+                                                QCOW2_DISCARD_OTHER);
+                            goto fail;
+                        }
+                    }
+
+                    zeroed_cluster_offset = offset;
+                    zeroed_cluster_refcount = l2_refcount;
+                } else {
+                    ret = qcow2_update_cluster_refcount(bs,
+                            zeroed_cluster_offset >> s->cluster_bits,
+                            l2_refcount, QCOW2_DISCARD_OTHER);
+                    if (ret < 0) {
+                        goto fail;
+                    }
                 }
+
+                offset = zeroed_cluster_offset;
+                last_zeroed_cluster_l1i = i;
+                last_zeroed_cluster_l2i = j;
             }
 
-            ret = qcow2_pre_write_overlap_check(bs, 0, offset, s->cluster_size);
-            if (ret < 0) {
-                if (!preallocated) {
-                    qcow2_free_clusters(bs, offset, s->cluster_size,
-                                        QCOW2_DISCARD_ALWAYS);
+            if (preallocated) {
+                ret = qcow2_pre_write_overlap_check(bs, 0, offset,
+                                                    s->cluster_size);
+                if (ret < 0) {
+                    goto fail;
                 }
-                goto fail;
-            }
 
-            ret = bdrv_write_zeroes(bs->file, offset / BDRV_SECTOR_SIZE,
-                                    s->cluster_sectors, 0);
-            if (ret < 0) {
-                if (!preallocated) {
-                    qcow2_free_clusters(bs, offset, s->cluster_size,
-                                        QCOW2_DISCARD_ALWAYS);
+                ret = bdrv_write_zeroes(bs->file, offset / BDRV_SECTOR_SIZE,
+                                        s->cluster_sectors, 0);
+                if (ret < 0) {
+                    goto fail;
                 }
-                goto fail;
             }
 
-            if (l2_refcount == 1) {
-                l2_table[j] = cpu_to_be64(offset | QCOW_OFLAG_COPIED);
+            if (preallocated) {
+                l2_table[j] = cpu_to_be64(offset | (l2_entry & QCOW_OFLAG_COPIED));
             } else {
                 l2_table[j] = cpu_to_be64(offset);
             }
@@ -1670,8 +1705,8 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
                 qcow2_cache_depends_on_flush(s->l2_table_cache);
             }
             ret = qcow2_cache_put(bs, s->l2_table_cache, (void **)&l2_table);
+            l2_table = NULL;
             if (ret < 0) {
-                l2_table = NULL;
                 goto fail;
             }
         } else {
@@ -1697,6 +1732,36 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
         }
     }
 
+    /* Fix COPIED (only valid for active L2 tables) */
+    if (is_active_l1 && zeroed_cluster_refcount == 1) {
+        uint64_t l2_offset, l2_entry;
+
+        l2_offset = l1_table[last_zeroed_cluster_l1i] & L1E_OFFSET_MASK;
+        assert(l2_offset);
+
+        ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset,
+                              (void **)&l2_table);
+        if (ret < 0) {
+            goto fail;
+        }
+
+        l2_entry = be64_to_cpu(l2_table[last_zeroed_cluster_l2i]);
+
+        assert(!(l2_entry & QCOW_OFLAG_COPIED));
+        l2_entry |= QCOW_OFLAG_COPIED;
+
+        l2_table[last_zeroed_cluster_l2i] = cpu_to_be64(l2_entry);
+
+        qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
+        qcow2_cache_depends_on_flush(s->l2_table_cache);
+
+        ret = qcow2_cache_put(bs, s->l2_table_cache, (void **)&l2_table);
+        l2_table = NULL;
+        if (ret < 0) {
+            goto fail;
+        }
+    }
+
     ret = 0;
 
 fail:
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 8/8] iotests: Expand test 061
  2014-07-25 18:07 [Qemu-devel] [PATCH 0/8] block/qcow2: Improve (?) zero cluster expansion Max Reitz
                   ` (6 preceding siblings ...)
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 7/8] block/qcow2: Speed up zero cluster expansion Max Reitz
@ 2014-07-25 18:07 ` Max Reitz
  7 siblings, 0 replies; 20+ messages in thread
From: Max Reitz @ 2014-07-25 18:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Max Reitz

Add some tests for progress output and one test regarding the COPIED
flag for shared zeroed clusters to 061.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/061     | 41 +++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/061.out | 40 ++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/group   |  2 +-
 3 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/061 b/tests/qemu-iotests/061
index ab98def..12ea687 100755
--- a/tests/qemu-iotests/061
+++ b/tests/qemu-iotests/061
@@ -209,6 +209,47 @@ $QEMU_IMG amend -o "compat=0.10" "$TEST_IMG"
 _check_test_img
 $QEMU_IO -c "read -P 0 0 64M" "$TEST_IMG" | _filter_qemu_io
 
+echo
+echo "=== Testing multiple zeroed cluster allocations ==="
+echo
+IMGOPTS="compat=1.1" TEST_IMG="$TEST_IMG.base" _make_test_img 64M
+IMGOPTS="compat=1.1,cluster_size=512" make_test_img -b "$TEST_IMG.base" 64M
+$QEMU_IO -c "write -z 0 $(((2 * 65535 + 1) * 512))" "$TEST_IMG" \
+    | _filter_qemu_io
+$QEMU_IMG amend -o "compat=0.10" "$TEST_IMG"
+_check_test_img
+
+echo
+echo "=== Testing progress report without snapshot ==="
+echo
+IMGOPTS="compat=1.1" TEST_IMG="$TEST_IMG.base" _make_test_img 4G
+IMGOPTS="compat=1.1" _make_test_img -b "$TEST_IMG.base" 4G
+$QEMU_IO -c "write -z 0  64k" \
+         -c "write -z 1G 64k" \
+         -c "write -z 2G 64k" \
+         -c "write -z 3G 64k" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IMG amend -p -o "compat=0.10" "$TEST_IMG"
+_check_test_img
+
+echo
+echo "=== Testing progress report with snapshot ==="
+echo
+IMGOPTS="compat=1.1" TEST_IMG="$TEST_IMG.base" _make_test_img 4G
+IMGOPTS="compat=1.1" _make_test_img -b "$TEST_IMG.base" 4G
+$QEMU_IO -c "write -z 0  64k" \
+         -c "write -z 1G 64k" \
+         -c "write -z 2G 64k" \
+         -c "write -z 3G 64k" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IMG snapshot -c foo "$TEST_IMG"
+# COW L2 table 0
+$QEMU_IO -c "write -z 64k 64k" "$TEST_IMG" | _filter_qemu_io
+# The first four steps will advance by 2/9, because in the active L1 table, the
+# first L2 table contains two zero cluster entries and all others contain one
+# but are shared with the other L2 table; the final 1/9 step will be from the
+# zero cluster entry in the L2 table owned exclusively by the snapshot.
+$QEMU_IMG amend -p -o "compat=0.10" "$TEST_IMG"
+_check_test_img
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index e372470..63d7bb2 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -384,4 +384,44 @@ wrote 67108864/67108864 bytes at offset 0
 No errors were found on the image.
 read 67108864/67108864 bytes at offset 0
 64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+=== Testing multiple zeroed cluster allocations ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864 
+./061: line 216: make_test_img: command not found
+wrote 67108352/67108352 bytes at offset 0
+64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+No errors were found on the image.
+
+=== Testing progress report without snapshot ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=4294967296 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=4294967296 backing_file='TEST_DIR/t.IMGFMT.base' 
+wrote 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 1073741824
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 2147483648
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 3221225472
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+    (0.00/100%)
    (25.00/100%)
    (50.00/100%)
    (75.00/100%)
    (100.00/100%)
    (100.00/100%)
+No errors were found on the image.
+
+=== Testing progress report with snapshot ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=4294967296 
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=4294967296 backing_file='TEST_DIR/t.IMGFMT.base' 
+wrote 65536/65536 bytes at offset 0
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 1073741824
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 2147483648
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 3221225472
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+    (0.00/100%)
    (22.22/100%)
    (44.44/100%)
    (66.67/100%)
    (88.89/100%)
    (100.00/100%)
    (100.00/100%)
    (100.00/100%)
    (100.00/100%)
    (100.00/100%)
+No errors were found on the image.
 *** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 6e67f61..b27e2f9 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -67,7 +67,7 @@
 058 rw auto quick
 059 rw auto quick
 060 rw auto quick
-061 rw auto quick
+061 rw auto
 062 rw auto quick
 063 rw auto quick
 064 rw auto quick
-- 
2.0.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] block: Add status callback to bdrv_amend_options()
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 1/8] block: Add status callback to bdrv_amend_options() Max Reitz
@ 2014-07-30 14:50   ` Eric Blake
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Blake @ 2014-07-30 14:50 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 1374 bytes --]

On 07/25/2014 12:07 PM, Max Reitz wrote:
> Depending on the changed options and the image format,
> bdrv_amend_options() may take a significant amount of time. In these
> cases, a way to be informed about the operation's status is desirable.
> 
> Since the operation is rather complex and may fundamentally change the
> image, implementing it as AIO or a couroutine does not seem feasible. On
> the other hand, implementing it as a block job would be significantly
> more difficult than a simple callback and would not add benefits other
> than progress report to the amending operation, because it should not
> actually be run as a block job at all.
> 
> A callback may not be very pretty, but it's very easy to implement and
> perfectly fits its purpose here.

Agree with this reasoning.

> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block.c                   | 5 +++--
>  block/qcow2.c             | 5 ++++-
>  include/block/block.h     | 5 ++++-
>  include/block/block_int.h | 3 ++-
>  qemu-img.c                | 2 +-
>  5 files changed, 14 insertions(+), 6 deletions(-)

Good for 2.2, and looks like this patch is the same in both your
original and alternate series.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 2/8] qemu-img: Add progress output for amend
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 2/8] qemu-img: Add progress output for amend Max Reitz
@ 2014-07-30 14:55   ` Eric Blake
  2014-07-30 20:20     ` Max Reitz
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Blake @ 2014-07-30 14:55 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 530 bytes --]

On 07/25/2014 12:07 PM, Max Reitz wrote:
> Now that bdrv_amend_options() supports a status callback, use it to
> display a progress report.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  qemu-img.c | 26 +++++++++++++++++++++++---
>  1 file changed, 23 insertions(+), 3 deletions(-)

No documentation of the new option?  The code looks fine, but the texi
needs to mention -p before I can give R-b

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 3/8] qemu-img: Fix insignifcant memleak
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 3/8] qemu-img: Fix insignifcant memleak Max Reitz
@ 2014-07-30 14:56   ` Eric Blake
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Blake @ 2014-07-30 14:56 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 551 bytes --]

On 07/25/2014 12:07 PM, Max Reitz wrote:
> As soon as options is set in img_amend(), it needs to be freed before
> the function returns. This leak is rather insignifcant, as qemu-img will
> exit subsequently anyway, but there's no point in not fixing it.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  qemu-img.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 4/8] block/qcow2: Make get_refcount() global
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 4/8] block/qcow2: Make get_refcount() global Max Reitz
@ 2014-07-30 15:04   ` Eric Blake
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Blake @ 2014-07-30 15:04 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 553 bytes --]

On 07/25/2014 12:07 PM, Max Reitz wrote:
> Reading the refcount of a cluster is an operation which can be useful in
> all of the qcow2 code, so make that function globally available.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/qcow2-refcount.c | 23 ++++++++++++-----------
>  block/qcow2.h          |  2 ++
>  2 files changed, 14 insertions(+), 11 deletions(-)
> 

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 5/8] block/qcow2: Implement status CB for amend
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 5/8] block/qcow2: Implement status CB for amend Max Reitz
@ 2014-07-30 15:23   ` Eric Blake
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Blake @ 2014-07-30 15:23 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 1037 bytes --]

On 07/25/2014 12:07 PM, Max Reitz wrote:
> The only really time-consuming operation potentially performed by
> qcow2_amend_options() is zero cluster expansion when downgrading qcow2
> images from compat=1.1 to compat=0.10, so report status of that
> operation and that operation only through the status CB.
> 
> For this, count the number of L2 zero entries, use this as the basis for
> the total "amend job length" and increase the current offset by the
> cluster size multiplied by the refcount of the L2 table when expanding
> zero clusters.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/qcow2-cluster.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  block/qcow2.c         |   9 ++--
>  block/qcow2.h         |   3 +-
>  3 files changed, 147 insertions(+), 10 deletions(-)
> 

Makes sense, although I'll defer R-b until I've looked at both competing
series in entirety.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] block/qcow2: Simplify shared L2 handling in amend
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 6/8] block/qcow2: Simplify shared L2 handling in amend Max Reitz
@ 2014-07-30 15:36   ` Eric Blake
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Blake @ 2014-07-30 15:36 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 858 bytes --]

On 07/25/2014 12:07 PM, Max Reitz wrote:
> Currently, we have a bitmap for keeping track of which clusters have
> been created during the zero cluster expansion process. This was
> necessary because we need to properly increase the refcount for shared
> L2 tables.
> 
> However, now we can simply take the L2 refcount and use it for the
> cluster allocated for expansion. This will be the correct refcount and
> therefore we don't have to remember that cluster having been allocated
> any more.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/qcow2-cluster.c | 83 +++++++++++++--------------------------------------
>  1 file changed, 21 insertions(+), 62 deletions(-)

Slick - less code, and faster operation :)

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 7/8] block/qcow2: Speed up zero cluster expansion
  2014-07-25 18:07 ` [Qemu-devel] [PATCH 7/8] block/qcow2: Speed up zero cluster expansion Max Reitz
@ 2014-07-30 16:14   ` Eric Blake
  2014-07-30 20:31     ` Max Reitz
  2014-07-30 20:31     ` Eric Blake
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Blake @ 2014-07-30 16:14 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 5044 bytes --]

On 07/25/2014 12:07 PM, Max Reitz wrote:
> Actually, we do not need to allocate a new data cluster for every zero
> cluster to be expanded: It is completely sufficient to rely on qcow2's
> COW part and instead create a single zero cluster and reuse it as much
> as possible.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/qcow2-cluster.c | 119 ++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 92 insertions(+), 27 deletions(-)
> 
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 905beb6..867db03 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -1558,6 +1558,9 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
>      BDRVQcowState *s = bs->opaque;
>      bool is_active_l1 = (l1_table == s->l1_table);
>      uint64_t *l2_table = NULL;
> +    int64_t zeroed_cluster_offset = 0;
> +    int zeroed_cluster_refcount = 0;
> +    int last_zeroed_cluster_l1i = 0, last_zeroed_cluster_l2i = 0;
>      int ret;
>      int i, j;
>  
> @@ -1617,47 +1620,79 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
>                      continue;
>                  }
>  
> -                offset = qcow2_alloc_clusters(bs, s->cluster_size);
> -                if (offset < 0) {
> -                    ret = offset;
> -                    goto fail;
> +                if (zeroed_cluster_offset) {
> +                    zeroed_cluster_refcount += l2_refcount;
> +                    if (zeroed_cluster_refcount > 0xffff) {

Doesn't the qcow2 file format allow variable-sized maximum refcount
(bytes 96-99 refcount_order in the header)?  Therefore, you should be
using the value computed from the header rather than hard-coding the
assumption that the header used (the default of) 16-bit refcount. [Yeah,
I know, we don't yet have code that supports non-default size, even
though the file format documents it, but that doesn't mean we should
make it harder to add support down the road...]

> +                        zeroed_cluster_refcount = 0;
> +                        zeroed_cluster_offset = 0;
> +                    }
>                  }

This isn't a maximal packing.  As long as we don't mind complexity to
gain compactness, couldn't we also expand the existing
zeroed_cluster_offset all the way up to full refcount, and decrement
l2_refcount by the difference, before spilling over to allocating the
next zero cluster?

Also, I have to wonder - since the all-zero cluster is the most likely
cluster to have a large refcount, even during normal runtime, should we
special case the normal qcow2 write code to track the current all-zero
cluster (if any), and merely increase its refcount rather than allocate
a new cluster any time it is detected that an all-zero cluster is
needed?  [Of course, the tracking would be runtime only, since
compat=0.10 header doesn't provide any way to track the location of an
all-zero cluster across file reloads.  Each new runtime would probably
settle on a new location for the all-zero cluster used during that run,
rather than trying to find an existing one.  And there's really no point
to adding a header to track an all-zero cluster in compat=1.1 images,
since those images already have the ability to track zero clusters
without needing one allocated.]

> +                if (!zeroed_cluster_offset) {
> +                    offset = qcow2_alloc_clusters(bs, s->cluster_size);
> +                    if (offset < 0) {
> +                        ret = offset;
> +                        goto fail;
> +                    }
>  
> -                if (l2_refcount > 1) {
> -                    /* For shared L2 tables, set the refcount accordingly (it is
> -                     * already 1 and needs to be l2_refcount) */
> -                    ret = qcow2_update_cluster_refcount(bs,
> -                            offset >> s->cluster_bits, l2_refcount - 1,
> -                            QCOW2_DISCARD_OTHER);
> +                    ret = qcow2_pre_write_overlap_check(bs, 0, offset,
> +                                                        s->cluster_size);
> +                    if (ret < 0) {
> +                        qcow2_free_clusters(bs, offset, s->cluster_size,
> +                                            QCOW2_DISCARD_OTHER);
> +                        goto fail;
> +                    }
> +
> +                    ret = bdrv_write_zeroes(bs->file, offset / BDRV_SECTOR_SIZE,
> +                                            s->cluster_sectors, 0);

That is, if bdrv_write_zeroes knows how to take advantage of an already
existing all-zero cluster, it would be less special casing in this code,
but still get the same benefits of maximizing refcount during the amend
operation, if all expanded clusters go through bdrv_write_zeroes.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 2/8] qemu-img: Add progress output for amend
  2014-07-30 14:55   ` Eric Blake
@ 2014-07-30 20:20     ` Max Reitz
  0 siblings, 0 replies; 20+ messages in thread
From: Max Reitz @ 2014-07-30 20:20 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

On 30.07.2014 16:55, Eric Blake wrote:
> On 07/25/2014 12:07 PM, Max Reitz wrote:
>> Now that bdrv_amend_options() supports a status callback, use it to
>> display a progress report.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   qemu-img.c | 26 +++++++++++++++++++++++---
>>   1 file changed, 23 insertions(+), 3 deletions(-)
> No documentation of the new option?  The code looks fine, but the texi
> needs to mention -p before I can give R-b

Oops, I completely forgot. I'll fix it.

Max

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 7/8] block/qcow2: Speed up zero cluster expansion
  2014-07-30 16:14   ` Eric Blake
@ 2014-07-30 20:31     ` Max Reitz
  2014-07-30 20:31     ` Eric Blake
  1 sibling, 0 replies; 20+ messages in thread
From: Max Reitz @ 2014-07-30 20:31 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

On 30.07.2014 18:14, Eric Blake wrote:
> On 07/25/2014 12:07 PM, Max Reitz wrote:
>> Actually, we do not need to allocate a new data cluster for every zero
>> cluster to be expanded: It is completely sufficient to rely on qcow2's
>> COW part and instead create a single zero cluster and reuse it as much
>> as possible.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/qcow2-cluster.c | 119 ++++++++++++++++++++++++++++++++++++++------------
>>   1 file changed, 92 insertions(+), 27 deletions(-)
>>
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index 905beb6..867db03 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -1558,6 +1558,9 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
>>       BDRVQcowState *s = bs->opaque;
>>       bool is_active_l1 = (l1_table == s->l1_table);
>>       uint64_t *l2_table = NULL;
>> +    int64_t zeroed_cluster_offset = 0;
>> +    int zeroed_cluster_refcount = 0;
>> +    int last_zeroed_cluster_l1i = 0, last_zeroed_cluster_l2i = 0;
>>       int ret;
>>       int i, j;
>>   
>> @@ -1617,47 +1620,79 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
>>                       continue;
>>                   }
>>   
>> -                offset = qcow2_alloc_clusters(bs, s->cluster_size);
>> -                if (offset < 0) {
>> -                    ret = offset;
>> -                    goto fail;
>> +                if (zeroed_cluster_offset) {
>> +                    zeroed_cluster_refcount += l2_refcount;
>> +                    if (zeroed_cluster_refcount > 0xffff) {
> Doesn't the qcow2 file format allow variable-sized maximum refcount
> (bytes 96-99 refcount_order in the header)?  Therefore, you should be
> using the value computed from the header rather than hard-coding the
> assumption that the header used (the default of) 16-bit refcount.

Oh, you're right, I didn't even think of that.

> [Yeah,
> I know, we don't yet have code that supports non-default size, even
> though the file format documents it, but that doesn't mean we should
> make it harder to add support down the road...]

And this is probably why, yes. ;-)

>> +                        zeroed_cluster_refcount = 0;
>> +                        zeroed_cluster_offset = 0;
>> +                    }
>>                   }
> This isn't a maximal packing.  As long as we don't mind complexity to
> gain compactness, couldn't we also expand the existing
> zeroed_cluster_offset all the way up to full refcount, and decrement
> l2_refcount by the difference, before spilling over to allocating the
> next zero cluster?

Hm, right.

> Also, I have to wonder - since the all-zero cluster is the most likely
> cluster to have a large refcount, even during normal runtime, should we
> special case the normal qcow2 write code to track the current all-zero
> cluster (if any), and merely increase its refcount rather than allocate
> a new cluster any time it is detected that an all-zero cluster is
> needed?  [Of course, the tracking would be runtime only, since
> compat=0.10 header doesn't provide any way to track the location of an
> all-zero cluster across file reloads.  Each new runtime would probably
> settle on a new location for the all-zero cluster used during that run,
> rather than trying to find an existing one.  And there's really no point
> to adding a header to track an all-zero cluster in compat=1.1 images,
> since those images already have the ability to track zero clusters
> without needing one allocated.]

This may improve performance for compat=0.10 images; however, I don't 
think we should care that much about compat=0.10 images to justify 
optimizations specifically for those all over the qcow2 code.

>> +                if (!zeroed_cluster_offset) {
>> +                    offset = qcow2_alloc_clusters(bs, s->cluster_size);
>> +                    if (offset < 0) {
>> +                        ret = offset;
>> +                        goto fail;
>> +                    }
>>   
>> -                if (l2_refcount > 1) {
>> -                    /* For shared L2 tables, set the refcount accordingly (it is
>> -                     * already 1 and needs to be l2_refcount) */
>> -                    ret = qcow2_update_cluster_refcount(bs,
>> -                            offset >> s->cluster_bits, l2_refcount - 1,
>> -                            QCOW2_DISCARD_OTHER);
>> +                    ret = qcow2_pre_write_overlap_check(bs, 0, offset,
>> +                                                        s->cluster_size);
>> +                    if (ret < 0) {
>> +                        qcow2_free_clusters(bs, offset, s->cluster_size,
>> +                                            QCOW2_DISCARD_OTHER);
>> +                        goto fail;
>> +                    }
>> +
>> +                    ret = bdrv_write_zeroes(bs->file, offset / BDRV_SECTOR_SIZE,
>> +                                            s->cluster_sectors, 0);
> That is, if bdrv_write_zeroes knows how to take advantage of an already
> existing all-zero cluster, it would be less special casing in this code,
> but still get the same benefits of maximizing refcount during the amend
> operation, if all expanded clusters go through bdrv_write_zeroes.

The special casing would then be somewhere else and I think only this 
code would really benefit from it. I don't think we should ingrain such 
optimizations in all of the qcow2 code; I personally can live with 
having these optimizations contained and separated from the rest of the 
qcow2 code in functions like this one, but anything else would be a bit 
too much effort (and maybe even too error-prone, since it would probably 
be rather complex) for current qemu. We do have compat=1.1 with zero 
clusters and that is what should be used for performance.

I'll wait with v2 of this patch until you have decided on whether it's 
worth it (in the alternative series, I completely dropped this patch).

Max

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 7/8] block/qcow2: Speed up zero cluster expansion
  2014-07-30 16:14   ` Eric Blake
  2014-07-30 20:31     ` Max Reitz
@ 2014-07-30 20:31     ` Eric Blake
  2014-07-30 20:41       ` Max Reitz
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Blake @ 2014-07-30 20:31 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 2183 bytes --]

On 07/30/2014 10:14 AM, Eric Blake wrote:
> On 07/25/2014 12:07 PM, Max Reitz wrote:
>> Actually, we do not need to allocate a new data cluster for every zero
>> cluster to be expanded: It is completely sufficient to rely on qcow2's
>> COW part and instead create a single zero cluster and reuse it as much
>> as possible.

> Also, I have to wonder - since the all-zero cluster is the most likely
> cluster to have a large refcount, even during normal runtime, should we
> special case the normal qcow2 write code to track the current all-zero
> cluster (if any), and merely increase its refcount rather than allocate
> a new cluster any time it is detected that an all-zero cluster is
> needed?  [Of course, the tracking would be runtime only, since
> compat=0.10 header doesn't provide any way to track the location of an
> all-zero cluster across file reloads.  Each new runtime would probably
> settle on a new location for the all-zero cluster used during that run,
> rather than trying to find an existing one.  And there's really no point
> to adding a header to track an all-zero cluster in compat=1.1 images,
> since those images already have the ability to track zero clusters
> without needing one allocated.]


>> +                    ret = bdrv_write_zeroes(bs->file, offset / BDRV_SECTOR_SIZE,
>> +                                            s->cluster_sectors, 0);
> 
> That is, if bdrv_write_zeroes knows how to take advantage of an already
> existing all-zero cluster, it would be less special casing in this code,
> but still get the same benefits of maximizing refcount during the amend
> operation, if all expanded clusters go through bdrv_write_zeroes.

Now that I've looked through both variants, I'm leaning towards the
simplicity of your alternate series, rather than the complexity of this
one, if we can (independently?) optimize bdrv_write_zeroes to reuse a
known-all-zeroes cluster when possible.  Of course, you may want to get
other opinions than just mine before posting your next round of these
patches.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 7/8] block/qcow2: Speed up zero cluster expansion
  2014-07-30 20:31     ` Eric Blake
@ 2014-07-30 20:41       ` Max Reitz
  0 siblings, 0 replies; 20+ messages in thread
From: Max Reitz @ 2014-07-30 20:41 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi

On 30.07.2014 22:31, Eric Blake wrote:
> On 07/30/2014 10:14 AM, Eric Blake wrote:
>> On 07/25/2014 12:07 PM, Max Reitz wrote:
>>> Actually, we do not need to allocate a new data cluster for every zero
>>> cluster to be expanded: It is completely sufficient to rely on qcow2's
>>> COW part and instead create a single zero cluster and reuse it as much
>>> as possible.
>> Also, I have to wonder - since the all-zero cluster is the most likely
>> cluster to have a large refcount, even during normal runtime, should we
>> special case the normal qcow2 write code to track the current all-zero
>> cluster (if any), and merely increase its refcount rather than allocate
>> a new cluster any time it is detected that an all-zero cluster is
>> needed?  [Of course, the tracking would be runtime only, since
>> compat=0.10 header doesn't provide any way to track the location of an
>> all-zero cluster across file reloads.  Each new runtime would probably
>> settle on a new location for the all-zero cluster used during that run,
>> rather than trying to find an existing one.  And there's really no point
>> to adding a header to track an all-zero cluster in compat=1.1 images,
>> since those images already have the ability to track zero clusters
>> without needing one allocated.]
>
>>> +                    ret = bdrv_write_zeroes(bs->file, offset / BDRV_SECTOR_SIZE,
>>> +                                            s->cluster_sectors, 0);
>> That is, if bdrv_write_zeroes knows how to take advantage of an already
>> existing all-zero cluster, it would be less special casing in this code,
>> but still get the same benefits of maximizing refcount during the amend
>> operation, if all expanded clusters go through bdrv_write_zeroes.
> Now that I've looked through both variants, I'm leaning towards the
> simplicity of your alternate series, rather than the complexity of this
> one, if we can (independently?) optimize bdrv_write_zeroes to reuse a
> known-all-zeroes cluster when possible.  Of course, you may want to get
> other opinions than just mine before posting your next round of these
> patches.

I'm pretty sure Kevin prefers a variant which is as simple as possible, 
so I'll use that (alternative) version for v2, then.

However, I still think we should not optimize bdrv_write_zeroes(). As 
far as I know, qemu should work best with raw and qcow2 in its current 
version. raw will not support things like a common zero cluster anyway; 
and qcow2 in its current version has zero clusters built-in. I don't 
think we should optimize for qcow2 compat=0.10 to make up for things it 
lacks in comparison to compat=1.1 by design.

Also, in regard to this patch: bs->file is most probably a raw file 
which won't support a common zero cluster. If we want to optimize the 
bdrv_write_zeroes() call alone, all we can do is to allow it to discard 
the sectors (which I guess I'll just do in v2 because it doesn't cost 
anything).

In any case, if later on I or somebody else does decide to optimize 
bdrv_write_zeroes() we can still implement this optimization 
independently of this series.

Max

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-07-30 20:41 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-25 18:07 [Qemu-devel] [PATCH 0/8] block/qcow2: Improve (?) zero cluster expansion Max Reitz
2014-07-25 18:07 ` [Qemu-devel] [PATCH 1/8] block: Add status callback to bdrv_amend_options() Max Reitz
2014-07-30 14:50   ` Eric Blake
2014-07-25 18:07 ` [Qemu-devel] [PATCH 2/8] qemu-img: Add progress output for amend Max Reitz
2014-07-30 14:55   ` Eric Blake
2014-07-30 20:20     ` Max Reitz
2014-07-25 18:07 ` [Qemu-devel] [PATCH 3/8] qemu-img: Fix insignifcant memleak Max Reitz
2014-07-30 14:56   ` Eric Blake
2014-07-25 18:07 ` [Qemu-devel] [PATCH 4/8] block/qcow2: Make get_refcount() global Max Reitz
2014-07-30 15:04   ` Eric Blake
2014-07-25 18:07 ` [Qemu-devel] [PATCH 5/8] block/qcow2: Implement status CB for amend Max Reitz
2014-07-30 15:23   ` Eric Blake
2014-07-25 18:07 ` [Qemu-devel] [PATCH 6/8] block/qcow2: Simplify shared L2 handling in amend Max Reitz
2014-07-30 15:36   ` Eric Blake
2014-07-25 18:07 ` [Qemu-devel] [PATCH 7/8] block/qcow2: Speed up zero cluster expansion Max Reitz
2014-07-30 16:14   ` Eric Blake
2014-07-30 20:31     ` Max Reitz
2014-07-30 20:31     ` Eric Blake
2014-07-30 20:41       ` Max Reitz
2014-07-25 18:07 ` [Qemu-devel] [PATCH 8/8] iotests: Expand test 061 Max Reitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.