All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/17] Improve qcow2 all-zero detection
@ 2020-01-31 17:44 Eric Blake
  2020-01-31 17:44 ` [PATCH 01/17] qcow2: Comment typo fixes Eric Blake
                   ` (18 more replies)
  0 siblings, 19 replies; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: david.edmondson, qemu-block, mreitz

Based-on: <20200124103458.1525982-2-david.edmondson@oracle.com>
([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)

I'm working on adding an NBD extension that reports whether an image
is already all zero when the client first connects.  I initially
thought I could write the NBD code to just call bdrv_has_zero_init(),
but that turned out to be a bad assumption that instead resulted in
this patch series.  The NBD patch will come later (and cross-posted to
the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
repositories).

I do have an RFC question on patch 13 - as implemented here, I set a
qcow2 bit if the image has all clusters known zero and no backing
image.  But it may be more useful to instead report whether all
clusters _allocated in this layer_ are zero, at which point the
overall image is all-zero only if the backing file also has that
property (or even make it two bits).  The tweaks to subsequent patches
based on what we think makes the most useful semantics shouldn't be
hard.

[repo.or.cz appears to be down as I type this; I'll post a link to a
repository later when it comes back up]

Eric Blake (17):
  qcow2: Comment typo fixes
  qcow2: List autoclear bit names in header
  qcow2: Avoid feature name extension on small cluster size
  block: Improve documentation of .bdrv_has_zero_init
  block: Don't advertise zero_init_truncate with encryption
  block: Improve bdrv_has_zero_init_truncate with backing file
  gluster: Drop useless has_zero_init callback
  sheepdog: Consistently set bdrv_has_zero_init_truncate
  block: Refactor bdrv_has_zero_init{,_truncate}
  block: Add new BDRV_ZERO_OPEN flag
  file-posix: Support BDRV_ZERO_OPEN
  gluster: Support BDRV_ZERO_OPEN
  qcow2: Add new autoclear feature for all zero image
  qcow2: Expose all zero bit through .bdrv_known_zeroes
  qcow2: Implement all-zero autoclear bit
  iotests: Add new test for qcow2 all-zero bit
  qcow2: Let qemu-img check cover all-zero bit

 block.c                    |  62 +++++----
 block/file-posix.c         |  16 ++-
 block/file-win32.c         |   3 +-
 block/gluster.c            |  34 +++--
 block/nfs.c                |   7 +-
 block/parallels.c          |   4 +-
 block/qcow.c               |   2 +-
 block/qcow2-refcount.c     |  60 +++++++-
 block/qcow2-snapshot.c     |  11 ++
 block/qcow2.c              | 150 +++++++++++++++++---
 block/qcow2.h              |   6 +-
 block/qed.c                |   3 +-
 block/raw-format.c         |  12 +-
 block/rbd.c                |   3 +-
 block/sheepdog.c           |   7 +-
 block/ssh.c                |   7 +-
 block/vdi.c                |   8 +-
 block/vhdx.c               |  16 +--
 block/vmdk.c               |   9 +-
 block/vpc.c                |   8 +-
 blockdev.c                 |   2 +-
 docs/interop/qcow2.txt     |  15 +-
 include/block/block.h      |  38 ++++-
 include/block/block_int.h  |  14 +-
 qapi/block-core.json       |   4 +
 qemu-img.c                 |   9 +-
 tests/qemu-iotests/031.out |  14 +-
 tests/qemu-iotests/036     |   6 +-
 tests/qemu-iotests/036.out |  10 +-
 tests/qemu-iotests/060.out |   6 +-
 tests/qemu-iotests/061     |   6 +-
 tests/qemu-iotests/061.out |  26 ++--
 tests/qemu-iotests/065     |  12 +-
 tests/qemu-iotests/082.out |   7 +
 tests/qemu-iotests/122     |   2 +-
 tests/qemu-iotests/188     |   2 +-
 tests/qemu-iotests/188.out |   2 +-
 tests/qemu-iotests/206.out |   4 +
 tests/qemu-iotests/242.out |   1 +
 tests/qemu-iotests/285     | 124 +++++++++++++++++
 tests/qemu-iotests/285.out | 277 +++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/group   |   1 +
 42 files changed, 832 insertions(+), 178 deletions(-)
 create mode 100755 tests/qemu-iotests/285
 create mode 100644 tests/qemu-iotests/285.out

-- 
2.24.1



^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 01/17] qcow2: Comment typo fixes
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-02-04 14:12   ` Vladimir Sementsov-Ogievskiy
  2020-02-09 19:34   ` Alberto Garcia
  2020-01-31 17:44 ` [PATCH 02/17] qcow2: List autoclear bit names in header Eric Blake
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

Various trivial typos noticed while working on this file.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/qcow2.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index cef9d72b3a16..30fd3d13032a 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -174,7 +174,7 @@ static ssize_t qcow2_crypto_hdr_write_func(QCryptoBlock *block, size_t offset,
 }


-/* 
+/*
  * read qcow2 extension and fill bs
  * start reading from start_offset
  * finish reading upon magic of value 0 or when end_offset reached
@@ -3251,7 +3251,7 @@ qcow2_co_create(BlockdevCreateOptions *create_options, Error **errp)
      * inconsistency later.
      *
      * We do need a refcount table because growing the refcount table means
-     * allocating two new refcount blocks - the seconds of which would be at
+     * allocating two new refcount blocks - the second of which would be at
      * 2 GB for 64k clusters, and we don't want to have a 2 GB initial file
      * size for any qcow2 image.
      */
@@ -3495,7 +3495,7 @@ qcow2_co_create(BlockdevCreateOptions *create_options, Error **errp)
         goto out;
     }

-    /* Want a backing file? There you go.*/
+    /* Want a backing file? There you go. */
     if (qcow2_opts->has_backing_file) {
         const char *backing_format = NULL;

-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 02/17] qcow2: List autoclear bit names in header
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
  2020-01-31 17:44 ` [PATCH 01/17] qcow2: Comment typo fixes Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-02-04 14:26   ` Vladimir Sementsov-Ogievskiy
  2020-01-31 17:44 ` [PATCH 03/17] qcow2: Avoid feature name extension on small cluster size Eric Blake
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

The feature table is supposed to advertise the name of all feature
bits that we support; however, we forgot to update the table for
autoclear bits.  While at it, move the table to read-only memory in
code, and tweak the qcow2 spec to name the second autoclear bit.
Update iotests that are affected by the longer header length.

Fixes: 88ddffae
Fixes: 93c24936
Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/qcow2.c              | 12 +++++++++++-
 docs/interop/qcow2.txt     |  3 ++-
 tests/qemu-iotests/031.out |  8 ++++----
 tests/qemu-iotests/036.out |  4 ++--
 tests/qemu-iotests/061.out | 14 +++++++-------
 5 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 30fd3d13032a..d3e7709ac2b4 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2821,7 +2821,7 @@ int qcow2_update_header(BlockDriverState *bs)

     /* Feature table */
     if (s->qcow_version >= 3) {
-        Qcow2Feature features[] = {
+        static const Qcow2Feature features[] = {
             {
                 .type = QCOW2_FEAT_TYPE_INCOMPATIBLE,
                 .bit  = QCOW2_INCOMPAT_DIRTY_BITNR,
@@ -2842,6 +2842,16 @@ int qcow2_update_header(BlockDriverState *bs)
                 .bit  = QCOW2_COMPAT_LAZY_REFCOUNTS_BITNR,
                 .name = "lazy refcounts",
             },
+            {
+                .type = QCOW2_FEAT_TYPE_AUTOCLEAR,
+                .bit  = QCOW2_AUTOCLEAR_BITMAPS_BITNR,
+                .name = "consistent bitmaps",
+            },
+            {
+                .type = QCOW2_FEAT_TYPE_AUTOCLEAR,
+                .bit  = QCOW2_AUTOCLEAR_DATA_FILE_RAW_BITNR,
+                .name = "raw external data",
+            },
         };

         ret = header_ext_add(buf, QCOW2_EXT_MAGIC_FEATURE_TABLE,
diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index af5711e53371..8510d74c8079 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -138,7 +138,8 @@ in the description of a field.
                                 bit is unset, the bitmaps extension data must be
                                 considered inconsistent.

-                    Bit 1:      If this bit is set, the external data file can
+                    Bit 1:      Raw external data bit
+                                If this bit is set, the external data file can
                                 be read as a consistent standalone raw image
                                 without looking at the qcow2 metadata.

diff --git a/tests/qemu-iotests/031.out b/tests/qemu-iotests/031.out
index d535e407bc30..46f97c5a4ea4 100644
--- a/tests/qemu-iotests/031.out
+++ b/tests/qemu-iotests/031.out
@@ -117,7 +117,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    192
+length                    288
 data                      <binary>

 Header extension:
@@ -150,7 +150,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    192
+length                    288
 data                      <binary>

 Header extension:
@@ -164,7 +164,7 @@ No errors were found on the image.

 magic                     0x514649fb
 version                   3
-backing_file_offset       0x178
+backing_file_offset       0x1d8
 backing_file_size         0x17
 cluster_bits              16
 size                      67108864
@@ -188,7 +188,7 @@ data                      'host_device'

 Header extension:
 magic                     0x6803f857
-length                    192
+length                    288
 data                      <binary>

 Header extension:
diff --git a/tests/qemu-iotests/036.out b/tests/qemu-iotests/036.out
index 0b52b934e115..23b699ce0622 100644
--- a/tests/qemu-iotests/036.out
+++ b/tests/qemu-iotests/036.out
@@ -26,7 +26,7 @@ compatible_features       []
 autoclear_features        [63]
 Header extension:
 magic                     0x6803f857
-length                    192
+length                    288
 data                      <binary>


@@ -38,7 +38,7 @@ compatible_features       []
 autoclear_features        []
 Header extension:
 magic                     0x6803f857
-length                    192
+length                    288
 data                      <binary>

 *** done
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index 8b3091a412bc..413cc4e0f4ab 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -26,7 +26,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    192
+length                    288
 data                      <binary>

 magic                     0x514649fb
@@ -84,7 +84,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    192
+length                    288
 data                      <binary>

 magic                     0x514649fb
@@ -140,7 +140,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    192
+length                    288
 data                      <binary>

 ERROR cluster 5 refcount=0 reference=1
@@ -195,7 +195,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    192
+length                    288
 data                      <binary>

 magic                     0x514649fb
@@ -264,7 +264,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    192
+length                    288
 data                      <binary>

 read 65536/65536 bytes at offset 44040192
@@ -298,7 +298,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    192
+length                    288
 data                      <binary>

 ERROR cluster 5 refcount=0 reference=1
@@ -327,7 +327,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    192
+length                    288
 data                      <binary>

 read 131072/131072 bytes at offset 0
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 03/17] qcow2: Avoid feature name extension on small cluster size
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
  2020-01-31 17:44 ` [PATCH 01/17] qcow2: Comment typo fixes Eric Blake
  2020-01-31 17:44 ` [PATCH 02/17] qcow2: List autoclear bit names in header Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-02-04 14:39   ` Vladimir Sementsov-Ogievskiy
  2020-02-09 19:28   ` Alberto Garcia
  2020-01-31 17:44 ` [PATCH 04/17] block: Improve documentation of .bdrv_has_zero_init Eric Blake
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

As the feature name table can be quite large (over 9k if all 64 bits
of all three feature fields have names; a mere 8 features leaves only
8 bytes for a backing file name in a 512-byte cluster), it is unwise
to emit this optional header in images with small cluster sizes.

Update iotest 036 to skip running on small cluster sizes; meanwhile,
note that iotest 061 never passed on alternative cluster sizes
(however, I limited this patch to tests with output affected by adding
feature names, rather than auditing for other tests that are not
robust to alternative cluster sizes).

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/qcow2.c          | 11 +++++++++--
 tests/qemu-iotests/036 |  6 ++++--
 tests/qemu-iotests/061 |  6 ++++--
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index d3e7709ac2b4..6ea06dbdf48a 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2819,8 +2819,15 @@ int qcow2_update_header(BlockDriverState *bs)
         buflen -= ret;
     }

-    /* Feature table */
-    if (s->qcow_version >= 3) {
+    /*
+     * Feature table.  A mere 8 feature names occupies 392 bytes, and
+     * when coupled with the v3 minimum header of 104 bytes plus the
+     * 8-byte end-of-extension marker, that would leave only 8 bytes
+     * for a backing file name in an image with 512-byte clusters.
+     * Thus, we choose to omit this header for cluster sizes 4k and
+     * smaller.
+     */
+    if (s->qcow_version >= 3 && s->cluster_size > 4096) {
         static const Qcow2Feature features[] = {
             {
                 .type = QCOW2_FEAT_TYPE_INCOMPATIBLE,
diff --git a/tests/qemu-iotests/036 b/tests/qemu-iotests/036
index 512598421c20..cf522de7a1aa 100755
--- a/tests/qemu-iotests/036
+++ b/tests/qemu-iotests/036
@@ -44,8 +44,10 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _supported_fmt qcow2
 _supported_proto file
 # Only qcow2v3 and later supports feature bits;
-# qcow2.py does not support external data files
-_unsupported_imgopts 'compat=0.10' data_file
+# qcow2.py does not support external data files;
+# this test requires a cluster size large enough for the feature table
+_unsupported_imgopts 'compat=0.10' data_file \
+		     'cluster_size=\(512\|1024\|2048\|4096\)'

 echo
 echo === Image with unknown incompatible feature bit ===
diff --git a/tests/qemu-iotests/061 b/tests/qemu-iotests/061
index 36b040491fef..ce285d308408 100755
--- a/tests/qemu-iotests/061
+++ b/tests/qemu-iotests/061
@@ -44,8 +44,10 @@ _supported_os Linux
 # Conversion between different compat versions can only really work
 # with refcount_bits=16;
 # we have explicit tests for data_file here, but the whole test does
-# not work with it
-_unsupported_imgopts 'refcount_bits=\([^1]\|.\([^6]\|$\)\)' data_file
+# not work with it;
+# we have explicit tests for various cluster sizes, the remaining tests
+# require the default 64k cluster
+_unsupported_imgopts 'refcount_bits=\([^1]\|.\([^6]\|$\)\)' data_file cluster_size

 echo
 echo "=== Testing version downgrade with zero expansion ==="
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 04/17] block: Improve documentation of .bdrv_has_zero_init
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (2 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 03/17] qcow2: Avoid feature name extension on small cluster size Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-02-04 15:03   ` Vladimir Sementsov-Ogievskiy
  2020-01-31 17:44 ` [PATCH 05/17] block: Don't advertise zero_init_truncate with encryption Eric Blake
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

Several drivers supply .bdrv_has_zero_init that returns 1, but lack
the .bdrv_has_zero_init_truncate callback (parallels and qed outright,
vdi in some scenarios).  A literal reading of the existing
documentation says such drivers are broken, because
bdrv_has_zero_init_truncate() defaults to zero if the callback is
missing; but in practice, the tie between the two functions is only
relevant when truncate is supported.  Clarify the documentation to
make it obvious that this is okay.

Fixes: 1dcaf527
Signed-off-by: Eric Blake <eblake@redhat.com>
---
 include/block/block_int.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 640fb82c789e..77ab45dc87cf 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -444,7 +444,8 @@ struct BlockDriver {
     /*
      * Returns 1 if newly created images are guaranteed to contain only
      * zeros, 0 otherwise.
-     * Must return 0 if .bdrv_has_zero_init_truncate() returns 0.
+     * Must return 0 if .bdrv_co_truncate is set and
+     * .bdrv_has_zero_init_truncate() returns 0.
      */
     int (*bdrv_has_zero_init)(BlockDriverState *bs);

-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 05/17] block: Don't advertise zero_init_truncate with encryption
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (3 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 04/17] block: Improve documentation of .bdrv_has_zero_init Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-02-10 18:12   ` Alberto Garcia
  2020-01-31 17:44 ` [PATCH 06/17] block: Improve bdrv_has_zero_init_truncate with backing file Eric Blake
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

Commit 38841dcd correctly argued that having qcow2 blindly return 1
for .bdrv_has_zero_init() is wrong for preallocated images built on
block devices, while .bdrv_has_zero_init_truncate() can still return 1
because it is only relied on when changing size with PREALLOC_MODE_OFF
(and this is true even for v2 images which lack the notion of an
explicit zero cluster, since the block layer already filters out the
case of a larger backing file leaking through).  However, it missed
the fact that encrypted images do not default to reading as zero in
any case.

However, instead of changing qcow2's .bdrv_has_zero_init_truncate() to
point to a one-off function that special-cases bs->encryption, it is
smarter to just move the logic about encryption directly to the block
layer (that is, the driver callbacks will never be invoked for
encrypted images, just like they are already not called when a backing
file is present).  This solution fixes the qcow2 issue, has no effect
on the crypto driver (which already lacks .bdrv_has_zero_init*
callbacks), and no other driver currently uses bs->encrypted.

One other reason to fix this at the block layer: any information we
expose about an encrypted image that in turn may alter timing of
algorithms run on that image can be considered a (slight) information
leak; refusing to optimize zero handling of encrypted images thus
avoids the possibility of that being a security concern.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block.c       | 19 ++++++++++++++++---
 block/qcow2.c |  2 --
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/block.c b/block.c
index 6c2b2bd2e292..296845040e59 100644
--- a/block.c
+++ b/block.c
@@ -5077,9 +5077,12 @@ int bdrv_has_zero_init(BlockDriverState *bs)
         return 0;
     }

-    /* If BS is a copy on write image, it is initialized to
-       the contents of the base image, which may not be zeroes.  */
-    if (bs->backing) {
+    /*
+     * If BS is a copy on write image, it is initialized to the
+     * contents of the base image, which may not be zeroes.  Likewise,
+     * encrypted images do not read as zero.
+     */
+    if (bs->backing || bs->encrypted) {
         return 0;
     }
     if (bs->drv->bdrv_has_zero_init) {
@@ -5099,6 +5102,16 @@ int bdrv_has_zero_init_truncate(BlockDriverState *bs)
         return 0;
     }

+    /*
+     * Encrypted images never default to reading all zero; and even if
+     * they did, advertising that fact might lead to an information
+     * leak based on timing comparisons of algorithms that change if
+     * our result were dynamic.
+     */
+    if (bs->encrypted) {
+        return 0;
+    }
+
     if (bs->backing) {
         /* Depends on the backing image length, but better safe than sorry */
         return 0;
diff --git a/block/qcow2.c b/block/qcow2.c
index 6ea06dbdf48a..40aa751d1de7 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4934,8 +4934,6 @@ static int qcow2_has_zero_init(BlockDriverState *bs)

     if (!preallocated) {
         return 1;
-    } else if (bs->encrypted) {
-        return 0;
     } else {
         return bdrv_has_zero_init(s->data_file->bs);
     }
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 06/17] block: Improve bdrv_has_zero_init_truncate with backing file
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (4 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 05/17] block: Don't advertise zero_init_truncate with encryption Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-02-10 18:13   ` Alberto Garcia
  2020-01-31 17:44 ` [PATCH 07/17] gluster: Drop useless has_zero_init callback Eric Blake
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

When we added bdrv_has_zero_init_truncate(), we chose to blindly
return 0 if a backing file was present, because we knew of the corner
case where a backing layer larger than the current layer might leak
the tail of the backing layer into the resized region.  But as this
setup is rare, it penalizes the more common case of a backing layer
smaller than the current layer.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 296845040e59..d132662f3103 100644
--- a/block.c
+++ b/block.c
@@ -5112,9 +5112,19 @@ int bdrv_has_zero_init_truncate(BlockDriverState *bs)
         return 0;
     }

+    /*
+     * If the current layer is smaller than the backing layer,
+     * truncation may expose backing data; treat failure to query size
+     * in the same manner. Otherwise, we can trust the driver.
+     */
+
     if (bs->backing) {
-        /* Depends on the backing image length, but better safe than sorry */
-        return 0;
+        int64_t back = bdrv_getlength(bs->backing->bs);
+        int64_t curr = bdrv_getlength(bs);
+
+        if (back < 0 || curr < back) {
+            return 0;
+        }
     }
     if (bs->drv->bdrv_has_zero_init_truncate) {
         return bs->drv->bdrv_has_zero_init_truncate(bs);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 07/17] gluster: Drop useless has_zero_init callback
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (5 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 06/17] block: Improve bdrv_has_zero_init_truncate with backing file Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-02-04 15:06   ` Vladimir Sementsov-Ogievskiy
                     ` (2 more replies)
  2020-01-31 17:44 ` [PATCH 08/17] sheepdog: Consistently set bdrv_has_zero_init_truncate Eric Blake
                   ` (11 subsequent siblings)
  18 siblings, 3 replies; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: david.edmondson, Kevin Wolf, open list:GLUSTER, qemu-block, mreitz

block.c already defaults to 0 if we don't provide a callback; there's
no need to write a callback that always fails.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/gluster.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index 4fa4a77a4777..9d952c70981b 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -1357,12 +1357,6 @@ static int64_t qemu_gluster_allocated_file_size(BlockDriverState *bs)
     }
 }

-static int qemu_gluster_has_zero_init(BlockDriverState *bs)
-{
-    /* GlusterFS volume could be backed by a block device */
-    return 0;
-}
-
 /*
  * Find allocation range in @bs around offset @start.
  * May change underlying file descriptor's file offset.
@@ -1567,8 +1561,6 @@ static BlockDriver bdrv_gluster = {
     .bdrv_co_readv                = qemu_gluster_co_readv,
     .bdrv_co_writev               = qemu_gluster_co_writev,
     .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
-    .bdrv_has_zero_init           = qemu_gluster_has_zero_init,
-    .bdrv_has_zero_init_truncate  = qemu_gluster_has_zero_init,
 #ifdef CONFIG_GLUSTERFS_DISCARD
     .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
 #endif
@@ -1599,8 +1591,6 @@ static BlockDriver bdrv_gluster_tcp = {
     .bdrv_co_readv                = qemu_gluster_co_readv,
     .bdrv_co_writev               = qemu_gluster_co_writev,
     .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
-    .bdrv_has_zero_init           = qemu_gluster_has_zero_init,
-    .bdrv_has_zero_init_truncate  = qemu_gluster_has_zero_init,
 #ifdef CONFIG_GLUSTERFS_DISCARD
     .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
 #endif
@@ -1631,8 +1621,6 @@ static BlockDriver bdrv_gluster_unix = {
     .bdrv_co_readv                = qemu_gluster_co_readv,
     .bdrv_co_writev               = qemu_gluster_co_writev,
     .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
-    .bdrv_has_zero_init           = qemu_gluster_has_zero_init,
-    .bdrv_has_zero_init_truncate  = qemu_gluster_has_zero_init,
 #ifdef CONFIG_GLUSTERFS_DISCARD
     .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
 #endif
@@ -1669,8 +1657,6 @@ static BlockDriver bdrv_gluster_rdma = {
     .bdrv_co_readv                = qemu_gluster_co_readv,
     .bdrv_co_writev               = qemu_gluster_co_writev,
     .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
-    .bdrv_has_zero_init           = qemu_gluster_has_zero_init,
-    .bdrv_has_zero_init_truncate  = qemu_gluster_has_zero_init,
 #ifdef CONFIG_GLUSTERFS_DISCARD
     .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
 #endif
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 08/17] sheepdog: Consistently set bdrv_has_zero_init_truncate
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (6 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 07/17] gluster: Drop useless has_zero_init callback Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-02-04 15:09   ` Vladimir Sementsov-Ogievskiy
  2020-01-31 17:44 ` [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate} Eric Blake
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, open list:Sheepdog, qemu-block, mreitz,
	david.edmondson, Liu Yuan

block_int.h claims that .bdrv_has_zero_init must return 0 if
.bdrv_has_zero_init_truncate does likewise; but this is violated if
only the former callback is provided if .bdrv_co_truncate also exists.
When adding the latter callback, it was mistakenly added to only one
of the three possible sheepdog instantiations.

Fixes: 1dcaf527
Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/sheepdog.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index cfa84338a2d6..522c16a93676 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -3269,6 +3269,7 @@ static BlockDriver bdrv_sheepdog_tcp = {
     .bdrv_co_create               = sd_co_create,
     .bdrv_co_create_opts          = sd_co_create_opts,
     .bdrv_has_zero_init           = bdrv_has_zero_init_1,
+    .bdrv_has_zero_init_truncate  = bdrv_has_zero_init_1,
     .bdrv_getlength               = sd_getlength,
     .bdrv_get_allocated_file_size = sd_get_allocated_file_size,
     .bdrv_co_truncate             = sd_co_truncate,
@@ -3307,6 +3308,7 @@ static BlockDriver bdrv_sheepdog_unix = {
     .bdrv_co_create               = sd_co_create,
     .bdrv_co_create_opts          = sd_co_create_opts,
     .bdrv_has_zero_init           = bdrv_has_zero_init_1,
+    .bdrv_has_zero_init_truncate  = bdrv_has_zero_init_1,
     .bdrv_getlength               = sd_getlength,
     .bdrv_get_allocated_file_size = sd_get_allocated_file_size,
     .bdrv_co_truncate             = sd_co_truncate,
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (7 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 08/17] sheepdog: Consistently set bdrv_has_zero_init_truncate Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-02-04 15:35   ` Vladimir Sementsov-Ogievskiy
  2020-02-04 17:53   ` Max Reitz
  2020-01-31 17:44 ` [PATCH 10/17] block: Add new BDRV_ZERO_OPEN flag Eric Blake
                   ` (9 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, mreitz,
	david.edmondson, Stefan Hajnoczi, Liu Yuan, Denis V. Lunev,
	Jason Dillaman, Markus Armbruster

Having two slightly-different function names for related purposes is
unwieldy, especially since I envision adding yet another notion of
zero support in an upcoming patch.  It doesn't help that
bdrv_has_zero_init() is a misleading name (I originally thought that a
driver could only return 1 when opening an already-existing image
known to be all zeroes; but in reality many drivers always return 1
because it only applies to a just-created image).  Refactor all uses
to instead have a single function that returns multiple bits of
information, with better naming and documentation.

No semantic change, although some of the changes (such as to qcow2.c)
require a careful reading to see how it remains the same.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block.c                    | 49 ++++++++++++++------------------------
 block/file-posix.c         |  3 +--
 block/file-win32.c         |  3 +--
 block/nfs.c                |  7 +++---
 block/parallels.c          |  4 ++--
 block/qcow.c               |  2 +-
 block/qcow2.c              | 10 ++++----
 block/qed.c                |  3 +--
 block/raw-format.c         | 12 +++-------
 block/rbd.c                |  3 +--
 block/sheepdog.c           |  9 +++----
 block/ssh.c                |  7 +++---
 block/vdi.c                |  8 +++----
 block/vhdx.c               | 16 ++++++-------
 block/vmdk.c               |  9 +++----
 block/vpc.c                |  8 +++----
 blockdev.c                 |  2 +-
 include/block/block.h      | 28 +++++++++++++++++++---
 include/block/block_int.h  | 15 ++----------
 qemu-img.c                 |  3 ++-
 tests/qemu-iotests/122     |  2 +-
 tests/qemu-iotests/188     |  2 +-
 tests/qemu-iotests/188.out |  2 +-
 23 files changed, 96 insertions(+), 111 deletions(-)

diff --git a/block.c b/block.c
index d132662f3103..fac0813140aa 100644
--- a/block.c
+++ b/block.c
@@ -5066,38 +5066,20 @@ int bdrv_get_flags(BlockDriverState *bs)
     return bs->open_flags;
 }

-int bdrv_has_zero_init_1(BlockDriverState *bs)
+int bdrv_known_zeroes_create(BlockDriverState *bs)
 {
-    return 1;
+    return BDRV_ZERO_CREATE;
 }

-int bdrv_has_zero_init(BlockDriverState *bs)
+int bdrv_known_zeroes_truncate(BlockDriverState *bs)
 {
-    if (!bs->drv) {
-        return 0;
-    }
-
-    /*
-     * If BS is a copy on write image, it is initialized to the
-     * contents of the base image, which may not be zeroes.  Likewise,
-     * encrypted images do not read as zero.
-     */
-    if (bs->backing || bs->encrypted) {
-        return 0;
-    }
-    if (bs->drv->bdrv_has_zero_init) {
-        return bs->drv->bdrv_has_zero_init(bs);
-    }
-    if (bs->file && bs->drv->is_filter) {
-        return bdrv_has_zero_init(bs->file->bs);
-    }
-
-    /* safe default */
-    return 0;
+    return BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE;
 }

-int bdrv_has_zero_init_truncate(BlockDriverState *bs)
+int bdrv_known_zeroes(BlockDriverState *bs)
 {
+    int mask = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE;
+
     if (!bs->drv) {
         return 0;
     }
@@ -5113,9 +5095,12 @@ int bdrv_has_zero_init_truncate(BlockDriverState *bs)
     }

     /*
-     * If the current layer is smaller than the backing layer,
-     * truncation may expose backing data; treat failure to query size
-     * in the same manner. Otherwise, we can trust the driver.
+     * If BS is a copy on write image, it is initialized to the
+     * contents of the base image, which may not be zeroes, so
+     * ZERO_CREATE is not viable.  If the current layer is smaller
+     * than the backing layer, truncation may expose backing data,
+     * restricting ZERO_TRUNCATE; treat failure to query size in the
+     * same manner.  Otherwise, we can trust the driver.
      */

     if (bs->backing) {
@@ -5125,12 +5110,14 @@ int bdrv_has_zero_init_truncate(BlockDriverState *bs)
         if (back < 0 || curr < back) {
             return 0;
         }
+        mask = BDRV_ZERO_TRUNCATE;
     }
-    if (bs->drv->bdrv_has_zero_init_truncate) {
-        return bs->drv->bdrv_has_zero_init_truncate(bs);
+
+    if (bs->drv->bdrv_known_zeroes) {
+        return bs->drv->bdrv_known_zeroes(bs) & mask;
     }
     if (bs->file && bs->drv->is_filter) {
-        return bdrv_has_zero_init_truncate(bs->file->bs);
+        return bdrv_known_zeroes(bs->file->bs) & mask;
     }

     /* safe default */
diff --git a/block/file-posix.c b/block/file-posix.c
index ab82ee1a6718..ff9e39ab882f 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -3071,8 +3071,7 @@ BlockDriver bdrv_file = {
     .bdrv_close = raw_close,
     .bdrv_co_create = raw_co_create,
     .bdrv_co_create_opts = raw_co_create_opts,
-    .bdrv_has_zero_init = bdrv_has_zero_init_1,
-    .bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes = bdrv_known_zeroes_truncate,
     .bdrv_co_block_status = raw_co_block_status,
     .bdrv_co_invalidate_cache = raw_co_invalidate_cache,
     .bdrv_co_pwrite_zeroes = raw_co_pwrite_zeroes,
diff --git a/block/file-win32.c b/block/file-win32.c
index 77e8ff7b68ae..e9b8f3b2370b 100644
--- a/block/file-win32.c
+++ b/block/file-win32.c
@@ -635,8 +635,7 @@ BlockDriver bdrv_file = {
     .bdrv_refresh_limits = raw_probe_alignment,
     .bdrv_close         = raw_close,
     .bdrv_co_create_opts = raw_co_create_opts,
-    .bdrv_has_zero_init = bdrv_has_zero_init_1,
-    .bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes  = bdrv_known_zeroes_truncate,

     .bdrv_aio_preadv    = raw_aio_preadv,
     .bdrv_aio_pwritev   = raw_aio_pwritev,
diff --git a/block/nfs.c b/block/nfs.c
index 9a6311e27066..34ebe91d5b39 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -702,10 +702,10 @@ out:
     return ret;
 }

-static int nfs_has_zero_init(BlockDriverState *bs)
+static int nfs_known_zeroes(BlockDriverState *bs)
 {
     NFSClient *client = bs->opaque;
-    return client->has_zero_init;
+    return client->has_zero_init ? BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE : 0;
 }

 /* Called (via nfs_service) with QemuMutex held.  */
@@ -869,8 +869,7 @@ static BlockDriver bdrv_nfs = {
     .bdrv_parse_filename            = nfs_parse_filename,
     .create_opts                    = &nfs_create_opts,

-    .bdrv_has_zero_init             = nfs_has_zero_init,
-    .bdrv_has_zero_init_truncate    = nfs_has_zero_init,
+    .bdrv_known_zeroes              = nfs_known_zeroes,
     .bdrv_get_allocated_file_size   = nfs_get_allocated_file_size,
     .bdrv_co_truncate               = nfs_file_co_truncate,

diff --git a/block/parallels.c b/block/parallels.c
index 7a01997659b0..dad6389c8481 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -835,7 +835,7 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags,
         goto fail_options;
     }

-    if (!bdrv_has_zero_init_truncate(bs->file->bs)) {
+    if (!(bdrv_known_zeroes(bs->file->bs) & BDRV_ZERO_TRUNCATE)) {
         s->prealloc_mode = PRL_PREALLOC_MODE_FALLOCATE;
     }

@@ -906,7 +906,7 @@ static BlockDriver bdrv_parallels = {
     .bdrv_close		= parallels_close,
     .bdrv_child_perm          = bdrv_format_default_perms,
     .bdrv_co_block_status     = parallels_co_block_status,
-    .bdrv_has_zero_init       = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes        = bdrv_known_zeroes_create,
     .bdrv_co_flush_to_os      = parallels_co_flush_to_os,
     .bdrv_co_readv  = parallels_co_readv,
     .bdrv_co_writev = parallels_co_writev,
diff --git a/block/qcow.c b/block/qcow.c
index fce89898681f..b0c9e212fdb1 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -1183,7 +1183,7 @@ static BlockDriver bdrv_qcow = {
     .bdrv_reopen_prepare    = qcow_reopen_prepare,
     .bdrv_co_create         = qcow_co_create,
     .bdrv_co_create_opts    = qcow_co_create_opts,
-    .bdrv_has_zero_init     = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes      = bdrv_known_zeroes_create,
     .supports_backing       = true,
     .bdrv_refresh_limits    = qcow_refresh_limits,

diff --git a/block/qcow2.c b/block/qcow2.c
index 40aa751d1de7..9f2371925737 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4914,10 +4914,11 @@ static ImageInfoSpecific *qcow2_get_specific_info(BlockDriverState *bs,
     return spec_info;
 }

-static int qcow2_has_zero_init(BlockDriverState *bs)
+static int qcow2_known_zeroes(BlockDriverState *bs)
 {
     BDRVQcow2State *s = bs->opaque;
     bool preallocated;
+    int r = BDRV_ZERO_TRUNCATE;

     if (qemu_in_coroutine()) {
         qemu_co_mutex_lock(&s->lock);
@@ -4933,9 +4934,9 @@ static int qcow2_has_zero_init(BlockDriverState *bs)
     }

     if (!preallocated) {
-        return 1;
+        return r | BDRV_ZERO_CREATE;
     } else {
-        return bdrv_has_zero_init(s->data_file->bs);
+        return r | bdrv_known_zeroes(s->data_file->bs);
     }
 }

@@ -5559,8 +5560,7 @@ BlockDriver bdrv_qcow2 = {
     .bdrv_child_perm      = bdrv_format_default_perms,
     .bdrv_co_create_opts  = qcow2_co_create_opts,
     .bdrv_co_create       = qcow2_co_create,
-    .bdrv_has_zero_init   = qcow2_has_zero_init,
-    .bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes    = qcow2_known_zeroes,
     .bdrv_co_block_status = qcow2_co_block_status,

     .bdrv_co_preadv_part    = qcow2_co_preadv_part,
diff --git a/block/qed.c b/block/qed.c
index d8c4e5fb1e85..b00cef2035b3 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -1672,8 +1672,7 @@ static BlockDriver bdrv_qed = {
     .bdrv_child_perm          = bdrv_format_default_perms,
     .bdrv_co_create           = bdrv_qed_co_create,
     .bdrv_co_create_opts      = bdrv_qed_co_create_opts,
-    .bdrv_has_zero_init       = bdrv_has_zero_init_1,
-    .bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes        = bdrv_known_zeroes_truncate,
     .bdrv_co_block_status     = bdrv_qed_co_block_status,
     .bdrv_co_readv            = bdrv_qed_co_readv,
     .bdrv_co_writev           = bdrv_qed_co_writev,
diff --git a/block/raw-format.c b/block/raw-format.c
index 3a76ec7dd21b..1334a7a2c224 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -409,14 +409,9 @@ static int raw_co_ioctl(BlockDriverState *bs, unsigned long int req, void *buf)
     return bdrv_co_ioctl(bs->file->bs, req, buf);
 }

-static int raw_has_zero_init(BlockDriverState *bs)
+static int raw_known_zeroes(BlockDriverState *bs)
 {
-    return bdrv_has_zero_init(bs->file->bs);
-}
-
-static int raw_has_zero_init_truncate(BlockDriverState *bs)
-{
-    return bdrv_has_zero_init_truncate(bs->file->bs);
+    return bdrv_known_zeroes(bs->file->bs);
 }

 static int coroutine_fn raw_co_create_opts(const char *filename, QemuOpts *opts,
@@ -577,8 +572,7 @@ BlockDriver bdrv_raw = {
     .bdrv_lock_medium     = &raw_lock_medium,
     .bdrv_co_ioctl        = &raw_co_ioctl,
     .create_opts          = &raw_create_opts,
-    .bdrv_has_zero_init   = &raw_has_zero_init,
-    .bdrv_has_zero_init_truncate = &raw_has_zero_init_truncate,
+    .bdrv_known_zeroes    = &raw_known_zeroes,
     .strong_runtime_opts  = raw_strong_runtime_opts,
     .mutable_opts         = mutable_opts,
 };
diff --git a/block/rbd.c b/block/rbd.c
index 027cbcc69520..6cd8e86bccec 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1289,8 +1289,7 @@ static BlockDriver bdrv_rbd = {
     .bdrv_reopen_prepare    = qemu_rbd_reopen_prepare,
     .bdrv_co_create         = qemu_rbd_co_create,
     .bdrv_co_create_opts    = qemu_rbd_co_create_opts,
-    .bdrv_has_zero_init     = bdrv_has_zero_init_1,
-    .bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes      = bdrv_known_zeroes_truncate,
     .bdrv_get_info          = qemu_rbd_getinfo,
     .create_opts            = &qemu_rbd_create_opts,
     .bdrv_getlength         = qemu_rbd_getlength,
diff --git a/block/sheepdog.c b/block/sheepdog.c
index 522c16a93676..916e64abdd74 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -3229,8 +3229,7 @@ static BlockDriver bdrv_sheepdog = {
     .bdrv_close                   = sd_close,
     .bdrv_co_create               = sd_co_create,
     .bdrv_co_create_opts          = sd_co_create_opts,
-    .bdrv_has_zero_init           = bdrv_has_zero_init_1,
-    .bdrv_has_zero_init_truncate  = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes            = bdrv_known_zeroes_truncate,
     .bdrv_getlength               = sd_getlength,
     .bdrv_get_allocated_file_size = sd_get_allocated_file_size,
     .bdrv_co_truncate             = sd_co_truncate,
@@ -3268,8 +3267,7 @@ static BlockDriver bdrv_sheepdog_tcp = {
     .bdrv_close                   = sd_close,
     .bdrv_co_create               = sd_co_create,
     .bdrv_co_create_opts          = sd_co_create_opts,
-    .bdrv_has_zero_init           = bdrv_has_zero_init_1,
-    .bdrv_has_zero_init_truncate  = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes            = bdrv_known_zeroes_truncate,
     .bdrv_getlength               = sd_getlength,
     .bdrv_get_allocated_file_size = sd_get_allocated_file_size,
     .bdrv_co_truncate             = sd_co_truncate,
@@ -3307,8 +3305,7 @@ static BlockDriver bdrv_sheepdog_unix = {
     .bdrv_close                   = sd_close,
     .bdrv_co_create               = sd_co_create,
     .bdrv_co_create_opts          = sd_co_create_opts,
-    .bdrv_has_zero_init           = bdrv_has_zero_init_1,
-    .bdrv_has_zero_init_truncate  = bdrv_has_zero_init_1,
+    .bdrv_known_zeroes            = bdrv_known_zeroes_truncate,
     .bdrv_getlength               = sd_getlength,
     .bdrv_get_allocated_file_size = sd_get_allocated_file_size,
     .bdrv_co_truncate             = sd_co_truncate,
diff --git a/block/ssh.c b/block/ssh.c
index b4375cf7d2e5..e89dae39800c 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -1007,14 +1007,14 @@ static void ssh_close(BlockDriverState *bs)
     ssh_state_free(s);
 }

-static int ssh_has_zero_init(BlockDriverState *bs)
+static int ssh_known_zeroes(BlockDriverState *bs)
 {
     BDRVSSHState *s = bs->opaque;
     /* Assume false, unless we can positively prove it's true. */
     int has_zero_init = 0;

     if (s->attrs->type == SSH_FILEXFER_TYPE_REGULAR) {
-        has_zero_init = 1;
+        has_zero_init = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE;
     }

     return has_zero_init;
@@ -1390,8 +1390,7 @@ static BlockDriver bdrv_ssh = {
     .bdrv_co_create               = ssh_co_create,
     .bdrv_co_create_opts          = ssh_co_create_opts,
     .bdrv_close                   = ssh_close,
-    .bdrv_has_zero_init           = ssh_has_zero_init,
-    .bdrv_has_zero_init_truncate  = ssh_has_zero_init,
+    .bdrv_known_zeroes            = ssh_known_zeroes,
     .bdrv_co_readv                = ssh_co_readv,
     .bdrv_co_writev               = ssh_co_writev,
     .bdrv_getlength               = ssh_getlength,
diff --git a/block/vdi.c b/block/vdi.c
index 0142da723315..df8f62624ccf 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -989,14 +989,14 @@ static void vdi_close(BlockDriverState *bs)
     error_free(s->migration_blocker);
 }

-static int vdi_has_zero_init(BlockDriverState *bs)
+static int vdi_known_zeroes(BlockDriverState *bs)
 {
     BDRVVdiState *s = bs->opaque;

     if (s->header.image_type == VDI_TYPE_STATIC) {
-        return bdrv_has_zero_init(bs->file->bs);
+        return bdrv_known_zeroes(bs->file->bs) & BDRV_ZERO_CREATE;
     } else {
-        return 1;
+        return BDRV_ZERO_CREATE;
     }
 }

@@ -1040,7 +1040,7 @@ static BlockDriver bdrv_vdi = {
     .bdrv_child_perm          = bdrv_format_default_perms,
     .bdrv_co_create      = vdi_co_create,
     .bdrv_co_create_opts = vdi_co_create_opts,
-    .bdrv_has_zero_init  = vdi_has_zero_init,
+    .bdrv_known_zeroes   = vdi_known_zeroes,
     .bdrv_co_block_status = vdi_co_block_status,
     .bdrv_make_empty = vdi_make_empty,

diff --git a/block/vhdx.c b/block/vhdx.c
index f02d2611bef8..4e8320c1b855 100644
--- a/block/vhdx.c
+++ b/block/vhdx.c
@@ -1365,7 +1365,7 @@ static coroutine_fn int vhdx_co_writev(BlockDriverState *bs, int64_t sector_num,
                 /* Queue another write of zero buffers if the underlying file
                  * does not zero-fill on file extension */

-                if (bdrv_has_zero_init_truncate(bs->file->bs) == 0) {
+                if (!(bdrv_known_zeroes(bs->file->bs) & BDRV_ZERO_TRUNCATE)) {
                     use_zero_buffers = true;

                     /* zero fill the front, if any */
@@ -1720,8 +1720,8 @@ static int vhdx_create_bat(BlockBackend *blk, BDRVVHDXState *s,
     }

     if (type == VHDX_TYPE_FIXED ||
-                use_zero_blocks ||
-                bdrv_has_zero_init(blk_bs(blk)) == 0) {
+        use_zero_blocks ||
+        !(bdrv_known_zeroes(blk_bs(blk)) & BDRV_ZERO_CREATE)) {
         /* for a fixed file, the default BAT entry is not zero */
         s->bat = g_try_malloc0(length);
         if (length && s->bat == NULL) {
@@ -2162,7 +2162,7 @@ static int coroutine_fn vhdx_co_check(BlockDriverState *bs,
     return 0;
 }

-static int vhdx_has_zero_init(BlockDriverState *bs)
+static int vhdx_known_zeroes(BlockDriverState *bs)
 {
     BDRVVHDXState *s = bs->opaque;
     int state;
@@ -2173,17 +2173,17 @@ static int vhdx_has_zero_init(BlockDriverState *bs)
      * therefore enough to check the first BAT entry.
      */
     if (!s->bat_entries) {
-        return 1;
+        return BDRV_ZERO_CREATE;
     }

     state = s->bat[0] & VHDX_BAT_STATE_BIT_MASK;
     if (state == PAYLOAD_BLOCK_FULLY_PRESENT) {
         /* Fixed subformat */
-        return bdrv_has_zero_init(bs->file->bs);
+        return bdrv_known_zeroes(bs->file->bs) & BDRV_ZERO_CREATE;
     }

     /* Dynamic subformat */
-    return 1;
+    return BDRV_ZERO_CREATE;
 }

 static QemuOptsList vhdx_create_opts = {
@@ -2239,7 +2239,7 @@ static BlockDriver bdrv_vhdx = {
     .bdrv_co_create_opts    = vhdx_co_create_opts,
     .bdrv_get_info          = vhdx_get_info,
     .bdrv_co_check          = vhdx_co_check,
-    .bdrv_has_zero_init     = vhdx_has_zero_init,
+    .bdrv_known_zeroes      = vhdx_known_zeroes,

     .create_opts            = &vhdx_create_opts,
 };
diff --git a/block/vmdk.c b/block/vmdk.c
index 20e909d99794..ca59f50413d2 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -2815,7 +2815,7 @@ static int64_t vmdk_get_allocated_file_size(BlockDriverState *bs)
     return ret;
 }

-static int vmdk_has_zero_init(BlockDriverState *bs)
+static int vmdk_known_zeroes(BlockDriverState *bs)
 {
     int i;
     BDRVVmdkState *s = bs->opaque;
@@ -2824,12 +2824,13 @@ static int vmdk_has_zero_init(BlockDriverState *bs)
      * return 0. */
     for (i = 0; i < s->num_extents; i++) {
         if (s->extents[i].flat) {
-            if (!bdrv_has_zero_init(s->extents[i].file->bs)) {
+            if (!(bdrv_known_zeroes(s->extents[i].file->bs) &
+                  BDRV_ZERO_CREATE)) {
                 return 0;
             }
         }
     }
-    return 1;
+    return BDRV_ZERO_CREATE;
 }

 static ImageInfo *vmdk_get_extent_info(VmdkExtent *extent)
@@ -3052,7 +3053,7 @@ static BlockDriver bdrv_vmdk = {
     .bdrv_co_flush_to_disk        = vmdk_co_flush,
     .bdrv_co_block_status         = vmdk_co_block_status,
     .bdrv_get_allocated_file_size = vmdk_get_allocated_file_size,
-    .bdrv_has_zero_init           = vmdk_has_zero_init,
+    .bdrv_known_zeroes            = vmdk_known_zeroes,
     .bdrv_get_specific_info       = vmdk_get_specific_info,
     .bdrv_refresh_limits          = vmdk_refresh_limits,
     .bdrv_get_info                = vmdk_get_info,
diff --git a/block/vpc.c b/block/vpc.c
index a65550298e19..f4741e07bfb2 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -1173,15 +1173,15 @@ fail:
 }


-static int vpc_has_zero_init(BlockDriverState *bs)
+static int vpc_known_zeroes(BlockDriverState *bs)
 {
     BDRVVPCState *s = bs->opaque;
     VHDFooter *footer =  (VHDFooter *) s->footer_buf;

     if (be32_to_cpu(footer->type) == VHD_FIXED) {
-        return bdrv_has_zero_init(bs->file->bs);
+        return bdrv_known_zeroes(bs->file->bs) & BDRV_ZERO_CREATE;
     } else {
-        return 1;
+        return BDRV_ZERO_CREATE;
     }
 }

@@ -1249,7 +1249,7 @@ static BlockDriver bdrv_vpc = {
     .bdrv_get_info          = vpc_get_info,

     .create_opts            = &vpc_create_opts,
-    .bdrv_has_zero_init     = vpc_has_zero_init,
+    .bdrv_known_zeroes      = vpc_known_zeroes,
     .strong_runtime_opts    = vpc_strong_runtime_opts,
 };

diff --git a/blockdev.c b/blockdev.c
index c6a727cca99d..90a17e7f7bce 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -4001,7 +4001,7 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)

     zero_target = (arg->sync == MIRROR_SYNC_MODE_FULL &&
                    (arg->mode == NEW_IMAGE_MODE_EXISTING ||
-                    !bdrv_has_zero_init(target_bs)));
+                    !(bdrv_known_zeroes(target_bs) & BDRV_ZERO_CREATE)));


     /* Honor bdrv_try_set_aio_context() context acquisition requirements. */
diff --git a/include/block/block.h b/include/block/block.h
index 6cd566324d95..a6a227f50678 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -85,6 +85,28 @@ typedef enum {
     BDRV_REQ_MASK               = 0x3ff,
 } BdrvRequestFlags;

+typedef enum {
+    /*
+     * bdrv_known_zeroes() should include this bit if the contents of
+     * a freshly-created image with no backing file reads as all
+     * zeroes without any additional effort.  If .bdrv_co_truncate is
+     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
+     * Since this bit is only reliable at image creation, a driver may
+     * return this bit even for existing images that do not currently
+     * read as zero.
+     */
+    BDRV_ZERO_CREATE        = 0x1,
+
+    /*
+     * bdrv_known_zeroes() should include this bit if growing an image
+     * with PREALLOC_MODE_OFF (either with no backing file, or beyond
+     * the size of the backing file) will read the new data as all
+     * zeroes without any additional effort.  This bit only matters
+     * for drivers that set .bdrv_co_truncate.
+     */
+    BDRV_ZERO_TRUNCATE      = 0x2,
+} BdrvZeroFlags;
+
 typedef struct BlockSizes {
     uint32_t phys;
     uint32_t log;
@@ -430,9 +452,9 @@ void bdrv_drain_all(void);

 int bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
 int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes);
-int bdrv_has_zero_init_1(BlockDriverState *bs);
-int bdrv_has_zero_init(BlockDriverState *bs);
-int bdrv_has_zero_init_truncate(BlockDriverState *bs);
+int bdrv_known_zeroes_create(BlockDriverState *bs);
+int bdrv_known_zeroes_truncate(BlockDriverState *bs);
+int bdrv_known_zeroes(BlockDriverState *bs);
 bool bdrv_unallocated_blocks_are_zero(BlockDriverState *bs);
 bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
 int bdrv_block_status(BlockDriverState *bs, int64_t offset,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 77ab45dc87cf..47b34860bf95 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -441,19 +441,8 @@ struct BlockDriver {

     void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp);

-    /*
-     * Returns 1 if newly created images are guaranteed to contain only
-     * zeros, 0 otherwise.
-     * Must return 0 if .bdrv_co_truncate is set and
-     * .bdrv_has_zero_init_truncate() returns 0.
-     */
-    int (*bdrv_has_zero_init)(BlockDriverState *bs);
-
-    /*
-     * Returns 1 if new areas added by growing the image with
-     * PREALLOC_MODE_OFF contain only zeros, 0 otherwise.
-     */
-    int (*bdrv_has_zero_init_truncate)(BlockDriverState *bs);
+    /* Returns bitwise-OR of BdrvZeroFlags. */
+    int (*bdrv_known_zeroes)(BlockDriverState *bs);

     /* Remove fd handlers, timers, and other event loop callbacks so the event
      * loop is no longer in use.  Called with no in-flight requests and in
diff --git a/qemu-img.c b/qemu-img.c
index e0bfc33ef4f6..e60217e6c382 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1987,7 +1987,8 @@ static int convert_do_copy(ImgConvertState *s)
     /* Check whether we have zero initialisation or can get it efficiently */
     if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
         !s->target_has_backing) {
-        s->has_zero_init = bdrv_has_zero_init(blk_bs(s->target));
+        s->has_zero_init = !!(bdrv_known_zeroes(blk_bs(s->target)) &
+                              BDRV_ZERO_CREATE);
     }

     if (!s->has_zero_init && !s->target_has_backing &&
diff --git a/tests/qemu-iotests/122 b/tests/qemu-iotests/122
index dfa350936fe6..7cb09309948f 100755
--- a/tests/qemu-iotests/122
+++ b/tests/qemu-iotests/122
@@ -267,7 +267,7 @@ echo
 # Keep source zero
 _make_test_img 64M

-# Output is not zero, but has bdrv_has_zero_init() == 1
+# Output is not zero, but has bdrv_known_zeroes() including BDRV_ZERO_CREATE
 TEST_IMG="$TEST_IMG".orig _make_test_img 64M
 $QEMU_IO -c "write -P 42 0 64k" "$TEST_IMG".orig | _filter_qemu_io

diff --git a/tests/qemu-iotests/188 b/tests/qemu-iotests/188
index afca44df5427..9656969fef4a 100755
--- a/tests/qemu-iotests/188
+++ b/tests/qemu-iotests/188
@@ -71,7 +71,7 @@ $QEMU_IO --object $SECRETALT -c "read -P 0xa 0 $size" --image-opts $IMGSPEC | _f
 _cleanup_test_img

 echo
-echo "== verify that has_zero_init returns false when preallocating =="
+echo "== verify that known_zeroes returns 0 when preallocating =="

 # Empty source file
 if [ -n "$TEST_IMG_FILE" ]; then
diff --git a/tests/qemu-iotests/188.out b/tests/qemu-iotests/188.out
index c568ef370145..f7da30440c65 100644
--- a/tests/qemu-iotests/188.out
+++ b/tests/qemu-iotests/188.out
@@ -16,7 +16,7 @@ read 16777216/16777216 bytes at offset 0
 == verify open failure with wrong password ==
 qemu-io: can't open: Invalid password, cannot unlock any keyslot

-== verify that has_zero_init returns false when preallocating ==
+== verify that known_zeroes returns 0 when preallocating ==
 Formatting 'TEST_DIR/t.IMGFMT.orig', fmt=IMGFMT size=16777216
 Images are identical.
 *** done
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 10/17] block: Add new BDRV_ZERO_OPEN flag
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (8 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate} Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-01-31 18:03   ` Eric Blake
  2020-02-04 17:34   ` Max Reitz
  2020-01-31 17:44 ` [PATCH 11/17] file-posix: Support BDRV_ZERO_OPEN Eric Blake
                   ` (8 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

Knowing that a file reads as all zeroes when created is useful, but
limited in scope to drivers that can create images.  However, there
are also situations where pre-existing images can quickly be
determined to read as all zeroes, even when the image was not just
created by the same process.  The optimization used in qemu-img
convert to avoid a pre-zeroing pass on the destination is just as
useful in such a scenario.  As such, it is worth the block layer
adding another bit to bdrv_known_zeroes().

Note that while BDRV_ZERO_CREATE cannot chase through backing layers
(because it only applies at creation time, but the backing layer was
not created at the same time as the active layer being created), it IS
okay for BDRV_ZERO_OPEN to chase through layers (as long as all layers
currently read as zero, the image reads as zero).

Upcoming patches will update the qcow2, file-posix, and nbd drivers to
advertise the new bit when appropriate.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block.c               | 12 ++++++------
 include/block/block.h | 10 ++++++++++
 qemu-img.c            | 10 ++++++----
 3 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/block.c b/block.c
index fac0813140aa..d68f527dc41f 100644
--- a/block.c
+++ b/block.c
@@ -5078,7 +5078,7 @@ int bdrv_known_zeroes_truncate(BlockDriverState *bs)

 int bdrv_known_zeroes(BlockDriverState *bs)
 {
-    int mask = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE;
+    int mask = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE | BDRV_ZERO_OPEN;

     if (!bs->drv) {
         return 0;
@@ -5100,17 +5100,17 @@ int bdrv_known_zeroes(BlockDriverState *bs)
      * ZERO_CREATE is not viable.  If the current layer is smaller
      * than the backing layer, truncation may expose backing data,
      * restricting ZERO_TRUNCATE; treat failure to query size in the
-     * same manner.  Otherwise, we can trust the driver.
+     * same manner.  For ZERO_OPEN, we insist that both backing and
+     * current layer report the bit.
      */
-
     if (bs->backing) {
         int64_t back = bdrv_getlength(bs->backing->bs);
         int64_t curr = bdrv_getlength(bs);

-        if (back < 0 || curr < back) {
-            return 0;
+        mask = bdrv_known_zeroes(bs->backing->bs) & BDRV_ZERO_OPEN;
+        if (back >= 0 && curr >= back) {
+            mask |= BDRV_ZERO_TRUNCATE;
         }
-        mask = BDRV_ZERO_TRUNCATE;
     }

     if (bs->drv->bdrv_known_zeroes) {
diff --git a/include/block/block.h b/include/block/block.h
index a6a227f50678..dafb8cc2bd80 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -105,6 +105,16 @@ typedef enum {
      * for drivers that set .bdrv_co_truncate.
      */
     BDRV_ZERO_TRUNCATE      = 0x2,
+
+    /*
+     * bdrv_known_zeroes() should include this bit if an image is
+     * known to read as all zeroes when first opened; this bit should
+     * not be relied on after any writes to the image.  This can be
+     * set even if BDRV_ZERO_INIT is clear, but should only be set if
+     * making the determination is more efficient than looping over
+     * block status for the image.
+     */
+    BDRV_ZERO_OPEN          = 0x4,
 } BdrvZeroFlags;

 typedef struct BlockSizes {
diff --git a/qemu-img.c b/qemu-img.c
index e60217e6c382..c8519a74f738 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1985,10 +1985,12 @@ static int convert_do_copy(ImgConvertState *s)
     int64_t sector_num = 0;

     /* Check whether we have zero initialisation or can get it efficiently */
-    if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
-        !s->target_has_backing) {
-        s->has_zero_init = !!(bdrv_known_zeroes(blk_bs(s->target)) &
-                              BDRV_ZERO_CREATE);
+    if (!s->has_zero_init && s->min_sparse && !s->target_has_backing) {
+        ret = bdrv_known_zeroes(blk_bs(s->target));
+        if (ret & BDRV_ZERO_OPEN ||
+            (s->target_is_new && ret & BDRV_ZERO_CREATE)) {
+            s->has_zero_init = true;
+        }
     }

     if (!s->has_zero_init && !s->target_has_backing &&
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 11/17] file-posix: Support BDRV_ZERO_OPEN
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (9 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 10/17] block: Add new BDRV_ZERO_OPEN flag Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-01-31 17:44 ` [PATCH 12/17] gluster: " Eric Blake
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

A single lseek(SEEK_DATA) is sufficient to tell us if a raw file is
completely sparse, in which case it reads as all zeroes.  Not done
here, but possible extension for the future: when working with block
devices instead of files, there may be various ways with ioctl or
similar to quickly probe if a given block device is known to be
completely unmapped where unmapped regions read as zero.  But for now,
block devices remain without a .bdrv_known_zeroes, because most block
devices have random content without an explicit pre-zeroing pass.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/file-posix.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index ff9e39ab882f..b4d73dd0363b 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2541,6 +2541,19 @@ static int find_allocation(BlockDriverState *bs, off_t start,
 #endif
 }

+static int raw_known_zeroes(BlockDriverState *bs)
+{
+    /* This callback is only installed for files, not block devices. */
+    int r = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE;
+    off_t data, hole;
+
+    if (find_allocation(bs, 0, &data, &hole) == -ENXIO) {
+        r |= BDRV_ZERO_OPEN;
+    }
+
+    return r;
+}
+
 /*
  * Returns the allocation status of the specified offset.
  *
@@ -3071,7 +3084,7 @@ BlockDriver bdrv_file = {
     .bdrv_close = raw_close,
     .bdrv_co_create = raw_co_create,
     .bdrv_co_create_opts = raw_co_create_opts,
-    .bdrv_known_zeroes = bdrv_known_zeroes_truncate,
+    .bdrv_known_zeroes = raw_known_zeroes,
     .bdrv_co_block_status = raw_co_block_status,
     .bdrv_co_invalidate_cache = raw_co_invalidate_cache,
     .bdrv_co_pwrite_zeroes = raw_co_pwrite_zeroes,
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 12/17] gluster: Support BDRV_ZERO_OPEN
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (10 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 11/17] file-posix: Support BDRV_ZERO_OPEN Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-02-17  8:16   ` [GEDI] " Niels de Vos
  2020-01-31 17:44 ` [PATCH 13/17] qcow2: Add new autoclear feature for all zero image Eric Blake
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: david.edmondson, Kevin Wolf, open list:GLUSTER, qemu-block, mreitz

Since gluster already copies file-posix for lseek usage in block
status, it also makes sense to copy it for learning if the image
currently reads as all zeroes.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/gluster.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/block/gluster.c b/block/gluster.c
index 9d952c70981b..0417a86547c8 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -1464,6 +1464,22 @@ exit:
     return -ENOTSUP;
 }

+static int qemu_gluster_known_zeroes(BlockDriverState *bs)
+{
+    /*
+     * GlusterFS volume could be backed by a block device, with no way
+     * to query if regions added by creation or truncation will read
+     * as zeroes.  However, we can use lseek(SEEK_DATA) to check if
+     * contents currently read as zero.
+     */
+    off_t data, hole;
+
+    if (find_allocation(bs, 0, &data, &hole) == -ENXIO) {
+        return BDRV_ZERO_OPEN;
+    }
+    return 0;
+}
+
 /*
  * Returns the allocation status of the specified offset.
  *
@@ -1561,6 +1577,7 @@ static BlockDriver bdrv_gluster = {
     .bdrv_co_readv                = qemu_gluster_co_readv,
     .bdrv_co_writev               = qemu_gluster_co_writev,
     .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
+    .bdrv_known_zeroes            = qemu_gluster_known_zeroes,
 #ifdef CONFIG_GLUSTERFS_DISCARD
     .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
 #endif
@@ -1591,6 +1608,7 @@ static BlockDriver bdrv_gluster_tcp = {
     .bdrv_co_readv                = qemu_gluster_co_readv,
     .bdrv_co_writev               = qemu_gluster_co_writev,
     .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
+    .bdrv_known_zeroes            = qemu_gluster_known_zeroes,
 #ifdef CONFIG_GLUSTERFS_DISCARD
     .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
 #endif
@@ -1621,6 +1639,7 @@ static BlockDriver bdrv_gluster_unix = {
     .bdrv_co_readv                = qemu_gluster_co_readv,
     .bdrv_co_writev               = qemu_gluster_co_writev,
     .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
+    .bdrv_known_zeroes            = qemu_gluster_known_zeroes,
 #ifdef CONFIG_GLUSTERFS_DISCARD
     .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
 #endif
@@ -1657,6 +1676,7 @@ static BlockDriver bdrv_gluster_rdma = {
     .bdrv_co_readv                = qemu_gluster_co_readv,
     .bdrv_co_writev               = qemu_gluster_co_writev,
     .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
+    .bdrv_known_zeroes            = qemu_gluster_known_zeroes,
 #ifdef CONFIG_GLUSTERFS_DISCARD
     .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
 #endif
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 13/17] qcow2: Add new autoclear feature for all zero image
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (11 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 12/17] gluster: " Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-02-03 17:45   ` Vladimir Sementsov-Ogievskiy
  2020-01-31 17:44 ` [PATCH 14/17] qcow2: Expose all zero bit through .bdrv_known_zeroes Eric Blake
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: david.edmondson, Kevin Wolf, Markus Armbruster, qemu-block, mreitz

With the recent introduction of BDRV_ZERO_OPEN, we can optimize
various qemu-img operations if we know the destination starts life
with all zero content.  For an image with no cluster allocations and
no backing file, this was already trivial with BDRV_ZERO_CREATE; but
for a fully preallocated image, it does not scale to crawl through the
entire L1/L2 tree to see if every cluster is currently marked as a
zero cluster.  But it is quite easy to add an autoclear bit to the
qcow2 file itself: the bit will be set after newly creating an image
or after qcow2_make_empty, and cleared on any other modification
(including by an older qemu that doesn't recognize the bit).

This patch documents the new bit, independently of implementing the
places in code that should set it (which means that for bisection
purposes, it is safer to still mask the bit out when opening an image
with the bit set).

A few iotests have updated output due to the larger number of named
header features.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
RFC: As defined in this patch, I defined the bit to be clear if any
cluster defers to a backing file. But the block layer would handle
things just fine if we instead allowed the bit to be set if all
clusters allocated in this image are zero, even if there are other
clusters not allocated.  Or maybe we want TWO bits: one if all
clusters allocated here are known zero, and a second if we know that
there are any clusters that defer to a backing image.
---
 block/qcow2.c              |  9 +++++++++
 block/qcow2.h              |  3 +++
 docs/interop/qcow2.txt     | 12 +++++++++++-
 qapi/block-core.json       |  4 ++++
 tests/qemu-iotests/031.out |  8 ++++----
 tests/qemu-iotests/036.out |  4 ++--
 tests/qemu-iotests/061.out | 14 +++++++-------
 7 files changed, 40 insertions(+), 14 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 9f2371925737..20cce9410c84 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2859,6 +2859,11 @@ int qcow2_update_header(BlockDriverState *bs)
                 .bit  = QCOW2_AUTOCLEAR_DATA_FILE_RAW_BITNR,
                 .name = "raw external data",
             },
+            {
+                .type = QCOW2_FEAT_TYPE_AUTOCLEAR,
+                .bit  = QCOW2_AUTOCLEAR_ALL_ZERO_BITNR,
+                .name = "all zero",
+            },
         };

         ret = header_ext_add(buf, QCOW2_EXT_MAGIC_FEATURE_TABLE,
@@ -4874,6 +4879,10 @@ static ImageInfoSpecific *qcow2_get_specific_info(BlockDriverState *bs,
             .corrupt            = s->incompatible_features &
                                   QCOW2_INCOMPAT_CORRUPT,
             .has_corrupt        = true,
+            .all_zero           = s->autoclear_features &
+                                  QCOW2_AUTOCLEAR_ALL_ZERO,
+            .has_all_zero       = s->autoclear_features &
+                                  QCOW2_AUTOCLEAR_ALL_ZERO,
             .refcount_bits      = s->refcount_bits,
             .has_bitmaps        = !!bitmaps,
             .bitmaps            = bitmaps,
diff --git a/block/qcow2.h b/block/qcow2.h
index 094212623257..6fc2d323d753 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -237,11 +237,14 @@ enum {
 enum {
     QCOW2_AUTOCLEAR_BITMAPS_BITNR       = 0,
     QCOW2_AUTOCLEAR_DATA_FILE_RAW_BITNR = 1,
+    QCOW2_AUTOCLEAR_ALL_ZERO_BITNR      = 2,
     QCOW2_AUTOCLEAR_BITMAPS             = 1 << QCOW2_AUTOCLEAR_BITMAPS_BITNR,
     QCOW2_AUTOCLEAR_DATA_FILE_RAW       = 1 << QCOW2_AUTOCLEAR_DATA_FILE_RAW_BITNR,
+    QCOW2_AUTOCLEAR_ALL_ZERO            = 1 << QCOW2_AUTOCLEAR_ALL_ZERO_BITNR,

     QCOW2_AUTOCLEAR_MASK                = QCOW2_AUTOCLEAR_BITMAPS
                                         | QCOW2_AUTOCLEAR_DATA_FILE_RAW,
+    /* TODO: Add _ALL_ZERO to _MASK once it is handled correctly */
 };

 enum qcow2_discard_type {
diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index 8510d74c8079..d435363a413c 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -153,7 +153,17 @@ in the description of a field.
                                 File bit (incompatible feature bit 1) is also
                                 set.

-                    Bits 2-63:  Reserved (set to 0)
+                    Bit 2:      All zero image bit
+                                If this bit is set, the entire image reads
+                                as all zeroes. This can be useful for
+                                detecting just-created images even when
+                                clusters are preallocated, which in turn
+                                can be used to optimize image copying.
+
+                                This bit should not be set if any cluster
+                                in the image defers to a backing file.
+
+                    Bits 3-63:  Reserved (set to 0)

          96 -  99:  refcount_order
                     Describes the width of a reference count block entry (width
diff --git a/qapi/block-core.json b/qapi/block-core.json
index ef94a296868f..af837ed5af33 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -71,6 +71,9 @@
 # @corrupt: true if the image has been marked corrupt; only valid for
 #           compat >= 1.1 (since 2.2)
 #
+# @all-zero: present and true only if the image is known to read as all
+#            zeroes (since 5.0)
+#
 # @refcount-bits: width of a refcount entry in bits (since 2.3)
 #
 # @encrypt: details about encryption parameters; only set if image
@@ -87,6 +90,7 @@
       '*data-file-raw': 'bool',
       '*lazy-refcounts': 'bool',
       '*corrupt': 'bool',
+      '*all-zero': 'bool',
       'refcount-bits': 'int',
       '*encrypt': 'ImageInfoSpecificQCow2Encryption',
       '*bitmaps': ['Qcow2BitmapInfo']
diff --git a/tests/qemu-iotests/031.out b/tests/qemu-iotests/031.out
index 46f97c5a4ea4..bb1afa7b87f6 100644
--- a/tests/qemu-iotests/031.out
+++ b/tests/qemu-iotests/031.out
@@ -117,7 +117,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    288
+length                    336
 data                      <binary>

 Header extension:
@@ -150,7 +150,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    288
+length                    336
 data                      <binary>

 Header extension:
@@ -164,7 +164,7 @@ No errors were found on the image.

 magic                     0x514649fb
 version                   3
-backing_file_offset       0x1d8
+backing_file_offset       0x208
 backing_file_size         0x17
 cluster_bits              16
 size                      67108864
@@ -188,7 +188,7 @@ data                      'host_device'

 Header extension:
 magic                     0x6803f857
-length                    288
+length                    336
 data                      <binary>

 Header extension:
diff --git a/tests/qemu-iotests/036.out b/tests/qemu-iotests/036.out
index 23b699ce0622..e409acf60e2b 100644
--- a/tests/qemu-iotests/036.out
+++ b/tests/qemu-iotests/036.out
@@ -26,7 +26,7 @@ compatible_features       []
 autoclear_features        [63]
 Header extension:
 magic                     0x6803f857
-length                    288
+length                    336
 data                      <binary>


@@ -38,7 +38,7 @@ compatible_features       []
 autoclear_features        []
 Header extension:
 magic                     0x6803f857
-length                    288
+length                    336
 data                      <binary>

 *** done
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index 413cc4e0f4ab..d873f79bb606 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -26,7 +26,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    288
+length                    336
 data                      <binary>

 magic                     0x514649fb
@@ -84,7 +84,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    288
+length                    336
 data                      <binary>

 magic                     0x514649fb
@@ -140,7 +140,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    288
+length                    336
 data                      <binary>

 ERROR cluster 5 refcount=0 reference=1
@@ -195,7 +195,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    288
+length                    336
 data                      <binary>

 magic                     0x514649fb
@@ -264,7 +264,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    288
+length                    336
 data                      <binary>

 read 65536/65536 bytes at offset 44040192
@@ -298,7 +298,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    288
+length                    336
 data                      <binary>

 ERROR cluster 5 refcount=0 reference=1
@@ -327,7 +327,7 @@ header_length             104

 Header extension:
 magic                     0x6803f857
-length                    288
+length                    336
 data                      <binary>

 read 131072/131072 bytes at offset 0
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 14/17] qcow2: Expose all zero bit through .bdrv_known_zeroes
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (12 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 13/17] qcow2: Add new autoclear feature for all zero image Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-01-31 17:44 ` [PATCH 15/17] qcow2: Implement all-zero autoclear bit Eric Blake
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

Now that qcow2 images have a way to track when the contents are known
to be all zero, it is worth exposing this to clients such as qemu-img
convert.  (Of course, until the next patch wires up qcow2 to actually
set the bit, this patch has no immediate effect; however, keeping it
as a separate patch allows for an easier revert when testing if the
bit makes a difference in qemu-img behavior).

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/qcow2.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 20cce9410c84..3f61d806a14b 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4938,6 +4938,9 @@ static int qcow2_known_zeroes(BlockDriverState *bs)
      * therefore enough to check the first one.
      */
     preallocated = s->l1_size > 0 && s->l1_table[0] != 0;
+    if (s->autoclear_features & QCOW2_AUTOCLEAR_ALL_ZERO) {
+        r |= BDRV_ZERO_OPEN;
+    }
     if (qemu_in_coroutine()) {
         qemu_co_mutex_unlock(&s->lock);
     }
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 15/17] qcow2: Implement all-zero autoclear bit
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (13 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 14/17] qcow2: Expose all zero bit through .bdrv_known_zeroes Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-01-31 17:44 ` [PATCH 16/17] iotests: Add new test for qcow2 all-zero bit Eric Blake
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

Wire up the autoclear bit just defined in the previous patch. When we
create an image or clear it with .bdrv_make_empty, we know that it
reads as all zeroes.  Reading an image does not change the previous
status, nor does writing zeroes, trimming (because we specifically set
trimmed clusters to read as zero), or resize (because the new length
reads as zero).  This leaves normal writes, data copies, snapshot
reverts, and altering the backing file that can change the status.
Furthermore, it is not safe to claim that an encrypted image or an
image with a backing file reads as all zeroes.

Implementation-wise, we clear the bit from the file on the first
modification, and then rewrite it when marking the image clean; some
callers want to rewrite it (to either set or clear), while others want
to preserve the current value; the modifications to qemu_mark_clean
make it easier to consolidate the logic for when setting the bit is
safe.

A number of iotests have altered output, in situations where we have a
provably zero image at that point in the test.

Later, we may want to wire in further checks to qemu-img check that
validates if the bit is set correctly, and/or to set the bit in images
where it would be valid, but I did not do that here.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/qcow2-snapshot.c     | 11 +++++
 block/qcow2.c              | 97 ++++++++++++++++++++++++++++++++++----
 block/qcow2.h              |  5 +-
 tests/qemu-iotests/031.out |  6 +--
 tests/qemu-iotests/036.out |  6 +--
 tests/qemu-iotests/061.out | 12 +++--
 tests/qemu-iotests/065     | 12 ++---
 tests/qemu-iotests/082.out |  7 +++
 tests/qemu-iotests/206.out |  4 ++
 tests/qemu-iotests/242.out |  1 +
 10 files changed, 134 insertions(+), 27 deletions(-)

diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
index 5ab64da1ec36..e19f1b3ef5fa 100644
--- a/block/qcow2-snapshot.c
+++ b/block/qcow2-snapshot.c
@@ -781,6 +781,16 @@ int qcow2_snapshot_goto(BlockDriverState *bs, const char *snapshot_id)
         goto fail;
     }

+    /*
+     * With modification to the qcow2 spec, snapshots could store
+     * whether they are in an all zero state. But for now, we assume
+     * all snapshots are nonzero.
+     */
+    ret = qcow2_mark_nonzero(bs);
+    if (ret < 0) {
+        goto fail;
+    }
+
     /*
      * Make sure that the current L1 table is big enough to contain the whole
      * L1 table of the snapshot. If the snapshot L1 table is smaller, the
@@ -1044,6 +1054,7 @@ int qcow2_snapshot_load_tmp(BlockDriverState *bs,
     s->l1_size = sn->l1_size;
     s->l1_table_offset = sn->l1_table_offset;
     s->l1_table = new_l1_table;
+    s->autoclear_features &= ~QCOW2_AUTOCLEAR_ALL_ZERO;

     for(i = 0;i < s->l1_size; i++) {
         be64_to_cpus(&s->l1_table[i]);
diff --git a/block/qcow2.c b/block/qcow2.c
index 3f61d806a14b..6b1969e4d90a 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -480,6 +480,40 @@ static void report_unsupported_feature(Error **errp, Qcow2Feature *table,
     g_free(features);
 }

+/*
+ * Clear the all zero bit and flushes afterwards if necessary.
+ *
+ * If updating the header fails, it is not safe to proceed with
+ * modifying the image.
+ */
+int qcow2_mark_nonzero(BlockDriverState *bs)
+{
+    BDRVQcow2State *s = bs->opaque;
+    uint64_t val;
+    int ret;
+
+    if (!(s->autoclear_features & QCOW2_AUTOCLEAR_ALL_ZERO)) {
+        return 0; /* already marked non-zero, including version 2 */
+    }
+
+    assert(s->qcow_version >= 3);
+
+    val = cpu_to_be64(s->autoclear_features & ~QCOW2_AUTOCLEAR_ALL_ZERO);
+    ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, autoclear_features),
+                      &val, sizeof(val));
+    if (ret < 0) {
+        return ret;
+    }
+    ret = bdrv_flush(bs->file->bs);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /* Only clear the in-memory flag if the header was updated successfully */
+    s->autoclear_features &= ~QCOW2_AUTOCLEAR_ALL_ZERO;
+    return 0;
+}
+
 /*
  * Sets the dirty bit and flushes afterwards if necessary.
  *
@@ -518,16 +552,27 @@ int qcow2_mark_dirty(BlockDriverState *bs)
 /*
  * Clears the dirty bit and flushes before if necessary.  Only call this
  * function when there are no pending requests, it does not guard against
- * concurrent requests dirtying the image.
+ * concurrent requests dirtying the image. If all_zero is 0 or 1, adjust
+ * the value of s->autoclear; if -1, preserve the cached value.
  */
-static int qcow2_mark_clean(BlockDriverState *bs)
+static int qcow2_mark_clean(BlockDriverState *bs, int all_zero)
 {
     BDRVQcow2State *s = bs->opaque;

-    if (s->incompatible_features & QCOW2_INCOMPAT_DIRTY) {
+    if (all_zero == -1) {
+        all_zero = !!(s->autoclear_features & QCOW2_AUTOCLEAR_ALL_ZERO);
+    }
+    if (bs->backing || bs->encrypted || s->qcow_version < 3) {
+        all_zero = 0;
+    }
+    if (s->incompatible_features & QCOW2_INCOMPAT_DIRTY ||
+        (all_zero && !(s->autoclear_features & QCOW2_AUTOCLEAR_ALL_ZERO))) {
         int ret;

         s->incompatible_features &= ~QCOW2_INCOMPAT_DIRTY;
+        if (all_zero) {
+            s->autoclear_features |= QCOW2_AUTOCLEAR_ALL_ZERO;
+        }

         ret = qcow2_flush_caches(bs);
         if (ret < 0) {
@@ -616,7 +661,13 @@ static int coroutine_fn qcow2_co_check_locked(BlockDriverState *bs,
     }

     if (fix && result->check_errors == 0 && result->corruptions == 0) {
-        ret = qcow2_mark_clean(bs);
+        /*
+         * In the case of fixing an image, we've actually spent the
+         * time of traversing every cluster, and could thus turn the
+         * all_zero bit on if the check proves it is correct; but for
+         * now, it is easier to just always drop the all_zero bit.
+         */
+        ret = qcow2_mark_clean(bs, 0);
         if (ret < 0) {
             return ret;
         }
@@ -1069,7 +1120,7 @@ static int qcow2_update_options_prepare(BlockDriverState *bs,
     }

     if (s->use_lazy_refcounts && !r->use_lazy_refcounts) {
-        ret = qcow2_mark_clean(bs);
+        ret = qcow2_mark_clean(bs, -1);
         if (ret < 0) {
             error_setg_errno(errp, -ret, "Failed to disable lazy refcounts");
             goto fail;
@@ -1865,7 +1916,7 @@ static int qcow2_reopen_prepare(BDRVReopenState *state,
             goto fail;
         }

-        ret = qcow2_mark_clean(state->bs);
+        ret = qcow2_mark_clean(state->bs, -1);
         if (ret < 0) {
             goto fail;
         }
@@ -2486,6 +2537,11 @@ static coroutine_fn int qcow2_co_pwritev_part(

     trace_qcow2_writev_start_req(qemu_coroutine_self(), offset, bytes);

+    ret = qcow2_mark_nonzero(bs);
+    if (ret < 0) {
+        goto fail_nometa;
+    }
+
     while (bytes != 0 && aio_task_pool_status(aio) == 0) {

         l2meta = NULL;
@@ -2586,7 +2642,7 @@ static int qcow2_inactivate(BlockDriverState *bs)
     }

     if (result == 0) {
-        qcow2_mark_clean(bs);
+        qcow2_mark_clean(bs, -1);
     }

     return result;
@@ -3443,6 +3499,9 @@ qcow2_co_create(BlockdevCreateOptions *create_options, Error **errp)
         header->autoclear_features |=
             cpu_to_be64(QCOW2_AUTOCLEAR_DATA_FILE_RAW);
     }
+    if (version >= 3 && !qcow2_opts->has_backing_file) {
+        header->autoclear_features |= cpu_to_be64(QCOW2_AUTOCLEAR_ALL_ZERO);
+    }

     ret = blk_pwrite(blk, 0, header, cluster_size, 0);
     g_free(header);
@@ -3793,6 +3852,11 @@ static coroutine_fn int qcow2_co_pdiscard(BlockDriverState *bs,
     }

     qemu_co_mutex_lock(&s->lock);
+    /*
+     * No need to call qcow2_mark_nonzero: v2 images lack autoclear
+     * bits and so are already nonzero; v3 images pass full_discard=false
+     * so that discarded clusters still read as zero.
+     */
     ret = qcow2_cluster_discard(bs, offset, bytes, QCOW2_DISCARD_REQUEST,
                                 false);
     qemu_co_mutex_unlock(&s->lock);
@@ -3902,6 +3966,11 @@ qcow2_co_copy_range_to(BlockDriverState *bs,

     qemu_co_mutex_lock(&s->lock);

+    ret = qcow2_mark_nonzero(bs);
+    if (ret < 0) {
+        goto fail;
+    }
+
     while (bytes != 0) {

         l2meta = NULL;
@@ -4334,6 +4403,11 @@ qcow2_co_pwritev_compressed_part(BlockDriverState *bs,
         return -ENOTSUP;
     }

+    ret = qcow2_mark_nonzero(bs);
+    if (ret < 0) {
+        return ret;
+    }
+
     if (bytes == 0) {
         /*
          * align end of file to a sector boundary to ease reading with
@@ -4547,7 +4621,7 @@ static int make_completely_empty(BlockDriverState *bs)

     /* Now finally the in-memory information corresponds to the on-disk
      * structures and is correct */
-    ret = qcow2_mark_clean(bs);
+    ret = qcow2_mark_clean(bs, 1);
     if (ret < 0) {
         goto fail;
     }
@@ -4615,6 +4689,9 @@ static int qcow2_make_empty(BlockDriverState *bs)
             break;
         }
     }
+    if (!bs->backing && !bs->encrypted && s->qcow_version >= 3) {
+        s->autoclear_features |= QCOW2_AUTOCLEAR_ALL_ZERO;
+    }

     return ret;
 }
@@ -5002,7 +5079,7 @@ static int qcow2_downgrade(BlockDriverState *bs, int target_version,

     /* clear incompatible features */
     if (s->incompatible_features & QCOW2_INCOMPAT_DIRTY) {
-        ret = qcow2_mark_clean(bs);
+        ret = qcow2_mark_clean(bs, 0);
         if (ret < 0) {
             error_setg_errno(errp, -ret, "Failed to make the image clean");
             return ret;
@@ -5372,7 +5449,7 @@ static int qcow2_amend_options(BlockDriverState *bs, QemuOpts *opts,
             s->use_lazy_refcounts = true;
         } else {
             /* make image clean first */
-            ret = qcow2_mark_clean(bs);
+            ret = qcow2_mark_clean(bs, -1);
             if (ret < 0) {
                 error_setg_errno(errp, -ret, "Failed to make the image clean");
                 return ret;
diff --git a/block/qcow2.h b/block/qcow2.h
index 6fc2d323d753..7b971ed825ed 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -243,8 +243,8 @@ enum {
     QCOW2_AUTOCLEAR_ALL_ZERO            = 1 << QCOW2_AUTOCLEAR_ALL_ZERO_BITNR,

     QCOW2_AUTOCLEAR_MASK                = QCOW2_AUTOCLEAR_BITMAPS
-                                        | QCOW2_AUTOCLEAR_DATA_FILE_RAW,
-    /* TODO: Add _ALL_ZERO to _MASK once it is handled correctly */
+                                        | QCOW2_AUTOCLEAR_DATA_FILE_RAW
+                                        | QCOW2_AUTOCLEAR_ALL_ZERO,
 };

 enum qcow2_discard_type {
@@ -610,6 +610,7 @@ int64_t qcow2_refcount_metadata_size(int64_t clusters, size_t cluster_size,

 int qcow2_mark_dirty(BlockDriverState *bs);
 int qcow2_mark_corrupt(BlockDriverState *bs);
+int qcow2_mark_nonzero(BlockDriverState *bs);
 int qcow2_mark_consistent(BlockDriverState *bs);
 int qcow2_update_header(BlockDriverState *bs);

diff --git a/tests/qemu-iotests/031.out b/tests/qemu-iotests/031.out
index bb1afa7b87f6..293f67e96bb6 100644
--- a/tests/qemu-iotests/031.out
+++ b/tests/qemu-iotests/031.out
@@ -111,7 +111,7 @@ nb_snapshots              0
 snapshot_offset           0x0
 incompatible_features     []
 compatible_features       []
-autoclear_features        []
+autoclear_features        [2]
 refcount_order            4
 header_length             104

@@ -144,7 +144,7 @@ nb_snapshots              0
 snapshot_offset           0x0
 incompatible_features     []
 compatible_features       []
-autoclear_features        []
+autoclear_features        [2]
 refcount_order            4
 header_length             104

@@ -177,7 +177,7 @@ nb_snapshots              0
 snapshot_offset           0x0
 incompatible_features     []
 compatible_features       []
-autoclear_features        []
+autoclear_features        [2]
 refcount_order            4
 header_length             104

diff --git a/tests/qemu-iotests/036.out b/tests/qemu-iotests/036.out
index e409acf60e2b..5eea8b2bb547 100644
--- a/tests/qemu-iotests/036.out
+++ b/tests/qemu-iotests/036.out
@@ -5,7 +5,7 @@ QA output created by 036
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 incompatible_features     [63]
 compatible_features       []
-autoclear_features        []
+autoclear_features        [2]
 qemu-img: Could not open 'TEST_DIR/t.IMGFMT': Unsupported IMGFMT feature(s): Unknown incompatible feature: 8000000000000000
 qemu-img: Could not open 'TEST_DIR/t.IMGFMT': Unsupported IMGFMT feature(s): Test feature

@@ -23,7 +23,7 @@ qemu-img: Could not open 'TEST_DIR/t.IMGFMT': Unsupported IMGFMT feature(s): tes
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 incompatible_features     []
 compatible_features       []
-autoclear_features        [63]
+autoclear_features        [2, 63]
 Header extension:
 magic                     0x6803f857
 length                    336
@@ -35,7 +35,7 @@ data                      <binary>
 No errors were found on the image.
 incompatible_features     []
 compatible_features       []
-autoclear_features        []
+autoclear_features        [2]
 Header extension:
 magic                     0x6803f857
 length                    336
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index d873f79bb606..3d471c2bde14 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -20,7 +20,7 @@ nb_snapshots              0
 snapshot_offset           0x0
 incompatible_features     []
 compatible_features       [0]
-autoclear_features        []
+autoclear_features        [2]
 refcount_order            4
 header_length             104

@@ -78,7 +78,7 @@ nb_snapshots              0
 snapshot_offset           0x0
 incompatible_features     []
 compatible_features       [0]
-autoclear_features        []
+autoclear_features        [2]
 refcount_order            4
 header_length             104

@@ -189,7 +189,7 @@ nb_snapshots              0
 snapshot_offset           0x0
 incompatible_features     []
 compatible_features       [42]
-autoclear_features        [42]
+autoclear_features        [2, 42]
 refcount_order            4
 header_length             104

@@ -491,6 +491,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: false
     refcount bits: 16
     data file: TEST_DIR/t.IMGFMT.data
@@ -511,6 +512,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: false
     refcount bits: 16
     data file: foo
@@ -524,6 +526,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: false
     refcount bits: 16
     data file raw: false
@@ -538,6 +541,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: false
     refcount bits: 16
     data file: TEST_DIR/t.IMGFMT.data
@@ -550,6 +554,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: false
     refcount bits: 16
     data file: TEST_DIR/t.IMGFMT.data
@@ -563,6 +568,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: false
     refcount bits: 16
     data file: TEST_DIR/t.IMGFMT.data
diff --git a/tests/qemu-iotests/065 b/tests/qemu-iotests/065
index 5b21eb96bd09..d47b3d30d0de 100755
--- a/tests/qemu-iotests/065
+++ b/tests/qemu-iotests/065
@@ -94,17 +94,17 @@ class TestQCow2(TestQemuImgInfo):
 class TestQCow3NotLazy(TestQemuImgInfo):
     '''Testing a qcow2 version 3 image with lazy refcounts disabled'''
     img_options = 'compat=1.1,lazy_refcounts=off'
-    json_compare = { 'compat': '1.1', 'lazy-refcounts': False,
+    json_compare = { 'compat': '1.1', 'all-zero': True, 'lazy-refcounts': False,
                      'refcount-bits': 16, 'corrupt': False }
-    human_compare = [ 'compat: 1.1', 'lazy refcounts: false',
+    human_compare = [ 'compat: 1.1', 'all zero: true', 'lazy refcounts: false',
                       'refcount bits: 16', 'corrupt: false' ]

 class TestQCow3Lazy(TestQemuImgInfo):
     '''Testing a qcow2 version 3 image with lazy refcounts enabled'''
     img_options = 'compat=1.1,lazy_refcounts=on'
-    json_compare = { 'compat': '1.1', 'lazy-refcounts': True,
+    json_compare = { 'compat': '1.1', 'all-zero': True, 'lazy-refcounts': True,
                      'refcount-bits': 16, 'corrupt': False }
-    human_compare = [ 'compat: 1.1', 'lazy refcounts: true',
+    human_compare = [ 'compat: 1.1', 'all zero: true', 'lazy refcounts: true',
                       'refcount bits: 16', 'corrupt: false' ]

 class TestQCow3NotLazyQMP(TestQMP):
@@ -112,7 +112,7 @@ class TestQCow3NotLazyQMP(TestQMP):
        with lazy refcounts enabled'''
     img_options = 'compat=1.1,lazy_refcounts=off'
     qemu_options = 'lazy-refcounts=on'
-    compare = { 'compat': '1.1', 'lazy-refcounts': False,
+    compare = { 'compat': '1.1', 'all-zero': True, 'lazy-refcounts': False,
                 'refcount-bits': 16, 'corrupt': False }


@@ -121,7 +121,7 @@ class TestQCow3LazyQMP(TestQMP):
        with lazy refcounts disabled'''
     img_options = 'compat=1.1,lazy_refcounts=on'
     qemu_options = 'lazy-refcounts=off'
-    compare = { 'compat': '1.1', 'lazy-refcounts': True,
+    compare = { 'compat': '1.1', 'all-zero': True, 'lazy-refcounts': True,
                 'refcount-bits': 16, 'corrupt': False }

 TestImageInfoSpecific = None
diff --git a/tests/qemu-iotests/082.out b/tests/qemu-iotests/082.out
index 9d4ed4dc9d61..6729a43712f2 100644
--- a/tests/qemu-iotests/082.out
+++ b/tests/qemu-iotests/082.out
@@ -17,6 +17,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 4096
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: true
     refcount bits: 16
     corrupt: false
@@ -29,6 +30,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 8192
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: true
     refcount bits: 16
     corrupt: false
@@ -299,6 +301,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 4096
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: true
     refcount bits: 16
     corrupt: false
@@ -310,6 +313,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 8192
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: true
     refcount bits: 16
     corrupt: false
@@ -579,6 +583,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 65536
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: true
     refcount bits: 16
     corrupt: false
@@ -590,6 +595,7 @@ virtual size: 130 MiB (136314880 bytes)
 cluster_size: 65536
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: false
     refcount bits: 16
     corrupt: false
@@ -601,6 +607,7 @@ virtual size: 132 MiB (138412032 bytes)
 cluster_size: 65536
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: true
     refcount bits: 16
     corrupt: false
diff --git a/tests/qemu-iotests/206.out b/tests/qemu-iotests/206.out
index 61e7241e0bf3..aa27d75d12b1 100644
--- a/tests/qemu-iotests/206.out
+++ b/tests/qemu-iotests/206.out
@@ -18,6 +18,7 @@ virtual size: 128 MiB (134217728 bytes)
 cluster_size: 65536
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: false
     refcount bits: 16
     corrupt: false
@@ -40,6 +41,7 @@ virtual size: 64 MiB (67108864 bytes)
 cluster_size: 65536
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: false
     refcount bits: 16
     corrupt: false
@@ -62,6 +64,7 @@ virtual size: 32 MiB (33554432 bytes)
 cluster_size: 2097152
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: true
     refcount bits: 1
     corrupt: false
@@ -102,6 +105,7 @@ encrypted: yes
 cluster_size: 65536
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: false
     refcount bits: 16
     encrypt:
diff --git a/tests/qemu-iotests/242.out b/tests/qemu-iotests/242.out
index 7ac8404d11c8..807f24549e89 100644
--- a/tests/qemu-iotests/242.out
+++ b/tests/qemu-iotests/242.out
@@ -153,6 +153,7 @@ virtual size: 1 MiB (1048576 bytes)
 cluster_size: 65536
 Format specific information:
     compat: 1.1
+    all zero: true
     lazy refcounts: false
     bitmaps:
         [0]:
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 16/17] iotests: Add new test for qcow2 all-zero bit
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (14 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 15/17] qcow2: Implement all-zero autoclear bit Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-01-31 17:44 ` [PATCH 17/17] qcow2: Let qemu-img check cover " Eric Blake
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

Cover various scenarios to show that the bit gets set even for
fully-allocated images, as well as scenarios where it is properly
cleared.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 tests/qemu-iotests/285     | 107 +++++++++++++++
 tests/qemu-iotests/285.out | 257 +++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 365 insertions(+)
 create mode 100755 tests/qemu-iotests/285
 create mode 100644 tests/qemu-iotests/285.out

diff --git a/tests/qemu-iotests/285 b/tests/qemu-iotests/285
new file mode 100755
index 000000000000..66037af237a1
--- /dev/null
+++ b/tests/qemu-iotests/285
@@ -0,0 +1,107 @@
+#!/usr/bin/env bash
+#
+# Test qcow2 all-zero autoclear bit
+#
+# Copyright (C) 2020 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+seq=$(basename $0)
+echo "QA output created by $seq"
+
+status=1	# failure is the default!
+
+_cleanup()
+{
+    _cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+# Autoclear bit is not available in compat=0.10;
+# encrypted images never advertise all-zero bit
+_unsupported_imgopts 'compat=0.10' encrypt
+
+for mode in off metadata falloc full; do
+
+    echo
+    echo "=== preallocation=$mode ==="
+    echo
+
+    _make_test_img -o "preallocation=$mode" 32M
+
+    # Actions that do not lose the all-zero nature of the image:
+    $QEMU_IO -c 'w -z 0 16M' -c 'discard 8M 16M' "$TEST_IMG" | _filter_qemu_io
+    $QEMU_IMG resize --preallocation=$mode "$TEST_IMG" +8M
+    $QEMU_IO -c 'r -P 0 0 40M' "$TEST_IMG" | _filter_qemu_io
+    $QEMU_IMG info "$TEST_IMG" | _filter_img_info --format-specific
+
+    # Writing data must clear the all-zero bit:
+    $QEMU_IO -c 'w -P 1 32M 1M' "$TEST_IMG" | _filter_qemu_io
+    $QEMU_IMG info "$TEST_IMG" | _filter_img_info --format-specific
+
+    # Alas, rewriting the image back to zero does not restore the bit
+    # (checking if each write gets us back to zero does not scale)
+    $QEMU_IO -c 'w -z 32M 1M' "$TEST_IMG" | _filter_qemu_io
+    $QEMU_IMG info "$TEST_IMG" | _filter_img_info --format-specific
+
+done
+
+echo
+echo "=== backing files ==="
+echo
+
+# Even when a backing file is all zero, we do not set all-zero bit;
+# this is true whether we create with a backing file or rebase later
+TEST_IMG_SAVE=$TEST_IMG
+TEST_IMG=$TEST_IMG.base
+_make_test_img 32M
+TEST_IMG=$TEST_IMG_SAVE
+_make_test_img -b "$TEST_IMG.base" -F qcow2 32M
+$QEMU_IMG info "$TEST_IMG" | _filter_img_info --format-specific
+_make_test_img 32M
+$QEMU_IMG info "$TEST_IMG" | _filter_img_info --format-specific
+$QEMU_IMG rebase -u -F qcow2 -b "$TEST_IMG.base" "$TEST_IMG"
+$QEMU_IMG info "$TEST_IMG" | _filter_img_info --format-specific
+
+# qemu-img commit clears an image, but because it still has a backing file,
+# setting the all-zero bit is not correct
+$QEMU_IO -c 'w -P 1 0 1M' "$TEST_IMG" | _filter_qemu_io
+$QEMU_IMG commit "$TEST_IMG"
+$QEMU_IMG info "$TEST_IMG" | _filter_img_info --format-specific
+
+echo
+echo "=== internal snapshots ==="
+echo
+
+# For now, internal snapshots do not remember the all-zero bit
+_make_test_img 32M
+$QEMU_IMG info "$TEST_IMG" | _filter_img_info --format-specific
+$QEMU_IMG snapshot -c snap "$TEST_IMG"
+$QEMU_IO -c 'w -P 1 0 1M' "$TEST_IMG" | _filter_qemu_io
+$QEMU_IMG snapshot -l snap "$TEST_IMG"
+$QEMU_IMG info "$TEST_IMG" | _filter_img_info --format-specific \
+    | _filter_date | _filter_vmstate_size
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/285.out b/tests/qemu-iotests/285.out
new file mode 100644
index 000000000000..e43ff9906b5f
--- /dev/null
+++ b/tests/qemu-iotests/285.out
@@ -0,0 +1,257 @@
+QA output created by 285
+
+=== preallocation=off ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=33554432 preallocation=off
+wrote 16777216/16777216 bytes at offset 0
+16 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+discard 16777216/16777216 bytes at offset 8388608
+16 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Image resized.
+read 41943040/41943040 bytes at offset 0
+40 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 40 MiB (41943040 bytes)
+disk size: 260 KiB
+Format specific information:
+    compat: 1.1
+    all zero: true
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+wrote 1048576/1048576 bytes at offset 33554432
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 40 MiB (41943040 bytes)
+disk size: 1.25 MiB
+Format specific information:
+    compat: 1.1
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+wrote 1048576/1048576 bytes at offset 33554432
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 40 MiB (41943040 bytes)
+disk size: 1.25 MiB
+Format specific information:
+    compat: 1.1
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+
+=== preallocation=metadata ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=33554432 preallocation=metadata
+wrote 16777216/16777216 bytes at offset 0
+16 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+discard 16777216/16777216 bytes at offset 8388608
+16 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Image resized.
+read 41943040/41943040 bytes at offset 0
+40 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 40 MiB (41943040 bytes)
+disk size: 260 KiB
+Format specific information:
+    compat: 1.1
+    all zero: true
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+wrote 1048576/1048576 bytes at offset 33554432
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 40 MiB (41943040 bytes)
+disk size: 1.25 MiB
+Format specific information:
+    compat: 1.1
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+wrote 1048576/1048576 bytes at offset 33554432
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 40 MiB (41943040 bytes)
+disk size: 1.25 MiB
+Format specific information:
+    compat: 1.1
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+
+=== preallocation=falloc ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=33554432 preallocation=falloc
+wrote 16777216/16777216 bytes at offset 0
+16 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+discard 16777216/16777216 bytes at offset 8388608
+16 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Image resized.
+read 41943040/41943040 bytes at offset 0
+40 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 40 MiB (41943040 bytes)
+disk size: 24.3 MiB
+Format specific information:
+    compat: 1.1
+    all zero: true
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+wrote 1048576/1048576 bytes at offset 33554432
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 40 MiB (41943040 bytes)
+disk size: 24.3 MiB
+Format specific information:
+    compat: 1.1
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+wrote 1048576/1048576 bytes at offset 33554432
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 40 MiB (41943040 bytes)
+disk size: 24.3 MiB
+Format specific information:
+    compat: 1.1
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+
+=== preallocation=full ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=33554432 preallocation=full
+wrote 16777216/16777216 bytes at offset 0
+16 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+discard 16777216/16777216 bytes at offset 8388608
+16 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Image resized.
+read 41943040/41943040 bytes at offset 0
+40 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 40 MiB (41943040 bytes)
+disk size: 24.3 MiB
+Format specific information:
+    compat: 1.1
+    all zero: true
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+wrote 1048576/1048576 bytes at offset 33554432
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 40 MiB (41943040 bytes)
+disk size: 24.3 MiB
+Format specific information:
+    compat: 1.1
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+wrote 1048576/1048576 bytes at offset 33554432
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 40 MiB (41943040 bytes)
+disk size: 24.3 MiB
+Format specific information:
+    compat: 1.1
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+
+=== backing files ===
+
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=33554432
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=33554432 backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=IMGFMT
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 32 MiB (33554432 bytes)
+disk size: 196 KiB
+backing file: TEST_DIR/t.IMGFMT.base
+backing file format: IMGFMT
+Format specific information:
+    compat: 1.1
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=33554432
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 32 MiB (33554432 bytes)
+disk size: 196 KiB
+Format specific information:
+    compat: 1.1
+    all zero: true
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 32 MiB (33554432 bytes)
+disk size: 196 KiB
+backing file: TEST_DIR/t.IMGFMT.base
+backing file format: IMGFMT
+Format specific information:
+    compat: 1.1
+    all zero: true
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Image committed.
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 32 MiB (33554432 bytes)
+disk size: 260 KiB
+backing file: TEST_DIR/t.IMGFMT.base
+backing file format: IMGFMT
+Format specific information:
+    compat: 1.1
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+
+=== internal snapshots ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=33554432
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 32 MiB (33554432 bytes)
+disk size: 196 KiB
+Format specific information:
+    compat: 1.1
+    all zero: true
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+qemu-img: Expecting one image file name
+Try 'qemu-img --help' for more information
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 32 MiB (33554432 bytes)
+disk size:     SIZE
+Snapshot list:
+ID        TAG                 VM SIZE                DATE       VM CLOCK
+1         snap                   SIZE yyyy-mm-dd hh:mm:ss   00:00:00.000
+Format specific information:
+    compat: 1.1
+    lazy refcounts: false
+    refcount bits: 16
+    corrupt: false
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index e041cc1ee360..e9b20818fad5 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -289,3 +289,4 @@
 279 rw backing quick
 280 rw migration quick
 281 rw quick
+285 rw quick
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 17/17] qcow2: Let qemu-img check cover all-zero bit
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (15 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 16/17] iotests: Add new test for qcow2 all-zero bit Eric Blake
@ 2020-01-31 17:44 ` Eric Blake
  2020-02-04 17:32 ` [PATCH 00/17] Improve qcow2 all-zero detection Max Reitz
  2020-02-05  9:04 ` Vladimir Sementsov-Ogievskiy
  18 siblings, 0 replies; 73+ messages in thread
From: Eric Blake @ 2020-01-31 17:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

Since checking an images refcounts already visits every cluster, it's
basically free to also check that the all-zero bit is correctly set.
Only check for the active L1 table, and only output an error on the
first non-zero cluster found.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/qcow2-refcount.c     | 60 +++++++++++++++++++++++++++++++++++---
 tests/qemu-iotests/060.out |  6 ++--
 tests/qemu-iotests/285     | 17 +++++++++++
 tests/qemu-iotests/285.out | 20 +++++++++++++
 4 files changed, 97 insertions(+), 6 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index f67ac6b2d893..95c8101df365 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1583,6 +1583,7 @@ int qcow2_inc_refcounts_imrt(BlockDriverState *bs, BdrvCheckResult *res,
 /* Flags for check_refcounts_l1() and check_refcounts_l2() */
 enum {
     CHECK_FRAG_INFO = 0x2,      /* update BlockFragInfo counters */
+    CHECK_ALL_ZERO = 0x4,       /* check autoclear all_zero bit */
 };

 /*
@@ -1596,12 +1597,14 @@ enum {
 static int check_refcounts_l2(BlockDriverState *bs, BdrvCheckResult *res,
                               void **refcount_table,
                               int64_t *refcount_table_size, int64_t l2_offset,
-                              int flags, BdrvCheckMode fix, bool active)
+                              int flags, BdrvCheckMode fix, bool active,
+                              bool *all_zero)
 {
     BDRVQcow2State *s = bs->opaque;
     uint64_t *l2_table, l2_entry;
     uint64_t next_contiguous_offset = 0;
     int i, l2_size, nb_csectors, ret;
+    bool check_all_zero;

     /* Read L2 table from disk */
     l2_size = s->l2_size * sizeof(uint64_t);
@@ -1615,8 +1618,9 @@ static int check_refcounts_l2(BlockDriverState *bs, BdrvCheckResult *res,
     }

     /* Do the actual checks */
-    for(i = 0; i < s->l2_size; i++) {
+    for (i = 0; i < s->l2_size; i++) {
         l2_entry = be64_to_cpu(l2_table[i]);
+        check_all_zero = *all_zero;

         switch (qcow2_get_cluster_type(bs, l2_entry)) {
         case QCOW2_CLUSTER_COMPRESSED:
@@ -1662,6 +1666,8 @@ static int check_refcounts_l2(BlockDriverState *bs, BdrvCheckResult *res,
             break;

         case QCOW2_CLUSTER_ZERO_ALLOC:
+            check_all_zero = false;
+            /* fall through */
         case QCOW2_CLUSTER_NORMAL:
         {
             uint64_t offset = l2_entry & L2E_OFFSET_MASK;
@@ -1740,12 +1746,51 @@ static int check_refcounts_l2(BlockDriverState *bs, BdrvCheckResult *res,
         }

         case QCOW2_CLUSTER_ZERO_PLAIN:
+            check_all_zero = false;
+            break;
+
         case QCOW2_CLUSTER_UNALLOCATED:
+            if (!bs->backing) {
+                check_all_zero = false;
+            }
             break;

         default:
             abort();
         }
+
+        if (check_all_zero) {
+            fprintf(stderr, "%s: all zero bit set but L2 table at offset "
+                    "0x%" PRIx64" contains non-zero cluster at entry %d\n",
+                    fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR",
+                    l2_offset, i);
+            *all_zero = false;
+            if (fix & BDRV_FIX_ERRORS) {
+                uint64_t feat;
+
+                ret = bdrv_pread(bs->file,
+                                 offsetof(QCowHeader, autoclear_features),
+                                 &feat, sizeof(feat));
+                if (ret >= 0) {
+                    feat &= ~cpu_to_be64(QCOW2_AUTOCLEAR_ALL_ZERO);
+                    ret = bdrv_pwrite(bs->file,
+                                      offsetof(QCowHeader, autoclear_features),
+                                      &feat, sizeof(feat));
+                }
+                if (ret < 0) {
+                    fprintf(stderr,
+                            "ERROR: Failed to update all zero bit: %s\n",
+                            strerror(-ret));
+                    res->check_errors++;
+                    /* Continue checking the rest of this L2 table */
+                } else {
+                    res->corruptions_fixed++;
+                }
+                s->autoclear_features &= ~QCOW2_AUTOCLEAR_ALL_ZERO;
+            } else {
+                res->corruptions++;
+            }
+        }
     }

     g_free(l2_table);
@@ -1774,6 +1819,12 @@ static int check_refcounts_l1(BlockDriverState *bs,
     BDRVQcow2State *s = bs->opaque;
     uint64_t *l1_table = NULL, l2_offset, l1_size2;
     int i, ret;
+    bool all_zero = false;
+
+    if (flags & CHECK_ALL_ZERO &&
+        s->autoclear_features & QCOW2_AUTOCLEAR_ALL_ZERO) {
+        all_zero = true;
+    }

     l1_size2 = l1_size * sizeof(uint64_t);

@@ -1825,7 +1876,7 @@ static int check_refcounts_l1(BlockDriverState *bs,
             /* Process and check L2 entries */
             ret = check_refcounts_l2(bs, res, refcount_table,
                                      refcount_table_size, l2_offset, flags,
-                                     fix, active);
+                                     fix, active, &all_zero);
             if (ret < 0) {
                 goto fail;
             }
@@ -2114,7 +2165,8 @@ static int calculate_refcounts(BlockDriverState *bs, BdrvCheckResult *res,

     /* current L1 table */
     ret = check_refcounts_l1(bs, res, refcount_table, nb_clusters,
-                             s->l1_table_offset, s->l1_size, CHECK_FRAG_INFO,
+                             s->l1_table_offset, s->l1_size,
+                             CHECK_FRAG_INFO | CHECK_ALL_ZERO,
                              fix, true);
     if (ret < 0) {
         return ret;
diff --git a/tests/qemu-iotests/060.out b/tests/qemu-iotests/060.out
index d27692a33c0d..d82aca458544 100644
--- a/tests/qemu-iotests/060.out
+++ b/tests/qemu-iotests/060.out
@@ -3,9 +3,10 @@ QA output created by 060
 === Testing L2 reference into L1 ===

 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+ERROR: all zero bit set but L2 table at offset 0x30000 contains non-zero cluster at entry 0
 ERROR cluster 3 refcount=1 reference=3

-1 errors were found on the image.
+2 errors were found on the image.
 Data may be corrupted, or further writes to the image may corrupt it.
 incompatible_features     []
 qcow2: Marking image as corrupt: Preventing invalid write on metadata (overlaps with active L1 table); further corruption events will be suppressed
@@ -28,10 +29,11 @@ read 512/512 bytes at offset 0
 === Testing cluster data reference into refcount block ===

 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+ERROR: all zero bit set but L2 table at offset 0x40000 contains non-zero cluster at entry 0
 ERROR refcount block 0 refcount=2
 ERROR cluster 2 refcount=1 reference=2

-2 errors were found on the image.
+3 errors were found on the image.
 Data may be corrupted, or further writes to the image may corrupt it.
 incompatible_features     []
 qcow2: Marking image as corrupt: Preventing invalid write on metadata (overlaps with refcount block); further corruption events will be suppressed
diff --git a/tests/qemu-iotests/285 b/tests/qemu-iotests/285
index 66037af237a1..c435bb57d749 100755
--- a/tests/qemu-iotests/285
+++ b/tests/qemu-iotests/285
@@ -101,6 +101,23 @@ $QEMU_IMG snapshot -l snap "$TEST_IMG"
 $QEMU_IMG info "$TEST_IMG" | _filter_img_info --format-specific \
     | _filter_date | _filter_vmstate_size

+echo
+echo "=== qemu-img check ==="
+echo
+
+_make_test_img 32M
+$QEMU_IO -c 'w -P 1 0 1M' "$TEST_IMG" | _filter_qemu_io
+# Image should be clean
+_check_test_img
+# Manually corrupt the image by setting the bit
+$PYTHON qcow2.py "$TEST_IMG" set-feature-bit autoclear 2
+# check should detect the problem
+_check_test_img
+# repair should fix it
+_check_test_img -r all
+# the image should be clean again
+_check_test_img
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/285.out b/tests/qemu-iotests/285.out
index e43ff9906b5f..b28c9e266bf6 100644
--- a/tests/qemu-iotests/285.out
+++ b/tests/qemu-iotests/285.out
@@ -254,4 +254,24 @@ Format specific information:
     lazy refcounts: false
     refcount bits: 16
     corrupt: false
+
+=== qemu-img check ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=33554432
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+No errors were found on the image.
+ERROR: all zero bit set but L2 table at offset 0x40000 contains non-zero cluster at entry 0
+
+1 errors were found on the image.
+Data may be corrupted, or further writes to the image may corrupt it.
+Repairing: all zero bit set but L2 table at offset 0x40000 contains non-zero cluster at entry 0
+The following inconsistencies were found and repaired:
+
+    0 leaked clusters
+    1 corruptions
+
+Double checking the fixed image now...
+No errors were found on the image.
+No errors were found on the image.
 *** done
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH 10/17] block: Add new BDRV_ZERO_OPEN flag
  2020-01-31 17:44 ` [PATCH 10/17] block: Add new BDRV_ZERO_OPEN flag Eric Blake
@ 2020-01-31 18:03   ` Eric Blake
  2020-02-04 17:34   ` Max Reitz
  1 sibling, 0 replies; 73+ messages in thread
From: Eric Blake @ 2020-01-31 18:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

On 1/31/20 11:44 AM, Eric Blake wrote:
> Knowing that a file reads as all zeroes when created is useful, but
> limited in scope to drivers that can create images.  However, there
> are also situations where pre-existing images can quickly be
> determined to read as all zeroes, even when the image was not just
> created by the same process.  The optimization used in qemu-img
> convert to avoid a pre-zeroing pass on the destination is just as
> useful in such a scenario.  As such, it is worth the block layer
> adding another bit to bdrv_known_zeroes().
> 
> Note that while BDRV_ZERO_CREATE cannot chase through backing layers
> (because it only applies at creation time, but the backing layer was
> not created at the same time as the active layer being created), it IS
> okay for BDRV_ZERO_OPEN to chase through layers (as long as all layers
> currently read as zero, the image reads as zero).
> 
> Upcoming patches will update the qcow2, file-posix, and nbd drivers to
> advertise the new bit when appropriate.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---

[Is it bad when I review my own patches?]

> +++ b/block.c
> @@ -5078,7 +5078,7 @@ int bdrv_known_zeroes_truncate(BlockDriverState *bs)
> 
>   int bdrv_known_zeroes(BlockDriverState *bs)
>   {
> -    int mask = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE;
> +    int mask = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE | BDRV_ZERO_OPEN;
> 
>       if (!bs->drv) {
>           return 0;
> @@ -5100,17 +5100,17 @@ int bdrv_known_zeroes(BlockDriverState *bs)
>        * ZERO_CREATE is not viable.  If the current layer is smaller
>        * than the backing layer, truncation may expose backing data,
>        * restricting ZERO_TRUNCATE; treat failure to query size in the
> -     * same manner.  Otherwise, we can trust the driver.
> +     * same manner.  For ZERO_OPEN, we insist that both backing and
> +     * current layer report the bit.
>        */
> -
>       if (bs->backing) {

Spurious line deletion caused by rebasing.


> +++ b/include/block/block.h
> @@ -105,6 +105,16 @@ typedef enum {
>        * for drivers that set .bdrv_co_truncate.
>        */
>       BDRV_ZERO_TRUNCATE      = 0x2,
> +
> +    /*
> +     * bdrv_known_zeroes() should include this bit if an image is
> +     * known to read as all zeroes when first opened; this bit should
> +     * not be relied on after any writes to the image.  This can be
> +     * set even if BDRV_ZERO_INIT is clear, but should only be set if

Rebasing snafu - I renamed that bit BDRV_ZERO_CREATE in patch 9.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 13/17] qcow2: Add new autoclear feature for all zero image
  2020-01-31 17:44 ` [PATCH 13/17] qcow2: Add new autoclear feature for all zero image Eric Blake
@ 2020-02-03 17:45   ` Vladimir Sementsov-Ogievskiy
  2020-02-04 13:12     ` Eric Blake
  0 siblings, 1 reply; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-03 17:45 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: david.edmondson, Kevin Wolf, Markus Armbruster, qemu-block, mreitz

31.01.2020 20:44, Eric Blake wrote:
> With the recent introduction of BDRV_ZERO_OPEN, we can optimize
> various qemu-img operations if we know the destination starts life
> with all zero content.  For an image with no cluster allocations and
> no backing file, this was already trivial with BDRV_ZERO_CREATE; but
> for a fully preallocated image, it does not scale to crawl through the
> entire L1/L2 tree to see if every cluster is currently marked as a
> zero cluster.  But it is quite easy to add an autoclear bit to the
> qcow2 file itself: the bit will be set after newly creating an image
> or after qcow2_make_empty, and cleared on any other modification
> (including by an older qemu that doesn't recognize the bit).
> 
> This patch documents the new bit, independently of implementing the
> places in code that should set it (which means that for bisection
> purposes, it is safer to still mask the bit out when opening an image
> with the bit set).
> 
> A few iotests have updated output due to the larger number of named
> header features.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> ---
> RFC: As defined in this patch, I defined the bit to be clear if any
> cluster defers to a backing file. But the block layer would handle
> things just fine if we instead allowed the bit to be set if all
> clusters allocated in this image are zero, even if there are other
> clusters not allocated.  Or maybe we want TWO bits: one if all
> clusters allocated here are known zero, and a second if we know that
> there are any clusters that defer to a backing image.
> ---
>   block/qcow2.c              |  9 +++++++++
>   block/qcow2.h              |  3 +++
>   docs/interop/qcow2.txt     | 12 +++++++++++-
>   qapi/block-core.json       |  4 ++++
>   tests/qemu-iotests/031.out |  8 ++++----
>   tests/qemu-iotests/036.out |  4 ++--
>   tests/qemu-iotests/061.out | 14 +++++++-------
>   7 files changed, 40 insertions(+), 14 deletions(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 9f2371925737..20cce9410c84 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2859,6 +2859,11 @@ int qcow2_update_header(BlockDriverState *bs)
>                   .bit  = QCOW2_AUTOCLEAR_DATA_FILE_RAW_BITNR,
>                   .name = "raw external data",
>               },
> +            {
> +                .type = QCOW2_FEAT_TYPE_AUTOCLEAR,
> +                .bit  = QCOW2_AUTOCLEAR_ALL_ZERO_BITNR,
> +                .name = "all zero",
> +            },
>           };
> 
>           ret = header_ext_add(buf, QCOW2_EXT_MAGIC_FEATURE_TABLE,
> @@ -4874,6 +4879,10 @@ static ImageInfoSpecific *qcow2_get_specific_info(BlockDriverState *bs,
>               .corrupt            = s->incompatible_features &
>                                     QCOW2_INCOMPAT_CORRUPT,
>               .has_corrupt        = true,
> +            .all_zero           = s->autoclear_features &
> +                                  QCOW2_AUTOCLEAR_ALL_ZERO,
> +            .has_all_zero       = s->autoclear_features &
> +                                  QCOW2_AUTOCLEAR_ALL_ZERO,
>               .refcount_bits      = s->refcount_bits,
>               .has_bitmaps        = !!bitmaps,
>               .bitmaps            = bitmaps,
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 094212623257..6fc2d323d753 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -237,11 +237,14 @@ enum {
>   enum {
>       QCOW2_AUTOCLEAR_BITMAPS_BITNR       = 0,
>       QCOW2_AUTOCLEAR_DATA_FILE_RAW_BITNR = 1,
> +    QCOW2_AUTOCLEAR_ALL_ZERO_BITNR      = 2,
>       QCOW2_AUTOCLEAR_BITMAPS             = 1 << QCOW2_AUTOCLEAR_BITMAPS_BITNR,
>       QCOW2_AUTOCLEAR_DATA_FILE_RAW       = 1 << QCOW2_AUTOCLEAR_DATA_FILE_RAW_BITNR,
> +    QCOW2_AUTOCLEAR_ALL_ZERO            = 1 << QCOW2_AUTOCLEAR_ALL_ZERO_BITNR,
> 
>       QCOW2_AUTOCLEAR_MASK                = QCOW2_AUTOCLEAR_BITMAPS
>                                           | QCOW2_AUTOCLEAR_DATA_FILE_RAW,
> +    /* TODO: Add _ALL_ZERO to _MASK once it is handled correctly */
>   };
> 
>   enum qcow2_discard_type {
> diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
> index 8510d74c8079..d435363a413c 100644
> --- a/docs/interop/qcow2.txt
> +++ b/docs/interop/qcow2.txt
> @@ -153,7 +153,17 @@ in the description of a field.
>                                   File bit (incompatible feature bit 1) is also
>                                   set.
> 
> -                    Bits 2-63:  Reserved (set to 0)
> +                    Bit 2:      All zero image bit
> +                                If this bit is set, the entire image reads
> +                                as all zeroes. This can be useful for
> +                                detecting just-created images even when
> +                                clusters are preallocated, which in turn
> +                                can be used to optimize image copying.
> +
> +                                This bit should not be set if any cluster
> +                                in the image defers to a backing file.

Hmm. The term "defers to a backing file" not defined in the spec. And, as I
understand, can't be defined by design. Backing file may be added/removed/changed
dynamically, and qcow2 driver will not know about it. So, the only way to
be sure that clusters are not defer to backing file is to make them
ZERO clusters (not UNALLOCATED). But this is inefficient, as we'll have to
allocated all L2 tables.

So, I think better to define this flag as "all allocated clusters are zero".

Hmm interesting, in qcow2 spec "allocated" means allocated on disk and has
offset. So, ZERO cluster is actually unallocated cluster, with bit 0 of
L2 entry set to 1. On the other hand, qemu block layer considers ZERO
clusters as "allocated" (in POV of backing-chain).

So, if we define it as "all allocated clusters are zero", we are done:
other clusters are either unallocated and MAY refer to backing, so we
can say nothing about their read-as-zero status at the level of qcow2
spec, or unallocated with zero-bit set, which are normal ZERO clusters.

So, on the level of qcow2 driver I think it's better consider only this
image. Still, we can implement generic bdrv_is_all_zeros, which will
check or layers (or at least, check that bs->backing is NULL).


> +
> +                    Bits 3-63:  Reserved (set to 0)
> 
>            96 -  99:  refcount_order
>                       Describes the width of a reference count block entry (width
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index ef94a296868f..af837ed5af33 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -71,6 +71,9 @@
>   # @corrupt: true if the image has been marked corrupt; only valid for
>   #           compat >= 1.1 (since 2.2)
>   #
> +# @all-zero: present and true only if the image is known to read as all
> +#            zeroes (since 5.0)
> +#
>   # @refcount-bits: width of a refcount entry in bits (since 2.3)
>   #
>   # @encrypt: details about encryption parameters; only set if image
> @@ -87,6 +90,7 @@
>         '*data-file-raw': 'bool',
>         '*lazy-refcounts': 'bool',
>         '*corrupt': 'bool',
> +      '*all-zero': 'bool',
>         'refcount-bits': 'int',
>         '*encrypt': 'ImageInfoSpecificQCow2Encryption',
>         '*bitmaps': ['Qcow2BitmapInfo']
> diff --git a/tests/qemu-iotests/031.out b/tests/qemu-iotests/031.out
> index 46f97c5a4ea4..bb1afa7b87f6 100644
> --- a/tests/qemu-iotests/031.out
> +++ b/tests/qemu-iotests/031.out
> @@ -117,7 +117,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    288
> +length                    336
>   data                      <binary>
> 
>   Header extension:
> @@ -150,7 +150,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    288
> +length                    336
>   data                      <binary>
> 
>   Header extension:
> @@ -164,7 +164,7 @@ No errors were found on the image.
> 
>   magic                     0x514649fb
>   version                   3
> -backing_file_offset       0x1d8
> +backing_file_offset       0x208
>   backing_file_size         0x17
>   cluster_bits              16
>   size                      67108864
> @@ -188,7 +188,7 @@ data                      'host_device'
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    288
> +length                    336
>   data                      <binary>
> 
>   Header extension:
> diff --git a/tests/qemu-iotests/036.out b/tests/qemu-iotests/036.out
> index 23b699ce0622..e409acf60e2b 100644
> --- a/tests/qemu-iotests/036.out
> +++ b/tests/qemu-iotests/036.out
> @@ -26,7 +26,7 @@ compatible_features       []
>   autoclear_features        [63]
>   Header extension:
>   magic                     0x6803f857
> -length                    288
> +length                    336
>   data                      <binary>
> 
> 
> @@ -38,7 +38,7 @@ compatible_features       []
>   autoclear_features        []
>   Header extension:
>   magic                     0x6803f857
> -length                    288
> +length                    336
>   data                      <binary>
> 
>   *** done
> diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
> index 413cc4e0f4ab..d873f79bb606 100644
> --- a/tests/qemu-iotests/061.out
> +++ b/tests/qemu-iotests/061.out
> @@ -26,7 +26,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    288
> +length                    336
>   data                      <binary>
> 
>   magic                     0x514649fb
> @@ -84,7 +84,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    288
> +length                    336
>   data                      <binary>
> 
>   magic                     0x514649fb
> @@ -140,7 +140,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    288
> +length                    336
>   data                      <binary>
> 
>   ERROR cluster 5 refcount=0 reference=1
> @@ -195,7 +195,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    288
> +length                    336
>   data                      <binary>
> 
>   magic                     0x514649fb
> @@ -264,7 +264,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    288
> +length                    336
>   data                      <binary>
> 
>   read 65536/65536 bytes at offset 44040192
> @@ -298,7 +298,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    288
> +length                    336
>   data                      <binary>
> 
>   ERROR cluster 5 refcount=0 reference=1
> @@ -327,7 +327,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    288
> +length                    336
>   data                      <binary>
> 
>   read 131072/131072 bytes at offset 0
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 13/17] qcow2: Add new autoclear feature for all zero image
  2020-02-03 17:45   ` Vladimir Sementsov-Ogievskiy
@ 2020-02-04 13:12     ` Eric Blake
  2020-02-04 13:29       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-02-04 13:12 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: david.edmondson, Kevin Wolf, Markus Armbruster, qemu-block, mreitz

On 2/3/20 11:45 AM, Vladimir Sementsov-Ogievskiy wrote:
> 31.01.2020 20:44, Eric Blake wrote:
>> With the recent introduction of BDRV_ZERO_OPEN, we can optimize
>> various qemu-img operations if we know the destination starts life
>> with all zero content.  For an image with no cluster allocations and
>> no backing file, this was already trivial with BDRV_ZERO_CREATE; but
>> for a fully preallocated image, it does not scale to crawl through the
>> entire L1/L2 tree to see if every cluster is currently marked as a
>> zero cluster.  But it is quite easy to add an autoclear bit to the
>> qcow2 file itself: the bit will be set after newly creating an image
>> or after qcow2_make_empty, and cleared on any other modification
>> (including by an older qemu that doesn't recognize the bit).
>>
>> This patch documents the new bit, independently of implementing the
>> places in code that should set it (which means that for bisection
>> purposes, it is safer to still mask the bit out when opening an image
>> with the bit set).
>>
>> A few iotests have updated output due to the larger number of named
>> header features.
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>
>> ---
>> RFC: As defined in this patch, I defined the bit to be clear if any
>> cluster defers to a backing file. But the block layer would handle
>> things just fine if we instead allowed the bit to be set if all
>> clusters allocated in this image are zero, even if there are other
>> clusters not allocated.  Or maybe we want TWO bits: one if all
>> clusters allocated here are known zero, and a second if we know that
>> there are any clusters that defer to a backing image.

>> -                    Bits 2-63:  Reserved (set to 0)
>> +                    Bit 2:      All zero image bit
>> +                                If this bit is set, the entire image 
>> reads
>> +                                as all zeroes. This can be useful for
>> +                                detecting just-created images even when
>> +                                clusters are preallocated, which in turn
>> +                                can be used to optimize image copying.
>> +
>> +                                This bit should not be set if any 
>> cluster
>> +                                in the image defers to a backing file.
> 
> Hmm. The term "defers to a backing file" not defined in the spec. And, as I
> understand, can't be defined by design. Backing file may be 
> added/removed/changed
> dynamically, and qcow2 driver will not know about it. So, the only way to
> be sure that clusters are not defer to backing file is to make them
> ZERO clusters (not UNALLOCATED). But this is inefficient, as we'll have to
> allocated all L2 tables.
> 
> So, I think better to define this flag as "all allocated clusters are 
> zero".

That was precisely the topic of my RFC question.

I _do_ think it is simpler to report that 'all clusters where content 
comes from _this_ image read as zero', leaving unallocated clusters as 
zero only if 1. there is no backing image, or 2. the backing image also 
reads as all zero (recursing as needed).  I'll spin v2 of these patches 
along those lines, although I'm hoping for more review on the rest of 
the series, first.

> 
> Hmm interesting, in qcow2 spec "allocated" means allocated on disk and has
> offset. So, ZERO cluster is actually unallocated cluster, with bit 0 of
> L2 entry set to 1. On the other hand, qemu block layer considers ZERO
> clusters as "allocated" (in POV of backing-chain).

I really want the definition to be 'any cluster whose contents come from 
this layer' (the qemu-io definition of allocated, not necessarily the 
qcow2 definition of allocated), which picks up BOTH types of qcow2 zero 
clusters (those preallocated but marked 0, where the contents of the 
allocated area are indeterminate but never read, and those unallocated 
but marked 0 which do not defer to the backing layer).  Whether or not 
the cluster is allocated is less important than whether the image reads 
as 0 at that cluster.

But I think that you are right that an alternative definition of 'all 
allocated clusters are zero' will give the same results when crawling 
through the backing chain to learn if the overall image reads as zero, 
and that's all the more that we can expect out of this bit.

> 
> So, if we define it as "all allocated clusters are zero", we are done:
> other clusters are either unallocated and MAY refer to backing, so we
> can say nothing about their read-as-zero status at the level of qcow2
> spec, or unallocated with zero-bit set, which are normal ZERO clusters.
> 
> So, on the level of qcow2 driver I think it's better consider only this
> image. Still, we can implement generic bdrv_is_all_zeros, which will
> check or layers (or at least, check that bs->backing is NULL).

The earlier parts of this series which renamed bdrv_has_zero_init() into 
bdrv_known_zeroes() does just that - it already handles recursion 
through the backing chain, and insists that an image is all zeroes with 
respect to BDRV_ZERO_OPEN only if all layers of the backing chain agree.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 13/17] qcow2: Add new autoclear feature for all zero image
  2020-02-04 13:12     ` Eric Blake
@ 2020-02-04 13:29       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-04 13:29 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: david.edmondson, Kevin Wolf, Markus Armbruster, qemu-block, mreitz

04.02.2020 16:12, Eric Blake wrote:
> On 2/3/20 11:45 AM, Vladimir Sementsov-Ogievskiy wrote:
>> 31.01.2020 20:44, Eric Blake wrote:
>>> With the recent introduction of BDRV_ZERO_OPEN, we can optimize
>>> various qemu-img operations if we know the destination starts life
>>> with all zero content.  For an image with no cluster allocations and
>>> no backing file, this was already trivial with BDRV_ZERO_CREATE; but
>>> for a fully preallocated image, it does not scale to crawl through the
>>> entire L1/L2 tree to see if every cluster is currently marked as a
>>> zero cluster.  But it is quite easy to add an autoclear bit to the
>>> qcow2 file itself: the bit will be set after newly creating an image
>>> or after qcow2_make_empty, and cleared on any other modification
>>> (including by an older qemu that doesn't recognize the bit).
>>>
>>> This patch documents the new bit, independently of implementing the
>>> places in code that should set it (which means that for bisection
>>> purposes, it is safer to still mask the bit out when opening an image
>>> with the bit set).
>>>
>>> A few iotests have updated output due to the larger number of named
>>> header features.
>>>
>>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>>
>>> ---
>>> RFC: As defined in this patch, I defined the bit to be clear if any
>>> cluster defers to a backing file. But the block layer would handle
>>> things just fine if we instead allowed the bit to be set if all
>>> clusters allocated in this image are zero, even if there are other
>>> clusters not allocated.  Or maybe we want TWO bits: one if all
>>> clusters allocated here are known zero, and a second if we know that
>>> there are any clusters that defer to a backing image.
> 
>>> -                    Bits 2-63:  Reserved (set to 0)
>>> +                    Bit 2:      All zero image bit
>>> +                                If this bit is set, the entire image reads
>>> +                                as all zeroes. This can be useful for
>>> +                                detecting just-created images even when
>>> +                                clusters are preallocated, which in turn
>>> +                                can be used to optimize image copying.
>>> +
>>> +                                This bit should not be set if any cluster
>>> +                                in the image defers to a backing file.
>>
>> Hmm. The term "defers to a backing file" not defined in the spec. And, as I
>> understand, can't be defined by design. Backing file may be added/removed/changed
>> dynamically, and qcow2 driver will not know about it. So, the only way to
>> be sure that clusters are not defer to backing file is to make them
>> ZERO clusters (not UNALLOCATED). But this is inefficient, as we'll have to
>> allocated all L2 tables.
>>
>> So, I think better to define this flag as "all allocated clusters are zero".
> 
> That was precisely the topic of my RFC question.

Yes, and this is what I'm thinking about it :)  Looks like I worded it in
manner that I didn't see the RFC and just consider it as final patch,
sorry for that.

> 
> I _do_ think it is simpler to report that 'all clusters where content comes from _this_ image read as zero', leaving unallocated clusters as zero only if 1. there is no backing image, or 2. the backing image also reads as all zero (recursing as needed).  I'll spin v2 of these patches along those lines, although I'm hoping for more review on the rest of the series, first.

Still, I'm not sure that it make sense to consider backing at all. In my POV,
backing is up to the user. User may load backing file which is specified in
qcow2 header, but on the same time, user may chose some other backing file.
Backing file is "external" thing, so, may be better not rely on it.

> 
>>
>> Hmm interesting, in qcow2 spec "allocated" means allocated on disk and has
>> offset. So, ZERO cluster is actually unallocated cluster, with bit 0 of
>> L2 entry set to 1. On the other hand, qemu block layer considers ZERO
>> clusters as "allocated" (in POV of backing-chain).
> 
> I really want the definition to be 'any cluster whose contents come from this layer' (the qemu-io definition of allocated, not necessarily the qcow2 definition of allocated), which picks up BOTH types of qcow2 zero clusters (those preallocated but marked 0, where the contents of the allocated area are indeterminate but never read, and those unallocated but marked 0 which do not defer to the backing layer).  Whether or not the cluster is allocated is less important than whether the image reads as 0 at that cluster.
> 
> But I think that you are right that an alternative definition of 'all allocated clusters are zero' will give the same results when crawling through the backing chain to learn if the overall image reads as zero, and that's all the more that we can expect out of this bit.

Yes, it's equal, because unallocated clusters marked as ZERO are zero anyway.

> 
>>
>> So, if we define it as "all allocated clusters are zero", we are done:
>> other clusters are either unallocated and MAY refer to backing, so we
>> can say nothing about their read-as-zero status at the level of qcow2
>> spec, or unallocated with zero-bit set, which are normal ZERO clusters.
>>
>> So, on the level of qcow2 driver I think it's better consider only this
>> image. Still, we can implement generic bdrv_is_all_zeros, which will
>> check or layers (or at least, check that bs->backing is NULL).
> 
> The earlier parts of this series which renamed bdrv_has_zero_init() into bdrv_known_zeroes() does just that - it already handles recursion through the backing chain, and insists that an image is all zeroes with respect to BDRV_ZERO_OPEN only if all layers of the backing chain agree.
> 

Great. I'll look at other patches soon.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 01/17] qcow2: Comment typo fixes
  2020-01-31 17:44 ` [PATCH 01/17] qcow2: Comment typo fixes Eric Blake
@ 2020-02-04 14:12   ` Vladimir Sementsov-Ogievskiy
  2020-02-09 19:34   ` Alberto Garcia
  1 sibling, 0 replies; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-04 14:12 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

31.01.2020 20:44, Eric Blake wrote:
> Various trivial typos noticed while working on this file.
> 
> Signed-off-by: Eric Blake<eblake@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 02/17] qcow2: List autoclear bit names in header
  2020-01-31 17:44 ` [PATCH 02/17] qcow2: List autoclear bit names in header Eric Blake
@ 2020-02-04 14:26   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-04 14:26 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

31.01.2020 20:44, Eric Blake wrote:
> The feature table is supposed to advertise the name of all feature
> bits that we support; however, we forgot to update the table for
> autoclear bits.  While at it, move the table to read-only memory in
> code, and tweak the qcow2 spec to name the second autoclear bit.
> Update iotests that are affected by the longer header length.
> 
> Fixes: 88ddffae
> Fixes: 93c24936
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>   block/qcow2.c              | 12 +++++++++++-
>   docs/interop/qcow2.txt     |  3 ++-
>   tests/qemu-iotests/031.out |  8 ++++----
>   tests/qemu-iotests/036.out |  4 ++--
>   tests/qemu-iotests/061.out | 14 +++++++-------
>   5 files changed, 26 insertions(+), 15 deletions(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 30fd3d13032a..d3e7709ac2b4 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2821,7 +2821,7 @@ int qcow2_update_header(BlockDriverState *bs)
> 
>       /* Feature table */
>       if (s->qcow_version >= 3) {
> -        Qcow2Feature features[] = {
> +        static const Qcow2Feature features[] = {
>               {
>                   .type = QCOW2_FEAT_TYPE_INCOMPATIBLE,
>                   .bit  = QCOW2_INCOMPAT_DIRTY_BITNR,
> @@ -2842,6 +2842,16 @@ int qcow2_update_header(BlockDriverState *bs)
>                   .bit  = QCOW2_COMPAT_LAZY_REFCOUNTS_BITNR,
>                   .name = "lazy refcounts",
>               },
> +            {
> +                .type = QCOW2_FEAT_TYPE_AUTOCLEAR,
> +                .bit  = QCOW2_AUTOCLEAR_BITMAPS_BITNR,
> +                .name = "consistent bitmaps",

Hmm, what do you mean by "consistent" ? Each bitmap has own in_use flag, and my be
"inconsistent" on its own.

I'd prefer to name it just "bitmaps", as extension is named "Bitmaps". With this change:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

> +            },
> +            {
> +                .type = QCOW2_FEAT_TYPE_AUTOCLEAR,
> +                .bit  = QCOW2_AUTOCLEAR_DATA_FILE_RAW_BITNR,
> +                .name = "raw external data",
> +            },
>           };
> 
>           ret = header_ext_add(buf, QCOW2_EXT_MAGIC_FEATURE_TABLE,
> diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
> index af5711e53371..8510d74c8079 100644
> --- a/docs/interop/qcow2.txt
> +++ b/docs/interop/qcow2.txt
> @@ -138,7 +138,8 @@ in the description of a field.
>                                   bit is unset, the bitmaps extension data must be
>                                   considered inconsistent.
> 
> -                    Bit 1:      If this bit is set, the external data file can
> +                    Bit 1:      Raw external data bit
> +                                If this bit is set, the external data file can
>                                   be read as a consistent standalone raw image
>                                   without looking at the qcow2 metadata.
> 
> diff --git a/tests/qemu-iotests/031.out b/tests/qemu-iotests/031.out
> index d535e407bc30..46f97c5a4ea4 100644
> --- a/tests/qemu-iotests/031.out
> +++ b/tests/qemu-iotests/031.out
> @@ -117,7 +117,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    192
> +length                    288
>   data                      <binary>
> 
>   Header extension:
> @@ -150,7 +150,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    192
> +length                    288
>   data                      <binary>
> 
>   Header extension:
> @@ -164,7 +164,7 @@ No errors were found on the image.
> 
>   magic                     0x514649fb
>   version                   3
> -backing_file_offset       0x178
> +backing_file_offset       0x1d8
>   backing_file_size         0x17
>   cluster_bits              16
>   size                      67108864
> @@ -188,7 +188,7 @@ data                      'host_device'
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    192
> +length                    288
>   data                      <binary>
> 
>   Header extension:
> diff --git a/tests/qemu-iotests/036.out b/tests/qemu-iotests/036.out
> index 0b52b934e115..23b699ce0622 100644
> --- a/tests/qemu-iotests/036.out
> +++ b/tests/qemu-iotests/036.out
> @@ -26,7 +26,7 @@ compatible_features       []
>   autoclear_features        [63]
>   Header extension:
>   magic                     0x6803f857
> -length                    192
> +length                    288
>   data                      <binary>
> 
> 
> @@ -38,7 +38,7 @@ compatible_features       []
>   autoclear_features        []
>   Header extension:
>   magic                     0x6803f857
> -length                    192
> +length                    288
>   data                      <binary>
> 
>   *** done
> diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
> index 8b3091a412bc..413cc4e0f4ab 100644
> --- a/tests/qemu-iotests/061.out
> +++ b/tests/qemu-iotests/061.out
> @@ -26,7 +26,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    192
> +length                    288
>   data                      <binary>
> 
>   magic                     0x514649fb
> @@ -84,7 +84,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    192
> +length                    288
>   data                      <binary>
> 
>   magic                     0x514649fb
> @@ -140,7 +140,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    192
> +length                    288
>   data                      <binary>
> 
>   ERROR cluster 5 refcount=0 reference=1
> @@ -195,7 +195,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    192
> +length                    288
>   data                      <binary>
> 
>   magic                     0x514649fb
> @@ -264,7 +264,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    192
> +length                    288
>   data                      <binary>
> 
>   read 65536/65536 bytes at offset 44040192
> @@ -298,7 +298,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    192
> +length                    288
>   data                      <binary>
> 
>   ERROR cluster 5 refcount=0 reference=1
> @@ -327,7 +327,7 @@ header_length             104
> 
>   Header extension:
>   magic                     0x6803f857
> -length                    192
> +length                    288
>   data                      <binary>
> 
>   read 131072/131072 bytes at offset 0
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 03/17] qcow2: Avoid feature name extension on small cluster size
  2020-01-31 17:44 ` [PATCH 03/17] qcow2: Avoid feature name extension on small cluster size Eric Blake
@ 2020-02-04 14:39   ` Vladimir Sementsov-Ogievskiy
  2020-02-09 19:28   ` Alberto Garcia
  1 sibling, 0 replies; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-04 14:39 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

31.01.2020 20:44, Eric Blake wrote:
> As the feature name table can be quite large (over 9k if all 64 bits
> of all three feature fields have names; a mere 8 features leaves only
> 8 bytes for a backing file name in a 512-byte cluster), it is unwise
> to emit this optional header in images with small cluster sizes.
> 
> Update iotest 036 to skip running on small cluster sizes; meanwhile,
> note that iotest 061 never passed on alternative cluster sizes
> (however, I limited this patch to tests with output affected by adding
> feature names, rather than auditing for other tests that are not
> robust to alternative cluster sizes).
> 
> Signed-off-by: Eric Blake<eblake@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 04/17] block: Improve documentation of .bdrv_has_zero_init
  2020-01-31 17:44 ` [PATCH 04/17] block: Improve documentation of .bdrv_has_zero_init Eric Blake
@ 2020-02-04 15:03   ` Vladimir Sementsov-Ogievskiy
  2020-02-04 15:16     ` Eric Blake
  0 siblings, 1 reply; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-04 15:03 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

31.01.2020 20:44, Eric Blake wrote:
> Several drivers supply .bdrv_has_zero_init that returns 1, but lack
> the .bdrv_has_zero_init_truncate callback (parallels and qed outright,
> vdi in some scenarios).  A literal reading of the existing
> documentation says such drivers are broken, because
> bdrv_has_zero_init_truncate() defaults to zero if the callback is
> missing; but in practice, the tie between the two functions is only
> relevant when truncate is supported.  Clarify the documentation to
> make it obvious that this is okay.
> 
> Fixes: 1dcaf527
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>   include/block/block_int.h | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 640fb82c789e..77ab45dc87cf 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -444,7 +444,8 @@ struct BlockDriver {
>       /*
>        * Returns 1 if newly created images are guaranteed to contain only
>        * zeros, 0 otherwise.
> -     * Must return 0 if .bdrv_has_zero_init_truncate() returns 0.
> +     * Must return 0 if .bdrv_co_truncate is set and
> +     * .bdrv_has_zero_init_truncate() returns 0.
>        */
>       int (*bdrv_has_zero_init)(BlockDriverState *bs);
> 

I doubt, shouldn't 1dcaf527 be fixed by adding all needed bdrv_has_zero_init_truncate functions?

(sorry, I started to dig in the code and patches around bdrv_has_zero_init_truncate and tired :(.,
  hope Max will comment this).

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/17] gluster: Drop useless has_zero_init callback
  2020-01-31 17:44 ` [PATCH 07/17] gluster: Drop useless has_zero_init callback Eric Blake
@ 2020-02-04 15:06   ` Vladimir Sementsov-Ogievskiy
  2020-02-10 18:21   ` Alberto Garcia
  2020-02-17  8:06   ` [GEDI] " Niels de Vos
  2 siblings, 0 replies; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-04 15:06 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: david.edmondson, Kevin Wolf, open list:GLUSTER, qemu-block, mreitz

31.01.2020 20:44, Eric Blake wrote:
> block.c already defaults to 0 if we don't provide a callback; there's
> no need to write a callback that always fails.
> 
> Signed-off-by: Eric Blake<eblake@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 08/17] sheepdog: Consistently set bdrv_has_zero_init_truncate
  2020-01-31 17:44 ` [PATCH 08/17] sheepdog: Consistently set bdrv_has_zero_init_truncate Eric Blake
@ 2020-02-04 15:09   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-04 15:09 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: Kevin Wolf, open list:Sheepdog, qemu-block, mreitz,
	david.edmondson, Liu Yuan

31.01.2020 20:44, Eric Blake wrote:
> block_int.h claims that .bdrv_has_zero_init must return 0 if
> .bdrv_has_zero_init_truncate does likewise; but this is violated if

Hmm, you changed this in patch 04..

> only the former callback is provided if .bdrv_co_truncate also exists.
> When adding the latter callback, it was mistakenly added to only one
> of the three possible sheepdog instantiations.
> 
> Fixes: 1dcaf527
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>   block/sheepdog.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/block/sheepdog.c b/block/sheepdog.c
> index cfa84338a2d6..522c16a93676 100644
> --- a/block/sheepdog.c
> +++ b/block/sheepdog.c
> @@ -3269,6 +3269,7 @@ static BlockDriver bdrv_sheepdog_tcp = {
>       .bdrv_co_create               = sd_co_create,
>       .bdrv_co_create_opts          = sd_co_create_opts,
>       .bdrv_has_zero_init           = bdrv_has_zero_init_1,
> +    .bdrv_has_zero_init_truncate  = bdrv_has_zero_init_1,
>       .bdrv_getlength               = sd_getlength,
>       .bdrv_get_allocated_file_size = sd_get_allocated_file_size,
>       .bdrv_co_truncate             = sd_co_truncate,
> @@ -3307,6 +3308,7 @@ static BlockDriver bdrv_sheepdog_unix = {
>       .bdrv_co_create               = sd_co_create,
>       .bdrv_co_create_opts          = sd_co_create_opts,
>       .bdrv_has_zero_init           = bdrv_has_zero_init_1,
> +    .bdrv_has_zero_init_truncate  = bdrv_has_zero_init_1,
>       .bdrv_getlength               = sd_getlength,
>       .bdrv_get_allocated_file_size = sd_get_allocated_file_size,
>       .bdrv_co_truncate             = sd_co_truncate,
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 04/17] block: Improve documentation of .bdrv_has_zero_init
  2020-02-04 15:03   ` Vladimir Sementsov-Ogievskiy
@ 2020-02-04 15:16     ` Eric Blake
  0 siblings, 0 replies; 73+ messages in thread
From: Eric Blake @ 2020-02-04 15:16 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

On 2/4/20 9:03 AM, Vladimir Sementsov-Ogievskiy wrote:
> 31.01.2020 20:44, Eric Blake wrote:
>> Several drivers supply .bdrv_has_zero_init that returns 1, but lack
>> the .bdrv_has_zero_init_truncate callback (parallels and qed outright,
>> vdi in some scenarios).  A literal reading of the existing
>> documentation says such drivers are broken, because
>> bdrv_has_zero_init_truncate() defaults to zero if the callback is
>> missing; but in practice, the tie between the two functions is only
>> relevant when truncate is supported.  Clarify the documentation to
>> make it obvious that this is okay.
>>
>> Fixes: 1dcaf527
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> ---
>>   include/block/block_int.h | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index 640fb82c789e..77ab45dc87cf 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -444,7 +444,8 @@ struct BlockDriver {
>>       /*
>>        * Returns 1 if newly created images are guaranteed to contain only
>>        * zeros, 0 otherwise.
>> -     * Must return 0 if .bdrv_has_zero_init_truncate() returns 0.
>> +     * Must return 0 if .bdrv_co_truncate is set and
>> +     * .bdrv_has_zero_init_truncate() returns 0.
>>        */
>>       int (*bdrv_has_zero_init)(BlockDriverState *bs);
>>
> 
> I doubt, shouldn't 1dcaf527 be fixed by adding all needed 
> bdrv_has_zero_init_truncate functions?

That was my original thought. But looking at callers of 
bdrv_has_zero_init_truncate() shows that they all plan to perform 
bdrv_co_truncate(PREALLOC_MODE_OFF) and want to know if that will leak 
non-zero data; if you can't truncate, it doesn't matter what 
init_truncate() returns, but since init_truncate() defaults to 0, it 
shouldn't invalidate init() returning 1 - fixing the docs was easier 
than adding useless callbacks to parallels, qed, and vdi just to rip 
them back out again in patch 9.

As you noted later, sheepdog was the only driver that violated this rule 
(and it is fixed in patch 8).  I could reorder the series to get the bug 
fix in before the documentation change, if that would help.

> 
> (sorry, I started to dig in the code and patches around 
> bdrv_has_zero_init_truncate and tired :(.,
>   hope Max will comment this).
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-01-31 17:44 ` [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate} Eric Blake
@ 2020-02-04 15:35   ` Vladimir Sementsov-Ogievskiy
  2020-02-04 15:49     ` Eric Blake
  2020-02-04 17:42     ` Max Reitz
  2020-02-04 17:53   ` Max Reitz
  1 sibling, 2 replies; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-04 15:35 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, mreitz,
	david.edmondson, Stefan Hajnoczi, Denis V. Lunev, Liu Yuan,
	Jason Dillaman, Markus Armbruster

31.01.2020 20:44, Eric Blake wrote:
> Having two slightly-different function names for related purposes is
> unwieldy, especially since I envision adding yet another notion of
> zero support in an upcoming patch.  It doesn't help that
> bdrv_has_zero_init() is a misleading name (I originally thought that a
> driver could only return 1 when opening an already-existing image
> known to be all zeroes; but in reality many drivers always return 1
> because it only applies to a just-created image).  Refactor all uses
> to instead have a single function that returns multiple bits of
> information, with better naming and documentation.

Sounds good

> 
> No semantic change, although some of the changes (such as to qcow2.c)
> require a careful reading to see how it remains the same.
> 

...

> diff --git a/include/block/block.h b/include/block/block.h
> index 6cd566324d95..a6a227f50678 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h

Hmm, header file in the middle of the patch, possibly you don't use
[diff]
     orderFile = scripts/git.orderfile

in git config.. Or it is broken.

> @@ -85,6 +85,28 @@ typedef enum {
>       BDRV_REQ_MASK               = 0x3ff,
>   } BdrvRequestFlags;
> 
> +typedef enum {
> +    /*
> +     * bdrv_known_zeroes() should include this bit if the contents of
> +     * a freshly-created image with no backing file reads as all
> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.

I understand that this is preexisting logic, but could I ask: why? What's wrong
if driver can guarantee that created file is all-zero, but is not sure about
file resizing? I agree that it's normal for these flags to have the same value,
but what is the reason for this restriction?..

So, the only possible combination of flags, when they differs, is create=0 and
truncate=1.. How is it possible?

> +     * Since this bit is only reliable at image creation, a driver may
> +     * return this bit even for existing images that do not currently
> +     * read as zero.
> +     */
> +    BDRV_ZERO_CREATE        = 0x1,
> +
> +    /*
> +     * bdrv_known_zeroes() should include this bit if growing an image
> +     * with PREALLOC_MODE_OFF (either with no backing file, or beyond
> +     * the size of the backing file) will read the new data as all
> +     * zeroes without any additional effort.  This bit only matters
> +     * for drivers that set .bdrv_co_truncate.
> +     */
> +    BDRV_ZERO_TRUNCATE      = 0x2,
> +} BdrvZeroFlags;
> +

...


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-02-04 15:35   ` Vladimir Sementsov-Ogievskiy
@ 2020-02-04 15:49     ` Eric Blake
  2020-02-04 16:07       ` Vladimir Sementsov-Ogievskiy
  2020-02-04 17:42     ` Max Reitz
  1 sibling, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-02-04 15:49 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, mreitz,
	david.edmondson, Stefan Hajnoczi, Denis V. Lunev, Liu Yuan,
	Jason Dillaman, Markus Armbruster

On 2/4/20 9:35 AM, Vladimir Sementsov-Ogievskiy wrote:
> 31.01.2020 20:44, Eric Blake wrote:
>> Having two slightly-different function names for related purposes is
>> unwieldy, especially since I envision adding yet another notion of
>> zero support in an upcoming patch.  It doesn't help that
>> bdrv_has_zero_init() is a misleading name (I originally thought that a
>> driver could only return 1 when opening an already-existing image
>> known to be all zeroes; but in reality many drivers always return 1
>> because it only applies to a just-created image).  Refactor all uses
>> to instead have a single function that returns multiple bits of
>> information, with better naming and documentation.
> 
> Sounds good
> 
>>
>> No semantic change, although some of the changes (such as to qcow2.c)
>> require a careful reading to see how it remains the same.
>>
> 
> ...
> 
>> diff --git a/include/block/block.h b/include/block/block.h
>> index 6cd566324d95..a6a227f50678 100644
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
> 
> Hmm, header file in the middle of the patch, possibly you don't use
> [diff]
>      orderFile = scripts/git.orderfile
> 
> in git config.. Or it is broken.

I do have it set up, so I'm not sure why it didn't work as planned. 
I'll make sure v2 follows the order I intended.

> 
>> @@ -85,6 +85,28 @@ typedef enum {
>>       BDRV_REQ_MASK               = 0x3ff,
>>   } BdrvRequestFlags;
>>
>> +typedef enum {
>> +    /*
>> +     * bdrv_known_zeroes() should include this bit if the contents of
>> +     * a freshly-created image with no backing file reads as all
>> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
>> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
> 
> I understand that this is preexisting logic, but could I ask: why? 
> What's wrong
> if driver can guarantee that created file is all-zero, but is not sure 
> about
> file resizing? I agree that it's normal for these flags to have the same 
> value,
> but what is the reason for this restriction?..

For _this_ patch, my goal is to preserve pre-existing practice. Where we 
think pre-existing practice is wrong, we can then improve it in other 
patches (see patch 6, for example).

I _think_ the reason for this original limitation is as follows: If an 
image can be resized, we could choose to perform 'create(size=0), 
truncate(size=final)' instead of 'create(size=final)', and we want to 
guarantee the same behavior. If truncation can't guarantee a zero read, 
then why is creation doing so?

But as I did not write the original patch, I would welcome Max's input 
with regards to the thought behind commit ceaca56f.

> 
> So, the only possible combination of flags, when they differs, is 
> create=0 and
> truncate=1.. How is it possible?

qcow2 had that mode, at least before patch 5.

> 
>> +     * Since this bit is only reliable at image creation, a driver may
>> +     * return this bit even for existing images that do not currently
>> +     * read as zero.
>> +     */
>> +    BDRV_ZERO_CREATE        = 0x1,
>> +
>> +    /*
>> +     * bdrv_known_zeroes() should include this bit if growing an image
>> +     * with PREALLOC_MODE_OFF (either with no backing file, or beyond
>> +     * the size of the backing file) will read the new data as all
>> +     * zeroes without any additional effort.  This bit only matters
>> +     * for drivers that set .bdrv_co_truncate.
>> +     */
>> +    BDRV_ZERO_TRUNCATE      = 0x2,
>> +} BdrvZeroFlags;
>> +
> 
> ...
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-02-04 15:49     ` Eric Blake
@ 2020-02-04 16:07       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-04 16:07 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, mreitz,
	david.edmondson, Stefan Hajnoczi, Denis V. Lunev, Liu Yuan,
	Jason Dillaman, Markus Armbruster

04.02.2020 18:49, Eric Blake wrote:
> On 2/4/20 9:35 AM, Vladimir Sementsov-Ogievskiy wrote:
>> 31.01.2020 20:44, Eric Blake wrote:
>>> Having two slightly-different function names for related purposes is
>>> unwieldy, especially since I envision adding yet another notion of
>>> zero support in an upcoming patch.  It doesn't help that
>>> bdrv_has_zero_init() is a misleading name (I originally thought that a
>>> driver could only return 1 when opening an already-existing image
>>> known to be all zeroes; but in reality many drivers always return 1
>>> because it only applies to a just-created image).  Refactor all uses
>>> to instead have a single function that returns multiple bits of
>>> information, with better naming and documentation.
>>
>> Sounds good
>>
>>>
>>> No semantic change, although some of the changes (such as to qcow2.c)
>>> require a careful reading to see how it remains the same.
>>>
>>
>> ...
>>
>>> diff --git a/include/block/block.h b/include/block/block.h
>>> index 6cd566324d95..a6a227f50678 100644
>>> --- a/include/block/block.h
>>> +++ b/include/block/block.h
>>
>> Hmm, header file in the middle of the patch, possibly you don't use
>> [diff]
>>      orderFile = scripts/git.orderfile
>>
>> in git config.. Or it is broken.
> 
> I do have it set up, so I'm not sure why it didn't work as planned. I'll make sure v2 follows the order I intended.
> 
>>
>>> @@ -85,6 +85,28 @@ typedef enum {
>>>       BDRV_REQ_MASK               = 0x3ff,
>>>   } BdrvRequestFlags;
>>>
>>> +typedef enum {
>>> +    /*
>>> +     * bdrv_known_zeroes() should include this bit if the contents of
>>> +     * a freshly-created image with no backing file reads as all
>>> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
>>> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
>>
>> I understand that this is preexisting logic, but could I ask: why? What's wrong
>> if driver can guarantee that created file is all-zero, but is not sure about
>> file resizing? I agree that it's normal for these flags to have the same value,
>> but what is the reason for this restriction?..
> 
> For _this_ patch, my goal is to preserve pre-existing practice. Where we think pre-existing practice is wrong, we can then improve it in other patches (see patch 6, for example).

This is OK, of course, I'm just trying to understand existing logic.

> 
> I _think_ the reason for this original limitation is as follows: If an image can be resized, we could choose to perform 'create(size=0), truncate(size=final)' instead of 'create(size=final)', and we want to guarantee the same behavior. If truncation can't guarantee a zero read, then why is creation doing so?

If we want to guarantee the same behavior, we should restrict any difference between these flags :)

> 
> But as I did not write the original patch, I would welcome Max's input with regards to the thought behind commit ceaca56f.
> 
>>
>> So, the only possible combination of flags, when they differs, is create=0 and
>> truncate=1.. How is it possible?
> 
> qcow2 had that mode, at least before patch 5.

yes, it reported even for encrypted images truncate=1...

> 
>>
>>> +     * Since this bit is only reliable at image creation, a driver may
>>> +     * return this bit even for existing images that do not currently
>>> +     * read as zero.
>>> +     */
>>> +    BDRV_ZERO_CREATE        = 0x1,
>>> +
>>> +    /*
>>> +     * bdrv_known_zeroes() should include this bit if growing an image
>>> +     * with PREALLOC_MODE_OFF (either with no backing file, or beyond
>>> +     * the size of the backing file) will read the new data as all
>>> +     * zeroes without any additional effort.  This bit only matters
>>> +     * for drivers that set .bdrv_co_truncate.
>>> +     */
>>> +    BDRV_ZERO_TRUNCATE      = 0x2,
>>> +} BdrvZeroFlags;
>>> +
>>
>> ...
>>
>>
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/17] Improve qcow2 all-zero detection
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (16 preceding siblings ...)
  2020-01-31 17:44 ` [PATCH 17/17] qcow2: Let qemu-img check cover " Eric Blake
@ 2020-02-04 17:32 ` Max Reitz
  2020-02-04 18:53   ` Eric Blake
  2020-02-05  9:04 ` Vladimir Sementsov-Ogievskiy
  18 siblings, 1 reply; 73+ messages in thread
From: Max Reitz @ 2020-02-04 17:32 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 2469 bytes --]

On 31.01.20 18:44, Eric Blake wrote:
> Based-on: <20200124103458.1525982-2-david.edmondson@oracle.com>
> ([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)
> 
> I'm working on adding an NBD extension that reports whether an image
> is already all zero when the client first connects.  I initially
> thought I could write the NBD code to just call bdrv_has_zero_init(),
> but that turned out to be a bad assumption that instead resulted in
> this patch series.  The NBD patch will come later (and cross-posted to
> the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
> repositories).

We had a discussion about this on IRC, and as far as I remember I wasn’t
quite sold on the “why”.  So, again, I wonder why this is needed.

I mean, it does make intuitive sense to want to know whether an image is
fully zero, but if I continue thinking about it I don’t know any case
where we would need to figure it out and where we could accept “We don’t
know” as an answer.  So I’m looking for use cases, but this cover letter
doesn’t mention any.  (And from a quick glance I don’t see this series
using the flag, actually.)

(We have a use case with convert -n to freshly created image files, but
my position on this on IRC was that we want the --target-is-zero flag
for that anyway: Auto-detection may always break, our preferred default
behavior may always change, so if you want convert -n not to touch the
target image except to write non-zero data from the source, we need a
--target-is-zero flag and users need to use it.  Well, management
layers, because I don’t think users would use convert -n anyway.

And with --target-is-zero and users effectively having to use it, I
don’t think that’s a good example of a use case.)

I suppose there is the point of blockdev-create + blockdev-mirror: This
has exactly the same problem as convert -n.  But again, if you really
want blockdev-mirror not just to force-zero the image, you probably need
to tell it so explicitly (i.e., with a --target-is-zero flag for
blockdev-mirror).

(Well, I suppose we could save us a target-is-zero for mirror if we took
this series and had a filter driver that force-reports BDRV_ZERO_OPEN.
But, well, please no.)

But maybe I’m just an idiot and there is no reason not to take this
series and make blockdev-create + blockdev-mirror do the sensible thing
by default in most cases. *shrug*

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 10/17] block: Add new BDRV_ZERO_OPEN flag
  2020-01-31 17:44 ` [PATCH 10/17] block: Add new BDRV_ZERO_OPEN flag Eric Blake
  2020-01-31 18:03   ` Eric Blake
@ 2020-02-04 17:34   ` Max Reitz
  2020-02-04 17:50     ` Eric Blake
  1 sibling, 1 reply; 73+ messages in thread
From: Max Reitz @ 2020-02-04 17:34 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 4773 bytes --]

On 31.01.20 18:44, Eric Blake wrote:
> Knowing that a file reads as all zeroes when created is useful, but
> limited in scope to drivers that can create images.  However, there
> are also situations where pre-existing images can quickly be
> determined to read as all zeroes, even when the image was not just
> created by the same process.  The optimization used in qemu-img
> convert to avoid a pre-zeroing pass on the destination is just as
> useful in such a scenario.  As such, it is worth the block layer
> adding another bit to bdrv_known_zeroes().
> 
> Note that while BDRV_ZERO_CREATE cannot chase through backing layers
> (because it only applies at creation time, but the backing layer was
> not created at the same time as the active layer being created), it IS
> okay for BDRV_ZERO_OPEN to chase through layers (as long as all layers
> currently read as zero, the image reads as zero).
> 
> Upcoming patches will update the qcow2, file-posix, and nbd drivers to
> advertise the new bit when appropriate.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block.c               | 12 ++++++------
>  include/block/block.h | 10 ++++++++++
>  qemu-img.c            | 10 ++++++----
>  3 files changed, 22 insertions(+), 10 deletions(-)
> 
> diff --git a/block.c b/block.c
> index fac0813140aa..d68f527dc41f 100644
> --- a/block.c
> +++ b/block.c
> @@ -5078,7 +5078,7 @@ int bdrv_known_zeroes_truncate(BlockDriverState *bs)
> 
>  int bdrv_known_zeroes(BlockDriverState *bs)
>  {
> -    int mask = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE;
> +    int mask = BDRV_ZERO_CREATE | BDRV_ZERO_TRUNCATE | BDRV_ZERO_OPEN;
> 
>      if (!bs->drv) {
>          return 0;
> @@ -5100,17 +5100,17 @@ int bdrv_known_zeroes(BlockDriverState *bs)
>       * ZERO_CREATE is not viable.  If the current layer is smaller
>       * than the backing layer, truncation may expose backing data,
>       * restricting ZERO_TRUNCATE; treat failure to query size in the
> -     * same manner.  Otherwise, we can trust the driver.
> +     * same manner.  For ZERO_OPEN, we insist that both backing and
> +     * current layer report the bit.
>       */
> -
>      if (bs->backing) {
>          int64_t back = bdrv_getlength(bs->backing->bs);
>          int64_t curr = bdrv_getlength(bs);
> 
> -        if (back < 0 || curr < back) {
> -            return 0;
> +        mask = bdrv_known_zeroes(bs->backing->bs) & BDRV_ZERO_OPEN;
> +        if (back >= 0 && curr >= back) {
> +            mask |= BDRV_ZERO_TRUNCATE;
>          }
> -        mask = BDRV_ZERO_TRUNCATE;
>      }
> 
>      if (bs->drv->bdrv_known_zeroes) {
> diff --git a/include/block/block.h b/include/block/block.h
> index a6a227f50678..dafb8cc2bd80 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -105,6 +105,16 @@ typedef enum {
>       * for drivers that set .bdrv_co_truncate.
>       */
>      BDRV_ZERO_TRUNCATE      = 0x2,
> +
> +    /*
> +     * bdrv_known_zeroes() should include this bit if an image is
> +     * known to read as all zeroes when first opened; this bit should
> +     * not be relied on after any writes to the image.

Is there a good reason for this?  Because to me this screams like we are
going to check this flag without ensuring that the image has actually
not been written to yet.  So if it’s generally easy for drivers to stop
reporting this flag after a write, then maybe we should do so.

Max

>                                                          This can be
> +     * set even if BDRV_ZERO_INIT is clear, but should only be set if
> +     * making the determination is more efficient than looping over
> +     * block status for the image.
> +     */
> +    BDRV_ZERO_OPEN          = 0x4,
>  } BdrvZeroFlags;
> 
>  typedef struct BlockSizes {
> diff --git a/qemu-img.c b/qemu-img.c
> index e60217e6c382..c8519a74f738 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1985,10 +1985,12 @@ static int convert_do_copy(ImgConvertState *s)
>      int64_t sector_num = 0;
> 
>      /* Check whether we have zero initialisation or can get it efficiently */
> -    if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
> -        !s->target_has_backing) {
> -        s->has_zero_init = !!(bdrv_known_zeroes(blk_bs(s->target)) &
> -                              BDRV_ZERO_CREATE);
> +    if (!s->has_zero_init && s->min_sparse && !s->target_has_backing) {
> +        ret = bdrv_known_zeroes(blk_bs(s->target));
> +        if (ret & BDRV_ZERO_OPEN ||
> +            (s->target_is_new && ret & BDRV_ZERO_CREATE)) {
> +            s->has_zero_init = true;
> +        }
>      }
> 
>      if (!s->has_zero_init && !s->target_has_backing &&
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-02-04 15:35   ` Vladimir Sementsov-Ogievskiy
  2020-02-04 15:49     ` Eric Blake
@ 2020-02-04 17:42     ` Max Reitz
  2020-02-04 17:51       ` Eric Blake
  2020-02-05  7:51       ` Vladimir Sementsov-Ogievskiy
  1 sibling, 2 replies; 73+ messages in thread
From: Max Reitz @ 2020-02-04 17:42 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Eric Blake, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, Markus Armbruster,
	david.edmondson, Stefan Hajnoczi, Denis V. Lunev, Liu Yuan,
	Jason Dillaman


[-- Attachment #1.1: Type: text/plain, Size: 3040 bytes --]

On 04.02.20 16:35, Vladimir Sementsov-Ogievskiy wrote:
> 31.01.2020 20:44, Eric Blake wrote:
>> Having two slightly-different function names for related purposes is
>> unwieldy, especially since I envision adding yet another notion of
>> zero support in an upcoming patch.  It doesn't help that
>> bdrv_has_zero_init() is a misleading name (I originally thought that a
>> driver could only return 1 when opening an already-existing image
>> known to be all zeroes; but in reality many drivers always return 1
>> because it only applies to a just-created image).  Refactor all uses
>> to instead have a single function that returns multiple bits of
>> information, with better naming and documentation.
> 
> Sounds good
> 
>>
>> No semantic change, although some of the changes (such as to qcow2.c)
>> require a careful reading to see how it remains the same.
>>
> 
> ...
> 
>> diff --git a/include/block/block.h b/include/block/block.h
>> index 6cd566324d95..a6a227f50678 100644
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
> 
> Hmm, header file in the middle of the patch, possibly you don't use
> [diff]
>     orderFile = scripts/git.orderfile
> 
> in git config.. Or it is broken.
> 
>> @@ -85,6 +85,28 @@ typedef enum {
>>       BDRV_REQ_MASK               = 0x3ff,
>>   } BdrvRequestFlags;
>>
>> +typedef enum {
>> +    /*
>> +     * bdrv_known_zeroes() should include this bit if the contents of
>> +     * a freshly-created image with no backing file reads as all
>> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
>> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
> 
> I understand that this is preexisting logic, but could I ask: why?
> What's wrong
> if driver can guarantee that created file is all-zero, but is not sure
> about
> file resizing? I agree that it's normal for these flags to have the same
> value,
> but what is the reason for this restriction?..

If areas added by truncation (or growth, rather) are always zero, then
the file can always be created with size 0 and grown from there.  Thus,
images where truncation adds zeroed areas will generally always be zero
after creation.

> So, the only possible combination of flags, when they differs, is
> create=0 and
> truncate=1.. How is it possible?

For preallocated qcow2 images, it depends on the storage whether they
are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
to bdrv_has_zero_init() of s->data_file->bs.

But when you truncate them (with PREALLOC_MODE_OFF, as
BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
area is always going to be 0, regardless of initial preallocation.


I just noticed a bug there, though: Encrypted qcow2 images will not see
areas added through growth as 0.  Hence, qcow2’s
bdrv_has_zero_init_truncate() implementation should not return true
unconditionally, but only for unencrypted images.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 10/17] block: Add new BDRV_ZERO_OPEN flag
  2020-02-04 17:34   ` Max Reitz
@ 2020-02-04 17:50     ` Eric Blake
  2020-02-05  8:39       ` Vladimir Sementsov-Ogievskiy
  2020-02-05 17:26       ` Max Reitz
  0 siblings, 2 replies; 73+ messages in thread
From: Eric Blake @ 2020-02-04 17:50 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block

On 2/4/20 11:34 AM, Max Reitz wrote:

>> +++ b/include/block/block.h
>> @@ -105,6 +105,16 @@ typedef enum {
>>        * for drivers that set .bdrv_co_truncate.
>>        */
>>       BDRV_ZERO_TRUNCATE      = 0x2,
>> +
>> +    /*
>> +     * bdrv_known_zeroes() should include this bit if an image is
>> +     * known to read as all zeroes when first opened; this bit should
>> +     * not be relied on after any writes to the image.
> 
> Is there a good reason for this?  Because to me this screams like we are
> going to check this flag without ensuring that the image has actually
> not been written to yet.  So if it’s generally easy for drivers to stop
> reporting this flag after a write, then maybe we should do so.

In patch 15 (implementing things in qcow2), I actually wrote the driver 
to return live results, rather than just open-time results, in part 
because writing the bit to persistent storage in qcow2 means that the 
bit must be accurate, without relying on the block layer's help.

But my pending NBD patch (not posted yet, but will be soon), the 
proposal I'm making for the NBD protocol itself is just open-time, not 
live, and so it would be more work than necessary to make the NBD driver 
report live results.

But it seems like it should be easy enough to also patch the block layer 
itself to guarantee that callers of bdrv_known_zeroes() cannot see this 
bit set if the block layer has been used in any non-zero transaction, by 
repeating the same logic as used in qcow2 to kill the bit (any 
write/write_compressed/bdrv_copy clear the bit, any trim clears the bit 
if the driver does not guarantee trim reads as zero, any truncate clears 
the bit if the driver does not guarantee truncate reads as zero, etc). 
Basically, the block layer would cache the results of .bdrv_known_zeroes 
during .bdrv_co_open, bdrv_co_pwrite() and friends would update that 
cache, and and bdrv_known_zeroes() would report the cached value rather 
than a fresh call to .bdrv_known_zeroes.

Are we worried enough about clients of this interface to make the block 
layer more robust?  (From the maintenance standpoint, the more the block 
layer guarantees, the easier it is to write code that uses the block 
layer; but there is the counter-argument that making the block layer 
track whether an image has been modified means a [slight] penalty to 
every write request to update the boolean.)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-02-04 17:42     ` Max Reitz
@ 2020-02-04 17:51       ` Eric Blake
  2020-02-05 16:43         ` Max Reitz
  2020-02-05  7:51       ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-02-04 17:51 UTC (permalink / raw)
  To: Max Reitz, Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, Markus Armbruster,
	david.edmondson, Stefan Hajnoczi, Denis V. Lunev, Liu Yuan,
	Jason Dillaman

On 2/4/20 11:42 AM, Max Reitz wrote:

>>
>> I understand that this is preexisting logic, but could I ask: why?
>> What's wrong
>> if driver can guarantee that created file is all-zero, but is not sure
>> about
>> file resizing? I agree that it's normal for these flags to have the same
>> value,
>> but what is the reason for this restriction?..
> 
> If areas added by truncation (or growth, rather) are always zero, then
> the file can always be created with size 0 and grown from there.  Thus,
> images where truncation adds zeroed areas will generally always be zero
> after creation.
> 
>> So, the only possible combination of flags, when they differs, is
>> create=0 and
>> truncate=1.. How is it possible?
> 
> For preallocated qcow2 images, it depends on the storage whether they
> are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
> to bdrv_has_zero_init() of s->data_file->bs.
> 
> But when you truncate them (with PREALLOC_MODE_OFF, as
> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
> area is always going to be 0, regardless of initial preallocation.
> 
> 
> I just noticed a bug there, though: Encrypted qcow2 images will not see
> areas added through growth as 0.  Hence, qcow2’s
> bdrv_has_zero_init_truncate() implementation should not return true
> unconditionally, but only for unencrypted images.

Hence patch 5 earlier in the series :)


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-01-31 17:44 ` [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate} Eric Blake
  2020-02-04 15:35   ` Vladimir Sementsov-Ogievskiy
@ 2020-02-04 17:53   ` Max Reitz
  2020-02-04 19:03     ` Eric Blake
  1 sibling, 1 reply; 73+ messages in thread
From: Max Reitz @ 2020-02-04 17:53 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, Markus Armbruster,
	david.edmondson, Stefan Hajnoczi, Liu Yuan, Denis V. Lunev,
	Jason Dillaman


[-- Attachment #1.1: Type: text/plain, Size: 2421 bytes --]

On 31.01.20 18:44, Eric Blake wrote:
> Having two slightly-different function names for related purposes is
> unwieldy, especially since I envision adding yet another notion of
> zero support in an upcoming patch.  It doesn't help that
> bdrv_has_zero_init() is a misleading name (I originally thought that a
> driver could only return 1 when opening an already-existing image
> known to be all zeroes; but in reality many drivers always return 1
> because it only applies to a just-created image).

I don’t find it misleading, I just find it meaningless, which then makes
it open to interpretation (or maybe rather s/interpretation/wishful
thinking/).

> Refactor all uses
> to instead have a single function that returns multiple bits of
> information, with better naming and documentation.

It doesn’t make sense to me.  How exactly is it unwieldy?  In the sense
that we have to deal with multiple rather small implementation functions
rather than a big one per driver?  Actually, multiple small functions
sounds better to me – unless the three implementations share common code.

As for the callers, they only want a single flag out of the three, don’t
they?  If so, it doesn’t really matter for them.

In fact, I can imagine that drivers can trivially return
BDRV_ZERO_TRUNCATE information (because the preallocation mode is
fixed), whereas BDRV_ZERO_CREATE can be a bit more involved, and
BDRV_ZERO_OPEN could take even more time because some (constant-time)
inquiries have to be done.

And thus callers which just want the trivially obtainable
BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
even though they don’t care about that flag.

So I’d leave it as separate functions so drivers can feel free to have
implementations for BDRV_ZERO_OPEN that take more than mere microseconds
but that are more accurate.

(Or maybe if you really want it to be a single functions, callers could
pass the mask of flags they care about.  If all flags are trivially
obtainable, the implementations would then simply create their result
mask and & it with the caller-given mask.  For implementations where
some branches could take a bit more time, those branches are only taken
when the caller cares about the given flag.  But again, I don’t
necessarily think having a single function is more easily handleable
than three smaller ones.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/17] Improve qcow2 all-zero detection
  2020-02-04 17:32 ` [PATCH 00/17] Improve qcow2 all-zero detection Max Reitz
@ 2020-02-04 18:53   ` Eric Blake
  2020-02-05 17:04     ` Max Reitz
  0 siblings, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-02-04 18:53 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: david.edmondson, qemu-block

On 2/4/20 11:32 AM, Max Reitz wrote:
> On 31.01.20 18:44, Eric Blake wrote:
>> Based-on: <20200124103458.1525982-2-david.edmondson@oracle.com>
>> ([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)
>>
>> I'm working on adding an NBD extension that reports whether an image
>> is already all zero when the client first connects.  I initially
>> thought I could write the NBD code to just call bdrv_has_zero_init(),
>> but that turned out to be a bad assumption that instead resulted in
>> this patch series.  The NBD patch will come later (and cross-posted to
>> the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
>> repositories).
> 
> We had a discussion about this on IRC, and as far as I remember I wasn’t
> quite sold on the “why”.  So, again, I wonder why this is needed.
> 
> I mean, it does make intuitive sense to want to know whether an image is
> fully zero, but if I continue thinking about it I don’t know any case
> where we would need to figure it out and where we could accept “We don’t
> know” as an answer.  So I’m looking for use cases, but this cover letter
> doesn’t mention any.  (And from a quick glance I don’t see this series
> using the flag, actually.)

Patch 10/17 has:

diff --git a/qemu-img.c b/qemu-img.c
index e60217e6c382..c8519a74f738 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1985,10 +1985,12 @@ static int convert_do_copy(ImgConvertState *s)
      int64_t sector_num = 0;

      /* Check whether we have zero initialisation or can get it 
efficiently */
-    if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
-        !s->target_has_backing) {
-        s->has_zero_init = !!(bdrv_known_zeroes(blk_bs(s->target)) &
-                              BDRV_ZERO_CREATE);
+    if (!s->has_zero_init && s->min_sparse && !s->target_has_backing) {
+        ret = bdrv_known_zeroes(blk_bs(s->target));
+        if (ret & BDRV_ZERO_OPEN ||
+            (s->target_is_new && ret & BDRV_ZERO_CREATE)) {
+            s->has_zero_init = true;
+        }
      }

That's the use case: when copying into a destination file, it's useful 
to know if the destination already reads as all zeroes, before 
attempting a fallback to bdrv_make_zero(BDRV_REQ_NO_FALLBACK) or calls 
to block status checking for holes.

> 
> (We have a use case with convert -n to freshly created image files, but
> my position on this on IRC was that we want the --target-is-zero flag
> for that anyway: Auto-detection may always break, our preferred default
> behavior may always change, so if you want convert -n not to touch the
> target image except to write non-zero data from the source, we need a
> --target-is-zero flag and users need to use it.  Well, management
> layers, because I don’t think users would use convert -n anyway.
> 
> And with --target-is-zero and users effectively having to use it, I
> don’t think that’s a good example of a use case.)

Yes, there will still be cases where you have to use --target-is-zero 
because the image itself couldn't report that it already reads as 
zeroes, but there are also enough cases where the destination is already 
known to read zeroes and it's a shame to tell the user that 'you have to 
add --target-is-zero to get faster copying even though we could have 
inferred it on your behalf'.

> 
> I suppose there is the point of blockdev-create + blockdev-mirror: This
> has exactly the same problem as convert -n.  But again, if you really
> want blockdev-mirror not just to force-zero the image, you probably need
> to tell it so explicitly (i.e., with a --target-is-zero flag for
> blockdev-mirror).
> 
> (Well, I suppose we could save us a target-is-zero for mirror if we took
> this series and had a filter driver that force-reports BDRV_ZERO_OPEN.
> But, well, please no.)
> 
> But maybe I’m just an idiot and there is no reason not to take this
> series and make blockdev-create + blockdev-mirror do the sensible thing
> by default in most cases. *shrug*

My argument for taking the series _is_ that the common case can be made 
more efficient without user effort.  Yes, we still need the knob for 
when the common case isn't already smart enough, but the difference in 
avoiding a pre-zeroing pass is noticeable when copying images around 
(and more than just for qcow2 - my followup series to improve NBD is 
similarly useful given how much work has already been invested in 
mapping NBD into storage access over https in the upper layers like ovirt).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-02-04 17:53   ` Max Reitz
@ 2020-02-04 19:03     ` Eric Blake
  2020-02-05 17:22       ` Max Reitz
  0 siblings, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-02-04 19:03 UTC (permalink / raw)
  To: Max Reitz, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, Markus Armbruster,
	david.edmondson, Stefan Hajnoczi, Liu Yuan, Denis V. Lunev,
	Jason Dillaman

On 2/4/20 11:53 AM, Max Reitz wrote:
> On 31.01.20 18:44, Eric Blake wrote:
>> Having two slightly-different function names for related purposes is
>> unwieldy, especially since I envision adding yet another notion of
>> zero support in an upcoming patch.  It doesn't help that
>> bdrv_has_zero_init() is a misleading name (I originally thought that a
>> driver could only return 1 when opening an already-existing image
>> known to be all zeroes; but in reality many drivers always return 1
>> because it only applies to a just-created image).
> 
> I don’t find it misleading, I just find it meaningless, which then makes
> it open to interpretation (or maybe rather s/interpretation/wishful
> thinking/).
> 
>> Refactor all uses
>> to instead have a single function that returns multiple bits of
>> information, with better naming and documentation.
> 
> It doesn’t make sense to me.  How exactly is it unwieldy?  In the sense
> that we have to deal with multiple rather small implementation functions
> rather than a big one per driver?  Actually, multiple small functions
> sounds better to me – unless the three implementations share common code.

Common code for dealing with encryption, backing files, and so on.  It 
felt like I had a lot of code repetition when keeping functions separate.

> 
> As for the callers, they only want a single flag out of the three, don’t
> they?  If so, it doesn’t really matter for them.

The qemu-img.c caller in patch 10 checks ZERO_CREATE | ZERO_OPEN, so we 
DO have situations of checking more than one bit, vs. needing two 
function calls.

> 
> In fact, I can imagine that drivers can trivially return
> BDRV_ZERO_TRUNCATE information (because the preallocation mode is
> fixed), whereas BDRV_ZERO_CREATE can be a bit more involved, and
> BDRV_ZERO_OPEN could take even more time because some (constant-time)
> inquiries have to be done.

In looking at the rest of the series, drivers were either completely 
trivial (in which case, declaring:

.bdrv_has_zero_init = bdrv_has_zero_init_1,
.bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,

was a lot wordier than the new:

.bdrv_known_zeroes = bdrv_known_zeroes_truncate,

), or completely spelled out but where both creation and truncation were 
determined in the same amount of effort.


> 
> And thus callers which just want the trivially obtainable
> BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
> even though they don’t care about that flag.

True, but only to a minor extent; and the documentation mentions that 
the BDRV_ZERO_OPEN calculation MUST NOT be as expensive as a blind 
block_status loop.  Meanwhile, callers tend to only care about 
bdrv_known_zeroes() right after opening an image or right before 
resizing (not repeatedly during runtime); and you also argued elsewhere 
in this thread that it may be worth having the block layer cache 
BDRV_ZERO_OPEN and update the cache on any write, at which point, the 
expense in the driver callback really is a one-time call during 
bdrv_co_open().  And in that case, whether the one-time expense is done 
via a single function call or via three driver callbacks, the amount of 
work is the same; but the driver callback interface is easier if there 
is only one callback (similar to how bdrv_unallocated_blocks_are_zero() 
calls bdrv_get_info() only for bdi.unallocated_blocks_are_zero, even 
though BlockDriverInfo tracks much more than that boolean).

In fact, it may be worth consolidating known zeroes support into 
BlockDriverInfo.

> 
> So I’d leave it as separate functions so drivers can feel free to have
> implementations for BDRV_ZERO_OPEN that take more than mere microseconds
> but that are more accurate.
> 
> (Or maybe if you really want it to be a single functions, callers could
> pass the mask of flags they care about.  If all flags are trivially
> obtainable, the implementations would then simply create their result
> mask and & it with the caller-given mask.  For implementations where
> some branches could take a bit more time, those branches are only taken
> when the caller cares about the given flag.  But again, I don’t
> necessarily think having a single function is more easily handleable
> than three smaller ones.)

Those are still viable options, but before I repaint the bikeshed along 
those lines, I'd at least like a review of whether the overall idea of 
having a notion of 'reads-all-zeroes' is indeed useful enough, 
regardless of how we implement it as one vs. three driver callbacks.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-02-04 17:42     ` Max Reitz
  2020-02-04 17:51       ` Eric Blake
@ 2020-02-05  7:51       ` Vladimir Sementsov-Ogievskiy
  2020-02-05 14:07         ` Eric Blake
  1 sibling, 1 reply; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-05  7:51 UTC (permalink / raw)
  To: Max Reitz, Eric Blake, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, Markus Armbruster,
	david.edmondson, Stefan Hajnoczi, Denis V. Lunev, Liu Yuan,
	Jason Dillaman

04.02.2020 20:42, Max Reitz wrote:
> On 04.02.20 16:35, Vladimir Sementsov-Ogievskiy wrote:
>> 31.01.2020 20:44, Eric Blake wrote:
>>> Having two slightly-different function names for related purposes is
>>> unwieldy, especially since I envision adding yet another notion of
>>> zero support in an upcoming patch.  It doesn't help that
>>> bdrv_has_zero_init() is a misleading name (I originally thought that a
>>> driver could only return 1 when opening an already-existing image
>>> known to be all zeroes; but in reality many drivers always return 1
>>> because it only applies to a just-created image).  Refactor all uses
>>> to instead have a single function that returns multiple bits of
>>> information, with better naming and documentation.
>>
>> Sounds good
>>
>>>
>>> No semantic change, although some of the changes (such as to qcow2.c)
>>> require a careful reading to see how it remains the same.
>>>
>>
>> ...
>>
>>> diff --git a/include/block/block.h b/include/block/block.h
>>> index 6cd566324d95..a6a227f50678 100644
>>> --- a/include/block/block.h
>>> +++ b/include/block/block.h
>>
>> Hmm, header file in the middle of the patch, possibly you don't use
>> [diff]
>>      orderFile = scripts/git.orderfile
>>
>> in git config.. Or it is broken.
>>
>>> @@ -85,6 +85,28 @@ typedef enum {
>>>        BDRV_REQ_MASK               = 0x3ff,
>>>    } BdrvRequestFlags;
>>>
>>> +typedef enum {
>>> +    /*
>>> +     * bdrv_known_zeroes() should include this bit if the contents of
>>> +     * a freshly-created image with no backing file reads as all
>>> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
>>> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
>>
>> I understand that this is preexisting logic, but could I ask: why?
>> What's wrong
>> if driver can guarantee that created file is all-zero, but is not sure
>> about
>> file resizing? I agree that it's normal for these flags to have the same
>> value,
>> but what is the reason for this restriction?..
> 
> If areas added by truncation (or growth, rather) are always zero, then
> the file can always be created with size 0 and grown from there.  Thus,
> images where truncation adds zeroed areas will generally always be zero
> after creation.

This means, that if truncation bit is set, than create bit should be set.. But
here we say that if truncation is clear, than create bit must be clear.

> 
>> So, the only possible combination of flags, when they differs, is
>> create=0 and
>> truncate=1.. How is it possible?
> 
> For preallocated qcow2 images, it depends on the storage whether they
> are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
> to bdrv_has_zero_init() of s->data_file->bs.
> 
> But when you truncate them (with PREALLOC_MODE_OFF, as
> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
> area is always going to be 0, regardless of initial preallocation.

ah yes, due to qcow2 zero clusters.

> 
> 
> I just noticed a bug there, though: Encrypted qcow2 images will not see
> areas added through growth as 0.  Hence, qcow2’s
> bdrv_has_zero_init_truncate() implementation should not return true
> unconditionally, but only for unencrypted images.
> 
> Max
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 10/17] block: Add new BDRV_ZERO_OPEN flag
  2020-02-04 17:50     ` Eric Blake
@ 2020-02-05  8:39       ` Vladimir Sementsov-Ogievskiy
  2020-02-05 17:26       ` Max Reitz
  1 sibling, 0 replies; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-05  8:39 UTC (permalink / raw)
  To: Eric Blake, Max Reitz, qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block

04.02.2020 20:50, Eric Blake wrote:
> On 2/4/20 11:34 AM, Max Reitz wrote:
> 
>>> +++ b/include/block/block.h
>>> @@ -105,6 +105,16 @@ typedef enum {
>>>        * for drivers that set .bdrv_co_truncate.
>>>        */
>>>       BDRV_ZERO_TRUNCATE      = 0x2,
>>> +
>>> +    /*
>>> +     * bdrv_known_zeroes() should include this bit if an image is
>>> +     * known to read as all zeroes when first opened; this bit should
>>> +     * not be relied on after any writes to the image.
>>
>> Is there a good reason for this?  Because to me this screams like we are
>> going to check this flag without ensuring that the image has actually
>> not been written to yet.  So if it’s generally easy for drivers to stop
>> reporting this flag after a write, then maybe we should do so.
> 
> In patch 15 (implementing things in qcow2), I actually wrote the driver to return live results, rather than just open-time results, in part because writing the bit to persistent storage in qcow2 means that the bit must be accurate, without relying on the block layer's help.
> 
> But my pending NBD patch (not posted yet, but will be soon), the proposal I'm making for the NBD protocol itself is just open-time, not live, and so it would be more work than necessary to make the NBD driver report live results.
> 
> But it seems like it should be easy enough to also patch the block layer itself to guarantee that callers of bdrv_known_zeroes() cannot see this bit set if the block layer has been used in any non-zero transaction, by repeating the same logic as used in qcow2 to kill the bit (any write/write_compressed/bdrv_copy clear the bit, any trim clears the bit if the driver does not guarantee trim reads as zero, any truncate clears the bit if the driver does not guarantee truncate reads as zero, etc). Basically, the block layer would cache the results of .bdrv_known_zeroes during .bdrv_co_open, bdrv_co_pwrite() and friends would update that cache, and and bdrv_known_zeroes() would report the cached value rather than a fresh call to .bdrv_known_zeroes.
> 
> Are we worried enough about clients of this interface to make the block layer more robust?  (From the maintenance standpoint, the more the block layer guarantees, the easier it is to write code that uses the block layer; but there is the counter-argument that making the block layer track whether an image has been modified means a [slight] penalty to every write request to update the boolean.)
> 

I'm for functions is_all_zero(), vs is_it_was_all_zeros_when_opened(). I never liked places in code where is_zero_init() used like is_disk_zero(), without any checks, that the drive was not modified, or even created by use.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/17] Improve qcow2 all-zero detection
  2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
                   ` (17 preceding siblings ...)
  2020-02-04 17:32 ` [PATCH 00/17] Improve qcow2 all-zero detection Max Reitz
@ 2020-02-05  9:04 ` Vladimir Sementsov-Ogievskiy
  2020-02-05  9:25   ` Vladimir Sementsov-Ogievskiy
  2020-02-05 14:22   ` Eric Blake
  18 siblings, 2 replies; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-05  9:04 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, qemu-block, mreitz

31.01.2020 20:44, Eric Blake wrote:
> Based-on: <20200124103458.1525982-2-david.edmondson@oracle.com>
> ([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)
> 
> I'm working on adding an NBD extension that reports whether an image
> is already all zero when the client first connects.  I initially
> thought I could write the NBD code to just call bdrv_has_zero_init(),
> but that turned out to be a bad assumption that instead resulted in
> this patch series.  The NBD patch will come later (and cross-posted to
> the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
> repositories).
> 
> I do have an RFC question on patch 13 - as implemented here, I set a
> qcow2 bit if the image has all clusters known zero and no backing
> image.  But it may be more useful to instead report whether all
> clusters _allocated in this layer_ are zero, at which point the
> overall image is all-zero only if the backing file also has that
> property (or even make it two bits).  The tweaks to subsequent patches
> based on what we think makes the most useful semantics shouldn't be
> hard.
> 
> [repo.or.cz appears to be down as I type this; I'll post a link to a
> repository later when it comes back up]
> 

I have several ideas around it.

1. For generic block layer.
Did you consider as alternative to BDRV_ZEO_OPEN, to export the
information through normal block_status? So, if we have the
information, that disk is all-zero, we can always add _ZERO
flag to block-status result. And in generic bdrv_is_all_zeroes(),
we can just call block_status(0, disk_size), which will return
ZERO and n=disk_size if driver supports all-zero feature and is
all-zero now.
I think block-status is a native way for such information, and I
think that we anyway want to come to support of 64bit block-status
for qcow2 and nbd.

2. For NBD
Again, possible alternative is BLOCK_STATUS, but we need 64bit
commands for it. I plan to send a proposal anyway. Still, nothing
bad in two possible path of receiving all-zero information.
And even with your NBD extension, we can export this information
through block-status [1.]

3. For qcow2
Hmm. Here, as I understand, than main case is freshly created qcow2,
which is fully-unallocated. To understand that it is empty, we
need only to check all L1 entries. And for empty L1 table it is fast.
So we don't need any qcow2 format improvement to check it.




-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/17] Improve qcow2 all-zero detection
  2020-02-05  9:04 ` Vladimir Sementsov-Ogievskiy
@ 2020-02-05  9:25   ` Vladimir Sementsov-Ogievskiy
  2020-02-05 14:26     ` Eric Blake
  2020-02-05 14:22   ` Eric Blake
  1 sibling, 1 reply; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-05  9:25 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, qemu-block, mreitz

05.02.2020 12:04, Vladimir Sementsov-Ogievskiy wrote:
> 31.01.2020 20:44, Eric Blake wrote:
>> Based-on: <20200124103458.1525982-2-david.edmondson@oracle.com>
>> ([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)
>>
>> I'm working on adding an NBD extension that reports whether an image
>> is already all zero when the client first connects.  I initially
>> thought I could write the NBD code to just call bdrv_has_zero_init(),
>> but that turned out to be a bad assumption that instead resulted in
>> this patch series.  The NBD patch will come later (and cross-posted to
>> the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
>> repositories).
>>
>> I do have an RFC question on patch 13 - as implemented here, I set a
>> qcow2 bit if the image has all clusters known zero and no backing
>> image.  But it may be more useful to instead report whether all
>> clusters _allocated in this layer_ are zero, at which point the
>> overall image is all-zero only if the backing file also has that
>> property (or even make it two bits).  The tweaks to subsequent patches
>> based on what we think makes the most useful semantics shouldn't be
>> hard.
>>
>> [repo.or.cz appears to be down as I type this; I'll post a link to a
>> repository later when it comes back up]
>>
> 
> I have several ideas around it.
> 
> 1. For generic block layer.
> Did you consider as alternative to BDRV_ZEO_OPEN, to export the
> information through normal block_status? So, if we have the
> information, that disk is all-zero, we can always add _ZERO
> flag to block-status result. And in generic bdrv_is_all_zeroes(),
> we can just call block_status(0, disk_size), which will return
> ZERO and n=disk_size if driver supports all-zero feature and is
> all-zero now.
> I think block-status is a native way for such information, and I
> think that we anyway want to come to support of 64bit block-status
> for qcow2 and nbd.
> 
> 2. For NBD
> Again, possible alternative is BLOCK_STATUS, but we need 64bit
> commands for it. I plan to send a proposal anyway. Still, nothing
> bad in two possible path of receiving all-zero information.
> And even with your NBD extension, we can export this information
> through block-status [1.]
> 
> 3. For qcow2
> Hmm. Here, as I understand, than main case is freshly created qcow2,
> which is fully-unallocated. To understand that it is empty, we
> need only to check all L1 entries. And for empty L1 table it is fast.
> So we don't need any qcow2 format improvement to check it.
> 

Ah yes, I forget about preallocated case. Hmm. For preallocated clusters,
we have zero bits in L2 entries. And with them, we even don't need
preallocated to be filled by zeros, as we never read them (but just return
zeros on read)..

Then, may be we want similar flag for L1 entry (this will enable large
fast write-zero). And may be we want flag which marks the whole image
as read-zero (it's your flag). So, now I think, my previous idea
of "all allocated is zero" is worse. As for fully-preallocated images
we are sure that all clusters are allocated, and it is more native to
have flags similar to ZERO bit in L2 entry.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-02-05  7:51       ` Vladimir Sementsov-Ogievskiy
@ 2020-02-05 14:07         ` Eric Blake
  2020-02-05 14:25           ` Vladimir Sementsov-Ogievskiy
  2020-02-05 17:55           ` Max Reitz
  0 siblings, 2 replies; 73+ messages in thread
From: Eric Blake @ 2020-02-05 14:07 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Max Reitz, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, Markus Armbruster,
	david.edmondson, Stefan Hajnoczi, Denis V. Lunev, Liu Yuan,
	Jason Dillaman

On 2/5/20 1:51 AM, Vladimir Sementsov-Ogievskiy wrote:

>>>> +typedef enum {
>>>> +    /*
>>>> +     * bdrv_known_zeroes() should include this bit if the contents of
>>>> +     * a freshly-created image with no backing file reads as all
>>>> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
>>>> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
>>>
>>> I understand that this is preexisting logic, but could I ask: why?
>>> What's wrong
>>> if driver can guarantee that created file is all-zero, but is not sure
>>> about
>>> file resizing? I agree that it's normal for these flags to have the same
>>> value,
>>> but what is the reason for this restriction?..
>>
>> If areas added by truncation (or growth, rather) are always zero, then
>> the file can always be created with size 0 and grown from there.  Thus,
>> images where truncation adds zeroed areas will generally always be zero
>> after creation.
> 
> This means, that if truncation bit is set, than create bit should be 
> set.. But
> here we say that if truncation is clear, than create bit must be clear.

Max, did we get the logic backwards?

> 
>>
>>> So, the only possible combination of flags, when they differs, is
>>> create=0 and
>>> truncate=1.. How is it possible?
>>
>> For preallocated qcow2 images, it depends on the storage whether they
>> are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
>> to bdrv_has_zero_init() of s->data_file->bs.
>>
>> But when you truncate them (with PREALLOC_MODE_OFF, as
>> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
>> area is always going to be 0, regardless of initial preallocation.
> 
> ah yes, due to qcow2 zero clusters.

Hmm. Do we actually set the zero flag on unallocated clusters when 
resizing a qcow2 image?  That would be an O(n) operation (we have to 
visit the L2 entry for each added cluster, even if only to set the zero 
cluster bit).  Or do we instead just rely on the fact that qcow2 is 
inherently sparse, and that when you resize the guest-visible size 
without writing any new clusters, then it is only subsequent guest 
access to those addresses that finally allocate clusters, making resize 
O(1) (update the qcow2 metadata cluster, but not any L2 tables) while 
still reading 0 from the new data.  To some extent, that's what the 
allocation mode is supposed to control.

What about with external data images, where a resize in guest-visible 
length requires a resize of the underlying data image?  There, we DO 
have to worry about whether the data image resizes with zeroes (as in 
the filesystem) or with random data (as in a block device).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/17] Improve qcow2 all-zero detection
  2020-02-05  9:04 ` Vladimir Sementsov-Ogievskiy
  2020-02-05  9:25   ` Vladimir Sementsov-Ogievskiy
@ 2020-02-05 14:22   ` Eric Blake
  2020-02-05 14:43     ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-02-05 14:22 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: david.edmondson, qemu-block, mreitz

On 2/5/20 3:04 AM, Vladimir Sementsov-Ogievskiy wrote:

>> [repo.or.cz appears to be down as I type this; I'll post a link to a
>> repository later when it comes back up]

Now up
https://repo.or.cz/qemu/ericb.git/shortlog/refs/tags/qcow2-all-zero-v1

>>
> 
> I have several ideas around it.
> 
> 1. For generic block layer.
> Did you consider as alternative to BDRV_ZEO_OPEN, to export the
> information through normal block_status? So, if we have the
> information, that disk is all-zero, we can always add _ZERO
> flag to block-status result.

Makes sense.

> And in generic bdrv_is_all_zeroes(),
> we can just call block_status(0, disk_size), which will return
> ZERO and n=disk_size if driver supports all-zero feature and is
> all-zero now.

Less obvious.  block_status is not required to visit the entire disk, 
even if the entire disk is all zero.  For example, qcow2 visits at most 
one L2 page in a call (if the request spans L1 entries, it will be 
truncated at the boundary, even if the region before and after the 
boundary have the same status).  I'm also worried if we still have 
32-bit limitations in block_status (ideally, we've fixed things to 
support 64-bit status where possible, but I'm not counting on it).

> I think block-status is a native way for such information, and I
> think that we anyway want to come to support of 64bit block-status
> for qcow2 and nbd.

Block status requires an O(n) loop over the disk, where n is the number 
of distinct extents possible.  If you get lucky, and 
block_status(0,size) returns a single extent, then yes that can feed the 
'is_zeroes' request.  Similarly, a single return of non-zero data can 
instantly tell you that 'is_zeroes' is false.  But given that drivers 
may break up their response on convenient boundaries, such as qcow2 on 
L1 entry granularity, you cannot blindly assume that a return of zero 
data for smaller than the requested size implies non-zero data, only 
that there is insufficient information to tell if the disk is all_zeroes 
without querying further block_status calls, and that's where you lose 
out on the speed compared to just being told up-front from an 'is_zero' 
call.

> 
> 2. For NBD
> Again, possible alternative is BLOCK_STATUS, but we need 64bit
> commands for it. I plan to send a proposal anyway. Still, nothing
> bad in two possible path of receiving all-zero information.
> And even with your NBD extension, we can export this information
> through block-status [1.]

Yes, having 64-bit BLOCK_STATUS in NBD is orthogonal to this, but both 
ideas are independently useful, and as the level of difficulty in 
implementing things may vary, it is conceivable to have both a server 
that provides 'is_zero' but not BLOCK_STATUS, and a server that provides 
64-bit BLOCK_STATUS but not 'is_zero'.

> 
> 3. For qcow2
> Hmm. Here, as I understand, than main case is freshly created qcow2,
> which is fully-unallocated. To understand that it is empty, we
> need only to check all L1 entries. And for empty L1 table it is fast.
> So we don't need any qcow2 format improvement to check it.

The benefit of this patch series is that it detects preallocated qcow2 
images as all_zero.  What's more, scanning all L1 entries is O(n), but 
detecting an autoclear all_zero bit is O(1).  Your proposed L1 scan is 
accurate for fewer cases, and costs more time.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-02-05 14:07         ` Eric Blake
@ 2020-02-05 14:25           ` Vladimir Sementsov-Ogievskiy
  2020-02-05 14:36             ` Eric Blake
  2020-02-05 17:55           ` Max Reitz
  1 sibling, 1 reply; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-05 14:25 UTC (permalink / raw)
  To: Eric Blake, Max Reitz, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, Markus Armbruster,
	david.edmondson, Stefan Hajnoczi, Denis V. Lunev, Liu Yuan,
	Jason Dillaman

05.02.2020 17:07, Eric Blake wrote:
> On 2/5/20 1:51 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>>>> +typedef enum {
>>>>> +    /*
>>>>> +     * bdrv_known_zeroes() should include this bit if the contents of
>>>>> +     * a freshly-created image with no backing file reads as all
>>>>> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
>>>>> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
>>>>
>>>> I understand that this is preexisting logic, but could I ask: why?
>>>> What's wrong
>>>> if driver can guarantee that created file is all-zero, but is not sure
>>>> about
>>>> file resizing? I agree that it's normal for these flags to have the same
>>>> value,
>>>> but what is the reason for this restriction?..
>>>
>>> If areas added by truncation (or growth, rather) are always zero, then
>>> the file can always be created with size 0 and grown from there.  Thus,
>>> images where truncation adds zeroed areas will generally always be zero
>>> after creation.
>>
>> This means, that if truncation bit is set, than create bit should be set.. But
>> here we say that if truncation is clear, than create bit must be clear.
> 
> Max, did we get the logic backwards?
> 
>>
>>>
>>>> So, the only possible combination of flags, when they differs, is
>>>> create=0 and
>>>> truncate=1.. How is it possible?
>>>
>>> For preallocated qcow2 images, it depends on the storage whether they
>>> are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
>>> to bdrv_has_zero_init() of s->data_file->bs.
>>>
>>> But when you truncate them (with PREALLOC_MODE_OFF, as
>>> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
>>> area is always going to be 0, regardless of initial preallocation.
>>
>> ah yes, due to qcow2 zero clusters.
> 
> Hmm. Do we actually set the zero flag on unallocated clusters when resizing a qcow2 image?  That would be an O(n) operation (we have to visit the L2 entry for each added cluster, even if only to set the zero cluster bit).  Or do we instead just rely on the fact that qcow2 is inherently sparse, and that when you resize the guest-visible size without writing any new clusters, then it is only subsequent guest access to those addresses that finally allocate clusters, making resize O(1) (update the qcow2 metadata cluster, but not any L2 tables) while still reading 0 from the new data.  To some extent, that's what the allocation mode is supposed to control.

We must mark as ZERO new cluster at least if there is a _larger_ backing file, to prevent data from backing file become available for the guest. But we don't do it. It's a bug and there is fixing series from Kevin, I've just pinged it:
"[PATCH for-4.2? v3 0/8] block: Fix resize (extending) of short overlays"

> 
> What about with external data images, where a resize in guest-visible length requires a resize of the underlying data image?  There, we DO have to worry about whether the data image resizes with zeroes (as in the filesystem) or with random data (as in a block device).
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/17] Improve qcow2 all-zero detection
  2020-02-05  9:25   ` Vladimir Sementsov-Ogievskiy
@ 2020-02-05 14:26     ` Eric Blake
  2020-02-05 14:47       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-02-05 14:26 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: david.edmondson, qemu-block, mreitz

On 2/5/20 3:25 AM, Vladimir Sementsov-Ogievskiy wrote:

>> 3. For qcow2
>> Hmm. Here, as I understand, than main case is freshly created qcow2,
>> which is fully-unallocated. To understand that it is empty, we
>> need only to check all L1 entries. And for empty L1 table it is fast.
>> So we don't need any qcow2 format improvement to check it.
>>
> 
> Ah yes, I forget about preallocated case. Hmm. For preallocated clusters,
> we have zero bits in L2 entries. And with them, we even don't need
> preallocated to be filled by zeros, as we never read them (but just return
> zeros on read)..

Scanning all L2 entries is O(n), while an autoclear bit properly 
maintained is O(1).

> 
> Then, may be we want similar flag for L1 entry (this will enable large
> fast write-zero). And may be we want flag which marks the whole image
> as read-zero (it's your flag). So, now I think, my previous idea
> of "all allocated is zero" is worse. As for fully-preallocated images
> we are sure that all clusters are allocated, and it is more native to
> have flags similar to ZERO bit in L2 entry.

Right now, we don't have any L1 entry flags.  Adding one would require 
adding an incompatible feature flag (if older qemu would choke to see 
unexpected flags in an L1 entry), or at best an autoclear feature flag 
(if the autoclear bit gets cleared because an older qemu opened the 
image and couldn't maintain L1 entry flags correctly, then newer qemu 
knows it cannot trust those L1 entry flags).  But as soon as you are 
talking about adding a feature bit, then why add one that still requires 
O(n) traversal to check (true, the 'n' in an O(n) traversal of L1 tables 
is much smaller than the 'n' in an O(n) traversal of L2 tables), when 
you can instead just add an O(1) autoclear bit that maintains all_zero 
status for the image as a whole?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-02-05 14:25           ` Vladimir Sementsov-Ogievskiy
@ 2020-02-05 14:36             ` Eric Blake
  0 siblings, 0 replies; 73+ messages in thread
From: Eric Blake @ 2020-02-05 14:36 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Max Reitz, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, Markus Armbruster,
	david.edmondson, Stefan Hajnoczi, Denis V. Lunev, Liu Yuan,
	Jason Dillaman

On 2/5/20 8:25 AM, Vladimir Sementsov-Ogievskiy wrote:

>>>> But when you truncate them (with PREALLOC_MODE_OFF, as
>>>> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
>>>> area is always going to be 0, regardless of initial preallocation.
>>>
>>> ah yes, due to qcow2 zero clusters.
>>
>> Hmm. Do we actually set the zero flag on unallocated clusters when 
>> resizing a qcow2 image?  That would be an O(n) operation (we have to 
>> visit the L2 entry for each added cluster, even if only to set the 
>> zero cluster bit).  Or do we instead just rely on the fact that qcow2 
>> is inherently sparse, and that when you resize the guest-visible size 
>> without writing any new clusters, then it is only subsequent guest 
>> access to those addresses that finally allocate clusters, making 
>> resize O(1) (update the qcow2 metadata cluster, but not any L2 tables) 
>> while still reading 0 from the new data.  To some extent, that's what 
>> the allocation mode is supposed to control.
> 
> We must mark as ZERO new cluster at least if there is a _larger_ backing 
> file, to prevent data from backing file become available for the guest. 
> But we don't do it. It's a bug and there is fixing series from Kevin, 
> I've just pinged it:
> "[PATCH for-4.2? v3 0/8] block: Fix resize (extending) of short overlays"

There's a difference for a backing file larger than the qcow2 file, and 
the protocol layer larger than the qcow2 file.  Visually, with the 
following four nodes:

f1 [qcow2 format] <- f2 [qcow2 format]
   v                        v
p1 [file protocol]    p2 [file protocol]

If f1 is larger than f2, then resizing f2 without writing zero clusters 
leaks the data from f1 into f2.  The block layer knows this: prior to 
this series, bdrv_has_zero_init_truncate() returns 0 if bs->backing is 
present; and even in this series, see patch 6/17 which continues to 
force a 0 return rather than calling into the driver if the sizes are 
suspect.  But that is an uncommon corner case; in short, the qcow2 
callback .bdrv_has_zero_init_truncate is NOT reachable in that scenario, 
whether before or after this series.

On the other hand, if p2 is larger than f2, resizing f2 reads zeroes. 
That's because qcow2 HAS to add L2 mappings into p2 before data from p2 
can leak, but .bdrv_co_truncate(PREALLOC_MODE_OFF) does not add any L2 
mappings.  Thus, qcow2 blindly returning 1 for 
.bdrv_has_zero_init_truncate was correct (other than the anomaly of 
bs->encrypted, also fixed earlier in this series).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/17] Improve qcow2 all-zero detection
  2020-02-05 14:22   ` Eric Blake
@ 2020-02-05 14:43     ` Vladimir Sementsov-Ogievskiy
  2020-02-05 14:58       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-05 14:43 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, qemu-block, mreitz

05.02.2020 17:22, Eric Blake wrote:
> On 2/5/20 3:04 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>> [repo.or.cz appears to be down as I type this; I'll post a link to a
>>> repository later when it comes back up]
> 
> Now up
> https://repo.or.cz/qemu/ericb.git/shortlog/refs/tags/qcow2-all-zero-v1
> 
>>>
>>
>> I have several ideas around it.
>>
>> 1. For generic block layer.
>> Did you consider as alternative to BDRV_ZEO_OPEN, to export the
>> information through normal block_status? So, if we have the
>> information, that disk is all-zero, we can always add _ZERO
>> flag to block-status result.
> 
> Makes sense.
> 
>> And in generic bdrv_is_all_zeroes(),
>> we can just call block_status(0, disk_size), which will return
>> ZERO and n=disk_size if driver supports all-zero feature and is
>> all-zero now.
> 
> Less obvious.  block_status is not required to visit the entire disk, even if the entire disk is all zero.  For example, qcow2 visits at most one L2 page in a call (if the request spans L1 entries, it will be truncated at the boundary, even if the region before and after the boundary have the same status).  I'm also worried if we still have 32-bit limitations in block_status (ideally, we've fixed things to support 64-bit status where possible, but I'm not counting on it).

Not required, but why not doing it? If we have information that all disk is of the same ZERO status, no reasons to not reply on block_status(0, disk_size) with smaller n.

> 
>> I think block-status is a native way for such information, and I
>> think that we anyway want to come to support of 64bit block-status
>> for qcow2 and nbd.
> 
> Block status requires an O(n) loop over the disk, where n is the number of distinct extents possible.  If you get lucky, and block_status(0,size) returns a single extent, then yes that can feed the 'is_zeroes' request.  Similarly, a single return of non-zero data can instantly tell you that 'is_zeroes' is false.  But given that drivers may break up their response on convenient boundaries, such as qcow2 on L1 entry granularity, you cannot blindly assume that a return of zero data for smaller than the requested size implies non-zero data, only that there is insufficient information to tell if the disk is all_zeroes without querying further block_status calls, and that's where you lose out on the speed compared to just being told up-front from an 'is_zero' call.

Yes. But how is it worse than BDRV_ZERO_OPEN? With one block_status call we have the same information. If on block_status(0, disk_size) driver replies with ZERO but smaller than disk_size, it means that either disk is not all-zero, or driver doesn't support 'fast whole-disk zero check' feature, which is equal to not supporting BDRV_ZERO_OPEN.

> 
>>
>> 2. For NBD
>> Again, possible alternative is BLOCK_STATUS, but we need 64bit
>> commands for it. I plan to send a proposal anyway. Still, nothing
>> bad in two possible path of receiving all-zero information.
>> And even with your NBD extension, we can export this information
>> through block-status [1.]
> 
> Yes, having 64-bit BLOCK_STATUS in NBD is orthogonal to this, but both ideas are independently useful, and as the level of difficulty in implementing things may vary, it is conceivable to have both a server that provides 'is_zero' but not BLOCK_STATUS, and a server that provides 64-bit BLOCK_STATUS but not 'is_zero'.
> 
>>
>> 3. For qcow2
>> Hmm. Here, as I understand, than main case is freshly created qcow2,
>> which is fully-unallocated. To understand that it is empty, we
>> need only to check all L1 entries. And for empty L1 table it is fast.
>> So we don't need any qcow2 format improvement to check it.
> 
> The benefit of this patch series is that it detects preallocated qcow2 images as all_zero.  What's more, scanning all L1 entries is O(n), but detecting an autoclear all_zero bit is O(1).  Your proposed L1 scan is accurate for fewer cases, and costs more time.

Ah yes, somehow I thought that L1 is not allocated for fresh image..

Hmm, than possibly we need two new top-level flags: "all-zero" and "all-unallocated"..


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/17] Improve qcow2 all-zero detection
  2020-02-05 14:26     ` Eric Blake
@ 2020-02-05 14:47       ` Vladimir Sementsov-Ogievskiy
  2020-02-05 15:14         ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-05 14:47 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, qemu-block, mreitz

05.02.2020 17:26, Eric Blake wrote:
> On 2/5/20 3:25 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>> 3. For qcow2
>>> Hmm. Here, as I understand, than main case is freshly created qcow2,
>>> which is fully-unallocated. To understand that it is empty, we
>>> need only to check all L1 entries. And for empty L1 table it is fast.
>>> So we don't need any qcow2 format improvement to check it.
>>>
>>
>> Ah yes, I forget about preallocated case. Hmm. For preallocated clusters,
>> we have zero bits in L2 entries. And with them, we even don't need
>> preallocated to be filled by zeros, as we never read them (but just return
>> zeros on read)..
> 
> Scanning all L2 entries is O(n), while an autoclear bit properly maintained is O(1).
> 
>>
>> Then, may be we want similar flag for L1 entry (this will enable large
>> fast write-zero). And may be we want flag which marks the whole image
>> as read-zero (it's your flag). So, now I think, my previous idea
>> of "all allocated is zero" is worse. As for fully-preallocated images
>> we are sure that all clusters are allocated, and it is more native to
>> have flags similar to ZERO bit in L2 entry.
> 
> Right now, we don't have any L1 entry flags.  Adding one would require adding an incompatible feature flag (if older qemu would choke to see unexpected flags in an L1 entry), or at best an autoclear feature flag (if the autoclear bit gets cleared because an older qemu opened the image and couldn't maintain L1 entry flags correctly, then newer qemu knows it cannot trust those L1 entry flags).  But as soon as you are talking about adding a feature bit, then why add one that still requires O(n) traversal to check (true, the 'n' in an O(n) traversal of L1 tables is much smaller than the 'n' in an O(n) traversal of L2 tables), when you can instead just add an O(1) autoclear bit that maintains all_zero status for the image as a whole?
> 

My suggestion about L1 entry flag is side thing, I understand difference between O(n) and O(1) :) Still additional L1 entry will help to make efficient large block-status and write-zero requests.

And I agree that we need top level flag.. I just try to say, that it seems good to make it similar with existing L2 flag. But yes, it would be incomaptible change, as it marks all clusters as ZERO, and older Qemu can't understand it and may treat all clusters as unallocated.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/17] Improve qcow2 all-zero detection
  2020-02-05 14:43     ` Vladimir Sementsov-Ogievskiy
@ 2020-02-05 14:58       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-05 14:58 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, qemu-block, mreitz

05.02.2020 17:43, Vladimir Sementsov-Ogievskiy wrote:
> 05.02.2020 17:22, Eric Blake wrote:
>> On 2/5/20 3:04 AM, Vladimir Sementsov-Ogievskiy wrote:
>>
>>>> [repo.or.cz appears to be down as I type this; I'll post a link to a
>>>> repository later when it comes back up]
>>
>> Now up
>> https://repo.or.cz/qemu/ericb.git/shortlog/refs/tags/qcow2-all-zero-v1
>>
>>>>
>>>
>>> I have several ideas around it.
>>>
>>> 1. For generic block layer.
>>> Did you consider as alternative to BDRV_ZEO_OPEN, to export the
>>> information through normal block_status? So, if we have the
>>> information, that disk is all-zero, we can always add _ZERO
>>> flag to block-status result.
>>
>> Makes sense.
>>
>>> And in generic bdrv_is_all_zeroes(),
>>> we can just call block_status(0, disk_size), which will return
>>> ZERO and n=disk_size if driver supports all-zero feature and is
>>> all-zero now.
>>
>> Less obvious.  block_status is not required to visit the entire disk, even if the entire disk is all zero.  For example, qcow2 visits at most one L2 page in a call (if the request spans L1 entries, it will be truncated at the boundary, even if the region before and after the boundary have the same status).  I'm also worried if we still have 32-bit limitations in block_status (ideally, we've fixed things to support 64-bit status where possible, but I'm not counting on it).
> 
> Not required, but why not doing it? If we have information that all disk is of the same ZERO status, no reasons to not reply on block_status(0, disk_size) with smaller n.
> 
>>
>>> I think block-status is a native way for such information, and I
>>> think that we anyway want to come to support of 64bit block-status
>>> for qcow2 and nbd.
>>
>> Block status requires an O(n) loop over the disk, where n is the number of distinct extents possible.  If you get lucky, and block_status(0,size) returns a single extent, then yes that can feed the 'is_zeroes' request.  Similarly, a single return of non-zero data can instantly tell you that 'is_zeroes' is false.  But given that drivers may break up their response on convenient boundaries, such as qcow2 on L1 entry granularity, you cannot blindly assume that a return of zero data for smaller than the requested size implies non-zero data, only that there is insufficient information to tell if the disk is all_zeroes without querying further block_status calls, and that's where you lose out on the speed compared to just being told up-front from an 'is_zero' call.
> 
> Yes. But how is it worse than BDRV_ZERO_OPEN? With one block_status call we have the same information. If on block_status(0, disk_size) driver replies with ZERO but smaller than disk_size, it means that either disk is not all-zero, or driver doesn't support 'fast whole-disk zero check' feature, which is equal to not supporting BDRV_ZERO_OPEN.
> 
>>
>>>
>>> 2. For NBD
>>> Again, possible alternative is BLOCK_STATUS, but we need 64bit
>>> commands for it. I plan to send a proposal anyway. Still, nothing
>>> bad in two possible path of receiving all-zero information.
>>> And even with your NBD extension, we can export this information
>>> through block-status [1.]
>>
>> Yes, having 64-bit BLOCK_STATUS in NBD is orthogonal to this, but both ideas are independently useful, and as the level of difficulty in implementing things may vary, it is conceivable to have both a server that provides 'is_zero' but not BLOCK_STATUS, and a server that provides 64-bit BLOCK_STATUS but not 'is_zero'.
>>
>>>
>>> 3. For qcow2
>>> Hmm. Here, as I understand, than main case is freshly created qcow2,
>>> which is fully-unallocated. To understand that it is empty, we
>>> need only to check all L1 entries. And for empty L1 table it is fast.
>>> So we don't need any qcow2 format improvement to check it.
>>
>> The benefit of this patch series is that it detects preallocated qcow2 images as all_zero.  What's more, scanning all L1 entries is O(n), but detecting an autoclear all_zero bit is O(1).  Your proposed L1 scan is accurate for fewer cases, and costs more time.
> 
> Ah yes, somehow I thought that L1 is not allocated for fresh image..
> 
> Hmm, than possibly we need two new top-level flags: "all-zero" and "all-unallocated"..
> 

It make sense only with incompatible semantics. With autoclean semantics it's better to have one 'all-allocated-are-zero' and don't care.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/17] Improve qcow2 all-zero detection
  2020-02-05 14:47       ` Vladimir Sementsov-Ogievskiy
@ 2020-02-05 15:14         ` Vladimir Sementsov-Ogievskiy
  2020-02-05 17:58           ` Max Reitz
  0 siblings, 1 reply; 73+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-02-05 15:14 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, qemu-block, mreitz

05.02.2020 17:47, Vladimir Sementsov-Ogievskiy wrote:
> 05.02.2020 17:26, Eric Blake wrote:
>> On 2/5/20 3:25 AM, Vladimir Sementsov-Ogievskiy wrote:
>>
>>>> 3. For qcow2
>>>> Hmm. Here, as I understand, than main case is freshly created qcow2,
>>>> which is fully-unallocated. To understand that it is empty, we
>>>> need only to check all L1 entries. And for empty L1 table it is fast.
>>>> So we don't need any qcow2 format improvement to check it.
>>>>
>>>
>>> Ah yes, I forget about preallocated case. Hmm. For preallocated clusters,
>>> we have zero bits in L2 entries. And with them, we even don't need
>>> preallocated to be filled by zeros, as we never read them (but just return
>>> zeros on read)..
>>
>> Scanning all L2 entries is O(n), while an autoclear bit properly maintained is O(1).
>>
>>>
>>> Then, may be we want similar flag for L1 entry (this will enable large
>>> fast write-zero). And may be we want flag which marks the whole image
>>> as read-zero (it's your flag). So, now I think, my previous idea
>>> of "all allocated is zero" is worse. As for fully-preallocated images
>>> we are sure that all clusters are allocated, and it is more native to
>>> have flags similar to ZERO bit in L2 entry.
>>
>> Right now, we don't have any L1 entry flags.  Adding one would require adding an incompatible feature flag (if older qemu would choke to see unexpected flags in an L1 entry), or at best an autoclear feature flag (if the autoclear bit gets cleared because an older qemu opened the image and couldn't maintain L1 entry flags correctly, then newer qemu knows it cannot trust those L1 entry flags).  But as soon as you are talking about adding a feature bit, then why add one that still requires O(n) traversal to check (true, the 'n' in an O(n) traversal of L1 tables is much smaller than the 'n' in an O(n) traversal of L2 tables), when you can instead just add an O(1) autoclear bit that maintains all_zero status for the image as a whole?
>>
> 
> My suggestion about L1 entry flag is side thing, I understand difference between O(n) and O(1) :) Still additional L1 entry will help to make efficient large block-status and write-zero requests.
> 
> And I agree that we need top level flag.. I just try to say, that it seems good to make it similar with existing L2 flag. But yes, it would be incomaptible change, as it marks all clusters as ZERO, and older Qemu can't understand it and may treat all clusters as unallocated.
> 

Still, how long is this O(n) ? We load the whole L1 into memory anyway. For example, 16Tb disk with 64K granularity, we'll have 32768 L1 entries. Will we get sensible performance benefit with an extension? I doubt in it now. And anyway, if we have an extension, we should fallback to this O(n) if we don't have the flag set.

So, I think the flag is beneficial only for preallocated images.

Hmm. and for such images, if we want, we can define this flag as 'all clusters are allocated zeroes', if we want. Which will prove that image reads as zero independently of any backing relations.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-02-04 17:51       ` Eric Blake
@ 2020-02-05 16:43         ` Max Reitz
  0 siblings, 0 replies; 73+ messages in thread
From: Max Reitz @ 2020-02-05 16:43 UTC (permalink / raw)
  To: Eric Blake, Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, Markus Armbruster,
	david.edmondson, Stefan Hajnoczi, Denis V. Lunev, Liu Yuan,
	Jason Dillaman


[-- Attachment #1.1: Type: text/plain, Size: 1542 bytes --]

On 04.02.20 18:51, Eric Blake wrote:
> On 2/4/20 11:42 AM, Max Reitz wrote:
> 
>>>
>>> I understand that this is preexisting logic, but could I ask: why?
>>> What's wrong
>>> if driver can guarantee that created file is all-zero, but is not sure
>>> about
>>> file resizing? I agree that it's normal for these flags to have the same
>>> value,
>>> but what is the reason for this restriction?..
>>
>> If areas added by truncation (or growth, rather) are always zero, then
>> the file can always be created with size 0 and grown from there.  Thus,
>> images where truncation adds zeroed areas will generally always be zero
>> after creation.
>>
>>> So, the only possible combination of flags, when they differs, is
>>> create=0 and
>>> truncate=1.. How is it possible?
>>
>> For preallocated qcow2 images, it depends on the storage whether they
>> are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
>> to bdrv_has_zero_init() of s->data_file->bs.
>>
>> But when you truncate them (with PREALLOC_MODE_OFF, as
>> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
>> area is always going to be 0, regardless of initial preallocation.
>>
>>
>> I just noticed a bug there, though: Encrypted qcow2 images will not see
>> areas added through growth as 0.  Hence, qcow2’s
>> bdrv_has_zero_init_truncate() implementation should not return true
>> unconditionally, but only for unencrypted images.
> 
> Hence patch 5 earlier in the series :)

Ah, good. :-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/17] Improve qcow2 all-zero detection
  2020-02-04 18:53   ` Eric Blake
@ 2020-02-05 17:04     ` Max Reitz
  2020-02-05 19:21       ` Eric Blake
  0 siblings, 1 reply; 73+ messages in thread
From: Max Reitz @ 2020-02-05 17:04 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 6465 bytes --]

On 04.02.20 19:53, Eric Blake wrote:
> On 2/4/20 11:32 AM, Max Reitz wrote:
>> On 31.01.20 18:44, Eric Blake wrote:
>>> Based-on: <20200124103458.1525982-2-david.edmondson@oracle.com>
>>> ([PATCH v2 1/2] qemu-img: Add --target-is-zero to convert)
>>>
>>> I'm working on adding an NBD extension that reports whether an image
>>> is already all zero when the client first connects.  I initially
>>> thought I could write the NBD code to just call bdrv_has_zero_init(),
>>> but that turned out to be a bad assumption that instead resulted in
>>> this patch series.  The NBD patch will come later (and cross-posted to
>>> the NBD protocol, libnbd, nbdkit, and qemu, as it will affect all four
>>> repositories).
>>
>> We had a discussion about this on IRC, and as far as I remember I wasn’t
>> quite sold on the “why”.  So, again, I wonder why this is needed.
>>
>> I mean, it does make intuitive sense to want to know whether an image is
>> fully zero, but if I continue thinking about it I don’t know any case
>> where we would need to figure it out and where we could accept “We don’t
>> know” as an answer.  So I’m looking for use cases, but this cover letter
>> doesn’t mention any.  (And from a quick glance I don’t see this series
>> using the flag, actually.)
> 
> Patch 10/17 has:
> 
> diff --git a/qemu-img.c b/qemu-img.c
> index e60217e6c382..c8519a74f738 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1985,10 +1985,12 @@ static int convert_do_copy(ImgConvertState *s)
>      int64_t sector_num = 0;
> 
>      /* Check whether we have zero initialisation or can get it
> efficiently */
> -    if (!s->has_zero_init && s->target_is_new && s->min_sparse &&
> -        !s->target_has_backing) {
> -        s->has_zero_init = !!(bdrv_known_zeroes(blk_bs(s->target)) &
> -                              BDRV_ZERO_CREATE);
> +    if (!s->has_zero_init && s->min_sparse && !s->target_has_backing) {
> +        ret = bdrv_known_zeroes(blk_bs(s->target));
> +        if (ret & BDRV_ZERO_OPEN ||
> +            (s->target_is_new && ret & BDRV_ZERO_CREATE)) {
> +            s->has_zero_init = true;
> +        }
>      }

OK, I expected users to come in a separate patch.

> That's the use case: when copying into a destination file, it's useful
> to know if the destination already reads as all zeroes, before
> attempting a fallback to bdrv_make_zero(BDRV_REQ_NO_FALLBACK) or calls
> to block status checking for holes.

But that was my point on IRC.  Is it really more useful if
bdrv_make_zero() is just as quick?  (And the fact that NBD doesn’t have
an implementation looks more like a problem with NBD to me.)

(Considering that at least the code we discussed on IRC didn’t work for
preallocated images, which was the one point where we actually have a
problem in practice.)

>> (We have a use case with convert -n to freshly created image files, but
>> my position on this on IRC was that we want the --target-is-zero flag
>> for that anyway: Auto-detection may always break, our preferred default
>> behavior may always change, so if you want convert -n not to touch the
>> target image except to write non-zero data from the source, we need a
>> --target-is-zero flag and users need to use it.  Well, management
>> layers, because I don’t think users would use convert -n anyway.
>>
>> And with --target-is-zero and users effectively having to use it, I
>> don’t think that’s a good example of a use case.)
> 
> Yes, there will still be cases where you have to use --target-is-zero
> because the image itself couldn't report that it already reads as
> zeroes, but there are also enough cases where the destination is already
> known to read zeroes and it's a shame to tell the user that 'you have to
> add --target-is-zero to get faster copying even though we could have
> inferred it on your behalf'.

How is it a shame?  I think only management tools would use convert -n.
 Management tools want reliable behavior.  If you want reliable
behavior, you have to use --target-is-zero anyway.  So I don’t see the
actual benefit for qemu-img convert.

>> I suppose there is the point of blockdev-create + blockdev-mirror: This
>> has exactly the same problem as convert -n.  But again, if you really
>> want blockdev-mirror not just to force-zero the image, you probably need
>> to tell it so explicitly (i.e., with a --target-is-zero flag for
>> blockdev-mirror).
>>
>> (Well, I suppose we could save us a target-is-zero for mirror if we took
>> this series and had a filter driver that force-reports BDRV_ZERO_OPEN.
>> But, well, please no.)
>>
>> But maybe I’m just an idiot and there is no reason not to take this
>> series and make blockdev-create + blockdev-mirror do the sensible thing
>> by default in most cases. *shrug*
> 
> My argument for taking the series _is_ that the common case can be made
> more efficient without user effort.

The thing is, I don’t see the user effort.  I don’t think users use
convert -n or backup manually.  And for management tools, it isn’t
really effort to add another switch.

> Yes, we still need the knob for
> when the common case isn't already smart enough,

But the user can’t know when qemu isn’t smart enough.  So users who care
have to always give the flag.

> but the difference in
> avoiding a pre-zeroing pass is noticeable when copying images around

I’m sure it is, but the question I ask is whether in practice we
wouldn’t get --target-is-zero in all of these cases anyway.


So I’m not sold on “it works most of the time”, because if it’s just
most of the time, then we’ll likely see --target-is-zero all of the time.

OTOH, I suppose that with the new qcow2 extension, it would always work
for the following case:
(1) Create a qcow2 file,
(2) Immediately (with the next qemu-img/QMP invocation) use it as a
target of convert -n or mirror or anything similar.

If so, that means it works reliably all of the time for a common case.
I guess that’d be enough for me.

Max

> (and more than just for qcow2 - my followup series to improve NBD is
> similarly useful given how much work has already been invested in
> mapping NBD into storage access over https in the upper layers like ovirt).
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-02-04 19:03     ` Eric Blake
@ 2020-02-05 17:22       ` Max Reitz
  2020-02-05 18:39         ` Eric Blake
  0 siblings, 1 reply; 73+ messages in thread
From: Max Reitz @ 2020-02-05 17:22 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, Markus Armbruster,
	david.edmondson, Stefan Hajnoczi, Liu Yuan, Denis V. Lunev,
	Jason Dillaman


[-- Attachment #1.1: Type: text/plain, Size: 6929 bytes --]

On 04.02.20 20:03, Eric Blake wrote:
> On 2/4/20 11:53 AM, Max Reitz wrote:
>> On 31.01.20 18:44, Eric Blake wrote:
>>> Having two slightly-different function names for related purposes is
>>> unwieldy, especially since I envision adding yet another notion of
>>> zero support in an upcoming patch.  It doesn't help that
>>> bdrv_has_zero_init() is a misleading name (I originally thought that a
>>> driver could only return 1 when opening an already-existing image
>>> known to be all zeroes; but in reality many drivers always return 1
>>> because it only applies to a just-created image).
>>
>> I don’t find it misleading, I just find it meaningless, which then makes
>> it open to interpretation (or maybe rather s/interpretation/wishful
>> thinking/).
>>
>>> Refactor all uses
>>> to instead have a single function that returns multiple bits of
>>> information, with better naming and documentation.
>>
>> It doesn’t make sense to me.  How exactly is it unwieldy?  In the sense
>> that we have to deal with multiple rather small implementation functions
>> rather than a big one per driver?  Actually, multiple small functions
>> sounds better to me – unless the three implementations share common code.
> 
> Common code for dealing with encryption, backing files, and so on.  It
> felt like I had a lot of code repetition when keeping functions separate.

Well, I suppose “dealing with” means “if (encrypted || has_backing)”, so
duplicating that doesn’t seem too bad.

>> As for the callers, they only want a single flag out of the three, don’t
>> they?  If so, it doesn’t really matter for them.
> 
> The qemu-img.c caller in patch 10 checks ZERO_CREATE | ZERO_OPEN, so we
> DO have situations of checking more than one bit, vs. needing two
> function calls.

Hm, OK.  Not sure if that place would look worse with two function
calls, but, well.

>> In fact, I can imagine that drivers can trivially return
>> BDRV_ZERO_TRUNCATE information (because the preallocation mode is
>> fixed), whereas BDRV_ZERO_CREATE can be a bit more involved, and
>> BDRV_ZERO_OPEN could take even more time because some (constant-time)
>> inquiries have to be done.
> 
> In looking at the rest of the series, drivers were either completely
> trivial (in which case, declaring:
> 
> .bdrv_has_zero_init = bdrv_has_zero_init_1,
> .bdrv_has_zero_init_truncate = bdrv_has_zero_init_1,
> 
> was a lot wordier than the new:
> 
> .bdrv_known_zeroes = bdrv_known_zeroes_truncate,

Not sure if that’s bad, though.

> ), or completely spelled out but where both creation and truncation were
> determined in the same amount of effort.

Well, usually, the effort is minimal, but OK.

>> And thus callers which just want the trivially obtainable
>> BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
>> even though they don’t care about that flag.
> 
> True, but only to a minor extent; and the documentation mentions that
> the BDRV_ZERO_OPEN calculation MUST NOT be as expensive as a blind
> block_status loop.

So it must be less expensive than an arbitrarily complex loop.  I think
a single SEEK_DATA/HOLE call was something like O(n) on tmpfs?

What I’m trying to say is that this is not a good limit and can mean
anything.

I do think this limit definition makes sense for callers that want to
know about ZERO_OPEN.  But I don’t know why we would have to let other
callers wait, too.

> Meanwhile, callers tend to only care about
> bdrv_known_zeroes() right after opening an image or right before
> resizing (not repeatedly during runtime);

Hm, yes.  I was thinking of parallels, but that only checks once in
parallels_open(), so it’s OK.

> and you also argued elsewhere
> in this thread that it may be worth having the block layer cache
> BDRV_ZERO_OPEN and update the cache on any write,

I didn’t say the block layer, but it if makes sense.

> at which point, the
> expense in the driver callback really is a one-time call during
> bdrv_co_open().

It definitely doesn’t make sense to me to do that call unconditionally
in bdrv_co_open().

> And in that case, whether the one-time expense is done
> via a single function call or via three driver callbacks, the amount of
> work is the same; but the driver callback interface is easier if there
> is only one callback (similar to how bdrv_unallocated_blocks_are_zero()
> calls bdrv_get_info() only for bdi.unallocated_blocks_are_zero, even
> though BlockDriverInfo tracks much more than that boolean).
> 
> In fact, it may be worth consolidating known zeroes support into
> BlockDriverInfo.

I’m very skeptical of that.  BDI already has the problem that it doesn’t
know which of the information the caller actually wants and that it is
sometimes used in a quasi-hot path.

Maybe that means it is indeed time to incorporate it into BDI, but the
caller should have a way of specifying what parts of BDI it actually
needs and then drivers can skip anything that isn’t trivially obtainable
that the caller doesn’t need.

>> So I’d leave it as separate functions so drivers can feel free to have
>> implementations for BDRV_ZERO_OPEN that take more than mere microseconds
>> but that are more accurate.
>>
>> (Or maybe if you really want it to be a single functions, callers could
>> pass the mask of flags they care about.  If all flags are trivially
>> obtainable, the implementations would then simply create their result
>> mask and & it with the caller-given mask.  For implementations where
>> some branches could take a bit more time, those branches are only taken
>> when the caller cares about the given flag.  But again, I don’t
>> necessarily think having a single function is more easily handleable
>> than three smaller ones.)
> 
> Those are still viable options, but before I repaint the bikeshed along
> those lines, I'd at least like a review of whether the overall idea of
> having a notion of 'reads-all-zeroes' is indeed useful enough,
> regardless of how we implement it as one vs. three driver callbacks.

I’m as hesitant as ever to give a review that this notion is useful,
because I haven’t seen a practical example yet where the problem isn’t
the fact that NBD doesn’t have 64-bit write_zeroes support.

So far, it looks to me like this notion is only really useful for cases
where we expect a management layer on top of qemu anyway.  And then I’m
not sure that this new feature works reliably enough for such a
management layer.

(I’m not saying it isn’t useful.  Again, intuitively it does seem
useful.  Intuition can be enough to merge a sufficiently simple series
that doesn’t increase code complexity too much.  But I’m still asking
for actual practical examples, because that would make a better
argument, of course.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 10/17] block: Add new BDRV_ZERO_OPEN flag
  2020-02-04 17:50     ` Eric Blake
  2020-02-05  8:39       ` Vladimir Sementsov-Ogievskiy
@ 2020-02-05 17:26       ` Max Reitz
  1 sibling, 0 replies; 73+ messages in thread
From: Max Reitz @ 2020-02-05 17:26 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 3068 bytes --]

On 04.02.20 18:50, Eric Blake wrote:
> On 2/4/20 11:34 AM, Max Reitz wrote:
> 
>>> +++ b/include/block/block.h
>>> @@ -105,6 +105,16 @@ typedef enum {
>>>        * for drivers that set .bdrv_co_truncate.
>>>        */
>>>       BDRV_ZERO_TRUNCATE      = 0x2,
>>> +
>>> +    /*
>>> +     * bdrv_known_zeroes() should include this bit if an image is
>>> +     * known to read as all zeroes when first opened; this bit should
>>> +     * not be relied on after any writes to the image.
>>
>> Is there a good reason for this?  Because to me this screams like we are
>> going to check this flag without ensuring that the image has actually
>> not been written to yet.  So if it’s generally easy for drivers to stop
>> reporting this flag after a write, then maybe we should do so.
> 
> In patch 15 (implementing things in qcow2), I actually wrote the driver
> to return live results, rather than just open-time results, in part
> because writing the bit to persistent storage in qcow2 means that the
> bit must be accurate, without relying on the block layer's help.
> 
> But my pending NBD patch (not posted yet, but will be soon), the
> proposal I'm making for the NBD protocol itself is just open-time, not
> live, and so it would be more work than necessary to make the NBD driver
> report live results.
> 
> But it seems like it should be easy enough to also patch the block layer
> itself to guarantee that callers of bdrv_known_zeroes() cannot see this
> bit set if the block layer has been used in any non-zero transaction, by
> repeating the same logic as used in qcow2 to kill the bit (any
> write/write_compressed/bdrv_copy clear the bit, any trim clears the bit
> if the driver does not guarantee trim reads as zero, any truncate clears
> the bit if the driver does not guarantee truncate reads as zero, etc).
> Basically, the block layer would cache the results of .bdrv_known_zeroes
> during .bdrv_co_open, bdrv_co_pwrite() and friends would update that
> cache, and and bdrv_known_zeroes() would report the cached value rather
> than a fresh call to .bdrv_known_zeroes.

Sounds reasonable to me in generaly, but I’d prefer it to be fetched
on-demand rather than unconditionally in bdrv_open().

(I realize that this means a tri-state of “known false”, “known true”,
and “not yet queried”.)

> Are we worried enough about clients of this interface to make the block
> layer more robust?  (From the maintenance standpoint, the more the block
> layer guarantees, the easier it is to write code that uses the block
> layer; but there is the counter-argument that making the block layer
> track whether an image has been modified means a [slight] penalty to
> every write request to update the boolean.)

Just like Vladimir, I’m worried about repeating the same mistakes we
have before: That is, most places that called bdrv_has_zero_init() just
did so out of wishful thinking, hoping that it would do what they need
it to.  It didn’t.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-02-05 14:07         ` Eric Blake
  2020-02-05 14:25           ` Vladimir Sementsov-Ogievskiy
@ 2020-02-05 17:55           ` Max Reitz
  1 sibling, 0 replies; 73+ messages in thread
From: Max Reitz @ 2020-02-05 17:55 UTC (permalink / raw)
  To: Eric Blake, Vladimir Sementsov-Ogievskiy, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, Markus Armbruster,
	david.edmondson, Stefan Hajnoczi, Denis V. Lunev, Liu Yuan,
	Jason Dillaman


[-- Attachment #1.1: Type: text/plain, Size: 4798 bytes --]

On 05.02.20 15:07, Eric Blake wrote:
> On 2/5/20 1:51 AM, Vladimir Sementsov-Ogievskiy wrote:
> 
>>>>> +typedef enum {
>>>>> +    /*
>>>>> +     * bdrv_known_zeroes() should include this bit if the contents of
>>>>> +     * a freshly-created image with no backing file reads as all
>>>>> +     * zeroes without any additional effort.  If .bdrv_co_truncate is
>>>>> +     * set, then this must be clear if BDRV_ZERO_TRUNCATE is clear.
>>>>
>>>> I understand that this is preexisting logic, but could I ask: why?
>>>> What's wrong
>>>> if driver can guarantee that created file is all-zero, but is not sure
>>>> about
>>>> file resizing? I agree that it's normal for these flags to have the
>>>> same
>>>> value,
>>>> but what is the reason for this restriction?..
>>>
>>> If areas added by truncation (or growth, rather) are always zero, then
>>> the file can always be created with size 0 and grown from there.  Thus,
>>> images where truncation adds zeroed areas will generally always be zero
>>> after creation.
>>
>> This means, that if truncation bit is set, than create bit should be
>> set.. But
>> here we say that if truncation is clear, than create bit must be clear.
> 
> Max, did we get the logic backwards?

Or maybe my explanation was just wrong.

Because nobody actually forces a driver to use truncate to ensure that
an newly created file will be 0.  Hm.  And more importantly, you can’t
use truncate with PREALLOC_MODE_OFF when you want to create an image
with preallocation.

Let’s see.  The offending commit message says:

> No .bdrv_has_zero_init() implementation returns 1 if growing the file
> would add non-zero areas (at least with PREALLOC_MODE_OFF), so using it
> in lieu of this new function was always safe.
> 
> But on the other hand, it is possible that growing an image that is not
> zero-initialized would still add a zero-initialized area, like when
> using nonpreallocating truncation on a preallocated image.  For callers
> that care only about truncation, not about creation with potential
> preallocation, this new function is useful.

So I suppose the explanation is just the preallocation mode alone;
has_zero_init() is for the image’s actual preallocation mode, whereas
has_zero_init_truncate() is forced to PREALLOC_MODE_OFF.  As such, the
latter is less strict than the former.  So the former cannot be true
when the latter is false.

>>>> So, the only possible combination of flags, when they differs, is
>>>> create=0 and
>>>> truncate=1.. How is it possible?
>>>
>>> For preallocated qcow2 images, it depends on the storage whether they
>>> are actually 0 after creation.  Hence qcow2_has_zero_init() then defers
>>> to bdrv_has_zero_init() of s->data_file->bs.
>>>
>>> But when you truncate them (with PREALLOC_MODE_OFF, as
>>> BlockDriver.bdrv_has_zero_init_truncate()’s comment explains), the new
>>> area is always going to be 0, regardless of initial preallocation.
>>
>> ah yes, due to qcow2 zero clusters.
> 
> Hmm. Do we actually set the zero flag on unallocated clusters when
> resizing a qcow2 image?

No.  They are just unallocated, i.e. zero.  (Nodes with backing files
never return true for bdrv_has_zero_init_truncate anyway).

> That would be an O(n) operation (we have to
> visit the L2 entry for each added cluster, even if only to set the zero
> cluster bit).  Or do we instead just rely on the fact that qcow2 is
> inherently sparse, and that when you resize the guest-visible size
> without writing any new clusters, then it is only subsequent guest
> access to those addresses that finally allocate clusters, making resize
> O(1) (update the qcow2 metadata cluster, but not any L2 tables) while
> still reading 0 from the new data.  To some extent, that's what the
> allocation mode is supposed to control.
> 
> What about with external data images, where a resize in guest-visible
> length requires a resize of the underlying data image?  There, we DO
> have to worry about whether the data image resizes with zeroes (as in
> the filesystem) or with random data (as in a block device).

Well, partially: Namely, only with data_file_raw.  Because otherwise the
clusters are still unallocated and thus read as zero.  So yes, then we
do have to worry about that.

With data_file_raw, we have an obligation to make the data file return
the same data as the qcow2 file, so, um.  I wonder whether we actually
take any care of this yet.  If you have some external data file without
zero_init(_truncate), do get zeroes when reading from the qcow2 node,
but non-zeroes when reading from the raw data file?  That would be OK
without data_file_raw, but not with it.  I suppose I’ll have to test it.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/17] Improve qcow2 all-zero detection
  2020-02-05 15:14         ` Vladimir Sementsov-Ogievskiy
@ 2020-02-05 17:58           ` Max Reitz
  0 siblings, 0 replies; 73+ messages in thread
From: Max Reitz @ 2020-02-05 17:58 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Eric Blake, qemu-devel
  Cc: david.edmondson, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 3269 bytes --]

On 05.02.20 16:14, Vladimir Sementsov-Ogievskiy wrote:
> 05.02.2020 17:47, Vladimir Sementsov-Ogievskiy wrote:
>> 05.02.2020 17:26, Eric Blake wrote:
>>> On 2/5/20 3:25 AM, Vladimir Sementsov-Ogievskiy wrote:
>>>
>>>>> 3. For qcow2
>>>>> Hmm. Here, as I understand, than main case is freshly created qcow2,
>>>>> which is fully-unallocated. To understand that it is empty, we
>>>>> need only to check all L1 entries. And for empty L1 table it is fast.
>>>>> So we don't need any qcow2 format improvement to check it.
>>>>>
>>>>
>>>> Ah yes, I forget about preallocated case. Hmm. For preallocated
>>>> clusters,
>>>> we have zero bits in L2 entries. And with them, we even don't need
>>>> preallocated to be filled by zeros, as we never read them (but just
>>>> return
>>>> zeros on read)..
>>>
>>> Scanning all L2 entries is O(n), while an autoclear bit properly
>>> maintained is O(1).
>>>
>>>>
>>>> Then, may be we want similar flag for L1 entry (this will enable large
>>>> fast write-zero). And may be we want flag which marks the whole image
>>>> as read-zero (it's your flag). So, now I think, my previous idea
>>>> of "all allocated is zero" is worse. As for fully-preallocated images
>>>> we are sure that all clusters are allocated, and it is more native to
>>>> have flags similar to ZERO bit in L2 entry.
>>>
>>> Right now, we don't have any L1 entry flags.  Adding one would
>>> require adding an incompatible feature flag (if older qemu would
>>> choke to see unexpected flags in an L1 entry), or at best an
>>> autoclear feature flag (if the autoclear bit gets cleared because an
>>> older qemu opened the image and couldn't maintain L1 entry flags
>>> correctly, then newer qemu knows it cannot trust those L1 entry
>>> flags).  But as soon as you are talking about adding a feature bit,
>>> then why add one that still requires O(n) traversal to check (true,
>>> the 'n' in an O(n) traversal of L1 tables is much smaller than the
>>> 'n' in an O(n) traversal of L2 tables), when you can instead just add
>>> an O(1) autoclear bit that maintains all_zero status for the image as
>>> a whole?
>>>
>>
>> My suggestion about L1 entry flag is side thing, I understand
>> difference between O(n) and O(1) :) Still additional L1 entry will
>> help to make efficient large block-status and write-zero requests.
>>
>> And I agree that we need top level flag.. I just try to say, that it
>> seems good to make it similar with existing L2 flag. But yes, it would
>> be incomaptible change, as it marks all clusters as ZERO, and older
>> Qemu can't understand it and may treat all clusters as unallocated.
>>
> 
> Still, how long is this O(n) ? We load the whole L1 into memory anyway.
> For example, 16Tb disk with 64K granularity, we'll have 32768 L1
> entries. Will we get sensible performance benefit with an extension? I
> doubt in it now. And anyway, if we have an extension, we should fallback
> to this O(n) if we don't have the flag set.

(Sorry, it’s late and I haven’t followed this particular conversation
too closely, but:)

Keep in mind that the default metadata overlap protection mode causes
all L1 entries to be scanned on each I/O write.  So it can’t be that bad.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-02-05 17:22       ` Max Reitz
@ 2020-02-05 18:39         ` Eric Blake
  2020-02-06  9:18           ` Max Reitz
  0 siblings, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-02-05 18:39 UTC (permalink / raw)
  To: Max Reitz, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, Markus Armbruster,
	david.edmondson, Stefan Hajnoczi, Liu Yuan, Denis V. Lunev,
	Jason Dillaman

On 2/5/20 11:22 AM, Max Reitz wrote:

> 
>>> And thus callers which just want the trivially obtainable
>>> BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
>>> even though they don’t care about that flag.
>>
>> True, but only to a minor extent; and the documentation mentions that
>> the BDRV_ZERO_OPEN calculation MUST NOT be as expensive as a blind
>> block_status loop.
> 
> So it must be less expensive than an arbitrarily complex loop.  I think
> a single SEEK_DATA/HOLE call was something like O(n) on tmpfs?

If I recall, the tmpfs bug was that it was O(n) where n was based on the 
initial offset and the number of extents prior to that offset.  The 
probe at offset 0 is O(1) (because there are no prior extents), whether 
it reaches the end of the file (the entire image is a hole) or hits data 
beforehand.  It is only probes at later offsets where the speed penalty 
sets in, and where an O(n) loop over all extents turned into an O(n^2) 
traversal time due to the O(n) nature of each later lookup.

> 
> What I’m trying to say is that this is not a good limit and can mean
> anything.
> 
> I do think this limit definition makes sense for callers that want to
> know about ZERO_OPEN.  But I don’t know why we would have to let other
> callers wait, too.

Keeping separate functions may still be the right approach for v2, 
although I'd still like to name things better ('has_zero_init' vs. 
'has_zero_init_truncate' did not work well for me).  And if I'm renaming 
things, then I'm touching just as much code whether I rename and keep 
separate functions or rename and consolidate into one.

> 
>> Meanwhile, callers tend to only care about
>> bdrv_known_zeroes() right after opening an image or right before
>> resizing (not repeatedly during runtime);
> 
> Hm, yes.  I was thinking of parallels, but that only checks once in
> parallels_open(), so it’s OK.
> 
>> and you also argued elsewhere
>> in this thread that it may be worth having the block layer cache
>> BDRV_ZERO_OPEN and update the cache on any write,
> 
> I didn’t say the block layer, but it if makes sense.
> 
>> at which point, the
>> expense in the driver callback really is a one-time call during
>> bdrv_co_open().
> 
> It definitely doesn’t make sense to me to do that call unconditionally
> in bdrv_co_open().

Okay, you have a point there - while 'qemu-img convert' cares, not all 
clients of bdrv_co_open() are worried about whether the existing 
contents are zero; so unconditionally priming a cache during 
bdrv_co_open is not as wise as doing things when it will actually be 
useful information.  On the other hand, if it is something that clients 
only use when first opening an image, caching data doesn't make much 
sense either.

So, we know that bdrv_has_zero_init() is only viable on a just-created 
image, bdrv_has_zero_init_truncate() is only viable if you are about to 
resize an image using bdrv_co_truncate(PREALLOC_MODE_OFF).

Hmm - thinking aloud: our ultimate goal is that we want to make it 
easier for algorithms that can be sped up IF the image is currently 
known to be all zero.  Maybe what this means is that we really want to 
be tweaking bdrv_make_zeroes() to do all the work, something along the 
lines of:
- if the image is known to already be all zeroes using an O(1) probe 
(this includes if the image was freshly created and creation sees all 
zeroes, or if a block_status at offset 0 shows a hole for the entire 
image, or if an NBD extension advertises all zero at connection 
time...), return success
- if the image has a FAST truncate, and resizing reads zeroes, we can 
truncate to size 0 and back to the desired size, then return success; 
determining if truncate is fast should be similar to how 
BDRV_REQ_NO_FALLBACK determines whether write zeroes is fast
- if the image supports BDRV_REQ_NO_FALLBACK with write zeroes, we can 
request a write zeroes over the whole image, which will either succeed 
(the image is now quickly zero) or fail (writing zeroes as we go is the 
best we can do)
- if the image could report that it is all zeroes, but only at the cost 
of O(n) work such as a loop over block_status (or even O(n^2) with the 
tmpfs lseek bug), it's easier to report failure than to worry about 
making the image read all zeroes

qemu-img would then only ever need to consult --target-is-zero and 
bdrv_make_zero(), and not worry about any other function calls; while 
the block layer would take care of coordinating whatever other call 
sequences make the most sense in reporting success or failure in getting 
the image into an all-zero state if it was not already there.


> 
>> And in that case, whether the one-time expense is done
>> via a single function call or via three driver callbacks, the amount of
>> work is the same; but the driver callback interface is easier if there
>> is only one callback (similar to how bdrv_unallocated_blocks_are_zero()
>> calls bdrv_get_info() only for bdi.unallocated_blocks_are_zero, even
>> though BlockDriverInfo tracks much more than that boolean).
>>
>> In fact, it may be worth consolidating known zeroes support into
>> BlockDriverInfo.
> 
> I’m very skeptical of that.  BDI already has the problem that it doesn’t
> know which of the information the caller actually wants and that it is
> sometimes used in a quasi-hot path.
> 
> Maybe that means it is indeed time to incorporate it into BDI, but the
> caller should have a way of specifying what parts of BDI it actually
> needs and then drivers can skip anything that isn’t trivially obtainable
> that the caller doesn’t need.

I'm reminded of the recent kernel addition of xstat(); the traditional 
stat/fstat interfaces really don't know which bits of information you 
care about, so you get everything, but with xstat(), you can request 
only what you plan to use, which may indeed result in speedups.


>> Those are still viable options, but before I repaint the bikeshed along
>> those lines, I'd at least like a review of whether the overall idea of
>> having a notion of 'reads-all-zeroes' is indeed useful enough,
>> regardless of how we implement it as one vs. three driver callbacks.
> 
> I’m as hesitant as ever to give a review that this notion is useful,
> because I haven’t seen a practical example yet where the problem isn’t
> the fact that NBD doesn’t have 64-bit write_zeroes support.

Even if the NBD protocol gains 64-bit write_zeroes, not all NBD servers 
will be compliant with that extension.  This will speed up operations 
when talking to older servers which do not support 64-bit writes, even 
if newer qemu is never such a server.

> 
> So far, it looks to me like this notion is only really useful for cases
> where we expect a management layer on top of qemu anyway.  And then I’m
> not sure that this new feature works reliably enough for such a
> management layer.
> 
> (I’m not saying it isn’t useful.  Again, intuitively it does seem
> useful.  Intuition can be enough to merge a sufficiently simple series
> that doesn’t increase code complexity too much.  But I’m still asking
> for actual practical examples, because that would make a better
> argument, of course.)

I'm hoping when I post my NBD patches that I can also provide some 
benchmark timing numbers to make the case a bit more concrete.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/17] Improve qcow2 all-zero detection
  2020-02-05 17:04     ` Max Reitz
@ 2020-02-05 19:21       ` Eric Blake
  2020-02-06  9:12         ` Max Reitz
  0 siblings, 1 reply; 73+ messages in thread
From: Eric Blake @ 2020-02-05 19:21 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: david.edmondson, qemu-block

On 2/5/20 11:04 AM, Max Reitz wrote:
> OK, I expected users to come in a separate patch.

I can refactor that better in v2.

> 
>> That's the use case: when copying into a destination file, it's useful
>> to know if the destination already reads as all zeroes, before
>> attempting a fallback to bdrv_make_zero(BDRV_REQ_NO_FALLBACK) or calls
>> to block status checking for holes.
> 
> But that was my point on IRC.  Is it really more useful if
> bdrv_make_zero() is just as quick?  (And the fact that NBD doesn’t have
> an implementation looks more like a problem with NBD to me.)

That is indeed a thought - why should qemu-img even TRY to call 
bdrv_has_init_zero; it should instead call bdrv_make_zero(), which 
should be as fast as possible (see my latest reply on 9/17 exploring 
that idea more).  Under the hood, we can then make bdrv_make_zero() use 
whatever tricks it needs, whether keeping the driver's 
.bdrv_has_zero_init/_truncate callbacks, adding another callback, making 
write_zeroes faster, or whatever, but instead of making qemu-img sort 
through multiple ideas, the burden would now be hidden in the block layer.

> 
> (Considering that at least the code we discussed on IRC didn’t work for
> preallocated images, which was the one point where we actually have a
> problem in practice.)

And this series DOES improve the case for preallocated qcow2 images, by 
virtue of a new qcow2 autoclear bit.  But again, that may be something 
we want to hide in the driver callback interfaces, while qemu-img just 
blindly calls bdrv_make_zero() and gets success (the image now reads as 
zeroes, either because it was already that way or we did something 
quick) or failure (it is a waste of time to prezero, whether by 
write_zeroes or by trim or by truncate, so just manually write zeroes as 
part of your image copying).

> 
>>> (We have a use case with convert -n to freshly created image files, but
>>> my position on this on IRC was that we want the --target-is-zero flag
>>> for that anyway: Auto-detection may always break, our preferred default
>>> behavior may always change, so if you want convert -n not to touch the
>>> target image except to write non-zero data from the source, we need a
>>> --target-is-zero flag and users need to use it.  Well, management
>>> layers, because I don’t think users would use convert -n anyway.
>>>
>>> And with --target-is-zero and users effectively having to use it, I
>>> don’t think that’s a good example of a use case.)
>>
>> Yes, there will still be cases where you have to use --target-is-zero
>> because the image itself couldn't report that it already reads as
>> zeroes, but there are also enough cases where the destination is already
>> known to read zeroes and it's a shame to tell the user that 'you have to
>> add --target-is-zero to get faster copying even though we could have
>> inferred it on your behalf'.
> 
> How is it a shame?  I think only management tools would use convert -n.
>   Management tools want reliable behavior.  If you want reliable
> behavior, you have to use --target-is-zero anyway.  So I don’t see the
> actual benefit for qemu-img convert.

qemu-img convert to an NBD destination cannot create the destination, so 
it ALWAYS has to use -n.  I don't know how often users are likely to 
wire up a command line for qemu-img convert with NBD as the destination, 
or whether you are correct that it will be a management app able to 
supply -n (and thus able to supply --target-is-zero).  But at the same 
time, can a management app learn whether it is safe to supply 
--target-is-zero?  With my upcoming NBD patches, 'qemu-img --list' will 
show whether the NBD target is known to start life all zero, and a 
management app could use mechanism to decide when to pass 
--target-is-zero (whether a management app would actually fork qemu-img 
--list, or just be an NBD client itself to do the same thing qemu-img 
would do, is beside the point).

Similarly, this series includes enhancements to 'qemu-img info' on qcow2 
images known to currently read as zero; again, that sort of information 
is necessary somewhere in the chain, whether it be because qemu-img 
consumes the information itself, or because the management app consumes 
the information in order to pass the --target-is-zero option to 
qemu-img, either way, the information needs to be available for consumption.

> 
>>> I suppose there is the point of blockdev-create + blockdev-mirror: This
>>> has exactly the same problem as convert -n.  But again, if you really
>>> want blockdev-mirror not just to force-zero the image, you probably need
>>> to tell it so explicitly (i.e., with a --target-is-zero flag for
>>> blockdev-mirror).
>>>
>>> (Well, I suppose we could save us a target-is-zero for mirror if we took
>>> this series and had a filter driver that force-reports BDRV_ZERO_OPEN.
>>> But, well, please no.)
>>>
>>> But maybe I’m just an idiot and there is no reason not to take this
>>> series and make blockdev-create + blockdev-mirror do the sensible thing
>>> by default in most cases. *shrug*
>>
>> My argument for taking the series _is_ that the common case can be made
>> more efficient without user effort.
> 
> The thing is, I don’t see the user effort.  I don’t think users use
> convert -n or backup manually.  And for management tools, it isn’t
> really effort to add another switch.

Maybe, but it is just shifting the burden between who consumes the 
information that an image is all zero, and how the consumption of that 
information is passed to qemu-img.

> 
>> Yes, we still need the knob for
>> when the common case isn't already smart enough,
> 
> But the user can’t know when qemu isn’t smart enough.  So users who care
> have to always give the flag.
> 
>> but the difference in
>> avoiding a pre-zeroing pass is noticeable when copying images around
> 
> I’m sure it is, but the question I ask is whether in practice we
> wouldn’t get --target-is-zero in all of these cases anyway.
> 
> 
> So I’m not sold on “it works most of the time”, because if it’s just
> most of the time, then we’ll likely see --target-is-zero all of the time.
> 
> OTOH, I suppose that with the new qcow2 extension, it would always work
> for the following case:
> (1) Create a qcow2 file,
> (2) Immediately (with the next qemu-img/QMP invocation) use it as a
> target of convert -n or mirror or anything similar.

Yes, that is one of the immediately obvious fallouts from this series - 
you can now create a preallocated qcow2 image in one process, and the 
next process using that image can readily tell that it is still 
just-created.

> 
> If so, that means it works reliably all of the time for a common case.
> I guess that’d be enough for me.
> 
> Max
> 
>> (and more than just for qcow2 - my followup series to improve NBD is
>> similarly useful given how much work has already been invested in
>> mapping NBD into storage access over https in the upper layers like ovirt).
>>
> 
> 

At any rate, I think you've convinced me to rethink how I present v2 
(maybe not by refactoring bdrv_known_zeroes usage, but instead 
refactoring bdrv_make_zero), but that the qcow2 autoclear bit is still a 
useful feature to have.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 00/17] Improve qcow2 all-zero detection
  2020-02-05 19:21       ` Eric Blake
@ 2020-02-06  9:12         ` Max Reitz
  0 siblings, 0 replies; 73+ messages in thread
From: Max Reitz @ 2020-02-06  9:12 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 9234 bytes --]

On 05.02.20 20:21, Eric Blake wrote:
> On 2/5/20 11:04 AM, Max Reitz wrote:
>> OK, I expected users to come in a separate patch.
> 
> I can refactor that better in v2.
> 
>>
>>> That's the use case: when copying into a destination file, it's useful
>>> to know if the destination already reads as all zeroes, before
>>> attempting a fallback to bdrv_make_zero(BDRV_REQ_NO_FALLBACK) or calls
>>> to block status checking for holes.
>>
>> But that was my point on IRC.  Is it really more useful if
>> bdrv_make_zero() is just as quick?  (And the fact that NBD doesn’t have
>> an implementation looks more like a problem with NBD to me.)
> 
> That is indeed a thought - why should qemu-img even TRY to call
> bdrv_has_init_zero; it should instead call bdrv_make_zero(), which
> should be as fast as possible (see my latest reply on 9/17 exploring
> that idea more).  Under the hood, we can then make bdrv_make_zero() use
> whatever tricks it needs, whether keeping the driver's
> .bdrv_has_zero_init/_truncate callbacks, adding another callback, making
> write_zeroes faster, or whatever, but instead of making qemu-img sort
> through multiple ideas, the burden would now be hidden in the block layer.

I didn’t even think of that.  More on that at the very bottom of this mail.

>> (Considering that at least the code we discussed on IRC didn’t work for
>> preallocated images, which was the one point where we actually have a
>> problem in practice.)
> 
> And this series DOES improve the case for preallocated qcow2 images, by
> virtue of a new qcow2 autoclear bit.  But again, that may be something
> we want to hide in the driver callback interfaces, while qemu-img just
> blindly calls bdrv_make_zero() and gets success (the image now reads as
> zeroes, either because it was already that way or we did something
> quick) or failure (it is a waste of time to prezero, whether by
> write_zeroes or by trim or by truncate, so just manually write zeroes as
> part of your image copying).

Oh, yes, indeed.  Sorry.

>>>> (We have a use case with convert -n to freshly created image files, but
>>>> my position on this on IRC was that we want the --target-is-zero flag
>>>> for that anyway: Auto-detection may always break, our preferred default
>>>> behavior may always change, so if you want convert -n not to touch the
>>>> target image except to write non-zero data from the source, we need a
>>>> --target-is-zero flag and users need to use it.  Well, management
>>>> layers, because I don’t think users would use convert -n anyway.
>>>>
>>>> And with --target-is-zero and users effectively having to use it, I
>>>> don’t think that’s a good example of a use case.)
>>>
>>> Yes, there will still be cases where you have to use --target-is-zero
>>> because the image itself couldn't report that it already reads as
>>> zeroes, but there are also enough cases where the destination is already
>>> known to read zeroes and it's a shame to tell the user that 'you have to
>>> add --target-is-zero to get faster copying even though we could have
>>> inferred it on your behalf'.
>>
>> How is it a shame?  I think only management tools would use convert -n.
>>   Management tools want reliable behavior.  If you want reliable
>> behavior, you have to use --target-is-zero anyway.  So I don’t see the
>> actual benefit for qemu-img convert.
> 
> qemu-img convert to an NBD destination cannot create the destination, so
> it ALWAYS has to use -n.  I don't know how often users are likely to
> wire up a command line for qemu-img convert with NBD as the destination,
> or whether you are correct that it will be a management app able to
> supply -n (and thus able to supply --target-is-zero).  But at the same
> time, can a management app learn whether it is safe to supply
> --target-is-zero?  With my upcoming NBD patches, 'qemu-img --list' will
> show whether the NBD target is known to start life all zero, and a
> management app could use mechanism to decide when to pass
> --target-is-zero (whether a management app would actually fork qemu-img
> --list, or just be an NBD client itself to do the same thing qemu-img
> would do, is beside the point).
> 
> Similarly, this series includes enhancements to 'qemu-img info' on qcow2
> images known to currently read as zero; again, that sort of information
> is necessary somewhere in the chain, whether it be because qemu-img
> consumes the information itself, or because the management app consumes
> the information in order to pass the --target-is-zero option to
> qemu-img, either way, the information needs to be available for
> consumption.

I simply assumed that management applications will just assume that a
newly created image is zero.

I’m aware that may be wrong, but then again, that hasn’t stopped them in
the past or they would have asked for qemu to deliver this information
earlier...  (That doesn’t mean that at some point maybe they will start
to care and ask for it.)

One problem with delivering this information of course is that it’s
useless.  If qemu knows the image to be zero, then it will do the right
thing by itself, and then the management application doesn’t need to
pass --target-is-zero anymore.

>>>> I suppose there is the point of blockdev-create + blockdev-mirror: This
>>>> has exactly the same problem as convert -n.  But again, if you really
>>>> want blockdev-mirror not just to force-zero the image, you probably
>>>> need
>>>> to tell it so explicitly (i.e., with a --target-is-zero flag for
>>>> blockdev-mirror).
>>>>
>>>> (Well, I suppose we could save us a target-is-zero for mirror if we
>>>> took
>>>> this series and had a filter driver that force-reports BDRV_ZERO_OPEN.
>>>> But, well, please no.)
>>>>
>>>> But maybe I’m just an idiot and there is no reason not to take this
>>>> series and make blockdev-create + blockdev-mirror do the sensible thing
>>>> by default in most cases. *shrug*
>>>
>>> My argument for taking the series _is_ that the common case can be made
>>> more efficient without user effort.
>>
>> The thing is, I don’t see the user effort.  I don’t think users use
>> convert -n or backup manually.  And for management tools, it isn’t
>> really effort to add another switch.
> 
> Maybe, but it is just shifting the burden between who consumes the
> information that an image is all zero, and how the consumption of that
> information is passed to qemu-img.

Sure, but the question is who can take the burden the easiest.

The management layer creates the image and then uses it, so it can
easily retain this information.

When qemu creates an image and then uses it in a separate step, it
cannot retain this information.  It has to be stored somewhere
persistently and we have to fetch it.  In the case of qcow2, that works
with a flag.  In other cases...  Well, in any case it isn’t as trivial
as in a management application.

>>> Yes, we still need the knob for
>>> when the common case isn't already smart enough,
>>
>> But the user can’t know when qemu isn’t smart enough.  So users who care
>> have to always give the flag.
>>
>>> but the difference in
>>> avoiding a pre-zeroing pass is noticeable when copying images around
>>
>> I’m sure it is, but the question I ask is whether in practice we
>> wouldn’t get --target-is-zero in all of these cases anyway.
>>
>>
>> So I’m not sold on “it works most of the time”, because if it’s just
>> most of the time, then we’ll likely see --target-is-zero all of the time.
>>
>> OTOH, I suppose that with the new qcow2 extension, it would always work
>> for the following case:
>> (1) Create a qcow2 file,
>> (2) Immediately (with the next qemu-img/QMP invocation) use it as a
>> target of convert -n or mirror or anything similar.
> 
> Yes, that is one of the immediately obvious fallouts from this series -
> you can now create a preallocated qcow2 image in one process, and the
> next process using that image can readily tell that it is still
> just-created.

And it’s a common case with blockdev-create.

>> If so, that means it works reliably all of the time for a common case.
>> I guess that’d be enough for me.
>>
>> Max
>>
>>> (and more than just for qcow2 - my followup series to improve NBD is
>>> similarly useful given how much work has already been invested in
>>> mapping NBD into storage access over https in the upper layers like
>>> ovirt).
>>>
>>
>>
> 
> At any rate, I think you've convinced me to rethink how I present v2
> (maybe not by refactoring bdrv_known_zeroes usage, but instead
> refactoring bdrv_make_zero), but that the qcow2 autoclear bit is still a
> useful feature to have.

Hm.  So you mean there isn’t any caller that actually cares about
whether an image is zero, only that it is zero.  That is, they are all
“if (!is_zero()) { make_zero(); }” – which is functionally the same as
“make_zero();” alone.  make_zero in turn could and should be a no-op
when the image is known to be zero already.

I actually didn’t think of that.  Sounds good.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate}
  2020-02-05 18:39         ` Eric Blake
@ 2020-02-06  9:18           ` Max Reitz
  0 siblings, 0 replies; 73+ messages in thread
From: Max Reitz @ 2020-02-06  9:18 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: Kevin Wolf, Fam Zheng, open list:Sheepdog, qemu-block, Jeff Cody,
	Stefan Weil, Peter Lieven, Richard W.M. Jones, Markus Armbruster,
	david.edmondson, Stefan Hajnoczi, Liu Yuan, Denis V. Lunev,
	Jason Dillaman


[-- Attachment #1.1: Type: text/plain, Size: 7825 bytes --]

On 05.02.20 19:39, Eric Blake wrote:
> On 2/5/20 11:22 AM, Max Reitz wrote:
> 
>>
>>>> And thus callers which just want the trivially obtainable
>>>> BDRV_ZERO_TRUNCATE info have to wait for the BDRV_ZERO_OPEN inquiry,
>>>> even though they don’t care about that flag.
>>>
>>> True, but only to a minor extent; and the documentation mentions that
>>> the BDRV_ZERO_OPEN calculation MUST NOT be as expensive as a blind
>>> block_status loop.
>>
>> So it must be less expensive than an arbitrarily complex loop.  I think
>> a single SEEK_DATA/HOLE call was something like O(n) on tmpfs?
> 
> If I recall, the tmpfs bug was that it was O(n) where n was based on the
> initial offset and the number of extents prior to that offset.  The
> probe at offset 0 is O(1) (because there are no prior extents), whether
> it reaches the end of the file (the entire image is a hole) or hits data
> beforehand.  It is only probes at later offsets where the speed penalty
> sets in, and where an O(n) loop over all extents turned into an O(n^2)
> traversal time due to the O(n) nature of each later lookup.

So it’s O(n/2) for each lookup on average, which is O(n). O:-)

>> What I’m trying to say is that this is not a good limit and can mean
>> anything.
>>
>> I do think this limit definition makes sense for callers that want to
>> know about ZERO_OPEN.  But I don’t know why we would have to let other
>> callers wait, too.
> 
> Keeping separate functions may still be the right approach for v2,
> although I'd still like to name things better ('has_zero_init' vs.
> 'has_zero_init_truncate' did not work well for me).  And if I'm renaming
> things, then I'm touching just as much code whether I rename and keep
> separate functions or rename and consolidate into one.

I definitely don’t disagree about renaming, and if you think that
consolidating the functions is worth it, then it probably makes sense
(because you have the experience there, given this series).

But I’d still like to throw in that a rename is a more easily doable and
reviewable change than a consolidation, even if you get the same number
of hunks in the end.

>>> Meanwhile, callers tend to only care about
>>> bdrv_known_zeroes() right after opening an image or right before
>>> resizing (not repeatedly during runtime);
>>
>> Hm, yes.  I was thinking of parallels, but that only checks once in
>> parallels_open(), so it’s OK.
>>
>>> and you also argued elsewhere
>>> in this thread that it may be worth having the block layer cache
>>> BDRV_ZERO_OPEN and update the cache on any write,
>>
>> I didn’t say the block layer, but it if makes sense.
>>
>>> at which point, the
>>> expense in the driver callback really is a one-time call during
>>> bdrv_co_open().
>>
>> It definitely doesn’t make sense to me to do that call unconditionally
>> in bdrv_co_open().
> 
> Okay, you have a point there - while 'qemu-img convert' cares, not all
> clients of bdrv_co_open() are worried about whether the existing
> contents are zero; so unconditionally priming a cache during
> bdrv_co_open is not as wise as doing things when it will actually be
> useful information.  On the other hand, if it is something that clients
> only use when first opening an image, caching data doesn't make much
> sense either.
> 
> So, we know that bdrv_has_zero_init() is only viable on a just-created
> image, bdrv_has_zero_init_truncate() is only viable if you are about to
> resize an image using bdrv_co_truncate(PREALLOC_MODE_OFF).
> 
> Hmm - thinking aloud: our ultimate goal is that we want to make it
> easier for algorithms that can be sped up IF the image is currently
> known to be all zero.  Maybe what this means is that we really want to
> be tweaking bdrv_make_zeroes() to do all the work, something along the
> lines of:
> - if the image is known to already be all zeroes using an O(1) probe
> (this includes if the image was freshly created and creation sees all
> zeroes, or if a block_status at offset 0 shows a hole for the entire
> image, or if an NBD extension advertises all zero at connection
> time...), return success

[Insert case here: If the image has a custom make_zeroes implementation,
use it, and return success]

> - if the image has a FAST truncate, and resizing reads zeroes, we can
> truncate to size 0 and back to the desired size, then return success;
> determining if truncate is fast should be similar to how
> BDRV_REQ_NO_FALLBACK determines whether write zeroes is fast
> - if the image supports BDRV_REQ_NO_FALLBACK with write zeroes, we can
> request a write zeroes over the whole image, which will either succeed
> (the image is now quickly zero) or fail (writing zeroes as we go is the
> best we can do)
> - if the image could report that it is all zeroes, but only at the cost
> of O(n) work such as a loop over block_status (or even O(n^2) with the
> tmpfs lseek bug), it's easier to report failure than to worry about
> making the image read all zeroes
> 
> qemu-img would then only ever need to consult --target-is-zero and
> bdrv_make_zero(), and not worry about any other function calls; while
> the block layer would take care of coordinating whatever other call
> sequences make the most sense in reporting success or failure in getting
> the image into an all-zero state if it was not already there.

(As I just wrote on the cover letter thread:) Sounds good to me.

>>> And in that case, whether the one-time expense is done
>>> via a single function call or via three driver callbacks, the amount of
>>> work is the same; but the driver callback interface is easier if there
>>> is only one callback (similar to how bdrv_unallocated_blocks_are_zero()
>>> calls bdrv_get_info() only for bdi.unallocated_blocks_are_zero, even
>>> though BlockDriverInfo tracks much more than that boolean).
>>>
>>> In fact, it may be worth consolidating known zeroes support into
>>> BlockDriverInfo.
>>
>> I’m very skeptical of that.  BDI already has the problem that it doesn’t
>> know which of the information the caller actually wants and that it is
>> sometimes used in a quasi-hot path.
>>
>> Maybe that means it is indeed time to incorporate it into BDI, but the
>> caller should have a way of specifying what parts of BDI it actually
>> needs and then drivers can skip anything that isn’t trivially obtainable
>> that the caller doesn’t need.
> 
> I'm reminded of the recent kernel addition of xstat(); the traditional
> stat/fstat interfaces really don't know which bits of information you
> care about, so you get everything, but with xstat(), you can request
> only what you plan to use, which may indeed result in speedups.

I hope we can put off thinking about it if the known-zeroes check can
simply be put into make_zero(). O:-)

>>> Those are still viable options, but before I repaint the bikeshed along
>>> those lines, I'd at least like a review of whether the overall idea of
>>> having a notion of 'reads-all-zeroes' is indeed useful enough,
>>> regardless of how we implement it as one vs. three driver callbacks.
>>
>> I’m as hesitant as ever to give a review that this notion is useful,
>> because I haven’t seen a practical example yet where the problem isn’t
>> the fact that NBD doesn’t have 64-bit write_zeroes support.
> 
> Even if the NBD protocol gains 64-bit write_zeroes, not all NBD servers
> will be compliant with that extension.  This will speed up operations
> when talking to older servers which do not support 64-bit writes, even
> if newer qemu is never such a server.

The same applies to reads-all-zeroes, though.  Only if both server and
client provide/understand this flag will it do something.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 03/17] qcow2: Avoid feature name extension on small cluster size
  2020-01-31 17:44 ` [PATCH 03/17] qcow2: Avoid feature name extension on small cluster size Eric Blake
  2020-02-04 14:39   ` Vladimir Sementsov-Ogievskiy
@ 2020-02-09 19:28   ` Alberto Garcia
  1 sibling, 0 replies; 73+ messages in thread
From: Alberto Garcia @ 2020-02-09 19:28 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

On Fri 31 Jan 2020 06:44:22 PM CET, Eric Blake wrote:
> As the feature name table can be quite large (over 9k if all 64 bits
> of all three feature fields have names; a mere 8 features leaves only
> 8 bytes for a backing file name in a 512-byte cluster), it is unwise
> to emit this optional header in images with small cluster sizes.
>
> Update iotest 036 to skip running on small cluster sizes; meanwhile,
> note that iotest 061 never passed on alternative cluster sizes
> (however, I limited this patch to tests with output affected by adding
> feature names, rather than auditing for other tests that are not
> robust to alternative cluster sizes).
>
> Signed-off-by: Eric Blake <eblake@redhat.com>

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 01/17] qcow2: Comment typo fixes
  2020-01-31 17:44 ` [PATCH 01/17] qcow2: Comment typo fixes Eric Blake
  2020-02-04 14:12   ` Vladimir Sementsov-Ogievskiy
@ 2020-02-09 19:34   ` Alberto Garcia
  1 sibling, 0 replies; 73+ messages in thread
From: Alberto Garcia @ 2020-02-09 19:34 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

On Fri 31 Jan 2020 06:44:20 PM CET, Eric Blake wrote:
> Various trivial typos noticed while working on this file.
>
> Signed-off-by: Eric Blake <eblake@redhat.com>

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 05/17] block: Don't advertise zero_init_truncate with encryption
  2020-01-31 17:44 ` [PATCH 05/17] block: Don't advertise zero_init_truncate with encryption Eric Blake
@ 2020-02-10 18:12   ` Alberto Garcia
  0 siblings, 0 replies; 73+ messages in thread
From: Alberto Garcia @ 2020-02-10 18:12 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

On Fri 31 Jan 2020 06:44:24 PM CET, Eric Blake wrote:
> Commit 38841dcd correctly argued that having qcow2 blindly return 1
> for .bdrv_has_zero_init() is wrong for preallocated images built on
> block devices, while .bdrv_has_zero_init_truncate() can still return 1
> because it is only relied on when changing size with PREALLOC_MODE_OFF
> (and this is true even for v2 images which lack the notion of an
> explicit zero cluster, since the block layer already filters out the
> case of a larger backing file leaking through).  However, it missed
> the fact that encrypted images do not default to reading as zero in
> any case.
>
> However, instead of changing qcow2's .bdrv_has_zero_init_truncate() to
> point to a one-off function that special-cases bs->encryption, it is
> smarter to just move the logic about encryption directly to the block
> layer (that is, the driver callbacks will never be invoked for
> encrypted images, just like they are already not called when a backing
> file is present).  This solution fixes the qcow2 issue, has no effect
> on the crypto driver (which already lacks .bdrv_has_zero_init*
> callbacks), and no other driver currently uses bs->encrypted.
>
> One other reason to fix this at the block layer: any information we
> expose about an encrypted image that in turn may alter timing of
> algorithms run on that image can be considered a (slight) information
> leak; refusing to optimize zero handling of encrypted images thus
> avoids the possibility of that being a security concern.
>
> Signed-off-by: Eric Blake <eblake@redhat.com>

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 06/17] block: Improve bdrv_has_zero_init_truncate with backing file
  2020-01-31 17:44 ` [PATCH 06/17] block: Improve bdrv_has_zero_init_truncate with backing file Eric Blake
@ 2020-02-10 18:13   ` Alberto Garcia
  0 siblings, 0 replies; 73+ messages in thread
From: Alberto Garcia @ 2020-02-10 18:13 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: david.edmondson, Kevin Wolf, qemu-block, mreitz

On Fri 31 Jan 2020 06:44:25 PM CET, Eric Blake wrote:
> When we added bdrv_has_zero_init_truncate(), we chose to blindly
> return 0 if a backing file was present, because we knew of the corner
> case where a backing layer larger than the current layer might leak
> the tail of the backing layer into the resized region.  But as this
> setup is rare, it penalizes the more common case of a backing layer
> smaller than the current layer.
>
> Signed-off-by: Eric Blake <eblake@redhat.com>

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/17] gluster: Drop useless has_zero_init callback
  2020-01-31 17:44 ` [PATCH 07/17] gluster: Drop useless has_zero_init callback Eric Blake
  2020-02-04 15:06   ` Vladimir Sementsov-Ogievskiy
@ 2020-02-10 18:21   ` Alberto Garcia
  2020-02-17  8:06   ` [GEDI] " Niels de Vos
  2 siblings, 0 replies; 73+ messages in thread
From: Alberto Garcia @ 2020-02-10 18:21 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: david.edmondson, Kevin Wolf, open list:GLUSTER, qemu-block, mreitz

On Fri 31 Jan 2020 06:44:26 PM CET, Eric Blake wrote:
> block.c already defaults to 0 if we don't provide a callback; there's
> no need to write a callback that always fails.
>
> Signed-off-by: Eric Blake <eblake@redhat.com>

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [GEDI] [PATCH 07/17] gluster: Drop useless has_zero_init callback
  2020-01-31 17:44 ` [PATCH 07/17] gluster: Drop useless has_zero_init callback Eric Blake
  2020-02-04 15:06   ` Vladimir Sementsov-Ogievskiy
  2020-02-10 18:21   ` Alberto Garcia
@ 2020-02-17  8:06   ` Niels de Vos
  2020-02-17 12:03     ` Eric Blake
  2 siblings, 1 reply; 73+ messages in thread
From: Niels de Vos @ 2020-02-17  8:06 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, open list:GLUSTER, qemu-block, qemu-devel, mreitz,
	david.edmondson

On Fri, Jan 31, 2020 at 11:44:26AM -0600, Eric Blake wrote:
> block.c already defaults to 0 if we don't provide a callback; there's
> no need to write a callback that always fails.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>

Reviewed-by: Niels de Vos <ndevos@redhat.com>

> ---
>  block/gluster.c | 14 --------------
>  1 file changed, 14 deletions(-)
> 
> diff --git a/block/gluster.c b/block/gluster.c
> index 4fa4a77a4777..9d952c70981b 100644
> --- a/block/gluster.c
> +++ b/block/gluster.c
> @@ -1357,12 +1357,6 @@ static int64_t qemu_gluster_allocated_file_size(BlockDriverState *bs)
>      }
>  }
> 
> -static int qemu_gluster_has_zero_init(BlockDriverState *bs)
> -{
> -    /* GlusterFS volume could be backed by a block device */
> -    return 0;
> -}
> -
>  /*
>   * Find allocation range in @bs around offset @start.
>   * May change underlying file descriptor's file offset.
> @@ -1567,8 +1561,6 @@ static BlockDriver bdrv_gluster = {
>      .bdrv_co_readv                = qemu_gluster_co_readv,
>      .bdrv_co_writev               = qemu_gluster_co_writev,
>      .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
> -    .bdrv_has_zero_init           = qemu_gluster_has_zero_init,
> -    .bdrv_has_zero_init_truncate  = qemu_gluster_has_zero_init,
>  #ifdef CONFIG_GLUSTERFS_DISCARD
>      .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
>  #endif
> @@ -1599,8 +1591,6 @@ static BlockDriver bdrv_gluster_tcp = {
>      .bdrv_co_readv                = qemu_gluster_co_readv,
>      .bdrv_co_writev               = qemu_gluster_co_writev,
>      .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
> -    .bdrv_has_zero_init           = qemu_gluster_has_zero_init,
> -    .bdrv_has_zero_init_truncate  = qemu_gluster_has_zero_init,
>  #ifdef CONFIG_GLUSTERFS_DISCARD
>      .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
>  #endif
> @@ -1631,8 +1621,6 @@ static BlockDriver bdrv_gluster_unix = {
>      .bdrv_co_readv                = qemu_gluster_co_readv,
>      .bdrv_co_writev               = qemu_gluster_co_writev,
>      .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
> -    .bdrv_has_zero_init           = qemu_gluster_has_zero_init,
> -    .bdrv_has_zero_init_truncate  = qemu_gluster_has_zero_init,
>  #ifdef CONFIG_GLUSTERFS_DISCARD
>      .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
>  #endif
> @@ -1669,8 +1657,6 @@ static BlockDriver bdrv_gluster_rdma = {
>      .bdrv_co_readv                = qemu_gluster_co_readv,
>      .bdrv_co_writev               = qemu_gluster_co_writev,
>      .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
> -    .bdrv_has_zero_init           = qemu_gluster_has_zero_init,
> -    .bdrv_has_zero_init_truncate  = qemu_gluster_has_zero_init,
>  #ifdef CONFIG_GLUSTERFS_DISCARD
>      .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
>  #endif
> -- 
> 2.24.1
> 
> _______________________________________________
> integration mailing list
> integration@gluster.org
> https://lists.gluster.org/mailman/listinfo/integration
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [GEDI] [PATCH 12/17] gluster: Support BDRV_ZERO_OPEN
  2020-01-31 17:44 ` [PATCH 12/17] gluster: " Eric Blake
@ 2020-02-17  8:16   ` Niels de Vos
  0 siblings, 0 replies; 73+ messages in thread
From: Niels de Vos @ 2020-02-17  8:16 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, open list:GLUSTER, qemu-block, qemu-devel, mreitz,
	david.edmondson

On Fri, Jan 31, 2020 at 11:44:31AM -0600, Eric Blake wrote:
> Since gluster already copies file-posix for lseek usage in block
> status, it also makes sense to copy it for learning if the image
> currently reads as all zeroes.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/gluster.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/block/gluster.c b/block/gluster.c
> index 9d952c70981b..0417a86547c8 100644
> --- a/block/gluster.c
> +++ b/block/gluster.c
> @@ -1464,6 +1464,22 @@ exit:
>      return -ENOTSUP;
>  }
> 
> +static int qemu_gluster_known_zeroes(BlockDriverState *bs)
> +{
> +    /*
> +     * GlusterFS volume could be backed by a block device, with no way

Actually, Gluster dropped support for volumes backed by block devices
(LVM) a few releases back. Nobody could be found that used it, and it
could not be combined with other Gluster features. All contents on a
Gluster volume is now always backed by 'normal' files on a filesystem.

Creation or truncation should behave just as on a file on a local
filesystem. So maybe qemu_gluster_known_zeroes is not needed at all?

Niels


> +     * to query if regions added by creation or truncation will read
> +     * as zeroes.  However, we can use lseek(SEEK_DATA) to check if
> +     * contents currently read as zero.
> +     */
> +    off_t data, hole;
> +
> +    if (find_allocation(bs, 0, &data, &hole) == -ENXIO) {
> +        return BDRV_ZERO_OPEN;
> +    }
> +    return 0;
> +}
> +
>  /*
>   * Returns the allocation status of the specified offset.
>   *
> @@ -1561,6 +1577,7 @@ static BlockDriver bdrv_gluster = {
>      .bdrv_co_readv                = qemu_gluster_co_readv,
>      .bdrv_co_writev               = qemu_gluster_co_writev,
>      .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
> +    .bdrv_known_zeroes            = qemu_gluster_known_zeroes,
>  #ifdef CONFIG_GLUSTERFS_DISCARD
>      .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
>  #endif
> @@ -1591,6 +1608,7 @@ static BlockDriver bdrv_gluster_tcp = {
>      .bdrv_co_readv                = qemu_gluster_co_readv,
>      .bdrv_co_writev               = qemu_gluster_co_writev,
>      .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
> +    .bdrv_known_zeroes            = qemu_gluster_known_zeroes,
>  #ifdef CONFIG_GLUSTERFS_DISCARD
>      .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
>  #endif
> @@ -1621,6 +1639,7 @@ static BlockDriver bdrv_gluster_unix = {
>      .bdrv_co_readv                = qemu_gluster_co_readv,
>      .bdrv_co_writev               = qemu_gluster_co_writev,
>      .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
> +    .bdrv_known_zeroes            = qemu_gluster_known_zeroes,
>  #ifdef CONFIG_GLUSTERFS_DISCARD
>      .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
>  #endif
> @@ -1657,6 +1676,7 @@ static BlockDriver bdrv_gluster_rdma = {
>      .bdrv_co_readv                = qemu_gluster_co_readv,
>      .bdrv_co_writev               = qemu_gluster_co_writev,
>      .bdrv_co_flush_to_disk        = qemu_gluster_co_flush_to_disk,
> +    .bdrv_known_zeroes            = qemu_gluster_known_zeroes,
>  #ifdef CONFIG_GLUSTERFS_DISCARD
>      .bdrv_co_pdiscard             = qemu_gluster_co_pdiscard,
>  #endif
> -- 
> 2.24.1
> 
> _______________________________________________
> integration mailing list
> integration@gluster.org
> https://lists.gluster.org/mailman/listinfo/integration
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [GEDI] [PATCH 07/17] gluster: Drop useless has_zero_init callback
  2020-02-17  8:06   ` [GEDI] " Niels de Vos
@ 2020-02-17 12:03     ` Eric Blake
  2020-02-17 12:22       ` Eric Blake
  2020-02-17 14:01       ` Niels de Vos
  0 siblings, 2 replies; 73+ messages in thread
From: Eric Blake @ 2020-02-17 12:03 UTC (permalink / raw)
  To: Niels de Vos
  Cc: Kevin Wolf, open list:GLUSTER, qemu-block, qemu-devel, mreitz,
	david.edmondson

On 2/17/20 2:06 AM, Niels de Vos wrote:
> On Fri, Jan 31, 2020 at 11:44:26AM -0600, Eric Blake wrote:
>> block.c already defaults to 0 if we don't provide a callback; there's
>> no need to write a callback that always fails.
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> Reviewed-by: Niels de Vos <ndevos@redhat.com>
> 

Per your other message,

On 2/17/20 2:16 AM, Niels de Vos wrote:
 > On Fri, Jan 31, 2020 at 11:44:31AM -0600, Eric Blake wrote:
 >> Since gluster already copies file-posix for lseek usage in block
 >> status, it also makes sense to copy it for learning if the image
 >> currently reads as all zeroes.
 >>

 >> +static int qemu_gluster_known_zeroes(BlockDriverState *bs)
 >> +{
 >> +    /*
 >> +     * GlusterFS volume could be backed by a block device, with no way
 >
 > Actually, Gluster dropped support for volumes backed by block devices
 > (LVM) a few releases back. Nobody could be found that used it, and it
 > could not be combined with other Gluster features. All contents on a
 > Gluster volume is now always backed by 'normal' files on a filesystem.

That's useful to know.  Thanks!

 >
 > Creation or truncation should behave just as on a file on a local
 > filesystem. So maybe qemu_gluster_known_zeroes is not needed at all?

Which version of gluster first required a regular filesystem backing for 
all gluster files?  Does qemu support older versions (in which case, 
what is the correct version-probing invocation to return 0 prior to that 
point, and 1 after), or do all versions supported by qemu already 
guarantee zero initialization on creation or widening truncation by 
virtue of POSIX file semantics (in which case, patch 7 should instead 
switch to using .bdrv_has_zero_init_1 for both functions)?  Per 
configure, we probe for glusterfs_xlator_opt from gluster 4, which 
implies the code still tries to be portable to even older gluster, but 
I'm not sure if this squares with qemu-doc.texi which mentions our 
minimum distro policy (for example, now that qemu requires python 3 
consistent with our distro policy, that rules out several older systems 
where older gluster was likely to be present).

I'm respinning the series for other reasons, but would like to get this 
right as part of that respin.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [GEDI] [PATCH 07/17] gluster: Drop useless has_zero_init callback
  2020-02-17 12:03     ` Eric Blake
@ 2020-02-17 12:22       ` Eric Blake
  2020-02-17 14:01       ` Niels de Vos
  1 sibling, 0 replies; 73+ messages in thread
From: Eric Blake @ 2020-02-17 12:22 UTC (permalink / raw)
  To: Niels de Vos
  Cc: Kevin Wolf, open list:GLUSTER, qemu-block, qemu-devel, mreitz,
	david.edmondson

On 2/17/20 6:03 AM, Eric Blake wrote:

>  >
>  > Creation or truncation should behave just as on a file on a local
>  > filesystem. So maybe qemu_gluster_known_zeroes is not needed at all?
> 
> Which version of gluster first required a regular filesystem backing for 
> all gluster files?  Does qemu support older versions (in which case, 
> what is the correct version-probing invocation to return 0 prior to that 
> point, and 1 after), or do all versions supported by qemu already 
> guarantee zero initialization on creation or widening truncation by 
> virtue of POSIX file semantics (in which case, patch 7 should instead 
> switch to using .bdrv_has_zero_init_1 for both functions)?  Per 
> configure, we probe for glusterfs_xlator_opt from gluster 4, which 
> implies the code still tries to be portable to even older gluster, but 
> I'm not sure if this squares with qemu-doc.texi which mentions our 
> minimum distro policy (for example, now that qemu requires python 3 
> consistent with our distro policy, that rules out several older systems 
> where older gluster was likely to be present).

For reference, I quickly found commit efc6c070ac as an example of 
bumping minimum versions (however, that commit is from 2018, so I'm sure 
there are even more recent examples, just not with the same keywords 
that I was searching for).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [GEDI] [PATCH 07/17] gluster: Drop useless has_zero_init callback
  2020-02-17 12:03     ` Eric Blake
  2020-02-17 12:22       ` Eric Blake
@ 2020-02-17 14:01       ` Niels de Vos
  1 sibling, 0 replies; 73+ messages in thread
From: Niels de Vos @ 2020-02-17 14:01 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, open list:GLUSTER, qemu-block, qemu-devel, mreitz,
	david.edmondson

On Mon, Feb 17, 2020 at 06:03:40AM -0600, Eric Blake wrote:
> On 2/17/20 2:06 AM, Niels de Vos wrote:
> > On Fri, Jan 31, 2020 at 11:44:26AM -0600, Eric Blake wrote:
> > > block.c already defaults to 0 if we don't provide a callback; there's
> > > no need to write a callback that always fails.
> > > 
> > > Signed-off-by: Eric Blake <eblake@redhat.com>
> > 
> > Reviewed-by: Niels de Vos <ndevos@redhat.com>
> > 
> 
> Per your other message,
> 
> On 2/17/20 2:16 AM, Niels de Vos wrote:
> > On Fri, Jan 31, 2020 at 11:44:31AM -0600, Eric Blake wrote:
> >> Since gluster already copies file-posix for lseek usage in block
> >> status, it also makes sense to copy it for learning if the image
> >> currently reads as all zeroes.
> >>
> 
> >> +static int qemu_gluster_known_zeroes(BlockDriverState *bs)
> >> +{
> >> +    /*
> >> +     * GlusterFS volume could be backed by a block device, with no way
> >
> > Actually, Gluster dropped support for volumes backed by block devices
> > (LVM) a few releases back. Nobody could be found that used it, and it
> > could not be combined with other Gluster features. All contents on a
> > Gluster volume is now always backed by 'normal' files on a filesystem.
> 
> That's useful to know.  Thanks!
> 
> >
> > Creation or truncation should behave just as on a file on a local
> > filesystem. So maybe qemu_gluster_known_zeroes is not needed at all?
> 
> Which version of gluster first required a regular filesystem backing for all
> gluster files?  Does qemu support older versions (in which case, what is the
> correct version-probing invocation to return 0 prior to that point, and 1
> after), or do all versions supported by qemu already guarantee zero
> initialization on creation or widening truncation by virtue of POSIX file
> semantics (in which case, patch 7 should instead switch to using
> .bdrv_has_zero_init_1 for both functions)?  Per configure, we probe for
> glusterfs_xlator_opt from gluster 4, which implies the code still tries to
> be portable to even older gluster, but I'm not sure if this squares with
> qemu-doc.texi which mentions our minimum distro policy (for example, now
> that qemu requires python 3 consistent with our distro policy, that rules
> out several older systems where older gluster was likely to be present).

The block device feature (storage/bd xlator) got deprecated in Gluster
5.0, and was removed with Gluster 6.0. Fedora 29 is the last version
that contained the bd.so xlator (glusterfs-server 5.0, deprecated).

All currently maintained and available Gluster releases should have
glusterfs_xlator_opt (introduced with glusterfs-3.5 in 2014). However, I
am not sure what versions are provided with different distributions. The
expectation is that at least Gluster 5 is provided, as older releases
will not get any updates anymore. See
https://www.gluster.org/release-schedule/ for a more detailed timeline.

Unfortunately there is no reasonable way to probe for the type of
backend (block or filesystem) that is used. So, a runtime check to be on
the extreme safe side to fallback on block device backends is not an
option.

HTH,
Niels



^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2020-02-17 14:03 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-31 17:44 [PATCH 00/17] Improve qcow2 all-zero detection Eric Blake
2020-01-31 17:44 ` [PATCH 01/17] qcow2: Comment typo fixes Eric Blake
2020-02-04 14:12   ` Vladimir Sementsov-Ogievskiy
2020-02-09 19:34   ` Alberto Garcia
2020-01-31 17:44 ` [PATCH 02/17] qcow2: List autoclear bit names in header Eric Blake
2020-02-04 14:26   ` Vladimir Sementsov-Ogievskiy
2020-01-31 17:44 ` [PATCH 03/17] qcow2: Avoid feature name extension on small cluster size Eric Blake
2020-02-04 14:39   ` Vladimir Sementsov-Ogievskiy
2020-02-09 19:28   ` Alberto Garcia
2020-01-31 17:44 ` [PATCH 04/17] block: Improve documentation of .bdrv_has_zero_init Eric Blake
2020-02-04 15:03   ` Vladimir Sementsov-Ogievskiy
2020-02-04 15:16     ` Eric Blake
2020-01-31 17:44 ` [PATCH 05/17] block: Don't advertise zero_init_truncate with encryption Eric Blake
2020-02-10 18:12   ` Alberto Garcia
2020-01-31 17:44 ` [PATCH 06/17] block: Improve bdrv_has_zero_init_truncate with backing file Eric Blake
2020-02-10 18:13   ` Alberto Garcia
2020-01-31 17:44 ` [PATCH 07/17] gluster: Drop useless has_zero_init callback Eric Blake
2020-02-04 15:06   ` Vladimir Sementsov-Ogievskiy
2020-02-10 18:21   ` Alberto Garcia
2020-02-17  8:06   ` [GEDI] " Niels de Vos
2020-02-17 12:03     ` Eric Blake
2020-02-17 12:22       ` Eric Blake
2020-02-17 14:01       ` Niels de Vos
2020-01-31 17:44 ` [PATCH 08/17] sheepdog: Consistently set bdrv_has_zero_init_truncate Eric Blake
2020-02-04 15:09   ` Vladimir Sementsov-Ogievskiy
2020-01-31 17:44 ` [PATCH 09/17] block: Refactor bdrv_has_zero_init{,_truncate} Eric Blake
2020-02-04 15:35   ` Vladimir Sementsov-Ogievskiy
2020-02-04 15:49     ` Eric Blake
2020-02-04 16:07       ` Vladimir Sementsov-Ogievskiy
2020-02-04 17:42     ` Max Reitz
2020-02-04 17:51       ` Eric Blake
2020-02-05 16:43         ` Max Reitz
2020-02-05  7:51       ` Vladimir Sementsov-Ogievskiy
2020-02-05 14:07         ` Eric Blake
2020-02-05 14:25           ` Vladimir Sementsov-Ogievskiy
2020-02-05 14:36             ` Eric Blake
2020-02-05 17:55           ` Max Reitz
2020-02-04 17:53   ` Max Reitz
2020-02-04 19:03     ` Eric Blake
2020-02-05 17:22       ` Max Reitz
2020-02-05 18:39         ` Eric Blake
2020-02-06  9:18           ` Max Reitz
2020-01-31 17:44 ` [PATCH 10/17] block: Add new BDRV_ZERO_OPEN flag Eric Blake
2020-01-31 18:03   ` Eric Blake
2020-02-04 17:34   ` Max Reitz
2020-02-04 17:50     ` Eric Blake
2020-02-05  8:39       ` Vladimir Sementsov-Ogievskiy
2020-02-05 17:26       ` Max Reitz
2020-01-31 17:44 ` [PATCH 11/17] file-posix: Support BDRV_ZERO_OPEN Eric Blake
2020-01-31 17:44 ` [PATCH 12/17] gluster: " Eric Blake
2020-02-17  8:16   ` [GEDI] " Niels de Vos
2020-01-31 17:44 ` [PATCH 13/17] qcow2: Add new autoclear feature for all zero image Eric Blake
2020-02-03 17:45   ` Vladimir Sementsov-Ogievskiy
2020-02-04 13:12     ` Eric Blake
2020-02-04 13:29       ` Vladimir Sementsov-Ogievskiy
2020-01-31 17:44 ` [PATCH 14/17] qcow2: Expose all zero bit through .bdrv_known_zeroes Eric Blake
2020-01-31 17:44 ` [PATCH 15/17] qcow2: Implement all-zero autoclear bit Eric Blake
2020-01-31 17:44 ` [PATCH 16/17] iotests: Add new test for qcow2 all-zero bit Eric Blake
2020-01-31 17:44 ` [PATCH 17/17] qcow2: Let qemu-img check cover " Eric Blake
2020-02-04 17:32 ` [PATCH 00/17] Improve qcow2 all-zero detection Max Reitz
2020-02-04 18:53   ` Eric Blake
2020-02-05 17:04     ` Max Reitz
2020-02-05 19:21       ` Eric Blake
2020-02-06  9:12         ` Max Reitz
2020-02-05  9:04 ` Vladimir Sementsov-Ogievskiy
2020-02-05  9:25   ` Vladimir Sementsov-Ogievskiy
2020-02-05 14:26     ` Eric Blake
2020-02-05 14:47       ` Vladimir Sementsov-Ogievskiy
2020-02-05 15:14         ` Vladimir Sementsov-Ogievskiy
2020-02-05 17:58           ` Max Reitz
2020-02-05 14:22   ` Eric Blake
2020-02-05 14:43     ` Vladimir Sementsov-Ogievskiy
2020-02-05 14:58       ` Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.