[Qemu-devel] [PATCH v7 0/6] minor qcow2 compression improvements

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v7 0/6] minor qcow2 compression improvements
@ 2018-06-28 19:07 Eric Blake
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 1/6] qcow2: Prefer byte-based calls into bs->file Eric Blake
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Eric Blake @ 2018-06-28 19:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, mreitz, qemu-block, berto

v5 was here:
https://lists.gnu.org/archive/html/qemu-devel/2018-04/msg04542.html

since then:
- 5/6: update iotest number, add R-b

(git-backport-diff doesn't handle renames well)

001/6:[----] [--] 'qcow2: Prefer byte-based calls into bs->file'
002/6:[----] [--] 'qcow2: Document some maximum size constraints'
003/6:[----] [--] 'qcow2: Reduce REFT_OFFSET_MASK'
004/6:[----] [--] 'qcow2: Don't allow overflow during cluster allocation'
005/6:[down] 'iotests: Add new test 220 for max compressed cluster offset'
006/6:[----] [--] 'qcow2: Avoid memory over-allocation on compressed images'

Eric Blake (6):
  qcow2: Prefer byte-based calls into bs->file
  qcow2: Document some maximum size constraints
  qcow2: Reduce REFT_OFFSET_MASK
  qcow2: Don't allow overflow during cluster allocation
  iotests: Add new test 220 for max compressed cluster offset
  qcow2: Avoid memory over-allocation on compressed images

 docs/interop/qcow2.txt     | 40 +++++++++++++++++--
 block/qcow2.h              |  8 +++-
 block/qcow2-cluster.c      | 32 +++++++++------
 block/qcow2-refcount.c     | 27 ++++++++-----
 block/qcow2.c              |  2 +-
 tests/qemu-iotests/220     | 97 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/220.out | 54 ++++++++++++++++++++++++++
 tests/qemu-iotests/group   |  1 +
 8 files changed, 234 insertions(+), 27 deletions(-)
 create mode 100755 tests/qemu-iotests/220
 create mode 100644 tests/qemu-iotests/220.out

-- 
2.14.4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v7 1/6] qcow2: Prefer byte-based calls into bs->file
  2018-06-28 19:07 [Qemu-devel] [PATCH v7 0/6] minor qcow2 compression improvements Eric Blake
@ 2018-06-28 19:07 ` Eric Blake
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 2/6] qcow2: Document some maximum size constraints Eric Blake
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Eric Blake @ 2018-06-28 19:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, mreitz, qemu-block, berto

We had only a few sector-based stragglers left; convert them to use
our preferred byte-based accesses.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>

---
v5: commit message tweak
v2: indentation fix
---
 block/qcow2-cluster.c  | 5 ++---
 block/qcow2-refcount.c | 6 +++---
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 0d74584c9b4..4701fbc7a12 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1616,13 +1616,12 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
         }

         BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
-        ret = bdrv_read(bs->file, coffset >> 9, s->cluster_data,
-                        nb_csectors);
+        ret = bdrv_pread(bs->file, coffset, s->cluster_data, csize);
         if (ret < 0) {
             return ret;
         }
         if (decompress_buffer(s->cluster_cache, s->cluster_size,
-                              s->cluster_data + sector_offset, csize) < 0) {
+                              s->cluster_data, csize) < 0) {
             return -EIO;
         }
         s->cluster_cache_offset = coffset;
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 18c729aa276..20e6f2e29fc 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -2368,8 +2368,8 @@ write_refblocks:
         on_disk_refblock = (void *)((char *) *refcount_table +
                                     refblock_index * s->cluster_size);

-        ret = bdrv_write(bs->file, refblock_offset / BDRV_SECTOR_SIZE,
-                         on_disk_refblock, s->cluster_sectors);
+        ret = bdrv_pwrite(bs->file, refblock_offset, on_disk_refblock,
+                          s->cluster_size);
         if (ret < 0) {
             fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
             goto fail;
@@ -2591,7 +2591,7 @@ fail:
  * - 0 if writing to this offset will not affect the mentioned metadata
  * - a positive QCow2MetadataOverlap value indicating one overlapping section
  * - a negative value (-errno) indicating an error while performing a check,
- *   e.g. when bdrv_read failed on QCOW2_OL_INACTIVE_L2
+ *   e.g. when bdrv_pread failed on QCOW2_OL_INACTIVE_L2
  */
 int qcow2_check_metadata_overlap(BlockDriverState *bs, int ign, int64_t offset,
                                  int64_t size)
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v7 2/6] qcow2: Document some maximum size constraints
  2018-06-28 19:07 [Qemu-devel] [PATCH v7 0/6] minor qcow2 compression improvements Eric Blake
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 1/6] qcow2: Prefer byte-based calls into bs->file Eric Blake
@ 2018-06-28 19:07 ` Eric Blake
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 3/6] qcow2: Reduce REFT_OFFSET_MASK Eric Blake
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Eric Blake @ 2018-06-28 19:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, mreitz, qemu-block, berto

Although off_t permits up to 63 bits (8EB) of file offsets, in
practice, we're going to hit other limits first.  Document some
of those limits in the qcow2 spec, and how choice of cluster size
can influence some of the limits.

While at it, notice that since we cannot map any virtual cluster
to any address higher than 64 PB (56 bits) (due to the current L1/L2
field encoding stopping at bit 55), it makes little sense to require
the refcount table to access host offsets beyond that point.  Mark
the upper bits of the refcount table entries as reserved to match
the L1/L2 table, with no ill effects, since it is unlikely that there
are any existing images larger than 64PB in the first place, and thus
all existing images already have those bits as 0.  If 64PB proves to
be too small in the future, we could enlarge all three uses of bit
55 into the reserved bits at that time.  (For reference, ext4 with
4k blocks caps files at 16PB.)

However, there is one limit that reserved bits don't help with: for
compressed clusters, the L2 layout requires an ever-smaller maximum
host offset as cluster size gets larger, down to a 512 TB maximum
with 2M clusters.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>

--
v5: even more wording tweaks
v4: more wording tweaks
v3: new patch
---
 docs/interop/qcow2.txt | 40 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 37 insertions(+), 3 deletions(-)

diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index 8e1547ded27..bb4898e60bf 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -40,7 +40,17 @@ The first cluster of a qcow2 image contains the file header:
                     with larger cluster sizes.

          24 - 31:   size
-                    Virtual disk size in bytes
+                    Virtual disk size in bytes.
+
+                    Note: with a 2 MB cluster size, the maximum
+                    virtual size is 2 EB (61 bits) for a sparse file,
+                    but other sizing limitations in refcount and L1/L2
+                    tables mean that an image cannot have more than 64
+                    PB (56 bits) of populated clusters (assuming the
+                    image does not first hit other limits such as a
+                    file system's maximum size).  With a 512 byte
+                    cluster size, the maximum virtual size drops to
+                    128 GB (37 bits).

          32 - 35:   crypt_method
                     0 for no encryption
@@ -326,6 +336,15 @@ in the image file.
 It contains pointers to the second level structures which are called refcount
 blocks and are exactly one cluster in size.

+All qcow2 metadata, including the refcount table and refcount blocks,
+must currently reside at host offsets below 64 PB (56 bits) (if the
+underlying protocol can even be sized that large); this limit could be
+enlarged by putting reserved bits into use, but only if a similar
+limit on L1/L2 tables is revisited at the same time.  While a larger
+cluster size theoretically allows the refcount table to cover more
+host offsets, in practice, other inherent limits will constrain the
+maximum image size before the refcount table is full.
+
 Given a offset into the image file, the refcount of its cluster can be obtained
 as follows:

@@ -341,7 +360,7 @@ Refcount table entry:

     Bit  0 -  8:    Reserved (set to 0)

-         9 - 63:    Bits 9-63 of the offset into the image file at which the
+         9 - 55:    Bits 9-55 of the offset into the image file at which the
                     refcount block starts. Must be aligned to a cluster
                     boundary.

@@ -349,6 +368,8 @@ Refcount table entry:
                     been allocated. All refcounts managed by this refcount block
                     are 0.

+        56 - 63:    Reserved (set to 0)
+
 Refcount block entry (x = refcount_bits - 1):

     Bit  0 -  x:    Reference count of the cluster. If refcount_bits implies a
@@ -365,6 +386,17 @@ The L1 table has a variable size (stored in the header) and may use multiple
 clusters, however it must be contiguous in the image file. L2 tables are
 exactly one cluster in size.

+The L1 and L2 tables have implications on the maximum virtual file
+size; a larger cluster size is required for the guest to have access
+to more space.  Furthermore, a virtual cluster must currently map to a
+host offset below 64 PB (56 bits) (this limit could be enlarged by
+putting reserved bits into use, but only if a similar limit on
+refcount tables is revisited at the same time).  Additionally, with
+larger cluster sizes, compressed clusters have a smaller limit on host
+cluster mappings (a 2M cluster size requires compressed clusters to
+reside below 512 TB (49 bits), where enlarging this would require an
+incompatible layout change).
+
 Given a offset into the virtual disk, the offset into the image file can be
 obtained as follows:

@@ -427,7 +459,9 @@ Standard Cluster Descriptor:
 Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):

     Bit  0 - x-1:   Host cluster offset. This is usually _not_ aligned to a
-                    cluster or sector boundary!
+                    cluster or sector boundary!  If cluster_bits is
+                    small enough that this field includes bits beyond
+                    55, those upper bits must be set to 0.

          x - 61:    Number of additional 512-byte sectors used for the
                     compressed data, beyond the sector containing the offset
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v7 3/6] qcow2: Reduce REFT_OFFSET_MASK
  2018-06-28 19:07 [Qemu-devel] [PATCH v7 0/6] minor qcow2 compression improvements Eric Blake
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 1/6] qcow2: Prefer byte-based calls into bs->file Eric Blake
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 2/6] qcow2: Document some maximum size constraints Eric Blake
@ 2018-06-28 19:07 ` Eric Blake
  2018-06-29  8:44   ` Kevin Wolf
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 4/6] qcow2: Don't allow overflow during cluster allocation Eric Blake
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Eric Blake @ 2018-06-28 19:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, mreitz, qemu-block, berto

Match our code to the spec change in the previous patch - there's
no reason for the refcount table to allow larger offsets than the
L1/L2 tables. In practice, no image has more than 64PB of
allocated clusters anyways, as anything beyond that can't be
expressed via L2 mappings to host offsets.

Suggested-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>

---
v4: new patch
---
 block/qcow2.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 01b5250415f..3774229ef9c 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -439,7 +439,7 @@ typedef enum QCow2MetadataOverlap {
 #define L2E_OFFSET_MASK 0x00fffffffffffe00ULL
 #define L2E_COMPRESSED_OFFSET_SIZE_MASK 0x3fffffffffffffffULL

-#define REFT_OFFSET_MASK 0xfffffffffffffe00ULL
+#define REFT_OFFSET_MASK 0x00fffffffffffe00ULL

 static inline int64_t start_of_cluster(BDRVQcow2State *s, int64_t offset)
 {
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v7 4/6] qcow2: Don't allow overflow during cluster allocation
  2018-06-28 19:07 [Qemu-devel] [PATCH v7 0/6] minor qcow2 compression improvements Eric Blake
                   ` (2 preceding siblings ...)
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 3/6] qcow2: Reduce REFT_OFFSET_MASK Eric Blake
@ 2018-06-28 19:07 ` Eric Blake
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 5/6] iotests: Add new test 220 for max compressed cluster offset Eric Blake
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 6/6] qcow2: Avoid memory over-allocation on compressed images Eric Blake
  5 siblings, 0 replies; 17+ messages in thread
From: Eric Blake @ 2018-06-28 19:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, mreitz, qemu-block, berto

Our code was already checking that we did not attempt to
allocate more clusters than what would fit in an INT64 (the
physical maximimum if we can access a full off_t's worth of
data).  But this does not catch smaller limits enforced by
various spots in the qcow2 image description: L1 and normal
clusters of L2 are documented as having bits 63-56 reserved
for other purposes, capping our maximum offset at 64PB (bit
55 is the maximum bit set).  And for compressed images with
2M clusters, the cap drops the maximum offset to bit 48, or
a maximum offset of 512TB.  If we overflow that offset, we
would write compressed data into one place, but try to
decompress from another, which won't work.

It's actually possible to prove that overflow can cause image
corruption without this patch; I'll add the iotests separately
in the next commit.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>

---
v6: improve commit message, now that iotests is possible [Max]
v3: use s->cluster_offset_mask instead of open-coding it [Berto],
use MIN() to clamp offset on small cluster size, add spec patch
first to justify clamping even on refcount allocations
---
 block/qcow2.h          |  6 ++++++
 block/qcow2-refcount.c | 21 ++++++++++++++-------
 2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 3774229ef9c..16f2ce118ce 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -41,6 +41,12 @@
 #define QCOW_MAX_CRYPT_CLUSTERS 32
 #define QCOW_MAX_SNAPSHOTS 65536

+/* Field widths in qcow2 mean normal cluster offsets cannot reach
+ * 64PB; depending on cluster size, compressed clusters can have a
+ * smaller limit (64PB for up to 16k clusters, then ramps down to
+ * 512TB for 2M clusters).  */
+#define QCOW_MAX_CLUSTER_OFFSET ((1ULL << 56) - 1)
+
 /* 8 MB refcount table is enough for 2 PB images at 64k cluster size
  * (128 GB for 512 byte clusters, 2 EB for 2 MB clusters) */
 #define QCOW_MAX_REFTABLE_SIZE 0x800000
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 20e6f2e29fc..d79948742de 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -31,7 +31,8 @@
 #include "qemu/bswap.h"
 #include "qemu/cutils.h"

-static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size);
+static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
+                                    uint64_t max);
 static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
                             int64_t offset, int64_t length, uint64_t addend,
                             bool decrease, enum qcow2_discard_type type);
@@ -362,7 +363,8 @@ static int alloc_refcount_block(BlockDriverState *bs,
     }

     /* Allocate the refcount block itself and mark it as used */
-    int64_t new_block = alloc_clusters_noref(bs, s->cluster_size);
+    int64_t new_block = alloc_clusters_noref(bs, s->cluster_size,
+                                             QCOW_MAX_CLUSTER_OFFSET);
     if (new_block < 0) {
         return new_block;
     }
@@ -954,7 +956,8 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs,


 /* return < 0 if error */
-static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size)
+static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
+                                    uint64_t max)
 {
     BDRVQcow2State *s = bs->opaque;
     uint64_t i, nb_clusters, refcount;
@@ -979,9 +982,9 @@ retry:
     }

     /* Make sure that all offsets in the "allocated" range are representable
-     * in an int64_t */
+     * in the requested max */
     if (s->free_cluster_index > 0 &&
-        s->free_cluster_index - 1 > (INT64_MAX >> s->cluster_bits))
+        s->free_cluster_index - 1 > (max >> s->cluster_bits))
     {
         return -EFBIG;
     }
@@ -1001,7 +1004,7 @@ int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t size)

     BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC);
     do {
-        offset = alloc_clusters_noref(bs, size);
+        offset = alloc_clusters_noref(bs, size, QCOW_MAX_CLUSTER_OFFSET);
         if (offset < 0) {
             return offset;
         }
@@ -1083,7 +1086,11 @@ int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size)
     free_in_cluster = s->cluster_size - offset_into_cluster(s, offset);
     do {
         if (!offset || free_in_cluster < size) {
-            int64_t new_cluster = alloc_clusters_noref(bs, s->cluster_size);
+            int64_t new_cluster;
+
+            new_cluster = alloc_clusters_noref(bs, s->cluster_size,
+                                               MIN(s->cluster_offset_mask,
+                                                   QCOW_MAX_CLUSTER_OFFSET));
             if (new_cluster < 0) {
                 return new_cluster;
             }
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v7 5/6] iotests: Add new test 220 for max compressed cluster offset
  2018-06-28 19:07 [Qemu-devel] [PATCH v7 0/6] minor qcow2 compression improvements Eric Blake
                   ` (3 preceding siblings ...)
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 4/6] qcow2: Don't allow overflow during cluster allocation Eric Blake
@ 2018-06-28 19:07 ` Eric Blake
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 6/6] qcow2: Avoid memory over-allocation on compressed images Eric Blake
  5 siblings, 0 replies; 17+ messages in thread
From: Eric Blake @ 2018-06-28 19:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, mreitz, qemu-block, berto

If you have a capable file system (tmpfs is good, ext4 not so much;
run ./check with TEST_DIR pointing to a good location so as not
to skip the test), it's actually possible to create a qcow2 file
that expands to a sparse 512T image with just over 38M of content.
The test is not the world's fastest (qemu crawling through 256M
bits of refcount table to find the next cluster to allocate takes
several seconds, as does qemu-img check reporting millions of
leaked clusters); but it DOES catch the problem that the previous
patch just fixed where writing a compressed cluster to a full
image ended up overwriting the wrong cluster.

Suggested-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>

---
v7: s/214/220/
v6: new patch; took over 90 seconds to run on my setup, using tmpfs
---
 tests/qemu-iotests/220     | 97 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/220.out | 54 ++++++++++++++++++++++++++
 tests/qemu-iotests/group   |  1 +
 3 files changed, 152 insertions(+)
 create mode 100755 tests/qemu-iotests/220
 create mode 100644 tests/qemu-iotests/220.out

diff --git a/tests/qemu-iotests/220 b/tests/qemu-iotests/220
new file mode 100755
index 00000000000..b4127facd12
--- /dev/null
+++ b/tests/qemu-iotests/220
@@ -0,0 +1,97 @@
+#!/bin/bash
+#
+# max limits on compression in huge qcow2 files
+#
+# Copyright (C) 2018 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+here=`pwd`
+status=1    # failure is the default!
+
+_cleanup()
+{
+    _cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.pattern
+
+_supported_fmt qcow2
+_supported_proto file
+_supported_os Linux
+
+echo "== Creating huge file =="
+
+# Sanity check: We require a file system that permits the creation
+# of a HUGE (but very sparse) file.  tmpfs works, ext4 does not.
+if ! truncate --size=513T "$TEST_IMG"; then
+    _notrun "file system on $TEST_DIR does not support large enough files"
+fi
+rm "$TEST_IMG"
+IMGOPTS='cluster_size=2M,refcount_bits=1' _make_test_img 513T
+
+echo "== Populating refcounts =="
+# We want an image with 256M refcounts * 2M clusters = 512T referenced.
+# Each 2M cluster holds 16M refcounts; the refcount table initially uses
+# 1 refblock, so we need to add 15 more.  The refcount table lives at 2M,
+# first refblock at 4M, L2 at 6M, so our remaining additions start at 8M.
+# Then, for each refblock, mark it as fully populated.
+to_hex() {
+    printf %016x\\n $1 | sed 's/\(..\)/\\x\1/g'
+}
+truncate --size=38m "$TEST_IMG"
+entry=$((0x200000))
+$QEMU_IO_PROG -f raw -c "w -P 0xff 4m 2m" "$TEST_IMG" | _filter_qemu_io
+for i in {1..15}; do
+    offs=$((0x600000 + i*0x200000))
+    poke_file "$TEST_IMG" $((i*8 + entry)) $(to_hex $offs)
+    $QEMU_IO_PROG -f raw -c "w -P 0xff $offs 2m" "$TEST_IMG" | _filter_qemu_io
+done
+
+echo "== Checking file before =="
+# FIXME: 'qemu-img check' doesn't diagnose refcounts beyond the end of
+# the file as leaked clusters
+_check_test_img 2>&1 | sed '/^Leaked cluster/d'
+stat -c 'image size %s' "$TEST_IMG"
+
+echo "== Trying to write compressed cluster =="
+# Given our file size, the next available cluster at 512T lies beyond the
+# maximum offset that a compressed 2M cluster can reside in
+$QEMU_IO_PROG -c 'w -c 0 2m' "$TEST_IMG" | _filter_qemu_io
+# The attempt failed, but ended up allocating a new refblock
+stat -c 'image size %s' "$TEST_IMG"
+
+echo "== Writing normal cluster =="
+# The failed write should not corrupt the image, so a normal write succeeds
+$QEMU_IO_PROG -c 'w 0 2m' "$TEST_IMG" | _filter_qemu_io
+
+echo "== Checking file after =="
+# qemu-img now sees the millions of leaked clusters, thanks to the allocations
+# at 512T.  Undo many of our faked references to speed up the check.
+$QEMU_IO_PROG -f raw -c "w -z 5m 1m" -c "w -z 8m 30m" "$TEST_IMG" |
+    _filter_qemu_io
+_check_test_img 2>&1 | sed '/^Leaked cluster/d'
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/220.out b/tests/qemu-iotests/220.out
new file mode 100644
index 00000000000..af3021fd883
--- /dev/null
+++ b/tests/qemu-iotests/220.out
@@ -0,0 +1,54 @@
+QA output created by 220
+== Creating huge file ==
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=564049465049088
+== Populating refcounts ==
+wrote 2097152/2097152 bytes at offset 4194304
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 8388608
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 10485760
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 12582912
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 14680064
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 16777216
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 18874368
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 20971520
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 23068672
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 25165824
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 27262976
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 29360128
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 31457280
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 33554432
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 35651584
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 2097152/2097152 bytes at offset 37748736
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== Checking file before ==
+No errors were found on the image.
+image size 39845888
+== Trying to write compressed cluster ==
+write failed: Input/output error
+image size 562949957615616
+== Writing normal cluster ==
+wrote 2097152/2097152 bytes at offset 0
+2 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+== Checking file after ==
+wrote 1048576/1048576 bytes at offset 5242880
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 31457280/31457280 bytes at offset 8388608
+30 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+8388589 leaked clusters were found on the image.
+This means waste of disk space, but no harm to data.
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index eea75819d2a..6eed45ff572 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -219,4 +219,5 @@
 217 rw auto quick
 218 rw auto quick
 219 rw auto
+220 rw auto
 221 rw auto quick
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v7 6/6] qcow2: Avoid memory over-allocation on compressed images
  2018-06-28 19:07 [Qemu-devel] [PATCH v7 0/6] minor qcow2 compression improvements Eric Blake
                   ` (4 preceding siblings ...)
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 5/6] iotests: Add new test 220 for max compressed cluster offset Eric Blake
@ 2018-06-28 19:07 ` Eric Blake
  2018-06-29  9:03   ` Kevin Wolf
  5 siblings, 1 reply; 17+ messages in thread
From: Eric Blake @ 2018-06-28 19:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, mreitz, qemu-block, berto

When reading a compressed image, we were allocating s->cluster_data
to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
with 2M clusters).  Let's check out the history:

Back when qcow2 was first written, we used s->cluster_data for
everything, including copy_sectors() and encryption, where we want
to operate on more than one cluster at once.  Obviously, at that
point, the buffer had to be aligned for other users, even though
compression itself doesn't require any alignment (the fact that
the compressed data generally starts mid-sector means that aligning
our buffer buys us nothing - either the protocol already supports
byte-based access into whatever offset we want, or we are already
using a bounce buffer to read a full sector, and copying into
our destination no longer requires alignment).

But commit 1b9f1491 (v1.1!) changed things to allocate parallel
buffers on demand rather than sharing a single buffer, for encryption
and COW, leaving compression as the final client of s->cluster_data.
That use was still preserved, because if a single compressed cluster
is read more than once, we reuse the cache instead of decompressing
it a second time (someday, we may come up with better caching to
avoid wasting repeated decompressions while still being more parallel,
but that is a task for another patch; the XXX comment in
qcow2_co_preadv for QCOW2_CLUSTER_COMPRESSED is telling).

Much later, in commit de82815d (v2.2), we noticed that a 64M
allocation is prone to failure, so we switched over to a graceful
memory allocation error message.  Elsewhere in the code, we do
g_malloc(2 * cluster_size) without ever checking for failure, but
even 4M starts to be large enough that trying to be nice is worth
the effort, so we want to keep that aspect.

Then even later, in 3e4c7052 (2.11), we realized that allocating
a large buffer up front for every qcow2 image is expensive, and
switched to lazy allocation only for images that actually had
compressed clusters.  But in the process, we never even bothered
to check whether what we were allocating still made sense in its
new context!

So, it's time to cut back on the waste.  A compressed cluster
written by qemu will NEVER occupy more than an uncompressed
cluster, but based on mid-sector alignment, we may still need
to read 1 cluster + 1 sector in order to recover enough bytes
for the decompression.  Meanwhile, third-party producers of
qcow2 may not be as smart, and gzip DOES document that because
the compression stream adds metadata, and because of the
pigeonhole principle, there are worst case scenarios where
attempts to compress will actually inflate an image, by up to
0.015% (at most 1 sector larger for an unfortunate 2M
compression), meaning we should realistically expect to
sometimes need to read 1 cluster + 2 sectors.

However, the qcow2 spec permits an all-ones sector count, plus
a full sector containing the initial offset, for a maximum read
of 2 full clusters.  Thanks to the way decompression works, even
though such a value is too large for the actual compressed data
(for all but 512-byte clusters), it really doesn't matter if we
read too much (gzip ignores slop, once it has decoded a full
cluster).  So it's easier to just allocate cluster_data to be as
large as we can ever possibly see; even if it still wastes up to
2M on any image created by qcow2, that's still an improvment of
60M less waste than pre-patch.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>

---
v5: fix math error in commit message [Max]
v4: fix off-by-one in assertion and commit message [Berto]
v3: tighten allocation [Berto]
v2: actually check allocation failure (previous version meant
to use g_malloc, but ended up posted with g_try_malloc without
checking); add assertions outside of conditional, improve
commit message to better match reality now that qcow2 spec bug
has been fixed
---
 block/qcow2-cluster.c | 27 ++++++++++++++++++---------
 block/qcow2.c         |  2 +-
 2 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 4701fbc7a12..b98fe683726 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1599,20 +1599,29 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
         sector_offset = coffset & 511;
         csize = nb_csectors * 512 - sector_offset;

-        /* Allocate buffers on first decompress operation, most images are
-         * uncompressed and the memory overhead can be avoided.  The buffers
-         * are freed in .bdrv_close().
+        /* Allocate buffers on the first decompress operation; most
+         * images are uncompressed and the memory overhead can be
+         * avoided.  The buffers are freed in .bdrv_close().  qemu
+         * never writes an inflated cluster, and gzip itself never
+         * inflates a problematic cluster by more than 0.015%, but the
+         * qcow2 format allows up to 1 byte short of 2 full clusters
+         * when including the sector containing offset.  gzip ignores
+         * trailing slop, so just allocate that much up front rather
+         * than reject third-party images with overlarge csize.
          */
+        assert(!!s->cluster_data == !!s->cluster_cache);
+        assert(csize <= 2 * s->cluster_size);
         if (!s->cluster_data) {
-            /* one more sector for decompressed data alignment */
-            s->cluster_data = qemu_try_blockalign(bs->file->bs,
-                    QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + 512);
+            s->cluster_data = g_try_malloc(2 * s->cluster_size);
             if (!s->cluster_data) {
                 return -ENOMEM;
             }
-        }
-        if (!s->cluster_cache) {
-            s->cluster_cache = g_malloc(s->cluster_size);
+            s->cluster_cache = g_try_malloc(s->cluster_size);
+            if (!s->cluster_cache) {
+                g_free(s->cluster_data);
+                s->cluster_data = NULL;
+                return -ENOMEM;
+            }
         }

         BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
diff --git a/block/qcow2.c b/block/qcow2.c
index da54b2c8f84..e90ef804cdb 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2165,7 +2165,7 @@ static void qcow2_close(BlockDriverState *bs)
     g_free(s->image_backing_format);

     g_free(s->cluster_cache);
-    qemu_vfree(s->cluster_data);
+    g_free(s->cluster_data);
     qcow2_refcount_close(bs);
     qcow2_free_snapshots(bs);
 }
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/6] qcow2: Reduce REFT_OFFSET_MASK
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 3/6] qcow2: Reduce REFT_OFFSET_MASK Eric Blake
@ 2018-06-29  8:44   ` Kevin Wolf
  2018-06-29 15:22     ` Eric Blake
  0 siblings, 1 reply; 17+ messages in thread
From: Kevin Wolf @ 2018-06-29  8:44 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, mreitz, qemu-block, berto

Am 28.06.2018 um 21:07 hat Eric Blake geschrieben:
> Match our code to the spec change in the previous patch - there's
> no reason for the refcount table to allow larger offsets than the
> L1/L2 tables.

What about internal snapshots? And anyway, because of the metadata
overhead, the physical image size of a fully allocated image is always
going to be at least minimally larger than the virtual disk size.

I'm not necessarily opposed to making the change if there is a good
reason to make it, but I don't see a real need for it and the
justification used here and also in the previous patch is incorrect.

Kevin

> In practice, no image has more than 64PB of
> allocated clusters anyways, as anything beyond that can't be
> expressed via L2 mappings to host offsets.
> 
> Suggested-by: Alberto Garcia <berto@igalia.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> Reviewed-by: Alberto Garcia <berto@igalia.com>
> 
> ---
> v4: new patch
> ---
>  block/qcow2.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 01b5250415f..3774229ef9c 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -439,7 +439,7 @@ typedef enum QCow2MetadataOverlap {
>  #define L2E_OFFSET_MASK 0x00fffffffffffe00ULL
>  #define L2E_COMPRESSED_OFFSET_SIZE_MASK 0x3fffffffffffffffULL
> 
> -#define REFT_OFFSET_MASK 0xfffffffffffffe00ULL
> +#define REFT_OFFSET_MASK 0x00fffffffffffe00ULL
> 
>  static inline int64_t start_of_cluster(BDRVQcow2State *s, int64_t offset)
>  {
> -- 
> 2.14.4
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v7 6/6] qcow2: Avoid memory over-allocation on compressed images
  2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 6/6] qcow2: Avoid memory over-allocation on compressed images Eric Blake
@ 2018-06-29  9:03   ` Kevin Wolf
  2018-06-29 15:16     ` Eric Blake
  0 siblings, 1 reply; 17+ messages in thread
From: Kevin Wolf @ 2018-06-29  9:03 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, mreitz, qemu-block, berto

Am 28.06.2018 um 21:07 hat Eric Blake geschrieben:
> When reading a compressed image, we were allocating s->cluster_data
> to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
> with 2M clusters).  Let's check out the history:
> 
> Back when qcow2 was first written, we used s->cluster_data for
> everything, including copy_sectors() and encryption, where we want
> to operate on more than one cluster at once.  Obviously, at that
> point, the buffer had to be aligned for other users, even though
> compression itself doesn't require any alignment (the fact that
> the compressed data generally starts mid-sector means that aligning
> our buffer buys us nothing - either the protocol already supports
> byte-based access into whatever offset we want, or we are already
> using a bounce buffer to read a full sector, and copying into
> our destination no longer requires alignment).
> 
> But commit 1b9f1491 (v1.1!) changed things to allocate parallel
> buffers on demand rather than sharing a single buffer, for encryption
> and COW, leaving compression as the final client of s->cluster_data.
> That use was still preserved, because if a single compressed cluster
> is read more than once, we reuse the cache instead of decompressing
> it a second time (someday, we may come up with better caching to
> avoid wasting repeated decompressions while still being more parallel,
> but that is a task for another patch; the XXX comment in
> qcow2_co_preadv for QCOW2_CLUSTER_COMPRESSED is telling).
> 
> Much later, in commit de82815d (v2.2), we noticed that a 64M
> allocation is prone to failure, so we switched over to a graceful
> memory allocation error message.  Elsewhere in the code, we do
> g_malloc(2 * cluster_size) without ever checking for failure, but
> even 4M starts to be large enough that trying to be nice is worth
> the effort, so we want to keep that aspect.
> 
> Then even later, in 3e4c7052 (2.11), we realized that allocating
> a large buffer up front for every qcow2 image is expensive, and
> switched to lazy allocation only for images that actually had
> compressed clusters.  But in the process, we never even bothered
> to check whether what we were allocating still made sense in its
> new context!
> 
> So, it's time to cut back on the waste.  A compressed cluster
> written by qemu will NEVER occupy more than an uncompressed
> cluster, but based on mid-sector alignment, we may still need
> to read 1 cluster + 1 sector in order to recover enough bytes
> for the decompression.  Meanwhile, third-party producers of
> qcow2 may not be as smart, and gzip DOES document that because
> the compression stream adds metadata, and because of the
> pigeonhole principle, there are worst case scenarios where
> attempts to compress will actually inflate an image, by up to
> 0.015% (at most 1 sector larger for an unfortunate 2M
> compression), meaning we should realistically expect to
> sometimes need to read 1 cluster + 2 sectors.
> 
> However, the qcow2 spec permits an all-ones sector count, plus
> a full sector containing the initial offset, for a maximum read
> of 2 full clusters.  Thanks to the way decompression works, even
> though such a value is too large for the actual compressed data
> (for all but 512-byte clusters), it really doesn't matter if we
> read too much (gzip ignores slop, once it has decoded a full
> cluster).  So it's easier to just allocate cluster_data to be as
> large as we can ever possibly see; even if it still wastes up to
> 2M on any image created by qcow2, that's still an improvment of
> 60M less waste than pre-patch.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> Reviewed-by: Alberto Garcia <berto@igalia.com>
> 
> ---
> v5: fix math error in commit message [Max]
> v4: fix off-by-one in assertion and commit message [Berto]
> v3: tighten allocation [Berto]
> v2: actually check allocation failure (previous version meant
> to use g_malloc, but ended up posted with g_try_malloc without
> checking); add assertions outside of conditional, improve
> commit message to better match reality now that qcow2 spec bug
> has been fixed
> ---
>  block/qcow2-cluster.c | 27 ++++++++++++++++++---------
>  block/qcow2.c         |  2 +-
>  2 files changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 4701fbc7a12..b98fe683726 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -1599,20 +1599,29 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
>          sector_offset = coffset & 511;
>          csize = nb_csectors * 512 - sector_offset;
> 
> -        /* Allocate buffers on first decompress operation, most images are
> -         * uncompressed and the memory overhead can be avoided.  The buffers
> -         * are freed in .bdrv_close().
> +        /* Allocate buffers on the first decompress operation; most
> +         * images are uncompressed and the memory overhead can be
> +         * avoided.  The buffers are freed in .bdrv_close().  qemu
> +         * never writes an inflated cluster, and gzip itself never
> +         * inflates a problematic cluster by more than 0.015%, but the
> +         * qcow2 format allows up to 1 byte short of 2 full clusters
> +         * when including the sector containing offset.  gzip ignores
> +         * trailing slop, so just allocate that much up front rather
> +         * than reject third-party images with overlarge csize.
>           */
> +        assert(!!s->cluster_data == !!s->cluster_cache);
> +        assert(csize <= 2 * s->cluster_size);
>          if (!s->cluster_data) {
> -            /* one more sector for decompressed data alignment */
> -            s->cluster_data = qemu_try_blockalign(bs->file->bs,
> -                    QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + 512);
> +            s->cluster_data = g_try_malloc(2 * s->cluster_size);
>              if (!s->cluster_data) {
>                  return -ENOMEM;
>              }
> -        }
> -        if (!s->cluster_cache) {
> -            s->cluster_cache = g_malloc(s->cluster_size);
> +            s->cluster_cache = g_try_malloc(s->cluster_size);
> +            if (!s->cluster_cache) {
> +                g_free(s->cluster_data);
> +                s->cluster_data = NULL;
> +                return -ENOMEM;
> +            }
>          }

I wonder how much of a difference s->cluster_cache really makes. I
wouldn't expect guests to access the same cluster twice in a row.

Maybe the easiest solution would be switching to temporary buffers that
would have the exact size we need and would be freed at the end of the
request?

Kevin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v7 6/6] qcow2: Avoid memory over-allocation on compressed images
  2018-06-29  9:03   ` Kevin Wolf
@ 2018-06-29 15:16     ` Eric Blake
  2018-06-29 15:47       ` Kevin Wolf
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Blake @ 2018-06-29 15:16 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, mreitz, qemu-block, berto

On 06/29/2018 04:03 AM, Kevin Wolf wrote:
> Am 28.06.2018 um 21:07 hat Eric Blake geschrieben:
>> When reading a compressed image, we were allocating s->cluster_data
>> to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
>> with 2M clusters).  Let's check out the history:
>>

>> However, the qcow2 spec permits an all-ones sector count, plus
>> a full sector containing the initial offset, for a maximum read
>> of 2 full clusters.  Thanks to the way decompression works, even
>> though such a value is too large for the actual compressed data
>> (for all but 512-byte clusters), it really doesn't matter if we
>> read too much (gzip ignores slop, once it has decoded a full
>> cluster).  So it's easier to just allocate cluster_data to be as
>> large as we can ever possibly see; even if it still wastes up to
>> 2M on any image created by qcow2, that's still an improvment of

s/improvment/improvement/

>> 60M less waste than pre-patch.
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> Reviewed-by: Alberto Garcia <berto@igalia.com>
>>
>> ---

>> +++ b/block/qcow2-cluster.c
>> @@ -1599,20 +1599,29 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
>>           sector_offset = coffset & 511;
>>           csize = nb_csectors * 512 - sector_offset;
>>
>> -        /* Allocate buffers on first decompress operation, most images are
>> -         * uncompressed and the memory overhead can be avoided.  The buffers
>> -         * are freed in .bdrv_close().
>> +        /* Allocate buffers on the first decompress operation; most
>> +         * images are uncompressed and the memory overhead can be
>> +         * avoided.  The buffers are freed in .bdrv_close().  qemu
>> +         * never writes an inflated cluster, and gzip itself never
>> +         * inflates a problematic cluster by more than 0.015%, but the
>> +         * qcow2 format allows up to 1 byte short of 2 full clusters
>> +         * when including the sector containing offset.  gzip ignores
>> +         * trailing slop, so just allocate that much up front rather
>> +         * than reject third-party images with overlarge csize.
>>            */
>> +        assert(!!s->cluster_data == !!s->cluster_cache);
>> +        assert(csize <= 2 * s->cluster_size);
>>           if (!s->cluster_data) {
>> -            /* one more sector for decompressed data alignment */
>> -            s->cluster_data = qemu_try_blockalign(bs->file->bs,
>> -                    QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + 512);
>> +            s->cluster_data = g_try_malloc(2 * s->cluster_size);
>>               if (!s->cluster_data) {
>>                   return -ENOMEM;
>>               }
>> -        }
>> -        if (!s->cluster_cache) {
>> -            s->cluster_cache = g_malloc(s->cluster_size);
>> +            s->cluster_cache = g_try_malloc(s->cluster_size);
>> +            if (!s->cluster_cache) {
>> +                g_free(s->cluster_data);
>> +                s->cluster_data = NULL;
>> +                return -ENOMEM;
>> +            }
>>           }
> 
> I wonder how much of a difference s->cluster_cache really makes. I
> wouldn't expect guests to access the same cluster twice in a row.

I don't know if it makes a difference.  But my patch is not even 
touching that - ALL I'm doing is changing a 64M allocation into a 4M 
allocation, with otherwise no change to the frequency of allocation 
(which is once per image).

> 
> Maybe the easiest solution would be switching to temporary buffers that
> would have the exact size we need and would be freed at the end of the
> request?

The exact size for a qemu-produced image would be at most 2M instead of 
4M - but doing the change you propose WILL cause more frequent 
allocations (once per cluster, rather than once per image).  We'd have 
to benchmark if it makes sense.

But that would be a separate patch from this one.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/6] qcow2: Reduce REFT_OFFSET_MASK
  2018-06-29  8:44   ` Kevin Wolf
@ 2018-06-29 15:22     ` Eric Blake
  2018-06-29 15:43       ` Daniel P. Berrangé
  2018-06-29 15:43       ` Kevin Wolf
  0 siblings, 2 replies; 17+ messages in thread
From: Eric Blake @ 2018-06-29 15:22 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, mreitz, qemu-block, berto

On 06/29/2018 03:44 AM, Kevin Wolf wrote:
> Am 28.06.2018 um 21:07 hat Eric Blake geschrieben:
>> Match our code to the spec change in the previous patch - there's
>> no reason for the refcount table to allow larger offsets than the
>> L1/L2 tables.
> 
> What about internal snapshots? And anyway, because of the metadata
> overhead, the physical image size of a fully allocated image is always
> going to be at least minimally larger than the virtual disk size.
> 
> I'm not necessarily opposed to making the change if there is a good
> reason to make it, but I don't see a real need for it and the
> justification used here and also in the previous patch is incorrect.

The fact that ext4 cannot hold an image this large is already an 
indication that setting this limit on the refcount table is NOT going to 
bite real users.

Yes, you can argue that with lots of metadata, including internal 
snapshots, and on a capable file system (such as tmpfs) that can even 
hold files this large to begin with, then yes, allowing the refcount to 
exceed this limit will allow slightly more metadata to be crammed into a 
single image.  But will it actually help anyone?

Do I need to respin the series to split patch 2 into the obvious changes 
(stuff unrelated to capping refcount size) vs. the controversial stuff 
(refcount cap and this code change)?

> 
> Kevin
> 
>> In practice, no image has more than 64PB of
>> allocated clusters anyways, as anything beyond that can't be
>> expressed via L2 mappings to host offsets.

If you're opposed to an exact 56-bit limit on the grounds that 56-bit 
guest data plus minimal metadata should still be expressable, would it 
be better to cap refcount at 57-bits?

>>
>> Suggested-by: Alberto Garcia <berto@igalia.com>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> Reviewed-by: Alberto Garcia <berto@igalia.com>
>>

>> +++ b/block/qcow2.h
>> @@ -439,7 +439,7 @@ typedef enum QCow2MetadataOverlap {
>>   #define L2E_OFFSET_MASK 0x00fffffffffffe00ULL
>>   #define L2E_COMPRESSED_OFFSET_SIZE_MASK 0x3fffffffffffffffULL
>>
>> -#define REFT_OFFSET_MASK 0xfffffffffffffe00ULL
>> +#define REFT_OFFSET_MASK 0x00fffffffffffe00ULL
>>
>>   static inline int64_t start_of_cluster(BDRVQcow2State *s, int64_t offset)
>>   {
>> -- 
>> 2.14.4
>>
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/6] qcow2: Reduce REFT_OFFSET_MASK
  2018-06-29 15:22     ` Eric Blake
@ 2018-06-29 15:43       ` Daniel P. Berrangé
  2018-06-29 15:43       ` Kevin Wolf
  1 sibling, 0 replies; 17+ messages in thread
From: Daniel P. Berrangé @ 2018-06-29 15:43 UTC (permalink / raw)
  To: Eric Blake; +Cc: Kevin Wolf, berto, qemu-devel, qemu-block, mreitz

On Fri, Jun 29, 2018 at 10:22:22AM -0500, Eric Blake wrote:
> On 06/29/2018 03:44 AM, Kevin Wolf wrote:
> > Am 28.06.2018 um 21:07 hat Eric Blake geschrieben:
> > > Match our code to the spec change in the previous patch - there's
> > > no reason for the refcount table to allow larger offsets than the
> > > L1/L2 tables.
> > 
> > What about internal snapshots? And anyway, because of the metadata
> > overhead, the physical image size of a fully allocated image is always
> > going to be at least minimally larger than the virtual disk size.
> > 
> > I'm not necessarily opposed to making the change if there is a good
> > reason to make it, but I don't see a real need for it and the
> > justification used here and also in the previous patch is incorrect.
> 
> The fact that ext4 cannot hold an image this large is already an indication
> that setting this limit on the refcount table is NOT going to bite real
> users.

NB, RHEL-7 defaults to xfs and this supports file sizes way larger
than ext4 does, so not sure we should consider ext4 as representative
of real world limits anymore.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/6] qcow2: Reduce REFT_OFFSET_MASK
  2018-06-29 15:22     ` Eric Blake
  2018-06-29 15:43       ` Daniel P. Berrangé
@ 2018-06-29 15:43       ` Kevin Wolf
  2018-09-14 18:47         ` Eric Blake
  1 sibling, 1 reply; 17+ messages in thread
From: Kevin Wolf @ 2018-06-29 15:43 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, mreitz, qemu-block, berto

Am 29.06.2018 um 17:22 hat Eric Blake geschrieben:
> On 06/29/2018 03:44 AM, Kevin Wolf wrote:
> > Am 28.06.2018 um 21:07 hat Eric Blake geschrieben:
> > > Match our code to the spec change in the previous patch - there's
> > > no reason for the refcount table to allow larger offsets than the
> > > L1/L2 tables.
> > 
> > What about internal snapshots? And anyway, because of the metadata
> > overhead, the physical image size of a fully allocated image is always
> > going to be at least minimally larger than the virtual disk size.
> > 
> > I'm not necessarily opposed to making the change if there is a good
> > reason to make it, but I don't see a real need for it and the
> > justification used here and also in the previous patch is incorrect.
> 
> The fact that ext4 cannot hold an image this large is already an indication
> that setting this limit on the refcount table is NOT going to bite real
> users.
> 
> Yes, you can argue that with lots of metadata, including internal snapshots,
> and on a capable file system (such as tmpfs) that can even hold files this
> large to begin with, then yes, allowing the refcount to exceed this limit
> will allow slightly more metadata to be crammed into a single image.  But
> will it actually help anyone?
> 
> Do I need to respin the series to split patch 2 into the obvious changes
> (stuff unrelated to capping refcount size) vs. the controversial stuff
> (refcount cap and this code change)?
> 
> > 
> > Kevin
> > 
> > > In practice, no image has more than 64PB of
> > > allocated clusters anyways, as anything beyond that can't be
> > > expressed via L2 mappings to host offsets.
> 
> If you're opposed to an exact 56-bit limit on the grounds that 56-bit guest
> data plus minimal metadata should still be expressable, would it be better
> to cap refcount at 57-bits?

You're arguing that the patch doesn't hurt in practice, but what I'm
really looking for is what do we gain from it?

We generally don't apply patches just because they don't hurt, but
because they are helpful for something. In the simple, fully compatible
cases "helpful" could just mean "we like the code better".

This is a slightly different category: "it's technically incompatible,
but we don't think anyone uses this". We have done such changes before
when they allowed us to improve something substantially, but I think the
requirements for this are at least a little higher than just "because we
like it better". And I'm not even sure yet if I like it better because
it seems like an arbitrary restriction.

So apart from not hurting in practice, what does the restriction buy us?

Kevin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v7 6/6] qcow2: Avoid memory over-allocation on compressed images
  2018-06-29 15:16     ` Eric Blake
@ 2018-06-29 15:47       ` Kevin Wolf
  2018-06-29 16:48         ` [Qemu-devel] [Qemu-block] " Kevin Wolf
  2018-11-13 22:38         ` [Qemu-devel] " Eric Blake
  0 siblings, 2 replies; 17+ messages in thread
From: Kevin Wolf @ 2018-06-29 15:47 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, mreitz, qemu-block, berto

Am 29.06.2018 um 17:16 hat Eric Blake geschrieben:
> On 06/29/2018 04:03 AM, Kevin Wolf wrote:
> > Am 28.06.2018 um 21:07 hat Eric Blake geschrieben:
> > > When reading a compressed image, we were allocating s->cluster_data
> > > to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
> > > with 2M clusters).  Let's check out the history:
> > > 
> 
> > > However, the qcow2 spec permits an all-ones sector count, plus
> > > a full sector containing the initial offset, for a maximum read
> > > of 2 full clusters.  Thanks to the way decompression works, even
> > > though such a value is too large for the actual compressed data
> > > (for all but 512-byte clusters), it really doesn't matter if we
> > > read too much (gzip ignores slop, once it has decoded a full
> > > cluster).  So it's easier to just allocate cluster_data to be as
> > > large as we can ever possibly see; even if it still wastes up to
> > > 2M on any image created by qcow2, that's still an improvment of
> 
> s/improvment/improvement/
> 
> > > 60M less waste than pre-patch.
> > > 
> > > Signed-off-by: Eric Blake <eblake@redhat.com>
> > > Reviewed-by: Alberto Garcia <berto@igalia.com>
> > > 
> > > ---
> 
> > > +++ b/block/qcow2-cluster.c
> > > @@ -1599,20 +1599,29 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
> > >           sector_offset = coffset & 511;
> > >           csize = nb_csectors * 512 - sector_offset;
> > > 
> > > -        /* Allocate buffers on first decompress operation, most images are
> > > -         * uncompressed and the memory overhead can be avoided.  The buffers
> > > -         * are freed in .bdrv_close().
> > > +        /* Allocate buffers on the first decompress operation; most
> > > +         * images are uncompressed and the memory overhead can be
> > > +         * avoided.  The buffers are freed in .bdrv_close().  qemu
> > > +         * never writes an inflated cluster, and gzip itself never
> > > +         * inflates a problematic cluster by more than 0.015%, but the
> > > +         * qcow2 format allows up to 1 byte short of 2 full clusters
> > > +         * when including the sector containing offset.  gzip ignores
> > > +         * trailing slop, so just allocate that much up front rather
> > > +         * than reject third-party images with overlarge csize.
> > >            */
> > > +        assert(!!s->cluster_data == !!s->cluster_cache);
> > > +        assert(csize <= 2 * s->cluster_size);
> > >           if (!s->cluster_data) {
> > > -            /* one more sector for decompressed data alignment */
> > > -            s->cluster_data = qemu_try_blockalign(bs->file->bs,
> > > -                    QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + 512);
> > > +            s->cluster_data = g_try_malloc(2 * s->cluster_size);
> > >               if (!s->cluster_data) {
> > >                   return -ENOMEM;
> > >               }
> > > -        }
> > > -        if (!s->cluster_cache) {
> > > -            s->cluster_cache = g_malloc(s->cluster_size);
> > > +            s->cluster_cache = g_try_malloc(s->cluster_size);
> > > +            if (!s->cluster_cache) {
> > > +                g_free(s->cluster_data);
> > > +                s->cluster_data = NULL;
> > > +                return -ENOMEM;
> > > +            }
> > >           }
> > 
> > I wonder how much of a difference s->cluster_cache really makes. I
> > wouldn't expect guests to access the same cluster twice in a row.
> 
> I don't know if it makes a difference.  But my patch is not even touching
> that - ALL I'm doing is changing a 64M allocation into a 4M allocation, with
> otherwise no change to the frequency of allocation (which is once per
> image).
> 
> > 
> > Maybe the easiest solution would be switching to temporary buffers that
> > would have the exact size we need and would be freed at the end of the
> > request?
> 
> The exact size for a qemu-produced image would be at most 2M instead of 4M -
> but doing the change you propose WILL cause more frequent allocations (once
> per cluster, rather than once per image).  We'd have to benchmark if it
> makes sense.

I wouldn't expect that another allocation makes a big difference when
you already have to decompress the whole cluster. In fact, it could
speed things up because we could then parallelise this.

Hmm... Wasn't there a series for using a worker thread for decompression
recently? It might actually already make that change, I don't remember.

> But that would be a separate patch from this one.

Yes, just a thought I had while reviewing your patch.

Kevin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH v7 6/6] qcow2: Avoid memory over-allocation on compressed images
  2018-06-29 15:47       ` Kevin Wolf
@ 2018-06-29 16:48         ` Kevin Wolf
  2018-11-13 22:38         ` [Qemu-devel] " Eric Blake
  1 sibling, 0 replies; 17+ messages in thread
From: Kevin Wolf @ 2018-06-29 16:48 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, qemu-block, mreitz

Am 29.06.2018 um 17:47 hat Kevin Wolf geschrieben:
> Am 29.06.2018 um 17:16 hat Eric Blake geschrieben:
> > On 06/29/2018 04:03 AM, Kevin Wolf wrote:
> > > Am 28.06.2018 um 21:07 hat Eric Blake geschrieben:
> > > > When reading a compressed image, we were allocating s->cluster_data
> > > > to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
> > > > with 2M clusters).  Let's check out the history:
> > > > 
> > 
> > > > However, the qcow2 spec permits an all-ones sector count, plus
> > > > a full sector containing the initial offset, for a maximum read
> > > > of 2 full clusters.  Thanks to the way decompression works, even
> > > > though such a value is too large for the actual compressed data
> > > > (for all but 512-byte clusters), it really doesn't matter if we
> > > > read too much (gzip ignores slop, once it has decoded a full
> > > > cluster).  So it's easier to just allocate cluster_data to be as
> > > > large as we can ever possibly see; even if it still wastes up to
> > > > 2M on any image created by qcow2, that's still an improvment of
> > 
> > s/improvment/improvement/
> > 
> > > > 60M less waste than pre-patch.
> > > > 
> > > > Signed-off-by: Eric Blake <eblake@redhat.com>
> > > > Reviewed-by: Alberto Garcia <berto@igalia.com>
> > > > 
> > > > ---
> > 
> > > > +++ b/block/qcow2-cluster.c
> > > > @@ -1599,20 +1599,29 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
> > > >           sector_offset = coffset & 511;
> > > >           csize = nb_csectors * 512 - sector_offset;
> > > > 
> > > > -        /* Allocate buffers on first decompress operation, most images are
> > > > -         * uncompressed and the memory overhead can be avoided.  The buffers
> > > > -         * are freed in .bdrv_close().
> > > > +        /* Allocate buffers on the first decompress operation; most
> > > > +         * images are uncompressed and the memory overhead can be
> > > > +         * avoided.  The buffers are freed in .bdrv_close().  qemu
> > > > +         * never writes an inflated cluster, and gzip itself never
> > > > +         * inflates a problematic cluster by more than 0.015%, but the
> > > > +         * qcow2 format allows up to 1 byte short of 2 full clusters
> > > > +         * when including the sector containing offset.  gzip ignores
> > > > +         * trailing slop, so just allocate that much up front rather
> > > > +         * than reject third-party images with overlarge csize.
> > > >            */
> > > > +        assert(!!s->cluster_data == !!s->cluster_cache);
> > > > +        assert(csize <= 2 * s->cluster_size);
> > > >           if (!s->cluster_data) {
> > > > -            /* one more sector for decompressed data alignment */
> > > > -            s->cluster_data = qemu_try_blockalign(bs->file->bs,
> > > > -                    QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + 512);
> > > > +            s->cluster_data = g_try_malloc(2 * s->cluster_size);
> > > >               if (!s->cluster_data) {
> > > >                   return -ENOMEM;
> > > >               }
> > > > -        }
> > > > -        if (!s->cluster_cache) {
> > > > -            s->cluster_cache = g_malloc(s->cluster_size);
> > > > +            s->cluster_cache = g_try_malloc(s->cluster_size);
> > > > +            if (!s->cluster_cache) {
> > > > +                g_free(s->cluster_data);
> > > > +                s->cluster_data = NULL;
> > > > +                return -ENOMEM;
> > > > +            }
> > > >           }
> > > 
> > > I wonder how much of a difference s->cluster_cache really makes. I
> > > wouldn't expect guests to access the same cluster twice in a row.
> > 
> > I don't know if it makes a difference.  But my patch is not even touching
> > that - ALL I'm doing is changing a 64M allocation into a 4M allocation, with
> > otherwise no change to the frequency of allocation (which is once per
> > image).
> > 
> > > 
> > > Maybe the easiest solution would be switching to temporary buffers that
> > > would have the exact size we need and would be freed at the end of the
> > > request?
> > 
> > The exact size for a qemu-produced image would be at most 2M instead of 4M -
> > but doing the change you propose WILL cause more frequent allocations (once
> > per cluster, rather than once per image).  We'd have to benchmark if it
> > makes sense.
> 
> I wouldn't expect that another allocation makes a big difference when
> you already have to decompress the whole cluster. In fact, it could
> speed things up because we could then parallelise this.
> 
> Hmm... Wasn't there a series for using a worker thread for decompression
> recently? It might actually already make that change, I don't remember.

Actually no, it was compression, not decompression.

Kevin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/6] qcow2: Reduce REFT_OFFSET_MASK
  2018-06-29 15:43       ` Kevin Wolf
@ 2018-09-14 18:47         ` Eric Blake
  0 siblings, 0 replies; 17+ messages in thread
From: Eric Blake @ 2018-09-14 18:47 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, mreitz, qemu-block, berto

[revisiting this thread]

On 6/29/18 10:43 AM, Kevin Wolf wrote:
> Am 29.06.2018 um 17:22 hat Eric Blake geschrieben:
>> On 06/29/2018 03:44 AM, Kevin Wolf wrote:
>>> Am 28.06.2018 um 21:07 hat Eric Blake geschrieben:
>>>> Match our code to the spec change in the previous patch - there's
>>>> no reason for the refcount table to allow larger offsets than the
>>>> L1/L2 tables.
>>>
>>> What about internal snapshots? And anyway, because of the metadata
>>> overhead, the physical image size of a fully allocated image is always
>>> going to be at least minimally larger than the virtual disk size.
>>>
>>> I'm not necessarily opposed to making the change if there is a good
>>> reason to make it, but I don't see a real need for it and the
>>> justification used here and also in the previous patch is incorrect.
>>
>> The fact that ext4 cannot hold an image this large is already an indication
>> that setting this limit on the refcount table is NOT going to bite real
>> users.
>>
>> Yes, you can argue that with lots of metadata, including internal snapshots,
>> and on a capable file system (such as tmpfs) that can even hold files this
>> large to begin with, then yes, allowing the refcount to exceed this limit
>> will allow slightly more metadata to be crammed into a single image.  But
>> will it actually help anyone?
>>
>> Do I need to respin the series to split patch 2 into the obvious changes
>> (stuff unrelated to capping refcount size) vs. the controversial stuff
>> (refcount cap and this code change)?
>>
>>>
>>> Kevin
>>>
>>>> In practice, no image has more than 64PB of
>>>> allocated clusters anyways, as anything beyond that can't be
>>>> expressed via L2 mappings to host offsets.
>>
>> If you're opposed to an exact 56-bit limit on the grounds that 56-bit guest
>> data plus minimal metadata should still be expressable, would it be better
>> to cap refcount at 57-bits?
> 
> You're arguing that the patch doesn't hurt in practice, but what I'm
> really looking for is what do we gain from it?

I guess one thing to be gained by having qemu's implementation of qcow2 
pull a hard-stop at 56 bits is that it becomes easier to detect errant 
refcount entries beyond the end of the file as being errant, and not 
attempt to expand the file during 'qemu-img check' just because a 
refcount pointed out that far.  But that's still a pretty weak argument 
(we've already mentioned that qemu-img check should do a better job of 
handling/flagging refcount entries that point way beyond the end of the 
image - but that's true regardless of what limits may be placed on the 
host image size)

> 
> We generally don't apply patches just because they don't hurt, but
> because they are helpful for something. In the simple, fully compatible
> cases "helpful" could just mean "we like the code better".
> 
> This is a slightly different category: "it's technically incompatible,
> but we don't think anyone uses this". We have done such changes before
> when they allowed us to improve something substantially, but I think the
> requirements for this are at least a little higher than just "because we
> like it better". And I'm not even sure yet if I like it better because
> it seems like an arbitrary restriction.
> 
> So apart from not hurting in practice, what does the restriction buy us?

As I'm not coming up with a solid reason for tightening the bounds, I 
will (eventually) respin the series without the spec change.  At this 
point, it's probably easier for me to wait until after the pull request 
queue has started flushing again.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v7 6/6] qcow2: Avoid memory over-allocation on compressed images
  2018-06-29 15:47       ` Kevin Wolf
  2018-06-29 16:48         ` [Qemu-devel] [Qemu-block] " Kevin Wolf
@ 2018-11-13 22:38         ` Eric Blake
  1 sibling, 0 replies; 17+ messages in thread
From: Eric Blake @ 2018-11-13 22:38 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, mreitz, qemu-block, berto

On 6/29/18 10:47 AM, Kevin Wolf wrote:
> Am 29.06.2018 um 17:16 hat Eric Blake geschrieben:
>> On 06/29/2018 04:03 AM, Kevin Wolf wrote:
>>> Am 28.06.2018 um 21:07 hat Eric Blake geschrieben:
>>>> When reading a compressed image, we were allocating s->cluster_data
>>>> to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
>>>> with 2M clusters).  Let's check out the history:
>>>>
>>
>>>> However, the qcow2 spec permits an all-ones sector count, plus
>>>> a full sector containing the initial offset, for a maximum read
>>>> of 2 full clusters.  Thanks to the way decompression works, even
>>>> though such a value is too large for the actual compressed data
>>>> (for all but 512-byte clusters), it really doesn't matter if we
>>>> read too much (gzip ignores slop, once it has decoded a full
>>>> cluster).  So it's easier to just allocate cluster_data to be as
>>>> large as we can ever possibly see; even if it still wastes up to
>>>> 2M on any image created by qcow2, that's still an improvment of
>>
>> s/improvment/improvement/
>>
>>>> 60M less waste than pre-patch.
>>>>

>>> I wonder how much of a difference s->cluster_cache really makes. I
>>> wouldn't expect guests to access the same cluster twice in a row.
>>
>> I don't know if it makes a difference.  But my patch is not even touching
>> that - ALL I'm doing is changing a 64M allocation into a 4M allocation, with
>> otherwise no change to the frequency of allocation (which is once per
>> image).
>>
>>>
>>> Maybe the easiest solution would be switching to temporary buffers that
>>> would have the exact size we need and would be freed at the end of the
>>> request?
>>
>> The exact size for a qemu-produced image would be at most 2M instead of 4M -
>> but doing the change you propose WILL cause more frequent allocations (once
>> per cluster, rather than once per image).  We'd have to benchmark if it
>> makes sense.
> 
> I wouldn't expect that another allocation makes a big difference when
> you already have to decompress the whole cluster. In fact, it could
> speed things up because we could then parallelise this.
> 
> Hmm... Wasn't there a series for using a worker thread for decompression
> recently? It might actually already make that change, I don't remember.
> 
>> But that would be a separate patch from this one.
> 
> Yes, just a thought I had while reviewing your patch.

Well, such a patch has now landed in your block-next queue, so I'm going 
to rebase this patch on top of that (if there's still anything left to 
rebase, that is), and submit the remaining parts of this series that 
still make sense for 3.1 as a v8 posting.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-11-13 22:39 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-28 19:07 [Qemu-devel] [PATCH v7 0/6] minor qcow2 compression improvements Eric Blake
2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 1/6] qcow2: Prefer byte-based calls into bs->file Eric Blake
2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 2/6] qcow2: Document some maximum size constraints Eric Blake
2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 3/6] qcow2: Reduce REFT_OFFSET_MASK Eric Blake
2018-06-29  8:44   ` Kevin Wolf
2018-06-29 15:22     ` Eric Blake
2018-06-29 15:43       ` Daniel P. Berrangé
2018-06-29 15:43       ` Kevin Wolf
2018-09-14 18:47         ` Eric Blake
2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 4/6] qcow2: Don't allow overflow during cluster allocation Eric Blake
2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 5/6] iotests: Add new test 220 for max compressed cluster offset Eric Blake
2018-06-28 19:07 ` [Qemu-devel] [PATCH v7 6/6] qcow2: Avoid memory over-allocation on compressed images Eric Blake
2018-06-29  9:03   ` Kevin Wolf
2018-06-29 15:16     ` Eric Blake
2018-06-29 15:47       ` Kevin Wolf
2018-06-29 16:48         ` [Qemu-devel] [Qemu-block] " Kevin Wolf
2018-11-13 22:38         ` [Qemu-devel] " Eric Blake

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.