All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v5 0/5] minor qcow2 compression improvements
@ 2018-04-23 22:33 Eric Blake
  2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 1/5] qcow2: Prefer byte-based calls into bs->file Eric Blake
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Eric Blake @ 2018-04-23 22:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, qemu-block, berto

Another attempt at this series; v4 was over a month ago:
https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg06780.html

Since then: add R-b, and another round of wording tweaks [Berto, Max],
emphasizing that refcount table is not the limiting factor and that
other limits like ext4 are more likely to hit first

001/5:[----] [--] 'qcow2: Prefer byte-based calls into bs->file'
002/5:[0020] [FC] 'qcow2: Document some maximum size constraints'
003/5:[----] [--] 'qcow2: Reduce REFT_OFFSET_MASK'
004/5:[----] [--] 'qcow2: Don't allow overflow during cluster allocation'
005/5:[----] [--] 'qcow2: Avoid memory over-allocation on compressed images'

Eric Blake (5):
  qcow2: Prefer byte-based calls into bs->file
  qcow2: Document some maximum size constraints
  qcow2: Reduce REFT_OFFSET_MASK
  qcow2: Don't allow overflow during cluster allocation
  qcow2: Avoid memory over-allocation on compressed images

 docs/interop/qcow2.txt | 40 +++++++++++++++++++++++++++++++++++++---
 block/qcow2.h          |  8 +++++++-
 block/qcow2-cluster.c  | 32 ++++++++++++++++++++------------
 block/qcow2-refcount.c | 27 +++++++++++++++++----------
 block/qcow2.c          |  2 +-
 5 files changed, 82 insertions(+), 27 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v5 1/5] qcow2: Prefer byte-based calls into bs->file
  2018-04-23 22:33 [Qemu-devel] [PATCH v5 0/5] minor qcow2 compression improvements Eric Blake
@ 2018-04-23 22:33 ` Eric Blake
  2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 2/5] qcow2: Document some maximum size constraints Eric Blake
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Eric Blake @ 2018-04-23 22:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, qemu-block, berto, Max Reitz

We had only a few sector-based stragglers left; convert them to use
our preferred byte-based accesses.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>

---
v5: commit message tweak
v2: indentation fix
---
 block/qcow2-cluster.c  | 5 ++---
 block/qcow2-refcount.c | 6 +++---
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 1aee726c6a4..1dc00ff2110 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1616,13 +1616,12 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
         }

         BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
-        ret = bdrv_read(bs->file, coffset >> 9, s->cluster_data,
-                        nb_csectors);
+        ret = bdrv_pread(bs->file, coffset, s->cluster_data, csize);
         if (ret < 0) {
             return ret;
         }
         if (decompress_buffer(s->cluster_cache, s->cluster_size,
-                              s->cluster_data + sector_offset, csize) < 0) {
+                              s->cluster_data, csize) < 0) {
             return -EIO;
         }
         s->cluster_cache_offset = coffset;
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 6b8b63514af..6bfc11bb48f 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -2359,8 +2359,8 @@ write_refblocks:
         on_disk_refblock = (void *)((char *) *refcount_table +
                                     refblock_index * s->cluster_size);

-        ret = bdrv_write(bs->file, refblock_offset / BDRV_SECTOR_SIZE,
-                         on_disk_refblock, s->cluster_sectors);
+        ret = bdrv_pwrite(bs->file, refblock_offset, on_disk_refblock,
+                          s->cluster_size);
         if (ret < 0) {
             fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
             goto fail;
@@ -2582,7 +2582,7 @@ fail:
  * - 0 if writing to this offset will not affect the mentioned metadata
  * - a positive QCow2MetadataOverlap value indicating one overlapping section
  * - a negative value (-errno) indicating an error while performing a check,
- *   e.g. when bdrv_read failed on QCOW2_OL_INACTIVE_L2
+ *   e.g. when bdrv_pread failed on QCOW2_OL_INACTIVE_L2
  */
 int qcow2_check_metadata_overlap(BlockDriverState *bs, int ign, int64_t offset,
                                  int64_t size)
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v5 2/5] qcow2: Document some maximum size constraints
  2018-04-23 22:33 [Qemu-devel] [PATCH v5 0/5] minor qcow2 compression improvements Eric Blake
  2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 1/5] qcow2: Prefer byte-based calls into bs->file Eric Blake
@ 2018-04-23 22:33 ` Eric Blake
  2018-04-24  9:13   ` Alberto Garcia
  2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 3/5] qcow2: Reduce REFT_OFFSET_MASK Eric Blake
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Eric Blake @ 2018-04-23 22:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, qemu-block, berto, Max Reitz

Although off_t permits up to 63 bits (8EB) of file offsets, in
practice, we're going to hit other limits first.  Document some
of those limits in the qcow2 spec, and how choice of cluster size
can influence some of the limits.

While at it, notice that since we cannot map any virtual cluster
to any address higher than 64 PB (56 bits) (due to the current L1/L2
field encoding stopping at bit 55), it makes little sense to require
the refcount table to access host offsets beyond that point.  Mark
the upper bits of the refcount table entries as reserved to match
the L1/L2 table, with no ill effects, since it is unlikely that there
are any existing images larger than 64PB in the first place, and thus
all existing images already have those bits as 0.  If 64PB proves to
be too small in the future, we could enlarge all three uses of bit
55 into the reserved bits at that time.  (For reference, ext4 with
4k blocks caps files at 16PB.)

However, there is one limit that reserved bits don't help with: for
compressed clusters, the L2 layout requires an ever-smaller maximum
host offset as cluster size gets larger, down to a 512 TB maximum
with 2M clusters.

Signed-off-by: Eric Blake <eblake@redhat.com>

--
v5: even more wording tweaks
v4: more wording tweaks
v3: new patch
---
 docs/interop/qcow2.txt | 40 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 37 insertions(+), 3 deletions(-)

diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index feb711fb6a8..f0338f356fe 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -40,7 +40,17 @@ The first cluster of a qcow2 image contains the file header:
                     with larger cluster sizes.

          24 - 31:   size
-                    Virtual disk size in bytes
+                    Virtual disk size in bytes.
+
+                    Note: with a 2 MB cluster size, the maximum
+                    virtual size is 2 EB (61 bits) for a sparse file,
+                    but other sizing limitations in refcount and L1/L2
+                    tables mean that an image cannot have more than 64
+                    PB (56 bits) of populated clusters (assuming the
+                    image does not first hit other limits such as a
+                    file system's maximum size).  With a 512 byte
+                    cluster size, the maximum virtual size drops to
+                    128 GB (37 bits).

          32 - 35:   crypt_method
                     0 for no encryption
@@ -326,6 +336,15 @@ in the image file.
 It contains pointers to the second level structures which are called refcount
 blocks and are exactly one cluster in size.

+All qcow2 metadata, including the refcount table and refcount blocks,
+must currently reside at host offsets below 64 PB (56 bits) (if the
+underlying protocol can even be sized that large); this limit could be
+enlarged by putting reserved bits into use, but only if a similar
+limit on L1/L2 tables is revisited at the same time.  While a larger
+cluster size theoretically allows the refcount table to cover more
+host offsets, in practice, other inherent limits will constrain the
+maximum image size before the refcount table is full.
+
 Given a offset into the image file, the refcount of its cluster can be obtained
 as follows:

@@ -341,7 +360,7 @@ Refcount table entry:

     Bit  0 -  8:    Reserved (set to 0)

-         9 - 63:    Bits 9-63 of the offset into the image file at which the
+         9 - 55:    Bits 9-55 of the offset into the image file at which the
                     refcount block starts. Must be aligned to a cluster
                     boundary.

@@ -349,6 +368,8 @@ Refcount table entry:
                     been allocated. All refcounts managed by this refcount block
                     are 0.

+        56 - 63:    Reserved (set to 0)
+
 Refcount block entry (x = refcount_bits - 1):

     Bit  0 -  x:    Reference count of the cluster. If refcount_bits implies a
@@ -365,6 +386,17 @@ The L1 table has a variable size (stored in the header) and may use multiple
 clusters, however it must be contiguous in the image file. L2 tables are
 exactly one cluster in size.

+The L1 and L2 tables have implications on the maximum virtual file
+size; a larger cluster size is required for the guest to have access
+to more space.  Furthermore, a virtual cluster must currently map to a
+host offset below 64 PB (56 bits) (this limit could be enlarged by
+putting reserved bits into use, but only if a similar limit on
+refcount tables is revisited at the same time).  Additionally, with
+larger cluster sizes, compressed clusters have a smaller limit on host
+cluster mappings (a 2M cluster size requires compressed clusters to
+reside below 512 TB (49 bits), where enlarging this would require an
+incompatible layout change).
+
 Given a offset into the virtual disk, the offset into the image file can be
 obtained as follows:

@@ -427,7 +459,9 @@ Standard Cluster Descriptor:
 Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):

     Bit  0 - x-1:   Host cluster offset. This is usually _not_ aligned to a
-                    cluster or sector boundary!
+                    cluster or sector boundary!  If cluster_bits is
+                    small enough that this field includes bits beyond
+                    55, those upper bits must be set to 0.

          x - 61:    Number of additional 512-byte sectors used for the
                     compressed data, beyond the sector containing the offset
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v5 3/5] qcow2: Reduce REFT_OFFSET_MASK
  2018-04-23 22:33 [Qemu-devel] [PATCH v5 0/5] minor qcow2 compression improvements Eric Blake
  2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 1/5] qcow2: Prefer byte-based calls into bs->file Eric Blake
  2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 2/5] qcow2: Document some maximum size constraints Eric Blake
@ 2018-04-23 22:33 ` Eric Blake
  2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 4/5] qcow2: Don't allow overflow during cluster allocation Eric Blake
  2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 5/5] qcow2: Avoid memory over-allocation on compressed images Eric Blake
  4 siblings, 0 replies; 12+ messages in thread
From: Eric Blake @ 2018-04-23 22:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, qemu-block, berto, Max Reitz

Match our code to the spec change in the previous patch - there's
no reason for the refcount table to allow larger offsets than the
L1/L2 tables. In practice, no image has more than 64PB of
allocated clusters anyways, as anything beyond that can't be
expressed via L2 mappings to host offsets.

Suggested-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>

---
v4: new patch
---
 block/qcow2.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index adf5c3950fd..1df15a18aa1 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -443,7 +443,7 @@ typedef enum QCow2MetadataOverlap {
 #define L2E_OFFSET_MASK 0x00fffffffffffe00ULL
 #define L2E_COMPRESSED_OFFSET_SIZE_MASK 0x3fffffffffffffffULL

-#define REFT_OFFSET_MASK 0xfffffffffffffe00ULL
+#define REFT_OFFSET_MASK 0x00fffffffffffe00ULL

 static inline int64_t start_of_cluster(BDRVQcow2State *s, int64_t offset)
 {
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v5 4/5] qcow2: Don't allow overflow during cluster allocation
  2018-04-23 22:33 [Qemu-devel] [PATCH v5 0/5] minor qcow2 compression improvements Eric Blake
                   ` (2 preceding siblings ...)
  2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 3/5] qcow2: Reduce REFT_OFFSET_MASK Eric Blake
@ 2018-04-23 22:33 ` Eric Blake
  2018-04-25 14:44   ` Max Reitz
  2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 5/5] qcow2: Avoid memory over-allocation on compressed images Eric Blake
  4 siblings, 1 reply; 12+ messages in thread
From: Eric Blake @ 2018-04-23 22:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, qemu-block, berto, Max Reitz

Our code was already checking that we did not attempt to
allocate more clusters than what would fit in an INT64 (the
physical maximimum if we can access a full off_t's worth of
data).  But this does not catch smaller limits enforced by
various spots in the qcow2 image description: L1 and normal
clusters of L2 are documented as having bits 63-56 reserved
for other purposes, capping our maximum offset at 64PB (bit
55 is the maximum bit set).  And for compressed images with
2M clusters, the cap drops the maximum offset to bit 48, or
a maximum offset of 512TB.  If we overflow that offset, we
would write compressed data into one place, but try to
decompress from another, which won't work.

I don't have 512TB handy to prove whether things break if we
compress so much data that we overflow that limit, and don't
think that iotests can (quickly) test it either.  Test 138
comes close (it corrupts an image into thinking something lives
at 32PB, which is half the maximum for L1 sizing - although
it relies on 512-byte clusters).  But that test points out
that we will generally hit other limits first (such as running
out of memory for the refcount table, or exceeding file system
limits like 16TB on ext4, etc), so this is more a theoretical
safety valve than something likely to be hit.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>

---
v3: use s->cluster_offset_mask instead of open-coding it [Berto],
use MIN() to clamp offset on small cluster size, add spec patch
first to justify clamping even on refcount allocations
---
 block/qcow2.h          |  6 ++++++
 block/qcow2-refcount.c | 21 ++++++++++++++-------
 2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 1df15a18aa1..a3b9faa9d53 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -41,6 +41,12 @@
 #define QCOW_MAX_CRYPT_CLUSTERS 32
 #define QCOW_MAX_SNAPSHOTS 65536

+/* Field widths in qcow2 mean normal cluster offsets cannot reach
+ * 64PB; depending on cluster size, compressed clusters can have a
+ * smaller limit (64PB for up to 16k clusters, then ramps down to
+ * 512TB for 2M clusters).  */
+#define QCOW_MAX_CLUSTER_OFFSET ((1ULL << 56) - 1)
+
 /* 8 MB refcount table is enough for 2 PB images at 64k cluster size
  * (128 GB for 512 byte clusters, 2 EB for 2 MB clusters) */
 #define QCOW_MAX_REFTABLE_SIZE 0x800000
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 6bfc11bb48f..fcbea3c9644 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -31,7 +31,8 @@
 #include "qemu/bswap.h"
 #include "qemu/cutils.h"

-static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size);
+static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
+                                    uint64_t max);
 static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
                             int64_t offset, int64_t length, uint64_t addend,
                             bool decrease, enum qcow2_discard_type type);
@@ -362,7 +363,8 @@ static int alloc_refcount_block(BlockDriverState *bs,
     }

     /* Allocate the refcount block itself and mark it as used */
-    int64_t new_block = alloc_clusters_noref(bs, s->cluster_size);
+    int64_t new_block = alloc_clusters_noref(bs, s->cluster_size,
+                                             QCOW_MAX_CLUSTER_OFFSET);
     if (new_block < 0) {
         return new_block;
     }
@@ -954,7 +956,8 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs,


 /* return < 0 if error */
-static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size)
+static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
+                                    uint64_t max)
 {
     BDRVQcow2State *s = bs->opaque;
     uint64_t i, nb_clusters, refcount;
@@ -979,9 +982,9 @@ retry:
     }

     /* Make sure that all offsets in the "allocated" range are representable
-     * in an int64_t */
+     * in the requested max */
     if (s->free_cluster_index > 0 &&
-        s->free_cluster_index - 1 > (INT64_MAX >> s->cluster_bits))
+        s->free_cluster_index - 1 > (max >> s->cluster_bits))
     {
         return -EFBIG;
     }
@@ -1001,7 +1004,7 @@ int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t size)

     BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC);
     do {
-        offset = alloc_clusters_noref(bs, size);
+        offset = alloc_clusters_noref(bs, size, QCOW_MAX_CLUSTER_OFFSET);
         if (offset < 0) {
             return offset;
         }
@@ -1083,7 +1086,11 @@ int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size)
     free_in_cluster = s->cluster_size - offset_into_cluster(s, offset);
     do {
         if (!offset || free_in_cluster < size) {
-            int64_t new_cluster = alloc_clusters_noref(bs, s->cluster_size);
+            int64_t new_cluster;
+
+            new_cluster = alloc_clusters_noref(bs, s->cluster_size,
+                                               MIN(s->cluster_offset_mask,
+                                                   QCOW_MAX_CLUSTER_OFFSET));
             if (new_cluster < 0) {
                 return new_cluster;
             }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v5 5/5] qcow2: Avoid memory over-allocation on compressed images
  2018-04-23 22:33 [Qemu-devel] [PATCH v5 0/5] minor qcow2 compression improvements Eric Blake
                   ` (3 preceding siblings ...)
  2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 4/5] qcow2: Don't allow overflow during cluster allocation Eric Blake
@ 2018-04-23 22:33 ` Eric Blake
  2018-04-25 15:00   ` Max Reitz
  4 siblings, 1 reply; 12+ messages in thread
From: Eric Blake @ 2018-04-23 22:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, qemu-block, berto, Max Reitz

When reading a compressed image, we were allocating s->cluster_data
to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
with 2M clusters).  Let's check out the history:

Back when qcow2 was first written, we used s->cluster_data for
everything, including copy_sectors() and encryption, where we want
to operate on more than one cluster at once.  Obviously, at that
point, the buffer had to be aligned for other users, even though
compression itself doesn't require any alignment (the fact that
the compressed data generally starts mid-sector means that aligning
our buffer buys us nothing - either the protocol already supports
byte-based access into whatever offset we want, or we are already
using a bounce buffer to read a full sector, and copying into
our destination no longer requires alignment).

But commit 1b9f1491 (v1.1!) changed things to allocate parallel
buffers on demand rather than sharing a single buffer, for encryption
and COW, leaving compression as the final client of s->cluster_data.
That use was still preserved, because if a single compressed cluster
is read more than once, we reuse the cache instead of decompressing
it a second time (someday, we may come up with better caching to
avoid wasting repeated decompressions while still being more parallel,
but that is a task for another patch; the XXX comment in
qcow2_co_preadv for QCOW2_CLUSTER_COMPRESSED is telling).

Much later, in commit de82815d (v2.2), we noticed that a 64M
allocation is prone to failure, so we switched over to a graceful
memory allocation error message.  Elsewhere in the code, we do
g_malloc(2 * cluster_size) without ever checking for failure, but
even 4M starts to be large enough that trying to be nice is worth
the effort, so we want to keep that aspect.

Then even later, in 3e4c7052 (2.11), we realized that allocating
a large buffer up front for every qcow2 image is expensive, and
switched to lazy allocation only for images that actually had
compressed clusters.  But in the process, we never even bothered
to check whether what we were allocating still made sense in its
new context!

So, it's time to cut back on the waste.  A compressed cluster
written by qemu will NEVER occupy more than an uncompressed
cluster, but based on mid-sector alignment, we may still need
to read 1 cluster + 1 sector in order to recover enough bytes
for the decompression.  But third-party producers of qcow2 may
not be as smart, and gzip DOES document that because the
compression stream adds metadata, and because of the pigeonhole
principle, there are worst case scenarios where attempts to
compress will actually inflate an image, by up to 0.015% (or 62
sectors larger for an unfortunate 2M compression).  In fact,
the qcow2 spec permits an all-ones sector count, plus a full
sector containing the initial offset, for a maximum read of
2 full clusters; and thanks to the way decompression works,
even though such a value is probably too large for the actual
compressed data, it really doesn't matter if we read too much
(gzip ignores slop, once it has decoded a full cluster).  So
it's easier to just allocate cluster_data to be as large as we
can ever possibly see; even if it still wastes up to 2M on any
image created by qcow2, that's still an improvment of 60M less
waste than pre-patch.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>

---
v4: fix off-by-one in assertion and commit message [Berto]
v3: tighten allocation [Berto]
v2: actually check allocation failure (previous version meant
to use g_malloc, but ended up posted with g_try_malloc without
checking); add assertions outside of conditional, improve
commit message to better match reality now that qcow2 spec bug
has been fixed
---
 block/qcow2-cluster.c | 27 ++++++++++++++++++---------
 block/qcow2.c         |  2 +-
 2 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 1dc00ff2110..6e3eb88a37a 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1599,20 +1599,29 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
         sector_offset = coffset & 511;
         csize = nb_csectors * 512 - sector_offset;

-        /* Allocate buffers on first decompress operation, most images are
-         * uncompressed and the memory overhead can be avoided.  The buffers
-         * are freed in .bdrv_close().
+        /* Allocate buffers on the first decompress operation; most
+         * images are uncompressed and the memory overhead can be
+         * avoided.  The buffers are freed in .bdrv_close().  qemu
+         * never writes an inflated cluster, and gzip itself never
+         * inflates a problematic cluster by more than 0.015%, but the
+         * qcow2 format allows up to 1 byte short of 2 full clusters
+         * when including the sector containing offset.  gzip ignores
+         * trailing slop, so just allocate that much up front rather
+         * than reject third-party images with overlarge csize.
          */
+        assert(!!s->cluster_data == !!s->cluster_cache);
+        assert(csize <= 2 * s->cluster_size);
         if (!s->cluster_data) {
-            /* one more sector for decompressed data alignment */
-            s->cluster_data = qemu_try_blockalign(bs->file->bs,
-                    QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + 512);
+            s->cluster_data = g_try_malloc(2 * s->cluster_size);
             if (!s->cluster_data) {
                 return -ENOMEM;
             }
-        }
-        if (!s->cluster_cache) {
-            s->cluster_cache = g_malloc(s->cluster_size);
+            s->cluster_cache = g_try_malloc(s->cluster_size);
+            if (!s->cluster_cache) {
+                g_free(s->cluster_data);
+                s->cluster_data = NULL;
+                return -ENOMEM;
+            }
         }

         BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
diff --git a/block/qcow2.c b/block/qcow2.c
index ef68772acad..a8301371ccc 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2151,7 +2151,7 @@ static void qcow2_close(BlockDriverState *bs)
     g_free(s->image_backing_format);

     g_free(s->cluster_cache);
-    qemu_vfree(s->cluster_data);
+    g_free(s->cluster_data);
     qcow2_refcount_close(bs);
     qcow2_free_snapshots(bs);
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/5] qcow2: Document some maximum size constraints
  2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 2/5] qcow2: Document some maximum size constraints Eric Blake
@ 2018-04-24  9:13   ` Alberto Garcia
  0 siblings, 0 replies; 12+ messages in thread
From: Alberto Garcia @ 2018-04-24  9:13 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: kwolf, qemu-block, Max Reitz

On Tue 24 Apr 2018 12:33:34 AM CEST, Eric Blake wrote:
> Although off_t permits up to 63 bits (8EB) of file offsets, in
> practice, we're going to hit other limits first.  Document some
> of those limits in the qcow2 spec, and how choice of cluster size
> can influence some of the limits.
>
> While at it, notice that since we cannot map any virtual cluster
> to any address higher than 64 PB (56 bits) (due to the current L1/L2
> field encoding stopping at bit 55), it makes little sense to require
> the refcount table to access host offsets beyond that point.  Mark
> the upper bits of the refcount table entries as reserved to match
> the L1/L2 table, with no ill effects, since it is unlikely that there
> are any existing images larger than 64PB in the first place, and thus
> all existing images already have those bits as 0.  If 64PB proves to
> be too small in the future, we could enlarge all three uses of bit
> 55 into the reserved bits at that time.  (For reference, ext4 with
> 4k blocks caps files at 16PB.)
>
> However, there is one limit that reserved bits don't help with: for
> compressed clusters, the L2 layout requires an ever-smaller maximum
> host offset as cluster size gets larger, down to a 512 TB maximum
> with 2M clusters.
>
> Signed-off-by: Eric Blake <eblake@redhat.com>

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v5 4/5] qcow2: Don't allow overflow during cluster allocation
  2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 4/5] qcow2: Don't allow overflow during cluster allocation Eric Blake
@ 2018-04-25 14:44   ` Max Reitz
  2018-04-25 18:26     ` Eric Blake
  2018-04-25 20:31     ` Eric Blake
  0 siblings, 2 replies; 12+ messages in thread
From: Max Reitz @ 2018-04-25 14:44 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: kwolf, qemu-block, berto


[-- Attachment #1.1: Type: text/plain, Size: 6954 bytes --]

On 2018-04-24 00:33, Eric Blake wrote:
> Our code was already checking that we did not attempt to
> allocate more clusters than what would fit in an INT64 (the
> physical maximimum if we can access a full off_t's worth of
> data).  But this does not catch smaller limits enforced by
> various spots in the qcow2 image description: L1 and normal
> clusters of L2 are documented as having bits 63-56 reserved
> for other purposes, capping our maximum offset at 64PB (bit
> 55 is the maximum bit set).  And for compressed images with
> 2M clusters, the cap drops the maximum offset to bit 48, or
> a maximum offset of 512TB.  If we overflow that offset, we
> would write compressed data into one place, but try to
> decompress from another, which won't work.
> 
> I don't have 512TB handy to prove whether things break if we
> compress so much data that we overflow that limit, and don't
> think that iotests can (quickly) test it either.  Test 138
> comes close (it corrupts an image into thinking something lives
> at 32PB, which is half the maximum for L1 sizing - although
> it relies on 512-byte clusters).  But that test points out
> that we will generally hit other limits first (such as running
> out of memory for the refcount table, or exceeding file system
> limits like 16TB on ext4, etc), so this is more a theoretical
> safety valve than something likely to be hit.

You don't need 512 TB, though, 36 MB is sufficient.

Here's what you do:
(1) Create a 513 TB image with cluster_size=2M,refcount_bits=1
(2) Take a hex editor and enter 16 refblocks into the reftable
(3) Fill all of those refblocks with 1s

(Funny side note: qemu-img check thinks that image is clean because it
doesn't check refcounts beyond the image end...)

I've attached a compressed test image (unsurprisingly, it compresses
really well).

Before this series:
$ ./qemu-io -c 'write -c 0 2M' test.qcow2
qcow2: Marking image as corrupt: Preventing invalid write on metadata
(overlaps with refcount block); further corruption events will be suppressed
write failed: Input/output error

Aw.

After this series:
$ ./qemu-io -c 'write -c 0 2M' test.qcow2
write failed: Input/output error

(Normal writes just work fine.)


Maybe you want to add a test still -- creating the image is rather quick
(well, you have to write 64 MB of 1s, but other than that).  The only
thing that takes a bit of time is qemu figuring out where the first free
cluster is...  That takes like 15 seconds here.

And another issue of course is...

$ ls -lhs test.qcow2
42M -rw-r--r--. 1 maxx maxx 513T 25. Apr 16:42 test.qcow2

Yeah, that.  Depends on the host file system, of course, whether that is
a real issue. O:-)

Max

> Signed-off-by: Eric Blake <eblake@redhat.com>
> Reviewed-by: Alberto Garcia <berto@igalia.com>
> 
> ---
> v3: use s->cluster_offset_mask instead of open-coding it [Berto],
> use MIN() to clamp offset on small cluster size, add spec patch
> first to justify clamping even on refcount allocations
> ---
>  block/qcow2.h          |  6 ++++++
>  block/qcow2-refcount.c | 21 ++++++++++++++-------
>  2 files changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 1df15a18aa1..a3b9faa9d53 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -41,6 +41,12 @@
>  #define QCOW_MAX_CRYPT_CLUSTERS 32
>  #define QCOW_MAX_SNAPSHOTS 65536
> 
> +/* Field widths in qcow2 mean normal cluster offsets cannot reach
> + * 64PB; depending on cluster size, compressed clusters can have a
> + * smaller limit (64PB for up to 16k clusters, then ramps down to
> + * 512TB for 2M clusters).  */
> +#define QCOW_MAX_CLUSTER_OFFSET ((1ULL << 56) - 1)
> +
>  /* 8 MB refcount table is enough for 2 PB images at 64k cluster size
>   * (128 GB for 512 byte clusters, 2 EB for 2 MB clusters) */
>  #define QCOW_MAX_REFTABLE_SIZE 0x800000
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 6bfc11bb48f..fcbea3c9644 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -31,7 +31,8 @@
>  #include "qemu/bswap.h"
>  #include "qemu/cutils.h"
> 
> -static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size);
> +static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
> +                                    uint64_t max);
>  static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>                              int64_t offset, int64_t length, uint64_t addend,
>                              bool decrease, enum qcow2_discard_type type);
> @@ -362,7 +363,8 @@ static int alloc_refcount_block(BlockDriverState *bs,
>      }
> 
>      /* Allocate the refcount block itself and mark it as used */
> -    int64_t new_block = alloc_clusters_noref(bs, s->cluster_size);
> +    int64_t new_block = alloc_clusters_noref(bs, s->cluster_size,
> +                                             QCOW_MAX_CLUSTER_OFFSET);
>      if (new_block < 0) {
>          return new_block;
>      }
> @@ -954,7 +956,8 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs,
> 
> 
>  /* return < 0 if error */
> -static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size)
> +static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
> +                                    uint64_t max)
>  {
>      BDRVQcow2State *s = bs->opaque;
>      uint64_t i, nb_clusters, refcount;
> @@ -979,9 +982,9 @@ retry:
>      }
> 
>      /* Make sure that all offsets in the "allocated" range are representable
> -     * in an int64_t */
> +     * in the requested max */
>      if (s->free_cluster_index > 0 &&
> -        s->free_cluster_index - 1 > (INT64_MAX >> s->cluster_bits))
> +        s->free_cluster_index - 1 > (max >> s->cluster_bits))
>      {
>          return -EFBIG;
>      }
> @@ -1001,7 +1004,7 @@ int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t size)
> 
>      BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC);
>      do {
> -        offset = alloc_clusters_noref(bs, size);
> +        offset = alloc_clusters_noref(bs, size, QCOW_MAX_CLUSTER_OFFSET);
>          if (offset < 0) {
>              return offset;
>          }
> @@ -1083,7 +1086,11 @@ int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size)
>      free_in_cluster = s->cluster_size - offset_into_cluster(s, offset);
>      do {
>          if (!offset || free_in_cluster < size) {
> -            int64_t new_cluster = alloc_clusters_noref(bs, s->cluster_size);
> +            int64_t new_cluster;
> +
> +            new_cluster = alloc_clusters_noref(bs, s->cluster_size,
> +                                               MIN(s->cluster_offset_mask,
> +                                                   QCOW_MAX_CLUSTER_OFFSET));
>              if (new_cluster < 0) {
>                  return new_cluster;
>              }
> 


[-- Attachment #1.2: test.qcow2.xz --]
[-- Type: application/x-xz, Size: 6060 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v5 5/5] qcow2: Avoid memory over-allocation on compressed images
  2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 5/5] qcow2: Avoid memory over-allocation on compressed images Eric Blake
@ 2018-04-25 15:00   ` Max Reitz
  2018-04-25 18:37     ` Eric Blake
  0 siblings, 1 reply; 12+ messages in thread
From: Max Reitz @ 2018-04-25 15:00 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: kwolf, qemu-block, berto

[-- Attachment #1: Type: text/plain, Size: 7121 bytes --]

On 2018-04-24 00:33, Eric Blake wrote:
> When reading a compressed image, we were allocating s->cluster_data
> to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
> with 2M clusters).  Let's check out the history:
> 
> Back when qcow2 was first written, we used s->cluster_data for
> everything, including copy_sectors() and encryption, where we want
> to operate on more than one cluster at once.  Obviously, at that
> point, the buffer had to be aligned for other users, even though
> compression itself doesn't require any alignment (the fact that
> the compressed data generally starts mid-sector means that aligning
> our buffer buys us nothing - either the protocol already supports
> byte-based access into whatever offset we want, or we are already
> using a bounce buffer to read a full sector, and copying into
> our destination no longer requires alignment).
> 
> But commit 1b9f1491 (v1.1!) changed things to allocate parallel
> buffers on demand rather than sharing a single buffer, for encryption
> and COW, leaving compression as the final client of s->cluster_data.
> That use was still preserved, because if a single compressed cluster
> is read more than once, we reuse the cache instead of decompressing
> it a second time (someday, we may come up with better caching to
> avoid wasting repeated decompressions while still being more parallel,
> but that is a task for another patch; the XXX comment in
> qcow2_co_preadv for QCOW2_CLUSTER_COMPRESSED is telling).
> 
> Much later, in commit de82815d (v2.2), we noticed that a 64M
> allocation is prone to failure, so we switched over to a graceful
> memory allocation error message.  Elsewhere in the code, we do
> g_malloc(2 * cluster_size) without ever checking for failure, but
> even 4M starts to be large enough that trying to be nice is worth
> the effort, so we want to keep that aspect.
> 
> Then even later, in 3e4c7052 (2.11), we realized that allocating
> a large buffer up front for every qcow2 image is expensive, and
> switched to lazy allocation only for images that actually had
> compressed clusters.  But in the process, we never even bothered
> to check whether what we were allocating still made sense in its
> new context!
> 
> So, it's time to cut back on the waste.  A compressed cluster
> written by qemu will NEVER occupy more than an uncompressed
> cluster, but based on mid-sector alignment, we may still need
> to read 1 cluster + 1 sector in order to recover enough bytes
> for the decompression.  But third-party producers of qcow2 may
> not be as smart, and gzip DOES document that because the
> compression stream adds metadata, and because of the pigeonhole
> principle, there are worst case scenarios where attempts to
> compress will actually inflate an image, by up to 0.015% (or 62
> sectors larger for an unfortunate 2M compression).

Hm?  2M * 0.00015 < 315 (bytes!), so where does that number come from?

>                                                     In fact,
> the qcow2 spec permits an all-ones sector count, plus a full
> sector containing the initial offset, for a maximum read of
> 2 full clusters; and thanks to the way decompression works,
> even though such a value is probably too large for the actual
> compressed data, it really doesn't matter if we read too much
> (gzip ignores slop, once it has decoded a full cluster).  So
> it's easier to just allocate cluster_data to be as large as we
> can ever possibly see; even if it still wastes up to 2M on any
> image created by qcow2, that's still an improvment of 60M less
> waste than pre-patch.

OK, so from the technical perspective it's irrelevant anyway, but I
suppose the number should still be fixed in the commit message.

Apart from that, the series looks good to me.

Max

> Signed-off-by: Eric Blake <eblake@redhat.com>
> Reviewed-by: Alberto Garcia <berto@igalia.com>
> 
> ---
> v4: fix off-by-one in assertion and commit message [Berto]
> v3: tighten allocation [Berto]
> v2: actually check allocation failure (previous version meant
> to use g_malloc, but ended up posted with g_try_malloc without
> checking); add assertions outside of conditional, improve
> commit message to better match reality now that qcow2 spec bug
> has been fixed
> ---
>  block/qcow2-cluster.c | 27 ++++++++++++++++++---------
>  block/qcow2.c         |  2 +-
>  2 files changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 1dc00ff2110..6e3eb88a37a 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -1599,20 +1599,29 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
>          sector_offset = coffset & 511;
>          csize = nb_csectors * 512 - sector_offset;
> 
> -        /* Allocate buffers on first decompress operation, most images are
> -         * uncompressed and the memory overhead can be avoided.  The buffers
> -         * are freed in .bdrv_close().
> +        /* Allocate buffers on the first decompress operation; most
> +         * images are uncompressed and the memory overhead can be
> +         * avoided.  The buffers are freed in .bdrv_close().  qemu
> +         * never writes an inflated cluster, and gzip itself never
> +         * inflates a problematic cluster by more than 0.015%, but the
> +         * qcow2 format allows up to 1 byte short of 2 full clusters
> +         * when including the sector containing offset.  gzip ignores
> +         * trailing slop, so just allocate that much up front rather
> +         * than reject third-party images with overlarge csize.
>           */
> +        assert(!!s->cluster_data == !!s->cluster_cache);
> +        assert(csize <= 2 * s->cluster_size);
>          if (!s->cluster_data) {
> -            /* one more sector for decompressed data alignment */
> -            s->cluster_data = qemu_try_blockalign(bs->file->bs,
> -                    QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + 512);
> +            s->cluster_data = g_try_malloc(2 * s->cluster_size);
>              if (!s->cluster_data) {
>                  return -ENOMEM;
>              }
> -        }
> -        if (!s->cluster_cache) {
> -            s->cluster_cache = g_malloc(s->cluster_size);
> +            s->cluster_cache = g_try_malloc(s->cluster_size);
> +            if (!s->cluster_cache) {
> +                g_free(s->cluster_data);
> +                s->cluster_data = NULL;
> +                return -ENOMEM;
> +            }
>          }
> 
>          BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
> diff --git a/block/qcow2.c b/block/qcow2.c
> index ef68772acad..a8301371ccc 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2151,7 +2151,7 @@ static void qcow2_close(BlockDriverState *bs)
>      g_free(s->image_backing_format);
> 
>      g_free(s->cluster_cache);
> -    qemu_vfree(s->cluster_data);
> +    g_free(s->cluster_data);
>      qcow2_refcount_close(bs);
>      qcow2_free_snapshots(bs);
>  }
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v5 4/5] qcow2: Don't allow overflow during cluster allocation
  2018-04-25 14:44   ` Max Reitz
@ 2018-04-25 18:26     ` Eric Blake
  2018-04-25 20:31     ` Eric Blake
  1 sibling, 0 replies; 12+ messages in thread
From: Eric Blake @ 2018-04-25 18:26 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: kwolf, qemu-block, berto

[-- Attachment #1: Type: text/plain, Size: 3409 bytes --]

On 04/25/2018 09:44 AM, Max Reitz wrote:
> On 2018-04-24 00:33, Eric Blake wrote:
>> Our code was already checking that we did not attempt to
>> allocate more clusters than what would fit in an INT64 (the
>> physical maximimum if we can access a full off_t's worth of

s/maximimum/maximum/

>> data).  But this does not catch smaller limits enforced by
>> various spots in the qcow2 image description: L1 and normal
>> clusters of L2 are documented as having bits 63-56 reserved
>> for other purposes, capping our maximum offset at 64PB (bit
>> 55 is the maximum bit set).  And for compressed images with
>> 2M clusters, the cap drops the maximum offset to bit 48, or
>> a maximum offset of 512TB.  If we overflow that offset, we
>> would write compressed data into one place, but try to
>> decompress from another, which won't work.
>>
>> I don't have 512TB handy to prove whether things break if we
>> compress so much data that we overflow that limit, and don't
>> think that iotests can (quickly) test it either.  Test 138
>> comes close (it corrupts an image into thinking something lives
>> at 32PB, which is half the maximum for L1 sizing - although
>> it relies on 512-byte clusters).  But that test points out
>> that we will generally hit other limits first (such as running
>> out of memory for the refcount table, or exceeding file system
>> limits like 16TB on ext4, etc), so this is more a theoretical
>> safety valve than something likely to be hit.
> 
> You don't need 512 TB, though, 36 MB is sufficient.

Cool.  I'll have to attempt that as a followup patch.

> 
> Here's what you do:
> (1) Create a 513 TB image with cluster_size=2M,refcount_bits=1
> (2) Take a hex editor and enter 16 refblocks into the reftable
> (3) Fill all of those refblocks with 1s

That's a lot of leaked clusters ;)

> 
> (Funny side note: qemu-img check thinks that image is clean because it
> doesn't check refcounts beyond the image end...)

Eww - yet another bug to fix...

> 
> I've attached a compressed test image (unsurprisingly, it compresses
> really well).
> 
> Before this series:
> $ ./qemu-io -c 'write -c 0 2M' test.qcow2
> qcow2: Marking image as corrupt: Preventing invalid write on metadata
> (overlaps with refcount block); further corruption events will be suppressed
> write failed: Input/output error
> 
> Aw.
> 
> After this series:
> $ ./qemu-io -c 'write -c 0 2M' test.qcow2
> write failed: Input/output error
> 
> (Normal writes just work fine.)
> 
> 
> Maybe you want to add a test still -- creating the image is rather quick
> (well, you have to write 64 MB of 1s, but other than that).  The only
> thing that takes a bit of time is qemu figuring out where the first free
> cluster is...  That takes like 15 seconds here.

Then the test doesn't belong in '-g quick'.

> 
> And another issue of course is...
> 
> $ ls -lhs test.qcow2
> 42M -rw-r--r--. 1 maxx maxx 513T 25. Apr 16:42 test.qcow2
> 
> Yeah, that.  Depends on the host file system, of course, whether that is
> a real issue. O:-)

As long as iotests can gracefully skip if qemu-img fails to create the
image, then the test should still run on all remaining filesystems that
support sparse files that large.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 619 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v5 5/5] qcow2: Avoid memory over-allocation on compressed images
  2018-04-25 15:00   ` Max Reitz
@ 2018-04-25 18:37     ` Eric Blake
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Blake @ 2018-04-25 18:37 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: kwolf, qemu-block, berto

[-- Attachment #1: Type: text/plain, Size: 2235 bytes --]

On 04/25/2018 10:00 AM, Max Reitz wrote:
> On 2018-04-24 00:33, Eric Blake wrote:
>> When reading a compressed image, we were allocating s->cluster_data
>> to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
>> with 2M clusters).  Let's check out the history:
>>

>>
>> So, it's time to cut back on the waste.  A compressed cluster
>> written by qemu will NEVER occupy more than an uncompressed
>> cluster, but based on mid-sector alignment, we may still need
>> to read 1 cluster + 1 sector in order to recover enough bytes
>> for the decompression.  But third-party producers of qcow2 may
>> not be as smart, and gzip DOES document that because the
>> compression stream adds metadata, and because of the pigeonhole
>> principle, there are worst case scenarios where attempts to
>> compress will actually inflate an image, by up to 0.015% (or 62
>> sectors larger for an unfortunate 2M compression).
> 
> Hm?  2M * 0.00015 < 315 (bytes!), so where does that number come from?

/me puts on the cone of shame...

Umm, it comes from me forgetting that percents need to be scaled before
multiplying.  31.5k bytes is 62 sectors, but you're right that 315 bytes
is only 1 sector.

> 
>>                                                     In fact,
>> the qcow2 spec permits an all-ones sector count, plus a full
>> sector containing the initial offset, for a maximum read of
>> 2 full clusters; and thanks to the way decompression works,
>> even though such a value is probably too large for the actual
>> compressed data, it really doesn't matter if we read too much
>> (gzip ignores slop, once it has decoded a full cluster).  So
>> it's easier to just allocate cluster_data to be as large as we
>> can ever possibly see; even if it still wastes up to 2M on any
>> image created by qcow2, that's still an improvment of 60M less
>> waste than pre-patch.
> 
> OK, so from the technical perspective it's irrelevant anyway, but I
> suppose the number should still be fixed in the commit message.

Hey, at least it's a math error instead of a typo ;)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 619 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v5 4/5] qcow2: Don't allow overflow during cluster allocation
  2018-04-25 14:44   ` Max Reitz
  2018-04-25 18:26     ` Eric Blake
@ 2018-04-25 20:31     ` Eric Blake
  1 sibling, 0 replies; 12+ messages in thread
From: Eric Blake @ 2018-04-25 20:31 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: kwolf, qemu-block, berto

[-- Attachment #1: Type: text/plain, Size: 2327 bytes --]

On 04/25/2018 09:44 AM, Max Reitz wrote:

> Here's what you do:
> (1) Create a 513 TB image with cluster_size=2M,refcount_bits=1
> (2) Take a hex editor and enter 16 refblocks into the reftable
> (3) Fill all of those refblocks with 1s
> 
> (Funny side note: qemu-img check thinks that image is clean because it
> doesn't check refcounts beyond the image end...)
> 
> I've attached a compressed test image (unsurprisingly, it compresses
> really well).
> 
> Before this series:
> $ ./qemu-io -c 'write -c 0 2M' test.qcow2
> qcow2: Marking image as corrupt: Preventing invalid write on metadata
> (overlaps with refcount block); further corruption events will be suppressed
> write failed: Input/output error
> 
> Aw.
> 
> After this series:
> $ ./qemu-io -c 'write -c 0 2M' test.qcow2
> write failed: Input/output error

Hmm, I wonder if ENOSPC is a better failure than EIO if the reason we
can't write a compressed cluster is because the image has grown too large.

> 
> (Normal writes just work fine.)

I also wonder if there should be a knob to control whether attempts to
write a compressed cluster should fail vs. silently fall back to a
normal write when encountering this too-large situation (that is, a
trade-off between keeping guest writes working no matter what, even if
we lose out on compression savings, vs. learning as soon as possible
when compression is no longer possible).  Then again, other than
carefully-crafted super-sparse files, it's much harder to argue that
you'll hit this in real usage scenarios, so it may not be worth the effort.

> 
> 
> Maybe you want to add a test still -- creating the image is rather quick
> (well, you have to write 64 MB of 1s, but other than that).  The only
> thing that takes a bit of time is qemu figuring out where the first free
> cluster is...  That takes like 15 seconds here.
> 
> And another issue of course is...
> 
> $ ls -lhs test.qcow2
> 42M -rw-r--r--. 1 maxx maxx 513T 25. Apr 16:42 test.qcow2
> 
> Yeah, that.  Depends on the host file system, of course, whether that is
> a real issue. O:-)

Well, I'm writing the iotest now, and thus will post a v6 of this series.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 619 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-04-25 20:33 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-23 22:33 [Qemu-devel] [PATCH v5 0/5] minor qcow2 compression improvements Eric Blake
2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 1/5] qcow2: Prefer byte-based calls into bs->file Eric Blake
2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 2/5] qcow2: Document some maximum size constraints Eric Blake
2018-04-24  9:13   ` Alberto Garcia
2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 3/5] qcow2: Reduce REFT_OFFSET_MASK Eric Blake
2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 4/5] qcow2: Don't allow overflow during cluster allocation Eric Blake
2018-04-25 14:44   ` Max Reitz
2018-04-25 18:26     ` Eric Blake
2018-04-25 20:31     ` Eric Blake
2018-04-23 22:33 ` [Qemu-devel] [PATCH v5 5/5] qcow2: Avoid memory over-allocation on compressed images Eric Blake
2018-04-25 15:00   ` Max Reitz
2018-04-25 18:37     ` Eric Blake

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.