All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
@ 2017-04-06 15:01 Alberto Garcia
  2017-04-06 16:40 ` Eric Blake
                   ` (4 more replies)
  0 siblings, 5 replies; 64+ messages in thread
From: Alberto Garcia @ 2017-04-06 15:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, Stefan Hajnoczi, Max Reitz, Kevin Wolf

Hi all,

over the past couple of months I discussed with some of you the
possibility to extend the qcow2 format in order to improve its
performance and reduce its memory requirements (particularly with very
large images).

After some discussion in the mailing list and the #qemu IRC channel I
decided to write a prototype of a new extension for qcow2 so I could
understand better the scope of the changes and have some preliminary
data about its effects.

This e-mail is the formal presentation of my proposal to extend the
on-disk qcow2 format. As you can see this is still an RFC. Due to the
nature of the changes I would like to get as much feedback as possible
before going forward.

=== Problem ===

The original problem that I wanted to address is the memory
requirements of qcow2 files if you want to have very large images and
still keep good I/O performance. This is a consequence of its very
structure, which I'm going to describe now.

A qcow2 image is divided into units of constant size called clusters,
and among other things it contains metadata that maps guest addresses
to host addresses (the so-called L1 and L2 tables).

There are two basic problems that result from this:

1) Reading from or writing to a qcow2 image involves reading the
   corresponding entry on the L2 table that maps the guest address to
   the host address. This is very slow because it involves two I/O
   operations: one on the L2 table and the other one on the actual
   data cluster.

2) A cluster is the smallest unit of allocation. Therefore writing a
   mere 512 bytes to an empty disk requires allocating a complete
   cluster and filling it with zeroes (or with data from the backing
   image if there is one). This wastes more disk space and also has a
   negative impact on I/O.

Problem (1) can be solved by keeping in memory a cache of the L2
tables (QEMU has an "l2_cache_size" parameter for this purpose). The
amount of disk space used by L2 tables depends on two factors: the
disk size and the cluster size.

The cluster size can be configured when the image is created, and it
can be any power of two between 512 bytes and 2 MB (it defaults to 64
KB).

The maximum amount of space needed for the L2 tables can be calculated
with the following formula:

   max_l2_size = virtual_disk_size * 8 / cluster_size

Large images require a large amount of metadata, and therefore a large
amount of memory for the L2 cache. With the default cluster size
(64KB) that's 128MB of L2 cache for a 1TB qcow2 image.

The only way to reduce the size of the L2 tables is therefore
increasing the cluster size, but this makes the image less efficient
as described earlier in (2).

=== The proposal ===

The idea of this proposal is to extend the qcow2 format by allowing
subcluster allocation. There would be an optional feature that would
need to be enabled when creating the image. The on-disk format would
remain essentially the same, except that each data cluster would be
internally divided into a number of subclusters of equal size.

What this means in practice is that each entry on an L2 table would be
accompanied by a bitmap indicating the allocation state of each one of
the subclusters for that cluster. There are several alternatives for
storing the bitmap, described below.

Other than L2 entries, all other data structures would remain
unchanged, but for data clusters the smallest unit of allocation would
now be the subcluster. Reference counting would still be at the
cluster level, because there's no way to reference individual
subclusters. Copy-on-write on internal snapshots would need to copy
complete clusters, so that scenario would not benefit from this
change.

I see two main use cases for this feature:

a) The qcow2 image is not too large / the L2 cache is not a problem,
   but you want to increase the allocation performance. In this case
   you can have something like a 128KB cluster with 4KB subclusters
   (with 4KB being a common block size in ext4 and other filesystems)

b) The qcow2 image is very large and you want to save metadata space
   in order to have a smaller L2 cache. In this case you can go for
   the maximum cluster size (2MB) but you want to have smaller
   subclusters to increase the allocation performance and optimize the
   disk usage. This was actually my original use case.

=== Test results ===

I have a basic working prototype of this. It's still incomplete -and
buggy :)- but it gives an idea of what we can expect from it. In my
implementation each data cluster has 8 subclusters, but that's not set
in stone (see below).

I made all tests on an SSD drive, writing to an empty qcow2 image with
a fully populated 40GB backing image, performing random writes using
fio with a block size of 4KB.

I tried with the default and maximum cluster sizes (64KB and 2MB) and
also with some other sizes. I also made sure to try with 32KB clusters
so the subcluster size matches the 4KB block size used for the I/O.

It's important to point out that once a cluster has been completely
allocated then having subclusters offers no performance benefit. For
this reason the size of the image for these tests (40GB) was chosen to
be large enough to guarantee that there are always new clusters being
allocated. This is therefore a worst-case scenario (or best-case for
this feature, if you want).

Here are the results (subcluster size in brackets):

|-----------------+----------------+-----------------+-------------------|
|  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
|-----------------+----------------+-----------------+-------------------|
|   2 MB (256 KB) |   440 IOPS     |  100 IOPS       | 160 KB (*)        |
| 512 KB  (64 KB) |  1000 IOPS     |  300 IOPS       | 640 KB            |
|  64 KB   (8 KB) |  3000 IOPS     | 1000 IOPS       |   5 MB            |
|  32 KB   (4 KB) | 12000 IOPS     | 1300 IOPS       |  10 MB            |
|   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
|-----------------+----------------+-----------------+-------------------|

                (*) The L2 cache must be a multiple of the cluster
                    size, so in this case it must be 2MB. On the table
                    I chose to show how much of those 2MB are actually
                    used so you can compare it with the other cases.

Some comments about the results:

- For the 64KB, 512KB and 2MB cases, having subclusters increases
  write performance roughly by three. This happens because for each
  cluster allocation there's less data to copy from the backing
  image. For the same reason, the smaller the cluster, the better the
  performance. As expected, 64KB clusters with no subclusters perform
  roughly the same as 512KB clusters with 64KB subclusters.

- The 32KB case is the most interesting one. Without subclusters it's
  not very different from the 64KB case, but having a subcluster with
  the same size of the I/O block eliminates the need for COW entirely
  and the performance skyrockets (10 times faster!).

- 4KB is however very slow. I attribute this to the fact that the
  cluster size is so small that a new cluster needs to be allocated
  for every single write and its refcount updated accordingly. The L2
  and refcount tables are also so small that they are too inefficient
  and need to grow all the time.

Here are the results when writing to an empty 40GB qcow2 image with no
backing file. The numbers are of course different but as you can see
the patterns are similar:

|-----------------+----------------+-----------------+-------------------|
|  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
|-----------------+----------------+-----------------+-------------------|
|   2 MB (256 KB) |  1200 IOPS     |  255 IOPS       | 160 KB            |
| 512 KB  (64 KB) |  3000 IOPS     |  700 IOPS       | 640 KB            |
|  64 KB   (8 KB) |  7200 IOPS     | 3300 IOPS       |   5 MB            |
|  32 KB   (4 KB) | 12300 IOPS     | 4200 IOPS       |  10 MB            |
|   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
|-----------------+----------------+-----------------+-------------------|

=== Changes to the on-disk format ===

The qcow2 on-disk format needs to change so each L2 entry has a bitmap
indicating the allocation status of each subcluster. There are three
possible states (unallocated, allocated, all zeroes), so we need two
bits per subcluster.

An L2 entry is 64 bits wide, and this is the current format (for
uncompressed clusters):

63    56 55    48 47    40 39    32 31    24 23    16 15     8 7      0
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
**<----> <--------------------------------------------------><------->*
  Rsrved              host cluster offset of data             Reserved
  (6 bits)                (47 bits)                           (8 bits)

    bit 63: refcount == 1   (QCOW_OFLAG_COPIED)
    bit 62: compressed = 1  (QCOW_OFLAG_COMPRESSED)
    bit 0: all zeros        (QCOW_OFLAG_ZERO)

I thought of three alternatives for storing the subcluster bitmaps. I
haven't made my mind completely about which one is the best one, so
I'd like to present all three for discussion. Here they are:

(1) Storing the bitmap inside the 64-bit entry

    This is a simple alternative and is the one that I chose for my
    prototype. There are 14 unused bits plus the "all zeroes" one. If
    we steal one from the host offset we have the 16 bits that we need
    for the bitmap and we have 46 bits left for the host offset, which
    is more than enough.

    * Pros:
      + Simple. Few changes compared to the current qcow2 format.

    * Cons:
      - Only 8 subclusters per cluster. We would not be making the
        most of this feature.

      - No reserved bits left for the future.

(2) Making L2 entries 128-bit wide.

    In this alternative we would double the size of L2 entries. The
    first half would remain unchanged and the second one would store
    the bitmap. That would leave us with 32 subclusters per cluster.

    * Pros:
      + More subclusters per cluster. We could have images with
        e.g. 128k clusters with 4k subclusters.

    * Cons:
      - More space needed for L2 entries. The same cluster size would
        require a cache twice as large, although having subcluster
        allocation would compensate for this.

      - More changes to the code to handle 128-bit entries.

      - We would still be wasting the 14 reserved bits that L2 entries
        have.

(3) Storing the bitmap somewhere else

    This would involve storing the bitmap separate from the L2 tables
    (perhaps using the bitmaps extension? I haven't looked much into
    this).

    * Pros:
      + Possibility to make the number of subclusters configurable
        by the user (32, 64, 128, ...)
      + All existing metadata structures would remain untouched
        (although the "all zeroes" bit in L2 entries would probably
        become unused).

    * Cons:
      - As with alternative (2), more space needed for metadata.

      - The bitmap would also need to be cached for performance
        reasons.

      - Possibly one more *_cache_size option.

      - One more metadata structure to be updated for each
        allocation. This would probably impact I/O negatively.

=== Compressed clusters ===

My idea is that compressed clusters would remain the same. They are
read-only anyway so they would not be affected by any of these
changes.

===========================

I think I managed to summarize all the ideas that I had in my mind,
but I'm sure you probably have questions and comments, so I'd be happy
to get as much feedback as possible.

So, looking forward to reading your opinions.

Regards,

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia
@ 2017-04-06 16:40 ` Eric Blake
  2017-04-07  8:49   ` Alberto Garcia
                     ` (2 more replies)
  2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 64+ messages in thread
From: Eric Blake @ 2017-04-06 16:40 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 9872 bytes --]

On 04/06/2017 10:01 AM, Alberto Garcia wrote:
> Hi all,
> 
> over the past couple of months I discussed with some of you the
> possibility to extend the qcow2 format in order to improve its
> performance and reduce its memory requirements (particularly with very
> large images).
> 
> After some discussion in the mailing list and the #qemu IRC channel I
> decided to write a prototype of a new extension for qcow2 so I could
> understand better the scope of the changes and have some preliminary
> data about its effects.
> 
> This e-mail is the formal presentation of my proposal to extend the
> on-disk qcow2 format. As you can see this is still an RFC. Due to the
> nature of the changes I would like to get as much feedback as possible
> before going forward.

The idea in general makes sense; I can even remember chatting with Kevin
about similar ideas as far back as 2015, where the biggest drawback is
that it is an incompatible image change, and therefore images created
with the flag cannot be read by older tools.

> === Test results ===
> 
> I have a basic working prototype of this. It's still incomplete -and
> buggy :)- but it gives an idea of what we can expect from it. In my
> implementation each data cluster has 8 subclusters, but that's not set
> in stone (see below).
> 
> I made all tests on an SSD drive, writing to an empty qcow2 image with
> a fully populated 40GB backing image, performing random writes using
> fio with a block size of 4KB.
> 
> I tried with the default and maximum cluster sizes (64KB and 2MB) and
> also with some other sizes. I also made sure to try with 32KB clusters
> so the subcluster size matches the 4KB block size used for the I/O.
> 
> It's important to point out that once a cluster has been completely
> allocated then having subclusters offers no performance benefit. For
> this reason the size of the image for these tests (40GB) was chosen to
> be large enough to guarantee that there are always new clusters being
> allocated. This is therefore a worst-case scenario (or best-case for
> this feature, if you want).
> 
> Here are the results (subcluster size in brackets):
> 
> |-----------------+----------------+-----------------+-------------------|
> |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> |-----------------+----------------+-----------------+-------------------|
> |   2 MB (256 KB) |   440 IOPS     |  100 IOPS       | 160 KB (*)        |
> | 512 KB  (64 KB) |  1000 IOPS     |  300 IOPS       | 640 KB            |
> |  64 KB   (8 KB) |  3000 IOPS     | 1000 IOPS       |   5 MB            |
> |  32 KB   (4 KB) | 12000 IOPS     | 1300 IOPS       |  10 MB            |
> |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
> |-----------------+----------------+-----------------+-------------------|

Those are some cool results!


> === Changes to the on-disk format ===
> 
> The qcow2 on-disk format needs to change so each L2 entry has a bitmap
> indicating the allocation status of each subcluster. There are three
> possible states (unallocated, allocated, all zeroes), so we need two
> bits per subcluster.

You also have to add a new incompatible feature bit, so that older tools
know they can't read the new image correctly, and therefore don't
accidentally corrupt it.

> 
> An L2 entry is 64 bits wide, and this is the current format (for
> uncompressed clusters):
> 
> 63    56 55    48 47    40 39    32 31    24 23    16 15     8 7      0
> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> **<----> <--------------------------------------------------><------->*
>   Rsrved              host cluster offset of data             Reserved
>   (6 bits)                (47 bits)                           (8 bits)
> 
>     bit 63: refcount == 1   (QCOW_OFLAG_COPIED)
>     bit 62: compressed = 1  (QCOW_OFLAG_COMPRESSED)
>     bit 0: all zeros        (QCOW_OFLAG_ZERO)
> 
> I thought of three alternatives for storing the subcluster bitmaps. I
> haven't made my mind completely about which one is the best one, so
> I'd like to present all three for discussion. Here they are:
> 
> (1) Storing the bitmap inside the 64-bit entry
> 
>     This is a simple alternative and is the one that I chose for my
>     prototype. There are 14 unused bits plus the "all zeroes" one. If
>     we steal one from the host offset we have the 16 bits that we need
>     for the bitmap and we have 46 bits left for the host offset, which
>     is more than enough.

Note that because you are using exactly 8 subclusters, you can require
that the minimum cluster size when subclusters are enabled be 4k (since
we already have a lower-limit of 512-byte sector operation, and don't
want subclusters to be smaller than that); at which case you are
guaranteed that the host cluster offset will be 4k aligned.  So in
reality, once you turn on subclusters, you have:

63    56 55    48 47    40 39    32 31    24 23    16 15     8 7      0
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
**<----> <-----------------------------------------------><---------->*
  Rsrved              host cluster offset of data             Reserved
  (6 bits)                (44 bits)                           (11 bits)

where you have 17 bits plus the "all zeroes" bit to play with, thanks to
the three bits of host cluster offset that are now guaranteed to be zero
due to cluster size alignment (but you're also right that the "all
zeroes" bit is now redundant information with the 8 subcluster-is-zero
bits, so repurposing it does not hurt)

> 
>     * Pros:
>       + Simple. Few changes compared to the current qcow2 format.
> 
>     * Cons:
>       - Only 8 subclusters per cluster. We would not be making the
>         most of this feature.
> 
>       - No reserved bits left for the future.

I just argued you have at least one, and probably 2, bits left over for
future in-word expansion.

> 
> (2) Making L2 entries 128-bit wide.
> 
>     In this alternative we would double the size of L2 entries. The
>     first half would remain unchanged and the second one would store
>     the bitmap. That would leave us with 32 subclusters per cluster.

Although for smaller cluster sizes (such as 4k clusters), you'd still
want to restrict that subclusters are at least 512-byte sectors, so
you'd be using fewer than 32 of those subcluster positions until the
cluster size is large enough.

> 
>     * Pros:
>       + More subclusters per cluster. We could have images with
>         e.g. 128k clusters with 4k subclusters.

Could allow variable-sized subclusters (your choice of 32 subclusters of
4k each, or 16 subclusters of 8k each)

> 
>     * Cons:
>       - More space needed for L2 entries. The same cluster size would
>         require a cache twice as large, although having subcluster
>         allocation would compensate for this.
> 
>       - More changes to the code to handle 128-bit entries.

Dealing with variable-sized subclusters, or with unused subclsuster
entries when the cluster size is too small (such as a 4k cluster should
not be allowed any subclusters smaller than 512 bytes, but that's at
most 8 out of the 32 slots available), can get tricky.

> 
>       - We would still be wasting the 14 reserved bits that L2 entries
>         have.
> 
> (3) Storing the bitmap somewhere else
> 
>     This would involve storing the bitmap separate from the L2 tables
>     (perhaps using the bitmaps extension? I haven't looked much into
>     this).
> 
>     * Pros:
>       + Possibility to make the number of subclusters configurable
>         by the user (32, 64, 128, ...)
>       + All existing metadata structures would remain untouched
>         (although the "all zeroes" bit in L2 entries would probably
>         become unused).

It might still remain useful for optimization purposes, although then we
get into image consistency questions (if the all zeroes bit is set but
subcluster map claims allocation, or if the all zeroes bit is clear but
all subclusters claim zero, which one wins).

> 
>     * Cons:
>       - As with alternative (2), more space needed for metadata.
> 
>       - The bitmap would also need to be cached for performance
>         reasons.
> 
>       - Possibly one more *_cache_size option.
> 
>       - One more metadata structure to be updated for each
>         allocation. This would probably impact I/O negatively.

Having the subcluster table directly in the L2 means that updating the
L2 table is done with a single write. You are definitely right that
having the subcluster table as a bitmap in a separate cluster means two
writes instead of one, but as always, it's hard to predict how much of
an impact that is without benchmarks.

> 
> === Compressed clusters ===
> 
> My idea is that compressed clusters would remain the same. They are
> read-only anyway so they would not be affected by any of these
> changes.
> 
> ===========================
> 
> I think I managed to summarize all the ideas that I had in my mind,
> but I'm sure you probably have questions and comments, so I'd be happy
> to get as much feedback as possible.
> 
> So, looking forward to reading your opinions.
> 

The fact that you already have numbers proving the speedups that are
possible when first allocating the image make this sound like a useful
project, even though it is an incompatible image change that old tools
won't be able to recognize. You'll want to make sure 'qemu-img amend'
can rewrite an image with subclusters into an older image.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-06 16:40 ` Eric Blake
@ 2017-04-07  8:49   ` Alberto Garcia
  2017-04-07 12:41   ` Kevin Wolf
  2017-04-21 21:09   ` [Qemu-devel] proposed qcow2 extension: cluster reservations [was: " Eric Blake
  2 siblings, 0 replies; 64+ messages in thread
From: Alberto Garcia @ 2017-04-07  8:49 UTC (permalink / raw)
  To: Eric Blake, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz

On Thu 06 Apr 2017 06:40:41 PM CEST, Eric Blake wrote:

>> This e-mail is the formal presentation of my proposal to extend the
>> on-disk qcow2 format. As you can see this is still an RFC. Due to the
>> nature of the changes I would like to get as much feedback as
>> possible before going forward.
>
> The idea in general makes sense; I can even remember chatting with
> Kevin about similar ideas as far back as 2015, where the biggest
> drawback is that it is an incompatible image change, and therefore
> images created with the flag cannot be read by older tools.

That's correct.

>> === Changes to the on-disk format ===
>> 
>> The qcow2 on-disk format needs to change so each L2 entry has a
>> bitmap indicating the allocation status of each subcluster. There are
>> three possible states (unallocated, allocated, all zeroes), so we
>> need two bits per subcluster.
>
> You also have to add a new incompatible feature bit, so that older
> tools know they can't read the new image correctly, and therefore
> don't accidentally corrupt it.

Yes, of course. The name that I'm considering is something like
QCOW2_INCOMPAT_SUBCLUSTER, and for the creation options either
'subclusters=on/off' or 'subcluster_size=XXX' (depending on whether the
size is configurable or not).

>> (1) Storing the bitmap inside the 64-bit entry
>> 
>>     This is a simple alternative and is the one that I chose for my
>>     prototype. There are 14 unused bits plus the "all zeroes" one. If
>>     we steal one from the host offset we have the 16 bits that we need
>>     for the bitmap and we have 46 bits left for the host offset, which
>>     is more than enough.
>
> Note that because you are using exactly 8 subclusters, you can require
> that the minimum cluster size when subclusters are enabled be 4k (since
> we already have a lower-limit of 512-byte sector operation, and don't
> want subclusters to be smaller than that); at which case you are
> guaranteed that the host cluster offset will be 4k aligned.  So in
> reality, once you turn on subclusters, you have:
>
> 63    56 55    48 47    40 39    32 31    24 23    16 15     8 7      0
> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> **<----> <-----------------------------------------------><---------->*
>   Rsrved              host cluster offset of data             Reserved
>   (6 bits)                (44 bits)                           (11 bits)
>
> where you have 17 bits plus the "all zeroes" bit to play with, thanks to
> the three bits of host cluster offset that are now guaranteed to be zero
> due to cluster size alignment (but you're also right that the "all
> zeroes" bit is now redundant information with the 8 subcluster-is-zero
> bits, so repurposing it does not hurt)

The lower bits of the offset field are guaranteed to be zero if the
cluster size is anything other than the minimum (4KB), so alternatively
"host cluster offset" could become "host cluster index", where
host_cluster_offset = host_cluster_index * cluster_size.

With the 56 bits that we have now we can address 64 PB of data. In
theory QEMU can create larger qcow2 files if the cluster size is big
enough. The current hard limit is QCOW_MAX_L1_SIZE = 0x2000000, which
implies:

|--------------+------------------|
| Cluster size | Max virtual size |
|--------------+------------------|
| 512 bytes    | 128 GB           |
|   1 KB       | 512 GB           |
|   2 KB       |   2 TB           |
|   4 KB       |   8 TB           |
|   8 KB       |  32 TB           |
|  16 KB       | 128 TB           |
|  32 KB       | 512 TB           |
|  64 KB       |   2 PB           |
| 128 KB       |   8 PB           |
| 256 KB       |  32 PB           |
| 512 KB       | 128 PB           |
|   1 MB       | 512 PB           |
|   2 MB       |   2 EB           |
|--------------+------------------|

In practice however 64PB ought to be enough for anybody™, so maybe it's
not worth doing this.

>>     * Cons:
>>       - Only 8 subclusters per cluster. We would not be making the
>>         most of this feature.
>> 
>>       - No reserved bits left for the future.
>
> I just argued you have at least one, and probably 2, bits left over
> for future in-word expansion.

You're correct in any case, 512 subclusters should be the absolute
minimum.

>> (2) Making L2 entries 128-bit wide.
>> 
>>     In this alternative we would double the size of L2 entries. The
>>     first half would remain unchanged and the second one would store
>>     the bitmap. That would leave us with 32 subclusters per cluster.
>
> Although for smaller cluster sizes (such as 4k clusters), you'd still
> want to restrict that subclusters are at least 512-byte sectors, so
> you'd be using fewer than 32 of those subcluster positions until the
> cluster size is large enough.

I think in that case what would make sense is to increase the minimum
cluster size instead, or is there any reason why we would want smaller
clusters if we already guarantee that the minimum allocation is 512
bytes?

If we're going to reserve all 64 bits for the bitmap anyway I don't know
if there's a good reason not to use them all.

>> (3) Storing the bitmap somewhere else
>> 
>>     This would involve storing the bitmap separate from the L2 tables
>>     (perhaps using the bitmaps extension? I haven't looked much into
>>     this).
>> 
>>     * Pros:
>>       + Possibility to make the number of subclusters configurable
>>         by the user (32, 64, 128, ...)
>>       + All existing metadata structures would remain untouched
>>         (although the "all zeroes" bit in L2 entries would probably
>>         become unused).
>
> It might still remain useful for optimization purposes, although then
> we get into image consistency questions (if the all zeroes bit is set
> but subcluster map claims allocation, or if the all zeroes bit is
> clear but all subclusters claim zero, which one wins).

If we keep the bit we'd need to define its semantics clearly. We're
going to run into a similar problem with the subcluster state bits (we
have three states but two bits, so four possible values).

> Having the subcluster table directly in the L2 means that updating the
> L2 table is done with a single write. You are definitely right that
> having the subcluster table as a bitmap in a separate cluster means
> two writes instead of one, but as always, it's hard to predict how
> much of an impact that is without benchmarks.

Yes, this one almost feels like the cleanest alternative of the three I
described, but I suspect its effect on I/O performance would be
noticeable.

> The fact that you already have numbers proving the speedups that are
> possible when first allocating the image make this sound like a useful
> project, even though it is an incompatible image change that old tools
> won't be able to recognize. You'll want to make sure 'qemu-img amend'
> can rewrite an image with subclusters into an older image.

Yes, definitely.

Thanks a lot for your feedback!

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia
  2017-04-06 16:40 ` Eric Blake
@ 2017-04-07 12:20 ` Stefan Hajnoczi
  2017-04-07 12:24   ` Alberto Garcia
  2017-04-07 13:01   ` Kevin Wolf
  2017-04-07 17:10 ` Max Reitz
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 64+ messages in thread
From: Stefan Hajnoczi @ 2017-04-07 12:20 UTC (permalink / raw)
  To: Alberto Garcia; +Cc: qemu-devel, qemu-block, Max Reitz, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 3379 bytes --]

On Thu, Apr 06, 2017 at 06:01:48PM +0300, Alberto Garcia wrote:
> Here are the results (subcluster size in brackets):
> 
> |-----------------+----------------+-----------------+-------------------|
> |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> |-----------------+----------------+-----------------+-------------------|
> |   2 MB (256 KB) |   440 IOPS     |  100 IOPS       | 160 KB (*)        |
> | 512 KB  (64 KB) |  1000 IOPS     |  300 IOPS       | 640 KB            |
> |  64 KB   (8 KB) |  3000 IOPS     | 1000 IOPS       |   5 MB            |
> |  32 KB   (4 KB) | 12000 IOPS     | 1300 IOPS       |  10 MB            |
> |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
> |-----------------+----------------+-----------------+-------------------|
> 
>                 (*) The L2 cache must be a multiple of the cluster
>                     size, so in this case it must be 2MB. On the table
>                     I chose to show how much of those 2MB are actually
>                     used so you can compare it with the other cases.
> 
> Some comments about the results:
> 
> - For the 64KB, 512KB and 2MB cases, having subclusters increases
>   write performance roughly by three. This happens because for each
>   cluster allocation there's less data to copy from the backing
>   image. For the same reason, the smaller the cluster, the better the
>   performance. As expected, 64KB clusters with no subclusters perform
>   roughly the same as 512KB clusters with 64KB subclusters.
> 
> - The 32KB case is the most interesting one. Without subclusters it's
>   not very different from the 64KB case, but having a subcluster with
>   the same size of the I/O block eliminates the need for COW entirely
>   and the performance skyrockets (10 times faster!).
> 
> - 4KB is however very slow. I attribute this to the fact that the
>   cluster size is so small that a new cluster needs to be allocated
>   for every single write and its refcount updated accordingly. The L2
>   and refcount tables are also so small that they are too inefficient
>   and need to grow all the time.
> 
> Here are the results when writing to an empty 40GB qcow2 image with no
> backing file. The numbers are of course different but as you can see
> the patterns are similar:
> 
> |-----------------+----------------+-----------------+-------------------|
> |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> |-----------------+----------------+-----------------+-------------------|
> |   2 MB (256 KB) |  1200 IOPS     |  255 IOPS       | 160 KB            |
> | 512 KB  (64 KB) |  3000 IOPS     |  700 IOPS       | 640 KB            |
> |  64 KB   (8 KB) |  7200 IOPS     | 3300 IOPS       |   5 MB            |
> |  32 KB   (4 KB) | 12300 IOPS     | 4200 IOPS       |  10 MB            |
> |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
> |-----------------+----------------+-----------------+-------------------|

I don't understand why subclusters=on performs so much better when
there's no backing file.  Is qcow2 zeroing out the 64 KB cluster with
subclusters=off?

It ought to just write the 4 KB data when a new cluster is touched.
Therefore the performance should be very similar to subclusters=on.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi
@ 2017-04-07 12:24   ` Alberto Garcia
  2017-04-07 13:01   ` Kevin Wolf
  1 sibling, 0 replies; 64+ messages in thread
From: Alberto Garcia @ 2017-04-07 12:24 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, qemu-block, Max Reitz, Kevin Wolf

On Fri 07 Apr 2017 02:20:21 PM CEST, Stefan Hajnoczi wrote:
>> Here are the results when writing to an empty 40GB qcow2 image with no
>> backing file. The numbers are of course different but as you can see
>> the patterns are similar:
>> 
>> |-----------------+----------------+-----------------+-------------------|
>> |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
>> |-----------------+----------------+-----------------+-------------------|
>> |   2 MB (256 KB) |  1200 IOPS     |  255 IOPS       | 160 KB            |
>> | 512 KB  (64 KB) |  3000 IOPS     |  700 IOPS       | 640 KB            |
>> |  64 KB   (8 KB) |  7200 IOPS     | 3300 IOPS       |   5 MB            |
>> |  32 KB   (4 KB) | 12300 IOPS     | 4200 IOPS       |  10 MB            |
>> |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
>> |-----------------+----------------+-----------------+-------------------|
>
> I don't understand why subclusters=on performs so much better when
> there's no backing file.  Is qcow2 zeroing out the 64 KB cluster with
> subclusters=off?

It is, see https://lists.nongnu.org/archive/html/qemu-devel/2015-07/msg05877.html

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-06 16:40 ` Eric Blake
  2017-04-07  8:49   ` Alberto Garcia
@ 2017-04-07 12:41   ` Kevin Wolf
  2017-04-07 14:24     ` Alberto Garcia
  2017-04-21 21:09   ` [Qemu-devel] proposed qcow2 extension: cluster reservations [was: " Eric Blake
  2 siblings, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-04-07 12:41 UTC (permalink / raw)
  To: Eric Blake
  Cc: Alberto Garcia, qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 7664 bytes --]

Am 06.04.2017 um 18:40 hat Eric Blake geschrieben:
> On 04/06/2017 10:01 AM, Alberto Garcia wrote:
> > I thought of three alternatives for storing the subcluster bitmaps. I
> > haven't made my mind completely about which one is the best one, so
> > I'd like to present all three for discussion. Here they are:
> > 
> > (1) Storing the bitmap inside the 64-bit entry
> > 
> >     This is a simple alternative and is the one that I chose for my
> >     prototype. There are 14 unused bits plus the "all zeroes" one. If
> >     we steal one from the host offset we have the 16 bits that we need
> >     for the bitmap and we have 46 bits left for the host offset, which
> >     is more than enough.
> 
> Note that because you are using exactly 8 subclusters, you can require
> that the minimum cluster size when subclusters are enabled be 4k (since
> we already have a lower-limit of 512-byte sector operation, and don't
> want subclusters to be smaller than that); at which case you are
> guaranteed that the host cluster offset will be 4k aligned.  So in
> reality, once you turn on subclusters, you have:
> 
> 63    56 55    48 47    40 39    32 31    24 23    16 15     8 7      0
> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> **<----> <-----------------------------------------------><---------->*
>   Rsrved              host cluster offset of data             Reserved
>   (6 bits)                (44 bits)                           (11 bits)
> 
> where you have 17 bits plus the "all zeroes" bit to play with, thanks to
> the three bits of host cluster offset that are now guaranteed to be zero
> due to cluster size alignment (but you're also right that the "all
> zeroes" bit is now redundant information with the 8 subcluster-is-zero
> bits, so repurposing it does not hurt)
> 
> > 
> >     * Pros:
> >       + Simple. Few changes compared to the current qcow2 format.
> > 
> >     * Cons:
> >       - Only 8 subclusters per cluster. We would not be making the
> >         most of this feature.
> > 
> >       - No reserved bits left for the future.
> 
> I just argued you have at least one, and probably 2, bits left over for
> future in-word expansion.

I think only 8 subclusters is just too few. That the subcluster status
would be split in two halves doesn't make me like this layout much
better either.

Intuitively one might think that squeezing everything into the existing
data structures will save us memory, but as you correctly state below,
the difference between 8 and 32 subclusters means that we can use a
larger cluster size and the doubled L2 entry size uses actually less
space to cover the whole image than using the existing one.

I can see how few changes compared to the current format can look
attractive, but I also think that there is a danger of forgetting to
support subclusters in some place. When the layout is radically
different, such mistakes would be caught very quickly, so maybe being
more different is actually a plus.

> > 
> > (2) Making L2 entries 128-bit wide.
> > 
> >     In this alternative we would double the size of L2 entries. The
> >     first half would remain unchanged and the second one would store
> >     the bitmap. That would leave us with 32 subclusters per cluster.
> 
> Although for smaller cluster sizes (such as 4k clusters), you'd still
> want to restrict that subclusters are at least 512-byte sectors, so
> you'd be using fewer than 32 of those subcluster positions until the
> cluster size is large enough.
> 
> > 
> >     * Pros:
> >       + More subclusters per cluster. We could have images with
> >         e.g. 128k clusters with 4k subclusters.
> 
> Could allow variable-sized subclusters (your choice of 32 subclusters of
> 4k each, or 16 subclusters of 8k each)

I don't think using less subclusters is desirable if it doesn't come
with savings elsewhere. We already need to allocate two clusters for an
L2 table now, so we want to use it.

The more interesting kind of variable-sized subclusters would be if you
could select any multiple of 32, meaning three or more clusters per L2
table (with 192 bits or more per entry).

I'm doubtful if this would be worth the effort, though.

> > 
> >     * Cons:
> >       - More space needed for L2 entries. The same cluster size would
> >         require a cache twice as large, although having subcluster
> >         allocation would compensate for this.
> > 
> >       - More changes to the code to handle 128-bit entries.
> 
> Dealing with variable-sized subclusters, or with unused subclsuster
> entries when the cluster size is too small (such as a 4k cluster should
> not be allowed any subclusters smaller than 512 bytes, but that's at
> most 8 out of the 32 slots available), can get tricky.

If it's too tricky, we just don't allow it. 32 * 512 = 16k would be the
minimal cluster size that allows enabling subclusters.

> > 
> >       - We would still be wasting the 14 reserved bits that L2 entries
> >         have.

We discussed the concept of subclusters multiple times in the past, and
I think every time my conclusion was that option (2) is really what we
want if we implement this.

> > (3) Storing the bitmap somewhere else
> > 
> >     This would involve storing the bitmap separate from the L2 tables
> >     (perhaps using the bitmaps extension? I haven't looked much into
> >     this).
> > 
> >     * Pros:
> >       + Possibility to make the number of subclusters configurable
> >         by the user (32, 64, 128, ...)
> >       + All existing metadata structures would remain untouched
> >         (although the "all zeroes" bit in L2 entries would probably
> >         become unused).
> 
> It might still remain useful for optimization purposes, although then we
> get into image consistency questions (if the all zeroes bit is set but
> subcluster map claims allocation, or if the all zeroes bit is clear but
> all subclusters claim zero, which one wins).
> 
> > 
> >     * Cons:
> >       - As with alternative (2), more space needed for metadata.
> > 
> >       - The bitmap would also need to be cached for performance
> >         reasons.
> > 
> >       - Possibly one more *_cache_size option.
> > 
> >       - One more metadata structure to be updated for each
> >         allocation. This would probably impact I/O negatively.
> 
> Having the subcluster table directly in the L2 means that updating the
> L2 table is done with a single write. You are definitely right that
> having the subcluster table as a bitmap in a separate cluster means two
> writes instead of one, but as always, it's hard to predict how much of
> an impact that is without benchmarks.

Note that it's not just additional write requests, but that we can't
update the L2 table entry and the bitmap atomically any more, so we have
to worry about ordering. The ordering between L2 table and refcount
blocks is already painful enough, I'm not sure if I would want to add a
third type. Ordering also means disk flushes, which are a lot slower
than just additional writes.

We wouldn't have these ordering problems with a journal (because we
could then commit things atomically), but I suppose we don't want to
make two major changes at once. :-)

> > === Compressed clusters ===
> > 
> > My idea is that compressed clusters would remain the same. They are
> > read-only anyway so they would not be affected by any of these
> > changes.

Yes, this makes sense to me. Compression already uses its own kind of
splitting clusters.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi
  2017-04-07 12:24   ` Alberto Garcia
@ 2017-04-07 13:01   ` Kevin Wolf
  2017-04-10 15:32     ` Stefan Hajnoczi
  1 sibling, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-04-07 13:01 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Alberto Garcia, qemu-devel, qemu-block, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 4218 bytes --]

Am 07.04.2017 um 14:20 hat Stefan Hajnoczi geschrieben:
> On Thu, Apr 06, 2017 at 06:01:48PM +0300, Alberto Garcia wrote:
> > Here are the results (subcluster size in brackets):
> > 
> > |-----------------+----------------+-----------------+-------------------|
> > |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> > |-----------------+----------------+-----------------+-------------------|
> > |   2 MB (256 KB) |   440 IOPS     |  100 IOPS       | 160 KB (*)        |
> > | 512 KB  (64 KB) |  1000 IOPS     |  300 IOPS       | 640 KB            |
> > |  64 KB   (8 KB) |  3000 IOPS     | 1000 IOPS       |   5 MB            |
> > |  32 KB   (4 KB) | 12000 IOPS     | 1300 IOPS       |  10 MB            |
> > |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
> > |-----------------+----------------+-----------------+-------------------|
> > 
> >                 (*) The L2 cache must be a multiple of the cluster
> >                     size, so in this case it must be 2MB. On the table
> >                     I chose to show how much of those 2MB are actually
> >                     used so you can compare it with the other cases.
> > 
> > Some comments about the results:
> > 
> > - For the 64KB, 512KB and 2MB cases, having subclusters increases
> >   write performance roughly by three. This happens because for each
> >   cluster allocation there's less data to copy from the backing
> >   image. For the same reason, the smaller the cluster, the better the
> >   performance. As expected, 64KB clusters with no subclusters perform
> >   roughly the same as 512KB clusters with 64KB subclusters.
> > 
> > - The 32KB case is the most interesting one. Without subclusters it's
> >   not very different from the 64KB case, but having a subcluster with
> >   the same size of the I/O block eliminates the need for COW entirely
> >   and the performance skyrockets (10 times faster!).
> > 
> > - 4KB is however very slow. I attribute this to the fact that the
> >   cluster size is so small that a new cluster needs to be allocated
> >   for every single write and its refcount updated accordingly. The L2
> >   and refcount tables are also so small that they are too inefficient
> >   and need to grow all the time.
> > 
> > Here are the results when writing to an empty 40GB qcow2 image with no
> > backing file. The numbers are of course different but as you can see
> > the patterns are similar:
> > 
> > |-----------------+----------------+-----------------+-------------------|
> > |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> > |-----------------+----------------+-----------------+-------------------|
> > |   2 MB (256 KB) |  1200 IOPS     |  255 IOPS       | 160 KB            |
> > | 512 KB  (64 KB) |  3000 IOPS     |  700 IOPS       | 640 KB            |
> > |  64 KB   (8 KB) |  7200 IOPS     | 3300 IOPS       |   5 MB            |
> > |  32 KB   (4 KB) | 12300 IOPS     | 4200 IOPS       |  10 MB            |
> > |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
> > |-----------------+----------------+-----------------+-------------------|
> 
> I don't understand why subclusters=on performs so much better when
> there's no backing file.  Is qcow2 zeroing out the 64 KB cluster with
> subclusters=off?
> 
> It ought to just write the 4 KB data when a new cluster is touched.
> Therefore the performance should be very similar to subclusters=on.

No, it can't do that. Nobody guarantees that the cluster contains only
zeros when we don't write them. It could have been used before and then
either freed on a qcow2 level or we could be sitting on a block device
rather than a file.

One optimisation that would be possible even without subclusters is
making only a single I/O request to write the whole cluster instead of
three of them (COW head, guest write, COW tail). Without a backing file,
this improved performance almost to the level of rewrites, but it
couldn't solve the problem when a backing file was used (which is the
main use case for qcow2), so I never got to submitting a patch for it.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-07 12:41   ` Kevin Wolf
@ 2017-04-07 14:24     ` Alberto Garcia
  0 siblings, 0 replies; 64+ messages in thread
From: Alberto Garcia @ 2017-04-07 14:24 UTC (permalink / raw)
  To: Kevin Wolf, Eric Blake; +Cc: qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz

On Fri 07 Apr 2017 02:41:21 PM CEST, Kevin Wolf <kwolf@redhat.com> wrote:
>> 63    56 55    48 47    40 39    32 31    24 23    16 15     8 7      0
>> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
>> **<----> <-----------------------------------------------><---------->*
>>   Rsrved              host cluster offset of data             Reserved
>>   (6 bits)                (44 bits)                           (11 bits)
>> 
>> where you have 17 bits plus the "all zeroes" bit to play with, thanks to
>> the three bits of host cluster offset that are now guaranteed to be zero
>> due to cluster size alignment (but you're also right that the "all
>> zeroes" bit is now redundant information with the 8 subcluster-is-zero
>> bits, so repurposing it does not hurt)
>> 
>> > 
>> >     * Pros:
>> >       + Simple. Few changes compared to the current qcow2 format.
>> > 
>> >     * Cons:
>> >       - Only 8 subclusters per cluster. We would not be making the
>> >         most of this feature.
>> > 
>> >       - No reserved bits left for the future.
>> 
>> I just argued you have at least one, and probably 2, bits left over for
>> future in-word expansion.
>
> I think only 8 subclusters is just too few. That the subcluster status
> would be split in two halves doesn't make me like this layout much
> better either.

I also agree that 8 are too few (splitting the subcluster field would
not be strictly necessary, but that's not so important).

>> > (2) Making L2 entries 128-bit wide.
>> > 
>> >     In this alternative we would double the size of L2 entries. The
>> >     first half would remain unchanged and the second one would store
>> >     the bitmap. That would leave us with 32 subclusters per cluster.
>> 
>> Although for smaller cluster sizes (such as 4k clusters), you'd still
>> want to restrict that subclusters are at least 512-byte sectors, so
>> you'd be using fewer than 32 of those subcluster positions until the
>> cluster size is large enough.
>> 
>> > 
>> >     * Pros:
>> >       + More subclusters per cluster. We could have images with
>> >         e.g. 128k clusters with 4k subclusters.
>> 
>> Could allow variable-sized subclusters (your choice of 32 subclusters of
>> 4k each, or 16 subclusters of 8k each)
>
> I don't think using less subclusters is desirable if it doesn't come
> with savings elsewhere. We already need to allocate two clusters for an
> L2 table now, so we want to use it.
>
> The more interesting kind of variable-sized subclusters would be if you
> could select any multiple of 32, meaning three or more clusters per L2
> table (with 192 bits or more per entry).

Yeah, I agree. I think it's worth considering. One more drawback that I
can think of is that if we make L2 entries wider and we have compressed
clusters we'd be wasting space in their entries.

>> >       - One more metadata structure to be updated for each
>> >         allocation. This would probably impact I/O negatively.
>> 
>> Having the subcluster table directly in the L2 means that updating
>> the L2 table is done with a single write. You are definitely right
>> that having the subcluster table as a bitmap in a separate cluster
>> means two writes instead of one, but as always, it's hard to predict
>> how much of an impact that is without benchmarks.
>
> Note that it's not just additional write requests, but that we can't
> update the L2 table entry and the bitmap atomically any more, so we
> have to worry about ordering. The ordering between L2 table and
> refcount blocks is already painful enough, I'm not sure if I would
> want to add a third type. Ordering also means disk flushes, which are
> a lot slower than just additional writes.

You're rightk, I think you just convinced me that this is a bad idea and
I'm also more inclined towards (2) now.

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia
  2017-04-06 16:40 ` Eric Blake
  2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi
@ 2017-04-07 17:10 ` Max Reitz
  2017-04-10  8:42   ` Kevin Wolf
  2017-04-11 12:56   ` Alberto Garcia
  2017-04-12 16:54 ` Denis V. Lunev
  2017-04-12 17:55 ` Denis V. Lunev
  4 siblings, 2 replies; 64+ messages in thread
From: Max Reitz @ 2017-04-07 17:10 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel; +Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 15746 bytes --]

On 06.04.2017 17:01, Alberto Garcia wrote:
> Hi all,
> 
> over the past couple of months I discussed with some of you the
> possibility to extend the qcow2 format in order to improve its
> performance and reduce its memory requirements (particularly with very
> large images).
> 
> After some discussion in the mailing list and the #qemu IRC channel I
> decided to write a prototype of a new extension for qcow2 so I could
> understand better the scope of the changes and have some preliminary
> data about its effects.
> 
> This e-mail is the formal presentation of my proposal to extend the
> on-disk qcow2 format. As you can see this is still an RFC. Due to the
> nature of the changes I would like to get as much feedback as possible
> before going forward.
> 
> === Problem ===
> 
> The original problem that I wanted to address is the memory
> requirements of qcow2 files if you want to have very large images and
> still keep good I/O performance. This is a consequence of its very
> structure, which I'm going to describe now.
> 
> A qcow2 image is divided into units of constant size called clusters,
> and among other things it contains metadata that maps guest addresses
> to host addresses (the so-called L1 and L2 tables).
> 
> There are two basic problems that result from this:
> 
> 1) Reading from or writing to a qcow2 image involves reading the
>    corresponding entry on the L2 table that maps the guest address to
>    the host address. This is very slow because it involves two I/O
>    operations: one on the L2 table and the other one on the actual
>    data cluster.
> 
> 2) A cluster is the smallest unit of allocation. Therefore writing a
>    mere 512 bytes to an empty disk requires allocating a complete
>    cluster and filling it with zeroes (or with data from the backing
>    image if there is one). This wastes more disk space and also has a
>    negative impact on I/O.
> 
> Problem (1) can be solved by keeping in memory a cache of the L2
> tables (QEMU has an "l2_cache_size" parameter for this purpose). The
> amount of disk space used by L2 tables depends on two factors: the
> disk size and the cluster size.
> 
> The cluster size can be configured when the image is created, and it
> can be any power of two between 512 bytes and 2 MB (it defaults to 64
> KB).
> 
> The maximum amount of space needed for the L2 tables can be calculated
> with the following formula:
> 
>    max_l2_size = virtual_disk_size * 8 / cluster_size
> 
> Large images require a large amount of metadata, and therefore a large
> amount of memory for the L2 cache. With the default cluster size
> (64KB) that's 128MB of L2 cache for a 1TB qcow2 image.
> 
> The only way to reduce the size of the L2 tables is therefore
> increasing the cluster size, but this makes the image less efficient
> as described earlier in (2).
> 
> === The proposal ===
> 
> The idea of this proposal is to extend the qcow2 format by allowing
> subcluster allocation. There would be an optional feature that would
> need to be enabled when creating the image. The on-disk format would
> remain essentially the same, except that each data cluster would be
> internally divided into a number of subclusters of equal size.
> 
> What this means in practice is that each entry on an L2 table would be
> accompanied by a bitmap indicating the allocation state of each one of
> the subclusters for that cluster. There are several alternatives for
> storing the bitmap, described below.
> 
> Other than L2 entries, all other data structures would remain
> unchanged, but for data clusters the smallest unit of allocation would
> now be the subcluster. Reference counting would still be at the
> cluster level, because there's no way to reference individual
> subclusters. Copy-on-write on internal snapshots would need to copy
> complete clusters, so that scenario would not benefit from this
> change.
> 
> I see two main use cases for this feature:
> 
> a) The qcow2 image is not too large / the L2 cache is not a problem,
>    but you want to increase the allocation performance. In this case
>    you can have something like a 128KB cluster with 4KB subclusters
>    (with 4KB being a common block size in ext4 and other filesystems)
> 
> b) The qcow2 image is very large and you want to save metadata space
>    in order to have a smaller L2 cache. In this case you can go for
>    the maximum cluster size (2MB) but you want to have smaller
>    subclusters to increase the allocation performance and optimize the
>    disk usage. This was actually my original use case.
> 
> === Test results ===
> 
> I have a basic working prototype of this. It's still incomplete -and
> buggy :)- but it gives an idea of what we can expect from it. In my
> implementation each data cluster has 8 subclusters, but that's not set
> in stone (see below).
> 
> I made all tests on an SSD drive, writing to an empty qcow2 image with
> a fully populated 40GB backing image, performing random writes using
> fio with a block size of 4KB.
> 
> I tried with the default and maximum cluster sizes (64KB and 2MB) and
> also with some other sizes. I also made sure to try with 32KB clusters
> so the subcluster size matches the 4KB block size used for the I/O.
> 
> It's important to point out that once a cluster has been completely
> allocated then having subclusters offers no performance benefit. For
> this reason the size of the image for these tests (40GB) was chosen to
> be large enough to guarantee that there are always new clusters being
> allocated. This is therefore a worst-case scenario (or best-case for
> this feature, if you want).
> 
> Here are the results (subcluster size in brackets):
> 
> |-----------------+----------------+-----------------+-------------------|
> |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> |-----------------+----------------+-----------------+-------------------|
> |   2 MB (256 KB) |   440 IOPS     |  100 IOPS       | 160 KB (*)        |
> | 512 KB  (64 KB) |  1000 IOPS     |  300 IOPS       | 640 KB            |
> |  64 KB   (8 KB) |  3000 IOPS     | 1000 IOPS       |   5 MB            |
> |  32 KB   (4 KB) | 12000 IOPS     | 1300 IOPS       |  10 MB            |
> |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
> |-----------------+----------------+-----------------+-------------------|
> 
>                 (*) The L2 cache must be a multiple of the cluster
>                     size, so in this case it must be 2MB. On the table
>                     I chose to show how much of those 2MB are actually
>                     used so you can compare it with the other cases.
> 
> Some comments about the results:
> 
> - For the 64KB, 512KB and 2MB cases, having subclusters increases
>   write performance roughly by three. This happens because for each
>   cluster allocation there's less data to copy from the backing
>   image. For the same reason, the smaller the cluster, the better the
>   performance. As expected, 64KB clusters with no subclusters perform
>   roughly the same as 512KB clusters with 64KB subclusters.
> 
> - The 32KB case is the most interesting one. Without subclusters it's
>   not very different from the 64KB case, but having a subcluster with
>   the same size of the I/O block eliminates the need for COW entirely
>   and the performance skyrockets (10 times faster!).
> 
> - 4KB is however very slow. I attribute this to the fact that the
>   cluster size is so small that a new cluster needs to be allocated
>   for every single write and its refcount updated accordingly. The L2
>   and refcount tables are also so small that they are too inefficient
>   and need to grow all the time.
> 
> Here are the results when writing to an empty 40GB qcow2 image with no
> backing file. The numbers are of course different but as you can see
> the patterns are similar:
> 
> |-----------------+----------------+-----------------+-------------------|
> |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> |-----------------+----------------+-----------------+-------------------|
> |   2 MB (256 KB) |  1200 IOPS     |  255 IOPS       | 160 KB            |
> | 512 KB  (64 KB) |  3000 IOPS     |  700 IOPS       | 640 KB            |
> |  64 KB   (8 KB) |  7200 IOPS     | 3300 IOPS       |   5 MB            |
> |  32 KB   (4 KB) | 12300 IOPS     | 4200 IOPS       |  10 MB            |
> |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
> |-----------------+----------------+-----------------+-------------------|
> 
> === Changes to the on-disk format ===
> 
> The qcow2 on-disk format needs to change so each L2 entry has a bitmap
> indicating the allocation status of each subcluster. There are three
> possible states (unallocated, allocated, all zeroes), so we need two
> bits per subcluster.

You don't need two bits, you need log(3) / log(2) = ld(3) ≈ 1.58. You
can encode the status of eight subclusters (3^8 = 6561) in 13 bits
(ld(6561) ≈ 12.68).

(Why not write it in Malbolge? I guarantee it will be much simpler given
this is ternary information.)

> An L2 entry is 64 bits wide, and this is the current format (for
> uncompressed clusters):
> 
> 63    56 55    48 47    40 39    32 31    24 23    16 15     8 7      0
> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> **<----> <--------------------------------------------------><------->*
>   Rsrved              host cluster offset of data             Reserved
>   (6 bits)                (47 bits)                           (8 bits)
> 
>     bit 63: refcount == 1   (QCOW_OFLAG_COPIED)
>     bit 62: compressed = 1  (QCOW_OFLAG_COMPRESSED)
>     bit 0: all zeros        (QCOW_OFLAG_ZERO)
> 
> I thought of three alternatives for storing the subcluster bitmaps. I
> haven't made my mind completely about which one is the best one, so
> I'd like to present all three for discussion. Here they are:
> 
> (1) Storing the bitmap inside the 64-bit entry
> 
>     This is a simple alternative and is the one that I chose for my
>     prototype. There are 14 unused bits plus the "all zeroes" one. If
>     we steal one from the host offset we have the 16 bits that we need
>     for the bitmap and we have 46 bits left for the host offset, which
>     is more than enough.

No need to steal because you only need 13 bits (apart from what Eric wrote).

> 
>     * Pros:
>       + Simple. Few changes compared to the current qcow2 format.
> 
>     * Cons:
>       - Only 8 subclusters per cluster. We would not be making the
>         most of this feature.

I agree on this one, though.

One case I'd be especially interested in are of course 4 kB subclusters
for 64 kB clusters (because 4 kB is a usual page size and can be
configured to be the block size of a guest device; and because 64 kB
simply is the standard cluster size of qcow2 images nowadays[1]...).

Then we'd have 16 subclusters which would require 26 bits.
Unfortunately, with 64 kB clusters we'd only have 22 bits available
(including the is-zero flag)... So this is not going to fly.

(We could even get one more bit if we had a subcluster-flag, because I
guess we can always assume subclustered clusters to have OFLAG_COPIED
and be uncompressed. But still, three bits missing.)

If course, if you'd be willing to give up the all-zeroes state for
subclusters, it would be enough...

>       - No reserved bits left for the future.

One bit left (or more, see Eric). :-)

> (2) Making L2 entries 128-bit wide.
> 
>     In this alternative we would double the size of L2 entries. The
>     first half would remain unchanged and the second one would store
>     the bitmap. That would leave us with 32 subclusters per cluster.
> 
>     * Pros:
>       + More subclusters per cluster. We could have images with
>         e.g. 128k clusters with 4k subclusters.
> 
>     * Cons:
>       - More space needed for L2 entries. The same cluster size would
>         require a cache twice as large, although having subcluster
>         allocation would compensate for this.
> 
>       - More changes to the code to handle 128-bit entries.
> 
>       - We would still be wasting the 14 reserved bits that L2 entries
>         have.

Not sure I like this too much, but maybe it performs better than (3).

Also, Kevin doesn't like (3) and neither do you anymore...

> (3) Storing the bitmap somewhere else
> 
>     This would involve storing the bitmap separate from the L2 tables
>     (perhaps using the bitmaps extension? I haven't looked much into
>     this).

(not really)

>     * Pros:
>       + Possibility to make the number of subclusters configurable
>         by the user (32, 64, 128, ...)

Yes indeed!

>       + All existing metadata structures would remain untouched
>         (although the "all zeroes" bit in L2 entries would probably
>         become unused).

Well, sounds nice but it's still an incompatible change, so I'm not sure
whether this is really a benefit. I guess it means our code stays simpler.

>     * Cons:
>       - As with alternative (2), more space needed for metadata.

Yeah, but just one 1/4 more if you have 8 subclusters or 1/2 more if you
take my favorite 64 kB/4 kB configuration. Not too bad.

(This is worse for (2). It'll always be double or more.)

>       - The bitmap would also need to be cached for performance
>         reasons.
> 
>       - Possibly one more *_cache_size option.

I don't think so. There is no reason we shouldn't cache all the entries
whose corresponding L2 table is loaded.

(Or I can't see one, at least. At least once we can cache partial
clusters; if the subcluster information for an L2 table only takes 1/4th
of the space (and thus 1/4th of a cluster) we only need to cache that
much data. We don't need to cache the adjacent subcluster information
blocks if their L2 tables aren't cached, too.)

>       - One more metadata structure to be updated for each
>         allocation. This would probably impact I/O negatively.

Needs benchmarks. :-)

I don't think it would be too bad on an SSD. On an HDD it means
additional seeking, yes. (Unless you're clever: If you have 1/2th
overhead, you could put the subcluster information between the
corresponding L2 tables (same for every multiple of 1/2 that isn't also
a multiple of 1). For multiples of 1, you just put it into a cluster
adjacent to the L2 table.

(OTOH, the fact that you effectively need to at least double the L2 size
for (2) means that you need to write more data, so (2) might actually
perform worse than (3) if you would be all fine with just 16 subclusters
(at least on an SSD).)

By the way, if you'd only allow multiple of 1s overhead (i.e. multiples
of 32 subclusters), I think (3) would be pretty much the same as (2) if
you just always write the subcluster information adjacent to the L2
table. Should be just the same caching-wise and performance-wise.

Max

> === Compressed clusters ===
> 
> My idea is that compressed clusters would remain the same. They are
> read-only anyway so they would not be affected by any of these
> changes.
> 
> ===========================
> 
> I think I managed to summarize all the ideas that I had in my mind,
> but I'm sure you probably have questions and comments, so I'd be happy
> to get as much feedback as possible.
> 
> So, looking forward to reading your opinions.
> 
> Regards,
> 
> Berto
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-07 17:10 ` Max Reitz
@ 2017-04-10  8:42   ` Kevin Wolf
  2017-04-10 15:03     ` Max Reitz
  2017-04-11 12:56   ` Alberto Garcia
  1 sibling, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-04-10  8:42 UTC (permalink / raw)
  To: Max Reitz; +Cc: Alberto Garcia, qemu-devel, qemu-block, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 859 bytes --]

Am 07.04.2017 um 19:10 hat Max Reitz geschrieben:
> One case I'd be especially interested in are of course 4 kB subclusters
> for 64 kB clusters (because 4 kB is a usual page size and can be
> configured to be the block size of a guest device; and because 64 kB
> simply is the standard cluster size of qcow2 images nowadays[1]...).

Why should the current default cluster size be an argument for anything?
64k is a tradeoff between small COW size and large allocation
granularity that seemed to work well (and it used to be the maximum
cluster size originally, so it seemed safer than 128k).

With subclusters, we have a completely different situation and need to
find a new default that works best. I'm relatively sure that 64k are too
small under the new conditions.

Also, undefined reference: [1] (I was hoping to find a better argument
there...)

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-10  8:42   ` Kevin Wolf
@ 2017-04-10 15:03     ` Max Reitz
  0 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-04-10 15:03 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Alberto Garcia, qemu-devel, qemu-block, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 1066 bytes --]

On 10.04.2017 10:42, Kevin Wolf wrote:
> Am 07.04.2017 um 19:10 hat Max Reitz geschrieben:
>> One case I'd be especially interested in are of course 4 kB subclusters
>> for 64 kB clusters (because 4 kB is a usual page size and can be
>> configured to be the block size of a guest device; and because 64 kB
>> simply is the standard cluster size of qcow2 images nowadays[1]...).
> 
> Why should the current default cluster size be an argument for anything?
> 64k is a tradeoff between small COW size and large allocation
> granularity that seemed to work well (and it used to be the maximum
> cluster size originally, so it seemed safer than 128k).
> 
> With subclusters, we have a completely different situation and need to
> find a new default that works best. I'm relatively sure that 64k are too
> small under the new conditions.
> 
> Also, undefined reference: [1] (I was hoping to find a better argument
> there...)

I meant [1] to be basically what you just said.

Well, the thing is that 64 kB is still better than 32 kB. :-P

Max



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-07 13:01   ` Kevin Wolf
@ 2017-04-10 15:32     ` Stefan Hajnoczi
  0 siblings, 0 replies; 64+ messages in thread
From: Stefan Hajnoczi @ 2017-04-10 15:32 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Alberto Garcia, qemu-devel, qemu-block, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 4118 bytes --]

On Fri, Apr 07, 2017 at 03:01:29PM +0200, Kevin Wolf wrote:
> Am 07.04.2017 um 14:20 hat Stefan Hajnoczi geschrieben:
> > On Thu, Apr 06, 2017 at 06:01:48PM +0300, Alberto Garcia wrote:
> > > Here are the results (subcluster size in brackets):
> > > 
> > > |-----------------+----------------+-----------------+-------------------|
> > > |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> > > |-----------------+----------------+-----------------+-------------------|
> > > |   2 MB (256 KB) |   440 IOPS     |  100 IOPS       | 160 KB (*)        |
> > > | 512 KB  (64 KB) |  1000 IOPS     |  300 IOPS       | 640 KB            |
> > > |  64 KB   (8 KB) |  3000 IOPS     | 1000 IOPS       |   5 MB            |
> > > |  32 KB   (4 KB) | 12000 IOPS     | 1300 IOPS       |  10 MB            |
> > > |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
> > > |-----------------+----------------+-----------------+-------------------|
> > > 
> > >                 (*) The L2 cache must be a multiple of the cluster
> > >                     size, so in this case it must be 2MB. On the table
> > >                     I chose to show how much of those 2MB are actually
> > >                     used so you can compare it with the other cases.
> > > 
> > > Some comments about the results:
> > > 
> > > - For the 64KB, 512KB and 2MB cases, having subclusters increases
> > >   write performance roughly by three. This happens because for each
> > >   cluster allocation there's less data to copy from the backing
> > >   image. For the same reason, the smaller the cluster, the better the
> > >   performance. As expected, 64KB clusters with no subclusters perform
> > >   roughly the same as 512KB clusters with 64KB subclusters.
> > > 
> > > - The 32KB case is the most interesting one. Without subclusters it's
> > >   not very different from the 64KB case, but having a subcluster with
> > >   the same size of the I/O block eliminates the need for COW entirely
> > >   and the performance skyrockets (10 times faster!).
> > > 
> > > - 4KB is however very slow. I attribute this to the fact that the
> > >   cluster size is so small that a new cluster needs to be allocated
> > >   for every single write and its refcount updated accordingly. The L2
> > >   and refcount tables are also so small that they are too inefficient
> > >   and need to grow all the time.
> > > 
> > > Here are the results when writing to an empty 40GB qcow2 image with no
> > > backing file. The numbers are of course different but as you can see
> > > the patterns are similar:
> > > 
> > > |-----------------+----------------+-----------------+-------------------|
> > > |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> > > |-----------------+----------------+-----------------+-------------------|
> > > |   2 MB (256 KB) |  1200 IOPS     |  255 IOPS       | 160 KB            |
> > > | 512 KB  (64 KB) |  3000 IOPS     |  700 IOPS       | 640 KB            |
> > > |  64 KB   (8 KB) |  7200 IOPS     | 3300 IOPS       |   5 MB            |
> > > |  32 KB   (4 KB) | 12300 IOPS     | 4200 IOPS       |  10 MB            |
> > > |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
> > > |-----------------+----------------+-----------------+-------------------|
> > 
> > I don't understand why subclusters=on performs so much better when
> > there's no backing file.  Is qcow2 zeroing out the 64 KB cluster with
> > subclusters=off?
> > 
> > It ought to just write the 4 KB data when a new cluster is touched.
> > Therefore the performance should be very similar to subclusters=on.
> 
> No, it can't do that. Nobody guarantees that the cluster contains only
> zeros when we don't write them. It could have been used before and then
> either freed on a qcow2 level or we could be sitting on a block device
> rather than a file.

I thought we had the no-op optimization for clusters allocated at the
end of a POSIX file.  All the more reason to add sub-clusters!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-07 17:10 ` Max Reitz
  2017-04-10  8:42   ` Kevin Wolf
@ 2017-04-11 12:56   ` Alberto Garcia
  2017-04-11 14:04     ` Max Reitz
  1 sibling, 1 reply; 64+ messages in thread
From: Alberto Garcia @ 2017-04-11 12:56 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf

On Fri 07 Apr 2017 07:10:46 PM CEST, Max Reitz wrote:
>> === Changes to the on-disk format ===
>> 
>> The qcow2 on-disk format needs to change so each L2 entry has a bitmap
>> indicating the allocation status of each subcluster. There are three
>> possible states (unallocated, allocated, all zeroes), so we need two
>> bits per subcluster.
>
> You don't need two bits, you need log(3) / log(2) = ld(3) ≈ 1.58. You
> can encode the status of eight subclusters (3^8 = 6561) in 13 bits
> (ld(6561) ≈ 12.68).

Right, although that would make the encoding more cumbersome to use and
to debug. Is it worth it?

> One case I'd be especially interested in are of course 4 kB
> subclusters for 64 kB clusters (because 4 kB is a usual page size and
> can be configured to be the block size of a guest device; and because
> 64 kB simply is the standard cluster size of qcow2 images
> nowadays[1]...).

I think that we should have at least that, but ideally larger
cluster-to-subcluster ratios.

> (We could even get one more bit if we had a subcluster-flag, because I
> guess we can always assume subclustered clusters to have OFLAG_COPIED
> and be uncompressed. But still, three bits missing.)

Why can we always assume OFLAG_COPIED?

> If course, if you'd be willing to give up the all-zeroes state for
> subclusters, it would be enough...

I still think that it looks like a better idea to allow having more
subclusters, but giving up the all-zeroes state is a valid
alternative. Apart from having to overwrite with zeroes when a
subcluster is discarded, is there anything else that we would miss?

> By the way, if you'd only allow multiple of 1s overhead
> (i.e. multiples of 32 subclusters), I think (3) would be pretty much
> the same as (2) if you just always write the subcluster information
> adjacent to the L2 table. Should be just the same caching-wise and
> performance-wise.

Then (3) is effectively the same as (2), just that the subcluster
bitmaps are at the end of the L2 cluster, and not next to each entry.

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-11 12:56   ` Alberto Garcia
@ 2017-04-11 14:04     ` Max Reitz
  2017-04-11 14:31       ` Alberto Garcia
  0 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-04-11 14:04 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel; +Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 2694 bytes --]

On 11.04.2017 14:56, Alberto Garcia wrote:
> On Fri 07 Apr 2017 07:10:46 PM CEST, Max Reitz wrote:
>>> === Changes to the on-disk format ===
>>>
>>> The qcow2 on-disk format needs to change so each L2 entry has a bitmap
>>> indicating the allocation status of each subcluster. There are three
>>> possible states (unallocated, allocated, all zeroes), so we need two
>>> bits per subcluster.
>>
>> You don't need two bits, you need log(3) / log(2) = ld(3) ≈ 1.58. You
>> can encode the status of eight subclusters (3^8 = 6561) in 13 bits
>> (ld(6561) ≈ 12.68).
> 
> Right, although that would make the encoding more cumbersome to use and
> to debug. Is it worth it?

Probably not, considering this is probably not the way we want to go anyway.

>> One case I'd be especially interested in are of course 4 kB
>> subclusters for 64 kB clusters (because 4 kB is a usual page size and
>> can be configured to be the block size of a guest device; and because
>> 64 kB simply is the standard cluster size of qcow2 images
>> nowadays[1]...).
> 
> I think that we should have at least that, but ideally larger
> cluster-to-subcluster ratios.
> 
>> (We could even get one more bit if we had a subcluster-flag, because I
>> guess we can always assume subclustered clusters to have OFLAG_COPIED
>> and be uncompressed. But still, three bits missing.)
> 
> Why can we always assume OFLAG_COPIED?

Because partially allocated clusters cannot be used with internal
snapshots, and that is what OFLAG_COPIED is for.

>> If course, if you'd be willing to give up the all-zeroes state for
>> subclusters, it would be enough...
> 
> I still think that it looks like a better idea to allow having more
> subclusters, but giving up the all-zeroes state is a valid
> alternative. Apart from having to overwrite with zeroes when a
> subcluster is discarded, is there anything else that we would miss?

It if it's a real discard you can just discard it (which is what we do
for compat=0.10 images anyway); but zero-writes will then have to be
come real writes, yes.

>> By the way, if you'd only allow multiple of 1s overhead
>> (i.e. multiples of 32 subclusters), I think (3) would be pretty much
>> the same as (2) if you just always write the subcluster information
>> adjacent to the L2 table. Should be just the same caching-wise and
>> performance-wise.
> 
> Then (3) is effectively the same as (2), just that the subcluster
> bitmaps are at the end of the L2 cluster, and not next to each entry.

Exactly. But it's a difference in implementation, as you won't have to
worry about having changed the L2 table layout; maybe that's a benefit.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-11 14:04     ` Max Reitz
@ 2017-04-11 14:31       ` Alberto Garcia
  2017-04-11 14:45         ` [Qemu-devel] [Qemu-block] " Eric Blake
  2017-04-11 14:49         ` [Qemu-devel] " Kevin Wolf
  0 siblings, 2 replies; 64+ messages in thread
From: Alberto Garcia @ 2017-04-11 14:31 UTC (permalink / raw)
  To: Max Reitz, qemu-devel; +Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf

On Tue 11 Apr 2017 04:04:53 PM CEST, Max Reitz wrote:
>>> (We could even get one more bit if we had a subcluster-flag, because I
>>> guess we can always assume subclustered clusters to have OFLAG_COPIED
>>> and be uncompressed. But still, three bits missing.)
>> 
>> Why can we always assume OFLAG_COPIED?
>
> Because partially allocated clusters cannot be used with internal
> snapshots, and that is what OFLAG_COPIED is for.

Why can't they be used?

>>> If course, if you'd be willing to give up the all-zeroes state for
>>> subclusters, it would be enough...
>> 
>> I still think that it looks like a better idea to allow having more
>> subclusters, but giving up the all-zeroes state is a valid
>> alternative. Apart from having to overwrite with zeroes when a
>> subcluster is discarded, is there anything else that we would miss?
>
> It if it's a real discard you can just discard it (which is what we do
> for compat=0.10 images anyway); but zero-writes will then have to be
> come real writes, yes.

Perhaps we can give up that bit for subclusters then, that would allow
us to double their number. We would still have the zero flag at the
cluster level. Opinions on this, anyone?

>>> By the way, if you'd only allow multiple of 1s overhead
>>> (i.e. multiples of 32 subclusters), I think (3) would be pretty much
>>> the same as (2) if you just always write the subcluster information
>>> adjacent to the L2 table. Should be just the same caching-wise and
>>> performance-wise.
>> 
>> Then (3) is effectively the same as (2), just that the subcluster
>> bitmaps are at the end of the L2 cluster, and not next to each entry.
>
> Exactly. But it's a difference in implementation, as you won't have to
> worry about having changed the L2 table layout; maybe that's a
> benefit.

I'm not sure if that would simplify or complicate things, but it's worth
considering.

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-11 14:31       ` Alberto Garcia
@ 2017-04-11 14:45         ` Eric Blake
  2017-04-12 12:41           ` Alberto Garcia
  2017-04-11 14:49         ` [Qemu-devel] " Kevin Wolf
  1 sibling, 1 reply; 64+ messages in thread
From: Eric Blake @ 2017-04-11 14:45 UTC (permalink / raw)
  To: Alberto Garcia, Max Reitz, qemu-devel
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2734 bytes --]

On 04/11/2017 09:31 AM, Alberto Garcia wrote:
> On Tue 11 Apr 2017 04:04:53 PM CEST, Max Reitz wrote:
>>>> (We could even get one more bit if we had a subcluster-flag, because I
>>>> guess we can always assume subclustered clusters to have OFLAG_COPIED
>>>> and be uncompressed. But still, three bits missing.)
>>>
>>> Why can we always assume OFLAG_COPIED?
>>
>> Because partially allocated clusters cannot be used with internal
>> snapshots, and that is what OFLAG_COPIED is for.
> 
> Why can't they be used?

An internal snapshot causes a COW to happen if another write happens
anywhere in the cluster. Setting OFLAG_COPIED is a shorthand for whether
the COW must happen, but it is always possible (but slower) to refer
back to the refcount to learn the same information.  If we have a
cluster with missing subclusters, and need to do a COW, we are already
reading from the backing file - so we might as well populate the missing
subclusters of the original cluster at that time we write the new
updated cluster, at which point we no longer need to mark the cluster as
using subclusters.  Or we could state that the action of creating an
internal snapshot takes longer, because it fully populates all
partially-populated clusters (taking an internal snapshot is something
that is not done frequently, after all, as we've gradually been trying
to steer users to external snapshots) - or we could even go so far as to
state that internal snapshots and subclusters are incompatible (you
can't use both features at the same time).

It may be possible to make OFLAG_COPIED and subclusters work usefully
together, but the point being made here is that because we're already
changing design principles, we don't necessarily have to burn a bit on
OFLAG_COPIED if subclusters are in use.

> 
> Perhaps we can give up that bit for subclusters then, that would allow
> us to double their number. We would still have the zero flag at the
> cluster level. Opinions on this, anyone?

I already think we're leaning away from option 1, even though the above
conversation was about how to pack more state into just 64 bits if we
wanted to stick with option 1.

If we use option 2 or 3, it may still be worth burning a bit in the
original 64 bits that says whether the subcluster secondary 64-bits is
valid (it might speed up some operations if we only have to do one
64-bit read and realize that the entire cluster is uniform, compared to
operations where the subcluster flag is set so we have to do a second
64-bit read to learn about the state of each subcluster).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-11 14:31       ` Alberto Garcia
  2017-04-11 14:45         ` [Qemu-devel] [Qemu-block] " Eric Blake
@ 2017-04-11 14:49         ` Kevin Wolf
  2017-04-11 14:58           ` Eric Blake
                             ` (2 more replies)
  1 sibling, 3 replies; 64+ messages in thread
From: Kevin Wolf @ 2017-04-11 14:49 UTC (permalink / raw)
  To: Alberto Garcia; +Cc: Max Reitz, qemu-devel, qemu-block, Stefan Hajnoczi

Am 11.04.2017 um 16:31 hat Alberto Garcia geschrieben:
> On Tue 11 Apr 2017 04:04:53 PM CEST, Max Reitz wrote:
> >>> (We could even get one more bit if we had a subcluster-flag, because I
> >>> guess we can always assume subclustered clusters to have OFLAG_COPIED
> >>> and be uncompressed. But still, three bits missing.)
> >> 
> >> Why can we always assume OFLAG_COPIED?
> >
> > Because partially allocated clusters cannot be used with internal
> > snapshots, and that is what OFLAG_COPIED is for.
> 
> Why can't they be used?

Refcounts are on a cluster granularity, so you have to COW the whole
cluster at once. If you copied only a subcluster, you'd lose the
information where to find the other subclusters.

> >>> If course, if you'd be willing to give up the all-zeroes state for
> >>> subclusters, it would be enough...
> >> 
> >> I still think that it looks like a better idea to allow having more
> >> subclusters, but giving up the all-zeroes state is a valid
> >> alternative. Apart from having to overwrite with zeroes when a
> >> subcluster is discarded, is there anything else that we would miss?
> >
> > It if it's a real discard you can just discard it (which is what we do
> > for compat=0.10 images anyway); but zero-writes will then have to be
> > come real writes, yes.
> 
> Perhaps we can give up that bit for subclusters then, that would allow
> us to double their number. We would still have the zero flag at the
> cluster level. Opinions on this, anyone?

No, making the backing file contents reappear is really bad, we don't
want that. If anything, we'd have to use the cluster level zero flag and
do COW (i.e. write explicit zeros) on the first write to a subcluster in
it. I'd rather keep the zero flag for subclusters.

> >>> By the way, if you'd only allow multiple of 1s overhead
> >>> (i.e. multiples of 32 subclusters), I think (3) would be pretty much
> >>> the same as (2) if you just always write the subcluster information
> >>> adjacent to the L2 table. Should be just the same caching-wise and
> >>> performance-wise.
> >> 
> >> Then (3) is effectively the same as (2), just that the subcluster
> >> bitmaps are at the end of the L2 cluster, and not next to each entry.
> >
> > Exactly. But it's a difference in implementation, as you won't have to
> > worry about having changed the L2 table layout; maybe that's a
> > benefit.
> 
> I'm not sure if that would simplify or complicate things, but it's worth
> considering.

Note that 64k between an L2 entry and the corresponding bitmap is enough
to make an update not atomic any more. They need to be within the same
sector to get atomicity.

Kevin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-11 14:49         ` [Qemu-devel] " Kevin Wolf
@ 2017-04-11 14:58           ` Eric Blake
  2017-04-11 14:59           ` Max Reitz
  2017-04-12 12:47           ` Alberto Garcia
  2 siblings, 0 replies; 64+ messages in thread
From: Eric Blake @ 2017-04-11 14:58 UTC (permalink / raw)
  To: Kevin Wolf, Alberto Garcia
  Cc: Stefan Hajnoczi, qemu-devel, qemu-block, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 1560 bytes --]

On 04/11/2017 09:49 AM, Kevin Wolf wrote:

>>>> Then (3) is effectively the same as (2), just that the subcluster
>>>> bitmaps are at the end of the L2 cluster, and not next to each entry.
>>>
>>> Exactly. But it's a difference in implementation, as you won't have to
>>> worry about having changed the L2 table layout; maybe that's a
>>> benefit.
>>
>> I'm not sure if that would simplify or complicate things, but it's worth
>> considering.
> 
> Note that 64k between an L2 entry and the corresponding bitmap is enough
> to make an update not atomic any more. They need to be within the same
> sector to get atomicity.

Furthermore, there is a benefit to cache line packing - alternating
64-bits for offset and 64-bits for subclusters will fit all 128 bits in
the same cache line, while having all offsets up front followed by all
subclusters later is not.  Worse, depending on architecture, if the
64-bits for the offset is at the same relative offset to overall cache
alignment as the 64-bits for the subcluster (for example, with 1M
clusters, if the offset is at 4M and the subcluster info is at 5M), the
alignments of the separate memory pages may cause you to end up with
both values competing for the same cache line, causing ping-pong
evictions and associated slowdowns.  Data locality really does want to
locate stuff that is commonly used together to also appear close
together in memory.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-11 14:49         ` [Qemu-devel] " Kevin Wolf
  2017-04-11 14:58           ` Eric Blake
@ 2017-04-11 14:59           ` Max Reitz
  2017-04-11 15:08             ` Eric Blake
  2017-04-12 12:47           ` Alberto Garcia
  2 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-04-11 14:59 UTC (permalink / raw)
  To: Kevin Wolf, Alberto Garcia; +Cc: qemu-devel, qemu-block, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 1740 bytes --]

On 11.04.2017 16:49, Kevin Wolf wrote:

[...]

>>>>> By the way, if you'd only allow multiple of 1s overhead
>>>>> (i.e. multiples of 32 subclusters), I think (3) would be pretty much
>>>>> the same as (2) if you just always write the subcluster information
>>>>> adjacent to the L2 table. Should be just the same caching-wise and
>>>>> performance-wise.
>>>>
>>>> Then (3) is effectively the same as (2), just that the subcluster
>>>> bitmaps are at the end of the L2 cluster, and not next to each entry.
>>>
>>> Exactly. But it's a difference in implementation, as you won't have to
>>> worry about having changed the L2 table layout; maybe that's a
>>> benefit.
>>
>> I'm not sure if that would simplify or complicate things, but it's worth
>> considering.
> 
> Note that 64k between an L2 entry and the corresponding bitmap is enough
> to make an update not atomic any more. They need to be within the same
> sector to get atomicity.

Good point, but that also means that (with (2)) you can only use
subcluster configurations where the L2 entry size increases by a power
of two. Unfortunately, only one of those configurations itself is a
power of two, and that is 32.

(With 32 subclusters, you take up 64 bits, which means an L2 entry will
take 128 bits; with any higher 2^n, you'd take up 2^{n+1} bits and the
L2 entry would take 2^{n+1} + 64 which is impossible to be a power of two.)

I don't know how useful non-power-of-two subcluster configurations are.
Probably not at all.

Since using subcluster would always result in the L2 table taking more
than 512 bytes, you could therefore never guarantee that there is no
entry overlapping a sector border (except with 32 subclusters).

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-11 14:59           ` Max Reitz
@ 2017-04-11 15:08             ` Eric Blake
  2017-04-11 15:18               ` Max Reitz
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Blake @ 2017-04-11 15:08 UTC (permalink / raw)
  To: Max Reitz, Kevin Wolf, Alberto Garcia
  Cc: Stefan Hajnoczi, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1281 bytes --]

On 04/11/2017 09:59 AM, Max Reitz wrote:

> 
> Good point, but that also means that (with (2)) you can only use
> subcluster configurations where the L2 entry size increases by a power
> of two. Unfortunately, only one of those configurations itself is a
> power of two, and that is 32.
> 
> (With 32 subclusters, you take up 64 bits, which means an L2 entry will
> take 128 bits; with any higher 2^n, you'd take up 2^{n+1} bits and the
> L2 entry would take 2^{n+1} + 64 which is impossible to be a power of two.)

Or we add padding. If you want 64 subclusters, you burn 256 bits per
entry, even though only 192 of those bits are used.

> 
> I don't know how useful non-power-of-two subcluster configurations are.
> Probably not at all.
> 
> Since using subcluster would always result in the L2 table taking more
> than 512 bytes, you could therefore never guarantee that there is no
> entry overlapping a sector border (except with 32 subclusters).

Yes, there's definite benefits to keeping whatever structure we end up
with aligned so that it naturally falls into sector boundaries, even if
it means more padding bits.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-11 15:08             ` Eric Blake
@ 2017-04-11 15:18               ` Max Reitz
  2017-04-11 15:29                 ` Kevin Wolf
  2017-04-11 15:30                 ` Eric Blake
  0 siblings, 2 replies; 64+ messages in thread
From: Max Reitz @ 2017-04-11 15:18 UTC (permalink / raw)
  To: Eric Blake, Kevin Wolf, Alberto Garcia
  Cc: Stefan Hajnoczi, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1962 bytes --]

On 11.04.2017 17:08, Eric Blake wrote:
> On 04/11/2017 09:59 AM, Max Reitz wrote:
> 
>>
>> Good point, but that also means that (with (2)) you can only use
>> subcluster configurations where the L2 entry size increases by a power
>> of two. Unfortunately, only one of those configurations itself is a
>> power of two, and that is 32.
>>
>> (With 32 subclusters, you take up 64 bits, which means an L2 entry will
>> take 128 bits; with any higher 2^n, you'd take up 2^{n+1} bits and the
>> L2 entry would take 2^{n+1} + 64 which is impossible to be a power of two.)
> 
> Or we add padding. If you want 64 subclusters, you burn 256 bits per> entry, even though only 192 of those bits are used.

Hm, yeah, although you have to keep in mind that the padding is almost
pretty much the same as the the data bits we need, effectively doubling
the size of the L2 tables:

padding = 2^{n+2} - 2^{n+1} - 64 (=2^6)
        = 2^{n+1} - 64

So that's not so nice, but if it's the only thing we can do...

>> I don't know how useful non-power-of-two subcluster configurations are.
>> Probably not at all.
>>
>> Since using subcluster would always result in the L2 table taking more
>> than 512 bytes, you could therefore never guarantee that there is no
>> entry overlapping a sector border (except with 32 subclusters).
> 
> Yes, there's definite benefits to keeping whatever structure we end up
> with aligned so that it naturally falls into sector boundaries, even if
> it means more padding bits.

Then again, I'm not even sure we really need atomicity for L2 entries +
subcluster bits. I don't think you'd ever have to modify both at the
same time (if you just say the subclusters are all unallocated when
allocating the cluster itself, and then you write which subclusters are
actually allocated afterwards)).

(This also applies to your remark on caching, I think.)

Atomicity certainly makes things easier, though.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-11 15:18               ` Max Reitz
@ 2017-04-11 15:29                 ` Kevin Wolf
  2017-04-11 15:29                   ` Max Reitz
  2017-04-11 15:30                 ` Eric Blake
  1 sibling, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-04-11 15:29 UTC (permalink / raw)
  To: Max Reitz
  Cc: Eric Blake, Alberto Garcia, Stefan Hajnoczi, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2273 bytes --]

Am 11.04.2017 um 17:18 hat Max Reitz geschrieben:
> On 11.04.2017 17:08, Eric Blake wrote:
> > On 04/11/2017 09:59 AM, Max Reitz wrote:
> > 
> >>
> >> Good point, but that also means that (with (2)) you can only use
> >> subcluster configurations where the L2 entry size increases by a power
> >> of two. Unfortunately, only one of those configurations itself is a
> >> power of two, and that is 32.
> >>
> >> (With 32 subclusters, you take up 64 bits, which means an L2 entry will
> >> take 128 bits; with any higher 2^n, you'd take up 2^{n+1} bits and the
> >> L2 entry would take 2^{n+1} + 64 which is impossible to be a power of two.)
> > 
> > Or we add padding. If you want 64 subclusters, you burn 256 bits per> entry, even though only 192 of those bits are used.
> 
> Hm, yeah, although you have to keep in mind that the padding is almost
> pretty much the same as the the data bits we need, effectively doubling
> the size of the L2 tables:
> 
> padding = 2^{n+2} - 2^{n+1} - 64 (=2^6)
>         = 2^{n+1} - 64
> 
> So that's not so nice, but if it's the only thing we can do...
> 
> >> I don't know how useful non-power-of-two subcluster configurations are.
> >> Probably not at all.
> >>
> >> Since using subcluster would always result in the L2 table taking more
> >> than 512 bytes, you could therefore never guarantee that there is no
> >> entry overlapping a sector border (except with 32 subclusters).
> > 
> > Yes, there's definite benefits to keeping whatever structure we end up
> > with aligned so that it naturally falls into sector boundaries, even if
> > it means more padding bits.
> 
> Then again, I'm not even sure we really need atomicity for L2 entries +
> subcluster bits. I don't think you'd ever have to modify both at the
> same time (if you just say the subclusters are all unallocated when
> allocating the cluster itself, and then you write which subclusters are
> actually allocated afterwards)).
> 
> (This also applies to your remark on caching, I think.)
> 
> Atomicity certainly makes things easier, though.

Unless you want to deal with ordering (i.e. on cluster allocation, first
update the subcluster bitmap, then flush, then add the L2 entry), I
think you need atomicity.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-11 15:29                 ` Kevin Wolf
@ 2017-04-11 15:29                   ` Max Reitz
  0 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-04-11 15:29 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Eric Blake, Alberto Garcia, Stefan Hajnoczi, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2363 bytes --]

On 11.04.2017 17:29, Kevin Wolf wrote:
> Am 11.04.2017 um 17:18 hat Max Reitz geschrieben:
>> On 11.04.2017 17:08, Eric Blake wrote:
>>> On 04/11/2017 09:59 AM, Max Reitz wrote:
>>>
>>>>
>>>> Good point, but that also means that (with (2)) you can only use
>>>> subcluster configurations where the L2 entry size increases by a power
>>>> of two. Unfortunately, only one of those configurations itself is a
>>>> power of two, and that is 32.
>>>>
>>>> (With 32 subclusters, you take up 64 bits, which means an L2 entry will
>>>> take 128 bits; with any higher 2^n, you'd take up 2^{n+1} bits and the
>>>> L2 entry would take 2^{n+1} + 64 which is impossible to be a power of two.)
>>>
>>> Or we add padding. If you want 64 subclusters, you burn 256 bits per> entry, even though only 192 of those bits are used.
>>
>> Hm, yeah, although you have to keep in mind that the padding is almost
>> pretty much the same as the the data bits we need, effectively doubling
>> the size of the L2 tables:
>>
>> padding = 2^{n+2} - 2^{n+1} - 64 (=2^6)
>>         = 2^{n+1} - 64
>>
>> So that's not so nice, but if it's the only thing we can do...
>>
>>>> I don't know how useful non-power-of-two subcluster configurations are.
>>>> Probably not at all.
>>>>
>>>> Since using subcluster would always result in the L2 table taking more
>>>> than 512 bytes, you could therefore never guarantee that there is no
>>>> entry overlapping a sector border (except with 32 subclusters).
>>>
>>> Yes, there's definite benefits to keeping whatever structure we end up
>>> with aligned so that it naturally falls into sector boundaries, even if
>>> it means more padding bits.
>>
>> Then again, I'm not even sure we really need atomicity for L2 entries +
>> subcluster bits. I don't think you'd ever have to modify both at the
>> same time (if you just say the subclusters are all unallocated when
>> allocating the cluster itself, and then you write which subclusters are
>> actually allocated afterwards)).
>>
>> (This also applies to your remark on caching, I think.)
>>
>> Atomicity certainly makes things easier, though.
> 
> Unless you want to deal with ordering (i.e. on cluster allocation, first
> update the subcluster bitmap, then flush, then add the L2 entry), I
> think you need atomicity.

Yes, that's what I meant.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-11 15:18               ` Max Reitz
  2017-04-11 15:29                 ` Kevin Wolf
@ 2017-04-11 15:30                 ` Eric Blake
  2017-04-11 15:34                   ` Max Reitz
  1 sibling, 1 reply; 64+ messages in thread
From: Eric Blake @ 2017-04-11 15:30 UTC (permalink / raw)
  To: Max Reitz, Kevin Wolf, Alberto Garcia
  Cc: Stefan Hajnoczi, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 877 bytes --]

On 04/11/2017 10:18 AM, Max Reitz wrote:

> Hm, yeah, although you have to keep in mind that the padding is almost
> pretty much the same as the the data bits we need, effectively doubling
> the size of the L2 tables:
> 
> padding = 2^{n+2} - 2^{n+1} - 64 (=2^6)
>         = 2^{n+1} - 64
> 
> So that's not so nice, but if it's the only thing we can do...

Or we mix-and-match your ideas: since our subclusters are ternary
encoding, if you want 128 subclusters, instead of asking for 256 bits,
you ask for ld(3^128) = 203 bits, then use 11 padding bits of your
original 64, and you have something that still fits in 256 bits instead
of 512...

Okay, just kidding.  Anything larger than 32 subclusters does get
awkward fast.


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-11 15:30                 ` Eric Blake
@ 2017-04-11 15:34                   ` Max Reitz
  0 siblings, 0 replies; 64+ messages in thread
From: Max Reitz @ 2017-04-11 15:34 UTC (permalink / raw)
  To: Eric Blake, Kevin Wolf, Alberto Garcia
  Cc: Stefan Hajnoczi, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1233 bytes --]

On 11.04.2017 17:30, Eric Blake wrote:
> On 04/11/2017 10:18 AM, Max Reitz wrote:
> 
>> Hm, yeah, although you have to keep in mind that the padding is almost
>> pretty much the same as the the data bits we need, effectively doubling
>> the size of the L2 tables:
>>
>> padding = 2^{n+2} - 2^{n+1} - 64 (=2^6)
>>         = 2^{n+1} - 64
>>
>> So that's not so nice, but if it's the only thing we can do...
> 
> Or we mix-and-match your ideas: since our subclusters are ternary
> encoding, if you want 128 subclusters, instead of asking for 256 bits,
> you ask for ld(3^128) = 203 bits, then use 11 padding bits of your
> original 64, and you have something that still fits in 256 bits instead
> of 512...
> 
> Okay, just kidding.  Anything larger than 32 subclusters does get
> awkward fast.

If we want to do something a bit complicated that is still kind of
reasonable, though, we could just not use padding except for 512 byte
boundaries. Of course, that makes calculating the index of an L2 table a
bit more complicated (probably involving a division by a value that is
not a power of two), but it would kind of work.

Or we use Malbolge, yes. That would probably be the most sensible choice.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-11 14:45         ` [Qemu-devel] [Qemu-block] " Eric Blake
@ 2017-04-12 12:41           ` Alberto Garcia
  2017-04-12 14:10             ` Max Reitz
  0 siblings, 1 reply; 64+ messages in thread
From: Alberto Garcia @ 2017-04-12 12:41 UTC (permalink / raw)
  To: Eric Blake, Max Reitz, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block

On Tue 11 Apr 2017 04:45:29 PM CEST, Eric Blake wrote:
>>>>> (We could even get one more bit if we had a subcluster-flag,
>>>>> because I guess we can always assume subclustered clusters to have
>>>>> OFLAG_COPIED and be uncompressed. But still, three bits missing.)
>>>>
>>>> Why can we always assume OFLAG_COPIED?
>>>
>>> Because partially allocated clusters cannot be used with internal
>>>snapshots, and that is what OFLAG_COPIED is for.
>> 
>> Why can't they be used?
>
> An internal snapshot causes a COW to happen if another write happens
> anywhere in the cluster. Setting OFLAG_COPIED is a shorthand for
> whether the COW must happen, but it is always possible (but slower) to
> refer back to the refcount to learn the same information.  If we have
> a cluster with missing subclusters, and need to do a COW, we are
> already reading from the backing file - so we might as well populate
> the missing subclusters of the original cluster at that time we write
> the new updated cluster, at which point we no longer need to mark the
> cluster as using subclusters.

I still don't see why we can always assume OFLAG_COPIED. Before doing
the COW one cluster can have references from multiple snapshots, and
OFLAG_COPIED is equally valid in that context. We still need to know if
we need to perform COW when writing to it, regardless of whether it has
subclusters or not.

Also, when doing a COW for an internal snapshot we definitely have to
duplicate the whole cluster, but do we really need to read from the
backing file? Can't we leave the missing subclusters unallocated in the
copy?

> If we use option 2 or 3, it may still be worth burning a bit in the
> original 64 bits that says whether the subcluster secondary 64-bits is
> valid (it might speed up some operations if we only have to do one
> 64-bit read and realize that the entire cluster is uniform, compared
> to operations where the subcluster flag is set so we have to do a
> second 64-bit read to learn about the state of each subcluster).

Yeah, that might be a good idea.

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-11 14:49         ` [Qemu-devel] " Kevin Wolf
  2017-04-11 14:58           ` Eric Blake
  2017-04-11 14:59           ` Max Reitz
@ 2017-04-12 12:47           ` Alberto Garcia
  2 siblings, 0 replies; 64+ messages in thread
From: Alberto Garcia @ 2017-04-12 12:47 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Max Reitz, qemu-devel, qemu-block, Stefan Hajnoczi

On Tue 11 Apr 2017 04:49:21 PM CEST, Kevin Wolf wrote:
>> >>> (We could even get one more bit if we had a subcluster-flag, because I
>> >>> guess we can always assume subclustered clusters to have OFLAG_COPIED
>> >>> and be uncompressed. But still, three bits missing.)
>> >> 
>> >> Why can we always assume OFLAG_COPIED?
>> >
>> > Because partially allocated clusters cannot be used with internal
>> > snapshots, and that is what OFLAG_COPIED is for.
>> 
>> Why can't they be used?
>
> Refcounts are on a cluster granularity, so you have to COW the whole
> cluster at once. If you copied only a subcluster, you'd lose the
> information where to find the other subclusters.

Individual subclusters don't have reference counts, OFLAG_COPIED would
always be at the cluster level, but else I don't see the problem (see
the reply that I just wrote to Eric).

>> > It if it's a real discard you can just discard it (which is what we
>> > do for compat=0.10 images anyway); but zero-writes will then have
>> > to be come real writes, yes.
>> 
>> Perhaps we can give up that bit for subclusters then, that would
>> allow us to double their number. We would still have the zero flag at
>> the cluster level. Opinions on this, anyone?
>
> No, making the backing file contents reappear is really bad, we don't
> want that.

I'm not talking about making the backing file contents reappear, but
about writing zeroes instead of setting the zero flag.

>> >> Then (3) is effectively the same as (2), just that the subcluster
>> >> bitmaps are at the end of the L2 cluster, and not next to each
>> >> entry.
>> >
>> > Exactly. But it's a difference in implementation, as you won't have
>> > to worry about having changed the L2 table layout; maybe that's a
>> > benefit.
>> 
>> I'm not sure if that would simplify or complicate things, but it's
>> worth considering.
>
> Note that 64k between an L2 entry and the corresponding bitmap is
> enough to make an update not atomic any more. They need to be within
> the same sector to get atomicity.

Good point.

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-12 12:41           ` Alberto Garcia
@ 2017-04-12 14:10             ` Max Reitz
  2017-04-13  8:05               ` Alberto Garcia
  0 siblings, 1 reply; 64+ messages in thread
From: Max Reitz @ 2017-04-12 14:10 UTC (permalink / raw)
  To: Alberto Garcia, Eric Blake, qemu-devel
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2670 bytes --]

On 12.04.2017 14:41, Alberto Garcia wrote:
> On Tue 11 Apr 2017 04:45:29 PM CEST, Eric Blake wrote:
>>>>>> (We could even get one more bit if we had a subcluster-flag,
>>>>>> because I guess we can always assume subclustered clusters to have
>>>>>> OFLAG_COPIED and be uncompressed. But still, three bits missing.)
>>>>>
>>>>> Why can we always assume OFLAG_COPIED?
>>>>
>>>> Because partially allocated clusters cannot be used with internal
>>>> snapshots, and that is what OFLAG_COPIED is for.
>>>
>>> Why can't they be used?
>>
>> An internal snapshot causes a COW to happen if another write happens
>> anywhere in the cluster. Setting OFLAG_COPIED is a shorthand for
>> whether the COW must happen, but it is always possible (but slower) to
>> refer back to the refcount to learn the same information.  If we have
>> a cluster with missing subclusters, and need to do a COW, we are
>> already reading from the backing file - so we might as well populate
>> the missing subclusters of the original cluster at that time we write
>> the new updated cluster, at which point we no longer need to mark the
>> cluster as using subclusters.
> 
> I still don't see why we can always assume OFLAG_COPIED. Before doing
> the COW one cluster can have references from multiple snapshots,

Yes...

>                                                                  and
> OFLAG_COPIED is equally valid in that context.

In what context? When having subclusters? Why?

>                                                We still need to know if
> we need to perform COW when writing to it, regardless of whether it has
> subclusters or not.

But why would you reference a cluster multiple times if it has
subclusters? Yes, you can do it in theory, but we could just disallow
it, because it doesn't make sense.

As I've said before, there may be uses for clusters to be referenced
multiple times other than internal snapshots; but right now we only use
it for internal snapshots, and I don't think that disallowing multiple
references for subclustered clusters would be horrible for any of the
other use cases.

> Also, when doing a COW for an internal snapshot we definitely have to
> duplicate the whole cluster, but do we really need to read from the
> backing file? Can't we leave the missing subclusters unallocated in the
> copy?
Can't we just disallow !OFLAG_COPIED for subclustered clusters?

To me it seems you implicitly assume we would want to allow
!OFLAG_COPIED, and based on that you argue how we could do so and what
semantics it would bear. However, I fail to see why we would want that
feature at all.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia
                   ` (2 preceding siblings ...)
  2017-04-07 17:10 ` Max Reitz
@ 2017-04-12 16:54 ` Denis V. Lunev
  2017-04-13 11:58   ` Alberto Garcia
  2017-04-12 17:55 ` Denis V. Lunev
  4 siblings, 1 reply; 64+ messages in thread
From: Denis V. Lunev @ 2017-04-12 16:54 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz

On 04/06/2017 06:01 PM, Alberto Garcia wrote:
> Hi all,
>
> over the past couple of months I discussed with some of you the
> possibility to extend the qcow2 format in order to improve its
> performance and reduce its memory requirements (particularly with very
> large images).
>
> After some discussion in the mailing list and the #qemu IRC channel I
> decided to write a prototype of a new extension for qcow2 so I could
> understand better the scope of the changes and have some preliminary
> data about its effects.
>
> This e-mail is the formal presentation of my proposal to extend the
> on-disk qcow2 format. As you can see this is still an RFC. Due to the
> nature of the changes I would like to get as much feedback as possible
> before going forward.
>
> === Problem ===
>
> The original problem that I wanted to address is the memory
> requirements of qcow2 files if you want to have very large images and
> still keep good I/O performance. This is a consequence of its very
> structure, which I'm going to describe now.
>
> A qcow2 image is divided into units of constant size called clusters,
> and among other things it contains metadata that maps guest addresses
> to host addresses (the so-called L1 and L2 tables).
>
> There are two basic problems that result from this:
>
> 1) Reading from or writing to a qcow2 image involves reading the
>    corresponding entry on the L2 table that maps the guest address to
>    the host address. This is very slow because it involves two I/O
>    operations: one on the L2 table and the other one on the actual
>    data cluster.
>
> 2) A cluster is the smallest unit of allocation. Therefore writing a
>    mere 512 bytes to an empty disk requires allocating a complete
>    cluster and filling it with zeroes (or with data from the backing
>    image if there is one). This wastes more disk space and also has a
>    negative impact on I/O.
>
> Problem (1) can be solved by keeping in memory a cache of the L2
> tables (QEMU has an "l2_cache_size" parameter for this purpose). The
> amount of disk space used by L2 tables depends on two factors: the
> disk size and the cluster size.
>
> The cluster size can be configured when the image is created, and it
> can be any power of two between 512 bytes and 2 MB (it defaults to 64
> KB).
>
> The maximum amount of space needed for the L2 tables can be calculated
> with the following formula:
>
>    max_l2_size = virtual_disk_size * 8 / cluster_size
>
> Large images require a large amount of metadata, and therefore a large
> amount of memory for the L2 cache. With the default cluster size
> (64KB) that's 128MB of L2 cache for a 1TB qcow2 image.
>
> The only way to reduce the size of the L2 tables is therefore
> increasing the cluster size, but this makes the image less efficient
> as described earlier in (2).
>
> === The proposal ===
>
> The idea of this proposal is to extend the qcow2 format by allowing
> subcluster allocation. There would be an optional feature that would
> need to be enabled when creating the image. The on-disk format would
> remain essentially the same, except that each data cluster would be
> internally divided into a number of subclusters of equal size.
>
> What this means in practice is that each entry on an L2 table would be
> accompanied by a bitmap indicating the allocation state of each one of
> the subclusters for that cluster. There are several alternatives for
> storing the bitmap, described below.
>
> Other than L2 entries, all other data structures would remain
> unchanged, but for data clusters the smallest unit of allocation would
> now be the subcluster. Reference counting would still be at the
> cluster level, because there's no way to reference individual
> subclusters. Copy-on-write on internal snapshots would need to copy
> complete clusters, so that scenario would not benefit from this
> change.
>
> I see two main use cases for this feature:
>
> a) The qcow2 image is not too large / the L2 cache is not a problem,
>    but you want to increase the allocation performance. In this case
>    you can have something like a 128KB cluster with 4KB subclusters
>    (with 4KB being a common block size in ext4 and other filesystems)
>
> b) The qcow2 image is very large and you want to save metadata space
>    in order to have a smaller L2 cache. In this case you can go for
>    the maximum cluster size (2MB) but you want to have smaller
>    subclusters to increase the allocation performance and optimize the
>    disk usage. This was actually my original use case.
>
> === Test results ===
>
> I have a basic working prototype of this. It's still incomplete -and
> buggy :)- but it gives an idea of what we can expect from it. In my
> implementation each data cluster has 8 subclusters, but that's not set
> in stone (see below).
>
> I made all tests on an SSD drive, writing to an empty qcow2 image with
> a fully populated 40GB backing image, performing random writes using
> fio with a block size of 4KB.
>
> I tried with the default and maximum cluster sizes (64KB and 2MB) and
> also with some other sizes. I also made sure to try with 32KB clusters
> so the subcluster size matches the 4KB block size used for the I/O.
>
> It's important to point out that once a cluster has been completely
> allocated then having subclusters offers no performance benefit. For
> this reason the size of the image for these tests (40GB) was chosen to
> be large enough to guarantee that there are always new clusters being
> allocated. This is therefore a worst-case scenario (or best-case for
> this feature, if you want).
>
> Here are the results (subcluster size in brackets):
>
> |-----------------+----------------+-----------------+-------------------|
> |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> |-----------------+----------------+-----------------+-------------------|
> |   2 MB (256 KB) |   440 IOPS     |  100 IOPS       | 160 KB (*)        |
> | 512 KB  (64 KB) |  1000 IOPS     |  300 IOPS       | 640 KB            |
> |  64 KB   (8 KB) |  3000 IOPS     | 1000 IOPS       |   5 MB            |
> |  32 KB   (4 KB) | 12000 IOPS     | 1300 IOPS       |  10 MB            |
> |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
> |-----------------+----------------+-----------------+-------------------|
>
>                 (*) The L2 cache must be a multiple of the cluster
>                     size, so in this case it must be 2MB. On the table
>                     I chose to show how much of those 2MB are actually
>                     used so you can compare it with the other cases.
>
> Some comments about the results:
>
> - For the 64KB, 512KB and 2MB cases, having subclusters increases
>   write performance roughly by three. This happens because for each
>   cluster allocation there's less data to copy from the backing
>   image. For the same reason, the smaller the cluster, the better the
>   performance. As expected, 64KB clusters with no subclusters perform
>   roughly the same as 512KB clusters with 64KB subclusters.
>
> - The 32KB case is the most interesting one. Without subclusters it's
>   not very different from the 64KB case, but having a subcluster with
>   the same size of the I/O block eliminates the need for COW entirely
>   and the performance skyrockets (10 times faster!).
>
> - 4KB is however very slow. I attribute this to the fact that the
>   cluster size is so small that a new cluster needs to be allocated
>   for every single write and its refcount updated accordingly. The L2
>   and refcount tables are also so small that they are too inefficient
>   and need to grow all the time.
>
> Here are the results when writing to an empty 40GB qcow2 image with no
> backing file. The numbers are of course different but as you can see
> the patterns are similar:
>
> |-----------------+----------------+-----------------+-------------------|
> |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> |-----------------+----------------+-----------------+-------------------|
> |   2 MB (256 KB) |  1200 IOPS     |  255 IOPS       | 160 KB            |
> | 512 KB  (64 KB) |  3000 IOPS     |  700 IOPS       | 640 KB            |
> |  64 KB   (8 KB) |  7200 IOPS     | 3300 IOPS       |   5 MB            |
> |  32 KB   (4 KB) | 12300 IOPS     | 4200 IOPS       |  10 MB            |
> |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
> |-----------------+----------------+-----------------+-------------------|
>
> === Changes to the on-disk format ===
>
> The qcow2 on-disk format needs to change so each L2 entry has a bitmap
> indicating the allocation status of each subcluster. There are three
> possible states (unallocated, allocated, all zeroes), so we need two
> bits per subcluster.
>
> An L2 entry is 64 bits wide, and this is the current format (for
> uncompressed clusters):
>
> 63    56 55    48 47    40 39    32 31    24 23    16 15     8 7      0
> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> **<----> <--------------------------------------------------><------->*
>   Rsrved              host cluster offset of data             Reserved
>   (6 bits)                (47 bits)                           (8 bits)
>
>     bit 63: refcount == 1   (QCOW_OFLAG_COPIED)
>     bit 62: compressed = 1  (QCOW_OFLAG_COMPRESSED)
>     bit 0: all zeros        (QCOW_OFLAG_ZERO)
>
> I thought of three alternatives for storing the subcluster bitmaps. I
> haven't made my mind completely about which one is the best one, so
> I'd like to present all three for discussion. Here they are:
>
> (1) Storing the bitmap inside the 64-bit entry
>
>     This is a simple alternative and is the one that I chose for my
>     prototype. There are 14 unused bits plus the "all zeroes" one. If
>     we steal one from the host offset we have the 16 bits that we need
>     for the bitmap and we have 46 bits left for the host offset, which
>     is more than enough.
>
>     * Pros:
>       + Simple. Few changes compared to the current qcow2 format.
>
>     * Cons:
>       - Only 8 subclusters per cluster. We would not be making the
>         most of this feature.
>
>       - No reserved bits left for the future.
>
> (2) Making L2 entries 128-bit wide.
>
>     In this alternative we would double the size of L2 entries. The
>     first half would remain unchanged and the second one would store
>     the bitmap. That would leave us with 32 subclusters per cluster.
>
>     * Pros:
>       + More subclusters per cluster. We could have images with
>         e.g. 128k clusters with 4k subclusters.
>
>     * Cons:
>       - More space needed for L2 entries. The same cluster size would
>         require a cache twice as large, although having subcluster
>         allocation would compensate for this.
>
>       - More changes to the code to handle 128-bit entries.
>
>       - We would still be wasting the 14 reserved bits that L2 entries
>         have.
>
> (3) Storing the bitmap somewhere else
>
>     This would involve storing the bitmap separate from the L2 tables
>     (perhaps using the bitmaps extension? I haven't looked much into
>     this).
>
>     * Pros:
>       + Possibility to make the number of subclusters configurable
>         by the user (32, 64, 128, ...)
>       + All existing metadata structures would remain untouched
>         (although the "all zeroes" bit in L2 entries would probably
>         become unused).
>
>     * Cons:
>       - As with alternative (2), more space needed for metadata.
>
>       - The bitmap would also need to be cached for performance
>         reasons.
>
>       - Possibly one more *_cache_size option.
>
>       - One more metadata structure to be updated for each
>         allocation. This would probably impact I/O negatively.
>
> === Compressed clusters ===
>
> My idea is that compressed clusters would remain the same. They are
> read-only anyway so they would not be affected by any of these
> changes.
>
> ===========================
>
> I think I managed to summarize all the ideas that I had in my mind,
> but I'm sure you probably have questions and comments, so I'd be happy
> to get as much feedback as possible.
>
> So, looking forward to reading your opinions.
>
> Regards,
>
> Berto
>

My opinion about this approach is very negative as the problem could
be (partially) solved in a much better way.

1) current L2 cache management seems very wrong to me. Each cache
    miss means that we have to read entire L2 cache block. This means
    that in the worst case (when dataset of the test does not fit L2 cache
    size we read 64kb of L2 table for each 4 kb read).

    The situation is MUCH worse once we are starting to increase cluster
    size. For 1 Mb blocks we have to read 1 Mb on each cache miss.

    The situation can be cured immediately once we will start reading
    L2 cache with 4 or 8kb chunks. We have patchset for this for our
    downstream and preparing it for upstream.

Performance results are the following (roughly, from memory) for
Intel P3700 PCIe SSD, which is 300k IOPS capable.

We have 20k IOPSes without any modifications on 16 Gb dataset IOPS
test. We have 50k with (1) implemented. With 1Mb datablock we have
100k IOPses. This is not that big, but this is much better.

2) yet another terrible thing in cluster allocation is its allocation
strategy.
    Current QCOW2 codebase implies that we need 5 (five) IOPSes to
    complete COW operation. We are reading head, writing head, reading
    tail, writing tail, writing actual data to be written. This could be
easily
    reduced to 3 IOPSes.
    Another problem is the amount of data written. We are writing entire
    cluster in write operation and this is also insane. It is possible to
    perform fallocate() and actual data write on normal modern filesystem.
    This patch series was also finished but not yet sent.
    This gives around 20 times difference for cluster allocation.

3) Partial allocation is very bad from the later performance point of view.
    The guest can send 4-5 Mb requests. In this case QCOW2 read for
    non-sequential data would be completely sequential, having HUGE
    latency here. We should perform chunk reading in coroutine pool,
    like has been done recently by Peter Lieven in qemu-img in commit
    2d9187bc65
    We are struggling to finish this and put into the production.

I do think that this complex of things could address almost all initial
issues.

Den

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia
                   ` (3 preceding siblings ...)
  2017-04-12 16:54 ` Denis V. Lunev
@ 2017-04-12 17:55 ` Denis V. Lunev
  2017-04-12 18:20   ` Eric Blake
  4 siblings, 1 reply; 64+ messages in thread
From: Denis V. Lunev @ 2017-04-12 17:55 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz

On 04/06/2017 06:01 PM, Alberto Garcia wrote:
> Hi all,
>
> over the past couple of months I discussed with some of you the
> possibility to extend the qcow2 format in order to improve its
> performance and reduce its memory requirements (particularly with very
> large images).
>
> After some discussion in the mailing list and the #qemu IRC channel I
> decided to write a prototype of a new extension for qcow2 so I could
> understand better the scope of the changes and have some preliminary
> data about its effects.
>
> This e-mail is the formal presentation of my proposal to extend the
> on-disk qcow2 format. As you can see this is still an RFC. Due to the
> nature of the changes I would like to get as much feedback as possible
> before going forward.
>
> === Problem ===
>
> The original problem that I wanted to address is the memory
> requirements of qcow2 files if you want to have very large images and
> still keep good I/O performance. This is a consequence of its very
> structure, which I'm going to describe now.
>
> A qcow2 image is divided into units of constant size called clusters,
> and among other things it contains metadata that maps guest addresses
> to host addresses (the so-called L1 and L2 tables).
>
> There are two basic problems that result from this:
>
> 1) Reading from or writing to a qcow2 image involves reading the
>    corresponding entry on the L2 table that maps the guest address to
>    the host address. This is very slow because it involves two I/O
>    operations: one on the L2 table and the other one on the actual
>    data cluster.
>
> 2) A cluster is the smallest unit of allocation. Therefore writing a
>    mere 512 bytes to an empty disk requires allocating a complete
>    cluster and filling it with zeroes (or with data from the backing
>    image if there is one). This wastes more disk space and also has a
>    negative impact on I/O.
>
> Problem (1) can be solved by keeping in memory a cache of the L2
> tables (QEMU has an "l2_cache_size" parameter for this purpose). The
> amount of disk space used by L2 tables depends on two factors: the
> disk size and the cluster size.
>
> The cluster size can be configured when the image is created, and it
> can be any power of two between 512 bytes and 2 MB (it defaults to 64
> KB).
>
> The maximum amount of space needed for the L2 tables can be calculated
> with the following formula:
>
>    max_l2_size = virtual_disk_size * 8 / cluster_size
>
> Large images require a large amount of metadata, and therefore a large
> amount of memory for the L2 cache. With the default cluster size
> (64KB) that's 128MB of L2 cache for a 1TB qcow2 image.
>
> The only way to reduce the size of the L2 tables is therefore
> increasing the cluster size, but this makes the image less efficient
> as described earlier in (2).
>
> === The proposal ===
>
> The idea of this proposal is to extend the qcow2 format by allowing
> subcluster allocation. There would be an optional feature that would
> need to be enabled when creating the image. The on-disk format would
> remain essentially the same, except that each data cluster would be
> internally divided into a number of subclusters of equal size.
>
> What this means in practice is that each entry on an L2 table would be
> accompanied by a bitmap indicating the allocation state of each one of
> the subclusters for that cluster. There are several alternatives for
> storing the bitmap, described below.
>
> Other than L2 entries, all other data structures would remain
> unchanged, but for data clusters the smallest unit of allocation would
> now be the subcluster. Reference counting would still be at the
> cluster level, because there's no way to reference individual
> subclusters. Copy-on-write on internal snapshots would need to copy
> complete clusters, so that scenario would not benefit from this
> change.
>
> I see two main use cases for this feature:
>
> a) The qcow2 image is not too large / the L2 cache is not a problem,
>    but you want to increase the allocation performance. In this case
>    you can have something like a 128KB cluster with 4KB subclusters
>    (with 4KB being a common block size in ext4 and other filesystems)
>
> b) The qcow2 image is very large and you want to save metadata space
>    in order to have a smaller L2 cache. In this case you can go for
>    the maximum cluster size (2MB) but you want to have smaller
>    subclusters to increase the allocation performance and optimize the
>    disk usage. This was actually my original use case.
>
> === Test results ===
>
> I have a basic working prototype of this. It's still incomplete -and
> buggy :)- but it gives an idea of what we can expect from it. In my
> implementation each data cluster has 8 subclusters, but that's not set
> in stone (see below).
>
> I made all tests on an SSD drive, writing to an empty qcow2 image with
> a fully populated 40GB backing image, performing random writes using
> fio with a block size of 4KB.
>
> I tried with the default and maximum cluster sizes (64KB and 2MB) and
> also with some other sizes. I also made sure to try with 32KB clusters
> so the subcluster size matches the 4KB block size used for the I/O.
>
> It's important to point out that once a cluster has been completely
> allocated then having subclusters offers no performance benefit. For
> this reason the size of the image for these tests (40GB) was chosen to
> be large enough to guarantee that there are always new clusters being
> allocated. This is therefore a worst-case scenario (or best-case for
> this feature, if you want).
>
> Here are the results (subcluster size in brackets):
>
> |-----------------+----------------+-----------------+-------------------|
> |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> |-----------------+----------------+-----------------+-------------------|
> |   2 MB (256 KB) |   440 IOPS     |  100 IOPS       | 160 KB (*)        |
> | 512 KB  (64 KB) |  1000 IOPS     |  300 IOPS       | 640 KB            |
> |  64 KB   (8 KB) |  3000 IOPS     | 1000 IOPS       |   5 MB            |
> |  32 KB   (4 KB) | 12000 IOPS     | 1300 IOPS       |  10 MB            |
> |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
> |-----------------+----------------+-----------------+-------------------|
>
>                 (*) The L2 cache must be a multiple of the cluster
>                     size, so in this case it must be 2MB. On the table
>                     I chose to show how much of those 2MB are actually
>                     used so you can compare it with the other cases.
>
> Some comments about the results:
>
> - For the 64KB, 512KB and 2MB cases, having subclusters increases
>   write performance roughly by three. This happens because for each
>   cluster allocation there's less data to copy from the backing
>   image. For the same reason, the smaller the cluster, the better the
>   performance. As expected, 64KB clusters with no subclusters perform
>   roughly the same as 512KB clusters with 64KB subclusters.
>
> - The 32KB case is the most interesting one. Without subclusters it's
>   not very different from the 64KB case, but having a subcluster with
>   the same size of the I/O block eliminates the need for COW entirely
>   and the performance skyrockets (10 times faster!).
>
> - 4KB is however very slow. I attribute this to the fact that the
>   cluster size is so small that a new cluster needs to be allocated
>   for every single write and its refcount updated accordingly. The L2
>   and refcount tables are also so small that they are too inefficient
>   and need to grow all the time.
>
> Here are the results when writing to an empty 40GB qcow2 image with no
> backing file. The numbers are of course different but as you can see
> the patterns are similar:
>
> |-----------------+----------------+-----------------+-------------------|
> |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> |-----------------+----------------+-----------------+-------------------|
> |   2 MB (256 KB) |  1200 IOPS     |  255 IOPS       | 160 KB            |
> | 512 KB  (64 KB) |  3000 IOPS     |  700 IOPS       | 640 KB            |
> |  64 KB   (8 KB) |  7200 IOPS     | 3300 IOPS       |   5 MB            |
> |  32 KB   (4 KB) | 12300 IOPS     | 4200 IOPS       |  10 MB            |
> |   4 KB  (512 B) |   100 IOPS     |  100 IOPS       |  80 MB            |
> |-----------------+----------------+-----------------+-------------------|
>
> === Changes to the on-disk format ===
>
> The qcow2 on-disk format needs to change so each L2 entry has a bitmap
> indicating the allocation status of each subcluster. There are three
> possible states (unallocated, allocated, all zeroes), so we need two
> bits per subcluster.
>
> An L2 entry is 64 bits wide, and this is the current format (for
> uncompressed clusters):
>
> 63    56 55    48 47    40 39    32 31    24 23    16 15     8 7      0
> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> **<----> <--------------------------------------------------><------->*
>   Rsrved              host cluster offset of data             Reserved
>   (6 bits)                (47 bits)                           (8 bits)
>
>     bit 63: refcount == 1   (QCOW_OFLAG_COPIED)
>     bit 62: compressed = 1  (QCOW_OFLAG_COMPRESSED)
>     bit 0: all zeros        (QCOW_OFLAG_ZERO)
>
> I thought of three alternatives for storing the subcluster bitmaps. I
> haven't made my mind completely about which one is the best one, so
> I'd like to present all three for discussion. Here they are:
>
> (1) Storing the bitmap inside the 64-bit entry
>
>     This is a simple alternative and is the one that I chose for my
>     prototype. There are 14 unused bits plus the "all zeroes" one. If
>     we steal one from the host offset we have the 16 bits that we need
>     for the bitmap and we have 46 bits left for the host offset, which
>     is more than enough.
>
>     * Pros:
>       + Simple. Few changes compared to the current qcow2 format.
>
>     * Cons:
>       - Only 8 subclusters per cluster. We would not be making the
>         most of this feature.
>
>       - No reserved bits left for the future.
>
> (2) Making L2 entries 128-bit wide.
>
>     In this alternative we would double the size of L2 entries. The
>     first half would remain unchanged and the second one would store
>     the bitmap. That would leave us with 32 subclusters per cluster.
>
>     * Pros:
>       + More subclusters per cluster. We could have images with
>         e.g. 128k clusters with 4k subclusters.
>
>     * Cons:
>       - More space needed for L2 entries. The same cluster size would
>         require a cache twice as large, although having subcluster
>         allocation would compensate for this.
>
>       - More changes to the code to handle 128-bit entries.
>
>       - We would still be wasting the 14 reserved bits that L2 entries
>         have.
>
> (3) Storing the bitmap somewhere else
>
>     This would involve storing the bitmap separate from the L2 tables
>     (perhaps using the bitmaps extension? I haven't looked much into
>     this).
>
>     * Pros:
>       + Possibility to make the number of subclusters configurable
>         by the user (32, 64, 128, ...)
>       + All existing metadata structures would remain untouched
>         (although the "all zeroes" bit in L2 entries would probably
>         become unused).
>
>     * Cons:
>       - As with alternative (2), more space needed for metadata.
>
>       - The bitmap would also need to be cached for performance
>         reasons.
>
>       - Possibly one more *_cache_size option.
>
>       - One more metadata structure to be updated for each
>         allocation. This would probably impact I/O negatively.
>
> === Compressed clusters ===
>
> My idea is that compressed clusters would remain the same. They are
> read-only anyway so they would not be affected by any of these
> changes.
>
> ===========================
>
> I think I managed to summarize all the ideas that I had in my mind,
> but I'm sure you probably have questions and comments, so I'd be happy
> to get as much feedback as possible.
>
> So, looking forward to reading your opinions.
>
> Regards,
>
> Berto
>
Let me rephrase a bit.

The proposal is looking very close to the following case:
- raw sparse file

In this case all writes are very-very-very fast and from the
guest point of view all is OK. Sequential data is really sequential.
Though once we are starting to perform any sequential IO, we
have real pain. Each sequential operation becomes random
on the host file system and the IO becomes very slow. This
will not be observed with the test, but the performance will
degrade very soon.

This is why raw sparse files are not used in the real life.
Hypervisor must maintain guest OS invariants and the data,
which is nearby from the guest point of view should be kept
nearby in host.

This is why actually that 64kb data blocks are extremely
small :) OK. This is offtopic.

One can easily recreate this case using the following simple
test:
- write each even 4kb page of the disk, one by one
- write each odd 4 kb page of the disk
- run sequential read with f.e. 1 MB data block

Normally we should still have native performance, but
with raw sparse files and (as far as understand the
proposal) sub-clusters we will have the host IO pattern
exactly like random.

This seems like a big and inevitable problem of the approach
for me. We still have the potential to improve current
algorithms and not introduce non-compatible changes.

Sorry if this is too emotional. We have learned above in a
very hard way.

Den

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-12 17:55 ` Denis V. Lunev
@ 2017-04-12 18:20   ` Eric Blake
  2017-04-12 19:02     ` Denis V. Lunev
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Blake @ 2017-04-12 18:20 UTC (permalink / raw)
  To: Denis V. Lunev, Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, qemu-block, Stefan Hajnoczi, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 3059 bytes --]

On 04/12/2017 12:55 PM, Denis V. Lunev wrote:
> Let me rephrase a bit.
> 
> The proposal is looking very close to the following case:
> - raw sparse file
> 
> In this case all writes are very-very-very fast and from the
> guest point of view all is OK. Sequential data is really sequential.
> Though once we are starting to perform any sequential IO, we
> have real pain. Each sequential operation becomes random
> on the host file system and the IO becomes very slow. This
> will not be observed with the test, but the performance will
> degrade very soon.
> 
> This is why raw sparse files are not used in the real life.
> Hypervisor must maintain guest OS invariants and the data,
> which is nearby from the guest point of view should be kept
> nearby in host.
> 
> This is why actually that 64kb data blocks are extremely
> small :) OK. This is offtopic.

Not necessarily. Using subclusters may allow you to ramp up to larger
cluster sizes. We can also set up our allocation (and pre-allocation
schemes) so that we always reserve an entire cluster on the host at the
time we allocate the cluster, even if we only plan to write to
particular subclusters within that cluster.  In fact, 32 subclusters to
a 2M cluster results in 64k subclusters, where you are still writing at
64k data chunks but could now have guaranteed 2M locality, compared to
the current qcow2 with 64k clusters that writes in 64k data chunks but
with no locality.

Just because we don't write the entire cluster up front does not mean
that we don't have to allocate (or have a mode that allocates) the
entire cluster at the time of the first subcluster use.

> 
> One can easily recreate this case using the following simple
> test:
> - write each even 4kb page of the disk, one by one
> - write each odd 4 kb page of the disk
> - run sequential read with f.e. 1 MB data block
> 
> Normally we should still have native performance, but
> with raw sparse files and (as far as understand the
> proposal) sub-clusters we will have the host IO pattern
> exactly like random.

Only if we don't pre-allocate entire clusters at the point that we first
touch the cluster.

> 
> This seems like a big and inevitable problem of the approach
> for me. We still have the potential to improve current
> algorithms and not introduce non-compatible changes.
> 
> Sorry if this is too emotional. We have learned above in a
> very hard way.

And your experience is useful, as a way to fine-tune this proposal.  But
it doesn't mean we should entirely ditch this proposal.  I also
appreciate that you have patches in the works to reduce bottlenecks
(such as turning sub-cluster writes into 3 IOPs rather than 5, by doing
read-head, read-tail, write-cluster, instead of the current read-head,
write-head, write-body, read-tail, write-tail), but think that both
approaches are complimentary, not orthogonal.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-12 18:20   ` Eric Blake
@ 2017-04-12 19:02     ` Denis V. Lunev
  2017-04-13  9:44       ` Kevin Wolf
  0 siblings, 1 reply; 64+ messages in thread
From: Denis V. Lunev @ 2017-04-12 19:02 UTC (permalink / raw)
  To: Eric Blake, Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, qemu-block, Stefan Hajnoczi, Max Reitz

On 04/12/2017 09:20 PM, Eric Blake wrote:
> On 04/12/2017 12:55 PM, Denis V. Lunev wrote:
>> Let me rephrase a bit.
>>
>> The proposal is looking very close to the following case:
>> - raw sparse file
>>
>> In this case all writes are very-very-very fast and from the
>> guest point of view all is OK. Sequential data is really sequential.
>> Though once we are starting to perform any sequential IO, we
>> have real pain. Each sequential operation becomes random
>> on the host file system and the IO becomes very slow. This
>> will not be observed with the test, but the performance will
>> degrade very soon.
>>
>> This is why raw sparse files are not used in the real life.
>> Hypervisor must maintain guest OS invariants and the data,
>> which is nearby from the guest point of view should be kept
>> nearby in host.
>>
>> This is why actually that 64kb data blocks are extremely
>> small :) OK. This is offtopic.
> Not necessarily. Using subclusters may allow you to ramp up to larger
> cluster sizes. We can also set up our allocation (and pre-allocation
> schemes) so that we always reserve an entire cluster on the host at the
> time we allocate the cluster, even if we only plan to write to
> particular subclusters within that cluster.  In fact, 32 subclusters to
> a 2M cluster results in 64k subclusters, where you are still writing at
> 64k data chunks but could now have guaranteed 2M locality, compared to
> the current qcow2 with 64k clusters that writes in 64k data chunks but
> with no locality.
>
> Just because we don't write the entire cluster up front does not mean
> that we don't have to allocate (or have a mode that allocates) the
> entire cluster at the time of the first subcluster use.

this is something that I do not understand. We reserve the entire cluster at
allocation. Why do we need sub-clusters at cluster "creation" without COW?
fallocate() and preallocation completely covers this stage for now in
full and
solve all botllenecks we have. 4k/8k granularity of L2 cache solves metadata
write problem. But IMHO it is not important. Normally we sync metadata
at guest sync.

The only difference I am observing in this case is "copy-on-write" pattern
of the load with backing store or snapshot, where we copy only partial
cluster.
Thus we should clearly define that this is the only area of improvement and
start discussion from this point. Simple cluster creation is not the problem
anymore. I think that this reduces the scope of the proposal a lot.

Initial proposal starts from stating 2 problems:

"1) Reading from or writing to a qcow2 image involves reading the
   corresponding entry on the L2 table that maps the guest address to
   the host address. This is very slow because it involves two I/O
   operations: one on the L2 table and the other one on the actual
   data cluster.

2) A cluster is the smallest unit of allocation. Therefore writing a
   mere 512 bytes to an empty disk requires allocating a complete
   cluster and filling it with zeroes (or with data from the backing
   image if there is one). This wastes more disk space and also has a
   negative impact on I/O."

With pre-allocation (2) would be exactly the same as now and all
gain with sub-clusters will be effectively 0 as we will have to
preallocate entire cluster.

(1) is also questionable. I think that the root of the problem
is the cost of L2 cache miss, which is giant. With 1 Mb or 2 Mb
cluster the cost of the cache miss is not acceptable at all.
With page granularity of L2 cache this problem is seriously
reduced. We can switch to bigger blocks without much problem.
Again, the only problem is COW.

Thus I think that the proposal should be seriously re-analyzed and refined
with this input.

>> One can easily recreate this case using the following simple
>> test:
>> - write each even 4kb page of the disk, one by one
>> - write each odd 4 kb page of the disk
>> - run sequential read with f.e. 1 MB data block
>>
>> Normally we should still have native performance, but
>> with raw sparse files and (as far as understand the
>> proposal) sub-clusters we will have the host IO pattern
>> exactly like random.
> Only if we don't pre-allocate entire clusters at the point that we first
> touch the cluster.
>
>> This seems like a big and inevitable problem of the approach
>> for me. We still have the potential to improve current
>> algorithms and not introduce non-compatible changes.
>>
>> Sorry if this is too emotional. We have learned above in a
>> very hard way.
> And your experience is useful, as a way to fine-tune this proposal.  But
> it doesn't mean we should entirely ditch this proposal.  I also
> appreciate that you have patches in the works to reduce bottlenecks
> (such as turning sub-cluster writes into 3 IOPs rather than 5, by doing
> read-head, read-tail, write-cluster, instead of the current read-head,
> write-head, write-body, read-tail, write-tail), but think that both
> approaches are complimentary, not orthogonal.
>
Thank you :) I just prefer to dead end with compatible changes and start
incompatible ones after that.

There are really a lot of other possibilities for viable optimizations,
which
are not yet done on top of proposed ones:
- IO plug/unplug support at QCOW2 level. plug in controller is definitely
  not enough. This affects only the first IO operation while we could have
  a bunch of them
- sort and merge requests list in submit
- direct AIO read/write support to avoid extra coroutine creation for
  read-write ops if we are doing several operations in parallel in
  qcow2_co_readv/writev. Right now AIO operations are emulated
  via coroutines which have some impact
- offload compression/decompression/encryption to side thread
- optimize sequential write operation not aligned to the cluster boundary
  if cluster is not allocated initially

May be it would be useful to create intermediate DIO structure for IO
operation which will carry offset/iovec on it like done in kernel. I do
think
that such compatible changes could improve raw performance even
with the current format 2-3 times, which is brought out by the proposal.

Den

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-12 14:10             ` Max Reitz
@ 2017-04-13  8:05               ` Alberto Garcia
  2017-04-13  9:02                 ` Kevin Wolf
  0 siblings, 1 reply; 64+ messages in thread
From: Alberto Garcia @ 2017-04-13  8:05 UTC (permalink / raw)
  To: Max Reitz, Eric Blake, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block

On Wed 12 Apr 2017 04:10:46 PM CEST, Max Reitz <mreitz@redhat.com> wrote:
>> I still don't see why we can always assume OFLAG_COPIED. Before doing
>> the COW one cluster can have references from multiple snapshots,
>
> Yes...
>
>> and OFLAG_COPIED is equally valid in that context.
>
> In what context? When having subclusters? Why?
>
>> We still need to know if we need to perform COW when writing to it,
>> regardless of whether it has subclusters or not.
>
> But why would you reference a cluster multiple times if it has
> subclusters? Yes, you can do it in theory, but we could just disallow
> it, because it doesn't make sense.

We discussed this yesterday on IRC, but I'm writing the summary here for
reference:

if you want be able to create internal snapshots quickly without having
to copy a lot of data, you need to be able to have multiple references
to one cluster.

Anyway, it seems that this discussion is only relevant if we're trying
to save as many bits as possible because we want to store everything in
a 64-bit entry -alternative (1) from my original e-mail-. I agree that
for that alternative the usefulness of that bit can be put into
question.

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13  8:05               ` Alberto Garcia
@ 2017-04-13  9:02                 ` Kevin Wolf
  2017-04-13  9:05                   ` Alberto Garcia
  0 siblings, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-04-13  9:02 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Max Reitz, Eric Blake, qemu-devel, Stefan Hajnoczi, qemu-block

Am 13.04.2017 um 10:05 hat Alberto Garcia geschrieben:
> On Wed 12 Apr 2017 04:10:46 PM CEST, Max Reitz <mreitz@redhat.com> wrote:
> >> I still don't see why we can always assume OFLAG_COPIED. Before doing
> >> the COW one cluster can have references from multiple snapshots,
> >
> > Yes...
> >
> >> and OFLAG_COPIED is equally valid in that context.
> >
> > In what context? When having subclusters? Why?
> >
> >> We still need to know if we need to perform COW when writing to it,
> >> regardless of whether it has subclusters or not.
> >
> > But why would you reference a cluster multiple times if it has
> > subclusters? Yes, you can do it in theory, but we could just disallow
> > it, because it doesn't make sense.
> 
> We discussed this yesterday on IRC, but I'm writing the summary here for
> reference:
> 
> if you want be able to create internal snapshots quickly without having
> to copy a lot of data, you need to be able to have multiple references
> to one cluster.
> 
> Anyway, it seems that this discussion is only relevant if we're trying
> to save as many bits as possible because we want to store everything in
> a 64-bit entry -alternative (1) from my original e-mail-. I agree that
> for that alternative the usefulness of that bit can be put into
> question.

I think you still need it if you don't want to look at the refcount
blocks for every write. When you take an internal snapshot, you just
increase the refcount of the L2 tables at first and keep the contents
the same, including the subcluster information. On the first write to
the cluster, like with normal images you need to copy the whole cluster,
and whether this is the case is determined with the COPIED flag. (The
copy can actually keep unallocated/zeroed subclusters that way, but data
has to be copied for the whole cluster.)

Of course, in all cases the COPIED flag is just an optimisation because
it's always possible to look at the refcount blocks, but I don't think
there is any difference between subclustered and traditional images in
this respect.

Kevin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13  9:02                 ` Kevin Wolf
@ 2017-04-13  9:05                   ` Alberto Garcia
  0 siblings, 0 replies; 64+ messages in thread
From: Alberto Garcia @ 2017-04-13  9:05 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Max Reitz, Eric Blake, qemu-devel, Stefan Hajnoczi, qemu-block

On Thu 13 Apr 2017 11:02:10 AM CEST, Kevin Wolf <kwolf@redhat.com> wrote:
> I think you still need it if you don't want to look at the refcount
> blocks for every write. When you take an internal snapshot, you just
> increase the refcount of the L2 tables at first and keep the contents
> the same, including the subcluster information. On the first write to
> the cluster, like with normal images you need to copy the whole
> cluster, and whether this is the case is determined with the COPIED
> flag. (The copy can actually keep unallocated/zeroed subclusters that
> way, but data has to be copied for the whole cluster.)
>
> Of course, in all cases the COPIED flag is just an optimisation
> because it's always possible to look at the refcount blocks, but I
> don't think there is any difference between subclustered and
> traditional images in this respect.

Exactly :)

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-12 19:02     ` Denis V. Lunev
@ 2017-04-13  9:44       ` Kevin Wolf
  2017-04-13 10:19         ` Denis V. Lunev
  0 siblings, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-04-13  9:44 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Eric Blake, Alberto Garcia, qemu-devel, qemu-block,
	Stefan Hajnoczi, Max Reitz

Am 12.04.2017 um 21:02 hat Denis V. Lunev geschrieben:
> On 04/12/2017 09:20 PM, Eric Blake wrote:
> > On 04/12/2017 12:55 PM, Denis V. Lunev wrote:
> >> Let me rephrase a bit.
> >>
> >> The proposal is looking very close to the following case:
> >> - raw sparse file
> >>
> >> In this case all writes are very-very-very fast and from the
> >> guest point of view all is OK. Sequential data is really sequential.
> >> Though once we are starting to perform any sequential IO, we
> >> have real pain. Each sequential operation becomes random
> >> on the host file system and the IO becomes very slow. This
> >> will not be observed with the test, but the performance will
> >> degrade very soon.
> >>
> >> This is why raw sparse files are not used in the real life.
> >> Hypervisor must maintain guest OS invariants and the data,
> >> which is nearby from the guest point of view should be kept
> >> nearby in host.
> >>
> >> This is why actually that 64kb data blocks are extremely
> >> small :) OK. This is offtopic.
> > Not necessarily. Using subclusters may allow you to ramp up to larger
> > cluster sizes. We can also set up our allocation (and pre-allocation
> > schemes) so that we always reserve an entire cluster on the host at the
> > time we allocate the cluster, even if we only plan to write to
> > particular subclusters within that cluster.  In fact, 32 subclusters to
> > a 2M cluster results in 64k subclusters, where you are still writing at
> > 64k data chunks but could now have guaranteed 2M locality, compared to
> > the current qcow2 with 64k clusters that writes in 64k data chunks but
> > with no locality.
> >
> > Just because we don't write the entire cluster up front does not mean
> > that we don't have to allocate (or have a mode that allocates) the
> > entire cluster at the time of the first subcluster use.
> 
> this is something that I do not understand. We reserve the entire cluster at
> allocation. Why do we need sub-clusters at cluster "creation" without COW?
> fallocate() and preallocation completely covers this stage for now in
> full and
> solve all botllenecks we have. 4k/8k granularity of L2 cache solves metadata
> write problem. But IMHO it is not important. Normally we sync metadata
> at guest sync.
> 
> The only difference I am observing in this case is "copy-on-write" pattern
> of the load with backing store or snapshot, where we copy only partial
> cluster.
> Thus we should clearly define that this is the only area of improvement and
> start discussion from this point. Simple cluster creation is not the problem
> anymore. I think that this reduces the scope of the proposal a lot.

I think subclusters have two different valid use cases:

1. The first one is what you describe and what I was mostly interested
   in: By reducing the subcluster size to the file system block size, we
   can completely avoid any COW because there won't be partial writes
   any more.

   You attended my KVM Forum talk two years ago where I described how
   COW is the biggest pain point for qcow2 performance, costing about
   50% of performance for initial writes after taking a snapshot. So
   while you're right that many other improvements are possible, I think
   this is one of the most important points to address.

2. The other use case is what Berto had in mind: Keep subclusters at the
   current cluster size (to avoid getting even larger COWs), but
   increase the cluster size. This reduces the metadata size and allows
   to cover a larger range with the same L2 cache size. Additionally, it
   can possibly reduce fragmentation of the image on the file system
   level.

The fundamental observation in both cases is that it's impractical to
use the same granularity for cluster mapping (want larger sizes) and for
status tracking (want small sizes to avoid partial writes).

> Initial proposal starts from stating 2 problems:
> 
> "1) Reading from or writing to a qcow2 image involves reading the
>    corresponding entry on the L2 table that maps the guest address to
>    the host address. This is very slow because it involves two I/O
>    operations: one on the L2 table and the other one on the actual
>    data cluster.
> 
> 2) A cluster is the smallest unit of allocation. Therefore writing a
>    mere 512 bytes to an empty disk requires allocating a complete
>    cluster and filling it with zeroes (or with data from the backing
>    image if there is one). This wastes more disk space and also has a
>    negative impact on I/O."
> 
> With pre-allocation (2) would be exactly the same as now and all
> gain with sub-clusters will be effectively 0 as we will have to
> preallocate entire cluster.

To be honest, I'm not sure if we should do preallocation or whether
that's the file system's job.

In any case, the big improvement is that we don't have to read from the
backing file, so even if we keep preallocating the whole cluster, we'd
gain something there. I also think that preallocating would use
something like fallocate() rather than pwrite(), so it should involve a
lot less I/O.

> (1) is also questionable. I think that the root of the problem
> is the cost of L2 cache miss, which is giant. With 1 Mb or 2 Mb
> cluster the cost of the cache miss is not acceptable at all.
> With page granularity of L2 cache this problem is seriously
> reduced. We can switch to bigger blocks without much problem.
> Again, the only problem is COW.

Yes, with larger cluster sizes, the L2 cache sucks currently. It needs
to be able to cache partial tables.

With the currently common cluster sizes, I don't think it makes a big
difference, though. As long as you end up making a single request, 4k or
64k size isn't that different, and three 4k requests are almost certainly
slower than one 64k request.

So I agree, for use case 2, some work on the cache is required in
addition to increasing the cluster size.

> There are really a lot of other possibilities for viable optimizations,
> which
> are not yet done on top of proposed ones:
> - IO plug/unplug support at QCOW2 level. plug in controller is definitely
>   not enough. This affects only the first IO operation while we could have
>   a bunch of them
> - sort and merge requests list in submit
> - direct AIO read/write support to avoid extra coroutine creation for
>   read-write ops if we are doing several operations in parallel in
>   qcow2_co_readv/writev. Right now AIO operations are emulated
>   via coroutines which have some impact
> - offload compression/decompression/encryption to side thread
> - optimize sequential write operation not aligned to the cluster boundary
>   if cluster is not allocated initially
> 
> May be it would be useful to create intermediate DIO structure for IO
> operation which will carry offset/iovec on it like done in kernel. I do
> think
> that such compatible changes could improve raw performance even
> with the current format 2-3 times, which is brought out by the proposal.

Well, if you can show a concrete optimisation and ideally also numbers,
that would certainly be interesting. I am very skeptical of 2-3 times
improvement (not the least because we would be well over native
performance then...), but I'm happy to be convinced otherwise. Maybe
start a new thread like this one if/when you think you have a detailled
idea for one of them that is ready for discussion.

The one that I actually think could make a big difference for its use
case is the compression/decompression/encryption one, but honestly, if
someone really cared about these, I think there would be lower hanging
fruit (e.g. we're currently reading in data only cluster by cluster for
compressed images).

Kevin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13  9:44       ` Kevin Wolf
@ 2017-04-13 10:19         ` Denis V. Lunev
  2017-04-14  1:06           ` [Qemu-devel] [Qemu-block] " John Snow
  0 siblings, 1 reply; 64+ messages in thread
From: Denis V. Lunev @ 2017-04-13 10:19 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Eric Blake, Alberto Garcia, qemu-devel, qemu-block,
	Stefan Hajnoczi, Max Reitz

On 04/13/2017 12:44 PM, Kevin Wolf wrote:
> Am 12.04.2017 um 21:02 hat Denis V. Lunev geschrieben:
>> On 04/12/2017 09:20 PM, Eric Blake wrote:
>>> On 04/12/2017 12:55 PM, Denis V. Lunev wrote:
>>>> Let me rephrase a bit.
>>>>
>>>> The proposal is looking very close to the following case:
>>>> - raw sparse file
>>>>
>>>> In this case all writes are very-very-very fast and from the
>>>> guest point of view all is OK. Sequential data is really sequential.
>>>> Though once we are starting to perform any sequential IO, we
>>>> have real pain. Each sequential operation becomes random
>>>> on the host file system and the IO becomes very slow. This
>>>> will not be observed with the test, but the performance will
>>>> degrade very soon.
>>>>
>>>> This is why raw sparse files are not used in the real life.
>>>> Hypervisor must maintain guest OS invariants and the data,
>>>> which is nearby from the guest point of view should be kept
>>>> nearby in host.
>>>>
>>>> This is why actually that 64kb data blocks are extremely
>>>> small :) OK. This is offtopic.
>>> Not necessarily. Using subclusters may allow you to ramp up to larger
>>> cluster sizes. We can also set up our allocation (and pre-allocation
>>> schemes) so that we always reserve an entire cluster on the host at the
>>> time we allocate the cluster, even if we only plan to write to
>>> particular subclusters within that cluster.  In fact, 32 subclusters to
>>> a 2M cluster results in 64k subclusters, where you are still writing at
>>> 64k data chunks but could now have guaranteed 2M locality, compared to
>>> the current qcow2 with 64k clusters that writes in 64k data chunks but
>>> with no locality.
>>>
>>> Just because we don't write the entire cluster up front does not mean
>>> that we don't have to allocate (or have a mode that allocates) the
>>> entire cluster at the time of the first subcluster use.
>> this is something that I do not understand. We reserve the entire cluster at
>> allocation. Why do we need sub-clusters at cluster "creation" without COW?
>> fallocate() and preallocation completely covers this stage for now in
>> full and
>> solve all botllenecks we have. 4k/8k granularity of L2 cache solves metadata
>> write problem. But IMHO it is not important. Normally we sync metadata
>> at guest sync.
>>
>> The only difference I am observing in this case is "copy-on-write" pattern
>> of the load with backing store or snapshot, where we copy only partial
>> cluster.
>> Thus we should clearly define that this is the only area of improvement and
>> start discussion from this point. Simple cluster creation is not the problem
>> anymore. I think that this reduces the scope of the proposal a lot.
> I think subclusters have two different valid use cases:
>
> 1. The first one is what you describe and what I was mostly interested
>    in: By reducing the subcluster size to the file system block size, we
>    can completely avoid any COW because there won't be partial writes
>    any more.
>
>    You attended my KVM Forum talk two years ago where I described how
>    COW is the biggest pain point for qcow2 performance, costing about
>    50% of performance for initial writes after taking a snapshot. So
>    while you're right that many other improvements are possible, I think
>    this is one of the most important points to address.
>
> 2. The other use case is what Berto had in mind: Keep subclusters at the
>    current cluster size (to avoid getting even larger COWs), but
>    increase the cluster size. This reduces the metadata size and allows
>    to cover a larger range with the same L2 cache size. Additionally, it
>    can possibly reduce fragmentation of the image on the file system
>    level.
>
> The fundamental observation in both cases is that it's impractical to
> use the same granularity for cluster mapping (want larger sizes) and for
> status tracking (want small sizes to avoid partial writes).
Fragmented images are very good at the initial moment once the
IO is coming first time in small pieces. But it becomes real
pain later on once guest will REUSE this area for sequential
data written in big chunks. Or the guest could even read this
data sequentially later on.

You will have random read in host instead of sequential read in
guest. This would be serious performance problem. The guest
really believes that the data with similar LBAs are adjacent and
optimize IO keeping this in mind. With clusters/subclusters/etc
you break this fundamental assumption. This is not visible initially,
but will trigger later on.


>> Initial proposal starts from stating 2 problems:
>>
>> "1) Reading from or writing to a qcow2 image involves reading the
>>    corresponding entry on the L2 table that maps the guest address to
>>    the host address. This is very slow because it involves two I/O
>>    operations: one on the L2 table and the other one on the actual
>>    data cluster.
>>
>> 2) A cluster is the smallest unit of allocation. Therefore writing a
>>    mere 512 bytes to an empty disk requires allocating a complete
>>    cluster and filling it with zeroes (or with data from the backing
>>    image if there is one). This wastes more disk space and also has a
>>    negative impact on I/O."
>>
>> With pre-allocation (2) would be exactly the same as now and all
>> gain with sub-clusters will be effectively 0 as we will have to
>> preallocate entire cluster.
> To be honest, I'm not sure if we should do preallocation or whether
> that's the file system's job.
>
> In any case, the big improvement is that we don't have to read from the
> backing file, so even if we keep preallocating the whole cluster, we'd
> gain something there. I also think that preallocating would use
> something like fallocate() rather than pwrite(), so it should involve a
> lot less I/O.
>
yes. fallocate() works pretty well and we will have almost
the same amount of IO as submitted from the guest. This
also helps a lot for sequential writes not aligned to the
cluster boundary.


>> (1) is also questionable. I think that the root of the problem
>> is the cost of L2 cache miss, which is giant. With 1 Mb or 2 Mb
>> cluster the cost of the cache miss is not acceptable at all.
>> With page granularity of L2 cache this problem is seriously
>> reduced. We can switch to bigger blocks without much problem.
>> Again, the only problem is COW.
> Yes, with larger cluster sizes, the L2 cache sucks currently. It needs
> to be able to cache partial tables.
>
> With the currently common cluster sizes, I don't think it makes a big
> difference, though. As long as you end up making a single request, 4k or
> 64k size isn't that different, and three 4k requests are almost certainly
> slower than one 64k request.
>
> So I agree, for use case 2, some work on the cache is required in
> addition to increasing the cluster size.
This helps even with 64kb cluster size. For the case of the fragmented guest
filesystem the dataset could be quite fragmented and we an technically use
much less memory to cover it if we'll use pages clusters.


>> There are really a lot of other possibilities for viable optimizations,
>> which
>> are not yet done on top of proposed ones:
>> - IO plug/unplug support at QCOW2 level. plug in controller is definitely
>>   not enough. This affects only the first IO operation while we could have
>>   a bunch of them
>> - sort and merge requests list in submit
>> - direct AIO read/write support to avoid extra coroutine creation for
>>   read-write ops if we are doing several operations in parallel in
>>   qcow2_co_readv/writev. Right now AIO operations are emulated
>>   via coroutines which have some impact
>> - offload compression/decompression/encryption to side thread
>> - optimize sequential write operation not aligned to the cluster boundary
>>   if cluster is not allocated initially
>>
>> May be it would be useful to create intermediate DIO structure for IO
>> operation which will carry offset/iovec on it like done in kernel. I do
>> think
>> that such compatible changes could improve raw performance even
>> with the current format 2-3 times, which is brought out by the proposal.
> Well, if you can show a concrete optimisation and ideally also numbers,
> that would certainly be interesting. I am very skeptical of 2-3 times
> improvement (not the least because we would be well over native
> performance then...), but I'm happy to be convinced otherwise. Maybe
> start a new thread like this one if/when you think you have a detailled
> idea for one of them that is ready for discussion.
>
> The one that I actually think could make a big difference for its use
> case is the compression/decompression/encryption one, but honestly, if
> someone really cared about these, I think there would be lower hanging
> fruit (e.g. we're currently reading in data only cluster by cluster for
> compressed images).
In progress. Compression has a lot of troubles ;)

Den

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-12 16:54 ` Denis V. Lunev
@ 2017-04-13 11:58   ` Alberto Garcia
  2017-04-13 12:44     ` Denis V. Lunev
  0 siblings, 1 reply; 64+ messages in thread
From: Alberto Garcia @ 2017-04-13 11:58 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz

On Wed 12 Apr 2017 06:54:50 PM CEST, Denis V. Lunev wrote:
> My opinion about this approach is very negative as the problem could
> be (partially) solved in a much better way.

Hmm... it seems to me that (some of) the problems you are describing are
different from the ones this proposal tries to address. Not that I
disagree with them! I think you are giving useful feedback :)

> 1) current L2 cache management seems very wrong to me. Each cache
>     miss means that we have to read entire L2 cache block. This means
>     that in the worst case (when dataset of the test does not fit L2
>     cache size we read 64kb of L2 table for each 4 kb read).
>
>     The situation is MUCH worse once we are starting to increase
>     cluster size. For 1 Mb blocks we have to read 1 Mb on each cache
>     miss.
>
>     The situation can be cured immediately once we will start reading
>     L2 cache with 4 or 8kb chunks. We have patchset for this for our
>     downstream and preparing it for upstream.

Correct, although the impact of this depends on whether you are using
SDD or HDD.

With an SSD what you want is to minimize is the number of unnecessary
reads, so reading small chunks will likely increase the performance when
there's a cache miss.

With an HDD what you want is to minimize the number of seeks. Once you
have moved the disk head to the location where the cluster is, reading
the whole cluster is relatively inexpensive, so (leaving the memory
requirements aside) you generally want to read as much as possible.

> 2) yet another terrible thing in cluster allocation is its allocation
>     strategy.
>     Current QCOW2 codebase implies that we need 5 (five) IOPSes to
>     complete COW operation. We are reading head, writing head, reading
>     tail, writing tail, writing actual data to be written. This could
>     be easily reduced to 3 IOPSes.

That sounds right, but I'm not sure if this is really incompatible with
my proposal :)

>     Another problem is the amount of data written. We are writing
>     entire cluster in write operation and this is also insane. It is
>     possible to perform fallocate() and actual data write on normal
>     modern filesystem.

But that only works when filling the cluster with zeroes, doesn't it? If
there's a backing image you need to bring all the contents from there.

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 11:58   ` Alberto Garcia
@ 2017-04-13 12:44     ` Denis V. Lunev
  2017-04-13 13:05       ` Kevin Wolf
  2017-04-13 13:21       ` Alberto Garcia
  0 siblings, 2 replies; 64+ messages in thread
From: Denis V. Lunev @ 2017-04-13 12:44 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz

On 04/13/2017 02:58 PM, Alberto Garcia wrote:
> On Wed 12 Apr 2017 06:54:50 PM CEST, Denis V. Lunev wrote:
>> My opinion about this approach is very negative as the problem could
>> be (partially) solved in a much better way.
> Hmm... it seems to me that (some of) the problems you are describing are
> different from the ones this proposal tries to address. Not that I
> disagree with them! I think you are giving useful feedback :)
>
>> 1) current L2 cache management seems very wrong to me. Each cache
>>     miss means that we have to read entire L2 cache block. This means
>>     that in the worst case (when dataset of the test does not fit L2
>>     cache size we read 64kb of L2 table for each 4 kb read).
>>
>>     The situation is MUCH worse once we are starting to increase
>>     cluster size. For 1 Mb blocks we have to read 1 Mb on each cache
>>     miss.
>>
>>     The situation can be cured immediately once we will start reading
>>     L2 cache with 4 or 8kb chunks. We have patchset for this for our
>>     downstream and preparing it for upstream.
> Correct, although the impact of this depends on whether you are using
> SDD or HDD.
>
> With an SSD what you want is to minimize is the number of unnecessary
> reads, so reading small chunks will likely increase the performance when
> there's a cache miss.
>
> With an HDD what you want is to minimize the number of seeks. Once you
> have moved the disk head to the location where the cluster is, reading
> the whole cluster is relatively inexpensive, so (leaving the memory
> requirements aside) you generally want to read as much as possible.
no! This greatly helps for HDD too!

The reason is that you cover areas of the virtual disk much more precise.
There is very simple example. Let us assume that I have f.e. 1 TB virtual
HDD with 1 MB block size. As far as I understand right now L2 cache
for the case consists of 4 L2 clusters.

So, I can exhaust current cache only with 5 requests and each actual read
will costs L2 table read. This is a read problem. This condition could
happen on fragmented FS without a problem.

With my proposal the situation is MUCH better. All accesses will be taken
from the cache after the first run.

>> 2) yet another terrible thing in cluster allocation is its allocation
>>     strategy.
>>     Current QCOW2 codebase implies that we need 5 (five) IOPSes to
>>     complete COW operation. We are reading head, writing head, reading
>>     tail, writing tail, writing actual data to be written. This could
>>     be easily reduced to 3 IOPSes.
> That sounds right, but I'm not sure if this is really incompatible with
> my proposal :)
the problem is code complexity, with is very complex right now.


>>     Another problem is the amount of data written. We are writing
>>     entire cluster in write operation and this is also insane. It is
>>     possible to perform fallocate() and actual data write on normal
>>     modern filesystem.
> But that only works when filling the cluster with zeroes, doesn't it? If
> there's a backing image you need to bring all the contents from there.

Yes. Backing images are problems. Though, even with sub-clusters, we
will suffer
exactly the same with the amount of IOPSes as even with that head and
tail have to
be read. If you are spoken about subclusters equals to FS block size and
avoid
COW at all, this would be terribly slow later on with sequential
reading. In such
an approach sequential reading will result in random read.

Guest OSes are written keeping in mind that adjacent LBAs are really
adjacent
and reading them sequentially is a very good idea. This invariant will
be broken
for the case of subclusters.

For nowadays SSD we are facing problems somewhere else. Right now I can
achieve
only 100k IOPSes on SSD capable of 350-550k. 1 Mb block with
preallocation and
fragmented L2 cache gives same 100k. Tests for initially empty image
gives around
80k for us.

May be I have too good hardware for these tests. Right now I am giving
number
from our last run on Intel top SSDs. We will remeasure this on something
slower
before submission ;)

Den

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 12:44     ` Denis V. Lunev
@ 2017-04-13 13:05       ` Kevin Wolf
  2017-04-13 13:09         ` Denis V. Lunev
  2017-04-13 13:21       ` Alberto Garcia
  1 sibling, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-04-13 13:05 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Alberto Garcia, qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz

Am 13.04.2017 um 14:44 hat Denis V. Lunev geschrieben:
> On 04/13/2017 02:58 PM, Alberto Garcia wrote:
> > On Wed 12 Apr 2017 06:54:50 PM CEST, Denis V. Lunev wrote:
> >> My opinion about this approach is very negative as the problem could
> >> be (partially) solved in a much better way.
> > Hmm... it seems to me that (some of) the problems you are describing are
> > different from the ones this proposal tries to address. Not that I
> > disagree with them! I think you are giving useful feedback :)
> >
> >> 1) current L2 cache management seems very wrong to me. Each cache
> >>     miss means that we have to read entire L2 cache block. This means
> >>     that in the worst case (when dataset of the test does not fit L2
> >>     cache size we read 64kb of L2 table for each 4 kb read).
> >>
> >>     The situation is MUCH worse once we are starting to increase
> >>     cluster size. For 1 Mb blocks we have to read 1 Mb on each cache
> >>     miss.
> >>
> >>     The situation can be cured immediately once we will start reading
> >>     L2 cache with 4 or 8kb chunks. We have patchset for this for our
> >>     downstream and preparing it for upstream.
> > Correct, although the impact of this depends on whether you are using
> > SDD or HDD.
> >
> > With an SSD what you want is to minimize is the number of unnecessary
> > reads, so reading small chunks will likely increase the performance when
> > there's a cache miss.
> >
> > With an HDD what you want is to minimize the number of seeks. Once you
> > have moved the disk head to the location where the cluster is, reading
> > the whole cluster is relatively inexpensive, so (leaving the memory
> > requirements aside) you generally want to read as much as possible.
> no! This greatly helps for HDD too!
> 
> The reason is that you cover areas of the virtual disk much more precise.
> There is very simple example. Let us assume that I have f.e. 1 TB virtual
> HDD with 1 MB block size. As far as I understand right now L2 cache
> for the case consists of 4 L2 clusters.
> 
> So, I can exhaust current cache only with 5 requests and each actual read
> will costs L2 table read. This is a read problem. This condition could
> happen on fragmented FS without a problem.
> 
> With my proposal the situation is MUCH better. All accesses will be taken
> from the cache after the first run.
> 
> >> 2) yet another terrible thing in cluster allocation is its allocation
> >>     strategy.
> >>     Current QCOW2 codebase implies that we need 5 (five) IOPSes to
> >>     complete COW operation. We are reading head, writing head, reading
> >>     tail, writing tail, writing actual data to be written. This could
> >>     be easily reduced to 3 IOPSes.
> > That sounds right, but I'm not sure if this is really incompatible with
> > my proposal :)
> the problem is code complexity, with is very complex right now.
> 
> 
> >>     Another problem is the amount of data written. We are writing
> >>     entire cluster in write operation and this is also insane. It is
> >>     possible to perform fallocate() and actual data write on normal
> >>     modern filesystem.
> > But that only works when filling the cluster with zeroes, doesn't it? If
> > there's a backing image you need to bring all the contents from there.
> 
> Yes. Backing images are problems. Though, even with sub-clusters, we
> will suffer exactly the same with the amount of IOPSes as even with
> that head and tail have to be read. If you are spoken about
> subclusters equals to FS block size and avoid COW at all, this would
> be terribly slow later on with sequential reading. In such an approach
> sequential reading will result in random read.
> 
> Guest OSes are written keeping in mind that adjacent LBAs are really
> adjacent and reading them sequentially is a very good idea. This
> invariant will be broken for the case of subclusters.

How that?

Given the same cluster size, subclustered and traditional images behave
_exactly_ the same regarding fragmentation. Subclusters only have an
effect on it (and a positive one) when you take them as a reason that
you can now afford to increase the cluster size.

I see subclusters and fragmentation as mostly independent topics.

> For nowadays SSD we are facing problems somewhere else. Right now I
> can achieve only 100k IOPSes on SSD capable of 350-550k. 1 Mb block
> with preallocation and fragmented L2 cache gives same 100k. Tests for
> initially empty image gives around 80k for us.

Preallocated images aren't particularly interesting to me. qcow2 is used
mainly for two reasons. One of them is sparseness (initially small file
size) mostly for desktop use cases with no serious I/O, so not that
interesting either. The other one is snapshots, i.e. backing files,
which doesn't work with preallocation (yet).

Actually, preallocation with backing files is something that subclusters
would automatically enable: You could already reserve the space for a
cluster, but still leave all subclusters marked as unallocated.

Kevin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 13:05       ` Kevin Wolf
@ 2017-04-13 13:09         ` Denis V. Lunev
  2017-04-13 13:36           ` Alberto Garcia
  0 siblings, 1 reply; 64+ messages in thread
From: Denis V. Lunev @ 2017-04-13 13:09 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Alberto Garcia, qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz

On 04/13/2017 04:05 PM, Kevin Wolf wrote:
> Am 13.04.2017 um 14:44 hat Denis V. Lunev geschrieben:
>> On 04/13/2017 02:58 PM, Alberto Garcia wrote:
>>> On Wed 12 Apr 2017 06:54:50 PM CEST, Denis V. Lunev wrote:
>>>> My opinion about this approach is very negative as the problem could
>>>> be (partially) solved in a much better way.
>>> Hmm... it seems to me that (some of) the problems you are describing are
>>> different from the ones this proposal tries to address. Not that I
>>> disagree with them! I think you are giving useful feedback :)
>>>
>>>> 1) current L2 cache management seems very wrong to me. Each cache
>>>>     miss means that we have to read entire L2 cache block. This means
>>>>     that in the worst case (when dataset of the test does not fit L2
>>>>     cache size we read 64kb of L2 table for each 4 kb read).
>>>>
>>>>     The situation is MUCH worse once we are starting to increase
>>>>     cluster size. For 1 Mb blocks we have to read 1 Mb on each cache
>>>>     miss.
>>>>
>>>>     The situation can be cured immediately once we will start reading
>>>>     L2 cache with 4 or 8kb chunks. We have patchset for this for our
>>>>     downstream and preparing it for upstream.
>>> Correct, although the impact of this depends on whether you are using
>>> SDD or HDD.
>>>
>>> With an SSD what you want is to minimize is the number of unnecessary
>>> reads, so reading small chunks will likely increase the performance when
>>> there's a cache miss.
>>>
>>> With an HDD what you want is to minimize the number of seeks. Once you
>>> have moved the disk head to the location where the cluster is, reading
>>> the whole cluster is relatively inexpensive, so (leaving the memory
>>> requirements aside) you generally want to read as much as possible.
>> no! This greatly helps for HDD too!
>>
>> The reason is that you cover areas of the virtual disk much more precise.
>> There is very simple example. Let us assume that I have f.e. 1 TB virtual
>> HDD with 1 MB block size. As far as I understand right now L2 cache
>> for the case consists of 4 L2 clusters.
>>
>> So, I can exhaust current cache only with 5 requests and each actual read
>> will costs L2 table read. This is a read problem. This condition could
>> happen on fragmented FS without a problem.
>>
>> With my proposal the situation is MUCH better. All accesses will be taken
>> from the cache after the first run.
>>
>>>> 2) yet another terrible thing in cluster allocation is its allocation
>>>>     strategy.
>>>>     Current QCOW2 codebase implies that we need 5 (five) IOPSes to
>>>>     complete COW operation. We are reading head, writing head, reading
>>>>     tail, writing tail, writing actual data to be written. This could
>>>>     be easily reduced to 3 IOPSes.
>>> That sounds right, but I'm not sure if this is really incompatible with
>>> my proposal :)
>> the problem is code complexity, with is very complex right now.
>>
>>
>>>>     Another problem is the amount of data written. We are writing
>>>>     entire cluster in write operation and this is also insane. It is
>>>>     possible to perform fallocate() and actual data write on normal
>>>>     modern filesystem.
>>> But that only works when filling the cluster with zeroes, doesn't it? If
>>> there's a backing image you need to bring all the contents from there.
>> Yes. Backing images are problems. Though, even with sub-clusters, we
>> will suffer exactly the same with the amount of IOPSes as even with
>> that head and tail have to be read. If you are spoken about
>> subclusters equals to FS block size and avoid COW at all, this would
>> be terribly slow later on with sequential reading. In such an approach
>> sequential reading will result in random read.
>>
>> Guest OSes are written keeping in mind that adjacent LBAs are really
>> adjacent and reading them sequentially is a very good idea. This
>> invariant will be broken for the case of subclusters.
> How that?
>
> Given the same cluster size, subclustered and traditional images behave
> _exactly_ the same regarding fragmentation. Subclusters only have an
> effect on it (and a positive one) when you take them as a reason that
> you can now afford to increase the cluster size.
>
> I see subclusters and fragmentation as mostly independent topics.
>
>> For nowadays SSD we are facing problems somewhere else. Right now I
>> can achieve only 100k IOPSes on SSD capable of 350-550k. 1 Mb block
>> with preallocation and fragmented L2 cache gives same 100k. Tests for
>> initially empty image gives around 80k for us.
> Preallocated images aren't particularly interesting to me. qcow2 is used
> mainly for two reasons. One of them is sparseness (initially small file
> size) mostly for desktop use cases with no serious I/O, so not that
> interesting either. The other one is snapshots, i.e. backing files,
> which doesn't work with preallocation (yet).
>
> Actually, preallocation with backing files is something that subclusters
> would automatically enable: You could already reserve the space for a
> cluster, but still leave all subclusters marked as unallocated.

I am spoken about fallocate() for the entire cluster before actual write()
for originally empty image. This increases the performance of 4k random
writes 10+ times. In this case we can just write those 4k and do nothing
else.

Den

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 12:44     ` Denis V. Lunev
  2017-04-13 13:05       ` Kevin Wolf
@ 2017-04-13 13:21       ` Alberto Garcia
  2017-04-13 13:30         ` Denis V. Lunev
  2017-04-13 13:51         ` Kevin Wolf
  1 sibling, 2 replies; 64+ messages in thread
From: Alberto Garcia @ 2017-04-13 13:21 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz

On Thu 13 Apr 2017 02:44:51 PM CEST, Denis V. Lunev wrote:
>>> 1) current L2 cache management seems very wrong to me. Each cache
>>>     miss means that we have to read entire L2 cache block. This means
>>>     that in the worst case (when dataset of the test does not fit L2
>>>     cache size we read 64kb of L2 table for each 4 kb read).
>>>
>>>     The situation is MUCH worse once we are starting to increase
>>>     cluster size. For 1 Mb blocks we have to read 1 Mb on each cache
>>>     miss.
>>>
>>>     The situation can be cured immediately once we will start reading
>>>     L2 cache with 4 or 8kb chunks. We have patchset for this for our
>>>     downstream and preparing it for upstream.
>> Correct, although the impact of this depends on whether you are using
>> SDD or HDD.
>>
>> With an SSD what you want is to minimize is the number of unnecessary
>> reads, so reading small chunks will likely increase the performance when
>> there's a cache miss.
>>
>> With an HDD what you want is to minimize the number of seeks. Once you
>> have moved the disk head to the location where the cluster is, reading
>> the whole cluster is relatively inexpensive, so (leaving the memory
>> requirements aside) you generally want to read as much as possible.
> no! This greatly helps for HDD too!
>
> The reason is that you cover areas of the virtual disk much more
> precise.  There is very simple example. Let us assume that I have
> f.e. 1 TB virtual HDD with 1 MB block size. As far as I understand
> right now L2 cache for the case consists of 4 L2 clusters.
>
> So, I can exhaust current cache only with 5 requests and each actual
> read will costs L2 table read. This is a read problem. This condition
> could happen on fragmented FS without a problem.

But what you're saying is that this makes a more efficient use of cache
memory.

If the guest OS has a lot of unused space but is very fragmented then
you don't want to fill up your cache with L2 entries that are not going
to be used. It's better to read smaller chunks from the L2 table so
there are fewer chances of having to evict entries from the
cache. Therefore this results in less cache misses and better I/O
performance.

Ok, this sounds perfectly reasonable to me.

If the cache is however big enough for the whole disk then you never
need to evict entries, so with an HDD you actually want to take
advantage of disk seeks and read as many L2 entries as possible.

However it's true that in this case this will only affect the initial
reads. Once the cache is full there's no need to read the L2 tables from
disk anymore and the performance will be the same, so your point remains
valid.

Still, one of the goals from my proposal is to reduce the amount of
metadata needed for the image. No matter how efficient you make the
cache, the only way to reduce the amount of L2 entries is to increase
the cluster size. And increasing the cluster size results in slower COW
and less efficient use of disk space.

>>>     Another problem is the amount of data written. We are writing
>>>     entire cluster in write operation and this is also insane. It is
>>>     possible to perform fallocate() and actual data write on normal
>>>     modern filesystem.
>> But that only works when filling the cluster with zeroes, doesn't it? If
>> there's a backing image you need to bring all the contents from there.
>
> Yes. Backing images are problems. Though, even with sub-clusters, we
> will suffer exactly the same with the amount of IOPSes as even with
> that head and tail have to be read. If you are spoken about
> subclusters equals to FS block size and avoid COW at all, this would
> be terribly slow later on with sequential reading. In such an approach
> sequential reading will result in random read.
>
> Guest OSes are written keeping in mind that adjacent LBAs are really
> adjacent and reading them sequentially is a very good idea. This
> invariant will be broken for the case of subclusters.

This invariant is already broken by the very design of the qcow2 format,
subclusters don't really add anything new there. For any given cluster
size you can write 4k in every odd cluster, then do the same in every
even cluster, and you'll get an equally fragmented image.

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 13:21       ` Alberto Garcia
@ 2017-04-13 13:30         ` Denis V. Lunev
  2017-04-13 13:59           ` Kevin Wolf
  2017-04-13 15:04           ` Alberto Garcia
  2017-04-13 13:51         ` Kevin Wolf
  1 sibling, 2 replies; 64+ messages in thread
From: Denis V. Lunev @ 2017-04-13 13:30 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz

On 04/13/2017 04:21 PM, Alberto Garcia wrote:
> On Thu 13 Apr 2017 02:44:51 PM CEST, Denis V. Lunev wrote:
>>>> 1) current L2 cache management seems very wrong to me. Each cache
>>>>     miss means that we have to read entire L2 cache block. This means
>>>>     that in the worst case (when dataset of the test does not fit L2
>>>>     cache size we read 64kb of L2 table for each 4 kb read).
>>>>
>>>>     The situation is MUCH worse once we are starting to increase
>>>>     cluster size. For 1 Mb blocks we have to read 1 Mb on each cache
>>>>     miss.
>>>>
>>>>     The situation can be cured immediately once we will start reading
>>>>     L2 cache with 4 or 8kb chunks. We have patchset for this for our
>>>>     downstream and preparing it for upstream.
>>> Correct, although the impact of this depends on whether you are using
>>> SDD or HDD.
>>>
>>> With an SSD what you want is to minimize is the number of unnecessary
>>> reads, so reading small chunks will likely increase the performance when
>>> there's a cache miss.
>>>
>>> With an HDD what you want is to minimize the number of seeks. Once you
>>> have moved the disk head to the location where the cluster is, reading
>>> the whole cluster is relatively inexpensive, so (leaving the memory
>>> requirements aside) you generally want to read as much as possible.
>> no! This greatly helps for HDD too!
>>
>> The reason is that you cover areas of the virtual disk much more
>> precise.  There is very simple example. Let us assume that I have
>> f.e. 1 TB virtual HDD with 1 MB block size. As far as I understand
>> right now L2 cache for the case consists of 4 L2 clusters.
>>
>> So, I can exhaust current cache only with 5 requests and each actual
>> read will costs L2 table read. This is a read problem. This condition
>> could happen on fragmented FS without a problem.
> But what you're saying is that this makes a more efficient use of cache
> memory.
>
> If the guest OS has a lot of unused space but is very fragmented then
> you don't want to fill up your cache with L2 entries that are not going
> to be used. It's better to read smaller chunks from the L2 table so
> there are fewer chances of having to evict entries from the
> cache. Therefore this results in less cache misses and better I/O
> performance.
>
> Ok, this sounds perfectly reasonable to me.
>
> If the cache is however big enough for the whole disk then you never
> need to evict entries, so with an HDD you actually want to take
> advantage of disk seeks and read as many L2 entries as possible.
>
> However it's true that in this case this will only affect the initial
> reads. Once the cache is full there's no need to read the L2 tables from
> disk anymore and the performance will be the same, so your point remains
> valid.
>
> Still, one of the goals from my proposal is to reduce the amount of
> metadata needed for the image. No matter how efficient you make the
> cache, the only way to reduce the amount of L2 entries is to increase
> the cluster size. And increasing the cluster size results in slower COW
> and less efficient use of disk space.
actually we can read by clusters if the cache is empty or near empty.
Yes, block size should be increased. I perfectly in agreement with your.
But I think that we could do that by plain increase of the cluster size
without any further dances. Sub-clusters as sub-clusters will help
if we are able to avoid COW. With COW I do not see much difference.

But for the case of the COW absence, further sequential reading will
be broken by the fragmented file in the host. That is the point. We
should try to avoid host fragmentation at all.


>>>>     Another problem is the amount of data written. We are writing
>>>>     entire cluster in write operation and this is also insane. It is
>>>>     possible to perform fallocate() and actual data write on normal
>>>>     modern filesystem.
>>> But that only works when filling the cluster with zeroes, doesn't it? If
>>> there's a backing image you need to bring all the contents from there.
>> Yes. Backing images are problems. Though, even with sub-clusters, we
>> will suffer exactly the same with the amount of IOPSes as even with
>> that head and tail have to be read. If you are spoken about
>> subclusters equals to FS block size and avoid COW at all, this would
>> be terribly slow later on with sequential reading. In such an approach
>> sequential reading will result in random read.
>>
>> Guest OSes are written keeping in mind that adjacent LBAs are really
>> adjacent and reading them sequentially is a very good idea. This
>> invariant will be broken for the case of subclusters.
> This invariant is already broken by the very design of the qcow2 format,
> subclusters don't really add anything new there. For any given cluster
> size you can write 4k in every odd cluster, then do the same in every
> even cluster, and you'll get an equally fragmented image.
>
The size of the cluster matters! Our experiments in older Parallels
shown that
with 1 Mb continuous (!) cluster this invariant is "almost" kept and
this works
fine for sequential ops.

Den

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 13:09         ` Denis V. Lunev
@ 2017-04-13 13:36           ` Alberto Garcia
  2017-04-13 14:06             ` Denis V. Lunev
  0 siblings, 1 reply; 64+ messages in thread
From: Alberto Garcia @ 2017-04-13 13:36 UTC (permalink / raw)
  To: Denis V. Lunev, Kevin Wolf
  Cc: qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz

On Thu 13 Apr 2017 03:09:53 PM CEST, Denis V. Lunev wrote:
>>> For nowadays SSD we are facing problems somewhere else. Right now I
>>> can achieve only 100k IOPSes on SSD capable of 350-550k. 1 Mb block
>>> with preallocation and fragmented L2 cache gives same 100k. Tests
>>> for initially empty image gives around 80k for us.
>> Preallocated images aren't particularly interesting to me. qcow2 is
>> used mainly for two reasons. One of them is sparseness (initially
>> small file size) mostly for desktop use cases with no serious I/O, so
>> not that interesting either. The other one is snapshots, i.e. backing
>> files, which doesn't work with preallocation (yet).
>>
>> Actually, preallocation with backing files is something that
>> subclusters would automatically enable: You could already reserve the
>> space for a cluster, but still leave all subclusters marked as
>> unallocated.
>
> I am spoken about fallocate() for the entire cluster before actual
> write() for originally empty image. This increases the performance of
> 4k random writes 10+ times. In this case we can just write those 4k
> and do nothing else.

You're talking about using fallocate() for filling a cluster with zeroes
before writing data to it.

As noted earlier in this thread, this works if the image is empty or if
it doesn't have a backing file.

And if the image is not empty you cannot guarantee that the cluster
contains zeroes (you can use FALLOC_FL_ZERO_RANGE, but that won't work
in all cases).

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 13:21       ` Alberto Garcia
  2017-04-13 13:30         ` Denis V. Lunev
@ 2017-04-13 13:51         ` Kevin Wolf
  2017-04-13 14:15           ` Alberto Garcia
  2017-04-13 14:42           ` [Qemu-devel] " Denis V. Lunev
  1 sibling, 2 replies; 64+ messages in thread
From: Kevin Wolf @ 2017-04-13 13:51 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Denis V. Lunev, qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz

Am 13.04.2017 um 15:21 hat Alberto Garcia geschrieben:
> This invariant is already broken by the very design of the qcow2 format,
> subclusters don't really add anything new there. For any given cluster
> size you can write 4k in every odd cluster, then do the same in every
> even cluster, and you'll get an equally fragmented image.

Because this scenario has appeared repeatedly in this thread: Can we
please use a more realistic one that shows an actual problem? Because
with 8k or more for the cluster size you don't get any qcow2
fragmentation with 4k even/odd writes (which is a pathological case
anyway), and the file systems are clever enough to cope with it, too.

Just to confirm this experimentally, I ran this short script:

----------------------------------------------------------------
#!/bin/bash
./qemu-img create -f qcow2 /tmp/test.qcow2 64M

echo even blocks
for i in $(seq 0 32767); do echo "write $((i * 8))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null
echo odd blocks
for i in $(seq 0 32767); do echo "write $((i * 8 + 4))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null

./qemu-img map /tmp/test.qcow2
filefrag -v /tmp/test.qcow2
----------------------------------------------------------------

And sure enough, this is the output:

----------------------------------------------------------------
Formatting '/tmp/test.qcow2', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
even blocks
odd blocks
Offset          Length          Mapped to       File
0               0x4000000       0x50000         /tmp/test.qcow2
Filesystem type is: 58465342
File size of /tmp/test.qcow2 is 67436544 (16464 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      47:     142955..    143002:     48:            
   1:       48..      48:     143016..    143016:      1:     143003:
   2:       64..      79:     142868..    142883:     16:     143017:
   3:       80..     111:     155386..    155417:     32:     142884:
   4:      112..     303:     227558..    227749:    192:     155418:
   5:      304..     559:     228382..    228637:    256:     227750:
   6:      560..    1071:     455069..    455580:    512:     228638:
   7:     1072..    2095:     485544..    486567:   1024:     455581:
   8:     2096..    4143:     497978..    500025:   2048:     486568:
   9:     4144..    8239:     508509..    512604:   4096:     500026:
  10:     8240..   16431:     563122..    571313:   8192:     512605:
  11:    16432..   32815:     632969..    649352:  16384:     571314: eof
/tmp/test.qcow2: 12 extents found
----------------------------------------------------------------

That is, on the qcow2 level we have exactly 0% fragmentation, everything
is completely contiguous in a single chunk. XFS as the container of the
test image creates a few more extents, but as you can see, it uses
fairly large extent sizes in the end (and it would use even larger ones
if I wrote more than 64 MB).

Kevin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 13:30         ` Denis V. Lunev
@ 2017-04-13 13:59           ` Kevin Wolf
  2017-04-13 15:04           ` Alberto Garcia
  1 sibling, 0 replies; 64+ messages in thread
From: Kevin Wolf @ 2017-04-13 13:59 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Alberto Garcia, qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz

Am 13.04.2017 um 15:30 hat Denis V. Lunev geschrieben:
> On 04/13/2017 04:21 PM, Alberto Garcia wrote:
> > On Thu 13 Apr 2017 02:44:51 PM CEST, Denis V. Lunev wrote:
> >>>> 1) current L2 cache management seems very wrong to me. Each cache
> >>>>     miss means that we have to read entire L2 cache block. This means
> >>>>     that in the worst case (when dataset of the test does not fit L2
> >>>>     cache size we read 64kb of L2 table for each 4 kb read).
> >>>>
> >>>>     The situation is MUCH worse once we are starting to increase
> >>>>     cluster size. For 1 Mb blocks we have to read 1 Mb on each cache
> >>>>     miss.
> >>>>
> >>>>     The situation can be cured immediately once we will start reading
> >>>>     L2 cache with 4 or 8kb chunks. We have patchset for this for our
> >>>>     downstream and preparing it for upstream.
> >>> Correct, although the impact of this depends on whether you are using
> >>> SDD or HDD.
> >>>
> >>> With an SSD what you want is to minimize is the number of unnecessary
> >>> reads, so reading small chunks will likely increase the performance when
> >>> there's a cache miss.
> >>>
> >>> With an HDD what you want is to minimize the number of seeks. Once you
> >>> have moved the disk head to the location where the cluster is, reading
> >>> the whole cluster is relatively inexpensive, so (leaving the memory
> >>> requirements aside) you generally want to read as much as possible.
> >> no! This greatly helps for HDD too!
> >>
> >> The reason is that you cover areas of the virtual disk much more
> >> precise.  There is very simple example. Let us assume that I have
> >> f.e. 1 TB virtual HDD with 1 MB block size. As far as I understand
> >> right now L2 cache for the case consists of 4 L2 clusters.
> >>
> >> So, I can exhaust current cache only with 5 requests and each actual
> >> read will costs L2 table read. This is a read problem. This condition
> >> could happen on fragmented FS without a problem.
> > But what you're saying is that this makes a more efficient use of cache
> > memory.
> >
> > If the guest OS has a lot of unused space but is very fragmented then
> > you don't want to fill up your cache with L2 entries that are not going
> > to be used. It's better to read smaller chunks from the L2 table so
> > there are fewer chances of having to evict entries from the
> > cache. Therefore this results in less cache misses and better I/O
> > performance.
> >
> > Ok, this sounds perfectly reasonable to me.
> >
> > If the cache is however big enough for the whole disk then you never
> > need to evict entries, so with an HDD you actually want to take
> > advantage of disk seeks and read as many L2 entries as possible.
> >
> > However it's true that in this case this will only affect the initial
> > reads. Once the cache is full there's no need to read the L2 tables from
> > disk anymore and the performance will be the same, so your point remains
> > valid.
> >
> > Still, one of the goals from my proposal is to reduce the amount of
> > metadata needed for the image. No matter how efficient you make the
> > cache, the only way to reduce the amount of L2 entries is to increase
> > the cluster size. And increasing the cluster size results in slower COW
> > and less efficient use of disk space.
> actually we can read by clusters if the cache is empty or near empty.
> Yes, block size should be increased. I perfectly in agreement with your.
> But I think that we could do that by plain increase of the cluster size
> without any further dances. Sub-clusters as sub-clusters will help
> if we are able to avoid COW. With COW I do not see much difference.

With COW, it's basically just the same argument as you're making for
reading L2 tables: There's a difference between having to copy 64k to
process a write request and having to copy 2 MB.

> But for the case of the COW absence, further sequential reading will
> be broken by the fragmented file in the host. That is the point. We
> should try to avoid host fragmentation at all.

I still don't understand why you think that subclusters will cause
fragmentation that wouldn't be there without subclusters. The opposite
is true, with subclusters, larger cluster sizes become more realistic to
use.

> >>>>     Another problem is the amount of data written. We are writing
> >>>>     entire cluster in write operation and this is also insane. It is
> >>>>     possible to perform fallocate() and actual data write on normal
> >>>>     modern filesystem.
> >>> But that only works when filling the cluster with zeroes, doesn't it? If
> >>> there's a backing image you need to bring all the contents from there.
> >> Yes. Backing images are problems. Though, even with sub-clusters, we
> >> will suffer exactly the same with the amount of IOPSes as even with
> >> that head and tail have to be read. If you are spoken about
> >> subclusters equals to FS block size and avoid COW at all, this would
> >> be terribly slow later on with sequential reading. In such an approach
> >> sequential reading will result in random read.
> >>
> >> Guest OSes are written keeping in mind that adjacent LBAs are really
> >> adjacent and reading them sequentially is a very good idea. This
> >> invariant will be broken for the case of subclusters.
> > This invariant is already broken by the very design of the qcow2 format,
> > subclusters don't really add anything new there. For any given cluster
> > size you can write 4k in every odd cluster, then do the same in every
> > even cluster, and you'll get an equally fragmented image.
>
> The size of the cluster matters! Our experiments in older Parallels
> shown that with 1 Mb continuous (!) cluster this invariant is "almost"
> kept and this works fine for sequential ops.

So you should be interested in what I called "use case 2" in another
email in this thread: Making use of the subclusters so that you can
increase the cluster size to 2 MB while still maintaining reasonable COW
granularity.

Kevin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 13:36           ` Alberto Garcia
@ 2017-04-13 14:06             ` Denis V. Lunev
  0 siblings, 0 replies; 64+ messages in thread
From: Denis V. Lunev @ 2017-04-13 14:06 UTC (permalink / raw)
  To: Alberto Garcia, Kevin Wolf
  Cc: qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz

On 04/13/2017 04:36 PM, Alberto Garcia wrote:
> On Thu 13 Apr 2017 03:09:53 PM CEST, Denis V. Lunev wrote:
>>>> For nowadays SSD we are facing problems somewhere else. Right now I
>>>> can achieve only 100k IOPSes on SSD capable of 350-550k. 1 Mb block
>>>> with preallocation and fragmented L2 cache gives same 100k. Tests
>>>> for initially empty image gives around 80k for us.
>>> Preallocated images aren't particularly interesting to me. qcow2 is
>>> used mainly for two reasons. One of them is sparseness (initially
>>> small file size) mostly for desktop use cases with no serious I/O, so
>>> not that interesting either. The other one is snapshots, i.e. backing
>>> files, which doesn't work with preallocation (yet).
>>>
>>> Actually, preallocation with backing files is something that
>>> subclusters would automatically enable: You could already reserve the
>>> space for a cluster, but still leave all subclusters marked as
>>> unallocated.
>> I am spoken about fallocate() for the entire cluster before actual
>> write() for originally empty image. This increases the performance of
>> 4k random writes 10+ times. In this case we can just write those 4k
>> and do nothing else.
> You're talking about using fallocate() for filling a cluster with zeroes
> before writing data to it.
>
> As noted earlier in this thread, this works if the image is empty or if
> it doesn't have a backing file.
>
> And if the image is not empty you cannot guarantee that the cluster
> contains zeroes (you can use FALLOC_FL_ZERO_RANGE, but that won't work
> in all cases).
>
> Berto
yes, I agree here.

But COW operations suffer more from the amount of IO operations
required rather than from the amount of data transferred. Let us
assume that we have 64k cluster represented as [--------]. With
4k write in the middle of the cluster we will have now 5 IOPSes
to perform the operation: read head, write head, write 4k, read tail,
write tail. Normally this should take 2 operations: read entire
cluster (64kb), write entire cluster (64kb).

In this approach further 64kb read of this cluster will result in
1 host IOPS - the file is continuous from the point of the host file
system.

With 8kb subclusters we will have same 1 read and 1 write after the
tuning. The difference is only the amount of data read: 8kb and
8 kb write instead of 64kb read and 64kb write.

Sure the size of the cluster should be increased. Personally I like
1 MB due to my past experience.

At my opinion there is no difference at all in terms of performance
for rotational drive at all for COW of 1 MB cluster and 64 kb cluster
without subclusters. Reading of 1 Mb and reading of 64 kb are the
same (100-150 IOPSes with 150 Mb/s throughput). Further continuous
reads will be much better with 1 MB blocks and without subclusters.

Yes, there is a difference on "average" SSD drives, which gives
40k-100k IOPSes. We will experience slowdown reading 1 MB
instead of 64kb. But the difference is not that big actually.
Top notch nowadays PCIe SSDs can not be saturated with QEMU
nowadays. I am able to reach only 100k IOPSes instead of
300k-500k in host even when the data is written to the existing
clusters. Thus we will have no difference with them between 1 MB
cluster and 64 kb cluster in terms of COW.

So, at my opinion, simple 1 Mb cluster size along as fragmented
L2 cache is very good from all points. Even from COW point ;)
The situation in real life will not be worse or better from the performance
point of view as we also will avoid additional metadata updates.

Den

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 13:51         ` Kevin Wolf
@ 2017-04-13 14:15           ` Alberto Garcia
  2017-04-13 14:27             ` Kevin Wolf
  2017-04-13 14:42           ` [Qemu-devel] " Denis V. Lunev
  1 sibling, 1 reply; 64+ messages in thread
From: Alberto Garcia @ 2017-04-13 14:15 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Denis V. Lunev, qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz

On Thu 13 Apr 2017 03:51:55 PM CEST, Kevin Wolf wrote:
>> This invariant is already broken by the very design of the qcow2
>> format, subclusters don't really add anything new there. For any
>> given cluster size you can write 4k in every odd cluster, then do the
>> same in every even cluster, and you'll get an equally fragmented
>> image.
>
> Because this scenario has appeared repeatedly in this thread: Can we
> please use a more realistic one that shows an actual problem? Because
> with 8k or more for the cluster size you don't get any qcow2
> fragmentation with 4k even/odd writes (which is a pathological case
> anyway), and the file systems are clever enough to cope with it, too.
>
> Just to confirm this experimentally, I ran this short script:
>
> ----------------------------------------------------------------
> #!/bin/bash
> ./qemu-img create -f qcow2 /tmp/test.qcow2 64M
>
> echo even blocks
> for i in $(seq 0 32767); do echo "write $((i * 8))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null
> echo odd blocks
> for i in $(seq 0 32767); do echo "write $((i * 8 + 4))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null
>
> ./qemu-img map /tmp/test.qcow2
> filefrag -v /tmp/test.qcow2
> ----------------------------------------------------------------

But that's because while you're writing on every other 4k block the
cluster size is 64k, so you're effectively allocating clusters in
sequential order. That's why you get this:

> Offset          Length          Mapped to       File
> 0               0x4000000       0x50000         /tmp/test.qcow2

You would need to either have 4k clusters, or space writes even more.

Here's a simpler example, mkfs.ext4 on an empty drive gets you something
like this:

----------------------------------------------------------------
Offset          Length          Mapped to       File
[...]
0x1040000       0x10000         0x150000        test.qcow2
0x1050000       0x50000         0x100000        test.qcow2
0x10a0000       0x10000         0x190000        test.qcow2
0x10b0000       0x30000         0x160000        test.qcow2
0x10e0000       0x10000         0x200000        test.qcow2
0x10f0000       0x30000         0x1a0000        test.qcow2
0x1120000       0x10000         0x210000        test.qcow2
0x1130000       0x30000         0x1d0000        test.qcow2
0x1160000       0x10000         0x280000        test.qcow2
0x1170000       0x30000         0x220000        test.qcow2
[...]
----------------------------------------------------------------

Now, I haven't measured the effect of this on I/O performance, but
Denis's point seems in principle valid to me.

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 14:15           ` Alberto Garcia
@ 2017-04-13 14:27             ` Kevin Wolf
  2017-04-13 16:42               ` [Qemu-devel] [Qemu-block] " Roman Kagan
  0 siblings, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-04-13 14:27 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Denis V. Lunev, qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz

Am 13.04.2017 um 16:15 hat Alberto Garcia geschrieben:
> On Thu 13 Apr 2017 03:51:55 PM CEST, Kevin Wolf wrote:
> >> This invariant is already broken by the very design of the qcow2
> >> format, subclusters don't really add anything new there. For any
> >> given cluster size you can write 4k in every odd cluster, then do the
> >> same in every even cluster, and you'll get an equally fragmented
> >> image.
> >
> > Because this scenario has appeared repeatedly in this thread: Can we
> > please use a more realistic one that shows an actual problem? Because
> > with 8k or more for the cluster size you don't get any qcow2
> > fragmentation with 4k even/odd writes (which is a pathological case
> > anyway), and the file systems are clever enough to cope with it, too.
> >
> > Just to confirm this experimentally, I ran this short script:
> >
> > ----------------------------------------------------------------
> > #!/bin/bash
> > ./qemu-img create -f qcow2 /tmp/test.qcow2 64M
> >
> > echo even blocks
> > for i in $(seq 0 32767); do echo "write $((i * 8))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null
> > echo odd blocks
> > for i in $(seq 0 32767); do echo "write $((i * 8 + 4))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null
> >
> > ./qemu-img map /tmp/test.qcow2
> > filefrag -v /tmp/test.qcow2
> > ----------------------------------------------------------------
> 
> But that's because while you're writing on every other 4k block the
> cluster size is 64k, so you're effectively allocating clusters in
> sequential order. That's why you get this:
> 
> > Offset          Length          Mapped to       File
> > 0               0x4000000       0x50000         /tmp/test.qcow2
> 
> You would need to either have 4k clusters, or space writes even more.
> 
> Here's a simpler example, mkfs.ext4 on an empty drive gets you something
> like this:
> [...]

My point wasn't that qcow2 doesn't fragment, but that Denis and you were
both using a really bad example. You were trying to construct an
artificially bad image and you actually ended up constructing a perfect
one.

> Now, I haven't measured the effect of this on I/O performance, but
> Denis's point seems in principle valid to me.

In principle yes, but especially his fear of host file system
fragmentation seems a bit exaggerated. If I use 64k even/odd writes in
the script, I end up with a horribly fragmented qcow2 image, but still
perfectly contiguous layout of the image file in the file system.

We can and probably should do something about the qcow2 fragmentation
eventually (I guess a more intelligent cluster allocation strategy could
go a long way there), but I wouldn't worry to much about the host file
system.

Kevin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 13:51         ` Kevin Wolf
  2017-04-13 14:15           ` Alberto Garcia
@ 2017-04-13 14:42           ` Denis V. Lunev
  1 sibling, 0 replies; 64+ messages in thread
From: Denis V. Lunev @ 2017-04-13 14:42 UTC (permalink / raw)
  To: Kevin Wolf, Alberto Garcia
  Cc: qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz

On 04/13/2017 04:51 PM, Kevin Wolf wrote:
> Am 13.04.2017 um 15:21 hat Alberto Garcia geschrieben:
>> This invariant is already broken by the very design of the qcow2 format,
>> subclusters don't really add anything new there. For any given cluster
>> size you can write 4k in every odd cluster, then do the same in every
>> even cluster, and you'll get an equally fragmented image.
> Because this scenario has appeared repeatedly in this thread: Can we
> please use a more realistic one that shows an actual problem? Because
> with 8k or more for the cluster size you don't get any qcow2
> fragmentation with 4k even/odd writes (which is a pathological case
> anyway), and the file systems are clever enough to cope with it, too.
>
> Just to confirm this experimentally, I ran this short script:
>
> ----------------------------------------------------------------
> #!/bin/bash
> ./qemu-img create -f qcow2 /tmp/test.qcow2 64M
>
> echo even blocks
> for i in $(seq 0 32767); do echo "write $((i * 8))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null
> echo odd blocks
> for i in $(seq 0 32767); do echo "write $((i * 8 + 4))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null
>
> ./qemu-img map /tmp/test.qcow2
> filefrag -v /tmp/test.qcow2
> ----------------------------------------------------------------
>
> And sure enough, this is the output:
>
> ----------------------------------------------------------------
> Formatting '/tmp/test.qcow2', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
> even blocks
> odd blocks
> Offset          Length          Mapped to       File
> 0               0x4000000       0x50000         /tmp/test.qcow2
> Filesystem type is: 58465342
> File size of /tmp/test.qcow2 is 67436544 (16464 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      47:     142955..    143002:     48:            
>    1:       48..      48:     143016..    143016:      1:     143003:
>    2:       64..      79:     142868..    142883:     16:     143017:
>    3:       80..     111:     155386..    155417:     32:     142884:
>    4:      112..     303:     227558..    227749:    192:     155418:
>    5:      304..     559:     228382..    228637:    256:     227750:
>    6:      560..    1071:     455069..    455580:    512:     228638:
>    7:     1072..    2095:     485544..    486567:   1024:     455581:
>    8:     2096..    4143:     497978..    500025:   2048:     486568:
>    9:     4144..    8239:     508509..    512604:   4096:     500026:
>   10:     8240..   16431:     563122..    571313:   8192:     512605:
>   11:    16432..   32815:     632969..    649352:  16384:     571314: eof
> /tmp/test.qcow2: 12 extents found
> ----------------------------------------------------------------
>
> That is, on the qcow2 level we have exactly 0% fragmentation, everything
> is completely contiguous in a single chunk. XFS as the container of the
> test image creates a few more extents, but as you can see, it uses
> fairly large extent sizes in the end (and it would use even larger ones
> if I wrote more than 64 MB).

I am spoken about image like this:

#!/bin/bash
qemu-img create -f qcow2 /tmp/test.qcow2 64M

echo even blocks
for i in $(seq 0 512); do echo "write $((i * 128 + 1))k 4k"; done |
qemu-io /tmp/test.qcow2 > /dev/null
echo odd blocks
for i in $(seq 0 512); do echo "write $((i * 128 + 65))k 4k"; done |
qemu-io /tmp/test.qcow2 > /dev/null

echo fragmented
strace -f -e pread64 qemu-io -c "read 0 64M" /tmp/test.qcow2 2>&1 | wc -l
rm -rf 1.img

qemu-img create -f qcow2 /tmp/test.qcow2 64M
echo sequential
for i in $(seq 0 1024); do echo "write $((i * 64))k 4k"; done | qemu-io
/tmp/test.qcow2 > /dev/null
strace -f -e pread64 qemu-io -c "read 0 64M" /tmp/test.qcow2 2>&1 | wc -l

and the difference is important - see the amount of read operations
reported: 1032 vs 9

iris ~/tmp/2 $ ./1.sh
Formatting '/tmp/test.qcow2', fmt=qcow2 size=67108864 encryption=off
cluster_size=65536 lazy_refcounts=off refcount_bits=16
even blocks
odd blocks
fragmented
1032   <------------------------- (1)
Formatting '/tmp/test.qcow2', fmt=qcow2 size=67108864 encryption=off
cluster_size=65536 lazy_refcounts=off refcount_bits=16
sequential
main-loop: WARNING: I/O thread spun for 1000 iterations
9        <-------------------------- (2)
iris ~/tmp/2 $

Subclusters will work exactly like (1) rather than (2). With a big block
(1 Mb) one could expect much better performance for sequential
operations as (2). The file (1) is continuous from the host point of
view, correct, but it will be accessed randomly from guest. See
the difference in the amount of reads performed.

Den

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 13:30         ` Denis V. Lunev
  2017-04-13 13:59           ` Kevin Wolf
@ 2017-04-13 15:04           ` Alberto Garcia
  2017-04-13 15:17             ` Denis V. Lunev
  1 sibling, 1 reply; 64+ messages in thread
From: Alberto Garcia @ 2017-04-13 15:04 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz

On Thu 13 Apr 2017 03:30:43 PM CEST, Denis V. Lunev wrote:
> Yes, block size should be increased. I perfectly in agreement with
> your.  But I think that we could do that by plain increase of the
> cluster size without any further dances. Sub-clusters as sub-clusters
> will help if we are able to avoid COW. With COW I do not see much
> difference.

I'm trying to summarize your position, tell me if I got everything
correctly:

1. We should try to reduce data fragmentation on the qcow2 file,
   because it will have a long term effect on the I/O performance (as
   opposed to an effect on the initial operations on the empty image).

2. The way to do that is to increase the cluster size (to 1MB or
   more).

3. Benefit: increasing the cluster size also decreases the amount of
   metadata (L2 and refcount).

4. Problem: L2 tables become too big and fill up the cache more
   easily. To solve this the cache code should do partial reads
   instead of complete L2 clusters.

5. Problem: larger cluster sizes also mean more data to copy when
   there's a COW. To solve this the COW code should be modified so it
   goes from 5 OPs (read head, write head, read tail, write tail,
   write data) to 2 OPs (read cluster, write modified cluster).

6. Having subclusters adds incompatible changes to the file format,
   and they offer no benefit after allocation.

7. Subclusters are only really useful if they match the guest fs block
   size (because you would avoid doing COW on allocation). Otherwise
   the only thing that you get is a faster COW (because you move less
   data), but the improvement is not dramatic and it's better if we do
   what's proposed in point 5.

8. Even if the subcluster size matches the guest block size, you'll
   get very fast initial allocation but also more chances to end up
   with a very fragmented qcow2 image, which is worse in the long run.

9. Problem: larger clusters make a less efficient use of disk space,
   but that's a drawback you're fine with considering all of the
   above.

Is that a fair summary of what you're trying to say? Anything else
missing?

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 15:04           ` Alberto Garcia
@ 2017-04-13 15:17             ` Denis V. Lunev
  2017-04-18 11:52               ` Alberto Garcia
  0 siblings, 1 reply; 64+ messages in thread
From: Denis V. Lunev @ 2017-04-13 15:17 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz

On 04/13/2017 06:04 PM, Alberto Garcia wrote:
> On Thu 13 Apr 2017 03:30:43 PM CEST, Denis V. Lunev wrote:
>> Yes, block size should be increased. I perfectly in agreement with
>> your.  But I think that we could do that by plain increase of the
>> cluster size without any further dances. Sub-clusters as sub-clusters
>> will help if we are able to avoid COW. With COW I do not see much
>> difference.
> I'm trying to summarize your position, tell me if I got everything
> correctly:
>
> 1. We should try to reduce data fragmentation on the qcow2 file,
>    because it will have a long term effect on the I/O performance (as
>    opposed to an effect on the initial operations on the empty image).
yes

> 2. The way to do that is to increase the cluster size (to 1MB or
>    more).
yes

> 3. Benefit: increasing the cluster size also decreases the amount of
>    metadata (L2 and refcount).
yes

> 4. Problem: L2 tables become too big and fill up the cache more
>    easily. To solve this the cache code should do partial reads
>    instead of complete L2 clusters.
yes. We can read full cluster as originally if L2 cache is empty.

> 5. Problem: larger cluster sizes also mean more data to copy when
>    there's a COW. To solve this the COW code should be modified so it
>    goes from 5 OPs (read head, write head, read tail, write tail,
>    write data) to 2 OPs (read cluster, write modified cluster).
yes, with small tweak if head and tail are in different clusters. In
this case we
will end up with 3 OPs.

> 6. Having subclusters adds incompatible changes to the file format,
>    and they offer no benefit after allocation.
yes

> 7. Subclusters are only really useful if they match the guest fs block
>    size (because you would avoid doing COW on allocation). Otherwise
>    the only thing that you get is a faster COW (because you move less
>    data), but the improvement is not dramatic and it's better if we do
>    what's proposed in point 5.
yes

> 8. Even if the subcluster size matches the guest block size, you'll
>    get very fast initial allocation but also more chances to end up
>    with a very fragmented qcow2 image, which is worse in the long run.
yes

> 9. Problem: larger clusters make a less efficient use of disk space,
>    but that's a drawback you're fine with considering all of the
>    above.
yes

> Is that a fair summary of what you're trying to say? Anything else
> missing?
yes.

5a. Problem: initial cluster allocation without COW. Could be made
      cluster-size agnostic with the help of fallocate() call. Big
clusters are even
      better as the amount of such allocations is reduced.

Thank you very much for this cool summary! I am too tongue-tied.

Den

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 14:27             ` Kevin Wolf
@ 2017-04-13 16:42               ` Roman Kagan
  0 siblings, 0 replies; 64+ messages in thread
From: Roman Kagan @ 2017-04-13 16:42 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Alberto Garcia, Denis V. Lunev, qemu-block, qemu-devel,
	Stefan Hajnoczi, Max Reitz

On Thu, Apr 13, 2017 at 04:27:35PM +0200, Kevin Wolf wrote:
> Am 13.04.2017 um 16:15 hat Alberto Garcia geschrieben:
> > On Thu 13 Apr 2017 03:51:55 PM CEST, Kevin Wolf wrote:
> > >> This invariant is already broken by the very design of the qcow2
> > >> format, subclusters don't really add anything new there. For any
> > >> given cluster size you can write 4k in every odd cluster, then do the
> > >> same in every even cluster, and you'll get an equally fragmented
> > >> image.
> > >
> > > Because this scenario has appeared repeatedly in this thread: Can we
> > > please use a more realistic one that shows an actual problem? Because
> > > with 8k or more for the cluster size you don't get any qcow2
> > > fragmentation with 4k even/odd writes (which is a pathological case
> > > anyway), and the file systems are clever enough to cope with it, too.
> > >
> > > Just to confirm this experimentally, I ran this short script:
> > >
> > > ----------------------------------------------------------------
> > > #!/bin/bash
> > > ./qemu-img create -f qcow2 /tmp/test.qcow2 64M
> > >
> > > echo even blocks
> > > for i in $(seq 0 32767); do echo "write $((i * 8))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null
> > > echo odd blocks
> > > for i in $(seq 0 32767); do echo "write $((i * 8 + 4))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null
> > >
> > > ./qemu-img map /tmp/test.qcow2
> > > filefrag -v /tmp/test.qcow2
> > > ----------------------------------------------------------------
> > 
> > But that's because while you're writing on every other 4k block the
> > cluster size is 64k, so you're effectively allocating clusters in
> > sequential order. That's why you get this:
> > 
> > > Offset          Length          Mapped to       File
> > > 0               0x4000000       0x50000         /tmp/test.qcow2
> > 
> > You would need to either have 4k clusters, or space writes even more.
> > 
> > Here's a simpler example, mkfs.ext4 on an empty drive gets you something
> > like this:
> > [...]
> 
> My point wasn't that qcow2 doesn't fragment, but that Denis and you were
> both using a really bad example. You were trying to construct an
> artificially bad image and you actually ended up constructing a perfect
> one.
> 
> > Now, I haven't measured the effect of this on I/O performance, but
> > Denis's point seems in principle valid to me.
> 
> In principle yes, but especially his fear of host file system
> fragmentation seems a bit exaggerated. If I use 64k even/odd writes in
> the script, I end up with a horribly fragmented qcow2 image, but still
> perfectly contiguous layout of the image file in the file system.
> 
> We can and probably should do something about the qcow2 fragmentation
> eventually (I guess a more intelligent cluster allocation strategy could
> go a long way there), but I wouldn't worry to much about the host file
> system.

I beg to disagree.  I didn't have QEMU with subcluster allocation
enabled (you did, didn't you?) so I went ahead with a raw file:

# truncate --size 64k bbb                                                                                                                                                          [14/14]
# filefrag -v bbb
Filesystem type is: ef53
File size of bbb is 65536 (16 blocks of 4096 bytes)
bbb: 0 extents found
# for i in {0..7}; do echo write $[(i * 2) * 4]k 4k; done | qemu-io bbb
...
# for i in {0..7}; do echo write $[(i * 2 + 1) * 4]k 4k; done | qemu-io bbb
...
# filefrag -v bbb
Filesystem type is: ef53
File size of bbb is 65536 (16 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       1:   65860793..  65860794:      2:
   1:        2..       2:   65859644..  65859644:      1:   65860795:
   2:        3..       3:   65859651..  65859651:      1:   65859645:
   3:        4..       4:   65859645..  65859645:      1:   65859652:
   4:        5..       5:   65859652..  65859652:      1:   65859646:
   5:        6..       6:   65859646..  65859646:      1:   65859653:
   6:        7..       7:   65859653..  65859653:      1:   65859647:
   7:        8..       8:   65859647..  65859647:      1:   65859654:
   8:        9..       9:   65859654..  65859654:      1:   65859648:
   9:       10..      10:   65859648..  65859648:      1:   65859655:
  10:       11..      11:   65859655..  65859655:      1:   65859649:
  11:       12..      12:   65859649..  65859649:      1:   65859656:
  12:       13..      13:   65859656..  65859656:      1:   65859650:
  13:       14..      14:   65859650..  65859650:      1:   65859657:
  14:       15..      15:   65859657..  65859657:      1:   65859651: last,eof
bbb: 15 extents found

So the host filesystem did a very poor job here (ext4 on top of two-way
raid0 on top of rotating disks).

Naturally, replacing truncate with fallocate in the above example gives
no fragmenation:

...
# filefrag -v bbb
Filesystem type is: ef53
File size of bbb is 65536 (16 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      15:  183616784.. 183616799:     16:             last,eof
bbb: 1 extent found

Roman.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 10:19         ` Denis V. Lunev
@ 2017-04-14  1:06           ` John Snow
  2017-04-14  4:17             ` Denis V. Lunev
  2017-04-14  7:40             ` Roman Kagan
  0 siblings, 2 replies; 64+ messages in thread
From: John Snow @ 2017-04-14  1:06 UTC (permalink / raw)
  To: Denis V. Lunev, Kevin Wolf
  Cc: qemu-block, qemu-devel, Max Reitz, Stefan Hajnoczi



On 04/13/2017 06:19 AM, Denis V. Lunev wrote:
> On 04/13/2017 12:44 PM, Kevin Wolf wrote:
>> Am 12.04.2017 um 21:02 hat Denis V. Lunev geschrieben:
>>> On 04/12/2017 09:20 PM, Eric Blake wrote:
>>>> On 04/12/2017 12:55 PM, Denis V. Lunev wrote:
>>>>> Let me rephrase a bit.
>>>>>
>>>>> The proposal is looking very close to the following case:
>>>>> - raw sparse file
>>>>>
>>>>> In this case all writes are very-very-very fast and from the
>>>>> guest point of view all is OK. Sequential data is really sequential.
>>>>> Though once we are starting to perform any sequential IO, we
>>>>> have real pain. Each sequential operation becomes random
>>>>> on the host file system and the IO becomes very slow. This
>>>>> will not be observed with the test, but the performance will
>>>>> degrade very soon.
>>>>>
>>>>> This is why raw sparse files are not used in the real life.
>>>>> Hypervisor must maintain guest OS invariants and the data,
>>>>> which is nearby from the guest point of view should be kept
>>>>> nearby in host.
>>>>>
>>>>> This is why actually that 64kb data blocks are extremely
>>>>> small :) OK. This is offtopic.
>>>> Not necessarily. Using subclusters may allow you to ramp up to larger
>>>> cluster sizes. We can also set up our allocation (and pre-allocation
>>>> schemes) so that we always reserve an entire cluster on the host at the
>>>> time we allocate the cluster, even if we only plan to write to
>>>> particular subclusters within that cluster.  In fact, 32 subclusters to
>>>> a 2M cluster results in 64k subclusters, where you are still writing at
>>>> 64k data chunks but could now have guaranteed 2M locality, compared to
>>>> the current qcow2 with 64k clusters that writes in 64k data chunks but
>>>> with no locality.
>>>>
>>>> Just because we don't write the entire cluster up front does not mean
>>>> that we don't have to allocate (or have a mode that allocates) the
>>>> entire cluster at the time of the first subcluster use.
>>> this is something that I do not understand. We reserve the entire cluster at
>>> allocation. Why do we need sub-clusters at cluster "creation" without COW?
>>> fallocate() and preallocation completely covers this stage for now in
>>> full and
>>> solve all botllenecks we have. 4k/8k granularity of L2 cache solves metadata
>>> write problem. But IMHO it is not important. Normally we sync metadata
>>> at guest sync.
>>>
>>> The only difference I am observing in this case is "copy-on-write" pattern
>>> of the load with backing store or snapshot, where we copy only partial
>>> cluster.
>>> Thus we should clearly define that this is the only area of improvement and
>>> start discussion from this point. Simple cluster creation is not the problem
>>> anymore. I think that this reduces the scope of the proposal a lot.
>> I think subclusters have two different valid use cases:
>>
>> 1. The first one is what you describe and what I was mostly interested
>>    in: By reducing the subcluster size to the file system block size, we
>>    can completely avoid any COW because there won't be partial writes
>>    any more.
>>
>>    You attended my KVM Forum talk two years ago where I described how
>>    COW is the biggest pain point for qcow2 performance, costing about
>>    50% of performance for initial writes after taking a snapshot. So
>>    while you're right that many other improvements are possible, I think
>>    this is one of the most important points to address.
>>
>> 2. The other use case is what Berto had in mind: Keep subclusters at the
>>    current cluster size (to avoid getting even larger COWs), but
>>    increase the cluster size. This reduces the metadata size and allows
>>    to cover a larger range with the same L2 cache size. Additionally, it
>>    can possibly reduce fragmentation of the image on the file system
>>    level.
>>
>> The fundamental observation in both cases is that it's impractical to
>> use the same granularity for cluster mapping (want larger sizes) and for
>> status tracking (want small sizes to avoid partial writes).
> Fragmented images are very good at the initial moment once the
> IO is coming first time in small pieces. But it becomes real
> pain later on once guest will REUSE this area for sequential
> data written in big chunks. Or the guest could even read this
> data sequentially later on.
> 
> You will have random read in host instead of sequential read in
> guest. This would be serious performance problem. The guest
> really believes that the data with similar LBAs are adjacent and
> optimize IO keeping this in mind. With clusters/subclusters/etc
> you break this fundamental assumption. This is not visible initially,
> but will trigger later on.
> 

Hi Denis,

I've read this entire thread now and I really like Berto's summary which
I think is one of the best recaps of existing qcow2 problems and this
discussion so far.

I understand your opinion that we should focus on compatible changes
before incompatible ones, and I also understand that you are very
concerned about physical fragmentation for reducing long-term IO.

What I don't understand is why you think that subclusters will increase
fragmentation. If we admit that fragmentation is a problem now, surely
increasing cluster sizes to 1 or 2 MB will only help to *reduce*
physical fragmentation, right?

Subclusters as far as I am understanding them will not actually allow
subclusters to be located at virtually disparate locations, we will
continue to allocate clusters as normal -- we'll just keep track of
which portions of the cluster we've written to to help us optimize COW*.

So if we have a 1MB cluster with 64k subclusters as a hypothetical, if
we write just the first subcluster, we'll have a map like:

X---------------

Whatever actually happens to exist in this space, whether it be a hole
we punched via fallocate or literal zeroes, this space is known to the
filesystem to be contiguous.

If we write to the last subcluster, we'll get:

X--------------X

And again, maybe the dashes are a fallocate hole, maybe they're zeroes.
but the last subcluster is located virtually exactly 15 subclusters
behind the first, they're not physically contiguous. We've saved the
space between them. Future out-of-order writes won't contribute to any
fragmentation, at least at this level.

You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune
the subclusters right, we'll have *zero*, won't we?

As far as I can tell, this lets us do a lot of good things all at once:

(1) We get some COW optimizations (for reasons Berto and Kevin have
detailed)
(2) We can increase our cluster size
(3) L2 cache can cover bigger files with less memory
(4) Fragmentation decreases.

Is this not a win all around? We can improve throughput for a lot of
different reasons all at once, here. Have I misunderstood the discussion
so far, anyone?

Please correct me where I am wrong and _ELI5_.

Thanks,
-John

> 
>>> Initial proposal starts from stating 2 problems:
>>>
>>> "1) Reading from or writing to a qcow2 image involves reading the
>>>    corresponding entry on the L2 table that maps the guest address to
>>>    the host address. This is very slow because it involves two I/O
>>>    operations: one on the L2 table and the other one on the actual
>>>    data cluster.
>>>
>>> 2) A cluster is the smallest unit of allocation. Therefore writing a
>>>    mere 512 bytes to an empty disk requires allocating a complete
>>>    cluster and filling it with zeroes (or with data from the backing
>>>    image if there is one). This wastes more disk space and also has a
>>>    negative impact on I/O."
>>>
>>> With pre-allocation (2) would be exactly the same as now and all
>>> gain with sub-clusters will be effectively 0 as we will have to
>>> preallocate entire cluster.
>> To be honest, I'm not sure if we should do preallocation or whether
>> that's the file system's job.
>>
>> In any case, the big improvement is that we don't have to read from the
>> backing file, so even if we keep preallocating the whole cluster, we'd
>> gain something there. I also think that preallocating would use
>> something like fallocate() rather than pwrite(), so it should involve a
>> lot less I/O.
>>
> yes. fallocate() works pretty well and we will have almost
> the same amount of IO as submitted from the guest. This
> also helps a lot for sequential writes not aligned to the
> cluster boundary.
> 
> 
>>> (1) is also questionable. I think that the root of the problem
>>> is the cost of L2 cache miss, which is giant. With 1 Mb or 2 Mb
>>> cluster the cost of the cache miss is not acceptable at all.
>>> With page granularity of L2 cache this problem is seriously
>>> reduced. We can switch to bigger blocks without much problem.
>>> Again, the only problem is COW.
>> Yes, with larger cluster sizes, the L2 cache sucks currently. It needs
>> to be able to cache partial tables.
>>
>> With the currently common cluster sizes, I don't think it makes a big
>> difference, though. As long as you end up making a single request, 4k or
>> 64k size isn't that different, and three 4k requests are almost certainly
>> slower than one 64k request.
>>
>> So I agree, for use case 2, some work on the cache is required in
>> addition to increasing the cluster size.
> This helps even with 64kb cluster size. For the case of the fragmented guest
> filesystem the dataset could be quite fragmented and we an technically use
> much less memory to cover it if we'll use pages clusters.
> 
> 
>>> There are really a lot of other possibilities for viable optimizations,
>>> which
>>> are not yet done on top of proposed ones:
>>> - IO plug/unplug support at QCOW2 level. plug in controller is definitely
>>>   not enough. This affects only the first IO operation while we could have
>>>   a bunch of them
>>> - sort and merge requests list in submit
>>> - direct AIO read/write support to avoid extra coroutine creation for
>>>   read-write ops if we are doing several operations in parallel in
>>>   qcow2_co_readv/writev. Right now AIO operations are emulated
>>>   via coroutines which have some impact
>>> - offload compression/decompression/encryption to side thread
>>> - optimize sequential write operation not aligned to the cluster boundary
>>>   if cluster is not allocated initially
>>>
>>> May be it would be useful to create intermediate DIO structure for IO
>>> operation which will carry offset/iovec on it like done in kernel. I do
>>> think
>>> that such compatible changes could improve raw performance even
>>> with the current format 2-3 times, which is brought out by the proposal.
>> Well, if you can show a concrete optimisation and ideally also numbers,
>> that would certainly be interesting. I am very skeptical of 2-3 times
>> improvement (not the least because we would be well over native
>> performance then...), but I'm happy to be convinced otherwise. Maybe
>> start a new thread like this one if/when you think you have a detailled
>> idea for one of them that is ready for discussion.
>>
>> The one that I actually think could make a big difference for its use
>> case is the compression/decompression/encryption one, but honestly, if
>> someone really cared about these, I think there would be lower hanging
>> fruit (e.g. we're currently reading in data only cluster by cluster for
>> compressed images).
> In progress. Compression has a lot of troubles ;)
> 
> Den
> 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-14  1:06           ` [Qemu-devel] [Qemu-block] " John Snow
@ 2017-04-14  4:17             ` Denis V. Lunev
  2017-04-18 11:22               ` Kevin Wolf
  2017-04-14  7:40             ` Roman Kagan
  1 sibling, 1 reply; 64+ messages in thread
From: Denis V. Lunev @ 2017-04-14  4:17 UTC (permalink / raw)
  To: John Snow, Kevin Wolf; +Cc: qemu-block, qemu-devel, Max Reitz, Stefan Hajnoczi

[skipped...]

> Hi Denis,
>
> I've read this entire thread now and I really like Berto's summary which
> I think is one of the best recaps of existing qcow2 problems and this
> discussion so far.
>
> I understand your opinion that we should focus on compatible changes
> before incompatible ones, and I also understand that you are very
> concerned about physical fragmentation for reducing long-term IO.
>
> What I don't understand is why you think that subclusters will increase
> fragmentation. If we admit that fragmentation is a problem now, surely
> increasing cluster sizes to 1 or 2 MB will only help to *reduce*
> physical fragmentation, right?
>
> Subclusters as far as I am understanding them will not actually allow
> subclusters to be located at virtually disparate locations, we will
> continue to allocate clusters as normal -- we'll just keep track of
> which portions of the cluster we've written to to help us optimize COW*.
>
> So if we have a 1MB cluster with 64k subclusters as a hypothetical, if
> we write just the first subcluster, we'll have a map like:
>
> X---------------
>
> Whatever actually happens to exist in this space, whether it be a hole
> we punched via fallocate or literal zeroes, this space is known to the
> filesystem to be contiguous.
>
> If we write to the last subcluster, we'll get:
>
> X--------------X
>
> And again, maybe the dashes are a fallocate hole, maybe they're zeroes.
> but the last subcluster is located virtually exactly 15 subclusters
> behind the first, they're not physically contiguous. We've saved the
> space between them. Future out-of-order writes won't contribute to any
> fragmentation, at least at this level.
>
> You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune
> the subclusters right, we'll have *zero*, won't we?
>
> As far as I can tell, this lets us do a lot of good things all at once:
>
> (1) We get some COW optimizations (for reasons Berto and Kevin have
> detailed)
Yes. We are fine with COW. Let us assume that we will have issued read
entire cluster command after the COW, in the situation

X--------------X

with a backing store. This is possible even with 1-2 Mb cluster size.
I have seen 4-5 Mb requests from the guest in the real life. In this
case we will have 3 IOP:
    read left X area, read backing, read right X.
This is the real drawback of the approach, if sub-cluster size is really
small enough, which should be the case for optimal COW. Thus we
will have random IO in the host instead of sequential one in guest.
Thus we have optimized COW at the cost of long term reads. This
is what I am worrying about as we can have a lot of such reads before
any further data change.

With real holes the situation is even worse. If we have real hole
(obtained with truncate), we are in the case mentioned by Roman.
Virtually the space is continuous, but we have host fragmentation,
equal to sub-cluster size.

We are in the tough position even with COW. If sub-cluster is
not equals to the size of the guest filesystem block, we still need
to do COW. The only win is the amount of data copied, but the loss
in the amount of IOPSes is the same. On the other hand, to get
real win, we should have properly aligned guest partitions, know
exactly guest filesystem block size etc etc etc. This is not that
easy as can seen.

> (2) We can increase our cluster size
> (3) L2 cache can cover bigger files with less memory
> (4) Fragmentation decreases.
yes. I like (2)-(4) a lot. But at my opinion we could stick to 1 Mb
without sub-clusters and still get (2)-(4).

>
> Is this not a win all around? We can improve throughput for a lot of
> different reasons all at once, here. Have I misunderstood the discussion
> so far, anyone?
>
> Please correct me where I am wrong and _ELI5_.
Have tried that, sorry for my illiteracy ;)

Den

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-14  1:06           ` [Qemu-devel] [Qemu-block] " John Snow
  2017-04-14  4:17             ` Denis V. Lunev
@ 2017-04-14  7:40             ` Roman Kagan
  1 sibling, 0 replies; 64+ messages in thread
From: Roman Kagan @ 2017-04-14  7:40 UTC (permalink / raw)
  To: John Snow
  Cc: Denis V. Lunev, Kevin Wolf, Stefan Hajnoczi, qemu-devel,
	qemu-block, Max Reitz

On Thu, Apr 13, 2017 at 09:06:19PM -0400, John Snow wrote:
> So if we have a 1MB cluster with 64k subclusters as a hypothetical, if
> we write just the first subcluster, we'll have a map like:
> 
> X---------------
> 
> Whatever actually happens to exist in this space, whether it be a hole
> we punched via fallocate or literal zeroes, this space is known to the
> filesystem to be contiguous.
> 
> If we write to the last subcluster, we'll get:
> 
> X--------------X
> 
> And again, maybe the dashes are a fallocate hole, maybe they're zeroes.
> but the last subcluster is located virtually exactly 15 subclusters
> behind the first, they're not physically contiguous. We've saved the
> space between them. Future out-of-order writes won't contribute to any
> fragmentation, at least at this level.

Yeah I think this is where the confusion lies.  You apparently assume
that the filesystem is smart enough to compensate for the subclusters
being sparse within a cluster, and will make them eventually contiguous
on the *media* once they are all written.  Denis is claiming the
opposite.  I posted a simple experiment with a 64kB sparse file written
out of order which ended up being 16 disparate blocks on the platters
(ext4; with xfs this may be different), and this is obviously
detrimental for performance with rotating disks.

Note also that if the filesystem actually is smart to maintain the
subclusters contiguos even if written out of order, apparently by not
allowing blocks from other files to take the yet unused space between
sparse subclusters, the disk space saving becomes not so obvious.

> You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune
> the subclusters right, we'll have *zero*, won't we?

Right, this is an attractive advantage.  Need to test if the later
access to such interleaved clusters is not degraded, though.

Roman.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-14  4:17             ` Denis V. Lunev
@ 2017-04-18 11:22               ` Kevin Wolf
  2017-04-18 17:30                 ` Denis V. Lunev
  0 siblings, 1 reply; 64+ messages in thread
From: Kevin Wolf @ 2017-04-18 11:22 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: John Snow, qemu-block, qemu-devel, Max Reitz, Stefan Hajnoczi

Am 14.04.2017 um 06:17 hat Denis V. Lunev geschrieben:
> [skipped...]
> 
> > Hi Denis,
> >
> > I've read this entire thread now and I really like Berto's summary which
> > I think is one of the best recaps of existing qcow2 problems and this
> > discussion so far.
> >
> > I understand your opinion that we should focus on compatible changes
> > before incompatible ones, and I also understand that you are very
> > concerned about physical fragmentation for reducing long-term IO.
> >
> > What I don't understand is why you think that subclusters will increase
> > fragmentation. If we admit that fragmentation is a problem now, surely
> > increasing cluster sizes to 1 or 2 MB will only help to *reduce*
> > physical fragmentation, right?
> >
> > Subclusters as far as I am understanding them will not actually allow
> > subclusters to be located at virtually disparate locations, we will
> > continue to allocate clusters as normal -- we'll just keep track of
> > which portions of the cluster we've written to to help us optimize COW*.
> >
> > So if we have a 1MB cluster with 64k subclusters as a hypothetical, if
> > we write just the first subcluster, we'll have a map like:
> >
> > X---------------
> >
> > Whatever actually happens to exist in this space, whether it be a hole
> > we punched via fallocate or literal zeroes, this space is known to the
> > filesystem to be contiguous.
> >
> > If we write to the last subcluster, we'll get:
> >
> > X--------------X
> >
> > And again, maybe the dashes are a fallocate hole, maybe they're zeroes.
> > but the last subcluster is located virtually exactly 15 subclusters
> > behind the first, they're not physically contiguous. We've saved the
> > space between them. Future out-of-order writes won't contribute to any
> > fragmentation, at least at this level.
> >
> > You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune
> > the subclusters right, we'll have *zero*, won't we?
> >
> > As far as I can tell, this lets us do a lot of good things all at once:
> >
> > (1) We get some COW optimizations (for reasons Berto and Kevin have
> > detailed)
> Yes. We are fine with COW. Let us assume that we will have issued read
> entire cluster command after the COW, in the situation
> 
> X--------------X
> 
> with a backing store. This is possible even with 1-2 Mb cluster size.
> I have seen 4-5 Mb requests from the guest in the real life. In this
> case we will have 3 IOP:
>     read left X area, read backing, read right X.
> This is the real drawback of the approach, if sub-cluster size is really
> small enough, which should be the case for optimal COW. Thus we
> will have random IO in the host instead of sequential one in guest.
> Thus we have optimized COW at the cost of long term reads. This
> is what I am worrying about as we can have a lot of such reads before
> any further data change.

So just to avoid misunderstandings about what you're comparing here:
You get these 3 iops for 2 MB clusters with 64k subclusters, whereas you
would get only a single operation for 2 MB clusters without subclusters.
Today's 64k clusters without subclusters behave no better than the
2M/64k version, but that's not what you're comparing.

Yes, you are correct about this observation. But it is a tradeoff that
you're intentionally making when using backing files. In the extreme,
there is an alternative that performs so much better: Instead of using a
backing file, use 'qemu-img convert' to copy (and defragment) the whole
image upfront. No COW whatsoever, no fragmentation, fast reads. The
downside is that it takes a while to copy the whole image upfront, and
it also costs quite a bit of disk space.

So once we acknowledge that we're dealing with a tradeoff here, it
becomes obvious that neither the extreme setup for performance (copy the
whole image upfront) nor the extreme setup for sparseness (COW on a
sector level) are the right default for the average case, nor is
optimising one-sidedly a good idea. It is good if we can provide
solutions for extreme cases, but by default we need to cater for the
average case, which cares both about reasonable performance and disk
usage.

Kevin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-13 15:17             ` Denis V. Lunev
@ 2017-04-18 11:52               ` Alberto Garcia
  2017-04-18 17:27                 ` Denis V. Lunev
  0 siblings, 1 reply; 64+ messages in thread
From: Alberto Garcia @ 2017-04-18 11:52 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-devel
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz

On Thu 13 Apr 2017 05:17:21 PM CEST, Denis V. Lunev wrote:
> On 04/13/2017 06:04 PM, Alberto Garcia wrote:
>> On Thu 13 Apr 2017 03:30:43 PM CEST, Denis V. Lunev wrote:
>>> Yes, block size should be increased. I perfectly in agreement with
>>> your.  But I think that we could do that by plain increase of the
>>> cluster size without any further dances. Sub-clusters as sub-clusters
>>> will help if we are able to avoid COW. With COW I do not see much
>>> difference.
>> I'm trying to summarize your position, tell me if I got everything
>> correctly:
>>
>> 1. We should try to reduce data fragmentation on the qcow2 file,
>>    because it will have a long term effect on the I/O performance (as
>>    opposed to an effect on the initial operations on the empty image).
> yes
>
>> 2. The way to do that is to increase the cluster size (to 1MB or
>>    more).
> yes
>
>> 3. Benefit: increasing the cluster size also decreases the amount of
>>    metadata (L2 and refcount).
> yes
>
>> 4. Problem: L2 tables become too big and fill up the cache more
>>    easily. To solve this the cache code should do partial reads
>>    instead of complete L2 clusters.
> yes. We can read full cluster as originally if L2 cache is empty.
>
>> 5. Problem: larger cluster sizes also mean more data to copy when
>>    there's a COW. To solve this the COW code should be modified so it
>>    goes from 5 OPs (read head, write head, read tail, write tail,
>>    write data) to 2 OPs (read cluster, write modified cluster).
> yes, with small tweak if head and tail are in different clusters. In
> this case we
> will end up with 3 OPs.
>
>> 6. Having subclusters adds incompatible changes to the file format,
>>    and they offer no benefit after allocation.
> yes
>
>> 7. Subclusters are only really useful if they match the guest fs block
>>    size (because you would avoid doing COW on allocation). Otherwise
>>    the only thing that you get is a faster COW (because you move less
>>    data), but the improvement is not dramatic and it's better if we do
>>    what's proposed in point 5.
> yes
>
>> 8. Even if the subcluster size matches the guest block size, you'll
>>    get very fast initial allocation but also more chances to end up
>>    with a very fragmented qcow2 image, which is worse in the long run.
> yes
>
>> 9. Problem: larger clusters make a less efficient use of disk space,
>>    but that's a drawback you're fine with considering all of the
>>    above.
> yes
>
>> Is that a fair summary of what you're trying to say? Anything else
>> missing?
> yes.
>
> 5a. Problem: initial cluster allocation without COW. Could be made
>       cluster-size agnostic with the help of fallocate() call. Big
> clusters are even
>       better as the amount of such allocations is reduced.
>
> Thank you very much for this cool summary! I am too tongue-tied.

Hi Denis,

I don't have the have data to verify all your claims here, but in
general what you say makes sense.

Although I'm not sure if I agree with everything (especially on whether
any of this applies to SSD drives at all) it seems that we all agree
that the COW algorithm can be improved, so perhaps I should start by
taking a look at that.

Regards,

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-18 11:52               ` Alberto Garcia
@ 2017-04-18 17:27                 ` Denis V. Lunev
  0 siblings, 0 replies; 64+ messages in thread
From: Denis V. Lunev @ 2017-04-18 17:27 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz

On 04/18/2017 02:52 PM, Alberto Garcia wrote:
> On Thu 13 Apr 2017 05:17:21 PM CEST, Denis V. Lunev wrote:
>> On 04/13/2017 06:04 PM, Alberto Garcia wrote:
>>> On Thu 13 Apr 2017 03:30:43 PM CEST, Denis V. Lunev wrote:
>>>> Yes, block size should be increased. I perfectly in agreement with
>>>> your.  But I think that we could do that by plain increase of the
>>>> cluster size without any further dances. Sub-clusters as sub-clusters
>>>> will help if we are able to avoid COW. With COW I do not see much
>>>> difference.
>>> I'm trying to summarize your position, tell me if I got everything
>>> correctly:
>>>
>>> 1. We should try to reduce data fragmentation on the qcow2 file,
>>>    because it will have a long term effect on the I/O performance (as
>>>    opposed to an effect on the initial operations on the empty image).
>> yes
>>
>>> 2. The way to do that is to increase the cluster size (to 1MB or
>>>    more).
>> yes
>>
>>> 3. Benefit: increasing the cluster size also decreases the amount of
>>>    metadata (L2 and refcount).
>> yes
>>
>>> 4. Problem: L2 tables become too big and fill up the cache more
>>>    easily. To solve this the cache code should do partial reads
>>>    instead of complete L2 clusters.
>> yes. We can read full cluster as originally if L2 cache is empty.
>>
>>> 5. Problem: larger cluster sizes also mean more data to copy when
>>>    there's a COW. To solve this the COW code should be modified so it
>>>    goes from 5 OPs (read head, write head, read tail, write tail,
>>>    write data) to 2 OPs (read cluster, write modified cluster).
>> yes, with small tweak if head and tail are in different clusters. In
>> this case we
>> will end up with 3 OPs.
>>
>>> 6. Having subclusters adds incompatible changes to the file format,
>>>    and they offer no benefit after allocation.
>> yes
>>
>>> 7. Subclusters are only really useful if they match the guest fs block
>>>    size (because you would avoid doing COW on allocation). Otherwise
>>>    the only thing that you get is a faster COW (because you move less
>>>    data), but the improvement is not dramatic and it's better if we do
>>>    what's proposed in point 5.
>> yes
>>
>>> 8. Even if the subcluster size matches the guest block size, you'll
>>>    get very fast initial allocation but also more chances to end up
>>>    with a very fragmented qcow2 image, which is worse in the long run.
>> yes
>>
>>> 9. Problem: larger clusters make a less efficient use of disk space,
>>>    but that's a drawback you're fine with considering all of the
>>>    above.
>> yes
>>
>>> Is that a fair summary of what you're trying to say? Anything else
>>> missing?
>> yes.
>>
>> 5a. Problem: initial cluster allocation without COW. Could be made
>>       cluster-size agnostic with the help of fallocate() call. Big
>> clusters are even
>>       better as the amount of such allocations is reduced.
>>
>> Thank you very much for this cool summary! I am too tongue-tied.
> Hi Denis,
>
> I don't have the have data to verify all your claims here, but in
> general what you say makes sense.
>
> Although I'm not sure if I agree with everything (especially on whether
> any of this applies to SSD drives at all) it seems that we all agree
> that the COW algorithm can be improved, so perhaps I should start by
> taking a look at that.
>
> Regards,
>
> Berto
I understand. I just wanted to raise another possible (compatible)
approach, which could help.

Den

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-18 11:22               ` Kevin Wolf
@ 2017-04-18 17:30                 ` Denis V. Lunev
  0 siblings, 0 replies; 64+ messages in thread
From: Denis V. Lunev @ 2017-04-18 17:30 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: John Snow, qemu-block, qemu-devel, Max Reitz, Stefan Hajnoczi

On 04/18/2017 02:22 PM, Kevin Wolf wrote:
> Am 14.04.2017 um 06:17 hat Denis V. Lunev geschrieben:
>> [skipped...]
>>
>>> Hi Denis,
>>>
>>> I've read this entire thread now and I really like Berto's summary which
>>> I think is one of the best recaps of existing qcow2 problems and this
>>> discussion so far.
>>>
>>> I understand your opinion that we should focus on compatible changes
>>> before incompatible ones, and I also understand that you are very
>>> concerned about physical fragmentation for reducing long-term IO.
>>>
>>> What I don't understand is why you think that subclusters will increase
>>> fragmentation. If we admit that fragmentation is a problem now, surely
>>> increasing cluster sizes to 1 or 2 MB will only help to *reduce*
>>> physical fragmentation, right?
>>>
>>> Subclusters as far as I am understanding them will not actually allow
>>> subclusters to be located at virtually disparate locations, we will
>>> continue to allocate clusters as normal -- we'll just keep track of
>>> which portions of the cluster we've written to to help us optimize COW*.
>>>
>>> So if we have a 1MB cluster with 64k subclusters as a hypothetical, if
>>> we write just the first subcluster, we'll have a map like:
>>>
>>> X---------------
>>>
>>> Whatever actually happens to exist in this space, whether it be a hole
>>> we punched via fallocate or literal zeroes, this space is known to the
>>> filesystem to be contiguous.
>>>
>>> If we write to the last subcluster, we'll get:
>>>
>>> X--------------X
>>>
>>> And again, maybe the dashes are a fallocate hole, maybe they're zeroes.
>>> but the last subcluster is located virtually exactly 15 subclusters
>>> behind the first, they're not physically contiguous. We've saved the
>>> space between them. Future out-of-order writes won't contribute to any
>>> fragmentation, at least at this level.
>>>
>>> You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune
>>> the subclusters right, we'll have *zero*, won't we?
>>>
>>> As far as I can tell, this lets us do a lot of good things all at once:
>>>
>>> (1) We get some COW optimizations (for reasons Berto and Kevin have
>>> detailed)
>> Yes. We are fine with COW. Let us assume that we will have issued read
>> entire cluster command after the COW, in the situation
>>
>> X--------------X
>>
>> with a backing store. This is possible even with 1-2 Mb cluster size.
>> I have seen 4-5 Mb requests from the guest in the real life. In this
>> case we will have 3 IOP:
>>     read left X area, read backing, read right X.
>> This is the real drawback of the approach, if sub-cluster size is really
>> small enough, which should be the case for optimal COW. Thus we
>> will have random IO in the host instead of sequential one in guest.
>> Thus we have optimized COW at the cost of long term reads. This
>> is what I am worrying about as we can have a lot of such reads before
>> any further data change.
> So just to avoid misunderstandings about what you're comparing here:
> You get these 3 iops for 2 MB clusters with 64k subclusters, whereas you
> would get only a single operation for 2 MB clusters without subclusters.
> Today's 64k clusters without subclusters behave no better than the
> 2M/64k version, but that's not what you're comparing.
>
> Yes, you are correct about this observation. But it is a tradeoff that
> you're intentionally making when using backing files. In the extreme,
> there is an alternative that performs so much better: Instead of using a
> backing file, use 'qemu-img convert' to copy (and defragment) the whole
> image upfront. No COW whatsoever, no fragmentation, fast reads. The
> downside is that it takes a while to copy the whole image upfront, and
> it also costs quite a bit of disk space.
in general, for production environments, this is total pain. We
have a lot of customers with Tb images. Free space is also
a real problem for them.


> So once we acknowledge that we're dealing with a tradeoff here, it
> becomes obvious that neither the extreme setup for performance (copy the
> whole image upfront) nor the extreme setup for sparseness (COW on a
> sector level) are the right default for the average case, nor is
> optimising one-sidedly a good idea. It is good if we can provide
> solutions for extreme cases, but by default we need to cater for the
> average case, which cares both about reasonable performance and disk
> usage.
yes, I agree. But 64kb cluster size by default for big images (not for
backup!)
is another extreme ;) Who will care with 1 Tb image or 10 Tb image about
several Mbs.

Pls note, that 1 Mb is better for the default block size as with this size
sequential write is equal to the random write for non-SSD disks.

Den

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Qemu-devel] proposed qcow2 extension: cluster reservations [was: [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-06 16:40 ` Eric Blake
  2017-04-07  8:49   ` Alberto Garcia
  2017-04-07 12:41   ` Kevin Wolf
@ 2017-04-21 21:09   ` Eric Blake
  2017-04-22 17:56     ` Max Reitz
  2 siblings, 1 reply; 64+ messages in thread
From: Eric Blake @ 2017-04-21 21:09 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, qemu-block, Stefan Hajnoczi, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 13291 bytes --]

On 04/06/2017 11:40 AM, Eric Blake wrote:

>> === Changes to the on-disk format ===
>>
>> The qcow2 on-disk format needs to change so each L2 entry has a bitmap
>> indicating the allocation status of each subcluster. There are three
>> possible states (unallocated, allocated, all zeroes), so we need two
>> bits per subcluster.
> 
> You also have to add a new incompatible feature bit, so that older tools
> know they can't read the new image correctly, and therefore don't
> accidentally corrupt it.

As long as we are talking about incompatible feature bits, I had some
thoughts about image mapping today.

tl;dr: summary
As long as we are considering incompatible features, maybe we should
make it easier to have an image file that explicitly preserves
guest=>host mapping, such that nothing the guest can do will reorder the
mapping.  This way, it would be possible to fully preallocate an image
such that all guest offsets are adjusted by a constant offset to become
the corresponding host offset (basically, no qcow2 metadata is
interleaved in the middle of guest data).

I don't see any way to do it on current qcow2 images, but with
subclusters, you get it for free by having a cluster with an offset but
with all subclusters marked as unallocated.  But perhaps it is something
orthogonal enough that we would want a separate incompatible feature bit
for just this, without having subclusters at the same time.

In the process of exploring the topic, I expose a couple of existing
bugs in our qcow2 handling.

========

Longer version:

If I'm reading qcow2-clusters.c and qcow2-refcount.c correctly, our
current implementation of bdrv_discard() says that except for clusters
already marked QCOW2_CLUSTER_ZERO, we will unconditionally remove the L2
mapping of the address.  Whether we actually punch a hole in the
underlying image, or merely add it to a list of free clusters for use in
subsequent allocations, is later decided based on
s->discard_passthrough[type] (that is, the caller can pass different
type levels that control whether to never punch, always punch, or let
the blockdev parameters of the caller control whether to punch).

Presumably, if I've preallocated my image because I want to guarantee
enough host space, then I would use blockdev parameters that ensure that
guest actions never punch a hole.  But based on my reading, I still have
the situation that if I initially preallocated things so that guest
cluster 0 and 1 are consecutive clusters in the host, and the guest
triggers bdrv_pdiscard() over both clusters, then writes to cluster 1
before cluster 0, then even though I'm not changing the underlying
allocation of the host file, I _am_ resulting in fragmentation in the
qcow2 file, where cluster 1 in the guest now falls prior to cluster 0.

Demonstration:
$ qemu-img create -f qcow2 -o cluster_size=1M,preallocation=full file3 10M
Formatting 'file3', fmt=qcow2 size=10485760 encryption=off
cluster_size=1048576 preallocation=full lazy_refcounts=off refcount_bits=16
$ qemu-img map -f qcow2 file3
Offset          Length          Mapped to       File
0               0x900000        0x500000        file3
0x9ff000        0x1000          0xeff000        file3

Oh, that's weird - our "full allocation" still results in a gap in
guest-allocation (nothing from 0x900000-0x9ff000 appears to be mapped),
even though every offset that IS allocated has a 1:1 mapping (for any
given guest offset, add 5M to get the host offset).  Maybe the JSON
output will shed more light:

$ qemu-img map -f qcow2 file3 --output=json
[{ "start": 0, "length": 9437184, "depth": 0, "zero": false, "data":
true, "offset": 5242880},
{ "start": 9437184, "length": 1044480, "depth": 0, "zero": true, "data":
true, "offset": 14680064},
{ "start": 10481664, "length": 4096, "depth": 0, "zero": false, "data":
true, "offset": 15724544}]

That's a bit better - the hole that didn't show up is still mapped, it's
just the human version omitted it because that portion of the file
explicitly reads as zero.  But I'm still confused - how can only half of
a 1M cluster read as zero, when our QCOW2_CLUSTER_ZERO flag applies a
cluster-at-a-time?  Oh - it's because we defer to the file system, and
according to lseek(SEEK_HOLE), we actually do have a hole in the file...

$ qemu-img map -f raw file3 --output=json
[{ "start": 0, "length": 5242880, "depth": 0, "zero": false, "data":
true, "offset": 0},
{ "start": 5242880, "length": 10481664, "depth": 0, "zero": true,
"data": false, "offset": 5242880},
{ "start": 15724544, "length": 4096, "depth": 0, "zero": false, "data":
true, "offset": 15724544}]
$ qemu-io -f raw file3

and reading the L2 table shows that we do NOT have a case of
QCOW2_CLUSTER_ZERO set in the qcow2 image:

$ qemu-io -f raw file3
qemu-io> r -v 4m 128
00400000:  80 00 00 00 00 50 00 00 80 00 00 00 00 60 00 00  .....P..........
00400010:  80 00 00 00 00 70 00 00 80 00 00 00 00 80 00 00  .....p..........
00400020:  80 00 00 00 00 90 00 00 80 00 00 00 00 a0 00 00  ................
00400030:  80 00 00 00 00 b0 00 00 80 00 00 00 00 c0 00 00  ................
00400040:  80 00 00 00 00 d0 00 00 80 00 00 00 00 e0 00 00  ................
00400050:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00400060:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00400070:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
read 128/128 bytes at offset 4194304
128 bytes, 1 ops; 0.0001 sec (905.797 KiB/sec and 7246.3768 ops/sec)
qemu-io> q


But it's still sad that our request for preallocation=full didn't
actually work on the underlying file (we have a hole that we didn't
want).  Oh well, yet another thing to fix someday.

Anyways, now have the guest do a discard and subsequent re-write:

$ qemu-io -f qcow2 file3
qemu-io> discard 0 2m
discard 2097152/2097152 bytes at offset 0
2 MiB, 1 ops; 0.0014 sec (1.395 GiB/sec and 714.2857 ops/sec)
qemu-io> w 1m 1m
wrote 1048576/1048576 bytes at offset 1048576
1 MiB, 1 ops; 0.0465 sec (21.487 MiB/sec and 21.4869 ops/sec)
qemu-io> w 0 1m
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0327 sec (30.497 MiB/sec and 30.4971 ops/sec)
qemu-io> q
$ qemu-img map -f qcow2 file3
Offset          Length          Mapped to       File
0               0x100000        0x600000        file3
0x100000        0x100000        0x500000        file3
0x200000        0x700000        0x700000        file3
0x9ff000        0x1000          0xeff000        file3

We haven't consumed any new host space (good), but we HAVE fragmented
the guest image (cluster 0 and 1 have now swapped spaces; we no longer
have contiguous guest=>host translations).

But watch what happens with explicit zero clusters:

$ qemu-io -f qcow2 file3
qemu-io> w -z 3m 2m
wrote 2097152/2097152 bytes at offset 3145728
2 MiB, 1 ops; 0.0136 sec (146.563 MiB/sec and 73.2815 ops/sec)
qemu-io> q

Per the L2 map, we have now set the explicit zero bit, but maintained an
allocation:

$ qemu-io -f raw file3
qemu-io> r -v 4m 128
00400000:  80 00 00 00 00 60 00 00 80 00 00 00 00 50 00 00  .............P..
00400010:  80 00 00 00 00 70 00 00 80 00 00 00 00 80 00 01  .....p..........
00400020:  80 00 00 00 00 90 00 01 80 00 00 00 00 a0 00 00  ................
00400030:  80 00 00 00 00 b0 00 00 80 00 00 00 00 c0 00 00  ................
00400040:  80 00 00 00 00 d0 00 00 80 00 00 00 00 e0 00 00  ................
00400050:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00400060:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00400070:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
read 128/128 bytes at offset 4194304
128 bytes, 1 ops; 0.0001 sec (1.197 MiB/sec and 9803.9216 ops/sec)
qemu-io> q


And with that allocation in place, discarding and then rewriting zero
does NOT lose the allocation:

$ qemu-io -f qcow2 file3
qemu-io> discard 3m 2m
discard 2097152/2097152 bytes at offset 3145728
2 MiB, 1 ops; 0.0007 sec (2.622 GiB/sec and 1342.2819 ops/sec)
qemu-io> w -z 4m 1m
wrote 1048576/1048576 bytes at offset 4194304
1 MiB, 1 ops; 0.0103 sec (97.003 MiB/sec and 97.0026 ops/sec)
qemu-io> w -z 3m 1m
wrote 1048576/1048576 bytes at offset 3145728
1 MiB, 1 ops; 0.0095 sec (104.319 MiB/sec and 104.3188 ops/sec)
qemu-io> q

eblake@red (0) ~/qemu (nbd|REBASE-i 26/109)
$ qemu-io -f raw file3
qemu-io> r -v 4m 128
00400000:  80 00 00 00 00 60 00 00 80 00 00 00 00 50 00 00  .............P..
00400010:  80 00 00 00 00 70 00 00 80 00 00 00 00 80 00 01  .....p..........
00400020:  80 00 00 00 00 90 00 01 80 00 00 00 00 a0 00 00  ................
00400030:  80 00 00 00 00 b0 00 00 80 00 00 00 00 c0 00 00  ................
00400040:  80 00 00 00 00 d0 00 00 80 00 00 00 00 e0 00 00  ................
00400050:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00400060:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00400070:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
read 128/128 bytes at offset 4194304
128 bytes, 1 ops; 0.0001 sec (1.185 MiB/sec and 9708.7379 ops/sec)
qemu-io>

Clusters 3 and 4 are still mapped to host 8 and 9, as they were
originally.  But what happens when we try to write data there:

$ qemu-io -f qcow2 file3
qemu-io> w 4m 1m
wrote 1048576/1048576 bytes at offset 4194304
1 MiB, 1 ops; 0.0402 sec (24.830 MiB/sec and 24.8299 ops/sec)
qemu-io> q
$ qemu-img map -f qcow2 file3
Offset          Length          Mapped to       File
0               0x100000        0x600000        file3
0x100000        0x100000        0x500000        file3
0x200000        0x100000        0x700000        file3
0x400000        0x100000        0xf00000        file3
0x500000        0x400000        0xa00000        file3
0x9ff000        0x1000          0xeff000        file3

Ouch - that cluster write IGNORED the fact that we already had a
reservation at host 8, and instead changed the mapping so that guest 4
now maps to host f, making our refcount table larger, and fragmenting
things.  So much for pre-allocation saving the day :(

So, we need a bug-fix such that allocating a cluster that will be
replacing an allocated-but-zero cluster would reuse the existing
allocation, rather than create a new mapping.

[Oh, and how will all of this interact with refcounts > 1, when internal
snapshots are in use?  Ugh - I don't know that we want to go there]

But if we can preserve mappings of clusters that are explicitly marked
zero, I started to wonder if it would also be reasonable to have a mode
where we could ALWAYS preserve mappings.  Adding another bit that says
that a cluster has a reserved mapping, but still defers to the backing
file for its current data, would let us extend the existing behavior of
map-preservation on write zeros to work with ALL writes, when writing to
a fully pre-allocated image.

Such a bit is incompatible (existing qemu would not know how to present
correct data to the guest when the bit is set), so if we want it, we
have to burn an incompatible feature bit.  The question is whether this
feature is independently useful enough to do it orthogonally from
subclusters, or whether we want to minimize the exponential explosion of
orthogonal bits by only allowing the feature when subclusters are also
enabled.  And in a way, subclusters already provide this feature: any
time you map a cluster to an address, but have ALL of its subclusters
marked as unallocated (read from backing file), you have preallocated a
mapping that you can later use when writing to that cluster, and that
preallocation is no longer tied to whether the guest uses write-zeros.

When I chatted with Max on IRC about the idea, we said this:


<mreitz> I mean, sure, we can add both, but I'd still want them two be
two incompatible bits
<eblake> if you want the features to be orthogonal (with exponentially
more cases to check), then having multiple incompatible bits is okay
<mreitz> Because FEATURE_BIT_UNALLOCATED_AND_SUBCLUSTERS sounds weird
and FEATURE_BIT_EXTENDED_L2_ENTRIES a bit pretentious
<mreitz> Well, orthogonal is a good question. If we want to have an
UNALLOCATED flag we should think so before adding subclusters
<mreitz> Because then we should at least make clear that the ZERO flag
for a subcluster requires the ALLOCATED flag to be set or something
<mreitz> So we can reserve ZERO/!ALLOCATED for the case where you want
to fall through to the backing file

So, if you got this far in reading, the question becomes whether having
a mode where you can mark a cluster as mapping-reserved-but-unallocated
has enough use case to be worth pursuing, knowing that it will burn an
incompatible feature bit; or if it should be rolled into the subcluster
proposal, or whether it's not a feature that anyone needs after all.

And meanwhile, it looks like I have some patches to propose (and
qemu-iotests to write) if I can help fix the bugs I've pointed out.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] proposed qcow2 extension: cluster reservations [was: [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-21 21:09   ` [Qemu-devel] proposed qcow2 extension: cluster reservations [was: " Eric Blake
@ 2017-04-22 17:56     ` Max Reitz
  2017-04-24 11:45       ` Kevin Wolf
  2017-04-24 12:46       ` Alberto Garcia
  0 siblings, 2 replies; 64+ messages in thread
From: Max Reitz @ 2017-04-22 17:56 UTC (permalink / raw)
  To: Eric Blake, Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, qemu-block, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 5565 bytes --]

On 21.04.2017 23:09, Eric Blake wrote:
> On 04/06/2017 11:40 AM, Eric Blake wrote:
> 
>>> === Changes to the on-disk format ===
>>>
>>> The qcow2 on-disk format needs to change so each L2 entry has a bitmap
>>> indicating the allocation status of each subcluster. There are three
>>> possible states (unallocated, allocated, all zeroes), so we need two
>>> bits per subcluster.
>>
>> You also have to add a new incompatible feature bit, so that older tools
>> know they can't read the new image correctly, and therefore don't
>> accidentally corrupt it.
> 
> As long as we are talking about incompatible feature bits, I had some
> thoughts about image mapping today.
> 
> tl;dr: summary> As long as we are considering incompatible features, maybe we should
> make it easier to have an image file that explicitly preserves
> guest=>host mapping, such that nothing the guest can do will reorder the
> mapping.  This way, it would be possible to fully preallocate an image
> such that all guest offsets are adjusted by a constant offset to become
> the corresponding host offset (basically, no qcow2 metadata is
> interleaved in the middle of guest data).
> 
> I don't see any way to do it on current qcow2 images, but with
> subclusters, you get it for free by having a cluster with an offset but
> with all subclusters marked as unallocated.  But perhaps it is something
> orthogonal enough that we would want a separate incompatible feature bit
> for just this, without having subclusters at the same time.
> 
> In the process of exploring the topic, I expose a couple of existing
> bugs in our qcow2 handling.
> 
> ========
> 
> Longer version:
> 
> If I'm reading qcow2-clusters.c and qcow2-refcount.c correctly, our
> current implementation of bdrv_discard() says that except for clusters
> already marked QCOW2_CLUSTER_ZERO, we will unconditionally remove the L2
> mapping of the address.

As I've said, I think the ZERO bit is just a bug and we should free
preallocated zero clusters.

>                          Whether we actually punch a hole in the
> underlying image, or merely add it to a list of free clusters for use in
> subsequent allocations, is later decided based on
> s->discard_passthrough[type] (that is, the caller can pass different
> type levels that control whether to never punch, always punch, or let
> the blockdev parameters of the caller control whether to punch).
> 
> Presumably, if I've preallocated my image because I want to guarantee
> enough host space, then I would use blockdev parameters that ensure that
> guest actions never punch a hole.  But based on my reading, I still have
> the situation that if I initially preallocated things so that guest
> cluster 0 and 1 are consecutive clusters in the host, and the guest
> triggers bdrv_pdiscard() over both clusters, then writes to cluster 1
> before cluster 0, then even though I'm not changing the underlying
> allocation of the host file, I _am_ resulting in fragmentation in the
> qcow2 file, where cluster 1 in the guest now falls prior to cluster 0.

[...]

> But if we can preserve mappings of clusters that are explicitly marked
> zero, I started to wonder if it would also be reasonable to have a mode
> where we could ALWAYS preserve mappings.  Adding another bit that says
> that a cluster has a reserved mapping, but still defers to the backing
> file for its current data, would let us extend the existing behavior of
> map-preservation on write zeros to work with ALL writes, when writing to
> a fully pre-allocated image.

Yes, and it also means that we may want to think about implementation a
preallocation mode in qemu which puts all of the data into a single
consecutive chunk (as you have hinted at somewhere above).

> When I chatted with Max on IRC about the idea, we said this:
> 
> 
> <mreitz> I mean, sure, we can add both, but I'd still want them two be
> two incompatible bits
> <eblake> if you want the features to be orthogonal (with exponentially
> more cases to check), then having multiple incompatible bits is okay
> <mreitz> Because FEATURE_BIT_UNALLOCATED_AND_SUBCLUSTERS sounds weird
> and FEATURE_BIT_EXTENDED_L2_ENTRIES a bit pretentious
> <mreitz> Well, orthogonal is a good question. If we want to have an
> UNALLOCATED flag we should think so before adding subclusters
> <mreitz> Because then we should at least make clear that the ZERO flag
> for a subcluster requires the ALLOCATED flag to be set or something
> <mreitz> So we can reserve ZERO/!ALLOCATED for the case where you want
> to fall through to the backing file
> 
> So, if you got this far in reading, the question becomes whether having
> a mode where you can mark a cluster as mapping-reserved-but-unallocated
> has enough use case to be worth pursuing, knowing that it will burn an
> incompatible feature bit; or if it should be rolled into the subcluster
> proposal, or whether it's not a feature that anyone needs after all.

I just forgot that just saying !ALLOCATED will be enough, regardless of
the ZERO flag... Yeah, subclusters will give us this for free, and I
think it's therefore reasonable to just require them if you want this
feature (preallocation with a backing file).

> And meanwhile, it looks like I have some patches to propose (and
> qemu-iotests to write) if I can help fix the bugs I've pointed out.

You mean these?
https://github.com/XanClic/qemu/commits/random-qcow2-stuff-v1

;-)

I'll send them once I've written tests.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] proposed qcow2 extension: cluster reservations [was: [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-22 17:56     ` Max Reitz
@ 2017-04-24 11:45       ` Kevin Wolf
  2017-04-24 12:46       ` Alberto Garcia
  1 sibling, 0 replies; 64+ messages in thread
From: Kevin Wolf @ 2017-04-24 11:45 UTC (permalink / raw)
  To: Max Reitz
  Cc: Eric Blake, Alberto Garcia, qemu-devel, qemu-block, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 821 bytes --]

Am 22.04.2017 um 19:56 hat Max Reitz geschrieben:
> On 21.04.2017 23:09, Eric Blake wrote:
> > And meanwhile, it looks like I have some patches to propose (and
> > qemu-iotests to write) if I can help fix the bugs I've pointed out.
> 
> You mean these?
> https://github.com/XanClic/qemu/commits/random-qcow2-stuff-v1
> 
> ;-)
> 
> I'll send them once I've written tests.

You were quicker with writing the patches than I could reply that I
already started a patch to reuse preallocated zero clusters, but never
got around to finish it:

http://repo.or.cz/qemu/kevin.git/commitdiff/60b9f813bb258978455944b4ace1ab5d93b5e7d4

I seem to remember that it still had a problem (and probably also the
refcount > 0 bug that Eric mentioned), but maybe you can reuse at least
the test case from there.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [Qemu-devel] proposed qcow2 extension: cluster reservations [was: [RFC] Proposed qcow2 extension: subcluster allocation
  2017-04-22 17:56     ` Max Reitz
  2017-04-24 11:45       ` Kevin Wolf
@ 2017-04-24 12:46       ` Alberto Garcia
  1 sibling, 0 replies; 64+ messages in thread
From: Alberto Garcia @ 2017-04-24 12:46 UTC (permalink / raw)
  To: Max Reitz, Eric Blake, qemu-devel; +Cc: Kevin Wolf, qemu-block, Stefan Hajnoczi

On Sat 22 Apr 2017 07:56:57 PM CEST, Max Reitz wrote:
>> So, if you got this far in reading, the question becomes whether
>> having a mode where you can mark a cluster as
>> mapping-reserved-but-unallocated has enough use case to be worth
>> pursuing, knowing that it will burn an incompatible feature bit; or
>> if it should be rolled into the subcluster proposal, or whether it's
>> not a feature that anyone needs after all.
>
> I just forgot that just saying !ALLOCATED will be enough, regardless
> of the ZERO flag... Yeah, subclusters will give us this for free, and
> I think it's therefore reasonable to just require them if you want
> this feature (preallocation with a backing file).

I agree with Max here.

Berto

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2017-04-24 12:47 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia
2017-04-06 16:40 ` Eric Blake
2017-04-07  8:49   ` Alberto Garcia
2017-04-07 12:41   ` Kevin Wolf
2017-04-07 14:24     ` Alberto Garcia
2017-04-21 21:09   ` [Qemu-devel] proposed qcow2 extension: cluster reservations [was: " Eric Blake
2017-04-22 17:56     ` Max Reitz
2017-04-24 11:45       ` Kevin Wolf
2017-04-24 12:46       ` Alberto Garcia
2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi
2017-04-07 12:24   ` Alberto Garcia
2017-04-07 13:01   ` Kevin Wolf
2017-04-10 15:32     ` Stefan Hajnoczi
2017-04-07 17:10 ` Max Reitz
2017-04-10  8:42   ` Kevin Wolf
2017-04-10 15:03     ` Max Reitz
2017-04-11 12:56   ` Alberto Garcia
2017-04-11 14:04     ` Max Reitz
2017-04-11 14:31       ` Alberto Garcia
2017-04-11 14:45         ` [Qemu-devel] [Qemu-block] " Eric Blake
2017-04-12 12:41           ` Alberto Garcia
2017-04-12 14:10             ` Max Reitz
2017-04-13  8:05               ` Alberto Garcia
2017-04-13  9:02                 ` Kevin Wolf
2017-04-13  9:05                   ` Alberto Garcia
2017-04-11 14:49         ` [Qemu-devel] " Kevin Wolf
2017-04-11 14:58           ` Eric Blake
2017-04-11 14:59           ` Max Reitz
2017-04-11 15:08             ` Eric Blake
2017-04-11 15:18               ` Max Reitz
2017-04-11 15:29                 ` Kevin Wolf
2017-04-11 15:29                   ` Max Reitz
2017-04-11 15:30                 ` Eric Blake
2017-04-11 15:34                   ` Max Reitz
2017-04-12 12:47           ` Alberto Garcia
2017-04-12 16:54 ` Denis V. Lunev
2017-04-13 11:58   ` Alberto Garcia
2017-04-13 12:44     ` Denis V. Lunev
2017-04-13 13:05       ` Kevin Wolf
2017-04-13 13:09         ` Denis V. Lunev
2017-04-13 13:36           ` Alberto Garcia
2017-04-13 14:06             ` Denis V. Lunev
2017-04-13 13:21       ` Alberto Garcia
2017-04-13 13:30         ` Denis V. Lunev
2017-04-13 13:59           ` Kevin Wolf
2017-04-13 15:04           ` Alberto Garcia
2017-04-13 15:17             ` Denis V. Lunev
2017-04-18 11:52               ` Alberto Garcia
2017-04-18 17:27                 ` Denis V. Lunev
2017-04-13 13:51         ` Kevin Wolf
2017-04-13 14:15           ` Alberto Garcia
2017-04-13 14:27             ` Kevin Wolf
2017-04-13 16:42               ` [Qemu-devel] [Qemu-block] " Roman Kagan
2017-04-13 14:42           ` [Qemu-devel] " Denis V. Lunev
2017-04-12 17:55 ` Denis V. Lunev
2017-04-12 18:20   ` Eric Blake
2017-04-12 19:02     ` Denis V. Lunev
2017-04-13  9:44       ` Kevin Wolf
2017-04-13 10:19         ` Denis V. Lunev
2017-04-14  1:06           ` [Qemu-devel] [Qemu-block] " John Snow
2017-04-14  4:17             ` Denis V. Lunev
2017-04-18 11:22               ` Kevin Wolf
2017-04-18 17:30                 ` Denis V. Lunev
2017-04-14  7:40             ` Roman Kagan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.