All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
@ 2019-06-27 13:59 Alberto Garcia
  2019-06-27 14:19 ` Denis Lunev
  2019-06-27 16:54 ` Kevin Wolf
  0 siblings, 2 replies; 29+ messages in thread
From: Alberto Garcia @ 2019-06-27 13:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anton Nefedov, Denis V. Lunev, qemu-block, Max Reitz

Hi all,

a couple of years ago I came to the mailing list with a proposal to
extend the qcow2 format to add subcluster allocation.

You can read the original message (and the discussion thread that came
afterwards) here:

   https://lists.gnu.org/archive/html/qemu-block/2017-04/msg00178.html

The description of the problem from the original proposal is still
valid so I won't repeat it here.

What I have been doing during the past few weeks was to retake the
code that I wrote in 2017, make it work with the latest QEMU and fix
many of its bugs. I have again a working prototype which is by no
means complete but it allows us to have up-to-date information about
what we can expect from this feature.

My goal with this message is to retake the discussion and re-evaluate
whether this is a feature that we'd like for QEMU in light of the test
results and all the changes that we have had in the past couple of
years.

=== Test results ===

I ran these tests with the same hardware configuration as in 2017: an
SSD drive and random 4KB write requests to an empty 40GB qcow2 image.

Here are the results when the qcow2 file is backed by a fully
populated image. There are 8 subclusters per cluster and the
subcluster size is in brackets:

|-----------------+----------------+-----------------|
|  Cluster size   | subclusters=on | subclusters=off |
|-----------------+----------------+-----------------|
|   2 MB (256 KB) |   571 IOPS     |  124 IOPS       |
|   1 MB (128 KB) |   863 IOPS     |  212 IOPS       |
| 512 KB  (64 KB) |  1678 IOPS     |  365 IOPS       |
| 256 KB  (32 KB) |  2618 IOPS     |  568 IOPS       |
| 128 KB  (16 KB) |  4907 IOPS     |  873 IOPS       |
|  64 KB   (8 KB) | 10613 IOPS     | 1680 IOPS       |
|  32 KB   (4 KB) | 13038 IOPS     | 2476 IOPS       |
|   4 KB (512 B)  |   101 IOPS     |  101 IOPS       |
|-----------------+----------------+-----------------|

Some comments about the results, after comparing them with those from
2017:

- As expected, 32KB clusters / 4 KB subclusters give the best results
  because that matches the size of the write request and therefore
  there's no copy-on-write involved.

- Allocation is generally faster now in all cases (between 20-90%,
  depending on the case). We have made several optimizations to the
  code since last time, and I suppose that the COW changes made in
  commits b3cf1c7cf8 and ee22a9d869 are probably the main factor
  behind these improvements.

- Apart from the 64KB/8KB case (which is much faster), the patters are
  generally the same: subcluster allocation offers similar performance
  benefits compared to last time, so there are no surprises in this
  area.

Then I ran the tests again using the same environment but without a
backing image. The goal is to measure the impact of subcluster
allocation on completely empty images.

Here we have an important change: since commit c8bb23cbdb empty
clusters are preallocated and filled with zeroes using an efficient
operation (typically fallocate() with FALLOC_FL_ZERO_RANGE) instead of
writing the zeroes with the usual pwrite() call.

The effects of this are dramatic, so I decided to run two sets of
tests: one with this optimization and one without it.

Here are the results:

|-----------------+----------------+-----------------+----------------+-----------------|
|                 | Initialization with fallocate()  |  Initialization with pwritev()   |
|-----------------+----------------+-----------------+----------------+-----------------|
|  Cluster size   | subclusters=on | subclusters=off | subclusters=on | subclusters=off |
|-----------------+----------------+-----------------+----------------+-----------------|
|   2 MB (256 KB) | 14468 IOPS     | 14776 IOPS      |  1181 IOPS     |  260 IOPS       |
|   1 MB (128 KB) | 13752 IOPS     | 14956 IOPS      |  1916 IOPS     |  358 IOPS       |
| 512 KB  (64 KB) | 12961 IOPS     | 14776 IOPS      |  4038 IOPS     |  684 IOPS       |
| 256 KB  (32 KB) | 12790 IOPS     | 14534 IOPS      |  6172 IOPS     | 1213 IOPS       |
| 128 KB  (16 KB) | 12550 IOPS     | 13967 IOPS      |  8700 IOPS     | 1976 IOPS       |
|  64 KB   (8 KB) | 12491 IOPS     | 13432 IOPS      | 11735 IOPS     | 4267 IOPS       |
|  32 KB   (4 KB) | 13203 IOPS     | 11752 IOPS      | 12366 IOPS     | 6306 IOPS       |
|   4 KB (512 B)  |   103 IOPS     |   101 IOPS      |   101 IOPS     |  101 IOPS       |
|-----------------+----------------+-----------------+----------------+-----------------|

Comments:

- With the old-style allocation method using pwritev() we get similar
  benefits as we did last time. The comments from the test with a
  backing image apply to this one as well.

- However the new allocation method is so efficient that having
  subclusters does not offer any performance benefit. It even slows
  down things a bit in most cases, so we'd probably need to fine tune
  the algorithm in order to get similar results.

- In light of this numbers I also think that even when there's a
  backing image we could preallocate the full cluster but only do COW
  on the affected subclusters. This would the rest of the cluster
  preallocated on disk but unallocated on the bitmap. This would
  probably reduce on-disk fragmentation, which was one of the concerns
  raised during the original discussion.

I also ran some tests on a rotating HDD drive. Here having subclusters
doesn't make a big difference regardless of whether there is a backing
image or not, so we can ignore this scenario.

=== Changes to the on-disk format ===

In my original proposal I described 3 different alternatives for
storing the subcluster bitmaps. I'm naming them here, but refer to
that message for more details.

(1) Storing the bitmap inside the 64-bit entry
(2) Making L2 entries 128-bit wide.
(3) Storing the bitmap somewhere else

I used (1) for this implementation for simplicity, but I think (2) is
probably the best one.

===========================

And I think that's all. As you can see I didn't want to go much into
the open technical questions (I think the on-disk format would be the
main one), the first goal should be to decide whether this is still an
interesting feature or not.

So, any questions or comments will be much appreciated.

Berto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-27 13:59 [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images Alberto Garcia
@ 2019-06-27 14:19 ` Denis Lunev
  2019-06-27 15:38   ` Alberto Garcia
  2019-06-27 16:54 ` Kevin Wolf
  1 sibling, 1 reply; 29+ messages in thread
From: Denis Lunev @ 2019-06-27 14:19 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, Anton Nefedov, qemu-block, Max Reitz

On 6/27/19 4:59 PM, Alberto Garcia wrote:
> Hi all,
>
> a couple of years ago I came to the mailing list with a proposal to
> extend the qcow2 format to add subcluster allocation.
>
> You can read the original message (and the discussion thread that came
> afterwards) here:
>
>    https://lists.gnu.org/archive/html/qemu-block/2017-04/msg00178.html
>
> The description of the problem from the original proposal is still
> valid so I won't repeat it here.
>
> What I have been doing during the past few weeks was to retake the
> code that I wrote in 2017, make it work with the latest QEMU and fix
> many of its bugs. I have again a working prototype which is by no
> means complete but it allows us to have up-to-date information about
> what we can expect from this feature.
>
> My goal with this message is to retake the discussion and re-evaluate
> whether this is a feature that we'd like for QEMU in light of the test
> results and all the changes that we have had in the past couple of
> years.
>
> === Test results ===
>
> I ran these tests with the same hardware configuration as in 2017: an
> SSD drive and random 4KB write requests to an empty 40GB qcow2 image.
>
> Here are the results when the qcow2 file is backed by a fully
> populated image. There are 8 subclusters per cluster and the
> subcluster size is in brackets:
>
> |-----------------+----------------+-----------------|
> |  Cluster size   | subclusters=on | subclusters=off |
> |-----------------+----------------+-----------------|
> |   2 MB (256 KB) |   571 IOPS     |  124 IOPS       |
> |   1 MB (128 KB) |   863 IOPS     |  212 IOPS       |
> | 512 KB  (64 KB) |  1678 IOPS     |  365 IOPS       |
> | 256 KB  (32 KB) |  2618 IOPS     |  568 IOPS       |
> | 128 KB  (16 KB) |  4907 IOPS     |  873 IOPS       |
> |  64 KB   (8 KB) | 10613 IOPS     | 1680 IOPS       |
> |  32 KB   (4 KB) | 13038 IOPS     | 2476 IOPS       |
> |   4 KB (512 B)  |   101 IOPS     |  101 IOPS       |
> |-----------------+----------------+-----------------|
>
> Some comments about the results, after comparing them with those from
> 2017:
>
> - As expected, 32KB clusters / 4 KB subclusters give the best results
>   because that matches the size of the write request and therefore
>   there's no copy-on-write involved.
>
> - Allocation is generally faster now in all cases (between 20-90%,
>   depending on the case). We have made several optimizations to the
>   code since last time, and I suppose that the COW changes made in
>   commits b3cf1c7cf8 and ee22a9d869 are probably the main factor
>   behind these improvements.
>
> - Apart from the 64KB/8KB case (which is much faster), the patters are
>   generally the same: subcluster allocation offers similar performance
>   benefits compared to last time, so there are no surprises in this
>   area.
>
> Then I ran the tests again using the same environment but without a
> backing image. The goal is to measure the impact of subcluster
> allocation on completely empty images.
>
> Here we have an important change: since commit c8bb23cbdb empty
> clusters are preallocated and filled with zeroes using an efficient
> operation (typically fallocate() with FALLOC_FL_ZERO_RANGE) instead of
> writing the zeroes with the usual pwrite() call.
>
> The effects of this are dramatic, so I decided to run two sets of
> tests: one with this optimization and one without it.
>
> Here are the results:
>
> |-----------------+----------------+-----------------+----------------+-----------------|
> |                 | Initialization with fallocate()  |  Initialization with pwritev()   |
> |-----------------+----------------+-----------------+----------------+-----------------|
> |  Cluster size   | subclusters=on | subclusters=off | subclusters=on | subclusters=off |
> |-----------------+----------------+-----------------+----------------+-----------------|
> |   2 MB (256 KB) | 14468 IOPS     | 14776 IOPS      |  1181 IOPS     |  260 IOPS       |
> |   1 MB (128 KB) | 13752 IOPS     | 14956 IOPS      |  1916 IOPS     |  358 IOPS       |
> | 512 KB  (64 KB) | 12961 IOPS     | 14776 IOPS      |  4038 IOPS     |  684 IOPS       |
> | 256 KB  (32 KB) | 12790 IOPS     | 14534 IOPS      |  6172 IOPS     | 1213 IOPS       |
> | 128 KB  (16 KB) | 12550 IOPS     | 13967 IOPS      |  8700 IOPS     | 1976 IOPS       |
> |  64 KB   (8 KB) | 12491 IOPS     | 13432 IOPS      | 11735 IOPS     | 4267 IOPS       |
> |  32 KB   (4 KB) | 13203 IOPS     | 11752 IOPS      | 12366 IOPS     | 6306 IOPS       |
> |   4 KB (512 B)  |   103 IOPS     |   101 IOPS      |   101 IOPS     |  101 IOPS       |
> |-----------------+----------------+-----------------+----------------+-----------------|
>
> Comments:
>
> - With the old-style allocation method using pwritev() we get similar
>   benefits as we did last time. The comments from the test with a
>   backing image apply to this one as well.
>
> - However the new allocation method is so efficient that having
>   subclusters does not offer any performance benefit. It even slows
>   down things a bit in most cases, so we'd probably need to fine tune
>   the algorithm in order to get similar results.
>
> - In light of this numbers I also think that even when there's a
>   backing image we could preallocate the full cluster but only do COW
>   on the affected subclusters. This would the rest of the cluster
>   preallocated on disk but unallocated on the bitmap. This would
>   probably reduce on-disk fragmentation, which was one of the concerns
>   raised during the original discussion.
>
> I also ran some tests on a rotating HDD drive. Here having subclusters
> doesn't make a big difference regardless of whether there is a backing
> image or not, so we can ignore this scenario.
>
> === Changes to the on-disk format ===
>
> In my original proposal I described 3 different alternatives for
> storing the subcluster bitmaps. I'm naming them here, but refer to
> that message for more details.
>
> (1) Storing the bitmap inside the 64-bit entry
> (2) Making L2 entries 128-bit wide.
> (3) Storing the bitmap somewhere else
>
> I used (1) for this implementation for simplicity, but I think (2) is
> probably the best one.
>
> ===========================
>
> And I think that's all. As you can see I didn't want to go much into
> the open technical questions (I think the on-disk format would be the
> main one), the first goal should be to decide whether this is still an
> interesting feature or not.
>
> So, any questions or comments will be much appreciated.
>
> Berto
I would like to add my $0.02 here from a little bit different
point of view.

Right now QCOW2 is not very efficient with default cluster
size (64k) for fast performance with big disks. Nowadays
ppl uses really BIG images and 1-2-3-8 Tb disks are really
common. Unfortunately ppl want to get random IO fast too.
Thus metadata cache should be in memory as in the any other
case we will get IOPSes halved (1 operation for metadata
cache read and one operation for real read). For 8 Tb image
this results in 1 Gb RAM for that. For 1 Mb cluster we get
64 Mb which is much more reasonable.

Though with 1 Mb cluster the reclaim process becomes
much-much worse. I can not give exact number, unfortunately.
AFAIR the image occupies 30-50% more space. Guys, I would
appreciate if you will correct me here with real numbers.

Thus in respect to this patterns subclusters could give us
benefits of fast random IO and good reclaim rate. I would
consider 64k cluster/8k subcluster as too extreme for me.
In reality we would end up with completely fragmented
image very soon. Sequential reads would become random
VERY soon without preallocation. Though, anyway, this
makes some sense for COW. But, again, in such a case
subclusters should not be holed as required by scenario
I have mentioned first.

Den

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-27 14:19 ` Denis Lunev
@ 2019-06-27 15:38   ` Alberto Garcia
  2019-06-27 15:42     ` Alberto Garcia
  2019-06-27 16:05     ` Denis Lunev
  0 siblings, 2 replies; 29+ messages in thread
From: Alberto Garcia @ 2019-06-27 15:38 UTC (permalink / raw)
  To: Denis Lunev, qemu-devel; +Cc: Kevin Wolf, Anton Nefedov, qemu-block, Max Reitz

On Thu 27 Jun 2019 04:19:25 PM CEST, Denis Lunev wrote:

> Right now QCOW2 is not very efficient with default cluster size (64k)
> for fast performance with big disks. Nowadays ppl uses really BIG
> images and 1-2-3-8 Tb disks are really common. Unfortunately ppl want
> to get random IO fast too.  Thus metadata cache should be in memory as
> in the any other case we will get IOPSes halved (1 operation for
> metadata cache read and one operation for real read). For 8 Tb image
> this results in 1 Gb RAM for that. For 1 Mb cluster we get 64 Mb which
> is much more reasonable.

Correct, the L2 metadata size is a well-known problem that has been
discussed extensively, and that has received plenty of attention.

> Though with 1 Mb cluster the reclaim process becomes much-much
> worse. I can not give exact number, unfortunately.  AFAIR the image
> occupies 30-50% more space. Guys, I would appreciate if you will
> correct me here with real numbers.

Correct, because the cluster size is the smallest unit of allocation, so
a 16KB write on an empty area of the image will always allocate a
complete 1MB cluster.

> Thus in respect to this patterns subclusters could give us benefits of
> fast random IO and good reclaim rate.

Exactly, but that fast random I/O would only happen when allocating new
clusters. Once the clusters are allocated it doesn't provide any
additional performance benefit.

> I would consider 64k cluster/8k subcluster as too extreme for me.  In
> reality we would end up with completely fragmented image very soon.

You mean because of the 64k cluster size, or because of the 8k
subcluster size? If it's the former, yes. If it's the latter, it can be
solved by preallocating the cluster with fallocate(). But then you would
lose the benefit of the good reclaim rate.

Berto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-27 15:38   ` Alberto Garcia
@ 2019-06-27 15:42     ` Alberto Garcia
  2019-06-28  9:20       ` Kevin Wolf
  2019-06-27 16:05     ` Denis Lunev
  1 sibling, 1 reply; 29+ messages in thread
From: Alberto Garcia @ 2019-06-27 15:42 UTC (permalink / raw)
  To: Denis Lunev, qemu-devel; +Cc: Kevin Wolf, Anton Nefedov, qemu-block, Max Reitz

On Thu 27 Jun 2019 05:38:56 PM CEST, Alberto Garcia wrote:
>> I would consider 64k cluster/8k subcluster as too extreme for me.

I forgot to add: this 64k/8k ratio is only with my current prototype.

In practice if we go with the 128-bit L2 entries we would have 64
subclusters per cluster, or 32 if we want to have a separate
QCOW_OFLAG_ZERO for each subcluster (would we need this?).

Berto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-27 15:38   ` Alberto Garcia
  2019-06-27 15:42     ` Alberto Garcia
@ 2019-06-27 16:05     ` Denis Lunev
  2019-06-28 14:43       ` Alberto Garcia
  1 sibling, 1 reply; 29+ messages in thread
From: Denis Lunev @ 2019-06-27 16:05 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, Anton Nefedov, qemu-block, Max Reitz

On 6/27/19 6:38 PM, Alberto Garcia wrote:
> On Thu 27 Jun 2019 04:19:25 PM CEST, Denis Lunev wrote:
>
>> Right now QCOW2 is not very efficient with default cluster size (64k)
>> for fast performance with big disks. Nowadays ppl uses really BIG
>> images and 1-2-3-8 Tb disks are really common. Unfortunately ppl want
>> to get random IO fast too.  Thus metadata cache should be in memory as
>> in the any other case we will get IOPSes halved (1 operation for
>> metadata cache read and one operation for real read). For 8 Tb image
>> this results in 1 Gb RAM for that. For 1 Mb cluster we get 64 Mb which
>> is much more reasonable.
> Correct, the L2 metadata size is a well-known problem that has been
> discussed extensively, and that has received plenty of attention.
>
>> Though with 1 Mb cluster the reclaim process becomes much-much
>> worse. I can not give exact number, unfortunately.  AFAIR the image
>> occupies 30-50% more space. Guys, I would appreciate if you will
>> correct me here with real numbers.
> Correct, because the cluster size is the smallest unit of allocation, so
> a 16KB write on an empty area of the image will always allocate a
> complete 1MB cluster.

>> Thus in respect to this patterns subclusters could give us benefits of
>> fast random IO and good reclaim rate.
> Exactly, but that fast random I/O would only happen when allocating new
> clusters. Once the clusters are allocated it doesn't provide any
> additional performance benefit.

No, I am talking about the situation after the allocation. That is the main
point why I have a feeling that sub-cluster could provide a benefit.

OK. The situation (1) is the following:
- the disk is completely allocated
- QCOW2 image size is 8 Tb
- we have image with 1 Mb cluster/64k sub-cluster (for simplicity)
- L2 metadata cache size is 128 Mb (64 Mb L2 tables, 64 Mb other data)
- holes are made on a sub-cluster bases, i.e. with 64 Kb granularity

In this case random IO test will give near native IOPS result. Metadata
is in memory, no additional reads are required. Wasted host filesystem
space (due to cluster size) is kept at minimum, i.e. on the level of
the "pre-subcluster" QCOW2.

Situation (2):
- 8 Tb QCOW2 image is completely allocated
- 1 Mb cluster size, 128 Mb L2 cache size

Near same performance as (1), but much less disk space savings for
holes.

Situation (3):
- 8 Tb QCOW2 image, completely allocated
- 64 Kb cluster size, 128 MB L2 cache

Random IO performance halved from (1) and (2) due to metadata
re-read for each subsequent operation. Same disk space savings
as in case (1).

Please note, I am not talking now about your case with COW. Here
the allocation is performed on the sub-cluster basis, i.e. the abscence
of the sub-cluster in the image means hole on that offset. This is
important difference.

>> I would consider 64k cluster/8k subcluster as too extreme for me.  In
>> reality we would end up with completely fragmented image very soon.
> You mean because of the 64k cluster size, or because of the 8k
> subcluster size? If it's the former, yes. If it's the latter, it can be
> solved by preallocating the cluster with fallocate(). But then you would
> lose the benefit of the good reclaim rate.

You are optimizing COW speed and your proposal is on that. Thus you
getting minimal allocation unit as a cluster. I am talking about a bit
different pattern of subcluster benefits when the offset allocation unit
is cluster while the space allocation unit is sub-cluster.

This is important difference and that is why I am talking that for my
case 8 Kb space allocation unit is too extreme. These case should
be somehow separated.

Den

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-27 13:59 [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images Alberto Garcia
  2019-06-27 14:19 ` Denis Lunev
@ 2019-06-27 16:54 ` Kevin Wolf
  2019-06-27 17:08   ` Denis Lunev
  2019-06-28 12:57   ` Alberto Garcia
  1 sibling, 2 replies; 29+ messages in thread
From: Kevin Wolf @ 2019-06-27 16:54 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Anton Nefedov, Denis V. Lunev, qemu-block, qemu-devel, Max Reitz

Am 27.06.2019 um 15:59 hat Alberto Garcia geschrieben:
> Hi all,
> 
> a couple of years ago I came to the mailing list with a proposal to
> extend the qcow2 format to add subcluster allocation.
> 
> You can read the original message (and the discussion thread that came
> afterwards) here:
> 
>    https://lists.gnu.org/archive/html/qemu-block/2017-04/msg00178.html
> 
> The description of the problem from the original proposal is still
> valid so I won't repeat it here.
> 
> What I have been doing during the past few weeks was to retake the
> code that I wrote in 2017, make it work with the latest QEMU and fix
> many of its bugs. I have again a working prototype which is by no
> means complete but it allows us to have up-to-date information about
> what we can expect from this feature.
> 
> My goal with this message is to retake the discussion and re-evaluate
> whether this is a feature that we'd like for QEMU in light of the test
> results and all the changes that we have had in the past couple of
> years.
> 
> === Test results ===
> 
> I ran these tests with the same hardware configuration as in 2017: an
> SSD drive and random 4KB write requests to an empty 40GB qcow2 image.
> 
> Here are the results when the qcow2 file is backed by a fully
> populated image. There are 8 subclusters per cluster and the
> subcluster size is in brackets:
> 
> |-----------------+----------------+-----------------|
> |  Cluster size   | subclusters=on | subclusters=off |
> |-----------------+----------------+-----------------|
> |   2 MB (256 KB) |   571 IOPS     |  124 IOPS       |
> |   1 MB (128 KB) |   863 IOPS     |  212 IOPS       |
> | 512 KB  (64 KB) |  1678 IOPS     |  365 IOPS       |
> | 256 KB  (32 KB) |  2618 IOPS     |  568 IOPS       |
> | 128 KB  (16 KB) |  4907 IOPS     |  873 IOPS       |
> |  64 KB   (8 KB) | 10613 IOPS     | 1680 IOPS       |
> |  32 KB   (4 KB) | 13038 IOPS     | 2476 IOPS       |
> |   4 KB (512 B)  |   101 IOPS     |  101 IOPS       |
> |-----------------+----------------+-----------------|

So at the first sight, if you compare the numbers in the same row,
subclusters=on is a clear winner.

But almost more interesting is the observation that at least for large
cluster sizes, subcluster size X performs almost identical to cluster
size X without subclusters:

                as subcluster size  as cluster size, subclusters=off
    256 KB      571 IOPS            568 IOPS
    128 KB      863 IOPS            873 IOPS
    64 KB       1678 IOPS           1680 IOPS
    32 KB       2618 IOPS           2476 IOPS
    ...
    4 KB        13038 IOPS          101 IOPS

Something interesting happens in the part that you didn't benchmark
between 4 KB and 32 KB (actually, maybe it has already started for the
32 KB case): Performance collapses for small cluster sizes, but it
reaches record highs for small subclusters. I suspect that this is
because L2 tables are becoming very small with 4 KB clusters, but they
are still 32 KB if 4 KB is only the subcluster size. (By the way, did
the L2 cache cover the whole disk in your benchmarks?)

I think this gives us two completely different motivations why
subclusters could be useful, depending on the cluster size you're using:

1. If you use small cluster sizes like 32 KB/4 KB, then obviously you
   can get IOPS rates during cluster allocation that you couldn't come
   even close to before. I think this is a quite strong argument in
   favour of the feature.

2. With larger cluster sizes, you don't get a significant difference in
   the performance during cluster allocation compared to just using the
   subcluster size as the cluster size without having subclusters. Here,
   the motivation could be something along the lines of avoiding
   fragmentation. This would probably need more benchmarks to check how
   fragmentation affects the performance after the initial write.

   This one could possibly be a valid justification, too, but I think it
   would need more work on demonstrating that the effects are real and
   justify the implementation and long-term maintenance effort required
   for subclusters.

> Some comments about the results, after comparing them with those from
> 2017:
> 
> - As expected, 32KB clusters / 4 KB subclusters give the best results
>   because that matches the size of the write request and therefore
>   there's no copy-on-write involved.
> 
> - Allocation is generally faster now in all cases (between 20-90%,
>   depending on the case). We have made several optimizations to the
>   code since last time, and I suppose that the COW changes made in
>   commits b3cf1c7cf8 and ee22a9d869 are probably the main factor
>   behind these improvements.
> 
> - Apart from the 64KB/8KB case (which is much faster), the patters are
>   generally the same: subcluster allocation offers similar performance
>   benefits compared to last time, so there are no surprises in this
>   area.
> 
> Then I ran the tests again using the same environment but without a
> backing image. The goal is to measure the impact of subcluster
> allocation on completely empty images.
> 
> Here we have an important change: since commit c8bb23cbdb empty
> clusters are preallocated and filled with zeroes using an efficient
> operation (typically fallocate() with FALLOC_FL_ZERO_RANGE) instead of
> writing the zeroes with the usual pwrite() call.
> 
> The effects of this are dramatic, so I decided to run two sets of
> tests: one with this optimization and one without it.
> 
> Here are the results:
> 
> |-----------------+----------------+-----------------+----------------+-----------------|
> |                 | Initialization with fallocate()  |  Initialization with pwritev()   |
> |-----------------+----------------+-----------------+----------------+-----------------|
> |  Cluster size   | subclusters=on | subclusters=off | subclusters=on | subclusters=off |
> |-----------------+----------------+-----------------+----------------+-----------------|
> |   2 MB (256 KB) | 14468 IOPS     | 14776 IOPS      |  1181 IOPS     |  260 IOPS       |
> |   1 MB (128 KB) | 13752 IOPS     | 14956 IOPS      |  1916 IOPS     |  358 IOPS       |
> | 512 KB  (64 KB) | 12961 IOPS     | 14776 IOPS      |  4038 IOPS     |  684 IOPS       |
> | 256 KB  (32 KB) | 12790 IOPS     | 14534 IOPS      |  6172 IOPS     | 1213 IOPS       |
> | 128 KB  (16 KB) | 12550 IOPS     | 13967 IOPS      |  8700 IOPS     | 1976 IOPS       |
> |  64 KB   (8 KB) | 12491 IOPS     | 13432 IOPS      | 11735 IOPS     | 4267 IOPS       |
> |  32 KB   (4 KB) | 13203 IOPS     | 11752 IOPS      | 12366 IOPS     | 6306 IOPS       |
> |   4 KB (512 B)  |   103 IOPS     |   101 IOPS      |   101 IOPS     |  101 IOPS       |
> |-----------------+----------------+-----------------+----------------+-----------------|
> 
> Comments:
> 
> - With the old-style allocation method using pwritev() we get similar
>   benefits as we did last time. The comments from the test with a
>   backing image apply to this one as well.
> 
> - However the new allocation method is so efficient that having
>   subclusters does not offer any performance benefit. It even slows
>   down things a bit in most cases, so we'd probably need to fine tune
>   the algorithm in order to get similar results.
> 
> - In light of this numbers I also think that even when there's a
>   backing image we could preallocate the full cluster but only do COW
>   on the affected subclusters. This would the rest of the cluster
>   preallocated on disk but unallocated on the bitmap. This would
>   probably reduce on-disk fragmentation, which was one of the concerns
>   raised during the original discussion.

Yes, especially when we have to do some COW anyway, this would come at
nearly zero cost because we call fallocate() anyway.

I'm not sure whether it's worth doing when we don't have to do COW. We
will at least avoid qcow2 fragmentation because of the large cluster
size. And file systems are a lot cleverer than qcow2 to avoid
fragmentation on the file system level. So it might not actually make a
big difference in practice.

This is pure theory, though. We'd have to benchmark things to give a
definite answer.

> I also ran some tests on a rotating HDD drive. Here having subclusters
> doesn't make a big difference regardless of whether there is a backing
> image or not, so we can ignore this scenario.

Interesting, this is kind of unexpected. Why would avoided COW not make
a difference on rotating HDDs? (All of this is cache=none, right?)

> === Changes to the on-disk format ===
> 
> In my original proposal I described 3 different alternatives for
> storing the subcluster bitmaps. I'm naming them here, but refer to
> that message for more details.
> 
> (1) Storing the bitmap inside the 64-bit entry
> (2) Making L2 entries 128-bit wide.
> (3) Storing the bitmap somewhere else
> 
> I used (1) for this implementation for simplicity, but I think (2) is
> probably the best one.

Which would give us 32 bits for the subclusters, so you'd get 128k/4k or
2M/64k. Or would you intend to use some of these 32 bits for something
different?

I think (3) is the worst because it adds another kind of metadata table
that we have to consider for ordering updates. So it might come with
more frequent cache flushes.

> ===========================
> 
> And I think that's all. As you can see I didn't want to go much into
> the open technical questions (I think the on-disk format would be the
> main one), the first goal should be to decide whether this is still an
> interesting feature or not.
> 
> So, any questions or comments will be much appreciated.

It does like very interesting to me at least for small subcluster sizes.

For the larger ones, I suspect that the Virtuozzo guys might be
interested in performing more benchmarks to see whether it improves the
fragmentation problems that they have talked about a lot. It might end
up being interesting for these cases, too.

Kevin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-27 16:54 ` Kevin Wolf
@ 2019-06-27 17:08   ` Denis Lunev
  2019-06-28 16:32     ` Alberto Garcia
  2019-07-11 14:08     ` Alberto Garcia
  2019-06-28 12:57   ` Alberto Garcia
  1 sibling, 2 replies; 29+ messages in thread
From: Denis Lunev @ 2019-06-27 17:08 UTC (permalink / raw)
  To: Kevin Wolf, Alberto Garcia
  Cc: Anton Nefedov, qemu-devel, qemu-block, Max Reitz

[snip]
>> ===========================
>>
>> And I think that's all. As you can see I didn't want to go much into
>> the open technical questions (I think the on-disk format would be the
>> main one), the first goal should be to decide whether this is still an
>> interesting feature or not.
>>
>> So, any questions or comments will be much appreciated.
> It does like very interesting to me at least for small subcluster sizes.
>
> For the larger ones, I suspect that the Virtuozzo guys might be
> interested in performing more benchmarks to see whether it improves the
> fragmentation problems that they have talked about a lot. It might end
> up being interesting for these cases, too.
>
> Kevin
There is no difference in terms of data continuity if the space
under the whole cluster is allocated with fallocate() as noted
by Berto.

For large sizes I have posted different use case but with
slightly different constraints. Subcluster option could be
interesting in terms of random IO on big disks even after
the allocation.

But can we get a link to the repo with actual version of
patches.

Den

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-27 15:42     ` Alberto Garcia
@ 2019-06-28  9:20       ` Kevin Wolf
  2019-06-28  9:53         ` Alberto Garcia
  0 siblings, 1 reply; 29+ messages in thread
From: Kevin Wolf @ 2019-06-28  9:20 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Anton Nefedov, Denis Lunev, qemu-block, qemu-devel, Max Reitz

Am 27.06.2019 um 17:42 hat Alberto Garcia geschrieben:
> On Thu 27 Jun 2019 05:38:56 PM CEST, Alberto Garcia wrote:
> >> I would consider 64k cluster/8k subcluster as too extreme for me.
> 
> I forgot to add: this 64k/8k ratio is only with my current prototype.
> 
> In practice if we go with the 128-bit L2 entries we would have 64
> subclusters per cluster, or 32 if we want to have a separate
> QCOW_OFLAG_ZERO for each subcluster (would we need this?).

Yes, I think we'd want to have a separate zero flag for each subcluster.
Otherwise, when writing to a zero cluster, you'd have to COW the whole
supercluster again. And you'd have to fall back to explicit writes of a
zeroed buffer rather than using efficient operations more often.

Kevin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28  9:20       ` Kevin Wolf
@ 2019-06-28  9:53         ` Alberto Garcia
  2019-06-28 10:04           ` Kevin Wolf
  0 siblings, 1 reply; 29+ messages in thread
From: Alberto Garcia @ 2019-06-28  9:53 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anton Nefedov, Denis Lunev, qemu-block, qemu-devel, Max Reitz

On Fri 28 Jun 2019 11:20:57 AM CEST, Kevin Wolf wrote:
>> >> I would consider 64k cluster/8k subcluster as too extreme for me.
>> 
>> I forgot to add: this 64k/8k ratio is only with my current prototype.
>> 
>> In practice if we go with the 128-bit L2 entries we would have 64
>> subclusters per cluster, or 32 if we want to have a separate
>> QCOW_OFLAG_ZERO for each subcluster (would we need this?).
>
> Yes, I think we'd want to have a separate zero flag for each
> subcluster.  Otherwise, when writing to a zero cluster, you'd have to
> COW the whole supercluster again.

Yes if the original cluster had the QCOW_OFLAG_ZERO bit, not if it was
simply unallocated.

Berto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28  9:53         ` Alberto Garcia
@ 2019-06-28 10:04           ` Kevin Wolf
  2019-06-28 13:19             ` Alberto Garcia
  0 siblings, 1 reply; 29+ messages in thread
From: Kevin Wolf @ 2019-06-28 10:04 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Anton Nefedov, Denis Lunev, qemu-block, qemu-devel, Max Reitz

Am 28.06.2019 um 11:53 hat Alberto Garcia geschrieben:
> On Fri 28 Jun 2019 11:20:57 AM CEST, Kevin Wolf wrote:
> >> >> I would consider 64k cluster/8k subcluster as too extreme for me.
> >> 
> >> I forgot to add: this 64k/8k ratio is only with my current prototype.
> >> 
> >> In practice if we go with the 128-bit L2 entries we would have 64
> >> subclusters per cluster, or 32 if we want to have a separate
> >> QCOW_OFLAG_ZERO for each subcluster (would we need this?).
> >
> > Yes, I think we'd want to have a separate zero flag for each
> > subcluster.  Otherwise, when writing to a zero cluster, you'd have to
> > COW the whole supercluster again.
> 
> Yes if the original cluster had the QCOW_OFLAG_ZERO bit, not if it was
> simply unallocated.

Right, but that leaving clusters simply unallocated doesn't quite cut it
is something we already noticed before writing the spec for v3. Only
really necessary when you have a backing file, of course, but that's one
of the more interesting cases for subclusters anyway.

Kevin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-27 16:54 ` Kevin Wolf
  2019-06-27 17:08   ` Denis Lunev
@ 2019-06-28 12:57   ` Alberto Garcia
  2019-06-28 13:03     ` Alberto Garcia
  1 sibling, 1 reply; 29+ messages in thread
From: Alberto Garcia @ 2019-06-28 12:57 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Anton Nefedov, Denis V. Lunev, qemu-block, qemu-devel, Max Reitz

On Thu 27 Jun 2019 06:54:34 PM CEST, Kevin Wolf wrote:
>> |-----------------+----------------+-----------------|
>> |  Cluster size   | subclusters=on | subclusters=off |
>> |-----------------+----------------+-----------------|
>> |   2 MB (256 KB) |   571 IOPS     |  124 IOPS       |
>> |   1 MB (128 KB) |   863 IOPS     |  212 IOPS       |
>> | 512 KB  (64 KB) |  1678 IOPS     |  365 IOPS       |
>> | 256 KB  (32 KB) |  2618 IOPS     |  568 IOPS       |
>> | 128 KB  (16 KB) |  4907 IOPS     |  873 IOPS       |
>> |  64 KB   (8 KB) | 10613 IOPS     | 1680 IOPS       |
>> |  32 KB   (4 KB) | 13038 IOPS     | 2476 IOPS       |
>> |   4 KB (512 B)  |   101 IOPS     |  101 IOPS       |
>> |-----------------+----------------+-----------------|
>
> So at the first sight, if you compare the numbers in the same row,
> subclusters=on is a clear winner.

Yes, as expected.

> But almost more interesting is the observation that at least for large
> cluster sizes, subcluster size X performs almost identical to cluster
> size X without subclusters:

But that's also to be expected, isn't it? The only difference (in terms
of I/O) between allocating a 64KB cluster and a 64KB subcluster is how
the L2 entry is updated. The amount of data that is read and written is
the same.

> Something interesting happens in the part that you didn't benchmark
> between 4 KB and 32 KB (actually, maybe it has already started for the
> 32 KB case): Performance collapses for small cluster sizes, but it
> reaches record highs for small subclusters.

I dind't measure that initially because I thought that having
subclusters < 4KB was not very useful. The 512b case was just to see how
it would perform on the extreme case. I anyway decided to get the rest
of the numbers too, so here's the complete table with the missing rows:

|---------+------------+----------------+-----------------|
| Cluster | Subcluster | subclusters=on | subclusters=off |
|---------+------------+----------------+-----------------|
|    2048 |        256 |            571 |             124 |
|    1024 |        128 |            863 |             212 |
|     512 |         64 |           1678 |             365 |
|     256 |         32 |           2618 |             568 |
|     128 |         16 |           4907 |             873 |
|      64 |          8 |          10613 |            1680 |
|      32 |          4 |          13038 |            2476 |
|      16 |          2 |           7555 |            3389 |
|       8 |          1 |            299 |             420 |
|       4 |       512b |            101 |             101 |
|---------+------------+----------------+-----------------|

> I suspect that this is because L2 tables are becoming very small with
> 4 KB clusters, but they are still 32 KB if 4 KB is only the subcluster
> size.

Yes, I explained that in my original proposal from 2017. I didn't
actually investigate further, but my take is that 4KB clusters require
constant allocations and refcount updates, plus L2 tables fill up very
quickly.

> (By the way, did the L2 cache cover the whole disk in your
> benchmarks?)

Yes, in all cases (I forgot to mention that, sorry).

> I think this gives us two completely different motivations why
> subclusters could be useful, depending on the cluster size you're
> using:
>
> 1. If you use small cluster sizes like 32 KB/4 KB, then obviously you
>    can get IOPS rates during cluster allocation that you couldn't come
>    even close to before. I think this is a quite strong argument in
>    favour of the feature.

Yes, indeed. You would need to select the subcluster size so it matches
the size of guest I/O requests (the size of the filesystem block is
probably the best choice).

> 2. With larger cluster sizes, you don't get a significant difference
>    in the performance during cluster allocation compared to just using
>    the subcluster size as the cluster size without having
>    subclusters. Here, the motivation could be something along the
>    lines of avoiding fragmentation. This would probably need more
>    benchmarks to check how fragmentation affects the performance after
>    the initial write.
>
>    This one could possibly be a valid justification, too, but I think it
>    would need more work on demonstrating that the effects are real and
>    justify the implementation and long-term maintenance effort required
>    for subclusters.

I agree. However another benefit of large cluster sizes is that you
reduce the amount of metadata, so you get the same performance with a
smaller L2 cache.

>> I also ran some tests on a rotating HDD drive. Here having
>> subclusters doesn't make a big difference regardless of whether there
>> is a backing image or not, so we can ignore this scenario.
>
> Interesting, this is kind of unexpected. Why would avoided COW not
> make a difference on rotating HDDs? (All of this is cache=none,
> right?)

, the 32K/4K with no COW is obviously much faster 

>
>> === Changes to the on-disk format ===
>> 
>> In my original proposal I described 3 different alternatives for
>> storing the subcluster bitmaps. I'm naming them here, but refer to
>> that message for more details.
>> 
>> (1) Storing the bitmap inside the 64-bit entry
>> (2) Making L2 entries 128-bit wide.
>> (3) Storing the bitmap somewhere else
>> 
>> I used (1) for this implementation for simplicity, but I think (2) is
>> probably the best one.
>
> Which would give us 32 bits for the subclusters, so you'd get 128k/4k or
> 2M/64k. Or would you intend to use some of these 32 bits for something
> different?
>
> I think (3) is the worst because it adds another kind of metadata table
> that we have to consider for ordering updates. So it might come with
> more frequent cache flushes.
>
>> ===========================
>> 
>> And I think that's all. As you can see I didn't want to go much into
>> the open technical questions (I think the on-disk format would be the
>> main one), the first goal should be to decide whether this is still an
>> interesting feature or not.
>> 
>> So, any questions or comments will be much appreciated.
>
> It does like very interesting to me at least for small subcluster sizes.
>
> For the larger ones, I suspect that the Virtuozzo guys might be
> interested in performing more benchmarks to see whether it improves the
> fragmentation problems that they have talked about a lot. It might end
> up being interesting for these cases, too.
>
> Kevin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28 12:57   ` Alberto Garcia
@ 2019-06-28 13:03     ` Alberto Garcia
  0 siblings, 0 replies; 29+ messages in thread
From: Alberto Garcia @ 2019-06-28 13:03 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Anton Nefedov, Denis V. Lunev, qemu-block, qemu-devel, Max Reitz

I pressed "send" too early, here's the last part of my reply:

On Fri 28 Jun 2019 02:57:56 PM CEST, Alberto Garcia wrote:
>> I also ran some tests on a rotating HDD drive. Here having
>> subclusters doesn't make a big difference regardless of whether there
>> is a backing image or not, so we can ignore this scenario.

> Interesting, this is kind of unexpected. Why would avoided COW not
> make a difference on rotating HDDs? (All of this is cache=none,
> right?)

The 32K/4K with no COW is obviously much faster (it's also faster with
1MB and 2MB clusters), it's the rest of the scenarios that show no
improvement.

>> === Changes to the on-disk format ===
>> 
>> In my original proposal I described 3 different alternatives for
>> storing the subcluster bitmaps. I'm naming them here, but refer to
>> that message for more details.
>> 
>> (1) Storing the bitmap inside the 64-bit entry
>> (2) Making L2 entries 128-bit wide.
>> (3) Storing the bitmap somewhere else
>> 
>> I used (1) for this implementation for simplicity, but I think (2) is
>> probably the best one.
>
> Which would give us 32 bits for the subclusters, so you'd get 128k/4k
> or 2M/64k. Or would you intend to use some of these 32 bits for
> something different?

No, 32 bits for subclusters, or 64 if we don't have the 'all zeroes' bit
(we discussed this in a separate message).

> I think (3) is the worst because it adds another kind of metadata
> table that we have to consider for ordering updates. So it might come
> with more frequent cache flushes.

Yes I agree.

Berto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28 10:04           ` Kevin Wolf
@ 2019-06-28 13:19             ` Alberto Garcia
  2019-06-28 14:16               ` Kevin Wolf
  0 siblings, 1 reply; 29+ messages in thread
From: Alberto Garcia @ 2019-06-28 13:19 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anton Nefedov, Denis Lunev, qemu-block, qemu-devel, Max Reitz

On Fri 28 Jun 2019 12:04:22 PM CEST, Kevin Wolf wrote:
> Am 28.06.2019 um 11:53 hat Alberto Garcia geschrieben:
>> On Fri 28 Jun 2019 11:20:57 AM CEST, Kevin Wolf wrote:
>> >> >> I would consider 64k cluster/8k subcluster as too extreme for me.
>> >> 
>> >> I forgot to add: this 64k/8k ratio is only with my current prototype.
>> >> 
>> >> In practice if we go with the 128-bit L2 entries we would have 64
>> >> subclusters per cluster, or 32 if we want to have a separate
>> >> QCOW_OFLAG_ZERO for each subcluster (would we need this?).
>> >
>> > Yes, I think we'd want to have a separate zero flag for each
>> > subcluster.  Otherwise, when writing to a zero cluster, you'd have to
>> > COW the whole supercluster again.
>> 
>> Yes if the original cluster had the QCOW_OFLAG_ZERO bit, not if it
>> was simply unallocated.
>
> Right, but that leaving clusters simply unallocated doesn't quite cut
> it is something we already noticed before writing the spec for
> v3. Only really necessary when you have a backing file, of course, but
> that's one of the more interesting cases for subclusters anyway.

I wonder if it would be possible to have a hybrid solution:

With 64 bits to indicate the allocation status of each subcluster we
still have the original L2 entry with its QCOW_OFLAG_ZERO bit, so we
need to specify what happens if that bit is set and at the same time
some subclusters are marked as allocated.

One possibility is that allocated subclusters have actual data and the
rest are zero subclusters.

Berto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28 13:19             ` Alberto Garcia
@ 2019-06-28 14:16               ` Kevin Wolf
  2019-06-28 16:31                 ` Alberto Garcia
  0 siblings, 1 reply; 29+ messages in thread
From: Kevin Wolf @ 2019-06-28 14:16 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Anton Nefedov, Denis Lunev, qemu-block, qemu-devel, Max Reitz

Am 28.06.2019 um 15:19 hat Alberto Garcia geschrieben:
> On Fri 28 Jun 2019 12:04:22 PM CEST, Kevin Wolf wrote:
> > Am 28.06.2019 um 11:53 hat Alberto Garcia geschrieben:
> >> On Fri 28 Jun 2019 11:20:57 AM CEST, Kevin Wolf wrote:
> >> >> >> I would consider 64k cluster/8k subcluster as too extreme for me.
> >> >> 
> >> >> I forgot to add: this 64k/8k ratio is only with my current prototype.
> >> >> 
> >> >> In practice if we go with the 128-bit L2 entries we would have 64
> >> >> subclusters per cluster, or 32 if we want to have a separate
> >> >> QCOW_OFLAG_ZERO for each subcluster (would we need this?).
> >> >
> >> > Yes, I think we'd want to have a separate zero flag for each
> >> > subcluster.  Otherwise, when writing to a zero cluster, you'd have to
> >> > COW the whole supercluster again.
> >> 
> >> Yes if the original cluster had the QCOW_OFLAG_ZERO bit, not if it
> >> was simply unallocated.
> >
> > Right, but that leaving clusters simply unallocated doesn't quite cut
> > it is something we already noticed before writing the spec for
> > v3. Only really necessary when you have a backing file, of course, but
> > that's one of the more interesting cases for subclusters anyway.
> 
> I wonder if it would be possible to have a hybrid solution:
> 
> With 64 bits to indicate the allocation status of each subcluster we
> still have the original L2 entry with its QCOW_OFLAG_ZERO bit, so we
> need to specify what happens if that bit is set and at the same time
> some subclusters are marked as allocated.
> 
> One possibility is that allocated subclusters have actual data and the
> rest are zero subclusters.

Hm, yes, that would be possible.

However, that would require some addtional complexity in write_zeroes
then: If the zero flag isn't set yet, then we need to check that no
other subcluster is unallocated before we can turn the zero flag on for
the whole cluster. You couldn't have subclusters referring to the
backing file and zeroed subclusters at the same time.

Maybe doubling the number of bits would actually be worth the additional
complexity in write_zeroes.

Another thing to consider is that with two bits per subcluster, we'd
still have one combination left for other purposes (we only have
unallocated, allocated, zero). In previous discussions, we talked about
the possibility of using that for a new "write-through to backing file"
cluster type. It's not clear if this is a good idea, but it came up
multiple times in the past.

Kevin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-27 16:05     ` Denis Lunev
@ 2019-06-28 14:43       ` Alberto Garcia
  2019-06-28 14:47         ` Denis Lunev
  2019-06-28 14:57         ` Kevin Wolf
  0 siblings, 2 replies; 29+ messages in thread
From: Alberto Garcia @ 2019-06-28 14:43 UTC (permalink / raw)
  To: Denis Lunev, qemu-devel; +Cc: Kevin Wolf, Anton Nefedov, qemu-block, Max Reitz

On Thu 27 Jun 2019 06:05:55 PM CEST, Denis Lunev wrote:
>>> Thus in respect to this patterns subclusters could give us benefits
>>> of fast random IO and good reclaim rate.
>> Exactly, but that fast random I/O would only happen when allocating
>> new clusters. Once the clusters are allocated it doesn't provide any
>> additional performance benefit.
>
> No, I am talking about the situation after the allocation. That is the
> main point why I have a feeling that sub-cluster could provide a
> benefit.
>
> OK. The situation (1) is the following:
> - the disk is completely allocated
> - QCOW2 image size is 8 Tb
> - we have image with 1 Mb cluster/64k sub-cluster (for simplicity)
> - L2 metadata cache size is 128 Mb (64 Mb L2 tables, 64 Mb other data)
> - holes are made on a sub-cluster bases, i.e. with 64 Kb granularity
>
> In this case random IO test will give near native IOPS
> result. Metadata is in memory, no additional reads are
> required. Wasted host filesystem space (due to cluster size) is kept
> at minimum, i.e. on the level of the "pre-subcluster" QCOW2.
>
> Situation (2):
> - 8 Tb QCOW2 image is completely allocated
> - 1 Mb cluster size, 128 Mb L2 cache size
>
> Near same performance as (1), but much less disk space savings for
> holes.
>
> Situation (3):
> - 8 Tb QCOW2 image, completely allocated
> - 64 Kb cluster size, 128 MB L2 cache
>
> Random IO performance halved from (1) and (2) due to metadata re-read
> for each subsequent operation. Same disk space savings as in case (1).

If I understood correctly what you are trying to say, subclusters allow
us to use larger cluster sizes in order to reduce the amount of L2
metadata (and therefore the cache size) while keeping the same space
benefits as smaller clusters.

> Please note, I am not talking now about your case with COW. Here the
> allocation is performed on the sub-cluster basis, i.e. the abscence of
> the sub-cluster in the image means hole on that offset. This is
> important difference.

I mentioned the possibility that if you have a case like 2MB / 64KB and
you write to an empty cluster then you could allocate the necessary
subclusters, and additionally fallocate() the space of the whole cluster
(2MB) in order to try to keep it contiguous.

With this we would lose the space saving advantage of having
subclusters. But perhaps that would work for smaller cluster sizes (it
would mitigate the fragmentation problem).

Berto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28 14:43       ` Alberto Garcia
@ 2019-06-28 14:47         ` Denis Lunev
  2019-06-28 14:57         ` Kevin Wolf
  1 sibling, 0 replies; 29+ messages in thread
From: Denis Lunev @ 2019-06-28 14:47 UTC (permalink / raw)
  To: Alberto Garcia, qemu-devel
  Cc: Kevin Wolf, Anton Nefedov, qemu-block, Max Reitz

On 6/28/19 5:43 PM, Alberto Garcia wrote:
> On Thu 27 Jun 2019 06:05:55 PM CEST, Denis Lunev wrote:
>>>> Thus in respect to this patterns subclusters could give us benefits
>>>> of fast random IO and good reclaim rate.
>>> Exactly, but that fast random I/O would only happen when allocating
>>> new clusters. Once the clusters are allocated it doesn't provide any
>>> additional performance benefit.
>> No, I am talking about the situation after the allocation. That is the
>> main point why I have a feeling that sub-cluster could provide a
>> benefit.
>>
>> OK. The situation (1) is the following:
>> - the disk is completely allocated
>> - QCOW2 image size is 8 Tb
>> - we have image with 1 Mb cluster/64k sub-cluster (for simplicity)
>> - L2 metadata cache size is 128 Mb (64 Mb L2 tables, 64 Mb other data)
>> - holes are made on a sub-cluster bases, i.e. with 64 Kb granularity
>>
>> In this case random IO test will give near native IOPS
>> result. Metadata is in memory, no additional reads are
>> required. Wasted host filesystem space (due to cluster size) is kept
>> at minimum, i.e. on the level of the "pre-subcluster" QCOW2.
>>
>> Situation (2):
>> - 8 Tb QCOW2 image is completely allocated
>> - 1 Mb cluster size, 128 Mb L2 cache size
>>
>> Near same performance as (1), but much less disk space savings for
>> holes.
>>
>> Situation (3):
>> - 8 Tb QCOW2 image, completely allocated
>> - 64 Kb cluster size, 128 MB L2 cache
>>
>> Random IO performance halved from (1) and (2) due to metadata re-read
>> for each subsequent operation. Same disk space savings as in case (1).
> If I understood correctly what you are trying to say, subclusters allow
> us to use larger cluster sizes in order to reduce the amount of L2
> metadata (and therefore the cache size) while keeping the same space
> benefits as smaller clusters.
yes

>> Please note, I am not talking now about your case with COW. Here the
>> allocation is performed on the sub-cluster basis, i.e. the abscence of
>> the sub-cluster in the image means hole on that offset. This is
>> important difference.
> I mentioned the possibility that if you have a case like 2MB / 64KB and
> you write to an empty cluster then you could allocate the necessary
> subclusters, and additionally fallocate() the space of the whole cluster
> (2MB) in order to try to keep it contiguous.
>
> With this we would lose the space saving advantage of having
> subclusters. But perhaps that would work for smaller cluster sizes (it
> would mitigate the fragmentation problem).
yes, this is distinction and completely different usecase.
We have obtained it over time from our customers,
who wants very fast performance AND space conservation
at once. This is still the case for SSD users, which are
fast but small.

Den

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28 14:43       ` Alberto Garcia
  2019-06-28 14:47         ` Denis Lunev
@ 2019-06-28 14:57         ` Kevin Wolf
  2019-06-28 15:02           ` Alberto Garcia
  1 sibling, 1 reply; 29+ messages in thread
From: Kevin Wolf @ 2019-06-28 14:57 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Anton Nefedov, Denis Lunev, qemu-block, qemu-devel, Max Reitz

Am 28.06.2019 um 16:43 hat Alberto Garcia geschrieben:
> On Thu 27 Jun 2019 06:05:55 PM CEST, Denis Lunev wrote:
> > Please note, I am not talking now about your case with COW. Here the
> > allocation is performed on the sub-cluster basis, i.e. the abscence of
> > the sub-cluster in the image means hole on that offset. This is
> > important difference.
> 
> I mentioned the possibility that if you have a case like 2MB / 64KB and
> you write to an empty cluster then you could allocate the necessary
> subclusters, and additionally fallocate() the space of the whole cluster
> (2MB) in order to try to keep it contiguous.
> 
> With this we would lose the space saving advantage of having
> subclusters. But perhaps that would work for smaller cluster sizes (it
> would mitigate the fragmentation problem).

There seem to be use cases for both ways. So does this need to be an
option?

Kevin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28 14:57         ` Kevin Wolf
@ 2019-06-28 15:02           ` Alberto Garcia
  2019-06-28 15:03             ` Denis Lunev
  2019-06-28 15:09             ` Kevin Wolf
  0 siblings, 2 replies; 29+ messages in thread
From: Alberto Garcia @ 2019-06-28 15:02 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anton Nefedov, Denis Lunev, qemu-block, qemu-devel, Max Reitz

On Fri 28 Jun 2019 04:57:08 PM CEST, Kevin Wolf wrote:
> Am 28.06.2019 um 16:43 hat Alberto Garcia geschrieben:
>> On Thu 27 Jun 2019 06:05:55 PM CEST, Denis Lunev wrote:
>> > Please note, I am not talking now about your case with COW. Here the
>> > allocation is performed on the sub-cluster basis, i.e. the abscence of
>> > the sub-cluster in the image means hole on that offset. This is
>> > important difference.
>> 
>> I mentioned the possibility that if you have a case like 2MB / 64KB
>> and you write to an empty cluster then you could allocate the
>> necessary subclusters, and additionally fallocate() the space of the
>> whole cluster (2MB) in order to try to keep it contiguous.
>> 
>> With this we would lose the space saving advantage of having
>> subclusters. But perhaps that would work for smaller cluster sizes
>> (it would mitigate the fragmentation problem).
>
> There seem to be use cases for both ways. So does this need to be an
> option?

Probably a runtime option, or a heuristic that decides what to do
depending on the cluster size.

Berto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28 15:02           ` Alberto Garcia
@ 2019-06-28 15:03             ` Denis Lunev
  2019-06-28 15:10               ` Alberto Garcia
  2019-06-28 15:09             ` Kevin Wolf
  1 sibling, 1 reply; 29+ messages in thread
From: Denis Lunev @ 2019-06-28 15:03 UTC (permalink / raw)
  To: Alberto Garcia, Kevin Wolf
  Cc: Anton Nefedov, qemu-devel, qemu-block, Max Reitz

On 6/28/19 6:02 PM, Alberto Garcia wrote:
> On Fri 28 Jun 2019 04:57:08 PM CEST, Kevin Wolf wrote:
>> Am 28.06.2019 um 16:43 hat Alberto Garcia geschrieben:
>>> On Thu 27 Jun 2019 06:05:55 PM CEST, Denis Lunev wrote:
>>>> Please note, I am not talking now about your case with COW. Here the
>>>> allocation is performed on the sub-cluster basis, i.e. the abscence of
>>>> the sub-cluster in the image means hole on that offset. This is
>>>> important difference.
>>> I mentioned the possibility that if you have a case like 2MB / 64KB
>>> and you write to an empty cluster then you could allocate the
>>> necessary subclusters, and additionally fallocate() the space of the
>>> whole cluster (2MB) in order to try to keep it contiguous.
>>>
>>> With this we would lose the space saving advantage of having
>>> subclusters. But perhaps that would work for smaller cluster sizes
>>> (it would mitigate the fragmentation problem).
>> There seem to be use cases for both ways. So does this need to be an
>> option?
> Probably a runtime option, or a heuristic that decides what to do
> depending on the cluster size.
no, I think that this should be on-disk option as this affects
allocation strategy.

Den

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28 15:02           ` Alberto Garcia
  2019-06-28 15:03             ` Denis Lunev
@ 2019-06-28 15:09             ` Kevin Wolf
  2019-06-28 15:12               ` Alberto Garcia
  1 sibling, 1 reply; 29+ messages in thread
From: Kevin Wolf @ 2019-06-28 15:09 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Anton Nefedov, Denis Lunev, qemu-block, qemu-devel, Max Reitz

Am 28.06.2019 um 17:02 hat Alberto Garcia geschrieben:
> On Fri 28 Jun 2019 04:57:08 PM CEST, Kevin Wolf wrote:
> > Am 28.06.2019 um 16:43 hat Alberto Garcia geschrieben:
> >> On Thu 27 Jun 2019 06:05:55 PM CEST, Denis Lunev wrote:
> >> > Please note, I am not talking now about your case with COW. Here the
> >> > allocation is performed on the sub-cluster basis, i.e. the abscence of
> >> > the sub-cluster in the image means hole on that offset. This is
> >> > important difference.
> >> 
> >> I mentioned the possibility that if you have a case like 2MB / 64KB
> >> and you write to an empty cluster then you could allocate the
> >> necessary subclusters, and additionally fallocate() the space of the
> >> whole cluster (2MB) in order to try to keep it contiguous.
> >> 
> >> With this we would lose the space saving advantage of having
> >> subclusters. But perhaps that would work for smaller cluster sizes
> >> (it would mitigate the fragmentation problem).
> >
> > There seem to be use cases for both ways. So does this need to be an
> > option?
> 
> Probably a runtime option, or a heuristic that decides what to do
> depending on the cluster size.

How would the heuristic decide whether the user wants to save disk space
or whether they consider avoiding fragmentation (i.e. performance) more
important?

Kevin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28 15:03             ` Denis Lunev
@ 2019-06-28 15:10               ` Alberto Garcia
  2019-06-28 15:15                 ` Kevin Wolf
  0 siblings, 1 reply; 29+ messages in thread
From: Alberto Garcia @ 2019-06-28 15:10 UTC (permalink / raw)
  To: Denis Lunev, Kevin Wolf; +Cc: Anton Nefedov, qemu-devel, qemu-block, Max Reitz

On Fri 28 Jun 2019 05:03:13 PM CEST, Denis Lunev wrote:
> On 6/28/19 6:02 PM, Alberto Garcia wrote:
>>>>> Please note, I am not talking now about your case with COW. Here the
>>>>> allocation is performed on the sub-cluster basis, i.e. the abscence of
>>>>> the sub-cluster in the image means hole on that offset. This is
>>>>> important difference.
>>>> I mentioned the possibility that if you have a case like 2MB / 64KB
>>>> and you write to an empty cluster then you could allocate the
>>>> necessary subclusters, and additionally fallocate() the space of the
>>>> whole cluster (2MB) in order to try to keep it contiguous.
>>>>
>>>> With this we would lose the space saving advantage of having
>>>> subclusters. But perhaps that would work for smaller cluster sizes
>>>> (it would mitigate the fragmentation problem).
>>> There seem to be use cases for both ways. So does this need to be an
>>> option?
>> Probably a runtime option, or a heuristic that decides what to do
>> depending on the cluster size.
> no, I think that this should be on-disk option as this affects
> allocation strategy.

But why does it need to be stored on-disk? It should be theoretically
possible to switch between on strategy and the other at runtime (not
that it would make sense though).

Berto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28 15:09             ` Kevin Wolf
@ 2019-06-28 15:12               ` Alberto Garcia
  2019-07-01  6:22                 ` Kevin Wolf
  0 siblings, 1 reply; 29+ messages in thread
From: Alberto Garcia @ 2019-06-28 15:12 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anton Nefedov, Denis Lunev, qemu-block, qemu-devel, Max Reitz

On Fri 28 Jun 2019 05:09:11 PM CEST, Kevin Wolf wrote:
> Am 28.06.2019 um 17:02 hat Alberto Garcia geschrieben:
>> On Fri 28 Jun 2019 04:57:08 PM CEST, Kevin Wolf wrote:
>> > Am 28.06.2019 um 16:43 hat Alberto Garcia geschrieben:
>> >> On Thu 27 Jun 2019 06:05:55 PM CEST, Denis Lunev wrote:
>> >> > Please note, I am not talking now about your case with COW. Here the
>> >> > allocation is performed on the sub-cluster basis, i.e. the abscence of
>> >> > the sub-cluster in the image means hole on that offset. This is
>> >> > important difference.
>> >> 
>> >> I mentioned the possibility that if you have a case like 2MB / 64KB
>> >> and you write to an empty cluster then you could allocate the
>> >> necessary subclusters, and additionally fallocate() the space of the
>> >> whole cluster (2MB) in order to try to keep it contiguous.
>> >> 
>> >> With this we would lose the space saving advantage of having
>> >> subclusters. But perhaps that would work for smaller cluster sizes
>> >> (it would mitigate the fragmentation problem).
>> >
>> > There seem to be use cases for both ways. So does this need to be an
>> > option?
>> 
>> Probably a runtime option, or a heuristic that decides what to do
>> depending on the cluster size.
>
> How would the heuristic decide whether the user wants to save disk space
> or whether they consider avoiding fragmentation (i.e. performance) more
> important?

Well I suppose the fragmentation problem is more important when you have
small clusters and less so when you have large clusters, so that would
be a way to do it.

Of course with an option the user would have the final choice.

Berto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28 15:10               ` Alberto Garcia
@ 2019-06-28 15:15                 ` Kevin Wolf
  0 siblings, 0 replies; 29+ messages in thread
From: Kevin Wolf @ 2019-06-28 15:15 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Anton Nefedov, Denis Lunev, qemu-block, qemu-devel, Max Reitz

Am 28.06.2019 um 17:10 hat Alberto Garcia geschrieben:
> On Fri 28 Jun 2019 05:03:13 PM CEST, Denis Lunev wrote:
> > On 6/28/19 6:02 PM, Alberto Garcia wrote:
> >>>>> Please note, I am not talking now about your case with COW. Here the
> >>>>> allocation is performed on the sub-cluster basis, i.e. the abscence of
> >>>>> the sub-cluster in the image means hole on that offset. This is
> >>>>> important difference.
> >>>> I mentioned the possibility that if you have a case like 2MB / 64KB
> >>>> and you write to an empty cluster then you could allocate the
> >>>> necessary subclusters, and additionally fallocate() the space of the
> >>>> whole cluster (2MB) in order to try to keep it contiguous.
> >>>>
> >>>> With this we would lose the space saving advantage of having
> >>>> subclusters. But perhaps that would work for smaller cluster sizes
> >>>> (it would mitigate the fragmentation problem).
> >>> There seem to be use cases for both ways. So does this need to be an
> >>> option?
> >> Probably a runtime option, or a heuristic that decides what to do
> >> depending on the cluster size.
> > no, I think that this should be on-disk option as this affects
> > allocation strategy.
> 
> But why does it need to be stored on-disk? It should be theoretically
> possible to switch between on strategy and the other at runtime (not
> that it would make sense though).

I think it makes sense to store the default in the image and allow it to
be overridden at runtime, similar to lazy_refcounts.

Kevin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28 14:16               ` Kevin Wolf
@ 2019-06-28 16:31                 ` Alberto Garcia
  0 siblings, 0 replies; 29+ messages in thread
From: Alberto Garcia @ 2019-06-28 16:31 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anton Nefedov, Denis Lunev, qemu-block, qemu-devel, Max Reitz

On Fri 28 Jun 2019 04:16:28 PM CEST, Kevin Wolf wrote:
>> >> >> >> I would consider 64k cluster/8k subcluster as too extreme
>> >> >> >> for me.
>> >> >> 
>> >> >> I forgot to add: this 64k/8k ratio is only with my current
>> >> >> prototype.
>> >> >> 
>> >> >> In practice if we go with the 128-bit L2 entries we would have
>> >> >> 64 subclusters per cluster, or 32 if we want to have a separate
>> >> >> QCOW_OFLAG_ZERO for each subcluster (would we need this?).
>> >> >
>> >> > Yes, I think we'd want to have a separate zero flag for each
>> >> > subcluster.  Otherwise, when writing to a zero cluster, you'd
>> >> > have to COW the whole supercluster again.
>> >> 
>> >> Yes if the original cluster had the QCOW_OFLAG_ZERO bit, not if it
>> >> was simply unallocated.
>> >
>> > Right, but that leaving clusters simply unallocated doesn't quite
>> > cut it is something we already noticed before writing the spec for
>> > v3. Only really necessary when you have a backing file, of course,
>> > but that's one of the more interesting cases for subclusters
>> > anyway.
>> 
>> I wonder if it would be possible to have a hybrid solution:
>> 
>> With 64 bits to indicate the allocation status of each subcluster we
>> still have the original L2 entry with its QCOW_OFLAG_ZERO bit, so we
>> need to specify what happens if that bit is set and at the same time
>> some subclusters are marked as allocated.
>> 
>> One possibility is that allocated subclusters have actual data and
>> the rest are zero subclusters.
>
> Hm, yes, that would be possible.
>
> However, that would require some addtional complexity in write_zeroes
> then: If the zero flag isn't set yet, then we need to check that no
> other subcluster is unallocated before we can turn the zero flag on
> for the whole cluster. You couldn't have subclusters referring to the
> backing file and zeroed subclusters at the same time.
>
> Maybe doubling the number of bits would actually be worth the
> additional complexity in write_zeroes.

Yes, that's a bit my doubt. How important it is to have those features
that you mention vs being able to halve the minimum unit of allocation.

Berto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-27 17:08   ` Denis Lunev
@ 2019-06-28 16:32     ` Alberto Garcia
  2019-07-11 14:08     ` Alberto Garcia
  1 sibling, 0 replies; 29+ messages in thread
From: Alberto Garcia @ 2019-06-28 16:32 UTC (permalink / raw)
  To: Denis Lunev, Kevin Wolf; +Cc: Anton Nefedov, qemu-devel, qemu-block, Max Reitz

On Thu 27 Jun 2019 07:08:29 PM CEST, Denis Lunev wrote:
> But can we get a link to the repo with actual version of patches.

It's not in a state that can be published at the moment, but I'll try to
have something available soon.

Berto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-28 15:12               ` Alberto Garcia
@ 2019-07-01  6:22                 ` Kevin Wolf
  0 siblings, 0 replies; 29+ messages in thread
From: Kevin Wolf @ 2019-07-01  6:22 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Anton Nefedov, Denis Lunev, qemu-block, qemu-devel, Max Reitz

Am 28.06.2019 um 17:12 hat Alberto Garcia geschrieben:
> On Fri 28 Jun 2019 05:09:11 PM CEST, Kevin Wolf wrote:
> > Am 28.06.2019 um 17:02 hat Alberto Garcia geschrieben:
> >> On Fri 28 Jun 2019 04:57:08 PM CEST, Kevin Wolf wrote:
> >> > Am 28.06.2019 um 16:43 hat Alberto Garcia geschrieben:
> >> >> On Thu 27 Jun 2019 06:05:55 PM CEST, Denis Lunev wrote:
> >> >> > Please note, I am not talking now about your case with COW. Here the
> >> >> > allocation is performed on the sub-cluster basis, i.e. the abscence of
> >> >> > the sub-cluster in the image means hole on that offset. This is
> >> >> > important difference.
> >> >> 
> >> >> I mentioned the possibility that if you have a case like 2MB / 64KB
> >> >> and you write to an empty cluster then you could allocate the
> >> >> necessary subclusters, and additionally fallocate() the space of the
> >> >> whole cluster (2MB) in order to try to keep it contiguous.
> >> >> 
> >> >> With this we would lose the space saving advantage of having
> >> >> subclusters. But perhaps that would work for smaller cluster sizes
> >> >> (it would mitigate the fragmentation problem).
> >> >
> >> > There seem to be use cases for both ways. So does this need to be an
> >> > option?
> >> 
> >> Probably a runtime option, or a heuristic that decides what to do
> >> depending on the cluster size.
> >
> > How would the heuristic decide whether the user wants to save disk space
> > or whether they consider avoiding fragmentation (i.e. performance) more
> > important?
> 
> Well I suppose the fragmentation problem is more important when you have
> small clusters and less so when you have large clusters, so that would
> be a way to do it.

On the other hand, if the user cares about fragmentation, they will
probably use large clusters, and if they care about disk space, they
will probably use smaller clusters.

> Of course with an option the user would have the final choice.

Ah, okay, if it's only a default, then guessing wrong isn't as much of a
problem.

Kevin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-06-27 17:08   ` Denis Lunev
  2019-06-28 16:32     ` Alberto Garcia
@ 2019-07-11 14:08     ` Alberto Garcia
  2019-07-11 14:32       ` Kevin Wolf
  1 sibling, 1 reply; 29+ messages in thread
From: Alberto Garcia @ 2019-07-11 14:08 UTC (permalink / raw)
  To: Denis Lunev, Kevin Wolf; +Cc: Anton Nefedov, qemu-devel, qemu-block, Max Reitz

On Thu 27 Jun 2019 07:08:29 PM CEST, Denis Lunev wrote:

> But can we get a link to the repo with actual version of patches.

Hi,

I updated my code to increase the L2 entry size from 64 bits to 128 bits
and thanks to this we now have 32 subclusters per cluster (32 bits for
"subcluster allocated" and 32 for "subcluster is all zeroes").

I also fixed a few bugs on the way and started to clean the code a bit
so it is more readable. You can get it here:

   https://github.com/bertogg/qemu/releases/tag/subcluster-allocation-prototype-20190711

The idea is that you can test it, evaluate the performance and see
whether the general approach makes sense, but this is obviously not
release-quality code so don't focus too much on the coding style,
variable names, hacks, etc. Many things need to change, other things
still need to be implemented, and I'm already on the process of doing
it.

Some questions that are still open:

- It is possible to configure very easily the number of subclusters per
  cluster. It is now hardcoded to 32 in qcow2_do_open() but any power of
  2 would work (just change the number there if you want to test
  it). Would an option for this be worth adding?

- We could also allow the user to choose 64 subclusters per cluster and
  disable the "all zeroes" bits in that case. It is quite simple in
  terms of lines of code but it would make the qcow2 spec a bit more
  complicated.

- We would now have "all zeroes" bits at the cluster and subcluster
  levels, so there's an ambiguity here that we need to solve. In
  particular, what happens if we have a QCOW2_CLUSTER_ZERO_ALLOC cluster
  but some bits from the bitmap are set? Do we ignore them completely?

I also ran some I/O tests using a similar scenario like last time (SSD
drive, 40GB backing image). Here are the results, you can see the
difference between the previous prototype (8 subclusters per cluster)
and the new one (32):

|--------------+----------------+---------------+-----------------|
| Cluster size | 32 subclusters | 8 subclusters | subclusters=off |
|--------------+----------------+---------------+-----------------|
|         4 KB |        80 IOPS |      101 IOPS |         92 IOPS |
|         8 KB |       108 IOPS |      299 IOPS |        417 IOPS |
|        16 KB |      3440 IOPS |     7555 IOPS |       3347 IOPS |
|        32 KB |     10718 IOPS |    13038 IOPS |       2435 IOPS |
|        64 KB |     12569 IOPS |    10613 IOPS |       1622 IOPS |
|       128 KB |     11444 IOPS |     4907 IOPS |        866 IOPS |
|       256 KB |      9335 IOPS |     2618 IOPS |        561 IOPS |
|       512 KB |       185 IOPS |     1678 IOPS |        353 IOPS |
|      1024 KB |      2477 IOPS |      863 IOPS |        212 IOPS |
|      2048 KB |      1536 IOPS |      571 IOPS |        123 IOPS |
|--------------+----------------+---------------+-----------------|

I'm surprised about the 256 KB cluster / 32 subclusters case (I would
expect ~3300 IOPS), but I ran it a few times and the results are always
the same. I still haven't investigated why that happens. The rest of the
results seem more or less normal.

I will now continue working towards having something a complete
solution, but any feedback or comments will be very welcome.

Regards,

Berto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-07-11 14:08     ` Alberto Garcia
@ 2019-07-11 14:32       ` Kevin Wolf
  2019-07-11 14:56         ` Alberto Garcia
  0 siblings, 1 reply; 29+ messages in thread
From: Kevin Wolf @ 2019-07-11 14:32 UTC (permalink / raw)
  To: Alberto Garcia
  Cc: Anton Nefedov, Denis Lunev, qemu-block, qemu-devel, Max Reitz

Am 11.07.2019 um 16:08 hat Alberto Garcia geschrieben:
> Some questions that are still open:
> 
> - It is possible to configure very easily the number of subclusters per
>   cluster. It is now hardcoded to 32 in qcow2_do_open() but any power of
>   2 would work (just change the number there if you want to test
>   it). Would an option for this be worth adding?

I think for testing we can just change the constant. Once th feature is
merged and used in production, I don't think there is any reason to
leave bits unused.

> - We could also allow the user to choose 64 subclusters per cluster and
>   disable the "all zeroes" bits in that case. It is quite simple in
>   terms of lines of code but it would make the qcow2 spec a bit more
>   complicated.
> 
> - We would now have "all zeroes" bits at the cluster and subcluster
>   levels, so there's an ambiguity here that we need to solve. In
>   particular, what happens if we have a QCOW2_CLUSTER_ZERO_ALLOC cluster
>   but some bits from the bitmap are set? Do we ignore them completely?

The (super)cluster zero bit should probably always be clear if
subclusters are used. If it's set, we have a corrupted image.

> I also ran some I/O tests using a similar scenario like last time (SSD
> drive, 40GB backing image). Here are the results, you can see the
> difference between the previous prototype (8 subclusters per cluster)
> and the new one (32):

Is the 8 subclusters test run with the old version (64 bit L2 entries)
or the new version (128 bit L2 entries) with bits left unused?

> |--------------+----------------+---------------+-----------------|
> | Cluster size | 32 subclusters | 8 subclusters | subclusters=off |
> |--------------+----------------+---------------+-----------------|
> |         4 KB |        80 IOPS |      101 IOPS |         92 IOPS |
> |         8 KB |       108 IOPS |      299 IOPS |        417 IOPS |
> |        16 KB |      3440 IOPS |     7555 IOPS |       3347 IOPS |
> |        32 KB |     10718 IOPS |    13038 IOPS |       2435 IOPS |
> |        64 KB |     12569 IOPS |    10613 IOPS |       1622 IOPS |
> |       128 KB |     11444 IOPS |     4907 IOPS |        866 IOPS |
> |       256 KB |      9335 IOPS |     2618 IOPS |        561 IOPS |
> |       512 KB |       185 IOPS |     1678 IOPS |        353 IOPS |
> |      1024 KB |      2477 IOPS |      863 IOPS |        212 IOPS |
> |      2048 KB |      1536 IOPS |      571 IOPS |        123 IOPS |
> |--------------+----------------+---------------+-----------------|
> 
> I'm surprised about the 256 KB cluster / 32 subclusters case (I would
> expect ~3300 IOPS), but I ran it a few times and the results are always
> the same. I still haven't investigated why that happens. The rest of the
> results seem more or less normal.

Shouldn't 256k/8k perform similarly to 64k/8k, or maybe a bit better?
Why did you expect ~3300 IOPS?

I found other results more surprising. In particular:

* Why does 64k/2k perform better than 128k/4k when the block size for
  your requests is 4k?

* Why is the maximum for 8 subclusters higher than for 32 subclusters?
  I guess this does make some sense if the 8 subclusters case actually
  used 64 bit L2 entries. If you did use 128 bit entries for both 32 and
  8 subclusters, I don't see why 8 subclusters should perform better in
  any case.

* What causes the minimum at 512k with 32 subclusters? The other two
  setups have a maximum and performance decreases monotonically to both
  sides. This one has a minimum at 512k and larger cluster sizes improve
  performance again.

  In fact, 512k performs really bad compared even to subclusters=off.

Kevin


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
  2019-07-11 14:32       ` Kevin Wolf
@ 2019-07-11 14:56         ` Alberto Garcia
  0 siblings, 0 replies; 29+ messages in thread
From: Alberto Garcia @ 2019-07-11 14:56 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anton Nefedov, Denis Lunev, qemu-block, qemu-devel, Max Reitz

On Thu 11 Jul 2019 04:32:34 PM CEST, Kevin Wolf wrote:

>> - It is possible to configure very easily the number of subclusters per
>>   cluster. It is now hardcoded to 32 in qcow2_do_open() but any power of
>>   2 would work (just change the number there if you want to test
>>   it). Would an option for this be worth adding?
>
> I think for testing we can just change the constant. Once th feature
> is merged and used in production, I don't think there is any reason to
> leave bits unused.

Me neither unless we want to allow the 64 subclusters scenario that I
mentioned.

>> - We would now have "all zeroes" bits at the cluster and subcluster
>> levels, so there's an ambiguity here that we need to solve. In
>> particular, what happens if we have a QCOW2_CLUSTER_ZERO_ALLOC
>> cluster but some bits from the bitmap are set? Do we ignore them
>> completely?
>
> The (super)cluster zero bit should probably always be clear if
> subclusters are used. If it's set, we have a corrupted image.

I mentioned in an earlier e-mail that one possibility is to leave that
bit as it is now and use the bitmap only for the allocation status (so
we'd have 64 subclusters). If QCOW_OFLAG_ZERO is set and the subcluster
is not allocated then it's all zeroes.

With this we'd double the amount of subclusters but we'd lose the
possibility to have zero and unallocated subclusters at the same time.

>> I also ran some I/O tests using a similar scenario like last time
>> (SSD drive, 40GB backing image). Here are the results, you can see
>> the difference between the previous prototype (8 subclusters per
>> cluster) and the new one (32):
>
> Is the 8 subclusters test run with the old version (64 bit L2 entries)
> or the new version (128 bit L2 entries) with bits left unused?

It's the old version of the code (I copied & pasted the numbers from the
previous table).

>> |--------------+----------------+---------------+-----------------|
>> | Cluster size | 32 subclusters | 8 subclusters | subclusters=off |
>> |--------------+----------------+---------------+-----------------|
>> |         4 KB |        80 IOPS |      101 IOPS |         92 IOPS |
>> |         8 KB |       108 IOPS |      299 IOPS |        417 IOPS |
>> |        16 KB |      3440 IOPS |     7555 IOPS |       3347 IOPS |
>> |        32 KB |     10718 IOPS |    13038 IOPS |       2435 IOPS |
>> |        64 KB |     12569 IOPS |    10613 IOPS |       1622 IOPS |
>> |       128 KB |     11444 IOPS |     4907 IOPS |        866 IOPS |
>> |       256 KB |      9335 IOPS |     2618 IOPS |        561 IOPS |
>> |       512 KB |       185 IOPS |     1678 IOPS |        353 IOPS |
>> |      1024 KB |      2477 IOPS |      863 IOPS |        212 IOPS |
>> |      2048 KB |      1536 IOPS |      571 IOPS |        123 IOPS |
>> |--------------+----------------+---------------+-----------------|
>> 
>> I'm surprised about the 256 KB cluster / 32 subclusters case (I would
>> expect ~3300 IOPS), but I ran it a few times and the results are always
>> the same. I still haven't investigated why that happens. The rest of the
>> results seem more or less normal.
>
> Shouldn't 256k/8k perform similarly to 64k/8k, or maybe a bit better?
> Why did you expect ~3300 IOPS?

Sorry I meant the 512k/16k case, which is obviously the outlier there.
 
> I found other results more surprising. In particular:
>
> * Why does 64k/2k perform better than 128k/4k when the block size for
>   your requests is 4k?

They should perform similar because the only difference in practice is
that in the former case you set two bits on the bitmap and in the latter
only one. The difference is not too big, I could run the tests again and
if the results are consistent I can investigate what's going on.

But yes, I would expect 128k/4k to be the fastest of them all.

> * Why is the maximum for 8 subclusters higher than for 32 subclusters?
>   I guess this does make some sense if the 8 subclusters case actually
>   used 64 bit L2 entries. If you did use 128 bit entries for both 32 and
>   8 subclusters, I don't see why 8 subclusters should perform better in
>   any case.

I used 64-bit entries for the 8 subcluster case. I can try with the new
code and see what happens.

> * What causes the minimum at 512k with 32 subclusters?

That's the case that I meant earlier, and I still don't have a good
hypothesis of why that happens. I'll need to debug it.

Berto


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2019-07-11 14:57 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-27 13:59 [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images Alberto Garcia
2019-06-27 14:19 ` Denis Lunev
2019-06-27 15:38   ` Alberto Garcia
2019-06-27 15:42     ` Alberto Garcia
2019-06-28  9:20       ` Kevin Wolf
2019-06-28  9:53         ` Alberto Garcia
2019-06-28 10:04           ` Kevin Wolf
2019-06-28 13:19             ` Alberto Garcia
2019-06-28 14:16               ` Kevin Wolf
2019-06-28 16:31                 ` Alberto Garcia
2019-06-27 16:05     ` Denis Lunev
2019-06-28 14:43       ` Alberto Garcia
2019-06-28 14:47         ` Denis Lunev
2019-06-28 14:57         ` Kevin Wolf
2019-06-28 15:02           ` Alberto Garcia
2019-06-28 15:03             ` Denis Lunev
2019-06-28 15:10               ` Alberto Garcia
2019-06-28 15:15                 ` Kevin Wolf
2019-06-28 15:09             ` Kevin Wolf
2019-06-28 15:12               ` Alberto Garcia
2019-07-01  6:22                 ` Kevin Wolf
2019-06-27 16:54 ` Kevin Wolf
2019-06-27 17:08   ` Denis Lunev
2019-06-28 16:32     ` Alberto Garcia
2019-07-11 14:08     ` Alberto Garcia
2019-07-11 14:32       ` Kevin Wolf
2019-07-11 14:56         ` Alberto Garcia
2019-06-28 12:57   ` Alberto Garcia
2019-06-28 13:03     ` Alberto Garcia

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.