[Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images

* [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
@ 2019-06-27 13:59 Alberto Garcia
  2019-06-27 14:19 ` Denis Lunev
  2019-06-27 16:54 ` Kevin Wolf
  0 siblings, 2 replies; 29+ messages in thread
From: Alberto Garcia @ 2019-06-27 13:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anton Nefedov, Denis V. Lunev, qemu-block, Max Reitz

Hi all,

a couple of years ago I came to the mailing list with a proposal to
extend the qcow2 format to add subcluster allocation.

You can read the original message (and the discussion thread that came
afterwards) here:

   https://lists.gnu.org/archive/html/qemu-block/2017-04/msg00178.html

The description of the problem from the original proposal is still
valid so I won't repeat it here.

What I have been doing during the past few weeks was to retake the
code that I wrote in 2017, make it work with the latest QEMU and fix
many of its bugs. I have again a working prototype which is by no
means complete but it allows us to have up-to-date information about
what we can expect from this feature.

My goal with this message is to retake the discussion and re-evaluate
whether this is a feature that we'd like for QEMU in light of the test
results and all the changes that we have had in the past couple of
years.

=== Test results ===

I ran these tests with the same hardware configuration as in 2017: an
SSD drive and random 4KB write requests to an empty 40GB qcow2 image.

Here are the results when the qcow2 file is backed by a fully
populated image. There are 8 subclusters per cluster and the
subcluster size is in brackets:

|-----------------+----------------+-----------------|
|  Cluster size   | subclusters=on | subclusters=off |
|-----------------+----------------+-----------------|
|   2 MB (256 KB) |   571 IOPS     |  124 IOPS       |
|   1 MB (128 KB) |   863 IOPS     |  212 IOPS       |
| 512 KB  (64 KB) |  1678 IOPS     |  365 IOPS       |
| 256 KB  (32 KB) |  2618 IOPS     |  568 IOPS       |
| 128 KB  (16 KB) |  4907 IOPS     |  873 IOPS       |
|  64 KB   (8 KB) | 10613 IOPS     | 1680 IOPS       |
|  32 KB   (4 KB) | 13038 IOPS     | 2476 IOPS       |
|   4 KB (512 B)  |   101 IOPS     |  101 IOPS       |
|-----------------+----------------+-----------------|

Some comments about the results, after comparing them with those from
2017:

- As expected, 32KB clusters / 4 KB subclusters give the best results
  because that matches the size of the write request and therefore
  there's no copy-on-write involved.

- Allocation is generally faster now in all cases (between 20-90%,
  depending on the case). We have made several optimizations to the
  code since last time, and I suppose that the COW changes made in
  commits b3cf1c7cf8 and ee22a9d869 are probably the main factor
  behind these improvements.

- Apart from the 64KB/8KB case (which is much faster), the patters are
  generally the same: subcluster allocation offers similar performance
  benefits compared to last time, so there are no surprises in this
  area.

Then I ran the tests again using the same environment but without a
backing image. The goal is to measure the impact of subcluster
allocation on completely empty images.

Here we have an important change: since commit c8bb23cbdb empty
clusters are preallocated and filled with zeroes using an efficient
operation (typically fallocate() with FALLOC_FL_ZERO_RANGE) instead of
writing the zeroes with the usual pwrite() call.

The effects of this are dramatic, so I decided to run two sets of
tests: one with this optimization and one without it.

Here are the results:

|-----------------+----------------+-----------------+----------------+-----------------|
|                 | Initialization with fallocate()  |  Initialization with pwritev()   |
|-----------------+----------------+-----------------+----------------+-----------------|
|  Cluster size   | subclusters=on | subclusters=off | subclusters=on | subclusters=off |
|-----------------+----------------+-----------------+----------------+-----------------|
|   2 MB (256 KB) | 14468 IOPS     | 14776 IOPS      |  1181 IOPS     |  260 IOPS       |
|   1 MB (128 KB) | 13752 IOPS     | 14956 IOPS      |  1916 IOPS     |  358 IOPS       |
| 512 KB  (64 KB) | 12961 IOPS     | 14776 IOPS      |  4038 IOPS     |  684 IOPS       |
| 256 KB  (32 KB) | 12790 IOPS     | 14534 IOPS      |  6172 IOPS     | 1213 IOPS       |
| 128 KB  (16 KB) | 12550 IOPS     | 13967 IOPS      |  8700 IOPS     | 1976 IOPS       |
|  64 KB   (8 KB) | 12491 IOPS     | 13432 IOPS      | 11735 IOPS     | 4267 IOPS       |
|  32 KB   (4 KB) | 13203 IOPS     | 11752 IOPS      | 12366 IOPS     | 6306 IOPS       |
|   4 KB (512 B)  |   103 IOPS     |   101 IOPS      |   101 IOPS     |  101 IOPS       |
|-----------------+----------------+-----------------+----------------+-----------------|

Comments:

- With the old-style allocation method using pwritev() we get similar
  benefits as we did last time. The comments from the test with a
  backing image apply to this one as well.

- However the new allocation method is so efficient that having
  subclusters does not offer any performance benefit. It even slows
  down things a bit in most cases, so we'd probably need to fine tune
  the algorithm in order to get similar results.

- In light of this numbers I also think that even when there's a
  backing image we could preallocate the full cluster but only do COW
  on the affected subclusters. This would the rest of the cluster
  preallocated on disk but unallocated on the bitmap. This would
  probably reduce on-disk fragmentation, which was one of the concerns
  raised during the original discussion.

I also ran some tests on a rotating HDD drive. Here having subclusters
doesn't make a big difference regardless of whether there is a backing
image or not, so we can ignore this scenario.

=== Changes to the on-disk format ===

In my original proposal I described 3 different alternatives for
storing the subcluster bitmaps. I'm naming them here, but refer to
that message for more details.

(1) Storing the bitmap inside the 64-bit entry
(2) Making L2 entries 128-bit wide.
(3) Storing the bitmap somewhere else

I used (1) for this implementation for simplicity, but I think (2) is
probably the best one.

===========================

And I think that's all. As you can see I didn't want to go much into
the open technical questions (I think the on-disk format would be the
main one), the first goal should be to decide whether this is still an
interesting feature or not.

So, any questions or comments will be much appreciated.

Berto

^ permalink raw reply	[flat|nested] 29+ messages in thread