[Qemu-devel] [PATCH v2 00/12] qcow2: Add new overlap check functions

* [Qemu-devel] [PATCH v2 00/12] qcow2: Add new overlap check functions
@ 2014-11-24 15:56 Max Reitz
  2014-11-24 15:56 ` [Qemu-devel] [PATCH v2 01/12] " Max Reitz
                   ` (17 more replies)
  0 siblings, 18 replies; 38+ messages in thread
From: Max Reitz @ 2014-11-24 15:56 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Peter Lieven, Stefan Hajnoczi, Max Reitz

As has been requested, this series adds new overlap check functions to
the qcow2 code. My local branch is called "qcow2-improved-overlap-v1",
but I am not so sure whether it is actually an improvement; that is left
for you to decide, dear reviewers.

See patch 1 for an explanation of why this series exists and what it
does. Patch 1 is basically the core of this series, the rest just
employs the functions introduced there.

In a later patch, we may want to change the meaning of the "constant"
overlap checking option to mean the same as "cached", which is
everything except for inactive L2 tables. This series does make
checking for overlaps with inactive L2 tables at runtime just as cheap
as everything else (constant time plus caching), but using these checks
means qemu has to read all the snapshot L1 tables when opening a qcow2
file. This does not take long, of course, but it does result in a bit of
overhead so I did not want to enable it by default.

I think just enabling all overlap checks by default after this series
should be fine and useful, though.

Benchmarks
==========

First, let me repeat the results I wrote as a reply to the cover letter
of v1:

I had a 1 GB qcow2 image in tmpfs (sorry, I "only" have 4 GB of RAM in
this laptop, and a 2 GB image will kill it) which I accessed from a
guest from inside qemu (Arch Linux, to be exact, because its image just
dumps you into a console and that's all I need). I ran

$ for i in $(seq 1 42); do \
    dd if=/dev/zero of=/dev/vda bs=65536 \
    done

because it does not matter what data is written to the image, I just
need to write to a lot of clusters fast to maximize pressure on the
overlap check.

The results were 13.5/10.5/12.41 % of CPU usage (recorded using perf) in
qcow2_check_metadata_overlap() with the current implementation and
0.08/0.09 % with this series applied (0.08 % with overlap-check=cached,
0.09 % with overlap-check=all).

Now as a table, just to have one here (and because it is useful when
skipping through this text):

Current implementation:     12.14 (± 1.519) %  (n = 3)
With this series applied:    0.08           %  (n = 1)
overlap-check=all:           0.09           %  (n = 1)

Now I did some other benchmarks.

The first of those is just running (on the host, obviously):

$ perf record -ag \
  qemu-img create -f qcow2 -o preallocation=metadata,cluster_size=512 \
  /tmp/test.qcow2 2G

I ran ten rounds with the current implementation, ten rounds with these
patches applied and five rounds with cache_size set to 0 in
qcow2_create_empty_metadata_list() (which results in only one bitmap
existing at one time, and therefore a lot of conversions between the
bitmap and the list representation).

The current implementation had a CPU usage (fraction of cycles) in
qcow2_check_metadata_overlap() of 47.65 (±4.24) %. There was one rather
extreme outlier (36.31 %), with that removed it is 48.91 (±1.54) %.

With this series applied, the usage dropped to 0.149 (± 0.028) %.
Additionally, qcow2_metadata_list_enter() took 0.046 (± 0.021) %. Both
together took thus 0.195 (± 0.040) %.

With cache_size set to 0, the usage was 0.130 % (± 0.012 %) in
qcow2_check_metadata_overlap(); 0.056 % (± 0.011 %) in
qcow2_metadata_list_enter(); and 0.186 % (± 0.021 %). It dropped, but
still in range of standard deviation. Thus, heavy conversion between
bitmap and list conversion (in normal use cases due to random accesses)
should not be a problem at all.

As a table:

Current implementation:     48.91 (± 1.537) %  (n =  9)
With this series applied:   0.195 (± 0.040) %  (n = 10)
Only one bitmap cached:     0.186 (± 0.021) %  (n =  5)

And as a really noticeable consequence: Before this series, creating
the image like that took 16.624 s (15.97 s user). With this series, it
takes 4.502 s (3.94 s user).

Because statistics are fun, I collected the number of metadata list
operations for the second and the third tests:

Second test (default of 15 bitmaps cached):

List to bitmap conversions: 1053
Bitmap to list conversions: 1038
Bitmaps deallocated:        1038

Insert operations:         82266
Remove operations:            14
Queries:                  312265

Third test (only one bitmap cached):

List to bitmap conversions: 135943
Bitmap to list conversions:  66546
Bitmaps deallocated:        135942

Insert operations:           82266
Remove operations:              14
Queries:                    312265

As expected, the number of conversions is much higher. Interestingly, in
the first case, every bitmap deallocation also meant a bitmap to list
conversion; that means, every bitmap was modified at least once while it
was cached. In contrast to that, in the second case, only every second
deallocation resulted in a conversion, which means that half of the
cached bitmaps were only read (which is bad considering all was done was
allocating metadata structures).

But for performance, there is no real information here (we only see that
setting cache_size to 0 actually did increase the number of
conversions).

The second benchmark I ran today was a program which opened /dev/vda in
qemu (again on Arch) with O_DIRECT | O_SYNC | O_RDWR. It then wrote an
uninitialized 512 byte buffer to random places (512-byte-aligned) to a
1 GB image freshly created before VM launch with metadata preallocation
and 512 byte clusters. The intention was to break bitmap caching and
having to convert back and forth between list and bitmap representation.
The results are as follows:

Current implementation:     18.73 (± 0.157) %  (n = 10)
With this series applied:   0.068 (± 0.009) %  (n = 10)
Only one bitmap cached:     0.062 (± 0.008) %  (n =  5)

Runtime conclusion
------------------

As can be seen from the benchmarks, runtime CPU usage by the metadata
overlap checks is greatly decreased by this series (to 1/150 in the
first benchmark, to 1/250 in the second, and to 1/275 in the third).

Where is the caveat?
--------------------

The problem with this series is obviously that all the metadata
information is read when reading a qcow2 image. iotest 044 is a good
test for this. I personally haven't seen a problem here, but I can't
outrule any. I never noticed any waits when opening images.

When approaching this from a theoretical approach, it becomes clear that
there shouldn't be any problems here. If using the default of
overlap-check=cached, only information available anyway is used to built
up the metadata list (the same information that is iterated through for
every overlap check in the current implementation). Since building the
list simply means setting a bitmask in a bitmap and then transforming
that bitmap to a list (which has been shown not pose any performance
issues by the second benchmark), there should not be any problem here.

So, the only caveat I do see is increased code complexity. I initially
thought it might be too complex for its purpose; but after having done
the benchmarks, it became apparent to me that there is a problem with
the current implementation and that this series does fix it.

And RAM usage may pose a problem.

RAM usage
=========

So I have looked at my 2 GB image above, and the list uses 40 kB, which
may or may not be too much (sounds completely fine to me for an image
with 512 byte clusters); but it is a least a number I can use for
testing the following theoretical inspection:

So, once again, we assume the worst. Every metadata structure needs its
own entry in the list - actually, the worst would be that every cluster
needs its own entry, but that only makes a difference for L1 tables and
the refcount table, so we can ignore that. In fact, we can completely
ignore those tables; it makes things much easier.

Let's name the cluster size "CS", and name the image size in bytes "IS".

So, for a refcount width of 16 bits per entry, we can describe CS / 2
clusters per refblock; which means we need (IS / CS) / (CS / 2) =
2 * IS / (CS * CS) refblocks. Let's be generous and say that the L2
tables are capable of describing the complete image size (which in
practice they do not because the guest disk size is smaller than the
physical size). This assumption also has the neat side-effect of not
having to care about snapshots. If we have a lot of internal snapshots
with copied L2 tables, IS grows and therefore the next formula knows
that the number of L2 tables grows as well. Nice. So, as every L2 table
can describe CS / 8 (guest) clusters, we need 8 * IS / (CS * CS) L2
tables.

Therefore, ignoring the image header, L1 tables, the reftable and the
snapshot table, we have about the following amount of metadata clusters
per image (with 16 bit refcount entries):

(2 + 8) * IS / (CS * CS) = 10 * IS / (CS * CS)

We said that for every metadata cluster, we would need one entry in the
list. Every list entry takes 32 bits (4 bytes). Thus, we need
approximately up to

40 * IS / (CS * CS)

bytes for the metadata list (please ignore units and append "bytes" at
the resulting number...).

Let's test that for the above image, which has a disk size of 266 MB:

40 * 266M / (512 * 512) = 42 kB

Great! It works.

So, now let's set CS to 64 kB, because that is basically the only
cluster size we really care about. For a 1 TB image, we need 10 kB for
the list. Sounds great to me. For a 1 PB image, we will need 10 MB. Fair
enough. (Note that you don't need 10 MB of RAM to create a 1 PB image;
you only need that once the disk size of the image has reached 1 PB).

And 1 TB with 512 byte clusters? 160 MB. Urgh, that is a lot. But then
again, you can switch off the overlap check with overlap-check=off; and
trying to actually use a 1 TB image with 512 byte clusters is crazy in
itself (have you tried just creating one without preallocation? It takes
forever). So I can live with that.

tl;dr
=====

* CPU usage at runtime decreased by 150 to 275 percent on
  overlap-check-heavy tasks
* No additional performance problems at loading time (in theory has the
  same runtime complexity as a single overlap check right now; in
  practice I could not find any problems)
* Decent RAM usage (40 kB for a 1 TB image with 64 kB clusters; 40 MB
  for a 1 PB image etc. pp.)

As of this version, this series depends on
"[PATCH v6] qcow2: Buffer L1 table in snapshot refcount update".

v2:
- Patch 1:
  - Fixed a mistake regarding the value of
    Qcow2MetadataFragment::relative_start; this value should be relative
    to the window start, not relative to the end of the last fragment
    (destroy_window_bitmap() wrote the latter there in v1)
  - Use uint64_t for the window index everywhere
- Patch 8: Said "qcow2: Buffer L1 table in snapshot refcount update"
  removes l1_allocated in qcow2_update_snapshot_refcount() and basically
  replaces it by active_l1 (where active_l1 = !l1_allocated). In v1, the
  condition it was used in was actually wrong, this is fixed here, too
  (the active L2 cluster type should be removed from the list if we are
  working on the active L1 table, not if we are working on an inactive
  L1 table).

git-backport-diff against v2:

Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/12:[0010] [FC] 'qcow2: Add new overlap check functions'
002/12:[----] [--] 'qcow2: Pull up overlap check option evaluation'
003/12:[----] [--] 'qcow2: Create metadata list'
004/12:[----] [--] 'qcow2/overlaps: Protect image header'
005/12:[----] [--] 'qcow2/overlaps: Protect refcount table'
006/12:[----] [--] 'qcow2/overlaps: Protect refcount blocks'
007/12:[----] [--] 'qcow2/overlaps: Protect active L1 table'
008/12:[0002] [FC] 'qcow2/overlaps: Protect active L2 tables'
009/12:[----] [--] 'qcow2/overlaps: Protect snapshot table'
010/12:[----] [--] 'qcow2/overlaps: Protect inactive L1 tables'
011/12:[----] [-C] 'qcow2/overlaps: Protect inactive L2 tables'
012/12:[----] [--] 'qcow2: Use new metadata overlap check function'

Max Reitz (12):
  qcow2: Add new overlap check functions
  qcow2: Pull up overlap check option evaluation
  qcow2: Create metadata list
  qcow2/overlaps: Protect image header
  qcow2/overlaps: Protect refcount table
  qcow2/overlaps: Protect refcount blocks
  qcow2/overlaps: Protect active L1 table
  qcow2/overlaps: Protect active L2 tables
  qcow2/overlaps: Protect snapshot table
  qcow2/overlaps: Protect inactive L1 tables
  qcow2/overlaps: Protect inactive L2 tables
  qcow2: Use new metadata overlap check function

 block/Makefile.objs    |   3 +-
 block/qcow2-cluster.c  |  13 ++
 block/qcow2-overlap.c  | 400 +++++++++++++++++++++++++++++++++++++++++++++++++
 block/qcow2-refcount.c | 202 ++++++++++---------------
 block/qcow2-snapshot.c | 105 ++++++++++++-
 block/qcow2.c          | 130 ++++++++++------
 block/qcow2.h          |  13 ++
 7 files changed, 694 insertions(+), 172 deletions(-)
 create mode 100644 block/qcow2-overlap.c

-- 
1.9.3

^ permalink raw reply	[flat|nested] 38+ messages in thread