[Qemu-devel] RFC: Reducing the size of entries in the qcow2 L2 cache

* [Qemu-devel] RFC: Reducing the size of entries in the qcow2 L2 cache
@ 2017-09-19 15:07 Alberto Garcia
  2017-09-19 15:18 ` Denis V. Lunev
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Alberto Garcia @ 2017-09-19 15:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, Kevin Wolf, Max Reitz, Denis V. Lunev

Hi everyone,

over the past few weeks I have been testing the effects of reducing
the size of the entries in the qcow2 L2 cache. This was briefly
mentioned by Denis in the same thread where we discussed subcluster
allocation back in April, but I'll describe here the problem and the
proposal in detail.

=== Problem ===

In the qcow2 file format guest addresses are mapped to host addresses
using the so-called L1 and L2 tables. The size of an L2 table is the
same as the cluster size, therefore a larger cluster means more L2
entries in a table, and because of that an L2 table can map a larger
portion of the address space (not only because it contains more
entries, but also because the data cluster that each one of those
entries points at is larger).

There are two consequences of this:

   1) If you double the cluster size of a qcow2 image then the maximum
      space needed for all L2 tables is divided by two (i.e. you need
      half the metadata).

   2) If you double the cluster size of a qcow2 image then each one of
      the L2 tables will map four times as much disk space.

With the default cluster size of 64KB, each L2 table maps 512MB of
contiguous disk space. This table shows what happens when you change
the cluster size:

     |--------------+------------------|
     | Cluster size | An L2 table maps |
     |--------------+------------------|
     |       512  B |            32 KB |
     |         1 KB |           128 KB |
     |         2 KB |           512 KB |
     |         4 KB |             2 MB |
     |         8 KB |             8 MB |
     |        16 KB |            32 MB |
     |        32 KB |           128 MB |
     |        64 KB |           512 MB |
     |       128 KB |             2 GB |
     |       256 KB |             8 GB |
     |       512 KB |            32 GB |
     |         1 MB |           128 GB |
     |         2 MB |           512 GB |
     |--------------+------------------|

When QEMU wants to convert a guest address into a host address, it
needs to read the entry from the corresponding L2 table. The qcow2
driver doesn't read those entries directly, it does it by loading the
tables in the L2 cache so they can be kept in memory in case they are
needed later.

The problem here is that the L2 cache (and the qcow2 driver in
general) always works with complete L2 tables: if QEMU needs a
particular L2 entry then the whole cluster containing the L2 table is
read from disk, and if the cache is full then a cluster worth of
cached data has to be discarded.

The consequences of this are worse the larger the cluster size is, not
only because we're reading (and discarding) larger amounts of data,
but also because we're using that memory in a very inefficient way.

Example: with 1MB clusters each L2 table maps 128GB of contiguous
virtual disk, so that's the granularity of our cache. If we're
performing I/O in a 4GB area that overlaps two of those 128GB chunks,
we need to have in the cache two complete L2 tables (2MB) even when in
practice we're only using 32KB of those 2MB (32KB contain enough L2
entries to map the 4GB that we're using).

=== The proposal ===

One way to solve the problems described above is to decouple the L2
table size (which is equal to the cluster size) from the cache entry
size.

The qcow2 cache doesn't actually know anything about the data that
it's loading, it just receives a disk offset and checks that it is
properly aligned. It's perfectly possible to make it load data blocks
smaller than a cluster.

I already have a working prototype, and I was doing tests using a 4KB
cache entry size. 4KB is small enough, it allows us to make a more
flexible use of the cache, it's also a common file system block size
and it can hold enough L2 entries to cover substantial amounts of disk
space (especially with large clusters).

     |--------------+-----------------------|
     | Cluster size | 4KB of L2 entries map |
     |--------------+-----------------------|
     | 64 KB        | 32 MB                 |
     | 128 KB       | 64 MB                 |
     | 256 KB       | 128 MB                |
     | 512 KB       | 256 MB                |
     | 1 MB         | 512 MB                |
     | 2 MB         | 1 GB                  |
     |--------------+-----------------------|

Some results from my tests (using an SSD drive and random 4K reads):

|-----------+--------------+-------------+---------------+--------------|
| Disk size | Cluster size | L2 cache    | Standard QEMU | Patched QEMU |
|-----------+--------------+-------------+---------------+--------------|
| 16 GB     | 64 KB        | 1 MB [8 GB] | 5000 IOPS     | 12700 IOPS   |
|  2 TB     |  2 MB        | 4 MB [1 TB] |  576 IOPS     | 11000 IOPS   |
|-----------+--------------+-------------+---------------+--------------|

The improvements are clearly visible, but it's important to point out
a couple of things:

   - L2 cache size is always < total L2 metadata on disk (otherwise
     this wouldn't make sense). Increasing the L2 cache size improves
     performance a lot (and makes the effect of these patches
     disappear), but it requires more RAM.
   - Doing random reads over the whole disk is probably not a very
     realistic scenario. During normal usage only certain areas of the
     disk need to be accessed, so performance should be much better
     with the same amount of cache.
   - I wrote a best-case scenario test (several I/O jobs each accesing
     a part of the disk that requires loading its own L2 table) and my
     patched version is 20x faster even with 64KB clusters.

=== What needs to change? ===

Not so much, fortunately:

   - The on-disk format does not need any change, qcow2 images remain
     the same.
   - The qcow2 cache driver needs almost no changes, the entry size
     is no longer assumed to be equal to the cluster size, and it has
     to be explicitly set. Other than that it remains the same (it can
     even be simplified).

The QEMU qcow2 driver does need a few changes:

   - qcow2_get_cluster_offset() and get_cluster_table() simply need to
     be aware that they're not loading full L2 tables anymore.
   - handle_copied(), handle_alloc(), discard_single_l2() and
     zero_single_l2() only need to update the calculation of
     nb_clusters.
   - l2_allocate() and qcow2_update_snapshot_refcount() cannot load a
     full L2 table in memory at once, they need to loop over the
     sub-tables. Other than wrapping the core of those functions
     inside a loop I haven't detected any major problem.
   - We need a proper name for these sub-tables that we are loading
     now. I'm actually still struggling with this :-) I can't think of
     any name that is clear enough and not too cumbersome to use (L2
     subtables? => Confusing. L3 tables? => they're not really that).

=== What about the refcount cache? ===

This proposal would not touch this at all, the assumption would remain
that "refcount cache entry size = cluster size".

===========================

I think I haven't forgotten anything. As I said I have a working
prototype of this and if you like the idea I'd like to publish it
soon. Any questions or comments will be appreciated.

Thanks!

Berto

^ permalink raw reply	[flat|nested] 6+ messages in thread