* [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation @ 2017-04-06 15:01 Alberto Garcia 2017-04-06 16:40 ` Eric Blake ` (4 more replies) 0 siblings, 5 replies; 64+ messages in thread From: Alberto Garcia @ 2017-04-06 15:01 UTC (permalink / raw) To: qemu-devel; +Cc: qemu-block, Stefan Hajnoczi, Max Reitz, Kevin Wolf Hi all, over the past couple of months I discussed with some of you the possibility to extend the qcow2 format in order to improve its performance and reduce its memory requirements (particularly with very large images). After some discussion in the mailing list and the #qemu IRC channel I decided to write a prototype of a new extension for qcow2 so I could understand better the scope of the changes and have some preliminary data about its effects. This e-mail is the formal presentation of my proposal to extend the on-disk qcow2 format. As you can see this is still an RFC. Due to the nature of the changes I would like to get as much feedback as possible before going forward. === Problem === The original problem that I wanted to address is the memory requirements of qcow2 files if you want to have very large images and still keep good I/O performance. This is a consequence of its very structure, which I'm going to describe now. A qcow2 image is divided into units of constant size called clusters, and among other things it contains metadata that maps guest addresses to host addresses (the so-called L1 and L2 tables). There are two basic problems that result from this: 1) Reading from or writing to a qcow2 image involves reading the corresponding entry on the L2 table that maps the guest address to the host address. This is very slow because it involves two I/O operations: one on the L2 table and the other one on the actual data cluster. 2) A cluster is the smallest unit of allocation. Therefore writing a mere 512 bytes to an empty disk requires allocating a complete cluster and filling it with zeroes (or with data from the backing image if there is one). This wastes more disk space and also has a negative impact on I/O. Problem (1) can be solved by keeping in memory a cache of the L2 tables (QEMU has an "l2_cache_size" parameter for this purpose). The amount of disk space used by L2 tables depends on two factors: the disk size and the cluster size. The cluster size can be configured when the image is created, and it can be any power of two between 512 bytes and 2 MB (it defaults to 64 KB). The maximum amount of space needed for the L2 tables can be calculated with the following formula: max_l2_size = virtual_disk_size * 8 / cluster_size Large images require a large amount of metadata, and therefore a large amount of memory for the L2 cache. With the default cluster size (64KB) that's 128MB of L2 cache for a 1TB qcow2 image. The only way to reduce the size of the L2 tables is therefore increasing the cluster size, but this makes the image less efficient as described earlier in (2). === The proposal === The idea of this proposal is to extend the qcow2 format by allowing subcluster allocation. There would be an optional feature that would need to be enabled when creating the image. The on-disk format would remain essentially the same, except that each data cluster would be internally divided into a number of subclusters of equal size. What this means in practice is that each entry on an L2 table would be accompanied by a bitmap indicating the allocation state of each one of the subclusters for that cluster. There are several alternatives for storing the bitmap, described below. Other than L2 entries, all other data structures would remain unchanged, but for data clusters the smallest unit of allocation would now be the subcluster. Reference counting would still be at the cluster level, because there's no way to reference individual subclusters. Copy-on-write on internal snapshots would need to copy complete clusters, so that scenario would not benefit from this change. I see two main use cases for this feature: a) The qcow2 image is not too large / the L2 cache is not a problem, but you want to increase the allocation performance. In this case you can have something like a 128KB cluster with 4KB subclusters (with 4KB being a common block size in ext4 and other filesystems) b) The qcow2 image is very large and you want to save metadata space in order to have a smaller L2 cache. In this case you can go for the maximum cluster size (2MB) but you want to have smaller subclusters to increase the allocation performance and optimize the disk usage. This was actually my original use case. === Test results === I have a basic working prototype of this. It's still incomplete -and buggy :)- but it gives an idea of what we can expect from it. In my implementation each data cluster has 8 subclusters, but that's not set in stone (see below). I made all tests on an SSD drive, writing to an empty qcow2 image with a fully populated 40GB backing image, performing random writes using fio with a block size of 4KB. I tried with the default and maximum cluster sizes (64KB and 2MB) and also with some other sizes. I also made sure to try with 32KB clusters so the subcluster size matches the 4KB block size used for the I/O. It's important to point out that once a cluster has been completely allocated then having subclusters offers no performance benefit. For this reason the size of the image for these tests (40GB) was chosen to be large enough to guarantee that there are always new clusters being allocated. This is therefore a worst-case scenario (or best-case for this feature, if you want). Here are the results (subcluster size in brackets): |-----------------+----------------+-----------------+-------------------| | cluster size | subclusters=on | subclusters=off | Max L2 cache size | |-----------------+----------------+-----------------+-------------------| | 2 MB (256 KB) | 440 IOPS | 100 IOPS | 160 KB (*) | | 512 KB (64 KB) | 1000 IOPS | 300 IOPS | 640 KB | | 64 KB (8 KB) | 3000 IOPS | 1000 IOPS | 5 MB | | 32 KB (4 KB) | 12000 IOPS | 1300 IOPS | 10 MB | | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | |-----------------+----------------+-----------------+-------------------| (*) The L2 cache must be a multiple of the cluster size, so in this case it must be 2MB. On the table I chose to show how much of those 2MB are actually used so you can compare it with the other cases. Some comments about the results: - For the 64KB, 512KB and 2MB cases, having subclusters increases write performance roughly by three. This happens because for each cluster allocation there's less data to copy from the backing image. For the same reason, the smaller the cluster, the better the performance. As expected, 64KB clusters with no subclusters perform roughly the same as 512KB clusters with 64KB subclusters. - The 32KB case is the most interesting one. Without subclusters it's not very different from the 64KB case, but having a subcluster with the same size of the I/O block eliminates the need for COW entirely and the performance skyrockets (10 times faster!). - 4KB is however very slow. I attribute this to the fact that the cluster size is so small that a new cluster needs to be allocated for every single write and its refcount updated accordingly. The L2 and refcount tables are also so small that they are too inefficient and need to grow all the time. Here are the results when writing to an empty 40GB qcow2 image with no backing file. The numbers are of course different but as you can see the patterns are similar: |-----------------+----------------+-----------------+-------------------| | cluster size | subclusters=on | subclusters=off | Max L2 cache size | |-----------------+----------------+-----------------+-------------------| | 2 MB (256 KB) | 1200 IOPS | 255 IOPS | 160 KB | | 512 KB (64 KB) | 3000 IOPS | 700 IOPS | 640 KB | | 64 KB (8 KB) | 7200 IOPS | 3300 IOPS | 5 MB | | 32 KB (4 KB) | 12300 IOPS | 4200 IOPS | 10 MB | | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | |-----------------+----------------+-----------------+-------------------| === Changes to the on-disk format === The qcow2 on-disk format needs to change so each L2 entry has a bitmap indicating the allocation status of each subcluster. There are three possible states (unallocated, allocated, all zeroes), so we need two bits per subcluster. An L2 entry is 64 bits wide, and this is the current format (for uncompressed clusters): 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 **<----> <--------------------------------------------------><------->* Rsrved host cluster offset of data Reserved (6 bits) (47 bits) (8 bits) bit 63: refcount == 1 (QCOW_OFLAG_COPIED) bit 62: compressed = 1 (QCOW_OFLAG_COMPRESSED) bit 0: all zeros (QCOW_OFLAG_ZERO) I thought of three alternatives for storing the subcluster bitmaps. I haven't made my mind completely about which one is the best one, so I'd like to present all three for discussion. Here they are: (1) Storing the bitmap inside the 64-bit entry This is a simple alternative and is the one that I chose for my prototype. There are 14 unused bits plus the "all zeroes" one. If we steal one from the host offset we have the 16 bits that we need for the bitmap and we have 46 bits left for the host offset, which is more than enough. * Pros: + Simple. Few changes compared to the current qcow2 format. * Cons: - Only 8 subclusters per cluster. We would not be making the most of this feature. - No reserved bits left for the future. (2) Making L2 entries 128-bit wide. In this alternative we would double the size of L2 entries. The first half would remain unchanged and the second one would store the bitmap. That would leave us with 32 subclusters per cluster. * Pros: + More subclusters per cluster. We could have images with e.g. 128k clusters with 4k subclusters. * Cons: - More space needed for L2 entries. The same cluster size would require a cache twice as large, although having subcluster allocation would compensate for this. - More changes to the code to handle 128-bit entries. - We would still be wasting the 14 reserved bits that L2 entries have. (3) Storing the bitmap somewhere else This would involve storing the bitmap separate from the L2 tables (perhaps using the bitmaps extension? I haven't looked much into this). * Pros: + Possibility to make the number of subclusters configurable by the user (32, 64, 128, ...) + All existing metadata structures would remain untouched (although the "all zeroes" bit in L2 entries would probably become unused). * Cons: - As with alternative (2), more space needed for metadata. - The bitmap would also need to be cached for performance reasons. - Possibly one more *_cache_size option. - One more metadata structure to be updated for each allocation. This would probably impact I/O negatively. === Compressed clusters === My idea is that compressed clusters would remain the same. They are read-only anyway so they would not be affected by any of these changes. =========================== I think I managed to summarize all the ideas that I had in my mind, but I'm sure you probably have questions and comments, so I'd be happy to get as much feedback as possible. So, looking forward to reading your opinions. Regards, Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia @ 2017-04-06 16:40 ` Eric Blake 2017-04-07 8:49 ` Alberto Garcia ` (2 more replies) 2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi ` (3 subsequent siblings) 4 siblings, 3 replies; 64+ messages in thread From: Eric Blake @ 2017-04-06 16:40 UTC (permalink / raw) To: Alberto Garcia, qemu-devel Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz [-- Attachment #1: Type: text/plain, Size: 9872 bytes --] On 04/06/2017 10:01 AM, Alberto Garcia wrote: > Hi all, > > over the past couple of months I discussed with some of you the > possibility to extend the qcow2 format in order to improve its > performance and reduce its memory requirements (particularly with very > large images). > > After some discussion in the mailing list and the #qemu IRC channel I > decided to write a prototype of a new extension for qcow2 so I could > understand better the scope of the changes and have some preliminary > data about its effects. > > This e-mail is the formal presentation of my proposal to extend the > on-disk qcow2 format. As you can see this is still an RFC. Due to the > nature of the changes I would like to get as much feedback as possible > before going forward. The idea in general makes sense; I can even remember chatting with Kevin about similar ideas as far back as 2015, where the biggest drawback is that it is an incompatible image change, and therefore images created with the flag cannot be read by older tools. > === Test results === > > I have a basic working prototype of this. It's still incomplete -and > buggy :)- but it gives an idea of what we can expect from it. In my > implementation each data cluster has 8 subclusters, but that's not set > in stone (see below). > > I made all tests on an SSD drive, writing to an empty qcow2 image with > a fully populated 40GB backing image, performing random writes using > fio with a block size of 4KB. > > I tried with the default and maximum cluster sizes (64KB and 2MB) and > also with some other sizes. I also made sure to try with 32KB clusters > so the subcluster size matches the 4KB block size used for the I/O. > > It's important to point out that once a cluster has been completely > allocated then having subclusters offers no performance benefit. For > this reason the size of the image for these tests (40GB) was chosen to > be large enough to guarantee that there are always new clusters being > allocated. This is therefore a worst-case scenario (or best-case for > this feature, if you want). > > Here are the results (subcluster size in brackets): > > |-----------------+----------------+-----------------+-------------------| > | cluster size | subclusters=on | subclusters=off | Max L2 cache size | > |-----------------+----------------+-----------------+-------------------| > | 2 MB (256 KB) | 440 IOPS | 100 IOPS | 160 KB (*) | > | 512 KB (64 KB) | 1000 IOPS | 300 IOPS | 640 KB | > | 64 KB (8 KB) | 3000 IOPS | 1000 IOPS | 5 MB | > | 32 KB (4 KB) | 12000 IOPS | 1300 IOPS | 10 MB | > | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | > |-----------------+----------------+-----------------+-------------------| Those are some cool results! > === Changes to the on-disk format === > > The qcow2 on-disk format needs to change so each L2 entry has a bitmap > indicating the allocation status of each subcluster. There are three > possible states (unallocated, allocated, all zeroes), so we need two > bits per subcluster. You also have to add a new incompatible feature bit, so that older tools know they can't read the new image correctly, and therefore don't accidentally corrupt it. > > An L2 entry is 64 bits wide, and this is the current format (for > uncompressed clusters): > > 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 > 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > **<----> <--------------------------------------------------><------->* > Rsrved host cluster offset of data Reserved > (6 bits) (47 bits) (8 bits) > > bit 63: refcount == 1 (QCOW_OFLAG_COPIED) > bit 62: compressed = 1 (QCOW_OFLAG_COMPRESSED) > bit 0: all zeros (QCOW_OFLAG_ZERO) > > I thought of three alternatives for storing the subcluster bitmaps. I > haven't made my mind completely about which one is the best one, so > I'd like to present all three for discussion. Here they are: > > (1) Storing the bitmap inside the 64-bit entry > > This is a simple alternative and is the one that I chose for my > prototype. There are 14 unused bits plus the "all zeroes" one. If > we steal one from the host offset we have the 16 bits that we need > for the bitmap and we have 46 bits left for the host offset, which > is more than enough. Note that because you are using exactly 8 subclusters, you can require that the minimum cluster size when subclusters are enabled be 4k (since we already have a lower-limit of 512-byte sector operation, and don't want subclusters to be smaller than that); at which case you are guaranteed that the host cluster offset will be 4k aligned. So in reality, once you turn on subclusters, you have: 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 **<----> <-----------------------------------------------><---------->* Rsrved host cluster offset of data Reserved (6 bits) (44 bits) (11 bits) where you have 17 bits plus the "all zeroes" bit to play with, thanks to the three bits of host cluster offset that are now guaranteed to be zero due to cluster size alignment (but you're also right that the "all zeroes" bit is now redundant information with the 8 subcluster-is-zero bits, so repurposing it does not hurt) > > * Pros: > + Simple. Few changes compared to the current qcow2 format. > > * Cons: > - Only 8 subclusters per cluster. We would not be making the > most of this feature. > > - No reserved bits left for the future. I just argued you have at least one, and probably 2, bits left over for future in-word expansion. > > (2) Making L2 entries 128-bit wide. > > In this alternative we would double the size of L2 entries. The > first half would remain unchanged and the second one would store > the bitmap. That would leave us with 32 subclusters per cluster. Although for smaller cluster sizes (such as 4k clusters), you'd still want to restrict that subclusters are at least 512-byte sectors, so you'd be using fewer than 32 of those subcluster positions until the cluster size is large enough. > > * Pros: > + More subclusters per cluster. We could have images with > e.g. 128k clusters with 4k subclusters. Could allow variable-sized subclusters (your choice of 32 subclusters of 4k each, or 16 subclusters of 8k each) > > * Cons: > - More space needed for L2 entries. The same cluster size would > require a cache twice as large, although having subcluster > allocation would compensate for this. > > - More changes to the code to handle 128-bit entries. Dealing with variable-sized subclusters, or with unused subclsuster entries when the cluster size is too small (such as a 4k cluster should not be allowed any subclusters smaller than 512 bytes, but that's at most 8 out of the 32 slots available), can get tricky. > > - We would still be wasting the 14 reserved bits that L2 entries > have. > > (3) Storing the bitmap somewhere else > > This would involve storing the bitmap separate from the L2 tables > (perhaps using the bitmaps extension? I haven't looked much into > this). > > * Pros: > + Possibility to make the number of subclusters configurable > by the user (32, 64, 128, ...) > + All existing metadata structures would remain untouched > (although the "all zeroes" bit in L2 entries would probably > become unused). It might still remain useful for optimization purposes, although then we get into image consistency questions (if the all zeroes bit is set but subcluster map claims allocation, or if the all zeroes bit is clear but all subclusters claim zero, which one wins). > > * Cons: > - As with alternative (2), more space needed for metadata. > > - The bitmap would also need to be cached for performance > reasons. > > - Possibly one more *_cache_size option. > > - One more metadata structure to be updated for each > allocation. This would probably impact I/O negatively. Having the subcluster table directly in the L2 means that updating the L2 table is done with a single write. You are definitely right that having the subcluster table as a bitmap in a separate cluster means two writes instead of one, but as always, it's hard to predict how much of an impact that is without benchmarks. > > === Compressed clusters === > > My idea is that compressed clusters would remain the same. They are > read-only anyway so they would not be affected by any of these > changes. > > =========================== > > I think I managed to summarize all the ideas that I had in my mind, > but I'm sure you probably have questions and comments, so I'd be happy > to get as much feedback as possible. > > So, looking forward to reading your opinions. > The fact that you already have numbers proving the speedups that are possible when first allocating the image make this sound like a useful project, even though it is an incompatible image change that old tools won't be able to recognize. You'll want to make sure 'qemu-img amend' can rewrite an image with subclusters into an older image. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-06 16:40 ` Eric Blake @ 2017-04-07 8:49 ` Alberto Garcia 2017-04-07 12:41 ` Kevin Wolf 2017-04-21 21:09 ` [Qemu-devel] proposed qcow2 extension: cluster reservations [was: " Eric Blake 2 siblings, 0 replies; 64+ messages in thread From: Alberto Garcia @ 2017-04-07 8:49 UTC (permalink / raw) To: Eric Blake, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz On Thu 06 Apr 2017 06:40:41 PM CEST, Eric Blake wrote: >> This e-mail is the formal presentation of my proposal to extend the >> on-disk qcow2 format. As you can see this is still an RFC. Due to the >> nature of the changes I would like to get as much feedback as >> possible before going forward. > > The idea in general makes sense; I can even remember chatting with > Kevin about similar ideas as far back as 2015, where the biggest > drawback is that it is an incompatible image change, and therefore > images created with the flag cannot be read by older tools. That's correct. >> === Changes to the on-disk format === >> >> The qcow2 on-disk format needs to change so each L2 entry has a >> bitmap indicating the allocation status of each subcluster. There are >> three possible states (unallocated, allocated, all zeroes), so we >> need two bits per subcluster. > > You also have to add a new incompatible feature bit, so that older > tools know they can't read the new image correctly, and therefore > don't accidentally corrupt it. Yes, of course. The name that I'm considering is something like QCOW2_INCOMPAT_SUBCLUSTER, and for the creation options either 'subclusters=on/off' or 'subcluster_size=XXX' (depending on whether the size is configurable or not). >> (1) Storing the bitmap inside the 64-bit entry >> >> This is a simple alternative and is the one that I chose for my >> prototype. There are 14 unused bits plus the "all zeroes" one. If >> we steal one from the host offset we have the 16 bits that we need >> for the bitmap and we have 46 bits left for the host offset, which >> is more than enough. > > Note that because you are using exactly 8 subclusters, you can require > that the minimum cluster size when subclusters are enabled be 4k (since > we already have a lower-limit of 512-byte sector operation, and don't > want subclusters to be smaller than that); at which case you are > guaranteed that the host cluster offset will be 4k aligned. So in > reality, once you turn on subclusters, you have: > > 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 > 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > **<----> <-----------------------------------------------><---------->* > Rsrved host cluster offset of data Reserved > (6 bits) (44 bits) (11 bits) > > where you have 17 bits plus the "all zeroes" bit to play with, thanks to > the three bits of host cluster offset that are now guaranteed to be zero > due to cluster size alignment (but you're also right that the "all > zeroes" bit is now redundant information with the 8 subcluster-is-zero > bits, so repurposing it does not hurt) The lower bits of the offset field are guaranteed to be zero if the cluster size is anything other than the minimum (4KB), so alternatively "host cluster offset" could become "host cluster index", where host_cluster_offset = host_cluster_index * cluster_size. With the 56 bits that we have now we can address 64 PB of data. In theory QEMU can create larger qcow2 files if the cluster size is big enough. The current hard limit is QCOW_MAX_L1_SIZE = 0x2000000, which implies: |--------------+------------------| | Cluster size | Max virtual size | |--------------+------------------| | 512 bytes | 128 GB | | 1 KB | 512 GB | | 2 KB | 2 TB | | 4 KB | 8 TB | | 8 KB | 32 TB | | 16 KB | 128 TB | | 32 KB | 512 TB | | 64 KB | 2 PB | | 128 KB | 8 PB | | 256 KB | 32 PB | | 512 KB | 128 PB | | 1 MB | 512 PB | | 2 MB | 2 EB | |--------------+------------------| In practice however 64PB ought to be enough for anybody™, so maybe it's not worth doing this. >> * Cons: >> - Only 8 subclusters per cluster. We would not be making the >> most of this feature. >> >> - No reserved bits left for the future. > > I just argued you have at least one, and probably 2, bits left over > for future in-word expansion. You're correct in any case, 512 subclusters should be the absolute minimum. >> (2) Making L2 entries 128-bit wide. >> >> In this alternative we would double the size of L2 entries. The >> first half would remain unchanged and the second one would store >> the bitmap. That would leave us with 32 subclusters per cluster. > > Although for smaller cluster sizes (such as 4k clusters), you'd still > want to restrict that subclusters are at least 512-byte sectors, so > you'd be using fewer than 32 of those subcluster positions until the > cluster size is large enough. I think in that case what would make sense is to increase the minimum cluster size instead, or is there any reason why we would want smaller clusters if we already guarantee that the minimum allocation is 512 bytes? If we're going to reserve all 64 bits for the bitmap anyway I don't know if there's a good reason not to use them all. >> (3) Storing the bitmap somewhere else >> >> This would involve storing the bitmap separate from the L2 tables >> (perhaps using the bitmaps extension? I haven't looked much into >> this). >> >> * Pros: >> + Possibility to make the number of subclusters configurable >> by the user (32, 64, 128, ...) >> + All existing metadata structures would remain untouched >> (although the "all zeroes" bit in L2 entries would probably >> become unused). > > It might still remain useful for optimization purposes, although then > we get into image consistency questions (if the all zeroes bit is set > but subcluster map claims allocation, or if the all zeroes bit is > clear but all subclusters claim zero, which one wins). If we keep the bit we'd need to define its semantics clearly. We're going to run into a similar problem with the subcluster state bits (we have three states but two bits, so four possible values). > Having the subcluster table directly in the L2 means that updating the > L2 table is done with a single write. You are definitely right that > having the subcluster table as a bitmap in a separate cluster means > two writes instead of one, but as always, it's hard to predict how > much of an impact that is without benchmarks. Yes, this one almost feels like the cleanest alternative of the three I described, but I suspect its effect on I/O performance would be noticeable. > The fact that you already have numbers proving the speedups that are > possible when first allocating the image make this sound like a useful > project, even though it is an incompatible image change that old tools > won't be able to recognize. You'll want to make sure 'qemu-img amend' > can rewrite an image with subclusters into an older image. Yes, definitely. Thanks a lot for your feedback! Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-06 16:40 ` Eric Blake 2017-04-07 8:49 ` Alberto Garcia @ 2017-04-07 12:41 ` Kevin Wolf 2017-04-07 14:24 ` Alberto Garcia 2017-04-21 21:09 ` [Qemu-devel] proposed qcow2 extension: cluster reservations [was: " Eric Blake 2 siblings, 1 reply; 64+ messages in thread From: Kevin Wolf @ 2017-04-07 12:41 UTC (permalink / raw) To: Eric Blake Cc: Alberto Garcia, qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz [-- Attachment #1: Type: text/plain, Size: 7664 bytes --] Am 06.04.2017 um 18:40 hat Eric Blake geschrieben: > On 04/06/2017 10:01 AM, Alberto Garcia wrote: > > I thought of three alternatives for storing the subcluster bitmaps. I > > haven't made my mind completely about which one is the best one, so > > I'd like to present all three for discussion. Here they are: > > > > (1) Storing the bitmap inside the 64-bit entry > > > > This is a simple alternative and is the one that I chose for my > > prototype. There are 14 unused bits plus the "all zeroes" one. If > > we steal one from the host offset we have the 16 bits that we need > > for the bitmap and we have 46 bits left for the host offset, which > > is more than enough. > > Note that because you are using exactly 8 subclusters, you can require > that the minimum cluster size when subclusters are enabled be 4k (since > we already have a lower-limit of 512-byte sector operation, and don't > want subclusters to be smaller than that); at which case you are > guaranteed that the host cluster offset will be 4k aligned. So in > reality, once you turn on subclusters, you have: > > 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 > 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > **<----> <-----------------------------------------------><---------->* > Rsrved host cluster offset of data Reserved > (6 bits) (44 bits) (11 bits) > > where you have 17 bits plus the "all zeroes" bit to play with, thanks to > the three bits of host cluster offset that are now guaranteed to be zero > due to cluster size alignment (but you're also right that the "all > zeroes" bit is now redundant information with the 8 subcluster-is-zero > bits, so repurposing it does not hurt) > > > > > * Pros: > > + Simple. Few changes compared to the current qcow2 format. > > > > * Cons: > > - Only 8 subclusters per cluster. We would not be making the > > most of this feature. > > > > - No reserved bits left for the future. > > I just argued you have at least one, and probably 2, bits left over for > future in-word expansion. I think only 8 subclusters is just too few. That the subcluster status would be split in two halves doesn't make me like this layout much better either. Intuitively one might think that squeezing everything into the existing data structures will save us memory, but as you correctly state below, the difference between 8 and 32 subclusters means that we can use a larger cluster size and the doubled L2 entry size uses actually less space to cover the whole image than using the existing one. I can see how few changes compared to the current format can look attractive, but I also think that there is a danger of forgetting to support subclusters in some place. When the layout is radically different, such mistakes would be caught very quickly, so maybe being more different is actually a plus. > > > > (2) Making L2 entries 128-bit wide. > > > > In this alternative we would double the size of L2 entries. The > > first half would remain unchanged and the second one would store > > the bitmap. That would leave us with 32 subclusters per cluster. > > Although for smaller cluster sizes (such as 4k clusters), you'd still > want to restrict that subclusters are at least 512-byte sectors, so > you'd be using fewer than 32 of those subcluster positions until the > cluster size is large enough. > > > > > * Pros: > > + More subclusters per cluster. We could have images with > > e.g. 128k clusters with 4k subclusters. > > Could allow variable-sized subclusters (your choice of 32 subclusters of > 4k each, or 16 subclusters of 8k each) I don't think using less subclusters is desirable if it doesn't come with savings elsewhere. We already need to allocate two clusters for an L2 table now, so we want to use it. The more interesting kind of variable-sized subclusters would be if you could select any multiple of 32, meaning three or more clusters per L2 table (with 192 bits or more per entry). I'm doubtful if this would be worth the effort, though. > > > > * Cons: > > - More space needed for L2 entries. The same cluster size would > > require a cache twice as large, although having subcluster > > allocation would compensate for this. > > > > - More changes to the code to handle 128-bit entries. > > Dealing with variable-sized subclusters, or with unused subclsuster > entries when the cluster size is too small (such as a 4k cluster should > not be allowed any subclusters smaller than 512 bytes, but that's at > most 8 out of the 32 slots available), can get tricky. If it's too tricky, we just don't allow it. 32 * 512 = 16k would be the minimal cluster size that allows enabling subclusters. > > > > - We would still be wasting the 14 reserved bits that L2 entries > > have. We discussed the concept of subclusters multiple times in the past, and I think every time my conclusion was that option (2) is really what we want if we implement this. > > (3) Storing the bitmap somewhere else > > > > This would involve storing the bitmap separate from the L2 tables > > (perhaps using the bitmaps extension? I haven't looked much into > > this). > > > > * Pros: > > + Possibility to make the number of subclusters configurable > > by the user (32, 64, 128, ...) > > + All existing metadata structures would remain untouched > > (although the "all zeroes" bit in L2 entries would probably > > become unused). > > It might still remain useful for optimization purposes, although then we > get into image consistency questions (if the all zeroes bit is set but > subcluster map claims allocation, or if the all zeroes bit is clear but > all subclusters claim zero, which one wins). > > > > > * Cons: > > - As with alternative (2), more space needed for metadata. > > > > - The bitmap would also need to be cached for performance > > reasons. > > > > - Possibly one more *_cache_size option. > > > > - One more metadata structure to be updated for each > > allocation. This would probably impact I/O negatively. > > Having the subcluster table directly in the L2 means that updating the > L2 table is done with a single write. You are definitely right that > having the subcluster table as a bitmap in a separate cluster means two > writes instead of one, but as always, it's hard to predict how much of > an impact that is without benchmarks. Note that it's not just additional write requests, but that we can't update the L2 table entry and the bitmap atomically any more, so we have to worry about ordering. The ordering between L2 table and refcount blocks is already painful enough, I'm not sure if I would want to add a third type. Ordering also means disk flushes, which are a lot slower than just additional writes. We wouldn't have these ordering problems with a journal (because we could then commit things atomically), but I suppose we don't want to make two major changes at once. :-) > > === Compressed clusters === > > > > My idea is that compressed clusters would remain the same. They are > > read-only anyway so they would not be affected by any of these > > changes. Yes, this makes sense to me. Compression already uses its own kind of splitting clusters. Kevin [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-07 12:41 ` Kevin Wolf @ 2017-04-07 14:24 ` Alberto Garcia 0 siblings, 0 replies; 64+ messages in thread From: Alberto Garcia @ 2017-04-07 14:24 UTC (permalink / raw) To: Kevin Wolf, Eric Blake; +Cc: qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz On Fri 07 Apr 2017 02:41:21 PM CEST, Kevin Wolf <kwolf@redhat.com> wrote: >> 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 >> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 >> **<----> <-----------------------------------------------><---------->* >> Rsrved host cluster offset of data Reserved >> (6 bits) (44 bits) (11 bits) >> >> where you have 17 bits plus the "all zeroes" bit to play with, thanks to >> the three bits of host cluster offset that are now guaranteed to be zero >> due to cluster size alignment (but you're also right that the "all >> zeroes" bit is now redundant information with the 8 subcluster-is-zero >> bits, so repurposing it does not hurt) >> >> > >> > * Pros: >> > + Simple. Few changes compared to the current qcow2 format. >> > >> > * Cons: >> > - Only 8 subclusters per cluster. We would not be making the >> > most of this feature. >> > >> > - No reserved bits left for the future. >> >> I just argued you have at least one, and probably 2, bits left over for >> future in-word expansion. > > I think only 8 subclusters is just too few. That the subcluster status > would be split in two halves doesn't make me like this layout much > better either. I also agree that 8 are too few (splitting the subcluster field would not be strictly necessary, but that's not so important). >> > (2) Making L2 entries 128-bit wide. >> > >> > In this alternative we would double the size of L2 entries. The >> > first half would remain unchanged and the second one would store >> > the bitmap. That would leave us with 32 subclusters per cluster. >> >> Although for smaller cluster sizes (such as 4k clusters), you'd still >> want to restrict that subclusters are at least 512-byte sectors, so >> you'd be using fewer than 32 of those subcluster positions until the >> cluster size is large enough. >> >> > >> > * Pros: >> > + More subclusters per cluster. We could have images with >> > e.g. 128k clusters with 4k subclusters. >> >> Could allow variable-sized subclusters (your choice of 32 subclusters of >> 4k each, or 16 subclusters of 8k each) > > I don't think using less subclusters is desirable if it doesn't come > with savings elsewhere. We already need to allocate two clusters for an > L2 table now, so we want to use it. > > The more interesting kind of variable-sized subclusters would be if you > could select any multiple of 32, meaning three or more clusters per L2 > table (with 192 bits or more per entry). Yeah, I agree. I think it's worth considering. One more drawback that I can think of is that if we make L2 entries wider and we have compressed clusters we'd be wasting space in their entries. >> > - One more metadata structure to be updated for each >> > allocation. This would probably impact I/O negatively. >> >> Having the subcluster table directly in the L2 means that updating >> the L2 table is done with a single write. You are definitely right >> that having the subcluster table as a bitmap in a separate cluster >> means two writes instead of one, but as always, it's hard to predict >> how much of an impact that is without benchmarks. > > Note that it's not just additional write requests, but that we can't > update the L2 table entry and the bitmap atomically any more, so we > have to worry about ordering. The ordering between L2 table and > refcount blocks is already painful enough, I'm not sure if I would > want to add a third type. Ordering also means disk flushes, which are > a lot slower than just additional writes. You're rightk, I think you just convinced me that this is a bad idea and I'm also more inclined towards (2) now. Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* [Qemu-devel] proposed qcow2 extension: cluster reservations [was: [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-06 16:40 ` Eric Blake 2017-04-07 8:49 ` Alberto Garcia 2017-04-07 12:41 ` Kevin Wolf @ 2017-04-21 21:09 ` Eric Blake 2017-04-22 17:56 ` Max Reitz 2 siblings, 1 reply; 64+ messages in thread From: Eric Blake @ 2017-04-21 21:09 UTC (permalink / raw) To: Alberto Garcia, qemu-devel Cc: Kevin Wolf, qemu-block, Stefan Hajnoczi, Max Reitz [-- Attachment #1: Type: text/plain, Size: 13291 bytes --] On 04/06/2017 11:40 AM, Eric Blake wrote: >> === Changes to the on-disk format === >> >> The qcow2 on-disk format needs to change so each L2 entry has a bitmap >> indicating the allocation status of each subcluster. There are three >> possible states (unallocated, allocated, all zeroes), so we need two >> bits per subcluster. > > You also have to add a new incompatible feature bit, so that older tools > know they can't read the new image correctly, and therefore don't > accidentally corrupt it. As long as we are talking about incompatible feature bits, I had some thoughts about image mapping today. tl;dr: summary As long as we are considering incompatible features, maybe we should make it easier to have an image file that explicitly preserves guest=>host mapping, such that nothing the guest can do will reorder the mapping. This way, it would be possible to fully preallocate an image such that all guest offsets are adjusted by a constant offset to become the corresponding host offset (basically, no qcow2 metadata is interleaved in the middle of guest data). I don't see any way to do it on current qcow2 images, but with subclusters, you get it for free by having a cluster with an offset but with all subclusters marked as unallocated. But perhaps it is something orthogonal enough that we would want a separate incompatible feature bit for just this, without having subclusters at the same time. In the process of exploring the topic, I expose a couple of existing bugs in our qcow2 handling. ======== Longer version: If I'm reading qcow2-clusters.c and qcow2-refcount.c correctly, our current implementation of bdrv_discard() says that except for clusters already marked QCOW2_CLUSTER_ZERO, we will unconditionally remove the L2 mapping of the address. Whether we actually punch a hole in the underlying image, or merely add it to a list of free clusters for use in subsequent allocations, is later decided based on s->discard_passthrough[type] (that is, the caller can pass different type levels that control whether to never punch, always punch, or let the blockdev parameters of the caller control whether to punch). Presumably, if I've preallocated my image because I want to guarantee enough host space, then I would use blockdev parameters that ensure that guest actions never punch a hole. But based on my reading, I still have the situation that if I initially preallocated things so that guest cluster 0 and 1 are consecutive clusters in the host, and the guest triggers bdrv_pdiscard() over both clusters, then writes to cluster 1 before cluster 0, then even though I'm not changing the underlying allocation of the host file, I _am_ resulting in fragmentation in the qcow2 file, where cluster 1 in the guest now falls prior to cluster 0. Demonstration: $ qemu-img create -f qcow2 -o cluster_size=1M,preallocation=full file3 10M Formatting 'file3', fmt=qcow2 size=10485760 encryption=off cluster_size=1048576 preallocation=full lazy_refcounts=off refcount_bits=16 $ qemu-img map -f qcow2 file3 Offset Length Mapped to File 0 0x900000 0x500000 file3 0x9ff000 0x1000 0xeff000 file3 Oh, that's weird - our "full allocation" still results in a gap in guest-allocation (nothing from 0x900000-0x9ff000 appears to be mapped), even though every offset that IS allocated has a 1:1 mapping (for any given guest offset, add 5M to get the host offset). Maybe the JSON output will shed more light: $ qemu-img map -f qcow2 file3 --output=json [{ "start": 0, "length": 9437184, "depth": 0, "zero": false, "data": true, "offset": 5242880}, { "start": 9437184, "length": 1044480, "depth": 0, "zero": true, "data": true, "offset": 14680064}, { "start": 10481664, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": 15724544}] That's a bit better - the hole that didn't show up is still mapped, it's just the human version omitted it because that portion of the file explicitly reads as zero. But I'm still confused - how can only half of a 1M cluster read as zero, when our QCOW2_CLUSTER_ZERO flag applies a cluster-at-a-time? Oh - it's because we defer to the file system, and according to lseek(SEEK_HOLE), we actually do have a hole in the file... $ qemu-img map -f raw file3 --output=json [{ "start": 0, "length": 5242880, "depth": 0, "zero": false, "data": true, "offset": 0}, { "start": 5242880, "length": 10481664, "depth": 0, "zero": true, "data": false, "offset": 5242880}, { "start": 15724544, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": 15724544}] $ qemu-io -f raw file3 and reading the L2 table shows that we do NOT have a case of QCOW2_CLUSTER_ZERO set in the qcow2 image: $ qemu-io -f raw file3 qemu-io> r -v 4m 128 00400000: 80 00 00 00 00 50 00 00 80 00 00 00 00 60 00 00 .....P.......... 00400010: 80 00 00 00 00 70 00 00 80 00 00 00 00 80 00 00 .....p.......... 00400020: 80 00 00 00 00 90 00 00 80 00 00 00 00 a0 00 00 ................ 00400030: 80 00 00 00 00 b0 00 00 80 00 00 00 00 c0 00 00 ................ 00400040: 80 00 00 00 00 d0 00 00 80 00 00 00 00 e0 00 00 ................ 00400050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00400060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00400070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ read 128/128 bytes at offset 4194304 128 bytes, 1 ops; 0.0001 sec (905.797 KiB/sec and 7246.3768 ops/sec) qemu-io> q But it's still sad that our request for preallocation=full didn't actually work on the underlying file (we have a hole that we didn't want). Oh well, yet another thing to fix someday. Anyways, now have the guest do a discard and subsequent re-write: $ qemu-io -f qcow2 file3 qemu-io> discard 0 2m discard 2097152/2097152 bytes at offset 0 2 MiB, 1 ops; 0.0014 sec (1.395 GiB/sec and 714.2857 ops/sec) qemu-io> w 1m 1m wrote 1048576/1048576 bytes at offset 1048576 1 MiB, 1 ops; 0.0465 sec (21.487 MiB/sec and 21.4869 ops/sec) qemu-io> w 0 1m wrote 1048576/1048576 bytes at offset 0 1 MiB, 1 ops; 0.0327 sec (30.497 MiB/sec and 30.4971 ops/sec) qemu-io> q $ qemu-img map -f qcow2 file3 Offset Length Mapped to File 0 0x100000 0x600000 file3 0x100000 0x100000 0x500000 file3 0x200000 0x700000 0x700000 file3 0x9ff000 0x1000 0xeff000 file3 We haven't consumed any new host space (good), but we HAVE fragmented the guest image (cluster 0 and 1 have now swapped spaces; we no longer have contiguous guest=>host translations). But watch what happens with explicit zero clusters: $ qemu-io -f qcow2 file3 qemu-io> w -z 3m 2m wrote 2097152/2097152 bytes at offset 3145728 2 MiB, 1 ops; 0.0136 sec (146.563 MiB/sec and 73.2815 ops/sec) qemu-io> q Per the L2 map, we have now set the explicit zero bit, but maintained an allocation: $ qemu-io -f raw file3 qemu-io> r -v 4m 128 00400000: 80 00 00 00 00 60 00 00 80 00 00 00 00 50 00 00 .............P.. 00400010: 80 00 00 00 00 70 00 00 80 00 00 00 00 80 00 01 .....p.......... 00400020: 80 00 00 00 00 90 00 01 80 00 00 00 00 a0 00 00 ................ 00400030: 80 00 00 00 00 b0 00 00 80 00 00 00 00 c0 00 00 ................ 00400040: 80 00 00 00 00 d0 00 00 80 00 00 00 00 e0 00 00 ................ 00400050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00400060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00400070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ read 128/128 bytes at offset 4194304 128 bytes, 1 ops; 0.0001 sec (1.197 MiB/sec and 9803.9216 ops/sec) qemu-io> q And with that allocation in place, discarding and then rewriting zero does NOT lose the allocation: $ qemu-io -f qcow2 file3 qemu-io> discard 3m 2m discard 2097152/2097152 bytes at offset 3145728 2 MiB, 1 ops; 0.0007 sec (2.622 GiB/sec and 1342.2819 ops/sec) qemu-io> w -z 4m 1m wrote 1048576/1048576 bytes at offset 4194304 1 MiB, 1 ops; 0.0103 sec (97.003 MiB/sec and 97.0026 ops/sec) qemu-io> w -z 3m 1m wrote 1048576/1048576 bytes at offset 3145728 1 MiB, 1 ops; 0.0095 sec (104.319 MiB/sec and 104.3188 ops/sec) qemu-io> q eblake@red (0) ~/qemu (nbd|REBASE-i 26/109) $ qemu-io -f raw file3 qemu-io> r -v 4m 128 00400000: 80 00 00 00 00 60 00 00 80 00 00 00 00 50 00 00 .............P.. 00400010: 80 00 00 00 00 70 00 00 80 00 00 00 00 80 00 01 .....p.......... 00400020: 80 00 00 00 00 90 00 01 80 00 00 00 00 a0 00 00 ................ 00400030: 80 00 00 00 00 b0 00 00 80 00 00 00 00 c0 00 00 ................ 00400040: 80 00 00 00 00 d0 00 00 80 00 00 00 00 e0 00 00 ................ 00400050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00400060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00400070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ read 128/128 bytes at offset 4194304 128 bytes, 1 ops; 0.0001 sec (1.185 MiB/sec and 9708.7379 ops/sec) qemu-io> Clusters 3 and 4 are still mapped to host 8 and 9, as they were originally. But what happens when we try to write data there: $ qemu-io -f qcow2 file3 qemu-io> w 4m 1m wrote 1048576/1048576 bytes at offset 4194304 1 MiB, 1 ops; 0.0402 sec (24.830 MiB/sec and 24.8299 ops/sec) qemu-io> q $ qemu-img map -f qcow2 file3 Offset Length Mapped to File 0 0x100000 0x600000 file3 0x100000 0x100000 0x500000 file3 0x200000 0x100000 0x700000 file3 0x400000 0x100000 0xf00000 file3 0x500000 0x400000 0xa00000 file3 0x9ff000 0x1000 0xeff000 file3 Ouch - that cluster write IGNORED the fact that we already had a reservation at host 8, and instead changed the mapping so that guest 4 now maps to host f, making our refcount table larger, and fragmenting things. So much for pre-allocation saving the day :( So, we need a bug-fix such that allocating a cluster that will be replacing an allocated-but-zero cluster would reuse the existing allocation, rather than create a new mapping. [Oh, and how will all of this interact with refcounts > 1, when internal snapshots are in use? Ugh - I don't know that we want to go there] But if we can preserve mappings of clusters that are explicitly marked zero, I started to wonder if it would also be reasonable to have a mode where we could ALWAYS preserve mappings. Adding another bit that says that a cluster has a reserved mapping, but still defers to the backing file for its current data, would let us extend the existing behavior of map-preservation on write zeros to work with ALL writes, when writing to a fully pre-allocated image. Such a bit is incompatible (existing qemu would not know how to present correct data to the guest when the bit is set), so if we want it, we have to burn an incompatible feature bit. The question is whether this feature is independently useful enough to do it orthogonally from subclusters, or whether we want to minimize the exponential explosion of orthogonal bits by only allowing the feature when subclusters are also enabled. And in a way, subclusters already provide this feature: any time you map a cluster to an address, but have ALL of its subclusters marked as unallocated (read from backing file), you have preallocated a mapping that you can later use when writing to that cluster, and that preallocation is no longer tied to whether the guest uses write-zeros. When I chatted with Max on IRC about the idea, we said this: <mreitz> I mean, sure, we can add both, but I'd still want them two be two incompatible bits <eblake> if you want the features to be orthogonal (with exponentially more cases to check), then having multiple incompatible bits is okay <mreitz> Because FEATURE_BIT_UNALLOCATED_AND_SUBCLUSTERS sounds weird and FEATURE_BIT_EXTENDED_L2_ENTRIES a bit pretentious <mreitz> Well, orthogonal is a good question. If we want to have an UNALLOCATED flag we should think so before adding subclusters <mreitz> Because then we should at least make clear that the ZERO flag for a subcluster requires the ALLOCATED flag to be set or something <mreitz> So we can reserve ZERO/!ALLOCATED for the case where you want to fall through to the backing file So, if you got this far in reading, the question becomes whether having a mode where you can mark a cluster as mapping-reserved-but-unallocated has enough use case to be worth pursuing, knowing that it will burn an incompatible feature bit; or if it should be rolled into the subcluster proposal, or whether it's not a feature that anyone needs after all. And meanwhile, it looks like I have some patches to propose (and qemu-iotests to write) if I can help fix the bugs I've pointed out. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] proposed qcow2 extension: cluster reservations [was: [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-21 21:09 ` [Qemu-devel] proposed qcow2 extension: cluster reservations [was: " Eric Blake @ 2017-04-22 17:56 ` Max Reitz 2017-04-24 11:45 ` Kevin Wolf 2017-04-24 12:46 ` Alberto Garcia 0 siblings, 2 replies; 64+ messages in thread From: Max Reitz @ 2017-04-22 17:56 UTC (permalink / raw) To: Eric Blake, Alberto Garcia, qemu-devel Cc: Kevin Wolf, qemu-block, Stefan Hajnoczi [-- Attachment #1: Type: text/plain, Size: 5565 bytes --] On 21.04.2017 23:09, Eric Blake wrote: > On 04/06/2017 11:40 AM, Eric Blake wrote: > >>> === Changes to the on-disk format === >>> >>> The qcow2 on-disk format needs to change so each L2 entry has a bitmap >>> indicating the allocation status of each subcluster. There are three >>> possible states (unallocated, allocated, all zeroes), so we need two >>> bits per subcluster. >> >> You also have to add a new incompatible feature bit, so that older tools >> know they can't read the new image correctly, and therefore don't >> accidentally corrupt it. > > As long as we are talking about incompatible feature bits, I had some > thoughts about image mapping today. > > tl;dr: summary> As long as we are considering incompatible features, maybe we should > make it easier to have an image file that explicitly preserves > guest=>host mapping, such that nothing the guest can do will reorder the > mapping. This way, it would be possible to fully preallocate an image > such that all guest offsets are adjusted by a constant offset to become > the corresponding host offset (basically, no qcow2 metadata is > interleaved in the middle of guest data). > > I don't see any way to do it on current qcow2 images, but with > subclusters, you get it for free by having a cluster with an offset but > with all subclusters marked as unallocated. But perhaps it is something > orthogonal enough that we would want a separate incompatible feature bit > for just this, without having subclusters at the same time. > > In the process of exploring the topic, I expose a couple of existing > bugs in our qcow2 handling. > > ======== > > Longer version: > > If I'm reading qcow2-clusters.c and qcow2-refcount.c correctly, our > current implementation of bdrv_discard() says that except for clusters > already marked QCOW2_CLUSTER_ZERO, we will unconditionally remove the L2 > mapping of the address. As I've said, I think the ZERO bit is just a bug and we should free preallocated zero clusters. > Whether we actually punch a hole in the > underlying image, or merely add it to a list of free clusters for use in > subsequent allocations, is later decided based on > s->discard_passthrough[type] (that is, the caller can pass different > type levels that control whether to never punch, always punch, or let > the blockdev parameters of the caller control whether to punch). > > Presumably, if I've preallocated my image because I want to guarantee > enough host space, then I would use blockdev parameters that ensure that > guest actions never punch a hole. But based on my reading, I still have > the situation that if I initially preallocated things so that guest > cluster 0 and 1 are consecutive clusters in the host, and the guest > triggers bdrv_pdiscard() over both clusters, then writes to cluster 1 > before cluster 0, then even though I'm not changing the underlying > allocation of the host file, I _am_ resulting in fragmentation in the > qcow2 file, where cluster 1 in the guest now falls prior to cluster 0. [...] > But if we can preserve mappings of clusters that are explicitly marked > zero, I started to wonder if it would also be reasonable to have a mode > where we could ALWAYS preserve mappings. Adding another bit that says > that a cluster has a reserved mapping, but still defers to the backing > file for its current data, would let us extend the existing behavior of > map-preservation on write zeros to work with ALL writes, when writing to > a fully pre-allocated image. Yes, and it also means that we may want to think about implementation a preallocation mode in qemu which puts all of the data into a single consecutive chunk (as you have hinted at somewhere above). > When I chatted with Max on IRC about the idea, we said this: > > > <mreitz> I mean, sure, we can add both, but I'd still want them two be > two incompatible bits > <eblake> if you want the features to be orthogonal (with exponentially > more cases to check), then having multiple incompatible bits is okay > <mreitz> Because FEATURE_BIT_UNALLOCATED_AND_SUBCLUSTERS sounds weird > and FEATURE_BIT_EXTENDED_L2_ENTRIES a bit pretentious > <mreitz> Well, orthogonal is a good question. If we want to have an > UNALLOCATED flag we should think so before adding subclusters > <mreitz> Because then we should at least make clear that the ZERO flag > for a subcluster requires the ALLOCATED flag to be set or something > <mreitz> So we can reserve ZERO/!ALLOCATED for the case where you want > to fall through to the backing file > > So, if you got this far in reading, the question becomes whether having > a mode where you can mark a cluster as mapping-reserved-but-unallocated > has enough use case to be worth pursuing, knowing that it will burn an > incompatible feature bit; or if it should be rolled into the subcluster > proposal, or whether it's not a feature that anyone needs after all. I just forgot that just saying !ALLOCATED will be enough, regardless of the ZERO flag... Yeah, subclusters will give us this for free, and I think it's therefore reasonable to just require them if you want this feature (preallocation with a backing file). > And meanwhile, it looks like I have some patches to propose (and > qemu-iotests to write) if I can help fix the bugs I've pointed out. You mean these? https://github.com/XanClic/qemu/commits/random-qcow2-stuff-v1 ;-) I'll send them once I've written tests. Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 512 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] proposed qcow2 extension: cluster reservations [was: [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-22 17:56 ` Max Reitz @ 2017-04-24 11:45 ` Kevin Wolf 2017-04-24 12:46 ` Alberto Garcia 1 sibling, 0 replies; 64+ messages in thread From: Kevin Wolf @ 2017-04-24 11:45 UTC (permalink / raw) To: Max Reitz Cc: Eric Blake, Alberto Garcia, qemu-devel, qemu-block, Stefan Hajnoczi [-- Attachment #1: Type: text/plain, Size: 821 bytes --] Am 22.04.2017 um 19:56 hat Max Reitz geschrieben: > On 21.04.2017 23:09, Eric Blake wrote: > > And meanwhile, it looks like I have some patches to propose (and > > qemu-iotests to write) if I can help fix the bugs I've pointed out. > > You mean these? > https://github.com/XanClic/qemu/commits/random-qcow2-stuff-v1 > > ;-) > > I'll send them once I've written tests. You were quicker with writing the patches than I could reply that I already started a patch to reuse preallocated zero clusters, but never got around to finish it: http://repo.or.cz/qemu/kevin.git/commitdiff/60b9f813bb258978455944b4ace1ab5d93b5e7d4 I seem to remember that it still had a problem (and probably also the refcount > 0 bug that Eric mentioned), but maybe you can reuse at least the test case from there. Kevin [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] proposed qcow2 extension: cluster reservations [was: [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-22 17:56 ` Max Reitz 2017-04-24 11:45 ` Kevin Wolf @ 2017-04-24 12:46 ` Alberto Garcia 1 sibling, 0 replies; 64+ messages in thread From: Alberto Garcia @ 2017-04-24 12:46 UTC (permalink / raw) To: Max Reitz, Eric Blake, qemu-devel; +Cc: Kevin Wolf, qemu-block, Stefan Hajnoczi On Sat 22 Apr 2017 07:56:57 PM CEST, Max Reitz wrote: >> So, if you got this far in reading, the question becomes whether >> having a mode where you can mark a cluster as >> mapping-reserved-but-unallocated has enough use case to be worth >> pursuing, knowing that it will burn an incompatible feature bit; or >> if it should be rolled into the subcluster proposal, or whether it's >> not a feature that anyone needs after all. > > I just forgot that just saying !ALLOCATED will be enough, regardless > of the ZERO flag... Yeah, subclusters will give us this for free, and > I think it's therefore reasonable to just require them if you want > this feature (preallocation with a backing file). I agree with Max here. Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia 2017-04-06 16:40 ` Eric Blake @ 2017-04-07 12:20 ` Stefan Hajnoczi 2017-04-07 12:24 ` Alberto Garcia 2017-04-07 13:01 ` Kevin Wolf 2017-04-07 17:10 ` Max Reitz ` (2 subsequent siblings) 4 siblings, 2 replies; 64+ messages in thread From: Stefan Hajnoczi @ 2017-04-07 12:20 UTC (permalink / raw) To: Alberto Garcia; +Cc: qemu-devel, qemu-block, Max Reitz, Kevin Wolf [-- Attachment #1: Type: text/plain, Size: 3379 bytes --] On Thu, Apr 06, 2017 at 06:01:48PM +0300, Alberto Garcia wrote: > Here are the results (subcluster size in brackets): > > |-----------------+----------------+-----------------+-------------------| > | cluster size | subclusters=on | subclusters=off | Max L2 cache size | > |-----------------+----------------+-----------------+-------------------| > | 2 MB (256 KB) | 440 IOPS | 100 IOPS | 160 KB (*) | > | 512 KB (64 KB) | 1000 IOPS | 300 IOPS | 640 KB | > | 64 KB (8 KB) | 3000 IOPS | 1000 IOPS | 5 MB | > | 32 KB (4 KB) | 12000 IOPS | 1300 IOPS | 10 MB | > | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | > |-----------------+----------------+-----------------+-------------------| > > (*) The L2 cache must be a multiple of the cluster > size, so in this case it must be 2MB. On the table > I chose to show how much of those 2MB are actually > used so you can compare it with the other cases. > > Some comments about the results: > > - For the 64KB, 512KB and 2MB cases, having subclusters increases > write performance roughly by three. This happens because for each > cluster allocation there's less data to copy from the backing > image. For the same reason, the smaller the cluster, the better the > performance. As expected, 64KB clusters with no subclusters perform > roughly the same as 512KB clusters with 64KB subclusters. > > - The 32KB case is the most interesting one. Without subclusters it's > not very different from the 64KB case, but having a subcluster with > the same size of the I/O block eliminates the need for COW entirely > and the performance skyrockets (10 times faster!). > > - 4KB is however very slow. I attribute this to the fact that the > cluster size is so small that a new cluster needs to be allocated > for every single write and its refcount updated accordingly. The L2 > and refcount tables are also so small that they are too inefficient > and need to grow all the time. > > Here are the results when writing to an empty 40GB qcow2 image with no > backing file. The numbers are of course different but as you can see > the patterns are similar: > > |-----------------+----------------+-----------------+-------------------| > | cluster size | subclusters=on | subclusters=off | Max L2 cache size | > |-----------------+----------------+-----------------+-------------------| > | 2 MB (256 KB) | 1200 IOPS | 255 IOPS | 160 KB | > | 512 KB (64 KB) | 3000 IOPS | 700 IOPS | 640 KB | > | 64 KB (8 KB) | 7200 IOPS | 3300 IOPS | 5 MB | > | 32 KB (4 KB) | 12300 IOPS | 4200 IOPS | 10 MB | > | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | > |-----------------+----------------+-----------------+-------------------| I don't understand why subclusters=on performs so much better when there's no backing file. Is qcow2 zeroing out the 64 KB cluster with subclusters=off? It ought to just write the 4 KB data when a new cluster is touched. Therefore the performance should be very similar to subclusters=on. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi @ 2017-04-07 12:24 ` Alberto Garcia 2017-04-07 13:01 ` Kevin Wolf 1 sibling, 0 replies; 64+ messages in thread From: Alberto Garcia @ 2017-04-07 12:24 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: qemu-devel, qemu-block, Max Reitz, Kevin Wolf On Fri 07 Apr 2017 02:20:21 PM CEST, Stefan Hajnoczi wrote: >> Here are the results when writing to an empty 40GB qcow2 image with no >> backing file. The numbers are of course different but as you can see >> the patterns are similar: >> >> |-----------------+----------------+-----------------+-------------------| >> | cluster size | subclusters=on | subclusters=off | Max L2 cache size | >> |-----------------+----------------+-----------------+-------------------| >> | 2 MB (256 KB) | 1200 IOPS | 255 IOPS | 160 KB | >> | 512 KB (64 KB) | 3000 IOPS | 700 IOPS | 640 KB | >> | 64 KB (8 KB) | 7200 IOPS | 3300 IOPS | 5 MB | >> | 32 KB (4 KB) | 12300 IOPS | 4200 IOPS | 10 MB | >> | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | >> |-----------------+----------------+-----------------+-------------------| > > I don't understand why subclusters=on performs so much better when > there's no backing file. Is qcow2 zeroing out the 64 KB cluster with > subclusters=off? It is, see https://lists.nongnu.org/archive/html/qemu-devel/2015-07/msg05877.html Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi 2017-04-07 12:24 ` Alberto Garcia @ 2017-04-07 13:01 ` Kevin Wolf 2017-04-10 15:32 ` Stefan Hajnoczi 1 sibling, 1 reply; 64+ messages in thread From: Kevin Wolf @ 2017-04-07 13:01 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Alberto Garcia, qemu-devel, qemu-block, Max Reitz [-- Attachment #1: Type: text/plain, Size: 4218 bytes --] Am 07.04.2017 um 14:20 hat Stefan Hajnoczi geschrieben: > On Thu, Apr 06, 2017 at 06:01:48PM +0300, Alberto Garcia wrote: > > Here are the results (subcluster size in brackets): > > > > |-----------------+----------------+-----------------+-------------------| > > | cluster size | subclusters=on | subclusters=off | Max L2 cache size | > > |-----------------+----------------+-----------------+-------------------| > > | 2 MB (256 KB) | 440 IOPS | 100 IOPS | 160 KB (*) | > > | 512 KB (64 KB) | 1000 IOPS | 300 IOPS | 640 KB | > > | 64 KB (8 KB) | 3000 IOPS | 1000 IOPS | 5 MB | > > | 32 KB (4 KB) | 12000 IOPS | 1300 IOPS | 10 MB | > > | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | > > |-----------------+----------------+-----------------+-------------------| > > > > (*) The L2 cache must be a multiple of the cluster > > size, so in this case it must be 2MB. On the table > > I chose to show how much of those 2MB are actually > > used so you can compare it with the other cases. > > > > Some comments about the results: > > > > - For the 64KB, 512KB and 2MB cases, having subclusters increases > > write performance roughly by three. This happens because for each > > cluster allocation there's less data to copy from the backing > > image. For the same reason, the smaller the cluster, the better the > > performance. As expected, 64KB clusters with no subclusters perform > > roughly the same as 512KB clusters with 64KB subclusters. > > > > - The 32KB case is the most interesting one. Without subclusters it's > > not very different from the 64KB case, but having a subcluster with > > the same size of the I/O block eliminates the need for COW entirely > > and the performance skyrockets (10 times faster!). > > > > - 4KB is however very slow. I attribute this to the fact that the > > cluster size is so small that a new cluster needs to be allocated > > for every single write and its refcount updated accordingly. The L2 > > and refcount tables are also so small that they are too inefficient > > and need to grow all the time. > > > > Here are the results when writing to an empty 40GB qcow2 image with no > > backing file. The numbers are of course different but as you can see > > the patterns are similar: > > > > |-----------------+----------------+-----------------+-------------------| > > | cluster size | subclusters=on | subclusters=off | Max L2 cache size | > > |-----------------+----------------+-----------------+-------------------| > > | 2 MB (256 KB) | 1200 IOPS | 255 IOPS | 160 KB | > > | 512 KB (64 KB) | 3000 IOPS | 700 IOPS | 640 KB | > > | 64 KB (8 KB) | 7200 IOPS | 3300 IOPS | 5 MB | > > | 32 KB (4 KB) | 12300 IOPS | 4200 IOPS | 10 MB | > > | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | > > |-----------------+----------------+-----------------+-------------------| > > I don't understand why subclusters=on performs so much better when > there's no backing file. Is qcow2 zeroing out the 64 KB cluster with > subclusters=off? > > It ought to just write the 4 KB data when a new cluster is touched. > Therefore the performance should be very similar to subclusters=on. No, it can't do that. Nobody guarantees that the cluster contains only zeros when we don't write them. It could have been used before and then either freed on a qcow2 level or we could be sitting on a block device rather than a file. One optimisation that would be possible even without subclusters is making only a single I/O request to write the whole cluster instead of three of them (COW head, guest write, COW tail). Without a backing file, this improved performance almost to the level of rewrites, but it couldn't solve the problem when a backing file was used (which is the main use case for qcow2), so I never got to submitting a patch for it. Kevin [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-07 13:01 ` Kevin Wolf @ 2017-04-10 15:32 ` Stefan Hajnoczi 0 siblings, 0 replies; 64+ messages in thread From: Stefan Hajnoczi @ 2017-04-10 15:32 UTC (permalink / raw) To: Kevin Wolf; +Cc: Alberto Garcia, qemu-devel, qemu-block, Max Reitz [-- Attachment #1: Type: text/plain, Size: 4118 bytes --] On Fri, Apr 07, 2017 at 03:01:29PM +0200, Kevin Wolf wrote: > Am 07.04.2017 um 14:20 hat Stefan Hajnoczi geschrieben: > > On Thu, Apr 06, 2017 at 06:01:48PM +0300, Alberto Garcia wrote: > > > Here are the results (subcluster size in brackets): > > > > > > |-----------------+----------------+-----------------+-------------------| > > > | cluster size | subclusters=on | subclusters=off | Max L2 cache size | > > > |-----------------+----------------+-----------------+-------------------| > > > | 2 MB (256 KB) | 440 IOPS | 100 IOPS | 160 KB (*) | > > > | 512 KB (64 KB) | 1000 IOPS | 300 IOPS | 640 KB | > > > | 64 KB (8 KB) | 3000 IOPS | 1000 IOPS | 5 MB | > > > | 32 KB (4 KB) | 12000 IOPS | 1300 IOPS | 10 MB | > > > | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | > > > |-----------------+----------------+-----------------+-------------------| > > > > > > (*) The L2 cache must be a multiple of the cluster > > > size, so in this case it must be 2MB. On the table > > > I chose to show how much of those 2MB are actually > > > used so you can compare it with the other cases. > > > > > > Some comments about the results: > > > > > > - For the 64KB, 512KB and 2MB cases, having subclusters increases > > > write performance roughly by three. This happens because for each > > > cluster allocation there's less data to copy from the backing > > > image. For the same reason, the smaller the cluster, the better the > > > performance. As expected, 64KB clusters with no subclusters perform > > > roughly the same as 512KB clusters with 64KB subclusters. > > > > > > - The 32KB case is the most interesting one. Without subclusters it's > > > not very different from the 64KB case, but having a subcluster with > > > the same size of the I/O block eliminates the need for COW entirely > > > and the performance skyrockets (10 times faster!). > > > > > > - 4KB is however very slow. I attribute this to the fact that the > > > cluster size is so small that a new cluster needs to be allocated > > > for every single write and its refcount updated accordingly. The L2 > > > and refcount tables are also so small that they are too inefficient > > > and need to grow all the time. > > > > > > Here are the results when writing to an empty 40GB qcow2 image with no > > > backing file. The numbers are of course different but as you can see > > > the patterns are similar: > > > > > > |-----------------+----------------+-----------------+-------------------| > > > | cluster size | subclusters=on | subclusters=off | Max L2 cache size | > > > |-----------------+----------------+-----------------+-------------------| > > > | 2 MB (256 KB) | 1200 IOPS | 255 IOPS | 160 KB | > > > | 512 KB (64 KB) | 3000 IOPS | 700 IOPS | 640 KB | > > > | 64 KB (8 KB) | 7200 IOPS | 3300 IOPS | 5 MB | > > > | 32 KB (4 KB) | 12300 IOPS | 4200 IOPS | 10 MB | > > > | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | > > > |-----------------+----------------+-----------------+-------------------| > > > > I don't understand why subclusters=on performs so much better when > > there's no backing file. Is qcow2 zeroing out the 64 KB cluster with > > subclusters=off? > > > > It ought to just write the 4 KB data when a new cluster is touched. > > Therefore the performance should be very similar to subclusters=on. > > No, it can't do that. Nobody guarantees that the cluster contains only > zeros when we don't write them. It could have been used before and then > either freed on a qcow2 level or we could be sitting on a block device > rather than a file. I thought we had the no-op optimization for clusters allocated at the end of a POSIX file. All the more reason to add sub-clusters! Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia 2017-04-06 16:40 ` Eric Blake 2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi @ 2017-04-07 17:10 ` Max Reitz 2017-04-10 8:42 ` Kevin Wolf 2017-04-11 12:56 ` Alberto Garcia 2017-04-12 16:54 ` Denis V. Lunev 2017-04-12 17:55 ` Denis V. Lunev 4 siblings, 2 replies; 64+ messages in thread From: Max Reitz @ 2017-04-07 17:10 UTC (permalink / raw) To: Alberto Garcia, qemu-devel; +Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf [-- Attachment #1: Type: text/plain, Size: 15746 bytes --] On 06.04.2017 17:01, Alberto Garcia wrote: > Hi all, > > over the past couple of months I discussed with some of you the > possibility to extend the qcow2 format in order to improve its > performance and reduce its memory requirements (particularly with very > large images). > > After some discussion in the mailing list and the #qemu IRC channel I > decided to write a prototype of a new extension for qcow2 so I could > understand better the scope of the changes and have some preliminary > data about its effects. > > This e-mail is the formal presentation of my proposal to extend the > on-disk qcow2 format. As you can see this is still an RFC. Due to the > nature of the changes I would like to get as much feedback as possible > before going forward. > > === Problem === > > The original problem that I wanted to address is the memory > requirements of qcow2 files if you want to have very large images and > still keep good I/O performance. This is a consequence of its very > structure, which I'm going to describe now. > > A qcow2 image is divided into units of constant size called clusters, > and among other things it contains metadata that maps guest addresses > to host addresses (the so-called L1 and L2 tables). > > There are two basic problems that result from this: > > 1) Reading from or writing to a qcow2 image involves reading the > corresponding entry on the L2 table that maps the guest address to > the host address. This is very slow because it involves two I/O > operations: one on the L2 table and the other one on the actual > data cluster. > > 2) A cluster is the smallest unit of allocation. Therefore writing a > mere 512 bytes to an empty disk requires allocating a complete > cluster and filling it with zeroes (or with data from the backing > image if there is one). This wastes more disk space and also has a > negative impact on I/O. > > Problem (1) can be solved by keeping in memory a cache of the L2 > tables (QEMU has an "l2_cache_size" parameter for this purpose). The > amount of disk space used by L2 tables depends on two factors: the > disk size and the cluster size. > > The cluster size can be configured when the image is created, and it > can be any power of two between 512 bytes and 2 MB (it defaults to 64 > KB). > > The maximum amount of space needed for the L2 tables can be calculated > with the following formula: > > max_l2_size = virtual_disk_size * 8 / cluster_size > > Large images require a large amount of metadata, and therefore a large > amount of memory for the L2 cache. With the default cluster size > (64KB) that's 128MB of L2 cache for a 1TB qcow2 image. > > The only way to reduce the size of the L2 tables is therefore > increasing the cluster size, but this makes the image less efficient > as described earlier in (2). > > === The proposal === > > The idea of this proposal is to extend the qcow2 format by allowing > subcluster allocation. There would be an optional feature that would > need to be enabled when creating the image. The on-disk format would > remain essentially the same, except that each data cluster would be > internally divided into a number of subclusters of equal size. > > What this means in practice is that each entry on an L2 table would be > accompanied by a bitmap indicating the allocation state of each one of > the subclusters for that cluster. There are several alternatives for > storing the bitmap, described below. > > Other than L2 entries, all other data structures would remain > unchanged, but for data clusters the smallest unit of allocation would > now be the subcluster. Reference counting would still be at the > cluster level, because there's no way to reference individual > subclusters. Copy-on-write on internal snapshots would need to copy > complete clusters, so that scenario would not benefit from this > change. > > I see two main use cases for this feature: > > a) The qcow2 image is not too large / the L2 cache is not a problem, > but you want to increase the allocation performance. In this case > you can have something like a 128KB cluster with 4KB subclusters > (with 4KB being a common block size in ext4 and other filesystems) > > b) The qcow2 image is very large and you want to save metadata space > in order to have a smaller L2 cache. In this case you can go for > the maximum cluster size (2MB) but you want to have smaller > subclusters to increase the allocation performance and optimize the > disk usage. This was actually my original use case. > > === Test results === > > I have a basic working prototype of this. It's still incomplete -and > buggy :)- but it gives an idea of what we can expect from it. In my > implementation each data cluster has 8 subclusters, but that's not set > in stone (see below). > > I made all tests on an SSD drive, writing to an empty qcow2 image with > a fully populated 40GB backing image, performing random writes using > fio with a block size of 4KB. > > I tried with the default and maximum cluster sizes (64KB and 2MB) and > also with some other sizes. I also made sure to try with 32KB clusters > so the subcluster size matches the 4KB block size used for the I/O. > > It's important to point out that once a cluster has been completely > allocated then having subclusters offers no performance benefit. For > this reason the size of the image for these tests (40GB) was chosen to > be large enough to guarantee that there are always new clusters being > allocated. This is therefore a worst-case scenario (or best-case for > this feature, if you want). > > Here are the results (subcluster size in brackets): > > |-----------------+----------------+-----------------+-------------------| > | cluster size | subclusters=on | subclusters=off | Max L2 cache size | > |-----------------+----------------+-----------------+-------------------| > | 2 MB (256 KB) | 440 IOPS | 100 IOPS | 160 KB (*) | > | 512 KB (64 KB) | 1000 IOPS | 300 IOPS | 640 KB | > | 64 KB (8 KB) | 3000 IOPS | 1000 IOPS | 5 MB | > | 32 KB (4 KB) | 12000 IOPS | 1300 IOPS | 10 MB | > | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | > |-----------------+----------------+-----------------+-------------------| > > (*) The L2 cache must be a multiple of the cluster > size, so in this case it must be 2MB. On the table > I chose to show how much of those 2MB are actually > used so you can compare it with the other cases. > > Some comments about the results: > > - For the 64KB, 512KB and 2MB cases, having subclusters increases > write performance roughly by three. This happens because for each > cluster allocation there's less data to copy from the backing > image. For the same reason, the smaller the cluster, the better the > performance. As expected, 64KB clusters with no subclusters perform > roughly the same as 512KB clusters with 64KB subclusters. > > - The 32KB case is the most interesting one. Without subclusters it's > not very different from the 64KB case, but having a subcluster with > the same size of the I/O block eliminates the need for COW entirely > and the performance skyrockets (10 times faster!). > > - 4KB is however very slow. I attribute this to the fact that the > cluster size is so small that a new cluster needs to be allocated > for every single write and its refcount updated accordingly. The L2 > and refcount tables are also so small that they are too inefficient > and need to grow all the time. > > Here are the results when writing to an empty 40GB qcow2 image with no > backing file. The numbers are of course different but as you can see > the patterns are similar: > > |-----------------+----------------+-----------------+-------------------| > | cluster size | subclusters=on | subclusters=off | Max L2 cache size | > |-----------------+----------------+-----------------+-------------------| > | 2 MB (256 KB) | 1200 IOPS | 255 IOPS | 160 KB | > | 512 KB (64 KB) | 3000 IOPS | 700 IOPS | 640 KB | > | 64 KB (8 KB) | 7200 IOPS | 3300 IOPS | 5 MB | > | 32 KB (4 KB) | 12300 IOPS | 4200 IOPS | 10 MB | > | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | > |-----------------+----------------+-----------------+-------------------| > > === Changes to the on-disk format === > > The qcow2 on-disk format needs to change so each L2 entry has a bitmap > indicating the allocation status of each subcluster. There are three > possible states (unallocated, allocated, all zeroes), so we need two > bits per subcluster. You don't need two bits, you need log(3) / log(2) = ld(3) ≈ 1.58. You can encode the status of eight subclusters (3^8 = 6561) in 13 bits (ld(6561) ≈ 12.68). (Why not write it in Malbolge? I guarantee it will be much simpler given this is ternary information.) > An L2 entry is 64 bits wide, and this is the current format (for > uncompressed clusters): > > 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 > 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > **<----> <--------------------------------------------------><------->* > Rsrved host cluster offset of data Reserved > (6 bits) (47 bits) (8 bits) > > bit 63: refcount == 1 (QCOW_OFLAG_COPIED) > bit 62: compressed = 1 (QCOW_OFLAG_COMPRESSED) > bit 0: all zeros (QCOW_OFLAG_ZERO) > > I thought of three alternatives for storing the subcluster bitmaps. I > haven't made my mind completely about which one is the best one, so > I'd like to present all three for discussion. Here they are: > > (1) Storing the bitmap inside the 64-bit entry > > This is a simple alternative and is the one that I chose for my > prototype. There are 14 unused bits plus the "all zeroes" one. If > we steal one from the host offset we have the 16 bits that we need > for the bitmap and we have 46 bits left for the host offset, which > is more than enough. No need to steal because you only need 13 bits (apart from what Eric wrote). > > * Pros: > + Simple. Few changes compared to the current qcow2 format. > > * Cons: > - Only 8 subclusters per cluster. We would not be making the > most of this feature. I agree on this one, though. One case I'd be especially interested in are of course 4 kB subclusters for 64 kB clusters (because 4 kB is a usual page size and can be configured to be the block size of a guest device; and because 64 kB simply is the standard cluster size of qcow2 images nowadays[1]...). Then we'd have 16 subclusters which would require 26 bits. Unfortunately, with 64 kB clusters we'd only have 22 bits available (including the is-zero flag)... So this is not going to fly. (We could even get one more bit if we had a subcluster-flag, because I guess we can always assume subclustered clusters to have OFLAG_COPIED and be uncompressed. But still, three bits missing.) If course, if you'd be willing to give up the all-zeroes state for subclusters, it would be enough... > - No reserved bits left for the future. One bit left (or more, see Eric). :-) > (2) Making L2 entries 128-bit wide. > > In this alternative we would double the size of L2 entries. The > first half would remain unchanged and the second one would store > the bitmap. That would leave us with 32 subclusters per cluster. > > * Pros: > + More subclusters per cluster. We could have images with > e.g. 128k clusters with 4k subclusters. > > * Cons: > - More space needed for L2 entries. The same cluster size would > require a cache twice as large, although having subcluster > allocation would compensate for this. > > - More changes to the code to handle 128-bit entries. > > - We would still be wasting the 14 reserved bits that L2 entries > have. Not sure I like this too much, but maybe it performs better than (3). Also, Kevin doesn't like (3) and neither do you anymore... > (3) Storing the bitmap somewhere else > > This would involve storing the bitmap separate from the L2 tables > (perhaps using the bitmaps extension? I haven't looked much into > this). (not really) > * Pros: > + Possibility to make the number of subclusters configurable > by the user (32, 64, 128, ...) Yes indeed! > + All existing metadata structures would remain untouched > (although the "all zeroes" bit in L2 entries would probably > become unused). Well, sounds nice but it's still an incompatible change, so I'm not sure whether this is really a benefit. I guess it means our code stays simpler. > * Cons: > - As with alternative (2), more space needed for metadata. Yeah, but just one 1/4 more if you have 8 subclusters or 1/2 more if you take my favorite 64 kB/4 kB configuration. Not too bad. (This is worse for (2). It'll always be double or more.) > - The bitmap would also need to be cached for performance > reasons. > > - Possibly one more *_cache_size option. I don't think so. There is no reason we shouldn't cache all the entries whose corresponding L2 table is loaded. (Or I can't see one, at least. At least once we can cache partial clusters; if the subcluster information for an L2 table only takes 1/4th of the space (and thus 1/4th of a cluster) we only need to cache that much data. We don't need to cache the adjacent subcluster information blocks if their L2 tables aren't cached, too.) > - One more metadata structure to be updated for each > allocation. This would probably impact I/O negatively. Needs benchmarks. :-) I don't think it would be too bad on an SSD. On an HDD it means additional seeking, yes. (Unless you're clever: If you have 1/2th overhead, you could put the subcluster information between the corresponding L2 tables (same for every multiple of 1/2 that isn't also a multiple of 1). For multiples of 1, you just put it into a cluster adjacent to the L2 table. (OTOH, the fact that you effectively need to at least double the L2 size for (2) means that you need to write more data, so (2) might actually perform worse than (3) if you would be all fine with just 16 subclusters (at least on an SSD).) By the way, if you'd only allow multiple of 1s overhead (i.e. multiples of 32 subclusters), I think (3) would be pretty much the same as (2) if you just always write the subcluster information adjacent to the L2 table. Should be just the same caching-wise and performance-wise. Max > === Compressed clusters === > > My idea is that compressed clusters would remain the same. They are > read-only anyway so they would not be affected by any of these > changes. > > =========================== > > I think I managed to summarize all the ideas that I had in my mind, > but I'm sure you probably have questions and comments, so I'd be happy > to get as much feedback as possible. > > So, looking forward to reading your opinions. > > Regards, > > Berto > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 512 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-07 17:10 ` Max Reitz @ 2017-04-10 8:42 ` Kevin Wolf 2017-04-10 15:03 ` Max Reitz 2017-04-11 12:56 ` Alberto Garcia 1 sibling, 1 reply; 64+ messages in thread From: Kevin Wolf @ 2017-04-10 8:42 UTC (permalink / raw) To: Max Reitz; +Cc: Alberto Garcia, qemu-devel, qemu-block, Stefan Hajnoczi [-- Attachment #1: Type: text/plain, Size: 859 bytes --] Am 07.04.2017 um 19:10 hat Max Reitz geschrieben: > One case I'd be especially interested in are of course 4 kB subclusters > for 64 kB clusters (because 4 kB is a usual page size and can be > configured to be the block size of a guest device; and because 64 kB > simply is the standard cluster size of qcow2 images nowadays[1]...). Why should the current default cluster size be an argument for anything? 64k is a tradeoff between small COW size and large allocation granularity that seemed to work well (and it used to be the maximum cluster size originally, so it seemed safer than 128k). With subclusters, we have a completely different situation and need to find a new default that works best. I'm relatively sure that 64k are too small under the new conditions. Also, undefined reference: [1] (I was hoping to find a better argument there...) Kevin [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-10 8:42 ` Kevin Wolf @ 2017-04-10 15:03 ` Max Reitz 0 siblings, 0 replies; 64+ messages in thread From: Max Reitz @ 2017-04-10 15:03 UTC (permalink / raw) To: Kevin Wolf; +Cc: Alberto Garcia, qemu-devel, qemu-block, Stefan Hajnoczi [-- Attachment #1: Type: text/plain, Size: 1066 bytes --] On 10.04.2017 10:42, Kevin Wolf wrote: > Am 07.04.2017 um 19:10 hat Max Reitz geschrieben: >> One case I'd be especially interested in are of course 4 kB subclusters >> for 64 kB clusters (because 4 kB is a usual page size and can be >> configured to be the block size of a guest device; and because 64 kB >> simply is the standard cluster size of qcow2 images nowadays[1]...). > > Why should the current default cluster size be an argument for anything? > 64k is a tradeoff between small COW size and large allocation > granularity that seemed to work well (and it used to be the maximum > cluster size originally, so it seemed safer than 128k). > > With subclusters, we have a completely different situation and need to > find a new default that works best. I'm relatively sure that 64k are too > small under the new conditions. > > Also, undefined reference: [1] (I was hoping to find a better argument > there...) I meant [1] to be basically what you just said. Well, the thing is that 64 kB is still better than 32 kB. :-P Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 512 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-07 17:10 ` Max Reitz 2017-04-10 8:42 ` Kevin Wolf @ 2017-04-11 12:56 ` Alberto Garcia 2017-04-11 14:04 ` Max Reitz 1 sibling, 1 reply; 64+ messages in thread From: Alberto Garcia @ 2017-04-11 12:56 UTC (permalink / raw) To: Max Reitz, qemu-devel; +Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf On Fri 07 Apr 2017 07:10:46 PM CEST, Max Reitz wrote: >> === Changes to the on-disk format === >> >> The qcow2 on-disk format needs to change so each L2 entry has a bitmap >> indicating the allocation status of each subcluster. There are three >> possible states (unallocated, allocated, all zeroes), so we need two >> bits per subcluster. > > You don't need two bits, you need log(3) / log(2) = ld(3) ≈ 1.58. You > can encode the status of eight subclusters (3^8 = 6561) in 13 bits > (ld(6561) ≈ 12.68). Right, although that would make the encoding more cumbersome to use and to debug. Is it worth it? > One case I'd be especially interested in are of course 4 kB > subclusters for 64 kB clusters (because 4 kB is a usual page size and > can be configured to be the block size of a guest device; and because > 64 kB simply is the standard cluster size of qcow2 images > nowadays[1]...). I think that we should have at least that, but ideally larger cluster-to-subcluster ratios. > (We could even get one more bit if we had a subcluster-flag, because I > guess we can always assume subclustered clusters to have OFLAG_COPIED > and be uncompressed. But still, three bits missing.) Why can we always assume OFLAG_COPIED? > If course, if you'd be willing to give up the all-zeroes state for > subclusters, it would be enough... I still think that it looks like a better idea to allow having more subclusters, but giving up the all-zeroes state is a valid alternative. Apart from having to overwrite with zeroes when a subcluster is discarded, is there anything else that we would miss? > By the way, if you'd only allow multiple of 1s overhead > (i.e. multiples of 32 subclusters), I think (3) would be pretty much > the same as (2) if you just always write the subcluster information > adjacent to the L2 table. Should be just the same caching-wise and > performance-wise. Then (3) is effectively the same as (2), just that the subcluster bitmaps are at the end of the L2 cluster, and not next to each entry. Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-11 12:56 ` Alberto Garcia @ 2017-04-11 14:04 ` Max Reitz 2017-04-11 14:31 ` Alberto Garcia 0 siblings, 1 reply; 64+ messages in thread From: Max Reitz @ 2017-04-11 14:04 UTC (permalink / raw) To: Alberto Garcia, qemu-devel; +Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf [-- Attachment #1: Type: text/plain, Size: 2694 bytes --] On 11.04.2017 14:56, Alberto Garcia wrote: > On Fri 07 Apr 2017 07:10:46 PM CEST, Max Reitz wrote: >>> === Changes to the on-disk format === >>> >>> The qcow2 on-disk format needs to change so each L2 entry has a bitmap >>> indicating the allocation status of each subcluster. There are three >>> possible states (unallocated, allocated, all zeroes), so we need two >>> bits per subcluster. >> >> You don't need two bits, you need log(3) / log(2) = ld(3) ≈ 1.58. You >> can encode the status of eight subclusters (3^8 = 6561) in 13 bits >> (ld(6561) ≈ 12.68). > > Right, although that would make the encoding more cumbersome to use and > to debug. Is it worth it? Probably not, considering this is probably not the way we want to go anyway. >> One case I'd be especially interested in are of course 4 kB >> subclusters for 64 kB clusters (because 4 kB is a usual page size and >> can be configured to be the block size of a guest device; and because >> 64 kB simply is the standard cluster size of qcow2 images >> nowadays[1]...). > > I think that we should have at least that, but ideally larger > cluster-to-subcluster ratios. > >> (We could even get one more bit if we had a subcluster-flag, because I >> guess we can always assume subclustered clusters to have OFLAG_COPIED >> and be uncompressed. But still, three bits missing.) > > Why can we always assume OFLAG_COPIED? Because partially allocated clusters cannot be used with internal snapshots, and that is what OFLAG_COPIED is for. >> If course, if you'd be willing to give up the all-zeroes state for >> subclusters, it would be enough... > > I still think that it looks like a better idea to allow having more > subclusters, but giving up the all-zeroes state is a valid > alternative. Apart from having to overwrite with zeroes when a > subcluster is discarded, is there anything else that we would miss? It if it's a real discard you can just discard it (which is what we do for compat=0.10 images anyway); but zero-writes will then have to be come real writes, yes. >> By the way, if you'd only allow multiple of 1s overhead >> (i.e. multiples of 32 subclusters), I think (3) would be pretty much >> the same as (2) if you just always write the subcluster information >> adjacent to the L2 table. Should be just the same caching-wise and >> performance-wise. > > Then (3) is effectively the same as (2), just that the subcluster > bitmaps are at the end of the L2 cluster, and not next to each entry. Exactly. But it's a difference in implementation, as you won't have to worry about having changed the L2 table layout; maybe that's a benefit. Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 512 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-11 14:04 ` Max Reitz @ 2017-04-11 14:31 ` Alberto Garcia 2017-04-11 14:45 ` [Qemu-devel] [Qemu-block] " Eric Blake 2017-04-11 14:49 ` [Qemu-devel] " Kevin Wolf 0 siblings, 2 replies; 64+ messages in thread From: Alberto Garcia @ 2017-04-11 14:31 UTC (permalink / raw) To: Max Reitz, qemu-devel; +Cc: qemu-block, Stefan Hajnoczi, Kevin Wolf On Tue 11 Apr 2017 04:04:53 PM CEST, Max Reitz wrote: >>> (We could even get one more bit if we had a subcluster-flag, because I >>> guess we can always assume subclustered clusters to have OFLAG_COPIED >>> and be uncompressed. But still, three bits missing.) >> >> Why can we always assume OFLAG_COPIED? > > Because partially allocated clusters cannot be used with internal > snapshots, and that is what OFLAG_COPIED is for. Why can't they be used? >>> If course, if you'd be willing to give up the all-zeroes state for >>> subclusters, it would be enough... >> >> I still think that it looks like a better idea to allow having more >> subclusters, but giving up the all-zeroes state is a valid >> alternative. Apart from having to overwrite with zeroes when a >> subcluster is discarded, is there anything else that we would miss? > > It if it's a real discard you can just discard it (which is what we do > for compat=0.10 images anyway); but zero-writes will then have to be > come real writes, yes. Perhaps we can give up that bit for subclusters then, that would allow us to double their number. We would still have the zero flag at the cluster level. Opinions on this, anyone? >>> By the way, if you'd only allow multiple of 1s overhead >>> (i.e. multiples of 32 subclusters), I think (3) would be pretty much >>> the same as (2) if you just always write the subcluster information >>> adjacent to the L2 table. Should be just the same caching-wise and >>> performance-wise. >> >> Then (3) is effectively the same as (2), just that the subcluster >> bitmaps are at the end of the L2 cluster, and not next to each entry. > > Exactly. But it's a difference in implementation, as you won't have to > worry about having changed the L2 table layout; maybe that's a > benefit. I'm not sure if that would simplify or complicate things, but it's worth considering. Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-11 14:31 ` Alberto Garcia @ 2017-04-11 14:45 ` Eric Blake 2017-04-12 12:41 ` Alberto Garcia 2017-04-11 14:49 ` [Qemu-devel] " Kevin Wolf 1 sibling, 1 reply; 64+ messages in thread From: Eric Blake @ 2017-04-11 14:45 UTC (permalink / raw) To: Alberto Garcia, Max Reitz, qemu-devel Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block [-- Attachment #1: Type: text/plain, Size: 2734 bytes --] On 04/11/2017 09:31 AM, Alberto Garcia wrote: > On Tue 11 Apr 2017 04:04:53 PM CEST, Max Reitz wrote: >>>> (We could even get one more bit if we had a subcluster-flag, because I >>>> guess we can always assume subclustered clusters to have OFLAG_COPIED >>>> and be uncompressed. But still, three bits missing.) >>> >>> Why can we always assume OFLAG_COPIED? >> >> Because partially allocated clusters cannot be used with internal >> snapshots, and that is what OFLAG_COPIED is for. > > Why can't they be used? An internal snapshot causes a COW to happen if another write happens anywhere in the cluster. Setting OFLAG_COPIED is a shorthand for whether the COW must happen, but it is always possible (but slower) to refer back to the refcount to learn the same information. If we have a cluster with missing subclusters, and need to do a COW, we are already reading from the backing file - so we might as well populate the missing subclusters of the original cluster at that time we write the new updated cluster, at which point we no longer need to mark the cluster as using subclusters. Or we could state that the action of creating an internal snapshot takes longer, because it fully populates all partially-populated clusters (taking an internal snapshot is something that is not done frequently, after all, as we've gradually been trying to steer users to external snapshots) - or we could even go so far as to state that internal snapshots and subclusters are incompatible (you can't use both features at the same time). It may be possible to make OFLAG_COPIED and subclusters work usefully together, but the point being made here is that because we're already changing design principles, we don't necessarily have to burn a bit on OFLAG_COPIED if subclusters are in use. > > Perhaps we can give up that bit for subclusters then, that would allow > us to double their number. We would still have the zero flag at the > cluster level. Opinions on this, anyone? I already think we're leaning away from option 1, even though the above conversation was about how to pack more state into just 64 bits if we wanted to stick with option 1. If we use option 2 or 3, it may still be worth burning a bit in the original 64 bits that says whether the subcluster secondary 64-bits is valid (it might speed up some operations if we only have to do one 64-bit read and realize that the entire cluster is uniform, compared to operations where the subcluster flag is set so we have to do a second 64-bit read to learn about the state of each subcluster). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-11 14:45 ` [Qemu-devel] [Qemu-block] " Eric Blake @ 2017-04-12 12:41 ` Alberto Garcia 2017-04-12 14:10 ` Max Reitz 0 siblings, 1 reply; 64+ messages in thread From: Alberto Garcia @ 2017-04-12 12:41 UTC (permalink / raw) To: Eric Blake, Max Reitz, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block On Tue 11 Apr 2017 04:45:29 PM CEST, Eric Blake wrote: >>>>> (We could even get one more bit if we had a subcluster-flag, >>>>> because I guess we can always assume subclustered clusters to have >>>>> OFLAG_COPIED and be uncompressed. But still, three bits missing.) >>>> >>>> Why can we always assume OFLAG_COPIED? >>> >>> Because partially allocated clusters cannot be used with internal >>>snapshots, and that is what OFLAG_COPIED is for. >> >> Why can't they be used? > > An internal snapshot causes a COW to happen if another write happens > anywhere in the cluster. Setting OFLAG_COPIED is a shorthand for > whether the COW must happen, but it is always possible (but slower) to > refer back to the refcount to learn the same information. If we have > a cluster with missing subclusters, and need to do a COW, we are > already reading from the backing file - so we might as well populate > the missing subclusters of the original cluster at that time we write > the new updated cluster, at which point we no longer need to mark the > cluster as using subclusters. I still don't see why we can always assume OFLAG_COPIED. Before doing the COW one cluster can have references from multiple snapshots, and OFLAG_COPIED is equally valid in that context. We still need to know if we need to perform COW when writing to it, regardless of whether it has subclusters or not. Also, when doing a COW for an internal snapshot we definitely have to duplicate the whole cluster, but do we really need to read from the backing file? Can't we leave the missing subclusters unallocated in the copy? > If we use option 2 or 3, it may still be worth burning a bit in the > original 64 bits that says whether the subcluster secondary 64-bits is > valid (it might speed up some operations if we only have to do one > 64-bit read and realize that the entire cluster is uniform, compared > to operations where the subcluster flag is set so we have to do a > second 64-bit read to learn about the state of each subcluster). Yeah, that might be a good idea. Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-12 12:41 ` Alberto Garcia @ 2017-04-12 14:10 ` Max Reitz 2017-04-13 8:05 ` Alberto Garcia 0 siblings, 1 reply; 64+ messages in thread From: Max Reitz @ 2017-04-12 14:10 UTC (permalink / raw) To: Alberto Garcia, Eric Blake, qemu-devel Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block [-- Attachment #1: Type: text/plain, Size: 2670 bytes --] On 12.04.2017 14:41, Alberto Garcia wrote: > On Tue 11 Apr 2017 04:45:29 PM CEST, Eric Blake wrote: >>>>>> (We could even get one more bit if we had a subcluster-flag, >>>>>> because I guess we can always assume subclustered clusters to have >>>>>> OFLAG_COPIED and be uncompressed. But still, three bits missing.) >>>>> >>>>> Why can we always assume OFLAG_COPIED? >>>> >>>> Because partially allocated clusters cannot be used with internal >>>> snapshots, and that is what OFLAG_COPIED is for. >>> >>> Why can't they be used? >> >> An internal snapshot causes a COW to happen if another write happens >> anywhere in the cluster. Setting OFLAG_COPIED is a shorthand for >> whether the COW must happen, but it is always possible (but slower) to >> refer back to the refcount to learn the same information. If we have >> a cluster with missing subclusters, and need to do a COW, we are >> already reading from the backing file - so we might as well populate >> the missing subclusters of the original cluster at that time we write >> the new updated cluster, at which point we no longer need to mark the >> cluster as using subclusters. > > I still don't see why we can always assume OFLAG_COPIED. Before doing > the COW one cluster can have references from multiple snapshots, Yes... > and > OFLAG_COPIED is equally valid in that context. In what context? When having subclusters? Why? > We still need to know if > we need to perform COW when writing to it, regardless of whether it has > subclusters or not. But why would you reference a cluster multiple times if it has subclusters? Yes, you can do it in theory, but we could just disallow it, because it doesn't make sense. As I've said before, there may be uses for clusters to be referenced multiple times other than internal snapshots; but right now we only use it for internal snapshots, and I don't think that disallowing multiple references for subclustered clusters would be horrible for any of the other use cases. > Also, when doing a COW for an internal snapshot we definitely have to > duplicate the whole cluster, but do we really need to read from the > backing file? Can't we leave the missing subclusters unallocated in the > copy? Can't we just disallow !OFLAG_COPIED for subclustered clusters? To me it seems you implicitly assume we would want to allow !OFLAG_COPIED, and based on that you argue how we could do so and what semantics it would bear. However, I fail to see why we would want that feature at all. Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 512 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-12 14:10 ` Max Reitz @ 2017-04-13 8:05 ` Alberto Garcia 2017-04-13 9:02 ` Kevin Wolf 0 siblings, 1 reply; 64+ messages in thread From: Alberto Garcia @ 2017-04-13 8:05 UTC (permalink / raw) To: Max Reitz, Eric Blake, qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block On Wed 12 Apr 2017 04:10:46 PM CEST, Max Reitz <mreitz@redhat.com> wrote: >> I still don't see why we can always assume OFLAG_COPIED. Before doing >> the COW one cluster can have references from multiple snapshots, > > Yes... > >> and OFLAG_COPIED is equally valid in that context. > > In what context? When having subclusters? Why? > >> We still need to know if we need to perform COW when writing to it, >> regardless of whether it has subclusters or not. > > But why would you reference a cluster multiple times if it has > subclusters? Yes, you can do it in theory, but we could just disallow > it, because it doesn't make sense. We discussed this yesterday on IRC, but I'm writing the summary here for reference: if you want be able to create internal snapshots quickly without having to copy a lot of data, you need to be able to have multiple references to one cluster. Anyway, it seems that this discussion is only relevant if we're trying to save as many bits as possible because we want to store everything in a 64-bit entry -alternative (1) from my original e-mail-. I agree that for that alternative the usefulness of that bit can be put into question. Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 8:05 ` Alberto Garcia @ 2017-04-13 9:02 ` Kevin Wolf 2017-04-13 9:05 ` Alberto Garcia 0 siblings, 1 reply; 64+ messages in thread From: Kevin Wolf @ 2017-04-13 9:02 UTC (permalink / raw) To: Alberto Garcia Cc: Max Reitz, Eric Blake, qemu-devel, Stefan Hajnoczi, qemu-block Am 13.04.2017 um 10:05 hat Alberto Garcia geschrieben: > On Wed 12 Apr 2017 04:10:46 PM CEST, Max Reitz <mreitz@redhat.com> wrote: > >> I still don't see why we can always assume OFLAG_COPIED. Before doing > >> the COW one cluster can have references from multiple snapshots, > > > > Yes... > > > >> and OFLAG_COPIED is equally valid in that context. > > > > In what context? When having subclusters? Why? > > > >> We still need to know if we need to perform COW when writing to it, > >> regardless of whether it has subclusters or not. > > > > But why would you reference a cluster multiple times if it has > > subclusters? Yes, you can do it in theory, but we could just disallow > > it, because it doesn't make sense. > > We discussed this yesterday on IRC, but I'm writing the summary here for > reference: > > if you want be able to create internal snapshots quickly without having > to copy a lot of data, you need to be able to have multiple references > to one cluster. > > Anyway, it seems that this discussion is only relevant if we're trying > to save as many bits as possible because we want to store everything in > a 64-bit entry -alternative (1) from my original e-mail-. I agree that > for that alternative the usefulness of that bit can be put into > question. I think you still need it if you don't want to look at the refcount blocks for every write. When you take an internal snapshot, you just increase the refcount of the L2 tables at first and keep the contents the same, including the subcluster information. On the first write to the cluster, like with normal images you need to copy the whole cluster, and whether this is the case is determined with the COPIED flag. (The copy can actually keep unallocated/zeroed subclusters that way, but data has to be copied for the whole cluster.) Of course, in all cases the COPIED flag is just an optimisation because it's always possible to look at the refcount blocks, but I don't think there is any difference between subclustered and traditional images in this respect. Kevin ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 9:02 ` Kevin Wolf @ 2017-04-13 9:05 ` Alberto Garcia 0 siblings, 0 replies; 64+ messages in thread From: Alberto Garcia @ 2017-04-13 9:05 UTC (permalink / raw) To: Kevin Wolf; +Cc: Max Reitz, Eric Blake, qemu-devel, Stefan Hajnoczi, qemu-block On Thu 13 Apr 2017 11:02:10 AM CEST, Kevin Wolf <kwolf@redhat.com> wrote: > I think you still need it if you don't want to look at the refcount > blocks for every write. When you take an internal snapshot, you just > increase the refcount of the L2 tables at first and keep the contents > the same, including the subcluster information. On the first write to > the cluster, like with normal images you need to copy the whole > cluster, and whether this is the case is determined with the COPIED > flag. (The copy can actually keep unallocated/zeroed subclusters that > way, but data has to be copied for the whole cluster.) > > Of course, in all cases the COPIED flag is just an optimisation > because it's always possible to look at the refcount blocks, but I > don't think there is any difference between subclustered and > traditional images in this respect. Exactly :) Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-11 14:31 ` Alberto Garcia 2017-04-11 14:45 ` [Qemu-devel] [Qemu-block] " Eric Blake @ 2017-04-11 14:49 ` Kevin Wolf 2017-04-11 14:58 ` Eric Blake ` (2 more replies) 1 sibling, 3 replies; 64+ messages in thread From: Kevin Wolf @ 2017-04-11 14:49 UTC (permalink / raw) To: Alberto Garcia; +Cc: Max Reitz, qemu-devel, qemu-block, Stefan Hajnoczi Am 11.04.2017 um 16:31 hat Alberto Garcia geschrieben: > On Tue 11 Apr 2017 04:04:53 PM CEST, Max Reitz wrote: > >>> (We could even get one more bit if we had a subcluster-flag, because I > >>> guess we can always assume subclustered clusters to have OFLAG_COPIED > >>> and be uncompressed. But still, three bits missing.) > >> > >> Why can we always assume OFLAG_COPIED? > > > > Because partially allocated clusters cannot be used with internal > > snapshots, and that is what OFLAG_COPIED is for. > > Why can't they be used? Refcounts are on a cluster granularity, so you have to COW the whole cluster at once. If you copied only a subcluster, you'd lose the information where to find the other subclusters. > >>> If course, if you'd be willing to give up the all-zeroes state for > >>> subclusters, it would be enough... > >> > >> I still think that it looks like a better idea to allow having more > >> subclusters, but giving up the all-zeroes state is a valid > >> alternative. Apart from having to overwrite with zeroes when a > >> subcluster is discarded, is there anything else that we would miss? > > > > It if it's a real discard you can just discard it (which is what we do > > for compat=0.10 images anyway); but zero-writes will then have to be > > come real writes, yes. > > Perhaps we can give up that bit for subclusters then, that would allow > us to double their number. We would still have the zero flag at the > cluster level. Opinions on this, anyone? No, making the backing file contents reappear is really bad, we don't want that. If anything, we'd have to use the cluster level zero flag and do COW (i.e. write explicit zeros) on the first write to a subcluster in it. I'd rather keep the zero flag for subclusters. > >>> By the way, if you'd only allow multiple of 1s overhead > >>> (i.e. multiples of 32 subclusters), I think (3) would be pretty much > >>> the same as (2) if you just always write the subcluster information > >>> adjacent to the L2 table. Should be just the same caching-wise and > >>> performance-wise. > >> > >> Then (3) is effectively the same as (2), just that the subcluster > >> bitmaps are at the end of the L2 cluster, and not next to each entry. > > > > Exactly. But it's a difference in implementation, as you won't have to > > worry about having changed the L2 table layout; maybe that's a > > benefit. > > I'm not sure if that would simplify or complicate things, but it's worth > considering. Note that 64k between an L2 entry and the corresponding bitmap is enough to make an update not atomic any more. They need to be within the same sector to get atomicity. Kevin ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-11 14:49 ` [Qemu-devel] " Kevin Wolf @ 2017-04-11 14:58 ` Eric Blake 2017-04-11 14:59 ` Max Reitz 2017-04-12 12:47 ` Alberto Garcia 2 siblings, 0 replies; 64+ messages in thread From: Eric Blake @ 2017-04-11 14:58 UTC (permalink / raw) To: Kevin Wolf, Alberto Garcia Cc: Stefan Hajnoczi, qemu-devel, qemu-block, Max Reitz [-- Attachment #1: Type: text/plain, Size: 1560 bytes --] On 04/11/2017 09:49 AM, Kevin Wolf wrote: >>>> Then (3) is effectively the same as (2), just that the subcluster >>>> bitmaps are at the end of the L2 cluster, and not next to each entry. >>> >>> Exactly. But it's a difference in implementation, as you won't have to >>> worry about having changed the L2 table layout; maybe that's a >>> benefit. >> >> I'm not sure if that would simplify or complicate things, but it's worth >> considering. > > Note that 64k between an L2 entry and the corresponding bitmap is enough > to make an update not atomic any more. They need to be within the same > sector to get atomicity. Furthermore, there is a benefit to cache line packing - alternating 64-bits for offset and 64-bits for subclusters will fit all 128 bits in the same cache line, while having all offsets up front followed by all subclusters later is not. Worse, depending on architecture, if the 64-bits for the offset is at the same relative offset to overall cache alignment as the 64-bits for the subcluster (for example, with 1M clusters, if the offset is at 4M and the subcluster info is at 5M), the alignments of the separate memory pages may cause you to end up with both values competing for the same cache line, causing ping-pong evictions and associated slowdowns. Data locality really does want to locate stuff that is commonly used together to also appear close together in memory. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-11 14:49 ` [Qemu-devel] " Kevin Wolf 2017-04-11 14:58 ` Eric Blake @ 2017-04-11 14:59 ` Max Reitz 2017-04-11 15:08 ` Eric Blake 2017-04-12 12:47 ` Alberto Garcia 2 siblings, 1 reply; 64+ messages in thread From: Max Reitz @ 2017-04-11 14:59 UTC (permalink / raw) To: Kevin Wolf, Alberto Garcia; +Cc: qemu-devel, qemu-block, Stefan Hajnoczi [-- Attachment #1: Type: text/plain, Size: 1740 bytes --] On 11.04.2017 16:49, Kevin Wolf wrote: [...] >>>>> By the way, if you'd only allow multiple of 1s overhead >>>>> (i.e. multiples of 32 subclusters), I think (3) would be pretty much >>>>> the same as (2) if you just always write the subcluster information >>>>> adjacent to the L2 table. Should be just the same caching-wise and >>>>> performance-wise. >>>> >>>> Then (3) is effectively the same as (2), just that the subcluster >>>> bitmaps are at the end of the L2 cluster, and not next to each entry. >>> >>> Exactly. But it's a difference in implementation, as you won't have to >>> worry about having changed the L2 table layout; maybe that's a >>> benefit. >> >> I'm not sure if that would simplify or complicate things, but it's worth >> considering. > > Note that 64k between an L2 entry and the corresponding bitmap is enough > to make an update not atomic any more. They need to be within the same > sector to get atomicity. Good point, but that also means that (with (2)) you can only use subcluster configurations where the L2 entry size increases by a power of two. Unfortunately, only one of those configurations itself is a power of two, and that is 32. (With 32 subclusters, you take up 64 bits, which means an L2 entry will take 128 bits; with any higher 2^n, you'd take up 2^{n+1} bits and the L2 entry would take 2^{n+1} + 64 which is impossible to be a power of two.) I don't know how useful non-power-of-two subcluster configurations are. Probably not at all. Since using subcluster would always result in the L2 table taking more than 512 bytes, you could therefore never guarantee that there is no entry overlapping a sector border (except with 32 subclusters). Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 512 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-11 14:59 ` Max Reitz @ 2017-04-11 15:08 ` Eric Blake 2017-04-11 15:18 ` Max Reitz 0 siblings, 1 reply; 64+ messages in thread From: Eric Blake @ 2017-04-11 15:08 UTC (permalink / raw) To: Max Reitz, Kevin Wolf, Alberto Garcia Cc: Stefan Hajnoczi, qemu-devel, qemu-block [-- Attachment #1: Type: text/plain, Size: 1281 bytes --] On 04/11/2017 09:59 AM, Max Reitz wrote: > > Good point, but that also means that (with (2)) you can only use > subcluster configurations where the L2 entry size increases by a power > of two. Unfortunately, only one of those configurations itself is a > power of two, and that is 32. > > (With 32 subclusters, you take up 64 bits, which means an L2 entry will > take 128 bits; with any higher 2^n, you'd take up 2^{n+1} bits and the > L2 entry would take 2^{n+1} + 64 which is impossible to be a power of two.) Or we add padding. If you want 64 subclusters, you burn 256 bits per entry, even though only 192 of those bits are used. > > I don't know how useful non-power-of-two subcluster configurations are. > Probably not at all. > > Since using subcluster would always result in the L2 table taking more > than 512 bytes, you could therefore never guarantee that there is no > entry overlapping a sector border (except with 32 subclusters). Yes, there's definite benefits to keeping whatever structure we end up with aligned so that it naturally falls into sector boundaries, even if it means more padding bits. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-11 15:08 ` Eric Blake @ 2017-04-11 15:18 ` Max Reitz 2017-04-11 15:29 ` Kevin Wolf 2017-04-11 15:30 ` Eric Blake 0 siblings, 2 replies; 64+ messages in thread From: Max Reitz @ 2017-04-11 15:18 UTC (permalink / raw) To: Eric Blake, Kevin Wolf, Alberto Garcia Cc: Stefan Hajnoczi, qemu-devel, qemu-block [-- Attachment #1: Type: text/plain, Size: 1962 bytes --] On 11.04.2017 17:08, Eric Blake wrote: > On 04/11/2017 09:59 AM, Max Reitz wrote: > >> >> Good point, but that also means that (with (2)) you can only use >> subcluster configurations where the L2 entry size increases by a power >> of two. Unfortunately, only one of those configurations itself is a >> power of two, and that is 32. >> >> (With 32 subclusters, you take up 64 bits, which means an L2 entry will >> take 128 bits; with any higher 2^n, you'd take up 2^{n+1} bits and the >> L2 entry would take 2^{n+1} + 64 which is impossible to be a power of two.) > > Or we add padding. If you want 64 subclusters, you burn 256 bits per> entry, even though only 192 of those bits are used. Hm, yeah, although you have to keep in mind that the padding is almost pretty much the same as the the data bits we need, effectively doubling the size of the L2 tables: padding = 2^{n+2} - 2^{n+1} - 64 (=2^6) = 2^{n+1} - 64 So that's not so nice, but if it's the only thing we can do... >> I don't know how useful non-power-of-two subcluster configurations are. >> Probably not at all. >> >> Since using subcluster would always result in the L2 table taking more >> than 512 bytes, you could therefore never guarantee that there is no >> entry overlapping a sector border (except with 32 subclusters). > > Yes, there's definite benefits to keeping whatever structure we end up > with aligned so that it naturally falls into sector boundaries, even if > it means more padding bits. Then again, I'm not even sure we really need atomicity for L2 entries + subcluster bits. I don't think you'd ever have to modify both at the same time (if you just say the subclusters are all unallocated when allocating the cluster itself, and then you write which subclusters are actually allocated afterwards)). (This also applies to your remark on caching, I think.) Atomicity certainly makes things easier, though. Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 512 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-11 15:18 ` Max Reitz @ 2017-04-11 15:29 ` Kevin Wolf 2017-04-11 15:29 ` Max Reitz 2017-04-11 15:30 ` Eric Blake 1 sibling, 1 reply; 64+ messages in thread From: Kevin Wolf @ 2017-04-11 15:29 UTC (permalink / raw) To: Max Reitz Cc: Eric Blake, Alberto Garcia, Stefan Hajnoczi, qemu-devel, qemu-block [-- Attachment #1: Type: text/plain, Size: 2273 bytes --] Am 11.04.2017 um 17:18 hat Max Reitz geschrieben: > On 11.04.2017 17:08, Eric Blake wrote: > > On 04/11/2017 09:59 AM, Max Reitz wrote: > > > >> > >> Good point, but that also means that (with (2)) you can only use > >> subcluster configurations where the L2 entry size increases by a power > >> of two. Unfortunately, only one of those configurations itself is a > >> power of two, and that is 32. > >> > >> (With 32 subclusters, you take up 64 bits, which means an L2 entry will > >> take 128 bits; with any higher 2^n, you'd take up 2^{n+1} bits and the > >> L2 entry would take 2^{n+1} + 64 which is impossible to be a power of two.) > > > > Or we add padding. If you want 64 subclusters, you burn 256 bits per> entry, even though only 192 of those bits are used. > > Hm, yeah, although you have to keep in mind that the padding is almost > pretty much the same as the the data bits we need, effectively doubling > the size of the L2 tables: > > padding = 2^{n+2} - 2^{n+1} - 64 (=2^6) > = 2^{n+1} - 64 > > So that's not so nice, but if it's the only thing we can do... > > >> I don't know how useful non-power-of-two subcluster configurations are. > >> Probably not at all. > >> > >> Since using subcluster would always result in the L2 table taking more > >> than 512 bytes, you could therefore never guarantee that there is no > >> entry overlapping a sector border (except with 32 subclusters). > > > > Yes, there's definite benefits to keeping whatever structure we end up > > with aligned so that it naturally falls into sector boundaries, even if > > it means more padding bits. > > Then again, I'm not even sure we really need atomicity for L2 entries + > subcluster bits. I don't think you'd ever have to modify both at the > same time (if you just say the subclusters are all unallocated when > allocating the cluster itself, and then you write which subclusters are > actually allocated afterwards)). > > (This also applies to your remark on caching, I think.) > > Atomicity certainly makes things easier, though. Unless you want to deal with ordering (i.e. on cluster allocation, first update the subcluster bitmap, then flush, then add the L2 entry), I think you need atomicity. Kevin [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-11 15:29 ` Kevin Wolf @ 2017-04-11 15:29 ` Max Reitz 0 siblings, 0 replies; 64+ messages in thread From: Max Reitz @ 2017-04-11 15:29 UTC (permalink / raw) To: Kevin Wolf Cc: Eric Blake, Alberto Garcia, Stefan Hajnoczi, qemu-devel, qemu-block [-- Attachment #1: Type: text/plain, Size: 2363 bytes --] On 11.04.2017 17:29, Kevin Wolf wrote: > Am 11.04.2017 um 17:18 hat Max Reitz geschrieben: >> On 11.04.2017 17:08, Eric Blake wrote: >>> On 04/11/2017 09:59 AM, Max Reitz wrote: >>> >>>> >>>> Good point, but that also means that (with (2)) you can only use >>>> subcluster configurations where the L2 entry size increases by a power >>>> of two. Unfortunately, only one of those configurations itself is a >>>> power of two, and that is 32. >>>> >>>> (With 32 subclusters, you take up 64 bits, which means an L2 entry will >>>> take 128 bits; with any higher 2^n, you'd take up 2^{n+1} bits and the >>>> L2 entry would take 2^{n+1} + 64 which is impossible to be a power of two.) >>> >>> Or we add padding. If you want 64 subclusters, you burn 256 bits per> entry, even though only 192 of those bits are used. >> >> Hm, yeah, although you have to keep in mind that the padding is almost >> pretty much the same as the the data bits we need, effectively doubling >> the size of the L2 tables: >> >> padding = 2^{n+2} - 2^{n+1} - 64 (=2^6) >> = 2^{n+1} - 64 >> >> So that's not so nice, but if it's the only thing we can do... >> >>>> I don't know how useful non-power-of-two subcluster configurations are. >>>> Probably not at all. >>>> >>>> Since using subcluster would always result in the L2 table taking more >>>> than 512 bytes, you could therefore never guarantee that there is no >>>> entry overlapping a sector border (except with 32 subclusters). >>> >>> Yes, there's definite benefits to keeping whatever structure we end up >>> with aligned so that it naturally falls into sector boundaries, even if >>> it means more padding bits. >> >> Then again, I'm not even sure we really need atomicity for L2 entries + >> subcluster bits. I don't think you'd ever have to modify both at the >> same time (if you just say the subclusters are all unallocated when >> allocating the cluster itself, and then you write which subclusters are >> actually allocated afterwards)). >> >> (This also applies to your remark on caching, I think.) >> >> Atomicity certainly makes things easier, though. > > Unless you want to deal with ordering (i.e. on cluster allocation, first > update the subcluster bitmap, then flush, then add the L2 entry), I > think you need atomicity. Yes, that's what I meant. Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 512 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-11 15:18 ` Max Reitz 2017-04-11 15:29 ` Kevin Wolf @ 2017-04-11 15:30 ` Eric Blake 2017-04-11 15:34 ` Max Reitz 1 sibling, 1 reply; 64+ messages in thread From: Eric Blake @ 2017-04-11 15:30 UTC (permalink / raw) To: Max Reitz, Kevin Wolf, Alberto Garcia Cc: Stefan Hajnoczi, qemu-devel, qemu-block [-- Attachment #1: Type: text/plain, Size: 877 bytes --] On 04/11/2017 10:18 AM, Max Reitz wrote: > Hm, yeah, although you have to keep in mind that the padding is almost > pretty much the same as the the data bits we need, effectively doubling > the size of the L2 tables: > > padding = 2^{n+2} - 2^{n+1} - 64 (=2^6) > = 2^{n+1} - 64 > > So that's not so nice, but if it's the only thing we can do... Or we mix-and-match your ideas: since our subclusters are ternary encoding, if you want 128 subclusters, instead of asking for 256 bits, you ask for ld(3^128) = 203 bits, then use 11 padding bits of your original 64, and you have something that still fits in 256 bits instead of 512... Okay, just kidding. Anything larger than 32 subclusters does get awkward fast. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-11 15:30 ` Eric Blake @ 2017-04-11 15:34 ` Max Reitz 0 siblings, 0 replies; 64+ messages in thread From: Max Reitz @ 2017-04-11 15:34 UTC (permalink / raw) To: Eric Blake, Kevin Wolf, Alberto Garcia Cc: Stefan Hajnoczi, qemu-devel, qemu-block [-- Attachment #1: Type: text/plain, Size: 1233 bytes --] On 11.04.2017 17:30, Eric Blake wrote: > On 04/11/2017 10:18 AM, Max Reitz wrote: > >> Hm, yeah, although you have to keep in mind that the padding is almost >> pretty much the same as the the data bits we need, effectively doubling >> the size of the L2 tables: >> >> padding = 2^{n+2} - 2^{n+1} - 64 (=2^6) >> = 2^{n+1} - 64 >> >> So that's not so nice, but if it's the only thing we can do... > > Or we mix-and-match your ideas: since our subclusters are ternary > encoding, if you want 128 subclusters, instead of asking for 256 bits, > you ask for ld(3^128) = 203 bits, then use 11 padding bits of your > original 64, and you have something that still fits in 256 bits instead > of 512... > > Okay, just kidding. Anything larger than 32 subclusters does get > awkward fast. If we want to do something a bit complicated that is still kind of reasonable, though, we could just not use padding except for 512 byte boundaries. Of course, that makes calculating the index of an L2 table a bit more complicated (probably involving a division by a value that is not a power of two), but it would kind of work. Or we use Malbolge, yes. That would probably be the most sensible choice. Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 512 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-11 14:49 ` [Qemu-devel] " Kevin Wolf 2017-04-11 14:58 ` Eric Blake 2017-04-11 14:59 ` Max Reitz @ 2017-04-12 12:47 ` Alberto Garcia 2 siblings, 0 replies; 64+ messages in thread From: Alberto Garcia @ 2017-04-12 12:47 UTC (permalink / raw) To: Kevin Wolf; +Cc: Max Reitz, qemu-devel, qemu-block, Stefan Hajnoczi On Tue 11 Apr 2017 04:49:21 PM CEST, Kevin Wolf wrote: >> >>> (We could even get one more bit if we had a subcluster-flag, because I >> >>> guess we can always assume subclustered clusters to have OFLAG_COPIED >> >>> and be uncompressed. But still, three bits missing.) >> >> >> >> Why can we always assume OFLAG_COPIED? >> > >> > Because partially allocated clusters cannot be used with internal >> > snapshots, and that is what OFLAG_COPIED is for. >> >> Why can't they be used? > > Refcounts are on a cluster granularity, so you have to COW the whole > cluster at once. If you copied only a subcluster, you'd lose the > information where to find the other subclusters. Individual subclusters don't have reference counts, OFLAG_COPIED would always be at the cluster level, but else I don't see the problem (see the reply that I just wrote to Eric). >> > It if it's a real discard you can just discard it (which is what we >> > do for compat=0.10 images anyway); but zero-writes will then have >> > to be come real writes, yes. >> >> Perhaps we can give up that bit for subclusters then, that would >> allow us to double their number. We would still have the zero flag at >> the cluster level. Opinions on this, anyone? > > No, making the backing file contents reappear is really bad, we don't > want that. I'm not talking about making the backing file contents reappear, but about writing zeroes instead of setting the zero flag. >> >> Then (3) is effectively the same as (2), just that the subcluster >> >> bitmaps are at the end of the L2 cluster, and not next to each >> >> entry. >> > >> > Exactly. But it's a difference in implementation, as you won't have >> > to worry about having changed the L2 table layout; maybe that's a >> > benefit. >> >> I'm not sure if that would simplify or complicate things, but it's >> worth considering. > > Note that 64k between an L2 entry and the corresponding bitmap is > enough to make an update not atomic any more. They need to be within > the same sector to get atomicity. Good point. Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia ` (2 preceding siblings ...) 2017-04-07 17:10 ` Max Reitz @ 2017-04-12 16:54 ` Denis V. Lunev 2017-04-13 11:58 ` Alberto Garcia 2017-04-12 17:55 ` Denis V. Lunev 4 siblings, 1 reply; 64+ messages in thread From: Denis V. Lunev @ 2017-04-12 16:54 UTC (permalink / raw) To: Alberto Garcia, qemu-devel Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz On 04/06/2017 06:01 PM, Alberto Garcia wrote: > Hi all, > > over the past couple of months I discussed with some of you the > possibility to extend the qcow2 format in order to improve its > performance and reduce its memory requirements (particularly with very > large images). > > After some discussion in the mailing list and the #qemu IRC channel I > decided to write a prototype of a new extension for qcow2 so I could > understand better the scope of the changes and have some preliminary > data about its effects. > > This e-mail is the formal presentation of my proposal to extend the > on-disk qcow2 format. As you can see this is still an RFC. Due to the > nature of the changes I would like to get as much feedback as possible > before going forward. > > === Problem === > > The original problem that I wanted to address is the memory > requirements of qcow2 files if you want to have very large images and > still keep good I/O performance. This is a consequence of its very > structure, which I'm going to describe now. > > A qcow2 image is divided into units of constant size called clusters, > and among other things it contains metadata that maps guest addresses > to host addresses (the so-called L1 and L2 tables). > > There are two basic problems that result from this: > > 1) Reading from or writing to a qcow2 image involves reading the > corresponding entry on the L2 table that maps the guest address to > the host address. This is very slow because it involves two I/O > operations: one on the L2 table and the other one on the actual > data cluster. > > 2) A cluster is the smallest unit of allocation. Therefore writing a > mere 512 bytes to an empty disk requires allocating a complete > cluster and filling it with zeroes (or with data from the backing > image if there is one). This wastes more disk space and also has a > negative impact on I/O. > > Problem (1) can be solved by keeping in memory a cache of the L2 > tables (QEMU has an "l2_cache_size" parameter for this purpose). The > amount of disk space used by L2 tables depends on two factors: the > disk size and the cluster size. > > The cluster size can be configured when the image is created, and it > can be any power of two between 512 bytes and 2 MB (it defaults to 64 > KB). > > The maximum amount of space needed for the L2 tables can be calculated > with the following formula: > > max_l2_size = virtual_disk_size * 8 / cluster_size > > Large images require a large amount of metadata, and therefore a large > amount of memory for the L2 cache. With the default cluster size > (64KB) that's 128MB of L2 cache for a 1TB qcow2 image. > > The only way to reduce the size of the L2 tables is therefore > increasing the cluster size, but this makes the image less efficient > as described earlier in (2). > > === The proposal === > > The idea of this proposal is to extend the qcow2 format by allowing > subcluster allocation. There would be an optional feature that would > need to be enabled when creating the image. The on-disk format would > remain essentially the same, except that each data cluster would be > internally divided into a number of subclusters of equal size. > > What this means in practice is that each entry on an L2 table would be > accompanied by a bitmap indicating the allocation state of each one of > the subclusters for that cluster. There are several alternatives for > storing the bitmap, described below. > > Other than L2 entries, all other data structures would remain > unchanged, but for data clusters the smallest unit of allocation would > now be the subcluster. Reference counting would still be at the > cluster level, because there's no way to reference individual > subclusters. Copy-on-write on internal snapshots would need to copy > complete clusters, so that scenario would not benefit from this > change. > > I see two main use cases for this feature: > > a) The qcow2 image is not too large / the L2 cache is not a problem, > but you want to increase the allocation performance. In this case > you can have something like a 128KB cluster with 4KB subclusters > (with 4KB being a common block size in ext4 and other filesystems) > > b) The qcow2 image is very large and you want to save metadata space > in order to have a smaller L2 cache. In this case you can go for > the maximum cluster size (2MB) but you want to have smaller > subclusters to increase the allocation performance and optimize the > disk usage. This was actually my original use case. > > === Test results === > > I have a basic working prototype of this. It's still incomplete -and > buggy :)- but it gives an idea of what we can expect from it. In my > implementation each data cluster has 8 subclusters, but that's not set > in stone (see below). > > I made all tests on an SSD drive, writing to an empty qcow2 image with > a fully populated 40GB backing image, performing random writes using > fio with a block size of 4KB. > > I tried with the default and maximum cluster sizes (64KB and 2MB) and > also with some other sizes. I also made sure to try with 32KB clusters > so the subcluster size matches the 4KB block size used for the I/O. > > It's important to point out that once a cluster has been completely > allocated then having subclusters offers no performance benefit. For > this reason the size of the image for these tests (40GB) was chosen to > be large enough to guarantee that there are always new clusters being > allocated. This is therefore a worst-case scenario (or best-case for > this feature, if you want). > > Here are the results (subcluster size in brackets): > > |-----------------+----------------+-----------------+-------------------| > | cluster size | subclusters=on | subclusters=off | Max L2 cache size | > |-----------------+----------------+-----------------+-------------------| > | 2 MB (256 KB) | 440 IOPS | 100 IOPS | 160 KB (*) | > | 512 KB (64 KB) | 1000 IOPS | 300 IOPS | 640 KB | > | 64 KB (8 KB) | 3000 IOPS | 1000 IOPS | 5 MB | > | 32 KB (4 KB) | 12000 IOPS | 1300 IOPS | 10 MB | > | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | > |-----------------+----------------+-----------------+-------------------| > > (*) The L2 cache must be a multiple of the cluster > size, so in this case it must be 2MB. On the table > I chose to show how much of those 2MB are actually > used so you can compare it with the other cases. > > Some comments about the results: > > - For the 64KB, 512KB and 2MB cases, having subclusters increases > write performance roughly by three. This happens because for each > cluster allocation there's less data to copy from the backing > image. For the same reason, the smaller the cluster, the better the > performance. As expected, 64KB clusters with no subclusters perform > roughly the same as 512KB clusters with 64KB subclusters. > > - The 32KB case is the most interesting one. Without subclusters it's > not very different from the 64KB case, but having a subcluster with > the same size of the I/O block eliminates the need for COW entirely > and the performance skyrockets (10 times faster!). > > - 4KB is however very slow. I attribute this to the fact that the > cluster size is so small that a new cluster needs to be allocated > for every single write and its refcount updated accordingly. The L2 > and refcount tables are also so small that they are too inefficient > and need to grow all the time. > > Here are the results when writing to an empty 40GB qcow2 image with no > backing file. The numbers are of course different but as you can see > the patterns are similar: > > |-----------------+----------------+-----------------+-------------------| > | cluster size | subclusters=on | subclusters=off | Max L2 cache size | > |-----------------+----------------+-----------------+-------------------| > | 2 MB (256 KB) | 1200 IOPS | 255 IOPS | 160 KB | > | 512 KB (64 KB) | 3000 IOPS | 700 IOPS | 640 KB | > | 64 KB (8 KB) | 7200 IOPS | 3300 IOPS | 5 MB | > | 32 KB (4 KB) | 12300 IOPS | 4200 IOPS | 10 MB | > | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | > |-----------------+----------------+-----------------+-------------------| > > === Changes to the on-disk format === > > The qcow2 on-disk format needs to change so each L2 entry has a bitmap > indicating the allocation status of each subcluster. There are three > possible states (unallocated, allocated, all zeroes), so we need two > bits per subcluster. > > An L2 entry is 64 bits wide, and this is the current format (for > uncompressed clusters): > > 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 > 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > **<----> <--------------------------------------------------><------->* > Rsrved host cluster offset of data Reserved > (6 bits) (47 bits) (8 bits) > > bit 63: refcount == 1 (QCOW_OFLAG_COPIED) > bit 62: compressed = 1 (QCOW_OFLAG_COMPRESSED) > bit 0: all zeros (QCOW_OFLAG_ZERO) > > I thought of three alternatives for storing the subcluster bitmaps. I > haven't made my mind completely about which one is the best one, so > I'd like to present all three for discussion. Here they are: > > (1) Storing the bitmap inside the 64-bit entry > > This is a simple alternative and is the one that I chose for my > prototype. There are 14 unused bits plus the "all zeroes" one. If > we steal one from the host offset we have the 16 bits that we need > for the bitmap and we have 46 bits left for the host offset, which > is more than enough. > > * Pros: > + Simple. Few changes compared to the current qcow2 format. > > * Cons: > - Only 8 subclusters per cluster. We would not be making the > most of this feature. > > - No reserved bits left for the future. > > (2) Making L2 entries 128-bit wide. > > In this alternative we would double the size of L2 entries. The > first half would remain unchanged and the second one would store > the bitmap. That would leave us with 32 subclusters per cluster. > > * Pros: > + More subclusters per cluster. We could have images with > e.g. 128k clusters with 4k subclusters. > > * Cons: > - More space needed for L2 entries. The same cluster size would > require a cache twice as large, although having subcluster > allocation would compensate for this. > > - More changes to the code to handle 128-bit entries. > > - We would still be wasting the 14 reserved bits that L2 entries > have. > > (3) Storing the bitmap somewhere else > > This would involve storing the bitmap separate from the L2 tables > (perhaps using the bitmaps extension? I haven't looked much into > this). > > * Pros: > + Possibility to make the number of subclusters configurable > by the user (32, 64, 128, ...) > + All existing metadata structures would remain untouched > (although the "all zeroes" bit in L2 entries would probably > become unused). > > * Cons: > - As with alternative (2), more space needed for metadata. > > - The bitmap would also need to be cached for performance > reasons. > > - Possibly one more *_cache_size option. > > - One more metadata structure to be updated for each > allocation. This would probably impact I/O negatively. > > === Compressed clusters === > > My idea is that compressed clusters would remain the same. They are > read-only anyway so they would not be affected by any of these > changes. > > =========================== > > I think I managed to summarize all the ideas that I had in my mind, > but I'm sure you probably have questions and comments, so I'd be happy > to get as much feedback as possible. > > So, looking forward to reading your opinions. > > Regards, > > Berto > My opinion about this approach is very negative as the problem could be (partially) solved in a much better way. 1) current L2 cache management seems very wrong to me. Each cache miss means that we have to read entire L2 cache block. This means that in the worst case (when dataset of the test does not fit L2 cache size we read 64kb of L2 table for each 4 kb read). The situation is MUCH worse once we are starting to increase cluster size. For 1 Mb blocks we have to read 1 Mb on each cache miss. The situation can be cured immediately once we will start reading L2 cache with 4 or 8kb chunks. We have patchset for this for our downstream and preparing it for upstream. Performance results are the following (roughly, from memory) for Intel P3700 PCIe SSD, which is 300k IOPS capable. We have 20k IOPSes without any modifications on 16 Gb dataset IOPS test. We have 50k with (1) implemented. With 1Mb datablock we have 100k IOPses. This is not that big, but this is much better. 2) yet another terrible thing in cluster allocation is its allocation strategy. Current QCOW2 codebase implies that we need 5 (five) IOPSes to complete COW operation. We are reading head, writing head, reading tail, writing tail, writing actual data to be written. This could be easily reduced to 3 IOPSes. Another problem is the amount of data written. We are writing entire cluster in write operation and this is also insane. It is possible to perform fallocate() and actual data write on normal modern filesystem. This patch series was also finished but not yet sent. This gives around 20 times difference for cluster allocation. 3) Partial allocation is very bad from the later performance point of view. The guest can send 4-5 Mb requests. In this case QCOW2 read for non-sequential data would be completely sequential, having HUGE latency here. We should perform chunk reading in coroutine pool, like has been done recently by Peter Lieven in qemu-img in commit 2d9187bc65 We are struggling to finish this and put into the production. I do think that this complex of things could address almost all initial issues. Den ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-12 16:54 ` Denis V. Lunev @ 2017-04-13 11:58 ` Alberto Garcia 2017-04-13 12:44 ` Denis V. Lunev 0 siblings, 1 reply; 64+ messages in thread From: Alberto Garcia @ 2017-04-13 11:58 UTC (permalink / raw) To: Denis V. Lunev, qemu-devel Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz On Wed 12 Apr 2017 06:54:50 PM CEST, Denis V. Lunev wrote: > My opinion about this approach is very negative as the problem could > be (partially) solved in a much better way. Hmm... it seems to me that (some of) the problems you are describing are different from the ones this proposal tries to address. Not that I disagree with them! I think you are giving useful feedback :) > 1) current L2 cache management seems very wrong to me. Each cache > miss means that we have to read entire L2 cache block. This means > that in the worst case (when dataset of the test does not fit L2 > cache size we read 64kb of L2 table for each 4 kb read). > > The situation is MUCH worse once we are starting to increase > cluster size. For 1 Mb blocks we have to read 1 Mb on each cache > miss. > > The situation can be cured immediately once we will start reading > L2 cache with 4 or 8kb chunks. We have patchset for this for our > downstream and preparing it for upstream. Correct, although the impact of this depends on whether you are using SDD or HDD. With an SSD what you want is to minimize is the number of unnecessary reads, so reading small chunks will likely increase the performance when there's a cache miss. With an HDD what you want is to minimize the number of seeks. Once you have moved the disk head to the location where the cluster is, reading the whole cluster is relatively inexpensive, so (leaving the memory requirements aside) you generally want to read as much as possible. > 2) yet another terrible thing in cluster allocation is its allocation > strategy. > Current QCOW2 codebase implies that we need 5 (five) IOPSes to > complete COW operation. We are reading head, writing head, reading > tail, writing tail, writing actual data to be written. This could > be easily reduced to 3 IOPSes. That sounds right, but I'm not sure if this is really incompatible with my proposal :) > Another problem is the amount of data written. We are writing > entire cluster in write operation and this is also insane. It is > possible to perform fallocate() and actual data write on normal > modern filesystem. But that only works when filling the cluster with zeroes, doesn't it? If there's a backing image you need to bring all the contents from there. Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 11:58 ` Alberto Garcia @ 2017-04-13 12:44 ` Denis V. Lunev 2017-04-13 13:05 ` Kevin Wolf 2017-04-13 13:21 ` Alberto Garcia 0 siblings, 2 replies; 64+ messages in thread From: Denis V. Lunev @ 2017-04-13 12:44 UTC (permalink / raw) To: Alberto Garcia, qemu-devel Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz On 04/13/2017 02:58 PM, Alberto Garcia wrote: > On Wed 12 Apr 2017 06:54:50 PM CEST, Denis V. Lunev wrote: >> My opinion about this approach is very negative as the problem could >> be (partially) solved in a much better way. > Hmm... it seems to me that (some of) the problems you are describing are > different from the ones this proposal tries to address. Not that I > disagree with them! I think you are giving useful feedback :) > >> 1) current L2 cache management seems very wrong to me. Each cache >> miss means that we have to read entire L2 cache block. This means >> that in the worst case (when dataset of the test does not fit L2 >> cache size we read 64kb of L2 table for each 4 kb read). >> >> The situation is MUCH worse once we are starting to increase >> cluster size. For 1 Mb blocks we have to read 1 Mb on each cache >> miss. >> >> The situation can be cured immediately once we will start reading >> L2 cache with 4 or 8kb chunks. We have patchset for this for our >> downstream and preparing it for upstream. > Correct, although the impact of this depends on whether you are using > SDD or HDD. > > With an SSD what you want is to minimize is the number of unnecessary > reads, so reading small chunks will likely increase the performance when > there's a cache miss. > > With an HDD what you want is to minimize the number of seeks. Once you > have moved the disk head to the location where the cluster is, reading > the whole cluster is relatively inexpensive, so (leaving the memory > requirements aside) you generally want to read as much as possible. no! This greatly helps for HDD too! The reason is that you cover areas of the virtual disk much more precise. There is very simple example. Let us assume that I have f.e. 1 TB virtual HDD with 1 MB block size. As far as I understand right now L2 cache for the case consists of 4 L2 clusters. So, I can exhaust current cache only with 5 requests and each actual read will costs L2 table read. This is a read problem. This condition could happen on fragmented FS without a problem. With my proposal the situation is MUCH better. All accesses will be taken from the cache after the first run. >> 2) yet another terrible thing in cluster allocation is its allocation >> strategy. >> Current QCOW2 codebase implies that we need 5 (five) IOPSes to >> complete COW operation. We are reading head, writing head, reading >> tail, writing tail, writing actual data to be written. This could >> be easily reduced to 3 IOPSes. > That sounds right, but I'm not sure if this is really incompatible with > my proposal :) the problem is code complexity, with is very complex right now. >> Another problem is the amount of data written. We are writing >> entire cluster in write operation and this is also insane. It is >> possible to perform fallocate() and actual data write on normal >> modern filesystem. > But that only works when filling the cluster with zeroes, doesn't it? If > there's a backing image you need to bring all the contents from there. Yes. Backing images are problems. Though, even with sub-clusters, we will suffer exactly the same with the amount of IOPSes as even with that head and tail have to be read. If you are spoken about subclusters equals to FS block size and avoid COW at all, this would be terribly slow later on with sequential reading. In such an approach sequential reading will result in random read. Guest OSes are written keeping in mind that adjacent LBAs are really adjacent and reading them sequentially is a very good idea. This invariant will be broken for the case of subclusters. For nowadays SSD we are facing problems somewhere else. Right now I can achieve only 100k IOPSes on SSD capable of 350-550k. 1 Mb block with preallocation and fragmented L2 cache gives same 100k. Tests for initially empty image gives around 80k for us. May be I have too good hardware for these tests. Right now I am giving number from our last run on Intel top SSDs. We will remeasure this on something slower before submission ;) Den ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 12:44 ` Denis V. Lunev @ 2017-04-13 13:05 ` Kevin Wolf 2017-04-13 13:09 ` Denis V. Lunev 2017-04-13 13:21 ` Alberto Garcia 1 sibling, 1 reply; 64+ messages in thread From: Kevin Wolf @ 2017-04-13 13:05 UTC (permalink / raw) To: Denis V. Lunev Cc: Alberto Garcia, qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz Am 13.04.2017 um 14:44 hat Denis V. Lunev geschrieben: > On 04/13/2017 02:58 PM, Alberto Garcia wrote: > > On Wed 12 Apr 2017 06:54:50 PM CEST, Denis V. Lunev wrote: > >> My opinion about this approach is very negative as the problem could > >> be (partially) solved in a much better way. > > Hmm... it seems to me that (some of) the problems you are describing are > > different from the ones this proposal tries to address. Not that I > > disagree with them! I think you are giving useful feedback :) > > > >> 1) current L2 cache management seems very wrong to me. Each cache > >> miss means that we have to read entire L2 cache block. This means > >> that in the worst case (when dataset of the test does not fit L2 > >> cache size we read 64kb of L2 table for each 4 kb read). > >> > >> The situation is MUCH worse once we are starting to increase > >> cluster size. For 1 Mb blocks we have to read 1 Mb on each cache > >> miss. > >> > >> The situation can be cured immediately once we will start reading > >> L2 cache with 4 or 8kb chunks. We have patchset for this for our > >> downstream and preparing it for upstream. > > Correct, although the impact of this depends on whether you are using > > SDD or HDD. > > > > With an SSD what you want is to minimize is the number of unnecessary > > reads, so reading small chunks will likely increase the performance when > > there's a cache miss. > > > > With an HDD what you want is to minimize the number of seeks. Once you > > have moved the disk head to the location where the cluster is, reading > > the whole cluster is relatively inexpensive, so (leaving the memory > > requirements aside) you generally want to read as much as possible. > no! This greatly helps for HDD too! > > The reason is that you cover areas of the virtual disk much more precise. > There is very simple example. Let us assume that I have f.e. 1 TB virtual > HDD with 1 MB block size. As far as I understand right now L2 cache > for the case consists of 4 L2 clusters. > > So, I can exhaust current cache only with 5 requests and each actual read > will costs L2 table read. This is a read problem. This condition could > happen on fragmented FS without a problem. > > With my proposal the situation is MUCH better. All accesses will be taken > from the cache after the first run. > > >> 2) yet another terrible thing in cluster allocation is its allocation > >> strategy. > >> Current QCOW2 codebase implies that we need 5 (five) IOPSes to > >> complete COW operation. We are reading head, writing head, reading > >> tail, writing tail, writing actual data to be written. This could > >> be easily reduced to 3 IOPSes. > > That sounds right, but I'm not sure if this is really incompatible with > > my proposal :) > the problem is code complexity, with is very complex right now. > > > >> Another problem is the amount of data written. We are writing > >> entire cluster in write operation and this is also insane. It is > >> possible to perform fallocate() and actual data write on normal > >> modern filesystem. > > But that only works when filling the cluster with zeroes, doesn't it? If > > there's a backing image you need to bring all the contents from there. > > Yes. Backing images are problems. Though, even with sub-clusters, we > will suffer exactly the same with the amount of IOPSes as even with > that head and tail have to be read. If you are spoken about > subclusters equals to FS block size and avoid COW at all, this would > be terribly slow later on with sequential reading. In such an approach > sequential reading will result in random read. > > Guest OSes are written keeping in mind that adjacent LBAs are really > adjacent and reading them sequentially is a very good idea. This > invariant will be broken for the case of subclusters. How that? Given the same cluster size, subclustered and traditional images behave _exactly_ the same regarding fragmentation. Subclusters only have an effect on it (and a positive one) when you take them as a reason that you can now afford to increase the cluster size. I see subclusters and fragmentation as mostly independent topics. > For nowadays SSD we are facing problems somewhere else. Right now I > can achieve only 100k IOPSes on SSD capable of 350-550k. 1 Mb block > with preallocation and fragmented L2 cache gives same 100k. Tests for > initially empty image gives around 80k for us. Preallocated images aren't particularly interesting to me. qcow2 is used mainly for two reasons. One of them is sparseness (initially small file size) mostly for desktop use cases with no serious I/O, so not that interesting either. The other one is snapshots, i.e. backing files, which doesn't work with preallocation (yet). Actually, preallocation with backing files is something that subclusters would automatically enable: You could already reserve the space for a cluster, but still leave all subclusters marked as unallocated. Kevin ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 13:05 ` Kevin Wolf @ 2017-04-13 13:09 ` Denis V. Lunev 2017-04-13 13:36 ` Alberto Garcia 0 siblings, 1 reply; 64+ messages in thread From: Denis V. Lunev @ 2017-04-13 13:09 UTC (permalink / raw) To: Kevin Wolf Cc: Alberto Garcia, qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz On 04/13/2017 04:05 PM, Kevin Wolf wrote: > Am 13.04.2017 um 14:44 hat Denis V. Lunev geschrieben: >> On 04/13/2017 02:58 PM, Alberto Garcia wrote: >>> On Wed 12 Apr 2017 06:54:50 PM CEST, Denis V. Lunev wrote: >>>> My opinion about this approach is very negative as the problem could >>>> be (partially) solved in a much better way. >>> Hmm... it seems to me that (some of) the problems you are describing are >>> different from the ones this proposal tries to address. Not that I >>> disagree with them! I think you are giving useful feedback :) >>> >>>> 1) current L2 cache management seems very wrong to me. Each cache >>>> miss means that we have to read entire L2 cache block. This means >>>> that in the worst case (when dataset of the test does not fit L2 >>>> cache size we read 64kb of L2 table for each 4 kb read). >>>> >>>> The situation is MUCH worse once we are starting to increase >>>> cluster size. For 1 Mb blocks we have to read 1 Mb on each cache >>>> miss. >>>> >>>> The situation can be cured immediately once we will start reading >>>> L2 cache with 4 or 8kb chunks. We have patchset for this for our >>>> downstream and preparing it for upstream. >>> Correct, although the impact of this depends on whether you are using >>> SDD or HDD. >>> >>> With an SSD what you want is to minimize is the number of unnecessary >>> reads, so reading small chunks will likely increase the performance when >>> there's a cache miss. >>> >>> With an HDD what you want is to minimize the number of seeks. Once you >>> have moved the disk head to the location where the cluster is, reading >>> the whole cluster is relatively inexpensive, so (leaving the memory >>> requirements aside) you generally want to read as much as possible. >> no! This greatly helps for HDD too! >> >> The reason is that you cover areas of the virtual disk much more precise. >> There is very simple example. Let us assume that I have f.e. 1 TB virtual >> HDD with 1 MB block size. As far as I understand right now L2 cache >> for the case consists of 4 L2 clusters. >> >> So, I can exhaust current cache only with 5 requests and each actual read >> will costs L2 table read. This is a read problem. This condition could >> happen on fragmented FS without a problem. >> >> With my proposal the situation is MUCH better. All accesses will be taken >> from the cache after the first run. >> >>>> 2) yet another terrible thing in cluster allocation is its allocation >>>> strategy. >>>> Current QCOW2 codebase implies that we need 5 (five) IOPSes to >>>> complete COW operation. We are reading head, writing head, reading >>>> tail, writing tail, writing actual data to be written. This could >>>> be easily reduced to 3 IOPSes. >>> That sounds right, but I'm not sure if this is really incompatible with >>> my proposal :) >> the problem is code complexity, with is very complex right now. >> >> >>>> Another problem is the amount of data written. We are writing >>>> entire cluster in write operation and this is also insane. It is >>>> possible to perform fallocate() and actual data write on normal >>>> modern filesystem. >>> But that only works when filling the cluster with zeroes, doesn't it? If >>> there's a backing image you need to bring all the contents from there. >> Yes. Backing images are problems. Though, even with sub-clusters, we >> will suffer exactly the same with the amount of IOPSes as even with >> that head and tail have to be read. If you are spoken about >> subclusters equals to FS block size and avoid COW at all, this would >> be terribly slow later on with sequential reading. In such an approach >> sequential reading will result in random read. >> >> Guest OSes are written keeping in mind that adjacent LBAs are really >> adjacent and reading them sequentially is a very good idea. This >> invariant will be broken for the case of subclusters. > How that? > > Given the same cluster size, subclustered and traditional images behave > _exactly_ the same regarding fragmentation. Subclusters only have an > effect on it (and a positive one) when you take them as a reason that > you can now afford to increase the cluster size. > > I see subclusters and fragmentation as mostly independent topics. > >> For nowadays SSD we are facing problems somewhere else. Right now I >> can achieve only 100k IOPSes on SSD capable of 350-550k. 1 Mb block >> with preallocation and fragmented L2 cache gives same 100k. Tests for >> initially empty image gives around 80k for us. > Preallocated images aren't particularly interesting to me. qcow2 is used > mainly for two reasons. One of them is sparseness (initially small file > size) mostly for desktop use cases with no serious I/O, so not that > interesting either. The other one is snapshots, i.e. backing files, > which doesn't work with preallocation (yet). > > Actually, preallocation with backing files is something that subclusters > would automatically enable: You could already reserve the space for a > cluster, but still leave all subclusters marked as unallocated. I am spoken about fallocate() for the entire cluster before actual write() for originally empty image. This increases the performance of 4k random writes 10+ times. In this case we can just write those 4k and do nothing else. Den ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 13:09 ` Denis V. Lunev @ 2017-04-13 13:36 ` Alberto Garcia 2017-04-13 14:06 ` Denis V. Lunev 0 siblings, 1 reply; 64+ messages in thread From: Alberto Garcia @ 2017-04-13 13:36 UTC (permalink / raw) To: Denis V. Lunev, Kevin Wolf Cc: qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz On Thu 13 Apr 2017 03:09:53 PM CEST, Denis V. Lunev wrote: >>> For nowadays SSD we are facing problems somewhere else. Right now I >>> can achieve only 100k IOPSes on SSD capable of 350-550k. 1 Mb block >>> with preallocation and fragmented L2 cache gives same 100k. Tests >>> for initially empty image gives around 80k for us. >> Preallocated images aren't particularly interesting to me. qcow2 is >> used mainly for two reasons. One of them is sparseness (initially >> small file size) mostly for desktop use cases with no serious I/O, so >> not that interesting either. The other one is snapshots, i.e. backing >> files, which doesn't work with preallocation (yet). >> >> Actually, preallocation with backing files is something that >> subclusters would automatically enable: You could already reserve the >> space for a cluster, but still leave all subclusters marked as >> unallocated. > > I am spoken about fallocate() for the entire cluster before actual > write() for originally empty image. This increases the performance of > 4k random writes 10+ times. In this case we can just write those 4k > and do nothing else. You're talking about using fallocate() for filling a cluster with zeroes before writing data to it. As noted earlier in this thread, this works if the image is empty or if it doesn't have a backing file. And if the image is not empty you cannot guarantee that the cluster contains zeroes (you can use FALLOC_FL_ZERO_RANGE, but that won't work in all cases). Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 13:36 ` Alberto Garcia @ 2017-04-13 14:06 ` Denis V. Lunev 0 siblings, 0 replies; 64+ messages in thread From: Denis V. Lunev @ 2017-04-13 14:06 UTC (permalink / raw) To: Alberto Garcia, Kevin Wolf Cc: qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz On 04/13/2017 04:36 PM, Alberto Garcia wrote: > On Thu 13 Apr 2017 03:09:53 PM CEST, Denis V. Lunev wrote: >>>> For nowadays SSD we are facing problems somewhere else. Right now I >>>> can achieve only 100k IOPSes on SSD capable of 350-550k. 1 Mb block >>>> with preallocation and fragmented L2 cache gives same 100k. Tests >>>> for initially empty image gives around 80k for us. >>> Preallocated images aren't particularly interesting to me. qcow2 is >>> used mainly for two reasons. One of them is sparseness (initially >>> small file size) mostly for desktop use cases with no serious I/O, so >>> not that interesting either. The other one is snapshots, i.e. backing >>> files, which doesn't work with preallocation (yet). >>> >>> Actually, preallocation with backing files is something that >>> subclusters would automatically enable: You could already reserve the >>> space for a cluster, but still leave all subclusters marked as >>> unallocated. >> I am spoken about fallocate() for the entire cluster before actual >> write() for originally empty image. This increases the performance of >> 4k random writes 10+ times. In this case we can just write those 4k >> and do nothing else. > You're talking about using fallocate() for filling a cluster with zeroes > before writing data to it. > > As noted earlier in this thread, this works if the image is empty or if > it doesn't have a backing file. > > And if the image is not empty you cannot guarantee that the cluster > contains zeroes (you can use FALLOC_FL_ZERO_RANGE, but that won't work > in all cases). > > Berto yes, I agree here. But COW operations suffer more from the amount of IO operations required rather than from the amount of data transferred. Let us assume that we have 64k cluster represented as [--------]. With 4k write in the middle of the cluster we will have now 5 IOPSes to perform the operation: read head, write head, write 4k, read tail, write tail. Normally this should take 2 operations: read entire cluster (64kb), write entire cluster (64kb). In this approach further 64kb read of this cluster will result in 1 host IOPS - the file is continuous from the point of the host file system. With 8kb subclusters we will have same 1 read and 1 write after the tuning. The difference is only the amount of data read: 8kb and 8 kb write instead of 64kb read and 64kb write. Sure the size of the cluster should be increased. Personally I like 1 MB due to my past experience. At my opinion there is no difference at all in terms of performance for rotational drive at all for COW of 1 MB cluster and 64 kb cluster without subclusters. Reading of 1 Mb and reading of 64 kb are the same (100-150 IOPSes with 150 Mb/s throughput). Further continuous reads will be much better with 1 MB blocks and without subclusters. Yes, there is a difference on "average" SSD drives, which gives 40k-100k IOPSes. We will experience slowdown reading 1 MB instead of 64kb. But the difference is not that big actually. Top notch nowadays PCIe SSDs can not be saturated with QEMU nowadays. I am able to reach only 100k IOPSes instead of 300k-500k in host even when the data is written to the existing clusters. Thus we will have no difference with them between 1 MB cluster and 64 kb cluster in terms of COW. So, at my opinion, simple 1 Mb cluster size along as fragmented L2 cache is very good from all points. Even from COW point ;) The situation in real life will not be worse or better from the performance point of view as we also will avoid additional metadata updates. Den ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 12:44 ` Denis V. Lunev 2017-04-13 13:05 ` Kevin Wolf @ 2017-04-13 13:21 ` Alberto Garcia 2017-04-13 13:30 ` Denis V. Lunev 2017-04-13 13:51 ` Kevin Wolf 1 sibling, 2 replies; 64+ messages in thread From: Alberto Garcia @ 2017-04-13 13:21 UTC (permalink / raw) To: Denis V. Lunev, qemu-devel Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz On Thu 13 Apr 2017 02:44:51 PM CEST, Denis V. Lunev wrote: >>> 1) current L2 cache management seems very wrong to me. Each cache >>> miss means that we have to read entire L2 cache block. This means >>> that in the worst case (when dataset of the test does not fit L2 >>> cache size we read 64kb of L2 table for each 4 kb read). >>> >>> The situation is MUCH worse once we are starting to increase >>> cluster size. For 1 Mb blocks we have to read 1 Mb on each cache >>> miss. >>> >>> The situation can be cured immediately once we will start reading >>> L2 cache with 4 or 8kb chunks. We have patchset for this for our >>> downstream and preparing it for upstream. >> Correct, although the impact of this depends on whether you are using >> SDD or HDD. >> >> With an SSD what you want is to minimize is the number of unnecessary >> reads, so reading small chunks will likely increase the performance when >> there's a cache miss. >> >> With an HDD what you want is to minimize the number of seeks. Once you >> have moved the disk head to the location where the cluster is, reading >> the whole cluster is relatively inexpensive, so (leaving the memory >> requirements aside) you generally want to read as much as possible. > no! This greatly helps for HDD too! > > The reason is that you cover areas of the virtual disk much more > precise. There is very simple example. Let us assume that I have > f.e. 1 TB virtual HDD with 1 MB block size. As far as I understand > right now L2 cache for the case consists of 4 L2 clusters. > > So, I can exhaust current cache only with 5 requests and each actual > read will costs L2 table read. This is a read problem. This condition > could happen on fragmented FS without a problem. But what you're saying is that this makes a more efficient use of cache memory. If the guest OS has a lot of unused space but is very fragmented then you don't want to fill up your cache with L2 entries that are not going to be used. It's better to read smaller chunks from the L2 table so there are fewer chances of having to evict entries from the cache. Therefore this results in less cache misses and better I/O performance. Ok, this sounds perfectly reasonable to me. If the cache is however big enough for the whole disk then you never need to evict entries, so with an HDD you actually want to take advantage of disk seeks and read as many L2 entries as possible. However it's true that in this case this will only affect the initial reads. Once the cache is full there's no need to read the L2 tables from disk anymore and the performance will be the same, so your point remains valid. Still, one of the goals from my proposal is to reduce the amount of metadata needed for the image. No matter how efficient you make the cache, the only way to reduce the amount of L2 entries is to increase the cluster size. And increasing the cluster size results in slower COW and less efficient use of disk space. >>> Another problem is the amount of data written. We are writing >>> entire cluster in write operation and this is also insane. It is >>> possible to perform fallocate() and actual data write on normal >>> modern filesystem. >> But that only works when filling the cluster with zeroes, doesn't it? If >> there's a backing image you need to bring all the contents from there. > > Yes. Backing images are problems. Though, even with sub-clusters, we > will suffer exactly the same with the amount of IOPSes as even with > that head and tail have to be read. If you are spoken about > subclusters equals to FS block size and avoid COW at all, this would > be terribly slow later on with sequential reading. In such an approach > sequential reading will result in random read. > > Guest OSes are written keeping in mind that adjacent LBAs are really > adjacent and reading them sequentially is a very good idea. This > invariant will be broken for the case of subclusters. This invariant is already broken by the very design of the qcow2 format, subclusters don't really add anything new there. For any given cluster size you can write 4k in every odd cluster, then do the same in every even cluster, and you'll get an equally fragmented image. Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 13:21 ` Alberto Garcia @ 2017-04-13 13:30 ` Denis V. Lunev 2017-04-13 13:59 ` Kevin Wolf 2017-04-13 15:04 ` Alberto Garcia 2017-04-13 13:51 ` Kevin Wolf 1 sibling, 2 replies; 64+ messages in thread From: Denis V. Lunev @ 2017-04-13 13:30 UTC (permalink / raw) To: Alberto Garcia, qemu-devel Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz On 04/13/2017 04:21 PM, Alberto Garcia wrote: > On Thu 13 Apr 2017 02:44:51 PM CEST, Denis V. Lunev wrote: >>>> 1) current L2 cache management seems very wrong to me. Each cache >>>> miss means that we have to read entire L2 cache block. This means >>>> that in the worst case (when dataset of the test does not fit L2 >>>> cache size we read 64kb of L2 table for each 4 kb read). >>>> >>>> The situation is MUCH worse once we are starting to increase >>>> cluster size. For 1 Mb blocks we have to read 1 Mb on each cache >>>> miss. >>>> >>>> The situation can be cured immediately once we will start reading >>>> L2 cache with 4 or 8kb chunks. We have patchset for this for our >>>> downstream and preparing it for upstream. >>> Correct, although the impact of this depends on whether you are using >>> SDD or HDD. >>> >>> With an SSD what you want is to minimize is the number of unnecessary >>> reads, so reading small chunks will likely increase the performance when >>> there's a cache miss. >>> >>> With an HDD what you want is to minimize the number of seeks. Once you >>> have moved the disk head to the location where the cluster is, reading >>> the whole cluster is relatively inexpensive, so (leaving the memory >>> requirements aside) you generally want to read as much as possible. >> no! This greatly helps for HDD too! >> >> The reason is that you cover areas of the virtual disk much more >> precise. There is very simple example. Let us assume that I have >> f.e. 1 TB virtual HDD with 1 MB block size. As far as I understand >> right now L2 cache for the case consists of 4 L2 clusters. >> >> So, I can exhaust current cache only with 5 requests and each actual >> read will costs L2 table read. This is a read problem. This condition >> could happen on fragmented FS without a problem. > But what you're saying is that this makes a more efficient use of cache > memory. > > If the guest OS has a lot of unused space but is very fragmented then > you don't want to fill up your cache with L2 entries that are not going > to be used. It's better to read smaller chunks from the L2 table so > there are fewer chances of having to evict entries from the > cache. Therefore this results in less cache misses and better I/O > performance. > > Ok, this sounds perfectly reasonable to me. > > If the cache is however big enough for the whole disk then you never > need to evict entries, so with an HDD you actually want to take > advantage of disk seeks and read as many L2 entries as possible. > > However it's true that in this case this will only affect the initial > reads. Once the cache is full there's no need to read the L2 tables from > disk anymore and the performance will be the same, so your point remains > valid. > > Still, one of the goals from my proposal is to reduce the amount of > metadata needed for the image. No matter how efficient you make the > cache, the only way to reduce the amount of L2 entries is to increase > the cluster size. And increasing the cluster size results in slower COW > and less efficient use of disk space. actually we can read by clusters if the cache is empty or near empty. Yes, block size should be increased. I perfectly in agreement with your. But I think that we could do that by plain increase of the cluster size without any further dances. Sub-clusters as sub-clusters will help if we are able to avoid COW. With COW I do not see much difference. But for the case of the COW absence, further sequential reading will be broken by the fragmented file in the host. That is the point. We should try to avoid host fragmentation at all. >>>> Another problem is the amount of data written. We are writing >>>> entire cluster in write operation and this is also insane. It is >>>> possible to perform fallocate() and actual data write on normal >>>> modern filesystem. >>> But that only works when filling the cluster with zeroes, doesn't it? If >>> there's a backing image you need to bring all the contents from there. >> Yes. Backing images are problems. Though, even with sub-clusters, we >> will suffer exactly the same with the amount of IOPSes as even with >> that head and tail have to be read. If you are spoken about >> subclusters equals to FS block size and avoid COW at all, this would >> be terribly slow later on with sequential reading. In such an approach >> sequential reading will result in random read. >> >> Guest OSes are written keeping in mind that adjacent LBAs are really >> adjacent and reading them sequentially is a very good idea. This >> invariant will be broken for the case of subclusters. > This invariant is already broken by the very design of the qcow2 format, > subclusters don't really add anything new there. For any given cluster > size you can write 4k in every odd cluster, then do the same in every > even cluster, and you'll get an equally fragmented image. > The size of the cluster matters! Our experiments in older Parallels shown that with 1 Mb continuous (!) cluster this invariant is "almost" kept and this works fine for sequential ops. Den ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 13:30 ` Denis V. Lunev @ 2017-04-13 13:59 ` Kevin Wolf 2017-04-13 15:04 ` Alberto Garcia 1 sibling, 0 replies; 64+ messages in thread From: Kevin Wolf @ 2017-04-13 13:59 UTC (permalink / raw) To: Denis V. Lunev Cc: Alberto Garcia, qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz Am 13.04.2017 um 15:30 hat Denis V. Lunev geschrieben: > On 04/13/2017 04:21 PM, Alberto Garcia wrote: > > On Thu 13 Apr 2017 02:44:51 PM CEST, Denis V. Lunev wrote: > >>>> 1) current L2 cache management seems very wrong to me. Each cache > >>>> miss means that we have to read entire L2 cache block. This means > >>>> that in the worst case (when dataset of the test does not fit L2 > >>>> cache size we read 64kb of L2 table for each 4 kb read). > >>>> > >>>> The situation is MUCH worse once we are starting to increase > >>>> cluster size. For 1 Mb blocks we have to read 1 Mb on each cache > >>>> miss. > >>>> > >>>> The situation can be cured immediately once we will start reading > >>>> L2 cache with 4 or 8kb chunks. We have patchset for this for our > >>>> downstream and preparing it for upstream. > >>> Correct, although the impact of this depends on whether you are using > >>> SDD or HDD. > >>> > >>> With an SSD what you want is to minimize is the number of unnecessary > >>> reads, so reading small chunks will likely increase the performance when > >>> there's a cache miss. > >>> > >>> With an HDD what you want is to minimize the number of seeks. Once you > >>> have moved the disk head to the location where the cluster is, reading > >>> the whole cluster is relatively inexpensive, so (leaving the memory > >>> requirements aside) you generally want to read as much as possible. > >> no! This greatly helps for HDD too! > >> > >> The reason is that you cover areas of the virtual disk much more > >> precise. There is very simple example. Let us assume that I have > >> f.e. 1 TB virtual HDD with 1 MB block size. As far as I understand > >> right now L2 cache for the case consists of 4 L2 clusters. > >> > >> So, I can exhaust current cache only with 5 requests and each actual > >> read will costs L2 table read. This is a read problem. This condition > >> could happen on fragmented FS without a problem. > > But what you're saying is that this makes a more efficient use of cache > > memory. > > > > If the guest OS has a lot of unused space but is very fragmented then > > you don't want to fill up your cache with L2 entries that are not going > > to be used. It's better to read smaller chunks from the L2 table so > > there are fewer chances of having to evict entries from the > > cache. Therefore this results in less cache misses and better I/O > > performance. > > > > Ok, this sounds perfectly reasonable to me. > > > > If the cache is however big enough for the whole disk then you never > > need to evict entries, so with an HDD you actually want to take > > advantage of disk seeks and read as many L2 entries as possible. > > > > However it's true that in this case this will only affect the initial > > reads. Once the cache is full there's no need to read the L2 tables from > > disk anymore and the performance will be the same, so your point remains > > valid. > > > > Still, one of the goals from my proposal is to reduce the amount of > > metadata needed for the image. No matter how efficient you make the > > cache, the only way to reduce the amount of L2 entries is to increase > > the cluster size. And increasing the cluster size results in slower COW > > and less efficient use of disk space. > actually we can read by clusters if the cache is empty or near empty. > Yes, block size should be increased. I perfectly in agreement with your. > But I think that we could do that by plain increase of the cluster size > without any further dances. Sub-clusters as sub-clusters will help > if we are able to avoid COW. With COW I do not see much difference. With COW, it's basically just the same argument as you're making for reading L2 tables: There's a difference between having to copy 64k to process a write request and having to copy 2 MB. > But for the case of the COW absence, further sequential reading will > be broken by the fragmented file in the host. That is the point. We > should try to avoid host fragmentation at all. I still don't understand why you think that subclusters will cause fragmentation that wouldn't be there without subclusters. The opposite is true, with subclusters, larger cluster sizes become more realistic to use. > >>>> Another problem is the amount of data written. We are writing > >>>> entire cluster in write operation and this is also insane. It is > >>>> possible to perform fallocate() and actual data write on normal > >>>> modern filesystem. > >>> But that only works when filling the cluster with zeroes, doesn't it? If > >>> there's a backing image you need to bring all the contents from there. > >> Yes. Backing images are problems. Though, even with sub-clusters, we > >> will suffer exactly the same with the amount of IOPSes as even with > >> that head and tail have to be read. If you are spoken about > >> subclusters equals to FS block size and avoid COW at all, this would > >> be terribly slow later on with sequential reading. In such an approach > >> sequential reading will result in random read. > >> > >> Guest OSes are written keeping in mind that adjacent LBAs are really > >> adjacent and reading them sequentially is a very good idea. This > >> invariant will be broken for the case of subclusters. > > This invariant is already broken by the very design of the qcow2 format, > > subclusters don't really add anything new there. For any given cluster > > size you can write 4k in every odd cluster, then do the same in every > > even cluster, and you'll get an equally fragmented image. > > The size of the cluster matters! Our experiments in older Parallels > shown that with 1 Mb continuous (!) cluster this invariant is "almost" > kept and this works fine for sequential ops. So you should be interested in what I called "use case 2" in another email in this thread: Making use of the subclusters so that you can increase the cluster size to 2 MB while still maintaining reasonable COW granularity. Kevin ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 13:30 ` Denis V. Lunev 2017-04-13 13:59 ` Kevin Wolf @ 2017-04-13 15:04 ` Alberto Garcia 2017-04-13 15:17 ` Denis V. Lunev 1 sibling, 1 reply; 64+ messages in thread From: Alberto Garcia @ 2017-04-13 15:04 UTC (permalink / raw) To: Denis V. Lunev, qemu-devel Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz On Thu 13 Apr 2017 03:30:43 PM CEST, Denis V. Lunev wrote: > Yes, block size should be increased. I perfectly in agreement with > your. But I think that we could do that by plain increase of the > cluster size without any further dances. Sub-clusters as sub-clusters > will help if we are able to avoid COW. With COW I do not see much > difference. I'm trying to summarize your position, tell me if I got everything correctly: 1. We should try to reduce data fragmentation on the qcow2 file, because it will have a long term effect on the I/O performance (as opposed to an effect on the initial operations on the empty image). 2. The way to do that is to increase the cluster size (to 1MB or more). 3. Benefit: increasing the cluster size also decreases the amount of metadata (L2 and refcount). 4. Problem: L2 tables become too big and fill up the cache more easily. To solve this the cache code should do partial reads instead of complete L2 clusters. 5. Problem: larger cluster sizes also mean more data to copy when there's a COW. To solve this the COW code should be modified so it goes from 5 OPs (read head, write head, read tail, write tail, write data) to 2 OPs (read cluster, write modified cluster). 6. Having subclusters adds incompatible changes to the file format, and they offer no benefit after allocation. 7. Subclusters are only really useful if they match the guest fs block size (because you would avoid doing COW on allocation). Otherwise the only thing that you get is a faster COW (because you move less data), but the improvement is not dramatic and it's better if we do what's proposed in point 5. 8. Even if the subcluster size matches the guest block size, you'll get very fast initial allocation but also more chances to end up with a very fragmented qcow2 image, which is worse in the long run. 9. Problem: larger clusters make a less efficient use of disk space, but that's a drawback you're fine with considering all of the above. Is that a fair summary of what you're trying to say? Anything else missing? Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 15:04 ` Alberto Garcia @ 2017-04-13 15:17 ` Denis V. Lunev 2017-04-18 11:52 ` Alberto Garcia 0 siblings, 1 reply; 64+ messages in thread From: Denis V. Lunev @ 2017-04-13 15:17 UTC (permalink / raw) To: Alberto Garcia, qemu-devel Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz On 04/13/2017 06:04 PM, Alberto Garcia wrote: > On Thu 13 Apr 2017 03:30:43 PM CEST, Denis V. Lunev wrote: >> Yes, block size should be increased. I perfectly in agreement with >> your. But I think that we could do that by plain increase of the >> cluster size without any further dances. Sub-clusters as sub-clusters >> will help if we are able to avoid COW. With COW I do not see much >> difference. > I'm trying to summarize your position, tell me if I got everything > correctly: > > 1. We should try to reduce data fragmentation on the qcow2 file, > because it will have a long term effect on the I/O performance (as > opposed to an effect on the initial operations on the empty image). yes > 2. The way to do that is to increase the cluster size (to 1MB or > more). yes > 3. Benefit: increasing the cluster size also decreases the amount of > metadata (L2 and refcount). yes > 4. Problem: L2 tables become too big and fill up the cache more > easily. To solve this the cache code should do partial reads > instead of complete L2 clusters. yes. We can read full cluster as originally if L2 cache is empty. > 5. Problem: larger cluster sizes also mean more data to copy when > there's a COW. To solve this the COW code should be modified so it > goes from 5 OPs (read head, write head, read tail, write tail, > write data) to 2 OPs (read cluster, write modified cluster). yes, with small tweak if head and tail are in different clusters. In this case we will end up with 3 OPs. > 6. Having subclusters adds incompatible changes to the file format, > and they offer no benefit after allocation. yes > 7. Subclusters are only really useful if they match the guest fs block > size (because you would avoid doing COW on allocation). Otherwise > the only thing that you get is a faster COW (because you move less > data), but the improvement is not dramatic and it's better if we do > what's proposed in point 5. yes > 8. Even if the subcluster size matches the guest block size, you'll > get very fast initial allocation but also more chances to end up > with a very fragmented qcow2 image, which is worse in the long run. yes > 9. Problem: larger clusters make a less efficient use of disk space, > but that's a drawback you're fine with considering all of the > above. yes > Is that a fair summary of what you're trying to say? Anything else > missing? yes. 5a. Problem: initial cluster allocation without COW. Could be made cluster-size agnostic with the help of fallocate() call. Big clusters are even better as the amount of such allocations is reduced. Thank you very much for this cool summary! I am too tongue-tied. Den ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 15:17 ` Denis V. Lunev @ 2017-04-18 11:52 ` Alberto Garcia 2017-04-18 17:27 ` Denis V. Lunev 0 siblings, 1 reply; 64+ messages in thread From: Alberto Garcia @ 2017-04-18 11:52 UTC (permalink / raw) To: Denis V. Lunev, qemu-devel Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz On Thu 13 Apr 2017 05:17:21 PM CEST, Denis V. Lunev wrote: > On 04/13/2017 06:04 PM, Alberto Garcia wrote: >> On Thu 13 Apr 2017 03:30:43 PM CEST, Denis V. Lunev wrote: >>> Yes, block size should be increased. I perfectly in agreement with >>> your. But I think that we could do that by plain increase of the >>> cluster size without any further dances. Sub-clusters as sub-clusters >>> will help if we are able to avoid COW. With COW I do not see much >>> difference. >> I'm trying to summarize your position, tell me if I got everything >> correctly: >> >> 1. We should try to reduce data fragmentation on the qcow2 file, >> because it will have a long term effect on the I/O performance (as >> opposed to an effect on the initial operations on the empty image). > yes > >> 2. The way to do that is to increase the cluster size (to 1MB or >> more). > yes > >> 3. Benefit: increasing the cluster size also decreases the amount of >> metadata (L2 and refcount). > yes > >> 4. Problem: L2 tables become too big and fill up the cache more >> easily. To solve this the cache code should do partial reads >> instead of complete L2 clusters. > yes. We can read full cluster as originally if L2 cache is empty. > >> 5. Problem: larger cluster sizes also mean more data to copy when >> there's a COW. To solve this the COW code should be modified so it >> goes from 5 OPs (read head, write head, read tail, write tail, >> write data) to 2 OPs (read cluster, write modified cluster). > yes, with small tweak if head and tail are in different clusters. In > this case we > will end up with 3 OPs. > >> 6. Having subclusters adds incompatible changes to the file format, >> and they offer no benefit after allocation. > yes > >> 7. Subclusters are only really useful if they match the guest fs block >> size (because you would avoid doing COW on allocation). Otherwise >> the only thing that you get is a faster COW (because you move less >> data), but the improvement is not dramatic and it's better if we do >> what's proposed in point 5. > yes > >> 8. Even if the subcluster size matches the guest block size, you'll >> get very fast initial allocation but also more chances to end up >> with a very fragmented qcow2 image, which is worse in the long run. > yes > >> 9. Problem: larger clusters make a less efficient use of disk space, >> but that's a drawback you're fine with considering all of the >> above. > yes > >> Is that a fair summary of what you're trying to say? Anything else >> missing? > yes. > > 5a. Problem: initial cluster allocation without COW. Could be made > cluster-size agnostic with the help of fallocate() call. Big > clusters are even > better as the amount of such allocations is reduced. > > Thank you very much for this cool summary! I am too tongue-tied. Hi Denis, I don't have the have data to verify all your claims here, but in general what you say makes sense. Although I'm not sure if I agree with everything (especially on whether any of this applies to SSD drives at all) it seems that we all agree that the COW algorithm can be improved, so perhaps I should start by taking a look at that. Regards, Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-18 11:52 ` Alberto Garcia @ 2017-04-18 17:27 ` Denis V. Lunev 0 siblings, 0 replies; 64+ messages in thread From: Denis V. Lunev @ 2017-04-18 17:27 UTC (permalink / raw) To: Alberto Garcia, qemu-devel Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz On 04/18/2017 02:52 PM, Alberto Garcia wrote: > On Thu 13 Apr 2017 05:17:21 PM CEST, Denis V. Lunev wrote: >> On 04/13/2017 06:04 PM, Alberto Garcia wrote: >>> On Thu 13 Apr 2017 03:30:43 PM CEST, Denis V. Lunev wrote: >>>> Yes, block size should be increased. I perfectly in agreement with >>>> your. But I think that we could do that by plain increase of the >>>> cluster size without any further dances. Sub-clusters as sub-clusters >>>> will help if we are able to avoid COW. With COW I do not see much >>>> difference. >>> I'm trying to summarize your position, tell me if I got everything >>> correctly: >>> >>> 1. We should try to reduce data fragmentation on the qcow2 file, >>> because it will have a long term effect on the I/O performance (as >>> opposed to an effect on the initial operations on the empty image). >> yes >> >>> 2. The way to do that is to increase the cluster size (to 1MB or >>> more). >> yes >> >>> 3. Benefit: increasing the cluster size also decreases the amount of >>> metadata (L2 and refcount). >> yes >> >>> 4. Problem: L2 tables become too big and fill up the cache more >>> easily. To solve this the cache code should do partial reads >>> instead of complete L2 clusters. >> yes. We can read full cluster as originally if L2 cache is empty. >> >>> 5. Problem: larger cluster sizes also mean more data to copy when >>> there's a COW. To solve this the COW code should be modified so it >>> goes from 5 OPs (read head, write head, read tail, write tail, >>> write data) to 2 OPs (read cluster, write modified cluster). >> yes, with small tweak if head and tail are in different clusters. In >> this case we >> will end up with 3 OPs. >> >>> 6. Having subclusters adds incompatible changes to the file format, >>> and they offer no benefit after allocation. >> yes >> >>> 7. Subclusters are only really useful if they match the guest fs block >>> size (because you would avoid doing COW on allocation). Otherwise >>> the only thing that you get is a faster COW (because you move less >>> data), but the improvement is not dramatic and it's better if we do >>> what's proposed in point 5. >> yes >> >>> 8. Even if the subcluster size matches the guest block size, you'll >>> get very fast initial allocation but also more chances to end up >>> with a very fragmented qcow2 image, which is worse in the long run. >> yes >> >>> 9. Problem: larger clusters make a less efficient use of disk space, >>> but that's a drawback you're fine with considering all of the >>> above. >> yes >> >>> Is that a fair summary of what you're trying to say? Anything else >>> missing? >> yes. >> >> 5a. Problem: initial cluster allocation without COW. Could be made >> cluster-size agnostic with the help of fallocate() call. Big >> clusters are even >> better as the amount of such allocations is reduced. >> >> Thank you very much for this cool summary! I am too tongue-tied. > Hi Denis, > > I don't have the have data to verify all your claims here, but in > general what you say makes sense. > > Although I'm not sure if I agree with everything (especially on whether > any of this applies to SSD drives at all) it seems that we all agree > that the COW algorithm can be improved, so perhaps I should start by > taking a look at that. > > Regards, > > Berto I understand. I just wanted to raise another possible (compatible) approach, which could help. Den ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 13:21 ` Alberto Garcia 2017-04-13 13:30 ` Denis V. Lunev @ 2017-04-13 13:51 ` Kevin Wolf 2017-04-13 14:15 ` Alberto Garcia 2017-04-13 14:42 ` [Qemu-devel] " Denis V. Lunev 1 sibling, 2 replies; 64+ messages in thread From: Kevin Wolf @ 2017-04-13 13:51 UTC (permalink / raw) To: Alberto Garcia Cc: Denis V. Lunev, qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz Am 13.04.2017 um 15:21 hat Alberto Garcia geschrieben: > This invariant is already broken by the very design of the qcow2 format, > subclusters don't really add anything new there. For any given cluster > size you can write 4k in every odd cluster, then do the same in every > even cluster, and you'll get an equally fragmented image. Because this scenario has appeared repeatedly in this thread: Can we please use a more realistic one that shows an actual problem? Because with 8k or more for the cluster size you don't get any qcow2 fragmentation with 4k even/odd writes (which is a pathological case anyway), and the file systems are clever enough to cope with it, too. Just to confirm this experimentally, I ran this short script: ---------------------------------------------------------------- #!/bin/bash ./qemu-img create -f qcow2 /tmp/test.qcow2 64M echo even blocks for i in $(seq 0 32767); do echo "write $((i * 8))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null echo odd blocks for i in $(seq 0 32767); do echo "write $((i * 8 + 4))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null ./qemu-img map /tmp/test.qcow2 filefrag -v /tmp/test.qcow2 ---------------------------------------------------------------- And sure enough, this is the output: ---------------------------------------------------------------- Formatting '/tmp/test.qcow2', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 even blocks odd blocks Offset Length Mapped to File 0 0x4000000 0x50000 /tmp/test.qcow2 Filesystem type is: 58465342 File size of /tmp/test.qcow2 is 67436544 (16464 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 47: 142955.. 143002: 48: 1: 48.. 48: 143016.. 143016: 1: 143003: 2: 64.. 79: 142868.. 142883: 16: 143017: 3: 80.. 111: 155386.. 155417: 32: 142884: 4: 112.. 303: 227558.. 227749: 192: 155418: 5: 304.. 559: 228382.. 228637: 256: 227750: 6: 560.. 1071: 455069.. 455580: 512: 228638: 7: 1072.. 2095: 485544.. 486567: 1024: 455581: 8: 2096.. 4143: 497978.. 500025: 2048: 486568: 9: 4144.. 8239: 508509.. 512604: 4096: 500026: 10: 8240.. 16431: 563122.. 571313: 8192: 512605: 11: 16432.. 32815: 632969.. 649352: 16384: 571314: eof /tmp/test.qcow2: 12 extents found ---------------------------------------------------------------- That is, on the qcow2 level we have exactly 0% fragmentation, everything is completely contiguous in a single chunk. XFS as the container of the test image creates a few more extents, but as you can see, it uses fairly large extent sizes in the end (and it would use even larger ones if I wrote more than 64 MB). Kevin ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 13:51 ` Kevin Wolf @ 2017-04-13 14:15 ` Alberto Garcia 2017-04-13 14:27 ` Kevin Wolf 2017-04-13 14:42 ` [Qemu-devel] " Denis V. Lunev 1 sibling, 1 reply; 64+ messages in thread From: Alberto Garcia @ 2017-04-13 14:15 UTC (permalink / raw) To: Kevin Wolf Cc: Denis V. Lunev, qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz On Thu 13 Apr 2017 03:51:55 PM CEST, Kevin Wolf wrote: >> This invariant is already broken by the very design of the qcow2 >> format, subclusters don't really add anything new there. For any >> given cluster size you can write 4k in every odd cluster, then do the >> same in every even cluster, and you'll get an equally fragmented >> image. > > Because this scenario has appeared repeatedly in this thread: Can we > please use a more realistic one that shows an actual problem? Because > with 8k or more for the cluster size you don't get any qcow2 > fragmentation with 4k even/odd writes (which is a pathological case > anyway), and the file systems are clever enough to cope with it, too. > > Just to confirm this experimentally, I ran this short script: > > ---------------------------------------------------------------- > #!/bin/bash > ./qemu-img create -f qcow2 /tmp/test.qcow2 64M > > echo even blocks > for i in $(seq 0 32767); do echo "write $((i * 8))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null > echo odd blocks > for i in $(seq 0 32767); do echo "write $((i * 8 + 4))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null > > ./qemu-img map /tmp/test.qcow2 > filefrag -v /tmp/test.qcow2 > ---------------------------------------------------------------- But that's because while you're writing on every other 4k block the cluster size is 64k, so you're effectively allocating clusters in sequential order. That's why you get this: > Offset Length Mapped to File > 0 0x4000000 0x50000 /tmp/test.qcow2 You would need to either have 4k clusters, or space writes even more. Here's a simpler example, mkfs.ext4 on an empty drive gets you something like this: ---------------------------------------------------------------- Offset Length Mapped to File [...] 0x1040000 0x10000 0x150000 test.qcow2 0x1050000 0x50000 0x100000 test.qcow2 0x10a0000 0x10000 0x190000 test.qcow2 0x10b0000 0x30000 0x160000 test.qcow2 0x10e0000 0x10000 0x200000 test.qcow2 0x10f0000 0x30000 0x1a0000 test.qcow2 0x1120000 0x10000 0x210000 test.qcow2 0x1130000 0x30000 0x1d0000 test.qcow2 0x1160000 0x10000 0x280000 test.qcow2 0x1170000 0x30000 0x220000 test.qcow2 [...] ---------------------------------------------------------------- Now, I haven't measured the effect of this on I/O performance, but Denis's point seems in principle valid to me. Berto ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 14:15 ` Alberto Garcia @ 2017-04-13 14:27 ` Kevin Wolf 2017-04-13 16:42 ` [Qemu-devel] [Qemu-block] " Roman Kagan 0 siblings, 1 reply; 64+ messages in thread From: Kevin Wolf @ 2017-04-13 14:27 UTC (permalink / raw) To: Alberto Garcia Cc: Denis V. Lunev, qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz Am 13.04.2017 um 16:15 hat Alberto Garcia geschrieben: > On Thu 13 Apr 2017 03:51:55 PM CEST, Kevin Wolf wrote: > >> This invariant is already broken by the very design of the qcow2 > >> format, subclusters don't really add anything new there. For any > >> given cluster size you can write 4k in every odd cluster, then do the > >> same in every even cluster, and you'll get an equally fragmented > >> image. > > > > Because this scenario has appeared repeatedly in this thread: Can we > > please use a more realistic one that shows an actual problem? Because > > with 8k or more for the cluster size you don't get any qcow2 > > fragmentation with 4k even/odd writes (which is a pathological case > > anyway), and the file systems are clever enough to cope with it, too. > > > > Just to confirm this experimentally, I ran this short script: > > > > ---------------------------------------------------------------- > > #!/bin/bash > > ./qemu-img create -f qcow2 /tmp/test.qcow2 64M > > > > echo even blocks > > for i in $(seq 0 32767); do echo "write $((i * 8))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null > > echo odd blocks > > for i in $(seq 0 32767); do echo "write $((i * 8 + 4))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null > > > > ./qemu-img map /tmp/test.qcow2 > > filefrag -v /tmp/test.qcow2 > > ---------------------------------------------------------------- > > But that's because while you're writing on every other 4k block the > cluster size is 64k, so you're effectively allocating clusters in > sequential order. That's why you get this: > > > Offset Length Mapped to File > > 0 0x4000000 0x50000 /tmp/test.qcow2 > > You would need to either have 4k clusters, or space writes even more. > > Here's a simpler example, mkfs.ext4 on an empty drive gets you something > like this: > [...] My point wasn't that qcow2 doesn't fragment, but that Denis and you were both using a really bad example. You were trying to construct an artificially bad image and you actually ended up constructing a perfect one. > Now, I haven't measured the effect of this on I/O performance, but > Denis's point seems in principle valid to me. In principle yes, but especially his fear of host file system fragmentation seems a bit exaggerated. If I use 64k even/odd writes in the script, I end up with a horribly fragmented qcow2 image, but still perfectly contiguous layout of the image file in the file system. We can and probably should do something about the qcow2 fragmentation eventually (I guess a more intelligent cluster allocation strategy could go a long way there), but I wouldn't worry to much about the host file system. Kevin ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 14:27 ` Kevin Wolf @ 2017-04-13 16:42 ` Roman Kagan 0 siblings, 0 replies; 64+ messages in thread From: Roman Kagan @ 2017-04-13 16:42 UTC (permalink / raw) To: Kevin Wolf Cc: Alberto Garcia, Denis V. Lunev, qemu-block, qemu-devel, Stefan Hajnoczi, Max Reitz On Thu, Apr 13, 2017 at 04:27:35PM +0200, Kevin Wolf wrote: > Am 13.04.2017 um 16:15 hat Alberto Garcia geschrieben: > > On Thu 13 Apr 2017 03:51:55 PM CEST, Kevin Wolf wrote: > > >> This invariant is already broken by the very design of the qcow2 > > >> format, subclusters don't really add anything new there. For any > > >> given cluster size you can write 4k in every odd cluster, then do the > > >> same in every even cluster, and you'll get an equally fragmented > > >> image. > > > > > > Because this scenario has appeared repeatedly in this thread: Can we > > > please use a more realistic one that shows an actual problem? Because > > > with 8k or more for the cluster size you don't get any qcow2 > > > fragmentation with 4k even/odd writes (which is a pathological case > > > anyway), and the file systems are clever enough to cope with it, too. > > > > > > Just to confirm this experimentally, I ran this short script: > > > > > > ---------------------------------------------------------------- > > > #!/bin/bash > > > ./qemu-img create -f qcow2 /tmp/test.qcow2 64M > > > > > > echo even blocks > > > for i in $(seq 0 32767); do echo "write $((i * 8))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null > > > echo odd blocks > > > for i in $(seq 0 32767); do echo "write $((i * 8 + 4))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null > > > > > > ./qemu-img map /tmp/test.qcow2 > > > filefrag -v /tmp/test.qcow2 > > > ---------------------------------------------------------------- > > > > But that's because while you're writing on every other 4k block the > > cluster size is 64k, so you're effectively allocating clusters in > > sequential order. That's why you get this: > > > > > Offset Length Mapped to File > > > 0 0x4000000 0x50000 /tmp/test.qcow2 > > > > You would need to either have 4k clusters, or space writes even more. > > > > Here's a simpler example, mkfs.ext4 on an empty drive gets you something > > like this: > > [...] > > My point wasn't that qcow2 doesn't fragment, but that Denis and you were > both using a really bad example. You were trying to construct an > artificially bad image and you actually ended up constructing a perfect > one. > > > Now, I haven't measured the effect of this on I/O performance, but > > Denis's point seems in principle valid to me. > > In principle yes, but especially his fear of host file system > fragmentation seems a bit exaggerated. If I use 64k even/odd writes in > the script, I end up with a horribly fragmented qcow2 image, but still > perfectly contiguous layout of the image file in the file system. > > We can and probably should do something about the qcow2 fragmentation > eventually (I guess a more intelligent cluster allocation strategy could > go a long way there), but I wouldn't worry to much about the host file > system. I beg to disagree. I didn't have QEMU with subcluster allocation enabled (you did, didn't you?) so I went ahead with a raw file: # truncate --size 64k bbb [14/14] # filefrag -v bbb Filesystem type is: ef53 File size of bbb is 65536 (16 blocks of 4096 bytes) bbb: 0 extents found # for i in {0..7}; do echo write $[(i * 2) * 4]k 4k; done | qemu-io bbb ... # for i in {0..7}; do echo write $[(i * 2 + 1) * 4]k 4k; done | qemu-io bbb ... # filefrag -v bbb Filesystem type is: ef53 File size of bbb is 65536 (16 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 1: 65860793.. 65860794: 2: 1: 2.. 2: 65859644.. 65859644: 1: 65860795: 2: 3.. 3: 65859651.. 65859651: 1: 65859645: 3: 4.. 4: 65859645.. 65859645: 1: 65859652: 4: 5.. 5: 65859652.. 65859652: 1: 65859646: 5: 6.. 6: 65859646.. 65859646: 1: 65859653: 6: 7.. 7: 65859653.. 65859653: 1: 65859647: 7: 8.. 8: 65859647.. 65859647: 1: 65859654: 8: 9.. 9: 65859654.. 65859654: 1: 65859648: 9: 10.. 10: 65859648.. 65859648: 1: 65859655: 10: 11.. 11: 65859655.. 65859655: 1: 65859649: 11: 12.. 12: 65859649.. 65859649: 1: 65859656: 12: 13.. 13: 65859656.. 65859656: 1: 65859650: 13: 14.. 14: 65859650.. 65859650: 1: 65859657: 14: 15.. 15: 65859657.. 65859657: 1: 65859651: last,eof bbb: 15 extents found So the host filesystem did a very poor job here (ext4 on top of two-way raid0 on top of rotating disks). Naturally, replacing truncate with fallocate in the above example gives no fragmenation: ... # filefrag -v bbb Filesystem type is: ef53 File size of bbb is 65536 (16 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 15: 183616784.. 183616799: 16: last,eof bbb: 1 extent found Roman. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 13:51 ` Kevin Wolf 2017-04-13 14:15 ` Alberto Garcia @ 2017-04-13 14:42 ` Denis V. Lunev 1 sibling, 0 replies; 64+ messages in thread From: Denis V. Lunev @ 2017-04-13 14:42 UTC (permalink / raw) To: Kevin Wolf, Alberto Garcia Cc: qemu-devel, Stefan Hajnoczi, qemu-block, Max Reitz On 04/13/2017 04:51 PM, Kevin Wolf wrote: > Am 13.04.2017 um 15:21 hat Alberto Garcia geschrieben: >> This invariant is already broken by the very design of the qcow2 format, >> subclusters don't really add anything new there. For any given cluster >> size you can write 4k in every odd cluster, then do the same in every >> even cluster, and you'll get an equally fragmented image. > Because this scenario has appeared repeatedly in this thread: Can we > please use a more realistic one that shows an actual problem? Because > with 8k or more for the cluster size you don't get any qcow2 > fragmentation with 4k even/odd writes (which is a pathological case > anyway), and the file systems are clever enough to cope with it, too. > > Just to confirm this experimentally, I ran this short script: > > ---------------------------------------------------------------- > #!/bin/bash > ./qemu-img create -f qcow2 /tmp/test.qcow2 64M > > echo even blocks > for i in $(seq 0 32767); do echo "write $((i * 8))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null > echo odd blocks > for i in $(seq 0 32767); do echo "write $((i * 8 + 4))k 4k"; done | ./qemu-io /tmp/test.qcow2 > /dev/null > > ./qemu-img map /tmp/test.qcow2 > filefrag -v /tmp/test.qcow2 > ---------------------------------------------------------------- > > And sure enough, this is the output: > > ---------------------------------------------------------------- > Formatting '/tmp/test.qcow2', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 > even blocks > odd blocks > Offset Length Mapped to File > 0 0x4000000 0x50000 /tmp/test.qcow2 > Filesystem type is: 58465342 > File size of /tmp/test.qcow2 is 67436544 (16464 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 47: 142955.. 143002: 48: > 1: 48.. 48: 143016.. 143016: 1: 143003: > 2: 64.. 79: 142868.. 142883: 16: 143017: > 3: 80.. 111: 155386.. 155417: 32: 142884: > 4: 112.. 303: 227558.. 227749: 192: 155418: > 5: 304.. 559: 228382.. 228637: 256: 227750: > 6: 560.. 1071: 455069.. 455580: 512: 228638: > 7: 1072.. 2095: 485544.. 486567: 1024: 455581: > 8: 2096.. 4143: 497978.. 500025: 2048: 486568: > 9: 4144.. 8239: 508509.. 512604: 4096: 500026: > 10: 8240.. 16431: 563122.. 571313: 8192: 512605: > 11: 16432.. 32815: 632969.. 649352: 16384: 571314: eof > /tmp/test.qcow2: 12 extents found > ---------------------------------------------------------------- > > That is, on the qcow2 level we have exactly 0% fragmentation, everything > is completely contiguous in a single chunk. XFS as the container of the > test image creates a few more extents, but as you can see, it uses > fairly large extent sizes in the end (and it would use even larger ones > if I wrote more than 64 MB). I am spoken about image like this: #!/bin/bash qemu-img create -f qcow2 /tmp/test.qcow2 64M echo even blocks for i in $(seq 0 512); do echo "write $((i * 128 + 1))k 4k"; done | qemu-io /tmp/test.qcow2 > /dev/null echo odd blocks for i in $(seq 0 512); do echo "write $((i * 128 + 65))k 4k"; done | qemu-io /tmp/test.qcow2 > /dev/null echo fragmented strace -f -e pread64 qemu-io -c "read 0 64M" /tmp/test.qcow2 2>&1 | wc -l rm -rf 1.img qemu-img create -f qcow2 /tmp/test.qcow2 64M echo sequential for i in $(seq 0 1024); do echo "write $((i * 64))k 4k"; done | qemu-io /tmp/test.qcow2 > /dev/null strace -f -e pread64 qemu-io -c "read 0 64M" /tmp/test.qcow2 2>&1 | wc -l and the difference is important - see the amount of read operations reported: 1032 vs 9 iris ~/tmp/2 $ ./1.sh Formatting '/tmp/test.qcow2', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 even blocks odd blocks fragmented 1032 <------------------------- (1) Formatting '/tmp/test.qcow2', fmt=qcow2 size=67108864 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 sequential main-loop: WARNING: I/O thread spun for 1000 iterations 9 <-------------------------- (2) iris ~/tmp/2 $ Subclusters will work exactly like (1) rather than (2). With a big block (1 Mb) one could expect much better performance for sequential operations as (2). The file (1) is continuous from the host point of view, correct, but it will be accessed randomly from guest. See the difference in the amount of reads performed. Den ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia ` (3 preceding siblings ...) 2017-04-12 16:54 ` Denis V. Lunev @ 2017-04-12 17:55 ` Denis V. Lunev 2017-04-12 18:20 ` Eric Blake 4 siblings, 1 reply; 64+ messages in thread From: Denis V. Lunev @ 2017-04-12 17:55 UTC (permalink / raw) To: Alberto Garcia, qemu-devel Cc: Kevin Wolf, Stefan Hajnoczi, qemu-block, Max Reitz On 04/06/2017 06:01 PM, Alberto Garcia wrote: > Hi all, > > over the past couple of months I discussed with some of you the > possibility to extend the qcow2 format in order to improve its > performance and reduce its memory requirements (particularly with very > large images). > > After some discussion in the mailing list and the #qemu IRC channel I > decided to write a prototype of a new extension for qcow2 so I could > understand better the scope of the changes and have some preliminary > data about its effects. > > This e-mail is the formal presentation of my proposal to extend the > on-disk qcow2 format. As you can see this is still an RFC. Due to the > nature of the changes I would like to get as much feedback as possible > before going forward. > > === Problem === > > The original problem that I wanted to address is the memory > requirements of qcow2 files if you want to have very large images and > still keep good I/O performance. This is a consequence of its very > structure, which I'm going to describe now. > > A qcow2 image is divided into units of constant size called clusters, > and among other things it contains metadata that maps guest addresses > to host addresses (the so-called L1 and L2 tables). > > There are two basic problems that result from this: > > 1) Reading from or writing to a qcow2 image involves reading the > corresponding entry on the L2 table that maps the guest address to > the host address. This is very slow because it involves two I/O > operations: one on the L2 table and the other one on the actual > data cluster. > > 2) A cluster is the smallest unit of allocation. Therefore writing a > mere 512 bytes to an empty disk requires allocating a complete > cluster and filling it with zeroes (or with data from the backing > image if there is one). This wastes more disk space and also has a > negative impact on I/O. > > Problem (1) can be solved by keeping in memory a cache of the L2 > tables (QEMU has an "l2_cache_size" parameter for this purpose). The > amount of disk space used by L2 tables depends on two factors: the > disk size and the cluster size. > > The cluster size can be configured when the image is created, and it > can be any power of two between 512 bytes and 2 MB (it defaults to 64 > KB). > > The maximum amount of space needed for the L2 tables can be calculated > with the following formula: > > max_l2_size = virtual_disk_size * 8 / cluster_size > > Large images require a large amount of metadata, and therefore a large > amount of memory for the L2 cache. With the default cluster size > (64KB) that's 128MB of L2 cache for a 1TB qcow2 image. > > The only way to reduce the size of the L2 tables is therefore > increasing the cluster size, but this makes the image less efficient > as described earlier in (2). > > === The proposal === > > The idea of this proposal is to extend the qcow2 format by allowing > subcluster allocation. There would be an optional feature that would > need to be enabled when creating the image. The on-disk format would > remain essentially the same, except that each data cluster would be > internally divided into a number of subclusters of equal size. > > What this means in practice is that each entry on an L2 table would be > accompanied by a bitmap indicating the allocation state of each one of > the subclusters for that cluster. There are several alternatives for > storing the bitmap, described below. > > Other than L2 entries, all other data structures would remain > unchanged, but for data clusters the smallest unit of allocation would > now be the subcluster. Reference counting would still be at the > cluster level, because there's no way to reference individual > subclusters. Copy-on-write on internal snapshots would need to copy > complete clusters, so that scenario would not benefit from this > change. > > I see two main use cases for this feature: > > a) The qcow2 image is not too large / the L2 cache is not a problem, > but you want to increase the allocation performance. In this case > you can have something like a 128KB cluster with 4KB subclusters > (with 4KB being a common block size in ext4 and other filesystems) > > b) The qcow2 image is very large and you want to save metadata space > in order to have a smaller L2 cache. In this case you can go for > the maximum cluster size (2MB) but you want to have smaller > subclusters to increase the allocation performance and optimize the > disk usage. This was actually my original use case. > > === Test results === > > I have a basic working prototype of this. It's still incomplete -and > buggy :)- but it gives an idea of what we can expect from it. In my > implementation each data cluster has 8 subclusters, but that's not set > in stone (see below). > > I made all tests on an SSD drive, writing to an empty qcow2 image with > a fully populated 40GB backing image, performing random writes using > fio with a block size of 4KB. > > I tried with the default and maximum cluster sizes (64KB and 2MB) and > also with some other sizes. I also made sure to try with 32KB clusters > so the subcluster size matches the 4KB block size used for the I/O. > > It's important to point out that once a cluster has been completely > allocated then having subclusters offers no performance benefit. For > this reason the size of the image for these tests (40GB) was chosen to > be large enough to guarantee that there are always new clusters being > allocated. This is therefore a worst-case scenario (or best-case for > this feature, if you want). > > Here are the results (subcluster size in brackets): > > |-----------------+----------------+-----------------+-------------------| > | cluster size | subclusters=on | subclusters=off | Max L2 cache size | > |-----------------+----------------+-----------------+-------------------| > | 2 MB (256 KB) | 440 IOPS | 100 IOPS | 160 KB (*) | > | 512 KB (64 KB) | 1000 IOPS | 300 IOPS | 640 KB | > | 64 KB (8 KB) | 3000 IOPS | 1000 IOPS | 5 MB | > | 32 KB (4 KB) | 12000 IOPS | 1300 IOPS | 10 MB | > | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | > |-----------------+----------------+-----------------+-------------------| > > (*) The L2 cache must be a multiple of the cluster > size, so in this case it must be 2MB. On the table > I chose to show how much of those 2MB are actually > used so you can compare it with the other cases. > > Some comments about the results: > > - For the 64KB, 512KB and 2MB cases, having subclusters increases > write performance roughly by three. This happens because for each > cluster allocation there's less data to copy from the backing > image. For the same reason, the smaller the cluster, the better the > performance. As expected, 64KB clusters with no subclusters perform > roughly the same as 512KB clusters with 64KB subclusters. > > - The 32KB case is the most interesting one. Without subclusters it's > not very different from the 64KB case, but having a subcluster with > the same size of the I/O block eliminates the need for COW entirely > and the performance skyrockets (10 times faster!). > > - 4KB is however very slow. I attribute this to the fact that the > cluster size is so small that a new cluster needs to be allocated > for every single write and its refcount updated accordingly. The L2 > and refcount tables are also so small that they are too inefficient > and need to grow all the time. > > Here are the results when writing to an empty 40GB qcow2 image with no > backing file. The numbers are of course different but as you can see > the patterns are similar: > > |-----------------+----------------+-----------------+-------------------| > | cluster size | subclusters=on | subclusters=off | Max L2 cache size | > |-----------------+----------------+-----------------+-------------------| > | 2 MB (256 KB) | 1200 IOPS | 255 IOPS | 160 KB | > | 512 KB (64 KB) | 3000 IOPS | 700 IOPS | 640 KB | > | 64 KB (8 KB) | 7200 IOPS | 3300 IOPS | 5 MB | > | 32 KB (4 KB) | 12300 IOPS | 4200 IOPS | 10 MB | > | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB | > |-----------------+----------------+-----------------+-------------------| > > === Changes to the on-disk format === > > The qcow2 on-disk format needs to change so each L2 entry has a bitmap > indicating the allocation status of each subcluster. There are three > possible states (unallocated, allocated, all zeroes), so we need two > bits per subcluster. > > An L2 entry is 64 bits wide, and this is the current format (for > uncompressed clusters): > > 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 > 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > **<----> <--------------------------------------------------><------->* > Rsrved host cluster offset of data Reserved > (6 bits) (47 bits) (8 bits) > > bit 63: refcount == 1 (QCOW_OFLAG_COPIED) > bit 62: compressed = 1 (QCOW_OFLAG_COMPRESSED) > bit 0: all zeros (QCOW_OFLAG_ZERO) > > I thought of three alternatives for storing the subcluster bitmaps. I > haven't made my mind completely about which one is the best one, so > I'd like to present all three for discussion. Here they are: > > (1) Storing the bitmap inside the 64-bit entry > > This is a simple alternative and is the one that I chose for my > prototype. There are 14 unused bits plus the "all zeroes" one. If > we steal one from the host offset we have the 16 bits that we need > for the bitmap and we have 46 bits left for the host offset, which > is more than enough. > > * Pros: > + Simple. Few changes compared to the current qcow2 format. > > * Cons: > - Only 8 subclusters per cluster. We would not be making the > most of this feature. > > - No reserved bits left for the future. > > (2) Making L2 entries 128-bit wide. > > In this alternative we would double the size of L2 entries. The > first half would remain unchanged and the second one would store > the bitmap. That would leave us with 32 subclusters per cluster. > > * Pros: > + More subclusters per cluster. We could have images with > e.g. 128k clusters with 4k subclusters. > > * Cons: > - More space needed for L2 entries. The same cluster size would > require a cache twice as large, although having subcluster > allocation would compensate for this. > > - More changes to the code to handle 128-bit entries. > > - We would still be wasting the 14 reserved bits that L2 entries > have. > > (3) Storing the bitmap somewhere else > > This would involve storing the bitmap separate from the L2 tables > (perhaps using the bitmaps extension? I haven't looked much into > this). > > * Pros: > + Possibility to make the number of subclusters configurable > by the user (32, 64, 128, ...) > + All existing metadata structures would remain untouched > (although the "all zeroes" bit in L2 entries would probably > become unused). > > * Cons: > - As with alternative (2), more space needed for metadata. > > - The bitmap would also need to be cached for performance > reasons. > > - Possibly one more *_cache_size option. > > - One more metadata structure to be updated for each > allocation. This would probably impact I/O negatively. > > === Compressed clusters === > > My idea is that compressed clusters would remain the same. They are > read-only anyway so they would not be affected by any of these > changes. > > =========================== > > I think I managed to summarize all the ideas that I had in my mind, > but I'm sure you probably have questions and comments, so I'd be happy > to get as much feedback as possible. > > So, looking forward to reading your opinions. > > Regards, > > Berto > Let me rephrase a bit. The proposal is looking very close to the following case: - raw sparse file In this case all writes are very-very-very fast and from the guest point of view all is OK. Sequential data is really sequential. Though once we are starting to perform any sequential IO, we have real pain. Each sequential operation becomes random on the host file system and the IO becomes very slow. This will not be observed with the test, but the performance will degrade very soon. This is why raw sparse files are not used in the real life. Hypervisor must maintain guest OS invariants and the data, which is nearby from the guest point of view should be kept nearby in host. This is why actually that 64kb data blocks are extremely small :) OK. This is offtopic. One can easily recreate this case using the following simple test: - write each even 4kb page of the disk, one by one - write each odd 4 kb page of the disk - run sequential read with f.e. 1 MB data block Normally we should still have native performance, but with raw sparse files and (as far as understand the proposal) sub-clusters we will have the host IO pattern exactly like random. This seems like a big and inevitable problem of the approach for me. We still have the potential to improve current algorithms and not introduce non-compatible changes. Sorry if this is too emotional. We have learned above in a very hard way. Den ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-12 17:55 ` Denis V. Lunev @ 2017-04-12 18:20 ` Eric Blake 2017-04-12 19:02 ` Denis V. Lunev 0 siblings, 1 reply; 64+ messages in thread From: Eric Blake @ 2017-04-12 18:20 UTC (permalink / raw) To: Denis V. Lunev, Alberto Garcia, qemu-devel Cc: Kevin Wolf, qemu-block, Stefan Hajnoczi, Max Reitz [-- Attachment #1: Type: text/plain, Size: 3059 bytes --] On 04/12/2017 12:55 PM, Denis V. Lunev wrote: > Let me rephrase a bit. > > The proposal is looking very close to the following case: > - raw sparse file > > In this case all writes are very-very-very fast and from the > guest point of view all is OK. Sequential data is really sequential. > Though once we are starting to perform any sequential IO, we > have real pain. Each sequential operation becomes random > on the host file system and the IO becomes very slow. This > will not be observed with the test, but the performance will > degrade very soon. > > This is why raw sparse files are not used in the real life. > Hypervisor must maintain guest OS invariants and the data, > which is nearby from the guest point of view should be kept > nearby in host. > > This is why actually that 64kb data blocks are extremely > small :) OK. This is offtopic. Not necessarily. Using subclusters may allow you to ramp up to larger cluster sizes. We can also set up our allocation (and pre-allocation schemes) so that we always reserve an entire cluster on the host at the time we allocate the cluster, even if we only plan to write to particular subclusters within that cluster. In fact, 32 subclusters to a 2M cluster results in 64k subclusters, where you are still writing at 64k data chunks but could now have guaranteed 2M locality, compared to the current qcow2 with 64k clusters that writes in 64k data chunks but with no locality. Just because we don't write the entire cluster up front does not mean that we don't have to allocate (or have a mode that allocates) the entire cluster at the time of the first subcluster use. > > One can easily recreate this case using the following simple > test: > - write each even 4kb page of the disk, one by one > - write each odd 4 kb page of the disk > - run sequential read with f.e. 1 MB data block > > Normally we should still have native performance, but > with raw sparse files and (as far as understand the > proposal) sub-clusters we will have the host IO pattern > exactly like random. Only if we don't pre-allocate entire clusters at the point that we first touch the cluster. > > This seems like a big and inevitable problem of the approach > for me. We still have the potential to improve current > algorithms and not introduce non-compatible changes. > > Sorry if this is too emotional. We have learned above in a > very hard way. And your experience is useful, as a way to fine-tune this proposal. But it doesn't mean we should entirely ditch this proposal. I also appreciate that you have patches in the works to reduce bottlenecks (such as turning sub-cluster writes into 3 IOPs rather than 5, by doing read-head, read-tail, write-cluster, instead of the current read-head, write-head, write-body, read-tail, write-tail), but think that both approaches are complimentary, not orthogonal. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-12 18:20 ` Eric Blake @ 2017-04-12 19:02 ` Denis V. Lunev 2017-04-13 9:44 ` Kevin Wolf 0 siblings, 1 reply; 64+ messages in thread From: Denis V. Lunev @ 2017-04-12 19:02 UTC (permalink / raw) To: Eric Blake, Alberto Garcia, qemu-devel Cc: Kevin Wolf, qemu-block, Stefan Hajnoczi, Max Reitz On 04/12/2017 09:20 PM, Eric Blake wrote: > On 04/12/2017 12:55 PM, Denis V. Lunev wrote: >> Let me rephrase a bit. >> >> The proposal is looking very close to the following case: >> - raw sparse file >> >> In this case all writes are very-very-very fast and from the >> guest point of view all is OK. Sequential data is really sequential. >> Though once we are starting to perform any sequential IO, we >> have real pain. Each sequential operation becomes random >> on the host file system and the IO becomes very slow. This >> will not be observed with the test, but the performance will >> degrade very soon. >> >> This is why raw sparse files are not used in the real life. >> Hypervisor must maintain guest OS invariants and the data, >> which is nearby from the guest point of view should be kept >> nearby in host. >> >> This is why actually that 64kb data blocks are extremely >> small :) OK. This is offtopic. > Not necessarily. Using subclusters may allow you to ramp up to larger > cluster sizes. We can also set up our allocation (and pre-allocation > schemes) so that we always reserve an entire cluster on the host at the > time we allocate the cluster, even if we only plan to write to > particular subclusters within that cluster. In fact, 32 subclusters to > a 2M cluster results in 64k subclusters, where you are still writing at > 64k data chunks but could now have guaranteed 2M locality, compared to > the current qcow2 with 64k clusters that writes in 64k data chunks but > with no locality. > > Just because we don't write the entire cluster up front does not mean > that we don't have to allocate (or have a mode that allocates) the > entire cluster at the time of the first subcluster use. this is something that I do not understand. We reserve the entire cluster at allocation. Why do we need sub-clusters at cluster "creation" without COW? fallocate() and preallocation completely covers this stage for now in full and solve all botllenecks we have. 4k/8k granularity of L2 cache solves metadata write problem. But IMHO it is not important. Normally we sync metadata at guest sync. The only difference I am observing in this case is "copy-on-write" pattern of the load with backing store or snapshot, where we copy only partial cluster. Thus we should clearly define that this is the only area of improvement and start discussion from this point. Simple cluster creation is not the problem anymore. I think that this reduces the scope of the proposal a lot. Initial proposal starts from stating 2 problems: "1) Reading from or writing to a qcow2 image involves reading the corresponding entry on the L2 table that maps the guest address to the host address. This is very slow because it involves two I/O operations: one on the L2 table and the other one on the actual data cluster. 2) A cluster is the smallest unit of allocation. Therefore writing a mere 512 bytes to an empty disk requires allocating a complete cluster and filling it with zeroes (or with data from the backing image if there is one). This wastes more disk space and also has a negative impact on I/O." With pre-allocation (2) would be exactly the same as now and all gain with sub-clusters will be effectively 0 as we will have to preallocate entire cluster. (1) is also questionable. I think that the root of the problem is the cost of L2 cache miss, which is giant. With 1 Mb or 2 Mb cluster the cost of the cache miss is not acceptable at all. With page granularity of L2 cache this problem is seriously reduced. We can switch to bigger blocks without much problem. Again, the only problem is COW. Thus I think that the proposal should be seriously re-analyzed and refined with this input. >> One can easily recreate this case using the following simple >> test: >> - write each even 4kb page of the disk, one by one >> - write each odd 4 kb page of the disk >> - run sequential read with f.e. 1 MB data block >> >> Normally we should still have native performance, but >> with raw sparse files and (as far as understand the >> proposal) sub-clusters we will have the host IO pattern >> exactly like random. > Only if we don't pre-allocate entire clusters at the point that we first > touch the cluster. > >> This seems like a big and inevitable problem of the approach >> for me. We still have the potential to improve current >> algorithms and not introduce non-compatible changes. >> >> Sorry if this is too emotional. We have learned above in a >> very hard way. > And your experience is useful, as a way to fine-tune this proposal. But > it doesn't mean we should entirely ditch this proposal. I also > appreciate that you have patches in the works to reduce bottlenecks > (such as turning sub-cluster writes into 3 IOPs rather than 5, by doing > read-head, read-tail, write-cluster, instead of the current read-head, > write-head, write-body, read-tail, write-tail), but think that both > approaches are complimentary, not orthogonal. > Thank you :) I just prefer to dead end with compatible changes and start incompatible ones after that. There are really a lot of other possibilities for viable optimizations, which are not yet done on top of proposed ones: - IO plug/unplug support at QCOW2 level. plug in controller is definitely not enough. This affects only the first IO operation while we could have a bunch of them - sort and merge requests list in submit - direct AIO read/write support to avoid extra coroutine creation for read-write ops if we are doing several operations in parallel in qcow2_co_readv/writev. Right now AIO operations are emulated via coroutines which have some impact - offload compression/decompression/encryption to side thread - optimize sequential write operation not aligned to the cluster boundary if cluster is not allocated initially May be it would be useful to create intermediate DIO structure for IO operation which will carry offset/iovec on it like done in kernel. I do think that such compatible changes could improve raw performance even with the current format 2-3 times, which is brought out by the proposal. Den ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-12 19:02 ` Denis V. Lunev @ 2017-04-13 9:44 ` Kevin Wolf 2017-04-13 10:19 ` Denis V. Lunev 0 siblings, 1 reply; 64+ messages in thread From: Kevin Wolf @ 2017-04-13 9:44 UTC (permalink / raw) To: Denis V. Lunev Cc: Eric Blake, Alberto Garcia, qemu-devel, qemu-block, Stefan Hajnoczi, Max Reitz Am 12.04.2017 um 21:02 hat Denis V. Lunev geschrieben: > On 04/12/2017 09:20 PM, Eric Blake wrote: > > On 04/12/2017 12:55 PM, Denis V. Lunev wrote: > >> Let me rephrase a bit. > >> > >> The proposal is looking very close to the following case: > >> - raw sparse file > >> > >> In this case all writes are very-very-very fast and from the > >> guest point of view all is OK. Sequential data is really sequential. > >> Though once we are starting to perform any sequential IO, we > >> have real pain. Each sequential operation becomes random > >> on the host file system and the IO becomes very slow. This > >> will not be observed with the test, but the performance will > >> degrade very soon. > >> > >> This is why raw sparse files are not used in the real life. > >> Hypervisor must maintain guest OS invariants and the data, > >> which is nearby from the guest point of view should be kept > >> nearby in host. > >> > >> This is why actually that 64kb data blocks are extremely > >> small :) OK. This is offtopic. > > Not necessarily. Using subclusters may allow you to ramp up to larger > > cluster sizes. We can also set up our allocation (and pre-allocation > > schemes) so that we always reserve an entire cluster on the host at the > > time we allocate the cluster, even if we only plan to write to > > particular subclusters within that cluster. In fact, 32 subclusters to > > a 2M cluster results in 64k subclusters, where you are still writing at > > 64k data chunks but could now have guaranteed 2M locality, compared to > > the current qcow2 with 64k clusters that writes in 64k data chunks but > > with no locality. > > > > Just because we don't write the entire cluster up front does not mean > > that we don't have to allocate (or have a mode that allocates) the > > entire cluster at the time of the first subcluster use. > > this is something that I do not understand. We reserve the entire cluster at > allocation. Why do we need sub-clusters at cluster "creation" without COW? > fallocate() and preallocation completely covers this stage for now in > full and > solve all botllenecks we have. 4k/8k granularity of L2 cache solves metadata > write problem. But IMHO it is not important. Normally we sync metadata > at guest sync. > > The only difference I am observing in this case is "copy-on-write" pattern > of the load with backing store or snapshot, where we copy only partial > cluster. > Thus we should clearly define that this is the only area of improvement and > start discussion from this point. Simple cluster creation is not the problem > anymore. I think that this reduces the scope of the proposal a lot. I think subclusters have two different valid use cases: 1. The first one is what you describe and what I was mostly interested in: By reducing the subcluster size to the file system block size, we can completely avoid any COW because there won't be partial writes any more. You attended my KVM Forum talk two years ago where I described how COW is the biggest pain point for qcow2 performance, costing about 50% of performance for initial writes after taking a snapshot. So while you're right that many other improvements are possible, I think this is one of the most important points to address. 2. The other use case is what Berto had in mind: Keep subclusters at the current cluster size (to avoid getting even larger COWs), but increase the cluster size. This reduces the metadata size and allows to cover a larger range with the same L2 cache size. Additionally, it can possibly reduce fragmentation of the image on the file system level. The fundamental observation in both cases is that it's impractical to use the same granularity for cluster mapping (want larger sizes) and for status tracking (want small sizes to avoid partial writes). > Initial proposal starts from stating 2 problems: > > "1) Reading from or writing to a qcow2 image involves reading the > corresponding entry on the L2 table that maps the guest address to > the host address. This is very slow because it involves two I/O > operations: one on the L2 table and the other one on the actual > data cluster. > > 2) A cluster is the smallest unit of allocation. Therefore writing a > mere 512 bytes to an empty disk requires allocating a complete > cluster and filling it with zeroes (or with data from the backing > image if there is one). This wastes more disk space and also has a > negative impact on I/O." > > With pre-allocation (2) would be exactly the same as now and all > gain with sub-clusters will be effectively 0 as we will have to > preallocate entire cluster. To be honest, I'm not sure if we should do preallocation or whether that's the file system's job. In any case, the big improvement is that we don't have to read from the backing file, so even if we keep preallocating the whole cluster, we'd gain something there. I also think that preallocating would use something like fallocate() rather than pwrite(), so it should involve a lot less I/O. > (1) is also questionable. I think that the root of the problem > is the cost of L2 cache miss, which is giant. With 1 Mb or 2 Mb > cluster the cost of the cache miss is not acceptable at all. > With page granularity of L2 cache this problem is seriously > reduced. We can switch to bigger blocks without much problem. > Again, the only problem is COW. Yes, with larger cluster sizes, the L2 cache sucks currently. It needs to be able to cache partial tables. With the currently common cluster sizes, I don't think it makes a big difference, though. As long as you end up making a single request, 4k or 64k size isn't that different, and three 4k requests are almost certainly slower than one 64k request. So I agree, for use case 2, some work on the cache is required in addition to increasing the cluster size. > There are really a lot of other possibilities for viable optimizations, > which > are not yet done on top of proposed ones: > - IO plug/unplug support at QCOW2 level. plug in controller is definitely > not enough. This affects only the first IO operation while we could have > a bunch of them > - sort and merge requests list in submit > - direct AIO read/write support to avoid extra coroutine creation for > read-write ops if we are doing several operations in parallel in > qcow2_co_readv/writev. Right now AIO operations are emulated > via coroutines which have some impact > - offload compression/decompression/encryption to side thread > - optimize sequential write operation not aligned to the cluster boundary > if cluster is not allocated initially > > May be it would be useful to create intermediate DIO structure for IO > operation which will carry offset/iovec on it like done in kernel. I do > think > that such compatible changes could improve raw performance even > with the current format 2-3 times, which is brought out by the proposal. Well, if you can show a concrete optimisation and ideally also numbers, that would certainly be interesting. I am very skeptical of 2-3 times improvement (not the least because we would be well over native performance then...), but I'm happy to be convinced otherwise. Maybe start a new thread like this one if/when you think you have a detailled idea for one of them that is ready for discussion. The one that I actually think could make a big difference for its use case is the compression/decompression/encryption one, but honestly, if someone really cared about these, I think there would be lower hanging fruit (e.g. we're currently reading in data only cluster by cluster for compressed images). Kevin ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 9:44 ` Kevin Wolf @ 2017-04-13 10:19 ` Denis V. Lunev 2017-04-14 1:06 ` [Qemu-devel] [Qemu-block] " John Snow 0 siblings, 1 reply; 64+ messages in thread From: Denis V. Lunev @ 2017-04-13 10:19 UTC (permalink / raw) To: Kevin Wolf Cc: Eric Blake, Alberto Garcia, qemu-devel, qemu-block, Stefan Hajnoczi, Max Reitz On 04/13/2017 12:44 PM, Kevin Wolf wrote: > Am 12.04.2017 um 21:02 hat Denis V. Lunev geschrieben: >> On 04/12/2017 09:20 PM, Eric Blake wrote: >>> On 04/12/2017 12:55 PM, Denis V. Lunev wrote: >>>> Let me rephrase a bit. >>>> >>>> The proposal is looking very close to the following case: >>>> - raw sparse file >>>> >>>> In this case all writes are very-very-very fast and from the >>>> guest point of view all is OK. Sequential data is really sequential. >>>> Though once we are starting to perform any sequential IO, we >>>> have real pain. Each sequential operation becomes random >>>> on the host file system and the IO becomes very slow. This >>>> will not be observed with the test, but the performance will >>>> degrade very soon. >>>> >>>> This is why raw sparse files are not used in the real life. >>>> Hypervisor must maintain guest OS invariants and the data, >>>> which is nearby from the guest point of view should be kept >>>> nearby in host. >>>> >>>> This is why actually that 64kb data blocks are extremely >>>> small :) OK. This is offtopic. >>> Not necessarily. Using subclusters may allow you to ramp up to larger >>> cluster sizes. We can also set up our allocation (and pre-allocation >>> schemes) so that we always reserve an entire cluster on the host at the >>> time we allocate the cluster, even if we only plan to write to >>> particular subclusters within that cluster. In fact, 32 subclusters to >>> a 2M cluster results in 64k subclusters, where you are still writing at >>> 64k data chunks but could now have guaranteed 2M locality, compared to >>> the current qcow2 with 64k clusters that writes in 64k data chunks but >>> with no locality. >>> >>> Just because we don't write the entire cluster up front does not mean >>> that we don't have to allocate (or have a mode that allocates) the >>> entire cluster at the time of the first subcluster use. >> this is something that I do not understand. We reserve the entire cluster at >> allocation. Why do we need sub-clusters at cluster "creation" without COW? >> fallocate() and preallocation completely covers this stage for now in >> full and >> solve all botllenecks we have. 4k/8k granularity of L2 cache solves metadata >> write problem. But IMHO it is not important. Normally we sync metadata >> at guest sync. >> >> The only difference I am observing in this case is "copy-on-write" pattern >> of the load with backing store or snapshot, where we copy only partial >> cluster. >> Thus we should clearly define that this is the only area of improvement and >> start discussion from this point. Simple cluster creation is not the problem >> anymore. I think that this reduces the scope of the proposal a lot. > I think subclusters have two different valid use cases: > > 1. The first one is what you describe and what I was mostly interested > in: By reducing the subcluster size to the file system block size, we > can completely avoid any COW because there won't be partial writes > any more. > > You attended my KVM Forum talk two years ago where I described how > COW is the biggest pain point for qcow2 performance, costing about > 50% of performance for initial writes after taking a snapshot. So > while you're right that many other improvements are possible, I think > this is one of the most important points to address. > > 2. The other use case is what Berto had in mind: Keep subclusters at the > current cluster size (to avoid getting even larger COWs), but > increase the cluster size. This reduces the metadata size and allows > to cover a larger range with the same L2 cache size. Additionally, it > can possibly reduce fragmentation of the image on the file system > level. > > The fundamental observation in both cases is that it's impractical to > use the same granularity for cluster mapping (want larger sizes) and for > status tracking (want small sizes to avoid partial writes). Fragmented images are very good at the initial moment once the IO is coming first time in small pieces. But it becomes real pain later on once guest will REUSE this area for sequential data written in big chunks. Or the guest could even read this data sequentially later on. You will have random read in host instead of sequential read in guest. This would be serious performance problem. The guest really believes that the data with similar LBAs are adjacent and optimize IO keeping this in mind. With clusters/subclusters/etc you break this fundamental assumption. This is not visible initially, but will trigger later on. >> Initial proposal starts from stating 2 problems: >> >> "1) Reading from or writing to a qcow2 image involves reading the >> corresponding entry on the L2 table that maps the guest address to >> the host address. This is very slow because it involves two I/O >> operations: one on the L2 table and the other one on the actual >> data cluster. >> >> 2) A cluster is the smallest unit of allocation. Therefore writing a >> mere 512 bytes to an empty disk requires allocating a complete >> cluster and filling it with zeroes (or with data from the backing >> image if there is one). This wastes more disk space and also has a >> negative impact on I/O." >> >> With pre-allocation (2) would be exactly the same as now and all >> gain with sub-clusters will be effectively 0 as we will have to >> preallocate entire cluster. > To be honest, I'm not sure if we should do preallocation or whether > that's the file system's job. > > In any case, the big improvement is that we don't have to read from the > backing file, so even if we keep preallocating the whole cluster, we'd > gain something there. I also think that preallocating would use > something like fallocate() rather than pwrite(), so it should involve a > lot less I/O. > yes. fallocate() works pretty well and we will have almost the same amount of IO as submitted from the guest. This also helps a lot for sequential writes not aligned to the cluster boundary. >> (1) is also questionable. I think that the root of the problem >> is the cost of L2 cache miss, which is giant. With 1 Mb or 2 Mb >> cluster the cost of the cache miss is not acceptable at all. >> With page granularity of L2 cache this problem is seriously >> reduced. We can switch to bigger blocks without much problem. >> Again, the only problem is COW. > Yes, with larger cluster sizes, the L2 cache sucks currently. It needs > to be able to cache partial tables. > > With the currently common cluster sizes, I don't think it makes a big > difference, though. As long as you end up making a single request, 4k or > 64k size isn't that different, and three 4k requests are almost certainly > slower than one 64k request. > > So I agree, for use case 2, some work on the cache is required in > addition to increasing the cluster size. This helps even with 64kb cluster size. For the case of the fragmented guest filesystem the dataset could be quite fragmented and we an technically use much less memory to cover it if we'll use pages clusters. >> There are really a lot of other possibilities for viable optimizations, >> which >> are not yet done on top of proposed ones: >> - IO plug/unplug support at QCOW2 level. plug in controller is definitely >> not enough. This affects only the first IO operation while we could have >> a bunch of them >> - sort and merge requests list in submit >> - direct AIO read/write support to avoid extra coroutine creation for >> read-write ops if we are doing several operations in parallel in >> qcow2_co_readv/writev. Right now AIO operations are emulated >> via coroutines which have some impact >> - offload compression/decompression/encryption to side thread >> - optimize sequential write operation not aligned to the cluster boundary >> if cluster is not allocated initially >> >> May be it would be useful to create intermediate DIO structure for IO >> operation which will carry offset/iovec on it like done in kernel. I do >> think >> that such compatible changes could improve raw performance even >> with the current format 2-3 times, which is brought out by the proposal. > Well, if you can show a concrete optimisation and ideally also numbers, > that would certainly be interesting. I am very skeptical of 2-3 times > improvement (not the least because we would be well over native > performance then...), but I'm happy to be convinced otherwise. Maybe > start a new thread like this one if/when you think you have a detailled > idea for one of them that is ready for discussion. > > The one that I actually think could make a big difference for its use > case is the compression/decompression/encryption one, but honestly, if > someone really cared about these, I think there would be lower hanging > fruit (e.g. we're currently reading in data only cluster by cluster for > compressed images). In progress. Compression has a lot of troubles ;) Den ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-13 10:19 ` Denis V. Lunev @ 2017-04-14 1:06 ` John Snow 2017-04-14 4:17 ` Denis V. Lunev 2017-04-14 7:40 ` Roman Kagan 0 siblings, 2 replies; 64+ messages in thread From: John Snow @ 2017-04-14 1:06 UTC (permalink / raw) To: Denis V. Lunev, Kevin Wolf Cc: qemu-block, qemu-devel, Max Reitz, Stefan Hajnoczi On 04/13/2017 06:19 AM, Denis V. Lunev wrote: > On 04/13/2017 12:44 PM, Kevin Wolf wrote: >> Am 12.04.2017 um 21:02 hat Denis V. Lunev geschrieben: >>> On 04/12/2017 09:20 PM, Eric Blake wrote: >>>> On 04/12/2017 12:55 PM, Denis V. Lunev wrote: >>>>> Let me rephrase a bit. >>>>> >>>>> The proposal is looking very close to the following case: >>>>> - raw sparse file >>>>> >>>>> In this case all writes are very-very-very fast and from the >>>>> guest point of view all is OK. Sequential data is really sequential. >>>>> Though once we are starting to perform any sequential IO, we >>>>> have real pain. Each sequential operation becomes random >>>>> on the host file system and the IO becomes very slow. This >>>>> will not be observed with the test, but the performance will >>>>> degrade very soon. >>>>> >>>>> This is why raw sparse files are not used in the real life. >>>>> Hypervisor must maintain guest OS invariants and the data, >>>>> which is nearby from the guest point of view should be kept >>>>> nearby in host. >>>>> >>>>> This is why actually that 64kb data blocks are extremely >>>>> small :) OK. This is offtopic. >>>> Not necessarily. Using subclusters may allow you to ramp up to larger >>>> cluster sizes. We can also set up our allocation (and pre-allocation >>>> schemes) so that we always reserve an entire cluster on the host at the >>>> time we allocate the cluster, even if we only plan to write to >>>> particular subclusters within that cluster. In fact, 32 subclusters to >>>> a 2M cluster results in 64k subclusters, where you are still writing at >>>> 64k data chunks but could now have guaranteed 2M locality, compared to >>>> the current qcow2 with 64k clusters that writes in 64k data chunks but >>>> with no locality. >>>> >>>> Just because we don't write the entire cluster up front does not mean >>>> that we don't have to allocate (or have a mode that allocates) the >>>> entire cluster at the time of the first subcluster use. >>> this is something that I do not understand. We reserve the entire cluster at >>> allocation. Why do we need sub-clusters at cluster "creation" without COW? >>> fallocate() and preallocation completely covers this stage for now in >>> full and >>> solve all botllenecks we have. 4k/8k granularity of L2 cache solves metadata >>> write problem. But IMHO it is not important. Normally we sync metadata >>> at guest sync. >>> >>> The only difference I am observing in this case is "copy-on-write" pattern >>> of the load with backing store or snapshot, where we copy only partial >>> cluster. >>> Thus we should clearly define that this is the only area of improvement and >>> start discussion from this point. Simple cluster creation is not the problem >>> anymore. I think that this reduces the scope of the proposal a lot. >> I think subclusters have two different valid use cases: >> >> 1. The first one is what you describe and what I was mostly interested >> in: By reducing the subcluster size to the file system block size, we >> can completely avoid any COW because there won't be partial writes >> any more. >> >> You attended my KVM Forum talk two years ago where I described how >> COW is the biggest pain point for qcow2 performance, costing about >> 50% of performance for initial writes after taking a snapshot. So >> while you're right that many other improvements are possible, I think >> this is one of the most important points to address. >> >> 2. The other use case is what Berto had in mind: Keep subclusters at the >> current cluster size (to avoid getting even larger COWs), but >> increase the cluster size. This reduces the metadata size and allows >> to cover a larger range with the same L2 cache size. Additionally, it >> can possibly reduce fragmentation of the image on the file system >> level. >> >> The fundamental observation in both cases is that it's impractical to >> use the same granularity for cluster mapping (want larger sizes) and for >> status tracking (want small sizes to avoid partial writes). > Fragmented images are very good at the initial moment once the > IO is coming first time in small pieces. But it becomes real > pain later on once guest will REUSE this area for sequential > data written in big chunks. Or the guest could even read this > data sequentially later on. > > You will have random read in host instead of sequential read in > guest. This would be serious performance problem. The guest > really believes that the data with similar LBAs are adjacent and > optimize IO keeping this in mind. With clusters/subclusters/etc > you break this fundamental assumption. This is not visible initially, > but will trigger later on. > Hi Denis, I've read this entire thread now and I really like Berto's summary which I think is one of the best recaps of existing qcow2 problems and this discussion so far. I understand your opinion that we should focus on compatible changes before incompatible ones, and I also understand that you are very concerned about physical fragmentation for reducing long-term IO. What I don't understand is why you think that subclusters will increase fragmentation. If we admit that fragmentation is a problem now, surely increasing cluster sizes to 1 or 2 MB will only help to *reduce* physical fragmentation, right? Subclusters as far as I am understanding them will not actually allow subclusters to be located at virtually disparate locations, we will continue to allocate clusters as normal -- we'll just keep track of which portions of the cluster we've written to to help us optimize COW*. So if we have a 1MB cluster with 64k subclusters as a hypothetical, if we write just the first subcluster, we'll have a map like: X--------------- Whatever actually happens to exist in this space, whether it be a hole we punched via fallocate or literal zeroes, this space is known to the filesystem to be contiguous. If we write to the last subcluster, we'll get: X--------------X And again, maybe the dashes are a fallocate hole, maybe they're zeroes. but the last subcluster is located virtually exactly 15 subclusters behind the first, they're not physically contiguous. We've saved the space between them. Future out-of-order writes won't contribute to any fragmentation, at least at this level. You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune the subclusters right, we'll have *zero*, won't we? As far as I can tell, this lets us do a lot of good things all at once: (1) We get some COW optimizations (for reasons Berto and Kevin have detailed) (2) We can increase our cluster size (3) L2 cache can cover bigger files with less memory (4) Fragmentation decreases. Is this not a win all around? We can improve throughput for a lot of different reasons all at once, here. Have I misunderstood the discussion so far, anyone? Please correct me where I am wrong and _ELI5_. Thanks, -John > >>> Initial proposal starts from stating 2 problems: >>> >>> "1) Reading from or writing to a qcow2 image involves reading the >>> corresponding entry on the L2 table that maps the guest address to >>> the host address. This is very slow because it involves two I/O >>> operations: one on the L2 table and the other one on the actual >>> data cluster. >>> >>> 2) A cluster is the smallest unit of allocation. Therefore writing a >>> mere 512 bytes to an empty disk requires allocating a complete >>> cluster and filling it with zeroes (or with data from the backing >>> image if there is one). This wastes more disk space and also has a >>> negative impact on I/O." >>> >>> With pre-allocation (2) would be exactly the same as now and all >>> gain with sub-clusters will be effectively 0 as we will have to >>> preallocate entire cluster. >> To be honest, I'm not sure if we should do preallocation or whether >> that's the file system's job. >> >> In any case, the big improvement is that we don't have to read from the >> backing file, so even if we keep preallocating the whole cluster, we'd >> gain something there. I also think that preallocating would use >> something like fallocate() rather than pwrite(), so it should involve a >> lot less I/O. >> > yes. fallocate() works pretty well and we will have almost > the same amount of IO as submitted from the guest. This > also helps a lot for sequential writes not aligned to the > cluster boundary. > > >>> (1) is also questionable. I think that the root of the problem >>> is the cost of L2 cache miss, which is giant. With 1 Mb or 2 Mb >>> cluster the cost of the cache miss is not acceptable at all. >>> With page granularity of L2 cache this problem is seriously >>> reduced. We can switch to bigger blocks without much problem. >>> Again, the only problem is COW. >> Yes, with larger cluster sizes, the L2 cache sucks currently. It needs >> to be able to cache partial tables. >> >> With the currently common cluster sizes, I don't think it makes a big >> difference, though. As long as you end up making a single request, 4k or >> 64k size isn't that different, and three 4k requests are almost certainly >> slower than one 64k request. >> >> So I agree, for use case 2, some work on the cache is required in >> addition to increasing the cluster size. > This helps even with 64kb cluster size. For the case of the fragmented guest > filesystem the dataset could be quite fragmented and we an technically use > much less memory to cover it if we'll use pages clusters. > > >>> There are really a lot of other possibilities for viable optimizations, >>> which >>> are not yet done on top of proposed ones: >>> - IO plug/unplug support at QCOW2 level. plug in controller is definitely >>> not enough. This affects only the first IO operation while we could have >>> a bunch of them >>> - sort and merge requests list in submit >>> - direct AIO read/write support to avoid extra coroutine creation for >>> read-write ops if we are doing several operations in parallel in >>> qcow2_co_readv/writev. Right now AIO operations are emulated >>> via coroutines which have some impact >>> - offload compression/decompression/encryption to side thread >>> - optimize sequential write operation not aligned to the cluster boundary >>> if cluster is not allocated initially >>> >>> May be it would be useful to create intermediate DIO structure for IO >>> operation which will carry offset/iovec on it like done in kernel. I do >>> think >>> that such compatible changes could improve raw performance even >>> with the current format 2-3 times, which is brought out by the proposal. >> Well, if you can show a concrete optimisation and ideally also numbers, >> that would certainly be interesting. I am very skeptical of 2-3 times >> improvement (not the least because we would be well over native >> performance then...), but I'm happy to be convinced otherwise. Maybe >> start a new thread like this one if/when you think you have a detailled >> idea for one of them that is ready for discussion. >> >> The one that I actually think could make a big difference for its use >> case is the compression/decompression/encryption one, but honestly, if >> someone really cared about these, I think there would be lower hanging >> fruit (e.g. we're currently reading in data only cluster by cluster for >> compressed images). > In progress. Compression has a lot of troubles ;) > > Den > > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-14 1:06 ` [Qemu-devel] [Qemu-block] " John Snow @ 2017-04-14 4:17 ` Denis V. Lunev 2017-04-18 11:22 ` Kevin Wolf 2017-04-14 7:40 ` Roman Kagan 1 sibling, 1 reply; 64+ messages in thread From: Denis V. Lunev @ 2017-04-14 4:17 UTC (permalink / raw) To: John Snow, Kevin Wolf; +Cc: qemu-block, qemu-devel, Max Reitz, Stefan Hajnoczi [skipped...] > Hi Denis, > > I've read this entire thread now and I really like Berto's summary which > I think is one of the best recaps of existing qcow2 problems and this > discussion so far. > > I understand your opinion that we should focus on compatible changes > before incompatible ones, and I also understand that you are very > concerned about physical fragmentation for reducing long-term IO. > > What I don't understand is why you think that subclusters will increase > fragmentation. If we admit that fragmentation is a problem now, surely > increasing cluster sizes to 1 or 2 MB will only help to *reduce* > physical fragmentation, right? > > Subclusters as far as I am understanding them will not actually allow > subclusters to be located at virtually disparate locations, we will > continue to allocate clusters as normal -- we'll just keep track of > which portions of the cluster we've written to to help us optimize COW*. > > So if we have a 1MB cluster with 64k subclusters as a hypothetical, if > we write just the first subcluster, we'll have a map like: > > X--------------- > > Whatever actually happens to exist in this space, whether it be a hole > we punched via fallocate or literal zeroes, this space is known to the > filesystem to be contiguous. > > If we write to the last subcluster, we'll get: > > X--------------X > > And again, maybe the dashes are a fallocate hole, maybe they're zeroes. > but the last subcluster is located virtually exactly 15 subclusters > behind the first, they're not physically contiguous. We've saved the > space between them. Future out-of-order writes won't contribute to any > fragmentation, at least at this level. > > You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune > the subclusters right, we'll have *zero*, won't we? > > As far as I can tell, this lets us do a lot of good things all at once: > > (1) We get some COW optimizations (for reasons Berto and Kevin have > detailed) Yes. We are fine with COW. Let us assume that we will have issued read entire cluster command after the COW, in the situation X--------------X with a backing store. This is possible even with 1-2 Mb cluster size. I have seen 4-5 Mb requests from the guest in the real life. In this case we will have 3 IOP: read left X area, read backing, read right X. This is the real drawback of the approach, if sub-cluster size is really small enough, which should be the case for optimal COW. Thus we will have random IO in the host instead of sequential one in guest. Thus we have optimized COW at the cost of long term reads. This is what I am worrying about as we can have a lot of such reads before any further data change. With real holes the situation is even worse. If we have real hole (obtained with truncate), we are in the case mentioned by Roman. Virtually the space is continuous, but we have host fragmentation, equal to sub-cluster size. We are in the tough position even with COW. If sub-cluster is not equals to the size of the guest filesystem block, we still need to do COW. The only win is the amount of data copied, but the loss in the amount of IOPSes is the same. On the other hand, to get real win, we should have properly aligned guest partitions, know exactly guest filesystem block size etc etc etc. This is not that easy as can seen. > (2) We can increase our cluster size > (3) L2 cache can cover bigger files with less memory > (4) Fragmentation decreases. yes. I like (2)-(4) a lot. But at my opinion we could stick to 1 Mb without sub-clusters and still get (2)-(4). > > Is this not a win all around? We can improve throughput for a lot of > different reasons all at once, here. Have I misunderstood the discussion > so far, anyone? > > Please correct me where I am wrong and _ELI5_. Have tried that, sorry for my illiteracy ;) Den ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-14 4:17 ` Denis V. Lunev @ 2017-04-18 11:22 ` Kevin Wolf 2017-04-18 17:30 ` Denis V. Lunev 0 siblings, 1 reply; 64+ messages in thread From: Kevin Wolf @ 2017-04-18 11:22 UTC (permalink / raw) To: Denis V. Lunev Cc: John Snow, qemu-block, qemu-devel, Max Reitz, Stefan Hajnoczi Am 14.04.2017 um 06:17 hat Denis V. Lunev geschrieben: > [skipped...] > > > Hi Denis, > > > > I've read this entire thread now and I really like Berto's summary which > > I think is one of the best recaps of existing qcow2 problems and this > > discussion so far. > > > > I understand your opinion that we should focus on compatible changes > > before incompatible ones, and I also understand that you are very > > concerned about physical fragmentation for reducing long-term IO. > > > > What I don't understand is why you think that subclusters will increase > > fragmentation. If we admit that fragmentation is a problem now, surely > > increasing cluster sizes to 1 or 2 MB will only help to *reduce* > > physical fragmentation, right? > > > > Subclusters as far as I am understanding them will not actually allow > > subclusters to be located at virtually disparate locations, we will > > continue to allocate clusters as normal -- we'll just keep track of > > which portions of the cluster we've written to to help us optimize COW*. > > > > So if we have a 1MB cluster with 64k subclusters as a hypothetical, if > > we write just the first subcluster, we'll have a map like: > > > > X--------------- > > > > Whatever actually happens to exist in this space, whether it be a hole > > we punched via fallocate or literal zeroes, this space is known to the > > filesystem to be contiguous. > > > > If we write to the last subcluster, we'll get: > > > > X--------------X > > > > And again, maybe the dashes are a fallocate hole, maybe they're zeroes. > > but the last subcluster is located virtually exactly 15 subclusters > > behind the first, they're not physically contiguous. We've saved the > > space between them. Future out-of-order writes won't contribute to any > > fragmentation, at least at this level. > > > > You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune > > the subclusters right, we'll have *zero*, won't we? > > > > As far as I can tell, this lets us do a lot of good things all at once: > > > > (1) We get some COW optimizations (for reasons Berto and Kevin have > > detailed) > Yes. We are fine with COW. Let us assume that we will have issued read > entire cluster command after the COW, in the situation > > X--------------X > > with a backing store. This is possible even with 1-2 Mb cluster size. > I have seen 4-5 Mb requests from the guest in the real life. In this > case we will have 3 IOP: > read left X area, read backing, read right X. > This is the real drawback of the approach, if sub-cluster size is really > small enough, which should be the case for optimal COW. Thus we > will have random IO in the host instead of sequential one in guest. > Thus we have optimized COW at the cost of long term reads. This > is what I am worrying about as we can have a lot of such reads before > any further data change. So just to avoid misunderstandings about what you're comparing here: You get these 3 iops for 2 MB clusters with 64k subclusters, whereas you would get only a single operation for 2 MB clusters without subclusters. Today's 64k clusters without subclusters behave no better than the 2M/64k version, but that's not what you're comparing. Yes, you are correct about this observation. But it is a tradeoff that you're intentionally making when using backing files. In the extreme, there is an alternative that performs so much better: Instead of using a backing file, use 'qemu-img convert' to copy (and defragment) the whole image upfront. No COW whatsoever, no fragmentation, fast reads. The downside is that it takes a while to copy the whole image upfront, and it also costs quite a bit of disk space. So once we acknowledge that we're dealing with a tradeoff here, it becomes obvious that neither the extreme setup for performance (copy the whole image upfront) nor the extreme setup for sparseness (COW on a sector level) are the right default for the average case, nor is optimising one-sidedly a good idea. It is good if we can provide solutions for extreme cases, but by default we need to cater for the average case, which cares both about reasonable performance and disk usage. Kevin ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-18 11:22 ` Kevin Wolf @ 2017-04-18 17:30 ` Denis V. Lunev 0 siblings, 0 replies; 64+ messages in thread From: Denis V. Lunev @ 2017-04-18 17:30 UTC (permalink / raw) To: Kevin Wolf; +Cc: John Snow, qemu-block, qemu-devel, Max Reitz, Stefan Hajnoczi On 04/18/2017 02:22 PM, Kevin Wolf wrote: > Am 14.04.2017 um 06:17 hat Denis V. Lunev geschrieben: >> [skipped...] >> >>> Hi Denis, >>> >>> I've read this entire thread now and I really like Berto's summary which >>> I think is one of the best recaps of existing qcow2 problems and this >>> discussion so far. >>> >>> I understand your opinion that we should focus on compatible changes >>> before incompatible ones, and I also understand that you are very >>> concerned about physical fragmentation for reducing long-term IO. >>> >>> What I don't understand is why you think that subclusters will increase >>> fragmentation. If we admit that fragmentation is a problem now, surely >>> increasing cluster sizes to 1 or 2 MB will only help to *reduce* >>> physical fragmentation, right? >>> >>> Subclusters as far as I am understanding them will not actually allow >>> subclusters to be located at virtually disparate locations, we will >>> continue to allocate clusters as normal -- we'll just keep track of >>> which portions of the cluster we've written to to help us optimize COW*. >>> >>> So if we have a 1MB cluster with 64k subclusters as a hypothetical, if >>> we write just the first subcluster, we'll have a map like: >>> >>> X--------------- >>> >>> Whatever actually happens to exist in this space, whether it be a hole >>> we punched via fallocate or literal zeroes, this space is known to the >>> filesystem to be contiguous. >>> >>> If we write to the last subcluster, we'll get: >>> >>> X--------------X >>> >>> And again, maybe the dashes are a fallocate hole, maybe they're zeroes. >>> but the last subcluster is located virtually exactly 15 subclusters >>> behind the first, they're not physically contiguous. We've saved the >>> space between them. Future out-of-order writes won't contribute to any >>> fragmentation, at least at this level. >>> >>> You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune >>> the subclusters right, we'll have *zero*, won't we? >>> >>> As far as I can tell, this lets us do a lot of good things all at once: >>> >>> (1) We get some COW optimizations (for reasons Berto and Kevin have >>> detailed) >> Yes. We are fine with COW. Let us assume that we will have issued read >> entire cluster command after the COW, in the situation >> >> X--------------X >> >> with a backing store. This is possible even with 1-2 Mb cluster size. >> I have seen 4-5 Mb requests from the guest in the real life. In this >> case we will have 3 IOP: >> read left X area, read backing, read right X. >> This is the real drawback of the approach, if sub-cluster size is really >> small enough, which should be the case for optimal COW. Thus we >> will have random IO in the host instead of sequential one in guest. >> Thus we have optimized COW at the cost of long term reads. This >> is what I am worrying about as we can have a lot of such reads before >> any further data change. > So just to avoid misunderstandings about what you're comparing here: > You get these 3 iops for 2 MB clusters with 64k subclusters, whereas you > would get only a single operation for 2 MB clusters without subclusters. > Today's 64k clusters without subclusters behave no better than the > 2M/64k version, but that's not what you're comparing. > > Yes, you are correct about this observation. But it is a tradeoff that > you're intentionally making when using backing files. In the extreme, > there is an alternative that performs so much better: Instead of using a > backing file, use 'qemu-img convert' to copy (and defragment) the whole > image upfront. No COW whatsoever, no fragmentation, fast reads. The > downside is that it takes a while to copy the whole image upfront, and > it also costs quite a bit of disk space. in general, for production environments, this is total pain. We have a lot of customers with Tb images. Free space is also a real problem for them. > So once we acknowledge that we're dealing with a tradeoff here, it > becomes obvious that neither the extreme setup for performance (copy the > whole image upfront) nor the extreme setup for sparseness (COW on a > sector level) are the right default for the average case, nor is > optimising one-sidedly a good idea. It is good if we can provide > solutions for extreme cases, but by default we need to cater for the > average case, which cares both about reasonable performance and disk > usage. yes, I agree. But 64kb cluster size by default for big images (not for backup!) is another extreme ;) Who will care with 1 Tb image or 10 Tb image about several Mbs. Pls note, that 1 Mb is better for the default block size as with this size sequential write is equal to the random write for non-SSD disks. Den ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation 2017-04-14 1:06 ` [Qemu-devel] [Qemu-block] " John Snow 2017-04-14 4:17 ` Denis V. Lunev @ 2017-04-14 7:40 ` Roman Kagan 1 sibling, 0 replies; 64+ messages in thread From: Roman Kagan @ 2017-04-14 7:40 UTC (permalink / raw) To: John Snow Cc: Denis V. Lunev, Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block, Max Reitz On Thu, Apr 13, 2017 at 09:06:19PM -0400, John Snow wrote: > So if we have a 1MB cluster with 64k subclusters as a hypothetical, if > we write just the first subcluster, we'll have a map like: > > X--------------- > > Whatever actually happens to exist in this space, whether it be a hole > we punched via fallocate or literal zeroes, this space is known to the > filesystem to be contiguous. > > If we write to the last subcluster, we'll get: > > X--------------X > > And again, maybe the dashes are a fallocate hole, maybe they're zeroes. > but the last subcluster is located virtually exactly 15 subclusters > behind the first, they're not physically contiguous. We've saved the > space between them. Future out-of-order writes won't contribute to any > fragmentation, at least at this level. Yeah I think this is where the confusion lies. You apparently assume that the filesystem is smart enough to compensate for the subclusters being sparse within a cluster, and will make them eventually contiguous on the *media* once they are all written. Denis is claiming the opposite. I posted a simple experiment with a 64kB sparse file written out of order which ended up being 16 disparate blocks on the platters (ext4; with xfs this may be different), and this is obviously detrimental for performance with rotating disks. Note also that if the filesystem actually is smart to maintain the subclusters contiguos even if written out of order, apparently by not allowing blocks from other files to take the yet unused space between sparse subclusters, the disk space saving becomes not so obvious. > You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune > the subclusters right, we'll have *zero*, won't we? Right, this is an attractive advantage. Need to test if the later access to such interleaved clusters is not degraded, though. Roman. ^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2017-04-24 12:47 UTC | newest] Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia 2017-04-06 16:40 ` Eric Blake 2017-04-07 8:49 ` Alberto Garcia 2017-04-07 12:41 ` Kevin Wolf 2017-04-07 14:24 ` Alberto Garcia 2017-04-21 21:09 ` [Qemu-devel] proposed qcow2 extension: cluster reservations [was: " Eric Blake 2017-04-22 17:56 ` Max Reitz 2017-04-24 11:45 ` Kevin Wolf 2017-04-24 12:46 ` Alberto Garcia 2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi 2017-04-07 12:24 ` Alberto Garcia 2017-04-07 13:01 ` Kevin Wolf 2017-04-10 15:32 ` Stefan Hajnoczi 2017-04-07 17:10 ` Max Reitz 2017-04-10 8:42 ` Kevin Wolf 2017-04-10 15:03 ` Max Reitz 2017-04-11 12:56 ` Alberto Garcia 2017-04-11 14:04 ` Max Reitz 2017-04-11 14:31 ` Alberto Garcia 2017-04-11 14:45 ` [Qemu-devel] [Qemu-block] " Eric Blake 2017-04-12 12:41 ` Alberto Garcia 2017-04-12 14:10 ` Max Reitz 2017-04-13 8:05 ` Alberto Garcia 2017-04-13 9:02 ` Kevin Wolf 2017-04-13 9:05 ` Alberto Garcia 2017-04-11 14:49 ` [Qemu-devel] " Kevin Wolf 2017-04-11 14:58 ` Eric Blake 2017-04-11 14:59 ` Max Reitz 2017-04-11 15:08 ` Eric Blake 2017-04-11 15:18 ` Max Reitz 2017-04-11 15:29 ` Kevin Wolf 2017-04-11 15:29 ` Max Reitz 2017-04-11 15:30 ` Eric Blake 2017-04-11 15:34 ` Max Reitz 2017-04-12 12:47 ` Alberto Garcia 2017-04-12 16:54 ` Denis V. Lunev 2017-04-13 11:58 ` Alberto Garcia 2017-04-13 12:44 ` Denis V. Lunev 2017-04-13 13:05 ` Kevin Wolf 2017-04-13 13:09 ` Denis V. Lunev 2017-04-13 13:36 ` Alberto Garcia 2017-04-13 14:06 ` Denis V. Lunev 2017-04-13 13:21 ` Alberto Garcia 2017-04-13 13:30 ` Denis V. Lunev 2017-04-13 13:59 ` Kevin Wolf 2017-04-13 15:04 ` Alberto Garcia 2017-04-13 15:17 ` Denis V. Lunev 2017-04-18 11:52 ` Alberto Garcia 2017-04-18 17:27 ` Denis V. Lunev 2017-04-13 13:51 ` Kevin Wolf 2017-04-13 14:15 ` Alberto Garcia 2017-04-13 14:27 ` Kevin Wolf 2017-04-13 16:42 ` [Qemu-devel] [Qemu-block] " Roman Kagan 2017-04-13 14:42 ` [Qemu-devel] " Denis V. Lunev 2017-04-12 17:55 ` Denis V. Lunev 2017-04-12 18:20 ` Eric Blake 2017-04-12 19:02 ` Denis V. Lunev 2017-04-13 9:44 ` Kevin Wolf 2017-04-13 10:19 ` Denis V. Lunev 2017-04-14 1:06 ` [Qemu-devel] [Qemu-block] " John Snow 2017-04-14 4:17 ` Denis V. Lunev 2017-04-18 11:22 ` Kevin Wolf 2017-04-18 17:30 ` Denis V. Lunev 2017-04-14 7:40 ` Roman Kagan
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.