On 04/06/2017 11:40 AM, Eric Blake wrote: >> === Changes to the on-disk format === >> >> The qcow2 on-disk format needs to change so each L2 entry has a bitmap >> indicating the allocation status of each subcluster. There are three >> possible states (unallocated, allocated, all zeroes), so we need two >> bits per subcluster. > > You also have to add a new incompatible feature bit, so that older tools > know they can't read the new image correctly, and therefore don't > accidentally corrupt it. As long as we are talking about incompatible feature bits, I had some thoughts about image mapping today. tl;dr: summary As long as we are considering incompatible features, maybe we should make it easier to have an image file that explicitly preserves guest=>host mapping, such that nothing the guest can do will reorder the mapping. This way, it would be possible to fully preallocate an image such that all guest offsets are adjusted by a constant offset to become the corresponding host offset (basically, no qcow2 metadata is interleaved in the middle of guest data). I don't see any way to do it on current qcow2 images, but with subclusters, you get it for free by having a cluster with an offset but with all subclusters marked as unallocated. But perhaps it is something orthogonal enough that we would want a separate incompatible feature bit for just this, without having subclusters at the same time. In the process of exploring the topic, I expose a couple of existing bugs in our qcow2 handling. ======== Longer version: If I'm reading qcow2-clusters.c and qcow2-refcount.c correctly, our current implementation of bdrv_discard() says that except for clusters already marked QCOW2_CLUSTER_ZERO, we will unconditionally remove the L2 mapping of the address. Whether we actually punch a hole in the underlying image, or merely add it to a list of free clusters for use in subsequent allocations, is later decided based on s->discard_passthrough[type] (that is, the caller can pass different type levels that control whether to never punch, always punch, or let the blockdev parameters of the caller control whether to punch). Presumably, if I've preallocated my image because I want to guarantee enough host space, then I would use blockdev parameters that ensure that guest actions never punch a hole. But based on my reading, I still have the situation that if I initially preallocated things so that guest cluster 0 and 1 are consecutive clusters in the host, and the guest triggers bdrv_pdiscard() over both clusters, then writes to cluster 1 before cluster 0, then even though I'm not changing the underlying allocation of the host file, I _am_ resulting in fragmentation in the qcow2 file, where cluster 1 in the guest now falls prior to cluster 0. Demonstration: $ qemu-img create -f qcow2 -o cluster_size=1M,preallocation=full file3 10M Formatting 'file3', fmt=qcow2 size=10485760 encryption=off cluster_size=1048576 preallocation=full lazy_refcounts=off refcount_bits=16 $ qemu-img map -f qcow2 file3 Offset Length Mapped to File 0 0x900000 0x500000 file3 0x9ff000 0x1000 0xeff000 file3 Oh, that's weird - our "full allocation" still results in a gap in guest-allocation (nothing from 0x900000-0x9ff000 appears to be mapped), even though every offset that IS allocated has a 1:1 mapping (for any given guest offset, add 5M to get the host offset). Maybe the JSON output will shed more light: $ qemu-img map -f qcow2 file3 --output=json [{ "start": 0, "length": 9437184, "depth": 0, "zero": false, "data": true, "offset": 5242880}, { "start": 9437184, "length": 1044480, "depth": 0, "zero": true, "data": true, "offset": 14680064}, { "start": 10481664, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": 15724544}] That's a bit better - the hole that didn't show up is still mapped, it's just the human version omitted it because that portion of the file explicitly reads as zero. But I'm still confused - how can only half of a 1M cluster read as zero, when our QCOW2_CLUSTER_ZERO flag applies a cluster-at-a-time? Oh - it's because we defer to the file system, and according to lseek(SEEK_HOLE), we actually do have a hole in the file... $ qemu-img map -f raw file3 --output=json [{ "start": 0, "length": 5242880, "depth": 0, "zero": false, "data": true, "offset": 0}, { "start": 5242880, "length": 10481664, "depth": 0, "zero": true, "data": false, "offset": 5242880}, { "start": 15724544, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": 15724544}] $ qemu-io -f raw file3 and reading the L2 table shows that we do NOT have a case of QCOW2_CLUSTER_ZERO set in the qcow2 image: $ qemu-io -f raw file3 qemu-io> r -v 4m 128 00400000: 80 00 00 00 00 50 00 00 80 00 00 00 00 60 00 00 .....P.......... 00400010: 80 00 00 00 00 70 00 00 80 00 00 00 00 80 00 00 .....p.......... 00400020: 80 00 00 00 00 90 00 00 80 00 00 00 00 a0 00 00 ................ 00400030: 80 00 00 00 00 b0 00 00 80 00 00 00 00 c0 00 00 ................ 00400040: 80 00 00 00 00 d0 00 00 80 00 00 00 00 e0 00 00 ................ 00400050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00400060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00400070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ read 128/128 bytes at offset 4194304 128 bytes, 1 ops; 0.0001 sec (905.797 KiB/sec and 7246.3768 ops/sec) qemu-io> q But it's still sad that our request for preallocation=full didn't actually work on the underlying file (we have a hole that we didn't want). Oh well, yet another thing to fix someday. Anyways, now have the guest do a discard and subsequent re-write: $ qemu-io -f qcow2 file3 qemu-io> discard 0 2m discard 2097152/2097152 bytes at offset 0 2 MiB, 1 ops; 0.0014 sec (1.395 GiB/sec and 714.2857 ops/sec) qemu-io> w 1m 1m wrote 1048576/1048576 bytes at offset 1048576 1 MiB, 1 ops; 0.0465 sec (21.487 MiB/sec and 21.4869 ops/sec) qemu-io> w 0 1m wrote 1048576/1048576 bytes at offset 0 1 MiB, 1 ops; 0.0327 sec (30.497 MiB/sec and 30.4971 ops/sec) qemu-io> q $ qemu-img map -f qcow2 file3 Offset Length Mapped to File 0 0x100000 0x600000 file3 0x100000 0x100000 0x500000 file3 0x200000 0x700000 0x700000 file3 0x9ff000 0x1000 0xeff000 file3 We haven't consumed any new host space (good), but we HAVE fragmented the guest image (cluster 0 and 1 have now swapped spaces; we no longer have contiguous guest=>host translations). But watch what happens with explicit zero clusters: $ qemu-io -f qcow2 file3 qemu-io> w -z 3m 2m wrote 2097152/2097152 bytes at offset 3145728 2 MiB, 1 ops; 0.0136 sec (146.563 MiB/sec and 73.2815 ops/sec) qemu-io> q Per the L2 map, we have now set the explicit zero bit, but maintained an allocation: $ qemu-io -f raw file3 qemu-io> r -v 4m 128 00400000: 80 00 00 00 00 60 00 00 80 00 00 00 00 50 00 00 .............P.. 00400010: 80 00 00 00 00 70 00 00 80 00 00 00 00 80 00 01 .....p.......... 00400020: 80 00 00 00 00 90 00 01 80 00 00 00 00 a0 00 00 ................ 00400030: 80 00 00 00 00 b0 00 00 80 00 00 00 00 c0 00 00 ................ 00400040: 80 00 00 00 00 d0 00 00 80 00 00 00 00 e0 00 00 ................ 00400050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00400060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00400070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ read 128/128 bytes at offset 4194304 128 bytes, 1 ops; 0.0001 sec (1.197 MiB/sec and 9803.9216 ops/sec) qemu-io> q And with that allocation in place, discarding and then rewriting zero does NOT lose the allocation: $ qemu-io -f qcow2 file3 qemu-io> discard 3m 2m discard 2097152/2097152 bytes at offset 3145728 2 MiB, 1 ops; 0.0007 sec (2.622 GiB/sec and 1342.2819 ops/sec) qemu-io> w -z 4m 1m wrote 1048576/1048576 bytes at offset 4194304 1 MiB, 1 ops; 0.0103 sec (97.003 MiB/sec and 97.0026 ops/sec) qemu-io> w -z 3m 1m wrote 1048576/1048576 bytes at offset 3145728 1 MiB, 1 ops; 0.0095 sec (104.319 MiB/sec and 104.3188 ops/sec) qemu-io> q eblake@red (0) ~/qemu (nbd|REBASE-i 26/109) $ qemu-io -f raw file3 qemu-io> r -v 4m 128 00400000: 80 00 00 00 00 60 00 00 80 00 00 00 00 50 00 00 .............P.. 00400010: 80 00 00 00 00 70 00 00 80 00 00 00 00 80 00 01 .....p.......... 00400020: 80 00 00 00 00 90 00 01 80 00 00 00 00 a0 00 00 ................ 00400030: 80 00 00 00 00 b0 00 00 80 00 00 00 00 c0 00 00 ................ 00400040: 80 00 00 00 00 d0 00 00 80 00 00 00 00 e0 00 00 ................ 00400050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00400060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00400070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ read 128/128 bytes at offset 4194304 128 bytes, 1 ops; 0.0001 sec (1.185 MiB/sec and 9708.7379 ops/sec) qemu-io> Clusters 3 and 4 are still mapped to host 8 and 9, as they were originally. But what happens when we try to write data there: $ qemu-io -f qcow2 file3 qemu-io> w 4m 1m wrote 1048576/1048576 bytes at offset 4194304 1 MiB, 1 ops; 0.0402 sec (24.830 MiB/sec and 24.8299 ops/sec) qemu-io> q $ qemu-img map -f qcow2 file3 Offset Length Mapped to File 0 0x100000 0x600000 file3 0x100000 0x100000 0x500000 file3 0x200000 0x100000 0x700000 file3 0x400000 0x100000 0xf00000 file3 0x500000 0x400000 0xa00000 file3 0x9ff000 0x1000 0xeff000 file3 Ouch - that cluster write IGNORED the fact that we already had a reservation at host 8, and instead changed the mapping so that guest 4 now maps to host f, making our refcount table larger, and fragmenting things. So much for pre-allocation saving the day :( So, we need a bug-fix such that allocating a cluster that will be replacing an allocated-but-zero cluster would reuse the existing allocation, rather than create a new mapping. [Oh, and how will all of this interact with refcounts > 1, when internal snapshots are in use? Ugh - I don't know that we want to go there] But if we can preserve mappings of clusters that are explicitly marked zero, I started to wonder if it would also be reasonable to have a mode where we could ALWAYS preserve mappings. Adding another bit that says that a cluster has a reserved mapping, but still defers to the backing file for its current data, would let us extend the existing behavior of map-preservation on write zeros to work with ALL writes, when writing to a fully pre-allocated image. Such a bit is incompatible (existing qemu would not know how to present correct data to the guest when the bit is set), so if we want it, we have to burn an incompatible feature bit. The question is whether this feature is independently useful enough to do it orthogonally from subclusters, or whether we want to minimize the exponential explosion of orthogonal bits by only allowing the feature when subclusters are also enabled. And in a way, subclusters already provide this feature: any time you map a cluster to an address, but have ALL of its subclusters marked as unallocated (read from backing file), you have preallocated a mapping that you can later use when writing to that cluster, and that preallocation is no longer tied to whether the guest uses write-zeros. When I chatted with Max on IRC about the idea, we said this: I mean, sure, we can add both, but I'd still want them two be two incompatible bits if you want the features to be orthogonal (with exponentially more cases to check), then having multiple incompatible bits is okay Because FEATURE_BIT_UNALLOCATED_AND_SUBCLUSTERS sounds weird and FEATURE_BIT_EXTENDED_L2_ENTRIES a bit pretentious Well, orthogonal is a good question. If we want to have an UNALLOCATED flag we should think so before adding subclusters Because then we should at least make clear that the ZERO flag for a subcluster requires the ALLOCATED flag to be set or something So we can reserve ZERO/!ALLOCATED for the case where you want to fall through to the backing file So, if you got this far in reading, the question becomes whether having a mode where you can mark a cluster as mapping-reserved-but-unallocated has enough use case to be worth pursuing, knowing that it will burn an incompatible feature bit; or if it should be rolled into the subcluster proposal, or whether it's not a feature that anyone needs after all. And meanwhile, it looks like I have some patches to propose (and qemu-iotests to write) if I can help fix the bugs I've pointed out. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org