On 21.04.2017 23:09, Eric Blake wrote: > On 04/06/2017 11:40 AM, Eric Blake wrote: > >>> === Changes to the on-disk format === >>> >>> The qcow2 on-disk format needs to change so each L2 entry has a bitmap >>> indicating the allocation status of each subcluster. There are three >>> possible states (unallocated, allocated, all zeroes), so we need two >>> bits per subcluster. >> >> You also have to add a new incompatible feature bit, so that older tools >> know they can't read the new image correctly, and therefore don't >> accidentally corrupt it. > > As long as we are talking about incompatible feature bits, I had some > thoughts about image mapping today. > > tl;dr: summary> As long as we are considering incompatible features, maybe we should > make it easier to have an image file that explicitly preserves > guest=>host mapping, such that nothing the guest can do will reorder the > mapping. This way, it would be possible to fully preallocate an image > such that all guest offsets are adjusted by a constant offset to become > the corresponding host offset (basically, no qcow2 metadata is > interleaved in the middle of guest data). > > I don't see any way to do it on current qcow2 images, but with > subclusters, you get it for free by having a cluster with an offset but > with all subclusters marked as unallocated. But perhaps it is something > orthogonal enough that we would want a separate incompatible feature bit > for just this, without having subclusters at the same time. > > In the process of exploring the topic, I expose a couple of existing > bugs in our qcow2 handling. > > ======== > > Longer version: > > If I'm reading qcow2-clusters.c and qcow2-refcount.c correctly, our > current implementation of bdrv_discard() says that except for clusters > already marked QCOW2_CLUSTER_ZERO, we will unconditionally remove the L2 > mapping of the address. As I've said, I think the ZERO bit is just a bug and we should free preallocated zero clusters. > Whether we actually punch a hole in the > underlying image, or merely add it to a list of free clusters for use in > subsequent allocations, is later decided based on > s->discard_passthrough[type] (that is, the caller can pass different > type levels that control whether to never punch, always punch, or let > the blockdev parameters of the caller control whether to punch). > > Presumably, if I've preallocated my image because I want to guarantee > enough host space, then I would use blockdev parameters that ensure that > guest actions never punch a hole. But based on my reading, I still have > the situation that if I initially preallocated things so that guest > cluster 0 and 1 are consecutive clusters in the host, and the guest > triggers bdrv_pdiscard() over both clusters, then writes to cluster 1 > before cluster 0, then even though I'm not changing the underlying > allocation of the host file, I _am_ resulting in fragmentation in the > qcow2 file, where cluster 1 in the guest now falls prior to cluster 0. [...] > But if we can preserve mappings of clusters that are explicitly marked > zero, I started to wonder if it would also be reasonable to have a mode > where we could ALWAYS preserve mappings. Adding another bit that says > that a cluster has a reserved mapping, but still defers to the backing > file for its current data, would let us extend the existing behavior of > map-preservation on write zeros to work with ALL writes, when writing to > a fully pre-allocated image. Yes, and it also means that we may want to think about implementation a preallocation mode in qemu which puts all of the data into a single consecutive chunk (as you have hinted at somewhere above). > When I chatted with Max on IRC about the idea, we said this: > > > I mean, sure, we can add both, but I'd still want them two be > two incompatible bits > if you want the features to be orthogonal (with exponentially > more cases to check), then having multiple incompatible bits is okay > Because FEATURE_BIT_UNALLOCATED_AND_SUBCLUSTERS sounds weird > and FEATURE_BIT_EXTENDED_L2_ENTRIES a bit pretentious > Well, orthogonal is a good question. If we want to have an > UNALLOCATED flag we should think so before adding subclusters > Because then we should at least make clear that the ZERO flag > for a subcluster requires the ALLOCATED flag to be set or something > So we can reserve ZERO/!ALLOCATED for the case where you want > to fall through to the backing file > > So, if you got this far in reading, the question becomes whether having > a mode where you can mark a cluster as mapping-reserved-but-unallocated > has enough use case to be worth pursuing, knowing that it will burn an > incompatible feature bit; or if it should be rolled into the subcluster > proposal, or whether it's not a feature that anyone needs after all. I just forgot that just saying !ALLOCATED will be enough, regardless of the ZERO flag... Yeah, subclusters will give us this for free, and I think it's therefore reasonable to just require them if you want this feature (preallocation with a backing file). > And meanwhile, it looks like I have some patches to propose (and > qemu-iotests to write) if I can help fix the bugs I've pointed out. You mean these? https://github.com/XanClic/qemu/commits/random-qcow2-stuff-v1 ;-) I'll send them once I've written tests. Max