On 11.04.2017 14:56, Alberto Garcia wrote: > On Fri 07 Apr 2017 07:10:46 PM CEST, Max Reitz wrote: >>> === Changes to the on-disk format === >>> >>> The qcow2 on-disk format needs to change so each L2 entry has a bitmap >>> indicating the allocation status of each subcluster. There are three >>> possible states (unallocated, allocated, all zeroes), so we need two >>> bits per subcluster. >> >> You don't need two bits, you need log(3) / log(2) = ld(3) ≈ 1.58. You >> can encode the status of eight subclusters (3^8 = 6561) in 13 bits >> (ld(6561) ≈ 12.68). > > Right, although that would make the encoding more cumbersome to use and > to debug. Is it worth it? Probably not, considering this is probably not the way we want to go anyway. >> One case I'd be especially interested in are of course 4 kB >> subclusters for 64 kB clusters (because 4 kB is a usual page size and >> can be configured to be the block size of a guest device; and because >> 64 kB simply is the standard cluster size of qcow2 images >> nowadays[1]...). > > I think that we should have at least that, but ideally larger > cluster-to-subcluster ratios. > >> (We could even get one more bit if we had a subcluster-flag, because I >> guess we can always assume subclustered clusters to have OFLAG_COPIED >> and be uncompressed. But still, three bits missing.) > > Why can we always assume OFLAG_COPIED? Because partially allocated clusters cannot be used with internal snapshots, and that is what OFLAG_COPIED is for. >> If course, if you'd be willing to give up the all-zeroes state for >> subclusters, it would be enough... > > I still think that it looks like a better idea to allow having more > subclusters, but giving up the all-zeroes state is a valid > alternative. Apart from having to overwrite with zeroes when a > subcluster is discarded, is there anything else that we would miss? It if it's a real discard you can just discard it (which is what we do for compat=0.10 images anyway); but zero-writes will then have to be come real writes, yes. >> By the way, if you'd only allow multiple of 1s overhead >> (i.e. multiples of 32 subclusters), I think (3) would be pretty much >> the same as (2) if you just always write the subcluster information >> adjacent to the L2 table. Should be just the same caching-wise and >> performance-wise. > > Then (3) is effectively the same as (2), just that the subcluster > bitmaps are at the end of the L2 cluster, and not next to each entry. Exactly. But it's a difference in implementation, as you won't have to worry about having changed the L2 table layout; maybe that's a benefit. Max