On 11.04.2017 14:56, Alberto Garcia wrote:
> On Fri 07 Apr 2017 07:10:46 PM CEST, Max Reitz wrote:
>>> === Changes to the on-disk format ===
>>>
>>> The qcow2 on-disk format needs to change so each L2 entry has a bitmap
>>> indicating the allocation status of each subcluster. There are three
>>> possible states (unallocated, allocated, all zeroes), so we need two
>>> bits per subcluster.
>>
>> You don't need two bits, you need log(3) / log(2) = ld(3) ≈ 1.58. You
>> can encode the status of eight subclusters (3^8 = 6561) in 13 bits
>> (ld(6561) ≈ 12.68).
> 
> Right, although that would make the encoding more cumbersome to use and
> to debug. Is it worth it?

Probably not, considering this is probably not the way we want to go anyway.

>> One case I'd be especially interested in are of course 4 kB
>> subclusters for 64 kB clusters (because 4 kB is a usual page size and
>> can be configured to be the block size of a guest device; and because
>> 64 kB simply is the standard cluster size of qcow2 images
>> nowadays[1]...).
> 
> I think that we should have at least that, but ideally larger
> cluster-to-subcluster ratios.
> 
>> (We could even get one more bit if we had a subcluster-flag, because I
>> guess we can always assume subclustered clusters to have OFLAG_COPIED
>> and be uncompressed. But still, three bits missing.)
> 
> Why can we always assume OFLAG_COPIED?

Because partially allocated clusters cannot be used with internal
snapshots, and that is what OFLAG_COPIED is for.

>> If course, if you'd be willing to give up the all-zeroes state for
>> subclusters, it would be enough...
> 
> I still think that it looks like a better idea to allow having more
> subclusters, but giving up the all-zeroes state is a valid
> alternative. Apart from having to overwrite with zeroes when a
> subcluster is discarded, is there anything else that we would miss?

It if it's a real discard you can just discard it (which is what we do
for compat=0.10 images anyway); but zero-writes will then have to be
come real writes, yes.

>> By the way, if you'd only allow multiple of 1s overhead
>> (i.e. multiples of 32 subclusters), I think (3) would be pretty much
>> the same as (2) if you just always write the subcluster information
>> adjacent to the L2 table. Should be just the same caching-wise and
>> performance-wise.
> 
> Then (3) is effectively the same as (2), just that the subcluster
> bitmaps are at the end of the L2 cluster, and not next to each entry.

Exactly. But it's a difference in implementation, as you won't have to
worry about having changed the L2 table layout; maybe that's a benefit.

Max