Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation

From: Alberto Garcia <berto@igalia.com>
To: Kevin Wolf <kwolf@redhat.com>, Eric Blake <eblake@redhat.com>
Cc: qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>,
	qemu-block@nongnu.org, Max Reitz <mreitz@redhat.com>
Subject: Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
Date: Fri, 07 Apr 2017 16:24:44 +0200	[thread overview]
Message-ID: <w514ly0e1n7.fsf@maestria.local.igalia.com> (raw)
In-Reply-To: <20170407124121.GC4716@noname.redhat.com>

On Fri 07 Apr 2017 02:41:21 PM CEST, Kevin Wolf <kwolf@redhat.com> wrote:
>> 63    56 55    48 47    40 39    32 31    24 23    16 15     8 7      0
>> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
>> **<----> <-----------------------------------------------><---------->*
>>   Rsrved              host cluster offset of data             Reserved
>>   (6 bits)                (44 bits)                           (11 bits)
>> 
>> where you have 17 bits plus the "all zeroes" bit to play with, thanks to
>> the three bits of host cluster offset that are now guaranteed to be zero
>> due to cluster size alignment (but you're also right that the "all
>> zeroes" bit is now redundant information with the 8 subcluster-is-zero
>> bits, so repurposing it does not hurt)
>> 
>> > 
>> >     * Pros:
>> >       + Simple. Few changes compared to the current qcow2 format.
>> > 
>> >     * Cons:
>> >       - Only 8 subclusters per cluster. We would not be making the
>> >         most of this feature.
>> > 
>> >       - No reserved bits left for the future.
>> 
>> I just argued you have at least one, and probably 2, bits left over for
>> future in-word expansion.
>
> I think only 8 subclusters is just too few. That the subcluster status
> would be split in two halves doesn't make me like this layout much
> better either.

I also agree that 8 are too few (splitting the subcluster field would
not be strictly necessary, but that's not so important).

>> > (2) Making L2 entries 128-bit wide.
>> > 
>> >     In this alternative we would double the size of L2 entries. The
>> >     first half would remain unchanged and the second one would store
>> >     the bitmap. That would leave us with 32 subclusters per cluster.
>> 
>> Although for smaller cluster sizes (such as 4k clusters), you'd still
>> want to restrict that subclusters are at least 512-byte sectors, so
>> you'd be using fewer than 32 of those subcluster positions until the
>> cluster size is large enough.
>> 
>> > 
>> >     * Pros:
>> >       + More subclusters per cluster. We could have images with
>> >         e.g. 128k clusters with 4k subclusters.
>> 
>> Could allow variable-sized subclusters (your choice of 32 subclusters of
>> 4k each, or 16 subclusters of 8k each)
>
> I don't think using less subclusters is desirable if it doesn't come
> with savings elsewhere. We already need to allocate two clusters for an
> L2 table now, so we want to use it.
>
> The more interesting kind of variable-sized subclusters would be if you
> could select any multiple of 32, meaning three or more clusters per L2
> table (with 192 bits or more per entry).

Yeah, I agree. I think it's worth considering. One more drawback that I
can think of is that if we make L2 entries wider and we have compressed
clusters we'd be wasting space in their entries.

>> >       - One more metadata structure to be updated for each
>> >         allocation. This would probably impact I/O negatively.
>> 
>> Having the subcluster table directly in the L2 means that updating
>> the L2 table is done with a single write. You are definitely right
>> that having the subcluster table as a bitmap in a separate cluster
>> means two writes instead of one, but as always, it's hard to predict
>> how much of an impact that is without benchmarks.
>
> Note that it's not just additional write requests, but that we can't
> update the L2 table entry and the bitmap atomically any more, so we
> have to worry about ordering. The ordering between L2 table and
> refcount blocks is already painful enough, I'm not sure if I would
> want to add a third type. Ordering also means disk flushes, which are
> a lot slower than just additional writes.

You're rightk, I think you just convinced me that this is a bad idea and
I'm also more inclined towards (2) now.

Berto