From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41332) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cwVJC-0006Wu-KV for qemu-devel@nongnu.org; Fri, 07 Apr 2017 10:56:19 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cwVJ9-0004D8-Bp for qemu-devel@nongnu.org; Fri, 07 Apr 2017 10:56:18 -0400 From: Alberto Garcia In-Reply-To: <20170407124121.GC4716@noname.redhat.com> References: <20170406150148.zwjpozqtale44jfh@perseus.local> <20170407124121.GC4716@noname.redhat.com> Date: Fri, 07 Apr 2017 16:24:44 +0200 Message-ID: MIME-Version: 1.0 Content-Type: text/plain Subject: Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf , Eric Blake Cc: qemu-devel@nongnu.org, Stefan Hajnoczi , qemu-block@nongnu.org, Max Reitz On Fri 07 Apr 2017 02:41:21 PM CEST, Kevin Wolf wrote: >> 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 >> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 >> **<----> <-----------------------------------------------><---------->* >> Rsrved host cluster offset of data Reserved >> (6 bits) (44 bits) (11 bits) >> >> where you have 17 bits plus the "all zeroes" bit to play with, thanks to >> the three bits of host cluster offset that are now guaranteed to be zero >> due to cluster size alignment (but you're also right that the "all >> zeroes" bit is now redundant information with the 8 subcluster-is-zero >> bits, so repurposing it does not hurt) >> >> > >> > * Pros: >> > + Simple. Few changes compared to the current qcow2 format. >> > >> > * Cons: >> > - Only 8 subclusters per cluster. We would not be making the >> > most of this feature. >> > >> > - No reserved bits left for the future. >> >> I just argued you have at least one, and probably 2, bits left over for >> future in-word expansion. > > I think only 8 subclusters is just too few. That the subcluster status > would be split in two halves doesn't make me like this layout much > better either. I also agree that 8 are too few (splitting the subcluster field would not be strictly necessary, but that's not so important). >> > (2) Making L2 entries 128-bit wide. >> > >> > In this alternative we would double the size of L2 entries. The >> > first half would remain unchanged and the second one would store >> > the bitmap. That would leave us with 32 subclusters per cluster. >> >> Although for smaller cluster sizes (such as 4k clusters), you'd still >> want to restrict that subclusters are at least 512-byte sectors, so >> you'd be using fewer than 32 of those subcluster positions until the >> cluster size is large enough. >> >> > >> > * Pros: >> > + More subclusters per cluster. We could have images with >> > e.g. 128k clusters with 4k subclusters. >> >> Could allow variable-sized subclusters (your choice of 32 subclusters of >> 4k each, or 16 subclusters of 8k each) > > I don't think using less subclusters is desirable if it doesn't come > with savings elsewhere. We already need to allocate two clusters for an > L2 table now, so we want to use it. > > The more interesting kind of variable-sized subclusters would be if you > could select any multiple of 32, meaning three or more clusters per L2 > table (with 192 bits or more per entry). Yeah, I agree. I think it's worth considering. One more drawback that I can think of is that if we make L2 entries wider and we have compressed clusters we'd be wasting space in their entries. >> > - One more metadata structure to be updated for each >> > allocation. This would probably impact I/O negatively. >> >> Having the subcluster table directly in the L2 means that updating >> the L2 table is done with a single write. You are definitely right >> that having the subcluster table as a bitmap in a separate cluster >> means two writes instead of one, but as always, it's hard to predict >> how much of an impact that is without benchmarks. > > Note that it's not just additional write requests, but that we can't > update the L2 table entry and the bitmap atomically any more, so we > have to worry about ordering. The ordering between L2 table and > refcount blocks is already painful enough, I'm not sure if I would > want to add a third type. Ordering also means disk flushes, which are > a lot slower than just additional writes. You're rightk, I think you just convinced me that this is a bad idea and I'm also more inclined towards (2) now. Berto