All of lore.kernel.org
 help / color / mirror / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: Eric Blake <eblake@redhat.com>, Alberto Garcia <berto@igalia.com>,
	qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>,
	qemu-block@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] proposed qcow2 extension: cluster reservations [was: [RFC] Proposed qcow2 extension: subcluster allocation
Date: Sat, 22 Apr 2017 19:56:57 +0200	[thread overview]
Message-ID: <1877dc76-87e0-650c-d76a-44e6e7754a2b@redhat.com> (raw)
In-Reply-To: <382b0707-233d-ab05-e054-3da43617460d@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 5565 bytes --]

On 21.04.2017 23:09, Eric Blake wrote:
> On 04/06/2017 11:40 AM, Eric Blake wrote:
> 
>>> === Changes to the on-disk format ===
>>>
>>> The qcow2 on-disk format needs to change so each L2 entry has a bitmap
>>> indicating the allocation status of each subcluster. There are three
>>> possible states (unallocated, allocated, all zeroes), so we need two
>>> bits per subcluster.
>>
>> You also have to add a new incompatible feature bit, so that older tools
>> know they can't read the new image correctly, and therefore don't
>> accidentally corrupt it.
> 
> As long as we are talking about incompatible feature bits, I had some
> thoughts about image mapping today.
> 
> tl;dr: summary> As long as we are considering incompatible features, maybe we should
> make it easier to have an image file that explicitly preserves
> guest=>host mapping, such that nothing the guest can do will reorder the
> mapping.  This way, it would be possible to fully preallocate an image
> such that all guest offsets are adjusted by a constant offset to become
> the corresponding host offset (basically, no qcow2 metadata is
> interleaved in the middle of guest data).
> 
> I don't see any way to do it on current qcow2 images, but with
> subclusters, you get it for free by having a cluster with an offset but
> with all subclusters marked as unallocated.  But perhaps it is something
> orthogonal enough that we would want a separate incompatible feature bit
> for just this, without having subclusters at the same time.
> 
> In the process of exploring the topic, I expose a couple of existing
> bugs in our qcow2 handling.
> 
> ========
> 
> Longer version:
> 
> If I'm reading qcow2-clusters.c and qcow2-refcount.c correctly, our
> current implementation of bdrv_discard() says that except for clusters
> already marked QCOW2_CLUSTER_ZERO, we will unconditionally remove the L2
> mapping of the address.

As I've said, I think the ZERO bit is just a bug and we should free
preallocated zero clusters.

>                          Whether we actually punch a hole in the
> underlying image, or merely add it to a list of free clusters for use in
> subsequent allocations, is later decided based on
> s->discard_passthrough[type] (that is, the caller can pass different
> type levels that control whether to never punch, always punch, or let
> the blockdev parameters of the caller control whether to punch).
> 
> Presumably, if I've preallocated my image because I want to guarantee
> enough host space, then I would use blockdev parameters that ensure that
> guest actions never punch a hole.  But based on my reading, I still have
> the situation that if I initially preallocated things so that guest
> cluster 0 and 1 are consecutive clusters in the host, and the guest
> triggers bdrv_pdiscard() over both clusters, then writes to cluster 1
> before cluster 0, then even though I'm not changing the underlying
> allocation of the host file, I _am_ resulting in fragmentation in the
> qcow2 file, where cluster 1 in the guest now falls prior to cluster 0.

[...]

> But if we can preserve mappings of clusters that are explicitly marked
> zero, I started to wonder if it would also be reasonable to have a mode
> where we could ALWAYS preserve mappings.  Adding another bit that says
> that a cluster has a reserved mapping, but still defers to the backing
> file for its current data, would let us extend the existing behavior of
> map-preservation on write zeros to work with ALL writes, when writing to
> a fully pre-allocated image.

Yes, and it also means that we may want to think about implementation a
preallocation mode in qemu which puts all of the data into a single
consecutive chunk (as you have hinted at somewhere above).

> When I chatted with Max on IRC about the idea, we said this:
> 
> 
> <mreitz> I mean, sure, we can add both, but I'd still want them two be
> two incompatible bits
> <eblake> if you want the features to be orthogonal (with exponentially
> more cases to check), then having multiple incompatible bits is okay
> <mreitz> Because FEATURE_BIT_UNALLOCATED_AND_SUBCLUSTERS sounds weird
> and FEATURE_BIT_EXTENDED_L2_ENTRIES a bit pretentious
> <mreitz> Well, orthogonal is a good question. If we want to have an
> UNALLOCATED flag we should think so before adding subclusters
> <mreitz> Because then we should at least make clear that the ZERO flag
> for a subcluster requires the ALLOCATED flag to be set or something
> <mreitz> So we can reserve ZERO/!ALLOCATED for the case where you want
> to fall through to the backing file
> 
> So, if you got this far in reading, the question becomes whether having
> a mode where you can mark a cluster as mapping-reserved-but-unallocated
> has enough use case to be worth pursuing, knowing that it will burn an
> incompatible feature bit; or if it should be rolled into the subcluster
> proposal, or whether it's not a feature that anyone needs after all.

I just forgot that just saying !ALLOCATED will be enough, regardless of
the ZERO flag... Yeah, subclusters will give us this for free, and I
think it's therefore reasonable to just require them if you want this
feature (preallocation with a backing file).

> And meanwhile, it looks like I have some patches to propose (and
> qemu-iotests to write) if I can help fix the bugs I've pointed out.

You mean these?
https://github.com/XanClic/qemu/commits/random-qcow2-stuff-v1

;-)

I'll send them once I've written tests.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

  reply	other threads:[~2017-04-22 17:57 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia
2017-04-06 16:40 ` Eric Blake
2017-04-07  8:49   ` Alberto Garcia
2017-04-07 12:41   ` Kevin Wolf
2017-04-07 14:24     ` Alberto Garcia
2017-04-21 21:09   ` [Qemu-devel] proposed qcow2 extension: cluster reservations [was: " Eric Blake
2017-04-22 17:56     ` Max Reitz [this message]
2017-04-24 11:45       ` Kevin Wolf
2017-04-24 12:46       ` Alberto Garcia
2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi
2017-04-07 12:24   ` Alberto Garcia
2017-04-07 13:01   ` Kevin Wolf
2017-04-10 15:32     ` Stefan Hajnoczi
2017-04-07 17:10 ` Max Reitz
2017-04-10  8:42   ` Kevin Wolf
2017-04-10 15:03     ` Max Reitz
2017-04-11 12:56   ` Alberto Garcia
2017-04-11 14:04     ` Max Reitz
2017-04-11 14:31       ` Alberto Garcia
2017-04-11 14:45         ` [Qemu-devel] [Qemu-block] " Eric Blake
2017-04-12 12:41           ` Alberto Garcia
2017-04-12 14:10             ` Max Reitz
2017-04-13  8:05               ` Alberto Garcia
2017-04-13  9:02                 ` Kevin Wolf
2017-04-13  9:05                   ` Alberto Garcia
2017-04-11 14:49         ` [Qemu-devel] " Kevin Wolf
2017-04-11 14:58           ` Eric Blake
2017-04-11 14:59           ` Max Reitz
2017-04-11 15:08             ` Eric Blake
2017-04-11 15:18               ` Max Reitz
2017-04-11 15:29                 ` Kevin Wolf
2017-04-11 15:29                   ` Max Reitz
2017-04-11 15:30                 ` Eric Blake
2017-04-11 15:34                   ` Max Reitz
2017-04-12 12:47           ` Alberto Garcia
2017-04-12 16:54 ` Denis V. Lunev
2017-04-13 11:58   ` Alberto Garcia
2017-04-13 12:44     ` Denis V. Lunev
2017-04-13 13:05       ` Kevin Wolf
2017-04-13 13:09         ` Denis V. Lunev
2017-04-13 13:36           ` Alberto Garcia
2017-04-13 14:06             ` Denis V. Lunev
2017-04-13 13:21       ` Alberto Garcia
2017-04-13 13:30         ` Denis V. Lunev
2017-04-13 13:59           ` Kevin Wolf
2017-04-13 15:04           ` Alberto Garcia
2017-04-13 15:17             ` Denis V. Lunev
2017-04-18 11:52               ` Alberto Garcia
2017-04-18 17:27                 ` Denis V. Lunev
2017-04-13 13:51         ` Kevin Wolf
2017-04-13 14:15           ` Alberto Garcia
2017-04-13 14:27             ` Kevin Wolf
2017-04-13 16:42               ` [Qemu-devel] [Qemu-block] " Roman Kagan
2017-04-13 14:42           ` [Qemu-devel] " Denis V. Lunev
2017-04-12 17:55 ` Denis V. Lunev
2017-04-12 18:20   ` Eric Blake
2017-04-12 19:02     ` Denis V. Lunev
2017-04-13  9:44       ` Kevin Wolf
2017-04-13 10:19         ` Denis V. Lunev
2017-04-14  1:06           ` [Qemu-devel] [Qemu-block] " John Snow
2017-04-14  4:17             ` Denis V. Lunev
2017-04-18 11:22               ` Kevin Wolf
2017-04-18 17:30                 ` Denis V. Lunev
2017-04-14  7:40             ` Roman Kagan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1877dc76-87e0-650c-d76a-44e6e7754a2b@redhat.com \
    --to=mreitz@redhat.com \
    --cc=berto@igalia.com \
    --cc=eblake@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.