From: Eric Blake <eblake@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
qemu-block@nongnu.org
Cc: kwolf@redhat.com, fam@euphon.net, qemu-devel@nongnu.org,
mreitz@redhat.com, stefanha@redhat.com, den@openvz.org
Subject: Re: [PATCH v2 1/5] block/io: fix bdrv_co_block_status_above
Date: Tue, 19 May 2020 16:48:14 -0500 [thread overview]
Message-ID: <711cc70d-bc12-9fda-b24c-7b3acdd5cb08@redhat.com> (raw)
In-Reply-To: <3a66bacf-3462-a82c-c758-730107e75898@virtuozzo.com>
On 5/19/20 4:13 PM, Vladimir Sementsov-Ogievskiy wrote:
> 19.05.2020 23:41, Eric Blake wrote:
>> On 5/19/20 2:54 PM, Vladimir Sementsov-Ogievskiy wrote:
>>> bdrv_co_block_status_above has several problems with handling short
>>> backing files:
>>>
>>> 1. With want_zeros=true, it may return ret with BDRV_BLOCK_ZERO but
>>> without BDRV_BLOCK_ALLOCATED flag, when actually short backing file
>>> which produces these after-EOF zeros is inside requested backing
>>> sequence.
>>
>> That's intentional. That portion of the guest-visible data reads as
>> zero (BDRV_BLOCK_ZERO set) but was NOT read from the top layer, but
>> rather synthesized by the block layer because it derived from the
>> backing file but was beyond EOF of that backing layer
>> (BDRV_BLOCK_ALLOCATED is clear).
>
> Not in top yes. But _inside_ the requested base..top backing-chain-part.
> So it should be considered ALLOCATED, as we should not go to further
> backing.
Yes, I think I figured that out by patch 5.
>
> Assume the following chain:
>
> top aa--
> middle bb
> base xxxx
>
> (so, middle is short)
>
> block_status(top, 2) should return ZERO without ALLOCATED, as yes it's
> ZERO and yes, it's from another layer
>
> block_status_above(top, base, 2) should return ZERO with ALLOCATED, as
> it's ZERO, and it's produced inside requested backing-chain-region,
> actually, it's produced because of short middle node. We must report
> ALLOCATED to show that we are not going to read from base.
Yes, that matches my intuition. allocated_above says "where in the
chain did we get the data, since it did not come from top", and the
correct answer is "we got it from middle, due to synthesizing zero
beyond EOF". Okay, with that understanding in place, maybe this patch
is right. But I'll have to revisit it tomorrow on a fresh mind (it's
too late in the day for me to be sure that I'm getting it all straight
right now).
>
>>
>>>
>>> 2. With want_zero=false, it may return pnum=0 prior to actual EOF,
>>> because of EOF of short backing file.
>>
>> Do you have a reproducer for this?
>
> No, I don't have one, but it seems possible at least with
> want_zero=false. I'll think of it tomorrow, too tired now.
>
>> In my experience, this is not possible. Generally, if you request
>> status that overlaps EOF of the backing, you get a response truncated
>> to the end of the backing, and you are then likely to follow up with a
>> subsequent status request starting from the underlying EOF which then
>> sees the desired unallocated zeroes:
>>
>> back xxxx
>> top yy------
>> request ^^^^^^
>> response ^^
>> request ^^^^
>> response ^^^^
If we can come up with a reproducer where allocated_above returns
pnum=0, that would indeed prove my initial hesitation wrong (perhaps by:
back xxxxxxxx
mid1 xxxxxx
mid2 xxxx
mid3 xxxxxx
top xxxxxxxx
for various different start and base points within the chain?)
>>
>>>
>>> Fix these things, making logic about short backing files clearer.
>>>
>>> Note that 154 output changed, because now bdrv_block_status_above don't
>>
>> doesn't
>>
>>> merge unallocated zeros with zeros after EOF (which are actually
>>> "allocated" in POV of read from backing-chain top) and is_zero() just
>>> don't understand that the whole head or tail is zero. We may update
>>> is_zero to call bdrv_block_status_above several times, or add flag to
>>> bdrv_block_status_above that we are not interested in ALLOCATED flag,
>>> so ranges with different ALLOCATED status may be merged, but actually,
>>> it seems that we'd better don't care about this corner case.
>>
>> This actually sounds like an avoidable regression. :(
>
> I don't see real problem in it. But it seems not hard to avoid it, so I
> will try to.
I guess my real reasoning is: "I spent a lot of time trying to tweak
that test to not lose the fact that the tail of the image reads as
zero", because it looks weird if we later resize the image but still
have a glitch in the middle reporting one non-zero cluster out of a
larger range all because of the shenanigans that occurred around the
tail prior to resizing.
>>> +++ b/block/io.c
>>> @@ -2461,25 +2461,45 @@ static int coroutine_fn
>>> bdrv_co_block_status_above(BlockDriverState *bs,
>>> ret = bdrv_co_block_status(p, want_zero, offset, bytes,
>>> pnum, map,
>>> file);
>>> if (ret < 0) {
>>> - break;
>>> + return ret;
>>> }
>>> - if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
>>> + if (*pnum == 0) {
>>> + if (first) {
>>> + return ret;
>>> + }
>>> +
>>> /*
>>> - * Reading beyond the end of the file continues to read
>>> - * zeroes, but we can only widen the result to the
>>> - * unallocated length we learned from an earlier
>>> - * iteration.
>>> + * Reads from bs for the selected region will return
>>> zeroes,
>>> + * produced because the current level is short. We
>>> should consider
>>> + * it as allocated.
>>
>> Why? If we replaced the backing file to something longer (qemu-img
>> rebase -u), we would WANT to read from the backing file. The only
>> reason we read zero is because the block layer synthesized it _while_
>> deferring to the backing layer, not because it was directly allocated
>> in the top layer.
>
> No, if we replace backing file of the current layer, nothing will
> change, as _this_ layer is short, not the backing. Or which backing file
> do you mean? If you mean current bs, than replacing it doesn't make
> sense in the context, as block_status_above requested the current bs (as
> part of base..top range), not the other one.
Maybe it's just the comment wording that needs help. After reading
through patch 5, it looks like my problem is now coming up with a
comment to the effect of "the top layer deferred to this layer, and
because this layer is short, any zeroes that we synthesize beyond EOF
behave as if they were allocated at this layer".
>
>>
>>> + *
>>> + * TODO: Should we report p as file here?
>>
>> No. Reporting 'file' only makes sense if you can point to an offset
>> within that file that would read the guest-visible data in question -
>> but when the data is synthesized, there is no such offset.
>
> I don't know. It still adds some information about which level is
> responsible for these ZEROES. Kevin argued that it make sense.
It took me a while, but I'm coming around to it: my initial read was
assuming that you were reporting that the tail was being claimed as
allocated by top; but in reality, you are fixing things to claim it as
being allocated by mid. The former is wrong (top did not allocate, it
deferred to mid); but the latter does indeed make sense (reading from
mid ended up synthesizing, which means that our hunt for the data ends
at mid and we never traverse deeper, regardless of whether base may also
have data). But now it's a question of whether the code matches that
textual description, and I'm a bit too fried to answer that question
properly today :)
>>> +++ b/tests/qemu-iotests/154.out
>>> @@ -310,13 +310,13 @@ wrote 512/512 bytes at offset 134217728
>>> 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>> 2048/2048 bytes allocated at offset 128 MiB
>>> [{ "start": 0, "length": 134217728, "depth": 1, "zero": true,
>>> "data": false},
>>> -{ "start": 134217728, "length": 2048, "depth": 0, "zero": true,
>>> "data": false}]
>>> +{ "start": 134217728, "length": 2048, "depth": 0, "zero": false,
>>> "data": true, "offset": OFFSET}]
>>
>> The fact that we no longer see zeroes in the tail of the file makes me
>> think this patch is wrong.
So, if we can avoid that minor regression, and still otherwise report
zeroes as allocated from mid, then I think we'll be on the right track.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
next prev parent reply other threads:[~2020-05-19 21:49 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-19 19:54 [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above Vladimir Sementsov-Ogievskiy
2020-05-19 19:54 ` [PATCH v2 1/5] block/io: fix bdrv_co_block_status_above Vladimir Sementsov-Ogievskiy
2020-05-19 20:41 ` Eric Blake
2020-05-19 21:13 ` Vladimir Sementsov-Ogievskiy
2020-05-19 21:48 ` Eric Blake [this message]
2020-05-20 6:16 ` Vladimir Sementsov-Ogievskiy
2020-05-19 19:54 ` [PATCH v2 2/5] block/io: bdrv_common_block_status_above: support include_base Vladimir Sementsov-Ogievskiy
2020-05-19 19:54 ` [PATCH v2 3/5] block/io: bdrv_common_block_status_above: support bs == base Vladimir Sementsov-Ogievskiy
2020-05-19 19:55 ` [PATCH v2 4/5] block/io: fix bdrv_is_allocated_above Vladimir Sementsov-Ogievskiy
2020-05-19 20:45 ` Eric Blake
2020-05-19 19:55 ` [PATCH v2 5/5] iotests: add commit top->base cases to 274 Vladimir Sementsov-Ogievskiy
2020-05-19 21:13 ` Eric Blake
2020-05-19 21:25 ` Vladimir Sementsov-Ogievskiy
2020-05-19 21:49 ` Eric Blake
2020-05-19 20:21 ` [PATCH v2 0/5] fix & merge block_status_above and is_allocated_above Eric Blake
2020-05-19 20:28 ` Vladimir Sementsov-Ogievskiy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=711cc70d-bc12-9fda-b24c-7b3acdd5cb08@redhat.com \
--to=eblake@redhat.com \
--cc=den@openvz.org \
--cc=fam@euphon.net \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
--cc=vsementsov@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.