Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize

From: Max Reitz <mreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-block@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>,
	Anton Nefedov <anton.nefedov@virtuozzo.com>,
	qemu-devel@nongnu.org, qemu-stable@nongnu.org,
	Stefan Hajnoczi <stefanha@redhat.com>,
	"Denis V . Lunev" <den@openvz.org>
Subject: Re: [PATCH for-4.2 v2 3/3] block/file-posix: Let post-EOF fallocate serialize
Date: Tue, 2 Jun 2020 17:46:37 +0200	[thread overview]
Message-ID: <50115120-9d1a-79f7-64f4-cd45508c0e7c@redhat.com> (raw)
In-Reply-To: <dfe5fbff-ce04-504e-542b-11095a57fd78@virtuozzo.com>

[-- Attachment #1.1: Type: text/plain, Size: 4956 bytes --]

On 02.06.20 16:43, Vladimir Sementsov-Ogievskiy wrote:
> 01.11.2019 18:25, Max Reitz wrote:
> 
> Sorry for being late, I have some comments

Uh, well.  Reasonable, but I hope you don’t mind me having no longer
having this patch fresh on my mind.

>> The XFS kernel driver has a bug that may cause data corruption for qcow2
>> images as of qemu commit c8bb23cbdbe32f.  We can work around it by
>> treating post-EOF fallocates as serializing up until infinity (INT64_MAX
>> in practice).
>>
>> Cc: qemu-stable@nongnu.org
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/file-posix.c | 36 ++++++++++++++++++++++++++++++++++++
>>   1 file changed, 36 insertions(+)
>>
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index 0b7e904d48..1f0f61a02b 100644
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -2721,6 +2721,42 @@ raw_do_pwrite_zeroes(BlockDriverState *bs,
>> int64_t offset, int bytes,
>>       RawPosixAIOData acb;
>>       ThreadPoolFunc *handler;
>>   +#ifdef CONFIG_FALLOCATE
>> +    if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
>> +        BdrvTrackedRequest *req;
>> +        uint64_t end;
>> +
>> +        /*
>> +         * This is a workaround for a bug in the Linux XFS driver,
>> +         * where writes submitted through the AIO interface will be
>> +         * discarded if they happen beyond a concurrently running
>> +         * fallocate() that increases the file length (i.e., both the
>> +         * write and the fallocate() happen beyond the EOF).
>> +         *
>> +         * To work around it, we extend the tracked request for this
>> +         * zero write until INT64_MAX (effectively infinity), and mark
>> +         * it as serializing.
>> +         *
>> +         * We have to enable this workaround for all filesystems and
>> +         * AIO modes (not just XFS with aio=native), because for
>> +         * remote filesystems we do not know the host configuration.
>> +         */
>> +
>> +        req = bdrv_co_get_self_request(bs);
>> +        assert(req);
>> +        assert(req->type == BDRV_TRACKED_WRITE);
>> +        assert(req->offset <= offset);
>> +        assert(req->offset + req->bytes >= offset + bytes);
> 
> Why these assertions?

Mostly to see that bdrv_co_get_self_request() (introduced by the same
series) actually got the right request.  (I suppose.)

> TrackedRequest offset and bytes fields correspond
> to the original request. When request is being expanded to satisfy
> request_alignment, these fields are not updated.

Well, shrunk in this case, but OK.

> So, maybe, we should assert overlap_offset and overlap_bytes?

Maybe, but would that have any benefit?  Especially after this patch
having been in qemu for over half a year?

(Also, intuitively off the top of my head I don’t see how it would make
more sense to check overlap_offset and overlap_bytes, if all the
assertions are for is to see that we got the right request.
overlap_offset and overlap_bytes may still not exactly match @offset or
@bytes, respectively.)

Your suggestion makes it sound a bit like you have a different purpose
in mind what these assertions might be useful for...?

>> +
>> +        end = INT64_MAX & -(uint64_t)bs->bl.request_alignment;
>> +        req->bytes = end - req->offset;
> 
> And I doubt that we should update req->bytes. We never updated it in
> other places, it corresponds to original request. It's enough to update
> overlap_bytes to achieve corresponding serialising.

Does it hurt?  If so, would you send a patch?

I assume you reply to this patch instead of writing a patch because you
have the same feeling of “It probably doesn’t really matter, so let’s
have a discussion first”.

My stance is: I don’t think it matters and this whole piece of code is a
hack that shouldn’t exist, obviously.  So I don’t really care how it
fits into all of our other code.

I would like to say I wouldn’t mind a patch to drop the req->bytes
assignment, but OTOH it would mean I’d have to review it and verify that
it’s indeed sufficient to set overlap_bytes.

If it’s in any way inconvenient for you that req->bytes is adjusted,
then of course please send one.

>> +        req->overlap_bytes = req->bytes;
>> +
>> +        bdrv_mark_request_serialising(req, bs->bl.request_alignment);
> 
> Not sure, how much should we care about request_alignment here, I think,
> it's enough to just set req->overlap_bytes = INT64_MAX -
> req->overlap_offest, but it doesn't really matter.

As long as req->bytes is adjusted, we have to care, or the overlap_bytes
calculation in bdrv_mark_request_serialising will overflow.

Well, one could argue that it doesn’t matter because the MAX() will
still do the right thing, but overflowing is never nice.

(Of course, it probably doesn’t matter at all if we just wouldn’t touch
req->bytes.)

Max

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]