Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)

From: Dominique MARTINET <dominique.martinet@atmark-techno.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: Nikolay Borisov <nborisov@suse.com>, Jens Axboe <axboe@kernel.dk>,
	io-uring@vger.kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: read corruption with qemu master io_uring engine / linux master / btrfs(?)
Date: Thu, 30 Jun 2022 09:41:01 +0900	[thread overview]
Message-ID: <YrzxHbWCR6zhIAcx@atmark-techno.com> (raw)
In-Reply-To: <20220629153710.GA379981@falcondesktop>

Filipe Manana wrote on Wed, Jun 29, 2022 at 04:37:10PM +0100:
> On Wed, Jun 29, 2022 at 02:14:18PM +0900, Dominique MARTINET wrote:
> >  - qemu short read handling was... rather disappointing.
> > Patch should appear here[1] eventually, but as it seems moderated?
> > [1] https://lore.kernel.org/qemu-devel/20220629044957.1998430-1-dominique.martinet@atmark-techno.com
> 
> Btw, the link doesn't work (at least at the moment):
> 
> "Message-ID <20220629044957.1998430-1-dominique.martinet@atmark-techno.com> not found"

Yes, the submitting a patch documentation[1] mentions the lists are
moderated, so it took a bit of time.
It looks like it went through now, and my understanding is further
mails won't be delayed -- but I'll Cc you on v2 after testing it.

[1] https://www.qemu.org/docs/master/devel/submitting-a-patch.html

> >  - comments there also say short reads should never happen on newer
> > kernels (assuming local filesystems?) -- how true is that? If we're
> > doing our best kernel side to avoid short reads I guess we probably
> > ought to have a look at this.
> 
> Short reads can happen, and an application should deal with it.

I definitely agree with this, qemu must be fixed. I don't think anyone
will argue with that.

> If we look at the man page for read(2):
> 
> "
>        On success, the number of bytes read is returned (zero indicates
>        end of file), and the file position is advanced by this number.
>        It is not an error if this number is smaller than the number of
>        bytes requested; this may happen for example because fewer bytes
>        are actually available right now (maybe because we were close to
>        end-of-file, or because we are reading from a pipe, or from a
>        terminal), or because read() was interrupted by a signal.  See
>        also NOTES.
> "
> 
> pread(2) refers to read(2)'s documention about short reads as well.
> I don't think reading with io_uring is an exception, I'm not aware of
> any rules that forbided short reads from happening (even if the offset
> and length don't cross the EOF boundary).

It might be documented, but I was laughed when I said we (9p) were
allowed to return short reads on a whim:

https://lkml.kernel.org/r/20200406164641.GF21484@bombadil.infradead.org

(now that might not be the proudest thing I've allowed for 9p, but it
shows there is some expectation we don't do short reads if we can avoid
it... But yes, that doesn't mean we shouldn't fix broken applications
when we find one)

> As mentioned in the commit pointed before, we recently had a similar
> report with MariaDB, which wasn't dealing with short reads properly
> and got fixed shortly after:
> 
> https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582
> 
> In fact not dealing with short reads at all, is not that uncommon
> in applications. In that particular case we could avoid doing the
> short read in btrfs, by returning -EAGAIN and making io_uring use
> a blocking context to do a blocking direct IO read.

Sounds good to me.

> > It can easily be reproduced with a simple io_uring program -- see
> > example attached that eventually fails with the following error on
> > btrfs:
> > bad read result for io 8, offset 792227840: 266240 should be 1466368
> > 
> > but doesn't fail on tmpfs or without O_DIRECT.
> > feel free to butcher it, it's already a quickly hacked downversion of my
> > original test that had hash computation etc so the flow might feel a bit
> > weird.
> > Just compile with `gcc -o shortreads uring_shortreads.c -luring` and run
> > with file to read in argument.
> 
> I just tried your program, against the qemu/vmdk image you mentioned in the
> first message, and after over an hour running I couldn't trigger any short
> reads - this was on the integration misc-next branch.
>
> It's possible that to trigger the issue, one needs a particular file extent
> layout, which will not be the same as yours after downloading and converting
> the file.

Ugh. I've also been unable to reproduce on a test fs, despite filling it
with small files and removing some to artificially fragment the image,
so I guess I really do have something on these "normal" filesystems...

Is there a way to artificially try to recreate weird layouts?
I've also tried btrfs send|receive, but while it did preserve reflinked
extents it didn't seem to do the trick.

> Are you able to apply kernel patches and test? If so I may provide you with
> a patch to try and see if it fixes the problem for you.

Yes, no problem with that; I'm not deleting that file until we've seen
the end of it and will be happy to test anything :)

-- 
Dominique