All of lore.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com,
	Dylan Yudaken <dylany@fb.com>, Jens Axboe <axboe@kernel.dk>,
	agruenba@redhat.com
Subject: Re: [PATCH] btrfs: handle DIO read faults properly
Date: Fri, 12 Aug 2022 11:39:23 -0400	[thread overview]
Message-ID: <YvZ0K53zVv21mNVs@localhost.localdomain> (raw)
In-Reply-To: <20220812090208.GA2373742@falcondesktop>

On Fri, Aug 12, 2022 at 10:02:08AM +0100, Filipe Manana wrote:
> On Thu, Aug 11, 2022 at 05:06:11PM -0400, Josef Bacik wrote:
> > Dylan reported a problem where he had an io_uring test that was returning
> > short DIO reads with ee5b46a353af ("btrfs: increase direct io read size
> > limit to 256 sectors") applied.  This turned out to be a red herring,
> > this simply increases the size of the reads we'll attempt to do in one
> > go.  The root of the problem is that because we're now trying to read in
> > more into our buffer, we're more likely to hit a page fault while trying
> > to read into the buffer.
> > 
> > Because we pass IOMAP_DIO_PARTIAL into iomap if we get an -EFAULT we'll
> > simply return 0, expecting that we'll do the fault and then try again.
> > However since we're only checking for a ret > 0 || ret == -EFAULT we
> > return a short read.
> 
> I find this explanation to be terse. There's a lot of non-obvious details missing.
> 
> So, at __iomap_dio_rw() we have this:
> 
>     if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) {
>         if (!(iocb->ki_flags & IOCB_NOWAIT))
>             wait_for_completion = true;
>         ret = 0;
>     }
> 
> And shortly after, in the same function, we have:
> 
>     if (!atomic_dec_and_test(&dio->ref)) {
>         if (!wait_for_completion)
>             return ERR_PTR(-EIOCBQUEUED);
> 
>         for (;;) {
>             set_current_state(TASK_UNINTERRUPTIBLE);
>             if (!READ_ONCE(dio->submit.waiter))
>                 break;
> 
>             blk_io_schedule();
>         }
>         __set_current_state(TASK_RUNNING);
>     }
> 
> If the short read happens with io_uring, it means the initial request had
> IOCB_NOWAIT set. With the IOCB_NOWAIT call to btrfs, we were able to satisfy
> part of the read, submit one or more bios, and then we got -EFAULT when trying
> to faultin a page from the iovector.
> 
> This makes the first if-statement set 'ret' to 0 and 'wait_for_completion'
> stays with a false value. Then later on, because 'wait_for_completion' is
> false, -EIOCBQUEUED is returned to btrfs, which returns it and propagates
> it back to io_uring.
> 
> Because the read was not entirely satisfied, io_uring fallbacks to a blocking
> context, i.e. IOCB_NOWAIT is not set. We get back to btrfs and __iomap_dio_rw(),
> we submit one or more bios for some parts of remaining range, but then we get
> -EFAULT again. Then in that first if statement at __iomap_dio_rw(), we set 'ret'
> to 0 and 'wait_for_completion' to true, since dio->size is > 0 (we have submitted
> one more bios, made some progress) and IOCB_NOWAIT is not set.
> 
> This results in 0 being returned to btrfs_direct_read(), which returns 0 back
> to btrfs_file_read_iter(). And there we call filemap_read() to try to read
> what's left using buffered IO, passing it a value of 0 for its 'already_read'
> argument. Then there we get some error, presumably a page fault again, and
> return it back to io_uring.
> 
> And that's how user space gets the partial read?
> I don't see how else it can happen, so I'd like you to confirm that and update the
> changelog with a lot more details how user space gets a partial read.
> 

Sure I'll include these details, I left them out because it's relatively
convoluted (as you found out) and really the important detail is we get a 0 back
because we tried to page fault and it got turned into 0 from -EFAULT.

> 
> > Fix this by checking for a 0 read from iomap as
> > well.  I've left the EFAULT case because it appears like we can still
> > get this in the case of a page fault from a direct read from an inline
> > extent.
> 
> The EFAULT case is not related to inline extents.
> When we find an inline extent, we always fallback to buffered IO.
> btrfs_iomap_begin() will return -ENOTBLK or -EAGAIN depending on NOWAIT, so
> btrfs_direct_read() will never get -EFAULT because of an inline extent.
> 
> To get -EFAULT it means we are trying to read a non-inline, non-compressed
> extent and then failed to faultin the very first page of the iovector, so in
> that first if-statement at __iomap_dio_rw(), 'ret' remains with the value
> of -EFAULT because dio->size is 0 (no progress made at all) and inline
> extents always start at file offset 0 (so no progress may have happened).
> 

Sorry I thought I saw IOMAP_INLINE for our inline stuff.  We can get it for
IOMAP_HOLE where we try to zero the buffer which can fault, and then we have a
dio->size == 0 && ret == -EFAULT and that'll get propagated up to us, so we need
to preserve the -EFAULT check.  I'll update the changelog with this as well.
Thanks,

Josef

  reply	other threads:[~2022-08-12 15:39 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-11 21:06 [PATCH] btrfs: handle DIO read faults properly Josef Bacik
2022-08-12  9:02 ` Filipe Manana
2022-08-12 15:39   ` Josef Bacik [this message]
2022-08-19 15:55   ` Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YvZ0K53zVv21mNVs@localhost.localdomain \
    --to=josef@toxicpanda.com \
    --cc=agruenba@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=dylany@fb.com \
    --cc=fdmanana@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.