All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] btrfs: handle DIO read faults properly
@ 2022-08-11 21:06 Josef Bacik
  2022-08-12  9:02 ` Filipe Manana
  0 siblings, 1 reply; 4+ messages in thread
From: Josef Bacik @ 2022-08-11 21:06 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: Dylan Yudaken, Jens Axboe

Dylan reported a problem where he had an io_uring test that was returning
short DIO reads with ee5b46a353af ("btrfs: increase direct io read size
limit to 256 sectors") applied.  This turned out to be a red herring,
this simply increases the size of the reads we'll attempt to do in one
go.  The root of the problem is that because we're now trying to read in
more into our buffer, we're more likely to hit a page fault while trying
to read into the buffer.

Because we pass IOMAP_DIO_PARTIAL into iomap if we get an -EFAULT we'll
simply return 0, expecting that we'll do the fault and then try again.
However since we're only checking for a ret > 0 || ret == -EFAULT we
return a short read.  Fix this by checking for a 0 read from iomap as
well.  I've left the EFAULT case because it appears like we can still
get this in the case of a page fault from a direct read from an inline
extent.

Jens tested this patch with their testcase and it fixed the problem.

Reported-by: Dylan Yudaken <dylany@fb.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index c9649b7b2f18..a61085ac59d6 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3776,7 +3776,7 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
 	if (ret > 0)
 		read = ret;
 
-	if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
+	if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret >= 0)) {
 		const size_t left = iov_iter_count(to);
 
 		if (left == prev_left) {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: handle DIO read faults properly
  2022-08-11 21:06 [PATCH] btrfs: handle DIO read faults properly Josef Bacik
@ 2022-08-12  9:02 ` Filipe Manana
  2022-08-12 15:39   ` Josef Bacik
  2022-08-19 15:55   ` Josef Bacik
  0 siblings, 2 replies; 4+ messages in thread
From: Filipe Manana @ 2022-08-12  9:02 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team, Dylan Yudaken, Jens Axboe, agruenba

On Thu, Aug 11, 2022 at 05:06:11PM -0400, Josef Bacik wrote:
> Dylan reported a problem where he had an io_uring test that was returning
> short DIO reads with ee5b46a353af ("btrfs: increase direct io read size
> limit to 256 sectors") applied.  This turned out to be a red herring,
> this simply increases the size of the reads we'll attempt to do in one
> go.  The root of the problem is that because we're now trying to read in
> more into our buffer, we're more likely to hit a page fault while trying
> to read into the buffer.
> 
> Because we pass IOMAP_DIO_PARTIAL into iomap if we get an -EFAULT we'll
> simply return 0, expecting that we'll do the fault and then try again.
> However since we're only checking for a ret > 0 || ret == -EFAULT we
> return a short read.

I find this explanation to be terse. There's a lot of non-obvious details missing.

So, at __iomap_dio_rw() we have this:

    if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) {
        if (!(iocb->ki_flags & IOCB_NOWAIT))
            wait_for_completion = true;
        ret = 0;
    }

And shortly after, in the same function, we have:

    if (!atomic_dec_and_test(&dio->ref)) {
        if (!wait_for_completion)
            return ERR_PTR(-EIOCBQUEUED);

        for (;;) {
            set_current_state(TASK_UNINTERRUPTIBLE);
            if (!READ_ONCE(dio->submit.waiter))
                break;

            blk_io_schedule();
        }
        __set_current_state(TASK_RUNNING);
    }

If the short read happens with io_uring, it means the initial request had
IOCB_NOWAIT set. With the IOCB_NOWAIT call to btrfs, we were able to satisfy
part of the read, submit one or more bios, and then we got -EFAULT when trying
to faultin a page from the iovector.

This makes the first if-statement set 'ret' to 0 and 'wait_for_completion'
stays with a false value. Then later on, because 'wait_for_completion' is
false, -EIOCBQUEUED is returned to btrfs, which returns it and propagates
it back to io_uring.

Because the read was not entirely satisfied, io_uring fallbacks to a blocking
context, i.e. IOCB_NOWAIT is not set. We get back to btrfs and __iomap_dio_rw(),
we submit one or more bios for some parts of remaining range, but then we get
-EFAULT again. Then in that first if statement at __iomap_dio_rw(), we set 'ret'
to 0 and 'wait_for_completion' to true, since dio->size is > 0 (we have submitted
one more bios, made some progress) and IOCB_NOWAIT is not set.

This results in 0 being returned to btrfs_direct_read(), which returns 0 back
to btrfs_file_read_iter(). And there we call filemap_read() to try to read
what's left using buffered IO, passing it a value of 0 for its 'already_read'
argument. Then there we get some error, presumably a page fault again, and
return it back to io_uring.

And that's how user space gets the partial read?
I don't see how else it can happen, so I'd like you to confirm that and update the
changelog with a lot more details how user space gets a partial read.


> Fix this by checking for a 0 read from iomap as
> well.  I've left the EFAULT case because it appears like we can still
> get this in the case of a page fault from a direct read from an inline
> extent.

The EFAULT case is not related to inline extents.
When we find an inline extent, we always fallback to buffered IO.
btrfs_iomap_begin() will return -ENOTBLK or -EAGAIN depending on NOWAIT, so
btrfs_direct_read() will never get -EFAULT because of an inline extent.

To get -EFAULT it means we are trying to read a non-inline, non-compressed
extent and then failed to faultin the very first page of the iovector, so in
that first if-statement at __iomap_dio_rw(), 'ret' remains with the value
of -EFAULT because dio->size is 0 (no progress made at all) and inline
extents always start at file offset 0 (so no progress may have happened).

I'm also seeing that gfs2 also doesn't retry if it gets 0 from iomap
(at gfs2_file_direct_read()), so adding Andreas to CC.

Thanks.

> 
> Jens tested this patch with their testcase and it fixed the problem.
> 
> Reported-by: Dylan Yudaken <dylany@fb.com>
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/file.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index c9649b7b2f18..a61085ac59d6 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3776,7 +3776,7 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
>  	if (ret > 0)
>  		read = ret;
>  
> -	if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret > 0)) {
> +	if (iov_iter_count(to) > 0 && (ret == -EFAULT || ret >= 0)) {
>  		const size_t left = iov_iter_count(to);
>  
>  		if (left == prev_left) {
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: handle DIO read faults properly
  2022-08-12  9:02 ` Filipe Manana
@ 2022-08-12 15:39   ` Josef Bacik
  2022-08-19 15:55   ` Josef Bacik
  1 sibling, 0 replies; 4+ messages in thread
From: Josef Bacik @ 2022-08-12 15:39 UTC (permalink / raw)
  To: Filipe Manana
  Cc: linux-btrfs, kernel-team, Dylan Yudaken, Jens Axboe, agruenba

On Fri, Aug 12, 2022 at 10:02:08AM +0100, Filipe Manana wrote:
> On Thu, Aug 11, 2022 at 05:06:11PM -0400, Josef Bacik wrote:
> > Dylan reported a problem where he had an io_uring test that was returning
> > short DIO reads with ee5b46a353af ("btrfs: increase direct io read size
> > limit to 256 sectors") applied.  This turned out to be a red herring,
> > this simply increases the size of the reads we'll attempt to do in one
> > go.  The root of the problem is that because we're now trying to read in
> > more into our buffer, we're more likely to hit a page fault while trying
> > to read into the buffer.
> > 
> > Because we pass IOMAP_DIO_PARTIAL into iomap if we get an -EFAULT we'll
> > simply return 0, expecting that we'll do the fault and then try again.
> > However since we're only checking for a ret > 0 || ret == -EFAULT we
> > return a short read.
> 
> I find this explanation to be terse. There's a lot of non-obvious details missing.
> 
> So, at __iomap_dio_rw() we have this:
> 
>     if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) {
>         if (!(iocb->ki_flags & IOCB_NOWAIT))
>             wait_for_completion = true;
>         ret = 0;
>     }
> 
> And shortly after, in the same function, we have:
> 
>     if (!atomic_dec_and_test(&dio->ref)) {
>         if (!wait_for_completion)
>             return ERR_PTR(-EIOCBQUEUED);
> 
>         for (;;) {
>             set_current_state(TASK_UNINTERRUPTIBLE);
>             if (!READ_ONCE(dio->submit.waiter))
>                 break;
> 
>             blk_io_schedule();
>         }
>         __set_current_state(TASK_RUNNING);
>     }
> 
> If the short read happens with io_uring, it means the initial request had
> IOCB_NOWAIT set. With the IOCB_NOWAIT call to btrfs, we were able to satisfy
> part of the read, submit one or more bios, and then we got -EFAULT when trying
> to faultin a page from the iovector.
> 
> This makes the first if-statement set 'ret' to 0 and 'wait_for_completion'
> stays with a false value. Then later on, because 'wait_for_completion' is
> false, -EIOCBQUEUED is returned to btrfs, which returns it and propagates
> it back to io_uring.
> 
> Because the read was not entirely satisfied, io_uring fallbacks to a blocking
> context, i.e. IOCB_NOWAIT is not set. We get back to btrfs and __iomap_dio_rw(),
> we submit one or more bios for some parts of remaining range, but then we get
> -EFAULT again. Then in that first if statement at __iomap_dio_rw(), we set 'ret'
> to 0 and 'wait_for_completion' to true, since dio->size is > 0 (we have submitted
> one more bios, made some progress) and IOCB_NOWAIT is not set.
> 
> This results in 0 being returned to btrfs_direct_read(), which returns 0 back
> to btrfs_file_read_iter(). And there we call filemap_read() to try to read
> what's left using buffered IO, passing it a value of 0 for its 'already_read'
> argument. Then there we get some error, presumably a page fault again, and
> return it back to io_uring.
> 
> And that's how user space gets the partial read?
> I don't see how else it can happen, so I'd like you to confirm that and update the
> changelog with a lot more details how user space gets a partial read.
> 

Sure I'll include these details, I left them out because it's relatively
convoluted (as you found out) and really the important detail is we get a 0 back
because we tried to page fault and it got turned into 0 from -EFAULT.

> 
> > Fix this by checking for a 0 read from iomap as
> > well.  I've left the EFAULT case because it appears like we can still
> > get this in the case of a page fault from a direct read from an inline
> > extent.
> 
> The EFAULT case is not related to inline extents.
> When we find an inline extent, we always fallback to buffered IO.
> btrfs_iomap_begin() will return -ENOTBLK or -EAGAIN depending on NOWAIT, so
> btrfs_direct_read() will never get -EFAULT because of an inline extent.
> 
> To get -EFAULT it means we are trying to read a non-inline, non-compressed
> extent and then failed to faultin the very first page of the iovector, so in
> that first if-statement at __iomap_dio_rw(), 'ret' remains with the value
> of -EFAULT because dio->size is 0 (no progress made at all) and inline
> extents always start at file offset 0 (so no progress may have happened).
> 

Sorry I thought I saw IOMAP_INLINE for our inline stuff.  We can get it for
IOMAP_HOLE where we try to zero the buffer which can fault, and then we have a
dio->size == 0 && ret == -EFAULT and that'll get propagated up to us, so we need
to preserve the -EFAULT check.  I'll update the changelog with this as well.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: handle DIO read faults properly
  2022-08-12  9:02 ` Filipe Manana
  2022-08-12 15:39   ` Josef Bacik
@ 2022-08-19 15:55   ` Josef Bacik
  1 sibling, 0 replies; 4+ messages in thread
From: Josef Bacik @ 2022-08-19 15:55 UTC (permalink / raw)
  To: Filipe Manana
  Cc: linux-btrfs, kernel-team, Dylan Yudaken, Jens Axboe, agruenba

On Fri, Aug 12, 2022 at 10:02:08AM +0100, Filipe Manana wrote:
> On Thu, Aug 11, 2022 at 05:06:11PM -0400, Josef Bacik wrote:
> > Dylan reported a problem where he had an io_uring test that was returning
> > short DIO reads with ee5b46a353af ("btrfs: increase direct io read size
> > limit to 256 sectors") applied.  This turned out to be a red herring,
> > this simply increases the size of the reads we'll attempt to do in one
> > go.  The root of the problem is that because we're now trying to read in
> > more into our buffer, we're more likely to hit a page fault while trying
> > to read into the buffer.
> > 
> > Because we pass IOMAP_DIO_PARTIAL into iomap if we get an -EFAULT we'll
> > simply return 0, expecting that we'll do the fault and then try again.
> > However since we're only checking for a ret > 0 || ret == -EFAULT we
> > return a short read.
> 
> I find this explanation to be terse. There's a lot of non-obvious details missing.
> 

Turns out there was some weirdness with the reproducer, and this isn't actually
the problem.

The real problem is that we would submit part of the IO since we got a fault,
and because we are async we'd get the short read.  I've sent a different patch
with the actual explanation for what was happening and a proper fix for this
particular regression.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-08-19 16:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-11 21:06 [PATCH] btrfs: handle DIO read faults properly Josef Bacik
2022-08-12  9:02 ` Filipe Manana
2022-08-12 15:39   ` Josef Bacik
2022-08-19 15:55   ` Josef Bacik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.