All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Filipe Manana <fdmanana@kernel.org>,
	Michael Kerrisk <mtk@man7.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-xfs@vger.kernel.org,
	linux-btrfs <linux-btrfs@vger.kernel.org>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	cluster-devel <cluster-devel@redhat.com>,
	Josef Bacik <josef@toxicpanda.com>,
	Filipe Manana <fdmanana@suse.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH] iomap: fix incomplete async dio reads when using IOMAP_DIO_PARTIAL
Date: Thu, 3 Mar 2022 10:12:06 +1100	[thread overview]
Message-ID: <20220302231206.GV59715@dread.disaster.area> (raw)
In-Reply-To: <CAHc6FU7jBeUEAaB0BupypG1zdxf4shF5T56cHZCD_xXi-jeB+Q@mail.gmail.com>

On Wed, Mar 02, 2022 at 02:03:28PM +0100, Andreas Gruenbacher wrote:
> On Wed, Mar 2, 2022 at 11:17 AM Filipe Manana <fdmanana@kernel.org> wrote:
> > On Tue, Mar 01, 2022 at 09:38:30AM +1100, Dave Chinner wrote:
> > > On Mon, Feb 28, 2022 at 02:32:03PM +0000, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > .....
> > >
> > > > 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
> > > >     and return 4K (the amount of io done) to iomap_dio_complete_work();
> > > >
> > > > 12) iomap_dio_complete_work() calls the iocb completion callback,
> > > >     iocb->ki_complete() with a second argument value of 4K (total io
> > > >     done) and the iocb with the adjust ki_pos of X + 4K. This results
> > > >     in completing the read request for io_uring, leaving it with a
> > > >     result of 4K bytes read, and only the first page of the buffer
> > > >     filled in, while the remaining 3 pages, corresponding to the other
> > > >     3 extents, were not filled;
> > > >
> > > > 13) For the application, the result is unexpected because if we ask
> > > >     to read N bytes, it expects to get N bytes read as long as those
> > > >     N bytes don't cross the EOF (i_size).
> > >
> > > Yeah, that's exactly the sort of problem we were having with XFS
> > > with partial DIO completions due to needing multiple iomap iteration
> > > loops to complete a single IO. Hence IOMAP_NOWAIT now triggers the
> > > above range check and aborts before we start...
> >
> > Interesting.
> 
> Dave, this seems to affect all users of iomap_dio_rw in the same way,
> so would it make sense to move this check there?

Perhaps, but I'm not sure it makes sense because filesystems need to
abort ->iomap_begin with -EAGAIN in IOMAP_NOWAIT contexts before
they make any changes.

Hence detecting short extents in the generic code becomes ...
difficult because we might now need to undo changes that have been
made in ->iomap_begin. e.g. if the filesystem allocates space and
the iomap core says "not long enough" because IOMAP_NOWAIT is set,
then we may have to punch out that allocation in ->iomap_end beforei
returning -EAGAIN.

That means filesystems like XFS may now need to supply a ->iomap_end
function to undo stuff the core decides it shouldn't have done,
instead of the filesystem ensuring it never does the operation it in
the first place...

IOWs, the correct behaviour here is for the filesystem ->iomap_begin
method to see that it needs to allocate and return -EAGAIN if
IOMAP_NOWAIT is set, not do the allocation and hope it that it ends
up being long enough to cover the entire IO we have to do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [PATCH] iomap: fix incomplete async dio reads when using IOMAP_DIO_PARTIAL
Date: Thu, 3 Mar 2022 10:12:06 +1100	[thread overview]
Message-ID: <20220302231206.GV59715@dread.disaster.area> (raw)
In-Reply-To: <CAHc6FU7jBeUEAaB0BupypG1zdxf4shF5T56cHZCD_xXi-jeB+Q@mail.gmail.com>

On Wed, Mar 02, 2022 at 02:03:28PM +0100, Andreas Gruenbacher wrote:
> On Wed, Mar 2, 2022 at 11:17 AM Filipe Manana <fdmanana@kernel.org> wrote:
> > On Tue, Mar 01, 2022 at 09:38:30AM +1100, Dave Chinner wrote:
> > > On Mon, Feb 28, 2022 at 02:32:03PM +0000, fdmanana at kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > .....
> > >
> > > > 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
> > > >     and return 4K (the amount of io done) to iomap_dio_complete_work();
> > > >
> > > > 12) iomap_dio_complete_work() calls the iocb completion callback,
> > > >     iocb->ki_complete() with a second argument value of 4K (total io
> > > >     done) and the iocb with the adjust ki_pos of X + 4K. This results
> > > >     in completing the read request for io_uring, leaving it with a
> > > >     result of 4K bytes read, and only the first page of the buffer
> > > >     filled in, while the remaining 3 pages, corresponding to the other
> > > >     3 extents, were not filled;
> > > >
> > > > 13) For the application, the result is unexpected because if we ask
> > > >     to read N bytes, it expects to get N bytes read as long as those
> > > >     N bytes don't cross the EOF (i_size).
> > >
> > > Yeah, that's exactly the sort of problem we were having with XFS
> > > with partial DIO completions due to needing multiple iomap iteration
> > > loops to complete a single IO. Hence IOMAP_NOWAIT now triggers the
> > > above range check and aborts before we start...
> >
> > Interesting.
> 
> Dave, this seems to affect all users of iomap_dio_rw in the same way,
> so would it make sense to move this check there?

Perhaps, but I'm not sure it makes sense because filesystems need to
abort ->iomap_begin with -EAGAIN in IOMAP_NOWAIT contexts before
they make any changes.

Hence detecting short extents in the generic code becomes ...
difficult because we might now need to undo changes that have been
made in ->iomap_begin. e.g. if the filesystem allocates space and
the iomap core says "not long enough" because IOMAP_NOWAIT is set,
then we may have to punch out that allocation in ->iomap_end beforei
returning -EAGAIN.

That means filesystems like XFS may now need to supply a ->iomap_end
function to undo stuff the core decides it shouldn't have done,
instead of the filesystem ensuring it never does the operation it in
the first place...

IOWs, the correct behaviour here is for the filesystem ->iomap_begin
method to see that it needs to allocate and return -EAGAIN if
IOMAP_NOWAIT is set, not do the allocation and hope it that it ends
up being long enough to cover the entire IO we have to do.

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



  reply	other threads:[~2022-03-02 23:49 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-28 14:32 [PATCH] iomap: fix incomplete async dio reads when using IOMAP_DIO_PARTIAL fdmanana
2022-02-28 14:32 ` [Cluster-devel] " fdmanana
2022-02-28 22:38 ` Dave Chinner
2022-02-28 22:38   ` [Cluster-devel] " Dave Chinner
2022-03-02 10:17   ` Filipe Manana
2022-03-02 10:17     ` [Cluster-devel] " Filipe Manana
2022-03-02 13:03     ` Andreas Gruenbacher
2022-03-02 13:03       ` [Cluster-devel] " Andreas Gruenbacher
2022-03-02 23:12       ` Dave Chinner [this message]
2022-03-02 23:12         ` Dave Chinner
2022-03-01 16:42 ` Darrick J. Wong
2022-03-01 16:42   ` [Cluster-devel] " Darrick J. Wong
2022-03-02 10:30   ` Filipe Manana
2022-03-02 10:30     ` [Cluster-devel] " Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220302231206.GV59715@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=agruenba@redhat.com \
    --cc=cluster-devel@redhat.com \
    --cc=djwong@kernel.org \
    --cc=fdmanana@kernel.org \
    --cc=fdmanana@suse.com \
    --cc=hch@infradead.org \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mtk@man7.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.