linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Josef Bacik <josef@toxicpanda.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>,
	David Sterba <dsterba@suse.com>,
	"linux-btrfs @ vger . kernel . org" <linux-btrfs@vger.kernel.org>,
	Filipe Manana <fdmanana@gmail.com>,
	Christoph Hellwig <hch@lst.de>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>
Subject: Re: [RFC PATCH] btrfs: don't call btrfs_sync_file from iomap context
Date: Wed, 2 Sep 2020 07:46:13 +1000	[thread overview]
Message-ID: <20200901214613.GH12096@dread.disaster.area> (raw)
In-Reply-To: <42efa646-73cd-d884-1c9c-dd889294bde2@toxicpanda.com>

On Tue, Sep 01, 2020 at 11:11:58AM -0400, Josef Bacik wrote:
> On 9/1/20 9:06 AM, Johannes Thumshirn wrote:
> > This happens because iomap_dio_complete() calls into generic_write_sync()
> > if we have the data-sync flag set. But as we're still under the
> > inode_lock() from btrfs_file_write_iter() we will deadlock once
> > btrfs_sync_file() tries to acquire the inode_lock().
> > 
> > Calling into generic_write_sync() is not needed as __btrfs_direct_write()
> > already takes care of persisting the data on disk. We can temporarily drop
> > the IOCB_DSYNC flag before calling into __btrfs_direct_write() so the
> > iomap code won't try to call into the sync routines as well.
> > 
> > References: https://github.com/btrfs/fstests/issues/12
> > Fixes: da4d7c1b4c45 ("btrfs: switch to iomap for direct IO")
> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> > ---
> >   fs/btrfs/file.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index b62679382799..c75c0f2a5f72 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -2023,6 +2023,7 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >   		atomic_inc(&BTRFS_I(inode)->sync_writers);
> >   	if (iocb->ki_flags & IOCB_DIRECT) {
> > +		iocb->ki_flags &= ~IOCB_DSYNC;
> >   		num_written = __btrfs_direct_write(iocb, from);
> >   	} else {
> >   		num_written = btrfs_buffered_write(iocb, from);
> > @@ -2046,8 +2047,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >   	if (num_written > 0)
> >   		num_written = generic_write_sync(iocb, num_written);
> > -	if (sync)
> > +	if (sync) {
> > +		iocb->ki_flags |= IOCB_DSYNC;
> >   		atomic_dec(&BTRFS_I(inode)->sync_writers);
> > +	}
> >   out:
> >   	current->backing_dev_info = NULL;
> >   	return num_written ? num_written : err;
> > 
> 
> Christoph, I feel like this is broken.

No, it isn't broken, it's just a -different design- to the old
direct IO path. It was done this way done by design because the old
way of requiring separate paths for calling generic_write_sync() for
sync and AIO is ....  nasty, and doesn't allow for optimisation of
IO completion functionality that may be wholly dependent on
submission time inode state.

e.g. moving the O_DSYNC completion out of the context of the
IOMAP_F_DIRTY submission context means we can't reliably do FUA
writes to avoid calls to generic_write_sync() completely.
Compromising that functionality is going to cause major performance
regressions for high performance enterprise databases using O_DSYNC
AIO+DIO...

> Xfs and ext4 get away with this for
> different reasons,

No, they "don't get away with it", this is how it was designed to
work.

> ext4 doesn't take the inode_lock() at all in fsync, and
> xfs takes the ILOCK instead of the IOLOCK, so it's fine.  However btrfs uses
> inode_lock() in ->fsync (not for the IO, just for the logging part).  A long
> time ago I specifically pushed the inode locking down into ->fsync()
> handlers to give us this sort of control.
> 
> I'm not 100% on the iomap stuff, but the fix seems like we need to move the
> generic_write_sync() out of iomap_dio_complete() completely, and the callers
> do their own thing, much like the normal generic_file_write_iter() does.

That effectively breaks O_DSYNC AIO and requires us to reintroduce
all the nasty code that the old direct IO path required both the
infrastructure and the filesystems to handle it. That's really not
acceptible solution to an internal btrfs locking issue...

> And then I'd like to add a WARN_ON(lockdep_is_held()) in vfs_fsync_range()
> so we can avoid this sort of thing in the future.  What do you think?

That's not going to work, either. There are filesystems that call
vfs_fsync_range() directly from under the inode_lock(). For example,
the fallocate() path in gfs2. And it's called under the ext4 and XFS
MMAPLOCK from the dax page fault path, which is the page fault
equivalent of the inode_lock(). IOWs, if you know that you aren't
going to take inode locks in your ->fsync() method, there's nothing
that says you cannot call vfs_fsync_range() while holding those
inode locks.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2020-09-01 21:46 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20200901130644.12655-1-johannes.thumshirn@wdc.com>
2020-09-01 15:11 ` [RFC PATCH] btrfs: don't call btrfs_sync_file from iomap context Josef Bacik
2020-09-01 17:45   ` Darrick J. Wong
2020-09-01 17:55     ` Josef Bacik
2020-09-01 21:46   ` Dave Chinner [this message]
2020-09-01 22:19     ` Josef Bacik
2020-09-01 23:58       ` Dave Chinner
2020-09-02  0:22         ` Josef Bacik
2020-09-02  7:12           ` Johannes Thumshirn
2020-09-02 11:10             ` Josef Bacik
2020-09-02 16:29               ` Darrick J. Wong
2020-09-02 16:47                 ` Josef Bacik
2020-09-02 11:44         ` Matthew Wilcox
2020-09-02 12:20           ` Dave Chinner
2020-09-02 12:42             ` Josef Bacik
2020-09-03  2:28               ` Dave Chinner
2020-09-03  9:49                 ` Filipe Manana
2020-09-03 16:32   ` Christoph Hellwig
2020-09-03 16:46     ` Josef Bacik
2020-09-07  0:04     ` Dave Chinner
2020-09-15 21:48       ` Goldwyn Rodrigues
2020-09-17  3:09         ` Dave Chinner
2020-09-17  5:52           ` Christoph Hellwig
2020-09-17  6:29             ` Dave Chinner
2020-09-17  6:42               ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200901214613.GH12096@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=dsterba@suse.com \
    --cc=fdmanana@gmail.com \
    --cc=hch@lst.de \
    --cc=johannes.thumshirn@wdc.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).