All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@lst.de>,
	Goldwyn Rodrigues <rgoldwyn@suse.de>,
	linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org,
	fdmanana@gmail.com, dsterba@suse.cz, cluster-devel@redhat.com,
	linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
Date: Thu, 9 Jul 2020 10:10:38 -0700	[thread overview]
Message-ID: <20200709171038.GE7625@magnolia> (raw)
In-Reply-To: <20200709170519.GH12769@casper.infradead.org>

On Thu, Jul 09, 2020 at 06:05:19PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 09, 2020 at 09:09:26AM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> > > iomap: Only invalidate page cache pages on direct IO writes
> > > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The historic requirement for XFS to invalidate cached pages on
> > > direct IO reads has been lost in the twisty pages of history - it was
> > > inherited from Irix, which implemented page cache invalidation on
> > > read as a method of working around problems synchronising page
> > > cache state with uncached IO.
> > 
> > Urk.
> > 
> > > XFS has carried this ever since. In the initial linux ports it was
> > > necessary to get mmap and DIO to play "ok" together and not
> > > immediately corrupt data. This was the state of play until the linux
> > > kernel had infrastructure to track unwritten extents and synchronise
> > > page faults with allocations and unwritten extent conversions
> > > (->page_mkwrite infrastructure). IOws, the page cache invalidation
> > > on DIO read was necessary to prevent trivial data corruptions. This
> > > didn't solve all the problems, though.
> > > 
> > > There were peformance problems if we didn't invalidate the entire
> > > page cache over the file on read - we couldn't easily determine if
> > > the cached pages were over the range of the IO, and invalidation
> > > required taking a serialising lock (i_mutex) on the inode. This
> > > serialising lock was an issue for XFS, as it was the only exclusive
> > > lock in the direct Io read path.
> > > 
> > > Hence if there were any cached pages, we'd just invalidate the
> > > entire file in one go so that subsequent IOs didn't need to take the
> > > serialising lock. This was a problem that prevented ranged
> > > invalidation from being particularly useful for avoiding the
> > > remaining coherency issues. This was solved with the conversion of
> > > i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
> > > use i_rwsem. Hence we could now just do ranged invalidation and the
> > > performance problem went away.
> > > 
> > > However, page cache invalidation was still needed to serialise
> > > sub-page/sub-block zeroing via direct IO against buffered IO because
> > > bufferhead state attached to the cached page could get out of whack
> > > when direct IOs were issued.  We've removed bufferheads from the
> > > XFS code, and we don't carry any extent state on the cached pages
> > > anymore, and so this problem has gone away, too.
> > > 
> > > IOWs, it would appear that we don't have any good reason to be
> > > invalidating the page cache on DIO reads anymore. Hence remove the
> > > invalidation on read because it is unnecessary overhead,
> > > not needed to maintain coherency between mmap/buffered access and
> > > direct IO anymore, and prevents anyone from using direct IO reads
> > > from intentionally invalidating the page cache of a file.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/iomap/direct-io.c | 33 +++++++++++++++++----------------
> > >  1 file changed, 17 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index ec7b78e6feca..ef0059eb34b5 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -475,23 +475,24 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >  	if (ret)
> > >  		goto out_free_dio;
> > >  
> > > -	/*
> > > -	 * Try to invalidate cache pages for the range we're direct
> > > -	 * writing.  If this invalidation fails, tough, the write will
> > > -	 * still work, but racing two incompatible write paths is a
> > > -	 * pretty crazy thing to do, so we don't support it 100%.
> > 
> > I always wondered about the repeated use of 'write' in this comment
> > despite the lack of any sort of WRITE check logic.  Seems fine to me,
> > let's throw it on the fstests pile and see what happens.
> > 
> > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> > --D
> > 
> > > -	 */
> > > -	ret = invalidate_inode_pages2_range(mapping,
> > > -			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > -	if (ret)
> > > -		dio_warn_stale_pagecache(iocb->ki_filp);
> > > -	ret = 0;
> > > +	if (iov_iter_rw(iter) == WRITE) {
> > > +		/*
> > > +		 * Try to invalidate cache pages for the range we're direct
> > > +		 * writing.  If this invalidation fails, tough, the write will
> > > +		 * still work, but racing two incompatible write paths is a
> > > +		 * pretty crazy thing to do, so we don't support it 100%.
> > > +		 */
> > > +		ret = invalidate_inode_pages2_range(mapping,
> > > +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > +		if (ret)
> > > +			dio_warn_stale_pagecache(iocb->ki_filp);
> > > +		ret = 0;
> > >  
> > > -	if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > > -	    !inode->i_sb->s_dio_done_wq) {
> > > -		ret = sb_init_dio_done_wq(inode->i_sb);
> > > -		if (ret < 0)
> > > -			goto out_free_dio;
> > > +		if (!wait_for_completion &&
> > > +		    !inode->i_sb->s_dio_done_wq) {
> > > +			ret = sb_init_dio_done_wq(inode->i_sb);
> > > +			if (ret < 0)
> > > +				goto out_free_dio;

...and yes I did add in the closing brace here. :P

--D

> > >  	}
> > >  
> > >  	inode_dio_begin(inode);

WARNING: multiple messages have this Message-ID (diff)
From: Darrick J. Wong <darrick.wong@oracle.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
Date: Thu, 9 Jul 2020 10:10:38 -0700	[thread overview]
Message-ID: <20200709171038.GE7625@magnolia> (raw)
In-Reply-To: <20200709170519.GH12769@casper.infradead.org>

On Thu, Jul 09, 2020 at 06:05:19PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 09, 2020 at 09:09:26AM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> > > iomap: Only invalidate page cache pages on direct IO writes
> > > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The historic requirement for XFS to invalidate cached pages on
> > > direct IO reads has been lost in the twisty pages of history - it was
> > > inherited from Irix, which implemented page cache invalidation on
> > > read as a method of working around problems synchronising page
> > > cache state with uncached IO.
> > 
> > Urk.
> > 
> > > XFS has carried this ever since. In the initial linux ports it was
> > > necessary to get mmap and DIO to play "ok" together and not
> > > immediately corrupt data. This was the state of play until the linux
> > > kernel had infrastructure to track unwritten extents and synchronise
> > > page faults with allocations and unwritten extent conversions
> > > (->page_mkwrite infrastructure). IOws, the page cache invalidation
> > > on DIO read was necessary to prevent trivial data corruptions. This
> > > didn't solve all the problems, though.
> > > 
> > > There were peformance problems if we didn't invalidate the entire
> > > page cache over the file on read - we couldn't easily determine if
> > > the cached pages were over the range of the IO, and invalidation
> > > required taking a serialising lock (i_mutex) on the inode. This
> > > serialising lock was an issue for XFS, as it was the only exclusive
> > > lock in the direct Io read path.
> > > 
> > > Hence if there were any cached pages, we'd just invalidate the
> > > entire file in one go so that subsequent IOs didn't need to take the
> > > serialising lock. This was a problem that prevented ranged
> > > invalidation from being particularly useful for avoiding the
> > > remaining coherency issues. This was solved with the conversion of
> > > i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
> > > use i_rwsem. Hence we could now just do ranged invalidation and the
> > > performance problem went away.
> > > 
> > > However, page cache invalidation was still needed to serialise
> > > sub-page/sub-block zeroing via direct IO against buffered IO because
> > > bufferhead state attached to the cached page could get out of whack
> > > when direct IOs were issued.  We've removed bufferheads from the
> > > XFS code, and we don't carry any extent state on the cached pages
> > > anymore, and so this problem has gone away, too.
> > > 
> > > IOWs, it would appear that we don't have any good reason to be
> > > invalidating the page cache on DIO reads anymore. Hence remove the
> > > invalidation on read because it is unnecessary overhead,
> > > not needed to maintain coherency between mmap/buffered access and
> > > direct IO anymore, and prevents anyone from using direct IO reads
> > > from intentionally invalidating the page cache of a file.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/iomap/direct-io.c | 33 +++++++++++++++++----------------
> > >  1 file changed, 17 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index ec7b78e6feca..ef0059eb34b5 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -475,23 +475,24 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >  	if (ret)
> > >  		goto out_free_dio;
> > >  
> > > -	/*
> > > -	 * Try to invalidate cache pages for the range we're direct
> > > -	 * writing.  If this invalidation fails, tough, the write will
> > > -	 * still work, but racing two incompatible write paths is a
> > > -	 * pretty crazy thing to do, so we don't support it 100%.
> > 
> > I always wondered about the repeated use of 'write' in this comment
> > despite the lack of any sort of WRITE check logic.  Seems fine to me,
> > let's throw it on the fstests pile and see what happens.
> > 
> > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> > --D
> > 
> > > -	 */
> > > -	ret = invalidate_inode_pages2_range(mapping,
> > > -			pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > -	if (ret)
> > > -		dio_warn_stale_pagecache(iocb->ki_filp);
> > > -	ret = 0;
> > > +	if (iov_iter_rw(iter) == WRITE) {
> > > +		/*
> > > +		 * Try to invalidate cache pages for the range we're direct
> > > +		 * writing.  If this invalidation fails, tough, the write will
> > > +		 * still work, but racing two incompatible write paths is a
> > > +		 * pretty crazy thing to do, so we don't support it 100%.
> > > +		 */
> > > +		ret = invalidate_inode_pages2_range(mapping,
> > > +				pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > +		if (ret)
> > > +			dio_warn_stale_pagecache(iocb->ki_filp);
> > > +		ret = 0;
> > >  
> > > -	if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > > -	    !inode->i_sb->s_dio_done_wq) {
> > > -		ret = sb_init_dio_done_wq(inode->i_sb);
> > > -		if (ret < 0)
> > > -			goto out_free_dio;
> > > +		if (!wait_for_completion &&
> > > +		    !inode->i_sb->s_dio_done_wq) {
> > > +			ret = sb_init_dio_done_wq(inode->i_sb);
> > > +			if (ret < 0)
> > > +				goto out_free_dio;

...and yes I did add in the closing brace here. :P

--D

> > >  	}
> > >  
> > >  	inode_dio_begin(inode);



  reply	other threads:[~2020-07-09 17:10 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-29 19:23 [PATCH 0/6 v10] btrfs direct-io using iomap Goldwyn Rodrigues
2020-06-29 19:23 ` [PATCH 1/6] iomap: Convert wait_for_completion to flags Goldwyn Rodrigues
2020-06-29 23:03   ` David Sterba
2020-06-30 16:35   ` David Sterba
2020-07-01  7:34     ` Johannes Thumshirn
2020-07-01  7:50   ` Christoph Hellwig
2020-06-29 19:23 ` [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails Goldwyn Rodrigues
2020-07-01  7:53   ` always fall back to buffered I/O after invalidation failures, was: " Christoph Hellwig
2020-07-01  7:53     ` [Cluster-devel] " Christoph Hellwig
2020-07-07 12:43     ` Goldwyn Rodrigues
2020-07-07 12:43       ` [Cluster-devel] " Goldwyn Rodrigues
2020-07-07 12:57       ` Matthew Wilcox
2020-07-07 12:57         ` [Cluster-devel] " Matthew Wilcox
2020-07-07 13:00         ` Christoph Hellwig
2020-07-07 13:00           ` [Cluster-devel] " Christoph Hellwig
2020-07-08  6:51           ` Dave Chinner
2020-07-08  6:51             ` [Cluster-devel] " Dave Chinner
2020-07-08 13:54             ` Matthew Wilcox
2020-07-08 13:54               ` [Cluster-devel] " Matthew Wilcox
2020-07-08 16:54               ` Christoph Hellwig
2020-07-08 16:54                 ` [Cluster-devel] " Christoph Hellwig
2020-07-08 17:11                 ` Matthew Wilcox
2020-07-08 17:11                   ` [Cluster-devel] " Matthew Wilcox
2020-07-09  8:26                 ` Steven Whitehouse
2020-07-09  8:26                   ` Steven Whitehouse
2020-07-09  2:25               ` Dave Chinner
2020-07-09  2:25                 ` [Cluster-devel] " Dave Chinner
2020-07-09 16:09                 ` Darrick J. Wong
2020-07-09 16:09                   ` [Cluster-devel] " Darrick J. Wong
2020-07-09 17:05                   ` Matthew Wilcox
2020-07-09 17:05                     ` [Cluster-devel] " Matthew Wilcox
2020-07-09 17:10                     ` Darrick J. Wong [this message]
2020-07-09 17:10                       ` Darrick J. Wong
2020-07-09 22:59                       ` Dave Chinner
2020-07-09 22:59                         ` [Cluster-devel] " Dave Chinner
2020-07-10 16:03                         ` Christoph Hellwig
2020-07-10 16:03                           ` [Cluster-devel] " Christoph Hellwig
2020-07-12 11:36                 ` Avi Kivity
2020-07-12 11:36                   ` [Cluster-devel] " Avi Kivity
2020-07-07 13:49         ` Goldwyn Rodrigues
2020-07-07 13:49           ` [Cluster-devel] " Goldwyn Rodrigues
2020-07-07 14:01           ` Darrick J. Wong
2020-07-07 14:01             ` [Cluster-devel] " Darrick J. Wong
2020-07-07 14:30             ` Goldwyn Rodrigues
2020-07-07 14:30               ` [Cluster-devel] " Goldwyn Rodrigues
2020-06-29 19:23 ` [PATCH 3/6] btrfs: switch to iomap_dio_rw() for dio Goldwyn Rodrigues
2020-06-29 19:23 ` [PATCH 4/6] fs: remove dio_end_io() Goldwyn Rodrigues
2020-06-29 19:23 ` [PATCH 5/6] btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK Goldwyn Rodrigues
2020-06-29 19:23 ` [PATCH 6/6] btrfs: split btrfs_direct_IO to read and write part Goldwyn Rodrigues

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200709171038.GE7625@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=cluster-devel@redhat.com \
    --cc=david@fromorbit.com \
    --cc=dsterba@suse.cz \
    --cc=fdmanana@gmail.com \
    --cc=hch@lst.de \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=rgoldwyn@suse.de \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.