Re: [PATCH] iomap: Return zero in case of unsuccessful pagecache invalidation before DIO

From: Filipe Manana <fdmanana@gmail.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
	Goldwyn Rodrigues <rgoldwyn@suse.de>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-btrfs <linux-btrfs@vger.kernel.org>,
	Johannes Thumshirn <Johannes.Thumshirn@wdc.com>,
	Christoph Hellwig <hch@infradead.org>,
	dsterba@suse.cz
Subject: Re: [PATCH] iomap: Return zero in case of unsuccessful pagecache invalidation before DIO
Date: Fri, 29 May 2020 12:50:01 +0100	[thread overview]
Message-ID: <CAL3q7H5cp8joqHnS8rtPBBEQyYw9L0KbRNCQwFfKz1pD-tZvwQ@mail.gmail.com> (raw)
In-Reply-To: <20200529113116.GU17206@bombadil.infradead.org>

On Fri, May 29, 2020 at 12:31 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, May 29, 2020 at 11:55:33AM +0100, Filipe Manana wrote:
> > On Fri, May 29, 2020 at 1:23 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > >
> > > On Thu, May 28, 2020 at 02:21:03PM -0500, Goldwyn Rodrigues wrote:
> > > >
> > > > Filesystems such as btrfs are unable to guarantee page invalidation
> > > > because pages could be locked as a part of the extent. Return zero
> > >
> > > Locked for what?  filemap_write_and_wait_range should have just cleaned
> > > them off.
> >
> > Yes, it will be confusing even for someone more familiar with btrfs.
> > The changelog could be more detailed to make it clear what's happening and why.
> >
> > So what happens:
> >
> > 1) iomap_dio_rw() calls filemap_write_and_wait_range().
> >     That starts delalloc for all dirty pages in the range and then
> > waits for writeback to complete.
> >     This is enough for most filesystems at least (if not all except btrfs).
> >
> > 2) However, in btrfs once writeback finishes, a job is queued to run
> > on a dedicated workqueue, to execute the function
> > btrfs_finish_ordered_io().
> >     So that job will be run after filemap_write_and_wait_range() returns.
> >     That function locks the file range (using a btrfs specific data
> > structure), does a bunch of things (updating several btrees), and then
> > unlocks the file range.
> >
> > 3) While iomap calls invalidate_inode_pages2_range(), which ends up
> > calling the btrfs callback btfs_releasepage(),
> >     btrfs_finish_ordered_io() is running and has the file range locked
> > (this is what Goldwyn means by "pages could be locked", which is
> > confusing because it's not about any locked struct page).
> >
> > 4) Because the file range is locked, btrfs_releasepage() returns 0
> > (page can't be released), this happens in the helper function
> > try_release_extent_state().
> >     Any page in that range is not dirty nor under writeback anymore
> > and, in fact, btrfs_finished_ordered_io() doesn't do anything with the
> > pages, it's only updating metadata.
> >
> >     btrfs_releasepage() in this case could release the pages, but
> > there are other contextes where the file range is locked, the pages
> > are still not dirty and not under writeback, where this would not be
> > safe to do.
>
> Isn't this the bug, though?  Rather than returning "page can't be
> released", shouldn't ->releasepage sleep on the extent state, at least
> if the GFP mask indicates you can sleep?

Goldwyn mentioned in another thread that he had tried that with the
following patch:

https://patchwork.kernel.org/patch/11275063/

But he mentioned it didn't work though, caused some locking problems.
I don't know the details and I haven't tried the patchset yet.
Goldwyn?

>
> > 5) So because of that invalidate_inode_pages2_range() returns
> > non-zero, the iomap code prints that warning message and then proceeds
> > with doing a direct IO write anyway.
> >
> > What happens currently in btrfs, before Goldwyn's patchset:
> >
> > 1) generic_file_direct_write() also calls filemap_write_and_wait_range().
> > 2) After that it calls invalidate_inode_pages2_range() too, but if
> > that returns non-zero, it doesn't print any warning and falls back to
> > a buffered write.
> >
> > So Goldwyn here is effectively adding that behaviour from
> > generic_file_direct_write() to iomap.
> >
> > Thanks.
> >
> > >
> > > > in case a page cache invalidation is unsuccessful so filesystems can
> > > > fallback to buffered I/O. This is similar to
> > > > generic_file_direct_write().
> > > >
> > > > This takes care of the following invalidation warning during btrfs
> > > > mixed buffered and direct I/O using iomap_dio_rw():
> > > >
> > > > Page cache invalidation failure on direct I/O.  Possible data
> > > > corruption due to collision with buffered I/O!
> > > >
> > > > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > >
> > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > index e4addfc58107..215315be6233 100644
> > > > --- a/fs/iomap/direct-io.c
> > > > +++ b/fs/iomap/direct-io.c
> > > > @@ -483,9 +483,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > >        */
> > > >       ret = invalidate_inode_pages2_range(mapping,
> > > >                       pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > -     if (ret)
> > > > -             dio_warn_stale_pagecache(iocb->ki_filp);
> > > > -     ret = 0;
> > > > +     /*
> > > > +      * If a page can not be invalidated, return 0 to fall back
> > > > +      * to buffered write.
> > > > +      */
> > > > +     if (ret) {
> > > > +             if (ret == -EBUSY)
> > > > +                     ret = 0;
> > > > +             goto out_free_dio;
> > >
> > > XFS doesn't fall back to buffered io when directio fails, which means
> > > this will cause a regression there.
> > >
> > > Granted mixing write types is bogus...
> > >
> > > --D
> > >
> > > > +     }
> > > >
> > > >       if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > > >           !inode->i_sb->s_dio_done_wq) {
> > > >
> > > > --
> > > > Goldwyn
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”

-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”