linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	xfs <linux-xfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH v3] vfs: fix page locking deadlocks when deduping files
Date: Wed, 14 Aug 2019 08:33:49 -0700	[thread overview]
Message-ID: <20190814153349.GS7138@magnolia> (raw)
In-Reply-To: <20190814095448.GK6129@dread.disaster.area>

On Wed, Aug 14, 2019 at 07:54:48PM +1000, Dave Chinner wrote:
> On Tue, Aug 13, 2019 at 08:40:10AM -0700, Matthew Wilcox wrote:
> > On Tue, Aug 13, 2019 at 08:14:34AM -0700, Darrick J. Wong wrote:
> > > +		/*
> > > +		 * Now that we've locked both pages, make sure they still
> > > +		 * represent the data we're interested in.  If not, someone
> > > +		 * is invalidating pages on us and we lose.
> > > +		 */
> > > +		if (src_page->mapping != src->i_mapping ||
> > > +		    src_page->index != srcoff >> PAGE_SHIFT ||
> > > +		    dest_page->mapping != dest->i_mapping ||
> > > +		    dest_page->index != destoff >> PAGE_SHIFT) {
> > > +			same = false;
> > > +			goto unlock;
> > > +		}
> > 
> > It is my understanding that you don't need to check the ->index here.
> > If I'm wrong about that, I'd really appreciate being corrected, because
> > the page cache locking is subtle.
> 
> Ah, when talking to Darrick about this, I didn't notice the code
> took references on the page, so it probably doesn't need the index
> check - the page can't be recycled out from under us here an
> inserted into a new mapping until we drop the reference.
> 
> What I was mainly concerned about here is that we only have a shared
> inode lock on the src inode, so this code can be running
> concurrently with both invalidation and insertion into the mapping.
> e.g. direct write io does invalidation, buffered read does
> insertion. Hence we have to be really careful about the data in the
> source page being valid and stable while we run the comparison.
> 
> And on further thought, I don't think shared locking is actually
> safe here. A shared lock doesn't stop new direct IO from being
> submitted, so inode_dio_wait() just drains IO at that point in time
> and but doesn't provide any guarantee that there isn't concurrent
> DIO running.
> 
> Hence we could do the comparison here, see the data is the same,
> drop the page lock, a DIO write then invalidates the page and writes
> new data while we are comparing the rest of page(s) in the range. By
> the time we've checked the whole range, the data at the start is no
> longer the same, and the comparison is stale.
> 
> And then we do the dedupe operation oblivious to the fact the data
> on disk doesn't actually match anymore, and we corrupt the data in
> the destination file as it gets linked to mismatched data in the
> source file....

<urrrrrrk> Yeah, that looks like a bug to me.  I didn't realize that
directio writes were IOLOCK_SHARED and therefore reflink has to take
IOLOCK_EXCL to block them.

Related question: do we need to do the same for MMAPLOCK?  I think the
answer is yes because xfs_filemap_fault can call page_mkwrite with
MMAPLOCK_SHARED.

--D

> Darrick?
> 
> > You call read_mapping_page() which returns the page with an elevated
> > refcount.  That means the page can't go back to the page allocator and
> > be allocated again.  It can, because it's unlocked, still be truncated,
> > so the check for ->mapping after locking it is needed.  But the check
> > for ->index being correct was done by find_get_entry().
> > 
> > See pagecache_get_page() -- if we specify FGP_LOCK, then it will lock
> > the page, check the ->mapping but not check ->index.  OK, it does check
> > ->index, but in a VM_BUG_ON(), so it's not something that ought to be
> > able to be wrong.
> 
> Yeah, we used to have to play tricks in the old XFS writeback
> clustering code to do our own non-blocking page cache lookups adn
> this was one of the things we needed to be careful about until
> the pagevec_lookup* interfaces came along and solved all the
> problems for us. Funny how the brain remembers old gotchas with
> also reminding you that the problems went away almost as long
> ago.....
> 
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

  reply	other threads:[~2019-08-14 15:34 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-13 15:14 [PATCH v3] vfs: fix page locking deadlocks when deduping files Darrick J. Wong
2019-08-13 15:40 ` Matthew Wilcox
2019-08-14  7:03   ` Gao Xiang
2019-08-14  7:17     ` Gao Xiang
2019-08-14  9:54   ` Dave Chinner
2019-08-14 15:33     ` Darrick J. Wong [this message]
2019-08-14 21:28       ` Dave Chinner
2019-08-15  0:41         ` Darrick J. Wong
2019-08-13 15:53 ` Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190814153349.GS7138@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).