linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	xfs <linux-xfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH v3] vfs: fix page locking deadlocks when deduping files
Date: Wed, 14 Aug 2019 17:41:40 -0700	[thread overview]
Message-ID: <20190815004140.GE3440173@magnolia> (raw)
In-Reply-To: <20190814212833.GO6129@dread.disaster.area>

On Thu, Aug 15, 2019 at 07:28:33AM +1000, Dave Chinner wrote:
> On Wed, Aug 14, 2019 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > On Wed, Aug 14, 2019 at 07:54:48PM +1000, Dave Chinner wrote:
> > > On Tue, Aug 13, 2019 at 08:40:10AM -0700, Matthew Wilcox wrote:
> > > > On Tue, Aug 13, 2019 at 08:14:34AM -0700, Darrick J. Wong wrote:
> > > > > +		/*
> > > > > +		 * Now that we've locked both pages, make sure they still
> > > > > +		 * represent the data we're interested in.  If not, someone
> > > > > +		 * is invalidating pages on us and we lose.
> > > > > +		 */
> > > > > +		if (src_page->mapping != src->i_mapping ||
> > > > > +		    src_page->index != srcoff >> PAGE_SHIFT ||
> > > > > +		    dest_page->mapping != dest->i_mapping ||
> > > > > +		    dest_page->index != destoff >> PAGE_SHIFT) {
> > > > > +			same = false;
> > > > > +			goto unlock;
> > > > > +		}
> > > > 
> > > > It is my understanding that you don't need to check the ->index here.
> > > > If I'm wrong about that, I'd really appreciate being corrected, because
> > > > the page cache locking is subtle.
> > > 
> > > Ah, when talking to Darrick about this, I didn't notice the code
> > > took references on the page, so it probably doesn't need the index
> > > check - the page can't be recycled out from under us here an
> > > inserted into a new mapping until we drop the reference.
> > > 
> > > What I was mainly concerned about here is that we only have a shared
> > > inode lock on the src inode, so this code can be running
> > > concurrently with both invalidation and insertion into the mapping.
> > > e.g. direct write io does invalidation, buffered read does
> > > insertion. Hence we have to be really careful about the data in the
> > > source page being valid and stable while we run the comparison.
> > > 
> > > And on further thought, I don't think shared locking is actually
> > > safe here. A shared lock doesn't stop new direct IO from being
> > > submitted, so inode_dio_wait() just drains IO at that point in time
> > > and but doesn't provide any guarantee that there isn't concurrent
> > > DIO running.
> > > 
> > > Hence we could do the comparison here, see the data is the same,
> > > drop the page lock, a DIO write then invalidates the page and writes
> > > new data while we are comparing the rest of page(s) in the range. By
> > > the time we've checked the whole range, the data at the start is no
> > > longer the same, and the comparison is stale.
> > > 
> > > And then we do the dedupe operation oblivious to the fact the data
> > > on disk doesn't actually match anymore, and we corrupt the data in
> > > the destination file as it gets linked to mismatched data in the
> > > source file....
> > 
> > <urrrrrrk> Yeah, that looks like a bug to me.  I didn't realize that
> > directio writes were IOLOCK_SHARED and therefore reflink has to take
> > IOLOCK_EXCL to block them.
> > 
> > Related question: do we need to do the same for MMAPLOCK?  I think the
> > answer is yes because xfs_filemap_fault can call page_mkwrite with
> > MMAPLOCK_SHARED.
> 
> Hmmmm. Yes, you are right, but I don't just holding MMAPLOCK_EXCL is
> enough. Holding the MMAPLOCK will hold off page faults while we have
> the lock, but it won't prevent pages that already have writeable
> ptes from being modified as they don't require another page fault
> until they've been cleaned.
> 
> So it seems to me that if we need to ensure that the file range is
> not being concurrently modified, we have to:
> 
> 	a) inode lock exclusive
> 	b) mmap lock exclusive
> 	c) break layouts(*)
> 	d) wait for dio
> 	e) clean all the dirty pages
> 
> On both the source and destination files. And then, because the
> locks we hold form a barrier against newly dirtied pages, will all
> attempts to modify the data be blocked. And so now the data
> comparison can be done safely.

I think xfs already proceeds in this order (a-e), it's just that we
aren't correctly taking IOLOCK_EXCL/MMAPLOCK_EXCL on the source file to
prevent some other thread from starting a directio write or dirtying an
mmap'd page.  But let's try my crappy little patch that fixes the shared
locks and see what other smoke comes out of the machine...

--D

> (*) The break layouts bit is necessary to handle co-ordination with
> RDMA, NVMEoF, P2P DMA, pNFS, etc that manipulate data directly via
> the block device rather than through file pages or DIO...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

  reply	other threads:[~2019-08-15  0:41 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-13 15:14 [PATCH v3] vfs: fix page locking deadlocks when deduping files Darrick J. Wong
2019-08-13 15:40 ` Matthew Wilcox
2019-08-14  7:03   ` Gao Xiang
2019-08-14  7:17     ` Gao Xiang
2019-08-14  9:54   ` Dave Chinner
2019-08-14 15:33     ` Darrick J. Wong
2019-08-14 21:28       ` Dave Chinner
2019-08-15  0:41         ` Darrick J. Wong [this message]
2019-08-13 15:53 ` Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190815004140.GE3440173@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).