From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>,
xfs <linux-xfs@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH v3] vfs: fix page locking deadlocks when deduping files
Date: Wed, 14 Aug 2019 17:41:40 -0700 [thread overview]
Message-ID: <20190815004140.GE3440173@magnolia> (raw)
In-Reply-To: <20190814212833.GO6129@dread.disaster.area>
On Thu, Aug 15, 2019 at 07:28:33AM +1000, Dave Chinner wrote:
> On Wed, Aug 14, 2019 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > On Wed, Aug 14, 2019 at 07:54:48PM +1000, Dave Chinner wrote:
> > > On Tue, Aug 13, 2019 at 08:40:10AM -0700, Matthew Wilcox wrote:
> > > > On Tue, Aug 13, 2019 at 08:14:34AM -0700, Darrick J. Wong wrote:
> > > > > + /*
> > > > > + * Now that we've locked both pages, make sure they still
> > > > > + * represent the data we're interested in. If not, someone
> > > > > + * is invalidating pages on us and we lose.
> > > > > + */
> > > > > + if (src_page->mapping != src->i_mapping ||
> > > > > + src_page->index != srcoff >> PAGE_SHIFT ||
> > > > > + dest_page->mapping != dest->i_mapping ||
> > > > > + dest_page->index != destoff >> PAGE_SHIFT) {
> > > > > + same = false;
> > > > > + goto unlock;
> > > > > + }
> > > >
> > > > It is my understanding that you don't need to check the ->index here.
> > > > If I'm wrong about that, I'd really appreciate being corrected, because
> > > > the page cache locking is subtle.
> > >
> > > Ah, when talking to Darrick about this, I didn't notice the code
> > > took references on the page, so it probably doesn't need the index
> > > check - the page can't be recycled out from under us here an
> > > inserted into a new mapping until we drop the reference.
> > >
> > > What I was mainly concerned about here is that we only have a shared
> > > inode lock on the src inode, so this code can be running
> > > concurrently with both invalidation and insertion into the mapping.
> > > e.g. direct write io does invalidation, buffered read does
> > > insertion. Hence we have to be really careful about the data in the
> > > source page being valid and stable while we run the comparison.
> > >
> > > And on further thought, I don't think shared locking is actually
> > > safe here. A shared lock doesn't stop new direct IO from being
> > > submitted, so inode_dio_wait() just drains IO at that point in time
> > > and but doesn't provide any guarantee that there isn't concurrent
> > > DIO running.
> > >
> > > Hence we could do the comparison here, see the data is the same,
> > > drop the page lock, a DIO write then invalidates the page and writes
> > > new data while we are comparing the rest of page(s) in the range. By
> > > the time we've checked the whole range, the data at the start is no
> > > longer the same, and the comparison is stale.
> > >
> > > And then we do the dedupe operation oblivious to the fact the data
> > > on disk doesn't actually match anymore, and we corrupt the data in
> > > the destination file as it gets linked to mismatched data in the
> > > source file....
> >
> > <urrrrrrk> Yeah, that looks like a bug to me. I didn't realize that
> > directio writes were IOLOCK_SHARED and therefore reflink has to take
> > IOLOCK_EXCL to block them.
> >
> > Related question: do we need to do the same for MMAPLOCK? I think the
> > answer is yes because xfs_filemap_fault can call page_mkwrite with
> > MMAPLOCK_SHARED.
>
> Hmmmm. Yes, you are right, but I don't just holding MMAPLOCK_EXCL is
> enough. Holding the MMAPLOCK will hold off page faults while we have
> the lock, but it won't prevent pages that already have writeable
> ptes from being modified as they don't require another page fault
> until they've been cleaned.
>
> So it seems to me that if we need to ensure that the file range is
> not being concurrently modified, we have to:
>
> a) inode lock exclusive
> b) mmap lock exclusive
> c) break layouts(*)
> d) wait for dio
> e) clean all the dirty pages
>
> On both the source and destination files. And then, because the
> locks we hold form a barrier against newly dirtied pages, will all
> attempts to modify the data be blocked. And so now the data
> comparison can be done safely.
I think xfs already proceeds in this order (a-e), it's just that we
aren't correctly taking IOLOCK_EXCL/MMAPLOCK_EXCL on the source file to
prevent some other thread from starting a directio write or dirtying an
mmap'd page. But let's try my crappy little patch that fixes the shared
locks and see what other smoke comes out of the machine...
--D
> (*) The break layouts bit is necessary to handle co-ordination with
> RDMA, NVMEoF, P2P DMA, pNFS, etc that manipulate data directly via
> the block device rather than through file pages or DIO...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2019-08-15 0:41 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-13 15:14 [PATCH v3] vfs: fix page locking deadlocks when deduping files Darrick J. Wong
2019-08-13 15:40 ` Matthew Wilcox
2019-08-14 7:03 ` Gao Xiang
2019-08-14 7:17 ` Gao Xiang
2019-08-14 9:54 ` Dave Chinner
2019-08-14 15:33 ` Darrick J. Wong
2019-08-14 21:28 ` Dave Chinner
2019-08-15 0:41 ` Darrick J. Wong [this message]
2019-08-13 15:53 ` Filipe Manana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190815004140.GE3440173@magnolia \
--to=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).