Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v3] vfs: fix page locking deadlocks when deduping files
@ 2019-08-13 15:14 Darrick J. Wong
  2019-08-13 15:40 ` Matthew Wilcox
  2019-08-13 15:53 ` Filipe Manana
  0 siblings, 2 replies; 9+ messages in thread
From: Darrick J. Wong @ 2019-08-13 15:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel

From: Darrick J. Wong <darrick.wong@oracle.com>

When dedupe wants to use the page cache to compare parts of two files
for dedupe, we must be very careful to handle locking correctly.  The
current code doesn't do this.  It must lock and unlock the page only
once if the two pages are the same, since the overlapping range check
doesn't catch this when blocksize < pagesize.  If the pages are distinct
but from the same file, we must observe page locking order and lock them
in order of increasing offset to avoid clashing with writeback locking.

Fixes: 876bec6f9bbfcb3 ("vfs: refactor clone/dedupe_file_range common functions")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
---
v3: revalidate page after locking it
v2: provide an unlock helper
---
 fs/read_write.c |   50 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 42 insertions(+), 8 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 1f5088dec566..da341eb3033c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1811,10 +1811,7 @@ static int generic_remap_check_len(struct inode *inode_in,
 	return (remap_flags & REMAP_FILE_DEDUP) ? -EBADE : -EINVAL;
 }
 
-/*
- * Read a page's worth of file data into the page cache.  Return the page
- * locked.
- */
+/* Read a page's worth of file data into the page cache. */
 static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
 {
 	struct page *page;
@@ -1826,10 +1823,32 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
 		put_page(page);
 		return ERR_PTR(-EIO);
 	}
-	lock_page(page);
 	return page;
 }
 
+/*
+ * Lock two pages, ensuring that we lock in offset order if the pages are from
+ * the same file.
+ */
+static void vfs_lock_two_pages(struct page *page1, struct page *page2)
+{
+	/* Always lock in order of increasing index. */
+	if (page1->index > page2->index)
+		swap(page1, page2);
+
+	lock_page(page1);
+	if (page1 != page2)
+		lock_page(page2);
+}
+
+/* Unlock two pages, being careful not to unlock the same page twice. */
+static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
+{
+	unlock_page(page1);
+	if (page1 != page2)
+		unlock_page(page2);
+}
+
 /*
  * Compare extents of two files to see if they are the same.
  * Caller must have locked both inodes to prevent write races.
@@ -1867,10 +1886,25 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 		dest_page = vfs_dedupe_get_page(dest, destoff);
 		if (IS_ERR(dest_page)) {
 			error = PTR_ERR(dest_page);
-			unlock_page(src_page);
 			put_page(src_page);
 			goto out_error;
 		}
+
+		vfs_lock_two_pages(src_page, dest_page);
+
+		/*
+		 * Now that we've locked both pages, make sure they still
+		 * represent the data we're interested in.  If not, someone
+		 * is invalidating pages on us and we lose.
+		 */
+		if (src_page->mapping != src->i_mapping ||
+		    src_page->index != srcoff >> PAGE_SHIFT ||
+		    dest_page->mapping != dest->i_mapping ||
+		    dest_page->index != destoff >> PAGE_SHIFT) {
+			same = false;
+			goto unlock;
+		}
+
 		src_addr = kmap_atomic(src_page);
 		dest_addr = kmap_atomic(dest_page);
 
@@ -1882,8 +1916,8 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 
 		kunmap_atomic(dest_addr);
 		kunmap_atomic(src_addr);
-		unlock_page(dest_page);
-		unlock_page(src_page);
+unlock:
+		vfs_unlock_two_pages(src_page, dest_page);
 		put_page(dest_page);
 		put_page(src_page);
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] vfs: fix page locking deadlocks when deduping files
  2019-08-13 15:14 [PATCH v3] vfs: fix page locking deadlocks when deduping files Darrick J. Wong
@ 2019-08-13 15:40 ` Matthew Wilcox
  2019-08-14  7:03   ` Gao Xiang
  2019-08-14  9:54   ` Dave Chinner
  2019-08-13 15:53 ` Filipe Manana
  1 sibling, 2 replies; 9+ messages in thread
From: Matthew Wilcox @ 2019-08-13 15:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, xfs, linux-fsdevel

On Tue, Aug 13, 2019 at 08:14:34AM -0700, Darrick J. Wong wrote:
> +		/*
> +		 * Now that we've locked both pages, make sure they still
> +		 * represent the data we're interested in.  If not, someone
> +		 * is invalidating pages on us and we lose.
> +		 */
> +		if (src_page->mapping != src->i_mapping ||
> +		    src_page->index != srcoff >> PAGE_SHIFT ||
> +		    dest_page->mapping != dest->i_mapping ||
> +		    dest_page->index != destoff >> PAGE_SHIFT) {
> +			same = false;
> +			goto unlock;
> +		}

It is my understanding that you don't need to check the ->index here.
If I'm wrong about that, I'd really appreciate being corrected, because
the page cache locking is subtle.

You call read_mapping_page() which returns the page with an elevated
refcount.  That means the page can't go back to the page allocator and
be allocated again.  It can, because it's unlocked, still be truncated,
so the check for ->mapping after locking it is needed.  But the check
for ->index being correct was done by find_get_entry().

See pagecache_get_page() -- if we specify FGP_LOCK, then it will lock
the page, check the ->mapping but not check ->index.  OK, it does check
->index, but in a VM_BUG_ON(), so it's not something that ought to be
able to be wrong.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] vfs: fix page locking deadlocks when deduping files
  2019-08-13 15:14 [PATCH v3] vfs: fix page locking deadlocks when deduping files Darrick J. Wong
  2019-08-13 15:40 ` Matthew Wilcox
@ 2019-08-13 15:53 ` Filipe Manana
  1 sibling, 0 replies; 9+ messages in thread
From: Filipe Manana @ 2019-08-13 15:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, xfs, linux-fsdevel

On Tue, Aug 13, 2019 at 4:15 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> When dedupe wants to use the page cache to compare parts of two files
> for dedupe, we must be very careful to handle locking correctly.  The
> current code doesn't do this.  It must lock and unlock the page only
> once if the two pages are the same, since the overlapping range check
> doesn't catch this when blocksize < pagesize.  If the pages are distinct
> but from the same file, we must observe page locking order and lock them
> in order of increasing offset to avoid clashing with writeback locking.
>
> Fixes: 876bec6f9bbfcb3 ("vfs: refactor clone/dedupe_file_range common functions")
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Bill O'Donnell <billodo@redhat.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

We actually had the same bug in btrfs, before we had cloning/dedupe in
vfs/xfs/etc, and fixed it back in 2017 [1].
I totally missed this behaviour in the vfs helpers when I updated
btrfs to use them some months ago.
Thanks.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b1517622f2524f531113b12c27b9a0ea69c38983


> ---
> v3: revalidate page after locking it
> v2: provide an unlock helper
> ---
>  fs/read_write.c |   50 ++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 42 insertions(+), 8 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 1f5088dec566..da341eb3033c 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1811,10 +1811,7 @@ static int generic_remap_check_len(struct inode *inode_in,
>         return (remap_flags & REMAP_FILE_DEDUP) ? -EBADE : -EINVAL;
>  }
>
> -/*
> - * Read a page's worth of file data into the page cache.  Return the page
> - * locked.
> - */
> +/* Read a page's worth of file data into the page cache. */
>  static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
>  {
>         struct page *page;
> @@ -1826,10 +1823,32 @@ static struct page *vfs_dedupe_get_page(struct inode *inode, loff_t offset)
>                 put_page(page);
>                 return ERR_PTR(-EIO);
>         }
> -       lock_page(page);
>         return page;
>  }
>
> +/*
> + * Lock two pages, ensuring that we lock in offset order if the pages are from
> + * the same file.
> + */
> +static void vfs_lock_two_pages(struct page *page1, struct page *page2)
> +{
> +       /* Always lock in order of increasing index. */
> +       if (page1->index > page2->index)
> +               swap(page1, page2);
> +
> +       lock_page(page1);
> +       if (page1 != page2)
> +               lock_page(page2);
> +}
> +
> +/* Unlock two pages, being careful not to unlock the same page twice. */
> +static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
> +{
> +       unlock_page(page1);
> +       if (page1 != page2)
> +               unlock_page(page2);
> +}
> +
>  /*
>   * Compare extents of two files to see if they are the same.
>   * Caller must have locked both inodes to prevent write races.
> @@ -1867,10 +1886,25 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
>                 dest_page = vfs_dedupe_get_page(dest, destoff);
>                 if (IS_ERR(dest_page)) {
>                         error = PTR_ERR(dest_page);
> -                       unlock_page(src_page);
>                         put_page(src_page);
>                         goto out_error;
>                 }
> +
> +               vfs_lock_two_pages(src_page, dest_page);
> +
> +               /*
> +                * Now that we've locked both pages, make sure they still
> +                * represent the data we're interested in.  If not, someone
> +                * is invalidating pages on us and we lose.
> +                */
> +               if (src_page->mapping != src->i_mapping ||
> +                   src_page->index != srcoff >> PAGE_SHIFT ||
> +                   dest_page->mapping != dest->i_mapping ||
> +                   dest_page->index != destoff >> PAGE_SHIFT) {
> +                       same = false;
> +                       goto unlock;
> +               }
> +
>                 src_addr = kmap_atomic(src_page);
>                 dest_addr = kmap_atomic(dest_page);
>
> @@ -1882,8 +1916,8 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
>
>                 kunmap_atomic(dest_addr);
>                 kunmap_atomic(src_addr);
> -               unlock_page(dest_page);
> -               unlock_page(src_page);
> +unlock:
> +               vfs_unlock_two_pages(src_page, dest_page);
>                 put_page(dest_page);
>                 put_page(src_page);
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] vfs: fix page locking deadlocks when deduping files
  2019-08-13 15:40 ` Matthew Wilcox
@ 2019-08-14  7:03   ` Gao Xiang
  2019-08-14  7:17     ` Gao Xiang
  2019-08-14  9:54   ` Dave Chinner
  1 sibling, 1 reply; 9+ messages in thread
From: Gao Xiang @ 2019-08-14  7:03 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Darrick J. Wong, Dave Chinner, xfs, linux-fsdevel

On Tue, Aug 13, 2019 at 08:40:10AM -0700, Matthew Wilcox wrote:
> On Tue, Aug 13, 2019 at 08:14:34AM -0700, Darrick J. Wong wrote:
> > +		/*
> > +		 * Now that we've locked both pages, make sure they still
> > +		 * represent the data we're interested in.  If not, someone
> > +		 * is invalidating pages on us and we lose.
> > +		 */
> > +		if (src_page->mapping != src->i_mapping ||
> > +		    src_page->index != srcoff >> PAGE_SHIFT ||
> > +		    dest_page->mapping != dest->i_mapping ||
> > +		    dest_page->index != destoff >> PAGE_SHIFT) {
> > +			same = false;
> > +			goto unlock;
> > +		}
> 
> It is my understanding that you don't need to check the ->index here.
> If I'm wrong about that, I'd really appreciate being corrected, because
> the page cache locking is subtle.
> 
> You call read_mapping_page() which returns the page with an elevated
> refcount.  That means the page can't go back to the page allocator and
> be allocated again.  It can, because it's unlocked, still be truncated,
> so the check for ->mapping after locking it is needed.  But the check
> for ->index being correct was done by find_get_entry().
> 
> See pagecache_get_page() -- if we specify FGP_LOCK, then it will lock
> the page, check the ->mapping but not check ->index.  OK, it does check
> ->index, but in a VM_BUG_ON(), so it's not something that ought to be
> able to be wrong.

That is my understanding as well. In details...

The page data get ready after read_mapping_page() is successfully
returned. However, if someone needs to get a stable untruncated page,
lock_page() and recheck page->mapping are needed as well.

I have no idea how page->index can be changed safely without reallocating
the page, even some paths could keep using some truncated page temporarily
with some refcounts held but I think those paths cannot add these pages
directly to some page cache again without freeing since it seems really
unsafe.....

Thanks,
Gao Xiang

> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] vfs: fix page locking deadlocks when deduping files
  2019-08-14  7:03   ` Gao Xiang
@ 2019-08-14  7:17     ` Gao Xiang
  0 siblings, 0 replies; 9+ messages in thread
From: Gao Xiang @ 2019-08-14  7:17 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Darrick J. Wong, Dave Chinner, xfs, linux-fsdevel

On Wed, Aug 14, 2019 at 03:03:21PM +0800, Gao Xiang wrote:
> On Tue, Aug 13, 2019 at 08:40:10AM -0700, Matthew Wilcox wrote:
> > On Tue, Aug 13, 2019 at 08:14:34AM -0700, Darrick J. Wong wrote:
> > > +		/*
> > > +		 * Now that we've locked both pages, make sure they still
> > > +		 * represent the data we're interested in.  If not, someone
> > > +		 * is invalidating pages on us and we lose.
> > > +		 */
> > > +		if (src_page->mapping != src->i_mapping ||
> > > +		    src_page->index != srcoff >> PAGE_SHIFT ||
> > > +		    dest_page->mapping != dest->i_mapping ||
> > > +		    dest_page->index != destoff >> PAGE_SHIFT) {
> > > +			same = false;
> > > +			goto unlock;
> > > +		}
> > 
> > It is my understanding that you don't need to check the ->index here.
> > If I'm wrong about that, I'd really appreciate being corrected, because
> > the page cache locking is subtle.
> > 
> > You call read_mapping_page() which returns the page with an elevated
> > refcount.  That means the page can't go back to the page allocator and
> > be allocated again.  It can, because it's unlocked, still be truncated,
> > so the check for ->mapping after locking it is needed.  But the check
> > for ->index being correct was done by find_get_entry().
> > 
> > See pagecache_get_page() -- if we specify FGP_LOCK, then it will lock
> > the page, check the ->mapping but not check ->index.  OK, it does check
> > ->index, but in a VM_BUG_ON(), so it's not something that ought to be
> > able to be wrong.
> 
> That is my understanding as well. In details...
> 
> The page data get ready after read_mapping_page() is successfully
> returned. However, if someone needs to get a stable untruncated page,
> lock_page() and recheck page->mapping are needed as well.
> 
> I have no idea how page->index can be changed safely without reallocating
> the page, even some paths could keep using some truncated page temporarily
> with some refcounts held but I think those paths cannot add these pages

Such a case is like that even if the page can be truncated
at the same time without locking, some paths only needs to
get its page data unstrictly (and note that these pages
should be Uptodated before). Therefore those paths can
only take a refcount without PG_lock... But such refcounts
should be used temporarily, those pages cannot be added to
page cache again without reallocating...

Thanks,
Gao Xiang

> directly to some page cache again without freeing since it seems really
> unsafe.....
> 
> Thanks,
> Gao Xiang
> 
> > 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] vfs: fix page locking deadlocks when deduping files
  2019-08-13 15:40 ` Matthew Wilcox
  2019-08-14  7:03   ` Gao Xiang
@ 2019-08-14  9:54   ` Dave Chinner
  2019-08-14 15:33     ` Darrick J. Wong
  1 sibling, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2019-08-14  9:54 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Darrick J. Wong, xfs, linux-fsdevel

On Tue, Aug 13, 2019 at 08:40:10AM -0700, Matthew Wilcox wrote:
> On Tue, Aug 13, 2019 at 08:14:34AM -0700, Darrick J. Wong wrote:
> > +		/*
> > +		 * Now that we've locked both pages, make sure they still
> > +		 * represent the data we're interested in.  If not, someone
> > +		 * is invalidating pages on us and we lose.
> > +		 */
> > +		if (src_page->mapping != src->i_mapping ||
> > +		    src_page->index != srcoff >> PAGE_SHIFT ||
> > +		    dest_page->mapping != dest->i_mapping ||
> > +		    dest_page->index != destoff >> PAGE_SHIFT) {
> > +			same = false;
> > +			goto unlock;
> > +		}
> 
> It is my understanding that you don't need to check the ->index here.
> If I'm wrong about that, I'd really appreciate being corrected, because
> the page cache locking is subtle.

Ah, when talking to Darrick about this, I didn't notice the code
took references on the page, so it probably doesn't need the index
check - the page can't be recycled out from under us here an
inserted into a new mapping until we drop the reference.

What I was mainly concerned about here is that we only have a shared
inode lock on the src inode, so this code can be running
concurrently with both invalidation and insertion into the mapping.
e.g. direct write io does invalidation, buffered read does
insertion. Hence we have to be really careful about the data in the
source page being valid and stable while we run the comparison.

And on further thought, I don't think shared locking is actually
safe here. A shared lock doesn't stop new direct IO from being
submitted, so inode_dio_wait() just drains IO at that point in time
and but doesn't provide any guarantee that there isn't concurrent
DIO running.

Hence we could do the comparison here, see the data is the same,
drop the page lock, a DIO write then invalidates the page and writes
new data while we are comparing the rest of page(s) in the range. By
the time we've checked the whole range, the data at the start is no
longer the same, and the comparison is stale.

And then we do the dedupe operation oblivious to the fact the data
on disk doesn't actually match anymore, and we corrupt the data in
the destination file as it gets linked to mismatched data in the
source file....

Darrick?

> You call read_mapping_page() which returns the page with an elevated
> refcount.  That means the page can't go back to the page allocator and
> be allocated again.  It can, because it's unlocked, still be truncated,
> so the check for ->mapping after locking it is needed.  But the check
> for ->index being correct was done by find_get_entry().
> 
> See pagecache_get_page() -- if we specify FGP_LOCK, then it will lock
> the page, check the ->mapping but not check ->index.  OK, it does check
> ->index, but in a VM_BUG_ON(), so it's not something that ought to be
> able to be wrong.

Yeah, we used to have to play tricks in the old XFS writeback
clustering code to do our own non-blocking page cache lookups adn
this was one of the things we needed to be careful about until
the pagevec_lookup* interfaces came along and solved all the
problems for us. Funny how the brain remembers old gotchas with
also reminding you that the problems went away almost as long
ago.....


Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] vfs: fix page locking deadlocks when deduping files
  2019-08-14  9:54   ` Dave Chinner
@ 2019-08-14 15:33     ` Darrick J. Wong
  2019-08-14 21:28       ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2019-08-14 15:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Matthew Wilcox, xfs, linux-fsdevel

On Wed, Aug 14, 2019 at 07:54:48PM +1000, Dave Chinner wrote:
> On Tue, Aug 13, 2019 at 08:40:10AM -0700, Matthew Wilcox wrote:
> > On Tue, Aug 13, 2019 at 08:14:34AM -0700, Darrick J. Wong wrote:
> > > +		/*
> > > +		 * Now that we've locked both pages, make sure they still
> > > +		 * represent the data we're interested in.  If not, someone
> > > +		 * is invalidating pages on us and we lose.
> > > +		 */
> > > +		if (src_page->mapping != src->i_mapping ||
> > > +		    src_page->index != srcoff >> PAGE_SHIFT ||
> > > +		    dest_page->mapping != dest->i_mapping ||
> > > +		    dest_page->index != destoff >> PAGE_SHIFT) {
> > > +			same = false;
> > > +			goto unlock;
> > > +		}
> > 
> > It is my understanding that you don't need to check the ->index here.
> > If I'm wrong about that, I'd really appreciate being corrected, because
> > the page cache locking is subtle.
> 
> Ah, when talking to Darrick about this, I didn't notice the code
> took references on the page, so it probably doesn't need the index
> check - the page can't be recycled out from under us here an
> inserted into a new mapping until we drop the reference.
> 
> What I was mainly concerned about here is that we only have a shared
> inode lock on the src inode, so this code can be running
> concurrently with both invalidation and insertion into the mapping.
> e.g. direct write io does invalidation, buffered read does
> insertion. Hence we have to be really careful about the data in the
> source page being valid and stable while we run the comparison.
> 
> And on further thought, I don't think shared locking is actually
> safe here. A shared lock doesn't stop new direct IO from being
> submitted, so inode_dio_wait() just drains IO at that point in time
> and but doesn't provide any guarantee that there isn't concurrent
> DIO running.
> 
> Hence we could do the comparison here, see the data is the same,
> drop the page lock, a DIO write then invalidates the page and writes
> new data while we are comparing the rest of page(s) in the range. By
> the time we've checked the whole range, the data at the start is no
> longer the same, and the comparison is stale.
> 
> And then we do the dedupe operation oblivious to the fact the data
> on disk doesn't actually match anymore, and we corrupt the data in
> the destination file as it gets linked to mismatched data in the
> source file....

<urrrrrrk> Yeah, that looks like a bug to me.  I didn't realize that
directio writes were IOLOCK_SHARED and therefore reflink has to take
IOLOCK_EXCL to block them.

Related question: do we need to do the same for MMAPLOCK?  I think the
answer is yes because xfs_filemap_fault can call page_mkwrite with
MMAPLOCK_SHARED.

--D

> Darrick?
> 
> > You call read_mapping_page() which returns the page with an elevated
> > refcount.  That means the page can't go back to the page allocator and
> > be allocated again.  It can, because it's unlocked, still be truncated,
> > so the check for ->mapping after locking it is needed.  But the check
> > for ->index being correct was done by find_get_entry().
> > 
> > See pagecache_get_page() -- if we specify FGP_LOCK, then it will lock
> > the page, check the ->mapping but not check ->index.  OK, it does check
> > ->index, but in a VM_BUG_ON(), so it's not something that ought to be
> > able to be wrong.
> 
> Yeah, we used to have to play tricks in the old XFS writeback
> clustering code to do our own non-blocking page cache lookups adn
> this was one of the things we needed to be careful about until
> the pagevec_lookup* interfaces came along and solved all the
> problems for us. Funny how the brain remembers old gotchas with
> also reminding you that the problems went away almost as long
> ago.....
> 
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] vfs: fix page locking deadlocks when deduping files
  2019-08-14 15:33     ` Darrick J. Wong
@ 2019-08-14 21:28       ` Dave Chinner
  2019-08-15  0:41         ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2019-08-14 21:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Matthew Wilcox, xfs, linux-fsdevel

On Wed, Aug 14, 2019 at 08:33:49AM -0700, Darrick J. Wong wrote:
> On Wed, Aug 14, 2019 at 07:54:48PM +1000, Dave Chinner wrote:
> > On Tue, Aug 13, 2019 at 08:40:10AM -0700, Matthew Wilcox wrote:
> > > On Tue, Aug 13, 2019 at 08:14:34AM -0700, Darrick J. Wong wrote:
> > > > +		/*
> > > > +		 * Now that we've locked both pages, make sure they still
> > > > +		 * represent the data we're interested in.  If not, someone
> > > > +		 * is invalidating pages on us and we lose.
> > > > +		 */
> > > > +		if (src_page->mapping != src->i_mapping ||
> > > > +		    src_page->index != srcoff >> PAGE_SHIFT ||
> > > > +		    dest_page->mapping != dest->i_mapping ||
> > > > +		    dest_page->index != destoff >> PAGE_SHIFT) {
> > > > +			same = false;
> > > > +			goto unlock;
> > > > +		}
> > > 
> > > It is my understanding that you don't need to check the ->index here.
> > > If I'm wrong about that, I'd really appreciate being corrected, because
> > > the page cache locking is subtle.
> > 
> > Ah, when talking to Darrick about this, I didn't notice the code
> > took references on the page, so it probably doesn't need the index
> > check - the page can't be recycled out from under us here an
> > inserted into a new mapping until we drop the reference.
> > 
> > What I was mainly concerned about here is that we only have a shared
> > inode lock on the src inode, so this code can be running
> > concurrently with both invalidation and insertion into the mapping.
> > e.g. direct write io does invalidation, buffered read does
> > insertion. Hence we have to be really careful about the data in the
> > source page being valid and stable while we run the comparison.
> > 
> > And on further thought, I don't think shared locking is actually
> > safe here. A shared lock doesn't stop new direct IO from being
> > submitted, so inode_dio_wait() just drains IO at that point in time
> > and but doesn't provide any guarantee that there isn't concurrent
> > DIO running.
> > 
> > Hence we could do the comparison here, see the data is the same,
> > drop the page lock, a DIO write then invalidates the page and writes
> > new data while we are comparing the rest of page(s) in the range. By
> > the time we've checked the whole range, the data at the start is no
> > longer the same, and the comparison is stale.
> > 
> > And then we do the dedupe operation oblivious to the fact the data
> > on disk doesn't actually match anymore, and we corrupt the data in
> > the destination file as it gets linked to mismatched data in the
> > source file....
> 
> <urrrrrrk> Yeah, that looks like a bug to me.  I didn't realize that
> directio writes were IOLOCK_SHARED and therefore reflink has to take
> IOLOCK_EXCL to block them.
> 
> Related question: do we need to do the same for MMAPLOCK?  I think the
> answer is yes because xfs_filemap_fault can call page_mkwrite with
> MMAPLOCK_SHARED.

Hmmmm. Yes, you are right, but I don't just holding MMAPLOCK_EXCL is
enough. Holding the MMAPLOCK will hold off page faults while we have
the lock, but it won't prevent pages that already have writeable
ptes from being modified as they don't require another page fault
until they've been cleaned.

So it seems to me that if we need to ensure that the file range is
not being concurrently modified, we have to:

	a) inode lock exclusive
	b) mmap lock exclusive
	c) break layouts(*)
	d) wait for dio
	e) clean all the dirty pages

On both the source and destination files. And then, because the
locks we hold form a barrier against newly dirtied pages, will all
attempts to modify the data be blocked. And so now the data
comparison can be done safely.

(*) The break layouts bit is necessary to handle co-ordination with
RDMA, NVMEoF, P2P DMA, pNFS, etc that manipulate data directly via
the block device rather than through file pages or DIO...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] vfs: fix page locking deadlocks when deduping files
  2019-08-14 21:28       ` Dave Chinner
@ 2019-08-15  0:41         ` Darrick J. Wong
  0 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2019-08-15  0:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Matthew Wilcox, xfs, linux-fsdevel

On Thu, Aug 15, 2019 at 07:28:33AM +1000, Dave Chinner wrote:
> On Wed, Aug 14, 2019 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > On Wed, Aug 14, 2019 at 07:54:48PM +1000, Dave Chinner wrote:
> > > On Tue, Aug 13, 2019 at 08:40:10AM -0700, Matthew Wilcox wrote:
> > > > On Tue, Aug 13, 2019 at 08:14:34AM -0700, Darrick J. Wong wrote:
> > > > > +		/*
> > > > > +		 * Now that we've locked both pages, make sure they still
> > > > > +		 * represent the data we're interested in.  If not, someone
> > > > > +		 * is invalidating pages on us and we lose.
> > > > > +		 */
> > > > > +		if (src_page->mapping != src->i_mapping ||
> > > > > +		    src_page->index != srcoff >> PAGE_SHIFT ||
> > > > > +		    dest_page->mapping != dest->i_mapping ||
> > > > > +		    dest_page->index != destoff >> PAGE_SHIFT) {
> > > > > +			same = false;
> > > > > +			goto unlock;
> > > > > +		}
> > > > 
> > > > It is my understanding that you don't need to check the ->index here.
> > > > If I'm wrong about that, I'd really appreciate being corrected, because
> > > > the page cache locking is subtle.
> > > 
> > > Ah, when talking to Darrick about this, I didn't notice the code
> > > took references on the page, so it probably doesn't need the index
> > > check - the page can't be recycled out from under us here an
> > > inserted into a new mapping until we drop the reference.
> > > 
> > > What I was mainly concerned about here is that we only have a shared
> > > inode lock on the src inode, so this code can be running
> > > concurrently with both invalidation and insertion into the mapping.
> > > e.g. direct write io does invalidation, buffered read does
> > > insertion. Hence we have to be really careful about the data in the
> > > source page being valid and stable while we run the comparison.
> > > 
> > > And on further thought, I don't think shared locking is actually
> > > safe here. A shared lock doesn't stop new direct IO from being
> > > submitted, so inode_dio_wait() just drains IO at that point in time
> > > and but doesn't provide any guarantee that there isn't concurrent
> > > DIO running.
> > > 
> > > Hence we could do the comparison here, see the data is the same,
> > > drop the page lock, a DIO write then invalidates the page and writes
> > > new data while we are comparing the rest of page(s) in the range. By
> > > the time we've checked the whole range, the data at the start is no
> > > longer the same, and the comparison is stale.
> > > 
> > > And then we do the dedupe operation oblivious to the fact the data
> > > on disk doesn't actually match anymore, and we corrupt the data in
> > > the destination file as it gets linked to mismatched data in the
> > > source file....
> > 
> > <urrrrrrk> Yeah, that looks like a bug to me.  I didn't realize that
> > directio writes were IOLOCK_SHARED and therefore reflink has to take
> > IOLOCK_EXCL to block them.
> > 
> > Related question: do we need to do the same for MMAPLOCK?  I think the
> > answer is yes because xfs_filemap_fault can call page_mkwrite with
> > MMAPLOCK_SHARED.
> 
> Hmmmm. Yes, you are right, but I don't just holding MMAPLOCK_EXCL is
> enough. Holding the MMAPLOCK will hold off page faults while we have
> the lock, but it won't prevent pages that already have writeable
> ptes from being modified as they don't require another page fault
> until they've been cleaned.
> 
> So it seems to me that if we need to ensure that the file range is
> not being concurrently modified, we have to:
> 
> 	a) inode lock exclusive
> 	b) mmap lock exclusive
> 	c) break layouts(*)
> 	d) wait for dio
> 	e) clean all the dirty pages
> 
> On both the source and destination files. And then, because the
> locks we hold form a barrier against newly dirtied pages, will all
> attempts to modify the data be blocked. And so now the data
> comparison can be done safely.

I think xfs already proceeds in this order (a-e), it's just that we
aren't correctly taking IOLOCK_EXCL/MMAPLOCK_EXCL on the source file to
prevent some other thread from starting a directio write or dirtying an
mmap'd page.  But let's try my crappy little patch that fixes the shared
locks and see what other smoke comes out of the machine...

--D

> (*) The break layouts bit is necessary to handle co-ordination with
> RDMA, NVMEoF, P2P DMA, pNFS, etc that manipulate data directly via
> the block device rather than through file pages or DIO...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-13 15:14 [PATCH v3] vfs: fix page locking deadlocks when deduping files Darrick J. Wong
2019-08-13 15:40 ` Matthew Wilcox
2019-08-14  7:03   ` Gao Xiang
2019-08-14  7:17     ` Gao Xiang
2019-08-14  9:54   ` Dave Chinner
2019-08-14 15:33     ` Darrick J. Wong
2019-08-14 21:28       ` Dave Chinner
2019-08-15  0:41         ` Darrick J. Wong
2019-08-13 15:53 ` Filipe Manana

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org linux-fsdevel@archiver.kernel.org
	public-inbox-index linux-fsdevel


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox