[PATCH 0/4] Introduce a mmap sem to deal with some mmap issues

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] Introduce a mmap sem to deal with some mmap issues
@ 2020-12-14 18:19 Josef Bacik
  2020-12-14 18:19 ` [PATCH 1/4] btrfs: add a i_mmap_lock to our inode Josef Bacik
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Josef Bacik @ 2020-12-14 18:19 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Hello,

Both Filipe and I have found different issues that result in the same thing, we
need to be able to exclude mmap from happening in certain scenarios.  The
specifics are well described in the commit logs, but generally there's 2 issues

1) dedupe needs to validate that pages match, and since the validation is done
   outside of the extent lock we can race with mmap and dedupe pages that do not
   match.
2) We can deadlock in certain low metadata scenarios where we need to flush
   an ordered extent, but can't because mmap is holding the page lock.

These issues exist for remap and fallocate, so add an i_mmap_sem to allow us to
disallow mmap in these cases.  I'm still waiting on xfstests to finish with
this, but 2 hours in and no lockdep or deadlocks.  Thanks,

Josef Bacik (4):
  btrfs: add a i_mmap_lock to our inode
  btrfs: cleanup inode_lock/inode_unlock uses
  btrfs: exclude mmaps while doing remap
  btrfs: exclude mmap from happening during all fallocate operations

 fs/btrfs/btrfs_inode.h   |  1 +
 fs/btrfs/ctree.h         |  1 +
 fs/btrfs/delayed-inode.c |  4 ++--
 fs/btrfs/file.c          | 20 ++++++++++----------
 fs/btrfs/inode.c         | 10 ++++++++++
 fs/btrfs/ioctl.c         | 26 +++++++++++++-------------
 fs/btrfs/reflink.c       | 30 ++++++++++++++++++++++++------
 fs/btrfs/relocation.c    |  4 ++--
 8 files changed, 63 insertions(+), 33 deletions(-)

-- 
2.26.2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/4] btrfs: add a i_mmap_lock to our inode
  2020-12-14 18:19 [PATCH 0/4] Introduce a mmap sem to deal with some mmap issues Josef Bacik
@ 2020-12-14 18:19 ` Josef Bacik
  2020-12-14 18:19 ` [PATCH 2/4] btrfs: cleanup inode_lock/inode_unlock uses Josef Bacik
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Josef Bacik @ 2020-12-14 18:19 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We need to be able to exclude page_mkwrite from happening concurrently
with certain operations.  To facilitate this, add a i_mmap_lock to our
inode, down_read() it in our mkwrite, and add a new ILOCK flag to
indicate that we want to take the i_mmap_lock as well.  I used pahole to
check the size of the btrfs_inode, the sizes are as follows

no lockdep:
before: 1120 (3 per 4k page)
after: 1160 (3 per 4k page)

lockdep:
before: 2072 (1 per 4k page)
after: 2224 (1 per 4k page)

We're slightly larger but it doesn't change how many objects we can fit
per page.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/btrfs_inode.h |  1 +
 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/inode.c       | 10 ++++++++++
 3 files changed, 12 insertions(+)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index d9bf53d9ff90..66cec5c13396 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -220,6 +220,7 @@ struct btrfs_inode {
 	/* Hook into fs_info->delayed_iputs */
 	struct list_head delayed_iput;
 
+	struct rw_semaphore i_mmap_lock;
 	struct inode vfs_inode;
 };
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3935d297d198..e2bebb3318cc 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3147,6 +3147,7 @@ extern const struct iomap_dio_ops btrfs_dio_ops;
 /* Inode locking type flags, by default the exclusive lock is taken */
 #define BTRFS_ILOCK_SHARED	(1U << 0)
 #define BTRFS_ILOCK_TRY 	(1U << 1)
+#define BTRFS_ILOCK_MMAP	(1U << 2)
 
 int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags);
 void btrfs_inode_unlock(struct inode *inode, unsigned int ilock_flags);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 070716650df8..a5ee4d45ee02 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -101,6 +101,7 @@ static void __endio_write_update_ordered(struct btrfs_inode *inode,
  * BTRFS_ILOCK_SHARED - acquire a shared lock on the inode
  * BTRFS_ILOCK_TRY - try to acquire the lock, if fails on first attempt
  *		     return -EAGAIN
+ * BTRFS_ILOCK_MMAP - acquire a write lock on the i_mmap_lock
  */
 int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags)
 {
@@ -121,6 +122,8 @@ int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags)
 		}
 		inode_lock(inode);
 	}
+	if (ilock_flags & BTRFS_ILOCK_MMAP)
+		down_write(&BTRFS_I(inode)->i_mmap_lock);
 	return 0;
 }
 
@@ -132,6 +135,8 @@ int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags)
  */
 void btrfs_inode_unlock(struct inode *inode, unsigned int ilock_flags)
 {
+	if (ilock_flags & BTRFS_ILOCK_MMAP)
+		up_write(&BTRFS_I(inode)->i_mmap_lock);
 	if (ilock_flags & BTRFS_ILOCK_SHARED)
 		inode_unlock_shared(inode);
 	else
@@ -8344,6 +8349,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 
 	ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
 again:
+	down_read(&BTRFS_I(inode)->i_mmap_lock);
 	lock_page(page);
 	size = i_size_read(inode);
 
@@ -8367,6 +8373,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 		unlock_extent_cached(io_tree, page_start, page_end,
 				     &cached_state);
 		unlock_page(page);
+		up_read(&BTRFS_I(inode)->i_mmap_lock);
 		btrfs_start_ordered_extent(ordered, 1);
 		btrfs_put_ordered_extent(ordered);
 		goto again;
@@ -8424,6 +8431,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 	BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
 
 	unlock_extent_cached(io_tree, page_start, page_end, &cached_state);
+	up_read(&BTRFS_I(inode)->i_mmap_lock);
 
 	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 	sb_end_pagefault(inode->i_sb);
@@ -8432,6 +8440,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 
 out_unlock:
 	unlock_page(page);
+	up_read(&BTRFS_I(inode)->i_mmap_lock);
 out:
 	btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
 	btrfs_delalloc_release_space(BTRFS_I(inode), data_reserved, page_start,
@@ -8680,6 +8689,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
 	INIT_LIST_HEAD(&ei->delalloc_inodes);
 	INIT_LIST_HEAD(&ei->delayed_iput);
 	RB_CLEAR_NODE(&ei->rb_node);
+	init_rwsem(&ei->i_mmap_lock);
 
 	return inode;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/4] btrfs: cleanup inode_lock/inode_unlock uses
  2020-12-14 18:19 [PATCH 0/4] Introduce a mmap sem to deal with some mmap issues Josef Bacik
  2020-12-14 18:19 ` [PATCH 1/4] btrfs: add a i_mmap_lock to our inode Josef Bacik
@ 2020-12-14 18:19 ` Josef Bacik
  2020-12-14 18:19 ` [PATCH 3/4] btrfs: exclude mmaps while doing remap Josef Bacik
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Josef Bacik @ 2020-12-14 18:19 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

A few places we intermix btrfs_inode_lock with a inode_unlock, and some
places we just use inode_lock/inode_unlock instead of btrfs_inode_lock.
None of these places are using this incorrectly, but as we adjust some
of these callers it would be nice to keep everything consistent, so
convert everybody to use btrfs_inode_lock/btrfs_inode_unlock.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/delayed-inode.c |  4 ++--
 fs/btrfs/file.c          | 18 +++++++++---------
 fs/btrfs/ioctl.c         | 26 +++++++++++++-------------
 fs/btrfs/reflink.c       |  4 ++--
 fs/btrfs/relocation.c    |  4 ++--
 5 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 70c0340d839c..7532229e7efb 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1588,8 +1588,8 @@ bool btrfs_readdir_get_delayed_items(struct inode *inode,
 	 * We can only do one readdir with delayed items at a time because of
 	 * item->readdir_list.
 	 */
-	inode_unlock_shared(inode);
-	inode_lock(inode);
+	btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
+	btrfs_inode_lock(inode, 0);
 
 	mutex_lock(&delayed_node->mutex);
 	item = __btrfs_first_delayed_insertion_item(delayed_node);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0e41459b8de6..ed633ae87e06 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2131,7 +2131,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	if (ret)
 		goto out;
 
-	inode_lock(inode);
+	btrfs_inode_lock(inode, 0);
 
 	atomic_inc(&root->log_batch);
 
@@ -2163,7 +2163,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 */
 	ret = start_ordered_ops(inode, start, end);
 	if (ret) {
-		inode_unlock(inode);
+		btrfs_inode_unlock(inode, 0);
 		goto out;
 	}
 
@@ -2259,7 +2259,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * file again, but that will end up using the synchronization
 	 * inside btrfs_sync_log to keep things safe.
 	 */
-	inode_unlock(inode);
+	btrfs_inode_unlock(inode, 0);
 
 	if (ret != BTRFS_NO_LOG_SYNC) {
 		if (!ret) {
@@ -2289,7 +2289,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 
 out_release_extents:
 	btrfs_release_log_ctx_extents(&ctx);
-	inode_unlock(inode);
+	btrfs_inode_unlock(inode, 0);
 	goto out;
 }
 
@@ -2872,7 +2872,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	if (ret)
 		return ret;
 
-	inode_lock(inode);
+	btrfs_inode_lock(inode, 0);
 	ino_size = round_up(inode->i_size, fs_info->sectorsize);
 	ret = find_first_non_hole(BTRFS_I(inode), &offset, &len);
 	if (ret < 0)
@@ -2912,7 +2912,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 		truncated_block = true;
 		ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0);
 		if (ret) {
-			inode_unlock(inode);
+			btrfs_inode_unlock(inode, 0);
 			return ret;
 		}
 	}
@@ -3013,7 +3013,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 				ret = ret2;
 		}
 	}
-	inode_unlock(inode);
+	btrfs_inode_unlock(inode, 0);
 	return ret;
 }
 
@@ -3378,7 +3378,7 @@ static long btrfs_fallocate(struct file *file, int mode,
 
 	if (mode & FALLOC_FL_ZERO_RANGE) {
 		ret = btrfs_zero_range(inode, offset, len, mode);
-		inode_unlock(inode);
+		btrfs_inode_unlock(inode, 0);
 		return ret;
 	}
 
@@ -3488,7 +3488,7 @@ static long btrfs_fallocate(struct file *file, int mode,
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end,
 			     &cached_state);
 out:
-	inode_unlock(inode);
+	btrfs_inode_unlock(inode, 0);
 	/* Let go of our reservation. */
 	if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE))
 		btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index dde49a791f3e..becb5d9ae3ce 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -226,7 +226,7 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 	if (ret)
 		return ret;
 
-	inode_lock(inode);
+	btrfs_inode_lock(inode, 0);
 	fsflags = btrfs_mask_fsflags_for_type(inode, fsflags);
 	old_fsflags = btrfs_inode_flags_to_fsflags(binode->flags);
 
@@ -353,7 +353,7 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
  out_end_trans:
 	btrfs_end_transaction(trans);
  out_unlock:
-	inode_unlock(inode);
+	btrfs_inode_unlock(inode, 0);
 	mnt_drop_write_file(file);
 	return ret;
 }
@@ -449,7 +449,7 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void __user *arg)
 	if (ret)
 		return ret;
 
-	inode_lock(inode);
+	btrfs_inode_lock(inode, 0);
 
 	old_flags = binode->flags;
 	old_i_flags = inode->i_flags;
@@ -501,7 +501,7 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void __user *arg)
 		inode->i_flags = old_i_flags;
 	}
 
-	inode_unlock(inode);
+	btrfs_inode_unlock(inode, 0);
 	mnt_drop_write_file(file);
 
 	return ret;
@@ -1010,7 +1010,7 @@ static noinline int btrfs_mksubvol(const struct path *parent,
 out_dput:
 	dput(dentry);
 out_unlock:
-	inode_unlock(dir);
+	btrfs_inode_unlock(dir, 0);
 	return error;
 }
 
@@ -1602,7 +1602,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 			ra_index += cluster;
 		}
 
-		inode_lock(inode);
+		btrfs_inode_lock(inode, 0);
 		if (IS_SWAPFILE(inode)) {
 			ret = -ETXTBSY;
 		} else {
@@ -1611,13 +1611,13 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 			ret = cluster_pages_for_defrag(inode, pages, i, cluster);
 		}
 		if (ret < 0) {
-			inode_unlock(inode);
+			btrfs_inode_unlock(inode, 0);
 			goto out_ra;
 		}
 
 		defrag_count += ret;
 		balance_dirty_pages_ratelimited(inode->i_mapping);
-		inode_unlock(inode);
+		btrfs_inode_unlock(inode, 0);
 
 		if (newer_than) {
 			if (newer_off == (u64)-1)
@@ -1665,9 +1665,9 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
 
 out_ra:
 	if (do_compress) {
-		inode_lock(inode);
+		btrfs_inode_lock(inode, 0);
 		BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE;
-		inode_unlock(inode);
+		btrfs_inode_unlock(inode, 0);
 	}
 	if (!file)
 		kfree(ra);
@@ -3083,9 +3083,9 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file,
 		goto out_dput;
 	}
 
-	inode_lock(inode);
+	btrfs_inode_lock(inode, 0);
 	err = btrfs_delete_subvolume(dir, dentry);
-	inode_unlock(inode);
+	btrfs_inode_unlock(inode, 0);
 	if (!err) {
 		fsnotify_rmdir(dir, dentry);
 		d_delete(dentry);
@@ -3094,7 +3094,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file,
 out_dput:
 	dput(dentry);
 out_unlock_dir:
-	inode_unlock(dir);
+	btrfs_inode_unlock(dir, 0);
 free_subvol_name:
 	kfree(subvol_name_ptr);
 free_parent:
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index b03e7891394e..97f58b15eba2 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -816,7 +816,7 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
 		return -EINVAL;
 
 	if (same_inode)
-		inode_lock(src_inode);
+		btrfs_inode_lock(src_inode, 0);
 	else
 		lock_two_nondirectories(src_inode, dst_inode);
 
@@ -832,7 +832,7 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
 
 out_unlock:
 	if (same_inode)
-		inode_unlock(src_inode);
+		btrfs_inode_unlock(src_inode, 0);
 	else
 		unlock_two_nondirectories(src_inode, dst_inode);
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 19b7db8b2117..3fa4f7ba8da3 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2553,7 +2553,7 @@ static noinline_for_stack int prealloc_file_extent_cluster(
 	if (ret)
 		return ret;
 
-	inode_lock(&inode->vfs_inode);
+	btrfs_inode_lock(&inode->vfs_inode, 0);
 	for (nr = 0; nr < cluster->nr; nr++) {
 		start = cluster->boundary[nr] - offset;
 		if (nr + 1 < cluster->nr)
@@ -2571,7 +2571,7 @@ static noinline_for_stack int prealloc_file_extent_cluster(
 		if (ret)
 			break;
 	}
-	inode_unlock(&inode->vfs_inode);
+	btrfs_inode_unlock(&inode->vfs_inode, 0);
 
 	if (cur_offset < prealloc_end)
 		btrfs_free_reserved_data_space_noquota(inode->root->fs_info,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/4] btrfs: exclude mmaps while doing remap
  2020-12-14 18:19 [PATCH 0/4] Introduce a mmap sem to deal with some mmap issues Josef Bacik
  2020-12-14 18:19 ` [PATCH 1/4] btrfs: add a i_mmap_lock to our inode Josef Bacik
  2020-12-14 18:19 ` [PATCH 2/4] btrfs: cleanup inode_lock/inode_unlock uses Josef Bacik
@ 2020-12-14 18:19 ` Josef Bacik
  2020-12-15 20:23   ` Darrick J. Wong
  2020-12-14 18:19 ` [PATCH 4/4] btrfs: exclude mmap from happening during all fallocate operations Josef Bacik
  2020-12-23  8:19 ` [PATCH 0/4] Introduce a mmap sem to deal with some mmap issues Christoph Hellwig
  4 siblings, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2020-12-14 18:19 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Darrick reported a potential issue to me where we could allow mmap
writes after validating a page range matched in the case of dedupe.
Generally we rely on lock page -> lock extent with the ordered flush to
protect us, but this is done after we check the pages because we use the
generic helpers, so we could modify the page in between doing the check
and locking the range.

There also exists a deadlock, as described by Filipe

"""
When cloning a file range, we lock the inodes, flush any delalloc within
the respective file ranges, wait for any ordered extents and then lock the
file ranges in both inodes. This means that right after we flush delalloc
and before we lock the file ranges, memory mapped writes can come in and
dirty pages in the file ranges of the clone operation.

Most of the time this is harmless and causes no problems. However, if we
are low on available metadata space, we can later end up in a deadlock
when starting a transaction to replace file extent items. This happens if
when allocating metadata space for the transaction, we need to wait for
the async reclaim thread to release space and the reclaim thread needs to
flush delalloc for the inode that got the memory mapped write and has its
range locked by the clone task.

Basically what happens is the following:

1) A clone operation locks inodes A and B, flushes delalloc for both
   inodes in the respective file ranges and waits for any ordered extents
   in those ranges to complete;

2) Before the clone task locks the file ranges, another task does a
   memory mapped write (which does not lock the inode) for one of the
   inodes of the clone operation. So now we have a dirty page in one of
   the ranges used by the clone operation;

3) The clone operation locks the file ranges for inodes A and B;

4) Later, when iterating over the file extents of inode A, the clone
   task attempts to start a transaction. There's not enough available
   free metadata space, so the async reclaim task is started (if not
   running already) and we wait for someone to wake us up on our
   reservation ticket;

5) The async reclaim task is not able to release space by any other
   means and decides to flush delalloc for the inode of the clone
   operation;

6) The workqueue job used to flush the inode blocks when starting
   delalloc for the inode, since the file range is currently locked by
   the clone task;

7) But the clone task is waiting on its reservation ticket and the async
   reclaim task is waiting on the flush job to complete, which can't
   progress since the clone task has the file range locked. So unless
   some other task is able to release space, for example an ordered
   extent for some other inode completes, we have a deadlock between all
   these tasks;

When this happens stack traces like the following showup in dmesg/syslog:

 INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
 Call Trace:
  __schedule+0x5d1/0xcf0
  schedule+0x45/0xe0
  lock_extent_bits+0x1e6/0x2d0 [btrfs]
  ? finish_wait+0x90/0x90
  btrfs_invalidatepage+0x32c/0x390 [btrfs]
  ? __mod_memcg_state+0x8e/0x160
  __extent_writepage+0x2d4/0x400 [btrfs]
  extent_write_cache_pages+0x2b2/0x500 [btrfs]
  ? lock_release+0x20e/0x4c0
  ? trace_hardirqs_on+0x1b/0xf0
  extent_writepages+0x43/0x90 [btrfs]
  ? lock_acquire+0x1a3/0x490
  do_writepages+0x43/0xe0
  ? __filemap_fdatawrite_range+0xa4/0x100
  __filemap_fdatawrite_range+0xc5/0x100
  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
  btrfs_work_helper+0xf1/0x600 [btrfs]
  process_one_work+0x24e/0x5e0
  worker_thread+0x50/0x3b0
  ? process_one_work+0x5e0/0x5e0
  kthread+0x153/0x170
  ? kthread_mod_delayed_work+0xc0/0xc0
  ret_from_fork+0x22/0x30
 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
 Call Trace:
  __schedule+0x5d1/0xcf0
  ? kvm_clock_read+0x14/0x30
  ? wait_for_completion+0x81/0x110
  schedule+0x45/0xe0
  schedule_timeout+0x30c/0x580
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  ? lock_acquire+0x1a3/0x490
  ? try_to_wake_up+0x7a/0xa20
  ? lock_release+0x20e/0x4c0
  ? lock_acquired+0x199/0x490
  ? wait_for_completion+0x81/0x110
  wait_for_completion+0xab/0x110
  start_delalloc_inodes+0x2af/0x390 [btrfs]
  btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
  flush_space+0x24f/0x660 [btrfs]
  btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
  process_one_work+0x24e/0x5e0
  worker_thread+0x20f/0x3b0
  ? process_one_work+0x5e0/0x5e0
  kthread+0x153/0x170
  ? kthread_mod_delayed_work+0xc0/0xc0
  ret_from_fork+0x22/0x30
(...)
several other tasks blocked on inode locks held by the clone task below
(...)
 RIP: 0033:0x7f61efe73fff
 Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
 RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
 RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
 RDX: 00000000ffffff9c RSI: 0000560fbd604690 RDI: 00000000ffffff9c
 RBP: 00007ffc3371beb0 R08: 0000000000000002 R09: 0000560fbd5d75f0
 R10: 0000560fbd5d81f0 R11: 0000000000000202 R12: 0000000000000002
 R13: 000000000000000b R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
 task: fdm-stress        state:D stack:    0 pid:2508234 ppid:2508153 flags:0x00004000
 Call Trace:
  __schedule+0x5d1/0xcf0
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  schedule+0x45/0xe0
  __reserve_bytes+0x4a4/0xb10 [btrfs]
  ? finish_wait+0x90/0x90
  btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
  btrfs_block_rsv_add+0x1f/0x50 [btrfs]
  start_transaction+0x2d1/0x760 [btrfs]
  btrfs_replace_file_extents+0x120/0x930 [btrfs]
  ? lock_release+0x20e/0x4c0
  btrfs_clone+0x3e4/0x7e0 [btrfs]
  ? btrfs_lookup_first_ordered_extent+0x8e/0x100 [btrfs]
  btrfs_clone_files+0xf6/0x150 [btrfs]
  btrfs_remap_file_range+0x324/0x3d0 [btrfs]
  do_clone_file_range+0xd4/0x1f0
  vfs_clone_file_range+0x4d/0x230
  ? lock_release+0x20e/0x4c0
  ioctl_file_clone+0x8f/0xc0
  do_vfs_ioctl+0x342/0x750
  __x64_sys_ioctl+0x62/0xb0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
"""

Fix both of these issues by excluding mmaps from happening we are doing
any sort of remap, which prevents this race completely.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/reflink.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 97f58b15eba2..7df2182c98d4 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -587,6 +587,20 @@ static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1,
 	lock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1);
 }
 
+static void btrfs_double_mmap_lock(struct inode *inode1, struct inode *inode2)
+{
+	if (inode1 < inode2)
+		swap(inode1, inode2);
+	down_write(&BTRFS_I(inode1)->i_mmap_lock);
+	down_write_nested(&BTRFS_I(inode2)->i_mmap_lock, SINGLE_DEPTH_NESTING);
+}
+
+static void btrfs_double_mmap_unlock(struct inode *inode1, struct inode *inode2)
+{
+	up_write(&BTRFS_I(inode1)->i_mmap_lock);
+	up_write(&BTRFS_I(inode2)->i_mmap_lock);
+}
+
 static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len,
 				   struct inode *dst, u64 dst_loff)
 {
@@ -815,10 +829,12 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
 	if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY))
 		return -EINVAL;
 
-	if (same_inode)
-		btrfs_inode_lock(src_inode, 0);
-	else
+	if (same_inode) {
+		btrfs_inode_lock(src_inode, BTRFS_ILOCK_MMAP);
+	} else {
 		lock_two_nondirectories(src_inode, dst_inode);
+		btrfs_double_mmap_lock(src_inode, dst_inode);
+	}
 
 	ret = btrfs_remap_file_range_prep(src_file, off, dst_file, destoff,
 					  &len, remap_flags);
@@ -831,10 +847,12 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
 		ret = btrfs_clone_files(dst_file, src_file, off, len, destoff);
 
 out_unlock:
-	if (same_inode)
-		btrfs_inode_unlock(src_inode, 0);
-	else
+	if (same_inode) {
+		btrfs_inode_unlock(src_inode, BTRFS_ILOCK_MMAP);
+	} else {
+		btrfs_double_mmap_unlock(src_inode, dst_inode);
 		unlock_two_nondirectories(src_inode, dst_inode);
+	}
 
 	return ret < 0 ? ret : len;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/4] btrfs: exclude mmap from happening during all fallocate operations
  2020-12-14 18:19 [PATCH 0/4] Introduce a mmap sem to deal with some mmap issues Josef Bacik
                   ` (2 preceding siblings ...)
  2020-12-14 18:19 ` [PATCH 3/4] btrfs: exclude mmaps while doing remap Josef Bacik
@ 2020-12-14 18:19 ` Josef Bacik
  2020-12-23  8:19 ` [PATCH 0/4] Introduce a mmap sem to deal with some mmap issues Christoph Hellwig
  4 siblings, 0 replies; 8+ messages in thread
From: Josef Bacik @ 2020-12-14 18:19 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

There's a small window where a deadlock can happen between fallocate and
mmap.  This is described in detail by Filipe

"""
When doing a fallocate operation we lock the inode, flush delalloc within
the target range, wait for any ordered extents to complete and then lock
the file range. Before we lock the range and after we flush delalloc,
there is a time window where another task can come in and do a memory
mapped write for a page within the fallocate range.

This means that after fallocate locks the range, there can be a dirty page
in the range. More often than not, this does not cause any problem.
The exception is when we are low on available metadata space, because an
fallocate operation needs to start a transaction while holding the file
range locked, either through btrfs_prealloc_file_range() or through the
call to btrfs_fallocate_update_isize(). If that's the case, we can end up
in a deadlock. The following list of steps explains how that happens:

1) A fallocate operation starts, locks the inode, flushes delalloc in the
   range and waits for ordered extents in the range to complete;

2) Before the fallocate task locks the file range, another task does a
   memory mapped write for a page in the fallocate target range. This is
   possible since memory mapped writes do not (and can not) lock the
   inode;

3) The fallocate task locks the file range. At this point there is one
   dirty page in the range (due to the memory mapped write);

4) When the fallocate task attempts to start a transaction, it blocks when
   attempting to reserve metadata space, since we are low on available
   metadata space. Before blocking (wait on its reservation ticket), it
   starts the async reclaim task (if not running already);

5) The async reclaim task is not able to release space through any other
   means, so it decides to flush delalloc for inodes with dirty pages.
   It finds that the inode used in the fallocate operation has a dirty
   page and therefore queues a job (fs_info->flushs_workers workqueue) to
   flush delalloc for that inode and waits on that job to complete;

6) The flush job blocks when attempting to lock the file range because
   it is currently locked by the fallocate task;

7) The fallocate task keeps waiting for its metadata reservation, waiting
   for a wakeup on its reservation ticket. The async reclaim task is
   waiting on the flush job, which in turn is waiting for locking the file
   range that is currently locked by the fallocate task. So unless some
   other task is able to release enough metadata space, for example an
   ordered extent for some other inode completes, we end up in a deadlock
   between all these tasks.

When this happens stack traces like the following showup in dmesg/syslog:

 INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
 Call Trace:
  __schedule+0x5d1/0xcf0
  schedule+0x45/0xe0
  lock_extent_bits+0x1e6/0x2d0 [btrfs]
  ? finish_wait+0x90/0x90
  btrfs_invalidatepage+0x32c/0x390 [btrfs]
  ? __mod_memcg_state+0x8e/0x160
  __extent_writepage+0x2d4/0x400 [btrfs]
  extent_write_cache_pages+0x2b2/0x500 [btrfs]
  ? lock_release+0x20e/0x4c0
  ? trace_hardirqs_on+0x1b/0xf0
  extent_writepages+0x43/0x90 [btrfs]
  ? lock_acquire+0x1a3/0x490
  do_writepages+0x43/0xe0
  ? __filemap_fdatawrite_range+0xa4/0x100
  __filemap_fdatawrite_range+0xc5/0x100
  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
  btrfs_work_helper+0xf1/0x600 [btrfs]
  process_one_work+0x24e/0x5e0
  worker_thread+0x50/0x3b0
  ? process_one_work+0x5e0/0x5e0
  kthread+0x153/0x170
  ? kthread_mod_delayed_work+0xc0/0xc0
  ret_from_fork+0x22/0x30
 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
 Call Trace:
  __schedule+0x5d1/0xcf0
  ? kvm_clock_read+0x14/0x30
  ? wait_for_completion+0x81/0x110
  schedule+0x45/0xe0
  schedule_timeout+0x30c/0x580
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  ? lock_acquire+0x1a3/0x490
  ? try_to_wake_up+0x7a/0xa20
  ? lock_release+0x20e/0x4c0
  ? lock_acquired+0x199/0x490
  ? wait_for_completion+0x81/0x110
  wait_for_completion+0xab/0x110
  start_delalloc_inodes+0x2af/0x390 [btrfs]
  btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
  flush_space+0x24f/0x660 [btrfs]
  btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
  process_one_work+0x24e/0x5e0
  worker_thread+0x20f/0x3b0
  ? process_one_work+0x5e0/0x5e0
  kthread+0x153/0x170
  ? kthread_mod_delayed_work+0xc0/0xc0
  ret_from_fork+0x22/0x30
(...)
several tasks waiting for the inode lock held by the fallocate task below
(...)
 RIP: 0033:0x7f61efe73fff
 Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
 RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
 RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
 RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c
 RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
 R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001
 R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
 task:fdm-stress        state:D stack:    0 pid:2508243 ppid:2508153 flags:0x00000000
 Call Trace:
  __schedule+0x5d1/0xcf0
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  schedule+0x45/0xe0
  __reserve_bytes+0x4a4/0xb10 [btrfs]
  ? finish_wait+0x90/0x90
  btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
  btrfs_block_rsv_add+0x1f/0x50 [btrfs]
  start_transaction+0x2d1/0x760 [btrfs]
  btrfs_replace_file_extents+0x120/0x930 [btrfs]
  ? btrfs_fallocate+0xdcf/0x1260 [btrfs]
  btrfs_fallocate+0xdfb/0x1260 [btrfs]
  ? filename_lookup+0xf1/0x180
  vfs_fallocate+0x14f/0x440
  ioctl_preallocate+0x92/0xc0
  do_vfs_ioctl+0x66b/0x750
  ? __do_sys_newfstat+0x53/0x60
  __x64_sys_ioctl+0x62/0xb0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
"""

Fix this by disallowing mmaps from happening while we're doing any of
the fallocate operations on this inode.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/file.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ed633ae87e06..d573ffcbd626 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2872,7 +2872,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	if (ret)
 		return ret;
 
-	btrfs_inode_lock(inode, 0);
+	btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP);
 	ino_size = round_up(inode->i_size, fs_info->sectorsize);
 	ret = find_first_non_hole(BTRFS_I(inode), &offset, &len);
 	if (ret < 0)
@@ -2912,7 +2912,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 		truncated_block = true;
 		ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0);
 		if (ret) {
-			btrfs_inode_unlock(inode, 0);
+			btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
 			return ret;
 		}
 	}
@@ -3013,7 +3013,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 				ret = ret2;
 		}
 	}
-	btrfs_inode_unlock(inode, 0);
+	btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
 	return ret;
 }
 
@@ -3336,7 +3336,7 @@ static long btrfs_fallocate(struct file *file, int mode,
 			return ret;
 	}
 
-	btrfs_inode_lock(inode, 0);
+	btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP);
 
 	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
 		ret = inode_newsize_ok(inode, offset + len);
@@ -3378,7 +3378,7 @@ static long btrfs_fallocate(struct file *file, int mode,
 
 	if (mode & FALLOC_FL_ZERO_RANGE) {
 		ret = btrfs_zero_range(inode, offset, len, mode);
-		btrfs_inode_unlock(inode, 0);
+		btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
 		return ret;
 	}
 
@@ -3488,7 +3488,7 @@ static long btrfs_fallocate(struct file *file, int mode,
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end,
 			     &cached_state);
 out:
-	btrfs_inode_unlock(inode, 0);
+	btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
 	/* Let go of our reservation. */
 	if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE))
 		btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/4] btrfs: exclude mmaps while doing remap
  2020-12-14 18:19 ` [PATCH 3/4] btrfs: exclude mmaps while doing remap Josef Bacik
@ 2020-12-15 20:23   ` Darrick J. Wong
  2020-12-15 20:40     ` Josef Bacik
  0 siblings, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2020-12-15 20:23 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Dec 14, 2020 at 01:19:40PM -0500, Josef Bacik wrote:
> Darrick reported a potential issue to me where we could allow mmap
> writes after validating a page range matched in the case of dedupe.
> Generally we rely on lock page -> lock extent with the ordered flush to
> protect us, but this is done after we check the pages because we use the
> generic helpers, so we could modify the page in between doing the check
> and locking the range.

FWIW I only found that via code inspection because Matthew Wilcox asked
me about whether or not filesystems did the right thing.  I wrote the
attached fstest to try to demonstrate the problem on btrfs but it
actually passes on xfs and btrfs.  This means either that (a) btrfs is
doing it right through some other means because I don't understand its
locking or (b) the test is wrong.

The test /does/ explode as expected on ocfs2 because mmap io lol there.

--D

From: Darrick J. Wong <darrick.wong@oracle.com>
Subject: [PATCH] generic: test file writers racing with FIDEDUPERANGE

Create a test to make sure that dedupe actually locks the file ranges
correctly before starting the content comparison and keeps them locked
until the operation completes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 src/Makefile          |    2 
 src/deduperace.c      |  370 +++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/949     |   51 +++++++
 tests/generic/949.out |    2 
 tests/generic/group   |    1 
 5 files changed, 425 insertions(+), 1 deletion(-)
 create mode 100644 src/deduperace.c
 create mode 100755 tests/generic/949
 create mode 100644 tests/generic/949.out

diff --git a/src/Makefile b/src/Makefile
index 5b9781c5..2875e8ef 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -21,7 +21,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
 	preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
-	locktest unwritten_mmap bulkstat_unlink_test \
+	locktest unwritten_mmap bulkstat_unlink_test deduperace \
 	bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \
 	stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \
 	seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner \
diff --git a/src/deduperace.c b/src/deduperace.c
new file mode 100644
index 00000000..88c0b6b9
--- /dev/null
+++ b/src/deduperace.c
@@ -0,0 +1,370 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2020 Oracle.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * Race pwrite/mwrite with dedupe to see if we got the locking right.
+ *
+ * File writes and mmap writes should not be able to change the src_fd's
+ * contents after dedupe prep has verified that the file contents are the same.
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include <sys/ioctl.h>
+#include <linux/fs.h>
+#include <string.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <stdlib.h>
+#include <errno.h>
+
+#define GOOD_BYTE		0x58
+#define BAD_BYTE		0x66
+
+#ifndef FIDEDUPERANGE
+/* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
+#define FILE_DEDUPE_RANGE_SAME		0
+#define FILE_DEDUPE_RANGE_DIFFERS	1
+
+/* from struct btrfs_ioctl_file_extent_same_info */
+struct file_dedupe_range_info {
+	__s64 dest_fd;		/* in - destination file */
+	__u64 dest_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file. */
+	/* status of this dedupe operation:
+	 * < 0 for error
+	 * == FILE_DEDUPE_RANGE_SAME if dedupe succeeds
+	 * == FILE_DEDUPE_RANGE_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;		/* must be zero */
+};
+
+/* from struct btrfs_ioctl_file_extent_same_args */
+struct file_dedupe_range {
+	__u64 src_offset;	/* in - start of extent in source */
+	__u64 src_length;	/* in - length of extent */
+	__u16 dest_count;	/* in - total elements in info array */
+	__u16 reserved1;	/* must be zero */
+	__u32 reserved2;	/* must be zero */
+	struct file_dedupe_range_info info[0];
+};
+#define FIDEDUPERANGE	_IOWR(0x94, 54, struct file_dedupe_range)
+#endif /* FIDEDUPERANGE */
+
+static int fd1, fd2;
+static loff_t offset = 37; /* Nice low offset to trick the compare */
+static loff_t blksz;
+
+/* Continuously dirty the pagecache for the region being dupe-tested. */
+void *
+mwriter(
+	void		*data)
+{
+	volatile char	*p;
+
+	p = mmap(NULL, blksz, PROT_WRITE, MAP_SHARED, fd1, 0);
+	if (p == MAP_FAILED) {
+		perror("mmap");
+		exit(2);
+	}
+
+	while (1) {
+		*(p + offset) = BAD_BYTE;
+		*(p + offset) = GOOD_BYTE;
+	}
+}
+
+/* Continuously write to the region being dupe-tested. */
+void *
+pwriter(
+	void		*data)
+{
+	char		v;
+	ssize_t		sz;
+
+	while (1) {
+		v = BAD_BYTE;
+		sz = pwrite(fd1, &v, sizeof(v), offset);
+		if (sz != sizeof(v)) {
+			perror("pwrite0");
+			exit(2);
+		}
+
+		v = GOOD_BYTE;
+		sz = pwrite(fd1, &v, sizeof(v), offset);
+		if (sz != sizeof(v)) {
+			perror("pwrite1");
+			exit(2);
+		}
+	}
+
+	return NULL;
+}
+
+static inline void
+complain(
+	loff_t	offset,
+	char	bad)
+{
+	fprintf(stderr, "ASSERT: offset %llu should be 0x%x, got 0x%x!\n",
+			(unsigned long long)offset, GOOD_BYTE, bad);
+	abort();
+}
+
+/* Make sure the destination file pagecache never changes. */
+void *
+mreader(
+	void		*data)
+{
+	volatile char	*p;
+
+	p = mmap(NULL, blksz, PROT_READ, MAP_SHARED, fd2, 0);
+	if (p == MAP_FAILED) {
+		perror("mmap");
+		exit(2);
+	}
+
+	while (1) {
+		if (*(p + offset) != GOOD_BYTE)
+			complain(offset, *(p + offset));
+	}
+}
+
+/* Make sure the destination file never changes. */
+void *
+preader(
+	void		*data)
+{
+	char		v;
+	ssize_t		sz;
+
+	while (1) {
+		sz = pread(fd2, &v, sizeof(v), offset);
+		if (sz != sizeof(v)) {
+			perror("pwrite0");
+			exit(2);
+		}
+
+		if (v != GOOD_BYTE)
+			complain(offset, v);
+	}
+
+	return NULL;
+}
+
+void
+print_help(const char *progname)
+{
+	printf("Usage: %s [-b blksz] [-c dir] [-n nr_ops] [-o offset] [-r] [-w] [-v]\n",
+			progname);
+	printf("-b sets the block size (default is autoconfigured)\n");
+	printf("-c chdir to this path before starting\n");
+	printf("-n controls the number of dedupe ops (default 10000)\n");
+	printf("-o reads and writes to this offset (default 37)\n");
+	printf("-r uses pread instead of mmap read.\n");
+	printf("-v prints status updates.\n");
+	printf("-w uses pwrite instead of mmap write.\n");
+}
+
+int
+main(
+	int		argc,
+	char		*argv[])
+{
+	struct file_dedupe_range *fdr;
+	char		*Xbuf;
+	void		*(*reader_fn)(void *) = mreader;
+	void		*(*writer_fn)(void *) = mwriter;
+	unsigned long	same = 0;
+	unsigned long	differs = 0;
+	unsigned long	i, nr_ops = 10000;
+	ssize_t		sz;
+	pthread_t	reader, writer;
+	int		verbose = 0;
+	int		c;
+	int		ret;
+
+	while ((c = getopt(argc, argv, "b:c:n:o:rvw")) != -1) {
+		switch (c) {
+		case 'b':
+			errno = 0;
+			blksz = strtoul(optarg, NULL, 0);
+			if (errno) {
+				perror(optarg);
+				exit(1);
+			}
+			break;
+		case 'c':
+			ret = chdir(optarg);
+			if (ret) {
+				perror("chdir");
+				exit(1);
+			}
+			break;
+		case 'n':
+			errno = 0;
+			nr_ops = strtoul(optarg, NULL, 0);
+			if (errno) {
+				perror(optarg);
+				exit(1);
+			}
+			break;
+		case 'o':
+			errno = 0;
+			offset = strtoul(optarg, NULL, 0);
+			if (errno) {
+				perror(optarg);
+				exit(1);
+			}
+			break;
+		case 'r':
+			reader_fn = preader;
+			break;
+		case 'v':
+			verbose = 1;
+			break;
+		case 'w':
+			writer_fn = pwriter;
+			break;
+		default:
+			print_help(argv[0]);
+			exit(1);
+			break;
+		}
+	}
+
+	fdr = malloc(sizeof(struct file_dedupe_range) +
+			sizeof(struct file_dedupe_range_info));
+	if (!fdr) {
+		perror("malloc");
+		exit(1);
+	}
+
+	/* Initialize both files. */
+	fd1 = open("file1", O_RDWR | O_CREAT | O_TRUNC | O_NOATIME, 0600);
+	if (fd1 < 0) {
+		perror("file1");
+		exit(1);
+	}
+
+	fd2 = open("file2", O_RDWR | O_CREAT | O_TRUNC | O_NOATIME, 0600);
+	if (fd2 < 0) {
+		perror("file2");
+		exit(1);
+	}
+
+	if (blksz <= 0) {
+		struct stat	statbuf;
+
+		ret = fstat(fd1, &statbuf);
+		if (ret) {
+			perror("file1 stat");
+			exit(1);
+		}
+		blksz = statbuf.st_blksize;
+	}
+
+	if (offset >= blksz) {
+		fprintf(stderr, "offset (%llu) < blksz (%llu)?\n",
+				(unsigned long long)offset,
+				(unsigned long long)blksz);
+		exit(1);
+	}
+
+	Xbuf = malloc(blksz);
+	if (!Xbuf) {
+		perror("malloc buffer");
+		exit(1);
+	}
+	memset(Xbuf, GOOD_BYTE, blksz);
+
+	sz = pwrite(fd1, Xbuf, blksz, 0);
+	if (sz != blksz) {
+		perror("file1 write");
+		exit(1);
+	}
+
+	sz = pwrite(fd2, Xbuf, blksz, 0);
+	if (sz != blksz) {
+		perror("file2 write");
+		exit(1);
+	}
+
+	ret = fsync(fd1);
+	if (ret) {
+		perror("file1 fsync");
+		exit(1);
+	}
+
+	ret = fsync(fd2);
+	if (ret) {
+		perror("file2 fsync");
+		exit(1);
+	}
+
+	/* Start our reader and writer threads. */
+	ret = pthread_create(&reader, NULL, reader_fn, NULL);
+	if (ret) {
+		fprintf(stderr, "rthread: %s\n", strerror(ret));
+		exit(1);
+	}
+
+	ret = pthread_create(&writer, NULL, writer_fn, NULL);
+	if (ret) {
+		fprintf(stderr, "wthread: %s\n", strerror(ret));
+		exit(1);
+	}
+
+	/*
+	 * Now start deduping.  If the contents match, fd1's blocks will be
+	 * remapped into fd2, which is why the writer thread targets fd1 and
+	 * the reader checks fd2 to make sure that none of fd1's writes ever
+	 * make it into fd2.
+	 */
+	for (i = 1; i <= nr_ops; i++) {
+		fdr->src_offset = 0;
+		fdr->src_length = blksz;
+		fdr->dest_count = 1;
+		fdr->reserved1 = 0;
+		fdr->reserved2 = 0;
+		fdr->info[0].dest_fd = fd2;
+		fdr->info[0].dest_offset = 0;
+		fdr->info[0].reserved = 0;
+
+		ret = ioctl(fd1, FIDEDUPERANGE, fdr);
+		if (ret) {
+			perror("dedupe");
+			exit(2);
+		}
+
+		switch (fdr->info[0].status) {
+		case FILE_DEDUPE_RANGE_DIFFERS:
+			differs++;
+			break;
+		case FILE_DEDUPE_RANGE_SAME:
+			same++;
+			break;
+		default:
+			fprintf(stderr, "deduperange: %s\n",
+					strerror(-fdr->info[0].status));
+			exit(2);
+			break;
+		}
+
+		if (verbose && (i % 337) == 0)
+			printf("nr_ops: %lu; same: %lu; differs: %lu\n",
+					i, same, differs);
+	}
+
+	if (verbose)
+		printf("nr_ops: %lu; same: %lu; differs: %lu\n", i - 1, same,
+				differs);
+
+	/* Program termination will kill the threads and close the files. */
+	return 0;
+}
diff --git a/tests/generic/949 b/tests/generic/949
new file mode 100755
index 00000000..b827fcaa
--- /dev/null
+++ b/tests/generic/949
@@ -0,0 +1,51 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2020, Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 949
+#
+# Make sure that mmap and file writers racing with FIDEDUPERANGE cannot write
+# to the file after the dedupe prep function has decided that the file contents
+# are identical and we can therefore go ahead with the remapping.
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1    # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/reflink
+
+# real QA test starts here
+_supported_fs generic
+_require_scratch_dedupe
+
+rm -f $seqres.full
+
+nr_ops=$((TIME_FACTOR * 10000))
+
+# Format filesystem
+_scratch_mkfs > $seqres.full
+_scratch_mount
+
+# Test once with mmap writes
+$here/src/deduperace -c $SCRATCH_MNT -n $nr_ops
+
+# Test again with pwrites for the lulz
+$here/src/deduperace -c $SCRATCH_MNT -n $nr_ops -w
+
+echo Silence is golden.
+# success, all done
+status=0
+exit
diff --git a/tests/generic/949.out b/tests/generic/949.out
new file mode 100644
index 00000000..2998b46c
--- /dev/null
+++ b/tests/generic/949.out
@@ -0,0 +1,2 @@
+QA output created by 949
+Silence is golden.
diff --git a/tests/generic/group b/tests/generic/group
index fe196f0a..364c011d 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -628,6 +628,7 @@
 722 auto quick atime bigtime shutdown
 947 auto quick rw clone
 948 auto quick rw copy_range
+949 auto quick rw dedupe clone
 1200 auto quick swapext quota
 1201 auto quick swapext quota
 1202 auto quick swapext

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/4] btrfs: exclude mmaps while doing remap
  2020-12-15 20:23   ` Darrick J. Wong
@ 2020-12-15 20:40     ` Josef Bacik
  0 siblings, 0 replies; 8+ messages in thread
From: Josef Bacik @ 2020-12-15 20:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-btrfs, kernel-team

On 12/15/20 3:23 PM, Darrick J. Wong wrote:
> On Mon, Dec 14, 2020 at 01:19:40PM -0500, Josef Bacik wrote:
>> Darrick reported a potential issue to me where we could allow mmap
>> writes after validating a page range matched in the case of dedupe.
>> Generally we rely on lock page -> lock extent with the ordered flush to
>> protect us, but this is done after we check the pages because we use the
>> generic helpers, so we could modify the page in between doing the check
>> and locking the range.
> 
> FWIW I only found that via code inspection because Matthew Wilcox asked
> me about whether or not filesystems did the right thing.  I wrote the
> attached fstest to try to demonstrate the problem on btrfs but it
> actually passes on xfs and btrfs.  This means either that (a) btrfs is
> doing it right through some other means because I don't understand its
> locking or (b) the test is wrong.
> 
> The test /does/ explode as expected on ocfs2 because mmap io lol there.
> 

The problem exists, it's just the window is super narrow.  If you put a 
schedule_timeout(HZ) right after btrfs_remap_file_range_prep() you'd hit the 
problem every time.  We basically have to mkwrite right between doing the 
vfs_dedupe_file_range_compare() and calling btrfs_extent_same(), which is going 
to be hard to do.  You aren't missing something, it's just hard to hit.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/4] Introduce a mmap sem to deal with some mmap issues
  2020-12-14 18:19 [PATCH 0/4] Introduce a mmap sem to deal with some mmap issues Josef Bacik
                   ` (3 preceding siblings ...)
  2020-12-14 18:19 ` [PATCH 4/4] btrfs: exclude mmap from happening during all fallocate operations Josef Bacik
@ 2020-12-23  8:19 ` Christoph Hellwig
  4 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2020-12-23  8:19 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Dec 14, 2020 at 01:19:37PM -0500, Josef Bacik wrote:
> These issues exist for remap and fallocate, so add an i_mmap_sem to allow us to
> disallow mmap in these cases.  I'm still waiting on xfstests to finish with
> this, but 2 hours in and no lockdep or deadlocks.  Thanks,

Any chance you could look into lifting it to the VFS and also convert
ext4 and XFS to the common scheme?

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-12-23  8:20 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-14 18:19 [PATCH 0/4] Introduce a mmap sem to deal with some mmap issues Josef Bacik
2020-12-14 18:19 ` [PATCH 1/4] btrfs: add a i_mmap_lock to our inode Josef Bacik
2020-12-14 18:19 ` [PATCH 2/4] btrfs: cleanup inode_lock/inode_unlock uses Josef Bacik
2020-12-14 18:19 ` [PATCH 3/4] btrfs: exclude mmaps while doing remap Josef Bacik
2020-12-15 20:23   ` Darrick J. Wong
2020-12-15 20:40     ` Josef Bacik
2020-12-14 18:19 ` [PATCH 4/4] btrfs: exclude mmap from happening during all fallocate operations Josef Bacik
2020-12-23  8:19 ` [PATCH 0/4] Introduce a mmap sem to deal with some mmap issues Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.