All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] btrfs: fix a couple swapfile support bugs
@ 2021-02-03 11:17 fdmanana
  2021-02-03 11:17 ` [PATCH 1/4] btrfs: avoid checking for RO block group twice during nocow writeback fdmanana
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: fdmanana @ 2021-02-03 11:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The following patchset fixes 2 bugs with the swapfile support, where we can
end up falling back to COW when writing to an active swapfile. As a bonus,
it makes the NOCOW write patch, for both buffered and direct IO, more efficient
by avoiding doing repeated worked when checking if the target block group is
read-only.

Filipe Manana (4):
  btrfs: avoid checking for RO block group twice during nocow writeback
  btrfs: fix race between writes to swap files and scrub
  btrfs: remove no longer used function btrfs_extent_readonly()
  btrfs: fix race between swap file activation and snapshot creation

 fs/btrfs/block-group.c | 33 ++++++++++++++++++++++++++++-
 fs/btrfs/block-group.h |  9 ++++++++
 fs/btrfs/ctree.h       |  6 +++++-
 fs/btrfs/extent-tree.c | 13 ------------
 fs/btrfs/inode.c       | 47 ++++++++++++++++++++++++++++++++++--------
 fs/btrfs/scrub.c       |  9 +++++++-
 6 files changed, 92 insertions(+), 25 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] btrfs: avoid checking for RO block group twice during nocow writeback
  2021-02-03 11:17 [PATCH 0/4] btrfs: fix a couple swapfile support bugs fdmanana
@ 2021-02-03 11:17 ` fdmanana
  2021-02-04  7:47   ` Anand Jain
  2021-02-03 11:17 ` [PATCH 2/4] btrfs: fix race between writes to swap files and scrub fdmanana
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: fdmanana @ 2021-02-03 11:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

During the nocow writeback path, we currently iterate the rbtree of block
groups twice: once for checking if the target block group is RO with the
call to btrfs_extent_readonly()), and once again for getting a nocow
reference on the block group with a call to btrfs_inc_nocow_writers().

Since btrfs_inc_nocow_writers() already returns false when the target
block group is RO, remove the call to btrfs_extent_readonly(). Not only
we avoid searching the blocks group rbtree twice, it also helps reduce
contention on the lock that protects it (specially since it is a spin
lock and not a read-write lock). That may make a noticeable difference
on very large filesystems, with thousands of allocated block groups.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 589030cefd90..b10fc42f9e9a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1657,9 +1657,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 			 */
 			btrfs_release_path(path);
 
-			/* If extent is RO, we must COW it */
-			if (btrfs_extent_readonly(fs_info, disk_bytenr))
-				goto out_check;
 			ret = btrfs_cross_ref_exist(root, ino,
 						    found_key.offset -
 						    extent_offset, disk_bytenr, false);
@@ -1706,6 +1703,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 				WARN_ON_ONCE(freespace_inode);
 				goto out_check;
 			}
+			/* If the extent's block group is RO, we must COW. */
 			if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
 				goto out_check;
 			nocow = true;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] btrfs: fix race between writes to swap files and scrub
  2021-02-03 11:17 [PATCH 0/4] btrfs: fix a couple swapfile support bugs fdmanana
  2021-02-03 11:17 ` [PATCH 1/4] btrfs: avoid checking for RO block group twice during nocow writeback fdmanana
@ 2021-02-03 11:17 ` fdmanana
  2021-02-04  8:48   ` Anand Jain
  2021-02-03 11:17 ` [PATCH 3/4] btrfs: remove no longer used function btrfs_extent_readonly() fdmanana
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: fdmanana @ 2021-02-03 11:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.

However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.

Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.

This ends up removing the need to call btrfs_extent_readonly() from
can_nocow_extent(), as btrfs_swap_activate() now checks if a block group
is RO through the new function btrfs_inc_block_group_swap_extents().

The only other caller of can_nocow_extent() is the direct IO write path,
btrfs_get_blocks_direct_write(), but that already checks if a block group
is RO through the call to btrfs_inc_nocow_writers(). In fact, after this
change we end up optimizing the direct IO write path, since we no longer
iterate the block groups rbtree twice, once with btrfs_extent_readonly(),
through can_nocow_extent(), and once again with btrfs_inc_nocow_writers().
This can save time and reduce contention on the lock that protects the
rbtree (specially because it is a spinlock and not a read/write lock) on
very large filesystems, with several thousands of allocated block groups.

Fixes: ed46ff3d42378 ("Btrfs: support swap files")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/block-group.c | 33 ++++++++++++++++++++++++++++++++-
 fs/btrfs/block-group.h |  9 +++++++++
 fs/btrfs/ctree.h       |  5 +++++
 fs/btrfs/inode.c       | 22 ++++++++++++++++++----
 fs/btrfs/scrub.c       |  9 ++++++++-
 5 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 5fa6b3d540f4..c0a8ddf92ef8 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1150,6 +1150,11 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
 	spin_lock(&sinfo->lock);
 	spin_lock(&cache->lock);
 
+	if (cache->swap_extents) {
+		ret = -ETXTBSY;
+		goto out;
+	}
+
 	if (cache->ro) {
 		cache->ro++;
 		ret = 0;
@@ -2260,7 +2265,7 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
 	}
 
 	ret = inc_block_group_ro(cache, 0);
-	if (!do_chunk_alloc)
+	if (!do_chunk_alloc || ret == -ETXTBSY)
 		goto unlock_out;
 	if (!ret)
 		goto out;
@@ -2269,6 +2274,8 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
 	if (ret < 0)
 		goto out;
 	ret = inc_block_group_ro(cache, 0);
+	if (ret == -ETXTBSY)
+		goto unlock_out;
 out:
 	if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
 		alloc_flags = btrfs_get_alloc_profile(fs_info, cache->flags);
@@ -3352,6 +3359,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 		ASSERT(list_empty(&block_group->io_list));
 		ASSERT(list_empty(&block_group->bg_list));
 		ASSERT(refcount_read(&block_group->refs) == 1);
+		ASSERT(block_group->swap_extents == 0);
 		btrfs_put_block_group(block_group);
 
 		spin_lock(&info->block_group_cache_lock);
@@ -3418,3 +3426,26 @@ void btrfs_unfreeze_block_group(struct btrfs_block_group *block_group)
 		__btrfs_remove_free_space_cache(block_group->free_space_ctl);
 	}
 }
+
+bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg)
+{
+	bool ret = true;
+
+	spin_lock(&bg->lock);
+	if (bg->ro)
+		ret = false;
+	else
+		bg->swap_extents++;
+	spin_unlock(&bg->lock);
+
+	return ret;
+}
+
+void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount)
+{
+	spin_lock(&bg->lock);
+	ASSERT(!bg->ro);
+	ASSERT(bg->swap_extents >= amount);
+	bg->swap_extents -= amount;
+	spin_unlock(&bg->lock);
+}
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 8f74a96074f7..105094bd1821 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -181,6 +181,12 @@ struct btrfs_block_group {
 	 */
 	int needs_free_space;
 
+	/*
+	 * Number of extents in this block group used for swap files.
+	 * All accesses protected by the spinlock 'lock'.
+	 */
+	int swap_extents;
+
 	/* Record locked full stripes for RAID5/6 block group */
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
 };
@@ -296,6 +302,9 @@ static inline int btrfs_block_group_done(struct btrfs_block_group *cache)
 void btrfs_freeze_block_group(struct btrfs_block_group *cache);
 void btrfs_unfreeze_block_group(struct btrfs_block_group *cache);
 
+bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg);
+void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount);
+
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 		     u64 physical, u64 **logical, int *naddrs, int *stripe_len);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ed6bb46a2572..5269777a4fb4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -523,6 +523,11 @@ struct btrfs_swapfile_pin {
 	 * points to a struct btrfs_device.
 	 */
 	bool is_block_group;
+	/*
+	 * Only used when 'is_block_group' is true and it is the number of
+	 * extents used by a swapfile for this block group ('ptr' field).
+	 */
+	int bg_extent_count;
 };
 
 bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b10fc42f9e9a..464c289c402d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7204,9 +7204,6 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 		*ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi);
 	}
 
-	if (btrfs_extent_readonly(fs_info, disk_bytenr))
-		goto out;
-
 	num_bytes = min(offset + *len, extent_end) - offset;
 	if (!nocow && found_type == BTRFS_FILE_EXTENT_PREALLOC) {
 		u64 range_end;
@@ -9990,6 +9987,7 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
 	sp->ptr = ptr;
 	sp->inode = inode;
 	sp->is_block_group = is_block_group;
+	sp->bg_extent_count = 1;
 
 	spin_lock(&fs_info->swapfile_pins_lock);
 	p = &fs_info->swapfile_pins.rb_node;
@@ -10003,6 +10001,8 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
 			   (sp->ptr == entry->ptr && sp->inode > entry->inode)) {
 			p = &(*p)->rb_right;
 		} else {
+			if (is_block_group)
+				entry->bg_extent_count++;
 			spin_unlock(&fs_info->swapfile_pins_lock);
 			kfree(sp);
 			return 1;
@@ -10028,8 +10028,11 @@ static void btrfs_free_swapfile_pins(struct inode *inode)
 		sp = rb_entry(node, struct btrfs_swapfile_pin, node);
 		if (sp->inode == inode) {
 			rb_erase(&sp->node, &fs_info->swapfile_pins);
-			if (sp->is_block_group)
+			if (sp->is_block_group) {
+				btrfs_dec_block_group_swap_extents(sp->ptr,
+							   sp->bg_extent_count);
 				btrfs_put_block_group(sp->ptr);
+			}
 			kfree(sp);
 		}
 		node = next;
@@ -10244,6 +10247,17 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 			goto out;
 		}
 
+		if (!btrfs_inc_block_group_swap_extents(bg)) {
+			btrfs_warn(fs_info,
+			   "block group for swapfile at %llu is read-only%s",
+			   bg->start,
+			   atomic_read(&fs_info->scrubs_running) ?
+				   " (scrub running)" : "");
+			btrfs_put_block_group(bg);
+			ret = -EINVAL;
+			goto out;
+		}
+
 		ret = btrfs_add_swapfile_pin(inode, bg, true);
 		if (ret) {
 			btrfs_put_block_group(bg);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 5f4f88a4d2c8..c09a494be8c6 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3630,6 +3630,13 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			 * commit_transactions.
 			 */
 			ro_set = 0;
+		} else if (ret == -ETXTBSY) {
+			btrfs_warn(fs_info,
+		   "skipping scrub of block group %llu due to active swapfile",
+				   cache->start);
+			scrub_pause_off(fs_info);
+			ret = 0;
+			goto skip_unfreeze;
 		} else {
 			btrfs_warn(fs_info,
 				   "failed setting block group ro: %d", ret);
@@ -3719,7 +3726,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		} else {
 			spin_unlock(&cache->lock);
 		}
-
+skip_unfreeze:
 		btrfs_unfreeze_block_group(cache);
 		btrfs_put_block_group(cache);
 		if (ret)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] btrfs: remove no longer used function btrfs_extent_readonly()
  2021-02-03 11:17 [PATCH 0/4] btrfs: fix a couple swapfile support bugs fdmanana
  2021-02-03 11:17 ` [PATCH 1/4] btrfs: avoid checking for RO block group twice during nocow writeback fdmanana
  2021-02-03 11:17 ` [PATCH 2/4] btrfs: fix race between writes to swap files and scrub fdmanana
@ 2021-02-03 11:17 ` fdmanana
  2021-02-03 11:17 ` [PATCH 4/4] btrfs: fix race between swap file activation and snapshot creation fdmanana
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: fdmanana @ 2021-02-03 11:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

After the two previous patches:

  btrfs: avoid checking for RO block group twice during nocow writeback
  btrfs: fix race between writes to swap files and scrub

it is no longer used, so just remove it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h       |  1 -
 fs/btrfs/extent-tree.c | 13 -------------
 2 files changed, 14 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5269777a4fb4..65d158082b86 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2686,7 +2686,6 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans);
 int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 			 struct btrfs_ref *generic_ref);
 
-int btrfs_extent_readonly(struct btrfs_fs_info *fs_info, u64 bytenr);
 void btrfs_clear_space_info_full(struct btrfs_fs_info *info);
 
 /*
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5476ab84e544..11ef4259cacb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2455,19 +2455,6 @@ int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	return __btrfs_mod_ref(trans, root, buf, full_backref, 0);
 }
 
-int btrfs_extent_readonly(struct btrfs_fs_info *fs_info, u64 bytenr)
-{
-	struct btrfs_block_group *block_group;
-	int readonly = 0;
-
-	block_group = btrfs_lookup_block_group(fs_info, bytenr);
-	if (!block_group || block_group->ro)
-		readonly = 1;
-	if (block_group)
-		btrfs_put_block_group(block_group);
-	return readonly;
-}
-
 static u64 get_alloc_profile_by_root(struct btrfs_root *root, int data)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] btrfs: fix race between swap file activation and snapshot creation
  2021-02-03 11:17 [PATCH 0/4] btrfs: fix a couple swapfile support bugs fdmanana
                   ` (2 preceding siblings ...)
  2021-02-03 11:17 ` [PATCH 3/4] btrfs: remove no longer used function btrfs_extent_readonly() fdmanana
@ 2021-02-03 11:17 ` fdmanana
  2021-02-05 12:55 ` [PATCH v2 0/3] btrfs: fix a couple swapfile support bugs fdmanana
  2021-02-10 15:30 ` [PATCH 0/4] " Josef Bacik
  5 siblings, 0 replies; 19+ messages in thread
From: fdmanana @ 2021-02-03 11:17 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When creating a snapshot we check if the current number of swap files, in
the root, is non-zero, and if it is, we error out and warn that we can not
create the snapshot because there are active swap files.

However this is racy because when a task started activation of a swap
file, another task might have started already snapshot creation and might
have seen the counter for the number of swap files as zero. This means
that after the swap file is activated we may end up with a snapshot of the
same root successfully created, and therefore when the first write to the
swap file happens it has to fall back into COW mode, which should never
happen for active swap files.

Basically what can happen is:

1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
   There it sees that root->nr_swapfiles has a value of 0 so it continues;

2) Task B enters btrfs_swap_activate(). It is not aware that another task
   started snapshot creation but it did not finish yet. It increments
   root->nr_swapfiles from 0 to 1;

3) Task B checks that the file meets all requirements to be an active
   swap file - it has NOCOW set, there are no snapshots for the inode's
   root at the moment, no file holes, no reflinked extents, etc;

4) Task B returns success and now the file is an active swap file;

5) Task A commits the transaction to create the snapshot and finishes.
   The swap file's extents are now shared between the original root and
   the snapshot;

6) A write into an extent of the swap file is attempted - there is a
   snapshot of the file's root, so we fall back to COW mode and therefore
   the physical location of the extent changes on disk.

So fix this by taking the snapshot lock during swap file activation before
locking the extent range, as that is the order in which we lock these
during buffered writes.

Fixes: ed46ff3d42378 ("Btrfs: support swap files")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 464c289c402d..ca89f3166dd7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10093,7 +10093,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 			       sector_t *span)
 {
 	struct inode *inode = file_inode(file);
-	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct extent_state *cached_state = NULL;
 	struct extent_map *em = NULL;
@@ -10144,13 +10145,27 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	   "cannot activate swapfile while exclusive operation is running");
 		return -EBUSY;
 	}
+
+	/*
+	 * Prevent snapshot creation while we are activating the swap file.
+	 * We do not want to race with snapshot creation. If snapshot creation
+	 * already started before we bumped nr_swapfiles from 0 to 1 and
+	 * completes before the first write into the swap file after it is
+	 * activated, than that write would fallback to COW.
+	 */
+	if (!btrfs_drew_try_write_lock(&root->snapshot_lock)) {
+		btrfs_exclop_finish(fs_info);
+		btrfs_warn(fs_info,
+	   "cannot activate swapfile because snapshot creation is in progress");
+		return -EINVAL;
+	}
 	/*
 	 * Snapshots can create extents which require COW even if NODATACOW is
 	 * set. We use this counter to prevent snapshots. We must increment it
 	 * before walking the extents because we don't want a concurrent
 	 * snapshot to run after we've already checked the extents.
 	 */
-	atomic_inc(&BTRFS_I(inode)->root->nr_swapfiles);
+	atomic_inc(&root->nr_swapfiles);
 
 	isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize);
 
@@ -10296,6 +10311,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	if (ret)
 		btrfs_swap_deactivate(file);
 
+	btrfs_drew_write_unlock(&root->snapshot_lock);
+
 	btrfs_exclop_finish(fs_info);
 
 	if (ret)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] btrfs: avoid checking for RO block group twice during nocow writeback
  2021-02-03 11:17 ` [PATCH 1/4] btrfs: avoid checking for RO block group twice during nocow writeback fdmanana
@ 2021-02-04  7:47   ` Anand Jain
  0 siblings, 0 replies; 19+ messages in thread
From: Anand Jain @ 2021-02-04  7:47 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 2/3/2021 7:17 PM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> During the nocow writeback path, we currently iterate the rbtree of block
> groups twice: once for checking if the target block group is RO with the
> call to btrfs_extent_readonly()), and once again for getting a nocow
> reference on the block group with a call to btrfs_inc_nocow_writers().
> 
> Since btrfs_inc_nocow_writers() already returns false when the target
> block group is RO, remove the call to btrfs_extent_readonly(). Not only
> we avoid searching the blocks group rbtree twice, it also helps reduce
> contention on the lock that protects it (specially since it is a spin
> lock and not a read-write lock). That may make a noticeable difference
> on very large filesystems, with thousands of allocated block groups.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Thanks.

> ---
>   fs/btrfs/inode.c | 4 +---
>   1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 589030cefd90..b10fc42f9e9a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1657,9 +1657,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>   			 */
>   			btrfs_release_path(path);
>   
> -			/* If extent is RO, we must COW it */
> -			if (btrfs_extent_readonly(fs_info, disk_bytenr))
> -				goto out_check;
>   			ret = btrfs_cross_ref_exist(root, ino,
>   						    found_key.offset -
>   						    extent_offset, disk_bytenr, false);
> @@ -1706,6 +1703,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>   				WARN_ON_ONCE(freespace_inode);
>   				goto out_check;
>   			}
> +			/* If the extent's block group is RO, we must COW. */
>   			if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
>   				goto out_check;
>   			nocow = true;
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] btrfs: fix race between writes to swap files and scrub
  2021-02-03 11:17 ` [PATCH 2/4] btrfs: fix race between writes to swap files and scrub fdmanana
@ 2021-02-04  8:48   ` Anand Jain
  2021-02-04 10:11     ` Filipe Manana
  0 siblings, 1 reply; 19+ messages in thread
From: Anand Jain @ 2021-02-04  8:48 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 2/3/2021 7:17 PM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When we active a swap file, at btrfs_swap_activate(), we acquire the
> exclusive operation lock to prevent the physical location of the swap
> file extents to be changed by operations such as balance and device
> replace/resize/remove. We also call there can_nocow_extent() which,
> among other things, checks if the block group of a swap file extent is
> currently RO, and if it is we can not use the extent, since a write
> into it would result in COWing the extent.
> 
> However we have no protection against a scrub operation running after we
> activate the swap file, which can result in the swap file extents to be
> COWed while the scrub is running and operating on the respective block
> group, because scrub turns a block group into RO before it processes it
> and then back again to RW mode after processing it. That means an attempt
> to write into a swap file extent while scrub is processing the respective
> block group, will result in COWing the extent, changing its physical
> location on disk.
> 
> Fix this by making sure that block groups that have extents that are used
> by active swap files can not be turned into RO mode, therefore making it
> not possible for a scrub to turn them into RO mode.

> When a scrub finds a
> block group that can not be turned to RO due to the existence of extents
> used by swap files, it proceeds to the next block group and logs a warning
> message that mentions the block group was skipped due to active swap
> files - this is the same approach we currently use for balance.

  It is better if this info is documented in the scrub man-page. IMO.

> This ends up removing the need to call btrfs_extent_readonly() from
> can_nocow_extent(), as btrfs_swap_activate() now checks if a block group
> is RO through the new function btrfs_inc_block_group_swap_extents().
> 
> The only other caller of can_nocow_extent() is the direct IO write path,
> btrfs_get_blocks_direct_write(), but that already checks if a block group
> is RO through the call to btrfs_inc_nocow_writers(). In fact, after this
> change we end up optimizing the direct IO write path, since we no longer
> iterate the block groups rbtree twice, once with btrfs_extent_readonly(),
> through can_nocow_extent(), and once again with btrfs_inc_nocow_writers().
> This can save time and reduce contention on the lock that protects the
> rbtree (specially because it is a spinlock and not a read/write lock) on
> very large filesystems, with several thousands of allocated block groups.
> 
> Fixes: ed46ff3d42378 ("Btrfs: support swap files")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

  I am not sure about the optimization of direct IO part, but as such
  changes looks good.

  Reviewed-by: Anand Jain <anand.jain@oracle.com>

  Thanks, Anand

> ---
>   fs/btrfs/block-group.c | 33 ++++++++++++++++++++++++++++++++-
>   fs/btrfs/block-group.h |  9 +++++++++
>   fs/btrfs/ctree.h       |  5 +++++
>   fs/btrfs/inode.c       | 22 ++++++++++++++++++----
>   fs/btrfs/scrub.c       |  9 ++++++++-
>   5 files changed, 72 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 5fa6b3d540f4..c0a8ddf92ef8 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1150,6 +1150,11 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
>   	spin_lock(&sinfo->lock);
>   	spin_lock(&cache->lock);
>   
> +	if (cache->swap_extents) {
> +		ret = -ETXTBSY;
> +		goto out;
> +	}
> +
>   	if (cache->ro) {
>   		cache->ro++;
>   		ret = 0;
> @@ -2260,7 +2265,7 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
>   	}
>   
>   	ret = inc_block_group_ro(cache, 0);
> -	if (!do_chunk_alloc)
> +	if (!do_chunk_alloc || ret == -ETXTBSY)
>   		goto unlock_out;
>   	if (!ret)
>   		goto out;
> @@ -2269,6 +2274,8 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
>   	if (ret < 0)
>   		goto out;
>   	ret = inc_block_group_ro(cache, 0);
> +	if (ret == -ETXTBSY)
> +		goto unlock_out;
>   out:
>   	if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
>   		alloc_flags = btrfs_get_alloc_profile(fs_info, cache->flags);
> @@ -3352,6 +3359,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
>   		ASSERT(list_empty(&block_group->io_list));
>   		ASSERT(list_empty(&block_group->bg_list));
>   		ASSERT(refcount_read(&block_group->refs) == 1);
> +		ASSERT(block_group->swap_extents == 0);
>   		btrfs_put_block_group(block_group);
>   
>   		spin_lock(&info->block_group_cache_lock);
> @@ -3418,3 +3426,26 @@ void btrfs_unfreeze_block_group(struct btrfs_block_group *block_group)
>   		__btrfs_remove_free_space_cache(block_group->free_space_ctl);
>   	}
>   }
> +
> +bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg)
> +{
> +	bool ret = true;
> +
> +	spin_lock(&bg->lock);
> +	if (bg->ro)
> +		ret = false;
> +	else
> +		bg->swap_extents++;
> +	spin_unlock(&bg->lock);
> +
> +	return ret;
> +}
> +
> +void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount)
> +{
> +	spin_lock(&bg->lock);
> +	ASSERT(!bg->ro);
> +	ASSERT(bg->swap_extents >= amount);
> +	bg->swap_extents -= amount;
> +	spin_unlock(&bg->lock);
> +}
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index 8f74a96074f7..105094bd1821 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -181,6 +181,12 @@ struct btrfs_block_group {
>   	 */
>   	int needs_free_space;
>   
> +	/*
> +	 * Number of extents in this block group used for swap files.
> +	 * All accesses protected by the spinlock 'lock'.
> +	 */
> +	int swap_extents;
> +
>   	/* Record locked full stripes for RAID5/6 block group */
>   	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
>   };
> @@ -296,6 +302,9 @@ static inline int btrfs_block_group_done(struct btrfs_block_group *cache)
>   void btrfs_freeze_block_group(struct btrfs_block_group *cache);
>   void btrfs_unfreeze_block_group(struct btrfs_block_group *cache);
>   
> +bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg);
> +void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount);
> +
>   #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>   int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
>   		     u64 physical, u64 **logical, int *naddrs, int *stripe_len);
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index ed6bb46a2572..5269777a4fb4 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -523,6 +523,11 @@ struct btrfs_swapfile_pin {
>   	 * points to a struct btrfs_device.
>   	 */
>   	bool is_block_group;
> +	/*
> +	 * Only used when 'is_block_group' is true and it is the number of
> +	 * extents used by a swapfile for this block group ('ptr' field).
> +	 */
> +	int bg_extent_count;
>   };
>   
>   bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b10fc42f9e9a..464c289c402d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7204,9 +7204,6 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
>   		*ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi);
>   	}
>   
> -	if (btrfs_extent_readonly(fs_info, disk_bytenr))
> -		goto out;
> -
>   	num_bytes = min(offset + *len, extent_end) - offset;
>   	if (!nocow && found_type == BTRFS_FILE_EXTENT_PREALLOC) {
>   		u64 range_end;
> @@ -9990,6 +9987,7 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
>   	sp->ptr = ptr;
>   	sp->inode = inode;
>   	sp->is_block_group = is_block_group;
> +	sp->bg_extent_count = 1;
>   
>   	spin_lock(&fs_info->swapfile_pins_lock);
>   	p = &fs_info->swapfile_pins.rb_node;
> @@ -10003,6 +10001,8 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
>   			   (sp->ptr == entry->ptr && sp->inode > entry->inode)) {
>   			p = &(*p)->rb_right;
>   		} else {
> +			if (is_block_group)
> +				entry->bg_extent_count++;
>   			spin_unlock(&fs_info->swapfile_pins_lock);
>   			kfree(sp);
>   			return 1;
> @@ -10028,8 +10028,11 @@ static void btrfs_free_swapfile_pins(struct inode *inode)
>   		sp = rb_entry(node, struct btrfs_swapfile_pin, node);
>   		if (sp->inode == inode) {
>   			rb_erase(&sp->node, &fs_info->swapfile_pins);
> -			if (sp->is_block_group)
> +			if (sp->is_block_group) {
> +				btrfs_dec_block_group_swap_extents(sp->ptr,
> +							   sp->bg_extent_count);
>   				btrfs_put_block_group(sp->ptr);
> +			}
>   			kfree(sp);
>   		}
>   		node = next;
> @@ -10244,6 +10247,17 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>   			goto out;
>   		}
>   
> +		if (!btrfs_inc_block_group_swap_extents(bg)) {
> +			btrfs_warn(fs_info,
> +			   "block group for swapfile at %llu is read-only%s",
> +			   bg->start,
> +			   atomic_read(&fs_info->scrubs_running) ?
> +				   " (scrub running)" : "");
> +			btrfs_put_block_group(bg);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
>   		ret = btrfs_add_swapfile_pin(inode, bg, true);
>   		if (ret) {
>   			btrfs_put_block_group(bg);
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 5f4f88a4d2c8..c09a494be8c6 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3630,6 +3630,13 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>   			 * commit_transactions.
>   			 */
>   			ro_set = 0;
> +		} else if (ret == -ETXTBSY) {
> +			btrfs_warn(fs_info,
> +		   "skipping scrub of block group %llu due to active swapfile",
> +				   cache->start);
> +			scrub_pause_off(fs_info);
> +			ret = 0;
> +			goto skip_unfreeze;
>   		} else {
>   			btrfs_warn(fs_info,
>   				   "failed setting block group ro: %d", ret);
> @@ -3719,7 +3726,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>   		} else {
>   			spin_unlock(&cache->lock);
>   		}
> -
> +skip_unfreeze:
>   		btrfs_unfreeze_block_group(cache);
>   		btrfs_put_block_group(cache);
>   		if (ret)
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] btrfs: fix race between writes to swap files and scrub
  2021-02-04  8:48   ` Anand Jain
@ 2021-02-04 10:11     ` Filipe Manana
  2021-02-05  7:44       ` Anand Jain
  0 siblings, 1 reply; 19+ messages in thread
From: Filipe Manana @ 2021-02-04 10:11 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Thu, Feb 4, 2021 at 8:48 AM Anand Jain <anand.jain@oracle.com> wrote:
>
> On 2/3/2021 7:17 PM, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When we active a swap file, at btrfs_swap_activate(), we acquire the
> > exclusive operation lock to prevent the physical location of the swap
> > file extents to be changed by operations such as balance and device
> > replace/resize/remove. We also call there can_nocow_extent() which,
> > among other things, checks if the block group of a swap file extent is
> > currently RO, and if it is we can not use the extent, since a write
> > into it would result in COWing the extent.
> >
> > However we have no protection against a scrub operation running after we
> > activate the swap file, which can result in the swap file extents to be
> > COWed while the scrub is running and operating on the respective block
> > group, because scrub turns a block group into RO before it processes it
> > and then back again to RW mode after processing it. That means an attempt
> > to write into a swap file extent while scrub is processing the respective
> > block group, will result in COWing the extent, changing its physical
> > location on disk.
> >
> > Fix this by making sure that block groups that have extents that are used
> > by active swap files can not be turned into RO mode, therefore making it
> > not possible for a scrub to turn them into RO mode.
>
> > When a scrub finds a
> > block group that can not be turned to RO due to the existence of extents
> > used by swap files, it proceeds to the next block group and logs a warning
> > message that mentions the block group was skipped due to active swap
> > files - this is the same approach we currently use for balance.
>
>   It is better if this info is documented in the scrub man-page. IMO.
>
> > This ends up removing the need to call btrfs_extent_readonly() from
> > can_nocow_extent(), as btrfs_swap_activate() now checks if a block group
> > is RO through the new function btrfs_inc_block_group_swap_extents().
> >
> > The only other caller of can_nocow_extent() is the direct IO write path,
> > btrfs_get_blocks_direct_write(), but that already checks if a block group
> > is RO through the call to btrfs_inc_nocow_writers(). In fact, after this
> > change we end up optimizing the direct IO write path, since we no longer
> > iterate the block groups rbtree twice, once with btrfs_extent_readonly(),
> > through can_nocow_extent(), and once again with btrfs_inc_nocow_writers().
> > This can save time and reduce contention on the lock that protects the
> > rbtree (specially because it is a spinlock and not a read/write lock) on
> > very large filesystems, with several thousands of allocated block groups.
> >
> > Fixes: ed46ff3d42378 ("Btrfs: support swap files")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
>   I am not sure about the optimization of direct IO part, but as such
>   changes looks good.

So, if you could understand the buffered IO path (first patch in the
series), how can you not be sure about the direct IO path (which does
exactly the same)?

>
>   Reviewed-by: Anand Jain <anand.jain@oracle.com>

Hum, and how can you give a Reviewed-by tag when you are not sure
about some part of the code? That doesn't make sense to me.

Thanks.

>
>   Thanks, Anand
>
> > ---
> >   fs/btrfs/block-group.c | 33 ++++++++++++++++++++++++++++++++-
> >   fs/btrfs/block-group.h |  9 +++++++++
> >   fs/btrfs/ctree.h       |  5 +++++
> >   fs/btrfs/inode.c       | 22 ++++++++++++++++++----
> >   fs/btrfs/scrub.c       |  9 ++++++++-
> >   5 files changed, 72 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > index 5fa6b3d540f4..c0a8ddf92ef8 100644
> > --- a/fs/btrfs/block-group.c
> > +++ b/fs/btrfs/block-group.c
> > @@ -1150,6 +1150,11 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
> >       spin_lock(&sinfo->lock);
> >       spin_lock(&cache->lock);
> >
> > +     if (cache->swap_extents) {
> > +             ret = -ETXTBSY;
> > +             goto out;
> > +     }
> > +
> >       if (cache->ro) {
> >               cache->ro++;
> >               ret = 0;
> > @@ -2260,7 +2265,7 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
> >       }
> >
> >       ret = inc_block_group_ro(cache, 0);
> > -     if (!do_chunk_alloc)
> > +     if (!do_chunk_alloc || ret == -ETXTBSY)
> >               goto unlock_out;
> >       if (!ret)
> >               goto out;
> > @@ -2269,6 +2274,8 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
> >       if (ret < 0)
> >               goto out;
> >       ret = inc_block_group_ro(cache, 0);
> > +     if (ret == -ETXTBSY)
> > +             goto unlock_out;
> >   out:
> >       if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
> >               alloc_flags = btrfs_get_alloc_profile(fs_info, cache->flags);
> > @@ -3352,6 +3359,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
> >               ASSERT(list_empty(&block_group->io_list));
> >               ASSERT(list_empty(&block_group->bg_list));
> >               ASSERT(refcount_read(&block_group->refs) == 1);
> > +             ASSERT(block_group->swap_extents == 0);
> >               btrfs_put_block_group(block_group);
> >
> >               spin_lock(&info->block_group_cache_lock);
> > @@ -3418,3 +3426,26 @@ void btrfs_unfreeze_block_group(struct btrfs_block_group *block_group)
> >               __btrfs_remove_free_space_cache(block_group->free_space_ctl);
> >       }
> >   }
> > +
> > +bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg)
> > +{
> > +     bool ret = true;
> > +
> > +     spin_lock(&bg->lock);
> > +     if (bg->ro)
> > +             ret = false;
> > +     else
> > +             bg->swap_extents++;
> > +     spin_unlock(&bg->lock);
> > +
> > +     return ret;
> > +}
> > +
> > +void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount)
> > +{
> > +     spin_lock(&bg->lock);
> > +     ASSERT(!bg->ro);
> > +     ASSERT(bg->swap_extents >= amount);
> > +     bg->swap_extents -= amount;
> > +     spin_unlock(&bg->lock);
> > +}
> > diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> > index 8f74a96074f7..105094bd1821 100644
> > --- a/fs/btrfs/block-group.h
> > +++ b/fs/btrfs/block-group.h
> > @@ -181,6 +181,12 @@ struct btrfs_block_group {
> >        */
> >       int needs_free_space;
> >
> > +     /*
> > +      * Number of extents in this block group used for swap files.
> > +      * All accesses protected by the spinlock 'lock'.
> > +      */
> > +     int swap_extents;
> > +
> >       /* Record locked full stripes for RAID5/6 block group */
> >       struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
> >   };
> > @@ -296,6 +302,9 @@ static inline int btrfs_block_group_done(struct btrfs_block_group *cache)
> >   void btrfs_freeze_block_group(struct btrfs_block_group *cache);
> >   void btrfs_unfreeze_block_group(struct btrfs_block_group *cache);
> >
> > +bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg);
> > +void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount);
> > +
> >   #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> >   int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
> >                    u64 physical, u64 **logical, int *naddrs, int *stripe_len);
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index ed6bb46a2572..5269777a4fb4 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -523,6 +523,11 @@ struct btrfs_swapfile_pin {
> >        * points to a struct btrfs_device.
> >        */
> >       bool is_block_group;
> > +     /*
> > +      * Only used when 'is_block_group' is true and it is the number of
> > +      * extents used by a swapfile for this block group ('ptr' field).
> > +      */
> > +     int bg_extent_count;
> >   };
> >
> >   bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index b10fc42f9e9a..464c289c402d 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -7204,9 +7204,6 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
> >               *ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi);
> >       }
> >
> > -     if (btrfs_extent_readonly(fs_info, disk_bytenr))
> > -             goto out;
> > -
> >       num_bytes = min(offset + *len, extent_end) - offset;
> >       if (!nocow && found_type == BTRFS_FILE_EXTENT_PREALLOC) {
> >               u64 range_end;
> > @@ -9990,6 +9987,7 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
> >       sp->ptr = ptr;
> >       sp->inode = inode;
> >       sp->is_block_group = is_block_group;
> > +     sp->bg_extent_count = 1;
> >
> >       spin_lock(&fs_info->swapfile_pins_lock);
> >       p = &fs_info->swapfile_pins.rb_node;
> > @@ -10003,6 +10001,8 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
> >                          (sp->ptr == entry->ptr && sp->inode > entry->inode)) {
> >                       p = &(*p)->rb_right;
> >               } else {
> > +                     if (is_block_group)
> > +                             entry->bg_extent_count++;
> >                       spin_unlock(&fs_info->swapfile_pins_lock);
> >                       kfree(sp);
> >                       return 1;
> > @@ -10028,8 +10028,11 @@ static void btrfs_free_swapfile_pins(struct inode *inode)
> >               sp = rb_entry(node, struct btrfs_swapfile_pin, node);
> >               if (sp->inode == inode) {
> >                       rb_erase(&sp->node, &fs_info->swapfile_pins);
> > -                     if (sp->is_block_group)
> > +                     if (sp->is_block_group) {
> > +                             btrfs_dec_block_group_swap_extents(sp->ptr,
> > +                                                        sp->bg_extent_count);
> >                               btrfs_put_block_group(sp->ptr);
> > +                     }
> >                       kfree(sp);
> >               }
> >               node = next;
> > @@ -10244,6 +10247,17 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
> >                       goto out;
> >               }
> >
> > +             if (!btrfs_inc_block_group_swap_extents(bg)) {
> > +                     btrfs_warn(fs_info,
> > +                        "block group for swapfile at %llu is read-only%s",
> > +                        bg->start,
> > +                        atomic_read(&fs_info->scrubs_running) ?
> > +                                " (scrub running)" : "");
> > +                     btrfs_put_block_group(bg);
> > +                     ret = -EINVAL;
> > +                     goto out;
> > +             }
> > +
> >               ret = btrfs_add_swapfile_pin(inode, bg, true);
> >               if (ret) {
> >                       btrfs_put_block_group(bg);
> > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> > index 5f4f88a4d2c8..c09a494be8c6 100644
> > --- a/fs/btrfs/scrub.c
> > +++ b/fs/btrfs/scrub.c
> > @@ -3630,6 +3630,13 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >                        * commit_transactions.
> >                        */
> >                       ro_set = 0;
> > +             } else if (ret == -ETXTBSY) {
> > +                     btrfs_warn(fs_info,
> > +                "skipping scrub of block group %llu due to active swapfile",
> > +                                cache->start);
> > +                     scrub_pause_off(fs_info);
> > +                     ret = 0;
> > +                     goto skip_unfreeze;
> >               } else {
> >                       btrfs_warn(fs_info,
> >                                  "failed setting block group ro: %d", ret);
> > @@ -3719,7 +3726,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> >               } else {
> >                       spin_unlock(&cache->lock);
> >               }
> > -
> > +skip_unfreeze:
> >               btrfs_unfreeze_block_group(cache);
> >               btrfs_put_block_group(cache);
> >               if (ret)
> >
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] btrfs: fix race between writes to swap files and scrub
  2021-02-04 10:11     ` Filipe Manana
@ 2021-02-05  7:44       ` Anand Jain
  2021-02-05 12:54         ` Filipe Manana
  0 siblings, 1 reply; 19+ messages in thread
From: Anand Jain @ 2021-02-05  7:44 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On 2/4/2021 6:11 PM, Filipe Manana wrote:
> On Thu, Feb 4, 2021 at 8:48 AM Anand Jain <anand.jain@oracle.com> wrote:
>>
>> On 2/3/2021 7:17 PM, fdmanana@kernel.org wrote:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> When we active a swap file, at btrfs_swap_activate(), we acquire the
>>> exclusive operation lock to prevent the physical location of the swap
>>> file extents to be changed by operations such as balance and device
>>> replace/resize/remove. We also call there can_nocow_extent() which,
>>> among other things, checks if the block group of a swap file extent is
>>> currently RO, and if it is we can not use the extent, since a write
>>> into it would result in COWing the extent.
>>>
>>> However we have no protection against a scrub operation running after we
>>> activate the swap file, which can result in the swap file extents to be
>>> COWed while the scrub is running and operating on the respective block
>>> group, because scrub turns a block group into RO before it processes it
>>> and then back again to RW mode after processing it. That means an attempt
>>> to write into a swap file extent while scrub is processing the respective
>>> block group, will result in COWing the extent, changing its physical
>>> location on disk.
>>>
>>> Fix this by making sure that block groups that have extents that are used
>>> by active swap files can not be turned into RO mode, therefore making it
>>> not possible for a scrub to turn them into RO mode.
>>
>>> When a scrub finds a
>>> block group that can not be turned to RO due to the existence of extents
>>> used by swap files, it proceeds to the next block group and logs a warning
>>> message that mentions the block group was skipped due to active swap
>>> files - this is the same approach we currently use for balance.
>>
>>    It is better if this info is documented in the scrub man-page. IMO.
>>
>>> This ends up removing the need to call btrfs_extent_readonly() from
>>> can_nocow_extent(), as btrfs_swap_activate() now checks if a block group
>>> is RO through the new function btrfs_inc_block_group_swap_extents().
>>>


>>> The only other caller of can_nocow_extent() is the direct IO write path,

There is a third caller. check_can_nocow() also calls 
can_nocow_extent(), which I missed before. Any idea where does it get to 
know that extent is RO in the threads using check_can_nocow()? I have to 
back out the RB for this reason for now.


>>> btrfs_get_blocks_direct_write(), but that already checks if a block group
>>> is RO through the call to btrfs_inc_nocow_writers().

>>> In fact, after this
>>> change we end up optimizing the direct IO write path, since we no longer
>>> iterate the block groups rbtree twice, once with btrfs_extent_readonly(),
>>> through can_nocow_extent(), and once again with btrfs_inc_nocow_writers().
>>> This can save time and reduce contention on the lock that protects the
>>> rbtree (specially because it is a spinlock and not a read/write lock) on
>>> very large filesystems, with several thousands of allocated block groups.
>>>
>>> Fixes: ed46ff3d42378 ("Btrfs: support swap files")
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>
>>    I am not sure about the optimization of direct IO part, but as such
>>    changes looks good.

Clarifying about the optimization part (for both buffered and direct IO) 
- After patch 1, and patch 2, now we check on the RO extents after the 
functions btrfs_cross_ref_exist(), and csum_exist_in_range(), both of 
them have search_slot, whereas, before patch 1, and patch 2, we used to 
fail early (if the extent is RO) and avoided the search_slot, so I am 
not sure if there is optimization.

Thanks, Anand

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] btrfs: fix race between writes to swap files and scrub
  2021-02-05  7:44       ` Anand Jain
@ 2021-02-05 12:54         ` Filipe Manana
  0 siblings, 0 replies; 19+ messages in thread
From: Filipe Manana @ 2021-02-05 12:54 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Fri, Feb 5, 2021 at 7:44 AM Anand Jain <anand.jain@oracle.com> wrote:
>
> On 2/4/2021 6:11 PM, Filipe Manana wrote:
> > On Thu, Feb 4, 2021 at 8:48 AM Anand Jain <anand.jain@oracle.com> wrote:
> >>
> >> On 2/3/2021 7:17 PM, fdmanana@kernel.org wrote:
> >>> From: Filipe Manana <fdmanana@suse.com>
> >>>
> >>> When we active a swap file, at btrfs_swap_activate(), we acquire the
> >>> exclusive operation lock to prevent the physical location of the swap
> >>> file extents to be changed by operations such as balance and device
> >>> replace/resize/remove. We also call there can_nocow_extent() which,
> >>> among other things, checks if the block group of a swap file extent is
> >>> currently RO, and if it is we can not use the extent, since a write
> >>> into it would result in COWing the extent.
> >>>
> >>> However we have no protection against a scrub operation running after we
> >>> activate the swap file, which can result in the swap file extents to be
> >>> COWed while the scrub is running and operating on the respective block
> >>> group, because scrub turns a block group into RO before it processes it
> >>> and then back again to RW mode after processing it. That means an attempt
> >>> to write into a swap file extent while scrub is processing the respective
> >>> block group, will result in COWing the extent, changing its physical
> >>> location on disk.
> >>>
> >>> Fix this by making sure that block groups that have extents that are used
> >>> by active swap files can not be turned into RO mode, therefore making it
> >>> not possible for a scrub to turn them into RO mode.
> >>
> >>> When a scrub finds a
> >>> block group that can not be turned to RO due to the existence of extents
> >>> used by swap files, it proceeds to the next block group and logs a warning
> >>> message that mentions the block group was skipped due to active swap
> >>> files - this is the same approach we currently use for balance.
> >>
> >>    It is better if this info is documented in the scrub man-page. IMO.
> >>
> >>> This ends up removing the need to call btrfs_extent_readonly() from
> >>> can_nocow_extent(), as btrfs_swap_activate() now checks if a block group
> >>> is RO through the new function btrfs_inc_block_group_swap_extents().
> >>>
>
>
> >>> The only other caller of can_nocow_extent() is the direct IO write path,
>
> There is a third caller. check_can_nocow() also calls
> can_nocow_extent(), which I missed before. Any idea where does it get to
> know that extent is RO in the threads using check_can_nocow()? I have to
> back out the RB for this reason for now.

Yes, that one I missed. However it's arguable how useful it is, because starting
nocow writers and changing a block group from RW to RO is not atomic,
and therefore
sometimes it will have no effect, see below.

I'll leave that part out and deal with the direct IO write path
optimization later perhaps,
as things are a bit entangled and not simple to distinguish whether we
are in the
direct IO path or not at can_nocow_extent().

>
>
> >>> btrfs_get_blocks_direct_write(), but that already checks if a block group
> >>> is RO through the call to btrfs_inc_nocow_writers().
>
> >>> In fact, after this
> >>> change we end up optimizing the direct IO write path, since we no longer
> >>> iterate the block groups rbtree twice, once with btrfs_extent_readonly(),
> >>> through can_nocow_extent(), and once again with btrfs_inc_nocow_writers().
> >>> This can save time and reduce contention on the lock that protects the
> >>> rbtree (specially because it is a spinlock and not a read/write lock) on
> >>> very large filesystems, with several thousands of allocated block groups.
> >>>
> >>> Fixes: ed46ff3d42378 ("Btrfs: support swap files")
> >>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >>
> >>    I am not sure about the optimization of direct IO part, but as such
> >>    changes looks good.
>
> Clarifying about the optimization part (for both buffered and direct IO)
> - After patch 1, and patch 2, now we check on the RO extents after the
> functions btrfs_cross_ref_exist(), and csum_exist_in_range(), both of
> them have search_slot, whereas, before patch 1, and patch 2, we used to
> fail early (if the extent is RO) and avoided the search_slot, so I am
> not sure if there is optimization.

And?
Doing a single search is faster than doing 2 searches.
It does not matter to check if a block group is RO before or after
those checks, because:

1) Having a block group RO is a rather exceptional situation and, when
it happens (scrub and balance), it's
temporary. We optimize for common cases, we don't gain anything by
checking for it twice.
Your concern goes the other way around, you want to do the RO check
first to fallback more quickly into
cow mode - optimizing for the exceptional and uncommon case. I could
move up the call to
btrfs_inc_block_group_swap_extents(), to take the place of the call to
btrfs_inc_block_group_swap_extents(),
but really that is pointless since RO block groups are exceptional and
temporary, and would make the code
more complex than needed (having to track which gotos require
decrementing the nocow writers).

2) More importantly, and this is what really matters, have you thought
about what happens
if the block group is turned RO right after we checked that it was RW?
Either after calling
btrfs_extent_readonly() and before calling btrfs_inc_nocow_writers(),
or after calling both.
Should we have additional checks to see if the block group is now RO
after each one of those calls?

In case you didn't notice, starting a nocow write and setting a block
group RO is not atomic,
and that is fine (it's actually much simpler than making it atomic).
Because scrub and balance,
after turning a block group to RO mode, wait for any existing nocow
writes to complete before
they do anything with the block group's extents - new writes are
guaranteed to not allocate from
the block group or write to its existing extents because the block
group is RO now.

I hope this clarifies why having the RO block group check earlier or
later is irrelevant.

Thanks.

>
> Thanks, Anand

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 0/3] btrfs: fix a couple swapfile support bugs
  2021-02-03 11:17 [PATCH 0/4] btrfs: fix a couple swapfile support bugs fdmanana
                   ` (3 preceding siblings ...)
  2021-02-03 11:17 ` [PATCH 4/4] btrfs: fix race between swap file activation and snapshot creation fdmanana
@ 2021-02-05 12:55 ` fdmanana
  2021-02-05 12:55   ` [PATCH v2 1/3] btrfs: avoid checking for RO block group twice during nocow writeback fdmanana
                     ` (3 more replies)
  2021-02-10 15:30 ` [PATCH 0/4] " Josef Bacik
  5 siblings, 4 replies; 19+ messages in thread
From: fdmanana @ 2021-02-05 12:55 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The following patchset fixes 2 bugs with the swapfile support, where we can
end up falling back to COW when writing to an active swapfile. The first patch
is actually independent and just makes the nocow buffered IO path more efficient
by eliminating a repeated check for a read-only block group.

V2: Removed the part of optimizing the direct IO nocow path from patch 2,
    because removing the RO block group check from can_nocow_extent() would
    leave the buffered write path that checks if we can fallback to nocow at
    write time (and not writeback time), after hitting ENOSPC when attempting
    to reserve data space, from doing that check. The optimization can still
    be done, but that would require adding more context information to
    can_nocow_extent(), so it could know when it needs to check if the block
    group is RO or not - since things are a bit entangled around that function
    and its callers, I've left it out for now.

Filipe Manana (3):
  btrfs: avoid checking for RO block group twice during nocow writeback
  btrfs: fix race between writes to swap files and scrub
  btrfs: fix race between swap file activation and snapshot creation

 fs/btrfs/block-group.c | 33 ++++++++++++++++++++++++++++++-
 fs/btrfs/block-group.h |  9 +++++++++
 fs/btrfs/ctree.h       |  5 +++++
 fs/btrfs/inode.c       | 44 ++++++++++++++++++++++++++++++++++++------
 fs/btrfs/scrub.c       |  9 ++++++++-
 5 files changed, 92 insertions(+), 8 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/3] btrfs: avoid checking for RO block group twice during nocow writeback
  2021-02-05 12:55 ` [PATCH v2 0/3] btrfs: fix a couple swapfile support bugs fdmanana
@ 2021-02-05 12:55   ` fdmanana
  2021-02-10 12:28     ` Anand Jain
  2021-02-05 12:55   ` [PATCH v2 2/3] btrfs: fix race between writes to swap files and scrub fdmanana
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: fdmanana @ 2021-02-05 12:55 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

During the nocow writeback path, we currently iterate the rbtree of block
groups twice: once for checking if the target block group is RO with the
call to btrfs_extent_readonly()), and once again for getting a nocow
reference on the block group with a call to btrfs_inc_nocow_writers().

Since btrfs_inc_nocow_writers() already returns false when the target
block group is RO, remove the call to btrfs_extent_readonly(). Not only
we avoid searching the blocks group rbtree twice, it also helps reduce
contention on the lock that protects it (specially since it is a spin
lock and not a read-write lock). That may make a noticeable difference
on very large filesystems, with thousands of allocated block groups.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 04cd95899ac8..76a0151ef05a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1657,9 +1657,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 			 */
 			btrfs_release_path(path);
 
-			/* If extent is RO, we must COW it */
-			if (btrfs_extent_readonly(fs_info, disk_bytenr))
-				goto out_check;
 			ret = btrfs_cross_ref_exist(root, ino,
 						    found_key.offset -
 						    extent_offset, disk_bytenr, false);
@@ -1706,6 +1703,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 				WARN_ON_ONCE(freespace_inode);
 				goto out_check;
 			}
+			/* If the extent's block group is RO, we must COW. */
 			if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
 				goto out_check;
 			nocow = true;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/3] btrfs: fix race between writes to swap files and scrub
  2021-02-05 12:55 ` [PATCH v2 0/3] btrfs: fix a couple swapfile support bugs fdmanana
  2021-02-05 12:55   ` [PATCH v2 1/3] btrfs: avoid checking for RO block group twice during nocow writeback fdmanana
@ 2021-02-05 12:55   ` fdmanana
  2021-02-10 16:54     ` Anand Jain
  2021-02-05 12:55   ` [PATCH v2 3/3] btrfs: fix race between swap file activation and snapshot creation fdmanana
  2021-02-10 22:15   ` [PATCH v2 0/3] btrfs: fix a couple swapfile support bugs David Sterba
  3 siblings, 1 reply; 19+ messages in thread
From: fdmanana @ 2021-02-05 12:55 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.

However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.

Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.

Fixes: ed46ff3d42378 ("Btrfs: support swap files")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/block-group.c | 33 ++++++++++++++++++++++++++++++++-
 fs/btrfs/block-group.h |  9 +++++++++
 fs/btrfs/ctree.h       |  5 +++++
 fs/btrfs/inode.c       | 19 ++++++++++++++++++-
 fs/btrfs/scrub.c       |  9 ++++++++-
 5 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 5fa6b3d540f4..c0a8ddf92ef8 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1150,6 +1150,11 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
 	spin_lock(&sinfo->lock);
 	spin_lock(&cache->lock);
 
+	if (cache->swap_extents) {
+		ret = -ETXTBSY;
+		goto out;
+	}
+
 	if (cache->ro) {
 		cache->ro++;
 		ret = 0;
@@ -2260,7 +2265,7 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
 	}
 
 	ret = inc_block_group_ro(cache, 0);
-	if (!do_chunk_alloc)
+	if (!do_chunk_alloc || ret == -ETXTBSY)
 		goto unlock_out;
 	if (!ret)
 		goto out;
@@ -2269,6 +2274,8 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
 	if (ret < 0)
 		goto out;
 	ret = inc_block_group_ro(cache, 0);
+	if (ret == -ETXTBSY)
+		goto unlock_out;
 out:
 	if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
 		alloc_flags = btrfs_get_alloc_profile(fs_info, cache->flags);
@@ -3352,6 +3359,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 		ASSERT(list_empty(&block_group->io_list));
 		ASSERT(list_empty(&block_group->bg_list));
 		ASSERT(refcount_read(&block_group->refs) == 1);
+		ASSERT(block_group->swap_extents == 0);
 		btrfs_put_block_group(block_group);
 
 		spin_lock(&info->block_group_cache_lock);
@@ -3418,3 +3426,26 @@ void btrfs_unfreeze_block_group(struct btrfs_block_group *block_group)
 		__btrfs_remove_free_space_cache(block_group->free_space_ctl);
 	}
 }
+
+bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg)
+{
+	bool ret = true;
+
+	spin_lock(&bg->lock);
+	if (bg->ro)
+		ret = false;
+	else
+		bg->swap_extents++;
+	spin_unlock(&bg->lock);
+
+	return ret;
+}
+
+void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount)
+{
+	spin_lock(&bg->lock);
+	ASSERT(!bg->ro);
+	ASSERT(bg->swap_extents >= amount);
+	bg->swap_extents -= amount;
+	spin_unlock(&bg->lock);
+}
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 8f74a96074f7..105094bd1821 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -181,6 +181,12 @@ struct btrfs_block_group {
 	 */
 	int needs_free_space;
 
+	/*
+	 * Number of extents in this block group used for swap files.
+	 * All accesses protected by the spinlock 'lock'.
+	 */
+	int swap_extents;
+
 	/* Record locked full stripes for RAID5/6 block group */
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
 };
@@ -296,6 +302,9 @@ static inline int btrfs_block_group_done(struct btrfs_block_group *cache)
 void btrfs_freeze_block_group(struct btrfs_block_group *cache);
 void btrfs_unfreeze_block_group(struct btrfs_block_group *cache);
 
+bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg);
+void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount);
+
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 		     u64 physical, u64 **logical, int *naddrs, int *stripe_len);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a9b0521d9e89..0597e85ab38c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -523,6 +523,11 @@ struct btrfs_swapfile_pin {
 	 * points to a struct btrfs_device.
 	 */
 	bool is_block_group;
+	/*
+	 * Only used when 'is_block_group' is true and it is the number of
+	 * extents used by a swapfile for this block group ('ptr' field).
+	 */
+	int bg_extent_count;
 };
 
 bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 76a0151ef05a..715ae1320383 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10010,6 +10010,7 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
 	sp->ptr = ptr;
 	sp->inode = inode;
 	sp->is_block_group = is_block_group;
+	sp->bg_extent_count = 1;
 
 	spin_lock(&fs_info->swapfile_pins_lock);
 	p = &fs_info->swapfile_pins.rb_node;
@@ -10023,6 +10024,8 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
 			   (sp->ptr == entry->ptr && sp->inode > entry->inode)) {
 			p = &(*p)->rb_right;
 		} else {
+			if (is_block_group)
+				entry->bg_extent_count++;
 			spin_unlock(&fs_info->swapfile_pins_lock);
 			kfree(sp);
 			return 1;
@@ -10048,8 +10051,11 @@ static void btrfs_free_swapfile_pins(struct inode *inode)
 		sp = rb_entry(node, struct btrfs_swapfile_pin, node);
 		if (sp->inode == inode) {
 			rb_erase(&sp->node, &fs_info->swapfile_pins);
-			if (sp->is_block_group)
+			if (sp->is_block_group) {
+				btrfs_dec_block_group_swap_extents(sp->ptr,
+							   sp->bg_extent_count);
 				btrfs_put_block_group(sp->ptr);
+			}
 			kfree(sp);
 		}
 		node = next;
@@ -10264,6 +10270,17 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 			goto out;
 		}
 
+		if (!btrfs_inc_block_group_swap_extents(bg)) {
+			btrfs_warn(fs_info,
+			   "block group for swapfile at %llu is read-only%s",
+			   bg->start,
+			   atomic_read(&fs_info->scrubs_running) ?
+				   " (scrub running)" : "");
+			btrfs_put_block_group(bg);
+			ret = -EINVAL;
+			goto out;
+		}
+
 		ret = btrfs_add_swapfile_pin(inode, bg, true);
 		if (ret) {
 			btrfs_put_block_group(bg);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 5f4f88a4d2c8..c09a494be8c6 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3630,6 +3630,13 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			 * commit_transactions.
 			 */
 			ro_set = 0;
+		} else if (ret == -ETXTBSY) {
+			btrfs_warn(fs_info,
+		   "skipping scrub of block group %llu due to active swapfile",
+				   cache->start);
+			scrub_pause_off(fs_info);
+			ret = 0;
+			goto skip_unfreeze;
 		} else {
 			btrfs_warn(fs_info,
 				   "failed setting block group ro: %d", ret);
@@ -3719,7 +3726,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		} else {
 			spin_unlock(&cache->lock);
 		}
-
+skip_unfreeze:
 		btrfs_unfreeze_block_group(cache);
 		btrfs_put_block_group(cache);
 		if (ret)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 3/3] btrfs: fix race between swap file activation and snapshot creation
  2021-02-05 12:55 ` [PATCH v2 0/3] btrfs: fix a couple swapfile support bugs fdmanana
  2021-02-05 12:55   ` [PATCH v2 1/3] btrfs: avoid checking for RO block group twice during nocow writeback fdmanana
  2021-02-05 12:55   ` [PATCH v2 2/3] btrfs: fix race between writes to swap files and scrub fdmanana
@ 2021-02-05 12:55   ` fdmanana
  2021-02-10 17:19     ` Anand Jain
  2021-02-10 22:15   ` [PATCH v2 0/3] btrfs: fix a couple swapfile support bugs David Sterba
  3 siblings, 1 reply; 19+ messages in thread
From: fdmanana @ 2021-02-05 12:55 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When creating a snapshot we check if the current number of swap files, in
the root, is non-zero, and if it is, we error out and warn that we can not
create the snapshot because there are active swap files.

However this is racy because when a task started activation of a swap
file, another task might have started already snapshot creation and might
have seen the counter for the number of swap files as zero. This means
that after the swap file is activated we may end up with a snapshot of the
same root successfully created, and therefore when the first write to the
swap file happens it has to fall back into COW mode, which should never
happen for active swap files.

Basically what can happen is:

1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
   There it sees that root->nr_swapfiles has a value of 0 so it continues;

2) Task B enters btrfs_swap_activate(). It is not aware that another task
   started snapshot creation but it did not finish yet. It increments
   root->nr_swapfiles from 0 to 1;

3) Task B checks that the file meets all requirements to be an active
   swap file - it has NOCOW set, there are no snapshots for the inode's
   root at the moment, no file holes, no reflinked extents, etc;

4) Task B returns success and now the file is an active swap file;

5) Task A commits the transaction to create the snapshot and finishes.
   The swap file's extents are now shared between the original root and
   the snapshot;

6) A write into an extent of the swap file is attempted - there is a
   snapshot of the file's root, so we fall back to COW mode and therefore
   the physical location of the extent changes on disk.

So fix this by taking the snapshot lock during swap file activation before
locking the extent range, as that is the order in which we lock these
during buffered writes.

Fixes: ed46ff3d42378 ("Btrfs: support swap files")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 715ae1320383..a9fb6a4eb9dd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10116,7 +10116,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 			       sector_t *span)
 {
 	struct inode *inode = file_inode(file);
-	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct extent_state *cached_state = NULL;
 	struct extent_map *em = NULL;
@@ -10167,13 +10168,27 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	   "cannot activate swapfile while exclusive operation is running");
 		return -EBUSY;
 	}
+
+	/*
+	 * Prevent snapshot creation while we are activating the swap file.
+	 * We do not want to race with snapshot creation. If snapshot creation
+	 * already started before we bumped nr_swapfiles from 0 to 1 and
+	 * completes before the first write into the swap file after it is
+	 * activated, than that write would fallback to COW.
+	 */
+	if (!btrfs_drew_try_write_lock(&root->snapshot_lock)) {
+		btrfs_exclop_finish(fs_info);
+		btrfs_warn(fs_info,
+	   "cannot activate swapfile because snapshot creation is in progress");
+		return -EINVAL;
+	}
 	/*
 	 * Snapshots can create extents which require COW even if NODATACOW is
 	 * set. We use this counter to prevent snapshots. We must increment it
 	 * before walking the extents because we don't want a concurrent
 	 * snapshot to run after we've already checked the extents.
 	 */
-	atomic_inc(&BTRFS_I(inode)->root->nr_swapfiles);
+	atomic_inc(&root->nr_swapfiles);
 
 	isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize);
 
@@ -10319,6 +10334,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	if (ret)
 		btrfs_swap_deactivate(file);
 
+	btrfs_drew_write_unlock(&root->snapshot_lock);
+
 	btrfs_exclop_finish(fs_info);
 
 	if (ret)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/3] btrfs: avoid checking for RO block group twice during nocow writeback
  2021-02-05 12:55   ` [PATCH v2 1/3] btrfs: avoid checking for RO block group twice during nocow writeback fdmanana
@ 2021-02-10 12:28     ` Anand Jain
  0 siblings, 0 replies; 19+ messages in thread
From: Anand Jain @ 2021-02-10 12:28 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 05/02/2021 20:55, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> During the nocow writeback path, we currently iterate the rbtree of block
> groups twice: once for checking if the target block group is RO with the
> call to btrfs_extent_readonly()), and once again for getting a nocow
> reference on the block group with a call to btrfs_inc_nocow_writers().
> 
> Since btrfs_inc_nocow_writers() already returns false when the target
> block group is RO, remove the call to btrfs_extent_readonly(). Not only
> we avoid searching the blocks group rbtree twice, it also helps reduce
> contention on the lock that protects it (specially since it is a spin
> lock and not a read-write lock). That may make a noticeable difference
> on very large filesystems, with thousands of allocated block groups.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Look like btrfs_inc_nocow_writers() came in later and made the
RO check redundant.

Reviewed-by: Anand Jain <anand.jain@oracle.com>

Thanks, Anand

> ---
>   fs/btrfs/inode.c | 4 +---
>   1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 04cd95899ac8..76a0151ef05a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1657,9 +1657,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>   			 */
>   			btrfs_release_path(path);
>   
> -			/* If extent is RO, we must COW it */
> -			if (btrfs_extent_readonly(fs_info, disk_bytenr))
> -				goto out_check;
>   			ret = btrfs_cross_ref_exist(root, ino,
>   						    found_key.offset -
>   						    extent_offset, disk_bytenr, false);
> @@ -1706,6 +1703,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>   				WARN_ON_ONCE(freespace_inode);
>   				goto out_check;
>   			}
> +			/* If the extent's block group is RO, we must COW. */
>   			if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
>   				goto out_check;
>   			nocow = true;
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] btrfs: fix a couple swapfile support bugs
  2021-02-03 11:17 [PATCH 0/4] btrfs: fix a couple swapfile support bugs fdmanana
                   ` (4 preceding siblings ...)
  2021-02-05 12:55 ` [PATCH v2 0/3] btrfs: fix a couple swapfile support bugs fdmanana
@ 2021-02-10 15:30 ` Josef Bacik
  5 siblings, 0 replies; 19+ messages in thread
From: Josef Bacik @ 2021-02-10 15:30 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 2/3/21 6:17 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> The following patchset fixes 2 bugs with the swapfile support, where we can
> end up falling back to COW when writing to an active swapfile. As a bonus,
> it makes the NOCOW write patch, for both buffered and direct IO, more efficient
> by avoiding doing repeated worked when checking if the target block group is
> read-only.
> 

You can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

To the series, thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: fix race between writes to swap files and scrub
  2021-02-05 12:55   ` [PATCH v2 2/3] btrfs: fix race between writes to swap files and scrub fdmanana
@ 2021-02-10 16:54     ` Anand Jain
  0 siblings, 0 replies; 19+ messages in thread
From: Anand Jain @ 2021-02-10 16:54 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 05/02/2021 20:55, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When we active a swap file, at btrfs_swap_activate(), we acquire the
> exclusive operation lock to prevent the physical location of the swap
> file extents to be changed by operations such as balance and device
> replace/resize/remove. We also call there can_nocow_extent() which,
> among other things, checks if the block group of a swap file extent is
> currently RO, and if it is we can not use the extent, since a write
> into it would result in COWing the extent.
> 
> However we have no protection against a scrub operation running after we
> activate the swap file, which can result in the swap file extents to be
> COWed while the scrub is running and operating on the respective block
> group, because scrub turns a block group into RO before it processes it
> and then back again to RW mode after processing it. That means an attempt
> to write into a swap file extent while scrub is processing the respective
> block group, will result in COWing the extent, changing its physical
> location on disk.
> 
> Fix this by making sure that block groups that have extents that are used
> by active swap files can not be turned into RO mode, therefore making it
> not possible for a scrub to turn them into RO mode. When a scrub finds a
> block group that can not be turned to RO due to the existence of extents
> used by swap files, it proceeds to the next block group and logs a warning
> message that mentions the block group was skipped due to active swap
> files - this is the same approach we currently use for balance.
> 
> Fixes: ed46ff3d42378 ("Btrfs: support swap files")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Looks good.
Reviewed-by: Anand Jain <anand.jain@oracle.com>

Thanks, Anand

> ---
>   fs/btrfs/block-group.c | 33 ++++++++++++++++++++++++++++++++-
>   fs/btrfs/block-group.h |  9 +++++++++
>   fs/btrfs/ctree.h       |  5 +++++
>   fs/btrfs/inode.c       | 19 ++++++++++++++++++-
>   fs/btrfs/scrub.c       |  9 ++++++++-
>   5 files changed, 72 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 5fa6b3d540f4..c0a8ddf92ef8 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1150,6 +1150,11 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
>   	spin_lock(&sinfo->lock);
>   	spin_lock(&cache->lock);
>   
> +	if (cache->swap_extents) {
> +		ret = -ETXTBSY;
> +		goto out;
> +	}
> +
>   	if (cache->ro) {
>   		cache->ro++;
>   		ret = 0;
> @@ -2260,7 +2265,7 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
>   	}
>   
>   	ret = inc_block_group_ro(cache, 0);
> -	if (!do_chunk_alloc)
> +	if (!do_chunk_alloc || ret == -ETXTBSY)
>   		goto unlock_out;
>   	if (!ret)
>   		goto out;
> @@ -2269,6 +2274,8 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
>   	if (ret < 0)
>   		goto out;
>   	ret = inc_block_group_ro(cache, 0);
> +	if (ret == -ETXTBSY)
> +		goto unlock_out;
>   out:
>   	if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
>   		alloc_flags = btrfs_get_alloc_profile(fs_info, cache->flags);
> @@ -3352,6 +3359,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
>   		ASSERT(list_empty(&block_group->io_list));
>   		ASSERT(list_empty(&block_group->bg_list));
>   		ASSERT(refcount_read(&block_group->refs) == 1);
> +		ASSERT(block_group->swap_extents == 0);
>   		btrfs_put_block_group(block_group);
>   
>   		spin_lock(&info->block_group_cache_lock);
> @@ -3418,3 +3426,26 @@ void btrfs_unfreeze_block_group(struct btrfs_block_group *block_group)
>   		__btrfs_remove_free_space_cache(block_group->free_space_ctl);
>   	}
>   }
> +
> +bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg)
> +{
> +	bool ret = true;
> +
> +	spin_lock(&bg->lock);
> +	if (bg->ro)
> +		ret = false;
> +	else
> +		bg->swap_extents++;
> +	spin_unlock(&bg->lock);
> +
> +	return ret;
> +}
> +
> +void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount)
> +{
> +	spin_lock(&bg->lock);
> +	ASSERT(!bg->ro);
> +	ASSERT(bg->swap_extents >= amount);
> +	bg->swap_extents -= amount;
> +	spin_unlock(&bg->lock);
> +}
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index 8f74a96074f7..105094bd1821 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -181,6 +181,12 @@ struct btrfs_block_group {
>   	 */
>   	int needs_free_space;
>   
> +	/*
> +	 * Number of extents in this block group used for swap files.
> +	 * All accesses protected by the spinlock 'lock'.
> +	 */
> +	int swap_extents;
> +
>   	/* Record locked full stripes for RAID5/6 block group */
>   	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
>   };
> @@ -296,6 +302,9 @@ static inline int btrfs_block_group_done(struct btrfs_block_group *cache)
>   void btrfs_freeze_block_group(struct btrfs_block_group *cache);
>   void btrfs_unfreeze_block_group(struct btrfs_block_group *cache);
>   
> +bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg);
> +void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount);
> +
>   #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>   int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
>   		     u64 physical, u64 **logical, int *naddrs, int *stripe_len);
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index a9b0521d9e89..0597e85ab38c 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -523,6 +523,11 @@ struct btrfs_swapfile_pin {
>   	 * points to a struct btrfs_device.
>   	 */
>   	bool is_block_group;
> +	/*
> +	 * Only used when 'is_block_group' is true and it is the number of
> +	 * extents used by a swapfile for this block group ('ptr' field).
> +	 */
> +	int bg_extent_count;
>   };
>   
>   bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 76a0151ef05a..715ae1320383 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -10010,6 +10010,7 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
>   	sp->ptr = ptr;
>   	sp->inode = inode;
>   	sp->is_block_group = is_block_group;
> +	sp->bg_extent_count = 1;
>   
>   	spin_lock(&fs_info->swapfile_pins_lock);
>   	p = &fs_info->swapfile_pins.rb_node;
> @@ -10023,6 +10024,8 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
>   			   (sp->ptr == entry->ptr && sp->inode > entry->inode)) {
>   			p = &(*p)->rb_right;
>   		} else {
> +			if (is_block_group)
> +				entry->bg_extent_count++;
>   			spin_unlock(&fs_info->swapfile_pins_lock);
>   			kfree(sp);
>   			return 1;
> @@ -10048,8 +10051,11 @@ static void btrfs_free_swapfile_pins(struct inode *inode)
>   		sp = rb_entry(node, struct btrfs_swapfile_pin, node);
>   		if (sp->inode == inode) {
>   			rb_erase(&sp->node, &fs_info->swapfile_pins);
> -			if (sp->is_block_group)
> +			if (sp->is_block_group) {
> +				btrfs_dec_block_group_swap_extents(sp->ptr,
> +							   sp->bg_extent_count);
>   				btrfs_put_block_group(sp->ptr);
> +			}
>   			kfree(sp);
>   		}
>   		node = next;
> @@ -10264,6 +10270,17 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>   			goto out;
>   		}
>   
> +		if (!btrfs_inc_block_group_swap_extents(bg)) {
> +			btrfs_warn(fs_info,
> +			   "block group for swapfile at %llu is read-only%s",
> +			   bg->start,
> +			   atomic_read(&fs_info->scrubs_running) ?
> +				   " (scrub running)" : "");
> +			btrfs_put_block_group(bg);
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
>   		ret = btrfs_add_swapfile_pin(inode, bg, true);
>   		if (ret) {
>   			btrfs_put_block_group(bg);
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 5f4f88a4d2c8..c09a494be8c6 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3630,6 +3630,13 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>   			 * commit_transactions.
>   			 */
>   			ro_set = 0;
> +		} else if (ret == -ETXTBSY) {
> +			btrfs_warn(fs_info,
> +		   "skipping scrub of block group %llu due to active swapfile",
> +				   cache->start);
> +			scrub_pause_off(fs_info);
> +			ret = 0;
> +			goto skip_unfreeze;
>   		} else {
>   			btrfs_warn(fs_info,
>   				   "failed setting block group ro: %d", ret);
> @@ -3719,7 +3726,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>   		} else {
>   			spin_unlock(&cache->lock);
>   		}
> -
> +skip_unfreeze:
>   		btrfs_unfreeze_block_group(cache);
>   		btrfs_put_block_group(cache);
>   		if (ret)
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/3] btrfs: fix race between swap file activation and snapshot creation
  2021-02-05 12:55   ` [PATCH v2 3/3] btrfs: fix race between swap file activation and snapshot creation fdmanana
@ 2021-02-10 17:19     ` Anand Jain
  0 siblings, 0 replies; 19+ messages in thread
From: Anand Jain @ 2021-02-10 17:19 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 05/02/2021 20:55, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When creating a snapshot we check if the current number of swap files, in
> the root, is non-zero, and if it is, we error out and warn that we can not
> create the snapshot because there are active swap files.
> 
> However this is racy because when a task started activation of a swap
> file, another task might have started already snapshot creation and might
> have seen the counter for the number of swap files as zero. This means
> that after the swap file is activated we may end up with a snapshot of the
> same root successfully created, and therefore when the first write to the
> swap file happens it has to fall back into COW mode, which should never
> happen for active swap files.
> 
> Basically what can happen is:
> 
> 1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
>     There it sees that root->nr_swapfiles has a value of 0 so it continues;
> 
> 2) Task B enters btrfs_swap_activate(). It is not aware that another task
>     started snapshot creation but it did not finish yet. It increments
>     root->nr_swapfiles from 0 to 1;
> 
> 3) Task B checks that the file meets all requirements to be an active
>     swap file - it has NOCOW set, there are no snapshots for the inode's
>     root at the moment, no file holes, no reflinked extents, etc;
> 
> 4) Task B returns success and now the file is an active swap file;
> 
> 5) Task A commits the transaction to create the snapshot and finishes.
>     The swap file's extents are now shared between the original root and
>     the snapshot;
> 
> 6) A write into an extent of the swap file is attempted - there is a
>     snapshot of the file's root, so we fall back to COW mode and therefore
>     the physical location of the extent changes on disk.
> 
> So fix this by taking the snapshot lock during swap file activation before
> locking the extent range, as that is the order in which we lock these
> during buffered writes.
> 
> Fixes: ed46ff3d42378 ("Btrfs: support swap files")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Anand Jain <anand.jain@oracle.com>

Thanks, Anand

> ---
>   fs/btrfs/inode.c | 21 +++++++++++++++++++--
>   1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 715ae1320383..a9fb6a4eb9dd 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -10116,7 +10116,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>   			       sector_t *span)
>   {
>   	struct inode *inode = file_inode(file);
> -	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> +	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	struct btrfs_fs_info *fs_info = root->fs_info;
>   	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
>   	struct extent_state *cached_state = NULL;
>   	struct extent_map *em = NULL;
> @@ -10167,13 +10168,27 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>   	   "cannot activate swapfile while exclusive operation is running");
>   		return -EBUSY;
>   	}
> +
> +	/*
> +	 * Prevent snapshot creation while we are activating the swap file.
> +	 * We do not want to race with snapshot creation. If snapshot creation
> +	 * already started before we bumped nr_swapfiles from 0 to 1 and
> +	 * completes before the first write into the swap file after it is
> +	 * activated, than that write would fallback to COW.
> +	 */
> +	if (!btrfs_drew_try_write_lock(&root->snapshot_lock)) {
> +		btrfs_exclop_finish(fs_info);
> +		btrfs_warn(fs_info,
> +	   "cannot activate swapfile because snapshot creation is in progress");
> +		return -EINVAL;
> +	}
>   	/*
>   	 * Snapshots can create extents which require COW even if NODATACOW is
>   	 * set. We use this counter to prevent snapshots. We must increment it
>   	 * before walking the extents because we don't want a concurrent
>   	 * snapshot to run after we've already checked the extents.
>   	 */
> -	atomic_inc(&BTRFS_I(inode)->root->nr_swapfiles);
> +	atomic_inc(&root->nr_swapfiles);
>   
>   	isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize);
>   
> @@ -10319,6 +10334,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>   	if (ret)
>   		btrfs_swap_deactivate(file);
>   
> +	btrfs_drew_write_unlock(&root->snapshot_lock);
> +
>   	btrfs_exclop_finish(fs_info);
>   
>   	if (ret)
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/3] btrfs: fix a couple swapfile support bugs
  2021-02-05 12:55 ` [PATCH v2 0/3] btrfs: fix a couple swapfile support bugs fdmanana
                     ` (2 preceding siblings ...)
  2021-02-05 12:55   ` [PATCH v2 3/3] btrfs: fix race between swap file activation and snapshot creation fdmanana
@ 2021-02-10 22:15   ` David Sterba
  3 siblings, 0 replies; 19+ messages in thread
From: David Sterba @ 2021-02-10 22:15 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Fri, Feb 05, 2021 at 12:55:35PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> The following patchset fixes 2 bugs with the swapfile support, where we can
> end up falling back to COW when writing to an active swapfile. The first patch
> is actually independent and just makes the nocow buffered IO path more efficient
> by eliminating a repeated check for a read-only block group.
> 
> V2: Removed the part of optimizing the direct IO nocow path from patch 2,
>     because removing the RO block group check from can_nocow_extent() would
>     leave the buffered write path that checks if we can fallback to nocow at
>     write time (and not writeback time), after hitting ENOSPC when attempting
>     to reserve data space, from doing that check. The optimization can still
>     be done, but that would require adding more context information to
>     can_nocow_extent(), so it could know when it needs to check if the block
>     group is RO or not - since things are a bit entangled around that function
>     and its callers, I've left it out for now.
> 
> Filipe Manana (3):
>   btrfs: avoid checking for RO block group twice during nocow writeback
>   btrfs: fix race between writes to swap files and scrub
>   btrfs: fix race between swap file activation and snapshot creation

Added to for-next, thanks. Target merge is 5.12-rc.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-02-10 22:18 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-03 11:17 [PATCH 0/4] btrfs: fix a couple swapfile support bugs fdmanana
2021-02-03 11:17 ` [PATCH 1/4] btrfs: avoid checking for RO block group twice during nocow writeback fdmanana
2021-02-04  7:47   ` Anand Jain
2021-02-03 11:17 ` [PATCH 2/4] btrfs: fix race between writes to swap files and scrub fdmanana
2021-02-04  8:48   ` Anand Jain
2021-02-04 10:11     ` Filipe Manana
2021-02-05  7:44       ` Anand Jain
2021-02-05 12:54         ` Filipe Manana
2021-02-03 11:17 ` [PATCH 3/4] btrfs: remove no longer used function btrfs_extent_readonly() fdmanana
2021-02-03 11:17 ` [PATCH 4/4] btrfs: fix race between swap file activation and snapshot creation fdmanana
2021-02-05 12:55 ` [PATCH v2 0/3] btrfs: fix a couple swapfile support bugs fdmanana
2021-02-05 12:55   ` [PATCH v2 1/3] btrfs: avoid checking for RO block group twice during nocow writeback fdmanana
2021-02-10 12:28     ` Anand Jain
2021-02-05 12:55   ` [PATCH v2 2/3] btrfs: fix race between writes to swap files and scrub fdmanana
2021-02-10 16:54     ` Anand Jain
2021-02-05 12:55   ` [PATCH v2 3/3] btrfs: fix race between swap file activation and snapshot creation fdmanana
2021-02-10 17:19     ` Anand Jain
2021-02-10 22:15   ` [PATCH v2 0/3] btrfs: fix a couple swapfile support bugs David Sterba
2021-02-10 15:30 ` [PATCH 0/4] " Josef Bacik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.