All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] Misc improvements to dev-replace code
@ 2019-05-14 10:54 Nikolay Borisov
  2019-05-14 10:54 ` [PATCH 1/8] btrfs: Don't opencode sync_blockdev in btrfs_init_dev_replace_tgtdev Nikolay Borisov
                   ` (8 more replies)
  0 siblings, 9 replies; 19+ messages in thread
From: Nikolay Borisov @ 2019-05-14 10:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

While fixing the failing ASSERT in device replace finishing procedure I also 
found several oddities/bugs. Here is the resultant pile.

First 3 patches are a couple simple cleanups.

Patch 4 fixes a real bug since btrfs_init_dev_replace_tgtdev accesses values
which might be updated by transaction commit, so naturally it should be called
after the transaction is actually committed. I think this should go to stable. 

Patch 5 cleanups unlocking code in btrfs_dev_replace_start removing a goto label 
and a local variable. 

Patch 6 also fixes a bug, since persisting the dev-replace item relied on global 
reserve being able to satisfy the condition. While this is not wrong per-se I 
find it somewhat subtle, so just be explicit and start a transaction with 
reservation for at least 1 item.

Patch 7 fixes the race condition which caused the newly added ASSERT to trigger. 
I've added the Fixes: tag to point to the first commit which introduced taking 
chunk_mutex. This is also stable material. 

Patch 8 is also a minor cleanup, just removing what I believe to be a redundant 
assignment. 

This series went under multiple xfstest runs and no regressions were observed. 

Nikolay Borisov (8):
  btrfs: Don't opencode sync_blockdev in btrfs_init_dev_replace_tgtdev
  btrfs: Reduce critical section in btrfs_init_dev_replace_tgtdev
  btrfs: Remove impossible WARN_ON
  btrfs: Ensure btrfs_init_dev_replace_tgtdev sees up to date values
  btrfs: Streamline replace sem unlock in btrfs_dev_replace_start
  btrfs: Explicitly reserve space for devreplace item
  btrfs: Ensure replaced device doesn't have pending chunk allocation
  btrfs: Remove redundant assignment of tgt_device->commit_total_bytes

 fs/btrfs/dev-replace.c | 59 +++++++++++++++++++++++-------------------
 1 file changed, 32 insertions(+), 27 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/8] btrfs: Don't opencode sync_blockdev in btrfs_init_dev_replace_tgtdev
  2019-05-14 10:54 [PATCH 0/8] Misc improvements to dev-replace code Nikolay Borisov
@ 2019-05-14 10:54 ` Nikolay Borisov
  2019-05-14 10:59   ` Johannes Thumshirn
  2019-05-14 10:54 ` [PATCH 2/8] btrfs: Reduce critical section " Nikolay Borisov
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Nikolay Borisov @ 2019-05-14 10:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

Using sync_blockdev makes it plain obvious what's happening. No
functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/dev-replace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 55c15f31d00d..c738ba460682 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -201,7 +201,7 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 		return PTR_ERR(bdev);
 	}
 
-	filemap_write_and_wait(bdev->bd_inode->i_mapping);
+	sync_blockdev(bdev);
 
 	devices = &fs_info->fs_devices->devices;
 	list_for_each_entry(device, devices, dev_list) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/8] btrfs: Reduce critical section in btrfs_init_dev_replace_tgtdev
  2019-05-14 10:54 [PATCH 0/8] Misc improvements to dev-replace code Nikolay Borisov
  2019-05-14 10:54 ` [PATCH 1/8] btrfs: Don't opencode sync_blockdev in btrfs_init_dev_replace_tgtdev Nikolay Borisov
@ 2019-05-14 10:54 ` Nikolay Borisov
  2019-05-14 11:05   ` Johannes Thumshirn
  2019-05-14 10:54 ` [PATCH 3/8] btrfs: Remove impossible WARN_ON Nikolay Borisov
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Nikolay Borisov @ 2019-05-14 10:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

There is no point in holding btrfs_fs_devices::device_list_mutex
while initialising fields of the not-yet-published device. Instead,
hold the mutex only when the newly initialised device is being
published. I think holding device_list_mutex here is redundant
altogether, because at this point BTRFS_FS_EXCL_OP is set which
prevents device removal/addition/balance/resize to occur.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/dev-replace.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index c738ba460682..ac6600a00188 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -237,7 +237,6 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	}
 	rcu_assign_pointer(device->name, name);
 
-	mutex_lock(&fs_info->fs_devices->device_list_mutex);
 	set_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
 	device->generation = 0;
 	device->io_width = fs_info->sectorsize;
@@ -256,6 +255,8 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	device->dev_stats_valid = 1;
 	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
 	device->fs_devices = fs_info->fs_devices;
+
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
 	list_add(&device->dev_list, &fs_info->fs_devices->devices);
 	fs_info->fs_devices->num_devices++;
 	fs_info->fs_devices->open_devices++;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/8] btrfs: Remove impossible WARN_ON
  2019-05-14 10:54 [PATCH 0/8] Misc improvements to dev-replace code Nikolay Borisov
  2019-05-14 10:54 ` [PATCH 1/8] btrfs: Don't opencode sync_blockdev in btrfs_init_dev_replace_tgtdev Nikolay Borisov
  2019-05-14 10:54 ` [PATCH 2/8] btrfs: Reduce critical section " Nikolay Borisov
@ 2019-05-14 10:54 ` Nikolay Borisov
  2019-05-14 11:09   ` Johannes Thumshirn
  2019-05-14 10:54 ` [PATCH 4/8] btrfs: Ensure btrfs_init_dev_replace_tgtdev sees up to date values Nikolay Borisov
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Nikolay Borisov @ 2019-05-14 10:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

This WARN_ON can never trigger because src_device cannot be null.
btrfs_find_device_by_devspec always returns either an error or a valid
pointer to the device. Just remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/dev-replace.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index ac6600a00188..7db9057d3d3c 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -447,7 +447,6 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	}
 
 	dev_replace->cont_reading_from_srcdev_mode = read_src;
-	WARN_ON(!src_device);
 	dev_replace->srcdev = src_device;
 	dev_replace->tgtdev = tgt_device;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/8] btrfs: Ensure btrfs_init_dev_replace_tgtdev sees up to date values
  2019-05-14 10:54 [PATCH 0/8] Misc improvements to dev-replace code Nikolay Borisov
                   ` (2 preceding siblings ...)
  2019-05-14 10:54 ` [PATCH 3/8] btrfs: Remove impossible WARN_ON Nikolay Borisov
@ 2019-05-14 10:54 ` Nikolay Borisov
  2019-05-14 10:54 ` [PATCH 5/8] btrfs: Streamline replace sem unlock in btrfs_dev_replace_start Nikolay Borisov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Nikolay Borisov @ 2019-05-14 10:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

btrfs_init_dev_replace_tgtdev reads certain values from the source
device (such as commit_total_bytes) which are updated during transaction
commit. Currently this function is called before committing any pending
transaction, leading to possibly reading outdated values. Fix this
by moving the function below the transaction commit, at this point the
exclusive op bit it set hence once transaction is complete the total
size of the device cannot be changed (it's usually changed by
resize/remove ops which are blocked).

Fixes: 9e271ae27e44 ("Btrfs: kernel operation should come after user input has been verified")

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/dev-replace.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 7db9057d3d3c..fbf53e996668 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -414,11 +414,6 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 		return -ETXTBSY;
 	}
 
-	ret = btrfs_init_dev_replace_tgtdev(fs_info, tgtdev_name,
-					    src_device, &tgt_device);
-	if (ret)
-		return ret;
-
 	/*
 	 * Here we commit the transaction to make sure commit_total_bytes
 	 * of all the devices are updated.
@@ -432,6 +427,11 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 		return PTR_ERR(trans);
 	}
 
+	ret = btrfs_init_dev_replace_tgtdev(fs_info, tgtdev_name,
+					    src_device, &tgt_device);
+	if (ret)
+		return ret;
+
 	need_unlock = true;
 	down_write(&dev_replace->rwsem);
 	switch (dev_replace->replace_state) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/8] btrfs: Streamline replace sem unlock in btrfs_dev_replace_start
  2019-05-14 10:54 [PATCH 0/8] Misc improvements to dev-replace code Nikolay Borisov
                   ` (3 preceding siblings ...)
  2019-05-14 10:54 ` [PATCH 4/8] btrfs: Ensure btrfs_init_dev_replace_tgtdev sees up to date values Nikolay Borisov
@ 2019-05-14 10:54 ` Nikolay Borisov
  2019-05-14 12:50   ` Johannes Thumshirn
  2019-05-14 10:54 ` [PATCH 6/8] btrfs: Explicitly reserve space for devreplace item Nikolay Borisov
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Nikolay Borisov @ 2019-05-14 10:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

There are only 2 branches which goto leave label with need_unlock set
to true. Essentially need_unlock is used as a substitute for directly
calling up_write. Since the branches needing this are only 2 and their
context is not that big it's more clear to just call up_write where
required. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/dev-replace.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index fbf53e996668..61ae43308192 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -400,7 +400,6 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	int ret;
 	struct btrfs_device *tgt_device = NULL;
 	struct btrfs_device *src_device = NULL;
-	bool need_unlock;
 
 	src_device = btrfs_find_device_by_devspec(fs_info, srcdevid,
 						  srcdev_name);
@@ -432,7 +431,6 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	if (ret)
 		return ret;
 
-	need_unlock = true;
 	down_write(&dev_replace->rwsem);
 	switch (dev_replace->replace_state) {
 	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
@@ -443,6 +441,7 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
 		ASSERT(0);
 		ret = BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED;
+		up_write(&dev_replace->rwsem);
 		goto leave;
 	}
 
@@ -471,7 +470,6 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	atomic64_set(&dev_replace->num_write_errors, 0);
 	atomic64_set(&dev_replace->num_uncorrectable_read_errors, 0);
 	up_write(&dev_replace->rwsem);
-	need_unlock = false;
 
 	ret = btrfs_sysfs_add_device_link(tgt_device->fs_devices, tgt_device);
 	if (ret)
@@ -483,12 +481,12 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	trans = btrfs_start_transaction(root, 0);
 	if (IS_ERR(trans)) {
 		ret = PTR_ERR(trans);
-		need_unlock = true;
 		down_write(&dev_replace->rwsem);
 		dev_replace->replace_state =
 			BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED;
 		dev_replace->srcdev = NULL;
 		dev_replace->tgtdev = NULL;
+		up_write(&dev_replace->rwsem);
 		goto leave;
 	}
 
@@ -510,8 +508,6 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	return ret;
 
 leave:
-	if (need_unlock)
-		up_write(&dev_replace->rwsem);
 	btrfs_destroy_dev_replace_tgtdev(tgt_device);
 	return ret;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/8] btrfs: Explicitly reserve space for devreplace item
  2019-05-14 10:54 [PATCH 0/8] Misc improvements to dev-replace code Nikolay Borisov
                   ` (4 preceding siblings ...)
  2019-05-14 10:54 ` [PATCH 5/8] btrfs: Streamline replace sem unlock in btrfs_dev_replace_start Nikolay Borisov
@ 2019-05-14 10:54 ` Nikolay Borisov
  2019-05-14 12:56   ` Johannes Thumshirn
  2019-05-14 10:54 ` [PATCH 7/8] btrfs: Ensure replaced device doesn't have pending chunk allocation Nikolay Borisov
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Nikolay Borisov @ 2019-05-14 10:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

Part of device replace involves writing an item to the device root
containing information about pending replace operations. Currently space
for this item is not being explicitly reserved so this works thanks to
presence of global reserve. While not fatal it's not good practice.
Let's be explicit about space requirement of device replace and reserve
space when starting the transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/dev-replace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 61ae43308192..fb2bbc2a53a9 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -477,8 +477,8 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 
 	btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
 
-	/* force writing the updated state information to disk */
-	trans = btrfs_start_transaction(root, 0);
+	/* Commit dev_replace state and reserve 1 item for it. */
+	trans = btrfs_start_transaction(root, 1);
 	if (IS_ERR(trans)) {
 		ret = PTR_ERR(trans);
 		down_write(&dev_replace->rwsem);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 7/8] btrfs: Ensure replaced device doesn't have pending chunk allocation
  2019-05-14 10:54 [PATCH 0/8] Misc improvements to dev-replace code Nikolay Borisov
                   ` (5 preceding siblings ...)
  2019-05-14 10:54 ` [PATCH 6/8] btrfs: Explicitly reserve space for devreplace item Nikolay Borisov
@ 2019-05-14 10:54 ` Nikolay Borisov
  2019-05-15 16:52   ` David Sterba
  2019-05-14 10:54 ` [PATCH 8/8] btrfs: Remove redundant assignment of tgt_device->commit_total_bytes Nikolay Borisov
  2019-05-28 16:47 ` [PATCH 0/8] Misc improvements to dev-replace code David Sterba
  8 siblings, 1 reply; 19+ messages in thread
From: Nikolay Borisov @ 2019-05-14 10:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

Recent FITRIM work, namely bbbf7243d62d ("btrfs: combine device update
operations during transaction commit") combined the way certain
operations are recoded in a transaction. As a result an ASSERT was
added in dev_replace_finish to ensure the new code works correctly.
Unfortunately I got reports that it's possible to trigger the assert,
meaning that during a device replace it's possible to have an unfinished
chunk allocation on the source device.

This is supposed to be prevented by the fact that a transaction is
committed before finishing the replace oepration and alter acquiring
the chunk mutex. This is not sufficient since by the time the
transaction is committed and the chunk mutex acquired it's possible to
allocate a chunk depending on the workload being executed on the
replaced device. This bug has been present ever since device replace was
introduced but there was never code which checks for it.

The correct way to fix is to ensure that there is no pending device
modification operation when the chunk mutex is acquire and if there is
repeat transaction commit. Unfortunately it's not possible to just
exclude the source device from btrfs_fs_devices::dev_alloc_list since
this causes ENOSPC to be hit in transaction commit.

Fixes: 391cd9df81ac ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/dev-replace.c | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index fb2bbc2a53a9..8ec9328609bd 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -599,17 +599,28 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 	}
 	btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
 
-	trans = btrfs_start_transaction(root, 0);
-	if (IS_ERR(trans)) {
-		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
-		return PTR_ERR(trans);
+	while (1) {
+		trans = btrfs_start_transaction(root, 0);
+		if (IS_ERR(trans)) {
+			mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
+			return PTR_ERR(trans);
+		}
+		ret = btrfs_commit_transaction(trans);
+		WARN_ON(ret);
+
+		/* Prevent write_all_supers() during the finishing procedure */
+		mutex_lock(&fs_info->fs_devices->device_list_mutex);
+		/* Prevent new chunks being allocated on the source device */
+		mutex_lock(&fs_info->chunk_mutex);
+
+		if (!list_empty(&src_device->post_commit_list)) {
+			mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+			mutex_unlock(&fs_info->chunk_mutex);
+		} else {
+			break;
+		}
 	}
-	ret = btrfs_commit_transaction(trans);
-	WARN_ON(ret);
 
-	/* keep away write_all_supers() during the finishing procedure */
-	mutex_lock(&fs_info->fs_devices->device_list_mutex);
-	mutex_lock(&fs_info->chunk_mutex);
 	down_write(&dev_replace->rwsem);
 	dev_replace->replace_state =
 		scrub_ret ? BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED
@@ -658,7 +669,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 	btrfs_device_set_disk_total_bytes(tgt_device,
 					  src_device->disk_total_bytes);
 	btrfs_device_set_bytes_used(tgt_device, src_device->bytes_used);
-	ASSERT(list_empty(&src_device->post_commit_list));
 	tgt_device->commit_total_bytes = src_device->commit_total_bytes;
 	tgt_device->commit_bytes_used = src_device->bytes_used;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 8/8] btrfs: Remove redundant assignment of tgt_device->commit_total_bytes
  2019-05-14 10:54 [PATCH 0/8] Misc improvements to dev-replace code Nikolay Borisov
                   ` (6 preceding siblings ...)
  2019-05-14 10:54 ` [PATCH 7/8] btrfs: Ensure replaced device doesn't have pending chunk allocation Nikolay Borisov
@ 2019-05-14 10:54 ` Nikolay Borisov
  2019-05-14 12:59   ` Johannes Thumshirn
  2019-05-28 16:47 ` [PATCH 0/8] Misc improvements to dev-replace code David Sterba
  8 siblings, 1 reply; 19+ messages in thread
From: Nikolay Borisov @ 2019-05-14 10:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Nikolay Borisov

This is already done in btrfs_init_dev_replace_tgtdev which is the first
phase of device replace, called before doing scrub. During that time
exclusive lock is held. Additionally btrfs_fs_device::commit_total_bytes
is always set based on the size of the underlying block device which
shouldn't change once set. This makes the 2nd assignment of the variable
in the finishing phase redundant.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/dev-replace.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 8ec9328609bd..5005a03ba847 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -669,7 +669,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 	btrfs_device_set_disk_total_bytes(tgt_device,
 					  src_device->disk_total_bytes);
 	btrfs_device_set_bytes_used(tgt_device, src_device->bytes_used);
-	tgt_device->commit_total_bytes = src_device->commit_total_bytes;
 	tgt_device->commit_bytes_used = src_device->bytes_used;
 
 	btrfs_assign_next_active_device(src_device, tgt_device);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/8] btrfs: Don't opencode sync_blockdev in btrfs_init_dev_replace_tgtdev
  2019-05-14 10:54 ` [PATCH 1/8] btrfs: Don't opencode sync_blockdev in btrfs_init_dev_replace_tgtdev Nikolay Borisov
@ 2019-05-14 10:59   ` Johannes Thumshirn
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2019-05-14 10:59 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/8] btrfs: Reduce critical section in btrfs_init_dev_replace_tgtdev
  2019-05-14 10:54 ` [PATCH 2/8] btrfs: Reduce critical section " Nikolay Borisov
@ 2019-05-14 11:05   ` Johannes Thumshirn
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2019-05-14 11:05 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/8] btrfs: Remove impossible WARN_ON
  2019-05-14 10:54 ` [PATCH 3/8] btrfs: Remove impossible WARN_ON Nikolay Borisov
@ 2019-05-14 11:09   ` Johannes Thumshirn
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2019-05-14 11:09 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/8] btrfs: Streamline replace sem unlock in btrfs_dev_replace_start
  2019-05-14 10:54 ` [PATCH 5/8] btrfs: Streamline replace sem unlock in btrfs_dev_replace_start Nikolay Borisov
@ 2019-05-14 12:50   ` Johannes Thumshirn
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2019-05-14 12:50 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/8] btrfs: Explicitly reserve space for devreplace item
  2019-05-14 10:54 ` [PATCH 6/8] btrfs: Explicitly reserve space for devreplace item Nikolay Borisov
@ 2019-05-14 12:56   ` Johannes Thumshirn
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2019-05-14 12:56 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 8/8] btrfs: Remove redundant assignment of tgt_device->commit_total_bytes
  2019-05-14 10:54 ` [PATCH 8/8] btrfs: Remove redundant assignment of tgt_device->commit_total_bytes Nikolay Borisov
@ 2019-05-14 12:59   ` Johannes Thumshirn
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2019-05-14 12:59 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 7/8] btrfs: Ensure replaced device doesn't have pending chunk allocation
  2019-05-14 10:54 ` [PATCH 7/8] btrfs: Ensure replaced device doesn't have pending chunk allocation Nikolay Borisov
@ 2019-05-15 16:52   ` David Sterba
  2019-05-17  7:44     ` [PATCH v2] " Nikolay Borisov
  0 siblings, 1 reply; 19+ messages in thread
From: David Sterba @ 2019-05-15 16:52 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Tue, May 14, 2019 at 01:54:44PM +0300, Nikolay Borisov wrote:
> Recent FITRIM work, namely bbbf7243d62d ("btrfs: combine device update
> operations during transaction commit") combined the way certain
> operations are recoded in a transaction. As a result an ASSERT was
> added in dev_replace_finish to ensure the new code works correctly.
> Unfortunately I got reports that it's possible to trigger the assert,
> meaning that during a device replace it's possible to have an unfinished
> chunk allocation on the source device.
> 
> This is supposed to be prevented by the fact that a transaction is
> committed before finishing the replace oepration and alter acquiring
> the chunk mutex. This is not sufficient since by the time the
> transaction is committed and the chunk mutex acquired it's possible to
> allocate a chunk depending on the workload being executed on the
> replaced device. This bug has been present ever since device replace was
> introduced but there was never code which checks for it.
> 
> The correct way to fix is to ensure that there is no pending device
> modification operation when the chunk mutex is acquire and if there is
> repeat transaction commit. Unfortunately it's not possible to just
> exclude the source device from btrfs_fs_devices::dev_alloc_list since
> this causes ENOSPC to be hit in transaction commit.
> 
> Fixes: 391cd9df81ac ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace")
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/dev-replace.c | 30 ++++++++++++++++++++----------
>  1 file changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index fb2bbc2a53a9..8ec9328609bd 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -599,17 +599,28 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>  	}
>  	btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
>  
> -	trans = btrfs_start_transaction(root, 0);
> -	if (IS_ERR(trans)) {
> -		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
> -		return PTR_ERR(trans);

Please add a comment that briefly explains why this is looping and not
the usual start/commit. Otherwise ok.

For the record, the other solution we discussed, removing the source
device from wrieable does not work due to enospc at the transaction
commit, and adding some extra conditions everywhere just to make sure
this special case is handled did not seem better than the looped commit.

The speciality is that the source device needs a point where no new
writes are accepted, but we still need to write the pending data plus
the final transaction commit. So the device is there but not really. The
commit could loop, but given how hard it was to reproduce that, it'll
almost never happen and overall runtime of dev-replace is high so this
won't cause noticeable delay.

> +	while (1) {
> +		trans = btrfs_start_transaction(root, 0);
> +		if (IS_ERR(trans)) {
> +			mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
> +			return PTR_ERR(trans);
> +		}
> +		ret = btrfs_commit_transaction(trans);
> +		WARN_ON(ret);
> +
> +		/* Prevent write_all_supers() during the finishing procedure */
> +		mutex_lock(&fs_info->fs_devices->device_list_mutex);
> +		/* Prevent new chunks being allocated on the source device */
> +		mutex_lock(&fs_info->chunk_mutex);
> +
> +		if (!list_empty(&src_device->post_commit_list)) {
> +			mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> +			mutex_unlock(&fs_info->chunk_mutex);
> +		} else {
> +			break;
> +		}
>  	}
> -	ret = btrfs_commit_transaction(trans);
> -	WARN_ON(ret);
>  
> -	/* keep away write_all_supers() during the finishing procedure */
> -	mutex_lock(&fs_info->fs_devices->device_list_mutex);
> -	mutex_lock(&fs_info->chunk_mutex);
>  	down_write(&dev_replace->rwsem);
>  	dev_replace->replace_state =
>  		scrub_ret ? BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED
> @@ -658,7 +669,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>  	btrfs_device_set_disk_total_bytes(tgt_device,
>  					  src_device->disk_total_bytes);
>  	btrfs_device_set_bytes_used(tgt_device, src_device->bytes_used);
> -	ASSERT(list_empty(&src_device->post_commit_list));
>  	tgt_device->commit_total_bytes = src_device->commit_total_bytes;
>  	tgt_device->commit_bytes_used = src_device->bytes_used;
>  
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2] btrfs: Ensure replaced device doesn't have pending chunk allocation
  2019-05-15 16:52   ` David Sterba
@ 2019-05-17  7:44     ` Nikolay Borisov
  2019-05-17 14:28       ` David Sterba
  0 siblings, 1 reply; 19+ messages in thread
From: Nikolay Borisov @ 2019-05-17  7:44 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs, Nikolay Borisov

Recent FITRIM work, namely bbbf7243d62d ("btrfs: combine device update
operations during transaction commit") combined the way certain
operations are recoded in a transaction. As a result an ASSERT was
added in dev_replace_finish to ensure the new code works correctly.
Unfortunately I got reports that it's possible to trigger the assert,
meaning that during a device replace it's possible to have an unfinished
chunk allocation on the source device.

This is supposed to be prevented by the fact that a transaction is
committed before finishing the replace oepration and alter acquiring
the chunk mutex. This is not sufficient since by the time the
transaction is committed and the chunk mutex acquired it's possible to
allocate a chunk depending on the workload being executed on the
replaced device. This bug has been present ever since device replace was
introduced but there was never code which checks for it.

The correct way to fix is to ensure that there is no pending device
modification operation when the chunk mutex is acquire and if there is
repeat transaction commit. Unfortunately it's not possible to just
exclude the source device from btrfs_fs_devices::dev_alloc_list since
this causes ENOSPC to be hit in transaction commit.

Fixes: 391cd9df81ac ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---

Changes since v1: 
 * Add an explicit comment why loop construct must be used.

 fs/btrfs/dev-replace.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index fb2bbc2a53a9..45c5ba7d80ed 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -599,17 +599,33 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 	}
 	btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
 
-	trans = btrfs_start_transaction(root, 0);
-	if (IS_ERR(trans)) {
-		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
-		return PTR_ERR(trans);
+	/*
+	 * We have to use this loop approach because at this point src_device
+	 * has to be available for transaction commit to complete, yet new
+	 * chunks shouldn't be allocated on the device.
+	 */
+	while (1) {
+		trans = btrfs_start_transaction(root, 0);
+		if (IS_ERR(trans)) {
+			mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
+			return PTR_ERR(trans);
+		}
+		ret = btrfs_commit_transaction(trans);
+		WARN_ON(ret);
+
+		/* Prevent write_all_supers() during the finishing procedure */
+		mutex_lock(&fs_info->fs_devices->device_list_mutex);
+		/* Prevent new chunks being allocated on the source device */
+		mutex_lock(&fs_info->chunk_mutex);
+
+		if (!list_empty(&src_device->post_commit_list)) {
+			mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+			mutex_unlock(&fs_info->chunk_mutex);
+		} else {
+			break;
+		}
 	}
-	ret = btrfs_commit_transaction(trans);
-	WARN_ON(ret);
 
-	/* keep away write_all_supers() during the finishing procedure */
-	mutex_lock(&fs_info->fs_devices->device_list_mutex);
-	mutex_lock(&fs_info->chunk_mutex);
 	down_write(&dev_replace->rwsem);
 	dev_replace->replace_state =
 		scrub_ret ? BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED
@@ -658,7 +674,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 	btrfs_device_set_disk_total_bytes(tgt_device,
 					  src_device->disk_total_bytes);
 	btrfs_device_set_bytes_used(tgt_device, src_device->bytes_used);
-	ASSERT(list_empty(&src_device->post_commit_list));
 	tgt_device->commit_total_bytes = src_device->commit_total_bytes;
 	tgt_device->commit_bytes_used = src_device->bytes_used;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] btrfs: Ensure replaced device doesn't have pending chunk allocation
  2019-05-17  7:44     ` [PATCH v2] " Nikolay Borisov
@ 2019-05-17 14:28       ` David Sterba
  0 siblings, 0 replies; 19+ messages in thread
From: David Sterba @ 2019-05-17 14:28 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: dsterba, linux-btrfs

On Fri, May 17, 2019 at 10:44:25AM +0300, Nikolay Borisov wrote:
> Recent FITRIM work, namely bbbf7243d62d ("btrfs: combine device update
> operations during transaction commit") combined the way certain
> operations are recoded in a transaction. As a result an ASSERT was
> added in dev_replace_finish to ensure the new code works correctly.
> Unfortunately I got reports that it's possible to trigger the assert,
> meaning that during a device replace it's possible to have an unfinished
> chunk allocation on the source device.
> 
> This is supposed to be prevented by the fact that a transaction is
> committed before finishing the replace oepration and alter acquiring
> the chunk mutex. This is not sufficient since by the time the
> transaction is committed and the chunk mutex acquired it's possible to
> allocate a chunk depending on the workload being executed on the
> replaced device. This bug has been present ever since device replace was
> introduced but there was never code which checks for it.
> 
> The correct way to fix is to ensure that there is no pending device
> modification operation when the chunk mutex is acquire and if there is
> repeat transaction commit. Unfortunately it's not possible to just
> exclude the source device from btrfs_fs_devices::dev_alloc_list since
> this causes ENOSPC to be hit in transaction commit.
> 
> Fixes: 391cd9df81ac ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace")
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/8] Misc improvements to dev-replace code
  2019-05-14 10:54 [PATCH 0/8] Misc improvements to dev-replace code Nikolay Borisov
                   ` (7 preceding siblings ...)
  2019-05-14 10:54 ` [PATCH 8/8] btrfs: Remove redundant assignment of tgt_device->commit_total_bytes Nikolay Borisov
@ 2019-05-28 16:47 ` David Sterba
  8 siblings, 0 replies; 19+ messages in thread
From: David Sterba @ 2019-05-28 16:47 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Tue, May 14, 2019 at 01:54:37PM +0300, Nikolay Borisov wrote:
> While fixing the failing ASSERT in device replace finishing procedure I also 
> found several oddities/bugs. Here is the resultant pile.
> 
> First 3 patches are a couple simple cleanups.
> 
> Patch 4 fixes a real bug since btrfs_init_dev_replace_tgtdev accesses values
> which might be updated by transaction commit, so naturally it should be called
> after the transaction is actually committed. I think this should go to stable. 
> 
> Patch 5 cleanups unlocking code in btrfs_dev_replace_start removing a goto label 
> and a local variable. 
> 
> Patch 6 also fixes a bug, since persisting the dev-replace item relied on global 
> reserve being able to satisfy the condition. While this is not wrong per-se I 
> find it somewhat subtle, so just be explicit and start a transaction with 
> reservation for at least 1 item.
> 
> Patch 7 fixes the race condition which caused the newly added ASSERT to trigger. 
> I've added the Fixes: tag to point to the first commit which introduced taking 
> chunk_mutex. This is also stable material. 
> 
> Patch 8 is also a minor cleanup, just removing what I believe to be a redundant 
> assignment. 
> 
> This series went under multiple xfstest runs and no regressions were observed. 
> 
> Nikolay Borisov (8):
>   btrfs: Don't opencode sync_blockdev in btrfs_init_dev_replace_tgtdev
>   btrfs: Reduce critical section in btrfs_init_dev_replace_tgtdev
>   btrfs: Remove impossible WARN_ON
>   btrfs: Ensure btrfs_init_dev_replace_tgtdev sees up to date values
>   btrfs: Streamline replace sem unlock in btrfs_dev_replace_start
>   btrfs: Explicitly reserve space for devreplace item
>   btrfs: Ensure replaced device doesn't have pending chunk allocation
>   btrfs: Remove redundant assignment of tgt_device->commit_total_bytes

Added as topic branch and to for-next will be moved to misc-next soon.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-05-28 16:46 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-14 10:54 [PATCH 0/8] Misc improvements to dev-replace code Nikolay Borisov
2019-05-14 10:54 ` [PATCH 1/8] btrfs: Don't opencode sync_blockdev in btrfs_init_dev_replace_tgtdev Nikolay Borisov
2019-05-14 10:59   ` Johannes Thumshirn
2019-05-14 10:54 ` [PATCH 2/8] btrfs: Reduce critical section " Nikolay Borisov
2019-05-14 11:05   ` Johannes Thumshirn
2019-05-14 10:54 ` [PATCH 3/8] btrfs: Remove impossible WARN_ON Nikolay Borisov
2019-05-14 11:09   ` Johannes Thumshirn
2019-05-14 10:54 ` [PATCH 4/8] btrfs: Ensure btrfs_init_dev_replace_tgtdev sees up to date values Nikolay Borisov
2019-05-14 10:54 ` [PATCH 5/8] btrfs: Streamline replace sem unlock in btrfs_dev_replace_start Nikolay Borisov
2019-05-14 12:50   ` Johannes Thumshirn
2019-05-14 10:54 ` [PATCH 6/8] btrfs: Explicitly reserve space for devreplace item Nikolay Borisov
2019-05-14 12:56   ` Johannes Thumshirn
2019-05-14 10:54 ` [PATCH 7/8] btrfs: Ensure replaced device doesn't have pending chunk allocation Nikolay Borisov
2019-05-15 16:52   ` David Sterba
2019-05-17  7:44     ` [PATCH v2] " Nikolay Borisov
2019-05-17 14:28       ` David Sterba
2019-05-14 10:54 ` [PATCH 8/8] btrfs: Remove redundant assignment of tgt_device->commit_total_bytes Nikolay Borisov
2019-05-14 12:59   ` Johannes Thumshirn
2019-05-28 16:47 ` [PATCH 0/8] Misc improvements to dev-replace code David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.