linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Sterba <dsterba@suse.cz>
To: Nikolay Borisov <nborisov@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 7/8] btrfs: Ensure replaced device doesn't have pending chunk allocation
Date: Wed, 15 May 2019 18:52:09 +0200	[thread overview]
Message-ID: <20190515165207.GU3138@twin.jikos.cz> (raw)
In-Reply-To: <20190514105445.23051-8-nborisov@suse.com>

On Tue, May 14, 2019 at 01:54:44PM +0300, Nikolay Borisov wrote:
> Recent FITRIM work, namely bbbf7243d62d ("btrfs: combine device update
> operations during transaction commit") combined the way certain
> operations are recoded in a transaction. As a result an ASSERT was
> added in dev_replace_finish to ensure the new code works correctly.
> Unfortunately I got reports that it's possible to trigger the assert,
> meaning that during a device replace it's possible to have an unfinished
> chunk allocation on the source device.
> 
> This is supposed to be prevented by the fact that a transaction is
> committed before finishing the replace oepration and alter acquiring
> the chunk mutex. This is not sufficient since by the time the
> transaction is committed and the chunk mutex acquired it's possible to
> allocate a chunk depending on the workload being executed on the
> replaced device. This bug has been present ever since device replace was
> introduced but there was never code which checks for it.
> 
> The correct way to fix is to ensure that there is no pending device
> modification operation when the chunk mutex is acquire and if there is
> repeat transaction commit. Unfortunately it's not possible to just
> exclude the source device from btrfs_fs_devices::dev_alloc_list since
> this causes ENOSPC to be hit in transaction commit.
> 
> Fixes: 391cd9df81ac ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace")
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
>  fs/btrfs/dev-replace.c | 30 ++++++++++++++++++++----------
>  1 file changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index fb2bbc2a53a9..8ec9328609bd 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -599,17 +599,28 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>  	}
>  	btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
>  
> -	trans = btrfs_start_transaction(root, 0);
> -	if (IS_ERR(trans)) {
> -		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
> -		return PTR_ERR(trans);

Please add a comment that briefly explains why this is looping and not
the usual start/commit. Otherwise ok.

For the record, the other solution we discussed, removing the source
device from wrieable does not work due to enospc at the transaction
commit, and adding some extra conditions everywhere just to make sure
this special case is handled did not seem better than the looped commit.

The speciality is that the source device needs a point where no new
writes are accepted, but we still need to write the pending data plus
the final transaction commit. So the device is there but not really. The
commit could loop, but given how hard it was to reproduce that, it'll
almost never happen and overall runtime of dev-replace is high so this
won't cause noticeable delay.

> +	while (1) {
> +		trans = btrfs_start_transaction(root, 0);
> +		if (IS_ERR(trans)) {
> +			mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
> +			return PTR_ERR(trans);
> +		}
> +		ret = btrfs_commit_transaction(trans);
> +		WARN_ON(ret);
> +
> +		/* Prevent write_all_supers() during the finishing procedure */
> +		mutex_lock(&fs_info->fs_devices->device_list_mutex);
> +		/* Prevent new chunks being allocated on the source device */
> +		mutex_lock(&fs_info->chunk_mutex);
> +
> +		if (!list_empty(&src_device->post_commit_list)) {
> +			mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> +			mutex_unlock(&fs_info->chunk_mutex);
> +		} else {
> +			break;
> +		}
>  	}
> -	ret = btrfs_commit_transaction(trans);
> -	WARN_ON(ret);
>  
> -	/* keep away write_all_supers() during the finishing procedure */
> -	mutex_lock(&fs_info->fs_devices->device_list_mutex);
> -	mutex_lock(&fs_info->chunk_mutex);
>  	down_write(&dev_replace->rwsem);
>  	dev_replace->replace_state =
>  		scrub_ret ? BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED
> @@ -658,7 +669,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>  	btrfs_device_set_disk_total_bytes(tgt_device,
>  					  src_device->disk_total_bytes);
>  	btrfs_device_set_bytes_used(tgt_device, src_device->bytes_used);
> -	ASSERT(list_empty(&src_device->post_commit_list));
>  	tgt_device->commit_total_bytes = src_device->commit_total_bytes;
>  	tgt_device->commit_bytes_used = src_device->bytes_used;
>  
> -- 
> 2.17.1

  reply	other threads:[~2019-05-15 16:51 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-14 10:54 [PATCH 0/8] Misc improvements to dev-replace code Nikolay Borisov
2019-05-14 10:54 ` [PATCH 1/8] btrfs: Don't opencode sync_blockdev in btrfs_init_dev_replace_tgtdev Nikolay Borisov
2019-05-14 10:59   ` Johannes Thumshirn
2019-05-14 10:54 ` [PATCH 2/8] btrfs: Reduce critical section " Nikolay Borisov
2019-05-14 11:05   ` Johannes Thumshirn
2019-05-14 10:54 ` [PATCH 3/8] btrfs: Remove impossible WARN_ON Nikolay Borisov
2019-05-14 11:09   ` Johannes Thumshirn
2019-05-14 10:54 ` [PATCH 4/8] btrfs: Ensure btrfs_init_dev_replace_tgtdev sees up to date values Nikolay Borisov
2019-05-14 10:54 ` [PATCH 5/8] btrfs: Streamline replace sem unlock in btrfs_dev_replace_start Nikolay Borisov
2019-05-14 12:50   ` Johannes Thumshirn
2019-05-14 10:54 ` [PATCH 6/8] btrfs: Explicitly reserve space for devreplace item Nikolay Borisov
2019-05-14 12:56   ` Johannes Thumshirn
2019-05-14 10:54 ` [PATCH 7/8] btrfs: Ensure replaced device doesn't have pending chunk allocation Nikolay Borisov
2019-05-15 16:52   ` David Sterba [this message]
2019-05-17  7:44     ` [PATCH v2] " Nikolay Borisov
2019-05-17 14:28       ` David Sterba
2019-05-14 10:54 ` [PATCH 8/8] btrfs: Remove redundant assignment of tgt_device->commit_total_bytes Nikolay Borisov
2019-05-14 12:59   ` Johannes Thumshirn
2019-05-28 16:47 ` [PATCH 0/8] Misc improvements to dev-replace code David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190515165207.GU3138@twin.jikos.cz \
    --to=dsterba@suse.cz \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=nborisov@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).