Re: [PATCH v4 2/5] btrfs: scrub: Fix RAID56 recovery race condition

From: Liu Bo <bo.li.liu@oracle.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v4 2/5] btrfs: scrub: Fix RAID56 recovery race condition
Date: Thu, 30 Mar 2017 10:05:59 -0700	[thread overview]
Message-ID: <20170330170558.GB8963@lim.localdomain> (raw)
In-Reply-To: <20170330063251.16872-3-quwenruo@cn.fujitsu.com>

On Thu, Mar 30, 2017 at 02:32:48PM +0800, Qu Wenruo wrote:
> When scrubbing a RAID5 which has recoverable data corruption (only one
> data stripe is corrupted), sometimes scrub will report more csum errors
> than expected. Sometimes even unrecoverable error will be reported.
> 
> The problem can be easily reproduced by the following steps:
> 1) Create a btrfs with RAID5 data profile with 3 devs
> 2) Mount it with nospace_cache or space_cache=v2
>    To avoid extra data space usage.
> 3) Create a 128K file and sync the fs, unmount it
>    Now the 128K file lies at the beginning of the data chunk
> 4) Locate the physical bytenr of data chunk on dev3
>    Dev3 is the 1st data stripe.
> 5) Corrupt the first 64K of the data chunk stripe on dev3
> 6) Mount the fs and scrub it
> 
> The correct csum error number should be 16(assuming using x86_64).
> Larger csum error number can be reported in a 1/3 chance.
> And unrecoverable error can also be reported in a 1/10 chance.
> 
> The root cause of the problem is RAID5/6 recover code has race
> condition, due to the fact that full scrub is initiated per device.
> 
> While for other mirror based profiles, each mirror is independent with
> each other, so race won't cause any big problem.
> 
> For example:
>         Corrupted       |       Correct          |      Correct        |
> |   Scrub dev3 (D1)     |    Scrub dev2 (D2)     |    Scrub dev1(P)    |
> ------------------------------------------------------------------------
> Read out D1             |Read out D2             |Read full stripe     |
> Check csum              |Check csum              |Check parity         |
> Csum mismatch           |Csum match, continue    |Parity mismatch      |
> handle_errored_block    |                        |handle_errored_block |
>  Read out full stripe   |                        | Read out full stripe|
>  D1 csum error(err++)   |                        | D1 csum error(err++)|
>  Recover D1             |                        | Recover D1          |
> 
> So D1's csum error is accounted twice, just because
> handle_errored_block() doesn't have enough protect, and race can happen.
> 
> On even worse case, for example D1's recovery code is re-writing
> D1/D2/P, and P's recovery code is just reading out full stripe, then we
> can cause unrecoverable error.
> 
> This patch will use previously introduced lock_full_stripe() and
> unlock_full_stripe() to protect the whole scrub_handle_errored_block()
> function for RAID56 recovery.
> So no extra csum error nor unrecoverable error.
> 
> Reported-by: Goffredo Baroncelli <kreijack@libero.it>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/scrub.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 5fc99a92b4ff..4bbefc96485d 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1109,6 +1109,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>  	int mirror_index;
>  	int page_num;
>  	int success;
> +	bool full_stripe_locked;
>  	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
>  				      DEFAULT_RATELIMIT_BURST);
>  
> @@ -1134,6 +1135,25 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>  	have_csum = sblock_to_check->pagev[0]->have_csum;
>  	dev = sblock_to_check->pagev[0]->dev;
>  
> +	/*
> +	 * For RAID5/6 race can happen for different dev scrub thread.
> +	 * For data corruption, Parity and Data thread will both try
> +	 * to recovery the data.
> +	 * Race can lead to double added csum error, or even unrecoverable
> +	 * error.
> +	 */
> +	ret = lock_full_stripe(fs_info, logical, &full_stripe_locked);
> +	if (ret < 0) {
> +		spin_lock(&sctx->stat_lock);
> +		/* Either malloc failure or bg_cache not found */
> +		if (ret == -ENOMEM)
> +			sctx->stat.malloc_errors++;
> +		else
> +			sctx->stat.uncorrectable_errors++;

Other places in scrub_handle_errored_block() also set read_errors and
uncorrectable_errors, why the above is an exception?

I'm fine with putting a bool into lock_full_stripe(), but it's easier
to tell whether locking is successful by setting the flag after
locking.

Thanks,

-liubo

> +		spin_unlock(&sctx->stat_lock);
> +		return ret;
> +	}
> +
>  	if (sctx->is_dev_replace && !is_metadata && !have_csum) {
>  		sblocks_for_recheck = NULL;
>  		goto nodatasum_case;
> @@ -1468,6 +1488,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>  		kfree(sblocks_for_recheck);
>  	}
>  
> +	ret = unlock_full_stripe(fs_info, logical, full_stripe_locked);
> +	if (ret < 0)
> +		return ret;
>  	return 0;
>  }
>  
> -- 
> 2.12.1
> 
> 
>