linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] btrfs: workaround the over-confident over-commit available space calculation
@ 2020-09-30 12:01 Qu Wenruo
  2020-10-05  9:22 ` Qu Wenruo
  2020-10-05 13:05 ` Josef Bacik
  0 siblings, 2 replies; 4+ messages in thread
From: Qu Wenruo @ 2020-09-30 12:01 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

[BUG]
There are quite some bug reports of btrfs falling into a ENOSPC trap,
where btrfs can't even start a transaction to add new devices.

[CAUSE]
Most of the reports are utilize multi-device profiles, like
RAID1/RAID10/RAID5/RAID6, and the involved disks have very unbalanced
sizes.

It turns out that, the overcommit calculation in btrfs_can_overcommit()
is just a factor based calculation, which can't check if devices can
really fulfill the requirement for the desired profile.

This makes btrfs_can_overcommit() to be always over-confident about
usable space, and when we can't allocate any new metadata chunk but
still allow new metadata operations, we fall into the ENOSPC trap and
have no way to exit it.

[WORKAROUND]
The root fix needs a device layout aware, chunk allocator like available
space calculation.

There used to be such patchset submitted to the mail list, but the extra
failure mode is tricky to handle for chunk allocation, thus that
patchset needs more time to mature.

Meanwhile to prevent such problems reaching more users, workaround the
problem by:
- Half the over-commit available space reported
  So that we won't always be that over-confident.
  But this won't really help if we have extremely unbalanced disk size.

- Don't over-commit if the space info is already full
  This may already be too late, but still better than doing nothing and
  believe the over-commit values.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/space-info.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 475968ccbd1d..e8133ec7e34a 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -339,6 +339,18 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
 		avail >>= 3;
 	else
 		avail >>= 1;
+	/*
+	 * Since current over-commit calculation is doomed already for
+	 * RAID0/RADI1/RAID10/RAID5/6, we half the availabe space to reduce
+	 * over-commit amount.
+	 *
+	 * This is just a workaround before the device layout aware
+	 * available space calculation arrives.
+	 */
+	if ((BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1_MASK |
+	     BTRFS_BLOCK_GROUP_RAID10 | BTRFS_BLOCK_GROUP_RAID56_MASK) &
+	     profile)
+		avail >>= 1;
 	return avail;
 }
 
@@ -353,6 +365,14 @@ int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
 	if (space_info->flags & BTRFS_BLOCK_GROUP_DATA)
 		return 0;
 
+	/*
+	 * If we can't allocate new space already, no overcommit is allowed.
+	 *
+	 * This check may be already late, but still better than nothing.
+	 */
+	if (space_info->full)
+		return 0;
+
 	used = btrfs_space_info_used(space_info, true);
 	avail = calc_available_free_space(fs_info, space_info, flush);
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: workaround the over-confident over-commit available space calculation
  2020-09-30 12:01 [PATCH] btrfs: workaround the over-confident over-commit available space calculation Qu Wenruo
@ 2020-10-05  9:22 ` Qu Wenruo
  2020-10-05 13:05 ` Josef Bacik
  1 sibling, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2020-10-05  9:22 UTC (permalink / raw)
  To: linux-btrfs, David Sterba; +Cc: stable


[-- Attachment #1.1: Type: text/plain, Size: 3612 bytes --]

Hi David,

Would you please consider merge this patch as a hotfix?

We have more and more reports about deadly ENOSPC trap for multi-device
setup.

Considering the worst consequence, user can't even delete anything due
to exhausted metadata, I really hope we can at least workaround it.

The side effect of the patch is, smaller metadata over-commit, which may
decrease the performance, but I see it worthy to avoid the worst case
scenario.

And buy enough time for us to review the per-profile available space patch.

Thanks,
Qu

On 2020/9/30 下午8:01, Qu Wenruo wrote:
> [BUG]
> There are quite some bug reports of btrfs falling into a ENOSPC trap,
> where btrfs can't even start a transaction to add new devices.
> 
> [CAUSE]
> Most of the reports are utilize multi-device profiles, like
> RAID1/RAID10/RAID5/RAID6, and the involved disks have very unbalanced
> sizes.
> 
> It turns out that, the overcommit calculation in btrfs_can_overcommit()
> is just a factor based calculation, which can't check if devices can
> really fulfill the requirement for the desired profile.
> 
> This makes btrfs_can_overcommit() to be always over-confident about
> usable space, and when we can't allocate any new metadata chunk but
> still allow new metadata operations, we fall into the ENOSPC trap and
> have no way to exit it.
> 
> [WORKAROUND]
> The root fix needs a device layout aware, chunk allocator like available
> space calculation.
> 
> There used to be such patchset submitted to the mail list, but the extra
> failure mode is tricky to handle for chunk allocation, thus that
> patchset needs more time to mature.
> 
> Meanwhile to prevent such problems reaching more users, workaround the
> problem by:
> - Half the over-commit available space reported
>   So that we won't always be that over-confident.
>   But this won't really help if we have extremely unbalanced disk size.
> 
> - Don't over-commit if the space info is already full
>   This may already be too late, but still better than doing nothing and
>   believe the over-commit values.
> 
> CC: stable@vger.kernel.org # 4.4+
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/space-info.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 475968ccbd1d..e8133ec7e34a 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -339,6 +339,18 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
>  		avail >>= 3;
>  	else
>  		avail >>= 1;
> +	/*
> +	 * Since current over-commit calculation is doomed already for
> +	 * RAID0/RADI1/RAID10/RAID5/6, we half the availabe space to reduce
> +	 * over-commit amount.
> +	 *
> +	 * This is just a workaround before the device layout aware
> +	 * available space calculation arrives.
> +	 */
> +	if ((BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1_MASK |
> +	     BTRFS_BLOCK_GROUP_RAID10 | BTRFS_BLOCK_GROUP_RAID56_MASK) &
> +	     profile)
> +		avail >>= 1;
>  	return avail;
>  }
>  
> @@ -353,6 +365,14 @@ int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
>  	if (space_info->flags & BTRFS_BLOCK_GROUP_DATA)
>  		return 0;
>  
> +	/*
> +	 * If we can't allocate new space already, no overcommit is allowed.
> +	 *
> +	 * This check may be already late, but still better than nothing.
> +	 */
> +	if (space_info->full)
> +		return 0;
> +
>  	used = btrfs_space_info_used(space_info, true);
>  	avail = calc_available_free_space(fs_info, space_info, flush);
>  
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: workaround the over-confident over-commit available space calculation
  2020-09-30 12:01 [PATCH] btrfs: workaround the over-confident over-commit available space calculation Qu Wenruo
  2020-10-05  9:22 ` Qu Wenruo
@ 2020-10-05 13:05 ` Josef Bacik
  2020-10-05 13:12   ` Qu Wenruo
  1 sibling, 1 reply; 4+ messages in thread
From: Josef Bacik @ 2020-10-05 13:05 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: stable

On 9/30/20 8:01 AM, Qu Wenruo wrote:
> [BUG]
> There are quite some bug reports of btrfs falling into a ENOSPC trap,
> where btrfs can't even start a transaction to add new devices.
> 
> [CAUSE]
> Most of the reports are utilize multi-device profiles, like
> RAID1/RAID10/RAID5/RAID6, and the involved disks have very unbalanced
> sizes.
> 
> It turns out that, the overcommit calculation in btrfs_can_overcommit()
> is just a factor based calculation, which can't check if devices can
> really fulfill the requirement for the desired profile.
> 
> This makes btrfs_can_overcommit() to be always over-confident about
> usable space, and when we can't allocate any new metadata chunk but
> still allow new metadata operations, we fall into the ENOSPC trap and
> have no way to exit it.
> 
> [WORKAROUND]
> The root fix needs a device layout aware, chunk allocator like available
> space calculation.
> 
> There used to be such patchset submitted to the mail list, but the extra
> failure mode is tricky to handle for chunk allocation, thus that
> patchset needs more time to mature.
> 
> Meanwhile to prevent such problems reaching more users, workaround the
> problem by:
> - Half the over-commit available space reported
>    So that we won't always be that over-confident.
>    But this won't really help if we have extremely unbalanced disk size.
> 
> - Don't over-commit if the space info is already full
>    This may already be too late, but still better than doing nothing and
>    believe the over-commit values.
> 

I just had a thought, what if we simply cap the free_chunk_space to the min of 
the free space of all the devices.  Simply walk through all the devices on 
mount, and we do the initial set of whatever the smallest one is.  The rest of 
the math would work out fine, and the rest of the modifications would work fine. 
  The only "tricky" part would be when we do a shrink or grow, we'd have to 
re-calculate the sizes for everybody, but that's not a big deal.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: workaround the over-confident over-commit available space calculation
  2020-10-05 13:05 ` Josef Bacik
@ 2020-10-05 13:12   ` Qu Wenruo
  0 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2020-10-05 13:12 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs; +Cc: stable


[-- Attachment #1.1: Type: text/plain, Size: 2781 bytes --]



On 2020/10/5 下午9:05, Josef Bacik wrote:
> On 9/30/20 8:01 AM, Qu Wenruo wrote:
>> [BUG]
>> There are quite some bug reports of btrfs falling into a ENOSPC trap,
>> where btrfs can't even start a transaction to add new devices.
>>
>> [CAUSE]
>> Most of the reports are utilize multi-device profiles, like
>> RAID1/RAID10/RAID5/RAID6, and the involved disks have very unbalanced
>> sizes.
>>
>> It turns out that, the overcommit calculation in btrfs_can_overcommit()
>> is just a factor based calculation, which can't check if devices can
>> really fulfill the requirement for the desired profile.
>>
>> This makes btrfs_can_overcommit() to be always over-confident about
>> usable space, and when we can't allocate any new metadata chunk but
>> still allow new metadata operations, we fall into the ENOSPC trap and
>> have no way to exit it.
>>
>> [WORKAROUND]
>> The root fix needs a device layout aware, chunk allocator like available
>> space calculation.
>>
>> There used to be such patchset submitted to the mail list, but the extra
>> failure mode is tricky to handle for chunk allocation, thus that
>> patchset needs more time to mature.
>>
>> Meanwhile to prevent such problems reaching more users, workaround the
>> problem by:
>> - Half the over-commit available space reported
>>    So that we won't always be that over-confident.
>>    But this won't really help if we have extremely unbalanced disk size.
>>
>> - Don't over-commit if the space info is already full
>>    This may already be too late, but still better than doing nothing and
>>    believe the over-commit values.
>>
> 
> I just had a thought, what if we simply cap the free_chunk_space to the
> min of the free space of all the devices.

Sure, reducing the number will never be a problem.

> Simply walk through all the
> devices on mount, and we do the initial set of whatever the smallest one
> is.  The rest of the math would work out fine, and the rest of the
> modifications would work fine.

But I still prefer to do the minimal device size update at the timing of
my per-profile available space, so we don't have any chance to
over-estimate.

>  The only "tricky" part would be when we
> do a shrink or grow, we'd have to re-calculate the sizes for everybody,
> but that's not a big deal.  Thanks,

As long as we don't over-estimate, everything will be fine, just how
many extra metadata flushing is needed (thus extra overhead).

The rest is just a spectrum between "I don't really like over-commit at
all and let's make it really hard to do any overcommit" and "I'm a super
smart guy and here is the best algorithm to estimate how many space we
really have for over-commit".

Thanks,
Qu

> 
> Josef
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-10-05 13:13 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-30 12:01 [PATCH] btrfs: workaround the over-confident over-commit available space calculation Qu Wenruo
2020-10-05  9:22 ` Qu Wenruo
2020-10-05 13:05 ` Josef Bacik
2020-10-05 13:12   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).