All of lore.kernel.org
 help / color / mirror / Atom feed
* Global reserve ran out of space at 512MB, fails to rebalance
@ 2020-12-10  1:52 Eric Wheeler
  2020-12-10  2:38 ` Qu Wenruo
  2020-12-10  3:12 ` Zygo Blaxell
  0 siblings, 2 replies; 8+ messages in thread
From: Eric Wheeler @ 2020-12-10  1:52 UTC (permalink / raw)
  To: linux-btrfs

Hello all,

We have a 30TB volume with lots of snapshots that is low on space and we 
are trying to rebalance.  Even if we don't rebalance, the space cleaner 
still fills up the Global reserve:

    Device size:                  30.00TiB
    Device allocated:             30.00TiB
    Device unallocated:            1.00GiB
    Device missing:                  0.00B
    Used:                         29.27TiB
    Free (estimated):            705.21GiB	(min: 704.71GiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
>>> Global reserve:              512.00MiB	(used: 512.00MiB) <<<<<<<

This was on a Linux 5.6 kernel.  I'm trying a Linux 5.9.13 kernel with a 
hacked in SZ_4G in place of the SZ_512MB and will report back when I learn 
more.

In the meantime, do you have any suggestions to work through the issue?

Thank you for your help!


--
Eric Wheeler

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Global reserve ran out of space at 512MB, fails to rebalance
  2020-12-10  1:52 Global reserve ran out of space at 512MB, fails to rebalance Eric Wheeler
@ 2020-12-10  2:38 ` Qu Wenruo
  2020-12-10  3:12 ` Zygo Blaxell
  1 sibling, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2020-12-10  2:38 UTC (permalink / raw)
  To: Eric Wheeler, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1565 bytes --]



On 2020/12/10 上午9:52, Eric Wheeler wrote:
> Hello all,
> 
> We have a 30TB volume with lots of snapshots that is low on space and we 
> are trying to rebalance.  Even if we don't rebalance, the space cleaner 
> still fills up the Global reserve:
> 
>     Device size:                  30.00TiB
>     Device allocated:             30.00TiB
>     Device unallocated:            1.00GiB

So it still has one GiB unallocated for rebalance.

But, don't expect much luck, as balance is not the solution to your problem.
What you really need is to free up some metadata space, as you're really
exhausting the metadata space (only 1GiB available for new metadata)

>     Device missing:                  0.00B
>     Used:                         29.27TiB
>     Free (estimated):            705.21GiB	(min: 704.71GiB)
>     Data ratio:                       1.00
>     Metadata ratio:                   2.00
>>>> Global reserve:              512.00MiB	(used: 512.00MiB) <<<<<<<
> 
> This was on a Linux 5.6 kernel.  I'm trying a Linux 5.9.13 kernel with a 
> hacked in SZ_4G in place of the SZ_512MB and will report back when I learn 
> more.

It won't change. In fact, if you increase the global rsv to 4G, it would
cause more problem since btrfs will try to reserve more metadata space,
which you definitely don't have.
> 
> In the meantime, do you have any suggestions to work through the issue?

If you can still remove files/snapshots, do it.

Thanks,
Qu
> 
> Thank you for your help!
> 
> 
> --
> Eric Wheeler
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Global reserve ran out of space at 512MB, fails to rebalance
  2020-12-10  1:52 Global reserve ran out of space at 512MB, fails to rebalance Eric Wheeler
  2020-12-10  2:38 ` Qu Wenruo
@ 2020-12-10  3:12 ` Zygo Blaxell
  2020-12-10 19:02   ` Eric Wheeler
  2020-12-10 19:50   ` Eric Wheeler
  1 sibling, 2 replies; 8+ messages in thread
From: Zygo Blaxell @ 2020-12-10  3:12 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: linux-btrfs

On Thu, Dec 10, 2020 at 01:52:19AM +0000, Eric Wheeler wrote:
> Hello all,
> 
> We have a 30TB volume with lots of snapshots that is low on space and we 
> are trying to rebalance.  Even if we don't rebalance, the space cleaner 
> still fills up the Global reserve:
> 
>     Device size:                  30.00TiB
>     Device allocated:             30.00TiB
>     Device unallocated:            1.00GiB
>     Device missing:                  0.00B
>     Used:                         29.27TiB
>     Free (estimated):            705.21GiB	(min: 704.71GiB)
>     Data ratio:                       1.00
>     Metadata ratio:                   2.00
> >>> Global reserve:              512.00MiB	(used: 512.00MiB) <<<<<<<

It would be nice to have the rest of the btrfs fi usage output.  We are
having to guess how your drives are populated with data and metadata
and what profiles are in use.

You probably need to be running some data balances (btrfs balance start
-dlimit=9 about once a day) to ensure there is always at least 1GB of
unallocated space on every drive.

Never balance metadata, especially not from a scheduled job.  Metadata
balances lead directly to this situation.

> This was on a Linux 5.6 kernel.  I'm trying a Linux 5.9.13 kernel with a 
> hacked in SZ_4G in place of the SZ_512MB and will report back when I learn 
> more.
> 
> In the meantime, do you have any suggestions to work through the issue?

I've had similar problems with snapshot deletes hitting ENOSPC with
small amounts of free metadata space.  In this case, the upgrade from
5.6 to 5.9 will include a fix for that (it's in 5.8, also 5.4 and earlier
LTS kernels).

Increasing the global reserve may seem to help, but so will just rebooting
over and over, so a positive result from an experimental kernel does not
necessarily mean anything.  Pending snapshot deletes will be making small
amounts of progress just before hitting ENOSPC, so it will eventually
succeed if you repeat the mount enough times even with an old stock
kernel.

> Thank you for your help!
> 
> 
> --
> Eric Wheeler

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Global reserve ran out of space at 512MB, fails to rebalance
  2020-12-10  3:12 ` Zygo Blaxell
@ 2020-12-10 19:02   ` Eric Wheeler
  2020-12-10 19:50   ` Eric Wheeler
  1 sibling, 0 replies; 8+ messages in thread
From: Eric Wheeler @ 2020-12-10 19:02 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs, Qu Wenruo

On Wed, 9 Dec 2020, Zygo Blaxell wrote:

> On Thu, Dec 10, 2020 at 01:52:19AM +0000, Eric Wheeler wrote:
> > Hello all,
> > 
> > We have a 30TB volume with lots of snapshots that is low on space and we 
> > are trying to rebalance.  Even if we don't rebalance, the space cleaner 
> > still fills up the Global reserve:
> > 
> >     Device size:                  30.00TiB
> >     Device allocated:             30.00TiB
> >     Device unallocated:            1.00GiB
> >     Device missing:                  0.00B
> >     Used:                         29.27TiB
> >     Free (estimated):            705.21GiB	(min: 704.71GiB)
> >     Data ratio:                       1.00
> >     Metadata ratio:                   2.00
> > >>> Global reserve:              512.00MiB	(used: 512.00MiB) <<<<<<<
> 
> It would be nice to have the rest of the btrfs fi usage output.  We are
> having to guess how your drives are populated with data and metadata
> and what profiles are in use.
> 
> You probably need to be running some data balances (btrfs balance start
> -dlimit=9 about once a day) to ensure there is always at least 1GB of
> unallocated space on every drive.
> 
> Never balance metadata, especially not from a scheduled job.  Metadata
> balances lead directly to this situation.
> 
> > This was on a Linux 5.6 kernel.  I'm trying a Linux 5.9.13 kernel with a 
> > hacked in SZ_4G in place of the SZ_512MB and will report back when I learn 
> > more.
> > 
> > In the meantime, do you have any suggestions to work through the issue?
> 
> I've had similar problems with snapshot deletes hitting ENOSPC with
> small amounts of free metadata space.  In this case, the upgrade from
> 5.6 to 5.9 will include a fix for that (it's in 5.8, also 5.4 and earlier
> LTS kernels).

Good to know, glad there's a patch for that!

Zygo and Qu, thank you both for your feedback!

-Eric

> 
> Increasing the global reserve may seem to help, but so will just rebooting
> over and over, so a positive result from an experimental kernel does not
> necessarily mean anything.  Pending snapshot deletes will be making small
> amounts of progress just before hitting ENOSPC, so it will eventually
> succeed if you repeat the mount enough times even with an old stock
> kernel.
> 
> > Thank you for your help!
> > 
> > 
> > --
> > Eric Wheeler
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Global reserve ran out of space at 512MB, fails to rebalance
  2020-12-10  3:12 ` Zygo Blaxell
  2020-12-10 19:02   ` Eric Wheeler
@ 2020-12-10 19:50   ` Eric Wheeler
  2020-12-11  3:49     ` Zygo Blaxell
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Wheeler @ 2020-12-10 19:50 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs, Qu Wenruo

On Wed, 9 Dec 2020, Zygo Blaxell wrote:
> On Thu, Dec 10, 2020 at 01:52:19AM +0000, Eric Wheeler wrote:
> > Hello all,
> > 
> > We have a 30TB volume with lots of snapshots that is low on space and we 
> > are trying to rebalance.  Even if we don't rebalance, the space cleaner 
> > still fills up the Global reserve:
> > 
> > >>> Global reserve:              512.00MiB	(used: 512.00MiB) <<<<<<<
> 
> It would be nice to have the rest of the btrfs fi usage output.  We are
> having to guess how your drives are populated with data and metadata
> and what profiles are in use.

Here is the whole output:

]# df -h /mnt/btrbackup ; echo; btrfs fi df /mnt/btrbackup|column -t; echo; btrfs fi usage /mnt/btrbackup

Filesystem                  Size  Used Avail Use% Mounted on
/dev/mapper/btrbackup-luks   30T   30T  541G  99% /mnt/btrbackup

Data,           single:  total=29.80TiB,  used=29.28TiB
System,         DUP:     total=8.00MiB,   used=3.42MiB
Metadata,	DUP:     total=99.00GiB,  used=87.03GiB
GlobalReserve,  single:  total=4.00GiB,   used=1.73MiB

Overall:
    Device size:                  30.00TiB
    Device allocated:             30.00TiB
    Device unallocated:            1.00GiB
    Device missing:                  0.00B
    Used:                         29.45TiB
    Free (estimated):            540.74GiB	(min: 540.24GiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:                4.00GiB	(used: 1.73MiB) <<<< with 4GB hack

Data,single: Size:29.80TiB, Used:29.28TiB (98.23%)
   /dev/mapper/btrbackup-luks     29.80TiB

Metadata,DUP: Size:99.00GiB, Used:87.03GiB (87.91%)
   /dev/mapper/btrbackup-luks    198.00GiB

System,DUP: Size:8.00MiB, Used:3.42MiB (42.77%)
   /dev/mapper/btrbackup-luks     16.00MiB

Unallocated:
   /dev/mapper/btrbackup-luks	   1.00GiB
 
> You probably need to be running some data balances (btrfs balance start
> -dlimit=9 about once a day) to ensure there is always at least 1GB of
> unallocated space on every drive.

Thanks for the daily rebalance tip.  Is there a reason you chose 
-dlimit=9?  I know it means 9 chunks, but why 9?  Also, how big is a 
"chunk" ? 

In this case we have 1GB unallocated.

> Never balance metadata, especially not from a scheduled job.  Metadata
> balances lead directly to this situation.

So when /would/ you balance metadata?

> > This was on a Linux 5.6 kernel.  I'm trying a Linux 5.9.13 kernel with a 
> > hacked in SZ_4G in place of the SZ_512MB and will report back when I learn 
> > more.
>
> I've had similar problems with snapshot deletes hitting ENOSPC with
> small amounts of free metadata space.  In this case, the upgrade from
> 5.6 to 5.9 will include a fix for that (it's in 5.8, also 5.4 and earlier
> LTS kernels).

Ok, now on 5.9.13
 
> Increasing the global reserve may seem to help, but so will just rebooting
> over and over, so a positive result from an experimental kernel does not
> necessarily mean anything.

At least this reduces the number of times I need to reboot ;)

Question: 

What do people think of making this a module option or ioctl for those who 
need to hack it into place to minimize reboots?

-Eric

--
Eric Wheeler

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Global reserve ran out of space at 512MB, fails to rebalance
  2020-12-10 19:50   ` Eric Wheeler
@ 2020-12-11  3:49     ` Zygo Blaxell
  2020-12-11 19:08       ` Eric Wheeler
  0 siblings, 1 reply; 8+ messages in thread
From: Zygo Blaxell @ 2020-12-11  3:49 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: linux-btrfs, Qu Wenruo

On Thu, Dec 10, 2020 at 07:50:06PM +0000, Eric Wheeler wrote:
> On Wed, 9 Dec 2020, Zygo Blaxell wrote:
> > On Thu, Dec 10, 2020 at 01:52:19AM +0000, Eric Wheeler wrote:
> > > Hello all,
> > > 
> > > We have a 30TB volume with lots of snapshots that is low on space and we 
> > > are trying to rebalance.  Even if we don't rebalance, the space cleaner 
> > > still fills up the Global reserve:
> > > 
> > > >>> Global reserve:              512.00MiB	(used: 512.00MiB) <<<<<<<
> > 
> > It would be nice to have the rest of the btrfs fi usage output.  We are
> > having to guess how your drives are populated with data and metadata
> > and what profiles are in use.
> 
> Here is the whole output:
> 
> ]# df -h /mnt/btrbackup ; echo; btrfs fi df /mnt/btrbackup|column -t; echo; btrfs fi usage /mnt/btrbackup
> 
> Filesystem                  Size  Used Avail Use% Mounted on
> /dev/mapper/btrbackup-luks   30T   30T  541G  99% /mnt/btrbackup
> 
> Data,           single:  total=29.80TiB,  used=29.28TiB
> System,         DUP:     total=8.00MiB,   used=3.42MiB
> Metadata,	DUP:     total=99.00GiB,  used=87.03GiB
> GlobalReserve,  single:  total=4.00GiB,   used=1.73MiB
> 
> Overall:
>     Device size:                  30.00TiB
>     Device allocated:             30.00TiB
>     Device unallocated:            1.00GiB
>     Device missing:                  0.00B
>     Used:                         29.45TiB
>     Free (estimated):            540.74GiB	(min: 540.24GiB)
>     Data ratio:                       1.00
>     Metadata ratio:                   2.00
>     Global reserve:                4.00GiB	(used: 1.73MiB) <<<< with 4GB hack
> 
> Data,single: Size:29.80TiB, Used:29.28TiB (98.23%)
>    /dev/mapper/btrbackup-luks     29.80TiB
> 
> Metadata,DUP: Size:99.00GiB, Used:87.03GiB (87.91%)
>    /dev/mapper/btrbackup-luks    198.00GiB
> 
> System,DUP: Size:8.00MiB, Used:3.42MiB (42.77%)
>    /dev/mapper/btrbackup-luks     16.00MiB
> 
> Unallocated:
>    /dev/mapper/btrbackup-luks	   1.00GiB
>  
> > You probably need to be running some data balances (btrfs balance start
> > -dlimit=9 about once a day) to ensure there is always at least 1GB of
> > unallocated space on every drive.
> 
> Thanks for the daily rebalance tip.  Is there a reason you chose 
> -dlimit=9?  I know it means 9 chunks, but why 9?  Also, how big is a 
> "chunk" ? 

"9" is fewer digits than "10"... ;)

It's a good starting point for an average filesystem.  You may want
to tune it for your workload.

Most chunks are multiples of 1GB except on very small filesystems;
however, sometimes the last data chunk is an odd size (i.e. not 1GB).
If the filesystem has been resized in the past, there could be multiple
odd-sized chunks.  A balance of one or two tiny chunks might not release a
useful amount of space.  Balancing 9 chunks at a time reduces the chance
of getting stuck with such a small chunk, and also gives you some extra
unallocated space in case some of it gets allocated during the day.

On a mostly idle system, you may not need to run the balance every day.
On a system with hundreds of GB of unallocated space on every drive,
you don't need to run balance at all.  Monitor your system's unallocated
space over time and adjust the limit and scheduling as required so you
don't run out of unallocated space.

Ideally we'd have a packaged tool ready to install that makes those
decisions automatically, but we're still trying to figure out what the
threshold values should be.  Clearly, "2TB of unallocated space on every
drive" needs no intervention, while "0 unallocated space on any drive"
is too late for intervention to be useful any more.  In between these
extremes, it is harder to evaluate from code.  When there's 20GB
unallocated on every drive, and 10GB are consumed per day, does that
require a balance today, or can it wait until tomorrow?  How would we get
the trend data?  Will the user do something unpredictable that requires
balance earlier than expected?  Will we set the thresholds too low,
and waste a lot of iops on balancing that isn't useful?  Will we set
the thresholds too high, and fail to intervene on a system that is
headed for metadata ENOSPC?

> In this case we have 1GB unallocated.

And also almost 12 GB allocated but not used metadata.  So I'm not sure
why you're hitting ENOSPC unless you are hitting a straight-up bug
(or using ssd_spread...don't use ssd_spread).  But 5.6 kernels did
have exactly such a bug (as did all kernels before 5.8).

> > Never balance metadata, especially not from a scheduled job.  Metadata
> > balances lead directly to this situation.
> 
> So when /would/ you balance metadata?

There are three cases where metadata balance does something useful:

	1.  When converting to a different raid profile (e.g. add
	a second disk, convert with balance -mconvert=raid1,soft).

	2.  When shrinking a disk or removing a drive from an array.
	This will run relocation on metadata in the removed sections
	of the filesystem.  This is not avoidable with the current
	btrfs code.

	3.  When you are a kernel developer testing for regressions in
	the metadata balance code.

Otherwise, never balance metadata.

btrfs is pretty good at reusing the allocated metadata space, plus
or minus a gigabyte.  Free space fragmentation is not an issue for
metadata space, since every metadata page is the same size, so there is
no equivalent benefit for defragmeting free space in metadata chunks as
there is for data chunks.  If your filesystem needed metadata space in
the past, chances are good it will be needed in the future, so the peak
metadata allocation size is almost always the correct size.

Metadata balance will deallocate metadata chunks based on current
instantaneous metadata space requirements, which can leave you with
insufficient space for metadata when you need it at other times.
Metadata should be allowed to grow, and never forced to shrink, until
it reaches approximately the right size for your workload.

> > > This was on a Linux 5.6 kernel.  I'm trying a Linux 5.9.13 kernel with a 
> > > hacked in SZ_4G in place of the SZ_512MB and will report back when I learn 
> > > more.
> >
> > I've had similar problems with snapshot deletes hitting ENOSPC with
> > small amounts of free metadata space.  In this case, the upgrade from
> > 5.6 to 5.9 will include a fix for that (it's in 5.8, also 5.4 and earlier
> > LTS kernels).
> 
> Ok, now on 5.9.13
>  
> > Increasing the global reserve may seem to help, but so will just rebooting
> > over and over, so a positive result from an experimental kernel does not
> > necessarily mean anything.
> 
> At least this reduces the number of times I need to reboot ;)

Does it?

You switched from SZ_512M on 5.6 to SZ_4G on 5.9.  5.9 has a proper fix
for the bug you most likely enountered, in which case SZ_4G would not
have any effect other than to use more memory and increase commit latency.

Transactions should be throttled at the reserved size, regardless of what
the reserved size is.  i.e. if you make the reserved size bigger, then
big deletes will just run longer and use more memory between commits.
Deletions are pretty huge metadata consumers--deleting as little as
0.3% of the filesystem can force a rewrite of all metadata in the worst
case--so you have probably been hitting the 512M limit over and over
without issue before now.  Transactions are supposed to grow to reach
the reserved limit and then pause to commit, so a big delete operation
will span many transactions.  The limit is there to keep kernel RAM
usage and transaction latency down.

If you ran two boots at SZ_512M followed by one boot at SZ_4G, and it
works, then that test only tells you that you needed to do 3 boots in
total to recover the filesystem from the initial bad state.  It doesn't
indicate SZ_4G is solving any problem.  To confirm that hypothesis,
you'd need to rewind the filesystem back to the initial state, run
a series of reboots on kernel A and a series of reboots on kernel B,
correct for timing artifacts (at some point there will be a scheduled
transaction commit that will appear at a random point in the IO sequence,
so different test runs under identical initial conditions will not produce
identical results), and show both 1) a consistent improvement in reboot
count between kernel A and kernel B, and 2) the only difference between
kernel A and kernel B is reserved size SZ_512M vs SZ_4G.

> Question: 
> 
> What do people think of making this a module option or ioctl for those who 
> need to hack it into place to minimize reboots?

I think it would be better to find the code that is failing to handle
running out of free transaction metadata, and fix that.  Assuming it
isn't fixed already.

> -Eric
> 
> --
> Eric Wheeler

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Global reserve ran out of space at 512MB, fails to rebalance
  2020-12-11  3:49     ` Zygo Blaxell
@ 2020-12-11 19:08       ` Eric Wheeler
  2020-12-11 21:05         ` Zygo Blaxell
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Wheeler @ 2020-12-11 19:08 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs, Qu Wenruo

On Thu, 10 Dec 2020, Zygo Blaxell wrote:
> On Thu, Dec 10, 2020 at 07:50:06PM +0000, Eric Wheeler wrote:
> > On Wed, 9 Dec 2020, Zygo Blaxell wrote:
> > > On Thu, Dec 10, 2020 at 01:52:19AM +0000, Eric Wheeler wrote:
> > > > Hello all,
> > > > 
> > > > We have a 30TB volume with lots of snapshots that is low on space and we 
> > > > are trying to rebalance.  Even if we don't rebalance, the space cleaner 
> > > > still fills up the Global reserve:
> > > > 
> > > > >>> Global reserve:              512.00MiB	(used: 512.00MiB) <<<<<<<
> > > 
> > > It would be nice to have the rest of the btrfs fi usage output.  We are
> > > having to guess how your drives are populated with data and metadata
> > > and what profiles are in use.
> > 
> > Here is the whole output:
> > 
> > ]# df -h /mnt/btrbackup ; echo; btrfs fi df /mnt/btrbackup|column -t; echo; btrfs fi usage /mnt/btrbackup
> > 
> > Filesystem                  Size  Used Avail Use% Mounted on
> > /dev/mapper/btrbackup-luks   30T   30T  541G  99% /mnt/btrbackup
> > 
> > Data,           single:  total=29.80TiB,  used=29.28TiB
> > System,         DUP:     total=8.00MiB,   used=3.42MiB
> > Metadata,	DUP:     total=99.00GiB,  used=87.03GiB
> > GlobalReserve,  single:  total=4.00GiB,   used=1.73MiB
> > 
> > Overall:
> >     Device size:                  30.00TiB
> >     Device allocated:             30.00TiB
> >     Device unallocated:            1.00GiB
> >     Device missing:                  0.00B
> >     Used:                         29.45TiB
> >     Free (estimated):            540.74GiB	(min: 540.24GiB)
> >     Data ratio:                       1.00
> >     Metadata ratio:                   2.00
> >     Global reserve:                4.00GiB	(used: 1.73MiB) <<<< with 4GB hack
> > 
> > Data,single: Size:29.80TiB, Used:29.28TiB (98.23%)
> >    /dev/mapper/btrbackup-luks     29.80TiB
> > 
> > Metadata,DUP: Size:99.00GiB, Used:87.03GiB (87.91%)
> >    /dev/mapper/btrbackup-luks    198.00GiB
> > 
> > System,DUP: Size:8.00MiB, Used:3.42MiB (42.77%)
> >    /dev/mapper/btrbackup-luks     16.00MiB
> > 
> > Unallocated:
> >    /dev/mapper/btrbackup-luks	   1.00GiB
> >  
> > > You probably need to be running some data balances (btrfs balance start
> > > -dlimit=9 about once a day) to ensure there is always at least 1GB of
> > > unallocated space on every drive.
> > 
> > Thanks for the daily rebalance tip.  Is there a reason you chose 
> > -dlimit=9?  I know it means 9 chunks, but why 9?  Also, how big is a 
> > "chunk" ? 
> 
> "9" is fewer digits than "10"... ;)
> 
> It's a good starting point for an average filesystem.  You may want
> to tune it for your workload.
> 
> Most chunks are multiples of 1GB except on very small filesystems;
> however, sometimes the last data chunk is an odd size (i.e. not 1GB).
> If the filesystem has been resized in the past, there could be multiple
> odd-sized chunks.  A balance of one or two tiny chunks might not release a
> useful amount of space.  Balancing 9 chunks at a time reduces the chance
> of getting stuck with such a small chunk, and also gives you some extra
> unallocated space in case some of it gets allocated during the day.
> 
> On a mostly idle system, you may not need to run the balance every day.
> On a system with hundreds of GB of unallocated space on every drive,
> you don't need to run balance at all.  Monitor your system's unallocated
> space over time and adjust the limit and scheduling as required so you
> don't run out of unallocated space.
> 
> Ideally we'd have a packaged tool ready to install that makes those
> decisions automatically, but we're still trying to figure out what the
> threshold values should be.  Clearly, "2TB of unallocated space on every
> drive" needs no intervention, while "0 unallocated space on any drive"
> is too late for intervention to be useful any more.  In between these
> extremes, it is harder to evaluate from code.  When there's 20GB
> unallocated on every drive, and 10GB are consumed per day, does that
> require a balance today, or can it wait until tomorrow?  How would we get
> the trend data?  Will the user do something unpredictable that requires
> balance earlier than expected?  Will we set the thresholds too low,
> and waste a lot of iops on balancing that isn't useful?  Will we set
> the thresholds too high, and fail to intervene on a system that is
> headed for metadata ENOSPC?
> 
> > In this case we have 1GB unallocated.
> 
> And also almost 12 GB allocated but not used metadata.  So I'm not sure
> why you're hitting ENOSPC unless you are hitting a straight-up bug
> (or using ssd_spread...don't use ssd_spread).  But 5.6 kernels did
> have exactly such a bug (as did all kernels before 5.8).

Sounds like our issue, we were on 5.6.

> 
> > > Never balance metadata, especially not from a scheduled job.  Metadata
> > > balances lead directly to this situation.
> > 
> > So when /would/ you balance metadata?
> 
> There are three cases where metadata balance does something useful:
> 
> 	1.  When converting to a different raid profile (e.g. add
> 	a second disk, convert with balance -mconvert=raid1,soft).
> 
> 	2.  When shrinking a disk or removing a drive from an array.
> 	This will run relocation on metadata in the removed sections
> 	of the filesystem.  This is not avoidable with the current
> 	btrfs code.
> 
> 	3.  When you are a kernel developer testing for regressions in
> 	the metadata balance code.
> 
> Otherwise, never balance metadata.
> 
> btrfs is pretty good at reusing the allocated metadata space, plus
> or minus a gigabyte.  Free space fragmentation is not an issue for
> metadata space, since every metadata page is the same size, so there is
> no equivalent benefit for defragmeting free space in metadata chunks as
> there is for data chunks.  If your filesystem needed metadata space in
> the past, chances are good it will be needed in the future, so the peak
> metadata allocation size is almost always the correct size.
> 
> Metadata balance will deallocate metadata chunks based on current
> instantaneous metadata space requirements, which can leave you with
> insufficient space for metadata when you need it at other times.
> Metadata should be allowed to grow, and never forced to shrink, until
> it reaches approximately the right size for your workload.
> 
> > > > This was on a Linux 5.6 kernel.  I'm trying a Linux 5.9.13 kernel with a 
> > > > hacked in SZ_4G in place of the SZ_512MB and will report back when I learn 
> > > > more.
> > >
> > > I've had similar problems with snapshot deletes hitting ENOSPC with
> > > small amounts of free metadata space.  In this case, the upgrade from
> > > 5.6 to 5.9 will include a fix for that (it's in 5.8, also 5.4 and earlier
> > > LTS kernels).
> > 
> > Ok, now on 5.9.13
> >  
> > > Increasing the global reserve may seem to help, but so will just rebooting
> > > over and over, so a positive result from an experimental kernel does not
> > > necessarily mean anything.
> > 
> > At least this reduces the number of times I need to reboot ;)
> 
> Does it?
> 
> You switched from SZ_512M on 5.6 to SZ_4G on 5.9.  5.9 has a proper fix
> for the bug you most likely enountered, in which case SZ_4G would not
> have any effect other than to use more memory and increase commit latency.
> 
> Transactions should be throttled at the reserved size, regardless of what
> the reserved size is.  i.e. if you make the reserved size bigger, then
> big deletes will just run longer and use more memory between commits.
> Deletions are pretty huge metadata consumers--deleting as little as
> 0.3% of the filesystem can force a rewrite of all metadata in the worst
> case--so you have probably been hitting the 512M limit over and over
> without issue before now.  Transactions are supposed to grow to reach
> the reserved limit and then pause to commit, so a big delete operation
> will span many transactions.  The limit is there to keep kernel RAM
> usage and transaction latency down.

Interesting.  In that case, perhaps there are low-latency scenarios for 
which the 512MB limit should be tuned down (not up like we tried).

I'm guessing most transaction management is asynchronous to userspace, but 
under what circumstances (if any) might a transaction commit block 
userspace IO?
 
> If you ran two boots at SZ_512M followed by one boot at SZ_4G, and it
> works, then that test only tells you that you needed to do 3 boots in
> total to recover the filesystem from the initial bad state.  It doesn't
> indicate SZ_4G is solving any problem.  To confirm that hypothesis,
> you'd need to rewind the filesystem back to the initial state, run
> a series of reboots on kernel A and a series of reboots on kernel B,
> correct for timing artifacts (at some point there will be a scheduled
> transaction commit that will appear at a random point in the IO sequence,
> so different test runs under identical initial conditions will not produce
> identical results), and show both 1) a consistent improvement in reboot
> count between kernel A and kernel B, and 2) the only difference between
> kernel A and kernel B is reserved size SZ_512M vs SZ_4G.

I understand, it is likely that 5.9 with 4G didn't do anything since the 
5.8 fix is in place.  Good to know that we don't have a reason to maintain 
the 4G hack.

Great writeup.  Your entire email response should go on the wiki!  I've 
searched hi-and-low for disk space best practices in BTRFS and I think you 
covered it all.

The system is in great shape now, check it out:

Overall:
    Device size:                  30.00TiB
    Device allocated:             29.88TiB
    Device unallocated:          123.97GiB
    Device missing:                  0.00B
    Used:                         23.45TiB
    Free (estimated):              6.47TiB	(min: 6.41TiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:                4.00GiB	(used: 0.00B)

Data,single: Size:29.68TiB, Used:23.33TiB (78.59%)
   /dev/mapper/btrbackup-luks     29.68TiB

Metadata,DUP: Size:101.00GiB, Used:63.49GiB (62.86%)
   /dev/mapper/btrbackup-luks    202.00GiB

System,DUP: Size:8.00MiB, Used:3.42MiB (42.77%)
   /dev/mapper/btrbackup-luks     16.00MiB

Unallocated:
   /dev/mapper/btrbackup-luks    123.97GiB


--
Eric Wheeler

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Global reserve ran out of space at 512MB, fails to rebalance
  2020-12-11 19:08       ` Eric Wheeler
@ 2020-12-11 21:05         ` Zygo Blaxell
  0 siblings, 0 replies; 8+ messages in thread
From: Zygo Blaxell @ 2020-12-11 21:05 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: linux-btrfs, Qu Wenruo

On Fri, Dec 11, 2020 at 07:08:23PM +0000, Eric Wheeler wrote:
> On Thu, 10 Dec 2020, Zygo Blaxell wrote:
> > On Thu, Dec 10, 2020 at 07:50:06PM +0000, Eric Wheeler wrote:
> > > On Wed, 9 Dec 2020, Zygo Blaxell wrote:
> > > > On Thu, Dec 10, 2020 at 01:52:19AM +0000, Eric Wheeler wrote:
> > > > > Hello all,
> > > > > 
> > > > > We have a 30TB volume with lots of snapshots that is low on space and we 
> > > > > are trying to rebalance.  Even if we don't rebalance, the space cleaner 
> > > > > still fills up the Global reserve:
> > > > > 
> > > > > >>> Global reserve:              512.00MiB	(used: 512.00MiB) <<<<<<<
>[...]
> > > Ok, now on 5.9.13
> > >  
> > > > Increasing the global reserve may seem to help, but so will just rebooting
> > > > over and over, so a positive result from an experimental kernel does not
> > > > necessarily mean anything.
> > > 
> > > At least this reduces the number of times I need to reboot ;)
> > 
> > Does it?
> > 
> > You switched from SZ_512M on 5.6 to SZ_4G on 5.9.  5.9 has a proper fix
> > for the bug you most likely enountered, in which case SZ_4G would not
> > have any effect other than to use more memory and increase commit latency.
> > 
> > Transactions should be throttled at the reserved size, regardless of what
> > the reserved size is.  i.e. if you make the reserved size bigger, then
> > big deletes will just run longer and use more memory between commits.
> > Deletions are pretty huge metadata consumers--deleting as little as
> > 0.3% of the filesystem can force a rewrite of all metadata in the worst
> > case--so you have probably been hitting the 512M limit over and over
> > without issue before now.  Transactions are supposed to grow to reach
> > the reserved limit and then pause to commit, so a big delete operation
> > will span many transactions.  The limit is there to keep kernel RAM
> > usage and transaction latency down.
> 
> Interesting.  In that case, perhaps there are low-latency scenarios for 
> which the 512MB limit should be tuned down (not up like we tried).
> 
> I'm guessing most transaction management is asynchronous to userspace, but 
> under what circumstances (if any) might a transaction commit block 
> userspace IO?

Transaction commit is not asynchronous.  Only one can run at a time--the
singular transaction is used to lock out certain concurrent operations.
It normally has a short critical section, and some write operations are
allowed to proceed across transaction boundaries.

Since 5.0, processes can add more work for btrfs-transaction to do while
it's trying to commit a transaction, limited only by available disk space
(previously there was a feedback loop that measured how fast the disks
were disposing of the data and throttled writing processes so the disks
could keep up).  This tops up the queued work for the transaction to
write to the disk, so the transaction never completes.  This leads to
several "stop the world" scenarios where one process can overwhelm btrfs
with write operations and lock out all other writers for as long as it
takes for the filesystem to run out of space (when the filesystem runs
out of space, the transaction must end, either with a commit or by
forcing the filesystem read-only).

There have been some patches to address some special cases of these,
and more are in for-next and on the mailing list.  But there are still
plenty of bad cases for current kernels.

> > If you ran two boots at SZ_512M followed by one boot at SZ_4G, and it
> > works, then that test only tells you that you needed to do 3 boots in
> > total to recover the filesystem from the initial bad state.  It doesn't
> > indicate SZ_4G is solving any problem.  To confirm that hypothesis,
> > you'd need to rewind the filesystem back to the initial state, run
> > a series of reboots on kernel A and a series of reboots on kernel B,
> > correct for timing artifacts (at some point there will be a scheduled
> > transaction commit that will appear at a random point in the IO sequence,
> > so different test runs under identical initial conditions will not produce
> > identical results), and show both 1) a consistent improvement in reboot
> > count between kernel A and kernel B, and 2) the only difference between
> > kernel A and kernel B is reserved size SZ_512M vs SZ_4G.
> 
> I understand, it is likely that 5.9 with 4G didn't do anything since the 
> 5.8 fix is in place.  Good to know that we don't have a reason to maintain 
> the 4G hack.
> 
> Great writeup.  Your entire email response should go on the wiki!  I've 
> searched hi-and-low for disk space best practices in BTRFS and I think you 
> covered it all.
> 
> The system is in great shape now, check it out:
> 
> Overall:
>     Device size:                  30.00TiB
>     Device allocated:             29.88TiB
>     Device unallocated:          123.97GiB
>     Device missing:                  0.00B
>     Used:                         23.45TiB
>     Free (estimated):              6.47TiB	(min: 6.41TiB)
>     Data ratio:                       1.00
>     Metadata ratio:                   2.00
>     Global reserve:                4.00GiB	(used: 0.00B)
> 
> Data,single: Size:29.68TiB, Used:23.33TiB (78.59%)
>    /dev/mapper/btrbackup-luks     29.68TiB
> 
> Metadata,DUP: Size:101.00GiB, Used:63.49GiB (62.86%)
>    /dev/mapper/btrbackup-luks    202.00GiB
> 
> System,DUP: Size:8.00MiB, Used:3.42MiB (42.77%)
>    /dev/mapper/btrbackup-luks     16.00MiB
> 
> Unallocated:
>    /dev/mapper/btrbackup-luks    123.97GiB

Yep, that should be good to go for a while.

> 
> --
> Eric Wheeler

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-12-11 21:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-10  1:52 Global reserve ran out of space at 512MB, fails to rebalance Eric Wheeler
2020-12-10  2:38 ` Qu Wenruo
2020-12-10  3:12 ` Zygo Blaxell
2020-12-10 19:02   ` Eric Wheeler
2020-12-10 19:50   ` Eric Wheeler
2020-12-11  3:49     ` Zygo Blaxell
2020-12-11 19:08       ` Eric Wheeler
2020-12-11 21:05         ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.