linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ENOSPC while df shows 826.93GiB free
@ 2021-12-07  2:29 Christoph Anton Mitterer
  2021-12-07  2:59 ` Qu Wenruo
  2021-12-07 15:39 ` Phillip Susi
  0 siblings, 2 replies; 20+ messages in thread
From: Christoph Anton Mitterer @ 2021-12-07  2:29 UTC (permalink / raw)
  To: linux-btrfs

Hey.

At the university I'm running a Tier-2 site for the large hadron
collider, with some total storage of 4 PB.

For a bit more than half of that I use btrfs, with HDDs combined to
some hardware raid, provided as 16TiB devices (on which the btrfs
sits).

It runs Debian bullseye, which has 5.10.70. Oh and I've used -R free-
space-tree.
I don't use snapshots on these filesystems.


On one of the filesystems I've ran now into ENOSPC.

# btrfs filesystem usage /srv/dcache/pools/2
Overall:
    Device size:		  16.00TiB
    Device allocated:		  16.00TiB
    Device unallocated:		   1.00MiB
    Device missing:		     0.00B
    Used:			  15.19TiB
    Free (estimated):		 826.93GiB	(min: 826.93GiB)
    Free (statfs, df):		 826.93GiB
    Data ratio:			      1.00
    Metadata ratio:		      2.00
    Global reserve:		 512.00MiB	(used: 0.00B)
    Multiple profiles:		        no

Data,single: Size:15.97TiB, Used:15.16TiB (94.94%)
   /dev/sdf	  15.97TiB

Metadata,DUP: Size:17.01GiB, Used:16.51GiB (97.06%)
   /dev/sdf	  34.01GiB

System,DUP: Size:8.00MiB, Used:2.12MiB (26.56%)
   /dev/sdf	  16.00MiB

Unallocated:
   /dev/sdf	   1.00MiB


yet:
# /srv/dcache/pools/2/foo
-bash: /srv/dcache/pools/2/foo: No such file or directory


balancing also fails, e.g.:
# btrfs balance start -dusage=50 /srv/dcache/pools/2
ERROR: error during balancing '/srv/dcache/pools/2': No space left on device
There may be more info in syslog - try dmesg | tail
# btrfs balance start -dusage=40 /srv/dcache/pools/2
Done, had to relocate 0 out of 16370 chunks
# btrfs balance start  /srv/dcache/pools/2
WARNING:

	Full balance without filters requested. This operation is very
	intense and takes potentially very long. It is recommended to
	use the balance filters to narrow down the scope of balance.
	Use 'btrfs balance start --full-balance' option to skip this
	warning. The operation will start in 10 seconds.
	Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting balance without any filters.
ERROR: error during balancing '/srv/dcache/pools/2': No space left on device
There may be more info in syslog - try dmesg | tail
# btrfs balance start -dusage=0 /srv/dcache/pools/2
Done, had to relocate 0 out of 16370 chunks




fsck showed no errors.



Any ideas what's going on and how to recover?


Thanks,
Chris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07  2:29 ENOSPC while df shows 826.93GiB free Christoph Anton Mitterer
@ 2021-12-07  2:59 ` Qu Wenruo
  2021-12-07  3:06   ` Christoph Anton Mitterer
  2021-12-07 15:39 ` Phillip Susi
  1 sibling, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2021-12-07  2:59 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs



On 2021/12/7 10:29, Christoph Anton Mitterer wrote:
> Hey.
>
> At the university I'm running a Tier-2 site for the large hadron
> collider, with some total storage of 4 PB.
>
> For a bit more than half of that I use btrfs, with HDDs combined to
> some hardware raid, provided as 16TiB devices (on which the btrfs
> sits).
>
> It runs Debian bullseye, which has 5.10.70. Oh and I've used -R free-
> space-tree.
> I don't use snapshots on these filesystems.
>
>
> On one of the filesystems I've ran now into ENOSPC.
>
> # btrfs filesystem usage /srv/dcache/pools/2
> Overall:
>      Device size:		  16.00TiB
>      Device allocated:		  16.00TiB
>      Device unallocated:		   1.00MiB

All device space is allocated already.

>      Device missing:		     0.00B
>      Used:			  15.19TiB
>      Free (estimated):		 826.93GiB	(min: 826.93GiB)
>      Free (statfs, df):		 826.93GiB
>      Data ratio:			      1.00
>      Metadata ratio:		      2.00
>      Global reserve:		 512.00MiB	(used: 0.00B)
>      Multiple profiles:		        no
>
> Data,single: Size:15.97TiB, Used:15.16TiB (94.94%)
>     /dev/sdf	  15.97TiB
>
> Metadata,DUP: Size:17.01GiB, Used:16.51GiB (97.06%)

Your metadata is full, although there is some free space (512M), but
that's mostly used by global rsv, for very critical operations.

Thus your metadata is full.

>     /dev/sdf	  34.01GiB
>
> System,DUP: Size:8.00MiB, Used:2.12MiB (26.56%)
>     /dev/sdf	  16.00MiB
>
> Unallocated:
>     /dev/sdf	   1.00MiB
>
>
> yet:
> # /srv/dcache/pools/2/foo
> -bash: /srv/dcache/pools/2/foo: No such file or directory
>
>
> balancing also fails, e.g.:
> # btrfs balance start -dusage=50 /srv/dcache/pools/2

Since your metadata is full, btrfs can't reserve enough metadata to
relocate a data chunk.

> ERROR: error during balancing '/srv/dcache/pools/2': No space left on device
> There may be more info in syslog - try dmesg | tail
> # btrfs balance start -dusage=40 /srv/dcache/pools/2
> Done, had to relocate 0 out of 16370 chunks
> # btrfs balance start  /srv/dcache/pools/2
> WARNING:
>
> 	Full balance without filters requested. This operation is very
> 	intense and takes potentially very long. It is recommended to
> 	use the balance filters to narrow down the scope of balance.
> 	Use 'btrfs balance start --full-balance' option to skip this
> 	warning. The operation will start in 10 seconds.
> 	Use Ctrl-C to stop it.
> 10 9 8 7 6 5 4 3 2 1
> Starting balance without any filters.
> ERROR: error during balancing '/srv/dcache/pools/2': No space left on device
> There may be more info in syslog - try dmesg | tail
> # btrfs balance start -dusage=0 /srv/dcache/pools/2
> Done, had to relocate 0 out of 16370 chunks
>
>
>
>
> fsck showed no errors.
>
>
>
> Any ideas what's going on and how to recover?

Since your metadata is already full, you may need to delete enough data
to free up enough metadata space.

The candidates includes small files (mostly inlined files), and large
files with checksums.

Thanks,
Qu

>
>
> Thanks,
> Chris.
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07  2:59 ` Qu Wenruo
@ 2021-12-07  3:06   ` Christoph Anton Mitterer
  2021-12-07  3:29     ` Qu Wenruo
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Anton Mitterer @ 2021-12-07  3:06 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On Tue, 2021-12-07 at 10:59 +0800, Qu Wenruo wrote:
> 
> Since your metadata is already full, you may need to delete enough
> data
> to free up enough metadata space.
> 
> The candidates includes small files (mostly inlined files), and large
> files with checksums.

On that fs, there are rather many large files (800MB - 1.5 GB).

Is there anyway to get (much?) more space reserved for metadata in the
future respectively on the other existing filesystems that haven't
deadlocked themselves yet?!


Thanks,
Chris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07  3:06   ` Christoph Anton Mitterer
@ 2021-12-07  3:29     ` Qu Wenruo
  2021-12-07  3:44       ` Christoph Anton Mitterer
  0 siblings, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2021-12-07  3:29 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs



On 2021/12/7 11:06, Christoph Anton Mitterer wrote:
> On Tue, 2021-12-07 at 10:59 +0800, Qu Wenruo wrote:
>>
>> Since your metadata is already full, you may need to delete enough
>> data
>> to free up enough metadata space.
>>
>> The candidates includes small files (mostly inlined files), and large
>> files with checksums.
>
> On that fs, there are rather many large files (800MB - 1.5 GB).
>
> Is there anyway to get (much?) more space reserved for metadata in the
> future respectively on the other existing filesystems that haven't
> deadlocked themselves yet?!

In fact, this is not really a deadlock, only balance is blocked by such
problem.

For other regular operations, you either got ENOSPC just like all other
fses which runs out of space, or do it without problem.

Furthermore, balance in this case is not really the preferred way to
free up space, really freeing up data is the correct way to go.

Thanks,
Qu

>
>
> Thanks,
> Chris.
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07  3:29     ` Qu Wenruo
@ 2021-12-07  3:44       ` Christoph Anton Mitterer
  2021-12-07  4:56         ` Qu Wenruo
  2021-12-07  7:21         ` Zygo Blaxell
  0 siblings, 2 replies; 20+ messages in thread
From: Christoph Anton Mitterer @ 2021-12-07  3:44 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On Tue, 2021-12-07 at 11:29 +0800, Qu Wenruo wrote:
> For other regular operations, you either got ENOSPC just like all
> other
> fses which runs out of space, or do it without problem.
> 
> Furthermore, balance in this case is not really the preferred way to
> free up space, really freeing up data is the correct way to go.

Well but to be honest... that makes btrfs kinda broke for that
particular purpose.


The software which runs on the storage and provides the data to the
experiments does in fact make sure that the space isn't fully used (per
default, it leave a gap of 4GB).

While this gap is configurable it seems a bit odd if one would have to
set it to ~1TB per fs... just to make sure that btrfs doesn't run out
of space for metadata.


And btrfs *does* show that plenty of space is left (always around 700-
800 GB)... so the application thinks it can happily continue to write,
while in fact it fails (and the cannot even start anymore as it fails
to create lock files).


My understanding was the when not using --mixed, btrfs has block groups
for data and metadata.

And it seems here that the data block groups have several 100 GB still
free, while - AFAIU you - the metadata block groups are already full.



I also wouldn't want to regularly balance (which doesn't really seem to
help that much so far)... cause it puts quite some IO load on the
systems.


So if csum data needs so much space... why can't it simply reserve e.g.
60 GB for metadata instead of just 17 GB?



If I really had to reserve ~ 1TB of storage to be unused (per 16TB fs)
just to get that working... I would need to move stuff back to ext4,
cause that's such a big loss we couldn't justify to our funding
agencies.


And we haven't had that issue with e.g. ext4 ... that seems to reserve
just enough for meta, so that we could basically fill up the fs close
to the end.



Cheers,
Chris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07  3:44       ` Christoph Anton Mitterer
@ 2021-12-07  4:56         ` Qu Wenruo
  2021-12-07 14:30           ` Christoph Anton Mitterer
  2021-12-07  7:21         ` Zygo Blaxell
  1 sibling, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2021-12-07  4:56 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Qu Wenruo, linux-btrfs



On 2021/12/7 11:44, Christoph Anton Mitterer wrote:
> On Tue, 2021-12-07 at 11:29 +0800, Qu Wenruo wrote:
>> For other regular operations, you either got ENOSPC just like all
>> other
>> fses which runs out of space, or do it without problem.
>>
>> Furthermore, balance in this case is not really the preferred way to
>> free up space, really freeing up data is the correct way to go.
>
> Well but to be honest... that makes btrfs kinda broke for that
> particular purpose.
>
>
> The software which runs on the storage and provides the data to the
> experiments does in fact make sure that the space isn't fully used (per
> default, it leave a gap of 4GB).
>
> While this gap is configurable it seems a bit odd if one would have to
> set it to ~1TB per fs... just to make sure that btrfs doesn't run out
> of space for metadata.
>
>
> And btrfs *does* show that plenty of space is left (always around 700-
> 800 GB)... so the application thinks it can happily continue to write,
> while in fact it fails (and the cannot even start anymore as it fails
> to create lock files).

That's the problem with dynamic chunk allocation, and to be honest, I
don't have any better idea how to make it work just like traditional fses.

You could consider it as something like thin-provisioned device, which
would have the same problem (reporting tons of free space, but will hang
if underlying space is used up).

>
>
> My understanding was the when not using --mixed, btrfs has block groups
> for data and metadata.
>
> And it seems here that the data block groups have several 100 GB still
> free, while - AFAIU you - the metadata block groups are already full.
>
>
>
> I also wouldn't want to regularly balance (which doesn't really seem to
> help that much so far)... cause it puts quite some IO load on the
> systems.
>
>
> So if csum data needs so much space... why can't it simply reserve e.g.
> 60 GB for metadata instead of just 17 GB?

Because all chunks are allocated on demand, if 1) your workload has
every unbalanced data/metadata usage, like this case (almost 1000:1).
2) You run out of space, then you will hit this particular problem.

>
>
>
> If I really had to reserve ~ 1TB of storage to be unused (per 16TB fs)
> just to get that working... I would need to move stuff back to ext4,
> cause that's such a big loss we couldn't justify to our funding
> agencies.

It won't matter if you reserve 1T or not for the data.

It can still go the same problem even if there are tons of unused data
space.
Fragmented data space can still cause the same problem.

>
>
> And we haven't had that issue with e.g. ext4 ... that seems to reserve
> just enough for meta, so that we could basically fill up the fs close
> to the end.

Ext4/XFS has a similar problem but much harder to hit, inode limits.

They use pre-determined inode limits (determined at mkfs time), thus you
can ran out of inodes before free space is used up.

Tools like "df" has ways to report such limits, but unfortunately for
btrfs there is no such way, but using btrfs specific tool to do that.

Thanks,
Qu

>
>
>
> Cheers,
> Chris.
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07  3:44       ` Christoph Anton Mitterer
  2021-12-07  4:56         ` Qu Wenruo
@ 2021-12-07  7:21         ` Zygo Blaxell
  2021-12-07 12:31           ` Jorge Bastos
                             ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: Zygo Blaxell @ 2021-12-07  7:21 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: Qu Wenruo, linux-btrfs

On Tue, Dec 07, 2021 at 04:44:13AM +0100, Christoph Anton Mitterer wrote:
> On Tue, 2021-12-07 at 11:29 +0800, Qu Wenruo wrote:
> > For other regular operations, you either got ENOSPC just like all
> > other
> > fses which runs out of space, or do it without problem.
> > 
> > Furthermore, balance in this case is not really the preferred way to
> > free up space, really freeing up data is the correct way to go.
> 
> Well but to be honest... that makes btrfs kinda broke for that
> particular purpose.
> 
> 
> The software which runs on the storage and provides the data to the
> experiments does in fact make sure that the space isn't fully used (per
> default, it leave a gap of 4GB).
> 
> While this gap is configurable it seems a bit odd if one would have to
> set it to ~1TB per fs... just to make sure that btrfs doesn't run out
> of space for metadata.
> 
> 
> And btrfs *does* show that plenty of space is left (always around 700-
> 800 GB)... so the application thinks it can happily continue to write,
> while in fact it fails (and the cannot even start anymore as it fails
> to create lock files).
> 
> 
> My understanding was the when not using --mixed, btrfs has block groups
> for data and metadata.
> 
> And it seems here that the data block groups have several 100 GB still
> free, while - AFAIU you - the metadata block groups are already full.
> 
> 
> 
> I also wouldn't want to regularly balance (which doesn't really seem to
> help that much so far)... cause it puts quite some IO load on the
> systems.

If you minimally balance data (so that you keep 2GB unallocated at all
times) then it works much better: you can allocate the last metadata
chunk that you need to expand, and it requires only a few minutes of IO
per day.  After a while you don't need to do this any more, as a large
buffer of allocated but unused metadata will form.

If you need a drastic intervention, you can mount with metadata_ratio=1
for a short(!) time to allocate a lot of extra metadata block groups.
Combine with a data block group balance for a few blocks (e.g. -dlimit=9).

You need about (3 + number_of_disks) GB of allocated but unused metadata
block groups to handle the worst case (balance, scrub, and discard all
active at the same time, plus the required free metadata space).  Also
leave room for existing metadata to expand by about 50%, especially if
you have snapshots.

Never balance metadata.  Balancing metadata will erase existing metadata
allocations, leading directly to this situation.

Free space search time goes up as the filesystem fills up.  The last 1%
of the filesystem will fill up significantly slower than the other 99%,
You might need to reserve 3% of the filesystem to keep latencies down
(ironically about the same amount that ext4 reserves).

There are some patches floating around to address these issues.

> So if csum data needs so much space... why can't it simply reserve e.g.
> 60 GB for metadata instead of just 17 GB?

It normally does.  Are you:

	- running metadata balances?  (Stop immediately.)

	- preallocating large files?  Checksums are allocated later, and
	naive usage of prealloc burns metadata space due to fragmentation.

	- modifying snapshots?	Metadata size increases with each
	modified snapshot.

	- replacing large files with a lot of very small ones?	Files
	below 2K are stored in metadata.  max_inline=0 disables this.

> If I really had to reserve ~ 1TB of storage to be unused (per 16TB fs)
> just to get that working... I would need to move stuff back to ext4,
> cause that's such a big loss we couldn't justify to our funding
> agencies.
> 
> 
> And we haven't had that issue with e.g. ext4 ... that seems to reserve
> just enough for meta, so that we could basically fill up the fs close
> to the end.
> 
> 
> 
> Cheers,
> Chris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07  7:21         ` Zygo Blaxell
@ 2021-12-07 12:31           ` Jorge Bastos
  2021-12-07 15:07           ` Christoph Anton Mitterer
  2021-12-07 15:10           ` Jorge Bastos
  2 siblings, 0 replies; 20+ messages in thread
From: Jorge Bastos @ 2021-12-07 12:31 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Christoph Anton Mitterer, Qu Wenruo, Btrfs BTRFS

This looks to me like this issue I reported before:

https://lore.kernel.org/linux-btrfs/CAHzMYBSap30NbnPnv4ka+fDA2nYGHfjYvD-NgT04t4vvN4q2sw@mail.gmail.com/

Data,single: Size:15.97TiB, Used:15.16TiB (94.94%)

When this happens to me I can see the data usage ratio is lower than
normal, there are mostly large files, and you can balance as much as
you like and the data ratio stays unchanged, and the unallocated space
gets to zero much sooner because of that, most times there's no issue
and data usage ratio is much higher, e.g., this filesystem could be
filled up to less than 4GB available:

Data,RAID0: Size:10.89TiB, Used:10.89TiB (99.97%)

This one could only be filled up to about 300GB available:

Data,RAID0: Size:10.89TiB, Used:10.59TiB (97.26%)

Both contain only large 100GiB size files, both file systems were
filled from new in exactly the same way, one file at a time, no
snapshots, no modifications after the initial data copy.

Regards,
Jorge Bastos

On Tue, Dec 7, 2021 at 9:45 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Tue, Dec 07, 2021 at 04:44:13AM +0100, Christoph Anton Mitterer wrote:
> > On Tue, 2021-12-07 at 11:29 +0800, Qu Wenruo wrote:
> > > For other regular operations, you either got ENOSPC just like all
> > > other
> > > fses which runs out of space, or do it without problem.
> > >
> > > Furthermore, balance in this case is not really the preferred way to
> > > free up space, really freeing up data is the correct way to go.
> >
> > Well but to be honest... that makes btrfs kinda broke for that
> > particular purpose.
> >
> >
> > The software which runs on the storage and provides the data to the
> > experiments does in fact make sure that the space isn't fully used (per
> > default, it leave a gap of 4GB).
> >
> > While this gap is configurable it seems a bit odd if one would have to
> > set it to ~1TB per fs... just to make sure that btrfs doesn't run out
> > of space for metadata.
> >
> >
> > And btrfs *does* show that plenty of space is left (always around 700-
> > 800 GB)... so the application thinks it can happily continue to write,
> > while in fact it fails (and the cannot even start anymore as it fails
> > to create lock files).
> >
> >
> > My understanding was the when not using --mixed, btrfs has block groups
> > for data and metadata.
> >
> > And it seems here that the data block groups have several 100 GB still
> > free, while - AFAIU you - the metadata block groups are already full.
> >
> >
> >
> > I also wouldn't want to regularly balance (which doesn't really seem to
> > help that much so far)... cause it puts quite some IO load on the
> > systems.
>
> If you minimally balance data (so that you keep 2GB unallocated at all
> times) then it works much better: you can allocate the last metadata
> chunk that you need to expand, and it requires only a few minutes of IO
> per day.  After a while you don't need to do this any more, as a large
> buffer of allocated but unused metadata will form.
>
> If you need a drastic intervention, you can mount with metadata_ratio=1
> for a short(!) time to allocate a lot of extra metadata block groups.
> Combine with a data block group balance for a few blocks (e.g. -dlimit=9).
>
> You need about (3 + number_of_disks) GB of allocated but unused metadata
> block groups to handle the worst case (balance, scrub, and discard all
> active at the same time, plus the required free metadata space).  Also
> leave room for existing metadata to expand by about 50%, especially if
> you have snapshots.
>
> Never balance metadata.  Balancing metadata will erase existing metadata
> allocations, leading directly to this situation.
>
> Free space search time goes up as the filesystem fills up.  The last 1%
> of the filesystem will fill up significantly slower than the other 99%,
> You might need to reserve 3% of the filesystem to keep latencies down
> (ironically about the same amount that ext4 reserves).
>
> There are some patches floating around to address these issues.
>
> > So if csum data needs so much space... why can't it simply reserve e.g.
> > 60 GB for metadata instead of just 17 GB?
>
> It normally does.  Are you:
>
>         - running metadata balances?  (Stop immediately.)
>
>         - preallocating large files?  Checksums are allocated later, and
>         naive usage of prealloc burns metadata space due to fragmentation.
>
>         - modifying snapshots?  Metadata size increases with each
>         modified snapshot.
>
>         - replacing large files with a lot of very small ones?  Files
>         below 2K are stored in metadata.  max_inline=0 disables this.
>
> > If I really had to reserve ~ 1TB of storage to be unused (per 16TB fs)
> > just to get that working... I would need to move stuff back to ext4,
> > cause that's such a big loss we couldn't justify to our funding
> > agencies.
> >
> >
> > And we haven't had that issue with e.g. ext4 ... that seems to reserve
> > just enough for meta, so that we could basically fill up the fs close
> > to the end.
> >
> >
> >
> > Cheers,
> > Chris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07  4:56         ` Qu Wenruo
@ 2021-12-07 14:30           ` Christoph Anton Mitterer
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Anton Mitterer @ 2021-12-07 14:30 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On Tue, 2021-12-07 at 12:56 +0800, Qu Wenruo wrote:
> That's the problem with dynamic chunk allocation, and to be honest, I
> don't have any better idea how to make it work just like traditional
> fses.
> 
> You could consider it as something like thin-provisioned device,
> which
> would have the same problem (reporting tons of free space, but will
> hang
> if underlying space is used up).

Well the first thing I don't understand is, that my scenario seems
pretty... simple.

These filesystems have only few files (so 30k to perhaps 200k).
Seems far simpler than e.g. the fs of the system itself, where one can
have many files of completely varying size in /usr, /home, and so on.

Also, these files (apart from some small meta-data files) are *always*
written once and then only read (or deleted).
There is never any random write access... so fragmentation should be
far less than under "normal" systems.

The total size of the fs is obviously known.
You said now, that the likely cause are the csum data... but isn't it
then kinda clear from the beginning how much you'd need (at most) if
the filesystem would be filled up with data?


Just for my understanding:
How is csum data stored?
Is it like one sum per fixed block size of data? Or one sum per (not
fixed) extent size of data?

But in both cases I'd have assumed that the maximum of space needed for
that is kinda predictable?
Unlike e.g. on a thin provisioned device, or when using many (rw)
snapshots,... where one cannot really predict how much storage would be
needed because data is changed from the shared copy.



> Because all chunks are allocated on demand, if 1) your workload has
> every unbalanced data/metadata usage, like this case (almost 1000:1).
> 2) You run out of space, then you will hit this particular problem.

I've described the typical workload above:
rather large files (the data sets from the experiments), written once,
never any further writes to them, only deletions.

I'd have expected that this causes *far* less fragmentation than e.g.
filesystems that contain /home or so, where one has many random writes.


> It won't matter if you reserve 1T or not for the data.
> 
> It can still go the same problem even if there are tons of unused
> data
> space.
> Fragmented data space can still cause the same problem.

Let me try to understand this better:

btrfs allocates data block groups and meta-data block groups (both
dynamically), right?

Are these always of the same size (like e.g. always 1G)?

When I now write a 500M file... it would e.g. fill one such data block
group with 500M (and write some data into a metadata block group).
And when I would write next a 2 G file... it would write the first 500M
to the already allocated data block group,.. and then allocate more to
write the remaining data.

Does that sound kinda right so far (simplified of course)?

The problem I had now, was that the fs filled up more and more and (due
to fragmentation),... all free space is in data block groups... but
since no unallocated storage is left it could not allocate more
metadata block groups.
So from the data PoV it could still write (i.e. the free space) because
all the fragmented data block groups have still some ~800GiB free...
but it cannot write any more meta-data.

Still kinda right?


So my naive assumption(s) would have been:
1) It's a sign that it doesn't allocate meta-data block group
aggressively enough.

2) If I cure the fragmentation (in the data block groups),... and btrfs
could give back those... there would be again some unallocated space,
which it could use for meta-data block groups... and so I could use
more of the remaining 800GB, right?

Would balance already do this? I guess not, cause AFAIU balance just
re-writes block groups as is, right?
So that's the reason, why balancing didn't help in any way?

So the proper way would be to btrfs filesystem defragment... thus
reclaim some unallocated space and get that for the meta-data.
Right?


But still,... that seems all quite a lot of manual work (and thus not
scale for a large data centre):
Would the deframentation work if the meta-data is already out of space?


Why would it not help, if btrfs (pre-)reserves more meta-data block
groups?
So maybe of the ~800GB that are now still free (within data block
groups)... one would use e.g. 100 GB to meta-data...
From these 100 GB... 50 GB might be never used... but overall I could
still use ~700 GB in data block groups - whereas now: both is
effectively lost (the full ~800 GB).



> 

Are there any manual ways to say in e.g. our use case:
don't just allocate 17GB per fs for meta-data... but allocate already
80GB...

And wouldn't that cure our problem... by simply helping to (likely)
never reaching the out-of-metadata space situation?


Thanks,
Chris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07  7:21         ` Zygo Blaxell
  2021-12-07 12:31           ` Jorge Bastos
@ 2021-12-07 15:07           ` Christoph Anton Mitterer
  2021-12-07 18:14             ` Zygo Blaxell
  2021-12-07 15:10           ` Jorge Bastos
  2 siblings, 1 reply; 20+ messages in thread
From: Christoph Anton Mitterer @ 2021-12-07 15:07 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Qu Wenruo, linux-btrfs

On Tue, 2021-12-07 at 02:21 -0500, Zygo Blaxell wrote:
> If you minimally balance data (so that you keep 2GB unallocated at
> all
> times) then it works much better: you can allocate the last metadata
> chunk that you need to expand, and it requires only a few minutes of
> IO
> per day.  After a while you don't need to do this any more, as a
> large
> buffer of allocated but unused metadata will form.

Hm I've already asked Qu in the other mail just before, whether/why
balancing would help there at all.

Doesn't it just re-write the block groups (but not defragment them...)
would that (and why) help to gain back unallocated space (which could
then be allocated for meta-data)?

And what exactly do you mean with "minimally"? I mean of course I can
use -dusage=20 or so... is it that?


But I guess all that wouldn't help now, when the unallocated space is
already used up, right?



> If you need a drastic intervention, you can mount with
> metadata_ratio=1
> for a short(!) time to allocate a lot of extra metadata block groups.
> Combine with a data block group balance for a few blocks (e.g. -
> dlimit=9).

All that seems rather impractical do to, to be honest. At least for an
non-expert admin.

First, these systems are production systems... so one doesn't want to
unmount (and do this procedure) when one sees that unallocated space
runs out.
One would rather want some way that if one sees: unallocated space gets
low -> allocate so and so much for meta data

I guess there are no real/official tools out there for such
surveillance? Like Nagios/Icinga checks, that look at the unallocated
space?



> You need about (3 + number_of_disks) GB of allocated but unused
> metadata
> block groups to handle the worst case (balance, scrub, and discard
> all
> active at the same time, plus the required free metadata space). 
> Also
> leave room for existing metadata to expand by about 50%, especially
> if
> you have snapshots.



> Never balance metadata.  Balancing metadata will erase existing
> metadata
> allocations, leading directly to this situation.

Wouldn't that only unallocated such allocations, that are completely
empty?

> > So if csum data needs so much space... why can't it simply reserve
> > e.g. 60 GB for metadata instead of just 17 GB?
> 
> It normally does.  Are you:
> 
>         - running metadata balances?  (Stop immediately.)

Nope, I did once accidentally (-musage=0 ... copy&pasted the wrong one)
but only *after* the filesystem got stuck...


>         - preallocating large files?  Checksums are allocated later,
> and
>         naive usage of prealloc burns metadata space due to
> fragmentation.

Hmm... not so sure about that... (I mean I don't know what the storage
middleware, which is www.dcache.org, does)... but it would probably do
this only for 1 to few such large files at once, if at all.


>         - modifying snapshots?  Metadata size increases with each
>         modified snapshot.

No snapshots are used at all on these filesystems.


>         - replacing large files with a lot of very small ones?  Files
>         below 2K are stored in metadata.  max_inline=0 disables this.

I guess you mean here:
First many large files were written... unallocated space is used up
(with data and meta-data block groups).
Then, large files are deleted... data block groups get fragmented (but
not unallocated acagain, because they're not empty.

Then loads of small files would be written (inline)... which then fails
as meta-data space would fill up even faster, right?


Well we do have filesystems, where there may be *many* small files..
but I guess still all around the range of 1MB or more. I don't think we
have lots of files below 2K.. if at all.


So I don't think that we have this IO pattern.

It rather seems simply as if btrfs wouldn't reserve meta-data
aggressively enough (at least not in our case)... and that to much is
allocated for data.. and when that is actually filled, it cannot
allocate anymore enough for metadata.



Thanks,
Chris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07  7:21         ` Zygo Blaxell
  2021-12-07 12:31           ` Jorge Bastos
  2021-12-07 15:07           ` Christoph Anton Mitterer
@ 2021-12-07 15:10           ` Jorge Bastos
  2021-12-07 15:22             ` Christoph Anton Mitterer
  2 siblings, 1 reply; 20+ messages in thread
From: Jorge Bastos @ 2021-12-07 15:10 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Christoph Anton Mitterer, Qu Wenruo, Btrfs BTRFS

Hi,

Disregard my last email, it is the same issue of metadata exhaustion,
just didn't understand why two identical file systems used in the same
way in the same server behaved so differently, if I convert metadata
from raid1 to single it will leave some extra metadata chunks and I
was then able to fill up the data chunks, of course now I can't
balance metadata to raid1 again, I wish there was an easy way to
allocate an extra metadata chunk when there's available space.

Regards,
Jorge Bastos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07 15:10           ` Jorge Bastos
@ 2021-12-07 15:22             ` Christoph Anton Mitterer
  2021-12-07 16:11               ` Jorge Bastos
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Anton Mitterer @ 2021-12-07 15:22 UTC (permalink / raw)
  To: Jorge Bastos, Zygo Blaxell; +Cc: Qu Wenruo, Btrfs BTRFS

Hey Jorge.

I've looked at your old mail thread... and the first case you've
showed:
btrfs fi usage /mnt/disk4
Overall:
    Device size:                   7.28TiB
    Device allocated:              7.28TiB
    Device unallocated:            1.04MiB
    Device missing:                  0.00B
    Used:                          7.24TiB
    Free (estimated):             34.55GiB      (min: 34.55GiB)
    Free (statfs, df):            34.55GiB
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)
    Multiple profiles:                  no

Data,single: Size:7.26TiB, Used:7.22TiB (99.54%)
   /dev/md4        7.26TiB

Metadata,DUP: Size:9.50GiB, Used:8.45GiB (88.93%)
   /dev/md4       19.00GiB

System,DUP: Size:32.00MiB, Used:800.00KiB (2.44%)
   /dev/md4       64.00MiB

Unallocated:
   /dev/md4        1.04MiB


Seems similar to my problem... but far less extrem.... so that I
personally would say I could live with that.

Of data you "loose" 0.04 TB ... so 40 GiB... of metadata you "loose"
1.45 GiB


It's a bit strange IMO that you then get ENOSPC, when your metadata has
still 1GB free (thought it would reserve less?)

But still, out of 7.28 TiB, that's ~0,556%... not sooo much.


I in contrast have:
829,44 + 0,5 = 829,94 GiB "lost"
Which is out of 16.00TiB some ~5,066% lost... which seems pretty much.


Cheers,
Chris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07  2:29 ENOSPC while df shows 826.93GiB free Christoph Anton Mitterer
  2021-12-07  2:59 ` Qu Wenruo
@ 2021-12-07 15:39 ` Phillip Susi
  2021-12-16  3:47   ` Christoph Anton Mitterer
  1 sibling, 1 reply; 20+ messages in thread
From: Phillip Susi @ 2021-12-07 15:39 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs


Christoph Anton Mitterer <calestyo@scientia.org> writes:

> yet:
> # /srv/dcache/pools/2/foo
> -bash: /srv/dcache/pools/2/foo: No such file or directory

I'm not sure what you intended this to do or show.  It looks like you
tried to execute a program named /srv/dcache/pools/2/foo, and there is
no such program.  That doesn't say anything about the filesystem.

> balancing also fails, e.g.:
> # btrfs balance start -dusage=50 /srv/dcache/pools/2
> ERROR: error during balancing '/srv/dcache/pools/2': No space left on device

Balance is basically like a defrag.  You have less than 0.01% space
free, which is not enough to do a defrag.  Either free up some space, or
don't bother trying to defrag.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07 15:22             ` Christoph Anton Mitterer
@ 2021-12-07 16:11               ` Jorge Bastos
  0 siblings, 0 replies; 20+ messages in thread
From: Jorge Bastos @ 2021-12-07 16:11 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: Zygo Blaxell, Qu Wenruo, Btrfs BTRFS

On Tue, Dec 7, 2021 at 3:22 PM Christoph Anton Mitterer
<calestyo@scientia.org> wrote:
>
> Hey Jorge.
>
> I've looked at your old mail thread... and the first case you've
> showed:
> btrfs fi usage /mnt/disk4

Hi,

Thanks for the reply, that one was fine, it was the "good" example,
the next one was the problem, /mnt/disk3, but like mentioned I figured
out it's the metadata exhaustion issue.

Regards,
Jorge Bastos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07 15:07           ` Christoph Anton Mitterer
@ 2021-12-07 18:14             ` Zygo Blaxell
  2021-12-16 23:16               ` Christoph Anton Mitterer
  0 siblings, 1 reply; 20+ messages in thread
From: Zygo Blaxell @ 2021-12-07 18:14 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: Qu Wenruo, linux-btrfs

On Tue, Dec 07, 2021 at 04:07:32PM +0100, Christoph Anton Mitterer wrote:
> On Tue, 2021-12-07 at 02:21 -0500, Zygo Blaxell wrote:
> > If you minimally balance data (so that you keep 2GB unallocated at
> > all
> > times) then it works much better: you can allocate the last metadata
> > chunk that you need to expand, and it requires only a few minutes of
> > IO
> > per day.  After a while you don't need to do this any more, as a
> > large
> > buffer of allocated but unused metadata will form.
> 
> Hm I've already asked Qu in the other mail just before, whether/why
> balancing would help there at all.
> 
> Doesn't it just re-write the block groups (but not defragment them...)
> would that (and why) help to gain back unallocated space (which could
> then be allocated for meta-data)?

It coalesces the free space in each block group into big contiguous
regions, eventually growing them to regions over 1GB in size.  Usually
this gives back unallocated space.

If balance can't pack the extents in 1GB units without changing their
sizes or crossing a block group boundary, then balance might not be
able to free any block groups this way, so this tends to fail when the
filesystem is over about 97% full.  It's important to run the minimal
data balances _before_ this happens, as it's too late to allocate
metadata after.

> And what exactly do you mean with "minimally"? I mean of course I can
> use -dusage=20 or so... is it that?

Minimal balance is exactly one data block group, i.e.

	btrfs balance start -dlimit=1 /fs

Run it when unallocated space gets low.  The exact threshold is low
enough that the time between new data block group allocations is less
than the balance time.

Usage filter is OK for one-off interventions, but repeated use eventually
leads to a filesystem full of block groups that are filled to the
threshold in the usage filter, and no unallocated space.

> But I guess all that wouldn't help now, when the unallocated space is
> already used up, right?

If you have many GB of free space in the block groups, then usually
one can be freed up.  After that, it's a straightforward slot-puzzle,
packing data into the unallocated space.

If the free space is too fragmented or the extents are too large, then
it will not be possible to recover without adding disk space or deleting
data.

> > If you need a drastic intervention, you can mount with
> > metadata_ratio=1
> > for a short(!) time to allocate a lot of extra metadata block groups.
> > Combine with a data block group balance for a few blocks (e.g. -
> > dlimit=9).
> 
> All that seems rather impractical do to, to be honest. At least for an
> non-expert admin.
> 
> First, these systems are production systems... so one doesn't want to
> unmount (and do this procedure) when one sees that unallocated space
> runs out.

I think remount suffices, but I haven't checked.  The mount option is
checked at block allocation time in the code, so it should be possible
to change it live.

It has to be run for a short time because metadata_ratio=1 means 1:1
metadata to data allocation.  You only want to do this to rescue a
filesystem that has become stuck with too little metadata.  Once the
required amount of metadata is allocated, remove the metadata_ratio
option and do minimal data balancing going forward.

> One would rather want some way that if one sees: unallocated space gets
> low -> allocate so and so much for meta data

You can set metadata_ratio=30, which will allocate (100 / 30) = ~3%
of the space for metadata, if you are starting with an empty filesystem.

> I guess there are no real/official tools out there for such
> surveillance? Like Nagios/Icinga checks, that look at the unallocated
> space?

TBH it's never been a problem--but I run the minimal data balance daily,
and scrub every month, and never balance metadata, and have snapshots
and dedupe.  Between these they trigger all the necessary metadata
allocations.

> > You need about (3 + number_of_disks) GB of allocated but unused
> > metadata
> > block groups to handle the worst case (balance, scrub, and discard
> > all
> > active at the same time, plus the required free metadata space). 
> > Also
> > leave room for existing metadata to expand by about 50%, especially
> > if
> > you have snapshots.
> 
> 
> 
> > Never balance metadata.  Balancing metadata will erase existing
> > metadata
> > allocations, leading directly to this situation.
> 
> Wouldn't that only unallocated such allocations, that are completely
> empty?

It will repack existing metadata into existing metadata block groups,
which _creates_ empty block groups (i.e. it removes all the data from
existing groups), then it removes the empty groups.  That's the opposite of
what you want:  you want extra unused space to be kept in the metadata
block groups, so that metadata can expand without having to compete with
data for new block group allocations.

> > > So if csum data needs so much space... why can't it simply reserve
> > > e.g. 60 GB for metadata instead of just 17 GB?
> > 
> > It normally does.  Are you:
> > 
> >         - running metadata balances?  (Stop immediately.)
> 
> Nope, I did once accidentally (-musage=0 ... copy&pasted the wrong one)
> but only *after* the filesystem got stuck...

That can only do one of two things:  have no effect, or make it worse.

> >         - preallocating large files?  Checksums are allocated later,
> > and
> >         naive usage of prealloc burns metadata space due to
> > fragmentation.
> 
> Hmm... not so sure about that... (I mean I don't know what the storage
> middleware, which is www.dcache.org, does)... but it would probably do
> this only for 1 to few such large files at once, if at all.
> 
> 
> >         - modifying snapshots?  Metadata size increases with each
> >         modified snapshot.
> 
> No snapshots are used at all on these filesystems.
> 
> 
> >         - replacing large files with a lot of very small ones?  Files
> >         below 2K are stored in metadata.  max_inline=0 disables this.
> 
> I guess you mean here:
> First many large files were written... unallocated space is used up
> (with data and meta-data block groups).
> Then, large files are deleted... data block groups get fragmented (but
> not unallocated acagain, because they're not empty.
> 
> Then loads of small files would be written (inline)... which then fails
> as meta-data space would fill up even faster, right?

Correct.

> Well we do have filesystems, where there may be *many* small files..
> but I guess still all around the range of 1MB or more. I don't think we
> have lots of files below 2K.. if at all.

In theory if the average file size decreases drastically it can change
the amount of metadata required and maybe require an increase in
metadata ratio after the metadata has been allocated.

Another case happens when you suddenly start using a lot of reflinks
when the filesystem is already completely allocated.

It is possible to contrive cases where metadata usage approaches 100%
of the filesystem, so there's no such thing as allocating "enough"
metadata space for all use cases.

> So I don't think that we have this IO pattern.
> 
> It rather seems simply as if btrfs wouldn't reserve meta-data
> aggressively enough (at least not in our case)... and that to much is
> allocated for data.. and when that is actually filled, it cannot
> allocate anymore enough for metadata.

That's possible (and there are patches attempting to address it).
We don't want to be too aggressive, or the disk fills up with unused
metadata allocations...but we need to be about 5 block groups more
aggressive than we are now to handle special cases like "mount and
write until full without doing any backups or maintenance."

A couple more suggestions (more like exploitable side-effects):

	- Run regular scrubs.  If a write occurs to a block group
	while it's being scrubbed, there's an extra metadata block
	group allocation.

	- Mount with -o ssd.  This makes metadata allocation more
	aggressive (though it also requires more metadata allocation,
	so like metadata_ratio, it might be worth turning off after
	the filesystem fills up).

> 
> 
> Thanks,
> Chris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07 15:39 ` Phillip Susi
@ 2021-12-16  3:47   ` Christoph Anton Mitterer
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Anton Mitterer @ 2021-12-16  3:47 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-btrfs

On Tue, 2021-12-07 at 10:39 -0500, Phillip Susi wrote:
> I'm not sure what you intended this to do or show.  It looks like you
> tried to execute a program named /srv/dcache/pools/2/foo, and there
> is
> no such program.  That doesn't say anything about the filesystem.

Ooops, sorry... that was some wrong copy&pasting.

What I actually did was merely a
# touch /srv/dcache/pools/2/foo

which already gave me the ENOSPC (for the reasons that have now already
been explained... i.e. no (usable) free space in the meta-data and no
unallocated space either.


Thanks,
Chris.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-07 18:14             ` Zygo Blaxell
@ 2021-12-16 23:16               ` Christoph Anton Mitterer
  2021-12-17  2:00                 ` Qu Wenruo
  2021-12-17  5:53                 ` Zygo Blaxell
  0 siblings, 2 replies; 20+ messages in thread
From: Christoph Anton Mitterer @ 2021-12-16 23:16 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Qu Wenruo, linux-btrfs

On Tue, 2021-12-07 at 13:14 -0500, Zygo Blaxell wrote:
> It coalesces the free space in each block group into big contiguous
> regions, eventually growing them to regions over 1GB in size. 
> Usually
> this gives back unallocated space.

Ah, I see... an yes that worked.

Not sure if I missed anything, but I think this should be somehow
explained in the btrfs-balance(8).
I mean there *is* the section "MAKING BLOCK GROUP LAYOUT MORE COMPACT",
but that also kinda misses the point that this can be used to get
unallocated space back, doesn't it?


Is there some way to see a distribution of the space usage of block
groups?
Like some print out that shows me:
- there are n block groups
- xx = 100%
- xx > 90%
- xx > 80%
...
- xx = 0%
?

That would also give some better idea on how worth it is to balance,
and which options to use.


> If balance can't pack the extents in 1GB units without changing their
> sizes or crossing a block group boundary, then balance might not be
> able to free any block groups this way, so this tends to fail when
> the
> filesystem is over about 97% full.

So that's basically the point when one can only move data away... do
the balance and move it back afterwards.

Which btw. worked quite nicely. (so thanks to all involved people for
the help with that).


> Minimal balance is exactly one data block group, i.e.
> 
>         btrfs balance start -dlimit=1 /fs
> 
> Run it when unallocated space gets low.  The exact threshold is low
> enough that the time between new data block group allocations is less
> than the balance time.

What the sysadmin of large storage farms needs is something that one
can run basically always (so even if unallocated space is NOT low),
which kinda works out of the box and automatically (run via cron?) and
doesn't impact the IO too much.
Or one would need some daemon, which monitors unallocated space and
kicks in if necessary.

Does it make sense to use -dusage=xx in addition to -dlimit?
I mean if space is already tight... would just -dlimit=1 try to find a
block group that it can balance (because it's usage is low enough)...
or might it just fail when the first tried one is nearly fully (and not
enough space is left for that in other block groups)?


> It has to be run for a short time because metadata_ratio=1 means 1:1
> metadata to data allocation.  You only want to do this to rescue a
> filesystem that has become stuck with too little metadata.  Once the
> required amount of metadata is allocated, remove the metadata_ratio
> option and do minimal data balancing going forward.

But that's also something rather only suitable for "rescuing"... one
wouldn't want to do that in big storage systems on hundreds of
filesystems, just to make sure that btrfs doesn't run into that
situation in the first place.

For that it would be much nicer if one had other means to tell btrfs to
allocate more for metadata,... like either a command to reserve xx GB,
that one can run when one sees that space get tight... or by some
bother logic when btrfs does that automatically.


> You can set metadata_ratio=30, which will allocate (100 / 30) = ~3%
> of the space for metadata, if you are starting with an empty
> filesystem.

Okay that sounds more like a way...


> TBH it's never been a problem--but I run the minimal data balance
> daily,
> and scrub every month, and never balance metadata, and have snapshots
> and dedupe.  Between these they trigger all the necessary metadata
> allocations.

I'm also still not really sure why this happened here.

I've asked the developers of our storage middleware software in the
meantime, and it seems in fact that dCache does pre-allocate the space
of files that it wants to write.

But even then, shouldn't btrfs be able to know how much it will
generally need for csum metadata?

I can only think of IO patterns where one would end up with too
aggressive meta-data allocation (e.g. when writing lots of directories
or XATTRS) and where not enough data block groups are left.

But the other way round?
If one writes very small files (so that they are inlined) -> meta-data
should grow.

If one writes non-inlined files, regardless of whether small or big...
shouldn't it always be clear how much space could be needed for csum
meta-data, when a new block group is allocated for data and if that
would be fully written?


> In theory if the average file size decreases drastically it can
> change
> the amount of metadata required and maybe require an increase in
> metadata ratio after the metadata has been allocated.

I cannot totally rule this out, but it's pretty unlikely.


> Another case happens when you suddenly start using a lot of reflinks
> when the filesystem is already completely allocated.

That I can rule out, we didn't make any snapshots or ref-copies.


> That's possible (and there are patches attempting to address it).
> We don't want to be too aggressive, or the disk fills up with unused
> metadata allocations...but we need to be about 5 block groups more
> aggressive than we are now to handle special cases like "mount and
> write until full without doing any backups or maintenance."

Wouldn't a "simple" (at least in my mind ;-) ) solution be, that:
- if the case arises, that either data or meta-data block groups are
  full
- and not unallocated space is left
- and if the other kind of block groups has plenty of free space left
  (say in total something like > 10 times the size of a block group...
  or maybe more (depending on the total filesystem size), cause one
  probably doesn't want to shuffle loads of data around, just for the
  last 0.005% to be squeezed out.)
then:
- btrfs automatically does the balance?
  Or maybe something "better" that also works when it would need to
  break up extents?

If there are cases where one doesn't like that automatic shuffling, one
could make it opt-in via some mount option.


> A couple more suggestions (more like exploitable side-effects):
> 
>         - Run regular scrubs.  If a write occurs to a block group
>         while it's being scrubbed, there's an extra metadata block
>         group allocation.

But writes during scrubs would only happen when it finds and corrupted
blocks?


Thanks,
Chris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-16 23:16               ` Christoph Anton Mitterer
@ 2021-12-17  2:00                 ` Qu Wenruo
  2021-12-17  3:10                   ` Christoph Anton Mitterer
  2021-12-17  5:53                 ` Zygo Blaxell
  1 sibling, 1 reply; 20+ messages in thread
From: Qu Wenruo @ 2021-12-17  2:00 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Zygo Blaxell; +Cc: linux-btrfs



[...]
>> That's possible (and there are patches attempting to address it).
>> We don't want to be too aggressive, or the disk fills up with unused
>> metadata allocations...but we need to be about 5 block groups more
>> aggressive than we are now to handle special cases like "mount and
>> write until full without doing any backups or maintenance."
>
> Wouldn't a "simple" (at least in my mind ;-) ) solution be, that:
> - if the case arises, that either data or meta-data block groups are
>    full
> - and not unallocated space is left
> - and if the other kind of block groups has plenty of free space left
>    (say in total something like > 10 times the size of a block group...
>    or maybe more (depending on the total filesystem size), cause one
>    probably doesn't want to shuffle loads of data around, just for the
>    last 0.005% to be squeezed out.)
> then:
> - btrfs automatically does the balance?
>    Or maybe something "better" that also works when it would need to
>    break up extents?

Or, let's change how we output our vanilla `df` command output, by
taking metadata free space and unallocated space into consideration, like:

- If there are plenty unallocated space
   Go current output.

- If there is no more unallocated space can be utilized
   Then take metadata free space into consideration, like if there is
   only 1G free metadata space, while several tera free data space,
   we only report free metadata space * some ratio as free data space.

   And if by some magic calculation, we determined that even balance
   won't free up any space, we return available space as 0 directly.

By this, we under-report the amount of available space, although users
may (and for most cases, they indeed can) write way more space than the
reported available space, we have done our best to show end users that
they need to take care of the fs.
Either by deleting unused data, or do proper maintenance before reported
available space reaches 0.

By this, your existing space reservation tool will work way better than
your current situation, and you have enough early warning before
reaching the current situation.

But I doubt if this would greatly drop the disk utilization, as we will
become too cautious on reporting available space.

Thanks,
Qu

>
> If there are cases where one doesn't like that automatic shuffling, one
> could make it opt-in via some mount option.
>
>
>> A couple more suggestions (more like exploitable side-effects):
>>
>>          - Run regular scrubs.  If a write occurs to a block group
>>          while it's being scrubbed, there's an extra metadata block
>>          group allocation.
>
> But writes during scrubs would only happen when it finds and corrupted
> blocks?
>
>
> Thanks,
> Chris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-17  2:00                 ` Qu Wenruo
@ 2021-12-17  3:10                   ` Christoph Anton Mitterer
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Anton Mitterer @ 2021-12-17  3:10 UTC (permalink / raw)
  To: Qu Wenruo, Zygo Blaxell; +Cc: linux-btrfs

On Fri, 2021-12-17 at 10:00 +0800, Qu Wenruo wrote:
> Or, let's change how we output our vanilla `df` command output, by
> taking metadata free space and unallocated space into consideration,
> like:

Actually I was thinking about this before as well, but that would
rather just remedy the consequences of that particular ENOSPC situation
and not prevent it.



> - If there is no more unallocated space can be utilized
>    Then take metadata free space into consideration, like if there is
>    only 1G free metadata space, while several tera free data space,
>    we only report free metadata space * some ratio as free data
> space.

Not sure whether this is so good... because then the shown free space
is completely made up... it could be like that value if the remaining
unallocated space and the remaining meta-data space are eaten up as
"anticipated"... but it could also be much more or much less (depending
on what actually happens), right?

What I'd rather do is:
*If* btrfs realises that there's still free space in the data block
groups... but really nothing at all (that isn't reserved for special
operations) is left in the meta-data block groups AND nothing more
could be allocated... then suddenly drop the shown free space to
exactly 0.

Because from a classic programs point of view, that's what the case, it
cannot further add any files (not even empty ones).


This would also allow programs like dCache to better deal with that
situation.
What dCache does is laid out here:
https://github.com/dCache/dcache/issues/5352#issuecomment-989793555

Perhaps some background... dCache is a distributed storage system, so
it runs on multiple nodes managing files placed in many filesystems (on
so called pools).
Clients first connect via some protocol to a "door node", from which
they are (at least if the respective protocol supports it) redirected
to a pool, where dCache thinks the file could be written to (in the
write case, obviously).

dCache decides that by known all it's pools and monitoring their
(filesystems') free space. It also has a configurable gap value
(defaulting to 4GB), which it will try to leave free on a pool.

If the file is expected to fit in (I think it again depends on the
protocol, whether it really knows in advance how much the client will
write) while still observing the gap,... plus several more load
balancing metrics... a pool may be selected and the client redirected.

Seems to me like a fairly reasonable process.


So as things are currently with btrfs and when that particular
situation arises that I've had now (plenty free space in data block
groups, but zero in meta-data block groups plus zero unallocated
space), then dCache cannot really deal properly with that:

- df (respectively the usual syscalls) will show it that much more
space is available than what the gap would help against

- the client tries to write to the pool, there's immediately ENOSPC and
the transfer is properly aborted with some failure

- but dCache cannot really tell whether the situation is still there or
not... so it will run into broken write transfers over and over

- typically also, once a client is redirected to a pool, there is no
going back and retrying the same on another one (at least not
automatically from within the protocol)... so the failure is really
"permanent", unless the client itself tries again and then (by chance)
lands on another pool where the btrfs is still good


If df respectively the syscalls would return 0 free space in that
situation, we'd still have ~800 GB lost (without manual
intervention)... but at least the middleware should be able to deal
with that.



> By this, we under-report the amount of available space, although
> users
> may (and for most cases, they indeed can) write way more space than
> the
> reported available space, we have done our best to show end users
> that
> they need to take care of the fs.
> Either by deleting unused data, or do proper maintenance before
> reported
> available space reaches 0.

Well but at least when the problem has happened, then - without any
further intervention - no further writes (of new files respectively new
data) will be possible... so the "under-reporting" is only true if one
assumes that this intervention will happen.

If it does, like by a some maintenance "minimal" balance as Zygo
suggested, then the whole situation should anyway not happen, AFAIU.
And if its by some intervention after the ENOSPC, then the "under-
reporting" would also go away, as soon as the problem was fixed
(manually).




But what do you think about my idea of btrfs automatically solving the
situation by doing a balance on it's own, once the problem of arose?


One could also think of something like the following:
Add some 2nd level global reserve, which is much bigger than the
current one... at least enough that one could manually balance the fs
(or btrfs to that automatically if it decided it needs to)

If the problem of this mail thread occurs it could be used to more
easily solve it without the need to move data to somewhere else (which
may not always be feasible), because it would be reserved to be e.g.
used for such a balance.

One could make its dependent on the size of the fs. If the fs has e.g.
1TB, then reserving e.g. 4GB is barely noticeably. And if the fs should
be too small, one simply doesn't have the 2nd level global reserve.

If(!) the fs runs full in a proper way (i.e. no more unallocated space
an meta-data and data block groups are equally full) then btrfs could
decide to release that 2nd level global reserve back to be used, to
squeeze out as much space as possible not loosing too much.

Once it's really full, it's full and not much new could happen
anyway... and the normal global reserve would be still there for the
*very* important things.

If files should later on be deleted, btrfs could decide to try to re-
establish the 2nd level global reserve... again to be "reserved" until
the fs is again really really full and it would be just wasted space.


Cheers,
Chris.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: ENOSPC while df shows 826.93GiB free
  2021-12-16 23:16               ` Christoph Anton Mitterer
  2021-12-17  2:00                 ` Qu Wenruo
@ 2021-12-17  5:53                 ` Zygo Blaxell
  1 sibling, 0 replies; 20+ messages in thread
From: Zygo Blaxell @ 2021-12-17  5:53 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: Qu Wenruo, linux-btrfs

On Fri, Dec 17, 2021 at 12:16:21AM +0100, Christoph Anton Mitterer wrote:
> On Tue, 2021-12-07 at 13:14 -0500, Zygo Blaxell wrote:

[snip]

> Is there some way to see a distribution of the space usage of block
> groups?
> Like some print out that shows me:
> - there are n block groups
> - xx = 100%
> - xx > 90%
> - xx > 80%
> ...
> - xx = 0%
> ?
> 
> That would also give some better idea on how worth it is to balance,
> and which options to use.

Python-btrfs lets you access btrfs data structures from python scripts.
This might even be an existing example.

> > Minimal balance is exactly one data block group, i.e.
> > 
> >         btrfs balance start -dlimit=1 /fs
> > 
> > Run it when unallocated space gets low.  The exact threshold is low
> > enough that the time between new data block group allocations is less
> > than the balance time.
> 
> What the sysadmin of large storage farms needs is something that one
> can run basically always (so even if unallocated space is NOT low),
> which kinda works out of the box and automatically (run via cron?) and
> doesn't impact the IO too much.
> Or one would need some daemon, which monitors unallocated space and
> kicks in if necessary.

That's the theory, and its what packages like btrfsmaintenance try to do.

The practice is...more complicated.

> Does it make sense to use -dusage=xx in addition to -dlimit?
> I mean if space is already tight... would just -dlimit=1 try to find a
> block group that it can balance (because it's usage is low enough)...
> or might it just fail when the first tried one is nearly fully (and not
> enough space is left for that in other block groups)?

The best strategy I've found so far is to choose block groups entirely
at random, because:

	* the benefit is fixed:  after a successful block group balance,
	you will have 1GB of unallocated space on all disks in the
	block group.  In that sense it doesn't matter which block groups
	you balance, only the number that you balance.  If you pick
	a full block group, btrfs will pack the data into emptier block
	groups.  If you pick an empty block group, btrfs will pack the
	data into other empty block groups, or create a new empty block
	group and just shuffle the data around.

	* the cost of computing the cost of relocating a block group is
	proportional to doing the work of relocating the block group.  The
	data movement for 1GB takes 12 seconds on modern spinning drives
	and 1 second or less on NVMe.  The other 60-seconds-to-an-hour
	of relocating a block group is updating all the data references,
	and the parent nodes that reference them, recursively.	If you
	had some clever caching and precomputation scheme you could
	maybe choose a good block group to balance in less time than
	it takes to balance it, but if you predict wrong, you're stuck
	doing the extra work with no benefit.  Also because this is a
	deterministic algorithm, you run into the next problem:

	* choosing block groups by a deterministic algorithm (e.g. number
	of free bytes, percentage of free space, fullest/emptiest device,
	largest vaddr, smallest vaddr) eventually runs into adverse
	selection, and gets stuck on a block group that doesn't fit into
	the available free space, but it's always the "next" block group
	according to the selecting algorithm, so it can make no further
	progress.  Choosing a completely random block group (from the
	target devices where unallocated space is required) may or may
	not succeed, but it's a cheap algorithm to run and it's very
	good at avoiding adverse selection.

> > TBH it's never been a problem--but I run the minimal data balance
> > daily,
> > and scrub every month, and never balance metadata, and have snapshots
> > and dedupe.  Between these they trigger all the necessary metadata
> > allocations.
> 
> I'm also still not really sure why this happened here.
> 
> I've asked the developers of our storage middleware software in the
> meantime, and it seems in fact that dCache does pre-allocate the space
> of files that it wants to write.
> 
> But even then, shouldn't btrfs be able to know how much it will
> generally need for csum metadata?

It varies a lot.  Checksum items have variable overheads as they are
packed into pages.  There is some heuristic based on a constant ratio
but maybe it's a little too low.

It does seem to be prone to rounding error, as I've seen a lot of
users presenting filesystems that have exactly 1GB too little metadata
allocated.

> I can only think of IO patterns where one would end up with too
> aggressive meta-data allocation (e.g. when writing lots of directories
> or XATTRS) and where not enough data block groups are left.
> 
> But the other way round?
> If one writes very small files (so that they are inlined) -> meta-data
> should grow.
> 
> If one writes non-inlined files, regardless of whether small or big...
> shouldn't it always be clear how much space could be needed for csum
> meta-data, when a new block group is allocated for data and if that
> would be fully written?

It's not even clear how much space is needed for the data.  Extents are
immutable, so if you overwrite part of a large extent, you will need more
space for the new data even though the old data is no longer reachable
through any file.

Checksums can vary in density from 779 (if there are a lot of holes in
files) to 4090 blocks per metadata page (if they're all contiguous).
That's a 5:1 size ratio between the extremes.

> > That's possible (and there are patches attempting to address it).
> > We don't want to be too aggressive, or the disk fills up with unused
> > metadata allocations...but we need to be about 5 block groups more
> > aggressive than we are now to handle special cases like "mount and
> > write until full without doing any backups or maintenance."
> 
> Wouldn't a "simple" (at least in my mind ;-) ) solution be, that:
> - if the case arises, that either data or meta-data block groups are
>   full
> - and not unallocated space is left
> - and if the other kind of block groups has plenty of free space left
>   (say in total something like > 10 times the size of a block group...
>   or maybe more (depending on the total filesystem size), cause one
>   probably doesn't want to shuffle loads of data around, just for the
>   last 0.005% to be squeezed out.)
> then:
> - btrfs automatically does the balance?
>   Or maybe something "better" that also works when it would need to
>   break up extents?

The problem is in the definitions of things like "plenty" and "not a lot",
and expectations like "last 0.005%."  We all know balancing automatically
solves the problem, but all the algorithms we use to trigger it are wrong
in some edge case.

Balance is a big and complex thing that operates on big filesystem
allocation objects, too big to run automatically at the moment a critical
failure is detected.  The challenge is to predict the future well enough
to know when to run balance to avoid it.  In these early days, everybody
seems to be rolling their own solutions and discovering surprising
implications of their choices.

Also there are much simpler solutions, like "put all the metadata on
SSD", where the administrator picks the metadata size and btrfs works
(or doesn't work) with it.

Rewriting the extent tree is also on the table, though people have
recently worked on that (extent tree v2) and the ability to change
allocated extent lengths after the fact was dropped from the proposal.

> If there are cases where one doesn't like that automatic shuffling, one
> could make it opt-in via some mount option.

In theory a garbage collection tool can be written today to manage
this, but it's only a theory until somebody writes it.  It's possible
to break up extents by running a combination of defrag and dedupe over
them using existing userspace interfaces.  Once such a tool exists, the
kernel interfaces could be improved for performance.  That tool would
essentially be data balance in userspace, so the kernel data balance
would no longer be needed.  It's not clear that this would be able to
perform any better than the current data balance scheme, though, except
for being slightly more flexible on extremely full filesystems.

> > A couple more suggestions (more like exploitable side-effects):
> > 
> >         - Run regular scrubs.  If a write occurs to a block group
> >         while it's being scrubbed, there's an extra metadata block
> >         group allocation.
> 
> But writes during scrubs would only happen when it finds and corrupted
> blocks?

Each block group is made read-only while it is scrubbed to prevent
modification while scrub verifies it.  If some process wants to modify
data on the filesystem during the scrub, it must allocate its new data
in some block group that is not being scrubbed.  If all the existing
block groups are either full or read-only, then a new block group must be
allocated.  If this is not possible, the writing process will hit ENOSPC.
In other words, scrub effectively decreases free space while it runs by
locking some of it away temporarily, and this forces btrfs to allocate
a little more space for data and metadata.

This is one of the many triggers for btrfs to require and allocate another
GB of metadata apparently at random.  It's never random, but there are
a lot of different triggering conditions in the implementation.

Only a few spare block groups are usually needed, so people running
scrub regularly work around the metadata problem without knowing they're
working around a problem.

> Thanks,
> Chris.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-12-17  5:53 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-07  2:29 ENOSPC while df shows 826.93GiB free Christoph Anton Mitterer
2021-12-07  2:59 ` Qu Wenruo
2021-12-07  3:06   ` Christoph Anton Mitterer
2021-12-07  3:29     ` Qu Wenruo
2021-12-07  3:44       ` Christoph Anton Mitterer
2021-12-07  4:56         ` Qu Wenruo
2021-12-07 14:30           ` Christoph Anton Mitterer
2021-12-07  7:21         ` Zygo Blaxell
2021-12-07 12:31           ` Jorge Bastos
2021-12-07 15:07           ` Christoph Anton Mitterer
2021-12-07 18:14             ` Zygo Blaxell
2021-12-16 23:16               ` Christoph Anton Mitterer
2021-12-17  2:00                 ` Qu Wenruo
2021-12-17  3:10                   ` Christoph Anton Mitterer
2021-12-17  5:53                 ` Zygo Blaxell
2021-12-07 15:10           ` Jorge Bastos
2021-12-07 15:22             ` Christoph Anton Mitterer
2021-12-07 16:11               ` Jorge Bastos
2021-12-07 15:39 ` Phillip Susi
2021-12-16  3:47   ` Christoph Anton Mitterer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).