* ENOSPC while df shows 826.93GiB free @ 2021-12-07 2:29 Christoph Anton Mitterer 2021-12-07 2:59 ` Qu Wenruo 2021-12-07 15:39 ` Phillip Susi 0 siblings, 2 replies; 20+ messages in thread From: Christoph Anton Mitterer @ 2021-12-07 2:29 UTC (permalink / raw) To: linux-btrfs Hey. At the university I'm running a Tier-2 site for the large hadron collider, with some total storage of 4 PB. For a bit more than half of that I use btrfs, with HDDs combined to some hardware raid, provided as 16TiB devices (on which the btrfs sits). It runs Debian bullseye, which has 5.10.70. Oh and I've used -R free- space-tree. I don't use snapshots on these filesystems. On one of the filesystems I've ran now into ENOSPC. # btrfs filesystem usage /srv/dcache/pools/2 Overall: Device size: 16.00TiB Device allocated: 16.00TiB Device unallocated: 1.00MiB Device missing: 0.00B Used: 15.19TiB Free (estimated): 826.93GiB (min: 826.93GiB) Free (statfs, df): 826.93GiB Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Multiple profiles: no Data,single: Size:15.97TiB, Used:15.16TiB (94.94%) /dev/sdf 15.97TiB Metadata,DUP: Size:17.01GiB, Used:16.51GiB (97.06%) /dev/sdf 34.01GiB System,DUP: Size:8.00MiB, Used:2.12MiB (26.56%) /dev/sdf 16.00MiB Unallocated: /dev/sdf 1.00MiB yet: # /srv/dcache/pools/2/foo -bash: /srv/dcache/pools/2/foo: No such file or directory balancing also fails, e.g.: # btrfs balance start -dusage=50 /srv/dcache/pools/2 ERROR: error during balancing '/srv/dcache/pools/2': No space left on device There may be more info in syslog - try dmesg | tail # btrfs balance start -dusage=40 /srv/dcache/pools/2 Done, had to relocate 0 out of 16370 chunks # btrfs balance start /srv/dcache/pools/2 WARNING: Full balance without filters requested. This operation is very intense and takes potentially very long. It is recommended to use the balance filters to narrow down the scope of balance. Use 'btrfs balance start --full-balance' option to skip this warning. The operation will start in 10 seconds. Use Ctrl-C to stop it. 10 9 8 7 6 5 4 3 2 1 Starting balance without any filters. ERROR: error during balancing '/srv/dcache/pools/2': No space left on device There may be more info in syslog - try dmesg | tail # btrfs balance start -dusage=0 /srv/dcache/pools/2 Done, had to relocate 0 out of 16370 chunks fsck showed no errors. Any ideas what's going on and how to recover? Thanks, Chris. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 2:29 ENOSPC while df shows 826.93GiB free Christoph Anton Mitterer @ 2021-12-07 2:59 ` Qu Wenruo 2021-12-07 3:06 ` Christoph Anton Mitterer 2021-12-07 15:39 ` Phillip Susi 1 sibling, 1 reply; 20+ messages in thread From: Qu Wenruo @ 2021-12-07 2:59 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2021/12/7 10:29, Christoph Anton Mitterer wrote: > Hey. > > At the university I'm running a Tier-2 site for the large hadron > collider, with some total storage of 4 PB. > > For a bit more than half of that I use btrfs, with HDDs combined to > some hardware raid, provided as 16TiB devices (on which the btrfs > sits). > > It runs Debian bullseye, which has 5.10.70. Oh and I've used -R free- > space-tree. > I don't use snapshots on these filesystems. > > > On one of the filesystems I've ran now into ENOSPC. > > # btrfs filesystem usage /srv/dcache/pools/2 > Overall: > Device size: 16.00TiB > Device allocated: 16.00TiB > Device unallocated: 1.00MiB All device space is allocated already. > Device missing: 0.00B > Used: 15.19TiB > Free (estimated): 826.93GiB (min: 826.93GiB) > Free (statfs, df): 826.93GiB > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > Multiple profiles: no > > Data,single: Size:15.97TiB, Used:15.16TiB (94.94%) > /dev/sdf 15.97TiB > > Metadata,DUP: Size:17.01GiB, Used:16.51GiB (97.06%) Your metadata is full, although there is some free space (512M), but that's mostly used by global rsv, for very critical operations. Thus your metadata is full. > /dev/sdf 34.01GiB > > System,DUP: Size:8.00MiB, Used:2.12MiB (26.56%) > /dev/sdf 16.00MiB > > Unallocated: > /dev/sdf 1.00MiB > > > yet: > # /srv/dcache/pools/2/foo > -bash: /srv/dcache/pools/2/foo: No such file or directory > > > balancing also fails, e.g.: > # btrfs balance start -dusage=50 /srv/dcache/pools/2 Since your metadata is full, btrfs can't reserve enough metadata to relocate a data chunk. > ERROR: error during balancing '/srv/dcache/pools/2': No space left on device > There may be more info in syslog - try dmesg | tail > # btrfs balance start -dusage=40 /srv/dcache/pools/2 > Done, had to relocate 0 out of 16370 chunks > # btrfs balance start /srv/dcache/pools/2 > WARNING: > > Full balance without filters requested. This operation is very > intense and takes potentially very long. It is recommended to > use the balance filters to narrow down the scope of balance. > Use 'btrfs balance start --full-balance' option to skip this > warning. The operation will start in 10 seconds. > Use Ctrl-C to stop it. > 10 9 8 7 6 5 4 3 2 1 > Starting balance without any filters. > ERROR: error during balancing '/srv/dcache/pools/2': No space left on device > There may be more info in syslog - try dmesg | tail > # btrfs balance start -dusage=0 /srv/dcache/pools/2 > Done, had to relocate 0 out of 16370 chunks > > > > > fsck showed no errors. > > > > Any ideas what's going on and how to recover? Since your metadata is already full, you may need to delete enough data to free up enough metadata space. The candidates includes small files (mostly inlined files), and large files with checksums. Thanks, Qu > > > Thanks, > Chris. > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 2:59 ` Qu Wenruo @ 2021-12-07 3:06 ` Christoph Anton Mitterer 2021-12-07 3:29 ` Qu Wenruo 0 siblings, 1 reply; 20+ messages in thread From: Christoph Anton Mitterer @ 2021-12-07 3:06 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs On Tue, 2021-12-07 at 10:59 +0800, Qu Wenruo wrote: > > Since your metadata is already full, you may need to delete enough > data > to free up enough metadata space. > > The candidates includes small files (mostly inlined files), and large > files with checksums. On that fs, there are rather many large files (800MB - 1.5 GB). Is there anyway to get (much?) more space reserved for metadata in the future respectively on the other existing filesystems that haven't deadlocked themselves yet?! Thanks, Chris. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 3:06 ` Christoph Anton Mitterer @ 2021-12-07 3:29 ` Qu Wenruo 2021-12-07 3:44 ` Christoph Anton Mitterer 0 siblings, 1 reply; 20+ messages in thread From: Qu Wenruo @ 2021-12-07 3:29 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2021/12/7 11:06, Christoph Anton Mitterer wrote: > On Tue, 2021-12-07 at 10:59 +0800, Qu Wenruo wrote: >> >> Since your metadata is already full, you may need to delete enough >> data >> to free up enough metadata space. >> >> The candidates includes small files (mostly inlined files), and large >> files with checksums. > > On that fs, there are rather many large files (800MB - 1.5 GB). > > Is there anyway to get (much?) more space reserved for metadata in the > future respectively on the other existing filesystems that haven't > deadlocked themselves yet?! In fact, this is not really a deadlock, only balance is blocked by such problem. For other regular operations, you either got ENOSPC just like all other fses which runs out of space, or do it without problem. Furthermore, balance in this case is not really the preferred way to free up space, really freeing up data is the correct way to go. Thanks, Qu > > > Thanks, > Chris. > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 3:29 ` Qu Wenruo @ 2021-12-07 3:44 ` Christoph Anton Mitterer 2021-12-07 4:56 ` Qu Wenruo 2021-12-07 7:21 ` Zygo Blaxell 0 siblings, 2 replies; 20+ messages in thread From: Christoph Anton Mitterer @ 2021-12-07 3:44 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs On Tue, 2021-12-07 at 11:29 +0800, Qu Wenruo wrote: > For other regular operations, you either got ENOSPC just like all > other > fses which runs out of space, or do it without problem. > > Furthermore, balance in this case is not really the preferred way to > free up space, really freeing up data is the correct way to go. Well but to be honest... that makes btrfs kinda broke for that particular purpose. The software which runs on the storage and provides the data to the experiments does in fact make sure that the space isn't fully used (per default, it leave a gap of 4GB). While this gap is configurable it seems a bit odd if one would have to set it to ~1TB per fs... just to make sure that btrfs doesn't run out of space for metadata. And btrfs *does* show that plenty of space is left (always around 700- 800 GB)... so the application thinks it can happily continue to write, while in fact it fails (and the cannot even start anymore as it fails to create lock files). My understanding was the when not using --mixed, btrfs has block groups for data and metadata. And it seems here that the data block groups have several 100 GB still free, while - AFAIU you - the metadata block groups are already full. I also wouldn't want to regularly balance (which doesn't really seem to help that much so far)... cause it puts quite some IO load on the systems. So if csum data needs so much space... why can't it simply reserve e.g. 60 GB for metadata instead of just 17 GB? If I really had to reserve ~ 1TB of storage to be unused (per 16TB fs) just to get that working... I would need to move stuff back to ext4, cause that's such a big loss we couldn't justify to our funding agencies. And we haven't had that issue with e.g. ext4 ... that seems to reserve just enough for meta, so that we could basically fill up the fs close to the end. Cheers, Chris. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 3:44 ` Christoph Anton Mitterer @ 2021-12-07 4:56 ` Qu Wenruo 2021-12-07 14:30 ` Christoph Anton Mitterer 2021-12-07 7:21 ` Zygo Blaxell 1 sibling, 1 reply; 20+ messages in thread From: Qu Wenruo @ 2021-12-07 4:56 UTC (permalink / raw) To: Christoph Anton Mitterer, Qu Wenruo, linux-btrfs On 2021/12/7 11:44, Christoph Anton Mitterer wrote: > On Tue, 2021-12-07 at 11:29 +0800, Qu Wenruo wrote: >> For other regular operations, you either got ENOSPC just like all >> other >> fses which runs out of space, or do it without problem. >> >> Furthermore, balance in this case is not really the preferred way to >> free up space, really freeing up data is the correct way to go. > > Well but to be honest... that makes btrfs kinda broke for that > particular purpose. > > > The software which runs on the storage and provides the data to the > experiments does in fact make sure that the space isn't fully used (per > default, it leave a gap of 4GB). > > While this gap is configurable it seems a bit odd if one would have to > set it to ~1TB per fs... just to make sure that btrfs doesn't run out > of space for metadata. > > > And btrfs *does* show that plenty of space is left (always around 700- > 800 GB)... so the application thinks it can happily continue to write, > while in fact it fails (and the cannot even start anymore as it fails > to create lock files). That's the problem with dynamic chunk allocation, and to be honest, I don't have any better idea how to make it work just like traditional fses. You could consider it as something like thin-provisioned device, which would have the same problem (reporting tons of free space, but will hang if underlying space is used up). > > > My understanding was the when not using --mixed, btrfs has block groups > for data and metadata. > > And it seems here that the data block groups have several 100 GB still > free, while - AFAIU you - the metadata block groups are already full. > > > > I also wouldn't want to regularly balance (which doesn't really seem to > help that much so far)... cause it puts quite some IO load on the > systems. > > > So if csum data needs so much space... why can't it simply reserve e.g. > 60 GB for metadata instead of just 17 GB? Because all chunks are allocated on demand, if 1) your workload has every unbalanced data/metadata usage, like this case (almost 1000:1). 2) You run out of space, then you will hit this particular problem. > > > > If I really had to reserve ~ 1TB of storage to be unused (per 16TB fs) > just to get that working... I would need to move stuff back to ext4, > cause that's such a big loss we couldn't justify to our funding > agencies. It won't matter if you reserve 1T or not for the data. It can still go the same problem even if there are tons of unused data space. Fragmented data space can still cause the same problem. > > > And we haven't had that issue with e.g. ext4 ... that seems to reserve > just enough for meta, so that we could basically fill up the fs close > to the end. Ext4/XFS has a similar problem but much harder to hit, inode limits. They use pre-determined inode limits (determined at mkfs time), thus you can ran out of inodes before free space is used up. Tools like "df" has ways to report such limits, but unfortunately for btrfs there is no such way, but using btrfs specific tool to do that. Thanks, Qu > > > > Cheers, > Chris. > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 4:56 ` Qu Wenruo @ 2021-12-07 14:30 ` Christoph Anton Mitterer 0 siblings, 0 replies; 20+ messages in thread From: Christoph Anton Mitterer @ 2021-12-07 14:30 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs On Tue, 2021-12-07 at 12:56 +0800, Qu Wenruo wrote: > That's the problem with dynamic chunk allocation, and to be honest, I > don't have any better idea how to make it work just like traditional > fses. > > You could consider it as something like thin-provisioned device, > which > would have the same problem (reporting tons of free space, but will > hang > if underlying space is used up). Well the first thing I don't understand is, that my scenario seems pretty... simple. These filesystems have only few files (so 30k to perhaps 200k). Seems far simpler than e.g. the fs of the system itself, where one can have many files of completely varying size in /usr, /home, and so on. Also, these files (apart from some small meta-data files) are *always* written once and then only read (or deleted). There is never any random write access... so fragmentation should be far less than under "normal" systems. The total size of the fs is obviously known. You said now, that the likely cause are the csum data... but isn't it then kinda clear from the beginning how much you'd need (at most) if the filesystem would be filled up with data? Just for my understanding: How is csum data stored? Is it like one sum per fixed block size of data? Or one sum per (not fixed) extent size of data? But in both cases I'd have assumed that the maximum of space needed for that is kinda predictable? Unlike e.g. on a thin provisioned device, or when using many (rw) snapshots,... where one cannot really predict how much storage would be needed because data is changed from the shared copy. > Because all chunks are allocated on demand, if 1) your workload has > every unbalanced data/metadata usage, like this case (almost 1000:1). > 2) You run out of space, then you will hit this particular problem. I've described the typical workload above: rather large files (the data sets from the experiments), written once, never any further writes to them, only deletions. I'd have expected that this causes *far* less fragmentation than e.g. filesystems that contain /home or so, where one has many random writes. > It won't matter if you reserve 1T or not for the data. > > It can still go the same problem even if there are tons of unused > data > space. > Fragmented data space can still cause the same problem. Let me try to understand this better: btrfs allocates data block groups and meta-data block groups (both dynamically), right? Are these always of the same size (like e.g. always 1G)? When I now write a 500M file... it would e.g. fill one such data block group with 500M (and write some data into a metadata block group). And when I would write next a 2 G file... it would write the first 500M to the already allocated data block group,.. and then allocate more to write the remaining data. Does that sound kinda right so far (simplified of course)? The problem I had now, was that the fs filled up more and more and (due to fragmentation),... all free space is in data block groups... but since no unallocated storage is left it could not allocate more metadata block groups. So from the data PoV it could still write (i.e. the free space) because all the fragmented data block groups have still some ~800GiB free... but it cannot write any more meta-data. Still kinda right? So my naive assumption(s) would have been: 1) It's a sign that it doesn't allocate meta-data block group aggressively enough. 2) If I cure the fragmentation (in the data block groups),... and btrfs could give back those... there would be again some unallocated space, which it could use for meta-data block groups... and so I could use more of the remaining 800GB, right? Would balance already do this? I guess not, cause AFAIU balance just re-writes block groups as is, right? So that's the reason, why balancing didn't help in any way? So the proper way would be to btrfs filesystem defragment... thus reclaim some unallocated space and get that for the meta-data. Right? But still,... that seems all quite a lot of manual work (and thus not scale for a large data centre): Would the deframentation work if the meta-data is already out of space? Why would it not help, if btrfs (pre-)reserves more meta-data block groups? So maybe of the ~800GB that are now still free (within data block groups)... one would use e.g. 100 GB to meta-data... From these 100 GB... 50 GB might be never used... but overall I could still use ~700 GB in data block groups - whereas now: both is effectively lost (the full ~800 GB). > Are there any manual ways to say in e.g. our use case: don't just allocate 17GB per fs for meta-data... but allocate already 80GB... And wouldn't that cure our problem... by simply helping to (likely) never reaching the out-of-metadata space situation? Thanks, Chris. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 3:44 ` Christoph Anton Mitterer 2021-12-07 4:56 ` Qu Wenruo @ 2021-12-07 7:21 ` Zygo Blaxell 2021-12-07 12:31 ` Jorge Bastos ` (2 more replies) 1 sibling, 3 replies; 20+ messages in thread From: Zygo Blaxell @ 2021-12-07 7:21 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Qu Wenruo, linux-btrfs On Tue, Dec 07, 2021 at 04:44:13AM +0100, Christoph Anton Mitterer wrote: > On Tue, 2021-12-07 at 11:29 +0800, Qu Wenruo wrote: > > For other regular operations, you either got ENOSPC just like all > > other > > fses which runs out of space, or do it without problem. > > > > Furthermore, balance in this case is not really the preferred way to > > free up space, really freeing up data is the correct way to go. > > Well but to be honest... that makes btrfs kinda broke for that > particular purpose. > > > The software which runs on the storage and provides the data to the > experiments does in fact make sure that the space isn't fully used (per > default, it leave a gap of 4GB). > > While this gap is configurable it seems a bit odd if one would have to > set it to ~1TB per fs... just to make sure that btrfs doesn't run out > of space for metadata. > > > And btrfs *does* show that plenty of space is left (always around 700- > 800 GB)... so the application thinks it can happily continue to write, > while in fact it fails (and the cannot even start anymore as it fails > to create lock files). > > > My understanding was the when not using --mixed, btrfs has block groups > for data and metadata. > > And it seems here that the data block groups have several 100 GB still > free, while - AFAIU you - the metadata block groups are already full. > > > > I also wouldn't want to regularly balance (which doesn't really seem to > help that much so far)... cause it puts quite some IO load on the > systems. If you minimally balance data (so that you keep 2GB unallocated at all times) then it works much better: you can allocate the last metadata chunk that you need to expand, and it requires only a few minutes of IO per day. After a while you don't need to do this any more, as a large buffer of allocated but unused metadata will form. If you need a drastic intervention, you can mount with metadata_ratio=1 for a short(!) time to allocate a lot of extra metadata block groups. Combine with a data block group balance for a few blocks (e.g. -dlimit=9). You need about (3 + number_of_disks) GB of allocated but unused metadata block groups to handle the worst case (balance, scrub, and discard all active at the same time, plus the required free metadata space). Also leave room for existing metadata to expand by about 50%, especially if you have snapshots. Never balance metadata. Balancing metadata will erase existing metadata allocations, leading directly to this situation. Free space search time goes up as the filesystem fills up. The last 1% of the filesystem will fill up significantly slower than the other 99%, You might need to reserve 3% of the filesystem to keep latencies down (ironically about the same amount that ext4 reserves). There are some patches floating around to address these issues. > So if csum data needs so much space... why can't it simply reserve e.g. > 60 GB for metadata instead of just 17 GB? It normally does. Are you: - running metadata balances? (Stop immediately.) - preallocating large files? Checksums are allocated later, and naive usage of prealloc burns metadata space due to fragmentation. - modifying snapshots? Metadata size increases with each modified snapshot. - replacing large files with a lot of very small ones? Files below 2K are stored in metadata. max_inline=0 disables this. > If I really had to reserve ~ 1TB of storage to be unused (per 16TB fs) > just to get that working... I would need to move stuff back to ext4, > cause that's such a big loss we couldn't justify to our funding > agencies. > > > And we haven't had that issue with e.g. ext4 ... that seems to reserve > just enough for meta, so that we could basically fill up the fs close > to the end. > > > > Cheers, > Chris. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 7:21 ` Zygo Blaxell @ 2021-12-07 12:31 ` Jorge Bastos 2021-12-07 15:07 ` Christoph Anton Mitterer 2021-12-07 15:10 ` Jorge Bastos 2 siblings, 0 replies; 20+ messages in thread From: Jorge Bastos @ 2021-12-07 12:31 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Christoph Anton Mitterer, Qu Wenruo, Btrfs BTRFS This looks to me like this issue I reported before: https://lore.kernel.org/linux-btrfs/CAHzMYBSap30NbnPnv4ka+fDA2nYGHfjYvD-NgT04t4vvN4q2sw@mail.gmail.com/ Data,single: Size:15.97TiB, Used:15.16TiB (94.94%) When this happens to me I can see the data usage ratio is lower than normal, there are mostly large files, and you can balance as much as you like and the data ratio stays unchanged, and the unallocated space gets to zero much sooner because of that, most times there's no issue and data usage ratio is much higher, e.g., this filesystem could be filled up to less than 4GB available: Data,RAID0: Size:10.89TiB, Used:10.89TiB (99.97%) This one could only be filled up to about 300GB available: Data,RAID0: Size:10.89TiB, Used:10.59TiB (97.26%) Both contain only large 100GiB size files, both file systems were filled from new in exactly the same way, one file at a time, no snapshots, no modifications after the initial data copy. Regards, Jorge Bastos On Tue, Dec 7, 2021 at 9:45 AM Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > > On Tue, Dec 07, 2021 at 04:44:13AM +0100, Christoph Anton Mitterer wrote: > > On Tue, 2021-12-07 at 11:29 +0800, Qu Wenruo wrote: > > > For other regular operations, you either got ENOSPC just like all > > > other > > > fses which runs out of space, or do it without problem. > > > > > > Furthermore, balance in this case is not really the preferred way to > > > free up space, really freeing up data is the correct way to go. > > > > Well but to be honest... that makes btrfs kinda broke for that > > particular purpose. > > > > > > The software which runs on the storage and provides the data to the > > experiments does in fact make sure that the space isn't fully used (per > > default, it leave a gap of 4GB). > > > > While this gap is configurable it seems a bit odd if one would have to > > set it to ~1TB per fs... just to make sure that btrfs doesn't run out > > of space for metadata. > > > > > > And btrfs *does* show that plenty of space is left (always around 700- > > 800 GB)... so the application thinks it can happily continue to write, > > while in fact it fails (and the cannot even start anymore as it fails > > to create lock files). > > > > > > My understanding was the when not using --mixed, btrfs has block groups > > for data and metadata. > > > > And it seems here that the data block groups have several 100 GB still > > free, while - AFAIU you - the metadata block groups are already full. > > > > > > > > I also wouldn't want to regularly balance (which doesn't really seem to > > help that much so far)... cause it puts quite some IO load on the > > systems. > > If you minimally balance data (so that you keep 2GB unallocated at all > times) then it works much better: you can allocate the last metadata > chunk that you need to expand, and it requires only a few minutes of IO > per day. After a while you don't need to do this any more, as a large > buffer of allocated but unused metadata will form. > > If you need a drastic intervention, you can mount with metadata_ratio=1 > for a short(!) time to allocate a lot of extra metadata block groups. > Combine with a data block group balance for a few blocks (e.g. -dlimit=9). > > You need about (3 + number_of_disks) GB of allocated but unused metadata > block groups to handle the worst case (balance, scrub, and discard all > active at the same time, plus the required free metadata space). Also > leave room for existing metadata to expand by about 50%, especially if > you have snapshots. > > Never balance metadata. Balancing metadata will erase existing metadata > allocations, leading directly to this situation. > > Free space search time goes up as the filesystem fills up. The last 1% > of the filesystem will fill up significantly slower than the other 99%, > You might need to reserve 3% of the filesystem to keep latencies down > (ironically about the same amount that ext4 reserves). > > There are some patches floating around to address these issues. > > > So if csum data needs so much space... why can't it simply reserve e.g. > > 60 GB for metadata instead of just 17 GB? > > It normally does. Are you: > > - running metadata balances? (Stop immediately.) > > - preallocating large files? Checksums are allocated later, and > naive usage of prealloc burns metadata space due to fragmentation. > > - modifying snapshots? Metadata size increases with each > modified snapshot. > > - replacing large files with a lot of very small ones? Files > below 2K are stored in metadata. max_inline=0 disables this. > > > If I really had to reserve ~ 1TB of storage to be unused (per 16TB fs) > > just to get that working... I would need to move stuff back to ext4, > > cause that's such a big loss we couldn't justify to our funding > > agencies. > > > > > > And we haven't had that issue with e.g. ext4 ... that seems to reserve > > just enough for meta, so that we could basically fill up the fs close > > to the end. > > > > > > > > Cheers, > > Chris. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 7:21 ` Zygo Blaxell 2021-12-07 12:31 ` Jorge Bastos @ 2021-12-07 15:07 ` Christoph Anton Mitterer 2021-12-07 18:14 ` Zygo Blaxell 2021-12-07 15:10 ` Jorge Bastos 2 siblings, 1 reply; 20+ messages in thread From: Christoph Anton Mitterer @ 2021-12-07 15:07 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Qu Wenruo, linux-btrfs On Tue, 2021-12-07 at 02:21 -0500, Zygo Blaxell wrote: > If you minimally balance data (so that you keep 2GB unallocated at > all > times) then it works much better: you can allocate the last metadata > chunk that you need to expand, and it requires only a few minutes of > IO > per day. After a while you don't need to do this any more, as a > large > buffer of allocated but unused metadata will form. Hm I've already asked Qu in the other mail just before, whether/why balancing would help there at all. Doesn't it just re-write the block groups (but not defragment them...) would that (and why) help to gain back unallocated space (which could then be allocated for meta-data)? And what exactly do you mean with "minimally"? I mean of course I can use -dusage=20 or so... is it that? But I guess all that wouldn't help now, when the unallocated space is already used up, right? > If you need a drastic intervention, you can mount with > metadata_ratio=1 > for a short(!) time to allocate a lot of extra metadata block groups. > Combine with a data block group balance for a few blocks (e.g. - > dlimit=9). All that seems rather impractical do to, to be honest. At least for an non-expert admin. First, these systems are production systems... so one doesn't want to unmount (and do this procedure) when one sees that unallocated space runs out. One would rather want some way that if one sees: unallocated space gets low -> allocate so and so much for meta data I guess there are no real/official tools out there for such surveillance? Like Nagios/Icinga checks, that look at the unallocated space? > You need about (3 + number_of_disks) GB of allocated but unused > metadata > block groups to handle the worst case (balance, scrub, and discard > all > active at the same time, plus the required free metadata space). > Also > leave room for existing metadata to expand by about 50%, especially > if > you have snapshots. > Never balance metadata. Balancing metadata will erase existing > metadata > allocations, leading directly to this situation. Wouldn't that only unallocated such allocations, that are completely empty? > > So if csum data needs so much space... why can't it simply reserve > > e.g. 60 GB for metadata instead of just 17 GB? > > It normally does. Are you: > > - running metadata balances? (Stop immediately.) Nope, I did once accidentally (-musage=0 ... copy&pasted the wrong one) but only *after* the filesystem got stuck... > - preallocating large files? Checksums are allocated later, > and > naive usage of prealloc burns metadata space due to > fragmentation. Hmm... not so sure about that... (I mean I don't know what the storage middleware, which is www.dcache.org, does)... but it would probably do this only for 1 to few such large files at once, if at all. > - modifying snapshots? Metadata size increases with each > modified snapshot. No snapshots are used at all on these filesystems. > - replacing large files with a lot of very small ones? Files > below 2K are stored in metadata. max_inline=0 disables this. I guess you mean here: First many large files were written... unallocated space is used up (with data and meta-data block groups). Then, large files are deleted... data block groups get fragmented (but not unallocated acagain, because they're not empty. Then loads of small files would be written (inline)... which then fails as meta-data space would fill up even faster, right? Well we do have filesystems, where there may be *many* small files.. but I guess still all around the range of 1MB or more. I don't think we have lots of files below 2K.. if at all. So I don't think that we have this IO pattern. It rather seems simply as if btrfs wouldn't reserve meta-data aggressively enough (at least not in our case)... and that to much is allocated for data.. and when that is actually filled, it cannot allocate anymore enough for metadata. Thanks, Chris. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 15:07 ` Christoph Anton Mitterer @ 2021-12-07 18:14 ` Zygo Blaxell 2021-12-16 23:16 ` Christoph Anton Mitterer 0 siblings, 1 reply; 20+ messages in thread From: Zygo Blaxell @ 2021-12-07 18:14 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Qu Wenruo, linux-btrfs On Tue, Dec 07, 2021 at 04:07:32PM +0100, Christoph Anton Mitterer wrote: > On Tue, 2021-12-07 at 02:21 -0500, Zygo Blaxell wrote: > > If you minimally balance data (so that you keep 2GB unallocated at > > all > > times) then it works much better: you can allocate the last metadata > > chunk that you need to expand, and it requires only a few minutes of > > IO > > per day. After a while you don't need to do this any more, as a > > large > > buffer of allocated but unused metadata will form. > > Hm I've already asked Qu in the other mail just before, whether/why > balancing would help there at all. > > Doesn't it just re-write the block groups (but not defragment them...) > would that (and why) help to gain back unallocated space (which could > then be allocated for meta-data)? It coalesces the free space in each block group into big contiguous regions, eventually growing them to regions over 1GB in size. Usually this gives back unallocated space. If balance can't pack the extents in 1GB units without changing their sizes or crossing a block group boundary, then balance might not be able to free any block groups this way, so this tends to fail when the filesystem is over about 97% full. It's important to run the minimal data balances _before_ this happens, as it's too late to allocate metadata after. > And what exactly do you mean with "minimally"? I mean of course I can > use -dusage=20 or so... is it that? Minimal balance is exactly one data block group, i.e. btrfs balance start -dlimit=1 /fs Run it when unallocated space gets low. The exact threshold is low enough that the time between new data block group allocations is less than the balance time. Usage filter is OK for one-off interventions, but repeated use eventually leads to a filesystem full of block groups that are filled to the threshold in the usage filter, and no unallocated space. > But I guess all that wouldn't help now, when the unallocated space is > already used up, right? If you have many GB of free space in the block groups, then usually one can be freed up. After that, it's a straightforward slot-puzzle, packing data into the unallocated space. If the free space is too fragmented or the extents are too large, then it will not be possible to recover without adding disk space or deleting data. > > If you need a drastic intervention, you can mount with > > metadata_ratio=1 > > for a short(!) time to allocate a lot of extra metadata block groups. > > Combine with a data block group balance for a few blocks (e.g. - > > dlimit=9). > > All that seems rather impractical do to, to be honest. At least for an > non-expert admin. > > First, these systems are production systems... so one doesn't want to > unmount (and do this procedure) when one sees that unallocated space > runs out. I think remount suffices, but I haven't checked. The mount option is checked at block allocation time in the code, so it should be possible to change it live. It has to be run for a short time because metadata_ratio=1 means 1:1 metadata to data allocation. You only want to do this to rescue a filesystem that has become stuck with too little metadata. Once the required amount of metadata is allocated, remove the metadata_ratio option and do minimal data balancing going forward. > One would rather want some way that if one sees: unallocated space gets > low -> allocate so and so much for meta data You can set metadata_ratio=30, which will allocate (100 / 30) = ~3% of the space for metadata, if you are starting with an empty filesystem. > I guess there are no real/official tools out there for such > surveillance? Like Nagios/Icinga checks, that look at the unallocated > space? TBH it's never been a problem--but I run the minimal data balance daily, and scrub every month, and never balance metadata, and have snapshots and dedupe. Between these they trigger all the necessary metadata allocations. > > You need about (3 + number_of_disks) GB of allocated but unused > > metadata > > block groups to handle the worst case (balance, scrub, and discard > > all > > active at the same time, plus the required free metadata space). > > Also > > leave room for existing metadata to expand by about 50%, especially > > if > > you have snapshots. > > > > > Never balance metadata. Balancing metadata will erase existing > > metadata > > allocations, leading directly to this situation. > > Wouldn't that only unallocated such allocations, that are completely > empty? It will repack existing metadata into existing metadata block groups, which _creates_ empty block groups (i.e. it removes all the data from existing groups), then it removes the empty groups. That's the opposite of what you want: you want extra unused space to be kept in the metadata block groups, so that metadata can expand without having to compete with data for new block group allocations. > > > So if csum data needs so much space... why can't it simply reserve > > > e.g. 60 GB for metadata instead of just 17 GB? > > > > It normally does. Are you: > > > > - running metadata balances? (Stop immediately.) > > Nope, I did once accidentally (-musage=0 ... copy&pasted the wrong one) > but only *after* the filesystem got stuck... That can only do one of two things: have no effect, or make it worse. > > - preallocating large files? Checksums are allocated later, > > and > > naive usage of prealloc burns metadata space due to > > fragmentation. > > Hmm... not so sure about that... (I mean I don't know what the storage > middleware, which is www.dcache.org, does)... but it would probably do > this only for 1 to few such large files at once, if at all. > > > > - modifying snapshots? Metadata size increases with each > > modified snapshot. > > No snapshots are used at all on these filesystems. > > > > - replacing large files with a lot of very small ones? Files > > below 2K are stored in metadata. max_inline=0 disables this. > > I guess you mean here: > First many large files were written... unallocated space is used up > (with data and meta-data block groups). > Then, large files are deleted... data block groups get fragmented (but > not unallocated acagain, because they're not empty. > > Then loads of small files would be written (inline)... which then fails > as meta-data space would fill up even faster, right? Correct. > Well we do have filesystems, where there may be *many* small files.. > but I guess still all around the range of 1MB or more. I don't think we > have lots of files below 2K.. if at all. In theory if the average file size decreases drastically it can change the amount of metadata required and maybe require an increase in metadata ratio after the metadata has been allocated. Another case happens when you suddenly start using a lot of reflinks when the filesystem is already completely allocated. It is possible to contrive cases where metadata usage approaches 100% of the filesystem, so there's no such thing as allocating "enough" metadata space for all use cases. > So I don't think that we have this IO pattern. > > It rather seems simply as if btrfs wouldn't reserve meta-data > aggressively enough (at least not in our case)... and that to much is > allocated for data.. and when that is actually filled, it cannot > allocate anymore enough for metadata. That's possible (and there are patches attempting to address it). We don't want to be too aggressive, or the disk fills up with unused metadata allocations...but we need to be about 5 block groups more aggressive than we are now to handle special cases like "mount and write until full without doing any backups or maintenance." A couple more suggestions (more like exploitable side-effects): - Run regular scrubs. If a write occurs to a block group while it's being scrubbed, there's an extra metadata block group allocation. - Mount with -o ssd. This makes metadata allocation more aggressive (though it also requires more metadata allocation, so like metadata_ratio, it might be worth turning off after the filesystem fills up). > > > Thanks, > Chris. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 18:14 ` Zygo Blaxell @ 2021-12-16 23:16 ` Christoph Anton Mitterer 2021-12-17 2:00 ` Qu Wenruo 2021-12-17 5:53 ` Zygo Blaxell 0 siblings, 2 replies; 20+ messages in thread From: Christoph Anton Mitterer @ 2021-12-16 23:16 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Qu Wenruo, linux-btrfs On Tue, 2021-12-07 at 13:14 -0500, Zygo Blaxell wrote: > It coalesces the free space in each block group into big contiguous > regions, eventually growing them to regions over 1GB in size. > Usually > this gives back unallocated space. Ah, I see... an yes that worked. Not sure if I missed anything, but I think this should be somehow explained in the btrfs-balance(8). I mean there *is* the section "MAKING BLOCK GROUP LAYOUT MORE COMPACT", but that also kinda misses the point that this can be used to get unallocated space back, doesn't it? Is there some way to see a distribution of the space usage of block groups? Like some print out that shows me: - there are n block groups - xx = 100% - xx > 90% - xx > 80% ... - xx = 0% ? That would also give some better idea on how worth it is to balance, and which options to use. > If balance can't pack the extents in 1GB units without changing their > sizes or crossing a block group boundary, then balance might not be > able to free any block groups this way, so this tends to fail when > the > filesystem is over about 97% full. So that's basically the point when one can only move data away... do the balance and move it back afterwards. Which btw. worked quite nicely. (so thanks to all involved people for the help with that). > Minimal balance is exactly one data block group, i.e. > > btrfs balance start -dlimit=1 /fs > > Run it when unallocated space gets low. The exact threshold is low > enough that the time between new data block group allocations is less > than the balance time. What the sysadmin of large storage farms needs is something that one can run basically always (so even if unallocated space is NOT low), which kinda works out of the box and automatically (run via cron?) and doesn't impact the IO too much. Or one would need some daemon, which monitors unallocated space and kicks in if necessary. Does it make sense to use -dusage=xx in addition to -dlimit? I mean if space is already tight... would just -dlimit=1 try to find a block group that it can balance (because it's usage is low enough)... or might it just fail when the first tried one is nearly fully (and not enough space is left for that in other block groups)? > It has to be run for a short time because metadata_ratio=1 means 1:1 > metadata to data allocation. You only want to do this to rescue a > filesystem that has become stuck with too little metadata. Once the > required amount of metadata is allocated, remove the metadata_ratio > option and do minimal data balancing going forward. But that's also something rather only suitable for "rescuing"... one wouldn't want to do that in big storage systems on hundreds of filesystems, just to make sure that btrfs doesn't run into that situation in the first place. For that it would be much nicer if one had other means to tell btrfs to allocate more for metadata,... like either a command to reserve xx GB, that one can run when one sees that space get tight... or by some bother logic when btrfs does that automatically. > You can set metadata_ratio=30, which will allocate (100 / 30) = ~3% > of the space for metadata, if you are starting with an empty > filesystem. Okay that sounds more like a way... > TBH it's never been a problem--but I run the minimal data balance > daily, > and scrub every month, and never balance metadata, and have snapshots > and dedupe. Between these they trigger all the necessary metadata > allocations. I'm also still not really sure why this happened here. I've asked the developers of our storage middleware software in the meantime, and it seems in fact that dCache does pre-allocate the space of files that it wants to write. But even then, shouldn't btrfs be able to know how much it will generally need for csum metadata? I can only think of IO patterns where one would end up with too aggressive meta-data allocation (e.g. when writing lots of directories or XATTRS) and where not enough data block groups are left. But the other way round? If one writes very small files (so that they are inlined) -> meta-data should grow. If one writes non-inlined files, regardless of whether small or big... shouldn't it always be clear how much space could be needed for csum meta-data, when a new block group is allocated for data and if that would be fully written? > In theory if the average file size decreases drastically it can > change > the amount of metadata required and maybe require an increase in > metadata ratio after the metadata has been allocated. I cannot totally rule this out, but it's pretty unlikely. > Another case happens when you suddenly start using a lot of reflinks > when the filesystem is already completely allocated. That I can rule out, we didn't make any snapshots or ref-copies. > That's possible (and there are patches attempting to address it). > We don't want to be too aggressive, or the disk fills up with unused > metadata allocations...but we need to be about 5 block groups more > aggressive than we are now to handle special cases like "mount and > write until full without doing any backups or maintenance." Wouldn't a "simple" (at least in my mind ;-) ) solution be, that: - if the case arises, that either data or meta-data block groups are full - and not unallocated space is left - and if the other kind of block groups has plenty of free space left (say in total something like > 10 times the size of a block group... or maybe more (depending on the total filesystem size), cause one probably doesn't want to shuffle loads of data around, just for the last 0.005% to be squeezed out.) then: - btrfs automatically does the balance? Or maybe something "better" that also works when it would need to break up extents? If there are cases where one doesn't like that automatic shuffling, one could make it opt-in via some mount option. > A couple more suggestions (more like exploitable side-effects): > > - Run regular scrubs. If a write occurs to a block group > while it's being scrubbed, there's an extra metadata block > group allocation. But writes during scrubs would only happen when it finds and corrupted blocks? Thanks, Chris. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-16 23:16 ` Christoph Anton Mitterer @ 2021-12-17 2:00 ` Qu Wenruo 2021-12-17 3:10 ` Christoph Anton Mitterer 2021-12-17 5:53 ` Zygo Blaxell 1 sibling, 1 reply; 20+ messages in thread From: Qu Wenruo @ 2021-12-17 2:00 UTC (permalink / raw) To: Christoph Anton Mitterer, Zygo Blaxell; +Cc: linux-btrfs [...] >> That's possible (and there are patches attempting to address it). >> We don't want to be too aggressive, or the disk fills up with unused >> metadata allocations...but we need to be about 5 block groups more >> aggressive than we are now to handle special cases like "mount and >> write until full without doing any backups or maintenance." > > Wouldn't a "simple" (at least in my mind ;-) ) solution be, that: > - if the case arises, that either data or meta-data block groups are > full > - and not unallocated space is left > - and if the other kind of block groups has plenty of free space left > (say in total something like > 10 times the size of a block group... > or maybe more (depending on the total filesystem size), cause one > probably doesn't want to shuffle loads of data around, just for the > last 0.005% to be squeezed out.) > then: > - btrfs automatically does the balance? > Or maybe something "better" that also works when it would need to > break up extents? Or, let's change how we output our vanilla `df` command output, by taking metadata free space and unallocated space into consideration, like: - If there are plenty unallocated space Go current output. - If there is no more unallocated space can be utilized Then take metadata free space into consideration, like if there is only 1G free metadata space, while several tera free data space, we only report free metadata space * some ratio as free data space. And if by some magic calculation, we determined that even balance won't free up any space, we return available space as 0 directly. By this, we under-report the amount of available space, although users may (and for most cases, they indeed can) write way more space than the reported available space, we have done our best to show end users that they need to take care of the fs. Either by deleting unused data, or do proper maintenance before reported available space reaches 0. By this, your existing space reservation tool will work way better than your current situation, and you have enough early warning before reaching the current situation. But I doubt if this would greatly drop the disk utilization, as we will become too cautious on reporting available space. Thanks, Qu > > If there are cases where one doesn't like that automatic shuffling, one > could make it opt-in via some mount option. > > >> A couple more suggestions (more like exploitable side-effects): >> >> - Run regular scrubs. If a write occurs to a block group >> while it's being scrubbed, there's an extra metadata block >> group allocation. > > But writes during scrubs would only happen when it finds and corrupted > blocks? > > > Thanks, > Chris. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-17 2:00 ` Qu Wenruo @ 2021-12-17 3:10 ` Christoph Anton Mitterer 0 siblings, 0 replies; 20+ messages in thread From: Christoph Anton Mitterer @ 2021-12-17 3:10 UTC (permalink / raw) To: Qu Wenruo, Zygo Blaxell; +Cc: linux-btrfs On Fri, 2021-12-17 at 10:00 +0800, Qu Wenruo wrote: > Or, let's change how we output our vanilla `df` command output, by > taking metadata free space and unallocated space into consideration, > like: Actually I was thinking about this before as well, but that would rather just remedy the consequences of that particular ENOSPC situation and not prevent it. > - If there is no more unallocated space can be utilized > Then take metadata free space into consideration, like if there is > only 1G free metadata space, while several tera free data space, > we only report free metadata space * some ratio as free data > space. Not sure whether this is so good... because then the shown free space is completely made up... it could be like that value if the remaining unallocated space and the remaining meta-data space are eaten up as "anticipated"... but it could also be much more or much less (depending on what actually happens), right? What I'd rather do is: *If* btrfs realises that there's still free space in the data block groups... but really nothing at all (that isn't reserved for special operations) is left in the meta-data block groups AND nothing more could be allocated... then suddenly drop the shown free space to exactly 0. Because from a classic programs point of view, that's what the case, it cannot further add any files (not even empty ones). This would also allow programs like dCache to better deal with that situation. What dCache does is laid out here: https://github.com/dCache/dcache/issues/5352#issuecomment-989793555 Perhaps some background... dCache is a distributed storage system, so it runs on multiple nodes managing files placed in many filesystems (on so called pools). Clients first connect via some protocol to a "door node", from which they are (at least if the respective protocol supports it) redirected to a pool, where dCache thinks the file could be written to (in the write case, obviously). dCache decides that by known all it's pools and monitoring their (filesystems') free space. It also has a configurable gap value (defaulting to 4GB), which it will try to leave free on a pool. If the file is expected to fit in (I think it again depends on the protocol, whether it really knows in advance how much the client will write) while still observing the gap,... plus several more load balancing metrics... a pool may be selected and the client redirected. Seems to me like a fairly reasonable process. So as things are currently with btrfs and when that particular situation arises that I've had now (plenty free space in data block groups, but zero in meta-data block groups plus zero unallocated space), then dCache cannot really deal properly with that: - df (respectively the usual syscalls) will show it that much more space is available than what the gap would help against - the client tries to write to the pool, there's immediately ENOSPC and the transfer is properly aborted with some failure - but dCache cannot really tell whether the situation is still there or not... so it will run into broken write transfers over and over - typically also, once a client is redirected to a pool, there is no going back and retrying the same on another one (at least not automatically from within the protocol)... so the failure is really "permanent", unless the client itself tries again and then (by chance) lands on another pool where the btrfs is still good If df respectively the syscalls would return 0 free space in that situation, we'd still have ~800 GB lost (without manual intervention)... but at least the middleware should be able to deal with that. > By this, we under-report the amount of available space, although > users > may (and for most cases, they indeed can) write way more space than > the > reported available space, we have done our best to show end users > that > they need to take care of the fs. > Either by deleting unused data, or do proper maintenance before > reported > available space reaches 0. Well but at least when the problem has happened, then - without any further intervention - no further writes (of new files respectively new data) will be possible... so the "under-reporting" is only true if one assumes that this intervention will happen. If it does, like by a some maintenance "minimal" balance as Zygo suggested, then the whole situation should anyway not happen, AFAIU. And if its by some intervention after the ENOSPC, then the "under- reporting" would also go away, as soon as the problem was fixed (manually). But what do you think about my idea of btrfs automatically solving the situation by doing a balance on it's own, once the problem of arose? One could also think of something like the following: Add some 2nd level global reserve, which is much bigger than the current one... at least enough that one could manually balance the fs (or btrfs to that automatically if it decided it needs to) If the problem of this mail thread occurs it could be used to more easily solve it without the need to move data to somewhere else (which may not always be feasible), because it would be reserved to be e.g. used for such a balance. One could make its dependent on the size of the fs. If the fs has e.g. 1TB, then reserving e.g. 4GB is barely noticeably. And if the fs should be too small, one simply doesn't have the 2nd level global reserve. If(!) the fs runs full in a proper way (i.e. no more unallocated space an meta-data and data block groups are equally full) then btrfs could decide to release that 2nd level global reserve back to be used, to squeeze out as much space as possible not loosing too much. Once it's really full, it's full and not much new could happen anyway... and the normal global reserve would be still there for the *very* important things. If files should later on be deleted, btrfs could decide to try to re- establish the 2nd level global reserve... again to be "reserved" until the fs is again really really full and it would be just wasted space. Cheers, Chris. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-16 23:16 ` Christoph Anton Mitterer 2021-12-17 2:00 ` Qu Wenruo @ 2021-12-17 5:53 ` Zygo Blaxell 1 sibling, 0 replies; 20+ messages in thread From: Zygo Blaxell @ 2021-12-17 5:53 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Qu Wenruo, linux-btrfs On Fri, Dec 17, 2021 at 12:16:21AM +0100, Christoph Anton Mitterer wrote: > On Tue, 2021-12-07 at 13:14 -0500, Zygo Blaxell wrote: [snip] > Is there some way to see a distribution of the space usage of block > groups? > Like some print out that shows me: > - there are n block groups > - xx = 100% > - xx > 90% > - xx > 80% > ... > - xx = 0% > ? > > That would also give some better idea on how worth it is to balance, > and which options to use. Python-btrfs lets you access btrfs data structures from python scripts. This might even be an existing example. > > Minimal balance is exactly one data block group, i.e. > > > > btrfs balance start -dlimit=1 /fs > > > > Run it when unallocated space gets low. The exact threshold is low > > enough that the time between new data block group allocations is less > > than the balance time. > > What the sysadmin of large storage farms needs is something that one > can run basically always (so even if unallocated space is NOT low), > which kinda works out of the box and automatically (run via cron?) and > doesn't impact the IO too much. > Or one would need some daemon, which monitors unallocated space and > kicks in if necessary. That's the theory, and its what packages like btrfsmaintenance try to do. The practice is...more complicated. > Does it make sense to use -dusage=xx in addition to -dlimit? > I mean if space is already tight... would just -dlimit=1 try to find a > block group that it can balance (because it's usage is low enough)... > or might it just fail when the first tried one is nearly fully (and not > enough space is left for that in other block groups)? The best strategy I've found so far is to choose block groups entirely at random, because: * the benefit is fixed: after a successful block group balance, you will have 1GB of unallocated space on all disks in the block group. In that sense it doesn't matter which block groups you balance, only the number that you balance. If you pick a full block group, btrfs will pack the data into emptier block groups. If you pick an empty block group, btrfs will pack the data into other empty block groups, or create a new empty block group and just shuffle the data around. * the cost of computing the cost of relocating a block group is proportional to doing the work of relocating the block group. The data movement for 1GB takes 12 seconds on modern spinning drives and 1 second or less on NVMe. The other 60-seconds-to-an-hour of relocating a block group is updating all the data references, and the parent nodes that reference them, recursively. If you had some clever caching and precomputation scheme you could maybe choose a good block group to balance in less time than it takes to balance it, but if you predict wrong, you're stuck doing the extra work with no benefit. Also because this is a deterministic algorithm, you run into the next problem: * choosing block groups by a deterministic algorithm (e.g. number of free bytes, percentage of free space, fullest/emptiest device, largest vaddr, smallest vaddr) eventually runs into adverse selection, and gets stuck on a block group that doesn't fit into the available free space, but it's always the "next" block group according to the selecting algorithm, so it can make no further progress. Choosing a completely random block group (from the target devices where unallocated space is required) may or may not succeed, but it's a cheap algorithm to run and it's very good at avoiding adverse selection. > > TBH it's never been a problem--but I run the minimal data balance > > daily, > > and scrub every month, and never balance metadata, and have snapshots > > and dedupe. Between these they trigger all the necessary metadata > > allocations. > > I'm also still not really sure why this happened here. > > I've asked the developers of our storage middleware software in the > meantime, and it seems in fact that dCache does pre-allocate the space > of files that it wants to write. > > But even then, shouldn't btrfs be able to know how much it will > generally need for csum metadata? It varies a lot. Checksum items have variable overheads as they are packed into pages. There is some heuristic based on a constant ratio but maybe it's a little too low. It does seem to be prone to rounding error, as I've seen a lot of users presenting filesystems that have exactly 1GB too little metadata allocated. > I can only think of IO patterns where one would end up with too > aggressive meta-data allocation (e.g. when writing lots of directories > or XATTRS) and where not enough data block groups are left. > > But the other way round? > If one writes very small files (so that they are inlined) -> meta-data > should grow. > > If one writes non-inlined files, regardless of whether small or big... > shouldn't it always be clear how much space could be needed for csum > meta-data, when a new block group is allocated for data and if that > would be fully written? It's not even clear how much space is needed for the data. Extents are immutable, so if you overwrite part of a large extent, you will need more space for the new data even though the old data is no longer reachable through any file. Checksums can vary in density from 779 (if there are a lot of holes in files) to 4090 blocks per metadata page (if they're all contiguous). That's a 5:1 size ratio between the extremes. > > That's possible (and there are patches attempting to address it). > > We don't want to be too aggressive, or the disk fills up with unused > > metadata allocations...but we need to be about 5 block groups more > > aggressive than we are now to handle special cases like "mount and > > write until full without doing any backups or maintenance." > > Wouldn't a "simple" (at least in my mind ;-) ) solution be, that: > - if the case arises, that either data or meta-data block groups are > full > - and not unallocated space is left > - and if the other kind of block groups has plenty of free space left > (say in total something like > 10 times the size of a block group... > or maybe more (depending on the total filesystem size), cause one > probably doesn't want to shuffle loads of data around, just for the > last 0.005% to be squeezed out.) > then: > - btrfs automatically does the balance? > Or maybe something "better" that also works when it would need to > break up extents? The problem is in the definitions of things like "plenty" and "not a lot", and expectations like "last 0.005%." We all know balancing automatically solves the problem, but all the algorithms we use to trigger it are wrong in some edge case. Balance is a big and complex thing that operates on big filesystem allocation objects, too big to run automatically at the moment a critical failure is detected. The challenge is to predict the future well enough to know when to run balance to avoid it. In these early days, everybody seems to be rolling their own solutions and discovering surprising implications of their choices. Also there are much simpler solutions, like "put all the metadata on SSD", where the administrator picks the metadata size and btrfs works (or doesn't work) with it. Rewriting the extent tree is also on the table, though people have recently worked on that (extent tree v2) and the ability to change allocated extent lengths after the fact was dropped from the proposal. > If there are cases where one doesn't like that automatic shuffling, one > could make it opt-in via some mount option. In theory a garbage collection tool can be written today to manage this, but it's only a theory until somebody writes it. It's possible to break up extents by running a combination of defrag and dedupe over them using existing userspace interfaces. Once such a tool exists, the kernel interfaces could be improved for performance. That tool would essentially be data balance in userspace, so the kernel data balance would no longer be needed. It's not clear that this would be able to perform any better than the current data balance scheme, though, except for being slightly more flexible on extremely full filesystems. > > A couple more suggestions (more like exploitable side-effects): > > > > - Run regular scrubs. If a write occurs to a block group > > while it's being scrubbed, there's an extra metadata block > > group allocation. > > But writes during scrubs would only happen when it finds and corrupted > blocks? Each block group is made read-only while it is scrubbed to prevent modification while scrub verifies it. If some process wants to modify data on the filesystem during the scrub, it must allocate its new data in some block group that is not being scrubbed. If all the existing block groups are either full or read-only, then a new block group must be allocated. If this is not possible, the writing process will hit ENOSPC. In other words, scrub effectively decreases free space while it runs by locking some of it away temporarily, and this forces btrfs to allocate a little more space for data and metadata. This is one of the many triggers for btrfs to require and allocate another GB of metadata apparently at random. It's never random, but there are a lot of different triggering conditions in the implementation. Only a few spare block groups are usually needed, so people running scrub regularly work around the metadata problem without knowing they're working around a problem. > Thanks, > Chris. > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 7:21 ` Zygo Blaxell 2021-12-07 12:31 ` Jorge Bastos 2021-12-07 15:07 ` Christoph Anton Mitterer @ 2021-12-07 15:10 ` Jorge Bastos 2021-12-07 15:22 ` Christoph Anton Mitterer 2 siblings, 1 reply; 20+ messages in thread From: Jorge Bastos @ 2021-12-07 15:10 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Christoph Anton Mitterer, Qu Wenruo, Btrfs BTRFS Hi, Disregard my last email, it is the same issue of metadata exhaustion, just didn't understand why two identical file systems used in the same way in the same server behaved so differently, if I convert metadata from raid1 to single it will leave some extra metadata chunks and I was then able to fill up the data chunks, of course now I can't balance metadata to raid1 again, I wish there was an easy way to allocate an extra metadata chunk when there's available space. Regards, Jorge Bastos ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 15:10 ` Jorge Bastos @ 2021-12-07 15:22 ` Christoph Anton Mitterer 2021-12-07 16:11 ` Jorge Bastos 0 siblings, 1 reply; 20+ messages in thread From: Christoph Anton Mitterer @ 2021-12-07 15:22 UTC (permalink / raw) To: Jorge Bastos, Zygo Blaxell; +Cc: Qu Wenruo, Btrfs BTRFS Hey Jorge. I've looked at your old mail thread... and the first case you've showed: btrfs fi usage /mnt/disk4 Overall: Device size: 7.28TiB Device allocated: 7.28TiB Device unallocated: 1.04MiB Device missing: 0.00B Used: 7.24TiB Free (estimated): 34.55GiB (min: 34.55GiB) Free (statfs, df): 34.55GiB Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Multiple profiles: no Data,single: Size:7.26TiB, Used:7.22TiB (99.54%) /dev/md4 7.26TiB Metadata,DUP: Size:9.50GiB, Used:8.45GiB (88.93%) /dev/md4 19.00GiB System,DUP: Size:32.00MiB, Used:800.00KiB (2.44%) /dev/md4 64.00MiB Unallocated: /dev/md4 1.04MiB Seems similar to my problem... but far less extrem.... so that I personally would say I could live with that. Of data you "loose" 0.04 TB ... so 40 GiB... of metadata you "loose" 1.45 GiB It's a bit strange IMO that you then get ENOSPC, when your metadata has still 1GB free (thought it would reserve less?) But still, out of 7.28 TiB, that's ~0,556%... not sooo much. I in contrast have: 829,44 + 0,5 = 829,94 GiB "lost" Which is out of 16.00TiB some ~5,066% lost... which seems pretty much. Cheers, Chris. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 15:22 ` Christoph Anton Mitterer @ 2021-12-07 16:11 ` Jorge Bastos 0 siblings, 0 replies; 20+ messages in thread From: Jorge Bastos @ 2021-12-07 16:11 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Zygo Blaxell, Qu Wenruo, Btrfs BTRFS On Tue, Dec 7, 2021 at 3:22 PM Christoph Anton Mitterer <calestyo@scientia.org> wrote: > > Hey Jorge. > > I've looked at your old mail thread... and the first case you've > showed: > btrfs fi usage /mnt/disk4 Hi, Thanks for the reply, that one was fine, it was the "good" example, the next one was the problem, /mnt/disk3, but like mentioned I figured out it's the metadata exhaustion issue. Regards, Jorge Bastos ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 2:29 ENOSPC while df shows 826.93GiB free Christoph Anton Mitterer 2021-12-07 2:59 ` Qu Wenruo @ 2021-12-07 15:39 ` Phillip Susi 2021-12-16 3:47 ` Christoph Anton Mitterer 1 sibling, 1 reply; 20+ messages in thread From: Phillip Susi @ 2021-12-07 15:39 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: linux-btrfs Christoph Anton Mitterer <calestyo@scientia.org> writes: > yet: > # /srv/dcache/pools/2/foo > -bash: /srv/dcache/pools/2/foo: No such file or directory I'm not sure what you intended this to do or show. It looks like you tried to execute a program named /srv/dcache/pools/2/foo, and there is no such program. That doesn't say anything about the filesystem. > balancing also fails, e.g.: > # btrfs balance start -dusage=50 /srv/dcache/pools/2 > ERROR: error during balancing '/srv/dcache/pools/2': No space left on device Balance is basically like a defrag. You have less than 0.01% space free, which is not enough to do a defrag. Either free up some space, or don't bother trying to defrag. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: ENOSPC while df shows 826.93GiB free 2021-12-07 15:39 ` Phillip Susi @ 2021-12-16 3:47 ` Christoph Anton Mitterer 0 siblings, 0 replies; 20+ messages in thread From: Christoph Anton Mitterer @ 2021-12-16 3:47 UTC (permalink / raw) To: Phillip Susi; +Cc: linux-btrfs On Tue, 2021-12-07 at 10:39 -0500, Phillip Susi wrote: > I'm not sure what you intended this to do or show. It looks like you > tried to execute a program named /srv/dcache/pools/2/foo, and there > is > no such program. That doesn't say anything about the filesystem. Ooops, sorry... that was some wrong copy&pasting. What I actually did was merely a # touch /srv/dcache/pools/2/foo which already gave me the ENOSPC (for the reasons that have now already been explained... i.e. no (usable) free space in the meta-data and no unallocated space either. Thanks, Chris. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2021-12-17 5:53 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-12-07 2:29 ENOSPC while df shows 826.93GiB free Christoph Anton Mitterer 2021-12-07 2:59 ` Qu Wenruo 2021-12-07 3:06 ` Christoph Anton Mitterer 2021-12-07 3:29 ` Qu Wenruo 2021-12-07 3:44 ` Christoph Anton Mitterer 2021-12-07 4:56 ` Qu Wenruo 2021-12-07 14:30 ` Christoph Anton Mitterer 2021-12-07 7:21 ` Zygo Blaxell 2021-12-07 12:31 ` Jorge Bastos 2021-12-07 15:07 ` Christoph Anton Mitterer 2021-12-07 18:14 ` Zygo Blaxell 2021-12-16 23:16 ` Christoph Anton Mitterer 2021-12-17 2:00 ` Qu Wenruo 2021-12-17 3:10 ` Christoph Anton Mitterer 2021-12-17 5:53 ` Zygo Blaxell 2021-12-07 15:10 ` Jorge Bastos 2021-12-07 15:22 ` Christoph Anton Mitterer 2021-12-07 16:11 ` Jorge Bastos 2021-12-07 15:39 ` Phillip Susi 2021-12-16 3:47 ` Christoph Anton Mitterer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).