On Wed, Jan 29, 2020 at 05:01:47PM +0100, David Sterba wrote: > On Fri, Jan 17, 2020 at 10:16:45PM +0800, Qu Wenruo wrote: > > >> But this behavior itself is not accurate. > > >> > > >> We have global reservation, which is normally always larger than the > > >> immediate number 4M. > > > > > > The global block reserve is subtracted from the metadata accounted from > > > the block groups. And after that, if there's only little space left, the > > > check triggers. Because at this point any new metadata reservation > > > cannot be satisfied from the remaining space, yet there's >0 reported. > > > > OK, then we need to do over-commit calculation here, and do the 4M > > calculation. > > > > The quick solution I can think of would go back to Josef's solution by > > exporting can_overcommit() to do the calculation. > > > > > > But my biggest problem is, do we really need to do all these hassle? > > As far as I know we did not ask for it, it's caused by limitations of > the public interfaces (statfs). We don't use half of the existing public interface (the f_files side). Conflating data with metadata isn't helpful: they need very different remedial actions when you run out of each. This is true even on other filesystems, though other filesystems don't have remedial actions as extremely different as "deleting files" and "btrfs data balance." > > My argument is, other fs like ext4/xfs still has their inode number > > limits, and they don't report 0 avail when that get exhausted. > > (Although statfs() has such report mechanism for them though). > > Maybe that's also the perception of what "space" is for users. The data > and metadata distinction is an implementation detail. So if 'df' tells > me there's space I should be able to create new files, right? Not necessarily. Someone else might allocate data between statfs check and the write. The new file's name may overflow existing directory blocks and trigger a data block allocation (on filesystems that have directory data blocks). The filesystem might be out of inodes, if it's a filesystem with static inode allocation. Also recall that 'df' has the '-i' option, so df does give you two numbers for space, and it has two opportunities to tell you that you don't have any. Seeing "72KB data blocks free" is not a guarantee that you can write exactly 72KB of data. It's an estimate that you can write _about_ 72KB of data. Some difference between the estimated free space and the actual number of data blocks that can be written is expected, especially when the filesystem is nearly full or has multiple agents writing to it (and then there's compression, snapshots, dedupe, unreachable overwritten extent blocks...). We expect certainty when comparing numbers that are very different. The closer you get to the reported number, the less certain we can be about whether an ENOSPC will occur before you write that amount of data. Examples: If you have 72K free in df, you _definitely_ can't install that package that requires 860MB of space, but you probably can write just one more 4K block to a log. Maybe you can install a 60KB package but not a 68KB one. Users (and the robots they configure) can usually cope with derating the numbers a little. The recent problem in 5.4+ kernels is that this difference is now often far too large--thousands of gigabytes, because btrfs flips from counting "all the free data blocks in existing and future data block groups" to "zero" at arbitrary times. This is *much worse* than the nearly-full case, where the correct and reported (zero) numbers are within a few GB of each other. The large errors in statfs reporting and large changes in reported value result in large consequences--automated deletion of data is the most serious one, where the bigger the error, the more data is unnecessarily deleted. The case where a filesystem runs out of unallocated space, has several thousand GB of unused space in data block groups, and runs out of metadata space is...tricky, because you correctly hit ENOSPC with thousands of GB of free space on the drive, but it's not in a form btrfs can use because it's all in data block groups. The usual automated responses triggered on statfs() reports of low data space don't work here. Data balances are required to free up space for metadata. 'unlink' or 'btrfs sub delete' usually don't help in these cases, or help extremely inefficiently (i.e. the kernel won't free data block groups until they're entirely empty, so it's not sufficient to delete a few files--you have to delete almost _all_ of the files). 'unlink' can even make the problem worse (duplicating shared metadata pages, consuming even more metadata space). The arbitrary reporting of "0 available data blocks" in df isn't helpful in this case because the ENOSPC has nothing at all to do with data blocks. One could say "well if we're out of metadata space and have free data blocks, the correct thing to do is block writers and start balancing, so that a free block is a free block." There are obvious problems with that, but it would make sense to consider reducing data blocks and metadata free space to a single number in df only *after* they were made transparently fungible. > more data, but still looking at the same number of free space. > > For ext2 or ext3 it should be easier to see if it's possible to create > new inodes, the values of 'df -i' are likely accurate because of the > mkfs-time preallocation. > Newer features on ext4 and xfs allow dynamic creation of inodes, you can > find guesswork regarding the f_files estimates. > > I vaguely remember that it's possible to get ENOSPC on ext4 by > exhausting metadata yet statfs will still tell you there's free space. Yes, if you run out of f_favail on a traditional Unix filesystem, you can't make any more inodes, but you can still write all the data blocks to existing files you want (until f_bavail runs out). Other filesystems do _not_ set f_bavail to 0 when f_favail happens to be 0 too. Also, on a traditional Unix filesystem, you can overwrite, truncate (shrink, maybe not expand) or delete any files you want, no matter what df says. On btrfs those things are not all always possible, and sometimes have a negative effect on free space. > This is confusing too. statfs on btrfs will tell you that you have free blocks (f_bavail), but not free files. Many users have never heard of f_favail because some filesystems (including btrfs) don't implement it, and on the others f_favail changes very slowly and rarely hits zero for typical workloads when compared to f_bavail. That said, f_favail is an ideal way to report to users that they are running out of some scarce resource that isn't data blocks. Thankfully btrfs only has two kinds of scarce resource (for now), so we can still use the statvfs structure. On btrfs we could overload f_f{avail,free,iles} to count metadata blocks (metadata_allocated - metadata_used + (all_unallocated / raid_profile_parameters) - metadata_reserved). That would provide a number that proportionally decreases to zero as all options for allocating new metadata blocks are exhausted. When ENOSPC occurs, either f_favail or f_bavail will end up being zero[1], no questionable hacks required. Automated low space responders will be able to apply the correct proportional response (delete files/snapshots or balance data for low data and metadata space, respectively) by looking at which value is zero. TBH I think the code change content of Qu's patch was fine as-is: it mostly reverts ca8a51b3a9 "btrfs: statfs: report zero available if metadata are exhausted", and I think that should be reverted regardless of the other issues, simply and only because it conflates data blocks with metadata. It's reporting values in the wrong columns, so confusion is (and bugs are) inevitable. [1] If f_{b,f}avail ends up being below 0, make it 0. Nothing is more confusing or has less predictable effects than reports of negative free disk space.