From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from james.kirk.hungrycats.org ([174.142.39.145]:41586 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1751969AbaLHGnT (ORCPT ); Mon, 8 Dec 2014 01:43:19 -0500 Date: Mon, 8 Dec 2014 01:43:17 -0500 From: Zygo Blaxell To: Shriramana Sharma , i@hungrycats.org Cc: linux-btrfs Subject: Re: Why is the actual disk usage of btrfs considered unknowable? Message-ID: <20141208064315.GA22023@hungrycats.org> References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="bg08WKrSYDhXBjb5" In-Reply-To: Sender: linux-btrfs-owner@vger.kernel.org List-ID: --bg08WKrSYDhXBjb5 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Dec 07, 2014 at 08:45:59PM +0530, Shriramana Sharma wrote: > IIUC: >=20 > 1) btrfs fi df already shows the alloc-ed space and the space used out of= that. >=20 > 2) Despite snapshots, CoW and compression, the tree knows how many > extents of data and metadata there are, and how many bytes on disk > these occcupy, no matter what is the total (uncompressed, > "unsnapshotted") size of all the directories and files on the disk. >=20 > So this means that btrfs fi df actually shows the real on-disk usage. > In this case, why do we hear people saying it's not possible to know > the actual on-disk usage and when a btrfs-formatted disk (or > partition) will go out of space? "On-disk usage" is easy--that's about the past, and can be measured straightforwardly with a single count of bytes. "When a btrfs filesystem will return ENOSPC" is much more complicated--that's about the future, and depends heavily on current structure and upcoming modifications of it. There were some pretty terrible btrfs bugs and warts that were fixed only in the last 5 months or so. Since some of those had been around for a year or more, they gave btrfs a reputation. The 'df' command (statvfs(2)) would report raw free space instead of an estimate based on the current RAID profile. This confused some badly designed programs that would use statvfs to determine that N bytes of free space were avaiable, and be surprised when N bytes were not all available for their use. If you had a btrfs using RAID1, it would report double the amount of space used and available (one for each disk, e.g. 2x1TB disks 75% full would be reported as 2TB capacity with 1.5TB used and 0.5TB free). Now statvfs(2) computes more correct values (1TB capacity with 750GB used and 250GB free). Some bugs would crash the btrfs cleaner (the thread which removes deleted snapshots) or balance, and would cause the filesystem to prematurely report ENOSPC when (in theory) hundreds of gigabytes were available. These were straight up bugs that are now fixed. Modifying the filesystem tree requires free metadata blocks into which to write new CoW nodes for the modified metadata. When you delete something, disk usage goes up for a few seconds before it goes down (if you have snapshots, the "down" part may be delayed until you delete the snapshots). This can lead to surprising "No space left on device" errors from commands like 'rm -rf lots_of_files'. The GlobalReserve chunk type was introduced to reserve a few MB of space on the filesystem to handle such cases. Thankfully, everything above now seems to be fixed. There is still an issue with hetergeneous chunk allocation. The 'df' command and 'statvfs' syscall only report a single quantity for used and free space, while in btrfs there are two distinct data types to be stored in two distinct container types--and for maximum result irreproducibility, the amount of space allocated to each type is dynamic. Data (file contents) is allocated 1GB at a time, metadata (directory structures, inodes, checksums) is allocated 256MB at a time, and the two types are not interchangeable after allocation. This can cause inaccuracies when reporting free space as the last few free GB are consumed. 256MB might abruptly disappear from free space if you happen to run out of free metadata space and allocate a new metadata chunk instead of a data chunk. The last few KB of a file that does not fill a full 4K block can be stored 'inline' (next to the inode in the metadata tree). If you are low on space in data chunks, you might be able to write a large number of small files using inline metadata to store the file contents, but not an equivalent-sized large file using data extent blocks. If you have lots of free data space but not enough metadata space, you get the opposite result (e.g. you can write new large files but not extend small existing ones). All that above happens with RAID, compression and quotas turned *off*. Turning those them on makes space usage even harder to analyze (and ENOSPC errors harder to predict) with a single-dimension "available space" metric. --bg08WKrSYDhXBjb5 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAlSFSIMACgkQgfmLGlazG5xuCwCgtMrdC3wBMAhxkbitwRLYT66v 2lcAoOktHKTvG4HKrxRHIB6EUszFCILK =qOn9 -----END PGP SIGNATURE----- --bg08WKrSYDhXBjb5--