* Massive loss of disk space @ 2017-08-01 11:43 pwm 2017-08-01 12:20 ` Hugo Mills 0 siblings, 1 reply; 26+ messages in thread From: pwm @ 2017-08-01 11:43 UTC (permalink / raw) To: linux-btrfs I have a 10TB file system with a parity file for a snapraid. However, I can suddenly not extend the parity file despite the file system only being about 50% filled - I should have 5TB of unallocated space. When trying to extend the parity file, fallocate() just returns ENOSPC, i.e. that the disk is full. Machine was originally a Debian 8 (Jessie) but after I detected the issue and no btrfs tool did show any errors, I have updated to Debian 9 (Snatch) to get a newer kernel and newer btrfs tools. pwm@europium:/mnt$ btrfs --version btrfs-progs v4.7.3 pwm@europium:/mnt$ uname -a Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26) x86_64 GNU/Linux pwm@europium:/mnt/snap_04$ ls -l total 4932703608 -rw------- 1 root root 319148889 Jul 8 04:21 snapraid.content -rw------- 1 root root 283115520 Aug 1 04:08 snapraid.content.tmp -rw------- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity pwm@europium:/mnt/snap_04$ df . Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdg1 9766434816 4944614648 4819831432 51% /mnt/snap_04 pwm@europium:/mnt/snap_04$ sudo btrfs fi show . Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1 Compare this with the second snapraid parity disk: pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/ Label: 'snap_05' uuid: bac477e3-e78c-43ee-8402-6bdfff194567 Total devices 1 FS bytes used 4.69TiB devid 1 size 9.09TiB used 4.70TiB path /dev/sdi1 So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB. While almost the same amount of file system usage. And almost identical usage pattern. It's an archival RAID, so there is hardly any writes to the parity files because there are almost no file changes to the data files. The main usage is that the parity file gets extended when one of the data disks reaches a new high water mark. The only file that gets regularly rewritten is the snapraid.content file that gets regenerated after every scrub. pwm@europium:/mnt/snap_04$ sudo btrfs fi df . Data, single: total=9.08TiB, used=4.59TiB System, DUP: total=8.00MiB, used=992.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=6.00GiB, used=4.81GiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=512.00MiB, used=0.00B pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du . Total Exclusive Set shared Filename 4.59TiB 4.59TiB - ./snapraid.parity 304.37MiB 304.37MiB - ./snapraid.content 270.00MiB 270.00MiB - ./snapraid.content.tmp 4.59TiB 4.59TiB 0.00B . pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage . Overall: Device size: 9.09TiB Device allocated: 9.09TiB Device unallocated: 0.00B Device missing: 0.00B Used: 4.60TiB Free (estimated): 4.49TiB (min: 4.49TiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:9.08TiB, Used:4.59TiB /dev/sdg1 9.08TiB Metadata,single: Size:8.00MiB, Used:0.00B /dev/sdg1 8.00MiB Metadata,DUP: Size:6.00GiB, Used:4.81GiB /dev/sdg1 12.00GiB System,single: Size:4.00MiB, Used:0.00B /dev/sdg1 4.00MiB System,DUP: Size:8.00MiB, Used:992.00KiB /dev/sdg1 16.00MiB Unallocated: /dev/sdg1 0.00B pwm@europium:~$ sudo btrfs check /dev/sdg1 Checking filesystem on /dev/sdg1 UUID: c46df8fa-03db-4b32-8beb-5521d9931a31 checking extents checking free space cache checking fs roots checking csums checking root refs found 5057294639104 bytes used err is 0 total csum bytes: 4529856120 total tree bytes: 5170151424 total fs tree bytes: 178700288 total extent tree bytes: 209616896 btree space waste bytes: 182357204 file data blocks allocated: 5073330888704 referenced 5052040339456 pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/ scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31 scrub started at Mon Jul 31 21:26:50 2017 and finished after 06:53:47 total bytes scrubbed: 4.60TiB with 0 errors So where have my 5TB disk space gone lost? And what should I do to be able to get it back again? I could obviously reformat the partition and rebuild the parity since I still have one good parity, but that doesn't feel like a good route. It isn't impossible this might happen again. /Per W ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-01 11:43 Massive loss of disk space pwm @ 2017-08-01 12:20 ` Hugo Mills 2017-08-01 14:39 ` pwm 0 siblings, 1 reply; 26+ messages in thread From: Hugo Mills @ 2017-08-01 12:20 UTC (permalink / raw) To: pwm; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 5847 bytes --] Hi, Per, Start here: https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29 In your case, I'd suggest using "-dusage=20" to start with, as it'll probably free up quite a lot of your existing allocation. And this may also be of interest, in how to read the output of the tools: https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools Finally, I note that you've still got some "single" chunks present for metadata. It won't affect your space allocation issues, but I would recommend getting rid of them anyway: # btrfs balance start -mconvert=dup,soft Hugo. On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote: > I have a 10TB file system with a parity file for a snapraid. > However, I can suddenly not extend the parity file despite the file > system only being about 50% filled - I should have 5TB of > unallocated space. When trying to extend the parity file, > fallocate() just returns ENOSPC, i.e. that the disk is full. > > Machine was originally a Debian 8 (Jessie) but after I detected the > issue and no btrfs tool did show any errors, I have updated to > Debian 9 (Snatch) to get a newer kernel and newer btrfs tools. > > pwm@europium:/mnt$ btrfs --version > btrfs-progs v4.7.3 > pwm@europium:/mnt$ uname -a > Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 > (2017-06-26) x86_64 GNU/Linux > > > > > pwm@europium:/mnt/snap_04$ ls -l > total 4932703608 > -rw------- 1 root root 319148889 Jul 8 04:21 snapraid.content > -rw------- 1 root root 283115520 Aug 1 04:08 snapraid.content.tmp > -rw------- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity > > > > pwm@europium:/mnt/snap_04$ df . > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/sdg1 9766434816 4944614648 4819831432 51% /mnt/snap_04 > > > > pwm@europium:/mnt/snap_04$ sudo btrfs fi show . > Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 > Total devices 1 FS bytes used 4.60TiB > devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1 > > Compare this with the second snapraid parity disk: > pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/ > Label: 'snap_05' uuid: bac477e3-e78c-43ee-8402-6bdfff194567 > Total devices 1 FS bytes used 4.69TiB > devid 1 size 9.09TiB used 4.70TiB path /dev/sdi1 > > So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB. > While almost the same amount of file system usage. And almost > identical usage pattern. It's an archival RAID, so there is hardly > any writes to the parity files because there are almost no file > changes to the data files. The main usage is that the parity file > gets extended when one of the data disks reaches a new high water > mark. > > The only file that gets regularly rewritten is the snapraid.content > file that gets regenerated after every scrub. > > > > pwm@europium:/mnt/snap_04$ sudo btrfs fi df . > Data, single: total=9.08TiB, used=4.59TiB > System, DUP: total=8.00MiB, used=992.00KiB > System, single: total=4.00MiB, used=0.00B > Metadata, DUP: total=6.00GiB, used=4.81GiB > Metadata, single: total=8.00MiB, used=0.00B > GlobalReserve, single: total=512.00MiB, used=0.00B > > > > pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du . > Total Exclusive Set shared Filename > 4.59TiB 4.59TiB - ./snapraid.parity > 304.37MiB 304.37MiB - ./snapraid.content > 270.00MiB 270.00MiB - ./snapraid.content.tmp > 4.59TiB 4.59TiB 0.00B . > > > > pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage . > Overall: > Device size: 9.09TiB > Device allocated: 9.09TiB > Device unallocated: 0.00B > Device missing: 0.00B > Used: 4.60TiB > Free (estimated): 4.49TiB (min: 4.49TiB) > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,single: Size:9.08TiB, Used:4.59TiB > /dev/sdg1 9.08TiB > > Metadata,single: Size:8.00MiB, Used:0.00B > /dev/sdg1 8.00MiB > > Metadata,DUP: Size:6.00GiB, Used:4.81GiB > /dev/sdg1 12.00GiB > > System,single: Size:4.00MiB, Used:0.00B > /dev/sdg1 4.00MiB > > System,DUP: Size:8.00MiB, Used:992.00KiB > /dev/sdg1 16.00MiB > > Unallocated: > /dev/sdg1 0.00B > > > > pwm@europium:~$ sudo btrfs check /dev/sdg1 > Checking filesystem on /dev/sdg1 > UUID: c46df8fa-03db-4b32-8beb-5521d9931a31 > checking extents > checking free space cache > checking fs roots > checking csums > checking root refs > found 5057294639104 bytes used err is 0 > total csum bytes: 4529856120 > total tree bytes: 5170151424 > total fs tree bytes: 178700288 > total extent tree bytes: 209616896 > btree space waste bytes: 182357204 > file data blocks allocated: 5073330888704 > referenced 5052040339456 > > > > pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/ > scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31 > scrub started at Mon Jul 31 21:26:50 2017 and finished after > 06:53:47 > total bytes scrubbed: 4.60TiB with 0 errors > > > > So where have my 5TB disk space gone lost? > And what should I do to be able to get it back again? > > I could obviously reformat the partition and rebuild the parity > since I still have one good parity, but that doesn't feel like a > good route. It isn't impossible this might happen again. > > /Per W -- Hugo Mills | Well, sir, the floor is yours. But remember, the hugo@... carfax.org.uk | roof is ours! http://carfax.org.uk/ | PGP: E2AB1DE4 | The Goons [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-01 12:20 ` Hugo Mills @ 2017-08-01 14:39 ` pwm 2017-08-01 14:47 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 26+ messages in thread From: pwm @ 2017-08-01 14:39 UTC (permalink / raw) To: Hugo Mills; +Cc: linux-btrfs Thanks for the links and suggestions. I did try your suggestions but it didn't solve the underlying problem. pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04 Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=20 Done, had to relocate 4596 out of 9317 chunks pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/ Done, had to relocate 2 out of 4721 chunks pwm@europium:~$ sudo btrfs fi df /mnt/snap_04 Data, single: total=4.60TiB, used=4.59TiB System, DUP: total=40.00MiB, used=512.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B pwm@europium:~$ sudo btrfs fi show /mnt/snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid 1 size 9.09TiB used 4.61TiB path /dev/sdg1 So now device 1 usage is down from 9.09TiB to 4.61TiB. But if I test to fallocate() to grow the large parity file, I directly fail. I wrote a little help program that just focuses on fallocate() instead of having to run snapraid with lots of unknown additional actions being performed. Original file size is 5050486226944 bytes Trying to grow file to 5151751667712 bytes Failed fallocate [No space left on device] And result after shows 'used' have jumped up to 9.09TiB again. root@europium:/mnt# btrfs fi show snap_04 Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 Total devices 1 FS bytes used 4.60TiB devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1 root@europium:/mnt# btrfs fi df /mnt/snap_04/ Data, single: total=9.08TiB, used=4.59TiB System, DUP: total=40.00MiB, used=992.00KiB Metadata, DUP: total=6.50GiB, used=4.81GiB GlobalReserve, single: total=512.00MiB, used=0.00B It's almost like the file system have decided that it needs to make a snapshot and store two complete copies of the complete file, which is obviously not going to work with a file larger than 50% of the file system. No issue at all to grow the parity file on the other parity disk. And that's why I wonder if there is some undetected file system corruption. /Per W On Tue, 1 Aug 2017, Hugo Mills wrote: > Hi, Per, > > Start here: > > https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29 > > In your case, I'd suggest using "-dusage=20" to start with, as > it'll probably free up quite a lot of your existing allocation. > > And this may also be of interest, in how to read the output of the > tools: > > https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools > > Finally, I note that you've still got some "single" chunks present > for metadata. It won't affect your space allocation issues, but I > would recommend getting rid of them anyway: > > # btrfs balance start -mconvert=dup,soft > > Hugo. > > On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote: >> I have a 10TB file system with a parity file for a snapraid. >> However, I can suddenly not extend the parity file despite the file >> system only being about 50% filled - I should have 5TB of >> unallocated space. When trying to extend the parity file, >> fallocate() just returns ENOSPC, i.e. that the disk is full. >> >> Machine was originally a Debian 8 (Jessie) but after I detected the >> issue and no btrfs tool did show any errors, I have updated to >> Debian 9 (Snatch) to get a newer kernel and newer btrfs tools. >> >> pwm@europium:/mnt$ btrfs --version >> btrfs-progs v4.7.3 >> pwm@europium:/mnt$ uname -a >> Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 >> (2017-06-26) x86_64 GNU/Linux >> >> >> >> >> pwm@europium:/mnt/snap_04$ ls -l >> total 4932703608 >> -rw------- 1 root root 319148889 Jul 8 04:21 snapraid.content >> -rw------- 1 root root 283115520 Aug 1 04:08 snapraid.content.tmp >> -rw------- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity >> >> >> >> pwm@europium:/mnt/snap_04$ df . >> Filesystem 1K-blocks Used Available Use% Mounted on >> /dev/sdg1 9766434816 4944614648 4819831432 51% /mnt/snap_04 >> >> >> >> pwm@europium:/mnt/snap_04$ sudo btrfs fi show . >> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 >> Total devices 1 FS bytes used 4.60TiB >> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1 >> >> Compare this with the second snapraid parity disk: >> pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/ >> Label: 'snap_05' uuid: bac477e3-e78c-43ee-8402-6bdfff194567 >> Total devices 1 FS bytes used 4.69TiB >> devid 1 size 9.09TiB used 4.70TiB path /dev/sdi1 >> >> So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB. >> While almost the same amount of file system usage. And almost >> identical usage pattern. It's an archival RAID, so there is hardly >> any writes to the parity files because there are almost no file >> changes to the data files. The main usage is that the parity file >> gets extended when one of the data disks reaches a new high water >> mark. >> >> The only file that gets regularly rewritten is the snapraid.content >> file that gets regenerated after every scrub. >> >> >> >> pwm@europium:/mnt/snap_04$ sudo btrfs fi df . >> Data, single: total=9.08TiB, used=4.59TiB >> System, DUP: total=8.00MiB, used=992.00KiB >> System, single: total=4.00MiB, used=0.00B >> Metadata, DUP: total=6.00GiB, used=4.81GiB >> Metadata, single: total=8.00MiB, used=0.00B >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> >> >> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du . >> Total Exclusive Set shared Filename >> 4.59TiB 4.59TiB - ./snapraid.parity >> 304.37MiB 304.37MiB - ./snapraid.content >> 270.00MiB 270.00MiB - ./snapraid.content.tmp >> 4.59TiB 4.59TiB 0.00B . >> >> >> >> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage . >> Overall: >> Device size: 9.09TiB >> Device allocated: 9.09TiB >> Device unallocated: 0.00B >> Device missing: 0.00B >> Used: 4.60TiB >> Free (estimated): 4.49TiB (min: 4.49TiB) >> Data ratio: 1.00 >> Metadata ratio: 2.00 >> Global reserve: 512.00MiB (used: 0.00B) >> >> Data,single: Size:9.08TiB, Used:4.59TiB >> /dev/sdg1 9.08TiB >> >> Metadata,single: Size:8.00MiB, Used:0.00B >> /dev/sdg1 8.00MiB >> >> Metadata,DUP: Size:6.00GiB, Used:4.81GiB >> /dev/sdg1 12.00GiB >> >> System,single: Size:4.00MiB, Used:0.00B >> /dev/sdg1 4.00MiB >> >> System,DUP: Size:8.00MiB, Used:992.00KiB >> /dev/sdg1 16.00MiB >> >> Unallocated: >> /dev/sdg1 0.00B >> >> >> >> pwm@europium:~$ sudo btrfs check /dev/sdg1 >> Checking filesystem on /dev/sdg1 >> UUID: c46df8fa-03db-4b32-8beb-5521d9931a31 >> checking extents >> checking free space cache >> checking fs roots >> checking csums >> checking root refs >> found 5057294639104 bytes used err is 0 >> total csum bytes: 4529856120 >> total tree bytes: 5170151424 >> total fs tree bytes: 178700288 >> total extent tree bytes: 209616896 >> btree space waste bytes: 182357204 >> file data blocks allocated: 5073330888704 >> referenced 5052040339456 >> >> >> >> pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/ >> scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31 >> scrub started at Mon Jul 31 21:26:50 2017 and finished after >> 06:53:47 >> total bytes scrubbed: 4.60TiB with 0 errors >> >> >> >> So where have my 5TB disk space gone lost? >> And what should I do to be able to get it back again? >> >> I could obviously reformat the partition and rebuild the parity >> since I still have one good parity, but that doesn't feel like a >> good route. It isn't impossible this might happen again. >> >> /Per W > > -- > Hugo Mills | Well, sir, the floor is yours. But remember, the > hugo@... carfax.org.uk | roof is ours! > http://carfax.org.uk/ | > PGP: E2AB1DE4 | The Goons > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-01 14:39 ` pwm @ 2017-08-01 14:47 ` Austin S. Hemmelgarn 2017-08-01 15:00 ` Austin S. Hemmelgarn 2017-08-02 4:14 ` Duncan 0 siblings, 2 replies; 26+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-01 14:47 UTC (permalink / raw) To: pwm, Hugo Mills; +Cc: linux-btrfs On 2017-08-01 10:39, pwm wrote: > Thanks for the links and suggestions. > > I did try your suggestions but it didn't solve the underlying problem. > > > > pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04 > Dumping filters: flags 0x1, state 0x0, force is off > DATA (flags 0x2): balancing, usage=20 > Done, had to relocate 4596 out of 9317 chunks > > > pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/ > Done, had to relocate 2 out of 4721 chunks > > > pwm@europium:~$ sudo btrfs fi df /mnt/snap_04 > Data, single: total=4.60TiB, used=4.59TiB > System, DUP: total=40.00MiB, used=512.00KiB > Metadata, DUP: total=6.50GiB, used=4.81GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > > pwm@europium:~$ sudo btrfs fi show /mnt/snap_04 > Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 > Total devices 1 FS bytes used 4.60TiB > devid 1 size 9.09TiB used 4.61TiB path /dev/sdg1 > > > So now device 1 usage is down from 9.09TiB to 4.61TiB. > > But if I test to fallocate() to grow the large parity file, I directly > fail. I wrote a little help program that just focuses on fallocate() > instead of having to run snapraid with lots of unknown additional > actions being performed. > > > Original file size is 5050486226944 bytes > Trying to grow file to 5151751667712 bytes > Failed fallocate [No space left on device] > > > > And result after shows 'used' have jumped up to 9.09TiB again. > > root@europium:/mnt# btrfs fi show snap_04 > Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 > Total devices 1 FS bytes used 4.60TiB > devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1 > > root@europium:/mnt# btrfs fi df /mnt/snap_04/ > Data, single: total=9.08TiB, used=4.59TiB > System, DUP: total=40.00MiB, used=992.00KiB > Metadata, DUP: total=6.50GiB, used=4.81GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > > It's almost like the file system have decided that it needs to make a > snapshot and store two complete copies of the complete file, which is > obviously not going to work with a file larger than 50% of the file system. I think I _might_ understand what's going on here. Is that test program calling fallocate using the desired total size of the file, or just trying to allocate the range beyond the end to extend the file? I've seen issues with the first case on BTRFS before, and I'm starting to think that it might actually be trying to allocate the exact amount of space requested by fallocate, even if part of the range is already allocated space. > > No issue at all to grow the parity file on the other parity disk. And > that's why I wonder if there is some undetected file system corruption. > > /Per W > > On Tue, 1 Aug 2017, Hugo Mills wrote: > >> Hi, Per, >> >> Start here: >> >> https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29 >> >> >> In your case, I'd suggest using "-dusage=20" to start with, as >> it'll probably free up quite a lot of your existing allocation. >> >> And this may also be of interest, in how to read the output of the >> tools: >> >> https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools >> >> >> Finally, I note that you've still got some "single" chunks present >> for metadata. It won't affect your space allocation issues, but I >> would recommend getting rid of them anyway: >> >> # btrfs balance start -mconvert=dup,soft >> >> Hugo. >> >> On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote: >>> I have a 10TB file system with a parity file for a snapraid. >>> However, I can suddenly not extend the parity file despite the file >>> system only being about 50% filled - I should have 5TB of >>> unallocated space. When trying to extend the parity file, >>> fallocate() just returns ENOSPC, i.e. that the disk is full. >>> >>> Machine was originally a Debian 8 (Jessie) but after I detected the >>> issue and no btrfs tool did show any errors, I have updated to >>> Debian 9 (Snatch) to get a newer kernel and newer btrfs tools. >>> >>> pwm@europium:/mnt$ btrfs --version >>> btrfs-progs v4.7.3 >>> pwm@europium:/mnt$ uname -a >>> Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 >>> (2017-06-26) x86_64 GNU/Linux >>> >>> >>> >>> >>> pwm@europium:/mnt/snap_04$ ls -l >>> total 4932703608 >>> -rw------- 1 root root 319148889 Jul 8 04:21 snapraid.content >>> -rw------- 1 root root 283115520 Aug 1 04:08 snapraid.content.tmp >>> -rw------- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity >>> >>> >>> >>> pwm@europium:/mnt/snap_04$ df . >>> Filesystem 1K-blocks Used Available Use% Mounted on >>> /dev/sdg1 9766434816 4944614648 4819831432 51% /mnt/snap_04 >>> >>> >>> >>> pwm@europium:/mnt/snap_04$ sudo btrfs fi show . >>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 >>> Total devices 1 FS bytes used 4.60TiB >>> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1 >>> >>> Compare this with the second snapraid parity disk: >>> pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/ >>> Label: 'snap_05' uuid: bac477e3-e78c-43ee-8402-6bdfff194567 >>> Total devices 1 FS bytes used 4.69TiB >>> devid 1 size 9.09TiB used 4.70TiB path /dev/sdi1 >>> >>> So on one parity disk, devid is 9.09TiB used - on the other only >>> 4.70TiB. >>> While almost the same amount of file system usage. And almost >>> identical usage pattern. It's an archival RAID, so there is hardly >>> any writes to the parity files because there are almost no file >>> changes to the data files. The main usage is that the parity file >>> gets extended when one of the data disks reaches a new high water >>> mark. >>> >>> The only file that gets regularly rewritten is the snapraid.content >>> file that gets regenerated after every scrub. >>> >>> >>> >>> pwm@europium:/mnt/snap_04$ sudo btrfs fi df . >>> Data, single: total=9.08TiB, used=4.59TiB >>> System, DUP: total=8.00MiB, used=992.00KiB >>> System, single: total=4.00MiB, used=0.00B >>> Metadata, DUP: total=6.00GiB, used=4.81GiB >>> Metadata, single: total=8.00MiB, used=0.00B >>> GlobalReserve, single: total=512.00MiB, used=0.00B >>> >>> >>> >>> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du . >>> Total Exclusive Set shared Filename >>> 4.59TiB 4.59TiB - ./snapraid.parity >>> 304.37MiB 304.37MiB - ./snapraid.content >>> 270.00MiB 270.00MiB - ./snapraid.content.tmp >>> 4.59TiB 4.59TiB 0.00B . >>> >>> >>> >>> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage . >>> Overall: >>> Device size: 9.09TiB >>> Device allocated: 9.09TiB >>> Device unallocated: 0.00B >>> Device missing: 0.00B >>> Used: 4.60TiB >>> Free (estimated): 4.49TiB (min: 4.49TiB) >>> Data ratio: 1.00 >>> Metadata ratio: 2.00 >>> Global reserve: 512.00MiB (used: 0.00B) >>> >>> Data,single: Size:9.08TiB, Used:4.59TiB >>> /dev/sdg1 9.08TiB >>> >>> Metadata,single: Size:8.00MiB, Used:0.00B >>> /dev/sdg1 8.00MiB >>> >>> Metadata,DUP: Size:6.00GiB, Used:4.81GiB >>> /dev/sdg1 12.00GiB >>> >>> System,single: Size:4.00MiB, Used:0.00B >>> /dev/sdg1 4.00MiB >>> >>> System,DUP: Size:8.00MiB, Used:992.00KiB >>> /dev/sdg1 16.00MiB >>> >>> Unallocated: >>> /dev/sdg1 0.00B >>> >>> >>> >>> pwm@europium:~$ sudo btrfs check /dev/sdg1 >>> Checking filesystem on /dev/sdg1 >>> UUID: c46df8fa-03db-4b32-8beb-5521d9931a31 >>> checking extents >>> checking free space cache >>> checking fs roots >>> checking csums >>> checking root refs >>> found 5057294639104 bytes used err is 0 >>> total csum bytes: 4529856120 >>> total tree bytes: 5170151424 >>> total fs tree bytes: 178700288 >>> total extent tree bytes: 209616896 >>> btree space waste bytes: 182357204 >>> file data blocks allocated: 5073330888704 >>> referenced 5052040339456 >>> >>> >>> >>> pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/ >>> scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31 >>> scrub started at Mon Jul 31 21:26:50 2017 and finished after >>> 06:53:47 >>> total bytes scrubbed: 4.60TiB with 0 errors >>> >>> >>> >>> So where have my 5TB disk space gone lost? >>> And what should I do to be able to get it back again? >>> >>> I could obviously reformat the partition and rebuild the parity >>> since I still have one good parity, but that doesn't feel like a >>> good route. It isn't impossible this might happen again. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-01 14:47 ` Austin S. Hemmelgarn @ 2017-08-01 15:00 ` Austin S. Hemmelgarn 2017-08-01 15:24 ` pwm 2017-08-02 17:52 ` Goffredo Baroncelli 2017-08-02 4:14 ` Duncan 1 sibling, 2 replies; 26+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-01 15:00 UTC (permalink / raw) To: pwm, Hugo Mills; +Cc: linux-btrfs On 2017-08-01 10:47, Austin S. Hemmelgarn wrote: > On 2017-08-01 10:39, pwm wrote: >> Thanks for the links and suggestions. >> >> I did try your suggestions but it didn't solve the underlying problem. >> >> >> >> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04 >> Dumping filters: flags 0x1, state 0x0, force is off >> DATA (flags 0x2): balancing, usage=20 >> Done, had to relocate 4596 out of 9317 chunks >> >> >> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/ >> Done, had to relocate 2 out of 4721 chunks >> >> >> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04 >> Data, single: total=4.60TiB, used=4.59TiB >> System, DUP: total=40.00MiB, used=512.00KiB >> Metadata, DUP: total=6.50GiB, used=4.81GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> >> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04 >> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 >> Total devices 1 FS bytes used 4.60TiB >> devid 1 size 9.09TiB used 4.61TiB path /dev/sdg1 >> >> >> So now device 1 usage is down from 9.09TiB to 4.61TiB. >> >> But if I test to fallocate() to grow the large parity file, I directly >> fail. I wrote a little help program that just focuses on fallocate() >> instead of having to run snapraid with lots of unknown additional >> actions being performed. >> >> >> Original file size is 5050486226944 bytes >> Trying to grow file to 5151751667712 bytes >> Failed fallocate [No space left on device] >> >> >> >> And result after shows 'used' have jumped up to 9.09TiB again. >> >> root@europium:/mnt# btrfs fi show snap_04 >> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 >> Total devices 1 FS bytes used 4.60TiB >> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1 >> >> root@europium:/mnt# btrfs fi df /mnt/snap_04/ >> Data, single: total=9.08TiB, used=4.59TiB >> System, DUP: total=40.00MiB, used=992.00KiB >> Metadata, DUP: total=6.50GiB, used=4.81GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> >> It's almost like the file system have decided that it needs to make a >> snapshot and store two complete copies of the complete file, which is >> obviously not going to work with a file larger than 50% of the file >> system. > I think I _might_ understand what's going on here. Is that test program > calling fallocate using the desired total size of the file, or just > trying to allocate the range beyond the end to extend the file? I've > seen issues with the first case on BTRFS before, and I'm starting to > think that it might actually be trying to allocate the exact amount of > space requested by fallocate, even if part of the range is already > allocated space. OK, I just did a dead simple test by hand, and it looks like I was right. The method I used to check this is as follows: 1. Create and mount a reasonably small filesystem (I used an 8G temporary LV for this, a file would work too though). 2. Using dd or a similar tool, create a test file that takes up half of the size of the filesystem. It is important that this _not_ be fallocated, but just written out. 3. Use `fallocate -l` to try and extend the size of the file beyond half the size of the filesystem. For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will succeed with no error. Based on this and some low-level inspection, it looks like BTRFS treats the full range of the fallocate call as unallocated, and thus is trying to allocate space for regions of that range that are already allocated. >> >> No issue at all to grow the parity file on the other parity disk. And >> that's why I wonder if there is some undetected file system corruption. >> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-01 15:00 ` Austin S. Hemmelgarn @ 2017-08-01 15:24 ` pwm 2017-08-01 15:45 ` Austin S. Hemmelgarn 2017-08-02 17:52 ` Goffredo Baroncelli 1 sibling, 1 reply; 26+ messages in thread From: pwm @ 2017-08-01 15:24 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Hugo Mills, linux-btrfs Yes, the test code is as below - trying to match what snapraid tries to do: #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <stdio.h> #include <string.h> #include <unistd.h> #include <errno.h> int main() { int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR); if (fd < 0) { printf("Failed opening parity file [%s]\n",strerror(errno)); return 1; } off_t filesize = 5151751667712ull; int res; struct stat statbuf; if (fstat(fd,&statbuf)) { printf("Failed stat [%s]\n",strerror(errno)); close(fd); return 1; } printf("Original file size is %llu bytes\n",i (unsigned long long)statbuf.st_size); printf("Trying to grow file to %llu bytes\n",i (unsigned long long)filesize); res = fallocate(fd,0,0,filesize); if (res) { printf("Failed fallocate [%s]\n",strerror(errno)); close(fd); return 1; } if (fsync(fd)) { printf("Failed fsync [%s]\n",fsync(errno)); close(fd); return 1; } close(fd); return 0; } So the call doesn't make use of the previous file size as offset for the extension. int fallocate(int fd, int mode, off_t offset, off_t len); What you are implying here is that if the fallocate() call is modified to: res = fallocate(fd,0,old_size,new_size-old_size); then everything should work as expected? /Per W On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote: > On 2017-08-01 10:47, Austin S. Hemmelgarn wrote: >> On 2017-08-01 10:39, pwm wrote: >>> Thanks for the links and suggestions. >>> >>> I did try your suggestions but it didn't solve the underlying problem. >>> >>> >>> >>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04 >>> Dumping filters: flags 0x1, state 0x0, force is off >>> DATA (flags 0x2): balancing, usage=20 >>> Done, had to relocate 4596 out of 9317 chunks >>> >>> >>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/ >>> Done, had to relocate 2 out of 4721 chunks >>> >>> >>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04 >>> Data, single: total=4.60TiB, used=4.59TiB >>> System, DUP: total=40.00MiB, used=512.00KiB >>> Metadata, DUP: total=6.50GiB, used=4.81GiB >>> GlobalReserve, single: total=512.00MiB, used=0.00B >>> >>> >>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04 >>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 >>> Total devices 1 FS bytes used 4.60TiB >>> devid 1 size 9.09TiB used 4.61TiB path /dev/sdg1 >>> >>> >>> So now device 1 usage is down from 9.09TiB to 4.61TiB. >>> >>> But if I test to fallocate() to grow the large parity file, I directly >>> fail. I wrote a little help program that just focuses on fallocate() >>> instead of having to run snapraid with lots of unknown additional actions >>> being performed. >>> >>> >>> Original file size is 5050486226944 bytes >>> Trying to grow file to 5151751667712 bytes >>> Failed fallocate [No space left on device] >>> >>> >>> >>> And result after shows 'used' have jumped up to 9.09TiB again. >>> >>> root@europium:/mnt# btrfs fi show snap_04 >>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 >>> Total devices 1 FS bytes used 4.60TiB >>> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1 >>> >>> root@europium:/mnt# btrfs fi df /mnt/snap_04/ >>> Data, single: total=9.08TiB, used=4.59TiB >>> System, DUP: total=40.00MiB, used=992.00KiB >>> Metadata, DUP: total=6.50GiB, used=4.81GiB >>> GlobalReserve, single: total=512.00MiB, used=0.00B >>> >>> >>> It's almost like the file system have decided that it needs to make a >>> snapshot and store two complete copies of the complete file, which is >>> obviously not going to work with a file larger than 50% of the file >>> system. >> I think I _might_ understand what's going on here. Is that test program >> calling fallocate using the desired total size of the file, or just trying >> to allocate the range beyond the end to extend the file? I've seen issues >> with the first case on BTRFS before, and I'm starting to think that it >> might actually be trying to allocate the exact amount of space requested by >> fallocate, even if part of the range is already allocated space. > > OK, I just did a dead simple test by hand, and it looks like I was right. > The method I used to check this is as follows: > 1. Create and mount a reasonably small filesystem (I used an 8G temporary LV > for this, a file would work too though). > 2. Using dd or a similar tool, create a test file that takes up half of the > size of the filesystem. It is important that this _not_ be fallocated, but > just written out. > 3. Use `fallocate -l` to try and extend the size of the file beyond half the > size of the filesystem. > > For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will > succeed with no error. Based on this and some low-level inspection, it looks > like BTRFS treats the full range of the fallocate call as unallocated, and > thus is trying to allocate space for regions of that range that are already > allocated. > >>> >>> No issue at all to grow the parity file on the other parity disk. And >>> that's why I wonder if there is some undetected file system corruption. >>> > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-01 15:24 ` pwm @ 2017-08-01 15:45 ` Austin S. Hemmelgarn 2017-08-01 16:50 ` pwm 0 siblings, 1 reply; 26+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-01 15:45 UTC (permalink / raw) To: pwm; +Cc: Hugo Mills, linux-btrfs On 2017-08-01 11:24, pwm wrote: > Yes, the test code is as below - trying to match what snapraid tries to do: > > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <stdio.h> > #include <string.h> > #include <unistd.h> > #include <errno.h> > > int main() { > int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR); > if (fd < 0) { > printf("Failed opening parity file [%s]\n",strerror(errno)); > return 1; > } > > off_t filesize = 5151751667712ull; > int res; > > struct stat statbuf; > if (fstat(fd,&statbuf)) { > printf("Failed stat [%s]\n",strerror(errno)); > close(fd); > return 1; > } > > printf("Original file size is %llu bytes\n",i > (unsigned long long)statbuf.st_size); > printf("Trying to grow file to %llu bytes\n",i > (unsigned long long)filesize); > > res = fallocate(fd,0,0,filesize); > if (res) { > printf("Failed fallocate [%s]\n",strerror(errno)); > close(fd); > return 1; > } > > if (fsync(fd)) { > printf("Failed fsync [%s]\n",fsync(errno)); > close(fd); > return 1; > } > > close(fd); > return 0; > } > > So the call doesn't make use of the previous file size as offset for the > extension. > > int fallocate(int fd, int mode, off_t offset, off_t len); > > What you are implying here is that if the fallocate() call is modified to: > > res = fallocate(fd,0,old_size,new_size-old_size); > > then everything should work as expected? Based on what I've seen testing on my end, yes, that should cause things to work correctly. That said, given what snapraid does, the fact that they call fallocate covering the full desired size of the file is correct usage (the point is to make behavior deterministic, and calling it on the whole file makes sure that the file isn't sparse, which can impact performance). Given both the fact that calling fallocate() to extend the file without worrying about an offset is a legitimate use case, and that both ext4 and XFS (and I suspect almost every other Linux filesystem) works in this situation, I'd argue that the behavior of BTRFS in this situation is incorrect. > > /Per W > > On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote: > >> On 2017-08-01 10:47, Austin S. Hemmelgarn wrote: >>> On 2017-08-01 10:39, pwm wrote: >>>> Thanks for the links and suggestions. >>>> >>>> I did try your suggestions but it didn't solve the underlying problem. >>>> >>>> >>>> >>>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04 >>>> Dumping filters: flags 0x1, state 0x0, force is off >>>> DATA (flags 0x2): balancing, usage=20 >>>> Done, had to relocate 4596 out of 9317 chunks >>>> >>>> >>>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft >>>> /mnt/snap_04/ >>>> Done, had to relocate 2 out of 4721 chunks >>>> >>>> >>>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04 >>>> Data, single: total=4.60TiB, used=4.59TiB >>>> System, DUP: total=40.00MiB, used=512.00KiB >>>> Metadata, DUP: total=6.50GiB, used=4.81GiB >>>> GlobalReserve, single: total=512.00MiB, used=0.00B >>>> >>>> >>>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04 >>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 >>>> Total devices 1 FS bytes used 4.60TiB >>>> devid 1 size 9.09TiB used 4.61TiB path /dev/sdg1 >>>> >>>> >>>> So now device 1 usage is down from 9.09TiB to 4.61TiB. >>>> >>>> But if I test to fallocate() to grow the large parity file, I >>>> directly fail. I wrote a little help program that just focuses on >>>> fallocate() instead of having to run snapraid with lots of unknown >>>> additional actions being performed. >>>> >>>> >>>> Original file size is 5050486226944 bytes >>>> Trying to grow file to 5151751667712 bytes >>>> Failed fallocate [No space left on device] >>>> >>>> >>>> >>>> And result after shows 'used' have jumped up to 9.09TiB again. >>>> >>>> root@europium:/mnt# btrfs fi show snap_04 >>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 >>>> Total devices 1 FS bytes used 4.60TiB >>>> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1 >>>> >>>> root@europium:/mnt# btrfs fi df /mnt/snap_04/ >>>> Data, single: total=9.08TiB, used=4.59TiB >>>> System, DUP: total=40.00MiB, used=992.00KiB >>>> Metadata, DUP: total=6.50GiB, used=4.81GiB >>>> GlobalReserve, single: total=512.00MiB, used=0.00B >>>> >>>> >>>> It's almost like the file system have decided that it needs to make >>>> a snapshot and store two complete copies of the complete file, which >>>> is obviously not going to work with a file larger than 50% of the >>>> file system. >>> I think I _might_ understand what's going on here. Is that test >>> program calling fallocate using the desired total size of the file, >>> or just trying to allocate the range beyond the end to extend the >>> file? I've seen issues with the first case on BTRFS before, and I'm >>> starting to think that it might actually be trying to allocate the >>> exact amount of space requested by fallocate, even if part of the >>> range is already allocated space. >> >> OK, I just did a dead simple test by hand, and it looks like I was >> right. The method I used to check this is as follows: >> 1. Create and mount a reasonably small filesystem (I used an 8G >> temporary LV for this, a file would work too though). >> 2. Using dd or a similar tool, create a test file that takes up half >> of the size of the filesystem. It is important that this _not_ be >> fallocated, but just written out. >> 3. Use `fallocate -l` to try and extend the size of the file beyond >> half the size of the filesystem. >> >> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it >> will succeed with no error. Based on this and some low-level >> inspection, it looks like BTRFS treats the full range of the fallocate >> call as unallocated, and thus is trying to allocate space for regions >> of that range that are already allocated. >> >>>> >>>> No issue at all to grow the parity file on the other parity disk. >>>> And that's why I wonder if there is some undetected file system >>>> corruption. >>>> >> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-01 15:45 ` Austin S. Hemmelgarn @ 2017-08-01 16:50 ` pwm 2017-08-01 17:04 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 26+ messages in thread From: pwm @ 2017-08-01 16:50 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Hugo Mills, linux-btrfs I did a temporary patch of the snapraid code to start fallocate() from the previous parity file size. Finally have a snapraid sync up and running. Looks good, but will take quite a while before I can try a scrub command to double-check everything. Thanks for the help. /Per W On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote: > On 2017-08-01 11:24, pwm wrote: >> Yes, the test code is as below - trying to match what snapraid tries to do: >> >> #include <sys/types.h> >> #include <sys/stat.h> >> #include <fcntl.h> >> #include <stdio.h> >> #include <string.h> >> #include <unistd.h> >> #include <errno.h> >> >> int main() { >> int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR); >> if (fd < 0) { >> printf("Failed opening parity file [%s]\n",strerror(errno)); >> return 1; >> } >> >> off_t filesize = 5151751667712ull; >> int res; >> >> struct stat statbuf; >> if (fstat(fd,&statbuf)) { >> printf("Failed stat [%s]\n",strerror(errno)); >> close(fd); >> return 1; >> } >> >> printf("Original file size is %llu bytes\n",i >> (unsigned long long)statbuf.st_size); >> printf("Trying to grow file to %llu bytes\n",i >> (unsigned long long)filesize); >> >> res = fallocate(fd,0,0,filesize); >> if (res) { >> printf("Failed fallocate [%s]\n",strerror(errno)); >> close(fd); >> return 1; >> } >> >> if (fsync(fd)) { >> printf("Failed fsync [%s]\n",fsync(errno)); >> close(fd); >> return 1; >> } >> >> close(fd); >> return 0; >> } >> >> So the call doesn't make use of the previous file size as offset for the >> extension. >> >> int fallocate(int fd, int mode, off_t offset, off_t len); >> >> What you are implying here is that if the fallocate() call is modified to: >> >> res = fallocate(fd,0,old_size,new_size-old_size); >> >> then everything should work as expected? > Based on what I've seen testing on my end, yes, that should cause things to > work correctly. That said, given what snapraid does, the fact that they call > fallocate covering the full desired size of the file is correct usage (the > point is to make behavior deterministic, and calling it on the whole file > makes sure that the file isn't sparse, which can impact performance). > > Given both the fact that calling fallocate() to extend the file without > worrying about an offset is a legitimate use case, and that both ext4 and XFS > (and I suspect almost every other Linux filesystem) works in this situation, > I'd argue that the behavior of BTRFS in this situation is incorrect. >> >> /Per W >> >> On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote: >> >>> On 2017-08-01 10:47, Austin S. Hemmelgarn wrote: >>>> On 2017-08-01 10:39, pwm wrote: >>>>> Thanks for the links and suggestions. >>>>> >>>>> I did try your suggestions but it didn't solve the underlying problem. >>>>> >>>>> >>>>> >>>>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04 >>>>> Dumping filters: flags 0x1, state 0x0, force is off >>>>> DATA (flags 0x2): balancing, usage=20 >>>>> Done, had to relocate 4596 out of 9317 chunks >>>>> >>>>> >>>>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft >>>>> /mnt/snap_04/ >>>>> Done, had to relocate 2 out of 4721 chunks >>>>> >>>>> >>>>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04 >>>>> Data, single: total=4.60TiB, used=4.59TiB >>>>> System, DUP: total=40.00MiB, used=512.00KiB >>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB >>>>> GlobalReserve, single: total=512.00MiB, used=0.00B >>>>> >>>>> >>>>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04 >>>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 >>>>> Total devices 1 FS bytes used 4.60TiB >>>>> devid 1 size 9.09TiB used 4.61TiB path /dev/sdg1 >>>>> >>>>> >>>>> So now device 1 usage is down from 9.09TiB to 4.61TiB. >>>>> >>>>> But if I test to fallocate() to grow the large parity file, I directly >>>>> fail. I wrote a little help program that just focuses on fallocate() >>>>> instead of having to run snapraid with lots of unknown additional >>>>> actions being performed. >>>>> >>>>> >>>>> Original file size is 5050486226944 bytes >>>>> Trying to grow file to 5151751667712 bytes >>>>> Failed fallocate [No space left on device] >>>>> >>>>> >>>>> >>>>> And result after shows 'used' have jumped up to 9.09TiB again. >>>>> >>>>> root@europium:/mnt# btrfs fi show snap_04 >>>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 >>>>> Total devices 1 FS bytes used 4.60TiB >>>>> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1 >>>>> >>>>> root@europium:/mnt# btrfs fi df /mnt/snap_04/ >>>>> Data, single: total=9.08TiB, used=4.59TiB >>>>> System, DUP: total=40.00MiB, used=992.00KiB >>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB >>>>> GlobalReserve, single: total=512.00MiB, used=0.00B >>>>> >>>>> >>>>> It's almost like the file system have decided that it needs to make a >>>>> snapshot and store two complete copies of the complete file, which is >>>>> obviously not going to work with a file larger than 50% of the file >>>>> system. >>>> I think I _might_ understand what's going on here. Is that test program >>>> calling fallocate using the desired total size of the file, or just >>>> trying to allocate the range beyond the end to extend the file? I've >>>> seen issues with the first case on BTRFS before, and I'm starting to >>>> think that it might actually be trying to allocate the exact amount of >>>> space requested by fallocate, even if part of the range is already >>>> allocated space. >>> >>> OK, I just did a dead simple test by hand, and it looks like I was right. >>> The method I used to check this is as follows: >>> 1. Create and mount a reasonably small filesystem (I used an 8G temporary >>> LV for this, a file would work too though). >>> 2. Using dd or a similar tool, create a test file that takes up half of >>> the size of the filesystem. It is important that this _not_ be >>> fallocated, but just written out. >>> 3. Use `fallocate -l` to try and extend the size of the file beyond half >>> the size of the filesystem. >>> >>> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will >>> succeed with no error. Based on this and some low-level inspection, it >>> looks like BTRFS treats the full range of the fallocate call as >>> unallocated, and thus is trying to allocate space for regions of that >>> range that are already allocated. >>> >>>>> >>>>> No issue at all to grow the parity file on the other parity disk. And >>>>> that's why I wonder if there is some undetected file system corruption. >>>>> >>> > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-01 16:50 ` pwm @ 2017-08-01 17:04 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 26+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-01 17:04 UTC (permalink / raw) To: pwm; +Cc: Hugo Mills, linux-btrfs On 2017-08-01 12:50, pwm wrote: > I did a temporary patch of the snapraid code to start fallocate() from > the previous parity file size. Like I said though, it's BTRFS that's misbehaving here, not snapraid. I'm going to try to get some further discussion about this here on the mailing list,and hopefully it will get fixed in BTRFS (I would try to do so myself, but I'm at best a novice at C, and not well versed in kernel code). > > Finally have a snapraid sync up and running. Looks good, but will take > quite a while before I can try a scrub command to double-check everything. > > Thanks for the help. Glad I could be helpful! > > /Per W > > On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote: > >> On 2017-08-01 11:24, pwm wrote: >>> Yes, the test code is as below - trying to match what snapraid tries >>> to do: >>> >>> #include <sys/types.h> >>> #include <sys/stat.h> >>> #include <fcntl.h> >>> #include <stdio.h> >>> #include <string.h> >>> #include <unistd.h> >>> #include <errno.h> >>> >>> int main() { >>> int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR); >>> if (fd < 0) { >>> printf("Failed opening parity file [%s]\n",strerror(errno)); >>> return 1; >>> } >>> >>> off_t filesize = 5151751667712ull; >>> int res; >>> >>> struct stat statbuf; >>> if (fstat(fd,&statbuf)) { >>> printf("Failed stat [%s]\n",strerror(errno)); >>> close(fd); >>> return 1; >>> } >>> >>> printf("Original file size is %llu bytes\n",i >>> (unsigned long long)statbuf.st_size); >>> printf("Trying to grow file to %llu bytes\n",i >>> (unsigned long long)filesize); >>> >>> res = fallocate(fd,0,0,filesize); >>> if (res) { >>> printf("Failed fallocate [%s]\n",strerror(errno)); >>> close(fd); >>> return 1; >>> } >>> >>> if (fsync(fd)) { >>> printf("Failed fsync [%s]\n",fsync(errno)); >>> close(fd); >>> return 1; >>> } >>> >>> close(fd); >>> return 0; >>> } >>> >>> So the call doesn't make use of the previous file size as offset for >>> the extension. >>> >>> int fallocate(int fd, int mode, off_t offset, off_t len); >>> >>> What you are implying here is that if the fallocate() call is >>> modified to: >>> >>> res = fallocate(fd,0,old_size,new_size-old_size); >>> >>> then everything should work as expected? >> Based on what I've seen testing on my end, yes, that should cause >> things to work correctly. That said, given what snapraid does, the >> fact that they call fallocate covering the full desired size of the >> file is correct usage (the point is to make behavior deterministic, >> and calling it on the whole file makes sure that the file isn't >> sparse, which can impact performance). >> >> Given both the fact that calling fallocate() to extend the file >> without worrying about an offset is a legitimate use case, and that >> both ext4 and XFS (and I suspect almost every other Linux filesystem) >> works in this situation, I'd argue that the behavior of BTRFS in this >> situation is incorrect. >>> >>> /Per W >>> >>> On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote: >>> >>>> On 2017-08-01 10:47, Austin S. Hemmelgarn wrote: >>>>> On 2017-08-01 10:39, pwm wrote: >>>>>> Thanks for the links and suggestions. >>>>>> >>>>>> I did try your suggestions but it didn't solve the underlying >>>>>> problem. >>>>>> >>>>>> >>>>>> >>>>>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04 >>>>>> Dumping filters: flags 0x1, state 0x0, force is off >>>>>> DATA (flags 0x2): balancing, usage=20 >>>>>> Done, had to relocate 4596 out of 9317 chunks >>>>>> >>>>>> >>>>>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft >>>>>> /mnt/snap_04/ >>>>>> Done, had to relocate 2 out of 4721 chunks >>>>>> >>>>>> >>>>>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04 >>>>>> Data, single: total=4.60TiB, used=4.59TiB >>>>>> System, DUP: total=40.00MiB, used=512.00KiB >>>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB >>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B >>>>>> >>>>>> >>>>>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04 >>>>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 >>>>>> Total devices 1 FS bytes used 4.60TiB >>>>>> devid 1 size 9.09TiB used 4.61TiB path /dev/sdg1 >>>>>> >>>>>> >>>>>> So now device 1 usage is down from 9.09TiB to 4.61TiB. >>>>>> >>>>>> But if I test to fallocate() to grow the large parity file, I >>>>>> directly fail. I wrote a little help program that just focuses on >>>>>> fallocate() instead of having to run snapraid with lots of unknown >>>>>> additional actions being performed. >>>>>> >>>>>> >>>>>> Original file size is 5050486226944 bytes >>>>>> Trying to grow file to 5151751667712 bytes >>>>>> Failed fallocate [No space left on device] >>>>>> >>>>>> >>>>>> >>>>>> And result after shows 'used' have jumped up to 9.09TiB again. >>>>>> >>>>>> root@europium:/mnt# btrfs fi show snap_04 >>>>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31 >>>>>> Total devices 1 FS bytes used 4.60TiB >>>>>> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1 >>>>>> >>>>>> root@europium:/mnt# btrfs fi df /mnt/snap_04/ >>>>>> Data, single: total=9.08TiB, used=4.59TiB >>>>>> System, DUP: total=40.00MiB, used=992.00KiB >>>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB >>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B >>>>>> >>>>>> >>>>>> It's almost like the file system have decided that it needs to >>>>>> make a snapshot and store two complete copies of the complete >>>>>> file, which is obviously not going to work with a file larger than >>>>>> 50% of the file system. >>>>> I think I _might_ understand what's going on here. Is that test >>>>> program calling fallocate using the desired total size of the file, >>>>> or just trying to allocate the range beyond the end to extend the >>>>> file? I've seen issues with the first case on BTRFS before, and >>>>> I'm starting to think that it might actually be trying to allocate >>>>> the exact amount of space requested by fallocate, even if part of >>>>> the range is already allocated space. >>>> >>>> OK, I just did a dead simple test by hand, and it looks like I was >>>> right. The method I used to check this is as follows: >>>> 1. Create and mount a reasonably small filesystem (I used an 8G >>>> temporary LV for this, a file would work too though). >>>> 2. Using dd or a similar tool, create a test file that takes up half >>>> of the size of the filesystem. It is important that this _not_ be >>>> fallocated, but just written out. >>>> 3. Use `fallocate -l` to try and extend the size of the file beyond >>>> half the size of the filesystem. >>>> >>>> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it >>>> will succeed with no error. Based on this and some low-level >>>> inspection, it looks like BTRFS treats the full range of the >>>> fallocate call as unallocated, and thus is trying to allocate space >>>> for regions of that range that are already allocated. >>>> >>>>>> >>>>>> No issue at all to grow the parity file on the other parity disk. >>>>>> And that's why I wonder if there is some undetected file system >>>>>> corruption. >>>>>> >>>> >> >> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-01 15:00 ` Austin S. Hemmelgarn 2017-08-01 15:24 ` pwm @ 2017-08-02 17:52 ` Goffredo Baroncelli 2017-08-02 19:10 ` Austin S. Hemmelgarn ` (2 more replies) 1 sibling, 3 replies; 26+ messages in thread From: Goffredo Baroncelli @ 2017-08-02 17:52 UTC (permalink / raw) To: Austin S. Hemmelgarn, pwm, Hugo Mills; +Cc: linux-btrfs Hi, On 2017-08-01 17:00, Austin S. Hemmelgarn wrote: > OK, I just did a dead simple test by hand, and it looks like I was right. The method I used to check this is as follows: > 1. Create and mount a reasonably small filesystem (I used an 8G temporary LV for this, a file would work too though). > 2. Using dd or a similar tool, create a test file that takes up half of the size of the filesystem. It is important that this _not_ be fallocated, but just written out. > 3. Use `fallocate -l` to try and extend the size of the file beyond half the size of the filesystem. > > For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will succeed with no error. Based on this and some low-level inspection, it looks like BTRFS treats the full range of the fallocate call as unallocated, and thus is trying to allocate space for regions of that range that are already allocated. I can confirm this behavior; below some step to reproduce it [2]; however I don't think that it is a bug, but this is the correct behavior for a COW filesystem (see below). Looking at the function btrfs_fallocate() (file fs/btrfs/file.c) static long btrfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { [...] alloc_start = round_down(offset, blocksize); alloc_end = round_up(offset + len, blocksize); [...] /* * Only trigger disk allocation, don't trigger qgroup reserve * * For qgroup space, it will be checked later. */ ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), alloc_end - alloc_start) it seems that BTRFS always allocate the maximum space required, without consider the one already allocated. Is it too conservative ? I think no: consider the following scenario: a) create a 2GB file b) fallocate -o 1GB -l 2GB c) write from 1GB to 3GB after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. My opinion is that in general this behavior is correct due to the COW nature of BTRFS. The only exception that I can find, is about the "nocow" file. For these cases taking in accout the already allocated space would be better. Comments are welcome. BR G.Baroncelli [1] from man 2 fallocate [...] After a successful call, subsequent writes into the range specified by offset and len are guaranteed not to fail because of lack of disk space. [...] [2] -- create a 5G btrfs filesystem # mkdir t1 # truncate --size 5G disk # losetup /dev/loop0 disk # mkfs.btrfs /dev/loop0 # mount /dev/loop0 t1 -- test -- create a 1500 MB file, the expand it to 4000MB -- expected result: the file is 4000MB size -- result: fail: the expansion fails # fallocate -l $((1024*1024*100*15)) file.bin # fallocate -l $((1024*1024*100*40)) file.bin fallocate: fallocate failed: No space left on device # ls -lh file.bin -rw-r--r-- 1 root root 1.5G Aug 2 19:09 file.bin -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-02 17:52 ` Goffredo Baroncelli @ 2017-08-02 19:10 ` Austin S. Hemmelgarn 2017-08-02 21:05 ` Goffredo Baroncelli 2017-08-03 3:48 ` Duncan 2017-08-03 11:44 ` Marat Khalili 2 siblings, 1 reply; 26+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-02 19:10 UTC (permalink / raw) To: kreijack, pwm, Hugo Mills; +Cc: linux-btrfs On 2017-08-02 13:52, Goffredo Baroncelli wrote: > Hi, > > On 2017-08-01 17:00, Austin S. Hemmelgarn wrote: >> OK, I just did a dead simple test by hand, and it looks like I was right. The method I used to check this is as follows: >> 1. Create and mount a reasonably small filesystem (I used an 8G temporary LV for this, a file would work too though). >> 2. Using dd or a similar tool, create a test file that takes up half of the size of the filesystem. It is important that this _not_ be fallocated, but just written out. >> 3. Use `fallocate -l` to try and extend the size of the file beyond half the size of the filesystem. >> >> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will succeed with no error. Based on this and some low-level inspection, it looks like BTRFS treats the full range of the fallocate call as unallocated, and thus is trying to allocate space for regions of that range that are already allocated. > > I can confirm this behavior; below some step to reproduce it [2]; however I don't think that it is a bug, but this is the correct behavior for a COW filesystem (see below). > > > Looking at the function btrfs_fallocate() (file fs/btrfs/file.c) > > > static long btrfs_fallocate(struct file *file, int mode, > loff_t offset, loff_t len) > { > [...] > alloc_start = round_down(offset, blocksize); > alloc_end = round_up(offset + len, blocksize); > [...] > /* > * Only trigger disk allocation, don't trigger qgroup reserve > * > * For qgroup space, it will be checked later. > */ > ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), > alloc_end - alloc_start) > > > it seems that BTRFS always allocate the maximum space required, without consider the one already allocated. Is it too conservative ? I think no: consider the following scenario: > > a) create a 2GB file > b) fallocate -o 1GB -l 2GB > c) write from 1GB to 3GB > > after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit. I've checked JFS, XFS, ext4, vfat, NTFS (via NTFS-3G, not the kernel driver), NILFS2, OCFS2 (local mode only), F2FS, UFS, and HFS+ on Linux, UFS and HFS+ on OS X, UFS and ZFS on FreeBSD, FFS (UFS with a different name) and LFS (log structured) on NetBSD, and UFS and ZFS on Solaris, and VxFS on HP-UX, and _all_ of them behave correctly here and succeed with the test I listed, while BTRFS does not. This isn't codified in POSIX, but it's also not something that is listed as implementation defined, which in turn means that we should be trying to match the other implementations. > > My opinion is that in general this behavior is correct due to the COW nature of BTRFS. > The only exception that I can find, is about the "nocow" file. For these cases taking in accout the already allocated space would be better. There are other, saner ways to make that expectation hold though, and I'm not even certain that it does as things are implemented (I believe we still CoW unwritten extents when data is written to them, because I _have_ had writes to fallocate'ed files fail on BTRFS before with -ENOSPC). The ideal situation IMO is as follows: 1. This particular case (using fallocate() with an offset of 0 to extend a file that is already larger than half the remaining free space on the FS) _should_ succeed. Short of very convoluted configurations, extending a file with fallocate will not result in over-committing space on a CoW filesystem unless it would extend the file by more than the remaining free space, and therefore barring long external interactions, subsequent writes will also succeed. Proof of this for a general case is somewhat complicated, but in the very specific case of the script I posted as a reproducer in the other thread about this and the test case I gave in this thread, it's trivial to prove that the writes will succeed. Either way, the behavior of SnapRAID, while not optimal in this case, is still a legitimate usage (I've seen programs do things like that just to make sure the file isn't sparse). 2. Conversion of unwritten extents to written ones should not require new allocation. Ideally, we need to be allocating not just space for the data, but also reasonable space for the associated metadata when allocating an unwritten extent, and there should be no CoW involved when they are written to except for the small metadata updates required to account the new blocks. Unless we're doing this, then we have edge cases where the the above listed expectation does not hold (also note that GlobalReserve does not count IMO, it's supposed to be for temporary usage only and doesn't ever appear to be particularly large). 3. There should be some small amount of space reserved globally for not just metadata, but data too, so that a 'full' filesystem can still update existing files reliably. I'm not sure that we're not doing this already, but AIUI, GlobalReserve is metadata only. If we do this, we don't have to worry _as much_ about avoiding CoW when converting unwritten extents to regular ones. > > Comments are welcome. > > BR > G.Baroncelli > > [1] from man 2 fallocate > [...] > After a successful call, subsequent writes into the range specified by offset and len are > guaranteed not to fail because of lack of disk space. > [...] > > > [2] > > -- create a 5G btrfs filesystem > > # mkdir t1 > # truncate --size 5G disk > # losetup /dev/loop0 disk > # mkfs.btrfs /dev/loop0 > # mount /dev/loop0 t1 > > -- test > -- create a 1500 MB file, the expand it to 4000MB > -- expected result: the file is 4000MB size > -- result: fail: the expansion fails > > # fallocate -l $((1024*1024*100*15)) file.bin > # fallocate -l $((1024*1024*100*40)) file.bin > fallocate: fallocate failed: No space left on device > # ls -lh file.bin > -rw-r--r-- 1 root root 1.5G Aug 2 19:09 file.bin > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-02 19:10 ` Austin S. Hemmelgarn @ 2017-08-02 21:05 ` Goffredo Baroncelli 2017-08-03 11:39 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 26+ messages in thread From: Goffredo Baroncelli @ 2017-08-02 21:05 UTC (permalink / raw) To: Austin S. Hemmelgarn, pwm, Hugo Mills; +Cc: linux-btrfs On 2017-08-02 21:10, Austin S. Hemmelgarn wrote: > On 2017-08-02 13:52, Goffredo Baroncelli wrote: >> Hi, >> [...] >> consider the following scenario: >> >> a) create a 2GB file >> b) fallocate -o 1GB -l 2GB >> c) write from 1GB to 3GB >> >> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. > There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit. The man page of fallocate doesn't guarantee that. Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true. Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file. > > I've checked JFS, XFS, ext4, vfat, NTFS (via NTFS-3G, not the kernel driver), NILFS2, OCFS2 (local mode only), F2FS, UFS, and HFS+ on Linux, UFS and HFS+ on OS X, UFS and ZFS on FreeBSD, FFS (UFS with a different name) and LFS (log structured) on NetBSD, and UFS and ZFS on Solaris, and VxFS on HP-UX, and _all_ of them behave correctly here and succeed with the test I listed, while BTRFS does not. This isn't codified in POSIX, but it's also not something that is listed as implementation defined, which in turn means that we should be trying to match the other implementations. [...] > >> >> My opinion is that in general this behavior is correct due to the COW nature of BTRFS. >> The only exception that I can find, is about the "nocow" file. For these cases taking in accout the already allocated space would be better. > There are other, saner ways to make that expectation hold though, and I'm not even certain that it does as things are implemented (I believe we still CoW unwritten extents when data is written to them, because I _have_ had writes to fallocate'ed files fail on BTRFS before with -ENOSPC). > > The ideal situation IMO is as follows: > > 1. This particular case (using fallocate() with an offset of 0 to extend a file that is already larger than half the remaining free space on the FS) _should_ succeed. This description is not accurate. What happened is the following: 1) you have a file *with valid data* 2) you want to prepare an update of this file and want to be sure to have enough space at this point fallocate have to guarantee: a) you have your old data still available b) you have allocated the space for the update In terms of a COW filesystem, you need the space of a) + the space of b) > Short of very convoluted configurations, extending a file with fallocate will not result in over-committing space on a CoW filesystem unless it would extend the file by more than the remaining free space, and therefore barring long external interactions, subsequent writes will also succeed. Proof of this for a general case is somewhat complicated, but in the very specific case of the script I posted as a reproducer in the other thread about this and the test case I gave in this thread, it's trivial to prove that the writes will succeed. Either way, the behavior of SnapRAID, while not optimal in this case, is still a legitimate usage (I've seen programs do things like that just to make sure the file isn't sparse). > > 2. Conversion of unwritten extents to written ones should not require new allocation. Ideally, we need to be allocating not just space for the data, but also reasonable space for the associated metadata when allocating an unwritten extent, and there should be no CoW involved when they are written to except for the small metadata updates required to account the new blocks. Unless we're doing this, then we have edge cases where the the above listed expectation does not hold (also note that GlobalReserve does not count IMO, it's supposed to be for temporary usage only and doesn't ever appear to be particularly large). > > 3. There should be some small amount of space reserved globally for not just metadata, but data too, so that a 'full' filesystem can still update existing files reliably. I'm not sure that we're not doing this already, but AIUI, GlobalReserve is metadata only. If we do this, we don't have to worry _as much_ about avoiding CoW when converting unwritten extents to regular ones. >> >> Comments are welcome. >> >> BR >> G.Baroncelli >> >> [1] from man 2 fallocate >> [...] >> After a successful call, subsequent writes into the range specified by offset and len are >> guaranteed not to fail because of lack of disk space. >> [...] >> >> >> [2] >> >> -- create a 5G btrfs filesystem >> >> # mkdir t1 >> # truncate --size 5G disk >> # losetup /dev/loop0 disk >> # mkfs.btrfs /dev/loop0 >> # mount /dev/loop0 t1 >> >> -- test >> -- create a 1500 MB file, the expand it to 4000MB >> -- expected result: the file is 4000MB size >> -- result: fail: the expansion fails >> >> # fallocate -l $((1024*1024*100*15)) file.bin >> # fallocate -l $((1024*1024*100*40)) file.bin >> fallocate: fallocate failed: No space left on device >> # ls -lh file.bin >> -rw-r--r-- 1 root root 1.5G Aug 2 19:09 file.bin >> >> > > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-02 21:05 ` Goffredo Baroncelli @ 2017-08-03 11:39 ` Austin S. Hemmelgarn 2017-08-03 16:37 ` Goffredo Baroncelli 0 siblings, 1 reply; 26+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-03 11:39 UTC (permalink / raw) To: kreijack, pwm, Hugo Mills; +Cc: linux-btrfs On 2017-08-02 17:05, Goffredo Baroncelli wrote: > On 2017-08-02 21:10, Austin S. Hemmelgarn wrote: >> On 2017-08-02 13:52, Goffredo Baroncelli wrote: >>> Hi, >>> > [...] > >>> consider the following scenario: >>> >>> a) create a 2GB file >>> b) fallocate -o 1GB -l 2GB >>> c) write from 1GB to 3GB >>> >>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. > >> There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit. > > The man page of fallocate doesn't guarantee that. > > Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true. > > Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file. Yes, you need space, but you don't need _all_ the space. For a file that already has data in it, you only _need_ as much space as the largest chunk of data that can be written at once at a low level, because the moment that first write finishes, the space that was used in the file for that region is freed, and the next write can go there. Put a bit differently, you only need to allocate what isn't allocated in the region, and then a bit more to handle the initial write to the file. Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is. > > >> >> I've checked JFS, XFS, ext4, vfat, NTFS (via NTFS-3G, not the kernel driver), NILFS2, OCFS2 (local mode only), F2FS, UFS, and HFS+ on Linux, UFS and HFS+ on OS X, UFS and ZFS on FreeBSD, FFS (UFS with a different name) and LFS (log structured) on NetBSD, and UFS and ZFS on Solaris, and VxFS on HP-UX, and _all_ of them behave correctly here and succeed with the test I listed, while BTRFS does not. This isn't codified in POSIX, but it's also not something that is listed as implementation defined, which in turn means that we should be trying to match the other implementations. > > [...] > >> >>> >>> My opinion is that in general this behavior is correct due to the COW nature of BTRFS. >>> The only exception that I can find, is about the "nocow" file. For these cases taking in accout the already allocated space would be better. >> There are other, saner ways to make that expectation hold though, and I'm not even certain that it does as things are implemented (I believe we still CoW unwritten extents when data is written to them, because I _have_ had writes to fallocate'ed files fail on BTRFS before with -ENOSPC). >> >> The ideal situation IMO is as follows: >> >> 1. This particular case (using fallocate() with an offset of 0 to extend a file that is already larger than half the remaining free space on the FS) _should_ succeed. > > This description is not accurate. What happened is the following: > 1) you have a file *with valid data* > 2) you want to prepare an update of this file and want to be sure to have enough space Except this is not the common case. Most filesystems aren't CoW, so calling fallocate() like this is generally not 'ensuring you have enough space', it's 'ensuring the file isn't sparse, and we can write to the extra area beyond the end we care about'. > > at this point fallocate have to guarantee: > a) you have your old data still available > b) you have allocated the space for the update > > In terms of a COW filesystem, you need the space of a) + the space of b) No, that is only required if the entire file needs to be written atomically. There is some maximal size atomic write that BTRFS can perform as a single operation at a low level (I'm not sure if this is equal to the block size, or larger, but it doesn't matter much, either way, I'm talking the largest chunk of data it will write to a disk in a single operation before updating metadata to point to that new data). If your total size (original data plus the new space) is less than this maximal atomic write size, then the above is true, but if it is larger, you only need to allocate space for regions of the fallocate() range that aren't already allocated, plus space to accommodate at least one write of this maximal atomic write size. Any space beyond that just ends up minimizing the degree of fragmentation introduced by allocation. The methodology that allows this is really simple. When you start to write data to the file, the first part of the write goes into the newly allocated space, and the original region covered by that write gets freed. You can then write into the space that was just freed and repeat the process until the write is done. Implementing this requires the freeing process to know that the freed region was covered by an fallocate() call, and thus that it should be saved for future writes. Provided that the back-conversion from used space to fallocated() space is done directly, this is also race free. > > >> Short of very convoluted configurations, extending a file with fallocate will not result in over-committing space on a CoW filesystem unless it would extend the file by more than the remaining free space, and therefore barring long external interactions, subsequent writes will also succeed. Proof of this for a general case is somewhat complicated, but in the very specific case of the script I posted as a reproducer in the other thread about this and the test case I gave in this thread, it's trivial to prove that the writes will succeed. Either way, the behavior of SnapRAID, while not optimal in this case, is still a legitimate usage (I've seen programs do things like that just to make sure the file isn't sparse). >> >> 2. Conversion of unwritten extents to written ones should not require new allocation. Ideally, we need to be allocating not just space for the data, but also reasonable space for the associated metadata when allocating an unwritten extent, and there should be no CoW involved when they are written to except for the small metadata updates required to account the new blocks. Unless we're doing this, then we have edge cases where the the above listed expectation does not hold (also note that GlobalReserve does not count IMO, it's supposed to be for temporary usage only and doesn't ever appear to be particularly large). >> >> 3. There should be some small amount of space reserved globally for not just metadata, but data too, so that a 'full' filesystem can still update existing files reliably. I'm not sure that we're not doing this already, but AIUI, GlobalReserve is metadata only. If we do this, we don't have to worry _as much_ about avoiding CoW when converting unwritten extents to regular ones. >>> >>> Comments are welcome. >>> >>> BR >>> G.Baroncelli >>> >>> [1] from man 2 fallocate >>> [...] >>> After a successful call, subsequent writes into the range specified by offset and len are >>> guaranteed not to fail because of lack of disk space. >>> [...] >>> >>> >>> [2] >>> >>> -- create a 5G btrfs filesystem >>> >>> # mkdir t1 >>> # truncate --size 5G disk >>> # losetup /dev/loop0 disk >>> # mkfs.btrfs /dev/loop0 >>> # mount /dev/loop0 t1 >>> >>> -- test >>> -- create a 1500 MB file, the expand it to 4000MB >>> -- expected result: the file is 4000MB size >>> -- result: fail: the expansion fails >>> >>> # fallocate -l $((1024*1024*100*15)) file.bin >>> # fallocate -l $((1024*1024*100*40)) file.bin >>> fallocate: fallocate failed: No space left on device >>> # ls -lh file.bin >>> -rw-r--r-- 1 root root 1.5G Aug 2 19:09 file.bin >>> >>> >> >> > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-03 11:39 ` Austin S. Hemmelgarn @ 2017-08-03 16:37 ` Goffredo Baroncelli 2017-08-03 17:23 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 26+ messages in thread From: Goffredo Baroncelli @ 2017-08-03 16:37 UTC (permalink / raw) To: Austin S. Hemmelgarn, pwm, Hugo Mills; +Cc: linux-btrfs On 2017-08-03 13:39, Austin S. Hemmelgarn wrote: > On 2017-08-02 17:05, Goffredo Baroncelli wrote: >> On 2017-08-02 21:10, Austin S. Hemmelgarn wrote: >>> On 2017-08-02 13:52, Goffredo Baroncelli wrote: >>>> Hi, >>>> >> [...] >> >>>> consider the following scenario: >>>> >>>> a) create a 2GB file >>>> b) fallocate -o 1GB -l 2GB >>>> c) write from 1GB to 3GB >>>> >>>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. >> >>> There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit. >> >> The man page of fallocate doesn't guarantee that. >> >> Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true. >> >> Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file. > Yes, you need space, but you don't need _all_ the space. For a file that already has data in it, you only _need_ as much space as the largest chunk of data that can be written at once at a low level, because the moment that first write finishes, the space that was used in the file for that region is freed, and the next write can go there. Put a bit differently, you only need to allocate what isn't allocated in the region, and then a bit more to handle the initial write to the file. > > Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is. It seems that ZFS on linux doesn't support fallocate see https://github.com/zfsonlinux/zfs/issues/326 So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment. [...] >> In terms of a COW filesystem, you need the space of a) + the space of b) > No, that is only required if the entire file needs to be written atomically. There is some maximal size atomic write that BTRFS can perform as a single operation at a low level (I'm not sure if this is equal to the block size, or larger, but it doesn't matter much, either way, I'm talking the largest chunk of data it will write to a disk in a single operation before updating metadata to point to that new data). On the best of my knowledge there is only a time limit: IIRC every 30seconds a transaction is closed. If you are able to fill the filesystem in this time window you are in trouble. [...]-- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-03 16:37 ` Goffredo Baroncelli @ 2017-08-03 17:23 ` Austin S. Hemmelgarn 2017-08-04 14:45 ` Goffredo Baroncelli 0 siblings, 1 reply; 26+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-03 17:23 UTC (permalink / raw) To: kreijack, pwm, Hugo Mills; +Cc: linux-btrfs On 2017-08-03 12:37, Goffredo Baroncelli wrote: > On 2017-08-03 13:39, Austin S. Hemmelgarn wrote: >> On 2017-08-02 17:05, Goffredo Baroncelli wrote: >>> On 2017-08-02 21:10, Austin S. Hemmelgarn wrote: >>>> On 2017-08-02 13:52, Goffredo Baroncelli wrote: >>>>> Hi, >>>>> >>> [...] >>> >>>>> consider the following scenario: >>>>> >>>>> a) create a 2GB file >>>>> b) fallocate -o 1GB -l 2GB >>>>> c) write from 1GB to 3GB >>>>> >>>>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. >>> >>>> There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit. >>> >>> The man page of fallocate doesn't guarantee that. >>> >>> Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true. >>> >>> Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file. >> Yes, you need space, but you don't need _all_ the space. For a file that already has data in it, you only _need_ as much space as the largest chunk of data that can be written at once at a low level, because the moment that first write finishes, the space that was used in the file for that region is freed, and the next write can go there. Put a bit differently, you only need to allocate what isn't allocated in the region, and then a bit more to handle the initial write to the file. >> >> Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is. > > It seems that ZFS on linux doesn't support fallocate > > see https://github.com/zfsonlinux/zfs/issues/326 > > So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment. Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on). That said, I'm starting to wonder if just failing fallocate() calls to allocate space is actually the right thing to do here after all. Aside from this, we don't reserve metadata space for checksums and similar things for the eventual writes (so it's possible to get -ENOSPC on a write to an fallocate'ed region anyway because of metadata exhaustion), and splitting extents can also cause it to fail, so it's perfectly possible for the fallocate assumption to not hole on BTRFS. The irony of this is that if you're in a situation where you actually need to reserve space, you're more likely to fail (because if you actually _need_ to reserve the space, your filesystem may already be mostly full, and therefore any of the above issues may occur). On the specific note of splitting extents, the following will probably fail on BTRFS as well when done with a large enough FS (the turn over point ends up being the point at which 256MiB isn't enough space to account for all the extents), but will succeed with : 1. Create filesystem and mount it. On BTRFS, make sure autodefrag is off (this makes it fail more reliably, but is not essential for it to fail). 2. Use fallocate to allocate as large a file as possible (in the BTRFS case, try for the size of the filesystem - 544MiB (512 MiB for the metadata chunk, 32 for the system chunk). 3. Write half the file using 1MB blocks and skipping 1MB of space between each block (so every other 1MB of space is actually written to. 4. Write the other half of the file by filling in the holes. The net effect of this is to split the single large fallocat'ed extent into a very large number of 1MB extents, which in turn eats up lots of metadata space and will eventually exhaust it. While this specific exercise requires a large filesystem, more generic real world situations exist where this can happen (and I have had this happen before). > > [...] >>> In terms of a COW filesystem, you need the space of a) + the space of b) >> No, that is only required if the entire file needs to be written atomically. There is some maximal size atomic write that BTRFS can perform as a single operation at a low level (I'm not sure if this is equal to the block size, or larger, but it doesn't matter much, either way, I'm talking the largest chunk of data it will write to a disk in a single operation before updating metadata to point to that new data). > > On the best of my knowledge there is only a time limit: IIRC every 30seconds a transaction is closed. If you are able to fill the filesystem in this time window you are in trouble. Even with that, it's still possible to implement the method I outlined by defining such a limit and forcing a transaction commit when that limit is hit. I'm also not entirely convinced that the transaction is the limiting factor here (I was under the impression that the transaction just updates the top level metadata to point to the new tree of metadata). ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-03 17:23 ` Austin S. Hemmelgarn @ 2017-08-04 14:45 ` Goffredo Baroncelli 2017-08-04 15:05 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 26+ messages in thread From: Goffredo Baroncelli @ 2017-08-04 14:45 UTC (permalink / raw) To: Austin S. Hemmelgarn, pwm, Hugo Mills; +Cc: linux-btrfs On 2017-08-03 19:23, Austin S. Hemmelgarn wrote: > On 2017-08-03 12:37, Goffredo Baroncelli wrote: >> On 2017-08-03 13:39, Austin S. Hemmelgarn wrote: [...] >>> Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is. >> >> It seems that ZFS on linux doesn't support fallocate >> >> see https://github.com/zfsonlinux/zfs/issues/326 >> >> So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment. > Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on). For fun I checked the freebsd source and zfs source. To me it seems that ZFS on freebsd doesn't implement posix_fallocate() (VOP_ALLOCATE in freebas jargon), but instead relies on the freebsd default one. http://fxr.watson.org/fxr/source/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L7212 Following the chain of function pointers http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=10#L110 it seems that the freebsd vop_allocate() is implemented in vop_stdallocate() http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=excerpts#L912 which simply calls read() and write() on the range [offset...offset+len), which for a "conventional" filesystem ensure the block allocation. Of course it is an expensive solution. So I think (but I am not familiar with freebsd) that ZFS doesn't implement a real posix_allocate but it try to simulate it. Of course this don't > > That said, I'm starting to wonder if just failing fallocate() calls to allocate space is actually the right thing to do here after all. Aside from this, we don't reserve metadata space for checksums and similar things for the eventual writes (so it's possible to get -ENOSPC on a write to an fallocate'ed region anyway because of metadata exhaustion), and splitting extents can also cause it to fail, so it's perfectly possible for the fallocate assumption to not hole on BTRFS. posix_fallocate in BTRFS is not reliable for another reason. This syscall guarantees that a BG is allocated, but I think that the allocated BG is available to all processes, so a parallel process my exhaust all the available space before the first process uses it. My opinion is that BTRFS is not reliable when the space is exhausted, so it needs to work with an amount of disk space free. The size of this disk space should be O(2*size_of_biggest_write), and for operation like fallocate this means O(2*length). I think that is not casual that the fallocate implemented by ZFSONLINUX works with the flag FALLOC_FL_PUNCH_HOLE mode. https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L662 [...] /* * The only flag combination which matches the behavior of zfs_space() * is FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE. The FALLOC_FL_PUNCH_HOLE * flag was introduced in the 2.6.38 kernel. */ #if defined(HAVE_FILE_FALLOCATE) || defined(HAVE_INODE_FALLOCATE) long zpl_fallocate_common(struct inode *ip, int mode, loff_t offset, loff_t len) { int error = -EOPNOTSUPP; #if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE) cred_t *cr = CRED(); flock64_t bf; loff_t olen; fstrans_cookie_t cookie; if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) return (error); [...] -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-04 14:45 ` Goffredo Baroncelli @ 2017-08-04 15:05 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 26+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-04 15:05 UTC (permalink / raw) To: kreijack, pwm, Hugo Mills; +Cc: linux-btrfs On 2017-08-04 10:45, Goffredo Baroncelli wrote: > On 2017-08-03 19:23, Austin S. Hemmelgarn wrote: >> On 2017-08-03 12:37, Goffredo Baroncelli wrote: >>> On 2017-08-03 13:39, Austin S. Hemmelgarn wrote: > [...] > >>>> Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is. >>> >>> It seems that ZFS on linux doesn't support fallocate >>> >>> see https://github.com/zfsonlinux/zfs/issues/326 >>> >>> So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment. >> Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on). > > For fun I checked the freebsd source and zfs source. To me it seems that ZFS on freebsd doesn't implement posix_fallocate() (VOP_ALLOCATE in freebas jargon), but instead relies on the freebsd default one. > > http://fxr.watson.org/fxr/source/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L7212 > > Following the chain of function pointers > > http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=10#L110 > > it seems that the freebsd vop_allocate() is implemented in vop_stdallocate() > > http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=excerpts#L912 > > which simply calls read() and write() on the range [offset...offset+len), which for a "conventional" filesystem ensure the block allocation. Of course it is an expensive solution. > > So I think (but I am not familiar with freebsd) that ZFS doesn't implement a real posix_allocate but it try to simulate it. Of course this don't From a practical perspective though, posix_fallocate() doesn't matter, because almost everything uses the native fallocate call if at all possible. As you mention, FreeBSD is emulating it, but that 'emulation' provides behavior that is close enough to what is required that it doesn't matter. As a matter of perspective, posix_fallocate() is emulated on Linux too, see my reply below to your later comment about posix_fallocate() on BTRFS. Internally ZFS also keeps _some_ space reserved so it doesn't get wedged like BTRFS does when near full, and they don't do the whole data versus metadata segregation crap, so from a practical perspective, what FreeBSD's ZFS implementation does is sufficient because of the internal structure and handling of writes in ZFS. > > >> >> That said, I'm starting to wonder if just failing fallocate() calls to allocate space is actually the right thing to do here after all. Aside from this, we don't reserve metadata space for checksums and similar things for the eventual writes (so it's possible to get -ENOSPC on a write to an fallocate'ed region anyway because of metadata exhaustion), and splitting extents can also cause it to fail, so it's perfectly possible for the fallocate assumption to not hole on BTRFS. > > posix_fallocate in BTRFS is not reliable for another reason. This syscall guarantees that a BG is allocated, but I think that the allocated BG is available to all processes, so a parallel process my exhaust all the available space before the first process uses it. As mentioned above, posix_fallocate() is emulated in libc on Linux by calling the regular fallocate() if the FS supports it (which BTRFS does), or by writing out data like FreeBSD does in the kernel if the FS doesn't support fallocate(). IOW, posix_fallocate() has the exact same issues on BTRFS as Linux's fallocate() syscall does. > > My opinion is that BTRFS is not reliable when the space is exhausted, so it needs to work with an amount of disk space free. The size of this disk space should be O(2*size_of_biggest_write), and for operation like fallocate this means O(2*length). Again, this arises from how we handle writes. If we were to track blocks that have had fallocate called on them and only use those (for the first write at least) for writes to the file that had fallocate called on them (as well as breaking reflinks on them when fallocate is called), then we can get away with just using the size of the biggest write plus a little bit more space for _data_, but even then we need space for metadata (which we don't appear to track right now). > > I think that is not casual that the fallocate implemented by ZFSONLINUX works with the flag FALLOC_FL_PUNCH_HOLE mode. > > https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L662 > [...] > /* > * The only flag combination which matches the behavior of zfs_space() > * is FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE. The FALLOC_FL_PUNCH_HOLE > * flag was introduced in the 2.6.38 kernel. > */ > #if defined(HAVE_FILE_FALLOCATE) || defined(HAVE_INODE_FALLOCATE) > long > zpl_fallocate_common(struct inode *ip, int mode, loff_t offset, loff_t len) > { > int error = -EOPNOTSUPP; > > #if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE) > cred_t *cr = CRED(); > flock64_t bf; > loff_t olen; > fstrans_cookie_t cookie; > > if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) > return (error); > > [...] > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-02 17:52 ` Goffredo Baroncelli 2017-08-02 19:10 ` Austin S. Hemmelgarn @ 2017-08-03 3:48 ` Duncan 2017-08-03 11:44 ` Marat Khalili 2 siblings, 0 replies; 26+ messages in thread From: Duncan @ 2017-08-03 3:48 UTC (permalink / raw) To: linux-btrfs Goffredo Baroncelli posted on Wed, 02 Aug 2017 19:52:30 +0200 as excerpted: > it seems that BTRFS always allocate the maximum space required, without > consider the one already allocated. Is it too conservative ? I think no: > consider the following scenario: > > a) create a 2GB file > b) fallocate -o 1GB -l 2GB > c) write from 1GB to 3GB > > after b), the expectation is that c) always succeed [1]: i.e. there is > enough space on the filesystem. Due to the COW nature of BTRFS, you > cannot rely on the already allocated space because there could be a > small time window where both the old and the new data exists on the > disk. Not only a small time, perhaps (effectively) permanently, due to either of two factors: 1) If the existing extents are reflinked by snapshots or other files they obviously won't be released at all when the overwrite is completed. fallocate must account for this possibility, and behaving differently in the context of other reflinks would be confusing, so the best policy is consistently behave as if the existing data will not be freed. 2) As the devs have commented a number of times, an extent isn't freed if there's still a reflink to part of it. If the original extent was a full 1 GiB data chunk (the chunk being the max size of a native btrfs extent, one of the reasons a balance and defrag after conversion from ext4 and deletion of the ext4-saved subvolume is recommended, to break up the longer ext4 extents so they won't cause btrfs problems later) and all but a single 4 KiB block has been rewritten, the full 1 GiB extent will remain referenced and continue to take that original full 1 GiB space, *plus* the space of all the new-version extents of the overwritten data, of course. So in our fallocate and overwrite scenario, we again must reserve space for two copies of the data, the original which may well not be freed even without other reflinks, if a single 4 KiB block of an extent remains unoverwritten, and the new version of the data. At least that /was/ the behavior explained on-list previous to the hole- punching changes. I'm not a dev and haven't seen a dev comment on whether that remains the behavior after hole-punching, which may at least naively be expected to automatically handle and free overwritten data using hole-punching, or not. I'd be interested in seeing someone who can read the code confirm one way or the other whether hole-punching changed that previous behavior, or not. > My opinion is that in general this behavior is correct due to the COW > nature of BTRFS. > The only exception that I can find, is about the "nocow" file. For these > cases taking in accout the already allocated space would be better. I'd say it's dangerously optimistic even then, considering that "nocow" is actually "cow1" in the presence of snapshots. Meanwhile, it's worth keeping in mind that it's exactly these sorts of corner-cases that are why btrfs is taking so long to stabilize. Supposedly "simple" expectations aren't always so simple, and if a filesystem gets it wrong, it's somebody's data hanging in the balance! (Tho if they've any wisdom at all, they'll ensure they're aware of the stability status of a filesystem before they put data on it, and will adjust their backup policies accordingly if they're using a still not fully stabilized filesystem such as btrfs, so the data won't actually be in any danger anyway unless it was literally throw-away value, only whatever specific instance of it was involved in that corner-case.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-02 17:52 ` Goffredo Baroncelli 2017-08-02 19:10 ` Austin S. Hemmelgarn 2017-08-03 3:48 ` Duncan @ 2017-08-03 11:44 ` Marat Khalili 2017-08-03 11:52 ` Austin S. Hemmelgarn 2017-08-03 16:01 ` Goffredo Baroncelli 2 siblings, 2 replies; 26+ messages in thread From: Marat Khalili @ 2017-08-03 11:44 UTC (permalink / raw) To: Austin S. Hemmelgarn, linux-btrfs; +Cc: kreijack, pwm, Hugo Mills On 02/08/17 20:52, Goffredo Baroncelli wrote: > consider the following scenario: > > a) create a 2GB file > b) fallocate -o 1GB -l 2GB > c) write from 1GB to 3GB > > after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. Just curious. With current implementation, in the following case: a) create a 2GB file1 && create a 2GB file2 b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2 c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2 will (c) always succeed? I.e. does fallocate really allocate 2GB per file, or does it only allocate additional 1GB and check free space for another 1GB? If it's only the latter, it is useless. -- With Best Regards, Marat Khalili ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-03 11:44 ` Marat Khalili @ 2017-08-03 11:52 ` Austin S. Hemmelgarn 2017-08-03 16:01 ` Goffredo Baroncelli 1 sibling, 0 replies; 26+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-03 11:52 UTC (permalink / raw) To: Marat Khalili, linux-btrfs; +Cc: kreijack, pwm, Hugo Mills On 2017-08-03 07:44, Marat Khalili wrote: > On 02/08/17 20:52, Goffredo Baroncelli wrote: >> consider the following scenario: >> >> a) create a 2GB file >> b) fallocate -o 1GB -l 2GB >> c) write from 1GB to 3GB >> >> after b), the expectation is that c) always succeed [1]: i.e. there is >> enough space on the filesystem. Due to the COW nature of BTRFS, you >> cannot rely on the already allocated space because there could be a >> small time window where both the old and the new data exists on the disk. > Just curious. With current implementation, in the following case: > a) create a 2GB file1 && create a 2GB file2 > b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2 > c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2 > will (c) always succeed? I.e. does fallocate really allocate 2GB per > file, or does it only allocate additional 1GB and check free space for > another 1GB? If it's only the latter, it is useless. It will currently allocate 4GB total in this case (2 for each file), and _should_ succeed. I think there are corner cases where it can fail though because of metadata exhaustion, and I'm still not certain we don't CoW unwritten extents (if we do CoW unwritten extents, then this, and all fallocate allocation for that matter, becomes non-deterministic as to whether or not it succeeds). ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-03 11:44 ` Marat Khalili 2017-08-03 11:52 ` Austin S. Hemmelgarn @ 2017-08-03 16:01 ` Goffredo Baroncelli 2017-08-03 17:15 ` Marat Khalili 2017-08-03 22:51 ` pwm 1 sibling, 2 replies; 26+ messages in thread From: Goffredo Baroncelli @ 2017-08-03 16:01 UTC (permalink / raw) To: Marat Khalili, Austin S. Hemmelgarn, linux-btrfs; +Cc: pwm, Hugo Mills On 2017-08-03 13:44, Marat Khalili wrote: > On 02/08/17 20:52, Goffredo Baroncelli wrote: >> consider the following scenario: >> >> a) create a 2GB file >> b) fallocate -o 1GB -l 2GB >> c) write from 1GB to 3GB >> >> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. > Just curious. With current implementation, in the following case: > a) create a 2GB file1 && create a 2GB file2 > b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2 A this step you are trying to allocate 3GB+3GB = 6GB, so you exhausted the filesystem space. > c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2 > will (c) always succeed? I.e. does fallocate really allocate 2GB per file, or does it only allocate additional 1GB and check free space for another 1GB? If it's only the latter, it is useless. The file is physically extended ghigo@venice:/tmp$ fallocate -l 1000 foo.txt ghigo@venice:/tmp$ ls -l foo.txt -rw-r--r-- 1 ghigo ghigo 1000 Aug 3 18:00 foo.txt ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt ghigo@venice:/tmp$ ls -l foo.txt -rw-r--r-- 1 ghigo ghigo 1500 Aug 3 18:00 foo.txt ghigo@venice:/tmp$ > > -- > > With Best Regards, > Marat Khalili > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-03 16:01 ` Goffredo Baroncelli @ 2017-08-03 17:15 ` Marat Khalili 2017-08-03 17:25 ` Austin S. Hemmelgarn 2017-08-03 22:51 ` pwm 1 sibling, 1 reply; 26+ messages in thread From: Marat Khalili @ 2017-08-03 17:15 UTC (permalink / raw) To: kreijack, Goffredo Baroncelli, Austin S. Hemmelgarn, linux-btrfs Cc: pwm, Hugo Mills On August 3, 2017 7:01:06 PM GMT+03:00, Goffredo Baroncelli >The file is physically extended > >ghigo@venice:/tmp$ fallocate -l 1000 foo.txt For clarity let's replace the fallocate above with: $ head -c 1000 </dev/urandom >foo.txt >ghigo@venice:/tmp$ ls -l foo.txt >-rw-r--r-- 1 ghigo ghigo 1000 Aug 3 18:00 foo.txt >ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt >ghigo@venice:/tmp$ ls -l foo.txt >-rw-r--r-- 1 ghigo ghigo 1500 Aug 3 18:00 foo.txt >ghigo@venice:/tmp$ According to explanation by Austin the foo.txt at this point somehow occupies 2000 bytes of space because I can reflink it and then write another 1000 bytes of data into it without losing 1000 bytes I already have or getting out of drive space. (Or is it only true while there are open file handles?) -- With Best Regards, Marat Khalili ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-03 17:15 ` Marat Khalili @ 2017-08-03 17:25 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 26+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-03 17:25 UTC (permalink / raw) To: Marat Khalili, kreijack, linux-btrfs; +Cc: pwm, Hugo Mills On 2017-08-03 13:15, Marat Khalili wrote: > On August 3, 2017 7:01:06 PM GMT+03:00, Goffredo Baroncelli >> The file is physically extended >> >> ghigo@venice:/tmp$ fallocate -l 1000 foo.txt > > For clarity let's replace the fallocate above with: > $ head -c 1000 </dev/urandom >foo.txt > >> ghigo@venice:/tmp$ ls -l foo.txt >> -rw-r--r-- 1 ghigo ghigo 1000 Aug 3 18:00 foo.txt >> ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt >> ghigo@venice:/tmp$ ls -l foo.txt >> -rw-r--r-- 1 ghigo ghigo 1500 Aug 3 18:00 foo.txt >> ghigo@venice:/tmp$ > > According to explanation by Austin the foo.txt at this point somehow occupies 2000 bytes of space because I can reflink it and then write another 1000 bytes of data into it without losing 1000 bytes I already have or getting out of drive space. (Or is it only true while there are open file handles?) > OK, I think there may be some misunderstanding here. By 'CoW unwritten extents', I mean that when we write to the extent, a CoW operation happens, instead of the data being written directly into the extent. In this case, it has nothing to do with reflinking, and Goffredo is correct that if your filesystem is small enough, the second fallocate will fail there. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-03 16:01 ` Goffredo Baroncelli 2017-08-03 17:15 ` Marat Khalili @ 2017-08-03 22:51 ` pwm 1 sibling, 0 replies; 26+ messages in thread From: pwm @ 2017-08-03 22:51 UTC (permalink / raw) To: Goffredo Baroncelli Cc: Marat Khalili, Austin S. Hemmelgarn, linux-btrfs, Hugo Mills In 30 seconds I should be able to fill about 200MB * 30 = 6GB. Requiring the parity to not grow larger than there is a 6GB additional space is possible to live with on a 10TB disk. It seems that for SnapRAID to have any chance to work correctly with parity on a BTRFS partition, it would need a min-free configuration paramter to make sure there is always enough free space for one parity file update. But as it is right now, requiring that the disc isn't filled past 50% because fallocate() wants enough free space for 100% of the original file data to be rewritten obviously is not a working solution. Right now, it sounds like I should change all parity disks to a different file system to avoid the CoW issue. There doesn't seem to be any way to turn off CoW for an already existing file, and the parity data is already way past 50% so I can't make a copy. /Per W On Thu, 3 Aug 2017, Goffredo Baroncelli wrote: > On 2017-08-03 13:44, Marat Khalili wrote: >> On 02/08/17 20:52, Goffredo Baroncelli wrote: >>> consider the following scenario: >>> >>> a) create a 2GB file >>> b) fallocate -o 1GB -l 2GB >>> c) write from 1GB to 3GB >>> >>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. >> Just curious. With current implementation, in the following case: >> a) create a 2GB file1 && create a 2GB file2 >> b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2 > > A this step you are trying to allocate 3GB+3GB = 6GB, so you exhausted the filesystem space. > >> c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2 >> will (c) always succeed? I.e. does fallocate really allocate 2GB per file, or does it only allocate additional 1GB and check free space for another 1GB? If it's only the latter, it is useless. > The file is physically extended > > ghigo@venice:/tmp$ fallocate -l 1000 foo.txt > ghigo@venice:/tmp$ ls -l foo.txt > -rw-r--r-- 1 ghigo ghigo 1000 Aug 3 18:00 foo.txt > ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt > ghigo@venice:/tmp$ ls -l foo.txt > -rw-r--r-- 1 ghigo ghigo 1500 Aug 3 18:00 foo.txt > ghigo@venice:/tmp$ > >> >> -- >> >> With Best Regards, >> Marat Khalili >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-01 14:47 ` Austin S. Hemmelgarn 2017-08-01 15:00 ` Austin S. Hemmelgarn @ 2017-08-02 4:14 ` Duncan 2017-08-02 11:18 ` Austin S. Hemmelgarn 1 sibling, 1 reply; 26+ messages in thread From: Duncan @ 2017-08-02 4:14 UTC (permalink / raw) To: linux-btrfs Austin S. Hemmelgarn posted on Tue, 01 Aug 2017 10:47:30 -0400 as excerpted: > I think I _might_ understand what's going on here. Is that test program > calling fallocate using the desired total size of the file, or just > trying to allocate the range beyond the end to extend the file? I've > seen issues with the first case on BTRFS before, and I'm starting to > think that it might actually be trying to allocate the exact amount of > space requested by fallocate, even if part of the range is already > allocated space. If I've interpreted correctly (not being a dev, only a btrfs user, sysadmin, and list regular) previous discussions I've seen on this list... That's exactly what it's doing, and it's _intended_ behavior. The reasoning is something like this: fallocate is supposed to pre- allocate some space with the intent being that writes into that space won't fail, because the space is already allocated. For an existing file with some data already in it, ext4 and xfs do that counting the existing space. But btrfs is copy-on-write, meaning it's going to have to write the new data to a different location than the existing data, and it may well not free up the existing allocation (if even a single 4k block of the existing allocation remains unwritten, it will remain to hold down the entire previous allocation, which isn't released until *none* of it is still in use -- of course in normal usage "in use" can be due to old snapshots or other reflinks to the same extent, as well, tho in these test cases it's not). So in ordered to provide the writes to preallocated space shouldn't ENOSPC guarantee, btrfs can't count currently actually used space as part of the fallocate. The different behavior is entirely due to btrfs being COW, and thus a choice having to be made, do we worst-case fallocate-reserve for writes over currently used data that will have to be COWed elsewhere, possibly without freeing the existing extents because there's still something referencing them, or do we risk ENOSPCing on write to a previously fallocated area? The choice was to worst-case-reserve and take the ENOSPC risk at fallocate time, so the write into that fallocated space could then proceed without the ENOSPC risk that COW would otherwise imply. Make sense, or is my understanding a horrible misunderstanding? =:^) So if you're actually only appending, fallocate the /additional/ space, not the /entire/ space, and you'll get what you need. But if you're potentially overwriting what's there already, better fallocate the entire space, which triggers the btrfs worst-case allocation behavior you see, in ordered to guarantee it won't ENOSPC during the actual write. Of course the only time the behavior actually differs is with COW, but then there's a BIG difference, but that BIG difference has a GOOD BIG reason! =:^) Tho that difference will certainly necessitate some relearning the /correct/ way to do it, for devs who were doing it the COW-worst-case way all along, even if they didn't actually need to, because it didn't happen to make a difference on what they happened to be testing on, which happened not to be COW... Reminds me of the way newer versions of gcc and/or trying to build with clang as well tends to trigger relearning, because newer versions are stricter in ordered to allow better optimization, and other implementations are simply different in what they're strict on, /because/ they're a different implementation. Well, btrfs is stricter... because it's a different implementation that /has/ to be stricter... due to COW. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space 2017-08-02 4:14 ` Duncan @ 2017-08-02 11:18 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 26+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-02 11:18 UTC (permalink / raw) To: linux-btrfs On 2017-08-02 00:14, Duncan wrote: > Austin S. Hemmelgarn posted on Tue, 01 Aug 2017 10:47:30 -0400 as > excerpted: > >> I think I _might_ understand what's going on here. Is that test program >> calling fallocate using the desired total size of the file, or just >> trying to allocate the range beyond the end to extend the file? I've >> seen issues with the first case on BTRFS before, and I'm starting to >> think that it might actually be trying to allocate the exact amount of >> space requested by fallocate, even if part of the range is already >> allocated space. > > If I've interpreted correctly (not being a dev, only a btrfs user, > sysadmin, and list regular) previous discussions I've seen on this list... > > That's exactly what it's doing, and it's _intended_ behavior. > > The reasoning is something like this: fallocate is supposed to pre- > allocate some space with the intent being that writes into that space > won't fail, because the space is already allocated. > > For an existing file with some data already in it, ext4 and xfs do that > counting the existing space. > > But btrfs is copy-on-write, meaning it's going to have to write the new > data to a different location than the existing data, and it may well not > free up the existing allocation (if even a single 4k block of the > existing allocation remains unwritten, it will remain to hold down the > entire previous allocation, which isn't released until *none* of it is > still in use -- of course in normal usage "in use" can be due to old > snapshots or other reflinks to the same extent, as well, tho in these > test cases it's not). > > So in ordered to provide the writes to preallocated space shouldn't ENOSPC > guarantee, btrfs can't count currently actually used space as part of the > fallocate. > > The different behavior is entirely due to btrfs being COW, and thus a > choice having to be made, do we worst-case fallocate-reserve for writes > over currently used data that will have to be COWed elsewhere, possibly > without freeing the existing extents because there's still something > referencing them, or do we risk ENOSPCing on write to a previously > fallocated area? > > The choice was to worst-case-reserve and take the ENOSPC risk at fallocate > time, so the write into that fallocated space could then proceed without > the ENOSPC risk that COW would otherwise imply. > > Make sense, or is my understanding a horrible misunderstanding? =:^) Your reasoning is sound, except for the fact that at least on older kernels (not sure if this is still the case), BTRFS will still perform a COW operation when updating a fallocate'ed region. > > So if you're actually only appending, fallocate the /additional/ space, > not the /entire/ space, and you'll get what you need. But if you're > potentially overwriting what's there already, better fallocate the entire > space, which triggers the btrfs worst-case allocation behavior you see, > in ordered to guarantee it won't ENOSPC during the actual write. > > Of course the only time the behavior actually differs is with COW, but > then there's a BIG difference, but that BIG difference has a GOOD BIG > reason! =:^) > > Tho that difference will certainly necessitate some relearning the > /correct/ way to do it, for devs who were doing it the COW-worst-case way > all along, even if they didn't actually need to, because it didn't happen > to make a difference on what they happened to be testing on, which > happened not to be COW... > > Reminds me of the way newer versions of gcc and/or trying to build with > clang as well tends to trigger relearning, because newer versions are > stricter in ordered to allow better optimization, and other > implementations are simply different in what they're strict on, /because/ > they're a different implementation. Well, btrfs is stricter... because > it's a different implementation that /has/ to be stricter... due to COW. Except that that strictness breaks userspace programs that are doing perfectly reasonable things. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2017-08-04 15:05 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-08-01 11:43 Massive loss of disk space pwm 2017-08-01 12:20 ` Hugo Mills 2017-08-01 14:39 ` pwm 2017-08-01 14:47 ` Austin S. Hemmelgarn 2017-08-01 15:00 ` Austin S. Hemmelgarn 2017-08-01 15:24 ` pwm 2017-08-01 15:45 ` Austin S. Hemmelgarn 2017-08-01 16:50 ` pwm 2017-08-01 17:04 ` Austin S. Hemmelgarn 2017-08-02 17:52 ` Goffredo Baroncelli 2017-08-02 19:10 ` Austin S. Hemmelgarn 2017-08-02 21:05 ` Goffredo Baroncelli 2017-08-03 11:39 ` Austin S. Hemmelgarn 2017-08-03 16:37 ` Goffredo Baroncelli 2017-08-03 17:23 ` Austin S. Hemmelgarn 2017-08-04 14:45 ` Goffredo Baroncelli 2017-08-04 15:05 ` Austin S. Hemmelgarn 2017-08-03 3:48 ` Duncan 2017-08-03 11:44 ` Marat Khalili 2017-08-03 11:52 ` Austin S. Hemmelgarn 2017-08-03 16:01 ` Goffredo Baroncelli 2017-08-03 17:15 ` Marat Khalili 2017-08-03 17:25 ` Austin S. Hemmelgarn 2017-08-03 22:51 ` pwm 2017-08-02 4:14 ` Duncan 2017-08-02 11:18 ` Austin S. Hemmelgarn
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.