* Massive loss of disk space
@ 2017-08-01 11:43 pwm
2017-08-01 12:20 ` Hugo Mills
0 siblings, 1 reply; 26+ messages in thread
From: pwm @ 2017-08-01 11:43 UTC (permalink / raw)
To: linux-btrfs
I have a 10TB file system with a parity file for a snapraid. However, I
can suddenly not extend the parity file despite the file system only being
about 50% filled - I should have 5TB of unallocated space. When trying to
extend the parity file, fallocate() just returns ENOSPC, i.e. that the
disk is full.
Machine was originally a Debian 8 (Jessie) but after I detected the issue
and no btrfs tool did show any errors, I have updated to Debian 9 (Snatch)
to get a newer kernel and newer btrfs tools.
pwm@europium:/mnt$ btrfs --version
btrfs-progs v4.7.3
pwm@europium:/mnt$ uname -a
Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
x86_64 GNU/Linux
pwm@europium:/mnt/snap_04$ ls -l
total 4932703608
-rw------- 1 root root 319148889 Jul 8 04:21 snapraid.content
-rw------- 1 root root 283115520 Aug 1 04:08 snapraid.content.tmp
-rw------- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity
pwm@europium:/mnt/snap_04$ df .
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdg1 9766434816 4944614648 4819831432 51% /mnt/snap_04
pwm@europium:/mnt/snap_04$ sudo btrfs fi show .
Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
Total devices 1 FS bytes used 4.60TiB
devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1
Compare this with the second snapraid parity disk:
pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/
Label: 'snap_05' uuid: bac477e3-e78c-43ee-8402-6bdfff194567
Total devices 1 FS bytes used 4.69TiB
devid 1 size 9.09TiB used 4.70TiB path /dev/sdi1
So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB.
While almost the same amount of file system usage. And almost identical
usage pattern. It's an archival RAID, so there is hardly any writes to the
parity files because there are almost no file changes to the data files.
The main usage is that the parity file gets extended when one of the data
disks reaches a new high water mark.
The only file that gets regularly rewritten is the snapraid.content file
that gets regenerated after every scrub.
pwm@europium:/mnt/snap_04$ sudo btrfs fi df .
Data, single: total=9.08TiB, used=4.59TiB
System, DUP: total=8.00MiB, used=992.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=6.00GiB, used=4.81GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B
pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du .
Total Exclusive Set shared Filename
4.59TiB 4.59TiB - ./snapraid.parity
304.37MiB 304.37MiB - ./snapraid.content
270.00MiB 270.00MiB - ./snapraid.content.tmp
4.59TiB 4.59TiB 0.00B .
pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage .
Overall:
Device size: 9.09TiB
Device allocated: 9.09TiB
Device unallocated: 0.00B
Device missing: 0.00B
Used: 4.60TiB
Free (estimated): 4.49TiB (min: 4.49TiB)
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
Data,single: Size:9.08TiB, Used:4.59TiB
/dev/sdg1 9.08TiB
Metadata,single: Size:8.00MiB, Used:0.00B
/dev/sdg1 8.00MiB
Metadata,DUP: Size:6.00GiB, Used:4.81GiB
/dev/sdg1 12.00GiB
System,single: Size:4.00MiB, Used:0.00B
/dev/sdg1 4.00MiB
System,DUP: Size:8.00MiB, Used:992.00KiB
/dev/sdg1 16.00MiB
Unallocated:
/dev/sdg1 0.00B
pwm@europium:~$ sudo btrfs check /dev/sdg1
Checking filesystem on /dev/sdg1
UUID: c46df8fa-03db-4b32-8beb-5521d9931a31
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
found 5057294639104 bytes used err is 0
total csum bytes: 4529856120
total tree bytes: 5170151424
total fs tree bytes: 178700288
total extent tree bytes: 209616896
btree space waste bytes: 182357204
file data blocks allocated: 5073330888704
referenced 5052040339456
pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/
scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31
scrub started at Mon Jul 31 21:26:50 2017 and finished after
06:53:47
total bytes scrubbed: 4.60TiB with 0 errors
So where have my 5TB disk space gone lost?
And what should I do to be able to get it back again?
I could obviously reformat the partition and rebuild the parity since I
still have one good parity, but that doesn't feel like a good route. It
isn't impossible this might happen again.
/Per W
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-01 11:43 Massive loss of disk space pwm
@ 2017-08-01 12:20 ` Hugo Mills
2017-08-01 14:39 ` pwm
0 siblings, 1 reply; 26+ messages in thread
From: Hugo Mills @ 2017-08-01 12:20 UTC (permalink / raw)
To: pwm; +Cc: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 5847 bytes --]
Hi, Per,
Start here:
https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29
In your case, I'd suggest using "-dusage=20" to start with, as
it'll probably free up quite a lot of your existing allocation.
And this may also be of interest, in how to read the output of the
tools:
https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
Finally, I note that you've still got some "single" chunks present
for metadata. It won't affect your space allocation issues, but I
would recommend getting rid of them anyway:
# btrfs balance start -mconvert=dup,soft
Hugo.
On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote:
> I have a 10TB file system with a parity file for a snapraid.
> However, I can suddenly not extend the parity file despite the file
> system only being about 50% filled - I should have 5TB of
> unallocated space. When trying to extend the parity file,
> fallocate() just returns ENOSPC, i.e. that the disk is full.
>
> Machine was originally a Debian 8 (Jessie) but after I detected the
> issue and no btrfs tool did show any errors, I have updated to
> Debian 9 (Snatch) to get a newer kernel and newer btrfs tools.
>
> pwm@europium:/mnt$ btrfs --version
> btrfs-progs v4.7.3
> pwm@europium:/mnt$ uname -a
> Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2
> (2017-06-26) x86_64 GNU/Linux
>
>
>
>
> pwm@europium:/mnt/snap_04$ ls -l
> total 4932703608
> -rw------- 1 root root 319148889 Jul 8 04:21 snapraid.content
> -rw------- 1 root root 283115520 Aug 1 04:08 snapraid.content.tmp
> -rw------- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity
>
>
>
> pwm@europium:/mnt/snap_04$ df .
> Filesystem 1K-blocks Used Available Use% Mounted on
> /dev/sdg1 9766434816 4944614648 4819831432 51% /mnt/snap_04
>
>
>
> pwm@europium:/mnt/snap_04$ sudo btrfs fi show .
> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
> Total devices 1 FS bytes used 4.60TiB
> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1
>
> Compare this with the second snapraid parity disk:
> pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/
> Label: 'snap_05' uuid: bac477e3-e78c-43ee-8402-6bdfff194567
> Total devices 1 FS bytes used 4.69TiB
> devid 1 size 9.09TiB used 4.70TiB path /dev/sdi1
>
> So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB.
> While almost the same amount of file system usage. And almost
> identical usage pattern. It's an archival RAID, so there is hardly
> any writes to the parity files because there are almost no file
> changes to the data files. The main usage is that the parity file
> gets extended when one of the data disks reaches a new high water
> mark.
>
> The only file that gets regularly rewritten is the snapraid.content
> file that gets regenerated after every scrub.
>
>
>
> pwm@europium:/mnt/snap_04$ sudo btrfs fi df .
> Data, single: total=9.08TiB, used=4.59TiB
> System, DUP: total=8.00MiB, used=992.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, DUP: total=6.00GiB, used=4.81GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
>
> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du .
> Total Exclusive Set shared Filename
> 4.59TiB 4.59TiB - ./snapraid.parity
> 304.37MiB 304.37MiB - ./snapraid.content
> 270.00MiB 270.00MiB - ./snapraid.content.tmp
> 4.59TiB 4.59TiB 0.00B .
>
>
>
> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage .
> Overall:
> Device size: 9.09TiB
> Device allocated: 9.09TiB
> Device unallocated: 0.00B
> Device missing: 0.00B
> Used: 4.60TiB
> Free (estimated): 4.49TiB (min: 4.49TiB)
> Data ratio: 1.00
> Metadata ratio: 2.00
> Global reserve: 512.00MiB (used: 0.00B)
>
> Data,single: Size:9.08TiB, Used:4.59TiB
> /dev/sdg1 9.08TiB
>
> Metadata,single: Size:8.00MiB, Used:0.00B
> /dev/sdg1 8.00MiB
>
> Metadata,DUP: Size:6.00GiB, Used:4.81GiB
> /dev/sdg1 12.00GiB
>
> System,single: Size:4.00MiB, Used:0.00B
> /dev/sdg1 4.00MiB
>
> System,DUP: Size:8.00MiB, Used:992.00KiB
> /dev/sdg1 16.00MiB
>
> Unallocated:
> /dev/sdg1 0.00B
>
>
>
> pwm@europium:~$ sudo btrfs check /dev/sdg1
> Checking filesystem on /dev/sdg1
> UUID: c46df8fa-03db-4b32-8beb-5521d9931a31
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> found 5057294639104 bytes used err is 0
> total csum bytes: 4529856120
> total tree bytes: 5170151424
> total fs tree bytes: 178700288
> total extent tree bytes: 209616896
> btree space waste bytes: 182357204
> file data blocks allocated: 5073330888704
> referenced 5052040339456
>
>
>
> pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/
> scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31
> scrub started at Mon Jul 31 21:26:50 2017 and finished after
> 06:53:47
> total bytes scrubbed: 4.60TiB with 0 errors
>
>
>
> So where have my 5TB disk space gone lost?
> And what should I do to be able to get it back again?
>
> I could obviously reformat the partition and rebuild the parity
> since I still have one good parity, but that doesn't feel like a
> good route. It isn't impossible this might happen again.
>
> /Per W
--
Hugo Mills | Well, sir, the floor is yours. But remember, the
hugo@... carfax.org.uk | roof is ours!
http://carfax.org.uk/ |
PGP: E2AB1DE4 | The Goons
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-01 12:20 ` Hugo Mills
@ 2017-08-01 14:39 ` pwm
2017-08-01 14:47 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 26+ messages in thread
From: pwm @ 2017-08-01 14:39 UTC (permalink / raw)
To: Hugo Mills; +Cc: linux-btrfs
Thanks for the links and suggestions.
I did try your suggestions but it didn't solve the underlying problem.
pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
Dumping filters: flags 0x1, state 0x0, force is off
DATA (flags 0x2): balancing, usage=20
Done, had to relocate 4596 out of 9317 chunks
pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/
Done, had to relocate 2 out of 4721 chunks
pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
Data, single: total=4.60TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=512.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
Total devices 1 FS bytes used 4.60TiB
devid 1 size 9.09TiB used 4.61TiB path /dev/sdg1
So now device 1 usage is down from 9.09TiB to 4.61TiB.
But if I test to fallocate() to grow the large parity file, I directly
fail. I wrote a little help program that just focuses on fallocate()
instead of having to run snapraid with lots of unknown additional actions
being performed.
Original file size is 5050486226944 bytes
Trying to grow file to 5151751667712 bytes
Failed fallocate [No space left on device]
And result after shows 'used' have jumped up to 9.09TiB again.
root@europium:/mnt# btrfs fi show snap_04
Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
Total devices 1 FS bytes used 4.60TiB
devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1
root@europium:/mnt# btrfs fi df /mnt/snap_04/
Data, single: total=9.08TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=992.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
It's almost like the file system have decided that it needs to make a
snapshot and store two complete copies of the complete file, which is
obviously not going to work with a file larger than 50% of the file
system.
No issue at all to grow the parity file on the other parity disk. And
that's why I wonder if there is some undetected file system corruption.
/Per W
On Tue, 1 Aug 2017, Hugo Mills wrote:
> Hi, Per,
>
> Start here:
>
> https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29
>
> In your case, I'd suggest using "-dusage=20" to start with, as
> it'll probably free up quite a lot of your existing allocation.
>
> And this may also be of interest, in how to read the output of the
> tools:
>
> https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
>
> Finally, I note that you've still got some "single" chunks present
> for metadata. It won't affect your space allocation issues, but I
> would recommend getting rid of them anyway:
>
> # btrfs balance start -mconvert=dup,soft
>
> Hugo.
>
> On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote:
>> I have a 10TB file system with a parity file for a snapraid.
>> However, I can suddenly not extend the parity file despite the file
>> system only being about 50% filled - I should have 5TB of
>> unallocated space. When trying to extend the parity file,
>> fallocate() just returns ENOSPC, i.e. that the disk is full.
>>
>> Machine was originally a Debian 8 (Jessie) but after I detected the
>> issue and no btrfs tool did show any errors, I have updated to
>> Debian 9 (Snatch) to get a newer kernel and newer btrfs tools.
>>
>> pwm@europium:/mnt$ btrfs --version
>> btrfs-progs v4.7.3
>> pwm@europium:/mnt$ uname -a
>> Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2
>> (2017-06-26) x86_64 GNU/Linux
>>
>>
>>
>>
>> pwm@europium:/mnt/snap_04$ ls -l
>> total 4932703608
>> -rw------- 1 root root 319148889 Jul 8 04:21 snapraid.content
>> -rw------- 1 root root 283115520 Aug 1 04:08 snapraid.content.tmp
>> -rw------- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity
>>
>>
>>
>> pwm@europium:/mnt/snap_04$ df .
>> Filesystem 1K-blocks Used Available Use% Mounted on
>> /dev/sdg1 9766434816 4944614648 4819831432 51% /mnt/snap_04
>>
>>
>>
>> pwm@europium:/mnt/snap_04$ sudo btrfs fi show .
>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>> Total devices 1 FS bytes used 4.60TiB
>> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>
>> Compare this with the second snapraid parity disk:
>> pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/
>> Label: 'snap_05' uuid: bac477e3-e78c-43ee-8402-6bdfff194567
>> Total devices 1 FS bytes used 4.69TiB
>> devid 1 size 9.09TiB used 4.70TiB path /dev/sdi1
>>
>> So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB.
>> While almost the same amount of file system usage. And almost
>> identical usage pattern. It's an archival RAID, so there is hardly
>> any writes to the parity files because there are almost no file
>> changes to the data files. The main usage is that the parity file
>> gets extended when one of the data disks reaches a new high water
>> mark.
>>
>> The only file that gets regularly rewritten is the snapraid.content
>> file that gets regenerated after every scrub.
>>
>>
>>
>> pwm@europium:/mnt/snap_04$ sudo btrfs fi df .
>> Data, single: total=9.08TiB, used=4.59TiB
>> System, DUP: total=8.00MiB, used=992.00KiB
>> System, single: total=4.00MiB, used=0.00B
>> Metadata, DUP: total=6.00GiB, used=4.81GiB
>> Metadata, single: total=8.00MiB, used=0.00B
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>>
>>
>> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du .
>> Total Exclusive Set shared Filename
>> 4.59TiB 4.59TiB - ./snapraid.parity
>> 304.37MiB 304.37MiB - ./snapraid.content
>> 270.00MiB 270.00MiB - ./snapraid.content.tmp
>> 4.59TiB 4.59TiB 0.00B .
>>
>>
>>
>> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage .
>> Overall:
>> Device size: 9.09TiB
>> Device allocated: 9.09TiB
>> Device unallocated: 0.00B
>> Device missing: 0.00B
>> Used: 4.60TiB
>> Free (estimated): 4.49TiB (min: 4.49TiB)
>> Data ratio: 1.00
>> Metadata ratio: 2.00
>> Global reserve: 512.00MiB (used: 0.00B)
>>
>> Data,single: Size:9.08TiB, Used:4.59TiB
>> /dev/sdg1 9.08TiB
>>
>> Metadata,single: Size:8.00MiB, Used:0.00B
>> /dev/sdg1 8.00MiB
>>
>> Metadata,DUP: Size:6.00GiB, Used:4.81GiB
>> /dev/sdg1 12.00GiB
>>
>> System,single: Size:4.00MiB, Used:0.00B
>> /dev/sdg1 4.00MiB
>>
>> System,DUP: Size:8.00MiB, Used:992.00KiB
>> /dev/sdg1 16.00MiB
>>
>> Unallocated:
>> /dev/sdg1 0.00B
>>
>>
>>
>> pwm@europium:~$ sudo btrfs check /dev/sdg1
>> Checking filesystem on /dev/sdg1
>> UUID: c46df8fa-03db-4b32-8beb-5521d9931a31
>> checking extents
>> checking free space cache
>> checking fs roots
>> checking csums
>> checking root refs
>> found 5057294639104 bytes used err is 0
>> total csum bytes: 4529856120
>> total tree bytes: 5170151424
>> total fs tree bytes: 178700288
>> total extent tree bytes: 209616896
>> btree space waste bytes: 182357204
>> file data blocks allocated: 5073330888704
>> referenced 5052040339456
>>
>>
>>
>> pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/
>> scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31
>> scrub started at Mon Jul 31 21:26:50 2017 and finished after
>> 06:53:47
>> total bytes scrubbed: 4.60TiB with 0 errors
>>
>>
>>
>> So where have my 5TB disk space gone lost?
>> And what should I do to be able to get it back again?
>>
>> I could obviously reformat the partition and rebuild the parity
>> since I still have one good parity, but that doesn't feel like a
>> good route. It isn't impossible this might happen again.
>>
>> /Per W
>
> --
> Hugo Mills | Well, sir, the floor is yours. But remember, the
> hugo@... carfax.org.uk | roof is ours!
> http://carfax.org.uk/ |
> PGP: E2AB1DE4 | The Goons
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-01 14:39 ` pwm
@ 2017-08-01 14:47 ` Austin S. Hemmelgarn
2017-08-01 15:00 ` Austin S. Hemmelgarn
2017-08-02 4:14 ` Duncan
0 siblings, 2 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-01 14:47 UTC (permalink / raw)
To: pwm, Hugo Mills; +Cc: linux-btrfs
On 2017-08-01 10:39, pwm wrote:
> Thanks for the links and suggestions.
>
> I did try your suggestions but it didn't solve the underlying problem.
>
>
>
> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
> Dumping filters: flags 0x1, state 0x0, force is off
> DATA (flags 0x2): balancing, usage=20
> Done, had to relocate 4596 out of 9317 chunks
>
>
> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/
> Done, had to relocate 2 out of 4721 chunks
>
>
> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
> Data, single: total=4.60TiB, used=4.59TiB
> System, DUP: total=40.00MiB, used=512.00KiB
> Metadata, DUP: total=6.50GiB, used=4.81GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
> Total devices 1 FS bytes used 4.60TiB
> devid 1 size 9.09TiB used 4.61TiB path /dev/sdg1
>
>
> So now device 1 usage is down from 9.09TiB to 4.61TiB.
>
> But if I test to fallocate() to grow the large parity file, I directly
> fail. I wrote a little help program that just focuses on fallocate()
> instead of having to run snapraid with lots of unknown additional
> actions being performed.
>
>
> Original file size is 5050486226944 bytes
> Trying to grow file to 5151751667712 bytes
> Failed fallocate [No space left on device]
>
>
>
> And result after shows 'used' have jumped up to 9.09TiB again.
>
> root@europium:/mnt# btrfs fi show snap_04
> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
> Total devices 1 FS bytes used 4.60TiB
> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1
>
> root@europium:/mnt# btrfs fi df /mnt/snap_04/
> Data, single: total=9.08TiB, used=4.59TiB
> System, DUP: total=40.00MiB, used=992.00KiB
> Metadata, DUP: total=6.50GiB, used=4.81GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
> It's almost like the file system have decided that it needs to make a
> snapshot and store two complete copies of the complete file, which is
> obviously not going to work with a file larger than 50% of the file system.
I think I _might_ understand what's going on here. Is that test program
calling fallocate using the desired total size of the file, or just
trying to allocate the range beyond the end to extend the file? I've
seen issues with the first case on BTRFS before, and I'm starting to
think that it might actually be trying to allocate the exact amount of
space requested by fallocate, even if part of the range is already
allocated space.
>
> No issue at all to grow the parity file on the other parity disk. And
> that's why I wonder if there is some undetected file system corruption.
>
> /Per W
>
> On Tue, 1 Aug 2017, Hugo Mills wrote:
>
>> Hi, Per,
>>
>> Start here:
>>
>> https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29
>>
>>
>> In your case, I'd suggest using "-dusage=20" to start with, as
>> it'll probably free up quite a lot of your existing allocation.
>>
>> And this may also be of interest, in how to read the output of the
>> tools:
>>
>> https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
>>
>>
>> Finally, I note that you've still got some "single" chunks present
>> for metadata. It won't affect your space allocation issues, but I
>> would recommend getting rid of them anyway:
>>
>> # btrfs balance start -mconvert=dup,soft
>>
>> Hugo.
>>
>> On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote:
>>> I have a 10TB file system with a parity file for a snapraid.
>>> However, I can suddenly not extend the parity file despite the file
>>> system only being about 50% filled - I should have 5TB of
>>> unallocated space. When trying to extend the parity file,
>>> fallocate() just returns ENOSPC, i.e. that the disk is full.
>>>
>>> Machine was originally a Debian 8 (Jessie) but after I detected the
>>> issue and no btrfs tool did show any errors, I have updated to
>>> Debian 9 (Snatch) to get a newer kernel and newer btrfs tools.
>>>
>>> pwm@europium:/mnt$ btrfs --version
>>> btrfs-progs v4.7.3
>>> pwm@europium:/mnt$ uname -a
>>> Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2
>>> (2017-06-26) x86_64 GNU/Linux
>>>
>>>
>>>
>>>
>>> pwm@europium:/mnt/snap_04$ ls -l
>>> total 4932703608
>>> -rw------- 1 root root 319148889 Jul 8 04:21 snapraid.content
>>> -rw------- 1 root root 283115520 Aug 1 04:08 snapraid.content.tmp
>>> -rw------- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity
>>>
>>>
>>>
>>> pwm@europium:/mnt/snap_04$ df .
>>> Filesystem 1K-blocks Used Available Use% Mounted on
>>> /dev/sdg1 9766434816 4944614648 4819831432 51% /mnt/snap_04
>>>
>>>
>>>
>>> pwm@europium:/mnt/snap_04$ sudo btrfs fi show .
>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>> Total devices 1 FS bytes used 4.60TiB
>>> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>>
>>> Compare this with the second snapraid parity disk:
>>> pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/
>>> Label: 'snap_05' uuid: bac477e3-e78c-43ee-8402-6bdfff194567
>>> Total devices 1 FS bytes used 4.69TiB
>>> devid 1 size 9.09TiB used 4.70TiB path /dev/sdi1
>>>
>>> So on one parity disk, devid is 9.09TiB used - on the other only
>>> 4.70TiB.
>>> While almost the same amount of file system usage. And almost
>>> identical usage pattern. It's an archival RAID, so there is hardly
>>> any writes to the parity files because there are almost no file
>>> changes to the data files. The main usage is that the parity file
>>> gets extended when one of the data disks reaches a new high water
>>> mark.
>>>
>>> The only file that gets regularly rewritten is the snapraid.content
>>> file that gets regenerated after every scrub.
>>>
>>>
>>>
>>> pwm@europium:/mnt/snap_04$ sudo btrfs fi df .
>>> Data, single: total=9.08TiB, used=4.59TiB
>>> System, DUP: total=8.00MiB, used=992.00KiB
>>> System, single: total=4.00MiB, used=0.00B
>>> Metadata, DUP: total=6.00GiB, used=4.81GiB
>>> Metadata, single: total=8.00MiB, used=0.00B
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>
>>>
>>>
>>> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du .
>>> Total Exclusive Set shared Filename
>>> 4.59TiB 4.59TiB - ./snapraid.parity
>>> 304.37MiB 304.37MiB - ./snapraid.content
>>> 270.00MiB 270.00MiB - ./snapraid.content.tmp
>>> 4.59TiB 4.59TiB 0.00B .
>>>
>>>
>>>
>>> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage .
>>> Overall:
>>> Device size: 9.09TiB
>>> Device allocated: 9.09TiB
>>> Device unallocated: 0.00B
>>> Device missing: 0.00B
>>> Used: 4.60TiB
>>> Free (estimated): 4.49TiB (min: 4.49TiB)
>>> Data ratio: 1.00
>>> Metadata ratio: 2.00
>>> Global reserve: 512.00MiB (used: 0.00B)
>>>
>>> Data,single: Size:9.08TiB, Used:4.59TiB
>>> /dev/sdg1 9.08TiB
>>>
>>> Metadata,single: Size:8.00MiB, Used:0.00B
>>> /dev/sdg1 8.00MiB
>>>
>>> Metadata,DUP: Size:6.00GiB, Used:4.81GiB
>>> /dev/sdg1 12.00GiB
>>>
>>> System,single: Size:4.00MiB, Used:0.00B
>>> /dev/sdg1 4.00MiB
>>>
>>> System,DUP: Size:8.00MiB, Used:992.00KiB
>>> /dev/sdg1 16.00MiB
>>>
>>> Unallocated:
>>> /dev/sdg1 0.00B
>>>
>>>
>>>
>>> pwm@europium:~$ sudo btrfs check /dev/sdg1
>>> Checking filesystem on /dev/sdg1
>>> UUID: c46df8fa-03db-4b32-8beb-5521d9931a31
>>> checking extents
>>> checking free space cache
>>> checking fs roots
>>> checking csums
>>> checking root refs
>>> found 5057294639104 bytes used err is 0
>>> total csum bytes: 4529856120
>>> total tree bytes: 5170151424
>>> total fs tree bytes: 178700288
>>> total extent tree bytes: 209616896
>>> btree space waste bytes: 182357204
>>> file data blocks allocated: 5073330888704
>>> referenced 5052040339456
>>>
>>>
>>>
>>> pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/
>>> scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31
>>> scrub started at Mon Jul 31 21:26:50 2017 and finished after
>>> 06:53:47
>>> total bytes scrubbed: 4.60TiB with 0 errors
>>>
>>>
>>>
>>> So where have my 5TB disk space gone lost?
>>> And what should I do to be able to get it back again?
>>>
>>> I could obviously reformat the partition and rebuild the parity
>>> since I still have one good parity, but that doesn't feel like a
>>> good route. It isn't impossible this might happen again.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-01 14:47 ` Austin S. Hemmelgarn
@ 2017-08-01 15:00 ` Austin S. Hemmelgarn
2017-08-01 15:24 ` pwm
2017-08-02 17:52 ` Goffredo Baroncelli
2017-08-02 4:14 ` Duncan
1 sibling, 2 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-01 15:00 UTC (permalink / raw)
To: pwm, Hugo Mills; +Cc: linux-btrfs
On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:
> On 2017-08-01 10:39, pwm wrote:
>> Thanks for the links and suggestions.
>>
>> I did try your suggestions but it didn't solve the underlying problem.
>>
>>
>>
>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
>> Dumping filters: flags 0x1, state 0x0, force is off
>> DATA (flags 0x2): balancing, usage=20
>> Done, had to relocate 4596 out of 9317 chunks
>>
>>
>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/
>> Done, had to relocate 2 out of 4721 chunks
>>
>>
>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
>> Data, single: total=4.60TiB, used=4.59TiB
>> System, DUP: total=40.00MiB, used=512.00KiB
>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>>
>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>> Total devices 1 FS bytes used 4.60TiB
>> devid 1 size 9.09TiB used 4.61TiB path /dev/sdg1
>>
>>
>> So now device 1 usage is down from 9.09TiB to 4.61TiB.
>>
>> But if I test to fallocate() to grow the large parity file, I directly
>> fail. I wrote a little help program that just focuses on fallocate()
>> instead of having to run snapraid with lots of unknown additional
>> actions being performed.
>>
>>
>> Original file size is 5050486226944 bytes
>> Trying to grow file to 5151751667712 bytes
>> Failed fallocate [No space left on device]
>>
>>
>>
>> And result after shows 'used' have jumped up to 9.09TiB again.
>>
>> root@europium:/mnt# btrfs fi show snap_04
>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>> Total devices 1 FS bytes used 4.60TiB
>> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>
>> root@europium:/mnt# btrfs fi df /mnt/snap_04/
>> Data, single: total=9.08TiB, used=4.59TiB
>> System, DUP: total=40.00MiB, used=992.00KiB
>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>>
>> It's almost like the file system have decided that it needs to make a
>> snapshot and store two complete copies of the complete file, which is
>> obviously not going to work with a file larger than 50% of the file
>> system.
> I think I _might_ understand what's going on here. Is that test program
> calling fallocate using the desired total size of the file, or just
> trying to allocate the range beyond the end to extend the file? I've
> seen issues with the first case on BTRFS before, and I'm starting to
> think that it might actually be trying to allocate the exact amount of
> space requested by fallocate, even if part of the range is already
> allocated space.
OK, I just did a dead simple test by hand, and it looks like I was
right. The method I used to check this is as follows:
1. Create and mount a reasonably small filesystem (I used an 8G
temporary LV for this, a file would work too though).
2. Using dd or a similar tool, create a test file that takes up half of
the size of the filesystem. It is important that this _not_ be
fallocated, but just written out.
3. Use `fallocate -l` to try and extend the size of the file beyond half
the size of the filesystem.
For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will
succeed with no error. Based on this and some low-level inspection, it
looks like BTRFS treats the full range of the fallocate call as
unallocated, and thus is trying to allocate space for regions of that
range that are already allocated.
>>
>> No issue at all to grow the parity file on the other parity disk. And
>> that's why I wonder if there is some undetected file system corruption.
>>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-01 15:00 ` Austin S. Hemmelgarn
@ 2017-08-01 15:24 ` pwm
2017-08-01 15:45 ` Austin S. Hemmelgarn
2017-08-02 17:52 ` Goffredo Baroncelli
1 sibling, 1 reply; 26+ messages in thread
From: pwm @ 2017-08-01 15:24 UTC (permalink / raw)
To: Austin S. Hemmelgarn; +Cc: Hugo Mills, linux-btrfs
Yes, the test code is as below - trying to match what snapraid tries
to do:
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
int main() {
int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR);
if (fd < 0) {
printf("Failed opening parity file [%s]\n",strerror(errno));
return 1;
}
off_t filesize = 5151751667712ull;
int res;
struct stat statbuf;
if (fstat(fd,&statbuf)) {
printf("Failed stat [%s]\n",strerror(errno));
close(fd);
return 1;
}
printf("Original file size is %llu bytes\n",i
(unsigned long long)statbuf.st_size);
printf("Trying to grow file to %llu bytes\n",i
(unsigned long long)filesize);
res = fallocate(fd,0,0,filesize);
if (res) {
printf("Failed fallocate [%s]\n",strerror(errno));
close(fd);
return 1;
}
if (fsync(fd)) {
printf("Failed fsync [%s]\n",fsync(errno));
close(fd);
return 1;
}
close(fd);
return 0;
}
So the call doesn't make use of the previous file size as offset for the
extension.
int fallocate(int fd, int mode, off_t offset, off_t len);
What you are implying here is that if the fallocate() call is modified to:
res = fallocate(fd,0,old_size,new_size-old_size);
then everything should work as expected?
/Per W
On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:
> On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:
>> On 2017-08-01 10:39, pwm wrote:
>>> Thanks for the links and suggestions.
>>>
>>> I did try your suggestions but it didn't solve the underlying problem.
>>>
>>>
>>>
>>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
>>> Dumping filters: flags 0x1, state 0x0, force is off
>>> DATA (flags 0x2): balancing, usage=20
>>> Done, had to relocate 4596 out of 9317 chunks
>>>
>>>
>>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/
>>> Done, had to relocate 2 out of 4721 chunks
>>>
>>>
>>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
>>> Data, single: total=4.60TiB, used=4.59TiB
>>> System, DUP: total=40.00MiB, used=512.00KiB
>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>
>>>
>>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>> Total devices 1 FS bytes used 4.60TiB
>>> devid 1 size 9.09TiB used 4.61TiB path /dev/sdg1
>>>
>>>
>>> So now device 1 usage is down from 9.09TiB to 4.61TiB.
>>>
>>> But if I test to fallocate() to grow the large parity file, I directly
>>> fail. I wrote a little help program that just focuses on fallocate()
>>> instead of having to run snapraid with lots of unknown additional actions
>>> being performed.
>>>
>>>
>>> Original file size is 5050486226944 bytes
>>> Trying to grow file to 5151751667712 bytes
>>> Failed fallocate [No space left on device]
>>>
>>>
>>>
>>> And result after shows 'used' have jumped up to 9.09TiB again.
>>>
>>> root@europium:/mnt# btrfs fi show snap_04
>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>> Total devices 1 FS bytes used 4.60TiB
>>> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>>
>>> root@europium:/mnt# btrfs fi df /mnt/snap_04/
>>> Data, single: total=9.08TiB, used=4.59TiB
>>> System, DUP: total=40.00MiB, used=992.00KiB
>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>
>>>
>>> It's almost like the file system have decided that it needs to make a
>>> snapshot and store two complete copies of the complete file, which is
>>> obviously not going to work with a file larger than 50% of the file
>>> system.
>> I think I _might_ understand what's going on here. Is that test program
>> calling fallocate using the desired total size of the file, or just trying
>> to allocate the range beyond the end to extend the file? I've seen issues
>> with the first case on BTRFS before, and I'm starting to think that it
>> might actually be trying to allocate the exact amount of space requested by
>> fallocate, even if part of the range is already allocated space.
>
> OK, I just did a dead simple test by hand, and it looks like I was right.
> The method I used to check this is as follows:
> 1. Create and mount a reasonably small filesystem (I used an 8G temporary LV
> for this, a file would work too though).
> 2. Using dd or a similar tool, create a test file that takes up half of the
> size of the filesystem. It is important that this _not_ be fallocated, but
> just written out.
> 3. Use `fallocate -l` to try and extend the size of the file beyond half the
> size of the filesystem.
>
> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will
> succeed with no error. Based on this and some low-level inspection, it looks
> like BTRFS treats the full range of the fallocate call as unallocated, and
> thus is trying to allocate space for regions of that range that are already
> allocated.
>
>>>
>>> No issue at all to grow the parity file on the other parity disk. And
>>> that's why I wonder if there is some undetected file system corruption.
>>>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-01 15:24 ` pwm
@ 2017-08-01 15:45 ` Austin S. Hemmelgarn
2017-08-01 16:50 ` pwm
0 siblings, 1 reply; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-01 15:45 UTC (permalink / raw)
To: pwm; +Cc: Hugo Mills, linux-btrfs
On 2017-08-01 11:24, pwm wrote:
> Yes, the test code is as below - trying to match what snapraid tries to do:
>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <string.h>
> #include <unistd.h>
> #include <errno.h>
>
> int main() {
> int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR);
> if (fd < 0) {
> printf("Failed opening parity file [%s]\n",strerror(errno));
> return 1;
> }
>
> off_t filesize = 5151751667712ull;
> int res;
>
> struct stat statbuf;
> if (fstat(fd,&statbuf)) {
> printf("Failed stat [%s]\n",strerror(errno));
> close(fd);
> return 1;
> }
>
> printf("Original file size is %llu bytes\n",i
> (unsigned long long)statbuf.st_size);
> printf("Trying to grow file to %llu bytes\n",i
> (unsigned long long)filesize);
>
> res = fallocate(fd,0,0,filesize);
> if (res) {
> printf("Failed fallocate [%s]\n",strerror(errno));
> close(fd);
> return 1;
> }
>
> if (fsync(fd)) {
> printf("Failed fsync [%s]\n",fsync(errno));
> close(fd);
> return 1;
> }
>
> close(fd);
> return 0;
> }
>
> So the call doesn't make use of the previous file size as offset for the
> extension.
>
> int fallocate(int fd, int mode, off_t offset, off_t len);
>
> What you are implying here is that if the fallocate() call is modified to:
>
> res = fallocate(fd,0,old_size,new_size-old_size);
>
> then everything should work as expected?
Based on what I've seen testing on my end, yes, that should cause things
to work correctly. That said, given what snapraid does, the fact that
they call fallocate covering the full desired size of the file is
correct usage (the point is to make behavior deterministic, and calling
it on the whole file makes sure that the file isn't sparse, which can
impact performance).
Given both the fact that calling fallocate() to extend the file without
worrying about an offset is a legitimate use case, and that both ext4
and XFS (and I suspect almost every other Linux filesystem) works in
this situation, I'd argue that the behavior of BTRFS in this situation
is incorrect.
>
> /Per W
>
> On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:
>
>> On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:
>>> On 2017-08-01 10:39, pwm wrote:
>>>> Thanks for the links and suggestions.
>>>>
>>>> I did try your suggestions but it didn't solve the underlying problem.
>>>>
>>>>
>>>>
>>>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
>>>> Dumping filters: flags 0x1, state 0x0, force is off
>>>> DATA (flags 0x2): balancing, usage=20
>>>> Done, had to relocate 4596 out of 9317 chunks
>>>>
>>>>
>>>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft
>>>> /mnt/snap_04/
>>>> Done, had to relocate 2 out of 4721 chunks
>>>>
>>>>
>>>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
>>>> Data, single: total=4.60TiB, used=4.59TiB
>>>> System, DUP: total=40.00MiB, used=512.00KiB
>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>
>>>>
>>>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
>>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>> Total devices 1 FS bytes used 4.60TiB
>>>> devid 1 size 9.09TiB used 4.61TiB path /dev/sdg1
>>>>
>>>>
>>>> So now device 1 usage is down from 9.09TiB to 4.61TiB.
>>>>
>>>> But if I test to fallocate() to grow the large parity file, I
>>>> directly fail. I wrote a little help program that just focuses on
>>>> fallocate() instead of having to run snapraid with lots of unknown
>>>> additional actions being performed.
>>>>
>>>>
>>>> Original file size is 5050486226944 bytes
>>>> Trying to grow file to 5151751667712 bytes
>>>> Failed fallocate [No space left on device]
>>>>
>>>>
>>>>
>>>> And result after shows 'used' have jumped up to 9.09TiB again.
>>>>
>>>> root@europium:/mnt# btrfs fi show snap_04
>>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>> Total devices 1 FS bytes used 4.60TiB
>>>> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>>>
>>>> root@europium:/mnt# btrfs fi df /mnt/snap_04/
>>>> Data, single: total=9.08TiB, used=4.59TiB
>>>> System, DUP: total=40.00MiB, used=992.00KiB
>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>
>>>>
>>>> It's almost like the file system have decided that it needs to make
>>>> a snapshot and store two complete copies of the complete file, which
>>>> is obviously not going to work with a file larger than 50% of the
>>>> file system.
>>> I think I _might_ understand what's going on here. Is that test
>>> program calling fallocate using the desired total size of the file,
>>> or just trying to allocate the range beyond the end to extend the
>>> file? I've seen issues with the first case on BTRFS before, and I'm
>>> starting to think that it might actually be trying to allocate the
>>> exact amount of space requested by fallocate, even if part of the
>>> range is already allocated space.
>>
>> OK, I just did a dead simple test by hand, and it looks like I was
>> right. The method I used to check this is as follows:
>> 1. Create and mount a reasonably small filesystem (I used an 8G
>> temporary LV for this, a file would work too though).
>> 2. Using dd or a similar tool, create a test file that takes up half
>> of the size of the filesystem. It is important that this _not_ be
>> fallocated, but just written out.
>> 3. Use `fallocate -l` to try and extend the size of the file beyond
>> half the size of the filesystem.
>>
>> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it
>> will succeed with no error. Based on this and some low-level
>> inspection, it looks like BTRFS treats the full range of the fallocate
>> call as unallocated, and thus is trying to allocate space for regions
>> of that range that are already allocated.
>>
>>>>
>>>> No issue at all to grow the parity file on the other parity disk.
>>>> And that's why I wonder if there is some undetected file system
>>>> corruption.
>>>>
>>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-01 15:45 ` Austin S. Hemmelgarn
@ 2017-08-01 16:50 ` pwm
2017-08-01 17:04 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 26+ messages in thread
From: pwm @ 2017-08-01 16:50 UTC (permalink / raw)
To: Austin S. Hemmelgarn; +Cc: Hugo Mills, linux-btrfs
I did a temporary patch of the snapraid code to start fallocate() from the
previous parity file size.
Finally have a snapraid sync up and running. Looks good, but will take
quite a while before I can try a scrub command to double-check everything.
Thanks for the help.
/Per W
On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:
> On 2017-08-01 11:24, pwm wrote:
>> Yes, the test code is as below - trying to match what snapraid tries to do:
>>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <stdio.h>
>> #include <string.h>
>> #include <unistd.h>
>> #include <errno.h>
>>
>> int main() {
>> int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR);
>> if (fd < 0) {
>> printf("Failed opening parity file [%s]\n",strerror(errno));
>> return 1;
>> }
>>
>> off_t filesize = 5151751667712ull;
>> int res;
>>
>> struct stat statbuf;
>> if (fstat(fd,&statbuf)) {
>> printf("Failed stat [%s]\n",strerror(errno));
>> close(fd);
>> return 1;
>> }
>>
>> printf("Original file size is %llu bytes\n",i
>> (unsigned long long)statbuf.st_size);
>> printf("Trying to grow file to %llu bytes\n",i
>> (unsigned long long)filesize);
>>
>> res = fallocate(fd,0,0,filesize);
>> if (res) {
>> printf("Failed fallocate [%s]\n",strerror(errno));
>> close(fd);
>> return 1;
>> }
>>
>> if (fsync(fd)) {
>> printf("Failed fsync [%s]\n",fsync(errno));
>> close(fd);
>> return 1;
>> }
>>
>> close(fd);
>> return 0;
>> }
>>
>> So the call doesn't make use of the previous file size as offset for the
>> extension.
>>
>> int fallocate(int fd, int mode, off_t offset, off_t len);
>>
>> What you are implying here is that if the fallocate() call is modified to:
>>
>> res = fallocate(fd,0,old_size,new_size-old_size);
>>
>> then everything should work as expected?
> Based on what I've seen testing on my end, yes, that should cause things to
> work correctly. That said, given what snapraid does, the fact that they call
> fallocate covering the full desired size of the file is correct usage (the
> point is to make behavior deterministic, and calling it on the whole file
> makes sure that the file isn't sparse, which can impact performance).
>
> Given both the fact that calling fallocate() to extend the file without
> worrying about an offset is a legitimate use case, and that both ext4 and XFS
> (and I suspect almost every other Linux filesystem) works in this situation,
> I'd argue that the behavior of BTRFS in this situation is incorrect.
>>
>> /Per W
>>
>> On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:
>>
>>> On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:
>>>> On 2017-08-01 10:39, pwm wrote:
>>>>> Thanks for the links and suggestions.
>>>>>
>>>>> I did try your suggestions but it didn't solve the underlying problem.
>>>>>
>>>>>
>>>>>
>>>>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
>>>>> Dumping filters: flags 0x1, state 0x0, force is off
>>>>> DATA (flags 0x2): balancing, usage=20
>>>>> Done, had to relocate 4596 out of 9317 chunks
>>>>>
>>>>>
>>>>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft
>>>>> /mnt/snap_04/
>>>>> Done, had to relocate 2 out of 4721 chunks
>>>>>
>>>>>
>>>>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
>>>>> Data, single: total=4.60TiB, used=4.59TiB
>>>>> System, DUP: total=40.00MiB, used=512.00KiB
>>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>>
>>>>>
>>>>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
>>>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>>> Total devices 1 FS bytes used 4.60TiB
>>>>> devid 1 size 9.09TiB used 4.61TiB path /dev/sdg1
>>>>>
>>>>>
>>>>> So now device 1 usage is down from 9.09TiB to 4.61TiB.
>>>>>
>>>>> But if I test to fallocate() to grow the large parity file, I directly
>>>>> fail. I wrote a little help program that just focuses on fallocate()
>>>>> instead of having to run snapraid with lots of unknown additional
>>>>> actions being performed.
>>>>>
>>>>>
>>>>> Original file size is 5050486226944 bytes
>>>>> Trying to grow file to 5151751667712 bytes
>>>>> Failed fallocate [No space left on device]
>>>>>
>>>>>
>>>>>
>>>>> And result after shows 'used' have jumped up to 9.09TiB again.
>>>>>
>>>>> root@europium:/mnt# btrfs fi show snap_04
>>>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>>> Total devices 1 FS bytes used 4.60TiB
>>>>> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>>>>
>>>>> root@europium:/mnt# btrfs fi df /mnt/snap_04/
>>>>> Data, single: total=9.08TiB, used=4.59TiB
>>>>> System, DUP: total=40.00MiB, used=992.00KiB
>>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>>
>>>>>
>>>>> It's almost like the file system have decided that it needs to make a
>>>>> snapshot and store two complete copies of the complete file, which is
>>>>> obviously not going to work with a file larger than 50% of the file
>>>>> system.
>>>> I think I _might_ understand what's going on here. Is that test program
>>>> calling fallocate using the desired total size of the file, or just
>>>> trying to allocate the range beyond the end to extend the file? I've
>>>> seen issues with the first case on BTRFS before, and I'm starting to
>>>> think that it might actually be trying to allocate the exact amount of
>>>> space requested by fallocate, even if part of the range is already
>>>> allocated space.
>>>
>>> OK, I just did a dead simple test by hand, and it looks like I was right.
>>> The method I used to check this is as follows:
>>> 1. Create and mount a reasonably small filesystem (I used an 8G temporary
>>> LV for this, a file would work too though).
>>> 2. Using dd or a similar tool, create a test file that takes up half of
>>> the size of the filesystem. It is important that this _not_ be
>>> fallocated, but just written out.
>>> 3. Use `fallocate -l` to try and extend the size of the file beyond half
>>> the size of the filesystem.
>>>
>>> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will
>>> succeed with no error. Based on this and some low-level inspection, it
>>> looks like BTRFS treats the full range of the fallocate call as
>>> unallocated, and thus is trying to allocate space for regions of that
>>> range that are already allocated.
>>>
>>>>>
>>>>> No issue at all to grow the parity file on the other parity disk. And
>>>>> that's why I wonder if there is some undetected file system corruption.
>>>>>
>>>
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-01 16:50 ` pwm
@ 2017-08-01 17:04 ` Austin S. Hemmelgarn
0 siblings, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-01 17:04 UTC (permalink / raw)
To: pwm; +Cc: Hugo Mills, linux-btrfs
On 2017-08-01 12:50, pwm wrote:
> I did a temporary patch of the snapraid code to start fallocate() from
> the previous parity file size.
Like I said though, it's BTRFS that's misbehaving here, not snapraid.
I'm going to try to get some further discussion about this here on the
mailing list,and hopefully it will get fixed in BTRFS (I would try to do
so myself, but I'm at best a novice at C, and not well versed in kernel
code).
>
> Finally have a snapraid sync up and running. Looks good, but will take
> quite a while before I can try a scrub command to double-check everything.
>
> Thanks for the help.
Glad I could be helpful!
>
> /Per W
>
> On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:
>
>> On 2017-08-01 11:24, pwm wrote:
>>> Yes, the test code is as below - trying to match what snapraid tries
>>> to do:
>>>
>>> #include <sys/types.h>
>>> #include <sys/stat.h>
>>> #include <fcntl.h>
>>> #include <stdio.h>
>>> #include <string.h>
>>> #include <unistd.h>
>>> #include <errno.h>
>>>
>>> int main() {
>>> int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR);
>>> if (fd < 0) {
>>> printf("Failed opening parity file [%s]\n",strerror(errno));
>>> return 1;
>>> }
>>>
>>> off_t filesize = 5151751667712ull;
>>> int res;
>>>
>>> struct stat statbuf;
>>> if (fstat(fd,&statbuf)) {
>>> printf("Failed stat [%s]\n",strerror(errno));
>>> close(fd);
>>> return 1;
>>> }
>>>
>>> printf("Original file size is %llu bytes\n",i
>>> (unsigned long long)statbuf.st_size);
>>> printf("Trying to grow file to %llu bytes\n",i
>>> (unsigned long long)filesize);
>>>
>>> res = fallocate(fd,0,0,filesize);
>>> if (res) {
>>> printf("Failed fallocate [%s]\n",strerror(errno));
>>> close(fd);
>>> return 1;
>>> }
>>>
>>> if (fsync(fd)) {
>>> printf("Failed fsync [%s]\n",fsync(errno));
>>> close(fd);
>>> return 1;
>>> }
>>>
>>> close(fd);
>>> return 0;
>>> }
>>>
>>> So the call doesn't make use of the previous file size as offset for
>>> the extension.
>>>
>>> int fallocate(int fd, int mode, off_t offset, off_t len);
>>>
>>> What you are implying here is that if the fallocate() call is
>>> modified to:
>>>
>>> res = fallocate(fd,0,old_size,new_size-old_size);
>>>
>>> then everything should work as expected?
>> Based on what I've seen testing on my end, yes, that should cause
>> things to work correctly. That said, given what snapraid does, the
>> fact that they call fallocate covering the full desired size of the
>> file is correct usage (the point is to make behavior deterministic,
>> and calling it on the whole file makes sure that the file isn't
>> sparse, which can impact performance).
>>
>> Given both the fact that calling fallocate() to extend the file
>> without worrying about an offset is a legitimate use case, and that
>> both ext4 and XFS (and I suspect almost every other Linux filesystem)
>> works in this situation, I'd argue that the behavior of BTRFS in this
>> situation is incorrect.
>>>
>>> /Per W
>>>
>>> On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:
>>>
>>>> On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:
>>>>> On 2017-08-01 10:39, pwm wrote:
>>>>>> Thanks for the links and suggestions.
>>>>>>
>>>>>> I did try your suggestions but it didn't solve the underlying
>>>>>> problem.
>>>>>>
>>>>>>
>>>>>>
>>>>>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
>>>>>> Dumping filters: flags 0x1, state 0x0, force is off
>>>>>> DATA (flags 0x2): balancing, usage=20
>>>>>> Done, had to relocate 4596 out of 9317 chunks
>>>>>>
>>>>>>
>>>>>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft
>>>>>> /mnt/snap_04/
>>>>>> Done, had to relocate 2 out of 4721 chunks
>>>>>>
>>>>>>
>>>>>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
>>>>>> Data, single: total=4.60TiB, used=4.59TiB
>>>>>> System, DUP: total=40.00MiB, used=512.00KiB
>>>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>>>
>>>>>>
>>>>>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
>>>>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>>>> Total devices 1 FS bytes used 4.60TiB
>>>>>> devid 1 size 9.09TiB used 4.61TiB path /dev/sdg1
>>>>>>
>>>>>>
>>>>>> So now device 1 usage is down from 9.09TiB to 4.61TiB.
>>>>>>
>>>>>> But if I test to fallocate() to grow the large parity file, I
>>>>>> directly fail. I wrote a little help program that just focuses on
>>>>>> fallocate() instead of having to run snapraid with lots of unknown
>>>>>> additional actions being performed.
>>>>>>
>>>>>>
>>>>>> Original file size is 5050486226944 bytes
>>>>>> Trying to grow file to 5151751667712 bytes
>>>>>> Failed fallocate [No space left on device]
>>>>>>
>>>>>>
>>>>>>
>>>>>> And result after shows 'used' have jumped up to 9.09TiB again.
>>>>>>
>>>>>> root@europium:/mnt# btrfs fi show snap_04
>>>>>> Label: 'snap_04' uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>>>> Total devices 1 FS bytes used 4.60TiB
>>>>>> devid 1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>>>>>
>>>>>> root@europium:/mnt# btrfs fi df /mnt/snap_04/
>>>>>> Data, single: total=9.08TiB, used=4.59TiB
>>>>>> System, DUP: total=40.00MiB, used=992.00KiB
>>>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>>>
>>>>>>
>>>>>> It's almost like the file system have decided that it needs to
>>>>>> make a snapshot and store two complete copies of the complete
>>>>>> file, which is obviously not going to work with a file larger than
>>>>>> 50% of the file system.
>>>>> I think I _might_ understand what's going on here. Is that test
>>>>> program calling fallocate using the desired total size of the file,
>>>>> or just trying to allocate the range beyond the end to extend the
>>>>> file? I've seen issues with the first case on BTRFS before, and
>>>>> I'm starting to think that it might actually be trying to allocate
>>>>> the exact amount of space requested by fallocate, even if part of
>>>>> the range is already allocated space.
>>>>
>>>> OK, I just did a dead simple test by hand, and it looks like I was
>>>> right. The method I used to check this is as follows:
>>>> 1. Create and mount a reasonably small filesystem (I used an 8G
>>>> temporary LV for this, a file would work too though).
>>>> 2. Using dd or a similar tool, create a test file that takes up half
>>>> of the size of the filesystem. It is important that this _not_ be
>>>> fallocated, but just written out.
>>>> 3. Use `fallocate -l` to try and extend the size of the file beyond
>>>> half the size of the filesystem.
>>>>
>>>> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it
>>>> will succeed with no error. Based on this and some low-level
>>>> inspection, it looks like BTRFS treats the full range of the
>>>> fallocate call as unallocated, and thus is trying to allocate space
>>>> for regions of that range that are already allocated.
>>>>
>>>>>>
>>>>>> No issue at all to grow the parity file on the other parity disk.
>>>>>> And that's why I wonder if there is some undetected file system
>>>>>> corruption.
>>>>>>
>>>>
>>
>>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-01 14:47 ` Austin S. Hemmelgarn
2017-08-01 15:00 ` Austin S. Hemmelgarn
@ 2017-08-02 4:14 ` Duncan
2017-08-02 11:18 ` Austin S. Hemmelgarn
1 sibling, 1 reply; 26+ messages in thread
From: Duncan @ 2017-08-02 4:14 UTC (permalink / raw)
To: linux-btrfs
Austin S. Hemmelgarn posted on Tue, 01 Aug 2017 10:47:30 -0400 as
excerpted:
> I think I _might_ understand what's going on here. Is that test program
> calling fallocate using the desired total size of the file, or just
> trying to allocate the range beyond the end to extend the file? I've
> seen issues with the first case on BTRFS before, and I'm starting to
> think that it might actually be trying to allocate the exact amount of
> space requested by fallocate, even if part of the range is already
> allocated space.
If I've interpreted correctly (not being a dev, only a btrfs user,
sysadmin, and list regular) previous discussions I've seen on this list...
That's exactly what it's doing, and it's _intended_ behavior.
The reasoning is something like this: fallocate is supposed to pre-
allocate some space with the intent being that writes into that space
won't fail, because the space is already allocated.
For an existing file with some data already in it, ext4 and xfs do that
counting the existing space.
But btrfs is copy-on-write, meaning it's going to have to write the new
data to a different location than the existing data, and it may well not
free up the existing allocation (if even a single 4k block of the
existing allocation remains unwritten, it will remain to hold down the
entire previous allocation, which isn't released until *none* of it is
still in use -- of course in normal usage "in use" can be due to old
snapshots or other reflinks to the same extent, as well, tho in these
test cases it's not).
So in ordered to provide the writes to preallocated space shouldn't ENOSPC
guarantee, btrfs can't count currently actually used space as part of the
fallocate.
The different behavior is entirely due to btrfs being COW, and thus a
choice having to be made, do we worst-case fallocate-reserve for writes
over currently used data that will have to be COWed elsewhere, possibly
without freeing the existing extents because there's still something
referencing them, or do we risk ENOSPCing on write to a previously
fallocated area?
The choice was to worst-case-reserve and take the ENOSPC risk at fallocate
time, so the write into that fallocated space could then proceed without
the ENOSPC risk that COW would otherwise imply.
Make sense, or is my understanding a horrible misunderstanding? =:^)
So if you're actually only appending, fallocate the /additional/ space,
not the /entire/ space, and you'll get what you need. But if you're
potentially overwriting what's there already, better fallocate the entire
space, which triggers the btrfs worst-case allocation behavior you see,
in ordered to guarantee it won't ENOSPC during the actual write.
Of course the only time the behavior actually differs is with COW, but
then there's a BIG difference, but that BIG difference has a GOOD BIG
reason! =:^)
Tho that difference will certainly necessitate some relearning the
/correct/ way to do it, for devs who were doing it the COW-worst-case way
all along, even if they didn't actually need to, because it didn't happen
to make a difference on what they happened to be testing on, which
happened not to be COW...
Reminds me of the way newer versions of gcc and/or trying to build with
clang as well tends to trigger relearning, because newer versions are
stricter in ordered to allow better optimization, and other
implementations are simply different in what they're strict on, /because/
they're a different implementation. Well, btrfs is stricter... because
it's a different implementation that /has/ to be stricter... due to COW.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-02 4:14 ` Duncan
@ 2017-08-02 11:18 ` Austin S. Hemmelgarn
0 siblings, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-02 11:18 UTC (permalink / raw)
To: linux-btrfs
On 2017-08-02 00:14, Duncan wrote:
> Austin S. Hemmelgarn posted on Tue, 01 Aug 2017 10:47:30 -0400 as
> excerpted:
>
>> I think I _might_ understand what's going on here. Is that test program
>> calling fallocate using the desired total size of the file, or just
>> trying to allocate the range beyond the end to extend the file? I've
>> seen issues with the first case on BTRFS before, and I'm starting to
>> think that it might actually be trying to allocate the exact amount of
>> space requested by fallocate, even if part of the range is already
>> allocated space.
>
> If I've interpreted correctly (not being a dev, only a btrfs user,
> sysadmin, and list regular) previous discussions I've seen on this list...
>
> That's exactly what it's doing, and it's _intended_ behavior.
>
> The reasoning is something like this: fallocate is supposed to pre-
> allocate some space with the intent being that writes into that space
> won't fail, because the space is already allocated.
>
> For an existing file with some data already in it, ext4 and xfs do that
> counting the existing space.
>
> But btrfs is copy-on-write, meaning it's going to have to write the new
> data to a different location than the existing data, and it may well not
> free up the existing allocation (if even a single 4k block of the
> existing allocation remains unwritten, it will remain to hold down the
> entire previous allocation, which isn't released until *none* of it is
> still in use -- of course in normal usage "in use" can be due to old
> snapshots or other reflinks to the same extent, as well, tho in these
> test cases it's not).
>
> So in ordered to provide the writes to preallocated space shouldn't ENOSPC
> guarantee, btrfs can't count currently actually used space as part of the
> fallocate.
>
> The different behavior is entirely due to btrfs being COW, and thus a
> choice having to be made, do we worst-case fallocate-reserve for writes
> over currently used data that will have to be COWed elsewhere, possibly
> without freeing the existing extents because there's still something
> referencing them, or do we risk ENOSPCing on write to a previously
> fallocated area?
>
> The choice was to worst-case-reserve and take the ENOSPC risk at fallocate
> time, so the write into that fallocated space could then proceed without
> the ENOSPC risk that COW would otherwise imply.
>
> Make sense, or is my understanding a horrible misunderstanding? =:^)
Your reasoning is sound, except for the fact that at least on older
kernels (not sure if this is still the case), BTRFS will still perform a
COW operation when updating a fallocate'ed region.
>
> So if you're actually only appending, fallocate the /additional/ space,
> not the /entire/ space, and you'll get what you need. But if you're
> potentially overwriting what's there already, better fallocate the entire
> space, which triggers the btrfs worst-case allocation behavior you see,
> in ordered to guarantee it won't ENOSPC during the actual write.
>
> Of course the only time the behavior actually differs is with COW, but
> then there's a BIG difference, but that BIG difference has a GOOD BIG
> reason! =:^)
>
> Tho that difference will certainly necessitate some relearning the
> /correct/ way to do it, for devs who were doing it the COW-worst-case way
> all along, even if they didn't actually need to, because it didn't happen
> to make a difference on what they happened to be testing on, which
> happened not to be COW...
>
> Reminds me of the way newer versions of gcc and/or trying to build with
> clang as well tends to trigger relearning, because newer versions are
> stricter in ordered to allow better optimization, and other
> implementations are simply different in what they're strict on, /because/
> they're a different implementation. Well, btrfs is stricter... because
> it's a different implementation that /has/ to be stricter... due to COW.
Except that that strictness breaks userspace programs that are doing
perfectly reasonable things.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-01 15:00 ` Austin S. Hemmelgarn
2017-08-01 15:24 ` pwm
@ 2017-08-02 17:52 ` Goffredo Baroncelli
2017-08-02 19:10 ` Austin S. Hemmelgarn
` (2 more replies)
1 sibling, 3 replies; 26+ messages in thread
From: Goffredo Baroncelli @ 2017-08-02 17:52 UTC (permalink / raw)
To: Austin S. Hemmelgarn, pwm, Hugo Mills; +Cc: linux-btrfs
Hi,
On 2017-08-01 17:00, Austin S. Hemmelgarn wrote:
> OK, I just did a dead simple test by hand, and it looks like I was right. The method I used to check this is as follows:
> 1. Create and mount a reasonably small filesystem (I used an 8G temporary LV for this, a file would work too though).
> 2. Using dd or a similar tool, create a test file that takes up half of the size of the filesystem. It is important that this _not_ be fallocated, but just written out.
> 3. Use `fallocate -l` to try and extend the size of the file beyond half the size of the filesystem.
>
> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will succeed with no error. Based on this and some low-level inspection, it looks like BTRFS treats the full range of the fallocate call as unallocated, and thus is trying to allocate space for regions of that range that are already allocated.
I can confirm this behavior; below some step to reproduce it [2]; however I don't think that it is a bug, but this is the correct behavior for a COW filesystem (see below).
Looking at the function btrfs_fallocate() (file fs/btrfs/file.c)
static long btrfs_fallocate(struct file *file, int mode,
loff_t offset, loff_t len)
{
[...]
alloc_start = round_down(offset, blocksize);
alloc_end = round_up(offset + len, blocksize);
[...]
/*
* Only trigger disk allocation, don't trigger qgroup reserve
*
* For qgroup space, it will be checked later.
*/
ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode),
alloc_end - alloc_start)
it seems that BTRFS always allocate the maximum space required, without consider the one already allocated. Is it too conservative ? I think no: consider the following scenario:
a) create a 2GB file
b) fallocate -o 1GB -l 2GB
c) write from 1GB to 3GB
after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
My opinion is that in general this behavior is correct due to the COW nature of BTRFS.
The only exception that I can find, is about the "nocow" file. For these cases taking in accout the already allocated space would be better.
Comments are welcome.
BR
G.Baroncelli
[1] from man 2 fallocate
[...]
After a successful call, subsequent writes into the range specified by offset and len are
guaranteed not to fail because of lack of disk space.
[...]
[2]
-- create a 5G btrfs filesystem
# mkdir t1
# truncate --size 5G disk
# losetup /dev/loop0 disk
# mkfs.btrfs /dev/loop0
# mount /dev/loop0 t1
-- test
-- create a 1500 MB file, the expand it to 4000MB
-- expected result: the file is 4000MB size
-- result: fail: the expansion fails
# fallocate -l $((1024*1024*100*15)) file.bin
# fallocate -l $((1024*1024*100*40)) file.bin
fallocate: fallocate failed: No space left on device
# ls -lh file.bin
-rw-r--r-- 1 root root 1.5G Aug 2 19:09 file.bin
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-02 17:52 ` Goffredo Baroncelli
@ 2017-08-02 19:10 ` Austin S. Hemmelgarn
2017-08-02 21:05 ` Goffredo Baroncelli
2017-08-03 3:48 ` Duncan
2017-08-03 11:44 ` Marat Khalili
2 siblings, 1 reply; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-02 19:10 UTC (permalink / raw)
To: kreijack, pwm, Hugo Mills; +Cc: linux-btrfs
On 2017-08-02 13:52, Goffredo Baroncelli wrote:
> Hi,
>
> On 2017-08-01 17:00, Austin S. Hemmelgarn wrote:
>> OK, I just did a dead simple test by hand, and it looks like I was right. The method I used to check this is as follows:
>> 1. Create and mount a reasonably small filesystem (I used an 8G temporary LV for this, a file would work too though).
>> 2. Using dd or a similar tool, create a test file that takes up half of the size of the filesystem. It is important that this _not_ be fallocated, but just written out.
>> 3. Use `fallocate -l` to try and extend the size of the file beyond half the size of the filesystem.
>>
>> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will succeed with no error. Based on this and some low-level inspection, it looks like BTRFS treats the full range of the fallocate call as unallocated, and thus is trying to allocate space for regions of that range that are already allocated.
>
> I can confirm this behavior; below some step to reproduce it [2]; however I don't think that it is a bug, but this is the correct behavior for a COW filesystem (see below).
>
>
> Looking at the function btrfs_fallocate() (file fs/btrfs/file.c)
>
>
> static long btrfs_fallocate(struct file *file, int mode,
> loff_t offset, loff_t len)
> {
> [...]
> alloc_start = round_down(offset, blocksize);
> alloc_end = round_up(offset + len, blocksize);
> [...]
> /*
> * Only trigger disk allocation, don't trigger qgroup reserve
> *
> * For qgroup space, it will be checked later.
> */
> ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode),
> alloc_end - alloc_start)
>
>
> it seems that BTRFS always allocate the maximum space required, without consider the one already allocated. Is it too conservative ? I think no: consider the following scenario:
>
> a) create a 2GB file
> b) fallocate -o 1GB -l 2GB
> c) write from 1GB to 3GB
>
> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
There is also an expectation based on pretty much every other FS in
existence that calling fallocate() on a range that is already in use is
a (possibly expensive) no-op, and by extension using fallocate() with an
offset of 0 like a ftruncate() call will succeed as long as the new size
will fit.
I've checked JFS, XFS, ext4, vfat, NTFS (via NTFS-3G, not the kernel
driver), NILFS2, OCFS2 (local mode only), F2FS, UFS, and HFS+ on Linux,
UFS and HFS+ on OS X, UFS and ZFS on FreeBSD, FFS (UFS with a different
name) and LFS (log structured) on NetBSD, and UFS and ZFS on Solaris,
and VxFS on HP-UX, and _all_ of them behave correctly here and succeed
with the test I listed, while BTRFS does not. This isn't codified in
POSIX, but it's also not something that is listed as implementation
defined, which in turn means that we should be trying to match the other
implementations.
>
> My opinion is that in general this behavior is correct due to the COW nature of BTRFS.
> The only exception that I can find, is about the "nocow" file. For these cases taking in accout the already allocated space would be better.
There are other, saner ways to make that expectation hold though, and
I'm not even certain that it does as things are implemented (I believe
we still CoW unwritten extents when data is written to them, because I
_have_ had writes to fallocate'ed files fail on BTRFS before with -ENOSPC).
The ideal situation IMO is as follows:
1. This particular case (using fallocate() with an offset of 0 to extend
a file that is already larger than half the remaining free space on the
FS) _should_ succeed. Short of very convoluted configurations,
extending a file with fallocate will not result in over-committing space
on a CoW filesystem unless it would extend the file by more than the
remaining free space, and therefore barring long external interactions,
subsequent writes will also succeed. Proof of this for a general case
is somewhat complicated, but in the very specific case of the script I
posted as a reproducer in the other thread about this and the test case
I gave in this thread, it's trivial to prove that the writes will
succeed. Either way, the behavior of SnapRAID, while not optimal in
this case, is still a legitimate usage (I've seen programs do things
like that just to make sure the file isn't sparse).
2. Conversion of unwritten extents to written ones should not require
new allocation. Ideally, we need to be allocating not just space for
the data, but also reasonable space for the associated metadata when
allocating an unwritten extent, and there should be no CoW involved when
they are written to except for the small metadata updates required to
account the new blocks. Unless we're doing this, then we have edge
cases where the the above listed expectation does not hold (also note
that GlobalReserve does not count IMO, it's supposed to be for temporary
usage only and doesn't ever appear to be particularly large).
3. There should be some small amount of space reserved globally for not
just metadata, but data too, so that a 'full' filesystem can still
update existing files reliably. I'm not sure that we're not doing this
already, but AIUI, GlobalReserve is metadata only. If we do this, we
don't have to worry _as much_ about avoiding CoW when converting
unwritten extents to regular ones.
>
> Comments are welcome.
>
> BR
> G.Baroncelli
>
> [1] from man 2 fallocate
> [...]
> After a successful call, subsequent writes into the range specified by offset and len are
> guaranteed not to fail because of lack of disk space.
> [...]
>
>
> [2]
>
> -- create a 5G btrfs filesystem
>
> # mkdir t1
> # truncate --size 5G disk
> # losetup /dev/loop0 disk
> # mkfs.btrfs /dev/loop0
> # mount /dev/loop0 t1
>
> -- test
> -- create a 1500 MB file, the expand it to 4000MB
> -- expected result: the file is 4000MB size
> -- result: fail: the expansion fails
>
> # fallocate -l $((1024*1024*100*15)) file.bin
> # fallocate -l $((1024*1024*100*40)) file.bin
> fallocate: fallocate failed: No space left on device
> # ls -lh file.bin
> -rw-r--r-- 1 root root 1.5G Aug 2 19:09 file.bin
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-02 19:10 ` Austin S. Hemmelgarn
@ 2017-08-02 21:05 ` Goffredo Baroncelli
2017-08-03 11:39 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 26+ messages in thread
From: Goffredo Baroncelli @ 2017-08-02 21:05 UTC (permalink / raw)
To: Austin S. Hemmelgarn, pwm, Hugo Mills; +Cc: linux-btrfs
On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:
> On 2017-08-02 13:52, Goffredo Baroncelli wrote:
>> Hi,
>>
[...]
>> consider the following scenario:
>>
>> a) create a 2GB file
>> b) fallocate -o 1GB -l 2GB
>> c) write from 1GB to 3GB
>>
>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
> There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit.
The man page of fallocate doesn't guarantee that.
Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true.
Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file.
>
> I've checked JFS, XFS, ext4, vfat, NTFS (via NTFS-3G, not the kernel driver), NILFS2, OCFS2 (local mode only), F2FS, UFS, and HFS+ on Linux, UFS and HFS+ on OS X, UFS and ZFS on FreeBSD, FFS (UFS with a different name) and LFS (log structured) on NetBSD, and UFS and ZFS on Solaris, and VxFS on HP-UX, and _all_ of them behave correctly here and succeed with the test I listed, while BTRFS does not. This isn't codified in POSIX, but it's also not something that is listed as implementation defined, which in turn means that we should be trying to match the other implementations.
[...]
>
>>
>> My opinion is that in general this behavior is correct due to the COW nature of BTRFS.
>> The only exception that I can find, is about the "nocow" file. For these cases taking in accout the already allocated space would be better.
> There are other, saner ways to make that expectation hold though, and I'm not even certain that it does as things are implemented (I believe we still CoW unwritten extents when data is written to them, because I _have_ had writes to fallocate'ed files fail on BTRFS before with -ENOSPC).
>
> The ideal situation IMO is as follows:
>
> 1. This particular case (using fallocate() with an offset of 0 to extend a file that is already larger than half the remaining free space on the FS) _should_ succeed.
This description is not accurate. What happened is the following:
1) you have a file *with valid data*
2) you want to prepare an update of this file and want to be sure to have enough space
at this point fallocate have to guarantee:
a) you have your old data still available
b) you have allocated the space for the update
In terms of a COW filesystem, you need the space of a) + the space of b)
> Short of very convoluted configurations, extending a file with fallocate will not result in over-committing space on a CoW filesystem unless it would extend the file by more than the remaining free space, and therefore barring long external interactions, subsequent writes will also succeed. Proof of this for a general case is somewhat complicated, but in the very specific case of the script I posted as a reproducer in the other thread about this and the test case I gave in this thread, it's trivial to prove that the writes will succeed. Either way, the behavior of SnapRAID, while not optimal in this case, is still a legitimate usage (I've seen programs do things like that just to make sure the file isn't sparse).
>
> 2. Conversion of unwritten extents to written ones should not require new allocation. Ideally, we need to be allocating not just space for the data, but also reasonable space for the associated metadata when allocating an unwritten extent, and there should be no CoW involved when they are written to except for the small metadata updates required to account the new blocks. Unless we're doing this, then we have edge cases where the the above listed expectation does not hold (also note that GlobalReserve does not count IMO, it's supposed to be for temporary usage only and doesn't ever appear to be particularly large).
>
> 3. There should be some small amount of space reserved globally for not just metadata, but data too, so that a 'full' filesystem can still update existing files reliably. I'm not sure that we're not doing this already, but AIUI, GlobalReserve is metadata only. If we do this, we don't have to worry _as much_ about avoiding CoW when converting unwritten extents to regular ones.
>>
>> Comments are welcome.
>>
>> BR
>> G.Baroncelli
>>
>> [1] from man 2 fallocate
>> [...]
>> After a successful call, subsequent writes into the range specified by offset and len are
>> guaranteed not to fail because of lack of disk space.
>> [...]
>>
>>
>> [2]
>>
>> -- create a 5G btrfs filesystem
>>
>> # mkdir t1
>> # truncate --size 5G disk
>> # losetup /dev/loop0 disk
>> # mkfs.btrfs /dev/loop0
>> # mount /dev/loop0 t1
>>
>> -- test
>> -- create a 1500 MB file, the expand it to 4000MB
>> -- expected result: the file is 4000MB size
>> -- result: fail: the expansion fails
>>
>> # fallocate -l $((1024*1024*100*15)) file.bin
>> # fallocate -l $((1024*1024*100*40)) file.bin
>> fallocate: fallocate failed: No space left on device
>> # ls -lh file.bin
>> -rw-r--r-- 1 root root 1.5G Aug 2 19:09 file.bin
>>
>>
>
>
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-02 17:52 ` Goffredo Baroncelli
2017-08-02 19:10 ` Austin S. Hemmelgarn
@ 2017-08-03 3:48 ` Duncan
2017-08-03 11:44 ` Marat Khalili
2 siblings, 0 replies; 26+ messages in thread
From: Duncan @ 2017-08-03 3:48 UTC (permalink / raw)
To: linux-btrfs
Goffredo Baroncelli posted on Wed, 02 Aug 2017 19:52:30 +0200 as
excerpted:
> it seems that BTRFS always allocate the maximum space required, without
> consider the one already allocated. Is it too conservative ? I think no:
> consider the following scenario:
>
> a) create a 2GB file
> b) fallocate -o 1GB -l 2GB
> c) write from 1GB to 3GB
>
> after b), the expectation is that c) always succeed [1]: i.e. there is
> enough space on the filesystem. Due to the COW nature of BTRFS, you
> cannot rely on the already allocated space because there could be a
> small time window where both the old and the new data exists on the
> disk.
Not only a small time, perhaps (effectively) permanently, due to either
of two factors:
1) If the existing extents are reflinked by snapshots or other files they
obviously won't be released at all when the overwrite is completed.
fallocate must account for this possibility, and behaving differently in
the context of other reflinks would be confusing, so the best policy is
consistently behave as if the existing data will not be freed.
2) As the devs have commented a number of times, an extent isn't freed if
there's still a reflink to part of it. If the original extent was a full
1 GiB data chunk (the chunk being the max size of a native btrfs extent,
one of the reasons a balance and defrag after conversion from ext4 and
deletion of the ext4-saved subvolume is recommended, to break up the
longer ext4 extents so they won't cause btrfs problems later) and all but
a single 4 KiB block has been rewritten, the full 1 GiB extent will
remain referenced and continue to take that original full 1 GiB space,
*plus* the space of all the new-version extents of the overwritten data,
of course.
So in our fallocate and overwrite scenario, we again must reserve space
for two copies of the data, the original which may well not be freed even
without other reflinks, if a single 4 KiB block of an extent remains
unoverwritten, and the new version of the data.
At least that /was/ the behavior explained on-list previous to the hole-
punching changes. I'm not a dev and haven't seen a dev comment on
whether that remains the behavior after hole-punching, which may at least
naively be expected to automatically handle and free overwritten data
using hole-punching, or not. I'd be interested in seeing someone who can
read the code confirm one way or the other whether hole-punching changed
that previous behavior, or not.
> My opinion is that in general this behavior is correct due to the COW
> nature of BTRFS.
> The only exception that I can find, is about the "nocow" file. For these
> cases taking in accout the already allocated space would be better.
I'd say it's dangerously optimistic even then, considering that "nocow"
is actually "cow1" in the presence of snapshots.
Meanwhile, it's worth keeping in mind that it's exactly these sorts of
corner-cases that are why btrfs is taking so long to stabilize.
Supposedly "simple" expectations aren't always so simple, and if a
filesystem gets it wrong, it's somebody's data hanging in the balance!
(Tho if they've any wisdom at all, they'll ensure they're aware of the
stability status of a filesystem before they put data on it, and will
adjust their backup policies accordingly if they're using a still not
fully stabilized filesystem such as btrfs, so the data won't actually be
in any danger anyway unless it was literally throw-away value, only
whatever specific instance of it was involved in that corner-case.)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-02 21:05 ` Goffredo Baroncelli
@ 2017-08-03 11:39 ` Austin S. Hemmelgarn
2017-08-03 16:37 ` Goffredo Baroncelli
0 siblings, 1 reply; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-03 11:39 UTC (permalink / raw)
To: kreijack, pwm, Hugo Mills; +Cc: linux-btrfs
On 2017-08-02 17:05, Goffredo Baroncelli wrote:
> On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:
>> On 2017-08-02 13:52, Goffredo Baroncelli wrote:
>>> Hi,
>>>
> [...]
>
>>> consider the following scenario:
>>>
>>> a) create a 2GB file
>>> b) fallocate -o 1GB -l 2GB
>>> c) write from 1GB to 3GB
>>>
>>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
>
>> There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit.
>
> The man page of fallocate doesn't guarantee that.
>
> Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true.
>
> Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file.
Yes, you need space, but you don't need _all_ the space. For a file
that already has data in it, you only _need_ as much space as the
largest chunk of data that can be written at once at a low level,
because the moment that first write finishes, the space that was used in
the file for that region is freed, and the next write can go there. Put
a bit differently, you only need to allocate what isn't allocated in the
region, and then a bit more to handle the initial write to the file.
Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that
a CoW filesystem _does not_ need to behave like BTRFS is.
>
>
>>
>> I've checked JFS, XFS, ext4, vfat, NTFS (via NTFS-3G, not the kernel driver), NILFS2, OCFS2 (local mode only), F2FS, UFS, and HFS+ on Linux, UFS and HFS+ on OS X, UFS and ZFS on FreeBSD, FFS (UFS with a different name) and LFS (log structured) on NetBSD, and UFS and ZFS on Solaris, and VxFS on HP-UX, and _all_ of them behave correctly here and succeed with the test I listed, while BTRFS does not. This isn't codified in POSIX, but it's also not something that is listed as implementation defined, which in turn means that we should be trying to match the other implementations.
>
> [...]
>
>>
>>>
>>> My opinion is that in general this behavior is correct due to the COW nature of BTRFS.
>>> The only exception that I can find, is about the "nocow" file. For these cases taking in accout the already allocated space would be better.
>> There are other, saner ways to make that expectation hold though, and I'm not even certain that it does as things are implemented (I believe we still CoW unwritten extents when data is written to them, because I _have_ had writes to fallocate'ed files fail on BTRFS before with -ENOSPC).
>>
>> The ideal situation IMO is as follows:
>>
>> 1. This particular case (using fallocate() with an offset of 0 to extend a file that is already larger than half the remaining free space on the FS) _should_ succeed.
>
> This description is not accurate. What happened is the following:
> 1) you have a file *with valid data*
> 2) you want to prepare an update of this file and want to be sure to have enough space
Except this is not the common case. Most filesystems aren't CoW, so
calling fallocate() like this is generally not 'ensuring you have enough
space', it's 'ensuring the file isn't sparse, and we can write to the
extra area beyond the end we care about'.
>
> at this point fallocate have to guarantee:
> a) you have your old data still available
> b) you have allocated the space for the update
>
> In terms of a COW filesystem, you need the space of a) + the space of b)
No, that is only required if the entire file needs to be written
atomically. There is some maximal size atomic write that BTRFS can
perform as a single operation at a low level (I'm not sure if this is
equal to the block size, or larger, but it doesn't matter much, either
way, I'm talking the largest chunk of data it will write to a disk in a
single operation before updating metadata to point to that new data).
If your total size (original data plus the new space) is less than this
maximal atomic write size, then the above is true, but if it is larger,
you only need to allocate space for regions of the fallocate() range
that aren't already allocated, plus space to accommodate at least one
write of this maximal atomic write size. Any space beyond that just
ends up minimizing the degree of fragmentation introduced by allocation.
The methodology that allows this is really simple. When you start to
write data to the file, the first part of the write goes into the newly
allocated space, and the original region covered by that write gets
freed. You can then write into the space that was just freed and repeat
the process until the write is done. Implementing this requires the
freeing process to know that the freed region was covered by an
fallocate() call, and thus that it should be saved for future writes.
Provided that the back-conversion from used space to fallocated() space
is done directly, this is also race free.
>
>
>> Short of very convoluted configurations, extending a file with fallocate will not result in over-committing space on a CoW filesystem unless it would extend the file by more than the remaining free space, and therefore barring long external interactions, subsequent writes will also succeed. Proof of this for a general case is somewhat complicated, but in the very specific case of the script I posted as a reproducer in the other thread about this and the test case I gave in this thread, it's trivial to prove that the writes will succeed. Either way, the behavior of SnapRAID, while not optimal in this case, is still a legitimate usage (I've seen programs do things like that just to make sure the file isn't sparse).
>>
>> 2. Conversion of unwritten extents to written ones should not require new allocation. Ideally, we need to be allocating not just space for the data, but also reasonable space for the associated metadata when allocating an unwritten extent, and there should be no CoW involved when they are written to except for the small metadata updates required to account the new blocks. Unless we're doing this, then we have edge cases where the the above listed expectation does not hold (also note that GlobalReserve does not count IMO, it's supposed to be for temporary usage only and doesn't ever appear to be particularly large).
>>
>> 3. There should be some small amount of space reserved globally for not just metadata, but data too, so that a 'full' filesystem can still update existing files reliably. I'm not sure that we're not doing this already, but AIUI, GlobalReserve is metadata only. If we do this, we don't have to worry _as much_ about avoiding CoW when converting unwritten extents to regular ones.
>>>
>>> Comments are welcome.
>>>
>>> BR
>>> G.Baroncelli
>>>
>>> [1] from man 2 fallocate
>>> [...]
>>> After a successful call, subsequent writes into the range specified by offset and len are
>>> guaranteed not to fail because of lack of disk space.
>>> [...]
>>>
>>>
>>> [2]
>>>
>>> -- create a 5G btrfs filesystem
>>>
>>> # mkdir t1
>>> # truncate --size 5G disk
>>> # losetup /dev/loop0 disk
>>> # mkfs.btrfs /dev/loop0
>>> # mount /dev/loop0 t1
>>>
>>> -- test
>>> -- create a 1500 MB file, the expand it to 4000MB
>>> -- expected result: the file is 4000MB size
>>> -- result: fail: the expansion fails
>>>
>>> # fallocate -l $((1024*1024*100*15)) file.bin
>>> # fallocate -l $((1024*1024*100*40)) file.bin
>>> fallocate: fallocate failed: No space left on device
>>> # ls -lh file.bin
>>> -rw-r--r-- 1 root root 1.5G Aug 2 19:09 file.bin
>>>
>>>
>>
>>
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-02 17:52 ` Goffredo Baroncelli
2017-08-02 19:10 ` Austin S. Hemmelgarn
2017-08-03 3:48 ` Duncan
@ 2017-08-03 11:44 ` Marat Khalili
2017-08-03 11:52 ` Austin S. Hemmelgarn
2017-08-03 16:01 ` Goffredo Baroncelli
2 siblings, 2 replies; 26+ messages in thread
From: Marat Khalili @ 2017-08-03 11:44 UTC (permalink / raw)
To: Austin S. Hemmelgarn, linux-btrfs; +Cc: kreijack, pwm, Hugo Mills
On 02/08/17 20:52, Goffredo Baroncelli wrote:
> consider the following scenario:
>
> a) create a 2GB file
> b) fallocate -o 1GB -l 2GB
> c) write from 1GB to 3GB
>
> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
Just curious. With current implementation, in the following case:
a) create a 2GB file1 && create a 2GB file2
b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2
c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2
will (c) always succeed? I.e. does fallocate really allocate 2GB per
file, or does it only allocate additional 1GB and check free space for
another 1GB? If it's only the latter, it is useless.
--
With Best Regards,
Marat Khalili
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-03 11:44 ` Marat Khalili
@ 2017-08-03 11:52 ` Austin S. Hemmelgarn
2017-08-03 16:01 ` Goffredo Baroncelli
1 sibling, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-03 11:52 UTC (permalink / raw)
To: Marat Khalili, linux-btrfs; +Cc: kreijack, pwm, Hugo Mills
On 2017-08-03 07:44, Marat Khalili wrote:
> On 02/08/17 20:52, Goffredo Baroncelli wrote:
>> consider the following scenario:
>>
>> a) create a 2GB file
>> b) fallocate -o 1GB -l 2GB
>> c) write from 1GB to 3GB
>>
>> after b), the expectation is that c) always succeed [1]: i.e. there is
>> enough space on the filesystem. Due to the COW nature of BTRFS, you
>> cannot rely on the already allocated space because there could be a
>> small time window where both the old and the new data exists on the disk.
> Just curious. With current implementation, in the following case:
> a) create a 2GB file1 && create a 2GB file2
> b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2
> c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2
> will (c) always succeed? I.e. does fallocate really allocate 2GB per
> file, or does it only allocate additional 1GB and check free space for
> another 1GB? If it's only the latter, it is useless.
It will currently allocate 4GB total in this case (2 for each file), and
_should_ succeed. I think there are corner cases where it can fail
though because of metadata exhaustion, and I'm still not certain we
don't CoW unwritten extents (if we do CoW unwritten extents, then this,
and all fallocate allocation for that matter, becomes non-deterministic
as to whether or not it succeeds).
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-03 11:44 ` Marat Khalili
2017-08-03 11:52 ` Austin S. Hemmelgarn
@ 2017-08-03 16:01 ` Goffredo Baroncelli
2017-08-03 17:15 ` Marat Khalili
2017-08-03 22:51 ` pwm
1 sibling, 2 replies; 26+ messages in thread
From: Goffredo Baroncelli @ 2017-08-03 16:01 UTC (permalink / raw)
To: Marat Khalili, Austin S. Hemmelgarn, linux-btrfs; +Cc: pwm, Hugo Mills
On 2017-08-03 13:44, Marat Khalili wrote:
> On 02/08/17 20:52, Goffredo Baroncelli wrote:
>> consider the following scenario:
>>
>> a) create a 2GB file
>> b) fallocate -o 1GB -l 2GB
>> c) write from 1GB to 3GB
>>
>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
> Just curious. With current implementation, in the following case:
> a) create a 2GB file1 && create a 2GB file2
> b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2
A this step you are trying to allocate 3GB+3GB = 6GB, so you exhausted the filesystem space.
> c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2
> will (c) always succeed? I.e. does fallocate really allocate 2GB per file, or does it only allocate additional 1GB and check free space for another 1GB? If it's only the latter, it is useless.
The file is physically extended
ghigo@venice:/tmp$ fallocate -l 1000 foo.txt
ghigo@venice:/tmp$ ls -l foo.txt
-rw-r--r-- 1 ghigo ghigo 1000 Aug 3 18:00 foo.txt
ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt
ghigo@venice:/tmp$ ls -l foo.txt
-rw-r--r-- 1 ghigo ghigo 1500 Aug 3 18:00 foo.txt
ghigo@venice:/tmp$
>
> --
>
> With Best Regards,
> Marat Khalili
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-03 11:39 ` Austin S. Hemmelgarn
@ 2017-08-03 16:37 ` Goffredo Baroncelli
2017-08-03 17:23 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 26+ messages in thread
From: Goffredo Baroncelli @ 2017-08-03 16:37 UTC (permalink / raw)
To: Austin S. Hemmelgarn, pwm, Hugo Mills; +Cc: linux-btrfs
On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
> On 2017-08-02 17:05, Goffredo Baroncelli wrote:
>> On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:
>>> On 2017-08-02 13:52, Goffredo Baroncelli wrote:
>>>> Hi,
>>>>
>> [...]
>>
>>>> consider the following scenario:
>>>>
>>>> a) create a 2GB file
>>>> b) fallocate -o 1GB -l 2GB
>>>> c) write from 1GB to 3GB
>>>>
>>>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
>>
>>> There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit.
>>
>> The man page of fallocate doesn't guarantee that.
>>
>> Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true.
>>
>> Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file.
> Yes, you need space, but you don't need _all_ the space. For a file that already has data in it, you only _need_ as much space as the largest chunk of data that can be written at once at a low level, because the moment that first write finishes, the space that was used in the file for that region is freed, and the next write can go there. Put a bit differently, you only need to allocate what isn't allocated in the region, and then a bit more to handle the initial write to the file.
>
> Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is.
It seems that ZFS on linux doesn't support fallocate
see https://github.com/zfsonlinux/zfs/issues/326
So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment.
[...]
>> In terms of a COW filesystem, you need the space of a) + the space of b)
> No, that is only required if the entire file needs to be written atomically. There is some maximal size atomic write that BTRFS can perform as a single operation at a low level (I'm not sure if this is equal to the block size, or larger, but it doesn't matter much, either way, I'm talking the largest chunk of data it will write to a disk in a single operation before updating metadata to point to that new data).
On the best of my knowledge there is only a time limit: IIRC every 30seconds a transaction is closed. If you are able to fill the filesystem in this time window you are in trouble.
[...]--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-03 16:01 ` Goffredo Baroncelli
@ 2017-08-03 17:15 ` Marat Khalili
2017-08-03 17:25 ` Austin S. Hemmelgarn
2017-08-03 22:51 ` pwm
1 sibling, 1 reply; 26+ messages in thread
From: Marat Khalili @ 2017-08-03 17:15 UTC (permalink / raw)
To: kreijack, Goffredo Baroncelli, Austin S. Hemmelgarn, linux-btrfs
Cc: pwm, Hugo Mills
On August 3, 2017 7:01:06 PM GMT+03:00, Goffredo Baroncelli
>The file is physically extended
>
>ghigo@venice:/tmp$ fallocate -l 1000 foo.txt
For clarity let's replace the fallocate above with:
$ head -c 1000 </dev/urandom >foo.txt
>ghigo@venice:/tmp$ ls -l foo.txt
>-rw-r--r-- 1 ghigo ghigo 1000 Aug 3 18:00 foo.txt
>ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt
>ghigo@venice:/tmp$ ls -l foo.txt
>-rw-r--r-- 1 ghigo ghigo 1500 Aug 3 18:00 foo.txt
>ghigo@venice:/tmp$
According to explanation by Austin the foo.txt at this point somehow occupies 2000 bytes of space because I can reflink it and then write another 1000 bytes of data into it without losing 1000 bytes I already have or getting out of drive space. (Or is it only true while there are open file handles?)
--
With Best Regards,
Marat Khalili
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-03 16:37 ` Goffredo Baroncelli
@ 2017-08-03 17:23 ` Austin S. Hemmelgarn
2017-08-04 14:45 ` Goffredo Baroncelli
0 siblings, 1 reply; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-03 17:23 UTC (permalink / raw)
To: kreijack, pwm, Hugo Mills; +Cc: linux-btrfs
On 2017-08-03 12:37, Goffredo Baroncelli wrote:
> On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
>> On 2017-08-02 17:05, Goffredo Baroncelli wrote:
>>> On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:
>>>> On 2017-08-02 13:52, Goffredo Baroncelli wrote:
>>>>> Hi,
>>>>>
>>> [...]
>>>
>>>>> consider the following scenario:
>>>>>
>>>>> a) create a 2GB file
>>>>> b) fallocate -o 1GB -l 2GB
>>>>> c) write from 1GB to 3GB
>>>>>
>>>>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
>>>
>>>> There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit.
>>>
>>> The man page of fallocate doesn't guarantee that.
>>>
>>> Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true.
>>>
>>> Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file.
>> Yes, you need space, but you don't need _all_ the space. For a file that already has data in it, you only _need_ as much space as the largest chunk of data that can be written at once at a low level, because the moment that first write finishes, the space that was used in the file for that region is freed, and the next write can go there. Put a bit differently, you only need to allocate what isn't allocated in the region, and then a bit more to handle the initial write to the file.
>>
>> Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is.
>
> It seems that ZFS on linux doesn't support fallocate
>
> see https://github.com/zfsonlinux/zfs/issues/326
>
> So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment.
Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).
That said, I'm starting to wonder if just failing fallocate() calls to
allocate space is actually the right thing to do here after all. Aside
from this, we don't reserve metadata space for checksums and similar
things for the eventual writes (so it's possible to get -ENOSPC on a
write to an fallocate'ed region anyway because of metadata exhaustion),
and splitting extents can also cause it to fail, so it's perfectly
possible for the fallocate assumption to not hole on BTRFS. The irony
of this is that if you're in a situation where you actually need to
reserve space, you're more likely to fail (because if you actually
_need_ to reserve the space, your filesystem may already be mostly full,
and therefore any of the above issues may occur).
On the specific note of splitting extents, the following will probably
fail on BTRFS as well when done with a large enough FS (the turn over
point ends up being the point at which 256MiB isn't enough space to
account for all the extents), but will succeed with :
1. Create filesystem and mount it. On BTRFS, make sure autodefrag is
off (this makes it fail more reliably, but is not essential for it to fail).
2. Use fallocate to allocate as large a file as possible (in the BTRFS
case, try for the size of the filesystem - 544MiB (512 MiB for the
metadata chunk, 32 for the system chunk).
3. Write half the file using 1MB blocks and skipping 1MB of space
between each block (so every other 1MB of space is actually written to.
4. Write the other half of the file by filling in the holes.
The net effect of this is to split the single large fallocat'ed extent
into a very large number of 1MB extents, which in turn eats up lots of
metadata space and will eventually exhaust it. While this specific
exercise requires a large filesystem, more generic real world situations
exist where this can happen (and I have had this happen before).
>
> [...]
>>> In terms of a COW filesystem, you need the space of a) + the space of b)
>> No, that is only required if the entire file needs to be written atomically. There is some maximal size atomic write that BTRFS can perform as a single operation at a low level (I'm not sure if this is equal to the block size, or larger, but it doesn't matter much, either way, I'm talking the largest chunk of data it will write to a disk in a single operation before updating metadata to point to that new data).
>
> On the best of my knowledge there is only a time limit: IIRC every 30seconds a transaction is closed. If you are able to fill the filesystem in this time window you are in trouble.
Even with that, it's still possible to implement the method I outlined
by defining such a limit and forcing a transaction commit when that
limit is hit. I'm also not entirely convinced that the transaction is
the limiting factor here (I was under the impression that the
transaction just updates the top level metadata to point to the new tree
of metadata).
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-03 17:15 ` Marat Khalili
@ 2017-08-03 17:25 ` Austin S. Hemmelgarn
0 siblings, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-03 17:25 UTC (permalink / raw)
To: Marat Khalili, kreijack, linux-btrfs; +Cc: pwm, Hugo Mills
On 2017-08-03 13:15, Marat Khalili wrote:
> On August 3, 2017 7:01:06 PM GMT+03:00, Goffredo Baroncelli
>> The file is physically extended
>>
>> ghigo@venice:/tmp$ fallocate -l 1000 foo.txt
>
> For clarity let's replace the fallocate above with:
> $ head -c 1000 </dev/urandom >foo.txt
>
>> ghigo@venice:/tmp$ ls -l foo.txt
>> -rw-r--r-- 1 ghigo ghigo 1000 Aug 3 18:00 foo.txt
>> ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt
>> ghigo@venice:/tmp$ ls -l foo.txt
>> -rw-r--r-- 1 ghigo ghigo 1500 Aug 3 18:00 foo.txt
>> ghigo@venice:/tmp$
>
> According to explanation by Austin the foo.txt at this point somehow occupies 2000 bytes of space because I can reflink it and then write another 1000 bytes of data into it without losing 1000 bytes I already have or getting out of drive space. (Or is it only true while there are open file handles?)
>
OK, I think there may be some misunderstanding here. By 'CoW unwritten
extents', I mean that when we write to the extent, a CoW operation
happens, instead of the data being written directly into the extent. In
this case, it has nothing to do with reflinking, and Goffredo is correct
that if your filesystem is small enough, the second fallocate will fail
there.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-03 16:01 ` Goffredo Baroncelli
2017-08-03 17:15 ` Marat Khalili
@ 2017-08-03 22:51 ` pwm
1 sibling, 0 replies; 26+ messages in thread
From: pwm @ 2017-08-03 22:51 UTC (permalink / raw)
To: Goffredo Baroncelli
Cc: Marat Khalili, Austin S. Hemmelgarn, linux-btrfs, Hugo Mills
In 30 seconds I should be able to fill about 200MB * 30 = 6GB.
Requiring the parity to not grow larger than there is a 6GB additional
space is possible to live with on a 10TB disk.
It seems that for SnapRAID to have any chance to work correctly with
parity on a BTRFS partition, it would need a min-free configuration
paramter to make sure there is always enough free space for one parity
file update.
But as it is right now, requiring that the disc isn't filled past 50%
because fallocate() wants enough free space for 100% of the original file
data to be rewritten obviously is not a working solution.
Right now, it sounds like I should change all parity disks to a different
file system to avoid the CoW issue. There doesn't seem to be any way to
turn off CoW for an already existing file, and the parity data is already
way past 50% so I can't make a copy.
/Per W
On Thu, 3 Aug 2017, Goffredo Baroncelli wrote:
> On 2017-08-03 13:44, Marat Khalili wrote:
>> On 02/08/17 20:52, Goffredo Baroncelli wrote:
>>> consider the following scenario:
>>>
>>> a) create a 2GB file
>>> b) fallocate -o 1GB -l 2GB
>>> c) write from 1GB to 3GB
>>>
>>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
>> Just curious. With current implementation, in the following case:
>> a) create a 2GB file1 && create a 2GB file2
>> b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2
>
> A this step you are trying to allocate 3GB+3GB = 6GB, so you exhausted the filesystem space.
>
>> c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2
>> will (c) always succeed? I.e. does fallocate really allocate 2GB per file, or does it only allocate additional 1GB and check free space for another 1GB? If it's only the latter, it is useless.
> The file is physically extended
>
> ghigo@venice:/tmp$ fallocate -l 1000 foo.txt
> ghigo@venice:/tmp$ ls -l foo.txt
> -rw-r--r-- 1 ghigo ghigo 1000 Aug 3 18:00 foo.txt
> ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt
> ghigo@venice:/tmp$ ls -l foo.txt
> -rw-r--r-- 1 ghigo ghigo 1500 Aug 3 18:00 foo.txt
> ghigo@venice:/tmp$
>
>>
>> --
>>
>> With Best Regards,
>> Marat Khalili
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
>
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-03 17:23 ` Austin S. Hemmelgarn
@ 2017-08-04 14:45 ` Goffredo Baroncelli
2017-08-04 15:05 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 26+ messages in thread
From: Goffredo Baroncelli @ 2017-08-04 14:45 UTC (permalink / raw)
To: Austin S. Hemmelgarn, pwm, Hugo Mills; +Cc: linux-btrfs
On 2017-08-03 19:23, Austin S. Hemmelgarn wrote:
> On 2017-08-03 12:37, Goffredo Baroncelli wrote:
>> On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
[...]
>>> Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is.
>>
>> It seems that ZFS on linux doesn't support fallocate
>>
>> see https://github.com/zfsonlinux/zfs/issues/326
>>
>> So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment.
> Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).
For fun I checked the freebsd source and zfs source. To me it seems that ZFS on freebsd doesn't implement posix_fallocate() (VOP_ALLOCATE in freebas jargon), but instead relies on the freebsd default one.
http://fxr.watson.org/fxr/source/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L7212
Following the chain of function pointers
http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=10#L110
it seems that the freebsd vop_allocate() is implemented in vop_stdallocate()
http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=excerpts#L912
which simply calls read() and write() on the range [offset...offset+len), which for a "conventional" filesystem ensure the block allocation. Of course it is an expensive solution.
So I think (but I am not familiar with freebsd) that ZFS doesn't implement a real posix_allocate but it try to simulate it. Of course this don't
>
> That said, I'm starting to wonder if just failing fallocate() calls to allocate space is actually the right thing to do here after all. Aside from this, we don't reserve metadata space for checksums and similar things for the eventual writes (so it's possible to get -ENOSPC on a write to an fallocate'ed region anyway because of metadata exhaustion), and splitting extents can also cause it to fail, so it's perfectly possible for the fallocate assumption to not hole on BTRFS.
posix_fallocate in BTRFS is not reliable for another reason. This syscall guarantees that a BG is allocated, but I think that the allocated BG is available to all processes, so a parallel process my exhaust all the available space before the first process uses it.
My opinion is that BTRFS is not reliable when the space is exhausted, so it needs to work with an amount of disk space free. The size of this disk space should be O(2*size_of_biggest_write), and for operation like fallocate this means O(2*length).
I think that is not casual that the fallocate implemented by ZFSONLINUX works with the flag FALLOC_FL_PUNCH_HOLE mode.
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L662
[...]
/*
* The only flag combination which matches the behavior of zfs_space()
* is FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE. The FALLOC_FL_PUNCH_HOLE
* flag was introduced in the 2.6.38 kernel.
*/
#if defined(HAVE_FILE_FALLOCATE) || defined(HAVE_INODE_FALLOCATE)
long
zpl_fallocate_common(struct inode *ip, int mode, loff_t offset, loff_t len)
{
int error = -EOPNOTSUPP;
#if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE)
cred_t *cr = CRED();
flock64_t bf;
loff_t olen;
fstrans_cookie_t cookie;
if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
return (error);
[...]
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Massive loss of disk space
2017-08-04 14:45 ` Goffredo Baroncelli
@ 2017-08-04 15:05 ` Austin S. Hemmelgarn
0 siblings, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-04 15:05 UTC (permalink / raw)
To: kreijack, pwm, Hugo Mills; +Cc: linux-btrfs
On 2017-08-04 10:45, Goffredo Baroncelli wrote:
> On 2017-08-03 19:23, Austin S. Hemmelgarn wrote:
>> On 2017-08-03 12:37, Goffredo Baroncelli wrote:
>>> On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
> [...]
>
>>>> Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is.
>>>
>>> It seems that ZFS on linux doesn't support fallocate
>>>
>>> see https://github.com/zfsonlinux/zfs/issues/326
>>>
>>> So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment.
>> Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).
>
> For fun I checked the freebsd source and zfs source. To me it seems that ZFS on freebsd doesn't implement posix_fallocate() (VOP_ALLOCATE in freebas jargon), but instead relies on the freebsd default one.
>
> http://fxr.watson.org/fxr/source/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L7212
>
> Following the chain of function pointers
>
> http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=10#L110
>
> it seems that the freebsd vop_allocate() is implemented in vop_stdallocate()
>
> http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=excerpts#L912
>
> which simply calls read() and write() on the range [offset...offset+len), which for a "conventional" filesystem ensure the block allocation. Of course it is an expensive solution.
>
> So I think (but I am not familiar with freebsd) that ZFS doesn't implement a real posix_allocate but it try to simulate it. Of course this don't
From a practical perspective though, posix_fallocate() doesn't matter,
because almost everything uses the native fallocate call if at all
possible. As you mention, FreeBSD is emulating it, but that 'emulation'
provides behavior that is close enough to what is required that it
doesn't matter. As a matter of perspective, posix_fallocate() is
emulated on Linux too, see my reply below to your later comment about
posix_fallocate() on BTRFS.
Internally ZFS also keeps _some_ space reserved so it doesn't get wedged
like BTRFS does when near full, and they don't do the whole data versus
metadata segregation crap, so from a practical perspective, what
FreeBSD's ZFS implementation does is sufficient because of the internal
structure and handling of writes in ZFS.
>
>
>>
>> That said, I'm starting to wonder if just failing fallocate() calls to allocate space is actually the right thing to do here after all. Aside from this, we don't reserve metadata space for checksums and similar things for the eventual writes (so it's possible to get -ENOSPC on a write to an fallocate'ed region anyway because of metadata exhaustion), and splitting extents can also cause it to fail, so it's perfectly possible for the fallocate assumption to not hole on BTRFS.
>
> posix_fallocate in BTRFS is not reliable for another reason. This syscall guarantees that a BG is allocated, but I think that the allocated BG is available to all processes, so a parallel process my exhaust all the available space before the first process uses it.
As mentioned above, posix_fallocate() is emulated in libc on Linux by
calling the regular fallocate() if the FS supports it (which BTRFS
does), or by writing out data like FreeBSD does in the kernel if the FS
doesn't support fallocate(). IOW, posix_fallocate() has the exact same
issues on BTRFS as Linux's fallocate() syscall does.
>
> My opinion is that BTRFS is not reliable when the space is exhausted, so it needs to work with an amount of disk space free. The size of this disk space should be O(2*size_of_biggest_write), and for operation like fallocate this means O(2*length).
Again, this arises from how we handle writes. If we were to track
blocks that have had fallocate called on them and only use those (for
the first write at least) for writes to the file that had fallocate
called on them (as well as breaking reflinks on them when fallocate is
called), then we can get away with just using the size of the biggest
write plus a little bit more space for _data_, but even then we need
space for metadata (which we don't appear to track right now).
>
> I think that is not casual that the fallocate implemented by ZFSONLINUX works with the flag FALLOC_FL_PUNCH_HOLE mode.
>
> https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L662
> [...]
> /*
> * The only flag combination which matches the behavior of zfs_space()
> * is FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE. The FALLOC_FL_PUNCH_HOLE
> * flag was introduced in the 2.6.38 kernel.
> */
> #if defined(HAVE_FILE_FALLOCATE) || defined(HAVE_INODE_FALLOCATE)
> long
> zpl_fallocate_common(struct inode *ip, int mode, loff_t offset, loff_t len)
> {
> int error = -EOPNOTSUPP;
>
> #if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE)
> cred_t *cr = CRED();
> flock64_t bf;
> loff_t olen;
> fstrans_cookie_t cookie;
>
> if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> return (error);
>
> [...]
>
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2017-08-04 15:05 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-01 11:43 Massive loss of disk space pwm
2017-08-01 12:20 ` Hugo Mills
2017-08-01 14:39 ` pwm
2017-08-01 14:47 ` Austin S. Hemmelgarn
2017-08-01 15:00 ` Austin S. Hemmelgarn
2017-08-01 15:24 ` pwm
2017-08-01 15:45 ` Austin S. Hemmelgarn
2017-08-01 16:50 ` pwm
2017-08-01 17:04 ` Austin S. Hemmelgarn
2017-08-02 17:52 ` Goffredo Baroncelli
2017-08-02 19:10 ` Austin S. Hemmelgarn
2017-08-02 21:05 ` Goffredo Baroncelli
2017-08-03 11:39 ` Austin S. Hemmelgarn
2017-08-03 16:37 ` Goffredo Baroncelli
2017-08-03 17:23 ` Austin S. Hemmelgarn
2017-08-04 14:45 ` Goffredo Baroncelli
2017-08-04 15:05 ` Austin S. Hemmelgarn
2017-08-03 3:48 ` Duncan
2017-08-03 11:44 ` Marat Khalili
2017-08-03 11:52 ` Austin S. Hemmelgarn
2017-08-03 16:01 ` Goffredo Baroncelli
2017-08-03 17:15 ` Marat Khalili
2017-08-03 17:25 ` Austin S. Hemmelgarn
2017-08-03 22:51 ` pwm
2017-08-02 4:14 ` Duncan
2017-08-02 11:18 ` Austin S. Hemmelgarn
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.