All of lore.kernel.org
 help / color / mirror / Atom feed
* Massive loss of disk space
@ 2017-08-01 11:43 pwm
  2017-08-01 12:20 ` Hugo Mills
  0 siblings, 1 reply; 26+ messages in thread
From: pwm @ 2017-08-01 11:43 UTC (permalink / raw)
  To: linux-btrfs

I have a 10TB file system with a parity file for a snapraid. However, I 
can suddenly not extend the parity file despite the file system only being 
about 50% filled - I should have 5TB of unallocated space. When trying to 
extend the parity file, fallocate() just returns ENOSPC, i.e. that the 
disk is full.

Machine was originally a Debian 8 (Jessie) but after I detected the issue 
and no btrfs tool did show any errors, I have updated to Debian 9 (Snatch) 
to get a newer kernel and newer btrfs tools.

pwm@europium:/mnt$ btrfs --version
btrfs-progs v4.7.3
pwm@europium:/mnt$ uname -a
Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26) 
x86_64 GNU/Linux




pwm@europium:/mnt/snap_04$ ls -l
total 4932703608
-rw------- 1 root root     319148889 Jul  8 04:21 snapraid.content
-rw------- 1 root root     283115520 Aug  1 04:08 snapraid.content.tmp
-rw------- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity



pwm@europium:/mnt/snap_04$ df .
Filesystem      1K-blocks       Used  Available Use% Mounted on
/dev/sdg1      9766434816 4944614648 4819831432  51% /mnt/snap_04



pwm@europium:/mnt/snap_04$ sudo btrfs fi show .
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
         Total devices 1 FS bytes used 4.60TiB
         devid    1 size 9.09TiB used 9.09TiB path /dev/sdg1

Compare this with the second snapraid parity disk:
pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/
Label: 'snap_05'  uuid: bac477e3-e78c-43ee-8402-6bdfff194567
         Total devices 1 FS bytes used 4.69TiB
         devid    1 size 9.09TiB used 4.70TiB path /dev/sdi1

So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB.
While almost the same amount of file system usage. And almost identical 
usage pattern. It's an archival RAID, so there is hardly any writes to the 
parity files because there are almost no file changes to the data files. 
The main usage is that the parity file gets extended when one of the data 
disks reaches a new high water mark.

The only file that gets regularly rewritten is the snapraid.content file 
that gets regenerated after every scrub.



pwm@europium:/mnt/snap_04$ sudo btrfs fi df .
Data, single: total=9.08TiB, used=4.59TiB
System, DUP: total=8.00MiB, used=992.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=6.00GiB, used=4.81GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B



pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du .
      Total   Exclusive  Set shared  Filename
    4.59TiB     4.59TiB           -  ./snapraid.parity
  304.37MiB   304.37MiB           -  ./snapraid.content
  270.00MiB   270.00MiB           -  ./snapraid.content.tmp
    4.59TiB     4.59TiB       0.00B  .



pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage .
Overall:
     Device size:                   9.09TiB
     Device allocated:              9.09TiB
     Device unallocated:              0.00B
     Device missing:                  0.00B
     Used:                          4.60TiB
     Free (estimated):              4.49TiB      (min: 4.49TiB)
     Data ratio:                       1.00
     Metadata ratio:                   2.00
     Global reserve:              512.00MiB      (used: 0.00B)

Data,single: Size:9.08TiB, Used:4.59TiB
    /dev/sdg1       9.08TiB

Metadata,single: Size:8.00MiB, Used:0.00B
    /dev/sdg1       8.00MiB

Metadata,DUP: Size:6.00GiB, Used:4.81GiB
    /dev/sdg1      12.00GiB

System,single: Size:4.00MiB, Used:0.00B
    /dev/sdg1       4.00MiB

System,DUP: Size:8.00MiB, Used:992.00KiB
    /dev/sdg1      16.00MiB

Unallocated:
    /dev/sdg1         0.00B



pwm@europium:~$ sudo btrfs check /dev/sdg1
Checking filesystem on /dev/sdg1
UUID: c46df8fa-03db-4b32-8beb-5521d9931a31
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
found 5057294639104 bytes used err is 0
total csum bytes: 4529856120
total tree bytes: 5170151424
total fs tree bytes: 178700288
total extent tree bytes: 209616896
btree space waste bytes: 182357204
file data blocks allocated: 5073330888704
  referenced 5052040339456



pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/
scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31
         scrub started at Mon Jul 31 21:26:50 2017 and finished after 
06:53:47
         total bytes scrubbed: 4.60TiB with 0 errors



So where have my 5TB disk space gone lost?
And what should I do to be able to get it back again?

I could obviously reformat the partition and rebuild the parity since I 
still have one good parity, but that doesn't feel like a good route. It 
isn't impossible this might happen again.

/Per W

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-01 11:43 Massive loss of disk space pwm
@ 2017-08-01 12:20 ` Hugo Mills
  2017-08-01 14:39   ` pwm
  0 siblings, 1 reply; 26+ messages in thread
From: Hugo Mills @ 2017-08-01 12:20 UTC (permalink / raw)
  To: pwm; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5847 bytes --]

   Hi, Per,

   Start here:

https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29

   In your case, I'd suggest using "-dusage=20" to start with, as
it'll probably free up quite a lot of your existing allocation.

And this may also be of interest, in how to read the output of the
tools:

https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools

   Finally, I note that you've still got some "single" chunks present
for metadata. It won't affect your space allocation issues, but I
would recommend getting rid of them anyway:

# btrfs balance start -mconvert=dup,soft

   Hugo.

On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote:
> I have a 10TB file system with a parity file for a snapraid.
> However, I can suddenly not extend the parity file despite the file
> system only being about 50% filled - I should have 5TB of
> unallocated space. When trying to extend the parity file,
> fallocate() just returns ENOSPC, i.e. that the disk is full.
> 
> Machine was originally a Debian 8 (Jessie) but after I detected the
> issue and no btrfs tool did show any errors, I have updated to
> Debian 9 (Snatch) to get a newer kernel and newer btrfs tools.
> 
> pwm@europium:/mnt$ btrfs --version
> btrfs-progs v4.7.3
> pwm@europium:/mnt$ uname -a
> Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2
> (2017-06-26) x86_64 GNU/Linux
> 
> 
> 
> 
> pwm@europium:/mnt/snap_04$ ls -l
> total 4932703608
> -rw------- 1 root root     319148889 Jul  8 04:21 snapraid.content
> -rw------- 1 root root     283115520 Aug  1 04:08 snapraid.content.tmp
> -rw------- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity
> 
> 
> 
> pwm@europium:/mnt/snap_04$ df .
> Filesystem      1K-blocks       Used  Available Use% Mounted on
> /dev/sdg1      9766434816 4944614648 4819831432  51% /mnt/snap_04
> 
> 
> 
> pwm@europium:/mnt/snap_04$ sudo btrfs fi show .
> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>         Total devices 1 FS bytes used 4.60TiB
>         devid    1 size 9.09TiB used 9.09TiB path /dev/sdg1
> 
> Compare this with the second snapraid parity disk:
> pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/
> Label: 'snap_05'  uuid: bac477e3-e78c-43ee-8402-6bdfff194567
>         Total devices 1 FS bytes used 4.69TiB
>         devid    1 size 9.09TiB used 4.70TiB path /dev/sdi1
> 
> So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB.
> While almost the same amount of file system usage. And almost
> identical usage pattern. It's an archival RAID, so there is hardly
> any writes to the parity files because there are almost no file
> changes to the data files. The main usage is that the parity file
> gets extended when one of the data disks reaches a new high water
> mark.
> 
> The only file that gets regularly rewritten is the snapraid.content
> file that gets regenerated after every scrub.
> 
> 
> 
> pwm@europium:/mnt/snap_04$ sudo btrfs fi df .
> Data, single: total=9.08TiB, used=4.59TiB
> System, DUP: total=8.00MiB, used=992.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, DUP: total=6.00GiB, used=4.81GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> 
> 
> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du .
>      Total   Exclusive  Set shared  Filename
>    4.59TiB     4.59TiB           -  ./snapraid.parity
>  304.37MiB   304.37MiB           -  ./snapraid.content
>  270.00MiB   270.00MiB           -  ./snapraid.content.tmp
>    4.59TiB     4.59TiB       0.00B  .
> 
> 
> 
> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage .
> Overall:
>     Device size:                   9.09TiB
>     Device allocated:              9.09TiB
>     Device unallocated:              0.00B
>     Device missing:                  0.00B
>     Used:                          4.60TiB
>     Free (estimated):              4.49TiB      (min: 4.49TiB)
>     Data ratio:                       1.00
>     Metadata ratio:                   2.00
>     Global reserve:              512.00MiB      (used: 0.00B)
> 
> Data,single: Size:9.08TiB, Used:4.59TiB
>    /dev/sdg1       9.08TiB
> 
> Metadata,single: Size:8.00MiB, Used:0.00B
>    /dev/sdg1       8.00MiB
> 
> Metadata,DUP: Size:6.00GiB, Used:4.81GiB
>    /dev/sdg1      12.00GiB
> 
> System,single: Size:4.00MiB, Used:0.00B
>    /dev/sdg1       4.00MiB
> 
> System,DUP: Size:8.00MiB, Used:992.00KiB
>    /dev/sdg1      16.00MiB
> 
> Unallocated:
>    /dev/sdg1         0.00B
> 
> 
> 
> pwm@europium:~$ sudo btrfs check /dev/sdg1
> Checking filesystem on /dev/sdg1
> UUID: c46df8fa-03db-4b32-8beb-5521d9931a31
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> found 5057294639104 bytes used err is 0
> total csum bytes: 4529856120
> total tree bytes: 5170151424
> total fs tree bytes: 178700288
> total extent tree bytes: 209616896
> btree space waste bytes: 182357204
> file data blocks allocated: 5073330888704
>  referenced 5052040339456
> 
> 
> 
> pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/
> scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31
>         scrub started at Mon Jul 31 21:26:50 2017 and finished after
> 06:53:47
>         total bytes scrubbed: 4.60TiB with 0 errors
> 
> 
> 
> So where have my 5TB disk space gone lost?
> And what should I do to be able to get it back again?
> 
> I could obviously reformat the partition and rebuild the parity
> since I still have one good parity, but that doesn't feel like a
> good route. It isn't impossible this might happen again.
> 
> /Per W

-- 
Hugo Mills             | Well, sir, the floor is yours. But remember, the
hugo@... carfax.org.uk | roof is ours!
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                             The Goons

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-01 12:20 ` Hugo Mills
@ 2017-08-01 14:39   ` pwm
  2017-08-01 14:47     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 26+ messages in thread
From: pwm @ 2017-08-01 14:39 UTC (permalink / raw)
  To: Hugo Mills; +Cc: linux-btrfs

Thanks for the links and suggestions.

I did try your suggestions but it didn't solve the underlying problem.



pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
Dumping filters: flags 0x1, state 0x0, force is off
   DATA (flags 0x2): balancing, usage=20
Done, had to relocate 4596 out of 9317 chunks


pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/
Done, had to relocate 2 out of 4721 chunks


pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
Data, single: total=4.60TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=512.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
         Total devices 1 FS bytes used 4.60TiB
         devid    1 size 9.09TiB used 4.61TiB path /dev/sdg1


So now device 1 usage is down from 9.09TiB to 4.61TiB.

But if I test to fallocate() to grow the large parity file, I directly 
fail. I wrote a little help program that just focuses on fallocate() 
instead of having to run snapraid with lots of unknown additional actions 
being performed.


Original file size is  5050486226944 bytes
Trying to grow file to 5151751667712 bytes
Failed fallocate [No space left on device]



And result after shows 'used' have jumped up to 9.09TiB again.

root@europium:/mnt# btrfs fi show snap_04
Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
         Total devices 1 FS bytes used 4.60TiB
         devid    1 size 9.09TiB used 9.09TiB path /dev/sdg1

root@europium:/mnt# btrfs fi df /mnt/snap_04/
Data, single: total=9.08TiB, used=4.59TiB
System, DUP: total=40.00MiB, used=992.00KiB
Metadata, DUP: total=6.50GiB, used=4.81GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


It's almost like the file system have decided that it needs to make a 
snapshot and store two complete copies of the complete file, which is 
obviously not going to work with a file larger than 50% of the file 
system.

No issue at all to grow the parity file on the other parity disk. And 
that's why I wonder if there is some undetected file system corruption.

/Per W

On Tue, 1 Aug 2017, Hugo Mills wrote:

>   Hi, Per,
>
>   Start here:
>
> https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29
>
>   In your case, I'd suggest using "-dusage=20" to start with, as
> it'll probably free up quite a lot of your existing allocation.
>
> And this may also be of interest, in how to read the output of the
> tools:
>
> https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
>
>   Finally, I note that you've still got some "single" chunks present
> for metadata. It won't affect your space allocation issues, but I
> would recommend getting rid of them anyway:
>
> # btrfs balance start -mconvert=dup,soft
>
>   Hugo.
>
> On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote:
>> I have a 10TB file system with a parity file for a snapraid.
>> However, I can suddenly not extend the parity file despite the file
>> system only being about 50% filled - I should have 5TB of
>> unallocated space. When trying to extend the parity file,
>> fallocate() just returns ENOSPC, i.e. that the disk is full.
>>
>> Machine was originally a Debian 8 (Jessie) but after I detected the
>> issue and no btrfs tool did show any errors, I have updated to
>> Debian 9 (Snatch) to get a newer kernel and newer btrfs tools.
>>
>> pwm@europium:/mnt$ btrfs --version
>> btrfs-progs v4.7.3
>> pwm@europium:/mnt$ uname -a
>> Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2
>> (2017-06-26) x86_64 GNU/Linux
>>
>>
>>
>>
>> pwm@europium:/mnt/snap_04$ ls -l
>> total 4932703608
>> -rw------- 1 root root     319148889 Jul  8 04:21 snapraid.content
>> -rw------- 1 root root     283115520 Aug  1 04:08 snapraid.content.tmp
>> -rw------- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity
>>
>>
>>
>> pwm@europium:/mnt/snap_04$ df .
>> Filesystem      1K-blocks       Used  Available Use% Mounted on
>> /dev/sdg1      9766434816 4944614648 4819831432  51% /mnt/snap_04
>>
>>
>>
>> pwm@europium:/mnt/snap_04$ sudo btrfs fi show .
>> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>         Total devices 1 FS bytes used 4.60TiB
>>         devid    1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>
>> Compare this with the second snapraid parity disk:
>> pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/
>> Label: 'snap_05'  uuid: bac477e3-e78c-43ee-8402-6bdfff194567
>>         Total devices 1 FS bytes used 4.69TiB
>>         devid    1 size 9.09TiB used 4.70TiB path /dev/sdi1
>>
>> So on one parity disk, devid is 9.09TiB used - on the other only 4.70TiB.
>> While almost the same amount of file system usage. And almost
>> identical usage pattern. It's an archival RAID, so there is hardly
>> any writes to the parity files because there are almost no file
>> changes to the data files. The main usage is that the parity file
>> gets extended when one of the data disks reaches a new high water
>> mark.
>>
>> The only file that gets regularly rewritten is the snapraid.content
>> file that gets regenerated after every scrub.
>>
>>
>>
>> pwm@europium:/mnt/snap_04$ sudo btrfs fi df .
>> Data, single: total=9.08TiB, used=4.59TiB
>> System, DUP: total=8.00MiB, used=992.00KiB
>> System, single: total=4.00MiB, used=0.00B
>> Metadata, DUP: total=6.00GiB, used=4.81GiB
>> Metadata, single: total=8.00MiB, used=0.00B
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>>
>>
>> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du .
>>      Total   Exclusive  Set shared  Filename
>>    4.59TiB     4.59TiB           -  ./snapraid.parity
>>  304.37MiB   304.37MiB           -  ./snapraid.content
>>  270.00MiB   270.00MiB           -  ./snapraid.content.tmp
>>    4.59TiB     4.59TiB       0.00B  .
>>
>>
>>
>> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage .
>> Overall:
>>     Device size:                   9.09TiB
>>     Device allocated:              9.09TiB
>>     Device unallocated:              0.00B
>>     Device missing:                  0.00B
>>     Used:                          4.60TiB
>>     Free (estimated):              4.49TiB      (min: 4.49TiB)
>>     Data ratio:                       1.00
>>     Metadata ratio:                   2.00
>>     Global reserve:              512.00MiB      (used: 0.00B)
>>
>> Data,single: Size:9.08TiB, Used:4.59TiB
>>    /dev/sdg1       9.08TiB
>>
>> Metadata,single: Size:8.00MiB, Used:0.00B
>>    /dev/sdg1       8.00MiB
>>
>> Metadata,DUP: Size:6.00GiB, Used:4.81GiB
>>    /dev/sdg1      12.00GiB
>>
>> System,single: Size:4.00MiB, Used:0.00B
>>    /dev/sdg1       4.00MiB
>>
>> System,DUP: Size:8.00MiB, Used:992.00KiB
>>    /dev/sdg1      16.00MiB
>>
>> Unallocated:
>>    /dev/sdg1         0.00B
>>
>>
>>
>> pwm@europium:~$ sudo btrfs check /dev/sdg1
>> Checking filesystem on /dev/sdg1
>> UUID: c46df8fa-03db-4b32-8beb-5521d9931a31
>> checking extents
>> checking free space cache
>> checking fs roots
>> checking csums
>> checking root refs
>> found 5057294639104 bytes used err is 0
>> total csum bytes: 4529856120
>> total tree bytes: 5170151424
>> total fs tree bytes: 178700288
>> total extent tree bytes: 209616896
>> btree space waste bytes: 182357204
>> file data blocks allocated: 5073330888704
>>  referenced 5052040339456
>>
>>
>>
>> pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/
>> scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31
>>         scrub started at Mon Jul 31 21:26:50 2017 and finished after
>> 06:53:47
>>         total bytes scrubbed: 4.60TiB with 0 errors
>>
>>
>>
>> So where have my 5TB disk space gone lost?
>> And what should I do to be able to get it back again?
>>
>> I could obviously reformat the partition and rebuild the parity
>> since I still have one good parity, but that doesn't feel like a
>> good route. It isn't impossible this might happen again.
>>
>> /Per W
>
> -- 
> Hugo Mills             | Well, sir, the floor is yours. But remember, the
> hugo@... carfax.org.uk | roof is ours!
> http://carfax.org.uk/  |
> PGP: E2AB1DE4          |                                             The Goons
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-01 14:39   ` pwm
@ 2017-08-01 14:47     ` Austin S. Hemmelgarn
  2017-08-01 15:00       ` Austin S. Hemmelgarn
  2017-08-02  4:14       ` Duncan
  0 siblings, 2 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-01 14:47 UTC (permalink / raw)
  To: pwm, Hugo Mills; +Cc: linux-btrfs

On 2017-08-01 10:39, pwm wrote:
> Thanks for the links and suggestions.
> 
> I did try your suggestions but it didn't solve the underlying problem.
> 
> 
> 
> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
> Dumping filters: flags 0x1, state 0x0, force is off
>    DATA (flags 0x2): balancing, usage=20
> Done, had to relocate 4596 out of 9317 chunks
> 
> 
> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/
> Done, had to relocate 2 out of 4721 chunks
> 
> 
> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
> Data, single: total=4.60TiB, used=4.59TiB
> System, DUP: total=40.00MiB, used=512.00KiB
> Metadata, DUP: total=6.50GiB, used=4.81GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> 
> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>          Total devices 1 FS bytes used 4.60TiB
>          devid    1 size 9.09TiB used 4.61TiB path /dev/sdg1
> 
> 
> So now device 1 usage is down from 9.09TiB to 4.61TiB.
> 
> But if I test to fallocate() to grow the large parity file, I directly 
> fail. I wrote a little help program that just focuses on fallocate() 
> instead of having to run snapraid with lots of unknown additional 
> actions being performed.
> 
> 
> Original file size is  5050486226944 bytes
> Trying to grow file to 5151751667712 bytes
> Failed fallocate [No space left on device]
> 
> 
> 
> And result after shows 'used' have jumped up to 9.09TiB again.
> 
> root@europium:/mnt# btrfs fi show snap_04
> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>          Total devices 1 FS bytes used 4.60TiB
>          devid    1 size 9.09TiB used 9.09TiB path /dev/sdg1
> 
> root@europium:/mnt# btrfs fi df /mnt/snap_04/
> Data, single: total=9.08TiB, used=4.59TiB
> System, DUP: total=40.00MiB, used=992.00KiB
> Metadata, DUP: total=6.50GiB, used=4.81GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> 
> It's almost like the file system have decided that it needs to make a 
> snapshot and store two complete copies of the complete file, which is 
> obviously not going to work with a file larger than 50% of the file system.
I think I _might_ understand what's going on here.  Is that test program 
calling fallocate using the desired total size of the file, or just 
trying to allocate the range beyond the end to extend the file?  I've 
seen issues with the first case on BTRFS before, and I'm starting to 
think that it might actually be trying to allocate the exact amount of 
space requested by fallocate, even if part of the range is already 
allocated space.
> 
> No issue at all to grow the parity file on the other parity disk. And 
> that's why I wonder if there is some undetected file system corruption.
> 
> /Per W
> 
> On Tue, 1 Aug 2017, Hugo Mills wrote:
> 
>>   Hi, Per,
>>
>>   Start here:
>>
>> https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29 
>>
>>
>>   In your case, I'd suggest using "-dusage=20" to start with, as
>> it'll probably free up quite a lot of your existing allocation.
>>
>> And this may also be of interest, in how to read the output of the
>> tools:
>>
>> https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools 
>>
>>
>>   Finally, I note that you've still got some "single" chunks present
>> for metadata. It won't affect your space allocation issues, but I
>> would recommend getting rid of them anyway:
>>
>> # btrfs balance start -mconvert=dup,soft
>>
>>   Hugo.
>>
>> On Tue, Aug 01, 2017 at 01:43:23PM +0200, pwm wrote:
>>> I have a 10TB file system with a parity file for a snapraid.
>>> However, I can suddenly not extend the parity file despite the file
>>> system only being about 50% filled - I should have 5TB of
>>> unallocated space. When trying to extend the parity file,
>>> fallocate() just returns ENOSPC, i.e. that the disk is full.
>>>
>>> Machine was originally a Debian 8 (Jessie) but after I detected the
>>> issue and no btrfs tool did show any errors, I have updated to
>>> Debian 9 (Snatch) to get a newer kernel and newer btrfs tools.
>>>
>>> pwm@europium:/mnt$ btrfs --version
>>> btrfs-progs v4.7.3
>>> pwm@europium:/mnt$ uname -a
>>> Linux europium 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u2
>>> (2017-06-26) x86_64 GNU/Linux
>>>
>>>
>>>
>>>
>>> pwm@europium:/mnt/snap_04$ ls -l
>>> total 4932703608
>>> -rw------- 1 root root     319148889 Jul  8 04:21 snapraid.content
>>> -rw------- 1 root root     283115520 Aug  1 04:08 snapraid.content.tmp
>>> -rw------- 1 root root 5050486226944 Jul 31 17:14 snapraid.parity
>>>
>>>
>>>
>>> pwm@europium:/mnt/snap_04$ df .
>>> Filesystem      1K-blocks       Used  Available Use% Mounted on
>>> /dev/sdg1      9766434816 4944614648 4819831432  51% /mnt/snap_04
>>>
>>>
>>>
>>> pwm@europium:/mnt/snap_04$ sudo btrfs fi show .
>>> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>         Total devices 1 FS bytes used 4.60TiB
>>>         devid    1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>>
>>> Compare this with the second snapraid parity disk:
>>> pwm@europium:/mnt/snap_04$ sudo btrfs fi show /mnt/snap_05/
>>> Label: 'snap_05'  uuid: bac477e3-e78c-43ee-8402-6bdfff194567
>>>         Total devices 1 FS bytes used 4.69TiB
>>>         devid    1 size 9.09TiB used 4.70TiB path /dev/sdi1
>>>
>>> So on one parity disk, devid is 9.09TiB used - on the other only 
>>> 4.70TiB.
>>> While almost the same amount of file system usage. And almost
>>> identical usage pattern. It's an archival RAID, so there is hardly
>>> any writes to the parity files because there are almost no file
>>> changes to the data files. The main usage is that the parity file
>>> gets extended when one of the data disks reaches a new high water
>>> mark.
>>>
>>> The only file that gets regularly rewritten is the snapraid.content
>>> file that gets regenerated after every scrub.
>>>
>>>
>>>
>>> pwm@europium:/mnt/snap_04$ sudo btrfs fi df .
>>> Data, single: total=9.08TiB, used=4.59TiB
>>> System, DUP: total=8.00MiB, used=992.00KiB
>>> System, single: total=4.00MiB, used=0.00B
>>> Metadata, DUP: total=6.00GiB, used=4.81GiB
>>> Metadata, single: total=8.00MiB, used=0.00B
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>
>>>
>>>
>>> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem du .
>>>      Total   Exclusive  Set shared  Filename
>>>    4.59TiB     4.59TiB           -  ./snapraid.parity
>>>  304.37MiB   304.37MiB           -  ./snapraid.content
>>>  270.00MiB   270.00MiB           -  ./snapraid.content.tmp
>>>    4.59TiB     4.59TiB       0.00B  .
>>>
>>>
>>>
>>> pwm@europium:/mnt/snap_04$ sudo btrfs filesystem usage .
>>> Overall:
>>>     Device size:                   9.09TiB
>>>     Device allocated:              9.09TiB
>>>     Device unallocated:              0.00B
>>>     Device missing:                  0.00B
>>>     Used:                          4.60TiB
>>>     Free (estimated):              4.49TiB      (min: 4.49TiB)
>>>     Data ratio:                       1.00
>>>     Metadata ratio:                   2.00
>>>     Global reserve:              512.00MiB      (used: 0.00B)
>>>
>>> Data,single: Size:9.08TiB, Used:4.59TiB
>>>    /dev/sdg1       9.08TiB
>>>
>>> Metadata,single: Size:8.00MiB, Used:0.00B
>>>    /dev/sdg1       8.00MiB
>>>
>>> Metadata,DUP: Size:6.00GiB, Used:4.81GiB
>>>    /dev/sdg1      12.00GiB
>>>
>>> System,single: Size:4.00MiB, Used:0.00B
>>>    /dev/sdg1       4.00MiB
>>>
>>> System,DUP: Size:8.00MiB, Used:992.00KiB
>>>    /dev/sdg1      16.00MiB
>>>
>>> Unallocated:
>>>    /dev/sdg1         0.00B
>>>
>>>
>>>
>>> pwm@europium:~$ sudo btrfs check /dev/sdg1
>>> Checking filesystem on /dev/sdg1
>>> UUID: c46df8fa-03db-4b32-8beb-5521d9931a31
>>> checking extents
>>> checking free space cache
>>> checking fs roots
>>> checking csums
>>> checking root refs
>>> found 5057294639104 bytes used err is 0
>>> total csum bytes: 4529856120
>>> total tree bytes: 5170151424
>>> total fs tree bytes: 178700288
>>> total extent tree bytes: 209616896
>>> btree space waste bytes: 182357204
>>> file data blocks allocated: 5073330888704
>>>  referenced 5052040339456
>>>
>>>
>>>
>>> pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/
>>> scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31
>>>         scrub started at Mon Jul 31 21:26:50 2017 and finished after
>>> 06:53:47
>>>         total bytes scrubbed: 4.60TiB with 0 errors
>>>
>>>
>>>
>>> So where have my 5TB disk space gone lost?
>>> And what should I do to be able to get it back again?
>>>
>>> I could obviously reformat the partition and rebuild the parity
>>> since I still have one good parity, but that doesn't feel like a
>>> good route. It isn't impossible this might happen again.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-01 14:47     ` Austin S. Hemmelgarn
@ 2017-08-01 15:00       ` Austin S. Hemmelgarn
  2017-08-01 15:24         ` pwm
  2017-08-02 17:52         ` Goffredo Baroncelli
  2017-08-02  4:14       ` Duncan
  1 sibling, 2 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-01 15:00 UTC (permalink / raw)
  To: pwm, Hugo Mills; +Cc: linux-btrfs

On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:
> On 2017-08-01 10:39, pwm wrote:
>> Thanks for the links and suggestions.
>>
>> I did try your suggestions but it didn't solve the underlying problem.
>>
>>
>>
>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
>> Dumping filters: flags 0x1, state 0x0, force is off
>>    DATA (flags 0x2): balancing, usage=20
>> Done, had to relocate 4596 out of 9317 chunks
>>
>>
>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/
>> Done, had to relocate 2 out of 4721 chunks
>>
>>
>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
>> Data, single: total=4.60TiB, used=4.59TiB
>> System, DUP: total=40.00MiB, used=512.00KiB
>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>>
>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
>> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>          Total devices 1 FS bytes used 4.60TiB
>>          devid    1 size 9.09TiB used 4.61TiB path /dev/sdg1
>>
>>
>> So now device 1 usage is down from 9.09TiB to 4.61TiB.
>>
>> But if I test to fallocate() to grow the large parity file, I directly 
>> fail. I wrote a little help program that just focuses on fallocate() 
>> instead of having to run snapraid with lots of unknown additional 
>> actions being performed.
>>
>>
>> Original file size is  5050486226944 bytes
>> Trying to grow file to 5151751667712 bytes
>> Failed fallocate [No space left on device]
>>
>>
>>
>> And result after shows 'used' have jumped up to 9.09TiB again.
>>
>> root@europium:/mnt# btrfs fi show snap_04
>> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>          Total devices 1 FS bytes used 4.60TiB
>>          devid    1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>
>> root@europium:/mnt# btrfs fi df /mnt/snap_04/
>> Data, single: total=9.08TiB, used=4.59TiB
>> System, DUP: total=40.00MiB, used=992.00KiB
>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>>
>> It's almost like the file system have decided that it needs to make a 
>> snapshot and store two complete copies of the complete file, which is 
>> obviously not going to work with a file larger than 50% of the file 
>> system.
> I think I _might_ understand what's going on here.  Is that test program 
> calling fallocate using the desired total size of the file, or just 
> trying to allocate the range beyond the end to extend the file?  I've 
> seen issues with the first case on BTRFS before, and I'm starting to 
> think that it might actually be trying to allocate the exact amount of 
> space requested by fallocate, even if part of the range is already 
> allocated space.

OK, I just did a dead simple test by hand, and it looks like I was 
right.  The method I used to check this is as follows:
1. Create and mount a reasonably small filesystem (I used an 8G 
temporary LV for this, a file would work too though).
2. Using dd or a similar tool, create a test file that takes up half of 
the size of the filesystem.  It is important that this _not_ be 
fallocated, but just written out.
3. Use `fallocate -l` to try and extend the size of the file beyond half 
the size of the filesystem.

For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will 
succeed with no error.  Based on this and some low-level inspection, it 
looks like BTRFS treats the full range of the fallocate call as 
unallocated, and thus is trying to allocate space for regions of that 
range that are already allocated.

>>
>> No issue at all to grow the parity file on the other parity disk. And 
>> that's why I wonder if there is some undetected file system corruption.
>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-01 15:00       ` Austin S. Hemmelgarn
@ 2017-08-01 15:24         ` pwm
  2017-08-01 15:45           ` Austin S. Hemmelgarn
  2017-08-02 17:52         ` Goffredo Baroncelli
  1 sibling, 1 reply; 26+ messages in thread
From: pwm @ 2017-08-01 15:24 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Hugo Mills, linux-btrfs

Yes, the test code is as below - trying to match what snapraid tries 
to do:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>

int main() {
     int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR);
     if (fd < 0) {
         printf("Failed opening parity file [%s]\n",strerror(errno));
         return 1;
     }

     off_t filesize = 5151751667712ull;
     int res;

     struct stat statbuf;
     if (fstat(fd,&statbuf)) {
         printf("Failed stat [%s]\n",strerror(errno));
         close(fd);
         return 1;
     }

     printf("Original file size is  %llu bytes\n",i
            (unsigned long long)statbuf.st_size);
     printf("Trying to grow file to %llu bytes\n",i
            (unsigned long long)filesize);

     res = fallocate(fd,0,0,filesize);
     if (res) {
         printf("Failed fallocate [%s]\n",strerror(errno));
         close(fd);
         return 1;
     }

     if (fsync(fd)) {
         printf("Failed fsync [%s]\n",fsync(errno));
         close(fd);
         return 1;
     }

     close(fd);
     return 0;
}

So the call doesn't make use of the previous file size as offset for the 
extension.

int fallocate(int fd, int mode, off_t offset, off_t len);

What you are implying here is that if the fallocate() call is modified to:

   res = fallocate(fd,0,old_size,new_size-old_size);

then everything should work as expected?

/Per W

On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:

> On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:
>> On 2017-08-01 10:39, pwm wrote:
>>> Thanks for the links and suggestions.
>>> 
>>> I did try your suggestions but it didn't solve the underlying problem.
>>> 
>>> 
>>> 
>>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
>>> Dumping filters: flags 0x1, state 0x0, force is off
>>>    DATA (flags 0x2): balancing, usage=20
>>> Done, had to relocate 4596 out of 9317 chunks
>>> 
>>> 
>>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft /mnt/snap_04/
>>> Done, had to relocate 2 out of 4721 chunks
>>> 
>>> 
>>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
>>> Data, single: total=4.60TiB, used=4.59TiB
>>> System, DUP: total=40.00MiB, used=512.00KiB
>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>> 
>>> 
>>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
>>> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>          Total devices 1 FS bytes used 4.60TiB
>>>          devid    1 size 9.09TiB used 4.61TiB path /dev/sdg1
>>> 
>>> 
>>> So now device 1 usage is down from 9.09TiB to 4.61TiB.
>>> 
>>> But if I test to fallocate() to grow the large parity file, I directly 
>>> fail. I wrote a little help program that just focuses on fallocate() 
>>> instead of having to run snapraid with lots of unknown additional actions 
>>> being performed.
>>> 
>>> 
>>> Original file size is  5050486226944 bytes
>>> Trying to grow file to 5151751667712 bytes
>>> Failed fallocate [No space left on device]
>>> 
>>> 
>>> 
>>> And result after shows 'used' have jumped up to 9.09TiB again.
>>> 
>>> root@europium:/mnt# btrfs fi show snap_04
>>> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>          Total devices 1 FS bytes used 4.60TiB
>>>          devid    1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>> 
>>> root@europium:/mnt# btrfs fi df /mnt/snap_04/
>>> Data, single: total=9.08TiB, used=4.59TiB
>>> System, DUP: total=40.00MiB, used=992.00KiB
>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>> 
>>> 
>>> It's almost like the file system have decided that it needs to make a 
>>> snapshot and store two complete copies of the complete file, which is 
>>> obviously not going to work with a file larger than 50% of the file 
>>> system.
>> I think I _might_ understand what's going on here.  Is that test program 
>> calling fallocate using the desired total size of the file, or just trying 
>> to allocate the range beyond the end to extend the file?  I've seen issues 
>> with the first case on BTRFS before, and I'm starting to think that it 
>> might actually be trying to allocate the exact amount of space requested by 
>> fallocate, even if part of the range is already allocated space.
>
> OK, I just did a dead simple test by hand, and it looks like I was right. 
> The method I used to check this is as follows:
> 1. Create and mount a reasonably small filesystem (I used an 8G temporary LV 
> for this, a file would work too though).
> 2. Using dd or a similar tool, create a test file that takes up half of the 
> size of the filesystem.  It is important that this _not_ be fallocated, but 
> just written out.
> 3. Use `fallocate -l` to try and extend the size of the file beyond half the 
> size of the filesystem.
>
> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will 
> succeed with no error.  Based on this and some low-level inspection, it looks 
> like BTRFS treats the full range of the fallocate call as unallocated, and 
> thus is trying to allocate space for regions of that range that are already 
> allocated.
>
>>> 
>>> No issue at all to grow the parity file on the other parity disk. And 
>>> that's why I wonder if there is some undetected file system corruption.
>>> 
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-01 15:24         ` pwm
@ 2017-08-01 15:45           ` Austin S. Hemmelgarn
  2017-08-01 16:50             ` pwm
  0 siblings, 1 reply; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-01 15:45 UTC (permalink / raw)
  To: pwm; +Cc: Hugo Mills, linux-btrfs

On 2017-08-01 11:24, pwm wrote:
> Yes, the test code is as below - trying to match what snapraid tries to do:
> 
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <string.h>
> #include <unistd.h>
> #include <errno.h>
> 
> int main() {
>      int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR);
>      if (fd < 0) {
>          printf("Failed opening parity file [%s]\n",strerror(errno));
>          return 1;
>      }
> 
>      off_t filesize = 5151751667712ull;
>      int res;
> 
>      struct stat statbuf;
>      if (fstat(fd,&statbuf)) {
>          printf("Failed stat [%s]\n",strerror(errno));
>          close(fd);
>          return 1;
>      }
> 
>      printf("Original file size is  %llu bytes\n",i
>             (unsigned long long)statbuf.st_size);
>      printf("Trying to grow file to %llu bytes\n",i
>             (unsigned long long)filesize);
> 
>      res = fallocate(fd,0,0,filesize);
>      if (res) {
>          printf("Failed fallocate [%s]\n",strerror(errno));
>          close(fd);
>          return 1;
>      }
> 
>      if (fsync(fd)) {
>          printf("Failed fsync [%s]\n",fsync(errno));
>          close(fd);
>          return 1;
>      }
> 
>      close(fd);
>      return 0;
> }
> 
> So the call doesn't make use of the previous file size as offset for the 
> extension.
> 
> int fallocate(int fd, int mode, off_t offset, off_t len);
> 
> What you are implying here is that if the fallocate() call is modified to:
> 
>    res = fallocate(fd,0,old_size,new_size-old_size);
> 
> then everything should work as expected?
Based on what I've seen testing on my end, yes, that should cause things 
to work correctly.  That said, given what snapraid does, the fact that 
they call fallocate covering the full desired size of the file is 
correct usage (the point is to make behavior deterministic, and calling 
it on the whole file makes sure that the file isn't sparse, which can 
impact performance).

Given both the fact that calling fallocate() to extend the file without 
worrying about an offset is a legitimate use case, and that both ext4 
and XFS (and I suspect almost every other Linux filesystem) works in 
this situation, I'd argue that the behavior of BTRFS in this situation 
is incorrect.
> 
> /Per W
> 
> On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:
> 
>> On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:
>>> On 2017-08-01 10:39, pwm wrote:
>>>> Thanks for the links and suggestions.
>>>>
>>>> I did try your suggestions but it didn't solve the underlying problem.
>>>>
>>>>
>>>>
>>>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
>>>> Dumping filters: flags 0x1, state 0x0, force is off
>>>>    DATA (flags 0x2): balancing, usage=20
>>>> Done, had to relocate 4596 out of 9317 chunks
>>>>
>>>>
>>>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft 
>>>> /mnt/snap_04/
>>>> Done, had to relocate 2 out of 4721 chunks
>>>>
>>>>
>>>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
>>>> Data, single: total=4.60TiB, used=4.59TiB
>>>> System, DUP: total=40.00MiB, used=512.00KiB
>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>
>>>>
>>>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
>>>> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>>          Total devices 1 FS bytes used 4.60TiB
>>>>          devid    1 size 9.09TiB used 4.61TiB path /dev/sdg1
>>>>
>>>>
>>>> So now device 1 usage is down from 9.09TiB to 4.61TiB.
>>>>
>>>> But if I test to fallocate() to grow the large parity file, I 
>>>> directly fail. I wrote a little help program that just focuses on 
>>>> fallocate() instead of having to run snapraid with lots of unknown 
>>>> additional actions being performed.
>>>>
>>>>
>>>> Original file size is  5050486226944 bytes
>>>> Trying to grow file to 5151751667712 bytes
>>>> Failed fallocate [No space left on device]
>>>>
>>>>
>>>>
>>>> And result after shows 'used' have jumped up to 9.09TiB again.
>>>>
>>>> root@europium:/mnt# btrfs fi show snap_04
>>>> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>>          Total devices 1 FS bytes used 4.60TiB
>>>>          devid    1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>>>
>>>> root@europium:/mnt# btrfs fi df /mnt/snap_04/
>>>> Data, single: total=9.08TiB, used=4.59TiB
>>>> System, DUP: total=40.00MiB, used=992.00KiB
>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>
>>>>
>>>> It's almost like the file system have decided that it needs to make 
>>>> a snapshot and store two complete copies of the complete file, which 
>>>> is obviously not going to work with a file larger than 50% of the 
>>>> file system.
>>> I think I _might_ understand what's going on here.  Is that test 
>>> program calling fallocate using the desired total size of the file, 
>>> or just trying to allocate the range beyond the end to extend the 
>>> file?  I've seen issues with the first case on BTRFS before, and I'm 
>>> starting to think that it might actually be trying to allocate the 
>>> exact amount of space requested by fallocate, even if part of the 
>>> range is already allocated space.
>>
>> OK, I just did a dead simple test by hand, and it looks like I was 
>> right. The method I used to check this is as follows:
>> 1. Create and mount a reasonably small filesystem (I used an 8G 
>> temporary LV for this, a file would work too though).
>> 2. Using dd or a similar tool, create a test file that takes up half 
>> of the size of the filesystem.  It is important that this _not_ be 
>> fallocated, but just written out.
>> 3. Use `fallocate -l` to try and extend the size of the file beyond 
>> half the size of the filesystem.
>>
>> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it 
>> will succeed with no error.  Based on this and some low-level 
>> inspection, it looks like BTRFS treats the full range of the fallocate 
>> call as unallocated, and thus is trying to allocate space for regions 
>> of that range that are already allocated.
>>
>>>>
>>>> No issue at all to grow the parity file on the other parity disk. 
>>>> And that's why I wonder if there is some undetected file system 
>>>> corruption.
>>>>
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-01 15:45           ` Austin S. Hemmelgarn
@ 2017-08-01 16:50             ` pwm
  2017-08-01 17:04               ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 26+ messages in thread
From: pwm @ 2017-08-01 16:50 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Hugo Mills, linux-btrfs

I did a temporary patch of the snapraid code to start fallocate() from the 
previous parity file size.

Finally have a snapraid sync up and running. Looks good, but will take 
quite a while before I can try a scrub command to double-check everything.

Thanks for the help.

/Per W

On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:

> On 2017-08-01 11:24, pwm wrote:
>> Yes, the test code is as below - trying to match what snapraid tries to do:
>> 
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <stdio.h>
>> #include <string.h>
>> #include <unistd.h>
>> #include <errno.h>
>> 
>> int main() {
>>      int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR);
>>      if (fd < 0) {
>>          printf("Failed opening parity file [%s]\n",strerror(errno));
>>          return 1;
>>      }
>>
>>      off_t filesize = 5151751667712ull;
>>      int res;
>>
>>      struct stat statbuf;
>>      if (fstat(fd,&statbuf)) {
>>          printf("Failed stat [%s]\n",strerror(errno));
>>          close(fd);
>>          return 1;
>>      }
>>
>>      printf("Original file size is  %llu bytes\n",i
>>             (unsigned long long)statbuf.st_size);
>>      printf("Trying to grow file to %llu bytes\n",i
>>             (unsigned long long)filesize);
>>
>>      res = fallocate(fd,0,0,filesize);
>>      if (res) {
>>          printf("Failed fallocate [%s]\n",strerror(errno));
>>          close(fd);
>>          return 1;
>>      }
>>
>>      if (fsync(fd)) {
>>          printf("Failed fsync [%s]\n",fsync(errno));
>>          close(fd);
>>          return 1;
>>      }
>>
>>      close(fd);
>>      return 0;
>> }
>> 
>> So the call doesn't make use of the previous file size as offset for the 
>> extension.
>> 
>> int fallocate(int fd, int mode, off_t offset, off_t len);
>> 
>> What you are implying here is that if the fallocate() call is modified to:
>>
>>    res = fallocate(fd,0,old_size,new_size-old_size);
>> 
>> then everything should work as expected?
> Based on what I've seen testing on my end, yes, that should cause things to 
> work correctly.  That said, given what snapraid does, the fact that they call 
> fallocate covering the full desired size of the file is correct usage (the 
> point is to make behavior deterministic, and calling it on the whole file 
> makes sure that the file isn't sparse, which can impact performance).
>
> Given both the fact that calling fallocate() to extend the file without 
> worrying about an offset is a legitimate use case, and that both ext4 and XFS 
> (and I suspect almost every other Linux filesystem) works in this situation, 
> I'd argue that the behavior of BTRFS in this situation is incorrect.
>> 
>> /Per W
>> 
>> On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:
>> 
>>> On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:
>>>> On 2017-08-01 10:39, pwm wrote:
>>>>> Thanks for the links and suggestions.
>>>>> 
>>>>> I did try your suggestions but it didn't solve the underlying problem.
>>>>> 
>>>>> 
>>>>> 
>>>>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
>>>>> Dumping filters: flags 0x1, state 0x0, force is off
>>>>>    DATA (flags 0x2): balancing, usage=20
>>>>> Done, had to relocate 4596 out of 9317 chunks
>>>>> 
>>>>> 
>>>>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft 
>>>>> /mnt/snap_04/
>>>>> Done, had to relocate 2 out of 4721 chunks
>>>>> 
>>>>> 
>>>>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
>>>>> Data, single: total=4.60TiB, used=4.59TiB
>>>>> System, DUP: total=40.00MiB, used=512.00KiB
>>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>> 
>>>>> 
>>>>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
>>>>> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>>>          Total devices 1 FS bytes used 4.60TiB
>>>>>          devid    1 size 9.09TiB used 4.61TiB path /dev/sdg1
>>>>> 
>>>>> 
>>>>> So now device 1 usage is down from 9.09TiB to 4.61TiB.
>>>>> 
>>>>> But if I test to fallocate() to grow the large parity file, I directly 
>>>>> fail. I wrote a little help program that just focuses on fallocate() 
>>>>> instead of having to run snapraid with lots of unknown additional 
>>>>> actions being performed.
>>>>> 
>>>>> 
>>>>> Original file size is  5050486226944 bytes
>>>>> Trying to grow file to 5151751667712 bytes
>>>>> Failed fallocate [No space left on device]
>>>>> 
>>>>> 
>>>>> 
>>>>> And result after shows 'used' have jumped up to 9.09TiB again.
>>>>> 
>>>>> root@europium:/mnt# btrfs fi show snap_04
>>>>> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>>>          Total devices 1 FS bytes used 4.60TiB
>>>>>          devid    1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>>>> 
>>>>> root@europium:/mnt# btrfs fi df /mnt/snap_04/
>>>>> Data, single: total=9.08TiB, used=4.59TiB
>>>>> System, DUP: total=40.00MiB, used=992.00KiB
>>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>> 
>>>>> 
>>>>> It's almost like the file system have decided that it needs to make a 
>>>>> snapshot and store two complete copies of the complete file, which is 
>>>>> obviously not going to work with a file larger than 50% of the file 
>>>>> system.
>>>> I think I _might_ understand what's going on here.  Is that test program 
>>>> calling fallocate using the desired total size of the file, or just 
>>>> trying to allocate the range beyond the end to extend the file?  I've 
>>>> seen issues with the first case on BTRFS before, and I'm starting to 
>>>> think that it might actually be trying to allocate the exact amount of 
>>>> space requested by fallocate, even if part of the range is already 
>>>> allocated space.
>>> 
>>> OK, I just did a dead simple test by hand, and it looks like I was right. 
>>> The method I used to check this is as follows:
>>> 1. Create and mount a reasonably small filesystem (I used an 8G temporary 
>>> LV for this, a file would work too though).
>>> 2. Using dd or a similar tool, create a test file that takes up half of 
>>> the size of the filesystem.  It is important that this _not_ be 
>>> fallocated, but just written out.
>>> 3. Use `fallocate -l` to try and extend the size of the file beyond half 
>>> the size of the filesystem.
>>> 
>>> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will 
>>> succeed with no error.  Based on this and some low-level inspection, it 
>>> looks like BTRFS treats the full range of the fallocate call as 
>>> unallocated, and thus is trying to allocate space for regions of that 
>>> range that are already allocated.
>>> 
>>>>> 
>>>>> No issue at all to grow the parity file on the other parity disk. And 
>>>>> that's why I wonder if there is some undetected file system corruption.
>>>>> 
>>> 
>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-01 16:50             ` pwm
@ 2017-08-01 17:04               ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-01 17:04 UTC (permalink / raw)
  To: pwm; +Cc: Hugo Mills, linux-btrfs

On 2017-08-01 12:50, pwm wrote:
> I did a temporary patch of the snapraid code to start fallocate() from 
> the previous parity file size.
Like I said though, it's BTRFS that's misbehaving here, not snapraid. 
I'm going to try to get some further discussion about this here on the 
mailing list,and hopefully it will get fixed in BTRFS (I would try to do 
so myself, but I'm at best a novice at C, and not well versed in kernel 
code).
> 
> Finally have a snapraid sync up and running. Looks good, but will take 
> quite a while before I can try a scrub command to double-check everything.
> 
> Thanks for the help.
Glad I could be helpful!
> 
> /Per W
> 
> On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:
> 
>> On 2017-08-01 11:24, pwm wrote:
>>> Yes, the test code is as below - trying to match what snapraid tries 
>>> to do:
>>>
>>> #include <sys/types.h>
>>> #include <sys/stat.h>
>>> #include <fcntl.h>
>>> #include <stdio.h>
>>> #include <string.h>
>>> #include <unistd.h>
>>> #include <errno.h>
>>>
>>> int main() {
>>>      int fd = open("/mnt/snap_04/snapraid.parity",O_NOFOLLOW|O_RDWR);
>>>      if (fd < 0) {
>>>          printf("Failed opening parity file [%s]\n",strerror(errno));
>>>          return 1;
>>>      }
>>>
>>>      off_t filesize = 5151751667712ull;
>>>      int res;
>>>
>>>      struct stat statbuf;
>>>      if (fstat(fd,&statbuf)) {
>>>          printf("Failed stat [%s]\n",strerror(errno));
>>>          close(fd);
>>>          return 1;
>>>      }
>>>
>>>      printf("Original file size is  %llu bytes\n",i
>>>             (unsigned long long)statbuf.st_size);
>>>      printf("Trying to grow file to %llu bytes\n",i
>>>             (unsigned long long)filesize);
>>>
>>>      res = fallocate(fd,0,0,filesize);
>>>      if (res) {
>>>          printf("Failed fallocate [%s]\n",strerror(errno));
>>>          close(fd);
>>>          return 1;
>>>      }
>>>
>>>      if (fsync(fd)) {
>>>          printf("Failed fsync [%s]\n",fsync(errno));
>>>          close(fd);
>>>          return 1;
>>>      }
>>>
>>>      close(fd);
>>>      return 0;
>>> }
>>>
>>> So the call doesn't make use of the previous file size as offset for 
>>> the extension.
>>>
>>> int fallocate(int fd, int mode, off_t offset, off_t len);
>>>
>>> What you are implying here is that if the fallocate() call is 
>>> modified to:
>>>
>>>    res = fallocate(fd,0,old_size,new_size-old_size);
>>>
>>> then everything should work as expected?
>> Based on what I've seen testing on my end, yes, that should cause 
>> things to work correctly.  That said, given what snapraid does, the 
>> fact that they call fallocate covering the full desired size of the 
>> file is correct usage (the point is to make behavior deterministic, 
>> and calling it on the whole file makes sure that the file isn't 
>> sparse, which can impact performance).
>>
>> Given both the fact that calling fallocate() to extend the file 
>> without worrying about an offset is a legitimate use case, and that 
>> both ext4 and XFS (and I suspect almost every other Linux filesystem) 
>> works in this situation, I'd argue that the behavior of BTRFS in this 
>> situation is incorrect.
>>>
>>> /Per W
>>>
>>> On Tue, 1 Aug 2017, Austin S. Hemmelgarn wrote:
>>>
>>>> On 2017-08-01 10:47, Austin S. Hemmelgarn wrote:
>>>>> On 2017-08-01 10:39, pwm wrote:
>>>>>> Thanks for the links and suggestions.
>>>>>>
>>>>>> I did try your suggestions but it didn't solve the underlying 
>>>>>> problem.
>>>>>>
>>>>>>
>>>>>>
>>>>>> pwm@europium:~$ sudo btrfs balance start -v -dusage=20 /mnt/snap_04
>>>>>> Dumping filters: flags 0x1, state 0x0, force is off
>>>>>>    DATA (flags 0x2): balancing, usage=20
>>>>>> Done, had to relocate 4596 out of 9317 chunks
>>>>>>
>>>>>>
>>>>>> pwm@europium:~$ sudo btrfs balance start -mconvert=dup,soft 
>>>>>> /mnt/snap_04/
>>>>>> Done, had to relocate 2 out of 4721 chunks
>>>>>>
>>>>>>
>>>>>> pwm@europium:~$ sudo btrfs fi df /mnt/snap_04
>>>>>> Data, single: total=4.60TiB, used=4.59TiB
>>>>>> System, DUP: total=40.00MiB, used=512.00KiB
>>>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>>>
>>>>>>
>>>>>> pwm@europium:~$ sudo btrfs fi show /mnt/snap_04
>>>>>> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>>>>          Total devices 1 FS bytes used 4.60TiB
>>>>>>          devid    1 size 9.09TiB used 4.61TiB path /dev/sdg1
>>>>>>
>>>>>>
>>>>>> So now device 1 usage is down from 9.09TiB to 4.61TiB.
>>>>>>
>>>>>> But if I test to fallocate() to grow the large parity file, I 
>>>>>> directly fail. I wrote a little help program that just focuses on 
>>>>>> fallocate() instead of having to run snapraid with lots of unknown 
>>>>>> additional actions being performed.
>>>>>>
>>>>>>
>>>>>> Original file size is  5050486226944 bytes
>>>>>> Trying to grow file to 5151751667712 bytes
>>>>>> Failed fallocate [No space left on device]
>>>>>>
>>>>>>
>>>>>>
>>>>>> And result after shows 'used' have jumped up to 9.09TiB again.
>>>>>>
>>>>>> root@europium:/mnt# btrfs fi show snap_04
>>>>>> Label: 'snap_04'  uuid: c46df8fa-03db-4b32-8beb-5521d9931a31
>>>>>>          Total devices 1 FS bytes used 4.60TiB
>>>>>>          devid    1 size 9.09TiB used 9.09TiB path /dev/sdg1
>>>>>>
>>>>>> root@europium:/mnt# btrfs fi df /mnt/snap_04/
>>>>>> Data, single: total=9.08TiB, used=4.59TiB
>>>>>> System, DUP: total=40.00MiB, used=992.00KiB
>>>>>> Metadata, DUP: total=6.50GiB, used=4.81GiB
>>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>>>
>>>>>>
>>>>>> It's almost like the file system have decided that it needs to 
>>>>>> make a snapshot and store two complete copies of the complete 
>>>>>> file, which is obviously not going to work with a file larger than 
>>>>>> 50% of the file system.
>>>>> I think I _might_ understand what's going on here.  Is that test 
>>>>> program calling fallocate using the desired total size of the file, 
>>>>> or just trying to allocate the range beyond the end to extend the 
>>>>> file?  I've seen issues with the first case on BTRFS before, and 
>>>>> I'm starting to think that it might actually be trying to allocate 
>>>>> the exact amount of space requested by fallocate, even if part of 
>>>>> the range is already allocated space.
>>>>
>>>> OK, I just did a dead simple test by hand, and it looks like I was 
>>>> right. The method I used to check this is as follows:
>>>> 1. Create and mount a reasonably small filesystem (I used an 8G 
>>>> temporary LV for this, a file would work too though).
>>>> 2. Using dd or a similar tool, create a test file that takes up half 
>>>> of the size of the filesystem.  It is important that this _not_ be 
>>>> fallocated, but just written out.
>>>> 3. Use `fallocate -l` to try and extend the size of the file beyond 
>>>> half the size of the filesystem.
>>>>
>>>> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it 
>>>> will succeed with no error.  Based on this and some low-level 
>>>> inspection, it looks like BTRFS treats the full range of the 
>>>> fallocate call as unallocated, and thus is trying to allocate space 
>>>> for regions of that range that are already allocated.
>>>>
>>>>>>
>>>>>> No issue at all to grow the parity file on the other parity disk. 
>>>>>> And that's why I wonder if there is some undetected file system 
>>>>>> corruption.
>>>>>>
>>>>
>>
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-01 14:47     ` Austin S. Hemmelgarn
  2017-08-01 15:00       ` Austin S. Hemmelgarn
@ 2017-08-02  4:14       ` Duncan
  2017-08-02 11:18         ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 26+ messages in thread
From: Duncan @ 2017-08-02  4:14 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Tue, 01 Aug 2017 10:47:30 -0400 as
excerpted:

> I think I _might_ understand what's going on here.  Is that test program
> calling fallocate using the desired total size of the file, or just
> trying to allocate the range beyond the end to extend the file?  I've
> seen issues with the first case on BTRFS before, and I'm starting to
> think that it might actually be trying to allocate the exact amount of
> space requested by fallocate, even if part of the range is already
> allocated space.

If I've interpreted correctly (not being a dev, only a btrfs user, 
sysadmin, and list regular) previous discussions I've seen on this list...

That's exactly what it's doing, and it's _intended_ behavior.

The reasoning is something like this:  fallocate is supposed to pre-
allocate some space with the intent being that writes into that space 
won't fail, because the space is already allocated.

For an existing file with some data already in it, ext4 and xfs do that 
counting the existing space.

But btrfs is copy-on-write, meaning it's going to have to write the new 
data to a different location than the existing data, and it may well not 
free up the existing allocation (if even a single 4k block of the 
existing allocation remains unwritten, it will remain to hold down the 
entire previous allocation, which isn't released until *none* of it is 
still in use -- of course in normal usage "in use" can be due to old 
snapshots or other reflinks to the same extent, as well, tho in these 
test cases it's not).

So in ordered to provide the writes to preallocated space shouldn't ENOSPC 
guarantee, btrfs can't count currently actually used space as part of the 
fallocate.

The different behavior is entirely due to btrfs being COW, and thus a 
choice having to be made, do we worst-case fallocate-reserve for writes 
over currently used data that will have to be COWed elsewhere, possibly 
without freeing the existing extents because there's still something 
referencing them, or do we risk ENOSPCing on write to a previously 
fallocated area?

The choice was to worst-case-reserve and take the ENOSPC risk at fallocate 
time, so the write into that fallocated space could then proceed without 
the ENOSPC risk that COW would otherwise imply.

Make sense, or is my understanding a horrible misunderstanding? =:^)

So if you're actually only appending, fallocate the /additional/ space, 
not the /entire/ space, and you'll get what you need.  But if you're 
potentially overwriting what's there already, better fallocate the entire 
space, which triggers the btrfs worst-case allocation behavior you see, 
in ordered to guarantee it won't ENOSPC during the actual write.

Of course the only time the behavior actually differs is with COW, but 
then there's a BIG difference, but that BIG difference has a GOOD BIG 
reason!  =:^)

Tho that difference will certainly necessitate some relearning the 
/correct/ way to do it, for devs who were doing it the COW-worst-case way 
all along, even if they didn't actually need to, because it didn't happen 
to make a difference on what they happened to be testing on, which 
happened not to be COW...

Reminds me of the way newer versions of gcc and/or trying to build with 
clang as well tends to trigger relearning, because newer versions are 
stricter in ordered to allow better optimization, and other 
implementations are simply different in what they're strict on, /because/ 
they're a different implementation.  Well, btrfs is stricter... because 
it's a different implementation that /has/ to be stricter... due to COW.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-02  4:14       ` Duncan
@ 2017-08-02 11:18         ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-02 11:18 UTC (permalink / raw)
  To: linux-btrfs

On 2017-08-02 00:14, Duncan wrote:
> Austin S. Hemmelgarn posted on Tue, 01 Aug 2017 10:47:30 -0400 as
> excerpted:
> 
>> I think I _might_ understand what's going on here.  Is that test program
>> calling fallocate using the desired total size of the file, or just
>> trying to allocate the range beyond the end to extend the file?  I've
>> seen issues with the first case on BTRFS before, and I'm starting to
>> think that it might actually be trying to allocate the exact amount of
>> space requested by fallocate, even if part of the range is already
>> allocated space.
> 
> If I've interpreted correctly (not being a dev, only a btrfs user,
> sysadmin, and list regular) previous discussions I've seen on this list...
> 
> That's exactly what it's doing, and it's _intended_ behavior.
> 
> The reasoning is something like this:  fallocate is supposed to pre-
> allocate some space with the intent being that writes into that space
> won't fail, because the space is already allocated.
> 
> For an existing file with some data already in it, ext4 and xfs do that
> counting the existing space.
> 
> But btrfs is copy-on-write, meaning it's going to have to write the new
> data to a different location than the existing data, and it may well not
> free up the existing allocation (if even a single 4k block of the
> existing allocation remains unwritten, it will remain to hold down the
> entire previous allocation, which isn't released until *none* of it is
> still in use -- of course in normal usage "in use" can be due to old
> snapshots or other reflinks to the same extent, as well, tho in these
> test cases it's not).
> 
> So in ordered to provide the writes to preallocated space shouldn't ENOSPC
> guarantee, btrfs can't count currently actually used space as part of the
> fallocate.
> 
> The different behavior is entirely due to btrfs being COW, and thus a
> choice having to be made, do we worst-case fallocate-reserve for writes
> over currently used data that will have to be COWed elsewhere, possibly
> without freeing the existing extents because there's still something
> referencing them, or do we risk ENOSPCing on write to a previously
> fallocated area?
> 
> The choice was to worst-case-reserve and take the ENOSPC risk at fallocate
> time, so the write into that fallocated space could then proceed without
> the ENOSPC risk that COW would otherwise imply.
> 
> Make sense, or is my understanding a horrible misunderstanding? =:^)
Your reasoning is sound, except for the fact that at least on older 
kernels (not sure if this is still the case), BTRFS will still perform a 
COW operation when updating a fallocate'ed region.
> 
> So if you're actually only appending, fallocate the /additional/ space,
> not the /entire/ space, and you'll get what you need.  But if you're
> potentially overwriting what's there already, better fallocate the entire
> space, which triggers the btrfs worst-case allocation behavior you see,
> in ordered to guarantee it won't ENOSPC during the actual write.
> 
> Of course the only time the behavior actually differs is with COW, but
> then there's a BIG difference, but that BIG difference has a GOOD BIG
> reason!  =:^)
> 
> Tho that difference will certainly necessitate some relearning the
> /correct/ way to do it, for devs who were doing it the COW-worst-case way
> all along, even if they didn't actually need to, because it didn't happen
> to make a difference on what they happened to be testing on, which
> happened not to be COW...
> 
> Reminds me of the way newer versions of gcc and/or trying to build with
> clang as well tends to trigger relearning, because newer versions are
> stricter in ordered to allow better optimization, and other
> implementations are simply different in what they're strict on, /because/
> they're a different implementation.  Well, btrfs is stricter... because
> it's a different implementation that /has/ to be stricter... due to COW.
Except that that strictness breaks userspace programs that are doing 
perfectly reasonable things.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-01 15:00       ` Austin S. Hemmelgarn
  2017-08-01 15:24         ` pwm
@ 2017-08-02 17:52         ` Goffredo Baroncelli
  2017-08-02 19:10           ` Austin S. Hemmelgarn
                             ` (2 more replies)
  1 sibling, 3 replies; 26+ messages in thread
From: Goffredo Baroncelli @ 2017-08-02 17:52 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, pwm, Hugo Mills; +Cc: linux-btrfs

Hi,

On 2017-08-01 17:00, Austin S. Hemmelgarn wrote:
> OK, I just did a dead simple test by hand, and it looks like I was right.  The method I used to check this is as follows:
> 1. Create and mount a reasonably small filesystem (I used an 8G temporary LV for this, a file would work too though).
> 2. Using dd or a similar tool, create a test file that takes up half of the size of the filesystem.  It is important that this _not_ be fallocated, but just written out.
> 3. Use `fallocate -l` to try and extend the size of the file beyond half the size of the filesystem.
> 
> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will succeed with no error.  Based on this and some low-level inspection, it looks like BTRFS treats the full range of the fallocate call as unallocated, and thus is trying to allocate space for regions of that range that are already allocated.

I can confirm this behavior; below some step to reproduce it [2]; however I don't think that it is a bug, but this is the correct behavior for a COW filesystem (see below).


Looking at the function btrfs_fallocate() (file fs/btrfs/file.c)


static long btrfs_fallocate(struct file *file, int mode,
                            loff_t offset, loff_t len)
{
[...]
        alloc_start = round_down(offset, blocksize);        
        alloc_end = round_up(offset + len, blocksize);
[...]
        /*
         * Only trigger disk allocation, don't trigger qgroup reserve
         *
         * For qgroup space, it will be checked later.
         */
        ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode),
                        alloc_end - alloc_start)


it seems that BTRFS always allocate the maximum space required, without consider the one already allocated. Is it too conservative ? I think no: consider the following scenario:

a) create a 2GB file
b) fallocate -o 1GB -l 2GB
c) write from 1GB to 3GB

after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. 

My opinion is that in general this behavior is correct due to the COW nature of BTRFS. 
The only exception that I can find, is about the "nocow" file. For these cases taking in accout the already allocated space would be better.

Comments are welcome.

BR
G.Baroncelli

[1] from man 2 fallocate
[...]
       After  a  successful call, subsequent writes into the range specified by offset and len are
       guaranteed not to fail because of lack of disk space.
[...]


[2]

-- create a 5G btrfs filesystem

# mkdir t1
# truncate --size 5G disk
# losetup /dev/loop0 disk
# mkfs.btrfs /dev/loop0
# mount /dev/loop0 t1

-- test
-- create a 1500 MB file, the expand it to 4000MB
-- expected result: the file is 4000MB size
-- result: fail: the expansion fails

# fallocate -l $((1024*1024*100*15))  file.bin
# fallocate -l $((1024*1024*100*40))  file.bin
fallocate: fallocate failed: No space left on device
# ls -lh file.bin 
-rw-r--r-- 1 root root 1.5G Aug  2 19:09 file.bin


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-02 17:52         ` Goffredo Baroncelli
@ 2017-08-02 19:10           ` Austin S. Hemmelgarn
  2017-08-02 21:05             ` Goffredo Baroncelli
  2017-08-03  3:48           ` Duncan
  2017-08-03 11:44           ` Marat Khalili
  2 siblings, 1 reply; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-02 19:10 UTC (permalink / raw)
  To: kreijack, pwm, Hugo Mills; +Cc: linux-btrfs

On 2017-08-02 13:52, Goffredo Baroncelli wrote:
> Hi,
> 
> On 2017-08-01 17:00, Austin S. Hemmelgarn wrote:
>> OK, I just did a dead simple test by hand, and it looks like I was right.  The method I used to check this is as follows:
>> 1. Create and mount a reasonably small filesystem (I used an 8G temporary LV for this, a file would work too though).
>> 2. Using dd or a similar tool, create a test file that takes up half of the size of the filesystem.  It is important that this _not_ be fallocated, but just written out.
>> 3. Use `fallocate -l` to try and extend the size of the file beyond half the size of the filesystem.
>>
>> For BTRFS, this will result in -ENOSPC, while for ext4 and XFS, it will succeed with no error.  Based on this and some low-level inspection, it looks like BTRFS treats the full range of the fallocate call as unallocated, and thus is trying to allocate space for regions of that range that are already allocated.
> 
> I can confirm this behavior; below some step to reproduce it [2]; however I don't think that it is a bug, but this is the correct behavior for a COW filesystem (see below).
> 
> 
> Looking at the function btrfs_fallocate() (file fs/btrfs/file.c)
> 
> 
> static long btrfs_fallocate(struct file *file, int mode,
>                              loff_t offset, loff_t len)
> {
> [...]
>          alloc_start = round_down(offset, blocksize);
>          alloc_end = round_up(offset + len, blocksize);
> [...]
>          /*
>           * Only trigger disk allocation, don't trigger qgroup reserve
>           *
>           * For qgroup space, it will be checked later.
>           */
>          ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode),
>                          alloc_end - alloc_start)
> 
> 
> it seems that BTRFS always allocate the maximum space required, without consider the one already allocated. Is it too conservative ? I think no: consider the following scenario:
> 
> a) create a 2GB file
> b) fallocate -o 1GB -l 2GB
> c) write from 1GB to 3GB
> 
> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
There is also an expectation based on pretty much every other FS in 
existence that calling fallocate() on a range that is already in use is 
a (possibly expensive) no-op, and by extension using fallocate() with an 
offset of 0 like a ftruncate() call will succeed as long as the new size 
will fit.

I've checked JFS, XFS, ext4, vfat, NTFS (via NTFS-3G, not the kernel 
driver), NILFS2, OCFS2 (local mode only), F2FS, UFS, and HFS+ on Linux, 
UFS and HFS+ on OS X, UFS and ZFS on FreeBSD, FFS (UFS with a different 
name) and LFS (log structured) on NetBSD, and UFS and ZFS on Solaris, 
and VxFS on HP-UX, and _all_ of them behave correctly here and succeed 
with the test I listed, while BTRFS does not.  This isn't codified in 
POSIX, but it's also not something that is listed as implementation 
defined, which in turn means that we should be trying to match the other 
implementations.

> 
> My opinion is that in general this behavior is correct due to the COW nature of BTRFS.
> The only exception that I can find, is about the "nocow" file. For these cases taking in accout the already allocated space would be better.
There are other, saner ways to make that expectation hold though, and 
I'm not even certain that it does as things are implemented (I believe 
we still CoW unwritten extents when data is written to them, because I 
_have_ had writes to fallocate'ed files fail on BTRFS before with -ENOSPC).

The ideal situation IMO is as follows:

1. This particular case (using fallocate() with an offset of 0 to extend 
a file that is already larger than half the remaining free space on the 
FS) _should_ succeed.  Short of very convoluted configurations, 
extending a file with fallocate will not result in over-committing space 
on a CoW filesystem unless it would extend the file by more than the 
remaining free space, and therefore barring long external interactions, 
subsequent writes will also succeed.  Proof of this for a general case 
is somewhat complicated, but in the very specific case of the script I 
posted as a reproducer in the other thread about this and the test case 
I gave in this thread, it's trivial to prove that the writes will 
succeed.  Either way, the behavior of SnapRAID, while not optimal in 
this case, is still a legitimate usage (I've seen programs do things 
like that just to make sure the file isn't sparse).

2. Conversion of unwritten extents to written ones should not require 
new allocation.  Ideally, we need to be allocating not just space for 
the data, but also reasonable space for the associated metadata when 
allocating an unwritten extent, and there should be no CoW involved when 
they are written to except for the small metadata updates required to 
account the new blocks.  Unless we're doing this, then we have edge 
cases where the the above listed expectation does not hold (also note 
that GlobalReserve does not count IMO, it's supposed to be for temporary 
usage only and doesn't ever appear to be particularly large).

3. There should be some small amount of space reserved globally for not 
just metadata, but data too, so that a 'full' filesystem can still 
update existing files reliably.  I'm not sure that we're not doing this 
already, but AIUI, GlobalReserve is metadata only.  If we do this, we 
don't have to worry _as much_ about avoiding CoW when converting 
unwritten extents to regular ones.
> 
> Comments are welcome.
> 
> BR
> G.Baroncelli
> 
> [1] from man 2 fallocate
> [...]
>         After  a  successful call, subsequent writes into the range specified by offset and len are
>         guaranteed not to fail because of lack of disk space.
> [...]
> 
> 
> [2]
> 
> -- create a 5G btrfs filesystem
> 
> # mkdir t1
> # truncate --size 5G disk
> # losetup /dev/loop0 disk
> # mkfs.btrfs /dev/loop0
> # mount /dev/loop0 t1
> 
> -- test
> -- create a 1500 MB file, the expand it to 4000MB
> -- expected result: the file is 4000MB size
> -- result: fail: the expansion fails
> 
> # fallocate -l $((1024*1024*100*15))  file.bin
> # fallocate -l $((1024*1024*100*40))  file.bin
> fallocate: fallocate failed: No space left on device
> # ls -lh file.bin
> -rw-r--r-- 1 root root 1.5G Aug  2 19:09 file.bin
> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-02 19:10           ` Austin S. Hemmelgarn
@ 2017-08-02 21:05             ` Goffredo Baroncelli
  2017-08-03 11:39               ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 26+ messages in thread
From: Goffredo Baroncelli @ 2017-08-02 21:05 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, pwm, Hugo Mills; +Cc: linux-btrfs

On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:
> On 2017-08-02 13:52, Goffredo Baroncelli wrote:
>> Hi,
>>
[...]

>> consider the following scenario:
>>
>> a) create a 2GB file
>> b) fallocate -o 1GB -l 2GB
>> c) write from 1GB to 3GB
>>
>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.

> There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit.

The man page of fallocate doesn't guarantee that.

Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true. 

Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file.


> 
> I've checked JFS, XFS, ext4, vfat, NTFS (via NTFS-3G, not the kernel driver), NILFS2, OCFS2 (local mode only), F2FS, UFS, and HFS+ on Linux, UFS and HFS+ on OS X, UFS and ZFS on FreeBSD, FFS (UFS with a different name) and LFS (log structured) on NetBSD, and UFS and ZFS on Solaris, and VxFS on HP-UX, and _all_ of them behave correctly here and succeed with the test I listed, while BTRFS does not.  This isn't codified in POSIX, but it's also not something that is listed as implementation defined, which in turn means that we should be trying to match the other implementations.

[...]

> 
>>
>> My opinion is that in general this behavior is correct due to the COW nature of BTRFS.
>> The only exception that I can find, is about the "nocow" file. For these cases taking in accout the already allocated space would be better.
> There are other, saner ways to make that expectation hold though, and I'm not even certain that it does as things are implemented (I believe we still CoW unwritten extents when data is written to them, because I _have_ had writes to fallocate'ed files fail on BTRFS before with -ENOSPC).
> 
> The ideal situation IMO is as follows:
> 
> 1. This particular case (using fallocate() with an offset of 0 to extend a file that is already larger than half the remaining free space on the FS) _should_ succeed.  

This description is not accurate. What happened is the following:
1) you have a file *with valid data*
2) you want to prepare an update of this file and want to be sure to have enough space

at this point fallocate have to guarantee:
a) you have your old data still available
b) you have allocated the space for the update

In terms of a COW filesystem, you need the space of a) + the space of b)


> Short of very convoluted configurations, extending a file with fallocate will not result in over-committing space on a CoW filesystem unless it would extend the file by more than the remaining free space, and therefore barring long external interactions, subsequent writes will also succeed.  Proof of this for a general case is somewhat complicated, but in the very specific case of the script I posted as a reproducer in the other thread about this and the test case I gave in this thread, it's trivial to prove that the writes will succeed.  Either way, the behavior of SnapRAID, while not optimal in this case, is still a legitimate usage (I've seen programs do things like that just to make sure the file isn't sparse).
> 
> 2. Conversion of unwritten extents to written ones should not require new allocation.  Ideally, we need to be allocating not just space for the data, but also reasonable space for the associated metadata when allocating an unwritten extent, and there should be no CoW involved when they are written to except for the small metadata updates required to account the new blocks.  Unless we're doing this, then we have edge cases where the the above listed expectation does not hold (also note that GlobalReserve does not count IMO, it's supposed to be for temporary usage only and doesn't ever appear to be particularly large).
> 
> 3. There should be some small amount of space reserved globally for not just metadata, but data too, so that a 'full' filesystem can still update existing files reliably.  I'm not sure that we're not doing this already, but AIUI, GlobalReserve is metadata only.  If we do this, we don't have to worry _as much_ about avoiding CoW when converting unwritten extents to regular ones.
>>
>> Comments are welcome.
>>
>> BR
>> G.Baroncelli
>>
>> [1] from man 2 fallocate
>> [...]
>>         After  a  successful call, subsequent writes into the range specified by offset and len are
>>         guaranteed not to fail because of lack of disk space.
>> [...]
>>
>>
>> [2]
>>
>> -- create a 5G btrfs filesystem
>>
>> # mkdir t1
>> # truncate --size 5G disk
>> # losetup /dev/loop0 disk
>> # mkfs.btrfs /dev/loop0
>> # mount /dev/loop0 t1
>>
>> -- test
>> -- create a 1500 MB file, the expand it to 4000MB
>> -- expected result: the file is 4000MB size
>> -- result: fail: the expansion fails
>>
>> # fallocate -l $((1024*1024*100*15))  file.bin
>> # fallocate -l $((1024*1024*100*40))  file.bin
>> fallocate: fallocate failed: No space left on device
>> # ls -lh file.bin
>> -rw-r--r-- 1 root root 1.5G Aug  2 19:09 file.bin
>>
>>
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-02 17:52         ` Goffredo Baroncelli
  2017-08-02 19:10           ` Austin S. Hemmelgarn
@ 2017-08-03  3:48           ` Duncan
  2017-08-03 11:44           ` Marat Khalili
  2 siblings, 0 replies; 26+ messages in thread
From: Duncan @ 2017-08-03  3:48 UTC (permalink / raw)
  To: linux-btrfs

Goffredo Baroncelli posted on Wed, 02 Aug 2017 19:52:30 +0200 as
excerpted:

> it seems that BTRFS always allocate the maximum space required, without
> consider the one already allocated. Is it too conservative ? I think no:
> consider the following scenario:
> 
> a) create a 2GB file
> b) fallocate -o 1GB -l 2GB
> c) write from 1GB to 3GB
> 
> after b), the expectation is that c) always succeed [1]: i.e. there is
> enough space on the filesystem. Due to the COW nature of BTRFS, you
> cannot rely on the already allocated space because there could be a
> small time window where both the old and the new data exists on the
> disk.

Not only a small time, perhaps (effectively) permanently, due to either 
of two factors:

1) If the existing extents are reflinked by snapshots or other files they 
obviously won't be released at all when the overwrite is completed.  
fallocate must account for this possibility, and behaving differently in 
the context of other reflinks would be confusing, so the best policy is 
consistently behave as if the existing data will not be freed.

2) As the devs have commented a number of times, an extent isn't freed if 
there's still a reflink to part of it.  If the original extent was a full 
1 GiB data chunk (the chunk being the max size of a native btrfs extent, 
one of the reasons a balance and defrag after conversion from ext4 and 
deletion of the ext4-saved subvolume is recommended, to break up the 
longer ext4 extents so they won't cause btrfs problems later) and all but 
a single 4 KiB block has been rewritten, the full 1 GiB extent will 
remain referenced and continue to take that original full 1 GiB space, 
*plus* the space of all the new-version extents of the overwritten data, 
of course.

So in our fallocate and overwrite scenario, we again must reserve space 
for two copies of the data, the original which may well not be freed even 
without other reflinks, if a single 4 KiB block of an extent remains 
unoverwritten, and the new version of the data.

At least that /was/ the behavior explained on-list previous to the hole-
punching changes.  I'm not a dev and haven't seen a dev comment on 
whether that remains the behavior after hole-punching, which may at least 
naively be expected to automatically handle and free overwritten data 
using hole-punching, or not.  I'd be interested in seeing someone who can 
read the code confirm one way or the other whether hole-punching changed 
that previous behavior, or not.
 
> My opinion is that in general this behavior is correct due to the COW
> nature of BTRFS.
> The only exception that I can find, is about the "nocow" file. For these
> cases taking in accout the already allocated space would be better.

I'd say it's dangerously optimistic even then, considering that "nocow" 
is actually "cow1" in the presence of snapshots.


Meanwhile, it's worth keeping in mind that it's exactly these sorts of 
corner-cases that are why btrfs is taking so long to stabilize.  
Supposedly "simple" expectations aren't always so simple, and if a 
filesystem gets it wrong, it's somebody's data hanging in the balance!  
(Tho if they've any wisdom at all, they'll ensure they're aware of the 
stability status of a filesystem before they put data on it, and will 
adjust their backup policies accordingly if they're using a still not 
fully stabilized filesystem such as btrfs, so the data won't actually be 
in any danger anyway unless it was literally throw-away value, only 
whatever specific instance of it was involved in that corner-case.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-02 21:05             ` Goffredo Baroncelli
@ 2017-08-03 11:39               ` Austin S. Hemmelgarn
  2017-08-03 16:37                 ` Goffredo Baroncelli
  0 siblings, 1 reply; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-03 11:39 UTC (permalink / raw)
  To: kreijack, pwm, Hugo Mills; +Cc: linux-btrfs

On 2017-08-02 17:05, Goffredo Baroncelli wrote:
> On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:
>> On 2017-08-02 13:52, Goffredo Baroncelli wrote:
>>> Hi,
>>>
> [...]
> 
>>> consider the following scenario:
>>>
>>> a) create a 2GB file
>>> b) fallocate -o 1GB -l 2GB
>>> c) write from 1GB to 3GB
>>>
>>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
> 
>> There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit.
> 
> The man page of fallocate doesn't guarantee that.
> 
> Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true.
> 
> Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file.
Yes, you need space, but you don't need _all_ the space.  For a file 
that already has data in it, you only _need_ as much space as the 
largest chunk of data that can be written at once at a low level, 
because the moment that first write finishes, the space that was used in 
the file for that region is freed, and the next write can go there.  Put 
a bit differently, you only need to allocate what isn't allocated in the 
region, and then a bit more to handle the initial write to the file.

Also, as I said below, _THIS WORKS ON ZFS_.  That immediately means that 
a CoW filesystem _does not_ need to behave like BTRFS is.
> 
> 
>>
>> I've checked JFS, XFS, ext4, vfat, NTFS (via NTFS-3G, not the kernel driver), NILFS2, OCFS2 (local mode only), F2FS, UFS, and HFS+ on Linux, UFS and HFS+ on OS X, UFS and ZFS on FreeBSD, FFS (UFS with a different name) and LFS (log structured) on NetBSD, and UFS and ZFS on Solaris, and VxFS on HP-UX, and _all_ of them behave correctly here and succeed with the test I listed, while BTRFS does not.  This isn't codified in POSIX, but it's also not something that is listed as implementation defined, which in turn means that we should be trying to match the other implementations.
> 
> [...]
> 
>>
>>>
>>> My opinion is that in general this behavior is correct due to the COW nature of BTRFS.
>>> The only exception that I can find, is about the "nocow" file. For these cases taking in accout the already allocated space would be better.
>> There are other, saner ways to make that expectation hold though, and I'm not even certain that it does as things are implemented (I believe we still CoW unwritten extents when data is written to them, because I _have_ had writes to fallocate'ed files fail on BTRFS before with -ENOSPC).
>>
>> The ideal situation IMO is as follows:
>>
>> 1. This particular case (using fallocate() with an offset of 0 to extend a file that is already larger than half the remaining free space on the FS) _should_ succeed.
> 
> This description is not accurate. What happened is the following:
> 1) you have a file *with valid data*
> 2) you want to prepare an update of this file and want to be sure to have enough space
Except this is not the common case.  Most filesystems aren't CoW, so 
calling fallocate() like this is generally not 'ensuring you have enough 
space', it's 'ensuring the file isn't sparse, and we can write to the 
extra area beyond the end we care about'.
> 
> at this point fallocate have to guarantee:
> a) you have your old data still available
> b) you have allocated the space for the update
> 
> In terms of a COW filesystem, you need the space of a) + the space of b)
No, that is only required if the entire file needs to be written 
atomically.  There is some maximal size atomic write that BTRFS can 
perform as a single operation at a low level (I'm not sure if this is 
equal to the block size, or larger, but it doesn't matter much, either 
way, I'm talking the largest chunk of data it will write to a disk in a 
single operation before updating metadata to point to that new data). 
If your total size (original data plus the new space) is less than this 
maximal atomic write size, then the above is true, but if it is larger, 
you only need to allocate space for regions of the fallocate() range 
that aren't already allocated, plus space to accommodate at least one 
write of this maximal atomic write size.  Any space beyond that just 
ends up minimizing the degree of fragmentation introduced by allocation.

The methodology that allows this is really simple.  When you start to 
write data to the file, the first part of the write goes into the newly 
allocated space, and the original region covered by that write gets 
freed.  You can then write into the space that was just freed and repeat 
the process until the write is done.  Implementing this requires the 
freeing process to know that the freed region was covered by an 
fallocate() call, and thus that it should be saved for future writes. 
Provided that the back-conversion from used space to fallocated() space 
is done directly, this is also race free.
> 
> 
>> Short of very convoluted configurations, extending a file with fallocate will not result in over-committing space on a CoW filesystem unless it would extend the file by more than the remaining free space, and therefore barring long external interactions, subsequent writes will also succeed.  Proof of this for a general case is somewhat complicated, but in the very specific case of the script I posted as a reproducer in the other thread about this and the test case I gave in this thread, it's trivial to prove that the writes will succeed.  Either way, the behavior of SnapRAID, while not optimal in this case, is still a legitimate usage (I've seen programs do things like that just to make sure the file isn't sparse).
>>
>> 2. Conversion of unwritten extents to written ones should not require new allocation.  Ideally, we need to be allocating not just space for the data, but also reasonable space for the associated metadata when allocating an unwritten extent, and there should be no CoW involved when they are written to except for the small metadata updates required to account the new blocks.  Unless we're doing this, then we have edge cases where the the above listed expectation does not hold (also note that GlobalReserve does not count IMO, it's supposed to be for temporary usage only and doesn't ever appear to be particularly large).
>>
>> 3. There should be some small amount of space reserved globally for not just metadata, but data too, so that a 'full' filesystem can still update existing files reliably.  I'm not sure that we're not doing this already, but AIUI, GlobalReserve is metadata only.  If we do this, we don't have to worry _as much_ about avoiding CoW when converting unwritten extents to regular ones.
>>>
>>> Comments are welcome.
>>>
>>> BR
>>> G.Baroncelli
>>>
>>> [1] from man 2 fallocate
>>> [...]
>>>          After  a  successful call, subsequent writes into the range specified by offset and len are
>>>          guaranteed not to fail because of lack of disk space.
>>> [...]
>>>
>>>
>>> [2]
>>>
>>> -- create a 5G btrfs filesystem
>>>
>>> # mkdir t1
>>> # truncate --size 5G disk
>>> # losetup /dev/loop0 disk
>>> # mkfs.btrfs /dev/loop0
>>> # mount /dev/loop0 t1
>>>
>>> -- test
>>> -- create a 1500 MB file, the expand it to 4000MB
>>> -- expected result: the file is 4000MB size
>>> -- result: fail: the expansion fails
>>>
>>> # fallocate -l $((1024*1024*100*15))  file.bin
>>> # fallocate -l $((1024*1024*100*40))  file.bin
>>> fallocate: fallocate failed: No space left on device
>>> # ls -lh file.bin
>>> -rw-r--r-- 1 root root 1.5G Aug  2 19:09 file.bin
>>>
>>>
>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-02 17:52         ` Goffredo Baroncelli
  2017-08-02 19:10           ` Austin S. Hemmelgarn
  2017-08-03  3:48           ` Duncan
@ 2017-08-03 11:44           ` Marat Khalili
  2017-08-03 11:52             ` Austin S. Hemmelgarn
  2017-08-03 16:01             ` Goffredo Baroncelli
  2 siblings, 2 replies; 26+ messages in thread
From: Marat Khalili @ 2017-08-03 11:44 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, linux-btrfs; +Cc: kreijack, pwm, Hugo Mills

On 02/08/17 20:52, Goffredo Baroncelli wrote:
> consider the following scenario:
>
> a) create a 2GB file
> b) fallocate -o 1GB -l 2GB
> c) write from 1GB to 3GB
>
> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
Just curious. With current implementation, in the following case:
a) create a 2GB file1 && create a 2GB file2
b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2
c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2
will (c) always succeed? I.e. does fallocate really allocate 2GB per 
file, or does it only allocate additional 1GB and check free space for 
another 1GB? If it's only the latter, it is useless.

--

With Best Regards,
Marat Khalili


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-03 11:44           ` Marat Khalili
@ 2017-08-03 11:52             ` Austin S. Hemmelgarn
  2017-08-03 16:01             ` Goffredo Baroncelli
  1 sibling, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-03 11:52 UTC (permalink / raw)
  To: Marat Khalili, linux-btrfs; +Cc: kreijack, pwm, Hugo Mills

On 2017-08-03 07:44, Marat Khalili wrote:
> On 02/08/17 20:52, Goffredo Baroncelli wrote:
>> consider the following scenario:
>>
>> a) create a 2GB file
>> b) fallocate -o 1GB -l 2GB
>> c) write from 1GB to 3GB
>>
>> after b), the expectation is that c) always succeed [1]: i.e. there is 
>> enough space on the filesystem. Due to the COW nature of BTRFS, you 
>> cannot rely on the already allocated space because there could be a 
>> small time window where both the old and the new data exists on the disk.
> Just curious. With current implementation, in the following case:
> a) create a 2GB file1 && create a 2GB file2
> b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2
> c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2
> will (c) always succeed? I.e. does fallocate really allocate 2GB per 
> file, or does it only allocate additional 1GB and check free space for 
> another 1GB? If it's only the latter, it is useless.
It will currently allocate 4GB total in this case (2 for each file), and 
_should_ succeed.  I think there are corner cases where it can fail 
though because of metadata exhaustion, and I'm still not certain we 
don't CoW unwritten extents (if we do CoW unwritten extents, then this, 
and all fallocate allocation for that matter, becomes non-deterministic 
as to whether or not it succeeds).


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-03 11:44           ` Marat Khalili
  2017-08-03 11:52             ` Austin S. Hemmelgarn
@ 2017-08-03 16:01             ` Goffredo Baroncelli
  2017-08-03 17:15               ` Marat Khalili
  2017-08-03 22:51               ` pwm
  1 sibling, 2 replies; 26+ messages in thread
From: Goffredo Baroncelli @ 2017-08-03 16:01 UTC (permalink / raw)
  To: Marat Khalili, Austin S. Hemmelgarn, linux-btrfs; +Cc: pwm, Hugo Mills

On 2017-08-03 13:44, Marat Khalili wrote:
> On 02/08/17 20:52, Goffredo Baroncelli wrote:
>> consider the following scenario:
>>
>> a) create a 2GB file
>> b) fallocate -o 1GB -l 2GB
>> c) write from 1GB to 3GB
>>
>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
> Just curious. With current implementation, in the following case:
> a) create a 2GB file1 && create a 2GB file2
> b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2

A this step you are trying to allocate 3GB+3GB = 6GB, so you exhausted the filesystem space.

> c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2
> will (c) always succeed? I.e. does fallocate really allocate 2GB per file, or does it only allocate additional 1GB and check free space for another 1GB? If it's only the latter, it is useless.
The file is physically extended

ghigo@venice:/tmp$ fallocate -l 1000 foo.txt
ghigo@venice:/tmp$ ls -l foo.txt
-rw-r--r-- 1 ghigo ghigo 1000 Aug  3 18:00 foo.txt
ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt
ghigo@venice:/tmp$ ls -l foo.txt
-rw-r--r-- 1 ghigo ghigo 1500 Aug  3 18:00 foo.txt
ghigo@venice:/tmp$

> 
> -- 
> 
> With Best Regards,
> Marat Khalili
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-03 11:39               ` Austin S. Hemmelgarn
@ 2017-08-03 16:37                 ` Goffredo Baroncelli
  2017-08-03 17:23                   ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 26+ messages in thread
From: Goffredo Baroncelli @ 2017-08-03 16:37 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, pwm, Hugo Mills; +Cc: linux-btrfs

On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
> On 2017-08-02 17:05, Goffredo Baroncelli wrote:
>> On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:
>>> On 2017-08-02 13:52, Goffredo Baroncelli wrote:
>>>> Hi,
>>>>
>> [...]
>>
>>>> consider the following scenario:
>>>>
>>>> a) create a 2GB file
>>>> b) fallocate -o 1GB -l 2GB
>>>> c) write from 1GB to 3GB
>>>>
>>>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
>>
>>> There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit.
>>
>> The man page of fallocate doesn't guarantee that.
>>
>> Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true.
>>
>> Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file.
> Yes, you need space, but you don't need _all_ the space.  For a file that already has data in it, you only _need_ as much space as the largest chunk of data that can be written at once at a low level, because the moment that first write finishes, the space that was used in the file for that region is freed, and the next write can go there.  Put a bit differently, you only need to allocate what isn't allocated in the region, and then a bit more to handle the initial write to the file.
> 
> Also, as I said below, _THIS WORKS ON ZFS_.  That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is.

It seems that ZFS on linux doesn't support fallocate

see https://github.com/zfsonlinux/zfs/issues/326

So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment.

[...]
>> In terms of a COW filesystem, you need the space of a) + the space of b)
> No, that is only required if the entire file needs to be written atomically.  There is some maximal size atomic write that BTRFS can perform as a single operation at a low level (I'm not sure if this is equal to the block size, or larger, but it doesn't matter much, either way, I'm talking the largest chunk of data it will write to a disk in a single operation before updating metadata to point to that new data). 

On the best of my knowledge there is only a time limit: IIRC every 30seconds a transaction is closed. If you are able to fill the filesystem in this time window you are in trouble.

[...]-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-03 16:01             ` Goffredo Baroncelli
@ 2017-08-03 17:15               ` Marat Khalili
  2017-08-03 17:25                 ` Austin S. Hemmelgarn
  2017-08-03 22:51               ` pwm
  1 sibling, 1 reply; 26+ messages in thread
From: Marat Khalili @ 2017-08-03 17:15 UTC (permalink / raw)
  To: kreijack, Goffredo Baroncelli, Austin S. Hemmelgarn, linux-btrfs
  Cc: pwm, Hugo Mills

On August 3, 2017 7:01:06 PM GMT+03:00, Goffredo Baroncelli 
>The file is physically extended
>
>ghigo@venice:/tmp$ fallocate -l 1000 foo.txt

For clarity let's replace the fallocate above with:
$ head -c 1000 </dev/urandom >foo.txt

>ghigo@venice:/tmp$ ls -l foo.txt
>-rw-r--r-- 1 ghigo ghigo 1000 Aug  3 18:00 foo.txt
>ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt
>ghigo@venice:/tmp$ ls -l foo.txt
>-rw-r--r-- 1 ghigo ghigo 1500 Aug  3 18:00 foo.txt
>ghigo@venice:/tmp$

According to explanation by Austin the foo.txt at this point somehow occupies 2000 bytes of space because I can reflink it and then write another 1000 bytes of data into it without losing 1000 bytes I already have or getting out of drive space. (Or is it only true while there are open file handles?)
-- 

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-03 16:37                 ` Goffredo Baroncelli
@ 2017-08-03 17:23                   ` Austin S. Hemmelgarn
  2017-08-04 14:45                     ` Goffredo Baroncelli
  0 siblings, 1 reply; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-03 17:23 UTC (permalink / raw)
  To: kreijack, pwm, Hugo Mills; +Cc: linux-btrfs

On 2017-08-03 12:37, Goffredo Baroncelli wrote:
> On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
>> On 2017-08-02 17:05, Goffredo Baroncelli wrote:
>>> On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:
>>>> On 2017-08-02 13:52, Goffredo Baroncelli wrote:
>>>>> Hi,
>>>>>
>>> [...]
>>>
>>>>> consider the following scenario:
>>>>>
>>>>> a) create a 2GB file
>>>>> b) fallocate -o 1GB -l 2GB
>>>>> c) write from 1GB to 3GB
>>>>>
>>>>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
>>>
>>>> There is also an expectation based on pretty much every other FS in existence that calling fallocate() on a range that is already in use is a (possibly expensive) no-op, and by extension using fallocate() with an offset of 0 like a ftruncate() call will succeed as long as the new size will fit.
>>>
>>> The man page of fallocate doesn't guarantee that.
>>>
>>> Unfortunately in a COW filesystem the assumption that an allocate area may be simply overwritten is not true.
>>>
>>> Let me to say it with others words: as general rule if you want to _write_ something in a cow filesystem, you need space. Doesn't matter if you are *over-writing* existing data or you are *appending* to a file.
>> Yes, you need space, but you don't need _all_ the space.  For a file that already has data in it, you only _need_ as much space as the largest chunk of data that can be written at once at a low level, because the moment that first write finishes, the space that was used in the file for that region is freed, and the next write can go there.  Put a bit differently, you only need to allocate what isn't allocated in the region, and then a bit more to handle the initial write to the file.
>>
>> Also, as I said below, _THIS WORKS ON ZFS_.  That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is.
> 
> It seems that ZFS on linux doesn't support fallocate
> 
> see https://github.com/zfsonlinux/zfs/issues/326
> 
> So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment.
Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).

That said, I'm starting to wonder if just failing fallocate() calls to 
allocate space is actually the right thing to do here after all.  Aside 
from this, we don't reserve metadata space for checksums and similar 
things for the eventual writes (so it's possible to get -ENOSPC on a 
write to an fallocate'ed region anyway because of metadata exhaustion), 
and splitting extents can also cause it to fail, so it's perfectly 
possible for the fallocate assumption to not hole on BTRFS.  The irony 
of this is that if you're in a situation where you actually need to 
reserve space, you're more likely to fail (because if you actually 
_need_ to reserve the space, your filesystem may already be mostly full, 
and therefore any of the above issues may occur).

On the specific note of splitting extents, the following will probably 
fail on BTRFS as well when done with a large enough FS (the turn over 
point ends up being the point at which 256MiB isn't enough space to 
account for all the extents), but will succeed with :
1. Create filesystem and mount it.  On BTRFS, make sure autodefrag is 
off (this makes it fail more reliably, but is not essential for it to fail).
2. Use fallocate to allocate as large a file as possible (in the BTRFS 
case, try for the size of the filesystem - 544MiB (512 MiB for the 
metadata chunk, 32 for the system chunk).
3. Write half the file using 1MB blocks and skipping 1MB of space 
between each block (so every other 1MB of space is actually written to.
4. Write the other half of the file by filling in the holes.

The net effect of this is to split the single large fallocat'ed extent 
into a very large number of 1MB extents, which in turn eats up lots of 
metadata space and will eventually exhaust it.  While this specific 
exercise requires a large filesystem, more generic real world situations 
exist where this can happen (and I have had this happen before).
> 
> [...]
>>> In terms of a COW filesystem, you need the space of a) + the space of b)
>> No, that is only required if the entire file needs to be written atomically.  There is some maximal size atomic write that BTRFS can perform as a single operation at a low level (I'm not sure if this is equal to the block size, or larger, but it doesn't matter much, either way, I'm talking the largest chunk of data it will write to a disk in a single operation before updating metadata to point to that new data).
> 
> On the best of my knowledge there is only a time limit: IIRC every 30seconds a transaction is closed. If you are able to fill the filesystem in this time window you are in trouble.
Even with that, it's still possible to implement the method I outlined 
by defining such a limit and forcing a transaction commit when that 
limit is hit.  I'm also not entirely convinced that the transaction is 
the limiting factor here (I was under the impression that the 
transaction just updates the top level metadata to point to the new tree 
of metadata).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-03 17:15               ` Marat Khalili
@ 2017-08-03 17:25                 ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-03 17:25 UTC (permalink / raw)
  To: Marat Khalili, kreijack, linux-btrfs; +Cc: pwm, Hugo Mills

On 2017-08-03 13:15, Marat Khalili wrote:
> On August 3, 2017 7:01:06 PM GMT+03:00, Goffredo Baroncelli
>> The file is physically extended
>>
>> ghigo@venice:/tmp$ fallocate -l 1000 foo.txt
> 
> For clarity let's replace the fallocate above with:
> $ head -c 1000 </dev/urandom >foo.txt
> 
>> ghigo@venice:/tmp$ ls -l foo.txt
>> -rw-r--r-- 1 ghigo ghigo 1000 Aug  3 18:00 foo.txt
>> ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt
>> ghigo@venice:/tmp$ ls -l foo.txt
>> -rw-r--r-- 1 ghigo ghigo 1500 Aug  3 18:00 foo.txt
>> ghigo@venice:/tmp$
> 
> According to explanation by Austin the foo.txt at this point somehow occupies 2000 bytes of space because I can reflink it and then write another 1000 bytes of data into it without losing 1000 bytes I already have or getting out of drive space. (Or is it only true while there are open file handles?)
> 
OK, I think there may be some misunderstanding here.  By 'CoW unwritten 
extents', I mean that when we write to the extent, a CoW operation 
happens, instead of the data being written directly into the extent.  In 
this case, it has nothing to do with reflinking, and Goffredo is correct 
that if your filesystem is small enough, the second fallocate will fail 
there.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-03 16:01             ` Goffredo Baroncelli
  2017-08-03 17:15               ` Marat Khalili
@ 2017-08-03 22:51               ` pwm
  1 sibling, 0 replies; 26+ messages in thread
From: pwm @ 2017-08-03 22:51 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: Marat Khalili, Austin S. Hemmelgarn, linux-btrfs, Hugo Mills

In 30 seconds I should be able to fill about 200MB * 30 = 6GB.

Requiring the parity to not grow larger than there is a 6GB additional 
space is possible to live with on a 10TB disk.

It seems that for SnapRAID to have any chance to work correctly with 
parity on a BTRFS partition, it would need a min-free configuration 
paramter to make sure there is always enough free space for one parity 
file update.

But as it is right now, requiring that the disc isn't filled past 50% 
because fallocate() wants enough free space for 100% of the original file 
data to be rewritten obviously is not a working solution.

Right now, it sounds like I should change all parity disks to a different 
file system to avoid the CoW issue. There doesn't seem to be any way to 
turn off CoW for an already existing file, and the parity data is already 
way past 50% so I can't make a copy.

/Per W

On Thu, 3 Aug 2017, Goffredo Baroncelli wrote:

> On 2017-08-03 13:44, Marat Khalili wrote:
>> On 02/08/17 20:52, Goffredo Baroncelli wrote:
>>> consider the following scenario:
>>>
>>> a) create a 2GB file
>>> b) fallocate -o 1GB -l 2GB
>>> c) write from 1GB to 3GB
>>>
>>> after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk.
>> Just curious. With current implementation, in the following case:
>> a) create a 2GB file1 && create a 2GB file2
>> b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2
>
> A this step you are trying to allocate 3GB+3GB = 6GB, so you exhausted the filesystem space.
>
>> c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2
>> will (c) always succeed? I.e. does fallocate really allocate 2GB per file, or does it only allocate additional 1GB and check free space for another 1GB? If it's only the latter, it is useless.
> The file is physically extended
>
> ghigo@venice:/tmp$ fallocate -l 1000 foo.txt
> ghigo@venice:/tmp$ ls -l foo.txt
> -rw-r--r-- 1 ghigo ghigo 1000 Aug  3 18:00 foo.txt
> ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt
> ghigo@venice:/tmp$ ls -l foo.txt
> -rw-r--r-- 1 ghigo ghigo 1500 Aug  3 18:00 foo.txt
> ghigo@venice:/tmp$
>
>>
>> --
>>
>> With Best Regards,
>> Marat Khalili
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-03 17:23                   ` Austin S. Hemmelgarn
@ 2017-08-04 14:45                     ` Goffredo Baroncelli
  2017-08-04 15:05                       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 26+ messages in thread
From: Goffredo Baroncelli @ 2017-08-04 14:45 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, pwm, Hugo Mills; +Cc: linux-btrfs

On 2017-08-03 19:23, Austin S. Hemmelgarn wrote:
> On 2017-08-03 12:37, Goffredo Baroncelli wrote:
>> On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
[...]

>>> Also, as I said below, _THIS WORKS ON ZFS_.  That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is.
>>
>> It seems that ZFS on linux doesn't support fallocate
>>
>> see https://github.com/zfsonlinux/zfs/issues/326
>>
>> So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment.
> Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).

For fun I checked the freebsd source and zfs source. To me it seems that ZFS on freebsd doesn't implement posix_fallocate() (VOP_ALLOCATE in freebas jargon), but instead relies on the freebsd default one.

	http://fxr.watson.org/fxr/source/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L7212

Following the chain of function pointers

	http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=10#L110

it seems that the freebsd vop_allocate() is implemented in vop_stdallocate()

	http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=excerpts#L912

which simply calls read() and write() on the range [offset...offset+len), which for a "conventional" filesystem ensure the block allocation. Of course it is an expensive solution.

So I think (but I am not familiar with freebsd) that ZFS doesn't implement a real posix_allocate but it try to simulate it. Of course this don't


> 
> That said, I'm starting to wonder if just failing fallocate() calls to allocate space is actually the right thing to do here after all.  Aside from this, we don't reserve metadata space for checksums and similar things for the eventual writes (so it's possible to get -ENOSPC on a write to an fallocate'ed region anyway because of metadata exhaustion), and splitting extents can also cause it to fail, so it's perfectly possible for the fallocate assumption to not hole on BTRFS.  

posix_fallocate in BTRFS is not reliable for another reason. This syscall guarantees that a BG is allocated, but I think that the allocated BG is available to all processes, so a parallel process my exhaust all the available space before the first process uses it.

My opinion is that BTRFS is not reliable when the space is exhausted, so it needs to work with an amount of disk space free. The size of this disk space should be O(2*size_of_biggest_write), and for operation like fallocate this means O(2*length).

I think that is not casual that the fallocate implemented by ZFSONLINUX works with the flag FALLOC_FL_PUNCH_HOLE mode.

https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L662
[...]
/*
 * The only flag combination which matches the behavior of zfs_space()
 * is FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE.  The FALLOC_FL_PUNCH_HOLE
 * flag was introduced in the 2.6.38 kernel.
 */
#if defined(HAVE_FILE_FALLOCATE) || defined(HAVE_INODE_FALLOCATE)
long
zpl_fallocate_common(struct inode *ip, int mode, loff_t offset, loff_t len)
{
	int error = -EOPNOTSUPP;

#if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE)
	cred_t *cr = CRED();
	flock64_t bf;
	loff_t olen;
	fstrans_cookie_t cookie;

	if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
		return (error);

[...]

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Massive loss of disk space
  2017-08-04 14:45                     ` Goffredo Baroncelli
@ 2017-08-04 15:05                       ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 26+ messages in thread
From: Austin S. Hemmelgarn @ 2017-08-04 15:05 UTC (permalink / raw)
  To: kreijack, pwm, Hugo Mills; +Cc: linux-btrfs

On 2017-08-04 10:45, Goffredo Baroncelli wrote:
> On 2017-08-03 19:23, Austin S. Hemmelgarn wrote:
>> On 2017-08-03 12:37, Goffredo Baroncelli wrote:
>>> On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
> [...]
> 
>>>> Also, as I said below, _THIS WORKS ON ZFS_.  That immediately means that a CoW filesystem _does not_ need to behave like BTRFS is.
>>>
>>> It seems that ZFS on linux doesn't support fallocate
>>>
>>> see https://github.com/zfsonlinux/zfs/issues/326
>>>
>>> So I think that you are referring to a posix_fallocate and ZFS on solaris, which I can't test so I can't comment.
>> Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).
> 
> For fun I checked the freebsd source and zfs source. To me it seems that ZFS on freebsd doesn't implement posix_fallocate() (VOP_ALLOCATE in freebas jargon), but instead relies on the freebsd default one.
> 
> 	http://fxr.watson.org/fxr/source/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L7212
> 
> Following the chain of function pointers
> 
> 	http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=10#L110
> 
> it seems that the freebsd vop_allocate() is implemented in vop_stdallocate()
> 
> 	http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=excerpts#L912
> 
> which simply calls read() and write() on the range [offset...offset+len), which for a "conventional" filesystem ensure the block allocation. Of course it is an expensive solution.
> 
> So I think (but I am not familiar with freebsd) that ZFS doesn't implement a real posix_allocate but it try to simulate it. Of course this don't
 From a practical perspective though, posix_fallocate() doesn't matter, 
because almost everything uses the native fallocate call if at all 
possible.  As you mention, FreeBSD is emulating it, but that 'emulation' 
provides behavior that is close enough to what is required that it 
doesn't matter.  As a matter of perspective, posix_fallocate() is 
emulated on Linux too, see my reply below to your later comment about 
posix_fallocate() on BTRFS.

Internally ZFS also keeps _some_ space reserved so it doesn't get wedged 
like BTRFS does when near full, and they don't do the whole data versus 
metadata segregation crap, so from a practical perspective, what 
FreeBSD's ZFS implementation does is sufficient because of the internal 
structure and handling of writes in ZFS.
> 
> 
>>
>> That said, I'm starting to wonder if just failing fallocate() calls to allocate space is actually the right thing to do here after all.  Aside from this, we don't reserve metadata space for checksums and similar things for the eventual writes (so it's possible to get -ENOSPC on a write to an fallocate'ed region anyway because of metadata exhaustion), and splitting extents can also cause it to fail, so it's perfectly possible for the fallocate assumption to not hole on BTRFS.
> 
> posix_fallocate in BTRFS is not reliable for another reason. This syscall guarantees that a BG is allocated, but I think that the allocated BG is available to all processes, so a parallel process my exhaust all the available space before the first process uses it.
As mentioned above, posix_fallocate() is emulated in libc on Linux by 
calling the regular fallocate() if the FS supports it (which BTRFS 
does), or by writing out data like FreeBSD does in the kernel if the FS 
doesn't support fallocate().  IOW, posix_fallocate() has the exact same 
issues on BTRFS as Linux's fallocate() syscall does.
> 
> My opinion is that BTRFS is not reliable when the space is exhausted, so it needs to work with an amount of disk space free. The size of this disk space should be O(2*size_of_biggest_write), and for operation like fallocate this means O(2*length).
Again, this arises from how we handle writes.  If we were to track 
blocks that have had fallocate called on them and only use those (for 
the first write at least) for writes to the file that had fallocate 
called on them (as well as breaking reflinks on them when fallocate is 
called), then we can get away with just using the size of the biggest 
write plus a little bit more space for _data_, but even then we need 
space for metadata (which we don't appear to track right now).
> 
> I think that is not casual that the fallocate implemented by ZFSONLINUX works with the flag FALLOC_FL_PUNCH_HOLE mode.
> 
> https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L662
> [...]
> /*
>   * The only flag combination which matches the behavior of zfs_space()
>   * is FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE.  The FALLOC_FL_PUNCH_HOLE
>   * flag was introduced in the 2.6.38 kernel.
>   */
> #if defined(HAVE_FILE_FALLOCATE) || defined(HAVE_INODE_FALLOCATE)
> long
> zpl_fallocate_common(struct inode *ip, int mode, loff_t offset, loff_t len)
> {
> 	int error = -EOPNOTSUPP;
> 
> #if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE)
> 	cred_t *cr = CRED();
> 	flock64_t bf;
> 	loff_t olen;
> 	fstrans_cookie_t cookie;
> 
> 	if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> 		return (error);
> 
> [...]
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-08-04 15:05 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-01 11:43 Massive loss of disk space pwm
2017-08-01 12:20 ` Hugo Mills
2017-08-01 14:39   ` pwm
2017-08-01 14:47     ` Austin S. Hemmelgarn
2017-08-01 15:00       ` Austin S. Hemmelgarn
2017-08-01 15:24         ` pwm
2017-08-01 15:45           ` Austin S. Hemmelgarn
2017-08-01 16:50             ` pwm
2017-08-01 17:04               ` Austin S. Hemmelgarn
2017-08-02 17:52         ` Goffredo Baroncelli
2017-08-02 19:10           ` Austin S. Hemmelgarn
2017-08-02 21:05             ` Goffredo Baroncelli
2017-08-03 11:39               ` Austin S. Hemmelgarn
2017-08-03 16:37                 ` Goffredo Baroncelli
2017-08-03 17:23                   ` Austin S. Hemmelgarn
2017-08-04 14:45                     ` Goffredo Baroncelli
2017-08-04 15:05                       ` Austin S. Hemmelgarn
2017-08-03  3:48           ` Duncan
2017-08-03 11:44           ` Marat Khalili
2017-08-03 11:52             ` Austin S. Hemmelgarn
2017-08-03 16:01             ` Goffredo Baroncelli
2017-08-03 17:15               ` Marat Khalili
2017-08-03 17:25                 ` Austin S. Hemmelgarn
2017-08-03 22:51               ` pwm
2017-08-02  4:14       ` Duncan
2017-08-02 11:18         ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.