Why is the actual disk usage of btrfs considered unknowable?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Why is the actual disk usage of btrfs considered unknowable?
@ 2014-12-07 15:15 Shriramana Sharma
  2014-12-07 15:33 ` Martin Steigerwald
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Shriramana Sharma @ 2014-12-07 15:15 UTC (permalink / raw)
  To: linux-btrfs

IIUC:

1) btrfs fi df already shows the alloc-ed space and the space used out of that.

2) Despite snapshots, CoW and compression, the tree knows how many
extents of data and metadata there are, and how many bytes on disk
these occcupy, no matter what is the total (uncompressed,
"unsnapshotted") size of all the directories and files on the disk.

So this means that btrfs fi df actually shows the real on-disk usage.
In this case, why do we hear people saying it's not possible to know
the actual on-disk usage and when a btrfs-formatted disk (or
partition) will go out of space?

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 15:15 Why is the actual disk usage of btrfs considered unknowable? Shriramana Sharma
@ 2014-12-07 15:33 ` Martin Steigerwald
  2014-12-07 15:37   ` Shriramana Sharma
                     ` (3 more replies)
  2014-12-08  4:59 ` Robert White
  2014-12-08  6:43 ` Zygo Blaxell
  2 siblings, 4 replies; 28+ messages in thread
From: Martin Steigerwald @ 2014-12-07 15:33 UTC (permalink / raw)
  To: Shriramana Sharma; +Cc: linux-btrfs

Hi Shriramana!

Am Sonntag, 7. Dezember 2014, 20:45:59 schrieb Shriramana Sharma:
> IIUC:
> 
> 1) btrfs fi df already shows the alloc-ed space and the space used out of
> that.
> 
> 2) Despite snapshots, CoW and compression, the tree knows how many
> extents of data and metadata there are, and how many bytes on disk
> these occcupy, no matter what is the total (uncompressed,
> "unsnapshotted") size of all the directories and files on the disk.
> 
> So this means that btrfs fi df actually shows the real on-disk usage.
> In this case, why do we hear people saying it's not possible to know
> the actual on-disk usage and when a btrfs-formatted disk (or
> partition) will go out of space?

I never read that the actual disk usage is unknown. But I read that the actual 
what is free is unknown. And there are several reasons for that:

1) On a compressed filesystem you cannot know, but only estimate the 
compression ratio for future data.

2) On a compressed filesystem you can choose to have parts of it uncompressed 
by file / directory attributes, I think. BTRFS can´t know how much of the 
future data you are going to store compressed or uncompressed.

3) From what I gathered it is planned to allow different raid / redundancy 
levels for different subvolumes. BTRFS can´t know beforehand where applications 
request to save future data, i.e. in which subvolume.

4) Even on a convential filesystem the free space is an estimate, cause it can 
not predict the activity of other processes writing to the filesystem. You may 
have 10 GiB free at some point, but if another process is currently writing 
another 5 GiB at the time your process is writing it will continue to have 
less and less than the estimated 10 GiB free and if it wanted to write 10 GiB 
it will not be able to.

What might be possible but still has the limitation of the fourth point above, 
would be a query: How much free space do you have *right* know, on this 
directory path, if I write with standard settings.

But the only guarantee you can ever get is to pre-allocate your files with 
fallocate. When the fallocate file succeeded, you get a guarantee that you can 
write to the amount of allocated space into the file. Whether BTRFS can hold to 
that guarantee in any case? That depends on how bug free it is in that regard 
with its free space handling.

And in case you do not need all the fallocated space, other processes may not 
be able to write data anymore even if there would be free space in your 
fallocated files.

So you either overprovision or underprovision… :)

That written: Filling up a filesystem 100% will limit the performance of any 
filesystem that is non to me considerably and ask for further troubel. So 
better have at least 10-20% of the space free, except maybe for very large 
filesystem, but on the other hand I saw recommendations on the XFS mailing list 
that in heavy random I/O on lots of file case it is even better to leave 40-50% 
free in case you want to delay slowing down of the filesystem and want to have 
a well structured filesystem after 10 years of heavy usage. BTRFS can rebalance 
things, but I have yet to see that this rebalancing really optimizes things. 
It may not, or at least not in all cases.

So welcome to the challenges of filesystem development, especially for copy on 
write filesystem with the feature set BTRFS provides.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 15:33 ` Martin Steigerwald
@ 2014-12-07 15:37   ` Shriramana Sharma
  2014-12-07 15:40   ` Martin Steigerwald
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 28+ messages in thread
From: Shriramana Sharma @ 2014-12-07 15:37 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs

On Sun, Dec 7, 2014 at 9:03 PM, Martin Steigerwald <Martin@lichtvoll.de> wrote:
>
> I never read that the actual disk usage is unknown. But I read that the actual
> what is free is unknown. And there are several reasons for that:

That is totally understood. But I guess when your alloc space is
nearing 90% of your disk capacity, and used space is sorta 80% or so
of the alloc space, I guess it's a reasonable thing to expect that
people should add a drive to the pool, which btrfs makes so easy.
Given this, why do people complain about btrfs not being predictable
when it comes to ENOSPC?

Even with any other FS, I do think I'd not like my files to occupy
more than 90% or so since even then defrag would probably not work.

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 15:33 ` Martin Steigerwald
  2014-12-07 15:37   ` Shriramana Sharma
@ 2014-12-07 15:40   ` Martin Steigerwald
  2014-12-08  5:32     ` Robert White
  2014-12-07 18:20   ` ashford
  2014-12-07 19:19   ` Goffredo Baroncelli
  3 siblings, 1 reply; 28+ messages in thread
From: Martin Steigerwald @ 2014-12-07 15:40 UTC (permalink / raw)
  To: Shriramana Sharma; +Cc: linux-btrfs

Am Sonntag, 7. Dezember 2014, 16:33:37 schrieb Martin Steigerwald:
> What might be possible but still has the limitation of the fourth point
> above,  would be a query: How much free space do you have *right* know, on
> this directory path, if I write with standard settings.
> 
> But the only guarantee you can ever get is to pre-allocate your files with 
> fallocate. When the fallocate file succeeded, you get a guarantee that you
> can  write to the amount of allocated space into the file. Whether BTRFS
> can hold to that guarantee in any case? That depends on how bug free it is
> in that regard with its free space handling.

Well what would be possible I bet would be a kind of system call like this:

I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware, can I 
do it *and* give me a guarentee I can.

So like a more flexible fallocate approach as fallocate just allocates one file 
and you would need to run it for all files you intend to create. But challenge 
would be to estimate metadata allocation beforehand accurately.

Or have tar --fallocate -xf which for all files in the archive will first call 
fallocate and only if that succeeded, actually write them. But due to the 
nature of tar archives with their content listing across the whole archive, 
this means it may have to read the tar archive twice, so ZIP archives might be 
better suited for that.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 15:33 ` Martin Steigerwald
  2014-12-07 15:37   ` Shriramana Sharma
  2014-12-07 15:40   ` Martin Steigerwald
@ 2014-12-07 18:20   ` ashford
  2014-12-07 18:34     ` Hugo Mills
  2014-12-07 18:38     ` Martin Steigerwald
  2014-12-07 19:19   ` Goffredo Baroncelli
  3 siblings, 2 replies; 28+ messages in thread
From: ashford @ 2014-12-07 18:20 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Shriramana Sharma, linux-btrfs

Martin,

> I read that the actual
> what is free is unknown. And there are several reasons for that:
>
> 1) On a compressed filesystem you cannot know, but only estimate the
> compression ratio for future data.

It is NOT the job of BTRFS, or ANY file-system, to try to prodict the
future.  The future is unknown.  Don't try to account for it.  When asked
for the status (i.e. 'df'), it should return the current status.

> 2) On a compressed filesystem you can choose to have parts of it
> uncompressed by file / directory attributes, I think. BTRFS can't
> know how much of the
> future data you are going to store compressed or uncompressed.

Same as above.

If the user sees 18GB free space and has 20GB of data to write, it is up
to them to determine whether or not compression will allow it to fit.

> 3) From what I gathered it is planned to allow different raid /
> redundancy levels for different subvolumes. BTRFS canÂ´t know
> beforehand where applications request to save future data, i.e.
> in which subvolume.

Same as above.

If a user will be requesting to use a specific subvolume, it is up to them
to verify that adequate free space exists there, or handle the exception.

> 4) Even on a convential filesystem the free space is an estimate,
> cause it can not predict the activity of other processes writing
> to the filesystem. You may have 10 GiB free at some point, but if
> another process is currently writing another 5 GiB at the time
> your process is writing it will continue to have less and less
> than the estimated 10 GiB free and if it wanted to write 10 GiB
> it will not be able to.

Same as above.

This  is normal for multi-user systems.  It happens.  There's no way
around it, and other file-systems don't try.

> What might be possible but still has the limitation of the fourth
> point above, would be a query: How much free space do you have
> *right* know, on this directory path, if I write with standard
> settings.

That's what the 'df' command is supposed to return, and what it DOES
return on other file-systems, including file-systems that support
compression.

> But the only guarantee you can ever get is to pre-allocate your files
> with fallocate. When the fallocate file succeeded, you get a guarantee
> that you can write to the amount of allocated space into the file.
> Whether BTRFS can hold to that guarantee in any case? That depends on
> how bug free it is in that regard with its free space handling.

This is the same as in all other file-systems.

> And in case you do not need all the fallocated space, other processes
> may not be able to write data anymore even if there would be free
> space in your fallocated files.

Again, this is the same as in other file-systems.

> That written: Filling up a filesystem 100% will limit the performance
> of any filesystem that is non to me considerably and ask for further
> troubel. So better have at least 10-20% of the space free, except
> maybe for very large filesystem

Once again, this is the normal recommendation for most (all?)
file-systems.  As they fill, they get less efficient.  The impact is
minimal for a while, then the curve hits a knee and performance drops. 
Some file-systems have a setting to only allow the ROOT user to exceed a
specified percentage of file-system use.

Peter Ashford


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 18:20   ` ashford
@ 2014-12-07 18:34     ` Hugo Mills
  2014-12-07 18:48       ` Martin Steigerwald
                         ` (2 more replies)
  2014-12-07 18:38     ` Martin Steigerwald
  1 sibling, 3 replies; 28+ messages in thread
From: Hugo Mills @ 2014-12-07 18:34 UTC (permalink / raw)
  To: ashford; +Cc: Martin Steigerwald, Shriramana Sharma, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1278 bytes --]

On Sun, Dec 07, 2014 at 10:20:27AM -0800, ashford@whisperpc.com wrote:
[snip]
> > 3) From what I gathered it is planned to allow different raid /
> > redundancy levels for different subvolumes. BTRFS canÂ´t know
> > beforehand where applications request to save future data, i.e.
> > in which subvolume.
> 
> Same as above.
> 
> If a user will be requesting to use a specific subvolume, it is up to them
> to verify that adequate free space exists there, or handle the exception.

   OK, so let's say I've got a filesystem with 100 GiB of unallocated
space. I have two subvolumes, one configured for RAID-1 and one
configured for single storage.

   What number should be shown in the free output of df?

   100 GiB? But I can only write 50 GiB to the RAID-1 subvolume before
it runs out of space.

   50 GiB? I can get twice that much on the single subvolume.

   *Any* value shown here is going to be inaccurate, and whatever way
round we show it, someone will complain.

   Hugo.

-- 
Hugo Mills             | My doctor tells me that I have a malformed
hugo@... carfax.org.uk | public-duty gland, and a natural deficiency in moral
http://carfax.org.uk/  | fibre.
PGP: 65E74AC0          |                                     Zaphod Beeblebrox

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 18:20   ` ashford
  2014-12-07 18:34     ` Hugo Mills
@ 2014-12-07 18:38     ` Martin Steigerwald
  2014-12-07 19:44       ` ashford
  1 sibling, 1 reply; 28+ messages in thread
From: Martin Steigerwald @ 2014-12-07 18:38 UTC (permalink / raw)
  To: ashford; +Cc: Shriramana Sharma, linux-btrfs

Am Sonntag, 7. Dezember 2014, 10:20:27 schrieb ashford@whisperpc.com:
> Martin,
> 
> > I read that the actual
> > what is free is unknown. And there are several reasons for that:
> > 
> > 1) On a compressed filesystem you cannot know, but only estimate the
> > compression ratio for future data.
> 
> It is NOT the job of BTRFS, or ANY file-system, to try to prodict the
> future.  The future is unknown.  Don't try to account for it.  When asked
> for the status (i.e. 'df'), it should return the current status.
> 
> > 2) On a compressed filesystem you can choose to have parts of it
> > uncompressed by file / directory attributes, I think. BTRFS can't
> > know how much of the
> > future data you are going to store compressed or uncompressed.
> 
> Same as above.

What is the point you are trying to make?

I just described the reasons on what problems there are with trying to predict 
available free space, with BTRFS as an example. Some points apply to all 
filesystems, some do not, so what is the point you are trying to make?

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 18:34     ` Hugo Mills
@ 2014-12-07 18:48       ` Martin Steigerwald
  2014-12-07 19:39       ` ashford
  2014-12-08  5:17       ` Chris Murphy
  2 siblings, 0 replies; 28+ messages in thread
From: Martin Steigerwald @ 2014-12-07 18:48 UTC (permalink / raw)
  To: Hugo Mills, ashford, Shriramana Sharma, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3087 bytes --]

Am Sonntag, 7. Dezember 2014, 18:34:44 schrieb Hugo Mills:
> On Sun, Dec 07, 2014 at 10:20:27AM -0800, ashford@whisperpc.com wrote:
> [snip]
> 
> > > 3) From what I gathered it is planned to allow different raid /
> > > redundancy levels for different subvolumes. BTRFS canÂ´t know
> > > beforehand where applications request to save future data, i.e.
> > > in which subvolume.
> > 
> > Same as above.
> > 
> > If a user will be requesting to use a specific subvolume, it is up to them
> > to verify that adequate free space exists there, or handle the exception.
> 
>    OK, so let's say I've got a filesystem with 100 GiB of unallocated
> space. I have two subvolumes, one configured for RAID-1 and one
> configured for single storage.
> 
>    What number should be shown in the free output of df?
> 
>    100 GiB? But I can only write 50 GiB to the RAID-1 subvolume before
> it runs out of space.
> 
>    50 GiB? I can get twice that much on the single subvolume.
> 
>    *Any* value shown here is going to be inaccurate, and whatever way
> round we show it, someone will complain.

Thats why I pointed out fallocate. If it suceeds, I would expect even BTRFS 
with its special free space challenges guarantees the space is there.

A getfreespacebypath() syscall may yield quite accurate figures as well, cause 
then BTRFS can know which subvolume the application wants to write it, but as 
it cannot predict the future write behavior of all processes, it cannot 
guarantee anything.

So for any guarantee as far as I know, the only thing you can do is fallocate.

I never liked the pre installation check of roughly having 5 GiB of free space 
or so to suceed or fail otherwise. But on the other hand, running 90% through 
an installation and then failing to do not enough free space is also not nice. 
Similarily to copying 4 GiB of a 4,3 GB DVD image to an FAT32 before it fails 
the copy.

But for the FAT32 it is much easier to know it can´t write a file larger 4 GiB 
than for BTRFS or any other filesystem to know whether installing a set of files 
to a set of directories with a certain total size is going to work out.

Only thing that could improve this, would be some kind of more flexible space 
allocation. Or… creating every directory and fallocating each file with the 
exact size before doing the actual copy. And heck, somehow I like this idea. 
It could help to avoid actions that just do not make sense. An rsync could 
abort early if the total free space would not be enough. But especially for 
rsync since version 3 this again doesn´t work, as rsync works incrementally 
since version 3. On the other hand if btrfs send could store size requirements 
in the send data, before the receive side thats working, then the receive side 
could preallocate, but… then that would depend on how easy it would be for the 
send size to get the size difference between two snapshots.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 15:33 ` Martin Steigerwald
                     ` (2 preceding siblings ...)
  2014-12-07 18:20   ` ashford
@ 2014-12-07 19:19   ` Goffredo Baroncelli
  2014-12-07 20:32     ` ashford
  3 siblings, 1 reply; 28+ messages in thread
From: Goffredo Baroncelli @ 2014-12-07 19:19 UTC (permalink / raw)
  To: Shriramana Sharma; +Cc: Martin Steigerwald, linux-btrfs

On 12/07/2014 04:33 PM, Martin Steigerwald wrote:
> Hi Shriramana!
> 
> Am Sonntag, 7. Dezember 2014, 20:45:59 schrieb Shriramana Sharma:
>>> IIUC:
>>> 
>>> 1) btrfs fi df already shows the alloc-ed space and the space 
>>> used out of that.
>>> 
>>> 2) Despite snapshots, CoW and compression, the tree knows how 
>>> many extents of data and metadata there are, and how many bytes 
>>> on disk these occcupy, no matter what is the total (uncompressed,
>>> "unsnapshotted") size of all the directories and files on the
>>> disk.
>>> 
>>> So this means that btrfs fi df actually shows the real on-disk 
>>> usage. In this case, why do we hear people saying it's not 
>>> possible to know the actual on-disk usage and when a 
>>> btrfs-formatted disk (or partition) will go out of space?
> I never read that the actual disk usage is unknown. But I read that 
> the actual what is free is unknown. And there are several reasons
> for that:
> 
> 1) On a compressed filesystem you cannot know, but only estimate the 
> compression ratio for future data.
> 
> 2) On a compressed filesystem you can choose to have parts of it 
> uncompressed by file / directory attributes, I think. BTRFS can´t 
> know how much of the future data you are going to store compressed
> or uncompressed.
> 
> 3) From what I gathered it is planned to allow different raid / 
> redundancy levels for different subvolumes. BTRFS can´t know 
> beforehand where applications request to save future data, i.e. in 
> which subvolume.


3.1) even in the case of a single disk filesystem, data and metadata 
have different profiles: the data chunk doesn't have any redundancy, 
so 64kb of data consume 64kb of disk space. The metadata chunks 
usually are stored as DUP, so 64kb of metadata consume 128kb on disk.
Moreover you have to consider that small files are stored in metadata
chunk. This means that for big file the disk space consumed is equal
to the data size, but for small file this is doubled.

Going back to your request, to be more clear I used the following terms:
1- disk space used: the space used on the disk
2- size of data: the size of the data stored on the disks
3- disk free space: the unused space of the disk
4- free space: the size of data that the system is able to contain

The value 1,2,3 are known. Which is unknown is the point 4. In
the past I posted some patch which try to estimate the point 4 as:

                                 size_of_data 
free_space = disk_free_space * -----------------
                                disk_space_used

This estimation assumes that the ratio size_of_data/disk_space_used
is constant. But for the point above this assumption may be wrong.

In conclusion, the disk usage is well known; which is unknown is
the space that is available to the user (who is uninterested to
all the details inside a filesystem). The best that is doable
is an estimation like the above one.
BR
Goffredo

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 18:34     ` Hugo Mills
  2014-12-07 18:48       ` Martin Steigerwald
@ 2014-12-07 19:39       ` ashford
  2014-12-08  5:17       ` Chris Murphy
  2 siblings, 0 replies; 28+ messages in thread
From: ashford @ 2014-12-07 19:39 UTC (permalink / raw)
  To: Hugo Mills, ashford, Martin Steigerwald, Shriramana Sharma, linux-btrfs

> On Sun, Dec 07, 2014 at 10:20:27AM -0800, ashford@whisperpc.com wrote:
> [snip]
>> > 3) From what I gathered it is planned to allow different raid /
>> > redundancy levels for different subvolumes. BTRFS canÂ´t know
>> > beforehand where applications request to save future data, i.e.
>> > in which subvolume.
>>
>> Same as above.
>>
>> If a user will be requesting to use a specific subvolume, it is up to
>> them
>> to verify that adequate free space exists there, or handle the
>> exception.
>
>    OK, so let's say I've got a filesystem with 100 GiB of unallocated
> space. I have two subvolumes, one configured for RAID-1 and one
> configured for single storage.
>
>    What number should be shown in the free output of df?
>
>    100 GiB? But I can only write 50 GiB to the RAID-1 subvolume before
> it runs out of space.
>
>    50 GiB? I can get twice that much on the single subvolume.
>
>    *Any* value shown here is going to be inaccurate, and whatever way
> round we show it, someone will complain.

As an example, let's assume that the file-system is mounted as /data, with
a non-mirrored subvolume of /data/1 and a mirrored subvolume of /data/2. 
A df should should 100GiB free in both /data and /data/1, and 50GiB free
in /data/2.

Peter Ashford


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 18:38     ` Martin Steigerwald
@ 2014-12-07 19:44       ` ashford
  0 siblings, 0 replies; 28+ messages in thread
From: ashford @ 2014-12-07 19:44 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: ashford, Shriramana Sharma, linux-btrfs

> Am Sonntag, 7. Dezember 2014, 10:20:27 schrieb ashford@whisperpc.com:
>> Martin,
>>
>> > I read that the actual
>> > what is free is unknown. And there are several reasons for that:
>> >
>> > 1) On a compressed filesystem you cannot know, but only estimate the
>> > compression ratio for future data.
>>
>> It is NOT the job of BTRFS, or ANY file-system, to try to prodict the
>> future.  The future is unknown.  Don't try to account for it.  When
>> asked
>> for the status (i.e. 'df'), it should return the current status.
>>
>> > 2) On a compressed filesystem you can choose to have parts of it
>> > uncompressed by file / directory attributes, I think. BTRFS can't
>> > know how much of the
>> > future data you are going to store compressed or uncompressed.
>>
>> Same as above.
>
> What is the point you are trying to make?
>
> I just described the reasons on what problems there are with trying to
> predict
> available free space, with BTRFS as an example. Some points apply to all
> filesystems, some do not, so what is the point you are trying to make?

My point is that you don't try to predict, as that's a guaranteed path to
failure.  You deliver what you know.  This is what every other file-system
does.  There's no reason to do anything else.

Peter Ashford


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 19:19   ` Goffredo Baroncelli
@ 2014-12-07 20:32     ` ashford
  2014-12-07 23:01       ` Goffredo Baroncelli
  2014-12-08  8:18       ` Chris Murphy
  0 siblings, 2 replies; 28+ messages in thread
From: ashford @ 2014-12-07 20:32 UTC (permalink / raw)
  To: kreijack; +Cc: Shriramana Sharma, Martin Steigerwald, linux-btrfs

> 3.1) even in the case of a single disk filesystem, data and metadata
> have different profiles: the data chunk doesn't have any redundancy,
> so 64kb of data consume 64kb of disk space. The metadata chunks
> usually are stored as DUP, so 64kb of metadata consume 128kb on disk.
> Moreover you have to consider that small files are stored in metadata
> chunk. This means that for big file the disk space consumed is equal
> to the data size, but for small file this is doubled.

As there's no way to predict what the user will be doing, I see no reason
to do anything except return the actual amount of free space.

> Going back to your request, to be more clear I used the following terms:
> 1- disk space used: the space used on the disk
> 2- size of data: the size of the data stored on the disks
> 3- disk free space: the unused space of the disk
> 4- free space: the size of data that the system is able to contain
>
> The value 1,2,3 are known. Which is unknown is the point 4. In
> the past I posted some patch which try to estimate the point 4 as:
>
>                                  size_of_data
> free_space = disk_free_space * -----------------
>                                 disk_space_used
>
> This estimation assumes that the ratio size_of_data/disk_space_used
> is constant. But for the point above this assumption may be wrong.

While I expect that this is the best simple prediction, it's still a
prediction, with all the possible problems that a prediction entails.  My
contention is that predictions should be avoided whenever possible.

> In conclusion, the disk usage is well known; which is unknown is
> the space that is available to the user (who is uninterested to
> all the details inside a filesystem). The best that is doable
> is an estimation like the above one.

I disagree.  My experiences with other file-systems, including ZFS, show
that the most common solution is to just deliver to the user the actual
amount of unused disk space.  Anything else changes this known value into
a guess or prediction.

Peter Ashford


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 20:32     ` ashford
@ 2014-12-07 23:01       ` Goffredo Baroncelli
  2014-12-08  0:12         ` ashford
  2014-12-08  8:18       ` Chris Murphy
  1 sibling, 1 reply; 28+ messages in thread
From: Goffredo Baroncelli @ 2014-12-07 23:01 UTC (permalink / raw)
  To: ashford; +Cc: Shriramana Sharma, Martin Steigerwald, linux-btrfs

On 12/07/2014 09:32 PM, ashford@whisperpc.com wrote:
>> In conclusion, the disk usage is well known; which is unknown is
>> > the space that is available to the user (who is uninterested to
>> > all the details inside a filesystem). The best that is doable
>> > is an estimation like the above one.

> I disagree.  My experiences with other file-systems, including ZFS, show
> that the most common solution is to just deliver to the user the actual
> amount of unused disk space.  Anything else changes this known value into
> a guess or prediction.

So in case you have a raid1 filesystem on two disks; each disk has 300GB
free; which is the free space that you expected: 300GB or 600GB and why ?


> 
> Peter Ashford

BR
G.Baroncelli
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 23:01       ` Goffredo Baroncelli
@ 2014-12-08  0:12         ` ashford
  2014-12-08  2:42           ` Qu Wenruo
  2014-12-08 14:34           ` Goffredo Baroncelli
  0 siblings, 2 replies; 28+ messages in thread
From: ashford @ 2014-12-08  0:12 UTC (permalink / raw)
  To: kreijack; +Cc: ashford, Shriramana Sharma, Martin Steigerwald, linux-btrfs

Goffredo,

> So in case you have a raid1 filesystem on two disks; each disk has 300GB
> free; which is the free space that you expected: 300GB or 600GB and why ?

You should see 300GB free.  That's what you'll see with RAID-1 with a
hardware RAID controller, and with MD RAID.  Why would you expect to see
anything else with BTRFS RAID?

Peter Ashford


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-08  0:12         ` ashford
@ 2014-12-08  2:42           ` Qu Wenruo
  2014-12-08  8:12             ` ashford
  2014-12-08 14:34           ` Goffredo Baroncelli
  1 sibling, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2014-12-08  2:42 UTC (permalink / raw)
  To: ashford, kreijack; +Cc: Shriramana Sharma, Martin Steigerwald, linux-btrfs

-------- Original Message --------
Subject: Re: Why is the actual disk usage of btrfs considered unknowable?
From: <ashford@whisperpc.com>
To: <kreijack@inwind.it>
Date: 2014年12月08日 08:12
> Goffredo,
>
>> So in case you have a raid1 filesystem on two disks; each disk has 300GB
>> free; which is the free space that you expected: 300GB or 600GB and why ?
> You should see 300GB free.  That's what you'll see with RAID-1 with a
> hardware RAID controller, and with MD RAID.  Why would you expect to see
> anything else with BTRFS RAID?
>
> Peter Ashford
Yeah, you pointed out the real problem here:

[DIFFERENT RESULT FROM DIFFERENT VIEW]
See from *PURE ON-DISK* usage, it is still 600G, no matter what level of 
RAID.
See from *BLOCK LEVEL RAID1* usage, it is 300G. If fs(not btrfs) is 
build on BLOCK LEVEL RAID1,
then the *FILESYSTEM* usage will also be 300G

[BTRFS DOES NOT BELONG TO ANY TYPE]
But, btrfs is neither pure block level management(that should be MD or 
HW RAID or LVM), nor a
traditional filesystem!!

So the root of the problem is, btrfs mixs the position of block level 
management and filesystem level
management, which makes everything hard to understand.
You can't treat btrfs raid1 as a complete block level raid1, due to its 
flexibility on metadata/data profile different.

If vanilla df command shows filesystem level freespace, then btrfs won't 
give a accurate on.

[ONLY PREDICABLE CASE]
For the 300Gx2 case for btrfs, you can only consider it 300G free space 
only if you can ensure that
there was/is and will be only RAID1 data/metadata storing on it.(also 
need to ignore small space usage on CoW)

[RELIABLE DATA IS ON-DISK USAGE]
Only pure on-disk level usage is *a little* reliable. There is still 
problem for unbalanced metadata/data chunk
allocation problem(e.g, all space is allocated for data, no space for 
metadata CoW write).

[FEATURE SIMILAR CASE]
The only case that I may see similar problem will be mirrored thin 
lv(not implemented yet)
and normal thin lv competing for a thin pool.

Although not implemented, I think even implemented, admins may not 
complain so much since LVM doesn't
report free space, only used space on thin pool case.

Thanks,
Qu

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 15:15 Why is the actual disk usage of btrfs considered unknowable? Shriramana Sharma
  2014-12-07 15:33 ` Martin Steigerwald
@ 2014-12-08  4:59 ` Robert White
  2014-12-08  6:43 ` Zygo Blaxell
  2 siblings, 0 replies; 28+ messages in thread
From: Robert White @ 2014-12-08  4:59 UTC (permalink / raw)
  To: Shriramana Sharma, linux-btrfs

On 12/07/2014 07:15 AM, Shriramana Sharma wrote:
> IIUC:
>
> 1) btrfs fi df already shows the alloc-ed space and the space used out of that.
>
> 2) Despite snapshots, CoW and compression, the tree knows how many
> extents of data and metadata there are, and how many bytes on disk
> these occcupy, no matter what is the total (uncompressed,
> "unsnapshotted") size of all the directories and files on the disk.
>

I tried to answer this last time. So lets do a thought experiment...

You have an essentially full filesystem. Then the last two extents are 
allocated. One a 1Gb extent for data and the other is a 256Mb extent for 
metadata.

How much space on the disk is "free"? Is 1Gb for data, is it 256Mb for 
free space or is it 1280Mb for the combination of data and metadata or 
is it _zero_ for the complete absence of blocks that can be allocated 
into extents?

How about if I allocate 1Gb of data space and there is 512Mb of 
unallocated space, which is enough room for two more metadata extents 
but not enough room for another data extent. Is the drive "full" when 
you fill that last 1Gb? After all, you cannot write more data to the 
disk, but you can write more metadata.

If I start deleting files, and thereby create gaps in the previously 
allocated extents, are those gaps "free"? They are purposed but 
available for their respective uses.

Subtracting blocks allocated from blocks on media doesn't give you the 
"real" answer to what is or isnt "free".

If there is a leftover two-dozen sectors that won't fit in _any_ kind of 
extent are those sectors "Free" or are they just leftovers?

In real property terms, If I hold an easment on your driveway and you 
want to expand your house, how much of your property is can be used for 
the expansion of your house? My rights to your driveway don't count 
against you for meeting the "undeveloped land" caluclation of your local 
zoning board, but you can't build any house-bits on that driveway since 
I hold a right to use it, so it does count against your available square 
feet that you can design over.

"Free space" isn't the simple proposition you imagine because "free for 
what purpose" and "free in what sense" both have to be answered.

So the system estimates, and it does so in different ways for different 
purposes.

If you have a means in mind to resolve these conflicts we'd love to see 
the rationale and even the code...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 18:34     ` Hugo Mills
  2014-12-07 18:48       ` Martin Steigerwald
  2014-12-07 19:39       ` ashford
@ 2014-12-08  5:17       ` Chris Murphy
  2 siblings, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2014-12-08  5:17 UTC (permalink / raw)
  To: Hugo Mills, ashford, Martin Steigerwald, Shriramana Sharma, linux-btrfs

On Sun, Dec 7, 2014 at 11:34 AM, Hugo Mills <hugo@carfax.org.uk> wrote:

>    *Any* value shown here is going to be inaccurate, and whatever way
> round we show it, someone will complain.

Yeah I'd suggest that for regular df command, when multiple device
volumes exist, they're shown with ?? for Avail and Use% columns. Maybe
one day df can show more columns: AvailS/R0, AvailR1, AvailR5,
AvailR6.

Today:

# btrfs fi show
Label: 'test'  uuid: fb7df1f2-480d-4426-84f1-8aed197700e4
Total devices 2 FS bytes used 1.38GiB
devid    1 size 5.00GiB used 2.28GiB path /dev/loop0
devid    2 size 5.00GiB used 2.28GiB path /dev/loop1

# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0      5.0G  1.4G  4.0G  27% /mnt

# btrfs fi df /mnt
Data, RAID1: total=2.00GiB, used=1.38GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=256.00MiB, used=1.61MiB
GlobalReserve, single: total=16.00MiB, used=0.00B

Perhaps something like this instead:

# btrfs fi df /mnt
Available Space Range: 5.4GiB - 7.2 GiB
RAID0
Available: ~5.1GiB
Data: allocated=1.0GiB, used=0.1GiB
System: allocated=32MiB, used=16KiB
Metadata:allocated=256MiB, used=KiB
RAID1
Available: ~2.7GiB
Data: allocated=2.00GiB, used=1.38GiB
System: allocated=32.00MiB, used=16.00KiB
Metadata: allocated=256.00MiB, used=1.61MiB
SINGLE
Available:
GlobalReserve: allocated=16.00MiB, used=0.00B

1. There's 10.0GiB of total space for two devices in the volume.
2. There's 1.3GiB raid0 chunks allocated, and 4.6GiB raid1 chunks
allocated, 5.6 GiB is allocated.
3. To show available space for any particular profile, we have to
subtract out all the allocated chunks, and then add back in the free
space only for that profile's chunks.
4. The resulting value needs to be multiplied by that profile's
"replication factor" e.g. 1 for single and raid0, 0.5 for raid1, and
for raid 5/6 it depends not just on the number of devices, but the mix
of chunks striped across n disks since Btrfs allows chunks to be
striped across any number of disks so long as it meets the minimum,
i.e. there can be a dozen raid5 chunks striped across 3 drives, and
another 1/2 dozen chunks striped across 4 drives. Only a balance would
redistribute such that each chunk has the same number of strips.

For above it's
raid0 available:
[10.00 total - 5.9 allocated + the free space in only raid0 chunks] * 1

raid1 available:
[10.00 total - 5.9 allocated + the free space in only raid1 chunks] * 0.5

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 15:40   ` Martin Steigerwald
@ 2014-12-08  5:32     ` Robert White
  2014-12-08  6:20       ` ashford
  2014-12-08 14:47       ` Martin Steigerwald
  0 siblings, 2 replies; 28+ messages in thread
From: Robert White @ 2014-12-08  5:32 UTC (permalink / raw)
  To: Martin Steigerwald, Shriramana Sharma; +Cc: linux-btrfs

On 12/07/2014 07:40 AM, Martin Steigerwald wrote:
> Well what would be possible I bet would be a kind of system call like this:
>
> I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware, can I
> do it *and* give me a guarentee I can.
>
> So like a more flexible fallocate approach as fallocate just allocates one file
> and you would need to run it for all files you intend to create. But challenge
> would be to estimate metadata allocation beforehand accurately.
>
> Or have tar --fallocate -xf which for all files in the archive will first call
> fallocate and only if that succeeded, actually write them. But due to the
> nature of tar archives with their content listing across the whole archive,
> this means it may have to read the tar archive twice, so ZIP archives might be
> better suited for that.
>

What you suggest is Still Not Practical™ (the tar thing might have some 
ability if you were willing to analyze every file to the byte level).

Compression _can_ make a file _bigger_ than its base size. BTRFS decides 
whether or not to compress a file based on the results it gets when 
tying to compress the first N bytes. (I do not know the value of N). But 
it is _easy_ to have a file where the first N bytes compress well but 
the bytes after N take up more space than their byte count. So to 
fallocate() the right size in blocks you'd have to compress the input 
and determine what BTRFS _would_ _do_ and then allocate that much space 
instead of the file size.

And even then, if you didn't create all the names and directories you 
might find that the RBtree had to expand (allocate another tree node) 
one or more times to accommodate the actual files. Lather rinse repeat 
for any checksum trees and anything hitting a flush barrier because of 
commit= or sync() events or other writers perturbing your results 
because it only matters if the filesystem is nearly full and nearly full 
filesystems may not be quiescent at all.

So while the core problem isn't insoluble, in real life it is _not_ 
_worth_ _solving_.

On a nearly empty filesystem, it's going to fit.

In a reasonably empty filesystem, it's going to fit.

On a nearly full filesystem, it may or may not fit.

On a filesystem that is so close to full that you have reason to doubt 
it will fit, you are going to have a very bad time even if it fits.

If you did manage to invent and implement an fallocate algorythm that 
could make this promise and make it stick, then some other running 
program is what's going to crash when you use up that last byte anyway.

Almost full filesystems are their own reward.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-08  5:32     ` Robert White
@ 2014-12-08  6:20       ` ashford
  2014-12-08  7:06         ` Robert White
  2014-12-08 14:47       ` Martin Steigerwald
  1 sibling, 1 reply; 28+ messages in thread
From: ashford @ 2014-12-08  6:20 UTC (permalink / raw)
  To: Robert White; +Cc: Martin Steigerwald, Shriramana Sharma, linux-btrfs

Martin,

Excellent analysis.

> On 12/07/2014 07:40 AM, Martin Steigerwald wrote:
>
> So while the core problem isn't insoluble, in real life it is _not_
> _worth_ _solving_.

I agree.  There is inadequate return on the investment.  In addition, the
number of corner cases increases dramatically, making testing
significantly more complex.

> On a nearly empty filesystem, it's going to fit.
>
> In a reasonably empty filesystem, it's going to fit.
>
> On a nearly full filesystem, it may or may not fit.
>
> On a filesystem that is so close to full that you have reason to doubt
> it will fit, you are going to have a very bad time even if it fits.
>
> If you did manage to invent and implement an fallocate algorythm that
> could make this promise and make it stick, then some other running
> program is what's going to crash when you use up that last byte anyway.
>
> Almost full filesystems are their own reward.

In other words, BTRFS acts like any other filesystem with compression. 
This is reasonable.

Peter Ashford


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 15:15 Why is the actual disk usage of btrfs considered unknowable? Shriramana Sharma
  2014-12-07 15:33 ` Martin Steigerwald
  2014-12-08  4:59 ` Robert White
@ 2014-12-08  6:43 ` Zygo Blaxell
  2 siblings, 0 replies; 28+ messages in thread
From: Zygo Blaxell @ 2014-12-08  6:43 UTC (permalink / raw)
  To: Shriramana Sharma, i; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4187 bytes --]

On Sun, Dec 07, 2014 at 08:45:59PM +0530, Shriramana Sharma wrote:
> IIUC:
> 
> 1) btrfs fi df already shows the alloc-ed space and the space used out of that.
> 
> 2) Despite snapshots, CoW and compression, the tree knows how many
> extents of data and metadata there are, and how many bytes on disk
> these occcupy, no matter what is the total (uncompressed,
> "unsnapshotted") size of all the directories and files on the disk.
> 
> So this means that btrfs fi df actually shows the real on-disk usage.
> In this case, why do we hear people saying it's not possible to know
> the actual on-disk usage and when a btrfs-formatted disk (or
> partition) will go out of space?

"On-disk usage" is easy--that's about the past, and can be measured
straightforwardly with a single count of bytes.

"When a btrfs filesystem will return ENOSPC" is much more
complicated--that's about the future, and depends heavily on current
structure and upcoming modifications of it.

There were some pretty terrible btrfs bugs and warts that were fixed
only in the last 5 months or so.  Since some of those had been around
for a year or more, they gave btrfs a reputation.

The 'df' command (statvfs(2)) would report raw free space instead of an
estimate based on the current RAID profile.  This confused some badly
designed programs that would use statvfs to determine that N bytes of free
space were avaiable, and be surprised when N bytes were not all available
for their use.  If you had a btrfs using RAID1, it would report double
the amount of space used and available (one for each disk, e.g. 2x1TB
disks 75% full would be reported as 2TB capacity with 1.5TB used and
0.5TB free).  Now statvfs(2) computes more correct values (1TB capacity
with 750GB used and 250GB free).

Some bugs would crash the btrfs cleaner (the thread which removes deleted
snapshots) or balance, and would cause the filesystem to prematurely
report ENOSPC when (in theory) hundreds of gigabytes were available.
These were straight up bugs that are now fixed.

Modifying the filesystem tree requires free metadata blocks into which
to write new CoW nodes for the modified metadata.  When you delete
something, disk usage goes up for a few seconds before it goes down
(if you have snapshots, the "down" part may be delayed until you delete
the snapshots).  This can lead to surprising "No space left on device"
errors from commands like 'rm -rf lots_of_files'.  The GlobalReserve
chunk type was introduced to reserve a few MB of space on the filesystem
to handle such cases.

Thankfully, everything above now seems to be fixed.

There is still an issue with hetergeneous chunk allocation.  The 'df'
command and 'statvfs' syscall only report a single quantity for used
and free space, while in btrfs there are two distinct data types to be
stored in two distinct container types--and for maximum result
irreproducibility, the amount of space allocated to each type is dynamic.

Data (file contents) is allocated 1GB at a time, metadata (directory
structures, inodes, checksums) is allocated 256MB at a time, and
the two types are not interchangeable after allocation.  This can
cause inaccuracies when reporting free space as the last few free GB
are consumed.  256MB might abruptly disappear from free space if you
happen to run out of free metadata space and allocate a new metadata
chunk instead of a data chunk.

The last few KB of a file that does not fill a full 4K block can be
stored 'inline' (next to the inode in the metadata tree).  If you are
low on space in data chunks, you might be able to write a large number
of small files using inline metadata to store the file contents, but
not an equivalent-sized large file using data extent blocks.  If you
have lots of free data space but not enough metadata space, you get the
opposite result (e.g. you can write new large files but not extend small
existing ones).

All that above happens with RAID, compression and quotas turned *off*.
Turning those them on makes space usage even harder to analyze (and
ENOSPC errors harder to predict) with a single-dimension "available
space" metric.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-08  6:20       ` ashford
@ 2014-12-08  7:06         ` Robert White
  0 siblings, 0 replies; 28+ messages in thread
From: Robert White @ 2014-12-08  7:06 UTC (permalink / raw)
  To: ashford; +Cc: Martin Steigerwald, Shriramana Sharma, linux-btrfs

On 12/07/2014 10:20 PM, ashford@whisperpc.com wrote:
> Martin,
>
> Excellent analysis.
>
>> On 12/07/2014 07:40 AM, Martin Steigerwald wrote:
>>
>> So while the core problem isn't insoluble, in real life it is _not_
>> _worth_ _solving_.

Your email quoting things is messed up... I wrote that analysis... 8-)



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-08  2:42           ` Qu Wenruo
@ 2014-12-08  8:12             ` ashford
  0 siblings, 0 replies; 28+ messages in thread
From: ashford @ 2014-12-08  8:12 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: ashford, kreijack, Shriramana Sharma, Martin Steigerwald, linux-btrfs

>
> -------- Original Message --------
> Subject: Re: Why is the actual disk usage of btrfs considered unknowable?
> From: <ashford@whisperpc.com>
> To: <kreijack@inwind.it>
> Date: 2014å¹´12æœˆ08æ—¥ 08:12
>> Goffredo,
>>
>>> So in case you have a raid1 filesystem on two disks; each disk has
>>> 300GB
>>> free; which is the free space that you expected: 300GB or 600GB and why
>>> ?
>> You should see 300GB free.  That's what you'll see with RAID-1 with a
>> hardware RAID controller, and with MD RAID.  Why would you expect to see
>> anything else with BTRFS RAID?
>>
>> Peter Ashford
> Yeah, you pointed out the real problem here:
>
> [DIFFERENT RESULT FROM DIFFERENT VIEW]
> See from *PURE ON-DISK* usage, it is still 600G, no matter what level of
> RAID.
> See from *BLOCK LEVEL RAID1* usage, it is 300G. If fs(not btrfs) is
> build on BLOCK LEVEL RAID1,
> then the *FILESYSTEM* usage will also be 300G
>
> [BTRFS DOES NOT BELONG TO ANY TYPE]
> But, btrfs is neither pure block level management(that should be MD or
> HW RAID or LVM), nor a
> traditional filesystem!!

For the purposes of reporting free space, it is reasonable to assume that
the default structure will be used.  If the default for the volume or
subvolume is RAID-1, then that should be used for 'df' output.  Obviously,
the same should be done for other RAID levels.

> So the root of the problem is, btrfs mixs the position of block level
> management and filesystem level
> management, which makes everything hard to understand.
> You can't treat btrfs raid1 as a complete block level raid1, due to its
> flexibility on metadata/data profile different.

It will have the same discrepancies as other file-systems with
compression, plus a few more of its own, due to chunking.  If the
file-system can't give a completely accurate answer, it should give one
that makes sense.

> If vanilla df command shows filesystem level freespace, then btrfs won't
> give a accurate on.
>
> [ONLY PREDICABLE CASE]
> For the 300Gx2 case for btrfs, you can only consider it 300G free space
> only if you can ensure that
> there was/is and will be only RAID1 data/metadata storing on it.(also
> need to ignore small space usage on CoW)

I disagree.  You can consider the RAID structure to be whatever the
default structure is.  If the default is RAID-1, then that structure
should be used to compute the free space for 'df'.  The user should
understand that by explicitly requesting a different RAID structure,
different amounts of space will be used.

> [RELIABLE DATA IS ON-DISK USAGE]
> Only pure on-disk level usage is *a little* reliable. There is still
> problem for unbalanced metadata/data chunk
> allocation problem(e.g, all space is allocated for data, no space for
> metadata CoW write).

I agree.  Unused disk space isn't always available to be used by data. 
Sometimes it's reserved for metadata of one sort or another, and sometimes
it's too small to be of use.  In addition, BTRFS sometimes (with small
files) uses the Metadata chunks for data.  Yes, it's a complex problem. 
There is no simple solution that will make everyone happy.

---------------------------------

As for the 'df' output, I believe that the default should be the sum of
free space in data chunks, free space in metadata chunks and unallocated
space, ignoring any amounts that are small enough that BTRFS won't use
them, and adjusted for the RAID level of the volume/subvolume.

While it's possible to generate other values that will make sense for
specific cases, it's not possible to create one value that is correct in
all cases.

If it's not possible to be absolutely correct, considering every usage (or
even the most common usages), a 'reasonable' value should be returned. 
That reasonable value should be based on the default volume/subvolume
settings, including RAID levels and any space limits that may exist on the
volume or subvolume.  It should neither be the most optimistic nor the
most pessimistic.

Peter Ashford

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-07 20:32     ` ashford
  2014-12-07 23:01       ` Goffredo Baroncelli
@ 2014-12-08  8:18       ` Chris Murphy
  1 sibling, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2014-12-08  8:18 UTC (permalink / raw)
  To: ashford; +Cc: kreijack, Shriramana Sharma, Martin Steigerwald, linux-btrfs

On Sun, Dec 7, 2014 at 1:32 PM,  <ashford@whisperpc.com> wrote:
>
> I disagree.  My experiences with other file-systems, including ZFS, show
> that the most common solution is to just deliver to the user the actual
> amount of unused disk space.  Anything else changes this known value into
> a guess or prediction.

What is the "actual amount of unused disk space" in a 2x 8GB drives
mirror? Very literally, it's 16GB. It's a convenience subtracting the
space used for replication (the n mirror copies, or parity). This is
in fact how df reported Btrfs volumes with kernel 3.16 and older.

A ZFS mirror vdev doesn't work this way, it reports available space as
8GB. The level of replication and number of devices is a function of
the vdev, and is fixed. It can't be changed. With Btrfs there isn't a
zpool vs vdev type of distinction, and replication level isn't a
function of volume but rather that of chunks. At some future point
there will be a way to supply a hint (per subvolume, maybe per
directory or per file) for the allocator to put the file in a
particular chunk which has a particular level of replication and
number of devices. And that means "available space" isn't knowable.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-08  0:12         ` ashford
  2014-12-08  2:42           ` Qu Wenruo
@ 2014-12-08 14:34           ` Goffredo Baroncelli
  1 sibling, 0 replies; 28+ messages in thread
From: Goffredo Baroncelli @ 2014-12-08 14:34 UTC (permalink / raw)
  To: ashford; +Cc: Shriramana Sharma, Martin Steigerwald, linux-btrfs

On 12/08/2014 01:12 AM, ashford@whisperpc.com wrote:
> Goffredo,
> 
>> So in case you have a raid1 filesystem on two disks; each disk has 300GB
>> free; which is the free space that you expected: 300GB or 600GB and why ?
> 
> You should see 300GB free.  That's what you'll see with RAID-1 with a
> hardware RAID controller, and with MD RAID.  Why would you expect to see
> anything else with BTRFS RAID?

I had to ask you because in a your previous email you stated something
different:

On 12/07/2014 09:32 PM, ashford@whisperpc.com wrote:
> I disagree.  My experiences with other file-systems, including ZFS, show
> that the most common solution is to just deliver to the user the actual
> amount of *unused disk space*
            ^^^^^^^^^^^^^^^^^^^

So I expected that you answered with 600GB. But you have told the true:
the user want to know how many data is able to store on the disk, and
not the unused disk space.

But I have to point out that the common case is one disk filesystem
where the metadata chunks have a ratio data stored/disk space 
consumed of 1:2; the data chunks have a ratio of 1:1. This is one
reason why is difficult to evaluate the free space: if you have
all metadata chunks, you have to half the disk space. 
Another reason is that there is the idea to allow different raid 
profiles in the same filesystem. This will further complicate the
free space evaluation.

> Peter Ashford

G.Baroncelli
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-08  5:32     ` Robert White
  2014-12-08  6:20       ` ashford
@ 2014-12-08 14:47       ` Martin Steigerwald
  2014-12-08 14:57         ` Austin S Hemmelgarn
  2014-12-08 23:14         ` Zygo Blaxell
  1 sibling, 2 replies; 28+ messages in thread
From: Martin Steigerwald @ 2014-12-08 14:47 UTC (permalink / raw)
  To: Robert White; +Cc: Shriramana Sharma, linux-btrfs

Hi,

Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White:
> On 12/07/2014 07:40 AM, Martin Steigerwald wrote:
> > Well what would be possible I bet would be a kind of system call like
> > this:
> > 
> > I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware,
> > can I do it *and* give me a guarentee I can.
> > 
> > So like a more flexible fallocate approach as fallocate just allocates one
> > file and you would need to run it for all files you intend to create. But
> > challenge would be to estimate metadata allocation beforehand accurately.
> > 
> > Or have tar --fallocate -xf which for all files in the archive will first
> > call fallocate and only if that succeeded, actually write them. But due
> > to the nature of tar archives with their content listing across the whole
> > archive, this means it may have to read the tar archive twice, so ZIP
> > archives might be better suited for that.
> 
> What you suggest is Still Not Practical™ (the tar thing might have some
> ability if you were willing to analyze every file to the byte level).
> 
> Compression _can_ make a file _bigger_ than its base size. BTRFS decides
> whether or not to compress a file based on the results it gets when
> tying to compress the first N bytes. (I do not know the value of N). But
> it is _easy_ to have a file where the first N bytes compress well but
> the bytes after N take up more space than their byte count. So to
> fallocate() the right size in blocks you'd have to compress the input
> and determine what BTRFS _would_ _do_ and then allocate that much space
> instead of the file size.
> 
> And even then, if you didn't create all the names and directories you
> might find that the RBtree had to expand (allocate another tree node)
> one or more times to accommodate the actual files. Lather rinse repeat
> for any checksum trees and anything hitting a flush barrier because of
> commit= or sync() events or other writers perturbing your results
> because it only matters if the filesystem is nearly full and nearly full
> filesystems may not be quiescent at all.
> 
> So while the core problem isn't insoluble, in real life it is _not_
> _worth_ _solving_.
> 
> On a nearly empty filesystem, it's going to fit.
> 
> In a reasonably empty filesystem, it's going to fit.
> 
> On a nearly full filesystem, it may or may not fit.
> 
> On a filesystem that is so close to full that you have reason to doubt
> it will fit, you are going to have a very bad time even if it fits.
> 
> If you did manage to invent and implement an fallocate algorythm that
> could make this promise and make it stick, then some other running
> program is what's going to crash when you use up that last byte anyway.
> 
> Almost full filesystems are their own reward.

So you basically say that BTRFS with compression  does not meet the fallocate 
guarantee. Now thats interesting, cause it basically violates the 
documentation for the system call:

DESCRIPTION
       The function posix_fallocate() ensures that disk space  is  allo‐
       cated for the file referred to by the descriptor fd for the bytes
       in the range starting at offset and  continuing  for  len  bytes.
       After  a  successful call to posix_fallocate(), subsequent writes
       to bytes in the  specified  range  are  guaranteed  not  to  fail
       because of lack of disk space.

So in order to be standard compliant there, BTRFS would need to write 
fallocated files uncompressed… wow this is getting complex.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-08 14:47       ` Martin Steigerwald
@ 2014-12-08 14:57         ` Austin S Hemmelgarn
  2014-12-08 15:52           ` Martin Steigerwald
  2014-12-08 23:14         ` Zygo Blaxell
  1 sibling, 1 reply; 28+ messages in thread
From: Austin S Hemmelgarn @ 2014-12-08 14:57 UTC (permalink / raw)
  To: Martin Steigerwald, Robert White; +Cc: Shriramana Sharma, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3944 bytes --]

On 2014-12-08 09:47, Martin Steigerwald wrote:
> Hi,
>
> Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White:
>> On 12/07/2014 07:40 AM, Martin Steigerwald wrote:
>>> Well what would be possible I bet would be a kind of system call like
>>> this:
>>>
>>> I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware,
>>> can I do it *and* give me a guarentee I can.
>>>
>>> So like a more flexible fallocate approach as fallocate just allocates one
>>> file and you would need to run it for all files you intend to create. But
>>> challenge would be to estimate metadata allocation beforehand accurately.
>>>
>>> Or have tar --fallocate -xf which for all files in the archive will first
>>> call fallocate and only if that succeeded, actually write them. But due
>>> to the nature of tar archives with their content listing across the whole
>>> archive, this means it may have to read the tar archive twice, so ZIP
>>> archives might be better suited for that.
>>
>> What you suggest is Still Not Practical™ (the tar thing might have some
>> ability if you were willing to analyze every file to the byte level).
>>
>> Compression _can_ make a file _bigger_ than its base size. BTRFS decides
>> whether or not to compress a file based on the results it gets when
>> tying to compress the first N bytes. (I do not know the value of N). But
>> it is _easy_ to have a file where the first N bytes compress well but
>> the bytes after N take up more space than their byte count. So to
>> fallocate() the right size in blocks you'd have to compress the input
>> and determine what BTRFS _would_ _do_ and then allocate that much space
>> instead of the file size.
>>
>> And even then, if you didn't create all the names and directories you
>> might find that the RBtree had to expand (allocate another tree node)
>> one or more times to accommodate the actual files. Lather rinse repeat
>> for any checksum trees and anything hitting a flush barrier because of
>> commit= or sync() events or other writers perturbing your results
>> because it only matters if the filesystem is nearly full and nearly full
>> filesystems may not be quiescent at all.
>>
>> So while the core problem isn't insoluble, in real life it is _not_
>> _worth_ _solving_.
>>
>> On a nearly empty filesystem, it's going to fit.
>>
>> In a reasonably empty filesystem, it's going to fit.
>>
>> On a nearly full filesystem, it may or may not fit.
>>
>> On a filesystem that is so close to full that you have reason to doubt
>> it will fit, you are going to have a very bad time even if it fits.
>>
>> If you did manage to invent and implement an fallocate algorythm that
>> could make this promise and make it stick, then some other running
>> program is what's going to crash when you use up that last byte anyway.
>>
>> Almost full filesystems are their own reward.
>
> So you basically say that BTRFS with compression  does not meet the fallocate
> guarantee. Now thats interesting, cause it basically violates the
> documentation for the system call:
>
> DESCRIPTION
>         The function posix_fallocate() ensures that disk space  is  allo‐
>         cated for the file referred to by the descriptor fd for the bytes
>         in the range starting at offset and  continuing  for  len  bytes.
>         After  a  successful call to posix_fallocate(), subsequent writes
>         to bytes in the  specified  range  are  guaranteed  not  to  fail
>         because of lack of disk space.
>
> So in order to be standard compliant there, BTRFS would need to write
> fallocated files uncompressed… wow this is getting complex.
The other option would be to allocate based on the worst case size 
increase for the compression algorithm, (which works out to about 5% 
IIRC for zlib and a bit more for lzo) and then possibly discard the 
unwritten extents at some later point.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-08 14:57         ` Austin S Hemmelgarn
@ 2014-12-08 15:52           ` Martin Steigerwald
  0 siblings, 0 replies; 28+ messages in thread
From: Martin Steigerwald @ 2014-12-08 15:52 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Robert White, Shriramana Sharma, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4387 bytes --]

Am Montag, 8. Dezember 2014, 09:57:50 schrieb Austin S Hemmelgarn:
> On 2014-12-08 09:47, Martin Steigerwald wrote:
> > Hi,
> > 
> > Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White:
> >> On 12/07/2014 07:40 AM, Martin Steigerwald wrote:
> >>> Well what would be possible I bet would be a kind of system call like
> >>> this:
> >>> 
> >>> I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware,
> >>> can I do it *and* give me a guarentee I can.
> >>> 
> >>> So like a more flexible fallocate approach as fallocate just allocates
> >>> one
> >>> file and you would need to run it for all files you intend to create.
> >>> But
> >>> challenge would be to estimate metadata allocation beforehand
> >>> accurately.
> >>> 
> >>> Or have tar --fallocate -xf which for all files in the archive will
> >>> first
> >>> call fallocate and only if that succeeded, actually write them. But due
> >>> to the nature of tar archives with their content listing across the
> >>> whole
> >>> archive, this means it may have to read the tar archive twice, so ZIP
> >>> archives might be better suited for that.
> >> 
> >> What you suggest is Still Not Practical™ (the tar thing might have some
> >> ability if you were willing to analyze every file to the byte level).
> >> 
> >> Compression _can_ make a file _bigger_ than its base size. BTRFS decides
> >> whether or not to compress a file based on the results it gets when
> >> tying to compress the first N bytes. (I do not know the value of N). But
> >> it is _easy_ to have a file where the first N bytes compress well but
> >> the bytes after N take up more space than their byte count. So to
> >> fallocate() the right size in blocks you'd have to compress the input
> >> and determine what BTRFS _would_ _do_ and then allocate that much space
> >> instead of the file size.
> >> 
> >> And even then, if you didn't create all the names and directories you
> >> might find that the RBtree had to expand (allocate another tree node)
> >> one or more times to accommodate the actual files. Lather rinse repeat
> >> for any checksum trees and anything hitting a flush barrier because of
> >> commit= or sync() events or other writers perturbing your results
> >> because it only matters if the filesystem is nearly full and nearly full
> >> filesystems may not be quiescent at all.
> >> 
> >> So while the core problem isn't insoluble, in real life it is _not_
> >> _worth_ _solving_.
> >> 
> >> On a nearly empty filesystem, it's going to fit.
> >> 
> >> In a reasonably empty filesystem, it's going to fit.
> >> 
> >> On a nearly full filesystem, it may or may not fit.
> >> 
> >> On a filesystem that is so close to full that you have reason to doubt
> >> it will fit, you are going to have a very bad time even if it fits.
> >> 
> >> If you did manage to invent and implement an fallocate algorythm that
> >> could make this promise and make it stick, then some other running
> >> program is what's going to crash when you use up that last byte anyway.
> >> 
> >> Almost full filesystems are their own reward.
> > 
> > So you basically say that BTRFS with compression  does not meet the
> > fallocate guarantee. Now thats interesting, cause it basically violates
> > the
> > documentation for the system call:
> > 
> > DESCRIPTION
> > 
> >         The function posix_fallocate() ensures that disk space  is  allo‐
> >         cated for the file referred to by the descriptor fd for the bytes
> >         in the range starting at offset and  continuing  for  len  bytes.
> >         After  a  successful call to posix_fallocate(), subsequent writes
> >         to bytes in the  specified  range  are  guaranteed  not  to  fail
> >         because of lack of disk space.
> > 
> > So in order to be standard compliant there, BTRFS would need to write
> > fallocated files uncompressed… wow this is getting complex.
> 
> The other option would be to allocate based on the worst case size
> increase for the compression algorithm, (which works out to about 5%
> IIRC for zlib and a bit more for lzo) and then possibly discard the
> unwritten extents at some later point.

Now that seems like a workable solution.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Why is the actual disk usage of btrfs considered unknowable?
  2014-12-08 14:47       ` Martin Steigerwald
  2014-12-08 14:57         ` Austin S Hemmelgarn
@ 2014-12-08 23:14         ` Zygo Blaxell
  1 sibling, 0 replies; 28+ messages in thread
From: Zygo Blaxell @ 2014-12-08 23:14 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Robert White, Shriramana Sharma, linux-btrfs

On Mon, Dec 08, 2014 at 03:47:23PM +0100, Martin Steigerwald wrote:
> Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White:
> > On 12/07/2014 07:40 AM, Martin Steigerwald wrote:
> > Almost full filesystems are their own reward.
> 
> So you basically say that BTRFS with compression  does not meet the fallocate 
> guarantee. Now thats interesting, cause it basically violates the 
> documentation for the system call:
> 
> DESCRIPTION
>        The function posix_fallocate() ensures that disk space  is  allo‐
>        cated for the file referred to by the descriptor fd for the bytes
>        in the range starting at offset and  continuing  for  len  bytes.
>        After  a  successful call to posix_fallocate(), subsequent writes
>        to bytes in the  specified  range  are  guaranteed  not  to  fail
>        because of lack of disk space.
> 
> So in order to be standard compliant there, BTRFS would need to write 
> fallocated files uncompressed… wow this is getting complex.

...and nodatacow and no snapshots, since those require more space that
was never anticipated by fallocate.

Given the choice, I'd just let fallocate fail.  Usually when I come
across a program using fallocate, I end up patching it so it doesn't use
fallocate any more.


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-12-08 23:14 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-07 15:15 Why is the actual disk usage of btrfs considered unknowable? Shriramana Sharma
2014-12-07 15:33 ` Martin Steigerwald
2014-12-07 15:37   ` Shriramana Sharma
2014-12-07 15:40   ` Martin Steigerwald
2014-12-08  5:32     ` Robert White
2014-12-08  6:20       ` ashford
2014-12-08  7:06         ` Robert White
2014-12-08 14:47       ` Martin Steigerwald
2014-12-08 14:57         ` Austin S Hemmelgarn
2014-12-08 15:52           ` Martin Steigerwald
2014-12-08 23:14         ` Zygo Blaxell
2014-12-07 18:20   ` ashford
2014-12-07 18:34     ` Hugo Mills
2014-12-07 18:48       ` Martin Steigerwald
2014-12-07 19:39       ` ashford
2014-12-08  5:17       ` Chris Murphy
2014-12-07 18:38     ` Martin Steigerwald
2014-12-07 19:44       ` ashford
2014-12-07 19:19   ` Goffredo Baroncelli
2014-12-07 20:32     ` ashford
2014-12-07 23:01       ` Goffredo Baroncelli
2014-12-08  0:12         ` ashford
2014-12-08  2:42           ` Qu Wenruo
2014-12-08  8:12             ` ashford
2014-12-08 14:34           ` Goffredo Baroncelli
2014-12-08  8:18       ` Chris Murphy
2014-12-08  4:59 ` Robert White
2014-12-08  6:43 ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.