* Why is the actual disk usage of btrfs considered unknowable? @ 2014-12-07 15:15 Shriramana Sharma 2014-12-07 15:33 ` Martin Steigerwald ` (2 more replies) 0 siblings, 3 replies; 28+ messages in thread From: Shriramana Sharma @ 2014-12-07 15:15 UTC (permalink / raw) To: linux-btrfs IIUC: 1) btrfs fi df already shows the alloc-ed space and the space used out of that. 2) Despite snapshots, CoW and compression, the tree knows how many extents of data and metadata there are, and how many bytes on disk these occcupy, no matter what is the total (uncompressed, "unsnapshotted") size of all the directories and files on the disk. So this means that btrfs fi df actually shows the real on-disk usage. In this case, why do we hear people saying it's not possible to know the actual on-disk usage and when a btrfs-formatted disk (or partition) will go out of space? -- Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 15:15 Why is the actual disk usage of btrfs considered unknowable? Shriramana Sharma @ 2014-12-07 15:33 ` Martin Steigerwald 2014-12-07 15:37 ` Shriramana Sharma ` (3 more replies) 2014-12-08 4:59 ` Robert White 2014-12-08 6:43 ` Zygo Blaxell 2 siblings, 4 replies; 28+ messages in thread From: Martin Steigerwald @ 2014-12-07 15:33 UTC (permalink / raw) To: Shriramana Sharma; +Cc: linux-btrfs Hi Shriramana! Am Sonntag, 7. Dezember 2014, 20:45:59 schrieb Shriramana Sharma: > IIUC: > > 1) btrfs fi df already shows the alloc-ed space and the space used out of > that. > > 2) Despite snapshots, CoW and compression, the tree knows how many > extents of data and metadata there are, and how many bytes on disk > these occcupy, no matter what is the total (uncompressed, > "unsnapshotted") size of all the directories and files on the disk. > > So this means that btrfs fi df actually shows the real on-disk usage. > In this case, why do we hear people saying it's not possible to know > the actual on-disk usage and when a btrfs-formatted disk (or > partition) will go out of space? I never read that the actual disk usage is unknown. But I read that the actual what is free is unknown. And there are several reasons for that: 1) On a compressed filesystem you cannot know, but only estimate the compression ratio for future data. 2) On a compressed filesystem you can choose to have parts of it uncompressed by file / directory attributes, I think. BTRFS can´t know how much of the future data you are going to store compressed or uncompressed. 3) From what I gathered it is planned to allow different raid / redundancy levels for different subvolumes. BTRFS can´t know beforehand where applications request to save future data, i.e. in which subvolume. 4) Even on a convential filesystem the free space is an estimate, cause it can not predict the activity of other processes writing to the filesystem. You may have 10 GiB free at some point, but if another process is currently writing another 5 GiB at the time your process is writing it will continue to have less and less than the estimated 10 GiB free and if it wanted to write 10 GiB it will not be able to. What might be possible but still has the limitation of the fourth point above, would be a query: How much free space do you have *right* know, on this directory path, if I write with standard settings. But the only guarantee you can ever get is to pre-allocate your files with fallocate. When the fallocate file succeeded, you get a guarantee that you can write to the amount of allocated space into the file. Whether BTRFS can hold to that guarantee in any case? That depends on how bug free it is in that regard with its free space handling. And in case you do not need all the fallocated space, other processes may not be able to write data anymore even if there would be free space in your fallocated files. So you either overprovision or underprovision… :) That written: Filling up a filesystem 100% will limit the performance of any filesystem that is non to me considerably and ask for further troubel. So better have at least 10-20% of the space free, except maybe for very large filesystem, but on the other hand I saw recommendations on the XFS mailing list that in heavy random I/O on lots of file case it is even better to leave 40-50% free in case you want to delay slowing down of the filesystem and want to have a well structured filesystem after 10 years of heavy usage. BTRFS can rebalance things, but I have yet to see that this rebalancing really optimizes things. It may not, or at least not in all cases. So welcome to the challenges of filesystem development, especially for copy on write filesystem with the feature set BTRFS provides. Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 15:33 ` Martin Steigerwald @ 2014-12-07 15:37 ` Shriramana Sharma 2014-12-07 15:40 ` Martin Steigerwald ` (2 subsequent siblings) 3 siblings, 0 replies; 28+ messages in thread From: Shriramana Sharma @ 2014-12-07 15:37 UTC (permalink / raw) To: Martin Steigerwald; +Cc: linux-btrfs On Sun, Dec 7, 2014 at 9:03 PM, Martin Steigerwald <Martin@lichtvoll.de> wrote: > > I never read that the actual disk usage is unknown. But I read that the actual > what is free is unknown. And there are several reasons for that: That is totally understood. But I guess when your alloc space is nearing 90% of your disk capacity, and used space is sorta 80% or so of the alloc space, I guess it's a reasonable thing to expect that people should add a drive to the pool, which btrfs makes so easy. Given this, why do people complain about btrfs not being predictable when it comes to ENOSPC? Even with any other FS, I do think I'd not like my files to occupy more than 90% or so since even then defrag would probably not work. -- Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 15:33 ` Martin Steigerwald 2014-12-07 15:37 ` Shriramana Sharma @ 2014-12-07 15:40 ` Martin Steigerwald 2014-12-08 5:32 ` Robert White 2014-12-07 18:20 ` ashford 2014-12-07 19:19 ` Goffredo Baroncelli 3 siblings, 1 reply; 28+ messages in thread From: Martin Steigerwald @ 2014-12-07 15:40 UTC (permalink / raw) To: Shriramana Sharma; +Cc: linux-btrfs Am Sonntag, 7. Dezember 2014, 16:33:37 schrieb Martin Steigerwald: > What might be possible but still has the limitation of the fourth point > above, would be a query: How much free space do you have *right* know, on > this directory path, if I write with standard settings. > > But the only guarantee you can ever get is to pre-allocate your files with > fallocate. When the fallocate file succeeded, you get a guarantee that you > can write to the amount of allocated space into the file. Whether BTRFS > can hold to that guarantee in any case? That depends on how bug free it is > in that regard with its free space handling. Well what would be possible I bet would be a kind of system call like this: I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware, can I do it *and* give me a guarentee I can. So like a more flexible fallocate approach as fallocate just allocates one file and you would need to run it for all files you intend to create. But challenge would be to estimate metadata allocation beforehand accurately. Or have tar --fallocate -xf which for all files in the archive will first call fallocate and only if that succeeded, actually write them. But due to the nature of tar archives with their content listing across the whole archive, this means it may have to read the tar archive twice, so ZIP archives might be better suited for that. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 15:40 ` Martin Steigerwald @ 2014-12-08 5:32 ` Robert White 2014-12-08 6:20 ` ashford 2014-12-08 14:47 ` Martin Steigerwald 0 siblings, 2 replies; 28+ messages in thread From: Robert White @ 2014-12-08 5:32 UTC (permalink / raw) To: Martin Steigerwald, Shriramana Sharma; +Cc: linux-btrfs On 12/07/2014 07:40 AM, Martin Steigerwald wrote: > Well what would be possible I bet would be a kind of system call like this: > > I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware, can I > do it *and* give me a guarentee I can. > > So like a more flexible fallocate approach as fallocate just allocates one file > and you would need to run it for all files you intend to create. But challenge > would be to estimate metadata allocation beforehand accurately. > > Or have tar --fallocate -xf which for all files in the archive will first call > fallocate and only if that succeeded, actually write them. But due to the > nature of tar archives with their content listing across the whole archive, > this means it may have to read the tar archive twice, so ZIP archives might be > better suited for that. > What you suggest is Still Not Practical™ (the tar thing might have some ability if you were willing to analyze every file to the byte level). Compression _can_ make a file _bigger_ than its base size. BTRFS decides whether or not to compress a file based on the results it gets when tying to compress the first N bytes. (I do not know the value of N). But it is _easy_ to have a file where the first N bytes compress well but the bytes after N take up more space than their byte count. So to fallocate() the right size in blocks you'd have to compress the input and determine what BTRFS _would_ _do_ and then allocate that much space instead of the file size. And even then, if you didn't create all the names and directories you might find that the RBtree had to expand (allocate another tree node) one or more times to accommodate the actual files. Lather rinse repeat for any checksum trees and anything hitting a flush barrier because of commit= or sync() events or other writers perturbing your results because it only matters if the filesystem is nearly full and nearly full filesystems may not be quiescent at all. So while the core problem isn't insoluble, in real life it is _not_ _worth_ _solving_. On a nearly empty filesystem, it's going to fit. In a reasonably empty filesystem, it's going to fit. On a nearly full filesystem, it may or may not fit. On a filesystem that is so close to full that you have reason to doubt it will fit, you are going to have a very bad time even if it fits. If you did manage to invent and implement an fallocate algorythm that could make this promise and make it stick, then some other running program is what's going to crash when you use up that last byte anyway. Almost full filesystems are their own reward. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-08 5:32 ` Robert White @ 2014-12-08 6:20 ` ashford 2014-12-08 7:06 ` Robert White 2014-12-08 14:47 ` Martin Steigerwald 1 sibling, 1 reply; 28+ messages in thread From: ashford @ 2014-12-08 6:20 UTC (permalink / raw) To: Robert White; +Cc: Martin Steigerwald, Shriramana Sharma, linux-btrfs Martin, Excellent analysis. > On 12/07/2014 07:40 AM, Martin Steigerwald wrote: > > So while the core problem isn't insoluble, in real life it is _not_ > _worth_ _solving_. I agree. There is inadequate return on the investment. In addition, the number of corner cases increases dramatically, making testing significantly more complex. > On a nearly empty filesystem, it's going to fit. > > In a reasonably empty filesystem, it's going to fit. > > On a nearly full filesystem, it may or may not fit. > > On a filesystem that is so close to full that you have reason to doubt > it will fit, you are going to have a very bad time even if it fits. > > If you did manage to invent and implement an fallocate algorythm that > could make this promise and make it stick, then some other running > program is what's going to crash when you use up that last byte anyway. > > Almost full filesystems are their own reward. In other words, BTRFS acts like any other filesystem with compression. This is reasonable. Peter Ashford ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-08 6:20 ` ashford @ 2014-12-08 7:06 ` Robert White 0 siblings, 0 replies; 28+ messages in thread From: Robert White @ 2014-12-08 7:06 UTC (permalink / raw) To: ashford; +Cc: Martin Steigerwald, Shriramana Sharma, linux-btrfs On 12/07/2014 10:20 PM, ashford@whisperpc.com wrote: > Martin, > > Excellent analysis. > >> On 12/07/2014 07:40 AM, Martin Steigerwald wrote: >> >> So while the core problem isn't insoluble, in real life it is _not_ >> _worth_ _solving_. Your email quoting things is messed up... I wrote that analysis... 8-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-08 5:32 ` Robert White 2014-12-08 6:20 ` ashford @ 2014-12-08 14:47 ` Martin Steigerwald 2014-12-08 14:57 ` Austin S Hemmelgarn 2014-12-08 23:14 ` Zygo Blaxell 1 sibling, 2 replies; 28+ messages in thread From: Martin Steigerwald @ 2014-12-08 14:47 UTC (permalink / raw) To: Robert White; +Cc: Shriramana Sharma, linux-btrfs Hi, Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White: > On 12/07/2014 07:40 AM, Martin Steigerwald wrote: > > Well what would be possible I bet would be a kind of system call like > > this: > > > > I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware, > > can I do it *and* give me a guarentee I can. > > > > So like a more flexible fallocate approach as fallocate just allocates one > > file and you would need to run it for all files you intend to create. But > > challenge would be to estimate metadata allocation beforehand accurately. > > > > Or have tar --fallocate -xf which for all files in the archive will first > > call fallocate and only if that succeeded, actually write them. But due > > to the nature of tar archives with their content listing across the whole > > archive, this means it may have to read the tar archive twice, so ZIP > > archives might be better suited for that. > > What you suggest is Still Not Practical™ (the tar thing might have some > ability if you were willing to analyze every file to the byte level). > > Compression _can_ make a file _bigger_ than its base size. BTRFS decides > whether or not to compress a file based on the results it gets when > tying to compress the first N bytes. (I do not know the value of N). But > it is _easy_ to have a file where the first N bytes compress well but > the bytes after N take up more space than their byte count. So to > fallocate() the right size in blocks you'd have to compress the input > and determine what BTRFS _would_ _do_ and then allocate that much space > instead of the file size. > > And even then, if you didn't create all the names and directories you > might find that the RBtree had to expand (allocate another tree node) > one or more times to accommodate the actual files. Lather rinse repeat > for any checksum trees and anything hitting a flush barrier because of > commit= or sync() events or other writers perturbing your results > because it only matters if the filesystem is nearly full and nearly full > filesystems may not be quiescent at all. > > So while the core problem isn't insoluble, in real life it is _not_ > _worth_ _solving_. > > On a nearly empty filesystem, it's going to fit. > > In a reasonably empty filesystem, it's going to fit. > > On a nearly full filesystem, it may or may not fit. > > On a filesystem that is so close to full that you have reason to doubt > it will fit, you are going to have a very bad time even if it fits. > > If you did manage to invent and implement an fallocate algorythm that > could make this promise and make it stick, then some other running > program is what's going to crash when you use up that last byte anyway. > > Almost full filesystems are their own reward. So you basically say that BTRFS with compression does not meet the fallocate guarantee. Now thats interesting, cause it basically violates the documentation for the system call: DESCRIPTION The function posix_fallocate() ensures that disk space is allo‐ cated for the file referred to by the descriptor fd for the bytes in the range starting at offset and continuing for len bytes. After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space. So in order to be standard compliant there, BTRFS would need to write fallocated files uncompressed… wow this is getting complex. Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-08 14:47 ` Martin Steigerwald @ 2014-12-08 14:57 ` Austin S Hemmelgarn 2014-12-08 15:52 ` Martin Steigerwald 2014-12-08 23:14 ` Zygo Blaxell 1 sibling, 1 reply; 28+ messages in thread From: Austin S Hemmelgarn @ 2014-12-08 14:57 UTC (permalink / raw) To: Martin Steigerwald, Robert White; +Cc: Shriramana Sharma, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3944 bytes --] On 2014-12-08 09:47, Martin Steigerwald wrote: > Hi, > > Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White: >> On 12/07/2014 07:40 AM, Martin Steigerwald wrote: >>> Well what would be possible I bet would be a kind of system call like >>> this: >>> >>> I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware, >>> can I do it *and* give me a guarentee I can. >>> >>> So like a more flexible fallocate approach as fallocate just allocates one >>> file and you would need to run it for all files you intend to create. But >>> challenge would be to estimate metadata allocation beforehand accurately. >>> >>> Or have tar --fallocate -xf which for all files in the archive will first >>> call fallocate and only if that succeeded, actually write them. But due >>> to the nature of tar archives with their content listing across the whole >>> archive, this means it may have to read the tar archive twice, so ZIP >>> archives might be better suited for that. >> >> What you suggest is Still Not Practical™ (the tar thing might have some >> ability if you were willing to analyze every file to the byte level). >> >> Compression _can_ make a file _bigger_ than its base size. BTRFS decides >> whether or not to compress a file based on the results it gets when >> tying to compress the first N bytes. (I do not know the value of N). But >> it is _easy_ to have a file where the first N bytes compress well but >> the bytes after N take up more space than their byte count. So to >> fallocate() the right size in blocks you'd have to compress the input >> and determine what BTRFS _would_ _do_ and then allocate that much space >> instead of the file size. >> >> And even then, if you didn't create all the names and directories you >> might find that the RBtree had to expand (allocate another tree node) >> one or more times to accommodate the actual files. Lather rinse repeat >> for any checksum trees and anything hitting a flush barrier because of >> commit= or sync() events or other writers perturbing your results >> because it only matters if the filesystem is nearly full and nearly full >> filesystems may not be quiescent at all. >> >> So while the core problem isn't insoluble, in real life it is _not_ >> _worth_ _solving_. >> >> On a nearly empty filesystem, it's going to fit. >> >> In a reasonably empty filesystem, it's going to fit. >> >> On a nearly full filesystem, it may or may not fit. >> >> On a filesystem that is so close to full that you have reason to doubt >> it will fit, you are going to have a very bad time even if it fits. >> >> If you did manage to invent and implement an fallocate algorythm that >> could make this promise and make it stick, then some other running >> program is what's going to crash when you use up that last byte anyway. >> >> Almost full filesystems are their own reward. > > So you basically say that BTRFS with compression does not meet the fallocate > guarantee. Now thats interesting, cause it basically violates the > documentation for the system call: > > DESCRIPTION > The function posix_fallocate() ensures that disk space is allo‐ > cated for the file referred to by the descriptor fd for the bytes > in the range starting at offset and continuing for len bytes. > After a successful call to posix_fallocate(), subsequent writes > to bytes in the specified range are guaranteed not to fail > because of lack of disk space. > > So in order to be standard compliant there, BTRFS would need to write > fallocated files uncompressed… wow this is getting complex. The other option would be to allocate based on the worst case size increase for the compression algorithm, (which works out to about 5% IIRC for zlib and a bit more for lzo) and then possibly discard the unwritten extents at some later point. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 2455 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-08 14:57 ` Austin S Hemmelgarn @ 2014-12-08 15:52 ` Martin Steigerwald 0 siblings, 0 replies; 28+ messages in thread From: Martin Steigerwald @ 2014-12-08 15:52 UTC (permalink / raw) To: Austin S Hemmelgarn; +Cc: Robert White, Shriramana Sharma, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4387 bytes --] Am Montag, 8. Dezember 2014, 09:57:50 schrieb Austin S Hemmelgarn: > On 2014-12-08 09:47, Martin Steigerwald wrote: > > Hi, > > > > Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White: > >> On 12/07/2014 07:40 AM, Martin Steigerwald wrote: > >>> Well what would be possible I bet would be a kind of system call like > >>> this: > >>> > >>> I need to write 5 GB of data in 100 of files to /opt/mynewshinysoftware, > >>> can I do it *and* give me a guarentee I can. > >>> > >>> So like a more flexible fallocate approach as fallocate just allocates > >>> one > >>> file and you would need to run it for all files you intend to create. > >>> But > >>> challenge would be to estimate metadata allocation beforehand > >>> accurately. > >>> > >>> Or have tar --fallocate -xf which for all files in the archive will > >>> first > >>> call fallocate and only if that succeeded, actually write them. But due > >>> to the nature of tar archives with their content listing across the > >>> whole > >>> archive, this means it may have to read the tar archive twice, so ZIP > >>> archives might be better suited for that. > >> > >> What you suggest is Still Not Practical™ (the tar thing might have some > >> ability if you were willing to analyze every file to the byte level). > >> > >> Compression _can_ make a file _bigger_ than its base size. BTRFS decides > >> whether or not to compress a file based on the results it gets when > >> tying to compress the first N bytes. (I do not know the value of N). But > >> it is _easy_ to have a file where the first N bytes compress well but > >> the bytes after N take up more space than their byte count. So to > >> fallocate() the right size in blocks you'd have to compress the input > >> and determine what BTRFS _would_ _do_ and then allocate that much space > >> instead of the file size. > >> > >> And even then, if you didn't create all the names and directories you > >> might find that the RBtree had to expand (allocate another tree node) > >> one or more times to accommodate the actual files. Lather rinse repeat > >> for any checksum trees and anything hitting a flush barrier because of > >> commit= or sync() events or other writers perturbing your results > >> because it only matters if the filesystem is nearly full and nearly full > >> filesystems may not be quiescent at all. > >> > >> So while the core problem isn't insoluble, in real life it is _not_ > >> _worth_ _solving_. > >> > >> On a nearly empty filesystem, it's going to fit. > >> > >> In a reasonably empty filesystem, it's going to fit. > >> > >> On a nearly full filesystem, it may or may not fit. > >> > >> On a filesystem that is so close to full that you have reason to doubt > >> it will fit, you are going to have a very bad time even if it fits. > >> > >> If you did manage to invent and implement an fallocate algorythm that > >> could make this promise and make it stick, then some other running > >> program is what's going to crash when you use up that last byte anyway. > >> > >> Almost full filesystems are their own reward. > > > > So you basically say that BTRFS with compression does not meet the > > fallocate guarantee. Now thats interesting, cause it basically violates > > the > > documentation for the system call: > > > > DESCRIPTION > > > > The function posix_fallocate() ensures that disk space is allo‐ > > cated for the file referred to by the descriptor fd for the bytes > > in the range starting at offset and continuing for len bytes. > > After a successful call to posix_fallocate(), subsequent writes > > to bytes in the specified range are guaranteed not to fail > > because of lack of disk space. > > > > So in order to be standard compliant there, BTRFS would need to write > > fallocated files uncompressed… wow this is getting complex. > > The other option would be to allocate based on the worst case size > increase for the compression algorithm, (which works out to about 5% > IIRC for zlib and a bit more for lzo) and then possibly discard the > unwritten extents at some later point. Now that seems like a workable solution. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-08 14:47 ` Martin Steigerwald 2014-12-08 14:57 ` Austin S Hemmelgarn @ 2014-12-08 23:14 ` Zygo Blaxell 1 sibling, 0 replies; 28+ messages in thread From: Zygo Blaxell @ 2014-12-08 23:14 UTC (permalink / raw) To: Martin Steigerwald; +Cc: Robert White, Shriramana Sharma, linux-btrfs On Mon, Dec 08, 2014 at 03:47:23PM +0100, Martin Steigerwald wrote: > Am Sonntag, 7. Dezember 2014, 21:32:01 schrieb Robert White: > > On 12/07/2014 07:40 AM, Martin Steigerwald wrote: > > Almost full filesystems are their own reward. > > So you basically say that BTRFS with compression does not meet the fallocate > guarantee. Now thats interesting, cause it basically violates the > documentation for the system call: > > DESCRIPTION > The function posix_fallocate() ensures that disk space is allo‐ > cated for the file referred to by the descriptor fd for the bytes > in the range starting at offset and continuing for len bytes. > After a successful call to posix_fallocate(), subsequent writes > to bytes in the specified range are guaranteed not to fail > because of lack of disk space. > > So in order to be standard compliant there, BTRFS would need to write > fallocated files uncompressed… wow this is getting complex. ...and nodatacow and no snapshots, since those require more space that was never anticipated by fallocate. Given the choice, I'd just let fallocate fail. Usually when I come across a program using fallocate, I end up patching it so it doesn't use fallocate any more. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 15:33 ` Martin Steigerwald 2014-12-07 15:37 ` Shriramana Sharma 2014-12-07 15:40 ` Martin Steigerwald @ 2014-12-07 18:20 ` ashford 2014-12-07 18:34 ` Hugo Mills 2014-12-07 18:38 ` Martin Steigerwald 2014-12-07 19:19 ` Goffredo Baroncelli 3 siblings, 2 replies; 28+ messages in thread From: ashford @ 2014-12-07 18:20 UTC (permalink / raw) To: Martin Steigerwald; +Cc: Shriramana Sharma, linux-btrfs Martin, > I read that the actual > what is free is unknown. And there are several reasons for that: > > 1) On a compressed filesystem you cannot know, but only estimate the > compression ratio for future data. It is NOT the job of BTRFS, or ANY file-system, to try to prodict the future. The future is unknown. Don't try to account for it. When asked for the status (i.e. 'df'), it should return the current status. > 2) On a compressed filesystem you can choose to have parts of it > uncompressed by file / directory attributes, I think. BTRFS can't > know how much of the > future data you are going to store compressed or uncompressed. Same as above. If the user sees 18GB free space and has 20GB of data to write, it is up to them to determine whether or not compression will allow it to fit. > 3) From what I gathered it is planned to allow different raid / > redundancy levels for different subvolumes. BTRFS can´t know > beforehand where applications request to save future data, i.e. > in which subvolume. Same as above. If a user will be requesting to use a specific subvolume, it is up to them to verify that adequate free space exists there, or handle the exception. > 4) Even on a convential filesystem the free space is an estimate, > cause it can not predict the activity of other processes writing > to the filesystem. You may have 10 GiB free at some point, but if > another process is currently writing another 5 GiB at the time > your process is writing it will continue to have less and less > than the estimated 10 GiB free and if it wanted to write 10 GiB > it will not be able to. Same as above. This is normal for multi-user systems. It happens. There's no way around it, and other file-systems don't try. > What might be possible but still has the limitation of the fourth > point above, would be a query: How much free space do you have > *right* know, on this directory path, if I write with standard > settings. That's what the 'df' command is supposed to return, and what it DOES return on other file-systems, including file-systems that support compression. > But the only guarantee you can ever get is to pre-allocate your files > with fallocate. When the fallocate file succeeded, you get a guarantee > that you can write to the amount of allocated space into the file. > Whether BTRFS can hold to that guarantee in any case? That depends on > how bug free it is in that regard with its free space handling. This is the same as in all other file-systems. > And in case you do not need all the fallocated space, other processes > may not be able to write data anymore even if there would be free > space in your fallocated files. Again, this is the same as in other file-systems. > That written: Filling up a filesystem 100% will limit the performance > of any filesystem that is non to me considerably and ask for further > troubel. So better have at least 10-20% of the space free, except > maybe for very large filesystem Once again, this is the normal recommendation for most (all?) file-systems. As they fill, they get less efficient. The impact is minimal for a while, then the curve hits a knee and performance drops. Some file-systems have a setting to only allow the ROOT user to exceed a specified percentage of file-system use. Peter Ashford ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 18:20 ` ashford @ 2014-12-07 18:34 ` Hugo Mills 2014-12-07 18:48 ` Martin Steigerwald ` (2 more replies) 2014-12-07 18:38 ` Martin Steigerwald 1 sibling, 3 replies; 28+ messages in thread From: Hugo Mills @ 2014-12-07 18:34 UTC (permalink / raw) To: ashford; +Cc: Martin Steigerwald, Shriramana Sharma, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1278 bytes --] On Sun, Dec 07, 2014 at 10:20:27AM -0800, ashford@whisperpc.com wrote: [snip] > > 3) From what I gathered it is planned to allow different raid / > > redundancy levels for different subvolumes. BTRFS can´t know > > beforehand where applications request to save future data, i.e. > > in which subvolume. > > Same as above. > > If a user will be requesting to use a specific subvolume, it is up to them > to verify that adequate free space exists there, or handle the exception. OK, so let's say I've got a filesystem with 100 GiB of unallocated space. I have two subvolumes, one configured for RAID-1 and one configured for single storage. What number should be shown in the free output of df? 100 GiB? But I can only write 50 GiB to the RAID-1 subvolume before it runs out of space. 50 GiB? I can get twice that much on the single subvolume. *Any* value shown here is going to be inaccurate, and whatever way round we show it, someone will complain. Hugo. -- Hugo Mills | My doctor tells me that I have a malformed hugo@... carfax.org.uk | public-duty gland, and a natural deficiency in moral http://carfax.org.uk/ | fibre. PGP: 65E74AC0 | Zaphod Beeblebrox [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 18:34 ` Hugo Mills @ 2014-12-07 18:48 ` Martin Steigerwald 2014-12-07 19:39 ` ashford 2014-12-08 5:17 ` Chris Murphy 2 siblings, 0 replies; 28+ messages in thread From: Martin Steigerwald @ 2014-12-07 18:48 UTC (permalink / raw) To: Hugo Mills, ashford, Shriramana Sharma, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3087 bytes --] Am Sonntag, 7. Dezember 2014, 18:34:44 schrieb Hugo Mills: > On Sun, Dec 07, 2014 at 10:20:27AM -0800, ashford@whisperpc.com wrote: > [snip] > > > > 3) From what I gathered it is planned to allow different raid / > > > redundancy levels for different subvolumes. BTRFS can´t know > > > beforehand where applications request to save future data, i.e. > > > in which subvolume. > > > > Same as above. > > > > If a user will be requesting to use a specific subvolume, it is up to them > > to verify that adequate free space exists there, or handle the exception. > > OK, so let's say I've got a filesystem with 100 GiB of unallocated > space. I have two subvolumes, one configured for RAID-1 and one > configured for single storage. > > What number should be shown in the free output of df? > > 100 GiB? But I can only write 50 GiB to the RAID-1 subvolume before > it runs out of space. > > 50 GiB? I can get twice that much on the single subvolume. > > *Any* value shown here is going to be inaccurate, and whatever way > round we show it, someone will complain. Thats why I pointed out fallocate. If it suceeds, I would expect even BTRFS with its special free space challenges guarantees the space is there. A getfreespacebypath() syscall may yield quite accurate figures as well, cause then BTRFS can know which subvolume the application wants to write it, but as it cannot predict the future write behavior of all processes, it cannot guarantee anything. So for any guarantee as far as I know, the only thing you can do is fallocate. I never liked the pre installation check of roughly having 5 GiB of free space or so to suceed or fail otherwise. But on the other hand, running 90% through an installation and then failing to do not enough free space is also not nice. Similarily to copying 4 GiB of a 4,3 GB DVD image to an FAT32 before it fails the copy. But for the FAT32 it is much easier to know it can´t write a file larger 4 GiB than for BTRFS or any other filesystem to know whether installing a set of files to a set of directories with a certain total size is going to work out. Only thing that could improve this, would be some kind of more flexible space allocation. Or… creating every directory and fallocating each file with the exact size before doing the actual copy. And heck, somehow I like this idea. It could help to avoid actions that just do not make sense. An rsync could abort early if the total free space would not be enough. But especially for rsync since version 3 this again doesn´t work, as rsync works incrementally since version 3. On the other hand if btrfs send could store size requirements in the send data, before the receive side thats working, then the receive side could preallocate, but… then that would depend on how easy it would be for the send size to get the size difference between two snapshots. Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 18:34 ` Hugo Mills 2014-12-07 18:48 ` Martin Steigerwald @ 2014-12-07 19:39 ` ashford 2014-12-08 5:17 ` Chris Murphy 2 siblings, 0 replies; 28+ messages in thread From: ashford @ 2014-12-07 19:39 UTC (permalink / raw) To: Hugo Mills, ashford, Martin Steigerwald, Shriramana Sharma, linux-btrfs > On Sun, Dec 07, 2014 at 10:20:27AM -0800, ashford@whisperpc.com wrote: > [snip] >> > 3) From what I gathered it is planned to allow different raid / >> > redundancy levels for different subvolumes. BTRFS can´t know >> > beforehand where applications request to save future data, i.e. >> > in which subvolume. >> >> Same as above. >> >> If a user will be requesting to use a specific subvolume, it is up to >> them >> to verify that adequate free space exists there, or handle the >> exception. > > OK, so let's say I've got a filesystem with 100 GiB of unallocated > space. I have two subvolumes, one configured for RAID-1 and one > configured for single storage. > > What number should be shown in the free output of df? > > 100 GiB? But I can only write 50 GiB to the RAID-1 subvolume before > it runs out of space. > > 50 GiB? I can get twice that much on the single subvolume. > > *Any* value shown here is going to be inaccurate, and whatever way > round we show it, someone will complain. As an example, let's assume that the file-system is mounted as /data, with a non-mirrored subvolume of /data/1 and a mirrored subvolume of /data/2. A df should should 100GiB free in both /data and /data/1, and 50GiB free in /data/2. Peter Ashford ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 18:34 ` Hugo Mills 2014-12-07 18:48 ` Martin Steigerwald 2014-12-07 19:39 ` ashford @ 2014-12-08 5:17 ` Chris Murphy 2 siblings, 0 replies; 28+ messages in thread From: Chris Murphy @ 2014-12-08 5:17 UTC (permalink / raw) To: Hugo Mills, ashford, Martin Steigerwald, Shriramana Sharma, linux-btrfs On Sun, Dec 7, 2014 at 11:34 AM, Hugo Mills <hugo@carfax.org.uk> wrote: > *Any* value shown here is going to be inaccurate, and whatever way > round we show it, someone will complain. Yeah I'd suggest that for regular df command, when multiple device volumes exist, they're shown with ?? for Avail and Use% columns. Maybe one day df can show more columns: AvailS/R0, AvailR1, AvailR5, AvailR6. Today: # btrfs fi show Label: 'test' uuid: fb7df1f2-480d-4426-84f1-8aed197700e4 Total devices 2 FS bytes used 1.38GiB devid 1 size 5.00GiB used 2.28GiB path /dev/loop0 devid 2 size 5.00GiB used 2.28GiB path /dev/loop1 # df -h Filesystem Size Used Avail Use% Mounted on /dev/loop0 5.0G 1.4G 4.0G 27% /mnt # btrfs fi df /mnt Data, RAID1: total=2.00GiB, used=1.38GiB System, RAID1: total=32.00MiB, used=16.00KiB Metadata, RAID1: total=256.00MiB, used=1.61MiB GlobalReserve, single: total=16.00MiB, used=0.00B Perhaps something like this instead: # btrfs fi df /mnt Available Space Range: 5.4GiB - 7.2 GiB RAID0 Available: ~5.1GiB Data: allocated=1.0GiB, used=0.1GiB System: allocated=32MiB, used=16KiB Metadata:allocated=256MiB, used=KiB RAID1 Available: ~2.7GiB Data: allocated=2.00GiB, used=1.38GiB System: allocated=32.00MiB, used=16.00KiB Metadata: allocated=256.00MiB, used=1.61MiB SINGLE Available: GlobalReserve: allocated=16.00MiB, used=0.00B 1. There's 10.0GiB of total space for two devices in the volume. 2. There's 1.3GiB raid0 chunks allocated, and 4.6GiB raid1 chunks allocated, 5.6 GiB is allocated. 3. To show available space for any particular profile, we have to subtract out all the allocated chunks, and then add back in the free space only for that profile's chunks. 4. The resulting value needs to be multiplied by that profile's "replication factor" e.g. 1 for single and raid0, 0.5 for raid1, and for raid 5/6 it depends not just on the number of devices, but the mix of chunks striped across n disks since Btrfs allows chunks to be striped across any number of disks so long as it meets the minimum, i.e. there can be a dozen raid5 chunks striped across 3 drives, and another 1/2 dozen chunks striped across 4 drives. Only a balance would redistribute such that each chunk has the same number of strips. For above it's raid0 available: [10.00 total - 5.9 allocated + the free space in only raid0 chunks] * 1 raid1 available: [10.00 total - 5.9 allocated + the free space in only raid1 chunks] * 0.5 -- Chris Murphy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 18:20 ` ashford 2014-12-07 18:34 ` Hugo Mills @ 2014-12-07 18:38 ` Martin Steigerwald 2014-12-07 19:44 ` ashford 1 sibling, 1 reply; 28+ messages in thread From: Martin Steigerwald @ 2014-12-07 18:38 UTC (permalink / raw) To: ashford; +Cc: Shriramana Sharma, linux-btrfs Am Sonntag, 7. Dezember 2014, 10:20:27 schrieb ashford@whisperpc.com: > Martin, > > > I read that the actual > > what is free is unknown. And there are several reasons for that: > > > > 1) On a compressed filesystem you cannot know, but only estimate the > > compression ratio for future data. > > It is NOT the job of BTRFS, or ANY file-system, to try to prodict the > future. The future is unknown. Don't try to account for it. When asked > for the status (i.e. 'df'), it should return the current status. > > > 2) On a compressed filesystem you can choose to have parts of it > > uncompressed by file / directory attributes, I think. BTRFS can't > > know how much of the > > future data you are going to store compressed or uncompressed. > > Same as above. What is the point you are trying to make? I just described the reasons on what problems there are with trying to predict available free space, with BTRFS as an example. Some points apply to all filesystems, some do not, so what is the point you are trying to make? -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 18:38 ` Martin Steigerwald @ 2014-12-07 19:44 ` ashford 0 siblings, 0 replies; 28+ messages in thread From: ashford @ 2014-12-07 19:44 UTC (permalink / raw) To: Martin Steigerwald; +Cc: ashford, Shriramana Sharma, linux-btrfs > Am Sonntag, 7. Dezember 2014, 10:20:27 schrieb ashford@whisperpc.com: >> Martin, >> >> > I read that the actual >> > what is free is unknown. And there are several reasons for that: >> > >> > 1) On a compressed filesystem you cannot know, but only estimate the >> > compression ratio for future data. >> >> It is NOT the job of BTRFS, or ANY file-system, to try to prodict the >> future. The future is unknown. Don't try to account for it. When >> asked >> for the status (i.e. 'df'), it should return the current status. >> >> > 2) On a compressed filesystem you can choose to have parts of it >> > uncompressed by file / directory attributes, I think. BTRFS can't >> > know how much of the >> > future data you are going to store compressed or uncompressed. >> >> Same as above. > > What is the point you are trying to make? > > I just described the reasons on what problems there are with trying to > predict > available free space, with BTRFS as an example. Some points apply to all > filesystems, some do not, so what is the point you are trying to make? My point is that you don't try to predict, as that's a guaranteed path to failure. You deliver what you know. This is what every other file-system does. There's no reason to do anything else. Peter Ashford ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 15:33 ` Martin Steigerwald ` (2 preceding siblings ...) 2014-12-07 18:20 ` ashford @ 2014-12-07 19:19 ` Goffredo Baroncelli 2014-12-07 20:32 ` ashford 3 siblings, 1 reply; 28+ messages in thread From: Goffredo Baroncelli @ 2014-12-07 19:19 UTC (permalink / raw) To: Shriramana Sharma; +Cc: Martin Steigerwald, linux-btrfs On 12/07/2014 04:33 PM, Martin Steigerwald wrote: > Hi Shriramana! > > Am Sonntag, 7. Dezember 2014, 20:45:59 schrieb Shriramana Sharma: >>> IIUC: >>> >>> 1) btrfs fi df already shows the alloc-ed space and the space >>> used out of that. >>> >>> 2) Despite snapshots, CoW and compression, the tree knows how >>> many extents of data and metadata there are, and how many bytes >>> on disk these occcupy, no matter what is the total (uncompressed, >>> "unsnapshotted") size of all the directories and files on the >>> disk. >>> >>> So this means that btrfs fi df actually shows the real on-disk >>> usage. In this case, why do we hear people saying it's not >>> possible to know the actual on-disk usage and when a >>> btrfs-formatted disk (or partition) will go out of space? > I never read that the actual disk usage is unknown. But I read that > the actual what is free is unknown. And there are several reasons > for that: > > 1) On a compressed filesystem you cannot know, but only estimate the > compression ratio for future data. > > 2) On a compressed filesystem you can choose to have parts of it > uncompressed by file / directory attributes, I think. BTRFS can´t > know how much of the future data you are going to store compressed > or uncompressed. > > 3) From what I gathered it is planned to allow different raid / > redundancy levels for different subvolumes. BTRFS can´t know > beforehand where applications request to save future data, i.e. in > which subvolume. 3.1) even in the case of a single disk filesystem, data and metadata have different profiles: the data chunk doesn't have any redundancy, so 64kb of data consume 64kb of disk space. The metadata chunks usually are stored as DUP, so 64kb of metadata consume 128kb on disk. Moreover you have to consider that small files are stored in metadata chunk. This means that for big file the disk space consumed is equal to the data size, but for small file this is doubled. Going back to your request, to be more clear I used the following terms: 1- disk space used: the space used on the disk 2- size of data: the size of the data stored on the disks 3- disk free space: the unused space of the disk 4- free space: the size of data that the system is able to contain The value 1,2,3 are known. Which is unknown is the point 4. In the past I posted some patch which try to estimate the point 4 as: size_of_data free_space = disk_free_space * ----------------- disk_space_used This estimation assumes that the ratio size_of_data/disk_space_used is constant. But for the point above this assumption may be wrong. In conclusion, the disk usage is well known; which is unknown is the space that is available to the user (who is uninterested to all the details inside a filesystem). The best that is doable is an estimation like the above one. BR Goffredo -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 19:19 ` Goffredo Baroncelli @ 2014-12-07 20:32 ` ashford 2014-12-07 23:01 ` Goffredo Baroncelli 2014-12-08 8:18 ` Chris Murphy 0 siblings, 2 replies; 28+ messages in thread From: ashford @ 2014-12-07 20:32 UTC (permalink / raw) To: kreijack; +Cc: Shriramana Sharma, Martin Steigerwald, linux-btrfs > 3.1) even in the case of a single disk filesystem, data and metadata > have different profiles: the data chunk doesn't have any redundancy, > so 64kb of data consume 64kb of disk space. The metadata chunks > usually are stored as DUP, so 64kb of metadata consume 128kb on disk. > Moreover you have to consider that small files are stored in metadata > chunk. This means that for big file the disk space consumed is equal > to the data size, but for small file this is doubled. As there's no way to predict what the user will be doing, I see no reason to do anything except return the actual amount of free space. > Going back to your request, to be more clear I used the following terms: > 1- disk space used: the space used on the disk > 2- size of data: the size of the data stored on the disks > 3- disk free space: the unused space of the disk > 4- free space: the size of data that the system is able to contain > > The value 1,2,3 are known. Which is unknown is the point 4. In > the past I posted some patch which try to estimate the point 4 as: > > size_of_data > free_space = disk_free_space * ----------------- > disk_space_used > > This estimation assumes that the ratio size_of_data/disk_space_used > is constant. But for the point above this assumption may be wrong. While I expect that this is the best simple prediction, it's still a prediction, with all the possible problems that a prediction entails. My contention is that predictions should be avoided whenever possible. > In conclusion, the disk usage is well known; which is unknown is > the space that is available to the user (who is uninterested to > all the details inside a filesystem). The best that is doable > is an estimation like the above one. I disagree. My experiences with other file-systems, including ZFS, show that the most common solution is to just deliver to the user the actual amount of unused disk space. Anything else changes this known value into a guess or prediction. Peter Ashford ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 20:32 ` ashford @ 2014-12-07 23:01 ` Goffredo Baroncelli 2014-12-08 0:12 ` ashford 2014-12-08 8:18 ` Chris Murphy 1 sibling, 1 reply; 28+ messages in thread From: Goffredo Baroncelli @ 2014-12-07 23:01 UTC (permalink / raw) To: ashford; +Cc: Shriramana Sharma, Martin Steigerwald, linux-btrfs On 12/07/2014 09:32 PM, ashford@whisperpc.com wrote: >> In conclusion, the disk usage is well known; which is unknown is >> > the space that is available to the user (who is uninterested to >> > all the details inside a filesystem). The best that is doable >> > is an estimation like the above one. > I disagree. My experiences with other file-systems, including ZFS, show > that the most common solution is to just deliver to the user the actual > amount of unused disk space. Anything else changes this known value into > a guess or prediction. So in case you have a raid1 filesystem on two disks; each disk has 300GB free; which is the free space that you expected: 300GB or 600GB and why ? > > Peter Ashford BR G.Baroncelli > > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 23:01 ` Goffredo Baroncelli @ 2014-12-08 0:12 ` ashford 2014-12-08 2:42 ` Qu Wenruo 2014-12-08 14:34 ` Goffredo Baroncelli 0 siblings, 2 replies; 28+ messages in thread From: ashford @ 2014-12-08 0:12 UTC (permalink / raw) To: kreijack; +Cc: ashford, Shriramana Sharma, Martin Steigerwald, linux-btrfs Goffredo, > So in case you have a raid1 filesystem on two disks; each disk has 300GB > free; which is the free space that you expected: 300GB or 600GB and why ? You should see 300GB free. That's what you'll see with RAID-1 with a hardware RAID controller, and with MD RAID. Why would you expect to see anything else with BTRFS RAID? Peter Ashford ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-08 0:12 ` ashford @ 2014-12-08 2:42 ` Qu Wenruo 2014-12-08 8:12 ` ashford 2014-12-08 14:34 ` Goffredo Baroncelli 1 sibling, 1 reply; 28+ messages in thread From: Qu Wenruo @ 2014-12-08 2:42 UTC (permalink / raw) To: ashford, kreijack; +Cc: Shriramana Sharma, Martin Steigerwald, linux-btrfs -------- Original Message -------- Subject: Re: Why is the actual disk usage of btrfs considered unknowable? From: <ashford@whisperpc.com> To: <kreijack@inwind.it> Date: 2014年12月08日 08:12 > Goffredo, > >> So in case you have a raid1 filesystem on two disks; each disk has 300GB >> free; which is the free space that you expected: 300GB or 600GB and why ? > You should see 300GB free. That's what you'll see with RAID-1 with a > hardware RAID controller, and with MD RAID. Why would you expect to see > anything else with BTRFS RAID? > > Peter Ashford Yeah, you pointed out the real problem here: [DIFFERENT RESULT FROM DIFFERENT VIEW] See from *PURE ON-DISK* usage, it is still 600G, no matter what level of RAID. See from *BLOCK LEVEL RAID1* usage, it is 300G. If fs(not btrfs) is build on BLOCK LEVEL RAID1, then the *FILESYSTEM* usage will also be 300G [BTRFS DOES NOT BELONG TO ANY TYPE] But, btrfs is neither pure block level management(that should be MD or HW RAID or LVM), nor a traditional filesystem!! So the root of the problem is, btrfs mixs the position of block level management and filesystem level management, which makes everything hard to understand. You can't treat btrfs raid1 as a complete block level raid1, due to its flexibility on metadata/data profile different. If vanilla df command shows filesystem level freespace, then btrfs won't give a accurate on. [ONLY PREDICABLE CASE] For the 300Gx2 case for btrfs, you can only consider it 300G free space only if you can ensure that there was/is and will be only RAID1 data/metadata storing on it.(also need to ignore small space usage on CoW) [RELIABLE DATA IS ON-DISK USAGE] Only pure on-disk level usage is *a little* reliable. There is still problem for unbalanced metadata/data chunk allocation problem(e.g, all space is allocated for data, no space for metadata CoW write). [FEATURE SIMILAR CASE] The only case that I may see similar problem will be mirrored thin lv(not implemented yet) and normal thin lv competing for a thin pool. Although not implemented, I think even implemented, admins may not complain so much since LVM doesn't report free space, only used space on thin pool case. Thanks, Qu > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-08 2:42 ` Qu Wenruo @ 2014-12-08 8:12 ` ashford 0 siblings, 0 replies; 28+ messages in thread From: ashford @ 2014-12-08 8:12 UTC (permalink / raw) To: Qu Wenruo Cc: ashford, kreijack, Shriramana Sharma, Martin Steigerwald, linux-btrfs > > -------- Original Message -------- > Subject: Re: Why is the actual disk usage of btrfs considered unknowable? > From: <ashford@whisperpc.com> > To: <kreijack@inwind.it> > Date: 2014å¹´12æ08æ¥ 08:12 >> Goffredo, >> >>> So in case you have a raid1 filesystem on two disks; each disk has >>> 300GB >>> free; which is the free space that you expected: 300GB or 600GB and why >>> ? >> You should see 300GB free. That's what you'll see with RAID-1 with a >> hardware RAID controller, and with MD RAID. Why would you expect to see >> anything else with BTRFS RAID? >> >> Peter Ashford > Yeah, you pointed out the real problem here: > > [DIFFERENT RESULT FROM DIFFERENT VIEW] > See from *PURE ON-DISK* usage, it is still 600G, no matter what level of > RAID. > See from *BLOCK LEVEL RAID1* usage, it is 300G. If fs(not btrfs) is > build on BLOCK LEVEL RAID1, > then the *FILESYSTEM* usage will also be 300G > > [BTRFS DOES NOT BELONG TO ANY TYPE] > But, btrfs is neither pure block level management(that should be MD or > HW RAID or LVM), nor a > traditional filesystem!! For the purposes of reporting free space, it is reasonable to assume that the default structure will be used. If the default for the volume or subvolume is RAID-1, then that should be used for 'df' output. Obviously, the same should be done for other RAID levels. > So the root of the problem is, btrfs mixs the position of block level > management and filesystem level > management, which makes everything hard to understand. > You can't treat btrfs raid1 as a complete block level raid1, due to its > flexibility on metadata/data profile different. It will have the same discrepancies as other file-systems with compression, plus a few more of its own, due to chunking. If the file-system can't give a completely accurate answer, it should give one that makes sense. > If vanilla df command shows filesystem level freespace, then btrfs won't > give a accurate on. > > [ONLY PREDICABLE CASE] > For the 300Gx2 case for btrfs, you can only consider it 300G free space > only if you can ensure that > there was/is and will be only RAID1 data/metadata storing on it.(also > need to ignore small space usage on CoW) I disagree. You can consider the RAID structure to be whatever the default structure is. If the default is RAID-1, then that structure should be used to compute the free space for 'df'. The user should understand that by explicitly requesting a different RAID structure, different amounts of space will be used. > [RELIABLE DATA IS ON-DISK USAGE] > Only pure on-disk level usage is *a little* reliable. There is still > problem for unbalanced metadata/data chunk > allocation problem(e.g, all space is allocated for data, no space for > metadata CoW write). I agree. Unused disk space isn't always available to be used by data. Sometimes it's reserved for metadata of one sort or another, and sometimes it's too small to be of use. In addition, BTRFS sometimes (with small files) uses the Metadata chunks for data. Yes, it's a complex problem. There is no simple solution that will make everyone happy. --------------------------------- As for the 'df' output, I believe that the default should be the sum of free space in data chunks, free space in metadata chunks and unallocated space, ignoring any amounts that are small enough that BTRFS won't use them, and adjusted for the RAID level of the volume/subvolume. While it's possible to generate other values that will make sense for specific cases, it's not possible to create one value that is correct in all cases. If it's not possible to be absolutely correct, considering every usage (or even the most common usages), a 'reasonable' value should be returned. That reasonable value should be based on the default volume/subvolume settings, including RAID levels and any space limits that may exist on the volume or subvolume. It should neither be the most optimistic nor the most pessimistic. Peter Ashford ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-08 0:12 ` ashford 2014-12-08 2:42 ` Qu Wenruo @ 2014-12-08 14:34 ` Goffredo Baroncelli 1 sibling, 0 replies; 28+ messages in thread From: Goffredo Baroncelli @ 2014-12-08 14:34 UTC (permalink / raw) To: ashford; +Cc: Shriramana Sharma, Martin Steigerwald, linux-btrfs On 12/08/2014 01:12 AM, ashford@whisperpc.com wrote: > Goffredo, > >> So in case you have a raid1 filesystem on two disks; each disk has 300GB >> free; which is the free space that you expected: 300GB or 600GB and why ? > > You should see 300GB free. That's what you'll see with RAID-1 with a > hardware RAID controller, and with MD RAID. Why would you expect to see > anything else with BTRFS RAID? I had to ask you because in a your previous email you stated something different: On 12/07/2014 09:32 PM, ashford@whisperpc.com wrote: > I disagree. My experiences with other file-systems, including ZFS, show > that the most common solution is to just deliver to the user the actual > amount of *unused disk space* ^^^^^^^^^^^^^^^^^^^ So I expected that you answered with 600GB. But you have told the true: the user want to know how many data is able to store on the disk, and not the unused disk space. But I have to point out that the common case is one disk filesystem where the metadata chunks have a ratio data stored/disk space consumed of 1:2; the data chunks have a ratio of 1:1. This is one reason why is difficult to evaluate the free space: if you have all metadata chunks, you have to half the disk space. Another reason is that there is the idea to allow different raid profiles in the same filesystem. This will further complicate the free space evaluation. > Peter Ashford G.Baroncelli > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 20:32 ` ashford 2014-12-07 23:01 ` Goffredo Baroncelli @ 2014-12-08 8:18 ` Chris Murphy 1 sibling, 0 replies; 28+ messages in thread From: Chris Murphy @ 2014-12-08 8:18 UTC (permalink / raw) To: ashford; +Cc: kreijack, Shriramana Sharma, Martin Steigerwald, linux-btrfs On Sun, Dec 7, 2014 at 1:32 PM, <ashford@whisperpc.com> wrote: > > I disagree. My experiences with other file-systems, including ZFS, show > that the most common solution is to just deliver to the user the actual > amount of unused disk space. Anything else changes this known value into > a guess or prediction. What is the "actual amount of unused disk space" in a 2x 8GB drives mirror? Very literally, it's 16GB. It's a convenience subtracting the space used for replication (the n mirror copies, or parity). This is in fact how df reported Btrfs volumes with kernel 3.16 and older. A ZFS mirror vdev doesn't work this way, it reports available space as 8GB. The level of replication and number of devices is a function of the vdev, and is fixed. It can't be changed. With Btrfs there isn't a zpool vs vdev type of distinction, and replication level isn't a function of volume but rather that of chunks. At some future point there will be a way to supply a hint (per subvolume, maybe per directory or per file) for the allocator to put the file in a particular chunk which has a particular level of replication and number of devices. And that means "available space" isn't knowable. -- Chris Murphy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 15:15 Why is the actual disk usage of btrfs considered unknowable? Shriramana Sharma 2014-12-07 15:33 ` Martin Steigerwald @ 2014-12-08 4:59 ` Robert White 2014-12-08 6:43 ` Zygo Blaxell 2 siblings, 0 replies; 28+ messages in thread From: Robert White @ 2014-12-08 4:59 UTC (permalink / raw) To: Shriramana Sharma, linux-btrfs On 12/07/2014 07:15 AM, Shriramana Sharma wrote: > IIUC: > > 1) btrfs fi df already shows the alloc-ed space and the space used out of that. > > 2) Despite snapshots, CoW and compression, the tree knows how many > extents of data and metadata there are, and how many bytes on disk > these occcupy, no matter what is the total (uncompressed, > "unsnapshotted") size of all the directories and files on the disk. > I tried to answer this last time. So lets do a thought experiment... You have an essentially full filesystem. Then the last two extents are allocated. One a 1Gb extent for data and the other is a 256Mb extent for metadata. How much space on the disk is "free"? Is 1Gb for data, is it 256Mb for free space or is it 1280Mb for the combination of data and metadata or is it _zero_ for the complete absence of blocks that can be allocated into extents? How about if I allocate 1Gb of data space and there is 512Mb of unallocated space, which is enough room for two more metadata extents but not enough room for another data extent. Is the drive "full" when you fill that last 1Gb? After all, you cannot write more data to the disk, but you can write more metadata. If I start deleting files, and thereby create gaps in the previously allocated extents, are those gaps "free"? They are purposed but available for their respective uses. Subtracting blocks allocated from blocks on media doesn't give you the "real" answer to what is or isnt "free". If there is a leftover two-dozen sectors that won't fit in _any_ kind of extent are those sectors "Free" or are they just leftovers? In real property terms, If I hold an easment on your driveway and you want to expand your house, how much of your property is can be used for the expansion of your house? My rights to your driveway don't count against you for meeting the "undeveloped land" caluclation of your local zoning board, but you can't build any house-bits on that driveway since I hold a right to use it, so it does count against your available square feet that you can design over. "Free space" isn't the simple proposition you imagine because "free for what purpose" and "free in what sense" both have to be answered. So the system estimates, and it does so in different ways for different purposes. If you have a means in mind to resolve these conflicts we'd love to see the rationale and even the code... ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Why is the actual disk usage of btrfs considered unknowable? 2014-12-07 15:15 Why is the actual disk usage of btrfs considered unknowable? Shriramana Sharma 2014-12-07 15:33 ` Martin Steigerwald 2014-12-08 4:59 ` Robert White @ 2014-12-08 6:43 ` Zygo Blaxell 2 siblings, 0 replies; 28+ messages in thread From: Zygo Blaxell @ 2014-12-08 6:43 UTC (permalink / raw) To: Shriramana Sharma, i; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4187 bytes --] On Sun, Dec 07, 2014 at 08:45:59PM +0530, Shriramana Sharma wrote: > IIUC: > > 1) btrfs fi df already shows the alloc-ed space and the space used out of that. > > 2) Despite snapshots, CoW and compression, the tree knows how many > extents of data and metadata there are, and how many bytes on disk > these occcupy, no matter what is the total (uncompressed, > "unsnapshotted") size of all the directories and files on the disk. > > So this means that btrfs fi df actually shows the real on-disk usage. > In this case, why do we hear people saying it's not possible to know > the actual on-disk usage and when a btrfs-formatted disk (or > partition) will go out of space? "On-disk usage" is easy--that's about the past, and can be measured straightforwardly with a single count of bytes. "When a btrfs filesystem will return ENOSPC" is much more complicated--that's about the future, and depends heavily on current structure and upcoming modifications of it. There were some pretty terrible btrfs bugs and warts that were fixed only in the last 5 months or so. Since some of those had been around for a year or more, they gave btrfs a reputation. The 'df' command (statvfs(2)) would report raw free space instead of an estimate based on the current RAID profile. This confused some badly designed programs that would use statvfs to determine that N bytes of free space were avaiable, and be surprised when N bytes were not all available for their use. If you had a btrfs using RAID1, it would report double the amount of space used and available (one for each disk, e.g. 2x1TB disks 75% full would be reported as 2TB capacity with 1.5TB used and 0.5TB free). Now statvfs(2) computes more correct values (1TB capacity with 750GB used and 250GB free). Some bugs would crash the btrfs cleaner (the thread which removes deleted snapshots) or balance, and would cause the filesystem to prematurely report ENOSPC when (in theory) hundreds of gigabytes were available. These were straight up bugs that are now fixed. Modifying the filesystem tree requires free metadata blocks into which to write new CoW nodes for the modified metadata. When you delete something, disk usage goes up for a few seconds before it goes down (if you have snapshots, the "down" part may be delayed until you delete the snapshots). This can lead to surprising "No space left on device" errors from commands like 'rm -rf lots_of_files'. The GlobalReserve chunk type was introduced to reserve a few MB of space on the filesystem to handle such cases. Thankfully, everything above now seems to be fixed. There is still an issue with hetergeneous chunk allocation. The 'df' command and 'statvfs' syscall only report a single quantity for used and free space, while in btrfs there are two distinct data types to be stored in two distinct container types--and for maximum result irreproducibility, the amount of space allocated to each type is dynamic. Data (file contents) is allocated 1GB at a time, metadata (directory structures, inodes, checksums) is allocated 256MB at a time, and the two types are not interchangeable after allocation. This can cause inaccuracies when reporting free space as the last few free GB are consumed. 256MB might abruptly disappear from free space if you happen to run out of free metadata space and allocate a new metadata chunk instead of a data chunk. The last few KB of a file that does not fill a full 4K block can be stored 'inline' (next to the inode in the metadata tree). If you are low on space in data chunks, you might be able to write a large number of small files using inline metadata to store the file contents, but not an equivalent-sized large file using data extent blocks. If you have lots of free data space but not enough metadata space, you get the opposite result (e.g. you can write new large files but not extend small existing ones). All that above happens with RAID, compression and quotas turned *off*. Turning those them on makes space usage even harder to analyze (and ENOSPC errors harder to predict) with a single-dimension "available space" metric. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2014-12-08 23:14 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-12-07 15:15 Why is the actual disk usage of btrfs considered unknowable? Shriramana Sharma 2014-12-07 15:33 ` Martin Steigerwald 2014-12-07 15:37 ` Shriramana Sharma 2014-12-07 15:40 ` Martin Steigerwald 2014-12-08 5:32 ` Robert White 2014-12-08 6:20 ` ashford 2014-12-08 7:06 ` Robert White 2014-12-08 14:47 ` Martin Steigerwald 2014-12-08 14:57 ` Austin S Hemmelgarn 2014-12-08 15:52 ` Martin Steigerwald 2014-12-08 23:14 ` Zygo Blaxell 2014-12-07 18:20 ` ashford 2014-12-07 18:34 ` Hugo Mills 2014-12-07 18:48 ` Martin Steigerwald 2014-12-07 19:39 ` ashford 2014-12-08 5:17 ` Chris Murphy 2014-12-07 18:38 ` Martin Steigerwald 2014-12-07 19:44 ` ashford 2014-12-07 19:19 ` Goffredo Baroncelli 2014-12-07 20:32 ` ashford 2014-12-07 23:01 ` Goffredo Baroncelli 2014-12-08 0:12 ` ashford 2014-12-08 2:42 ` Qu Wenruo 2014-12-08 8:12 ` ashford 2014-12-08 14:34 ` Goffredo Baroncelli 2014-12-08 8:18 ` Chris Murphy 2014-12-08 4:59 ` Robert White 2014-12-08 6:43 ` Zygo Blaxell
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.