All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs: poor performance on deleting many large files
@ 2015-11-23  1:43 Mitch Fossen
  2015-11-23  6:29 ` Duncan
  2015-11-23 12:59 ` Austin S Hemmelgarn
  0 siblings, 2 replies; 48+ messages in thread
From: Mitch Fossen @ 2015-11-23  1:43 UTC (permalink / raw)
  To: linux-btrfs

Hi all,

I have a btrfs setup of 4x2TB HDDs for /home in btrfs RAID0 on Ubuntu
15.10 (kernel 4.2) and btrfs-progs 4.3.1. Root is on a separate SSD
also running btrfs.

About 6 people use it via ssh and run simulations. One of these
simulations generates a lot of intermediate data that can be discarded
after it is run, it usually ends up being around 100GB to 300GB spread
across dozens of files 500M to 5GB apiece.

The problem is that, when it comes time to do a "rm -rf
~/working_directory" the entire machine locks up and sporadically
allows other IO requests to go through, with a 5 to 10 minute delay
before other requests seem to be served. It can end up taking half an
hour or more to fully remove the offending directory, with the hangs
happening frequently enough to be frustrating. This didn't seem to
happen when the system was using ext4 on LVM.

Is there a way to fix this performance issue or at least mitigate it?
Would using ionice and the CFQ scheduler help? As far as I know Ubuntu
uses deadline by default which ignores ionice values.

Alternatively, would balancing and defragging data more often help?
The current mount options are compress=lzo and space_cache, and I will
try it with autodefrag enabled as well to see if that helps.

For now I think I'll recommend that everyone use subvolumes for these
runs and then enable user_subvol_rm_allowed.

Regards,

Mitch Fossen

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-11-23  1:43 btrfs: poor performance on deleting many large files Mitch Fossen
@ 2015-11-23  6:29 ` Duncan
  2015-11-25 21:49   ` Mitchell Fossen
  2015-11-23 12:59 ` Austin S Hemmelgarn
  1 sibling, 1 reply; 48+ messages in thread
From: Duncan @ 2015-11-23  6:29 UTC (permalink / raw)
  To: linux-btrfs

Mitch Fossen posted on Sun, 22 Nov 2015 19:43:28 -0600 as excerpted:

> Hi all,
> 
> I have a btrfs setup of 4x2TB HDDs for /home in btrfs RAID0 on Ubuntu
> 15.10 (kernel 4.2) and btrfs-progs 4.3.1. Root is on a separate SSD also
> running btrfs.
> 
> About 6 people use it via ssh and run simulations. One of these
> simulations generates a lot of intermediate data that can be discarded
> after it is run, it usually ends up being around 100GB to 300GB spread
> across dozens of files 500M to 5GB apiece.
> 
> The problem is that, when it comes time to do a "rm -rf
> ~/working_directory" the entire machine locks up and sporadically allows
> other IO requests to go through, with a 5 to 10 minute delay before
> other requests seem to be served. It can end up taking half an hour or
> more to fully remove the offending directory, with the hangs happening
> frequently enough to be frustrating. This didn't seem to happen when the
> system was using ext4 on LVM.
> 
> Is there a way to fix this performance issue or at least mitigate it?
> Would using ionice and the CFQ scheduler help? As far as I know Ubuntu
> uses deadline by default which ignores ionice values.
> 
> Alternatively, would balancing and defragging data more often help? The
> current mount options are compress=lzo and space_cache, and I will try
> it with autodefrag enabled as well to see if that helps.
> 
> For now I think I'll recommend that everyone use subvolumes for these
> runs and then enable user_subvol_rm_allowed.

Using subvolumes was the first recommendation I was going to make, too, 
so you're on the right track. =:^)

Also, in case you are using it (you didn't say, but this has been 
demonstrated to solve similar issues for others so it's worth 
mentioning), try turning btrfs quota functionality off.  While the devs 
are working very hard on that feature for btrfs, the fact is that it's 
simply still buggy and doesn't work reliably anyway, in addition to 
triggering scaling issues before they'd otherwise occur.  So my 
recommendation has been, and remains, unless you're working directly with 
the devs to fix quota issues (in which case, thanks!), if you actually 
NEED quota functionality, use a filesystem where it works reliably, while 
if you don't, just turn it off and avoid the scaling and other issues 
that currently still come with it.


As for defrag, that's quite a topic of its own, with complications 
related to snapshots and the nocow file attribute.  Very briefly, if you 
haven't been running it regularly or using the autodefrag mount option by 
default, chances are your available free space is rather fragmented as 
well, and while defrag may help, it may not reduce fragmentation to the 
degree you'd like.  (I'd suggest using filefrag to check fragmentation, 
but it doesn't know how to deal with btrfs compression, and will report 
heavy fragmentation for compressed files even if they're fine.  Since you 
use compression, that kind of eliminates using filefrag to actually see 
what your fragmentation is.)

Additionally, defrag isn't snapshot aware (they tried it for a few 
kernels a couple years ago but it simply didn't scale), so if you're 
using snapshots (as I believe Ubuntu does by default on btrfs, at least 
taking snapshots for upgrade-in-place), so using defrag on files that 
exist in the snapshots as well can dramatically increase space usage, 
since defrag will break the reflinks to the snapshotted extents and 
create new extents for defragged files.

Meanwhile, the absolute worst-case fragmentation on btrfs occurs with  
random-internal-rewrite-pattern files (as opposed to never changed, or 
append-only).  Common examples are database files and VM images.  For 
/relatively/ small files, to say 256 MiB, the autodefrag mount option is 
a reasonably effective solution, but it tends to have scaling issues with 
files over half a GiB so you can call this a negative recommendation for 
trying that option with half-gig-plus internal-random-rewrite-pattern 
files.  There are other mitigation strategies that can be used, but here 
the subject gets complex so I'll not detail them.  Suffice it to say that 
if the filesystem in question is used with large VM images or database 
files and you haven't taken specific fragmentation avoidance measures, 
that's very likely a good part of your problem right there, and you can 
call this a hint that further research is called for.

If your half-gig-plus files are mostly write-once, for example most media 
files unless you're doing heavy media editing, however, then autodefrag 
could be a good option in general, as it deals well with such files and 
with random-internal-rewrite-pattern files under a quarter gig or so.  Be 
aware, however, that if it's enabled on an already heavily fragmented 
filesystem (as yours likely is), it's likely to actually make performance 
worse until it gets things under control.  Your best bet in that case, if 
you have spare devices available to do so, is probably to create a fresh 
btrfs and consistently use autodefrag as you populate it from the 
existing heavily fragmented btrfs.  That way, it'll never have a chance 
for the fragmentation to build up in the first place, and autodefrag used 
as a routine mount option should keep it from getting bad in normal use.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-11-23  1:43 btrfs: poor performance on deleting many large files Mitch Fossen
  2015-11-23  6:29 ` Duncan
@ 2015-11-23 12:59 ` Austin S Hemmelgarn
  2015-11-26  0:23   ` [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?) Christoph Anton Mitterer
  1 sibling, 1 reply; 48+ messages in thread
From: Austin S Hemmelgarn @ 2015-11-23 12:59 UTC (permalink / raw)
  To: Mitch Fossen, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3471 bytes --]

On 2015-11-22 20:43, Mitch Fossen wrote:
> Hi all,
>
> I have a btrfs setup of 4x2TB HDDs for /home in btrfs RAID0 on Ubuntu
> 15.10 (kernel 4.2) and btrfs-progs 4.3.1. Root is on a separate SSD
> also running btrfs.
>
> About 6 people use it via ssh and run simulations. One of these
> simulations generates a lot of intermediate data that can be discarded
> after it is run, it usually ends up being around 100GB to 300GB spread
> across dozens of files 500M to 5GB apiece.
>
> The problem is that, when it comes time to do a "rm -rf
> ~/working_directory" the entire machine locks up and sporadically
> allows other IO requests to go through, with a 5 to 10 minute delay
> before other requests seem to be served. It can end up taking half an
> hour or more to fully remove the offending directory, with the hangs
> happening frequently enough to be frustrating. This didn't seem to
> happen when the system was using ext4 on LVM.
Based on this description, this sounds to me like an issue with 
fragmentation.
>
> Is there a way to fix this performance issue or at least mitigate it?
> Would using ionice and the CFQ scheduler help? As far as I know Ubuntu
> uses deadline by default which ignores ionice values.
This depends on a number of factors.  If you are on a new enough kernel, 
you may actually be using the blk-mq code instead of one of the 
traditional I/O schedulers, which does honor ionice values, and is 
generally a lot better than CFQ or deadline at actual fairness and 
performance.  If you aren't running on that code path, then whether 
deadline or CFQ is better is pretty hard to determine.  In general, CFQ 
needs some serious effort and benchmarking to get reasonable performance 
out of it.  CFQ can beat deadline in performance when properly tuned to 
the workload (except if you have really small rotational media (smaller 
than 32G or so), or if you absolutely need deterministic scheduling), 
but when you don't take the time to tune CFQ, deadline is usually better 
(except on SSD's, where CFQ is generally better than deadline even 
without performance tuning).
>
> Alternatively, would balancing and defragging data more often help?
> The current mount options are compress=lzo and space_cache, and I will
> try it with autodefrag enabled as well to see if that helps.
Balance is not likely to help much, but defragmentation might.  I would 
suggest running the defrag when nobody has any other data on the 
filesystem, as it will likely cause a severe drop in performance the 
first time it's run.  Autodefrag might help, but it may also make 
performance worse while writing the files in the first place.  You might 
also try with compress=none, depending on your storage hardware, using 
in-line compression can actually make things go significantly slower (I 
see this a lot with SSD's, and also with some high-end storage 
controllers, and especially when dealing with large data-sets that 
aren't very compressible).
>
> For now I think I'll recommend that everyone use subvolumes for these
> runs and then enable user_subvol_rm_allowed.
As Duncan said, this is probably the best option short term.  It is 
worth noting however that removing a subvolume still has some overhead 
(which appears to scale linearly with the amount of data in the 
subvolume).  This overhead isn't likely to be an issue however unless a 
bunch of subvolumes get removed in bulk however.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-11-23  6:29 ` Duncan
@ 2015-11-25 21:49   ` Mitchell Fossen
  2015-11-26 16:52     ` Duncan
  2015-11-27  1:49     ` Qu Wenruo
  0 siblings, 2 replies; 48+ messages in thread
From: Mitchell Fossen @ 2015-11-25 21:49 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On Mon, 2015-11-23 at 06:29 +0000, Duncan wrote:

> Using subvolumes was the first recommendation I was going to make, too, 
> so you're on the right track. =:^)
> 
> Also, in case you are using it (you didn't say, but this has been 
> demonstrated to solve similar issues for others so it's worth 
> mentioning), try turning btrfs quota functionality off.  While the devs 
> are working very hard on that feature for btrfs, the fact is that it's 
> simply still buggy and doesn't work reliably anyway, in addition to 
> triggering scaling issues before they'd otherwise occur.  So my 
> recommendation has been, and remains, unless you're working directly with 
> the devs to fix quota issues (in which case, thanks!), if you actually 
> NEED quota functionality, use a filesystem where it works reliably, while 
> if you don't, just turn it off and avoid the scaling and other issues 
> that currently still come with it.
> 

I did indeed have quotas turned on for the home directories! Since they were
mostly to calculate space used by everyone (since du -hs is so slow) and not
actually needed to limit people, I disabled them. 

> As for defrag, that's quite a topic of its own, with complications 
> related to snapshots and the nocow file attribute.  Very briefly, if you 
> haven't been running it regularly or using the autodefrag mount option by 
> default, chances are your available free space is rather fragmented as 
> well, and while defrag may help, it may not reduce fragmentation to the 
> degree you'd like.  (I'd suggest using filefrag to check fragmentation, 
> but it doesn't know how to deal with btrfs compression, and will report 
> heavy fragmentation for compressed files even if they're fine.  Since you 
> use compression, that kind of eliminates using filefrag to actually see 
> what your fragmentation is.)
> Additionally, defrag isn't snapshot aware (they tried it for a few 
> kernels a couple years ago but it simply didn't scale), so if you're 
> using snapshots (as I believe Ubuntu does by default on btrfs, at least 
> taking snapshots for upgrade-in-place), so using defrag on files that 
> exist in the snapshots as well can dramatically increase space usage, 
> since defrag will break the reflinks to the snapshotted extents and 
> create new extents for defragged files.
> 
> Meanwhile, the absolute worst-case fragmentation on btrfs occurs with  
> random-internal-rewrite-pattern files (as opposed to never changed, or 
> append-only).  Common examples are database files and VM images.  For 
> /relatively/ small files, to say 256 MiB, the autodefrag mount option is 
> a reasonably effective solution, but it tends to have scaling issues with 
> files over half a GiB so you can call this a negative recommendation for 
> trying that option with half-gig-plus internal-random-rewrite-pattern 
> files.  There are other mitigation strategies that can be used, but here 
> the subject gets complex so I'll not detail them.  Suffice it to say that 
> if the filesystem in question is used with large VM images or database 
> files and you haven't taken specific fragmentation avoidance measures, 
> that's very likely a good part of your problem right there, and you can 
> call this a hint that further research is called for.
> 
> If your half-gig-plus files are mostly write-once, for example most media 
> files unless you're doing heavy media editing, however, then autodefrag 
> could be a good option in general, as it deals well with such files and 
> with random-internal-rewrite-pattern files under a quarter gig or so.  Be 
> aware, however, that if it's enabled on an already heavily fragmented 
> filesystem (as yours likely is), it's likely to actually make performance 
> worse until it gets things under control.  Your best bet in that case, if 
> you have spare devices available to do so, is probably to create a fresh 
> btrfs and consistently use autodefrag as you populate it from the 
> existing heavily fragmented btrfs.  That way, it'll never have a chance 
> for the fragmentation to build up in the first place, and autodefrag used 
> as a routine mount option should keep it from getting bad in normal use.

Thanks for explaining that! Most of these files are written once and then read
from for the rest of their "lifetime" until the simulations are done and they
get archived/deleted. I'll try leaving autodefrag on and defragging directories
over the holiday weekend when no one is using the server. There is some database
usage, but I turned off COW for its folder and it only gets used sporadically
and shouldn't be a huge factor in day-to-day usage. 

Also, is there a recommendation for relatime vs noatime mount options? I don't
believe anything that runs on the server needs to use file access times, so if
it can help with performance/disk usage I'm fine with setting it to noatime.

I just tried copying a 70GB folder and then rm -rf it and it didn't appear to
impact performance, and I plan to try some larger tests later.

Thanks again for the help!

-Mitch


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-11-23 12:59 ` Austin S Hemmelgarn
@ 2015-11-26  0:23   ` Christoph Anton Mitterer
  2015-11-26  0:33     ` Hugo Mills
  2015-11-26 23:08     ` Duncan
  0 siblings, 2 replies; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-11-26  0:23 UTC (permalink / raw)
  To: Austin S Hemmelgarn, linux-btrfs, Duncan

[-- Attachment #1: Type: text/plain, Size: 4539 bytes --]

Hey.

I've worried before about the topics Mitch has raised.
Some questions.

1) AFAIU, the fragmentation problem exists especially for those files
that see many random writes, especially, but not limited to, big files.
Now that databases and VMs are affected by this, is probably broadly
known in the meantime (well at least by people on that list).
But I'd guess there are n other cases where such IO patterns can happen
which one simply never notices, while the btrfs continues to degrade.

So is there any general approach towards this?

And what are the actual possible consequences? Is it just that fs gets
slower (due to the fragmentation) or may I even run into other issues
to the point the space is eaten up or the fs becomes basically
unusable?

This is especially important for me, because for some VMs and even DBs
I wouldn't want to use nodatacow, because I want to have the
checksumming. (i.e. those cases where data integrity is much more
important than security)


2) Why does notdatacow imply nodatasum and can that ever be decoupled?
For me the checksumming is actually the most important part of btrfs
(not that I wouldn't like its other features as well)... so turning it
off is something I really would want to avoid.

Plus it opens questions like: When there are no checksums, how can it
(in the RAID cases) decide which block is the good one in case of
corruptions?


3) When I would actually disable datacow for e.g. a subvolume that
holds VMs or DBs... what are all the implications?
Obviously no checksumming, but what happens if I snapshot such a
subvolume or if I send/receive it?
I'd expect that then some kind of CoW needs to take place or does that
simply not work?


4) Duncan mentioned that defrag (and I guess that's also for auto-
defrag) isn't ref-link aware...
Isn't that somehow a complete showstopper?

As soon as one uses snapshot, and would defrag or auto defrag any of
them, space usage would just explode, perhaps to the extent of ENOSPC,
and rendering the fs effectively useless.

That sounds to me like, either I can't use ref-links, which are crucial
not only to snapshots but every file I copy with cp --reflink auto ...
or I can't defrag... which however will sooner or later cause quite
some fragmentation issues on btrfs?


5) Especially keeping (4) in mind but also the other comments in from
Duncan and Austin...
Is auto-defrag now recommended to be generally used?
Are both auto-defrag and defrag considered stable to be used? Or are
there other implications, like when I use compression


6) Does defragmentation work with compression? Or is it just filefrag
which can't cope with it?

Any other combinations or things with the typicaly btrfs technologies
(cow/nowcow, compression, snapshots, subvols, compressions, defrag,
balance) that one can do but which lead to unexpected problems (I, for
example, wouldn't have expected that defragmentation isn't ref-link
aware... still kinda shocked ;) )

For example, when I do a balance and change the compression, and I have
multiple snaphots or files within one subvol that share their blocks...
would that also lead to copies being made and the space growing
possibly dramatically?


7) How das free-space defragmentation happen (or is there even such a
thing)?
For example, when I have my big qemu images, *not* using nodatacow, and
I copy the image e.g. with qemu-img old.img new.img ... and delete the
old then.
Then I'd expect that the new.img is more or less not fragmented,... but
will my free space (from the removed old.img) still be completely
messed up sooner or later driving me into problems?


8) why does a balance not also defragment? Since everything is anyway
copied... why not defragmenting it?
I somehow would have hoped that a balance cleans up all kinds of
things,... like free space issues and also fragmentation.


Given all these issues,... fragmentation, situations in which space may
grow dramatically where the end-user/admin may not necessarily expect
it (e.g. the defrag or the balance+compression case?)... btrfs seem to
require much more in-depth knowledge and especially care (that even
depends on the type of data) on the end-user/admin side than the
traditional filesystems.
Are there for example any general recommendations what to regularly to
do keep the fs in a clean and proper shape (and I don't count "start
with a fresh one and copy the data over" as a valid way).


Thanks,
Chris.

> 

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-11-26  0:23   ` [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?) Christoph Anton Mitterer
@ 2015-11-26  0:33     ` Hugo Mills
  2015-12-09  5:43       ` Christoph Anton Mitterer
  2015-12-14  1:44       ` Christoph Anton Mitterer
  2015-11-26 23:08     ` Duncan
  1 sibling, 2 replies; 48+ messages in thread
From: Hugo Mills @ 2015-11-26  0:33 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: Austin S Hemmelgarn, linux-btrfs, Duncan

[-- Attachment #1: Type: text/plain, Size: 5387 bytes --]

On Thu, Nov 26, 2015 at 01:23:59AM +0100, Christoph Anton Mitterer wrote:
> 2) Why does notdatacow imply nodatasum and can that ever be decoupled?

   Answering the second part first, no, it can't.

   The issue is that nodatacow bypasses the transactional nature of
the FS, making changes to live data immediately. This then means that
if you modify a modatacow file, the csum for that modified section is
out of date, and won't be back in sync again until the latest
transaction is committed. So you can end up with an inconsistent
filesystem if there's a crash between the two events.

> For me the checksumming is actually the most important part of btrfs
> (not that I wouldn't like its other features as well)... so turning it
> off is something I really would want to avoid.
> 
> Plus it opens questions like: When there are no checksums, how can it
> (in the RAID cases) decide which block is the good one in case of
> corruptions?

   It doesn't decide -- both copies look equally good, because there's
no checksum, so if you read the data, the FS will return whatever data
was on the copy it happened to pick.


> 3) When I would actually disable datacow for e.g. a subvolume that
> holds VMs or DBs... what are all the implications?
> Obviously no checksumming, but what happens if I snapshot such a
> subvolume or if I send/receive it?

   After snapshotting, modifications are CoWed precisely once, and
then it reverts to nodatacow again. This means that making a snapshot
of a nodatacow object will cause it to fragment as writes are made to
it.

> I'd expect that then some kind of CoW needs to take place or does that
> simply not work?
> 
> 
> 4) Duncan mentioned that defrag (and I guess that's also for auto-
> defrag) isn't ref-link aware...
> Isn't that somehow a complete showstopper?

   It is, but the one attempt at dealing with it caused massive data
corruption, and it was turned off again. autodefrag, however, has
always been snapshot aware and snapshot safe, and would be the
recommended approach here. (Actually, it was broken in the same
incident I just described -- but fixed again when the broken patches
were reverted).

> As soon as one uses snapshot, and would defrag or auto defrag any of
> them, space usage would just explode, perhaps to the extent of ENOSPC,
> and rendering the fs effectively useless.
> 
> That sounds to me like, either I can't use ref-links, which are crucial
> not only to snapshots but every file I copy with cp --reflink auto ...
> or I can't defrag... which however will sooner or later cause quite
> some fragmentation issues on btrfs?
> 
> 
> 5) Especially keeping (4) in mind but also the other comments in from
> Duncan and Austin...
> Is auto-defrag now recommended to be generally used?

   Absolutely, yes.

   It's late for me, and this email was longer than I suspected, so
I'm going to stop here, but I'll try to pick it up again and answer
your other questions tomorrow.

   Hugo.

> Are both auto-defrag and defrag considered stable to be used? Or are
> there other implications, like when I use compression
> 
> 
> 6) Does defragmentation work with compression? Or is it just filefrag
> which can't cope with it?
> 
> Any other combinations or things with the typicaly btrfs technologies
> (cow/nowcow, compression, snapshots, subvols, compressions, defrag,
> balance) that one can do but which lead to unexpected problems (I, for
> example, wouldn't have expected that defragmentation isn't ref-link
> aware... still kinda shocked ;) )
> 
> For example, when I do a balance and change the compression, and I have
> multiple snaphots or files within one subvol that share their blocks...
> would that also lead to copies being made and the space growing
> possibly dramatically?
> 
> 
> 7) How das free-space defragmentation happen (or is there even such a
> thing)?
> For example, when I have my big qemu images, *not* using nodatacow, and
> I copy the image e.g. with qemu-img old.img new.img ... and delete the
> old then.
> Then I'd expect that the new.img is more or less not fragmented,... but
> will my free space (from the removed old.img) still be completely
> messed up sooner or later driving me into problems?
> 
> 
> 8) why does a balance not also defragment? Since everything is anyway
> copied... why not defragmenting it?
> I somehow would have hoped that a balance cleans up all kinds of
> things,... like free space issues and also fragmentation.
> 
> 
> Given all these issues,... fragmentation, situations in which space may
> grow dramatically where the end-user/admin may not necessarily expect
> it (e.g. the defrag or the balance+compression case?)... btrfs seem to
> require much more in-depth knowledge and especially care (that even
> depends on the type of data) on the end-user/admin side than the
> traditional filesystems.
> Are there for example any general recommendations what to regularly to
> do keep the fs in a clean and proper shape (and I don't count "start
> with a fresh one and copy the data over" as a valid way).
> 
> 
> Thanks,
> Chris.
> 
> > 



-- 
Hugo Mills             | "There's more than one way to do it" is not a
hugo@... carfax.org.uk | commandment. It is a dire warning.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-11-25 21:49   ` Mitchell Fossen
@ 2015-11-26 16:52     ` Duncan
  2015-11-26 18:25       ` Christoph Anton Mitterer
  2015-11-27  1:49     ` Qu Wenruo
  1 sibling, 1 reply; 48+ messages in thread
From: Duncan @ 2015-11-26 16:52 UTC (permalink / raw)
  To: linux-btrfs

Mitchell Fossen posted on Wed, 25 Nov 2015 15:49:58 -0600 as excerpted:

> Also, is there a recommendation for relatime vs noatime mount options? I
> don't believe anything that runs on the server needs to use file access
> times, so if it can help with performance/disk usage I'm fine with
> setting it to noatime.

FWIW I finally got tired enough of always setting noatime (for over a 
decade, since kernel 2.4 and my standardizing to then reiserfs) that I 
finally found the spot in the kernel where the relatime default is set, 
and patched it to be noatime by default.  My kernel scripts apply that on 
top of my git kernel pulls, now.

For people doing snapshotting in particular, atime updates can be a big 
part of the differences between snapshots, so it's particularly important 
to set noatime if you're snapshotting.

If you're not doing snapshots, it's somewhat less important, but IIRC it 
was still someone more of a performance issue than with ext*, tho I don't 
remember the details but I'd guess it's to do with COWing the metadata 
triggering metadata fragmentation.

Bottom line, use noatime unless you have something that needs atime.  
It's not going to hurt for sure, and should improve performance at least 
somewhat even on ext*.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-11-26 16:52     ` Duncan
@ 2015-11-26 18:25       ` Christoph Anton Mitterer
  2015-11-26 23:29         ` Duncan
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-11-26 18:25 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 798 bytes --]

On Thu, 2015-11-26 at 16:52 +0000, Duncan wrote:
> For people doing snapshotting in particular, atime updates can be a
> big 
> part of the differences between snapshots, so it's particularly
> important 
> to set noatime if you're snapshotting.
What everything happens when that is left at relatime?

I'd guess that obviously everytime the atime is updated there will be
some CoW, but only on meta-data blocks, right?

Does this then lead to fragmentation problems in the meta-data block
groups?

And how serious are the effects on space that is eaten up... say I have
n snapshots and access all of their files... then I'd probably get n
times the metadata, right? Which would sound quite dramatic...

Or is just parts of the metadate copied with new atimes?


Thanks,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-11-26  0:23   ` [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?) Christoph Anton Mitterer
  2015-11-26  0:33     ` Hugo Mills
@ 2015-11-26 23:08     ` Duncan
  2015-12-09  5:45       ` Christoph Anton Mitterer
  1 sibling, 1 reply; 48+ messages in thread
From: Duncan @ 2015-11-26 23:08 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Thu, 26 Nov 2015 01:23:59 +0100 as
excerpted:

> Hey.
> 
> I've worried before about the topics Mitch has raised.
> Some questions.
> 
> 1) AFAIU, the fragmentation problem exists especially for those files
> that see many random writes, especially, but not limited to, big files.
> Now that databases and VMs are affected by this, is probably broadly
> known in the meantime (well at least by people on that list).
> But I'd guess there are n other cases where such IO patterns can happen
> which one simply never notices, while the btrfs continues to degrade.

The two other known cases are:

1) Bittorrent download files, where the full file size is preallocated 
(and I think fsynced), then the torrent client downloads into it a chunk 
at a time.

The more general case would be any time a file of some size is 
preallocated and then written into more or less randomly, the problem 
being the preallocation, which on traditional rewrite-in-place 
filesystems helps avoid fragmentation (as well as ensuring space to save 
the full file), but on COW-based filesystems like btrfs, triggers exactly 
the fragmentation it was trying to avoid.

At least some torrent clients (ktorrent at least) have an option to turn 
off that preallocation, however, and that would be recommended where 
possible.  Where disabling the preallocation isn't possible, arranging to 
have the client write into a dir with the nocow attribute set, so newly 
created torrent files inherit it and do rewrite-in-place, is highly 
recommended.

It's also worth noting that once the download is complete, the files 
aren't going to be rewritten any further, and thus can be moved out of 
the nocow-set download dir and treated normally.  For those who will 
continue to seed the files for some time, this could be done, provided 
the client can seed from a directory different than the download dir.

2) As a subcase of the database file case that people may not think 
about, systemd journal files are known to have had the internal-rewrite-
pattern problem in the past.  Apparently, while they're mostly append-
only in general, they do have an index at the beginning of the file that 
gets rewritten quite a bit.

The problem is much reduced in newer systemd, which is btrfs aware and in 
fact uses btrfs-specific features such as subvolumes in a number of cases 
(creating subvolumes rather than directories where it makes sense in some 
shipped tmpfiles.d config files, for instance), if it's running on 
btrfs.  For the journal, I /think/ (see the next paragraph) that it now 
sets the journal files nocow, and puts them in a dedicated subvolume so 
snapshots of the parent won't snapshot the journals, thereby helping to 
avoid the snapshot-triggered cow1 issue.

On my own systems, however, I've configured journald to only use the 
volatile tmpfs journals in /run, not the permanent /var location, 
tweaking the size of the tmpfs mounted on /run and the journald config so 
it normally stores a full boot session, but of course doesn't store 
journals from previous sessions as they're wiped along with the tmpfs at 
reboot.  I run syslog-ng as well, configured to work with journald, and 
thus have its more traditional append-only plain-text syslogs for 
previous boot sessions.

For my usage that actually seems the best of both worlds as I get journald 
benefits such as service status reports showing the last 10 log entries 
for that service, etc, with those benefits mostly applying to the current 
session only, while I still have the traditional plain-text greppable, 
etc, syslogs, from both the current and previous sessions, back as far as 
my log rotation policy keeps them.  It also keeps the journals entirely 
off of btrfs, so that's one particular problem I don't have to worry 
about at all, the reason I'm a bit fuzzy on the exact details of systemd's 
solution to the journal on btrfs issue.

> So is there any general approach towards this?

The general case is that for normal desktop users, it doesn't tend to be 
a problem, as they don't do either large VMs or large databases, and 
small ones such as the sqlite files generated by firefox and various 
email clients are handled quite well by autodefrag, with that general 
desktop usage being its primary target.

For server usage and the more technically inclined workstation users who 
are running VMs and larger databases, the general feeling seems to be 
that those adminning such systems are, or should be, technically inclined 
enough to do their research and know when measures such as nocow and 
limited snapshotting along with manual defrags where necessary, are 
called for.  And if they don't originally, they find out when they start 
researching why performance isn't what they expected and what to do about 
it. =:^)

> And what are the actual possible consequences? Is it just that fs gets
> slower (due to the fragmentation) or may I even run into other issues to
> the point the space is eaten up or the fs becomes basically unusable?

It's primarily a performance issue, tho in severe cases it can also be a 
scaling issue, to the point that maintenance tasks such as balance take 
much longer than they should and can become impractical to run (where the 
alternative starting over with a new filesystem and restoring from 
backups is faster), because btrfs simply has too much bookkeeping 
overhead to do due to the high fragmentation.

And quotas tend to make the scaling issues much (MUCH!) worse, but since 
btrfs quotas are to date generally buggy and not entirely reliable 
anyway, that tends not to be a big problem for those who do their 
research, since they either stick with a more mature filesystem where 
quotas actually work if they need 'em, or don't ever enable them on btrfs 
if they don't actually need 'em.

> This is especially important for me, because for some VMs and even DBs I
> wouldn't want to use nodatacow, because I want to have the checksumming.
> (i.e. those cases where data integrity is much more important than
> security)

In general, nocow and the resulting loss of checksumming on these files 
isn't nearly the problem that it might seem at first glance.  Why?  
Because think about it, the applications using these files have had to be 
usable on more traditional filesystems without filesystem-level 
checksumming for decades, so the ones where data integrity is absolutely 
vital have tended to develop their own data integrity assurance 
mechanisms.  They really had no choice, as if they hadn't, they'd have 
been too unstable for the tasks at hand, and something else would have 
come along that was more stable and thus more suited to the task at hand.

In fact, while I've seen no reports of this recently, a few years ago 
there were a number of reported cases where the best explanation was that 
after a crash, the btrfs level file integrity and the application level 
file integrity apparently clashed, with the btrfs commit points and the 
application's own commit points out of sync, so that while btrfs said the 
file was fine, apparently parts of it were from before an application 
level checkpoint while other parts of it were after, so the application 
itself rejected the file, even tho the btrfs checksums matched.

As I said, that was a few years ago, and I think btrfs' barrier handling 
and fsync log rewriting are better now, such that I've not seen such 
reports in quite awhile.  But something was definitely happening at the 
time, and I think in at least some cases the application alone would have 
handled things better, as then it could have detected the damage and 
potentially replayed its own log or restored to a previous checkpoint, 
the exact same thing it did on filesystems without the integrity 
protections btrfs has.

Since most of these apps already have their own data integrity assurance 
mechanisms, the btrfs data integrity mechanisms aren't such a big deal 
and can in fact be turned off, letting the application layer handle it.  
Instead, where btrfs' data integrity works best is in two cases (1) btrfs 
internal metadata integrity handling, and (2) on the general run of the 
mill file processed by run of the mill applications that don't do their 
own data integrity processing (beyond perhaps a rather minimal sanity 
check, if that) and simply trust the data the filesystem feeds them.  In 
many cases they'd simply process the corrupt data and keep on going, 
while in others they'd crash, but it wouldn't be a big deal, because it'd 
be one corrupt jpeg or a few seconds of garbage in an mp3 or mpeg, and if 
the one app couldn't handle it without crashing, another would.  It 
wouldn't be a whole DB or VM's worth of data, down the drain, as it would 
be for the big apps, the reason the big apps had to implement their own 
data integrity processing.

Plus, the admins running the big, important apps, are much more likely to 
appreciate the value of the admin's rule of backups, if it's not backed 
up, by definition, it's of less value than the time and resources saved 
by not doing that backup, any protests to the contrary not withstanding 
as they simply underline the lie of the words in the face of the 
demonstrated lack of backups and thus by definition, low value of the 
data.

Because checksumming doesn't help you if the filesystem as a whole goes 
bad, or if the physical devices hosting it do so, while backups do!  (And 
the same of course applies to snapshotting, tho they can help with the 
generally worst risk, as any admin worth their salt knows, the admin's 
own fat-fingering!)


In general, then, for the big VMs and DBs, I recommend nocow, on 
dedicated subvolumes so parent snapshotting doesn't interfere, and 
preferably no snapshotting of the dedicated subvolume, if there's 
sufficient down-time to do proper db/vm-atomic backups, anyway.  If not, 
then snapshot at the low end of acceptable frequency for backups, backup 
the snapshot, and erase it.  There will still be some fragmentation due 
to the snapshot-induced cow1 (see discussion under #3 below), but it can 
be controlled, and scheduled defrag can be used to keep it within an 
acceptable range.  Altho defrag isn't snapshot aware, with snapshots only 
taken for backup purposes and then deleted, there won't be snapshots for 
defrag to be aware of, eliminating the potential problems there as well.

Based on posted reports, this sort of approach works well to keep 
fragmentation within manageable levels, while still allowing temporary 
snapshots for backup purposes.


> 2) Why does notdatacow imply nodatasum and can that ever be decoupled?

Hugo covered that.  It's a race issue.  With data rewritten in-place, 
it's no longer possible to atomically update both the data and its 
checksum at the same time, and if there's a crash between updates of the 
two or while one is actually being written...

Which is precisely why checksummed data integrity isn't more commonly 
implemented; on overwrite-in-place, it's simply not race free, so copy-on-
write is what actually makes it possible.  Therefore, disable copy-on-
write and by definition you must disable checksumming as well.

> 3) When I would actually disable datacow for e.g. a subvolume that holds
> VMs or DBs... what are all the implications?
> Obviously no checksumming, but what happens if I snapshot such a
> subvolume or if I send/receive it?
> I'd expect that then some kind of CoW needs to take place or does that
> simply not work?

Snapshots too are cow-based, as they lock in the existing version where 
it's at.  By virtue of necessity, then, first-writes to a block after a 
snapshot cow it, that being a necessary exception to nocow.  However, the 
file retains its nocow attribute, and further writes to the new block are 
now done in-place... until it to is locked in place by another snapshot.

Someone on-list referred to this once as cow1, and that has become a 
common shorthand reference for the process.  In fact, I referred to cow1 
in #1 above, and just now added a parenthetical back up there, referring 
here.

> 4) Duncan mentioned that defrag (and I guess that's also for auto-
> defrag) isn't ref-link aware...
> Isn't that somehow a complete showstopper?
> 
> As soon as one uses snapshot, and would defrag or auto defrag any of
> them, space usage would just explode, perhaps to the extent of ENOSPC,
> and rendering the fs effectively useless.
> 
> That sounds to me like, either I can't use ref-links, which are crucial
> not only to snapshots but every file I copy with cp --reflink auto ...
> or I can't defrag... which however will sooner or later cause quite some
> fragmentation issues on btrfs?

Hugo answered this one too, tho I wasn't aware that autodefrag was 
snapshot-aware.

But even without snapshot awareness, with an appropriate program of 
snapshot thinning (ideally no more than 250-ish snapshots per subvolume, 
which easily covers a year's worth of snapshots even starting at 
something like half-hourly, if they're thinned properly as well; 250 per 
subvolume lets you cover 8 subvolumes with a 2000 snapshot total, a 
reasonable cap that doesn't trigger severe scaling issues) defrag 
shouldn't be /too/ bad.

Most files aren't actually modified that much, so the number of defrag-
triggered copies wouldn't be that high.

And as discussed above, for VM images and databases, the recommendation 
is nocow, and either no snapshotting if there's down-time enough to do 
atomic backups without them, or only temporary snapshotting if necessary 
for atomic backups, with the snapshots removed after the backup is 
complete.  Further, defrag should only be done at a rather lower 
frequency than the temporary snapshotting, so even if a few snapshots are 
kept around, that's only a few copies of the files, nothing like the 
potentially 250-ish snapshots and thus copies of the file, for normal 
subvolumes, were defrag done at the same frequency as the snapshotting.

> 5) Especially keeping (4) in mind but also the other comments in from
> Duncan and Austin...
> Is auto-defrag now recommended to be generally used?
> Are both auto-defrag and defrag considered stable to be used? Or are
> there other implications, like when I use compression

Autodefrag is recommended for, and indeed targeted at, general desktop 
use, where internal-rewrite-pattern database, etc, files tend to be 
relatively small, quarter to half gig at the largest.

> 6) Does defragmentation work with compression? Or is it just filefrag
> which can't cope with it?

It's just filefrag -- which it can be noted isn't a btrfs-progs 
application (it's part of e2fsprogs).  There is or possibly was in fact 
discussion of teaching filefrag about btrfs compression so it wouldn't 
false-report massive fragmentation with it, but that was some time ago 
(I'd guess a couple years), and I've read absolutely nothing on it since, 
so I've no idea if the project was abandoned or indeed never got off the 
ground, or OTOH, if perhaps it's actually already done in the latest 
e2fsprogs.

btrfs defrag works fine with compression and in fact it even has an 
option to compress as it goes, thus allowing one to use it to compress 
files later, if you for instance weren't running the compress mount 
option (or perhaps toggled between zlib and lzo based compression) at the 
time the file was originally written.

And AFAIK autodefrag, because it simply queues affected files for 
defragging rewrite by a background thread, uses the current compress 
mount option just as does ordinary file writing.

> Any other combinations or things with the typicaly btrfs technologies
> (cow/nowcow, compression, snapshots, subvols, compressions, defrag,
> balance) that one can do but which lead to unexpected problems (I, for
> example, wouldn't have expected that defragmentation isn't ref-link
> aware... still kinda shocked ;) )

FWIW, I believe the intent remains to reenable snapshot-aware-defrag 
sometime in the future, after the various scaling issues including 
quotas, have been dealt with.  When the choice is between a defrag taking 
a half hour but not being snapshot aware, and taking perhaps literally 
/weeks/, because the scaling issues really were that bad... an actually 
practical defrag, even if it broke snapshot reflinks, was *clearly* 
preferred to one that was for all practical purposes too badly broken to 
actually use, because it scaled so badly it took weeks to do what should 
have been a half-hour job.

The one set of scaling issues was actually dealt with some time ago.  I 
think now it's primarily the fact that we're on the third quota subsystem 
rewrite and it's still buggy and far from stable, is what's holding up 
further progress on again having a snapshot-aware-defrag.  Once the quota 
code actually stabilizes there's probably some other work to do tying up 
loose ends as well, but my impression is that it's really not even 
possible until the quota code stabilizes.  The only exception to that 
would be if people simply give up on quotas entirely, and there's enough 
demand for that feature that giving up on them would be a *BIG* hit to 
btrfs as the assumed ext* successor, so unless they come up against a 
wall and find quotas simply can't be done in a reliable and scalable way 
on btrfs, the feature /will/ be there eventually, and then I think 
snapshot-aware-defrag work can resume.

But given results to date, quota code could be good in a couple kernel 
cycles... or it could be five years... and how long snapshot-aware-defrag 
would take to come back together after that is anyone's guess as well, so 
don't hold your breath... you won't make it!

> For example, when I do a balance and change the compression, and I have
> multiple snaphots or files within one subvol that share their blocks...
> would that also lead to copies being made and the space growing possibly
> dramatically?

AFAIK balance has nothing to do with compression.  Defrag has an option 
to recompress... with the usual snapshot-unaware implications in terms of 
snapshot reflink breakage, of course.

I actually don't know what the effect of defrag, with or without 
recompression, is on same-subvolume reflinks.  If I were to guess I'd say 
it breaks them too, but I don't know.  If I needed to know I'd probably 
test it to see... or ask.

It _is_ worth noting, however, lest there be any misconceptions, that 
regardless of the number of reflinks sharing an extent between them, a 
single defrag on a single file will only make, maximum, a single 
additional copy.  It's not like it makes another copy for each of the 
reflinks to it, unless you defrag each of those reflinks individually.

So 250 snapshots of something isn't going to grow usage by 250 times with 
just a single defrag.  It will double it if the defrag is actually done 
(defrag doesn't touch a file if it doesn't think it needs defragged, in 
which case no space usage change would occur, but then neither would the 
actual defrag), but it won't blow up by 250X just because there's 250 
snapshots!

> 7) How das free-space defragmentation happen (or is there even such a
> thing)?
> For example, when I have my big qemu images, *not* using nodatacow, and
> I copy the image e.g. with qemu-img old.img new.img ... and delete the
> old then.
> Then I'd expect that the new.img is more or less not fragmented,... but
> will my free space (from the removed old.img) still be completely messed
> up sooner or later driving me into problems?

This one's actually a very good question as there has been a moderate 
regression in defrag's efficiency lately (well, 3.17 IIRC, which is out 
of the recommended 2-LTS-kernels range, but it was actually about 4.1 
before people put two and two together and figured out what happened, as 
it was conceptually entirely unrelated), due to implications of an 
otherwise unrelated change.  Meanwhile, the change did fix the problem it 
was designed to fix and reports of it are far rarer these days, to the 
point that I'd expect most would consider it well worth the very moderate 
inadvertent regression.

Defrag doesn't really defrag free space, tho if you're running 
autodefrag, free space shouldn't ever get /that/ fragmented to begin 
with, since file fragmentation level will in general be kept low enough 
that the remaining space should be generally fragmentation free as well.

Meanwhile, at the blockgroup aka chunk level balance defrags free space 
to some degree, by rewriting and consolidating chunks.  However, that's 
not directly free space defrag either, it just happens to do some of that 
due to the rewrites it does.

As to what caused that moderate regression mentioned above, it happened 
this way (IIRC my theory as actually described, but others agreed in 
general, tho I don't believe it has been actually proven just yet, see 
below).  Defrag was originally designed to work with currently allocated 
chunks and not allocate new ones, as back then, there tended to be plenty 
of empty data chunks lying around from the same normal use that triggered 
the fragmentation the first place, as btrfs didn't reclaim empty chunks 
back then as it does now.

But people got tired of btrfs running into ENOSPC errors when df said it 
had plenty of space -- but it was all tied up in empty (usually) data 
chunks, so there was no unallocated space left to allocate to more 
metadata chunks when needed, and having to manually run a balance
-dusage=0 or whatever to free up a bunch of empty data chunks so metadata 
chunks could be allocated.  (Occasionally it was the reverse, lots of 
empty metadata chunks, running out of data chunks, but that was much 
rarer due to normal usage patterns favoring data chunk allocation.)

So along around 3.17, btrfs behavior was changed so that it now deletes 
empty chunks automatically, and people don't have to do so many manual 
balances to clear empty data chunks any more. =:^)  And a worthwhile 
change it was, too, except...

Only several kernel cycles later did we figure out the problem that 
change was for defrag, since it's pretty conservative about allocating 
new and thus empty data chunks.  It took that long because it apparently 
never occurred to anyone that it'd affect defrag in any way at all, when 
the change was made.  And indeed, the effect is rather subtle and none-
too-intuitive, so it's no wonder it didn't even occur to anyone.

So what happens now is there's no empty data chunks around for defrag to 
put its work into, so it has to use much more congested partially full 
data chunks with much smaller contiguous blocks of free space, and the 
defrag often ends up being much less efficient than it would be if it 
still had all those empty chunks of free space to work with that are now 
automatically deleted.  In fact, in some cases defrag can now actually 
result in *more* fragmentation, if the existing file extents are larger 
than those available in existing data chunks.  Tho from reports, that 
doesn't tend to happen on initial run when people notice a problem and 
decide to defrag, the initial defrag usually improves the situation 
some.  But given the situation, people might decide the first result 
isn't good enough and try another defrag, and then it can actually make 
the problem worse.

Of course, if people are consistently using autodefrag (as I do) this 
doesn't tend to be a very big problem, as fragmentation is never allowed 
to build up to the point where it's significantly interfering with free 
space.  But if people are doing it manually and allow the fragmentation 
to build up between runs, it can be a significant problem, because that 
much file fragmentation means free space is highly fragmented as well, 
and with no extra empty chunks around as they've all been deleted...

So at some point, defrag will need at least partially rewritten to be at 
least somewhat more greedy in its new data chunk allocation.  I'm not a 
coder so I can't evaluate how big a rewrite that'll be, but with a bit of 
luck, it's more like a few line patch than a rewrite.  Because if it's a 
rewrite, then it's likely to wait until they can try to address the 
snapshot-aware-defrag issue again at the same time, and it's anyone's 
guess when that'll be, but probably more like years than months.

Meanwhile, I don't know that anybody has tried this yet, and with both 
compression and autodefrag on here it's not easy for me to try it, but in 
theory anyway, if defrag isn't working particularly well, it should be 
possible to truncate-create a number of GiB-sized files, sync (or fsync 
each one individually) so they're written out to storage, then truncate 
each file down to a few bytes, something 0 < size < 4096 bytes (or page 
size on archs where it's not 4096 by default), so they take only a single 
block of that original 1 GiB allocation, and sync again.

With a btrfs fi df run before and after the process, you can see if it's 
having the intended effect of creating a bunch of nearly empty data 
chunks (which are nominally 1 GiB in size each, tho they can be smaller 
if space is tight or larger on a large but nearly empty filesystem).  If 
there's a number of partially empty chunks such that the spread between 
data size and used is over a GiB, it may take writing a number of files 
at a GiB each to use up that space and see new chunks allocated, but once 
the desired number of data chunks is allocated, then start truncating to 
say 3 KiB, and see if the data used number starts coming down accordingly.

The idea of course would be to force creation of some new data chunks 
with files the size of a data chunk, then truncate them to the size of a 
single block, freeing most of the data chunk.

/Then/ run defrag, and it should actually have some near 1 GiB contiguous 
free-space blocks it can use, and thus should be rather more efficient! 
=:^)

Of course when you're done you can delete all those "balloon files" you 
used to force the data chunk allocation.

I'm not /sure, but I think btrfs may actually delay empty chunk deletion 
by a bit, to see if it's going to be used.  If it does, then someone 
could actually create, sync, and then delete the balloon files, and do 
the defrag in the lag time before btrfs deletes the empty chunks.  If it 
works, that should let files over a GiB in size grab whole GiB size data 
chunks, but I'm not sure it'll work as I don't know what btrfs' delay 
factor is before deleting those unused chunks.

It'd be a worthwhile experiment anyway.  If it works, then we have now 
nicely demonstrated that defrag indeed does work better with a few extra 
empty chunks laying around, and that it really does need patched up to be 
a bit more greedy in allocating new chunks, now that btrfs auto-deletes 
them so defrag isn't likely to find them simply lying around to be used, 
as it used to.  Because AFAIK I was actually the one that came up with 
the idea that the new lack of empty chunks lying around was the problem, 
and while I did get some agreement that it was likely, I'm not sure it's 
actually been tested yet, and not being a coder, I can't easily just look 
at the code and see what defrag's new chunk allocation policy is, so to 
this point it remains a nicely logical theory, but as yet unproven to the 
best of my knowledge.

> 8) why does a balance not also defragment? Since everything is anyway
> copied... why not defragmenting it?
> I somehow would have hoped that a balance cleans up all kinds of
> things,... like free space issues and also fragmentation.

Balance works with blockgroups/chunks, rewriting and defragging (and 
converting if told to do so with the appropriate balance filters) at that 
level, not the individual file or extent level.

Defrag works at the file/extent level, within blockgroups.

Perhaps there will be a tool that combines the two at some point in the 
likely distant future, but as of now there's all sorts of other projects 
to be done, and given that the existing tools do the job in general, it's 
unlikely this one will rise high enough in the priority queue to get any 
attention for some years.

> Given all these issues,... fragmentation, situations in which space may
> grow dramatically where the end-user/admin may not necessarily expect it
> (e.g. the defrag or the balance+compression case?)... btrfs seem to
> require much more in-depth knowledge and especially care (that even
> depends on the type of data) on the end-user/admin side than the
> traditional filesystems.

To some extent that comes with the more advanced than ordinary 
filesystems domain.  However, I think a lot more of it is simply the 
continued relative immaturity of the filesystem.  As it matures, 
presumably a lot of these still rough edges will be polished away, but 
it's a long process, with full general maturity likely still some years 
away given the relatively limited number of devs and their current rate 
of progress.

> Are there for example any general recommendations what to regularly to
> do keep the fs in a clean and proper shape (and I don't count "start
> with a fresh one and copy the data over" as a valid way).

=:^)

When I switched to btrfs some years ago, it was obviously rather less 
stable and mature than it is today, and was in fact still labeled 
experimental, with much stronger warnings about risks should you decide 
to use it without good backups than it has today.

And since then, a number of features not available on the earlier 
versions have been introduced, some of which could only be changed on a 
new filesystem

Now my backups routine already involved creating or using existing backup 
partitions the same size as the working copy, doing a mkfs thereon, 
copying all the data from each working partition and filesystem to its 
parallel backup(s), and then testing those backups by mounting and/or 
booting to them as alternates to the working copies.  Because as any good 
admin knows, a would-be backup isn't a backup until it has been tested to 
work, because until then, the backup job isn't complete, and the to-be 
backup cannot be relied upon /as/ a backup.

And periodically, I'd take the opportunity presented at that point, to 
reverse the process as well, one booted onto the backup, I'd blow away 
the normal working copy and do a fresh mkfs on it, then copy everything 
back from the backup to the working copy.

With btrfs then in experimental and various new feature additions 
requiring a fresh mkfs.btrfs anyway, it was thus little to no change in 
routine to simply be a bit more regular with that last step, blowing away 
the working copy whilst booted to backup, and copying it all back to the 
working copy from the backup, as if I were doing a backup to what 
actually happened to be the working copy.


So "start with a fresh btrfs and copy the data over", is indeed part of 
my regular backups routine here, just as it was back on reiserfs before 
btrfs, only a bit more regular for awhile, while btrfs was adding new 
features rather regularly.  Now that the btrfs forward-compatible-only on-
disk-format change train has slowed down some, I've not actually done it 
recently, but it's certainly easy enough to do so when I decide to. =:^)


But in terms of your question, the only things I do somewhat regularly 
are an occasional scrub (with btrfs raid1 precisely so I /do/ have a 
second copy available if one or the other fails checksum), and mostly 
because it's habit from before the automatic empty chunk delete code and 
my btrfs are all relatively small so the room for error is accordingly 
smaller, keeping an eye on the combination of btrfs fi sh and btrfs fi df, 
to see if I need to run a filtered balance.  But I've not had to do that 
in awhile, so how long the habit will remain around I really don't know.

Other than that, it's the usual simply keeping up with the backups, which 
I don't automate, but generally pick a stable point when everything's 
working and do whenever I start getting uncomfortable about the work I'd 
lose if things went kerflooey.

Tho I'm obviously active on this list, keeping up with current status and 
developments including the latest commonly reported bugs, as well. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-11-26 18:25       ` Christoph Anton Mitterer
@ 2015-11-26 23:29         ` Duncan
  2015-11-27  0:06           ` Christoph Anton Mitterer
  0 siblings, 1 reply; 48+ messages in thread
From: Duncan @ 2015-11-26 23:29 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Thu, 26 Nov 2015 19:25:47 +0100 as
excerpted:

> On Thu, 2015-11-26 at 16:52 +0000, Duncan wrote:
>> For people doing snapshotting in particular, atime updates can be a big
>> part of the differences between snapshots, so it's particularly
>> important to set noatime if you're snapshotting.

> What everything happens when that is left at relatime?
> 
> I'd guess that obviously everytime the atime is updated there will be
> some CoW, but only on meta-data blocks, right?

Yes.
 
> Does this then lead to fragmentation problems in the meta-data block
> groups?

I don't believe so.  I think individual metadata elements tend to be 
small enough that several fit in a metadata node (16 KiB by default these 
days, IIRC), so there's no "metadata fragmentation" to speak of.

> And how serious are the effects on space that is eaten up... say I have
> n snapshots and access all of their files... then I'd probably get n
> times the metadata, right? Which would sound quite dramatic...
> 
> Or is just parts of the metadate copied with new atimes?

I think it's whole 4 KiB blocks and possibly whole metadata nodes (16 
KiB), copy-on-write, and these would be relatively small changes 
triggering cow of the entire block/node, aka write amplification.  While 
not too large in themselves, it's the number of them that becomes a 
problem.

IIRC relatime updates once a day on access.  If you're doing daily 
snapshots, updating metadata blocks for all files accessed in the last 24 
hours...

Again, individual snapshots aren't so much of a problem, and if you're 
thinning to the 250 snapshots per subvolume or less as I recommend, the 
problem will remain controlled, but at 250, starting at daily snapshots 
so they all have atime changes for at least all files accessed during 
that 24 hours, that's still a sizable set of unnecessarily modified and 
thus space-taking snapshotted metadata.

But I wouldn't worry about it too much if you're doing say monthly 
snapshots and only keeping a year's worth or less, 12-13 snapshots per 
subvolume total.

In my case, I'm on SSD with their limited write cycles, so while the 
snapshot thing doesn't affect me since my use-case doesn't involve 
snapshots, the SSD write cycle count thing certainly does, and noatime is 
worth it to me for that alone.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-11-26 23:29         ` Duncan
@ 2015-11-27  0:06           ` Christoph Anton Mitterer
  2015-11-27  3:38             ` Duncan
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-11-27  0:06 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2037 bytes --]

On Thu, 2015-11-26 at 23:29 +0000, Duncan wrote:
> > but only on meta-data blocks, right?
> Yes.
Okay... so it'll at most get the whole meta-data for a snapshot
separately and not shared anymore...
And when these are chained as in ZFS,.. it probably amplifies... i.e. a
change deep down in the tree changes all the upper elements as well?
Which shouldn't be a too big problem unless I have a lot snapshots or
extremely many files.



> I think it's whole 4 KiB blocks and possibly whole metadata nodes (16
> KiB), copy-on-write, and these would be relatively small changes 
> triggering cow of the entire block/node, aka write
> amplification.  While 
> not too large in themselves, it's the number of them that becomes a 
> problem.
Ah... there you say it already =)
But still it's always only meta-data that is copied, never the data,
right?!


> IIRC relatime updates once a day on access.  If you're doing daily 
> snapshots, updating metadata blocks for all files accessed in the
> last 24 
> hours...
Yes...


Wouldn't it be a way to handle that problem if btrfs allowed to create
snapshots for which the atime never gets updated, regardless of any
mount option?

And additionally, allow people to mount subvols with different
noatime/relatime/atime settings (unless that's already working)... that
way, they could enable it for things where they want/need it,... and
disable it where not.


> In my case, I'm on SSD with their limited write cycles, so while the
> snapshot thing doesn't affect me since my use-case doesn't involve 
> snapshots, the SSD write cycle count thing certainly does, and
> noatime is 
> worth it to me for that alone.
I'm always a bit unsure about that... I've used to do it as well as for
the wear.. but is that really necessary?
With relatime, atime updates happen at most once a day... so at worst
you rewrite... what... some 100 MB (at least in the ext234 case)... and
SSDs seem to bare much more write cycles than advertised.


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-11-25 21:49   ` Mitchell Fossen
  2015-11-26 16:52     ` Duncan
@ 2015-11-27  1:49     ` Qu Wenruo
  1 sibling, 0 replies; 48+ messages in thread
From: Qu Wenruo @ 2015-11-27  1:49 UTC (permalink / raw)
  To: Mitchell Fossen, Duncan, linux-btrfs



Mitchell Fossen wrote on 2015/11/25 15:49 -0600:
> On Mon, 2015-11-23 at 06:29 +0000, Duncan wrote:
>
>> Using subvolumes was the first recommendation I was going to make, too,
>> so you're on the right track. =:^)
>>
>> Also, in case you are using it (you didn't say, but this has been
>> demonstrated to solve similar issues for others so it's worth
>> mentioning), try turning btrfs quota functionality off.  While the devs
>> are working very hard on that feature for btrfs, the fact is that it's
>> simply still buggy and doesn't work reliably anyway, in addition to
>> triggering scaling issues before they'd otherwise occur.  So my
>> recommendation has been, and remains, unless you're working directly with
>> the devs to fix quota issues (in which case, thanks!), if you actually
>> NEED quota functionality, use a filesystem where it works reliably, while
>> if you don't, just turn it off and avoid the scaling and other issues
>> that currently still come with it.
>>
>
> I did indeed have quotas turned on for the home directories! Since they were
> mostly to calculate space used by everyone (since du -hs is so slow) and not
> actually needed to limit people, I disabled them.

[[About quota]]
Personally speaking, I'd like to have some comparison between quota 
enabled and disabled, to help locate if it's quota causing the problem.

If you can find a good and reliable reproducer, it would be very helpful 
for developers to improve btrfs.

BTW, it's also a good idea to us ps to locate what process is running at 
the time your btrfs hangs.

If it's kernel thread named btrfs-transaction, then it may be related to 
quota.


>
>> As for defrag, that's quite a topic of its own, with complications
>> related to snapshots and the nocow file attribute.  Very briefly, if you
>> haven't been running it regularly or using the autodefrag mount option by
>> default, chances are your available free space is rather fragmented as
>> well, and while defrag may help, it may not reduce fragmentation to the
>> degree you'd like.  (I'd suggest using filefrag to check fragmentation,
>> but it doesn't know how to deal with btrfs compression, and will report
>> heavy fragmentation for compressed files even if they're fine.  Since you
>> use compression, that kind of eliminates using filefrag to actually see
>> what your fragmentation is.)
>> Additionally, defrag isn't snapshot aware (they tried it for a few
>> kernels a couple years ago but it simply didn't scale), so if you're
>> using snapshots (as I believe Ubuntu does by default on btrfs, at least
>> taking snapshots for upgrade-in-place), so using defrag on files that
>> exist in the snapshots as well can dramatically increase space usage,
>> since defrag will break the reflinks to the snapshotted extents and
>> create new extents for defragged files.
>>
>> Meanwhile, the absolute worst-case fragmentation on btrfs occurs with
>> random-internal-rewrite-pattern files (as opposed to never changed, or
>> append-only).  Common examples are database files and VM images.  For
>> /relatively/ small files, to say 256 MiB, the autodefrag mount option is
>> a reasonably effective solution, but it tends to have scaling issues with
>> files over half a GiB so you can call this a negative recommendation for
>> trying that option with half-gig-plus internal-random-rewrite-pattern
>> files.  There are other mitigation strategies that can be used, but here
>> the subject gets complex so I'll not detail them.  Suffice it to say that
>> if the filesystem in question is used with large VM images or database
>> files and you haven't taken specific fragmentation avoidance measures,
>> that's very likely a good part of your problem right there, and you can
>> call this a hint that further research is called for.
>>
>> If your half-gig-plus files are mostly write-once, for example most media
>> files unless you're doing heavy media editing, however, then autodefrag
>> could be a good option in general, as it deals well with such files and
>> with random-internal-rewrite-pattern files under a quarter gig or so.  Be
>> aware, however, that if it's enabled on an already heavily fragmented
>> filesystem (as yours likely is), it's likely to actually make performance
>> worse until it gets things under control.  Your best bet in that case, if
>> you have spare devices available to do so, is probably to create a fresh
>> btrfs and consistently use autodefrag as you populate it from the
>> existing heavily fragmented btrfs.  That way, it'll never have a chance
>> for the fragmentation to build up in the first place, and autodefrag used
>> as a routine mount option should keep it from getting bad in normal use.
>
> Thanks for explaining that! Most of these files are written once and then read
> from for the rest of their "lifetime" until the simulations are done and they
> get archived/deleted. I'll try leaving autodefrag on and defragging directories
> over the holiday weekend when no one is using the server. There is some database
> usage, but I turned off COW for its folder and it only gets used sporadically
> and shouldn't be a huge factor in day-to-day usage.
>
> Also, is there a recommendation for relatime vs noatime mount options? I don't
> believe anything that runs on the server needs to use file access times, so if
> it can help with performance/disk usage I'm fine with setting it to noatime.
>
> I just tried copying a 70GB folder and then rm -rf it and it didn't appear to
> impact performance, and I plan to try some larger tests later.

It depends on the folder structure, but even for the worst case, it 
won't really trigger your problem.

[[About large files in btrfs]]
I agree with Duncan's suggestion completely, as that's the problem of 
btrfs fs tree design, it will cause too much race on the same tree lock.
Change it multi-subvolume will improve performance greatly especially 
for large files/directories.

The real problem is, btrfs delete one large file in a very unscaled method:

Block transaction until *all* the file extents belong to the inode are 
deleted.

Check __btrfs_update_delayed_inode() function in fs/btrfs/delayed-inode.c.

For small files that's OK, but for super huge files, that's a nightmare,
as the transaction won't be committed until all the file extents are 
deleted.
For 70G case, it will be consist of less than 600 file extents.
2 ~ 3 leaves can handle it, you may not feel the glitch when running 
delayed inode.

But for your 500~700G case, btrfs will need to delete about 4K file 
extents, the deletion may change the b-tree hugely, and takes a longer time.

So in your case, you may need that large files to trigger the problem...

We can try a better method to delete some file extents transcation by 
transaction, and hopes it may help your case.

Thanks,
Qu


>
> Thanks again for the help!
>
> -Mitch
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-11-27  0:06           ` Christoph Anton Mitterer
@ 2015-11-27  3:38             ` Duncan
  2015-11-28  3:57               ` Christoph Anton Mitterer
  0 siblings, 1 reply; 48+ messages in thread
From: Duncan @ 2015-11-27  3:38 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Fri, 27 Nov 2015 01:06:45 +0100 as
excerpted:

> And additionally, allow people to mount subvols with different
> noatime/relatime/atime settings (unless that's already working)... that
> way, they could enable it for things where they want/need it,... and
> disable it where not.

AFAIK, per-subvolume *atime mounts should already be working.  The *atime 
mount options are filesystem-generic (aka Linux vfs level), and while I 
my own use-case doesn't involve subvolumes, the wiki says they should be 
working (wrapped link I'm not bothering to jump thru the hoops to 
properly unwrap):

https://btrfs.wiki.kernel.org/index.php/FAQ
#Can_I_mount_subvolumes_with_different_mount_options.3F

So while personally untested, per-subvolume *atime mount options /should/ 
"just work".

Meanwhile, I've simply grown to hate atime as an inefficient and mostly 
useless drain on resources, so I pretty much just noatime everything, the 
reason I decided to bother patching my kernel to make that the default, 
instead of having yet another option I use everywhere anyway, clogging up 
the options field in my fstab.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-11-27  3:38             ` Duncan
@ 2015-11-28  3:57               ` Christoph Anton Mitterer
  2015-11-28  6:49                 ` Duncan
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-11-28  3:57 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 422 bytes --]

On Fri, 2015-11-27 at 03:38 +0000, Duncan wrote:
> AFAIK, per-subvolume *atime mounts should already be working.
Ah I see. :)

Still, specifically for snapshots that's a bit unhandy, as one
typically doesn't mount each of them... one rather mount e.g. the top
level subvol and has a subdir snapshots there...
So perhaps the idea of having snapshots that are per se noatime is
still not too bad.


Cheers,
Chris

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-11-28  3:57               ` Christoph Anton Mitterer
@ 2015-11-28  6:49                 ` Duncan
  2015-12-12 22:15                   ` Christoph Anton Mitterer
  0 siblings, 1 reply; 48+ messages in thread
From: Duncan @ 2015-11-28  6:49 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Sat, 28 Nov 2015 04:57:05 +0100 as
excerpted:

> On Fri, 2015-11-27 at 03:38 +0000, Duncan wrote:
>> AFAIK, per-subvolume *atime mounts should already be working.
> Ah I see. :)
> 
> Still, specifically for snapshots that's a bit unhandy, as one typically
> doesn't mount each of them... one rather mount e.g. the top level subvol
> and has a subdir snapshots there...
> So perhaps the idea of having snapshots that are per se noatime is still
> not too bad.

Read-only snapshots?  That'd do it, and of course you can toggle the read-
only property (see btrfs property and its btrfs-property manpage).

Alternatively, mount the toplevel subvol read-only or noatime on one 
mountpoint, and bind-mount it read-write or whatever other appropriate 
*atime elsewhere (or the reverse, if more appropriate).  Then use the 
noatime or read-only one unless you specifically wanted atimes updated.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-11-26  0:33     ` Hugo Mills
@ 2015-12-09  5:43       ` Christoph Anton Mitterer
  2015-12-09 13:36         ` Duncan
  2015-12-14  1:44       ` Christoph Anton Mitterer
  1 sibling, 1 reply; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-09  5:43 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Austin S Hemmelgarn, linux-btrfs, Duncan

[-- Attachment #1: Type: text/plain, Size: 5585 bytes --]

Hey Hugo,


On Thu, 2015-11-26 at 00:33 +0000, Hugo Mills wrote:
>    Answering the second part first, no, it can't.
Thanks so far :)


>    The issue is that nodatacow bypasses the transactional nature of
> the FS, making changes to live data immediately. This then means that
> if you modify a modatacow file, the csum for that modified section is
> out of date, and won't be back in sync again until the latest
> transaction is committed. So you can end up with an inconsistent
> filesystem if there's a crash between the two events.
Sure,... (and btw: is there some kind of journal planned for
nodatacow'ed files?),... but why not simply trying to write an updated
checksum after the modified section has been flushed to disk... of
course there's no guarantee that both are consistent in case of crash (
but that's also the case without any checksum)... but at least one
would have the csum protection against everything else (blockerrors and
that like) in case no crash occurs?



> > For me the checksumming is actually the most important part of
> > btrfs
> > (not that I wouldn't like its other features as well)... so turning
> > it
> > off is something I really would want to avoid.
> > 
> > Plus it opens questions like: When there are no checksums, how can
> > it
> > (in the RAID cases) decide which block is the good one in case of
> > corruptions?
>    It doesn't decide -- both copies look equally good, because
> there's
> no checksum, so if you read the data, the FS will return whatever
> data
> was on the copy it happened to pick.
Hmm I see... so one gets basically the behaviour of RAID.
Isn't that kind of a big loss? I always considered the guarantee
against block errors and that like one of the big and basic features of
btrfs.
It seems that for certain (not too unimportant cases: DBs, VMs) one has
to decide between either evil, loosing the guaranteed consistency via
checksums... or basically running into severe troubles (like Mitch's
reported fragmentation issues).


> > 3) When I would actually disable datacow for e.g. a subvolume that
> > holds VMs or DBs... what are all the implications?
> > Obviously no checksumming, but what happens if I snapshot such a
> > subvolume or if I send/receive it?
> 
>    After snapshotting, modifications are CoWed precisely once, and
> then it reverts to nodatacow again. This means that making a snapshot
> of a nodatacow object will cause it to fragment as writes are made to
> it.
I see... something that should possibly go to some advanced admin
documentation (if not already in).
It means basically, that one must assure that any such files (VM
images, DB data dirs) are already created with nodatacow (perhaps on a
subvolume which is mounted as such.


> > 4) Duncan mentioned that defrag (and I guess that's also for auto-
> > defrag) isn't ref-link aware...
> > Isn't that somehow a complete showstopper?
>    It is, but the one attempt at dealing with it caused massive data
> corruption, and it was turned off again.
So... does this mean that it's still planned to be implemented some day
or has it been given up forever?
And is it (hopefully) also planned to be implemented for reflinks when
compression is added/changed/removed?


Given that you (or Duncan?,... sorry I sometimes mix up which of said
exactly what, since both of you are notoriously helpful :-) ) mentioned
that autodefrag basically fails with larger files,... and given that it
seems to be quite important for btrfs to not be fragmented too heavily,
it sounds a bit as if anything that uses (multiple) reflinks (e.g.
snapshots) cannot be really used very well.


>  autodefrag, however, has
> always been snapshot aware and snapshot safe, and would be the
> recommended approach here.
Ahhh... so autodefag *is* snapshot aware, and that's basically why the
suggestion is (AFAIU) that it's turned on, right?
So, I'm afraid O:-), that triggers a follow-up question:
Why isn't it the default? Or in other words what are its drawbacks
(e.g. other cases where ref-links would be broken up,... or issues with
compression)?

And also, when I now activate it on an already populated fs, will it
defrag also any old files (even if they're not rewritten or so)?
I tried to have a look for some general (rather "for dummies" than for
core developers) description of how defrag and autodefrag work... but
couldn't find anything in the usual places... :-(

btw: The wiki (https://btrfs.wiki.kernel.org/index.php/UseCases#How_do_
I_defragment_many_files.3F) doesn't mention that auto-defrag doesn't
suffer from that problem.


>  (Actually, it was broken in the same
> incident I just described -- but fixed again when the broken patches
> were reverted).
So it just couldn't be fixed (hopfully: yet) for the (manual) online
defragmentation?!


> > 5) Especially keeping (4) in mind but also the other comments in
> > from
> > Duncan and Austin...
> > Is auto-defrag now recommended to be generally used?
>
>    Absolutely, yes.
I see... well, I'll probably wait for some answers about its drawbacks
and then give it a try.


>    It's late for me, and this email was longer than I suspected, so
> I'm going to stop here, but I'll try to pick it up again and answer
> your other questions tomorrow.
Thanks so far :)

I know I haven't replied to that thread for some days, but if you have
anything to add to the remaining questions, I'd be still happy to read
it :)


Thanks and best wishes,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-11-26 23:08     ` Duncan
@ 2015-12-09  5:45       ` Christoph Anton Mitterer
  2015-12-09 16:36         ` Duncan
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-09  5:45 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 2015-11-27 00:08, Duncan wrote:
> Christoph Anton Mitterer posted on Thu, 26 Nov 2015 01:23:59 +0100 as
> excerpted:
>> 1) AFAIU, the fragmentation problem exists especially for those files
>> that see many random writes, especially, but not limited to, big files.
>> Now that databases and VMs are affected by this, is probably broadly
>> known in the meantime (well at least by people on that list).
>> But I'd guess there are n other cases where such IO patterns can happen
>> which one simply never notices, while the btrfs continues to degrade.
> 
> The two other known cases are:
> 
> 1) Bittorrent download files, where the full file size is preallocated 
> (and I think fsynced), then the torrent client downloads into it a chunk 
> at a time.
Okay, sounds obvious.


> The more general case would be any time a file of some size is 
> preallocated and then written into more or less randomly, the problem 
> being the preallocation, which on traditional rewrite-in-place 
> filesystems helps avoid fragmentation (as well as ensuring space to save 
> the full file), but on COW-based filesystems like btrfs, triggers exactly 
> the fragmentation it was trying to avoid.
Is it really just the case when the file storage *is* actually fully
pre-allocated?
Cause that wouldn't (necessarily) be the case for e.g. VM images (e.g.
qcow2, or raw images when these are sparse files).
Or is it rather any case where, in larger file, many random (file
internal) writes occur?


> arranging to 
> have the client write into a dir with the nocow attribute set, so newly 
> created torrent files inherit it and do rewrite-in-place, is highly 
> recommended.
At the IMHO pretty high expense of loosing the checksumming :-(
Basically loosing half of the main functionalities that make btrfs
interesting for me.


> It's also worth noting that once the download is complete, the files 
> aren't going to be rewritten any further, and thus can be moved out of 
> the nocow-set download dir and treated normally.
Sure... but this requires manual intervention.

For databases, will e.g. the vacuuming maintenance tasks solve the
fragmentation issues (cause I guess at least when doing full vacuuming,
it will rewrite the files).


> The problem is much reduced in newer systemd, which is btrfs aware and in 
> fact uses btrfs-specific features such as subvolumes in a number of cases 
> (creating subvolumes rather than directories where it makes sense in some 
> shipped tmpfiles.d config files, for instance), if it's running on 
> btrfs.
Hmm doesn't seem really good to me if systemd would do that, cause it
then excludes any such files from being snapshot.


> For the journal, I /think/ (see the next paragraph) that it now 
> sets the journal files nocow, and puts them in a dedicated subvolume so 
> snapshots of the parent won't snapshot the journals, thereby helping to 
> avoid the snapshot-triggered cow1 issue.
The same here, kinda disturbing if systemd would decide that on it's
own, i.e. excluding files from being checksum protected...


>> So is there any general approach towards this?
> The general case is that for normal desktop users, it doesn't tend to be 
> a problem, as they don't do either large VMs or large databases,
Well depends a bit on how one defines the "normal desktop user",... for
e.g. developers or more "power users" it's probably not so unlikely that
they do run local VMs for testing or whatever.

> and 
> small ones such as the sqlite files generated by firefox and various 
> email clients are handled quite well by autodefrag, with that general 
> desktop usage being its primary target.
Which is however not yet the default...


> For server usage and the more technically inclined workstation users who 
> are running VMs and larger databases, the general feeling seems to be 
> that those adminning such systems are, or should be, technically inclined 
> enough to do their research and know when measures such as nocow and 
> limited snapshotting along with manual defrags where necessary, are 
> called for.
mhh... well it's perhaps simple to expect that knowledge for few things
like VMs, DBs and that like... but there are countless of software
systems, many of them being more or less like a black box, at least with
respect to their internals.

It feels a bit, if there should be some tools provided by btrfs, which
tell the users which files are likely problematic and should be nodatacow'ed


> And if they don't originally, they find out when they start 
> researching why performance isn't what they expected and what to do about 
> it. =:^)
Which can take quite a while to be found out...


>> And what are the actual possible consequences? Is it just that fs gets
>> slower (due to the fragmentation) or may I even run into other issues to
>> the point the space is eaten up or the fs becomes basically unusable?
> It's primarily a performance issue, tho in severe cases it can also be a 
> scaling issue, to the point that maintenance tasks such as balance take 
> much longer than they should and can become impractical to run
hmm so it could in principle also affect other files and not just the
fragmented ones, right?!

Are there any problems caused by all this with respect to free space
fragmentation? And what exactly are the consequences of free space
fragmentation? ;)


> (where the 
> alternative starting over with a new filesystem and restoring from 
> backups is faster)
Which is not always feasible :-/ .. and shouldn't be necessary for a fs.


Well I probably miss some real world experience here, i.e. whether these
issues are really problematic in practice or rather not, but that sounds
all quite worrisome..


>> This is especially important for me, because for some VMs and even DBs I
>> wouldn't want to use nodatacow, because I want to have the checksumming.
>> (i.e. those cases where data integrity is much more important than
>> security)
> In general, nocow and the resulting loss of checksumming on these files 
> isn't nearly the problem that it might seem at first glance.  Why?  
> Because think about it, the applications using these files have had to be 
> usable on more traditional filesystems without filesystem-level 
> checksumming for decades, so the ones where data integrity is absolutely 
> vital have tended to develop their own data integrity assurance 
> mechanisms.  They really had no choice, as if they hadn't, they'd have 
> been too unstable for the tasks at hand, and something else would have 
> come along that was more stable and thus more suited to the task at hand.
Hmm I don't share that view... take DBs, these are typically not
checksummed simply for performance reasons... so if you had block
corruptions it would have been easy the case that simply a value was
changed, which would go through unnoticed.

IIRC, Ted Tso once mentioned that some proposals for checksumming on
ext4 had been made (or that even some work was done on that)... so I
guess it must be doable even without CoW.
As said previously,... not having checksumming, even when "just" in
cases like VMs, DBs, etc. seems like a very big loss to me. :(


> In fact, while I've seen no reports of this recently, a few years ago 
> there were a number of reported cases where the best explanation was that 
> after a crash, the btrfs level file integrity and the application level 
> file integrity apparently clashed, with the btrfs commit points and the 
> application's own commit points out of sync, so that while btrfs said the 
> file was fine, apparently parts of it were from before an application 
> level checkpoint while other parts of it were after, so the application 
> itself rejected the file, even tho the btrfs checksums matched.
mhh I remember some cases, where these programs didn't properly sync
their data while already writing their own journals or similar
statuses... but that's simply bugs in these applications.

What fs level checksumming should mainly protect, AFAIU, is any
corruptions on the media or bus.


> As I said, that was a few years ago, and I think btrfs' barrier handling 
> and fsync log rewriting are better now, such that I've not seen such 
> reports in quite awhile.  But something was definitely happening at the 
> time, and I think in at least some cases the application alone would have 
> handled things better, as then it could have detected the damage and 
> potentially replayed its own log or restored to a previous checkpoint, 
> the exact same thing it did on filesystems without the integrity 
> protections btrfs has.
Hmm I think dpkg was one case, IIRC,... but again... this is nothing
that would apply to big VM images, where there is no protection from the
application... and nothing that protects against single byte errors,
which merely change a value and which even DBs with their journal
wouldn't notice.


> In 
> many cases they'd simply process the corrupt data and keep on going,
Which may be much worse than if they'd crash at least...

> while in others they'd crash, but it wouldn't be a big deal, because it'd 
> be one corrupt jpeg or a few seconds of garbage in an mp3 or mpeg, and if 
> the one app couldn't handle it without crashing, another would.
That may apply for desktop applications, where wrong data is usually not
that critical,... but if you do scientific computation than these kinds
of unnoticed errors may easily be the worst.


> Plus, the admins running the big, important apps, are much more likely to 
> appreciate the value of the admin's rule of backups
Backups don't help in the case of silent and single block corruptions...
your data gets just wrong and you continue to use it, which is why
overall checksumming (including on every read) would be so important.


> Because checksumming doesn't help you if the filesystem as a whole goes 
> bad, or if the physical devices hosting it do so, while backups do!  (And 
> the same of course applies to snapshotting, tho they can help with the 
> generally worst risk, as any admin worth their salt knows, the admin's 
> own fat-fingering!)
Sure, but these kinds of incidents are rather harmless (given one has
done proper backuping) as they're more or less immediately noticed.


>> 2) Why does notdatacow imply nodatasum and can that ever be decoupled?
> 
> Hugo covered that.  It's a race issue.  With data rewritten in-place, 
> it's no longer possible to atomically update both the data and its 
> checksum at the same time, and if there's a crash between updates of the 
> two or while one is actually being written...
> 
> Which is precisely why checksummed data integrity isn't more commonly 
> implemented; on overwrite-in-place, it's simply not race free, so copy-on-
> write is what actually makes it possible.  Therefore, disable copy-on-
> write and by definition you must disable checksumming as well.
I've answered already at Hugo's reply,... so see there.

Plus, as said above, I think to remember that something was in the works
for ext4... so it must be somehow possible, even if at the cost that
it's ambiguous in cash of crashes.


> Snapshots too are cow-based, as they lock in the existing version where 
> it's at.  By virtue of necessity, then, first-writes to a block after a 
> snapshot cow it, that being a necessary exception to nocow.  However, the 
> file retains its nocow attribute, and further writes to the new block are 
> now done in-place... until it to is locked in place by another snapshot.
Maybe that (and further exceptions, if any) should go to the description
of nodatacow, also explaining the possible implications (like the
fragmentation that will likely then occur to the not snapshotted file),


>> 4) Duncan mentioned that defrag (and I guess that's also for auto-
>> defrag) isn't ref-link aware...
>> Isn't that somehow a complete showstopper?
>>
>> As soon as one uses snapshot, and would defrag or auto defrag any of
>> them, space usage would just explode, perhaps to the extent of ENOSPC,
>> and rendering the fs effectively useless.
>>
>> That sounds to me like, either I can't use ref-links, which are crucial
>> not only to snapshots but every file I copy with cp --reflink auto ...
>> or I can't defrag... which however will sooner or later cause quite some
>> fragmentation issues on btrfs?
> 
> Hugo answered this one too, tho I wasn't aware that autodefrag was 
> snapshot-aware.
Is there some "definite" resource on that? Just in case Hugo may have
recalled this incorrectly?


> But even without snapshot awareness, with an appropriate program of 
> snapshot thinning (ideally no more than 250-ish snapshots per subvolume, 
> which easily covers a year's worth of snapshots even starting at 
> something like half-hourly, if they're thinned properly as well; 250 per 
> subvolume lets you cover 8 subvolumes with a 2000 snapshot total, a 
> reasonable cap that doesn't trigger severe scaling issues) defrag 
> shouldn't be /too/ bad.
>
> Most files aren't actually modified that much, so the number of
> defrag-triggered copies wouldn't be that high.
Hmm I thought that would only depend on how badly the files are
fragmented when being snapshot.
If I make a snapshot, while there are many fragments, and then defrag
one of them, everything that gets defragmented would be rewritten,
loosing any ref-links, while files that aren't defragmented would retain
them.

So I'd have thought that whether running into scaling issues, depends
fully on the respective fs.


So concluding:
- auto-defrag is ref-link aware, any generally suggested to be enabled
  it should also have no issues with compression
- non-auto-defrag may become reflink aware again in the future(?)
  solving the problems that arise right now from reflink
  copies/snapshots and the need to defragment for preformance reasons
  (in those cases where autodefrag doesn't work well)
- at least in my opinion, not having checksumming is a very big loss,
  by far not circumvented in most cases at the application level



> Autodefrag is recommended for, and indeed targeted at, general desktop 
> use, where internal-rewrite-pattern database, etc, files tend to be 
> relatively small, quarter to half gig at the largest.
Hmm and what about mixed-use systems,... which have both, desktop and
server like IO patterns?


> btrfs defrag works fine with compression and in fact it even has an 
> option to compress as it goes, thus allowing one to use it to compress 
> files later, if you for instance weren't running the compress mount 
> option (or perhaps toggled between zlib and lzo based compression) at the 
> time the file was originally written.
btw: I think documentation (at least the manpage) doesn't tell whether
btrfs defragment -c XX will work on files which aren't fragmented.


> FWIW, I believe the intent remains to reenable snapshot-aware-defrag 
> sometime in the future, after the various scaling issues including 
> quotas, have been dealt with.  When the choice is between a defrag taking 
> a half hour but not being snapshot aware, and taking perhaps literally 
> /weeks/, because the scaling issues really were that bad... an actually 
> practical defrag, even if it broke snapshot reflinks, was *clearly* 
> preferred to one that was for all practical purposes too badly broken to 
> actually use, because it scaled so badly it took weeks to do what should 
> have been a half-hour job.
Phew... "clearly" may be rather something that differs from person to
person.
- A defrag that doesn't work due to scaling issues - well one can
hopefully abort it and it's as if there simply was no defragmentation.
- A defrag which breaks up the ref-links, may eat up vast amounts of
storage that should not need to be "wasted" like this, and you'll never
get the ref-links back (unless perhaps with dedup).

Especially since the reflink stuff is one of the core parts of btrfs, I
wouldn't be so sure that it's better to silently break up the reflinks
(end users likely have no idea on what we discuss here, and it doesn't
seem to be mentioned in the manpages), instead of simply have a not
working defragmentation.


> The only exception to that 
> would be if people simply give up on quotas entirely, and there's enough 
> demand for that feature that giving up on them would be a *BIG* hit to 
> btrfs as the assumed ext* successor, so unless they come up against a 
> wall and find quotas simply can't be done in a reliable and scalable way 
> on btrfs, the feature /will/ be there eventually, and then I think 
> snapshot-aware-defrag work can resume.
Well, sounds like a good plan, dropping quotas would surely be bad for
many people... as are however several other things (as the
aforementioned loss of checksumming).


> I actually don't know what the effect of defrag, with or without 
> recompression, is on same-subvolume reflinks.  If I were to guess I'd say 
> it breaks them too, but I don't know.  If I needed to know I'd probably 
> test it to see... or ask.
How would you find out? Somehow via space usage?


> It _is_ worth noting, however, lest there be any misconceptions, that 
> regardless of the number of reflinks sharing an extent between them, a 
> single defrag on a single file will only make, maximum, a single 
> additional copy.  It's not like it makes another copy for each of the 
> reflinks to it, unless you defrag each of those reflinks individually.
Yes. that's what I'd have expected.
However when one runs e.g. btrfs fi defrag /snapshots/ one would get n
additional copies (one per snapshot), in the worst case.


>> 7) How das free-space defragmentation happen (or is there even such a
>> thing)?
>> For example, when I have my big qemu images, *not* using nodatacow, and
>> I copy the image e.g. with qemu-img old.img new.img ... and delete the
>> old then.
>> Then I'd expect that the new.img is more or less not fragmented,... but
>> will my free space (from the removed old.img) still be completely messed
>> up sooner or later driving me into problems?

> and having to manually run a balance
> -dusage=0
btw: shouldn't it do that particular one automatically from time to
time? Or is that actually the case now, by what you mentioned further
below around 3.17?


> So at some point, defrag will need at least partially rewritten to be at 
> least somewhat more greedy in its new data chunk allocation.
Just wanted to ask why defrag doesn't simply allocate some bigger chunks
of data in advance... ;)

>  I'm not a 
> coder so I can't evaluate how big a rewrite that'll be, but with a bit of 
> luck, it's more like a few line patch than a rewrite.  Because if it's a 
> rewrite, then it's likely to wait until they can try to address the 
> snapshot-aware-defrag issue again at the same time, and it's anyone's 
> guess when that'll be, but probably more like years than months.
Years? Ok... what a pity...


> Meanwhile, I don't know that anybody has tried this yet, and with both 
> compression and autodefrag on here it's not easy for me to try it, but in 
> theory anyway, if defrag isn't working particularly well, it should be 
> possible to truncate-create a number of GiB-sized files, sync (or fsync 
> each one individually) so they're written out to storage, then truncate 
> each file down to a few bytes, something 0 < size < 4096 bytes (or page 
> size on archs where it's not 4096 by default), so they take only a single 
> block of that original 1 GiB allocation, and sync again.
a) wouldn't truncate create a sparse file? And would btrfs then really
allocate chunks for that (would sound quite strange to me), which I
guess is your goal here?

b) How can one find out wheter defragmentation worked well? I guess with
filefrag in the compress=no case an not at all in any other?


>> Are there for example any general recommendations what to regularly to
>> do keep the fs in a clean and proper shape (and I don't count "start
>> with a fresh one and copy the data over" as a valid way).

> So "start with a fresh btrfs and copy the data over", is indeed part of 
> my regular backups routine here
[The following being more general thoughts/comments, not specifically a
reply to you ;-)]:
Well apart from several other more severe things I've mentioned before
(plus the IMHO quite severe issues with UUID collisions and possible
security leaks) I can only emphasize this once more:
It's IMHO not acceptable for a fs when it would more or less require
starting with a fresh fs every now and then - at least not when one
wants to use it in production mode.

Obviously, I don't demand that one can do in-place conversion when new
features get in (like skinny metadata)... I'm rather talking about that
copying data off, starting with a fresh fs and copying data back cannot
be considered (more or less) normal maintenance, e.g. to fight severe
forms of fragmentation or so

While this is wouldn't be a problem for single desktop machines or
smaller servers it would IMHO be a showstopper for big storage Tiers
(and I'm running one).
Take the LHC Computing Grid for example,...we manage some 100 PiB,
probably more in the meantime, in many research centres worldwide, much
of that being on disk and at least some parts of it with no real backups
anywhere. This may sound stupid, but in reality, one has funding
constraints and many other reasons that may keep one from having
everything twice.
This should especially demonstrate that not everyone has e.g. twice his
actually used storage just to move the data away, recreate the
filesystems and move it back (not to talk about any larger downtimes
that would result from that).

For quite a while I was thinking about productively using btrfs at our
local Tier-2 in Munich, but then decided against:
- the regular kernels from our current distro would have been rather too
old (and btrfs in them probably not yet stable enough)
- and even with more current kernels (I decided that around 4.0), btrfs
RAID6 (as well as MD RAID, with either btrfs or ext4) was slower than
ext4 on hardware RAID.
Not in all IO cases, but in the majority of those IO patterns that we
have (which is typically write once+read many, never append, sequential
read, random read and vector read).

On HW RAID, btrfs and ext4 were rather close... yet I decided against
btrfs for now, as it feels like it needs much more maintenance (in the
form of human interaction, digging out what actually causes the problems
and so on)... and as one of the core guys in storage support here (at
least in Germany), I'd probably recommend other sites the same.

So apart from these bigger issues, some of which are possibly rather a
matter of time to be solved during development (like snapshot aware
defrag), there are IMHO also some other areas which make it difficult to
use btrfs at large.
The fact alone that you guys need to explain here over many pages shows
that. And not every group/organisation/company is big enough to simply
hire their own btrfs developers to get first grade support ;)

Part of that is of course my own inexperience with btrfs (at least to
the extent that I'd entrust it with our ~2PiB data),... but even during
the short time that I've been more "regularly" on the list here, I read
about many people having issues (with fragmentation mostly ^^) and
stumbled over many places where I think documentation for
admins/end-users would be missing... or over effects which are easily
not clear at all for the non-power-btrfs-user but which may have
tremendous effects (e.g. CoW+DBs/VMs/etc, atime+snapshots,
defrag+snapshots, etc.). And while those are rather clear if one thinks
thoroughly through the likely effects of CoW, others like what Marc
Merlin recently reported (ENOSPC on scrub) aren't that easily clear at all.

Long story short... this is all fine, when I just play around with my
notebooks, or my few own servers,... at the worst case I start from
scratch taking a backup... but when dealing with more systems or those
where downtime/failure is a much bigger problem, then I think
self-maintenance and documentation need to get better (especially for
normal admins, and believe me, not every admin is willing to dig into
the details of btrfs and understand "all" the curicumstances of
fragmentation or issues with datacow/nodatacow.


> But in terms of your question, the only things I do somewhat regularly 
> are an occasional scrub (with btrfs raid1 precisely so I /do/ have a 
> second copy available if one or the other fails checksum), and mostly 
> because it's habit from before the automatic empty chunk delete code and 
> my btrfs are all relatively small so the room for error is accordingly 
> smaller, keeping an eye on the combination of btrfs fi sh and btrfs fi df, 
> to see if I need to run a filtered balance.
Speaking of which:
Is there somewhere a good documentation of what exactly all this numbers
of show, df, usage and so on tell?


> Other than that, it's the usual simply keeping up with the backups
Well but AFAIU it's much more, which I'd count towards maintenance:
- enabling autodefrag
- fighting fragmentation (by manually using svols with nodatacow in
  those cases where necessary, which first need to be determined)
- enabling notatime, especially when doing snapshots
- sometimes (still?) the necessity to run balance to reorder block
  groups,.. okay you said that empty ones are now automatically
  reclaimed.



Thank for all your detailed explanations, that helped a lot[0] :)
Cheers,
Chris.


[0] The same goes obviously for Hugo :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-09  5:43       ` Christoph Anton Mitterer
@ 2015-12-09 13:36         ` Duncan
  2015-12-14  2:46           ` Christoph Anton Mitterer
  2015-12-16 23:39           ` Kai Krakow
  0 siblings, 2 replies; 48+ messages in thread
From: Duncan @ 2015-12-09 13:36 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Wed, 09 Dec 2015 06:43:01 +0100 as
excerpted:

> Hey Hugo,
> 
> 
> On Thu, 2015-11-26 at 00:33 +0000, Hugo Mills wrote:
> 
>> The issue is that nodatacow bypasses the transactional nature of
>> the FS, making changes to live data immediately. This then means that
>> if you modify a modatacow file, the csum for that modified section is
>> out of date, and won't be back in sync again until the latest
>> transaction is committed. So you can end up with an inconsistent
>> filesystem if there's a crash between the two events.

> Sure,... (and btw: is there some kind of journal planned for
> nodatacow'ed files?),... but why not simply trying to write an updated
> checksum after the modified section has been flushed to disk... of
> course there's no guarantee that both are consistent in case of crash (
> but that's also the case without any checksum)... but at least one would
> have the csum protection against everything else (blockerrors and that
> like) in case no crash occurs?

Answering the BTW first, not to my knowledge, and I'd be skeptical.  In 
general, btrfs is cowed, and that's the focus.  To the extent that nocow 
is necessary for fragmentation/performance reasons, etc, the idea is to 
try to make cow work better in those cases, for example by working on 
autodefrag to make it better at handling large files without the scaling 
issues it currently has above half a gig or so, and thus to confine nocow 
to a smaller and smaller niche use-case, rather than focusing on making 
nocow better.

Of course it remains to be seen how much better they can do with 
autodefrag, etc, but at this point, there's way more project 
possibilities than people to develop them, so even if they do find they 
can't make cow work much better for these cases, actually working on nocow 
would still be rather far down the list, because there's so many other 
improvement and feature opportunities that will get the focus first.  
Which in practice probably puts it in "it'd be nice, but it's low enough 
priority that we're talking five years out or more, unless of course 
someone else qualified steps up and that's their personal itch they want 
to scratch", territory.

As for the updated checksum after modification, the problem with that is 
that in the mean time, the checksum wouldn't verify, and while btrfs 
could of course keep status in memory during normal operations, that's 
not the problem, the problem is what happens if there's a crash and in-
memory state vaporizes.  In that case, when btrfs remounted, it'd have no 
way of knowing why the checksum didn't match, just that it didn't, and 
would then refuse access to that block in the file, because for all it 
knows, it /is/ a block error.

And there's already a mechanism for telling btrfs to ignore checksums, 
and nocow already activates it, so... there's really nothing more to be 
done.

>> > For me the checksumming is actually the most important part of btrfs
>> > (not that I wouldn't like its other features as well)... so turning
>> > it off is something I really would want to avoid.

Same here.  In fact, my most anticipated feature is N-way-mirroring, 
since that will allow three copies (or more, but three is my sweet spot 
balance between the space and reliability factors) instead of the current 
limit of two.  It just disturbs me than in the event of one copy being 
bad, the other copy /better/ be good, because there's no further 
fallback!  With a third copy, there'd be that one further fallback, and 
the chances of all three copies failing checksum verification are remote 
enough I'm willing to risk it, given the incremental cost of additional 
copies.

>> > Plus it opens questions like: When there are no checksums, how can it
>> > (in the RAID cases) decide which block is the good one in case of
>> > corruptions?

>>    It doesn't decide -- both copies look equally good, because
>> there's no checksum, so if you read the data, the FS will return
>> whatever data was on the copy it happened to pick.

> Hmm I see... so one gets basically the behaviour of RAID.
> Isn't that kind of a big loss? I always considered the guarantee against
> block errors and that like one of the big and basic features of btrfs.

It is a big and basic feature, but turning it off isn't the end of the 
world, because then it's still the same level of reliability other 
solutions such as raid generally provide.

And the choice to turn it off is just that, a choice, tho it's currently 
the recommended one in some cases, such as with large VM images, etc.

But as it happens, both VM image management and databases tend to come 
with their own integrity management, in part precisely because the 
filesystem could never provide that sort of service.  So to the extent 
that btrfs must turn off its integrity management features when dealing 
with that sort of file, it's no bigger deal than it would be on any other 
filesystem, it's simply returning what's normally a huge bonus compared 
to other filesystems, to the status quo for specific situations that it 
otherwise doesn't deal so well with.  And if the status quo was good 
enough before, and in the absence of btrfs would of necessity be good 
enough still, then where it's necessary with btrfs, it's good enough 
there as well.

IOW, there's only upside, no downside.  If the upside doesn't apply, it's 
still no worse than it was before, no downside.

> It seems that for certain (not too unimportant cases: DBs, VMs) one has
> to decide between either evil, loosing the guaranteed consistency via
> checksums... or basically running into severe troubles (like Mitch's
> reported fragmentation issues).
> 
> 
>> > 3) When I would actually disable datacow for e.g. a subvolume that
>> > holds VMs or DBs... what are all the implications?
>> > Obviously no checksumming, but what happens if I snapshot such a
>> > subvolume or if I send/receive it?
>> 
>>    After snapshotting, modifications are CoWed precisely once, and
>> then it reverts to nodatacow again. This means that making a snapshot
>> of a nodatacow object will cause it to fragment as writes are made to
>> it.
> I see... something that should possibly go to some advanced admin
> documentation (if not already in).
> It means basically, that one must assure that any such files (VM images,
> DB data dirs) are already created with nodatacow (perhaps on a subvolume
> which is mounted as such.
> 
> 
>> > 4) Duncan mentioned that defrag (and I guess that's also for auto-
>> > defrag) isn't ref-link aware...
>> > Isn't that somehow a complete showstopper?

>> It is, but the one attempt at dealing with it caused massive data
>> corruption, and it was turned off again.

IIRC, it wasn't data corruption so much, as massive scaling issues, to 
the point where defrag was entirely useless, as it could take a week or 
more for just one file.

So the decision was made that a non-reflink-aware defrag that actually 
worked in something like reasonable time even if it did break reflinks 
and thus increase space usage, was of more use than a defrag that 
basically didn't work at all, because it effectively took an eternity.  
After all, you can always decide not to run it if you're worried about 
the space effects it's going to have, but if it's going to take a week or 
more for just one file, you effectively don't have the choice to run it 
at all.

> So... does this mean that it's still planned to be implemented some day
> or has it been given up forever?

AFAIK it's still on the list.  And the scaling issues are better, but one 
big thing holding it up now is quota management.  Quotas never have 
worked correctly, but they were a big part (close to half, IIRC) of the 
original snapshot-aware-defrag scaling issues, and thus must be reliably 
working and in a generally stable state before a snapshot-aware-defrag 
can be coded to work with them.  And without that, it's only half a 
solution that would have to be redone when quotes stabilized anyway, so 
really, quota code /must/ be stabilized to the point that it's not a 
moving target, before reimplementing snapshot-aware-defrag makes any 
sense at all.

But even at that point, while snapshot-aware-defrag is still on the list, 
I'm not sure if it's ever going to be actually viable.  It may be that 
the scaling issues are just too big, and it simply can't be made to work 
both correctly and in anything approaching practical time.  Time will 
tell, of course, but until then...

> Given that you (or Duncan?,... sorry I sometimes mix up which of said
> exactly what, since both of you are notoriously helpful :-) ) mentioned
> that autodefrag basically fails with larger files,... and given that it
> seems to be quite important for btrfs to not be fragmented too heavily,
> it sounds a bit as if anything that uses (multiple) reflinks (e.g.
> snapshots) cannot be really used very well.

That might have been either of us, as I think we've both said effectively 
that, over time.

As for reflink/snapshot usefulness, it really depends on your use-case.  
If both modifications and snapshots are seldom, it shouldn't be a big 
deal.  For use-cases where snapshots are temporary, as can be the case 
for most snapshots anyway in most send/receive usage scenarios, again, 
the problem is quite limited.

The biggest problem is with large random-rewrite-pattern files, where 
both rewrites and snapshots occur frequently.  That's really a worst-case 
for copy-on-write in general, and btrfs is no exception.  But there's 
still workarounds that can help keep the situation under control, and if 
it comes to it, one can always use other filesystems and accept their 
limitations, where btrfs isn't a particularly useful choice due to these 
sorts of limitations.

Which again emphasizes my point, while there's cases where btrfs' 
features run into limits, it's all upside, no downside.  Worst-case, you 
set nocow and turn off snapshotting, but that's exactly the situation 
you're in anyway with other filesystems, so you're no worse off than if 
you were using them.

Meanwhile, where those btrfs features *can* be used, which is on /most/ 
files, with only limited exceptions, it's all upside! =:^)

>>  autodefrag, however, has
>> always been snapshot aware and snapshot safe, and would be the
>> recommended approach here.

> Ahhh... so autodefag *is* snapshot aware, and that's basically why the
> suggestion is (AFAIU) that it's turned on, right?

FWIW, I've seen it asserted that autodefrag is snapshot aware a few times 
now, but I'm not personally sure that is the case and I don't see any 
immediately obvious reason it would be, when (manual) defrag isn't, so 
I've refrained from making that claim, myself.  If I were to see multiple 
devs make that assertion, I'd be more confident, but I believe I've only 
seen it from Hugo, and while I trust him in general because in general 
what he says makes sense, here, as I said, it just doesn't make immediate 
sense to me that the two would be so different, and without that 
explained and lacking further/other confirmation...  I just remain 
personally unsure and thus refrain from making that assertion, myself.

Which is why you've not seen me mention it...

Tho I can and _do_ say I've been happy with autodefrag here, and ensure 
it's enabled on everything, generally on first mount.  But again, my 
particular use-case doesn't deal with snapshots or reflinking in general, 
neither does it have these large random-rewrite-pattern files, so I'd be 
unlikely to see the effects of reflink-awareness, or lack thereof, in my 
own autodefrag usage, however much I might otherwise endorse it in 
general.

> So, I'm afraid O:-), that triggers a follow-up question:
> Why isn't it the default? Or in other words what are its drawbacks (e.g.
> other cases where ref-links would be broken up,... or issues with
> compression)?

The biggest downside of autodefrag is its performance on large (generally 
noticeable at between half a gig and a gig) random-rewrite-pattern files 
in actively-being-rewritten use.  For all other cases it's generally 
recommended, but that's why it's not the default.

And the problem there is simply that at some point the files get large 
enough that the defragging rewrites take longer than the time between 
those random updates, so the defragging rewrites become the bottleneck.  
As long as that's not occurring, either because the file is small enough, 
or because the backing device is SSD and/or simply fast enough, or 
because the updates are coming in slow enough to allow the file to be 
rewritten between them (the VM or DB using the file isn't in heavy enough 
use to trigger the problem), autodefrag works fine.

Meanwhile, there remain some tweaks they think they can do to autodefrag, 
that in theory should help eliminate this issue or at least move the 
bottlenecking to say 10 gig instead of 1 gig, but again, there's way more 
improvements to be made at this point than devs working on making them, 
so this improvement, as many others, simply has to wait its turn.  
However, this one's at least intermediate priority, so I'd put it at 
anywhere from two months to perhaps three years out.  It's unlikely to be 
beyond the 5 year mark, as some features on the wishlist almost certainly 
are.

> And also, when I now activate it on an already populated fs, will it
> defrag also any old files (even if they're not rewritten or so)?
> I tried to have a look for some general (rather "for dummies" than for
> core developers) description of how defrag and autodefrag work... but
> couldn't find anything in the usual places... :-(

AFAIK autodefrag only queues up the defrag when it detects fragmentation 
beyond some threshold, and it only checks and thus only detects at file 
(re)write.

Additionally, on a filesystem that hasn't had autodefrag on from the 
beginning, fragmentation is likely to be high enough that defrag, either 
auto or manual, won't be able to defrag to ideal levels, and 
fragmentation is thus likely to remain high for some time.

Further, when a filesystem is highly fragmented and autodefrag is first 
turned on, often it actually rather negatively affects performance for a 
few days, because so many files are so fragmented that it's queuing up 
defrags for nearly everything written.

So really, the ideal is having autodefrag on from the beginning, which is 
why I generally ensure it's on from the very first mount, or at least 
before I actually start putting files in the filesystem, here.  (Normally 
I'll create the filesystem including the label, and create the fstab 
entry for it referencing that label that includes autodefrag, at very 
nearly the same time, sometimes creating the fstab entry first since I do 
use the label, not the UUID.  Then I mount it using that fstab entry, so 
yes, it /does/ have autodefrag enabled from the very first mount. =:^)

Of course this might be reason enough to verify your backups one more 
time, blow away the filesystem with a brand new mkfs.btrfs, create that 
fstab entry with autodefrag included, mount, and restore from backups.  
This even gives you a chance to activate newer btrfs features like 16 KiB 
node size by default, if your filesystem is old enough to have been 
created before they were available, or before they were the default. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-09  5:45       ` Christoph Anton Mitterer
@ 2015-12-09 16:36         ` Duncan
  2015-12-16 21:59           ` Christoph Anton Mitterer
  0 siblings, 1 reply; 48+ messages in thread
From: Duncan @ 2015-12-09 16:36 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Wed, 09 Dec 2015 06:45:47 +0100 as
excerpted:

> On 2015-11-27 00:08, Duncan wrote:
>> Christoph Anton Mitterer posted on Thu, 26 Nov 2015 01:23:59 +0100 as
>> excerpted:
>>> 1) AFAIU, the fragmentation problem exists especially for those files
>>> that see many random writes, especially, but not limited to, big
>>> files. Now that databases and VMs are affected by this, is probably
>>> broadly known in the meantime (well at least by people on that list).
>>> But I'd guess there are n other cases where such IO patterns can
>>> happen which one simply never notices, while the btrfs continues to
>>> degrade.
>> 
>> The two other known cases are:
>> 
>> 1) Bittorrent download files, where the full file size is preallocated
>> (and I think fsynced), then the torrent client downloads into it a
>> chunk at a time.

> Okay, sounds obvious.
> 
>> The more general case would be any time a file of some size is
>> preallocated and then written into more or less randomly, the problem
>> being the preallocation, which on traditional rewrite-in-place
>> filesystems helps avoid fragmentation (as well as ensuring space to
>> save the full file), but on COW-based filesystems like btrfs, triggers
>> exactly the fragmentation it was trying to avoid.

> Is it really just the case when the file storage *is* actually fully
> pre-allocated?
> Cause that wouldn't (necessarily) be the case for e.g. VM images (e.g.
> qcow2, or raw images when these are sparse files).
> Or is it rather any case where, in larger file, many random (file
> internal) writes occur?

It's the second case, or rather, the reverse of the first case, since 
preallocation and fsync, then write into it, is one specific subset case 
of the broader case of random rewrites into existing files.  VM images 
and database files are two other specific subset cases of the same 
broader case superset.

>> arranging to have the client write into a dir with the nocow attribute
>> set, so newly created torrent files inherit it and do rewrite-in-place,
>> is highly recommended.

> At the IMHO pretty high expense of loosing the checksumming :-(
> Basically loosing half of the main functionalities that make btrfs
> interesting for me.

But... as I've pointed out in other replies, in many cases including this 
specific one (bittorrent), applications have already had to develop their 
own integrity management features, because other filesystems didn't 
supply them and the apps simply didn't work reliably without those 
features.

In the bittorrent case specifically, torrent chunks are already 
checksummed, and if they don't verify upon download, the chunk is thrown 
away and redownloaded.

And after the download is complete and the file isn't being constantly 
rewritten, it's perfectly fine to copy it elsewhere, into a dir where 
nocow doesn't apply.  With the copy, btrfs will create checksums, and if 
you're paranoid you can hashcheck the original nocow copy against the new 
checksummed/cow copy, and after that, any on-media changes will be caught 
by the normal checksum verification mechanisms.

Further, at least some bittorrent clients make preallocation an option.  
Here, on btrfs I'd simply turn off that option, rather than bothering 
with nocow in the first place.  That should already reduce fragmentation 
significantly due to the 30-second by default commit frequency, tho there 
will likely still be some fragmentation due to the out-of-order 
downloading.  But either autodefrag or the previously mentioned post-
download recopy should deal with that.

> For databases, will e.g. the vacuuming maintenance tasks solve the
> fragmentation issues (cause I guess at least when doing full vacuuming,
> it will rewrite the files).

If it does full rewrite, it should, provided the freespace itself isn't 
so fragmented that it's impossible to find sufficiently large extents to 
avoid fragmentation.

Of course there's also autodefrag, if the database isn't so busy and/or 
the database files are small enough that the defragging rewrites don't 
trigger bottlenecking, the primary downside risk with autodefrag.

>> The problem is much reduced in newer systemd, which is btrfs aware and
>> in fact uses btrfs-specific features such as subvolumes in a number of
>> cases (creating subvolumes rather than directories where it makes sense
>> in some shipped tmpfiles.d config files, for instance), if it's running
>> on btrfs.

> Hmm doesn't seem really good to me if systemd would do that, cause it
> then excludes any such files from being snapshot.

Of course if the directories are already present due to systemd upgrading 
from non-btrfs-aware versions, they'll remain as normal dirs, not 
subvolumes.  This is the case here.

And of course you can switch them around to dirs if you like, and/or 
override the shipped tmpfiles.d config with your own.

Meanwhile, distros that both ship systemd and offer btrfs as a filesystem 
option (or use it by default), should integrate this setting much as they 
would any other, patching the upstream version in their own packages if 
it's not a reasonable option for their distro.  So for the general case 
of people just using btrfs and systemd because that's what their distro 
does, it should just work, and to the degree that it doesn't, it's a 
distro-level bug, just as it'd be for any other distro-integration bug.

>> For the journal, I /think/ (see the next paragraph) that it now sets
>> the journal files nocow, and puts them in a dedicated subvolume so
>> snapshots of the parent won't snapshot the journals, thereby helping to
>> avoid the snapshot-triggered cow1 issue.

> The same here, kinda disturbing if systemd would decide that on it's
> own, i.e. excluding files from being checksum protected...

... With the same answer.  In the normal distro case, to the degree that 
the integration doesn't work, it's a distro integration issue.

But also, again, systemd provides its own journal file integrity 
management, meaning there's less reason for btrfs to do so as well, and 
the lack of btrfs checksumming on nocow files doesn't matter so much.

So the systemd settings are actually quite sane, and again, to the degree 
that the distro does things differently for their own integration 
purposes, any bugs resulting from such are distro integration bugs, not 
upstream bugs.

Meanwhile, those not using distros to manage such things (or on distros 
such as gentoo, where by design, far more decisions of that nature are 
left to the admin or local policy of the system it's deployed on) should 
by definition be advanced enough to do the research and make their own 
decisions, since that's precisely what they're choosing to do by straying 
from the distro-level integration policy.

>>> So is there any general approach towards this?

>> The general case is that for normal desktop users, it doesn't tend to
>> be a problem, as they don't do either large VMs or large databases,

> Well depends a bit on how one defines the "normal desktop user",... for
> e.g. developers or more "power users" it's probably not so unlikely that
> they do run local VMs for testing or whatever.

Well yes, but that's devs and power users, who by definition are advanced 
enough to do the research necessary and make the appropriate decisions.

The normal desktop user, referred to by some as luser (local user, but 
with the obvious connotation)... generally tends to run their web browser 
and their apps of choice and games... and doesn't want to be bothered 
with details of this nature that the distro should be managing for them 
-- after all, that's what a distro /does/.

>> and small ones such as the sqlite files generated by firefox and
>> various email clients are handled quite well by autodefrag, with that
>> general desktop usage being its primary target.
> Which is however not yet the default...

Distro integration bug! =:^)

> It feels a bit, if there should be some tools provided by btrfs, which
> tell the users which files are likely problematic and should be
> nodatacow'ed

And there very well might be such a tool... five or ten years down the 
road when btrfs is much more mature and generally stabilized, well beyond 
the "still maturing and stabilizing" status of the moment.

>>> And what are the actual possible consequences? Is it just that fs gets
>>> slower (due to the fragmentation) or may I even run into other issues
>>> to the point the space is eaten up or the fs becomes basically
>>> unusable?

>> It's primarily a performance issue, tho in severe cases it can also be
>> a scaling issue, to the point that maintenance tasks such as balance
>> take much longer than they should and can become impractical to run

> hmm so it could in principle also affect other files and not just the
> fragmented ones, right?!

Not really, except that general btrfs maintenance like balance and check 
takes far longer than it otherwise would.

But it can be the case that as filesystem fragmentation levels rise, free-
space itself is fragmented, to the point where files that would otherwise 
not be fragmented as they're created once and never touched again, end up 
fragmented, because there's simply no free-space extents big enough to 
create them in unfragmented, so a bunch of smaller free-space extents 
must be used where one larger one would have been used had it existed.

In that regard, yes, it can affect other files, but it affects them by 
fragmentation, so no, it doesn't affect unfragmented files... to the 
extent that there are any unfragmented files left.

> Are there any problems caused by all this with respect to free space
> fragmentation? And what exactly are the consequences of free space
> fragmentation? ;)

I must have intuited the question as I just answered it, above! =:^)

>> But even without snapshot awareness, with an appropriate program of
>> snapshot thinning (ideally no more than 250-ish snapshots per
>> subvolume, which easily covers a year's worth of snapshots even
>> starting at something like half-hourly, if they're thinned properly as
>> well; 250 per subvolume lets you cover 8 subvolumes with a 2000
>> snapshot total, a reasonable cap that doesn't trigger severe scaling
>> issues) defrag shouldn't be /too/ bad.
>>
>> Most files aren't actually modified that much, so the number of
>> defrag-triggered copies wouldn't be that high.
> Hmm I thought that would only depend on how badly the files are
> fragmented when being snapshot.
> If I make a snapshot, while there are many fragments, and then defrag
> one of them, everything that gets defragmented would be rewritten,
> loosing any ref-links, while files that aren't defragmented would retain
> them.

Yes, but I was talking about repeated defrag.  A single defrag should at 
most double the space usage of a file, if it unreflinks the entire thing.

But if the file is repeatedly modified and repeatedly snapshotted, and if 
autodefrag is /not/ snapshot aware, then worst-case is that every 
snapshot ends up being its own defragged fully un-reflinked copy, 
multiplying the space usage by the number of snapshots kept around!

By limiting the number of snapshots to 250, that already limits the space 
usage multiplication to 250 as well.  (While that may seem high, given 
that we've had people posting with tens or hundreds of thousands of 
snapshots, if autodefrag was breaking reflinks and they had it enabled... 
250X really is already relatively limited!)

But, as I said, most files don't actually get changed that much, so even 
assuming autodefrag isn't snapshot aware, that 250X worst-case is 
relatively unlikely.  In fact, many files are written one and never 
changes, in which case the autodefrag, if necessary at all, will happen 
shortly after write, and there will very likely be only the single copy.  
Others may have a handful, but only 2-10 copies, with more than that 
quite rare on most systems, so space usage will be nothing close to the 
250X worst-case scenario.  It may be bad, but it's strictly limited bad.

And of course that's assuming the worst case, that autodefrag is /not/ 
snapshot-aware.  If it is, then the problem effectively vaporizes 
entirely.

>> Autodefrag is recommended for, and indeed targeted at, general desktop
>> use, where internal-rewrite-pattern database, etc, files tend to be
>> relatively small, quarter to half gig at the largest.

> Hmm and what about mixed-use systems,... which have both, desktop and
> server like IO patterns?

Valid question.  And autodefrag, like most btrfs-specific mount options, 
remains filesystem-global at this point, too, so it's not like you can 
mount different subvolumes, some with autodefrag, some without (tho 
that's a planned future implementation detail).

But, at least personally, I tend to prefer separate filesystems, not 
subvols, in any case, primarily because I don't like having my data eggs 
all in the same filesystem basket and then watching its bottom drop out 
when I find it unmountable!

But the filesystem-global nature of autodefrag and similar mount options, 
tends to encourage the separate filesystem layout as well, as in that 
case you simply don't have to worry, because the server stuff is on its 
own separate btrfs where the autdefrag on the desktop btrfs can't 
interfere with it, as each separate filesystem can have its own mount 
options. =:^)

So that'd be /my/ preferred solution, but I can indeed see it being a 
problem for those users (or distros) that prefer one big filesystem with 
subvolumes, which some do, because then it's all in a single storage pool 
and thus easier to manage.

> btw: I think documentation (at least the manpage) doesn't tell whether
> btrfs defragment -c XX will work on files which aren't fragmented.

It implies it, but I don't believe it's explicit.

The implication is due to the implication that defrag with the compress 
option is effectively compress, in that it rewrites everything it's told 
to compress in that case, of course defragging in the process due to the 
rewrite, but with the primary purpose being the compress, when used in 
that manner.

But, while true (one poster found that out the hard way, when his space 
usage doubled due to snapshot reflink breaking for EVERY file... when he 
expected it to go down due to the compression -- he obviously didn't 
think thru the fact that compression MUST be a rewrite, thereby breaking 
snapshot reflinks, even were normal non-compression defrag to be snapshot 
aware, because compression substantially changes the way the file is 
stored), that's _implied_, not explicit.  You are correct in that making 
it explicit would be clearer.

> Phew... "clearly" may be rather something that differs from person to
> person.
> - A defrag that doesn't work due to scaling issues - well one can
> hopefully abort it and it's as if there simply was no defragmentation.
> - A defrag which breaks up the ref-links, may eat up vast amounts of
> storage that should not need to be "wasted" like this, and you'll never
> get the ref-links back (unless perhaps with dedup).

I addressed this in a reply a few hours ago to a different (I think) 
subthread.


>> I actually don't know what the effect of defrag, with or without
>> recompression, is on same-subvolume reflinks.  If I were to guess I'd
>> say it breaks them too, but I don't know.  If I needed to know I'd
>> probably test it to see... or ask.
> How would you find out? Somehow via space usage?

Yes.  Try it on a file that's large enough (a gig or so should do it 
nicely) to make a difference in the btrfs fi df listing.  Compare before 
and after listings.

> However when one runs e.g. btrfs fi defrag /snapshots/ one would get n
> additional copies (one per snapshot), in the worst case.

Hmm... That would be a

Very.

Bad.

Idea!

>> and having to manually run a balance -dusage=0
> btw: shouldn't it do that particular one automatically from time to
> time? Or is that actually the case now, by what you mentioned further
> below around 3.17?

Yes, (effective, of course it's all kernel side, btrfs balance userspace 
isn't actually called) balance -dusage=0 is automatic now.

>> So at some point, defrag will need at least partially rewritten to be
>> at least somewhat more greedy in its new data chunk allocation.

> Just wanted to ask why defrag doesn't simply allocate some bigger chunks
> of data in advance... ;)

It's possible that's actually how they'll fix it, when they do.

>> Meanwhile, I don't know that anybody has tried this yet, and with both
>> compression and autodefrag on here it's not easy for me to try it, but
>> in theory anyway, if defrag isn't working particularly well, it should
>> be possible to truncate-create a number of GiB-sized files, sync (or
>> fsync each one individually) so they're written out to storage, then
>> truncate each file down to a few bytes, something 0 < size < 4096 bytes
>> (or page size on archs where it's not 4096 by default), so they take
>> only a single block of that original 1 GiB allocation, and sync again.

> a) wouldn't truncate create a sparse file? And would btrfs then really
> allocate chunks for that (would sound quite strange to me), which I
> guess is your goal here?

As I said to my knowledge it hasn't been tried, but AFAIK, truncate, 
followed by sync (or fsync), doesn't do sparse.  I've seen it used for (I 
believe) similar purposes elsewhere, which is why I suggested its use 
here.

But obviously trying it would be the way to find out for sure.  There's a 
reason I added both the "hasn't been tried yet" and "in theory" 
qualifiers...

Of course if truncate doesn't work, catting from /dev/urandom should do 
the trick, as that should be neither sparse nor compressible.  
 
> b) How can one find out wheter defragmentation worked well? I guess with
> filefrag in the compress=no case an not at all in any other?

I recently found out that filefrag -v actually lists the extent byte 
addresses, thus making it possible to manually (or potentially via 
script) whether the 128-KiB compression blocks are contiguous or not.  
Contiguous would mean same extent, even if filefrag doesn't understand 
that yet.

But certainly, filefrag in the uncompressed case is exactly what I had in 
mind.

> Take the LHC Computing Grid for example,...we manage some 100 PiB,
> probably more in the meantime, in many research centres worldwide, much
> of that being on disk and at least some parts of it with no real backups
> anywhere. This may sound stupid, but in reality, one has funding
> constraints and many other reasons that may keep one from having
> everything twice.
> This should especially demonstrate that not everyone has e.g. twice his
> actually used storage just to move the data away, recreate the
> filesystems and move it back (not to talk about any larger downtimes
> that would result from that).

Yeah, the LHC is rather a special case.

Tho to be fair, were I managing data for them or that sort of data set 
where shear size makes backups impractical, I'd probably be at least as 
conservative about btrfs usage as you're sounding, not necessarily in the 
specifics, but simply because while btrfs is indeed stabilizing, I 
haven't had any argument on this list against my oft stated opinion that 
it's not fully stable and mature yet, and won't be for some time.

As such, to date I'd be unlikely to consider btrfs at all for data where 
backups aren't feasible, unless it really is simply throw-away data 
(which from your description isn't the case there), and would be leery as 
well about btrfs usage where backups are available, but simply 
impractical to deal with, due to shear size and data transfer time.

> For quite a while I was thinking about productively using btrfs at our
> local Tier-2 in Munich, but then decided against:

As should be apparent from the above, I basically agree.

I did want to mention that I enjoyed seeing your large-scale description, 
however, as well as your own reasoning for the decisions you have made.  
(Of course it's confirming my own opinion so I'm likely to enjoy it, but 
still...)

> Long story short... this is all fine, when I just play around with my
> notebooks, or my few own servers,... at the worst case I start from
> scratch taking a backup... but when dealing with more systems or those
> where downtime/failure is a much bigger problem, then I think
> self-maintenance and documentation need to get better (especially for
> normal admins, and believe me, not every admin is willing to dig into
> the details of btrfs and understand "all" the curicumstances of
> fragmentation or issues with datacow/nodatacow.

Absolutely, positively, agreed!  There's certainly a place for btrfs at 
its current stability level, but production level on that size of a 
system, really isn't it, unless perhaps you have the resources to do what 
facebook has done and hire Chris Mason. =:^)  (And even there, from what 
I've read, they have reasonably large test deployments and we do 
regularly see patches fixing problems they've found, but I'm not sure 
they're using it on their primary production, yet, tho they may be.)

>> But in terms of your question, the only things I do somewhat regularly
>> are an occasional scrub (with btrfs raid1 precisely so I /do/ have a
>> second copy available if one or the other fails checksum), and mostly
>> because it's habit from before the automatic empty chunk delete code
>> and my btrfs are all relatively small so the room for error is
>> accordingly smaller, keeping an eye on the combination of btrfs fi sh
>> and btrfs fi df,
>> to see if I need to run a filtered balance.

> Speaking of which:
> Is there somewhere a good documentation of what exactly all this numbers
> of show, df, usage and so on tell?

It's certainly in quite a few on-list posts over the years, but now that 
you mention it, I don't believe it's in the wiki or manpages.

I'm starting to go droopy so won't attempt to repeat it in this post, but 
may well do it in a followup, particularly if you ask about it again.

>> Other than that, it's the usual simply keeping up with the backups
> Well but AFAIU it's much more, which I'd count towards maintenance:
> - enabling autodefrag
> - fighting fragmentation (by manually using svols with nodatacow in
>   those cases where necessary, which first need to be determined)
> - enabling notatime, especially when doing snapshots
> - sometimes (still?) the necessity to run balance to reorder block
>   groups,.. okay you said that empty ones are now automatically
>   reclaimed.

I agree with these, but I consider them pretty much one-shot, and thus 
didn't think about them in the context of what I took to be a question 
about routine, which I interpreted as ongoing, maintenance.

Autodefrag I use everywhere, but for VM and DB usecases it'd take some 
research and likely testing.

General anti-fragmentation setup is IMO vital, but one-shot, particularly 
the research, which once done, becomes a part of one's personal knowledge 
base.

Noatime I've been setting for a decade now, since I saw it suggested in 
the reiserfs docs when I was first setting that up, so that's as second-
nature to me now as using mount to mount a filesystem... and using the 
mount and fstab manpages to figure out configuration.  I'd suggest that 
by now, any admin worth their salt should similarly be enabling it on 
principle by default, or be able to explain why not (mutt in the mode 
that needs it, for example) should they be asked.  So while I agree it's 
important, I'm not sure it should be on this list any more than say using 
mount, should be on the list just because it /is/ routine.

Entirely empty block groups are now automatically reclaimed, correct, but 
I just saw today the first posting I've read from someone who didn't 
realize btrfs still doesn't automatically reclaim low-usage blocks, say 
under 10% but not 0, and that those can still get out of balance over 
time, but that with the entirely empty ones reclaimed, it does actually 
take longer to reach that ENOSPC due to lack of unallocated chunks than 
it used to.

So balance can still be necessary, but if it was necessary every month 
before, perhaps every six months to a year is a reasonable balance target 
now.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-11-28  6:49                 ` Duncan
@ 2015-12-12 22:15                   ` Christoph Anton Mitterer
  2015-12-13  7:10                     ` Duncan
  2015-12-14 14:24                     ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-12 22:15 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1401 bytes --]

On Sat, 2015-11-28 at 06:49 +0000, Duncan wrote:
> Christoph Anton Mitterer posted on Sat, 28 Nov 2015 04:57:05 +0100 as
> excerpted:
> > Still, specifically for snapshots that's a bit unhandy, as one
> > typically
> > doesn't mount each of them... one rather mount e.g. the top level
> > subvol
> > and has a subdir snapshots there...
> > So perhaps the idea of having snapshots that are per se noatime is
> > still
> > not too bad.
> Read-only snapshots?
So you basically mean that ro snapshots won't have their atime updated
even without noatime?
Well I guess that was anyway the recent behaviour of Linux filesystems,
and only very old UNIX systems updated the atime even when the fs was
set ro.

> That'd do it, and of course you can toggle the read-
> only property (see btrfs property and its btrfs-property manpage).
Sure, but then it would still be nice for rw snapshots.

I guess what I probably actually want is the ability to set noatime as
a property.
I'll add that in a "feature request" on the project ideas wiki.

> Alternatively, mount the toplevel subvol read-only or noatime on one 
> mountpoint, and bind-mount it read-write or whatever other
> appropriate 
Well it's of course somehow possible... but that seems a bit ugly to
me... the best IMHO, would really be if one could set a property on
snapshots that marks them noatime.


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-12-12 22:15                   ` Christoph Anton Mitterer
@ 2015-12-13  7:10                     ` Duncan
  2015-12-16 22:14                       ` Christoph Anton Mitterer
  2015-12-14 14:24                     ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 48+ messages in thread
From: Duncan @ 2015-12-13  7:10 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Sat, 12 Dec 2015 23:15:38 +0100 as
excerpted:

> On Sat, 2015-11-28 at 06:49 +0000, Duncan wrote:
>> Christoph Anton Mitterer posted on Sat, 28 Nov 2015 04:57:05 +0100 as
>> excerpted:
>> > Still, specifically for snapshots that's a bit unhandy, as one
>> > typically doesn't mount each of them... one rather mount e.g. the top
>> > level subvol and has a subdir snapshots there...
>> > So perhaps the idea of having snapshots that are per se noatime is
>> > still not too bad.
>> Read-only snapshots?
> So you basically mean that ro snapshots won't have their atime updated
> even without noatime?
> Well I guess that was anyway the recent behaviour of Linux filesystems,
> and only very old UNIX systems updated the atime even when the fs was
> set ro.

I'd test it to be sure before relying on it (keeping in mind that my own 
use-case doesn't include subvolumes/snapshots so it's quite possible I 
could get fine details of this nature wrong), but that would be my very 
(_very_! see next) strong assumption, yes.

Because read-only snapshots are used for btrfs-send among other things, 
with the idea being that the read-only will keep them from changing in 
the middle of the send, and ro snapshot atime updates would seem to throw 
that entirely out the window.  So I can't imagine ro snapshots doing atime 
updates under any circumstance because I just can't see how send could 
rely on them then, but I'd still test it before counting on it.

>> That'd do it, and of course you can toggle the read-
>> only property (see btrfs property and its btrfs-property manpage).
> Sure, but then it would still be nice for rw snapshots.
> 
> I guess what I probably actually want is the ability to set noatime as a
> property.
> I'll add that in a "feature request" on the project ideas wiki.

AFAIK, the general idea was to eventually have all the (possible, some 
are global-filesystem-scope) subvolume mount options exposed as 
properties, it's just not implemented yet, but I'm not entirely sure if 
that was all /btrfs-specific/ mount options, or included the generic ones 
such as the *atime and no* (noexec/nodev/...) options as well.  In view 
of that and the fact that noatime is generic, adding it as a specific 
request still makes sense.  Someone with more specific knowledge on the 
current plan can remove it if it's already covered.

>> Alternatively, mount the toplevel subvol read-only or noatime on one
>> mountpoint, and bind-mount it read-write or whatever other appropriate
> Well it's of course somehow possible... but that seems a bit ugly to
> me... the best IMHO, would really be if one could set a property on
> snapshots that marks them noatime.

Yes.  Possible is good, but "just works", as one would hope the 
properties solution to eventually be, is still better than "possible by 
jumping thru mount-bind hoops", the current "possibility method". =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-11-26  0:33     ` Hugo Mills
  2015-12-09  5:43       ` Christoph Anton Mitterer
@ 2015-12-14  1:44       ` Christoph Anton Mitterer
  2015-12-14 10:51         ` Duncan
  1 sibling, 1 reply; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-14  1:44 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Duncan, Hugo Mills

[-- Attachment #1: Type: text/plain, Size: 1467 bytes --]

Two more on these:

On Thu, 2015-11-26 at 00:33 +0000, Hugo Mills wrote:
> 3) When I would actually disable datacow for e.g. a subvolume that
> > holds VMs or DBs... what are all the implications?
> > Obviously no checksumming, but what happens if I snapshot such a
> > subvolume or if I send/receive it?
>    After snapshotting, modifications are CoWed precisely once, and
> then it reverts to nodatacow again. This means that making a snapshot
> of a nodatacow object will cause it to fragment as writes are made to
> it.
AFAIU, the one the get's fragmented then is the snapshot, right, and
the "original" will stay in place where it was? (Which is of course
good, because one probably marked it nodatacow, to avoid that
fragmentation problem on internal writes).

I'd assume the same happens when I do a reflink cp.

Can one make a copy, where one still has atomicity (which I guess
implies CoW) but where the destination file isn't heavily fragmented
afterwards,... i.e. there's some pre-allocation, and then cp really
does copy each block (just everything's at the state of time where I
stared cp, not including any other internal changes made on the source
in between).


And one more:
You both said, auto-defrag is generally recommended.
Does that also apply for SSDs (where we want to avoid unnecessary
writes)?
It does seem to get enabled, when SSD mode is detected.
What would it actually do on an SSD?


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-09 13:36         ` Duncan
@ 2015-12-14  2:46           ` Christoph Anton Mitterer
  2015-12-14 11:19             ` Duncan
  2015-12-16 23:39           ` Kai Krakow
  1 sibling, 1 reply; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-14  2:46 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 9322 bytes --]

On Wed, 2015-12-09 at 13:36 +0000, Duncan wrote:
> Answering the BTW first, not to my knowledge, and I'd be
> skeptical.  In 
> general, btrfs is cowed, and that's the focus.  To the extent that
> nocow 
> is necessary for fragmentation/performance reasons, etc, the idea is
> to 
> try to make cow work better in those cases, for example by working on
> autodefrag to make it better at handling large files without the
> scaling 
> issues it currently has above half a gig or so, and thus to confine
> nocow 
> to a smaller and smaller niche use-case, rather than focusing on
> making 
> nocow better.
> Of course it remains to be seen how much better they can do with 
> autodefrag, etc, but at this point, there's way more project 
> possibilities than people to develop them, so even if they do find
> they 
> can't make cow work much better for these cases, actually working on
> nocow 
> would still be rather far down the list, because there's so many
> other 
> improvement and feature opportunities that will get the focus
> first.  
> Which in practice probably puts it in "it'd be nice, but it's low
> enough 
> priority that we're talking five years out or more, unless of course 
> someone else qualified steps up and that's their personal itch they
> want 
> to scratch", territory.
I guess I'll split out my answer on that, in a fresh thread about
checksums for nodatacow later, hoping to attract some more devs there
:-)

I think however, again with my naive understanding on how CoW works and
what it inherently implies, that there cannot be a real good solution
to the fragmentation problem for DB/etc. files.

And as such, I'd think that having a checksumming feature for
notdatacow as well, even if it's not perfect, is definitely worth it.


> As for the updated checksum after modification, the problem with that
> is 
> that in the mean time, the checksum wouldn't verify,
Well one could either implement some locking,.. but I don't see the
general problem here... if the block is still being written (and I
count updating the meta-data, including checksum, to that) it cannot be
read anyway, can it? It may be only half written and the data returned
would be garbage.


>  and while btrfs 
> could of course keep status in memory during normal operations,
> that's 
> not the problem, the problem is what happens if there's a crash and
> in-
> memory state vaporizes.  In that case, when btrfs remounted, it'd
> have no 
> way of knowing why the checksum didn't match, just that it didn't,
> and 
> would then refuse access to that block in the file, because for all
> it 
> knows, it /is/ a block error.
And this would only happen in the rare cases that anything crashes,
where it's anyway quite likely that this no-CoWed block will be
garbage.
I'll talk about that more in the separate thread... so let's move
things there.


> Same here.  In fact, my most anticipated feature is N-way-mirroring, 
Hmm ... not totally sure about that...
AFAIU, N-way-mirroring is what currently the currently wrongly called
RAID1 is in btrfs, i.e. having N replicas of everything on M devices,
right?
In other words, not being a N-parity-RAID and not guaranteeing that
*any* N disks could fail, right?

Hmm I guess that would be definitely nice to have, especially since
then we could have true RAID1, i.e. N=M.

But it's probably rather important for those scenarios, where either
resilience matters a lot... and/or   those where write speed doesn't
but read speed does, right?

Taking the example of our use case at the university, i.e. the LHC
Tier-2 we run,... that would rather be uninteresting.
We typically have storage nodes (and many of them) of say 16-24
devices, and based on funding constraints, resilience concerns and IO
performance, we place them in RAID6 (yeah i know, RAID5 is faster, but
even with hotspares in place, practise lead too often to lost RAIDs).

Especially for the bigger nodes, with more disks, we'd rather have a N-
parity RAID, where any N disks can fail)... of course performance
considerations may kill that desire again ;)


> It is a big and basic feature, but turning it off isn't the end of
> the 
> world, because then it's still the same level of reliability other 
> solutions such as raid generally provide.
Sure... I never meant it as "loss to what we already have in other
systems"... but as "loss compared to how awesome[0] btrfs could be ;-)"


> But as it happens, both VM image management and databases tend to
> come 
> with their own integrity management, in part precisely because the 
> filesystem could never provide that sort of service.
Well that's only partially true, to my knowledge.
a) I wouldn't know that hypervisors do that at all.
b) DBs have of course their journal, but that protects only against
crashes,... not against bad blocks nor does it help you to decide which
block is good when you have multiple.


> After all, you can always decide not to run it if you're worried
> about the space effects it's going to have
Hmm well,... and the manpage actually mentions that it blows up when
snapshots are used... at least in some technical language...
So,.. you're possibly right, here,... though I guess many may just do 
btrfs filesystem --help which looses no word about the possible grave
effects of defrag.


> But even at that point, while snapshot-aware-defrag is still on the
> list,  I'm not sure if it's ever going to be actually viable.  It may
> be that the scaling issues are just too big, and it simply can't be
> made to work  both correctly and in anything approaching practical
> time.
Well, I shall hope not
:)


> Worst-case, you  set nocow and turn off snapshotting, but that's 
> exactly the situation
> you're in anyway with other filesystems, so you're no worse off than
> if you were using them.
> Meanwhile, where those btrfs features *can* be used, which is on
> /most/ 
> files, with only limited exceptions, it's all upside! =:^)
Sure :D ... but that doesn't mean we should try do minimise the upside
cases if possible :-)


> FWIW, I've seen it asserted that autodefrag is snapshot aware a few
> times 
> now, but I'm not personally sure that is the case and I don't see any
> immediately obvious reason it would be, when (manual) defrag isn't,
> so 
> I've refrained from making that claim, myself.  If I were to see
> multiple 
> devs make that assertion, I'd be more confident, but I believe I've
> only 
> seen it from Hugo, and while I trust him in general because in
> general 
> what he says makes sense, here, as I said, it just doesn't make
> immediate 
> sense to me that the two would be so different
Yes, that was my concern as well...


> The biggest downside of autodefrag is its performance on large
> (generally 
> noticeable at between half a gig and a gig) random-rewrite-pattern
> files 
> in actively-being-rewritten use.  For all other cases it's generally 
> recommended, but that's why it's not the default.
Hmm that makes it a bit difficult to use when you have mixed use cases.
Can't they just add a feature that allows one to select up to which
file sizes autodefrag kicks in.

Interestingly, I've enabled it now, and as I've mentioned before I run
several VMs on that machine (which has a SSD), so far intentionally not
set nodatacow... however, so far I don't see any aggressive rewriting,
though admittedly, I wouldn't know how to properly tell whether auto-
defrag was doing heavy IO or not, it doesn't show up as a kernel thread
it seems.

> AFAIK autodefrag only queues up the defrag when it detects
> fragmentation 
> beyond some threshold, and it only checks and thus only detects at
> file 
> (re)write.
Sounds reasonable... especially I wouldn't want the situation in which
it basically constantly rewrites files, just because of few fragments.

Another case however, could be more tricky do detect: Files which
continuously and quickly fragment at whole or at least in parts.
AFAIU, it would basically not make any sense to try any defrag on such
(because of the "it quickly fragments" again).

Also it would be nice to have some knobs to control in more detail how
much IO it spends on autodefrag, perhaps even on a per fs basis or even
more detailed.


> Further, when a filesystem is highly fragmented and autodefrag is
> first 
> turned on, often it actually rather negatively affects performance
> for a 
> few days, because so many files are so fragmented that it's queuing
> up 
> defrags for nearly everything written.
I've read that advice of your's before... so you basically think it
would also queue up files that are fragmented, even when these are not
written to since it was turned on.

Interestingly, I did turn it on just a few days, and so far I haven't
seem much disk activity that would point to autodefrag.



Thanks again for your time and answers :)
Chris.


[0] In Germany there's the term "eierlegende Wollmilchsau", it
basically describes a pig, which gives milk, eggs and wool... perhaps
one can translate it with "jack of all trades device".
(No I don't want btrfs, to include a webbrowser and PDF reader ;) )

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-14  1:44       ` Christoph Anton Mitterer
@ 2015-12-14 10:51         ` Duncan
  2015-12-16 23:55           ` Christoph Anton Mitterer
  0 siblings, 1 reply; 48+ messages in thread
From: Duncan @ 2015-12-14 10:51 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Mon, 14 Dec 2015 02:44:55 +0100 as
excerpted:

> Two more on these:
> 
> On Thu, 2015-11-26 at 00:33 +0000, Hugo Mills wrote:
>> 3) When I would actually disable datacow for e.g. a subvolume that
>> > holds VMs or DBs... what are all the implications?

>> After snapshotting, modifications are CoWed precisely once, and
>> then it reverts to nodatacow again. This means that making a snapshot
>> of a nodatacow object will cause it to fragment as writes are made to
>> it.

> AFAIU, the one the get's fragmented then is the snapshot, right, and the
> "original" will stay in place where it was? (Which is of course good,
> because one probably marked it nodatacow, to avoid that fragmentation
> problem on internal writes).

No.  Or more precisely, keep in mind that from btrfs' perspective, in 
terms of reflinks, once made, there's no "original" in terms of special 
treatment, all references to the extent are treated the same.

What a snapshot actually does is create another reference (reflink) to an 
extent.  What btrfs normally does on change as a cow-based filesystem is 
of course copy-on-write the change.  What nocow does, in the absence of 
other references to that extent, is rewrite the change in-place.

But if there's another reference to that extent, the change can't be in-
place because that would change the file reached by that other reference 
as well, and the change was only to be made to one of them.  So in the 
case of nocow, a cow1 (one-time-cow) exception must be made, rewriting 
the changed data to a new location, as the old location continues to be 
referenced by at least one other reflink.

So (with the fact that writable snapshots are available and thus it can 
be the snapshot that changed if it's what was written to) the one that 
gets the changed fragment written elsewhere, thus getting fragmented, is 
the one that changed, whether that's the working copy or the snapshot of 
that working copy.

> I'd assume the same happens when I do a reflink cp.

Yes.  It's the same reflinking mechanism, after all.  If there's other 
reflinks to the extent, snapshot or otherwise, changes must be written 
elsewhere, even if they'd otherwise be nocow.

> Can one make a copy, where one still has atomicity (which I guess
> implies CoW) but where the destination file isn't heavily fragmented
> afterwards,... i.e. there's some pre-allocation, and then cp really does
> copy each block (just everything's at the state of time where I stared
> cp, not including any other internal changes made on the source in
> between).

The way that's handled is via ro snapshots which are then copied, which 
of course is what btrfs send does (at least in non-incremental mode, and 
incremental mode still uses the ro snapshot part to get atomicity), in 
effect.

> And one more:
> You both said, auto-defrag is generally recommended.
> Does that also apply for SSDs (where we want to avoid unnecessary
> writes)?
> It does seem to get enabled, when SSD mode is detected.
> What would it actually do on an SSD?

Did you mean it does _not_ seem to get (automatically) enabled, when SSD 
mode is detected, or that it _does_ seem to get enabled, when 
specifically included in the mount options, even on SSDs?

Or did you actually mean it the way you wrote it, that it seems to be 
enabled (implying automatically, along with ssd), when ssd mode is 
detected?

Because the latter would be a shock to me, as that behavior hasn't been 
documented anywhere, but I can't imagine it's actually doing it and that 
you actually meant what you actually wrote.


If you look waaayyy back to shortly before I did my first more or less 
permanent deployment (I had initially posted some questions and did an 
initial experimental deployment several months earlier, but it didn't 
last long, because $reasons), you'll see a post I made to the list with 
pretty much the same general question, autodefrag on ssd, or not.

I believe the most accurate short answer is that the benefit of 
autodefrag on SSD is fuzzy, and thus left to local choice/policy, without 
an official recommendation either way.

There are two points that we know for certain: (1) the zero-seek-time of 
SSD effectively nullifies the biggest and most direct cost associated 
with fragmentation on spinning rust, thereby lessening the advantage of 
autodefrag as seen on spinning rust by an equally large degree, and (2) 
autodefrag will without question lead to a relatively limited number of 
near-time additional writes, as the rewrite is queued and eventually 
processed.

To the extent that an admin considers these undisputed factors alone, or 
weighs them less heavily than the more controversial factors below, 
they're likely to consider autodefrag on ssd a net negative and leave it 
off.

But I was persuaded by the discussion when I asked the question, to 
enable autodefrag on my all-ssd btrfs deployment here.  Why?  Those 
other, less direct and arguably less directly measurable (except possibly 
by actual detail benchmarking or a/b deployment testing over long 
periods).

There are three factors I'm aware of here as well, all favoring 
autodefrag, just as the two above favored leaving it off.

1) IOPS, Input/Output Operations Per Second.  SSDs typically have both an 
IOPS and a throughput rating.  And unlike spinning rust, where raw non-
sequential-write IOPS are generally bottlenecked by seek times, on SSDs 
with their zero seek-times, IOPS can actually be the bottleneck.

Now I'm /far/ from a hardware storage device expert and thus may be badly 
misconstruing things here, but at least as I understand things, reading/
writing a single extent/fragment is typically issued as a single IO 
operation (to some maximum size), and particularly at the higher 
throughput speeds ssds commonly have and with their zero-seek-times, it's 
quite possible to bottleneck on the number of such operations, hitting 
the IOPS ceiling on either the device itself or its controller, if files 
are highly fragmented and/or there's multiple tasks doing IO to the same 
device at once.

Back when I first setup btrfs on my then new SSDs, I didn't know a whole 
lot about SSDs and this was my primary reason for choosing autodefrag; 
less fragmentation means larger IO operations so fewer of them are 
necessary to complete the data transfer, placing a lower stress on the 
device controllers and making it less likely to bottleneck on the IOPS 
limits.

2) SSD physical write and erase block sizes as multiples of the logical/
read block size.  To the extent that extent sizes are multiples of the 
write and/or erase-block size, writing larger extents will reduce write 
amplification due to writing and blocks smaller than the write or erase 
block size.

While the initial autodefrag rewrite is a second-cycle write after a 
fragmented write, spending a write cycle for the autodefrag, consistent 
use of autodefrag should help keep file fragmentation and thus ultimately 
space fragmentation to a minimum, so initial writes, where there's enough 
data to write an initially large extent, won't be forced to be broken 
into smaller extents because there's simply no large free-space extents 
left due to space fragmentation.

IOW, autodefrag used consistently should reduce space fragmentation as 
well as file fragmentation, and this reduced space fragmentation will 
lead to the possibility of writing larger extents initially, where the 
amount of data to be written allows it, thereby reducing initial file 
write fragmentation and the need for autodefrag as a result.

This one dawned on me somewhat later, after I understood a bit more about 
SSDs and write amplification due to physical write and erase block 
sizing. I was in the process of explaining (in the context of spinning 
rust) how autodefrag used consistently should help manage space 
fragmentation as well, when I suddenly realized the implications that had 
on SSDs as well, due to their larger physical write and erase block sizes.

3) Btrfs metadata management overhead.  While btrfs tracks things like 
checksums at fixed sizes, other metadata is per extent.  Obviously, the 
more extents a file has, the harder btrfs has to work to track them all.  
Maintenance tasks such as balance and check already have scaling issues; 
do we really want to make them worse by forcing them to track thousands 
or tens of thousands of extents per (large) file where they could be 
tracking a dozen or two?

Autodefrag helps keep the work btrfs itself has to do under control, and 
in some contexts, that alone can be worth any write-amplification costs.


On balance, I was persuaded to use autodefrag on my own btrfs' on SSDs, 
and believe the near-term write-cycle damage may in fact be largely 
counteracted by indirect free-space defrag effect and the effect that in 
turn has on the ability to even find large areas of cohesive free space 
to write into in the first place.  With that largely counteracted, the 
other benefits in my mind again outweigh the negatives, so autodefrag 
continues to be worth it in general, even on SSDs.

But I can definitely see how someone could logically take the opposing 
position, and without someone actually doing either some pretty complex 
benchmarks or some longer term a/b testing where autodefrag's longer term 
effect on free space fragmentation can come into play, against just 
letting things fragment as they will on the other side, in enough 
different usage scenarios to be convincing for the general purpose case 
as well, it's unlikely the debate will ever be properly resolved.

I suppose someone will eventually do that sort of testing, but of course 
even if they did it now, with btrfs code still to be optimized and 
various scaling work still to be done, it's anyone's guess if the test 
results would still apply a few years down the road, after that scaling 
and optimization work.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-14  2:46           ` Christoph Anton Mitterer
@ 2015-12-14 11:19             ` Duncan
  0 siblings, 0 replies; 48+ messages in thread
From: Duncan @ 2015-12-14 11:19 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Mon, 14 Dec 2015 03:46:01 +0100 as
excerpted:

>> Same here.  In fact, my most anticipated feature is N-way-mirroring,
> Hmm ... not totally sure about that...
> AFAIU, N-way-mirroring is what currently the currently wrongly called
> RAID1 is in btrfs, i.e. having N replicas of everything on M devices,
> right?
> In other words, not being a N-parity-RAID and not guaranteeing that
> *any* N disks could fail, right?

No.  N-way-mirroring, at least in simplest form (as in md/raid1) is N 
replicas on N devices, so loss of N-1 devices is permitted without loss 
of data.

Normally the best thing about this is that unlike parity, once the 
general support is in, you can increase redundancy at will, with 
guaranteed device-loss protection of as many devices as you care to 
insure against.

At one point with somewhat old devices that I didn't particularly trust 
any more and because I had them from a previous raid6 setup, I was 
running 4-way-md/raid1.

Of course with md/raid1, the problem is lack of any sort of data 
integrity assurance, even scrubbing just arbitrarily chooses one and in 
the case of difference, simply copies that to the others, not even 
plurality-vote most authoritative version.

With btrfs checksumming, the value of N-way-mirroring is increased 
dramatically, since it allows individual block verification and fallback, 
as opposed to whole-device-loss.

While my own sweet-spot balance will tend to be three-way, avoiding the 
"if one copy is bad (perhaps because of a device that's known failing/
failed), you better /hope/ your only remaining copy is good" problem of 
the present two-way-only solution, I could easily see people finding 
value in 4/5/6-way mirroring as well.

And of course if that is extended to raid10, three-way-mirroring, two-way-
striping, on six total devices, would be my preferred, over the three-way-
striped, two-way-mirrored, that's the only current choice for six-device 
btrfs raid10.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-12-12 22:15                   ` Christoph Anton Mitterer
  2015-12-13  7:10                     ` Duncan
@ 2015-12-14 14:24                     ` Austin S. Hemmelgarn
  2015-12-14 19:39                       ` Christoph Anton Mitterer
  2015-12-15  4:05                       ` Chris Murphy
  1 sibling, 2 replies; 48+ messages in thread
From: Austin S. Hemmelgarn @ 2015-12-14 14:24 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Duncan, linux-btrfs

On 2015-12-12 17:15, Christoph Anton Mitterer wrote:
> On Sat, 2015-11-28 at 06:49 +0000, Duncan wrote:
>> Christoph Anton Mitterer posted on Sat, 28 Nov 2015 04:57:05 +0100 as
>> excerpted:
>>> Still, specifically for snapshots that's a bit unhandy, as one
>>> typically
>>> doesn't mount each of them... one rather mount e.g. the top level
>>> subvol
>>> and has a subdir snapshots there...
>>> So perhaps the idea of having snapshots that are per se noatime is
>>> still
>>> not too bad.
>> Read-only snapshots?
> So you basically mean that ro snapshots won't have their atime updated
> even without noatime?
> Well I guess that was anyway the recent behaviour of Linux filesystems,
> and only very old UNIX systems updated the atime even when the fs was
> set ro.
Unless things have changed very recently, even many modern systems 
update atime on read-only filesystems, unless the media itself is 
read-only.  This is part of the reason for some of the forensics tools 
out there that drop write commands to the block devices connected to them.
>
>> That'd do it, and of course you can toggle the read-
>> only property (see btrfs property and its btrfs-property manpage).
> Sure, but then it would still be nice for rw snapshots.
>
> I guess what I probably actually want is the ability to set noatime as
> a property.
> I'll add that in a "feature request" on the project ideas wiki.
>
>> Alternatively, mount the toplevel subvol read-only or noatime on one
>> mountpoint, and bind-mount it read-write or whatever other
>> appropriate
> Well it's of course somehow possible... but that seems a bit ugly to
> me... the best IMHO, would really be if one could set a property on
> snapshots that marks them noatime.
If you have software that actually depends on atimes, then that software 
is broken (and yes, I even feel this way about Mutt).  The way atimes 
are implemented on most systems breaks the semantics that almost 
everyone expects from them, because they get updated for anything that 
even looks sideways at the inode from across the room.  Most software 
that uses them expects them to answer the question 'When were the 
contents of this file last read?', but they can get updated even for 
stuff like calculating file sizes, listing directory contents, or 
modifying the file's metadata.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-12-14 14:24                     ` Austin S. Hemmelgarn
@ 2015-12-14 19:39                       ` Christoph Anton Mitterer
  2015-12-14 20:27                         ` Austin S. Hemmelgarn
  2015-12-15  4:05                       ` Chris Murphy
  1 sibling, 1 reply; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-14 19:39 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1758 bytes --]

On Mon, 2015-12-14 at 09:24 -0500, Austin S. Hemmelgarn wrote:
> Unless things have changed very recently, even many modern systems
> update atime on read-only filesystems, unless the media itself is 
> read-only.
Seriously? Oh... *sigh*...
You mean as in Linux, ext*, xfs?

> If you have software that actually depends on atimes, then that
> software 
> is broken (and yes, I even feel this way about Mutt).
I don't disagree here :D

> The way atimes 
> are implemented on most systems breaks the semantics that almost 
> everyone expects from them, because they get updated for anything
> that 
> even looks sideways at the inode from across the room.  Most software
> that uses them expects them to answer the question 'When were the 
> contents of this file last read?', but they can get updated even for 
> stuff like calculating file sizes, listing directory contents, or 
> modifying the file's metadata.
Sure... my point here again was, that I try to look every now and then
at the whole thing from the pure-end-user side:
For them, the default is relatime, and they likely may not want to
change that because they have no clue on how much further effects this
may have (or not).
So as long as Linux doesn't change it's defaults to noatime, leaving
things up to broken software (i.e. to get fixed), I think it would be
nice for the end-user, to have e.g. snapshots be "save" (from the
write-amplification on read) out of the box.

My idea would be basically, that having a noatime btrfs-property, which
is perhaps even set automatically, would be an elegant way of doing
that.
I just haven't had time to properly write that up and add is as a
"feature request" to the projects idea wiki page.


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-12-14 19:39                       ` Christoph Anton Mitterer
@ 2015-12-14 20:27                         ` Austin S. Hemmelgarn
  2015-12-14 21:30                           ` Lionel Bouton
                                             ` (3 more replies)
  0 siblings, 4 replies; 48+ messages in thread
From: Austin S. Hemmelgarn @ 2015-12-14 20:27 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Duncan, linux-btrfs

On 2015-12-14 14:39, Christoph Anton Mitterer wrote:
> On Mon, 2015-12-14 at 09:24 -0500, Austin S. Hemmelgarn wrote:
>> Unless things have changed very recently, even many modern systems
>> update atime on read-only filesystems, unless the media itself is
>> read-only.
> Seriously? Oh... *sigh*...
> You mean as in Linux, ext*, xfs?
Possibly, I know that Windows 7 does it, and I think OS X and OpenBSD do 
it, but I'm not sure about Linux.
>
>> If you have software that actually depends on atimes, then that
>> software
>> is broken (and yes, I even feel this way about Mutt).
> I don't disagree here :D
>
>> The way atimes
>> are implemented on most systems breaks the semantics that almost
>> everyone expects from them, because they get updated for anything
>> that
>> even looks sideways at the inode from across the room.  Most software
>> that uses them expects them to answer the question 'When were the
>> contents of this file last read?', but they can get updated even for
>> stuff like calculating file sizes, listing directory contents, or
>> modifying the file's metadata.
> Sure... my point here again was, that I try to look every now and then
> at the whole thing from the pure-end-user side:
> For them, the default is relatime, and they likely may not want to
> change that because they have no clue on how much further effects this
> may have (or not).
> So as long as Linux doesn't change it's defaults to noatime, leaving
> things up to broken software (i.e. to get fixed), I think it would be
> nice for the end-user, to have e.g. snapshots be "save" (from the
> write-amplification on read) out of the box.
AFAIUI, the _only_ reason that that is still the default is because of 
Mutt, and that won't change as long as some of the kernel developers are 
using Mutt for e-mail and the Mutt developers don't realize that what 
they are doing is absolutely stupid.

FWIW, both Duncan and I have our own copy of the sources patched to 
default to noatime, and I know a number of embedded Linux developers who 
do likewise, and I've even heard talk in the past of some distributions 
possibly using such patches themselves (although it always ends up not 
happening, because of Mutt).
>
> My idea would be basically, that having a noatime btrfs-property, which
> is perhaps even set automatically, would be an elegant way of doing
> that.
> I just haven't had time to properly write that up and add is as a
> "feature request" to the projects idea wiki page.
I like this idea.
>
>
> Cheers,
> Chris.
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-12-14 20:27                         ` Austin S. Hemmelgarn
@ 2015-12-14 21:30                           ` Lionel Bouton
  2015-12-14 23:25                             ` Christoph Anton Mitterer
  2015-12-14 23:10                           ` Christoph Anton Mitterer
                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 48+ messages in thread
From: Lionel Bouton @ 2015-12-14 21:30 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Christoph Anton Mitterer, Duncan, linux-btrfs

Le 14/12/2015 21:27, Austin S. Hemmelgarn a écrit :
> AFAIUI, the _only_ reason that that is still the default is because of
> Mutt, and that won't change as long as some of the kernel developers
> are using Mutt for e-mail and the Mutt developers don't realize that
> what they are doing is absolutely stupid.
>

Mutt is often used as an example but tmpwatch uses atime by default too
and it's quite useful.

If you have a local cache of remote files for which you want a good hit
ratio and don't care too much about its exact size (you should have
Nagios/Zabbix/... alerting you when a filesystem reaches a %free limit
if you value your system's availability anyway), using tmpwatch with
cron to maintain it is only one single line away and does the job. For
an example of this particular case, on Gentoo the /usr/portage/distfiles
directory is used in one of the tasks you can uncomment to activate in
the cron.daily file provided when installing tmpwatch.
Using tmpwatch/cron is far more convenient than using a dedicated cache
(which might get tricky if the remote isn't HTTP-based, like an
rsync/ftp/nfs/... server or doesn't support HTTP IMS requests for example).
Some http frameworks put sessions in /tmp: in this case if you want
sessions to expire based on usage and not creation time, using tmpwatch
or similar with atime is the only way to clean these files. This can
even become a performance requirement: I've seen some servers slowing
down with tens/hundreds of thousands of session files in /tmp because it
was only cleaned at boot and the systems were almost never rebooted...

I use noatime and nodiratime on some BTRFS filesystems for performance
reasons: Ceph OSDs, heavily snapshotted first-level backup servers and
filesystems dedicated to database server files (in addition to
nodatacow) come to mind, but the cases where these options are really
useful even with BTRFS doesn't seem to be the common ones.

Finally Linus Torvalds has been quite vocal and consistent on the
general subject of the kernel not breaking user-space APIs no matter
what so I wouldn't have much hope for default kernel mount options
changes...

Lionel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-12-14 20:27                         ` Austin S. Hemmelgarn
  2015-12-14 21:30                           ` Lionel Bouton
@ 2015-12-14 23:10                           ` Christoph Anton Mitterer
  2015-12-14 23:16                           ` project idea: per-object default mount-options / more btrfs-properties / chattr attributes (was: btrfs: poor performance on deleting many large files) Christoph Anton Mitterer
  2015-12-15  2:08                           ` btrfs: poor performance on deleting many large files Duncan
  3 siblings, 0 replies; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-14 23:10 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 850 bytes --]

On Mon, 2015-12-14 at 15:27 -0500, Austin S. Hemmelgarn wrote:
> On 2015-12-14 14:39, Christoph Anton Mitterer wrote:
> > On Mon, 2015-12-14 at 09:24 -0500, Austin S. Hemmelgarn wrote:
> > > Unless things have changed very recently, even many modern
> > > systems
> > > update atime on read-only filesystems, unless the media itself is
> > > read-only.
> > Seriously? Oh... *sigh*...
> > You mean as in Linux, ext*, xfs?
> Possibly, I know that Windows 7 does it, and I think OS X and OpenBSD
> do 
> it, but I'm not sure about Linux.
I've just checked it via loopback image and strictatime:

- ro snapshot doesn't get atime updated
- rw snapshot does atime get update
- ro mounted fs (top level subvol) doesn't get atimes updated (neither
  in subvols)
- rw mounted fs (top level subvol) does get atimes updated

Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* project idea: per-object default mount-options / more btrfs-properties / chattr attributes (was: btrfs: poor performance on deleting many large files)
  2015-12-14 20:27                         ` Austin S. Hemmelgarn
  2015-12-14 21:30                           ` Lionel Bouton
  2015-12-14 23:10                           ` Christoph Anton Mitterer
@ 2015-12-14 23:16                           ` Christoph Anton Mitterer
  2015-12-15  2:08                           ` btrfs: poor performance on deleting many large files Duncan
  3 siblings, 0 replies; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-14 23:16 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1014 bytes --]

Just FYI:

On Mon, 2015-12-14 at 15:27 -0500, Austin S. Hemmelgarn wrote:
> > My idea would be basically, that having a noatime btrfs-property,
> > which
> > is perhaps even set automatically, would be an elegant way of doing
> > that.
> > I just haven't had time to properly write that up and add is as a
> > "feature request" to the projects idea wiki page.
> I like this idea.

I've just compiled some thoughts and ideas into:
https://btrfs.wiki.kernel.org/index.php/Project_ideas#Per-object_default_mount-options_.2F_btrfs-properties_.2F_chattr.281.29_attributes_and_reasonable_userland_defaults

As usual, this is mostly from my admin/end-user side, i.e. what I could
imagine would ease in the maintenance of large/complex (in terms of
subvols, nesting, snapshots) btrfs filesystems...

And of course, any developer or more expert user than me is happily
invited to comment/remove any (possibly stupid) ideas of mine therein,
or summon the inquisition for my heresy ;)


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-12-14 21:30                           ` Lionel Bouton
@ 2015-12-14 23:25                             ` Christoph Anton Mitterer
  2015-12-15  1:49                               ` Duncan
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-14 23:25 UTC (permalink / raw)
  To: Lionel Bouton, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2371 bytes --]

On Mon, 2015-12-14 at 22:30 +0100, Lionel Bouton wrote:
> Mutt is often used as an example but tmpwatch uses atime by default
> too
> and it's quite useful.
Hmm one could probably argue that these few cases justify the use of
separate filesystems (or btrfs subvols ;) ), so that the majority could
benefit of noatime.

> If you have a local cache of remote files for which you want a good
> hit
> ratio and don't care too much about its exact size (you should have
> Nagios/Zabbix/... alerting you when a filesystem reaches a %free
> limit
> if you value your system's availability anyway), using tmpwatch with
> cron to maintain it is only one single line away and does the job.
> For
> an example of this particular case, on Gentoo the
> /usr/portage/distfiles
> directory is used in one of the tasks you can uncomment to activate
> in
> the cron.daily file provided when installing tmpwatch.
> Using tmpwatch/cron is far more convenient than using a dedicated
> cache
> (which might get tricky if the remote isn't HTTP-based, like an
> rsync/ftp/nfs/... server or doesn't support HTTP IMS requests for
> example).
> Some http frameworks put sessions in /tmp: in this case if you want
> sessions to expire based on usage and not creation time, using
> tmpwatch
> or similar with atime is the only way to clean these files. This can
> even become a performance requirement: I've seen some servers slowing
> down with tens/hundreds of thousands of session files in /tmp because
> it
> was only cleaned at boot and the systems were almost never
> rebooted...
Okay there are probably some usecases, ... the session cleaning I'd
however rather consider a bug in the respective software, especially if
it really depends on it to expire the session (what if for some reason
tmpwatch get's broken, uninstalled, etc.)


> I use noatime and nodiratime
FYI: noatime implies nodiratime :-)


> Finally Linus Torvalds has been quite vocal and consistent on the
> general subject of the kernel not breaking user-space APIs no matter
> what so I wouldn't have much hope for default kernel mount options
> changes...
He surely is right in general,... but when the point has been reached,
where only a minority actually requires the feature... and the minority
actually starts to suffer from that... it may change.


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-12-14 23:25                             ` Christoph Anton Mitterer
@ 2015-12-15  1:49                               ` Duncan
  2015-12-15  2:38                                 ` Lionel Bouton
  0 siblings, 1 reply; 48+ messages in thread
From: Duncan @ 2015-12-15  1:49 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Tue, 15 Dec 2015 00:25:05 +0100 as
excerpted:

> On Mon, 2015-12-14 at 22:30 +0100, Lionel Bouton wrote:
> 
>> I use noatime and nodiratime

> FYI: noatime implies nodiratime :-)

Was going to post that myself.  Is there some reason you:

a) use nodiratime when noatime is already enabled, despite the fact that 
the latter already includes the former, or

b) didn't sufficiently research the option (at least the current mount 
manpage documents that noatime includes nodiratime under both the noatime 
and nodiratime options, and at least some hint of that has been in the 
manpage for years as I recall reading it when I first read of nodiratime 
and checked whether my noatime options included it) before standardizing 
on it, or

c) might have actually been talking in general, and there's some mounts 
you don't actually choose to make noatime, but still want nodiratime, or

d) chose that isn't otherwise reflected in the above?  If so, please 
describe, as it could be a learning experience for me, and possibly 
others as well.

>> Finally Linus Torvalds has been quite vocal and consistent on the
>> general subject of the kernel not breaking user-space APIs no matter
>> what so I wouldn't have much hope for default kernel mount options
>> changes...

> He surely is right in general,... but when the point has been reached,
> where only a minority actually requires the feature... and the minority
> actually starts to suffer from that... it may change.

Generally speaking, the practical rule is that you don't break userspace, 
but that a break that isn't noticed and reported by someone within a few 
release cycles is considered OK, as obviously nobody who actually cares 
enough about the possibility of old userspace breaking on new kernels 
enough to test for it was (still) using that functionality anyway.  (This 
is sometimes known as the "if a tree falls in the forest and there's 
nobody around to hear it, did it actually fall", rule. =:^)

But if it's noticed and reported before the new behavior itself is locked 
into place by other userspace relying on it, the change in behavior must 
be reverted.  (There have actually been a few cases over the years where 
they went to rather exceptional lengths to make two otherwise 
incompatible userspace-exposed behaviors both continue to work for the 
userspace that expected that behavior, without actually coding in such 
obvious hacks as executable name conditionals or the like, as others have 
been known to do at times.  Sometimes these fixes do end up bending the 
rules a bit, particularly the no-policy-in-the-kernel rule, but they do 
reinforce the now userspace breakage rule.)

The possible workarounds include the handful of kernel compatibility 
options that when enabled continue otherwise userspace breaking behavior 
such as removing old kernel API procfs files and the like.

That practical rule does in effect make it possible to do userspace-
breaking changes if you wait around long enough that there's nobody who 
will complain still actually using the old behavior.


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-12-14 20:27                         ` Austin S. Hemmelgarn
                                             ` (2 preceding siblings ...)
  2015-12-14 23:16                           ` project idea: per-object default mount-options / more btrfs-properties / chattr attributes (was: btrfs: poor performance on deleting many large files) Christoph Anton Mitterer
@ 2015-12-15  2:08                           ` Duncan
  3 siblings, 0 replies; 48+ messages in thread
From: Duncan @ 2015-12-15  2:08 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Mon, 14 Dec 2015 15:27:11 -0500 as
excerpted:

> FWIW, both Duncan and I have our own copy of the sources patched to
> default to noatime, and I know a number of embedded Linux developers who
> do likewise, and I've even heard talk in the past of some distributions
> possibly using such patches themselves (although it always ends up not
> happening, because of Mutt).

And FWIW, while I was reasonably conservative with my original patch and 
simply defaulted to noatime, turning it off if any of the atime-enabling 
options were found, I'm beginning to think I might as well simply hard-
code noatime, removing the conditions.  This is due to initr* behavior 
that ends up not disabling atime for early, mostly virtual/memory-based 
filesystems like procfs, sysfs, devfs, tmp-on-tmpfs, etc, but could 
extend to initial initr* mount of the root filesystem as well, if I 
decide to make it rw on the kernel commandline or some such.

Of course atime on a memory-based-fs isn't normally a huge problem since 
its all memory-based anyway, and it would enable stuff like atime based 
tmpwatch since I do a tmpfs based tmp, so I've not worried about it 
much.  But at the same time, I'm now assuming noatime on my systems, and 
anything that breaks that assumption could trigger hard to trace down 
bugs, and hardcoding the noatime assumption would bring a consistency 
that I don't have ATM.

If/when I change my patch in that regard, I may look into adding other 
conditional options, perhaps defaulting to autodefrag if it's btrfs, for 
instance, if my limited sysadmin-not-developer-level patching/coding 
skills allow it.  I'd have to see...  But I'd certainly start with making 
autodefrag a default, not hard-coded, if I did patch in autodefrag, 
because while I don't have large VM images and the like, where autodefrag 
can be a performance bottleneck, to worry about now, I'd like to keep 
that option available for me in the future, and would thus make 
autodefrag the default, not hard-coded.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-12-15  1:49                               ` Duncan
@ 2015-12-15  2:38                                 ` Lionel Bouton
  2015-12-16  8:10                                   ` Duncan
  0 siblings, 1 reply; 48+ messages in thread
From: Lionel Bouton @ 2015-12-15  2:38 UTC (permalink / raw)
  To: Duncan, linux-btrfs

Le 15/12/2015 02:49, Duncan a écrit :
> Christoph Anton Mitterer posted on Tue, 15 Dec 2015 00:25:05 +0100 as
> excerpted:
>
>> On Mon, 2015-12-14 at 22:30 +0100, Lionel Bouton wrote:
>>
>>> I use noatime and nodiratime
>> FYI: noatime implies nodiratime :-)
> Was going to post that myself.  Is there some reason you:
>
> a) use nodiratime when noatime is already enabled, despite the fact that 
> the latter already includes the former, or

I don't (for some time). I didn't check for nodiratime on all the
systems I admin so there could be some left around but as they are
harmless I only remove them when I happen to stumble on them.

>
> b) didn't sufficiently research the option (at least the current mount 
> manpage documents that noatime includes nodiratime under both the noatime 
> and nodiratime options,

I just checked: this has only be made crystal-clear in the latest
man-pages version 4.03 released 10 days ago.

The mount(8) page of Gentoo's current stable man-pages (4.02 release in
August) which is installed on my systems states for noatime:
"Do not update inode access times on this filesystem (e.g., for faster
access on the news spool to speed up news servers)."

This is prone to misinterpretation: directories are inodes but that may
not be self-explanatory for everyone. At least it could leave me with a
doubt if I wasn't absolutely certain of the behavior (see below): I'm
not sure myself that there isn't a difference between a VFS inode (the
in-memory structure) and an on-disk structure called inode which some
filesystems may not have (I may have been mistaken but IIRC ReiserFS
left me with the impression that it wasn't storing directory entries in
inodes or it didn't call it that).

In fact I remember that when I read statements about noatime implying
nodiratime I had to check fs/inode.c after I found a random discussion
on the subject mentioning the proof being in the code to make sure of
the behavior.


>  and at least some hint of that has been in the 
> manpage for years as I recall reading it when I first read of nodiratime 
> and checked whether my noatime options included it) before standardizing 
> on it, or
>
> c) might have actually been talking in general, and there's some mounts 
> you don't actually choose to make noatime, but still want nodiratime, or

I probably used this case for testing purposes (but don't remember a
case where it was useful to me).
The expression I used was not meant to describe the exact flags in fstab
on my systems but the general idea of avoiding files and directories
atime updates as by using noatime I'm implicitly using nodiratime too.
Sorry for the confusion (I've been confused about the subject a long
time which probably didn't help express myself clearly).

Best regards,

Lionel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-12-14 14:24                     ` Austin S. Hemmelgarn
  2015-12-14 19:39                       ` Christoph Anton Mitterer
@ 2015-12-15  4:05                       ` Chris Murphy
  1 sibling, 0 replies; 48+ messages in thread
From: Chris Murphy @ 2015-12-15  4:05 UTC (permalink / raw)
  To: Btrfs BTRFS

On Mon, Dec 14, 2015 at 7:24 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

>
> If you have software that actually depends on atimes, then that software is
> broken (and yes, I even feel this way about Mutt).  The way atimes are
> implemented on most systems breaks the semantics that almost everyone
> expects from them, because they get updated for anything that even looks
> sideways at the inode from across the room.  Most software that uses them
> expects them to answer the question 'When were the contents of this file
> last read?', but they can get updated even for stuff like calculating file
> sizes, listing directory contents, or modifying the file's metadata.

This Jonathan Corbet article still applies:
http://lwn.net/Articles/397442/

What a mess!

Hey. The 5 year anniversary was in July. Wanna bring it up again, Austin? Haha.
http://thread.gmane.org/gmane.linux.kernel.cifs/294

Users want file creation time. Specifically, an immutable time for
that file that persists across file system copies. The time of its
first occurrence on a particular volume is not useful information.
Getting that requires what seems to be an unlikely consensus.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-12-15  2:38                                 ` Lionel Bouton
@ 2015-12-16  8:10                                   ` Duncan
  0 siblings, 0 replies; 48+ messages in thread
From: Duncan @ 2015-12-16  8:10 UTC (permalink / raw)
  To: linux-btrfs

Lionel Bouton posted on Tue, 15 Dec 2015 03:38:33 +0100 as excerpted:

> I just checked: this has only be made crystal-clear in the latest
> man-pages version 4.03 released 10 days ago.
> 
> The mount(8) page of Gentoo's current stable man-pages (4.02 release in
> August) which is installed on my systems states for noatime:
> "Do not update inode access times on this filesystem (e.g., for faster
> access on the news spool to speed up news servers)."

Hmm... I hadn't synced and updated in about that time, and sure enough, 
while I've just synced I've not yet updated, and still have man-pages 
4.02 installed.

But, the mount.8(.bz2 in my case as that's the compression I'm configured 
for, I had to use man -d mount to debug-dump what file it was actually 
loading) manpage actually belongs to util-linux, according to equery 
belongs, while equery files man-pages | grep mount only returns hits for 
mount.2(.bz2 and umount).

So at least here, it's util-linux providing the mount (8) manpage, not 
man-pages.

Tho I'm on ~amd64 and IIRC just updated util-linux in the last update, so 
the cross-ref to nodiratime in the noatime entry (saying it isn't 
necessary as noatime covers it) probably came from there, or a similar 
recent util-linux update.

Let's see...

My current util-linux (with the xref in both noatime and nodiratime to 
the other, saying nodiratime isn't needed if noatime is used) is 2.27.1.

The oldest version I still have in my binpkg cache (tho I likely have 
older on the backup) is util-linux 2.24.2.  For noatime it has the 
wording you mention, don't update inode access times, but for nodiratime, 
it specifically mentions directory inode access times.  So from util-linux 
2.24.2 at least, the information was there, but you had to read between 
the lines a bit more, because nodiratime mentions dir inodes, and noatime 
says don't update atime on inodes, so it's there but you have to be a 
reasonably astute reader to see it.

In between those two I have other versions including 2.26.2 and 2.27.  
Looks like 2.27 added both the "implies nodiratime" wording to the noatime 
entry, and the nodiratime unneeded if noatime set notation to the 
nodiratime entry.

If there was a util-linux 2.26.x beyond x=2, I apparently never installed 
it, so the wording likely changed with 2.27, but may have changed with 
late 2.26 versions as well, if there were any beyond 2.26.2.

And on gentoo, 2.26.2 appears to be the latest stable-keyworded, so 
that's what stable users would have.

But as I said, the info is there at least as of 2.24.2, you just have to 
note in the nodiratime entry that it says dir inodes, while the noatime 
entry simply says inodes, without excluding dir inodes.  So it's there, 
you just have to be a somewhat astute reader to note it.

Anywhere else, say on-the-net recommendations for nodiratime, /should/ 
mention that they aren't necessary if noatime is used as well, but of 
course not all of them will.  (Tho I'd actually find it a bit strange to 
see discussion of nodiratime without discussion of noatime as well, as 
I'd guess any discussion of just one of the two would likely be on 
noatime, leaving nodiratime unmentioned if they're only covering one, as 
it shouldn't be necessary to mention, since it's already included in 
noatime.)

But there's probably a bunch of folks who originally read coverage of 
noatime, then saw nodiratime later, and thought "Oh, that's separate?  
Well I want that too!" and simply enabled them both, without actually 
checking the manpage or other documentation including on-the-net 
discussion.

I know here I originally saw noatime and decided I wanted it, then was 
confused when I saw nodiratime sometime later.  But I don't just enable 
stuff without having some idea what I'm enabling, so I did my research, 
and saw noatime implied nodiratime as well, so the only reason nodiratime 
might be needed would be if you wanted atime in general, but not on dirs.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-09 16:36         ` Duncan
@ 2015-12-16 21:59           ` Christoph Anton Mitterer
  2015-12-17  4:06             ` Duncan
                               ` (5 more replies)
  0 siblings, 6 replies; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-16 21:59 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 14545 bytes --]

On Wed, 2015-12-09 at 16:36 +0000, Duncan wrote:
> But... as I've pointed out in other replies, in many cases including
> this 
> specific one (bittorrent), applications have already had to develop
> their 
> own integrity management features
Well let's move discussion upon that into the "dear developers, can we
have notdatacow + checksumming, plz?" where I showed in one of the more
recent threads that bittorrent seems rather to be the only thing which
does use that per default... while on the VM image front, nothing seems
to support it, and on the DB front, some support it, but don't use it
per default.


> In the bittorrent case specifically, torrent chunks are already 
> checksummed, and if they don't verify upon download, the chunk is
> thrown 
> away and redownloaded.
I'm not a bittorrent expert, because I don't use it, but that sounds to
be more like the edonkey model, where - while there are checksums -
these are only used until the download completes. Then you have the
complete file, any checksum info thrown away, and the file again being
"at risk" (i.e. not checksum protected).


> And after the download is complete and the file isn't being
> constantly 
> rewritten, it's perfectly fine to copy it elsewhere, into a dir where
> nocow doesn't apply.
Sure, but again, nothing the user may automatically do, and there's
still the gap between the final verification from the bt software, to
the time it's copied over.
Arguably, that may be very short, but I see no reasons to make any
breaks in the everything-verified chain from the btrfs side.


>   With the copy, btrfs will create checksums, and if 
> you're paranoid you can hashcheck the original nocow copy against the
> new 
> checksummed/cow copy, and after that, any on-media changes will be
> caught 
> by the normal checksum verification mechanisms.
As before... of course you're right that one can do this, but nothing
that happens per default.
And I think that's just one of the nice things btrfs would/should give
us. That the filesystem assures that data is valid, at least in terms
of storage device and bus errors (it cannot protect of course against
memory errors or that like).


> > Hmm doesn't seem really good to me if systemd would do that, cause
> > it
> > then excludes any such files from being snapshot.
> 
> Of course if the directories are already present due to systemd
> upgrading 
> from non-btrfs-aware versions, they'll remain as normal dirs, not 
> subvolumes.  This is the case here.
Well, even if not, because one starts from a fresh system... people may
not want that.


> And of course you can switch them around to dirs if you like, and/or 
> override the shipped tmpfiles.d config with your own.
... sure but, people may not even notice that.
I don't think such a decision is up to systemd.
Anyway, since we're btrfs here, not systemd, that shouldn't bother us
;)


> > > and small ones such as the sqlite files generated by firefox and
> > > various email clients are handled quite well by autodefrag, with
> > > that
> > > general desktop usage being its primary target.
> > Which is however not yet the default...
> Distro integration bug! =:^)
Nah,... really not...
I'm quite sure that most distros will generally decide against
diverting from upstream in such choices.



> > It feels a bit, if there should be some tools provided by btrfs,
> > which
> > tell the users which files are likely problematic and should be
> > nodatacow'ed
> And there very well might be such a tool... five or ten years down
> the 
> road when btrfs is much more mature and generally stabilized, well
> beyond 
> the "still maturing and stabilizing" status of the moment.
Hmm let's hope btrfs isn't finished only when the next-gen default fs
arrives ;^)



> But it can be the case that as filesystem fragmentation levels rise,
> free-
> space itself is fragmented, to the point where files that would
> otherwise 
> not be fragmented as they're created once and never touched again,
> end up 
> fragmented, because there's simply no free-space extents big enough
> to 
> create them in unfragmented, so a bunch of smaller free-space extents
> must be used where one larger one would have been used had it
> existed.
In kinda curios, what free space fragmentation actually means here.

Ist simply like this:
+----------+-----+---+--------+
|     F    |  D  | F |    D   |
+----------+-----+---+--------+
Where D is data (i.e. files/metadata) and F is free space.
In other words, (F)ree space itself is not further subdivided and only
fragmented by the (D)ata extents in between.

Or is it more complex like this:
+-----+----+-----+---+--------+
|  F  |  F |  D  | F |    D   |
+-----+
----+-----+---+--------+
Where the (F)ree space itself is subdivided
into "extents" (not necessarily of the same size), and btrfs couldn't
use e.g. the first two F's as one contiguous amount of free space for a
larger (D)ata extent of that size:
+----------+-----+---+--------+
|    
D    |  D  | F |    D   |
+----------+-----+---+--------+
but would split
that up into two instead:
+-----+----+-----+---+--------+
|  D  |  D |  D
 | F |    D   |
+-----+----+-----+---+--------+

?


> In that regard, yes, it can affect other files, but it affects them
> by 
> fragmentation, so no, it doesn't affect unfragmented files... to the 
> extent that there are any unfragmented files left.
I see :)


> > Are there any problems caused by all this with respect to free
> > space
> > fragmentation? And what exactly are the consequences of free space
> > fragmentation? ;)
> I must have intuited the question as I just answered it, above! =:^)
O:-D


> And of course that's assuming the worst case, that autodefrag is
> /not/ 
> snapshot-aware.  If it is, then the problem effectively vaporizes 
> entirely.


> > Hmm and what about mixed-use systems,... which have both, desktop
> > and
> > server like IO patterns?
> 
> Valid question.  And autodefrag, like most btrfs-specific mount
> options, 
> remains filesystem-global at this point, too, so it's not like you
> can 
> mount different subvolumes, some with autodefrag, some without (tho 
> that's a planned future implementation detail).
> 
> But, at least personally, I tend to prefer separate filesystems, not 
> subvols, in any case, primarily because I don't like having my data
> eggs 
> all in the same filesystem basket and then watching its bottom drop
> out 
> when I find it unmountable!
> 
> But the filesystem-global nature of autodefrag and similar mount
> options, 
> tends to encourage the separate filesystem layout as well, as in that
> case you simply don't have to worry, because the server stuff is on
> its 
> own separate btrfs where the autdefrag on the desktop btrfs can't 
> interfere with it, as each separate filesystem can have its own mount
> options. =:^)
> 
> So that'd be /my/ preferred solution, but I can indeed see it being a
 
> problem for those users (or distros) that prefer one big filesystem
> with 
> subvolumes, which some do, because then it's all in a single storage
> pool 
> and thus easier to manage.
Well the problem I see here mainly is, with additional filesystems
(while you're absolutely right with your eggs basket example ;) )...
one has again the problem of partitioning or using e.g. LVM in order
not to allocate a more or less fixed number of bytes for each of the
different fs to be created for different purposes.
Now placing LVM below btrfs is at least conceptually bad, because btrfs
would already provide similar/same features by itself.

So that would be the nice part of just using subvols, with different
e.g. auto-defrag options though: it doesn't matter which of the subvols
eats up more space eventually - they share it.



> > btw: I think documentation (at least the manpage) doesn't tell
> > whether
> > btrfs defragment -c XX will work on files which aren't fragmented.
> 
> It implies it, but I don't believe it's explicit.
> 
> The implication is due to the implication that defrag with the
> compress 
> option is effectively compress, in that it rewrites everything it's
> told 
> to compress in that case, of course defragging in the process due to
> the 
> rewrite, but with the primary purpose being the compress, when used
> in 
> that manner.
Hmm,... I guess it would be better if there was a separate option for
that... or at least some more clear documentation.

> he obviously didn't
> think thru the fact that compression MUST be a rewrite, thereby
> breaking 
> snapshot reflinks, even were normal non-compression defrag to be
> snapshot 
> aware, because compression substantially changes the way the file is 
> stored), that's _implied_, not explicit.
So you mean, even if ref-link aware defrag would return, it would still
break them again when compressing/uncompressing/recompressing?
I'd have hoped that then, all snapshots respectively other reflinks
would simply also change to being compressed, or at least that there
would then be an option that allows to choose... break up the reflinks,
or change them.

> Yes.  Try it on a file that's large enough (a gig or so should do it
> nicely) to make a difference in the btrfs fi df listing.  Compare
> before 
> and after listings.
Okay... I'll need to think a bit more about how to actually trigger
that.
Cause a) one doesn't get notice, AFAICS, whether autodefrag ran, b) I
need to actually manage creating a fragmented file first, and c)
understand what each of the fi df's values actually mean ;)



> As I said to my knowledge it hasn't been tried, but AFAIK, truncate,
> followed by sync (or fsync), doesn't do sparse.  I've seen it used
> for (I 
> believe) similar purposes elsewhere, which is why I suggested its use
> here.
Hmm at least doing an
trunacete --size 10G foo
sync
doesn't seem to cause any disk IO.


> Of course if truncate doesn't work, catting from /dev/urandom should
> do 
> the trick, as that should be neither sparse nor compressible.  
Or perhaps a bit faster, /dev/zero ;-P

 
> > b) How can one find out wheter defragmentation worked well? I guess
> > with
> > filefrag in the compress=no case an not at all in any other?
> 
> I recently found out that filefrag -v actually lists the extent byte 
> addresses, thus making it possible to manually (or potentially via 
> script) whether the 128-KiB compression blocks are contiguous or
> not.  
> Contiguous would mean same extent, even if filefrag doesn't
> understand 
> that yet.
> 
> But certainly, filefrag in the uncompressed case is exactly what I
> had in 
> mind.
I'm a bit unsure how to read filefrag's output... (even in the
uncompressed case).
What would it show me if there was fragmentation


> Yeah, the LHC is rather a special case.
Well many fields of science actually go into that ranges now,
astronomers, microbiologists, genetic sciences, brain research, other
fields of physics, we even have had contact with some guys from
humanities that apparently think their "research" would need that large
amounts of storage (don't ask me,... I didn't understand it ^^)


> Tho to be fair, were I managing data for them or that sort of data
> set 
> where shear size makes backups impractical, I'd probably be at least
> as 
> conservative about btrfs usage as you're sounding, not necessarily in
> the 
> specifics, but simply because while btrfs is indeed stabilizing, I 
> haven't had any argument on this list against my oft stated opinion
> that 
> it's not fully stable and mature yet, and won't be for some time.
Sure,... I mean right now it's not a shame for btrfs, that one perhaps.
wouldn't recommend it for that usage.
But in the future the goal should be, that it can be used,... in our
case that would be probably still simple, as the vast majority of data
(i.e. PiBs that are archived) are typically write once read many.
But the files that are processed in jobs, aren't necessarily. They may
easily do all these things were right now btrfs may still start to
choke sooner or later.



> As such, to date I'd be unlikely to consider btrfs at all for data
> where 
> backups aren't feasible, unless it really is simply throw-away data 
> (which from your description isn't the case there), and would be
> leery as 
> well about btrfs usage where backups are available, but simply 
> impractical to deal with, due to shear size and data transfer time.
Sure, talking about right now,... but at least in 5-10 years, btrfs has
hopefully matured enough that people don't have to start making backups
on the fresh fs ;)


> > For quite a while I was thinking about productively using btrfs at
> > our
> > local Tier-2 in Munich, but then decided against:
> 
> As should be apparent from the above, I basically agree.
> 
> I did want to mention that I enjoyed seeing your large-scale
> description, 
> however, as well as your own reasoning for the decisions you have
> made.  
> (Of course it's confirming my own opinion so I'm likely to enjoy it,
> but 
> still...)
Well it was also meant as giving some insight to the devs on which
problems real world scenarios might suffer.

It would be interesting to hear from Chris (Mason), how things are
going at facebook (IIRC, they were testing btrfs in production),
especially with regards to maintainability and all these things we were
talking about (fragmentation ant that like).

But of course, even if it works out perfectly for them, one may not
immediately generalise... perhaps they don't do snapshots ;-) ... or
their nodes are more muliple-redundant and throw-away (i.e. if one VM's
fs breaks or gets slower, it would be automatically re-deployed and
populated with data).
These, and of course having the maintainer of the fs hired, may not be
things every site can afford (and it would also require cloning Chris,
which he may not be particularly fond of ;) ).


> > Speaking of which:
> > Is there somewhere a good documentation of what exactly all this
> > numbers
> > of show, df, usage and so on tell?
> 
> It's certainly in quite a few on-list posts over the years
okay,.. in other words: no ;-)
scatter over the years list posts don't count as documentation :P


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: btrfs: poor performance on deleting many large files
  2015-12-13  7:10                     ` Duncan
@ 2015-12-16 22:14                       ` Christoph Anton Mitterer
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-16 22:14 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2133 bytes --]

On Sun, 2015-12-13 at 07:10 +0000, Duncan wrote:
> > So you basically mean that ro snapshots won't have their atime
> > updated
> > even without noatime?
> > Well I guess that was anyway the recent behaviour of Linux
> > filesystems,
> > and only very old UNIX systems updated the atime even when the fs
> > was
> > set ro.
> 
> I'd test it to be sure before relying on it (keeping in mind that my
> own 
> use-case doesn't include subvolumes/snapshots so it's quite possible
> I 
> could get fine details of this nature wrong), but that would be my
> very 
> (_very_! see next) strong assumption, yes.
> 
> Because read-only snapshots are used for btrfs-send among other
> things, 
> with the idea being that the read-only will keep them from changing
> in 
> the middle of the send, and ro snapshot atime updates would seem to
> throw 
> that entirely out the window.  So I can't imagine ro snapshots doing
> atime 
> updates under any circumstance because I just can't see how send
> could 
> rely on them then, but I'd still test it before counting on it.

For those who haven't followed up the other threads:
I've tried it out and yes, ro-snapshots (as well as ro mounted btrfs
filesystem/subvolumes) don't have their atimes changed on e.g. read.



> AFAIK, the general idea was to eventually have all the (possible,
> some 
> are global-filesystem-scope) subvolume mount options exposed as 
> properties, it's just not implemented yet, but I'm not entirely sure
> if 
> that was all /btrfs-specific/ mount options, or included the generic
> ones 
> such as the *atime and no* (noexec/nodev/...) options as well.  In
> view 
> of that and the fact that noatime is generic, adding it as a specific
> request still makes sense.  Someone with more specific knowledge on
> the 
> current plan can remove it if it's already covered.
Not sure if I'd had already posted that here, but I did write some of
these ideas up and added it to the wiki:
https://btrfs.wiki.kernel.org/index.php?title=Project_ideas&action=historysubmit&diff=29757&oldid=29743



Best wishes,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-09 13:36         ` Duncan
  2015-12-14  2:46           ` Christoph Anton Mitterer
@ 2015-12-16 23:39           ` Kai Krakow
  1 sibling, 0 replies; 48+ messages in thread
From: Kai Krakow @ 2015-12-16 23:39 UTC (permalink / raw)
  To: linux-btrfs

Am Wed, 9 Dec 2015 13:36:01 +0000 (UTC)
schrieb Duncan <1i5t5.duncan@cox.net>:

> >> > 4) Duncan mentioned that defrag (and I guess that's also for
> >> > auto- defrag) isn't ref-link aware...
> >> > Isn't that somehow a complete showstopper?  
> 
> >> It is, but the one attempt at dealing with it caused massive data
> >> corruption, and it was turned off again.  
> 
> IIRC, it wasn't data corruption so much, as massive scaling issues,
> to the point where defrag was entirely useless, as it could take a
> week or more for just one file.
> 
> So the decision was made that a non-reflink-aware defrag that
> actually worked in something like reasonable time even if it did
> break reflinks and thus increase space usage, was of more use than a
> defrag that basically didn't work at all, because it effectively took
> an eternity. After all, you can always decide not to run it if you're
> worried about the space effects it's going to have, but if it's going
> to take a week or more for just one file, you effectively don't have
> the choice to run it at all.
> 
> > So... does this mean that it's still planned to be implemented some
> > day or has it been given up forever?  
> 
> AFAIK it's still on the list.  And the scaling issues are better, but
> one big thing holding it up now is quota management.  Quotas never
> have worked correctly, but they were a big part (close to half, IIRC)
> of the original snapshot-aware-defrag scaling issues, and thus must
> be reliably working and in a generally stable state before a
> snapshot-aware-defrag can be coded to work with them.  And without
> that, it's only half a solution that would have to be redone when
> quotes stabilized anyway, so really, quota code /must/ be stabilized
> to the point that it's not a moving target, before reimplementing
> snapshot-aware-defrag makes any sense at all.
> 
> But even at that point, while snapshot-aware-defrag is still on the
> list, I'm not sure if it's ever going to be actually viable.  It may
> be that the scaling issues are just too big, and it simply can't be
> made to work both correctly and in anything approaching practical
> time.  Time will tell, of course, but until then...

I'd like to throw in an idea... Couldn't auto-defrag just be made "sort
of reflink-aware" in a very simple fashion: Just let it ignore extents
that are shared?

That way you can still enjoy it benefits in a mixed-mode scenario where
you are working with snapshots partly but other subvolumes are never
taken snapshots of.

Comments?

-- 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-14 10:51         ` Duncan
@ 2015-12-16 23:55           ` Christoph Anton Mitterer
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-16 23:55 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 8534 bytes --]

On Mon, 2015-12-14 at 10:51 +0000, Duncan wrote:
> > AFAIU, the one the get's fragmented then is the snapshot, right,
> > and the
> > "original" will stay in place where it was? (Which is of course
> > good,
> > because one probably marked it nodatacow, to avoid that
> > fragmentation
> > problem on internal writes).
> 
> No.  Or more precisely, keep in mind that from btrfs' perspective, in
> terms of reflinks, once made, there's no "original" in terms of
> special 
> treatment, all references to the extent are treated the same.
Sure... you misunderstood me I guess..

> 
> What a snapshot actually does is create another reference (reflink)
> to an 
> extent.
[snip snap]
> So in the 
> case of nocow, a cow1 (one-time-cow) exception must be made,
> rewriting 
> the changed data to a new location, as the old location continues to
> be 
> referenced by at least one other reflink.
That's what I've meant.


> So (with the fact that writable snapshots are available and thus it
> can 
> be the snapshot that changed if it's what was written to) the one
> that 
> gets the changed fragment written elsewhere, thus getting fragmented,
> is 
> the one that changed, whether that's the working copy or the snapshot
> of 
> that working copy.
Yep,.. that's what I've suspected and asked for.

The "original" file, in the sense of the file that first reflinked the
contiguous blocks,... will continue to point to these continuous
blocks.

While the "new" file, i.e he CoW-1-ed snapshot's file, will partially
reflink blocks form the contiguous range, and it's rewritten blocks
will reflink somewhere else.
Thus the "new" file is the one that gets fragmented.


> > And one more:
> > You both said, auto-defrag is generally recommended.
> > Does that also apply for SSDs (where we want to avoid unnecessary
> > writes)?
> > It does seem to get enabled, when SSD mode is detected.
> > What would it actually do on an SSD?
> Did you mean it does _not_ seem to get (automatically) enabled, when
> SSD 
> mode is detected, or that it _does_ seem to get enabled, when 
> specifically included in the mount options, even on SSDs?
I does seem to get enabled, when specifically included in the mount
options (the ssd mount option is not used), i.e.:
/dev/mapper/system      /       btrfs   subvol=/root,defaults,noatime,autodefrag        0       1
leads to:
[    5.294205] BTRFS: device label foo devid 1 transid 13 /dev/disk/by-label/foo
[    5.295957] BTRFS info (device sdb3): disk space caching is enabled
[    5.296034] BTRFS: has skinny extents
[   67.082702] BTRFS: device label system devid 1 transid 60710 /dev/mapper/system
[   67.111185] BTRFS info (device dm-0): disk space caching is enabled
[   67.111267] BTRFS: has skinny extents
[   67.305084] BTRFS: detected SSD devices, enabling SSD mode
[   68.562084] BTRFS info (device dm-0): enabling auto defrag
[   68.562150] BTRFS info (device dm-0): disk space caching is enabled



> Or did you actually mean it the way you wrote it, that it seems to be
> enabled (implying automatically, along with ssd), when ssd mode is 
> detected?
No, sorry for being unclear.
I meant it that way, that having the ssd detected doesn't auto-disable
auto-defrag, which I thought may make sense, given that I didn't know
exactly what it would do on SSDs...
IIRC, Hugo or Austin, mentioned the thing with making for better IOPS,
but I haven't had considered that to have impact enough... so I thought
it could have made sense to ignore the "autodefrag" mount option in
case an ssd was detected.



> There are three factors I'm aware of here as well, all favoring 
> autodefrag, just as the two above favored leaving it off.
> 
> 1) IOPS, Input/Output Operations Per Second.  SSDs typically have
> both an 
> IOPS and a throughput rating.  And unlike spinning rust, where raw
> non-
> sequential-write IOPS are generally bottlenecked by seek times, on
> SSDs 
> with their zero seek-times, IOPS can actually be the bottleneck.
Hmm it would be really nice to get someone who has found a way to make
some sound analysis/benchmarking of that.


> 2) SSD physical write and erase block sizes as multiples of the
> logical/
> read block size.  To the extent that extent sizes are multiples of
> the 
> write and/or erase-block size, writing larger extents will reduce
> write 
> amplification due to writing and blocks smaller than the write or
> erase 
> block size.
Hmm... okay I don't know the details of how btrfs does this, but I'd
have expected that all extents are aligned to the underlying physical
devices' block structure.
Thus each extent should start at such write/erase block, and at most it
shouldn't perfectly at the end of the extent.
If the file is fragmented (i.e. more than one extent), I'd have even
hoped that all but the last one fit perfectly.

So what you basically mean, AFAIU, is that by having auto-defrag, you
get larger extents (i.e. smaller ones collapsed into one) and by thus
you get less cut off at the end of extents where these don't match
exactly the underlying write/erase blocks?

I still don't see the advantage here,... neighbouring extents would
hopefully still be aligned,... and it doesn't seem that one saves write
cycles but rather have more due to the defrag.


> While the initial autodefrag rewrite is a second-cycle write after a 
> fragmented write, spending a write cycle for the autodefrag,
> consistent 
> use of autodefrag should help keep file fragmentation and thus
> ultimately 
> space fragmentation to a minimum, so initial writes, where there's
> enough 
> data to write an initially large extent, won't be forced to be broken
> into smaller extents because there's simply no large free-space
> extents 
> left due to space fragmentation.
> IOW, autodefrag used consistently should reduce space fragmentation
> as 
> well as file fragmentation, and this reduced space fragmentation will
> lead to the possibility of writing larger extents initially, where
> the 
> amount of data to be written allows it, thereby reducing initial file
> write fragmentation and the need for autodefrag as a result.
Okay... but AFAIU, that's more like the effects described in (1) and
has less to do with erase/write block sizes...


> 3) Btrfs metadata management overhead.  While btrfs tracks things
> like 
> checksums at fixed sizes,
btw: over which amounts of data is each checksum calculated?


>  other metadata is per extent.  Obviously, the 
> more extents a file has, the harder btrfs has to work to track them
> all.  
> Maintenance tasks such as balance and check already have scaling
> issues; 
> do we really want to make them worse by forcing them to track
> thousands 
> or tens of thousands of extents per (large) file where they could be 
> tracking a dozen or two?
Okay, but these effects are IMHO also more similar to (1),... I'd
probably call them "meta-data compaction" or so...


> On balance, I was persuaded to use autodefrag on my own btrfs' on
> SSDs, 
> and believe the near-term write-cycle damage may in fact be largely 
> counteracted by indirect free-space defrag effect and the effect that
> in 
> turn has on the ability to even find large areas of cohesive free
> space 
> to write into in the first place.  With that largely counteracted,
> the 
> other benefits in my mind again outweigh the negatives, so autodefrag
> continues to be worth it in general, even on SSDs.
Intuitively, I'd tend to agree... even though I either didn't fully
understand your (2) and count it and (3) rather to (1).

Would be interesting to see a actual analysis with measurements from
one of the filesystem/block device geeks.



> I suppose someone will eventually do that sort of testing, but of
> course 
> even if they did it now, with btrfs code still to be optimized and 
> various scaling work still to be done, it's anyone's guess if the
> test 
> results would still apply a few years down the road, after that
> scaling 
> and optimization work.
Sure... :-)

I guess I'll leave it on, and when in 5-10 years after btrfs has been
stabilised and optimised someone comes up with rock solid data proofing
it, I can claim that I always knew it...
And if data disproves it, I can claim it's all Duncan's fault who lured
me into this
;^-P

Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-16 21:59           ` Christoph Anton Mitterer
@ 2015-12-17  4:06             ` Duncan
  2015-12-18  0:21               ` Christoph Anton Mitterer
  2015-12-17  4:35             ` Duncan
                               ` (4 subsequent siblings)
  5 siblings, 1 reply; 48+ messages in thread
From: Duncan @ 2015-12-17  4:06 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as
excerpted:

> On Wed, 2015-12-09 at 16:36 +0000, Duncan wrote:
>> But... as I've pointed out in other replies, in many cases including
>> this specific one (bittorrent), applications have already had to
>> develop their own integrity management features

> Well let's move discussion upon that into the "dear developers, can we
> have notdatacow + checksumming, plz?" where I showed in one of the more
> recent threads that bittorrent seems rather to be the only thing which
> does use that per default... while on the VM image front, nothing seems
> to support it, and on the DB front, some support it, but don't use it
> per default.
> 
>> In the bittorrent case specifically, torrent chunks are already
>> checksummed, and if they don't verify upon download, the chunk is
>> thrown away and redownloaded.

> I'm not a bittorrent expert, because I don't use it, but that sounds to
> be more like the edonkey model, where - while there are checksums -
> these are only used until the download completes. Then you have the
> complete file, any checksum info thrown away, and the file again being
> "at risk" (i.e. not checksum protected).

[I'm breaking this into smaller replies again.]

Just to mention here, that I said "integrity management features", which 
includes more than checksumming.  As Austin Hemmelgarn has been pointing 
out, DBs and some VMs do COW, some DBs do checksumming or at least have 
that option, and both VMs and DBs generally do at least some level of 
consistency checking as they load.  Those are all "integrity management 
features" at some level.

As for bittorrent, I /think/ the checksums are in the torrent files 
themselves (and if I'm not mistaken, much as git, the chunks within the 
file are actually IDed by checksum, not specific position, so as long as 
the torrent is active, uploading or downloading, these will by definition 
be retained).  As long as those are retained, the checksums should be 
retained.  And ideally, people will continue to torrent the files long 
after they've finished downloading them, in which case they'll still need 
the torrent files themselves, along with the checksums info.

And for longer term storage, people really should be copying/moving their 
torrented files elsewhere, in such a way that they either eliminate the 
fragmentation if the files weren't nocowed, or eliminate the nocow 
attribute and get them checksum-protected as normal for files not 
intended to be constantly randomly rewritten, which will be the case once 
they're no longer being actively downloaded.  Of course that's at the 
slightly technically oriented user level, but then, the whole nocow 
thing, or even caring about checksums and longer term file integrity in 
the first place, is also technically oriented user level.  Normal users 
will just download without worrying about the nocow in the first place, 
and perhaps wonder why the disk is thrashing so, but not be inclined to 
do anything about it except perhaps switch back to their old filesystem, 
where it was faster and the disk didn't sound as bad.  In doing so, 
they'll either automatically get the checksuming along with the worse 
performance, or go back to a filesystem without the checksumming, and 
think it's fine as they know no different.

Meanwhile, if they do it correctly there's no window without protection, 
as the torrent file can be used to double-verify the file once moved, as 
well, before deleting it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-16 21:59           ` Christoph Anton Mitterer
  2015-12-17  4:06             ` Duncan
@ 2015-12-17  4:35             ` Duncan
  2015-12-17  5:07             ` Duncan
                               ` (3 subsequent siblings)
  5 siblings, 0 replies; 48+ messages in thread
From: Duncan @ 2015-12-17  4:35 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as
excerpted:

>> And there very well might be such a tool... five or ten years down the
>> road when btrfs is much more mature and generally stabilized, well
>> beyond the "still maturing and stabilizing" status of the moment.

> Hmm let's hope btrfs isn't finished only when the next-gen default fs
> arrives ;^)

[Again, breaking into smaller point replies...]

Well, given the development history for both zfs and btrfs to date, five 
to ten years down the line, with yet another even newer filesystem then 
already under development, is more being "real", than not.  Also see the 
history in MS' attempt at a next-gen filesystem.  The reality is these 
things take FAR longer than one might think.

FWIW, on the wiki I see feature points and benchmarks for v0.14, 
introduced in April of 2008, and a link to an earlier btree filesystem on 
which btrfs apparently was based, dating to 2006, so while I don't have a 
precise beginning date, and to some extent such a thing would be rather 
arbitrary anyway, as Chris would certainly have done some major thinking, 
preliminary research and coding, before his first announcement, a project 
origin in late 2006 or sometime in 2007 has to be quite close.

And (as I noted in a parenthetical at my discovery in a different 
thread), I switched to btrfs for my main filesystems when I bought my 
first SSDs, in June of 2013, so already a quarter decade ago.  At the 
time btrfs was just starting to remove some of the more dire 
"experimental" warnings.  Obviously it has stabilized quite a bit since 
then, but due to the oft-quoted 80/20 rule and extensions, where the last 
20% of the progress takes 80% of the work, etc...

It could well be another five years before btrfs is at a point I think 
most here would call stable.  That would be 2020 or so, about 13 years 
for the project, and if you look at the similar projects mentioned above, 
that really isn't unrealistic at all.  Ten years minimum, and that's with 
serious corporate level commitments and a lot more dedicated devs than 
btrfs has.  12 years not unusual at all, and a decade and a half still 
well within reasonable range, for a filesystem with this level of 
complexity, scope, and features.

And realistically, by that time, yet another successor filesystem may 
indeed be in the early stages of development, say at the 20/80 point, 20% 
of required effort invested, possibly 80% of the features done, but not 
stabilized.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-16 21:59           ` Christoph Anton Mitterer
  2015-12-17  4:06             ` Duncan
  2015-12-17  4:35             ` Duncan
@ 2015-12-17  5:07             ` Duncan
  2015-12-17  5:12             ` Duncan
                               ` (2 subsequent siblings)
  5 siblings, 0 replies; 48+ messages in thread
From: Duncan @ 2015-12-17  5:07 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as
excerpted:

> In kinda curios, what free space fragmentation actually means here.
> 
> Ist simply like this:
> +----------+-----+---+--------+
> |     F    |  D  | F |    D   |
> +----------+-----+---+--------+
> Where D is data (i.e. files/metadata) and F is free space.
> In other words, (F)ree space itself is not further subdivided and only
> fragmented by the (D)ata extents in between.
> 
> Or is it more complex like this:
> +-----+----+-----+---+--------+
> |  F  |  F |  D  | F |    D   |
> +-----+----+-----+---+--------+
> Where the (F)ree space itself is subdivided into "extents" (not
> necessarily of the same size), and btrfs couldn't use e.g. the first two
> F's as one contiguous amount of free space for a larger (D)ata extent

[still breaking into smaller points for reply]

At the one level, I had the simpler f/d/f/d scheme in mind, but that 
would be the case inside a single data chunk.  At the higher file level, 
with files significant fractions of the size of a single data chunk to 
much larger than a single data chunk, the more complex and second
f/f/d/f/d case would apply, with the chunk boundary as the separation 
between the f/f.

IOW, files larger than data chunk size will always be fragmented into 
data chunk size fragments/extents, at the largest, because chunks are 
designed to be movable using balance, device remove, replace, etc.

So (using the size numbers from a recent comment from Qu in a different 
thread), on a filesystem with under 100 GiB total space-effective (space-
effective, space available, accounting for the replication type, raid1, 
etc, and I'm simplifying here...), data chunks should be 1 GiB, while 
above that, with striping, they might be upto 10 GiB.

Using the 1 GiB nominal figure, files over 1 GiB would always be broken 
into 1 GiB maximum size extents, corresponding to 1 extent per chunk.

But while 4 KiB extents are clearly tiny and inefficient at today's 
scale, in practice, efficiency gains break down at well under GiB scale, 
with AFAIK 128 MiB being the upper bound at which any efficiency gains 
could really be expected, and 1 MiB arguably being a reasonable point at 
which further increases in extent size likely won't have a whole lot of 
effect even on SSD erase-block (where 1 MiB is a nominal max), but that's 
that's still 256X the usual 4 KiB minimum data block size, 8X the 128 KiB 
btrfs compression-block size, and 4X the 256 KiB defrag default "don't 
bother with extents larger than this" size.

Basically, the 256 KiB btrfs defrag "don't bother with anything larger 
than this" default is quite reasonable, tho for massive multi-gig VM 
images, the number of 256 KiB fragments will still look pretty big, so 
while technically a very reasonable choice, the "eye appeal" still isn't 
that great.

But based on real reports posting before and after numbers from filefrag 
(on uncompressed btrfs), we do have cases where defrag can't find 256 KiB 
free-space blocks and thus can actually fragment a file worse than it was 
before, so free-space fragmentation is indeed a very real problem.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-16 21:59           ` Christoph Anton Mitterer
                               ` (2 preceding siblings ...)
  2015-12-17  5:07             ` Duncan
@ 2015-12-17  5:12             ` Duncan
  2015-12-17  6:00             ` Duncan
  2015-12-17  6:01             ` Duncan
  5 siblings, 0 replies; 48+ messages in thread
From: Duncan @ 2015-12-17  5:12 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as
excerpted:

>> he obviously didn't think thru the fact that compression MUST be a
>> rewrite, thereby breaking snapshot reflinks, even were normal
>> non-compression defrag to be snapshot aware, because compression
>> substantially changes the way the file is stored), that's _implied_,
>> not explicit.
> So you mean, even if ref-link aware defrag would return, it would still
> break them again when compressing/uncompressing/recompressing?
> I'd have hoped that then, all snapshots respectively other reflinks
> would simply also change to being compressed,

You're correct.  I "obviously didn't thing thru" that the whole way, 
myself. =:^(

But meanwhile, we don't have snapshot-aware-defrag, and in that case, the 
implication... and his result... remains.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-16 21:59           ` Christoph Anton Mitterer
                               ` (3 preceding siblings ...)
  2015-12-17  5:12             ` Duncan
@ 2015-12-17  6:00             ` Duncan
  2015-12-17  6:01             ` Duncan
  5 siblings, 0 replies; 48+ messages in thread
From: Duncan @ 2015-12-17  6:00 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as
excerpted:

> I'm a bit unsure how to read filefrag's output... (even in the
> uncompressed case).
> What would it show me if there was fragmentation

/path/to/file:	18 extents found

It tells you the number of extents found.  Nominally, each extent should 
be a fragment, but as has been discussed elsewhere, on btrfs compressed 
files it will interpret each 128 KiB btrfs compression block as its own 
extent, even if (as seen in verbose mode) the next one begins where the 
previous one ends so it's really just a single extent.

Apparently on ext3/4, it's possible to have multi-gig files as a single 
extent, thus unfragmented, but as explained in an earlier reply to a 
point earlier in your post, on btrfs, extents of a GiB are nominally the 
best you can do as that's the nominal data chunk size, tho in limited 
circumstances larger extents are still possible on btrfs.

In the case above, where I took the 18 extents result from a real file 
(tho obviously the posted path isn't real), it was 4 MiB in size (I think 
exactly, it's a 4 MiB BIOS image =:^), so doing the math, extents average 
227 KiB.  That's on a filesystem that is always mounted with autodefrag, 
but it's also always mounted with compress, so it's possible some of the 
reported extents are compressed.

Actually, looking at filefrag -v output (which I've never used before but 
which someone noted could be used to check fragmentation on compressed 
files, tho it's not as straightforward as you might think), it looks like 
all but two of the listed extents are 32 blocks long (with 4096 byte 
blocks), which equates to 128 KiB, the btrfs compression-block size, and 
the two remaining extents are 224 blocks long or 896 KiB, an exact 7 
multiple of 128 KiB, so this file would indeed appear to be compressed 
except for those two uncompressed extents.  (As for figuring out how to 
interpret the full -v output to know whether the compressed blocks are 
actually single extents or not, as I said this is my first time trying
-v, and I didn't bother going that far with it.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-16 21:59           ` Christoph Anton Mitterer
                               ` (4 preceding siblings ...)
  2015-12-17  6:00             ` Duncan
@ 2015-12-17  6:01             ` Duncan
  5 siblings, 0 replies; 48+ messages in thread
From: Duncan @ 2015-12-17  6:01 UTC (permalink / raw)
  To: linux-btrfs

Christoph Anton Mitterer posted on Wed, 16 Dec 2015 22:59:01 +0100 as
excerpted:

>> It's certainly in quite a few on-list posts over the years

> okay,.. in other words: no ;-)
> scatter over the years list posts don't count as documentation :P

=:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
  2015-12-17  4:06             ` Duncan
@ 2015-12-18  0:21               ` Christoph Anton Mitterer
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-18  0:21 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 7920 bytes --]

[I'm combining the messages again, since I feel a bit bad, when I write
so many mails to the list ;) ]
But from my side, feel free to split up as much as you want (perhaps
not single characters or so ;) )


On Thu, 2015-12-17 at 04:06 +0000, Duncan wrote:
> Just to mention here, that I said "integrity management features",
> which 
> includes more than checksumming.  As Austin Hemmelgarn has been
> pointing 
> out, DBs and some VMs do COW, some DBs do checksumming or at least
> have 
> that option, and both VMs and DBs generally do at least some level
> of 
> consistency checking as they load.  Those are all "integrity
> management 
> features" at some level.
Okay... well, but the point of that whole thread was obviously data
integrity protection in the sense of what data checksumming does in
btrfs for CoWed data and for meta-data.
In other words: checksums at some blockleve, which are verified upon
every read.



> As for bittorrent, I /think/ the checksums are in the torrent files 
> themselves (and if I'm not mistaken, much as git, the chunks within
> the 
> file are actually IDed by checksum, not specific position, so as long
> as 
> the torrent is active, uploading or downloading, these will by
> definition 
> be retained).  As long as those are retained, the checksums should
> be 
> retained.  And ideally, people will continue to torrent the files
> long 
> after they've finished downloading them, in which case they'll still
> need 
> the torrent files themselves, along with the checksums info.
Well I guess we don't need to hook up ourselves so much on the p2p
formats.
They're just one examples, even if these would actually be integrity
protected in the sense as described above, well, fine, but there are
other major use cases left, for which this is not the case.

Of course one can also always argue, that users can then manually move
the files out of the no-CoWed area or manually create their own
checksums as I do and store them in XATTRS.
But all this is not real proper full checksum protection: there are
gaps, where things are not protected and normal users may simply not
do/know all this (and why shouldn't they still benefit from proper
checksumming if we can make it for them).
IMHO, even the argument that one could manually make checksums or move
the file to CoWed area, while the e.g. downloaded files are still in
cache doesn't count: that wouldn't work for VMs, DBs, and certainly not
for torrent files larger than the memory.


> Meanwhile, if they do it correctly there's no window without
> protection, 
> as the torrent file can be used to double-verify the file once moved,
> as 
> well, before deleting it.
Again, would work only for torrent-like files, not for VM images, only
partially for DBs... plus... why requiring users to make it manually,
if the fs could take care of it.







On Thu, 2015-12-17 at 05:07 +0000, Duncan wrote:
> > In kinda curios, what free space fragmentation actually means here.
> > 
> > Ist simply like this:
> > +----------+-----+---+--------+
> > >     F    |  D  | F |    D   |
> > +----------+-----+---+--------+
> > Where D is data (i.e. files/metadata) and F is free space.
> > In other words, (F)ree space itself is not further subdivided and
> > only
> > fragmented by the (D)ata extents in between.
> > 
> > Or is it more complex like this:
> > +-----+----+-----+---+--------+
> > >  F  |  F |  D  | F |    D   |
> > +-----+----+-----+---+--------+
> > Where the (F)ree space itself is subdivided into "extents" (not
> > necessarily of the same size), and btrfs couldn't use e.g. the
> > first two
> > F's as one contiguous amount of free space for a larger (D)ata
> > extent
> At the one level, I had the simpler f/d/f/d scheme in mind, but that 
> would be the case inside a single data chunk.  At the higher file
> level, 
> with files significant fractions of the size of a single data chunk
> to 
> much larger than a single data chunk, the more complex and second
> f/f/d/f/d case would apply, with the chunk boundary as the
> separation 
> between the f/f.
Okay, but that's only when there are data chunks that neighbour each
other... since the data chunks are rather big normally (1GB) that
shouldn't be such a big issue,... so I guess the real world looks like
this:
 DC#1                  DC#2
...----+---------------------------------...
...---+|+----------+-----+-
--+--------+
... F |||     F    |  D  | F |    D   |
...---+|+--------
--+-----+---+--------+
...----++---------------------------------...
(with DC = data chunk)

but it could NOT look like this:
 DC#1                  DC#2
...----+---------------------------------...
...---+|+-----+----+-----+---+--------+
... F |||  F  |  F |  D  | F
|    D   |
...---+|+-----+----+-----+---+--------+
...----++-------------
--------------------...
in other words, there could be =2 adjacent free
space "extents", when these are actually parts of different
neighbouring chunks, but there could NOT be >=2 adjacent free space
"extents" as part of the same data chunk.
Right?


 
> IOW, files larger than data chunk size will always be fragmented
> into 
> data chunk size fragments/extents, at the largest, because chunks
> are 
> designed to be movable using balance, device remove, replace, etc.
IOW, filefrag doesn't really show me directly, whether a file is
fragged or not (at least not, when the file is > chunk size)...
There should be a better tool for that from the btrfs :)

And one more (think I found parts of the answer already below):
Does defrag only try to defrag within chunks, or would it also try to
align datachunks the "belong" together next to each other - or better
said would it try to place extents beloning together in neighbouring
extents?
Or is it basically not really forseen in btrfs, that file sizes > chunk
size are really fully consecutively on disk?
Similar perhaps, whether freshly allocating a file larger than > chunk
size, would try to choose the (already existing) and allocate new
chunks so that its extents are contiguous even at chunk borders?

I think if files > chunk size, would be always fragmented at the chunk
level,.. this may show up a problematic edge case:
If a file is heavily accessed at regions that are at the chunk borders,
one would have always seeks (at HDDs) when the next chunk is actually
needed... and one could never defrag it fully, or at least any balance
could "destroy" it again.
I guess nodatacow'ed areas also use the 1GB chunk size, right?


> Using the 1 GiB nominal figure, files over 1 GiB would always be
> broken 
> into 1 GiB maximum size extents, corresponding to 1 extent per chunk.
I see...

 
> But based on real reports posting before and after numbers from
> filefrag 
> (on uncompressed btrfs), we do have cases where defrag can't find 256
> KiB 
> free-space blocks and thus can actually fragment a file worse than it
> was 
> before, so free-space fragmentation is indeed a very real problem.
btw: That's IMHO quite strange... or rather said: I'd have thought that
the check whether an extent get's even more fragmented than before,
would have been rather trivial...







On Thu, 2015-12-17 at 06:00 +0000, Duncan wrote:
> but as has been discussed elsewhere, on btrfs
> compressed 
> files it will interpret each 128 KiB btrfs compression block as its
> own 
> extent, even if (as seen in verbose mode) the next one begins where
> the 
> previous one ends so it's really just a single extent.
Hmm I took the opportunity and reported that as a wishlist bug
upstream:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=808265


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2015-12-18  0:21 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-23  1:43 btrfs: poor performance on deleting many large files Mitch Fossen
2015-11-23  6:29 ` Duncan
2015-11-25 21:49   ` Mitchell Fossen
2015-11-26 16:52     ` Duncan
2015-11-26 18:25       ` Christoph Anton Mitterer
2015-11-26 23:29         ` Duncan
2015-11-27  0:06           ` Christoph Anton Mitterer
2015-11-27  3:38             ` Duncan
2015-11-28  3:57               ` Christoph Anton Mitterer
2015-11-28  6:49                 ` Duncan
2015-12-12 22:15                   ` Christoph Anton Mitterer
2015-12-13  7:10                     ` Duncan
2015-12-16 22:14                       ` Christoph Anton Mitterer
2015-12-14 14:24                     ` Austin S. Hemmelgarn
2015-12-14 19:39                       ` Christoph Anton Mitterer
2015-12-14 20:27                         ` Austin S. Hemmelgarn
2015-12-14 21:30                           ` Lionel Bouton
2015-12-14 23:25                             ` Christoph Anton Mitterer
2015-12-15  1:49                               ` Duncan
2015-12-15  2:38                                 ` Lionel Bouton
2015-12-16  8:10                                   ` Duncan
2015-12-14 23:10                           ` Christoph Anton Mitterer
2015-12-14 23:16                           ` project idea: per-object default mount-options / more btrfs-properties / chattr attributes (was: btrfs: poor performance on deleting many large files) Christoph Anton Mitterer
2015-12-15  2:08                           ` btrfs: poor performance on deleting many large files Duncan
2015-12-15  4:05                       ` Chris Murphy
2015-11-27  1:49     ` Qu Wenruo
2015-11-23 12:59 ` Austin S Hemmelgarn
2015-11-26  0:23   ` [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?) Christoph Anton Mitterer
2015-11-26  0:33     ` Hugo Mills
2015-12-09  5:43       ` Christoph Anton Mitterer
2015-12-09 13:36         ` Duncan
2015-12-14  2:46           ` Christoph Anton Mitterer
2015-12-14 11:19             ` Duncan
2015-12-16 23:39           ` Kai Krakow
2015-12-14  1:44       ` Christoph Anton Mitterer
2015-12-14 10:51         ` Duncan
2015-12-16 23:55           ` Christoph Anton Mitterer
2015-11-26 23:08     ` Duncan
2015-12-09  5:45       ` Christoph Anton Mitterer
2015-12-09 16:36         ` Duncan
2015-12-16 21:59           ` Christoph Anton Mitterer
2015-12-17  4:06             ` Duncan
2015-12-18  0:21               ` Christoph Anton Mitterer
2015-12-17  4:35             ` Duncan
2015-12-17  5:07             ` Duncan
2015-12-17  5:12             ` Duncan
2015-12-17  6:00             ` Duncan
2015-12-17  6:01             ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.