btrfs filesystem defragment -r -- does it affect subvolumes?

All of lore.kernel.org
 help / color / mirror / Atom feed

* btrfs filesystem defragment -r -- does it affect subvolumes?
@ 2017-08-31  7:05 Ulli Horlacher
  2017-09-12 16:28 ` defragmenting best practice? Ulli Horlacher
  0 siblings, 1 reply; 56+ messages in thread
From: Ulli Horlacher @ 2017-08-31  7:05 UTC (permalink / raw)
  To: linux-btrfs

When I do a 
btrfs filesystem defragment -r /directory
does it defragment really all files in this directory tree, even if it
contains subvolumes?
The man page does not mention subvolumes on this topic.

I have an older script (written by myself) which does a 
"btrfs filesystem defragment -r" on all subvolumes recursivly:

  btrfs filesystem defragment -r $m
  for s in $(btrfs subvolume list $m | awk '{ print $NF }'); do
    [[ "$s" =~ ^@ ]] && continue
    [[ $(btrfs subvolume show $m/$s) =~ Flags:.*readonly ]] && continue
    btrfs filesystem defragment -r $m/$s
  done

I wonder why I have done it that way :-}


-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<20170831070558.GB5783@rus.uni-stuttgart.de>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* defragmenting best practice?
  2017-08-31  7:05 btrfs filesystem defragment -r -- does it affect subvolumes? Ulli Horlacher
@ 2017-09-12 16:28 ` Ulli Horlacher
  2017-09-12 17:27   ` Austin S. Hemmelgarn
  2017-09-14 11:38   ` Kai Krakow
  0 siblings, 2 replies; 56+ messages in thread
From: Ulli Horlacher @ 2017-09-12 16:28 UTC (permalink / raw)
  To: linux-btrfs

On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
> When I do a 
> btrfs filesystem defragment -r /directory
> does it defragment really all files in this directory tree, even if it
> contains subvolumes?
> The man page does not mention subvolumes on this topic.

No answer so far :-(

But I found another problem in the man-page:

  Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as
  with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4
  will break up the ref-links of COW data (for example files copied with
  cp --reflink, snapshots or de-duplicated data). This may cause
  considerable increase of space usage depending on the broken up
  ref-links.

I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
snapshots.
Therefore, I better should avoid calling "btrfs filesystem defragment -r"?

What is the defragmenting best practice?
Avoid it completly?

-- 
Ullrich Horlacher              Server und Virtualisierung
Rechenzentrum TIK         
Universitaet Stuttgart         E-Mail: horlacher@tik.uni-stuttgart.de
Allmandring 30a                Tel:    ++49-711-68565868
70569 Stuttgart (Germany)      WWW:    http://www.tik.uni-stuttgart.de/
REF:<20170831070558.GB5783@rus.uni-stuttgart.de>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-12 16:28 ` defragmenting best practice? Ulli Horlacher
@ 2017-09-12 17:27   ` Austin S. Hemmelgarn
  2017-09-14  7:54     ` Duncan
  2017-09-14 11:38   ` Kai Krakow
  1 sibling, 1 reply; 56+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-12 17:27 UTC (permalink / raw)
  To: linux-btrfs

On 2017-09-12 12:28, Ulli Horlacher wrote:
> On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
>> When I do a
>> btrfs filesystem defragment -r /directory
>> does it defragment really all files in this directory tree, even if it
>> contains subvolumes?
>> The man page does not mention subvolumes on this topic.
> 
> No answer so far :-(
I hadn't seen your original mail, otherwise I probably would have 
responded.  Sorry about that.

On the note of the original question:
I'm pretty sure that it does recursively operate on nested subvolumes. 
The documentation doesn't say otherwise, and not doing so would be 
non-intuitive to people who don't know anything about subvolumes.
> 
> But I found another problem in the man-page:
> 
>    Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as
>    with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4
>    will break up the ref-links of COW data (for example files copied with
>    cp --reflink, snapshots or de-duplicated data). This may cause
>    considerable increase of space usage depending on the broken up
>    ref-links.
> 
> I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
> snapshots.
> Therefore, I better should avoid calling "btrfs filesystem defragment -r"?
> 
> What is the defragmenting best practice?
That really depends on what you're doing.

First, you need to understand that defrag won't break _all_ reflinks, 
just the particular instances you point it at.  So, if you have 
subvolume A, and snapshots S1 and S2 of that subvolume A, then running 
defrag on _just_ subvolume A will break the reflinks between it and the 
snapshots, but S1 and S2 will still share any data they were originally 
with each other.  If you then take a third snapshot of A, it will share 
data with A, but not with S1 or S2 (because A is no longer sharing data 
with S1 or S2).

Given this behavior, you have in turn three potential cases when talking 
about persistent snapshots:

1. You care about minimizing space used, but aren't as worried about 
performance.  In this case, the only option is to not run defrag at all.
2. You care about performance, but not space usage.  In this case, 
defragment everything.
3. You care about both space usage and performance.  In this case, I 
would personally suggest defragmenting only the source subvolume (so 
only subvolume A in the above explanation), and doing so on a schedule 
that coincides with snapshot rotation.  The idea is to defrag just 
before you take a snapshot, and at a frequency that gives a good balance 
between space usage and performance.  As a general rule, if you take 
this route, start by doing the defrag on either a monthly basis if 
you're doing daily or weekly snapshots, or with every fourth snapshot if 
not, and then adjust the interval based on how that impacts your space 
usage.

Additionally, you can compact free space without defragmenting data or 
breaking reflinks by running a full balance on the filesystem.

The tricky part though is that differing workloads are impacted 
differently by fragmentation.  Using just four generic examples:

* Mostly sequential write focused workloads (like security recording 
systems) tend to be impacted by free space fragmentation more than data 
fragmentation.  Balancing filesystems used for such workloads is likely 
to give a noticeable improvement, but defragmenting probably won't give 
much.
* Mostly sequential read focused workloads (like a streaming media 
server) tend to be the most impacted by data fragmentation, but aren't 
generally impacted by free space fragmentation.  As a result, defrag 
will help here a lot, but balance won't as much.
* Mostly random write focused workloads (like most database systems or 
virtual machines) are often impacted by both free space and data 
fragmentation, and are a pathological case for CoW filesystems.  Balance 
and defrag will help here, but they won't help for long.
* Mostly random read focused workloads (like most non-multimedia desktop 
usage) are not impacted much by either aspect, but if you're on a 
traditional hard drive they can be impacted significantly by how the 
data is spread across the disk.  Balance can help here, but only because 
it improves data locality, not because it compacts free space.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-12 17:27   ` Austin S. Hemmelgarn
@ 2017-09-14  7:54     ` Duncan
  2017-09-14 12:28       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 56+ messages in thread
From: Duncan @ 2017-09-14  7:54 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Tue, 12 Sep 2017 13:27:00 -0400 as
excerpted:

> The tricky part though is that differing workloads are impacted
> differently by fragmentation.  Using just four generic examples:
> 
> * Mostly sequential write focused workloads (like security recording
> systems) tend to be impacted by free space fragmentation more than data
> fragmentation.  Balancing filesystems used for such workloads is likely
> to give a noticeable improvement, but defragmenting probably won't give
> much.
> * Mostly sequential read focused workloads (like a streaming media
> server)
> tend to be the most impacted by data fragmentation, but aren't generally
> impacted by free space fragmentation.  As a result, defrag will help
> here a lot, but balance won't as much.
> * Mostly random write focused workloads (like most database systems or
> virtual machines) are often impacted by both free space and data
> fragmentation, and are a pathological case for CoW filesystems.  Balance
> and defrag will help here, but they won't help for long.
> * Mostly random read focused workloads (like most non-multimedia desktop
> usage) are not impacted much by either aspect, but if you're on a
> traditional hard drive they can be impacted significantly by how the
> data is spread across the disk.  Balance can help here, but only because
> it improves data locality, not because it compacts free space.

This is a very useful analysis, particularly given the examples.  Maybe 
put it on the wiki under the defrag discussion?  (Assuming something like 
it isn't already there.  I've not looked in awhile.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-12 16:28 ` defragmenting best practice? Ulli Horlacher
  2017-09-12 17:27   ` Austin S. Hemmelgarn
@ 2017-09-14 11:38   ` Kai Krakow
  2017-09-14 13:31     ` Tomasz Kłoczko
  1 sibling, 1 reply; 56+ messages in thread
From: Kai Krakow @ 2017-09-14 11:38 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 12 Sep 2017 18:28:43 +0200
schrieb Ulli Horlacher <framstag@rus.uni-stuttgart.de>:

> On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
> > When I do a 
> > btrfs filesystem defragment -r /directory
> > does it defragment really all files in this directory tree, even if
> > it contains subvolumes?
> > The man page does not mention subvolumes on this topic.  
> 
> No answer so far :-(
> 
> But I found another problem in the man-page:
> 
>   Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as
> well as with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or
> >= 3.13.4 will break up the ref-links of COW data (for example files
> >copied with
>   cp --reflink, snapshots or de-duplicated data). This may cause
>   considerable increase of space usage depending on the broken up
>   ref-links.
> 
> I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
> snapshots.
> Therefore, I better should avoid calling "btrfs filesystem defragment
> -r"?
> 
> What is the defragmenting best practice?
> Avoid it completly?

You may want to try https://github.com/Zygo/bees. It is a daemon
watching the file system generation changes, scanning the blocks and
then recombines them. Of course, this process somewhat defeats the
purpose of defragging in the first place as it will undo some of the
defragmenting.

I suggest you only ever defragment parts of your main subvolume or rely
on autodefrag, and let bees do optimizing the snapshots.

Also, I experimented with adding btrfs support to shake, still working
on better integration but currently lacking time... :-(

Shake is an adaptive defragger which rewrites files. With my current
patches it clones each file, and then rewrites it to its original
location. This approach is currently not optimal as it simply bails out
if some other process is accessing the file and leaves you with an
(intact) temporary copy you need to move back in place manually.

Shake works very well with the idea of detecting how defragmented, how
old, and how far away from an "ideal" position a file is and exploits
standard Linux file systems behavior to optimally placing files by
rewriting them. It then records its status per file in extended
attributes. It also works with non-btrfs file systems. My patches try
to avoid defragging files with shared extents, so this may help your
situation. However, it will still shuffle files around if they are too
far from their ideal position, thus destroying shared extents. A future
patch could use extent recombining and skip shared extents in that
process. But first I'd like to clean out some of the rough edges
together with the original author of shake.

Look here: https://github.com/unbrice/shake and also check out the pull
requests and comments there. You shouldn't currently run shake
unattended and only on specific parts of your FS you feel need
defragmenting.

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-14  7:54     ` Duncan
@ 2017-09-14 12:28       ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 56+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-14 12:28 UTC (permalink / raw)
  To: linux-btrfs

On 2017-09-14 03:54, Duncan wrote:
> Austin S. Hemmelgarn posted on Tue, 12 Sep 2017 13:27:00 -0400 as
> excerpted:
> 
>> The tricky part though is that differing workloads are impacted
>> differently by fragmentation.  Using just four generic examples:
>>
>> * Mostly sequential write focused workloads (like security recording
>> systems) tend to be impacted by free space fragmentation more than data
>> fragmentation.  Balancing filesystems used for such workloads is likely
>> to give a noticeable improvement, but defragmenting probably won't give
>> much.
>> * Mostly sequential read focused workloads (like a streaming media
>> server)
>> tend to be the most impacted by data fragmentation, but aren't generally
>> impacted by free space fragmentation.  As a result, defrag will help
>> here a lot, but balance won't as much.
>> * Mostly random write focused workloads (like most database systems or
>> virtual machines) are often impacted by both free space and data
>> fragmentation, and are a pathological case for CoW filesystems.  Balance
>> and defrag will help here, but they won't help for long.
>> * Mostly random read focused workloads (like most non-multimedia desktop
>> usage) are not impacted much by either aspect, but if you're on a
>> traditional hard drive they can be impacted significantly by how the
>> data is spread across the disk.  Balance can help here, but only because
>> it improves data locality, not because it compacts free space.
> 
> This is a very useful analysis, particularly given the examples.  Maybe
> put it on the wiki under the defrag discussion?  (Assuming something like
> it isn't already there.  I've not looked in awhile.)
> 
I've actually been meaning to write up something more thoroughly about 
this online (probably as a Gist).  When finally get around to that 
(probably in the next few weeks), I'll try to make sure a link ends up 
on the defrag page on the wiki.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-14 11:38   ` Kai Krakow
@ 2017-09-14 13:31     ` Tomasz Kłoczko
  2017-09-14 15:24       ` Kai Krakow
  0 siblings, 1 reply; 56+ messages in thread
From: Tomasz Kłoczko @ 2017-09-14 13:31 UTC (permalink / raw)
  Cc: linux-btrfs

On 14 September 2017 at 12:38, Kai Krakow <hurikhan77@gmail.com> wrote:
[..]
>
> I suggest you only ever defragment parts of your main subvolume or rely
> on autodefrag, and let bees do optimizing the snapshots.
>
> Also, I experimented with adding btrfs support to shake, still working
> on better integration but currently lacking time... :-(
>
> Shake is an adaptive defragger which rewrites files. With my current
> patches it clones each file, and then rewrites it to its original
> location. This approach is currently not optimal as it simply bails out
> if some other process is accessing the file and leaves you with an
> (intact) temporary copy you need to move back in place manually.

If you really want to have real and *ideal* distribution of the data
across physical disk first you need to build time travel device. This
device will allow you to put all blocks which needs to be read in
perfect order (to read all data only sequentially without seek).
However it will be working only in case of spindles because in case of
SSDs there is no seek time.
Please let us know when you will write drivers/timetravel/ Linux kernel driver.
When such driver will be available I promise I'll write all necessary
btrfs code by myself in matter of few days (it will be piece of cake
compare to build such device).

But seriously ..
Only context/scenario when you may want to lower defragmentation is
when you are something needs to allocate continuous area lower than
free space and larger than largest free chunk. Something like this
happens only when volume is working on almost 100% allocated space.
In such scenario even you bees cannot do to much as it may be not
enough free space to move some other data in larger chunks to
defragment FS physical space. If your workload will be still writing
new data to FS such defragmentation may give you (maybe) few more
seconds and just after this FS will be 100% full,

In other words if someone is thinking that such defragmentation daemon
is solving any problems he/she may be 100% right .. such person is
only *thinking* that this is truth.

kloczek
PS. Do you know first McGyver rule? -> "If it ain't broke, don't fix it".
So first show that fragmentation is hurting latency of the access to
btrfs data and it will be possible to measurable such impact.
Before you will start measuring this you need to learn how o sample
for example VFS layer latency. Do you know how to do this to deliver
such proof?
PS2. The same "discussions" about fragmentation where in the past
about +10 years ago after ZFS has been introduced. Just to let you
know that after initial ZFS introduction up to now was not written
even single line of ZFS code to handle active fragmentation and no one
been able to prove that something about active defragmentation needs
to be done in case of ZFS.
Why? Because all stands on the shoulders of enough cleaver *allocation
algorithm*. Only this and nothing more.
PS3. Please can we stop this/EOT?
--
Tomasz Kłoczko | LinkedIn: http://lnkd.in/FXPWxH

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-14 13:31     ` Tomasz Kłoczko
@ 2017-09-14 15:24       ` Kai Krakow
  2017-09-14 15:47         ` Kai Krakow
  2017-09-14 17:48         ` Tomasz Kłoczko
  0 siblings, 2 replies; 56+ messages in thread
From: Kai Krakow @ 2017-09-14 15:24 UTC (permalink / raw)
  To: linux-btrfs

Am Thu, 14 Sep 2017 14:31:48 +0100
schrieb Tomasz Kłoczko <kloczko.tomasz@gmail.com>:

> On 14 September 2017 at 12:38, Kai Krakow <hurikhan77@gmail.com>
> wrote: [..]
> >
> > I suggest you only ever defragment parts of your main subvolume or
> > rely on autodefrag, and let bees do optimizing the snapshots.

Please read that again including the parts you omitted.

> > Also, I experimented with adding btrfs support to shake, still
> > working on better integration but currently lacking time... :-(
> >
> > Shake is an adaptive defragger which rewrites files. With my current
> > patches it clones each file, and then rewrites it to its original
> > location. This approach is currently not optimal as it simply bails
> > out if some other process is accessing the file and leaves you with
> > an (intact) temporary copy you need to move back in place
> > manually.  
> 
> If you really want to have real and *ideal* distribution of the data
> across physical disk first you need to build time travel device. This
> device will allow you to put all blocks which needs to be read in
> perfect order (to read all data only sequentially without seek).
> However it will be working only in case of spindles because in case of
> SSDs there is no seek time.
> Please let us know when you will write drivers/timetravel/ Linux
> kernel driver. When such driver will be available I promise I'll
> write all necessary btrfs code by myself in matter of few days (it
> will be piece of cake compare to build such device).
> 
> But seriously ..

Seriously: Defragmentation on spindles is IMHO not about getting the
perfect continuous allocation but providing better spatial layout of
the files you work with.

Getting e.g. boot files into read order or at least nearby improves
boot time a lot. Similar for loading applications. Shake tries to
improve this by rewriting the files - and this works because file
systems (given enough free space) already do a very good job at doing
this. But constant system updates degrade this order over time.

It really doesn't matter if some big file is laid out in 1 allocation
of 1 GB or in 250 allocations of 4MB: It really doesn't make a big
difference.

Recombining extents into bigger once, tho, can make a big difference in
an aging btrfs, even on SSDs.

Bees is, btw, not about defragmentation: I have some OS containers
running and I want to deduplicate data after updates. It seems to do a
good job here, better than other deduplicators I found. And if some
defrag tools destroyed your snapshot reflinks, bees can also help here.
On its way it may recombine extents so it may improve fragmentation.
But usually it probably defragments because it needs to split extents
that a defragger combined.

But well, I think getting 100% continuous allocation is really not the
achievement you want to get, especially when reflinks are a primary
concern.

> Only context/scenario when you may want to lower defragmentation is
> when you are something needs to allocate continuous area lower than
> free space and larger than largest free chunk. Something like this
> happens only when volume is working on almost 100% allocated space.
> In such scenario even you bees cannot do to much as it may be not
> enough free space to move some other data in larger chunks to
> defragment FS physical space.

Bees does not do that.

> If your workload will be still writing
> new data to FS such defragmentation may give you (maybe) few more
> seconds and just after this FS will be 100% full,
> 
> In other words if someone is thinking that such defragmentation daemon
> is solving any problems he/she may be 100% right .. such person is
> only *thinking* that this is truth.

Bees is not about that.

> kloczek
> PS. Do you know first McGyver rule? -> "If it ain't broke, don't fix
> it".

Do you know the saying "think first, then act"?

> So first show that fragmentation is hurting latency of the
> access to btrfs data and it will be possible to measurable such
> impact. Before you will start measuring this you need to learn how o
> sample for example VFS layer latency. Do you know how to do this to
> deliver such proof?

You didn't get the point. You only read "defragmentation" and your
alarm lights lid up. You even think bees would be a defragmenter. It
probably is more the opposite because it introduces more fragments in
exchange for more reflinks.

> PS2. The same "discussions" about fragmentation where in the past
> about +10 years ago after ZFS has been introduced. Just to let you
> know that after initial ZFS introduction up to now was not written
> even single line of ZFS code to handle active fragmentation and no one
> been able to prove that something about active defragmentation needs
> to be done in case of ZFS.

Btrfs has autodefrag to reduce the number of fragments by rewriting
small portions of the file being written to. This is needed, otherwise
the feature won't be there. Why? Have you tried working with 1GB files
broken into 100000+ of fragments just because of how CoW works? Try,
there's your latency.

> Why? Because all stands on the shoulders of enough cleaver *allocation
> algorithm*. Only this and nothing more.
> PS3. Please can we stop this/EOT?

Can we please not start a flame war just because you hate defrag tools?

I think the whole discussion about "defragmenting" should be stopped.
Let's call it "optimizers":

If it reduces needed storage space, it optimizes. And I need a tool for
that. Otherwise tell me how btrfs solves this in-kernel, when
applications break reflinks by rewriting data...

If you're on spindles you want files be kept spatially nearby that are
needed at around the same time. This improves boot times and application
start times. The file system already does a good job at doing this. But
for some work loads (like booting) this degrades over time and the FS
can do nothing about it because this is just not how package managers
work (or Windows updates, NTFS also uses extent allocation and as such
solves the same problems in similar way as most Linux systems). Let the
package manager reinstall all files accessed at boot and it would
probably be solved. But who wants that? Btrfs does not solve this, SSDs
do. Using bcache for that matter on my local system. Wihtout SSDs,
shake (and other tools) can solve this.

If you are on SSD and work with almost full file systems, you may get
back performance by recombining free space. Defragmentation here is not
about files but free space. This can also be called an optimizer then.

I really have no interest in defragmenting a file system to 100%
continuous allocation. That was need for FAT and small system without
enough RAM for caching all the file system infrastructure. Today
systems use extent allocations and that solves the problem where the
original idea of defragmentation came from. When I speak of
defragmentation I mean something more intelligent like optimizing file
system layout for access patterns you use.

Conclusion: The original question was about defrag best practice with
regards to reflinked snapshots. And I recommended partially against it
and instead recommended bees which restores and optimizes the reflinks
and may recombine some of the extents. From my wording, and I apologize
for that, it was probably not completely clear what this means:

[I wrote]
> You may want to try https://github.com/Zygo/bees. It is a daemon
> watching the file system generation changes, scanning the blocks and
> then recombines them. Of course, this process somewhat defeats the
> purpose of defragging in the first place as it will undo some of the
> defragmenting.

It scans for duplicate blocks and recombines them into reflinked
blocks. This is done by recombining extents. For that purpose, extents
that the file system allocated, usually need to be broken up again into
smaller chunks. But bees tries to recombine such broken extents back
into bigger ones. But it is not a defragger, seriously! It indeed
breaks extents into smaller chunks.

Later I recommended to have a look at shake which I experimented with.
And I also recommended to let the btrfs autodefrag do the work and only
ever defragment only very selected parts of the file system he feels
needing "defragmentation". My patches to shake try to avoid btrfs
shared extents so actually they reduce the effect of defragmenting the
FS, because I think keeping reflinked extents is more important. But I
see the main purpose of shake to re-layout supplied files into nearby
space. I think it is more important to improve spatial locality of
files than having them 100% continuous.

I will try to make my intent more clear next time but I guess you won't
probably read it in its entirely anyways. :,-(

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-14 15:24       ` Kai Krakow
@ 2017-09-14 15:47         ` Kai Krakow
  2017-09-14 17:48         ` Tomasz Kłoczko
  1 sibling, 0 replies; 56+ messages in thread
From: Kai Krakow @ 2017-09-14 15:47 UTC (permalink / raw)
  To: linux-btrfs

Am Thu, 14 Sep 2017 17:24:34 +0200
schrieb Kai Krakow <hurikhan77@gmail.com>:

Errors corrected, see below...


> Am Thu, 14 Sep 2017 14:31:48 +0100
> schrieb Tomasz Kłoczko <kloczko.tomasz@gmail.com>:
> 
> > On 14 September 2017 at 12:38, Kai Krakow <hurikhan77@gmail.com>
> > wrote: [..]  
> > >
> > > I suggest you only ever defragment parts of your main subvolume or
> > > rely on autodefrag, and let bees do optimizing the snapshots.  
> 
> Please read that again including the parts you omitted.
> 
> 
> > > Also, I experimented with adding btrfs support to shake, still
> > > working on better integration but currently lacking time... :-(
> > >
> > > Shake is an adaptive defragger which rewrites files. With my
> > > current patches it clones each file, and then rewrites it to its
> > > original location. This approach is currently not optimal as it
> > > simply bails out if some other process is accessing the file and
> > > leaves you with an (intact) temporary copy you need to move back
> > > in place manually.    
> > 
> > If you really want to have real and *ideal* distribution of the data
> > across physical disk first you need to build time travel device.
> > This device will allow you to put all blocks which needs to be read
> > in perfect order (to read all data only sequentially without seek).
> > However it will be working only in case of spindles because in case
> > of SSDs there is no seek time.
> > Please let us know when you will write drivers/timetravel/ Linux
> > kernel driver. When such driver will be available I promise I'll
> > write all necessary btrfs code by myself in matter of few days (it
> > will be piece of cake compare to build such device).
> > 
> > But seriously ..  
> 
> Seriously: Defragmentation on spindles is IMHO not about getting the
> perfect continuous allocation but providing better spatial layout of
> the files you work with.
> 
> Getting e.g. boot files into read order or at least nearby improves
> boot time a lot. Similar for loading applications. Shake tries to
> improve this by rewriting the files - and this works because file
> systems (given enough free space) already do a very good job at doing
> this. But constant system updates degrade this order over time.
> 
> It really doesn't matter if some big file is laid out in 1 allocation
> of 1 GB or in 250 allocations of 4MB: It really doesn't make a big
> difference.
> 
> Recombining extents into bigger once, tho, can make a big difference
> in an aging btrfs, even on SSDs.
> 
> Bees is, btw, not about defragmentation: I have some OS containers
> running and I want to deduplicate data after updates. It seems to do a
> good job here, better than other deduplicators I found. And if some
> defrag tools destroyed your snapshot reflinks, bees can also help
> here. On its way it may recombine extents so it may improve
> fragmentation. But usually it probably defragments because it needs
                                         ^^^^^^^^^^^
It fragments!

> to split extents that a defragger combined.
> 
> But well, I think getting 100% continuous allocation is really not the
> achievement you want to get, especially when reflinks are a primary
> concern.
> 
> 
> > Only context/scenario when you may want to lower defragmentation is
> > when you are something needs to allocate continuous area lower than
> > free space and larger than largest free chunk. Something like this
> > happens only when volume is working on almost 100% allocated space.
> > In such scenario even you bees cannot do to much as it may be not
> > enough free space to move some other data in larger chunks to
> > defragment FS physical space.  
> 
> Bees does not do that.
> 
> 
> > If your workload will be still writing
> > new data to FS such defragmentation may give you (maybe) few more
> > seconds and just after this FS will be 100% full,
> > 
> > In other words if someone is thinking that such defragmentation
> > daemon is solving any problems he/she may be 100% right .. such
> > person is only *thinking* that this is truth.  
> 
> Bees is not about that.
> 
> 
> > kloczek
> > PS. Do you know first McGyver rule? -> "If it ain't broke, don't fix
> > it".  
> 
> Do you know the saying "think first, then act"?
> 
> 
> > So first show that fragmentation is hurting latency of the
> > access to btrfs data and it will be possible to measurable such
> > impact. Before you will start measuring this you need to learn how o
> > sample for example VFS layer latency. Do you know how to do this to
> > deliver such proof?  
> 
> You didn't get the point. You only read "defragmentation" and your
> alarm lights lid up. You even think bees would be a defragmenter. It
> probably is more the opposite because it introduces more fragments in
> exchange for more reflinks.
> 
> 
> > PS2. The same "discussions" about fragmentation where in the past
> > about +10 years ago after ZFS has been introduced. Just to let you
> > know that after initial ZFS introduction up to now was not written
> > even single line of ZFS code to handle active fragmentation and no
> > one been able to prove that something about active defragmentation
> > needs to be done in case of ZFS.  
> 
> Btrfs has autodefrag to reduce the number of fragments by rewriting
> small portions of the file being written to. This is needed, otherwise
> the feature won't be there. Why? Have you tried working with 1GB files
> broken into 100000+ of fragments just because of how CoW works? Try,
> there's your latency.
> 
> 
> > Why? Because all stands on the shoulders of enough cleaver
> > *allocation algorithm*. Only this and nothing more.
> > PS3. Please can we stop this/EOT?  
> 
> Can we please not start a flame war just because you hate defrag
> tools?
> 
> I think the whole discussion about "defragmenting" should be stopped.
> Let's call it "optimizers":
> 
> If it reduces needed storage space, it optimizes. And I need a tool
> for that. Otherwise tell me how btrfs solves this in-kernel, when
> applications break reflinks by rewriting data...
> 
> If you're on spindles you want files be kept spatially nearby that are
> needed at around the same time. This improves boot times and
> application start times. The file system already does a good job at
> doing this. But for some work loads (like booting) this degrades over
> time and the FS can do nothing about it because this is just not how
> package managers work (or Windows updates, NTFS also uses extent
> allocation and as such solves the same problems in similar way as
> most Linux systems). Let the package manager reinstall all files
> accessed at boot and it would probably be solved. But who wants that?
> Btrfs does not solve this, SSDs do. Using bcache for that matter on
> my local system. Wihtout SSDs, shake (and other tools) can solve this.
> 
> If you are on SSD and work with almost full file systems, you may get
> back performance by recombining free space. Defragmentation here is
> not about files but free space. This can also be called an optimizer
> then.
> 
> 
> I really have no interest in defragmenting a file system to 100%
> continuous allocation. That was need for FAT and small system without
> enough RAM for caching all the file system infrastructure. Today
> systems use extent allocations and that solves the problem where the
> original idea of defragmentation came from. When I speak of
> defragmentation I mean something more intelligent like optimizing file
> system layout for access patterns you use.
> 
> 
> Conclusion: The original question was about defrag best practice with
> regards to reflinked snapshots. And I recommended partially against it
> and instead recommended bees which restores and optimizes the reflinks
> and may recombine some of the extents. From my wording, and I
> apologize for that, it was probably not completely clear what this
> means:
> 
> [I wrote]
> > You may want to try https://github.com/Zygo/bees. It is a daemon
> > watching the file system generation changes, scanning the blocks and
> > then recombines them. Of course, this process somewhat defeats the
> > purpose of defragging in the first place as it will undo some of the
> > defragmenting.  
> 
> It scans for duplicate blocks and recombines them into reflinked
> blocks. This is done by recombining extents. For that purpose, extents
> that the file system allocated, usually need to be broken up again
> into smaller chunks. But bees tries to recombine such broken extents
> back into bigger ones. But it is not a defragger, seriously! It indeed
> breaks extents into smaller chunks.
> 
> Later I recommended to have a look at shake which I experimented with.
> And I also recommended to let the btrfs autodefrag do the work and
> only ever defragment only very selected parts of the file system he
> feels needing "defragmentation". My patches to shake try to avoid
> btrfs shared extents so actually they reduce the effect of
> defragmenting the FS, because I think keeping reflinked extents is
> more important. But I see the main purpose of shake to re-layout
> supplied files into nearby space. I think it is more important to
> improve spatial locality of files than having them 100% continuous.
> 
> I will try to make my intent more clear next time but I guess you
> won't probably read it in its entirely anyways. :,-(
> 
> 



-- 
Regards,
Kai

Replies to list-only preferred.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-14 15:24       ` Kai Krakow
  2017-09-14 15:47         ` Kai Krakow
@ 2017-09-14 17:48         ` Tomasz Kłoczko
  2017-09-14 18:53           ` Austin S. Hemmelgarn
                             ` (2 more replies)
  1 sibling, 3 replies; 56+ messages in thread
From: Tomasz Kłoczko @ 2017-09-14 17:48 UTC (permalink / raw)
  Cc: linux-btrfs

On 14 September 2017 at 16:24, Kai Krakow <hurikhan77@gmail.com> wrote:
[..]
> Getting e.g. boot files into read order or at least nearby improves
> boot time a lot. Similar for loading applications.

By how much it is possible to improve boot time?
Just please some example which I can try to replay which ill be
showing that we have similar results.
I still have one one of my laptops with spindle on btrfs root fs ( and
no other FSess in use) so I could be able to confirm that my numbers
are enough close to your numbers.

> Shake tries to
> improve this by rewriting the files - and this works because file
> systems (given enough free space) already do a very good job at doing
> this. But constant system updates degrade this order over time.

OK. Please prepare some database, import some data which size will be
few times of not used RAM (best if this multiplication factor will be
at least 10). Then do some batch of selects measuring distribution
latencies of those queries.
This will give you some data about. not fragmented data.
Then on next stage try to apply some number of update queries and
after reboot the system or drop all caches. and repeat the same set of
selects.
After this all what you need to do is compare distribution of the latencies.

> It really doesn't matter if some big file is laid out in 1 allocation
> of 1 GB or in 250 allocations of 4MB: It really doesn't make a big
> difference.
>
> Recombining extents into bigger once, tho, can make a big difference in
> an aging btrfs, even on SSDs.

That it may be an issue with using extents.
Again: please show some results of some test unit which anyone will be
able to reply and confirm or not that this effect really exist.

If problem really exist and is related ot extents you should have real
scenario explanation why ZFS is not using extents.
btrfs is not to far from classic approach do FS because it srill uses
allocation structures.
This is not the case in context of ZFS because this technology has no
information about what is already allocates.
ZFS uses free lists so by negation whatever is not on free list is
already allocated.
I'm not trying to point that ZFS is better but only point that by
changing allocation strategy you may not be blasted by something like
some extents bottleneck (which sill needs to be proven)

There are at least few very good reason why it is even necessary to
change sometimes strategy from allocations structures to free lists.
First: ZFS free list management is very similar to known from Linux
memory SLAB allocator.
Did you heard that someone needs to do system memory defragnentation
because fragmented memory adds some additional latency to memory
access?
Other consequence is that with growing size of the files and number of
files or directories FS metadata are growing exponentially with size
and numbers of such objects. In case of free lists there is no such
growth and all structures are growing with linear correlation.
Caching in memory free list data takes much less than caching b-trees.
Last thing is effort on deallocating something in FS with allocation
structure and with free lists.
In classic approach number of such operations is growing with depth of b-trees.
In case free list all hat you need to do is compare ctime of the
allocated block with volume or snapshot ctime to make decision about
return or not block to free list.
No matter how many snapshots, volumes, files or directories allays it
will be *just one compare* of the block or vol/snapshot ctime.
With necessity to do just only one compare comes way better
predictable behavior of whole FS and simplicity of the code making
such decisions.
In other words ZFS internally uses well know SLAB allocator with
caching some data about best possible location to allocate some
different sizes allocation unit size multiplied by n^2 like you can
see on Linux in /proc/slabinfo in case of *kmalloc* SLABs.
This is why in case of ZFS number of volumes, snapshots has zero
impact on avg speed of interactions over VFS layer.

If you will be able present real impact of the fragmentation (again
*if*) this may trigger other actions.
So AFAIK no one been able to deliver real numbers or scenarios about
such impact.
And *if* such impact really exist one of the solutions may be just
mimic what ZFS is doing (maybe there are other solutions).

So please show us test unit exposing problem with measurement
methodology presenting pathology related to fragmentation.

> Bees is, btw, not about defragmentation: I have some OS containers
> running and I want to deduplicate data after updates.

Deduplication done in userspace has natural consequences in form of
security issues.
executable doing such things will need full access to everything and
needs to have exposed some API/ABI allowing fiddle with content of the
btrfs. Which adds second batch of security related risks.

Try to have look how deduplication is working in case of ZFS without
offline deduplication.

>> In other words if someone is thinking that such defragmentation daemon
>> is solving any problems he/she may be 100% right .. such person is
>> only *thinking* that this is truth.
>
> Bees is not about that.

I've been only trying to say that I would be really surprised if bees
will be taking care of such scenarios.

>> So first show that fragmentation is hurting latency of the
>> access to btrfs data and it will be possible to measurable such
>> impact. Before you will start measuring this you need to learn how o
>> sample for example VFS layer latency. Do you know how to do this to
>> deliver such proof?
>
> You didn't get the point. You only read "defragmentation" and your
> alarm lights lid up. You even think bees would be a defragmenter. It
> probably is more the opposite because it introduces more fragments in
> exchange for more reflinks.

So you are asking to start investing in the development time
implementing something without proving or demonstrating that problem
is real?
No matter how long someone will be thinking about this it will change nothing.

[..]
> Can we please not start a flame war just because you hate defrag tools?

Really I have no idea where I wrote that I hate defragmentation.
Using ZFS as working and real example I've only told you that
necessity to reduce fragmentation is NULL if you are following exact
path.
In your world you are trying to tell that you keys do not match to the
locker in doors.
I'm only trying to tell you that there are many doors without key hole
which can be opened and closed.

I can only repeat that to trigger some actions about defragmentation
first you need to *present* some case scenario exposing that the
problem is real. I may even believe you that you may be right but
engineering it is not something is possible to apply "believe" term.

Intuition always may be tricking you here that as long as impact is
non-zero someone should take care of this.
No. if this impact will be enough small this can be ignored as same as
we are ignoring some consequences of the quantum physics in our life
(probability that bucket of water standing on open fire may freeze
instead boil according to quantum physics is always non-zero and
despite this fact no one been able to observe something like this).
In other words you need to show some *real numbers* which will show
SCALE of the issue.

kloczek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-14 17:48         ` Tomasz Kłoczko
@ 2017-09-14 18:53           ` Austin S. Hemmelgarn
  2017-09-15  2:26             ` Tomasz Kłoczko
  2017-09-14 20:17           ` Kai Krakow
  2017-09-15 10:54           ` Michał Sokołowski
  2 siblings, 1 reply; 56+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-14 18:53 UTC (permalink / raw)
  To: Tomasz Kłoczko; +Cc: linux-btrfs

On 2017-09-14 13:48, Tomasz Kłoczko wrote:
> On 14 September 2017 at 16:24, Kai Krakow <hurikhan77@gmail.com> wrote:
> [..]
>> Getting e.g. boot files into read order or at least nearby improves
>> boot time a lot. Similar for loading applications.
> 
> By how much it is possible to improve boot time?
> Just please some example which I can try to replay which ill be
> showing that we have similar results.
> I still have one one of my laptops with spindle on btrfs root fs ( and
> no other FSess in use) so I could be able to confirm that my numbers
> are enough close to your numbers.
While it's not for BTRFS< a tool called e4rat might be of interest to 
you regarding this.  It reorganizes files on an ext4 filesystem so that 
stuff used by the boot loader is right at the beginning of the device, 
and I've know people to get insane performance improvements (on the 
order of 20x in some pathologicallyb ad cases) in the time taken from 
the BIOS handing things off to GRUB to GRUB handing execution off to the 
kernel.
> 
>> Shake tries to
>> improve this by rewriting the files - and this works because file
>> systems (given enough free space) already do a very good job at doing
>> this. But constant system updates degrade this order over time.
> 
> OK. Please prepare some database, import some data which size will be
> few times of not used RAM (best if this multiplication factor will be
> at least 10). Then do some batch of selects measuring distribution
> latencies of those queries.
> This will give you some data about. not fragmented data.
> Then on next stage try to apply some number of update queries and
> after reboot the system or drop all caches. and repeat the same set of
> selects.
> After this all what you need to do is compare distribution of the latencies.
> 
>> It really doesn't matter if some big file is laid out in 1 allocation
>> of 1 GB or in 250 allocations of 4MB: It really doesn't make a big
>> difference.
>>
>> Recombining extents into bigger once, tho, can make a big difference in
>> an aging btrfs, even on SSDs.
> 
> That it may be an issue with using extents.
> Again: please show some results of some test unit which anyone will be
> able to reply and confirm or not that this effect really exist.
This shouldn't need examples.  It's trivial math combined with basic 
knowledge of hardware behavior.  Every request to a device has a minimum 
amount of overhead.  On traditional hard drives, this is usually 
dominated by seek latency, but on SSD's, the request setup, dispatch, 
and completion are the dominant factor.  Assumign you have a 2 
micro-second overhead per-request (not an exact number, just chosen for 
demonstration purposes because it makes the math easy), and a 1GB file, 
the time difference between reading ten 100MB extents and reading ten 
thousand 100kB extents is just short of 0.02 seconds, or a factor of 
about one thousand (which, no surprise here, is the factor of difference 
between the number of extents).
> 
> If problem really exist and is related ot extents you should have real
> scenario explanation why ZFS is not using extents.
Extents have nothing to do with it.  What matters is how much of the 
file data is contiguous (and therefore can be read as a single request) 
and how smart the FS is about figuring that out.  Extents help figure 
that out, but the primary reason to use them is to save space encoding 
block allocations within a file (go take a look at how ext2 handles 
allocations, and then compare that to ext4, the difference is insane in 
terms of space savings).
> btrfs is not to far from classic approach do FS because it srill uses
> allocation structures.
> This is not the case in context of ZFS because this technology has no
> information about what is already allocates.
> ZFS uses free lists so by negation whatever is not on free list is
> already allocated.
> I'm not trying to point that ZFS is better but only point that by
> changing allocation strategy you may not be blasted by something like
> some extents bottleneck (which sill needs to be proven)
> 
> There are at least few very good reason why it is even necessary to
> change sometimes strategy from allocations structures to free lists.
> First: ZFS free list management is very similar to known from Linux
> memory SLAB allocator.
> Did you heard that someone needs to do system memory defragnentation
> because fragmented memory adds some additional latency to memory
> access?
> Other consequence is that with growing size of the files and number of
> files or directories FS metadata are growing exponentially with size
> and numbers of such objects. In case of free lists there is no such
> growth and all structures are growing with linear correlation.
> Caching in memory free list data takes much less than caching b-trees.
> Last thing is effort on deallocating something in FS with allocation
> structure and with free lists.
> In classic approach number of such operations is growing with depth of b-trees.
> In case free list all hat you need to do is compare ctime of the
> allocated block with volume or snapshot ctime to make decision about
> return or not block to free list.
> No matter how many snapshots, volumes, files or directories allays it
> will be *just one compare* of the block or vol/snapshot ctime.
> With necessity to do just only one compare comes way better
> predictable behavior of whole FS and simplicity of the code making
> such decisions.
> In other words ZFS internally uses well know SLAB allocator with
> caching some data about best possible location to allocate some
> different sizes allocation unit size multiplied by n^2 like you can
> see on Linux in /proc/slabinfo in case of *kmalloc* SLABs.
> This is why in case of ZFS number of volumes, snapshots has zero
> impact on avg speed of interactions over VFS layer.
> 
> If you will be able present real impact of the fragmentation (again
> *if*) this may trigger other actions.
> So AFAIK no one been able to deliver real numbers or scenarios about
> such impact.
> And *if* such impact really exist one of the solutions may be just
> mimic what ZFS is doing (maybe there are other solutions).
> 
> So please show us test unit exposing problem with measurement
> methodology presenting pathology related to fragmentation.
> 
>> Bees is, btw, not about defragmentation: I have some OS containers
>> running and I want to deduplicate data after updates.
> 
> Deduplication done in userspace has natural consequences in form of
> security issues.
> executable doing such things will need full access to everything and
> needs to have exposed some API/ABI allowing fiddle with content of the
> btrfs. Which adds second batch of security related risks.
> 
> Try to have look how deduplication is working in case of ZFS without
> offline deduplication.
You mean how it eats tons of RAM and gives nearly no benefit in most 
cases compared to just using transparent compression?  Online 
deduplication like ZFS offers has issues too.
> 
>>> In other words if someone is thinking that such defragmentation daemon
>>> is solving any problems he/she may be 100% right .. such person is
>>> only *thinking* that this is truth.
>>
>> Bees is not about that.
> 
> I've been only trying to say that I would be really surprised if bees
> will be taking care of such scenarios.
> 
>>> So first show that fragmentation is hurting latency of the
>>> access to btrfs data and it will be possible to measurable such
>>> impact. Before you will start measuring this you need to learn how o
>>> sample for example VFS layer latency. Do you know how to do this to
>>> deliver such proof?
>>
>> You didn't get the point. You only read "defragmentation" and your
>> alarm lights lid up. You even think bees would be a defragmenter. It
>> probably is more the opposite because it introduces more fragments in
>> exchange for more reflinks.
> 
> So you are asking to start investing in the development time
> implementing something without proving or demonstrating that problem
> is real?
> No matter how long someone will be thinking about this it will change nothing.
> 
> [..]
>> Can we please not start a flame war just because you hate defrag tools?
> 
> Really I have no idea where I wrote that I hate defragmentation.
> Using ZFS as working and real example I've only told you that
> necessity to reduce fragmentation is NULL if you are following exact
> path.
> In your world you are trying to tell that you keys do not match to the
> locker in doors.
> I'm only trying to tell you that there are many doors without key hole
> which can be opened and closed.
> 
> I can only repeat that to trigger some actions about defragmentation
> first you need to *present* some case scenario exposing that the
> problem is real. I may even believe you that you may be right but
> engineering it is not something is possible to apply "believe" term.
> 
> Intuition always may be tricking you here that as long as impact is
> non-zero someone should take care of this.
> No. if this impact will be enough small this can be ignored as same as
> we are ignoring some consequences of the quantum physics in our life
> (probability that bucket of water standing on open fire may freeze
> instead boil according to quantum physics is always non-zero and
> despite this fact no one been able to observe something like this).
> In other words you need to show some *real numbers* which will show
> SCALE of the issue.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-14 17:48         ` Tomasz Kłoczko
  2017-09-14 18:53           ` Austin S. Hemmelgarn
@ 2017-09-14 20:17           ` Kai Krakow
  2017-09-15 10:54           ` Michał Sokołowski
  2 siblings, 0 replies; 56+ messages in thread
From: Kai Krakow @ 2017-09-14 20:17 UTC (permalink / raw)
  To: linux-btrfs

Am Thu, 14 Sep 2017 18:48:54 +0100
schrieb Tomasz Kłoczko <kloczko.tomasz@gmail.com>:

> On 14 September 2017 at 16:24, Kai Krakow <hurikhan77@gmail.com>
> wrote: [..]
> > Getting e.g. boot files into read order or at least nearby improves
> > boot time a lot. Similar for loading applications.  
> 
> By how much it is possible to improve boot time?
> Just please some example which I can try to replay which ill be
> showing that we have similar results.
> I still have one one of my laptops with spindle on btrfs root fs ( and
> no other FSess in use) so I could be able to confirm that my numbers
> are enough close to your numbers.

I need to create a test setup because this system uses bcache. The
difference (according to systemd-analyze) between warm bcache and no
bcache at all ranges from 16-30s boot time vs. 3+ minutes boot time.

I could turn off bcache, do a boot trace, try to rearrange boot files,
boot again. However, that is not very reproducible as the current file
layout is not defined. It'd be better to setup a separate machine where
I could start over from a "well defined" state before applying
optimization steps to see the differences between different strategies.
At least readahead is not very helpful, I tested that in the past. It
reduces boot time just by a few seconds, maybe 20-30, thus going from
3+ minutes to 2+ minutes.

I still have an old laptop lying around: Single spindle, should make a
good test scenario. I'll have to see if I can get it back into shape.
It will take me some time.

> > Shake tries to
> > improve this by rewriting the files - and this works because file
> > systems (given enough free space) already do a very good job at
> > doing this. But constant system updates degrade this order over
> > time.  
> 
> OK. Please prepare some database, import some data which size will be
> few times of not used RAM (best if this multiplication factor will be
> at least 10). Then do some batch of selects measuring distribution
> latencies of those queries.

Well, this is pretty easy. Systemd-journald is a real beast when it
comes to cow fragmentation. Results can be easily generated and
reproduced. There are long traces of discussions in the systemd mailing
list and I simply decided to make the files nocow right from the start
and that fixed it for me. I can simply revert it and create benchmarks.

> This will give you some data about. not fragmented data.

Well, I would probably do it the other way around: Generate a
fragmented journal file (as that is how journald creates the file over
time), then rewrite it by some manner to reduce extents, then run
journal operations again on this file. Does it bother you to turn this
around?

> Then on next stage try to apply some number of update queries and
> after reboot the system or drop all caches. and repeat the same set of
> selects.
> After this all what you need to do is compare distribution of the
> latencies.

Which tool to use to measure which latencies?

Speaking of latencies: What's of interest here is perceived
performance resulting mostly from seek overhead (except probably in the
journal file case which just overwhelmes by the pure amount of
extents). I'm not sure if measuring VFS latencies would provide any
useful insights here. VFS probably works fast enough still in this
case.

> > It really doesn't matter if some big file is laid out in 1
> > allocation of 1 GB or in 250 allocations of 4MB: It really doesn't
> > make a big difference.
> >
> > Recombining extents into bigger once, tho, can make a big
> > difference in an aging btrfs, even on SSDs.  
> 
> That it may be an issue with using extents.

I can't follow why you argue that a file with thousands of extents vs
a file of same size but only a few extents would makes no difference to
operate on. And of course this has to do with extents. But btrfs uses
extents. Do you suggest to use ZFS instead?

Due to how cow works, the effect would probably be less or barely
noticable for writes, but read scanning through the file becomes slow
with clearly more "noise" from the moving heads.

> Again: please show some results of some test unit which anyone will be
> able to reply and confirm or not that this effect really exist.
> 
> If problem really exist and is related ot extents you should have real
> scenario explanation why ZFS is not using extents.

That was never the discussion. You brought in the ZFS point. I read
about the design reasoning behind ZFS when it appeared and started gain
public interest years back.

> btrfs is not to far from classic approach do FS because it srill uses
> allocation structures.
> This is not the case in context of ZFS because this technology has no
> information about what is already allocates.

What about btrfs free space tree? Isn't that more or less the same? But
I don't believe that makes a significant difference for desktop-sized
storages. I think introduction of free space tree was due to
performance of many-TB file systems up to petabyte storage (and beyond
of course).

> ZFS uses free lists so by negation whatever is not on free list is
> already allocated.
> I'm not trying to point that ZFS is better but only point that by
> changing allocation strategy you may not be blasted by something like
> some extents bottleneck (which sill needs to be proven)

Reasoning behind using block-oriented allocation probably has more to
do with providing efficient vdevs and snapshotting. Using extents for
that has some nasty (and obvious) downsides if you think about it, like
slack space from only partially shared extents. I guess that is why
bees rewrites extent and then shares them again using EXTENT_SAME
IOCTL. It generates a lot of writes just to free some unused extent
slack.

> There are at least few very good reason why it is even necessary to
> change sometimes strategy from allocations structures to free lists.
> First: ZFS free list management is very similar to known from Linux
> memory SLAB allocator.
> Did you heard that someone needs to do system memory defragnentation
> because fragmented memory adds some additional latency to memory
> access?

64 bit systems tend to have enough address space that this is not an
issue. But it can easily become an issue if you fill the page tables or
use huge pages a lot. There's really something like memory
fragmentation but you usually don't defragment memory (and yes, such
products existed in the past for unnamed popular "OS"es but that is
snake oil).

And I can totally follow why free lists are better here, you don't need
to explain that.

BTW: Do you really compare RAM to spindle storage now? Latency for RAM
access is clearly more an electrical than a mechanical problem and also
very predictable and thus static, like it is with SSDs.

> Other consequence is that with growing size of the files and number of
> files or directories FS metadata are growing exponentially with size
> and numbers of such objects.

I'm not sure if this holds true for every implementation out there. You
can make it pretty linear if you wanted to (but you don't).

> In case of free lists there is no such
> growth and all structures are growing with linear correlation.

Why is that so? Can you illustrate examples?

Well, of course lists are linear, trees are not. But lists become slow.
So if you implement free lists as trees, I don't think that growth is
strictly linear. That's just not how trees work. And a list will become
slow at some point.

BTW: The slab memory allocator indeed has to handle fragmentation
issues. And it can become slow if used in wrong ways.

Slab uses a triple linked list to keep track of allocations, free items
and mixed times (items that hold allocated and free objects). I think
you can compare btrfs chunks and extents to how slab manages memory. A
full btrfs chunk would be tracked as a full slab item, a free chunk as
free item, and the rest is mixed.

When inserting objects into slab this would compare to btrfs extents.
You will have some slack because you cannot optimally fit all different
sized extents into a chunk. If you deallocate objects (thus remove an
extent), you'll get fragmented free space.

I think btrfs pretty well knows where such free space exists, and it
can find it. But if it has to start looking in the mixed case, it will
be harder to find fitting space (especially an optimal fit).

Slab will struggle the same problem. But is has to move no heads for
this. And I think slab matches objects into different size buckets to
alleviate such problems where possible. I think even ZFS differentiates
block sizes into different buckets for more performant and optimal
handling. Btrfs has to try to fit it with a lot of strategies to
optimize this: Will the extent grow shortly? Should I allocate now or
later? Maybe later would provide a better fit?

But it is a good strategy for most workloads but not the best party
with CoW.

> Caching in memory free list data takes much less than caching b-trees.
> Last thing is effort on deallocating something in FS with allocation
> structure and with free lists.
> In classic approach number of such operations is growing with depth
> of b-trees. In case free list all hat you need to do is compare ctime
> of the allocated block with volume or snapshot ctime to make decision
> about return or not block to free list.

As noted above I can follow why this was chosen. But that's not the
topic here.

Btrfs has b-trees - that's what it is. It's not ZFS. It's not ext4. It
is btrfs. You say "btrfs needs no defragmentation, it makes no
difference in speed" but now you list the many flaws and performance
downsides of things different to ZFS. So maybe there is a benefit in
coalescing many small extents back into few big extents? Or there is a
benefit in coalescing free space all over the place into fewer chunks
as "btrfs balance" would do it?

Why are there these tools if it makes no difference to have them? When
there was no strong benefit, why did anyone bother with the effort of
programming this and putting infrastructure into the kernel for it when
the kernel is already clearly very complex? Why did anyone program
different file systems? We could have gone with ext4, or xfs (which
starts to support reflinks already). What's the point of autodefrag
when it's not needed?

> No matter how many snapshots, volumes, files or directories allays it
> will be *just one compare* of the block or vol/snapshot ctime.
> With necessity to do just only one compare comes way better
> predictable behavior of whole FS and simplicity of the code making
> such decisions.

You almost completely convinced me to ditch btrfs and use ZFS and
recommend it to everyone who feels the urge to "defragment" even only
one if her/his files...

How much RAM do I need again for ZFS to operate with good performance?

> In other words ZFS internally uses well know SLAB allocator with
> caching some data about best possible location to allocate some
> different sizes allocation unit size multiplied by n^2 like you can
> see on Linux in /proc/slabinfo in case of *kmalloc* SLABs.
> This is why in case of ZFS number of volumes, snapshots has zero
> impact on avg speed of interactions over VFS layer.

I'm feeling the whole discussion only started because you think
performance perception solely comes from VFS latencies. Is that so?

> If you will be able present real impact of the fragmentation (again
> *if*) this may trigger other actions.

I start guessing that the numbers I'd present are not convincing for
you because you only want to see VFS latencies. Please think of
something imaginary: Perceived performance *whoosh*

Sure, I can throw lots of RAM at the problem. I can throw SSDs at the
problem. I can introduce HBAs with huge caching capabilites. I can
throw ZFS with L2ARC and ZIL at it. Plus huge amounts of RAM. It's all
no problem, we actually do that for high performance, high cost
enterprise server machines. But the ordinary desktop user can probably
not effort that.

> So AFAIK no one been able to deliver real numbers or scenarios about
> such impact.
> And *if* such impact really exist one of the solutions may be just
> mimic what ZFS is doing (maybe there are other solutions).

No. Probably not. You cannot just replace btrfs infrastructure with
something else and still call it btrfs. And also, there would be no
migration path. And then: ZFS on Linux is already there. If I want ZFS,
I use it, and do not invest efforts to make something else into ZFS.

Remember the rules: If it's not broken, don't fix it. And also use the
tools that best fit. When we are faced with what is here, and it
improves things as a one shot solution for an acceptable period of time
- why not use it? I mean, McGyver would also use that bubble gum to
glue the lighter to a stick, and not walk to the next super glue store
to get the one and only valid way to glue lighters to sticks. The
bubble gum will do long enough to temporarily solve the problem.

> So please show us test unit exposing problem with measurement
> methodology presenting pathology related to fragmentation.

Yeah, I get it: Fragmentation is a non-issue.

> > Bees is, btw, not about defragmentation: I have some OS containers
> > running and I want to deduplicate data after updates.  
> 
> Deduplication done in userspace has natural consequences in form of
> security issues.

Yes, of course. It needs proper isolation. The kernel is already very
"bloated", do you really want another worker process doing complicated
things running directly in kernel space? This naturally introduces
stability issues (which, btw, also introduce security issues). What
about providing better interfaces for exactly such operations?

> executable doing such things will need full access to everything and
> needs to have exposed some API/ABI allowing fiddle with content of the
> btrfs. Which adds second batch of security related risks.

It depends on how much other interfaces such a process exposes. You can
use proper process isolation. And maybe you shouldn't run it on
untrusted machines. But then again: Personally, I'd not store sensitive
information there. If security is your concern, then don't bloat the
kernel with such things, and then simply don't run it. Every extra
process running can be a security issue. Everyone knows that.

> Try to have look how deduplication is working in case of ZFS without
> offline deduplication.

I didn't investigate the inner workings but I know it needs lots of
RAM.

> >> In other words if someone is thinking that such defragmentation
> >> daemon is solving any problems he/she may be 100% right .. such
> >> person is only *thinking* that this is truth.  
> >
> > Bees is not about that.  
> 
> I've been only trying to say that I would be really surprised if bees
> will be taking care of such scenarios.

It at least tries to not be totally inefficient and as far as I read
the code and docs, it removes extent slack by recombining and
resplitting extents using data-safe kernel operations. But not for the
sake of defragmenting.

> >> So first show that fragmentation is hurting latency of the
> >> access to btrfs data and it will be possible to measurable such
> >> impact. Before you will start measuring this you need to learn how
> >> o sample for example VFS layer latency. Do you know how to do this
> >> to deliver such proof?  
> >
> > You didn't get the point. You only read "defragmentation" and your
> > alarm lights lid up. You even think bees would be a defragmenter. It
> > probably is more the opposite because it introduces more fragments
> > in exchange for more reflinks.  
> 
> So you are asking to start investing in the development time
> implementing something without proving or demonstrating that problem
> is real?

No, you did ask for it between the lines. You are taking about
latencies of single access. It is probably no problem. BTW: You don't
need to prove that to me.

But - personal experience - when it takes me to search the system
journal 30-40s, and when I defragmented the file, it takes just 3-4
seconds? What does this have to do with VFS layer latencies? Nothing!

I'm even in the same boat with you saying the the many file accesses
are still all low latency at the VFS layer. But boy, they are so much
more! That is perceived performance. Fragmentation makes a performance
difference. That takes no scientific approach to believe that. The fix
is already implemented: defrag the extents. The kernel has an IOCTL for
this.

Now, leverage the tools for it: To fasten a screw, you use a screw
driver. You don't built it yourself, you take it from you toolbox. The
screw is already there, the screw driver is there. Nothing to invent.
McGyver wouldn't build one himself when one was already lying around.

> No matter how long someone will be thinking about this it will change
> nothing.

Probably the right conclusion. So let's take the tools that are here,
or switch to a better fitting file system (which, btw, is also a tool
that is available).

> [..]
> > Can we please not start a flame war just because you hate defrag
> > tools?  
> 
> Really I have no idea where I wrote that I hate defragmentation.
> Using ZFS as working and real example I've only told you that
> necessity to reduce fragmentation is NULL if you are following exact
> path.

Yes, I'll provide data for systemd journal access. And please, not
another thread about that application.

> In your world you are trying to tell that you keys do not match to the
> locker in doors.

No, the key is just under the carpet. Use it, and turn it in the right
direction.

> I'm only trying to tell you that there are many doors without key hole
> which can be opened and closed.

That is insecure. *scnr

> I can only repeat that to trigger some actions about defragmentation
> first you need to *present* some case scenario exposing that the
> problem is real. I may even believe you that you may be right but
> engineering it is not something is possible to apply "believe" term.

Okay, no more hints about useful software because btrfs already has
everything you ever need.

Seriously, I didn't ask for fixing anything in btrfs. I hinted two
tools that the OP could benefit from when using snapshots and handling
fragmented files and asking for best practice. And I didn't recommend
to defragment the whole filesystem all day long because it will give
you a speed boost of 100+%.

You jumped the train and said that defragmentation is never needed,
because btrfs does all this perfectly already, while later telling how
much better zfs does everything, then telling that extent allocation is
the problem. Ah yes, we get to the point... But well, that's a
non-issue because VFS latencies are not the problem except I
scientifically prove it. No one wanted to go so far and deep. Really.

Fragmented files with lots of small extents? Defragment this file. Did
it help?

  Yes, okay that's your tool, the problem comes from the CoW
nature. Also, please use bees if you are planning to defrag files part
of the snapshot reflinks or undo operations. Maybe btrfs doesn't fit
your workload then.

  If no, okay let's look at the underlying problem. Now it's time to do
all this scientific stuff and so on.

But this has totally been hijacked with no chance for the OP to follow
this thread sanely.

> Intuition always may be tricking you here that as long as impact is
> non-zero someone should take care of this.

Yes, if access to the file is slow, I rewrite it with some tool, and
now it's fast. I must have been totally tricked. God, how dare I to
measure the time with a clock and not some block tracing debug tool
from the kernel...

And if I rearrange boot files on a spindle and the system comes up in
30s now like a fresh build instead of in 2 minutes... I must have been
tricked. Okay, it was Windows. But really, tell me: What does Windows
do what Linux wouldn't do during boot? Read files? Nah... I can deduce
that it has an effect even on Linux, I'm just still into finding and
making the right tool for it while meanwhile I circumvented it with
bcache.

And please, I don't use those shiny snake oil defraggers with even
counterproductive effects on the file system. I'm not a dumb non-tech
reader born this millenium, I'm not clicking those click-bait articles
"defragment your harddrive for speed". I'm looking into the technical
workings behind this (and other stuff), since almost 30 years. There are
only very very few tools available that do defrag right. And I know
exactly 2, one for NTFS, one for ext3.

But in the FOSS world, I can at least improve that. But maybe I
shouldn't even try, because there is no problem. And there's nothing to
fix.

> No. if this impact will be enough small this can be ignored as same as
> we are ignoring some consequences of the quantum physics in our life
> (probability that bucket of water standing on open fire may freeze
> instead boil according to quantum physics is always non-zero and
> despite this fact no one been able to observe something like this).
> In other words you need to show some *real numbers* which will show
> SCALE of the issue.

Quantum physics is - literally - when you try to plug your USB thumb
drive and it doesn't fit, turn it around, try again, and it doesn't
fit, then look at it and try again, and it fits. And that is a perfect
example for what the Schrödinger experiment really stands.

Try that with your water example, it won't work so easily. ;-)

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-14 18:53           ` Austin S. Hemmelgarn
@ 2017-09-15  2:26             ` Tomasz Kłoczko
  2017-09-15 12:23               ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 56+ messages in thread
From: Tomasz Kłoczko @ 2017-09-15  2:26 UTC (permalink / raw)
  To: linux-btrfs

On 14 September 2017 at 19:53, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
[..]
> While it's not for BTRFS< a tool called e4rat might be of interest to you
> regarding this.  It reorganizes files on an ext4 filesystem so that stuff
> used by the boot loader is right at the beginning of the device, and I've
> know people to get insane performance improvements (on the order of 20x in
> some pathologicallyb ad cases) in the time taken from the BIOS handing
> things off to GRUB to GRUB handing execution off to the kernel.

Do you know that what you've just wrote has nothing to do with fragmentation?
Intentionally or not you just trying to change the subject.

[..]
> This shouldn't need examples.  It's trivial math combined with basic
> knowledge of hardware behavior.  Every request to a device has a minimum
> amount of overhead.  On traditional hard drives, this is usually dominated
> by seek latency, but on SSD's, the request setup, dispatch, and completion
> are the dominant factor.  Assumign you have a 2 micro-second overhead
> per-request (not an exact number, just chosen for demonstration purposes
> because it makes the math easy), and a 1GB file, the time difference between
> reading ten 100MB extents and reading ten thousand 100kB extents is just
> short of 0.02 seconds, or a factor of about one thousand (which, no surprise
> here, is the factor of difference between the number of extents).

So to produce few seconds delay during boot you need to make few
hundreds thousands if not millions more IOs  and on reading everything
using ideal long sequential reads.
Almost every package upgrade on rewrite some files in 100% will
produce by using COW fully continuous areas per file.
You know .. there is no so many files in typical distribution
installation to produce such measurable impact.
On my current laptop I have a lot of devel and debug stuff installed
and still I have only

$ rpm -qal | wc -l
276489

files (from which only small fractions are ELF DSOs or executables)
installed by:

$ rpm -qa | wc -l
2314

packages.

I can bet that during even very complicated boot process it will be
touched (by read IOs) only few hundreds files. None of those files
will be read sequentially because this is not how executable content
is usually loaded into the buffer cache. Simple change block device
read ahead may improve boot time enough without putting all blocks in
perfect order. All what you need is start enough early "blockdev
--setra N" where N is greater than default 256 blocks. All this can be
done without thinking about fragmentation.
Seems you don't know that Linux by default is reading data from block
dev using at least 256 blocks (1KB each one) chunks because such IO
size is part of default RA settings, You can change those settings
just for boot time and you will have way lower number of IOs and sill
no significant improvement like few times shorter time. Fragmentation
will be in such case secondary factor.
All this could be done without bothering about fragmentation.

In other words still you are talking about some institutionally
possible results which will be falsified if you will try at least one
time do some real tests and measurements.
Last time when I've been doing some boot time measurements it was
about using sequential start of all services vs. maximum
palatalization. And yes by this it was possible to improve boot time
by few times. All without bothering about fragmentation.

Current fedora systemd base services definition can be improved in
many places by add more dependencies and execute many small services
in parallel. All those corrections can be done without even thinking
about fragmentation. Because these base sett of systemd services comes
with systemd source code those improvements can be done for almost all
Linux systemd based distros.

kloczek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-14 17:48         ` Tomasz Kłoczko
  2017-09-14 18:53           ` Austin S. Hemmelgarn
  2017-09-14 20:17           ` Kai Krakow
@ 2017-09-15 10:54           ` Michał Sokołowski
  2017-09-15 11:13             ` Peter Grandi
  2017-09-15 13:07             ` Tomasz Kłoczko
  2 siblings, 2 replies; 56+ messages in thread
From: Michał Sokołowski @ 2017-09-15 10:54 UTC (permalink / raw)
  To: Tomasz Kłoczko; +Cc: Linux fs Btrfs

[-- Attachment #1: Type: text/plain, Size: 687 bytes --]

On 09/14/2017 07:48 PM, Tomasz Kłoczko wrote:
> On 14 September 2017 at 16:24, Kai Krakow <hurikhan77@gmail.com> wrote:
> [..]
>> > Getting e.g. boot files into read order or at least nearby improves
>> > boot time a lot. Similar for loading applications.
> [...]
> Just please some example which I can try to replay which ill be
> showing that we have similar results.

Case #1
2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2 storage
-> guest BTRFS filesystem
SQL table row insertions per second: 1-2

Case #2
2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw storage ->
guest EXT4 filesystem
SQL table row insertions per second: 10-15



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3849 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-15 10:54           ` Michał Sokołowski
@ 2017-09-15 11:13             ` Peter Grandi
  2017-09-15 13:07             ` Tomasz Kłoczko
  1 sibling, 0 replies; 56+ messages in thread
From: Peter Grandi @ 2017-09-15 11:13 UTC (permalink / raw)
  To: Linux fs Btrfs

> Case #1
> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2 storage
> -> guest BTRFS filesystem
> SQL table row insertions per second: 1-2

"Doctor, if I stab my hand with a fork it hurts a lot: can you
cure that?"

> Case #2
> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw
> storage -> guest EXT4 filesystem
> SQL table row insertions per second: 10-15

"Doctor, I can't run as fast with a backpack full of bricks as
without it: can you cure that?"

:-)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-15  2:26             ` Tomasz Kłoczko
@ 2017-09-15 12:23               ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 56+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-15 12:23 UTC (permalink / raw)
  To: Tomasz Kłoczko, linux-btrfs

On 2017-09-14 22:26, Tomasz Kłoczko wrote:
> On 14 September 2017 at 19:53, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> [..]
>> While it's not for BTRFS< a tool called e4rat might be of interest to you
>> regarding this.  It reorganizes files on an ext4 filesystem so that stuff
>> used by the boot loader is right at the beginning of the device, and I've
>> know people to get insane performance improvements (on the order of 20x in
>> some pathologicallyb ad cases) in the time taken from the BIOS handing
>> things off to GRUB to GRUB handing execution off to the kernel.
> 
> Do you know that what you've just wrote has nothing to do with fragmentation?
> Intentionally or not you just trying to change the subject.
As hard as it may be to believe, this _is_ relevant to the part of your 
reply that I was responding to, namely:

 > By how much it is possible to improve boot time?

Note that discussion of file ordering impacting boot times is almost 
always centered around the boot loader, and _not_ userspace (because as 
you choose to focus on in changing the subject for the rest of this 
message, it's trivially possible to improve performance in userspace 
with some really simple tweaks).

You wanted examples regarding reordering of data in a localized manner 
improving boot time, so I gave _the_ reference for this on Linux (e4rat 
is the only publicly available tool I know of that does this).
> 
> [..]
>> This shouldn't need examples.  It's trivial math combined with basic
>> knowledge of hardware behavior.  Every request to a device has a minimum
>> amount of overhead.  On traditional hard drives, this is usually dominated
>> by seek latency, but on SSD's, the request setup, dispatch, and completion
>> are the dominant factor.  Assumign you have a 2 micro-second overhead
>> per-request (not an exact number, just chosen for demonstration purposes
>> because it makes the math easy), and a 1GB file, the time difference between
>> reading ten 100MB extents and reading ten thousand 100kB extents is just
>> short of 0.02 seconds, or a factor of about one thousand (which, no surprise
>> here, is the factor of difference between the number of extents).
> 
> So to produce few seconds delay during boot you need to make few
> hundreds thousands if not millions more IOs  and on reading everything
> using ideal long sequential reads.
No, that isn't what I was talking about.  Quit taking things out of 
context and assuming all of someone's reply is about only part of yours.

This was responding solely to this:

 > That it may be an issue with using extents.
 > Again: please show some results of some test unit which anyone will be
 > able to reply and confirm or not that this effect really exist.

And has nothing to do with boot time.

> Almost every package upgrade on rewrite some files in 100% will
> produce by using COW fully continuous areas per file.
> You know .. there is no so many files in typical distribution
> installation to produce such measurable impact. > On my current laptop I have a lot of devel and debug stuff installed
> and still I have only
> 
> $ rpm -qal | wc -l
> 276489
> 
> files (from which only small fractions are ELF DSOs or executables)
> installed by:
> 
> $ rpm -qa | wc -l
> 2314
> 
> packages.
> 
> I can bet that during even very complicated boot process it will be
> touched (by read IOs) only few hundreds files. None of those files
> will be read sequentially because this is not how executable content
> is usually loaded into the buffer cache. Simple change block device
> read ahead may improve boot time enough without putting all blocks in
> perfect order. All what you need is start enough early "blockdev
> --setra N" where N is greater than default 256 blocks. All this can be
> done without thinking about fragmentation.
As I mentioned above, the primary argument for reordering data for boot 
is largely related to the boot-loader, which doesn't have intelligent 
I/O scheduling and doesn't do read ahead, and is primarily about usage 
with traditional hard drives, where seek latency caused by lack of data 
locality actually does have a significant (and well documented) impact.

> Seems you don't know that Linux by default is reading data from block
> dev using at least 256 blocks (1KB each one) chunks because such IO
> size is part of default RA settings, You can change those settings
> just for boot time and you will have way lower number of IOs and sill
> no significant improvement like few times shorter time. Fragmentation
> will be in such case secondary factor.
> All this could be done without bothering about fragmentation.
The block-level read-ahead done by the kernel has near zero impact on 
performance unless your data is already highly local (not necessarily 
ordered, but at least all in the same place), which will almost never be 
the case on BTRFS when dealing with an active data set because of its 
copy on write semantics.
> 
> In other words still you are talking about some institutionally
> possible results which will be falsified if you will try at least one
> time do some real tests and measurements.
> Last time when I've been doing some boot time measurements it was
> about using sequential start of all services vs. maximum
> palatalization. And yes by this it was possible to improve boot time
> by few times. All without bothering about fragmentation.
> 
> Current fedora systemd base services definition can be improved in
> many places by add more dependencies and execute many small services
> in parallel. All those corrections can be done without even thinking
> about fragmentation. Because these base sett of systemd services comes
> with systemd source code those improvements can be done for almost all
> Linux systemd based distros.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-15 10:54           ` Michał Sokołowski
  2017-09-15 11:13             ` Peter Grandi
@ 2017-09-15 13:07             ` Tomasz Kłoczko
  2017-09-15 14:11               ` Michał Sokołowski
  1 sibling, 1 reply; 56+ messages in thread
From: Tomasz Kłoczko @ 2017-09-15 13:07 UTC (permalink / raw)
  To: Michał Sokołowski; +Cc: Linux fs Btrfs

On 15 September 2017 at 11:54, Michał Sokołowski <michal@sarach.com.pl> wrote:
[..]
>> Just please some example which I can try to replay which ill be
>> showing that we have similar results.
>
> Case #1
> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2 storage
> -> guest BTRFS filesystem
> SQL table row insertions per second: 1-2
>
> Case #2
> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw storage ->
> guest EXT4 filesystem
> SQL table row insertions per second: 10-15

Q -1) why you are comparing btrfs against ext4 on top of the btrfs
which is doing own COW operations on bottom of such sandwiches ..  if
we SUPPOSE to be talking about impact of the fragmentation on top of
btrfs?
Q 0) what do you think that you measure here?
Q 1) how did you produce those time measurements? time command?
looking on the watch?
Q 2) why there are ranges of timings? did you repeat some operations
few times (how many times and with or without dropping caches or doing
reboots?)
Q 3) What kind of SQL engine? with what kind of settings? with what
kind of tables? (indexes? foreign keys?) What kind of transactions
semantics?
Q 4) where is the example set of inserts which I can replay in my
setup? did you drop caches before batch of inserts? (do you know that
every insert generates as well some number of read IOs so information
is something is already cached before batch of inserts is *crucial*)
Did you restart SQL engine?
Q 5) are both test have been executed on the same box? if not which
one version of the kernel(s) have been used?
Q 6) ) effectively how many IOs have been done during those tests? how
did you measured those numbers (dtrace? perf? systemtap?)
Q7) why you are running your tests over qemu? Is it anything more
running on the host system during those tests?
.
.
.
I can probably make this list of questions 2 or 3 times longer.

koczek
--
Tomasz Kłoczko | LinkedIn: http://lnkd.in/FXPWxH

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-15 13:07             ` Tomasz Kłoczko
@ 2017-09-15 14:11               ` Michał Sokołowski
  2017-09-15 16:35                 ` Peter Grandi
  2017-09-15 17:08                 ` Kai Krakow
  0 siblings, 2 replies; 56+ messages in thread
From: Michał Sokołowski @ 2017-09-15 14:11 UTC (permalink / raw)
  To: Tomasz Kłoczko; +Cc: Linux fs Btrfs

[-- Attachment #1: Type: text/plain, Size: 2877 bytes --]

On 09/15/2017 03:07 PM, Tomasz Kłoczko wrote:
> [...]
> Case #1
> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2 storage
> -> guest BTRFS filesystem
> SQL table row insertions per second: 1-2
>
> Case #2
> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw storage ->
> guest EXT4 filesystem
> SQL table row insertions per second: 10-15
> Q -1) why you are comparing btrfs against ext4 on top of the btrfs
> which is doing own COW operations on bottom of such sandwiches ..  if
> we SUPPOSE to be talking about impact of the fragmentation on top of
> btrfs?

Tomasz,
you seem to be convinced that fragmentation does not matter. I found
this (extremely bad, true) example says otherwise.

> Q 0) what do you think that you measure here?

Cow's fragmentation impact on SQL write performance.

> Q 1) how did you produce those time measurements? time command?
> looking on the watch?

Time command (real) of bash script inserting 1000 rows (index and 128B
random string).

> Q 2) why there are ranges of timings? did you repeat some operations
> few times (how many times and with or without dropping caches or doing
> reboots?)

Yes, we've repeated it. With and without flushing cache (it didn't seem
to have any impact). I cannot remember whenever there were any reboots.
Those big time ranges are because, I don't have exact numbers on me. It
was quick and dirty task to find, prove and remove performance
bottleneck at minimal cost. AFAIR removing storage cow2 and guest BTRFS
storage gave us ~ 10 times boost. Surprisingly for us this boost seems
to be consistent (it does not degrade noticeably over time - 2 months
from the change).

> Q 3) What kind of SQL engine? with what kind of settings? with what
> kind of tables? (indexes? foreign keys?) What kind of transactions
> semantics?

PostgreSQL and MySQL both gave us those results. *

> Q 4) where is the example set of inserts which I can replay in my
> setup? did you drop caches before batch of inserts? (do you know that
> every insert generates as well some number of read IOs so information
> is something is already cached before batch of inserts is *crucial*)
> Did you restart SQL engine?
> Q 5) are both test have been executed on the same box? if not which
> one version of the kernel(s) have been used?

Same distribution, machine and kernel. *

> Q 6) ) effectively how many IOs have been done during those tests? how
> did you measured those numbers (dtrace? perf? systemtap?)

I didn't check that. *

> Q7) why you are running your tests over qemu? Is it anything more
> running on the host system during those tests?

Because of "production" environment location. No, there was not.

*) If you're really interested in (which I doubt), then I can put
example environment somewhere and gather more data.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3849 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-15 14:11               ` Michał Sokołowski
@ 2017-09-15 16:35                 ` Peter Grandi
  2017-09-15 17:08                 ` Kai Krakow
  1 sibling, 0 replies; 56+ messages in thread
From: Peter Grandi @ 2017-09-15 16:35 UTC (permalink / raw)
  To: Linux fs Btrfs

[ ... ]
>>>> Case #1
>>>> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs
>>>> -> qemu cow2 storage -> guest BTRFS filesystem
>>>> SQL table row insertions per second: 1-2

>>>> Case #2
>>>> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs
>>>> -> qemu raw storage -> guest EXT4 filesystem
>>>> SQL table row insertions per second: 10-15
[ ... ]

>> Q 0) what do you think that you measure here?

> Cow's fragmentation impact on SQL write performance.

That's not what you are measuring, you are measing the impact on
speed of configurations "designed" (perhaps unintentionally) for
maximum flexibility, lowest cost, and complete disregard for
speed.

[ ... ]

> It was quick and dirty task to find, prove and remove
> performance bottleneck at minimal cost.

This is based on the usual confusion between "performance" (the
result of several tradeoffs) and "speed". When you report "row
insertions per second" you are reporting a rate, that is a
"speed", not "performance", which is always multi-dimensional.
http://www.sabi.co.uk/blog/15-two.html?151023#151023

In the cases above speed is low, but I think that, taking into
account flexibility and cost, performance is pretty good.

> AFAIR removing storage cow2 and guest BTRFS storage gave us ~
> 10 times boost.

"Oh doctor, if I stop stabbing my hand with a fork it no longer
hurts, but running while carrying a rucksack full of bricks is
still slower than with a rucksack full of feathers".

[ ... ]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-15 14:11               ` Michał Sokołowski
  2017-09-15 16:35                 ` Peter Grandi
@ 2017-09-15 17:08                 ` Kai Krakow
  2017-09-15 19:10                   ` Tomasz Kłoczko
  1 sibling, 1 reply; 56+ messages in thread
From: Kai Krakow @ 2017-09-15 17:08 UTC (permalink / raw)
  To: linux-btrfs

Am Fri, 15 Sep 2017 16:11:50 +0200
schrieb Michał Sokołowski <michal@sarach.com.pl>:

> On 09/15/2017 03:07 PM, Tomasz Kłoczko wrote:
> > [...]
> > Case #1
> > 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2
> > storage -> guest BTRFS filesystem  
> > SQL table row insertions per second: 1-2
> >
> > Case #2
> > 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw
> > storage -> guest EXT4 filesystem
> > SQL table row insertions per second: 10-15
> > Q -1) why you are comparing btrfs against ext4 on top of the btrfs
> > which is doing own COW operations on bottom of such sandwiches ..
> > if we SUPPOSE to be talking about impact of the fragmentation on
> > top of btrfs?  
> 
> Tomasz,
> you seem to be convinced that fragmentation does not matter. I found
> this (extremely bad, true) example says otherwise.

Sorry to jump this, but did you at least set the qemu image to nocow?
Otherwise this example is totally flawed because you're testing qemu
storage layer mostly and not btrfs.

A better test would've been to test qemu raw on btrfs cow vs on btrfs
nocow, with both the same file system inside the qemu image.

But you are modifying multiple parameters at once during the test, and
I expect then everyone has a huge impact on performance but only one is
specific to btrfs which you apparently did not test this way.

Personally, running qemu cow2 on btrfs cow really helps nothing except
really bad performance. Make one of both layers nocow and it should
become better.

If you want to give some better numbers, please reduce this test to
just one cow layer, the one at the top layer: btrfs host fs. Copy the
image somewhere else to restore from, and ensure (using filefrag) that
the starting situation matches each test run.

Don't change any parameters of the qemu layer at each test. And run a
file system inside which doesn't do any fancy stuff, like ext2 or ext3
without journal. Use qemu raw storage.

Then test again with cow vs nocow on the host side.

Create a nocow copy of your image (use size of the source image for
truncate):

# rm -f qemu-image-nocow.raw
# touch qemu-image-nocow.raw
# chattr +C -c qemu-image-nocow.raw
# dd if=source-image.raw of=qemu-image-nocow.raw bs=1M
# btrfs fi defrag -f qemu-image-nocow.raw
# filefrag -v qemu-image-nocow.raw

Create a cow copy of your image:

# rm -f qemu-image-cow.raw
# touch qemu-image-cow.raw
# chattr -C -c qemu-image-cow.raw
# dd if=source-image.raw of=qemu-image-cow.raw bs=1M
# btrfs fi defrag -f qemu-image-cow.raw
# filefrag -v qemu-image-cow.raw

Given that host btrfs is mounted datacow,compress=none and without
autodefrag, and you don't touch the source image contents during tests.

Now run your test script inside both qemu machines, take your
measurements and check fragmentation again after the run.

filefrag should report no more fragments than before the test for the
first test, but should report a magnitude more for the second test.

Now copy (cp) both one at a time to a new file and measure the time. It
should be slower for the highly fragmented version.

Don't forget to run tests with and without flushed caches so we get
cold and warm numbers.

In this scenario, qemu would only be the application to modify the raw
image files and you're actually testing the impact of fragmentation of
btrfs.

You could also make a reflink copy of the nocow test image and do a
third test to see that it introduces fragmentation now, tho probably
much lower than for the cow test image. You can verify the numbers with
filefrag.

According to Tomasz, your tests should not run at vastly different
speeds because fragmentation has no impact on performance, quod est
demonstrandum... I think we will not get to the "erat" part.

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-15 17:08                 ` Kai Krakow
@ 2017-09-15 19:10                   ` Tomasz Kłoczko
  2017-09-20  6:38                     ` Dave
  2017-09-20  7:34                     ` Dmitry Kudriavtsev
  0 siblings, 2 replies; 56+ messages in thread
From: Tomasz Kłoczko @ 2017-09-15 19:10 UTC (permalink / raw)
  To: Linux fs Btrfs

On 15 September 2017 at 18:08, Kai Krakow <hurikhan77@gmail.com> wrote:
[..]
> According to Tomasz, your tests should not run at vastly different
> speeds because fragmentation has no impact on performance, quod est
> demonstrandum... I think we will not get to the "erat" part.

No. This is not precisely what I'm trying to tell.
Now however seeing that there is no precise/fully repeatable
methodology of performing proposed test I have huge doubts about what
is reported has effect has anything to do do with fragmentation or it
is side effect of using COW (which allow glue some number random
updates into larger sequential write IOs).

kloczek
-- 
Tomasz Kłoczko  LinkedIn: http://lnkd.in/FXPWxH

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-15 19:10                   ` Tomasz Kłoczko
@ 2017-09-20  6:38                     ` Dave
  2017-09-20 11:46                       ` Austin S. Hemmelgarn
                                         ` (2 more replies)
  2017-09-20  7:34                     ` Dmitry Kudriavtsev
  1 sibling, 3 replies; 56+ messages in thread
From: Dave @ 2017-09-20  6:38 UTC (permalink / raw)
  To: Linux fs Btrfs

>On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
>> When I do a
>> btrfs filesystem defragment -r /directory
>> does it defragment really all files in this directory tree, even if it
>> contains subvolumes?
>> The man page does not mention subvolumes on this topic.
>
>No answer so far :-(
>
>But I found another problem in the man-page:
>
>  Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as
>  with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4
>  will break up the ref-links of COW data (for example files copied with
>  cp --reflink, snapshots or de-duplicated data). This may cause
>  considerable increase of space usage depending on the broken up
>  ref-links.
>
>I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
>snapshots.
>Therefore, I better should avoid calling "btrfs filesystem defragment -r"?
>
>What is the defragmenting best practice?
>Avoid it completly?

My question is the same as the OP in this thread, so I came here to
read the answers before asking. However, it turns out that I still
need to ask something. Should I ask here or start a new thread? (I'll
assume here, since the topic is the same.)

Based on the answers here, it sounds like I should not run defrag at
all. However, I have a performance problem I need to solve, so if I
don't defrag, I need to do something else.

Here's my scenario. Some months ago I built an over-the-top powerful
desktop computer / workstation and I was looking forward to really
fantastic performance improvements over my 6 year old Ubuntu machine.
I installed Arch Linux on BTRFS on the new computer (on an SSD). To my
shock, it was no faster than my old machine. I focused a lot on
Firefox performance because I use Firefox a lot and that was one of
the applications in which I was most looking forward to better
performance.

I tried everything I could think of and everything recommended to me
in various forums (except switching to Windows) and the performance
remained very disappointing.

Then today I read the following:

    Gotchas - btrfs Wiki
    https://btrfs.wiki.kernel.org/index.php/Gotchas

    Fragmentation: Files with a lot of random writes can become
heavily fragmented (10000+ extents) causing excessive multi-second
spikes of CPU load on systems with an SSD or large amount a RAM. On
desktops this primarily affects application databases (including
Firefox). Workarounds include manually defragmenting your home
directory using btrfs fi defragment. Auto-defragment (mount option
autodefrag) should solve this problem.

Upon reading that I am wondering if fragmentation in the Firefox
profile is part of my issue. That's one thing I never tested
previously. (BTW, this system has 256 GB of RAM and 20 cores.)

Furthermore, on the same BTRFS Wiki page, it mentions the performance
penalties of many snapshots. I am keeping 30 to 50 snapshots of the
volume that contains the Firefox profile.

Would these two things be enough to turn top-of-the-line hardware into
a mediocre-preforming desktop system? (The system performs fine on
benchmarks -- it's real life usage, particularly with Firefox where it
is disappointing.)

After reading the info here, I am wondering if I should make a new
subvolume just for my Firefox profile(s) and not use COW and/or not
keep snapshots on it and mount it with the autodefrag option.

As part of this strategy, I could send snapshots to another disk using
btrfs send-receive. That way I would have the benefits of snapshots
(which are important to me), but by not keeping any snapshots on the
live subvolume I could avoid the performance problems.

What would you guys do in this situation?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-15 19:10                   ` Tomasz Kłoczko
  2017-09-20  6:38                     ` Dave
@ 2017-09-20  7:34                     ` Dmitry Kudriavtsev
  1 sibling, 0 replies; 56+ messages in thread
From: Dmitry Kudriavtsev @ 2017-09-20  7:34 UTC (permalink / raw)
  To: linux-btrfs

I've had a very similar issue with the performance of my laptop dropping to very low levels, eventually solved by uninstalling Snapper, deleting snapshots, and then defragmenting the drive.

This seems to be a common concern, I also had it happen on my desktop.

Dmitry

---

Thank you,
Dmitry Kudriavtsev

https://dkudriavtsev.xyz
inexpensivecomputers.net

⠀⠀⠀⠀⠀⠀⠀⣸⣧⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣰⣿⣿⣆⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⣀⡙⠿⣿⣿⣆⠀⠀⠀⠀⠀Hey, did you hear about that cool new OS? It's called
⠀⠀⠀⠀⣰⣿⣿⣷⣿⣿⣿⣆⠀⠀⠀⠀Arch Linux. I use Arch Linux. Have you ever used Arch
⠀⠀⠀⣰⣿⣿⣿⡿⢿⣿⣿⣿⣆⠀⠀⠀Linux? You should use Arch Linux. Everyone uses Arch!
⠀⠀⣰⣿⣿⣿⡏⠀⠀⢹⣿⣿⠿⡆⠀⠀Check out i3wm too!
⠀⣰⣿⣿⣿⡿⠇⠀⠀⠸⢿⣿⣷⣦⣄⠀
⣼⠿⠛⠉⠀⠀⠀⠀⠀⠀⠀⠀⠉⠛⠿⣦
September 19 2017 11:38 PM, "Dave" <davestechshop@gmail.com> wrote:
>> On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
>>> When I do a
>>> btrfs filesystem defragment -r /directory
>>> does it defragment really all files in this directory tree, even if it
>>> contains subvolumes?
>>> The man page does not mention subvolumes on this topic.
>> 
>> No answer so far :-(
>> 
>> But I found another problem in the man-page:
>> 
>> Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as
>> with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4
>> will break up the ref-links of COW data (for example files copied with
>> cp --reflink, snapshots or de-duplicated data). This may cause
>> considerable increase of space usage depending on the broken up
>> ref-links.
>> 
>> I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
>> snapshots.
>> Therefore, I better should avoid calling "btrfs filesystem defragment -r"?
>> 
>> What is the defragmenting best practice?
>> Avoid it completly?
> 
> My question is the same as the OP in this thread, so I came here to
> read the answers before asking. However, it turns out that I still
> need to ask something. Should I ask here or start a new thread? (I'll
> assume here, since the topic is the same.)
> 
> Based on the answers here, it sounds like I should not run defrag at
> all. However, I have a performance problem I need to solve, so if I
> don't defrag, I need to do something else.
> 
> Here's my scenario. Some months ago I built an over-the-top powerful
> desktop computer / workstation and I was looking forward to really
> fantastic performance improvements over my 6 year old Ubuntu machine.
> I installed Arch Linux on BTRFS on the new computer (on an SSD). To my
> shock, it was no faster than my old machine. I focused a lot on
> Firefox performance because I use Firefox a lot and that was one of
> the applications in which I was most looking forward to better
> performance.
> 
> I tried everything I could think of and everything recommended to me
> in various forums (except switching to Windows) and the performance
> remained very disappointing.
> 
> Then today I read the following:
> 
> Gotchas - btrfs Wiki
> https://btrfs.wiki.kernel.org/index.php/Gotchas
> 
> Fragmentation: Files with a lot of random writes can become
> heavily fragmented (10000+ extents) causing excessive multi-second
> spikes of CPU load on systems with an SSD or large amount a RAM. On
> desktops this primarily affects application databases (including
> Firefox). Workarounds include manually defragmenting your home
> directory using btrfs fi defragment. Auto-defragment (mount option
> autodefrag) should solve this problem.
> 
> Upon reading that I am wondering if fragmentation in the Firefox
> profile is part of my issue. That's one thing I never tested
> previously. (BTW, this system has 256 GB of RAM and 20 cores.)
> 
> Furthermore, on the same BTRFS Wiki page, it mentions the performance
> penalties of many snapshots. I am keeping 30 to 50 snapshots of the
> volume that contains the Firefox profile.
> 
> Would these two things be enough to turn top-of-the-line hardware into
> a mediocre-preforming desktop system? (The system performs fine on
> benchmarks -- it's real life usage, particularly with Firefox where it
> is disappointing.)
> 
> After reading the info here, I am wondering if I should make a new
> subvolume just for my Firefox profile(s) and not use COW and/or not
> keep snapshots on it and mount it with the autodefrag option.
> 
> As part of this strategy, I could send snapshots to another disk using
> btrfs send-receive. That way I would have the benefits of snapshots
> (which are important to me), but by not keeping any snapshots on the
> live subvolume I could avoid the performance problems.
> 
> What would you guys do in this situation?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-20  6:38                     ` Dave
@ 2017-09-20 11:46                       ` Austin S. Hemmelgarn
  2017-09-21 20:10                         ` Kai Krakow
  2017-09-21 11:09                       ` Duncan
  2017-09-21 19:28                       ` Sean Greenslade
  2 siblings, 1 reply; 56+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-20 11:46 UTC (permalink / raw)
  To: Dave, Linux fs Btrfs

On 2017-09-20 02:38, Dave wrote:
>> On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
>>> When I do a
>>> btrfs filesystem defragment -r /directory
>>> does it defragment really all files in this directory tree, even if it
>>> contains subvolumes?
>>> The man page does not mention subvolumes on this topic.
>>
>> No answer so far :-(
>>
>> But I found another problem in the man-page:
>>
>>   Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as
>>   with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4
>>   will break up the ref-links of COW data (for example files copied with
>>   cp --reflink, snapshots or de-duplicated data). This may cause
>>   considerable increase of space usage depending on the broken up
>>   ref-links.
>>
>> I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
>> snapshots.
>> Therefore, I better should avoid calling "btrfs filesystem defragment -r"?
>>
>> What is the defragmenting best practice?
>> Avoid it completly?
> 
> My question is the same as the OP in this thread, so I came here to
> read the answers before asking. However, it turns out that I still
> need to ask something. Should I ask here or start a new thread? (I'll
> assume here, since the topic is the same.)
> 
> Based on the answers here, it sounds like I should not run defrag at
> all. However, I have a performance problem I need to solve, so if I
> don't defrag, I need to do something else.
> 
> Here's my scenario. Some months ago I built an over-the-top powerful
> desktop computer / workstation and I was looking forward to really
> fantastic performance improvements over my 6 year old Ubuntu machine.
> I installed Arch Linux on BTRFS on the new computer (on an SSD). To my
> shock, it was no faster than my old machine. I focused a lot on
> Firefox performance because I use Firefox a lot and that was one of
> the applications in which I was most looking forward to better
> performance.
> 
> I tried everything I could think of and everything recommended to me
> in various forums (except switching to Windows) and the performance
> remained very disappointing.
Switching to Windows won't help any more than switching to ext4 would. 
If you were running Chrome, it might (Chrome actually has better 
performance on Windows than Linux by a small margin last time I 
checked), but Firefox gets pretty much the same performance on both 
platforms.
> 
> Then today I read the following:
> 
>      Gotchas - btrfs Wiki
>      https://btrfs.wiki.kernel.org/index.php/Gotchas
> 
>      Fragmentation: Files with a lot of random writes can become
> heavily fragmented (10000+ extents) causing excessive multi-second
> spikes of CPU load on systems with an SSD or large amount a RAM. On
> desktops this primarily affects application databases (including
> Firefox). Workarounds include manually defragmenting your home
> directory using btrfs fi defragment. Auto-defragment (mount option
> autodefrag) should solve this problem.
> 
> Upon reading that I am wondering if fragmentation in the Firefox
> profile is part of my issue. That's one thing I never tested
> previously. (BTW, this system has 256 GB of RAM and 20 cores.)
Almost certainly.  Most modern web browsers are brain-dead and insist on 
using SQLite databases (or traditional DB files) for everything, 
including the cache, and the usage for the cache in particular kills 
performance when fragmentation is an issue.
> 
> Furthermore, on the same BTRFS Wiki page, it mentions the performance
> penalties of many snapshots. I am keeping 30 to 50 snapshots of the
> volume that contains the Firefox profile.
> 
> Would these two things be enough to turn top-of-the-line hardware into
> a mediocre-preforming desktop system? (The system performs fine on
> benchmarks -- it's real life usage, particularly with Firefox where it
> is disappointing.)
Even ignoring fragmentation and reflink issues (it's reflinks, not 
snapshots that are the issue, snapshots just have tons of reflinks), 
BTRFS is slower than ext4 or XFS simply because of the fact that it's 
doing way more work.  The difference should have limited impact on an 
SSD if you get a handle on the other issues though.
> 
> After reading the info here, I am wondering if I should make a new
> subvolume just for my Firefox profile(s) and not use COW and/or not
> keep snapshots on it and mount it with the autodefrag option.
> 
> As part of this strategy, I could send snapshots to another disk using
> btrfs send-receive. That way I would have the benefits of snapshots
> (which are important to me), but by not keeping any snapshots on the
> live subvolume I could avoid the performance problems.
> 
> What would you guys do in this situation?
Personally?  Use Chrome or Chromium and turn on the simple cache backend 
(chrome://flags/#enable-simple-cache-backend) which doesn't have issues 
with fragmentation because it doesn't use a database file to store the 
cache and lets the filesystem handle the allocations.  The difference in 
performance in Chrome itself from flipping this switch is pretty amazing 
to be honest.  They're also faster than Firefox in general in my 
experience, but that's a separate discussion.

 From a practical perspective though, if you're using the profile sync 
feature in Firefox, you don't need the checksumming of BTRFS and 
shouldn't need snapshots either (at least, not for that), so through 
some symlink trickery you could put your Firefox profile on another 
filesystem (same for Thunderbird, which has the same issues).

Alternatively, if you can afford to have your space usage effectively 
multiplied by the number of snapshots, defragment the FS after every 
snapshot.  That will deal both with the performance issues from 
fragmentation, and the performance issues from reflinks (because defrag 
breaks reflinks).

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-20  6:38                     ` Dave
  2017-09-20 11:46                       ` Austin S. Hemmelgarn
@ 2017-09-21 11:09                       ` Duncan
  2017-10-31 21:47                         ` Dave
  2017-09-21 19:28                       ` Sean Greenslade
  2 siblings, 1 reply; 56+ messages in thread
From: Duncan @ 2017-09-21 11:09 UTC (permalink / raw)
  To: linux-btrfs

Dave posted on Wed, 20 Sep 2017 02:38:13 -0400 as excerpted:

> Here's my scenario. Some months ago I built an over-the-top powerful
> desktop computer / workstation and I was looking forward to really
> fantastic performance improvements over my 6 year old Ubuntu machine. I
> installed Arch Linux on BTRFS on the new computer (on an SSD). To my
> shock, it was no faster than my old machine. I focused a lot on Firefox
> performance because I use Firefox a lot and that was one of the
> applications in which I was most looking forward to better performance.
> 
> I tried everything I could think of and everything recommended to me in
> various forums (except switching to Windows) and the performance
> remained very disappointing.
> 
> Then today I read the following:
> 
>     Gotchas - btrfs Wiki https://btrfs.wiki.kernel.org/index.php/Gotchas
> 
>     Fragmentation: Files with a lot of random writes can become
> heavily fragmented (10000+ extents) causing excessive multi-second
> spikes of CPU load on systems with an SSD or large amount a RAM. On
> desktops this primarily affects application databases (including
> Firefox). Workarounds include manually defragmenting your home directory
> using btrfs fi defragment. Auto-defragment (mount option autodefrag)
> should solve this problem.
> 
> Upon reading that I am wondering if fragmentation in the Firefox profile
> is part of my issue. That's one thing I never tested previously. (BTW,
> this system has 256 GB of RAM and 20 cores.)
> 
> Furthermore, on the same BTRFS Wiki page, it mentions the performance
> penalties of many snapshots. I am keeping 30 to 50 snapshots of the
> volume that contains the Firefox profile.
> 
> Would these two things be enough to turn top-of-the-line hardware into a
> mediocre-preforming desktop system? (The system performs fine on
> benchmarks -- it's real life usage, particularly with Firefox where it
> is disappointing.)
> 
> After reading the info here, I am wondering if I should make a new
> subvolume just for my Firefox profile(s) and not use COW and/or not keep
> snapshots on it and mount it with the autodefrag option.
> 
> As part of this strategy, I could send snapshots to another disk using
> btrfs send-receive. That way I would have the benefits of snapshots
> (which are important to me), but by not keeping any snapshots on the
> live subvolume I could avoid the performance problems.
> 
> What would you guys do in this situation?

[FWIW this is my second try at a reply, my first being way too detailed 
and going off into the weeds somewhere, so I killed it.]

That's an interesting scenario indeed, and perhaps I can help, since my 
config isn't near as high end as yours, but I run firefox on btrfs on 
ssds, and have no performance complaints.  The difference is very likely 
due to one or more of the following (FWIW I'd suggest a 4-3-1-2 order, 
tho only 1 and 2 are really btrfs related):

1) I make sure I consistently mount with autodefrag, from the first mount 
after the filesystem is created in ordered to first populate it, on.  The 
filesystem never gets fragmented, forcing writes to highly fragmented 
free space, in the first place.  (With the past and current effect of the 
ssd mount option under discussion to change, it's possible I'll get more 
fragmentation in the future after ssd doesn't try so hard to find 
reasonably large free-space chunks to write into, but it has been fine so 
far.)

2) Subvolumes and snapshots seemed to me more trouble than they were 
worth, particularly since it's the same filesystem anyway, and if it's 
damaged, it'll take all the subvolumes and snapshots with it.  So I don't 
use them, preferring instead to use real partitioning and more smaller 
fully separate filesystems, some of which aren't mounted by default (and 
root mounted read-only by default), so there's little chance they'll be 
damaged in a crash or filesystem bug damage scenario.  And if there /is/ 
any damage, it's much more limited in scope since all my data eggs aren't 
in the same basket, so maintenance such as btrfs check and scrub take far 
less time (and check far less memory) than they would were it one big 
pool with snapshots.  And if recovery fails too, the backups are likewise 
small filesystems the same size as the working copies, so copying the 
data back over takes far less time as well (not to mention making the 
backups takes less time in the first place, so it's easier to regularly 
update them).

3) Austin mentioned the firefox cache.  I honestly wouldn't know on it, 
since I have firefox configured to use a tmpfs for its cache, so it 
operates at memory speed and gets cleared along with its memory at every 
reboot or tmpfs umount.  My inet speed is fast enough I don't really need 
cache anyway, but it's nice to have it, operating at memory speed, within 
a single boot session... and to have it cleared on reboot.

4) This one was the biggest one for me for awhile.

Is firefox running in multi-process mode?  If you don't know, got to 
about:support, and look in the Application Basics section, at the 
Multiprocess Windows entry and the Web Content Processes entry.  When you 
have multiple windows open it should show something like 2/2 (for two 
windows open, tho you won't get 20/20 for 20 windows open) for windows, 
and n/7 (tho I believe the default is 4 instead of 7, I've upped mine) 
for content processes, with n going up toward 7 (or 4) if you have 
multiple tabs/windows open playing video or the like.

If you're stuck at a single process that'll be a *BIG* drag on 
performance, particularly when playing youtube full-screen or the like.  
There are various reasons you might get stuck at a single process, 
including extensions that aren't compatible with "electrolysis" (aka e10s, 
this being the mozilla code name for multi-process firefox), and the one 
that was my problem after I ensured all my extensions were e10s 
compatible -- I was trying to run the upstream firefox binary, which is 
now pulseaudio-only (no more direct alsa support), with apulse as a 
pulseaudio substitute, and apulse is apparently single-process-only 
(forcing multi-process would crash the tabs as soon as I tried navigating 
away from about:whatever to anything remote).

Once I figured that out I switched back to using the gentoo firefox ebuild 
and enabling the alsa USE flag instead of pulseaudio there.  That got 
multiprocess working, and it was was *MUCH* more responsive, as I figured 
it should be! =:^)

If you find you're stuck at single process (remember, check with at least 
two windows open) and need help with it, yell.  Because it'll make a 
*HUGE* difference.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-20  6:38                     ` Dave
  2017-09-20 11:46                       ` Austin S. Hemmelgarn
  2017-09-21 11:09                       ` Duncan
@ 2017-09-21 19:28                       ` Sean Greenslade
  2 siblings, 0 replies; 56+ messages in thread
From: Sean Greenslade @ 2017-09-21 19:28 UTC (permalink / raw)
  To: Dave, Linux fs Btrfs

On September 19, 2017 11:38:13 PM PDT, Dave <davestechshop@gmail.com> wrote:
>>On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
> <snip>
>Here's my scenario. Some months ago I built an over-the-top powerful
>desktop computer / workstation and I was looking forward to really
>fantastic performance improvements over my 6 year old Ubuntu machine.
>I installed Arch Linux on BTRFS on the new computer (on an SSD). To my
>shock, it was no faster than my old machine. I focused a lot on
>Firefox performance because I use Firefox a lot and that was one of
>the applications in which I was most looking forward to better
>performance.
>
> <snip>
>
>What would you guys do in this situation?

Check out profile sync daemon:

https://wiki.archlinux.org/index.php/profile-sync-daemon

It keeps the active profile files in a ramfs, periodically syncing them back to disk. It works quite well on my 7 year old netbook.

--Sean


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-20 11:46                       ` Austin S. Hemmelgarn
@ 2017-09-21 20:10                         ` Kai Krakow
  2017-09-21 23:30                           ` Dave
                                             ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Kai Krakow @ 2017-09-21 20:10 UTC (permalink / raw)
  To: linux-btrfs

Am Wed, 20 Sep 2017 07:46:52 -0400
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> >      Fragmentation: Files with a lot of random writes can become
> > heavily fragmented (10000+ extents) causing excessive multi-second
> > spikes of CPU load on systems with an SSD or large amount a RAM. On
> > desktops this primarily affects application databases (including
> > Firefox). Workarounds include manually defragmenting your home
> > directory using btrfs fi defragment. Auto-defragment (mount option
> > autodefrag) should solve this problem.
> > 
> > Upon reading that I am wondering if fragmentation in the Firefox
> > profile is part of my issue. That's one thing I never tested
> > previously. (BTW, this system has 256 GB of RAM and 20 cores.)  
> Almost certainly.  Most modern web browsers are brain-dead and insist
> on using SQLite databases (or traditional DB files) for everything, 
> including the cache, and the usage for the cache in particular kills 
> performance when fragmentation is an issue.

At least in Chrome, you can turn on simple cache backend, which, I
think, is using many small instead of one huge file. This suit btrfs
much better:

chrome://flags/#enable-simple-cache-backend

And then I suggest also doing this (as your login user):

$ cd $HOME
$ mv .cache .cache.old
$ mkdir .cache
$ lsattr +C .cache
$ rsync -av .cache.old/ .cache/
$ rm -Rf .cache.old

This makes caches for most applications nocow. Chrome performance was
completely fixed for me by doing this.

I'm not sure where Firefox puts its cache, I only use it on very rare
occasions. But I think it's going to .cache/mozilla last time looked
at it.

You may want to close all apps before converting the cache directory.

Also, I don't see any downsides in making this nocow. That directory
could easily be also completely volatile. If something breaks due to no
longer protected by data csum, just clean it out.

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-21 20:10                         ` Kai Krakow
@ 2017-09-21 23:30                           ` Dave
  2017-09-21 23:58                           ` Kai Krakow
  2017-09-22 11:22                           ` Austin S. Hemmelgarn
  2 siblings, 0 replies; 56+ messages in thread
From: Dave @ 2017-09-21 23:30 UTC (permalink / raw)
  To: Linux fs Btrfs

These are great suggestions. I will test several of them (or all of
them) and report back with my results once I have done the testing.
Thank you! This is a fantastic mailing list.

P.S. I'm inclined to stay with Firefox, but I will definitely test
Chromium vs Firefox after making a series of changes based on the
suggestions here. I would hate to see the market lose the option of
Firefox because everyone goes to Chrome/Chromium.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-21 20:10                         ` Kai Krakow
  2017-09-21 23:30                           ` Dave
@ 2017-09-21 23:58                           ` Kai Krakow
  2017-09-22 11:22                           ` Austin S. Hemmelgarn
  2 siblings, 0 replies; 56+ messages in thread
From: Kai Krakow @ 2017-09-21 23:58 UTC (permalink / raw)
  To: linux-btrfs

Am Thu, 21 Sep 2017 22:10:13 +0200
schrieb Kai Krakow <hurikhan77@gmail.com>:

> Am Wed, 20 Sep 2017 07:46:52 -0400
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
> > >      Fragmentation: Files with a lot of random writes can become
> > > heavily fragmented (10000+ extents) causing excessive multi-second
> > > spikes of CPU load on systems with an SSD or large amount a RAM.
> > > On desktops this primarily affects application databases
> > > (including Firefox). Workarounds include manually defragmenting
> > > your home directory using btrfs fi defragment. Auto-defragment
> > > (mount option autodefrag) should solve this problem.
> > > 
> > > Upon reading that I am wondering if fragmentation in the Firefox
> > > profile is part of my issue. That's one thing I never tested
> > > previously. (BTW, this system has 256 GB of RAM and 20 cores.)    
> > Almost certainly.  Most modern web browsers are brain-dead and
> > insist on using SQLite databases (or traditional DB files) for
> > everything, including the cache, and the usage for the cache in
> > particular kills performance when fragmentation is an issue.  
> 
> At least in Chrome, you can turn on simple cache backend, which, I
> think, is using many small instead of one huge file. This suit btrfs
> much better:
> 
> chrome://flags/#enable-simple-cache-backend
> 
> 
> And then I suggest also doing this (as your login user):
> 
> $ cd $HOME
> $ mv .cache .cache.old
> $ mkdir .cache
> $ lsattr +C .cache

Oops, of course that's chattr, not lsattr

> $ rsync -av .cache.old/ .cache/
> $ rm -Rf .cache.old
> 
> This makes caches for most applications nocow. Chrome performance was
> completely fixed for me by doing this.
> 
> I'm not sure where Firefox puts its cache, I only use it on very rare
> occasions. But I think it's going to .cache/mozilla last time looked
> at it.
> 
> You may want to close all apps before converting the cache directory.
> 
> Also, I don't see any downsides in making this nocow. That directory
> could easily be also completely volatile. If something breaks due to
> no longer protected by data csum, just clean it out.


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-21 20:10                         ` Kai Krakow
  2017-09-21 23:30                           ` Dave
  2017-09-21 23:58                           ` Kai Krakow
@ 2017-09-22 11:22                           ` Austin S. Hemmelgarn
  2017-09-22 20:29                             ` Marc Joliet
  2 siblings, 1 reply; 56+ messages in thread
From: Austin S. Hemmelgarn @ 2017-09-22 11:22 UTC (permalink / raw)
  To: linux-btrfs

On 2017-09-21 16:10, Kai Krakow wrote:
> Am Wed, 20 Sep 2017 07:46:52 -0400
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
>>>       Fragmentation: Files with a lot of random writes can become
>>> heavily fragmented (10000+ extents) causing excessive multi-second
>>> spikes of CPU load on systems with an SSD or large amount a RAM. On
>>> desktops this primarily affects application databases (including
>>> Firefox). Workarounds include manually defragmenting your home
>>> directory using btrfs fi defragment. Auto-defragment (mount option
>>> autodefrag) should solve this problem.
>>>
>>> Upon reading that I am wondering if fragmentation in the Firefox
>>> profile is part of my issue. That's one thing I never tested
>>> previously. (BTW, this system has 256 GB of RAM and 20 cores.)
>> Almost certainly.  Most modern web browsers are brain-dead and insist
>> on using SQLite databases (or traditional DB files) for everything,
>> including the cache, and the usage for the cache in particular kills
>> performance when fragmentation is an issue.
> 
> At least in Chrome, you can turn on simple cache backend, which, I
> think, is using many small instead of one huge file. This suit btrfs
> much better:
That's correct.  The traditional cache in Chrome and Chromium uses a 
single SQLite database for storing all the cache data and metadata (just 
like FIrefox did last time I checked).  The simple cache backend instead 
uses the filesystem to handle allocations and uses directory hashing to 
speed up look ups of items, which actually means that even without BTRFS 
involved, it will usually be faster (both because it allows concurrent 
access unlike SQLite, and because it's generally faster to parse a 
multi-level directory hash than an SQL statement).
> 
> chrome://flags/#enable-simple-cache-backend
> 
> 
> And then I suggest also doing this (as your login user):
> 
> $ cd $HOME
> $ mv .cache .cache.old
> $ mkdir .cache
> $ lsattr +C .cache
> $ rsync -av .cache.old/ .cache/
> $ rm -Rf .cache.old
> 
> This makes caches for most applications nocow. Chrome performance was
> completely fixed for me by doing this.
> 
> I'm not sure where Firefox puts its cache, I only use it on very rare
> occasions. But I think it's going to .cache/mozilla last time looked
> at it.
I'm pretty sure that is correct.
> 
> You may want to close all apps before converting the cache directory.
At a minimum, you'll have to restart them to get them to use the new 
location.
> 
> Also, I don't see any downsides in making this nocow. That directory
> could easily be also completely volatile. If something breaks due to no
> longer protected by data csum, just clean it out.
Indeed, anything that is storing data here that can't be regenerated 
from some other source is asking for trouble, sane backup systems don't 
include ~/.cache, and it's quite often one of the first things 
recommended for deletion when trying to free up disk space.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-22 11:22                           ` Austin S. Hemmelgarn
@ 2017-09-22 20:29                             ` Marc Joliet
  0 siblings, 0 replies; 56+ messages in thread
From: Marc Joliet @ 2017-09-22 20:29 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2055 bytes --]

Am Freitag, 22. September 2017, 13:22:52 CEST schrieb Austin S. Hemmelgarn:
> > I'm not sure where Firefox puts its cache, I only use it on very rare
> > occasions. But I think it's going to .cache/mozilla last time looked
> > at it.
> 
> I'm pretty sure that is correct.

FWIW, on my system Firefox's cache looks like this:

% du -hsc (find .cache/mozilla/firefox/ -type f) | wc -l
9008                                                                                                                                                                                                                                                                           
% du -hsc (find .cache/mozilla/firefox/ -type f) | sort -h | tail
5,4M    .cache/mozilla/firefox/cb236e4s.default-1464421886682/cache2/entries/
83CEC8ADA08D9A9658458AB872BE107A216E71C6
5,5M    .cache/mozilla/firefox/cb236e4s.default-1464421886682/cache2/entries/
C60061B33D3BB91ED45951C922BAA1BB40022CB7
5,7M    .cache/mozilla/firefox/cb236e4s.default-1464421886682/cache2/entries/
0900D9EA8E3222EB8690348C2482C69308B15A20
5,7M    .cache/mozilla/firefox/cb236e4s.default-1464421886682/cache2/entries/
F8E90D121B884360E36BCB1735CC5A8B1B7A744B
5,8M    .cache/mozilla/firefox/cb236e4s.default-1464421886682/cache2/entries/
903C4CD01ABD74E353C7484C6E21A053AAC5DCC2
6,1M    .cache/mozilla/firefox/cb236e4s.default-1464421886682/cache2/entries/
3A0D4193B009700155811D14A28DBE38C37C0067
6,1M    .cache/mozilla/firefox/cb236e4s.default-1464421886682/startupCache/
scriptCache-current.bin
6,5M    .cache/mozilla/firefox/cb236e4s.default-1464421886682/cache2/entries/
304405168662C3624D57AF98A74345464F32A0DB
8,8M    .cache/mozilla/firefox/ik7qsfwb.Temp/cache2/entries/
BD7CA4125B3AA87D6B16C987741F33C65DBFFFDD
427M    insgesamt

So lots of files, many of which are (I suppose) relatively large, but do not 
look "everything in one database" large to me.

(This is with Firefox 55.0.2.)

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-09-21 11:09                       ` Duncan
@ 2017-10-31 21:47                         ` Dave
  2017-10-31 23:06                           ` Peter Grandi
                                             ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Dave @ 2017-10-31 21:47 UTC (permalink / raw)
  To: Linux fs Btrfs

I'm following up on all the suggestions regarding Firefox performance
on BTRFS. I have time to make these changes now, but I am having
trouble figuring out what to do. The constraints are:

1. BTRFS snapshots have proven to be too useful (and too important to
our overall IT approach) to forego.
2. We do not see any practical alternative (for us) to the incremental
backup strategy
(https://btrfs.wiki.kernel.org/index.php/Incremental_Backup)
3. We have large amounts of storage space (and can add more), but not
enough to break all reflinks on all snapshots.
4. We can transfer snapshots to backup storage (and thereby retain
minimal snapshots on the live volume)
3. Our team is standardized on Firefox. (Switching to Chromium is not
an option for us.)
5. Firefox profile sync has not worked well for us in the past, so we
don't use it.
6. Our machines generally have plenty of RAM so we could put the
Firefox cache (and maybe profile) into RAM using a technique such as
https://wiki.archlinux.org/index.php/Firefox/Profile_on_RAM. However,
profile persistence is important.

The most common recommendations were to switch to Chromium, defragment
and don't use snapshots. As the constraints above illustrate, we
cannot do those things.

The tentative solution I have come up with is:

1. Continue using snapshots, but retain the minimal number possible on
the live volume. Move historical snapshots to a backup device using
btrfs send-receive.
(https://btrfs.wiki.kernel.org/index.php/Incremental_Backup)

2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted
nocow -- it will NOT be snapshotted

3. Put most of $HOME on a "home" volume but separate all user
documents to another volume (i.e., "documents").

3.a. The "home" volume will retain only the one most recent snapshot
on that live volume. (More backup history will be retained on a backup
volume. ) This home volume can be defragmented. With one snapshot,
that will double our space usage, which is acceptable.

3.b. The documents volume will be snapshotted hourly and 36 hourly
snapshots plus daily, weekly and monthly snapshots retained. Therefore
it will NOT be defragmented, as that would not be practical or
space-wise possible.

3.c. The root volume (operating system, etc.) will follow a strategy
similar to home, but will also retain pre- and post- update snapshots.

4. Put the Firefox cache in RAM

5. If needed, consider putting the Firefox profile in RAM

6. Make sure Firefox is running in multi-process mode. (Duncan's
instructions, while greatly appreciated and very useful, left me
slightly confused about pulseaudio's compatibility with multi-process
mode.)

7. Check various Firefox performance tweaks such as these:
https://wiki.archlinux.org/index.php/Firefox/Tweaks

Can anyone guess whether this will be sufficient to solve our severe
performance problems? Do these steps make sense? Will any of these
steps lead to new problems? Should I proceed to give them a try? Or
can anyone suggest a better set of steps to test?

Notes:

In regard to snapshots, we must retain about 36 hourly snapshots of
user documents, for example. We have to have pre- and post- package
upgrade snapshots from at least the most recent operating system &
application package update. And we have to retain several daily,
weekly and monthly snapshots of system directories and some other
locations.) Most of these snapshots can be retained on backup storage
devices.

Regarding Firefox profile sync, it does not have an intelligent method
for resolving conflicts, for example. We found too many unexpected
changes when using sync, so we do not use it.

On Thu, Sep 21, 2017 at 7:09 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Dave posted on Wed, 20 Sep 2017 02:38:13 -0400 as excerpted:
>
>> Here's my scenario. Some months ago I built an over-the-top powerful
>> desktop computer / workstation and I was looking forward to really
>> fantastic performance improvements over my 6 year old Ubuntu machine. I
>> installed Arch Linux on BTRFS on the new computer (on an SSD). To my
>> shock, it was no faster than my old machine. I focused a lot on Firefox
>> performance because I use Firefox a lot and that was one of the
>> applications in which I was most looking forward to better performance.
>>
>> I tried everything I could think of and everything recommended to me in
>> various forums (except switching to Windows) and the performance
>> remained very disappointing.
>>
>> Then today I read the following:
>>
>>     Gotchas - btrfs Wiki https://btrfs.wiki.kernel.org/index.php/Gotchas
>>
>>     Fragmentation: Files with a lot of random writes can become
>> heavily fragmented (10000+ extents) causing excessive multi-second
>> spikes of CPU load on systems with an SSD or large amount a RAM. On
>> desktops this primarily affects application databases (including
>> Firefox). Workarounds include manually defragmenting your home directory
>> using btrfs fi defragment. Auto-defragment (mount option autodefrag)
>> should solve this problem.
>>
>> Upon reading that I am wondering if fragmentation in the Firefox profile
>> is part of my issue. That's one thing I never tested previously. (BTW,
>> this system has 256 GB of RAM and 20 cores.)
>>
>> Furthermore, on the same BTRFS Wiki page, it mentions the performance
>> penalties of many snapshots. I am keeping 30 to 50 snapshots of the
>> volume that contains the Firefox profile.
>>
>> Would these two things be enough to turn top-of-the-line hardware into a
>> mediocre-preforming desktop system? (The system performs fine on
>> benchmarks -- it's real life usage, particularly with Firefox where it
>> is disappointing.)
>>
>> After reading the info here, I am wondering if I should make a new
>> subvolume just for my Firefox profile(s) and not use COW and/or not keep
>> snapshots on it and mount it with the autodefrag option.
>>
>> As part of this strategy, I could send snapshots to another disk using
>> btrfs send-receive. That way I would have the benefits of snapshots
>> (which are important to me), but by not keeping any snapshots on the
>> live subvolume I could avoid the performance problems.
>>
>> What would you guys do in this situation?
>
> [FWIW this is my second try at a reply, my first being way too detailed
> and going off into the weeds somewhere, so I killed it.]
>
> That's an interesting scenario indeed, and perhaps I can help, since my
> config isn't near as high end as yours, but I run firefox on btrfs on
> ssds, and have no performance complaints.  The difference is very likely
> due to one or more of the following (FWIW I'd suggest a 4-3-1-2 order,
> tho only 1 and 2 are really btrfs related):
>
> 1) I make sure I consistently mount with autodefrag, from the first mount
> after the filesystem is created in ordered to first populate it, on.  The
> filesystem never gets fragmented, forcing writes to highly fragmented
> free space, in the first place.  (With the past and current effect of the
> ssd mount option under discussion to change, it's possible I'll get more
> fragmentation in the future after ssd doesn't try so hard to find
> reasonably large free-space chunks to write into, but it has been fine so
> far.)
>
> 2) Subvolumes and snapshots seemed to me more trouble than they were
> worth, particularly since it's the same filesystem anyway, and if it's
> damaged, it'll take all the subvolumes and snapshots with it.  So I don't
> use them, preferring instead to use real partitioning and more smaller
> fully separate filesystems, some of which aren't mounted by default (and
> root mounted read-only by default), so there's little chance they'll be
> damaged in a crash or filesystem bug damage scenario.  And if there /is/
> any damage, it's much more limited in scope since all my data eggs aren't
> in the same basket, so maintenance such as btrfs check and scrub take far
> less time (and check far less memory) than they would were it one big
> pool with snapshots.  And if recovery fails too, the backups are likewise
> small filesystems the same size as the working copies, so copying the
> data back over takes far less time as well (not to mention making the
> backups takes less time in the first place, so it's easier to regularly
> update them).
>
> 3) Austin mentioned the firefox cache.  I honestly wouldn't know on it,
> since I have firefox configured to use a tmpfs for its cache, so it
> operates at memory speed and gets cleared along with its memory at every
> reboot or tmpfs umount.  My inet speed is fast enough I don't really need
> cache anyway, but it's nice to have it, operating at memory speed, within
> a single boot session... and to have it cleared on reboot.
>
>
> 4) This one was the biggest one for me for awhile.
>
> Is firefox running in multi-process mode?  If you don't know, got to
> about:support, and look in the Application Basics section, at the
> Multiprocess Windows entry and the Web Content Processes entry.  When you
> have multiple windows open it should show something like 2/2 (for two
> windows open, tho you won't get 20/20 for 20 windows open) for windows,
> and n/7 (tho I believe the default is 4 instead of 7, I've upped mine)
> for content processes, with n going up toward 7 (or 4) if you have
> multiple tabs/windows open playing video or the like.
>
> If you're stuck at a single process that'll be a *BIG* drag on
> performance, particularly when playing youtube full-screen or the like.
> There are various reasons you might get stuck at a single process,
> including extensions that aren't compatible with "electrolysis" (aka e10s,
> this being the mozilla code name for multi-process firefox), and the one
> that was my problem after I ensured all my extensions were e10s
> compatible -- I was trying to run the upstream firefox binary, which is
> now pulseaudio-only (no more direct alsa support), with apulse as a
> pulseaudio substitute, and apulse is apparently single-process-only
> (forcing multi-process would crash the tabs as soon as I tried navigating
> away from about:whatever to anything remote).
>
> Once I figured that out I switched back to using the gentoo firefox ebuild
> and enabling the alsa USE flag instead of pulseaudio there.  That got
> multiprocess working, and it was was *MUCH* more responsive, as I figured
> it should be! =:^)
>
> If you find you're stuck at single process (remember, check with at least
> two windows open) and need help with it, yell.  Because it'll make a
> *HUGE* difference.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-10-31 21:47                         ` Dave
@ 2017-10-31 23:06                           ` Peter Grandi
  2017-11-01  0:37                             ` Dave
       [not found]                             ` <CAH=dxU47-52-asM5vJ_-qOpEpjZczHw7vQzgi1-TeKm58++zBQ@mail.gmail.com>
  2017-11-01  7:43                           ` Sean Greenslade
  2017-11-01 13:31                           ` Duncan
  2 siblings, 2 replies; 56+ messages in thread
From: Peter Grandi @ 2017-10-31 23:06 UTC (permalink / raw)
  To: Linux fs Btrfs

> I'm following up on all the suggestions regarding Firefox performance
> on BTRFS. [ ... ]

I haven't read that yet, so maybe I am missing something, but I
use Firefox with Btrfs all the time and I haven't got issues.

[ ... ]
> 1. BTRFS snapshots have proven to be too useful (and too important to
>    our overall IT approach) to forego.
[ ... ]
> 3. We have large amounts of storage space (and can add more), but not
>    enough to break all reflinks on all snapshots.

Firefox profiles get fragmented only in the databases containes
in them, and they are tiny, as in dozens of MB. That's usually
irrelevant.

Also nothing forces you to defragment a whole filesystem, you
can just defragment individual files or directories by using
'find' with it.

My top "$HOME" fragmented files are the aKregator RSS feed
databases, usually a few hundred fragments each, and the
'.sqlite' files for Firefox. Occasionally like just now I do
this:

  tree$  sudo filefrag .firefox/default/*.sqlite | sort -t: -k 2n | tail -4
  .firefox/default/cleanup.sqlite: 43 extents found
  .firefox/default/content-prefs.sqlite: 67 extents found
  .firefox/default/formhistory.sqlite: 87 extents found
  .firefox/default/places.sqlite: 3879 extents found

  tree$  sudo btrfs fi defrag .firefox/default/*.sqlite

  tree$  sudo filefrag .firefox/default/*.sqlite | sort -t: -k 2n | tail -4
  .firefox/default/webappsstore.sqlite: 1 extent found
  .firefox/default/favicons.sqlite: 2 extents found
  .firefox/default/kinto.sqlite: 2 extents found
  .firefox/default/places.sqlite: 44 extents found

> 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted
> nocow -- it will NOT be snapshotted

The cache can be simply deleted, and usually files in it are not
updated in place, so don't get fragmented, so no worry.

Also, you can declare the '.firefox/default/' directory to be
NOCOW, and that "just works". I haven't even bothered with that.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-10-31 23:06                           ` Peter Grandi
@ 2017-11-01  0:37                             ` Dave
  2017-11-01 12:21                               ` Austin S. Hemmelgarn
                                                 ` (2 more replies)
       [not found]                             ` <CAH=dxU47-52-asM5vJ_-qOpEpjZczHw7vQzgi1-TeKm58++zBQ@mail.gmail.com>
  1 sibling, 3 replies; 56+ messages in thread
From: Dave @ 2017-11-01  0:37 UTC (permalink / raw)
  To: Linux fs Btrfs; +Cc: Peter Grandi

On Tue, Oct 31, 2017 at 7:06 PM, Peter Grandi <pg@btrfs.list.sabi.co.uk> wrote:
>
> Also nothing forces you to defragment a whole filesystem, you
> can just defragment individual files or directories by using
> 'find' with it.

Thanks for that info. When defragmenting individual files on a BTRFS
filesystem with COW, I assume reflinks between that file and all
snapshots are broken. So if there are 30 snapshots on that volume,
that one file will suddenly take up 30 times more space... Is that
correct? Or are the reflinks only broken between the live file and the
latest snapshot? Or is it something between, based on how many times
the file has changed?

>
> My top "$HOME" fragmented files are the aKregator RSS feed
> databases, usually a few hundred fragments each, and the
> '.sqlite' files for Firefox. Occasionally like just now I do
> this:
>
>   tree$  sudo filefrag .firefox/default/*.sqlite | sort -t: -k 2n | tail -4
>   .firefox/default/cleanup.sqlite: 43 extents found
>   .firefox/default/content-prefs.sqlite: 67 extents found
>   .firefox/default/formhistory.sqlite: 87 extents found
>   .firefox/default/places.sqlite: 3879 extents found
>
>   tree$  sudo btrfs fi defrag .firefox/default/*.sqlite
>
>   tree$  sudo filefrag .firefox/default/*.sqlite | sort -t: -k 2n | tail -4
>   .firefox/default/webappsstore.sqlite: 1 extent found
>   .firefox/default/favicons.sqlite: 2 extents found
>   .firefox/default/kinto.sqlite: 2 extents found
>   .firefox/default/places.sqlite: 44 extents found

That's a very helpful example.

Can you also give an example of using find, as you suggested above?
I'm generally familiar with using find to execute specific commands,
but an example is appreciated in this case.

> > 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted nocow -- it will NOT be snapshotted

> Also, you can declare the '.firefox/default/' directory to be NOCOW, and that "just works".

The cache is in a separate location from the profiles, as I'm sure you
know.  The reason I suggested a separate BTRFS subvolume for
$HOME/.cache is that this will prevent the cache files for all
applications (for that user) from being included in the snapshots. We
take frequent snapshots and (afaik) it makes no sense to include cache
in backups or snapshots. The easiest way I know to exclude cache from
BTRFS snapshots is to put it on a separate subvolume. I assumed this
would make several things related to snapshots more efficient too.

As far as the Firefox profile being declared NOCOW, as soon as we take
the first snapshot, I understand that it will become COW again. So I
don't see any point in making it NOCOW.

Thanks for your reply. I appreciate any other feedback or suggestions.

Background: I'm not sure why our Firefox performance is so terrible
but here's my original post from Sept 20. (I could repost the earlier
replies too if needed.) I've been waiting to have a window of
opportunity to try to fix our Firefox performance again, and now I
have that chance.

>On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
>> When I do a
>> btrfs filesystem defragment -r /directory
>> does it defragment really all files in this directory tree, even if it
>> contains subvolumes?
>> The man page does not mention subvolumes on this topic.
>
>No answer so far :-(
>
>But I found another problem in the man-page:
>
>  Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as
>  with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4
>  will break up the ref-links of COW data (for example files copied with
>  cp --reflink, snapshots or de-duplicated data). This may cause
>  considerable increase of space usage depending on the broken up
>  ref-links.
>
>I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
>snapshots.
>Therefore, I better should avoid calling "btrfs filesystem defragment -r"?
>
>What is the defragmenting best practice?
>Avoid it completly?

My question is the same as the OP in this thread, so I came here to
read the answers before asking.

Based on the answers here, it sounds like I should not run defrag at
all. However, I have a performance problem I need to solve, so if I
don't defrag, I need to do something else.

Here's my scenario. Some months ago I built an over-the-top powerful
desktop computer / workstation and I was looking forward to really
fantastic performance improvements over my 6 year old Ubuntu machine.
I installed Arch Linux on BTRFS on the new computer (on an SSD). To my
shock, it was no faster than my old machine. I focused a lot on
Firefox performance because I use Firefox a lot and that was one of
the applications in which I was most looking forward to better
performance.

I tried everything I could think of and everything recommended to me
in various forums (except switching to Windows) and the performance
remained very disappointing.

Then today I read the following:

    Gotchas - btrfs Wiki
    https://btrfs.wiki.kernel.org/index.php/Gotchas

    Fragmentation: Files with a lot of random writes can become
heavily fragmented (10000+ extents) causing excessive multi-second
spikes of CPU load on systems with an SSD or large amount a RAM. On
desktops this primarily affects application databases (including
Firefox). Workarounds include manually defragmenting your home
directory using btrfs fi defragment. Auto-defragment (mount option
autodefrag) should solve this problem.

Upon reading that I am wondering if fragmentation in the Firefox
profile is part of my issue. That's one thing I never tested
previously. (BTW, this system has 256 GB of RAM and 20 cores.)

Furthermore, on the same BTRFS Wiki page, it mentions the performance
penalties of many snapshots. I am keeping 30 to 50 snapshots of the
volume that contains the Firefox profile.

Would these two things be enough to turn top-of-the-line hardware into
a mediocre-preforming desktop system? (The system performs fine on
benchmarks -- it's real life usage, particularly with Firefox where it
is disappointing.)

After reading the info here, I am wondering if I should make a new
subvolume just for my Firefox profile(s) and not use COW and/or not
keep snapshots on it and mount it with the autodefrag option.

As part of this strategy, I could send snapshots to another disk using
btrfs send-receive. That way I would have the benefits of snapshots
(which are important to me), but by not keeping any snapshots on the
live subvolume I could avoid the performance problems.

What would you guys do in this situation?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-10-31 21:47                         ` Dave
  2017-10-31 23:06                           ` Peter Grandi
@ 2017-11-01  7:43                           ` Sean Greenslade
  2017-11-01 13:31                           ` Duncan
  2 siblings, 0 replies; 56+ messages in thread
From: Sean Greenslade @ 2017-11-01  7:43 UTC (permalink / raw)
  To: Dave; +Cc: Linux fs Btrfs

On Tue, Oct 31, 2017 at 05:47:54PM -0400, Dave wrote:
> I'm following up on all the suggestions regarding Firefox performance
> on BTRFS. 
>
> <SNIP>
>
> 5. Firefox profile sync has not worked well for us in the past, so we
> don't use it.
> 6. Our machines generally have plenty of RAM so we could put the
> Firefox cache (and maybe profile) into RAM using a technique such as
> https://wiki.archlinux.org/index.php/Firefox/Profile_on_RAM. However,
> profile persistence is important.

> 4. Put the Firefox cache in RAM
> 
> 5. If needed, consider putting the Firefox profile in RAM

Have you looked into profile-sync-daemon?

https://wiki.archlinux.org/index.php/profile-sync-daemon

It basically does the "keep the profile in RAM but also sync it to HDD"
for you. I've used it for years, it works quite well.

--Sean


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-01  0:37                             ` Dave
@ 2017-11-01 12:21                               ` Austin S. Hemmelgarn
  2017-11-02  1:39                                 ` Dave
  2017-11-01 17:48                               ` Peter Grandi
  2017-11-02 21:16                               ` Kai Krakow
  2 siblings, 1 reply; 56+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-01 12:21 UTC (permalink / raw)
  To: Dave, Linux fs Btrfs; +Cc: Peter Grandi

On 2017-10-31 20:37, Dave wrote:
> On Tue, Oct 31, 2017 at 7:06 PM, Peter Grandi <pg@btrfs.list.sabi.co.uk> wrote:
>>
>> Also nothing forces you to defragment a whole filesystem, you
>> can just defragment individual files or directories by using
>> 'find' with it.
> 
> Thanks for that info. When defragmenting individual files on a BTRFS
> filesystem with COW, I assume reflinks between that file and all
> snapshots are broken. So if there are 30 snapshots on that volume,
> that one file will suddenly take up 30 times more space... Is that
> correct? Or are the reflinks only broken between the live file and the
> latest snapshot? Or is it something between, based on how many times
> the file has changed?
Only that file will be split, all the other reflinks will be preserved, 
so it will only take up twice the space in your example.  Reflinks are 
at the block level, and don't have a single origin point where they can 
all be broken at once.  It's just like having multiple hardlinks to a 
file, and then replacing one of them via a rename.  The rename will 
break _that_ hardlink, but not any of the others.  In fact, the simplest 
way to explain reflinks is block-level hard links that automatically 
break when the block is updated.
> 
>>
>> My top "$HOME" fragmented files are the aKregator RSS feed
>> databases, usually a few hundred fragments each, and the
>> '.sqlite' files for Firefox. Occasionally like just now I do
>> this:
>>
>>    tree$  sudo filefrag .firefox/default/*.sqlite | sort -t: -k 2n | tail -4
>>    .firefox/default/cleanup.sqlite: 43 extents found
>>    .firefox/default/content-prefs.sqlite: 67 extents found
>>    .firefox/default/formhistory.sqlite: 87 extents found
>>    .firefox/default/places.sqlite: 3879 extents found
>>
>>    tree$  sudo btrfs fi defrag .firefox/default/*.sqlite
>>
>>    tree$  sudo filefrag .firefox/default/*.sqlite | sort -t: -k 2n | tail -4
>>    .firefox/default/webappsstore.sqlite: 1 extent found
>>    .firefox/default/favicons.sqlite: 2 extents found
>>    .firefox/default/kinto.sqlite: 2 extents found
>>    .firefox/default/places.sqlite: 44 extents found
> 
> That's a very helpful example.
> 
> Can you also give an example of using find, as you suggested above?
> I'm generally familiar with using find to execute specific commands,
> but an example is appreciated in this case.
> 
>>> 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted nocow -- it will NOT be snapshotted
> 
>> Also, you can declare the '.firefox/default/' directory to be NOCOW, and that "just works".
> 
> The cache is in a separate location from the profiles, as I'm sure you
> know.  The reason I suggested a separate BTRFS subvolume for
> $HOME/.cache is that this will prevent the cache files for all
> applications (for that user) from being included in the snapshots. We
> take frequent snapshots and (afaik) it makes no sense to include cache
> in backups or snapshots. The easiest way I know to exclude cache from
> BTRFS snapshots is to put it on a separate subvolume. I assumed this
> would make several things related to snapshots more efficient too.
Yes, it will, and it will save space long-term as well since 
$HOME/.cache is usually the most frequently modified location in $HOME. 
In addition to not including this in the snapshots, it may also improve 
performance.  Each subvolume is it's own tree, with it's own locking, 
which means that you can generally improve parallel access performance 
by splitting the workload across multiple subvolumes.  Whether it will 
actually provide any real benefit in that respect is heavily dependent 
on the exact workload however, but it won't hurt performance.
> 
> As far as the Firefox profile being declared NOCOW, as soon as we take
> the first snapshot, I understand that it will become COW again. So I
> don't see any point in making it NOCOW.When snapshotting NOCOW files, exactly one COW operation happens for 
each block as it gets written.  In your case, this may not matter (most 
people don't change settings on a sub-hourly basis), but in cases where 
changes are very frequent relative to snapshots, it can have a big 
impact to only COW once instead of all the time.
> 
> Thanks for your reply. I appreciate any other feedback or suggestions.
> 
> Background: I'm not sure why our Firefox performance is so terrible
> but here's my original post from Sept 20. (I could repost the earlier
> replies too if needed.) I've been waiting to have a window of
> opportunity to try to fix our Firefox performance again, and now I
> have that chance.
> 
>> On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
>>> When I do a
>>> btrfs filesystem defragment -r /directory
>>> does it defragment really all files in this directory tree, even if it
>>> contains subvolumes?
>>> The man page does not mention subvolumes on this topic.
>>
>> No answer so far :-(
>>
>> But I found another problem in the man-page:
>>
>>   Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as
>>   with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4
>>   will break up the ref-links of COW data (for example files copied with
>>   cp --reflink, snapshots or de-duplicated data). This may cause
>>   considerable increase of space usage depending on the broken up
>>   ref-links.
>>
>> I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
>> snapshots.
>> Therefore, I better should avoid calling "btrfs filesystem defragment -r"?
>>
>> What is the defragmenting best practice?
>> Avoid it completly?
> 
> My question is the same as the OP in this thread, so I came here to
> read the answers before asking.
> 
> Based on the answers here, it sounds like I should not run defrag at
> all. However, I have a performance problem I need to solve, so if I
> don't defrag, I need to do something else.
> 
> Here's my scenario. Some months ago I built an over-the-top powerful
> desktop computer / workstation and I was looking forward to really
> fantastic performance improvements over my 6 year old Ubuntu machine.
> I installed Arch Linux on BTRFS on the new computer (on an SSD). To my
> shock, it was no faster than my old machine. I focused a lot on
> Firefox performance because I use Firefox a lot and that was one of
> the applications in which I was most looking forward to better
> performance.
> 
> I tried everything I could think of and everything recommended to me
> in various forums (except switching to Windows) and the performance
> remained very disappointing.
> 
> Then today I read the following:
> 
>      Gotchas - btrfs Wiki
>      https://btrfs.wiki.kernel.org/index.php/Gotchas
> 
>      Fragmentation: Files with a lot of random writes can become
> heavily fragmented (10000+ extents) causing excessive multi-second
> spikes of CPU load on systems with an SSD or large amount a RAM. On
> desktops this primarily affects application databases (including
> Firefox). Workarounds include manually defragmenting your home
> directory using btrfs fi defragment. Auto-defragment (mount option
> autodefrag) should solve this problem.
> 
> Upon reading that I am wondering if fragmentation in the Firefox
> profile is part of my issue. That's one thing I never tested
> previously. (BTW, this system has 256 GB of RAM and 20 cores.)
> 
> Furthermore, on the same BTRFS Wiki page, it mentions the performance
> penalties of many snapshots. I am keeping 30 to 50 snapshots of the
> volume that contains the Firefox profile.
> 
> Would these two things be enough to turn top-of-the-line hardware into
> a mediocre-preforming desktop system? (The system performs fine on
> benchmarks -- it's real life usage, particularly with Firefox where it
> is disappointing.)
> 
> After reading the info here, I am wondering if I should make a new
> subvolume just for my Firefox profile(s) and not use COW and/or not
> keep snapshots on it and mount it with the autodefrag option.
> 
> As part of this strategy, I could send snapshots to another disk using
> btrfs send-receive. That way I would have the benefits of snapshots
> (which are important to me), but by not keeping any snapshots on the
> live subvolume I could avoid the performance problems.
> 
> What would you guys do in this situation?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-10-31 21:47                         ` Dave
  2017-10-31 23:06                           ` Peter Grandi
  2017-11-01  7:43                           ` Sean Greenslade
@ 2017-11-01 13:31                           ` Duncan
  2017-11-01 23:36                             ` Dave
  2 siblings, 1 reply; 56+ messages in thread
From: Duncan @ 2017-11-01 13:31 UTC (permalink / raw)
  To: linux-btrfs

Dave posted on Tue, 31 Oct 2017 17:47:54 -0400 as excerpted:

> 6. Make sure Firefox is running in multi-process mode. (Duncan's
> instructions, while greatly appreciated and very useful, left me
> slightly confused about pulseaudio's compatibility with multi-process
> mode.)

Just to clarify:

There's no problem with native pulseaudio and firefox multi-process 
mode.  As that's what most people will be using, and what firefox 
upstream ships for, chances are very high that you're just fine there, 
tho there's some small chance you have some other problem.

My specific problem was that I do *NOT* have pulseaudio installed here, 
as I've never found I needed it and it adds more complication to my 
configuration than the limited benefit I'd get out of it justifies.  
Straight alsa has been fine for me.

(Explanatory note: Being on gentoo/~amd64, aka testing, I do a lot more 
updating than stable users, and because it's gentoo, all those updates 
are build from sources, so every single extra package I have installed 
has a very real cost in terms of repeated update builds over time.  Put a 
bit differently, building and updating from sources tends to rather 
strongly encourage the best security practice of only installing what you 
actually need, because you have to rebuild it at every update.  And I 
don't need pulseaudio enough to be worth the cost to keep it updated, so 
I don't have it installed.  It really is that simple.  Binary-based 
distro users have rather trivial update costs in comparison, so having a 
few extra packages installed that they don't actually use, isn't such a 
big deal for them.  Which is of course fortunate, since dependencies are 
often determined at build-time, and binary-based distros tend to enable 
relatively more of them because /someone/ uses them, even if it's a 
minority, so they tend to carry around more dependencies than the normal 
user will need, simply to support the few that do.  And because the cost 
is relatively lower, users, except for the ones that pay enough attention 
to the security aspect of the wider attack surface, don't generally care 
as much as they would if they were forced to build and update all of them 
from sources!)

So when firefox upstream dropped support for alsa and began requiring 
pulseaudio for users that actually wanted their browser to play sound, I 
had two choices.  I could try to find a workaround that would fake firefox 
into believing that I had pulseaudio, or I could switch back to building 
firefox from sources instead of simply installing the upstream provided 
binaries, since gentoo's firefox build scripts still have the alsa 
support option that upstream firefox refused to support or ship any 
longer.

As with most people and their browsers, firefox is the most security-
exposed app I run, and it sometimes takes gentoo a few days after an 
upstream firefox release to get a working build out, during which users 
waiting on gentoo's package build are exposed to already widely known and 
patched by upstream security issues.  That was more risk than I wanted to 
take, thus my choice of switching to the upstream firefox binaries in the 
first place, since they were available, indeed, autoupdated, on release 
day.  Additionally, a firefox build takes awhile, much longer than most 
other packages, and now requires rust, itself an expensive to build 
package (tho fortunately it doesn't upgrade on the fast cycle that firefox 
does).

So I wasn't particularly happy about being forced back to waiting for 
gentoo to get around to updating its firefox builds several days after 
upstream, and then taking the time to build them myself, making it 
worthwhile to look for a workaround.

And as it happens, there's a /sort/ of workaround called apulse, a much 
simpler and smaller package than pulseaudio itself, that's basically just 
a pulseaudio API wrapper around alsa.

And when I first installed apulse and tested firefox with it, sure 
enough, I got firefox sound back! =:^)  I thought I had my workaround and 
that it was a satisfactory solution.

Unfortunately, apulse appears not to be multi-process-safe, and as firefox 
went more and more multi-process in the announcements, etc, at first I 
couldn't figure out what was keeping firefox single-process for me.

After some research on the web, I found the settings to /force/ firefox-
multi-process, and tried them.  But firefox would then only work in local 
mode (about: pages, basically).  Every time I tried to actually go to a 
normal URL, the multi-process tabs would crash before it rendered a 
thing!  The original firefox UI shell was still running, but with an 
error message indicating the tab crash instead of the page I wanted.

After some troubleshooting I figured out it was apulse.  If I moved the 
apulse library out of the way so firefox couldn't find it, I could browse 
the web in multiprocess mode just fine... except I was of course missing 
audio again. =:^(

So apulse wasn't the workaround for upstream firefox now requiring 
pulseaudio that I thought it was, since apulse wouldn't work with multi-
process, and I had to switch back to gentoo's firefox build from sources 
in ordered to get the alsa support that upstream had dropped, after all.

Thus, it wasn't pulseaudio that was the problem with multiprocess, but 
the fact that firefox had dropped alsa and was forcing pulseaudio on 
Linux if you wanted audio at all, and the fact that the apulse workaround 
I thought I had, didn't work with multiprocess.  So it was apulse that 
was the problem, and pulseaudio was only involved because firefox 
dropping direct alsa support and forcing pulseaudio was what had me 
installing apulse as an attempted workaround.

Meanwhile, my intent with the original mention wasn't that apulse was 
likely your problem, that's relatively unlikely, but that you might have 
some /other/ problem, say a not electrolysis-enabled (aka e10s, e, ten-
letters, s) extension.

Back when I posted that, a not e10s-enabled extension was actually quite 
likely, as e10s was still rather new.  It's probably somewhat less so 
now, and firefox is of course on to the next big change, dropping the old 
"legacy chrome" extension support, in favor of the newer and generally 
Chromium-compatible WebExtensions/WE API, with firefox 57, to be released 
mid-month (Nov 14).

But assuming you're still seeing firefox performance issues, I'm still 
guessing that it's likely to be /something/ forcing single-process, as I 
/know/ how much of a difference that can make from experience.

So I'd definitely check it, and if you're not getting multi-process, the 
firefox about:support page should show it in the application basics 
section, multiprocess windows, and if that's working, web content 
processes, entries.  With luck it'll tell you why it's disabled if it is, 
saying something about incompatible extensions or the like, tho I had to 
do a bit more troubleshooting to find the problem with apulse.

If with multiple firefox windows open you're seeing 2/2 (or higher) in 
the multiprocess windows entry, and n/4 (the default, here I forced a 
higher 7) in the web content processes entry, then you're good to go in 
this regard, and the problem must be elsewhere.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-01  0:37                             ` Dave
  2017-11-01 12:21                               ` Austin S. Hemmelgarn
@ 2017-11-01 17:48                               ` Peter Grandi
  2017-11-02  0:09                                 ` Dave
  2017-11-02  0:43                                 ` Peter Grandi
  2017-11-02 21:16                               ` Kai Krakow
  2 siblings, 2 replies; 56+ messages in thread
From: Peter Grandi @ 2017-11-01 17:48 UTC (permalink / raw)
  To: Linux fs Btrfs

> When defragmenting individual files on a BTRFS filesystem with
> COW, I assume reflinks between that file and all snapshots are
> broken. So if there are 30 snapshots on that volume, that one
> file will suddenly take up 30 times more space... [ ... ]

Defragmentation works by effectively making a copy of the file
contents (simplistic view), so the end result is one copy with
29 reflinked contents, and one copy with defragmented contents.

> Can you also give an example of using find, as you suggested
> above? [ ... ]

Well, one way is to use 'find' as a filtering replacement for
'defrag' option '-r', as in for example:

  find "$HOME" -xdev '(' -name '*.sqlite' -o -name '*.mk4' ')' \
    -type f  -print0 | xargs -0 btrfs fi defrag

Another one is to find the most fragmented files first or all
files of at least 1M with with at least say 100 fragments as in:

  find "$HOME" -xdev -type f -size +1M -print0 | xargs -0 filefrag \
    | perl -n -e 'print "$1\0" if (m/(.*): ([0-9]+) extents/ && $1 > 100)' \
    | xargs -0 btrfs fi defrag

But there are many 'find' web pages and that is not quite a
Btrfs related topic.

> [ ... ] The easiest way I know to exclude cache from
> BTRFS snapshots is to put it on a separate subvolume. I assumed this
> would make several things related to snapshots more efficient too.

Only slightly.

> Background: I'm not sure why our Firefox performance is so terrible

As I always say, "performance" is not the same as "speed", and
probably your Firefox "performance" is sort of OKish even if the
"speed" is terrile, and neither is likely related to the profile
or the cache being on Btrfs: most JavaScript based sites are
awfully horrible regardless of browser:

  http://www.sabi.co.uk/blog/13-two.html?130817#130817

and if Firefox makes a special contribution it tends to leak
memory on several odd but common cases:

  https://utcc.utoronto.ca/~cks/space/blog/web/FirefoxResignedToLeaks?showcomments

Plus it tends to cache too much, e.g. recently close tabs.

But Firefox is not special because most web browsers are not
designed to run for a long time without a restart, and
Chromium/Chrome simply have a different set of problem sites.
Maybe the new "Quantum" Firefox 57 will improve matters because
it has a far more restrictive plugin API.

The overall problem is insoluble, hipster UX designers will be
the second the the wall when the revolution comes :-).

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-01 13:31                           ` Duncan
@ 2017-11-01 23:36                             ` Dave
  0 siblings, 0 replies; 56+ messages in thread
From: Dave @ 2017-11-01 23:36 UTC (permalink / raw)
  To: Linux fs Btrfs

On Wed, Nov 1, 2017 at 9:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Dave posted on Tue, 31 Oct 2017 17:47:54 -0400 as excerpted:
>
>> 6. Make sure Firefox is running in multi-process mode. (Duncan's
>> instructions, while greatly appreciated and very useful, left me
>> slightly confused about pulseaudio's compatibility with multi-process
>> mode.)
>
> Just to clarify:
>
> There's no problem with native pulseaudio and firefox multi-process
> mode.

Thank you for clarifying. And I appreciate your detailed explanation.

> Back when I posted that, a not e10s-enabled extension was actually quite
> likely, as e10s was still rather new.  It's probably somewhat less so
> now, and firefox is of course on to the next big change, dropping the old
> "legacy chrome" extension support, in favor of the newer and generally
> Chromium-compatible WebExtensions/WE API, with firefox 57, to be released
> mid-month (Nov 14).

I am now running Firefox 57 beta and I'll be doing my testing with
that using only WebExtensions.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-01 17:48                               ` Peter Grandi
@ 2017-11-02  0:09                                 ` Dave
  2017-11-02 11:17                                   ` Austin S. Hemmelgarn
  2017-11-02  0:43                                 ` Peter Grandi
  1 sibling, 1 reply; 56+ messages in thread
From: Dave @ 2017-11-02  0:09 UTC (permalink / raw)
  To: Linux fs Btrfs; +Cc: Peter Grandi

On Wed, Nov 1, 2017 at 1:48 PM, Peter Grandi <pg@btfs.list.sabi.co.uk> wrote:
>> When defragmenting individual files on a BTRFS filesystem with
>> COW, I assume reflinks between that file and all snapshots are
>> broken. So if there are 30 snapshots on that volume, that one
>> file will suddenly take up 30 times more space... [ ... ]
>
> Defragmentation works by effectively making a copy of the file
> contents (simplistic view), so the end result is one copy with
> 29 reflinked contents, and one copy with defragmented contents.

The clarification is much appreciated.

>> Can you also give an example of using find, as you suggested
>> above? [ ... ]
>
> Well, one way is to use 'find' as a filtering replacement for
> 'defrag' option '-r', as in for example:
>
>   find "$HOME" -xdev '(' -name '*.sqlite' -o -name '*.mk4' ')' \
>     -type f  -print0 | xargs -0 btrfs fi defrag
>
> Another one is to find the most fragmented files first or all
> files of at least 1M with with at least say 100 fragments as in:
>
>   find "$HOME" -xdev -type f -size +1M -print0 | xargs -0 filefrag \
>     | perl -n -e 'print "$1\0" if (m/(.*): ([0-9]+) extents/ && $1 > 100)' \
>     | xargs -0 btrfs fi defrag
>
> But there are many 'find' web pages and that is not quite a
> Btrfs related topic.

Your examples were perfect. I have experience using find in similar
ways. I can take it from there. :-)

>> Background: I'm not sure why our Firefox performance is so terrible
>
> As I always say, "performance" is not the same as "speed", and
> probably your Firefox "performance" is sort of OKish even if the
> "speed" is terrile, and neither is likely related to the profile
> or the cache being on Btrfs.

Here's what happened. Two years ago I installed Kubuntu (with Firefox)
on two desktop computers. One machine performed fine. Like you said,
"sort of OKish" and that's what we expect with the current state of
Linux. The other machine was substantially worse. We ran side-by-side
real-world tests on these two machines for months.

Initially I did a lot of testing, troubleshooting and reconfiguration
trying to get the second machine to perform as well as the first. I
never had success. At first I thought it was related to the GPU (or
driver). Then I thought it was because the first machine used the z170
chipset and the second was X99 based. But that wasn't it. I have never
solved the problem and I have been coming back to it periodically
these last two years. In that time I have tried different distros from
opensuse to Arch, and a lot of different hardware.

Furthermore, my new machines have the same performance problem. The
most interesting example is a high end machine with 256 GB of RAM. It
showed substantially worse desktop application performance than any
other computer here. All are running the exact same version of Firefox
with the exact same add-ons. (The installations are carbon copies of
each other.)

What originally caught my attention was earlier information in this thread:

Am Wed, 20 Sep 2017 07:46:52 -0400
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> >      Fragmentation: Files with a lot of random writes can become
> > heavily fragmented (10000+ extents) causing excessive multi-second
> > spikes of CPU load on systems with an SSD or large amount a RAM. On
> > desktops this primarily affects application databases (including
> > Firefox). Workarounds include manually defragmenting your home
> > directory using btrfs fi defragment. Auto-defragment (mount option
> > autodefrag) should solve this problem.
> >
> > Upon reading that I am wondering if fragmentation in the Firefox
> > profile is part of my issue. That's one thing I never tested
> > previously. (BTW, this system has 256 GB of RAM and 20 cores.)
> Almost certainly.  Most modern web browsers are brain-dead and insist
> on using SQLite databases (or traditional DB files) for everything,
> including the cache, and the usage for the cache in particular kills
> performance when fragmentation is an issue.

It turns out the the first machine (which performed well enough) was
the last one which was installed using LVM + EXT4. The second machine
(the one with the original performance problem) and all subsequent
machines have used BTRFS.

And the worst performing machine was the one with the most RAM and a
fast NVMe drive and top of the line hardware.

While Firefox and Linux in general have their performance "issues",
that's not relevant here. I'm comparing the same distros, same Firefox
versions, same Firefox add-ons, etc. I eventually tested many hardware
configurations: different CPU's, motherboards, GPU's, SSD's, RAM, etc.
The only remaining difference I can find is that the computer with
acceptable performance uses LVM + EXT4 while all the others use BTRFS.

With all the great feedback I have gotten here, I'm now ready to
retest this after implementing all the BTRFS-related suggestions I
have received. Maybe that will solve the problem or maybe this mystery
will continue...

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-01 17:48                               ` Peter Grandi
  2017-11-02  0:09                                 ` Dave
@ 2017-11-02  0:43                                 ` Peter Grandi
  1 sibling, 0 replies; 56+ messages in thread
From: Peter Grandi @ 2017-11-02  0:43 UTC (permalink / raw)
  To: Linux fs Btrfs

> Another one is to find the most fragmented files first or all
> files of at least 1M with with at least say 100 fragments as in:

> find "$HOME" -xdev -type f -size +1M -print0 | xargs -0 filefrag \
> | perl -n -e 'print "$1\0" if (m/(.*): ([0-9]+) extents/ && $1 > 100)' \
> | xargs -0 btrfs fi defrag

That should have "&& $2 > 100".

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-01 12:21                               ` Austin S. Hemmelgarn
@ 2017-11-02  1:39                                 ` Dave
  2017-11-02 11:07                                   ` Austin S. Hemmelgarn
  2017-11-03  5:58                                   ` Marat Khalili
  0 siblings, 2 replies; 56+ messages in thread
From: Dave @ 2017-11-02  1:39 UTC (permalink / raw)
  To: Linux fs Btrfs; +Cc: Peter Grandi, Austin S. Hemmelgarn

On Wed, Nov 1, 2017 at 8:21 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

>> The cache is in a separate location from the profiles, as I'm sure you
>> know.  The reason I suggested a separate BTRFS subvolume for
>> $HOME/.cache is that this will prevent the cache files for all
>> applications (for that user) from being included in the snapshots. We
>> take frequent snapshots and (afaik) it makes no sense to include cache
>> in backups or snapshots. The easiest way I know to exclude cache from
>> BTRFS snapshots is to put it on a separate subvolume. I assumed this
>> would make several things related to snapshots more efficient too.
>
> Yes, it will, and it will save space long-term as well since $HOME/.cache is
> usually the most frequently modified location in $HOME. In addition to not
> including this in the snapshots, it may also improve performance.  Each
> subvolume is it's own tree, with it's own locking, which means that you can
> generally improve parallel access performance by splitting the workload
> across multiple subvolumes.  Whether it will actually provide any real
> benefit in that respect is heavily dependent on the exact workload however,
> but it won't hurt performance.

I'm going to make this change now. What would be a good way to
implement this so that the change applies to the $HOME/.cache of each
user?

The simple way would be to create a new subvolume for each existing
user and mount it at $HOME/.cache in /etc/fstab, hard coding that
mount location for each user. I don't mind doing that as there are
only 4 users to consider. One minor concern is that it adds an
unexpected step to the process of creating a new user. Is there a
better way?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-02  1:39                                 ` Dave
@ 2017-11-02 11:07                                   ` Austin S. Hemmelgarn
  2017-11-03  2:59                                     ` Dave
  2017-11-03  5:58                                   ` Marat Khalili
  1 sibling, 1 reply; 56+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-02 11:07 UTC (permalink / raw)
  To: Dave, Linux fs Btrfs; +Cc: Peter Grandi

On 2017-11-01 21:39, Dave wrote:
> On Wed, Nov 1, 2017 at 8:21 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> 
>>> The cache is in a separate location from the profiles, as I'm sure you
>>> know.  The reason I suggested a separate BTRFS subvolume for
>>> $HOME/.cache is that this will prevent the cache files for all
>>> applications (for that user) from being included in the snapshots. We
>>> take frequent snapshots and (afaik) it makes no sense to include cache
>>> in backups or snapshots. The easiest way I know to exclude cache from
>>> BTRFS snapshots is to put it on a separate subvolume. I assumed this
>>> would make several things related to snapshots more efficient too.
>>
>> Yes, it will, and it will save space long-term as well since $HOME/.cache is
>> usually the most frequently modified location in $HOME. In addition to not
>> including this in the snapshots, it may also improve performance.  Each
>> subvolume is it's own tree, with it's own locking, which means that you can
>> generally improve parallel access performance by splitting the workload
>> across multiple subvolumes.  Whether it will actually provide any real
>> benefit in that respect is heavily dependent on the exact workload however,
>> but it won't hurt performance.
> 
> I'm going to make this change now. What would be a good way to
> implement this so that the change applies to the $HOME/.cache of each
> user?
> 
> The simple way would be to create a new subvolume for each existing
> user and mount it at $HOME/.cache in /etc/fstab, hard coding that
> mount location for each user. I don't mind doing that as there are
> only 4 users to consider. One minor concern is that it adds an
> unexpected step to the process of creating a new user. Is there a
> better way?
> 
The easiest option is to just make sure nobody is logged in and run the 
following shell script fragment:

     for dir in /home/* ; do
         rm -rf $dir/.cache
         btrfs subvolume create $dir/.cache
     done

And then add something to the user creation scripts to create that 
subvolume.  This approach won't pollute /etc/fstab, will still exclude 
the directory from snapshots, and doesn't require any hugely creative 
work to integrate with user creation and deletion.

In general, the contents of the .cache directory are just that, cached 
data.  Provided nobody is actively accessing it, it's perfectly safe to 
just nuke the entire directory (I actually do this on a semi-regular 
basis on my systems just because it helps save space).  In fact, based 
on the FreeDesktop.org standards, if this does break anything, it's a 
bug in the software in question.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-02  0:09                                 ` Dave
@ 2017-11-02 11:17                                   ` Austin S. Hemmelgarn
  2017-11-02 18:09                                     ` Dave
  0 siblings, 1 reply; 56+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-02 11:17 UTC (permalink / raw)
  To: Dave, Linux fs Btrfs; +Cc: Peter Grandi

On 2017-11-01 20:09, Dave wrote:
> On Wed, Nov 1, 2017 at 1:48 PM, Peter Grandi <pg@btfs.list.sabi.co.uk> wrote:
>>> When defragmenting individual files on a BTRFS filesystem with
>>> COW, I assume reflinks between that file and all snapshots are
>>> broken. So if there are 30 snapshots on that volume, that one
>>> file will suddenly take up 30 times more space... [ ... ]
>>
>> Defragmentation works by effectively making a copy of the file
>> contents (simplistic view), so the end result is one copy with
>> 29 reflinked contents, and one copy with defragmented contents.
> 
> The clarification is much appreciated.
> 
>>> Can you also give an example of using find, as you suggested
>>> above? [ ... ]
>>
>> Well, one way is to use 'find' as a filtering replacement for
>> 'defrag' option '-r', as in for example:
>>
>>    find "$HOME" -xdev '(' -name '*.sqlite' -o -name '*.mk4' ')' \
>>      -type f  -print0 | xargs -0 btrfs fi defrag
>>
>> Another one is to find the most fragmented files first or all
>> files of at least 1M with with at least say 100 fragments as in:
>>
>>    find "$HOME" -xdev -type f -size +1M -print0 | xargs -0 filefrag \
>>      | perl -n -e 'print "$1\0" if (m/(.*): ([0-9]+) extents/ && $1 > 100)' \
>>      | xargs -0 btrfs fi defrag
>>
>> But there are many 'find' web pages and that is not quite a
>> Btrfs related topic.
> 
> Your examples were perfect. I have experience using find in similar
> ways. I can take it from there. :-)
> 
>>> Background: I'm not sure why our Firefox performance is so terrible
>>
>> As I always say, "performance" is not the same as "speed", and
>> probably your Firefox "performance" is sort of OKish even if the
>> "speed" is terrile, and neither is likely related to the profile
>> or the cache being on Btrfs.
> 
> Here's what happened. Two years ago I installed Kubuntu (with Firefox)
> on two desktop computers. One machine performed fine. Like you said,
> "sort of OKish" and that's what we expect with the current state of
> Linux. The other machine was substantially worse. We ran side-by-side
> real-world tests on these two machines for months.
> 
> Initially I did a lot of testing, troubleshooting and reconfiguration
> trying to get the second machine to perform as well as the first. I
> never had success. At first I thought it was related to the GPU (or
> driver). Then I thought it was because the first machine used the z170
> chipset and the second was X99 based. But that wasn't it. I have never
> solved the problem and I have been coming back to it periodically
> these last two years. In that time I have tried different distros from
> opensuse to Arch, and a lot of different hardware.
> 
> Furthermore, my new machines have the same performance problem. The
> most interesting example is a high end machine with 256 GB of RAM. It
> showed substantially worse desktop application performance than any
> other computer here. All are running the exact same version of Firefox
> with the exact same add-ons. (The installations are carbon copies of
> each other.)
> 
> What originally caught my attention was earlier information in this thread:
> 
> Am Wed, 20 Sep 2017 07:46:52 -0400
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
>>>       Fragmentation: Files with a lot of random writes can become
>>> heavily fragmented (10000+ extents) causing excessive multi-second
>>> spikes of CPU load on systems with an SSD or large amount a RAM. On
>>> desktops this primarily affects application databases (including
>>> Firefox). Workarounds include manually defragmenting your home
>>> directory using btrfs fi defragment. Auto-defragment (mount option
>>> autodefrag) should solve this problem.
>>>
>>> Upon reading that I am wondering if fragmentation in the Firefox
>>> profile is part of my issue. That's one thing I never tested
>>> previously. (BTW, this system has 256 GB of RAM and 20 cores.)
>> Almost certainly.  Most modern web browsers are brain-dead and insist
>> on using SQLite databases (or traditional DB files) for everything,
>> including the cache, and the usage for the cache in particular kills
>> performance when fragmentation is an issue.
> 
> It turns out the the first machine (which performed well enough) was
> the last one which was installed using LVM + EXT4. The second machine
> (the one with the original performance problem) and all subsequent
> machines have used BTRFS.
> 
> And the worst performing machine was the one with the most RAM and a
> fast NVMe drive and top of the line hardware.
Somewhat nonsensically, I'll bet that NVMe is a contributing factor in 
this particular case.  NVMe has particularly bad performance with the 
old block IO schedulers (though it is NVMe, so it should still be better 
than a SATA or SAS SSD), and the new blk-mq framework just got 
scheduling support in 4.12, and only got reasonably good scheduling 
options in 4.13.  I doubt it's the entirety of the issue, but it's 
probably part of it.
> 
> While Firefox and Linux in general have their performance "issues",
> that's not relevant here. I'm comparing the same distros, same Firefox
> versions, same Firefox add-ons, etc. I eventually tested many hardware
> configurations: different CPU's, motherboards, GPU's, SSD's, RAM, etc.
> The only remaining difference I can find is that the computer with
> acceptable performance uses LVM + EXT4 while all the others use BTRFS.
> 
> With all the great feedback I have gotten here, I'm now ready to
> retest this after implementing all the BTRFS-related suggestions I
> have received. Maybe that will solve the problem or maybe this mystery
> will continue...
Hmm, if you're only using SSD's, that may partially explain things.  I 
don't remember if it was mentioned earlier in this thread, but you might 
try adding 'nossd' to the mount options.  The 'ssd' mount option (which 
gets set automatically if the device reports as non-rotational) impacts 
how the block allocator works, and that can have a pretty insane impact 
on performance.  Additionally, independently from that, try toggling the 
'discard' mount option.  If you have it enabled, disable it, if you have 
it disabled, enable it.  Inline discards can be very expensive on some 
hardware, especially older SSD's, and discards happen pretty frequently 
in a COW filesystem.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-02 11:17                                   ` Austin S. Hemmelgarn
@ 2017-11-02 18:09                                     ` Dave
  2017-11-02 18:37                                       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 56+ messages in thread
From: Dave @ 2017-11-02 18:09 UTC (permalink / raw)
  To: Linux fs Btrfs; +Cc: Austin S. Hemmelgarn

On Thu, Nov 2, 2017 at 7:17 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

>> And the worst performing machine was the one with the most RAM and a
>> fast NVMe drive and top of the line hardware.
>
> Somewhat nonsensically, I'll bet that NVMe is a contributing factor in this
> particular case.  NVMe has particularly bad performance with the old block
> IO schedulers (though it is NVMe, so it should still be better than a SATA
> or SAS SSD), and the new blk-mq framework just got scheduling support in
> 4.12, and only got reasonably good scheduling options in 4.13.  I doubt it's
> the entirety of the issue, but it's probably part of it.

Thanks for that news. Based on that, I assume the advice here (to use
noop for NVMe) is now outdated?
https://stackoverflow.com/a/27664577/463994

Is the solution as simple as running a kernel >= 4.13? Or do I need to
specify which scheduler to use?

I just checked one computer:

uname -a
Linux morpheus 4.13.5-1-ARCH #1 SMP PREEMPT Fri Oct 6 09:58:47 CEST
2017 x86_64 GNU/Linux

$ sudo find /sys -name scheduler -exec grep . {} +
/sys/devices/pci0000:00/0000:00:1d.0/0000:08:00.0/nvme/nvme0/nvme0n1/queue/scheduler:[none]
mq-deadline kyber bfq

>From this article, it sounds like (maybe) I should use kyber. I see
kyber listed in the output above, so I assume that means it is
available. I also think [none] is the current scheduler being used, as
it is in brackets.

I checked this:
https://www.kernel.org/doc/Documentation/block/switching-sched.txt
Based on that, I assume I would do this at runtime:

echo kyber > /sys/devices/pci0000:00/0000:00:1d.0/0000:08:00.0/nvme/nvme0/nvme0n1/queue/scheduler

I assume this is equivalent:

echo kyber > /sys/block/nvme0n1/queue/scheduler

How would I set it permanently at boot time?

>> While Firefox and Linux in general have their performance "issues",
>> that's not relevant here. I'm comparing the same distros, same Firefox
>> versions, same Firefox add-ons, etc. I eventually tested many hardware
>> configurations: different CPU's, motherboards, GPU's, SSD's, RAM, etc.
>> The only remaining difference I can find is that the computer with
>> acceptable performance uses LVM + EXT4 while all the others use BTRFS.
>>
>> With all the great feedback I have gotten here, I'm now ready to
>> retest this after implementing all the BTRFS-related suggestions I
>> have received. Maybe that will solve the problem or maybe this mystery
>> will continue...
>
> Hmm, if you're only using SSD's, that may partially explain things.  I don't
> remember if it was mentioned earlier in this thread, but you might try
> adding 'nossd' to the mount options.  The 'ssd' mount option (which gets set
> automatically if the device reports as non-rotational) impacts how the block
> allocator works, and that can have a pretty insane impact on performance.

I will test the "nossd" mount option.

> Additionally, independently from that, try toggling the 'discard' mount
> option.  If you have it enabled, disable it, if you have it disabled, enable
> it.  Inline discards can be very expensive on some hardware, especially
> older SSD's, and discards happen pretty frequently in a COW filesystem.

I have been following this advice, so I have never enabled discard for
an NVMe drive. Do you think it is worth testing?

Solid State Drives/NVMe - ArchWiki
https://wiki.archlinux.org/index.php/Solid_State_Drives/NVMe

Discards:
Note: Although continuous TRIM is an option (albeit not recommended)
for SSDs, NVMe devices should not be issued discards.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-02 18:09                                     ` Dave
@ 2017-11-02 18:37                                       ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 56+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-02 18:37 UTC (permalink / raw)
  To: Dave, Linux fs Btrfs

On 2017-11-02 14:09, Dave wrote:
> On Thu, Nov 2, 2017 at 7:17 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> 
>>> And the worst performing machine was the one with the most RAM and a
>>> fast NVMe drive and top of the line hardware.
>>
>> Somewhat nonsensically, I'll bet that NVMe is a contributing factor in this
>> particular case.  NVMe has particularly bad performance with the old block
>> IO schedulers (though it is NVMe, so it should still be better than a SATA
>> or SAS SSD), and the new blk-mq framework just got scheduling support in
>> 4.12, and only got reasonably good scheduling options in 4.13.  I doubt it's
>> the entirety of the issue, but it's probably part of it.
> 
> Thanks for that news. Based on that, I assume the advice here (to use
> noop for NVMe) is now outdated?
> https://stackoverflow.com/a/27664577/463994
> 
> Is the solution as simple as running a kernel >= 4.13? Or do I need to
> specify which scheduler to use?
> 
> I just checked one computer:
> 
> uname -a
> Linux morpheus 4.13.5-1-ARCH #1 SMP PREEMPT Fri Oct 6 09:58:47 CEST
> 2017 x86_64 GNU/Linux
> 
> $ sudo find /sys -name scheduler -exec grep . {} +
> /sys/devices/pci0000:00/0000:00:1d.0/0000:08:00.0/nvme/nvme0/nvme0n1/queue/scheduler:[none]
> mq-deadline kyber bfq
> 
>  From this article, it sounds like (maybe) I should use kyber. I see
> kyber listed in the output above, so I assume that means it is
> available. I also think [none] is the current scheduler being used, as
> it is in brackets.
> 
> I checked this:
> https://www.kernel.org/doc/Documentation/block/switching-sched.txt
> Based on that, I assume I would do this at runtime:
> 
> echo kyber > /sys/devices/pci0000:00/0000:00:1d.0/0000:08:00.0/nvme/nvme0/nvme0n1/queue/scheduler
> 
> I assume this is equivalent:
> 
> echo kyber > /sys/block/nvme0n1/queue/scheduler
> 
> How would I set it permanently at boot time?
It's kind of complicated overall.  As of 4.14, there are four options 
for the blk-mq path.  The 'none' scheduler is the old behavior prior to 
4.13, and does no scheduling.  'mq-deadline' is the default AFAIK, and 
behaves like the old deadline I/O scheduler (not sure if it supports I/O 
priorities).  'bfq' is a blk-mq port of a scheduler originally designed 
to replace the default CFQ scheduler from the old block layer.  'kyber' 
I know essentially nothing about, I never saw the patches on LKML (not 
sure if I just missed them, or they only went to topic lists), and I've 
not tried it myself.

I have no personal experience with anything but the none scheduler on 
NVMe devices, so i can't really comment much more than saying that I've 
seen a huge difference on the SATA SSD's I use first when the deadline 
scheduler became the default and then again when I switched to BFQ on my 
systems, and the fact that I've seen reports of using the deadline 
scheduler improving things on NVMe.

As far as setting it at boot time, there's currently no kernel 
configuration option to set a default like there is for the old block 
interface, and I don't know of any kernel command line option to set it 
either, but a udev rule setting it as a attribute works reliably.  I'm 
using something like the following to set all my SATA devices to use BFQ 
by default:

KERNEL=="sd?", SUBSYSTEM=="block", ACTION=="add", 
ATTR{queue/scheduler}="bfq"
> 
>>> While Firefox and Linux in general have their performance "issues",
>>> that's not relevant here. I'm comparing the same distros, same Firefox
>>> versions, same Firefox add-ons, etc. I eventually tested many hardware
>>> configurations: different CPU's, motherboards, GPU's, SSD's, RAM, etc.
>>> The only remaining difference I can find is that the computer with
>>> acceptable performance uses LVM + EXT4 while all the others use BTRFS.
>>>
>>> With all the great feedback I have gotten here, I'm now ready to
>>> retest this after implementing all the BTRFS-related suggestions I
>>> have received. Maybe that will solve the problem or maybe this mystery
>>> will continue...
>>
>> Hmm, if you're only using SSD's, that may partially explain things.  I don't
>> remember if it was mentioned earlier in this thread, but you might try
>> adding 'nossd' to the mount options.  The 'ssd' mount option (which gets set
>> automatically if the device reports as non-rotational) impacts how the block
>> allocator works, and that can have a pretty insane impact on performance.
> 
> I will test the "nossd" mount option.
If you're not seeing any difference on the newest kernels (I hadn't 
realized you were running 4.13 on anything), you might not see any 
impact from doing this.  I'd also suggest running a full balance prior 
to testing _after_ switching the option, part of the performance impact 
is due to the resultant on-disk layout.
> 
>> Additionally, independently from that, try toggling the 'discard' mount
>> option.  If you have it enabled, disable it, if you have it disabled, enable
>> it.  Inline discards can be very expensive on some hardware, especially
>> older SSD's, and discards happen pretty frequently in a COW filesystem.
> 
> I have been following this advice, so I have never enabled discard for
> an NVMe drive. Do you think it is worth testing?
> 
> Solid State Drives/NVMe - ArchWiki
> https://wiki.archlinux.org/index.php/Solid_State_Drives/NVMe
> 
> Discards:
> Note: Although continuous TRIM is an option (albeit not recommended)
> for SSDs, NVMe devices should not be issued discards.
I've never heard this particular advice before, and it offers no source 
for the claim.  I have seen Intel's advice that they quote below that 
before though, and would tend to agree with it for most users.  The part 
that makes this all complicated is that different devices handle batched 
discards (what the Arch people call 'Periodic TRIM') and on-demand 
discards (what the Arch people call 'Continuous TRIM') differently. 
Some devices (especially old ones) do better with batched discards, 
while others seem to do better with on-demand discards.  On top of that, 
there's significant variance based on the actual workload (including 
that from the filesystem itself).

Based on my own experience using BTRFS on SATA SSD's, it's usually 
better to do batched discards unless you only write to the filesystem 
infrequently, because:
1. Each COW operation triggers an associated discard (this can seriously 
kill your performance).
2. Because old copies of blocks get discarded immediately, it's much 
harder to recover a damaged filesystem.

There are some odd exceptions though.  If for example you're running 
BTRFS on a ramdisk or ZRAM device, you should just use on-demand 
discards, as that will free up memory immediately.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-01  0:37                             ` Dave
  2017-11-01 12:21                               ` Austin S. Hemmelgarn
  2017-11-01 17:48                               ` Peter Grandi
@ 2017-11-02 21:16                               ` Kai Krakow
  2017-11-03  2:47                                 ` Dave
  2 siblings, 1 reply; 56+ messages in thread
From: Kai Krakow @ 2017-11-02 21:16 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 31 Oct 2017 20:37:27 -0400
schrieb Dave <davestechshop@gmail.com>:

> > Also, you can declare the '.firefox/default/' directory to be
> > NOCOW, and that "just works".  
> 
> The cache is in a separate location from the profiles, as I'm sure you
> know.  The reason I suggested a separate BTRFS subvolume for
> $HOME/.cache is that this will prevent the cache files for all
> applications (for that user) from being included in the snapshots. We
> take frequent snapshots and (afaik) it makes no sense to include cache
> in backups or snapshots. The easiest way I know to exclude cache from
> BTRFS snapshots is to put it on a separate subvolume. I assumed this
> would make several things related to snapshots more efficient too.
> 
> As far as the Firefox profile being declared NOCOW, as soon as we take
> the first snapshot, I understand that it will become COW again. So I
> don't see any point in making it NOCOW.

Ah well, not really. The files and directories will still be nocow -
however, the next write to any such file after a snapshot will make a
cow operation. So you still see the fragmentation effect but to a much
lesser extent. But the files itself will remain in nocow format.

You may want to try btrfs autodefrag mount option and see if it
improves things (tho, the effect may take days or weeks to apply if you
didn't enable it right from the creation of the filesystem).

Also, autodefrag will probably unshare reflinks on your snapshots. You
may be able to use bees[1] to work against this effect. Its interaction
with autodefrag is not well tested but it works fine for me. Also, bees
is able to reduce some of the fragmentation during deduplication
because it will rewrite extents back into bigger chunks (but only for
duplicated data).

[1]: https://github.com/Zygo/bees

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-02 21:16                               ` Kai Krakow
@ 2017-11-03  2:47                                 ` Dave
  2017-11-03  7:26                                   ` Kai Krakow
  0 siblings, 1 reply; 56+ messages in thread
From: Dave @ 2017-11-03  2:47 UTC (permalink / raw)
  To: Linux fs Btrfs; +Cc: Kai Krakow

On Thu, Nov 2, 2017 at 5:16 PM, Kai Krakow <hurikhan77@gmail.com> wrote:

>
> You may want to try btrfs autodefrag mount option and see if it
> improves things (tho, the effect may take days or weeks to apply if you
> didn't enable it right from the creation of the filesystem).
>
> Also, autodefrag will probably unshare reflinks on your snapshots. You
> may be able to use bees[1] to work against this effect. Its interaction
> with autodefrag is not well tested but it works fine for me. Also, bees
> is able to reduce some of the fragmentation during deduplication
> because it will rewrite extents back into bigger chunks (but only for
> duplicated data).
>
> [1]: https://github.com/Zygo/bees

I will look into bees. And yes, I plan to try autodefrag. (I already
have it enabled now.) However, I need to understand something about
how btrfs send-receive works in regard to reflinks and fragmentation.

Say I have 2 snapshots on my live volume. The earlier one of them has
already been sent to another block device by btrfs send-receive (full
backup). Now defrag runs on the live volume and breaks some percentage
of the reflinks. At this point I do an incremental btrfs send-receive
using "-p" (or "-c") with the diff going to the same other block
device where the prior snapshot was already sent.

Will reflinks be "made whole" (restored) on the receiving block
device? Or is the state of the source volume replicated so closely
that reflink status is the same on the target?

Also, is fragmentation reduced on the receiving block device?

My expectation is that fragmentation would be reduced and duplication
would be reduced too. In other words, does send-receive result in
defragmentation and deduplication too?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-02 11:07                                   ` Austin S. Hemmelgarn
@ 2017-11-03  2:59                                     ` Dave
  2017-11-03  7:12                                       ` Kai Krakow
  0 siblings, 1 reply; 56+ messages in thread
From: Dave @ 2017-11-03  2:59 UTC (permalink / raw)
  To: Linux fs Btrfs; +Cc: Austin S. Hemmelgarn

On Thu, Nov 2, 2017 at 7:07 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2017-11-01 21:39, Dave wrote:
>> I'm going to make this change now. What would be a good way to
>> implement this so that the change applies to the $HOME/.cache of each
>> user?
>>
>> The simple way would be to create a new subvolume for each existing
>> user and mount it at $HOME/.cache in /etc/fstab, hard coding that
>> mount location for each user. I don't mind doing that as there are
>> only 4 users to consider. One minor concern is that it adds an
>> unexpected step to the process of creating a new user. Is there a
>> better way?
>>
> The easiest option is to just make sure nobody is logged in and run the
> following shell script fragment:
>
>     for dir in /home/* ; do
>         rm -rf $dir/.cache
>         btrfs subvolume create $dir/.cache
>     done
>
> And then add something to the user creation scripts to create that
> subvolume.  This approach won't pollute /etc/fstab, will still exclude the
> directory from snapshots, and doesn't require any hugely creative work to
> integrate with user creation and deletion.
>
> In general, the contents of the .cache directory are just that, cached data.
> Provided nobody is actively accessing it, it's perfectly safe to just nuke
> the entire directory...

I like this suggestion. Thank you. I had intended to mount the .cache
subvolumes with the NODATACOW option. However, with this approach, I
won't be explicitly mounting the .cache subvolumes. Is it possible to
use "chattr +C $dir/.cache" in that loop even though it is a
subvolume? And, is setting the .cache directory to NODATACOW the right
choice given this scenario? From earlier comments, I believe it is,
but I want to be sure I understood correctly.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-02  1:39                                 ` Dave
  2017-11-02 11:07                                   ` Austin S. Hemmelgarn
@ 2017-11-03  5:58                                   ` Marat Khalili
  2017-11-03  7:19                                     ` Kai Krakow
  1 sibling, 1 reply; 56+ messages in thread
From: Marat Khalili @ 2017-11-03  5:58 UTC (permalink / raw)
  To: Dave, Linux fs Btrfs; +Cc: Peter Grandi, Austin S. Hemmelgarn

On 02/11/17 04:39, Dave wrote:
> I'm going to make this change now. What would be a good way to
> implement this so that the change applies to the $HOME/.cache of each
> user?
I'd make each user's .cache a symlink (should work but if it won't then 
bind mount) to a per-user directory in some separately mounted volume 
with necessary options.

--

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-03  2:59                                     ` Dave
@ 2017-11-03  7:12                                       ` Kai Krakow
  0 siblings, 0 replies; 56+ messages in thread
From: Kai Krakow @ 2017-11-03  7:12 UTC (permalink / raw)
  To: linux-btrfs

Am Thu, 2 Nov 2017 22:59:36 -0400
schrieb Dave <davestechshop@gmail.com>:

> On Thu, Nov 2, 2017 at 7:07 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> > On 2017-11-01 21:39, Dave wrote:  
> >> I'm going to make this change now. What would be a good way to
> >> implement this so that the change applies to the $HOME/.cache of
> >> each user?
> >>
> >> The simple way would be to create a new subvolume for each existing
> >> user and mount it at $HOME/.cache in /etc/fstab, hard coding that
> >> mount location for each user. I don't mind doing that as there are
> >> only 4 users to consider. One minor concern is that it adds an
> >> unexpected step to the process of creating a new user. Is there a
> >> better way?
> >>  
> > The easiest option is to just make sure nobody is logged in and run
> > the following shell script fragment:
> >
> >     for dir in /home/* ; do
> >         rm -rf $dir/.cache
> >         btrfs subvolume create $dir/.cache
> >     done
> >
> > And then add something to the user creation scripts to create that
> > subvolume.  This approach won't pollute /etc/fstab, will still
> > exclude the directory from snapshots, and doesn't require any
> > hugely creative work to integrate with user creation and deletion.
> >
> > In general, the contents of the .cache directory are just that,
> > cached data. Provided nobody is actively accessing it, it's
> > perfectly safe to just nuke the entire directory...  
> 
> I like this suggestion. Thank you. I had intended to mount the .cache
> subvolumes with the NODATACOW option. However, with this approach, I
> won't be explicitly mounting the .cache subvolumes. Is it possible to
> use "chattr +C $dir/.cache" in that loop even though it is a
> subvolume? And, is setting the .cache directory to NODATACOW the right
> choice given this scenario? From earlier comments, I believe it is,
> but I want to be sure I understood correctly.

It is important to apply "chattr +C" to the _empty_ directory, because
even if used recursively, it won't apply to already existing, non-empty
files. But the +C attribute is inherited by newly created files and
directory: So simply follow the "chattr +C on empty directory" and
you're all set.

BTW: You cannot mount subvolumes from an already mounted btrfs device
with different mount options. That is currently not implemented (except
for maybe a very few options). So the fstab approach probably wouldn't
have helped you (depending on your partition layout).

You can simply just create subvolumes within the location needed and
they are implicitly mounted. Then change the particular subvolume cow
behavior with chattr.


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-03  5:58                                   ` Marat Khalili
@ 2017-11-03  7:19                                     ` Kai Krakow
  0 siblings, 0 replies; 56+ messages in thread
From: Kai Krakow @ 2017-11-03  7:19 UTC (permalink / raw)
  To: linux-btrfs

Am Fri, 3 Nov 2017 08:58:22 +0300
schrieb Marat Khalili <mkh@rqc.ru>:

> On 02/11/17 04:39, Dave wrote:
> > I'm going to make this change now. What would be a good way to
> > implement this so that the change applies to the $HOME/.cache of
> > each user?  
> I'd make each user's .cache a symlink (should work but if it won't
> then bind mount) to a per-user directory in some separately mounted
> volume with necessary options.

On a systemd system, each user already has a private tmpfs location
at /run/user/$(id -u).

You could add to the central login script:

# CACHE_DIR="/run/user/$(id -u)/cache"
# mkdir -p $CACHE_DIR && ln -snf $CACHE_DIR $HOME/.cache

You should not run this as root (because of mkdir -p).

You could wrap it into an if statement:

# if [ "$(whoami)" -ne "root" ]; then
#   ...
# fi


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-03  2:47                                 ` Dave
@ 2017-11-03  7:26                                   ` Kai Krakow
  2017-11-03 11:30                                     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 56+ messages in thread
From: Kai Krakow @ 2017-11-03  7:26 UTC (permalink / raw)
  To: linux-btrfs

Am Thu, 2 Nov 2017 22:47:31 -0400
schrieb Dave <davestechshop@gmail.com>:

> On Thu, Nov 2, 2017 at 5:16 PM, Kai Krakow <hurikhan77@gmail.com>
> wrote:
> 
> >
> > You may want to try btrfs autodefrag mount option and see if it
> > improves things (tho, the effect may take days or weeks to apply if
> > you didn't enable it right from the creation of the filesystem).
> >
> > Also, autodefrag will probably unshare reflinks on your snapshots.
> > You may be able to use bees[1] to work against this effect. Its
> > interaction with autodefrag is not well tested but it works fine
> > for me. Also, bees is able to reduce some of the fragmentation
> > during deduplication because it will rewrite extents back into
> > bigger chunks (but only for duplicated data).
> >
> > [1]: https://github.com/Zygo/bees  
> 
> I will look into bees. And yes, I plan to try autodefrag. (I already
> have it enabled now.) However, I need to understand something about
> how btrfs send-receive works in regard to reflinks and fragmentation.
> 
> Say I have 2 snapshots on my live volume. The earlier one of them has
> already been sent to another block device by btrfs send-receive (full
> backup). Now defrag runs on the live volume and breaks some percentage
> of the reflinks. At this point I do an incremental btrfs send-receive
> using "-p" (or "-c") with the diff going to the same other block
> device where the prior snapshot was already sent.
> 
> Will reflinks be "made whole" (restored) on the receiving block
> device? Or is the state of the source volume replicated so closely
> that reflink status is the same on the target?
> 
> Also, is fragmentation reduced on the receiving block device?
> 
> My expectation is that fragmentation would be reduced and duplication
> would be reduced too. In other words, does send-receive result in
> defragmentation and deduplication too?

As far as I understand, btrfs send/receive doesn't create an exact
mirror. It just replays the block operations between generation
numbers. That is: If it finds new blocks referenced between
generations, it will write a _new_ block to the destination.

So, no, it won't reduce fragmentation or duplication. It just keeps
reflinks intact as long as such extents weren't touched within the
generation range. Otherwise they are rewritten as new extents.

Autodefrag and deduplication processes will as such probably increase
duplication at the destination. A developer may have a better clue, tho.


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-11-03  7:26                                   ` Kai Krakow
@ 2017-11-03 11:30                                     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 56+ messages in thread
From: Austin S. Hemmelgarn @ 2017-11-03 11:30 UTC (permalink / raw)
  To: linux-btrfs

On 2017-11-03 03:26, Kai Krakow wrote:
> Am Thu, 2 Nov 2017 22:47:31 -0400
> schrieb Dave <davestechshop@gmail.com>:
> 
>> On Thu, Nov 2, 2017 at 5:16 PM, Kai Krakow <hurikhan77@gmail.com>
>> wrote:
>>
>>>
>>> You may want to try btrfs autodefrag mount option and see if it
>>> improves things (tho, the effect may take days or weeks to apply if
>>> you didn't enable it right from the creation of the filesystem).
>>>
>>> Also, autodefrag will probably unshare reflinks on your snapshots.
>>> You may be able to use bees[1] to work against this effect. Its
>>> interaction with autodefrag is not well tested but it works fine
>>> for me. Also, bees is able to reduce some of the fragmentation
>>> during deduplication because it will rewrite extents back into
>>> bigger chunks (but only for duplicated data).
>>>
>>> [1]: https://github.com/Zygo/bees
>>
>> I will look into bees. And yes, I plan to try autodefrag. (I already
>> have it enabled now.) However, I need to understand something about
>> how btrfs send-receive works in regard to reflinks and fragmentation.
>>
>> Say I have 2 snapshots on my live volume. The earlier one of them has
>> already been sent to another block device by btrfs send-receive (full
>> backup). Now defrag runs on the live volume and breaks some percentage
>> of the reflinks. At this point I do an incremental btrfs send-receive
>> using "-p" (or "-c") with the diff going to the same other block
>> device where the prior snapshot was already sent.
>>
>> Will reflinks be "made whole" (restored) on the receiving block
>> device? Or is the state of the source volume replicated so closely
>> that reflink status is the same on the target?
>>
>> Also, is fragmentation reduced on the receiving block device?
>>
>> My expectation is that fragmentation would be reduced and duplication
>> would be reduced too. In other words, does send-receive result in
>> defragmentation and deduplication too?
> 
> As far as I understand, btrfs send/receive doesn't create an exact
> mirror. It just replays the block operations between generation
> numbers. That is: If it finds new blocks referenced between
> generations, it will write a _new_ block to the destination.
That is mostly correct, except it's not a block level copy.  To put it 
in a heavily simplified manner, send/receive will recreate the subvolume 
using nothing more than basic file manipulation syscalls (write(), 
chown(), chmod(), etc), the clone ioctl, and some extra logic to figure 
out the correct location to clone from.  IOW, it's functionally 
equivalent to using rsync to copy the data, and then deduplicating, 
albeit a bit smarter about when to deduplicate (and more efficient in 
that respect).
> 
> So, no, it won't reduce fragmentation or duplication. It just keeps
> reflinks intact as long as such extents weren't touched within the
> generation range. Otherwise they are rewritten as new extents.
A received subvolume will almost always be less fragmented than the 
source, since everything is received serially, and each file is written 
out one at a time.
> 
> Autodefrag and deduplication processes will as such probably increase
> duplication at the destination. A developer may have a better clue, tho.
In theory, yes, but in practice, not so much.  Autodefrag generally 
operates on very small blocks of data (64k IIRC), and I'm pretty sure it 
has some heuristic that only triggers it on small random writes, so 
depending on the workload, it may not be triggering much (for example, 
it often won't trigger on cache directories, since those almost never 
have files rewritten in place).

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
       [not found]                             ` <CAH=dxU47-52-asM5vJ_-qOpEpjZczHw7vQzgi1-TeKm58++zBQ@mail.gmail.com>
@ 2017-12-11  5:18                               ` Dave
  2017-12-11  6:10                                 ` Timofey Titovets
  0 siblings, 1 reply; 56+ messages in thread
From: Dave @ 2017-12-11  5:18 UTC (permalink / raw)
  To: Linux fs Btrfs

On Tue, Oct 31, 2017 someone wrote:
>
>
> > 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted
> > nocow -- it will NOT be snapshotted

I did exactly this. It servers the purpose of avoiding snapshots.
However, today I saw the following at
https://wiki.archlinux.org/index.php/Btrfs

Note: From Btrfs Wiki Mount options: within a single file system, it
is not possible to mount some subvolumes with nodatacow and others
with datacow. The mount option of the first mounted subvolume applies
to any other subvolumes.

That makes me think my nodatacow mount option on $HOME/.cache is not
effective. True?

(My subjective performance results have not been as good as hoped for
with the tweaks I have tried so far.)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: defragmenting best practice?
  2017-12-11  5:18                               ` Dave
@ 2017-12-11  6:10                                 ` Timofey Titovets
  0 siblings, 0 replies; 56+ messages in thread
From: Timofey Titovets @ 2017-12-11  6:10 UTC (permalink / raw)
  To: Dave; +Cc: Linux fs Btrfs

2017-12-11 8:18 GMT+03:00 Dave <davestechshop@gmail.com>:
> On Tue, Oct 31, 2017 someone wrote:
>>
>>
>> > 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted
>> > nocow -- it will NOT be snapshotted
>
> I did exactly this. It servers the purpose of avoiding snapshots.
> However, today I saw the following at
> https://wiki.archlinux.org/index.php/Btrfs
>
> Note: From Btrfs Wiki Mount options: within a single file system, it
> is not possible to mount some subvolumes with nodatacow and others
> with datacow. The mount option of the first mounted subvolume applies
> to any other subvolumes.
>
> That makes me think my nodatacow mount option on $HOME/.cache is not
> effective. True?
>
> (My subjective performance results have not been as good as hoped for
> with the tweaks I have tried so far.)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

True, for magic dirs, that you may want mark as no cow, you need to
use chattr, like:
rm -rf ~/.cache
mkdir ~/.cache
chattr +C ~/.cache

-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2017-12-11  6:11 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-31  7:05 btrfs filesystem defragment -r -- does it affect subvolumes? Ulli Horlacher
2017-09-12 16:28 ` defragmenting best practice? Ulli Horlacher
2017-09-12 17:27   ` Austin S. Hemmelgarn
2017-09-14  7:54     ` Duncan
2017-09-14 12:28       ` Austin S. Hemmelgarn
2017-09-14 11:38   ` Kai Krakow
2017-09-14 13:31     ` Tomasz Kłoczko
2017-09-14 15:24       ` Kai Krakow
2017-09-14 15:47         ` Kai Krakow
2017-09-14 17:48         ` Tomasz Kłoczko
2017-09-14 18:53           ` Austin S. Hemmelgarn
2017-09-15  2:26             ` Tomasz Kłoczko
2017-09-15 12:23               ` Austin S. Hemmelgarn
2017-09-14 20:17           ` Kai Krakow
2017-09-15 10:54           ` Michał Sokołowski
2017-09-15 11:13             ` Peter Grandi
2017-09-15 13:07             ` Tomasz Kłoczko
2017-09-15 14:11               ` Michał Sokołowski
2017-09-15 16:35                 ` Peter Grandi
2017-09-15 17:08                 ` Kai Krakow
2017-09-15 19:10                   ` Tomasz Kłoczko
2017-09-20  6:38                     ` Dave
2017-09-20 11:46                       ` Austin S. Hemmelgarn
2017-09-21 20:10                         ` Kai Krakow
2017-09-21 23:30                           ` Dave
2017-09-21 23:58                           ` Kai Krakow
2017-09-22 11:22                           ` Austin S. Hemmelgarn
2017-09-22 20:29                             ` Marc Joliet
2017-09-21 11:09                       ` Duncan
2017-10-31 21:47                         ` Dave
2017-10-31 23:06                           ` Peter Grandi
2017-11-01  0:37                             ` Dave
2017-11-01 12:21                               ` Austin S. Hemmelgarn
2017-11-02  1:39                                 ` Dave
2017-11-02 11:07                                   ` Austin S. Hemmelgarn
2017-11-03  2:59                                     ` Dave
2017-11-03  7:12                                       ` Kai Krakow
2017-11-03  5:58                                   ` Marat Khalili
2017-11-03  7:19                                     ` Kai Krakow
2017-11-01 17:48                               ` Peter Grandi
2017-11-02  0:09                                 ` Dave
2017-11-02 11:17                                   ` Austin S. Hemmelgarn
2017-11-02 18:09                                     ` Dave
2017-11-02 18:37                                       ` Austin S. Hemmelgarn
2017-11-02  0:43                                 ` Peter Grandi
2017-11-02 21:16                               ` Kai Krakow
2017-11-03  2:47                                 ` Dave
2017-11-03  7:26                                   ` Kai Krakow
2017-11-03 11:30                                     ` Austin S. Hemmelgarn
     [not found]                             ` <CAH=dxU47-52-asM5vJ_-qOpEpjZczHw7vQzgi1-TeKm58++zBQ@mail.gmail.com>
2017-12-11  5:18                               ` Dave
2017-12-11  6:10                                 ` Timofey Titovets
2017-11-01  7:43                           ` Sean Greenslade
2017-11-01 13:31                           ` Duncan
2017-11-01 23:36                             ` Dave
2017-09-21 19:28                       ` Sean Greenslade
2017-09-20  7:34                     ` Dmitry Kudriavtsev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.