All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs fi defrag -r -t 32M? What is actually happening?
@ 2016-07-26 23:03 Nicholas D Steeves
  2016-07-27  1:10 ` Duncan
  0 siblings, 1 reply; 8+ messages in thread
From: Nicholas D Steeves @ 2016-07-26 23:03 UTC (permalink / raw)
  To: Btrfs BTRFS

Hi,

I've been using btrfs fi defrag with out the "-r -t 32M" option for
regular maintenance.  I just learned, in
Documentation/btrfs-convert.asciidoc, that there is a recommendation
to run with "-t 32M" after a conversion from ext2/3/4.  I then
cross-referenced this with btrfs-filesystem(8), and found that:

    Extents bigger than value given by -t will be skipped, otherwise
    this value is used as a target extent size, but is only advisory
    and may not be reached if the free space is too fragmented. Use 0
    to take the kernel default, which is 256kB but may change in the
    future.

I understand the default behaviour of target extent size of 256kB to
mean only defragment small files and metadata.  Or does this mean that
the default behaviour is to defragment extent tree metadata >256kB,
and then defragment the (larger than 256kB) data from many extents
into a single extent?  I was surprised to read this!

What's really happening with this default behaviour?  Should everyone
be using -t with a much larger value to actually defragment their
databases?

Thanks,
Nicholas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs fi defrag -r -t 32M? What is actually happening?
  2016-07-26 23:03 btrfs fi defrag -r -t 32M? What is actually happening? Nicholas D Steeves
@ 2016-07-27  1:10 ` Duncan
  2016-07-27 17:19   ` btrfs fi defrag does not defrag files >256kB? Nicholas D Steeves
  0 siblings, 1 reply; 8+ messages in thread
From: Duncan @ 2016-07-27  1:10 UTC (permalink / raw)
  To: linux-btrfs

Nicholas D Steeves posted on Tue, 26 Jul 2016 19:03:53 -0400 as excerpted:

> Hi,
> 
> I've been using btrfs fi defrag with out the "-r -t 32M" option for
> regular maintenance.  I just learned, in
> Documentation/btrfs-convert.asciidoc, that there is a recommendation
> to run with "-t 32M" after a conversion from ext2/3/4.  I then
> cross-referenced this with btrfs-filesystem(8), and found that:
> 
>     Extents bigger than value given by -t will be skipped, otherwise
>     this value is used as a target extent size, but is only advisory
>     and may not be reached if the free space is too fragmented. Use 0
>     to take the kernel default, which is 256kB but may change in the
>     future.
> 
> I understand the default behaviour of target extent size of 256kB to
> mean only defragment small files and metadata.  Or does this mean that
> the default behaviour is to defragment extent tree metadata >256kB,
> and then defragment the (larger than 256kB) data from many extents
> into a single extent?  I was surprised to read this!
> 
> What's really happening with this default behaviour?  Should everyone
> be using -t with a much larger value to actually defragment their
> databases?

Something about defrag's -t option should really be in the FAQ, as it is 
known to be somewhat confusing and to come up from time to time, tho this 
is the first time I've seen it in the context of convert.

In general, you are correct in that the larger the value given to -t, the 
more defragging you should ultimately get.  There's a practical upper 
limit, however, the data chunk size, which is nominally 1 GiB (tho on 
tiny btrfs it's smaller and on TB-scale it can be larger, to 8 or 10 GiB 
IIRC).  32-bit btrfs-progs defrag also had a bug at one point that would 
(IIRC) kill the parameter if it was set to 2+ GiB -- that has been fixed 
by hard-coding the 32-bit max to 1 GiB, I believe.  The bug didn't affect 
64-bit.  In any case, 1 GiB is fine, and often the largest btrfs can do 
anyway, due as I said to that being the normal data chunk size.

And btrfs defrag only deals with data.  There's no metadata defrag, tho 
balance -m (or whole filesystem) will normally consolidate the metadata 
into the fewest (nominally 256 MiB) metadata chunks possible as it 
rewrites them.

In that regard a defrag -t 32M recommendation is reasonable for a 
converted filesystem, tho you can certainly go larger... to 1 GiB as I 
said.

On a converted filesystem, however, there's possibly the opposite issue 
as well -- on btrfs, as stated, extents are normally limited to chunk 
size, nominally 1 GiB (the reason being for chunk management, with the 
indirection chunks provide allowing balance to do all the stuff it can 
do, like converting between different raid levels), while ext* apparently 
has no such limitation.  If the initial post-saved-subvol-delete defrag 
and balance don't work correctly -- they've been buggy at times in the 
past and haven't, then it can be the problem blocking a successful full 
balance is huge files with single extents larger than a GiB, that didn't 
get broken up into btrfs-native chunk-sized extents.  At times people 
have had to temporarily move these off the filesystem, thus clearing the 
> 1 GiB extents they took, and back on, thus recreating them with btrfs-
native chunk-size extents, maximum.


Of course all this is in the context of btrfs-convert.  But it's not 
really a recommended conversion path anyway, tho it's recognized as 
ultimately a pragmatic necessity.  The reasoning goes like this.  Before 
you do anything as major as filesystem conversion in the first place, 
full backups of anything you wish to keep are strongly recommended, 
because it's always possible something will go wrong during the convert.  
No sane sysadmin would attempt the conversion without a backup, unless 
the data really was defined as worth less than the cost of the space 
required to store that backup, because sane sysadmins recognize that 
attempting that sort of risky operation without a backup is by definition 
of the risk associated with the operation, defining that data as simply 
not worth the trouble -- they'd literally prefer to lose the data rather 
than pay the time/hassle/resource cost of having a backup.

And once the requirement of a full backup (or alternatively, that the 
data is really not worth the hassle) is recognized, it's far easier and 
faster to simply mkfs.btrfs a brand new btrfs and copy everything over 
from the old ext*, in the process leaving /it/ as the backup, than it is 
to go thru the hassle of doing the backup (and thus the same copy step 
you'd do copying the old data to the new filesystem) anyway, then the 
convert-in-place, then testing that it worked, then deleting the saved 
subvolume with the ext* metadata, then doing the defrag and balance, 
before you can be sure your btrfs is properly cleaned up and ready for 
normal use, and thus the btrfs convert fully successful.

Meanwhile, even when functioning perfectly, convert, because it /is/ 
dealing with the data and metadata in-place, isn't going to give you the 
flexibility or choices, nor the performance, of a freshly created btrfs, 
created with the options you want, and freshly populated with cleanly 
copied and thus fully native and defragged data and metadata from the 
old, now backup, copy.

So to a sysadmin considering the risks involved, convert gains you 
nothing and loses you lots, compared to starting with a brand new 
filesystem and copying everything over, thus letting the old filesystem 
remain in place as the initial backup until you're confident the data on 
the new filesystem is complete and the filesystem functioning properly as 
a full replacement for the old, now backup, copy.

Never-the-less, having a convert utility is "nice", and pragmatically 
recognized as a practical necessity, if btrfs is to eventually supplant 
ext* as the assumed Linux default filesystem, because despite all the 
wisdom saying otherwise and risk and disadvantages of convert-in-place, 
some people will never have those backups and will just take that risk, 
and without a convert utility, will simply remain with ext*.

(It's worth noting that arguably, those sorts of people shouldn't be 
switching to the still maturing filesystem in the first place, as backups 
or only testing with "losable" data is still strongly recommended for 
those wishing to try btrfs.  People who aren't willing to deal with 
backups really should be sticking to a rather more mature and stable 
filesystem than btrfs in its current state, and once they /are/ willing 
to deal with backups, the choice of a brand new btrfs and copy everything 
over from the old filesystem, which then becomes the backup, becomes so 
much better than convert, that there's simply no sane reason to use 
convert in the first place.  Thus, arguably, the only people using 
convert at this point should be those with the specific purpose of 
testing it, in ordered to be sure it's ready for that day when btrfs 
really is finally stable enough that it can in clear conscience be 
recommended to people without backups, as at least as stable and problem
free as whatever they were using previously.  Tho that day's likely some 
years in the future.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs fi defrag does not defrag files >256kB?
  2016-07-27  1:10 ` Duncan
@ 2016-07-27 17:19   ` Nicholas D Steeves
  2016-07-28  5:14     ` Duncan
  2016-07-28 10:55     ` David Sterba
  0 siblings, 2 replies; 8+ messages in thread
From: Nicholas D Steeves @ 2016-07-27 17:19 UTC (permalink / raw)
  To: Btrfs BTRFS, David Sterba

On 26 July 2016 at 21:10, Duncan <1i5t5.duncan@cox.net> wrote:
> Nicholas D Steeves posted on Tue, 26 Jul 2016 19:03:53 -0400 as excerpted:
>
>> Hi,
>>
>> I've been using btrfs fi defrag with out the "-r -t 32M" option for
>> regular maintenance.  I just learned, in
>> Documentation/btrfs-convert.asciidoc, that there is a recommendation
>> to run with "-t 32M" after a conversion from ext2/3/4.  I then
>> cross-referenced this with btrfs-filesystem(8), and found that:
>>
>>     Extents bigger than value given by -t will be skipped, otherwise
>>     this value is used as a target extent size, but is only advisory
>>     and may not be reached if the free space is too fragmented. Use 0
>>     to take the kernel default, which is 256kB but may change in the
>>     future.
>>
>> I understand the default behaviour of target extent size of 256kB to
>> mean only defragment small files and metadata.  Or does this mean that
>> the default behaviour is to defragment extent tree metadata >256kB,
>> and then defragment the (larger than 256kB) data from many extents
>> into a single extent?  I was surprised to read this!
>>
>> What's really happening with this default behaviour?  Should everyone
>> be using -t with a much larger value to actually defragment their
>> databases?
>
> Something about defrag's -t option should really be in the FAQ, as it is
> known to be somewhat confusing and to come up from time to time, tho this
> is the first time I've seen it in the context of convert.
>
> In general, you are correct in that the larger the value given to -t, the
> more defragging you should ultimately get.  There's a practical upper
> limit, however, the data chunk size, which is nominally 1 GiB (tho on
> tiny btrfs it's smaller and on TB-scale it can be larger, to 8 or 10 GiB
> IIRC).  32-bit btrfs-progs defrag also had a bug at one point that would
> (IIRC) kill the parameter if it was set to 2+ GiB -- that has been fixed
> by hard-coding the 32-bit max to 1 GiB, I believe.  The bug didn't affect
> 64-bit.  In any case, 1 GiB is fine, and often the largest btrfs can do
> anyway, due as I said to that being the normal data chunk size.
>
> And btrfs defrag only deals with data.  There's no metadata defrag, tho
> balance -m (or whole filesystem) will normally consolidate the metadata
> into the fewest (nominally 256 MiB) metadata chunks possible as it
> rewrites them.

Thank you for this metadata consolidation tip!

> In that regard a defrag -t 32M recommendation is reasonable for a
> converted filesystem, tho you can certainly go larger... to 1 GiB as I
> said.
>

I only mentioned btrfs-convert.asciidoc, because that's what led me to
the discrepancy between the default target extent size value, and a
recommended value.  I was searching for everything I could find on
defrag, because I had begun to suspect that it wasn't functioning as
expected.

Is there any reason why defrag without -t cannot detect and default to
the data chunk size, or why it does not default to 1 GiB?  In the same
way that balance's default behaviour is a full balance, shouldn't
defrag's default behaviour defrag whole chunks?  Does it not default
to 1 GiB because that would increase the number of cases where defrag
unreflinks and duplicates files--leading to an ENOSPC?

https://github.com/kdave/btrfsmaintenance/blob/master/btrfs-defrag.sh
uses -t 32M ; if a default target extent size of 1GiB is too radical,
why not set it it 32M?  If SLED ships btrfsmaintenance, then defrag -t
32M should be well-tested, no?

Thank you,
Nicholas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs fi defrag does not defrag files >256kB?
  2016-07-27 17:19   ` btrfs fi defrag does not defrag files >256kB? Nicholas D Steeves
@ 2016-07-28  5:14     ` Duncan
  2016-07-28 10:55     ` David Sterba
  1 sibling, 0 replies; 8+ messages in thread
From: Duncan @ 2016-07-28  5:14 UTC (permalink / raw)
  To: linux-btrfs

Nicholas D Steeves posted on Wed, 27 Jul 2016 13:19:01 -0400 as excerpted:

> Is there any reason why defrag without -t cannot detect and default to
> the data chunk size, or why it does not default to 1 GiB?

I don't know the answer, but have wondered that myself.  256 KiB seems a 
rather small default, to me.  I'd expect something in the MiB range, at 
least, maybe the same 2 MiB that modern partitioners tend to use for 
alignment, for the same reason, that tends to be a reasonable whole 
multiple of most erase-block sizes, and if the partition is aligned, 
should prevent unnecessary read-modify-write cycles on ssd, and help with 
tiled zones on SMR drives as well.

As to the question in the subject line, AFAIK btrfs fi defrag works on 
extents, not filesize per-se, so using the default 256 KiB target, yes 
it'll defrag files larger than that, but only for extents that are 
smaller than that.  If all the extents are 256 KiB plus, defrag won't do 
anything with it without a larger target option or unless the compress 
option is also used, in which case it rewrites everything it is pointed 
at, in ordered to recompress it.

Talking about compression, it's worth mentioning that filefrag doesn't 
understand btrfs compression either, and will count each 128 KiB 
(uncompressed size) compression block as a separate extent.  To get the 
true picture using filefrag, you need to use verbose and either eyeball 
the results manually or feed them into a script that processes the 
numbers and combines "extents" if they are reported as immediately 
consecutive on the filesystem.

As such, filefrag, given the regular opportunistic compression, turns out 
to be a good method of determining whether a file (of over 128 KiB in 
size) is actually compressed or not, since if it is filefrag will report 
multiple 128 KiB extents, while if it's not, extent size should be much 
less regular, likely with larger extents unless the file is often 
modified and rewritten in-place.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs fi defrag does not defrag files >256kB?
  2016-07-27 17:19   ` btrfs fi defrag does not defrag files >256kB? Nicholas D Steeves
  2016-07-28  5:14     ` Duncan
@ 2016-07-28 10:55     ` David Sterba
  2016-07-28 17:25       ` Duncan
  2016-07-28 17:53       ` Nicholas D Steeves
  1 sibling, 2 replies; 8+ messages in thread
From: David Sterba @ 2016-07-28 10:55 UTC (permalink / raw)
  To: Nicholas D Steeves; +Cc: Btrfs BTRFS, David Sterba

On Wed, Jul 27, 2016 at 01:19:01PM -0400, Nicholas D Steeves wrote:
> > In that regard a defrag -t 32M recommendation is reasonable for a
> > converted filesystem, tho you can certainly go larger... to 1 GiB as I
> > said.
> 
> I only mentioned btrfs-convert.asciidoc, because that's what led me to
> the discrepancy between the default target extent size value, and a
> recommended value.  I was searching for everything I could find on
> defrag, because I had begun to suspect that it wasn't functioning as
> expected.

Historically, the 256K size is from kernel. Defrag can create tons of
data to write and this is noticeable on the system. However the results
of defragmentation are not satisfactory to the user so the recommended
value is 32M. I'd rather not change the kernel default but we can
increase the default threshold (-t) in the userspace tools.

> Is there any reason why defrag without -t cannot detect and default to
> the data chunk size, or why it does not default to 1 GiB?

The 1G value wouldn't be reached on an average filesystem where the free
space is fragmented, besides there are some smaller internal limits on
extent sizes that may not reach the user target size.  The value 32M has
been experimentally found and tested on various systems and it proved to
work well. With 64M the defragmentation was less successful but as it's
only a hint, it's not wrong to use it.

> In the same
> way that balance's default behaviour is a full balance, shouldn't
> defrag's default behaviour defrag whole chunks?  Does it not default
> to 1 GiB because that would increase the number of cases where defrag
> unreflinks and duplicates files--leading to an ENOSPC?

Yes, this would also happen, unless the '-f' option is given (flush data
after defragmenting each file).

> https://github.com/kdave/btrfsmaintenance/blob/master/btrfs-defrag.sh
> uses -t 32M ; if a default target extent size of 1GiB is too radical,
> why not set it it 32M?  If SLED ships btrfsmaintenance, then defrag -t
> 32M should be well-tested, no?

It is.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs fi defrag does not defrag files >256kB?
  2016-07-28 10:55     ` David Sterba
@ 2016-07-28 17:25       ` Duncan
  2016-07-28 17:53       ` Nicholas D Steeves
  1 sibling, 0 replies; 8+ messages in thread
From: Duncan @ 2016-07-28 17:25 UTC (permalink / raw)
  To: linux-btrfs

David Sterba posted on Thu, 28 Jul 2016 12:55:55 +0200 as excerpted:

> On Wed, Jul 27, 2016 at 01:19:01PM -0400, Nicholas D Steeves wrote:
>> > In that regard a defrag -t 32M recommendation is reasonable for a
>> > converted filesystem, tho you can certainly go larger... to 1 GiB as
>> > I said.

And... I see that the progs v4.7-rc1 release has the 32M default.

=:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs fi defrag does not defrag files >256kB?
  2016-07-28 10:55     ` David Sterba
  2016-07-28 17:25       ` Duncan
@ 2016-07-28 17:53       ` Nicholas D Steeves
  2016-07-29  3:56         ` Duncan
  1 sibling, 1 reply; 8+ messages in thread
From: Nicholas D Steeves @ 2016-07-28 17:53 UTC (permalink / raw)
  To: dsterba, Nicholas D Steeves, Btrfs BTRFS, David Sterba

On 28 July 2016 at 06:55, David Sterba <dsterba@suse.cz> wrote:
> On Wed, Jul 27, 2016 at 01:19:01PM -0400, Nicholas D Steeves wrote:
>> > In that regard a defrag -t 32M recommendation is reasonable for a
>> > converted filesystem, tho you can certainly go larger... to 1 GiB as I
>> > said.
>>
>> I only mentioned btrfs-convert.asciidoc, because that's what led me to
>> the discrepancy between the default target extent size value, and a
>> recommended value.  I was searching for everything I could find on
>> defrag, because I had begun to suspect that it wasn't functioning as
>> expected.
>
> Historically, the 256K size is from kernel. Defrag can create tons of
> data to write and this is noticeable on the system. However the results
> of defragmentation are not satisfactory to the user so the recommended
> value is 32M. I'd rather not change the kernel default but we can
> increase the default threshold (-t) in the userspace tools.

Thank you, I just saw that commit too!  For the purposes of minimizing
the impact btrfs fi defrag in a background cron or systemd.trigger job
has on a running system, I've read that "-f" (flush data after defrag
of each file) is beneficial.  Would it be even more beneficial to
ionice -c idle the defragmentation?

>> Is there any reason why defrag without -t cannot detect and default to
>> the data chunk size, or why it does not default to 1 GiB?
>
> The 1G value wouldn't be reached on an average filesystem where the free
> space is fragmented, besides there are some smaller internal limits on
> extent sizes that may not reach the user target size.  The value 32M has
> been experimentally found and tested on various systems and it proved to
> work well. With 64M the defragmentation was less successful but as it's
> only a hint, it's not wrong to use it.

Thank you for sharing these results :-)

>> In the same
>> way that balance's default behaviour is a full balance, shouldn't
>> defrag's default behaviour defrag whole chunks?  Does it not default
>> to 1 GiB because that would increase the number of cases where defrag
>> unreflinks and duplicates files--leading to an ENOSPC?
>
> Yes, this would also happen, unless the '-f' option is given (flush data
> after defragmenting each file).

When flushing data after defragmenting each file, one might still hit
an ENOSPC, right?  But because the writes are more atomic it will be
easier to recover from?

Additionally, I've read that -o autodefrag doesn't yet work well for
large databases.  Would a supplementary targeted defrag policy be
useful here?  For example: a general cron/systemd.trigger default of
"-t 32M", and then another job for /var/lib/mysql/ with a policy of
"-f -t 1G"?  Or did your findings also show that large databases did
not benefit from larger target extent defrags?

Best regards,
Nicholas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: btrfs fi defrag does not defrag files >256kB?
  2016-07-28 17:53       ` Nicholas D Steeves
@ 2016-07-29  3:56         ` Duncan
  0 siblings, 0 replies; 8+ messages in thread
From: Duncan @ 2016-07-29  3:56 UTC (permalink / raw)
  To: linux-btrfs

Nicholas D Steeves posted on Thu, 28 Jul 2016 13:53:31 -0400 as excerpted:

> Additionally, I've read that -o autodefrag doesn't yet work well for
> large databases.  Would a supplementary targeted defrag policy be useful
> here?  For example: a general cron/systemd.trigger default of "-t 32M",
> and then another job for /var/lib/mysql/ with a policy of "-f -t 1G"? 
> Or did your findings also show that large databases did not benefit from
> larger target extent defrags?

That the autodefrag mount option didn't work well with large rewrite-
pattern files like vm images and databases was the previous advice, yes, 
but that changed at some point.  I'm not sure if autodefrag has always 
worked this way and they simply weren't sure before, or if it changed, 
but in any case, these days, it doesn't rewrite the entire file, only a 
(relatively) larger block of it than the individual 4 KiB block that 
would otherwise be rewritten.  (I'm not sure what size, perhaps the same 
256 KiB that's the kernel default for manual defrag?)

As such, it scales better than it would if the full gig-size (or 
whatever) file was being rewritten, altho there will still be some 
fragmentation.

And for the same reason, it's actually not as bad with snapshots as it 
might have been otherwise, because it only cows/de-reflinks a bit more of 
the file than would otherwise be cowed due to the write in any case, so 
it doesn't duplicate the entire file as originally feared by some, either.

Tho the only way to be sure would be to try it.


Meanwhile, it's worth noting that autodefrag works best if on from the 
beginning, so fragmentation doesn't get ahead of it.  Here, I ensure 
autodefrag is on from the first time I mount it, while the filesystem is 
still empty.  That way, fragmentation should never get out of hand, 
fragmenting free space so badly that free large extents to defrag into
/can't/ be found, as may be the case if autodefrag isn't turned on until 
later and manual defrag hasn't been done regularly either.  There have 
been a few reports of people waiting to turn it on until the filesystem 
is highly fragmented, and then having several days of low performance as 
defrag tries to catch up.  If it's consistently on from the beginning, 
that shouldn't happen.

Of course that may mean backing up and recreating the filesystem fresh in 
ordered to have autodefrag on from the beginning, if you're looking at 
trying it on existing filesystems that are likely highly fragmented.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-07-29  3:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-26 23:03 btrfs fi defrag -r -t 32M? What is actually happening? Nicholas D Steeves
2016-07-27  1:10 ` Duncan
2016-07-27 17:19   ` btrfs fi defrag does not defrag files >256kB? Nicholas D Steeves
2016-07-28  5:14     ` Duncan
2016-07-28 10:55     ` David Sterba
2016-07-28 17:25       ` Duncan
2016-07-28 17:53       ` Nicholas D Steeves
2016-07-29  3:56         ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.