All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] recursive defrag cleanup
@ 2016-12-06  4:39 Anand Jain
  2016-12-06  4:39 ` [PATCH] btrfs-progs: recursive defrag cleanup duplicate code Anand Jain
  2016-12-12 17:19 ` [PATCH] recursive defrag cleanup David Sterba
  0 siblings, 2 replies; 13+ messages in thread
From: Anand Jain @ 2016-12-06  4:39 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs

The command,
   btrfs fi defrag -v /btrfs
 does nothing, it won't defrag the files under /btrfs as user
 may expect. The command with recursive option
   btrfs fi defrag -vr /btrfs
 would defrag all the files under /btrfs including files in
 its sub directories.

 While attempting to fix this. The patch below 1/1 provides
 a cleanup. And the actual fix is pending, as to my understanding
 of nfwt() it does not provide the list of file without
 files under its sub directories.

Anand Jain (1):
  btrfs-progs: recursive defrag cleanup remove duplicate code

 cmds-filesystem.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

-- 
2.10.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] btrfs-progs: recursive defrag cleanup duplicate code
  2016-12-06  4:39 [PATCH] recursive defrag cleanup Anand Jain
@ 2016-12-06  4:39 ` Anand Jain
  2016-12-12 17:11   ` David Sterba
  2016-12-12 17:19 ` [PATCH] recursive defrag cleanup David Sterba
  1 sibling, 1 reply; 13+ messages in thread
From: Anand Jain @ 2016-12-06  4:39 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 cmds-filesystem.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 41623f3183a8..ecac37edf936 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -1136,21 +1136,13 @@ static int cmd_filesystem_defrag(int argc, char **argv)
 			close_file_or_dir(fd, dirstream);
 			continue;
 		}
-		if (recursive) {
-			if (S_ISDIR(st.st_mode)) {
-				ret = nftw(argv[i], defrag_callback, 10,
+		if (recursive && S_ISDIR(st.st_mode)) {
+			ret = nftw(argv[i], defrag_callback, 10,
 						FTW_MOUNT | FTW_PHYS);
-				if (ret == ENOTTY)
-					exit(1);
-				/* errors are handled in the callback */
-				ret = 0;
-			} else {
-				if (defrag_global_verbose)
-					printf("%s\n", argv[i]);
-				ret = do_defrag(fd, defrag_global_fancy_ioctl,
-						&defrag_global_range);
-				e = errno;
-			}
+			if (ret == ENOTTY)
+				exit(1);
+			/* errors are handled in the callback */
+			ret = 0;
 		} else {
 			if (defrag_global_verbose)
 				printf("%s\n", argv[i]);
-- 
2.10.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] btrfs-progs: recursive defrag cleanup duplicate code
  2016-12-06  4:39 ` [PATCH] btrfs-progs: recursive defrag cleanup duplicate code Anand Jain
@ 2016-12-12 17:11   ` David Sterba
  0 siblings, 0 replies; 13+ messages in thread
From: David Sterba @ 2016-12-12 17:11 UTC (permalink / raw)
  To: Anand Jain; +Cc: dsterba, linux-btrfs

On Tue, Dec 06, 2016 at 12:39:38PM +0800, Anand Jain wrote:
> Signed-off-by: Anand Jain <anand.jain@oracle.com>

Applied, thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] recursive defrag cleanup
  2016-12-06  4:39 [PATCH] recursive defrag cleanup Anand Jain
  2016-12-06  4:39 ` [PATCH] btrfs-progs: recursive defrag cleanup duplicate code Anand Jain
@ 2016-12-12 17:19 ` David Sterba
  2016-12-19 11:12   ` Anand Jain
  1 sibling, 1 reply; 13+ messages in thread
From: David Sterba @ 2016-12-12 17:19 UTC (permalink / raw)
  To: Anand Jain; +Cc: dsterba, linux-btrfs

On Tue, Dec 06, 2016 at 12:39:37PM +0800, Anand Jain wrote:
> The command,
>    btrfs fi defrag -v /btrfs
>  does nothing, it won't defrag the files under /btrfs as user
>  may expect. The command with recursive option
>    btrfs fi defrag -vr /btrfs
>  would defrag all the files under /btrfs including files in
>  its sub directories.
> 
>  While attempting to fix this. The patch below 1/1 provides
>  a cleanup. And the actual fix is pending, as to my understanding
>  of nfwt() it does not provide the list of file without
>  files under its sub directories.

What kind of fix do you mean? We could detect if there's a directory in
the list of arguments and assume the recursive mode. I think this is
what most users expect.

Currently passing a directory will defragment the extent tree, but I
think we should extend the defrag ioctl flags to explictly ask for that.
At minimum, a directory without -r could print a warning about what it's
really doing. But I'm open to other ideas.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] recursive defrag cleanup
  2016-12-12 17:19 ` [PATCH] recursive defrag cleanup David Sterba
@ 2016-12-19 11:12   ` Anand Jain
  2016-12-28 10:40     ` Janos Toth F.
  0 siblings, 1 reply; 13+ messages in thread
From: Anand Jain @ 2016-12-19 11:12 UTC (permalink / raw)
  To: dsterba, linux-btrfs



On 12/13/16 01:19, David Sterba wrote:
> On Tue, Dec 06, 2016 at 12:39:37PM +0800, Anand Jain wrote:
>> The command,
>>    btrfs fi defrag -v /btrfs
>>  does nothing, it won't defrag the files under /btrfs as user
>>  may expect. The command with recursive option
>>    btrfs fi defrag -vr /btrfs
>>  would defrag all the files under /btrfs including files in
>>  its sub directories.
>>
>>  While attempting to fix this. The patch below 1/1 provides
>>  a cleanup. And the actual fix is pending, as to my understanding
>>  of nfwt() it does not provide the list of file without
>>  files under its sub directories.
>
> What kind of fix do you mean? We could detect if there's a directory in
> the list of arguments and assume the recursive mode. I think this is
> what most users expect.

As per BTRFS-FILESYSTEM(8)
  :
---------
Note
Directory arguments without -r do not defragment files recursively but 
will defragment certain internal trees (extent tree and the subvolume 
tree). This has been confusing and could be removed in the future.
---------

  I was thinking the following as a fix..
    without -r: apply it on the extent tree (current behavior)
                (and probably add a warning as you suggest below).
    with -r only: error, and say also use (new) per-file option -x,
                  (instead of applying on all files of dir in arg and
                  its sub-dir. [*])
    with -x: apply it on all the files in the dir in the arg
    with -xr: apply it on all files in the dir in the arg and
              its sub-dir.

> Currently passing a directory will defragment the extent tree, but I
> think we should extend the defrag ioctl flags to explictly ask for that.

  I am ok with this approach as well, which means,
    without -r: apply it on all files in dir. (instead of current way
                of applying on the extent tree. [*])
    with -r: apply it on all files in dir and its sub-dir.
    with -e (new): apply it on extent tree.

  [*] changes the current cli semantics

> At minimum, a directory without -r could print a warning about what it's
> really doing. But I'm open to other ideas.

  agreed.

  (was on vacation sorry for the delay).


Thanks, Anand

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] recursive defrag cleanup
  2016-12-19 11:12   ` Anand Jain
@ 2016-12-28 10:40     ` Janos Toth F.
  2017-01-03 10:16       ` Anand Jain
  0 siblings, 1 reply; 13+ messages in thread
From: Janos Toth F. @ 2016-12-28 10:40 UTC (permalink / raw)
  To: Btrfs BTRFS

I still find the defrag tool a little bit confusing from a user perspective:
- Does the recursive defrag (-r) also defrag the specified directory's
extent tree or should one run two separate commands for completeness
(one with -r and one without -r)?
- What's the target scope of the extent tree defragmentation? Is it
recursive on the tree (regardless of the -r option) and thus
defragments all the extent trees in case one targets the root
subvolume?

In other words: What is the exact sequence of commands if one wishes
to defragment the whole filesystem as extensively as possible (all
files and extent trees included)?

There used to be a scrip floating around on various wikis (for
example, the Arch Linux wiki) which used the "find" tool to feed each
and every directory to the defrag command. I always thought that must
be overkill and now it's gone, but I don't see further explanations
and/or new scripts in place (other than a single command with the -r
option).


It's also a little mystery for me if balancing the metadata chunks is
supposed to be effectively defragmenting the metadata or not and what
the best practice regarding that issue is.
In my personal experience Btrfs filesystems tend to get slower over
time, up to the point where it takes several minutes to mount them or
to delete some big files (observed on HDDs, not on SSDs where the
sheer speed might masks the problem and filesystem tends to be smaller
anyway). When it gets really bad, Gentoo's localmount script starts to
time out on boot and Samba based network file deletions tend to freeze
the client Windows machine's file explorer.
It only takes 3-6 months and/or roughly 10-20 times of the total
disk(s) capacity's worth of write load to get there. Defrag doesn't
seem to help with that but running a balance on each and every
metadata blocks (data and system blocks can be skipped) seems to
"renew" it (no more timeouts or noticeable delays on mount, metadata
operation are as fast as expected, it works like a young
filesystem...).

One might expect that targeting the root subvolume with a recursive
defrag will take care of metadata fragmentation as well but it doesn't
seem to be the case and I don't see anybody recommending regular
matadata balancing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] recursive defrag cleanup
  2016-12-28 10:40     ` Janos Toth F.
@ 2017-01-03 10:16       ` Anand Jain
  2017-01-03 14:21         ` Janos Toth F.
  0 siblings, 1 reply; 13+ messages in thread
From: Anand Jain @ 2017-01-03 10:16 UTC (permalink / raw)
  To: Janos Toth F.; +Cc: Btrfs BTRFS



  Thanks for the comments.

  We are in the midst of making defrag better. For now, -r option picks
  up files of the dir specified, there is no way to defrag all subvol
  tree with out scripting, something like this.

  If /mnt is mounted with subvolid=5 (default).
  for all s subvol in /mnt
  do
	btrfs fi defrag /mnt/$s
  done
  btrfs fi defrag /mnt
  btrfs fi defrag -vr /mnt (file defrag)

> In my personal experience Btrfs filesystems tend to get slower over
> time, up to the point where it takes several minutes to mount them or
> to delete some big files (observed on HDDs, not on SSDs where the
> sheer speed might masks the problem and filesystem tends to be smaller
> anyway). When it gets really bad, Gentoo's localmount script starts to
> time out on boot and Samba based network file deletions tend to freeze
> the client Windows machine's file explorer.
> It only takes 3-6 months and/or roughly 10-20 times of the total
> disk(s) capacity's worth of write load to get there.

  IMO. The solution has to be reviewed against the use case here,
  and certainly it takes a lot of time.

  In some cases the access pattern for write may be different
  from accessed for read-only. Need to understand the context in
  which the the access leads to the timeout. And then tuning
  might help.


> Defrag doesn't
> seem to help with that but running a balance on each and every
> metadata blocks (data and system blocks can be skipped) seems to
> "renew" it (no more timeouts or noticeable delays on mount, metadata
> operation are as fast as expected, it works like a young
> filesystem...).

  Whats the defrag command you found didn't help but the above
  balance command helped ?

Thanks Anand

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] recursive defrag cleanup
  2017-01-03 10:16       ` Anand Jain
@ 2017-01-03 14:21         ` Janos Toth F.
  2017-01-03 16:01           ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 13+ messages in thread
From: Janos Toth F. @ 2017-01-03 14:21 UTC (permalink / raw)
  To: Btrfs BTRFS

So, in order to defrag "everything" in the filesystem (which is
possible to / potentially needs defrag) I need to run:
1: a recursive defrag starting from the root subvolume (to pick up all
the files in all the possible subvolumes and directories)
2: a non-recursive defrag on the root subvolume + (optionally)
additional non-recursive defrag(s) on all the other subvolume(s) (if
any) [but not on all directories like some old scripts did]

In my opinion, the recursive defrag should pick up and operate on all
the subvolumes, including the one specified in the command line (if
it's a subvolume) and all subvolumes "below" it (not on files only).


Does the -t parameter have any meaning/effect on non-recursive (tree)
defrag? I usually go with 32M because t>=128Mb tends to be unduly slow
(it takes a lot of time, even if I try to run it repeatedly on the
same static file several times in a row whereas t<=32M finishes rather
quickly in this case -> could this be a bug or design flaw?).


I have a Btrfs filesystem (among others) on a single HDD with
single,single,single block profiles which is effectively write-only.
Nine concurrent ffmpeg processes write files from real-time video
streams 24/7 (there is no pre-allocation, the files just grow and grow
for an hour until a new one starts). A daily cronjob deletes the old
files every night and starts a recursive defrag on the root subvolume
(there are no other subvolumes, only the default id=5). I appended a
non-recursive defrag to this maintenance script now but I doubt it
does anything meaningful in this case (it finishes very fast, so I
don't think it does too much work). This is the filesystem which
"degrades" in speed for me very fast and needs metadata re-balance
from time to time (I usually do it before every kernel upgrades and
thus reboots in order to avoid a possible localmount rc-script
timeouts).

I know I should probably use a much more simple filesystem (might even
vfat, or ext4 - possibly with the journal disabled) for this kind of
storage but I was curious how Btrfs can handle the job (with CoW
enabled, no less). All in all, everything is fine except the
degradation of metadata performance. Since it's mostly write-only, I
could even skip the file defrags (I originally scheduled it in a hope
it will overcome the metadata slowdown problems and it's also useful
[even if not necessary] to have the files defragmented in case I
occasionally want to use them). I am not sure but I guess defraging
the files helps to reduce the overall metadata size and thus makes the
balance step faster (quick balancing) and more efficient (better
post-balance performance).


I can't remember the exact script but it basically fed every single
directories (not just subvolumes) to the defrag tool using 'find' and
it was meant to complement a separate recursive defrag step. It was
supposed to defrag the metadata (the metadata of every single
directory below the specified location, one by one, so it was very
quick on my video-archive but very slow on my system root and didn't
really seem to achieve anything on either of them).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] recursive defrag cleanup
  2017-01-03 14:21         ` Janos Toth F.
@ 2017-01-03 16:01           ` Austin S. Hemmelgarn
  2017-01-03 18:16             ` Janos Toth F.
  0 siblings, 1 reply; 13+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-03 16:01 UTC (permalink / raw)
  To: Janos Toth F., Btrfs BTRFS

On 2017-01-03 09:21, Janos Toth F. wrote:
> So, in order to defrag "everything" in the filesystem (which is
> possible to / potentially needs defrag) I need to run:
> 1: a recursive defrag starting from the root subvolume (to pick up all
> the files in all the possible subvolumes and directories)
> 2: a non-recursive defrag on the root subvolume + (optionally)
> additional non-recursive defrag(s) on all the other subvolume(s) (if
> any) [but not on all directories like some old scripts did]
>
> In my opinion, the recursive defrag should pick up and operate on all
> the subvolumes, including the one specified in the command line (if
> it's a subvolume) and all subvolumes "below" it (not on files only).
I agree on this point.  I actually hadn't known that it didn't recurse 
into sub-volumes, and that's a pretty significant caveat that should be 
documented (and ideally fixed, defrag doesn't need to worry about 
cross-subvolume stuff because it breaks reflinks anyway).
>
> Does the -t parameter have any meaning/effect on non-recursive (tree)
> defrag? I usually go with 32M because t>=128Mb tends to be unduly slow
> (it takes a lot of time, even if I try to run it repeatedly on the
> same static file several times in a row whereas t<=32M finishes rather
> quickly in this case -> could this be a bug or design flaw?).
For single directories, -t almost certainly has near zero effect since 
directories are entirely in metadata.  For single files, it should only 
have an effect if it's smaller than the size of the file (it probably is 
for your usage if you've got hour long video files).  As far as the 
behavior above 128MB, stuff like that is expected to a certain extent 
when you have highly fragmented free space (the FS has to hunt harder to 
find a large enough free area to place the extent).

FWIW, unless you have insanely slow storage, 32MB is a reasonable target 
fragment size.  Fragmentation is mostly an issue with sequential reads, 
and usually by the time you're through processing that 32MB of data, 
your storage device will have the next 32MB ready.  The optimal value of 
course depends on many things, but 32-64MB is reasonable for most users 
who aren't streaming multiple files simultaneously off of a slow hard drive.
>
> I have a Btrfs filesystem (among others) on a single HDD with
> single,single,single block profiles which is effectively write-only.
> Nine concurrent ffmpeg processes write files from real-time video
> streams 24/7 (there is no pre-allocation, the files just grow and grow
> for an hour until a new one starts). A daily cronjob deletes the old
> files every night and starts a recursive defrag on the root subvolume
> (there are no other subvolumes, only the default id=5). I appended a
> non-recursive defrag to this maintenance script now but I doubt it
> does anything meaningful in this case (it finishes very fast, so I
> don't think it does too much work). This is the filesystem which
> "degrades" in speed for me very fast and needs metadata re-balance
> from time to time (I usually do it before every kernel upgrades and
> thus reboots in order to avoid a possible localmount rc-script
> timeouts).
>
> I know I should probably use a much more simple filesystem (might even
> vfat, or ext4 - possibly with the journal disabled) for this kind of
> storage but I was curious how Btrfs can handle the job (with CoW
> enabled, no less). All in all, everything is fine except the
> degradation of metadata performance. Since it's mostly write-only, I
> could even skip the file defrags (I originally scheduled it in a hope
> it will overcome the metadata slowdown problems and it's also useful
> [even if not necessary] to have the files defragmented in case I
> occasionally want to use them). I am not sure but I guess defraging
> the files helps to reduce the overall metadata size and thus makes the
> balance step faster (quick balancing) and more efficient (better
> post-balance performance).
Really use case specific question, but have you tried putting each set 
of files (one for each stream) in it's own sub-volume?  Your metadata 
performance is probably degrading from the sheer number of extents 
involved (assuming H.264 encoding and full HD video with DVD quality 
audio, you're probably looking at at least 1000 extents for each file, 
probably more), and splitting into a subvolume per-stream should 
segregate the metadata for each set of files, which should in turn help 
avoid stuff like lock contention (and may actually make both balance and 
defrag run faster).
>
> I can't remember the exact script but it basically fed every single
> directories (not just subvolumes) to the defrag tool using 'find' and
> it was meant to complement a separate recursive defrag step. It was
> supposed to defrag the metadata (the metadata of every single
> directory below the specified location, one by one, so it was very
> quick on my video-archive but very slow on my system root and didn't
> really seem to achieve anything on either of them).
In general, unless you're talking about a directory with tens of 
thousands of entries (which you should be avoiding for other reasons, 
even on other filesystems), defragmenting a directory itself will 
usually have near zero impact on performance on modern storage hardware. 
  In most cases, the whole directory fits entirely in the cache on the 
storage device and gets pre-loaded by read-ahead done by the device 
firmware, so the only optimization is in the lookup itself, which is 
already really efficient because of how BTRFS stores everything in 
B-trees.  You also have to factor in that directories tend to have more 
sticking power than file blocks in the VFS cache, since they're 
(usually) used more frequently, so once you've read the directory the 
first time, it's almost certainly going to be completely in cache.

To put it in perspective, a directory with about 20-25 entries and all 
file/directory names less than 15 characters (roughly typical root 
directory, not counting the . and .. pseudo-entries) easily fits 
entirely in one metadata block on BTRFS with a 16k block size (the 
current default), with lots of room to spare.  My home directory on my 
laptop, which has a total of 129 entries with no names longer than 50 
bytes (and an average filename length of about 30 bytes), fits entirely 
in about 4 16k metadata blocks on BTRFS (assuming I did the math right 
for this one, I may not have).

If the directory is one block or smaller, defrag will do absolutely 
nothing to it (it's already got only one fragment).  If it's more than 
one block, defrag will try to put those blocks together in order, but 
the difference between 4 16k reads and 1 64k read on most modern storage 
devices is near zero, so it often will have near zero impact on that 
aspect of performance unless things are really bad (for example, if I 
had a traditional hard drive and each of those four metadata blocks were 
exactly equally spaced across the whole partition, I might see some 
difference in performance), but even then you're talking small enough 
improvements that you won't notice unless you're constantly listing the 
directory and trashing the page cache at the same time.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] recursive defrag cleanup
  2017-01-03 16:01           ` Austin S. Hemmelgarn
@ 2017-01-03 18:16             ` Janos Toth F.
  2017-01-03 20:08               ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 13+ messages in thread
From: Janos Toth F. @ 2017-01-03 18:16 UTC (permalink / raw)
  To: Btrfs BTRFS

On Tue, Jan 3, 2017 at 5:01 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> I agree on this point.  I actually hadn't known that it didn't recurse into
> sub-volumes, and that's a pretty significant caveat that should be
> documented (and ideally fixed, defrag doesn't need to worry about
> cross-subvolume stuff because it breaks reflinks anyway).

I think it descends into subvolumes to picks up the files (data)
inside them. I was referring to picking up the "child" subvolumes
(trees) and defrag those (as if you fed all the subvolumes to a
non-recursive defrag one-by-one with the current implementation --- if
I understand this current implementation correctly*).

To keep it simple: the recursive mode (IMO) should not ignore any
entities which the defrag tool is able to meaningfully operate on (no
matter if these are file data, directory metadata or subvolume tree
metadata, etc --- if it can be fragmented and can be defraged by this
tool, it should be done during a recursive mode operation with no
exceptions --- unless you can and do set explicit exceptions). I think
only the subvolume and/or the directory (*) metadata are currently
ignored by the recursive mode (if anything).

* But you got me a little bit confused again. After reading the first
few emails in this thread I thought only files (data) and subvolumes
(tree metadata) can be defraged by this tool and it's a no-op for
regular directories. Yet you seem to imply it's possible to defrag
regular directories (the directory metadata), meaning defrag can
operate on 3 type of entities in total (file data, subvolume tree
metadata, regular directory metadata).

> For single directories, -t almost certainly has near zero effect since
> directories are entirely in metadata.  For single files, it should only have
> an effect if it's smaller than the size of the file (it probably is for your
> usage if you've got hour long video files).  As far as the behavior above
> 128MB, stuff like that is expected to a certain extent when you have highly
> fragmented free space (the FS has to hunt harder to find a large enough free
> area to place the extent).
>
> FWIW, unless you have insanely slow storage, 32MB is a reasonable target
> fragment size.  Fragmentation is mostly an issue with sequential reads, and
> usually by the time you're through processing that 32MB of data, your
> storage device will have the next 32MB ready.  The optimal value of course
> depends on many things, but 32-64MB is reasonable for most users who aren't
> streaming multiple files simultaneously off of a slow hard drive.

Yes, I know and it's not a problem to use <=32Mb. I just wondered why
>=128Mb seems to be so incredibly slow for me.
Well, actually, I also wondered if the defrag tool can "create" big
enough continuous free space chunks by relocating other (probably
small[ish]) files (including non-fragmented files) in order to make
room for huge fragmented files to be re-assembled there as continuous
files. I just didn't make the connection between these two questions.
I mean, defrag will obviously fail with huge target extent sizes and
huge fragmented files if the free space is fragmented (and why
wouldn't it be somewhat fragmented...? deleting fragmented files will
result in fragmented free space and new files will be fragmented if
free space is fragmented, so you will delete fragmented files once
again, and it goes on forever -> that was my nightmare with ZFS... it
feels like the FS can only become more and more fragmented over time
unless you keep a lot of free space [let's say >50%] all the time and
it still remains somewhat random).

Although, this brings up complications. A really extensive defrag
would require some sort of smart planning: building a map of objects
(including free space and continuous files), divining the best
possible target and trying to achieve that shape by heavy
reorganization of (meta/)data.

> Really use case specific question, but have you tried putting each set of
> files (one for each stream) in it's own sub-volume?  Your metadata
> performance is probably degrading from the sheer number of extents involved
> (assuming H.264 encoding and full HD video with DVD quality audio, you're
> probably looking at at least 1000 extents for each file, probably more), and
> splitting into a subvolume per-stream should segregate the metadata for each
> set of files, which should in turn help avoid stuff like lock contention
> (and may actually make both balance and defrag run faster).

Before I had a dedicated disk+filesystem for these files I did think
about creating a subvolume for all these video recordings rather than
keeping them in a simple directory under a big multi-disk filesystem's
root/default subvolume. (The decision to separate these files was
forced by an external scale-ability problem --- limited number of
connectors/slots for disks and limited "working" RAID options in Btrfs
--- rather than an explicit desire for segregation -> although in the
light of these issues it might came on it's own at some point by now)
but I didn't really see the point. On the contrary, I would think
segregation by subvolumes could only complicate things further. It can
only increase the total complexity if it does anything. The total
amount of metadata will be roughly the same or more but not less. You
just add more complexity to the basket (making it bigger in some
sense) by introducing subvolumes.

But if it could "serve the common good", I could certainly try as a test-case.

The file size tends to be anywhere between 200 and 2000 Megabytes and
I observed some heavy fragmentation, like ~2k extents per ~2Gb files,
thus 1Mb/extent sizes on average. I guess it also depends on the total
write cache load (some database-like loads often result in write cache
flushing-frenzies but other times I allow up to ~1Gb to be cached in
memory before the disk has to write anything, so the extent size could
build up to >32Mb --- if the allocator is smart enough and free space
fragments are big enough...).

> You also have to factor
> in that directories tend to have more sticking power than file blocks in the
> VFS cache, since they're (usually) used more frequently, so once you've read
> the directory the first time, it's almost certainly going to be completely
> in cache.

I tired to tune that in the past (to favor metadata even more than the
default behavior) but I ended up with OOMs.

> To put it in perspective, a directory with about 20-25 entries and all
> file/directory names less than 15 characters (roughly typical root
> directory, not counting the . and .. pseudo-entries) easily fits entirely in
> one metadata block on BTRFS with a 16k block size (the current default),
> with lots of room to spare.

I use 4k nodesize. I am not sure why I picked that (probably in order
to try minimizing locking contention which I might thought I had a
problem with, years ago).

> then you're talking small enough improvements that you won't notice unless
> you're constantly listing the directory and trashing the page cache at the
> same time.

Well, actually, I do. I already filed a request on ffmpeg's bug
tracker to ask for Direct-IO support because video recording with
ffmpeg constantly flushes my page cache (and it's not the only job of
this little home server).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] recursive defrag cleanup
  2017-01-03 18:16             ` Janos Toth F.
@ 2017-01-03 20:08               ` Austin S. Hemmelgarn
  2017-01-04 22:12                 ` Janos Toth F.
  0 siblings, 1 reply; 13+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-03 20:08 UTC (permalink / raw)
  To: Janos Toth F., Btrfs BTRFS

On 2017-01-03 13:16, Janos Toth F. wrote:
> On Tue, Jan 3, 2017 at 5:01 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> I agree on this point.  I actually hadn't known that it didn't recurse into
>> sub-volumes, and that's a pretty significant caveat that should be
>> documented (and ideally fixed, defrag doesn't need to worry about
>> cross-subvolume stuff because it breaks reflinks anyway).
>
> I think it descends into subvolumes to picks up the files (data)
> inside them. I was referring to picking up the "child" subvolumes
> (trees) and defrag those (as if you fed all the subvolumes to a
> non-recursive defrag one-by-one with the current implementation --- if
> I understand this current implementation correctly*).
>
> To keep it simple: the recursive mode (IMO) should not ignore any
> entities which the defrag tool is able to meaningfully operate on (no
> matter if these are file data, directory metadata or subvolume tree
> metadata, etc --- if it can be fragmented and can be defraged by this
> tool, it should be done during a recursive mode operation with no
> exceptions --- unless you can and do set explicit exceptions). I think
> only the subvolume and/or the directory (*) metadata are currently
> ignored by the recursive mode (if anything).
>
> * But you got me a little bit confused again. After reading the first
> few emails in this thread I thought only files (data) and subvolumes
> (tree metadata) can be defraged by this tool and it's a no-op for
> regular directories. Yet you seem to imply it's possible to defrag
> regular directories (the directory metadata), meaning defrag can
> operate on 3 type of entities in total (file data, subvolume tree
> metadata, regular directory metadata).
Actually, I was under the impression that it could defrag directory 
metadata, but I may be completely wrong about that (and it wouldn't 
surprise me if it didn't, considering what I mentioned about it probably 
having near zero performance benefit for most people).
>
>> For single directories, -t almost certainly has near zero effect since
>> directories are entirely in metadata.  For single files, it should only have
>> an effect if it's smaller than the size of the file (it probably is for your
>> usage if you've got hour long video files).  As far as the behavior above
>> 128MB, stuff like that is expected to a certain extent when you have highly
>> fragmented free space (the FS has to hunt harder to find a large enough free
>> area to place the extent).
>>
>> FWIW, unless you have insanely slow storage, 32MB is a reasonable target
>> fragment size.  Fragmentation is mostly an issue with sequential reads, and
>> usually by the time you're through processing that 32MB of data, your
>> storage device will have the next 32MB ready.  The optimal value of course
>> depends on many things, but 32-64MB is reasonable for most users who aren't
>> streaming multiple files simultaneously off of a slow hard drive.
>
> Yes, I know and it's not a problem to use <=32Mb. I just wondered why
>> =128Mb seems to be so incredibly slow for me.
> Well, actually, I also wondered if the defrag tool can "create" big
> enough continuous free space chunks by relocating other (probably
> small[ish]) files (including non-fragmented files) in order to make
> room for huge fragmented files to be re-assembled there as continuous
> files. I just didn't make the connection between these two questions.
> I mean, defrag will obviously fail with huge target extent sizes and
> huge fragmented files if the free space is fragmented (and why
> wouldn't it be somewhat fragmented...? deleting fragmented files will
> result in fragmented free space and new files will be fragmented if
> free space is fragmented, so you will delete fragmented files once
> again, and it goes on forever -> that was my nightmare with ZFS... it
> feels like the FS can only become more and more fragmented over time
> unless you keep a lot of free space [let's say >50%] all the time and
> it still remains somewhat random).

>
> Although, this brings up complications. A really extensive defrag
> would require some sort of smart planning: building a map of objects
> (including free space and continuous files), divining the best
> possible target and trying to achieve that shape by heavy
> reorganization of (meta/)data.
>
>> Really use case specific question, but have you tried putting each set of
>> files (one for each stream) in it's own sub-volume?  Your metadata
>> performance is probably degrading from the sheer number of extents involved
>> (assuming H.264 encoding and full HD video with DVD quality audio, you're
>> probably looking at at least 1000 extents for each file, probably more), and
>> splitting into a subvolume per-stream should segregate the metadata for each
>> set of files, which should in turn help avoid stuff like lock contention
>> (and may actually make both balance and defrag run faster).
>
> Before I had a dedicated disk+filesystem for these files I did think
> about creating a subvolume for all these video recordings rather than
> keeping them in a simple directory under a big multi-disk filesystem's
> root/default subvolume. (The decision to separate these files was
> forced by an external scale-ability problem --- limited number of
> connectors/slots for disks and limited "working" RAID options in Btrfs
> --- rather than an explicit desire for segregation -> although in the
> light of these issues it might came on it's own at some point by now)
> but I didn't really see the point. On the contrary, I would think
> segregation by subvolumes could only complicate things further. It can
> only increase the total complexity if it does anything. The total
> amount of metadata will be roughly the same or more but not less. You
> just add more complexity to the basket (making it bigger in some
> sense) by introducing subvolumes.
Because each subvolume is functionally it's own tree, it has it's own 
locking for changes and other stuff, which means that splitting into 
subvolumes will usually help with concurrency.  A lot of high 
concurrency performance benchmarks do significantly better if you split 
things into individual subvolumes (and this drives a couple of the other 
kernel developers crazy to no end).  It's not well published, but this 
is actually the recommended usage if you can afford the complexity and 
don't need snapshots.
>
> But if it could "serve the common good", I could certainly try as a test-case.
>
> The file size tends to be anywhere between 200 and 2000 Megabytes and
> I observed some heavy fragmentation, like ~2k extents per ~2Gb files,
> thus 1Mb/extent sizes on average. I guess it also depends on the total
> write cache load (some database-like loads often result in write cache
> flushing-frenzies but other times I allow up to ~1Gb to be cached in
> memory before the disk has to write anything, so the extent size could
> build up to >32Mb --- if the allocator is smart enough and free space
> fragments are big enough...).
In general, the allocator should be smart enough to do this properly.

As far as how much your buffering for write-back, that should depend 
entirely on how fast your RAM is relative to your storage device.  The 
smaller the gap between your storage and your RAM in terms of speed, the 
more you should be buffering (up to a point).  FWIW, I find that with 
DDR3-1600 RAM and a good (~540MB/s sequential write) SATA3 SSD, about 
160-320MB gets a near ideal balance of performance, throughput, and 
fragmentation, but of course YMMV.
>
>> You also have to factor
>> in that directories tend to have more sticking power than file blocks in the
>> VFS cache, since they're (usually) used more frequently, so once you've read
>> the directory the first time, it's almost certainly going to be completely
>> in cache.
>
> I tired to tune that in the past (to favor metadata even more than the
> default behavior) but I ended up with OOMs.
Yeah, it's not easy, especially since Linux doesn't support setting the 
parameters per-filesystem (or better yet per-block-device).
>
>> To put it in perspective, a directory with about 20-25 entries and all
>> file/directory names less than 15 characters (roughly typical root
>> directory, not counting the . and .. pseudo-entries) easily fits entirely in
>> one metadata block on BTRFS with a 16k block size (the current default),
>> with lots of room to spare.
>
> I use 4k nodesize. I am not sure why I picked that (probably in order
> to try minimizing locking contention which I might thought I had a
> problem with, years ago).
You might want to try with 16k node-size.  It's been the default for a 
while now for new filesystems (or at least, large filesystems, but yours 
is probably well above the threshold considering that you're talking 
about multiple hour long video streams).  A larger node-size helps with 
caching (usually), and often cuts down on fragmentation a bit (because 
it functionally sets the smallest possible fragment, so 16k node-size 
means you have worst-case 4 times fewer fragments than with a 4k node-size).
>
>> then you're talking small enough improvements that you won't notice unless
>> you're constantly listing the directory and trashing the page cache at the
>> same time.
>
> Well, actually, I do. I already filed a request on ffmpeg's bug
> tracker to ask for Direct-IO support because video recording with
> ffmpeg constantly flushes my page cache (and it's not the only job of
> this little home server).
Out of curiosity, just on this part, have you tried using cgroups to 
keep the memory usage isolated better?  Setting up a cgroup correctly 
for this is of course non-trivial, but at least you won't take down the 
whole machine if you get the parameters wrong.  Check 
Documentation/cgroup-v1/memory.txt and/or Documentation/cgroup-v2.txt in 
the kernel source tree for info on setup (If you're using systemd, 
you're using cgroup-v2, if you're using OpenRC (Gentoo and friends 
default init system), you'd be using cgroup-v1, beyond that, I have no 
idea).

Also, if you can get ffmpeg to spit out the stream on stdout, you could 
pipe to dd and have that use Direct-IO.  The dd command should be 
something along the lines of:
dd of=<filename> oflag=direct iflag=fullblock bs=<arbitrary large 
multiple of node-size>
The oflag will force dd to open the output file with O_DIRECT, the iflag 
will force it to collect full blocks of data before writing them (the 
size is set by bs=, I'd recommend using a power of 2 that's a multiple 
of your node-size, larger numbers will increase latency but reduce 
fragmentation and improve throughput).  This may still use a significant 
amount of RAM (the pipe is essentially an in-memory buffer), and may 
crowd out other applications, but I have no idea how much it may or may 
not help.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] recursive defrag cleanup
  2017-01-03 20:08               ` Austin S. Hemmelgarn
@ 2017-01-04 22:12                 ` Janos Toth F.
  2017-01-05 18:13                   ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 13+ messages in thread
From: Janos Toth F. @ 2017-01-04 22:12 UTC (permalink / raw)
  To: Btrfs BTRFS

I separated these 9 camera storages into 9 subvolumes (so now I have
10 subvols in total in this filesystem with the "root" subvol). It's
obviously way too early to talk about long term performance but now I
can tell that recursive defrag does NOT descend into "child"
subvolumes (it does not pick up the files located in these "child"
subvolumes when I point it to the "root" subvolume with the -r
option). That's very inconvenient (one might need to write a scrip
with a long static list of subvolumes and maintain it over time or
write a scrip which acquires the list from the subvolume list command
and feeds it to the defrag command one-by-one).

> Because each subvolume is functionally it's own tree, it has it's own
> locking for changes and other stuff, which means that splitting into
> subvolumes will usually help with concurrency.  A lot of high concurrency
> performance benchmarks do significantly better if you split things into
> individual subvolumes (and this drives a couple of the other kernel
> developers crazy to no end).  It's not well published, but this is actually
> the recommended usage if you can afford the complexity and don't need
> snapshots.

I am not a developer but this idea drives me crazy as well. I know
it's a silly reasoning but if you blindly extrapolate this idea you
come to the conclusion that every single file should be transparently
placed in it's own unique subvolume (by some automatic background
task) and every directory should automatically be a subvolume. I guess
there must be some inconveniently sub-optimal behavior in the tree
handling which could theoretically be optimized (or the observed
performance improvement of the subvolume segregation is some kind of
measurement error which does not really translate into actual real
life overall performance befit but only looks like that from some
specific perspective of the tests).

> As far as how much your buffering for write-back, that should depend
> entirely on how fast your RAM is relative to your storage device.  The
> smaller the gap between your storage and your RAM in terms of speed, the
> more you should be buffering (up to a point).  FWIW, I find that with
> DDR3-1600 RAM and a good (~540MB/s sequential write) SATA3 SSD, about
> 160-320MB gets a near ideal balance of performance, throughput, and
> fragmentation, but of course YMMV.

I don't think I share your logic on this. I usually consider the write
load random and I don't like my softwares possibly stalling while
there is plenty of RAM laying around to be used as a buffer until some
other tasks might stop trashing the disks (e.g. "bigger is always
better").

> Out of curiosity, just on this part, have you tried using cgroups to keep
> the memory usage isolated better?

No, I didn't even know cgroups can control the pagecache based on the
process which generates the cache-able IO.
To be honest, I don't think it's worth the effort for me (I would need
to learn how to use cgroups, I have zero experience with that).

> Also, if you can get ffmpeg to spit out the stream on stdout, you could pipe
> to dd and have that use Direct-IO.  The dd command should be something along
> the lines of:
> dd of=<filename> oflag=direct iflag=fullblock bs=<arbitrary large multiple
> of node-size>
> The oflag will force dd to open the output file with O_DIRECT, the iflag
> will force it to collect full blocks of data before writing them (the size
> is set by bs=, I'd recommend using a power of 2 that's a multiple of your
> node-size, larger numbers will increase latency but reduce fragmentation and
> improve throughput).  This may still use a significant amount of RAM (the
> pipe is essentially an in-memory buffer), and may crowd out other
> applications, but I have no idea how much it may or may not help.

This I can try (when I have no better things to play with). Thank you.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] recursive defrag cleanup
  2017-01-04 22:12                 ` Janos Toth F.
@ 2017-01-05 18:13                   ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 13+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-05 18:13 UTC (permalink / raw)
  To: Janos Toth F., Btrfs BTRFS

On 2017-01-04 17:12, Janos Toth F. wrote:
> I separated these 9 camera storages into 9 subvolumes (so now I have
> 10 subvols in total in this filesystem with the "root" subvol). It's
> obviously way too early to talk about long term performance but now I
> can tell that recursive defrag does NOT descend into "child"
> subvolumes (it does not pick up the files located in these "child"
> subvolumes when I point it to the "root" subvolume with the -r
> option). That's very inconvenient (one might need to write a scrip
> with a long static list of subvolumes and maintain it over time or
> write a scrip which acquires the list from the subvolume list command
> and feeds it to the defrag command one-by-one).
OK, that's good to know.  You might look at some way to parse the output 
of `btrfs subvol show` to simplify writing such a script.  Also, it's 
worth pointing out that there are other circumstances that will prevent 
defrag from operating on a file (I know it refuses to touch running 
executables, and I think that it may also avoid files opened with O_DIRECT).
>
>> Because each subvolume is functionally it's own tree, it has it's own
>> locking for changes and other stuff, which means that splitting into
>> subvolumes will usually help with concurrency.  A lot of high concurrency
>> performance benchmarks do significantly better if you split things into
>> individual subvolumes (and this drives a couple of the other kernel
>> developers crazy to no end).  It's not well published, but this is actually
>> the recommended usage if you can afford the complexity and don't need
>> snapshots.
>
> I am not a developer but this idea drives me crazy as well. I know
> it's a silly reasoning but if you blindly extrapolate this idea you
> come to the conclusion that every single file should be transparently
> placed in it's own unique subvolume (by some automatic background
> task) and every directory should automatically be a subvolume. I guess
> there must be some inconveniently sub-optimal behavior in the tree
> handling which could theoretically be optimized (or the observed
> performance improvement of the subvolume segregation is some kind of
> measurement error which does not really translate into actual real
> life overall performance befit but only looks like that from some
> specific perspective of the tests).
While it's annoying, it's also rather predictable from simple analysis 
of the code.  Many metadata operations (and any append to a file 
requires a metadata operation) require eventually locking part of the 
tree, and that ends up being a point of contention.  In general, I 
wouldn't say that _every_ file and _every_ directory would need this, as 
it's not often an issue on a lot of workloads either because the 
contention doesn't happen (serialized data transfer, WORM access 
patterns, etc), or because it's not happening frequently enough that it 
has a significant impact (most general desktop usage).  That said, there 
are other benefits to using subvolumes that make them attractive for 
many of the cases where this type of thing helps (for example, I use 
dedicated subvolumes for any local VCS repositories I have, both because 
it isolates them from global contention on locks, and it lets me nuke 
them much quicker than rm -rf would).
>
>> As far as how much your buffering for write-back, that should depend
>> entirely on how fast your RAM is relative to your storage device.  The
>> smaller the gap between your storage and your RAM in terms of speed, the
>> more you should be buffering (up to a point).  FWIW, I find that with
>> DDR3-1600 RAM and a good (~540MB/s sequential write) SATA3 SSD, about
>> 160-320MB gets a near ideal balance of performance, throughput, and
>> fragmentation, but of course YMMV.
>
> I don't think I share your logic on this. I usually consider the write
> load random and I don't like my softwares possibly stalling while
> there is plenty of RAM laying around to be used as a buffer until some
> other tasks might stop trashing the disks (e.g. "bigger is always
> better").
Like I said, things may be different for you, but I find in general that 
unless I'm 100% disk-bound, I actually have fewer latency issues when I 
buffer less (up to a point, anything less than about 64MB on my hardware 
makes latency worse).  Stalls happen more frequently, but each 
individual stall has much less impact on overall performance because the 
time is amortized across the whole operation.  Throughput suffers a bit, 
but once you get past a certain point, increasing the buffering will 
actually hurt throughput because of how long things stall for.  Less 
buffering also means you're less likely to trash read side of the 
page-cache because you're write cache will fluctuate in size less.
>
>> Out of curiosity, just on this part, have you tried using cgroups to keep
>> the memory usage isolated better?
>
> No, I didn't even know cgroups can control the pagecache based on the
> process which generates the cache-able IO.
I'm pretty sure they can cap the write-back buffering usage, but the 
tunable is kernel memory usage, and some old kernels didn't work with it 
(I forget when it actually started working correctly).
> To be honest, I don't think it's worth the effort for me (I would need
> to learn how to use cgroups, I have zero experience with that).
FWIW, it's probably worth learning to use cgroups, they're a great tool 
for isolating tasks from each other, and the memory controller is really 
the only one that's not all that intuitive.
>
>> Also, if you can get ffmpeg to spit out the stream on stdout, you could pipe
>> to dd and have that use Direct-IO.  The dd command should be something along
>> the lines of:
>> dd of=<filename> oflag=direct iflag=fullblock bs=<arbitrary large multiple
>> of node-size>
>> The oflag will force dd to open the output file with O_DIRECT, the iflag
>> will force it to collect full blocks of data before writing them (the size
>> is set by bs=, I'd recommend using a power of 2 that's a multiple of your
>> node-size, larger numbers will increase latency but reduce fragmentation and
>> improve throughput).  This may still use a significant amount of RAM (the
>> pipe is essentially an in-memory buffer), and may crowd out other
>> applications, but I have no idea how much it may or may not help.
>
> This I can try (when I have no better things to play with). Thank you.
Glad I could help.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-01-05 18:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-06  4:39 [PATCH] recursive defrag cleanup Anand Jain
2016-12-06  4:39 ` [PATCH] btrfs-progs: recursive defrag cleanup duplicate code Anand Jain
2016-12-12 17:11   ` David Sterba
2016-12-12 17:19 ` [PATCH] recursive defrag cleanup David Sterba
2016-12-19 11:12   ` Anand Jain
2016-12-28 10:40     ` Janos Toth F.
2017-01-03 10:16       ` Anand Jain
2017-01-03 14:21         ` Janos Toth F.
2017-01-03 16:01           ` Austin S. Hemmelgarn
2017-01-03 18:16             ` Janos Toth F.
2017-01-03 20:08               ` Austin S. Hemmelgarn
2017-01-04 22:12                 ` Janos Toth F.
2017-01-05 18:13                   ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.