Linux-BTRFS Archive on lore.kernel.org
 help / Atom feed
* BTRFS Mount Delay Time Graph
@ 2018-12-03 18:20 Wilson, Ellis
  2018-12-03 19:56 ` Lionel Bouton
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Wilson, Ellis @ 2018-12-03 18:20 UTC (permalink / raw)
  To: BTRFS

[-- Attachment #1: Type: text/plain, Size: 2176 bytes --]

Hi all,

Many months ago I promised to graph how long it took to mount a BTRFS 
filesystem as it grows.  I finally had (made) time for this, and the 
attached is the result of my testing.  The image is a fairly 
self-explanatory graph, and the raw data is also attached in 
comma-delimited format for the more curious.  The columns are: 
Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).

Experimental setup:
- System:
Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
- 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
- 3 unmount/mount cycles performed in between adding another 250GB of data
- 250GB of data added each time in the form of 25x10GB files in their 
own directory.  Files generated in parallel each epoch (25 at the same 
time, with a 1MB record size).
- 240 repetitions of this performed (to collect timings in increments of 
250GB between a 0GB and 60TB filesystem)
- Normal "time" command used to measure time to mount.  "Real" time used 
of the timings reported from time.
- Mount:
/dev/md0 on /btrfs type btrfs 
(rw,relatime,space_cache=v2,subvolid=5,subvol=/)

At 60TB, we take 30s to mount the filesystem, which is actually not as 
bad as I originally thought it would be (perhaps as a result of using 
RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
to comment if folks more intimately familiar with BTRFS think this is 
due to the very large files I've used.  I can redo the test with much 
more realistic data if people have legitimate reason to think it will 
drastically change the result.

With 14TB drives available today, it doesn't take more than a handful of 
drives to result in a filesystem that takes around a minute to mount. 
As a result of this, I suspect this will become an increasingly problem 
for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
not a contributor so I have no room to do so -- just shedding some light 
on a problem that may deserve attention as filesystem sizes continue to 
grow.

Best,

ellis

[-- Attachment #2: btrfs_mount_time_delay.jpg --]
[-- Type: image/jpeg, Size: 42838 bytes --]

[-- Attachment #3: mount_times.csv --]
[-- Type: text/csv, Size: 6288 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-03 18:20 BTRFS Mount Delay Time Graph Wilson, Ellis
@ 2018-12-03 19:56 ` Lionel Bouton
  2018-12-03 20:04   ` Lionel Bouton
  2018-12-03 22:22   ` Hans van Kranenburg
  2018-12-04  0:16 ` Qu Wenruo
  2018-12-04 13:07 ` Nikolay Borisov
  2 siblings, 2 replies; 12+ messages in thread
From: Lionel Bouton @ 2018-12-03 19:56 UTC (permalink / raw)
  To: Wilson, Ellis, BTRFS

Hi,

Le 03/12/2018 à 19:20, Wilson, Ellis a écrit :
> Hi all,
>
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.

We are hosting some large BTRFS filesystems on Ceph (RBD used by
QEMU/KVM). I believe the delay is heavily linked to the number of files
(I didn't check if snapshots matter and I suspect it does but not as
much as the number of "original" files at least if you don't heavily
modify existing files but mostly create new ones as we do).
As an example, we have a filesystem with 20TB used space with 4
subvolumes hosting multi millions files/directories (probably 10-20
millions total I didn't check the exact number recently as simply
counting files is a very long process) and 40 snapshots for each volume.
Mount takes about 15 minutes.
We have virtual machines that we don't reboot as often as we would like
because of these slow mount times.

If you want to study this, you could :
- graph the delay for various individual file sizes (instead of 25x10GB,
create 2 500 x 100MB and 250 000 x 1MB files between each run and
compare to the original result)
- graph the delay vs the number of snapshots (probably starting with a
large number of files in the initial subvolume to start with a non
trivial mount delay)
You may want to study the impact of the differences between snapshots by
comparing snapshoting without modifications and snapshots made at
various stages of your suvolume growth.

Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
tuning of the io queue (switching between classic io-schedulers and
blk-mq ones in the virtual machines) and BTRFS mount options
(space_cache=v2,ssd_spread) but there wasn't any measurable improvement
in mount time (I managed to reduce the mount of IO requests by half on
one server in production though although more tests are needed to
isolate the cause).
I didn't expect much for the mount times, it seems to me that mount is
mostly constrained by the BTRFS on disk structures needed at mount time
and how the filesystem reads them (for example it doesn't benefit at all
from large IO queue depths which probably means that each read depends
on previous ones which prevents io-schedulers from optimizing anything).

Best regards,

Lionel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-03 19:56 ` Lionel Bouton
@ 2018-12-03 20:04   ` Lionel Bouton
  2018-12-04  2:52     ` Chris Murphy
  2018-12-03 22:22   ` Hans van Kranenburg
  1 sibling, 1 reply; 12+ messages in thread
From: Lionel Bouton @ 2018-12-03 20:04 UTC (permalink / raw)
  To: Wilson, Ellis, BTRFS

Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
> [...]
> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> tuning of the io queue (switching between classic io-schedulers and
> blk-mq ones in the virtual machines) and BTRFS mount options
> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> in mount time (I managed to reduce the mount of IO requests

Sent to quickly : I meant to write "managed to reduce by half the number
of IO write requests for the same amount of data writen"

>  by half on
> one server in production though although more tests are needed to
> isolate the cause).



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-03 19:56 ` Lionel Bouton
  2018-12-03 20:04   ` Lionel Bouton
@ 2018-12-03 22:22   ` Hans van Kranenburg
  2018-12-04 16:45     ` [Mount time bug bounty?] was: " Lionel Bouton
  1 sibling, 1 reply; 12+ messages in thread
From: Hans van Kranenburg @ 2018-12-03 22:22 UTC (permalink / raw)
  To: Lionel Bouton, Wilson, Ellis, BTRFS

[-- Attachment #1: Type: text/plain, Size: 7058 bytes --]

Hi,

On 12/3/18 8:56 PM, Lionel Bouton wrote:
> 
> Le 03/12/2018 à 19:20, Wilson, Ellis a écrit :
>>
>> Many months ago I promised to graph how long it took to mount a BTRFS 
>> filesystem as it grows.  I finally had (made) time for this, and the 
>> attached is the result of my testing.  The image is a fairly 
>> self-explanatory graph, and the raw data is also attached in 
>> comma-delimited format for the more curious.  The columns are: 
>> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>>
>> Experimental setup:
>> - System:
>> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
>> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
>> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
>> - 3 unmount/mount cycles performed in between adding another 250GB of data
>> - 250GB of data added each time in the form of 25x10GB files in their 
>> own directory.  Files generated in parallel each epoch (25 at the same 
>> time, with a 1MB record size).
>> - 240 repetitions of this performed (to collect timings in increments of 
>> 250GB between a 0GB and 60TB filesystem)
>> - Normal "time" command used to measure time to mount.  "Real" time used 
>> of the timings reported from time.
>> - Mount:
>> /dev/md0 on /btrfs type btrfs 
>> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>>
>> At 60TB, we take 30s to mount the filesystem, which is actually not as 
>> bad as I originally thought it would be (perhaps as a result of using 
>> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
>> to comment if folks more intimately familiar with BTRFS think this is 
>> due to the very large files I've used.

Probably yes. The thing that is happening is that all block group items
are read from the extent tree. And, instead of being nicely grouped
together, they are scattered all over the place, at their virtual
address, in between all normal extent items.

So, mount time depends on cold random read iops your storage can do, and
the size of the extent tree and amount of block groups. And, your extent
tree has more items in it if you have more extents. So, yes, writing a
lot of 4kiB files should have a similar effect I think as a lot of
128MiB files that are still stored in 1 extent per file.

>  I can redo the test with much 
>> more realistic data if people have legitimate reason to think it will 
>> drastically change the result.
> 
> We are hosting some large BTRFS filesystems on Ceph (RBD used by
> QEMU/KVM). I believe the delay is heavily linked to the number of files
> (I didn't check if snapshots matter and I suspect it does but not as
> much as the number of "original" files at least if you don't heavily
> modify existing files but mostly create new ones as we do).
> As an example, we have a filesystem with 20TB used space with 4
> subvolumes hosting multi millions files/directories (probably 10-20
> millions total I didn't check the exact number recently as simply
> counting files is a very long process) and 40 snapshots for each volume.
> Mount takes about 15 minutes.
> We have virtual machines that we don't reboot as often as we would like
> because of these slow mount times.
> 
> If you want to study this, you could :
> - graph the delay for various individual file sizes (instead of 25x10GB,
> create 2 500 x 100MB and 250 000 x 1MB files between each run and
> compare to the original result)
> - graph the delay vs the number of snapshots (probably starting with a
> large number of files in the initial subvolume to start with a non
> trivial mount delay)
> You may want to study the impact of the differences between snapshots by
> comparing snapshoting without modifications and snapshots made at
> various stages of your suvolume growth.
> 
> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> tuning of the io queue (switching between classic io-schedulers and
> blk-mq ones in the virtual machines) and BTRFS mount options
> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> in mount time (I managed to reduce the mount of IO requests by half on
> one server in production though although more tests are needed to
> isolate the cause).
> I didn't expect much for the mount times, it seems to me that mount is
> mostly constrained by the BTRFS on disk structures needed at mount time
> and how the filesystem reads them (for example it doesn't benefit at all
> from large IO queue depths which probably means that each read depends
> on previous ones which prevents io-schedulers from optimizing anything).

Yes, I think that's true. See btrfs_read_block_groups in extent-tree.c:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982

What the code is doing here is starting at the beginning of the extent
tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which
is not that far away), and then based on the information in it, computes
where the next one will be (just after the end of the vaddr+length of
it), and then jumps over all normal extent items and searches again near
where the next block group item has to be. So, yes, that means that they
depend on each other.

Two possible ways to improve this:

1. Instead, walk the chunk tree (which has all related items packed
together) instead to find out at which locations in the extent tree the
block group items are located and then start getting items in parallel.
If you have storage with a lot of rotating rust that can deliver much
more random reads if you ask for more of them at the same time, then
this can already cause a massive speedup.

2. Move the block group items somewhere else, where they can nicely be
grouped together, so that the amount of metadata pages that has to be
looked up is minimal. Quoting from the link below, "slightly tricky
[...] but there are no fundamental obstacles".

https://www.spinics.net/lists/linux-btrfs/msg71766.html

I think the main obstacle here is finding a developer with enough
experience and time to do it. :)

For fun, you can also just read the block group metadata after dropping
caches each time, which should give similar relative timing results as
mounting the filesystem again. (Well, if disk IO wait is the main
slowdown of course.)

Attached are two example programs, using python-btrfs.

* bg_after_another.py does the same thing as the kernel code I just linked.
* bg_via_chunks.py looks them up based on chunk tree info.

The time that it takes after option 2 above would be implemented should
be very similar to just reading the chunk tree. (remove the block group
lookup from bg_via_chunks and run that).

Now what's still missing is changing the bg_via_chunks one to start
kicking off the block group searches in parallel, and then you can
predict how long it would take if 1 would be implemented.

\:D/

-- 
Hans van Kranenburg

[-- Attachment #2: bg_after_another.py --]
[-- Type: text/x-python, Size: 798 bytes --]

[-- Attachment #3: bg_via_chunks.py --]
[-- Type: text/x-python, Size: 306 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-03 18:20 BTRFS Mount Delay Time Graph Wilson, Ellis
  2018-12-03 19:56 ` Lionel Bouton
@ 2018-12-04  0:16 ` Qu Wenruo
  2018-12-04 13:07 ` Nikolay Borisov
  2 siblings, 0 replies; 12+ messages in thread
From: Qu Wenruo @ 2018-12-04  0:16 UTC (permalink / raw)
  To: Wilson, Ellis, BTRFS

[-- Attachment #1.1: Type: text/plain, Size: 2649 bytes --]



On 2018/12/4 上午2:20, Wilson, Ellis wrote:
> Hi all,
> 
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
> 
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> 
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.
> 
> With 14TB drives available today, it doesn't take more than a handful of 
> drives to result in a filesystem that takes around a minute to mount. 
> As a result of this, I suspect this will become an increasingly problem 
> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
> not a contributor so I have no room to do so -- just shedding some light 
> on a problem that may deserve attention as filesystem sizes continue to 
> grow.

This problem is somewhat known.

If you dig further, it's btrfs_read_block_groups() which will try to
read *ALL* block group items.
And to no one's surprise, when the fs goes larger, the more block group
items need to be read from disk.

We need to do some delay for such read to improve such case.

Thanks,
Qu

> 
> Best,
> 
> ellis
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-03 20:04   ` Lionel Bouton
@ 2018-12-04  2:52     ` Chris Murphy
  2018-12-04 15:08       ` Lionel Bouton
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2018-12-04  2:52 UTC (permalink / raw)
  To: Lionel Bouton; +Cc: Ellis H. Wilson III, Btrfs BTRFS

On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton
<lionel-subscription@bouton.name> wrote:
>
> Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
> > [...]
> > Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> > tuning of the io queue (switching between classic io-schedulers and
> > blk-mq ones in the virtual machines) and BTRFS mount options
> > (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> > in mount time (I managed to reduce the mount of IO requests
>
> Sent to quickly : I meant to write "managed to reduce by half the number
> of IO write requests for the same amount of data writen"
>
> >  by half on
> > one server in production though although more tests are needed to
> > isolate the cause).

Interesting. I wonder if it's ssd_spread or space_cache=v2 that
reduces the writes by half, or by how much for each? That's a major
reduction in writes, and suggests it might be possible for further
optimization, to help mitigate the wandering trees impact.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-03 18:20 BTRFS Mount Delay Time Graph Wilson, Ellis
  2018-12-03 19:56 ` Lionel Bouton
  2018-12-04  0:16 ` Qu Wenruo
@ 2018-12-04 13:07 ` Nikolay Borisov
  2018-12-04 13:31   ` Qu Wenruo
  2018-12-04 20:14   ` Wilson, Ellis
  2 siblings, 2 replies; 12+ messages in thread
From: Nikolay Borisov @ 2018-12-04 13:07 UTC (permalink / raw)
  To: Wilson, Ellis, BTRFS



On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
> Hi all,
> 
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
> 
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> 
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.
> 
> With 14TB drives available today, it doesn't take more than a handful of 
> drives to result in a filesystem that takes around a minute to mount. 
> As a result of this, I suspect this will become an increasingly problem 
> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
> not a contributor so I have no room to do so -- just shedding some light 
> on a problem that may deserve attention as filesystem sizes continue to 
> grow.

Would it be possible to provide perf traces of the longer-running mount
time? Everyone seems to be fixated on reading block groups (which is
likely to be the culprit) but before pointing finger I'd like concrete
evidence pointed at the offender.

> 
> Best,
> 
> ellis
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-04 13:07 ` Nikolay Borisov
@ 2018-12-04 13:31   ` Qu Wenruo
  2018-12-04 20:14   ` Wilson, Ellis
  1 sibling, 0 replies; 12+ messages in thread
From: Qu Wenruo @ 2018-12-04 13:31 UTC (permalink / raw)
  To: Nikolay Borisov, Wilson, Ellis, BTRFS



On 2018/12/4 下午9:07, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>> Hi all,
>>
>> Many months ago I promised to graph how long it took to mount a BTRFS 
>> filesystem as it grows.  I finally had (made) time for this, and the 
>> attached is the result of my testing.  The image is a fairly 
>> self-explanatory graph, and the raw data is also attached in 
>> comma-delimited format for the more curious.  The columns are: 
>> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>>
>> Experimental setup:
>> - System:
>> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
>> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
>> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
>> - 3 unmount/mount cycles performed in between adding another 250GB of data
>> - 250GB of data added each time in the form of 25x10GB files in their 
>> own directory.  Files generated in parallel each epoch (25 at the same 
>> time, with a 1MB record size).
>> - 240 repetitions of this performed (to collect timings in increments of 
>> 250GB between a 0GB and 60TB filesystem)
>> - Normal "time" command used to measure time to mount.  "Real" time used 
>> of the timings reported from time.
>> - Mount:
>> /dev/md0 on /btrfs type btrfs 
>> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>>
>> At 60TB, we take 30s to mount the filesystem, which is actually not as 
>> bad as I originally thought it would be (perhaps as a result of using 
>> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
>> to comment if folks more intimately familiar with BTRFS think this is 
>> due to the very large files I've used.  I can redo the test with much 
>> more realistic data if people have legitimate reason to think it will 
>> drastically change the result.
>>
>> With 14TB drives available today, it doesn't take more than a handful of 
>> drives to result in a filesystem that takes around a minute to mount. 
>> As a result of this, I suspect this will become an increasingly problem 
>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
>> not a contributor so I have no room to do so -- just shedding some light 
>> on a problem that may deserve attention as filesystem sizes continue to 
>> grow.
> 
> Would it be possible to provide perf traces of the longer-running mount
> time? Everyone seems to be fixated on reading block groups (which is
> likely to be the culprit) but before pointing finger I'd like concrete
> evidence pointed at the offender.

IIRC I submitted such analyse years ago.

Nowadays it may change due to chunk <-> bg <-> dev_extents cross checking.
So yes, it would be a good idea to show such percentage.

Thanks,
Qu

> 
>>
>> Best,
>>
>> ellis
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-04  2:52     ` Chris Murphy
@ 2018-12-04 15:08       ` Lionel Bouton
  0 siblings, 0 replies; 12+ messages in thread
From: Lionel Bouton @ 2018-12-04 15:08 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Ellis H. Wilson III, Btrfs BTRFS

Le 04/12/2018 à 03:52, Chris Murphy a écrit :
> On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton
> <lionel-subscription@bouton.name> wrote:
>> Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
>>> [...]
>>> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
>>> tuning of the io queue (switching between classic io-schedulers and
>>> blk-mq ones in the virtual machines) and BTRFS mount options
>>> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
>>> in mount time (I managed to reduce the mount of IO requests
>> Sent to quickly : I meant to write "managed to reduce by half the number
>> of IO write requests for the same amount of data writen"
>>
>>>  by half on
>>> one server in production though although more tests are needed to
>>> isolate the cause).
> Interesting. I wonder if it's ssd_spread or space_cache=v2 that
> reduces the writes by half, or by how much for each? That's a major
> reduction in writes, and suggests it might be possible for further
> optimization, to help mitigate the wandering trees impact.

Note, the other major changes were :
- 4.9 upgrade to 1.14,
- using multi-queue aware bfq instead of noop.

If BTRFS IO patterns in our case allow bfq to merge io-requests, this
could be another explanation.

Lionel


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Mount time bug bounty?] was: BTRFS Mount Delay Time Graph
  2018-12-03 22:22   ` Hans van Kranenburg
@ 2018-12-04 16:45     ` " Lionel Bouton
  0 siblings, 0 replies; 12+ messages in thread
From: Lionel Bouton @ 2018-12-04 16:45 UTC (permalink / raw)
  To: Hans van Kranenburg, Wilson, Ellis, BTRFS

Le 03/12/2018 à 23:22, Hans van Kranenburg a écrit :
> [...]
> Yes, I think that's true. See btrfs_read_block_groups in extent-tree.c:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982
>
> What the code is doing here is starting at the beginning of the extent
> tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which
> is not that far away), and then based on the information in it, computes
> where the next one will be (just after the end of the vaddr+length of
> it), and then jumps over all normal extent items and searches again near
> where the next block group item has to be. So, yes, that means that they
> depend on each other.
>
> Two possible ways to improve this:
>
> 1. Instead, walk the chunk tree (which has all related items packed
> together) instead to find out at which locations in the extent tree the
> block group items are located and then start getting items in parallel.
> If you have storage with a lot of rotating rust that can deliver much
> more random reads if you ask for more of them at the same time, then
> this can already cause a massive speedup.
>
> 2. Move the block group items somewhere else, where they can nicely be
> grouped together, so that the amount of metadata pages that has to be
> looked up is minimal. Quoting from the link below, "slightly tricky
> [...] but there are no fundamental obstacles".
>
> https://www.spinics.net/lists/linux-btrfs/msg71766.html
>
> I think the main obstacle here is finding a developer with enough
> experience and time to do it. :)

I would definitely be interested in sponsoring at least a part of the
needed time through my company (we are too small to hire kernel
developers full-time but we can make a one-time contribution for
something as valuable to us as faster mount delays).

If needed it could be split in two steps with separate bounties :
- providing a patch for the latest LTS kernel with a substantial
decrease in mount time in our case (ideally less than a minute instead
of 15 minutes but <5 minutes is already worth it).
- having it integrated in mainline.

I don't have any experience with company sponsorship/bounties but I'm
willing to learn (don't hesitate to make suggestions). I'll have to
discuss it with our accountant to make sure we do it correctly.

Is it the right place to discuss this kind of subject or should I take
the discussion elsewhere ?

Best regards,

Lionel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-04 13:07 ` Nikolay Borisov
  2018-12-04 13:31   ` Qu Wenruo
@ 2018-12-04 20:14   ` Wilson, Ellis
  2018-12-05  6:55     ` Nikolay Borisov
  1 sibling, 1 reply; 12+ messages in thread
From: Wilson, Ellis @ 2018-12-04 20:14 UTC (permalink / raw)
  To: Nikolay Borisov, BTRFS

On 12/4/18 8:07 AM, Nikolay Borisov wrote:
> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>> With 14TB drives available today, it doesn't take more than a handful of
>> drives to result in a filesystem that takes around a minute to mount.
>> As a result of this, I suspect this will become an increasingly problem
>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm
>> not a contributor so I have no room to do so -- just shedding some light
>> on a problem that may deserve attention as filesystem sizes continue to
>> grow.
> Would it be possible to provide perf traces of the longer-running mount
> time? Everyone seems to be fixated on reading block groups (which is
> likely to be the culprit) but before pointing finger I'd like concrete
> evidence pointed at the offender.

I am glad to collect such traces -- please advise with commands that 
would achieve that.  If you just mean block traces, I can do that, but I 
suspect you mean something more BTRFS-specific.

Best,

ellis


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-04 20:14   ` Wilson, Ellis
@ 2018-12-05  6:55     ` Nikolay Borisov
  0 siblings, 0 replies; 12+ messages in thread
From: Nikolay Borisov @ 2018-12-05  6:55 UTC (permalink / raw)
  To: Wilson, Ellis, BTRFS



On 4.12.18 г. 22:14 ч., Wilson, Ellis wrote:
> On 12/4/18 8:07 AM, Nikolay Borisov wrote:
>> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>>> With 14TB drives available today, it doesn't take more than a handful of
>>> drives to result in a filesystem that takes around a minute to mount.
>>> As a result of this, I suspect this will become an increasingly problem
>>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm
>>> not a contributor so I have no room to do so -- just shedding some light
>>> on a problem that may deserve attention as filesystem sizes continue to
>>> grow.
>> Would it be possible to provide perf traces of the longer-running mount
>> time? Everyone seems to be fixated on reading block groups (which is
>> likely to be the culprit) but before pointing finger I'd like concrete
>> evidence pointed at the offender.
> 
> I am glad to collect such traces -- please advise with commands that 
> would achieve that.  If you just mean block traces, I can do that, but I 
> suspect you mean something more BTRFS-specific.

A command that would be good is :

perf record --all-kernel -g mount /dev/vdc /media/scratch/

of course replace device/mount path appropriately. This will result in a
perf.data file which contains stacktraces of the hottest paths executed
during invocation of mount. If you could send this file to the mailing
list or upload it somwhere for interested people (me and perhaps) Qu to
inspect would be appreciated.

If the file turned out way too big you can use

perf report --stdio  to create a text output and you could send that as
well.

> 
> Best,
> 
> ellis
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, back to index

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-03 18:20 BTRFS Mount Delay Time Graph Wilson, Ellis
2018-12-03 19:56 ` Lionel Bouton
2018-12-03 20:04   ` Lionel Bouton
2018-12-04  2:52     ` Chris Murphy
2018-12-04 15:08       ` Lionel Bouton
2018-12-03 22:22   ` Hans van Kranenburg
2018-12-04 16:45     ` [Mount time bug bounty?] was: " Lionel Bouton
2018-12-04  0:16 ` Qu Wenruo
2018-12-04 13:07 ` Nikolay Borisov
2018-12-04 13:31   ` Qu Wenruo
2018-12-04 20:14   ` Wilson, Ellis
2018-12-05  6:55     ` Nikolay Borisov

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable: git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox