linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* BTRFS Mount Delay Time Graph
@ 2018-12-03 18:20 Wilson, Ellis
  2018-12-03 19:56 ` Lionel Bouton
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Wilson, Ellis @ 2018-12-03 18:20 UTC (permalink / raw)
  To: BTRFS

[-- Attachment #1: Type: text/plain, Size: 2176 bytes --]

Hi all,

Many months ago I promised to graph how long it took to mount a BTRFS 
filesystem as it grows.  I finally had (made) time for this, and the 
attached is the result of my testing.  The image is a fairly 
self-explanatory graph, and the raw data is also attached in 
comma-delimited format for the more curious.  The columns are: 
Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).

Experimental setup:
- System:
Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
- 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
- 3 unmount/mount cycles performed in between adding another 250GB of data
- 250GB of data added each time in the form of 25x10GB files in their 
own directory.  Files generated in parallel each epoch (25 at the same 
time, with a 1MB record size).
- 240 repetitions of this performed (to collect timings in increments of 
250GB between a 0GB and 60TB filesystem)
- Normal "time" command used to measure time to mount.  "Real" time used 
of the timings reported from time.
- Mount:
/dev/md0 on /btrfs type btrfs 
(rw,relatime,space_cache=v2,subvolid=5,subvol=/)

At 60TB, we take 30s to mount the filesystem, which is actually not as 
bad as I originally thought it would be (perhaps as a result of using 
RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
to comment if folks more intimately familiar with BTRFS think this is 
due to the very large files I've used.  I can redo the test with much 
more realistic data if people have legitimate reason to think it will 
drastically change the result.

With 14TB drives available today, it doesn't take more than a handful of 
drives to result in a filesystem that takes around a minute to mount. 
As a result of this, I suspect this will become an increasingly problem 
for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
not a contributor so I have no room to do so -- just shedding some light 
on a problem that may deserve attention as filesystem sizes continue to 
grow.

Best,

ellis

[-- Attachment #2: btrfs_mount_time_delay.jpg --]
[-- Type: image/jpeg, Size: 42838 bytes --]

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: mount_times.csv --]
[-- Type: text/csv; name="mount_times.csv", Size: 6288 bytes --]

0,0.018,0.037,0.016
250,0.245,0.098,0.066
500,0.417,0.119,0.138
750,0.284,0.073,0.066
1000,0.506,0.109,0.126
1250,0.824,0.134,0.204
1500,0.779,0.098,0.147
1750,0.805,0.107,0.215
2000,0.87,0.137,0.223
2250,1.009,0.168,0.226
2500,1.094,0.147,0.174
2750,0.908,0.137,0.246
3000,1.144,0.182,0.313
3250,1.232,0.209,0.312
3500,1.287,0.259,0.292
3750,1.29,0.166,0.298
4000,1.521,0.249,0.418
4250,1.448,0.341,0.395
4500,1.441,0.383,0.362
4750,1.555,0.35,0.371
5000,1.825,0.482,0.638
5250,1.731,0.69,0.928
5500,1.8,0.353,0.348
5750,1.979,0.295,1.194
6000,2.115,0.915,1.241
6250,2.238,0.614,1.735
6500,2.025,0.523,0.536
6750,2.15,0.458,0.727
7000,2.415,2.158,1.925
7250,2.589,1.059,2.24
7500,2.371,1.796,2.102
7750,2.737,1.579,1.659
8000,2.768,1.786,2.579
8250,2.979,2.544,2.654
8500,2.994,2.529,2.847
8750,3.042,2.283,2.947
9000,3.209,2.509,3.077
9250,3.124,2.7,3.096
9500,3.13,3.048,3.105
9750,3.444,2.702,3.33
10000,3.671,3.354,3.297
10250,3.639,3.468,3.681
10500,3.693,3.651,3.711
10750,3.729,3.135,3.303
11000,3.846,3.862,3.917
11250,4.006,3.668,3.861
11500,4.113,3.919,3.875
11750,3.968,3.774,3.985
12000,4.205,3.882,4.218
12250,4.454,4.354,4.444
12500,4.528,4.441,4.616
12750,4.688,4.206,4.252
13000,4.551,4.507,4.444
13250,4.806,5.059,4.81
13500,5.041,4.662,4.997
13750,5.057,4.394,4.713
14000,5.029,5.03,4.927
14250,5.173,5.259,5.101
14500,5.104,5.3,5.416
14750,4.809,4.62,4.698
15000,5.045,5.066,4.806
15250,5.101,5.159,5.174
15500,5.074,5.245,5.65
15750,5.123,5.031,5.056
16000,5.518,5.097,5.595
16250,5.318,5.463,5.353
16500,5.63,5.689,5.768
16750,5.375,5.24,5.165
17000,5.578,5.846,5.628
17250,5.73,5.774,5.726
17500,6.108,6.202,6.226
17750,5.645,5.668,5.936
18000,6.308,5.925,6.317
18250,6.19,6.171,6.169
18500,6.442,6.601,6.403
18750,6.558,6.44,6.803
19000,6.664,7.176,6.742
19250,7.37,7.414,6.807
19500,7.021,7.143,7.253
19750,7.051,6.691,7.063
20000,6.942,6.858,7.225
20250,7.617,7.39,7.202
20500,7.239,7.525,7.381
20750,7.638,7.332,7.549
21000,7.697,8.081,7.807
21250,7.867,7.929,7.826
21500,7.98,8.208,8.059
21750,7.79,7.614,7.726
22000,8.144,8.611,8.361
22250,8.19,8.558,8.459
22500,8.685,8.785,8.617
22750,8.702,8.454,8.727
23000,8.653,8.699,8.89
23250,8.897,9.328,9.101
23500,9.245,9.456,9.464
23750,9.242,9.072,9.363
24000,9.367,8.934,9.541
24250,9.2,9.754,9.708
24500,9.622,9.472,9.484
24750,9.756,9.672,10.091
25000,10.207,10.304,9.981
25250,10.135,10.166,9.991
25500,9.969,10.234,10.266
25750,10.098,10.515,10.98
26000,10.811,10.6,11.3
26250,11.211,10.761,10.825
26500,10.799,11.075,10.973
26750,10.72,11.12,11.39
27000,11.463,11.106,11.679
27250,11.644,11.363,11.316
27500,11.541,11.748,11.657
27750,11.292,11.794,11.616
28000,11.888,11.697,12.169
28250,12.298,12.183,12.002
28500,12.124,12.48,12.352
28750,11.347,11.815,12.201
29000,12.009,11.72,12.734
29250,11.918,12.02,12.583
29500,12.445,12.439,12.466
29750,12.071,11.863,12.078
30000,12.287,12.188,13.199
30250,12.63,12.429,13.088
30500,12.705,13.422,13.208
30750,12.713,13.168,13.089
31000,13.284,13.018,13.836
31250,13.086,12.977,13.741
31500,13.346,13.484,13.774
31750,13.069,13.436,13.48
32000,13.316,13.054,13.677
32250,13.555,13.813,13.918
32500,13.803,14.493,14.038
32750,13.853,13.861,14.46
33000,13.823,13.95,14.243
33250,14.702,15.369,14.527
33500,14.265,15.188,14.842
33750,14.527,14.138,14.502
34000,14.632,14.436,14.957
34250,14.595,14.354,15.724
34500,15.179,15.833,15.449
34750,15.119,15.564,15.589
35000,14.609,14.206,15.503
35250,14.829,14.811,16.051
35500,15.315,15.762,15.845
35750,15.482,16.136,15.77
36000,15.462,15.345,16.531
36250,15.766,16.858,16.009
36500,15.71,16.809,16.037
36750,15.976,16.57,16.203
37000,16.108,15.725,15.944
37250,16.405,16.228,17.203
37500,16.487,17.017,16.918
37750,16.571,17.48,20.071
38000,16.797,16.333,16.508
38250,16.795,17.292,17.217
38500,17.112,17.675,17.492
38750,17.218,17.56,17.346
39000,17.081,16.774,17.618
39250,17.783,17.931,20.41
39500,18.295,18.839,18.028
39750,17.986,18.649,18.257
40000,38.812,18.504,19.277
40250,18.577,19.63,18.959
40500,18.515,18.684,18.455
40750,18.98,19.499,18.979
41000,17.88,17.878,18.381
41250,18.603,19.768,19.112
41500,19.049,19.398,18.833
41750,18.955,19.42,19.663
42000,18.908,18.831,19.493
42250,19.533,22.931,20.046
42500,19.91,19.584,20.093
42750,19.729,20.241,20
43000,19.371,19.158,19.976
43250,19.784,20.238,20.352
43500,19.719,19.844,19.833
43750,19.93,20.861,20.229
44000,19.376,19.14,19.743
44250,20.335,21.226,20.736
44500,20.161,20.69,20.056
44750,20.551,21.042,20.682
45000,20.386,20.098,20.53
45250,20.871,21.7,21.466
45500,21.422,21.066,20.773
45750,21.177,21.18,21.177
46000,20.595,20.413,21.323
46250,21.696,22.117,21.523
46500,21.683,21.854,21.728
46750,21.924,22.163,21.99
47000,21.353,21.63,24.405
47250,22.15,21.849,22.274
47500,22.472,22.431,22.447
47750,22.118,23.374,22.495
48000,21.333,21.431,22.706
48250,22.415,21.989,24.122
48500,23.082,22.747,23.672
48750,23.226,24.14,23.605
49000,22.708,24.12,24.001
49250,22.946,23.023,23.522
49500,23.989,24.303,23.88
49750,23.499,24.185,23.821
50000,23.213,23.933,24.674
50250,23.922,23.836,24.489
50500,24.867,24.441,24.651
50750,24.654,24.781,24.614
51000,23.348,25.946,24.985
51250,25.184,24.438,26.443
51500,24.46,25.768,25.204
51750,25.314,26.491,25.526
52000,24.343,25.309,25.408
52250,25.285,24.875,26.173
52500,25.213,26.85,25.581
52750,25.337,26.823,26.116
53000,25.297,26.924,25.983
53250,25.998,25.646,27.488
53500,25.854,27.46,26.466
53750,26.499,27.225,27.087
54000,25.907,27.074,26.986
54250,26.586,26.204,27.17
54500,26.338,27.799,26.815
54750,27.095,27.326,26.908
55000,26.428,27.867,27.6
55250,26.634,26.403,27.963
55500,26.911,27.799,27.99
55750,27.324,28.459,27.825
56000,27.244,29.491,28.44
56250,27.248,27.275,28.161
56500,27.961,28.517,29.099
56750,28.139,28.268,29.295
57000,28.252,28.22,28.395
57250,28.361,28.141,30.187
57500,28.423,31.114,29.703
57750,29.229,29.945,29.792
58000,29.376,29.362,29.759
58250,28.764,28.672,29.961
58500,29.288,30.759,30.688
58750,28.908,29.38,30.402
59000,29.659,29.648,29.914
59250,29.908,28.835,30.583
59500,30.452,31.719,31.664
59750,30.197,30.971,31.771

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-03 18:20 BTRFS Mount Delay Time Graph Wilson, Ellis
@ 2018-12-03 19:56 ` Lionel Bouton
  2018-12-03 20:04   ` Lionel Bouton
  2018-12-03 22:22   ` Hans van Kranenburg
  2018-12-04  0:16 ` Qu Wenruo
  2018-12-04 13:07 ` Nikolay Borisov
  2 siblings, 2 replies; 14+ messages in thread
From: Lionel Bouton @ 2018-12-03 19:56 UTC (permalink / raw)
  To: Wilson, Ellis, BTRFS

Hi,

Le 03/12/2018 à 19:20, Wilson, Ellis a écrit :
> Hi all,
>
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.

We are hosting some large BTRFS filesystems on Ceph (RBD used by
QEMU/KVM). I believe the delay is heavily linked to the number of files
(I didn't check if snapshots matter and I suspect it does but not as
much as the number of "original" files at least if you don't heavily
modify existing files but mostly create new ones as we do).
As an example, we have a filesystem with 20TB used space with 4
subvolumes hosting multi millions files/directories (probably 10-20
millions total I didn't check the exact number recently as simply
counting files is a very long process) and 40 snapshots for each volume.
Mount takes about 15 minutes.
We have virtual machines that we don't reboot as often as we would like
because of these slow mount times.

If you want to study this, you could :
- graph the delay for various individual file sizes (instead of 25x10GB,
create 2 500 x 100MB and 250 000 x 1MB files between each run and
compare to the original result)
- graph the delay vs the number of snapshots (probably starting with a
large number of files in the initial subvolume to start with a non
trivial mount delay)
You may want to study the impact of the differences between snapshots by
comparing snapshoting without modifications and snapshots made at
various stages of your suvolume growth.

Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
tuning of the io queue (switching between classic io-schedulers and
blk-mq ones in the virtual machines) and BTRFS mount options
(space_cache=v2,ssd_spread) but there wasn't any measurable improvement
in mount time (I managed to reduce the mount of IO requests by half on
one server in production though although more tests are needed to
isolate the cause).
I didn't expect much for the mount times, it seems to me that mount is
mostly constrained by the BTRFS on disk structures needed at mount time
and how the filesystem reads them (for example it doesn't benefit at all
from large IO queue depths which probably means that each read depends
on previous ones which prevents io-schedulers from optimizing anything).

Best regards,

Lionel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-03 19:56 ` Lionel Bouton
@ 2018-12-03 20:04   ` Lionel Bouton
  2018-12-04  2:52     ` Chris Murphy
  2018-12-03 22:22   ` Hans van Kranenburg
  1 sibling, 1 reply; 14+ messages in thread
From: Lionel Bouton @ 2018-12-03 20:04 UTC (permalink / raw)
  To: Wilson, Ellis, BTRFS

Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
> [...]
> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> tuning of the io queue (switching between classic io-schedulers and
> blk-mq ones in the virtual machines) and BTRFS mount options
> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> in mount time (I managed to reduce the mount of IO requests

Sent to quickly : I meant to write "managed to reduce by half the number
of IO write requests for the same amount of data writen"

>  by half on
> one server in production though although more tests are needed to
> isolate the cause).



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-03 19:56 ` Lionel Bouton
  2018-12-03 20:04   ` Lionel Bouton
@ 2018-12-03 22:22   ` Hans van Kranenburg
  2018-12-04 16:45     ` [Mount time bug bounty?] was: " Lionel Bouton
  1 sibling, 1 reply; 14+ messages in thread
From: Hans van Kranenburg @ 2018-12-03 22:22 UTC (permalink / raw)
  To: Lionel Bouton, Wilson, Ellis, BTRFS

[-- Attachment #1: Type: text/plain, Size: 7058 bytes --]

Hi,

On 12/3/18 8:56 PM, Lionel Bouton wrote:
> 
> Le 03/12/2018 à 19:20, Wilson, Ellis a écrit :
>>
>> Many months ago I promised to graph how long it took to mount a BTRFS 
>> filesystem as it grows.  I finally had (made) time for this, and the 
>> attached is the result of my testing.  The image is a fairly 
>> self-explanatory graph, and the raw data is also attached in 
>> comma-delimited format for the more curious.  The columns are: 
>> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>>
>> Experimental setup:
>> - System:
>> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
>> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
>> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
>> - 3 unmount/mount cycles performed in between adding another 250GB of data
>> - 250GB of data added each time in the form of 25x10GB files in their 
>> own directory.  Files generated in parallel each epoch (25 at the same 
>> time, with a 1MB record size).
>> - 240 repetitions of this performed (to collect timings in increments of 
>> 250GB between a 0GB and 60TB filesystem)
>> - Normal "time" command used to measure time to mount.  "Real" time used 
>> of the timings reported from time.
>> - Mount:
>> /dev/md0 on /btrfs type btrfs 
>> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>>
>> At 60TB, we take 30s to mount the filesystem, which is actually not as 
>> bad as I originally thought it would be (perhaps as a result of using 
>> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
>> to comment if folks more intimately familiar with BTRFS think this is 
>> due to the very large files I've used.

Probably yes. The thing that is happening is that all block group items
are read from the extent tree. And, instead of being nicely grouped
together, they are scattered all over the place, at their virtual
address, in between all normal extent items.

So, mount time depends on cold random read iops your storage can do, and
the size of the extent tree and amount of block groups. And, your extent
tree has more items in it if you have more extents. So, yes, writing a
lot of 4kiB files should have a similar effect I think as a lot of
128MiB files that are still stored in 1 extent per file.

>  I can redo the test with much 
>> more realistic data if people have legitimate reason to think it will 
>> drastically change the result.
> 
> We are hosting some large BTRFS filesystems on Ceph (RBD used by
> QEMU/KVM). I believe the delay is heavily linked to the number of files
> (I didn't check if snapshots matter and I suspect it does but not as
> much as the number of "original" files at least if you don't heavily
> modify existing files but mostly create new ones as we do).
> As an example, we have a filesystem with 20TB used space with 4
> subvolumes hosting multi millions files/directories (probably 10-20
> millions total I didn't check the exact number recently as simply
> counting files is a very long process) and 40 snapshots for each volume.
> Mount takes about 15 minutes.
> We have virtual machines that we don't reboot as often as we would like
> because of these slow mount times.
> 
> If you want to study this, you could :
> - graph the delay for various individual file sizes (instead of 25x10GB,
> create 2 500 x 100MB and 250 000 x 1MB files between each run and
> compare to the original result)
> - graph the delay vs the number of snapshots (probably starting with a
> large number of files in the initial subvolume to start with a non
> trivial mount delay)
> You may want to study the impact of the differences between snapshots by
> comparing snapshoting without modifications and snapshots made at
> various stages of your suvolume growth.
> 
> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> tuning of the io queue (switching between classic io-schedulers and
> blk-mq ones in the virtual machines) and BTRFS mount options
> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> in mount time (I managed to reduce the mount of IO requests by half on
> one server in production though although more tests are needed to
> isolate the cause).
> I didn't expect much for the mount times, it seems to me that mount is
> mostly constrained by the BTRFS on disk structures needed at mount time
> and how the filesystem reads them (for example it doesn't benefit at all
> from large IO queue depths which probably means that each read depends
> on previous ones which prevents io-schedulers from optimizing anything).

Yes, I think that's true. See btrfs_read_block_groups in extent-tree.c:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982

What the code is doing here is starting at the beginning of the extent
tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which
is not that far away), and then based on the information in it, computes
where the next one will be (just after the end of the vaddr+length of
it), and then jumps over all normal extent items and searches again near
where the next block group item has to be. So, yes, that means that they
depend on each other.

Two possible ways to improve this:

1. Instead, walk the chunk tree (which has all related items packed
together) instead to find out at which locations in the extent tree the
block group items are located and then start getting items in parallel.
If you have storage with a lot of rotating rust that can deliver much
more random reads if you ask for more of them at the same time, then
this can already cause a massive speedup.

2. Move the block group items somewhere else, where they can nicely be
grouped together, so that the amount of metadata pages that has to be
looked up is minimal. Quoting from the link below, "slightly tricky
[...] but there are no fundamental obstacles".

https://www.spinics.net/lists/linux-btrfs/msg71766.html

I think the main obstacle here is finding a developer with enough
experience and time to do it. :)

For fun, you can also just read the block group metadata after dropping
caches each time, which should give similar relative timing results as
mounting the filesystem again. (Well, if disk IO wait is the main
slowdown of course.)

Attached are two example programs, using python-btrfs.

* bg_after_another.py does the same thing as the kernel code I just linked.
* bg_via_chunks.py looks them up based on chunk tree info.

The time that it takes after option 2 above would be implemented should
be very similar to just reading the chunk tree. (remove the block group
lookup from bg_via_chunks and run that).

Now what's still missing is changing the bg_via_chunks one to start
kicking off the block group searches in parallel, and then you can
predict how long it would take if 1 would be implemented.

\:D/

-- 
Hans van Kranenburg

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: bg_after_another.py --]
[-- Type: text/x-python; name="bg_after_another.py", Size: 798 bytes --]

#!/usr/bin/python3

import btrfs
import sys

if len(sys.argv) < 2:
    print("Usage: {} <mountpoint>".format(sys.argv[0]))
    sys.exit(1)


tree = btrfs.ctree.EXTENT_TREE_OBJECTID
min_key = btrfs.ctree.Key(0, 0, 0)
bufsize = btrfs.utils.SZ_4K


def first_block_group_after(fs, key):
    for header, data in btrfs.ioctl.search_v2(fs.fd, tree, min_key, buf_size=bufsize):
        if header.type == btrfs.ctree.BLOCK_GROUP_ITEM_KEY:
            return header


fs = btrfs.FileSystem(sys.argv[1])
while True:
    header = first_block_group_after(fs, min_key)
    if header is None:
        break
    min_key = btrfs.ctree.Key(header.objectid + header.offset,
                              btrfs.ctree.BLOCK_GROUP_ITEM_KEY, 0)
    print('.', end='', flush=True)

print()

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: bg_via_chunks.py --]
[-- Type: text/x-python; name="bg_via_chunks.py", Size: 306 bytes --]

#!/usr/bin/python3

import btrfs
import sys

if len(sys.argv) < 2:
    print("Usage: {} <mountpoint>".format(sys.argv[0]))
    sys.exit(1)

fs = btrfs.FileSystem(sys.argv[1])
for chunk in fs.chunks():
    fs.block_group(chunk.vaddr, chunk.length)
    print('.', end='', flush=True)

print()

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-03 18:20 BTRFS Mount Delay Time Graph Wilson, Ellis
  2018-12-03 19:56 ` Lionel Bouton
@ 2018-12-04  0:16 ` Qu Wenruo
  2018-12-04 13:07 ` Nikolay Borisov
  2 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2018-12-04  0:16 UTC (permalink / raw)
  To: Wilson, Ellis, BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 2649 bytes --]



On 2018/12/4 上午2:20, Wilson, Ellis wrote:
> Hi all,
> 
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
> 
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> 
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.
> 
> With 14TB drives available today, it doesn't take more than a handful of 
> drives to result in a filesystem that takes around a minute to mount. 
> As a result of this, I suspect this will become an increasingly problem 
> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
> not a contributor so I have no room to do so -- just shedding some light 
> on a problem that may deserve attention as filesystem sizes continue to 
> grow.

This problem is somewhat known.

If you dig further, it's btrfs_read_block_groups() which will try to
read *ALL* block group items.
And to no one's surprise, when the fs goes larger, the more block group
items need to be read from disk.

We need to do some delay for such read to improve such case.

Thanks,
Qu

> 
> Best,
> 
> ellis
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-03 20:04   ` Lionel Bouton
@ 2018-12-04  2:52     ` Chris Murphy
  2018-12-04 15:08       ` Lionel Bouton
  0 siblings, 1 reply; 14+ messages in thread
From: Chris Murphy @ 2018-12-04  2:52 UTC (permalink / raw)
  To: Lionel Bouton; +Cc: Ellis H. Wilson III, Btrfs BTRFS

On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton
<lionel-subscription@bouton.name> wrote:
>
> Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
> > [...]
> > Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> > tuning of the io queue (switching between classic io-schedulers and
> > blk-mq ones in the virtual machines) and BTRFS mount options
> > (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> > in mount time (I managed to reduce the mount of IO requests
>
> Sent to quickly : I meant to write "managed to reduce by half the number
> of IO write requests for the same amount of data writen"
>
> >  by half on
> > one server in production though although more tests are needed to
> > isolate the cause).

Interesting. I wonder if it's ssd_spread or space_cache=v2 that
reduces the writes by half, or by how much for each? That's a major
reduction in writes, and suggests it might be possible for further
optimization, to help mitigate the wandering trees impact.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-03 18:20 BTRFS Mount Delay Time Graph Wilson, Ellis
  2018-12-03 19:56 ` Lionel Bouton
  2018-12-04  0:16 ` Qu Wenruo
@ 2018-12-04 13:07 ` Nikolay Borisov
  2018-12-04 13:31   ` Qu Wenruo
  2018-12-04 20:14   ` Wilson, Ellis
  2 siblings, 2 replies; 14+ messages in thread
From: Nikolay Borisov @ 2018-12-04 13:07 UTC (permalink / raw)
  To: Wilson, Ellis, BTRFS



On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
> Hi all,
> 
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
> 
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> 
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.
> 
> With 14TB drives available today, it doesn't take more than a handful of 
> drives to result in a filesystem that takes around a minute to mount. 
> As a result of this, I suspect this will become an increasingly problem 
> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
> not a contributor so I have no room to do so -- just shedding some light 
> on a problem that may deserve attention as filesystem sizes continue to 
> grow.

Would it be possible to provide perf traces of the longer-running mount
time? Everyone seems to be fixated on reading block groups (which is
likely to be the culprit) but before pointing finger I'd like concrete
evidence pointed at the offender.

> 
> Best,
> 
> ellis
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-04 13:07 ` Nikolay Borisov
@ 2018-12-04 13:31   ` Qu Wenruo
  2018-12-04 20:14   ` Wilson, Ellis
  1 sibling, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2018-12-04 13:31 UTC (permalink / raw)
  To: Nikolay Borisov, Wilson, Ellis, BTRFS



On 2018/12/4 下午9:07, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>> Hi all,
>>
>> Many months ago I promised to graph how long it took to mount a BTRFS 
>> filesystem as it grows.  I finally had (made) time for this, and the 
>> attached is the result of my testing.  The image is a fairly 
>> self-explanatory graph, and the raw data is also attached in 
>> comma-delimited format for the more curious.  The columns are: 
>> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>>
>> Experimental setup:
>> - System:
>> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
>> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
>> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
>> - 3 unmount/mount cycles performed in between adding another 250GB of data
>> - 250GB of data added each time in the form of 25x10GB files in their 
>> own directory.  Files generated in parallel each epoch (25 at the same 
>> time, with a 1MB record size).
>> - 240 repetitions of this performed (to collect timings in increments of 
>> 250GB between a 0GB and 60TB filesystem)
>> - Normal "time" command used to measure time to mount.  "Real" time used 
>> of the timings reported from time.
>> - Mount:
>> /dev/md0 on /btrfs type btrfs 
>> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>>
>> At 60TB, we take 30s to mount the filesystem, which is actually not as 
>> bad as I originally thought it would be (perhaps as a result of using 
>> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
>> to comment if folks more intimately familiar with BTRFS think this is 
>> due to the very large files I've used.  I can redo the test with much 
>> more realistic data if people have legitimate reason to think it will 
>> drastically change the result.
>>
>> With 14TB drives available today, it doesn't take more than a handful of 
>> drives to result in a filesystem that takes around a minute to mount. 
>> As a result of this, I suspect this will become an increasingly problem 
>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
>> not a contributor so I have no room to do so -- just shedding some light 
>> on a problem that may deserve attention as filesystem sizes continue to 
>> grow.
> 
> Would it be possible to provide perf traces of the longer-running mount
> time? Everyone seems to be fixated on reading block groups (which is
> likely to be the culprit) but before pointing finger I'd like concrete
> evidence pointed at the offender.

IIRC I submitted such analyse years ago.

Nowadays it may change due to chunk <-> bg <-> dev_extents cross checking.
So yes, it would be a good idea to show such percentage.

Thanks,
Qu

> 
>>
>> Best,
>>
>> ellis
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-04  2:52     ` Chris Murphy
@ 2018-12-04 15:08       ` Lionel Bouton
  0 siblings, 0 replies; 14+ messages in thread
From: Lionel Bouton @ 2018-12-04 15:08 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Ellis H. Wilson III, Btrfs BTRFS

Le 04/12/2018 à 03:52, Chris Murphy a écrit :
> On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton
> <lionel-subscription@bouton.name> wrote:
>> Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
>>> [...]
>>> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
>>> tuning of the io queue (switching between classic io-schedulers and
>>> blk-mq ones in the virtual machines) and BTRFS mount options
>>> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
>>> in mount time (I managed to reduce the mount of IO requests
>> Sent to quickly : I meant to write "managed to reduce by half the number
>> of IO write requests for the same amount of data writen"
>>
>>>  by half on
>>> one server in production though although more tests are needed to
>>> isolate the cause).
> Interesting. I wonder if it's ssd_spread or space_cache=v2 that
> reduces the writes by half, or by how much for each? That's a major
> reduction in writes, and suggests it might be possible for further
> optimization, to help mitigate the wandering trees impact.

Note, the other major changes were :
- 4.9 upgrade to 1.14,
- using multi-queue aware bfq instead of noop.

If BTRFS IO patterns in our case allow bfq to merge io-requests, this
could be another explanation.

Lionel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Mount time bug bounty?] was: BTRFS Mount Delay Time Graph
  2018-12-03 22:22   ` Hans van Kranenburg
@ 2018-12-04 16:45     ` Lionel Bouton
  0 siblings, 0 replies; 14+ messages in thread
From: Lionel Bouton @ 2018-12-04 16:45 UTC (permalink / raw)
  To: Hans van Kranenburg, Wilson, Ellis, BTRFS

Le 03/12/2018 à 23:22, Hans van Kranenburg a écrit :
> [...]
> Yes, I think that's true. See btrfs_read_block_groups in extent-tree.c:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982
>
> What the code is doing here is starting at the beginning of the extent
> tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which
> is not that far away), and then based on the information in it, computes
> where the next one will be (just after the end of the vaddr+length of
> it), and then jumps over all normal extent items and searches again near
> where the next block group item has to be. So, yes, that means that they
> depend on each other.
>
> Two possible ways to improve this:
>
> 1. Instead, walk the chunk tree (which has all related items packed
> together) instead to find out at which locations in the extent tree the
> block group items are located and then start getting items in parallel.
> If you have storage with a lot of rotating rust that can deliver much
> more random reads if you ask for more of them at the same time, then
> this can already cause a massive speedup.
>
> 2. Move the block group items somewhere else, where they can nicely be
> grouped together, so that the amount of metadata pages that has to be
> looked up is minimal. Quoting from the link below, "slightly tricky
> [...] but there are no fundamental obstacles".
>
> https://www.spinics.net/lists/linux-btrfs/msg71766.html
>
> I think the main obstacle here is finding a developer with enough
> experience and time to do it. :)

I would definitely be interested in sponsoring at least a part of the
needed time through my company (we are too small to hire kernel
developers full-time but we can make a one-time contribution for
something as valuable to us as faster mount delays).

If needed it could be split in two steps with separate bounties :
- providing a patch for the latest LTS kernel with a substantial
decrease in mount time in our case (ideally less than a minute instead
of 15 minutes but <5 minutes is already worth it).
- having it integrated in mainline.

I don't have any experience with company sponsorship/bounties but I'm
willing to learn (don't hesitate to make suggestions). I'll have to
discuss it with our accountant to make sure we do it correctly.

Is it the right place to discuss this kind of subject or should I take
the discussion elsewhere ?

Best regards,

Lionel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-04 13:07 ` Nikolay Borisov
  2018-12-04 13:31   ` Qu Wenruo
@ 2018-12-04 20:14   ` Wilson, Ellis
  2018-12-05  6:55     ` Nikolay Borisov
  1 sibling, 1 reply; 14+ messages in thread
From: Wilson, Ellis @ 2018-12-04 20:14 UTC (permalink / raw)
  To: Nikolay Borisov, BTRFS

On 12/4/18 8:07 AM, Nikolay Borisov wrote:
> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>> With 14TB drives available today, it doesn't take more than a handful of
>> drives to result in a filesystem that takes around a minute to mount.
>> As a result of this, I suspect this will become an increasingly problem
>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm
>> not a contributor so I have no room to do so -- just shedding some light
>> on a problem that may deserve attention as filesystem sizes continue to
>> grow.
> Would it be possible to provide perf traces of the longer-running mount
> time? Everyone seems to be fixated on reading block groups (which is
> likely to be the culprit) but before pointing finger I'd like concrete
> evidence pointed at the offender.

I am glad to collect such traces -- please advise with commands that 
would achieve that.  If you just mean block traces, I can do that, but I 
suspect you mean something more BTRFS-specific.

Best,

ellis


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-04 20:14   ` Wilson, Ellis
@ 2018-12-05  6:55     ` Nikolay Borisov
  2018-12-20  5:47       ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Nikolay Borisov @ 2018-12-05  6:55 UTC (permalink / raw)
  To: Wilson, Ellis, BTRFS



On 4.12.18 г. 22:14 ч., Wilson, Ellis wrote:
> On 12/4/18 8:07 AM, Nikolay Borisov wrote:
>> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>>> With 14TB drives available today, it doesn't take more than a handful of
>>> drives to result in a filesystem that takes around a minute to mount.
>>> As a result of this, I suspect this will become an increasingly problem
>>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm
>>> not a contributor so I have no room to do so -- just shedding some light
>>> on a problem that may deserve attention as filesystem sizes continue to
>>> grow.
>> Would it be possible to provide perf traces of the longer-running mount
>> time? Everyone seems to be fixated on reading block groups (which is
>> likely to be the culprit) but before pointing finger I'd like concrete
>> evidence pointed at the offender.
> 
> I am glad to collect such traces -- please advise with commands that 
> would achieve that.  If you just mean block traces, I can do that, but I 
> suspect you mean something more BTRFS-specific.

A command that would be good is :

perf record --all-kernel -g mount /dev/vdc /media/scratch/

of course replace device/mount path appropriately. This will result in a
perf.data file which contains stacktraces of the hottest paths executed
during invocation of mount. If you could send this file to the mailing
list or upload it somwhere for interested people (me and perhaps) Qu to
inspect would be appreciated.

If the file turned out way too big you can use

perf report --stdio  to create a text output and you could send that as
well.

> 
> Best,
> 
> ellis
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: BTRFS Mount Delay Time Graph
  2018-12-05  6:55     ` Nikolay Borisov
@ 2018-12-20  5:47       ` Qu Wenruo
  2018-12-26  3:43         ` Btrfs_read_block_groups() delay (Was Re: BTRFS Mount Delay Time Graph) Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2018-12-20  5:47 UTC (permalink / raw)
  To: Nikolay Borisov, Wilson, Ellis, BTRFS



On 2018/12/5 下午2:55, Nikolay Borisov wrote:
> 
> 
> On 4.12.18 г. 22:14 ч., Wilson, Ellis wrote:
>> On 12/4/18 8:07 AM, Nikolay Borisov wrote:
>>> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>>>> With 14TB drives available today, it doesn't take more than a handful of
>>>> drives to result in a filesystem that takes around a minute to mount.
>>>> As a result of this, I suspect this will become an increasingly problem
>>>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm
>>>> not a contributor so I have no room to do so -- just shedding some light
>>>> on a problem that may deserve attention as filesystem sizes continue to
>>>> grow.
>>> Would it be possible to provide perf traces of the longer-running mount
>>> time? Everyone seems to be fixated on reading block groups (which is
>>> likely to be the culprit) but before pointing finger I'd like concrete
>>> evidence pointed at the offender.
>>
>> I am glad to collect such traces -- please advise with commands that 
>> would achieve that.  If you just mean block traces, I can do that, but I 
>> suspect you mean something more BTRFS-specific.
> 
> A command that would be good is :
> 
> perf record --all-kernel -g mount /dev/vdc /media/scratch/


In fact, if we're just going to verify if it's btrfs_read_block_groups()
causing the biggest problem, we could use ftrace directly (wrapped by
"perf ftrace"):

perf ftrace -t function_graph -T open_ctree \
	-T btrfs_read_block_groups \
	mount $dev $mnt

The result will be super easy to read, something like:

 2)               |  open_ctree [btrfs]() {
 2)               |    btrfs_read_block_groups [btrfs]() {
 2) # 1726.598 us |    }
 2) * 21817.28 us |  }


Since I'm just using a small fs, with 4G data copied from /usr, we won't
populate extent tree with enough backrefs, thus
btrfs_read_block_groups() won't be a big problem. (only 7.9%)

However when I populate the fs with small inline files along with small
data extents, and 4K nodesize to bump up extent tree size, the same 4G
data would result a different story:

 3)               |  open_ctree [btrfs]() {
 3)               |    btrfs_read_block_groups [btrfs]() {
 3) # 4567.645 us |    }
 3) * 22520.95 us |  }

Now it's 20.3% of the total mount time.
I believe the percentage will just increase and go over 70% when the fs
is larger and larger.


So, Wilson, would you please use above "perf ftrace" command to get the
function duration?

Thanks,
Qu

> 
> of course replace device/mount path appropriately. This will result in a
> perf.data file which contains stacktraces of the hottest paths executed
> during invocation of mount. If you could send this file to the mailing
> list or upload it somwhere for interested people (me and perhaps) Qu to
> inspect would be appreciated.
> 
> If the file turned out way too big you can use
> 
> perf report --stdio  to create a text output and you could send that as
> well.
> 
>>
>> Best,
>>
>> ellis
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Btrfs_read_block_groups() delay (Was Re: BTRFS Mount Delay Time Graph)
  2018-12-20  5:47       ` Qu Wenruo
@ 2018-12-26  3:43         ` Qu Wenruo
  0 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2018-12-26  3:43 UTC (permalink / raw)
  To: Nikolay Borisov, Wilson, Ellis, BTRFS

Now with a little larger fs (257G used, backed by HDD), the result is
much more obvious:

$ sudo perf ftrace -t function_graph \
                   -T open_ctree \
                   -T btrfs_read_block_groups \
                   -T check_chunk_block_group_mappings \
                   -T btrfs_read_chunk_tree \
                   -T btrfs_verify_dev_extents \
                   mount /dev/vdc /mnt/btrfs/

 3)               |  open_ctree [btrfs]() {
 3)               |    btrfs_read_chunk_tree [btrfs]() {
 3) * 69033.31 us |    }
 3)               |    btrfs_verify_dev_extents [btrfs]() {
 3) * 90376.15 us |    }
 3)               |    btrfs_read_block_groups [btrfs]() {
 2) $ 2733853 us  |    } /* btrfs_read_block_groups [btrfs] */
 2) $ 3168384 us  |  } /* open_ctree [btrfs] */

For btrfs_read_chunk_tree() and btrfs_verify_dev_extents(), combined
they take less than 160ms.
While for btrfs_read_block_groups() it take 2.7s while the total mount
time is 3.1s, meaning btrfs_read_block_groups() is already taking 87% of
the mount time.


I'll try to make btrfs BLOCK_GROUP_ITEM into one separate tree to make
they iterate just like chunks tree, and see how it will end up.

Thanks,
Qu

On 2018/12/20 下午1:47, Qu Wenruo wrote:
> 
> 
> On 2018/12/5 下午2:55, Nikolay Borisov wrote:
>>
>>
>> On 4.12.18 г. 22:14 ч., Wilson, Ellis wrote:
>>
>> A command that would be good is :
>>
>> perf record --all-kernel -g mount /dev/vdc /media/scratch/
> 
> 
> In fact, if we're just going to verify if it's btrfs_read_block_groups()
> causing the biggest problem, we could use ftrace directly (wrapped by
> "perf ftrace"):
> 
> perf ftrace -t function_graph -T open_ctree \
> 	-T btrfs_read_block_groups \
> 	mount $dev $mnt
> 
> The result will be super easy to read, something like:
> 
>  2)               |  open_ctree [btrfs]() {
>  2)               |    btrfs_read_block_groups [btrfs]() {
>  2) # 1726.598 us |    }
>  2) * 21817.28 us |  }
> 
> 
> Since I'm just using a small fs, with 4G data copied from /usr, we won't
> populate extent tree with enough backrefs, thus
> btrfs_read_block_groups() won't be a big problem. (only 7.9%)
> 
> However when I populate the fs with small inline files along with small
> data extents, and 4K nodesize to bump up extent tree size, the same 4G
> data would result a different story:
> 
>  3)               |  open_ctree [btrfs]() {
>  3)               |    btrfs_read_block_groups [btrfs]() {
>  3) # 4567.645 us |    }
>  3) * 22520.95 us |  }
> 
> Now it's 20.3% of the total mount time.
> I believe the percentage will just increase and go over 70% when the fs
> is larger and larger.
> 
> 
> So, Wilson, would you please use above "perf ftrace" command to get the
> function duration?
> 
> Thanks,
> Qu
> 
>>
>> of course replace device/mount path appropriately. This will result in a
>> perf.data file which contains stacktraces of the hottest paths executed
>> during invocation of mount. If you could send this file to the mailing
>> list or upload it somwhere for interested people (me and perhaps) Qu to
>> inspect would be appreciated.
>>
>> If the file turned out way too big you can use
>>
>> perf report --stdio  to create a text output and you could send that as
>> well.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-12-26  3:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-03 18:20 BTRFS Mount Delay Time Graph Wilson, Ellis
2018-12-03 19:56 ` Lionel Bouton
2018-12-03 20:04   ` Lionel Bouton
2018-12-04  2:52     ` Chris Murphy
2018-12-04 15:08       ` Lionel Bouton
2018-12-03 22:22   ` Hans van Kranenburg
2018-12-04 16:45     ` [Mount time bug bounty?] was: " Lionel Bouton
2018-12-04  0:16 ` Qu Wenruo
2018-12-04 13:07 ` Nikolay Borisov
2018-12-04 13:31   ` Qu Wenruo
2018-12-04 20:14   ` Wilson, Ellis
2018-12-05  6:55     ` Nikolay Borisov
2018-12-20  5:47       ` Qu Wenruo
2018-12-26  3:43         ` Btrfs_read_block_groups() delay (Was Re: BTRFS Mount Delay Time Graph) Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).