All of lore.kernel.org
 help / color / mirror / Atom feed
* Status of FST and mount times
@ 2018-02-14 16:00 Ellis H. Wilson III
  2018-02-14 17:08 ` Nikolay Borisov
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Ellis H. Wilson III @ 2018-02-14 16:00 UTC (permalink / raw)
  To: linux-btrfs

Hi again -- back with a few more questions:

Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No 
compression.  No quotas enabled.  Many (potentially tens to hundreds) of 
subvolumes, each with tens of snapshots.  No control over size or number 
of files, but directory tree (entries per dir and general tree depth) 
can be controlled in case that's helpful).

1. I've been reading up about the space cache, and it appears there is a 
v2 of it called the free space tree that is much friendlier to large 
filesystems such as the one I am designing for.  It is listed as OK/OK 
on the wiki status page, but there is a note that btrfs progs treats it 
as read only (i.e., btrfs check repair cannot help me without a full 
space cache rebuild is my biggest concern) and the last status update on 
this I can find was circa fall 2016.  Can anybody give me an updated 
status on this feature?  From what I read, v1 and tens of TB filesystems 
will not play well together, so I'm inclined to dig into this.

2. There's another thread on-going about mount delays.  I've been 
completely blind to this specific problem until it caught my eye.  Does 
anyone have ballpark estimates for how long very large HDD-based 
filesystems will take to mount?  Yes, I know it will depend on the 
dataset.  I'm looking for O() worst-case approximations for 
enterprise-grade large drives (12/14TB), as I expect it should scale 
with multiple drives so approximating for a single drive should be good 
enough.

3. Do long mount delays relate to space_cache v1 vs v2 (I would guess 
no, unless it needed to be regenerated)?

Note that I'm not sensitive to multi-second mount delays.  I am 
sensitive to multi-minute mount delays, hence why I'm bringing this up.

FWIW: I am currently populating a machine we have with 6TB drives in it 
with real-world home dir data to see if I can replicate the mount issue.

Thanks,

ellis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-14 16:00 Status of FST and mount times Ellis H. Wilson III
@ 2018-02-14 17:08 ` Nikolay Borisov
  2018-02-14 17:21   ` Ellis H. Wilson III
                     ` (2 more replies)
  2018-02-14 23:24 ` Duncan
  2018-02-15  6:14 ` Chris Murphy
  2 siblings, 3 replies; 32+ messages in thread
From: Nikolay Borisov @ 2018-02-14 17:08 UTC (permalink / raw)
  To: Ellis H. Wilson III, linux-btrfs



On 14.02.2018 18:00, Ellis H. Wilson III wrote:
> Hi again -- back with a few more questions:
> 
> Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No
> compression.  No quotas enabled.  Many (potentially tens to hundreds) of
> subvolumes, each with tens of snapshots.  No control over size or number
> of files, but directory tree (entries per dir and general tree depth)
> can be controlled in case that's helpful).
> 
> 1. I've been reading up about the space cache, and it appears there is a
> v2 of it called the free space tree that is much friendlier to large
> filesystems such as the one I am designing for.  It is listed as OK/OK
> on the wiki status page, but there is a note that btrfs progs treats it
> as read only (i.e., btrfs check repair cannot help me without a full
> space cache rebuild is my biggest concern) and the last status update on
> this I can find was circa fall 2016.  Can anybody give me an updated
> status on this feature?  From what I read, v1 and tens of TB filesystems
> will not play well together, so I'm inclined to dig into this.

V1 for large filesystems is jut awful. Facebook have been experiencing
the pain hence they implemented v2. You can view the spacecache tree as
the complement version of the extent tree. v1 cache is implemented as a
hidden inode and even though writes (aka flushing of the freespace
cache) are metadata they are essentially treated as data. This could
potentially lead to priority inversions if cgroups io controller is
involved.

Furthermore, there is at least 1 known deadlock problem in freespace
cache v1. So yes, if you want to use btrfs ona multi-tb system v2 is
really the way to go.

> 
> 2. There's another thread on-going about mount delays.  I've been
> completely blind to this specific problem until it caught my eye.  Does
> anyone have ballpark estimates for how long very large HDD-based
> filesystems will take to mount?  Yes, I know it will depend on the
> dataset.  I'm looking for O() worst-case approximations for
> enterprise-grade large drives (12/14TB), as I expect it should scale
> with multiple drives so approximating for a single drive should be good
> enough.
> 
> 3. Do long mount delays relate to space_cache v1 vs v2 (I would guess
> no, unless it needed to be regenerated)?

No, the long mount times seems to be due to the fact that in order for a
btrfs filesystem to mount it needs to enumerate its block_groups items
and those are stored in the extent tree, which also holds all of the
information pertaining to allocated extents. So mixing those
data structures in the same tree and the fact that blockgroups are
iterated linearly during mount (check btrfs_read_block_groups) means on
spinning rust with shitty seek times this can take a while.

However, this will really depend on the amount of extents you have and
having taken a look at the thread you referred to it seems there is not
clear-cut reason why mounting is taking so long on that particular
occasion .


> 
> Note that I'm not sensitive to multi-second mount delays.  I am
> sensitive to multi-minute mount delays, hence why I'm bringing this up.
> 
> FWIW: I am currently populating a machine we have with 6TB drives in it
> with real-world home dir data to see if I can replicate the mount issue.
> 
> Thanks,
> 
> ellis
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-14 17:08 ` Nikolay Borisov
@ 2018-02-14 17:21   ` Ellis H. Wilson III
  2018-02-15  1:42   ` Qu Wenruo
  2018-02-15  5:54   ` Chris Murphy
  2 siblings, 0 replies; 32+ messages in thread
From: Ellis H. Wilson III @ 2018-02-14 17:21 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs

On 02/14/2018 12:08 PM, Nikolay Borisov wrote:
> V1 for large filesystems is jut awful. Facebook have been experiencing
> the pain hence they implemented v2. You can view the spacecache tree as
> the complement version of the extent tree. v1 cache is implemented as a
> hidden inode and even though writes (aka flushing of the freespace
> cache) are metadata they are essentially treated as data. This could
> potentially lead to priority inversions if cgroups io controller is
> involved.
> 
> Furthermore, there is at least 1 known deadlock problem in freespace
> cache v1. So yes, if you want to use btrfs ona multi-tb system v2 is
> really the way to go.

Fantastic.  Thanks for the backstory.  That is what I will plan to use 
then.  I've been operating with whatever the default is (I presume v1 
based on the man page), but haven't yet populated any of our machines 
sufficiently enough to notice performance degradation due to space cache 
problems.

> No, the long mount times seems to be due to the fact that in order for a
> btrfs filesystem to mount it needs to enumerate its block_groups items
> and those are stored in the extent tree, which also holds all of the
> information pertaining to allocated extents. So mixing those
> data structures in the same tree and the fact that blockgroups are
> iterated linearly during mount (check btrfs_read_block_groups) means on
> spinning rust with shitty seek times this can take a while.
> 
> However, this will really depend on the amount of extents you have and
> having taken a look at the thread you referred to it seems there is not
> clear-cut reason why mounting is taking so long on that particular
> occasion.

Ok; thanks.  To phrase it somewhat more simply, should I expect for 
"normal" datasets (think home directory) that happen to be part of a 
very large BTRFS filesystem (tens of TBs) to take more than 60s to 
mount?  Let's presume there isn't extreme fragmentation or any media 
errors to keep things simple.

Best,

ellis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-14 16:00 Status of FST and mount times Ellis H. Wilson III
  2018-02-14 17:08 ` Nikolay Borisov
@ 2018-02-14 23:24 ` Duncan
  2018-02-15 15:42   ` Ellis H. Wilson III
  2018-02-15  6:14 ` Chris Murphy
  2 siblings, 1 reply; 32+ messages in thread
From: Duncan @ 2018-02-14 23:24 UTC (permalink / raw)
  To: linux-btrfs

Ellis H. Wilson III posted on Wed, 14 Feb 2018 11:00:29 -0500 as
excerpted:

> Hi again -- back with a few more questions:
> 
> Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No
> compression.  No quotas enabled.  Many (potentially tens to hundreds) of
> subvolumes, each with tens of snapshots.  No control over size or number
> of files, but directory tree (entries per dir and general tree depth)
> can be controlled in case that's helpful).

??  How can you control both breadth (entries per dir) AND depth of 
directory tree without ultimately limiting your number of files?

Or do you mean you can control breadth XOR depth of tree as needed, 
allowing the other to expand as necessary to accommodate the uncontrolled 
number of files?

Anyway, AFAIK the only performance issue there would be the (IIRC) ~65535 
limit on directory hard links before additional ones are out-of-lined 
into a secondary node, with the entailing performance implications.

> 1. I've been reading up about the space cache, and it appears there is a
> v2 of it called the free space tree that is much friendlier to large
> filesystems such as the one I am designing for.  It is listed as OK/OK
> on the wiki status page, but there is a note that btrfs progs treats it
> as read only (i.e., btrfs check repair cannot help me without a full
> space cache rebuild is my biggest concern) and the last status update on
> this I can find was circa fall 2016.  Can anybody give me an updated
> status on this feature?  From what I read, v1 and tens of TB filesystems
> will not play well together, so I'm inclined to dig into this.

At tens of TB, yes, the free-space-cache (v1) has issues that the free-
space-tree (aka free-space-cache-v2) are designed to solve.  And v2 
should be very well tested in large enterprise installations by now, 
given facebook's usage and intimate involvement with btrfs.

But I have an arguably more basic concern...  Pardon me for reviewing the 
basics as I feel rather like a pupil attempting to lecture a teacher on 
the point and you could very likely teach /me/ about them, but they setup 
the point...

Raid0, particularly at the 10s-of-TB scale, has some implications that 
don't particularly well match your specified concerns above.

Of course "raid0" is a convenient misnomer, as there's nothing 
"redundant" about the "array of independent devices" in a raid0 
configuration, it's simply done for the space and speed features, with 
the sacrificial tradeoff being reliability.  It's only called raid0 as a 
convenience, allowing it to be grouped with the other raid configurations 
where "redundant" /is/ a feature, with the more important grouping 
commonality being they're all multi-device.

Because reliability /is/ the sacrificial tradeoff for raid0, it's 
relatively safe to make the assumption that reliability either isn't 
needed at all because the data literally is "throw-away" value (cache, 
say, where refilling the cache isn't a big cost or time factor), or 
reliability is assured by other mechanisms, backups being the most basic 
but there are others like multi-layered raid, etc, which in practice 
makes at least the particular instance of the data on the raid0 "throw-
away" value, even if the data as a whole is not.

So far, so good.  But then above you mention concern about btrfs-progs 
treating the free-space-tree (free-space-cache-v2) as read-only, and the 
time cost of having to clear and rebuild it after a btrfs check --repair.

Which is what triggered the mismatch warning I mentioned above.  Either 
that raid0 data is of throw-away value appropriate to placement on a 
raid0, and btrfs check --repair is of little concern as the benefits are 
questionable (no guarantees it'll work and the data is either directly 
throw-away value anyway, or there's a backup at hand that /does/ have a 
tested guarantee of viability, or it's not worthy of being called a 
backup in the first place), or it's not.

It's that concern about the viability of btrfs check --repair on what 
you're defining as throw-away data by placing it on raid0 in the first 
place, that's raising all those red warning flags for me!  And the fact 
that you didn't even bother to explain it with a side note to the effect 
that the reliability is addressed some other way, but you still need to 
worry about btrfs check --repair viability because $REASONS, is turning 
those red flags into flashing red lights accompanied by blaring sirens!

OK, so let's assume you /do/ have a tested backup, ready to go.  Then the 
viability of btrfs check --repair is of less concern, but remains 
something you might still be interested in for trivial cases, because 
let's face it, transferring tens of TB of data, even if ready at hand, 
does take time, and if you can avoid it because the btrfs check --repair 
fix is trivial, it's worth doing so.

Valid case, but there's nothing in your post indicating it's valid as 
/your/ case.

Of course the other possibility is live-failover, which is sure to be 
facebook's use-case.  But with live-failover, the viability of btrfs 
check --repair more or less ceases to be of interest, because the failover 
happens (relative to the offline check or restore time) instantly, and 
once the failed devices/machine is taken out of service it's far more 
effective to simply blow away the filesystem (if not replacing the 
device(s) entirely) and restore "at leisure" from backup, a relatively 
guaranteed procedure compared to the "no guarantees" of attempting to 
check --repair the filesystem out of trouble.

Which is very likely why the free-space-tree still isn't well supported 
by btrfs-progs, including btrfs check, several kernel (and thus -progs) 
development cycles later.  The people who really need the one (whichever 
one of the two)... don't tend to (or at least /shouldn't/) make use of 
the other so much.

It's also worth mentioning that btrfs raid0 mode, as well as single mode, 
hobbles the btrfs data and metadata integrity feature, because while 
checksums can and are still generated, stored and checked by default, and 
integrity problems can still be detected, because raid0 (and single) 
includes no redundancy, there's no second copy (raid1/10) or parity 
redundancy (raid5/6) to rebuild the bad data from, so it's simply gone.  
(Well, for data you can try btrfs restore of the otherwise inaccessible 
file and hope for the best, and for metadata, you can try check --repair 
and again hope for the best, but...)  If you're using that feature of 
btrfs and want/need more than just detection of a problem that can't be 
fixed due to lack of redundancy, there's a good chance you want a real 
redundancy raid mode on multi-device, or dup mode on single device.

So bottom line... given the sacrificial lack of redundancy and 
reliability of raid0, btrfs or not, in an enterprise setting with tens of 
TB of data, why are you worrying about the viability of btrfs check --
repair on what the placement on raid0 decrees to be throw-away data 
anyway?  At first glance anyway, one or the other, either the raid0 mode 
and thus declared throw-away value of tens of TB of data, or the 
viability of btrfs check --repair, indicating you don't consider the data 
you just declared to be of throw-away value by placing it on raid0, to be 
of throw-away value after all, must be wrong.  Which one is wrong is your 
call, and there's certainly individual cases (one of which I even named) 
where concern about the viability of btrfs check --repair on raid0 might 
be valid, but your post has no real indication that your case is such a 
case, and honestly, that worries me!

> 2. There's another thread on-going about mount delays.  I've been
> completely blind to this specific problem until it caught my eye.  Does
> anyone have ballpark estimates for how long very large HDD-based
> filesystems will take to mount?  Yes, I know it will depend on the
> dataset.  I'm looking for O() worst-case approximations for
> enterprise-grade large drives (12/14TB), as I expect it should scale
> with multiple drives so approximating for a single drive should be good
> enough.

No input on that question here (my own use-case couldn't be more 
different, multiple small sub-half-TB independent btrfs raid1s on 
partitioned ssds), but another concern, based on real-world reports I've 
seen on-list:

12-14 TB individual drives?

While you /did/ say enterprise grade so this probably doesn't apply to 
you, it might apply to others that will read this.

Be careful that you're not trying to use the "archive application" 
targeted SMR drives for general purpose use.  Occasionally people will 
try to buy and use such drives in general purpose use due to their 
cheaper per-TB cost, and it just doesn't go well.  We've had a number of 
reports of that. =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-14 17:08 ` Nikolay Borisov
  2018-02-14 17:21   ` Ellis H. Wilson III
@ 2018-02-15  1:42   ` Qu Wenruo
  2018-02-15  2:15     ` Duncan
  2018-02-15 11:12     ` Hans van Kranenburg
  2018-02-15  5:54   ` Chris Murphy
  2 siblings, 2 replies; 32+ messages in thread
From: Qu Wenruo @ 2018-02-15  1:42 UTC (permalink / raw)
  To: Nikolay Borisov, Ellis H. Wilson III, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4901 bytes --]



On 2018年02月15日 01:08, Nikolay Borisov wrote:
> 
> 
> On 14.02.2018 18:00, Ellis H. Wilson III wrote:
>> Hi again -- back with a few more questions:
>>
>> Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No
>> compression.  No quotas enabled.  Many (potentially tens to hundreds) of
>> subvolumes, each with tens of snapshots.  No control over size or number
>> of files, but directory tree (entries per dir and general tree depth)
>> can be controlled in case that's helpful).
>>
>> 1. I've been reading up about the space cache, and it appears there is a
>> v2 of it called the free space tree that is much friendlier to large
>> filesystems such as the one I am designing for.  It is listed as OK/OK
>> on the wiki status page, but there is a note that btrfs progs treats it
>> as read only (i.e., btrfs check repair cannot help me without a full
>> space cache rebuild is my biggest concern) and the last status update on
>> this I can find was circa fall 2016.  Can anybody give me an updated
>> status on this feature?  From what I read, v1 and tens of TB filesystems
>> will not play well together, so I'm inclined to dig into this.
> 
> V1 for large filesystems is jut awful. Facebook have been experiencing
> the pain hence they implemented v2. You can view the spacecache tree as
> the complement version of the extent tree. v1 cache is implemented as a
> hidden inode and even though writes (aka flushing of the freespace
> cache) are metadata they are essentially treated as data. This could
> potentially lead to priority inversions if cgroups io controller is
> involved.
> 
> Furthermore, there is at least 1 known deadlock problem in freespace
> cache v1. So yes, if you want to use btrfs ona multi-tb system v2 is
> really the way to go.
> 
>>
>> 2. There's another thread on-going about mount delays.  I've been
>> completely blind to this specific problem until it caught my eye.  Does
>> anyone have ballpark estimates for how long very large HDD-based
>> filesystems will take to mount?  Yes, I know it will depend on the
>> dataset.  I'm looking for O() worst-case approximations for
>> enterprise-grade large drives (12/14TB), as I expect it should scale
>> with multiple drives so approximating for a single drive should be good
>> enough.
>>
>> 3. Do long mount delays relate to space_cache v1 vs v2 (I would guess
>> no, unless it needed to be regenerated)?
> 
> No, the long mount times seems to be due to the fact that in order for a
> btrfs filesystem to mount it needs to enumerate its block_groups items
> and those are stored in the extent tree, which also holds all of the
> information pertaining to allocated extents. So mixing those
> data structures in the same tree and the fact that blockgroups are
> iterated linearly during mount (check btrfs_read_block_groups) means on
> spinning rust with shitty seek times this can take a while.

And, space cache is not loaded at mount time.
It's delayed until we determine to allocate extent from one block group.

So space cache is completely unrelated to long mount time.

> 
> However, this will really depend on the amount of extents you have and
> having taken a look at the thread you referred to it seems there is not
> clear-cut reason why mounting is taking so long on that particular
> occasion .

Just as said by Nikolay, the biggest problem of slow mount is the size
of extent tree (and HDD seek time)

The easiest way to get a basic idea of how large your extent tree is
using debug tree:

# btrfs-debug-tree -r -t extent <device>

You would get something like:
btrfs-progs v4.15
extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776 level 0  <<<
total bytes 10737418240
bytes used 393216
uuid 651fcf0c-0ffd-4351-9721-84b1615f02e0

That level is would give you some basic idea of the size of your extent
tree.

For level 0, it could contains about 400 items for average.
For level 1, it could contains up to 197K items.
...
For leven n, it could contains up to 400 * 493 ^ (n - 1) items.
( n <= 7 )

Thanks,
Qu

> 
> 
>>
>> Note that I'm not sensitive to multi-second mount delays.  I am
>> sensitive to multi-minute mount delays, hence why I'm bringing this up.
>>
>> FWIW: I am currently populating a machine we have with 6TB drives in it
>> with real-world home dir data to see if I can replicate the mount issue.
>>
>> Thanks,
>>
>> ellis
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-15  1:42   ` Qu Wenruo
@ 2018-02-15  2:15     ` Duncan
  2018-02-15  3:49       ` Qu Wenruo
  2018-02-15 11:12     ` Hans van Kranenburg
  1 sibling, 1 reply; 32+ messages in thread
From: Duncan @ 2018-02-15  2:15 UTC (permalink / raw)
  To: linux-btrfs

Qu Wenruo posted on Thu, 15 Feb 2018 09:42:27 +0800 as excerpted:

> The easiest way to get a basic idea of how large your extent tree is
> using debug tree:
> 
> # btrfs-debug-tree -r -t extent <device>
> 
> You would get something like:
> btrfs-progs v4.15 extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776
> level 0  <<<
> total bytes 10737418240 bytes used 393216 uuid
> 651fcf0c-0ffd-4351-9721-84b1615f02e0
> 
> That level is would give you some basic idea of the size of your extent
> tree.
> 
> For level 0, it could contains about 400 items for average.
> For level 1, it could contains up to 197K items.
> ...
> For leven n, it could contains up to 400 * 493 ^ (n - 1) items.
> ( n <= 7 )

So for level 2 (which I see on a couple of mine here, ran it out of 
curiosity):

400 * 493 ^ (2 - 1) = 400 * 493 = 197200

197K for both level 1 and level 2?  Doesn't look correct.

Perhaps you meant a simple power of n, instead of (n-1)?  That would 
yield ~97M for level 2, and would yield the given numbers for levels 0 
and 1 as well, whereby using n-1 for level 0 yields less than a single 
entry, and 400 for level 1.

Or the given numbers were for level 1 and 2, with level 0 not holding 
anything, not levels 0 and 1.  But that wouldn't jive with your level 0 
example, which I would assume could never happen if it couldn't hold even 
a single entry.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-15  2:15     ` Duncan
@ 2018-02-15  3:49       ` Qu Wenruo
  0 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2018-02-15  3:49 UTC (permalink / raw)
  To: Duncan, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2072 bytes --]



On 2018年02月15日 10:15, Duncan wrote:
> Qu Wenruo posted on Thu, 15 Feb 2018 09:42:27 +0800 as excerpted:
> 
>> The easiest way to get a basic idea of how large your extent tree is
>> using debug tree:
>>
>> # btrfs-debug-tree -r -t extent <device>
>>
>> You would get something like:
>> btrfs-progs v4.15 extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776
>> level 0  <<<
>> total bytes 10737418240 bytes used 393216 uuid
>> 651fcf0c-0ffd-4351-9721-84b1615f02e0
>>
>> That level is would give you some basic idea of the size of your extent
>> tree.
>>
>> For level 0, it could contains about 400 items for average.
>> For level 1, it could contains up to 197K items.
>> ...
>> For leven n, it could contains up to 400 * 493 ^ (n - 1) items.
>> ( n <= 7 )
> 
> So for level 2 (which I see on a couple of mine here, ran it out of 
> curiosity):
> 
> 400 * 493 ^ (2 - 1) = 400 * 493 = 197200
> 
> 197K for both level 1 and level 2?  Doesn't look correct.
> 
> Perhaps you meant a simple power of n, instead of (n-1)?

My fault, off by 1 is really easy to screw things up.

So it's 400 * 493 ^ n.

And level 0 also fits into the calculation.

>  That would 
> yield ~97M for level 2, and would yield the given numbers for levels 0 
> and 1 as well, whereby using n-1 for level 0 yields less than a single 
> entry, and 400 for level 1.
> 
> Or the given numbers were for level 1 and 2, with level 0 not holding 
> anything, not levels 0 and 1.  But that wouldn't jive with your level 0 
> example, which I would assume could never happen if it couldn't hold even 
> a single entry.

Here level 0 means it's leaf. And I assume the average item size of each
EXTENT_ITEM/METADATA item to be 40.
And using 16K nodesize we have 16283, we get 407, I just round it to 400
to make calculation a little easier and more headroom for larger item.

So for level 0, we could have around 400 items.
For nodes (1 < level <= 7), since node ptr is fixed to 33 bytes, the
calculation is pretty simple now.

Thanks,
Qu

> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-14 17:08 ` Nikolay Borisov
  2018-02-14 17:21   ` Ellis H. Wilson III
  2018-02-15  1:42   ` Qu Wenruo
@ 2018-02-15  5:54   ` Chris Murphy
  2 siblings, 0 replies; 32+ messages in thread
From: Chris Murphy @ 2018-02-15  5:54 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Ellis H. Wilson III, Btrfs BTRFS

On Wed, Feb 14, 2018 at 10:08 AM, Nikolay Borisov <nborisov@suse.com> wrote:

> V1 for large filesystems is jut awful. Facebook have been experiencing
> the pain hence they implemented v2. You can view the spacecache tree as
> the complement version of the extent tree. v1 cache is implemented as a
> hidden inode and even though writes (aka flushing of the freespace
> cache) are metadata they are essentially treated as data. This could
> potentially lead to priority inversions if cgroups io controller is
> involved.
>
> Furthermore, there is at least 1 known deadlock problem in freespace
> cache v1. So yes, if you want to use btrfs ona multi-tb system v2 is
> really the way to go.

I've been using v2 on a couple of systems' rootfs for a couple of
months. I'm not totally certain it's v2, or another enhancement circa
4.14, but system updates (rpm based) are definitely faster. So it may
not only be a Nice To Have with big file systems. I haven't tried it
yet but if the file system face plants on me, I figure I'll use btrfs
check to wipe the free space cache (hopefully that's allowed even if
the file system is hosed) and then try to repair.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-14 16:00 Status of FST and mount times Ellis H. Wilson III
  2018-02-14 17:08 ` Nikolay Borisov
  2018-02-14 23:24 ` Duncan
@ 2018-02-15  6:14 ` Chris Murphy
  2018-02-15 16:45   ` Ellis H. Wilson III
  2 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2018-02-15  6:14 UTC (permalink / raw)
  To: Ellis H. Wilson III; +Cc: Btrfs BTRFS

On Wed, Feb 14, 2018 at 9:00 AM, Ellis H. Wilson III <ellisw@panasas.com> wrote:

> Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No compression.
> No quotas enabled.  Many (potentially tens to hundreds) of subvolumes, each
> with tens of snapshots.

Even if non-catastrophic to lose such a file system, it's big enough
to be tedious and take time to set it up again. I think it's worth
considering one of two things as alternatives:

a. metadata raid1, data single: you lose the striping performance of
raid0, and if it's not randomly filled you'll end up with some disk
contention for reads and writes *but* if you lose a drive you will not
lose the file system. Any missing files on the dead drive will result
in EIO (and I think also a kernel message with path to file), and so
you could just run a script to delete those files and replace them
with backup copies.

b. Variation on the above would be to put it behind glusterfs
replicated volume. Gluster getting EIO from a brick should cause it to
get a copy from another brick and then fix up the bad one
automatically. Or in your raid0 case, the whole volume is lost, and
glusterfs helps do the full rebuild over 3-7 days while you're still
able to access those 70TB of data normally. Of course, this option
requires having two 70TB storage bricks available.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-15  1:42   ` Qu Wenruo
  2018-02-15  2:15     ` Duncan
@ 2018-02-15 11:12     ` Hans van Kranenburg
  2018-02-15 16:30       ` Ellis H. Wilson III
  1 sibling, 1 reply; 32+ messages in thread
From: Hans van Kranenburg @ 2018-02-15 11:12 UTC (permalink / raw)
  To: Qu Wenruo, Nikolay Borisov, Ellis H. Wilson III, linux-btrfs

On 02/15/2018 02:42 AM, Qu Wenruo wrote:
> 
> 
> On 2018年02月15日 01:08, Nikolay Borisov wrote:
>>
>>
>> On 14.02.2018 18:00, Ellis H. Wilson III wrote:
>>> Hi again -- back with a few more questions:
>>>
>>> Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No
>>> compression.  No quotas enabled.  Many (potentially tens to hundreds) of
>>> subvolumes, each with tens of snapshots.  No control over size or number
>>> of files, but directory tree (entries per dir and general tree depth)
>>> can be controlled in case that's helpful).
>>>
>>> 1. I've been reading up about the space cache, and it appears there is a
>>> v2 of it called the free space tree that is much friendlier to large
>>> filesystems such as the one I am designing for.  It is listed as OK/OK
>>> on the wiki status page, but there is a note that btrfs progs treats it
>>> as read only (i.e., btrfs check repair cannot help me without a full
>>> space cache rebuild is my biggest concern) and the last status update on
>>> this I can find was circa fall 2016.  Can anybody give me an updated
>>> status on this feature?  From what I read, v1 and tens of TB filesystems
>>> will not play well together, so I'm inclined to dig into this.
>>
>> V1 for large filesystems is jut awful. Facebook have been experiencing
>> the pain hence they implemented v2. You can view the spacecache tree as
>> the complement version of the extent tree. v1 cache is implemented as a
>> hidden inode and even though writes (aka flushing of the freespace
>> cache) are metadata they are essentially treated as data. This could
>> potentially lead to priority inversions if cgroups io controller is
>> involved.
>>
>> Furthermore, there is at least 1 known deadlock problem in freespace
>> cache v1. So yes, if you want to use btrfs ona multi-tb system v2 is
>> really the way to go.
>>
>>>
>>> 2. There's another thread on-going about mount delays.  I've been
>>> completely blind to this specific problem until it caught my eye.  Does
>>> anyone have ballpark estimates for how long very large HDD-based
>>> filesystems will take to mount?  Yes, I know it will depend on the
>>> dataset.  I'm looking for O() worst-case approximations for
>>> enterprise-grade large drives (12/14TB), as I expect it should scale
>>> with multiple drives so approximating for a single drive should be good
>>> enough.
>>>
>>> 3. Do long mount delays relate to space_cache v1 vs v2 (I would guess
>>> no, unless it needed to be regenerated)?
>>
>> No, the long mount times seems to be due to the fact that in order for a
>> btrfs filesystem to mount it needs to enumerate its block_groups items
>> and those are stored in the extent tree, which also holds all of the
>> information pertaining to allocated extents. So mixing those
>> data structures in the same tree and the fact that blockgroups are
>> iterated linearly during mount (check btrfs_read_block_groups) means on
>> spinning rust with shitty seek times this can take a while.
> 
> And, space cache is not loaded at mount time.
> It's delayed until we determine to allocate extent from one block group.
> 
> So space cache is completely unrelated to long mount time.
> 
>>
>> However, this will really depend on the amount of extents you have and
>> having taken a look at the thread you referred to it seems there is not
>> clear-cut reason why mounting is taking so long on that particular
>> occasion .
> 
> Just as said by Nikolay, the biggest problem of slow mount is the size
> of extent tree (and HDD seek time)
> 
> The easiest way to get a basic idea of how large your extent tree is
> using debug tree:
> 
> # btrfs-debug-tree -r -t extent <device>
> 
> You would get something like:
> btrfs-progs v4.15
> extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776 level 0  <<<
> total bytes 10737418240
> bytes used 393216
> uuid 651fcf0c-0ffd-4351-9721-84b1615f02e0
> 
> That level is would give you some basic idea of the size of your extent
> tree.
> 
> For level 0, it could contains about 400 items for average.
> For level 1, it could contains up to 197K items.
> ...
> For leven n, it could contains up to 400 * 493 ^ (n - 1) items.
> ( n <= 7 )

Another one to get that data:

https://github.com/knorrie/python-btrfs/blob/master/examples/show_metadata_tree_sizes.py

Example, with amount of leaves on level 0 and nodes higher up:

-# ./show_metadata_tree_sizes.py /
ROOT_TREE         336.00KiB 0(    20) 1(     1)
EXTENT_TREE       123.52MiB 0(  7876) 1(    28) 2(     1)
CHUNK_TREE        112.00KiB 0(     6) 1(     1)
DEV_TREE           80.00KiB 0(     4) 1(     1)
FS_TREE          1016.34MiB 0( 64113) 1(   881) 2(    52)
CSUM_TREE         777.42MiB 0( 49571) 1(   183) 2(     1)
QUOTA_TREE            0.00B
UUID_TREE          16.00KiB 0(     1)
FREE_SPACE_TREE   336.00KiB 0(    20) 1(     1)
DATA_RELOC_TREE    16.00KiB 0(     1)

> 
> Thanks,
> Qu
> 
>>
>>
>>>
>>> Note that I'm not sensitive to multi-second mount delays.  I am
>>> sensitive to multi-minute mount delays, hence why I'm bringing this up.
>>>
>>> FWIW: I am currently populating a machine we have with 6TB drives in it
>>> with real-world home dir data to see if I can replicate the mount issue.
>>>
>>> Thanks,
>>>
>>> ellis
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 


-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-14 23:24 ` Duncan
@ 2018-02-15 15:42   ` Ellis H. Wilson III
  2018-02-15 16:51     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 32+ messages in thread
From: Ellis H. Wilson III @ 2018-02-15 15:42 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 02/14/2018 06:24 PM, Duncan wrote:
>> Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No
>> compression.  No quotas enabled.  Many (potentially tens to hundreds) of
>> subvolumes, each with tens of snapshots.  No control over size or number
>> of files, but directory tree (entries per dir and general tree depth)
>> can be controlled in case that's helpful).
> 
> ??  How can you control both breadth (entries per dir) AND depth of
> directory tree without ultimately limiting your number of files?

I technically misspoke when I said "No control over size or number of 
files."  There is an upper-limit to the metadata (not BTRFS, for our 
filesystem) we can store on an accompanying SSD, which limits the number 
of files that ultimately can live on our BTRFS RAID0'd HDDs.  The 
current design is tuned to perform well up to that maximum, but it's a 
relatively shallow tree, so if there were known performance issues with 
more than N files per directory or beyond a specific depth of 
directories I was calling out that I can change the algorithm now.

> Anyway, AFAIK the only performance issue there would be the (IIRC) ~65535
> limit on directory hard links before additional ones are out-of-lined
> into a secondary node, with the entailing performance implications.

Here I interpret "directory hard links" to mean hard links within a 
single directory -- not real directory hard links as in Macs.  It's moot 
anyhow, as we support hard links at a much higher level in our parallel 
file system and no hard-links will exist whatsoever from BTRFS's 
perspective.

> So far, so good.  But then above you mention concern about btrfs-progs
> treating the free-space-tree (free-space-cache-v2) as read-only, and the
> time cost of having to clear and rebuild it after a btrfs check --repair.
> 
> Which is what triggered the mismatch warning I mentioned above.  Either
> that raid0 data is of throw-away value appropriate to placement on a
> raid0, and btrfs check --repair is of little concern as the benefits are
> questionable (no guarantees it'll work and the data is either directly
> throw-away value anyway, or there's a backup at hand that /does/ have a
> tested guarantee of viability, or it's not worthy of being called a
> backup in the first place), or it's not.

I think you may be looking at this a touch too black and white, but 
that's probably because I've not been clear about my use-case.  We do 
have mechanisms at a higher level in our parallel file system to do 
scale-out object-based RAID, so in a way the data is "throw-away" in 
that we can lose it without true data loss.  However, one should not 
underestimate the foreground impact of a reconstruction of 60-80TB of 
data, even with architectures like ours that scale reconstruction well. 
When I lose an HDD I fully expect we will need to rebuild that entire 
BTRFS filesystem, and we can.  But I'd like to limit it to real media 
failure.  In other words, if I can't mount my BTRFS filesystem after 
power-fail, and I can't run btrfs check --repair, then in essence I've 
lost a lot of data I need to rebuild for no "good" reason.

Perhaps more critically, when an entire cluster of these systems 
power-fail, if more than N of these running BTRFS come up and require 
check --repair prior to mount due to some commonly triggered BTRFS bug 
(not saying there is one, I'm just conservative), I'm completely hosed. 
Restoring PB's of data from backup is a non-starter.

In short, I've been playing coy about the details of my project and need 
to continue to do so for at least the next 4-6 months, but if you read 
anything about the company I'm emailing from, you can probably make 
reasonable guesses about what I'm trying to do.

> It's also worth mentioning that btrfs raid0 mode, as well as single mode,
> hobbles the btrfs data and metadata integrity feature, because while
> checksums can and are still generated, stored and checked by default, and
> integrity problems can still be detected, because raid0 (and single)
> includes no redundancy, there's no second copy (raid1/10) or parity
> redundancy (raid5/6) to rebuild the bad data from, so it's simply gone.

I'm ok with that.  We have a concept called "on-demand reconstruction" 
which permits us to rebuild individual objects in our filesystem 
on-demand (one component of which will be a failed file on one of the 
BTRFS filesystems).  So long as I can identify that a file has been 
corrupted I'm fine.

> 12-14 TB individual drives?
> 
> While you /did/ say enterprise grade so this probably doesn't apply to
> you, it might apply to others that will read this.
> 
> Be careful that you're not trying to use the "archive application"
> targeted SMR drives for general purpose use.

We're using traditional PMR drives for now.  That's available at 12/14TB 
capacity points presently.  I agree with your general sense that SMR 
drives are unlikely to play particularly well with BTRFS for all but the 
truly archival use-case.

Best,

ellis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-15 11:12     ` Hans van Kranenburg
@ 2018-02-15 16:30       ` Ellis H. Wilson III
  2018-02-16  1:55         ` Qu Wenruo
  0 siblings, 1 reply; 32+ messages in thread
From: Ellis H. Wilson III @ 2018-02-15 16:30 UTC (permalink / raw)
  To: Hans van Kranenburg, Qu Wenruo, Nikolay Borisov, linux-btrfs

On 02/15/2018 06:12 AM, Hans van Kranenburg wrote:
> On 02/15/2018 02:42 AM, Qu Wenruo wrote:
>> Just as said by Nikolay, the biggest problem of slow mount is the size
>> of extent tree (and HDD seek time)
>>
>> The easiest way to get a basic idea of how large your extent tree is
>> using debug tree:
>>
>> # btrfs-debug-tree -r -t extent <device>
>>
>> You would get something like:
>> btrfs-progs v4.15
>> extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776 level 0  <<<
>> total bytes 10737418240
>> bytes used 393216
>> uuid 651fcf0c-0ffd-4351-9721-84b1615f02e0
>>
>> That level is would give you some basic idea of the size of your extent
>> tree.
>>
>> For level 0, it could contains about 400 items for average.
>> For level 1, it could contains up to 197K items.
>> ...
>> For leven n, it could contains up to 400 * 493 ^ (n - 1) items.
>> ( n <= 7 )
> 
> Another one to get that data:
> 
> https://github.com/knorrie/python-btrfs/blob/master/examples/show_metadata_tree_sizes.py
> 
> Example, with amount of leaves on level 0 and nodes higher up:
> 
> -# ./show_metadata_tree_sizes.py /
> ROOT_TREE         336.00KiB 0(    20) 1(     1)
> EXTENT_TREE       123.52MiB 0(  7876) 1(    28) 2(     1)
> CHUNK_TREE        112.00KiB 0(     6) 1(     1)
> DEV_TREE           80.00KiB 0(     4) 1(     1)
> FS_TREE          1016.34MiB 0( 64113) 1(   881) 2(    52)
> CSUM_TREE         777.42MiB 0( 49571) 1(   183) 2(     1)
> QUOTA_TREE            0.00B
> UUID_TREE          16.00KiB 0(     1)
> FREE_SPACE_TREE   336.00KiB 0(    20) 1(     1)
> DATA_RELOC_TREE    16.00KiB 0(     1)

Very helpful information.  Thank you Qu and Hans!

I have about 1.7TB of homedir data newly rsync'd data on a single 
enterprise 7200rpm HDD and the following output for btrfs-debug:

extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2
total bytes 6001175126016
bytes used 1832557875200

Hans' (very cool) tool reports:
ROOT_TREE         624.00KiB 0(    38) 1(     1)
EXTENT_TREE       327.31MiB 0( 20881) 1(    66) 2(     1)
CHUNK_TREE        208.00KiB 0(    12) 1(     1)
DEV_TREE          144.00KiB 0(     8) 1(     1)
FS_TREE             5.75GiB 0(375589) 1(   952) 2(     2) 3(     1)
CSUM_TREE           1.75GiB 0(114274) 1(   385) 2(     1)
QUOTA_TREE            0.00B
UUID_TREE          16.00KiB 0(     1)
FREE_SPACE_TREE       0.00B
DATA_RELOC_TREE    16.00KiB 0(     1)

Mean mount times across 5 tests: 4.319s (stddev=0.079s)

Taking 100 snapshots (no changes between snapshots however) of the above 
subvolume doesn't appear to impact mount/umount time.  Snapshot creation 
and deletion both operate at between 0.25s to 0.5s.  I am very impressed 
with snapshot deletion in particular now that qgroups is disabled.

I will do more mount testing with twice and three times that dataset and 
see how mount times scale.

All done on 4.5.5.  I really need to move to a newer kernel.

Best,

ellis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-15  6:14 ` Chris Murphy
@ 2018-02-15 16:45   ` Ellis H. Wilson III
  0 siblings, 0 replies; 32+ messages in thread
From: Ellis H. Wilson III @ 2018-02-15 16:45 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 02/15/2018 01:14 AM, Chris Murphy wrote:
> On Wed, Feb 14, 2018 at 9:00 AM, Ellis H. Wilson III <ellisw@panasas.com> wrote:
> 
>> Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No compression.
>> No quotas enabled.  Many (potentially tens to hundreds) of subvolumes, each
>> with tens of snapshots.
> 
> Even if non-catastrophic to lose such a file system, it's big enough
> to be tedious and take time to set it up again. I think it's worth
> considering one of two things as alternatives:
> 
> a. metadata raid1, data single: you lose the striping performance of
> raid0, and if it's not randomly filled you'll end up with some disk
> contention for reads and writes *but* if you lose a drive you will not
> lose the file system. Any missing files on the dead drive will result
> in EIO (and I think also a kernel message with path to file), and so
> you could just run a script to delete those files and replace them
> with backup copies.

This option is on our roadmap for future releases of our parallel file 
system, but unfortunately we do not presently have the time to implement 
the functionality to report from the manager of that btrfs filesystem to 
the pfs manager that said files have gone missing.  We will absolutely 
be revisiting that as an option in early 2019, as replacing just one 
disk instead of N is highly attractive.  Waiting for EIO as you suggest 
in b is a non-starter for us, as we're working at scales sufficiently 
large that we don't want to wait for someone to stumble over a partially 
degraded file.  Pro-active reporting is what's needed, and we'll 
implement that Real Soon Now.

> b. Variation on the above would be to put it behind glusterfs
> replicated volume. Gluster getting EIO from a brick should cause it to
> get a copy from another brick and then fix up the bad one
> automatically. Or in your raid0 case, the whole volume is lost, and
> glusterfs helps do the full rebuild over 3-7 days while you're still
> able to access those 70TB of data normally. Of course, this option
> requires having two 70TB storage bricks available.

See my email address, which may help understand why GlusterFS is a 
non-starter.  Nevertheless, the idea is a fine one and we'll have 
something similar going on, but at higher raid levels and across 
typically a dozen or more of such bricks.

Best,

ellis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-15 15:42   ` Ellis H. Wilson III
@ 2018-02-15 16:51     ` Austin S. Hemmelgarn
  2018-02-15 16:58       ` Ellis H. Wilson III
  0 siblings, 1 reply; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2018-02-15 16:51 UTC (permalink / raw)
  To: Ellis H. Wilson III, linux-btrfs

On 2018-02-15 10:42, Ellis H. Wilson III wrote:
> On 02/14/2018 06:24 PM, Duncan wrote:
>>> Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No
>>> compression.  No quotas enabled.  Many (potentially tens to hundreds) of
>>> subvolumes, each with tens of snapshots.  No control over size or number
>>> of files, but directory tree (entries per dir and general tree depth)
>>> can be controlled in case that's helpful).
>>
>> ??  How can you control both breadth (entries per dir) AND depth of
>> directory tree without ultimately limiting your number of files?
> 
> I technically misspoke when I said "No control over size or number of 
> files."  There is an upper-limit to the metadata (not BTRFS, for our 
> filesystem) we can store on an accompanying SSD, which limits the number 
> of files that ultimately can live on our BTRFS RAID0'd HDDs.  The 
> current design is tuned to perform well up to that maximum, but it's a 
> relatively shallow tree, so if there were known performance issues with 
> more than N files per directory or beyond a specific depth of 
> directories I was calling out that I can change the algorithm now.
There are scaling performance issues with directory listings on BTRFS 
for directories with more than a few thousand files, but they're not 
well documented (most people don't hit them because most applications 
are designed around the expectation that directory listings will be slow 
in big directories), and I would not expect them to be much of an issue 
unless you're dealing with tens of thousands of files and particularly 
slow storage.
> 
>> Anyway, AFAIK the only performance issue there would be the (IIRC) ~65535
>> limit on directory hard links before additional ones are out-of-lined
>> into a secondary node, with the entailing performance implications.
> 
> Here I interpret "directory hard links" to mean hard links within a 
> single directory -- not real directory hard links as in Macs.  It's moot 
> anyhow, as we support hard links at a much higher level in our parallel 
> file system and no hard-links will exist whatsoever from BTRFS's 
> perspective.
> 
>> So far, so good.  But then above you mention concern about btrfs-progs
>> treating the free-space-tree (free-space-cache-v2) as read-only, and the
>> time cost of having to clear and rebuild it after a btrfs check --repair.
>>
>> Which is what triggered the mismatch warning I mentioned above.  Either
>> that raid0 data is of throw-away value appropriate to placement on a
>> raid0, and btrfs check --repair is of little concern as the benefits are
>> questionable (no guarantees it'll work and the data is either directly
>> throw-away value anyway, or there's a backup at hand that /does/ have a
>> tested guarantee of viability, or it's not worthy of being called a
>> backup in the first place), or it's not.
> 
> I think you may be looking at this a touch too black and white, but 
> that's probably because I've not been clear about my use-case.  We do 
> have mechanisms at a higher level in our parallel file system to do 
> scale-out object-based RAID, so in a way the data is "throw-away" in 
> that we can lose it without true data loss.  However, one should not 
> underestimate the foreground impact of a reconstruction of 60-80TB of 
> data, even with architectures like ours that scale reconstruction well. 
> When I lose an HDD I fully expect we will need to rebuild that entire 
> BTRFS filesystem, and we can.  But I'd like to limit it to real media 
> failure.  In other words, if I can't mount my BTRFS filesystem after 
> power-fail, and I can't run btrfs check --repair, then in essence I've 
> lost a lot of data I need to rebuild for no "good" reason.
> 
> Perhaps more critically, when an entire cluster of these systems 
> power-fail, if more than N of these running BTRFS come up and require 
> check --repair prior to mount due to some commonly triggered BTRFS bug 
> (not saying there is one, I'm just conservative), I'm completely hosed. 
> Restoring PB's of data from backup is a non-starter.
Whether or not this is likely to be an issue is just as much dependent 
on the storage hardware as how BTRFS handles it.  In my own experience, 
I've only ever lost a BTRFS volume to a power failure _once_ in the 
multiple years I've been using it, and that ended up being because the 
power failure trashed the storage device pretty severely (it was 
super-cheap flash storage).  I do know however that there are people who 
have had much worse results than me.
> 
> In short, I've been playing coy about the details of my project and need 
> to continue to do so for at least the next 4-6 months, but if you read 
> anything about the company I'm emailing from, you can probably make 
> reasonable guesses about what I'm trying to do.
> 
>> It's also worth mentioning that btrfs raid0 mode, as well as single mode,
>> hobbles the btrfs data and metadata integrity feature, because while
>> checksums can and are still generated, stored and checked by default, and
>> integrity problems can still be detected, because raid0 (and single)
>> includes no redundancy, there's no second copy (raid1/10) or parity
>> redundancy (raid5/6) to rebuild the bad data from, so it's simply gone.
> 
> I'm ok with that.  We have a concept called "on-demand reconstruction" 
> which permits us to rebuild individual objects in our filesystem 
> on-demand (one component of which will be a failed file on one of the 
> BTRFS filesystems).  So long as I can identify that a file has been 
> corrupted I'm fine.
Somewhat ironically, while BTRFS isn't yet great at fixing things when 
they go wrong, it's pretty good at letting you know something as gone 
wrong.  Unfortunately, it tends to be far more aggressive in doing so 
than it sounds like you need it to be.
> 
>> 12-14 TB individual drives?
>>
>> While you /did/ say enterprise grade so this probably doesn't apply to
>> you, it might apply to others that will read this.
>>
>> Be careful that you're not trying to use the "archive application"
>> targeted SMR drives for general purpose use.
> 
> We're using traditional PMR drives for now.  That's available at 12/14TB 
> capacity points presently.  I agree with your general sense that SMR 
> drives are unlikely to play particularly well with BTRFS for all but the 
> truly archival use-case.
It's not exactly a 'general sense' or a hunch, issues with BTRFS on SMR 
drives have been pretty well demonstrated in practice, hence Duncan 
making this statement despite the fact that it most likely did not apply 
to you.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-15 16:51     ` Austin S. Hemmelgarn
@ 2018-02-15 16:58       ` Ellis H. Wilson III
  2018-02-15 17:57         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 32+ messages in thread
From: Ellis H. Wilson III @ 2018-02-15 16:58 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, linux-btrfs

On 02/15/2018 11:51 AM, Austin S. Hemmelgarn wrote:
> There are scaling performance issues with directory listings on BTRFS 
> for directories with more than a few thousand files, but they're not 
> well documented (most people don't hit them because most applications 
> are designed around the expectation that directory listings will be slow 
> in big directories), and I would not expect them to be much of an issue 
> unless you're dealing with tens of thousands of files and particularly 
> slow storage.

Understood -- thanks.  Then plan is to keep it to around 1k entries per 
directory.  We've done some fairly concrete testing here to find the 
fall-off point for dirent caching in BTRFS, and the sweet-spot between 
having a large number of small directories cached vs. a few massive 
directories cached.  ~1k seems most palatable for our use-case and 
directory tree structure.

> I've only ever lost a BTRFS volume to a power failure _once_ in the 
> multiple years I've been using it, and that ended up being because the 
> power failure trashed the storage device pretty severely (it was 
> super-cheap flash storage).  I do know however that there are people who 
> have had much worse results than me.

Good to know.  We'll be running power-fail testing over the next couple 
months.  I'm waiting for some hardware to arrive presently.  We'll 
power-cycle fairly large filesystems a few thousand times before we deem 
it safe to ship.  If there are latent bugs in BTRFS still w.r.t. 
power-fail, I can guarantee we'll trip over them...

> It's not exactly a 'general sense' or a hunch, issues with BTRFS on SMR 
> drives have been pretty well demonstrated in practice, hence Duncan 
> making this statement despite the fact that it most likely did not apply 
> to you.

Ah, ok, thanks for clarifying.  I appreciate the forewarning regardless.

Best,

ellis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-15 16:58       ` Ellis H. Wilson III
@ 2018-02-15 17:57         ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2018-02-15 17:57 UTC (permalink / raw)
  To: Ellis H. Wilson III, linux-btrfs

On 2018-02-15 11:58, Ellis H. Wilson III wrote:
> On 02/15/2018 11:51 AM, Austin S. Hemmelgarn wrote:
>> There are scaling performance issues with directory listings on BTRFS 
>> for directories with more than a few thousand files, but they're not 
>> well documented (most people don't hit them because most applications 
>> are designed around the expectation that directory listings will be 
>> slow in big directories), and I would not expect them to be much of an 
>> issue unless you're dealing with tens of thousands of files and 
>> particularly slow storage.
> 
> Understood -- thanks.  Then plan is to keep it to around 1k entries per 
> directory.  We've done some fairly concrete testing here to find the 
> fall-off point for dirent caching in BTRFS, and the sweet-spot between 
> having a large number of small directories cached vs. a few massive 
> directories cached.  ~1k seems most palatable for our use-case and 
> directory tree structure.
Yeah, in my own experience this starts to get noticeable on slower 
storage around about 4k or more entries in a directory, but it ends up 
depending on the hardware to a certain extent and the rest of the system 
as well (something Samba does seems to make it significantly worse than 
listing locally for example, while NFS seems to be only be worse because 
of network latency).
> 
>> I've only ever lost a BTRFS volume to a power failure _once_ in the 
>> multiple years I've been using it, and that ended up being because the 
>> power failure trashed the storage device pretty severely (it was 
>> super-cheap flash storage).  I do know however that there are people 
>> who have had much worse results than me.
> 
> Good to know.  We'll be running power-fail testing over the next couple 
> months.  I'm waiting for some hardware to arrive presently.  We'll 
> power-cycle fairly large filesystems a few thousand times before we deem 
> it safe to ship.  If there are latent bugs in BTRFS still w.r.t. 
> power-fail, I can guarantee we'll trip over them...
Most of my own experience regarding power failures with BTRFS is on 
SSD's.  We actually use it on the embedded systems we build where I 
work, and a lot of our customers don't have the most reliable mains 
power (or they're too lazy to shut off the computer properly before 
flipping the main breaker for the machine to power it off for the 
evening), so some of our systems may see power failures on an almost 
daily basis.  Despite that, we've never had issues with BTRFS not 
recovering by itself, though we do have a very read-heavy workload with 
very infrequent writes, so that may be part of why it's worked so well 
for us.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-15 16:30       ` Ellis H. Wilson III
@ 2018-02-16  1:55         ` Qu Wenruo
  2018-02-16 14:12           ` Ellis H. Wilson III
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2018-02-16  1:55 UTC (permalink / raw)
  To: Ellis H. Wilson III, Hans van Kranenburg, Nikolay Borisov, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4039 bytes --]



On 2018年02月16日 00:30, Ellis H. Wilson III wrote:
> On 02/15/2018 06:12 AM, Hans van Kranenburg wrote:
>> On 02/15/2018 02:42 AM, Qu Wenruo wrote:
>>> Just as said by Nikolay, the biggest problem of slow mount is the size
>>> of extent tree (and HDD seek time)
>>>
>>> The easiest way to get a basic idea of how large your extent tree is
>>> using debug tree:
>>>
>>> # btrfs-debug-tree -r -t extent <device>
>>>
>>> You would get something like:
>>> btrfs-progs v4.15
>>> extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776 level 0  <<<
>>> total bytes 10737418240
>>> bytes used 393216
>>> uuid 651fcf0c-0ffd-4351-9721-84b1615f02e0
>>>
>>> That level is would give you some basic idea of the size of your extent
>>> tree.
>>>
>>> For level 0, it could contains about 400 items for average.
>>> For level 1, it could contains up to 197K items.
>>> ...
>>> For leven n, it could contains up to 400 * 493 ^ (n - 1) items.
>>> ( n <= 7 )
>>
>> Another one to get that data:
>>
>> https://github.com/knorrie/python-btrfs/blob/master/examples/show_metadata_tree_sizes.py
>>
>>
>> Example, with amount of leaves on level 0 and nodes higher up:
>>
>> -# ./show_metadata_tree_sizes.py /
>> ROOT_TREE         336.00KiB 0(    20) 1(     1)
>> EXTENT_TREE       123.52MiB 0(  7876) 1(    28) 2(     1)
>> CHUNK_TREE        112.00KiB 0(     6) 1(     1)
>> DEV_TREE           80.00KiB 0(     4) 1(     1)
>> FS_TREE          1016.34MiB 0( 64113) 1(   881) 2(    52)
>> CSUM_TREE         777.42MiB 0( 49571) 1(   183) 2(     1)
>> QUOTA_TREE            0.00B
>> UUID_TREE          16.00KiB 0(     1)
>> FREE_SPACE_TREE   336.00KiB 0(    20) 1(     1)
>> DATA_RELOC_TREE    16.00KiB 0(     1)
> 
> Very helpful information.  Thank you Qu and Hans!
> 
> I have about 1.7TB of homedir data newly rsync'd data on a single
> enterprise 7200rpm HDD and the following output for btrfs-debug:
> 
> extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2
> total bytes 6001175126016
> bytes used 1832557875200
> 
> Hans' (very cool) tool reports:
> ROOT_TREE         624.00KiB 0(    38) 1(     1)
> EXTENT_TREE       327.31MiB 0( 20881) 1(    66) 2(     1)

Extent tree is not so large, a little unexpected to see such slow mount.

BTW, how many chunks do you have?

It could be checked by:

# btrfs-debug-tree -t chunk <device> | grep CHUNK_ITEM | wc -l

Unless we have tons of chunks, it should be too slow.

> CHUNK_TREE        208.00KiB 0(    12) 1(     1)
> DEV_TREE          144.00KiB 0(     8) 1(     1)
> FS_TREE             5.75GiB 0(375589) 1(   952) 2(     2) 3(     1)
> CSUM_TREE           1.75GiB 0(114274) 1(   385) 2(     1)
> QUOTA_TREE            0.00B
> UUID_TREE          16.00KiB 0(     1)
> FREE_SPACE_TREE       0.00B
> DATA_RELOC_TREE    16.00KiB 0(     1)
> 
> Mean mount times across 5 tests: 4.319s (stddev=0.079s)
> 
> Taking 100 snapshots (no changes between snapshots however) of the above
> subvolume doesn't appear to impact mount/umount time.

100 unmodified snapshots won't affect mount time.

It needs new extents, which can be created by overwriting extents in
snapshots.
So it won't really cause much difference if all these snapshots are all
unmodified.

> Snapshot creation
> and deletion both operate at between 0.25s to 0.5s.

IIRC snapshot deletion is delayed, so the real work doesn't happen when
"btrfs sub del" returns.

Thanks,
Qu

>  I am very impressed
> with snapshot deletion in particular now that qgroups is disabled.
> 
> I will do more mount testing with twice and three times that dataset and
> see how mount times scale.
> 
> All done on 4.5.5.  I really need to move to a newer kernel.
> 
> Best,
> 
> ellis


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-16  1:55         ` Qu Wenruo
@ 2018-02-16 14:12           ` Ellis H. Wilson III
  2018-02-16 14:20             ` Hans van Kranenburg
  2018-02-17  0:59             ` Qu Wenruo
  0 siblings, 2 replies; 32+ messages in thread
From: Ellis H. Wilson III @ 2018-02-16 14:12 UTC (permalink / raw)
  To: Qu Wenruo, Hans van Kranenburg, Nikolay Borisov, linux-btrfs

On 02/15/2018 08:55 PM, Qu Wenruo wrote:
> On 2018年02月16日 00:30, Ellis H. Wilson III wrote:
>> Very helpful information.  Thank you Qu and Hans!
>>
>> I have about 1.7TB of homedir data newly rsync'd data on a single
>> enterprise 7200rpm HDD and the following output for btrfs-debug:
>>
>> extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2
>> total bytes 6001175126016
>> bytes used 1832557875200
>>
>> Hans' (very cool) tool reports:
>> ROOT_TREE         624.00KiB 0(    38) 1(     1)
>> EXTENT_TREE       327.31MiB 0( 20881) 1(    66) 2(     1)
> 
> Extent tree is not so large, a little unexpected to see such slow mount.
> 
> BTW, how many chunks do you have?
> 
> It could be checked by:
> 
> # btrfs-debug-tree -t chunk <device> | grep CHUNK_ITEM | wc -l

Since yesterday I've doubled the size by copying the homdir dataset in 
again.  Here are new stats:

extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2
total bytes 6001175126016
bytes used 3663525969920

$ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
3454

$ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
ROOT_TREE           1.14MiB 0(    72) 1(     1)
EXTENT_TREE       644.27MiB 0( 41101) 1(   131) 2(     1)
CHUNK_TREE        384.00KiB 0(    23) 1(     1)
DEV_TREE          272.00KiB 0(    16) 1(     1)
FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
CSUM_TREE           3.50GiB 0(228593) 1(   791) 2(     2) 3(     1)
QUOTA_TREE            0.00B
UUID_TREE          16.00KiB 0(     1)
FREE_SPACE_TREE       0.00B
DATA_RELOC_TREE    16.00KiB 0(     1)

The old mean mount time was 4.319s.  It now takes 11.537s for the 
doubled dataset.  Again please realize this is on an old version of 
BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still 
like to understand this delay more.  Should I expect this to scale in 
this way all the way up to my proposed 60-80TB filesystem so long as the 
file size distribution stays roughly similar?  That would definitely be 
in terms of multiple minutes at that point.

>> Taking 100 snapshots (no changes between snapshots however) of the above
>> subvolume doesn't appear to impact mount/umount time.
> 
> 100 unmodified snapshots won't affect mount time.
> 
> It needs new extents, which can be created by overwriting extents in
> snapshots.
> So it won't really cause much difference if all these snapshots are all
> unmodified.

Good to know, thanks!

>> Snapshot creation
>> and deletion both operate at between 0.25s to 0.5s.
> 
> IIRC snapshot deletion is delayed, so the real work doesn't happen when
> "btrfs sub del" returns.

I was using btrfs sub del -C for the deletions, so I believe (if that 
command truly waits for the subvolume to be utterly gone) it captures 
the entirety of the snapshot.

Best,

ellis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-16 14:12           ` Ellis H. Wilson III
@ 2018-02-16 14:20             ` Hans van Kranenburg
  2018-02-16 14:42               ` Ellis H. Wilson III
  2018-02-17  0:59             ` Qu Wenruo
  1 sibling, 1 reply; 32+ messages in thread
From: Hans van Kranenburg @ 2018-02-16 14:20 UTC (permalink / raw)
  To: Ellis H. Wilson III, Qu Wenruo, Nikolay Borisov, linux-btrfs

On 02/16/2018 03:12 PM, Ellis H. Wilson III wrote:
> On 02/15/2018 08:55 PM, Qu Wenruo wrote:
>> On 2018年02月16日 00:30, Ellis H. Wilson III wrote:
>>> Very helpful information.  Thank you Qu and Hans!
>>>
>>> I have about 1.7TB of homedir data newly rsync'd data on a single
>>> enterprise 7200rpm HDD and the following output for btrfs-debug:
>>>
>>> extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2
>>> total bytes 6001175126016
>>> bytes used 1832557875200
>>>
>>> Hans' (very cool) tool reports:
>>> ROOT_TREE         624.00KiB 0(    38) 1(     1)
>>> EXTENT_TREE       327.31MiB 0( 20881) 1(    66) 2(     1)
>>
>> Extent tree is not so large, a little unexpected to see such slow mount.
>>
>> BTW, how many chunks do you have?
>>
>> It could be checked by:
>>
>> # btrfs-debug-tree -t chunk <device> | grep CHUNK_ITEM | wc -l
> 
> Since yesterday I've doubled the size by copying the homdir dataset in
> again.  Here are new stats:
> 
> extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2
> total bytes 6001175126016
> bytes used 3663525969920
> 
> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
> 3454
> 
> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
> ROOT_TREE           1.14MiB 0(    72) 1(     1)
> EXTENT_TREE       644.27MiB 0( 41101) 1(   131) 2(     1)
> CHUNK_TREE        384.00KiB 0(    23) 1(     1)
> DEV_TREE          272.00KiB 0(    16) 1(     1)
> FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
> CSUM_TREE           3.50GiB 0(228593) 1(   791) 2(     2) 3(     1)
> QUOTA_TREE            0.00B
> UUID_TREE          16.00KiB 0(     1)
> FREE_SPACE_TREE       0.00B
> DATA_RELOC_TREE    16.00KiB 0(     1)
> 
> The old mean mount time was 4.319s.  It now takes 11.537s for the
> doubled dataset.  Again please realize this is on an old version of
> BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still
> like to understand this delay more.  Should I expect this to scale in
> this way all the way up to my proposed 60-80TB filesystem so long as the
> file size distribution stays roughly similar?  That would definitely be
> in terms of multiple minutes at that point.

Well, imagine you have a big tree (an actual real life tree outside) and
you need to pick things (e.g. apples) which are hanging everywhere.

So, what you need to to is climb the tree, climb on a branch all the way
to the end where the first apple is... climb back, climb up a bit, go
onto the next branch to the end for the next apple... etc etc....

The bigger the tree is, the longer it keeps you busy, because the apples
will be semi-evenly distributed around the full tree, and they're always
hanging at the end of the branch. The speed with which you can climb
around (random read disk access IO speed for btrfs, because your disk
cache is empty when first mounting) determines how quickly you're done.

So, yes.

>>> Taking 100 snapshots (no changes between snapshots however) of the above
>>> subvolume doesn't appear to impact mount/umount time.
>>
>> 100 unmodified snapshots won't affect mount time.
>>
>> It needs new extents, which can be created by overwriting extents in
>> snapshots.
>> So it won't really cause much difference if all these snapshots are all
>> unmodified.
> 
> Good to know, thanks!
> 
>>> Snapshot creation
>>> and deletion both operate at between 0.25s to 0.5s.
>>
>> IIRC snapshot deletion is delayed, so the real work doesn't happen when
>> "btrfs sub del" returns.
> 
> I was using btrfs sub del -C for the deletions, so I believe (if that
> command truly waits for the subvolume to be utterly gone) it captures
> the entirety of the snapshot.
> 
> Best,
> 
> ellis


-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-16 14:20             ` Hans van Kranenburg
@ 2018-02-16 14:42               ` Ellis H. Wilson III
  2018-02-16 14:55                 ` Ellis H. Wilson III
  0 siblings, 1 reply; 32+ messages in thread
From: Ellis H. Wilson III @ 2018-02-16 14:42 UTC (permalink / raw)
  To: Hans van Kranenburg, Qu Wenruo, Nikolay Borisov, linux-btrfs

On 02/16/2018 09:20 AM, Hans van Kranenburg wrote:
> Well, imagine you have a big tree (an actual real life tree outside) and
> you need to pick things (e.g. apples) which are hanging everywhere.
> 
> So, what you need to to is climb the tree, climb on a branch all the way
> to the end where the first apple is... climb back, climb up a bit, go
> onto the next branch to the end for the next apple... etc etc....
> 
> The bigger the tree is, the longer it keeps you busy, because the apples
> will be semi-evenly distributed around the full tree, and they're always
> hanging at the end of the branch. The speed with which you can climb
> around (random read disk access IO speed for btrfs, because your disk
> cache is empty when first mounting) determines how quickly you're done.
> 
> So, yes.

Thanks Hans.  I will say multiple minutes (by the looks of things, I'll 
end up near to an hour for 60TB if this non-linear scaling continues) to 
mount a filesystem is undesirable, but I won't offer that criticism 
without thinking constructively for a moment:

Help me out by referencing the tree in question if you don't mind, so I 
can better understand the point of picking all these "apples" (I would 
guess for capacity reporting via df, but maybe there's more).

Typical disclaimer that I haven't yet grokked the various inner-workings 
of BTRFS, so this is quite possibly a terrible or unapproachable idea:

On umount, you must already have whatever metadata you were doing the 
tree walk on mount for in-memory (otherwise you would have been able to 
lazily do the treewalk after a quick mount).  Therefore, could we not 
stash this metadata at or associated with, say, the root of the 
subvolumes?  This way you can always determine on mount quickly if the 
cache is still valid (i.e., no situation like: remount with old btrfs, 
change stuff, umount with old btrfs, remount with new btrfs, pain).  I 
would guess generation would be sufficient to determine if the cached 
metadata is valid for the given root block.

This would scale with number of subvolumes (but not snapshots), and 
would be reasonably quick I think.

Thoughts?

ellis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-16 14:42               ` Ellis H. Wilson III
@ 2018-02-16 14:55                 ` Ellis H. Wilson III
  0 siblings, 0 replies; 32+ messages in thread
From: Ellis H. Wilson III @ 2018-02-16 14:55 UTC (permalink / raw)
  To: Hans van Kranenburg, Qu Wenruo, Nikolay Borisov, linux-btrfs

On 02/16/2018 09:42 AM, Ellis H. Wilson III wrote:
> On 02/16/2018 09:20 AM, Hans van Kranenburg wrote:
>> Well, imagine you have a big tree (an actual real life tree outside) and
>> you need to pick things (e.g. apples) which are hanging everywhere.
>>
>> So, what you need to to is climb the tree, climb on a branch all the way
>> to the end where the first apple is... climb back, climb up a bit, go
>> onto the next branch to the end for the next apple... etc etc....
>>
>> The bigger the tree is, the longer it keeps you busy, because the apples
>> will be semi-evenly distributed around the full tree, and they're always
>> hanging at the end of the branch. The speed with which you can climb
>> around (random read disk access IO speed for btrfs, because your disk
>> cache is empty when first mounting) determines how quickly you're done.
>>
>> So, yes.
> 
> Thanks Hans.  I will say multiple minutes (by the looks of things, I'll 
> end up near to an hour for 60TB if this non-linear scaling continues) to 
> mount a filesystem is undesirable, but I won't offer that criticism 
> without thinking constructively for a moment:
> 
> Help me out by referencing the tree in question if you don't mind, so I 
> can better understand the point of picking all these "apples" (I would 
> guess for capacity reporting via df, but maybe there's more).
> 
> Typical disclaimer that I haven't yet grokked the various inner-workings 
> of BTRFS, so this is quite possibly a terrible or unapproachable idea:
> 
> On umount, you must already have whatever metadata you were doing the 
> tree walk on mount for in-memory (otherwise you would have been able to 
> lazily do the treewalk after a quick mount).  Therefore, could we not 
> stash this metadata at or associated with, say, the root of the 
> subvolumes?  This way you can always determine on mount quickly if the 
> cache is still valid (i.e., no situation like: remount with old btrfs, 
> change stuff, umount with old btrfs, remount with new btrfs, pain).  I 
> would guess generation would be sufficient to determine if the cached 
> metadata is valid for the given root block.
> 
> This would scale with number of subvolumes (but not snapshots), and 
> would be reasonably quick I think.

I see on 02/13 Qu commented regarding a similar idea, except proposed 
perhaps a richer version of my above suggestion (making block group into 
its own tree).  The concern was that it would be a lot of work since it 
modifies the on-disk format.  That's a reasonable worry.

I will get a new kernel, expand my array to around 36TB, and will 
generate a plot of mount times against extents going up to at least 30TB 
in increments of 0.5TB.  If this proves to reach absurd mount time 
delays (to be specific, anything above around 60s is untenable for our 
use), we may very well be sufficiently motivated to implement the above 
improvement and submit it for consideration.  Accordingly, if anybody 
has additional and/or more specific thoughts on the optimization, I am 
all ears.

Best,

ellis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-16 14:12           ` Ellis H. Wilson III
  2018-02-16 14:20             ` Hans van Kranenburg
@ 2018-02-17  0:59             ` Qu Wenruo
  2018-02-20 14:59               ` Ellis H. Wilson III
  1 sibling, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2018-02-17  0:59 UTC (permalink / raw)
  To: Ellis H. Wilson III, Hans van Kranenburg, Nikolay Borisov, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4694 bytes --]



On 2018年02月16日 22:12, Ellis H. Wilson III wrote:
> On 02/15/2018 08:55 PM, Qu Wenruo wrote:
>> On 2018年02月16日 00:30, Ellis H. Wilson III wrote:
>>> Very helpful information.  Thank you Qu and Hans!
>>>
>>> I have about 1.7TB of homedir data newly rsync'd data on a single
>>> enterprise 7200rpm HDD and the following output for btrfs-debug:
>>>
>>> extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2
>>> total bytes 6001175126016
>>> bytes used 1832557875200
>>>
>>> Hans' (very cool) tool reports:
>>> ROOT_TREE         624.00KiB 0(    38) 1(     1)
>>> EXTENT_TREE       327.31MiB 0( 20881) 1(    66) 2(     1)
>>
>> Extent tree is not so large, a little unexpected to see such slow mount.
>>
>> BTW, how many chunks do you have?
>>
>> It could be checked by:
>>
>> # btrfs-debug-tree -t chunk <device> | grep CHUNK_ITEM | wc -l
> 
> Since yesterday I've doubled the size by copying the homdir dataset in
> again.  Here are new stats:
> 
> extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2
> total bytes 6001175126016
> bytes used 3663525969920
> 
> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
> 3454

OK, this explains everything.

There are too many chunks.
This means at mount you need to search for block group item 3454 times.

Even each search only needs to iterate 3 tree blocks, multiply it 3454
it would still be a big work.
Although some tree blocks like the root node and level 1 nodes can be
cached, we still need to read about 3500 tree blocks.

If the fs is created using 16K nodesize, this means you need to do
random read for 54M using 16K blocksize.

No wonder it will takes some time.

Normally I would expect 1G chunk for each data and metadata chunk.

If there is nothing special, it means your filesystem is already larger
than 3T.
If your used space is way smaller (less than 30%) than 3.5T, then this
means your chunk usage is pretty low, and in that case, balance to
reduce number of chunks (block groups) would reduce mount time.

My personally estimate about mount time is O(nlogn).
So if you are able to reduce chunk number to half, you could reduce
mount time by 60%.

> 
> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
> ROOT_TREE           1.14MiB 0(    72) 1(     1)
> EXTENT_TREE       644.27MiB 0( 41101) 1(   131) 2(     1)
> CHUNK_TREE        384.00KiB 0(    23) 1(     1)
> DEV_TREE          272.00KiB 0(    16) 1(     1)
> FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
> CSUM_TREE           3.50GiB 0(228593) 1(   791) 2(     2) 3(     1)
> QUOTA_TREE            0.00B
> UUID_TREE          16.00KiB 0(     1)
> FREE_SPACE_TREE       0.00B
> DATA_RELOC_TREE    16.00KiB 0(     1)
> 
> The old mean mount time was 4.319s.  It now takes 11.537s for the
> doubled dataset.  Again please realize this is on an old version of
> BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still
> like to understand this delay more.  Should I expect this to scale in
> this way all the way up to my proposed 60-80TB filesystem so long as the
> file size distribution stays roughly similar?  That would definitely be
> in terms of multiple minutes at that point.
> 
>>> Taking 100 snapshots (no changes between snapshots however) of the above
>>> subvolume doesn't appear to impact mount/umount time.
>>
>> 100 unmodified snapshots won't affect mount time.
>>
>> It needs new extents, which can be created by overwriting extents in
>> snapshots.
>> So it won't really cause much difference if all these snapshots are all
>> unmodified.
> 
> Good to know, thanks!
> 
>>> Snapshot creation
>>> and deletion both operate at between 0.25s to 0.5s.
>>
>> IIRC snapshot deletion is delayed, so the real work doesn't happen when
>> "btrfs sub del" returns.
> 
> I was using btrfs sub del -C for the deletions, so I believe (if that
> command truly waits for the subvolume to be utterly gone) it captures
> the entirety of the snapshot.

No, snapshot deletion is completely delayed in background.

-C only ensures that even a powerloss happen after command return, you
won't see the snapshot anywhere, but it will still be deleted in background.

Thanks,
Qu

> 
> Best,
> 
> ellis
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-17  0:59             ` Qu Wenruo
@ 2018-02-20 14:59               ` Ellis H. Wilson III
  2018-02-20 15:41                 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 32+ messages in thread
From: Ellis H. Wilson III @ 2018-02-20 14:59 UTC (permalink / raw)
  To: Qu Wenruo, Hans van Kranenburg, Nikolay Borisov, linux-btrfs

On 02/16/2018 07:59 PM, Qu Wenruo wrote:
> On 2018年02月16日 22:12, Ellis H. Wilson III wrote:
>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
>> 3454
> 
> OK, this explains everything.
> 
> There are too many chunks.
> This means at mount you need to search for block group item 3454 times.
> 
> Even each search only needs to iterate 3 tree blocks, multiply it 3454
> it would still be a big work.
> Although some tree blocks like the root node and level 1 nodes can be
> cached, we still need to read about 3500 tree blocks.
> 
> If the fs is created using 16K nodesize, this means you need to do
> random read for 54M using 16K blocksize.
> 
> No wonder it will takes some time.
> 
> Normally I would expect 1G chunk for each data and metadata chunk.
> 
> If there is nothing special, it means your filesystem is already larger
> than 3T.
> If your used space is way smaller (less than 30%) than 3.5T, then this
> means your chunk usage is pretty low, and in that case, balance to
> reduce number of chunks (block groups) would reduce mount time.

The nodesize is 16K, and the filesystem data is 3.32TiB as reported by 
btrfs fi df.  So, from what I am hearing, this mount time is normal for 
a filesystem this size.  Ignoring a more complex and proper fix like the 
ones we've been discussing, would bumping the nodesize reduce the number 
of chunks, thereby reducing the mount time?

I don't see why balance would come into play here -- my understanding 
was that was for aged filesystems.  The only operations I've done on 
here was:
1. Format filesystem clean
2. Create a subvolume
3. rsync our home directories into that new subvolume
4. Create another subvolume
5. rsync our home directories into that new subvolume

Accordingly, zero (or at least, extremely little) data should have been 
overwritten, so I would expect things to be fairly well allocated 
already.  Please correct me if this is naive thinking.

>> I was using btrfs sub del -C for the deletions, so I believe (if that
>> command truly waits for the subvolume to be utterly gone) it captures
>> the entirety of the snapshot.
> 
> No, snapshot deletion is completely delayed in background.
> 
> -C only ensures that even a powerloss happen after command return, you
> won't see the snapshot anywhere, but it will still be deleted in background.

Ah, I had no idea.  Thank you!  Is there any way to "encourage" 
btrfs-cleaner to run at specific times, which I presume is the snapshot 
deletion process you are referring to?  If it can be told to run at a 
given time, can I throttle how fast it works, such that I avoid some of 
the high foreground interruption I've seen in the past?

Thanks,

ellis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-20 14:59               ` Ellis H. Wilson III
@ 2018-02-20 15:41                 ` Austin S. Hemmelgarn
  2018-02-21  1:49                   ` Qu Wenruo
  0 siblings, 1 reply; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2018-02-20 15:41 UTC (permalink / raw)
  To: Ellis H. Wilson III, Qu Wenruo, Hans van Kranenburg,
	Nikolay Borisov, linux-btrfs

On 2018-02-20 09:59, Ellis H. Wilson III wrote:
> On 02/16/2018 07:59 PM, Qu Wenruo wrote:
>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote:
>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
>>> 3454
>>
>> OK, this explains everything.
>>
>> There are too many chunks.
>> This means at mount you need to search for block group item 3454 times.
>>
>> Even each search only needs to iterate 3 tree blocks, multiply it 3454
>> it would still be a big work.
>> Although some tree blocks like the root node and level 1 nodes can be
>> cached, we still need to read about 3500 tree blocks.
>>
>> If the fs is created using 16K nodesize, this means you need to do
>> random read for 54M using 16K blocksize.
>>
>> No wonder it will takes some time.
>>
>> Normally I would expect 1G chunk for each data and metadata chunk.
>>
>> If there is nothing special, it means your filesystem is already larger
>> than 3T.
>> If your used space is way smaller (less than 30%) than 3.5T, then this
>> means your chunk usage is pretty low, and in that case, balance to
>> reduce number of chunks (block groups) would reduce mount time.
> 
> The nodesize is 16K, and the filesystem data is 3.32TiB as reported by 
> btrfs fi df.  So, from what I am hearing, this mount time is normal for 
> a filesystem this size.  Ignoring a more complex and proper fix like the 
> ones we've been discussing, would bumping the nodesize reduce the number 
> of chunks, thereby reducing the mount time?
It would probably not.  Chunk size is only based on the total size of 
the filesystem, with reasonable base values, so you would still need to 
have at least as many chunks to store the same amount of data (increase 
the node size too much though, and you will end up with more chunks, 
because you'll have more empty space wasted).
> 
> I don't see why balance would come into play here -- my understanding 
> was that was for aged filesystems.  The only operations I've done on 
> here was:
> 1. Format filesystem clean
> 2. Create a subvolume
> 3. rsync our home directories into that new subvolume
> 4. Create another subvolume
> 5. rsync our home directories into that new subvolume
> 
> Accordingly, zero (or at least, extremely little) data should have been 
> overwritten, so I would expect things to be fairly well allocated 
> already.  Please correct me if this is naive thinking.
Your logic is in general correct regarding data, but not necessarily 
metadata.  Assuming you did not use the `--inplace` option for rsync, it 
had to issue a rename for each individual file that got copied in, and 
as a result there was likely a lot of metadata being rewritten.

As far as balance being for aged filesystems, that's not exactly true. 
There are four big reasons you might run a balance:

1. As part of reshaping a volume.  You generally want run a balance 
whenever the number of disks in a volume permanently increases (it will 
happen automatically when it permanently decreases, as the device 
deletion operation is a special type of balance under the hood).  It's 
also used for converting chunk profiles.
2. To free up empty space inside chunks when the filesystem is full at 
the chunk level.
3. To redistribute data across multiple disks in a more even manner 
after deleting a lot of data.
4. To reduce the likelihood of 2 or 3 being an issue.

Reasons 2 and 3 are generally more likely to be needed on old volumes. 
Reason 1 is independent of the age of a volume.  Reason 4 is the reason 
for the regular filtered balances that I and some other people recommend 
be run as part of preventative maintenance, and is also generally 
independent of the age of a volume.

Qu's suggestion is actually independent of all the above reasons, but 
does kind of fit in with the fourth as another case of preventative 
maintenance.
> 
>>> I was using btrfs sub del -C for the deletions, so I believe (if that
>>> command truly waits for the subvolume to be utterly gone) it captures
>>> the entirety of the snapshot.
>>
>> No, snapshot deletion is completely delayed in background.
>>
>> -C only ensures that even a powerloss happen after command return, you
>> won't see the snapshot anywhere, but it will still be deleted in 
>> background.
> 
> Ah, I had no idea.  Thank you!  Is there any way to "encourage" 
> btrfs-cleaner to run at specific times, which I presume is the snapshot 
> deletion process you are referring to?  If it can be told to run at a 
> given time, can I throttle how fast it works, such that I avoid some of 
> the high foreground interruption I've seen in the past?
I don't think there's any way to do this right now (though it would be 
nice if there was).  In theory, you could adjust the priority of the 
kernel thread itself, but messing around with kthread priorities is 
seriously dangerous even if you know exactly what you're doing.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-20 15:41                 ` Austin S. Hemmelgarn
@ 2018-02-21  1:49                   ` Qu Wenruo
  2018-02-21 14:49                     ` Ellis H. Wilson III
  0 siblings, 1 reply; 32+ messages in thread
From: Qu Wenruo @ 2018-02-21  1:49 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Ellis H. Wilson III, Hans van Kranenburg,
	Nikolay Borisov, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 6244 bytes --]



On 2018年02月20日 23:41, Austin S. Hemmelgarn wrote:
> On 2018-02-20 09:59, Ellis H. Wilson III wrote:
>> On 02/16/2018 07:59 PM, Qu Wenruo wrote:
>>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote:
>>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
>>>> 3454
>>>
>>> OK, this explains everything.
>>>
>>> There are too many chunks.
>>> This means at mount you need to search for block group item 3454 times.
>>>
>>> Even each search only needs to iterate 3 tree blocks, multiply it 3454
>>> it would still be a big work.
>>> Although some tree blocks like the root node and level 1 nodes can be
>>> cached, we still need to read about 3500 tree blocks.
>>>
>>> If the fs is created using 16K nodesize, this means you need to do
>>> random read for 54M using 16K blocksize.
>>>
>>> No wonder it will takes some time.
>>>
>>> Normally I would expect 1G chunk for each data and metadata chunk.
>>>
>>> If there is nothing special, it means your filesystem is already larger
>>> than 3T.
>>> If your used space is way smaller (less than 30%) than 3.5T, then this
>>> means your chunk usage is pretty low, and in that case, balance to
>>> reduce number of chunks (block groups) would reduce mount time.
>>
>> The nodesize is 16K, and the filesystem data is 3.32TiB as reported by
>> btrfs fi df.  So, from what I am hearing, this mount time is normal
>> for a filesystem this size.  Ignoring a more complex and proper fix
>> like the ones we've been discussing, would bumping the nodesize reduce
>> the number of chunks, thereby reducing the mount time?
> It would probably not.  Chunk size is only based on the total size of
> the filesystem, with reasonable base values, so you would still need to
> have at least as many chunks to store the same amount of data (increase
> the node size too much though, and you will end up with more chunks,
> because you'll have more empty space wasted).

Increasing node size may reduce extent tree size. Although at most
reduce one level AFAIK.

But considering that the higher the node is, the more chance it's
cached, reducing tree height wouldn't bring much performance impact AFAIK.

If one could do real world benchmark to beat or prove my assumption, it
would be much better though.

>>
>> I don't see why balance would come into play here -- my understanding
>> was that was for aged filesystems.  The only operations I've done on
>> here was:
>> 1. Format filesystem clean
>> 2. Create a subvolume
>> 3. rsync our home directories into that new subvolume
>> 4. Create another subvolume
>> 5. rsync our home directories into that new subvolume
>>
>> Accordingly, zero (or at least, extremely little) data should have
>> been overwritten, so I would expect things to be fairly well allocated
>> already.  Please correct me if this is naive thinking.
> Your logic is in general correct regarding data, but not necessarily
> metadata.  Assuming you did not use the `--inplace` option for rsync, it
> had to issue a rename for each individual file that got copied in, and
> as a result there was likely a lot of metadata being rewritten.
> 
> As far as balance being for aged filesystems, that's not exactly true.
> There are four big reasons you might run a balance:
> 
> 1. As part of reshaping a volume.  You generally want run a balance
> whenever the number of disks in a volume permanently increases (it will
> happen automatically when it permanently decreases, as the device
> deletion operation is a special type of balance under the hood).  It's
> also used for converting chunk profiles.
> 2. To free up empty space inside chunks when the filesystem is full at
> the chunk level.
> 3. To redistribute data across multiple disks in a more even manner
> after deleting a lot of data.
> 4. To reduce the likelihood of 2 or 3 being an issue.
> 
> Reasons 2 and 3 are generally more likely to be needed on old volumes.
> Reason 1 is independent of the age of a volume.  Reason 4 is the reason
> for the regular filtered balances that I and some other people recommend
> be run as part of preventative maintenance, and is also generally
> independent of the age of a volume.
> 
> Qu's suggestion is actually independent of all the above reasons, but
> does kind of fit in with the fourth as another case of preventative
> maintenance.

My suggestion is to use balance to reduce number of block groups, so we
could do less search at mount time.

It's more like reason 2.

But it only works for case where there are a lot of fragments so a lot
of chunks are not fully utilized.
Unfortunately, that's not the case for OP, so my suggestion doesn't make
sense here.

BTW, if OP still wants to try something to possibly to reduce mount time
with same the fs, I could try some modification to current block group
iteration code to see if it makes sense.

Thanks,
Qu

>>
>>>> I was using btrfs sub del -C for the deletions, so I believe (if that
>>>> command truly waits for the subvolume to be utterly gone) it captures
>>>> the entirety of the snapshot.
>>>
>>> No, snapshot deletion is completely delayed in background.
>>>
>>> -C only ensures that even a powerloss happen after command return, you
>>> won't see the snapshot anywhere, but it will still be deleted in
>>> background.
>>
>> Ah, I had no idea.  Thank you!  Is there any way to "encourage"
>> btrfs-cleaner to run at specific times, which I presume is the
>> snapshot deletion process you are referring to?  If it can be told to
>> run at a given time, can I throttle how fast it works, such that I
>> avoid some of the high foreground interruption I've seen in the past?
> I don't think there's any way to do this right now (though it would be
> nice if there was).  In theory, you could adjust the priority of the
> kernel thread itself, but messing around with kthread priorities is
> seriously dangerous even if you know exactly what you're doing.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-21  1:49                   ` Qu Wenruo
@ 2018-02-21 14:49                     ` Ellis H. Wilson III
  2018-02-21 15:03                       ` Hans van Kranenburg
                                         ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Ellis H. Wilson III @ 2018-02-21 14:49 UTC (permalink / raw)
  To: Qu Wenruo, Austin S. Hemmelgarn, Hans van Kranenburg,
	Nikolay Borisov, linux-btrfs

On 02/20/2018 08:49 PM, Qu Wenruo wrote:
>>>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote:
>>>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
>>>>> 3454
>>>>
> Increasing node size may reduce extent tree size. Although at most
> reduce one level AFAIK.
> 
> But considering that the higher the node is, the more chance it's
> cached, reducing tree height wouldn't bring much performance impact AFAIK.
> 
> If one could do real world benchmark to beat or prove my assumption, it
> would be much better though.

I'm willing to try this if you tell me exactly what you'd like me to do. 
  I've not mucked with nodesize before, so I'd like to avoid changing it 
to something absurd.

>> Qu's suggestion is actually independent of all the above reasons, but
>> does kind of fit in with the fourth as another case of preventative
>> maintenance.
> 
> My suggestion is to use balance to reduce number of block groups, so we
> could do less search at mount time.
> 
> It's more like reason 2.
> 
> But it only works for case where there are a lot of fragments so a lot
> of chunks are not fully utilized.
> Unfortunately, that's not the case for OP, so my suggestion doesn't make
> sense here.

I ran the balance all the same, and the number of chunks has not 
changed.  Before 3454, and after 3454:
  $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
3454

HOWEVER, the time to mount has gone up somewhat significantly, from 
11.537s to 16.553s, which was very unexpected.  Output from previously 
run commands shows the extent tree metadata grew about 25% due to the 
balance.  Everything else stayed roughly the same, and no additional 
data was added to the system (nor snapshots taken, nor additional 
volumes added, etc):

Before balance:
$ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
ROOT_TREE           1.14MiB 0(    72) 1(     1)
EXTENT_TREE       644.27MiB 0( 41101) 1(   131) 2(     1)
CHUNK_TREE        384.00KiB 0(    23) 1(     1)
DEV_TREE          272.00KiB 0(    16) 1(     1)
FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
CSUM_TREE           3.50GiB 0(228593) 1(   791) 2(     2) 3(     1)
QUOTA_TREE            0.00B
UUID_TREE          16.00KiB 0(     1)
FREE_SPACE_TREE       0.00B
DATA_RELOC_TREE    16.00KiB 0(     1)

After balance:
$ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
ROOT_TREE           1.16MiB 0(    73) 1(     1)
EXTENT_TREE       806.50MiB 0( 51419) 1(   196) 2(     1)
CHUNK_TREE        384.00KiB 0(    23) 1(     1)
DEV_TREE          272.00KiB 0(    16) 1(     1)
FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
CSUM_TREE           3.49GiB 0(227920) 1(   804) 2(     2) 3(     1)
QUOTA_TREE            0.00B
UUID_TREE          16.00KiB 0(     1)
FREE_SPACE_TREE       0.00B
DATA_RELOC_TREE    16.00KiB 0(     1)

> BTW, if OP still wants to try something to possibly to reduce mount time
> with same the fs, I could try some modification to current block group
> iteration code to see if it makes sense.

I'm glad to try anything if it's helpful to improving BTRFS.  Just let 
me know.

Best,

ellis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-21 14:49                     ` Ellis H. Wilson III
@ 2018-02-21 15:03                       ` Hans van Kranenburg
  2018-02-21 15:19                         ` Ellis H. Wilson III
  2018-02-21 21:27                       ` E V
  2018-02-22  0:53                       ` Qu Wenruo
  2 siblings, 1 reply; 32+ messages in thread
From: Hans van Kranenburg @ 2018-02-21 15:03 UTC (permalink / raw)
  To: Ellis H. Wilson III, linux-btrfs

On 02/21/2018 03:49 PM, Ellis H. Wilson III wrote:
> On 02/20/2018 08:49 PM, Qu Wenruo wrote:
>>>>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote:
>>>>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
>>>>>> 3454
>>>>>
>> Increasing node size may reduce extent tree size. Although at most
>> reduce one level AFAIK.
>>
>> But considering that the higher the node is, the more chance it's
>> cached, reducing tree height wouldn't bring much performance impact
>> AFAIK.
>>
>> If one could do real world benchmark to beat or prove my assumption, it
>> would be much better though.
> 
> I'm willing to try this if you tell me exactly what you'd like me to do.
>  I've not mucked with nodesize before, so I'd like to avoid changing it
> to something absurd.
> 
>>> Qu's suggestion is actually independent of all the above reasons, but
>>> does kind of fit in with the fourth as another case of preventative
>>> maintenance.
>>
>> My suggestion is to use balance to reduce number of block groups, so we
>> could do less search at mount time.
>>
>> It's more like reason 2.
>>
>> But it only works for case where there are a lot of fragments so a lot
>> of chunks are not fully utilized.
>> Unfortunately, that's not the case for OP, so my suggestion doesn't make
>> sense here.
> 
> I ran the balance all the same, and the number of chunks has not
> changed.  Before 3454, and after 3454:
>  $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
> 3454
> 
> HOWEVER, the time to mount has gone up somewhat significantly, from
> 11.537s to 16.553s, which was very unexpected.  Output from previously
> run commands shows the extent tree metadata grew about 25% due to the
> balance.  Everything else stayed roughly the same, and no additional
> data was added to the system (nor snapshots taken, nor additional
> volumes added, etc):
> 
> Before balance:
> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
> ROOT_TREE           1.14MiB 0(    72) 1(     1)
> EXTENT_TREE       644.27MiB 0( 41101) 1(   131) 2(     1)
> CHUNK_TREE        384.00KiB 0(    23) 1(     1)
> DEV_TREE          272.00KiB 0(    16) 1(     1)
> FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
> CSUM_TREE           3.50GiB 0(228593) 1(   791) 2(     2) 3(     1)
> QUOTA_TREE            0.00B
> UUID_TREE          16.00KiB 0(     1)
> FREE_SPACE_TREE       0.00B
> DATA_RELOC_TREE    16.00KiB 0(     1)
> 
> After balance:
> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
> ROOT_TREE           1.16MiB 0(    73) 1(     1)
> EXTENT_TREE       806.50MiB 0( 51419) 1(   196) 2(     1)
> CHUNK_TREE        384.00KiB 0(    23) 1(     1)
> DEV_TREE          272.00KiB 0(    16) 1(     1)
> FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
> CSUM_TREE           3.49GiB 0(227920) 1(   804) 2(     2) 3(     1)
> QUOTA_TREE            0.00B
> UUID_TREE          16.00KiB 0(     1)
> FREE_SPACE_TREE       0.00B
> DATA_RELOC_TREE    16.00KiB 0(     1)

Heu, interesting.

What's the output of `btrfs fi df /mountpoint` and `grep btrfs
/proc/self/mounts` (does it contain 'ssd') and which kernel version is
this? (I get a bit lost in the many messages and subthreads in this
thread) I also can't find in the threads which command "the balance" means.

And what does this tell you?

https://github.com/knorrie/python-btrfs/blob/develop/examples/show_free_space_fragmentation.py

Just to make sure you're not pointlessly shovelling data around on a
filesystem that is already in bad shape.

>> BTW, if OP still wants to try something to possibly to reduce mount time
>> with same the fs, I could try some modification to current block group
>> iteration code to see if it makes sense.
> 
> I'm glad to try anything if it's helpful to improving BTRFS.  Just let
> me know.
> 
> Best,
> 
> ellis


-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-21 15:03                       ` Hans van Kranenburg
@ 2018-02-21 15:19                         ` Ellis H. Wilson III
  2018-02-21 15:56                           ` Hans van Kranenburg
  0 siblings, 1 reply; 32+ messages in thread
From: Ellis H. Wilson III @ 2018-02-21 15:19 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs

On 02/21/2018 10:03 AM, Hans van Kranenburg wrote:
> On 02/21/2018 03:49 PM, Ellis H. Wilson III wrote:
>> On 02/20/2018 08:49 PM, Qu Wenruo wrote:
>>> My suggestion is to use balance to reduce number of block groups, so we
>>> could do less search at mount time.
>>>
>>> It's more like reason 2.
>>>
>>> But it only works for case where there are a lot of fragments so a lot
>>> of chunks are not fully utilized.
>>> Unfortunately, that's not the case for OP, so my suggestion doesn't make
>>> sense here.
>>
>> I ran the balance all the same, and the number of chunks has not
>> changed.  Before 3454, and after 3454:
>>   $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
>> 3454
>>
>> HOWEVER, the time to mount has gone up somewhat significantly, from
>> 11.537s to 16.553s, which was very unexpected.  Output from previously
>> run commands shows the extent tree metadata grew about 25% due to the
>> balance.  Everything else stayed roughly the same, and no additional
>> data was added to the system (nor snapshots taken, nor additional
>> volumes added, etc):
>>
>> Before balance:
>> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
>> ROOT_TREE           1.14MiB 0(    72) 1(     1)
>> EXTENT_TREE       644.27MiB 0( 41101) 1(   131) 2(     1)
>> CHUNK_TREE        384.00KiB 0(    23) 1(     1)
>> DEV_TREE          272.00KiB 0(    16) 1(     1)
>> FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
>> CSUM_TREE           3.50GiB 0(228593) 1(   791) 2(     2) 3(     1)
>> QUOTA_TREE            0.00B
>> UUID_TREE          16.00KiB 0(     1)
>> FREE_SPACE_TREE       0.00B
>> DATA_RELOC_TREE    16.00KiB 0(     1)
>>
>> After balance:
>> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
>> ROOT_TREE           1.16MiB 0(    73) 1(     1)
>> EXTENT_TREE       806.50MiB 0( 51419) 1(   196) 2(     1)
>> CHUNK_TREE        384.00KiB 0(    23) 1(     1)
>> DEV_TREE          272.00KiB 0(    16) 1(     1)
>> FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
>> CSUM_TREE           3.49GiB 0(227920) 1(   804) 2(     2) 3(     1)
>> QUOTA_TREE            0.00B
>> UUID_TREE          16.00KiB 0(     1)
>> FREE_SPACE_TREE       0.00B
>> DATA_RELOC_TREE    16.00KiB 0(     1)
> 
> Heu, interesting.
> 
> What's the output of `btrfs fi df /mountpoint` and `grep btrfs
> /proc/self/mounts` (does it contain 'ssd') and which kernel version is
> this? (I get a bit lost in the many messages and subthreads in this
> thread) I also can't find in the threads which command "the balance" means.

Short recap:
- I found long mount time for 1.65TB of home dir data at ~4s
- Doubling this data on the same btrfs fs to 3.3TB increased mount time 
to 11s
- Qu et. al. suggested balance might reduce chunks, which came in around 
3400, and the chunk walk on mount was the driving factor in terms of time
- I ran balance
- Mount time went up to 16s, and all else remains the same except the 
extent tree.

$ sudo btrfs fi df /mnt/btrfs
Data, single: total=3.32TiB, used=3.32TiB
System, DUP: total=8.00MiB, used=384.00KiB
Metadata, DUP: total=16.50GiB, used=15.82GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

$ sudo grep btrfs /proc/self/mounts
/dev/sdb /mnt/btrfs btrfs rw,relatime,space_cache,subvolid=5,subvol=/ 0 0

  $ uname -a
Linux <snip> 4.5.5-300.fc24.x86_64 #1 SMP Thu May 19 13:05:32 UTC 2016 
x86_64 x86_64 x86_64 GNU/Linux

I plan to rerun this on a newer kernel, but haven't had time to spin up 
another machine with a modern kernel yet, and this machine is also being 
used for other things right now so I can't just upgrade it.

> And what does this tell you?
> 
> https://github.com/knorrie/python-btrfs/blob/develop/examples/show_free_space_fragmentation.py

$ sudo ./show_free_space_fragmentation.py /mnt/btrfs
No Free Space Tree (space_cache=v2) found!
Falling back to using the extent tree to determine free space extents.
vaddr 6529453391872 length 1073741824 used_pct 27 free space fragments 1 
score 0
Skipped because of usage > 90%: 3397 chunks

Best,

ellis

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-21 15:19                         ` Ellis H. Wilson III
@ 2018-02-21 15:56                           ` Hans van Kranenburg
  2018-02-22 12:41                             ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 32+ messages in thread
From: Hans van Kranenburg @ 2018-02-21 15:56 UTC (permalink / raw)
  To: Ellis H. Wilson III, linux-btrfs

On 02/21/2018 04:19 PM, Ellis H. Wilson III wrote:
> On 02/21/2018 10:03 AM, Hans van Kranenburg wrote:
>> On 02/21/2018 03:49 PM, Ellis H. Wilson III wrote:
>>> On 02/20/2018 08:49 PM, Qu Wenruo wrote:
>>>> My suggestion is to use balance to reduce number of block groups, so we
>>>> could do less search at mount time.
>>>>
>>>> It's more like reason 2.
>>>>
>>>> But it only works for case where there are a lot of fragments so a lot
>>>> of chunks are not fully utilized.
>>>> Unfortunately, that's not the case for OP, so my suggestion doesn't
>>>> make
>>>> sense here.
>>>
>>> I ran the balance all the same, and the number of chunks has not
>>> changed.  Before 3454, and after 3454:
>>>   $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
>>> 3454
>>>
>>> HOWEVER, the time to mount has gone up somewhat significantly, from
>>> 11.537s to 16.553s, which was very unexpected.  Output from previously
>>> run commands shows the extent tree metadata grew about 25% due to the
>>> balance.  Everything else stayed roughly the same, and no additional
>>> data was added to the system (nor snapshots taken, nor additional
>>> volumes added, etc):
>>>
>>> Before balance:
>>> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
>>> ROOT_TREE           1.14MiB 0(    72) 1(     1)
>>> EXTENT_TREE       644.27MiB 0( 41101) 1(   131) 2(     1)
>>> CHUNK_TREE        384.00KiB 0(    23) 1(     1)
>>> DEV_TREE          272.00KiB 0(    16) 1(     1)
>>> FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
>>> CSUM_TREE           3.50GiB 0(228593) 1(   791) 2(     2) 3(     1)
>>> QUOTA_TREE            0.00B
>>> UUID_TREE          16.00KiB 0(     1)
>>> FREE_SPACE_TREE       0.00B
>>> DATA_RELOC_TREE    16.00KiB 0(     1)
>>>
>>> After balance:
>>> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
>>> ROOT_TREE           1.16MiB 0(    73) 1(     1)
>>> EXTENT_TREE       806.50MiB 0( 51419) 1(   196) 2(     1)
>>> CHUNK_TREE        384.00KiB 0(    23) 1(     1)
>>> DEV_TREE          272.00KiB 0(    16) 1(     1)
>>> FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
>>> CSUM_TREE           3.49GiB 0(227920) 1(   804) 2(     2) 3(     1)
>>> QUOTA_TREE            0.00B
>>> UUID_TREE          16.00KiB 0(     1)
>>> FREE_SPACE_TREE       0.00B
>>> DATA_RELOC_TREE    16.00KiB 0(     1)
>>
>> Heu, interesting.
>>
>> What's the output of `btrfs fi df /mountpoint` and `grep btrfs
>> /proc/self/mounts` (does it contain 'ssd') and which kernel version is
>> this? (I get a bit lost in the many messages and subthreads in this
>> thread) I also can't find in the threads which command "the balance"
>> means.
> 
> Short recap:
> - I found long mount time for 1.65TB of home dir data at ~4s
> - Doubling this data on the same btrfs fs to 3.3TB increased mount time
> to 11s
> - Qu et. al. suggested balance might reduce chunks, which came in around
> 3400, and the chunk walk on mount was the driving factor in terms of time
> - I ran balance
> - Mount time went up to 16s, and all else remains the same except the
> extent tree.
> 
> $ sudo btrfs fi df /mnt/btrfs
> Data, single: total=3.32TiB, used=3.32TiB
> System, DUP: total=8.00MiB, used=384.00KiB
> Metadata, DUP: total=16.50GiB, used=15.82GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B

Ah, so allocated data space is 100% filled with data. That's very good
yes. And it explains why you can't lower the amount of chunks by
balancing. You're just moving around data and replacing full chunks with
new full chunks. :]

Doesn't explain why it blows up the size of the extent tree though. I
have no idea why that is.

> $ sudo grep btrfs /proc/self/mounts
> /dev/sdb /mnt/btrfs btrfs rw,relatime,space_cache,subvolid=5,subvol=/ 0 0

Ok, no 'ssd', good.

>  $ uname -a
> Linux <snip> 4.5.5-300.fc24.x86_64 #1 SMP Thu May 19 13:05:32 UTC 2016
> x86_64 x86_64 x86_64 GNU/Linux
> 
> I plan to rerun this on a newer kernel, but haven't had time to spin up
> another machine with a modern kernel yet, and this machine is also being
> used for other things right now so I can't just upgrade it.
> 
>> And what does this tell you?
>>
>> https://github.com/knorrie/python-btrfs/blob/develop/examples/show_free_space_fragmentation.py
>>
> 
> $ sudo ./show_free_space_fragmentation.py /mnt/btrfs
> No Free Space Tree (space_cache=v2) found!
> Falling back to using the extent tree to determine free space extents.
> vaddr 6529453391872 length 1073741824 used_pct 27 free space fragments 1
> score 0
> Skipped because of usage > 90%: 3397 chunks

Good.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-21 14:49                     ` Ellis H. Wilson III
  2018-02-21 15:03                       ` Hans van Kranenburg
@ 2018-02-21 21:27                       ` E V
  2018-02-22  0:53                       ` Qu Wenruo
  2 siblings, 0 replies; 32+ messages in thread
From: E V @ 2018-02-21 21:27 UTC (permalink / raw)
  To: Ellis H. Wilson III
  Cc: Qu Wenruo, Austin S. Hemmelgarn, Hans van Kranenburg,
	Nikolay Borisov, linux-btrfs

On Wed, Feb 21, 2018 at 9:49 AM, Ellis H. Wilson III <ellisw@panasas.com> wrote:
> On 02/20/2018 08:49 PM, Qu Wenruo wrote:
>>>>>
>>>>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote:
>>>>>>
>>>>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
>>>>>> 3454
>>>>>
>>>>>
>> Increasing node size may reduce extent tree size. Although at most
>> reduce one level AFAIK.
>>
>> But considering that the higher the node is, the more chance it's
>> cached, reducing tree height wouldn't bring much performance impact AFAIK.
>>
>> If one could do real world benchmark to beat or prove my assumption, it
>> would be much better though.
>
>
> I'm willing to try this if you tell me exactly what you'd like me to do.
> I've not mucked with nodesize before, so I'd like to avoid changing it to
> something absurd.

mkfs.btrfs caps -n at 64K so absurd isn't really an option. If you
have a large filesystem on a RAID array you will likely see a
performance bump in your metadata operations if you use 64K and also
set the stripe size of the RAID array to 64K.

>>> Qu's suggestion is actually independent of all the above reasons, but
>>> does kind of fit in with the fourth as another case of preventative
>>> maintenance.
>>
>>
>> My suggestion is to use balance to reduce number of block groups, so we
>> could do less search at mount time.
>>
>> It's more like reason 2.
>>
>> But it only works for case where there are a lot of fragments so a lot
>> of chunks are not fully utilized.
>> Unfortunately, that's not the case for OP, so my suggestion doesn't make
>> sense here.
>
>
> I ran the balance all the same, and the number of chunks has not changed.
> Before 3454, and after 3454:
>  $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
> 3454
>
> HOWEVER, the time to mount has gone up somewhat significantly, from 11.537s
> to 16.553s, which was very unexpected.  Output from previously run commands
> shows the extent tree metadata grew about 25% due to the balance.
> Everything else stayed roughly the same, and no additional data was added to
> the system (nor snapshots taken, nor additional volumes added, etc):
>
> Before balance:
> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
> ROOT_TREE           1.14MiB 0(    72) 1(     1)
> EXTENT_TREE       644.27MiB 0( 41101) 1(   131) 2(     1)
> CHUNK_TREE        384.00KiB 0(    23) 1(     1)
> DEV_TREE          272.00KiB 0(    16) 1(     1)
> FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
> CSUM_TREE           3.50GiB 0(228593) 1(   791) 2(     2) 3(     1)
> QUOTA_TREE            0.00B
> UUID_TREE          16.00KiB 0(     1)
> FREE_SPACE_TREE       0.00B
> DATA_RELOC_TREE    16.00KiB 0(     1)
>
> After balance:
> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
> ROOT_TREE           1.16MiB 0(    73) 1(     1)
> EXTENT_TREE       806.50MiB 0( 51419) 1(   196) 2(     1)
> CHUNK_TREE        384.00KiB 0(    23) 1(     1)
> DEV_TREE          272.00KiB 0(    16) 1(     1)
> FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
> CSUM_TREE           3.49GiB 0(227920) 1(   804) 2(     2) 3(     1)
> QUOTA_TREE            0.00B
> UUID_TREE          16.00KiB 0(     1)
> FREE_SPACE_TREE       0.00B
> DATA_RELOC_TREE    16.00KiB 0(     1)
>
>> BTW, if OP still wants to try something to possibly to reduce mount time
>> with same the fs, I could try some modification to current block group
>> iteration code to see if it makes sense.
>
>
> I'm glad to try anything if it's helpful to improving BTRFS.  Just let me
> know.
>
> Best,
>
> ellis
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-21 14:49                     ` Ellis H. Wilson III
  2018-02-21 15:03                       ` Hans van Kranenburg
  2018-02-21 21:27                       ` E V
@ 2018-02-22  0:53                       ` Qu Wenruo
  2 siblings, 0 replies; 32+ messages in thread
From: Qu Wenruo @ 2018-02-22  0:53 UTC (permalink / raw)
  To: Ellis H. Wilson III, Austin S. Hemmelgarn, Hans van Kranenburg,
	Nikolay Borisov, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4309 bytes --]



On 2018年02月21日 22:49, Ellis H. Wilson III wrote:
> On 02/20/2018 08:49 PM, Qu Wenruo wrote:
>>>>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote:
>>>>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
>>>>>> 3454
>>>>>
>> Increasing node size may reduce extent tree size. Although at most
>> reduce one level AFAIK.
>>
>> But considering that the higher the node is, the more chance it's
>> cached, reducing tree height wouldn't bring much performance impact
>> AFAIK.
>>
>> If one could do real world benchmark to beat or prove my assumption, it
>> would be much better though.
> 
> I'm willing to try this if you tell me exactly what you'd like me to do.
>  I've not mucked with nodesize before, so I'd like to avoid changing it
> to something absurd.
> 
>>> Qu's suggestion is actually independent of all the above reasons, but
>>> does kind of fit in with the fourth as another case of preventative
>>> maintenance.
>>
>> My suggestion is to use balance to reduce number of block groups, so we
>> could do less search at mount time.
>>
>> It's more like reason 2.
>>
>> But it only works for case where there are a lot of fragments so a lot
>> of chunks are not fully utilized.
>> Unfortunately, that's not the case for OP, so my suggestion doesn't make
>> sense here.
> 
> I ran the balance all the same, and the number of chunks has not
> changed.  Before 3454, and after 3454:
>  $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
> 3454
> 
> HOWEVER, the time to mount has gone up somewhat significantly, from
> 11.537s to 16.553s, which was very unexpected.  Output from previously
> run commands shows the extent tree metadata grew about 25% due to the
> balance.  Everything else stayed roughly the same, and no additional
> data was added to the system (nor snapshots taken, nor additional
> volumes added, etc):

In theory, if the extent tree height and block group usage doesn't
change dramatically, the tree block reads caused by block groups
iteration shouldn't change much.

But in your case, extent tree leaves increased, I believe it's the tree
block readahead causing the problem.

> 
> Before balance:
> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
> ROOT_TREE           1.14MiB 0(    72) 1(     1)
> EXTENT_TREE       644.27MiB 0( 41101) 1(   131) 2(     1)
> CHUNK_TREE        384.00KiB 0(    23) 1(     1)
> DEV_TREE          272.00KiB 0(    16) 1(     1)
> FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
> CSUM_TREE           3.50GiB 0(228593) 1(   791) 2(     2) 3(     1)
> QUOTA_TREE            0.00B
> UUID_TREE          16.00KiB 0(     1)
> FREE_SPACE_TREE       0.00B
> DATA_RELOC_TREE    16.00KiB 0(     1)
> 
> After balance:
> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
> ROOT_TREE           1.16MiB 0(    73) 1(     1)
> EXTENT_TREE       806.50MiB 0( 51419) 1(   196) 2(     1)
> CHUNK_TREE        384.00KiB 0(    23) 1(     1)
> DEV_TREE          272.00KiB 0(    16) 1(     1)
> FS_TREE            11.55GiB 0(754442) 1(  2179) 2(     5) 3(     2)
> CSUM_TREE           3.49GiB 0(227920) 1(   804) 2(     2) 3(     1)
> QUOTA_TREE            0.00B
> UUID_TREE          16.00KiB 0(     1)
> FREE_SPACE_TREE       0.00B
> DATA_RELOC_TREE    16.00KiB 0(     1)
> 
>> BTW, if OP still wants to try something to possibly to reduce mount time
>> with same the fs, I could try some modification to current block group
>> iteration code to see if it makes sense.
> 
> I'm glad to try anything if it's helpful to improving BTRFS.  Just let
> me know.

Glad to hear that.

I would send out some RFC patch to see if it would help to reduce mount
time (maybe only by a little)

Thanks,
Qu

> 
> Best,
> 
> ellis
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of FST and mount times
  2018-02-21 15:56                           ` Hans van Kranenburg
@ 2018-02-22 12:41                             ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2018-02-22 12:41 UTC (permalink / raw)
  To: Hans van Kranenburg, Ellis H. Wilson III, linux-btrfs

On 2018-02-21 10:56, Hans van Kranenburg wrote:
> On 02/21/2018 04:19 PM, Ellis H. Wilson III wrote:
>>
>> $ sudo btrfs fi df /mnt/btrfs
>> Data, single: total=3.32TiB, used=3.32TiB
>> System, DUP: total=8.00MiB, used=384.00KiB
>> Metadata, DUP: total=16.50GiB, used=15.82GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Ah, so allocated data space is 100% filled with data. That's very good
> yes. And it explains why you can't lower the amount of chunks by
> balancing. You're just moving around data and replacing full chunks with
> new full chunks. :]
> 
> Doesn't explain why it blows up the size of the extent tree though. I
> have no idea why that is.
This is just a guess, but I think it might have reordered extents within 
each chunk.  Any given extent can't span across a chunk boundary, so if 
the order changed, it may have split extents that had previously been 
full extents.  I'd be somewhat curious to see if defragmenting might 
help here (it should re-combine the split extents, though it will 
probably allocate a new chunk).


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-02-22 12:41 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-14 16:00 Status of FST and mount times Ellis H. Wilson III
2018-02-14 17:08 ` Nikolay Borisov
2018-02-14 17:21   ` Ellis H. Wilson III
2018-02-15  1:42   ` Qu Wenruo
2018-02-15  2:15     ` Duncan
2018-02-15  3:49       ` Qu Wenruo
2018-02-15 11:12     ` Hans van Kranenburg
2018-02-15 16:30       ` Ellis H. Wilson III
2018-02-16  1:55         ` Qu Wenruo
2018-02-16 14:12           ` Ellis H. Wilson III
2018-02-16 14:20             ` Hans van Kranenburg
2018-02-16 14:42               ` Ellis H. Wilson III
2018-02-16 14:55                 ` Ellis H. Wilson III
2018-02-17  0:59             ` Qu Wenruo
2018-02-20 14:59               ` Ellis H. Wilson III
2018-02-20 15:41                 ` Austin S. Hemmelgarn
2018-02-21  1:49                   ` Qu Wenruo
2018-02-21 14:49                     ` Ellis H. Wilson III
2018-02-21 15:03                       ` Hans van Kranenburg
2018-02-21 15:19                         ` Ellis H. Wilson III
2018-02-21 15:56                           ` Hans van Kranenburg
2018-02-22 12:41                             ` Austin S. Hemmelgarn
2018-02-21 21:27                       ` E V
2018-02-22  0:53                       ` Qu Wenruo
2018-02-15  5:54   ` Chris Murphy
2018-02-14 23:24 ` Duncan
2018-02-15 15:42   ` Ellis H. Wilson III
2018-02-15 16:51     ` Austin S. Hemmelgarn
2018-02-15 16:58       ` Ellis H. Wilson III
2018-02-15 17:57         ` Austin S. Hemmelgarn
2018-02-15  6:14 ` Chris Murphy
2018-02-15 16:45   ` Ellis H. Wilson III

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.