* Status of FST and mount times @ 2018-02-14 16:00 Ellis H. Wilson III 2018-02-14 17:08 ` Nikolay Borisov ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Ellis H. Wilson III @ 2018-02-14 16:00 UTC (permalink / raw) To: linux-btrfs Hi again -- back with a few more questions: Frame-of-reference here: RAID0. Around 70TB raw capacity. No compression. No quotas enabled. Many (potentially tens to hundreds) of subvolumes, each with tens of snapshots. No control over size or number of files, but directory tree (entries per dir and general tree depth) can be controlled in case that's helpful). 1. I've been reading up about the space cache, and it appears there is a v2 of it called the free space tree that is much friendlier to large filesystems such as the one I am designing for. It is listed as OK/OK on the wiki status page, but there is a note that btrfs progs treats it as read only (i.e., btrfs check repair cannot help me without a full space cache rebuild is my biggest concern) and the last status update on this I can find was circa fall 2016. Can anybody give me an updated status on this feature? From what I read, v1 and tens of TB filesystems will not play well together, so I'm inclined to dig into this. 2. There's another thread on-going about mount delays. I've been completely blind to this specific problem until it caught my eye. Does anyone have ballpark estimates for how long very large HDD-based filesystems will take to mount? Yes, I know it will depend on the dataset. I'm looking for O() worst-case approximations for enterprise-grade large drives (12/14TB), as I expect it should scale with multiple drives so approximating for a single drive should be good enough. 3. Do long mount delays relate to space_cache v1 vs v2 (I would guess no, unless it needed to be regenerated)? Note that I'm not sensitive to multi-second mount delays. I am sensitive to multi-minute mount delays, hence why I'm bringing this up. FWIW: I am currently populating a machine we have with 6TB drives in it with real-world home dir data to see if I can replicate the mount issue. Thanks, ellis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-14 16:00 Status of FST and mount times Ellis H. Wilson III @ 2018-02-14 17:08 ` Nikolay Borisov 2018-02-14 17:21 ` Ellis H. Wilson III ` (2 more replies) 2018-02-14 23:24 ` Duncan 2018-02-15 6:14 ` Chris Murphy 2 siblings, 3 replies; 32+ messages in thread From: Nikolay Borisov @ 2018-02-14 17:08 UTC (permalink / raw) To: Ellis H. Wilson III, linux-btrfs On 14.02.2018 18:00, Ellis H. Wilson III wrote: > Hi again -- back with a few more questions: > > Frame-of-reference here: RAID0. Around 70TB raw capacity. No > compression. No quotas enabled. Many (potentially tens to hundreds) of > subvolumes, each with tens of snapshots. No control over size or number > of files, but directory tree (entries per dir and general tree depth) > can be controlled in case that's helpful). > > 1. I've been reading up about the space cache, and it appears there is a > v2 of it called the free space tree that is much friendlier to large > filesystems such as the one I am designing for. It is listed as OK/OK > on the wiki status page, but there is a note that btrfs progs treats it > as read only (i.e., btrfs check repair cannot help me without a full > space cache rebuild is my biggest concern) and the last status update on > this I can find was circa fall 2016. Can anybody give me an updated > status on this feature? From what I read, v1 and tens of TB filesystems > will not play well together, so I'm inclined to dig into this. V1 for large filesystems is jut awful. Facebook have been experiencing the pain hence they implemented v2. You can view the spacecache tree as the complement version of the extent tree. v1 cache is implemented as a hidden inode and even though writes (aka flushing of the freespace cache) are metadata they are essentially treated as data. This could potentially lead to priority inversions if cgroups io controller is involved. Furthermore, there is at least 1 known deadlock problem in freespace cache v1. So yes, if you want to use btrfs ona multi-tb system v2 is really the way to go. > > 2. There's another thread on-going about mount delays. I've been > completely blind to this specific problem until it caught my eye. Does > anyone have ballpark estimates for how long very large HDD-based > filesystems will take to mount? Yes, I know it will depend on the > dataset. I'm looking for O() worst-case approximations for > enterprise-grade large drives (12/14TB), as I expect it should scale > with multiple drives so approximating for a single drive should be good > enough. > > 3. Do long mount delays relate to space_cache v1 vs v2 (I would guess > no, unless it needed to be regenerated)? No, the long mount times seems to be due to the fact that in order for a btrfs filesystem to mount it needs to enumerate its block_groups items and those are stored in the extent tree, which also holds all of the information pertaining to allocated extents. So mixing those data structures in the same tree and the fact that blockgroups are iterated linearly during mount (check btrfs_read_block_groups) means on spinning rust with shitty seek times this can take a while. However, this will really depend on the amount of extents you have and having taken a look at the thread you referred to it seems there is not clear-cut reason why mounting is taking so long on that particular occasion . > > Note that I'm not sensitive to multi-second mount delays. I am > sensitive to multi-minute mount delays, hence why I'm bringing this up. > > FWIW: I am currently populating a machine we have with 6TB drives in it > with real-world home dir data to see if I can replicate the mount issue. > > Thanks, > > ellis > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-14 17:08 ` Nikolay Borisov @ 2018-02-14 17:21 ` Ellis H. Wilson III 2018-02-15 1:42 ` Qu Wenruo 2018-02-15 5:54 ` Chris Murphy 2 siblings, 0 replies; 32+ messages in thread From: Ellis H. Wilson III @ 2018-02-14 17:21 UTC (permalink / raw) To: Nikolay Borisov, linux-btrfs On 02/14/2018 12:08 PM, Nikolay Borisov wrote: > V1 for large filesystems is jut awful. Facebook have been experiencing > the pain hence they implemented v2. You can view the spacecache tree as > the complement version of the extent tree. v1 cache is implemented as a > hidden inode and even though writes (aka flushing of the freespace > cache) are metadata they are essentially treated as data. This could > potentially lead to priority inversions if cgroups io controller is > involved. > > Furthermore, there is at least 1 known deadlock problem in freespace > cache v1. So yes, if you want to use btrfs ona multi-tb system v2 is > really the way to go. Fantastic. Thanks for the backstory. That is what I will plan to use then. I've been operating with whatever the default is (I presume v1 based on the man page), but haven't yet populated any of our machines sufficiently enough to notice performance degradation due to space cache problems. > No, the long mount times seems to be due to the fact that in order for a > btrfs filesystem to mount it needs to enumerate its block_groups items > and those are stored in the extent tree, which also holds all of the > information pertaining to allocated extents. So mixing those > data structures in the same tree and the fact that blockgroups are > iterated linearly during mount (check btrfs_read_block_groups) means on > spinning rust with shitty seek times this can take a while. > > However, this will really depend on the amount of extents you have and > having taken a look at the thread you referred to it seems there is not > clear-cut reason why mounting is taking so long on that particular > occasion. Ok; thanks. To phrase it somewhat more simply, should I expect for "normal" datasets (think home directory) that happen to be part of a very large BTRFS filesystem (tens of TBs) to take more than 60s to mount? Let's presume there isn't extreme fragmentation or any media errors to keep things simple. Best, ellis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-14 17:08 ` Nikolay Borisov 2018-02-14 17:21 ` Ellis H. Wilson III @ 2018-02-15 1:42 ` Qu Wenruo 2018-02-15 2:15 ` Duncan 2018-02-15 11:12 ` Hans van Kranenburg 2018-02-15 5:54 ` Chris Murphy 2 siblings, 2 replies; 32+ messages in thread From: Qu Wenruo @ 2018-02-15 1:42 UTC (permalink / raw) To: Nikolay Borisov, Ellis H. Wilson III, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 4901 bytes --] On 2018年02月15日 01:08, Nikolay Borisov wrote: > > > On 14.02.2018 18:00, Ellis H. Wilson III wrote: >> Hi again -- back with a few more questions: >> >> Frame-of-reference here: RAID0. Around 70TB raw capacity. No >> compression. No quotas enabled. Many (potentially tens to hundreds) of >> subvolumes, each with tens of snapshots. No control over size or number >> of files, but directory tree (entries per dir and general tree depth) >> can be controlled in case that's helpful). >> >> 1. I've been reading up about the space cache, and it appears there is a >> v2 of it called the free space tree that is much friendlier to large >> filesystems such as the one I am designing for. It is listed as OK/OK >> on the wiki status page, but there is a note that btrfs progs treats it >> as read only (i.e., btrfs check repair cannot help me without a full >> space cache rebuild is my biggest concern) and the last status update on >> this I can find was circa fall 2016. Can anybody give me an updated >> status on this feature? From what I read, v1 and tens of TB filesystems >> will not play well together, so I'm inclined to dig into this. > > V1 for large filesystems is jut awful. Facebook have been experiencing > the pain hence they implemented v2. You can view the spacecache tree as > the complement version of the extent tree. v1 cache is implemented as a > hidden inode and even though writes (aka flushing of the freespace > cache) are metadata they are essentially treated as data. This could > potentially lead to priority inversions if cgroups io controller is > involved. > > Furthermore, there is at least 1 known deadlock problem in freespace > cache v1. So yes, if you want to use btrfs ona multi-tb system v2 is > really the way to go. > >> >> 2. There's another thread on-going about mount delays. I've been >> completely blind to this specific problem until it caught my eye. Does >> anyone have ballpark estimates for how long very large HDD-based >> filesystems will take to mount? Yes, I know it will depend on the >> dataset. I'm looking for O() worst-case approximations for >> enterprise-grade large drives (12/14TB), as I expect it should scale >> with multiple drives so approximating for a single drive should be good >> enough. >> >> 3. Do long mount delays relate to space_cache v1 vs v2 (I would guess >> no, unless it needed to be regenerated)? > > No, the long mount times seems to be due to the fact that in order for a > btrfs filesystem to mount it needs to enumerate its block_groups items > and those are stored in the extent tree, which also holds all of the > information pertaining to allocated extents. So mixing those > data structures in the same tree and the fact that blockgroups are > iterated linearly during mount (check btrfs_read_block_groups) means on > spinning rust with shitty seek times this can take a while. And, space cache is not loaded at mount time. It's delayed until we determine to allocate extent from one block group. So space cache is completely unrelated to long mount time. > > However, this will really depend on the amount of extents you have and > having taken a look at the thread you referred to it seems there is not > clear-cut reason why mounting is taking so long on that particular > occasion . Just as said by Nikolay, the biggest problem of slow mount is the size of extent tree (and HDD seek time) The easiest way to get a basic idea of how large your extent tree is using debug tree: # btrfs-debug-tree -r -t extent <device> You would get something like: btrfs-progs v4.15 extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776 level 0 <<< total bytes 10737418240 bytes used 393216 uuid 651fcf0c-0ffd-4351-9721-84b1615f02e0 That level is would give you some basic idea of the size of your extent tree. For level 0, it could contains about 400 items for average. For level 1, it could contains up to 197K items. ... For leven n, it could contains up to 400 * 493 ^ (n - 1) items. ( n <= 7 ) Thanks, Qu > > >> >> Note that I'm not sensitive to multi-second mount delays. I am >> sensitive to multi-minute mount delays, hence why I'm bringing this up. >> >> FWIW: I am currently populating a machine we have with 6TB drives in it >> with real-world home dir data to see if I can replicate the mount issue. >> >> Thanks, >> >> ellis >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 520 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-15 1:42 ` Qu Wenruo @ 2018-02-15 2:15 ` Duncan 2018-02-15 3:49 ` Qu Wenruo 2018-02-15 11:12 ` Hans van Kranenburg 1 sibling, 1 reply; 32+ messages in thread From: Duncan @ 2018-02-15 2:15 UTC (permalink / raw) To: linux-btrfs Qu Wenruo posted on Thu, 15 Feb 2018 09:42:27 +0800 as excerpted: > The easiest way to get a basic idea of how large your extent tree is > using debug tree: > > # btrfs-debug-tree -r -t extent <device> > > You would get something like: > btrfs-progs v4.15 extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776 > level 0 <<< > total bytes 10737418240 bytes used 393216 uuid > 651fcf0c-0ffd-4351-9721-84b1615f02e0 > > That level is would give you some basic idea of the size of your extent > tree. > > For level 0, it could contains about 400 items for average. > For level 1, it could contains up to 197K items. > ... > For leven n, it could contains up to 400 * 493 ^ (n - 1) items. > ( n <= 7 ) So for level 2 (which I see on a couple of mine here, ran it out of curiosity): 400 * 493 ^ (2 - 1) = 400 * 493 = 197200 197K for both level 1 and level 2? Doesn't look correct. Perhaps you meant a simple power of n, instead of (n-1)? That would yield ~97M for level 2, and would yield the given numbers for levels 0 and 1 as well, whereby using n-1 for level 0 yields less than a single entry, and 400 for level 1. Or the given numbers were for level 1 and 2, with level 0 not holding anything, not levels 0 and 1. But that wouldn't jive with your level 0 example, which I would assume could never happen if it couldn't hold even a single entry. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-15 2:15 ` Duncan @ 2018-02-15 3:49 ` Qu Wenruo 0 siblings, 0 replies; 32+ messages in thread From: Qu Wenruo @ 2018-02-15 3:49 UTC (permalink / raw) To: Duncan, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 2072 bytes --] On 2018年02月15日 10:15, Duncan wrote: > Qu Wenruo posted on Thu, 15 Feb 2018 09:42:27 +0800 as excerpted: > >> The easiest way to get a basic idea of how large your extent tree is >> using debug tree: >> >> # btrfs-debug-tree -r -t extent <device> >> >> You would get something like: >> btrfs-progs v4.15 extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776 >> level 0 <<< >> total bytes 10737418240 bytes used 393216 uuid >> 651fcf0c-0ffd-4351-9721-84b1615f02e0 >> >> That level is would give you some basic idea of the size of your extent >> tree. >> >> For level 0, it could contains about 400 items for average. >> For level 1, it could contains up to 197K items. >> ... >> For leven n, it could contains up to 400 * 493 ^ (n - 1) items. >> ( n <= 7 ) > > So for level 2 (which I see on a couple of mine here, ran it out of > curiosity): > > 400 * 493 ^ (2 - 1) = 400 * 493 = 197200 > > 197K for both level 1 and level 2? Doesn't look correct. > > Perhaps you meant a simple power of n, instead of (n-1)? My fault, off by 1 is really easy to screw things up. So it's 400 * 493 ^ n. And level 0 also fits into the calculation. > That would > yield ~97M for level 2, and would yield the given numbers for levels 0 > and 1 as well, whereby using n-1 for level 0 yields less than a single > entry, and 400 for level 1. > > Or the given numbers were for level 1 and 2, with level 0 not holding > anything, not levels 0 and 1. But that wouldn't jive with your level 0 > example, which I would assume could never happen if it couldn't hold even > a single entry. Here level 0 means it's leaf. And I assume the average item size of each EXTENT_ITEM/METADATA item to be 40. And using 16K nodesize we have 16283, we get 407, I just round it to 400 to make calculation a little easier and more headroom for larger item. So for level 0, we could have around 400 items. For nodes (1 < level <= 7), since node ptr is fixed to 33 bytes, the calculation is pretty simple now. Thanks, Qu > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 520 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-15 1:42 ` Qu Wenruo 2018-02-15 2:15 ` Duncan @ 2018-02-15 11:12 ` Hans van Kranenburg 2018-02-15 16:30 ` Ellis H. Wilson III 1 sibling, 1 reply; 32+ messages in thread From: Hans van Kranenburg @ 2018-02-15 11:12 UTC (permalink / raw) To: Qu Wenruo, Nikolay Borisov, Ellis H. Wilson III, linux-btrfs On 02/15/2018 02:42 AM, Qu Wenruo wrote: > > > On 2018年02月15日 01:08, Nikolay Borisov wrote: >> >> >> On 14.02.2018 18:00, Ellis H. Wilson III wrote: >>> Hi again -- back with a few more questions: >>> >>> Frame-of-reference here: RAID0. Around 70TB raw capacity. No >>> compression. No quotas enabled. Many (potentially tens to hundreds) of >>> subvolumes, each with tens of snapshots. No control over size or number >>> of files, but directory tree (entries per dir and general tree depth) >>> can be controlled in case that's helpful). >>> >>> 1. I've been reading up about the space cache, and it appears there is a >>> v2 of it called the free space tree that is much friendlier to large >>> filesystems such as the one I am designing for. It is listed as OK/OK >>> on the wiki status page, but there is a note that btrfs progs treats it >>> as read only (i.e., btrfs check repair cannot help me without a full >>> space cache rebuild is my biggest concern) and the last status update on >>> this I can find was circa fall 2016. Can anybody give me an updated >>> status on this feature? From what I read, v1 and tens of TB filesystems >>> will not play well together, so I'm inclined to dig into this. >> >> V1 for large filesystems is jut awful. Facebook have been experiencing >> the pain hence they implemented v2. You can view the spacecache tree as >> the complement version of the extent tree. v1 cache is implemented as a >> hidden inode and even though writes (aka flushing of the freespace >> cache) are metadata they are essentially treated as data. This could >> potentially lead to priority inversions if cgroups io controller is >> involved. >> >> Furthermore, there is at least 1 known deadlock problem in freespace >> cache v1. So yes, if you want to use btrfs ona multi-tb system v2 is >> really the way to go. >> >>> >>> 2. There's another thread on-going about mount delays. I've been >>> completely blind to this specific problem until it caught my eye. Does >>> anyone have ballpark estimates for how long very large HDD-based >>> filesystems will take to mount? Yes, I know it will depend on the >>> dataset. I'm looking for O() worst-case approximations for >>> enterprise-grade large drives (12/14TB), as I expect it should scale >>> with multiple drives so approximating for a single drive should be good >>> enough. >>> >>> 3. Do long mount delays relate to space_cache v1 vs v2 (I would guess >>> no, unless it needed to be regenerated)? >> >> No, the long mount times seems to be due to the fact that in order for a >> btrfs filesystem to mount it needs to enumerate its block_groups items >> and those are stored in the extent tree, which also holds all of the >> information pertaining to allocated extents. So mixing those >> data structures in the same tree and the fact that blockgroups are >> iterated linearly during mount (check btrfs_read_block_groups) means on >> spinning rust with shitty seek times this can take a while. > > And, space cache is not loaded at mount time. > It's delayed until we determine to allocate extent from one block group. > > So space cache is completely unrelated to long mount time. > >> >> However, this will really depend on the amount of extents you have and >> having taken a look at the thread you referred to it seems there is not >> clear-cut reason why mounting is taking so long on that particular >> occasion . > > Just as said by Nikolay, the biggest problem of slow mount is the size > of extent tree (and HDD seek time) > > The easiest way to get a basic idea of how large your extent tree is > using debug tree: > > # btrfs-debug-tree -r -t extent <device> > > You would get something like: > btrfs-progs v4.15 > extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776 level 0 <<< > total bytes 10737418240 > bytes used 393216 > uuid 651fcf0c-0ffd-4351-9721-84b1615f02e0 > > That level is would give you some basic idea of the size of your extent > tree. > > For level 0, it could contains about 400 items for average. > For level 1, it could contains up to 197K items. > ... > For leven n, it could contains up to 400 * 493 ^ (n - 1) items. > ( n <= 7 ) Another one to get that data: https://github.com/knorrie/python-btrfs/blob/master/examples/show_metadata_tree_sizes.py Example, with amount of leaves on level 0 and nodes higher up: -# ./show_metadata_tree_sizes.py / ROOT_TREE 336.00KiB 0( 20) 1( 1) EXTENT_TREE 123.52MiB 0( 7876) 1( 28) 2( 1) CHUNK_TREE 112.00KiB 0( 6) 1( 1) DEV_TREE 80.00KiB 0( 4) 1( 1) FS_TREE 1016.34MiB 0( 64113) 1( 881) 2( 52) CSUM_TREE 777.42MiB 0( 49571) 1( 183) 2( 1) QUOTA_TREE 0.00B UUID_TREE 16.00KiB 0( 1) FREE_SPACE_TREE 336.00KiB 0( 20) 1( 1) DATA_RELOC_TREE 16.00KiB 0( 1) > > Thanks, > Qu > >> >> >>> >>> Note that I'm not sensitive to multi-second mount delays. I am >>> sensitive to multi-minute mount delays, hence why I'm bringing this up. >>> >>> FWIW: I am currently populating a machine we have with 6TB drives in it >>> with real-world home dir data to see if I can replicate the mount issue. >>> >>> Thanks, >>> >>> ellis >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- Hans van Kranenburg ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-15 11:12 ` Hans van Kranenburg @ 2018-02-15 16:30 ` Ellis H. Wilson III 2018-02-16 1:55 ` Qu Wenruo 0 siblings, 1 reply; 32+ messages in thread From: Ellis H. Wilson III @ 2018-02-15 16:30 UTC (permalink / raw) To: Hans van Kranenburg, Qu Wenruo, Nikolay Borisov, linux-btrfs On 02/15/2018 06:12 AM, Hans van Kranenburg wrote: > On 02/15/2018 02:42 AM, Qu Wenruo wrote: >> Just as said by Nikolay, the biggest problem of slow mount is the size >> of extent tree (and HDD seek time) >> >> The easiest way to get a basic idea of how large your extent tree is >> using debug tree: >> >> # btrfs-debug-tree -r -t extent <device> >> >> You would get something like: >> btrfs-progs v4.15 >> extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776 level 0 <<< >> total bytes 10737418240 >> bytes used 393216 >> uuid 651fcf0c-0ffd-4351-9721-84b1615f02e0 >> >> That level is would give you some basic idea of the size of your extent >> tree. >> >> For level 0, it could contains about 400 items for average. >> For level 1, it could contains up to 197K items. >> ... >> For leven n, it could contains up to 400 * 493 ^ (n - 1) items. >> ( n <= 7 ) > > Another one to get that data: > > https://github.com/knorrie/python-btrfs/blob/master/examples/show_metadata_tree_sizes.py > > Example, with amount of leaves on level 0 and nodes higher up: > > -# ./show_metadata_tree_sizes.py / > ROOT_TREE 336.00KiB 0( 20) 1( 1) > EXTENT_TREE 123.52MiB 0( 7876) 1( 28) 2( 1) > CHUNK_TREE 112.00KiB 0( 6) 1( 1) > DEV_TREE 80.00KiB 0( 4) 1( 1) > FS_TREE 1016.34MiB 0( 64113) 1( 881) 2( 52) > CSUM_TREE 777.42MiB 0( 49571) 1( 183) 2( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 336.00KiB 0( 20) 1( 1) > DATA_RELOC_TREE 16.00KiB 0( 1) Very helpful information. Thank you Qu and Hans! I have about 1.7TB of homedir data newly rsync'd data on a single enterprise 7200rpm HDD and the following output for btrfs-debug: extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2 total bytes 6001175126016 bytes used 1832557875200 Hans' (very cool) tool reports: ROOT_TREE 624.00KiB 0( 38) 1( 1) EXTENT_TREE 327.31MiB 0( 20881) 1( 66) 2( 1) CHUNK_TREE 208.00KiB 0( 12) 1( 1) DEV_TREE 144.00KiB 0( 8) 1( 1) FS_TREE 5.75GiB 0(375589) 1( 952) 2( 2) 3( 1) CSUM_TREE 1.75GiB 0(114274) 1( 385) 2( 1) QUOTA_TREE 0.00B UUID_TREE 16.00KiB 0( 1) FREE_SPACE_TREE 0.00B DATA_RELOC_TREE 16.00KiB 0( 1) Mean mount times across 5 tests: 4.319s (stddev=0.079s) Taking 100 snapshots (no changes between snapshots however) of the above subvolume doesn't appear to impact mount/umount time. Snapshot creation and deletion both operate at between 0.25s to 0.5s. I am very impressed with snapshot deletion in particular now that qgroups is disabled. I will do more mount testing with twice and three times that dataset and see how mount times scale. All done on 4.5.5. I really need to move to a newer kernel. Best, ellis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-15 16:30 ` Ellis H. Wilson III @ 2018-02-16 1:55 ` Qu Wenruo 2018-02-16 14:12 ` Ellis H. Wilson III 0 siblings, 1 reply; 32+ messages in thread From: Qu Wenruo @ 2018-02-16 1:55 UTC (permalink / raw) To: Ellis H. Wilson III, Hans van Kranenburg, Nikolay Borisov, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 4039 bytes --] On 2018年02月16日 00:30, Ellis H. Wilson III wrote: > On 02/15/2018 06:12 AM, Hans van Kranenburg wrote: >> On 02/15/2018 02:42 AM, Qu Wenruo wrote: >>> Just as said by Nikolay, the biggest problem of slow mount is the size >>> of extent tree (and HDD seek time) >>> >>> The easiest way to get a basic idea of how large your extent tree is >>> using debug tree: >>> >>> # btrfs-debug-tree -r -t extent <device> >>> >>> You would get something like: >>> btrfs-progs v4.15 >>> extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776 level 0 <<< >>> total bytes 10737418240 >>> bytes used 393216 >>> uuid 651fcf0c-0ffd-4351-9721-84b1615f02e0 >>> >>> That level is would give you some basic idea of the size of your extent >>> tree. >>> >>> For level 0, it could contains about 400 items for average. >>> For level 1, it could contains up to 197K items. >>> ... >>> For leven n, it could contains up to 400 * 493 ^ (n - 1) items. >>> ( n <= 7 ) >> >> Another one to get that data: >> >> https://github.com/knorrie/python-btrfs/blob/master/examples/show_metadata_tree_sizes.py >> >> >> Example, with amount of leaves on level 0 and nodes higher up: >> >> -# ./show_metadata_tree_sizes.py / >> ROOT_TREE 336.00KiB 0( 20) 1( 1) >> EXTENT_TREE 123.52MiB 0( 7876) 1( 28) 2( 1) >> CHUNK_TREE 112.00KiB 0( 6) 1( 1) >> DEV_TREE 80.00KiB 0( 4) 1( 1) >> FS_TREE 1016.34MiB 0( 64113) 1( 881) 2( 52) >> CSUM_TREE 777.42MiB 0( 49571) 1( 183) 2( 1) >> QUOTA_TREE 0.00B >> UUID_TREE 16.00KiB 0( 1) >> FREE_SPACE_TREE 336.00KiB 0( 20) 1( 1) >> DATA_RELOC_TREE 16.00KiB 0( 1) > > Very helpful information. Thank you Qu and Hans! > > I have about 1.7TB of homedir data newly rsync'd data on a single > enterprise 7200rpm HDD and the following output for btrfs-debug: > > extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2 > total bytes 6001175126016 > bytes used 1832557875200 > > Hans' (very cool) tool reports: > ROOT_TREE 624.00KiB 0( 38) 1( 1) > EXTENT_TREE 327.31MiB 0( 20881) 1( 66) 2( 1) Extent tree is not so large, a little unexpected to see such slow mount. BTW, how many chunks do you have? It could be checked by: # btrfs-debug-tree -t chunk <device> | grep CHUNK_ITEM | wc -l Unless we have tons of chunks, it should be too slow. > CHUNK_TREE 208.00KiB 0( 12) 1( 1) > DEV_TREE 144.00KiB 0( 8) 1( 1) > FS_TREE 5.75GiB 0(375589) 1( 952) 2( 2) 3( 1) > CSUM_TREE 1.75GiB 0(114274) 1( 385) 2( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 0.00B > DATA_RELOC_TREE 16.00KiB 0( 1) > > Mean mount times across 5 tests: 4.319s (stddev=0.079s) > > Taking 100 snapshots (no changes between snapshots however) of the above > subvolume doesn't appear to impact mount/umount time. 100 unmodified snapshots won't affect mount time. It needs new extents, which can be created by overwriting extents in snapshots. So it won't really cause much difference if all these snapshots are all unmodified. > Snapshot creation > and deletion both operate at between 0.25s to 0.5s. IIRC snapshot deletion is delayed, so the real work doesn't happen when "btrfs sub del" returns. Thanks, Qu > I am very impressed > with snapshot deletion in particular now that qgroups is disabled. > > I will do more mount testing with twice and three times that dataset and > see how mount times scale. > > All done on 4.5.5. I really need to move to a newer kernel. > > Best, > > ellis [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 520 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-16 1:55 ` Qu Wenruo @ 2018-02-16 14:12 ` Ellis H. Wilson III 2018-02-16 14:20 ` Hans van Kranenburg 2018-02-17 0:59 ` Qu Wenruo 0 siblings, 2 replies; 32+ messages in thread From: Ellis H. Wilson III @ 2018-02-16 14:12 UTC (permalink / raw) To: Qu Wenruo, Hans van Kranenburg, Nikolay Borisov, linux-btrfs On 02/15/2018 08:55 PM, Qu Wenruo wrote: > On 2018年02月16日 00:30, Ellis H. Wilson III wrote: >> Very helpful information. Thank you Qu and Hans! >> >> I have about 1.7TB of homedir data newly rsync'd data on a single >> enterprise 7200rpm HDD and the following output for btrfs-debug: >> >> extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2 >> total bytes 6001175126016 >> bytes used 1832557875200 >> >> Hans' (very cool) tool reports: >> ROOT_TREE 624.00KiB 0( 38) 1( 1) >> EXTENT_TREE 327.31MiB 0( 20881) 1( 66) 2( 1) > > Extent tree is not so large, a little unexpected to see such slow mount. > > BTW, how many chunks do you have? > > It could be checked by: > > # btrfs-debug-tree -t chunk <device> | grep CHUNK_ITEM | wc -l Since yesterday I've doubled the size by copying the homdir dataset in again. Here are new stats: extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2 total bytes 6001175126016 bytes used 3663525969920 $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l 3454 $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ ROOT_TREE 1.14MiB 0( 72) 1( 1) EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) CHUNK_TREE 384.00KiB 0( 23) 1( 1) DEV_TREE 272.00KiB 0( 16) 1( 1) FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) QUOTA_TREE 0.00B UUID_TREE 16.00KiB 0( 1) FREE_SPACE_TREE 0.00B DATA_RELOC_TREE 16.00KiB 0( 1) The old mean mount time was 4.319s. It now takes 11.537s for the doubled dataset. Again please realize this is on an old version of BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still like to understand this delay more. Should I expect this to scale in this way all the way up to my proposed 60-80TB filesystem so long as the file size distribution stays roughly similar? That would definitely be in terms of multiple minutes at that point. >> Taking 100 snapshots (no changes between snapshots however) of the above >> subvolume doesn't appear to impact mount/umount time. > > 100 unmodified snapshots won't affect mount time. > > It needs new extents, which can be created by overwriting extents in > snapshots. > So it won't really cause much difference if all these snapshots are all > unmodified. Good to know, thanks! >> Snapshot creation >> and deletion both operate at between 0.25s to 0.5s. > > IIRC snapshot deletion is delayed, so the real work doesn't happen when > "btrfs sub del" returns. I was using btrfs sub del -C for the deletions, so I believe (if that command truly waits for the subvolume to be utterly gone) it captures the entirety of the snapshot. Best, ellis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-16 14:12 ` Ellis H. Wilson III @ 2018-02-16 14:20 ` Hans van Kranenburg 2018-02-16 14:42 ` Ellis H. Wilson III 2018-02-17 0:59 ` Qu Wenruo 1 sibling, 1 reply; 32+ messages in thread From: Hans van Kranenburg @ 2018-02-16 14:20 UTC (permalink / raw) To: Ellis H. Wilson III, Qu Wenruo, Nikolay Borisov, linux-btrfs On 02/16/2018 03:12 PM, Ellis H. Wilson III wrote: > On 02/15/2018 08:55 PM, Qu Wenruo wrote: >> On 2018年02月16日 00:30, Ellis H. Wilson III wrote: >>> Very helpful information. Thank you Qu and Hans! >>> >>> I have about 1.7TB of homedir data newly rsync'd data on a single >>> enterprise 7200rpm HDD and the following output for btrfs-debug: >>> >>> extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2 >>> total bytes 6001175126016 >>> bytes used 1832557875200 >>> >>> Hans' (very cool) tool reports: >>> ROOT_TREE 624.00KiB 0( 38) 1( 1) >>> EXTENT_TREE 327.31MiB 0( 20881) 1( 66) 2( 1) >> >> Extent tree is not so large, a little unexpected to see such slow mount. >> >> BTW, how many chunks do you have? >> >> It could be checked by: >> >> # btrfs-debug-tree -t chunk <device> | grep CHUNK_ITEM | wc -l > > Since yesterday I've doubled the size by copying the homdir dataset in > again. Here are new stats: > > extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2 > total bytes 6001175126016 > bytes used 3663525969920 > > $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l > 3454 > > $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ > ROOT_TREE 1.14MiB 0( 72) 1( 1) > EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) > CHUNK_TREE 384.00KiB 0( 23) 1( 1) > DEV_TREE 272.00KiB 0( 16) 1( 1) > FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) > CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 0.00B > DATA_RELOC_TREE 16.00KiB 0( 1) > > The old mean mount time was 4.319s. It now takes 11.537s for the > doubled dataset. Again please realize this is on an old version of > BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still > like to understand this delay more. Should I expect this to scale in > this way all the way up to my proposed 60-80TB filesystem so long as the > file size distribution stays roughly similar? That would definitely be > in terms of multiple minutes at that point. Well, imagine you have a big tree (an actual real life tree outside) and you need to pick things (e.g. apples) which are hanging everywhere. So, what you need to to is climb the tree, climb on a branch all the way to the end where the first apple is... climb back, climb up a bit, go onto the next branch to the end for the next apple... etc etc.... The bigger the tree is, the longer it keeps you busy, because the apples will be semi-evenly distributed around the full tree, and they're always hanging at the end of the branch. The speed with which you can climb around (random read disk access IO speed for btrfs, because your disk cache is empty when first mounting) determines how quickly you're done. So, yes. >>> Taking 100 snapshots (no changes between snapshots however) of the above >>> subvolume doesn't appear to impact mount/umount time. >> >> 100 unmodified snapshots won't affect mount time. >> >> It needs new extents, which can be created by overwriting extents in >> snapshots. >> So it won't really cause much difference if all these snapshots are all >> unmodified. > > Good to know, thanks! > >>> Snapshot creation >>> and deletion both operate at between 0.25s to 0.5s. >> >> IIRC snapshot deletion is delayed, so the real work doesn't happen when >> "btrfs sub del" returns. > > I was using btrfs sub del -C for the deletions, so I believe (if that > command truly waits for the subvolume to be utterly gone) it captures > the entirety of the snapshot. > > Best, > > ellis -- Hans van Kranenburg ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-16 14:20 ` Hans van Kranenburg @ 2018-02-16 14:42 ` Ellis H. Wilson III 2018-02-16 14:55 ` Ellis H. Wilson III 0 siblings, 1 reply; 32+ messages in thread From: Ellis H. Wilson III @ 2018-02-16 14:42 UTC (permalink / raw) To: Hans van Kranenburg, Qu Wenruo, Nikolay Borisov, linux-btrfs On 02/16/2018 09:20 AM, Hans van Kranenburg wrote: > Well, imagine you have a big tree (an actual real life tree outside) and > you need to pick things (e.g. apples) which are hanging everywhere. > > So, what you need to to is climb the tree, climb on a branch all the way > to the end where the first apple is... climb back, climb up a bit, go > onto the next branch to the end for the next apple... etc etc.... > > The bigger the tree is, the longer it keeps you busy, because the apples > will be semi-evenly distributed around the full tree, and they're always > hanging at the end of the branch. The speed with which you can climb > around (random read disk access IO speed for btrfs, because your disk > cache is empty when first mounting) determines how quickly you're done. > > So, yes. Thanks Hans. I will say multiple minutes (by the looks of things, I'll end up near to an hour for 60TB if this non-linear scaling continues) to mount a filesystem is undesirable, but I won't offer that criticism without thinking constructively for a moment: Help me out by referencing the tree in question if you don't mind, so I can better understand the point of picking all these "apples" (I would guess for capacity reporting via df, but maybe there's more). Typical disclaimer that I haven't yet grokked the various inner-workings of BTRFS, so this is quite possibly a terrible or unapproachable idea: On umount, you must already have whatever metadata you were doing the tree walk on mount for in-memory (otherwise you would have been able to lazily do the treewalk after a quick mount). Therefore, could we not stash this metadata at or associated with, say, the root of the subvolumes? This way you can always determine on mount quickly if the cache is still valid (i.e., no situation like: remount with old btrfs, change stuff, umount with old btrfs, remount with new btrfs, pain). I would guess generation would be sufficient to determine if the cached metadata is valid for the given root block. This would scale with number of subvolumes (but not snapshots), and would be reasonably quick I think. Thoughts? ellis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-16 14:42 ` Ellis H. Wilson III @ 2018-02-16 14:55 ` Ellis H. Wilson III 0 siblings, 0 replies; 32+ messages in thread From: Ellis H. Wilson III @ 2018-02-16 14:55 UTC (permalink / raw) To: Hans van Kranenburg, Qu Wenruo, Nikolay Borisov, linux-btrfs On 02/16/2018 09:42 AM, Ellis H. Wilson III wrote: > On 02/16/2018 09:20 AM, Hans van Kranenburg wrote: >> Well, imagine you have a big tree (an actual real life tree outside) and >> you need to pick things (e.g. apples) which are hanging everywhere. >> >> So, what you need to to is climb the tree, climb on a branch all the way >> to the end where the first apple is... climb back, climb up a bit, go >> onto the next branch to the end for the next apple... etc etc.... >> >> The bigger the tree is, the longer it keeps you busy, because the apples >> will be semi-evenly distributed around the full tree, and they're always >> hanging at the end of the branch. The speed with which you can climb >> around (random read disk access IO speed for btrfs, because your disk >> cache is empty when first mounting) determines how quickly you're done. >> >> So, yes. > > Thanks Hans. I will say multiple minutes (by the looks of things, I'll > end up near to an hour for 60TB if this non-linear scaling continues) to > mount a filesystem is undesirable, but I won't offer that criticism > without thinking constructively for a moment: > > Help me out by referencing the tree in question if you don't mind, so I > can better understand the point of picking all these "apples" (I would > guess for capacity reporting via df, but maybe there's more). > > Typical disclaimer that I haven't yet grokked the various inner-workings > of BTRFS, so this is quite possibly a terrible or unapproachable idea: > > On umount, you must already have whatever metadata you were doing the > tree walk on mount for in-memory (otherwise you would have been able to > lazily do the treewalk after a quick mount). Therefore, could we not > stash this metadata at or associated with, say, the root of the > subvolumes? This way you can always determine on mount quickly if the > cache is still valid (i.e., no situation like: remount with old btrfs, > change stuff, umount with old btrfs, remount with new btrfs, pain). I > would guess generation would be sufficient to determine if the cached > metadata is valid for the given root block. > > This would scale with number of subvolumes (but not snapshots), and > would be reasonably quick I think. I see on 02/13 Qu commented regarding a similar idea, except proposed perhaps a richer version of my above suggestion (making block group into its own tree). The concern was that it would be a lot of work since it modifies the on-disk format. That's a reasonable worry. I will get a new kernel, expand my array to around 36TB, and will generate a plot of mount times against extents going up to at least 30TB in increments of 0.5TB. If this proves to reach absurd mount time delays (to be specific, anything above around 60s is untenable for our use), we may very well be sufficiently motivated to implement the above improvement and submit it for consideration. Accordingly, if anybody has additional and/or more specific thoughts on the optimization, I am all ears. Best, ellis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-16 14:12 ` Ellis H. Wilson III 2018-02-16 14:20 ` Hans van Kranenburg @ 2018-02-17 0:59 ` Qu Wenruo 2018-02-20 14:59 ` Ellis H. Wilson III 1 sibling, 1 reply; 32+ messages in thread From: Qu Wenruo @ 2018-02-17 0:59 UTC (permalink / raw) To: Ellis H. Wilson III, Hans van Kranenburg, Nikolay Borisov, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 4694 bytes --] On 2018年02月16日 22:12, Ellis H. Wilson III wrote: > On 02/15/2018 08:55 PM, Qu Wenruo wrote: >> On 2018年02月16日 00:30, Ellis H. Wilson III wrote: >>> Very helpful information. Thank you Qu and Hans! >>> >>> I have about 1.7TB of homedir data newly rsync'd data on a single >>> enterprise 7200rpm HDD and the following output for btrfs-debug: >>> >>> extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2 >>> total bytes 6001175126016 >>> bytes used 1832557875200 >>> >>> Hans' (very cool) tool reports: >>> ROOT_TREE 624.00KiB 0( 38) 1( 1) >>> EXTENT_TREE 327.31MiB 0( 20881) 1( 66) 2( 1) >> >> Extent tree is not so large, a little unexpected to see such slow mount. >> >> BTW, how many chunks do you have? >> >> It could be checked by: >> >> # btrfs-debug-tree -t chunk <device> | grep CHUNK_ITEM | wc -l > > Since yesterday I've doubled the size by copying the homdir dataset in > again. Here are new stats: > > extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2 > total bytes 6001175126016 > bytes used 3663525969920 > > $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l > 3454 OK, this explains everything. There are too many chunks. This means at mount you need to search for block group item 3454 times. Even each search only needs to iterate 3 tree blocks, multiply it 3454 it would still be a big work. Although some tree blocks like the root node and level 1 nodes can be cached, we still need to read about 3500 tree blocks. If the fs is created using 16K nodesize, this means you need to do random read for 54M using 16K blocksize. No wonder it will takes some time. Normally I would expect 1G chunk for each data and metadata chunk. If there is nothing special, it means your filesystem is already larger than 3T. If your used space is way smaller (less than 30%) than 3.5T, then this means your chunk usage is pretty low, and in that case, balance to reduce number of chunks (block groups) would reduce mount time. My personally estimate about mount time is O(nlogn). So if you are able to reduce chunk number to half, you could reduce mount time by 60%. > > $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ > ROOT_TREE 1.14MiB 0( 72) 1( 1) > EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) > CHUNK_TREE 384.00KiB 0( 23) 1( 1) > DEV_TREE 272.00KiB 0( 16) 1( 1) > FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) > CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 0.00B > DATA_RELOC_TREE 16.00KiB 0( 1) > > The old mean mount time was 4.319s. It now takes 11.537s for the > doubled dataset. Again please realize this is on an old version of > BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still > like to understand this delay more. Should I expect this to scale in > this way all the way up to my proposed 60-80TB filesystem so long as the > file size distribution stays roughly similar? That would definitely be > in terms of multiple minutes at that point. > >>> Taking 100 snapshots (no changes between snapshots however) of the above >>> subvolume doesn't appear to impact mount/umount time. >> >> 100 unmodified snapshots won't affect mount time. >> >> It needs new extents, which can be created by overwriting extents in >> snapshots. >> So it won't really cause much difference if all these snapshots are all >> unmodified. > > Good to know, thanks! > >>> Snapshot creation >>> and deletion both operate at between 0.25s to 0.5s. >> >> IIRC snapshot deletion is delayed, so the real work doesn't happen when >> "btrfs sub del" returns. > > I was using btrfs sub del -C for the deletions, so I believe (if that > command truly waits for the subvolume to be utterly gone) it captures > the entirety of the snapshot. No, snapshot deletion is completely delayed in background. -C only ensures that even a powerloss happen after command return, you won't see the snapshot anywhere, but it will still be deleted in background. Thanks, Qu > > Best, > > ellis > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 520 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-17 0:59 ` Qu Wenruo @ 2018-02-20 14:59 ` Ellis H. Wilson III 2018-02-20 15:41 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 32+ messages in thread From: Ellis H. Wilson III @ 2018-02-20 14:59 UTC (permalink / raw) To: Qu Wenruo, Hans van Kranenburg, Nikolay Borisov, linux-btrfs On 02/16/2018 07:59 PM, Qu Wenruo wrote: > On 2018年02月16日 22:12, Ellis H. Wilson III wrote: >> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l >> 3454 > > OK, this explains everything. > > There are too many chunks. > This means at mount you need to search for block group item 3454 times. > > Even each search only needs to iterate 3 tree blocks, multiply it 3454 > it would still be a big work. > Although some tree blocks like the root node and level 1 nodes can be > cached, we still need to read about 3500 tree blocks. > > If the fs is created using 16K nodesize, this means you need to do > random read for 54M using 16K blocksize. > > No wonder it will takes some time. > > Normally I would expect 1G chunk for each data and metadata chunk. > > If there is nothing special, it means your filesystem is already larger > than 3T. > If your used space is way smaller (less than 30%) than 3.5T, then this > means your chunk usage is pretty low, and in that case, balance to > reduce number of chunks (block groups) would reduce mount time. The nodesize is 16K, and the filesystem data is 3.32TiB as reported by btrfs fi df. So, from what I am hearing, this mount time is normal for a filesystem this size. Ignoring a more complex and proper fix like the ones we've been discussing, would bumping the nodesize reduce the number of chunks, thereby reducing the mount time? I don't see why balance would come into play here -- my understanding was that was for aged filesystems. The only operations I've done on here was: 1. Format filesystem clean 2. Create a subvolume 3. rsync our home directories into that new subvolume 4. Create another subvolume 5. rsync our home directories into that new subvolume Accordingly, zero (or at least, extremely little) data should have been overwritten, so I would expect things to be fairly well allocated already. Please correct me if this is naive thinking. >> I was using btrfs sub del -C for the deletions, so I believe (if that >> command truly waits for the subvolume to be utterly gone) it captures >> the entirety of the snapshot. > > No, snapshot deletion is completely delayed in background. > > -C only ensures that even a powerloss happen after command return, you > won't see the snapshot anywhere, but it will still be deleted in background. Ah, I had no idea. Thank you! Is there any way to "encourage" btrfs-cleaner to run at specific times, which I presume is the snapshot deletion process you are referring to? If it can be told to run at a given time, can I throttle how fast it works, such that I avoid some of the high foreground interruption I've seen in the past? Thanks, ellis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-20 14:59 ` Ellis H. Wilson III @ 2018-02-20 15:41 ` Austin S. Hemmelgarn 2018-02-21 1:49 ` Qu Wenruo 0 siblings, 1 reply; 32+ messages in thread From: Austin S. Hemmelgarn @ 2018-02-20 15:41 UTC (permalink / raw) To: Ellis H. Wilson III, Qu Wenruo, Hans van Kranenburg, Nikolay Borisov, linux-btrfs On 2018-02-20 09:59, Ellis H. Wilson III wrote: > On 02/16/2018 07:59 PM, Qu Wenruo wrote: >> On 2018年02月16日 22:12, Ellis H. Wilson III wrote: >>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l >>> 3454 >> >> OK, this explains everything. >> >> There are too many chunks. >> This means at mount you need to search for block group item 3454 times. >> >> Even each search only needs to iterate 3 tree blocks, multiply it 3454 >> it would still be a big work. >> Although some tree blocks like the root node and level 1 nodes can be >> cached, we still need to read about 3500 tree blocks. >> >> If the fs is created using 16K nodesize, this means you need to do >> random read for 54M using 16K blocksize. >> >> No wonder it will takes some time. >> >> Normally I would expect 1G chunk for each data and metadata chunk. >> >> If there is nothing special, it means your filesystem is already larger >> than 3T. >> If your used space is way smaller (less than 30%) than 3.5T, then this >> means your chunk usage is pretty low, and in that case, balance to >> reduce number of chunks (block groups) would reduce mount time. > > The nodesize is 16K, and the filesystem data is 3.32TiB as reported by > btrfs fi df. So, from what I am hearing, this mount time is normal for > a filesystem this size. Ignoring a more complex and proper fix like the > ones we've been discussing, would bumping the nodesize reduce the number > of chunks, thereby reducing the mount time? It would probably not. Chunk size is only based on the total size of the filesystem, with reasonable base values, so you would still need to have at least as many chunks to store the same amount of data (increase the node size too much though, and you will end up with more chunks, because you'll have more empty space wasted). > > I don't see why balance would come into play here -- my understanding > was that was for aged filesystems. The only operations I've done on > here was: > 1. Format filesystem clean > 2. Create a subvolume > 3. rsync our home directories into that new subvolume > 4. Create another subvolume > 5. rsync our home directories into that new subvolume > > Accordingly, zero (or at least, extremely little) data should have been > overwritten, so I would expect things to be fairly well allocated > already. Please correct me if this is naive thinking. Your logic is in general correct regarding data, but not necessarily metadata. Assuming you did not use the `--inplace` option for rsync, it had to issue a rename for each individual file that got copied in, and as a result there was likely a lot of metadata being rewritten. As far as balance being for aged filesystems, that's not exactly true. There are four big reasons you might run a balance: 1. As part of reshaping a volume. You generally want run a balance whenever the number of disks in a volume permanently increases (it will happen automatically when it permanently decreases, as the device deletion operation is a special type of balance under the hood). It's also used for converting chunk profiles. 2. To free up empty space inside chunks when the filesystem is full at the chunk level. 3. To redistribute data across multiple disks in a more even manner after deleting a lot of data. 4. To reduce the likelihood of 2 or 3 being an issue. Reasons 2 and 3 are generally more likely to be needed on old volumes. Reason 1 is independent of the age of a volume. Reason 4 is the reason for the regular filtered balances that I and some other people recommend be run as part of preventative maintenance, and is also generally independent of the age of a volume. Qu's suggestion is actually independent of all the above reasons, but does kind of fit in with the fourth as another case of preventative maintenance. > >>> I was using btrfs sub del -C for the deletions, so I believe (if that >>> command truly waits for the subvolume to be utterly gone) it captures >>> the entirety of the snapshot. >> >> No, snapshot deletion is completely delayed in background. >> >> -C only ensures that even a powerloss happen after command return, you >> won't see the snapshot anywhere, but it will still be deleted in >> background. > > Ah, I had no idea. Thank you! Is there any way to "encourage" > btrfs-cleaner to run at specific times, which I presume is the snapshot > deletion process you are referring to? If it can be told to run at a > given time, can I throttle how fast it works, such that I avoid some of > the high foreground interruption I've seen in the past? I don't think there's any way to do this right now (though it would be nice if there was). In theory, you could adjust the priority of the kernel thread itself, but messing around with kthread priorities is seriously dangerous even if you know exactly what you're doing. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-20 15:41 ` Austin S. Hemmelgarn @ 2018-02-21 1:49 ` Qu Wenruo 2018-02-21 14:49 ` Ellis H. Wilson III 0 siblings, 1 reply; 32+ messages in thread From: Qu Wenruo @ 2018-02-21 1:49 UTC (permalink / raw) To: Austin S. Hemmelgarn, Ellis H. Wilson III, Hans van Kranenburg, Nikolay Borisov, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 6244 bytes --] On 2018年02月20日 23:41, Austin S. Hemmelgarn wrote: > On 2018-02-20 09:59, Ellis H. Wilson III wrote: >> On 02/16/2018 07:59 PM, Qu Wenruo wrote: >>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote: >>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l >>>> 3454 >>> >>> OK, this explains everything. >>> >>> There are too many chunks. >>> This means at mount you need to search for block group item 3454 times. >>> >>> Even each search only needs to iterate 3 tree blocks, multiply it 3454 >>> it would still be a big work. >>> Although some tree blocks like the root node and level 1 nodes can be >>> cached, we still need to read about 3500 tree blocks. >>> >>> If the fs is created using 16K nodesize, this means you need to do >>> random read for 54M using 16K blocksize. >>> >>> No wonder it will takes some time. >>> >>> Normally I would expect 1G chunk for each data and metadata chunk. >>> >>> If there is nothing special, it means your filesystem is already larger >>> than 3T. >>> If your used space is way smaller (less than 30%) than 3.5T, then this >>> means your chunk usage is pretty low, and in that case, balance to >>> reduce number of chunks (block groups) would reduce mount time. >> >> The nodesize is 16K, and the filesystem data is 3.32TiB as reported by >> btrfs fi df. So, from what I am hearing, this mount time is normal >> for a filesystem this size. Ignoring a more complex and proper fix >> like the ones we've been discussing, would bumping the nodesize reduce >> the number of chunks, thereby reducing the mount time? > It would probably not. Chunk size is only based on the total size of > the filesystem, with reasonable base values, so you would still need to > have at least as many chunks to store the same amount of data (increase > the node size too much though, and you will end up with more chunks, > because you'll have more empty space wasted). Increasing node size may reduce extent tree size. Although at most reduce one level AFAIK. But considering that the higher the node is, the more chance it's cached, reducing tree height wouldn't bring much performance impact AFAIK. If one could do real world benchmark to beat or prove my assumption, it would be much better though. >> >> I don't see why balance would come into play here -- my understanding >> was that was for aged filesystems. The only operations I've done on >> here was: >> 1. Format filesystem clean >> 2. Create a subvolume >> 3. rsync our home directories into that new subvolume >> 4. Create another subvolume >> 5. rsync our home directories into that new subvolume >> >> Accordingly, zero (or at least, extremely little) data should have >> been overwritten, so I would expect things to be fairly well allocated >> already. Please correct me if this is naive thinking. > Your logic is in general correct regarding data, but not necessarily > metadata. Assuming you did not use the `--inplace` option for rsync, it > had to issue a rename for each individual file that got copied in, and > as a result there was likely a lot of metadata being rewritten. > > As far as balance being for aged filesystems, that's not exactly true. > There are four big reasons you might run a balance: > > 1. As part of reshaping a volume. You generally want run a balance > whenever the number of disks in a volume permanently increases (it will > happen automatically when it permanently decreases, as the device > deletion operation is a special type of balance under the hood). It's > also used for converting chunk profiles. > 2. To free up empty space inside chunks when the filesystem is full at > the chunk level. > 3. To redistribute data across multiple disks in a more even manner > after deleting a lot of data. > 4. To reduce the likelihood of 2 or 3 being an issue. > > Reasons 2 and 3 are generally more likely to be needed on old volumes. > Reason 1 is independent of the age of a volume. Reason 4 is the reason > for the regular filtered balances that I and some other people recommend > be run as part of preventative maintenance, and is also generally > independent of the age of a volume. > > Qu's suggestion is actually independent of all the above reasons, but > does kind of fit in with the fourth as another case of preventative > maintenance. My suggestion is to use balance to reduce number of block groups, so we could do less search at mount time. It's more like reason 2. But it only works for case where there are a lot of fragments so a lot of chunks are not fully utilized. Unfortunately, that's not the case for OP, so my suggestion doesn't make sense here. BTW, if OP still wants to try something to possibly to reduce mount time with same the fs, I could try some modification to current block group iteration code to see if it makes sense. Thanks, Qu >> >>>> I was using btrfs sub del -C for the deletions, so I believe (if that >>>> command truly waits for the subvolume to be utterly gone) it captures >>>> the entirety of the snapshot. >>> >>> No, snapshot deletion is completely delayed in background. >>> >>> -C only ensures that even a powerloss happen after command return, you >>> won't see the snapshot anywhere, but it will still be deleted in >>> background. >> >> Ah, I had no idea. Thank you! Is there any way to "encourage" >> btrfs-cleaner to run at specific times, which I presume is the >> snapshot deletion process you are referring to? If it can be told to >> run at a given time, can I throttle how fast it works, such that I >> avoid some of the high foreground interruption I've seen in the past? > I don't think there's any way to do this right now (though it would be > nice if there was). In theory, you could adjust the priority of the > kernel thread itself, but messing around with kthread priorities is > seriously dangerous even if you know exactly what you're doing. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 520 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-21 1:49 ` Qu Wenruo @ 2018-02-21 14:49 ` Ellis H. Wilson III 2018-02-21 15:03 ` Hans van Kranenburg ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Ellis H. Wilson III @ 2018-02-21 14:49 UTC (permalink / raw) To: Qu Wenruo, Austin S. Hemmelgarn, Hans van Kranenburg, Nikolay Borisov, linux-btrfs On 02/20/2018 08:49 PM, Qu Wenruo wrote: >>>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote: >>>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l >>>>> 3454 >>>> > Increasing node size may reduce extent tree size. Although at most > reduce one level AFAIK. > > But considering that the higher the node is, the more chance it's > cached, reducing tree height wouldn't bring much performance impact AFAIK. > > If one could do real world benchmark to beat or prove my assumption, it > would be much better though. I'm willing to try this if you tell me exactly what you'd like me to do. I've not mucked with nodesize before, so I'd like to avoid changing it to something absurd. >> Qu's suggestion is actually independent of all the above reasons, but >> does kind of fit in with the fourth as another case of preventative >> maintenance. > > My suggestion is to use balance to reduce number of block groups, so we > could do less search at mount time. > > It's more like reason 2. > > But it only works for case where there are a lot of fragments so a lot > of chunks are not fully utilized. > Unfortunately, that's not the case for OP, so my suggestion doesn't make > sense here. I ran the balance all the same, and the number of chunks has not changed. Before 3454, and after 3454: $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l 3454 HOWEVER, the time to mount has gone up somewhat significantly, from 11.537s to 16.553s, which was very unexpected. Output from previously run commands shows the extent tree metadata grew about 25% due to the balance. Everything else stayed roughly the same, and no additional data was added to the system (nor snapshots taken, nor additional volumes added, etc): Before balance: $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ ROOT_TREE 1.14MiB 0( 72) 1( 1) EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) CHUNK_TREE 384.00KiB 0( 23) 1( 1) DEV_TREE 272.00KiB 0( 16) 1( 1) FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) QUOTA_TREE 0.00B UUID_TREE 16.00KiB 0( 1) FREE_SPACE_TREE 0.00B DATA_RELOC_TREE 16.00KiB 0( 1) After balance: $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ ROOT_TREE 1.16MiB 0( 73) 1( 1) EXTENT_TREE 806.50MiB 0( 51419) 1( 196) 2( 1) CHUNK_TREE 384.00KiB 0( 23) 1( 1) DEV_TREE 272.00KiB 0( 16) 1( 1) FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) CSUM_TREE 3.49GiB 0(227920) 1( 804) 2( 2) 3( 1) QUOTA_TREE 0.00B UUID_TREE 16.00KiB 0( 1) FREE_SPACE_TREE 0.00B DATA_RELOC_TREE 16.00KiB 0( 1) > BTW, if OP still wants to try something to possibly to reduce mount time > with same the fs, I could try some modification to current block group > iteration code to see if it makes sense. I'm glad to try anything if it's helpful to improving BTRFS. Just let me know. Best, ellis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-21 14:49 ` Ellis H. Wilson III @ 2018-02-21 15:03 ` Hans van Kranenburg 2018-02-21 15:19 ` Ellis H. Wilson III 2018-02-21 21:27 ` E V 2018-02-22 0:53 ` Qu Wenruo 2 siblings, 1 reply; 32+ messages in thread From: Hans van Kranenburg @ 2018-02-21 15:03 UTC (permalink / raw) To: Ellis H. Wilson III, linux-btrfs On 02/21/2018 03:49 PM, Ellis H. Wilson III wrote: > On 02/20/2018 08:49 PM, Qu Wenruo wrote: >>>>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote: >>>>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l >>>>>> 3454 >>>>> >> Increasing node size may reduce extent tree size. Although at most >> reduce one level AFAIK. >> >> But considering that the higher the node is, the more chance it's >> cached, reducing tree height wouldn't bring much performance impact >> AFAIK. >> >> If one could do real world benchmark to beat or prove my assumption, it >> would be much better though. > > I'm willing to try this if you tell me exactly what you'd like me to do. > I've not mucked with nodesize before, so I'd like to avoid changing it > to something absurd. > >>> Qu's suggestion is actually independent of all the above reasons, but >>> does kind of fit in with the fourth as another case of preventative >>> maintenance. >> >> My suggestion is to use balance to reduce number of block groups, so we >> could do less search at mount time. >> >> It's more like reason 2. >> >> But it only works for case where there are a lot of fragments so a lot >> of chunks are not fully utilized. >> Unfortunately, that's not the case for OP, so my suggestion doesn't make >> sense here. > > I ran the balance all the same, and the number of chunks has not > changed. Before 3454, and after 3454: > $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l > 3454 > > HOWEVER, the time to mount has gone up somewhat significantly, from > 11.537s to 16.553s, which was very unexpected. Output from previously > run commands shows the extent tree metadata grew about 25% due to the > balance. Everything else stayed roughly the same, and no additional > data was added to the system (nor snapshots taken, nor additional > volumes added, etc): > > Before balance: > $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ > ROOT_TREE 1.14MiB 0( 72) 1( 1) > EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) > CHUNK_TREE 384.00KiB 0( 23) 1( 1) > DEV_TREE 272.00KiB 0( 16) 1( 1) > FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) > CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 0.00B > DATA_RELOC_TREE 16.00KiB 0( 1) > > After balance: > $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ > ROOT_TREE 1.16MiB 0( 73) 1( 1) > EXTENT_TREE 806.50MiB 0( 51419) 1( 196) 2( 1) > CHUNK_TREE 384.00KiB 0( 23) 1( 1) > DEV_TREE 272.00KiB 0( 16) 1( 1) > FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) > CSUM_TREE 3.49GiB 0(227920) 1( 804) 2( 2) 3( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 0.00B > DATA_RELOC_TREE 16.00KiB 0( 1) Heu, interesting. What's the output of `btrfs fi df /mountpoint` and `grep btrfs /proc/self/mounts` (does it contain 'ssd') and which kernel version is this? (I get a bit lost in the many messages and subthreads in this thread) I also can't find in the threads which command "the balance" means. And what does this tell you? https://github.com/knorrie/python-btrfs/blob/develop/examples/show_free_space_fragmentation.py Just to make sure you're not pointlessly shovelling data around on a filesystem that is already in bad shape. >> BTW, if OP still wants to try something to possibly to reduce mount time >> with same the fs, I could try some modification to current block group >> iteration code to see if it makes sense. > > I'm glad to try anything if it's helpful to improving BTRFS. Just let > me know. > > Best, > > ellis -- Hans van Kranenburg ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-21 15:03 ` Hans van Kranenburg @ 2018-02-21 15:19 ` Ellis H. Wilson III 2018-02-21 15:56 ` Hans van Kranenburg 0 siblings, 1 reply; 32+ messages in thread From: Ellis H. Wilson III @ 2018-02-21 15:19 UTC (permalink / raw) To: Hans van Kranenburg, linux-btrfs On 02/21/2018 10:03 AM, Hans van Kranenburg wrote: > On 02/21/2018 03:49 PM, Ellis H. Wilson III wrote: >> On 02/20/2018 08:49 PM, Qu Wenruo wrote: >>> My suggestion is to use balance to reduce number of block groups, so we >>> could do less search at mount time. >>> >>> It's more like reason 2. >>> >>> But it only works for case where there are a lot of fragments so a lot >>> of chunks are not fully utilized. >>> Unfortunately, that's not the case for OP, so my suggestion doesn't make >>> sense here. >> >> I ran the balance all the same, and the number of chunks has not >> changed. Before 3454, and after 3454: >> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l >> 3454 >> >> HOWEVER, the time to mount has gone up somewhat significantly, from >> 11.537s to 16.553s, which was very unexpected. Output from previously >> run commands shows the extent tree metadata grew about 25% due to the >> balance. Everything else stayed roughly the same, and no additional >> data was added to the system (nor snapshots taken, nor additional >> volumes added, etc): >> >> Before balance: >> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ >> ROOT_TREE 1.14MiB 0( 72) 1( 1) >> EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) >> CHUNK_TREE 384.00KiB 0( 23) 1( 1) >> DEV_TREE 272.00KiB 0( 16) 1( 1) >> FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) >> CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) >> QUOTA_TREE 0.00B >> UUID_TREE 16.00KiB 0( 1) >> FREE_SPACE_TREE 0.00B >> DATA_RELOC_TREE 16.00KiB 0( 1) >> >> After balance: >> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ >> ROOT_TREE 1.16MiB 0( 73) 1( 1) >> EXTENT_TREE 806.50MiB 0( 51419) 1( 196) 2( 1) >> CHUNK_TREE 384.00KiB 0( 23) 1( 1) >> DEV_TREE 272.00KiB 0( 16) 1( 1) >> FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) >> CSUM_TREE 3.49GiB 0(227920) 1( 804) 2( 2) 3( 1) >> QUOTA_TREE 0.00B >> UUID_TREE 16.00KiB 0( 1) >> FREE_SPACE_TREE 0.00B >> DATA_RELOC_TREE 16.00KiB 0( 1) > > Heu, interesting. > > What's the output of `btrfs fi df /mountpoint` and `grep btrfs > /proc/self/mounts` (does it contain 'ssd') and which kernel version is > this? (I get a bit lost in the many messages and subthreads in this > thread) I also can't find in the threads which command "the balance" means. Short recap: - I found long mount time for 1.65TB of home dir data at ~4s - Doubling this data on the same btrfs fs to 3.3TB increased mount time to 11s - Qu et. al. suggested balance might reduce chunks, which came in around 3400, and the chunk walk on mount was the driving factor in terms of time - I ran balance - Mount time went up to 16s, and all else remains the same except the extent tree. $ sudo btrfs fi df /mnt/btrfs Data, single: total=3.32TiB, used=3.32TiB System, DUP: total=8.00MiB, used=384.00KiB Metadata, DUP: total=16.50GiB, used=15.82GiB GlobalReserve, single: total=512.00MiB, used=0.00B $ sudo grep btrfs /proc/self/mounts /dev/sdb /mnt/btrfs btrfs rw,relatime,space_cache,subvolid=5,subvol=/ 0 0 $ uname -a Linux <snip> 4.5.5-300.fc24.x86_64 #1 SMP Thu May 19 13:05:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux I plan to rerun this on a newer kernel, but haven't had time to spin up another machine with a modern kernel yet, and this machine is also being used for other things right now so I can't just upgrade it. > And what does this tell you? > > https://github.com/knorrie/python-btrfs/blob/develop/examples/show_free_space_fragmentation.py $ sudo ./show_free_space_fragmentation.py /mnt/btrfs No Free Space Tree (space_cache=v2) found! Falling back to using the extent tree to determine free space extents. vaddr 6529453391872 length 1073741824 used_pct 27 free space fragments 1 score 0 Skipped because of usage > 90%: 3397 chunks Best, ellis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-21 15:19 ` Ellis H. Wilson III @ 2018-02-21 15:56 ` Hans van Kranenburg 2018-02-22 12:41 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 32+ messages in thread From: Hans van Kranenburg @ 2018-02-21 15:56 UTC (permalink / raw) To: Ellis H. Wilson III, linux-btrfs On 02/21/2018 04:19 PM, Ellis H. Wilson III wrote: > On 02/21/2018 10:03 AM, Hans van Kranenburg wrote: >> On 02/21/2018 03:49 PM, Ellis H. Wilson III wrote: >>> On 02/20/2018 08:49 PM, Qu Wenruo wrote: >>>> My suggestion is to use balance to reduce number of block groups, so we >>>> could do less search at mount time. >>>> >>>> It's more like reason 2. >>>> >>>> But it only works for case where there are a lot of fragments so a lot >>>> of chunks are not fully utilized. >>>> Unfortunately, that's not the case for OP, so my suggestion doesn't >>>> make >>>> sense here. >>> >>> I ran the balance all the same, and the number of chunks has not >>> changed. Before 3454, and after 3454: >>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l >>> 3454 >>> >>> HOWEVER, the time to mount has gone up somewhat significantly, from >>> 11.537s to 16.553s, which was very unexpected. Output from previously >>> run commands shows the extent tree metadata grew about 25% due to the >>> balance. Everything else stayed roughly the same, and no additional >>> data was added to the system (nor snapshots taken, nor additional >>> volumes added, etc): >>> >>> Before balance: >>> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ >>> ROOT_TREE 1.14MiB 0( 72) 1( 1) >>> EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) >>> CHUNK_TREE 384.00KiB 0( 23) 1( 1) >>> DEV_TREE 272.00KiB 0( 16) 1( 1) >>> FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) >>> CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) >>> QUOTA_TREE 0.00B >>> UUID_TREE 16.00KiB 0( 1) >>> FREE_SPACE_TREE 0.00B >>> DATA_RELOC_TREE 16.00KiB 0( 1) >>> >>> After balance: >>> $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ >>> ROOT_TREE 1.16MiB 0( 73) 1( 1) >>> EXTENT_TREE 806.50MiB 0( 51419) 1( 196) 2( 1) >>> CHUNK_TREE 384.00KiB 0( 23) 1( 1) >>> DEV_TREE 272.00KiB 0( 16) 1( 1) >>> FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) >>> CSUM_TREE 3.49GiB 0(227920) 1( 804) 2( 2) 3( 1) >>> QUOTA_TREE 0.00B >>> UUID_TREE 16.00KiB 0( 1) >>> FREE_SPACE_TREE 0.00B >>> DATA_RELOC_TREE 16.00KiB 0( 1) >> >> Heu, interesting. >> >> What's the output of `btrfs fi df /mountpoint` and `grep btrfs >> /proc/self/mounts` (does it contain 'ssd') and which kernel version is >> this? (I get a bit lost in the many messages and subthreads in this >> thread) I also can't find in the threads which command "the balance" >> means. > > Short recap: > - I found long mount time for 1.65TB of home dir data at ~4s > - Doubling this data on the same btrfs fs to 3.3TB increased mount time > to 11s > - Qu et. al. suggested balance might reduce chunks, which came in around > 3400, and the chunk walk on mount was the driving factor in terms of time > - I ran balance > - Mount time went up to 16s, and all else remains the same except the > extent tree. > > $ sudo btrfs fi df /mnt/btrfs > Data, single: total=3.32TiB, used=3.32TiB > System, DUP: total=8.00MiB, used=384.00KiB > Metadata, DUP: total=16.50GiB, used=15.82GiB > GlobalReserve, single: total=512.00MiB, used=0.00B Ah, so allocated data space is 100% filled with data. That's very good yes. And it explains why you can't lower the amount of chunks by balancing. You're just moving around data and replacing full chunks with new full chunks. :] Doesn't explain why it blows up the size of the extent tree though. I have no idea why that is. > $ sudo grep btrfs /proc/self/mounts > /dev/sdb /mnt/btrfs btrfs rw,relatime,space_cache,subvolid=5,subvol=/ 0 0 Ok, no 'ssd', good. > $ uname -a > Linux <snip> 4.5.5-300.fc24.x86_64 #1 SMP Thu May 19 13:05:32 UTC 2016 > x86_64 x86_64 x86_64 GNU/Linux > > I plan to rerun this on a newer kernel, but haven't had time to spin up > another machine with a modern kernel yet, and this machine is also being > used for other things right now so I can't just upgrade it. > >> And what does this tell you? >> >> https://github.com/knorrie/python-btrfs/blob/develop/examples/show_free_space_fragmentation.py >> > > $ sudo ./show_free_space_fragmentation.py /mnt/btrfs > No Free Space Tree (space_cache=v2) found! > Falling back to using the extent tree to determine free space extents. > vaddr 6529453391872 length 1073741824 used_pct 27 free space fragments 1 > score 0 > Skipped because of usage > 90%: 3397 chunks Good. -- Hans van Kranenburg ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-21 15:56 ` Hans van Kranenburg @ 2018-02-22 12:41 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 32+ messages in thread From: Austin S. Hemmelgarn @ 2018-02-22 12:41 UTC (permalink / raw) To: Hans van Kranenburg, Ellis H. Wilson III, linux-btrfs On 2018-02-21 10:56, Hans van Kranenburg wrote: > On 02/21/2018 04:19 PM, Ellis H. Wilson III wrote: >> >> $ sudo btrfs fi df /mnt/btrfs >> Data, single: total=3.32TiB, used=3.32TiB >> System, DUP: total=8.00MiB, used=384.00KiB >> Metadata, DUP: total=16.50GiB, used=15.82GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B > > Ah, so allocated data space is 100% filled with data. That's very good > yes. And it explains why you can't lower the amount of chunks by > balancing. You're just moving around data and replacing full chunks with > new full chunks. :] > > Doesn't explain why it blows up the size of the extent tree though. I > have no idea why that is. This is just a guess, but I think it might have reordered extents within each chunk. Any given extent can't span across a chunk boundary, so if the order changed, it may have split extents that had previously been full extents. I'd be somewhat curious to see if defragmenting might help here (it should re-combine the split extents, though it will probably allocate a new chunk). ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-21 14:49 ` Ellis H. Wilson III 2018-02-21 15:03 ` Hans van Kranenburg @ 2018-02-21 21:27 ` E V 2018-02-22 0:53 ` Qu Wenruo 2 siblings, 0 replies; 32+ messages in thread From: E V @ 2018-02-21 21:27 UTC (permalink / raw) To: Ellis H. Wilson III Cc: Qu Wenruo, Austin S. Hemmelgarn, Hans van Kranenburg, Nikolay Borisov, linux-btrfs On Wed, Feb 21, 2018 at 9:49 AM, Ellis H. Wilson III <ellisw@panasas.com> wrote: > On 02/20/2018 08:49 PM, Qu Wenruo wrote: >>>>> >>>>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote: >>>>>> >>>>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l >>>>>> 3454 >>>>> >>>>> >> Increasing node size may reduce extent tree size. Although at most >> reduce one level AFAIK. >> >> But considering that the higher the node is, the more chance it's >> cached, reducing tree height wouldn't bring much performance impact AFAIK. >> >> If one could do real world benchmark to beat or prove my assumption, it >> would be much better though. > > > I'm willing to try this if you tell me exactly what you'd like me to do. > I've not mucked with nodesize before, so I'd like to avoid changing it to > something absurd. mkfs.btrfs caps -n at 64K so absurd isn't really an option. If you have a large filesystem on a RAID array you will likely see a performance bump in your metadata operations if you use 64K and also set the stripe size of the RAID array to 64K. >>> Qu's suggestion is actually independent of all the above reasons, but >>> does kind of fit in with the fourth as another case of preventative >>> maintenance. >> >> >> My suggestion is to use balance to reduce number of block groups, so we >> could do less search at mount time. >> >> It's more like reason 2. >> >> But it only works for case where there are a lot of fragments so a lot >> of chunks are not fully utilized. >> Unfortunately, that's not the case for OP, so my suggestion doesn't make >> sense here. > > > I ran the balance all the same, and the number of chunks has not changed. > Before 3454, and after 3454: > $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l > 3454 > > HOWEVER, the time to mount has gone up somewhat significantly, from 11.537s > to 16.553s, which was very unexpected. Output from previously run commands > shows the extent tree metadata grew about 25% due to the balance. > Everything else stayed roughly the same, and no additional data was added to > the system (nor snapshots taken, nor additional volumes added, etc): > > Before balance: > $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ > ROOT_TREE 1.14MiB 0( 72) 1( 1) > EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) > CHUNK_TREE 384.00KiB 0( 23) 1( 1) > DEV_TREE 272.00KiB 0( 16) 1( 1) > FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) > CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 0.00B > DATA_RELOC_TREE 16.00KiB 0( 1) > > After balance: > $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ > ROOT_TREE 1.16MiB 0( 73) 1( 1) > EXTENT_TREE 806.50MiB 0( 51419) 1( 196) 2( 1) > CHUNK_TREE 384.00KiB 0( 23) 1( 1) > DEV_TREE 272.00KiB 0( 16) 1( 1) > FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) > CSUM_TREE 3.49GiB 0(227920) 1( 804) 2( 2) 3( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 0.00B > DATA_RELOC_TREE 16.00KiB 0( 1) > >> BTW, if OP still wants to try something to possibly to reduce mount time >> with same the fs, I could try some modification to current block group >> iteration code to see if it makes sense. > > > I'm glad to try anything if it's helpful to improving BTRFS. Just let me > know. > > Best, > > ellis > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-21 14:49 ` Ellis H. Wilson III 2018-02-21 15:03 ` Hans van Kranenburg 2018-02-21 21:27 ` E V @ 2018-02-22 0:53 ` Qu Wenruo 2 siblings, 0 replies; 32+ messages in thread From: Qu Wenruo @ 2018-02-22 0:53 UTC (permalink / raw) To: Ellis H. Wilson III, Austin S. Hemmelgarn, Hans van Kranenburg, Nikolay Borisov, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 4309 bytes --] On 2018年02月21日 22:49, Ellis H. Wilson III wrote: > On 02/20/2018 08:49 PM, Qu Wenruo wrote: >>>>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote: >>>>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l >>>>>> 3454 >>>>> >> Increasing node size may reduce extent tree size. Although at most >> reduce one level AFAIK. >> >> But considering that the higher the node is, the more chance it's >> cached, reducing tree height wouldn't bring much performance impact >> AFAIK. >> >> If one could do real world benchmark to beat or prove my assumption, it >> would be much better though. > > I'm willing to try this if you tell me exactly what you'd like me to do. > I've not mucked with nodesize before, so I'd like to avoid changing it > to something absurd. > >>> Qu's suggestion is actually independent of all the above reasons, but >>> does kind of fit in with the fourth as another case of preventative >>> maintenance. >> >> My suggestion is to use balance to reduce number of block groups, so we >> could do less search at mount time. >> >> It's more like reason 2. >> >> But it only works for case where there are a lot of fragments so a lot >> of chunks are not fully utilized. >> Unfortunately, that's not the case for OP, so my suggestion doesn't make >> sense here. > > I ran the balance all the same, and the number of chunks has not > changed. Before 3454, and after 3454: > $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l > 3454 > > HOWEVER, the time to mount has gone up somewhat significantly, from > 11.537s to 16.553s, which was very unexpected. Output from previously > run commands shows the extent tree metadata grew about 25% due to the > balance. Everything else stayed roughly the same, and no additional > data was added to the system (nor snapshots taken, nor additional > volumes added, etc): In theory, if the extent tree height and block group usage doesn't change dramatically, the tree block reads caused by block groups iteration shouldn't change much. But in your case, extent tree leaves increased, I believe it's the tree block readahead causing the problem. > > Before balance: > $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ > ROOT_TREE 1.14MiB 0( 72) 1( 1) > EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) > CHUNK_TREE 384.00KiB 0( 23) 1( 1) > DEV_TREE 272.00KiB 0( 16) 1( 1) > FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) > CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 0.00B > DATA_RELOC_TREE 16.00KiB 0( 1) > > After balance: > $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ > ROOT_TREE 1.16MiB 0( 73) 1( 1) > EXTENT_TREE 806.50MiB 0( 51419) 1( 196) 2( 1) > CHUNK_TREE 384.00KiB 0( 23) 1( 1) > DEV_TREE 272.00KiB 0( 16) 1( 1) > FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) > CSUM_TREE 3.49GiB 0(227920) 1( 804) 2( 2) 3( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 0.00B > DATA_RELOC_TREE 16.00KiB 0( 1) > >> BTW, if OP still wants to try something to possibly to reduce mount time >> with same the fs, I could try some modification to current block group >> iteration code to see if it makes sense. > > I'm glad to try anything if it's helpful to improving BTRFS. Just let > me know. Glad to hear that. I would send out some RFC patch to see if it would help to reduce mount time (maybe only by a little) Thanks, Qu > > Best, > > ellis > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 520 bytes --] ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-14 17:08 ` Nikolay Borisov 2018-02-14 17:21 ` Ellis H. Wilson III 2018-02-15 1:42 ` Qu Wenruo @ 2018-02-15 5:54 ` Chris Murphy 2 siblings, 0 replies; 32+ messages in thread From: Chris Murphy @ 2018-02-15 5:54 UTC (permalink / raw) To: Nikolay Borisov; +Cc: Ellis H. Wilson III, Btrfs BTRFS On Wed, Feb 14, 2018 at 10:08 AM, Nikolay Borisov <nborisov@suse.com> wrote: > V1 for large filesystems is jut awful. Facebook have been experiencing > the pain hence they implemented v2. You can view the spacecache tree as > the complement version of the extent tree. v1 cache is implemented as a > hidden inode and even though writes (aka flushing of the freespace > cache) are metadata they are essentially treated as data. This could > potentially lead to priority inversions if cgroups io controller is > involved. > > Furthermore, there is at least 1 known deadlock problem in freespace > cache v1. So yes, if you want to use btrfs ona multi-tb system v2 is > really the way to go. I've been using v2 on a couple of systems' rootfs for a couple of months. I'm not totally certain it's v2, or another enhancement circa 4.14, but system updates (rpm based) are definitely faster. So it may not only be a Nice To Have with big file systems. I haven't tried it yet but if the file system face plants on me, I figure I'll use btrfs check to wipe the free space cache (hopefully that's allowed even if the file system is hosed) and then try to repair. -- Chris Murphy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-14 16:00 Status of FST and mount times Ellis H. Wilson III 2018-02-14 17:08 ` Nikolay Borisov @ 2018-02-14 23:24 ` Duncan 2018-02-15 15:42 ` Ellis H. Wilson III 2018-02-15 6:14 ` Chris Murphy 2 siblings, 1 reply; 32+ messages in thread From: Duncan @ 2018-02-14 23:24 UTC (permalink / raw) To: linux-btrfs Ellis H. Wilson III posted on Wed, 14 Feb 2018 11:00:29 -0500 as excerpted: > Hi again -- back with a few more questions: > > Frame-of-reference here: RAID0. Around 70TB raw capacity. No > compression. No quotas enabled. Many (potentially tens to hundreds) of > subvolumes, each with tens of snapshots. No control over size or number > of files, but directory tree (entries per dir and general tree depth) > can be controlled in case that's helpful). ?? How can you control both breadth (entries per dir) AND depth of directory tree without ultimately limiting your number of files? Or do you mean you can control breadth XOR depth of tree as needed, allowing the other to expand as necessary to accommodate the uncontrolled number of files? Anyway, AFAIK the only performance issue there would be the (IIRC) ~65535 limit on directory hard links before additional ones are out-of-lined into a secondary node, with the entailing performance implications. > 1. I've been reading up about the space cache, and it appears there is a > v2 of it called the free space tree that is much friendlier to large > filesystems such as the one I am designing for. It is listed as OK/OK > on the wiki status page, but there is a note that btrfs progs treats it > as read only (i.e., btrfs check repair cannot help me without a full > space cache rebuild is my biggest concern) and the last status update on > this I can find was circa fall 2016. Can anybody give me an updated > status on this feature? From what I read, v1 and tens of TB filesystems > will not play well together, so I'm inclined to dig into this. At tens of TB, yes, the free-space-cache (v1) has issues that the free- space-tree (aka free-space-cache-v2) are designed to solve. And v2 should be very well tested in large enterprise installations by now, given facebook's usage and intimate involvement with btrfs. But I have an arguably more basic concern... Pardon me for reviewing the basics as I feel rather like a pupil attempting to lecture a teacher on the point and you could very likely teach /me/ about them, but they setup the point... Raid0, particularly at the 10s-of-TB scale, has some implications that don't particularly well match your specified concerns above. Of course "raid0" is a convenient misnomer, as there's nothing "redundant" about the "array of independent devices" in a raid0 configuration, it's simply done for the space and speed features, with the sacrificial tradeoff being reliability. It's only called raid0 as a convenience, allowing it to be grouped with the other raid configurations where "redundant" /is/ a feature, with the more important grouping commonality being they're all multi-device. Because reliability /is/ the sacrificial tradeoff for raid0, it's relatively safe to make the assumption that reliability either isn't needed at all because the data literally is "throw-away" value (cache, say, where refilling the cache isn't a big cost or time factor), or reliability is assured by other mechanisms, backups being the most basic but there are others like multi-layered raid, etc, which in practice makes at least the particular instance of the data on the raid0 "throw- away" value, even if the data as a whole is not. So far, so good. But then above you mention concern about btrfs-progs treating the free-space-tree (free-space-cache-v2) as read-only, and the time cost of having to clear and rebuild it after a btrfs check --repair. Which is what triggered the mismatch warning I mentioned above. Either that raid0 data is of throw-away value appropriate to placement on a raid0, and btrfs check --repair is of little concern as the benefits are questionable (no guarantees it'll work and the data is either directly throw-away value anyway, or there's a backup at hand that /does/ have a tested guarantee of viability, or it's not worthy of being called a backup in the first place), or it's not. It's that concern about the viability of btrfs check --repair on what you're defining as throw-away data by placing it on raid0 in the first place, that's raising all those red warning flags for me! And the fact that you didn't even bother to explain it with a side note to the effect that the reliability is addressed some other way, but you still need to worry about btrfs check --repair viability because $REASONS, is turning those red flags into flashing red lights accompanied by blaring sirens! OK, so let's assume you /do/ have a tested backup, ready to go. Then the viability of btrfs check --repair is of less concern, but remains something you might still be interested in for trivial cases, because let's face it, transferring tens of TB of data, even if ready at hand, does take time, and if you can avoid it because the btrfs check --repair fix is trivial, it's worth doing so. Valid case, but there's nothing in your post indicating it's valid as /your/ case. Of course the other possibility is live-failover, which is sure to be facebook's use-case. But with live-failover, the viability of btrfs check --repair more or less ceases to be of interest, because the failover happens (relative to the offline check or restore time) instantly, and once the failed devices/machine is taken out of service it's far more effective to simply blow away the filesystem (if not replacing the device(s) entirely) and restore "at leisure" from backup, a relatively guaranteed procedure compared to the "no guarantees" of attempting to check --repair the filesystem out of trouble. Which is very likely why the free-space-tree still isn't well supported by btrfs-progs, including btrfs check, several kernel (and thus -progs) development cycles later. The people who really need the one (whichever one of the two)... don't tend to (or at least /shouldn't/) make use of the other so much. It's also worth mentioning that btrfs raid0 mode, as well as single mode, hobbles the btrfs data and metadata integrity feature, because while checksums can and are still generated, stored and checked by default, and integrity problems can still be detected, because raid0 (and single) includes no redundancy, there's no second copy (raid1/10) or parity redundancy (raid5/6) to rebuild the bad data from, so it's simply gone. (Well, for data you can try btrfs restore of the otherwise inaccessible file and hope for the best, and for metadata, you can try check --repair and again hope for the best, but...) If you're using that feature of btrfs and want/need more than just detection of a problem that can't be fixed due to lack of redundancy, there's a good chance you want a real redundancy raid mode on multi-device, or dup mode on single device. So bottom line... given the sacrificial lack of redundancy and reliability of raid0, btrfs or not, in an enterprise setting with tens of TB of data, why are you worrying about the viability of btrfs check -- repair on what the placement on raid0 decrees to be throw-away data anyway? At first glance anyway, one or the other, either the raid0 mode and thus declared throw-away value of tens of TB of data, or the viability of btrfs check --repair, indicating you don't consider the data you just declared to be of throw-away value by placing it on raid0, to be of throw-away value after all, must be wrong. Which one is wrong is your call, and there's certainly individual cases (one of which I even named) where concern about the viability of btrfs check --repair on raid0 might be valid, but your post has no real indication that your case is such a case, and honestly, that worries me! > 2. There's another thread on-going about mount delays. I've been > completely blind to this specific problem until it caught my eye. Does > anyone have ballpark estimates for how long very large HDD-based > filesystems will take to mount? Yes, I know it will depend on the > dataset. I'm looking for O() worst-case approximations for > enterprise-grade large drives (12/14TB), as I expect it should scale > with multiple drives so approximating for a single drive should be good > enough. No input on that question here (my own use-case couldn't be more different, multiple small sub-half-TB independent btrfs raid1s on partitioned ssds), but another concern, based on real-world reports I've seen on-list: 12-14 TB individual drives? While you /did/ say enterprise grade so this probably doesn't apply to you, it might apply to others that will read this. Be careful that you're not trying to use the "archive application" targeted SMR drives for general purpose use. Occasionally people will try to buy and use such drives in general purpose use due to their cheaper per-TB cost, and it just doesn't go well. We've had a number of reports of that. =:^( -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-14 23:24 ` Duncan @ 2018-02-15 15:42 ` Ellis H. Wilson III 2018-02-15 16:51 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 32+ messages in thread From: Ellis H. Wilson III @ 2018-02-15 15:42 UTC (permalink / raw) To: Duncan, linux-btrfs On 02/14/2018 06:24 PM, Duncan wrote: >> Frame-of-reference here: RAID0. Around 70TB raw capacity. No >> compression. No quotas enabled. Many (potentially tens to hundreds) of >> subvolumes, each with tens of snapshots. No control over size or number >> of files, but directory tree (entries per dir and general tree depth) >> can be controlled in case that's helpful). > > ?? How can you control both breadth (entries per dir) AND depth of > directory tree without ultimately limiting your number of files? I technically misspoke when I said "No control over size or number of files." There is an upper-limit to the metadata (not BTRFS, for our filesystem) we can store on an accompanying SSD, which limits the number of files that ultimately can live on our BTRFS RAID0'd HDDs. The current design is tuned to perform well up to that maximum, but it's a relatively shallow tree, so if there were known performance issues with more than N files per directory or beyond a specific depth of directories I was calling out that I can change the algorithm now. > Anyway, AFAIK the only performance issue there would be the (IIRC) ~65535 > limit on directory hard links before additional ones are out-of-lined > into a secondary node, with the entailing performance implications. Here I interpret "directory hard links" to mean hard links within a single directory -- not real directory hard links as in Macs. It's moot anyhow, as we support hard links at a much higher level in our parallel file system and no hard-links will exist whatsoever from BTRFS's perspective. > So far, so good. But then above you mention concern about btrfs-progs > treating the free-space-tree (free-space-cache-v2) as read-only, and the > time cost of having to clear and rebuild it after a btrfs check --repair. > > Which is what triggered the mismatch warning I mentioned above. Either > that raid0 data is of throw-away value appropriate to placement on a > raid0, and btrfs check --repair is of little concern as the benefits are > questionable (no guarantees it'll work and the data is either directly > throw-away value anyway, or there's a backup at hand that /does/ have a > tested guarantee of viability, or it's not worthy of being called a > backup in the first place), or it's not. I think you may be looking at this a touch too black and white, but that's probably because I've not been clear about my use-case. We do have mechanisms at a higher level in our parallel file system to do scale-out object-based RAID, so in a way the data is "throw-away" in that we can lose it without true data loss. However, one should not underestimate the foreground impact of a reconstruction of 60-80TB of data, even with architectures like ours that scale reconstruction well. When I lose an HDD I fully expect we will need to rebuild that entire BTRFS filesystem, and we can. But I'd like to limit it to real media failure. In other words, if I can't mount my BTRFS filesystem after power-fail, and I can't run btrfs check --repair, then in essence I've lost a lot of data I need to rebuild for no "good" reason. Perhaps more critically, when an entire cluster of these systems power-fail, if more than N of these running BTRFS come up and require check --repair prior to mount due to some commonly triggered BTRFS bug (not saying there is one, I'm just conservative), I'm completely hosed. Restoring PB's of data from backup is a non-starter. In short, I've been playing coy about the details of my project and need to continue to do so for at least the next 4-6 months, but if you read anything about the company I'm emailing from, you can probably make reasonable guesses about what I'm trying to do. > It's also worth mentioning that btrfs raid0 mode, as well as single mode, > hobbles the btrfs data and metadata integrity feature, because while > checksums can and are still generated, stored and checked by default, and > integrity problems can still be detected, because raid0 (and single) > includes no redundancy, there's no second copy (raid1/10) or parity > redundancy (raid5/6) to rebuild the bad data from, so it's simply gone. I'm ok with that. We have a concept called "on-demand reconstruction" which permits us to rebuild individual objects in our filesystem on-demand (one component of which will be a failed file on one of the BTRFS filesystems). So long as I can identify that a file has been corrupted I'm fine. > 12-14 TB individual drives? > > While you /did/ say enterprise grade so this probably doesn't apply to > you, it might apply to others that will read this. > > Be careful that you're not trying to use the "archive application" > targeted SMR drives for general purpose use. We're using traditional PMR drives for now. That's available at 12/14TB capacity points presently. I agree with your general sense that SMR drives are unlikely to play particularly well with BTRFS for all but the truly archival use-case. Best, ellis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-15 15:42 ` Ellis H. Wilson III @ 2018-02-15 16:51 ` Austin S. Hemmelgarn 2018-02-15 16:58 ` Ellis H. Wilson III 0 siblings, 1 reply; 32+ messages in thread From: Austin S. Hemmelgarn @ 2018-02-15 16:51 UTC (permalink / raw) To: Ellis H. Wilson III, linux-btrfs On 2018-02-15 10:42, Ellis H. Wilson III wrote: > On 02/14/2018 06:24 PM, Duncan wrote: >>> Frame-of-reference here: RAID0. Around 70TB raw capacity. No >>> compression. No quotas enabled. Many (potentially tens to hundreds) of >>> subvolumes, each with tens of snapshots. No control over size or number >>> of files, but directory tree (entries per dir and general tree depth) >>> can be controlled in case that's helpful). >> >> ?? How can you control both breadth (entries per dir) AND depth of >> directory tree without ultimately limiting your number of files? > > I technically misspoke when I said "No control over size or number of > files." There is an upper-limit to the metadata (not BTRFS, for our > filesystem) we can store on an accompanying SSD, which limits the number > of files that ultimately can live on our BTRFS RAID0'd HDDs. The > current design is tuned to perform well up to that maximum, but it's a > relatively shallow tree, so if there were known performance issues with > more than N files per directory or beyond a specific depth of > directories I was calling out that I can change the algorithm now. There are scaling performance issues with directory listings on BTRFS for directories with more than a few thousand files, but they're not well documented (most people don't hit them because most applications are designed around the expectation that directory listings will be slow in big directories), and I would not expect them to be much of an issue unless you're dealing with tens of thousands of files and particularly slow storage. > >> Anyway, AFAIK the only performance issue there would be the (IIRC) ~65535 >> limit on directory hard links before additional ones are out-of-lined >> into a secondary node, with the entailing performance implications. > > Here I interpret "directory hard links" to mean hard links within a > single directory -- not real directory hard links as in Macs. It's moot > anyhow, as we support hard links at a much higher level in our parallel > file system and no hard-links will exist whatsoever from BTRFS's > perspective. > >> So far, so good. But then above you mention concern about btrfs-progs >> treating the free-space-tree (free-space-cache-v2) as read-only, and the >> time cost of having to clear and rebuild it after a btrfs check --repair. >> >> Which is what triggered the mismatch warning I mentioned above. Either >> that raid0 data is of throw-away value appropriate to placement on a >> raid0, and btrfs check --repair is of little concern as the benefits are >> questionable (no guarantees it'll work and the data is either directly >> throw-away value anyway, or there's a backup at hand that /does/ have a >> tested guarantee of viability, or it's not worthy of being called a >> backup in the first place), or it's not. > > I think you may be looking at this a touch too black and white, but > that's probably because I've not been clear about my use-case. We do > have mechanisms at a higher level in our parallel file system to do > scale-out object-based RAID, so in a way the data is "throw-away" in > that we can lose it without true data loss. However, one should not > underestimate the foreground impact of a reconstruction of 60-80TB of > data, even with architectures like ours that scale reconstruction well. > When I lose an HDD I fully expect we will need to rebuild that entire > BTRFS filesystem, and we can. But I'd like to limit it to real media > failure. In other words, if I can't mount my BTRFS filesystem after > power-fail, and I can't run btrfs check --repair, then in essence I've > lost a lot of data I need to rebuild for no "good" reason. > > Perhaps more critically, when an entire cluster of these systems > power-fail, if more than N of these running BTRFS come up and require > check --repair prior to mount due to some commonly triggered BTRFS bug > (not saying there is one, I'm just conservative), I'm completely hosed. > Restoring PB's of data from backup is a non-starter. Whether or not this is likely to be an issue is just as much dependent on the storage hardware as how BTRFS handles it. In my own experience, I've only ever lost a BTRFS volume to a power failure _once_ in the multiple years I've been using it, and that ended up being because the power failure trashed the storage device pretty severely (it was super-cheap flash storage). I do know however that there are people who have had much worse results than me. > > In short, I've been playing coy about the details of my project and need > to continue to do so for at least the next 4-6 months, but if you read > anything about the company I'm emailing from, you can probably make > reasonable guesses about what I'm trying to do. > >> It's also worth mentioning that btrfs raid0 mode, as well as single mode, >> hobbles the btrfs data and metadata integrity feature, because while >> checksums can and are still generated, stored and checked by default, and >> integrity problems can still be detected, because raid0 (and single) >> includes no redundancy, there's no second copy (raid1/10) or parity >> redundancy (raid5/6) to rebuild the bad data from, so it's simply gone. > > I'm ok with that. We have a concept called "on-demand reconstruction" > which permits us to rebuild individual objects in our filesystem > on-demand (one component of which will be a failed file on one of the > BTRFS filesystems). So long as I can identify that a file has been > corrupted I'm fine. Somewhat ironically, while BTRFS isn't yet great at fixing things when they go wrong, it's pretty good at letting you know something as gone wrong. Unfortunately, it tends to be far more aggressive in doing so than it sounds like you need it to be. > >> 12-14 TB individual drives? >> >> While you /did/ say enterprise grade so this probably doesn't apply to >> you, it might apply to others that will read this. >> >> Be careful that you're not trying to use the "archive application" >> targeted SMR drives for general purpose use. > > We're using traditional PMR drives for now. That's available at 12/14TB > capacity points presently. I agree with your general sense that SMR > drives are unlikely to play particularly well with BTRFS for all but the > truly archival use-case. It's not exactly a 'general sense' or a hunch, issues with BTRFS on SMR drives have been pretty well demonstrated in practice, hence Duncan making this statement despite the fact that it most likely did not apply to you. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-15 16:51 ` Austin S. Hemmelgarn @ 2018-02-15 16:58 ` Ellis H. Wilson III 2018-02-15 17:57 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 32+ messages in thread From: Ellis H. Wilson III @ 2018-02-15 16:58 UTC (permalink / raw) To: Austin S. Hemmelgarn, linux-btrfs On 02/15/2018 11:51 AM, Austin S. Hemmelgarn wrote: > There are scaling performance issues with directory listings on BTRFS > for directories with more than a few thousand files, but they're not > well documented (most people don't hit them because most applications > are designed around the expectation that directory listings will be slow > in big directories), and I would not expect them to be much of an issue > unless you're dealing with tens of thousands of files and particularly > slow storage. Understood -- thanks. Then plan is to keep it to around 1k entries per directory. We've done some fairly concrete testing here to find the fall-off point for dirent caching in BTRFS, and the sweet-spot between having a large number of small directories cached vs. a few massive directories cached. ~1k seems most palatable for our use-case and directory tree structure. > I've only ever lost a BTRFS volume to a power failure _once_ in the > multiple years I've been using it, and that ended up being because the > power failure trashed the storage device pretty severely (it was > super-cheap flash storage). I do know however that there are people who > have had much worse results than me. Good to know. We'll be running power-fail testing over the next couple months. I'm waiting for some hardware to arrive presently. We'll power-cycle fairly large filesystems a few thousand times before we deem it safe to ship. If there are latent bugs in BTRFS still w.r.t. power-fail, I can guarantee we'll trip over them... > It's not exactly a 'general sense' or a hunch, issues with BTRFS on SMR > drives have been pretty well demonstrated in practice, hence Duncan > making this statement despite the fact that it most likely did not apply > to you. Ah, ok, thanks for clarifying. I appreciate the forewarning regardless. Best, ellis ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-15 16:58 ` Ellis H. Wilson III @ 2018-02-15 17:57 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 32+ messages in thread From: Austin S. Hemmelgarn @ 2018-02-15 17:57 UTC (permalink / raw) To: Ellis H. Wilson III, linux-btrfs On 2018-02-15 11:58, Ellis H. Wilson III wrote: > On 02/15/2018 11:51 AM, Austin S. Hemmelgarn wrote: >> There are scaling performance issues with directory listings on BTRFS >> for directories with more than a few thousand files, but they're not >> well documented (most people don't hit them because most applications >> are designed around the expectation that directory listings will be >> slow in big directories), and I would not expect them to be much of an >> issue unless you're dealing with tens of thousands of files and >> particularly slow storage. > > Understood -- thanks. Then plan is to keep it to around 1k entries per > directory. We've done some fairly concrete testing here to find the > fall-off point for dirent caching in BTRFS, and the sweet-spot between > having a large number of small directories cached vs. a few massive > directories cached. ~1k seems most palatable for our use-case and > directory tree structure. Yeah, in my own experience this starts to get noticeable on slower storage around about 4k or more entries in a directory, but it ends up depending on the hardware to a certain extent and the rest of the system as well (something Samba does seems to make it significantly worse than listing locally for example, while NFS seems to be only be worse because of network latency). > >> I've only ever lost a BTRFS volume to a power failure _once_ in the >> multiple years I've been using it, and that ended up being because the >> power failure trashed the storage device pretty severely (it was >> super-cheap flash storage). I do know however that there are people >> who have had much worse results than me. > > Good to know. We'll be running power-fail testing over the next couple > months. I'm waiting for some hardware to arrive presently. We'll > power-cycle fairly large filesystems a few thousand times before we deem > it safe to ship. If there are latent bugs in BTRFS still w.r.t. > power-fail, I can guarantee we'll trip over them... Most of my own experience regarding power failures with BTRFS is on SSD's. We actually use it on the embedded systems we build where I work, and a lot of our customers don't have the most reliable mains power (or they're too lazy to shut off the computer properly before flipping the main breaker for the machine to power it off for the evening), so some of our systems may see power failures on an almost daily basis. Despite that, we've never had issues with BTRFS not recovering by itself, though we do have a very read-heavy workload with very infrequent writes, so that may be part of why it's worked so well for us. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-14 16:00 Status of FST and mount times Ellis H. Wilson III 2018-02-14 17:08 ` Nikolay Borisov 2018-02-14 23:24 ` Duncan @ 2018-02-15 6:14 ` Chris Murphy 2018-02-15 16:45 ` Ellis H. Wilson III 2 siblings, 1 reply; 32+ messages in thread From: Chris Murphy @ 2018-02-15 6:14 UTC (permalink / raw) To: Ellis H. Wilson III; +Cc: Btrfs BTRFS On Wed, Feb 14, 2018 at 9:00 AM, Ellis H. Wilson III <ellisw@panasas.com> wrote: > Frame-of-reference here: RAID0. Around 70TB raw capacity. No compression. > No quotas enabled. Many (potentially tens to hundreds) of subvolumes, each > with tens of snapshots. Even if non-catastrophic to lose such a file system, it's big enough to be tedious and take time to set it up again. I think it's worth considering one of two things as alternatives: a. metadata raid1, data single: you lose the striping performance of raid0, and if it's not randomly filled you'll end up with some disk contention for reads and writes *but* if you lose a drive you will not lose the file system. Any missing files on the dead drive will result in EIO (and I think also a kernel message with path to file), and so you could just run a script to delete those files and replace them with backup copies. b. Variation on the above would be to put it behind glusterfs replicated volume. Gluster getting EIO from a brick should cause it to get a copy from another brick and then fix up the bad one automatically. Or in your raid0 case, the whole volume is lost, and glusterfs helps do the full rebuild over 3-7 days while you're still able to access those 70TB of data normally. Of course, this option requires having two 70TB storage bricks available. -- Chris Murphy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Status of FST and mount times 2018-02-15 6:14 ` Chris Murphy @ 2018-02-15 16:45 ` Ellis H. Wilson III 0 siblings, 0 replies; 32+ messages in thread From: Ellis H. Wilson III @ 2018-02-15 16:45 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS On 02/15/2018 01:14 AM, Chris Murphy wrote: > On Wed, Feb 14, 2018 at 9:00 AM, Ellis H. Wilson III <ellisw@panasas.com> wrote: > >> Frame-of-reference here: RAID0. Around 70TB raw capacity. No compression. >> No quotas enabled. Many (potentially tens to hundreds) of subvolumes, each >> with tens of snapshots. > > Even if non-catastrophic to lose such a file system, it's big enough > to be tedious and take time to set it up again. I think it's worth > considering one of two things as alternatives: > > a. metadata raid1, data single: you lose the striping performance of > raid0, and if it's not randomly filled you'll end up with some disk > contention for reads and writes *but* if you lose a drive you will not > lose the file system. Any missing files on the dead drive will result > in EIO (and I think also a kernel message with path to file), and so > you could just run a script to delete those files and replace them > with backup copies. This option is on our roadmap for future releases of our parallel file system, but unfortunately we do not presently have the time to implement the functionality to report from the manager of that btrfs filesystem to the pfs manager that said files have gone missing. We will absolutely be revisiting that as an option in early 2019, as replacing just one disk instead of N is highly attractive. Waiting for EIO as you suggest in b is a non-starter for us, as we're working at scales sufficiently large that we don't want to wait for someone to stumble over a partially degraded file. Pro-active reporting is what's needed, and we'll implement that Real Soon Now. > b. Variation on the above would be to put it behind glusterfs > replicated volume. Gluster getting EIO from a brick should cause it to > get a copy from another brick and then fix up the bad one > automatically. Or in your raid0 case, the whole volume is lost, and > glusterfs helps do the full rebuild over 3-7 days while you're still > able to access those 70TB of data normally. Of course, this option > requires having two 70TB storage bricks available. See my email address, which may help understand why GlusterFS is a non-starter. Nevertheless, the idea is a fine one and we'll have something similar going on, but at higher raid levels and across typically a dozen or more of such bricks. Best, ellis ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2018-02-22 12:41 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-02-14 16:00 Status of FST and mount times Ellis H. Wilson III 2018-02-14 17:08 ` Nikolay Borisov 2018-02-14 17:21 ` Ellis H. Wilson III 2018-02-15 1:42 ` Qu Wenruo 2018-02-15 2:15 ` Duncan 2018-02-15 3:49 ` Qu Wenruo 2018-02-15 11:12 ` Hans van Kranenburg 2018-02-15 16:30 ` Ellis H. Wilson III 2018-02-16 1:55 ` Qu Wenruo 2018-02-16 14:12 ` Ellis H. Wilson III 2018-02-16 14:20 ` Hans van Kranenburg 2018-02-16 14:42 ` Ellis H. Wilson III 2018-02-16 14:55 ` Ellis H. Wilson III 2018-02-17 0:59 ` Qu Wenruo 2018-02-20 14:59 ` Ellis H. Wilson III 2018-02-20 15:41 ` Austin S. Hemmelgarn 2018-02-21 1:49 ` Qu Wenruo 2018-02-21 14:49 ` Ellis H. Wilson III 2018-02-21 15:03 ` Hans van Kranenburg 2018-02-21 15:19 ` Ellis H. Wilson III 2018-02-21 15:56 ` Hans van Kranenburg 2018-02-22 12:41 ` Austin S. Hemmelgarn 2018-02-21 21:27 ` E V 2018-02-22 0:53 ` Qu Wenruo 2018-02-15 5:54 ` Chris Murphy 2018-02-14 23:24 ` Duncan 2018-02-15 15:42 ` Ellis H. Wilson III 2018-02-15 16:51 ` Austin S. Hemmelgarn 2018-02-15 16:58 ` Ellis H. Wilson III 2018-02-15 17:57 ` Austin S. Hemmelgarn 2018-02-15 6:14 ` Chris Murphy 2018-02-15 16:45 ` Ellis H. Wilson III
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.