* Question: how understand the raid profile of a btrfs filesystem @ 2020-03-20 17:56 Goffredo Baroncelli 2020-03-21 3:29 ` Zygo Blaxell 2020-03-24 4:55 ` Anand Jain 0 siblings, 2 replies; 21+ messages in thread From: Goffredo Baroncelli @ 2020-03-20 17:56 UTC (permalink / raw) To: linux-btrfs Hi all, for a btrfs filesystem, how an user can understand which is the {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk which profile will have ? For simple filesystem it is easy looking at the output of (e.g) "btrfs fi df" or "btrfs fi us". But what if the filesystem is not simple ? btrfs fi us t/. Overall: Device size: 40.00GiB Device allocated: 19.52GiB Device unallocated: 20.48GiB Device missing: 0.00B Used: 16.75GiB Free (estimated): 12.22GiB (min: 8.27GiB) Data ratio: 1.90 Metadata ratio: 2.00 Global reserve: 9.06MiB (used: 0.00B) Data,single: Size:1.00GiB, Used:512.00MiB (50.00%) /dev/loop0 1.00GiB Data,RAID5: Size:3.00GiB, Used:2.48GiB (82.56%) /dev/loop1 1.00GiB /dev/loop2 1.00GiB /dev/loop3 1.00GiB /dev/loop0 1.00GiB Data,RAID6: Size:4.00GiB, Used:3.71GiB (92.75%) /dev/loop1 2.00GiB /dev/loop2 2.00GiB /dev/loop3 2.00GiB /dev/loop0 2.00GiB Data,RAID1C3: Size:2.00GiB, Used:1.88GiB (93.76%) /dev/loop1 2.00GiB /dev/loop2 2.00GiB /dev/loop3 2.00GiB Metadata,RAID1: Size:256.00MiB, Used:9.14MiB (3.57%) /dev/loop2 256.00MiB /dev/loop3 256.00MiB System,RAID1: Size:8.00MiB, Used:16.00KiB (0.20%) /dev/loop2 8.00MiB /dev/loop3 8.00MiB Unallocated: /dev/loop1 5.00GiB /dev/loop2 4.74GiB /dev/loop3 4.74GiB /dev/loop0 6.00GiB This is an example of a strange but valid filesystem. So the question is: the next chunk which profile will have ? Is there any way to understand what will happens ? I expected that the next chunk will be allocated as the last "convert". However I discovered that this is not true. Looking at the code it seems to me that the logic is the following (from btrfs_reduce_alloc_profile()) if (allowed & BTRFS_BLOCK_GROUP_RAID6) allowed = BTRFS_BLOCK_GROUP_RAID6; else if (allowed & BTRFS_BLOCK_GROUP_RAID5) allowed = BTRFS_BLOCK_GROUP_RAID5; else if (allowed & BTRFS_BLOCK_GROUP_RAID10) allowed = BTRFS_BLOCK_GROUP_RAID10; else if (allowed & BTRFS_BLOCK_GROUP_RAID1) allowed = BTRFS_BLOCK_GROUP_RAID1; else if (allowed & BTRFS_BLOCK_GROUP_RAID0) allowed = BTRFS_BLOCK_GROUP_RAID0; flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK; So in the case above the profile will be RAID6. And in the general if a RAID6 chunk is a filesystem, it wins ! But I am not sure.. Moreover I expected to see also reference to DUP and/or RAID1C[34] ... Does someone have any suggestion ? BR G.Baroncelli ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-20 17:56 Question: how understand the raid profile of a btrfs filesystem Goffredo Baroncelli @ 2020-03-21 3:29 ` Zygo Blaxell 2020-03-21 5:40 ` Andrei Borzenkov 2020-03-21 9:55 ` Goffredo Baroncelli 2020-03-24 4:55 ` Anand Jain 1 sibling, 2 replies; 21+ messages in thread From: Zygo Blaxell @ 2020-03-21 3:29 UTC (permalink / raw) To: kreijack; +Cc: linux-btrfs On Fri, Mar 20, 2020 at 06:56:38PM +0100, Goffredo Baroncelli wrote: > Hi all, > > for a btrfs filesystem, how an user can understand which is the {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk which profile will have ? It's the profile used by the highest-numbered block group for the allocation type (one for data, one for metadata/system). There are two profiles to consider, one for data and one for metadata. 'btrfs fi df', 'btrfs fi us', or 'btrfs dev usage' will all indicate which profiles these are. It is valid for the two profiles to be different, and different profile combinations have different use cases. In most cases the only one that matters is the data profile, as that's the one POSIX 'df' reports and data block writes consume, and the one that typically occupies more than 99% of the total space. Adminstrators and system designers have to be more aware of metadata usage when filesystems become extremely full or extremely small (less than 16 GB). Users (without root or CAP_SYS_ADMIN) generally can't do anything about metadata usage, except as a tiny side-effect of their data usage. > For simple filesystem it is easy looking at the output of (e.g) "btrfs fi df" or "btrfs fi us". But what if the filesystem is not simple ? "Not simple" is not a normal operating mode for btrfs. The filesystem allows multiple profiles to be active, so the filesystem can be converted to a new profile while old data is still accessible; however, the conversion is expected to end at some point, and all block groups will use the same profile when that happens. The allocator will only use one RAID profile, and will ignore free space in block groups of other profiles, while 'df' reports the total space on the filesystem in each profile, and metadata allocation does something else. 'btrfs fi us' reports a mess and can't give any accurate free space estimate. Disk space will apparently be free while writes fail with ENOSPC. This is not a problem if a conversion is running to eliminate all the "competing" profiles, but if the conversion stops, you can expect some problems with space until it resumes again. > btrfs fi us t/. > Overall: > Device size: 40.00GiB > Device allocated: 19.52GiB > Device unallocated: 20.48GiB > Device missing: 0.00B > Used: 16.75GiB > Free (estimated): 12.22GiB (min: 8.27GiB) > Data ratio: 1.90 > Metadata ratio: 2.00 > Global reserve: 9.06MiB (used: 0.00B) > > Data,single: Size:1.00GiB, Used:512.00MiB (50.00%) > /dev/loop0 1.00GiB > > Data,RAID5: Size:3.00GiB, Used:2.48GiB (82.56%) > /dev/loop1 1.00GiB > /dev/loop2 1.00GiB > /dev/loop3 1.00GiB > /dev/loop0 1.00GiB > > Data,RAID6: Size:4.00GiB, Used:3.71GiB (92.75%) > /dev/loop1 2.00GiB > /dev/loop2 2.00GiB > /dev/loop3 2.00GiB > /dev/loop0 2.00GiB > > Data,RAID1C3: Size:2.00GiB, Used:1.88GiB (93.76%) > /dev/loop1 2.00GiB > /dev/loop2 2.00GiB > /dev/loop3 2.00GiB > > Metadata,RAID1: Size:256.00MiB, Used:9.14MiB (3.57%) > /dev/loop2 256.00MiB > /dev/loop3 256.00MiB > > System,RAID1: Size:8.00MiB, Used:16.00KiB (0.20%) > /dev/loop2 8.00MiB > /dev/loop3 8.00MiB > > Unallocated: > /dev/loop1 5.00GiB > /dev/loop2 4.74GiB > /dev/loop3 4.74GiB > /dev/loop0 6.00GiB > > This is an example of a strange but valid filesystem. Valid, but the filesystem is in a state designed for temporary use during conversions, and you will want to exit that state as soon as possible. > So the question is: the next chunk which profile will have ? > Is there any way to understand what will happens ? > > I expected that the next chunk will be allocated as the last "convert". However I discovered that this is not true. That's correct in most cases--a convert will create a new block group, which will have the highest bytenr in the filesystem, and its profile will be used to allocate new data, thus converting the filesystem to the new profile; however, if you pause the convert and delete all the files in the new block group, it's possible that the new block group gets deleted too, and then the filesystem reverts to the previous RAID profile. Again, not a problem if you run the convert until it completely removes all old block groups! > Looking at the code it seems to me that the logic is the following (from btrfs_reduce_alloc_profile()) > > if (allowed & BTRFS_BLOCK_GROUP_RAID6) > allowed = BTRFS_BLOCK_GROUP_RAID6; > else if (allowed & BTRFS_BLOCK_GROUP_RAID5) > allowed = BTRFS_BLOCK_GROUP_RAID5; > else if (allowed & BTRFS_BLOCK_GROUP_RAID10) > allowed = BTRFS_BLOCK_GROUP_RAID10; > else if (allowed & BTRFS_BLOCK_GROUP_RAID1) > allowed = BTRFS_BLOCK_GROUP_RAID1; > else if (allowed & BTRFS_BLOCK_GROUP_RAID0) > allowed = BTRFS_BLOCK_GROUP_RAID0; > > flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK; > > So in the case above the profile will be RAID6. And in the general if a RAID6 chunk is a filesystem, it wins ! This code is used to determine whether a conversion reduces the level of redundancy, e.g. you are going from raid6 (2 redundant disks) to raid5 (1 redundant disk) or raid0 (0 redundant disks). There are warnings and a force flag required when that happens. It doesn't determine the raid profile of the next block group--that's just a straight copy of the raid profile of the last block group. > But I am not sure.. Moreover I expected to see also reference to DUP and/or RAID1C[34] ... If you get through that 'if' statement without hitting any of the branches, then you're equal to raid0 (0 redundant disks) but raid0 is a special case because it requires 2 disks for allocation. 'dup' (0 redundant disks) and 'single' (which is the absence of any profile bits) also have 0 redundant disks and require only 1 disk for allocation, there is no need to treat them differently. raid1c[34] probably should be there. Patches welcome. > Does someone have any suggestion ? > > BR > G.Baroncelli > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-21 3:29 ` Zygo Blaxell @ 2020-03-21 5:40 ` Andrei Borzenkov 2020-03-21 7:14 ` Zygo Blaxell 2020-03-21 9:55 ` Goffredo Baroncelli 1 sibling, 1 reply; 21+ messages in thread From: Andrei Borzenkov @ 2020-03-21 5:40 UTC (permalink / raw) To: Zygo Blaxell, kreijack; +Cc: linux-btrfs 21.03.2020 06:29, Zygo Blaxell пишет: > On Fri, Mar 20, 2020 at 06:56:38PM +0100, Goffredo Baroncelli wrote: >> Hi all, >> >> for a btrfs filesystem, how an user can understand which is the {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk which profile will have ? > > It's the profile used by the highest-numbered block group for the > allocation type (one for data, one for metadata/system). Is "highest-numbered" block group always the last one created? Can block group numbers wrap around? Recently someone reported that after conversion block groups with old profile remained and this probably explains it - conversion races with new allocation. >> So the question is: the next chunk which profile will have ? >> Is there any way to understand what will happens ? Well, from that explanation it is not possible using standard tools - one needs to crawl btrfs internals to find out the "last" block group. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-21 5:40 ` Andrei Borzenkov @ 2020-03-21 7:14 ` Zygo Blaxell 0 siblings, 0 replies; 21+ messages in thread From: Zygo Blaxell @ 2020-03-21 7:14 UTC (permalink / raw) To: Andrei Borzenkov; +Cc: kreijack, linux-btrfs On Sat, Mar 21, 2020 at 08:40:50AM +0300, Andrei Borzenkov wrote: > 21.03.2020 06:29, Zygo Blaxell пишет: > > On Fri, Mar 20, 2020 at 06:56:38PM +0100, Goffredo Baroncelli wrote: > >> Hi all, > >> > >> for a btrfs filesystem, how an user can understand which is the {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk which profile will have ? > > > > It's the profile used by the highest-numbered block group for the > > allocation type (one for data, one for metadata/system). > > Is "highest-numbered" block group always the last one created? It's not required by the filesystem format but it is the current behavior of the implementation. > Can block group numbers wrap around? In theory, yes, but they are 64 bits long and correspond to bytes in the filesystem's address space. If you loop balancing a filesystem with a single 4K data block and you can do it at 1000 block groups per second, you'll wrap around in a little over six months. Typical use cases (and even extreme ones) will take centuries to wrap around if you are converting all the time. > Recently someone reported that after conversion block groups with old > profile remained and this probably explains it - conversion races with > new allocation. Conversion *is* new allocation, no race is possible because they are the same thing. While a conversion is running, the conversion itself forces the raid profile of newly created block groups, so there is no race. After conversion is completed, there is special case code to prevent the last empty block group in the filesystem from being deleted; otherwise, btrfs would lose information about the selected raid profile. When a conversion is paused or cancelled, new allocations normally continue using the conversion target profile; however, if all block groups of the new profile are deleted (i.e. all the data contained in the new block groups are removed) then it is possible to revert back to allocating using an older profile. e.g. if you want to combine a balance convert with a device remove, you have to let the convert run long enough to ensure several block groups of the new raid profile exist on other drives than the drive being removed. The device remove will delete all block groups on the removed device, in reverse device physical offset order which is often (but not necessarily) reverse block group order. This leads to device remove switching back to the old RAID profile. This example is not any kind of race--the result can be produced deterministically, and the conversion must be paused first. A conversion can be forcibly stopped by various events: crashes, unmounting the filesystem, having an unrecoverable read or write error, or running out of space. These events will leave block groups with old profiles on the disk. Generally if an external event forces conversion to stop, then it will need to be manually restarted. If there are uncorrectable read errors on the filesystem then affected data blocks must be removed from the filesystem before conversion can be completed. Same with free space, you must have enough to complete. Old versions of mkfs.btrfs had bugs which would leave empty block groups with different profiles on the filesystem. When in doubt, or if you have an older vintage btrfs filesystem, run a converting balance with the desired raid profile and the 'soft' filter to be sure only one profile is present--it will be a no-op if conversion is complete; otherwise, it will finish the conversion. > >> So the question is: the next chunk which profile will have ? > >> Is there any way to understand what will happens ? > > Well, from that explanation it is not possible using standard tools - > one needs to crawl btrfs internals to find out the "last" block group. This is required only during the conversion process. In normal cases users can assume the only profile present is the one that will be used. The python-btrfs package contains an example of listing block groups. The last entry in the list will have the current allocation profile. An unprivileged user can monitor 'btrfs fi df' output over time. Used space will increase or decrease in the current profile, and only decrease in the other profiles. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-21 3:29 ` Zygo Blaxell 2020-03-21 5:40 ` Andrei Borzenkov @ 2020-03-21 9:55 ` Goffredo Baroncelli 2020-03-21 23:26 ` Zygo Blaxell 1 sibling, 1 reply; 21+ messages in thread From: Goffredo Baroncelli @ 2020-03-21 9:55 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On 3/21/20 4:29 AM, Zygo Blaxell wrote: > On Fri, Mar 20, 2020 at 06:56:38PM +0100, Goffredo Baroncelli wrote: >> Hi all, >> >> for a btrfs filesystem, how an user can understand which is the {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk which profile will have ? > > It's the profile used by the highest-numbered block group for the > allocation type (one for data, one for metadata/system). There > are two profiles to consider, one for data and one for metadata. > 'btrfs fi df', 'btrfs fi us', or 'btrfs dev usage' will all indicate > which profiles these are. > What do you think as "highest-numbered block group", the value in the "offset" filed. If so it doesn't make sense because it could be relocated easily. Anyway what are you describing is not what I saw. In the test above I create a raid5 filesystem, filled 1 chunk at 100% and a second chunk for few MB. Then I convert the most empty chunk as single. Then I fill the last chunk (the single one) and force to create a new chunk. What I saw is that the new chunk is raid5 mode. $ sudo mkfs.btrfs -draid5 /dev/loop[012] $ dd if=/dev/zero of=t/file-2.128gb_5 bs=1M count=$((2024+128)) # fill two chunk raid 5 $ sudo btrfs fi du t/. # see what is the situation [...] Data,RAID5: Size:4.00GiB, Used:2.10GiB (52.57%) /dev/loop0 2.00GiB /dev/loop1 2.00GiB /dev/loop2 2.00GiB [...] $ sudo btrfs balance start -dconvert=single,usage=50 t/. # convert the latest chunk to single $ sudo btrfs fi us t/. # see what is the situation [...] Data,single: Size:1.00GiB, Used:259.00MiB (25.29%) /dev/loop0 1.00GiB Data,RAID5: Size:2.00GiB, Used:1.85GiB (92.47%) /dev/loop0 1.00GiB /dev/loop1 1.00GiB /dev/loop2 1.00GiB [...] # fill the latest chunk and created a new one $ dd if=/dev/zero of=t/file-1.128gb_6 bs=1M count=$((1024+128)) $ sudo btrfs fi us t/. # see what is the situation [...] Data,single: Size:1.00GiB, Used:259.00MiB (25.29%) /dev/loop0 1.00GiB Data,RAID5: Size:4.00GiB, Used:1.85GiB (46.24%) /dev/loop0 2.00GiB /dev/loop1 2.00GiB /dev/loop2 2.00GiB [...] Expected results: the "single" chunk should pass from 1GB to 2GB. What it is observed is that the raid5 (the oldest chunk) passed from 2GB to 4GB. [...] >> Looking at the code it seems to me that the logic is the following (from btrfs_reduce_alloc_profile()) >> >> if (allowed & BTRFS_BLOCK_GROUP_RAID6) >> allowed = BTRFS_BLOCK_GROUP_RAID6; >> else if (allowed & BTRFS_BLOCK_GROUP_RAID5) >> allowed = BTRFS_BLOCK_GROUP_RAID5; >> else if (allowed & BTRFS_BLOCK_GROUP_RAID10) >> allowed = BTRFS_BLOCK_GROUP_RAID10; >> else if (allowed & BTRFS_BLOCK_GROUP_RAID1) >> allowed = BTRFS_BLOCK_GROUP_RAID1; >> else if (allowed & BTRFS_BLOCK_GROUP_RAID0) >> allowed = BTRFS_BLOCK_GROUP_RAID0; >> >> flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK; >> >> So in the case above the profile will be RAID6. And in the general if a RAID6 chunk is a filesystem, it wins ! > > This code is used to determine whether a conversion reduces the level of > redundancy, e.g. you are going from raid6 (2 redundant disks) to raid5 > (1 redundant disk) or raid0 (0 redundant disks). There are warnings and > a force flag required when that happens. It doesn't determine the raid > profile of the next block group--that's just a straight copy of the raid > profile of the last block group. To me it seems that this function decides the allocation of the next chunk. The chain of call is the following: btrfs_force_chunk_alloc btrfs_get_alloc_profile get_alloc_profile btrfs_reduce_alloc_profile btrfs_chunk_alloc btrfs_alloc_chunk __btrfs_alloc_chunk or another one is btrfs_alloc_data_chunk_ondemand btrfs_data_alloc_profile btrfs_get_alloc_profile get_alloc_profile btrfs_reduce_alloc_profile btrfs_chunk_alloc btrfs_alloc_chunk __btrfs_alloc_chunk The btrfs_get_alloc_profile/get_alloc_profile/btrfs_reduce_alloc_profile chain decides which profile has to be allocated. The current actives profiles are took and then filtered by the possible allowed on the basis of the number of disk. Which means that if a raid6 profile chunk exists (and there is a enough number of devices), the next chunk will be allocated as raid6. So is how I read the code, and what suggest my tests... My conclusion is: if you have multiple raid profile per disk, the next chunk allocation doesn't depend by the latest "balance", but but by the above logic. The recipe is: when you made a balance, pay attention to not leave any chunk in old format > >> But I am not sure.. Moreover I expected to see also reference to DUP and/or RAID1C[34] ... > > If you get through that 'if' statement without hitting any of the > branches, then you're equal to raid0 (0 redundant disks) but raid0 > is a special case because it requires 2 disks for allocation. 'dup' > (0 redundant disks) and 'single' (which is the absence of any profile > bits) also have 0 redundant disks and require only 1 disk for allocation, > there is no need to treat them differently. > > raid1c[34] probably should be there. Patches welcome. > >> Does someone have any suggestion ? >> >> BR >> G.Baroncelli >> -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-21 9:55 ` Goffredo Baroncelli @ 2020-03-21 23:26 ` Zygo Blaxell 2020-03-22 8:34 ` Goffredo Baroncelli 0 siblings, 1 reply; 21+ messages in thread From: Zygo Blaxell @ 2020-03-21 23:26 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: linux-btrfs On Sat, Mar 21, 2020 at 10:55:32AM +0100, Goffredo Baroncelli wrote: > On 3/21/20 4:29 AM, Zygo Blaxell wrote: > > On Fri, Mar 20, 2020 at 06:56:38PM +0100, Goffredo Baroncelli wrote: > > > Hi all, > > > > > > for a btrfs filesystem, how an user can understand which is the {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk which profile will have ? > > > > It's the profile used by the highest-numbered block group for the > > allocation type (one for data, one for metadata/system). There > > are two profiles to consider, one for data and one for metadata. > > 'btrfs fi df', 'btrfs fi us', or 'btrfs dev usage' will all indicate > > which profiles these are. > > > > What do you think as "highest-numbered block group", the value in the "offset" filed. Objectid field (the offset field is the size for block group items). > If so it doesn't make sense because it could be relocated easily. Relocation will create a new block group with the filesystem's current profile, which is the conversion target profile if present (all conversion is relocation), but some other profile in use on the filesystem otherwise. > Anyway what are you describing is not what I saw. In the test above > I create a raid5 filesystem, filled 1 chunk at 100% and a second chunk > for few MB. Then I convert the most empty chunk as single. OK, I was missing some details: At mount time all the block group items are read in order, and each one adjusts the allocator profile bits for the entire filesystem. The last block group is the one that has the *most influence* over the profile when no conversion is running, but doesn't set the profile alone. If there is a partial conversion, then the behavior changes as you note. When a conversion is active, the conversion target profile overrides everything else. That is how you can get a single block group on a filesystem that is entirely raid5. So...TL;DR if you're not running a conversion, the next block group will use some RAID profile already present on the filesystem, and it may not be the one you want it to be. > Then I fill > the last chunk (the single one) and force to create a new chunk. What > I saw is that the new chunk is raid5 mode. > > $ sudo mkfs.btrfs -draid5 /dev/loop[012] > $ dd if=/dev/zero of=t/file-2.128gb_5 bs=1M count=$((2024+128)) # fill two chunk raid 5 > $ sudo btrfs fi du t/. # see what is the situation > [...] > Data,RAID5: Size:4.00GiB, Used:2.10GiB (52.57%) > /dev/loop0 2.00GiB > /dev/loop1 2.00GiB > /dev/loop2 2.00GiB > [...] > $ sudo btrfs balance start -dconvert=single,usage=50 t/. # convert the latest chunk to single > $ sudo btrfs fi us t/. # see what is the situation > [...] > Data,single: Size:1.00GiB, Used:259.00MiB (25.29%) > /dev/loop0 1.00GiB > > Data,RAID5: Size:2.00GiB, Used:1.85GiB (92.47%) > /dev/loop0 1.00GiB > /dev/loop1 1.00GiB > /dev/loop2 1.00GiB > [...] > > # fill the latest chunk and created a new one > $ dd if=/dev/zero of=t/file-1.128gb_6 bs=1M count=$((1024+128)) > > $ sudo btrfs fi us t/. # see what is the situation > [...] > Data,single: Size:1.00GiB, Used:259.00MiB (25.29%) > /dev/loop0 1.00GiB > > Data,RAID5: Size:4.00GiB, Used:1.85GiB (46.24%) > /dev/loop0 2.00GiB > /dev/loop1 2.00GiB > /dev/loop2 2.00GiB > [...] > > Expected results: the "single" chunk should pass from 1GB to 2GB. What it is observed is that the raid5 (the oldest chunk) passed from 2GB to 4GB. ...but now you are not running conversion any more, and have multiple profiles. It's not really specified what will happen under those conditions, nor is it obvious what the correct behavior should be. The on-disk format does not have a field for "target profile". Adding one would be a disk format change. > [...] > > > Looking at the code it seems to me that the logic is the following (from btrfs_reduce_alloc_profile()) > > > > > > if (allowed & BTRFS_BLOCK_GROUP_RAID6) > > > allowed = BTRFS_BLOCK_GROUP_RAID6; > > > else if (allowed & BTRFS_BLOCK_GROUP_RAID5) > > > allowed = BTRFS_BLOCK_GROUP_RAID5; > > > else if (allowed & BTRFS_BLOCK_GROUP_RAID10) > > > allowed = BTRFS_BLOCK_GROUP_RAID10; > > > else if (allowed & BTRFS_BLOCK_GROUP_RAID1) > > > allowed = BTRFS_BLOCK_GROUP_RAID1; > > > else if (allowed & BTRFS_BLOCK_GROUP_RAID0) > > > allowed = BTRFS_BLOCK_GROUP_RAID0; > > > > > > flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK; > > > > > > So in the case above the profile will be RAID6. And in the general if a RAID6 chunk is a filesystem, it wins ! > > > > This code is used to determine whether a conversion reduces the level of > > redundancy, e.g. you are going from raid6 (2 redundant disks) to raid5 > > (1 redundant disk) or raid0 (0 redundant disks). There are warnings and > > a force flag required when that happens. It doesn't determine the raid > > profile of the next block group--that's just a straight copy of the raid > > profile of the last block group. > > To me it seems that this function decides the allocation of the next chunk. The chain of call is the following: Sorry, in my earlier mail I thought we were talking about a different piece of code that tries to enforce a similar rule. > btrfs_force_chunk_alloc > btrfs_get_alloc_profile > get_alloc_profile > btrfs_reduce_alloc_profile > btrfs_chunk_alloc > btrfs_alloc_chunk > __btrfs_alloc_chunk > > or another one is > > btrfs_alloc_data_chunk_ondemand > btrfs_data_alloc_profile > btrfs_get_alloc_profile > get_alloc_profile > btrfs_reduce_alloc_profile > btrfs_chunk_alloc > btrfs_alloc_chunk > __btrfs_alloc_chunk > > > The btrfs_get_alloc_profile/get_alloc_profile/btrfs_reduce_alloc_profile chain decides which profile has to be allocated. > The current actives profiles are took and then filtered by the possible allowed on the basis of the number of disk. Which means that if a raid6 profile chunk exists (and there is a enough number of devices), the next chunk will be allocated as raid6. > > So is how I read the code, and what suggest my tests... > > My conclusion is: if you have multiple raid profile per disk, the next chunk allocation doesn't depend by the latest "balance", but but by the above logic. > The recipe is: when you made a balance, pay attention to not leave any chunk in old format Well, yes, that is what I've been saying: don't expect btrfs to do sane things with a mixture of profiles. Stick to just one profile, except in the special case of a conversion. You wouldn't leave an array in degraded mode for long, and you need to balance after adding a single drive to a raid1 or striped-profile raid array. Partially converted filesystems fall into this category too. > > > But I am not sure.. Moreover I expected to see also reference to DUP and/or RAID1C[34] ... > > > > If you get through that 'if' statement without hitting any of the > > branches, then you're equal to raid0 (0 redundant disks) but raid0 > > is a special case because it requires 2 disks for allocation. 'dup' > > (0 redundant disks) and 'single' (which is the absence of any profile > > bits) also have 0 redundant disks and require only 1 disk for allocation, > > there is no need to treat them differently. > > > > raid1c[34] probably should be there. Patches welcome. > > > > > Does someone have any suggestion ? > > > > > > BR > > > G.Baroncelli > > > > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-21 23:26 ` Zygo Blaxell @ 2020-03-22 8:34 ` Goffredo Baroncelli 2020-03-22 8:38 ` Goffredo Baroncelli 0 siblings, 1 reply; 21+ messages in thread From: Goffredo Baroncelli @ 2020-03-22 8:34 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs Hi Zygo, On 3/22/20 12:26 AM, Zygo Blaxell wrote: > On Sat, Mar 21, 2020 at 10:55:32AM +0100, Goffredo Baroncelli wrote: >> On 3/21/20 4:29 AM, Zygo Blaxell wrote: >>> On Fri, Mar 20, 2020 at 06:56:38PM +0100, Goffredo Baroncelli wrote: >>>> Hi all, >>>> [...] > ...but now you are not running conversion any more, and have multiple > profiles. It's not really specified what will happen under those > conditions, nor is it obvious what the correct behavior should be. > > The on-disk format does not have a field for "target profile". Ok, I looked for a confirmation of that. > Adding one would be a disk format change. Yes but I think that it would be done in a backward compatible way. I think to add a field "target profile" in the super-block. The old kernels will ignore this field, and behave as today. The new ones will allocates the new chunk according to this field. To me it seems complicated to Any thoughts ? BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-22 8:34 ` Goffredo Baroncelli @ 2020-03-22 8:38 ` Goffredo Baroncelli 2020-03-22 23:49 ` Zygo Blaxell 0 siblings, 1 reply; 21+ messages in thread From: Goffredo Baroncelli @ 2020-03-22 8:38 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On 3/22/20 9:34 AM, Goffredo Baroncelli wrote: > > To me it seems complicated to [sorry I push the send button too early] To me it seems too complicated (and error prone) to derive the target profile from an analysis of the filesystem. Any thoughts ? BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-22 8:38 ` Goffredo Baroncelli @ 2020-03-22 23:49 ` Zygo Blaxell 2020-03-23 20:50 ` Goffredo Baroncelli 0 siblings, 1 reply; 21+ messages in thread From: Zygo Blaxell @ 2020-03-22 23:49 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: linux-btrfs On Sun, Mar 22, 2020 at 09:38:30AM +0100, Goffredo Baroncelli wrote: > On 3/22/20 9:34 AM, Goffredo Baroncelli wrote: > > > > > To me it seems complicated to > [sorry I push the send button too early] > > To me it seems too complicated (and error prone) to derive the target profile from an analysis of the filesystem. > > Any thoughts ? I still don't understand the use case you are trying to support. There are 3 states for a btrfs filesystem: 1. All block groups use the same profile. Pick any one, use its profile for future block groups. Avoid deleting the last one. Simple and easy to implement. 2. A conversion is in progress. Look in fs_info->balance_ctl for a 'convert' filter. If there is one, that's the profile for new block groups. Old block groups will be emptied and destroyed by conversion, and then we automatically go back to state #1. 3. A conversion is interrupted prior to completion. Sysadmin is expected to proceed immediately back to state #2, possibly after taking any necessary recovery actions that triggered entry into state #3. It doesn't really matter what the current allocation profile is, since it is likely to change before we allocate any more block groups. You seem to be trying to sustain or support a filesystem in state #3 for a prolonged period of time. Why would we do that? If your use case is providing information or guidance to a user, tell them how to get back to state #2 ASAP, so that they can then return to state #1 where they should be. Suppose your use case does involve staying in state #3 for a prolonged period of time--let's say e.g. you want to be able to use file attributes to put some file data on single profile while putting other files on raid5 profile. That use case would need to come with a bunch of infrastructure to support it, i.e. you'd need to define what the attributes are, and how btrfs could map those to device subsets and raid profiles. None of this exists, and even if it did, it would conflict with the "store the [singular] target profile on disk" idea. There could be a warning message in dmesg if we enter state #3. This message would appear after a converting balance is cancelled or aborted, and on mount when we scan block groups (which we would still need to do even after we added a "target profile" field to the superblock). Userspace like 'btrfs fi df' could also put out a warning like "multiple allocation profiles detected, but conversion is not in progress. Please finish conversion at your earliest convenience to avoid disappointment." I don't see the need to do anything more about it. We only get to state #3 if the automation has already failed, or has been explicitly cancelled at sysadmin request. It is better to wait for the sysadmin to decide what to do next, especially if the sysadmin's prior choice led to us entering this state (e.g. not enough space to complete a conversion to the target profile, so we can no longer use the target profile for new allocations). Picking a target profile at random (from the set of profiles already used in the filesystem) is no better or worse than any deterministic algorithm--it will always be wrong in some situations, and a good choice in other situations. I'd even consider removing the heuristics that are already there for prioritizing profiles. They are just surprising and undocumented behavior, and it would be better to document it as "random, BTW you should finish your conversion now." It doesn't help if e.g. you want to convert from raid6 to raid1, since the heuristic assumes you only want to go the other way. > BR > G.Baroncelli > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-22 23:49 ` Zygo Blaxell @ 2020-03-23 20:50 ` Goffredo Baroncelli 2020-03-23 22:48 ` Graham Cobb 2020-03-23 23:18 ` Zygo Blaxell 0 siblings, 2 replies; 21+ messages in thread From: Goffredo Baroncelli @ 2020-03-23 20:50 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On 3/23/20 12:49 AM, Zygo Blaxell wrote: > On Sun, Mar 22, 2020 at 09:38:30AM +0100, Goffredo Baroncelli wrote: >> On 3/22/20 9:34 AM, Goffredo Baroncelli wrote: >> >>> >>> To me it seems complicated to >> [sorry I push the send button too early] >> >> To me it seems too complicated (and error prone) to derive the target profile from an analysis of the filesystem. >> >> Any thoughts ? > > I still don't understand the use case you are trying to support. > > There are 3 states for a btrfs filesystem: > [...] > > 3. A conversion is interrupted prior to completion. Sysadmin is > expected to proceed immediately back to state #2, possibly after > taking any necessary recovery actions that triggered entry into > state #3. It doesn't really matter what the current allocation > profile is, since it is likely to change before we allocate > any more block groups. > > You seem to be trying to sustain or support a filesystem in state #3 for > a prolonged period of time. Why would we do that? If your use case is > providing information or guidance to a user, tell them how to get back > to state #2 ASAP, so that they can then return to state #1 where they > should be. Believe me: I *don't want* to sustain #3 at all; btrfs is already too complex. Supporting multiple profile is the worst thing that we can do. However #3 exists and it could cause unexpected results. I think that on this we agree. > [...] > There could be a warning message in dmesg if we enter state #3. > This message would appear after a converting balance is cancelled or > aborted, and on mount when we scan block groups (which we would still need > to do even after we added a "target profile" field to the superblock). > Userspace like 'btrfs fi df' could also put out a warning like "multiple > allocation profiles detected, but conversion is not in progress. Please > finish conversion at your earliest convenience to avoid disappointment." > I don't see the need to do anything more about it. It would help that every btrfs command should warn the users about an "un-wanted" state like this. > > We only get to state #3 if the automation has already failed, or has > been explicitly cancelled at sysadmin request. > Not only, you can enter in state #3 if you do something like: $ sudo btrfs balance start -dconvert=single,usage=50 t/. where you convert some chunk but not other. This is the point: we can consider the "failed automation" an unexpected event, however doing "btrfs bal stop" or the command above cannot be considered as unexpected event. [...] > I'd even consider removing the heuristics that are already there for > prioritizing profiles. They are just surprising and undocumented > behavior, and it would be better to document it as "random, BTW you > should finish your conversion now." I agree that we should remove this kind of heuristic. Doing so I think that, with moderate effort, btrfs can track what is the wanted profile (i.e. the one at the mkfs time or the one specified in last balance w/convert [*]) and uses it. To me it seems the natural thing to do. Noting more nothing less. We can't prevent a mixed profile filesystem (it has to be allowed the possibility to stop a long activity like the balance), but we should prevent the unexpected behavior: if we change the profile and something goes wrong, the next chunk allocation should be clear. The user don't have to read the code to understand what will happen. [*] we can argue which would be the expected profile after an interrupted balance: the former one or the latter one ? BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-23 20:50 ` Goffredo Baroncelli @ 2020-03-23 22:48 ` Graham Cobb 2020-03-25 4:09 ` Zygo Blaxell 2020-03-23 23:18 ` Zygo Blaxell 1 sibling, 1 reply; 21+ messages in thread From: Graham Cobb @ 2020-03-23 22:48 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On 23/03/2020 20:50, Goffredo Baroncelli wrote: > On 3/23/20 12:49 AM, Zygo Blaxell wrote: >> On Sun, Mar 22, 2020 at 09:38:30AM +0100, Goffredo Baroncelli wrote: >>> On 3/22/20 9:34 AM, Goffredo Baroncelli wrote: >>> >>>> >>>> To me it seems complicated to >>> [sorry I push the send button too early] >>> >>> To me it seems too complicated (and error prone) to derive the target >>> profile from an analysis of the filesystem. >>> >>> Any thoughts ? >> >> I still don't understand the use case you are trying to support. >> >> There are 3 states for a btrfs filesystem: >> > [...] >> >> 3. A conversion is interrupted prior to completion. Sysadmin is >> expected to proceed immediately back to state #2, possibly after >> taking any necessary recovery actions that triggered entry into >> state #3. It doesn't really matter what the current allocation >> profile is, since it is likely to change before we allocate >> any more block groups. >> >> You seem to be trying to sustain or support a filesystem in state #3 for >> a prolonged period of time. Why would we do that? In real life situations (particularly outside a commercial datacentre) this situation can persist for quite a while. I recently found myself in a real-life situation where this situation was not only in existence for weeks but was, at some times, getting worse (I was getting further away from my target configuration, not closer). In this case, the original trigger was a disk in a well over 10TB filesystem beginning to go bad. My strategy for handling that was to replace the failing disk asap, and then rearrange the disk usage on the system later. In order to handle the immediate emergency, I made use of existing free space in LVM volume groups to replace the failing disk, but that meant I had some user data and backups on the same physical disk for a while (although I have plenty of other backups available I like to keep my first-tier backups on separate local disks). So, once the immediate crisis was over, I needed to move disks around between the filesystems. It was weeks before I had managed to do sufficient disk adds, removes and replaces to have all the filesystems back to having data and backups on separate disks and all the data and metadata in the profiles I wanted. Just doing a replace for one disk took many days for the system to physically copy the data from one disk to the other. As this system was still in heavy use, this was made worse by btrfs deciding to store data in profiles I did not want (at that point in the manipulation) and forcing me to rebalance the data that had been written during the last disk change before I could start on the next one. Bottom line: although not the top priority in btrfs development, a simple way to control the profile to be used for new data and metadata allocations would have real benefit to overstretched sysadmins. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-23 22:48 ` Graham Cobb @ 2020-03-25 4:09 ` Zygo Blaxell 2020-03-25 4:30 ` Paul Jones 0 siblings, 1 reply; 21+ messages in thread From: Zygo Blaxell @ 2020-03-25 4:09 UTC (permalink / raw) To: Graham Cobb; +Cc: linux-btrfs On Mon, Mar 23, 2020 at 10:48:44PM +0000, Graham Cobb wrote: > On 23/03/2020 20:50, Goffredo Baroncelli wrote: > > On 3/23/20 12:49 AM, Zygo Blaxell wrote: > >> On Sun, Mar 22, 2020 at 09:38:30AM +0100, Goffredo Baroncelli wrote: > >>> On 3/22/20 9:34 AM, Goffredo Baroncelli wrote: > >>> > >>>> > >>>> To me it seems complicated to > >>> [sorry I push the send button too early] > >>> > >>> To me it seems too complicated (and error prone) to derive the target > >>> profile from an analysis of the filesystem. > >>> > >>> Any thoughts ? > >> > >> I still don't understand the use case you are trying to support. > >> > >> There are 3 states for a btrfs filesystem: > >> > > [...] > >> > >> 3. A conversion is interrupted prior to completion. Sysadmin is > >> expected to proceed immediately back to state #2, possibly after > >> taking any necessary recovery actions that triggered entry into > >> state #3. It doesn't really matter what the current allocation > >> profile is, since it is likely to change before we allocate > >> any more block groups. > >> > >> You seem to be trying to sustain or support a filesystem in state #3 for > >> a prolonged period of time. Why would we do that? > > In real life situations (particularly outside a commercial datacentre) > this situation can persist for quite a while. I recently found myself > in a real-life situation where this situation was not only in existence > for weeks but was, at some times, getting worse (I was getting further > away from my target configuration, not closer). > > In this case, the original trigger was a disk in a well over 10TB > filesystem beginning to go bad. My strategy for handling that was to > replace the failing disk asap, and then rearrange the disk usage on the > system later. In order to handle the immediate emergency, I made use of > existing free space in LVM volume groups to replace the failing disk, > but that meant I had some user data and backups on the same physical > disk for a while (although I have plenty of other backups available I > like to keep my first-tier backups on separate local disks). I've done those. And the annoying thing about them was... > So, once the immediate crisis was over, I needed to move disks around > between the filesystems. It was weeks before I had managed to do > sufficient disk adds, removes Disk removes are where the current system breaks down. 'btrfs device remove' is terrible: - can't cancel a remove except by rebooting or forcing ENOSPC - can't resume automatically after a reboot (probably a good thing for now, given there's no cancel) - can't coexist with a balance, even when paused--device remove requires the balance to be _cancelled_ first - doesn't have any equivalent to the 'convert' filter raid profile target in balance info so if you need to remove a device while you're changing profiles, you have to abort the profile change and then relocate a whole lot of data without being able to specify the correct target profile. The proper fix would be to reimplement 'btrfs dev remove' using pieces of the balance infrastructure (it kind of is now, except where it's not), and so 'device remove' can keep the 'convert=' target. Then you don't have to lose the target profile while doing removes (and fix the other problems too). Or just move it from the balance info to the superblock, as suggested elsewhere in the thread (none of these changes can be done without changing something in the on-disk format). But definitely don't have the target profile in both places! > and replaces to have all the filesystems > back to having data and backups on separate disks and all the data and > metadata in the profiles I wanted. Just doing a replace for one disk > took many days for the system to physically copy the data from one disk > to the other. > > As this system was still in heavy use, this was made worse by btrfs > deciding to store data in profiles I did not want (at that point in the > manipulation) and forcing me to rebalance the data that had been written > during the last disk change before I could start on the next one. > > Bottom line: although not the top priority in btrfs development, a > simple way to control the profile to be used for new data and metadata > allocations would have real benefit to overstretched sysadmins. > ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: Question: how understand the raid profile of a btrfs filesystem 2020-03-25 4:09 ` Zygo Blaxell @ 2020-03-25 4:30 ` Paul Jones 2020-03-26 2:51 ` Zygo Blaxell 0 siblings, 1 reply; 21+ messages in thread From: Paul Jones @ 2020-03-25 4:30 UTC (permalink / raw) To: Zygo Blaxell, Graham Cobb; +Cc: linux-btrfs > -----Original Message----- > From: linux-btrfs-owner@vger.kernel.org <linux-btrfs- > owner@vger.kernel.org> On Behalf Of Zygo Blaxell > Sent: Wednesday, 25 March 2020 3:10 PM > To: Graham Cobb <g.btrfs@cobb.uk.net> > Cc: linux-btrfs <linux-btrfs@vger.kernel.org> > Subject: Re: Question: how understand the raid profile of a btrfs filesystem > Disk removes are where the current system breaks down. 'btrfs device > remove' is terrible: > > - can't cancel a remove except by rebooting or forcing ENOSPC > > - can't resume automatically after a reboot (probably a good > thing for now, given there's no cancel) > > - can't coexist with a balance, even when paused--device remove > requires the balance to be _cancelled_ first > > - doesn't have any equivalent to the 'convert' filter raid > profile target in balance info > > so if you need to remove a device while you're changing profiles, you have to > abort the profile change and then relocate a whole lot of data without being > able to specify the correct target profile. > > The proper fix would be to reimplement 'btrfs dev remove' using pieces of > the balance infrastructure (it kind of is now, except where it's not), and so > 'device remove' can keep the 'convert=' target. Then you don't have to lose > the target profile while doing removes (and fix the other problems too). I've often thought it would be handy to be able to forcefully set the disk size or free space to zero, like how it is reported by 'btrfs fi sh' during a remove operation. That way a balance operation can be used for various things like profile changes or multiple disk removals (like replacing 4x1T drives with 1x4T drive) without unintentionally writing a bunch of data to a disk you don't want to write to anymore. It would also allow for a more gradual removal for disks that need replacing but not as an emergency, as data will gradually migrate itself to other discs as it is COWed. Paul. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-25 4:30 ` Paul Jones @ 2020-03-26 2:51 ` Zygo Blaxell 0 siblings, 0 replies; 21+ messages in thread From: Zygo Blaxell @ 2020-03-26 2:51 UTC (permalink / raw) To: Paul Jones; +Cc: Graham Cobb, linux-btrfs On Wed, Mar 25, 2020 at 04:30:16AM +0000, Paul Jones wrote: > > -----Original Message----- > > From: linux-btrfs-owner@vger.kernel.org <linux-btrfs- > > owner@vger.kernel.org> On Behalf Of Zygo Blaxell > > Sent: Wednesday, 25 March 2020 3:10 PM > > To: Graham Cobb <g.btrfs@cobb.uk.net> > > Cc: linux-btrfs <linux-btrfs@vger.kernel.org> > > Subject: Re: Question: how understand the raid profile of a btrfs filesystem > > > Disk removes are where the current system breaks down. 'btrfs device > > remove' is terrible: > > > > - can't cancel a remove except by rebooting or forcing ENOSPC > > > > - can't resume automatically after a reboot (probably a good > > thing for now, given there's no cancel) > > > > - can't coexist with a balance, even when paused--device remove > > requires the balance to be _cancelled_ first > > > > - doesn't have any equivalent to the 'convert' filter raid > > profile target in balance info > > > > so if you need to remove a device while you're changing profiles, you have to > > abort the profile change and then relocate a whole lot of data without being > > able to specify the correct target profile. > > > > The proper fix would be to reimplement 'btrfs dev remove' using pieces of > > the balance infrastructure (it kind of is now, except where it's not), and so > > 'device remove' can keep the 'convert=' target. Then you don't have to lose > > the target profile while doing removes (and fix the other problems too). > > I've often thought it would be handy to be able to forcefully set the > disk size or free space to zero, like how it is reported by 'btrfs > fi sh' during a remove operation. That way a balance operation can be > used for various things like profile changes or multiple disk removals > (like replacing 4x1T drives with 1x4T drive) without unintentionally > writing a bunch of data to a disk you don't want to write to anymore. I forgot "can only remove one disk at a time" in the list above. We can add multiple disks at once (well, add one at a time, then use balance to do all the relocation at once), but the opposite operation isn't possible. That is an elegant way to set up balances to do a device delete/shrink, too. > It would also allow for a more gradual removal for disks that need > replacing but not as an emergency, as data will gradually migrate > itself to other discs as it is COWed. > > Paul. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-23 20:50 ` Goffredo Baroncelli 2020-03-23 22:48 ` Graham Cobb @ 2020-03-23 23:18 ` Zygo Blaxell 1 sibling, 0 replies; 21+ messages in thread From: Zygo Blaxell @ 2020-03-23 23:18 UTC (permalink / raw) To: kreijack; +Cc: linux-btrfs On Mon, Mar 23, 2020 at 09:50:03PM +0100, Goffredo Baroncelli wrote: > On 3/23/20 12:49 AM, Zygo Blaxell wrote: > > On Sun, Mar 22, 2020 at 09:38:30AM +0100, Goffredo Baroncelli wrote: > > > On 3/22/20 9:34 AM, Goffredo Baroncelli wrote: > > > > > > > > > > > To me it seems complicated to > > > [sorry I push the send button too early] > > > > > > To me it seems too complicated (and error prone) to derive the target profile from an analysis of the filesystem. > > > > > > Any thoughts ? > > > > I still don't understand the use case you are trying to support. > > > > There are 3 states for a btrfs filesystem: > > > [...] > > > > 3. A conversion is interrupted prior to completion. Sysadmin is > > expected to proceed immediately back to state #2, possibly after > > taking any necessary recovery actions that triggered entry into > > state #3. It doesn't really matter what the current allocation > > profile is, since it is likely to change before we allocate > > any more block groups. > > > > You seem to be trying to sustain or support a filesystem in state #3 for > > a prolonged period of time. Why would we do that? If your use case is > > providing information or guidance to a user, tell them how to get back > > to state #2 ASAP, so that they can then return to state #1 where they > > should be. > > Believe me: I *don't want* to sustain #3 at all; btrfs is already too > complex. Supporting multiple profile is the worst thing that we can do. > However #3 exists and it could cause unexpected results. I think that on > this we agree. > > > [...] > > > There could be a warning message in dmesg if we enter state #3. > > This message would appear after a converting balance is cancelled or > > aborted, and on mount when we scan block groups (which we would still need > > to do even after we added a "target profile" field to the superblock). > > Userspace like 'btrfs fi df' could also put out a warning like "multiple > > allocation profiles detected, but conversion is not in progress. Please > > finish conversion at your earliest convenience to avoid disappointment." > > I don't see the need to do anything more about it. > > It would help that every btrfs command should warn the users about an > "un-wanted" state like this. Patches welcome... > > been explicitly cancelled at sysadmin request. > > > Not only, you can enter in state #3 if you do something like: > > $ sudo btrfs balance start -dconvert=single,usage=50 t/. > > where you convert some chunk but not other. Sure, but now you're intentionally doing weird (or sufficiently advanced) stuff. Given a combination of balance flags like that (convert + other restrictions), we should assume the user knows what they're doing, and stay out of the way. The existing code that inserts 'usage=90' when resuming a balance, though highly questionable, still presumes the user knows what they're doing when a balance has a convert in it, and doesn't modify the usage filter setting in that case. It's fairly normal to want to run something like this when changing RAID profiles on a big array: # Make lots of free space quickly for x in $(seq 0 100); do btrfs balance start -dconvert=single,soft,usage=$x t/. done # OK now do the full BGs, will be slow btrfs balance start -dconvert=single,soft t/. Should that print 101 warnings as it runs? What if the user is using python-btrfs (e.g. to order the block groups by usage) and not the btrfs-progs tools, or some other UI? Do we write warnings from inside the kernel? Will there be a "--quiet" option that suppresses the warning? (I suppose if the answer to the last two questions is "yes" then we just need patches to get it done). > This is the point: we can consider the "failed automation" an unexpected > event, however doing "btrfs bal stop" or the command above cannot be > considered as unexpected event. Balance cancel is always unexpected. "balance cancel" is a sysadmin forcing balance to exit using the error recovery code. If early termination of a conversion was _expected_, the sysadmin would have used 'limit' or 'vrange' or 'usage' or 'devid' or some other filter parameter so that balance does what it was told to do _without being cancelled_. > [...] > > > I'd even consider removing the heuristics that are already there for > > prioritizing profiles. They are just surprising and undocumented > > behavior, and it would be better to document it as "random, BTW you > > should finish your conversion now." > > I agree that we should remove this kind of heuristic. > Doing so I think that, with moderate effort, btrfs can track what is the > wanted profile (i.e. the one at the mkfs time or the one specified in last balance > w/convert [*]) and uses it. To me it seems the natural thing to do. Noting more > nothing less. It already kind of does--the balance convert parameters are stored on disk so it can be resumed after a umount or pause. "Pause" implies resuming later, and saving all the state required to do so. "Cancel" says something different, "forget what you were doing and wait for new instructions," so cancel wipes out the conversion target profile. > We can't prevent a mixed profile filesystem (it has to be allowed the > possibility to stop a long activity like the balance), but we should > prevent the unexpected behavior: if we change the profile and something > goes wrong, the next chunk allocation should be clear. The user don't have > to read the code to understand what will happen. > [*] we can argue which would be the expected profile after an interrupted balance: > the former one or the latter one ? If we can argue about it, then there's no right answer, and the status quo is fine (or we need a more complete solution). > BR > G.Baroncelli > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-20 17:56 Question: how understand the raid profile of a btrfs filesystem Goffredo Baroncelli 2020-03-21 3:29 ` Zygo Blaxell @ 2020-03-24 4:55 ` Anand Jain 2020-03-24 17:59 ` Goffredo Baroncelli 1 sibling, 1 reply; 21+ messages in thread From: Anand Jain @ 2020-03-24 4:55 UTC (permalink / raw) To: kreijack, linux-btrfs On 3/21/20 1:56 AM, Goffredo Baroncelli wrote: > Hi all, > > for a btrfs filesystem, how an user can understand which is the > {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk > which profile will have ? > For simple filesystem it is easy looking at the output of (e.g) "btrfs > fi df" or "btrfs fi us". But what if the filesystem is not simple ? > > btrfs fi us t/. > Overall: > Device size: 40.00GiB > Device allocated: 19.52GiB > Device unallocated: 20.48GiB > Device missing: 0.00B > Used: 16.75GiB > Free (estimated): 12.22GiB (min: 8.27GiB) > Data ratio: 1.90 > Metadata ratio: 2.00 > Global reserve: 9.06MiB (used: 0.00B) > > Data,single: Size:1.00GiB, Used:512.00MiB (50.00%) > /dev/loop0 1.00GiB > > Data,RAID5: Size:3.00GiB, Used:2.48GiB (82.56%) > /dev/loop1 1.00GiB > /dev/loop2 1.00GiB > /dev/loop3 1.00GiB > /dev/loop0 1.00GiB > > Data,RAID6: Size:4.00GiB, Used:3.71GiB (92.75%) > /dev/loop1 2.00GiB > /dev/loop2 2.00GiB > /dev/loop3 2.00GiB > /dev/loop0 2.00GiB > > Data,RAID1C3: Size:2.00GiB, Used:1.88GiB (93.76%) > /dev/loop1 2.00GiB > /dev/loop2 2.00GiB > /dev/loop3 2.00GiB > > Metadata,RAID1: Size:256.00MiB, Used:9.14MiB (3.57%) > /dev/loop2 256.00MiB > /dev/loop3 256.00MiB > > System,RAID1: Size:8.00MiB, Used:16.00KiB (0.20%) > /dev/loop2 8.00MiB > /dev/loop3 8.00MiB > > Unallocated: > /dev/loop1 5.00GiB > /dev/loop2 4.74GiB > /dev/loop3 4.74GiB > /dev/loop0 6.00GiB > > This is an example of a strange but valid filesystem. So the question > is: the next chunk which profile will have ? > Is there any way to understand what will happens ? > > I expected that the next chunk will be allocated as the last "convert". > However I discovered that this is not true. > > Looking at the code it seems to me that the logic is the following (from > btrfs_reduce_alloc_profile()) > > if (allowed & BTRFS_BLOCK_GROUP_RAID6) > allowed = BTRFS_BLOCK_GROUP_RAID6; > else if (allowed & BTRFS_BLOCK_GROUP_RAID5) > allowed = BTRFS_BLOCK_GROUP_RAID5; > else if (allowed & BTRFS_BLOCK_GROUP_RAID10) > allowed = BTRFS_BLOCK_GROUP_RAID10; > else if (allowed & BTRFS_BLOCK_GROUP_RAID1) > allowed = BTRFS_BLOCK_GROUP_RAID1; > else if (allowed & BTRFS_BLOCK_GROUP_RAID0) > allowed = BTRFS_BLOCK_GROUP_RAID0; > > flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK; > > So in the case above the profile will be RAID6. And in the general if a > RAID6 chunk is a filesystem, it wins ! That's arbitrary and doesn't make sense to me, IMO mkfs should save default profile in the super-block (which can be changed using ioctl) and kernel can create chunks based on the default profile. This approach also fixes chunk size inconsistency between progs and kernel as reported/fixed here https://patchwork.kernel.org/patch/11431405/ Thanks, Anand > But I am not sure.. Moreover I expected to see also reference to DUP > and/or RAID1C[34] ... > > Does someone have any suggestion ? > > BR > G.Baroncelli > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-24 4:55 ` Anand Jain @ 2020-03-24 17:59 ` Goffredo Baroncelli 2020-03-25 4:09 ` Andrei Borzenkov 0 siblings, 1 reply; 21+ messages in thread From: Goffredo Baroncelli @ 2020-03-24 17:59 UTC (permalink / raw) To: Anand Jain; +Cc: linux-btrfs, Zygo Blaxell On 3/24/20 5:55 AM, Anand Jain wrote: > On 3/21/20 1:56 AM, Goffredo Baroncelli wrote: >> Hi all, [..] >> Looking at the code it seems to me that the logic is the following (from btrfs_reduce_alloc_profile()) >> >> if (allowed & BTRFS_BLOCK_GROUP_RAID6) >> allowed = BTRFS_BLOCK_GROUP_RAID6; >> else if (allowed & BTRFS_BLOCK_GROUP_RAID5) >> allowed = BTRFS_BLOCK_GROUP_RAID5; >> else if (allowed & BTRFS_BLOCK_GROUP_RAID10) >> allowed = BTRFS_BLOCK_GROUP_RAID10; >> else if (allowed & BTRFS_BLOCK_GROUP_RAID1) >> allowed = BTRFS_BLOCK_GROUP_RAID1; >> else if (allowed & BTRFS_BLOCK_GROUP_RAID0) >> allowed = BTRFS_BLOCK_GROUP_RAID0; >> >> flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK; >> >> So in the case above the profile will be RAID6. And in the general if a RAID6 chunk is a filesystem, it wins ! > > That's arbitrary and doesn't make sense to me, IMO mkfs should save > default profile in the super-block (which can be changed using ioctl) > and kernel can create chunks based on the default profile. I'm working on this idea (storing the target profile in super-block). Of course this increase the consistency, but doesn't prevent the possibility that a mixed profiles filesystem could happen. And in this case is the user that has to solve the issue. Zygo, suggested also to add a mixed profile warning to btrfs (prog). And I agree with him. I think that we can use the space info ioctl (which doesn't require root privileges). BR G.Baroncelli > This > approach also fixes chunk size inconsistency between progs and kernel > as reported/fixed here > https://patchwork.kernel.org/patch/11431405/ > > Thanks, Anand > >> But I am not sure.. Moreover I expected to see also reference to DUP and/or RAID1C[34] ... >> >> Does someone have any suggestion ? >> >> BR >> G.Baroncelli >> > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-24 17:59 ` Goffredo Baroncelli @ 2020-03-25 4:09 ` Andrei Borzenkov 2020-03-25 17:14 ` Goffredo Baroncelli 0 siblings, 1 reply; 21+ messages in thread From: Andrei Borzenkov @ 2020-03-25 4:09 UTC (permalink / raw) To: kreijack, Anand Jain; +Cc: linux-btrfs, Zygo Blaxell 24.03.2020 20:59, Goffredo Baroncelli пишет: > On 3/24/20 5:55 AM, Anand Jain wrote: >> On 3/21/20 1:56 AM, Goffredo Baroncelli wrote: >>> Hi all, > [..] >>> Looking at the code it seems to me that the logic is the following >>> (from btrfs_reduce_alloc_profile()) >>> >>> if (allowed & BTRFS_BLOCK_GROUP_RAID6) >>> allowed = BTRFS_BLOCK_GROUP_RAID6; >>> else if (allowed & BTRFS_BLOCK_GROUP_RAID5) >>> allowed = BTRFS_BLOCK_GROUP_RAID5; >>> else if (allowed & BTRFS_BLOCK_GROUP_RAID10) >>> allowed = BTRFS_BLOCK_GROUP_RAID10; >>> else if (allowed & BTRFS_BLOCK_GROUP_RAID1) >>> allowed = BTRFS_BLOCK_GROUP_RAID1; >>> else if (allowed & BTRFS_BLOCK_GROUP_RAID0) >>> allowed = BTRFS_BLOCK_GROUP_RAID0; >>> >>> flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK; >>> >>> So in the case above the profile will be RAID6. And in the general if >>> a RAID6 chunk is a filesystem, it wins ! >> >> That's arbitrary and doesn't make sense to me, IMO mkfs should save >> default profile in the super-block (which can be changed using ioctl) >> and kernel can create chunks based on the default profile. > > I'm working on this idea (storing the target profile in super-block). What about per-subvolume profile? This comes up every now and then, like https://lore.kernel.org/linux-btrfs/cd82d247-5c95-18cd-a290-a911ff69613c@dirtcellar.net/ May be it could be subvolume property? > Of > course this increase the consistency, but > doesn't prevent the possibility that a mixed profiles filesystem could > happen. And in this case is the user that > has to solve the issue. > > Zygo, suggested also to add a mixed profile warning to btrfs (prog). And > I agree with him. I think that we can use > the space info ioctl (which doesn't require root privileges). > > BR > G.Baroncelli > >> This >> approach also fixes chunk size inconsistency between progs and kernel >> as reported/fixed here >> https://patchwork.kernel.org/patch/11431405/ >> >> Thanks, Anand >> >>> But I am not sure.. Moreover I expected to see also reference to DUP >>> and/or RAID1C[34] ... >>> >>> Does someone have any suggestion ? >>> >>> BR >>> G.Baroncelli >>> >> > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-25 4:09 ` Andrei Borzenkov @ 2020-03-25 17:14 ` Goffredo Baroncelli 2020-03-26 3:10 ` Zygo Blaxell 0 siblings, 1 reply; 21+ messages in thread From: Goffredo Baroncelli @ 2020-03-25 17:14 UTC (permalink / raw) To: Andrei Borzenkov; +Cc: Anand Jain, linux-btrfs, Zygo Blaxell On 3/25/20 5:09 AM, Andrei Borzenkov wrote: > 24.03.2020 20:59, Goffredo Baroncelli пишет: >> On 3/24/20 5:55 AM, Anand Jain wrote: >>> On 3/21/20 1:56 AM, Goffredo Baroncelli wrote: >>>> Hi all, >> [..] >>>> Looking at the code it seems to me that the logic is the following >>>> (from btrfs_reduce_alloc_profile()) >>>> >>>> if (allowed & BTRFS_BLOCK_GROUP_RAID6) >>>> allowed = BTRFS_BLOCK_GROUP_RAID6; >>>> else if (allowed & BTRFS_BLOCK_GROUP_RAID5) >>>> allowed = BTRFS_BLOCK_GROUP_RAID5; >>>> else if (allowed & BTRFS_BLOCK_GROUP_RAID10) >>>> allowed = BTRFS_BLOCK_GROUP_RAID10; >>>> else if (allowed & BTRFS_BLOCK_GROUP_RAID1) >>>> allowed = BTRFS_BLOCK_GROUP_RAID1; >>>> else if (allowed & BTRFS_BLOCK_GROUP_RAID0) >>>> allowed = BTRFS_BLOCK_GROUP_RAID0; >>>> >>>> flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK; >>>> >>>> So in the case above the profile will be RAID6. And in the general if >>>> a RAID6 chunk is a filesystem, it wins ! >>> >>> That's arbitrary and doesn't make sense to me, IMO mkfs should save >>> default profile in the super-block (which can be changed using ioctl) >>> and kernel can create chunks based on the default profile. >> >> I'm working on this idea (storing the target profile in super-block). > > What about per-subvolume profile? This comes up every now and then, like > > https://lore.kernel.org/linux-btrfs/cd82d247-5c95-18cd-a290-a911ff69613c@dirtcellar.net/ > > May be it could be subvolume property? The idea is nice. However I fear the mess that it could cause. Even now, with a more simpler system where there is a "per filesystem" profile, there are a lot of corner cases when something goes wrong (an interrupted balance, or a disk failed). In case of multiple profiles on sub-volume basis there is no simple answer in situation like: - when I make a snapshot of a sub-volumes, and then I change the profile of the original one, which is the profile of the files contained in the snapshot and in the original subvolumes ? Frankly speaking, if you want different profiles you need different filesystem... BR G.Baroncelli > >> Of >> course this increase the consistency, but >> doesn't prevent the possibility that a mixed profiles filesystem could >> happen. And in this case is the user that >> has to solve the issue. >> >> Zygo, suggested also to add a mixed profile warning to btrfs (prog). And >> I agree with him. I think that we can use >> the space info ioctl (which doesn't require root privileges). >> >> BR >> G.Baroncelli >> >>> This >>> approach also fixes chunk size inconsistency between progs and kernel >>> as reported/fixed here >>> https://patchwork.kernel.org/patch/11431405/ >>> >>> Thanks, Anand >>> >>>> But I am not sure.. Moreover I expected to see also reference to DUP >>>> and/or RAID1C[34] ... >>>> >>>> Does someone have any suggestion ? >>>> >>>> BR >>>> G.Baroncelli >>>> >>> >> >> > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Question: how understand the raid profile of a btrfs filesystem 2020-03-25 17:14 ` Goffredo Baroncelli @ 2020-03-26 3:10 ` Zygo Blaxell 0 siblings, 0 replies; 21+ messages in thread From: Zygo Blaxell @ 2020-03-26 3:10 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: Andrei Borzenkov, Anand Jain, linux-btrfs On Wed, Mar 25, 2020 at 06:14:05PM +0100, Goffredo Baroncelli wrote: > On 3/25/20 5:09 AM, Andrei Borzenkov wrote: > > 24.03.2020 20:59, Goffredo Baroncelli пишет: > > > On 3/24/20 5:55 AM, Anand Jain wrote: > > > > On 3/21/20 1:56 AM, Goffredo Baroncelli wrote: > > > > > Hi all, > > > [..] > > > > > Looking at the code it seems to me that the logic is the following > > > > > (from btrfs_reduce_alloc_profile()) > > > > > > > > > > if (allowed & BTRFS_BLOCK_GROUP_RAID6) > > > > > allowed = BTRFS_BLOCK_GROUP_RAID6; > > > > > else if (allowed & BTRFS_BLOCK_GROUP_RAID5) > > > > > allowed = BTRFS_BLOCK_GROUP_RAID5; > > > > > else if (allowed & BTRFS_BLOCK_GROUP_RAID10) > > > > > allowed = BTRFS_BLOCK_GROUP_RAID10; > > > > > else if (allowed & BTRFS_BLOCK_GROUP_RAID1) > > > > > allowed = BTRFS_BLOCK_GROUP_RAID1; > > > > > else if (allowed & BTRFS_BLOCK_GROUP_RAID0) > > > > > allowed = BTRFS_BLOCK_GROUP_RAID0; > > > > > > > > > > flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK; > > > > > > > > > > So in the case above the profile will be RAID6. And in the general if > > > > > a RAID6 chunk is a filesystem, it wins ! > > > > > > > > That's arbitrary and doesn't make sense to me, IMO mkfs should save > > > > default profile in the super-block (which can be changed using ioctl) > > > > and kernel can create chunks based on the default profile. > > > > > > I'm working on this idea (storing the target profile in super-block). > > > > What about per-subvolume profile? This comes up every now and then, like > > > > https://lore.kernel.org/linux-btrfs/cd82d247-5c95-18cd-a290-a911ff69613c@dirtcellar.net/ > > > > May be it could be subvolume property? ...or inode. > The idea is nice. However I fear the mess that it could cause. Even now, with a > more simpler system where there is a "per filesystem" profile, there are a lot of corner > cases when something goes wrong (an interrupted balance, or a disk failed). It can't be worse than qgroups. (only half kidding) Thinking aloud, you could even set up coarse-but-fast quotas that way--limit the number of data block groups allocated to a subvol. No sharing of block groups between subvols though, unless one subvol is a snapshot of the other. Also, limiting usage by block group includes free space within the block group, so it's inaccurate (i.e. coarse, effectively allocating space with multi-GB granularity and large error bars). If you have 20 users, and you want to give them each about 400GB but don't really care if they get 390GB or 410GB, then maybe it's not so bad. > In case of multiple profiles on sub-volume basis there is no simple answer in situation like: > - when I make a snapshot of a sub-volumes, and then I change the profile of the original one, > which is the profile of the files contained in the snapshot and in the original subvolumes ? It shouldn't be different from compress: you look up either the inode or the root, and it tells you what kind of extent you can allocate next. Any existing data stays where it is until it is deleted (or overwritten by CoW). If you start cloning between subvols then things get a little interesting (especially if you balance those afterwards) but not unsolvable if "when two or more answers are possible, it's undefined which one btrfs picks" is allowed in the solution. You'd have the same problem with no-longer-allocatable block groups that don't match the currently selected profile as you do now with mixed block group profiles. As the unallocatable block groups empty out, the storage density of the used space within them goes up, space appears to disappear, etc. This is state #3, after all, and it would take some work to make btrfs as happy in this state as it is in state #1. > Frankly speaking, if you want different profiles you need different filesystem... Well, there is that. Keeping the status quo (or small modifications thereof) is far easier to document, and it's not like we don't have a huge list of RAID-related things to fix already. > BR > G.Baroncelli > > > > > > Of > > > course this increase the consistency, but > > > doesn't prevent the possibility that a mixed profiles filesystem could > > > happen. And in this case is the user that > > > has to solve the issue. > > > > > > Zygo, suggested also to add a mixed profile warning to btrfs (prog). And > > > I agree with him. I think that we can use > > > the space info ioctl (which doesn't require root privileges). > > > > > > BR > > > G.Baroncelli > > > > > > > This > > > > approach also fixes chunk size inconsistency between progs and kernel > > > > as reported/fixed here > > > > https://patchwork.kernel.org/patch/11431405/ > > > > > > > > Thanks, Anand > > > > > > > > > But I am not sure.. Moreover I expected to see also reference to DUP > > > > > and/or RAID1C[34] ... > > > > > > > > > > Does someone have any suggestion ? > > > > > > > > > > BR > > > > > G.Baroncelli > > > > > > > > > > > > > > > > > > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Question: how understand the raid profile of a btrfs filesystem @ 2020-03-20 17:58 Goffredo Baroncelli 0 siblings, 0 replies; 21+ messages in thread From: Goffredo Baroncelli @ 2020-03-20 17:58 UTC (permalink / raw) To: linux-btrfs Hi all, for a btrfs filesystem, how an user can understand which is the {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk which profile will have ? For simple filesystem it is easy looking at the output of (e.g) "btrfs fi df" or "btrfs fi us". But what if the filesystem is not simple ? btrfs fi us t/. Overall: Device size: 40.00GiB Device allocated: 19.52GiB Device unallocated: 20.48GiB Device missing: 0.00B Used: 16.75GiB Free (estimated): 12.22GiB (min: 8.27GiB) Data ratio: 1.90 Metadata ratio: 2.00 Global reserve: 9.06MiB (used: 0.00B) Data,single: Size:1.00GiB, Used:512.00MiB (50.00%) /dev/loop0 1.00GiB Data,RAID5: Size:3.00GiB, Used:2.48GiB (82.56%) /dev/loop1 1.00GiB /dev/loop2 1.00GiB /dev/loop3 1.00GiB /dev/loop0 1.00GiB Data,RAID6: Size:4.00GiB, Used:3.71GiB (92.75%) /dev/loop1 2.00GiB /dev/loop2 2.00GiB /dev/loop3 2.00GiB /dev/loop0 2.00GiB Data,RAID1C3: Size:2.00GiB, Used:1.88GiB (93.76%) /dev/loop1 2.00GiB /dev/loop2 2.00GiB /dev/loop3 2.00GiB Metadata,RAID1: Size:256.00MiB, Used:9.14MiB (3.57%) /dev/loop2 256.00MiB /dev/loop3 256.00MiB System,RAID1: Size:8.00MiB, Used:16.00KiB (0.20%) /dev/loop2 8.00MiB /dev/loop3 8.00MiB Unallocated: /dev/loop1 5.00GiB /dev/loop2 4.74GiB /dev/loop3 4.74GiB /dev/loop0 6.00GiB This is an example of a strange but valid filesystem. So the question is: the next chunk which profile will have ? Is there any way to understand what will happens ? I expected that the next chunk will be allocated as the last "convert". However I discovered that this is not true. Looking at the code it seems to me that the logic is the following (from btrfs_reduce_alloc_profile()) if (allowed & BTRFS_BLOCK_GROUP_RAID6) allowed = BTRFS_BLOCK_GROUP_RAID6; else if (allowed & BTRFS_BLOCK_GROUP_RAID5) allowed = BTRFS_BLOCK_GROUP_RAID5; else if (allowed & BTRFS_BLOCK_GROUP_RAID10) allowed = BTRFS_BLOCK_GROUP_RAID10; else if (allowed & BTRFS_BLOCK_GROUP_RAID1) allowed = BTRFS_BLOCK_GROUP_RAID1; else if (allowed & BTRFS_BLOCK_GROUP_RAID0) allowed = BTRFS_BLOCK_GROUP_RAID0; flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK; So in the case above the profile will be RAID6. And in the general if a RAID6 chunk is a filesystem, it wins ! But I am not sure.. Moreover I expected to see also reference to DUP and/or RAID1C[34] ... Does someone have any suggestion ? BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2020-03-26 3:11 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-03-20 17:56 Question: how understand the raid profile of a btrfs filesystem Goffredo Baroncelli 2020-03-21 3:29 ` Zygo Blaxell 2020-03-21 5:40 ` Andrei Borzenkov 2020-03-21 7:14 ` Zygo Blaxell 2020-03-21 9:55 ` Goffredo Baroncelli 2020-03-21 23:26 ` Zygo Blaxell 2020-03-22 8:34 ` Goffredo Baroncelli 2020-03-22 8:38 ` Goffredo Baroncelli 2020-03-22 23:49 ` Zygo Blaxell 2020-03-23 20:50 ` Goffredo Baroncelli 2020-03-23 22:48 ` Graham Cobb 2020-03-25 4:09 ` Zygo Blaxell 2020-03-25 4:30 ` Paul Jones 2020-03-26 2:51 ` Zygo Blaxell 2020-03-23 23:18 ` Zygo Blaxell 2020-03-24 4:55 ` Anand Jain 2020-03-24 17:59 ` Goffredo Baroncelli 2020-03-25 4:09 ` Andrei Borzenkov 2020-03-25 17:14 ` Goffredo Baroncelli 2020-03-26 3:10 ` Zygo Blaxell 2020-03-20 17:58 Goffredo Baroncelli
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.