Question: how understand the raid profile of a btrfs filesystem

All of lore.kernel.org
 help / color / mirror / Atom feed

* Question: how understand the raid profile of a btrfs filesystem
@ 2020-03-20 17:56 Goffredo Baroncelli
  2020-03-21  3:29 ` Zygo Blaxell
  2020-03-24  4:55 ` Anand Jain
  0 siblings, 2 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2020-03-20 17:56 UTC (permalink / raw)
  To: linux-btrfs

Hi all,

for a btrfs filesystem, how an user can understand which is the {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk which profile will have ?
For simple filesystem it is easy looking at the output of (e.g)  "btrfs fi df" or "btrfs fi us". But what if the filesystem is not simple ?

btrfs fi us t/.
Overall:
     Device size:		  40.00GiB
     Device allocated:		  19.52GiB
     Device unallocated:		  20.48GiB
     Device missing:		     0.00B
     Used:			  16.75GiB
     Free (estimated):		  12.22GiB	(min: 8.27GiB)
     Data ratio:			      1.90
     Metadata ratio:		      2.00
     Global reserve:		   9.06MiB	(used: 0.00B)

Data,single: Size:1.00GiB, Used:512.00MiB (50.00%)
    /dev/loop0	   1.00GiB

Data,RAID5: Size:3.00GiB, Used:2.48GiB (82.56%)
    /dev/loop1	   1.00GiB
    /dev/loop2	   1.00GiB
    /dev/loop3	   1.00GiB
    /dev/loop0	   1.00GiB

Data,RAID6: Size:4.00GiB, Used:3.71GiB (92.75%)
    /dev/loop1	   2.00GiB
    /dev/loop2	   2.00GiB
    /dev/loop3	   2.00GiB
    /dev/loop0	   2.00GiB

Data,RAID1C3: Size:2.00GiB, Used:1.88GiB (93.76%)
    /dev/loop1	   2.00GiB
    /dev/loop2	   2.00GiB
    /dev/loop3	   2.00GiB

Metadata,RAID1: Size:256.00MiB, Used:9.14MiB (3.57%)
    /dev/loop2	 256.00MiB
    /dev/loop3	 256.00MiB

System,RAID1: Size:8.00MiB, Used:16.00KiB (0.20%)
    /dev/loop2	   8.00MiB
    /dev/loop3	   8.00MiB

Unallocated:
    /dev/loop1	   5.00GiB
    /dev/loop2	   4.74GiB
    /dev/loop3	   4.74GiB
    /dev/loop0	   6.00GiB

This is an example of a strange but valid filesystem. So the question is: the next chunk which profile will have ?
Is there any way to understand what will happens ?

I expected that the next chunk will be allocated as the last "convert". However I discovered that this is not true.

Looking at the code it seems to me that the logic is the following (from btrfs_reduce_alloc_profile())

         if (allowed & BTRFS_BLOCK_GROUP_RAID6)
                 allowed = BTRFS_BLOCK_GROUP_RAID6;
         else if (allowed & BTRFS_BLOCK_GROUP_RAID5)
                 allowed = BTRFS_BLOCK_GROUP_RAID5;
         else if (allowed & BTRFS_BLOCK_GROUP_RAID10)
                 allowed = BTRFS_BLOCK_GROUP_RAID10;
         else if (allowed & BTRFS_BLOCK_GROUP_RAID1)
                 allowed = BTRFS_BLOCK_GROUP_RAID1;
         else if (allowed & BTRFS_BLOCK_GROUP_RAID0)
                 allowed = BTRFS_BLOCK_GROUP_RAID0;

         flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK;

So in the case above the profile will be RAID6. And in the general if a RAID6 chunk is a filesystem, it wins !

But I am not sure.. Moreover I expected to see also reference to DUP and/or RAID1C[34] ...

Does someone have any suggestion ?

BR
G.Baroncelli


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-20 17:56 Question: how understand the raid profile of a btrfs filesystem Goffredo Baroncelli
@ 2020-03-21  3:29 ` Zygo Blaxell
  2020-03-21  5:40   ` Andrei Borzenkov
  2020-03-21  9:55   ` Goffredo Baroncelli
  2020-03-24  4:55 ` Anand Jain
  1 sibling, 2 replies; 21+ messages in thread
From: Zygo Blaxell @ 2020-03-21  3:29 UTC (permalink / raw)
  To: kreijack; +Cc: linux-btrfs

On Fri, Mar 20, 2020 at 06:56:38PM +0100, Goffredo Baroncelli wrote:
> Hi all,
> 
> for a btrfs filesystem, how an user can understand which is the {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk which profile will have ?

It's the profile used by the highest-numbered block group for the
allocation type (one for data, one for metadata/system).  There
are two profiles to consider, one for data and one for metadata.
'btrfs fi df', 'btrfs fi us', or 'btrfs dev usage' will all indicate
which profiles these are.

It is valid for the two profiles to be different, and different profile
combinations have different use cases.  In most cases the only one that
matters is the data profile, as that's the one POSIX 'df' reports and
data block writes consume, and the one that typically occupies more than
99% of the total space.  Adminstrators and system designers have to be
more aware of metadata usage when filesystems become extremely full or
extremely small (less than 16 GB).  Users (without root or CAP_SYS_ADMIN)
generally can't do anything about metadata usage, except as a tiny
side-effect of their data usage.

> For simple filesystem it is easy looking at the output of (e.g)  "btrfs fi df" or "btrfs fi us". But what if the filesystem is not simple ?

"Not simple" is not a normal operating mode for btrfs.  The filesystem
allows multiple profiles to be active, so the filesystem can be
converted to a new profile while old data is still accessible; however,
the conversion is expected to end at some point, and all block groups
will use the same profile when that happens.

The allocator will only use one RAID profile, and will ignore free
space in block groups of other profiles, while 'df' reports the total
space on the filesystem in each profile, and metadata allocation does
something else.  'btrfs fi us' reports a mess and can't give any accurate
free space estimate.  Disk space will apparently be free while writes
fail with ENOSPC.

This is not a problem if a conversion is running to eliminate all the
"competing" profiles, but if the conversion stops, you can expect some
problems with space until it resumes again.

> btrfs fi us t/.
> Overall:
>     Device size:		  40.00GiB
>     Device allocated:		  19.52GiB
>     Device unallocated:		  20.48GiB
>     Device missing:		     0.00B
>     Used:			  16.75GiB
>     Free (estimated):		  12.22GiB	(min: 8.27GiB)
>     Data ratio:			      1.90
>     Metadata ratio:		      2.00
>     Global reserve:		   9.06MiB	(used: 0.00B)
> 
> Data,single: Size:1.00GiB, Used:512.00MiB (50.00%)
>    /dev/loop0	   1.00GiB
> 
> Data,RAID5: Size:3.00GiB, Used:2.48GiB (82.56%)
>    /dev/loop1	   1.00GiB
>    /dev/loop2	   1.00GiB
>    /dev/loop3	   1.00GiB
>    /dev/loop0	   1.00GiB
> 
> Data,RAID6: Size:4.00GiB, Used:3.71GiB (92.75%)
>    /dev/loop1	   2.00GiB
>    /dev/loop2	   2.00GiB
>    /dev/loop3	   2.00GiB
>    /dev/loop0	   2.00GiB
> 
> Data,RAID1C3: Size:2.00GiB, Used:1.88GiB (93.76%)
>    /dev/loop1	   2.00GiB
>    /dev/loop2	   2.00GiB
>    /dev/loop3	   2.00GiB
> 
> Metadata,RAID1: Size:256.00MiB, Used:9.14MiB (3.57%)
>    /dev/loop2	 256.00MiB
>    /dev/loop3	 256.00MiB
> 
> System,RAID1: Size:8.00MiB, Used:16.00KiB (0.20%)
>    /dev/loop2	   8.00MiB
>    /dev/loop3	   8.00MiB
> 
> Unallocated:
>    /dev/loop1	   5.00GiB
>    /dev/loop2	   4.74GiB
>    /dev/loop3	   4.74GiB
>    /dev/loop0	   6.00GiB
> 
> This is an example of a strange but valid filesystem. 

Valid, but the filesystem is in a state designed for temporary use during
conversions, and you will want to exit that state as soon as possible.

> So the question is: the next chunk which profile will have ?
> Is there any way to understand what will happens ?
> 
> I expected that the next chunk will be allocated as the last "convert". However I discovered that this is not true.

That's correct in most cases--a convert will create a new block group,
which will have the highest bytenr in the filesystem, and its profile
will be used to allocate new data, thus converting the filesystem to
the new profile; however, if you pause the convert and delete all the
files in the new block group, it's possible that the new block group gets
deleted too, and then the filesystem reverts to the previous RAID profile.
Again, not a problem if you run the convert until it completely removes
all old block groups!

> Looking at the code it seems to me that the logic is the following (from btrfs_reduce_alloc_profile())
> 
>         if (allowed & BTRFS_BLOCK_GROUP_RAID6)
>                 allowed = BTRFS_BLOCK_GROUP_RAID6;
>         else if (allowed & BTRFS_BLOCK_GROUP_RAID5)
>                 allowed = BTRFS_BLOCK_GROUP_RAID5;
>         else if (allowed & BTRFS_BLOCK_GROUP_RAID10)
>                 allowed = BTRFS_BLOCK_GROUP_RAID10;
>         else if (allowed & BTRFS_BLOCK_GROUP_RAID1)
>                 allowed = BTRFS_BLOCK_GROUP_RAID1;
>         else if (allowed & BTRFS_BLOCK_GROUP_RAID0)
>                 allowed = BTRFS_BLOCK_GROUP_RAID0;
> 
>         flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK;
> 
> So in the case above the profile will be RAID6. And in the general if a RAID6 chunk is a filesystem, it wins !

This code is used to determine whether a conversion reduces the level of
redundancy, e.g. you are going from raid6 (2 redundant disks) to raid5
(1 redundant disk) or raid0 (0 redundant disks).  There are warnings and
a force flag required when that happens.  It doesn't determine the raid
profile of the next block group--that's just a straight copy of the raid
profile of the last block group.

> But I am not sure.. Moreover I expected to see also reference to DUP and/or RAID1C[34] ...

If you get through that 'if' statement without hitting any of the
branches, then you're equal to raid0 (0 redundant disks) but raid0
is a special case because it requires 2 disks for allocation.  'dup'
(0 redundant disks) and 'single' (which is the absence of any profile
bits) also have 0 redundant disks and require only 1 disk for allocation,
there is no need to treat them differently.

raid1c[34] probably should be there.  Patches welcome.

> Does someone have any suggestion ?
> 
> BR
> G.Baroncelli
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-21  3:29 ` Zygo Blaxell
@ 2020-03-21  5:40   ` Andrei Borzenkov
  2020-03-21  7:14     ` Zygo Blaxell
  2020-03-21  9:55   ` Goffredo Baroncelli
  1 sibling, 1 reply; 21+ messages in thread
From: Andrei Borzenkov @ 2020-03-21  5:40 UTC (permalink / raw)
  To: Zygo Blaxell, kreijack; +Cc: linux-btrfs

21.03.2020 06:29, Zygo Blaxell пишет:
> On Fri, Mar 20, 2020 at 06:56:38PM +0100, Goffredo Baroncelli wrote:
>> Hi all,
>>
>> for a btrfs filesystem, how an user can understand which is the {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk which profile will have ?
> 
> It's the profile used by the highest-numbered block group for the
> allocation type (one for data, one for metadata/system).

Is "highest-numbered" block group always the last one created? Can block
group numbers wrap around?

Recently someone reported that after conversion block groups with old
profile remained and this probably explains it - conversion races with
new allocation.

>> So the question is: the next chunk which profile will have ?
>> Is there any way to understand what will happens ?

Well, from that explanation it is not possible using standard tools -
one needs to crawl btrfs internals to find out the "last" block group.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-21  5:40   ` Andrei Borzenkov
@ 2020-03-21  7:14     ` Zygo Blaxell
  0 siblings, 0 replies; 21+ messages in thread
From: Zygo Blaxell @ 2020-03-21  7:14 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: kreijack, linux-btrfs

On Sat, Mar 21, 2020 at 08:40:50AM +0300, Andrei Borzenkov wrote:
> 21.03.2020 06:29, Zygo Blaxell пишет:
> > On Fri, Mar 20, 2020 at 06:56:38PM +0100, Goffredo Baroncelli wrote:
> >> Hi all,
> >>
> >> for a btrfs filesystem, how an user can understand which is the {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk which profile will have ?
> > 
> > It's the profile used by the highest-numbered block group for the
> > allocation type (one for data, one for metadata/system).
> 
> Is "highest-numbered" block group always the last one created? 

It's not required by the filesystem format but it is the current behavior
of the implementation.

> Can block group numbers wrap around?

In theory, yes, but they are 64 bits long and correspond to bytes in the
filesystem's address space.  If you loop balancing a filesystem with a
single 4K data block and you can do it at 1000 block groups per second,
you'll wrap around in a little over six months.  Typical use cases
(and even extreme ones) will take centuries to wrap around if you
are converting all the time.

> Recently someone reported that after conversion block groups with old
> profile remained and this probably explains it - conversion races with
> new allocation.

Conversion *is* new allocation, no race is possible because they are
the same thing.

While a conversion is running, the conversion itself forces the
raid profile of newly created block groups, so there is no race.
After conversion is completed, there is special case code to prevent the
last empty block group in the filesystem from being deleted; otherwise,
btrfs would lose information about the selected raid profile.

When a conversion is paused or cancelled, new allocations normally
continue using the conversion target profile; however, if all block
groups of the new profile are deleted (i.e. all the data contained in
the new block groups are removed) then it is possible to revert back to
allocating using an older profile.  e.g. if you want to combine a balance
convert with a device remove, you have to let the convert run long enough
to ensure several block groups of the new raid profile exist on other
drives than the drive being removed.  The device remove will delete all
block groups on the removed device, in reverse device physical offset
order which is often (but not necessarily) reverse block group order.
This leads to device remove switching back to the old RAID profile.
This example is not any kind of race--the result can be produced
deterministically, and the conversion must be paused first.

A conversion can be forcibly stopped by various events:  crashes,
unmounting the filesystem, having an unrecoverable read or write error,
or running out of space.  These events will leave block groups with old
profiles on the disk.  Generally if an external event forces conversion
to stop, then it will need to be manually restarted.

If there are uncorrectable read errors on the filesystem then affected
data blocks must be removed from the filesystem before conversion can
be completed.  Same with free space, you must have enough to complete.

Old versions of mkfs.btrfs had bugs which would leave empty block groups
with different profiles on the filesystem.  When in doubt, or if you have
an older vintage btrfs filesystem, run a converting balance with the
desired raid profile and the 'soft' filter to be sure only one profile
is present--it will be a no-op if conversion is complete; otherwise,
it will finish the conversion.

> >> So the question is: the next chunk which profile will have ?
> >> Is there any way to understand what will happens ?
> 
> Well, from that explanation it is not possible using standard tools -
> one needs to crawl btrfs internals to find out the "last" block group.

This is required only during the conversion process.  In normal cases
users can assume the only profile present is the one that will be used.

The python-btrfs package contains an example of listing block groups.
The last entry in the list will have the current allocation profile.

An unprivileged user can monitor 'btrfs fi df' output over time.
Used space will increase or decrease in the current profile, and
only decrease in the other profiles.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-21  3:29 ` Zygo Blaxell
  2020-03-21  5:40   ` Andrei Borzenkov
@ 2020-03-21  9:55   ` Goffredo Baroncelli
  2020-03-21 23:26     ` Zygo Blaxell
  1 sibling, 1 reply; 21+ messages in thread
From: Goffredo Baroncelli @ 2020-03-21  9:55 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On 3/21/20 4:29 AM, Zygo Blaxell wrote:
> On Fri, Mar 20, 2020 at 06:56:38PM +0100, Goffredo Baroncelli wrote:
>> Hi all,
>>
>> for a btrfs filesystem, how an user can understand which is the {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk which profile will have ?
> 
> It's the profile used by the highest-numbered block group for the
> allocation type (one for data, one for metadata/system).  There
> are two profiles to consider, one for data and one for metadata.
> 'btrfs fi df', 'btrfs fi us', or 'btrfs dev usage' will all indicate
> which profiles these are.
> 

What do you think as "highest-numbered block group", the value in the "offset" filed.
If so it doesn't make sense because it could be relocated easily.

Anyway what are you describing is not what I saw. In the test above I create a raid5 filesystem, filled 1 chunk at 100% and a second chunk for few MB. Then I convert the most empty chunk as single. Then I fill the last chunk (the single one) and force to create a new chunk. What I saw is that the new chunk is raid5 mode.

$ sudo mkfs.btrfs -draid5  /dev/loop[012]
$ dd if=/dev/zero of=t/file-2.128gb_5 bs=1M count=$((2024+128)) # fill two chunk raid 5
$ sudo btrfs fi du t/. # see what is the situation
[...]
Data,RAID5: Size:4.00GiB, Used:2.10GiB (52.57%)
    /dev/loop0	   2.00GiB
    /dev/loop1	   2.00GiB
    /dev/loop2	   2.00GiB
[...]
$ sudo btrfs balance start -dconvert=single,usage=50 t/. # convert the latest chunk to single
$ sudo btrfs fi us t/.	# see what is the situation
[...]
Data,single: Size:1.00GiB, Used:259.00MiB (25.29%)
    /dev/loop0	   1.00GiB

Data,RAID5: Size:2.00GiB, Used:1.85GiB (92.47%)
    /dev/loop0	   1.00GiB
    /dev/loop1	   1.00GiB
    /dev/loop2	   1.00GiB
[...]

# fill the latest chunk and created a new one
$ dd if=/dev/zero of=t/file-1.128gb_6 bs=1M count=$((1024+128))

$ sudo btrfs fi us t/. # see what is the situation
[...]
Data,single: Size:1.00GiB, Used:259.00MiB (25.29%)
    /dev/loop0	   1.00GiB

Data,RAID5: Size:4.00GiB, Used:1.85GiB (46.24%)
    /dev/loop0	   2.00GiB
    /dev/loop1	   2.00GiB
    /dev/loop2	   2.00GiB
[...]

Expected results: the "single" chunk should pass from 1GB to 2GB. What it is observed is that the raid5 (the oldest chunk) passed from 2GB to 4GB.

[...]
>> Looking at the code it seems to me that the logic is the following (from btrfs_reduce_alloc_profile())
>>
>>          if (allowed & BTRFS_BLOCK_GROUP_RAID6)
>>                  allowed = BTRFS_BLOCK_GROUP_RAID6;
>>          else if (allowed & BTRFS_BLOCK_GROUP_RAID5)
>>                  allowed = BTRFS_BLOCK_GROUP_RAID5;
>>          else if (allowed & BTRFS_BLOCK_GROUP_RAID10)
>>                  allowed = BTRFS_BLOCK_GROUP_RAID10;
>>          else if (allowed & BTRFS_BLOCK_GROUP_RAID1)
>>                  allowed = BTRFS_BLOCK_GROUP_RAID1;
>>          else if (allowed & BTRFS_BLOCK_GROUP_RAID0)
>>                  allowed = BTRFS_BLOCK_GROUP_RAID0;
>>
>>          flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK;
>>
>> So in the case above the profile will be RAID6. And in the general if a RAID6 chunk is a filesystem, it wins !
> 
> This code is used to determine whether a conversion reduces the level of
> redundancy, e.g. you are going from raid6 (2 redundant disks) to raid5
> (1 redundant disk) or raid0 (0 redundant disks).  There are warnings and
> a force flag required when that happens.  It doesn't determine the raid
> profile of the next block group--that's just a straight copy of the raid
> profile of the last block group.

To me it seems that this function decides the allocation of the next chunk. The chain of call is the following:

btrfs_force_chunk_alloc
	btrfs_get_alloc_profile
		get_alloc_profile
			btrfs_reduce_alloc_profile
	btrfs_chunk_alloc
		btrfs_alloc_chunk
			__btrfs_alloc_chunk

or another one is

btrfs_alloc_data_chunk_ondemand
	btrfs_data_alloc_profile
		btrfs_get_alloc_profile
			get_alloc_profile
				btrfs_reduce_alloc_profile
	btrfs_chunk_alloc
		btrfs_alloc_chunk
			__btrfs_alloc_chunk


The btrfs_get_alloc_profile/get_alloc_profile/btrfs_reduce_alloc_profile chain decides which profile has to be allocated.
The current actives profiles are took and then filtered by the possible allowed on the basis of the number of disk. Which means that if a raid6 profile chunk exists (and there is a enough number of devices), the next chunk will be allocated as raid6.

So is how I read the code, and what suggest my tests...

My conclusion is: if you have multiple raid profile per disk, the next chunk allocation doesn't depend by the latest "balance", but but by the above logic.
The recipe is: when you made a balance, pay attention to not leave any chunk in old format

> 
>> But I am not sure.. Moreover I expected to see also reference to DUP and/or RAID1C[34] ...
> 
> If you get through that 'if' statement without hitting any of the
> branches, then you're equal to raid0 (0 redundant disks) but raid0
> is a special case because it requires 2 disks for allocation.  'dup'
> (0 redundant disks) and 'single' (which is the absence of any profile
> bits) also have 0 redundant disks and require only 1 disk for allocation,
> there is no need to treat them differently.
> 
> raid1c[34] probably should be there.  Patches welcome.
> 
>> Does someone have any suggestion ?
>>
>> BR
>> G.Baroncelli
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-21  9:55   ` Goffredo Baroncelli
@ 2020-03-21 23:26     ` Zygo Blaxell
  2020-03-22  8:34       ` Goffredo Baroncelli
  0 siblings, 1 reply; 21+ messages in thread
From: Zygo Blaxell @ 2020-03-21 23:26 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs

On Sat, Mar 21, 2020 at 10:55:32AM +0100, Goffredo Baroncelli wrote:
> On 3/21/20 4:29 AM, Zygo Blaxell wrote:
> > On Fri, Mar 20, 2020 at 06:56:38PM +0100, Goffredo Baroncelli wrote:
> > > Hi all,
> > > 
> > > for a btrfs filesystem, how an user can understand which is the {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk which profile will have ?
> > 
> > It's the profile used by the highest-numbered block group for the
> > allocation type (one for data, one for metadata/system).  There
> > are two profiles to consider, one for data and one for metadata.
> > 'btrfs fi df', 'btrfs fi us', or 'btrfs dev usage' will all indicate
> > which profiles these are.
> > 
> 
> What do you think as "highest-numbered block group", the value in the "offset" filed.

Objectid field (the offset field is the size for block group items).

> If so it doesn't make sense because it could be relocated easily.

Relocation will create a new block group with the filesystem's current
profile, which is the conversion target profile if present (all conversion
is relocation), but some other profile in use on the filesystem otherwise.

> Anyway what are you describing is not what I saw. In the test above
> I create a raid5 filesystem, filled 1 chunk at 100% and a second chunk
> for few MB. Then I convert the most empty chunk as single. 

OK, I was missing some details:  At mount time all the block group items
are read in order, and each one adjusts the allocator profile bits for
the entire filesystem.  The last block group is the one that has the
*most influence* over the profile when no conversion is running, but
doesn't set the profile alone.

If there is a partial conversion, then the behavior changes as you note.

When a conversion is active, the conversion target profile overrides
everything else.  That is how you can get a single block group on a
filesystem that is entirely raid5.

So...TL;DR if you're not running a conversion, the next block group will
use some RAID profile already present on the filesystem, and it may not
be the one you want it to be.

> Then I fill
> the last chunk (the single one) and force to create a new chunk. What
> I saw is that the new chunk is raid5 mode.
> 
> $ sudo mkfs.btrfs -draid5  /dev/loop[012]
> $ dd if=/dev/zero of=t/file-2.128gb_5 bs=1M count=$((2024+128)) # fill two chunk raid 5
> $ sudo btrfs fi du t/. # see what is the situation
> [...]
> Data,RAID5: Size:4.00GiB, Used:2.10GiB (52.57%)
>    /dev/loop0	   2.00GiB
>    /dev/loop1	   2.00GiB
>    /dev/loop2	   2.00GiB
> [...]
> $ sudo btrfs balance start -dconvert=single,usage=50 t/. # convert the latest chunk to single
> $ sudo btrfs fi us t/.	# see what is the situation
> [...]
> Data,single: Size:1.00GiB, Used:259.00MiB (25.29%)
>    /dev/loop0	   1.00GiB
> 
> Data,RAID5: Size:2.00GiB, Used:1.85GiB (92.47%)
>    /dev/loop0	   1.00GiB
>    /dev/loop1	   1.00GiB
>    /dev/loop2	   1.00GiB
> [...]
> 
> # fill the latest chunk and created a new one
> $ dd if=/dev/zero of=t/file-1.128gb_6 bs=1M count=$((1024+128))
> 
> $ sudo btrfs fi us t/. # see what is the situation
> [...]
> Data,single: Size:1.00GiB, Used:259.00MiB (25.29%)
>    /dev/loop0	   1.00GiB
> 
> Data,RAID5: Size:4.00GiB, Used:1.85GiB (46.24%)
>    /dev/loop0	   2.00GiB
>    /dev/loop1	   2.00GiB
>    /dev/loop2	   2.00GiB
> [...]
> 
> Expected results: the "single" chunk should pass from 1GB to 2GB. What it is observed is that the raid5 (the oldest chunk) passed from 2GB to 4GB.

...but now you are not running conversion any more, and have multiple
profiles.  It's not really specified what will happen under those
conditions, nor is it obvious what the correct behavior should be.

The on-disk format does not have a field for "target profile".
Adding one would be a disk format change.

> [...]
> > > Looking at the code it seems to me that the logic is the following (from btrfs_reduce_alloc_profile())
> > > 
> > >          if (allowed & BTRFS_BLOCK_GROUP_RAID6)
> > >                  allowed = BTRFS_BLOCK_GROUP_RAID6;
> > >          else if (allowed & BTRFS_BLOCK_GROUP_RAID5)
> > >                  allowed = BTRFS_BLOCK_GROUP_RAID5;
> > >          else if (allowed & BTRFS_BLOCK_GROUP_RAID10)
> > >                  allowed = BTRFS_BLOCK_GROUP_RAID10;
> > >          else if (allowed & BTRFS_BLOCK_GROUP_RAID1)
> > >                  allowed = BTRFS_BLOCK_GROUP_RAID1;
> > >          else if (allowed & BTRFS_BLOCK_GROUP_RAID0)
> > >                  allowed = BTRFS_BLOCK_GROUP_RAID0;
> > > 
> > >          flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK;
> > > 
> > > So in the case above the profile will be RAID6. And in the general if a RAID6 chunk is a filesystem, it wins !
> > 
> > This code is used to determine whether a conversion reduces the level of
> > redundancy, e.g. you are going from raid6 (2 redundant disks) to raid5
> > (1 redundant disk) or raid0 (0 redundant disks).  There are warnings and
> > a force flag required when that happens.  It doesn't determine the raid
> > profile of the next block group--that's just a straight copy of the raid
> > profile of the last block group.
> 
> To me it seems that this function decides the allocation of the next chunk. The chain of call is the following:

Sorry, in my earlier mail I thought we were talking about a different
piece of code that tries to enforce a similar rule.

> btrfs_force_chunk_alloc
> 	btrfs_get_alloc_profile
> 		get_alloc_profile
> 			btrfs_reduce_alloc_profile
> 	btrfs_chunk_alloc
> 		btrfs_alloc_chunk
> 			__btrfs_alloc_chunk
> 
> or another one is
> 
> btrfs_alloc_data_chunk_ondemand
> 	btrfs_data_alloc_profile
> 		btrfs_get_alloc_profile
> 			get_alloc_profile
> 				btrfs_reduce_alloc_profile
> 	btrfs_chunk_alloc
> 		btrfs_alloc_chunk
> 			__btrfs_alloc_chunk
> 
> 
> The btrfs_get_alloc_profile/get_alloc_profile/btrfs_reduce_alloc_profile chain decides which profile has to be allocated.
> The current actives profiles are took and then filtered by the possible allowed on the basis of the number of disk. Which means that if a raid6 profile chunk exists (and there is a enough number of devices), the next chunk will be allocated as raid6.
> 
> So is how I read the code, and what suggest my tests...
> 
> My conclusion is: if you have multiple raid profile per disk, the next chunk allocation doesn't depend by the latest "balance", but but by the above logic.
> The recipe is: when you made a balance, pay attention to not leave any chunk in old format

Well, yes, that is what I've been saying:  don't expect btrfs to do sane
things with a mixture of profiles.  Stick to just one profile, except
in the special case of a conversion.  You wouldn't leave an array in
degraded mode for long, and you need to balance after adding a single
drive to a raid1 or striped-profile raid array.  Partially converted
filesystems fall into this category too.

> > > But I am not sure.. Moreover I expected to see also reference to DUP and/or RAID1C[34] ...
> > 
> > If you get through that 'if' statement without hitting any of the
> > branches, then you're equal to raid0 (0 redundant disks) but raid0
> > is a special case because it requires 2 disks for allocation.  'dup'
> > (0 redundant disks) and 'single' (which is the absence of any profile
> > bits) also have 0 redundant disks and require only 1 disk for allocation,
> > there is no need to treat them differently.
> > 
> > raid1c[34] probably should be there.  Patches welcome.
> > 
> > > Does someone have any suggestion ?
> > > 
> > > BR
> > > G.Baroncelli
> > > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-21 23:26     ` Zygo Blaxell
@ 2020-03-22  8:34       ` Goffredo Baroncelli
  2020-03-22  8:38         ` Goffredo Baroncelli
  0 siblings, 1 reply; 21+ messages in thread
From: Goffredo Baroncelli @ 2020-03-22  8:34 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

Hi Zygo,

On 3/22/20 12:26 AM, Zygo Blaxell wrote:
> On Sat, Mar 21, 2020 at 10:55:32AM +0100, Goffredo Baroncelli wrote:
>> On 3/21/20 4:29 AM, Zygo Blaxell wrote:
>>> On Fri, Mar 20, 2020 at 06:56:38PM +0100, Goffredo Baroncelli wrote:
>>>> Hi all,
>>>>

[...]


> ...but now you are not running conversion any more, and have multiple
> profiles.  It's not really specified what will happen under those
> conditions, nor is it obvious what the correct behavior should be.
> 
> The on-disk format does not have a field for "target profile".

Ok, I looked for a confirmation of that.

> Adding one would be a disk format change.

Yes but I think that it would be done in a backward compatible way. I
think to add a field "target profile" in the super-block. The old
kernels will ignore this field, and behave as today. The new ones will
allocates the new chunk according to this field.

To me it seems complicated to

Any thoughts ?

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-22  8:34       ` Goffredo Baroncelli
@ 2020-03-22  8:38         ` Goffredo Baroncelli
  2020-03-22 23:49           ` Zygo Blaxell
  0 siblings, 1 reply; 21+ messages in thread
From: Goffredo Baroncelli @ 2020-03-22  8:38 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On 3/22/20 9:34 AM, Goffredo Baroncelli wrote:

> 
> To me it seems complicated to
[sorry I push the send button too early]

To me it seems too complicated (and error prone) to derive the target profile from an analysis of the filesystem.

Any thoughts ?
  
BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-22  8:38         ` Goffredo Baroncelli
@ 2020-03-22 23:49           ` Zygo Blaxell
  2020-03-23 20:50             ` Goffredo Baroncelli
  0 siblings, 1 reply; 21+ messages in thread
From: Zygo Blaxell @ 2020-03-22 23:49 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: linux-btrfs

On Sun, Mar 22, 2020 at 09:38:30AM +0100, Goffredo Baroncelli wrote:
> On 3/22/20 9:34 AM, Goffredo Baroncelli wrote:
> 
> > 
> > To me it seems complicated to
> [sorry I push the send button too early]
> 
> To me it seems too complicated (and error prone) to derive the target profile from an analysis of the filesystem.
> 
> Any thoughts ?

I still don't understand the use case you are trying to support.

There are 3 states for a btrfs filesystem:

	1.  All block groups use the same profile.  Pick any one,
	use its profile for future block groups.  Avoid deleting the
	last one.  Simple and easy to implement.

	2.  A conversion is in progress.  Look in fs_info->balance_ctl
	for a 'convert' filter.  If there is one, that's the profile for
	new block groups.  Old block groups will be emptied and destroyed
	by conversion, and then we automatically go back to state #1.

	3.  A conversion is interrupted prior to completion.  Sysadmin is
	expected to proceed immediately back to state #2, possibly after
	taking any necessary recovery actions that triggered entry into
	state #3.  It doesn't really matter what the current allocation
	profile is, since it is likely to change before we allocate
	any more block groups.

You seem to be trying to sustain or support a filesystem in state #3 for
a prolonged period of time.  Why would we do that?  If your use case is
providing information or guidance to a user, tell them how to get back
to state #2 ASAP, so that they can then return to state #1 where they
should be.

Suppose your use case does involve staying in state #3 for a prolonged
period of time--let's say e.g. you want to be able to use file attributes
to put some file data on single profile while putting other files on raid5
profile.  That use case would need to come with a bunch of infrastructure
to support it, i.e. you'd need to define what the attributes are, and
how btrfs could map those to device subsets and raid profiles.  None of
this exists, and even if it did, it would conflict with the "store the
[singular] target profile on disk" idea.

There could be a warning message in dmesg if we enter state #3.
This message would appear after a converting balance is cancelled or
aborted, and on mount when we scan block groups (which we would still need
to do even after we added a "target profile" field to the superblock).
Userspace like 'btrfs fi df' could also put out a warning like "multiple
allocation profiles detected, but conversion is not in progress.  Please
finish conversion at your earliest convenience to avoid disappointment."
I don't see the need to do anything more about it.

We only get to state #3 if the automation has already failed, or has
been explicitly cancelled at sysadmin request.  It is better to wait
for the sysadmin to decide what to do next, especially if the sysadmin's
prior choice led to us entering this state (e.g. not enough space to
complete a conversion to the target profile, so we can no longer use
the target profile for new allocations).  Picking a target profile
at random (from the set of profiles already used in the filesystem)
is no better or worse than any deterministic algorithm--it will always
be wrong in some situations, and a good choice in other situations.

I'd even consider removing the heuristics that are already there for
prioritizing profiles.  They are just surprising and undocumented
behavior, and it would be better to document it as "random, BTW you
should finish your conversion now."  It doesn't help if e.g. you
want to convert from raid6 to raid1, since the heuristic assumes you
only want to go the other way.

> BR
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-22 23:49           ` Zygo Blaxell
@ 2020-03-23 20:50             ` Goffredo Baroncelli
  2020-03-23 22:48               ` Graham Cobb
  2020-03-23 23:18               ` Zygo Blaxell
  0 siblings, 2 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2020-03-23 20:50 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On 3/23/20 12:49 AM, Zygo Blaxell wrote:
> On Sun, Mar 22, 2020 at 09:38:30AM +0100, Goffredo Baroncelli wrote:
>> On 3/22/20 9:34 AM, Goffredo Baroncelli wrote:
>>
>>>
>>> To me it seems complicated to
>> [sorry I push the send button too early]
>>
>> To me it seems too complicated (and error prone) to derive the target profile from an analysis of the filesystem.
>>
>> Any thoughts ?
> 
> I still don't understand the use case you are trying to support.
> 
> There are 3 states for a btrfs filesystem:
> 
[...]
> 
> 	3.  A conversion is interrupted prior to completion.  Sysadmin is
> 	expected to proceed immediately back to state #2, possibly after
> 	taking any necessary recovery actions that triggered entry into
> 	state #3.  It doesn't really matter what the current allocation
> 	profile is, since it is likely to change before we allocate
> 	any more block groups.
> 
> You seem to be trying to sustain or support a filesystem in state #3 for
> a prolonged period of time.  Why would we do that?  If your use case is
> providing information or guidance to a user, tell them how to get back
> to state #2 ASAP, so that they can then return to state #1 where they
> should be.

Believe me: I *don't want* to sustain #3 at all; btrfs is already too
complex. Supporting multiple profile is the worst thing that we can do.
However #3 exists and it could cause unexpected results. I think that on
this we agree.
> 
[...]

> There could be a warning message in dmesg if we enter state #3.
> This message would appear after a converting balance is cancelled or
> aborted, and on mount when we scan block groups (which we would still need
> to do even after we added a "target profile" field to the superblock).
> Userspace like 'btrfs fi df' could also put out a warning like "multiple
> allocation profiles detected, but conversion is not in progress.  Please
> finish conversion at your earliest convenience to avoid disappointment."
> I don't see the need to do anything more about it.

It would help that every btrfs command should warn the users about an
"un-wanted" state like this.

> 
> We only get to state #3 if the automation has already failed, or has
> been explicitly cancelled at sysadmin request.  
>
Not only, you can enter in state #3 if you do something like:

$ sudo btrfs balance start -dconvert=single,usage=50 t/.

where you convert some chunk but not other.

This is the point: we can consider the "failed automation" an unexpected
event, however doing "btrfs bal stop" or the command above cannot be
considered as unexpected event.

[...]

> I'd even consider removing the heuristics that are already there for
> prioritizing profiles.  They are just surprising and undocumented
> behavior, and it would be better to document it as "random, BTW you
> should finish your conversion now."

I agree that we should remove this kind of heuristic.
Doing so I think that, with moderate effort, btrfs can track what is the
wanted profile (i.e. the one at the mkfs time or the one specified in last balance
w/convert [*]) and uses it. To me it seems the natural thing to do. Noting more
nothing less.
We can't prevent a mixed profile filesystem (it has to be allowed the
possibility to stop a long activity like the balance), but we should
prevent the unexpected behavior: if we change the profile and something
goes wrong, the next chunk allocation should be clear. The user don't have
to read the code to understand what will happen.
  [*] we can argue which would be the expected profile after an interrupted balance:
the former one or the latter one ?


BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-23 20:50             ` Goffredo Baroncelli
@ 2020-03-23 22:48               ` Graham Cobb
  2020-03-25  4:09                 ` Zygo Blaxell
  2020-03-23 23:18               ` Zygo Blaxell
  1 sibling, 1 reply; 21+ messages in thread
From: Graham Cobb @ 2020-03-23 22:48 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On 23/03/2020 20:50, Goffredo Baroncelli wrote:
> On 3/23/20 12:49 AM, Zygo Blaxell wrote:
>> On Sun, Mar 22, 2020 at 09:38:30AM +0100, Goffredo Baroncelli wrote:
>>> On 3/22/20 9:34 AM, Goffredo Baroncelli wrote:
>>>
>>>>
>>>> To me it seems complicated to
>>> [sorry I push the send button too early]
>>>
>>> To me it seems too complicated (and error prone) to derive the target
>>> profile from an analysis of the filesystem.
>>>
>>> Any thoughts ?
>>
>> I still don't understand the use case you are trying to support.
>>
>> There are 3 states for a btrfs filesystem:
>>
> [...]
>>
>>     3.  A conversion is interrupted prior to completion.  Sysadmin is
>>     expected to proceed immediately back to state #2, possibly after
>>     taking any necessary recovery actions that triggered entry into
>>     state #3.  It doesn't really matter what the current allocation
>>     profile is, since it is likely to change before we allocate
>>     any more block groups.
>>
>> You seem to be trying to sustain or support a filesystem in state #3 for
>> a prolonged period of time.  Why would we do that?  

In real life situations (particularly outside a commercial datacentre)
this situation can persist for quite a while.  I recently found myself
in a real-life situation where this situation was not only in existence
for weeks but was, at some times, getting worse (I was getting further
away from my target configuration, not closer).

In this case, the original trigger was a disk in a well over 10TB
filesystem beginning to go bad. My strategy for handling that was to
replace the failing disk asap, and then rearrange the disk usage on the
system later. In order to handle the immediate emergency, I made use of
existing free space in LVM volume groups to replace the failing disk,
but that meant I had some user data and backups on the same physical
disk for a while (although I have plenty of other backups available I
like to keep my first-tier backups on separate local disks).

So, once the immediate crisis was over, I needed to move disks around
between the filesystems. It was weeks before I had managed to do
sufficient disk adds, removes and replaces to have all the filesystems
back to having data and backups on separate disks and all the data and
metadata in the profiles I wanted. Just doing a replace for one disk
took many days for the system to physically copy the data from one disk
to the other.

As this system was still in heavy use, this was made worse by btrfs
deciding to store data in profiles I did not want (at that point in the
manipulation) and forcing me to rebalance the data that had been written
during the last disk change before I could start on the next one.

Bottom line: although not the top priority in btrfs development, a
simple way to control the profile to be used for new data and metadata
allocations would have real benefit to overstretched sysadmins.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-23 22:48               ` Graham Cobb
@ 2020-03-25  4:09                 ` Zygo Blaxell
  2020-03-25  4:30                   ` Paul Jones
  0 siblings, 1 reply; 21+ messages in thread
From: Zygo Blaxell @ 2020-03-25  4:09 UTC (permalink / raw)
  To: Graham Cobb; +Cc: linux-btrfs

On Mon, Mar 23, 2020 at 10:48:44PM +0000, Graham Cobb wrote:
> On 23/03/2020 20:50, Goffredo Baroncelli wrote:
> > On 3/23/20 12:49 AM, Zygo Blaxell wrote:
> >> On Sun, Mar 22, 2020 at 09:38:30AM +0100, Goffredo Baroncelli wrote:
> >>> On 3/22/20 9:34 AM, Goffredo Baroncelli wrote:
> >>>
> >>>>
> >>>> To me it seems complicated to
> >>> [sorry I push the send button too early]
> >>>
> >>> To me it seems too complicated (and error prone) to derive the target
> >>> profile from an analysis of the filesystem.
> >>>
> >>> Any thoughts ?
> >>
> >> I still don't understand the use case you are trying to support.
> >>
> >> There are 3 states for a btrfs filesystem:
> >>
> > [...]
> >>
> >>     3.  A conversion is interrupted prior to completion.  Sysadmin is
> >>     expected to proceed immediately back to state #2, possibly after
> >>     taking any necessary recovery actions that triggered entry into
> >>     state #3.  It doesn't really matter what the current allocation
> >>     profile is, since it is likely to change before we allocate
> >>     any more block groups.
> >>
> >> You seem to be trying to sustain or support a filesystem in state #3 for
> >> a prolonged period of time.  Why would we do that?  
> 
> In real life situations (particularly outside a commercial datacentre)
> this situation can persist for quite a while.  I recently found myself
> in a real-life situation where this situation was not only in existence
> for weeks but was, at some times, getting worse (I was getting further
> away from my target configuration, not closer).
> 
> In this case, the original trigger was a disk in a well over 10TB
> filesystem beginning to go bad. My strategy for handling that was to
> replace the failing disk asap, and then rearrange the disk usage on the
> system later. In order to handle the immediate emergency, I made use of
> existing free space in LVM volume groups to replace the failing disk,
> but that meant I had some user data and backups on the same physical
> disk for a while (although I have plenty of other backups available I
> like to keep my first-tier backups on separate local disks).

I've done those.  And the annoying thing about them was...

> So, once the immediate crisis was over, I needed to move disks around
> between the filesystems. It was weeks before I had managed to do
> sufficient disk adds, removes 

Disk removes are where the current system breaks down.  'btrfs device
remove' is terrible:

	- can't cancel a remove except by rebooting or forcing ENOSPC

	- can't resume automatically after a reboot (probably a good
	thing for now, given there's no cancel)

	- can't coexist with a balance, even when paused--device remove
	requires the balance to be _cancelled_ first

	- doesn't have any equivalent to the 'convert' filter raid
	profile target in balance info

so if you need to remove a device while you're changing profiles, you
have to abort the profile change and then relocate a whole lot of data
without being able to specify the correct target profile.

The proper fix would be to reimplement 'btrfs dev remove' using pieces of
the balance infrastructure (it kind of is now, except where it's not),
and so 'device remove' can keep the 'convert=' target.  Then you don't
have to lose the target profile while doing removes (and fix the other
problems too).

Or just move it from the balance info to the superblock, as suggested
elsewhere in the thread (none of these changes can be done without
changing something in the on-disk format).  But definitely don't have
the target profile in both places!

> and replaces to have all the filesystems
> back to having data and backups on separate disks and all the data and
> metadata in the profiles I wanted. Just doing a replace for one disk
> took many days for the system to physically copy the data from one disk
> to the other.
> 
> As this system was still in heavy use, this was made worse by btrfs
> deciding to store data in profiles I did not want (at that point in the
> manipulation) and forcing me to rebalance the data that had been written
> during the last disk change before I could start on the next one.
> 
> Bottom line: although not the top priority in btrfs development, a
> simple way to control the profile to be used for new data and metadata
> allocations would have real benefit to overstretched sysadmins.
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: Question: how understand the raid profile of a btrfs filesystem
  2020-03-25  4:09                 ` Zygo Blaxell
@ 2020-03-25  4:30                   ` Paul Jones
  2020-03-26  2:51                     ` Zygo Blaxell
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Jones @ 2020-03-25  4:30 UTC (permalink / raw)
  To: Zygo Blaxell, Graham Cobb; +Cc: linux-btrfs

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Zygo Blaxell
> Sent: Wednesday, 25 March 2020 3:10 PM
> To: Graham Cobb <g.btrfs@cobb.uk.net>
> Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
> Subject: Re: Question: how understand the raid profile of a btrfs filesystem

> Disk removes are where the current system breaks down.  'btrfs device
> remove' is terrible:
> 
> 	- can't cancel a remove except by rebooting or forcing ENOSPC
> 
> 	- can't resume automatically after a reboot (probably a good
> 	thing for now, given there's no cancel)
> 
> 	- can't coexist with a balance, even when paused--device remove
> 	requires the balance to be _cancelled_ first
> 
> 	- doesn't have any equivalent to the 'convert' filter raid
> 	profile target in balance info
> 
> so if you need to remove a device while you're changing profiles, you have to
> abort the profile change and then relocate a whole lot of data without being
> able to specify the correct target profile.
> 
> The proper fix would be to reimplement 'btrfs dev remove' using pieces of
> the balance infrastructure (it kind of is now, except where it's not), and so
> 'device remove' can keep the 'convert=' target.  Then you don't have to lose
> the target profile while doing removes (and fix the other problems too).

I've often thought it would be handy to be able to forcefully set the disk size or free space to zero, like how it is reported by 'btrfs fi sh' during a remove operation. That way a balance operation can be used for various things like profile changes or multiple disk removals (like replacing 4x1T drives with 1x4T drive) without unintentionally writing a bunch of data to a disk you don't want to write to anymore.
It would also allow for a more gradual removal for disks that need replacing but not as an emergency, as data will gradually migrate itself to other discs as it is COWed.

Paul.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-25  4:30                   ` Paul Jones
@ 2020-03-26  2:51                     ` Zygo Blaxell
  0 siblings, 0 replies; 21+ messages in thread
From: Zygo Blaxell @ 2020-03-26  2:51 UTC (permalink / raw)
  To: Paul Jones; +Cc: Graham Cobb, linux-btrfs

On Wed, Mar 25, 2020 at 04:30:16AM +0000, Paul Jones wrote:
> > -----Original Message-----
> > From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> > owner@vger.kernel.org> On Behalf Of Zygo Blaxell
> > Sent: Wednesday, 25 March 2020 3:10 PM
> > To: Graham Cobb <g.btrfs@cobb.uk.net>
> > Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
> > Subject: Re: Question: how understand the raid profile of a btrfs filesystem
> 
> > Disk removes are where the current system breaks down.  'btrfs device
> > remove' is terrible:
> > 
> > 	- can't cancel a remove except by rebooting or forcing ENOSPC
> > 
> > 	- can't resume automatically after a reboot (probably a good
> > 	thing for now, given there's no cancel)
> > 
> > 	- can't coexist with a balance, even when paused--device remove
> > 	requires the balance to be _cancelled_ first
> > 
> > 	- doesn't have any equivalent to the 'convert' filter raid
> > 	profile target in balance info
> > 
> > so if you need to remove a device while you're changing profiles, you have to
> > abort the profile change and then relocate a whole lot of data without being
> > able to specify the correct target profile.
> > 
> > The proper fix would be to reimplement 'btrfs dev remove' using pieces of
> > the balance infrastructure (it kind of is now, except where it's not), and so
> > 'device remove' can keep the 'convert=' target.  Then you don't have to lose
> > the target profile while doing removes (and fix the other problems too).
> 
> I've often thought it would be handy to be able to forcefully set the
> disk size or free space to zero, like how it is reported by 'btrfs
> fi sh' during a remove operation. That way a balance operation can be
> used for various things like profile changes or multiple disk removals
> (like replacing 4x1T drives with 1x4T drive) without unintentionally
> writing a bunch of data to a disk you don't want to write to anymore.

I forgot "can only remove one disk at a time" in the list above.  We can
add multiple disks at once (well, add one at a time, then use balance to
do all the relocation at once), but the opposite operation isn't possible.

That is an elegant way to set up balances to do a device delete/shrink,
too.

> It would also allow for a more gradual removal for disks that need
> replacing but not as an emergency, as data will gradually migrate
> itself to other discs as it is COWed.
> 
> Paul.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-23 20:50             ` Goffredo Baroncelli
  2020-03-23 22:48               ` Graham Cobb
@ 2020-03-23 23:18               ` Zygo Blaxell
  1 sibling, 0 replies; 21+ messages in thread
From: Zygo Blaxell @ 2020-03-23 23:18 UTC (permalink / raw)
  To: kreijack; +Cc: linux-btrfs

On Mon, Mar 23, 2020 at 09:50:03PM +0100, Goffredo Baroncelli wrote:
> On 3/23/20 12:49 AM, Zygo Blaxell wrote:
> > On Sun, Mar 22, 2020 at 09:38:30AM +0100, Goffredo Baroncelli wrote:
> > > On 3/22/20 9:34 AM, Goffredo Baroncelli wrote:
> > > 
> > > > 
> > > > To me it seems complicated to
> > > [sorry I push the send button too early]
> > > 
> > > To me it seems too complicated (and error prone) to derive the target profile from an analysis of the filesystem.
> > > 
> > > Any thoughts ?
> > 
> > I still don't understand the use case you are trying to support.
> > 
> > There are 3 states for a btrfs filesystem:
> > 
> [...]
> > 
> > 	3.  A conversion is interrupted prior to completion.  Sysadmin is
> > 	expected to proceed immediately back to state #2, possibly after
> > 	taking any necessary recovery actions that triggered entry into
> > 	state #3.  It doesn't really matter what the current allocation
> > 	profile is, since it is likely to change before we allocate
> > 	any more block groups.
> > 
> > You seem to be trying to sustain or support a filesystem in state #3 for
> > a prolonged period of time.  Why would we do that?  If your use case is
> > providing information or guidance to a user, tell them how to get back
> > to state #2 ASAP, so that they can then return to state #1 where they
> > should be.
> 
> Believe me: I *don't want* to sustain #3 at all; btrfs is already too
> complex. Supporting multiple profile is the worst thing that we can do.
> However #3 exists and it could cause unexpected results. I think that on
> this we agree.
> > 
> [...]
> 
> > There could be a warning message in dmesg if we enter state #3.
> > This message would appear after a converting balance is cancelled or
> > aborted, and on mount when we scan block groups (which we would still need
> > to do even after we added a "target profile" field to the superblock).
> > Userspace like 'btrfs fi df' could also put out a warning like "multiple
> > allocation profiles detected, but conversion is not in progress.  Please
> > finish conversion at your earliest convenience to avoid disappointment."
> > I don't see the need to do anything more about it.
> 
> It would help that every btrfs command should warn the users about an
> "un-wanted" state like this.

Patches welcome...

> > been explicitly cancelled at sysadmin request.
> > 
> Not only, you can enter in state #3 if you do something like:
> 
> $ sudo btrfs balance start -dconvert=single,usage=50 t/.
> 
> where you convert some chunk but not other.

Sure, but now you're intentionally doing weird (or sufficiently advanced)
stuff.  Given a combination of balance flags like that (convert + other
restrictions), we should assume the user knows what they're doing, and
stay out of the way.

The existing code that inserts 'usage=90' when resuming a balance,
though highly questionable, still presumes the user knows what they're
doing when a balance has a convert in it, and doesn't modify the usage
filter setting in that case.

It's fairly normal to want to run something like this when changing
RAID profiles on a big array:

	# Make lots of free space quickly
	for x in $(seq 0 100); do
		btrfs balance start -dconvert=single,soft,usage=$x t/.
	done
	# OK now do the full BGs, will be slow
	btrfs balance start -dconvert=single,soft t/.

Should that print 101 warnings as it runs?  What if the user is using
python-btrfs (e.g. to order the block groups by usage) and not the
btrfs-progs tools, or some other UI?  Do we write warnings from inside
the kernel?  Will there be a "--quiet" option that suppresses the warning?
(I suppose if the answer to the last two questions is "yes" then we just
need patches to get it done).

> This is the point: we can consider the "failed automation" an unexpected
> event, however doing "btrfs bal stop" or the command above cannot be
> considered as unexpected event.

Balance cancel is always unexpected.

"balance cancel" is a sysadmin forcing balance to exit using the error
recovery code.  If early termination of a conversion was _expected_,
the sysadmin would have used 'limit' or 'vrange' or 'usage' or 'devid'
or some other filter parameter so that balance does what it was told
to do _without being cancelled_.

> [...]
> 
> > I'd even consider removing the heuristics that are already there for
> > prioritizing profiles.  They are just surprising and undocumented
> > behavior, and it would be better to document it as "random, BTW you
> > should finish your conversion now."
> 
> I agree that we should remove this kind of heuristic.
> Doing so I think that, with moderate effort, btrfs can track what is the
> wanted profile (i.e. the one at the mkfs time or the one specified in last balance
> w/convert [*]) and uses it. To me it seems the natural thing to do. Noting more
> nothing less.

It already kind of does--the balance convert parameters are stored on
disk so it can be resumed after a umount or pause.  "Pause" implies
resuming later, and saving all the state required to do so.  "Cancel"
says something different, "forget what you were doing and wait for new
instructions," so cancel wipes out the conversion target profile.

> We can't prevent a mixed profile filesystem (it has to be allowed the
> possibility to stop a long activity like the balance), but we should
> prevent the unexpected behavior: if we change the profile and something
> goes wrong, the next chunk allocation should be clear. The user don't have
> to read the code to understand what will happen.
>  [*] we can argue which would be the expected profile after an interrupted balance:
> the former one or the latter one ?

If we can argue about it, then there's no right answer, and the status
quo is fine (or we need a more complete solution).

> BR
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-20 17:56 Question: how understand the raid profile of a btrfs filesystem Goffredo Baroncelli
  2020-03-21  3:29 ` Zygo Blaxell
@ 2020-03-24  4:55 ` Anand Jain
  2020-03-24 17:59   ` Goffredo Baroncelli
  1 sibling, 1 reply; 21+ messages in thread
From: Anand Jain @ 2020-03-24  4:55 UTC (permalink / raw)
  To: kreijack, linux-btrfs

On 3/21/20 1:56 AM, Goffredo Baroncelli wrote:
> Hi all,
> 
> for a btrfs filesystem, how an user can understand which is the 
> {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk 
> which profile will have ?
> For simple filesystem it is easy looking at the output of (e.g)  "btrfs 
> fi df" or "btrfs fi us". But what if the filesystem is not simple ?
> 
> btrfs fi us t/.
> Overall:
>      Device size:          40.00GiB
>      Device allocated:          19.52GiB
>      Device unallocated:          20.48GiB
>      Device missing:             0.00B
>      Used:              16.75GiB
>      Free (estimated):          12.22GiB    (min: 8.27GiB)
>      Data ratio:                  1.90
>      Metadata ratio:              2.00
>      Global reserve:           9.06MiB    (used: 0.00B)
> 
> Data,single: Size:1.00GiB, Used:512.00MiB (50.00%)
>     /dev/loop0       1.00GiB
> 
> Data,RAID5: Size:3.00GiB, Used:2.48GiB (82.56%)
>     /dev/loop1       1.00GiB
>     /dev/loop2       1.00GiB
>     /dev/loop3       1.00GiB
>     /dev/loop0       1.00GiB
> 
> Data,RAID6: Size:4.00GiB, Used:3.71GiB (92.75%)
>     /dev/loop1       2.00GiB
>     /dev/loop2       2.00GiB
>     /dev/loop3       2.00GiB
>     /dev/loop0       2.00GiB
> 
> Data,RAID1C3: Size:2.00GiB, Used:1.88GiB (93.76%)
>     /dev/loop1       2.00GiB
>     /dev/loop2       2.00GiB
>     /dev/loop3       2.00GiB
> 
> Metadata,RAID1: Size:256.00MiB, Used:9.14MiB (3.57%)
>     /dev/loop2     256.00MiB
>     /dev/loop3     256.00MiB
> 
> System,RAID1: Size:8.00MiB, Used:16.00KiB (0.20%)
>     /dev/loop2       8.00MiB
>     /dev/loop3       8.00MiB
> 
> Unallocated:
>     /dev/loop1       5.00GiB
>     /dev/loop2       4.74GiB
>     /dev/loop3       4.74GiB
>     /dev/loop0       6.00GiB
> 
> This is an example of a strange but valid filesystem. So the question 
> is: the next chunk which profile will have ?
> Is there any way to understand what will happens ?
> 
> I expected that the next chunk will be allocated as the last "convert". 
> However I discovered that this is not true.
> 
> Looking at the code it seems to me that the logic is the following (from 
> btrfs_reduce_alloc_profile())
> 
>          if (allowed & BTRFS_BLOCK_GROUP_RAID6)
>                  allowed = BTRFS_BLOCK_GROUP_RAID6;
>          else if (allowed & BTRFS_BLOCK_GROUP_RAID5)
>                  allowed = BTRFS_BLOCK_GROUP_RAID5;
>          else if (allowed & BTRFS_BLOCK_GROUP_RAID10)
>                  allowed = BTRFS_BLOCK_GROUP_RAID10;
>          else if (allowed & BTRFS_BLOCK_GROUP_RAID1)
>                  allowed = BTRFS_BLOCK_GROUP_RAID1;
>          else if (allowed & BTRFS_BLOCK_GROUP_RAID0)
>                  allowed = BTRFS_BLOCK_GROUP_RAID0;
> 
>          flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK;
> 
> So in the case above the profile will be RAID6. And in the general if a 
> RAID6 chunk is a filesystem, it wins !

  That's arbitrary and doesn't make sense to me, IMO mkfs should save
  default profile in the super-block (which can be changed using ioctl)
  and kernel can create chunks based on the default profile. This
  approach also fixes chunk size inconsistency between progs and kernel
  as reported/fixed here
    https://patchwork.kernel.org/patch/11431405/

Thanks, Anand

> But I am not sure.. Moreover I expected to see also reference to DUP 
> and/or RAID1C[34] ...
> 
> Does someone have any suggestion ?
> 
> BR
> G.Baroncelli
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-24  4:55 ` Anand Jain
@ 2020-03-24 17:59   ` Goffredo Baroncelli
  2020-03-25  4:09     ` Andrei Borzenkov
  0 siblings, 1 reply; 21+ messages in thread
From: Goffredo Baroncelli @ 2020-03-24 17:59 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, Zygo Blaxell

On 3/24/20 5:55 AM, Anand Jain wrote:
> On 3/21/20 1:56 AM, Goffredo Baroncelli wrote:
>> Hi all,
[..]
>> Looking at the code it seems to me that the logic is the following (from btrfs_reduce_alloc_profile())
>>
>>          if (allowed & BTRFS_BLOCK_GROUP_RAID6)
>>                  allowed = BTRFS_BLOCK_GROUP_RAID6;
>>          else if (allowed & BTRFS_BLOCK_GROUP_RAID5)
>>                  allowed = BTRFS_BLOCK_GROUP_RAID5;
>>          else if (allowed & BTRFS_BLOCK_GROUP_RAID10)
>>                  allowed = BTRFS_BLOCK_GROUP_RAID10;
>>          else if (allowed & BTRFS_BLOCK_GROUP_RAID1)
>>                  allowed = BTRFS_BLOCK_GROUP_RAID1;
>>          else if (allowed & BTRFS_BLOCK_GROUP_RAID0)
>>                  allowed = BTRFS_BLOCK_GROUP_RAID0;
>>
>>          flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK;
>>
>> So in the case above the profile will be RAID6. And in the general if a RAID6 chunk is a filesystem, it wins !
> 
>   That's arbitrary and doesn't make sense to me, IMO mkfs should save
>   default profile in the super-block (which can be changed using ioctl)
>   and kernel can create chunks based on the default profile. 

I'm working on this idea (storing the target profile in super-block). Of course this increase the consistency, but
doesn't prevent the possibility that a mixed profiles filesystem could happen. And in this case is the user that
has to solve the issue.

Zygo, suggested also to add a mixed profile warning to btrfs (prog). And I agree with him. I think that we can use
the space info ioctl (which doesn't require root privileges).

BR
G.Baroncelli

> This
>   approach also fixes chunk size inconsistency between progs and kernel
>   as reported/fixed here
>     https://patchwork.kernel.org/patch/11431405/
> 
> Thanks, Anand
> 
>> But I am not sure.. Moreover I expected to see also reference to DUP and/or RAID1C[34] ...
>>
>> Does someone have any suggestion ?
>>
>> BR
>> G.Baroncelli
>>
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-24 17:59   ` Goffredo Baroncelli
@ 2020-03-25  4:09     ` Andrei Borzenkov
  2020-03-25 17:14       ` Goffredo Baroncelli
  0 siblings, 1 reply; 21+ messages in thread
From: Andrei Borzenkov @ 2020-03-25  4:09 UTC (permalink / raw)
  To: kreijack, Anand Jain; +Cc: linux-btrfs, Zygo Blaxell

24.03.2020 20:59, Goffredo Baroncelli пишет:
> On 3/24/20 5:55 AM, Anand Jain wrote:
>> On 3/21/20 1:56 AM, Goffredo Baroncelli wrote:
>>> Hi all,
> [..]
>>> Looking at the code it seems to me that the logic is the following
>>> (from btrfs_reduce_alloc_profile())
>>>
>>>          if (allowed & BTRFS_BLOCK_GROUP_RAID6)
>>>                  allowed = BTRFS_BLOCK_GROUP_RAID6;
>>>          else if (allowed & BTRFS_BLOCK_GROUP_RAID5)
>>>                  allowed = BTRFS_BLOCK_GROUP_RAID5;
>>>          else if (allowed & BTRFS_BLOCK_GROUP_RAID10)
>>>                  allowed = BTRFS_BLOCK_GROUP_RAID10;
>>>          else if (allowed & BTRFS_BLOCK_GROUP_RAID1)
>>>                  allowed = BTRFS_BLOCK_GROUP_RAID1;
>>>          else if (allowed & BTRFS_BLOCK_GROUP_RAID0)
>>>                  allowed = BTRFS_BLOCK_GROUP_RAID0;
>>>
>>>          flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK;
>>>
>>> So in the case above the profile will be RAID6. And in the general if
>>> a RAID6 chunk is a filesystem, it wins !
>>
>>   That's arbitrary and doesn't make sense to me, IMO mkfs should save
>>   default profile in the super-block (which can be changed using ioctl)
>>   and kernel can create chunks based on the default profile. 
> 
> I'm working on this idea (storing the target profile in super-block).

What about per-subvolume profile? This comes up every now and then, like

https://lore.kernel.org/linux-btrfs/cd82d247-5c95-18cd-a290-a911ff69613c@dirtcellar.net/

May be it could be subvolume property?

> Of
> course this increase the consistency, but
> doesn't prevent the possibility that a mixed profiles filesystem could
> happen. And in this case is the user that
> has to solve the issue.
> 
> Zygo, suggested also to add a mixed profile warning to btrfs (prog). And
> I agree with him. I think that we can use
> the space info ioctl (which doesn't require root privileges).
> 
> BR
> G.Baroncelli
> 
>> This
>>   approach also fixes chunk size inconsistency between progs and kernel
>>   as reported/fixed here
>>     https://patchwork.kernel.org/patch/11431405/
>>
>> Thanks, Anand
>>
>>> But I am not sure.. Moreover I expected to see also reference to DUP
>>> and/or RAID1C[34] ...
>>>
>>> Does someone have any suggestion ?
>>>
>>> BR
>>> G.Baroncelli
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-25  4:09     ` Andrei Borzenkov
@ 2020-03-25 17:14       ` Goffredo Baroncelli
  2020-03-26  3:10         ` Zygo Blaxell
  0 siblings, 1 reply; 21+ messages in thread
From: Goffredo Baroncelli @ 2020-03-25 17:14 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Anand Jain, linux-btrfs, Zygo Blaxell

On 3/25/20 5:09 AM, Andrei Borzenkov wrote:
> 24.03.2020 20:59, Goffredo Baroncelli пишет:
>> On 3/24/20 5:55 AM, Anand Jain wrote:
>>> On 3/21/20 1:56 AM, Goffredo Baroncelli wrote:
>>>> Hi all,
>> [..]
>>>> Looking at the code it seems to me that the logic is the following
>>>> (from btrfs_reduce_alloc_profile())
>>>>
>>>>           if (allowed & BTRFS_BLOCK_GROUP_RAID6)
>>>>                   allowed = BTRFS_BLOCK_GROUP_RAID6;
>>>>           else if (allowed & BTRFS_BLOCK_GROUP_RAID5)
>>>>                   allowed = BTRFS_BLOCK_GROUP_RAID5;
>>>>           else if (allowed & BTRFS_BLOCK_GROUP_RAID10)
>>>>                   allowed = BTRFS_BLOCK_GROUP_RAID10;
>>>>           else if (allowed & BTRFS_BLOCK_GROUP_RAID1)
>>>>                   allowed = BTRFS_BLOCK_GROUP_RAID1;
>>>>           else if (allowed & BTRFS_BLOCK_GROUP_RAID0)
>>>>                   allowed = BTRFS_BLOCK_GROUP_RAID0;
>>>>
>>>>           flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK;
>>>>
>>>> So in the case above the profile will be RAID6. And in the general if
>>>> a RAID6 chunk is a filesystem, it wins !
>>>
>>>    That's arbitrary and doesn't make sense to me, IMO mkfs should save
>>>    default profile in the super-block (which can be changed using ioctl)
>>>    and kernel can create chunks based on the default profile.
>>
>> I'm working on this idea (storing the target profile in super-block).
> 
> What about per-subvolume profile? This comes up every now and then, like
> 
> https://lore.kernel.org/linux-btrfs/cd82d247-5c95-18cd-a290-a911ff69613c@dirtcellar.net/
> 
> May be it could be subvolume property?


The idea is nice. However I fear the mess that it could cause. Even now, with a
more simpler system where there is a "per filesystem" profile, there are a lot of corner
cases when something goes wrong (an interrupted balance, or a disk failed).
In case of multiple profiles on sub-volume basis there is no simple answer in situation like:
- when I make a snapshot of a sub-volumes, and then I change the profile of the original one,
which is the profile of the files contained in the snapshot and in the original subvolumes ?

Frankly speaking, if you want different profiles you need different filesystem...

BR
G.Baroncelli

> 
>> Of
>> course this increase the consistency, but
>> doesn't prevent the possibility that a mixed profiles filesystem could
>> happen. And in this case is the user that
>> has to solve the issue.
>>
>> Zygo, suggested also to add a mixed profile warning to btrfs (prog). And
>> I agree with him. I think that we can use
>> the space info ioctl (which doesn't require root privileges).
>>
>> BR
>> G.Baroncelli
>>
>>> This
>>>    approach also fixes chunk size inconsistency between progs and kernel
>>>    as reported/fixed here
>>>      https://patchwork.kernel.org/patch/11431405/
>>>
>>> Thanks, Anand
>>>
>>>> But I am not sure.. Moreover I expected to see also reference to DUP
>>>> and/or RAID1C[34] ...
>>>>
>>>> Does someone have any suggestion ?
>>>>
>>>> BR
>>>> G.Baroncelli
>>>>
>>>
>>
>>
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Question: how understand the raid profile of a btrfs filesystem
  2020-03-25 17:14       ` Goffredo Baroncelli
@ 2020-03-26  3:10         ` Zygo Blaxell
  0 siblings, 0 replies; 21+ messages in thread
From: Zygo Blaxell @ 2020-03-26  3:10 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Andrei Borzenkov, Anand Jain, linux-btrfs

On Wed, Mar 25, 2020 at 06:14:05PM +0100, Goffredo Baroncelli wrote:
> On 3/25/20 5:09 AM, Andrei Borzenkov wrote:
> > 24.03.2020 20:59, Goffredo Baroncelli пишет:
> > > On 3/24/20 5:55 AM, Anand Jain wrote:
> > > > On 3/21/20 1:56 AM, Goffredo Baroncelli wrote:
> > > > > Hi all,
> > > [..]
> > > > > Looking at the code it seems to me that the logic is the following
> > > > > (from btrfs_reduce_alloc_profile())
> > > > > 
> > > > >           if (allowed & BTRFS_BLOCK_GROUP_RAID6)
> > > > >                   allowed = BTRFS_BLOCK_GROUP_RAID6;
> > > > >           else if (allowed & BTRFS_BLOCK_GROUP_RAID5)
> > > > >                   allowed = BTRFS_BLOCK_GROUP_RAID5;
> > > > >           else if (allowed & BTRFS_BLOCK_GROUP_RAID10)
> > > > >                   allowed = BTRFS_BLOCK_GROUP_RAID10;
> > > > >           else if (allowed & BTRFS_BLOCK_GROUP_RAID1)
> > > > >                   allowed = BTRFS_BLOCK_GROUP_RAID1;
> > > > >           else if (allowed & BTRFS_BLOCK_GROUP_RAID0)
> > > > >                   allowed = BTRFS_BLOCK_GROUP_RAID0;
> > > > > 
> > > > >           flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK;
> > > > > 
> > > > > So in the case above the profile will be RAID6. And in the general if
> > > > > a RAID6 chunk is a filesystem, it wins !
> > > > 
> > > >    That's arbitrary and doesn't make sense to me, IMO mkfs should save
> > > >    default profile in the super-block (which can be changed using ioctl)
> > > >    and kernel can create chunks based on the default profile.
> > > 
> > > I'm working on this idea (storing the target profile in super-block).
> > 
> > What about per-subvolume profile? This comes up every now and then, like
> > 
> > https://lore.kernel.org/linux-btrfs/cd82d247-5c95-18cd-a290-a911ff69613c@dirtcellar.net/
> > 
> > May be it could be subvolume property?

...or inode.

> The idea is nice. However I fear the mess that it could cause. Even now, with a
> more simpler system where there is a "per filesystem" profile, there are a lot of corner
> cases when something goes wrong (an interrupted balance, or a disk failed).

It can't be worse than qgroups.  (only half kidding)

Thinking aloud, you could even set up coarse-but-fast quotas that
way--limit the number of data block groups allocated to a subvol.
No sharing of block groups between subvols though, unless one subvol is a
snapshot of the other.  Also, limiting usage by block group includes free
space within the block group, so it's inaccurate (i.e. coarse, effectively
allocating space with multi-GB granularity and large error bars).
If you have 20 users, and you want to give them each about 400GB but
don't really care if they get 390GB or 410GB, then maybe it's not so bad.

> In case of multiple profiles on sub-volume basis there is no simple answer in situation like:
> - when I make a snapshot of a sub-volumes, and then I change the profile of the original one,
> which is the profile of the files contained in the snapshot and in the original subvolumes ?

It shouldn't be different from compress:  you look up either the inode
or the root, and it tells you what kind of extent you can allocate next.
Any existing data stays where it is until it is deleted (or overwritten
by CoW).  If you start cloning between subvols then things get a
little interesting (especially if you balance those afterwards) but not
unsolvable if "when two or more answers are possible, it's undefined
which one btrfs picks" is allowed in the solution.

You'd have the same problem with no-longer-allocatable block groups that
don't match the currently selected profile as you do now with mixed
block group profiles.  As the unallocatable block groups empty out,
the storage density of the used space within them goes up, space appears
to disappear, etc.  This is state #3, after all, and it would take some
work to make btrfs as happy in this state as it is in state #1.

> Frankly speaking, if you want different profiles you need different filesystem...

Well, there is that.  Keeping the status quo (or small modifications
thereof) is far easier to document, and it's not like we don't have a
huge list of RAID-related things to fix already.

> BR
> G.Baroncelli
> 
> > 
> > > Of
> > > course this increase the consistency, but
> > > doesn't prevent the possibility that a mixed profiles filesystem could
> > > happen. And in this case is the user that
> > > has to solve the issue.
> > > 
> > > Zygo, suggested also to add a mixed profile warning to btrfs (prog). And
> > > I agree with him. I think that we can use
> > > the space info ioctl (which doesn't require root privileges).
> > > 
> > > BR
> > > G.Baroncelli
> > > 
> > > > This
> > > >    approach also fixes chunk size inconsistency between progs and kernel
> > > >    as reported/fixed here
> > > >      https://patchwork.kernel.org/patch/11431405/
> > > > 
> > > > Thanks, Anand
> > > > 
> > > > > But I am not sure.. Moreover I expected to see also reference to DUP
> > > > > and/or RAID1C[34] ...
> > > > > 
> > > > > Does someone have any suggestion ?
> > > > > 
> > > > > BR
> > > > > G.Baroncelli
> > > > > 
> > > > 
> > > 
> > > 
> > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Question: how understand the raid profile of a btrfs filesystem
@ 2020-03-20 17:58 Goffredo Baroncelli
  0 siblings, 0 replies; 21+ messages in thread
From: Goffredo Baroncelli @ 2020-03-20 17:58 UTC (permalink / raw)
  To: linux-btrfs

Hi all,

for a btrfs filesystem, how an user can understand which is the {data,mmetadata,system} [raid] profile in use ? E.g. the next chunk which profile will have ?
For simple filesystem it is easy looking at the output of (e.g)  "btrfs fi df" or "btrfs fi us". But what if the filesystem is not simple ?

btrfs fi us t/.
Overall:
     Device size:		  40.00GiB
     Device allocated:		  19.52GiB
     Device unallocated:		  20.48GiB
     Device missing:		     0.00B
     Used:			  16.75GiB
     Free (estimated):		  12.22GiB	(min: 8.27GiB)
     Data ratio:			      1.90
     Metadata ratio:		      2.00
     Global reserve:		   9.06MiB	(used: 0.00B)

Data,single: Size:1.00GiB, Used:512.00MiB (50.00%)
    /dev/loop0	   1.00GiB

Data,RAID5: Size:3.00GiB, Used:2.48GiB (82.56%)
    /dev/loop1	   1.00GiB
    /dev/loop2	   1.00GiB
    /dev/loop3	   1.00GiB
    /dev/loop0	   1.00GiB

Data,RAID6: Size:4.00GiB, Used:3.71GiB (92.75%)
    /dev/loop1	   2.00GiB
    /dev/loop2	   2.00GiB
    /dev/loop3	   2.00GiB
    /dev/loop0	   2.00GiB

Data,RAID1C3: Size:2.00GiB, Used:1.88GiB (93.76%)
    /dev/loop1	   2.00GiB
    /dev/loop2	   2.00GiB
    /dev/loop3	   2.00GiB

Metadata,RAID1: Size:256.00MiB, Used:9.14MiB (3.57%)
    /dev/loop2	 256.00MiB
    /dev/loop3	 256.00MiB

System,RAID1: Size:8.00MiB, Used:16.00KiB (0.20%)
    /dev/loop2	   8.00MiB
    /dev/loop3	   8.00MiB

Unallocated:
    /dev/loop1	   5.00GiB
    /dev/loop2	   4.74GiB
    /dev/loop3	   4.74GiB
    /dev/loop0	   6.00GiB

This is an example of a strange but valid filesystem. So the question is: the next chunk which profile will have ?
Is there any way to understand what will happens ?

I expected that the next chunk will be allocated as the last "convert". However I discovered that this is not true.

Looking at the code it seems to me that the logic is the following (from btrfs_reduce_alloc_profile())

         if (allowed & BTRFS_BLOCK_GROUP_RAID6)
                 allowed = BTRFS_BLOCK_GROUP_RAID6;
         else if (allowed & BTRFS_BLOCK_GROUP_RAID5)
                 allowed = BTRFS_BLOCK_GROUP_RAID5;
         else if (allowed & BTRFS_BLOCK_GROUP_RAID10)
                 allowed = BTRFS_BLOCK_GROUP_RAID10;
         else if (allowed & BTRFS_BLOCK_GROUP_RAID1)
                 allowed = BTRFS_BLOCK_GROUP_RAID1;
         else if (allowed & BTRFS_BLOCK_GROUP_RAID0)
                 allowed = BTRFS_BLOCK_GROUP_RAID0;

         flags &= ~BTRFS_BLOCK_GROUP_PROFILE_MASK;

So in the case above the profile will be RAID6. And in the general if a RAID6 chunk is a filesystem, it wins !

But I am not sure.. Moreover I expected to see also reference to DUP and/or RAID1C[34] ...

Does someone have any suggestion ?

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-03-26  3:11 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-20 17:56 Question: how understand the raid profile of a btrfs filesystem Goffredo Baroncelli
2020-03-21  3:29 ` Zygo Blaxell
2020-03-21  5:40   ` Andrei Borzenkov
2020-03-21  7:14     ` Zygo Blaxell
2020-03-21  9:55   ` Goffredo Baroncelli
2020-03-21 23:26     ` Zygo Blaxell
2020-03-22  8:34       ` Goffredo Baroncelli
2020-03-22  8:38         ` Goffredo Baroncelli
2020-03-22 23:49           ` Zygo Blaxell
2020-03-23 20:50             ` Goffredo Baroncelli
2020-03-23 22:48               ` Graham Cobb
2020-03-25  4:09                 ` Zygo Blaxell
2020-03-25  4:30                   ` Paul Jones
2020-03-26  2:51                     ` Zygo Blaxell
2020-03-23 23:18               ` Zygo Blaxell
2020-03-24  4:55 ` Anand Jain
2020-03-24 17:59   ` Goffredo Baroncelli
2020-03-25  4:09     ` Andrei Borzenkov
2020-03-25 17:14       ` Goffredo Baroncelli
2020-03-26  3:10         ` Zygo Blaxell
2020-03-20 17:58 Goffredo Baroncelli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.