All of lore.kernel.org
 help / color / mirror / Atom feed
* Array extremely unbalanced after convert to Raid5
@ 2021-05-05 13:41 Abdulla Bubshait
  2021-05-05 13:58 ` remi
  2021-05-05 14:49 ` Zygo Blaxell
  0 siblings, 2 replies; 7+ messages in thread
From: Abdulla Bubshait @ 2021-05-05 13:41 UTC (permalink / raw)
  To: linux-btrfs

I ran a balance convert of my data single setup to raid 5. Once
complete the setup is extremely unbalanced and doesn't even make sense
as a raid 5. I tried to run a balance with dlimit of 1000, but it just
seems to make things worse.


After convert the array looked like this:

btrfs fi show gives:
Label: 'horde'  uuid: 26debbc1-fdd0-4c3a-8581-8445b99c067c
       Total devices 4 FS bytes used 25.53TiB
       devid    1 size 16.37TiB used 2.36TiB path /dev/sdd
       devid    2 size 14.55TiB used 14.27TiB path /dev/sdc
       devid    3 size 12.73TiB used 12.69TiB path /dev/sdf
       devid    4 size 16.37TiB used 16.32TiB path /dev/sde

btrfs fi usage gives:
Overall:
   Device size:                  60.03TiB
   Device allocated:             45.64TiB
   Device unallocated:           14.39TiB
   Device missing:                  0.00B
   Used:                         45.59TiB
   Free (estimated):              8.08TiB      (min: 4.81TiB)
   Free (statfs, df):           410.67GiB
   Data ratio:                       1.78
   Metadata ratio:                   3.00
   Global reserve:              512.00MiB      (used: 80.00KiB)
   Multiple profiles:                  no

Data,RAID5: Size:25.51TiB, Used:25.50TiB (99.93%)
  /dev/sdd        2.33TiB
  /dev/sdc       14.23TiB
  /dev/sdf       12.66TiB
  /dev/sde       16.31TiB

Metadata,RAID1C3: Size:35.00GiB, Used:28.54GiB (81.55%)
  /dev/sdd       34.00GiB
  /dev/sdc       35.00GiB
  /dev/sdf       30.00GiB
  /dev/sde        6.00GiB

System,RAID1C3: Size:32.00MiB, Used:3.06MiB (9.57%)
  /dev/sdd       32.00MiB
  /dev/sdc       32.00MiB
  /dev/sde       32.00MiB

Unallocated:
  /dev/sdd       14.01TiB
  /dev/sdc      292.99GiB
  /dev/sdf       47.00GiB
  /dev/sde       53.00GiB

After doing some balance I currently have:
btrfs fi usage
Overall:
   Device size:                  60.03TiB
   Device allocated:             45.52TiB
   Device unallocated:           14.51TiB
   Device missing:                  0.00B
   Used:                         45.50TiB
   Free (estimated):              8.16TiB      (min: 4.85TiB)
   Free (statfs, df):           414.97GiB
   Data ratio:                       1.78
   Metadata ratio:                   3.00
   Global reserve:              512.00MiB      (used: 80.00KiB)
   Multiple profiles:                  no

Data,RAID5: Size:25.52TiB, Used:25.51TiB (99.96%)
  /dev/sdd        2.23TiB
  /dev/sdc       14.13TiB
  /dev/sdf       12.71TiB
  /dev/sde       16.37TiB

Metadata,RAID1C3: Size:29.00GiB, Used:28.51GiB (98.31%)
  /dev/sdd       29.00GiB
  /dev/sdc       29.00GiB
  /dev/sdf       27.00GiB
  /dev/sde        2.00GiB

System,RAID1C3: Size:32.00MiB, Used:3.03MiB (9.47%)
  /dev/sdd       32.00MiB
  /dev/sdc       32.00MiB
  /dev/sde       32.00MiB

Unallocated:
  /dev/sdd       14.12TiB
  /dev/sdc      404.99GiB
  /dev/sdf        1.00MiB
  /dev/sde        1.00MiB


So the estimated freespace is 8TB, When it should be closer to 15. I
am guessing the freespace would be better if it was properly balanced.
I am unsure how to properly balance this array though.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Array extremely unbalanced after convert to Raid5
  2021-05-05 13:41 Array extremely unbalanced after convert to Raid5 Abdulla Bubshait
@ 2021-05-05 13:58 ` remi
  2021-05-05 14:23   ` Abdulla Bubshait
  2021-05-05 14:51   ` Zygo Blaxell
  2021-05-05 14:49 ` Zygo Blaxell
  1 sibling, 2 replies; 7+ messages in thread
From: remi @ 2021-05-05 13:58 UTC (permalink / raw)
  To: Abdulla Bubshait, linux-btrfs



On Wed, May 5, 2021, at 9:41 AM, Abdulla Bubshait wrote:
> I ran a balance convert of my data single setup to raid 5. Once
> complete the setup is extremely unbalanced and doesn't even make sense
> as a raid 5. I tried to run a balance with dlimit of 1000, but it just
> seems to make things worse.

> 
> Unallocated:
>   /dev/sdd       14.12TiB
>   /dev/sdc      404.99GiB
>   /dev/sdf        1.00MiB
>   /dev/sde        1.00MiB
> 
> 


Sorry, I don't have a solution for you, but I want to point out that the situation is far more critical than you seem to have realized.. this filesystem is now completely wedged, and I would suggest adding another device, or replacing either /dev/sdf  or /dev/sde with something larger,, (though, if those are real disks, I see that might be a challenge.)

Your metadata is Raid1C3, (meaning 3 copies,), but you only have 2 disks with free space.  And thanks to the recent balancing, there is very little free space in the already allocated metadata,, so effectively, the filesystem can no longer write any new metadata and will very quickly hit out of space errors.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Array extremely unbalanced after convert to Raid5
  2021-05-05 13:58 ` remi
@ 2021-05-05 14:23   ` Abdulla Bubshait
  2021-05-05 14:51   ` Zygo Blaxell
  1 sibling, 0 replies; 7+ messages in thread
From: Abdulla Bubshait @ 2021-05-05 14:23 UTC (permalink / raw)
  To: remi; +Cc: linux-btrfs

On Wed, May 5, 2021 at 9:59 AM <remi@georgianit.com> wrote:

> Sorry, I don't have a solution for you, but I want to point out that the situation is far more critical than you seem to have realized.. this filesystem is now completely wedged, and I would suggest adding another device, or replacing either /dev/sdf  or /dev/sde with something larger,, (though, if those are real disks, I see that might be a challenge.)

I don't think they make disks larger than sde.

I can get some more space by offloading stuff from the array. But even
if I offload a TB and give myself some breathing room on those disks,
as soon as I run a balance the freespace on the disks will quickly
fill up leaving me again without any unallocated space on 3 drives.

Right now I am lucky enough to have some space on sdc, but typically
after balancing I would have 3 disks with 1MB left and 1 disk with 14
TB unallocated. And I can't seem to get the balance to start using up
sdd and freeing up space from all the others.

As things stand I don't know how sde ever got filled up. Since it is
the largest of the disks it should have a portion that cannot be find
a parity anywhere else, so part of it will remain unoccupied. At least
that is what I could gather from the btrfs space allocator site.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Array extremely unbalanced after convert to Raid5
  2021-05-05 13:41 Array extremely unbalanced after convert to Raid5 Abdulla Bubshait
  2021-05-05 13:58 ` remi
@ 2021-05-05 14:49 ` Zygo Blaxell
  2021-05-05 15:35   ` Abdulla Bubshait
  1 sibling, 1 reply; 7+ messages in thread
From: Zygo Blaxell @ 2021-05-05 14:49 UTC (permalink / raw)
  To: Abdulla Bubshait; +Cc: linux-btrfs

On Wed, May 05, 2021 at 09:41:49AM -0400, Abdulla Bubshait wrote:
> I ran a balance convert of my data single setup to raid 5. Once
> complete the setup is extremely unbalanced and doesn't even make sense
> as a raid 5. I tried to run a balance with dlimit of 1000, but it just
> seems to make things worse.

Balancing a full single array to raid5 requires a better device selection
algorithm than the kernel provides, especially if the disks are of
different sizes and were added to the array over a long period of time.
The kernel will strictly relocate the newest block groups first, which
may leave space occupied on some disks for most of the balance, and
cause many chunks to be created with suboptimal stripe width.

> After convert the array looked like this:
> 
> btrfs fi show gives:
> Label: 'horde'  uuid: 26debbc1-fdd0-4c3a-8581-8445b99c067c
>        Total devices 4 FS bytes used 25.53TiB
>        devid    1 size 16.37TiB used 2.36TiB path /dev/sdd
>        devid    2 size 14.55TiB used 14.27TiB path /dev/sdc
>        devid    3 size 12.73TiB used 12.69TiB path /dev/sdf
>        devid    4 size 16.37TiB used 16.32TiB path /dev/sde

For raid5 conversion, you need equal amounts of unallocated space on
each disk.  Convert sufficient raid5 block groups back to single profile
to redistribute the unallocated space:

	btrfs balance start -dconvert=single,devid=2,limit=4000 /fs

	btrfs balance start -dconvert=single,devid=3,limit=4000 /fs

	btrfs balance start -dconvert=single,devid=4,limit=4000 /fs

After this, each disk should have 3-4 TB of unallocated space on it
(devid 2-4 will have data moved to devid 1).  The important thing is to
have equal unallocated space--you can cancel balancing as soon as that
is achieved.  The limits above are higher than necessary to be sure
that happens.

Now use the stripes filter to get rid of all chunks that have fewer
than the optimum number of stripes on each disk.  Cycle through these
commands until they report 0 chunks relocated (you can just leave these
running in a shell loop and check on it every few hours, when they get
to 0 they will just become no-ops):

	btrfs balance start -dlimit=100,convert=raid5,stripes=1..3,devid=3 /fs

	btrfs balance start -dlimit=100,convert=raid5,stripes=1..2,devid=2 /fs

	btrfs balance start -dlimit=100,convert=raid5,stripes=1..1,devid=1 /fs

	btrfs balance start -dlimit=100,convert=raid5,stripes=1..1,devid=4 /fs

The filters select chunks that have undesirable stripe counts and force
them into raid5 profile.  Single chunks have stripe count 1 and will
be converted to raid5.  RAID5 chunks have stripe count >1 and will be
relocated (converted from raid5 to raid5, but in a different location
with more disks in the chunk).  RAID5 chunks that already occupy the
correct number of drives will not be touched.

It is important to select chunks from every drive in turn in order to
keep some free space available on all disks.  Each command will spread
out 100 chunks from one disk over all the disks, making space on one
disk and filling all the others.  The balance must change to another
devid at regular intervals to ensure all disks maintain free space as
long as possible.

If you want to avoid converting to single profile then you might be able
to use only the balance commands in the second section; however, if you
run out of space on one or more drives then the balances will push data
around but be unable to make any progress on changing the chunk sizes.
In that case you will need to convert some raid5 back to single chunks
to continue.

> btrfs fi usage gives:
> Overall:
>    Device size:                  60.03TiB
>    Device allocated:             45.64TiB
>    Device unallocated:           14.39TiB
>    Device missing:                  0.00B
>    Used:                         45.59TiB
>    Free (estimated):              8.08TiB      (min: 4.81TiB)
>    Free (statfs, df):           410.67GiB
>    Data ratio:                       1.78
>    Metadata ratio:                   3.00
>    Global reserve:              512.00MiB      (used: 80.00KiB)
>    Multiple profiles:                  no
> 
> Data,RAID5: Size:25.51TiB, Used:25.50TiB (99.93%)
>   /dev/sdd        2.33TiB
>   /dev/sdc       14.23TiB
>   /dev/sdf       12.66TiB
>   /dev/sde       16.31TiB
> 
> Metadata,RAID1C3: Size:35.00GiB, Used:28.54GiB (81.55%)
>   /dev/sdd       34.00GiB
>   /dev/sdc       35.00GiB
>   /dev/sdf       30.00GiB
>   /dev/sde        6.00GiB
> 
> System,RAID1C3: Size:32.00MiB, Used:3.06MiB (9.57%)
>   /dev/sdd       32.00MiB
>   /dev/sdc       32.00MiB
>   /dev/sde       32.00MiB
> 
> Unallocated:
>   /dev/sdd       14.01TiB
>   /dev/sdc      292.99GiB
>   /dev/sdf       47.00GiB
>   /dev/sde       53.00GiB
> 
> After doing some balance I currently have:
> btrfs fi usage
> Overall:
>    Device size:                  60.03TiB
>    Device allocated:             45.52TiB
>    Device unallocated:           14.51TiB
>    Device missing:                  0.00B
>    Used:                         45.50TiB
>    Free (estimated):              8.16TiB      (min: 4.85TiB)
>    Free (statfs, df):           414.97GiB
>    Data ratio:                       1.78
>    Metadata ratio:                   3.00
>    Global reserve:              512.00MiB      (used: 80.00KiB)
>    Multiple profiles:                  no
> 
> Data,RAID5: Size:25.52TiB, Used:25.51TiB (99.96%)
>   /dev/sdd        2.23TiB
>   /dev/sdc       14.13TiB
>   /dev/sdf       12.71TiB
>   /dev/sde       16.37TiB
> 
> Metadata,RAID1C3: Size:29.00GiB, Used:28.51GiB (98.31%)
>   /dev/sdd       29.00GiB
>   /dev/sdc       29.00GiB
>   /dev/sdf       27.00GiB
>   /dev/sde        2.00GiB
> 
> System,RAID1C3: Size:32.00MiB, Used:3.03MiB (9.47%)
>   /dev/sdd       32.00MiB
>   /dev/sdc       32.00MiB
>   /dev/sde       32.00MiB
> 
> Unallocated:
>   /dev/sdd       14.12TiB
>   /dev/sdc      404.99GiB
>   /dev/sdf        1.00MiB
>   /dev/sde        1.00MiB
> 
> 
> So the estimated freespace is 8TB, When it should be closer to 15. I
> am guessing the freespace would be better if it was properly balanced.
> I am unsure how to properly balance this array though.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Array extremely unbalanced after convert to Raid5
  2021-05-05 13:58 ` remi
  2021-05-05 14:23   ` Abdulla Bubshait
@ 2021-05-05 14:51   ` Zygo Blaxell
  1 sibling, 0 replies; 7+ messages in thread
From: Zygo Blaxell @ 2021-05-05 14:51 UTC (permalink / raw)
  To: remi; +Cc: Abdulla Bubshait, linux-btrfs

On Wed, May 05, 2021 at 09:58:03AM -0400, remi@georgianit.com wrote:
> 
> 
> On Wed, May 5, 2021, at 9:41 AM, Abdulla Bubshait wrote:
> > I ran a balance convert of my data single setup to raid 5. Once
> > complete the setup is extremely unbalanced and doesn't even make sense
> > as a raid 5. I tried to run a balance with dlimit of 1000, but it just
> > seems to make things worse.
> 
> > 
> > Unallocated:
> >   /dev/sdd       14.12TiB
> >   /dev/sdc      404.99GiB
> >   /dev/sdf        1.00MiB
> >   /dev/sde        1.00MiB
> > 
> > 
> 
> 
> Sorry, I don't have a solution for you, but I want to point out that
> the situation is far more critical than you seem to have realized.. this
> filesystem is now completely wedged, and I would suggest adding another
> device, or replacing either /dev/sdf  or /dev/sde with something
> larger,, (though, if those are real disks, I see that might be a
> challenge.)
> 
> Your metadata is Raid1C3, (meaning 3 copies,), but you only have 2 disks
> with free space.  And thanks to the recent balancing, there is very
> little free space in the already allocated metadata,, so effectively,
> the filesystem can no longer write any new metadata and will very
> quickly hit out of space errors.

The situation isn't that dire.  Balancing one data chunk off of either
/dev/sdf or /dev/sde will resolve the issue.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Array extremely unbalanced after convert to Raid5
  2021-05-05 14:49 ` Zygo Blaxell
@ 2021-05-05 15:35   ` Abdulla Bubshait
  2021-05-06  4:32     ` Zygo Blaxell
  0 siblings, 1 reply; 7+ messages in thread
From: Abdulla Bubshait @ 2021-05-05 15:35 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Wed, May 5, 2021 at 10:49 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> Balancing a full single array to raid5 requires a better device selection
> algorithm than the kernel provides, especially if the disks are of
> different sizes and were added to the array over a long period of time.
> The kernel will strictly relocate the newest block groups first, which
> may leave space occupied on some disks for most of the balance, and
> cause many chunks to be created with suboptimal stripe width.
>

Is this also true of running a full balance after conversion to raid5?
Is it able to optimize the stripe width or would a balance run into
the same issue due to
the disks being full?


> Now use the stripes filter to get rid of all chunks that have fewer
> than the optimum number of stripes on each disk.  Cycle through these
> commands until they report 0 chunks relocated (you can just leave these
> running in a shell loop and check on it every few hours, when they get
> to 0 they will just become no-ops):
>
>         btrfs balance start -dlimit=100,convert=raid5,stripes=1..3,devid=3 /fs
>
>         btrfs balance start -dlimit=100,convert=raid5,stripes=1..2,devid=2 /fs
>
>         btrfs balance start -dlimit=100,convert=raid5,stripes=1..1,devid=1 /fs
>
>         btrfs balance start -dlimit=100,convert=raid5,stripes=1..1,devid=4 /fs
>
> The filters select chunks that have undesirable stripe counts and force
> them into raid5 profile.  Single chunks have stripe count 1 and will
> be converted to raid5.  RAID5 chunks have stripe count >1 and will be
> relocated (converted from raid5 to raid5, but in a different location
> with more disks in the chunk).  RAID5 chunks that already occupy the
> correct number of drives will not be touched.

That is what I was looking to do, I must have missed the stripes
filter. I think I can figure out a script that is able to spread the
data enough.

But here is a question. At what point does the fs stop striping onto a
disk? Does it stop at 1MB unallocated and if so does that cause issues
in practice if the need arises to allocate metadata chunks due to
raid1c3?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Array extremely unbalanced after convert to Raid5
  2021-05-05 15:35   ` Abdulla Bubshait
@ 2021-05-06  4:32     ` Zygo Blaxell
  0 siblings, 0 replies; 7+ messages in thread
From: Zygo Blaxell @ 2021-05-06  4:32 UTC (permalink / raw)
  To: Abdulla Bubshait; +Cc: linux-btrfs

On Wed, May 05, 2021 at 11:35:23AM -0400, Abdulla Bubshait wrote:
> On Wed, May 5, 2021 at 10:49 AM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > Balancing a full single array to raid5 requires a better device selection
> > algorithm than the kernel provides, especially if the disks are of
> > different sizes and were added to the array over a long period of time.
> > The kernel will strictly relocate the newest block groups first, which
> > may leave space occupied on some disks for most of the balance, and
> > cause many chunks to be created with suboptimal stripe width.
> >
> Is this also true of running a full balance after conversion to raid5?
> Is it able to optimize the stripe width or would a balance run into
> the same issue due to
> the disks being full?

Balancing all the data block groups in a single command will simply
restripe every chunk in reverse creation order, whether needed or not,
and get whatever space is available at the time for each chunk--and
possibly run out of space on filesystems with non-equal disk sizes.
If you run it enough times, it might eventually settle into a good state,
but it is not guaranteed.

Generally you should never do a full balance because a full balance
balances metadata, and you should never balance metadata because it
will lead to low-metadata-space conditions like the one you are in now.
The exceptions to the "never balance metadata" rule are when converting
from one raid profile to a different profile, or when permanently removing
a disk from the filesystem, and even then you should ensure there is a
lot of unallocated space available before starting a metadata balance.

> > Now use the stripes filter to get rid of all chunks that have fewer
> > than the optimum number of stripes on each disk.  Cycle through these
> > commands until they report 0 chunks relocated (you can just leave these
> > running in a shell loop and check on it every few hours, when they get
> > to 0 they will just become no-ops):
> >
> >         btrfs balance start -dlimit=100,convert=raid5,stripes=1..3,devid=3 /fs
> >
> >         btrfs balance start -dlimit=100,convert=raid5,stripes=1..2,devid=2 /fs
> >
> >         btrfs balance start -dlimit=100,convert=raid5,stripes=1..1,devid=1 /fs
> >
> >         btrfs balance start -dlimit=100,convert=raid5,stripes=1..1,devid=4 /fs
> >
> > The filters select chunks that have undesirable stripe counts and force
> > them into raid5 profile.  Single chunks have stripe count 1 and will
> > be converted to raid5.  RAID5 chunks have stripe count >1 and will be
> > relocated (converted from raid5 to raid5, but in a different location
> > with more disks in the chunk).  RAID5 chunks that already occupy the
> > correct number of drives will not be touched.
> 
> That is what I was looking to do, I must have missed the stripes
> filter. I think I can figure out a script that is able to spread the
> data enough.
> 
> But here is a question. At what point does the fs stop striping onto a
> disk? Does it stop at 1MB unallocated and if so does that cause issues
> in practice if the need arises to allocate metadata chunks due to
> raid1c3?

Allocators come in two groups:  those that fill the emptiest disks first
(raid1*, single, dup) and those that fill all disks equally (raid0,
raid5, raid6, raid10).  raid1c3 metadata has a 3-disk minimum, so it
will allocate all its space on the 3 largest disks (or most free space
if the array is unbalanced) and normally runs out of space when the 3rd
largest disk in the array fills up.  raid5 data has a 2-disk minimum,
but will try to fill all drives equally, and normally runs out of space
when the 2nd largest disk in the array fills up.

raid5 fills up devid 3 first, then 2, then 1 and 4.  raid1c3 fills up
devid 1, 4, and 2 first, then 3.  When raid5 fills up devid 2, raid1c3 is
out of space, so you will have effectively 2TB unusable--there will not be
enough metadata space to fill the last 2 TB on devid 1 and 4.  You will
also have additional complications due to being out of metadata space
without also being out of data space that can be hard to recover from.

You can fix that in a few different ways:

	- convert metadata to raid1 (2 disk minimum, 1 failure tolerated,
	same as raid5, works with the 2 largest disks you have).

	- resize the 2 largest disks to be equal in size to the 3rd
	largest (will reduce filesystem capacity by 2 TB).  This ensures
	the 3rd largest disk will fill up at the same time as the two
	larger ones, so raid1c3 metadata can always be allocated until
	the filesystem is completely full.

	- replace a smaller disk with one matching the largest 2 disk
	sizes.	This is another way to make the top 3 disks the same
	size to satisfy the raid1c3 requirement.

	- mount -o metadata_ratio=20 (preallocates a lot of metadata
	space, equal in size to 5% of the data when normally <3% are
	needed).  Remember to never balance metadata or you'll lose
	this preallocation.  This enables maximum space usage and
	3-disk metadata redundancy, but it has a risk of failure
	if the metadata ratio turns out to be too low.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-05-06  4:32 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-05 13:41 Array extremely unbalanced after convert to Raid5 Abdulla Bubshait
2021-05-05 13:58 ` remi
2021-05-05 14:23   ` Abdulla Bubshait
2021-05-05 14:51   ` Zygo Blaxell
2021-05-05 14:49 ` Zygo Blaxell
2021-05-05 15:35   ` Abdulla Bubshait
2021-05-06  4:32     ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.