All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID-1 refuses to balance large drive
@ 2016-03-23  0:47 Brad Templeton
  2016-03-23  4:01 ` Qu Wenruo
  0 siblings, 1 reply; 35+ messages in thread
From: Brad Templeton @ 2016-03-23  0:47 UTC (permalink / raw)
  To: linux-btrfs

I have a RAID 1, and was running a bit low, so replaced a 2TB drive with
a 6TB.  The other drives are a 3TB and a 4TB.    After switching the
drive, I did a balance and ... essentially nothing changed.  It did not
balance clusters over to the 6TB drive off of the other 2 drives.  I
found it odd, and wondered if it would do it as needed, but as time went
on, the filesys got full for real.

Making inquiries on the IRC channel, it was suggested perhaps the drives
were too full for a balance, but they had at least 50gb free I would
estimate, when I swapped.    As a test, I added a 4th drive, a spare
20gb partition and did a balance.  The balance did indeed balance the 3
small drives, so they now each have 6gb unallocated, but the big drive
remained unchanged.   The balance reported it operated on almost all the
clusters, though.

Linux kernel 4.2.0 (Ubuntu Wiley)

Label: 'butter'  uuid: a91755d4-87d8-4acd-ae08-c11e7f1f5438
        Total devices 4 FS bytes used 3.88TiB
        devid    1 size 3.62TiB used 3.62TiB path /dev/sdi2
        devid    2 size 2.73TiB used 2.72TiB path /dev/sdh
        devid    3 size 5.43TiB used 1.42TiB path /dev/sdg2
        devid    4 size 20.00GiB used 14.00GiB path /dev/sda1

btrfs fi usage /local

Overall:
    Device size:                  11.81TiB
    Device allocated:              7.77TiB
    Device unallocated:            4.04TiB
    Device missing:                  0.00B
    Used:                          7.76TiB
    Free (estimated):              2.02TiB      (min: 2.02TiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID1: Size:3.87TiB, Used:3.87TiB
   /dev/sda1      14.00GiB
   /dev/sdg2       1.41TiB
   /dev/sdh        2.72TiB
   /dev/sdi2       3.61TiB

Metadata,RAID1: Size:11.00GiB, Used:9.79GiB
   /dev/sdg2       5.00GiB
   /dev/sdh        7.00GiB
   /dev/sdi2      10.00GiB

System,RAID1: Size:32.00MiB, Used:572.00KiB
   /dev/sdg2      32.00MiB
   /dev/sdi2      32.00MiB

Unallocated:
   /dev/sda1       6.00GiB
   /dev/sdg2       4.02TiB
   /dev/sdh        5.52GiB
   /dev/sdi2       7.36GiB

----------------------
btrfs fi df /local
Data, RAID1: total=3.87TiB, used=3.87TiB
System, RAID1: total=32.00MiB, used=572.00KiB
Metadata, RAID1: total=11.00GiB, used=9.79GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

I would have presumed that a balance would take blocks found on both the
3TB and 4TB, and move one of them over to the 6TB until all had 1.3TB of
unallocated space.  But this does not happen.  Any clues on how to make
it happen?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-23  0:47 RAID-1 refuses to balance large drive Brad Templeton
@ 2016-03-23  4:01 ` Qu Wenruo
  2016-03-23  4:47   ` Brad Templeton
  0 siblings, 1 reply; 35+ messages in thread
From: Qu Wenruo @ 2016-03-23  4:01 UTC (permalink / raw)
  To: bradtem, linux-btrfs



Brad Templeton wrote on 2016/03/22 17:47 -0700:
> I have a RAID 1, and was running a bit low, so replaced a 2TB drive with
> a 6TB.  The other drives are a 3TB and a 4TB.    After switching the
> drive, I did a balance and ... essentially nothing changed.  It did not
> balance clusters over to the 6TB drive off of the other 2 drives.  I
> found it odd, and wondered if it would do it as needed, but as time went
> on, the filesys got full for real.

Did you resized the replaced deivces to max?
Without resize, btrfs still consider it can only use 2T of the 6T devices.

Thanks,
Qu

>
> Making inquiries on the IRC channel, it was suggested perhaps the drives
> were too full for a balance, but they had at least 50gb free I would
> estimate, when I swapped.    As a test, I added a 4th drive, a spare
> 20gb partition and did a balance.  The balance did indeed balance the 3
> small drives, so they now each have 6gb unallocated, but the big drive
> remained unchanged.   The balance reported it operated on almost all the
> clusters, though.
>
> Linux kernel 4.2.0 (Ubuntu Wiley)
>
> Label: 'butter'  uuid: a91755d4-87d8-4acd-ae08-c11e7f1f5438
>          Total devices 4 FS bytes used 3.88TiB
>          devid    1 size 3.62TiB used 3.62TiB path /dev/sdi2
>          devid    2 size 2.73TiB used 2.72TiB path /dev/sdh
>          devid    3 size 5.43TiB used 1.42TiB path /dev/sdg2
>          devid    4 size 20.00GiB used 14.00GiB path /dev/sda1
>
> btrfs fi usage /local
>
> Overall:
>      Device size:                  11.81TiB
>      Device allocated:              7.77TiB
>      Device unallocated:            4.04TiB
>      Device missing:                  0.00B
>      Used:                          7.76TiB
>      Free (estimated):              2.02TiB      (min: 2.02TiB)
>      Data ratio:                       2.00
>      Metadata ratio:                   2.00
>      Global reserve:              512.00MiB      (used: 0.00B)
>
> Data,RAID1: Size:3.87TiB, Used:3.87TiB
>     /dev/sda1      14.00GiB
>     /dev/sdg2       1.41TiB
>     /dev/sdh        2.72TiB
>     /dev/sdi2       3.61TiB
>
> Metadata,RAID1: Size:11.00GiB, Used:9.79GiB
>     /dev/sdg2       5.00GiB
>     /dev/sdh        7.00GiB
>     /dev/sdi2      10.00GiB
>
> System,RAID1: Size:32.00MiB, Used:572.00KiB
>     /dev/sdg2      32.00MiB
>     /dev/sdi2      32.00MiB
>
> Unallocated:
>     /dev/sda1       6.00GiB
>     /dev/sdg2       4.02TiB
>     /dev/sdh        5.52GiB
>     /dev/sdi2       7.36GiB
>
> ----------------------
> btrfs fi df /local
> Data, RAID1: total=3.87TiB, used=3.87TiB
> System, RAID1: total=32.00MiB, used=572.00KiB
> Metadata, RAID1: total=11.00GiB, used=9.79GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> I would have presumed that a balance would take blocks found on both the
> 3TB and 4TB, and move one of them over to the 6TB until all had 1.3TB of
> unallocated space.  But this does not happen.  Any clues on how to make
> it happen?
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-23  4:01 ` Qu Wenruo
@ 2016-03-23  4:47   ` Brad Templeton
  2016-03-23  5:42     ` Chris Murphy
  0 siblings, 1 reply; 35+ messages in thread
From: Brad Templeton @ 2016-03-23  4:47 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

That's rather counter intuitive behaviour.  In most FSs, resizes are
needed when you do things like change the size of an underlying
partition, or you weren't using all the partition.  When you add one
drive with device add, and you then remove another with device delete,
why and how would the added device know to size itself to the device
that you are planning to delete?   Ie. I don't see how it could know
(you add the new drive before even telling it you want to remove the old
one) and I also can't see a reason it would not use all the drive you
tell it to add.

In any event, I did a btrfs fi resize 3:max /local on the 6TB as you
suggest, and have another balance running but it appears like all the
others to be doing nothing, though of course it will take hours.  Are
you sure it works that way?  Even before the resize, as you see below,
it indicates the volume is 6TB with 4TB of unallocated space.  It is
only the df that says full (and the fact that there is no unallocated
space on the 3TB and 4TB drives.)

On 03/22/2016 09:01 PM, Qu Wenruo wrote:
> 
> 
> Brad Templeton wrote on 2016/03/22 17:47 -0700:
>> I have a RAID 1, and was running a bit low, so replaced a 2TB drive with
>> a 6TB.  The other drives are a 3TB and a 4TB.    After switching the
>> drive, I did a balance and ... essentially nothing changed.  It did not
>> balance clusters over to the 6TB drive off of the other 2 drives.  I
>> found it odd, and wondered if it would do it as needed, but as time went
>> on, the filesys got full for real.
> 
> Did you resized the replaced deivces to max?
> Without resize, btrfs still consider it can only use 2T of the 6T devices.
> 
> Thanks,
> Qu
> 
>>
>> Making inquiries on the IRC channel, it was suggested perhaps the drives
>> were too full for a balance, but they had at least 50gb free I would
>> estimate, when I swapped.    As a test, I added a 4th drive, a spare
>> 20gb partition and did a balance.  The balance did indeed balance the 3
>> small drives, so they now each have 6gb unallocated, but the big drive
>> remained unchanged.   The balance reported it operated on almost all the
>> clusters, though.
>>
>> Linux kernel 4.2.0 (Ubuntu Wiley)
>>
>> Label: 'butter'  uuid: a91755d4-87d8-4acd-ae08-c11e7f1f5438
>>          Total devices 4 FS bytes used 3.88TiB
>>          devid    1 size 3.62TiB used 3.62TiB path /dev/sdi2
>>          devid    2 size 2.73TiB used 2.72TiB path /dev/sdh
>>          devid    3 size 5.43TiB used 1.42TiB path /dev/sdg2
>>          devid    4 size 20.00GiB used 14.00GiB path /dev/sda1
>>
>> btrfs fi usage /local
>>
>> Overall:
>>      Device size:                  11.81TiB
>>      Device allocated:              7.77TiB
>>      Device unallocated:            4.04TiB
>>      Device missing:                  0.00B
>>      Used:                          7.76TiB
>>      Free (estimated):              2.02TiB      (min: 2.02TiB)
>>      Data ratio:                       2.00
>>      Metadata ratio:                   2.00
>>      Global reserve:              512.00MiB      (used: 0.00B)
>>
>> Data,RAID1: Size:3.87TiB, Used:3.87TiB
>>     /dev/sda1      14.00GiB
>>     /dev/sdg2       1.41TiB
>>     /dev/sdh        2.72TiB
>>     /dev/sdi2       3.61TiB
>>
>> Metadata,RAID1: Size:11.00GiB, Used:9.79GiB
>>     /dev/sdg2       5.00GiB
>>     /dev/sdh        7.00GiB
>>     /dev/sdi2      10.00GiB
>>
>> System,RAID1: Size:32.00MiB, Used:572.00KiB
>>     /dev/sdg2      32.00MiB
>>     /dev/sdi2      32.00MiB
>>
>> Unallocated:
>>     /dev/sda1       6.00GiB
>>     /dev/sdg2       4.02TiB
>>     /dev/sdh        5.52GiB
>>     /dev/sdi2       7.36GiB
>>
>> ----------------------
>> btrfs fi df /local
>> Data, RAID1: total=3.87TiB, used=3.87TiB
>> System, RAID1: total=32.00MiB, used=572.00KiB
>> Metadata, RAID1: total=11.00GiB, used=9.79GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> I would have presumed that a balance would take blocks found on both the
>> 3TB and 4TB, and move one of them over to the 6TB until all had 1.3TB of
>> unallocated space.  But this does not happen.  Any clues on how to make
>> it happen?
>>
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-23  4:47   ` Brad Templeton
@ 2016-03-23  5:42     ` Chris Murphy
       [not found]       ` <56F22F80.501@gmail.com>
  0 siblings, 1 reply; 35+ messages in thread
From: Chris Murphy @ 2016-03-23  5:42 UTC (permalink / raw)
  To: bradtem; +Cc: Qu Wenruo, Btrfs BTRFS

On Tue, Mar 22, 2016 at 10:47 PM, Brad Templeton <bradtem@gmail.com> wrote:
> That's rather counter intuitive behaviour.  In most FSs, resizes are
> needed when you do things like change the size of an underlying
> partition, or you weren't using all the partition.  When you add one
> drive with device add, and you then remove another with device delete,
> why and how would the added device know to size itself to the device
> that you are planning to delete?   Ie. I don't see how it could know
> (you add the new drive before even telling it you want to remove the old
> one) and I also can't see a reason it would not use all the drive you
> tell it to add.
>
> In any event, I did a btrfs fi resize 3:max /local on the 6TB as you
> suggest, and have another balance running but it appears like all the
> others to be doing nothing, though of course it will take hours.  Are
> you sure it works that way?  Even before the resize, as you see below,
> it indicates the volume is 6TB with 4TB of unallocated space.  It is
> only the df that says full (and the fact that there is no unallocated
> space on the 3TB and 4TB drives.)


It does work that way and I agree off hand that the lack of
automatically doing a resize to max is counter intuitive. I'd think
the user has implicitly set the size they want by handing over the
device to Btrfs, be it a whole device, partition or LV. There might be
some notes in the mail archive and possibly comments in btrfs-progs
that explains the logic.

        devid    1 size 3.62TiB used 3.62TiB path /dev/sdi2
        devid    2 size 2.73TiB used 2.72TiB path /dev/sdh
        devid    3 size 5.43TiB used 1.42TiB path /dev/sdg

Also note that after a successful balance this will not be evenly
allocated because device sizes aren't even. Simplistically it'll do
something like this: copy 1 chunks on devid3 and copy 2 chunks on
devid1 until the free space on devid1 is equal to free space on
devid2. And then it'll start alternating copy 2 chunks between devid1
and 2, while copy 1 chunks continue to write on devid3. That happens
until free space on all three is equal, and then allocation alternates
among all three to try to maintain approximately equal free space
remaining.

You might find this helpful:
http://carfax.org.uk/btrfs-usage/



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
       [not found]       ` <56F22F80.501@gmail.com>
@ 2016-03-23  6:17         ` Chris Murphy
  2016-03-23 16:51           ` Brad Templeton
  0 siblings, 1 reply; 35+ messages in thread
From: Chris Murphy @ 2016-03-23  6:17 UTC (permalink / raw)
  To: bradtem, Btrfs BTRFS, Qu Wenruo

On Tue, Mar 22, 2016 at 11:54 PM, Brad Templeton <bradtem@gmail.com> wrote:
> Actually, the URL suggests that all the space will be used, which is
> what I had read about btrfs, that it handled this.

It will. But it does this by dominating writes to the devices that
have the most free space, until all devices have the same free space.


> But again, how could it possibly know to restrict the new device to only
> using 2TB?

In your case, before resizing it, it's just inheriting the size from
the device being replaced.

>
> Stage one:  Add the new 6TB device.  The 2TB device is still present.
>
> Stage two:  Remove the 2TB device.

OK this is confusing. In your first post you said replaced. That
suggests you used 'btrfs replace start' rather than 'btrfs device add'
followed by 'btrfs device remove'. So which did you do?

If you did the latter, then there's no resize necessary.


> The system copies everything on it
> to the device which has the most space, the empty 6TB device.  But you
> are saying it decides to _shrink_ the 6TB device now that we know it is
> a 2TB device being removed?

No I'm not. The source of confusion appears to be that you're
unfamiliar with 'btrfs replace' so you mean 'dev add' followed by 'dev
remove' to mean replaced.

This line:
        devid    3 size 5.43TiB used 1.42TiB path /dev/sdg2

suggests it's using the entire 6TB of the newly added drive, it's
already at max size.


> We didn't know the 2TB would be removed
> when we added the 6TB, so I just can't fathom why the code would do
> that.  In addition, the stats I get back say it didn't do that.

I don't understand the first part. Whether you asked for 'dev remove'
or you used 'replace' both of those mean removing some device. You
have to specify the device to be removed.

Now might be a good time to actually write out the exact commands you've used.


>
> More to the point, after the resize, the balance is still not changing
> any size numbers.  It should be moving blocks to the most empty device,
> should it not?    There is almost no space on devids 1 and 2, so it
> would not copy any chunks there.
>
> I'm starting to think this is a bug, but I'll keep plugging.

Could be a bug. Three drive raid1 of different sizes is somewhat
uncommon so it's possible it's hit an edge case somehow. Qu will know
more about how to find out why it's not allocating mostly to the
larger drive. The eventual work around might end up being to convert
data chunks to single, then convert back to raid1. But before doing
that it'd be better to find out why it's not doing the right thing the
normal way.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-23  6:17         ` Chris Murphy
@ 2016-03-23 16:51           ` Brad Templeton
  2016-03-23 18:34             ` Chris Murphy
  0 siblings, 1 reply; 35+ messages in thread
From: Brad Templeton @ 2016-03-23 16:51 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS, Qu Wenruo

Thanks for assist.  To reiterate what I said in private:

a) I am fairly sure I swapped drives by adding the 6TB drive and then
removing the 2TB drive, which would not have made the 6TB think it was
only 2TB.    The btrfs statistics commands have shown from the beginning
the size of the device as 6TB, and that after the remove, it haad 4TB
unallocated.

b) Even if my memory is wrong and I did a replace (that's not even
documented in the wiki page on multiple device so I did not think I had
heard of it) I have since does a resize to "max" on all devices, and
still the balance moves nothing.   It says it processes almost all the
blocks it sees, but nothing changes.

So I am looking for other options, or if people have commands I might
execute to diagnose this (as it seems to be a flaw in balance) let me know.

Some options remaining open to me:

a) I could re-add the 2TB device, which is still there.  Then balance
again, which hopefully would move a lot of stuff.   Then remove it again
and hopefully the new stuff would distribute mostly to the large drive.
 Then I could try balance again.

b) It was suggested I could (with a good backup) convert the drive to
non-RAID1 to free up tons of space and then re-convert.  What's the
precise procedure for that?  Perhaps I can do it with a limit to see how
it works as an experiment?   Any way to specifically target the blocks
that have their two copies on the 2 smaller drives for conversion?

c) Finally, I could take a full-full backup (my normal backups don't
bother with cached stuff and certain other things that you can recover)
and take the system down for a while to just wipe and restore the
volumes.  That doesn't find the bug, however.

On 03/22/2016 11:17 PM, Chris Murphy wrote:
> On Tue, Mar 22, 2016 at 11:54 PM, Brad Templeton <bradtem@gmail.com> wrote:
>> Actually, the URL suggests that all the space will be used, which is
>> what I had read about btrfs, that it handled this.
> 
> It will. But it does this by dominating writes to the devices that
> have the most free space, until all devices have the same free space.
> 
> 
>> But again, how could it possibly know to restrict the new device to only
>> using 2TB?
> 
> In your case, before resizing it, it's just inheriting the size from
> the device being replaced.
> 
>>
>> Stage one:  Add the new 6TB device.  The 2TB device is still present.
>>
>> Stage two:  Remove the 2TB device.
> 
> OK this is confusing. In your first post you said replaced. That
> suggests you used 'btrfs replace start' rather than 'btrfs device add'
> followed by 'btrfs device remove'. So which did you do?
> 
> If you did the latter, then there's no resize necessary.
> 
> 
>> The system copies everything on it
>> to the device which has the most space, the empty 6TB device.  But you
>> are saying it decides to _shrink_ the 6TB device now that we know it is
>> a 2TB device being removed?
> 
> No I'm not. The source of confusion appears to be that you're
> unfamiliar with 'btrfs replace' so you mean 'dev add' followed by 'dev
> remove' to mean replaced.
> 
> This line:
>         devid    3 size 5.43TiB used 1.42TiB path /dev/sdg2
> 
> suggests it's using the entire 6TB of the newly added drive, it's
> already at max size.
> 
> 
>> We didn't know the 2TB would be removed
>> when we added the 6TB, so I just can't fathom why the code would do
>> that.  In addition, the stats I get back say it didn't do that.
> 
> I don't understand the first part. Whether you asked for 'dev remove'
> or you used 'replace' both of those mean removing some device. You
> have to specify the device to be removed.
> 
> Now might be a good time to actually write out the exact commands you've used.
> 
> 
>>
>> More to the point, after the resize, the balance is still not changing
>> any size numbers.  It should be moving blocks to the most empty device,
>> should it not?    There is almost no space on devids 1 and 2, so it
>> would not copy any chunks there.
>>
>> I'm starting to think this is a bug, but I'll keep plugging.
> 
> Could be a bug. Three drive raid1 of different sizes is somewhat
> uncommon so it's possible it's hit an edge case somehow. Qu will know
> more about how to find out why it's not allocating mostly to the
> larger drive. The eventual work around might end up being to convert
> data chunks to single, then convert back to raid1. But before doing
> that it'd be better to find out why it's not doing the right thing the
> normal way.
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-23 16:51           ` Brad Templeton
@ 2016-03-23 18:34             ` Chris Murphy
  2016-03-23 19:10               ` Brad Templeton
                                 ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Chris Murphy @ 2016-03-23 18:34 UTC (permalink / raw)
  To: Brad Templeton; +Cc: Chris Murphy, Btrfs BTRFS, Qu Wenruo

On Wed, Mar 23, 2016 at 10:51 AM, Brad Templeton <bradtem@gmail.com> wrote:
> Thanks for assist.  To reiterate what I said in private:
>
> a) I am fairly sure I swapped drives by adding the 6TB drive and then
> removing the 2TB drive, which would not have made the 6TB think it was
> only 2TB.    The btrfs statistics commands have shown from the beginning
> the size of the device as 6TB, and that after the remove, it haad 4TB
> unallocated.

I agree this seems to be consistent with what's been reported.


>
> So I am looking for other options, or if people have commands I might
> execute to diagnose this (as it seems to be a flaw in balance) let me know.

What version of btrfs-progs is this? I'm vaguely curious what 'btrfs
check' reports (without --repair). Any version is OK but it's better
to use something fairly recent since the check code continues to
change a lot.

Another thing you could try is a newer kernel. Maybe there's a related
bug in 4.2.0. I think it may be more likely this is just an edge case
bug that's always been there, but it's valuable to know if recent
kernels exhibit the problem.

And before proceeding with a change in layout (converting to another
profile) I suggest taking an image of the metadata with btrfs-image,
it might come in handy for a developer.



>
> Some options remaining open to me:
>
> a) I could re-add the 2TB device, which is still there.  Then balance
> again, which hopefully would move a lot of stuff.   Then remove it again
> and hopefully the new stuff would distribute mostly to the large drive.
>  Then I could try balance again.

Yeah, to do this will require -f to wipe the signature info from that
drive when you add it. But I don't think this is a case of needing
more free space, I think it might be due to the odd number of drives
that are also fairly different in size.

But then what happens when you delete the 2TB drive after the balance?
Do you end up right back in this same situation?



>
> b) It was suggested I could (with a good backup) convert the drive to
> non-RAID1 to free up tons of space and then re-convert.  What's the
> precise procedure for that?  Perhaps I can do it with a limit to see how
> it works as an experiment?   Any way to specifically target the blocks
> that have their two copies on the 2 smaller drives for conversion?

btrfs balance -dconvert=single -mconvert=single -f   ## you have to
use -f to force reduction in redundancy
btrfs balance -dconvert=raid1 -mconvert=raid1

There is the devid= filter but I'm not sure of the consequences of
limiting the conversion to two of three devices, that's kinda
confusing and is sufficiently an edge case I wonder how many bugs
you're looking to find today? :-)



> c) Finally, I could take a full-full backup (my normal backups don't
> bother with cached stuff and certain other things that you can recover)
> and take the system down for a while to just wipe and restore the
> volumes.  That doesn't find the bug, however.

I'd have the full backup no matter what choice you make. At any time
for any reason any filesystem can face plant without warning.

But yes this should definitely work or else you've definitely found a
bug. Finding the bug in your current scenario is harder because the
history of this volume makes it really non-deterministic whereas if
you start with a 3 disk volume at mkfs time, and then you reproduce
this problem, for sure it's a bug. And fairly straightforward to
reproduce.

I still recommend a newer kernel and progs though, just because
there's no work being done on 4.2 anymore. I suggest 4.4.6 and 4.4.1
progs. And then if you reproduce it, it's not just a bug, it's a
current bug.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-23 18:34             ` Chris Murphy
@ 2016-03-23 19:10               ` Brad Templeton
  2016-03-23 19:27                 ` Alexander Fougner
                                   ` (2 more replies)
  2016-03-23 22:28               ` Duncan
  2016-03-24  7:08               ` Andrew Vaughan
  2 siblings, 3 replies; 35+ messages in thread
From: Brad Templeton @ 2016-03-23 19:10 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS, Qu Wenruo

It is Ubuntu wily, which is 4.2 and btrfs-progs 0.4.  I will upgrade to
Xenial in April but probably not before, I don't have days to spend on
this.   Is there a fairly safe ppa to pull 4.4 or 4.5?  In olden days, I
would patch and build my kernels from source but I just don't have time
for all the long-term sysadmin burden that creates any more.

Also, I presume if this is a bug, it's in btrfsprogs, though the new one
presumably needs a newer kernel too.

I am surprised to hear it said that having the mixed sizes is an odd
case.  That was actually one of the more compelling features of btrfs
that made me switch from mdadm, lvm and the rest.   I presumed most
people were the same. You need more space, you go out and buy a new
drive and of course the new drive is bigger than the old drives you
bought because they always get bigger.  Under mdadm the bigger drive
still helped, because it replaced at smaller drive, the one that was
holding the RAID back, but you didn't get to use all the big drive until
a year later when you had upgraded them all.  In the meantime you used
the extra space in other RAIDs.  (For example, a raid-5 plus a raid-1 on
the 2 bigger drives) Or you used the extra space as non-RAID space, ie.
space for static stuff that has offline backups.  In fact, most of my
storage is of that class (photo archives, reciprocal backups of other
systems) where RAID is not needed.

So the long story is, I think most home users are likely to always have
different sizes and want their FS to treat it well.

Since 6TB is a relatively new size, I wonder if that plays a role.  More
than 4TB of free space to balance into, could that confuse it?

Off to do a backup (good idea anyway.)



On 03/23/2016 11:34 AM, Chris Murphy wrote:
> On Wed, Mar 23, 2016 at 10:51 AM, Brad Templeton <bradtem@gmail.com> wrote:
>> Thanks for assist.  To reiterate what I said in private:
>>
>> a) I am fairly sure I swapped drives by adding the 6TB drive and then
>> removing the 2TB drive, which would not have made the 6TB think it was
>> only 2TB.    The btrfs statistics commands have shown from the beginning
>> the size of the device as 6TB, and that after the remove, it haad 4TB
>> unallocated.
> 
> I agree this seems to be consistent with what's been reported.
> 
> 
>>
>> So I am looking for other options, or if people have commands I might
>> execute to diagnose this (as it seems to be a flaw in balance) let me know.
> 
> What version of btrfs-progs is this? I'm vaguely curious what 'btrfs
> check' reports (without --repair). Any version is OK but it's better
> to use something fairly recent since the check code continues to
> change a lot.
> 
> Another thing you could try is a newer kernel. Maybe there's a related
> bug in 4.2.0. I think it may be more likely this is just an edge case
> bug that's always been there, but it's valuable to know if recent
> kernels exhibit the problem.
> 
> And before proceeding with a change in layout (converting to another
> profile) I suggest taking an image of the metadata with btrfs-image,
> it might come in handy for a developer.
> 
> 
> 
>>
>> Some options remaining open to me:
>>
>> a) I could re-add the 2TB device, which is still there.  Then balance
>> again, which hopefully would move a lot of stuff.   Then remove it again
>> and hopefully the new stuff would distribute mostly to the large drive.
>>  Then I could try balance again.
> 
> Yeah, to do this will require -f to wipe the signature info from that
> drive when you add it. But I don't think this is a case of needing
> more free space, I think it might be due to the odd number of drives
> that are also fairly different in size.
> 
> But then what happens when you delete the 2TB drive after the balance?
> Do you end up right back in this same situation?
> 
> 
> 
>>
>> b) It was suggested I could (with a good backup) convert the drive to
>> non-RAID1 to free up tons of space and then re-convert.  What's the
>> precise procedure for that?  Perhaps I can do it with a limit to see how
>> it works as an experiment?   Any way to specifically target the blocks
>> that have their two copies on the 2 smaller drives for conversion?
> 
> btrfs balance -dconvert=single -mconvert=single -f   ## you have to
> use -f to force reduction in redundancy
> btrfs balance -dconvert=raid1 -mconvert=raid1
> 
> There is the devid= filter but I'm not sure of the consequences of
> limiting the conversion to two of three devices, that's kinda
> confusing and is sufficiently an edge case I wonder how many bugs
> you're looking to find today? :-)
> 
> 
> 
>> c) Finally, I could take a full-full backup (my normal backups don't
>> bother with cached stuff and certain other things that you can recover)
>> and take the system down for a while to just wipe and restore the
>> volumes.  That doesn't find the bug, however.
> 
> I'd have the full backup no matter what choice you make. At any time
> for any reason any filesystem can face plant without warning.
> 
> But yes this should definitely work or else you've definitely found a
> bug. Finding the bug in your current scenario is harder because the
> history of this volume makes it really non-deterministic whereas if
> you start with a 3 disk volume at mkfs time, and then you reproduce
> this problem, for sure it's a bug. And fairly straightforward to
> reproduce.
> 
> I still recommend a newer kernel and progs though, just because
> there's no work being done on 4.2 anymore. I suggest 4.4.6 and 4.4.1
> progs. And then if you reproduce it, it's not just a bug, it's a
> current bug.
> 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-23 19:10               ` Brad Templeton
@ 2016-03-23 19:27                 ` Alexander Fougner
  2016-03-23 19:33                 ` Chris Murphy
  2016-03-23 21:54                 ` Duncan
  2 siblings, 0 replies; 35+ messages in thread
From: Alexander Fougner @ 2016-03-23 19:27 UTC (permalink / raw)
  To: bradtem; +Cc: Chris Murphy, Btrfs BTRFS, Qu Wenruo

2016-03-23 20:10 GMT+01:00 Brad Templeton <bradtem@gmail.com>:
> It is Ubuntu wily, which is 4.2 and btrfs-progs 0.4.  I will upgrade to
> Xenial in April but probably not before, I don't have days to spend on
> this.   Is there a fairly safe ppa to pull 4.4 or 4.5?

Use the mainline ppa: http://kernel.ubuntu.com/~kernel-ppa/mainline/
Instructions: https://wiki.ubuntu.com/Kernel/MainlineBuilds

You can also find a newer btrfs-progs .deb here:
launchpad.net/ubuntu/+source/btrfs-tools

 In olden days, I
> would patch and build my kernels from source but I just don't have time
> for all the long-term sysadmin burden that creates any more.
>
> Also, I presume if this is a bug, it's in btrfsprogs, though the new one
> presumably needs a newer kernel too.
>
> I am surprised to hear it said that having the mixed sizes is an odd
> case.  That was actually one of the more compelling features of btrfs
> that made me switch from mdadm, lvm and the rest.   I presumed most
> people were the same. You need more space, you go out and buy a new
> drive and of course the new drive is bigger than the old drives you
> bought because they always get bigger.  Under mdadm the bigger drive
> still helped, because it replaced at smaller drive, the one that was
> holding the RAID back, but you didn't get to use all the big drive until
> a year later when you had upgraded them all.  In the meantime you used
> the extra space in other RAIDs.  (For example, a raid-5 plus a raid-1 on
> the 2 bigger drives) Or you used the extra space as non-RAID space, ie.
> space for static stuff that has offline backups.  In fact, most of my
> storage is of that class (photo archives, reciprocal backups of other
> systems) where RAID is not needed.
>
> So the long story is, I think most home users are likely to always have
> different sizes and want their FS to treat it well.
>
> Since 6TB is a relatively new size, I wonder if that plays a role.  More
> than 4TB of free space to balance into, could that confuse it?
>
> Off to do a backup (good idea anyway.)
>
>
>
> On 03/23/2016 11:34 AM, Chris Murphy wrote:
>> On Wed, Mar 23, 2016 at 10:51 AM, Brad Templeton <bradtem@gmail.com> wrote:
>>> Thanks for assist.  To reiterate what I said in private:
>>>
>>> a) I am fairly sure I swapped drives by adding the 6TB drive and then
>>> removing the 2TB drive, which would not have made the 6TB think it was
>>> only 2TB.    The btrfs statistics commands have shown from the beginning
>>> the size of the device as 6TB, and that after the remove, it haad 4TB
>>> unallocated.
>>
>> I agree this seems to be consistent with what's been reported.
>>
>>
>>>
>>> So I am looking for other options, or if people have commands I might
>>> execute to diagnose this (as it seems to be a flaw in balance) let me know.
>>
>> What version of btrfs-progs is this? I'm vaguely curious what 'btrfs
>> check' reports (without --repair). Any version is OK but it's better
>> to use something fairly recent since the check code continues to
>> change a lot.
>>
>> Another thing you could try is a newer kernel. Maybe there's a related
>> bug in 4.2.0. I think it may be more likely this is just an edge case
>> bug that's always been there, but it's valuable to know if recent
>> kernels exhibit the problem.
>>
>> And before proceeding with a change in layout (converting to another
>> profile) I suggest taking an image of the metadata with btrfs-image,
>> it might come in handy for a developer.
>>
>>
>>
>>>
>>> Some options remaining open to me:
>>>
>>> a) I could re-add the 2TB device, which is still there.  Then balance
>>> again, which hopefully would move a lot of stuff.   Then remove it again
>>> and hopefully the new stuff would distribute mostly to the large drive.
>>>  Then I could try balance again.
>>
>> Yeah, to do this will require -f to wipe the signature info from that
>> drive when you add it. But I don't think this is a case of needing
>> more free space, I think it might be due to the odd number of drives
>> that are also fairly different in size.
>>
>> But then what happens when you delete the 2TB drive after the balance?
>> Do you end up right back in this same situation?
>>
>>
>>
>>>
>>> b) It was suggested I could (with a good backup) convert the drive to
>>> non-RAID1 to free up tons of space and then re-convert.  What's the
>>> precise procedure for that?  Perhaps I can do it with a limit to see how
>>> it works as an experiment?   Any way to specifically target the blocks
>>> that have their two copies on the 2 smaller drives for conversion?
>>
>> btrfs balance -dconvert=single -mconvert=single -f   ## you have to
>> use -f to force reduction in redundancy
>> btrfs balance -dconvert=raid1 -mconvert=raid1
>>
>> There is the devid= filter but I'm not sure of the consequences of
>> limiting the conversion to two of three devices, that's kinda
>> confusing and is sufficiently an edge case I wonder how many bugs
>> you're looking to find today? :-)
>>
>>
>>
>>> c) Finally, I could take a full-full backup (my normal backups don't
>>> bother with cached stuff and certain other things that you can recover)
>>> and take the system down for a while to just wipe and restore the
>>> volumes.  That doesn't find the bug, however.
>>
>> I'd have the full backup no matter what choice you make. At any time
>> for any reason any filesystem can face plant without warning.
>>
>> But yes this should definitely work or else you've definitely found a
>> bug. Finding the bug in your current scenario is harder because the
>> history of this volume makes it really non-deterministic whereas if
>> you start with a 3 disk volume at mkfs time, and then you reproduce
>> this problem, for sure it's a bug. And fairly straightforward to
>> reproduce.
>>
>> I still recommend a newer kernel and progs though, just because
>> there's no work being done on 4.2 anymore. I suggest 4.4.6 and 4.4.1
>> progs. And then if you reproduce it, it's not just a bug, it's a
>> current bug.
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-23 19:10               ` Brad Templeton
  2016-03-23 19:27                 ` Alexander Fougner
@ 2016-03-23 19:33                 ` Chris Murphy
  2016-03-24  1:59                   ` Qu Wenruo
  2016-03-25 13:16                   ` Patrik Lundquist
  2016-03-23 21:54                 ` Duncan
  2 siblings, 2 replies; 35+ messages in thread
From: Chris Murphy @ 2016-03-23 19:33 UTC (permalink / raw)
  To: Brad Templeton; +Cc: Chris Murphy, Btrfs BTRFS, Qu Wenruo

On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton <bradtem@gmail.com> wrote:
> It is Ubuntu wily, which is 4.2 and btrfs-progs 0.4.  I will upgrade to
> Xenial in April but probably not before, I don't have days to spend on
> this.   Is there a fairly safe ppa to pull 4.4 or 4.5?

I'm not sure.


 In olden days, I
> would patch and build my kernels from source but I just don't have time
> for all the long-term sysadmin burden that creates any more.
>
> Also, I presume if this is a bug, it's in btrfsprogs, though the new one
> presumably needs a newer kernel too.

No you can mix and match progs and kernel versions. You just don't get
new features if you don't have a new kernel.

But the issue is the balance code is all in the kernel. It's activated
by user space tools but it's all actually done by kernel code.



> I am surprised to hear it said that having the mixed sizes is an odd
> case.

Not odd as in wrong, just uncommon compared to other arrangements being tested.

>  That was actually one of the more compelling features of btrfs
> that made me switch from mdadm, lvm and the rest.   I presumed most
> people were the same. You need more space, you go out and buy a new
> drive and of course the new drive is bigger than the old drives you
> bought because they always get bigger.

Of course and I'm not saying it shouldn't work. The central problem
here is we don't even know what the problem really is; we only know
the manifestation of the problem isn't the desired or expected
outcome. And how to find out the cause is different than how to fix
it.



> Under mdadm the bigger drive
> still helped, because it replaced at smaller drive, the one that was
> holding the RAID back, but you didn't get to use all the big drive until
> a year later when you had upgraded them all.  In the meantime you used
> the extra space in other RAIDs.  (For example, a raid-5 plus a raid-1 on
> the 2 bigger drives) Or you used the extra space as non-RAID space, ie.
> space for static stuff that has offline backups.  In fact, most of my
> storage is of that class (photo archives, reciprocal backups of other
> systems) where RAID is not needed.
>
> So the long story is, I think most home users are likely to always have
> different sizes and want their FS to treat it well.

Yes of course. And at the expense of getting a frownie face....

"Btrfs is under heavy development, and is not suitable for
any uses other than benchmarking and review."
https://www.kernel.org/doc/Documentation/filesystems/btrfs.txt

Despite that disclosure, what you're describing is not what I'd expect
and not what I've previously experienced. But I haven't had three
different sized drives, and they weren't particularly full, and I
don't know if you started with three from the outset at mkfs time or
if this is the result of two drives with a third added on later, etc.
So the nature of file systems is actually really complicated and it's
normal for there to be regressions - and maybe this is a regression,
hard to say with available information.



> Since 6TB is a relatively new size, I wonder if that plays a role.  More
> than 4TB of free space to balance into, could that confuse it?

Seems unlikely.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-23 19:10               ` Brad Templeton
  2016-03-23 19:27                 ` Alexander Fougner
  2016-03-23 19:33                 ` Chris Murphy
@ 2016-03-23 21:54                 ` Duncan
  2 siblings, 0 replies; 35+ messages in thread
From: Duncan @ 2016-03-23 21:54 UTC (permalink / raw)
  To: linux-btrfs

Brad Templeton posted on Wed, 23 Mar 2016 12:10:29 -0700 as excerpted:

> It is Ubuntu wily, which is 4.2 and btrfs-progs 0.4.

Presumably that's a type for btrfs-progs.  Either that or Ubuntu's using 
a versioning that's totally different than upstream btrfs.  For some time 
now (since the 3.12 release, ancient history in btrfs terms), btrfs-progs 
has been release version synced with the kernel.  So the latest release 
is 4.5.0, to match the kernel 4.5.0 that came out shortly before that 
userspace release and that was developed at the same time.  Before that 
was 4.4.1, a primarily bugfix release to the previous 4.4.0.

Before 3.12, the previous actual userspace release, extremely stale by 
that point, was 0.19, tho there was a 0.20-rc1 release, that wasn't 
followed up with a 0.20 full release.  The recommendation back then was 
to run and for distros to ship git snapshots.

So where 0.4 came from I've not the foggiest, unless as I said it's a 
typo, perhaps for 4.0.

> I will upgrade to
> Xenial in April but probably not before, I don't have days to spend on
> this.   Is there a fairly safe ppa to pull 4.4 or 4.5?  In olden days, I
> would patch and build my kernels from source but I just don't have time
> for all the long-term sysadmin burden that creates any more.

Heh, this posting is from a gentooer, who builds /everything/ from 
sources. =:^)  Tho that's not really a problem as it can go on in the 
background and thus takes little actual attention time.

The real time is in figuring out what I need to know about what has 
changed between versions and if/how that needs to affect my existing 
config, but that's time that needs spent regardless of the distro, the 
major question being one of rolling distro and thus spending that time a 
bit here and a bit there as the various components upgrade, with a better 
chance of actually nailing down the problem to a specific package upgrade 
when there's issues, or doing it all in one huge version upgrade, which 
pretty much leaves you high and dry in terms of fixing problems since the 
entire world changes at once and it's thus nearly impossible to pin a bug 
to a particular package upgrade.


But meanwhile, as CMurphy says at the expense of a frowny face...

Given that btrfs is still maturing, and /not/ yet entirely stable and 
mature, and the fact that the list emphasis is on mainline, the list 
kernel recommendation is to follow one of two tracks, either mainline 
current, or mainline LTS.

If you choose the mainline current track, the recommendation is to stay 
within the latest two current kernel series.  With 4.5 out, that means 
you should be on 4.4 at least,  Previous non-LTS kernel series no longer 
get patch backports at least from mainline, and as we focus on mainline 
here, we're not tracking what distros may or may not backport on their 
own, so we simply can't provide the same level of support.

For LTS kernel track, the recommendation has recently relaxed slightly.  
Previously, it was again to stick with the latest two kernel LTS series, 
which would be 4.4 and 4.1.  However, the one previous to that was 3.18, 
and it has been reasonably stable, certainly more so that those previous 
to that, so while 4.1 or 4.4 is still what we really like to see, we 
recognize that some will be sticking to 3.18 and are continuing to try to 
support them as well, now that the LTS 4.4 has pushed it out of the 
primary recommended range.  But previous to that really isn't supported.

Not that we won't do best-effort, regardless, but in many instances, the 
best recommendation we can make with out-of-support kernels really is to 
upgrade to something more current, and try again.

Meanwhile, yes, we do recognize that distros have chosen to support btrfs 
on kernels outside that list.  But as I said, we don't track what patches 
the distros may or may not have backported, and thus aren't in a 
particularly good position to provide support for them.  The distros 
themselves, having chosen to provide that support, are in a far better 
position to do just that, since they know what they've backported and 
what they haven't.  So in that case, the best we can do is refer you to 
the distros whose support you are nominally relying on, to actually 
provide that support.

And obviously, kernel 4.2 isn't one of the ones named above.  It's 
neither a mainstream LTS, nor any longer within the last two current 
kernel releases.

So kernel upgrade, however you choose to do it, is strongly recommended, 
with two other alternatives if you prefer:

1) Ask your distro for support of versions off the mainline support 
list.  After all, they're the ones claiming to support the known to be 
not entirely stabilized and ready for production use btrfs on non-
mainline-LTS kernels long after mainline support for those non-LTS 
kernels has been dropped.

2) Choose a filesystem that better matches your needs, presumably because 
it /is/ fully mature and stable, and thus is properly supported on older 
kernels outside the relatively narrow range of btrfs-list recommended 
kernels.


As for userspace, as explained above, in most cases for online and 
generally operational btrfs, it's the kernel code that counts.  Userspace 
is important in three cases, however, (1) when you're first creating the 
filesystem (mkfs.btrfs), (2) if you need relatively new features that 
older userspace doesn't have the kernelcode calls to support, and (3) 
when the filesystem has problems and you're trying to fix them with btrfs 
check and the other offline tools, or you're simply trying to get what 
you can off the (presumably unmountable) filesystem using btrfs restore, 
before giving up on it entirely.

So for normal use, btrfs userspace version isn't as critical, until it 
gets so old translating from newer call syntax to older syntax, or 
between output formats, becomes a problem.  But once your btrfs won't 
mount properly and you're either trying to fix them or recover files off 
them, /then/ userspace becomes critical, as the newer versions can deal 
with more problems than older versions can.

Meanwhile, newer btrfs-progs userspace is always designed to be able to 
handle older kernels as well.

So a good rule of thumb for userspace is to run at least the latest 
userspace release from the series matching your kernel version (with the 
short period after kernel release before the corresponding userspace 
release excepted, of course, if you're running /that/ close to current).  
As long as you stay within kernel recommendations, that will keep your 
userspace within reason as well.


So a 4.2 kernel isn't supported (on list, but you can of course refer to 
your distro instead, if they support it) as it's out of the current 
kernel support range and isn't an LTS, and upgrading to 4.4 LTS is 
recommended.  Alternatively, you may wish to downgrade to kernel 4.1, 
which is actually an LTS kernel and remains well supported as such.

And once you're running a supported kernel, ensure that your btrfs-progs 
is the latest of that userspace series, or newer, and you should be good 
to go. =:^)

Again, with the alternatives being either getting support from your 
distro if they're supporting btrfs on versions outside of those supported 
on-list, or switching to a filesystem that better matches your use-case 
in terms of stability and longer term support.

> Also, I presume if this is a bug, it's in btrfsprogs, though the new one
> presumably needs a newer kernel too.

Balance, like most "online" btrfs code, is primarily kernel.  All 
userspace does is call the appropriate kernel code to do the actual work.

So the problem here is almost certainly kernel.


Meanwhile, I have an idea of what /might/ be your balance problem, but I 
want to cover it in a separate reply.  Suffice it to say here that the 
news isn't great, if this is your issue, as it's a known but somewhat 
rare problem that has yet to be properly traced down and fixed.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-23 18:34             ` Chris Murphy
  2016-03-23 19:10               ` Brad Templeton
@ 2016-03-23 22:28               ` Duncan
  2016-03-24  7:08               ` Andrew Vaughan
  2 siblings, 0 replies; 35+ messages in thread
From: Duncan @ 2016-03-23 22:28 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Wed, 23 Mar 2016 12:34:10 -0600 as excerpted:

> On Wed, Mar 23, 2016 at 10:51 AM, Brad Templeton <bradtem@gmail.com>
> wrote:
>> Thanks for assist.  To reiterate what I said in private:
>>
>> a) I am fairly sure I swapped drives by adding the 6TB drive and then
>> removing the 2TB drive, which would not have made the 6TB think it was
>> only 2TB.    The btrfs statistics commands have shown from the
>> beginning the size of the device as 6TB, and that after the remove, it
>> haad 4TB unallocated.
> 
> I agree this seems to be consistent with what's been reported.

Chris, and Hugo too as the one with the most experience with this, on IRC 
and privately as well as on-list.

Is this possibly another instance of that persistent mystery bug where 
btrfs pretty much refuses to allocate new chunks despite there being all 
sorts of room for it to do so, that seems just rare enough that without 
any known method of replication, keeps getting backburnered by more 
urgent-issues when devs try to properly investigate and trace it down, 
while being persistent over many kernels now and just common enough, with 
just enough common characteristics among those affected, to be considered 
a single, now recognized, bug?

If it's the same bug here, it seems to be affecting only the new 6 TB 
device, not the older and smaller devices, but I'm not sure if it has 
manifested in that sort of device-exclusive form before, or not, and that 
along with the facts that there's no fix known and that Hugo seems to be 
the only one with enough experience with the bug to actually reasonably 
authoritatively consider it the same bug, has me reluctant to actually 
label it as such here.

But I can certainly ask the question, and I've not yet seen it suggested 
as the ultimate bug we're facing thin this thread yet, so...


If Hugo (or Chris if he's seen enough more instances of this bug recently 
to reasonably reliably say) doesn't post something more authoritative...

If this is indeed /that/ bug, then most efforts to fix it, won't directly 
fix it at all.  Rebalancing to single, and then back to raid1, /might/ 
eliminate it... or not, I simply don't have enough experience 
troubleshooting this bug to know if others tried that and their results 
or not (tho I'd guess Hugo would have suggested that, where people 
weren't dealing with a single-device-only, anyway, and might know the 
results).

The one known way to eliminate the bug is to back everything up, blow 
away the filesystem and recreate it.  Tho AFAIK, in one instance at 
least, the new btrfs ended up having the same bug.  But I believe for 
most, it does get rid of it.  Luckily in the OP's case, the filesystem 
has evolved over time, so chances are that the bug won't appear on the 
new btrfs, created from the start with all the devices intended for it 
currently.  It /might/ reappear with time, but I'd hope it'd only appear 
sometime later, after another device upgrade or two, at least.

Of course, that's assuming it's either this bug, or another one that's 
fixed by starting over with newly created filesystem with all currently 
intended devices included in the mkfs.btrfs.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-23 19:33                 ` Chris Murphy
@ 2016-03-24  1:59                   ` Qu Wenruo
  2016-03-24  2:13                     ` Brad Templeton
  2016-03-25 13:16                   ` Patrik Lundquist
  1 sibling, 1 reply; 35+ messages in thread
From: Qu Wenruo @ 2016-03-24  1:59 UTC (permalink / raw)
  To: Chris Murphy, Brad Templeton; +Cc: Btrfs BTRFS



Chris Murphy wrote on 2016/03/23 13:33 -0600:
> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton <bradtem@gmail.com> wrote:
>> It is Ubuntu wily, which is 4.2 and btrfs-progs 0.4.  I will upgrade to
>> Xenial in April but probably not before, I don't have days to spend on
>> this.   Is there a fairly safe ppa to pull 4.4 or 4.5?
>
> I'm not sure.
>
>
>   In olden days, I
>> would patch and build my kernels from source but I just don't have time
>> for all the long-term sysadmin burden that creates any more.
>>
>> Also, I presume if this is a bug, it's in btrfsprogs, though the new one
>> presumably needs a newer kernel too.
>
> No you can mix and match progs and kernel versions. You just don't get
> new features if you don't have a new kernel.
>
> But the issue is the balance code is all in the kernel. It's activated
> by user space tools but it's all actually done by kernel code.
>
>
>
>> I am surprised to hear it said that having the mixed sizes is an odd
>> case.
>
> Not odd as in wrong, just uncommon compared to other arrangements being tested.
>
>>   That was actually one of the more compelling features of btrfs
>> that made me switch from mdadm, lvm and the rest.   I presumed most
>> people were the same. You need more space, you go out and buy a new
>> drive and of course the new drive is bigger than the old drives you
>> bought because they always get bigger.
>
> Of course and I'm not saying it shouldn't work. The central problem
> here is we don't even know what the problem really is; we only know
> the manifestation of the problem isn't the desired or expected
> outcome. And how to find out the cause is different than how to fix
> it.

About chunk allocation problem, I hope to get a clear view of the whole 
disk layout now.

What's the final disk layout?
Is that 4T + 3T + 6T + 20G layout?

If so, I'll say, in that case, only fully re-convert to single may help.
As there is no enough space to allocate new raid1 chunks for balance 
them all.


Chris Murphy may have already mentioned, btrfs chunk allocation has some 
limitation, although it is already more flex than mdadm.


Btrfs chunk allocation will choose the device with most unallocated, and 
for raid1, it will ensure always pick 2 different devices to allocation.

This allocation does make btrfs raid1 allocation more space in a more 
flex method than mdadm raid1.
But that only works if you start from scratch.

I'll explain it that case first.

1) 6T and 4T devices only stage: Allocate 1T Raid1 chunk.
    As 6T and 4T devices have the most unallocated space, so the first
    1T raid chunk will be allocated from them.
    Remaining space: 3/3/5

2) 6T and 3/4 switch stage: Allocate 4T Raid1 chunk.
    After stage 1), we have 3/3/5 remaining space, then btrfs will pick
    space from 5T remaining(6T devices), and switch between the other 3T
    remaining one.

    Cause the remaining space to be 1/1/1.

3) Fake-even allocation stage: Allocate 1T raid chunk.
    Now all devices have the same unallocated space, and there are 3
    devices, we can't really balance all chunks across them.
    As we must and will only select 2 devices, in this stage, there will
    be 1T unallocated and never be used.

After all, you will get 1 +4 +1 = 6T, still smaller than (3 + 4 +6 ) /2 
= 6.5T

Now let's talk about your 3 + 4 + 6 case.

For your initial state, 3 and 4 T devices is already filled up.
Even your 6T device have about 4T available space, it's only 1 device, 
not 2 which raid1 needs.

So, no space for balance to allocate a new raid chunk. The extra 20G is 
so small that almost makes no sence.


The convert to single then back to raid1, will do its job partly.
But according to other report from mail list.
The result won't be perfect even, even the reporter uses devices with 
all same size.


So to conclude:

1) Btrfs will use most of devices space for raid1.
2) 1) only happens if one fills btrfs from scratch
3) For already filled case, convert to single then convert back will
    work, but not perfectly.

Thanks,
Qu

>
>
>
>> Under mdadm the bigger drive
>> still helped, because it replaced at smaller drive, the one that was
>> holding the RAID back, but you didn't get to use all the big drive until
>> a year later when you had upgraded them all.  In the meantime you used
>> the extra space in other RAIDs.  (For example, a raid-5 plus a raid-1 on
>> the 2 bigger drives) Or you used the extra space as non-RAID space, ie.
>> space for static stuff that has offline backups.  In fact, most of my
>> storage is of that class (photo archives, reciprocal backups of other
>> systems) where RAID is not needed.
>>
>> So the long story is, I think most home users are likely to always have
>> different sizes and want their FS to treat it well.
>
> Yes of course. And at the expense of getting a frownie face....
>
> "Btrfs is under heavy development, and is not suitable for
> any uses other than benchmarking and review."
> https://www.kernel.org/doc/Documentation/filesystems/btrfs.txt
>
> Despite that disclosure, what you're describing is not what I'd expect
> and not what I've previously experienced. But I haven't had three
> different sized drives, and they weren't particularly full, and I
> don't know if you started with three from the outset at mkfs time or
> if this is the result of two drives with a third added on later, etc.
> So the nature of file systems is actually really complicated and it's
> normal for there to be regressions - and maybe this is a regression,
> hard to say with available information.
>
>
>
>> Since 6TB is a relatively new size, I wonder if that plays a role.  More
>> than 4TB of free space to balance into, could that confuse it?
>
> Seems unlikely.
>
>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-24  1:59                   ` Qu Wenruo
@ 2016-03-24  2:13                     ` Brad Templeton
  2016-03-24  2:33                       ` Qu Wenruo
  0 siblings, 1 reply; 35+ messages in thread
From: Brad Templeton @ 2016-03-24  2:13 UTC (permalink / raw)
  To: Qu Wenruo, Chris Murphy; +Cc: Btrfs BTRFS



On 03/23/2016 06:59 PM, Qu Wenruo wrote:

> 
> About chunk allocation problem, I hope to get a clear view of the whole
> disk layout now.
> 
> What's the final disk layout?
> Is that 4T + 3T + 6T + 20G layout?
> 
> If so, I'll say, in that case, only fully re-convert to single may help.
> As there is no enough space to allocate new raid1 chunks for balance
> them all.
> 
> 
> Chris Murphy may have already mentioned, btrfs chunk allocation has some
> limitation, although it is already more flex than mdadm.
> 
> 
> Btrfs chunk allocation will choose the device with most unallocated, and
> for raid1, it will ensure always pick 2 different devices to allocation.
> 
> This allocation does make btrfs raid1 allocation more space in a more
> flex method than mdadm raid1.
> But that only works if you start from scratch.
> 
> I'll explain it that case first.
> 
> 1) 6T and 4T devices only stage: Allocate 1T Raid1 chunk.
>    As 6T and 4T devices have the most unallocated space, so the first
>    1T raid chunk will be allocated from them.
>    Remaining space: 3/3/5

This stage never existed.  We had a 4 + 3 + 2 stage, which was low-ish
on space but not full.  I mean it had hundreds of gb free.

Then we had 4 + 3 + 6 + 2, but did not add more files or balance.

Then we had a remove of the 2, which caused, as expected, all the chunks
on the 2TB drive to be copied to the 6TB drive, as it was the most empty
drive.

Then we had a balance.  The balance (I would have expected) would have
moved chunks found on both 3 and 4, taking one of them and moving it to
the 6.  Generally alternating taking ones from the 3 and 4.   I can see
no reason this should not work even if 3 and 4 are almost entirely full,
but they were not.
But this did not happen.

> 
> 2) 6T and 3/4 switch stage: Allocate 4T Raid1 chunk.
>    After stage 1), we have 3/3/5 remaining space, then btrfs will pick
>    space from 5T remaining(6T devices), and switch between the other 3T
>    remaining one.
> 
>    Cause the remaining space to be 1/1/1.
> 
> 3) Fake-even allocation stage: Allocate 1T raid chunk.
>    Now all devices have the same unallocated space, and there are 3
>    devices, we can't really balance all chunks across them.
>    As we must and will only select 2 devices, in this stage, there will
>    be 1T unallocated and never be used.
> 
> After all, you will get 1 +4 +1 = 6T, still smaller than (3 + 4 +6 ) /2
> = 6.5T
> 
> Now let's talk about your 3 + 4 + 6 case.
> 
> For your initial state, 3 and 4 T devices is already filled up.
> Even your 6T device have about 4T available space, it's only 1 device,
> not 2 which raid1 needs.
> 
> So, no space for balance to allocate a new raid chunk. The extra 20G is
> so small that almost makes no sence.

Yes, it was added as an experiment on the suggestion of somebody on the
IRC channel.  I will be rid of it soon.  Still, it seems to me that the
lack of space even after I filled the disks should not interfere with
the balance's ability to move chunks which are found on both 3 and 4 so
that one remains and one goes to the 6.  This action needs no spare
space.   Now I presume the current algorithm perhaps does not work this way?

My next plan is to add the 2tb back. If I am right, balance will move
chunks from 3 and 4 to the 2TB, but it should not move any from the 6TB
because it has so much space.  LIkewise, when I re-remove the 2tb, all
its chunks should move to the 6tb, and I will be at least in a usable state.

Or is the single approach faster?

> 
> 
> The convert to single then back to raid1, will do its job partly.
> But according to other report from mail list.
> The result won't be perfect even, even the reporter uses devices with
> all same size.
> 
> 
> So to conclude:
> 
> 1) Btrfs will use most of devices space for raid1.
> 2) 1) only happens if one fills btrfs from scratch
> 3) For already filled case, convert to single then convert back will
>    work, but not perfectly.
> 
> Thanks,
> Qu
> 
>>
>>
>>
>>> Under mdadm the bigger drive
>>> still helped, because it replaced at smaller drive, the one that was
>>> holding the RAID back, but you didn't get to use all the big drive until
>>> a year later when you had upgraded them all.  In the meantime you used
>>> the extra space in other RAIDs.  (For example, a raid-5 plus a raid-1 on
>>> the 2 bigger drives) Or you used the extra space as non-RAID space, ie.
>>> space for static stuff that has offline backups.  In fact, most of my
>>> storage is of that class (photo archives, reciprocal backups of other
>>> systems) where RAID is not needed.
>>>
>>> So the long story is, I think most home users are likely to always have
>>> different sizes and want their FS to treat it well.
>>
>> Yes of course. And at the expense of getting a frownie face....
>>
>> "Btrfs is under heavy development, and is not suitable for
>> any uses other than benchmarking and review."
>> https://www.kernel.org/doc/Documentation/filesystems/btrfs.txt
>>
>> Despite that disclosure, what you're describing is not what I'd expect
>> and not what I've previously experienced. But I haven't had three
>> different sized drives, and they weren't particularly full, and I
>> don't know if you started with three from the outset at mkfs time or
>> if this is the result of two drives with a third added on later, etc.
>> So the nature of file systems is actually really complicated and it's
>> normal for there to be regressions - and maybe this is a regression,
>> hard to say with available information.
>>
>>
>>
>>> Since 6TB is a relatively new size, I wonder if that plays a role.  More
>>> than 4TB of free space to balance into, could that confuse it?
>>
>> Seems unlikely.
>>
>>
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-24  2:13                     ` Brad Templeton
@ 2016-03-24  2:33                       ` Qu Wenruo
  2016-03-24  2:49                         ` Brad Templeton
  0 siblings, 1 reply; 35+ messages in thread
From: Qu Wenruo @ 2016-03-24  2:33 UTC (permalink / raw)
  To: bradtem, Chris Murphy; +Cc: Btrfs BTRFS



Brad Templeton wrote on 2016/03/23 19:13 -0700:
>
>
> On 03/23/2016 06:59 PM, Qu Wenruo wrote:
>
>>
>> About chunk allocation problem, I hope to get a clear view of the whole
>> disk layout now.
>>
>> What's the final disk layout?
>> Is that 4T + 3T + 6T + 20G layout?
>>
>> If so, I'll say, in that case, only fully re-convert to single may help.
>> As there is no enough space to allocate new raid1 chunks for balance
>> them all.
>>
>>
>> Chris Murphy may have already mentioned, btrfs chunk allocation has some
>> limitation, although it is already more flex than mdadm.
>>
>>
>> Btrfs chunk allocation will choose the device with most unallocated, and
>> for raid1, it will ensure always pick 2 different devices to allocation.
>>
>> This allocation does make btrfs raid1 allocation more space in a more
>> flex method than mdadm raid1.
>> But that only works if you start from scratch.
>>
>> I'll explain it that case first.
>>
>> 1) 6T and 4T devices only stage: Allocate 1T Raid1 chunk.
>>     As 6T and 4T devices have the most unallocated space, so the first
>>     1T raid chunk will be allocated from them.
>>     Remaining space: 3/3/5
>
> This stage never existed.  We had a 4 + 3 + 2 stage, which was low-ish
> on space but not full.  I mean it had hundreds of gb free.

The stage I talked about is only for you fill btrfs from scratch, with 3 
4 6 devices.

Just as an example to explain how btrfs allocated space on un-even devices.

>
> Then we had 4 + 3 + 6 + 2, but did not add more files or balance.
>
> Then we had a remove of the 2, which caused, as expected, all the chunks
> on the 2TB drive to be copied to the 6TB drive, as it was the most empty
> drive.
>
> Then we had a balance.  The balance (I would have expected) would have
> moved chunks found on both 3 and 4, taking one of them and moving it to
> the 6.  Generally alternating taking ones from the 3 and 4.   I can see
> no reason this should not work even if 3 and 4 are almost entirely full,
> but they were not.
> But this did not happen.
>
>>
>> 2) 6T and 3/4 switch stage: Allocate 4T Raid1 chunk.
>>     After stage 1), we have 3/3/5 remaining space, then btrfs will pick
>>     space from 5T remaining(6T devices), and switch between the other 3T
>>     remaining one.
>>
>>     Cause the remaining space to be 1/1/1.
>>
>> 3) Fake-even allocation stage: Allocate 1T raid chunk.
>>     Now all devices have the same unallocated space, and there are 3
>>     devices, we can't really balance all chunks across them.
>>     As we must and will only select 2 devices, in this stage, there will
>>     be 1T unallocated and never be used.
>>
>> After all, you will get 1 +4 +1 = 6T, still smaller than (3 + 4 +6 ) /2
>> = 6.5T
>>
>> Now let's talk about your 3 + 4 + 6 case.
>>
>> For your initial state, 3 and 4 T devices is already filled up.
>> Even your 6T device have about 4T available space, it's only 1 device,
>> not 2 which raid1 needs.
>>
>> So, no space for balance to allocate a new raid chunk. The extra 20G is
>> so small that almost makes no sence.
>
> Yes, it was added as an experiment on the suggestion of somebody on the
> IRC channel.  I will be rid of it soon.  Still, it seems to me that the
> lack of space even after I filled the disks should not interfere with
> the balance's ability to move chunks which are found on both 3 and 4 so
> that one remains and one goes to the 6.  This action needs no spare
> space.   Now I presume the current algorithm perhaps does not work this way?

No, balance is not working like that.
Although most user consider balance is moving data, which is partly right.
The fact is, balance is, copy-and-delete. And it needs spare space.

Means you must have enough space for the extents you are balancing, then 
btrfs will copy them, update reference, and then delete old data (with 
its block group).

So for balancing data in already filled device, btrfs needs to find 
space for them first.
Which will need 2 devices with unallocated space for RAID1.

And in you case, you only have 1 devices with unallocated space, so no 
space to balance.


>
> My next plan is to add the 2tb back. If I am right, balance will move
> chunks from 3 and 4 to the 2TB,

Not only to 2TB, but to 2TB and 6TB. Never forgot that RAID1 needs 2 
devices.
And if 2TB is filled and 3/4 and free space, it's also possible to 3/4 
devices.

That will free 2TB in already filled up devices. But that's still not 
enough to get space even.

You may need to balance several times(maybe 10+) to make space a little 
even, as balance won't balance any chunk which is created by balance.
(Or balance will loop infinitely).

> but it should not move any from the 6TB
> because it has so much space.

That's also wrong.
Whether balance will move data from 6TB devices, is only determined by 
if the src chunk has stripe on 6TB devices and there is enough space to 
copy them to.

Balance, unlike chunk allocation, is much simple and no complicated 
space calculation.

1) Check current chunk
    If the chunk is out of chunk range (beyond last chunk, which means
    we are done and current chunk is newly created one)
    then we finish balance.

2) Check if we have enough space for current chunk.
    Including creating new chunks.

3) Copy all exntets in this chunk to new location

4) Update reference of all extents to point to new location
    And free old extents.

5) Goto next chunk.(bytenr order)

So, it's possible that some data in 6TB devices is moved to 6TB again, 
or to the empty 2TB devices.

It's chunk allocator which ensure the new chunk (destination chunk) is 
allocated from 6T and empty 2T devices.

>  LIkewise, when I re-remove the 2tb, all
> its chunks should move to the 6tb, and I will be at least in a usable state.
>
> Or is the single approach faster?

As mentioned, not that easy. The 2Tb devices is not the silver bullet at 
all.

Re-convert method is the preferred one, although it's not perfect.

Thanks,
Qu
>
>>
>>
>> The convert to single then back to raid1, will do its job partly.
>> But according to other report from mail list.
>> The result won't be perfect even, even the reporter uses devices with
>> all same size.
>>
>>
>> So to conclude:
>>
>> 1) Btrfs will use most of devices space for raid1.
>> 2) 1) only happens if one fills btrfs from scratch
>> 3) For already filled case, convert to single then convert back will
>>     work, but not perfectly.
>>
>> Thanks,
>> Qu
>>
>>>
>>>
>>>
>>>> Under mdadm the bigger drive
>>>> still helped, because it replaced at smaller drive, the one that was
>>>> holding the RAID back, but you didn't get to use all the big drive until
>>>> a year later when you had upgraded them all.  In the meantime you used
>>>> the extra space in other RAIDs.  (For example, a raid-5 plus a raid-1 on
>>>> the 2 bigger drives) Or you used the extra space as non-RAID space, ie.
>>>> space for static stuff that has offline backups.  In fact, most of my
>>>> storage is of that class (photo archives, reciprocal backups of other
>>>> systems) where RAID is not needed.
>>>>
>>>> So the long story is, I think most home users are likely to always have
>>>> different sizes and want their FS to treat it well.
>>>
>>> Yes of course. And at the expense of getting a frownie face....
>>>
>>> "Btrfs is under heavy development, and is not suitable for
>>> any uses other than benchmarking and review."
>>> https://www.kernel.org/doc/Documentation/filesystems/btrfs.txt
>>>
>>> Despite that disclosure, what you're describing is not what I'd expect
>>> and not what I've previously experienced. But I haven't had three
>>> different sized drives, and they weren't particularly full, and I
>>> don't know if you started with three from the outset at mkfs time or
>>> if this is the result of two drives with a third added on later, etc.
>>> So the nature of file systems is actually really complicated and it's
>>> normal for there to be regressions - and maybe this is a regression,
>>> hard to say with available information.
>>>
>>>
>>>
>>>> Since 6TB is a relatively new size, I wonder if that plays a role.  More
>>>> than 4TB of free space to balance into, could that confuse it?
>>>
>>> Seems unlikely.
>>>
>>>
>>
>
>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-24  2:33                       ` Qu Wenruo
@ 2016-03-24  2:49                         ` Brad Templeton
  2016-03-24  3:44                           ` Chris Murphy
                                             ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Brad Templeton @ 2016-03-24  2:49 UTC (permalink / raw)
  To: Qu Wenruo, Chris Murphy; +Cc: Btrfs BTRFS



On 03/23/2016 07:33 PM, Qu Wenruo wrote:

> 
> The stage I talked about is only for you fill btrfs from scratch, with 3
> 4 6 devices.
> 
> Just as an example to explain how btrfs allocated space on un-even devices.
> 
>>
>> Then we had 4 + 3 + 6 + 2, but did not add more files or balance.
>>
>> Then we had a remove of the 2, which caused, as expected, all the chunks
>> on the 2TB drive to be copied to the 6TB drive, as it was the most empty
>> drive.
>>
>> Then we had a balance.  The balance (I would have expected) would have
>> moved chunks found on both 3 and 4, taking one of them and moving it to
>> the 6.  Generally alternating taking ones from the 3 and 4.   I can see
>> no reason this should not work even if 3 and 4 are almost entirely full,
>> but they were not.
>> But this did not happen.
>>
>>>
>>> 2) 6T and 3/4 switch stage: Allocate 4T Raid1 chunk.
>>>     After stage 1), we have 3/3/5 remaining space, then btrfs will pick
>>>     space from 5T remaining(6T devices), and switch between the other 3T
>>>     remaining one.
>>>
>>>     Cause the remaining space to be 1/1/1.
>>>
>>> 3) Fake-even allocation stage: Allocate 1T raid chunk.
>>>     Now all devices have the same unallocated space, and there are 3
>>>     devices, we can't really balance all chunks across them.
>>>     As we must and will only select 2 devices, in this stage, there will
>>>     be 1T unallocated and never be used.
>>>
>>> After all, you will get 1 +4 +1 = 6T, still smaller than (3 + 4 +6 ) /2
>>> = 6.5T
>>>
>>> Now let's talk about your 3 + 4 + 6 case.
>>>
>>> For your initial state, 3 and 4 T devices is already filled up.
>>> Even your 6T device have about 4T available space, it's only 1 device,
>>> not 2 which raid1 needs.
>>>
>>> So, no space for balance to allocate a new raid chunk. The extra 20G is
>>> so small that almost makes no sence.
>>
>> Yes, it was added as an experiment on the suggestion of somebody on the
>> IRC channel.  I will be rid of it soon.  Still, it seems to me that the
>> lack of space even after I filled the disks should not interfere with
>> the balance's ability to move chunks which are found on both 3 and 4 so
>> that one remains and one goes to the 6.  This action needs no spare
>> space.   Now I presume the current algorithm perhaps does not work
>> this way?
> 
> No, balance is not working like that.
> Although most user consider balance is moving data, which is partly right.
> The fact is, balance is, copy-and-delete. And it needs spare space.
> 
> Means you must have enough space for the extents you are balancing, then
> btrfs will copy them, update reference, and then delete old data (with
> its block group).
> 
> So for balancing data in already filled device, btrfs needs to find
> space for them first.
> Which will need 2 devices with unallocated space for RAID1.
> 
> And in you case, you only have 1 devices with unallocated space, so no
> space to balance.

Ah.  I would class this as a bug, or at least a non-optimal design.  If
I understand, you say it tries to move both of the matching chunks to
new homes.  This makes no sense if there are 3 drives because it is
assured that one chunk is staying on the same drive.   Even with 4 or
more drives, where this could make sense, in fact it would still be wise
to attempt to move only one of the pair of chunks, and then move the
other if that is also a good idea.


> 
> 
>>
>> My next plan is to add the 2tb back. If I am right, balance will move
>> chunks from 3 and 4 to the 2TB,
> 
> Not only to 2TB, but to 2TB and 6TB. Never forgot that RAID1 needs 2
> devices.
> And if 2TB is filled and 3/4 and free space, it's also possible to 3/4
> devices.
> 
> That will free 2TB in already filled up devices. But that's still not
> enough to get space even.
> 
> You may need to balance several times(maybe 10+) to make space a little
> even, as balance won't balance any chunk which is created by balance.
> (Or balance will loop infinitely).

Now I understand -- I had not thought it would try to move 2 when that's
so obviously wrong on a 3-drive, and so I was not thinking of the
general case.  So I can now calculate that if I add the 2TB, in an ideal
situation, it will perhaps get 1TB of chunks and the 6TB will get 1TB of
chunks and then the 4 drives will have 3 with 1TB free, and the 6TB will
have 3TB free.   Then when I remove the 2TB, the 6TB should get all its
chunks and will have 2TB free and the other two 1TB free and that's
actually the right situation as all new blocks will appear on the 6TB
and one of the other two drives.

I don't want to keep 4 drives because small drives consume power for
little, better to move them to other purposes (offline backup etc.)

 In the algorithm below, does "chunk" refer to both the redundant copies
of the data, or just to one of them?  I am guessing my misunderstanding
may come from it referring to both, and moving both?

The ability of it to move within the same device you describe is
presumably there for combining things together to a chunk, but it
appears it slows down the drive rebalancing plan.

Thanks for your explanations.

> 
>> but it should not move any from the 6TB
>> because it has so much space.
> 
> That's also wrong.
> Whether balance will move data from 6TB devices, is only determined by
> if the src chunk has stripe on 6TB devices and there is enough space to
> copy them to.
> 
> Balance, unlike chunk allocation, is much simple and no complicated
> space calculation.
> 
> 1) Check current chunk
>    If the chunk is out of chunk range (beyond last chunk, which means
>    we are done and current chunk is newly created one)
>    then we finish balance.
> 
> 2) Check if we have enough space for current chunk.
>    Including creating new chunks.
> 
> 3) Copy all exntets in this chunk to new location
> 
> 4) Update reference of all extents to point to new location
>    And free old extents.
> 
> 5) Goto next chunk.(bytenr order)
> 
> So, it's possible that some data in 6TB devices is moved to 6TB again,
> or to the empty 2TB devices.
> 
> It's chunk allocator which ensure the new chunk (destination chunk) is
> allocated from 6T and empty 2T devices.
> 
>>  LIkewise, when I re-remove the 2tb, all
>> its chunks should move to the 6tb, and I will be at least in a usable
>> state.
>>
>> Or is the single approach faster?
> 
> As mentioned, not that easy. The 2Tb devices is not the silver bullet at
> all.
> 
> Re-convert method is the preferred one, although it's not perfect.
> 
> Thanks,
> Qu
>>
>>>
>>>
>>> The convert to single then back to raid1, will do its job partly.
>>> But according to other report from mail list.
>>> The result won't be perfect even, even the reporter uses devices with
>>> all same size.
>>>
>>>
>>> So to conclude:
>>>
>>> 1) Btrfs will use most of devices space for raid1.
>>> 2) 1) only happens if one fills btrfs from scratch
>>> 3) For already filled case, convert to single then convert back will
>>>     work, but not perfectly.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>>
>>>>
>>>>> Under mdadm the bigger drive
>>>>> still helped, because it replaced at smaller drive, the one that was
>>>>> holding the RAID back, but you didn't get to use all the big drive
>>>>> until
>>>>> a year later when you had upgraded them all.  In the meantime you used
>>>>> the extra space in other RAIDs.  (For example, a raid-5 plus a
>>>>> raid-1 on
>>>>> the 2 bigger drives) Or you used the extra space as non-RAID space,
>>>>> ie.
>>>>> space for static stuff that has offline backups.  In fact, most of my
>>>>> storage is of that class (photo archives, reciprocal backups of other
>>>>> systems) where RAID is not needed.
>>>>>
>>>>> So the long story is, I think most home users are likely to always
>>>>> have
>>>>> different sizes and want their FS to treat it well.
>>>>
>>>> Yes of course. And at the expense of getting a frownie face....
>>>>
>>>> "Btrfs is under heavy development, and is not suitable for
>>>> any uses other than benchmarking and review."
>>>> https://www.kernel.org/doc/Documentation/filesystems/btrfs.txt
>>>>
>>>> Despite that disclosure, what you're describing is not what I'd expect
>>>> and not what I've previously experienced. But I haven't had three
>>>> different sized drives, and they weren't particularly full, and I
>>>> don't know if you started with three from the outset at mkfs time or
>>>> if this is the result of two drives with a third added on later, etc.
>>>> So the nature of file systems is actually really complicated and it's
>>>> normal for there to be regressions - and maybe this is a regression,
>>>> hard to say with available information.
>>>>
>>>>
>>>>
>>>>> Since 6TB is a relatively new size, I wonder if that plays a role. 
>>>>> More
>>>>> than 4TB of free space to balance into, could that confuse it?
>>>>
>>>> Seems unlikely.
>>>>
>>>>
>>>
>>
>>
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-24  2:49                         ` Brad Templeton
@ 2016-03-24  3:44                           ` Chris Murphy
  2016-03-24  3:46                           ` Qu Wenruo
  2016-03-24  6:11                           ` Duncan
  2 siblings, 0 replies; 35+ messages in thread
From: Chris Murphy @ 2016-03-24  3:44 UTC (permalink / raw)
  To: Brad Templeton; +Cc: Qu Wenruo, Chris Murphy, Btrfs BTRFS

On Wed, Mar 23, 2016 at 8:49 PM, Brad Templeton <bradtem@gmail.com> wrote:
> On 03/23/2016 07:33 PM, Qu Wenruo wrote:
>>
>> No, balance is not working like that.
>> Although most user consider balance is moving data, which is partly right.
>> The fact is, balance is, copy-and-delete. And it needs spare space.
>>
>> Means you must have enough space for the extents you are balancing, then
>> btrfs will copy them, update reference, and then delete old data (with
>> its block group).
>>
>> So for balancing data in already filled device, btrfs needs to find
>> space for them first.
>> Which will need 2 devices with unallocated space for RAID1.
>>
>> And in you case, you only have 1 devices with unallocated space, so no
>> space to balance.
>
> Ah.  I would class this as a bug, or at least a non-optimal design.  If
> I understand, you say it tries to move both of the matching chunks to
> new homes.  This makes no sense if there are 3 drives because it is
> assured that one chunk is staying on the same drive.   Even with 4 or
> more drives, where this could make sense, in fact it would still be wise
> to attempt to move only one of the pair of chunks, and then move the
> other if that is also a good idea.

In a separate thread, it's observed that balance code is getting
complicated and it's probably important that it not be too smart for
itself.

The thing to understand is that chunks are a contiguous range of
physical sectors. What's really being copied are extents in those
chunks. And the balance not only rewrites extents but it tries to
collect them together to efficiently use the chunk space. The Btrfs
chunk isn't like an md chunk.

>
>
>>
>>
>>>
>>> My next plan is to add the 2tb back. If I am right, balance will move
>>> chunks from 3 and 4 to the 2TB,
>>
>> Not only to 2TB, but to 2TB and 6TB. Never forgot that RAID1 needs 2
>> devices.
>> And if 2TB is filled and 3/4 and free space, it's also possible to 3/4
>> devices.
>>
>> That will free 2TB in already filled up devices. But that's still not
>> enough to get space even.
>>
>> You may need to balance several times(maybe 10+) to make space a little
>> even, as balance won't balance any chunk which is created by balance.
>> (Or balance will loop infinitely).
>
> Now I understand -- I had not thought it would try to move 2 when that's
> so obviously wrong on a 3-drive, and so I was not thinking of the
> general case.  So I can now calculate that if I add the 2TB, in an ideal
> situation, it will perhaps get 1TB of chunks and the 6TB will get 1TB of
> chunks and then the 4 drives will have 3 with 1TB free, and the 6TB will
> have 3TB free.

The problem is that you have two devices totally full now, devid1 and
devid2. So it's not certain it's going to start just copying chunks
off those drives. Whatever it does, it does on both chunk copies. It
might be moving them. It might be packing them more efficiently with
extents. No deallocation of a chunk can happen until it's empty. So
for two full drives it's difficult to see how this gets fixed just
with a regular balance. I think you have to go to single profile...
OR...

Add the 2TB.
Remove the 6TB and wait.

        devid    3 size 5.43TiB used 1.42TiB path /dev/sdg2   this
suggests 1.4TiB on the 6TB drive so it should be possible for those
chunks to get moved to the 2TB drive.

Now you have an empty 6TB, and you still have a (very full) raid1 with all data.

mkfs a new volume on the 6TB, btrfs send/receive to get all data on
the 6TB drive. "Data,RAID1: Size:3.87TiB, Used:3.87TiB" suggests only
4TB data so the 6TB can hold all of it.

Now you can umount the old volume; and you can force add 3TB and 4TB
to the new 6TB volume, and -dconvert=raid1 -mconvert=raid1

The worse case scenario is the the 6TB drive dies during the
conversion and then it could be totally broken and you have to go to
backup. But otherwise, it's a bit less risky than two balances to and
from single profile across three or even four drives.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-24  2:49                         ` Brad Templeton
  2016-03-24  3:44                           ` Chris Murphy
@ 2016-03-24  3:46                           ` Qu Wenruo
  2016-03-24  6:11                           ` Duncan
  2 siblings, 0 replies; 35+ messages in thread
From: Qu Wenruo @ 2016-03-24  3:46 UTC (permalink / raw)
  To: bradtem, Chris Murphy; +Cc: Btrfs BTRFS



Brad Templeton wrote on 2016/03/23 19:49 -0700:
>
>
> On 03/23/2016 07:33 PM, Qu Wenruo wrote:
>
>>
>> The stage I talked about is only for you fill btrfs from scratch, with 3
>> 4 6 devices.
>>
>> Just as an example to explain how btrfs allocated space on un-even devices.
>>
>>>
>>> Then we had 4 + 3 + 6 + 2, but did not add more files or balance.
>>>
>>> Then we had a remove of the 2, which caused, as expected, all the chunks
>>> on the 2TB drive to be copied to the 6TB drive, as it was the most empty
>>> drive.
>>>
>>> Then we had a balance.  The balance (I would have expected) would have
>>> moved chunks found on both 3 and 4, taking one of them and moving it to
>>> the 6.  Generally alternating taking ones from the 3 and 4.   I can see
>>> no reason this should not work even if 3 and 4 are almost entirely full,
>>> but they were not.
>>> But this did not happen.
>>>
>>>>
>>>> 2) 6T and 3/4 switch stage: Allocate 4T Raid1 chunk.
>>>>      After stage 1), we have 3/3/5 remaining space, then btrfs will pick
>>>>      space from 5T remaining(6T devices), and switch between the other 3T
>>>>      remaining one.
>>>>
>>>>      Cause the remaining space to be 1/1/1.
>>>>
>>>> 3) Fake-even allocation stage: Allocate 1T raid chunk.
>>>>      Now all devices have the same unallocated space, and there are 3
>>>>      devices, we can't really balance all chunks across them.
>>>>      As we must and will only select 2 devices, in this stage, there will
>>>>      be 1T unallocated and never be used.
>>>>
>>>> After all, you will get 1 +4 +1 = 6T, still smaller than (3 + 4 +6 ) /2
>>>> = 6.5T
>>>>
>>>> Now let's talk about your 3 + 4 + 6 case.
>>>>
>>>> For your initial state, 3 and 4 T devices is already filled up.
>>>> Even your 6T device have about 4T available space, it's only 1 device,
>>>> not 2 which raid1 needs.
>>>>
>>>> So, no space for balance to allocate a new raid chunk. The extra 20G is
>>>> so small that almost makes no sence.
>>>
>>> Yes, it was added as an experiment on the suggestion of somebody on the
>>> IRC channel.  I will be rid of it soon.  Still, it seems to me that the
>>> lack of space even after I filled the disks should not interfere with
>>> the balance's ability to move chunks which are found on both 3 and 4 so
>>> that one remains and one goes to the 6.  This action needs no spare
>>> space.   Now I presume the current algorithm perhaps does not work
>>> this way?
>>
>> No, balance is not working like that.
>> Although most user consider balance is moving data, which is partly right.
>> The fact is, balance is, copy-and-delete. And it needs spare space.
>>
>> Means you must have enough space for the extents you are balancing, then
>> btrfs will copy them, update reference, and then delete old data (with
>> its block group).
>>
>> So for balancing data in already filled device, btrfs needs to find
>> space for them first.
>> Which will need 2 devices with unallocated space for RAID1.
>>
>> And in you case, you only have 1 devices with unallocated space, so no
>> space to balance.
>
> Ah.  I would class this as a bug, or at least a non-optimal design.  If
> I understand, you say it tries to move both of the matching chunks to
> new homes.  This makes no sense if there are 3 drives because it is
> assured that one chunk is staying on the same drive.   Even with 4 or
> more drives, where this could make sense, in fact it would still be wise
> to attempt to move only one of the pair of chunks, and then move the
> other if that is also a good idea.

For only one of the pair of chunk, you mean a stripe of a chunk.
And in that case, IIRC only replace is doing like that.

In most case, btrfs do in chunk unit, which means that may move data 
inside a device.

Even in that case, it's still useful.

For example, there is a chunk(1G size) which only contains 1 extent(4K).
Such balance can move the 4K extent into an existing chunk, and free the 
whole 1G chunk to allow new chunk to be created.

Considering balance is not only for making chunk allocation even, but 
also for a lot of other use, IMHO the behavior can hardly called as a bug.

>
>
>>
>>
>>>
>>> My next plan is to add the 2tb back. If I am right, balance will move
>>> chunks from 3 and 4 to the 2TB,
>>
>> Not only to 2TB, but to 2TB and 6TB. Never forgot that RAID1 needs 2
>> devices.
>> And if 2TB is filled and 3/4 and free space, it's also possible to 3/4
>> devices.
>>
>> That will free 2TB in already filled up devices. But that's still not
>> enough to get space even.
>>
>> You may need to balance several times(maybe 10+) to make space a little
>> even, as balance won't balance any chunk which is created by balance.
>> (Or balance will loop infinitely).
>
> Now I understand -- I had not thought it would try to move 2 when that's
> so obviously wrong on a 3-drive, and so I was not thinking of the
> general case.  So I can now calculate that if I add the 2TB, in an ideal
> situation, it will perhaps get 1TB of chunks and the 6TB will get 1TB of
> chunks and then the 4 drives will have 3 with 1TB free, and the 6TB will
> have 3TB free.   Then when I remove the 2TB, the 6TB should get all its
> chunks and will have 2TB free and the other two 1TB free and that's
> actually the right situation as all new blocks will appear on the 6TB
> and one of the other two drives.
>
> I don't want to keep 4 drives because small drives consume power for
> little, better to move them to other purposes (offline backup etc.)
>
>   In the algorithm below, does "chunk" refer to both the redundant copies
> of the data, or just to one of them?

Both, or more specifically, the logical data itself.

The copy is normally called stripe of the chunk.
In raid1 case, all the 2 stripes are just the same of the chunk contents.

In btrfs' view(logical address space), Btrfs only cares which chunk 
covers which bytenr range. This makes a lot things easier.

Like (0~1M range is never covered by any chunk)
Logical bytenr:
0        1G          2G          3G          4G
          |<-Chunk 1->|<-Chunk 2->|<-Chunk 3->|

Then how each chunk mapped to devices only needs chunk tree to consider.
Most part of btrfs only need to care about the logical address space.

In chunk tree, it records how chunk is mapped into real devices.
Chunk1: type RAID1|DATA, length 1G
         stripe 0 dev1, dev bytenr XXXX
         stripe 1 dev2, dev bytenr YYYY

Chunk2: type RAID1|METADATA, length 1G
         stripe 0 dev2, dev bytenr ZZZZ
         stripe 1 dev3, dev bytenr WWWW

And what balance do, is to move all extents(if possible) inside a chunk 
to another place.
Maybe a new chunk, or an old chunk with enough space.
For example, after balancing chunk1, btrfs creates a new chunk, chunk4.

Copy some extents inside chunk1 to chunk 4, some to chunk 2 and 3.

However stripes of chunk4 can still be in dev1 and dev2, although bytenr 
must changed.

0       1G          2G          3G          4G          5G
         |           |<-Chunk 2->|<-Chunk 3->|<-Chunk 4->|
Chunk 4: Type RAID1|DATA length 1G
          stripe 0 dev1, dev bytenr Some new BYTENR
          stripe 0 dev2, dev bytenr Some new BYTENR

>  I am guessing my misunderstanding
> may come from it referring to both, and moving both?

It's common to consider balance as moving data, and some times the idea 
of "moving" leads to misunderstanding.

>
> The ability of it to move within the same device you describe is
> presumably there for combining things together to a chunk, but it
> appears it slows down the drive rebalancing plan.

Personally speaking, the fastest plan is to create a 6T + 6T btrfs raid, 
and copy all data from old raid to them.
And only add devices in pair of same size to that raid.
No need to ever bother balancing (mostly).

Balance is never as fast as normal copy, unfortunately.

Thanks,
Qu
>
> Thanks for your explanations.
>>
>>> but it should not move any from the 6TB
>>> because it has so much space.
>>
>> That's also wrong.
>> Whether balance will move data from 6TB devices, is only determined by
>> if the src chunk has stripe on 6TB devices and there is enough space to
>> copy them to.
>>
>> Balance, unlike chunk allocation, is much simple and no complicated
>> space calculation.
>>
>> 1) Check current chunk
>>     If the chunk is out of chunk range (beyond last chunk, which means
>>     we are done and current chunk is newly created one)
>>     then we finish balance.
>>
>> 2) Check if we have enough space for current chunk.
>>     Including creating new chunks.
>>
>> 3) Copy all exntets in this chunk to new location
>>
>> 4) Update reference of all extents to point to new location
>>     And free old extents.
>>
>> 5) Goto next chunk.(bytenr order)
>>
>> So, it's possible that some data in 6TB devices is moved to 6TB again,
>> or to the empty 2TB devices.
>>
>> It's chunk allocator which ensure the new chunk (destination chunk) is
>> allocated from 6T and empty 2T devices.
>>
>>>   LIkewise, when I re-remove the 2tb, all
>>> its chunks should move to the 6tb, and I will be at least in a usable
>>> state.
>>>
>>> Or is the single approach faster?
>>
>> As mentioned, not that easy. The 2Tb devices is not the silver bullet at
>> all.
>>
>> Re-convert method is the preferred one, although it's not perfect.
>>
>> Thanks,
>> Qu
>>>
>>>>
>>>>
>>>> The convert to single then back to raid1, will do its job partly.
>>>> But according to other report from mail list.
>>>> The result won't be perfect even, even the reporter uses devices with
>>>> all same size.
>>>>
>>>>
>>>> So to conclude:
>>>>
>>>> 1) Btrfs will use most of devices space for raid1.
>>>> 2) 1) only happens if one fills btrfs from scratch
>>>> 3) For already filled case, convert to single then convert back will
>>>>      work, but not perfectly.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Under mdadm the bigger drive
>>>>>> still helped, because it replaced at smaller drive, the one that was
>>>>>> holding the RAID back, but you didn't get to use all the big drive
>>>>>> until
>>>>>> a year later when you had upgraded them all.  In the meantime you used
>>>>>> the extra space in other RAIDs.  (For example, a raid-5 plus a
>>>>>> raid-1 on
>>>>>> the 2 bigger drives) Or you used the extra space as non-RAID space,
>>>>>> ie.
>>>>>> space for static stuff that has offline backups.  In fact, most of my
>>>>>> storage is of that class (photo archives, reciprocal backups of other
>>>>>> systems) where RAID is not needed.
>>>>>>
>>>>>> So the long story is, I think most home users are likely to always
>>>>>> have
>>>>>> different sizes and want their FS to treat it well.
>>>>>
>>>>> Yes of course. And at the expense of getting a frownie face....
>>>>>
>>>>> "Btrfs is under heavy development, and is not suitable for
>>>>> any uses other than benchmarking and review."
>>>>> https://www.kernel.org/doc/Documentation/filesystems/btrfs.txt
>>>>>
>>>>> Despite that disclosure, what you're describing is not what I'd expect
>>>>> and not what I've previously experienced. But I haven't had three
>>>>> different sized drives, and they weren't particularly full, and I
>>>>> don't know if you started with three from the outset at mkfs time or
>>>>> if this is the result of two drives with a third added on later, etc.
>>>>> So the nature of file systems is actually really complicated and it's
>>>>> normal for there to be regressions - and maybe this is a regression,
>>>>> hard to say with available information.
>>>>>
>>>>>
>>>>>
>>>>>> Since 6TB is a relatively new size, I wonder if that plays a role.
>>>>>> More
>>>>>> than 4TB of free space to balance into, could that confuse it?
>>>>>
>>>>> Seems unlikely.
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-24  2:49                         ` Brad Templeton
  2016-03-24  3:44                           ` Chris Murphy
  2016-03-24  3:46                           ` Qu Wenruo
@ 2016-03-24  6:11                           ` Duncan
  2 siblings, 0 replies; 35+ messages in thread
From: Duncan @ 2016-03-24  6:11 UTC (permalink / raw)
  To: linux-btrfs

Brad Templeton posted on Wed, 23 Mar 2016 19:49:00 -0700 as excerpted:

> On 03/23/2016 07:33 PM, Qu Wenruo wrote:
> 
>>> Still, it seems to me
>>> that the lack of space even after I filled the disks should not
>>> interfere with the balance's ability to move chunks which are found on
>>> both 3 and 4 so that one remains and one goes to the 6.  This action
>>> needs no spare space.   Now I presume the current algorithm perhaps
>>> does not work this way?
>> 
>> No, balance is not working like that.
>> Although most user consider balance is moving data, which is partly
>> right. The fact is, balance is, copy-and-delete. And it needs spare
>> space.
>> 
>> Means you must have enough space for the extents you are balancing,
>> then btrfs will copy them, update reference, and then delete old data
>> (with its block group).
>> 
>> So for balancing data in already filled device, btrfs needs to find
>> space for them first.
>> Which will need 2 devices with unallocated space for RAID1.
>> 
>> And in you case, you only have 1 devices with unallocated space, so no
>> space to balance.
> 
> Ah.  I would class this as a bug, or at least a non-optimal design.  If
> I understand, you say it tries to move both of the matching chunks to
> new homes.  This makes no sense if there are 3 drives because it is
> assured that one chunk is staying on the same drive.   Even with 4 or
> more drives, where this could make sense, in fact it would still be wise
> to attempt to move only one of the pair of chunks, and then move the
> other if that is also a good idea.

What balance does, at its most basic, is rewrite and in the process 
manipulate chunks in some desired way, depending on the filters used, if 
any.  Once the chunks have been rewritten, the old copies are deleted.  
But existing chunks are never simply left in place unless the filters 
exclude them entirely.  If they are rewritten, a new chunk is created and 
the old chunk is removed.

Now one of the simplest and most basic effects of this rewrite process is 
that where two or more chunks of the same type (typically data or 
metadata) are only partially full, the rewrite process will create a new 
chunk and start writing, filling it until it is full, then creating 
another and filling it, etc, which ends up compacting chunks as it 
rewrites them.  So if there's ten chunks and average of 50% full, it'll 
compact that into five chunks, 100% full.  The usage filter is very 
helpful here, letting you tell balance to only bother with chunks that 
are under say 10% (usage=10) full, where you'll get a pretty big effect 
for the effort, as 10 such chunks can be consolidated into one.  Of 
course that would only happen if you /had/ 10 such chunks under 10% full, 
but at say usage=50, you still get one freed chunk for every two balance 
rewrites, taking longer, but still far less time than it would take to 
rewrite 90% full chunks, with far more dramatic effects... as long as 
there are chunks to balance and combine at that usage level, of course.

Here, we're using a different side effect, the fact that with a raid1 
setup, there are always two copies of the chunk, one on each of exactly 
two devices, and that when new chunks are allocated, they *SHOULD* be 
allocated from the devices with the most free space, subject only to the 
rule that both copies cannot be on the same device, so the effect is that 
it'll allocate from the device with the most space left for the first 
copy, and then for the second copy, it'll allocate from the device with 
the most space left, but where the device list excludes the device that 
the first copy is on.

But, the point that Qu is making is that balance, by definition, rewrites 
both raid1 copies of the chunk.  It can't simply rewrite just the one 
that's on the fullest device to the most empty and leave the other copy 
alone.  So what it will do is allocate space for a new chunk from each of 
the two devices with the most space left, and will copy the chunks to 
them, only releasing the existing copies when the copy is done and the 
new copies are safely on their respective devices.

Which means that at least two devices MUST have space left in ordered to 
rebalance from raid1 to raid1.  If only one device has space left, no 
rebalance can be done.

Now your 3 TB and 4 TB devices, one each, are full, with space left only 
on the 6 TB device.  When you first switched from the 2 TB device to the 
6 TB device, the device delete would have rewritten from the 2 TB device 
to the 6 TB device, and you probably had some space left on the other 
devices at that point.  However, you didn't have enough space left on the 
other two devices to utilize much of the 6 TB device, because each time 
you allocated a chunk on the 6 TB device, a chunk had to be allocated on 
one of the others as well, and they simply didn't have enough space left 
by that point to do that too many times.


Now, you /did/ try to rebalance before you /fully/ ran out of space on 
the other devices, and that's what Chris and I were thinking should have 
worked, putting one copy of each rebalanced chunk on the 6 TB device.

But, lacking (preferably) btrfs device usage (or btrfs filesystem show, 
gives a bit less information but does say how much of each device is 
actually used) reports from /before/ the further fillup, we can't say for 
sure how much space was actually left.

Now here's the question.  You said you estimated each drive had ~50 GB 
free when you did the original replace and then tried to balance, but 
where did that 50 GB number come from?

Here's why it matters.  Btrfs allocates space in two steps.  First it 
allocates from the unallocated pool into chunks, which can be data or 
metadata (there's also system chunks, but that's only a few MiB total, in 
your case 32 MiB on each of two devices given the raid1, and doesn't 
change dramatically with usage as data and metadata chunks do).

And it can easily happen that all available space is already allocated 
into (partially used) chunks, so there's no actually unallocated space 
left on a device in ordered to allocate further chunks, but there's still 
sufficient space left in the partially used chunks to continue adding and 
changing files for some time.  Only when new chunk allocation is 
necessary will a problem show up.

Now given the various btrfs reports, btrfs fi show and btrfs fi df, or 
btrfs fi usage, or for a device-centric report, btrfs dev usage, possibly 
combined with the other reports depending on what you're trying to figure 
out, it's quite possible to tell exactly what the status of each of the 
devices is, regarding both unallocated space as well as allocated chunks, 
and how much of those allocated chunks is actually used (globally, 
unfortunately actual usage of the chunk allocation isn't broken down by 
device, tho that information isn't technically needed per-device).

But if you're estimating only based on normal df, not the btrfs versions 
of the commands, you don't know how much space remained actually 
unallocated on each device, and for balance, that's the critical thing, 
particularly with raid1, since it MUST have space to allocate new chunks 
on AT LEAST TWO devices.


Which is where the IRC recommendation to add a 4th device of some GiB 
came in, the idea being to add enough unallocated space on that 4th 
device, that being the second device with actually unallocated space, to 
get you out of the tight spot.


There is, however, another factor in play here as well, chunk size.  Data 
chunks are the largest, and are nominally 1 GiB in size.  *HOWEVER*, on 
devices over some particular size, they can increase in size upto, a dev 
stated in one thread, 10 GiB in size.  However, while I know it can 
happen at larger filesystem and device sizes, I don't have the foggiest 
what the conditions and algorithms for chunk size are.  But with TB-scale 
devices and btrfs', it's very possible, even likely, that you're dealing 
with over the 1 GiB nominal size.

And if you're dealing with 10 GiB chunk sizes, or possibly even larger if 
I took that dev's chunk size limitation comments out of context and am 
wrong about that chunk size limit...

You may well simply not have a second device with enough unallocated 
space on it to properly handle the chunk sizes on that filesystem.  
Certainly, the btrfs fi usage report you posted showed a few gigs of 
unallocated space on each of three of the four devices (with all sorts of 
space left on the 6 TB device, of course), but all three were in the 
single-digits GB, and if most of your data chunks are 10 GiB... you 
simply don't have a device with enough unallocated space left to write 
that second copy.


Tho adding back that 2 TB device and doing a balance should indeed give 
you enough space to put a serious dent in that imbalance.

But as Qu says, you will likely end up having to rebalance several times 
in ordered to get it nicely balanced out, since you'll fill up that under 
2 TiB pretty fast from the other two full devices and it'll start round-
robinning to all three for the second copy before the other two are even 
a TiB down from full.

Again as Qu says, rebalancing to single and back to raid1 is another 
option, that should result in a much faster loading of the 6 TB device.  
I think (but I'm not sure) that the the single mode allocator still uses 
the "most space" allocation algorithm, in which case, given a total raid1 
usage of 7.77 TiB, which should be 3.88 TiB (~4.25 TB) in single mode, 
you should end up with a nearly free 3 TB device, just under 1 GiB on the 
4 TB device, and just under 3 TB on the 6 TB device, basically 3 TB free/
unallocated on each of the three devices.

(The tiny 4th device should be left entirely free in that case and should 
then be trivial to device delete as there will be nothing on it to move 
to other devices, it'll be a simple change to the system chunk device 
data and the superblocks on the other three devices.)

Then you can rebalance to raid1 mode again, and it should use up that 3 
TB on each device relatively evenly, round-robinning an unused device  
that alternates on each set of chunks copied.  While ~3/4 of all chunks 
should start out with their single-mode copy on the 6 TB device, 3/4 of 
all chunks deleted will be off it, leaving it free to get one of the two 
copies most of the time.  You should end up with about 1.3 TB free per 
device, with about 1.6 TB of the 3 TB device allocated, 2.6 TB of the 4 
TB device allocated, together pretty well sharing one copy of each chunk 
between them, and 4.3 T of the 6 TB device used, pretty much one copy of 
each chunk on its own.

The down side to that is that you're left with only a single copy while 
in single mode, and if that copy gets corrupted, you simply lose whatever 
was in that now corrupted chunk.  If the data's valuable enough, you may 
thus prefer to do repeated balances.

The other alternative of course is to ensure that everything that's not 
trivially replaced is backed up, and start from scratch with a newly 
created btrfs on the three devices, restoring to it from backup.

That's what I'd do, since the sysadmin's rule of backups in simple form 
says if it's not backed up, you are by definition of your (in)action, 
defining that data as worth less than the time/trouble/resources 
necessary to back it up.  So if it's worth the hassle it should be 
already backed up so you can simply blow away the existing filesystem, 
create it over new, and restore from backups, and if you don't have those 
backups, then by definition it's not worth the hassle, and starting over 
with a fresh filesystem is all three of (1) less hassle, (2) a chance to 
take advantage of newer filesystem options that weren't available when 
you first created the existing filesystem, and (3) a clean start, blowing 
away any chance of some bug lurking in the existing layout waiting to 
come back and bite you after you've put all the work into those 
rebalances, if you choose them over the clean start.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-23 18:34             ` Chris Murphy
  2016-03-23 19:10               ` Brad Templeton
  2016-03-23 22:28               ` Duncan
@ 2016-03-24  7:08               ` Andrew Vaughan
  2 siblings, 0 replies; 35+ messages in thread
From: Andrew Vaughan @ 2016-03-24  7:08 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Brad Templeton, Btrfs BTRFS

Hi Brad

Just a user here, not a dev.

I think I might have run into a similar bug about 6 months ago.

At the time I was running Debian stable.  (iirc that is kernel 3.16
and probably btrfs-progs of a similar vintage).

The filesystem was originally a 2 x 6TB array with a 4TB drive added
later when space began to get low.  I'm pretty sure I must have done
at least a partial balance after adding the 4TB drive, but something
like 1TB free on each of the two 6GB drives, and 2GB on the 4TB would
have been 'good enough for me'.

It was nearly full again when a copy unexpectedly reported
out-of-space.  Balance didn't fix it.  In retrospect btrfs had
probably run out of chunks on both 6TB drives.

I'm not sure what actually fixed it.  I upgraded to Debian testing
(something I was going to do soon anyway).  I might have also
temporarily added another drive.   (I have since had a 6TB drive fail,
and btrfs is running happily on 2x4TB, and 1x6TB).

More inline below.

On 24 March 2016 at 05:34, Chris Murphy <lists@colorremedies.com> wrote:
> On Wed, Mar 23, 2016 at 10:51 AM, Brad Templeton <bradtem@gmail.com> wrote:
>> Thanks for assist.  To reiterate what I said in private:
>>
>> a) I am fairly sure I swapped drives by adding the 6TB drive and then
>> removing the 2TB drive, which would not have made the 6TB think it was
>> only 2TB.    The btrfs statistics commands have shown from the beginning
>> the size of the device as 6TB, and that after the remove, it haad 4TB
>> unallocated.
>
> I agree this seems to be consistent with what's been reported.
>

<snip>

>>
>> Some options remaining open to me:
>>
>> a) I could re-add the 2TB device, which is still there.  Then balance
>> again, which hopefully would move a lot of stuff.   Then remove it again
>> and hopefully the new stuff would distribute mostly to the large drive.
>>  Then I could try balance again.
>
> Yeah, to do this will require -f to wipe the signature info from that
> drive when you add it. But I don't think this is a case of needing
> more free space, I think it might be due to the odd number of drives
> that are also fairly different in size.
>
If I recall correctly, when I did a device delete, I thought device
delete did remove the btrfs signature.  But I could be wrong

> But then what happens when you delete the 2TB drive after the balance?
> Do you end up right back in this same situation?
>

If balance manages to get the data properly distributed across the
drives, then the 2TB should be mostly empty, and device delete should
be able to remove the 2TB disk.   I successfully added a 4TB disk, did
a balance, and then removed a failing 6TB from the 3 drive array
above.

>
>>
>> b) It was suggested I could (with a good backup) convert the drive to
>> non-RAID1 to free up tons of space and then re-convert.  What's the
>> precise procedure for that?  Perhaps I can do it with a limit to see how
>> it works as an experiment?   Any way to specifically target the blocks
>> that have their two copies on the 2 smaller drives for conversion?
>
> btrfs balance -dconvert=single -mconvert=single -f   ## you have to
> use -f to force reduction in redundancy
> btrfs balance -dconvert=raid1 -mconvert=raid1

I would probably try upgrading to a newer kernel + btrfs-progs first.
Before converting back to raid1, I would also run btrfs device usage
and check to see whether the all devices have approximately the same
amount of unallocated space.  If they don't, maybe try running a full
balance again.

<snip>

Andrew

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-23 19:33                 ` Chris Murphy
  2016-03-24  1:59                   ` Qu Wenruo
@ 2016-03-25 13:16                   ` Patrik Lundquist
  2016-03-25 14:35                     ` Henk Slager
  2016-03-27  4:23                     ` Brad Templeton
  1 sibling, 2 replies; 35+ messages in thread
From: Patrik Lundquist @ 2016-03-25 13:16 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Brad Templeton, Btrfs BTRFS

On 23 March 2016 at 20:33, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton <bradtem@gmail.com> wrote:
> >
> > I am surprised to hear it said that having the mixed sizes is an odd
> > case.
>
> Not odd as in wrong, just uncommon compared to other arrangements being tested.

I think mixed drive sizes in raid1 is a killer feature for a home NAS,
where you replace an old smaller drive with the latest and largest
when you need more storage.

My raid1 currently consists of 6TB+3TB+3*2TB.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-25 13:16                   ` Patrik Lundquist
@ 2016-03-25 14:35                     ` Henk Slager
  2016-03-26  4:15                       ` Duncan
       [not found]                       ` <CAHz9+Emc4DsXoMLKYrp1TfN+2r2cXxaJmPyTnpeCZF=h0FhtMg@mail.gmail.com>
  2016-03-27  4:23                     ` Brad Templeton
  1 sibling, 2 replies; 35+ messages in thread
From: Henk Slager @ 2016-03-25 14:35 UTC (permalink / raw)
  To: Patrik Lundquist; +Cc: Chris Murphy, Brad Templeton, Btrfs BTRFS

On Fri, Mar 25, 2016 at 2:16 PM, Patrik Lundquist
<patrik.lundquist@gmail.com> wrote:
> On 23 March 2016 at 20:33, Chris Murphy <lists@colorremedies.com> wrote:
>>
>> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton <bradtem@gmail.com> wrote:
>> >
>> > I am surprised to hear it said that having the mixed sizes is an odd
>> > case.
>>
>> Not odd as in wrong, just uncommon compared to other arrangements being tested.
>
> I think mixed drive sizes in raid1 is a killer feature for a home NAS,
> where you replace an old smaller drive with the latest and largest
> when you need more storage.
>
> My raid1 currently consists of 6TB+3TB+3*2TB.

For the original OP situation, with chunks all filled op with extents
and devices all filled up with chunks, 'integrating' a new 6TB drive
in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
way in order to avoid immediate balancing needs:
- 'plug-in' the 6TB
- btrfs-replace  4TB by 6TB
- btrfs fi resize max 6TB_devID
- btrfs-replace  2TB by 4TB
- btrfs fi resize max 4TB_devID
- 'unplug' the 2TB

So then there would be 2 devices with roughly 2TB space available, so
good for continued btrfs raid1 writes.

An offline variant with dd instead of btrfs-replace could also be done
(I used to do that sometimes when btrfs-replace was not implemented).
My experience is that btrfs-replace speed is roughly at max speed (so
harddisk magnetic media transferspeed) during the whole replace
process and it does in a more direct way what you actually want. So in
total mostly way faster device replace/upgrade than with the
add+delete method. And raid1 redundancy is active all the time. Of
course it means first make sure the system runs up-to-date/latest
kernel+tools.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-25 14:35                     ` Henk Slager
@ 2016-03-26  4:15                       ` Duncan
       [not found]                       ` <CAHz9+Emc4DsXoMLKYrp1TfN+2r2cXxaJmPyTnpeCZF=h0FhtMg@mail.gmail.com>
  1 sibling, 0 replies; 35+ messages in thread
From: Duncan @ 2016-03-26  4:15 UTC (permalink / raw)
  To: linux-btrfs

Henk Slager posted on Fri, 25 Mar 2016 15:35:52 +0100 as excerpted:

> For the original OP situation, with chunks all filled op with extents
> and devices all filled up with chunks, 'integrating' a new 6TB drive
> in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
> way in order to avoid immediate balancing needs:

> - 'plug-in' the 6TB
> - btrfs-replace  4TB by 6TB
> - btrfs fi resize max 6TB_devID
> - btrfs-replace  2TB by 4TB
> - btrfs fi resize max 4TB_devID
> - 'unplug' the 2TB

Way to think outside the box, Henk!  I'll have to remember this as it's
a very clever and rather useful method-tool to have in the ol' admin
toolbox (aka brain). =:^)

I only wish I had thought of it, as it sure seems clear... now that
you described it!

Greatly appreciated, in any case! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2016-03-25 13:16                   ` Patrik Lundquist
  2016-03-25 14:35                     ` Henk Slager
@ 2016-03-27  4:23                     ` Brad Templeton
  1 sibling, 0 replies; 35+ messages in thread
From: Brad Templeton @ 2016-03-27  4:23 UTC (permalink / raw)
  Cc: Btrfs BTRFS




For those curious as the the result, the reduction to single and
restoration to RAID1 did indeed balance the array.   It was extremely
slow of course on a 12tb array.   I did not bother doing this with the
metadata.   I also stopped the conversion to single when it had freed up
enough space on the 2 smaller drives, because at that time it was moving
stuff into the big drive, which seemed sub-optimal considering what was
to come.

In general, obviously, I hope the long term goal is to not need this,
indeed not to need manual balance at all.   I would hope the goal is to
just be able to add and remove drives, tell the system what type of
redundancy you need and let it figure out the rest.  But I know this is
an FS in development.

I've actually come to feel that when it comes to personal drive arrays,
we actually need something much smarter than today's filesystems.  Truth
is, for example, that once my infrequently accessed files, such as old
photo and video archives, have a solid backup made, there is not
actually a need to keep them redundantly at all, except for speed, while
the much smaller volume of frequently accessed files needs that (or even
extra redundancy not for safety but extra speed, and of course cache on
an SSD is even better.)   This requires not just the fileystem and OS to
get smarter about this, but even the apps.  It may happen some day -- no
matter how cheap storage gets, we keep coming up with ways to fill it.

Thanks for the help.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
       [not found]                       ` <CAHz9+Emc4DsXoMLKYrp1TfN+2r2cXxaJmPyTnpeCZF=h0FhtMg@mail.gmail.com>
@ 2018-05-27  1:27                         ` Brad Templeton
  2018-05-27  1:41                           ` Qu Wenruo
  2018-06-08  3:23                           ` Zygo Blaxell
  0 siblings, 2 replies; 35+ messages in thread
From: Brad Templeton @ 2018-05-27  1:27 UTC (permalink / raw)
  To: Btrfs BTRFS

A few years ago, I encountered an issue (halfway between a bug and a
problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
fairly full.   The problem was that after replacing (by add/delete) a
small drive with a larger one, there were now 2 full drives and one
new half-full one, and balance was not able to correct this situation
to produce the desired result, which is 3 drives, each with a roughly
even amount of free space.  It can't do it because the 2 smaller
drives are full, and it doesn't realize it could just move one of the
copies of a block off the smaller drive onto the larger drive to free
space on the smaller drive, it wants to move them both, and there is
nowhere to put them both.

I'm about to do it again, taking my nearly full array which is 4TB,
4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
repeat the very time consuming situation, so I wanted to find out if
things were fixed now.   I am running Xenial (kernel 4.4.0) and could
consider the upgrade to  bionic (4.15) though that adds a lot more to
my plate before a long trip and I would prefer to avoid if I can.

So what is the best strategy:

a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" strategy)
b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
from 4TB but possibly not enough)
c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
recently vacated 6TB -- much longer procedure but possibly better

Or has this all been fixed and method A will work fine and get to the
ideal goal -- 3 drives, with available space suitably distributed to
allow full utilization over time?

On Sat, May 26, 2018 at 6:24 PM, Brad Templeton <bradtem@gmail.com> wrote:
> A few years ago, I encountered an issue (halfway between a bug and a
> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
> full.   The problem was that after replacing (by add/delete) a small drive
> with a larger one, there were now 2 full drives and one new half-full one,
> and balance was not able to correct this situation to produce the desired
> result, which is 3 drives, each with a roughly even amount of free space.
> It can't do it because the 2 smaller drives are full, and it doesn't realize
> it could just move one of the copies of a block off the smaller drive onto
> the larger drive to free space on the smaller drive, it wants to move them
> both, and there is nowhere to put them both.
>
> I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB
> and replacing one of the 4TB with an 8TB.  I don't want to repeat the very
> time consuming situation, so I wanted to find out if things were fixed now.
> I am running Xenial (kernel 4.4.0) and could consider the upgrade to  bionic
> (4.15) though that adds a lot more to my plate before a long trip and I
> would prefer to avoid if I can.
>
> So what is the best strategy:
>
> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic"
> strategy)
> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from
> 4TB but possibly not enough)
> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently
> vacated 6TB -- much longer procedure but possibly better
>
> Or has this all been fixed and method A will work fine and get to the ideal
> goal -- 3 drives, with available space suitably distributed to allow full
> utilization over time?
>
> On Fri, Mar 25, 2016 at 7:35 AM, Henk Slager <eye1tm@gmail.com> wrote:
>>
>> On Fri, Mar 25, 2016 at 2:16 PM, Patrik Lundquist
>> <patrik.lundquist@gmail.com> wrote:
>> > On 23 March 2016 at 20:33, Chris Murphy <lists@colorremedies.com> wrote:
>> >>
>> >> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton <bradtem@gmail.com>
>> >> wrote:
>> >> >
>> >> > I am surprised to hear it said that having the mixed sizes is an odd
>> >> > case.
>> >>
>> >> Not odd as in wrong, just uncommon compared to other arrangements being
>> >> tested.
>> >
>> > I think mixed drive sizes in raid1 is a killer feature for a home NAS,
>> > where you replace an old smaller drive with the latest and largest
>> > when you need more storage.
>> >
>> > My raid1 currently consists of 6TB+3TB+3*2TB.
>>
>> For the original OP situation, with chunks all filled op with extents
>> and devices all filled up with chunks, 'integrating' a new 6TB drive
>> in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
>> way in order to avoid immediate balancing needs:
>> - 'plug-in' the 6TB
>> - btrfs-replace  4TB by 6TB
>> - btrfs fi resize max 6TB_devID
>> - btrfs-replace  2TB by 4TB
>> - btrfs fi resize max 4TB_devID
>> - 'unplug' the 2TB
>>
>> So then there would be 2 devices with roughly 2TB space available, so
>> good for continued btrfs raid1 writes.
>>
>> An offline variant with dd instead of btrfs-replace could also be done
>> (I used to do that sometimes when btrfs-replace was not implemented).
>> My experience is that btrfs-replace speed is roughly at max speed (so
>> harddisk magnetic media transferspeed) during the whole replace
>> process and it does in a more direct way what you actually want. So in
>> total mostly way faster device replace/upgrade than with the
>> add+delete method. And raid1 redundancy is active all the time. Of
>> course it means first make sure the system runs up-to-date/latest
>> kernel+tools.
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2018-05-27  1:27                         ` Brad Templeton
@ 2018-05-27  1:41                           ` Qu Wenruo
  2018-05-27  1:49                             ` Brad Templeton
  2018-06-08  3:23                           ` Zygo Blaxell
  1 sibling, 1 reply; 35+ messages in thread
From: Qu Wenruo @ 2018-05-27  1:41 UTC (permalink / raw)
  To: Brad Templeton, Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 6916 bytes --]



On 2018年05月27日 09:27, Brad Templeton wrote:
> A few years ago, I encountered an issue (halfway between a bug and a
> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
> fairly full.   The problem was that after replacing (by add/delete) a
> small drive with a larger one, there were now 2 full drives and one
> new half-full one, and balance was not able to correct this situation
> to produce the desired result, which is 3 drives, each with a roughly
> even amount of free space.  It can't do it because the 2 smaller
> drives are full, and it doesn't realize it could just move one of the
> copies of a block off the smaller drive onto the larger drive to free
> space on the smaller drive, it wants to move them both, and there is
> nowhere to put them both.

It's not that easy.
For balance, btrfs must first find a large enough space to locate both
copy, then copy data.
Or if powerloss happens, it will cause data corruption.

So in your case, btrfs can only find enough space for one copy, thus
unable to relocate any chunk.

> 
> I'm about to do it again, taking my nearly full array which is 4TB,
> 4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
> repeat the very time consuming situation, so I wanted to find out if
> things were fixed now.   I am running Xenial (kernel 4.4.0) and could
> consider the upgrade to  bionic (4.15) though that adds a lot more to
> my plate before a long trip and I would prefer to avoid if I can.

Since there is nothing to fix, the behavior will not change at all.

> 
> So what is the best strategy:
> 
> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" strategy)
> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
> from 4TB but possibly not enough)
> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
> recently vacated 6TB -- much longer procedure but possibly better
> 
> Or has this all been fixed and method A will work fine and get to the
> ideal goal -- 3 drives, with available space suitably distributed to
> allow full utilization over time?

Btrfs chunk allocator is already trying to utilize all drivers for a
long long time.
When allocate chunks, btrfs will choose the device with the most free
space. However the nature of RAID1 needs btrfs to allocate extents from
2 different devices, which makes your replaced 4/4/6 a little complex.
(If your 4/4/6 array is set up and then filled to current stage, btrfs
should be able to utilize all the space)


Personally speaking, if you're confident enough, just add a new device,
and then do balance.
If enough chunks get balanced, there should be enough space freed on
existing disks.
Then remove the newly added device, then btrfs should handle the
remaining space well.

Thanks,
Qu

> 
> On Sat, May 26, 2018 at 6:24 PM, Brad Templeton <bradtem@gmail.com> wrote:
>> A few years ago, I encountered an issue (halfway between a bug and a
>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
>> full.   The problem was that after replacing (by add/delete) a small drive
>> with a larger one, there were now 2 full drives and one new half-full one,
>> and balance was not able to correct this situation to produce the desired
>> result, which is 3 drives, each with a roughly even amount of free space.
>> It can't do it because the 2 smaller drives are full, and it doesn't realize
>> it could just move one of the copies of a block off the smaller drive onto
>> the larger drive to free space on the smaller drive, it wants to move them
>> both, and there is nowhere to put them both.
>>
>> I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB
>> and replacing one of the 4TB with an 8TB.  I don't want to repeat the very
>> time consuming situation, so I wanted to find out if things were fixed now.
>> I am running Xenial (kernel 4.4.0) and could consider the upgrade to  bionic
>> (4.15) though that adds a lot more to my plate before a long trip and I
>> would prefer to avoid if I can.
>>
>> So what is the best strategy:
>>
>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic"
>> strategy)
>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from
>> 4TB but possibly not enough)
>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently
>> vacated 6TB -- much longer procedure but possibly better
>>
>> Or has this all been fixed and method A will work fine and get to the ideal
>> goal -- 3 drives, with available space suitably distributed to allow full
>> utilization over time?
>>
>> On Fri, Mar 25, 2016 at 7:35 AM, Henk Slager <eye1tm@gmail.com> wrote:
>>>
>>> On Fri, Mar 25, 2016 at 2:16 PM, Patrik Lundquist
>>> <patrik.lundquist@gmail.com> wrote:
>>>> On 23 March 2016 at 20:33, Chris Murphy <lists@colorremedies.com> wrote:
>>>>>
>>>>> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton <bradtem@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> I am surprised to hear it said that having the mixed sizes is an odd
>>>>>> case.
>>>>>
>>>>> Not odd as in wrong, just uncommon compared to other arrangements being
>>>>> tested.
>>>>
>>>> I think mixed drive sizes in raid1 is a killer feature for a home NAS,
>>>> where you replace an old smaller drive with the latest and largest
>>>> when you need more storage.
>>>>
>>>> My raid1 currently consists of 6TB+3TB+3*2TB.
>>>
>>> For the original OP situation, with chunks all filled op with extents
>>> and devices all filled up with chunks, 'integrating' a new 6TB drive
>>> in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
>>> way in order to avoid immediate balancing needs:
>>> - 'plug-in' the 6TB
>>> - btrfs-replace  4TB by 6TB
>>> - btrfs fi resize max 6TB_devID
>>> - btrfs-replace  2TB by 4TB
>>> - btrfs fi resize max 4TB_devID
>>> - 'unplug' the 2TB
>>>
>>> So then there would be 2 devices with roughly 2TB space available, so
>>> good for continued btrfs raid1 writes.
>>>
>>> An offline variant with dd instead of btrfs-replace could also be done
>>> (I used to do that sometimes when btrfs-replace was not implemented).
>>> My experience is that btrfs-replace speed is roughly at max speed (so
>>> harddisk magnetic media transferspeed) during the whole replace
>>> process and it does in a more direct way what you actually want. So in
>>> total mostly way faster device replace/upgrade than with the
>>> add+delete method. And raid1 redundancy is active all the time. Of
>>> course it means first make sure the system runs up-to-date/latest
>>> kernel+tools.
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2018-05-27  1:41                           ` Qu Wenruo
@ 2018-05-27  1:49                             ` Brad Templeton
  2018-05-27  1:56                               ` Qu Wenruo
  0 siblings, 1 reply; 35+ messages in thread
From: Brad Templeton @ 2018-05-27  1:49 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

That is what did not work last time.

I say I think there can be a "fix" because I hope the goal of BTRFS
raid is to be superior to traditional RAID.   That if one replaces a
drive, and asks to balance, it figures out what needs to be done to
make that work.  I understand that the current balance algorithm may
have trouble with that.   In this situation, the ideal result would be
the system would take the 3 drives (4TB and 6TB full, 8TB with 4TB
free) and move extents strictly from the 4TB and 6TB to the 8TB -- ie
extents which are currently on both the 4TB and 6TB -- by moving only
one copy.   It is not strictly a "bug" in that the code is operating
as designed, but it is an undesired function.

The problem is the approach you describe did not work in the prior upgrade.

On Sat, May 26, 2018 at 6:41 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> On 2018年05月27日 09:27, Brad Templeton wrote:
>> A few years ago, I encountered an issue (halfway between a bug and a
>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
>> fairly full.   The problem was that after replacing (by add/delete) a
>> small drive with a larger one, there were now 2 full drives and one
>> new half-full one, and balance was not able to correct this situation
>> to produce the desired result, which is 3 drives, each with a roughly
>> even amount of free space.  It can't do it because the 2 smaller
>> drives are full, and it doesn't realize it could just move one of the
>> copies of a block off the smaller drive onto the larger drive to free
>> space on the smaller drive, it wants to move them both, and there is
>> nowhere to put them both.
>
> It's not that easy.
> For balance, btrfs must first find a large enough space to locate both
> copy, then copy data.
> Or if powerloss happens, it will cause data corruption.
>
> So in your case, btrfs can only find enough space for one copy, thus
> unable to relocate any chunk.
>
>>
>> I'm about to do it again, taking my nearly full array which is 4TB,
>> 4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
>> repeat the very time consuming situation, so I wanted to find out if
>> things were fixed now.   I am running Xenial (kernel 4.4.0) and could
>> consider the upgrade to  bionic (4.15) though that adds a lot more to
>> my plate before a long trip and I would prefer to avoid if I can.
>
> Since there is nothing to fix, the behavior will not change at all.
>
>>
>> So what is the best strategy:
>>
>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" strategy)
>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
>> from 4TB but possibly not enough)
>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
>> recently vacated 6TB -- much longer procedure but possibly better
>>
>> Or has this all been fixed and method A will work fine and get to the
>> ideal goal -- 3 drives, with available space suitably distributed to
>> allow full utilization over time?
>
> Btrfs chunk allocator is already trying to utilize all drivers for a
> long long time.
> When allocate chunks, btrfs will choose the device with the most free
> space. However the nature of RAID1 needs btrfs to allocate extents from
> 2 different devices, which makes your replaced 4/4/6 a little complex.
> (If your 4/4/6 array is set up and then filled to current stage, btrfs
> should be able to utilize all the space)
>
>
> Personally speaking, if you're confident enough, just add a new device,
> and then do balance.
> If enough chunks get balanced, there should be enough space freed on
> existing disks.
> Then remove the newly added device, then btrfs should handle the
> remaining space well.
>
> Thanks,
> Qu
>
>>
>> On Sat, May 26, 2018 at 6:24 PM, Brad Templeton <bradtem@gmail.com> wrote:
>>> A few years ago, I encountered an issue (halfway between a bug and a
>>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
>>> full.   The problem was that after replacing (by add/delete) a small drive
>>> with a larger one, there were now 2 full drives and one new half-full one,
>>> and balance was not able to correct this situation to produce the desired
>>> result, which is 3 drives, each with a roughly even amount of free space.
>>> It can't do it because the 2 smaller drives are full, and it doesn't realize
>>> it could just move one of the copies of a block off the smaller drive onto
>>> the larger drive to free space on the smaller drive, it wants to move them
>>> both, and there is nowhere to put them both.
>>>
>>> I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB
>>> and replacing one of the 4TB with an 8TB.  I don't want to repeat the very
>>> time consuming situation, so I wanted to find out if things were fixed now.
>>> I am running Xenial (kernel 4.4.0) and could consider the upgrade to  bionic
>>> (4.15) though that adds a lot more to my plate before a long trip and I
>>> would prefer to avoid if I can.
>>>
>>> So what is the best strategy:
>>>
>>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic"
>>> strategy)
>>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from
>>> 4TB but possibly not enough)
>>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently
>>> vacated 6TB -- much longer procedure but possibly better
>>>
>>> Or has this all been fixed and method A will work fine and get to the ideal
>>> goal -- 3 drives, with available space suitably distributed to allow full
>>> utilization over time?
>>>
>>> On Fri, Mar 25, 2016 at 7:35 AM, Henk Slager <eye1tm@gmail.com> wrote:
>>>>
>>>> On Fri, Mar 25, 2016 at 2:16 PM, Patrik Lundquist
>>>> <patrik.lundquist@gmail.com> wrote:
>>>>> On 23 March 2016 at 20:33, Chris Murphy <lists@colorremedies.com> wrote:
>>>>>>
>>>>>> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton <bradtem@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> I am surprised to hear it said that having the mixed sizes is an odd
>>>>>>> case.
>>>>>>
>>>>>> Not odd as in wrong, just uncommon compared to other arrangements being
>>>>>> tested.
>>>>>
>>>>> I think mixed drive sizes in raid1 is a killer feature for a home NAS,
>>>>> where you replace an old smaller drive with the latest and largest
>>>>> when you need more storage.
>>>>>
>>>>> My raid1 currently consists of 6TB+3TB+3*2TB.
>>>>
>>>> For the original OP situation, with chunks all filled op with extents
>>>> and devices all filled up with chunks, 'integrating' a new 6TB drive
>>>> in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
>>>> way in order to avoid immediate balancing needs:
>>>> - 'plug-in' the 6TB
>>>> - btrfs-replace  4TB by 6TB
>>>> - btrfs fi resize max 6TB_devID
>>>> - btrfs-replace  2TB by 4TB
>>>> - btrfs fi resize max 4TB_devID
>>>> - 'unplug' the 2TB
>>>>
>>>> So then there would be 2 devices with roughly 2TB space available, so
>>>> good for continued btrfs raid1 writes.
>>>>
>>>> An offline variant with dd instead of btrfs-replace could also be done
>>>> (I used to do that sometimes when btrfs-replace was not implemented).
>>>> My experience is that btrfs-replace speed is roughly at max speed (so
>>>> harddisk magnetic media transferspeed) during the whole replace
>>>> process and it does in a more direct way what you actually want. So in
>>>> total mostly way faster device replace/upgrade than with the
>>>> add+delete method. And raid1 redundancy is active all the time. Of
>>>> course it means first make sure the system runs up-to-date/latest
>>>> kernel+tools.
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2018-05-27  1:49                             ` Brad Templeton
@ 2018-05-27  1:56                               ` Qu Wenruo
  2018-05-27  2:06                                 ` Brad Templeton
  0 siblings, 1 reply; 35+ messages in thread
From: Qu Wenruo @ 2018-05-27  1:56 UTC (permalink / raw)
  To: Brad Templeton; +Cc: Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 8677 bytes --]



On 2018年05月27日 09:49, Brad Templeton wrote:
> That is what did not work last time.
> 
> I say I think there can be a "fix" because I hope the goal of BTRFS
> raid is to be superior to traditional RAID.   That if one replaces a
> drive, and asks to balance, it figures out what needs to be done to
> make that work.  I understand that the current balance algorithm may
> have trouble with that.   In this situation, the ideal result would be
> the system would take the 3 drives (4TB and 6TB full, 8TB with 4TB
> free) and move extents strictly from the 4TB and 6TB to the 8TB -- ie
> extents which are currently on both the 4TB and 6TB -- by moving only
> one copy.

Btrfs can only do balance in a chunk unit.
Thus btrfs can only do:
1) Create new chunk
2) Copy data
3) Remove old chunk.

So it can't do the way you mentioned.
But your purpose sounds pretty valid and maybe we could enhanace btrfs
to do such thing.
(Currently only replace can behave like that)

> It is not strictly a "bug" in that the code is operating
> as designed, but it is an undesired function.
> 
> The problem is the approach you describe did not work in the prior upgrade.

Would you please try 4/4/6 + 4 or 4/4/6 + 2 and then balance?
And before/after balance, "btrfs fi usage" and "btrfs fi show" output
could also help.

Thanks,
Qu

> 
> On Sat, May 26, 2018 at 6:41 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>> On 2018年05月27日 09:27, Brad Templeton wrote:
>>> A few years ago, I encountered an issue (halfway between a bug and a
>>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
>>> fairly full.   The problem was that after replacing (by add/delete) a
>>> small drive with a larger one, there were now 2 full drives and one
>>> new half-full one, and balance was not able to correct this situation
>>> to produce the desired result, which is 3 drives, each with a roughly
>>> even amount of free space.  It can't do it because the 2 smaller
>>> drives are full, and it doesn't realize it could just move one of the
>>> copies of a block off the smaller drive onto the larger drive to free
>>> space on the smaller drive, it wants to move them both, and there is
>>> nowhere to put them both.
>>
>> It's not that easy.
>> For balance, btrfs must first find a large enough space to locate both
>> copy, then copy data.
>> Or if powerloss happens, it will cause data corruption.
>>
>> So in your case, btrfs can only find enough space for one copy, thus
>> unable to relocate any chunk.
>>
>>>
>>> I'm about to do it again, taking my nearly full array which is 4TB,
>>> 4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
>>> repeat the very time consuming situation, so I wanted to find out if
>>> things were fixed now.   I am running Xenial (kernel 4.4.0) and could
>>> consider the upgrade to  bionic (4.15) though that adds a lot more to
>>> my plate before a long trip and I would prefer to avoid if I can.
>>
>> Since there is nothing to fix, the behavior will not change at all.
>>
>>>
>>> So what is the best strategy:
>>>
>>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" strategy)
>>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
>>> from 4TB but possibly not enough)
>>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
>>> recently vacated 6TB -- much longer procedure but possibly better
>>>
>>> Or has this all been fixed and method A will work fine and get to the
>>> ideal goal -- 3 drives, with available space suitably distributed to
>>> allow full utilization over time?
>>
>> Btrfs chunk allocator is already trying to utilize all drivers for a
>> long long time.
>> When allocate chunks, btrfs will choose the device with the most free
>> space. However the nature of RAID1 needs btrfs to allocate extents from
>> 2 different devices, which makes your replaced 4/4/6 a little complex.
>> (If your 4/4/6 array is set up and then filled to current stage, btrfs
>> should be able to utilize all the space)
>>
>>
>> Personally speaking, if you're confident enough, just add a new device,
>> and then do balance.
>> If enough chunks get balanced, there should be enough space freed on
>> existing disks.
>> Then remove the newly added device, then btrfs should handle the
>> remaining space well.
>>
>> Thanks,
>> Qu
>>
>>>
>>> On Sat, May 26, 2018 at 6:24 PM, Brad Templeton <bradtem@gmail.com> wrote:
>>>> A few years ago, I encountered an issue (halfway between a bug and a
>>>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
>>>> full.   The problem was that after replacing (by add/delete) a small drive
>>>> with a larger one, there were now 2 full drives and one new half-full one,
>>>> and balance was not able to correct this situation to produce the desired
>>>> result, which is 3 drives, each with a roughly even amount of free space.
>>>> It can't do it because the 2 smaller drives are full, and it doesn't realize
>>>> it could just move one of the copies of a block off the smaller drive onto
>>>> the larger drive to free space on the smaller drive, it wants to move them
>>>> both, and there is nowhere to put them both.
>>>>
>>>> I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB
>>>> and replacing one of the 4TB with an 8TB.  I don't want to repeat the very
>>>> time consuming situation, so I wanted to find out if things were fixed now.
>>>> I am running Xenial (kernel 4.4.0) and could consider the upgrade to  bionic
>>>> (4.15) though that adds a lot more to my plate before a long trip and I
>>>> would prefer to avoid if I can.
>>>>
>>>> So what is the best strategy:
>>>>
>>>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic"
>>>> strategy)
>>>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from
>>>> 4TB but possibly not enough)
>>>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently
>>>> vacated 6TB -- much longer procedure but possibly better
>>>>
>>>> Or has this all been fixed and method A will work fine and get to the ideal
>>>> goal -- 3 drives, with available space suitably distributed to allow full
>>>> utilization over time?
>>>>
>>>> On Fri, Mar 25, 2016 at 7:35 AM, Henk Slager <eye1tm@gmail.com> wrote:
>>>>>
>>>>> On Fri, Mar 25, 2016 at 2:16 PM, Patrik Lundquist
>>>>> <patrik.lundquist@gmail.com> wrote:
>>>>>> On 23 March 2016 at 20:33, Chris Murphy <lists@colorremedies.com> wrote:
>>>>>>>
>>>>>>> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton <bradtem@gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I am surprised to hear it said that having the mixed sizes is an odd
>>>>>>>> case.
>>>>>>>
>>>>>>> Not odd as in wrong, just uncommon compared to other arrangements being
>>>>>>> tested.
>>>>>>
>>>>>> I think mixed drive sizes in raid1 is a killer feature for a home NAS,
>>>>>> where you replace an old smaller drive with the latest and largest
>>>>>> when you need more storage.
>>>>>>
>>>>>> My raid1 currently consists of 6TB+3TB+3*2TB.
>>>>>
>>>>> For the original OP situation, with chunks all filled op with extents
>>>>> and devices all filled up with chunks, 'integrating' a new 6TB drive
>>>>> in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
>>>>> way in order to avoid immediate balancing needs:
>>>>> - 'plug-in' the 6TB
>>>>> - btrfs-replace  4TB by 6TB
>>>>> - btrfs fi resize max 6TB_devID
>>>>> - btrfs-replace  2TB by 4TB
>>>>> - btrfs fi resize max 4TB_devID
>>>>> - 'unplug' the 2TB
>>>>>
>>>>> So then there would be 2 devices with roughly 2TB space available, so
>>>>> good for continued btrfs raid1 writes.
>>>>>
>>>>> An offline variant with dd instead of btrfs-replace could also be done
>>>>> (I used to do that sometimes when btrfs-replace was not implemented).
>>>>> My experience is that btrfs-replace speed is roughly at max speed (so
>>>>> harddisk magnetic media transferspeed) during the whole replace
>>>>> process and it does in a more direct way what you actually want. So in
>>>>> total mostly way faster device replace/upgrade than with the
>>>>> add+delete method. And raid1 redundancy is active all the time. Of
>>>>> course it means first make sure the system runs up-to-date/latest
>>>>> kernel+tools.
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2018-05-27  1:56                               ` Qu Wenruo
@ 2018-05-27  2:06                                 ` Brad Templeton
  2018-05-27  2:16                                   ` Qu Wenruo
  0 siblings, 1 reply; 35+ messages in thread
From: Brad Templeton @ 2018-05-27  2:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

Thanks.  These are all things which take substantial fractions of a
day to try, unfortunately.    Last time I ended up fixing it in a
fairly kluged way, which was to convert from raid-1 to single long
enough to get enough single blocks that when I converted back to
raid-1 they got distributed to the right drives.  But this is, aside
from being a kludge, a procedure with some minor risk.  Of course I am
taking a backup first, but still...

This strikes me as something that should be a fairly common event --
your raid is filling up, and so you expand it by replacing the oldest
and smallest drive with a new much bigger one.   In the old days of
RAID, you could not do that, you had to grow all drives at the same
time, and this is one of the ways that BTRFS is quite superior.
When I had MD raid, I went through a strange process of always having
a raid 5 that consisted of different sized drives.  The raid-5 was
based on the smallest of the 3 drives, and then the larger ones had
extra space which could either be in raid-1, or more imply was in solo
disk mode and used for less critical data (such as backups and old
archives.)   Slowly, and in a messy way, each time I replaced the
smallest drive, I could then grow the raid 5.  Yuck.     BTRFS is so
much better, except for this issue.

So if somebody has a thought of a procedure that is fairly sure to
work and doesn't involve too many copying passes -- copying 4tb is not
a quick operation -- it is much appreciated and might be a good thing
to add to a wiki page, which I would be happy to do.

On Sat, May 26, 2018 at 6:56 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> On 2018年05月27日 09:49, Brad Templeton wrote:
>> That is what did not work last time.
>>
>> I say I think there can be a "fix" because I hope the goal of BTRFS
>> raid is to be superior to traditional RAID.   That if one replaces a
>> drive, and asks to balance, it figures out what needs to be done to
>> make that work.  I understand that the current balance algorithm may
>> have trouble with that.   In this situation, the ideal result would be
>> the system would take the 3 drives (4TB and 6TB full, 8TB with 4TB
>> free) and move extents strictly from the 4TB and 6TB to the 8TB -- ie
>> extents which are currently on both the 4TB and 6TB -- by moving only
>> one copy.
>
> Btrfs can only do balance in a chunk unit.
> Thus btrfs can only do:
> 1) Create new chunk
> 2) Copy data
> 3) Remove old chunk.
>
> So it can't do the way you mentioned.
> But your purpose sounds pretty valid and maybe we could enhanace btrfs
> to do such thing.
> (Currently only replace can behave like that)
>
>> It is not strictly a "bug" in that the code is operating
>> as designed, but it is an undesired function.
>>
>> The problem is the approach you describe did not work in the prior upgrade.
>
> Would you please try 4/4/6 + 4 or 4/4/6 + 2 and then balance?
> And before/after balance, "btrfs fi usage" and "btrfs fi show" output
> could also help.
>
> Thanks,
> Qu
>
>>
>> On Sat, May 26, 2018 at 6:41 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>
>>>
>>> On 2018年05月27日 09:27, Brad Templeton wrote:
>>>> A few years ago, I encountered an issue (halfway between a bug and a
>>>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
>>>> fairly full.   The problem was that after replacing (by add/delete) a
>>>> small drive with a larger one, there were now 2 full drives and one
>>>> new half-full one, and balance was not able to correct this situation
>>>> to produce the desired result, which is 3 drives, each with a roughly
>>>> even amount of free space.  It can't do it because the 2 smaller
>>>> drives are full, and it doesn't realize it could just move one of the
>>>> copies of a block off the smaller drive onto the larger drive to free
>>>> space on the smaller drive, it wants to move them both, and there is
>>>> nowhere to put them both.
>>>
>>> It's not that easy.
>>> For balance, btrfs must first find a large enough space to locate both
>>> copy, then copy data.
>>> Or if powerloss happens, it will cause data corruption.
>>>
>>> So in your case, btrfs can only find enough space for one copy, thus
>>> unable to relocate any chunk.
>>>
>>>>
>>>> I'm about to do it again, taking my nearly full array which is 4TB,
>>>> 4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
>>>> repeat the very time consuming situation, so I wanted to find out if
>>>> things were fixed now.   I am running Xenial (kernel 4.4.0) and could
>>>> consider the upgrade to  bionic (4.15) though that adds a lot more to
>>>> my plate before a long trip and I would prefer to avoid if I can.
>>>
>>> Since there is nothing to fix, the behavior will not change at all.
>>>
>>>>
>>>> So what is the best strategy:
>>>>
>>>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" strategy)
>>>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
>>>> from 4TB but possibly not enough)
>>>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
>>>> recently vacated 6TB -- much longer procedure but possibly better
>>>>
>>>> Or has this all been fixed and method A will work fine and get to the
>>>> ideal goal -- 3 drives, with available space suitably distributed to
>>>> allow full utilization over time?
>>>
>>> Btrfs chunk allocator is already trying to utilize all drivers for a
>>> long long time.
>>> When allocate chunks, btrfs will choose the device with the most free
>>> space. However the nature of RAID1 needs btrfs to allocate extents from
>>> 2 different devices, which makes your replaced 4/4/6 a little complex.
>>> (If your 4/4/6 array is set up and then filled to current stage, btrfs
>>> should be able to utilize all the space)
>>>
>>>
>>> Personally speaking, if you're confident enough, just add a new device,
>>> and then do balance.
>>> If enough chunks get balanced, there should be enough space freed on
>>> existing disks.
>>> Then remove the newly added device, then btrfs should handle the
>>> remaining space well.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> On Sat, May 26, 2018 at 6:24 PM, Brad Templeton <bradtem@gmail.com> wrote:
>>>>> A few years ago, I encountered an issue (halfway between a bug and a
>>>>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
>>>>> full.   The problem was that after replacing (by add/delete) a small drive
>>>>> with a larger one, there were now 2 full drives and one new half-full one,
>>>>> and balance was not able to correct this situation to produce the desired
>>>>> result, which is 3 drives, each with a roughly even amount of free space.
>>>>> It can't do it because the 2 smaller drives are full, and it doesn't realize
>>>>> it could just move one of the copies of a block off the smaller drive onto
>>>>> the larger drive to free space on the smaller drive, it wants to move them
>>>>> both, and there is nowhere to put them both.
>>>>>
>>>>> I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB
>>>>> and replacing one of the 4TB with an 8TB.  I don't want to repeat the very
>>>>> time consuming situation, so I wanted to find out if things were fixed now.
>>>>> I am running Xenial (kernel 4.4.0) and could consider the upgrade to  bionic
>>>>> (4.15) though that adds a lot more to my plate before a long trip and I
>>>>> would prefer to avoid if I can.
>>>>>
>>>>> So what is the best strategy:
>>>>>
>>>>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic"
>>>>> strategy)
>>>>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from
>>>>> 4TB but possibly not enough)
>>>>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently
>>>>> vacated 6TB -- much longer procedure but possibly better
>>>>>
>>>>> Or has this all been fixed and method A will work fine and get to the ideal
>>>>> goal -- 3 drives, with available space suitably distributed to allow full
>>>>> utilization over time?
>>>>>
>>>>> On Fri, Mar 25, 2016 at 7:35 AM, Henk Slager <eye1tm@gmail.com> wrote:
>>>>>>
>>>>>> On Fri, Mar 25, 2016 at 2:16 PM, Patrik Lundquist
>>>>>> <patrik.lundquist@gmail.com> wrote:
>>>>>>> On 23 March 2016 at 20:33, Chris Murphy <lists@colorremedies.com> wrote:
>>>>>>>>
>>>>>>>> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton <bradtem@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I am surprised to hear it said that having the mixed sizes is an odd
>>>>>>>>> case.
>>>>>>>>
>>>>>>>> Not odd as in wrong, just uncommon compared to other arrangements being
>>>>>>>> tested.
>>>>>>>
>>>>>>> I think mixed drive sizes in raid1 is a killer feature for a home NAS,
>>>>>>> where you replace an old smaller drive with the latest and largest
>>>>>>> when you need more storage.
>>>>>>>
>>>>>>> My raid1 currently consists of 6TB+3TB+3*2TB.
>>>>>>
>>>>>> For the original OP situation, with chunks all filled op with extents
>>>>>> and devices all filled up with chunks, 'integrating' a new 6TB drive
>>>>>> in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
>>>>>> way in order to avoid immediate balancing needs:
>>>>>> - 'plug-in' the 6TB
>>>>>> - btrfs-replace  4TB by 6TB
>>>>>> - btrfs fi resize max 6TB_devID
>>>>>> - btrfs-replace  2TB by 4TB
>>>>>> - btrfs fi resize max 4TB_devID
>>>>>> - 'unplug' the 2TB
>>>>>>
>>>>>> So then there would be 2 devices with roughly 2TB space available, so
>>>>>> good for continued btrfs raid1 writes.
>>>>>>
>>>>>> An offline variant with dd instead of btrfs-replace could also be done
>>>>>> (I used to do that sometimes when btrfs-replace was not implemented).
>>>>>> My experience is that btrfs-replace speed is roughly at max speed (so
>>>>>> harddisk magnetic media transferspeed) during the whole replace
>>>>>> process and it does in a more direct way what you actually want. So in
>>>>>> total mostly way faster device replace/upgrade than with the
>>>>>> add+delete method. And raid1 redundancy is active all the time. Of
>>>>>> course it means first make sure the system runs up-to-date/latest
>>>>>> kernel+tools.
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2018-05-27  2:06                                 ` Brad Templeton
@ 2018-05-27  2:16                                   ` Qu Wenruo
  2018-05-27  2:21                                     ` Brad Templeton
  0 siblings, 1 reply; 35+ messages in thread
From: Qu Wenruo @ 2018-05-27  2:16 UTC (permalink / raw)
  To: Brad Templeton; +Cc: Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 11573 bytes --]



On 2018年05月27日 10:06, Brad Templeton wrote:
> Thanks.  These are all things which take substantial fractions of a
> day to try, unfortunately.

Normally I would suggest just using VM and several small disks (~10G),
along with fallocate (the fastest way to use space) to get a basic view
of the procedure.

> Last time I ended up fixing it in a
> fairly kluged way, which was to convert from raid-1 to single long
> enough to get enough single blocks that when I converted back to
> raid-1 they got distributed to the right drives.

Yep, that's the ultimate one-fit-all solution.
Also, this reminds me about the fact we could do the
RAID1->Single/DUP->Single downgrade in a much much faster way.
I think it's worthy considering for later enhancement.

>  But this is, aside
> from being a kludge, a procedure with some minor risk.  Of course I am
> taking a backup first, but still...
> 
> This strikes me as something that should be a fairly common event --
> your raid is filling up, and so you expand it by replacing the oldest
> and smallest drive with a new much bigger one.   In the old days of
> RAID, you could not do that, you had to grow all drives at the same
> time, and this is one of the ways that BTRFS is quite superior.
> When I had MD raid, I went through a strange process of always having
> a raid 5 that consisted of different sized drives.  The raid-5 was
> based on the smallest of the 3 drives, and then the larger ones had
> extra space which could either be in raid-1, or more imply was in solo
> disk mode and used for less critical data (such as backups and old
> archives.)   Slowly, and in a messy way, each time I replaced the
> smallest drive, I could then grow the raid 5.  Yuck.     BTRFS is so
> much better, except for this issue.
> 
> So if somebody has a thought of a procedure that is fairly sure to
> work and doesn't involve too many copying passes -- copying 4tb is not
> a quick operation -- it is much appreciated and might be a good thing
> to add to a wiki page, which I would be happy to do.

Anyway, "btrfs fi show" and "btrfs fi usage" would help before any
further advice from community.

Thanks,
Qu

> 
> On Sat, May 26, 2018 at 6:56 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>> On 2018年05月27日 09:49, Brad Templeton wrote:
>>> That is what did not work last time.
>>>
>>> I say I think there can be a "fix" because I hope the goal of BTRFS
>>> raid is to be superior to traditional RAID.   That if one replaces a
>>> drive, and asks to balance, it figures out what needs to be done to
>>> make that work.  I understand that the current balance algorithm may
>>> have trouble with that.   In this situation, the ideal result would be
>>> the system would take the 3 drives (4TB and 6TB full, 8TB with 4TB
>>> free) and move extents strictly from the 4TB and 6TB to the 8TB -- ie
>>> extents which are currently on both the 4TB and 6TB -- by moving only
>>> one copy.
>>
>> Btrfs can only do balance in a chunk unit.
>> Thus btrfs can only do:
>> 1) Create new chunk
>> 2) Copy data
>> 3) Remove old chunk.
>>
>> So it can't do the way you mentioned.
>> But your purpose sounds pretty valid and maybe we could enhanace btrfs
>> to do such thing.
>> (Currently only replace can behave like that)
>>
>>> It is not strictly a "bug" in that the code is operating
>>> as designed, but it is an undesired function.
>>>
>>> The problem is the approach you describe did not work in the prior upgrade.
>>
>> Would you please try 4/4/6 + 4 or 4/4/6 + 2 and then balance?
>> And before/after balance, "btrfs fi usage" and "btrfs fi show" output
>> could also help.
>>
>> Thanks,
>> Qu
>>
>>>
>>> On Sat, May 26, 2018 at 6:41 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>
>>>>
>>>> On 2018年05月27日 09:27, Brad Templeton wrote:
>>>>> A few years ago, I encountered an issue (halfway between a bug and a
>>>>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
>>>>> fairly full.   The problem was that after replacing (by add/delete) a
>>>>> small drive with a larger one, there were now 2 full drives and one
>>>>> new half-full one, and balance was not able to correct this situation
>>>>> to produce the desired result, which is 3 drives, each with a roughly
>>>>> even amount of free space.  It can't do it because the 2 smaller
>>>>> drives are full, and it doesn't realize it could just move one of the
>>>>> copies of a block off the smaller drive onto the larger drive to free
>>>>> space on the smaller drive, it wants to move them both, and there is
>>>>> nowhere to put them both.
>>>>
>>>> It's not that easy.
>>>> For balance, btrfs must first find a large enough space to locate both
>>>> copy, then copy data.
>>>> Or if powerloss happens, it will cause data corruption.
>>>>
>>>> So in your case, btrfs can only find enough space for one copy, thus
>>>> unable to relocate any chunk.
>>>>
>>>>>
>>>>> I'm about to do it again, taking my nearly full array which is 4TB,
>>>>> 4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
>>>>> repeat the very time consuming situation, so I wanted to find out if
>>>>> things were fixed now.   I am running Xenial (kernel 4.4.0) and could
>>>>> consider the upgrade to  bionic (4.15) though that adds a lot more to
>>>>> my plate before a long trip and I would prefer to avoid if I can.
>>>>
>>>> Since there is nothing to fix, the behavior will not change at all.
>>>>
>>>>>
>>>>> So what is the best strategy:
>>>>>
>>>>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" strategy)
>>>>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
>>>>> from 4TB but possibly not enough)
>>>>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
>>>>> recently vacated 6TB -- much longer procedure but possibly better
>>>>>
>>>>> Or has this all been fixed and method A will work fine and get to the
>>>>> ideal goal -- 3 drives, with available space suitably distributed to
>>>>> allow full utilization over time?
>>>>
>>>> Btrfs chunk allocator is already trying to utilize all drivers for a
>>>> long long time.
>>>> When allocate chunks, btrfs will choose the device with the most free
>>>> space. However the nature of RAID1 needs btrfs to allocate extents from
>>>> 2 different devices, which makes your replaced 4/4/6 a little complex.
>>>> (If your 4/4/6 array is set up and then filled to current stage, btrfs
>>>> should be able to utilize all the space)
>>>>
>>>>
>>>> Personally speaking, if you're confident enough, just add a new device,
>>>> and then do balance.
>>>> If enough chunks get balanced, there should be enough space freed on
>>>> existing disks.
>>>> Then remove the newly added device, then btrfs should handle the
>>>> remaining space well.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> On Sat, May 26, 2018 at 6:24 PM, Brad Templeton <bradtem@gmail.com> wrote:
>>>>>> A few years ago, I encountered an issue (halfway between a bug and a
>>>>>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
>>>>>> full.   The problem was that after replacing (by add/delete) a small drive
>>>>>> with a larger one, there were now 2 full drives and one new half-full one,
>>>>>> and balance was not able to correct this situation to produce the desired
>>>>>> result, which is 3 drives, each with a roughly even amount of free space.
>>>>>> It can't do it because the 2 smaller drives are full, and it doesn't realize
>>>>>> it could just move one of the copies of a block off the smaller drive onto
>>>>>> the larger drive to free space on the smaller drive, it wants to move them
>>>>>> both, and there is nowhere to put them both.
>>>>>>
>>>>>> I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB
>>>>>> and replacing one of the 4TB with an 8TB.  I don't want to repeat the very
>>>>>> time consuming situation, so I wanted to find out if things were fixed now.
>>>>>> I am running Xenial (kernel 4.4.0) and could consider the upgrade to  bionic
>>>>>> (4.15) though that adds a lot more to my plate before a long trip and I
>>>>>> would prefer to avoid if I can.
>>>>>>
>>>>>> So what is the best strategy:
>>>>>>
>>>>>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic"
>>>>>> strategy)
>>>>>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from
>>>>>> 4TB but possibly not enough)
>>>>>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently
>>>>>> vacated 6TB -- much longer procedure but possibly better
>>>>>>
>>>>>> Or has this all been fixed and method A will work fine and get to the ideal
>>>>>> goal -- 3 drives, with available space suitably distributed to allow full
>>>>>> utilization over time?
>>>>>>
>>>>>> On Fri, Mar 25, 2016 at 7:35 AM, Henk Slager <eye1tm@gmail.com> wrote:
>>>>>>>
>>>>>>> On Fri, Mar 25, 2016 at 2:16 PM, Patrik Lundquist
>>>>>>> <patrik.lundquist@gmail.com> wrote:
>>>>>>>> On 23 March 2016 at 20:33, Chris Murphy <lists@colorremedies.com> wrote:
>>>>>>>>>
>>>>>>>>> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton <bradtem@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I am surprised to hear it said that having the mixed sizes is an odd
>>>>>>>>>> case.
>>>>>>>>>
>>>>>>>>> Not odd as in wrong, just uncommon compared to other arrangements being
>>>>>>>>> tested.
>>>>>>>>
>>>>>>>> I think mixed drive sizes in raid1 is a killer feature for a home NAS,
>>>>>>>> where you replace an old smaller drive with the latest and largest
>>>>>>>> when you need more storage.
>>>>>>>>
>>>>>>>> My raid1 currently consists of 6TB+3TB+3*2TB.
>>>>>>>
>>>>>>> For the original OP situation, with chunks all filled op with extents
>>>>>>> and devices all filled up with chunks, 'integrating' a new 6TB drive
>>>>>>> in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
>>>>>>> way in order to avoid immediate balancing needs:
>>>>>>> - 'plug-in' the 6TB
>>>>>>> - btrfs-replace  4TB by 6TB
>>>>>>> - btrfs fi resize max 6TB_devID
>>>>>>> - btrfs-replace  2TB by 4TB
>>>>>>> - btrfs fi resize max 4TB_devID
>>>>>>> - 'unplug' the 2TB
>>>>>>>
>>>>>>> So then there would be 2 devices with roughly 2TB space available, so
>>>>>>> good for continued btrfs raid1 writes.
>>>>>>>
>>>>>>> An offline variant with dd instead of btrfs-replace could also be done
>>>>>>> (I used to do that sometimes when btrfs-replace was not implemented).
>>>>>>> My experience is that btrfs-replace speed is roughly at max speed (so
>>>>>>> harddisk magnetic media transferspeed) during the whole replace
>>>>>>> process and it does in a more direct way what you actually want. So in
>>>>>>> total mostly way faster device replace/upgrade than with the
>>>>>>> add+delete method. And raid1 redundancy is active all the time. Of
>>>>>>> course it means first make sure the system runs up-to-date/latest
>>>>>>> kernel+tools.
>>>>>>
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2018-05-27  2:16                                   ` Qu Wenruo
@ 2018-05-27  2:21                                     ` Brad Templeton
  2018-05-27  5:55                                       ` Duncan
  2018-05-27 18:22                                       ` Brad Templeton
  0 siblings, 2 replies; 35+ messages in thread
From: Brad Templeton @ 2018-05-27  2:21 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

Certainly.  My apologies for not including them before.   As
described, the disks are reasonably balanced -- not as full as the
last time.  As such, it might be enough that balance would (slowly)
free up enough chunks to get things going.  And if I have to, I will
partially convert to single again.   Certainly btrfs replace seems
like the most planned and simple path but it will result in a strange
distribution of the chunks.

Label: 'butter'  uuid: a91755d4-87d8-4acd-ae08-c11e7f1f5438
       Total devices 3 FS bytes used 6.11TiB
       devid    1 size 3.62TiB used 3.47TiB path /dev/sdj2Overall:
   Device size:                  12.70TiB
   Device allocated:             12.25TiB
   Device unallocated:          459.95GiB
   Device missing:                  0.00B
   Used:                         12.21TiB
   Free (estimated):            246.35GiB      (min: 246.35GiB)
   Data ratio:                       2.00
   Metadata ratio:                   2.00
   Global reserve:              512.00MiB      (used: 1.32MiB)

Data,RAID1: Size:6.11TiB, Used:6.09TiB
  /dev/sda        3.48TiB
  /dev/sdi2       5.28TiB
  /dev/sdj2       3.46TiB

Metadata,RAID1: Size:14.00GiB, Used:12.38GiB
  /dev/sda        8.00GiB
  /dev/sdi2       7.00GiB
  /dev/sdj2      13.00GiB

System,RAID1: Size:32.00MiB, Used:888.00KiB
  /dev/sdi2      32.00MiB
  /dev/sdj2      32.00MiB

Unallocated:
  /dev/sda      153.02GiB
  /dev/sdi2     154.56GiB
  /dev/sdj2     152.36GiB

      devid    2 size 3.64TiB used 3.49TiB path /dev/sda
       devid    3 size 5.43TiB used 5.28TiB path /dev/sdi2


On Sat, May 26, 2018 at 7:16 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
> On 2018年05月27日 10:06, Brad Templeton wrote:
>> Thanks.  These are all things which take substantial fractions of a
>> day to try, unfortunately.
>
> Normally I would suggest just using VM and several small disks (~10G),
> along with fallocate (the fastest way to use space) to get a basic view
> of the procedure.
>
>> Last time I ended up fixing it in a
>> fairly kluged way, which was to convert from raid-1 to single long
>> enough to get enough single blocks that when I converted back to
>> raid-1 they got distributed to the right drives.
>
> Yep, that's the ultimate one-fit-all solution.
> Also, this reminds me about the fact we could do the
> RAID1->Single/DUP->Single downgrade in a much much faster way.
> I think it's worthy considering for later enhancement.
>
>>  But this is, aside
>> from being a kludge, a procedure with some minor risk.  Of course I am
>> taking a backup first, but still...
>>
>> This strikes me as something that should be a fairly common event --
>> your raid is filling up, and so you expand it by replacing the oldest
>> and smallest drive with a new much bigger one.   In the old days of
>> RAID, you could not do that, you had to grow all drives at the same
>> time, and this is one of the ways that BTRFS is quite superior.
>> When I had MD raid, I went through a strange process of always having
>> a raid 5 that consisted of different sized drives.  The raid-5 was
>> based on the smallest of the 3 drives, and then the larger ones had
>> extra space which could either be in raid-1, or more imply was in solo
>> disk mode and used for less critical data (such as backups and old
>> archives.)   Slowly, and in a messy way, each time I replaced the
>> smallest drive, I could then grow the raid 5.  Yuck.     BTRFS is so
>> much better, except for this issue.
>>
>> So if somebody has a thought of a procedure that is fairly sure to
>> work and doesn't involve too many copying passes -- copying 4tb is not
>> a quick operation -- it is much appreciated and might be a good thing
>> to add to a wiki page, which I would be happy to do.
>
> Anyway, "btrfs fi show" and "btrfs fi usage" would help before any
> further advice from community.
>
> Thanks,
> Qu
>
>>
>> On Sat, May 26, 2018 at 6:56 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>
>>>
>>> On 2018年05月27日 09:49, Brad Templeton wrote:
>>>> That is what did not work last time.
>>>>
>>>> I say I think there can be a "fix" because I hope the goal of BTRFS
>>>> raid is to be superior to traditional RAID.   That if one replaces a
>>>> drive, and asks to balance, it figures out what needs to be done to
>>>> make that work.  I understand that the current balance algorithm may
>>>> have trouble with that.   In this situation, the ideal result would be
>>>> the system would take the 3 drives (4TB and 6TB full, 8TB with 4TB
>>>> free) and move extents strictly from the 4TB and 6TB to the 8TB -- ie
>>>> extents which are currently on both the 4TB and 6TB -- by moving only
>>>> one copy.
>>>
>>> Btrfs can only do balance in a chunk unit.
>>> Thus btrfs can only do:
>>> 1) Create new chunk
>>> 2) Copy data
>>> 3) Remove old chunk.
>>>
>>> So it can't do the way you mentioned.
>>> But your purpose sounds pretty valid and maybe we could enhanace btrfs
>>> to do such thing.
>>> (Currently only replace can behave like that)
>>>
>>>> It is not strictly a "bug" in that the code is operating
>>>> as designed, but it is an undesired function.
>>>>
>>>> The problem is the approach you describe did not work in the prior upgrade.
>>>
>>> Would you please try 4/4/6 + 4 or 4/4/6 + 2 and then balance?
>>> And before/after balance, "btrfs fi usage" and "btrfs fi show" output
>>> could also help.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> On Sat, May 26, 2018 at 6:41 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>>
>>>>>
>>>>> On 2018年05月27日 09:27, Brad Templeton wrote:
>>>>>> A few years ago, I encountered an issue (halfway between a bug and a
>>>>>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
>>>>>> fairly full.   The problem was that after replacing (by add/delete) a
>>>>>> small drive with a larger one, there were now 2 full drives and one
>>>>>> new half-full one, and balance was not able to correct this situation
>>>>>> to produce the desired result, which is 3 drives, each with a roughly
>>>>>> even amount of free space.  It can't do it because the 2 smaller
>>>>>> drives are full, and it doesn't realize it could just move one of the
>>>>>> copies of a block off the smaller drive onto the larger drive to free
>>>>>> space on the smaller drive, it wants to move them both, and there is
>>>>>> nowhere to put them both.
>>>>>
>>>>> It's not that easy.
>>>>> For balance, btrfs must first find a large enough space to locate both
>>>>> copy, then copy data.
>>>>> Or if powerloss happens, it will cause data corruption.
>>>>>
>>>>> So in your case, btrfs can only find enough space for one copy, thus
>>>>> unable to relocate any chunk.
>>>>>
>>>>>>
>>>>>> I'm about to do it again, taking my nearly full array which is 4TB,
>>>>>> 4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
>>>>>> repeat the very time consuming situation, so I wanted to find out if
>>>>>> things were fixed now.   I am running Xenial (kernel 4.4.0) and could
>>>>>> consider the upgrade to  bionic (4.15) though that adds a lot more to
>>>>>> my plate before a long trip and I would prefer to avoid if I can.
>>>>>
>>>>> Since there is nothing to fix, the behavior will not change at all.
>>>>>
>>>>>>
>>>>>> So what is the best strategy:
>>>>>>
>>>>>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" strategy)
>>>>>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
>>>>>> from 4TB but possibly not enough)
>>>>>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
>>>>>> recently vacated 6TB -- much longer procedure but possibly better
>>>>>>
>>>>>> Or has this all been fixed and method A will work fine and get to the
>>>>>> ideal goal -- 3 drives, with available space suitably distributed to
>>>>>> allow full utilization over time?
>>>>>
>>>>> Btrfs chunk allocator is already trying to utilize all drivers for a
>>>>> long long time.
>>>>> When allocate chunks, btrfs will choose the device with the most free
>>>>> space. However the nature of RAID1 needs btrfs to allocate extents from
>>>>> 2 different devices, which makes your replaced 4/4/6 a little complex.
>>>>> (If your 4/4/6 array is set up and then filled to current stage, btrfs
>>>>> should be able to utilize all the space)
>>>>>
>>>>>
>>>>> Personally speaking, if you're confident enough, just add a new device,
>>>>> and then do balance.
>>>>> If enough chunks get balanced, there should be enough space freed on
>>>>> existing disks.
>>>>> Then remove the newly added device, then btrfs should handle the
>>>>> remaining space well.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>>>
>>>>>> On Sat, May 26, 2018 at 6:24 PM, Brad Templeton <bradtem@gmail.com> wrote:
>>>>>>> A few years ago, I encountered an issue (halfway between a bug and a
>>>>>>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
>>>>>>> full.   The problem was that after replacing (by add/delete) a small drive
>>>>>>> with a larger one, there were now 2 full drives and one new half-full one,
>>>>>>> and balance was not able to correct this situation to produce the desired
>>>>>>> result, which is 3 drives, each with a roughly even amount of free space.
>>>>>>> It can't do it because the 2 smaller drives are full, and it doesn't realize
>>>>>>> it could just move one of the copies of a block off the smaller drive onto
>>>>>>> the larger drive to free space on the smaller drive, it wants to move them
>>>>>>> both, and there is nowhere to put them both.
>>>>>>>
>>>>>>> I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB
>>>>>>> and replacing one of the 4TB with an 8TB.  I don't want to repeat the very
>>>>>>> time consuming situation, so I wanted to find out if things were fixed now.
>>>>>>> I am running Xenial (kernel 4.4.0) and could consider the upgrade to  bionic
>>>>>>> (4.15) though that adds a lot more to my plate before a long trip and I
>>>>>>> would prefer to avoid if I can.
>>>>>>>
>>>>>>> So what is the best strategy:
>>>>>>>
>>>>>>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic"
>>>>>>> strategy)
>>>>>>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from
>>>>>>> 4TB but possibly not enough)
>>>>>>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently
>>>>>>> vacated 6TB -- much longer procedure but possibly better
>>>>>>>
>>>>>>> Or has this all been fixed and method A will work fine and get to the ideal
>>>>>>> goal -- 3 drives, with available space suitably distributed to allow full
>>>>>>> utilization over time?
>>>>>>>
>>>>>>> On Fri, Mar 25, 2016 at 7:35 AM, Henk Slager <eye1tm@gmail.com> wrote:
>>>>>>>>
>>>>>>>> On Fri, Mar 25, 2016 at 2:16 PM, Patrik Lundquist
>>>>>>>> <patrik.lundquist@gmail.com> wrote:
>>>>>>>>> On 23 March 2016 at 20:33, Chris Murphy <lists@colorremedies.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton <bradtem@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I am surprised to hear it said that having the mixed sizes is an odd
>>>>>>>>>>> case.
>>>>>>>>>>
>>>>>>>>>> Not odd as in wrong, just uncommon compared to other arrangements being
>>>>>>>>>> tested.
>>>>>>>>>
>>>>>>>>> I think mixed drive sizes in raid1 is a killer feature for a home NAS,
>>>>>>>>> where you replace an old smaller drive with the latest and largest
>>>>>>>>> when you need more storage.
>>>>>>>>>
>>>>>>>>> My raid1 currently consists of 6TB+3TB+3*2TB.
>>>>>>>>
>>>>>>>> For the original OP situation, with chunks all filled op with extents
>>>>>>>> and devices all filled up with chunks, 'integrating' a new 6TB drive
>>>>>>>> in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
>>>>>>>> way in order to avoid immediate balancing needs:
>>>>>>>> - 'plug-in' the 6TB
>>>>>>>> - btrfs-replace  4TB by 6TB
>>>>>>>> - btrfs fi resize max 6TB_devID
>>>>>>>> - btrfs-replace  2TB by 4TB
>>>>>>>> - btrfs fi resize max 4TB_devID
>>>>>>>> - 'unplug' the 2TB
>>>>>>>>
>>>>>>>> So then there would be 2 devices with roughly 2TB space available, so
>>>>>>>> good for continued btrfs raid1 writes.
>>>>>>>>
>>>>>>>> An offline variant with dd instead of btrfs-replace could also be done
>>>>>>>> (I used to do that sometimes when btrfs-replace was not implemented).
>>>>>>>> My experience is that btrfs-replace speed is roughly at max speed (so
>>>>>>>> harddisk magnetic media transferspeed) during the whole replace
>>>>>>>> process and it does in a more direct way what you actually want. So in
>>>>>>>> total mostly way faster device replace/upgrade than with the
>>>>>>>> add+delete method. And raid1 redundancy is active all the time. Of
>>>>>>>> course it means first make sure the system runs up-to-date/latest
>>>>>>>> kernel+tools.
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2018-05-27  2:21                                     ` Brad Templeton
@ 2018-05-27  5:55                                       ` Duncan
  2018-05-27 18:22                                       ` Brad Templeton
  1 sibling, 0 replies; 35+ messages in thread
From: Duncan @ 2018-05-27  5:55 UTC (permalink / raw)
  To: linux-btrfs

Brad Templeton posted on Sat, 26 May 2018 19:21:57 -0700 as excerpted:

> Certainly.  My apologies for not including them before.

Aieee!  Reply before quote, making the reply out of context, and my
attempt to reply in context... difficult and troublesome.

Please use standard list context-quote, reply in context, next time,
making it easier for further replies also in context.

> As
> described, the disks are reasonably balanced -- not as full as the
> last time.  As such, it might be enough that balance would (slowly)
> free up enough chunks to get things going.  And if I have to, I will
> partially convert to single again.   Certainly btrfs replace seems
> like the most planned and simple path but it will result in a strange
> distribution of the chunks.

[btrfs filesystem usage output below]

> Label: 'butter'  uuid: a91755d4-87d8-4acd-ae08-c11e7f1f5438
>        Total devices 3 FS bytes used 6.11TiB
>        devid    1 size 3.62TiB used 3.47TiB path /dev/sdj2Overall:
>    Device size:                  12.70TiB
>    Device allocated:             12.25TiB
>    Device unallocated:          459.95GiB
>    Device missing:                  0.00B
>    Used:                         12.21TiB
>    Free (estimated):            246.35GiB      (min: 246.35GiB)
>    Data ratio:                       2.00
>    Metadata ratio:                   2.00
>    Global reserve:              512.00MiB      (used: 1.32MiB)
> 
> Data,RAID1: Size:6.11TiB, Used:6.09TiB
>   /dev/sda        3.48TiB
>   /dev/sdi2       5.28TiB
>   /dev/sdj2       3.46TiB
> 
> Metadata,RAID1: Size:14.00GiB, Used:12.38GiB
>   /dev/sda        8.00GiB
>   /dev/sdi2       7.00GiB
>   /dev/sdj2      13.00GiB
> 
> System,RAID1: Size:32.00MiB, Used:888.00KiB
>   /dev/sdi2      32.00MiB
>   /dev/sdj2      32.00MiB
> 
> Unallocated:
>   /dev/sda      153.02GiB
>   /dev/sdi2     154.56GiB
>   /dev/sdj2     152.36GiB

[Presumably this is a bit of btrfs filesystem show output, but the
rest of it is missing...]

>       devid    2 size 3.64TiB used 3.49TiB path /dev/sda
>       devid    3 size 5.43TiB used 5.28TiB path /dev/sdi2


Based on the 100+ GiB still free on each of the three devices above,
you should have no issues balancing after replacing one of them.

Presumably the first time you tried it, there was far less, likely under
a GiB free on the two not replaced.  Since data chunks are nominally
1 GiB each and raid1 requires two copies, each on a different device,
that didn't leave enough space on either of the older devices to do
a balance, even tho there was plenty of space left on the just-replaced
new one.

(Tho multiple-GiB chunks are possible on TB+ devices, but 10 GiB free
on each device should be plenty, so 100+ GiB free on each... should be
no issues unless you run into some strange bug.)


Meanwhile, even in the case of not enough space free on all three
existing devices, given that they're currently two 4 TB devices and
a 6 TB device and that you're replacing one of the 4 TB devices with
an 8 TB device...

Doing a two-step replace, first replacing the 6 TB device with the
new 8 TB device, then resizing to the new 8 TB size, giving you ~2 TB of
free space on it, then replacing one of the 4 TB devices with the now
free 6 TB device, and again resizing to the new 6 TB size, giving you
~2 TB free on it too, thus giving you ~2 TB free on each of two devices
instead of all 4 TB of new space on a single device, should do the trick
very well, and should still be faster, probably MUCH faster, than doing
a temporary convert to single, then back to raid1, the kludge you used
last time. =:^)


Meanwhile, while kernel version of course remains up to you, given that
you mentioned 4.4 with a potential upgrade to 4.15, I will at least
cover the following, so you'll have it to use as you decide on kernel
versions.

4.15?  Why?  4.14 is the current mainline LTS kernel series, with 4.15
only being a normal short-term stable series that has already been
EOLed.  So 4.15 now makes little sense at all.  Either go current-stable
series and do 4.16 and continue to upgrade as the new kernels come (4.17
should be out shortly as it's past rc6, with rc7 likely out by the time
you read this and release likely in a week), or stick with 4.14 LTS for
the longer-term support.

Of course you can go with your distro kernel if you like, and I presume
that's why you mentioned 4.15, but as I said it's already EOLed upstream,
and of course this list being a kernel development list, our focus tends
to be on upstream/mainstream, not distro level kernels.  If you choose
a distro level kernel series that's EOLed at kernel.org, then you really
should be getting support from them for it, as they know what they've
backported and/or patched and are thus best positioned to support it.

As for what this list does try to support, it's the last two kernel
release series in each of the current and LTS tracks.  So as the first
release back from current 4.16, 4.15, tho EOLed upstream, is still
reasonably supported for the moment here, tho people should be
upgrading to 4.16 by now as 4.17 should be out in a couple weeks or
so and 4.15 would be out of the two-current-kernel-series window at that
time.

Meanwhile, the two latest LTS series are as already stated 4.14, and the
earlier 4.9.  4.4 is the one previous to that and it's still mainline
supported in general, but it's out of the two LTS-series window of best
support here, and truth be told, based on history, even supporting the
second newest LTS series starts to get more difficult at about a year and
a half out, 6 months or so before the next LTS comes out.  As it happens
that's about where 4.9 is now, and 4.14 has had about 6 months to
stabilize now, so for LTS I'd definitely recommend 4.14, now.

Of course that doesn't mean that we /refuse/ to support 4.4, we still
try, but it's out of primary focus now and in many cases, should you
have problems, the first recommendation is going to be try something
newer and see if the problem goes away or presents differently.  Or
as mentioned, check with your distro if it's a distro kernel, since
in that case they're best positioned to support it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2018-05-27  2:21                                     ` Brad Templeton
  2018-05-27  5:55                                       ` Duncan
@ 2018-05-27 18:22                                       ` Brad Templeton
  2018-05-28  8:31                                         ` Duncan
  1 sibling, 1 reply; 35+ messages in thread
From: Brad Templeton @ 2018-05-27 18:22 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

BTW, I decided to follow the original double replace strategy
suggested -- replace 6TB with 8TB and replace 4TB with 6TB.  That
should be sure to leave the 2 large drives each with 2TB free once
expanded, and thus able to fully use all space.

However, the first one has been going for 9 hours and is "189.7% done"
and still going.   Some sort of bug in calculating the completion
status, obviously.  With luck 200% will be enough?

On Sat, May 26, 2018 at 7:21 PM, Brad Templeton <bradtem@gmail.com> wrote:
> Certainly.  My apologies for not including them before.   As
> described, the disks are reasonably balanced -- not as full as the
> last time.  As such, it might be enough that balance would (slowly)
> free up enough chunks to get things going.  And if I have to, I will
> partially convert to single again.   Certainly btrfs replace seems
> like the most planned and simple path but it will result in a strange
> distribution of the chunks.
>
> Label: 'butter'  uuid: a91755d4-87d8-4acd-ae08-c11e7f1f5438
>        Total devices 3 FS bytes used 6.11TiB
>        devid    1 size 3.62TiB used 3.47TiB path /dev/sdj2Overall:
>    Device size:                  12.70TiB
>    Device allocated:             12.25TiB
>    Device unallocated:          459.95GiB
>    Device missing:                  0.00B
>    Used:                         12.21TiB
>    Free (estimated):            246.35GiB      (min: 246.35GiB)
>    Data ratio:                       2.00
>    Metadata ratio:                   2.00
>    Global reserve:              512.00MiB      (used: 1.32MiB)
>
> Data,RAID1: Size:6.11TiB, Used:6.09TiB
>   /dev/sda        3.48TiB
>   /dev/sdi2       5.28TiB
>   /dev/sdj2       3.46TiB
>
> Metadata,RAID1: Size:14.00GiB, Used:12.38GiB
>   /dev/sda        8.00GiB
>   /dev/sdi2       7.00GiB
>   /dev/sdj2      13.00GiB
>
> System,RAID1: Size:32.00MiB, Used:888.00KiB
>   /dev/sdi2      32.00MiB
>   /dev/sdj2      32.00MiB
>
> Unallocated:
>   /dev/sda      153.02GiB
>   /dev/sdi2     154.56GiB
>   /dev/sdj2     152.36GiB
>
>       devid    2 size 3.64TiB used 3.49TiB path /dev/sda
>        devid    3 size 5.43TiB used 5.28TiB path /dev/sdi2
>
>
> On Sat, May 26, 2018 at 7:16 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>> On 2018年05月27日 10:06, Brad Templeton wrote:
>>> Thanks.  These are all things which take substantial fractions of a
>>> day to try, unfortunately.
>>
>> Normally I would suggest just using VM and several small disks (~10G),
>> along with fallocate (the fastest way to use space) to get a basic view
>> of the procedure.
>>
>>> Last time I ended up fixing it in a
>>> fairly kluged way, which was to convert from raid-1 to single long
>>> enough to get enough single blocks that when I converted back to
>>> raid-1 they got distributed to the right drives.
>>
>> Yep, that's the ultimate one-fit-all solution.
>> Also, this reminds me about the fact we could do the
>> RAID1->Single/DUP->Single downgrade in a much much faster way.
>> I think it's worthy considering for later enhancement.
>>
>>>  But this is, aside
>>> from being a kludge, a procedure with some minor risk.  Of course I am
>>> taking a backup first, but still...
>>>
>>> This strikes me as something that should be a fairly common event --
>>> your raid is filling up, and so you expand it by replacing the oldest
>>> and smallest drive with a new much bigger one.   In the old days of
>>> RAID, you could not do that, you had to grow all drives at the same
>>> time, and this is one of the ways that BTRFS is quite superior.
>>> When I had MD raid, I went through a strange process of always having
>>> a raid 5 that consisted of different sized drives.  The raid-5 was
>>> based on the smallest of the 3 drives, and then the larger ones had
>>> extra space which could either be in raid-1, or more imply was in solo
>>> disk mode and used for less critical data (such as backups and old
>>> archives.)   Slowly, and in a messy way, each time I replaced the
>>> smallest drive, I could then grow the raid 5.  Yuck.     BTRFS is so
>>> much better, except for this issue.
>>>
>>> So if somebody has a thought of a procedure that is fairly sure to
>>> work and doesn't involve too many copying passes -- copying 4tb is not
>>> a quick operation -- it is much appreciated and might be a good thing
>>> to add to a wiki page, which I would be happy to do.
>>
>> Anyway, "btrfs fi show" and "btrfs fi usage" would help before any
>> further advice from community.
>>
>> Thanks,
>> Qu
>>
>>>
>>> On Sat, May 26, 2018 at 6:56 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>
>>>>
>>>> On 2018年05月27日 09:49, Brad Templeton wrote:
>>>>> That is what did not work last time.
>>>>>
>>>>> I say I think there can be a "fix" because I hope the goal of BTRFS
>>>>> raid is to be superior to traditional RAID.   That if one replaces a
>>>>> drive, and asks to balance, it figures out what needs to be done to
>>>>> make that work.  I understand that the current balance algorithm may
>>>>> have trouble with that.   In this situation, the ideal result would be
>>>>> the system would take the 3 drives (4TB and 6TB full, 8TB with 4TB
>>>>> free) and move extents strictly from the 4TB and 6TB to the 8TB -- ie
>>>>> extents which are currently on both the 4TB and 6TB -- by moving only
>>>>> one copy.
>>>>
>>>> Btrfs can only do balance in a chunk unit.
>>>> Thus btrfs can only do:
>>>> 1) Create new chunk
>>>> 2) Copy data
>>>> 3) Remove old chunk.
>>>>
>>>> So it can't do the way you mentioned.
>>>> But your purpose sounds pretty valid and maybe we could enhanace btrfs
>>>> to do such thing.
>>>> (Currently only replace can behave like that)
>>>>
>>>>> It is not strictly a "bug" in that the code is operating
>>>>> as designed, but it is an undesired function.
>>>>>
>>>>> The problem is the approach you describe did not work in the prior upgrade.
>>>>
>>>> Would you please try 4/4/6 + 4 or 4/4/6 + 2 and then balance?
>>>> And before/after balance, "btrfs fi usage" and "btrfs fi show" output
>>>> could also help.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> On Sat, May 26, 2018 at 6:41 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>>>
>>>>>>
>>>>>> On 2018年05月27日 09:27, Brad Templeton wrote:
>>>>>>> A few years ago, I encountered an issue (halfway between a bug and a
>>>>>>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
>>>>>>> fairly full.   The problem was that after replacing (by add/delete) a
>>>>>>> small drive with a larger one, there were now 2 full drives and one
>>>>>>> new half-full one, and balance was not able to correct this situation
>>>>>>> to produce the desired result, which is 3 drives, each with a roughly
>>>>>>> even amount of free space.  It can't do it because the 2 smaller
>>>>>>> drives are full, and it doesn't realize it could just move one of the
>>>>>>> copies of a block off the smaller drive onto the larger drive to free
>>>>>>> space on the smaller drive, it wants to move them both, and there is
>>>>>>> nowhere to put them both.
>>>>>>
>>>>>> It's not that easy.
>>>>>> For balance, btrfs must first find a large enough space to locate both
>>>>>> copy, then copy data.
>>>>>> Or if powerloss happens, it will cause data corruption.
>>>>>>
>>>>>> So in your case, btrfs can only find enough space for one copy, thus
>>>>>> unable to relocate any chunk.
>>>>>>
>>>>>>>
>>>>>>> I'm about to do it again, taking my nearly full array which is 4TB,
>>>>>>> 4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
>>>>>>> repeat the very time consuming situation, so I wanted to find out if
>>>>>>> things were fixed now.   I am running Xenial (kernel 4.4.0) and could
>>>>>>> consider the upgrade to  bionic (4.15) though that adds a lot more to
>>>>>>> my plate before a long trip and I would prefer to avoid if I can.
>>>>>>
>>>>>> Since there is nothing to fix, the behavior will not change at all.
>>>>>>
>>>>>>>
>>>>>>> So what is the best strategy:
>>>>>>>
>>>>>>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" strategy)
>>>>>>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
>>>>>>> from 4TB but possibly not enough)
>>>>>>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
>>>>>>> recently vacated 6TB -- much longer procedure but possibly better
>>>>>>>
>>>>>>> Or has this all been fixed and method A will work fine and get to the
>>>>>>> ideal goal -- 3 drives, with available space suitably distributed to
>>>>>>> allow full utilization over time?
>>>>>>
>>>>>> Btrfs chunk allocator is already trying to utilize all drivers for a
>>>>>> long long time.
>>>>>> When allocate chunks, btrfs will choose the device with the most free
>>>>>> space. However the nature of RAID1 needs btrfs to allocate extents from
>>>>>> 2 different devices, which makes your replaced 4/4/6 a little complex.
>>>>>> (If your 4/4/6 array is set up and then filled to current stage, btrfs
>>>>>> should be able to utilize all the space)
>>>>>>
>>>>>>
>>>>>> Personally speaking, if you're confident enough, just add a new device,
>>>>>> and then do balance.
>>>>>> If enough chunks get balanced, there should be enough space freed on
>>>>>> existing disks.
>>>>>> Then remove the newly added device, then btrfs should handle the
>>>>>> remaining space well.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>>>
>>>>>>> On Sat, May 26, 2018 at 6:24 PM, Brad Templeton <bradtem@gmail.com> wrote:
>>>>>>>> A few years ago, I encountered an issue (halfway between a bug and a
>>>>>>>> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
>>>>>>>> full.   The problem was that after replacing (by add/delete) a small drive
>>>>>>>> with a larger one, there were now 2 full drives and one new half-full one,
>>>>>>>> and balance was not able to correct this situation to produce the desired
>>>>>>>> result, which is 3 drives, each with a roughly even amount of free space.
>>>>>>>> It can't do it because the 2 smaller drives are full, and it doesn't realize
>>>>>>>> it could just move one of the copies of a block off the smaller drive onto
>>>>>>>> the larger drive to free space on the smaller drive, it wants to move them
>>>>>>>> both, and there is nowhere to put them both.
>>>>>>>>
>>>>>>>> I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB
>>>>>>>> and replacing one of the 4TB with an 8TB.  I don't want to repeat the very
>>>>>>>> time consuming situation, so I wanted to find out if things were fixed now.
>>>>>>>> I am running Xenial (kernel 4.4.0) and could consider the upgrade to  bionic
>>>>>>>> (4.15) though that adds a lot more to my plate before a long trip and I
>>>>>>>> would prefer to avoid if I can.
>>>>>>>>
>>>>>>>> So what is the best strategy:
>>>>>>>>
>>>>>>>> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic"
>>>>>>>> strategy)
>>>>>>>> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from
>>>>>>>> 4TB but possibly not enough)
>>>>>>>> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently
>>>>>>>> vacated 6TB -- much longer procedure but possibly better
>>>>>>>>
>>>>>>>> Or has this all been fixed and method A will work fine and get to the ideal
>>>>>>>> goal -- 3 drives, with available space suitably distributed to allow full
>>>>>>>> utilization over time?
>>>>>>>>
>>>>>>>> On Fri, Mar 25, 2016 at 7:35 AM, Henk Slager <eye1tm@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> On Fri, Mar 25, 2016 at 2:16 PM, Patrik Lundquist
>>>>>>>>> <patrik.lundquist@gmail.com> wrote:
>>>>>>>>>> On 23 March 2016 at 20:33, Chris Murphy <lists@colorremedies.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton <bradtem@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I am surprised to hear it said that having the mixed sizes is an odd
>>>>>>>>>>>> case.
>>>>>>>>>>>
>>>>>>>>>>> Not odd as in wrong, just uncommon compared to other arrangements being
>>>>>>>>>>> tested.
>>>>>>>>>>
>>>>>>>>>> I think mixed drive sizes in raid1 is a killer feature for a home NAS,
>>>>>>>>>> where you replace an old smaller drive with the latest and largest
>>>>>>>>>> when you need more storage.
>>>>>>>>>>
>>>>>>>>>> My raid1 currently consists of 6TB+3TB+3*2TB.
>>>>>>>>>
>>>>>>>>> For the original OP situation, with chunks all filled op with extents
>>>>>>>>> and devices all filled up with chunks, 'integrating' a new 6TB drive
>>>>>>>>> in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
>>>>>>>>> way in order to avoid immediate balancing needs:
>>>>>>>>> - 'plug-in' the 6TB
>>>>>>>>> - btrfs-replace  4TB by 6TB
>>>>>>>>> - btrfs fi resize max 6TB_devID
>>>>>>>>> - btrfs-replace  2TB by 4TB
>>>>>>>>> - btrfs fi resize max 4TB_devID
>>>>>>>>> - 'unplug' the 2TB
>>>>>>>>>
>>>>>>>>> So then there would be 2 devices with roughly 2TB space available, so
>>>>>>>>> good for continued btrfs raid1 writes.
>>>>>>>>>
>>>>>>>>> An offline variant with dd instead of btrfs-replace could also be done
>>>>>>>>> (I used to do that sometimes when btrfs-replace was not implemented).
>>>>>>>>> My experience is that btrfs-replace speed is roughly at max speed (so
>>>>>>>>> harddisk magnetic media transferspeed) during the whole replace
>>>>>>>>> process and it does in a more direct way what you actually want. So in
>>>>>>>>> total mostly way faster device replace/upgrade than with the
>>>>>>>>> add+delete method. And raid1 redundancy is active all the time. Of
>>>>>>>>> course it means first make sure the system runs up-to-date/latest
>>>>>>>>> kernel+tools.
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2018-05-27 18:22                                       ` Brad Templeton
@ 2018-05-28  8:31                                         ` Duncan
  0 siblings, 0 replies; 35+ messages in thread
From: Duncan @ 2018-05-28  8:31 UTC (permalink / raw)
  To: linux-btrfs

Brad Templeton posted on Sun, 27 May 2018 11:22:07 -0700 as excerpted:

> BTW, I decided to follow the original double replace strategy suggested 
--
> replace 6TB with 8TB and replace 4TB with 6TB.  That should be sure to
> leave the 2 large drives each with 2TB free once expanded, and thus able
> to fully use all space.
> 
> However, the first one has been going for 9 hours and is "189.7% done" 
> and still going.   Some sort of bug in calculating the completion
> status, obviously.  With luck 200% will be enough?

IIRC there was an over-100% completion status bug fixed, I'd guess about 
18 months to two years ago now, long enough it would have slipped 
regular's minds so nobody would have thought about it even knowing you're 
still on 4.4, that being one of the reasons we don't do as well 
supporting stuff that old.

If it is indeed the same bug, anything even half modern should have it 
fixed

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID-1 refuses to balance large drive
  2018-05-27  1:27                         ` Brad Templeton
  2018-05-27  1:41                           ` Qu Wenruo
@ 2018-06-08  3:23                           ` Zygo Blaxell
  1 sibling, 0 replies; 35+ messages in thread
From: Zygo Blaxell @ 2018-06-08  3:23 UTC (permalink / raw)
  To: Brad Templeton; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 7601 bytes --]

On Sat, May 26, 2018 at 06:27:57PM -0700, Brad Templeton wrote:
> A few years ago, I encountered an issue (halfway between a bug and a
> problem) with attempting to grow a BTRFS 3 disk Raid 1 which was
> fairly full.   The problem was that after replacing (by add/delete) a
> small drive with a larger one, there were now 2 full drives and one
> new half-full one, and balance was not able to correct this situation
> to produce the desired result, which is 3 drives, each with a roughly
> even amount of free space.  It can't do it because the 2 smaller
> drives are full, and it doesn't realize it could just move one of the
> copies of a block off the smaller drive onto the larger drive to free
> space on the smaller drive, it wants to move them both, and there is
> nowhere to put them both.
> 
> I'm about to do it again, taking my nearly full array which is 4TB,
> 4TB, 6TB and replacing one of the 4TB with an 8TB.  I don't want to
> repeat the very time consuming situation, so I wanted to find out if
> things were fixed now.   I am running Xenial (kernel 4.4.0) and could
> consider the upgrade to  bionic (4.15) though that adds a lot more to
> my plate before a long trip and I would prefer to avoid if I can.
> 
> So what is the best strategy:
> 
> a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic" strategy)
> b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks
> from 4TB but possibly not enough)
> c) Replace 6TB with 8TB, resize/balance, then replace 4TB with
> recently vacated 6TB -- much longer procedure but possibly better

d) Run "btrfs balance start -dlimit=3 /fs" to make some unallocated
space on all drives *before* adding disks.  Then replace, resize up,
and balance until unallocated space on all disks are equal.  There is
no need to continue balancing after that, so once that point is reached
you can cancel the balance.

A number of bad things can happen when unallocated space goes to zero,
and being unable to expand a raid1 array is only one of them.  Avoid that
situation even when not resizing the array, because some cases can be
very difficult to get out of.

Assuming your disk is not filled to the last gigabyte, you'll be able
to keep at least 1GB unallocated on every disk at all times.  Monitor
the amount of unallocated space and balance a few data block groups
(e.g. -dlimit=3) whenever unallocated space gets low.

A potential btrfs enhancement area:  allow the 'devid' parameter of
balance to specify two disks to balance block groups that contain chunks
on both disks.  We want to balance only those block groups that consist of
one chunk on each smaller drive.  This redistributes those block groups
to have one chunk on the large disk and one chunk on one of the smaller
disks, freeing space on the other small disk for the next block group.
Block groups that consist of a chunk on the big disk and one of the
small disks are already in the desired configuration, so rebalancing
them is just a waste of time.  Currently it's only possible to do this
by writing a script to select individual block groups with python-btrfs
or similar--much faster than plain btrfs balance for this case, but more
involved to set up.

> Or has this all been fixed and method A will work fine and get to the
> ideal goal -- 3 drives, with available space suitably distributed to
> allow full utilization over time?
> 
> On Sat, May 26, 2018 at 6:24 PM, Brad Templeton <bradtem@gmail.com> wrote:
> > A few years ago, I encountered an issue (halfway between a bug and a
> > problem) with attempting to grow a BTRFS 3 disk Raid 1 which was fairly
> > full.   The problem was that after replacing (by add/delete) a small drive
> > with a larger one, there were now 2 full drives and one new half-full one,
> > and balance was not able to correct this situation to produce the desired
> > result, which is 3 drives, each with a roughly even amount of free space.
> > It can't do it because the 2 smaller drives are full, and it doesn't realize
> > it could just move one of the copies of a block off the smaller drive onto
> > the larger drive to free space on the smaller drive, it wants to move them
> > both, and there is nowhere to put them both.
> >
> > I'm about to do it again, taking my nearly full array which is 4TB, 4TB, 6TB
> > and replacing one of the 4TB with an 8TB.  I don't want to repeat the very
> > time consuming situation, so I wanted to find out if things were fixed now.
> > I am running Xenial (kernel 4.4.0) and could consider the upgrade to  bionic
> > (4.15) though that adds a lot more to my plate before a long trip and I
> > would prefer to avoid if I can.
> >
> > So what is the best strategy:
> >
> > a) Replace 4TB with 8TB, resize up and balance?  (This is the "basic"
> > strategy)
> > b) Add 8TB, balance, remove 4TB (automatic distribution of some blocks from
> > 4TB but possibly not enough)
> > c) Replace 6TB with 8TB, resize/balance, then replace 4TB with recently
> > vacated 6TB -- much longer procedure but possibly better
> >
> > Or has this all been fixed and method A will work fine and get to the ideal
> > goal -- 3 drives, with available space suitably distributed to allow full
> > utilization over time?
> >
> > On Fri, Mar 25, 2016 at 7:35 AM, Henk Slager <eye1tm@gmail.com> wrote:
> >>
> >> On Fri, Mar 25, 2016 at 2:16 PM, Patrik Lundquist
> >> <patrik.lundquist@gmail.com> wrote:
> >> > On 23 March 2016 at 20:33, Chris Murphy <lists@colorremedies.com> wrote:
> >> >>
> >> >> On Wed, Mar 23, 2016 at 1:10 PM, Brad Templeton <bradtem@gmail.com>
> >> >> wrote:
> >> >> >
> >> >> > I am surprised to hear it said that having the mixed sizes is an odd
> >> >> > case.
> >> >>
> >> >> Not odd as in wrong, just uncommon compared to other arrangements being
> >> >> tested.
> >> >
> >> > I think mixed drive sizes in raid1 is a killer feature for a home NAS,
> >> > where you replace an old smaller drive with the latest and largest
> >> > when you need more storage.
> >> >
> >> > My raid1 currently consists of 6TB+3TB+3*2TB.
> >>
> >> For the original OP situation, with chunks all filled op with extents
> >> and devices all filled up with chunks, 'integrating' a new 6TB drive
> >> in an 4TB+3TG+2TB raid1 array could probably be done in a bit unusual
> >> way in order to avoid immediate balancing needs:
> >> - 'plug-in' the 6TB
> >> - btrfs-replace  4TB by 6TB
> >> - btrfs fi resize max 6TB_devID
> >> - btrfs-replace  2TB by 4TB
> >> - btrfs fi resize max 4TB_devID
> >> - 'unplug' the 2TB
> >>
> >> So then there would be 2 devices with roughly 2TB space available, so
> >> good for continued btrfs raid1 writes.
> >>
> >> An offline variant with dd instead of btrfs-replace could also be done
> >> (I used to do that sometimes when btrfs-replace was not implemented).
> >> My experience is that btrfs-replace speed is roughly at max speed (so
> >> harddisk magnetic media transferspeed) during the whole replace
> >> process and it does in a more direct way what you actually want. So in
> >> total mostly way faster device replace/upgrade than with the
> >> add+delete method. And raid1 redundancy is active all the time. Of
> >> course it means first make sure the system runs up-to-date/latest
> >> kernel+tools.
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2018-06-08  3:24 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-23  0:47 RAID-1 refuses to balance large drive Brad Templeton
2016-03-23  4:01 ` Qu Wenruo
2016-03-23  4:47   ` Brad Templeton
2016-03-23  5:42     ` Chris Murphy
     [not found]       ` <56F22F80.501@gmail.com>
2016-03-23  6:17         ` Chris Murphy
2016-03-23 16:51           ` Brad Templeton
2016-03-23 18:34             ` Chris Murphy
2016-03-23 19:10               ` Brad Templeton
2016-03-23 19:27                 ` Alexander Fougner
2016-03-23 19:33                 ` Chris Murphy
2016-03-24  1:59                   ` Qu Wenruo
2016-03-24  2:13                     ` Brad Templeton
2016-03-24  2:33                       ` Qu Wenruo
2016-03-24  2:49                         ` Brad Templeton
2016-03-24  3:44                           ` Chris Murphy
2016-03-24  3:46                           ` Qu Wenruo
2016-03-24  6:11                           ` Duncan
2016-03-25 13:16                   ` Patrik Lundquist
2016-03-25 14:35                     ` Henk Slager
2016-03-26  4:15                       ` Duncan
     [not found]                       ` <CAHz9+Emc4DsXoMLKYrp1TfN+2r2cXxaJmPyTnpeCZF=h0FhtMg@mail.gmail.com>
2018-05-27  1:27                         ` Brad Templeton
2018-05-27  1:41                           ` Qu Wenruo
2018-05-27  1:49                             ` Brad Templeton
2018-05-27  1:56                               ` Qu Wenruo
2018-05-27  2:06                                 ` Brad Templeton
2018-05-27  2:16                                   ` Qu Wenruo
2018-05-27  2:21                                     ` Brad Templeton
2018-05-27  5:55                                       ` Duncan
2018-05-27 18:22                                       ` Brad Templeton
2018-05-28  8:31                                         ` Duncan
2018-06-08  3:23                           ` Zygo Blaxell
2016-03-27  4:23                     ` Brad Templeton
2016-03-23 21:54                 ` Duncan
2016-03-23 22:28               ` Duncan
2016-03-24  7:08               ` Andrew Vaughan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.