All of lore.kernel.org
 help / color / mirror / Atom feed
* Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
@ 2020-07-14 16:13 John Petrini
  2020-07-15  1:18 ` Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: John Petrini @ 2020-07-14 16:13 UTC (permalink / raw)
  To: linux-btrfs

Hello All,

My filesystem went read only while converting data from raid-10 to
raid-6. I attempted a scrub but it immediately aborted. Can anyone
provide some guidance on possibly recovering?

Thank you!

##System Details##
Ubuntu 18.04

# This shows some read errors but they are not new. I had a SATA cable
come loose on a drive some months back that caused these. They haven't
increased since I reseated the cable
sudo btrfs device stats /mnt/storage-array/ | grep sde
[/dev/sde].write_io_errs    0
[/dev/sde].read_io_errs     237
[/dev/sde].flush_io_errs    0
[/dev/sde].corruption_errs  0
[/dev/sde].generation_errs  0


btrfs fi df /mnt/storage-array/
Data, RAID10: total=32.68TiB, used=32.64TiB
Data, RAID6: total=1.04TiB, used=1.04TiB
System, RAID10: total=96.00MiB, used=3.06MiB
Metadata, RAID10: total=40.84GiB, used=39.94GiB
GlobalReserve, single: total=512.00MiB, used=512.00MiB

uname -r
5.3.0-40-generic

##dmesg##
[3813499.479570] BTRFS info (device sdd): no csum found for inode
44197278 start 401276928
[3813499.480211] BTRFS warning (device sdd): csum failed root 5 ino
44197278 off 401276928 csum 0x0573112f expected csum 0x00000000 mirror
2
[3813506.750924] BTRFS error (device sdd): parent transid verify
failed on 98952926429184 wanted 6618521 found 6618515
[3813506.751395] BTRFS error (device sdd): parent transid verify
failed on 98952926429184 wanted 6618521 found 6618515
[3813506.751783] BTRFS info (device sdd): no csum found for inode
44197278 start 152698880
[3813506.773031] BTRFS warning (device sdd): csum failed root 5 ino
44197278 off 152698880 csum 0x9b7d599f expected csum 0x00000000 mirror
1
[3813506.773070] BTRFS error (device sdd): parent transid verify
failed on 98952926429184 wanted 6618521 found 6618515
[3813506.773596] BTRFS error (device sdd): parent transid verify
failed on 98952926429184 wanted 6618521 found 6618515
[3813506.774073] BTRFS info (device sdd): no csum found for inode
44197278 start 152698880
[3813506.813401] BTRFS warning (device sdd): csum failed root 5 ino
44197278 off 152698880 csum 0x9b7d599f expected csum 0x00000000 mirror
2
[3813506.813431] BTRFS error (device sdd): parent transid verify
failed on 98952926429184 wanted 6618521 found 6618515
[3813506.813439] BTRFS error (device sdd): parent transid verify
failed on 98952926429184 wanted 6618521 found 6618515
[3813506.813444] BTRFS info (device sdd): no csum found for inode
44197278 start 152698880
[3813506.813612] BTRFS warning (device sdd): csum failed root 5 ino
44197278 off 152698880 csum 0x9b7d599f expected csum 0x00000000 mirror
3
[3813506.813624] BTRFS error (device sdd): parent transid verify
failed on 98952926429184 wanted 6618521 found 6618515
[3813506.813628] BTRFS error (device sdd): parent transid verify
failed on 98952926429184 wanted 6618521 found 6618515
[3813506.813632] BTRFS info (device sdd): no csum found for inode
44197278 start 152698880
[3813506.816222] BTRFS warning (device sdd): csum failed root 5 ino
44197278 off 152698880 csum 0x9b7d599f expected csum 0x00000000 mirror
4
[3813510.542147] BTRFS error (device sdd): parent transid verify
failed on 104091430649856 wanted 6618521 found 6618516
[3813510.542731] BTRFS error (device sdd): parent transid verify
failed on 104091430649856 wanted 6618521 found 6618516
[3813510.543216] BTRFS info (device sdd): no csum found for inode
44197278 start 288227328
[3813510.558299] BTRFS warning (device sdd): csum failed root 5 ino
44197278 off 288227328 csum 0xa0775535 expected csum 0x00000000 mirror
1
[3813510.558341] BTRFS info (device sdd): no csum found for inode
44197278 start 288227328
[3813510.574681] BTRFS warning (device sdd): csum failed root 5 ino
44197278 off 288227328 csum 0xa0775535 expected csum 0x00000000 mirror
2
[3813510.574714] BTRFS info (device sdd): no csum found for inode
44197278 start 288227328
[3813510.574965] BTRFS warning (device sdd): csum failed root 5 ino
44197278 off 288227328 csum 0xa0775535 expected csum 0x00000000 mirror
3
[3813510.574980] BTRFS info (device sdd): no csum found for inode
44197278 start 288227328
[3813510.576050] BTRFS warning (device sdd): csum failed root 5 ino
44197278 off 288227328 csum 0xa0775535 expected csum 0x00000000 mirror
4
[3813510.576070] BTRFS info (device sdd): no csum found for inode
44197278 start 288227328
[3813510.577198] BTRFS warning (device sdd): csum failed root 5 ino
44197278 off 288227328 csum 0xa0775535 expected csum 0x00000000 mirror
5
[3813510.577214] BTRFS info (device sdd): no csum found for inode
44197278 start 288227328
[3813510.578222] BTRFS warning (device sdd): csum failed root 5 ino
44197278 off 288227328 csum 0xa0775535 expected csum 0x00000000 mirror
6

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
  2020-07-14 16:13 Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion John Petrini
@ 2020-07-15  1:18 ` Zygo Blaxell
       [not found]   ` <CADvYWxcq+-Fg0W9dmc-shwszF-7sX+GDVig0GncpvwKUDPfT7g@mail.gmail.com>
  0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-07-15  1:18 UTC (permalink / raw)
  To: John Petrini; +Cc: linux-btrfs

On Tue, Jul 14, 2020 at 12:13:56PM -0400, John Petrini wrote:
> Hello All,
> 
> My filesystem went read only while converting data from raid-10 to
> raid-6. 

When the filesystem forces itself read-only, the filesystem goes into
an inert but broken state:  there's enough filesytem left to avoid having
to kill all processes using it, it no longer writes to the disk so it
doesn't damage any data, but the filesystem doesn't really work any more.

The next thing you should do is umount and mount the filesystem again,
because...

> I attempted a scrub but it immediately aborted. 

...none of the tools will work properly until the filesystem is mounted
again.

You may need to mount with -o skip_balance, then run 'btrfs balance
cancel' on the filesystem to abort the balance; otherwise, it will
resume when you mount the filesystem again and probably have the same
problem.

> Can anyone
> provide some guidance on possibly recovering?
> 
> Thank you!

Aside:  data-raid6 metadata-raid10 isn't a sane configuration.  It
has 2 redundant disks for data and 1 redundant disk for metadata, so
the second parity disk in raid6 is wasted space.

The sane configurations for parity raid are:

	data-raid6 metadata-raid1c3 (2 parity stripes for data, 3 copies
	for metadata, 2 disks can fail, requires 3 or more disks)

	data-raid5 metadata-raid10 (1 parity stripe for data, 2 copies
	for metadata, 1 disk can fail, requires 4 or more disks)

	data-raid5 metadata-raid1 (1 parity stripe for data, 2 copies
	for metadata, 1 disk can fail, requires 2 or more disks)

> ##System Details##
> Ubuntu 18.04
> 
> # This shows some read errors but they are not new. I had a SATA cable
> come loose on a drive some months back that caused these. They haven't
> increased since I reseated the cable
> sudo btrfs device stats /mnt/storage-array/ | grep sde
> [/dev/sde].write_io_errs    0
> [/dev/sde].read_io_errs     237
> [/dev/sde].flush_io_errs    0
> [/dev/sde].corruption_errs  0
> [/dev/sde].generation_errs  0

You can clear these numbers with 'btrfs dev stats -z' once the cause
has been resolved.

> btrfs fi df /mnt/storage-array/
> Data, RAID10: total=32.68TiB, used=32.64TiB
> Data, RAID6: total=1.04TiB, used=1.04TiB
> System, RAID10: total=96.00MiB, used=3.06MiB
> Metadata, RAID10: total=40.84GiB, used=39.94GiB
> GlobalReserve, single: total=512.00MiB, used=512.00MiB

Please post 'btrfs fi usage' output.  'btrfs fi usage' reports how much
is unallocated, allocated, and used on each drive.  This information is
required to understand and correct ENOSPC issues.

You didn't post the dmesg messages from when the filesystem went
read-only, but metadata 'total' is very close to 'used', you were doing
a balance, and the filesystem went read-only, so I'm guessing you hit
ENOSPC for metadata due to lack of unallocated space on at least 4 drives
(minimum for raid10).

If you have a cron job or similar scheduled task that does 'btrfs balance
start -m', remove it, as that command will reduce metadata allocation,
which will lead directly to ENOSPC panic as the filesystem gets full.

If the filesystem became read-only for non-ENOSPC reasons, it is likely
in-memory or on-disk metadata corruption.  The former is trivially
recoverable, just mount again (preferably with an updated kernel,
see below).  The latter is not trivially recoverable, and with 40 GB
of metadata you may not want to wait for btrfs check.  Hopefully it's
just ENOSPC.

> uname -r
> 5.3.0-40-generic

Please upgrade to 5.4.13 or later.  Kernels 5.1 through 5.4.12 have a
rare but nasty bug that is triggered by writing at exactly the wrong
moment during balance.  5.3 has some internal defenses against that bug
(the "write time tree checker"), but if they fail, the result is metadata
corruption that requires btrfs check to repair.

> ##dmesg##
> [3813499.479570] BTRFS info (device sdd): no csum found for inode
> 44197278 start 401276928
> [3813499.480211] BTRFS warning (device sdd): csum failed root 5 ino
> 44197278 off 401276928 csum 0x0573112f expected csum 0x00000000 mirror
> 2

All of these are likely because you haven't umounted the filesystem yet.
The filesystem in memory is no longer in sync with the disk, and will
remain out of sync until umounted.

If umounting and mounting doesn't resolve the problem, please post the
kernel messages starting from before the mount.  We need to see the
_first_ errors.

> [3813506.750924] BTRFS error (device sdd): parent transid verify
> failed on 98952926429184 wanted 6618521 found 6618515
> [3813506.751395] BTRFS error (device sdd): parent transid verify
> failed on 98952926429184 wanted 6618521 found 6618515
> [3813506.751783] BTRFS info (device sdd): no csum found for inode
> 44197278 start 152698880
> [3813506.773031] BTRFS warning (device sdd): csum failed root 5 ino
> 44197278 off 152698880 csum 0x9b7d599f expected csum 0x00000000 mirror
> 1
> [3813506.773070] BTRFS error (device sdd): parent transid verify
> failed on 98952926429184 wanted 6618521 found 6618515
> [3813506.773596] BTRFS error (device sdd): parent transid verify
> failed on 98952926429184 wanted 6618521 found 6618515
> [3813506.774073] BTRFS info (device sdd): no csum found for inode
> 44197278 start 152698880
> [3813506.813401] BTRFS warning (device sdd): csum failed root 5 ino
> 44197278 off 152698880 csum 0x9b7d599f expected csum 0x00000000 mirror
> 2
> [3813506.813431] BTRFS error (device sdd): parent transid verify
> failed on 98952926429184 wanted 6618521 found 6618515
> [3813506.813439] BTRFS error (device sdd): parent transid verify
> failed on 98952926429184 wanted 6618521 found 6618515
> [3813506.813444] BTRFS info (device sdd): no csum found for inode
> 44197278 start 152698880
> [3813506.813612] BTRFS warning (device sdd): csum failed root 5 ino
> 44197278 off 152698880 csum 0x9b7d599f expected csum 0x00000000 mirror
> 3
> [3813506.813624] BTRFS error (device sdd): parent transid verify
> failed on 98952926429184 wanted 6618521 found 6618515
> [3813506.813628] BTRFS error (device sdd): parent transid verify
> failed on 98952926429184 wanted 6618521 found 6618515
> [3813506.813632] BTRFS info (device sdd): no csum found for inode
> 44197278 start 152698880
> [3813506.816222] BTRFS warning (device sdd): csum failed root 5 ino
> 44197278 off 152698880 csum 0x9b7d599f expected csum 0x00000000 mirror
> 4
> [3813510.542147] BTRFS error (device sdd): parent transid verify
> failed on 104091430649856 wanted 6618521 found 6618516
> [3813510.542731] BTRFS error (device sdd): parent transid verify
> failed on 104091430649856 wanted 6618521 found 6618516
> [3813510.543216] BTRFS info (device sdd): no csum found for inode
> 44197278 start 288227328
> [3813510.558299] BTRFS warning (device sdd): csum failed root 5 ino
> 44197278 off 288227328 csum 0xa0775535 expected csum 0x00000000 mirror
> 1
> [3813510.558341] BTRFS info (device sdd): no csum found for inode
> 44197278 start 288227328
> [3813510.574681] BTRFS warning (device sdd): csum failed root 5 ino
> 44197278 off 288227328 csum 0xa0775535 expected csum 0x00000000 mirror
> 2
> [3813510.574714] BTRFS info (device sdd): no csum found for inode
> 44197278 start 288227328
> [3813510.574965] BTRFS warning (device sdd): csum failed root 5 ino
> 44197278 off 288227328 csum 0xa0775535 expected csum 0x00000000 mirror
> 3
> [3813510.574980] BTRFS info (device sdd): no csum found for inode
> 44197278 start 288227328
> [3813510.576050] BTRFS warning (device sdd): csum failed root 5 ino
> 44197278 off 288227328 csum 0xa0775535 expected csum 0x00000000 mirror
> 4
> [3813510.576070] BTRFS info (device sdd): no csum found for inode
> 44197278 start 288227328
> [3813510.577198] BTRFS warning (device sdd): csum failed root 5 ino
> 44197278 off 288227328 csum 0xa0775535 expected csum 0x00000000 mirror
> 5
> [3813510.577214] BTRFS info (device sdd): no csum found for inode
> 44197278 start 288227328
> [3813510.578222] BTRFS warning (device sdd): csum failed root 5 ino
> 44197278 off 288227328 csum 0xa0775535 expected csum 0x00000000 mirror
> 6

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
       [not found]     ` <20200716042739.GB8346@hungrycats.org>
@ 2020-07-16 13:37       ` John Petrini
       [not found]         ` <CAJix6J9kmQjfFJJ1GwWXsX7WW6QKxPqpKx86g7hgA4PfbH5Rpg@mail.gmail.com>
  0 siblings, 1 reply; 13+ messages in thread
From: John Petrini @ 2020-07-16 13:37 UTC (permalink / raw)
  To: Zygo Blaxell, linux-btrfs

On Thu, Jul 16, 2020 at 12:27 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Tue, Jul 14, 2020 at 10:49:08PM -0400, John Petrini wrote:
> > I've done this and the filesystem mounted successfully though when
> > attempting to cancel the balance it just tells me it's not running.
>
> That's fine, as long as it stops one way or another.
>
> > > Aside:  data-raid6 metadata-raid10 isn't a sane configuration.  It
> > > has 2 redundant disks for data and 1 redundant disk for metadata, so
> > > the second parity disk in raid6 is wasted space.
> > >
> > > The sane configurations for parity raid are:
> > >
> > >         data-raid6 metadata-raid1c3 (2 parity stripes for data, 3 copies
> > >         for metadata, 2 disks can fail, requires 3 or more disks)
> > >
> > >         data-raid5 metadata-raid10 (1 parity stripe for data, 2 copies
> > >         for metadata, 1 disk can fail, requires 4 or more disks)
> > >
> > >         data-raid5 metadata-raid1 (1 parity stripe for data, 2 copies
> > >         for metadata, 1 disk can fail, requires 2 or more disks)
> > >
> >
> > This is very interesting. I had no idea that raid1c3 was an option
> > though it sounds like I may need a really recent kernel version?
>
> 5.5 or later.

Okay I'll look into getting on this version since that's a killer feature.

>
> > btrfs fi usage /mnt/storage-array/
> > WARNING: RAID56 detected, not implemented
> > Overall:
> >     Device size:          67.31TiB
> >     Device allocated:          65.45TiB
> >     Device unallocated:           1.86TiB
> >     Device missing:             0.00B
> >     Used:              65.14TiB
> >     Free (estimated):           1.12TiB    (min: 1.09TiB)
> >     Data ratio:                  1.94
> >     Metadata ratio:              2.00
> >     Global reserve:         512.00MiB    (used: 0.00B)
> >
> > Data,RAID10: Size:32.68TiB, Used:32.53TiB
> >    /dev/sda       4.34TiB
> >    /dev/sdb       4.34TiB
> >    /dev/sdc       4.34TiB
> >    /dev/sdd       2.21TiB
> >    /dev/sde       2.21TiB
> >    /dev/sdf       4.34TiB
> >    /dev/sdi       1.82TiB
> >    /dev/sdj       1.82TiB
> >    /dev/sdk       1.82TiB
> >    /dev/sdl       1.82TiB
> >    /dev/sdm       1.82TiB
> >    /dev/sdn       1.82TiB
> >
> > Data,RAID6: Size:1.04TiB, Used:1.04TiB
> >    /dev/sda     413.92GiB
> >    /dev/sdb     413.92GiB
> >    /dev/sdc     413.92GiB
> >    /dev/sdd     119.07GiB
> >    /dev/sde     119.07GiB
> >    /dev/sdf     413.92GiB
> >
> > Metadata,RAID10: Size:40.84GiB, Used:39.80GiB
> >    /dev/sda       5.66GiB
> >    /dev/sdb       5.66GiB
> >    /dev/sdc       5.66GiB
> >    /dev/sdd       2.41GiB
> >    /dev/sde       2.41GiB
> >    /dev/sdf       5.66GiB
> >    /dev/sdi       2.23GiB
> >    /dev/sdj       2.23GiB
> >    /dev/sdk       2.23GiB
> >    /dev/sdl       2.23GiB
> >    /dev/sdm       2.23GiB
> >    /dev/sdn       2.23GiB
> >
> > System,RAID10: Size:96.00MiB, Used:3.06MiB
> >    /dev/sda       8.00MiB
> >    /dev/sdb       8.00MiB
> >    /dev/sdc       8.00MiB
> >    /dev/sdd       8.00MiB
> >    /dev/sde       8.00MiB
> >    /dev/sdf       8.00MiB
> >    /dev/sdi       8.00MiB
> >    /dev/sdj       8.00MiB
> >    /dev/sdk       8.00MiB
> >    /dev/sdl       8.00MiB
> >    /dev/sdm       8.00MiB
> >    /dev/sdn       8.00MiB
> >
> > Unallocated:
> >    /dev/sda       4.35TiB
> >    /dev/sdb       4.35TiB
> >    /dev/sdc       4.35TiB
> >    /dev/sdd       2.22TiB
> >    /dev/sde       2.22TiB
> >    /dev/sdf       4.35TiB
> >    /dev/sdi       1.82TiB
> >    /dev/sdj       1.82TiB
> >    /dev/sdk       1.82TiB
> >    /dev/sdl       1.82TiB
> >    /dev/sdm       1.82TiB
> >    /dev/sdn       1.82TiB
>
> Plenty of unallocated space.  It should be able to do the conversion.

After upgrading, the unallocated space tells a different story. Maybe
due to the newer kernel or btrfs-progs?

Unallocated:
   /dev/sdd        1.02MiB
   /dev/sde        1.02MiB
   /dev/sdl        1.02MiB
   /dev/sdn        1.02MiB
   /dev/sdm        1.02MiB
   /dev/sdk        1.02MiB
   /dev/sdj        1.02MiB
   /dev/sdi        1.02MiB
   /dev/sdb        1.00MiB
   /dev/sdc        1.00MiB
   /dev/sda        5.90GiB
   /dev/sdg        5.90GiB

This is after clearing up additional space on the filesytem. When I
started the conversion there was only ~300G available. There's now
close 1TB according to df.

/dev/sdd                      68T   66T  932G  99% /mnt/storage-array

So I'm not sure what to make of this and whether it's safe to start
the conversion again. I don't feel like I can trust the unallocated
space before or after the upgrade.


Here's the versions I'm on now:
sudo dpkg -l | grep btrfs-progs
ii  btrfs-progs                            5.4.1-2
        amd64        Checksumming Copy on Write Filesystem utilities

uname -r
5.4.0-40-generic

>
> > > You didn't post the dmesg messages from when the filesystem went
> > > read-only, but metadata 'total' is very close to 'used', you were doing
> > > a balance, and the filesystem went read-only, so I'm guessing you hit
> > > ENOSPC for metadata due to lack of unallocated space on at least 4 drives
> > > (minimum for raid10).
> > >
> >
> > Here's a paste of everything in dmesg: http://paste.openstack.org/show/795929/
>
> Unfortunately the original errors are no longer in the buffer.  Maybe
> try /var/log/kern.log?
>

Found it. So this was a space issue. I knew the filesystem was very
full but figured ~300G would be enough.

kernel: [3755232.352221] BTRFS: error (device sdd) in
__btrfs_free_extent:4860: errno=-28 No space left
kernel: [3755232.352227] BTRFS: Transaction aborted (error -28)
ernel: [3755232.354693] BTRFS info (device sdd): forced readonly
kernel: [3755232.354700] BTRFS: error (device sdd) in
btrfs_run_delayed_refs:2795: errno=-28 No space left


> > > > uname -r
> > > > 5.3.0-40-generic
> > >
> > > Please upgrade to 5.4.13 or later.  Kernels 5.1 through 5.4.12 have a
> > > rare but nasty bug that is triggered by writing at exactly the wrong
> > > moment during balance.  5.3 has some internal defenses against that bug
> > > (the "write time tree checker"), but if they fail, the result is metadata
> > > corruption that requires btrfs check to repair.
> > >
> >
> > Thanks for the heads up. I'm getting it updated now and will attempt
> > to remount once I do. Once it's remounted how should I proceed? Can I
> > just assume the filesystem is healthy at that point? Should I perform
> > a scrub?
>
> If scrub reports no errors it's probably OK.

I did run a scrub and it came back clean.

>
> A scrub will tell you if any data or metadata is corrupted or any
> parent-child pointers are broken.  That will cover most of the common
> problems.  If the original issue was a spurious ENOSPC then everything
> should be OK.  If the original issue was a write time tree corruption
> then it should be OK.  If the original issue was something else, it
> will present itself again during the scrub or balance.
>
> If there are errors, scrub won't attribute them to the right disks for
> raid6.  It might be worth reading
>
>         https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@hungrycats.org/
>
> for a list of current raid5/6 issues to be aware of.

Thanks. This is good info.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
       [not found]         ` <CAJix6J9kmQjfFJJ1GwWXsX7WW6QKxPqpKx86g7hgA4PfbH5Rpg@mail.gmail.com>
@ 2020-07-16 22:57           ` Zygo Blaxell
  2020-07-17  1:11             ` John Petrini
  0 siblings, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2020-07-16 22:57 UTC (permalink / raw)
  To: John Petrini; +Cc: John Petrini, linux-btrfs

On Thu, Jul 16, 2020 at 10:20:43AM -0400, John Petrini wrote:
>    I've cleaned up a bit more space and kicked off a balance. btrfs fi usage
>    is reporting increased unallocated space so it seems to be helping.
>    sudo btrfs balance start -dusage=50 /mnt/storage-array/
>    During one of my attempts to clean up space the filesystem went read only
>    again with the same out of space error. I'm curious why deleting files
>    would cause this.

Deleting a file requires writing a new tree with the file not present.
That requires some extra space...

>    On Thu, Jul 16, 2020 at 9:38 AM John Petrini <[1]john.d.petrini@gmail.com>
>    wrote:
> 
>      On Thu, Jul 16, 2020 at 12:27 AM Zygo Blaxell
>      <[2]ce3g8jdj@umail.furryterror.org> wrote:
>      >
>      > On Tue, Jul 14, 2020 at 10:49:08PM -0400, John Petrini wrote:
>      > > I've done this and the filesystem mounted successfully though when
>      > > attempting to cancel the balance it just tells me it's not running.
>      >
>      > That's fine, as long as it stops one way or another.
>      >
>      > > > Aside:  data-raid6 metadata-raid10 isn't a sane configuration.  It
>      > > > has 2 redundant disks for data and 1 redundant disk for metadata,
>      so
>      > > > the second parity disk in raid6 is wasted space.
>      > > >
>      > > > The sane configurations for parity raid are:
>      > > >
>      > > >         data-raid6 metadata-raid1c3 (2 parity stripes for data, 3
>      copies
>      > > >         for metadata, 2 disks can fail, requires 3 or more disks)
>      > > >
>      > > >         data-raid5 metadata-raid10 (1 parity stripe for data, 2
>      copies
>      > > >         for metadata, 1 disk can fail, requires 4 or more disks)
>      > > >
>      > > >         data-raid5 metadata-raid1 (1 parity stripe for data, 2
>      copies
>      > > >         for metadata, 1 disk can fail, requires 2 or more disks)
>      > > >
>      > >
>      > > This is very interesting. I had no idea that raid1c3 was an option
>      > > though it sounds like I may need a really recent kernel version?
>      >
>      > 5.5 or later.
> 
>      Okay I'll look into getting on this version since that's a killer
>      feature.
> 
>      >
>      > > btrfs fi usage /mnt/storage-array/
>      > > WARNING: RAID56 detected, not implemented
>      > > Overall:
>      > >     Device size:          67.31TiB
>      > >     Device allocated:          65.45TiB
>      > >     Device unallocated:           1.86TiB
>      > >     Device missing:             0.00B
>      > >     Used:              65.14TiB
>      > >     Free (estimated):           1.12TiB    (min: 1.09TiB)
>      > >     Data ratio:                  1.94
>      > >     Metadata ratio:              2.00
>      > >     Global reserve:         512.00MiB    (used: 0.00B)
>      > >
>      > > Data,RAID10: Size:32.68TiB, Used:32.53TiB
>      > >    /dev/sda       4.34TiB
>      > >    /dev/sdb       4.34TiB
>      > >    /dev/sdc       4.34TiB
>      > >    /dev/sdd       2.21TiB
>      > >    /dev/sde       2.21TiB
>      > >    /dev/sdf       4.34TiB
>      > >    /dev/sdi       1.82TiB
>      > >    /dev/sdj       1.82TiB
>      > >    /dev/sdk       1.82TiB
>      > >    /dev/sdl       1.82TiB
>      > >    /dev/sdm       1.82TiB
>      > >    /dev/sdn       1.82TiB
>      > >
>      > > Data,RAID6: Size:1.04TiB, Used:1.04TiB
>      > >    /dev/sda     413.92GiB
>      > >    /dev/sdb     413.92GiB
>      > >    /dev/sdc     413.92GiB
>      > >    /dev/sdd     119.07GiB
>      > >    /dev/sde     119.07GiB
>      > >    /dev/sdf     413.92GiB
>      > >
>      > > Metadata,RAID10: Size:40.84GiB, Used:39.80GiB
>      > >    /dev/sda       5.66GiB
>      > >    /dev/sdb       5.66GiB
>      > >    /dev/sdc       5.66GiB
>      > >    /dev/sdd       2.41GiB
>      > >    /dev/sde       2.41GiB
>      > >    /dev/sdf       5.66GiB
>      > >    /dev/sdi       2.23GiB
>      > >    /dev/sdj       2.23GiB
>      > >    /dev/sdk       2.23GiB
>      > >    /dev/sdl       2.23GiB
>      > >    /dev/sdm       2.23GiB
>      > >    /dev/sdn       2.23GiB
>      > >
>      > > System,RAID10: Size:96.00MiB, Used:3.06MiB
>      > >    /dev/sda       8.00MiB
>      > >    /dev/sdb       8.00MiB
>      > >    /dev/sdc       8.00MiB
>      > >    /dev/sdd       8.00MiB
>      > >    /dev/sde       8.00MiB
>      > >    /dev/sdf       8.00MiB
>      > >    /dev/sdi       8.00MiB
>      > >    /dev/sdj       8.00MiB
>      > >    /dev/sdk       8.00MiB
>      > >    /dev/sdl       8.00MiB
>      > >    /dev/sdm       8.00MiB
>      > >    /dev/sdn       8.00MiB
>      > >
>      > > Unallocated:
>      > >    /dev/sda       4.35TiB
>      > >    /dev/sdb       4.35TiB
>      > >    /dev/sdc       4.35TiB
>      > >    /dev/sdd       2.22TiB
>      > >    /dev/sde       2.22TiB
>      > >    /dev/sdf       4.35TiB
>      > >    /dev/sdi       1.82TiB
>      > >    /dev/sdj       1.82TiB
>      > >    /dev/sdk       1.82TiB
>      > >    /dev/sdl       1.82TiB
>      > >    /dev/sdm       1.82TiB
>      > >    /dev/sdn       1.82TiB
>      >
>      > Plenty of unallocated space.  It should be able to do the conversion.
> 
>      After upgrading, the unallocated space tells a different story. Maybe
>      due to the newer kernel or btrfs-progs?

That is...odd.  Try 'btrfs dev usage', maybe something weird is happening
with device sizes.

>      Unallocated:
>         /dev/sdd        1.02MiB
>         /dev/sde        1.02MiB
>         /dev/sdl        1.02MiB
>         /dev/sdn        1.02MiB
>         /dev/sdm        1.02MiB
>         /dev/sdk        1.02MiB
>         /dev/sdj        1.02MiB
>         /dev/sdi        1.02MiB
>         /dev/sdb        1.00MiB
>         /dev/sdc        1.00MiB
>         /dev/sda        5.90GiB
>         /dev/sdg        5.90GiB

...and here we have only 2 disks with free space, so there's zero available
space for more metadata (raid10 requires 4 disks).

>      This is after clearing up additional space on the filesytem. When I
>      started the conversion there was only ~300G available. There's now
>      close 1TB according to df.
> 
>      /dev/sdd                      68T   66T  932G  99% /mnt/storage-array
> 
>      So I'm not sure what to make of this and whether it's safe to start
>      the conversion again. I don't feel like I can trust the unallocated
>      space before or after the upgrade.
> 
>      Here's the versions I'm on now:
>      sudo dpkg -l | grep btrfs-progs
>      ii  btrfs-progs                            5.4.1-2
>              amd64        Checksumming Copy on Write Filesystem utilities
> 
>      uname -r
>      5.4.0-40-generic
> 
>      >
>      > > > You didn't post the dmesg messages from when the filesystem went
>      > > > read-only, but metadata 'total' is very close to 'used', you were
>      doing
>      > > > a balance, and the filesystem went read-only, so I'm guessing you
>      hit
>      > > > ENOSPC for metadata due to lack of unallocated space on at least 4
>      drives
>      > > > (minimum for raid10).
>      > > >
>      > >
>      > > Here's a paste of everything in dmesg:
>      [3]http://paste.openstack.org/show/795929/
>      >
>      > Unfortunately the original errors are no longer in the buffer.  Maybe
>      > try /var/log/kern.log?
>      >
> 
>      Found it. So this was a space issue. I knew the filesystem was very
>      full but figured ~300G would be enough.
> 
>      kernel: [3755232.352221] BTRFS: error (device sdd) in
>      __btrfs_free_extent:4860: errno=-28 No space left
>      kernel: [3755232.352227] BTRFS: Transaction aborted (error -28)
>      ernel: [3755232.354693] BTRFS info (device sdd): forced readonly
>      kernel: [3755232.354700] BTRFS: error (device sdd) in
>      btrfs_run_delayed_refs:2795: errno=-28 No space left

The trick is that the free space has to be unallocated to change profiles.
'df' counts both unallocated and allocated-but-unused space.

Also you have disks of different sizes, which adds an additional
complication: raid6 data on 3 disks takes up more space for the same data
than raid10 data on 4 disks, because the former is 1 data + 2 parity,
while the latter is 1 data + 1 mirror.  So for 100 GB of data, it's 200
GB of raw space in raid10 on 4 disks, or 200GB of raw space in raid6 on
4 disks, but 300 GB of raw space in raid6 on 3 disks.

Since your filesystem is nearly full, there are likely to be 3-disk-wide
raid6 block groups formed when there is space available on only 3 drives.
If that happens too often, hundreds of GB will be wasted and the filesystem
fills up.

To convert raid10 to raid6 on a full filesystem with unequal disk sizes
you'll need to do a few steps:

	1.  balance -dconvert=raid1,stripes=1..3,profiles=raid6

This converts any 3-stripe raid6 to raid1, which will get some wasted
space back.  Use raid1 here because it's more flexible for allocation
on small numbers of disks than raid10.  We will get rid of it later.

	2.  balance -dconvert=raid1,devid=1,limit=5
	    balance -dconvert=raid1,devid=2,limit=5
	    balance -dconvert=raid1,devid=3,limit=5
	    balance -dconvert=raid1,devid=6,limit=5

Use btrfs fi show to see the real devids for these, I just put sequential
numbers in the above.

These balances relocate data on the 4.34TB drives to other disks in
the array.  The goal is to get some unallocated space on all of the
largest disks so you can create raid6 block groups that span all of them.

We convert to raid1 to get more flexible redistribution of the
space--raid10 will keep trying to fill every available drive, and has
a 4-disk minimum, while raid1 will try to equally distribute space on
all drives but only 2 at a time.  'soft' is not used here because we
want to relocate block groups on these devices whether they are already
raid1 or not.

Note that if there is 5GB free on all the largest disks we can skip
this entire step.  If there is not 5GB free on all the largest disks
at the end of the above commands, you may need to repeat this step,
or try 'balance -dconvert=raid1,limit=50' to try to force free space
on all disks in the array.

	3.  balance -dconvert=raid6,soft,devid=1

This converts all data block groups that have at least one chunk on devid
1 (or any disk of the largest size in the array) from raid10 to raid6.
This will ensure that every chunk that is added to devid 1 has at least
one corresponding chunk that is removed from devid 1.  That way, devid
1 doesn't fill up; instead, it will stay with a few GB unallocated.
The other disks will get unallocated space because a raid6 block group
that is at least 4 disks wide will store more data in the same raw space
than raid10.

At this stage it doesn't matter where the space is coming from, as long as
it's coming from a minimum of 4 other disks, and not filling up devid 1.
Some block groups will not be optimal.  We'll optimize later.

Eventually you'll get to the point where there is unallocated space on
all disks, and then the balance will finish converting the data to raid6
without further attention.

	4.  balance -dstripes=1..3,devid=1  # sda, 4.34TB
	    balance -dstripes=1..3,devid=2  # sdb, 4.34TB
	    balance -dstripes=1..3,devid=3  # sdc, 4.34TB
	    balance -dstripes=1..5,devid=4  # sdd, 2.21TB
	    balance -dstripes=1..5,devid=5  # sde, 2.21TB
	    balance -dstripes=1..3,devid=6  # sdf, 4.34TB
	    balance -dstripes=1..9,devid=7  # sdg, 1.82TB
	    balance -dstripes=1..9,devid=8  # sdh, 1.82TB
	    balance -dstripes=1..9,devid=9  # sdi, 1.82TB
	    balance -dstripes=1..9,devid=10 # sdj, 1.82TB

This rebalances any narrow stripes that may have formed during the
previous balances.  For each device we calculate how many disks are
the same or equal size, and rebalance any block group that is not
that number of disks wide:

	There are 4 4.34TB disks, so we balance any block group
	on a 4.34TB disk that is 1 to (4-1) = 3 stripes wide.

	There are 6 2.21TB-or-larger disks (2x2.21TB + 4x4.34TB), so we
	balance any block group on a 2.21TB disk that is 1 to (6-1) =
	5 stripes wide.

	There are 10 1.82TB-or-larger disks (this is the smallest size
	disk, so all 10 disks are equal or larger), so we balance any
	block group on a 1.82TB disk that is 1 to (10-1) = 9 stripes wide.

These balances will only relocate non-optimal block groups, so each one
should not relocate many block groups.  If 'btrfs balance status -v' says
it's relocating thousands of block groups, check the stripe count and
devid--if you use the wrong stripe count it will unnecessarily relocate
all the data on the device.

	5.  balance -mconvert=raid1c3,soft

The final step converts metadata from raid10 to raid1c3.  (requires
kernel 5.5)



>      > > > > uname -r
>      > > > > 5.3.0-40-generic
>      > > >
>      > > > Please upgrade to 5.4.13 or later.  Kernels 5.1 through 5.4.12
>      have a
>      > > > rare but nasty bug that is triggered by writing at exactly the
>      wrong
>      > > > moment during balance.  5.3 has some internal defenses against
>      that bug
>      > > > (the "write time tree checker"), but if they fail, the result is
>      metadata
>      > > > corruption that requires btrfs check to repair.
>      > > >
>      > >
>      > > Thanks for the heads up. I'm getting it updated now and will attempt
>      > > to remount once I do. Once it's remounted how should I proceed? Can
>      I
>      > > just assume the filesystem is healthy at that point? Should I
>      perform
>      > > a scrub?
>      >
>      > If scrub reports no errors it's probably OK.
> 
>      I did run a scrub and it came back clean.
> 
>      >
>      > A scrub will tell you if any data or metadata is corrupted or any
>      > parent-child pointers are broken.  That will cover most of the common
>      > problems.  If the original issue was a spurious ENOSPC then everything
>      > should be OK.  If the original issue was a write time tree corruption
>      > then it should be OK.  If the original issue was something else, it
>      > will present itself again during the scrub or balance.
>      >
>      > If there are errors, scrub won't attribute them to the right disks for
>      > raid6.  It might be worth reading
>      >
>      >       
>       [4]https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@hungrycats.org/
>      >
>      > for a list of current raid5/6 issues to be aware of.
> 
>      Thanks. This is good info.
> 
>    --
>    John Petrini
> 
> References
> 
>    Visible links
>    1. mailto:john.d.petrini@gmail.com
>    2. mailto:ce3g8jdj@umail.furryterror.org
>    3. http://paste.openstack.org/show/795929/
>    4. https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@hungrycats.org/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
  2020-07-16 22:57           ` Zygo Blaxell
@ 2020-07-17  1:11             ` John Petrini
  2020-07-17  5:57               ` Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: John Petrini @ 2020-07-17  1:11 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: John Petrini, linux-btrfs

On Thu, Jul 16, 2020 at 6:57 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Thu, Jul 16, 2020 at 10:20:43AM -0400, John Petrini wrote:
> >    I've cleaned up a bit more space and kicked off a balance. btrfs fi usage
> >    is reporting increased unallocated space so it seems to be helping.
> >    sudo btrfs balance start -dusage=50 /mnt/storage-array/
> >    During one of my attempts to clean up space the filesystem went read only
> >    again with the same out of space error. I'm curious why deleting files
> >    would cause this.
>
> Deleting a file requires writing a new tree with the file not present.
> That requires some extra space...
>
> >    On Thu, Jul 16, 2020 at 9:38 AM John Petrini <[1]john.d.petrini@gmail.com>
> >    wrote:
> >
> >      On Thu, Jul 16, 2020 at 12:27 AM Zygo Blaxell
> >      <[2]ce3g8jdj@umail.furryterror.org> wrote:
> >      >
> >      > On Tue, Jul 14, 2020 at 10:49:08PM -0400, John Petrini wrote:
> >      > > I've done this and the filesystem mounted successfully though when
> >      > > attempting to cancel the balance it just tells me it's not running.
> >      >
> >      > That's fine, as long as it stops one way or another.
> >      >
> >      > > > Aside:  data-raid6 metadata-raid10 isn't a sane configuration.  It
> >      > > > has 2 redundant disks for data and 1 redundant disk for metadata,
> >      so
> >      > > > the second parity disk in raid6 is wasted space.
> >      > > >
> >      > > > The sane configurations for parity raid are:
> >      > > >
> >      > > >         data-raid6 metadata-raid1c3 (2 parity stripes for data, 3
> >      copies
> >      > > >         for metadata, 2 disks can fail, requires 3 or more disks)
> >      > > >
> >      > > >         data-raid5 metadata-raid10 (1 parity stripe for data, 2
> >      copies
> >      > > >         for metadata, 1 disk can fail, requires 4 or more disks)
> >      > > >
> >      > > >         data-raid5 metadata-raid1 (1 parity stripe for data, 2
> >      copies
> >      > > >         for metadata, 1 disk can fail, requires 2 or more disks)
> >      > > >
> >      > >
> >      > > This is very interesting. I had no idea that raid1c3 was an option
> >      > > though it sounds like I may need a really recent kernel version?
> >      >
> >      > 5.5 or later.
> >
> >      Okay I'll look into getting on this version since that's a killer
> >      feature.
> >
> >      >
> >      > > btrfs fi usage /mnt/storage-array/
> >      > > WARNING: RAID56 detected, not implemented
> >      > > Overall:
> >      > >     Device size:          67.31TiB
> >      > >     Device allocated:          65.45TiB
> >      > >     Device unallocated:           1.86TiB
> >      > >     Device missing:             0.00B
> >      > >     Used:              65.14TiB
> >      > >     Free (estimated):           1.12TiB    (min: 1.09TiB)
> >      > >     Data ratio:                  1.94
> >      > >     Metadata ratio:              2.00
> >      > >     Global reserve:         512.00MiB    (used: 0.00B)
> >      > >
> >      > > Data,RAID10: Size:32.68TiB, Used:32.53TiB
> >      > >    /dev/sda       4.34TiB
> >      > >    /dev/sdb       4.34TiB
> >      > >    /dev/sdc       4.34TiB
> >      > >    /dev/sdd       2.21TiB
> >      > >    /dev/sde       2.21TiB
> >      > >    /dev/sdf       4.34TiB
> >      > >    /dev/sdi       1.82TiB
> >      > >    /dev/sdj       1.82TiB
> >      > >    /dev/sdk       1.82TiB
> >      > >    /dev/sdl       1.82TiB
> >      > >    /dev/sdm       1.82TiB
> >      > >    /dev/sdn       1.82TiB
> >      > >
> >      > > Data,RAID6: Size:1.04TiB, Used:1.04TiB
> >      > >    /dev/sda     413.92GiB
> >      > >    /dev/sdb     413.92GiB
> >      > >    /dev/sdc     413.92GiB
> >      > >    /dev/sdd     119.07GiB
> >      > >    /dev/sde     119.07GiB
> >      > >    /dev/sdf     413.92GiB
> >      > >
> >      > > Metadata,RAID10: Size:40.84GiB, Used:39.80GiB
> >      > >    /dev/sda       5.66GiB
> >      > >    /dev/sdb       5.66GiB
> >      > >    /dev/sdc       5.66GiB
> >      > >    /dev/sdd       2.41GiB
> >      > >    /dev/sde       2.41GiB
> >      > >    /dev/sdf       5.66GiB
> >      > >    /dev/sdi       2.23GiB
> >      > >    /dev/sdj       2.23GiB
> >      > >    /dev/sdk       2.23GiB
> >      > >    /dev/sdl       2.23GiB
> >      > >    /dev/sdm       2.23GiB
> >      > >    /dev/sdn       2.23GiB
> >      > >
> >      > > System,RAID10: Size:96.00MiB, Used:3.06MiB
> >      > >    /dev/sda       8.00MiB
> >      > >    /dev/sdb       8.00MiB
> >      > >    /dev/sdc       8.00MiB
> >      > >    /dev/sdd       8.00MiB
> >      > >    /dev/sde       8.00MiB
> >      > >    /dev/sdf       8.00MiB
> >      > >    /dev/sdi       8.00MiB
> >      > >    /dev/sdj       8.00MiB
> >      > >    /dev/sdk       8.00MiB
> >      > >    /dev/sdl       8.00MiB
> >      > >    /dev/sdm       8.00MiB
> >      > >    /dev/sdn       8.00MiB
> >      > >
> >      > > Unallocated:
> >      > >    /dev/sda       4.35TiB
> >      > >    /dev/sdb       4.35TiB
> >      > >    /dev/sdc       4.35TiB
> >      > >    /dev/sdd       2.22TiB
> >      > >    /dev/sde       2.22TiB
> >      > >    /dev/sdf       4.35TiB
> >      > >    /dev/sdi       1.82TiB
> >      > >    /dev/sdj       1.82TiB
> >      > >    /dev/sdk       1.82TiB
> >      > >    /dev/sdl       1.82TiB
> >      > >    /dev/sdm       1.82TiB
> >      > >    /dev/sdn       1.82TiB
> >      >
> >      > Plenty of unallocated space.  It should be able to do the conversion.
> >
> >      After upgrading, the unallocated space tells a different story. Maybe
> >      due to the newer kernel or btrfs-progs?
>
> That is...odd.  Try 'btrfs dev usage', maybe something weird is happening
> with device sizes.

Here it is. I'm not sure what to make of it though.

sudo btrfs dev usage /mnt/storage-array/
/dev/sdd, ID: 1
   Device size:             4.55TiB
   Device slack:              0.00B
   Data,RAID10:             3.12GiB
   Data,RAID10:             2.78GiB
   Data,RAID10:           784.31GiB
   Data,RAID10:             3.34TiB
   Data,RAID6:            144.07GiB
   Data,RAID6:            293.03GiB
   Metadata,RAID10:         4.47GiB
   Metadata,RAID10:       352.00MiB
   Unallocated:             1.02MiB

/dev/sde, ID: 2
   Device size:             4.55TiB
   Device slack:              0.00B
   Data,RAID10:             3.12GiB
   Data,RAID10:             2.78GiB
   Data,RAID10:           784.31GiB
   Data,RAID10:             3.34TiB
   Data,RAID6:            144.07GiB
   Data,RAID6:            293.03GiB
   Metadata,RAID10:         4.47GiB
   Metadata,RAID10:       352.00MiB
   Unallocated:             1.02MiB

/dev/sdl, ID: 3
   Device size:             3.64TiB
   Device slack:              0.00B
   Data,RAID10:             3.12GiB
   Data,RAID10:             3.34TiB
   Data,RAID6:            293.03GiB
   Metadata,RAID10:         4.47GiB
   Unallocated:             1.02MiB

/dev/sdn, ID: 4
   Device size:             3.64TiB
   Device slack:              0.00B
   Data,RAID10:             3.12GiB
   Data,RAID10:             3.34TiB
   Data,RAID6:            293.03GiB
   Metadata,RAID10:         4.47GiB
   Unallocated:             1.02MiB

/dev/sdm, ID: 5
   Device size:             3.64TiB
   Device slack:              0.00B
   Data,RAID10:             3.12GiB
   Data,RAID10:             3.34TiB
   Data,RAID6:            293.03GiB
   Metadata,RAID10:         4.47GiB
   Unallocated:             1.02MiB

/dev/sdk, ID: 6
   Device size:             3.64TiB
   Device slack:              0.00B
   Data,RAID10:             3.12GiB
   Data,RAID10:             3.34TiB
   Data,RAID6:            293.03GiB
   Metadata,RAID10:         4.47GiB
   Unallocated:             1.02MiB

/dev/sdj, ID: 7
   Device size:             3.64TiB
   Device slack:              0.00B
   Data,RAID10:             3.12GiB
   Data,RAID10:             3.34TiB
   Data,RAID6:            293.03GiB
   Metadata,RAID10:         4.47GiB
   Unallocated:             1.02MiB

/dev/sdi, ID: 8
   Device size:             3.64TiB
   Device slack:              0.00B
   Data,RAID10:             3.12GiB
   Data,RAID10:             3.34TiB
   Data,RAID6:            293.03GiB
   Metadata,RAID10:         4.47GiB
   Unallocated:             1.02MiB

/dev/sdb, ID: 9
   Device size:             9.10TiB
   Device slack:              0.00B
   Data,RAID10:             3.12GiB
   Data,RAID10:             4.01TiB
   Data,RAID10:           784.31GiB
   Data,RAID10:             3.34TiB
   Data,RAID6:            458.56GiB
   Data,RAID6:            144.07GiB
   Data,RAID6:            293.03GiB
   Metadata,RAID10:         4.47GiB
   Metadata,RAID10:       352.00MiB
   Metadata,RAID10:         6.00GiB
   Metadata,RAID1C3:        2.00GiB
   System,RAID1C3:         32.00MiB
   Unallocated:            82.89GiB

/dev/sdc, ID: 10
   Device size:             9.10TiB
   Device slack:              0.00B
   Data,RAID10:             3.12GiB
   Data,RAID10:             4.01TiB
   Data,RAID10:           784.31GiB
   Data,RAID10:             3.34TiB
   Data,RAID6:            458.56GiB
   Data,RAID6:            144.07GiB
   Data,RAID6:            293.03GiB
   Metadata,RAID10:         4.47GiB
   Metadata,RAID10:       352.00MiB
   Metadata,RAID10:         6.00GiB
   Metadata,RAID1C3:        3.00GiB
   Unallocated:            81.92GiB

/dev/sda, ID: 11
   Device size:             9.10TiB
   Device slack:              0.00B
   Data,RAID10:           784.31GiB
   Data,RAID10:             4.01TiB
   Data,RAID10:             3.34TiB
   Data,RAID6:            458.56GiB
   Data,RAID6:            144.07GiB
   Data,RAID6:            293.03GiB
   Metadata,RAID10:         4.47GiB
   Metadata,RAID10:       352.00MiB
   Metadata,RAID10:         6.00GiB
   Metadata,RAID1C3:        5.00GiB
   System,RAID1C3:         32.00MiB
   Unallocated:            85.79GiB

/dev/sdf, ID: 12
   Device size:             9.10TiB
   Device slack:              0.00B
   Data,RAID10:           784.31GiB
   Data,RAID10:             4.01TiB
   Data,RAID10:             3.34TiB
   Data,RAID6:            458.56GiB
   Data,RAID6:            144.07GiB
   Data,RAID6:            293.03GiB
   Metadata,RAID10:         4.47GiB
   Metadata,RAID10:       352.00MiB
   Metadata,RAID10:         6.00GiB
   Metadata,RAID1C3:        5.00GiB
   System,RAID1C3:         32.00MiB
   Unallocated:            85.79GiB

>
> >      Unallocated:
> >         /dev/sdd        1.02MiB
> >         /dev/sde        1.02MiB
> >         /dev/sdl        1.02MiB
> >         /dev/sdn        1.02MiB
> >         /dev/sdm        1.02MiB
> >         /dev/sdk        1.02MiB
> >         /dev/sdj        1.02MiB
> >         /dev/sdi        1.02MiB
> >         /dev/sdb        1.00MiB
> >         /dev/sdc        1.00MiB
> >         /dev/sda        5.90GiB
> >         /dev/sdg        5.90GiB
>
> ...and here we have only 2 disks with free space, so there's zero available
> space for more metadata (raid10 requires 4 disks).
>
> >      This is after clearing up additional space on the filesytem. When I
> >      started the conversion there was only ~300G available. There's now
> >      close 1TB according to df.
> >
> >      /dev/sdd                      68T   66T  932G  99% /mnt/storage-array
> >
> >      So I'm not sure what to make of this and whether it's safe to start
> >      the conversion again. I don't feel like I can trust the unallocated
> >      space before or after the upgrade.
> >
> >      Here's the versions I'm on now:
> >      sudo dpkg -l | grep btrfs-progs
> >      ii  btrfs-progs                            5.4.1-2
> >              amd64        Checksumming Copy on Write Filesystem utilities
> >
> >      uname -r
> >      5.4.0-40-generic
> >
> >      >
> >      > > > You didn't post the dmesg messages from when the filesystem went
> >      > > > read-only, but metadata 'total' is very close to 'used', you were
> >      doing
> >      > > > a balance, and the filesystem went read-only, so I'm guessing you
> >      hit
> >      > > > ENOSPC for metadata due to lack of unallocated space on at least 4
> >      drives
> >      > > > (minimum for raid10).
> >      > > >
> >      > >
> >      > > Here's a paste of everything in dmesg:
> >      [3]http://paste.openstack.org/show/795929/
> >      >
> >      > Unfortunately the original errors are no longer in the buffer.  Maybe
> >      > try /var/log/kern.log?
> >      >
> >
> >      Found it. So this was a space issue. I knew the filesystem was very
> >      full but figured ~300G would be enough.
> >
> >      kernel: [3755232.352221] BTRFS: error (device sdd) in
> >      __btrfs_free_extent:4860: errno=-28 No space left
> >      kernel: [3755232.352227] BTRFS: Transaction aborted (error -28)
> >      ernel: [3755232.354693] BTRFS info (device sdd): forced readonly
> >      kernel: [3755232.354700] BTRFS: error (device sdd) in
> >      btrfs_run_delayed_refs:2795: errno=-28 No space left
>
> The trick is that the free space has to be unallocated to change profiles.
> 'df' counts both unallocated and allocated-but-unused space.
>
> Also you have disks of different sizes, which adds an additional
> complication: raid6 data on 3 disks takes up more space for the same data
> than raid10 data on 4 disks, because the former is 1 data + 2 parity,
> while the latter is 1 data + 1 mirror.  So for 100 GB of data, it's 200
> GB of raw space in raid10 on 4 disks, or 200GB of raw space in raid6 on
> 4 disks, but 300 GB of raw space in raid6 on 3 disks.
>
> Since your filesystem is nearly full, there are likely to be 3-disk-wide
> raid6 block groups formed when there is space available on only 3 drives.
> If that happens too often, hundreds of GB will be wasted and the filesystem
> fills up.
>
> To convert raid10 to raid6 on a full filesystem with unequal disk sizes
> you'll need to do a few steps:
>
>         1.  balance -dconvert=raid1,stripes=1..3,profiles=raid6
>
> This converts any 3-stripe raid6 to raid1, which will get some wasted
> space back.  Use raid1 here because it's more flexible for allocation
> on small numbers of disks than raid10.  We will get rid of it later.
>
>         2.  balance -dconvert=raid1,devid=1,limit=5
>             balance -dconvert=raid1,devid=2,limit=5
>             balance -dconvert=raid1,devid=3,limit=5
>             balance -dconvert=raid1,devid=6,limit=5
>
> Use btrfs fi show to see the real devids for these, I just put sequential
> numbers in the above.
>
> These balances relocate data on the 4.34TB drives to other disks in
> the array.  The goal is to get some unallocated space on all of the
> largest disks so you can create raid6 block groups that span all of them.
>
> We convert to raid1 to get more flexible redistribution of the
> space--raid10 will keep trying to fill every available drive, and has
> a 4-disk minimum, while raid1 will try to equally distribute space on
> all drives but only 2 at a time.  'soft' is not used here because we
> want to relocate block groups on these devices whether they are already
> raid1 or not.
>
> Note that if there is 5GB free on all the largest disks we can skip
> this entire step.  If there is not 5GB free on all the largest disks
> at the end of the above commands, you may need to repeat this step,
> or try 'balance -dconvert=raid1,limit=50' to try to force free space
> on all disks in the array.
>
>         3.  balance -dconvert=raid6,soft,devid=1
>
> This converts all data block groups that have at least one chunk on devid
> 1 (or any disk of the largest size in the array) from raid10 to raid6.
> This will ensure that every chunk that is added to devid 1 has at least
> one corresponding chunk that is removed from devid 1.  That way, devid
> 1 doesn't fill up; instead, it will stay with a few GB unallocated.
> The other disks will get unallocated space because a raid6 block group
> that is at least 4 disks wide will store more data in the same raw space
> than raid10.
>
> At this stage it doesn't matter where the space is coming from, as long as
> it's coming from a minimum of 4 other disks, and not filling up devid 1.
> Some block groups will not be optimal.  We'll optimize later.
>
> Eventually you'll get to the point where there is unallocated space on
> all disks, and then the balance will finish converting the data to raid6
> without further attention.
>
>         4.  balance -dstripes=1..3,devid=1  # sda, 4.34TB
>             balance -dstripes=1..3,devid=2  # sdb, 4.34TB
>             balance -dstripes=1..3,devid=3  # sdc, 4.34TB
>             balance -dstripes=1..5,devid=4  # sdd, 2.21TB
>             balance -dstripes=1..5,devid=5  # sde, 2.21TB
>             balance -dstripes=1..3,devid=6  # sdf, 4.34TB
>             balance -dstripes=1..9,devid=7  # sdg, 1.82TB
>             balance -dstripes=1..9,devid=8  # sdh, 1.82TB
>             balance -dstripes=1..9,devid=9  # sdi, 1.82TB
>             balance -dstripes=1..9,devid=10 # sdj, 1.82TB
>
> This rebalances any narrow stripes that may have formed during the
> previous balances.  For each device we calculate how many disks are
> the same or equal size, and rebalance any block group that is not
> that number of disks wide:
>
>         There are 4 4.34TB disks, so we balance any block group
>         on a 4.34TB disk that is 1 to (4-1) = 3 stripes wide.
>
>         There are 6 2.21TB-or-larger disks (2x2.21TB + 4x4.34TB), so we
>         balance any block group on a 2.21TB disk that is 1 to (6-1) =
>         5 stripes wide.
>
>         There are 10 1.82TB-or-larger disks (this is the smallest size
>         disk, so all 10 disks are equal or larger), so we balance any
>         block group on a 1.82TB disk that is 1 to (10-1) = 9 stripes wide.
>
> These balances will only relocate non-optimal block groups, so each one
> should not relocate many block groups.  If 'btrfs balance status -v' says
> it's relocating thousands of block groups, check the stripe count and
> devid--if you use the wrong stripe count it will unnecessarily relocate
> all the data on the device.
>
>         5.  balance -mconvert=raid1c3,soft
>
> The final step converts metadata from raid10 to raid1c3.  (requires
> kernel 5.5)

Wow looks like I've got lots of info to mull over here! I kicked off
another convert already after cleaning up quite a bit more space. I
had over 100G unallocated on each device after deleting some data and
running another balance. I'm tempted to let it run and see if it
succeeds but my unallocated space has already dropped off a cliff with
95% of the rebalance remaining.

The command I used was: btrfs fi balance start -dconvert=raid6,soft
-mconvert=raid1c3 /mnt/storage-array/

Here's the current unallocated with only 5% of the conversion complete.
Unallocated:
   /dev/sdd        1.02MiB
   /dev/sde        1.02MiB
   /dev/sdl        1.02MiB
   /dev/sdn        1.02MiB
   /dev/sdm        1.02MiB
   /dev/sdk        1.02MiB
   /dev/sdj        1.02MiB
   /dev/sdi        1.02MiB
   /dev/sdb       82.89GiB
   /dev/sdc       81.92GiB
   /dev/sda       85.79GiB
   /dev/sdf       85.79GiB

>
>
>
> >      > > > > uname -r
> >      > > > > 5.3.0-40-generic
> >      > > >
> >      > > > Please upgrade to 5.4.13 or later.  Kernels 5.1 through 5.4.12
> >      have a
> >      > > > rare but nasty bug that is triggered by writing at exactly the
> >      wrong
> >      > > > moment during balance.  5.3 has some internal defenses against
> >      that bug
> >      > > > (the "write time tree checker"), but if they fail, the result is
> >      metadata
> >      > > > corruption that requires btrfs check to repair.
> >      > > >
> >      > >
> >      > > Thanks for the heads up. I'm getting it updated now and will attempt
> >      > > to remount once I do. Once it's remounted how should I proceed? Can
> >      I
> >      > > just assume the filesystem is healthy at that point? Should I
> >      perform
> >      > > a scrub?
> >      >
> >      > If scrub reports no errors it's probably OK.
> >
> >      I did run a scrub and it came back clean.
> >
> >      >
> >      > A scrub will tell you if any data or metadata is corrupted or any
> >      > parent-child pointers are broken.  That will cover most of the common
> >      > problems.  If the original issue was a spurious ENOSPC then everything
> >      > should be OK.  If the original issue was a write time tree corruption
> >      > then it should be OK.  If the original issue was something else, it
> >      > will present itself again during the scrub or balance.
> >      >
> >      > If there are errors, scrub won't attribute them to the right disks for
> >      > raid6.  It might be worth reading
> >      >
> >      >
> >       [4]https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@hungrycats.org/
> >      >
> >      > for a list of current raid5/6 issues to be aware of.
> >
> >      Thanks. This is good info.
> >
> >    --
> >    John Petrini
> >
> > References
> >
> >    Visible links
> >    1. mailto:john.d.petrini@gmail.com
> >    2. mailto:ce3g8jdj@umail.furryterror.org
> >    3. http://paste.openstack.org/show/795929/
> >    4. https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@hungrycats.org/



-- 
---------------------------------------
John Petrini

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
  2020-07-17  1:11             ` John Petrini
@ 2020-07-17  5:57               ` Zygo Blaxell
  2020-07-17 22:54                 ` John Petrini
  2020-07-18 10:36                 ` Steven Davies
  0 siblings, 2 replies; 13+ messages in thread
From: Zygo Blaxell @ 2020-07-17  5:57 UTC (permalink / raw)
  To: John Petrini; +Cc: John Petrini, linux-btrfs

On Thu, Jul 16, 2020 at 09:11:17PM -0400, John Petrini wrote:
> On Thu, Jul 16, 2020 at 6:57 PM Zygo Blaxell
> > That is...odd.  Try 'btrfs dev usage', maybe something weird is happening
> > with device sizes.
> 
> Here it is. I'm not sure what to make of it though.
> 
> sudo btrfs dev usage /mnt/storage-array/
> /dev/sdd, ID: 1
>    Device size:             4.55TiB
>    Device slack:              0.00B
>    Data,RAID10:             3.12GiB
>    Data,RAID10:             2.78GiB
>    Data,RAID10:           784.31GiB
>    Data,RAID10:             3.34TiB
>    Data,RAID6:            144.07GiB
>    Data,RAID6:            293.03GiB
>    Metadata,RAID10:         4.47GiB
>    Metadata,RAID10:       352.00MiB
>    Unallocated:             1.02MiB
> 
> /dev/sde, ID: 2
>    Device size:             4.55TiB
>    Device slack:              0.00B
>    Data,RAID10:             3.12GiB
>    Data,RAID10:             2.78GiB
>    Data,RAID10:           784.31GiB
>    Data,RAID10:             3.34TiB
>    Data,RAID6:            144.07GiB
>    Data,RAID6:            293.03GiB
>    Metadata,RAID10:         4.47GiB
>    Metadata,RAID10:       352.00MiB
>    Unallocated:             1.02MiB
> 
> /dev/sdl, ID: 3
>    Device size:             3.64TiB
>    Device slack:              0.00B
>    Data,RAID10:             3.12GiB
>    Data,RAID10:             3.34TiB
>    Data,RAID6:            293.03GiB
>    Metadata,RAID10:         4.47GiB
>    Unallocated:             1.02MiB
> 
> /dev/sdn, ID: 4
>    Device size:             3.64TiB
>    Device slack:              0.00B
>    Data,RAID10:             3.12GiB
>    Data,RAID10:             3.34TiB
>    Data,RAID6:            293.03GiB
>    Metadata,RAID10:         4.47GiB
>    Unallocated:             1.02MiB
> 
> /dev/sdm, ID: 5
>    Device size:             3.64TiB
>    Device slack:              0.00B
>    Data,RAID10:             3.12GiB
>    Data,RAID10:             3.34TiB
>    Data,RAID6:            293.03GiB
>    Metadata,RAID10:         4.47GiB
>    Unallocated:             1.02MiB
> 
> /dev/sdk, ID: 6
>    Device size:             3.64TiB
>    Device slack:              0.00B
>    Data,RAID10:             3.12GiB
>    Data,RAID10:             3.34TiB
>    Data,RAID6:            293.03GiB
>    Metadata,RAID10:         4.47GiB
>    Unallocated:             1.02MiB
> 
> /dev/sdj, ID: 7
>    Device size:             3.64TiB
>    Device slack:              0.00B
>    Data,RAID10:             3.12GiB
>    Data,RAID10:             3.34TiB
>    Data,RAID6:            293.03GiB
>    Metadata,RAID10:         4.47GiB
>    Unallocated:             1.02MiB
> 
> /dev/sdi, ID: 8
>    Device size:             3.64TiB
>    Device slack:              0.00B
>    Data,RAID10:             3.12GiB
>    Data,RAID10:             3.34TiB
>    Data,RAID6:            293.03GiB
>    Metadata,RAID10:         4.47GiB
>    Unallocated:             1.02MiB
> 
> /dev/sdb, ID: 9
>    Device size:             9.10TiB
>    Device slack:              0.00B
>    Data,RAID10:             3.12GiB
>    Data,RAID10:             4.01TiB
>    Data,RAID10:           784.31GiB
>    Data,RAID10:             3.34TiB
>    Data,RAID6:            458.56GiB
>    Data,RAID6:            144.07GiB
>    Data,RAID6:            293.03GiB
>    Metadata,RAID10:         4.47GiB
>    Metadata,RAID10:       352.00MiB
>    Metadata,RAID10:         6.00GiB
>    Metadata,RAID1C3:        2.00GiB
>    System,RAID1C3:         32.00MiB
>    Unallocated:            82.89GiB
> 
> /dev/sdc, ID: 10
>    Device size:             9.10TiB
>    Device slack:              0.00B
>    Data,RAID10:             3.12GiB
>    Data,RAID10:             4.01TiB
>    Data,RAID10:           784.31GiB
>    Data,RAID10:             3.34TiB
>    Data,RAID6:            458.56GiB
>    Data,RAID6:            144.07GiB
>    Data,RAID6:            293.03GiB
>    Metadata,RAID10:         4.47GiB
>    Metadata,RAID10:       352.00MiB
>    Metadata,RAID10:         6.00GiB
>    Metadata,RAID1C3:        3.00GiB
>    Unallocated:            81.92GiB
> 
> /dev/sda, ID: 11
>    Device size:             9.10TiB
>    Device slack:              0.00B
>    Data,RAID10:           784.31GiB
>    Data,RAID10:             4.01TiB
>    Data,RAID10:             3.34TiB
>    Data,RAID6:            458.56GiB
>    Data,RAID6:            144.07GiB
>    Data,RAID6:            293.03GiB
>    Metadata,RAID10:         4.47GiB
>    Metadata,RAID10:       352.00MiB
>    Metadata,RAID10:         6.00GiB
>    Metadata,RAID1C3:        5.00GiB
>    System,RAID1C3:         32.00MiB
>    Unallocated:            85.79GiB
> 
> /dev/sdf, ID: 12
>    Device size:             9.10TiB
>    Device slack:              0.00B
>    Data,RAID10:           784.31GiB
>    Data,RAID10:             4.01TiB
>    Data,RAID10:             3.34TiB
>    Data,RAID6:            458.56GiB
>    Data,RAID6:            144.07GiB
>    Data,RAID6:            293.03GiB
>    Metadata,RAID10:         4.47GiB
>    Metadata,RAID10:       352.00MiB
>    Metadata,RAID10:         6.00GiB
>    Metadata,RAID1C3:        5.00GiB
>    System,RAID1C3:         32.00MiB
>    Unallocated:            85.79GiB

OK...slack is 0, so there wasn't anything weird with underlying device
sizes going on.

There's 3 entries for "Data,RAID6" because there are three stripe widths:
12 disks, 6 disks, and 4 disks, corresponding to the number of disks of
each size.  Unfortunately 'dev usage' doesn't say which one is which.

> Wow looks like I've got lots of info to mull over here! I kicked off
> another convert already after cleaning up quite a bit more space. I
> had over 100G unallocated on each device after deleting some data and
> running another balance. 

If you did balances with no unallocated space on the small drives, then
the block groups created by those balances are the first block groups
to be processed by later balances.  These block groups will be narrow
so they'll use space less efficiently.  We want the opposite of that.

> I'm tempted to let it run and see if it
> succeeds but my unallocated space has already dropped off a cliff with
> 95% of the rebalance remaining.

This is why the devid/stripes filters are important.  Also I noticed
that my logic in my previous reply was wrong for this case:  we do want
to process the smallest disks first, not the largest ones, because that
way we guarantee we always increase unallocated space.

If we convert a 10-disk-wide block group from RAID10 to 4-disk-wide
RAID6, we replace 2 chunks on 10 disks with 5 chunks on 4 disks:

        2 RAID10 block groups:          5 RAID6 block groups:
        sda #1 data1                    sda #1 data1
        sdb #1 mirror1                  sdb #1 data2
        sdc #1 data2                    sdc #1 P1
        sdd #1 mirror2                  sdf #1 Q1
        sde #1 data3                    sda #2 data3
        sdf #1 mirror3                  sdb #2 data4
        sdg #1 data4                    sdc #2 P2
        sdh #1 mirror4                  sdf #2 Q2
        sdi #1 data5                    sda #3 data5
        sdj #1 mirror5                  sdb #3 data6
        sda #2 data6                    sdc #3 P3
        sdb #2 mirror6                  sdf #3 Q3
        sdc #2 data7                    sda #4 data7
        sdd #2 mirror7                  sdb #4 data8
        sde #2 data8                    sdc #4 P4
        sdf #2 mirror8                  sdf #4 Q4
        sdg #2 data9                    sda #5 data9
        sdh #2 mirror9                  sdb #5 data10
        sdi #2 data10                   sdc #5 P5
        sdj #2 mirror10                 sdf #5 Q5

When this happens we lose net 3GB of space on each of the 4 largest disks
for every 1GB we gain on the 6 smaller disks, and run out of space part
way through the balance.  We will have to make this tradeoff at some
point in the balance because of the disk sizes, but it's important that
it happens at the very end, after all other possible conversion is done
and the maximum amount of unallocated space is generated.

btrfs balance isn't smart enough to do this by itself, which is why it's
20 commands with filter parameters to get complex arrays reshaped, and
there are sometimes multiple passes.

We want to relocate a 10-disk-wide block group from RAID10 to 10-disk-wide
RAID6, replacing 8 chunks on 10 disks with 5 chunks on 10 disks:

        8 RAID10 block groups:          5 RAID6 block groups:
        sda #1 data1                    sda #1 data1
        sdb #1 mirror1                  sdb #1 data2
        sdc #1 data2                    sdc #1 data3
        sdd #1 mirror2                  sdd #1 data4
        sde #1 data3                    sde #1 data5
        sdf #1 mirror3                  sdf #1 data6
        sdg #1 data4                    sdg #1 data7
        sdh #1 mirror4                  sdh #1 data8
        sdi #1 data5                    sdi #1 P1
        sdj #1 mirror5                  sdj #1 Q1
        sda #2 data6                    sda #2 data9
        sdb #2 mirror6                  sdb #2 data10
        sdc #2 data7                    sdc #2 data11
        sdd #2 mirror7                  sdd #2 data12
        sde #2 data8                    sde #2 data13
        sdf #2 mirror8                  sdf #2 data14
        sdg #2 data9                    sdg #2 data15
        sdh #2 mirror9                  sdh #2 data16
        sdi #2 data10                   sdi #2 P2
        sdj #2 mirror10                 sdj #2 Q2
        ...etc there are 40GB of data

The easiest way to do that is:

	for sc in 12 11 10 9 8 7 6 5 4; do
		btrfs balance start -dconvert=raid6,stripes=$sc..$sc,soft -mconvert=raid1c3,soft /mnt/storage-array/
	done

The above converts the widest block groups first, so that every block
group converted results in a net increase in storage efficiency, and
creates unallocated space on as many disks as possible.

Then the next step from my original list, edited with the device
IDs and sizes from dev usage, is the optimization step.  I filled
in the device IDs and sizes from your 'dev usage' output:

> >         4.  balance -dstripes=1..5,devid=1  # sdd, 4.55TB
> >             balance -dstripes=1..5,devid=2  # sde, 4.55TB
> >             balance -dstripes=1..11,devid=3 # sdl, 3.64TB
> >             balance -dstripes=1..11,devid=4 # sdn, 3.64TB
> >             balance -dstripes=1..11,devid=5 # sdm, 3.64TB
> >             balance -dstripes=1..11,devid=6 # sdk, 3.64TB
> >             balance -dstripes=1..11,devid=7 # sdj, 3.64TB
> >             balance -dstripes=1..11,devid=8 # sdi, 3.64TB
> >             balance -dstripes=1..3,devid=9  # sdb, 9.10TB
> >             balance -dstripes=1..3,devid=10 # sdc, 9.10TB
> >             balance -dstripes=1..3,devid=11 # sda, 9.10TB
> >             balance -dstripes=1..3,devid=12 # sdf, 9.10TB

This ensures that each disk is a member of an optimum width block
group for the disk size.

Note: I'm not sure about the 1..11.  IIRC the btrfs limit is 10 disks
per stripe, so you might want to use 1..9 if it seems to be trying
to rebalance everything with 1..11.

Running 'watch btrfs fi usage /mnt/storage-array' while balance runs
can be enlightening.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
  2020-07-17  5:57               ` Zygo Blaxell
@ 2020-07-17 22:54                 ` John Petrini
  2020-07-18 10:36                 ` Steven Davies
  1 sibling, 0 replies; 13+ messages in thread
From: John Petrini @ 2020-07-17 22:54 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: John Petrini, linux-btrfs

On Fri, Jul 17, 2020 at 1:57 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Thu, Jul 16, 2020 at 09:11:17PM -0400, John Petrini wrote:
> > On Thu, Jul 16, 2020 at 6:57 PM Zygo Blaxell
> > > That is...odd.  Try 'btrfs dev usage', maybe something weird is happening
> > > with device sizes.
> >
> > Here it is. I'm not sure what to make of it though.
> >
> > sudo btrfs dev usage /mnt/storage-array/
> > /dev/sdd, ID: 1
> >    Device size:             4.55TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             2.78GiB
> >    Data,RAID10:           784.31GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            144.07GiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Metadata,RAID10:       352.00MiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sde, ID: 2
> >    Device size:             4.55TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             2.78GiB
> >    Data,RAID10:           784.31GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            144.07GiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Metadata,RAID10:       352.00MiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sdl, ID: 3
> >    Device size:             3.64TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sdn, ID: 4
> >    Device size:             3.64TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sdm, ID: 5
> >    Device size:             3.64TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sdk, ID: 6
> >    Device size:             3.64TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sdj, ID: 7
> >    Device size:             3.64TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sdi, ID: 8
> >    Device size:             3.64TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sdb, ID: 9
> >    Device size:             9.10TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             4.01TiB
> >    Data,RAID10:           784.31GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            458.56GiB
> >    Data,RAID6:            144.07GiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Metadata,RAID10:       352.00MiB
> >    Metadata,RAID10:         6.00GiB
> >    Metadata,RAID1C3:        2.00GiB
> >    System,RAID1C3:         32.00MiB
> >    Unallocated:            82.89GiB
> >
> > /dev/sdc, ID: 10
> >    Device size:             9.10TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             4.01TiB
> >    Data,RAID10:           784.31GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            458.56GiB
> >    Data,RAID6:            144.07GiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Metadata,RAID10:       352.00MiB
> >    Metadata,RAID10:         6.00GiB
> >    Metadata,RAID1C3:        3.00GiB
> >    Unallocated:            81.92GiB
> >
> > /dev/sda, ID: 11
> >    Device size:             9.10TiB
> >    Device slack:              0.00B
> >    Data,RAID10:           784.31GiB
> >    Data,RAID10:             4.01TiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            458.56GiB
> >    Data,RAID6:            144.07GiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Metadata,RAID10:       352.00MiB
> >    Metadata,RAID10:         6.00GiB
> >    Metadata,RAID1C3:        5.00GiB
> >    System,RAID1C3:         32.00MiB
> >    Unallocated:            85.79GiB
> >
> > /dev/sdf, ID: 12
> >    Device size:             9.10TiB
> >    Device slack:              0.00B
> >    Data,RAID10:           784.31GiB
> >    Data,RAID10:             4.01TiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            458.56GiB
> >    Data,RAID6:            144.07GiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Metadata,RAID10:       352.00MiB
> >    Metadata,RAID10:         6.00GiB
> >    Metadata,RAID1C3:        5.00GiB
> >    System,RAID1C3:         32.00MiB
> >    Unallocated:            85.79GiB
>
> OK...slack is 0, so there wasn't anything weird with underlying device
> sizes going on.
>
> There's 3 entries for "Data,RAID6" because there are three stripe widths:
> 12 disks, 6 disks, and 4 disks, corresponding to the number of disks of
> each size.  Unfortunately 'dev usage' doesn't say which one is which.
>
> > Wow looks like I've got lots of info to mull over here! I kicked off
> > another convert already after cleaning up quite a bit more space. I
> > had over 100G unallocated on each device after deleting some data and
> > running another balance.
>
> If you did balances with no unallocated space on the small drives, then
> the block groups created by those balances are the first block groups
> to be processed by later balances.  These block groups will be narrow
> so they'll use space less efficiently.  We want the opposite of that.
>

There was unallocated space on all drives before I started this recent
balance. So far it's still chugging along at about 20% complete but I
assume even if this does complete successfully I'll be stuck with some
narrow strips from the first attempt.

> > I'm tempted to let it run and see if it
> > succeeds but my unallocated space has already dropped off a cliff with
> > 95% of the rebalance remaining.
>
> This is why the devid/stripes filters are important.  Also I noticed
> that my logic in my previous reply was wrong for this case:  we do want
> to process the smallest disks first, not the largest ones, because that
> way we guarantee we always increase unallocated space.
>
> If we convert a 10-disk-wide block group from RAID10 to 4-disk-wide
> RAID6, we replace 2 chunks on 10 disks with 5 chunks on 4 disks:
>
>         2 RAID10 block groups:          5 RAID6 block groups:
>         sda #1 data1                    sda #1 data1
>         sdb #1 mirror1                  sdb #1 data2
>         sdc #1 data2                    sdc #1 P1
>         sdd #1 mirror2                  sdf #1 Q1
>         sde #1 data3                    sda #2 data3
>         sdf #1 mirror3                  sdb #2 data4
>         sdg #1 data4                    sdc #2 P2
>         sdh #1 mirror4                  sdf #2 Q2
>         sdi #1 data5                    sda #3 data5
>         sdj #1 mirror5                  sdb #3 data6
>         sda #2 data6                    sdc #3 P3
>         sdb #2 mirror6                  sdf #3 Q3
>         sdc #2 data7                    sda #4 data7
>         sdd #2 mirror7                  sdb #4 data8
>         sde #2 data8                    sdc #4 P4
>         sdf #2 mirror8                  sdf #4 Q4
>         sdg #2 data9                    sda #5 data9
>         sdh #2 mirror9                  sdb #5 data10
>         sdi #2 data10                   sdc #5 P5
>         sdj #2 mirror10                 sdf #5 Q5
>
> When this happens we lose net 3GB of space on each of the 4 largest disks
> for every 1GB we gain on the 6 smaller disks, and run out of space part
> way through the balance.  We will have to make this tradeoff at some
> point in the balance because of the disk sizes, but it's important that
> it happens at the very end, after all other possible conversion is done
> and the maximum amount of unallocated space is generated.
>
> btrfs balance isn't smart enough to do this by itself, which is why it's
> 20 commands with filter parameters to get complex arrays reshaped, and
> there are sometimes multiple passes.
>
> We want to relocate a 10-disk-wide block group from RAID10 to 10-disk-wide
> RAID6, replacing 8 chunks on 10 disks with 5 chunks on 10 disks:
>
>         8 RAID10 block groups:          5 RAID6 block groups:
>         sda #1 data1                    sda #1 data1
>         sdb #1 mirror1                  sdb #1 data2
>         sdc #1 data2                    sdc #1 data3
>         sdd #1 mirror2                  sdd #1 data4
>         sde #1 data3                    sde #1 data5
>         sdf #1 mirror3                  sdf #1 data6
>         sdg #1 data4                    sdg #1 data7
>         sdh #1 mirror4                  sdh #1 data8
>         sdi #1 data5                    sdi #1 P1
>         sdj #1 mirror5                  sdj #1 Q1
>         sda #2 data6                    sda #2 data9
>         sdb #2 mirror6                  sdb #2 data10
>         sdc #2 data7                    sdc #2 data11
>         sdd #2 mirror7                  sdd #2 data12
>         sde #2 data8                    sde #2 data13
>         sdf #2 mirror8                  sdf #2 data14
>         sdg #2 data9                    sdg #2 data15
>         sdh #2 mirror9                  sdh #2 data16
>         sdi #2 data10                   sdi #2 P2
>         sdj #2 mirror10                 sdj #2 Q2
>         ...etc there are 40GB of data
>
> The easiest way to do that is:
>
>         for sc in 12 11 10 9 8 7 6 5 4; do
>                 btrfs balance start -dconvert=raid6,stripes=$sc..$sc,soft -mconvert=raid1c3,soft /mnt/storage-array/
>         done
>
> The above converts the widest block groups first, so that every block
> group converted results in a net increase in storage efficiency, and
> creates unallocated space on as many disks as possible.
>
> Then the next step from my original list, edited with the device
> IDs and sizes from dev usage, is the optimization step.  I filled
> in the device IDs and sizes from your 'dev usage' output:
>
> > >         4.  balance -dstripes=1..5,devid=1  # sdd, 4.55TB
> > >             balance -dstripes=1..5,devid=2  # sde, 4.55TB
> > >             balance -dstripes=1..11,devid=3 # sdl, 3.64TB
> > >             balance -dstripes=1..11,devid=4 # sdn, 3.64TB
> > >             balance -dstripes=1..11,devid=5 # sdm, 3.64TB
> > >             balance -dstripes=1..11,devid=6 # sdk, 3.64TB
> > >             balance -dstripes=1..11,devid=7 # sdj, 3.64TB
> > >             balance -dstripes=1..11,devid=8 # sdi, 3.64TB
> > >             balance -dstripes=1..3,devid=9  # sdb, 9.10TB
> > >             balance -dstripes=1..3,devid=10 # sdc, 9.10TB
> > >             balance -dstripes=1..3,devid=11 # sda, 9.10TB
> > >             balance -dstripes=1..3,devid=12 # sdf, 9.10TB
>
> This ensures that each disk is a member of an optimum width block
> group for the disk size.
>
> Note: I'm not sure about the 1..11.  IIRC the btrfs limit is 10 disks
> per stripe, so you might want to use 1..9 if it seems to be trying
> to rebalance everything with 1..11.
>
> Running 'watch btrfs fi usage /mnt/storage-array' while balance runs
> can be enlightening.

Thanks so much for all this detail. I'll see how this run goes and if
it gets stuck again I'll try your strategy of converting to RAID-1 to
get back some unallocated space. Otherwise if this completes
successfully I'll go ahead with optimizing the striping and let you
know how it goes.



-- 
---------------------------------------
John Petrini

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
  2020-07-17  5:57               ` Zygo Blaxell
  2020-07-17 22:54                 ` John Petrini
@ 2020-07-18 10:36                 ` Steven Davies
  2020-07-20 17:57                   ` Goffredo Baroncelli
  1 sibling, 1 reply; 13+ messages in thread
From: Steven Davies @ 2020-07-18 10:36 UTC (permalink / raw)
  To: Zygo Blaxell, John Petrini; +Cc: John Petrini, linux-btrfs

On 17/07/2020 06:57, Zygo Blaxell wrote:
> On Thu, Jul 16, 2020 at 09:11:17PM -0400, John Petrini wrote:

--snip--

>> /dev/sdf, ID: 12
>>     Device size:             9.10TiB
>>     Device slack:              0.00B
>>     Data,RAID10:           784.31GiB
>>     Data,RAID10:             4.01TiB
>>     Data,RAID10:             3.34TiB
>>     Data,RAID6:            458.56GiB
>>     Data,RAID6:            144.07GiB
>>     Data,RAID6:            293.03GiB
>>     Metadata,RAID10:         4.47GiB
>>     Metadata,RAID10:       352.00MiB
>>     Metadata,RAID10:         6.00GiB
>>     Metadata,RAID1C3:        5.00GiB
>>     System,RAID1C3:         32.00MiB
>>     Unallocated:            85.79GiB
> 
> OK...slack is 0, so there wasn't anything weird with underlying device
> sizes going on.
> 
> There's 3 entries for "Data,RAID6" because there are three stripe widths:
> 12 disks, 6 disks, and 4 disks, corresponding to the number of disks of
> each size.  Unfortunately 'dev usage' doesn't say which one is which.

RFE: improve 'dev usage' to show these details.

As a user I'd look at this output and assume a bug in btrfs-tools 
because of the repeated conflicting information.

-- 
Steven Davies

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
  2020-07-18 10:36                 ` Steven Davies
@ 2020-07-20 17:57                   ` Goffredo Baroncelli
  2020-07-21 10:15                     ` Steven Davies
  0 siblings, 1 reply; 13+ messages in thread
From: Goffredo Baroncelli @ 2020-07-20 17:57 UTC (permalink / raw)
  To: Steven Davies, Zygo Blaxell, John Petrini; +Cc: John Petrini, linux-btrfs

On 7/18/20 12:36 PM, Steven Davies wrote:
> On 17/07/2020 06:57, Zygo Blaxell wrote:
>> On Thu, Jul 16, 2020 at 09:11:17PM -0400, John Petrini wrote:
> 
> --snip--
> 
>>> /dev/sdf, ID: 12
>>>     Device size:             9.10TiB
>>>     Device slack:              0.00B
>>>     Data,RAID10:           784.31GiB
>>>     Data,RAID10:             4.01TiB
>>>     Data,RAID10:             3.34TiB
>>>     Data,RAID6:            458.56GiB
>>>     Data,RAID6:            144.07GiB
>>>     Data,RAID6:            293.03GiB
>>>     Metadata,RAID10:         4.47GiB
>>>     Metadata,RAID10:       352.00MiB
>>>     Metadata,RAID10:         6.00GiB
>>>     Metadata,RAID1C3:        5.00GiB
>>>     System,RAID1C3:         32.00MiB
>>>     Unallocated:            85.79GiB
>>
[...]
> 
> RFE: improve 'dev usage' to show these details.
> 
> As a user I'd look at this output and assume a bug in btrfs-tools because of the repeated conflicting information.

What would be the expected output ?
What about the example below ?

  /dev/sdf, ID: 12
      Device size:             9.10TiB
      Device slack:              0.00B
      Data,RAID10:           784.31GiB
      Data,RAID10:             4.01TiB
      Data,RAID10:             3.34TiB
      Data,RAID6[3]:         458.56GiB
      Data,RAID6[5]:         144.07GiB
      Data,RAID6[7]:         293.03GiB
      Metadata,RAID10:         4.47GiB
      Metadata,RAID10:       352.00MiB
      Metadata,RAID10:         6.00GiB
      Metadata,RAID1C3:        5.00GiB
      System,RAID1C3:         32.00MiB
      Unallocated:            85.79GiB


Another possibility (but the output will change drastically, I am thinking to another command)

Filesystem '/'
	Data,RAID1:		123.45GiB
		/dev/sda	 12.34GiB
		/dev/sdb	 12.34GiB
	Data,RAID1:		123.45GiB
		/dev/sde	 12.34GiB
		/dev/sdf	 12.34GiB
	Data,RAID6:		123.45GiB
		/dev/sda	 12.34GiB
		/dev/sdb	 12.34GiB
		/dev/sdc	 12.34GiB
	Data,RAID6:		123.45GiB
		/dev/sdb	 12.34GiB
		/dev/sdc	 12.34GiB
		/dev/sdd	 12.34GiB
		/dev/sde	 12.34GiB
		/dev/sdf	 12.34GiB


The number are the chunks sizes (invented). Note: for RAID5/RAID6 a chunk will uses near all disks; however for (e.g.) RAID1  there is the possibility that CHUNKS use different disks pairs (see the two RAID1 instances).


BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
  2020-07-20 17:57                   ` Goffredo Baroncelli
@ 2020-07-21 10:15                     ` Steven Davies
  2020-07-21 20:48                       ` Goffredo Baroncelli
  0 siblings, 1 reply; 13+ messages in thread
From: Steven Davies @ 2020-07-21 10:15 UTC (permalink / raw)
  To: kreijack; +Cc: Zygo Blaxell, John Petrini, John Petrini, linux-btrfs

On 2020-07-20 18:57, Goffredo Baroncelli wrote:
> On 7/18/20 12:36 PM, Steven Davies wrote:
>> On 17/07/2020 06:57, Zygo Blaxell wrote:
>>> On Thu, Jul 16, 2020 at 09:11:17PM -0400, John Petrini wrote:
>> 
>> --snip--
>> 
>>>> /dev/sdf, ID: 12
>>>>     Device size:             9.10TiB
>>>>     Device slack:              0.00B
>>>>     Data,RAID10:           784.31GiB
>>>>     Data,RAID10:             4.01TiB
>>>>     Data,RAID10:             3.34TiB
>>>>     Data,RAID6:            458.56GiB
>>>>     Data,RAID6:            144.07GiB
>>>>     Data,RAID6:            293.03GiB
>>>>     Metadata,RAID10:         4.47GiB
>>>>     Metadata,RAID10:       352.00MiB
>>>>     Metadata,RAID10:         6.00GiB
>>>>     Metadata,RAID1C3:        5.00GiB
>>>>     System,RAID1C3:         32.00MiB
>>>>     Unallocated:            85.79GiB
>>> 
> [...]
>> 
>> RFE: improve 'dev usage' to show these details.
>> 
>> As a user I'd look at this output and assume a bug in btrfs-tools 
>> because of the repeated conflicting information.
> 
> What would be the expected output ?
> What about the example below ?
> 
>  /dev/sdf, ID: 12
>      Device size:             9.10TiB
>      Device slack:              0.00B
>      Data,RAID10:           784.31GiB
>      Data,RAID10:             4.01TiB
>      Data,RAID10:             3.34TiB
>      Data,RAID6[3]:         458.56GiB
>      Data,RAID6[5]:         144.07GiB
>      Data,RAID6[7]:         293.03GiB
>      Metadata,RAID10:         4.47GiB
>      Metadata,RAID10:       352.00MiB
>      Metadata,RAID10:         6.00GiB
>      Metadata,RAID1C3:        5.00GiB
>      System,RAID1C3:         32.00MiB
>      Unallocated:            85.79GiB

That works for me for RAID6. There are three lines for RAID10 too - 
what's the difference between these?

> Another possibility (but the output will change drastically, I am
> thinking to another command)
> 
> Filesystem '/'
> 	Data,RAID1:		123.45GiB
> 		/dev/sda	 12.34GiB
> 		/dev/sdb	 12.34GiB
> 	Data,RAID1:		123.45GiB
> 		/dev/sde	 12.34GiB
> 		/dev/sdf	 12.34GiB

Is this showing that there's 123.45GiB of RAID1 data which is mirrored 
between sda and sdb, and 123.45GiB which is mirrored between sde and 
sdf? I'm not sure how useful that would be if there are a lot of disks 
in a RAID1 volume with different blocks mirrored between different ones. 
For RAID1 (and RAID10) I would keep it simple.

> 	Data,RAID6:		123.45GiB
> 		/dev/sda	 12.34GiB
> 		/dev/sdb	 12.34GiB
> 		/dev/sdc	 12.34GiB
> 	Data,RAID6:		123.45GiB
> 		/dev/sdb	 12.34GiB
> 		/dev/sdc	 12.34GiB
> 		/dev/sdd	 12.34GiB
> 		/dev/sde	 12.34GiB
> 		/dev/sdf	 12.34GiB

Here there would need to be something which shows what the difference in 
the RAID6 blocks is - if it's the chunk size then I'd do the same as the 
above example with e.g. Data,RAID6[3].

> The number are the chunks sizes (invented). Note: for RAID5/RAID6 a
> chunk will uses near all disks; however for (e.g.) RAID1  there is the
> possibility that CHUNKS use different disks pairs (see the two RAID1
> instances).

-- 
Steven Davies

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
  2020-07-21 10:15                     ` Steven Davies
@ 2020-07-21 20:48                       ` Goffredo Baroncelli
  2020-07-23  8:57                         ` Steven Davies
  0 siblings, 1 reply; 13+ messages in thread
From: Goffredo Baroncelli @ 2020-07-21 20:48 UTC (permalink / raw)
  To: Steven Davies; +Cc: Zygo Blaxell, John Petrini, John Petrini, linux-btrfs

On 7/21/20 12:15 PM, Steven Davies wrote:
> On 2020-07-20 18:57, Goffredo Baroncelli wrote:
>> On 7/18/20 12:36 PM, Steven Davies wrote:
>>> On 17/07/2020 06:57, Zygo Blaxell wrote:
>>>> On Thu, Jul 16, 2020 at 09:11:17PM -0400, John Petrini wrote:
>>>
>>> --snip--
>>>
>>>>> /dev/sdf, ID: 12
>>>>>     Device size:             9.10TiB
>>>>>     Device slack:              0.00B
>>>>>     Data,RAID10:           784.31GiB
>>>>>     Data,RAID10:             4.01TiB
>>>>>     Data,RAID10:             3.34TiB
>>>>>     Data,RAID6:            458.56GiB
>>>>>     Data,RAID6:            144.07GiB
>>>>>     Data,RAID6:            293.03GiB
>>>>>     Metadata,RAID10:         4.47GiB
>>>>>     Metadata,RAID10:       352.00MiB
>>>>>     Metadata,RAID10:         6.00GiB
>>>>>     Metadata,RAID1C3:        5.00GiB
>>>>>     System,RAID1C3:         32.00MiB
>>>>>     Unallocated:            85.79GiB
>>>>
>> [...]
>>>
>>> RFE: improve 'dev usage' to show these details.
>>>
>>> As a user I'd look at this output and assume a bug in btrfs-tools because of the repeated conflicting information.
>>
>> What would be the expected output ?
>> What about the example below ?
>>
>>  /dev/sdf, ID: 12
>>      Device size:             9.10TiB
>>      Device slack:              0.00B
>>      Data,RAID10:           784.31GiB
>>      Data,RAID10:             4.01TiB
>>      Data,RAID10:             3.34TiB
>>      Data,RAID6[3]:         458.56GiB
>>      Data,RAID6[5]:         144.07GiB
>>      Data,RAID6[7]:         293.03GiB
>>      Metadata,RAID10:         4.47GiB
>>      Metadata,RAID10:       352.00MiB
>>      Metadata,RAID10:         6.00GiB
>>      Metadata,RAID1C3:        5.00GiB
>>      System,RAID1C3:         32.00MiB
>>      Unallocated:            85.79GiB
> 
> That works for me for RAID6. There are three lines for RAID10 too - what's the difference between these?

The differences is the number of the disks involved. In raid10, the first 64K are on the first disk, the 2nd 64K are in the 2nd disk and so until the last disk. Then the n+1 th 64K are again in the first disk... and so on.. (ok I missed the RAID1 part, but I think the have giving the idea )

So the chunk layout depends by the involved number of disk, even if the differences is not so dramatic.


> 
>> Another possibility (but the output will change drastically, I am
>> thinking to another command)
>>
>> Filesystem '/'
>>     Data,RAID1:        123.45GiB
>>         /dev/sda     12.34GiB
>>         /dev/sdb     12.34GiB
>>     Data,RAID1:        123.45GiB
>>         /dev/sde     12.34GiB
>>         /dev/sdf     12.34GiB
> 
> Is this showing that there's 123.45GiB of RAID1 data which is mirrored between sda and sdb, and 123.45GiB which is mirrored between sde and sdf? I'm not sure how useful that would be if there are a lot of disks in a RAID1 volume with different blocks mirrored between different ones. For RAID1 (and RAID10) I would keep it simple.
> 
>>     Data,RAID6:        123.45GiB
>>         /dev/sda     12.34GiB
>>         /dev/sdb     12.34GiB
>>         /dev/sdc     12.34GiB
>>     Data,RAID6:        123.45GiB
>>         /dev/sdb     12.34GiB
>>         /dev/sdc     12.34GiB
>>         /dev/sdd     12.34GiB
>>         /dev/sde     12.34GiB
>>         /dev/sdf     12.34GiB
> 
> Here there would need to be something which shows what the difference in the RAID6 blocks is - if it's the chunk size then I'd do the same as the above example with e.g. Data,RAID6[3].

We could add a '[n]' for the profile where it matters, e.g. raid0, raid10, raid5, raid6.
What do you think ?
> 
>> The number are the chunks sizes (invented). Note: for RAID5/RAID6 a
>> chunk will uses near all disks; however for (e.g.) RAID1  there is the
>> possibility that CHUNKS use different disks pairs (see the two RAID1
>> instances).
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
  2020-07-21 20:48                       ` Goffredo Baroncelli
@ 2020-07-23  8:57                         ` Steven Davies
  2020-07-23 19:29                           ` Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: Steven Davies @ 2020-07-23  8:57 UTC (permalink / raw)
  To: kreijack; +Cc: Zygo Blaxell, John Petrini, John Petrini, linux-btrfs

On 2020-07-21 21:48, Goffredo Baroncelli wrote:
> On 7/21/20 12:15 PM, Steven Davies wrote:
>> On 2020-07-20 18:57, Goffredo Baroncelli wrote:
>>> On 7/18/20 12:36 PM, Steven Davies wrote:

>>>>>> /dev/sdf, ID: 12
>>>>>>     Device size:             9.10TiB
>>>>>>     Device slack:              0.00B
>>>>>>     Data,RAID10:           784.31GiB
>>>>>>     Data,RAID10:             4.01TiB
>>>>>>     Data,RAID10:             3.34TiB
>>>>>>     Data,RAID6:            458.56GiB
>>>>>>     Data,RAID6:            144.07GiB
>>>>>>     Data,RAID6:            293.03GiB
>>>>>>     Metadata,RAID10:         4.47GiB
>>>>>>     Metadata,RAID10:       352.00MiB
>>>>>>     Metadata,RAID10:         6.00GiB
>>>>>>     Metadata,RAID1C3:        5.00GiB
>>>>>>     System,RAID1C3:         32.00MiB
>>>>>>     Unallocated:            85.79GiB
>>>>> 
>>> [...]
>>>> 
>>>> RFE: improve 'dev usage' to show these details.
>>>> 
>>>> As a user I'd look at this output and assume a bug in btrfs-tools 
>>>> because of the repeated conflicting information.
>>> 
>>> What would be the expected output ?
>>> What about the example below ?
>>> 
>>>  /dev/sdf, ID: 12
>>>      Device size:             9.10TiB
>>>      Device slack:              0.00B
>>>      Data,RAID10:           784.31GiB
>>>      Data,RAID10:             4.01TiB
>>>      Data,RAID10:             3.34TiB
>>>      Data,RAID6[3]:         458.56GiB
>>>      Data,RAID6[5]:         144.07GiB
>>>      Data,RAID6[7]:         293.03GiB
>>>      Metadata,RAID10:         4.47GiB
>>>      Metadata,RAID10:       352.00MiB
>>>      Metadata,RAID10:         6.00GiB
>>>      Metadata,RAID1C3:        5.00GiB
>>>      System,RAID1C3:         32.00MiB
>>>      Unallocated:            85.79GiB
>> 
>> That works for me for RAID6. There are three lines for RAID10 too - 
>> what's the difference between these?
> 
> The differences is the number of the disks involved. In raid10, the
> first 64K are on the first disk, the 2nd 64K are in the 2nd disk and
> so until the last disk. Then the n+1 th 64K are again in the first
> disk... and so on.. (ok I missed the RAID1 part, but I think the have
> giving the idea )
> 
> So the chunk layout depends by the involved number of disk, even if
> the differences is not so dramatic.

Is this information that the user/sysadmin needs to be aware of in a 
similar manner to the original problem that started this thread? If not 
I'd be tempted to sum all the RAID10 chunks into one line (each for data 
and metadata).

>>>     Data,RAID6:        123.45GiB
>>>         /dev/sda     12.34GiB
>>>         /dev/sdb     12.34GiB
>>>         /dev/sdc     12.34GiB
>>>     Data,RAID6:        123.45GiB
>>>         /dev/sdb     12.34GiB
>>>         /dev/sdc     12.34GiB
>>>         /dev/sdd     12.34GiB
>>>         /dev/sde     12.34GiB
>>>         /dev/sdf     12.34GiB
>> 
>> Here there would need to be something which shows what the difference 
>> in the RAID6 blocks is - if it's the chunk size then I'd do the same 
>> as the above example with e.g. Data,RAID6[3].
> 
> We could add a '[n]' for the profile where it matters, e.g. raid0,
> raid10, raid5, raid6.
> What do you think ?

So like this? That would make sense to me, as long as the meaning of [n] 
is explained in --help or the manpage.
      Data,RAID6[3]:     123.45GiB
          /dev/sda     12.34GiB
          /dev/sdb     12.34GiB
          /dev/sdc     12.34GiB
      Data,RAID6[5]:     123.45GiB
          /dev/sdb     12.34GiB
          /dev/sdc     12.34GiB
          /dev/sdd     12.34GiB
          /dev/sde     12.34GiB
          /dev/sdf     12.34GiB

-- 
Steven Davies

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
  2020-07-23  8:57                         ` Steven Davies
@ 2020-07-23 19:29                           ` Zygo Blaxell
  0 siblings, 0 replies; 13+ messages in thread
From: Zygo Blaxell @ 2020-07-23 19:29 UTC (permalink / raw)
  To: Steven Davies; +Cc: kreijack, John Petrini, John Petrini, linux-btrfs

On Thu, Jul 23, 2020 at 09:57:50AM +0100, Steven Davies wrote:
> On 2020-07-21 21:48, Goffredo Baroncelli wrote:
> > On 7/21/20 12:15 PM, Steven Davies wrote:
> > > On 2020-07-20 18:57, Goffredo Baroncelli wrote:
> > > > On 7/18/20 12:36 PM, Steven Davies wrote:
> 
> > > > > > > /dev/sdf, ID: 12
> > > > > > >     Device size:             9.10TiB
> > > > > > >     Device slack:              0.00B
> > > > > > >     Data,RAID10:           784.31GiB
> > > > > > >     Data,RAID10:             4.01TiB
> > > > > > >     Data,RAID10:             3.34TiB
> > > > > > >     Data,RAID6:            458.56GiB
> > > > > > >     Data,RAID6:            144.07GiB
> > > > > > >     Data,RAID6:            293.03GiB
> > > > > > >     Metadata,RAID10:         4.47GiB
> > > > > > >     Metadata,RAID10:       352.00MiB
> > > > > > >     Metadata,RAID10:         6.00GiB
> > > > > > >     Metadata,RAID1C3:        5.00GiB
> > > > > > >     System,RAID1C3:         32.00MiB
> > > > > > >     Unallocated:            85.79GiB
> > > > > > 
> > > > [...]
> > > > > 
> > > > > RFE: improve 'dev usage' to show these details.
> > > > > 
> > > > > As a user I'd look at this output and assume a bug in
> > > > > btrfs-tools because of the repeated conflicting information.
> > > > 
> > > > What would be the expected output ?
> > > > What about the example below ?
> > > > 
> > > >  /dev/sdf, ID: 12
> > > >      Device size:             9.10TiB
> > > >      Device slack:              0.00B
> > > >      Data,RAID10:           784.31GiB
> > > >      Data,RAID10:             4.01TiB
> > > >      Data,RAID10:             3.34TiB
> > > >      Data,RAID6[3]:         458.56GiB
> > > >      Data,RAID6[5]:         144.07GiB
> > > >      Data,RAID6[7]:         293.03GiB
> > > >      Metadata,RAID10:         4.47GiB
> > > >      Metadata,RAID10:       352.00MiB
> > > >      Metadata,RAID10:         6.00GiB
> > > >      Metadata,RAID1C3:        5.00GiB
> > > >      System,RAID1C3:         32.00MiB
> > > >      Unallocated:            85.79GiB
> > > 
> > > That works for me for RAID6. There are three lines for RAID10 too -
> > > what's the difference between these?
> > 
> > The differences is the number of the disks involved. In raid10, the
> > first 64K are on the first disk, the 2nd 64K are in the 2nd disk and
> > so until the last disk. Then the n+1 th 64K are again in the first
> > disk... and so on.. (ok I missed the RAID1 part, but I think the have
> > giving the idea )
> > 
> > So the chunk layout depends by the involved number of disk, even if
> > the differences is not so dramatic.
> 
> Is this information that the user/sysadmin needs to be aware of in a similar
> manner to the original problem that started this thread? If not I'd be
> tempted to sum all the RAID10 chunks into one line (each for data and
> metadata).

It's useful for all profiles that use striping across a variable number
of devices.  That's RAID0, RAID5, RAID6, and RAID10.  The other profiles
don't use stripes and have a fixed device count (i.e. RAID1 is always 2
disks, can never be 1 or 3), so there's no need to distinguish them in a
'dev usage' or 'fi usage' style view.

All the profiles that support variable numbers of devices are also
profiles that use striping, so the terms "stripe" and "disk" are used
interchangably for those profiles.

> > > >     Data,RAID6:        123.45GiB
> > > >         /dev/sda     12.34GiB
> > > >         /dev/sdb     12.34GiB
> > > >         /dev/sdc     12.34GiB
> > > >     Data,RAID6:        123.45GiB
> > > >         /dev/sdb     12.34GiB
> > > >         /dev/sdc     12.34GiB
> > > >         /dev/sdd     12.34GiB
> > > >         /dev/sde     12.34GiB
> > > >         /dev/sdf     12.34GiB
> > > 
> > > Here there would need to be something which shows what the
> > > difference in the RAID6 blocks is - if it's the chunk size then I'd
> > > do the same as the above example with e.g. Data,RAID6[3].
> > 
> > We could add a '[n]' for the profile where it matters, e.g. raid0,
> > raid10, raid5, raid6.
> > What do you think ?
> 
> So like this? That would make sense to me, as long as the meaning of [n] is
> explained in --help or the manpage.
>      Data,RAID6[3]:     123.45GiB
>          /dev/sda     12.34GiB
>          /dev/sdb     12.34GiB
>          /dev/sdc     12.34GiB
>      Data,RAID6[5]:     123.45GiB
>          /dev/sdb     12.34GiB
>          /dev/sdc     12.34GiB
>          /dev/sdd     12.34GiB
>          /dev/sde     12.34GiB
>          /dev/sdf     12.34GiB

It is quite useful to know how much data is used by each _combination_
of disks.  e.g. for a 3-device RAID1 we might want to know about this:

      Data,RAID1:     124.67GiB
          /dev/sda,/dev/sdb     99.99GiB
          /dev/sda,/dev/sdc     12.34GiB
          /dev/sdb,/dev/sdc     12.34GiB

Here there are many more block groups using /dev/sda and /dev/sdb than
there are other pairs of disks.  This may cause problems if disks are
added, removed, or resized, as there may not be enough space on the
/dev/sda,/dev/sdb pair to accommodate data moved from other disks.
Balance may be required to redistribute the sda,sdb block groups onto
sda,sdc and sdb,sdc block groups.

The device breakdown above is sorted device lists, so e.g. a block group
that uses '/dev/sdc,/dev/sda' would be sorted and appear as part of the
'/dev/sda,/dev/sdc' total.  For space-management purposes we do not care
which disk is mirror/stripe 0 and which is mirror/stripe 1, they take
up the same space either way.

In this case it may be better to separate out the device ID's.
There could be a pathological case like:

	/dev/sda,/dev/sdb,/dev/sdc,/dev/sdd,/dev/sde,/dev/sdf,/dev/sdg,/dev/sdh,/dev/sdi,/dev/sdj,/dev/sdk,/dev/sdl   12.34GB

which might be better written as:

    Device ID 1: /dev/sda
    Device ID 2: /dev/sdb
    Device ID 3: /dev/sdc
    [...]
    Device ID 11: /dev/sdk
    Device ID 12: /dev/sdl

    Data,RAID6[12]:    123.45GiB
        [1,2,3,4,5,6,7,8,9,10,11,12]    123.45GiB

    Data,RAID6[9]:    345.67GiB
        [1,2,3,4,8,9,10,11,12]    123.45GiB
        [1,2,4,5,6,9,10,11,12]    111.11GiB
        [1,2,3,4,6,8,9,11,12]     111.11GiB

If we see that, we know the 12-stripe-wide RAID6 is OK, but maybe some
of the 9-stripe-wide needs to be relocated depending on which disks are
the larger ones.  We would then run some balances with stripe= and devid=
filters e.g. to get rid of 9-stripe-wide RAID6 on devid 5.

> 
> -- 
> Steven Davies

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-07-23 19:29 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-14 16:13 Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion John Petrini
2020-07-15  1:18 ` Zygo Blaxell
     [not found]   ` <CADvYWxcq+-Fg0W9dmc-shwszF-7sX+GDVig0GncpvwKUDPfT7g@mail.gmail.com>
     [not found]     ` <20200716042739.GB8346@hungrycats.org>
2020-07-16 13:37       ` John Petrini
     [not found]         ` <CAJix6J9kmQjfFJJ1GwWXsX7WW6QKxPqpKx86g7hgA4PfbH5Rpg@mail.gmail.com>
2020-07-16 22:57           ` Zygo Blaxell
2020-07-17  1:11             ` John Petrini
2020-07-17  5:57               ` Zygo Blaxell
2020-07-17 22:54                 ` John Petrini
2020-07-18 10:36                 ` Steven Davies
2020-07-20 17:57                   ` Goffredo Baroncelli
2020-07-21 10:15                     ` Steven Davies
2020-07-21 20:48                       ` Goffredo Baroncelli
2020-07-23  8:57                         ` Steven Davies
2020-07-23 19:29                           ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.