All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Petrini <john.d.petrini@gmail.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: John Petrini <me@johnpetrini.com>, linux-btrfs@vger.kernel.org
Subject: Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
Date: Fri, 17 Jul 2020 18:54:12 -0400	[thread overview]
Message-ID: <CADvYWxeP83uQ7VHQ6y+_3yyRKnNVGBWBRsPSqBG_wHfjkeCFog@mail.gmail.com> (raw)
In-Reply-To: <20200717055706.GJ10769@hungrycats.org>

On Fri, Jul 17, 2020 at 1:57 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Thu, Jul 16, 2020 at 09:11:17PM -0400, John Petrini wrote:
> > On Thu, Jul 16, 2020 at 6:57 PM Zygo Blaxell
> > > That is...odd.  Try 'btrfs dev usage', maybe something weird is happening
> > > with device sizes.
> >
> > Here it is. I'm not sure what to make of it though.
> >
> > sudo btrfs dev usage /mnt/storage-array/
> > /dev/sdd, ID: 1
> >    Device size:             4.55TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             2.78GiB
> >    Data,RAID10:           784.31GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            144.07GiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Metadata,RAID10:       352.00MiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sde, ID: 2
> >    Device size:             4.55TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             2.78GiB
> >    Data,RAID10:           784.31GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            144.07GiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Metadata,RAID10:       352.00MiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sdl, ID: 3
> >    Device size:             3.64TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sdn, ID: 4
> >    Device size:             3.64TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sdm, ID: 5
> >    Device size:             3.64TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sdk, ID: 6
> >    Device size:             3.64TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sdj, ID: 7
> >    Device size:             3.64TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sdi, ID: 8
> >    Device size:             3.64TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Unallocated:             1.02MiB
> >
> > /dev/sdb, ID: 9
> >    Device size:             9.10TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             4.01TiB
> >    Data,RAID10:           784.31GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            458.56GiB
> >    Data,RAID6:            144.07GiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Metadata,RAID10:       352.00MiB
> >    Metadata,RAID10:         6.00GiB
> >    Metadata,RAID1C3:        2.00GiB
> >    System,RAID1C3:         32.00MiB
> >    Unallocated:            82.89GiB
> >
> > /dev/sdc, ID: 10
> >    Device size:             9.10TiB
> >    Device slack:              0.00B
> >    Data,RAID10:             3.12GiB
> >    Data,RAID10:             4.01TiB
> >    Data,RAID10:           784.31GiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            458.56GiB
> >    Data,RAID6:            144.07GiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Metadata,RAID10:       352.00MiB
> >    Metadata,RAID10:         6.00GiB
> >    Metadata,RAID1C3:        3.00GiB
> >    Unallocated:            81.92GiB
> >
> > /dev/sda, ID: 11
> >    Device size:             9.10TiB
> >    Device slack:              0.00B
> >    Data,RAID10:           784.31GiB
> >    Data,RAID10:             4.01TiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            458.56GiB
> >    Data,RAID6:            144.07GiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Metadata,RAID10:       352.00MiB
> >    Metadata,RAID10:         6.00GiB
> >    Metadata,RAID1C3:        5.00GiB
> >    System,RAID1C3:         32.00MiB
> >    Unallocated:            85.79GiB
> >
> > /dev/sdf, ID: 12
> >    Device size:             9.10TiB
> >    Device slack:              0.00B
> >    Data,RAID10:           784.31GiB
> >    Data,RAID10:             4.01TiB
> >    Data,RAID10:             3.34TiB
> >    Data,RAID6:            458.56GiB
> >    Data,RAID6:            144.07GiB
> >    Data,RAID6:            293.03GiB
> >    Metadata,RAID10:         4.47GiB
> >    Metadata,RAID10:       352.00MiB
> >    Metadata,RAID10:         6.00GiB
> >    Metadata,RAID1C3:        5.00GiB
> >    System,RAID1C3:         32.00MiB
> >    Unallocated:            85.79GiB
>
> OK...slack is 0, so there wasn't anything weird with underlying device
> sizes going on.
>
> There's 3 entries for "Data,RAID6" because there are three stripe widths:
> 12 disks, 6 disks, and 4 disks, corresponding to the number of disks of
> each size.  Unfortunately 'dev usage' doesn't say which one is which.
>
> > Wow looks like I've got lots of info to mull over here! I kicked off
> > another convert already after cleaning up quite a bit more space. I
> > had over 100G unallocated on each device after deleting some data and
> > running another balance.
>
> If you did balances with no unallocated space on the small drives, then
> the block groups created by those balances are the first block groups
> to be processed by later balances.  These block groups will be narrow
> so they'll use space less efficiently.  We want the opposite of that.
>

There was unallocated space on all drives before I started this recent
balance. So far it's still chugging along at about 20% complete but I
assume even if this does complete successfully I'll be stuck with some
narrow strips from the first attempt.

> > I'm tempted to let it run and see if it
> > succeeds but my unallocated space has already dropped off a cliff with
> > 95% of the rebalance remaining.
>
> This is why the devid/stripes filters are important.  Also I noticed
> that my logic in my previous reply was wrong for this case:  we do want
> to process the smallest disks first, not the largest ones, because that
> way we guarantee we always increase unallocated space.
>
> If we convert a 10-disk-wide block group from RAID10 to 4-disk-wide
> RAID6, we replace 2 chunks on 10 disks with 5 chunks on 4 disks:
>
>         2 RAID10 block groups:          5 RAID6 block groups:
>         sda #1 data1                    sda #1 data1
>         sdb #1 mirror1                  sdb #1 data2
>         sdc #1 data2                    sdc #1 P1
>         sdd #1 mirror2                  sdf #1 Q1
>         sde #1 data3                    sda #2 data3
>         sdf #1 mirror3                  sdb #2 data4
>         sdg #1 data4                    sdc #2 P2
>         sdh #1 mirror4                  sdf #2 Q2
>         sdi #1 data5                    sda #3 data5
>         sdj #1 mirror5                  sdb #3 data6
>         sda #2 data6                    sdc #3 P3
>         sdb #2 mirror6                  sdf #3 Q3
>         sdc #2 data7                    sda #4 data7
>         sdd #2 mirror7                  sdb #4 data8
>         sde #2 data8                    sdc #4 P4
>         sdf #2 mirror8                  sdf #4 Q4
>         sdg #2 data9                    sda #5 data9
>         sdh #2 mirror9                  sdb #5 data10
>         sdi #2 data10                   sdc #5 P5
>         sdj #2 mirror10                 sdf #5 Q5
>
> When this happens we lose net 3GB of space on each of the 4 largest disks
> for every 1GB we gain on the 6 smaller disks, and run out of space part
> way through the balance.  We will have to make this tradeoff at some
> point in the balance because of the disk sizes, but it's important that
> it happens at the very end, after all other possible conversion is done
> and the maximum amount of unallocated space is generated.
>
> btrfs balance isn't smart enough to do this by itself, which is why it's
> 20 commands with filter parameters to get complex arrays reshaped, and
> there are sometimes multiple passes.
>
> We want to relocate a 10-disk-wide block group from RAID10 to 10-disk-wide
> RAID6, replacing 8 chunks on 10 disks with 5 chunks on 10 disks:
>
>         8 RAID10 block groups:          5 RAID6 block groups:
>         sda #1 data1                    sda #1 data1
>         sdb #1 mirror1                  sdb #1 data2
>         sdc #1 data2                    sdc #1 data3
>         sdd #1 mirror2                  sdd #1 data4
>         sde #1 data3                    sde #1 data5
>         sdf #1 mirror3                  sdf #1 data6
>         sdg #1 data4                    sdg #1 data7
>         sdh #1 mirror4                  sdh #1 data8
>         sdi #1 data5                    sdi #1 P1
>         sdj #1 mirror5                  sdj #1 Q1
>         sda #2 data6                    sda #2 data9
>         sdb #2 mirror6                  sdb #2 data10
>         sdc #2 data7                    sdc #2 data11
>         sdd #2 mirror7                  sdd #2 data12
>         sde #2 data8                    sde #2 data13
>         sdf #2 mirror8                  sdf #2 data14
>         sdg #2 data9                    sdg #2 data15
>         sdh #2 mirror9                  sdh #2 data16
>         sdi #2 data10                   sdi #2 P2
>         sdj #2 mirror10                 sdj #2 Q2
>         ...etc there are 40GB of data
>
> The easiest way to do that is:
>
>         for sc in 12 11 10 9 8 7 6 5 4; do
>                 btrfs balance start -dconvert=raid6,stripes=$sc..$sc,soft -mconvert=raid1c3,soft /mnt/storage-array/
>         done
>
> The above converts the widest block groups first, so that every block
> group converted results in a net increase in storage efficiency, and
> creates unallocated space on as many disks as possible.
>
> Then the next step from my original list, edited with the device
> IDs and sizes from dev usage, is the optimization step.  I filled
> in the device IDs and sizes from your 'dev usage' output:
>
> > >         4.  balance -dstripes=1..5,devid=1  # sdd, 4.55TB
> > >             balance -dstripes=1..5,devid=2  # sde, 4.55TB
> > >             balance -dstripes=1..11,devid=3 # sdl, 3.64TB
> > >             balance -dstripes=1..11,devid=4 # sdn, 3.64TB
> > >             balance -dstripes=1..11,devid=5 # sdm, 3.64TB
> > >             balance -dstripes=1..11,devid=6 # sdk, 3.64TB
> > >             balance -dstripes=1..11,devid=7 # sdj, 3.64TB
> > >             balance -dstripes=1..11,devid=8 # sdi, 3.64TB
> > >             balance -dstripes=1..3,devid=9  # sdb, 9.10TB
> > >             balance -dstripes=1..3,devid=10 # sdc, 9.10TB
> > >             balance -dstripes=1..3,devid=11 # sda, 9.10TB
> > >             balance -dstripes=1..3,devid=12 # sdf, 9.10TB
>
> This ensures that each disk is a member of an optimum width block
> group for the disk size.
>
> Note: I'm not sure about the 1..11.  IIRC the btrfs limit is 10 disks
> per stripe, so you might want to use 1..9 if it seems to be trying
> to rebalance everything with 1..11.
>
> Running 'watch btrfs fi usage /mnt/storage-array' while balance runs
> can be enlightening.

Thanks so much for all this detail. I'll see how this run goes and if
it gets stuck again I'll try your strategy of converting to RAID-1 to
get back some unallocated space. Otherwise if this completes
successfully I'll go ahead with optimizing the striping and let you
know how it goes.



-- 
---------------------------------------
John Petrini

  reply	other threads:[~2020-07-17 22:54 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-14 16:13 Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion John Petrini
2020-07-15  1:18 ` Zygo Blaxell
     [not found]   ` <CADvYWxcq+-Fg0W9dmc-shwszF-7sX+GDVig0GncpvwKUDPfT7g@mail.gmail.com>
     [not found]     ` <20200716042739.GB8346@hungrycats.org>
2020-07-16 13:37       ` John Petrini
     [not found]         ` <CAJix6J9kmQjfFJJ1GwWXsX7WW6QKxPqpKx86g7hgA4PfbH5Rpg@mail.gmail.com>
2020-07-16 22:57           ` Zygo Blaxell
2020-07-17  1:11             ` John Petrini
2020-07-17  5:57               ` Zygo Blaxell
2020-07-17 22:54                 ` John Petrini [this message]
2020-07-18 10:36                 ` Steven Davies
2020-07-20 17:57                   ` Goffredo Baroncelli
2020-07-21 10:15                     ` Steven Davies
2020-07-21 20:48                       ` Goffredo Baroncelli
2020-07-23  8:57                         ` Steven Davies
2020-07-23 19:29                           ` Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CADvYWxeP83uQ7VHQ6y+_3yyRKnNVGBWBRsPSqBG_wHfjkeCFog@mail.gmail.com \
    --to=john.d.petrini@gmail.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=me@johnpetrini.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.