From: John Petrini <john.d.petrini@gmail.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: John Petrini <me@johnpetrini.com>, linux-btrfs@vger.kernel.org
Subject: Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion
Date: Fri, 17 Jul 2020 18:54:12 -0400 [thread overview]
Message-ID: <CADvYWxeP83uQ7VHQ6y+_3yyRKnNVGBWBRsPSqBG_wHfjkeCFog@mail.gmail.com> (raw)
In-Reply-To: <20200717055706.GJ10769@hungrycats.org>
On Fri, Jul 17, 2020 at 1:57 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Thu, Jul 16, 2020 at 09:11:17PM -0400, John Petrini wrote:
> > On Thu, Jul 16, 2020 at 6:57 PM Zygo Blaxell
> > > That is...odd. Try 'btrfs dev usage', maybe something weird is happening
> > > with device sizes.
> >
> > Here it is. I'm not sure what to make of it though.
> >
> > sudo btrfs dev usage /mnt/storage-array/
> > /dev/sdd, ID: 1
> > Device size: 4.55TiB
> > Device slack: 0.00B
> > Data,RAID10: 3.12GiB
> > Data,RAID10: 2.78GiB
> > Data,RAID10: 784.31GiB
> > Data,RAID10: 3.34TiB
> > Data,RAID6: 144.07GiB
> > Data,RAID6: 293.03GiB
> > Metadata,RAID10: 4.47GiB
> > Metadata,RAID10: 352.00MiB
> > Unallocated: 1.02MiB
> >
> > /dev/sde, ID: 2
> > Device size: 4.55TiB
> > Device slack: 0.00B
> > Data,RAID10: 3.12GiB
> > Data,RAID10: 2.78GiB
> > Data,RAID10: 784.31GiB
> > Data,RAID10: 3.34TiB
> > Data,RAID6: 144.07GiB
> > Data,RAID6: 293.03GiB
> > Metadata,RAID10: 4.47GiB
> > Metadata,RAID10: 352.00MiB
> > Unallocated: 1.02MiB
> >
> > /dev/sdl, ID: 3
> > Device size: 3.64TiB
> > Device slack: 0.00B
> > Data,RAID10: 3.12GiB
> > Data,RAID10: 3.34TiB
> > Data,RAID6: 293.03GiB
> > Metadata,RAID10: 4.47GiB
> > Unallocated: 1.02MiB
> >
> > /dev/sdn, ID: 4
> > Device size: 3.64TiB
> > Device slack: 0.00B
> > Data,RAID10: 3.12GiB
> > Data,RAID10: 3.34TiB
> > Data,RAID6: 293.03GiB
> > Metadata,RAID10: 4.47GiB
> > Unallocated: 1.02MiB
> >
> > /dev/sdm, ID: 5
> > Device size: 3.64TiB
> > Device slack: 0.00B
> > Data,RAID10: 3.12GiB
> > Data,RAID10: 3.34TiB
> > Data,RAID6: 293.03GiB
> > Metadata,RAID10: 4.47GiB
> > Unallocated: 1.02MiB
> >
> > /dev/sdk, ID: 6
> > Device size: 3.64TiB
> > Device slack: 0.00B
> > Data,RAID10: 3.12GiB
> > Data,RAID10: 3.34TiB
> > Data,RAID6: 293.03GiB
> > Metadata,RAID10: 4.47GiB
> > Unallocated: 1.02MiB
> >
> > /dev/sdj, ID: 7
> > Device size: 3.64TiB
> > Device slack: 0.00B
> > Data,RAID10: 3.12GiB
> > Data,RAID10: 3.34TiB
> > Data,RAID6: 293.03GiB
> > Metadata,RAID10: 4.47GiB
> > Unallocated: 1.02MiB
> >
> > /dev/sdi, ID: 8
> > Device size: 3.64TiB
> > Device slack: 0.00B
> > Data,RAID10: 3.12GiB
> > Data,RAID10: 3.34TiB
> > Data,RAID6: 293.03GiB
> > Metadata,RAID10: 4.47GiB
> > Unallocated: 1.02MiB
> >
> > /dev/sdb, ID: 9
> > Device size: 9.10TiB
> > Device slack: 0.00B
> > Data,RAID10: 3.12GiB
> > Data,RAID10: 4.01TiB
> > Data,RAID10: 784.31GiB
> > Data,RAID10: 3.34TiB
> > Data,RAID6: 458.56GiB
> > Data,RAID6: 144.07GiB
> > Data,RAID6: 293.03GiB
> > Metadata,RAID10: 4.47GiB
> > Metadata,RAID10: 352.00MiB
> > Metadata,RAID10: 6.00GiB
> > Metadata,RAID1C3: 2.00GiB
> > System,RAID1C3: 32.00MiB
> > Unallocated: 82.89GiB
> >
> > /dev/sdc, ID: 10
> > Device size: 9.10TiB
> > Device slack: 0.00B
> > Data,RAID10: 3.12GiB
> > Data,RAID10: 4.01TiB
> > Data,RAID10: 784.31GiB
> > Data,RAID10: 3.34TiB
> > Data,RAID6: 458.56GiB
> > Data,RAID6: 144.07GiB
> > Data,RAID6: 293.03GiB
> > Metadata,RAID10: 4.47GiB
> > Metadata,RAID10: 352.00MiB
> > Metadata,RAID10: 6.00GiB
> > Metadata,RAID1C3: 3.00GiB
> > Unallocated: 81.92GiB
> >
> > /dev/sda, ID: 11
> > Device size: 9.10TiB
> > Device slack: 0.00B
> > Data,RAID10: 784.31GiB
> > Data,RAID10: 4.01TiB
> > Data,RAID10: 3.34TiB
> > Data,RAID6: 458.56GiB
> > Data,RAID6: 144.07GiB
> > Data,RAID6: 293.03GiB
> > Metadata,RAID10: 4.47GiB
> > Metadata,RAID10: 352.00MiB
> > Metadata,RAID10: 6.00GiB
> > Metadata,RAID1C3: 5.00GiB
> > System,RAID1C3: 32.00MiB
> > Unallocated: 85.79GiB
> >
> > /dev/sdf, ID: 12
> > Device size: 9.10TiB
> > Device slack: 0.00B
> > Data,RAID10: 784.31GiB
> > Data,RAID10: 4.01TiB
> > Data,RAID10: 3.34TiB
> > Data,RAID6: 458.56GiB
> > Data,RAID6: 144.07GiB
> > Data,RAID6: 293.03GiB
> > Metadata,RAID10: 4.47GiB
> > Metadata,RAID10: 352.00MiB
> > Metadata,RAID10: 6.00GiB
> > Metadata,RAID1C3: 5.00GiB
> > System,RAID1C3: 32.00MiB
> > Unallocated: 85.79GiB
>
> OK...slack is 0, so there wasn't anything weird with underlying device
> sizes going on.
>
> There's 3 entries for "Data,RAID6" because there are three stripe widths:
> 12 disks, 6 disks, and 4 disks, corresponding to the number of disks of
> each size. Unfortunately 'dev usage' doesn't say which one is which.
>
> > Wow looks like I've got lots of info to mull over here! I kicked off
> > another convert already after cleaning up quite a bit more space. I
> > had over 100G unallocated on each device after deleting some data and
> > running another balance.
>
> If you did balances with no unallocated space on the small drives, then
> the block groups created by those balances are the first block groups
> to be processed by later balances. These block groups will be narrow
> so they'll use space less efficiently. We want the opposite of that.
>
There was unallocated space on all drives before I started this recent
balance. So far it's still chugging along at about 20% complete but I
assume even if this does complete successfully I'll be stuck with some
narrow strips from the first attempt.
> > I'm tempted to let it run and see if it
> > succeeds but my unallocated space has already dropped off a cliff with
> > 95% of the rebalance remaining.
>
> This is why the devid/stripes filters are important. Also I noticed
> that my logic in my previous reply was wrong for this case: we do want
> to process the smallest disks first, not the largest ones, because that
> way we guarantee we always increase unallocated space.
>
> If we convert a 10-disk-wide block group from RAID10 to 4-disk-wide
> RAID6, we replace 2 chunks on 10 disks with 5 chunks on 4 disks:
>
> 2 RAID10 block groups: 5 RAID6 block groups:
> sda #1 data1 sda #1 data1
> sdb #1 mirror1 sdb #1 data2
> sdc #1 data2 sdc #1 P1
> sdd #1 mirror2 sdf #1 Q1
> sde #1 data3 sda #2 data3
> sdf #1 mirror3 sdb #2 data4
> sdg #1 data4 sdc #2 P2
> sdh #1 mirror4 sdf #2 Q2
> sdi #1 data5 sda #3 data5
> sdj #1 mirror5 sdb #3 data6
> sda #2 data6 sdc #3 P3
> sdb #2 mirror6 sdf #3 Q3
> sdc #2 data7 sda #4 data7
> sdd #2 mirror7 sdb #4 data8
> sde #2 data8 sdc #4 P4
> sdf #2 mirror8 sdf #4 Q4
> sdg #2 data9 sda #5 data9
> sdh #2 mirror9 sdb #5 data10
> sdi #2 data10 sdc #5 P5
> sdj #2 mirror10 sdf #5 Q5
>
> When this happens we lose net 3GB of space on each of the 4 largest disks
> for every 1GB we gain on the 6 smaller disks, and run out of space part
> way through the balance. We will have to make this tradeoff at some
> point in the balance because of the disk sizes, but it's important that
> it happens at the very end, after all other possible conversion is done
> and the maximum amount of unallocated space is generated.
>
> btrfs balance isn't smart enough to do this by itself, which is why it's
> 20 commands with filter parameters to get complex arrays reshaped, and
> there are sometimes multiple passes.
>
> We want to relocate a 10-disk-wide block group from RAID10 to 10-disk-wide
> RAID6, replacing 8 chunks on 10 disks with 5 chunks on 10 disks:
>
> 8 RAID10 block groups: 5 RAID6 block groups:
> sda #1 data1 sda #1 data1
> sdb #1 mirror1 sdb #1 data2
> sdc #1 data2 sdc #1 data3
> sdd #1 mirror2 sdd #1 data4
> sde #1 data3 sde #1 data5
> sdf #1 mirror3 sdf #1 data6
> sdg #1 data4 sdg #1 data7
> sdh #1 mirror4 sdh #1 data8
> sdi #1 data5 sdi #1 P1
> sdj #1 mirror5 sdj #1 Q1
> sda #2 data6 sda #2 data9
> sdb #2 mirror6 sdb #2 data10
> sdc #2 data7 sdc #2 data11
> sdd #2 mirror7 sdd #2 data12
> sde #2 data8 sde #2 data13
> sdf #2 mirror8 sdf #2 data14
> sdg #2 data9 sdg #2 data15
> sdh #2 mirror9 sdh #2 data16
> sdi #2 data10 sdi #2 P2
> sdj #2 mirror10 sdj #2 Q2
> ...etc there are 40GB of data
>
> The easiest way to do that is:
>
> for sc in 12 11 10 9 8 7 6 5 4; do
> btrfs balance start -dconvert=raid6,stripes=$sc..$sc,soft -mconvert=raid1c3,soft /mnt/storage-array/
> done
>
> The above converts the widest block groups first, so that every block
> group converted results in a net increase in storage efficiency, and
> creates unallocated space on as many disks as possible.
>
> Then the next step from my original list, edited with the device
> IDs and sizes from dev usage, is the optimization step. I filled
> in the device IDs and sizes from your 'dev usage' output:
>
> > > 4. balance -dstripes=1..5,devid=1 # sdd, 4.55TB
> > > balance -dstripes=1..5,devid=2 # sde, 4.55TB
> > > balance -dstripes=1..11,devid=3 # sdl, 3.64TB
> > > balance -dstripes=1..11,devid=4 # sdn, 3.64TB
> > > balance -dstripes=1..11,devid=5 # sdm, 3.64TB
> > > balance -dstripes=1..11,devid=6 # sdk, 3.64TB
> > > balance -dstripes=1..11,devid=7 # sdj, 3.64TB
> > > balance -dstripes=1..11,devid=8 # sdi, 3.64TB
> > > balance -dstripes=1..3,devid=9 # sdb, 9.10TB
> > > balance -dstripes=1..3,devid=10 # sdc, 9.10TB
> > > balance -dstripes=1..3,devid=11 # sda, 9.10TB
> > > balance -dstripes=1..3,devid=12 # sdf, 9.10TB
>
> This ensures that each disk is a member of an optimum width block
> group for the disk size.
>
> Note: I'm not sure about the 1..11. IIRC the btrfs limit is 10 disks
> per stripe, so you might want to use 1..9 if it seems to be trying
> to rebalance everything with 1..11.
>
> Running 'watch btrfs fi usage /mnt/storage-array' while balance runs
> can be enlightening.
Thanks so much for all this detail. I'll see how this run goes and if
it gets stuck again I'll try your strategy of converting to RAID-1 to
get back some unallocated space. Otherwise if this completes
successfully I'll go ahead with optimizing the striping and let you
know how it goes.
--
---------------------------------------
John Petrini
next prev parent reply other threads:[~2020-07-17 22:54 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-14 16:13 Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion John Petrini
2020-07-15 1:18 ` Zygo Blaxell
[not found] ` <CADvYWxcq+-Fg0W9dmc-shwszF-7sX+GDVig0GncpvwKUDPfT7g@mail.gmail.com>
[not found] ` <20200716042739.GB8346@hungrycats.org>
2020-07-16 13:37 ` John Petrini
[not found] ` <CAJix6J9kmQjfFJJ1GwWXsX7WW6QKxPqpKx86g7hgA4PfbH5Rpg@mail.gmail.com>
2020-07-16 22:57 ` Zygo Blaxell
2020-07-17 1:11 ` John Petrini
2020-07-17 5:57 ` Zygo Blaxell
2020-07-17 22:54 ` John Petrini [this message]
2020-07-18 10:36 ` Steven Davies
2020-07-20 17:57 ` Goffredo Baroncelli
2020-07-21 10:15 ` Steven Davies
2020-07-21 20:48 ` Goffredo Baroncelli
2020-07-23 8:57 ` Steven Davies
2020-07-23 19:29 ` Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CADvYWxeP83uQ7VHQ6y+_3yyRKnNVGBWBRsPSqBG_wHfjkeCFog@mail.gmail.com \
--to=john.d.petrini@gmail.com \
--cc=ce3g8jdj@umail.furryterror.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=me@johnpetrini.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.