From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E178CC433E0 for ; Fri, 17 Jul 2020 05:57:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B8F4E2071A for ; Fri, 17 Jul 2020 05:57:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726036AbgGQF5I convert rfc822-to-8bit (ORCPT ); Fri, 17 Jul 2020 01:57:08 -0400 Received: from james.kirk.hungrycats.org ([174.142.39.145]:41110 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725807AbgGQF5H (ORCPT ); Fri, 17 Jul 2020 01:57:07 -0400 Received: by james.kirk.hungrycats.org (Postfix, from userid 1002) id 684247679C5; Fri, 17 Jul 2020 01:57:06 -0400 (EDT) Date: Fri, 17 Jul 2020 01:57:06 -0400 From: Zygo Blaxell To: John Petrini Cc: John Petrini , linux-btrfs@vger.kernel.org Subject: Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion Message-ID: <20200717055706.GJ10769@hungrycats.org> References: <20200715011843.GH10769@hungrycats.org> <20200716042739.GB8346@hungrycats.org> <20200716225731.GI10769@hungrycats.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: 8BIT In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Thu, Jul 16, 2020 at 09:11:17PM -0400, John Petrini wrote: > On Thu, Jul 16, 2020 at 6:57 PM Zygo Blaxell > > That is...odd. Try 'btrfs dev usage', maybe something weird is happening > > with device sizes. > > Here it is. I'm not sure what to make of it though. > > sudo btrfs dev usage /mnt/storage-array/ > /dev/sdd, ID: 1 > Device size: 4.55TiB > Device slack: 0.00B > Data,RAID10: 3.12GiB > Data,RAID10: 2.78GiB > Data,RAID10: 784.31GiB > Data,RAID10: 3.34TiB > Data,RAID6: 144.07GiB > Data,RAID6: 293.03GiB > Metadata,RAID10: 4.47GiB > Metadata,RAID10: 352.00MiB > Unallocated: 1.02MiB > > /dev/sde, ID: 2 > Device size: 4.55TiB > Device slack: 0.00B > Data,RAID10: 3.12GiB > Data,RAID10: 2.78GiB > Data,RAID10: 784.31GiB > Data,RAID10: 3.34TiB > Data,RAID6: 144.07GiB > Data,RAID6: 293.03GiB > Metadata,RAID10: 4.47GiB > Metadata,RAID10: 352.00MiB > Unallocated: 1.02MiB > > /dev/sdl, ID: 3 > Device size: 3.64TiB > Device slack: 0.00B > Data,RAID10: 3.12GiB > Data,RAID10: 3.34TiB > Data,RAID6: 293.03GiB > Metadata,RAID10: 4.47GiB > Unallocated: 1.02MiB > > /dev/sdn, ID: 4 > Device size: 3.64TiB > Device slack: 0.00B > Data,RAID10: 3.12GiB > Data,RAID10: 3.34TiB > Data,RAID6: 293.03GiB > Metadata,RAID10: 4.47GiB > Unallocated: 1.02MiB > > /dev/sdm, ID: 5 > Device size: 3.64TiB > Device slack: 0.00B > Data,RAID10: 3.12GiB > Data,RAID10: 3.34TiB > Data,RAID6: 293.03GiB > Metadata,RAID10: 4.47GiB > Unallocated: 1.02MiB > > /dev/sdk, ID: 6 > Device size: 3.64TiB > Device slack: 0.00B > Data,RAID10: 3.12GiB > Data,RAID10: 3.34TiB > Data,RAID6: 293.03GiB > Metadata,RAID10: 4.47GiB > Unallocated: 1.02MiB > > /dev/sdj, ID: 7 > Device size: 3.64TiB > Device slack: 0.00B > Data,RAID10: 3.12GiB > Data,RAID10: 3.34TiB > Data,RAID6: 293.03GiB > Metadata,RAID10: 4.47GiB > Unallocated: 1.02MiB > > /dev/sdi, ID: 8 > Device size: 3.64TiB > Device slack: 0.00B > Data,RAID10: 3.12GiB > Data,RAID10: 3.34TiB > Data,RAID6: 293.03GiB > Metadata,RAID10: 4.47GiB > Unallocated: 1.02MiB > > /dev/sdb, ID: 9 > Device size: 9.10TiB > Device slack: 0.00B > Data,RAID10: 3.12GiB > Data,RAID10: 4.01TiB > Data,RAID10: 784.31GiB > Data,RAID10: 3.34TiB > Data,RAID6: 458.56GiB > Data,RAID6: 144.07GiB > Data,RAID6: 293.03GiB > Metadata,RAID10: 4.47GiB > Metadata,RAID10: 352.00MiB > Metadata,RAID10: 6.00GiB > Metadata,RAID1C3: 2.00GiB > System,RAID1C3: 32.00MiB > Unallocated: 82.89GiB > > /dev/sdc, ID: 10 > Device size: 9.10TiB > Device slack: 0.00B > Data,RAID10: 3.12GiB > Data,RAID10: 4.01TiB > Data,RAID10: 784.31GiB > Data,RAID10: 3.34TiB > Data,RAID6: 458.56GiB > Data,RAID6: 144.07GiB > Data,RAID6: 293.03GiB > Metadata,RAID10: 4.47GiB > Metadata,RAID10: 352.00MiB > Metadata,RAID10: 6.00GiB > Metadata,RAID1C3: 3.00GiB > Unallocated: 81.92GiB > > /dev/sda, ID: 11 > Device size: 9.10TiB > Device slack: 0.00B > Data,RAID10: 784.31GiB > Data,RAID10: 4.01TiB > Data,RAID10: 3.34TiB > Data,RAID6: 458.56GiB > Data,RAID6: 144.07GiB > Data,RAID6: 293.03GiB > Metadata,RAID10: 4.47GiB > Metadata,RAID10: 352.00MiB > Metadata,RAID10: 6.00GiB > Metadata,RAID1C3: 5.00GiB > System,RAID1C3: 32.00MiB > Unallocated: 85.79GiB > > /dev/sdf, ID: 12 > Device size: 9.10TiB > Device slack: 0.00B > Data,RAID10: 784.31GiB > Data,RAID10: 4.01TiB > Data,RAID10: 3.34TiB > Data,RAID6: 458.56GiB > Data,RAID6: 144.07GiB > Data,RAID6: 293.03GiB > Metadata,RAID10: 4.47GiB > Metadata,RAID10: 352.00MiB > Metadata,RAID10: 6.00GiB > Metadata,RAID1C3: 5.00GiB > System,RAID1C3: 32.00MiB > Unallocated: 85.79GiB OK...slack is 0, so there wasn't anything weird with underlying device sizes going on. There's 3 entries for "Data,RAID6" because there are three stripe widths: 12 disks, 6 disks, and 4 disks, corresponding to the number of disks of each size. Unfortunately 'dev usage' doesn't say which one is which. > Wow looks like I've got lots of info to mull over here! I kicked off > another convert already after cleaning up quite a bit more space. I > had over 100G unallocated on each device after deleting some data and > running another balance. If you did balances with no unallocated space on the small drives, then the block groups created by those balances are the first block groups to be processed by later balances. These block groups will be narrow so they'll use space less efficiently. We want the opposite of that. > I'm tempted to let it run and see if it > succeeds but my unallocated space has already dropped off a cliff with > 95% of the rebalance remaining. This is why the devid/stripes filters are important. Also I noticed that my logic in my previous reply was wrong for this case: we do want to process the smallest disks first, not the largest ones, because that way we guarantee we always increase unallocated space. If we convert a 10-disk-wide block group from RAID10 to 4-disk-wide RAID6, we replace 2 chunks on 10 disks with 5 chunks on 4 disks: 2 RAID10 block groups: 5 RAID6 block groups: sda #1 data1 sda #1 data1 sdb #1 mirror1 sdb #1 data2 sdc #1 data2 sdc #1 P1 sdd #1 mirror2 sdf #1 Q1 sde #1 data3 sda #2 data3 sdf #1 mirror3 sdb #2 data4 sdg #1 data4 sdc #2 P2 sdh #1 mirror4 sdf #2 Q2 sdi #1 data5 sda #3 data5 sdj #1 mirror5 sdb #3 data6 sda #2 data6 sdc #3 P3 sdb #2 mirror6 sdf #3 Q3 sdc #2 data7 sda #4 data7 sdd #2 mirror7 sdb #4 data8 sde #2 data8 sdc #4 P4 sdf #2 mirror8 sdf #4 Q4 sdg #2 data9 sda #5 data9 sdh #2 mirror9 sdb #5 data10 sdi #2 data10 sdc #5 P5 sdj #2 mirror10 sdf #5 Q5 When this happens we lose net 3GB of space on each of the 4 largest disks for every 1GB we gain on the 6 smaller disks, and run out of space part way through the balance. We will have to make this tradeoff at some point in the balance because of the disk sizes, but it's important that it happens at the very end, after all other possible conversion is done and the maximum amount of unallocated space is generated. btrfs balance isn't smart enough to do this by itself, which is why it's 20 commands with filter parameters to get complex arrays reshaped, and there are sometimes multiple passes. We want to relocate a 10-disk-wide block group from RAID10 to 10-disk-wide RAID6, replacing 8 chunks on 10 disks with 5 chunks on 10 disks: 8 RAID10 block groups: 5 RAID6 block groups: sda #1 data1 sda #1 data1 sdb #1 mirror1 sdb #1 data2 sdc #1 data2 sdc #1 data3 sdd #1 mirror2 sdd #1 data4 sde #1 data3 sde #1 data5 sdf #1 mirror3 sdf #1 data6 sdg #1 data4 sdg #1 data7 sdh #1 mirror4 sdh #1 data8 sdi #1 data5 sdi #1 P1 sdj #1 mirror5 sdj #1 Q1 sda #2 data6 sda #2 data9 sdb #2 mirror6 sdb #2 data10 sdc #2 data7 sdc #2 data11 sdd #2 mirror7 sdd #2 data12 sde #2 data8 sde #2 data13 sdf #2 mirror8 sdf #2 data14 sdg #2 data9 sdg #2 data15 sdh #2 mirror9 sdh #2 data16 sdi #2 data10 sdi #2 P2 sdj #2 mirror10 sdj #2 Q2 ...etc there are 40GB of data The easiest way to do that is: for sc in 12 11 10 9 8 7 6 5 4; do btrfs balance start -dconvert=raid6,stripes=$sc..$sc,soft -mconvert=raid1c3,soft /mnt/storage-array/ done The above converts the widest block groups first, so that every block group converted results in a net increase in storage efficiency, and creates unallocated space on as many disks as possible. Then the next step from my original list, edited with the device IDs and sizes from dev usage, is the optimization step. I filled in the device IDs and sizes from your 'dev usage' output: > > 4. balance -dstripes=1..5,devid=1 # sdd, 4.55TB > > balance -dstripes=1..5,devid=2 # sde, 4.55TB > > balance -dstripes=1..11,devid=3 # sdl, 3.64TB > > balance -dstripes=1..11,devid=4 # sdn, 3.64TB > > balance -dstripes=1..11,devid=5 # sdm, 3.64TB > > balance -dstripes=1..11,devid=6 # sdk, 3.64TB > > balance -dstripes=1..11,devid=7 # sdj, 3.64TB > > balance -dstripes=1..11,devid=8 # sdi, 3.64TB > > balance -dstripes=1..3,devid=9 # sdb, 9.10TB > > balance -dstripes=1..3,devid=10 # sdc, 9.10TB > > balance -dstripes=1..3,devid=11 # sda, 9.10TB > > balance -dstripes=1..3,devid=12 # sdf, 9.10TB This ensures that each disk is a member of an optimum width block group for the disk size. Note: I'm not sure about the 1..11. IIRC the btrfs limit is 10 disks per stripe, so you might want to use 1..9 if it seems to be trying to rebalance everything with 1..11. Running 'watch btrfs fi usage /mnt/storage-array' while balance runs can be enlightening.