From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.1 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF92DC433E0 for ; Fri, 17 Jul 2020 22:54:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 82E312070E for ; Fri, 17 Jul 2020 22:54:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="LNuf1pet" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726546AbgGQWyZ (ORCPT ); Fri, 17 Jul 2020 18:54:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37460 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726204AbgGQWyZ (ORCPT ); Fri, 17 Jul 2020 18:54:25 -0400 Received: from mail-il1-x136.google.com (mail-il1-x136.google.com [IPv6:2607:f8b0:4864:20::136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 030C0C0619D2 for ; Fri, 17 Jul 2020 15:54:24 -0700 (PDT) Received: by mail-il1-x136.google.com with SMTP id h16so8654433ilj.11 for ; Fri, 17 Jul 2020 15:54:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=6JBFqp4jPOmCUHraFwJBecmfC0jZdjxe0+fZcyqVu6I=; b=LNuf1petaivUmv7ko3Ie+XXw9s8139Y8CiBgck8EMzdVas/mT4xzlA2IRNIOLbHxB5 kaglTBkxH/u95vO93MdCPCdqowwqiNYEtSdXBbl1DlfWS/3AyUA3dZW+W9dkjpPUnemy ajoE0ulNJIDPpulmoVnV9W+H4oHHwohnM3SakxUz3GLWEYBc8jNzQOuSeUWeU9lHlBpd i64A2voVWxaU9COGslXn7WtNyNZpVqLjVBMFKA46N/PvCpfcUrkqJ5qJl1yU/BXMoG1d 8Hv5LhkKOaL24vwBsvTIqpKgLsNpqZ96t25SXbMGQf91zJXrIpU0bY2SmASmZMRbnYxf Q8rQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=6JBFqp4jPOmCUHraFwJBecmfC0jZdjxe0+fZcyqVu6I=; b=dObY97VRy2tiPmaEWKGRCI6qcOtkgFHYHw27vbi6vK/TLG56bLF8plpUaFhZnKMVF8 MLHxSTPc8ObsZkBejOOrlkjqSXGudlmZrnzsoYr7pJDCP4Hhb5BMh1m8wzN470bG4Xrq RfwAXWAlhLDzurnZ3q1VPjUxMLISsrzSIly+8Mo1Lsg7+D1O5Vr9LCvLEURxgojb6sIT KkAsv/7qilSzLjNV6CCzBBEAbgJQ9uDQsVY8zsy0VH69KwegNfcT53ojHBsc0ipXcYOx 3xpJ6Cd2dMPjTULSJIMfdWkhFEJzldwLUYoFN4GUKop6KRskoFGZLTHTwvFw0KLcDJXP 6tkg== X-Gm-Message-State: AOAM532ZpL4F9ugmx0O7535Opf8szv+roTcboGKEq4bIHwrVIkvE8oQx Jag8d05p1Z9Q5MLVQkND0zjy8v88CFklWKTFQAY= X-Google-Smtp-Source: ABdhPJxaacvrtYDtJE2uppvpBJjXu7bIIrzJJ5M0E1dsneG/K+zfABHnW35IRqtwSH74IvA7kDLszfILiGg54FhhNv0= X-Received: by 2002:a92:aa92:: with SMTP id p18mr11919804ill.199.1595026463982; Fri, 17 Jul 2020 15:54:23 -0700 (PDT) MIME-Version: 1.0 References: <20200715011843.GH10769@hungrycats.org> <20200716042739.GB8346@hungrycats.org> <20200716225731.GI10769@hungrycats.org> <20200717055706.GJ10769@hungrycats.org> In-Reply-To: <20200717055706.GJ10769@hungrycats.org> From: John Petrini Date: Fri, 17 Jul 2020 18:54:12 -0400 Message-ID: Subject: Re: Filesystem Went Read Only During Raid-10 to Raid-6 Data Conversion To: Zygo Blaxell Cc: John Petrini , linux-btrfs@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Fri, Jul 17, 2020 at 1:57 AM Zygo Blaxell wrote: > > On Thu, Jul 16, 2020 at 09:11:17PM -0400, John Petrini wrote: > > On Thu, Jul 16, 2020 at 6:57 PM Zygo Blaxell > > > That is...odd. Try 'btrfs dev usage', maybe something weird is happening > > > with device sizes. > > > > Here it is. I'm not sure what to make of it though. > > > > sudo btrfs dev usage /mnt/storage-array/ > > /dev/sdd, ID: 1 > > Device size: 4.55TiB > > Device slack: 0.00B > > Data,RAID10: 3.12GiB > > Data,RAID10: 2.78GiB > > Data,RAID10: 784.31GiB > > Data,RAID10: 3.34TiB > > Data,RAID6: 144.07GiB > > Data,RAID6: 293.03GiB > > Metadata,RAID10: 4.47GiB > > Metadata,RAID10: 352.00MiB > > Unallocated: 1.02MiB > > > > /dev/sde, ID: 2 > > Device size: 4.55TiB > > Device slack: 0.00B > > Data,RAID10: 3.12GiB > > Data,RAID10: 2.78GiB > > Data,RAID10: 784.31GiB > > Data,RAID10: 3.34TiB > > Data,RAID6: 144.07GiB > > Data,RAID6: 293.03GiB > > Metadata,RAID10: 4.47GiB > > Metadata,RAID10: 352.00MiB > > Unallocated: 1.02MiB > > > > /dev/sdl, ID: 3 > > Device size: 3.64TiB > > Device slack: 0.00B > > Data,RAID10: 3.12GiB > > Data,RAID10: 3.34TiB > > Data,RAID6: 293.03GiB > > Metadata,RAID10: 4.47GiB > > Unallocated: 1.02MiB > > > > /dev/sdn, ID: 4 > > Device size: 3.64TiB > > Device slack: 0.00B > > Data,RAID10: 3.12GiB > > Data,RAID10: 3.34TiB > > Data,RAID6: 293.03GiB > > Metadata,RAID10: 4.47GiB > > Unallocated: 1.02MiB > > > > /dev/sdm, ID: 5 > > Device size: 3.64TiB > > Device slack: 0.00B > > Data,RAID10: 3.12GiB > > Data,RAID10: 3.34TiB > > Data,RAID6: 293.03GiB > > Metadata,RAID10: 4.47GiB > > Unallocated: 1.02MiB > > > > /dev/sdk, ID: 6 > > Device size: 3.64TiB > > Device slack: 0.00B > > Data,RAID10: 3.12GiB > > Data,RAID10: 3.34TiB > > Data,RAID6: 293.03GiB > > Metadata,RAID10: 4.47GiB > > Unallocated: 1.02MiB > > > > /dev/sdj, ID: 7 > > Device size: 3.64TiB > > Device slack: 0.00B > > Data,RAID10: 3.12GiB > > Data,RAID10: 3.34TiB > > Data,RAID6: 293.03GiB > > Metadata,RAID10: 4.47GiB > > Unallocated: 1.02MiB > > > > /dev/sdi, ID: 8 > > Device size: 3.64TiB > > Device slack: 0.00B > > Data,RAID10: 3.12GiB > > Data,RAID10: 3.34TiB > > Data,RAID6: 293.03GiB > > Metadata,RAID10: 4.47GiB > > Unallocated: 1.02MiB > > > > /dev/sdb, ID: 9 > > Device size: 9.10TiB > > Device slack: 0.00B > > Data,RAID10: 3.12GiB > > Data,RAID10: 4.01TiB > > Data,RAID10: 784.31GiB > > Data,RAID10: 3.34TiB > > Data,RAID6: 458.56GiB > > Data,RAID6: 144.07GiB > > Data,RAID6: 293.03GiB > > Metadata,RAID10: 4.47GiB > > Metadata,RAID10: 352.00MiB > > Metadata,RAID10: 6.00GiB > > Metadata,RAID1C3: 2.00GiB > > System,RAID1C3: 32.00MiB > > Unallocated: 82.89GiB > > > > /dev/sdc, ID: 10 > > Device size: 9.10TiB > > Device slack: 0.00B > > Data,RAID10: 3.12GiB > > Data,RAID10: 4.01TiB > > Data,RAID10: 784.31GiB > > Data,RAID10: 3.34TiB > > Data,RAID6: 458.56GiB > > Data,RAID6: 144.07GiB > > Data,RAID6: 293.03GiB > > Metadata,RAID10: 4.47GiB > > Metadata,RAID10: 352.00MiB > > Metadata,RAID10: 6.00GiB > > Metadata,RAID1C3: 3.00GiB > > Unallocated: 81.92GiB > > > > /dev/sda, ID: 11 > > Device size: 9.10TiB > > Device slack: 0.00B > > Data,RAID10: 784.31GiB > > Data,RAID10: 4.01TiB > > Data,RAID10: 3.34TiB > > Data,RAID6: 458.56GiB > > Data,RAID6: 144.07GiB > > Data,RAID6: 293.03GiB > > Metadata,RAID10: 4.47GiB > > Metadata,RAID10: 352.00MiB > > Metadata,RAID10: 6.00GiB > > Metadata,RAID1C3: 5.00GiB > > System,RAID1C3: 32.00MiB > > Unallocated: 85.79GiB > > > > /dev/sdf, ID: 12 > > Device size: 9.10TiB > > Device slack: 0.00B > > Data,RAID10: 784.31GiB > > Data,RAID10: 4.01TiB > > Data,RAID10: 3.34TiB > > Data,RAID6: 458.56GiB > > Data,RAID6: 144.07GiB > > Data,RAID6: 293.03GiB > > Metadata,RAID10: 4.47GiB > > Metadata,RAID10: 352.00MiB > > Metadata,RAID10: 6.00GiB > > Metadata,RAID1C3: 5.00GiB > > System,RAID1C3: 32.00MiB > > Unallocated: 85.79GiB > > OK...slack is 0, so there wasn't anything weird with underlying device > sizes going on. > > There's 3 entries for "Data,RAID6" because there are three stripe widths: > 12 disks, 6 disks, and 4 disks, corresponding to the number of disks of > each size. Unfortunately 'dev usage' doesn't say which one is which. > > > Wow looks like I've got lots of info to mull over here! I kicked off > > another convert already after cleaning up quite a bit more space. I > > had over 100G unallocated on each device after deleting some data and > > running another balance. > > If you did balances with no unallocated space on the small drives, then > the block groups created by those balances are the first block groups > to be processed by later balances. These block groups will be narrow > so they'll use space less efficiently. We want the opposite of that. > There was unallocated space on all drives before I started this recent balance. So far it's still chugging along at about 20% complete but I assume even if this does complete successfully I'll be stuck with some narrow strips from the first attempt. > > I'm tempted to let it run and see if it > > succeeds but my unallocated space has already dropped off a cliff with > > 95% of the rebalance remaining. > > This is why the devid/stripes filters are important. Also I noticed > that my logic in my previous reply was wrong for this case: we do want > to process the smallest disks first, not the largest ones, because that > way we guarantee we always increase unallocated space. > > If we convert a 10-disk-wide block group from RAID10 to 4-disk-wide > RAID6, we replace 2 chunks on 10 disks with 5 chunks on 4 disks: > > 2 RAID10 block groups: 5 RAID6 block groups: > sda #1 data1 sda #1 data1 > sdb #1 mirror1 sdb #1 data2 > sdc #1 data2 sdc #1 P1 > sdd #1 mirror2 sdf #1 Q1 > sde #1 data3 sda #2 data3 > sdf #1 mirror3 sdb #2 data4 > sdg #1 data4 sdc #2 P2 > sdh #1 mirror4 sdf #2 Q2 > sdi #1 data5 sda #3 data5 > sdj #1 mirror5 sdb #3 data6 > sda #2 data6 sdc #3 P3 > sdb #2 mirror6 sdf #3 Q3 > sdc #2 data7 sda #4 data7 > sdd #2 mirror7 sdb #4 data8 > sde #2 data8 sdc #4 P4 > sdf #2 mirror8 sdf #4 Q4 > sdg #2 data9 sda #5 data9 > sdh #2 mirror9 sdb #5 data10 > sdi #2 data10 sdc #5 P5 > sdj #2 mirror10 sdf #5 Q5 > > When this happens we lose net 3GB of space on each of the 4 largest disks > for every 1GB we gain on the 6 smaller disks, and run out of space part > way through the balance. We will have to make this tradeoff at some > point in the balance because of the disk sizes, but it's important that > it happens at the very end, after all other possible conversion is done > and the maximum amount of unallocated space is generated. > > btrfs balance isn't smart enough to do this by itself, which is why it's > 20 commands with filter parameters to get complex arrays reshaped, and > there are sometimes multiple passes. > > We want to relocate a 10-disk-wide block group from RAID10 to 10-disk-wide > RAID6, replacing 8 chunks on 10 disks with 5 chunks on 10 disks: > > 8 RAID10 block groups: 5 RAID6 block groups: > sda #1 data1 sda #1 data1 > sdb #1 mirror1 sdb #1 data2 > sdc #1 data2 sdc #1 data3 > sdd #1 mirror2 sdd #1 data4 > sde #1 data3 sde #1 data5 > sdf #1 mirror3 sdf #1 data6 > sdg #1 data4 sdg #1 data7 > sdh #1 mirror4 sdh #1 data8 > sdi #1 data5 sdi #1 P1 > sdj #1 mirror5 sdj #1 Q1 > sda #2 data6 sda #2 data9 > sdb #2 mirror6 sdb #2 data10 > sdc #2 data7 sdc #2 data11 > sdd #2 mirror7 sdd #2 data12 > sde #2 data8 sde #2 data13 > sdf #2 mirror8 sdf #2 data14 > sdg #2 data9 sdg #2 data15 > sdh #2 mirror9 sdh #2 data16 > sdi #2 data10 sdi #2 P2 > sdj #2 mirror10 sdj #2 Q2 > ...etc there are 40GB of data > > The easiest way to do that is: > > for sc in 12 11 10 9 8 7 6 5 4; do > btrfs balance start -dconvert=raid6,stripes=$sc..$sc,soft -mconvert=raid1c3,soft /mnt/storage-array/ > done > > The above converts the widest block groups first, so that every block > group converted results in a net increase in storage efficiency, and > creates unallocated space on as many disks as possible. > > Then the next step from my original list, edited with the device > IDs and sizes from dev usage, is the optimization step. I filled > in the device IDs and sizes from your 'dev usage' output: > > > > 4. balance -dstripes=1..5,devid=1 # sdd, 4.55TB > > > balance -dstripes=1..5,devid=2 # sde, 4.55TB > > > balance -dstripes=1..11,devid=3 # sdl, 3.64TB > > > balance -dstripes=1..11,devid=4 # sdn, 3.64TB > > > balance -dstripes=1..11,devid=5 # sdm, 3.64TB > > > balance -dstripes=1..11,devid=6 # sdk, 3.64TB > > > balance -dstripes=1..11,devid=7 # sdj, 3.64TB > > > balance -dstripes=1..11,devid=8 # sdi, 3.64TB > > > balance -dstripes=1..3,devid=9 # sdb, 9.10TB > > > balance -dstripes=1..3,devid=10 # sdc, 9.10TB > > > balance -dstripes=1..3,devid=11 # sda, 9.10TB > > > balance -dstripes=1..3,devid=12 # sdf, 9.10TB > > This ensures that each disk is a member of an optimum width block > group for the disk size. > > Note: I'm not sure about the 1..11. IIRC the btrfs limit is 10 disks > per stripe, so you might want to use 1..9 if it seems to be trying > to rebalance everything with 1..11. > > Running 'watch btrfs fi usage /mnt/storage-array' while balance runs > can be enlightening. Thanks so much for all this detail. I'll see how this run goes and if it gets stuck again I'll try your strategy of converting to RAID-1 to get back some unallocated space. Otherwise if this completes successfully I'll go ahead with optimizing the striping and let you know how it goes. -- --------------------------------------- John Petrini