From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=UUkT=6P=vger.kernel.org=linux-btrfs-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0EA97C28CBC
	for <linux-btrfs@archiver.kernel.org>; Fri,  1 May 2020 02:47:58 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id DC17820643
	for <linux-btrfs@archiver.kernel.org>; Fri,  1 May 2020 02:47:57 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1728092AbgEACr5 convert rfc822-to-8bit (ORCPT
        <rfc822;linux-btrfs@archiver.kernel.org>);
        Thu, 30 Apr 2020 22:47:57 -0400
Received: from james.kirk.hungrycats.org ([174.142.39.145]:46456 "EHLO
        james.kirk.hungrycats.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1728024AbgEACr4 (ORCPT
        <rfc822;linux-btrfs@vger.kernel.org>);
        Thu, 30 Apr 2020 22:47:56 -0400
Received: by james.kirk.hungrycats.org (Postfix, from userid 1002)
        id DDA5369CBFD; Thu, 30 Apr 2020 22:47:53 -0400 (EDT)
Date:   Thu, 30 Apr 2020 22:47:53 -0400
From:   Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To:     Phil Karn <karn@ka9q.net>
Cc:     Chris Murphy <lists@colorremedies.com>,
        Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Extremely slow device removals
Message-ID: <20200501024753.GE10769@hungrycats.org>
References: <8b647a7f-1223-fa9f-57c0-9a81a9bbeb27@ka9q.net>
 <14a8e382-0541-0f18-b969-ccf4b3254461@ka9q.net>
 <CAJCQCtQqdk3FAyc27PoyTXZkhcmvgDwt=oCR7Yw3yuqeOkr2oA@mail.gmail.com>
 <bfa161e9-7389-6a83-edee-2c3adbcc7bda@ka9q.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: 8BIT
In-Reply-To: <bfa161e9-7389-6a83-edee-2c3adbcc7bda@ka9q.net>
User-Agent: Mutt/1.10.1 (2018-07-13)
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org

On Thu, Apr 30, 2020 at 12:59:29PM -0700, Phil Karn wrote:
> On 4/30/20 11:40, Chris Murphy wrote:
> > It could be any number of things. Each drive has at least 3
> > partitions so what else is on these drives? Are those other partitions
> > active with other things going on at the same time? How are the drives
> > connected to the computer? Direct SATA/SAS connection? Via USB
> > enclosures? How many snapshots? Are quotas enabled? There's nothing in
> > dmesg for 5 days? Anything for the most recent hour? i.e. journalctl
> > -k --since=-1h
> 
> Nothing else is going on with these drives. Those other partitions
> include things like EFI, manual backups of the root file system on my
> SSD, and swap (which is barely used, verified with iostat and swapon -s).
> 
> The drives are connected internally with SATA at 3.0 Gb/s (this is an
> old motherboard). Still, this is 375 MB/s, much faster than the drives'
> sustained read/write speeds.
> 
> I did get rid of a lot of read-only snapshots while this was running in
> hopes this might speed things up. I'm down to 8, and willing to go
> lower. No obvious improvement. Would I expect this to help right away,
> or does it take time for btrfs to reclaim the space and realize it
> doesn't have to be copied?
> 
> I've never used quotas; I'm the only user.
> 
> There are plenty of messages in dmesg of the form
> 
> [482089.101264] BTRFS info (device sdd3): relocating block group
> 9016340119552 flags data|raid1
> [482118.545044] BTRFS info (device sdd3): found 1115 extents
> [482297.404024] BTRFS info (device sdd3): found 1115 extents
> 
> These appear to be routinely generated by the copy operation. I know
> what extents are, but these messages don't really tell me much.

If it keeps repeating "found 1115 extents" over and over (say 5 or
more times) then you're hitting the balance looping bug in kernel 5.1
and later.  Every N block groups (N seems to vary by user, I've heard
reports from 3 to over 6000) the kernel will get stuck in a loop and
will need to reboot to recover.  Even if you cancel the balance, it will
just loop again until rebooted, and there's no cancel for device delete
so if you start looping there you can just skip directly to the reboot.
For a non-trivial filesystem the probability of successfully deleting
or resizing a device is more or less zero.

There is no fix for that regression yet.  Kernel 4.19 doesn't have the
regression and does have other relevant bug fixes for balance, so it
can be used as a workaround.

> The copy operation appears to be proceeding normally, it's just
> extremely, painfully slow. And it's doing an awful lot of writing to the
> drive I'm removing, which doesn't seem to make sense. Looking at
> 'iostat', those writes are almost always done in parallel with another
> drive, a pattern I often see (and expect) with raid-1.
> 
> >
> > It's an old kernel by this list's standards. Mostly this list is
> > active development on mainline and stable kernels, not LTS kernels
> > which - you might have found a bug. But there's thousands of changes
> > throughout the storage stack in the kernel since then, thousands just
> > in Btrfs between 4.19 and 5.7 and 5.8 being worked on now. It's a 20+
> > month development difference.
> >
> > It's pretty much just luck if an upstream Btrfs developer sees this
> > and happens to know why it's slow and that it was fixed in X kernel
> > version or maybe it's a really old bug that just hasn't yet gotten a
> > good enough bug report still, and hasn't been fixed. That's why it's
> > common advice to "try with a newer kernel" because the problem might
> > not happen, and if it does, then chances are it's a bug.
> I used to routinely build and install the latest kernels but I got tired
> of that. But I could easily do so here if you think it would make a
> difference. It would force me to reboot, of course. As long as I'm not
> likely to corrupt my file system, I'm willing to do that.
> 
> >> I started the operation 5 days ago, and of right now I still have 2.18
> >> TB to move off the drive I'm trying to replace. I think it started
> >> around 3.5 TB.
> > Issue sysrq+t and post the output from 'journalctl -k --since=-10m'
> > in something like pastebin or in a text file on nextcloud/dropbox etc.
> > It's probably too big to email and usually the formatting gets munged
> > anyway and is hard to read.
> >
> > Someone might have an idea why it's slow from sysrq+t but it's a long shot.
> 
> I'm operating headless at the moment, but here's journalctl:
> 
> -- Logs begin at Fri 2020-04-24 21:49:22 PDT, end at Thu 2020-04-30
> 12:07:12 PDT. --
> Apr 30 12:04:26 homer.ka9q.net kernel: BTRFS info (device sdd3): found
> 1997 extents
> Apr 30 12:04:33 homer.ka9q.net kernel: BTRFS info (device sdd3):
> relocating block group 9019561345024 flags data|raid1
> Apr 30 12:05:21 homer.ka9q.net kernel: BTRFS info (device sdd3): found
> 6242 extents
> 
> > If there's anything important on this file system, you should make a
> > copy now. Update backups. You should be prepared to lose the whole
> > thing before proceeding further.
> Already done. Kinda goes without saying...
> > KB
> > Next, disable the write cache on all the drives. This can be done with
> > hdparm -W (cap W, lowercase w is dangerous, see man page). This should
> > improve the chance of the file system on all drives being consistent
> > if you have to force reboot - i.e. the reboot might hang so you should
> > be prepared to issue sysrq+s followed by sysrq+b. Better than power
> > reset.
> I did try disabling the write caches. Interestingly there was no obvious
> change in write speeds. I turned them back on, but I'll remember to turn
> them off before rebooting. Good suggestion.
> > Boot, leave all drives connected, make sure the write caches are
> > disabled, then make sure there's no SCT ERC mismatch, i.e.
> > https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
> 
> All drives support SCT. The timeouts *are* different: 10 sec for the new
> 16TB drives, 7 sec for the older 6 TB drives.
> 
> But this shouldn't matter because I'm quite sure all my drives are
> healthy. I regularly run both short and long smart tests, and they've
> always passed. No drive I/O errors in dmesg, no evidence of any retries
> or timeouts. Just lots of small apparently random reads and writes that
> execute very slowly. By "small" I mean the ratio of KB_read/s to tps in
> 'iostat' is small, usually less than 10 KB and often just 4KB.
> 
> Yes, my partitions are properly aligned on 8-LBA (4KB) boundaries.
> 
> >
> > And then do a scrub with all the drives attached. And then assess the
> > next step only after that completes. It'll either fix something or
> > not. You can do this same thing with kernel 4.19. It should work. But
> > until the health of the file system is known, I can't recommend doing
> > any device replacements or removals. It must be completely healthy
> > first.
> I run manual scrubs every month or so. They've always passed with zero
> errors. I don't run them automatically because they take a day and
> there's a very noticeable hit on performance. Btrfs (at least the
> version I'm running) doesn't seem to know how to run stuff like this at
> low priority (yes, I know that's much harder with I/O than with CPU).
> >
> > I personally would only do the device removal (either remove while
> > still connected or remove while missing) with 5.6.8 or 5.7rc3 because
> > if I have a problem, I'm reporting it on this list as a bug. With 4.19
> > it's just too old I think for this list, it's pure luck if anyone
> > knows for sure what's going on.
> 
> I can always try the latest kernel (5.6.8 is on kernel.org) as long as
> I'm not likely to lose data by rebooting. I do have backups but I'd like
> to avoid the lengthy hassle of rebuilding everything from scratch.
> 
> Thanks for the suggestions!
> 
> Phil
> 
> 
>