Re: Extremely slow device removals

From: Phil Karn <karn@ka9q.net>
To: Chris Murphy <lists@colorremedies.com>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Extremely slow device removals
Date: Thu, 30 Apr 2020 12:59:29 -0700	[thread overview]
Message-ID: <bfa161e9-7389-6a83-edee-2c3adbcc7bda@ka9q.net> (raw)
In-Reply-To: <CAJCQCtQqdk3FAyc27PoyTXZkhcmvgDwt=oCR7Yw3yuqeOkr2oA@mail.gmail.com>

On 4/30/20 11:40, Chris Murphy wrote:
> It could be any number of things. Each drive has at least 3
> partitions so what else is on these drives? Are those other partitions
> active with other things going on at the same time? How are the drives
> connected to the computer? Direct SATA/SAS connection? Via USB
> enclosures? How many snapshots? Are quotas enabled? There's nothing in
> dmesg for 5 days? Anything for the most recent hour? i.e. journalctl
> -k --since=-1h

Nothing else is going on with these drives. Those other partitions
include things like EFI, manual backups of the root file system on my
SSD, and swap (which is barely used, verified with iostat and swapon -s).

The drives are connected internally with SATA at 3.0 Gb/s (this is an
old motherboard). Still, this is 375 MB/s, much faster than the drives'
sustained read/write speeds.

I did get rid of a lot of read-only snapshots while this was running in
hopes this might speed things up. I'm down to 8, and willing to go
lower. No obvious improvement. Would I expect this to help right away,
or does it take time for btrfs to reclaim the space and realize it
doesn't have to be copied?

I've never used quotas; I'm the only user.

There are plenty of messages in dmesg of the form

[482089.101264] BTRFS info (device sdd3): relocating block group
9016340119552 flags data|raid1
[482118.545044] BTRFS info (device sdd3): found 1115 extents
[482297.404024] BTRFS info (device sdd3): found 1115 extents

These appear to be routinely generated by the copy operation. I know
what extents are, but these messages don't really tell me much.

The copy operation appears to be proceeding normally, it's just
extremely, painfully slow. And it's doing an awful lot of writing to the
drive I'm removing, which doesn't seem to make sense. Looking at
'iostat', those writes are almost always done in parallel with another
drive, a pattern I often see (and expect) with raid-1.

>
> It's an old kernel by this list's standards. Mostly this list is
> active development on mainline and stable kernels, not LTS kernels
> which - you might have found a bug. But there's thousands of changes
> throughout the storage stack in the kernel since then, thousands just
> in Btrfs between 4.19 and 5.7 and 5.8 being worked on now. It's a 20+
> month development difference.
>
> It's pretty much just luck if an upstream Btrfs developer sees this
> and happens to know why it's slow and that it was fixed in X kernel
> version or maybe it's a really old bug that just hasn't yet gotten a
> good enough bug report still, and hasn't been fixed. That's why it's
> common advice to "try with a newer kernel" because the problem might
> not happen, and if it does, then chances are it's a bug.
I used to routinely build and install the latest kernels but I got tired
of that. But I could easily do so here if you think it would make a
difference. It would force me to reboot, of course. As long as I'm not
likely to corrupt my file system, I'm willing to do that.
>
>> I started the operation 5 days ago, and of right now I still have 2.18
>> TB to move off the drive I'm trying to replace. I think it started
>> around 3.5 TB.
> Issue sysrq+t and post the output from 'journalctl -k --since=-10m'
> in something like pastebin or in a text file on nextcloud/dropbox etc.
> It's probably too big to email and usually the formatting gets munged
> anyway and is hard to read.
>
> Someone might have an idea why it's slow from sysrq+t but it's a long shot.

I'm operating headless at the moment, but here's journalctl:

-- Logs begin at Fri 2020-04-24 21:49:22 PDT, end at Thu 2020-04-30
12:07:12 PDT. --
Apr 30 12:04:26 homer.ka9q.net kernel: BTRFS info (device sdd3): found
1997 extents
Apr 30 12:04:33 homer.ka9q.net kernel: BTRFS info (device sdd3):
relocating block group 9019561345024 flags data|raid1
Apr 30 12:05:21 homer.ka9q.net kernel: BTRFS info (device sdd3): found
6242 extents

> If there's anything important on this file system, you should make a
> copy now. Update backups. You should be prepared to lose the whole
> thing before proceeding further.
Already done. Kinda goes without saying...
> KB
> Next, disable the write cache on all the drives. This can be done with
> hdparm -W (cap W, lowercase w is dangerous, see man page). This should
> improve the chance of the file system on all drives being consistent
> if you have to force reboot - i.e. the reboot might hang so you should
> be prepared to issue sysrq+s followed by sysrq+b. Better than power
> reset.
I did try disabling the write caches. Interestingly there was no obvious
change in write speeds. I turned them back on, but I'll remember to turn
them off before rebooting. Good suggestion.
> Boot, leave all drives connected, make sure the write caches are
> disabled, then make sure there's no SCT ERC mismatch, i.e.
> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

All drives support SCT. The timeouts *are* different: 10 sec for the new
16TB drives, 7 sec for the older 6 TB drives.

But this shouldn't matter because I'm quite sure all my drives are
healthy. I regularly run both short and long smart tests, and they've
always passed. No drive I/O errors in dmesg, no evidence of any retries
or timeouts. Just lots of small apparently random reads and writes that
execute very slowly. By "small" I mean the ratio of KB_read/s to tps in
'iostat' is small, usually less than 10 KB and often just 4KB.

Yes, my partitions are properly aligned on 8-LBA (4KB) boundaries.

>
> And then do a scrub with all the drives attached. And then assess the
> next step only after that completes. It'll either fix something or
> not. You can do this same thing with kernel 4.19. It should work. But
> until the health of the file system is known, I can't recommend doing
> any device replacements or removals. It must be completely healthy
> first.
I run manual scrubs every month or so. They've always passed with zero
errors. I don't run them automatically because they take a day and
there's a very noticeable hit on performance. Btrfs (at least the
version I'm running) doesn't seem to know how to run stuff like this at
low priority (yes, I know that's much harder with I/O than with CPU).
>
> I personally would only do the device removal (either remove while
> still connected or remove while missing) with 5.6.8 or 5.7rc3 because
> if I have a problem, I'm reporting it on this list as a bug. With 4.19
> it's just too old I think for this list, it's pure luck if anyone
> knows for sure what's going on.

I can always try the latest kernel (5.6.8 is on kernel.org) as long as
I'm not likely to lose data by rebooting. I do have backups but I'd like
to avoid the lengthy hassle of rebuilding everything from scratch.

Thanks for the suggestions!

Phil