All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Murphy <lists@colorremedies.com>
To: Phil Karn <karn@ka9q.net>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Extremely slow device removals
Date: Thu, 30 Apr 2020 12:40:21 -0600	[thread overview]
Message-ID: <CAJCQCtQqdk3FAyc27PoyTXZkhcmvgDwt=oCR7Yw3yuqeOkr2oA@mail.gmail.com> (raw)
In-Reply-To: <14a8e382-0541-0f18-b969-ccf4b3254461@ka9q.net>

On Thu, Apr 30, 2020 at 11:31 AM Phil Karn <karn@ka9q.net> wrote:
>
> Any comments on my message about btrfs drive removals being extremely slow?

It could be any number of things. Each drive has at least 3
partitions so what else is on these drives? Are those other partitions
active with other things going on at the same time? How are the drives
connected to the computer? Direct SATA/SAS connection? Via USB
enclosures? How many snapshots? Are quotas enabled? There's nothing in
dmesg for 5 days? Anything for the most recent hour? i.e. journalctl
-k --since=-1h

It's an old kernel by this list's standards. Mostly this list is
active development on mainline and stable kernels, not LTS kernels
which - you might have found a bug. But there's thousands of changes
throughout the storage stack in the kernel since then, thousands just
in Btrfs between 4.19 and 5.7 and 5.8 being worked on now. It's a 20+
month development difference.

It's pretty much just luck if an upstream Btrfs developer sees this
and happens to know why it's slow and that it was fixed in X kernel
version or maybe it's a really old bug that just hasn't yet gotten a
good enough bug report still, and hasn't been fixed. That's why it's
common advice to "try with a newer kernel" because the problem might
not happen, and if it does, then chances are it's a bug.

> I started the operation 5 days ago, and of right now I still have 2.18
> TB to move off the drive I'm trying to replace. I think it started
> around 3.5 TB.

Issue sysrq+t and post the output from 'journalctl -k --since=-10m'
in something like pastebin or in a text file on nextcloud/dropbox etc.
It's probably too big to email and usually the formatting gets munged
anyway and is hard to read.

Someone might have an idea why it's slow from sysrq+t but it's a long shot.

> Should I reboot degraded without this drive and do a "remove missing"
> operation instead? I'm willing to take the risk of losing another drive
> during the operation if it'll speed this up. It wouldn't be so bad if it
> weren't slowing my filesystem to a crawl for normal stuff, like reading
> mail.

If there's anything important on this file system, you should make a
copy now. Update backups. You should be prepared to lose the whole
thing before proceeding further.

Next, disable the write cache on all the drives. This can be done with
hdparm -W (cap W, lowercase w is dangerous, see man page). This should
improve the chance of the file system on all drives being consistent
if you have to force reboot - i.e. the reboot might hang so you should
be prepared to issue sysrq+s followed by sysrq+b. Better than power
reset.

We don't know what we don't know so it's a guess what the next step
is. While powered off you can remove devid 2, the device you want
removed. And first see if you can mount -o ro,degraded and check
dmesg, and see if things pass a basic sanity test for reading. Then
remount rw, and try to remove the missing device. It might go faster
to just rebuild the missing data from the single copies left? But
there's not much to go on.

Boot, leave all drives connected, make sure the write caches are
disabled, then make sure there's no SCT ERC mismatch, i.e.
https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

And then do a scrub with all the drives attached. And then assess the
next step only after that completes. It'll either fix something or
not. You can do this same thing with kernel 4.19. It should work. But
until the health of the file system is known, I can't recommend doing
any device replacements or removals. It must be completely healthy
first.

I personally would only do the device removal (either remove while
still connected or remove while missing) with 5.6.8 or 5.7rc3 because
if I have a problem, I'm reporting it on this list as a bug. With 4.19
it's just too old I think for this list, it's pure luck if anyone
knows for sure what's going on.


-- 
Chris Murphy

  parent reply	other threads:[~2020-04-30 18:40 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-28  7:22 Extremely slow device removals Phil Karn
2020-04-30 17:31 ` Phil Karn
2020-04-30 18:13   ` Jean-Denis Girard
2020-05-01  8:05     ` Phil Karn
2020-05-02  3:35       ` Zygo Blaxell
     [not found]         ` <CAMwB8mjUw+KV8mxg8ynPsv0sj5vSpwG7_khw=oP5n+SnPYzumQ@mail.gmail.com>
2020-05-02  4:31           ` Zygo Blaxell
2020-05-02  4:48         ` Paul Jones
2020-05-02  5:25           ` Phil Karn
2020-05-02  6:04             ` Remi Gauvin
2020-05-02  7:20             ` Zygo Blaxell
2020-05-02  7:27               ` Phil Karn
2020-05-02  7:52                 ` Zygo Blaxell
2020-05-02  6:00           ` Zygo Blaxell
2020-05-02  6:23             ` Paul Jones
2020-05-02  7:20               ` Phil Karn
2020-05-02  7:42                 ` Zygo Blaxell
2020-05-02  8:22                   ` Phil Karn
2020-05-02  8:24                     ` Phil Karn
2020-05-02  9:09                     ` Zygo Blaxell
2020-05-02 17:48                       ` Chris Murphy
2020-05-03  5:26                         ` Zygo Blaxell
2020-05-03  5:39                           ` Chris Murphy
2020-05-03  6:05                             ` Chris Murphy
2020-05-04  2:09                         ` Phil Karn
2020-05-02  7:43                 ` Jukka Larja
2020-05-02  4:49         ` Phil Karn
2020-04-30 18:40   ` Chris Murphy [this message]
2020-04-30 19:59     ` Phil Karn
2020-04-30 20:27       ` Alexandru Dordea
2020-04-30 20:58         ` Phil Karn
2020-05-01  2:47       ` Zygo Blaxell
2020-05-01  4:48         ` Phil Karn
2020-05-01  6:05           ` Alexandru Dordea
2020-05-01  7:29             ` Phil Karn
2020-05-02  4:18               ` Zygo Blaxell
2020-05-02  4:48                 ` Phil Karn
2020-05-02  5:00                 ` Phil Karn
2020-05-03  2:28                 ` Phil Karn
2020-05-04  7:39                   ` Phil Karn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJCQCtQqdk3FAyc27PoyTXZkhcmvgDwt=oCR7Yw3yuqeOkr2oA@mail.gmail.com' \
    --to=lists@colorremedies.com \
    --cc=karn@ka9q.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.