All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexandru Dordea <alex@dordea.net>
To: Phil Karn <karn@ka9q.net>
Cc: Chris Murphy <lists@colorremedies.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Extremely slow device removals
Date: Thu, 30 Apr 2020 23:27:26 +0300	[thread overview]
Message-ID: <848D59AB-5B64-4C32-BE21-7BC8A8B9821E@dordea.net> (raw)
In-Reply-To: <bfa161e9-7389-6a83-edee-2c3adbcc7bda@ka9q.net>

Hello,

I’m encountering the same issue with Raid6 for months :)
I have a BTRFS raid6 with 15x8TB HDD’s and 5 x 14TB. One of the 15 x 8TB crashed. I have removed the faulty drive and if I’m running the delete missing command the sys-load is increasing and to recover 6.66TB will take few months. After 5 days of running the missing data decreased to -6.10.
During this period the drivers are almost 100% and the R/W performance is degraded with more than 95%.

The R/W performance is not impacted if the process of delete/balance is not running. (Don’t know if running balance on a single CPU without multithread is a feature or a bug but it's a shame that the process is keeping only one CPU out of 48 at 100%).
No errors, partition is clean.
Mounted with space_cache=v2, no improvement.
Using kernel 5.6.6 with btrfsprogs 5.6 (latest opensuse tumbleweed).

        Total devices 20 FS bytes used 134.22TiB
        devid    1 size 7.28TiB used 7.10TiB path /dev/sdg
        devid    2 size 7.28TiB used 7.10TiB path /dev/sdh
        devid    3 size 7.28TiB used 7.10TiB path /dev/sdt
        devid    5 size 7.28TiB used 7.10TiB path /dev/sds
        devid    6 size 7.28TiB used 7.10TiB path /dev/sdr
        devid    7 size 7.28TiB used 7.10TiB path /dev/sdq
        devid    8 size 7.28TiB used 7.10TiB path /dev/sdp
        devid    9 size 7.28TiB used 7.10TiB path /dev/sdo
        devid   10 size 7.28TiB used 7.10TiB path /dev/sdn
        devid   11 size 7.28TiB used 7.10TiB path /dev/sdm
        devid   12 size 7.28TiB used 7.10TiB path /dev/sdl
        devid   13 size 7.28TiB used 7.10TiB path /dev/sdk
        devid   14 size 7.28TiB used 7.10TiB path /dev/sdj
        devid   15 size 7.28TiB used 7.10TiB path /dev/sdi
        devid   16 size 12.73TiB used 11.86TiB path /dev/sdc
        devid   17 size 12.73TiB used 11.86TiB path /dev/sdf
        devid   18 size 12.73TiB used 11.86TiB path /dev/sde
        devid   19 size 12.73TiB used 11.86TiB path /dev/sdb
        devid   20 size 12.73TiB used 11.86TiB path /dev/sdd
        *** Some devices missing

After spending months troubleshooting and tried to recover without heaving my server unavailable for months I’m about to give up :)


> On Apr 30, 2020, at 22:59, Phil Karn <karn@ka9q.net> wrote:
> 
> On 4/30/20 11:40, Chris Murphy wrote:
>> It could be any number of things. Each drive has at least 3
>> partitions so what else is on these drives? Are those other partitions
>> active with other things going on at the same time? How are the drives
>> connected to the computer? Direct SATA/SAS connection? Via USB
>> enclosures? How many snapshots? Are quotas enabled? There's nothing in
>> dmesg for 5 days? Anything for the most recent hour? i.e. journalctl
>> -k --since=-1h
> 
> Nothing else is going on with these drives. Those other partitions
> include things like EFI, manual backups of the root file system on my
> SSD, and swap (which is barely used, verified with iostat and swapon -s).
> 
> The drives are connected internally with SATA at 3.0 Gb/s (this is an
> old motherboard). Still, this is 375 MB/s, much faster than the drives'
> sustained read/write speeds.
> 
> I did get rid of a lot of read-only snapshots while this was running in
> hopes this might speed things up. I'm down to 8, and willing to go
> lower. No obvious improvement. Would I expect this to help right away,
> or does it take time for btrfs to reclaim the space and realize it
> doesn't have to be copied?
> 
> I've never used quotas; I'm the only user.
> 
> There are plenty of messages in dmesg of the form
> 
> [482089.101264] BTRFS info (device sdd3): relocating block group
> 9016340119552 flags data|raid1
> [482118.545044] BTRFS info (device sdd3): found 1115 extents
> [482297.404024] BTRFS info (device sdd3): found 1115 extents
> 
> These appear to be routinely generated by the copy operation. I know
> what extents are, but these messages don't really tell me much.
> 
> The copy operation appears to be proceeding normally, it's just
> extremely, painfully slow. And it's doing an awful lot of writing to the
> drive I'm removing, which doesn't seem to make sense. Looking at
> 'iostat', those writes are almost always done in parallel with another
> drive, a pattern I often see (and expect) with raid-1.
> 
>> 
>> It's an old kernel by this list's standards. Mostly this list is
>> active development on mainline and stable kernels, not LTS kernels
>> which - you might have found a bug. But there's thousands of changes
>> throughout the storage stack in the kernel since then, thousands just
>> in Btrfs between 4.19 and 5.7 and 5.8 being worked on now. It's a 20+
>> month development difference.
>> 
>> It's pretty much just luck if an upstream Btrfs developer sees this
>> and happens to know why it's slow and that it was fixed in X kernel
>> version or maybe it's a really old bug that just hasn't yet gotten a
>> good enough bug report still, and hasn't been fixed. That's why it's
>> common advice to "try with a newer kernel" because the problem might
>> not happen, and if it does, then chances are it's a bug.
> I used to routinely build and install the latest kernels but I got tired
> of that. But I could easily do so here if you think it would make a
> difference. It would force me to reboot, of course. As long as I'm not
> likely to corrupt my file system, I'm willing to do that.
>> 
>>> I started the operation 5 days ago, and of right now I still have 2.18
>>> TB to move off the drive I'm trying to replace. I think it started
>>> around 3.5 TB.
>> Issue sysrq+t and post the output from 'journalctl -k --since=-10m'
>> in something like pastebin or in a text file on nextcloud/dropbox etc.
>> It's probably too big to email and usually the formatting gets munged
>> anyway and is hard to read.
>> 
>> Someone might have an idea why it's slow from sysrq+t but it's a long shot.
> 
> I'm operating headless at the moment, but here's journalctl:
> 
> -- Logs begin at Fri 2020-04-24 21:49:22 PDT, end at Thu 2020-04-30
> 12:07:12 PDT. --
> Apr 30 12:04:26 homer.ka9q.net kernel: BTRFS info (device sdd3): found
> 1997 extents
> Apr 30 12:04:33 homer.ka9q.net kernel: BTRFS info (device sdd3):
> relocating block group 9019561345024 flags data|raid1
> Apr 30 12:05:21 homer.ka9q.net kernel: BTRFS info (device sdd3): found
> 6242 extents
> 
>> If there's anything important on this file system, you should make a
>> copy now. Update backups. You should be prepared to lose the whole
>> thing before proceeding further.
> Already done. Kinda goes without saying...
>> KB
>> Next, disable the write cache on all the drives. This can be done with
>> hdparm -W (cap W, lowercase w is dangerous, see man page). This should
>> improve the chance of the file system on all drives being consistent
>> if you have to force reboot - i.e. the reboot might hang so you should
>> be prepared to issue sysrq+s followed by sysrq+b. Better than power
>> reset.
> I did try disabling the write caches. Interestingly there was no obvious
> change in write speeds. I turned them back on, but I'll remember to turn
> them off before rebooting. Good suggestion.
>> Boot, leave all drives connected, make sure the write caches are
>> disabled, then make sure there's no SCT ERC mismatch, i.e.
>> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
> 
> All drives support SCT. The timeouts *are* different: 10 sec for the new
> 16TB drives, 7 sec for the older 6 TB drives.
> 
> But this shouldn't matter because I'm quite sure all my drives are
> healthy. I regularly run both short and long smart tests, and they've
> always passed. No drive I/O errors in dmesg, no evidence of any retries
> or timeouts. Just lots of small apparently random reads and writes that
> execute very slowly. By "small" I mean the ratio of KB_read/s to tps in
> 'iostat' is small, usually less than 10 KB and often just 4KB.
> 
> Yes, my partitions are properly aligned on 8-LBA (4KB) boundaries.
> 
>> 
>> And then do a scrub with all the drives attached. And then assess the
>> next step only after that completes. It'll either fix something or
>> not. You can do this same thing with kernel 4.19. It should work. But
>> until the health of the file system is known, I can't recommend doing
>> any device replacements or removals. It must be completely healthy
>> first.
> I run manual scrubs every month or so. They've always passed with zero
> errors. I don't run them automatically because they take a day and
> there's a very noticeable hit on performance. Btrfs (at least the
> version I'm running) doesn't seem to know how to run stuff like this at
> low priority (yes, I know that's much harder with I/O than with CPU).
>> 
>> I personally would only do the device removal (either remove while
>> still connected or remove while missing) with 5.6.8 or 5.7rc3 because
>> if I have a problem, I'm reporting it on this list as a bug. With 4.19
>> it's just too old I think for this list, it's pure luck if anyone
>> knows for sure what's going on.
> 
> I can always try the latest kernel (5.6.8 is on kernel.org) as long as
> I'm not likely to lose data by rebooting. I do have backups but I'd like
> to avoid the lengthy hassle of rebuilding everything from scratch.
> 
> Thanks for the suggestions!
> 
> Phil
> 
> 
> 


  reply	other threads:[~2020-04-30 20:27 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-28  7:22 Extremely slow device removals Phil Karn
2020-04-30 17:31 ` Phil Karn
2020-04-30 18:13   ` Jean-Denis Girard
2020-05-01  8:05     ` Phil Karn
2020-05-02  3:35       ` Zygo Blaxell
     [not found]         ` <CAMwB8mjUw+KV8mxg8ynPsv0sj5vSpwG7_khw=oP5n+SnPYzumQ@mail.gmail.com>
2020-05-02  4:31           ` Zygo Blaxell
2020-05-02  4:48         ` Paul Jones
2020-05-02  5:25           ` Phil Karn
2020-05-02  6:04             ` Remi Gauvin
2020-05-02  7:20             ` Zygo Blaxell
2020-05-02  7:27               ` Phil Karn
2020-05-02  7:52                 ` Zygo Blaxell
2020-05-02  6:00           ` Zygo Blaxell
2020-05-02  6:23             ` Paul Jones
2020-05-02  7:20               ` Phil Karn
2020-05-02  7:42                 ` Zygo Blaxell
2020-05-02  8:22                   ` Phil Karn
2020-05-02  8:24                     ` Phil Karn
2020-05-02  9:09                     ` Zygo Blaxell
2020-05-02 17:48                       ` Chris Murphy
2020-05-03  5:26                         ` Zygo Blaxell
2020-05-03  5:39                           ` Chris Murphy
2020-05-03  6:05                             ` Chris Murphy
2020-05-04  2:09                         ` Phil Karn
2020-05-02  7:43                 ` Jukka Larja
2020-05-02  4:49         ` Phil Karn
2020-04-30 18:40   ` Chris Murphy
2020-04-30 19:59     ` Phil Karn
2020-04-30 20:27       ` Alexandru Dordea [this message]
2020-04-30 20:58         ` Phil Karn
2020-05-01  2:47       ` Zygo Blaxell
2020-05-01  4:48         ` Phil Karn
2020-05-01  6:05           ` Alexandru Dordea
2020-05-01  7:29             ` Phil Karn
2020-05-02  4:18               ` Zygo Blaxell
2020-05-02  4:48                 ` Phil Karn
2020-05-02  5:00                 ` Phil Karn
2020-05-03  2:28                 ` Phil Karn
2020-05-04  7:39                   ` Phil Karn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=848D59AB-5B64-4C32-BE21-7BC8A8B9821E@dordea.net \
    --to=alex@dordea.net \
    --cc=karn@ka9q.net \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.