Extremely slow device removals

From: Phil Karn <karn@ka9q.net>
To: linux-btrfs@vger.kernel.org
Subject: Extremely slow device removals
Date: Tue, 28 Apr 2020 00:22:20 -0700	[thread overview]
Message-ID: <8b647a7f-1223-fa9f-57c0-9a81a9bbeb27@ka9q.net> (raw)

I've been running btrfs in RAID1 mode on four 6TB drives for years. They
have 35+K hours (4 years) of running time, and while they're still
passing SMART scans I I wanted to stop tempting fate. They were also
starting to get full (about 92%) and performance was beginning to suffer.

My plan: replace them with two new 16TB EXOS (Enterprise) drives from
Seagate.

My first false start was a "device add" of one of the new drives
followed by a "device remove" on an old one. (I'd been a while, and I'd
forgotten "device replace"). This went extremely slowly, and by morning
it had bombed with a message in the kernel log about running out of
space on (I think) the *old* drive. This seemed odd since the new drive
was still mostly empty.

The filesystem also refused to remount right away, but given the furious
drive activity I decided to be patient. The file system mounted by
itself an hour or so later. There were plenty of "task hung" messages in
the kernel log, but they all seemed to be warnings. No lost data. Whew.

By now I remembered "device replace". But I'd already done "device add"
on the first new 16 TB drive. That gave me 5 drives online and no spare
slot for the second new drive.

I didn't want to repeat the "device remove" for fear of another
out-of-space failure. So I took a gamble.  I pulled one of the old 6TB
drives to make room for the second new 16TB drive, brought the array up
in degraded mode and started a "device replace missing" operation onto
the second new drive. 'iostat' showed just what I expected: a burst of
reads from one or more of the three old drives alternating with big
writes to the new drive. The data rates were reasonably consistent with
the I/O bandwidth limitations of my 10-year-old server. When it finished
the next day I pulled the old 6TB drive and replaced it with the second
new 16 TB drive. So far so good.

I then began another "device replace". Since I wasn't forced to degrade
the array this time, I didn't. It's been several days, and it's nowhere
near half done. As far as I can tell, it's only making headway of maybe
100-200 GB/day so at this rate it might finish in several weeks!
Moreover, when I run 'iostat' I see lots of writes **to** the drive
being replaced, usually in parallel with the same amount of data going
to one of the other drives.

I'd expect lots of *reads from* the drive being replaced, but why are
there any writes to it at all? Is this just to keep the filesystem
consistent in case of a crash?

I'd already run data and metadata balance operations up to about 95%.

I hesitate to tempt fate by forcing the system down to do another
"device replace missing" operation. Can anyone explain why replacing a
missing device is so much faster than replacing an existing device? Is
it simply because, without no redundancy left against a drive loss, less
work needs to (or can) be done to protect against a crash?

Thanks.

Phil Karn

Here's some current system information.

Linux homer.ka9q.net 4.19.0-8-rt-amd64 #1 SMP PREEMPT RT Debian
4.19.98-1 (2020-01-26) x86_64 GNU/Linux

btrfs-progs v4.20.1

Label: 'homer-btrfs'  uuid: 0d090428-8af8-4d23-99da-92f7176f82a7

Total devices 5 FS bytes used 9.89TiB
    devid    1 size 5.46TiB used 3.81TiB path /dev/sdd3
    devid    2 size 0.00B used 2.72TiB path /dev/sde3 [device currently
being replaced]
    devid    4 size 5.46TiB used 5.10TiB path /dev/sdc3
    devid    5 size 14.32TiB used 6.08TiB path /dev/sdb4
    devid    6 size 14.32TiB used 2.08TiB path /dev/sda4

Data, RAID1: total=9.84TiB, used=9.84TiB
System, RAID1: total=32.00MiB, used=1.73MiB
Metadata, RAID1: total=52.00GiB, used=48.32GiB
GlobalReserve, single: total=512.00MiB, used=0.00B