All of lore.kernel.org
 help / color / mirror / Atom feed
* Extremely slow device removals
@ 2020-04-28  7:22 Phil Karn
  2020-04-30 17:31 ` Phil Karn
  0 siblings, 1 reply; 39+ messages in thread
From: Phil Karn @ 2020-04-28  7:22 UTC (permalink / raw)
  To: linux-btrfs

I've been running btrfs in RAID1 mode on four 6TB drives for years. They
have 35+K hours (4 years) of running time, and while they're still
passing SMART scans I I wanted to stop tempting fate. They were also
starting to get full (about 92%) and performance was beginning to suffer.

My plan: replace them with two new 16TB EXOS (Enterprise) drives from
Seagate.

My first false start was a "device add" of one of the new drives
followed by a "device remove" on an old one. (I'd been a while, and I'd
forgotten "device replace"). This went extremely slowly, and by morning
it had bombed with a message in the kernel log about running out of
space on (I think) the *old* drive. This seemed odd since the new drive
was still mostly empty.

The filesystem also refused to remount right away, but given the furious
drive activity I decided to be patient. The file system mounted by
itself an hour or so later. There were plenty of "task hung" messages in
the kernel log, but they all seemed to be warnings. No lost data. Whew.

By now I remembered "device replace". But I'd already done "device add"
on the first new 16 TB drive. That gave me 5 drives online and no spare
slot for the second new drive.

I didn't want to repeat the "device remove" for fear of another
out-of-space failure. So I took a gamble.  I pulled one of the old 6TB
drives to make room for the second new 16TB drive, brought the array up
in degraded mode and started a "device replace missing" operation onto
the second new drive. 'iostat' showed just what I expected: a burst of
reads from one or more of the three old drives alternating with big
writes to the new drive. The data rates were reasonably consistent with
the I/O bandwidth limitations of my 10-year-old server. When it finished
the next day I pulled the old 6TB drive and replaced it with the second
new 16 TB drive. So far so good.

I then began another "device replace". Since I wasn't forced to degrade
the array this time, I didn't. It's been several days, and it's nowhere
near half done. As far as I can tell, it's only making headway of maybe
100-200 GB/day so at this rate it might finish in several weeks!
Moreover, when I run 'iostat' I see lots of writes **to** the drive
being replaced, usually in parallel with the same amount of data going
to one of the other drives.

I'd expect lots of *reads from* the drive being replaced, but why are
there any writes to it at all? Is this just to keep the filesystem
consistent in case of a crash?

I'd already run data and metadata balance operations up to about 95%.

I hesitate to tempt fate by forcing the system down to do another
"device replace missing" operation. Can anyone explain why replacing a
missing device is so much faster than replacing an existing device? Is
it simply because, without no redundancy left against a drive loss, less
work needs to (or can) be done to protect against a crash?

Thanks.

Phil Karn

Here's some current system information.

Linux homer.ka9q.net 4.19.0-8-rt-amd64 #1 SMP PREEMPT RT Debian
4.19.98-1 (2020-01-26) x86_64 GNU/Linux

btrfs-progs v4.20.1

Label: 'homer-btrfs'  uuid: 0d090428-8af8-4d23-99da-92f7176f82a7

Total devices 5 FS bytes used 9.89TiB
    devid    1 size 5.46TiB used 3.81TiB path /dev/sdd3
    devid    2 size 0.00B used 2.72TiB path /dev/sde3 [device currently
being replaced]
    devid    4 size 5.46TiB used 5.10TiB path /dev/sdc3
    devid    5 size 14.32TiB used 6.08TiB path /dev/sdb4
    devid    6 size 14.32TiB used 2.08TiB path /dev/sda4

Data, RAID1: total=9.84TiB, used=9.84TiB
System, RAID1: total=32.00MiB, used=1.73MiB
Metadata, RAID1: total=52.00GiB, used=48.32GiB
GlobalReserve, single: total=512.00MiB, used=0.00B




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-04-28  7:22 Extremely slow device removals Phil Karn
@ 2020-04-30 17:31 ` Phil Karn
  2020-04-30 18:13   ` Jean-Denis Girard
  2020-04-30 18:40   ` Chris Murphy
  0 siblings, 2 replies; 39+ messages in thread
From: Phil Karn @ 2020-04-30 17:31 UTC (permalink / raw)
  To: linux-btrfs

Any comments on my message about btrfs drive removals being extremely slow?

I started the operation 5 days ago, and of right now I still have 2.18
TB to move off the drive I'm trying to replace. I think it started
around 3.5 TB.

Should I reboot degraded without this drive and do a "remove missing"
operation instead? I'm willing to take the risk of losing another drive
during the operation if it'll speed this up. It wouldn't be so bad if it
weren't slowing my filesystem to a crawl for normal stuff, like reading
mail.

Thanks,

Phil





^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-04-30 17:31 ` Phil Karn
@ 2020-04-30 18:13   ` Jean-Denis Girard
  2020-05-01  8:05     ` Phil Karn
  2020-04-30 18:40   ` Chris Murphy
  1 sibling, 1 reply; 39+ messages in thread
From: Jean-Denis Girard @ 2020-04-30 18:13 UTC (permalink / raw)
  To: linux-btrfs

Le 30/04/2020 à 07:31, Phil Karn a écrit :
> I started the operation 5 days ago, and of right now I still have 2.18
> TB to move off the drive I'm trying to replace. I think it started
> around 3.5 TB.

Hi Phil,

I did something similar one month ago. It took less than 4 hours for
1.71 TiB of data:

[xxx@taina ~]$ sudo btrfs replace status /home/SysNux
Started on 21.Mar 11:13:20, finished on 21.Mar 15:06:33, 0 write errs, 0
uncorr. read errs

[xxxg@taina ~]$ sudo btrfs fi show /home/SysNux/
Label: none  uuid: c5b8386b-b81d-4473-9340-7b8a74fc3a3c
        Total devices 2 FS bytes used 1.70TiB
        devid    1 size 1.82TiB used 1.71TiB path /dev/bcache2
        devid    2 size 1.82TiB used 1.71TiB path /dev/bcache0

These disks are behind bcache, which may have a positive impact. Also my
system is Fedora 31, and it was with 5.5 kernel, much more recent than
yours.

> Should I reboot degraded without this drive and do a "remove missing"
> operation instead? I'm willing to take the risk of losing another drive
> during the operation if it'll speed this up. It wouldn't be so bad if it
> weren't slowing my filesystem to a crawl for normal stuff, like reading
> mail.

No idea, I'm just a (happy) Btrfs user.


Hope this helps,
-- 
Jean-Denis Girard

SysNux                   Systèmes   Linux   en   Polynésie  française
https://www.sysnux.pf/   Tél: +689 40.50.10.40 / GSM: +689 87.797.527





^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-04-30 17:31 ` Phil Karn
  2020-04-30 18:13   ` Jean-Denis Girard
@ 2020-04-30 18:40   ` Chris Murphy
  2020-04-30 19:59     ` Phil Karn
  1 sibling, 1 reply; 39+ messages in thread
From: Chris Murphy @ 2020-04-30 18:40 UTC (permalink / raw)
  To: Phil Karn; +Cc: Btrfs BTRFS

On Thu, Apr 30, 2020 at 11:31 AM Phil Karn <karn@ka9q.net> wrote:
>
> Any comments on my message about btrfs drive removals being extremely slow?

It could be any number of things. Each drive has at least 3
partitions so what else is on these drives? Are those other partitions
active with other things going on at the same time? How are the drives
connected to the computer? Direct SATA/SAS connection? Via USB
enclosures? How many snapshots? Are quotas enabled? There's nothing in
dmesg for 5 days? Anything for the most recent hour? i.e. journalctl
-k --since=-1h

It's an old kernel by this list's standards. Mostly this list is
active development on mainline and stable kernels, not LTS kernels
which - you might have found a bug. But there's thousands of changes
throughout the storage stack in the kernel since then, thousands just
in Btrfs between 4.19 and 5.7 and 5.8 being worked on now. It's a 20+
month development difference.

It's pretty much just luck if an upstream Btrfs developer sees this
and happens to know why it's slow and that it was fixed in X kernel
version or maybe it's a really old bug that just hasn't yet gotten a
good enough bug report still, and hasn't been fixed. That's why it's
common advice to "try with a newer kernel" because the problem might
not happen, and if it does, then chances are it's a bug.

> I started the operation 5 days ago, and of right now I still have 2.18
> TB to move off the drive I'm trying to replace. I think it started
> around 3.5 TB.

Issue sysrq+t and post the output from 'journalctl -k --since=-10m'
in something like pastebin or in a text file on nextcloud/dropbox etc.
It's probably too big to email and usually the formatting gets munged
anyway and is hard to read.

Someone might have an idea why it's slow from sysrq+t but it's a long shot.

> Should I reboot degraded without this drive and do a "remove missing"
> operation instead? I'm willing to take the risk of losing another drive
> during the operation if it'll speed this up. It wouldn't be so bad if it
> weren't slowing my filesystem to a crawl for normal stuff, like reading
> mail.

If there's anything important on this file system, you should make a
copy now. Update backups. You should be prepared to lose the whole
thing before proceeding further.

Next, disable the write cache on all the drives. This can be done with
hdparm -W (cap W, lowercase w is dangerous, see man page). This should
improve the chance of the file system on all drives being consistent
if you have to force reboot - i.e. the reboot might hang so you should
be prepared to issue sysrq+s followed by sysrq+b. Better than power
reset.

We don't know what we don't know so it's a guess what the next step
is. While powered off you can remove devid 2, the device you want
removed. And first see if you can mount -o ro,degraded and check
dmesg, and see if things pass a basic sanity test for reading. Then
remount rw, and try to remove the missing device. It might go faster
to just rebuild the missing data from the single copies left? But
there's not much to go on.

Boot, leave all drives connected, make sure the write caches are
disabled, then make sure there's no SCT ERC mismatch, i.e.
https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

And then do a scrub with all the drives attached. And then assess the
next step only after that completes. It'll either fix something or
not. You can do this same thing with kernel 4.19. It should work. But
until the health of the file system is known, I can't recommend doing
any device replacements or removals. It must be completely healthy
first.

I personally would only do the device removal (either remove while
still connected or remove while missing) with 5.6.8 or 5.7rc3 because
if I have a problem, I'm reporting it on this list as a bug. With 4.19
it's just too old I think for this list, it's pure luck if anyone
knows for sure what's going on.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-04-30 18:40   ` Chris Murphy
@ 2020-04-30 19:59     ` Phil Karn
  2020-04-30 20:27       ` Alexandru Dordea
  2020-05-01  2:47       ` Zygo Blaxell
  0 siblings, 2 replies; 39+ messages in thread
From: Phil Karn @ 2020-04-30 19:59 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 4/30/20 11:40, Chris Murphy wrote:
> It could be any number of things. Each drive has at least 3
> partitions so what else is on these drives? Are those other partitions
> active with other things going on at the same time? How are the drives
> connected to the computer? Direct SATA/SAS connection? Via USB
> enclosures? How many snapshots? Are quotas enabled? There's nothing in
> dmesg for 5 days? Anything for the most recent hour? i.e. journalctl
> -k --since=-1h

Nothing else is going on with these drives. Those other partitions
include things like EFI, manual backups of the root file system on my
SSD, and swap (which is barely used, verified with iostat and swapon -s).

The drives are connected internally with SATA at 3.0 Gb/s (this is an
old motherboard). Still, this is 375 MB/s, much faster than the drives'
sustained read/write speeds.

I did get rid of a lot of read-only snapshots while this was running in
hopes this might speed things up. I'm down to 8, and willing to go
lower. No obvious improvement. Would I expect this to help right away,
or does it take time for btrfs to reclaim the space and realize it
doesn't have to be copied?

I've never used quotas; I'm the only user.

There are plenty of messages in dmesg of the form

[482089.101264] BTRFS info (device sdd3): relocating block group
9016340119552 flags data|raid1
[482118.545044] BTRFS info (device sdd3): found 1115 extents
[482297.404024] BTRFS info (device sdd3): found 1115 extents

These appear to be routinely generated by the copy operation. I know
what extents are, but these messages don't really tell me much.

The copy operation appears to be proceeding normally, it's just
extremely, painfully slow. And it's doing an awful lot of writing to the
drive I'm removing, which doesn't seem to make sense. Looking at
'iostat', those writes are almost always done in parallel with another
drive, a pattern I often see (and expect) with raid-1.

>
> It's an old kernel by this list's standards. Mostly this list is
> active development on mainline and stable kernels, not LTS kernels
> which - you might have found a bug. But there's thousands of changes
> throughout the storage stack in the kernel since then, thousands just
> in Btrfs between 4.19 and 5.7 and 5.8 being worked on now. It's a 20+
> month development difference.
>
> It's pretty much just luck if an upstream Btrfs developer sees this
> and happens to know why it's slow and that it was fixed in X kernel
> version or maybe it's a really old bug that just hasn't yet gotten a
> good enough bug report still, and hasn't been fixed. That's why it's
> common advice to "try with a newer kernel" because the problem might
> not happen, and if it does, then chances are it's a bug.
I used to routinely build and install the latest kernels but I got tired
of that. But I could easily do so here if you think it would make a
difference. It would force me to reboot, of course. As long as I'm not
likely to corrupt my file system, I'm willing to do that.
>
>> I started the operation 5 days ago, and of right now I still have 2.18
>> TB to move off the drive I'm trying to replace. I think it started
>> around 3.5 TB.
> Issue sysrq+t and post the output from 'journalctl -k --since=-10m'
> in something like pastebin or in a text file on nextcloud/dropbox etc.
> It's probably too big to email and usually the formatting gets munged
> anyway and is hard to read.
>
> Someone might have an idea why it's slow from sysrq+t but it's a long shot.

I'm operating headless at the moment, but here's journalctl:

-- Logs begin at Fri 2020-04-24 21:49:22 PDT, end at Thu 2020-04-30
12:07:12 PDT. --
Apr 30 12:04:26 homer.ka9q.net kernel: BTRFS info (device sdd3): found
1997 extents
Apr 30 12:04:33 homer.ka9q.net kernel: BTRFS info (device sdd3):
relocating block group 9019561345024 flags data|raid1
Apr 30 12:05:21 homer.ka9q.net kernel: BTRFS info (device sdd3): found
6242 extents

> If there's anything important on this file system, you should make a
> copy now. Update backups. You should be prepared to lose the whole
> thing before proceeding further.
Already done. Kinda goes without saying...
> KB
> Next, disable the write cache on all the drives. This can be done with
> hdparm -W (cap W, lowercase w is dangerous, see man page). This should
> improve the chance of the file system on all drives being consistent
> if you have to force reboot - i.e. the reboot might hang so you should
> be prepared to issue sysrq+s followed by sysrq+b. Better than power
> reset.
I did try disabling the write caches. Interestingly there was no obvious
change in write speeds. I turned them back on, but I'll remember to turn
them off before rebooting. Good suggestion.
> Boot, leave all drives connected, make sure the write caches are
> disabled, then make sure there's no SCT ERC mismatch, i.e.
> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

All drives support SCT. The timeouts *are* different: 10 sec for the new
16TB drives, 7 sec for the older 6 TB drives.

But this shouldn't matter because I'm quite sure all my drives are
healthy. I regularly run both short and long smart tests, and they've
always passed. No drive I/O errors in dmesg, no evidence of any retries
or timeouts. Just lots of small apparently random reads and writes that
execute very slowly. By "small" I mean the ratio of KB_read/s to tps in
'iostat' is small, usually less than 10 KB and often just 4KB.

Yes, my partitions are properly aligned on 8-LBA (4KB) boundaries.

>
> And then do a scrub with all the drives attached. And then assess the
> next step only after that completes. It'll either fix something or
> not. You can do this same thing with kernel 4.19. It should work. But
> until the health of the file system is known, I can't recommend doing
> any device replacements or removals. It must be completely healthy
> first.
I run manual scrubs every month or so. They've always passed with zero
errors. I don't run them automatically because they take a day and
there's a very noticeable hit on performance. Btrfs (at least the
version I'm running) doesn't seem to know how to run stuff like this at
low priority (yes, I know that's much harder with I/O than with CPU).
>
> I personally would only do the device removal (either remove while
> still connected or remove while missing) with 5.6.8 or 5.7rc3 because
> if I have a problem, I'm reporting it on this list as a bug. With 4.19
> it's just too old I think for this list, it's pure luck if anyone
> knows for sure what's going on.

I can always try the latest kernel (5.6.8 is on kernel.org) as long as
I'm not likely to lose data by rebooting. I do have backups but I'd like
to avoid the lengthy hassle of rebuilding everything from scratch.

Thanks for the suggestions!

Phil




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-04-30 19:59     ` Phil Karn
@ 2020-04-30 20:27       ` Alexandru Dordea
  2020-04-30 20:58         ` Phil Karn
  2020-05-01  2:47       ` Zygo Blaxell
  1 sibling, 1 reply; 39+ messages in thread
From: Alexandru Dordea @ 2020-04-30 20:27 UTC (permalink / raw)
  To: Phil Karn; +Cc: Chris Murphy, Btrfs BTRFS

Hello,

I’m encountering the same issue with Raid6 for months :)
I have a BTRFS raid6 with 15x8TB HDD’s and 5 x 14TB. One of the 15 x 8TB crashed. I have removed the faulty drive and if I’m running the delete missing command the sys-load is increasing and to recover 6.66TB will take few months. After 5 days of running the missing data decreased to -6.10.
During this period the drivers are almost 100% and the R/W performance is degraded with more than 95%.

The R/W performance is not impacted if the process of delete/balance is not running. (Don’t know if running balance on a single CPU without multithread is a feature or a bug but it's a shame that the process is keeping only one CPU out of 48 at 100%).
No errors, partition is clean.
Mounted with space_cache=v2, no improvement.
Using kernel 5.6.6 with btrfsprogs 5.6 (latest opensuse tumbleweed).

        Total devices 20 FS bytes used 134.22TiB
        devid    1 size 7.28TiB used 7.10TiB path /dev/sdg
        devid    2 size 7.28TiB used 7.10TiB path /dev/sdh
        devid    3 size 7.28TiB used 7.10TiB path /dev/sdt
        devid    5 size 7.28TiB used 7.10TiB path /dev/sds
        devid    6 size 7.28TiB used 7.10TiB path /dev/sdr
        devid    7 size 7.28TiB used 7.10TiB path /dev/sdq
        devid    8 size 7.28TiB used 7.10TiB path /dev/sdp
        devid    9 size 7.28TiB used 7.10TiB path /dev/sdo
        devid   10 size 7.28TiB used 7.10TiB path /dev/sdn
        devid   11 size 7.28TiB used 7.10TiB path /dev/sdm
        devid   12 size 7.28TiB used 7.10TiB path /dev/sdl
        devid   13 size 7.28TiB used 7.10TiB path /dev/sdk
        devid   14 size 7.28TiB used 7.10TiB path /dev/sdj
        devid   15 size 7.28TiB used 7.10TiB path /dev/sdi
        devid   16 size 12.73TiB used 11.86TiB path /dev/sdc
        devid   17 size 12.73TiB used 11.86TiB path /dev/sdf
        devid   18 size 12.73TiB used 11.86TiB path /dev/sde
        devid   19 size 12.73TiB used 11.86TiB path /dev/sdb
        devid   20 size 12.73TiB used 11.86TiB path /dev/sdd
        *** Some devices missing

After spending months troubleshooting and tried to recover without heaving my server unavailable for months I’m about to give up :)


> On Apr 30, 2020, at 22:59, Phil Karn <karn@ka9q.net> wrote:
> 
> On 4/30/20 11:40, Chris Murphy wrote:
>> It could be any number of things. Each drive has at least 3
>> partitions so what else is on these drives? Are those other partitions
>> active with other things going on at the same time? How are the drives
>> connected to the computer? Direct SATA/SAS connection? Via USB
>> enclosures? How many snapshots? Are quotas enabled? There's nothing in
>> dmesg for 5 days? Anything for the most recent hour? i.e. journalctl
>> -k --since=-1h
> 
> Nothing else is going on with these drives. Those other partitions
> include things like EFI, manual backups of the root file system on my
> SSD, and swap (which is barely used, verified with iostat and swapon -s).
> 
> The drives are connected internally with SATA at 3.0 Gb/s (this is an
> old motherboard). Still, this is 375 MB/s, much faster than the drives'
> sustained read/write speeds.
> 
> I did get rid of a lot of read-only snapshots while this was running in
> hopes this might speed things up. I'm down to 8, and willing to go
> lower. No obvious improvement. Would I expect this to help right away,
> or does it take time for btrfs to reclaim the space and realize it
> doesn't have to be copied?
> 
> I've never used quotas; I'm the only user.
> 
> There are plenty of messages in dmesg of the form
> 
> [482089.101264] BTRFS info (device sdd3): relocating block group
> 9016340119552 flags data|raid1
> [482118.545044] BTRFS info (device sdd3): found 1115 extents
> [482297.404024] BTRFS info (device sdd3): found 1115 extents
> 
> These appear to be routinely generated by the copy operation. I know
> what extents are, but these messages don't really tell me much.
> 
> The copy operation appears to be proceeding normally, it's just
> extremely, painfully slow. And it's doing an awful lot of writing to the
> drive I'm removing, which doesn't seem to make sense. Looking at
> 'iostat', those writes are almost always done in parallel with another
> drive, a pattern I often see (and expect) with raid-1.
> 
>> 
>> It's an old kernel by this list's standards. Mostly this list is
>> active development on mainline and stable kernels, not LTS kernels
>> which - you might have found a bug. But there's thousands of changes
>> throughout the storage stack in the kernel since then, thousands just
>> in Btrfs between 4.19 and 5.7 and 5.8 being worked on now. It's a 20+
>> month development difference.
>> 
>> It's pretty much just luck if an upstream Btrfs developer sees this
>> and happens to know why it's slow and that it was fixed in X kernel
>> version or maybe it's a really old bug that just hasn't yet gotten a
>> good enough bug report still, and hasn't been fixed. That's why it's
>> common advice to "try with a newer kernel" because the problem might
>> not happen, and if it does, then chances are it's a bug.
> I used to routinely build and install the latest kernels but I got tired
> of that. But I could easily do so here if you think it would make a
> difference. It would force me to reboot, of course. As long as I'm not
> likely to corrupt my file system, I'm willing to do that.
>> 
>>> I started the operation 5 days ago, and of right now I still have 2.18
>>> TB to move off the drive I'm trying to replace. I think it started
>>> around 3.5 TB.
>> Issue sysrq+t and post the output from 'journalctl -k --since=-10m'
>> in something like pastebin or in a text file on nextcloud/dropbox etc.
>> It's probably too big to email and usually the formatting gets munged
>> anyway and is hard to read.
>> 
>> Someone might have an idea why it's slow from sysrq+t but it's a long shot.
> 
> I'm operating headless at the moment, but here's journalctl:
> 
> -- Logs begin at Fri 2020-04-24 21:49:22 PDT, end at Thu 2020-04-30
> 12:07:12 PDT. --
> Apr 30 12:04:26 homer.ka9q.net kernel: BTRFS info (device sdd3): found
> 1997 extents
> Apr 30 12:04:33 homer.ka9q.net kernel: BTRFS info (device sdd3):
> relocating block group 9019561345024 flags data|raid1
> Apr 30 12:05:21 homer.ka9q.net kernel: BTRFS info (device sdd3): found
> 6242 extents
> 
>> If there's anything important on this file system, you should make a
>> copy now. Update backups. You should be prepared to lose the whole
>> thing before proceeding further.
> Already done. Kinda goes without saying...
>> KB
>> Next, disable the write cache on all the drives. This can be done with
>> hdparm -W (cap W, lowercase w is dangerous, see man page). This should
>> improve the chance of the file system on all drives being consistent
>> if you have to force reboot - i.e. the reboot might hang so you should
>> be prepared to issue sysrq+s followed by sysrq+b. Better than power
>> reset.
> I did try disabling the write caches. Interestingly there was no obvious
> change in write speeds. I turned them back on, but I'll remember to turn
> them off before rebooting. Good suggestion.
>> Boot, leave all drives connected, make sure the write caches are
>> disabled, then make sure there's no SCT ERC mismatch, i.e.
>> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
> 
> All drives support SCT. The timeouts *are* different: 10 sec for the new
> 16TB drives, 7 sec for the older 6 TB drives.
> 
> But this shouldn't matter because I'm quite sure all my drives are
> healthy. I regularly run both short and long smart tests, and they've
> always passed. No drive I/O errors in dmesg, no evidence of any retries
> or timeouts. Just lots of small apparently random reads and writes that
> execute very slowly. By "small" I mean the ratio of KB_read/s to tps in
> 'iostat' is small, usually less than 10 KB and often just 4KB.
> 
> Yes, my partitions are properly aligned on 8-LBA (4KB) boundaries.
> 
>> 
>> And then do a scrub with all the drives attached. And then assess the
>> next step only after that completes. It'll either fix something or
>> not. You can do this same thing with kernel 4.19. It should work. But
>> until the health of the file system is known, I can't recommend doing
>> any device replacements or removals. It must be completely healthy
>> first.
> I run manual scrubs every month or so. They've always passed with zero
> errors. I don't run them automatically because they take a day and
> there's a very noticeable hit on performance. Btrfs (at least the
> version I'm running) doesn't seem to know how to run stuff like this at
> low priority (yes, I know that's much harder with I/O than with CPU).
>> 
>> I personally would only do the device removal (either remove while
>> still connected or remove while missing) with 5.6.8 or 5.7rc3 because
>> if I have a problem, I'm reporting it on this list as a bug. With 4.19
>> it's just too old I think for this list, it's pure luck if anyone
>> knows for sure what's going on.
> 
> I can always try the latest kernel (5.6.8 is on kernel.org) as long as
> I'm not likely to lose data by rebooting. I do have backups but I'd like
> to avoid the lengthy hassle of rebuilding everything from scratch.
> 
> Thanks for the suggestions!
> 
> Phil
> 
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-04-30 20:27       ` Alexandru Dordea
@ 2020-04-30 20:58         ` Phil Karn
  0 siblings, 0 replies; 39+ messages in thread
From: Phil Karn @ 2020-04-30 20:58 UTC (permalink / raw)
  To: Alexandru Dordea; +Cc: Chris Murphy, Btrfs BTRFS

On 4/30/20 13:27, Alexandru Dordea wrote:
> Hello,
>
> I’m encountering the same issue with Raid6 for months :)
> I have a BTRFS raid6 with 15x8TB HDD’s and 5 x 14TB. One of the 15 x 8TB crashed. I have removed the faulty drive and if I’m running the delete missing command the sys-load is increasing and to recover 6.66TB will take few months. After 5 days of running the missing data decreased to -6.10.
> During this period the drivers are almost 100% and the R/W performance is degraded with more than 95%.
I see I have company, and that a more recent kernel has the same problem.
>
> The R/W performance is not impacted if the process of delete/balance is not running. (Don’t know if running balance on a single CPU without multithread is a feature or a bug but it's a shame that the process is keeping only one CPU out of 48 at 100%).

I'm using RAID-1 rather than 6, but for me there's very little CPU
consumption. Nor would I expect there to be, since the work is all in
the copying. I have 8 cores and am running Folding at Home (Covid-19
drug discovery) on 6 of them, but there seems to be plenty of CPU
available; idle time is consistently 12-15%. Still, I tried pausing FAH.
There was no discernable effect on the btrfs remove/copy operation, nor
would I expect there to be since FAH is entirely CPU-bound and the
remove/copy is entirely I/O bound. The 'btrfs remove' command uses
relatively little CPU and always shows as waiting on disk in the 'top'
command. Same for the kernel worker threads.

But just in case, I've scaled FAH back to 3 threads to see what happens.

I'm thinking maybe it's time to go back to dm-raid for RAID-1 and keep
btrfs only for its snapshot feature. Integrating RAID into the file
system seemed like a really good idea at the time, but snapshots alone
are still worth it.

When I ran XFS above dm-raid1, I'd periodically pull one drive, put it
in the safe, replace it with a blank drive and let it rebuild. This gave
me a full image backup, and the rebuild copy went at full disk speed
though it did have to copy unused disk. But based on what I'm seeing
now, that's preferable. A full disk image copy at sequential disk speed
is still much faster than copying only the used blocks in semi-random
order and shaking the hell out of my drives.

I wonder if putting bcache and a SSD between btrfs and my drives would
help...? How about a few hundred GB of RAM (I have only 12)?

--Phil




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-04-30 19:59     ` Phil Karn
  2020-04-30 20:27       ` Alexandru Dordea
@ 2020-05-01  2:47       ` Zygo Blaxell
  2020-05-01  4:48         ` Phil Karn
  1 sibling, 1 reply; 39+ messages in thread
From: Zygo Blaxell @ 2020-05-01  2:47 UTC (permalink / raw)
  To: Phil Karn; +Cc: Chris Murphy, Btrfs BTRFS

On Thu, Apr 30, 2020 at 12:59:29PM -0700, Phil Karn wrote:
> On 4/30/20 11:40, Chris Murphy wrote:
> > It could be any number of things. Each drive has at least 3
> > partitions so what else is on these drives? Are those other partitions
> > active with other things going on at the same time? How are the drives
> > connected to the computer? Direct SATA/SAS connection? Via USB
> > enclosures? How many snapshots? Are quotas enabled? There's nothing in
> > dmesg for 5 days? Anything for the most recent hour? i.e. journalctl
> > -k --since=-1h
> 
> Nothing else is going on with these drives. Those other partitions
> include things like EFI, manual backups of the root file system on my
> SSD, and swap (which is barely used, verified with iostat and swapon -s).
> 
> The drives are connected internally with SATA at 3.0 Gb/s (this is an
> old motherboard). Still, this is 375 MB/s, much faster than the drives'
> sustained read/write speeds.
> 
> I did get rid of a lot of read-only snapshots while this was running in
> hopes this might speed things up. I'm down to 8, and willing to go
> lower. No obvious improvement. Would I expect this to help right away,
> or does it take time for btrfs to reclaim the space and realize it
> doesn't have to be copied?
> 
> I've never used quotas; I'm the only user.
> 
> There are plenty of messages in dmesg of the form
> 
> [482089.101264] BTRFS info (device sdd3): relocating block group
> 9016340119552 flags data|raid1
> [482118.545044] BTRFS info (device sdd3): found 1115 extents
> [482297.404024] BTRFS info (device sdd3): found 1115 extents
> 
> These appear to be routinely generated by the copy operation. I know
> what extents are, but these messages don't really tell me much.

If it keeps repeating "found 1115 extents" over and over (say 5 or
more times) then you're hitting the balance looping bug in kernel 5.1
and later.  Every N block groups (N seems to vary by user, I've heard
reports from 3 to over 6000) the kernel will get stuck in a loop and
will need to reboot to recover.  Even if you cancel the balance, it will
just loop again until rebooted, and there's no cancel for device delete
so if you start looping there you can just skip directly to the reboot.
For a non-trivial filesystem the probability of successfully deleting
or resizing a device is more or less zero.

There is no fix for that regression yet.  Kernel 4.19 doesn't have the
regression and does have other relevant bug fixes for balance, so it
can be used as a workaround.

> The copy operation appears to be proceeding normally, it's just
> extremely, painfully slow. And it's doing an awful lot of writing to the
> drive I'm removing, which doesn't seem to make sense. Looking at
> 'iostat', those writes are almost always done in parallel with another
> drive, a pattern I often see (and expect) with raid-1.
> 
> >
> > It's an old kernel by this list's standards. Mostly this list is
> > active development on mainline and stable kernels, not LTS kernels
> > which - you might have found a bug. But there's thousands of changes
> > throughout the storage stack in the kernel since then, thousands just
> > in Btrfs between 4.19 and 5.7 and 5.8 being worked on now. It's a 20+
> > month development difference.
> >
> > It's pretty much just luck if an upstream Btrfs developer sees this
> > and happens to know why it's slow and that it was fixed in X kernel
> > version or maybe it's a really old bug that just hasn't yet gotten a
> > good enough bug report still, and hasn't been fixed. That's why it's
> > common advice to "try with a newer kernel" because the problem might
> > not happen, and if it does, then chances are it's a bug.
> I used to routinely build and install the latest kernels but I got tired
> of that. But I could easily do so here if you think it would make a
> difference. It would force me to reboot, of course. As long as I'm not
> likely to corrupt my file system, I'm willing to do that.
> 
> >> I started the operation 5 days ago, and of right now I still have 2.18
> >> TB to move off the drive I'm trying to replace. I think it started
> >> around 3.5 TB.
> > Issue sysrq+t and post the output from 'journalctl -k --since=-10m'
> > in something like pastebin or in a text file on nextcloud/dropbox etc.
> > It's probably too big to email and usually the formatting gets munged
> > anyway and is hard to read.
> >
> > Someone might have an idea why it's slow from sysrq+t but it's a long shot.
> 
> I'm operating headless at the moment, but here's journalctl:
> 
> -- Logs begin at Fri 2020-04-24 21:49:22 PDT, end at Thu 2020-04-30
> 12:07:12 PDT. --
> Apr 30 12:04:26 homer.ka9q.net kernel: BTRFS info (device sdd3): found
> 1997 extents
> Apr 30 12:04:33 homer.ka9q.net kernel: BTRFS info (device sdd3):
> relocating block group 9019561345024 flags data|raid1
> Apr 30 12:05:21 homer.ka9q.net kernel: BTRFS info (device sdd3): found
> 6242 extents
> 
> > If there's anything important on this file system, you should make a
> > copy now. Update backups. You should be prepared to lose the whole
> > thing before proceeding further.
> Already done. Kinda goes without saying...
> > KB
> > Next, disable the write cache on all the drives. This can be done with
> > hdparm -W (cap W, lowercase w is dangerous, see man page). This should
> > improve the chance of the file system on all drives being consistent
> > if you have to force reboot - i.e. the reboot might hang so you should
> > be prepared to issue sysrq+s followed by sysrq+b. Better than power
> > reset.
> I did try disabling the write caches. Interestingly there was no obvious
> change in write speeds. I turned them back on, but I'll remember to turn
> them off before rebooting. Good suggestion.
> > Boot, leave all drives connected, make sure the write caches are
> > disabled, then make sure there's no SCT ERC mismatch, i.e.
> > https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
> 
> All drives support SCT. The timeouts *are* different: 10 sec for the new
> 16TB drives, 7 sec for the older 6 TB drives.
> 
> But this shouldn't matter because I'm quite sure all my drives are
> healthy. I regularly run both short and long smart tests, and they've
> always passed. No drive I/O errors in dmesg, no evidence of any retries
> or timeouts. Just lots of small apparently random reads and writes that
> execute very slowly. By "small" I mean the ratio of KB_read/s to tps in
> 'iostat' is small, usually less than 10 KB and often just 4KB.
> 
> Yes, my partitions are properly aligned on 8-LBA (4KB) boundaries.
> 
> >
> > And then do a scrub with all the drives attached. And then assess the
> > next step only after that completes. It'll either fix something or
> > not. You can do this same thing with kernel 4.19. It should work. But
> > until the health of the file system is known, I can't recommend doing
> > any device replacements or removals. It must be completely healthy
> > first.
> I run manual scrubs every month or so. They've always passed with zero
> errors. I don't run them automatically because they take a day and
> there's a very noticeable hit on performance. Btrfs (at least the
> version I'm running) doesn't seem to know how to run stuff like this at
> low priority (yes, I know that's much harder with I/O than with CPU).
> >
> > I personally would only do the device removal (either remove while
> > still connected or remove while missing) with 5.6.8 or 5.7rc3 because
> > if I have a problem, I'm reporting it on this list as a bug. With 4.19
> > it's just too old I think for this list, it's pure luck if anyone
> > knows for sure what's going on.
> 
> I can always try the latest kernel (5.6.8 is on kernel.org) as long as
> I'm not likely to lose data by rebooting. I do have backups but I'd like
> to avoid the lengthy hassle of rebuilding everything from scratch.
> 
> Thanks for the suggestions!
> 
> Phil
> 
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-01  2:47       ` Zygo Blaxell
@ 2020-05-01  4:48         ` Phil Karn
  2020-05-01  6:05           ` Alexandru Dordea
  0 siblings, 1 reply; 39+ messages in thread
From: Phil Karn @ 2020-05-01  4:48 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Btrfs BTRFS

On 4/30/20 19:47, Zygo Blaxell wrote:
>
> If it keeps repeating "found 1115 extents" over and over (say 5 or
> more times) then you're hitting the balance looping bug in kernel 5.1
> and later.  Every N block groups (N seems to vary by user, I've heard
> reports from 3 to over 6000) the kernel will get stuck in a loop and
> will need to reboot to recover.  Even if you cancel the balance, it will
> just loop again until rebooted, and there's no cancel for device delete
> so if you start looping there you can just skip directly to the reboot.
> For a non-trivial filesystem the probability of successfully deleting
> or resizing a device is more or less zero.

This does not seem to be happening. Each message is for a different
block group with a different number of clusters. The device remove *is*
making progress, just very very slowly. I'm almost down to just 2TB
left. Woot!

If I ever have to do this again, I'll insert bcache and a big SSD
between btrfs and my devices. The slowness here has to be due to the
(spinning) disk I/O being highly fragmented and random. I've checked,
and none of my drives (despite their large sizes) are shingled, so
that's not it. The 6 TB units have 128 MB caches and the 16 TB have 256
MB caches.

I've never understood *exactly* what a hard drive internal cache does. I
see little sense in a LRU cache just like the host's own buffer cache
since the host has far more RAM. I do know they're used to reorder
operations to reduce seek latency, though this can be limited by the
need to fence writes to protect against a crash. I've wondered if
they're also used on reads to reduce rotational latency by prospectively
grabbing data as soon as the heads land on a cylinder. How big is a
"cylinder'' anyway? The inner workings of hard drives have become
steadily more opaque over the years, which makes it difficult to
optimize their use. Kinda like CPUs, actually. Last time I really tuned
up some tight code, I found that using vector instructions and avoiding
branch mispredictions made a big difference but nothing else seemed to
matter at all.

>
> There is no fix for that regression yet.  Kernel 4.19 doesn't have the
> regression and does have other relevant bug fixes for balance, so it
> can be used as a workaround.

I'm running 4.19.0-8-rt-amd64, the current real-time kernel in Debian
'stable'.

Phil




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-01  4:48         ` Phil Karn
@ 2020-05-01  6:05           ` Alexandru Dordea
  2020-05-01  7:29             ` Phil Karn
  0 siblings, 1 reply; 39+ messages in thread
From: Alexandru Dordea @ 2020-05-01  6:05 UTC (permalink / raw)
  To: Phil Karn; +Cc: Zygo Blaxell, Chris Murphy, Btrfs BTRFS

Don’t get me wrong, the single 100% CPU is only during balance process.
By running "btrfs device delete missing /storage”there is no impact on CPU/RAM. I do have 64GB DDR4 ECC but there is no more of 3GB ram usage.

I can see that @Chris Murphy mention that disabling the cache will impact performance. Did you tried that? 
On my devices I do have cache enabled and till now this is the only thing that I didn't tried :)

# hdparm -W /dev/sdc
/dev/sdc:
 write-caching =  1 (on)


> On May 1, 2020, at 07:48, Phil Karn <karn@ka9q.net> wrote:
> 
> On 4/30/20 19:47, Zygo Blaxell wrote:
>> 
>> If it keeps repeating "found 1115 extents" over and over (say 5 or
>> more times) then you're hitting the balance looping bug in kernel 5.1
>> and later.  Every N block groups (N seems to vary by user, I've heard
>> reports from 3 to over 6000) the kernel will get stuck in a loop and
>> will need to reboot to recover.  Even if you cancel the balance, it will
>> just loop again until rebooted, and there's no cancel for device delete
>> so if you start looping there you can just skip directly to the reboot.
>> For a non-trivial filesystem the probability of successfully deleting
>> or resizing a device is more or less zero.
> 
> This does not seem to be happening. Each message is for a different
> block group with a different number of clusters. The device remove *is*
> making progress, just very very slowly. I'm almost down to just 2TB
> left. Woot!
> 
> If I ever have to do this again, I'll insert bcache and a big SSD
> between btrfs and my devices. The slowness here has to be due to the
> (spinning) disk I/O being highly fragmented and random. I've checked,
> and none of my drives (despite their large sizes) are shingled, so
> that's not it. The 6 TB units have 128 MB caches and the 16 TB have 256
> MB caches.
> 
> I've never understood *exactly* what a hard drive internal cache does. I
> see little sense in a LRU cache just like the host's own buffer cache
> since the host has far more RAM. I do know they're used to reorder
> operations to reduce seek latency, though this can be limited by the
> need to fence writes to protect against a crash. I've wondered if
> they're also used on reads to reduce rotational latency by prospectively
> grabbing data as soon as the heads land on a cylinder. How big is a
> "cylinder'' anyway? The inner workings of hard drives have become
> steadily more opaque over the years, which makes it difficult to
> optimize their use. Kinda like CPUs, actually. Last time I really tuned
> up some tight code, I found that using vector instructions and avoiding
> branch mispredictions made a big difference but nothing else seemed to
> matter at all.
> 
>> 
>> There is no fix for that regression yet.  Kernel 4.19 doesn't have the
>> regression and does have other relevant bug fixes for balance, so it
>> can be used as a workaround.
> 
> I'm running 4.19.0-8-rt-amd64, the current real-time kernel in Debian
> 'stable'.
> 
> Phil
> 
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-01  6:05           ` Alexandru Dordea
@ 2020-05-01  7:29             ` Phil Karn
  2020-05-02  4:18               ` Zygo Blaxell
  0 siblings, 1 reply; 39+ messages in thread
From: Phil Karn @ 2020-05-01  7:29 UTC (permalink / raw)
  To: Alexandru Dordea; +Cc: Zygo Blaxell, Chris Murphy, Btrfs BTRFS

On 4/30/20 23:05, Alexandru Dordea wrote:
> Don’t get me wrong, the single 100% CPU is only during balance process.
> By running "btrfs device delete missing /storage”there is no impact on CPU/RAM. I do have 64GB DDR4 ECC but there is no more of 3GB ram usage.
3GB used for what, does that include the system buffer cache?
>
> I can see that @Chris Murphy mention that disabling the cache will impact performance. Did you tried that? 
> On my devices I do have cache enabled and till now this is the only thing that I didn't tried :)


It didn't seem to make an obvious difference, which surprised me a
little since the I/O seems so random. Maybe btrfs is already sticking a
lot of fences (barriers) into the write stream so the drive can't do
much reordering anyway?

I've always left write caching enabled in my drives since my system is
plugged into reliable power. I assume the only reason to turn it off is
to reduce the chance of filesystem corruption in case I have to force
the machine to reboot while the operation is still going.

Down to only 1.99 TB now! Wow!

--Phil




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-04-30 18:13   ` Jean-Denis Girard
@ 2020-05-01  8:05     ` Phil Karn
  2020-05-02  3:35       ` Zygo Blaxell
  0 siblings, 1 reply; 39+ messages in thread
From: Phil Karn @ 2020-05-01  8:05 UTC (permalink / raw)
  To: Jean-Denis Girard, linux-btrfs

On 4/30/20 11:13, Jean-Denis Girard wrote:
>
> Hi Phil,
>
> I did something similar one month ago. It took less than 4 hours for
> 1.71 TiB of data:
>
> [xxx@taina ~]$ sudo btrfs replace status /home/SysNux
> Started on 21.Mar 11:13:20, finished on 21.Mar 15:06:33, 0 write errs, 0
> uncorr. read errs

I just realized you did a *replace* rather than a *remove*. When I did a
replace on another drive, it also went much faster. It must copy the
data from the old drive to the new one in larger and/or more contiguous
chunks. It's only the remove operation that's painfully slow.

Phil



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-01  8:05     ` Phil Karn
@ 2020-05-02  3:35       ` Zygo Blaxell
       [not found]         ` <CAMwB8mjUw+KV8mxg8ynPsv0sj5vSpwG7_khw=oP5n+SnPYzumQ@mail.gmail.com>
                           ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Zygo Blaxell @ 2020-05-02  3:35 UTC (permalink / raw)
  To: Phil Karn; +Cc: Jean-Denis Girard, linux-btrfs

On Fri, May 01, 2020 at 01:05:20AM -0700, Phil Karn wrote:
> On 4/30/20 11:13, Jean-Denis Girard wrote:
> >
> > Hi Phil,
> >
> > I did something similar one month ago. It took less than 4 hours for
> > 1.71 TiB of data:
> >
> > [xxx@taina ~]$ sudo btrfs replace status /home/SysNux
> > Started on 21.Mar 11:13:20, finished on 21.Mar 15:06:33, 0 write errs, 0
> > uncorr. read errs
> 
> I just realized you did a *replace* rather than a *remove*. When I did a
> replace on another drive, it also went much faster. It must copy the
> data from the old drive to the new one in larger and/or more contiguous
> chunks. It's only the remove operation that's painfully slow.

"Replace" is a modified form of scrub which assumes that you want to
reconstruct an entire drive instead of verifying an existing one.
It reads and writes all the blocks roughly in physical disk order,
and doesn't need to update any metadata since it's not changing any of
the data as it passes through.

"Delete" is resize to 0 followed by remove the empty device.  Resize
requires relocating all data onto other disks--or other locations on
the same disk--one extent at a time, and updating all of the reference
pointers in the filesystem.

The difference in speed can be several orders of magnitude.

> Phil
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-01  7:29             ` Phil Karn
@ 2020-05-02  4:18               ` Zygo Blaxell
  2020-05-02  4:48                 ` Phil Karn
                                   ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Zygo Blaxell @ 2020-05-02  4:18 UTC (permalink / raw)
  To: Phil Karn; +Cc: Alexandru Dordea, Chris Murphy, Btrfs BTRFS

On Fri, May 01, 2020 at 12:29:50AM -0700, Phil Karn wrote:
> On 4/30/20 23:05, Alexandru Dordea wrote:
> > Don’t get me wrong, the single 100% CPU is only during balance process.
> > By running "btrfs device delete missing /storage”there is no impact on CPU/RAM. I do have 64GB DDR4 ECC but there is no more of 3GB ram usage.
> 3GB used for what, does that include the system buffer cache?
> >
> > I can see that @Chris Murphy mention that disabling the cache will impact performance. Did you tried that? 
> > On my devices I do have cache enabled and till now this is the only thing that I didn't tried :)
> 
> 
> It didn't seem to make an obvious difference, which surprised me a
> little since the I/O seems so random. Maybe btrfs is already sticking a
> lot of fences (barriers) into the write stream so the drive can't do
> much reordering anyway?

btrfs can send gigabytes of metadata IO per minute to a drive, enough to
overwhelm even the largest device write caches.  So even if you use 100%
of a 256MB drive's on-board RAM as write cache, the following gigabytes of
a large metadata update won't get much benefit from caching.  The drive
will be stuck a quarter gigabyte behind the host, trying to catch up
all the time.

Also, in large delete operations, half of the IOs are random _reads_,
which can't be optimized by write caching.  The writes are mostly
sequential, so they take less IO time.  So, say, 1% of the IO time
is made 80% faster by write caching, for a net benefit of 0.8% (not real
numbers).  Write caching helps fsync() performance and not much else.

A writeback SSD cache can have a significant beneficial effect on latency
until it gets full, but if it's not big enough to hold the metadata then
it won't be very helpful, in the worst case it will make btrfs slower.

> I've always left write caching enabled in my drives since my system is
> plugged into reliable power. I assume the only reason to turn it off is
> to reduce the chance of filesystem corruption in case I have to force
> the machine to reboot while the operation is still going.

The big surprise for write caches is what happens when the drive
gets a UNC sector.  Some drive firmwares work properly under normal
and power-loss conditions, but immediately drop the contents of the
write cache when they see an unreadable block.  This turns an otherwise
completely survivable error--a small number of consecutive bad sectors
on a single-disk filesystem---into a btrfs damaged beyond repair.

In this event, metadata writes will be dropped in both copies of dup
metadata, but the error is not reported to btrfs by the drive firmware
because it happens after the drive has reported successful completion of
the relevant flush command to the host.  Write caching in drives without
command queueing assumes that the drive will be able to complete the flush
command before a power failure interrupts it.  Usually firmware doesn't
take sector read retries or external vibration events into account, but
those events also prevent the drive from implementing any further write
commands from the host, so write ordering is preserved.  In the UNC case,
the firmware drops the write cache and also keeps accepting new write
commands, which is a bug--it should do at most one of those two things.
The result is unrecoverable metadata loss on btrfs.

Reliable power and crash avoidance won't help in this case--the filesystem
will die while it's still mounted.  If you find out the hard way that
you have a drive with firmware that does this, the only recourse is to
turn off write caching (and make sure it stays off), mkfs, and start
restoring backups.

> Down to only 1.99 TB now! Wow!
> 
> --Phil
> 
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
       [not found]         ` <CAMwB8mjUw+KV8mxg8ynPsv0sj5vSpwG7_khw=oP5n+SnPYzumQ@mail.gmail.com>
@ 2020-05-02  4:31           ` Zygo Blaxell
  0 siblings, 0 replies; 39+ messages in thread
From: Zygo Blaxell @ 2020-05-02  4:31 UTC (permalink / raw)
  To: Phil Karn; +Cc: Jean-Denis Girard, Btrfs BTRFS

On Fri, May 01, 2020 at 09:12:11PM -0700, Phil Karn wrote:
>    Thanks for the explanations of what replace and delete ("remove"?)
>    actually do; that's helpful. I'm still puzzled as to why there was so much
>    write activity to the drive I was removing; can you explain that?
>    My hand was ultimately forced today. The device remove running since last
>    weekend bombed out with a "no space" message in the kernel log despite
>    there being plenty of free space on all devices. The file system had been

Debugging metadata space reservations is an activity that was started 7
years ago and has maybe 2-5 years of debugging left.  (half-;)

>    remounted read-only. When I brought it back up, the mount system call
>    blocked while it underwent what was apparently a lengthy file system
>    check. (I got one message about a block group free space cache being
>    rebuilt). 

OK, you aren't using space_cache=v2 yet.  Unmount the filesystem and
mount with -o space_cache=v2.  It will take a long time to build the
cache (up to an hour per TB), but once it has, it should get a lot faster.

>    It really doesn't seem like such a good idea for a really basic
>    system call like "mount" to block indefinitely during system boot. 

mount has to take apart any relocation tree that may have been in
progress during shutdown (up to one block group's worth of metadata),
so it can take an arbitrary amount of time (though not usually more than
40 minutes, unless you're mounting with space_cache=v2 for the first time).

>    systemd
>    eventually gives up, but it does take a while. Lots and lots of stack
>    traces in dmesg about system calls blocking for more than 120 sec. Usually
>    mount, but also sd-sync when trying to shut the system down gracefully.
>    Eventually I was forced to hit hard reset.
>    These blocking mounts make it kinda painful to get a root shell just so
>    you can see what's going on. This is why I'll never put a root filesystem
>    on btrfs. I keep my root filesystems in XFS or ext4 on a SSD so I can at
>    least pull all the other drives and boot up single user fairly quickly.

I had a few systems configured that way--then some disk failures happened,
and it turns out that in the real world, ext4 is as fragile as btrfs but
lacks the integrity measurement and self-repair features.  So now I just
put two root filesystems on each machine, then if something bad happens
to the primary, the secondary root can be easily selected from a grub
"rescue" option.

>    I'll manually rsync the root file system onto a spare disk partition as a
>    backup.
>    Before rebooting I physically pulled the drive I was trying to replace and
>    set noauto in /etc/fstab on the btrfs fs. Back in multi-user mode at last,
>    I did a mount with degraded enabled and got the expected message about the
>    missing device (confirming I pulled the right one). It's still madly doing
>    I/O, but since it's not telling me what's going on (and the mount has not
>    completed) I have to assume from the I/O patterns that it's continuing the
>    device remove without it being physically present. I guess if I'm lucky
>    I'll be able to use my filesystem in a week or so. I do have a backup but
>    I'd rather not touch it except as a last resort.
>    On Fri, May 1, 2020 at 8:35 PM Zygo Blaxell
>    <[1]ce3g8jdj@umail.furryterror.org> wrote:
> 
>      On Fri, May 01, 2020 at 01:05:20AM -0700, Phil Karn wrote:
>      > On 4/30/20 11:13, Jean-Denis Girard wrote:
>      > >
>      > > Hi Phil,
>      > >
>      > > I did something similar one month ago. It took less than 4 hours for
>      > > 1.71 TiB of data:
>      > >
>      > > [xxx@taina ~]$ sudo btrfs replace status /home/SysNux
>      > > Started on 21.Mar 11:13:20, finished on 21.Mar 15:06:33, 0 write
>      errs, 0
>      > > uncorr. read errs
>      >
>      > I just realized you did a *replace* rather than a *remove*. When I did
>      a
>      > replace on another drive, it also went much faster. It must copy the
>      > data from the old drive to the new one in larger and/or more
>      contiguous
>      > chunks. It's only the remove operation that's painfully slow.
> 
>      "Replace" is a modified form of scrub which assumes that you want to
>      reconstruct an entire drive instead of verifying an existing one.
>      It reads and writes all the blocks roughly in physical disk order,
>      and doesn't need to update any metadata since it's not changing any of
>      the data as it passes through.
> 
>      "Delete" is resize to 0 followed by remove the empty device.  Resize
>      requires relocating all data onto other disks--or other locations on
>      the same disk--one extent at a time, and updating all of the reference
>      pointers in the filesystem.
> 
>      The difference in speed can be several orders of magnitude.
> 
>      > Phil
>      >
>      >
> 
> References
> 
>    Visible links
>    1. mailto:ce3g8jdj@umail.furryterror.org

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  4:18               ` Zygo Blaxell
@ 2020-05-02  4:48                 ` Phil Karn
  2020-05-02  5:00                 ` Phil Karn
  2020-05-03  2:28                 ` Phil Karn
  2 siblings, 0 replies; 39+ messages in thread
From: Phil Karn @ 2020-05-02  4:48 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Alexandru Dordea, Chris Murphy, Btrfs BTRFS

Thanks for the additional information. I know it's inherently hard to
recover from some errors when caching is enabled, like an acknowledged
write that can't later be flushed to stable storage. But I had no idea
that some drives do gratuitous stuff like dropping the whole write
cache on a read error. That does seem pretty indefensible. Has anybody
gotten a response from the vendors? Does anybody keep a list of the
drives and firmware versions that do this?

(Resending because the list requires plain text)

--Phil

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: Extremely slow device removals
  2020-05-02  3:35       ` Zygo Blaxell
       [not found]         ` <CAMwB8mjUw+KV8mxg8ynPsv0sj5vSpwG7_khw=oP5n+SnPYzumQ@mail.gmail.com>
@ 2020-05-02  4:48         ` Paul Jones
  2020-05-02  5:25           ` Phil Karn
  2020-05-02  6:00           ` Zygo Blaxell
  2020-05-02  4:49         ` Phil Karn
  2 siblings, 2 replies; 39+ messages in thread
From: Paul Jones @ 2020-05-02  4:48 UTC (permalink / raw)
  To: Zygo Blaxell, Phil Karn; +Cc: Jean-Denis Girard, linux-btrfs

> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Zygo Blaxell
> Sent: Saturday, 2 May 2020 1:35 PM
> To: Phil Karn <karn@ka9q.net>
> Cc: Jean-Denis Girard <jd.girard@sysnux.pf>; linux-btrfs@vger.kernel.org
> Subject: Re: Extremely slow device removals
> 
> On Fri, May 01, 2020 at 01:05:20AM -0700, Phil Karn wrote:
> > On 4/30/20 11:13, Jean-Denis Girard wrote:
> > >
> > > Hi Phil,
> > >
> > > I did something similar one month ago. It took less than 4 hours for
> > > 1.71 TiB of data:
> > >
> > > [xxx@taina ~]$ sudo btrfs replace status /home/SysNux Started on
> > > 21.Mar 11:13:20, finished on 21.Mar 15:06:33, 0 write errs, 0
> > > uncorr. read errs
> >
> > I just realized you did a *replace* rather than a *remove*. When I did
> > a replace on another drive, it also went much faster. It must copy the
> > data from the old drive to the new one in larger and/or more
> > contiguous chunks. It's only the remove operation that's painfully slow.
> 
> "Replace" is a modified form of scrub which assumes that you want to
> reconstruct an entire drive instead of verifying an existing one.
> It reads and writes all the blocks roughly in physical disk order, and doesn't
> need to update any metadata since it's not changing any of the data as it
> passes through.
> 
> "Delete" is resize to 0 followed by remove the empty device.  Resize requires
> relocating all data onto other disks--or other locations on the same disk--one
> extent at a time, and updating all of the reference pointers in the filesystem.
> 
> The difference in speed can be several orders of magnitude.

Delete seems to work like a balance. I've had a totally unbalanced raid 1 array and after removing a single almost full drive all the remaining drives are magically 50% full, down from 90% and up from 10%. It's a bit stressful when there is a missing disk as you can only delete a missing disk, not replace it.
It would be nice if BTRFS had some more smarts so it knows when to "balance" data, and when to simply "move/copy" a single copy of data. 


Paul.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  3:35       ` Zygo Blaxell
       [not found]         ` <CAMwB8mjUw+KV8mxg8ynPsv0sj5vSpwG7_khw=oP5n+SnPYzumQ@mail.gmail.com>
  2020-05-02  4:48         ` Paul Jones
@ 2020-05-02  4:49         ` Phil Karn
  2 siblings, 0 replies; 39+ messages in thread
From: Phil Karn @ 2020-05-02  4:49 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Jean-Denis Girard, Btrfs BTRFS

My hand was ultimately forced today. The device remove running since
last weekend bombed out with a "no space" message in the kernel log
despite there being plenty of free space on all devices. The file
system had been remounted read-only. When I brought it back up, the
mount system call blocked while it underwent what was apparently a
lengthy file system check. (I got one message about a block group free
space cache being rebuilt). It really doesn't seem like such a good
idea for a really basic system call like "mount" to block indefinitely
during system boot. systemd eventually gives up, but it does take a
while. Lots and lots of stack traces in dmesg about system calls
blocking for more than 120 sec. Usually mount, but also sd-sync when
trying to shut the system down gracefully. Eventually I was forced to
hit hard reset.

These blocking mounts make it kinda painful to get a root shell just
so you can see what's going on. This is why I'll never put a root
filesystem on btrfs. I keep my root filesystems in XFS or ext4 on a
SSD so I can at least pull all the other drives and boot up single
user fairly quickly. I'll manually rsync the root file system onto a
spare disk partition as a backup.

Before rebooting I physically pulled the drive I was trying to replace
and set noauto in /etc/fstab on the btrfs fs. Back in multi-user mode
at last, I did a mount with degraded enabled and got the expected
message about the missing device (confirming I pulled the right one).
It's still madly doing I/O, but since it's not telling me what's going
on (and the mount has not completed) I have to assume from the I/O
patterns that it's continuing the device remove without it being
physically present. I guess if I'm lucky I'll be able to use my
filesystem in a week or so. I do have a backup but I'd rather not
touch it except as a last resort.
(resending)
--Phil

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  4:18               ` Zygo Blaxell
  2020-05-02  4:48                 ` Phil Karn
@ 2020-05-02  5:00                 ` Phil Karn
  2020-05-03  2:28                 ` Phil Karn
  2 siblings, 0 replies; 39+ messages in thread
From: Phil Karn @ 2020-05-02  5:00 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Alexandru Dordea, Chris Murphy, Btrfs BTRFS

> So even if you use 100% of a 256MB drive's on-board RAM as write cache, the following gigabytes of
> a large metadata update won't get much benefit from caching.  The drive
> will be stuck a quarter gigabyte behind the host, trying to catch up
> all the time.

Well, at least in theory the cache should still make the drive go
faster by reordering the writes to minimize seek and rotational
latency. Only the drive knows its own layout and latencies, so only
the drive can do this optimization. I suppose that for a long time you
can assume that writes to LBAs with large absolute differences would
be slower than writes to LBAs with small ones, but I'm not so sure
even that's true anymore. Especially with shingled drives (though my
drives are not shingled, and I know that's a whole other discussion).
In any event, only the drive knows for sure.

Oh, question about transaction queuing: can each transaction on the
queue consist of any number of LBAs as long as they're consecutive? I
am trying to figure out if 4Kn (4Kb native sectors) buys you anything
over I/O on a 512e drive in multiples of 8 properly aligned LBAs. If
each transaction can be of any number of LBAs, it's hard to see any
real benefit to 4Kn except that it saves 3 bits in the LBA number.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  4:48         ` Paul Jones
@ 2020-05-02  5:25           ` Phil Karn
  2020-05-02  6:04             ` Remi Gauvin
  2020-05-02  7:20             ` Zygo Blaxell
  2020-05-02  6:00           ` Zygo Blaxell
  1 sibling, 2 replies; 39+ messages in thread
From: Phil Karn @ 2020-05-02  5:25 UTC (permalink / raw)
  To: Paul Jones; +Cc: Zygo Blaxell, Jean-Denis Girard, linux-btrfs

I'm still not sure I understand what "balance" really does. I've run
it quite a few times, with increasing percentage limits as
recommended, but my drives never end up with equal amounts of data.
Maybe that's because I've got an oddball configuration involving
drives of two different sizes and (temporarily at least) an odd number
of drives. It *sounds* like it ought to do what you describe, but what
I read sounds more like an internal defragmentation operation on data
and metadata storage areas. Is it both?

On Fri, May 1, 2020 at 9:48 PM Paul Jones <paul@pauljones.id.au> wrote:

> Delete seems to work like a balance. I've had a totally unbalanced raid 1 array and after removing a single almost full drive all the remaining drives are magically 50% full, down from 90% and up from 10%. It's a bit stressful when there is a missing disk as you can only delete a missing disk, not replace it.
> It would be nice if BTRFS had some more smarts so it knows when to "balance" data, and when to simply "move/copy" a single copy of data.
>
>
> Paul.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  4:48         ` Paul Jones
  2020-05-02  5:25           ` Phil Karn
@ 2020-05-02  6:00           ` Zygo Blaxell
  2020-05-02  6:23             ` Paul Jones
  1 sibling, 1 reply; 39+ messages in thread
From: Zygo Blaxell @ 2020-05-02  6:00 UTC (permalink / raw)
  To: Paul Jones; +Cc: Phil Karn, Jean-Denis Girard, linux-btrfs

On Sat, May 02, 2020 at 04:48:42AM +0000, Paul Jones wrote:
> > -----Original Message-----
> > From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> > owner@vger.kernel.org> On Behalf Of Zygo Blaxell
> > Sent: Saturday, 2 May 2020 1:35 PM
> > To: Phil Karn <karn@ka9q.net>
> > Cc: Jean-Denis Girard <jd.girard@sysnux.pf>; linux-btrfs@vger.kernel.org
> > Subject: Re: Extremely slow device removals
> > 
> > On Fri, May 01, 2020 at 01:05:20AM -0700, Phil Karn wrote:
> > > On 4/30/20 11:13, Jean-Denis Girard wrote:
> > > >
> > > > Hi Phil,
> > > >
> > > > I did something similar one month ago. It took less than 4 hours for
> > > > 1.71 TiB of data:
> > > >
> > > > [xxx@taina ~]$ sudo btrfs replace status /home/SysNux Started on
> > > > 21.Mar 11:13:20, finished on 21.Mar 15:06:33, 0 write errs, 0
> > > > uncorr. read errs
> > >
> > > I just realized you did a *replace* rather than a *remove*. When I did
> > > a replace on another drive, it also went much faster. It must copy the
> > > data from the old drive to the new one in larger and/or more
> > > contiguous chunks. It's only the remove operation that's painfully slow.
> > 
> > "Replace" is a modified form of scrub which assumes that you want to
> > reconstruct an entire drive instead of verifying an existing one.
> > It reads and writes all the blocks roughly in physical disk order, and doesn't
> > need to update any metadata since it's not changing any of the data as it
> > passes through.
> > 
> > "Delete" is resize to 0 followed by remove the empty device.  Resize requires
> > relocating all data onto other disks--or other locations on the same disk--one
> > extent at a time, and updating all of the reference pointers in the filesystem.
> > 
> > The difference in speed can be several orders of magnitude.
> 
> Delete seems to work like a balance. I've had a totally unbalanced
> raid 1 array and after removing a single almost full drive all the
> remaining drives are magically 50% full, down from 90% and up from
> 10%. It's a bit stressful when there is a missing disk as you can only
> delete a missing disk, not replace it.

Huh?  Replacing missing disks is what btrfs replace is _for_.

> It would be nice if BTRFS had some more smarts so it knows when to "balance" data, and when to simply "move/copy" a single copy of data. 
> 
> 
> Paul.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  5:25           ` Phil Karn
@ 2020-05-02  6:04             ` Remi Gauvin
  2020-05-02  7:20             ` Zygo Blaxell
  1 sibling, 0 replies; 39+ messages in thread
From: Remi Gauvin @ 2020-05-02  6:04 UTC (permalink / raw)
  To: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 735 bytes --]

On 2020-05-02 1:25 a.m., Phil Karn wrote:
> I'm still not sure I understand what "balance" really does. I've run
> it quite a few times, with increasing percentage limits as
> recommended, but my drives never end up with equal amounts of data.
> Maybe that's because I've got an oddball configuration involving
> drives of two different sizes and (temporarily at least) an odd number
> of drives. It *sounds* like it ought to do what you describe, but what
> I read sounds more like an internal defragmentation operation on data
> and metadata storage areas. Is it both?

BTRFS tries to balance the free space, not the Used space.  When the
drives are balanced, the Unallocated space across the drives should be
even.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: Extremely slow device removals
  2020-05-02  6:00           ` Zygo Blaxell
@ 2020-05-02  6:23             ` Paul Jones
  2020-05-02  7:20               ` Phil Karn
  0 siblings, 1 reply; 39+ messages in thread
From: Paul Jones @ 2020-05-02  6:23 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

> -----Original Message-----
> From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
> Sent: Saturday, 2 May 2020 4:01 PM
> To: Paul Jones <paul@pauljones.id.au>
> Cc: Phil Karn <karn@ka9q.net>; Jean-Denis Girard <jd.girard@sysnux.pf>;
> linux-btrfs@vger.kernel.org
> Subject: Re: Extremely slow device removals
> 
> >
> > Delete seems to work like a balance. I've had a totally unbalanced
> > raid 1 array and after removing a single almost full drive all the
> > remaining drives are magically 50% full, down from 90% and up from
> > 10%. It's a bit stressful when there is a missing disk as you can only
> > delete a missing disk, not replace it.
> 
> Huh?  Replacing missing disks is what btrfs replace is _for_.

Oh I see where I went wrong. I used a command like 
btrfs replace start missing /dev/sdf1 /mnt
when I should have used
btrfs replace start 999 /dev/sdf1 /mnt

Yet another quirk of BTRFS...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  6:23             ` Paul Jones
@ 2020-05-02  7:20               ` Phil Karn
  2020-05-02  7:42                 ` Zygo Blaxell
  2020-05-02  7:43                 ` Jukka Larja
  0 siblings, 2 replies; 39+ messages in thread
From: Phil Karn @ 2020-05-02  7:20 UTC (permalink / raw)
  To: Paul Jones; +Cc: Zygo Blaxell, linux-btrfs

So I'm trying to figure out the advantage of including RAID 1 inside
btrfs instead of just running it over a conventional (fs-agnostic)
RAID subsystem.

I was originally really intrigued by the idea of integrating RAID into
the file system since it seemed like you could do more that way, or at
least do things more efficiently. For example, when adding or
replacing a mirror you'd only have to copy those parts of the disk
that actually contain data. That promised better performance. But if
those actually-used blocks are copied in small pieces and in random
order so the operation is far slower than the logical equivalent of
"dd if=disk1 of=disk2', then what's left?

Even the ability to use drives of different sizes isn't unique to
btrfs. You can use LVM to concatenate smaller volumes into larger
logical ones.

Phil

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  5:25           ` Phil Karn
  2020-05-02  6:04             ` Remi Gauvin
@ 2020-05-02  7:20             ` Zygo Blaxell
  2020-05-02  7:27               ` Phil Karn
  1 sibling, 1 reply; 39+ messages in thread
From: Zygo Blaxell @ 2020-05-02  7:20 UTC (permalink / raw)
  To: Phil Karn; +Cc: Paul Jones, Jean-Denis Girard, linux-btrfs

On Fri, May 01, 2020 at 10:25:44PM -0700, Phil Karn wrote:
> I'm still not sure I understand what "balance" really does. I've run
> it quite a few times, with increasing percentage limits as
> recommended, but my drives never end up with equal amounts of data.
> Maybe that's because I've got an oddball configuration involving
> drives of two different sizes and (temporarily at least) an odd number
> of drives. It *sounds* like it ought to do what you describe, but what
> I read sounds more like an internal defragmentation operation on data
> and metadata storage areas. Is it both?

btrfs balance is mostly used to _free_ space (its other major use case
is to convert raid profiles).  Some raid levels (the mirroring and single
profiles) allocate with the goal of equalizing free space on all drives,
others (the striping profiles) equalize occupied space on all drives.
raid10 does a bit of both. 

If you have a mixed striping and mirroring profile (e.g. raid5 data with
raid1 metadata) then two opposing allocation policies happen at once, and
the space used and free on each disk is determined by the two algorithms
fighting it out.

balance coalesces free space areas into larger contiguous chunks by
reallocating all the existing data as if you had copied the files and
deleted the originals in logical extent order.  Sometimes people call this
"defrag free space" but the use of the word "defrag" can be confusing.

balance is not btrfs defrag.  defrag is concerned with making data extents
contiguous, while balance is concerned with making free space contiguous.

> On Fri, May 1, 2020 at 9:48 PM Paul Jones <paul@pauljones.id.au> wrote:
> 
> > Delete seems to work like a balance. I've had a totally unbalanced
> raid 1 array and after removing a single almost full drive all the
> remaining drives are magically 50% full, down from 90% and up from
> 10%. It's a bit stressful when there is a missing disk as you can only
> delete a missing disk, not replace it.
> > It would be nice if BTRFS had some more smarts so it knows when to
> "balance" data, and when to simply "move/copy" a single copy of data.
> >
> >
> > Paul.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  7:20             ` Zygo Blaxell
@ 2020-05-02  7:27               ` Phil Karn
  2020-05-02  7:52                 ` Zygo Blaxell
  0 siblings, 1 reply; 39+ messages in thread
From: Phil Karn @ 2020-05-02  7:27 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Paul Jones, Jean-Denis Girard, linux-btrfs

> deleted the originals in logical extent order.  Sometimes people call this
> "defrag free space" but the use of the word "defrag" can be confusing.
>
> balance is not btrfs defrag.  defrag is concerned with making data extents
> contiguous, while balance is concerned with making free space contiguous.

Got it. I actually would have understood "defrag free space" and that
it differed from file defragmentation (btrfs defrag, xfs_fsr,
e4defrag, etc).  "Balance" confused me.

How do you balance free space when you've got drives of unequal sizes,
like my (current) case of a 4-drive array consisting of two 16-TB
drives and two 6-TB drives?

Phil

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  7:20               ` Phil Karn
@ 2020-05-02  7:42                 ` Zygo Blaxell
  2020-05-02  8:22                   ` Phil Karn
  2020-05-02  7:43                 ` Jukka Larja
  1 sibling, 1 reply; 39+ messages in thread
From: Zygo Blaxell @ 2020-05-02  7:42 UTC (permalink / raw)
  To: Phil Karn; +Cc: Paul Jones, linux-btrfs

On Sat, May 02, 2020 at 12:20:42AM -0700, Phil Karn wrote:
> So I'm trying to figure out the advantage of including RAID 1 inside
> btrfs instead of just running it over a conventional (fs-agnostic)
> RAID subsystem.
> 
> I was originally really intrigued by the idea of integrating RAID into
> the file system since it seemed like you could do more that way, or at
> least do things more efficiently. For example, when adding or
> replacing a mirror you'd only have to copy those parts of the disk
> that actually contain data. That promised better performance. But if
> those actually-used blocks are copied in small pieces and in random
> order so the operation is far slower than the logical equivalent of
> "dd if=disk1 of=disk2', 

If you use btrfs replace to move data between drives then you get all
the advantages you describe.  Don't do 'device remove' if you can possibly
avoid it.

Array reshapes in btrfs are currently slower than they need to be, but
there's no on-disk-format reason why they can't be as fast as replace
in many cases.

> then what's left?

If there's data corruption on one disk, btrfs can detect it and replace
the lost data from the good copy.  Most block-level raid1's have a 50%
chance of corrupting the good copy with the bad one, and can only report
corruption as a difference in content between the drives (i.e. you have
to guess which is correct), if they bother to report corruption at all.

This allows you to take advantage of diverse redundant storage (e.g.
raid1 pairs composed of disks made by different vendors).  In btrfs
raid1, heterogeonous drive firmware maximizes the chance of having one
bug-free firmware, and scrub will tell you exactly which drive is bad.
In other raid1 implementations, a heterogeneous raid1 pair maximizes the
chance of one firmware in the array having a bug that corrupts the data
on the good drives, and doesn't tell you which drive is bad.

btrfs does not rely on lower level hardware for error detection, so you
can use cheap SSDs and SD cards that have no firmware capability to detect
or report the flash equivalent of UNC sectors as btrfs raid1 members.
I usually can squeeze about six months of extra life out of some very
substandard storage hardware with btrfs.

> Even the ability to use drives of different sizes isn't unique to
> btrfs. You can use LVM to concatenate smaller volumes into larger
> logical ones.
> 
> Phil

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  7:20               ` Phil Karn
  2020-05-02  7:42                 ` Zygo Blaxell
@ 2020-05-02  7:43                 ` Jukka Larja
  1 sibling, 0 replies; 39+ messages in thread
From: Jukka Larja @ 2020-05-02  7:43 UTC (permalink / raw)
  To: linux-btrfs

Phil Karn kirjoitti 2.5.2020 klo 10.20:

> So I'm trying to figure out the advantage of including RAID 1 inside
> btrfs instead of just running it over a conventional (fs-agnostic)
> RAID subsystem.
> 
> I was originally really intrigued by the idea of integrating RAID into
> the file system since it seemed like you could do more that way, or at
> least do things more efficiently. For example, when adding or
> replacing a mirror you'd only have to copy those parts of the disk
> that actually contain data. That promised better performance. But if
> those actually-used blocks are copied in small pieces and in random
> order so the operation is far slower than the logical equivalent of
> "dd if=disk1 of=disk2', then what's left?
> 
> Even the ability to use drives of different sizes isn't unique to
> btrfs. You can use LVM to concatenate smaller volumes into larger
> logical ones.

 From the point of view of someone who has set up mdadm just twice and at 
both times needed to start again at some point, due to messing something up, 
I think one great point of Btrfs is ease of setting up. That's especially 
true when adding or deleting disks semi regularly. I don't have any 
experience with LVM, but adding it to the mix will probably complicate 
things more.

There are, of course, lot of more or less "edge cases" that require knowing 
things, or doing sufficient research before proceding. Like using replace 
instead of add+delete. And if add+delete is requires (I recently replaced 
6x4TB disks with 2x16TB in an array that also has 2x10TB and 2x8TB disk), 
it's better to resize deleted disk in small increments instead of just 
deleting (and I think adding some balance would help too, though I came up 
with the idea too late), to avoid lot of unnecessary rewrites during delete, 
and getting stuck with long running operation that can't be cancelled.

-- 
      ...Elämälle vierasta toimintaa...
     Jukka Larja, Roskakori@aarghimedes.fi

<saylan> I just set up port forwards to defense.gov
<saylan> anyone scanning me now will be scanning/attacking the DoD :D
<renderbod> O.o
<bolt> that's... not exactly how port forwarding works
<saylan> ?
- Quote Database, http://www.bash.org/?954232 -

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  7:27               ` Phil Karn
@ 2020-05-02  7:52                 ` Zygo Blaxell
  0 siblings, 0 replies; 39+ messages in thread
From: Zygo Blaxell @ 2020-05-02  7:52 UTC (permalink / raw)
  To: Phil Karn; +Cc: Paul Jones, Jean-Denis Girard, linux-btrfs

On Sat, May 02, 2020 at 12:27:27AM -0700, Phil Karn wrote:
> > deleted the originals in logical extent order.  Sometimes people call this
> > "defrag free space" but the use of the word "defrag" can be confusing.
> >
> > balance is not btrfs defrag.  defrag is concerned with making data extents
> > contiguous, while balance is concerned with making free space contiguous.
> 
> Got it. I actually would have understood "defrag free space" and that
> it differed from file defragmentation (btrfs defrag, xfs_fsr,
> e4defrag, etc).  "Balance" confused me.
> 
> How do you balance free space when you've got drives of unequal sizes,
> like my (current) case of a 4-drive array consisting of two 16-TB
> drives and two 6-TB drives?

Depends on the RAID profile.  For single, dup, raid1, raid1c3, and raid1c4,
the drives with the most unallocated space are filled first, using devid
to break ties.  For raid0, raid5, and raid6, drives with free space are
filled equally.  raid10 fills disks in even-numbered groups of 4 or more
drives at a time, filling disks with the most unallocated space first.
There are some other rules (e.g. at most 10 disks are used in a single
block group) but they're not relevant at this scale.

raid1 and single profiles would fill the 16TB drives first, until there
was only 6 TB remaining.  At this point all the drives would have equal
free space, then all drives fill equally until they are all full.

raid0 and raid5 would fill all the disks at first, until the 6TB drives are
full and the 16TB drives have 10TB of free space, then they'd fill the
16TB drives the rest of the way.

raid1c3, raid1c4, raid6 and raid10 would fill all the drives at first,
then stop with ENOSPC when the 6TB disks are full and the 16TB disks
have 10TB of free space.  These profiles have a minimum of 3 or more
disks, and you don't have that number of the largest size disks.
Once all the smaller disks are full no further allocation can be done.

> Phil

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  7:42                 ` Zygo Blaxell
@ 2020-05-02  8:22                   ` Phil Karn
  2020-05-02  8:24                     ` Phil Karn
  2020-05-02  9:09                     ` Zygo Blaxell
  0 siblings, 2 replies; 39+ messages in thread
From: Phil Karn @ 2020-05-02  8:22 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Paul Jones, linux-btrfs

On Sat, May 2, 2020 at 12:42 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:

> If you use btrfs replace to move data between drives then you get all
> the advantages you describe.  Don't do 'device remove' if you can possibly
> avoid it.

But I had to use replace to do what I originally wanted to do: replace
four 6TB drives with two 16TB drives.  I could replace two but I'd
still have to remove two more. I may give up on that latter part for
now, but my original hope was to move everything to a smaller and
especially quieter box than the 10-year-old 4U server I have now
that's banished to the garage because of the noise. (Working on its
console in single-user is much less pleasant than retiring to the
house and using my laptop.) I also wanted to retire all four 6 TB
drives because they have over 35K hours (four years) of continuous run
time. They keep passing their SMART checks but I didn't want to keep
pushing my luck.

> If there's data corruption on one disk, btrfs can detect it and replace
> the lost data from the good copy.

That's a very good point I should have remembered. FS-agnostic RAID
depends on drive-level error detection, and being an early TCP/IP guy
I have always been a fan of end-to-end checks. That said, I can't
remember EVER having one of my drives silently corrupt data. When one
failed, I knew it. (Boy, did I know it.)  I can detect silent
corruption even in my ext4 or xfs file systems because I've been
experimenting for years with stashing SHA file hashes in an extended
attribute and periodically verifying them. This originated as a simple
deduplication tool with the attributes used only as a cache. But I
became intrigued by other uses for file-level hashes, like looking for
a file on a heterogeneous collection of machines by multicasting its
hash, and the aforementioned check for silent corruption. (Yes, I know
btrfs checks automatically, but I won't represent what I'm doing as
anything but purely experimental.)

I've never seen a btrfs scrub produce errors either except very
quickly on one system with faulty RAM, so I was never going to trust
it with real data anyway. (BTW, I believe strongly in ECC RAM. I can't
understand why it isn't universal given that it costs little more.)

I'm beginning to think I should look at some of the less tightly
coupled ways to provide redundant storage, such as gluster.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  8:22                   ` Phil Karn
@ 2020-05-02  8:24                     ` Phil Karn
  2020-05-02  9:09                     ` Zygo Blaxell
  1 sibling, 0 replies; 39+ messages in thread
From: Phil Karn @ 2020-05-02  8:24 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Paul Jones, linux-btrfs

On Sat, May 2, 2020 at 1:22 AM Phil Karn <karn@ka9q.net> wrote:

> But I had to use replace to do what I originally wanted to do: replace
> four 6TB drives with two 16TB drives.

Excuse me, I meant to say I had to use *remove* to do what I
originally wanted to do...

It's getting late.

--Phil

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  8:22                   ` Phil Karn
  2020-05-02  8:24                     ` Phil Karn
@ 2020-05-02  9:09                     ` Zygo Blaxell
  2020-05-02 17:48                       ` Chris Murphy
  1 sibling, 1 reply; 39+ messages in thread
From: Zygo Blaxell @ 2020-05-02  9:09 UTC (permalink / raw)
  To: Phil Karn; +Cc: Paul Jones, linux-btrfs

On Sat, May 02, 2020 at 01:22:25AM -0700, Phil Karn wrote:
> On Sat, May 2, 2020 at 12:42 AM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> 
> > If you use btrfs replace to move data between drives then you get all
> > the advantages you describe.  Don't do 'device remove' if you can possibly
> > avoid it.
> 
> But I had to use replace to do what I originally wanted to do: replace
> four 6TB drives with two 16TB drives.  I could replace two but I'd
> still have to remove two more. I may give up on that latter part for
> now, but my original hope was to move everything to a smaller and
> especially quieter box than the 10-year-old 4U server I have now
> that's banished to the garage because of the noise. (Working on its
> console in single-user is much less pleasant than retiring to the
> house and using my laptop.) I also wanted to retire all four 6 TB
> drives because they have over 35K hours (four years) of continuous run
> time. They keep passing their SMART checks but I didn't want to keep
> pushing my luck.

I replace drives in arrays one at a time, equally spaced over their
warranty period.  The replacements are larger, and that requires 3-6-month
long balances.  I guess the balance time is going to double every 18
months, which means there will come a point where balance takes longer
than simply waiting for the next replacement drive to make a pair of
disks with unallocated space.  I don't want to change the schedule to
replace 2 drives at a time as that increases the probability of correlated
2-disk failure.

> > If there's data corruption on one disk, btrfs can detect it and replace
> > the lost data from the good copy.
> 
> That's a very good point I should have remembered. FS-agnostic RAID
> depends on drive-level error detection, and being an early TCP/IP guy
> I have always been a fan of end-to-end checks. That said, I can't
> remember EVER having one of my drives silently corrupt data. 

Out of ~120 drive models I've tested, I've only seen 5 spinning drives
that silently corrupt data.  One disk got hot enough to emit blue smoke,
another didn't have the smoky drama but did have obvious bit errors in
its DRAM cache.  The rest were drives with firmware bugs, so all the
instances of specific models had identical issues.

On SD/MMC and below-$50 SSDs, silent data corruption is the most common
failure mode.  I don't think these disks are capable of detecting or
reporting individual sector errors.  I've never seen it happen.  They
either fall off the bus or they have a catastrophic failure and give
an error on every single access.

Some drive-level error events leave scars that look like data corruption
to btrfs, e.g.  if the firmware crashes before it can empty its write
cache, or if the Linux timeout is set too low and the kernel resets the
drive before it completes a write.  That's so common on low-end desktop
drives that I stopped buying them (at least the cheap SSDs weren't slow).

> When one
> failed, I knew it. (Boy, did I know it.)  I can detect silent
> corruption even in my ext4 or xfs file systems because I've been
> experimenting for years with stashing SHA file hashes in an extended
> attribute and periodically verifying them. This originated as a simple
> deduplication tool with the attributes used only as a cache. But I
> became intrigued by other uses for file-level hashes, like looking for
> a file on a heterogeneous collection of machines by multicasting its
> hash, and the aforementioned check for silent corruption. (Yes, I know
> btrfs checks automatically, but I won't represent what I'm doing as
> anything but purely experimental.)

Experiment away!  The more redundant hashes, the better.  I found two
btrfs data corruption bugs that way, and the same data makes me confident
that there aren't any more (at least with my current application workload).

> I've never seen a btrfs scrub produce errors either except very
> quickly on one system with faulty RAM, so I was never going to trust
> it with real data anyway. (BTW, I believe strongly in ECC RAM. I can't
> understand why it isn't universal given that it costs little more.)

I've seen one scrub error in a month of testing with a machine that had
known bad RAM.  btrfs had unrecoverable corruption 3 times in the same
interval.

> I'm beginning to think I should look at some of the less tightly
> coupled ways to provide redundant storage, such as gluster.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  9:09                     ` Zygo Blaxell
@ 2020-05-02 17:48                       ` Chris Murphy
  2020-05-03  5:26                         ` Zygo Blaxell
  2020-05-04  2:09                         ` Phil Karn
  0 siblings, 2 replies; 39+ messages in thread
From: Chris Murphy @ 2020-05-02 17:48 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Phil Karn, Paul Jones, linux-btrfs

On Sat, May 2, 2020 at 3:09 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On SD/MMC and below-$50 SSDs, silent data corruption is the most common
> failure mode.  I don't think these disks are capable of detecting or
> reporting individual sector errors.  I've never seen it happen.  They
> either fall off the bus or they have a catastrophic failure and give
> an error on every single access.

I'm still curious about the allocator to use for this device class. SD
Cards usually self-report rotational=0. Whereas USB sticks report
rotational=1. The man page seems to suggest nossd or ssd_spread.

In my very limited sample size from a single vendor, I've only seen SD
Card fail by becoming read only. i.e. hardware read-only, with the
kernel spewing sd/mmc related debugging info about the card (or card's
firmware). Maybe that's a good example? I suppose it's better to go
read-only with data still readable, and insofar as Btrfs was concerned
the data was correct, rather than start returning transiently bad
data. However, I only knew this due to data checksums.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02  4:18               ` Zygo Blaxell
  2020-05-02  4:48                 ` Phil Karn
  2020-05-02  5:00                 ` Phil Karn
@ 2020-05-03  2:28                 ` Phil Karn
  2020-05-04  7:39                   ` Phil Karn
  2 siblings, 1 reply; 39+ messages in thread
From: Phil Karn @ 2020-05-03  2:28 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Alexandru Dordea, Chris Murphy, Btrfs BTRFS

On 5/1/20 21:18, Zygo Blaxell wrote:
>
> Also, in large delete operations, half of the IOs are random _reads_,
> which can't be optimized by write caching.  The writes are mostly
> sequential, so they take less IO time.  So, say, 1% of the IO time
> is made 80% faster by write caching, for a net benefit of 0.8% (not real
> numbers).  Write caching helps fsync() performance and not much else.
Thanks for everyone's help, but listening to everyone else also talk
about taking weeks or months to delete a drive, with terrible
performance for other applications because of all the background I/O, it
really looks to me that despite the many theoretical advantages of
integrating raid into btrfs, it simply doesn't work in the real world
with real spinning disk drives with real and significant seek latencies.
Btrfs is too far ahead of the technology; its drive management features
look great until you actually try to use them.

Maybe I can revisit this in a few years when SSDs have displaced
spinning drives and have made seek latencies a thing of the past.
Spinning drives seem to have pretty much hit their technology limits
while SSDs are still making good progress in both size and price.

In the meantime I think I'll return to what I used to use before I tried
btrfs several years ago: XFS over LVM, with LVM working in large
contiguous allocation chunks that can be efficiently copied, moved and
resized on real spinning disks regardless of how the file system above
them allocates and uses them.

I do give btrfs considerable credit for not (yet) losing any of my data
through all this. But that's what offline backups and LVM snapshots are
also for.

Phil




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02 17:48                       ` Chris Murphy
@ 2020-05-03  5:26                         ` Zygo Blaxell
  2020-05-03  5:39                           ` Chris Murphy
  2020-05-04  2:09                         ` Phil Karn
  1 sibling, 1 reply; 39+ messages in thread
From: Zygo Blaxell @ 2020-05-03  5:26 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Phil Karn, Paul Jones, linux-btrfs

On Sat, May 02, 2020 at 11:48:18AM -0600, Chris Murphy wrote:
> On Sat, May 2, 2020 at 3:09 AM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On SD/MMC and below-$50 SSDs, silent data corruption is the most common
> > failure mode.  I don't think these disks are capable of detecting or
> > reporting individual sector errors.  I've never seen it happen.  They
> > either fall off the bus or they have a catastrophic failure and give
> > an error on every single access.
> 
> I'm still curious about the allocator to use for this device class. SD
> Cards usually self-report rotational=0. Whereas USB sticks report
> rotational=1. The man page seems to suggest nossd or ssd_spread.

Use dup metadata on all single-disk filesystems, unless you are making
an intentionally temporary filesystem (like a RAM disk, or a cache with
totally expendable contents).  The correct function for maximizing btrfs
lifetime does not have "rotational" as a parameter.

> In my very limited sample size from a single vendor, I've only seen SD
> Card fail by becoming read only. i.e. hardware read-only, with the
> kernel spewing sd/mmc related debugging info about the card (or card's
> firmware). Maybe that's a good example? 

Yes, that would be a good example if you can read the card.  Usually
when these devices hit the end of their lives there's nothing left
to read, or big chunks of data are misplaced or missing entirely.

All SSDs eventually end read-only, completely inaccessible, or
otherwise incapable of accepting further writes, if you run them long
enough.  Since it's no longer possible to test the drive's capability
as a storage device after this happens, you can have at most one such
failure per drive.  All the other failure modes can happen multiple times.

Some cheap SSDs will flip a bit (either in data or in a sector address)
at some point during their testable lifetimes.  The same drive can do
this over and over, so the error counts get quite high, and this is
easily the single most common failure event.  Since the drive itself
seems unaware of the errors, it never hits any kind of internal limit
on the number of failures (contrast with UNC sectors, where eventually
the remapping table fills up).  Typical error rates are one sector every
few weeks once the drive is past 50% of its endurance rating, but some
cheap SSDs don't wait for 50% and start corrupting data right away.

Some cheap SSDs fail by dropping off the bus until power-cycled.
Sometimes they corrupt data and drop off the bus at the same time, so
this event can end up being included in the silent data corruption count.
That may produce an elevated silent data corruption count, but silent
data corruption is still the most common event even if all bus drops
are subtracted.

Some cheap SSDs fail by becoming 2 orders of magnitude slower suddenly.
This is rare, and there's no data loss in these events.

Some SSDs detect and report UNC sector errors, either on read operations
or SMART self-tests, which I presume are due to internal data corruption
combined with error checking by the firmware, though they could be
false positives.  Cheap SSDs never do this, it only occurs on drives
outside of the cheap SSD group.

I believe that the cheap SSDs are not capable of detecting or reporting
data corruption errors on individual sectors, given the large number
of opportunities they've been provided to demonstrate this capability
under my observation, and the exactly zero times they've used one.

Most of the above applies to SD/MMC devices as well, except I've never
seen a SD/MMC device that had the UNC sector error detection capability.
They only seem to have the cheap SSD failure modes.

> I suppose it's better to go
> read-only with data still readable, and insofar as Btrfs was concerned
> the data was correct, rather than start returning transiently bad
> data. However, I only knew this due to data checksums.
> 
> 
> -- 
> Chris Murphy
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-03  5:26                         ` Zygo Blaxell
@ 2020-05-03  5:39                           ` Chris Murphy
  2020-05-03  6:05                             ` Chris Murphy
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Murphy @ 2020-05-03  5:39 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Phil Karn, Paul Jones, linux-btrfs

On Sat, May 2, 2020 at 11:26 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Sat, May 02, 2020 at 11:48:18AM -0600, Chris Murphy wrote:
> > On Sat, May 2, 2020 at 3:09 AM Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> wrote:
> > >
> > > On SD/MMC and below-$50 SSDs, silent data corruption is the most common
> > > failure mode.  I don't think these disks are capable of detecting or
> > > reporting individual sector errors.  I've never seen it happen.  They
> > > either fall off the bus or they have a catastrophic failure and give
> > > an error on every single access.
> >
> > I'm still curious about the allocator to use for this device class. SD
> > Cards usually self-report rotational=0. Whereas USB sticks report
> > rotational=1. The man page seems to suggest nossd or ssd_spread.
>
> Use dup metadata on all single-disk filesystems, unless you are making
> an intentionally temporary filesystem (like a RAM disk, or a cache with
> totally expendable contents).  The correct function for maximizing btrfs
> lifetime does not have "rotational" as a parameter.

Btrfs defaults need to do the right thing. Currently it's single
metadata for mkfs, and ssd mount option when sysfs reports the device
rotational is false. This applies to eMMC and SD Cards. Whereas USB
sticks report they're rotational for whatever reason, and in that case
the default is DUP and nossd. But I don't know that rotational is the
best way of assuming an allocator.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-03  5:39                           ` Chris Murphy
@ 2020-05-03  6:05                             ` Chris Murphy
  0 siblings, 0 replies; 39+ messages in thread
From: Chris Murphy @ 2020-05-03  6:05 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Zygo Blaxell, Phil Karn, Paul Jones, linux-btrfs

On Sat, May 2, 2020 at 11:39 PM Chris Murphy <lists@colorremedies.com> wrote:
>
> On Sat, May 2, 2020 at 11:26 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Sat, May 02, 2020 at 11:48:18AM -0600, Chris Murphy wrote:
> > > On Sat, May 2, 2020 at 3:09 AM Zygo Blaxell
> > > <ce3g8jdj@umail.furryterror.org> wrote:
> > > >
> > > > On SD/MMC and below-$50 SSDs, silent data corruption is the most common
> > > > failure mode.  I don't think these disks are capable of detecting or
> > > > reporting individual sector errors.  I've never seen it happen.  They
> > > > either fall off the bus or they have a catastrophic failure and give
> > > > an error on every single access.
> > >
> > > I'm still curious about the allocator to use for this device class. SD
> > > Cards usually self-report rotational=0. Whereas USB sticks report
> > > rotational=1. The man page seems to suggest nossd or ssd_spread.
> >
> > Use dup metadata on all single-disk filesystems, unless you are making
> > an intentionally temporary filesystem (like a RAM disk, or a cache with
> > totally expendable contents).  The correct function for maximizing btrfs
> > lifetime does not have "rotational" as a parameter.
>
> Btrfs defaults need to do the right thing. Currently it's single
> metadata for mkfs, and ssd mount option when sysfs reports the device
> rotational is false. This applies to eMMC and SD Cards. Whereas USB
> sticks report they're rotational for whatever reason, and in that case
> the default is DUP and nossd. But I don't know that rotational is the
> best way of assuming an allocator.

You address this in another thread: It's a bit unfortunate that
btrfs's default is still to use single metadata on SSD.


All I care about is detection on cheap SSDs. At least in my use cases,
I'm not sure it's worth the extra writes from DUP to be able to
recover from silent corruption.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-02 17:48                       ` Chris Murphy
  2020-05-03  5:26                         ` Zygo Blaxell
@ 2020-05-04  2:09                         ` Phil Karn
  1 sibling, 0 replies; 39+ messages in thread
From: Phil Karn @ 2020-05-04  2:09 UTC (permalink / raw)
  To: Chris Murphy, Zygo Blaxell; +Cc: Paul Jones, linux-btrfs

On 5/2/20 10:48, Chris Murphy wrote:
> In my very limited sample size from a single vendor, I've only seen SD
> Card fail by becoming read only. i.e. hardware read-only, with the
> kernel spewing sd/mmc related debugging info about the card (or card's
> firmware). Maybe that's a good example? I suppose it's better to go
> read-only with data still readable, and insofar as Btrfs was concerned
> the data was correct, rather than start returning transiently bad
> data. However, I only knew this due to data checksums.


I use Raspberry Pis a lot, so I've been forced to acquaint myself with
micro-SD cards. I don't see *that* many failures, but the ones I have
seen are sudden and total, i.e, the card simply doesn't respond anymore.
I think I also saw one suddenly drop to a capacity of 16 MB. Electrical
abuse may have been a factor in some of these failures, in the others
there was no obvious cause.

I do trim them frequently to avoid write amplification.

Phil





^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Extremely slow device removals
  2020-05-03  2:28                 ` Phil Karn
@ 2020-05-04  7:39                   ` Phil Karn
  0 siblings, 0 replies; 39+ messages in thread
From: Phil Karn @ 2020-05-04  7:39 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Alexandru Dordea, Chris Murphy, Btrfs BTRFS

On 5/2/20 19:28, Phil Karn wrote:
>
> Thanks for everyone's help, but listening to everyone else also talk
> about taking weeks or months to delete a drive, with terrible
> performance for other applications because of all the background I/O, it

After sending this message I built and installed kernel version 5.6.10.
Then I pulled the drive I was trying to remove and retried the 'device
remove' command. To my surprise, it went much faster than before. Still
not nearly as fast as the 'device replace' I ran on another drive, but
it finished in about 12 hours. This was in rescue mode with nothing else
running except sshd so I could watch remotely.

I'm now running a full scrub; so far there hasn't been any damage. The
remaining drives (two new 16TB and two old 6TB) are still very
unbalanced. That's a job for another day, but I don't think I'll have
the energy to remove any more drives from my array.

I booted back to 4.19.0-8 because of some apparent incompatibilities
between the 5.6.10 kernel and my Debian buster userland binaries and
config files, but that probably has nothing to do with btrfs.

Phil





^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2020-05-04  7:39 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-28  7:22 Extremely slow device removals Phil Karn
2020-04-30 17:31 ` Phil Karn
2020-04-30 18:13   ` Jean-Denis Girard
2020-05-01  8:05     ` Phil Karn
2020-05-02  3:35       ` Zygo Blaxell
     [not found]         ` <CAMwB8mjUw+KV8mxg8ynPsv0sj5vSpwG7_khw=oP5n+SnPYzumQ@mail.gmail.com>
2020-05-02  4:31           ` Zygo Blaxell
2020-05-02  4:48         ` Paul Jones
2020-05-02  5:25           ` Phil Karn
2020-05-02  6:04             ` Remi Gauvin
2020-05-02  7:20             ` Zygo Blaxell
2020-05-02  7:27               ` Phil Karn
2020-05-02  7:52                 ` Zygo Blaxell
2020-05-02  6:00           ` Zygo Blaxell
2020-05-02  6:23             ` Paul Jones
2020-05-02  7:20               ` Phil Karn
2020-05-02  7:42                 ` Zygo Blaxell
2020-05-02  8:22                   ` Phil Karn
2020-05-02  8:24                     ` Phil Karn
2020-05-02  9:09                     ` Zygo Blaxell
2020-05-02 17:48                       ` Chris Murphy
2020-05-03  5:26                         ` Zygo Blaxell
2020-05-03  5:39                           ` Chris Murphy
2020-05-03  6:05                             ` Chris Murphy
2020-05-04  2:09                         ` Phil Karn
2020-05-02  7:43                 ` Jukka Larja
2020-05-02  4:49         ` Phil Karn
2020-04-30 18:40   ` Chris Murphy
2020-04-30 19:59     ` Phil Karn
2020-04-30 20:27       ` Alexandru Dordea
2020-04-30 20:58         ` Phil Karn
2020-05-01  2:47       ` Zygo Blaxell
2020-05-01  4:48         ` Phil Karn
2020-05-01  6:05           ` Alexandru Dordea
2020-05-01  7:29             ` Phil Karn
2020-05-02  4:18               ` Zygo Blaxell
2020-05-02  4:48                 ` Phil Karn
2020-05-02  5:00                 ` Phil Karn
2020-05-03  2:28                 ` Phil Karn
2020-05-04  7:39                   ` Phil Karn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.