Re: Extremely slow device removals

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Phil Karn <karn@ka9q.net>
Cc: Paul Jones <paul@pauljones.id.au>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: Extremely slow device removals
Date: Sat, 2 May 2020 03:42:37 -0400	[thread overview]
Message-ID: <20200502074237.GM10769@hungrycats.org> (raw)
In-Reply-To: <CAMwB8mhGkcM3DCTusuHAi-cQcr-FrA5cq4hVYfv+65zn_QjAig@mail.gmail.com>

On Sat, May 02, 2020 at 12:20:42AM -0700, Phil Karn wrote:
> So I'm trying to figure out the advantage of including RAID 1 inside
> btrfs instead of just running it over a conventional (fs-agnostic)
> RAID subsystem.
> 
> I was originally really intrigued by the idea of integrating RAID into
> the file system since it seemed like you could do more that way, or at
> least do things more efficiently. For example, when adding or
> replacing a mirror you'd only have to copy those parts of the disk
> that actually contain data. That promised better performance. But if
> those actually-used blocks are copied in small pieces and in random
> order so the operation is far slower than the logical equivalent of
> "dd if=disk1 of=disk2', 

If you use btrfs replace to move data between drives then you get all
the advantages you describe.  Don't do 'device remove' if you can possibly
avoid it.

Array reshapes in btrfs are currently slower than they need to be, but
there's no on-disk-format reason why they can't be as fast as replace
in many cases.

> then what's left?

If there's data corruption on one disk, btrfs can detect it and replace
the lost data from the good copy.  Most block-level raid1's have a 50%
chance of corrupting the good copy with the bad one, and can only report
corruption as a difference in content between the drives (i.e. you have
to guess which is correct), if they bother to report corruption at all.

This allows you to take advantage of diverse redundant storage (e.g.
raid1 pairs composed of disks made by different vendors).  In btrfs
raid1, heterogeonous drive firmware maximizes the chance of having one
bug-free firmware, and scrub will tell you exactly which drive is bad.
In other raid1 implementations, a heterogeneous raid1 pair maximizes the
chance of one firmware in the array having a bug that corrupts the data
on the good drives, and doesn't tell you which drive is bad.

btrfs does not rely on lower level hardware for error detection, so you
can use cheap SSDs and SD cards that have no firmware capability to detect
or report the flash equivalent of UNC sectors as btrfs raid1 members.
I usually can squeeze about six months of extra life out of some very
substandard storage hardware with btrfs.

> Even the ability to use drives of different sizes isn't unique to
> btrfs. You can use LVM to concatenate smaller volumes into larger
> logical ones.
> 
> Phil