All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Murphy <lists@colorremedies.com>
To: Anand Jain <anand.jain@oracle.com>
Cc: David Sterba <dsterba@suse.cz>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>,
	Chris Mason <clm@fb.com>
Subject: Re: RAID1 availability issue[2], Hot-spare and auto-replace
Date: Sun, 18 Sep 2016 11:28:41 -0600	[thread overview]
Message-ID: <CAJCQCtStO0eO9CqLTW00mZHZd8jDJJV4V-ir4-OQzd2AfyXWSQ@mail.gmail.com> (raw)
In-Reply-To: <f1b74a5d-1fdf-2970-86fc-206a4bb26d78@oracle.com>

On Sun, Sep 18, 2016 at 2:34 AM, Anand Jain <anand.jain@oracle.com> wrote:
>
> (updated the subject, was [1])
>
>> IMO the hot-spare feature makes most sense with the raid56,
>
>
>   Why. ?

Raid56 is not scalable, has less redundancy in most all
configurations, rebuild impacts the entire array performance, and in
the case of raid6 two drives lost means incredibly slow rebuild. All
of that adds up to more disk for raid56 to be mitigated with a hot
spare being available for immediate rebuild.

Who currently would use hot spare right now? Problem 1 is Btrfs raid10
is not scalable like other raid10 implementations (mdadm, lvm,
hardware). Problem 2 is Btrfs the raid56 parity scrub bug; and
arguably also partial stripe writes not being CoW. I think hot spare
is pointless with those two problems still being true, and the way to
mitigate them right now is a clusterfs. Hot spare doesn't mitigate
these Btrfs weaknesses.


>
>> which is stuck where it is, so we need to get it working first.
>
>
>
>   We need at least one RAID which does not have the availability
>   issue. We could achieve that with raid1, there are patches
>   which needs maintainer time.

I agree with the idea of degraded raid1 chunks. It's a nasty surprise
to realize this only once it's too late and there's data loss. That
there is a user space work around, maybe makes it less of a big deal?
But I don't think it's documented on gotchas page with the soft
conversion work around to do the rebuild properly: scrub/balance alone
is not correct.

I kinda think we need a list of priorities for multiple device stuff,
and honestly hot spare while important I think is bottom of the list.

1. multiple fs UUID dev UUID corruption problem (the cloned device problem)
2. degraded volumes new bg's are single profile (Anand's April patchset)
3. raid56 bad parity created during scrub when data strip is bad and gets fixed
4. better faulty device tolerance (no crashing)
5. raid10 scaling, needs a way for even number block devices of the
same size to get fixed mirroring so it can tolerate multiple drive
failures so long as a mirrored pair don't fail
6. raid56 partial stripe RMW need to be CoW, doesn't matter if it
slows things down, if you don't like it, use raid10
7. raid1 threaded/async reads (whatever the correct term is to read
from all raid1 drives rather than PID based)
8. better faulty device notifications
9. raid56 parity needs to be checksummed
10. hotspare


2 and 3 might seem tied. Both can result in data loss, both have user
space work arounds (undocumented); but 2 has a greater chance of
happening than 3.

4 is probably worse than 3, but 4 is much more nebulous and 3 produces
a big negative perception.

I'm sure someone could argue hotspare could get squeezed in between 4
and 5; but that's really my one bias in the list, I don't care about
hot spare. I think it's more scalable to take advantage of Btrfs
uniqueness to shrink the file system to drop the bad drive to regain
full redundancy, rather than do hot spares, this is faster, and
doesn't waste a drive that's not doing any work.

I see shrink as more scalable with hard drives than hot spares,
especially in the case of data single profile with clusterfs's: drop
the bad device and its data, autodelete the lost files, rebuild
metadata to regain complete fs redundancy,  inform the cluster of
partial data loss - boom the array is completely fixed, let the
cluster figure out what to do next. Plus each brick isn't spinning an
unused hot spare. There is in effect a hot spare *somewhere* partially
used somewhere else in a cluster fs anyway. I see hot spare as an edge
case need, especially with hard drives. It's not a general purpose
need.

-- 
Chris Murphy

  reply	other threads:[~2016-09-18 17:28 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-13 13:39 [RFC] Preliminary BTRFS Encryption Anand Jain
2016-09-13 13:39 ` [PATCH] btrfs: Encryption: Add btrfs encryption support Anand Jain
2016-09-13 14:12   ` kbuild test robot
2016-09-13 14:24   ` kbuild test robot
2016-09-13 16:10   ` kbuild test robot
2016-09-13 13:39 ` [PATCH 1/2] btrfs-progs: make wait_for_commit non static Anand Jain
2016-09-13 13:39 ` [PATCH 2/2] btrfs-progs: add encryption support Anand Jain
2016-09-13 13:39 ` [PATCH] fstests: btrfs: support encryption Anand Jain
2016-09-13 16:42 ` [RFC] Preliminary BTRFS Encryption Wilson Meier
2016-09-14  7:02   ` Anand Jain
2016-09-14 18:26     ` Wilson Meier
2016-09-15  4:53 ` Alex Elsayed
2016-09-15 11:33   ` Anand Jain
2016-09-15 11:47     ` Alex Elsayed
2016-09-16 11:35       ` Anand Jain
2016-09-15  5:38 ` Chris Murphy
2016-09-15 11:32   ` Anand Jain
2016-09-15 11:37 ` Austin S. Hemmelgarn
2016-09-15 14:06   ` Anand Jain
2016-09-15 14:24     ` Austin S. Hemmelgarn
2016-09-16  8:58       ` David Sterba
2016-09-17  2:18       ` Zygo Blaxell
2016-09-16  1:12 ` Dave Chinner
2016-09-16  5:47   ` Roman Mamedov
2016-09-16  6:49   ` Alex Elsayed
2016-09-17  4:38     ` Zygo Blaxell
2016-09-17  6:37       ` Alex Elsayed
2016-09-19 18:08         ` Zygo Blaxell
2016-09-19 20:01           ` Alex Elsayed
2016-09-19 22:22             ` Zygo Blaxell
2016-09-19 22:25             ` Chris Murphy
2016-09-19 22:31               ` Zygo Blaxell
2016-09-20  1:10                 ` Zygo Blaxell
2016-09-17 18:45       ` David Sterba
2016-09-20 14:26         ` Anand Jain
2016-09-16 10:45   ` Brendan Hide
2016-09-16 11:46   ` Anand Jain
2016-09-16  8:49 ` David Sterba
2016-09-16 11:56   ` Anand Jain
2016-09-17 20:35     ` David Sterba
2016-09-18  8:34       ` RAID1 availability issue[2], Hot-spare and auto-replace Anand Jain
2016-09-18 17:28         ` Chris Murphy [this message]
2016-09-18 17:34           ` Chris Murphy
2016-09-19  2:25           ` Anand Jain
2016-09-19 12:07             ` Austin S. Hemmelgarn
2016-09-19 12:25           ` Austin S. Hemmelgarn
2016-09-18  9:54       ` [RFC] Preliminary BTRFS Encryption Anand Jain
2016-09-20  0:12   ` Chris Mason
2016-09-20  0:55     ` Anand Jain
2016-09-17  6:58 ` Eric Biggers
2016-09-17  7:13   ` Alex Elsayed
2016-09-19 18:57     ` Zygo Blaxell
2016-09-19 19:50       ` Alex Elsayed
2016-09-19 22:12         ` Zygo Blaxell
2016-09-17 16:12   ` Anand Jain
2016-09-17 18:57     ` Chris Murphy
2016-09-19 15:15 ` Experimental btrfs encryption Theodore Ts'o
2016-09-19 20:58   ` Alex Elsayed
2016-09-20  0:32     ` Chris Mason
2016-09-20  2:47       ` Alex Elsayed
2016-09-20  2:50       ` Theodore Ts'o
2016-09-20  3:05         ` Alex Elsayed
2016-09-20  4:09         ` Zygo Blaxell
2016-09-20 15:44         ` Chris Mason
2016-09-21 13:52           ` Anand Jain
2016-09-20  4:05   ` Anand Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJCQCtStO0eO9CqLTW00mZHZd8jDJJV4V-ir4-OQzd2AfyXWSQ@mail.gmail.com \
    --to=lists@colorremedies.com \
    --cc=anand.jain@oracle.com \
    --cc=clm@fb.com \
    --cc=dsterba@suse.cz \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.