All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID1, SSD+non-SSD
@ 2015-02-06 20:01 Brian B
  2015-02-07  0:23 ` Chris Murphy
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Brian B @ 2015-02-06 20:01 UTC (permalink / raw)
  To: linux-btrfs

My laptop has two disks, a SSD and a traditional magnetic disk. I plan
to make a partition on the mag disk equal in size the SSD and set up
BTRFS RAID1. This I know how to do.

The only reason I'm doing the RAID1 is for the self-healing. I realize
writing large amounts of data will be slower than the SSD alone, but
is it possible to set it up to only read from the magnetic drive if
there's an error reading from the SSD?

In other words, is there a way to tell it to only read from the faster
disk?  Is that even necessary?  Is there a better way to accomplish
this?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID1, SSD+non-SSD
  2015-02-06 20:01 RAID1, SSD+non-SSD Brian B
@ 2015-02-07  0:23 ` Chris Murphy
  2015-02-07 18:06   ` Kai Krakow
  2015-02-07  6:39 ` Duncan
  2015-02-07 17:28 ` Kyle Manna
  2 siblings, 1 reply; 10+ messages in thread
From: Chris Murphy @ 2015-02-07  0:23 UTC (permalink / raw)
  To: Btrfs BTRFS

On Fri, Feb 6, 2015 at 1:01 PM, Brian B <canis8585@gmail.com> wrote:
> My laptop has two disks, a SSD and a traditional magnetic disk. I plan
> to make a partition on the mag disk equal in size the SSD and set up
> BTRFS RAID1. This I know how to do.

There isn't a write mostly option in btrfs like there is with md raid,
so I don't know how Btrfs will tolerate one device being exceptionally
slower than the other. It may be most of the time it won't matter, but
I can imagine with a ton of IOPS backing up on the hard drive, having
completed on the SSD, could maybe be a problem. I'd test it unless
someone else who has pipes up.

>
> The only reason I'm doing the RAID1 is for the self-healing. I realize
> writing large amounts of data will be slower than the SSD alone, but
> is it possible to set it up to only read from the magnetic drive if
> there's an error reading from the SSD?

No.
>
> In other words, is there a way to tell it to only read from the faster
> disk?  Is that even necessary?  Is there a better way to accomplish
> this?

No. No. And maybe. In order.

If there is an error detected by either drive, or by Btrfs, Btrfs will
get the correct data from the other drive and fix the problem on the
original drive. You don't need to configure anything. The only concern
is the asymmetric performance.

I think the use case is better achieved with two HDD's + two SSD
partitions, configured either with bcache or dmcache. The result is
two logical devices using HDDs as backing device and SSD partitions as
cache, and then format them as Btrfs raid1. The question there of
course, is maturity of bcache vs dmcache and their interactions with
Btrfs. But at least that's supposed to work.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID1, SSD+non-SSD
  2015-02-06 20:01 RAID1, SSD+non-SSD Brian B
  2015-02-07  0:23 ` Chris Murphy
@ 2015-02-07  6:39 ` Duncan
  2015-02-07 12:42   ` RAID1, SSD+non-SSD (RAID 5/6 question) Ed Tomlinson
  2015-02-08  2:41   ` RAID1, SSD+non-SSD Brian B
  2015-02-07 17:28 ` Kyle Manna
  2 siblings, 2 replies; 10+ messages in thread
From: Duncan @ 2015-02-07  6:39 UTC (permalink / raw)
  To: linux-btrfs

Brian B posted on Fri, 06 Feb 2015 15:01:30 -0500 as excerpted:

> The only reason I'm doing the [btrfs] RAID1 is for the self-healing. I
> realize writing large amounts of data will be slower than the SSD
> alone, but is it possible to set it up to only read from the magnetic
> drive if there's an error reading from the SSD?

Chris Murphy is correct.  Btrfs raid1 doesn't have the write-mostly 
option that mdraid has.

I'll simply expand on what he mentioned with two points, #1 being the 
more important for your case.

1) The btrfs raid1 read-mode device choice algorithm is known to be sub-
optimal, and the plan is to change and optimize it in the longer term.  
Basically, it's an easy first implementation that's simple enough to be 
reasonably bug-free and to stay out of the developer's way while they 
work on on other things, while still allowing easy testing of both 
devices.

Specifically, it's a very simple even/odd parity assignment based on the 
PID making the request.  Thus, a single PID read task will consistently 
read from the same device (unless the block checksum on that device is 
bad, then it tries the other device), no matter how much there is to read 
and how backed up that device might be, or how idle the other one might 
be. Even a second read task from another PID, or a 10th, or the 100th, if 
they're all even or all odd parity PIDs, will all be assigned to read 
from the same device, even if the other one is entirely idle.

Which ends up being worst-case for a multi-threaded heavy-read focused 
task where all read threads happen to be even or odd, say if read and 
compute threads are paired and always spawned in the same order, with 
nothing else going on to throw the parity ordering off.  But that's how 
it's currently implemented.  =:^(

And it /does/ make for easily repeatable test results, while being simple 
enough to stay out of the way while development interest focuses 
elsewhere, after all pretty important factors early in a project of this 
scope. =:^)


Obviously, that's going to be bad news for you, too, unless your use-case 
is specific enough that you can tune the read PIDs to favor the parity 
that hits the SSD. =:^(


The claim is made that btrfs is stabilizing, and in fact, as a regular 
here for some time, I can vouch for that.  But I think it's reasonable to 
argue that until this sort of read-scheduling algorithm is replaced with 
something a bit more optimized, and of course that replacement well 
tested, it's definitely premature to call btrfs fully stable.  This sort 
of painfully bad in some cases mis-optimization just doesn't fit with 
stable, and regardless of how long it takes, until development quiets 
down far enough that the devs can feel comfortable focusing on something 
like this, it's extremely hard to argue that development has quieted down 
enough to fairly call it stable in the first place.

Well, my opinion anyway.

So the short of it is, at least until btrfs optimizes this a bit better, 
for SSD paired with spinning-rust raid1 optimization, as Chris Murphy 
suggested, use some sort of caching mechanism, bcache or dmcache.

Tho you'll want to compare notes with someone who has already tried it, 
as there were some issues with at least btrfs and bcache earlier.  I 
believe they're fixed now, but as explained above, btrfs itself isn't 
really entirely stable yet, so I'd definitely recommend keeping backups, 
and comparing notes with others who have tried it.  (I know there's some 
on the list, tho they may not see this.  But hopefully they'll respond to 
a new thread with bcache or dmcache in the title, if you decide to go 
that way.)


2) While this doesn't make a significant difference in the two-device 
btrfs raid1 case, it does with three or more devices in the btrfs raid1, 
and with other raid forms the difference is even stronger.  I noticed you 
wrote RAID1 in ALL CAPS form.  Btrfs' raid implementations aren't quite 
like traditional RAID, and I recall a dev (Chris Mason, actually, IIRC) 
pointing out that the choice to use small-letters raidX nomenclature was 
deliberate, in ordered to remind people that there is a difference.

Specifically for btrfs raid1, as contrasted to, for instance, md/RAID-1, 
at present btrfs raid1 is always pair-mirrored, regardless of the number 
of devices (above two, of course).  While a three-device md/RAID-1 will 
have three mirrors and a four-device md/RAID-1 will have four, simply 
adding redundant mirrors while maintaining capacity (in the simple all-
the-same-size case, anyway), a three-device btrfs raid1 will have 1.5x 
the capacity of a two-device btrfs raid1, and a four-device btrfs raid1 
will have twice the two-device capacity, while maintaining a constant 
pair-mirroring regardless of the number of devices in the btrfs raid1.

For btrfs raid10, the pair-mirroring is there, but for odd numbers of 
devices there's also a difference of uneven striping, because of the odd 
one out in the mirroring and the difference in chunk size between data 
and metadata chunks.

And of course there's the difference that data and metadata are treated 
separately in btrfs, and don't have to have the same raid levels, nor are 
they the same by default.  A filesystem agnostic raid such as mdraid or 
dmraid will by definition treat data and metadata alike as it won't be 
able to tell the difference -- if it did it wouldn't be filesystem 
agnostic.


Now that btrfs raid56 mode is basically complete with kernel 3.19, the 
next thing on the raid side of the roadmap is N-way-mirroring.  I'm 
really looking forward to that as I really like btrfs' self-repair 
capacities as well, but for me the ideal balance is three-way-mirroring, 
just in case two copies fail checksum.  Tho the fact of the matter is, 
btrfs only now is getting to the point where a third mirror has some 
reasonable chance of being useful, as until now btrfs itself was unstable 
enough that the chances of it having a bug were far higher than of both 
devices going bad for a checksummed block at the same time.  But btrfs 
really is much more stable than it was, and it's stable enough now that 
the possibility of a third mirror really should start making statistical 
sense pretty soon, if it doesn't already.

But given the time raid56 took, I'm not holding my breath.  I guess 
they'll be focused on the remaining raid56 bugs thru 3.20, and figure 
it'll be at least three kernel cycles later, so second half of the year 
at best, before we see N-way-mirroring in mainstream.  This time next 
year would actually seem more reasonable, and 2H-2016 or into 2017 
wouldn't surprise me in the least, again, given the time raid56 mode 
took.  Hopefully it'll be there before 2018...


Tho as I said, for the two-device case, if both data and metadata are 
raid1 mode, those differences can for the most part be ignored.  Thus, 
this point is mostly for others reading, and for you in the future should 
you end up working with a btrfs raid1 with more than two devices.  I 
mostly mentioned it due to seeing that all-caps RAID1.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID1, SSD+non-SSD (RAID 5/6 question)
  2015-02-07  6:39 ` Duncan
@ 2015-02-07 12:42   ` Ed Tomlinson
  2015-02-08  3:18     ` Duncan
  2015-02-08  2:41   ` RAID1, SSD+non-SSD Brian B
  1 sibling, 1 reply; 10+ messages in thread
From: Ed Tomlinson @ 2015-02-07 12:42 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Saturday, February 7, 2015 1:39:07 AM EST, Duncan wrote:

> The btrfs raid1 read-mode device choice algorithm

Duncan,

Very interesting suff on the raid1 read select alg.  What changes with 
raid5/6?  Is that alg 'smarter'?

TIA
Ed Tomlinson


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID1, SSD+non-SSD
  2015-02-06 20:01 RAID1, SSD+non-SSD Brian B
  2015-02-07  0:23 ` Chris Murphy
  2015-02-07  6:39 ` Duncan
@ 2015-02-07 17:28 ` Kyle Manna
  2 siblings, 0 replies; 10+ messages in thread
From: Kyle Manna @ 2015-02-07 17:28 UTC (permalink / raw)
  To: Brian B, linux-btrfs

On Fri Feb 06 2015 at 12:06:33 PM Brian B <canis8585@gmail.com> wrote:
>
> My laptop has two disks, a SSD and a traditional magnetic disk. I plan
> to make a partition on the mag disk equal in size the SSD and set up
> BTRFS RAID1. This I know how to do.
>
> The only reason I'm doing the RAID1 is for the self-healing. I realize
> writing large amounts of data will be slower than the SSD alone, but
> is it possible to set it up to only read from the magnetic drive if
> there's an error reading from the SSD?
>
> In other words, is there a way to tell it to only read from the faster
> disk?  Is that even necessary?  Is there a better way to accomplish
> this?


What you may want to look at is lvmcache + btrfs.  I've played with
lvmcache (using ext4 on top) and btrfs independently, but not
together.  Too many new technologies at the same time for my taste. :)

The best documentation I've found on lvm cache is the man page:
http://man7.org/linux/man-pages/man7/lvmcache.7.html

LVM cache uses dm-cache behind the scenes and makes it much more
manageable (i.e. construction, manipulation, and teardown of devices.
An lvm cache won't help with redundancy, the blocks will either exist
on the caching device or slower device.   To remove the cache, you can
force a flush of the blocks out of the to the traditional HDD and use
it without the cache without having to recreate the file system.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID1, SSD+non-SSD
  2015-02-07  0:23 ` Chris Murphy
@ 2015-02-07 18:06   ` Kai Krakow
  2015-02-08  3:31     ` Duncan
  0 siblings, 1 reply; 10+ messages in thread
From: Kai Krakow @ 2015-02-07 18:06 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy <lists@colorremedies.com> schrieb:

> On Fri, Feb 6, 2015 at 1:01 PM, Brian B <canis8585@gmail.com> wrote:
>> My laptop has two disks, a SSD and a traditional magnetic disk. I plan
>> to make a partition on the mag disk equal in size the SSD and set up
>> BTRFS RAID1. This I know how to do.
> 
> There isn't a write mostly option in btrfs like there is with md raid,
> so I don't know how Btrfs will tolerate one device being exceptionally
> slower than the other. It may be most of the time it won't matter, but
> I can imagine with a ton of IOPS backing up on the hard drive, having
> completed on the SSD, could maybe be a problem. I'd test it unless
> someone else who has pipes up.
> 
>>
>> The only reason I'm doing the RAID1 is for the self-healing. I realize
>> writing large amounts of data will be slower than the SSD alone, but
>> is it possible to set it up to only read from the magnetic drive if
>> there's an error reading from the SSD?
> 
> No.
>>
>> In other words, is there a way to tell it to only read from the faster
>> disk?  Is that even necessary?  Is there a better way to accomplish
>> this?
> 
> No. No. And maybe. In order.
> 
> If there is an error detected by either drive, or by Btrfs, Btrfs will
> get the correct data from the other drive and fix the problem on the
> original drive. You don't need to configure anything. The only concern
> is the asymmetric performance.
> 
> I think the use case is better achieved with two HDD's + two SSD
> partitions, configured either with bcache or dmcache. The result is
> two logical devices using HDDs as backing device and SSD partitions as
> cache, and then format them as Btrfs raid1. The question there of
> course, is maturity of bcache vs dmcache and their interactions with
> Btrfs. But at least that's supposed to work.

Bcache on multi-device btrfs works fine for me. No problems yet, even in 
case of hard-reset. I'm using normal desktop workload, some Steam games, 
some VMs, and some MySQL/PHP/Rails programming. A single bcache partition 
can support many backing device, so no need to use multiple partitions in 
that case. In this scenario, bcache{0,1,2} is my btrfs rootfs:

$ lsblk /dev/sdb
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sdb           8:16   0 119,2G  0 disk
├─sdb1        8:17   0   512M  0 part
├─sdb2        8:18   0    20G  0 part
├─sdb3        8:19   0  79,5G  0 part
│ ├─bcache0 252:0    0 925,5G  0 disk
│ ├─bcache1 252:1    0 925,5G  0 disk
│ └─bcache2 252:2    0 925,5G  0 disk /home
└─sdb4        8:20   0  19,2G  0 part

sdb3 is the cache device, sdb4 is trimmed and left untouched for SSD 
overprovisioning, sdb2 is a dedicated resume swap (traditional swapping goes 
to all there HDDs), sdb1 is my ESP to boot kernel and initramfs from.

Bcache takes some time to warm up but is really fast afterwards: Boot times 
(using systemd and readahead) went down from >60s (on spinning rust) to ~30s 
on first boot (system was migrated to bcache with dev del/add, not 
reinstalled), then ~15s, 10s, and now it fluctuates between 3 and 8s (mostly 
around 5s) for reaching graphical.target (depending on whether I installed 
updates). KDE takes some time to load but I suppose most of it is due to its 
artificial 4 second delay during initialization of kded and friends - I 
guess this will be fixed in KDE5. This boot target is not stripped down, it 
includes network, mysql, postfix and some other stuff that one either 
usually not needs or could be optimized away. The numbers are taken from 
"systemd-analyze critical-chain".

Bcache hit rate is usually between 60 and 80% using write-back. So after all 
I can generally recommend bcache. I don't know dmcache, tho. But I really 
suggest against using a mixed setup of SSD and HDD partitions in btrfs RAID 
mode, especially since btrfs does not handle different sized partitions that 
well. With bcache you can have your cake and eat it too (read: big storage 
pool + fast access times).

BTW: Is there work in progress to let btrfs choose which device to read from 
or write to other than using round-robin or pid mapping? Maybe it would be 
interesting to watch the current read and write latencies of all drives and 
choose the one with the lowest latency. Tho, I think it won't make much 
sense when passing accesses through btrfs.

-- 
Replies to list only preferred.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID1, SSD+non-SSD
  2015-02-07  6:39 ` Duncan
  2015-02-07 12:42   ` RAID1, SSD+non-SSD (RAID 5/6 question) Ed Tomlinson
@ 2015-02-08  2:41   ` Brian B
  2015-02-08  3:51     ` Duncan
  1 sibling, 1 reply; 10+ messages in thread
From: Brian B @ 2015-02-08  2:41 UTC (permalink / raw)
  To: linux-btrfs

On Sat, Feb 7, 2015 at 1:39 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Brian B posted on Fri, 06 Feb 2015 15:01:30 -0500 as excerpted:
>
>> The only reason I'm doing the [btrfs] RAID1 is for the self-healing. I
>> realize writing large amounts of data will be slower than the SSD
>> alone, but is it possible to set it up to only read from the magnetic
>> drive if there's an error reading from the SSD?
>
> Chris Murphy is correct.  Btrfs raid1 doesn't have the write-mostly
> option that mdraid has.
>
> I'll simply expand on what he mentioned with two points, #1 being the
> more important for your case.
>
> 1) The btrfs raid1 read-mode device choice algorithm is known to be sub-
> optimal, and the plan is to change and optimize it in the longer term.
> Basically, it's an easy first implementation that's simple enough to be
> reasonably bug-free and to stay out of the developer's way while they
> work on on other things, while still allowing easy testing of both
> devices.
>
> Specifically, it's a very simple even/odd parity assignment based on the
> PID making the request.  Thus, a single PID read task will consistently
> read from the same device (unless the block checksum on that device is
> bad, then it tries the other device), no matter how much there is to read
> and how backed up that device might be, or how idle the other one might
> be. Even a second read task from another PID, or a 10th, or the 100th, if
> they're all even or all odd parity PIDs, will all be assigned to read
> from the same device, even if the other one is entirely idle.
>
> Which ends up being worst-case for a multi-threaded heavy-read focused
> task where all read threads happen to be even or odd, say if read and
> compute threads are paired and always spawned in the same order, with
> nothing else going on to throw the parity ordering off.  But that's how
> it's currently implemented.  =:^(
>
> And it /does/ make for easily repeatable test results, while being simple
> enough to stay out of the way while development interest focuses
> elsewhere, after all pretty important factors early in a project of this
> scope. =:^)
>
>
> Obviously, that's going to be bad news for you, too, unless your use-case
> is specific enough that you can tune the read PIDs to favor the parity
> that hits the SSD. =:^(
>
>
> The claim is made that btrfs is stabilizing, and in fact, as a regular
> here for some time, I can vouch for that.  But I think it's reasonable to
> argue that until this sort of read-scheduling algorithm is replaced with
> something a bit more optimized, and of course that replacement well
> tested, it's definitely premature to call btrfs fully stable.  This sort
> of painfully bad in some cases mis-optimization just doesn't fit with
> stable, and regardless of how long it takes, until development quiets
> down far enough that the devs can feel comfortable focusing on something
> like this, it's extremely hard to argue that development has quieted down
> enough to fairly call it stable in the first place.
>
> Well, my opinion anyway.
>
> So the short of it is, at least until btrfs optimizes this a bit better,
> for SSD paired with spinning-rust raid1 optimization, as Chris Murphy
> suggested, use some sort of caching mechanism, bcache or dmcache.
>
> Tho you'll want to compare notes with someone who has already tried it,
> as there were some issues with at least btrfs and bcache earlier.  I
> believe they're fixed now, but as explained above, btrfs itself isn't
> really entirely stable yet, so I'd definitely recommend keeping backups,
> and comparing notes with others who have tried it.  (I know there's some
> on the list, tho they may not see this.  But hopefully they'll respond to
> a new thread with bcache or dmcache in the title, if you decide to go
> that way.)
>
>
> 2) While this doesn't make a significant difference in the two-device
> btrfs raid1 case, it does with three or more devices in the btrfs raid1,
> and with other raid forms the difference is even stronger.  I noticed you
> wrote RAID1 in ALL CAPS form.  Btrfs' raid implementations aren't quite
> like traditional RAID, and I recall a dev (Chris Mason, actually, IIRC)
> pointing out that the choice to use small-letters raidX nomenclature was
> deliberate, in ordered to remind people that there is a difference.
>
> Specifically for btrfs raid1, as contrasted to, for instance, md/RAID-1,
> at present btrfs raid1 is always pair-mirrored, regardless of the number
> of devices (above two, of course).  While a three-device md/RAID-1 will
> have three mirrors and a four-device md/RAID-1 will have four, simply
> adding redundant mirrors while maintaining capacity (in the simple all-
> the-same-size case, anyway), a three-device btrfs raid1 will have 1.5x
> the capacity of a two-device btrfs raid1, and a four-device btrfs raid1
> will have twice the two-device capacity, while maintaining a constant
> pair-mirroring regardless of the number of devices in the btrfs raid1.
>
> For btrfs raid10, the pair-mirroring is there, but for odd numbers of
> devices there's also a difference of uneven striping, because of the odd
> one out in the mirroring and the difference in chunk size between data
> and metadata chunks.
>
> And of course there's the difference that data and metadata are treated
> separately in btrfs, and don't have to have the same raid levels, nor are
> they the same by default.  A filesystem agnostic raid such as mdraid or
> dmraid will by definition treat data and metadata alike as it won't be
> able to tell the difference -- if it did it wouldn't be filesystem
> agnostic.
>
>
> Now that btrfs raid56 mode is basically complete with kernel 3.19, the
> next thing on the raid side of the roadmap is N-way-mirroring.  I'm
> really looking forward to that as I really like btrfs' self-repair
> capacities as well, but for me the ideal balance is three-way-mirroring,
> just in case two copies fail checksum.  Tho the fact of the matter is,
> btrfs only now is getting to the point where a third mirror has some
> reasonable chance of being useful, as until now btrfs itself was unstable
> enough that the chances of it having a bug were far higher than of both
> devices going bad for a checksummed block at the same time.  But btrfs
> really is much more stable than it was, and it's stable enough now that
> the possibility of a third mirror really should start making statistical
> sense pretty soon, if it doesn't already.
>
> But given the time raid56 took, I'm not holding my breath.  I guess
> they'll be focused on the remaining raid56 bugs thru 3.20, and figure
> it'll be at least three kernel cycles later, so second half of the year
> at best, before we see N-way-mirroring in mainstream.  This time next
> year would actually seem more reasonable, and 2H-2016 or into 2017
> wouldn't surprise me in the least, again, given the time raid56 mode
> took.  Hopefully it'll be there before 2018...
>
>
> Tho as I said, for the two-device case, if both data and metadata are
> raid1 mode, those differences can for the most part be ignored.  Thus,
> this point is mostly for others reading, and for you in the future should
> you end up working with a btrfs raid1 with more than two devices.  I
> mostly mentioned it due to seeing that all-caps RAID1.
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thanks, very informative about the read alg.  Sounds like it makes
more sense to simply do backups to the slower drive and manually
restore from those if I ever have a checksum error.

My main goal here was protection from undetectable sector corruption
("bitrot" etc.) without having to halve my SSD, but on btrfs I suppose
it's impossible for bitrot errors to creep into backups, because I'd
get a checksum error before that happened right?  Then I could just
restore it from a previous backup.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID1, SSD+non-SSD (RAID 5/6 question)
  2015-02-07 12:42   ` RAID1, SSD+non-SSD (RAID 5/6 question) Ed Tomlinson
@ 2015-02-08  3:18     ` Duncan
  0 siblings, 0 replies; 10+ messages in thread
From: Duncan @ 2015-02-08  3:18 UTC (permalink / raw)
  To: linux-btrfs

Ed Tomlinson posted on Sat, 07 Feb 2015 07:42:50 -0500 as excerpted:

> On Saturday, February 7, 2015 1:39:07 AM EST, Duncan wrote:
> 
>> The btrfs raid1 read-mode device choice algorithm
> 
> Very interesting suff on the raid1 read select alg.  What changes with
> raid5/6?  Is that alg 'smarter'?

I don't know as much about the raid56 (5/6) mode.  What I /do/ know about 
it is that until the still-in-testing 3.19 kernel and similarly "now" 
userspace, raid56 mode mkfs worked, and normal runtime worked, but scrub 
and the various repair modes were code-incomplete.  That made it 
effectively an inefficient raid0 in practice -- the parity strips were 
calculated and written, but the tools weren't there to properly recover 
from them should it be necessary, so from an admin perspective it was 
like a raid0, if a device drops out, say bye-bye to the entire 
filesystem.  In practice there were certain limited recovery steps that 
could be taken in some circumstances, but as they couldn't be counted on, 
from an admin perspective, the best policy really was to consider it a 
slow raid0, as that's the risk you were taking, running it.

The difference was that if you set it up for raid5/6, once the tools were 
complete and ready, you'd effectively get a "free" redundancy upgrade, 
since it was actually running that way all along, it just couldn't be 
recovered as such because the recovery tools weren't done yet.

With kernel 3.19, in theory all the btrfs raid56 mode kernel pieces are 
there now, altho in practice there's still bugs being worked out, so I'd 
not (bleeding-edge) trust it until 3.20 at least, and I'd hesitate to 
consider it as (relatively) stable as single/dup/raid0/1/10 modes for 
another couple kernels after that, simply because they've been usable for 
long enough to have had quite a few more bugs found and worked out at 
this point.

I'm not exactly sure what the status is on the userspace side, but I 
/think/ it's there in the current v3.18.x userspace release, and should 
be usable by the time the kernelspace is usable, kernel 3.20 with 
userspace 3.19.

But with ~9 week release cycles and with 3.19 very close to out now, if 
we take that 3.20 bleeding-edge usable in say 10 weeks from now, and call 
raid56 mode reasonably stable two kernel cycles or 18 weeks later, that 
puts it 28 weeks out, say 6.5 months, for reasonably stable.  Which would 
be late August.  Of course if you're willing to take a bit more risk, 
it's more like six or seven weeks, say 3.20-rc4 or so, about the end of 
March.  I'd really not recommend raid56 mode until then, unless you *ARE* 
treating it exactly as you would a raid0, and are willing to call the 
entire filesystem a complete loss if a device drops or there's any other 
serious problem with it.


As for algorithm, AFAIK, operationally btrfs raid56 mode stripes data 
similar to raid0, except that one or two devices of each stripe are of 
course reserved for parity.  So a three-way raid5 or a four-way raid6 
will have a two-way-data-stripe, while a four-way raid5 or a five-way 
raid6 will have a three-way-data-stripe.

Since data chunks are nominally 1 GiB and the allocator will allocate a 
chunk on each device, then full available width sub-chunk stripe with 
raid0/5/6, in theory at least, performance should be very similar to a 
conventional raid0/5/6, at least for single thread.

Which means writes are going to be the big bottleneck, just as they are 
with conventional raid5/6, since they end up being read-modify-write for 
any of the strips of the stripe not yet read into cache yet.

FWIW I actually ran md/RAID-6 here for awhile (general desktop/
workstation use-case, tho on gentoo, so call it developer's workstation 
due to the building from source), and was rather disappointed.  I found a 
well-optimized raid1 implementation (as md/RAID-1 is) to be much more 
efficient, even with four-way-mirroring!

Tho due to btrfs raid1 mode not yet being optimized, btrfs raid56 mode 
even with a reasonable write load, might well actually be competitive or 
even faster, at this point.  I haven't even looked to see if there's any 
benchmarks on that, yet.  (Despite raid56 mode repair tools not being 
complete, runtime worked, so it could have been benchmarked against raid1 
mode already.  I just haven't checked to see if there's actually a report 
of such on the wiki or wherever.)


But back to the SSD+spinning-rust combo, I don't expect btrfs raid56 mode 
to do particularly well on that, either, tho at least you wouldn't have 
the potential worst-case of all reads getting assigned to the spinning 
rust, as could well happen with btrfs' unoptimized raid1 mode, at this 
point.  Intuitively, I'd predict that read thruput would be similar to 
that of reading just the spinning-rust share off the spinning-rust 
device.  IOW, when reading from both, the SSD would be done so fast it 
wouldn't even show up in the results, while the speed of the spinning 
rust would be what you'd be getting for data read off of it, so where 
half the data is on spinning rust and half on ssd, you'd effectively get 
twice the speed you'd get if it were all on spinning rust, because half 
would show up at spinning rust speed, while the other half would already 
be there by the time the spinning rust side finished.  But that's simply 
intuition, and simple intuition could be quite wrong.  You could of 
course test it.

The ideal, if you don't want to deal with a cache layer, as I didn't, 
would be to simply declare the money to put it all on SSD worth it, and 
just do that.  Two SSDs in btrfs raid1 mode.  That's actually what I'm 
running here, tho I don't like all my data eggs in the same filesystem 
basket, so I actually have both SSDs partitioned up similarly, and am 
running multiple smaller independent btrfs, all (but for /boot) being 
btrfs raid1, with each of the two devices for each btrfs raid1 being a 
partition on one of the SSDs.

That actually works quite well and I've been very happy with it. =:^)  
Particularly when doing a full balance/scrub/check on a filesystem takes 
under 10 minutes, with some of them a minute or less, both because of the 
speed of the SSDs, and because the filesystems are all under 50 GiB 
each.  It's **MUCH** easier to work with such filesystems when a scrub or 
balance doesn't take the **DAYS** people often report for their multi-
terabyte spinning-rust based filesystems!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID1, SSD+non-SSD
  2015-02-07 18:06   ` Kai Krakow
@ 2015-02-08  3:31     ` Duncan
  0 siblings, 0 replies; 10+ messages in thread
From: Duncan @ 2015-02-08  3:31 UTC (permalink / raw)
  To: linux-btrfs

Kai Krakow posted on Sat, 07 Feb 2015 19:06:14 +0100 as excerpted:

> BTW: Is there work in progress to let btrfs choose which device to read
> from or write to other than using round-robin or pid mapping? Maybe it
> would be interesting to watch the current read and write latencies of
> all drives and choose the one with the lowest latency. Tho, I think it
> won't make much sense when passing accesses through btrfs.

There's several projects in that general area suggested on the wiki.  
You'd need to look there for status (unclaimed, claimed, in progress, 
etc) and to see if any of them match well enough to what you had in mind 
or if you might wish to add another.

There's definitely optimization planned, with the project ideas mentioned 
above going beyond that.  However, I'm not sure of the status of the 
actually planned optimization either.  There's certainly the standard 
premature optimization thing to worry about, but arguably, we're past the 
point at which it'd be premature now, and actually need it, if for no 
other reason, because making a case for true btrfs stability is rather 
difficult if such optimization is still being held off as premature, or 
isn't being held off any longer, but that state is so new the 
optimization simply hasn't been done yet.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID1, SSD+non-SSD
  2015-02-08  2:41   ` RAID1, SSD+non-SSD Brian B
@ 2015-02-08  3:51     ` Duncan
  0 siblings, 0 replies; 10+ messages in thread
From: Duncan @ 2015-02-08  3:51 UTC (permalink / raw)
  To: linux-btrfs

Brian B posted on Sat, 07 Feb 2015 21:41:08 -0500 as excerpted:

>  Sounds like it makes more
> sense to simply do backups to the slower drive and manually restore from
> those if I ever have a checksum error.
> 
> My main goal here was protection from undetectable sector corruption
> ("bitrot" etc.) without having to halve my SSD, but on btrfs I suppose
> it's impossible for bitrot errors to creep into backups, because I'd get
> a checksum error before that happened right?  Then I could just restore
> it from a previous backup.

Well, /as/ the bitrot happened, but you couldn't really do a backup of 
the bitrotted file, because you're correct, you'd get checksum errors due 
to the bitrot and btrfs wouldn't even let you access the file to back it 
up again, which in turn would mean it's time to restore at least that 
file from backup...

So yes, that plan does make sense to me. =:^)

BTW, it's worth noting the btrfs send/receive feature.  If both the ssd 
and the spinning rust backup are btrfs, send/receive should be an 
extremely efficient way to do the backups.  =:^)

Tho it may be worth keeping a more conventionally maintained second-level 
backup that's /not/ on btrfs as well, depending on how critical you 
consider that data.  While btrfs is stabilizing reasonably well now, it's 
not entirely stable yet and probably won't be for, let's say another 
year, and at least here, I really do sleep better knowing I have a non-
btrfs backup available as well.  You could manually checksum it, either 
in whole or in part, to be sure of detecting rot there, tho I've not done 
so here, figuring if I could survive decades without it before btrfs, I 
can survive another few years with it as a second backup.

Given the cost of SSD vs. spinning-rust, if all your data fits on the 
SSD, you should be able to do multiple levels of backup on spinning rust 
without breaking the bank.

(FWIW, altho as I mentioned earlier I have dual SSD btrfs raid1, I do 
still keep my media on spinning rust, NOT on SSD.  So I can't say all my 
data fits on SSD, here, or rather, it might, but that's not how I've set 
it up.  But as it happens the media files are both larger and less 
critical in terms of access speed, so spinning rust for them actually 
works out very well for me.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-02-08  3:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-06 20:01 RAID1, SSD+non-SSD Brian B
2015-02-07  0:23 ` Chris Murphy
2015-02-07 18:06   ` Kai Krakow
2015-02-08  3:31     ` Duncan
2015-02-07  6:39 ` Duncan
2015-02-07 12:42   ` RAID1, SSD+non-SSD (RAID 5/6 question) Ed Tomlinson
2015-02-08  3:18     ` Duncan
2015-02-08  2:41   ` RAID1, SSD+non-SSD Brian B
2015-02-08  3:51     ` Duncan
2015-02-07 17:28 ` Kyle Manna

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.