* RAID1, SSD+non-SSD
@ 2015-02-06 20:01 Brian B
2015-02-07 0:23 ` Chris Murphy
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Brian B @ 2015-02-06 20:01 UTC (permalink / raw)
To: linux-btrfs
My laptop has two disks, a SSD and a traditional magnetic disk. I plan
to make a partition on the mag disk equal in size the SSD and set up
BTRFS RAID1. This I know how to do.
The only reason I'm doing the RAID1 is for the self-healing. I realize
writing large amounts of data will be slower than the SSD alone, but
is it possible to set it up to only read from the magnetic drive if
there's an error reading from the SSD?
In other words, is there a way to tell it to only read from the faster
disk? Is that even necessary? Is there a better way to accomplish
this?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD
2015-02-06 20:01 RAID1, SSD+non-SSD Brian B
@ 2015-02-07 0:23 ` Chris Murphy
2015-02-07 18:06 ` Kai Krakow
2015-02-07 6:39 ` Duncan
2015-02-07 17:28 ` Kyle Manna
2 siblings, 1 reply; 10+ messages in thread
From: Chris Murphy @ 2015-02-07 0:23 UTC (permalink / raw)
To: Btrfs BTRFS
On Fri, Feb 6, 2015 at 1:01 PM, Brian B <canis8585@gmail.com> wrote:
> My laptop has two disks, a SSD and a traditional magnetic disk. I plan
> to make a partition on the mag disk equal in size the SSD and set up
> BTRFS RAID1. This I know how to do.
There isn't a write mostly option in btrfs like there is with md raid,
so I don't know how Btrfs will tolerate one device being exceptionally
slower than the other. It may be most of the time it won't matter, but
I can imagine with a ton of IOPS backing up on the hard drive, having
completed on the SSD, could maybe be a problem. I'd test it unless
someone else who has pipes up.
>
> The only reason I'm doing the RAID1 is for the self-healing. I realize
> writing large amounts of data will be slower than the SSD alone, but
> is it possible to set it up to only read from the magnetic drive if
> there's an error reading from the SSD?
No.
>
> In other words, is there a way to tell it to only read from the faster
> disk? Is that even necessary? Is there a better way to accomplish
> this?
No. No. And maybe. In order.
If there is an error detected by either drive, or by Btrfs, Btrfs will
get the correct data from the other drive and fix the problem on the
original drive. You don't need to configure anything. The only concern
is the asymmetric performance.
I think the use case is better achieved with two HDD's + two SSD
partitions, configured either with bcache or dmcache. The result is
two logical devices using HDDs as backing device and SSD partitions as
cache, and then format them as Btrfs raid1. The question there of
course, is maturity of bcache vs dmcache and their interactions with
Btrfs. But at least that's supposed to work.
--
Chris Murphy
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD
2015-02-06 20:01 RAID1, SSD+non-SSD Brian B
2015-02-07 0:23 ` Chris Murphy
@ 2015-02-07 6:39 ` Duncan
2015-02-07 12:42 ` RAID1, SSD+non-SSD (RAID 5/6 question) Ed Tomlinson
2015-02-08 2:41 ` RAID1, SSD+non-SSD Brian B
2015-02-07 17:28 ` Kyle Manna
2 siblings, 2 replies; 10+ messages in thread
From: Duncan @ 2015-02-07 6:39 UTC (permalink / raw)
To: linux-btrfs
Brian B posted on Fri, 06 Feb 2015 15:01:30 -0500 as excerpted:
> The only reason I'm doing the [btrfs] RAID1 is for the self-healing. I
> realize writing large amounts of data will be slower than the SSD
> alone, but is it possible to set it up to only read from the magnetic
> drive if there's an error reading from the SSD?
Chris Murphy is correct. Btrfs raid1 doesn't have the write-mostly
option that mdraid has.
I'll simply expand on what he mentioned with two points, #1 being the
more important for your case.
1) The btrfs raid1 read-mode device choice algorithm is known to be sub-
optimal, and the plan is to change and optimize it in the longer term.
Basically, it's an easy first implementation that's simple enough to be
reasonably bug-free and to stay out of the developer's way while they
work on on other things, while still allowing easy testing of both
devices.
Specifically, it's a very simple even/odd parity assignment based on the
PID making the request. Thus, a single PID read task will consistently
read from the same device (unless the block checksum on that device is
bad, then it tries the other device), no matter how much there is to read
and how backed up that device might be, or how idle the other one might
be. Even a second read task from another PID, or a 10th, or the 100th, if
they're all even or all odd parity PIDs, will all be assigned to read
from the same device, even if the other one is entirely idle.
Which ends up being worst-case for a multi-threaded heavy-read focused
task where all read threads happen to be even or odd, say if read and
compute threads are paired and always spawned in the same order, with
nothing else going on to throw the parity ordering off. But that's how
it's currently implemented. =:^(
And it /does/ make for easily repeatable test results, while being simple
enough to stay out of the way while development interest focuses
elsewhere, after all pretty important factors early in a project of this
scope. =:^)
Obviously, that's going to be bad news for you, too, unless your use-case
is specific enough that you can tune the read PIDs to favor the parity
that hits the SSD. =:^(
The claim is made that btrfs is stabilizing, and in fact, as a regular
here for some time, I can vouch for that. But I think it's reasonable to
argue that until this sort of read-scheduling algorithm is replaced with
something a bit more optimized, and of course that replacement well
tested, it's definitely premature to call btrfs fully stable. This sort
of painfully bad in some cases mis-optimization just doesn't fit with
stable, and regardless of how long it takes, until development quiets
down far enough that the devs can feel comfortable focusing on something
like this, it's extremely hard to argue that development has quieted down
enough to fairly call it stable in the first place.
Well, my opinion anyway.
So the short of it is, at least until btrfs optimizes this a bit better,
for SSD paired with spinning-rust raid1 optimization, as Chris Murphy
suggested, use some sort of caching mechanism, bcache or dmcache.
Tho you'll want to compare notes with someone who has already tried it,
as there were some issues with at least btrfs and bcache earlier. I
believe they're fixed now, but as explained above, btrfs itself isn't
really entirely stable yet, so I'd definitely recommend keeping backups,
and comparing notes with others who have tried it. (I know there's some
on the list, tho they may not see this. But hopefully they'll respond to
a new thread with bcache or dmcache in the title, if you decide to go
that way.)
2) While this doesn't make a significant difference in the two-device
btrfs raid1 case, it does with three or more devices in the btrfs raid1,
and with other raid forms the difference is even stronger. I noticed you
wrote RAID1 in ALL CAPS form. Btrfs' raid implementations aren't quite
like traditional RAID, and I recall a dev (Chris Mason, actually, IIRC)
pointing out that the choice to use small-letters raidX nomenclature was
deliberate, in ordered to remind people that there is a difference.
Specifically for btrfs raid1, as contrasted to, for instance, md/RAID-1,
at present btrfs raid1 is always pair-mirrored, regardless of the number
of devices (above two, of course). While a three-device md/RAID-1 will
have three mirrors and a four-device md/RAID-1 will have four, simply
adding redundant mirrors while maintaining capacity (in the simple all-
the-same-size case, anyway), a three-device btrfs raid1 will have 1.5x
the capacity of a two-device btrfs raid1, and a four-device btrfs raid1
will have twice the two-device capacity, while maintaining a constant
pair-mirroring regardless of the number of devices in the btrfs raid1.
For btrfs raid10, the pair-mirroring is there, but for odd numbers of
devices there's also a difference of uneven striping, because of the odd
one out in the mirroring and the difference in chunk size between data
and metadata chunks.
And of course there's the difference that data and metadata are treated
separately in btrfs, and don't have to have the same raid levels, nor are
they the same by default. A filesystem agnostic raid such as mdraid or
dmraid will by definition treat data and metadata alike as it won't be
able to tell the difference -- if it did it wouldn't be filesystem
agnostic.
Now that btrfs raid56 mode is basically complete with kernel 3.19, the
next thing on the raid side of the roadmap is N-way-mirroring. I'm
really looking forward to that as I really like btrfs' self-repair
capacities as well, but for me the ideal balance is three-way-mirroring,
just in case two copies fail checksum. Tho the fact of the matter is,
btrfs only now is getting to the point where a third mirror has some
reasonable chance of being useful, as until now btrfs itself was unstable
enough that the chances of it having a bug were far higher than of both
devices going bad for a checksummed block at the same time. But btrfs
really is much more stable than it was, and it's stable enough now that
the possibility of a third mirror really should start making statistical
sense pretty soon, if it doesn't already.
But given the time raid56 took, I'm not holding my breath. I guess
they'll be focused on the remaining raid56 bugs thru 3.20, and figure
it'll be at least three kernel cycles later, so second half of the year
at best, before we see N-way-mirroring in mainstream. This time next
year would actually seem more reasonable, and 2H-2016 or into 2017
wouldn't surprise me in the least, again, given the time raid56 mode
took. Hopefully it'll be there before 2018...
Tho as I said, for the two-device case, if both data and metadata are
raid1 mode, those differences can for the most part be ignored. Thus,
this point is mostly for others reading, and for you in the future should
you end up working with a btrfs raid1 with more than two devices. I
mostly mentioned it due to seeing that all-caps RAID1.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD (RAID 5/6 question)
2015-02-07 6:39 ` Duncan
@ 2015-02-07 12:42 ` Ed Tomlinson
2015-02-08 3:18 ` Duncan
2015-02-08 2:41 ` RAID1, SSD+non-SSD Brian B
1 sibling, 1 reply; 10+ messages in thread
From: Ed Tomlinson @ 2015-02-07 12:42 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs
On Saturday, February 7, 2015 1:39:07 AM EST, Duncan wrote:
> The btrfs raid1 read-mode device choice algorithm
Duncan,
Very interesting suff on the raid1 read select alg. What changes with
raid5/6? Is that alg 'smarter'?
TIA
Ed Tomlinson
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD
2015-02-06 20:01 RAID1, SSD+non-SSD Brian B
2015-02-07 0:23 ` Chris Murphy
2015-02-07 6:39 ` Duncan
@ 2015-02-07 17:28 ` Kyle Manna
2 siblings, 0 replies; 10+ messages in thread
From: Kyle Manna @ 2015-02-07 17:28 UTC (permalink / raw)
To: Brian B, linux-btrfs
On Fri Feb 06 2015 at 12:06:33 PM Brian B <canis8585@gmail.com> wrote:
>
> My laptop has two disks, a SSD and a traditional magnetic disk. I plan
> to make a partition on the mag disk equal in size the SSD and set up
> BTRFS RAID1. This I know how to do.
>
> The only reason I'm doing the RAID1 is for the self-healing. I realize
> writing large amounts of data will be slower than the SSD alone, but
> is it possible to set it up to only read from the magnetic drive if
> there's an error reading from the SSD?
>
> In other words, is there a way to tell it to only read from the faster
> disk? Is that even necessary? Is there a better way to accomplish
> this?
What you may want to look at is lvmcache + btrfs. I've played with
lvmcache (using ext4 on top) and btrfs independently, but not
together. Too many new technologies at the same time for my taste. :)
The best documentation I've found on lvm cache is the man page:
http://man7.org/linux/man-pages/man7/lvmcache.7.html
LVM cache uses dm-cache behind the scenes and makes it much more
manageable (i.e. construction, manipulation, and teardown of devices.
An lvm cache won't help with redundancy, the blocks will either exist
on the caching device or slower device. To remove the cache, you can
force a flush of the blocks out of the to the traditional HDD and use
it without the cache without having to recreate the file system.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD
2015-02-07 0:23 ` Chris Murphy
@ 2015-02-07 18:06 ` Kai Krakow
2015-02-08 3:31 ` Duncan
0 siblings, 1 reply; 10+ messages in thread
From: Kai Krakow @ 2015-02-07 18:06 UTC (permalink / raw)
To: linux-btrfs
Chris Murphy <lists@colorremedies.com> schrieb:
> On Fri, Feb 6, 2015 at 1:01 PM, Brian B <canis8585@gmail.com> wrote:
>> My laptop has two disks, a SSD and a traditional magnetic disk. I plan
>> to make a partition on the mag disk equal in size the SSD and set up
>> BTRFS RAID1. This I know how to do.
>
> There isn't a write mostly option in btrfs like there is with md raid,
> so I don't know how Btrfs will tolerate one device being exceptionally
> slower than the other. It may be most of the time it won't matter, but
> I can imagine with a ton of IOPS backing up on the hard drive, having
> completed on the SSD, could maybe be a problem. I'd test it unless
> someone else who has pipes up.
>
>>
>> The only reason I'm doing the RAID1 is for the self-healing. I realize
>> writing large amounts of data will be slower than the SSD alone, but
>> is it possible to set it up to only read from the magnetic drive if
>> there's an error reading from the SSD?
>
> No.
>>
>> In other words, is there a way to tell it to only read from the faster
>> disk? Is that even necessary? Is there a better way to accomplish
>> this?
>
> No. No. And maybe. In order.
>
> If there is an error detected by either drive, or by Btrfs, Btrfs will
> get the correct data from the other drive and fix the problem on the
> original drive. You don't need to configure anything. The only concern
> is the asymmetric performance.
>
> I think the use case is better achieved with two HDD's + two SSD
> partitions, configured either with bcache or dmcache. The result is
> two logical devices using HDDs as backing device and SSD partitions as
> cache, and then format them as Btrfs raid1. The question there of
> course, is maturity of bcache vs dmcache and their interactions with
> Btrfs. But at least that's supposed to work.
Bcache on multi-device btrfs works fine for me. No problems yet, even in
case of hard-reset. I'm using normal desktop workload, some Steam games,
some VMs, and some MySQL/PHP/Rails programming. A single bcache partition
can support many backing device, so no need to use multiple partitions in
that case. In this scenario, bcache{0,1,2} is my btrfs rootfs:
$ lsblk /dev/sdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 119,2G 0 disk
├─sdb1 8:17 0 512M 0 part
├─sdb2 8:18 0 20G 0 part
├─sdb3 8:19 0 79,5G 0 part
│ ├─bcache0 252:0 0 925,5G 0 disk
│ ├─bcache1 252:1 0 925,5G 0 disk
│ └─bcache2 252:2 0 925,5G 0 disk /home
└─sdb4 8:20 0 19,2G 0 part
sdb3 is the cache device, sdb4 is trimmed and left untouched for SSD
overprovisioning, sdb2 is a dedicated resume swap (traditional swapping goes
to all there HDDs), sdb1 is my ESP to boot kernel and initramfs from.
Bcache takes some time to warm up but is really fast afterwards: Boot times
(using systemd and readahead) went down from >60s (on spinning rust) to ~30s
on first boot (system was migrated to bcache with dev del/add, not
reinstalled), then ~15s, 10s, and now it fluctuates between 3 and 8s (mostly
around 5s) for reaching graphical.target (depending on whether I installed
updates). KDE takes some time to load but I suppose most of it is due to its
artificial 4 second delay during initialization of kded and friends - I
guess this will be fixed in KDE5. This boot target is not stripped down, it
includes network, mysql, postfix and some other stuff that one either
usually not needs or could be optimized away. The numbers are taken from
"systemd-analyze critical-chain".
Bcache hit rate is usually between 60 and 80% using write-back. So after all
I can generally recommend bcache. I don't know dmcache, tho. But I really
suggest against using a mixed setup of SSD and HDD partitions in btrfs RAID
mode, especially since btrfs does not handle different sized partitions that
well. With bcache you can have your cake and eat it too (read: big storage
pool + fast access times).
BTW: Is there work in progress to let btrfs choose which device to read from
or write to other than using round-robin or pid mapping? Maybe it would be
interesting to watch the current read and write latencies of all drives and
choose the one with the lowest latency. Tho, I think it won't make much
sense when passing accesses through btrfs.
--
Replies to list only preferred.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD
2015-02-07 6:39 ` Duncan
2015-02-07 12:42 ` RAID1, SSD+non-SSD (RAID 5/6 question) Ed Tomlinson
@ 2015-02-08 2:41 ` Brian B
2015-02-08 3:51 ` Duncan
1 sibling, 1 reply; 10+ messages in thread
From: Brian B @ 2015-02-08 2:41 UTC (permalink / raw)
To: linux-btrfs
On Sat, Feb 7, 2015 at 1:39 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Brian B posted on Fri, 06 Feb 2015 15:01:30 -0500 as excerpted:
>
>> The only reason I'm doing the [btrfs] RAID1 is for the self-healing. I
>> realize writing large amounts of data will be slower than the SSD
>> alone, but is it possible to set it up to only read from the magnetic
>> drive if there's an error reading from the SSD?
>
> Chris Murphy is correct. Btrfs raid1 doesn't have the write-mostly
> option that mdraid has.
>
> I'll simply expand on what he mentioned with two points, #1 being the
> more important for your case.
>
> 1) The btrfs raid1 read-mode device choice algorithm is known to be sub-
> optimal, and the plan is to change and optimize it in the longer term.
> Basically, it's an easy first implementation that's simple enough to be
> reasonably bug-free and to stay out of the developer's way while they
> work on on other things, while still allowing easy testing of both
> devices.
>
> Specifically, it's a very simple even/odd parity assignment based on the
> PID making the request. Thus, a single PID read task will consistently
> read from the same device (unless the block checksum on that device is
> bad, then it tries the other device), no matter how much there is to read
> and how backed up that device might be, or how idle the other one might
> be. Even a second read task from another PID, or a 10th, or the 100th, if
> they're all even or all odd parity PIDs, will all be assigned to read
> from the same device, even if the other one is entirely idle.
>
> Which ends up being worst-case for a multi-threaded heavy-read focused
> task where all read threads happen to be even or odd, say if read and
> compute threads are paired and always spawned in the same order, with
> nothing else going on to throw the parity ordering off. But that's how
> it's currently implemented. =:^(
>
> And it /does/ make for easily repeatable test results, while being simple
> enough to stay out of the way while development interest focuses
> elsewhere, after all pretty important factors early in a project of this
> scope. =:^)
>
>
> Obviously, that's going to be bad news for you, too, unless your use-case
> is specific enough that you can tune the read PIDs to favor the parity
> that hits the SSD. =:^(
>
>
> The claim is made that btrfs is stabilizing, and in fact, as a regular
> here for some time, I can vouch for that. But I think it's reasonable to
> argue that until this sort of read-scheduling algorithm is replaced with
> something a bit more optimized, and of course that replacement well
> tested, it's definitely premature to call btrfs fully stable. This sort
> of painfully bad in some cases mis-optimization just doesn't fit with
> stable, and regardless of how long it takes, until development quiets
> down far enough that the devs can feel comfortable focusing on something
> like this, it's extremely hard to argue that development has quieted down
> enough to fairly call it stable in the first place.
>
> Well, my opinion anyway.
>
> So the short of it is, at least until btrfs optimizes this a bit better,
> for SSD paired with spinning-rust raid1 optimization, as Chris Murphy
> suggested, use some sort of caching mechanism, bcache or dmcache.
>
> Tho you'll want to compare notes with someone who has already tried it,
> as there were some issues with at least btrfs and bcache earlier. I
> believe they're fixed now, but as explained above, btrfs itself isn't
> really entirely stable yet, so I'd definitely recommend keeping backups,
> and comparing notes with others who have tried it. (I know there's some
> on the list, tho they may not see this. But hopefully they'll respond to
> a new thread with bcache or dmcache in the title, if you decide to go
> that way.)
>
>
> 2) While this doesn't make a significant difference in the two-device
> btrfs raid1 case, it does with three or more devices in the btrfs raid1,
> and with other raid forms the difference is even stronger. I noticed you
> wrote RAID1 in ALL CAPS form. Btrfs' raid implementations aren't quite
> like traditional RAID, and I recall a dev (Chris Mason, actually, IIRC)
> pointing out that the choice to use small-letters raidX nomenclature was
> deliberate, in ordered to remind people that there is a difference.
>
> Specifically for btrfs raid1, as contrasted to, for instance, md/RAID-1,
> at present btrfs raid1 is always pair-mirrored, regardless of the number
> of devices (above two, of course). While a three-device md/RAID-1 will
> have three mirrors and a four-device md/RAID-1 will have four, simply
> adding redundant mirrors while maintaining capacity (in the simple all-
> the-same-size case, anyway), a three-device btrfs raid1 will have 1.5x
> the capacity of a two-device btrfs raid1, and a four-device btrfs raid1
> will have twice the two-device capacity, while maintaining a constant
> pair-mirroring regardless of the number of devices in the btrfs raid1.
>
> For btrfs raid10, the pair-mirroring is there, but for odd numbers of
> devices there's also a difference of uneven striping, because of the odd
> one out in the mirroring and the difference in chunk size between data
> and metadata chunks.
>
> And of course there's the difference that data and metadata are treated
> separately in btrfs, and don't have to have the same raid levels, nor are
> they the same by default. A filesystem agnostic raid such as mdraid or
> dmraid will by definition treat data and metadata alike as it won't be
> able to tell the difference -- if it did it wouldn't be filesystem
> agnostic.
>
>
> Now that btrfs raid56 mode is basically complete with kernel 3.19, the
> next thing on the raid side of the roadmap is N-way-mirroring. I'm
> really looking forward to that as I really like btrfs' self-repair
> capacities as well, but for me the ideal balance is three-way-mirroring,
> just in case two copies fail checksum. Tho the fact of the matter is,
> btrfs only now is getting to the point where a third mirror has some
> reasonable chance of being useful, as until now btrfs itself was unstable
> enough that the chances of it having a bug were far higher than of both
> devices going bad for a checksummed block at the same time. But btrfs
> really is much more stable than it was, and it's stable enough now that
> the possibility of a third mirror really should start making statistical
> sense pretty soon, if it doesn't already.
>
> But given the time raid56 took, I'm not holding my breath. I guess
> they'll be focused on the remaining raid56 bugs thru 3.20, and figure
> it'll be at least three kernel cycles later, so second half of the year
> at best, before we see N-way-mirroring in mainstream. This time next
> year would actually seem more reasonable, and 2H-2016 or into 2017
> wouldn't surprise me in the least, again, given the time raid56 mode
> took. Hopefully it'll be there before 2018...
>
>
> Tho as I said, for the two-device case, if both data and metadata are
> raid1 mode, those differences can for the most part be ignored. Thus,
> this point is mostly for others reading, and for you in the future should
> you end up working with a btrfs raid1 with more than two devices. I
> mostly mentioned it due to seeing that all-caps RAID1.
>
> --
> Duncan - List replies preferred. No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master." Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Thanks, very informative about the read alg. Sounds like it makes
more sense to simply do backups to the slower drive and manually
restore from those if I ever have a checksum error.
My main goal here was protection from undetectable sector corruption
("bitrot" etc.) without having to halve my SSD, but on btrfs I suppose
it's impossible for bitrot errors to creep into backups, because I'd
get a checksum error before that happened right? Then I could just
restore it from a previous backup.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD (RAID 5/6 question)
2015-02-07 12:42 ` RAID1, SSD+non-SSD (RAID 5/6 question) Ed Tomlinson
@ 2015-02-08 3:18 ` Duncan
0 siblings, 0 replies; 10+ messages in thread
From: Duncan @ 2015-02-08 3:18 UTC (permalink / raw)
To: linux-btrfs
Ed Tomlinson posted on Sat, 07 Feb 2015 07:42:50 -0500 as excerpted:
> On Saturday, February 7, 2015 1:39:07 AM EST, Duncan wrote:
>
>> The btrfs raid1 read-mode device choice algorithm
>
> Very interesting suff on the raid1 read select alg. What changes with
> raid5/6? Is that alg 'smarter'?
I don't know as much about the raid56 (5/6) mode. What I /do/ know about
it is that until the still-in-testing 3.19 kernel and similarly "now"
userspace, raid56 mode mkfs worked, and normal runtime worked, but scrub
and the various repair modes were code-incomplete. That made it
effectively an inefficient raid0 in practice -- the parity strips were
calculated and written, but the tools weren't there to properly recover
from them should it be necessary, so from an admin perspective it was
like a raid0, if a device drops out, say bye-bye to the entire
filesystem. In practice there were certain limited recovery steps that
could be taken in some circumstances, but as they couldn't be counted on,
from an admin perspective, the best policy really was to consider it a
slow raid0, as that's the risk you were taking, running it.
The difference was that if you set it up for raid5/6, once the tools were
complete and ready, you'd effectively get a "free" redundancy upgrade,
since it was actually running that way all along, it just couldn't be
recovered as such because the recovery tools weren't done yet.
With kernel 3.19, in theory all the btrfs raid56 mode kernel pieces are
there now, altho in practice there's still bugs being worked out, so I'd
not (bleeding-edge) trust it until 3.20 at least, and I'd hesitate to
consider it as (relatively) stable as single/dup/raid0/1/10 modes for
another couple kernels after that, simply because they've been usable for
long enough to have had quite a few more bugs found and worked out at
this point.
I'm not exactly sure what the status is on the userspace side, but I
/think/ it's there in the current v3.18.x userspace release, and should
be usable by the time the kernelspace is usable, kernel 3.20 with
userspace 3.19.
But with ~9 week release cycles and with 3.19 very close to out now, if
we take that 3.20 bleeding-edge usable in say 10 weeks from now, and call
raid56 mode reasonably stable two kernel cycles or 18 weeks later, that
puts it 28 weeks out, say 6.5 months, for reasonably stable. Which would
be late August. Of course if you're willing to take a bit more risk,
it's more like six or seven weeks, say 3.20-rc4 or so, about the end of
March. I'd really not recommend raid56 mode until then, unless you *ARE*
treating it exactly as you would a raid0, and are willing to call the
entire filesystem a complete loss if a device drops or there's any other
serious problem with it.
As for algorithm, AFAIK, operationally btrfs raid56 mode stripes data
similar to raid0, except that one or two devices of each stripe are of
course reserved for parity. So a three-way raid5 or a four-way raid6
will have a two-way-data-stripe, while a four-way raid5 or a five-way
raid6 will have a three-way-data-stripe.
Since data chunks are nominally 1 GiB and the allocator will allocate a
chunk on each device, then full available width sub-chunk stripe with
raid0/5/6, in theory at least, performance should be very similar to a
conventional raid0/5/6, at least for single thread.
Which means writes are going to be the big bottleneck, just as they are
with conventional raid5/6, since they end up being read-modify-write for
any of the strips of the stripe not yet read into cache yet.
FWIW I actually ran md/RAID-6 here for awhile (general desktop/
workstation use-case, tho on gentoo, so call it developer's workstation
due to the building from source), and was rather disappointed. I found a
well-optimized raid1 implementation (as md/RAID-1 is) to be much more
efficient, even with four-way-mirroring!
Tho due to btrfs raid1 mode not yet being optimized, btrfs raid56 mode
even with a reasonable write load, might well actually be competitive or
even faster, at this point. I haven't even looked to see if there's any
benchmarks on that, yet. (Despite raid56 mode repair tools not being
complete, runtime worked, so it could have been benchmarked against raid1
mode already. I just haven't checked to see if there's actually a report
of such on the wiki or wherever.)
But back to the SSD+spinning-rust combo, I don't expect btrfs raid56 mode
to do particularly well on that, either, tho at least you wouldn't have
the potential worst-case of all reads getting assigned to the spinning
rust, as could well happen with btrfs' unoptimized raid1 mode, at this
point. Intuitively, I'd predict that read thruput would be similar to
that of reading just the spinning-rust share off the spinning-rust
device. IOW, when reading from both, the SSD would be done so fast it
wouldn't even show up in the results, while the speed of the spinning
rust would be what you'd be getting for data read off of it, so where
half the data is on spinning rust and half on ssd, you'd effectively get
twice the speed you'd get if it were all on spinning rust, because half
would show up at spinning rust speed, while the other half would already
be there by the time the spinning rust side finished. But that's simply
intuition, and simple intuition could be quite wrong. You could of
course test it.
The ideal, if you don't want to deal with a cache layer, as I didn't,
would be to simply declare the money to put it all on SSD worth it, and
just do that. Two SSDs in btrfs raid1 mode. That's actually what I'm
running here, tho I don't like all my data eggs in the same filesystem
basket, so I actually have both SSDs partitioned up similarly, and am
running multiple smaller independent btrfs, all (but for /boot) being
btrfs raid1, with each of the two devices for each btrfs raid1 being a
partition on one of the SSDs.
That actually works quite well and I've been very happy with it. =:^)
Particularly when doing a full balance/scrub/check on a filesystem takes
under 10 minutes, with some of them a minute or less, both because of the
speed of the SSDs, and because the filesystems are all under 50 GiB
each. It's **MUCH** easier to work with such filesystems when a scrub or
balance doesn't take the **DAYS** people often report for their multi-
terabyte spinning-rust based filesystems!
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD
2015-02-07 18:06 ` Kai Krakow
@ 2015-02-08 3:31 ` Duncan
0 siblings, 0 replies; 10+ messages in thread
From: Duncan @ 2015-02-08 3:31 UTC (permalink / raw)
To: linux-btrfs
Kai Krakow posted on Sat, 07 Feb 2015 19:06:14 +0100 as excerpted:
> BTW: Is there work in progress to let btrfs choose which device to read
> from or write to other than using round-robin or pid mapping? Maybe it
> would be interesting to watch the current read and write latencies of
> all drives and choose the one with the lowest latency. Tho, I think it
> won't make much sense when passing accesses through btrfs.
There's several projects in that general area suggested on the wiki.
You'd need to look there for status (unclaimed, claimed, in progress,
etc) and to see if any of them match well enough to what you had in mind
or if you might wish to add another.
There's definitely optimization planned, with the project ideas mentioned
above going beyond that. However, I'm not sure of the status of the
actually planned optimization either. There's certainly the standard
premature optimization thing to worry about, but arguably, we're past the
point at which it'd be premature now, and actually need it, if for no
other reason, because making a case for true btrfs stability is rather
difficult if such optimization is still being held off as premature, or
isn't being held off any longer, but that state is so new the
optimization simply hasn't been done yet.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD
2015-02-08 2:41 ` RAID1, SSD+non-SSD Brian B
@ 2015-02-08 3:51 ` Duncan
0 siblings, 0 replies; 10+ messages in thread
From: Duncan @ 2015-02-08 3:51 UTC (permalink / raw)
To: linux-btrfs
Brian B posted on Sat, 07 Feb 2015 21:41:08 -0500 as excerpted:
> Sounds like it makes more
> sense to simply do backups to the slower drive and manually restore from
> those if I ever have a checksum error.
>
> My main goal here was protection from undetectable sector corruption
> ("bitrot" etc.) without having to halve my SSD, but on btrfs I suppose
> it's impossible for bitrot errors to creep into backups, because I'd get
> a checksum error before that happened right? Then I could just restore
> it from a previous backup.
Well, /as/ the bitrot happened, but you couldn't really do a backup of
the bitrotted file, because you're correct, you'd get checksum errors due
to the bitrot and btrfs wouldn't even let you access the file to back it
up again, which in turn would mean it's time to restore at least that
file from backup...
So yes, that plan does make sense to me. =:^)
BTW, it's worth noting the btrfs send/receive feature. If both the ssd
and the spinning rust backup are btrfs, send/receive should be an
extremely efficient way to do the backups. =:^)
Tho it may be worth keeping a more conventionally maintained second-level
backup that's /not/ on btrfs as well, depending on how critical you
consider that data. While btrfs is stabilizing reasonably well now, it's
not entirely stable yet and probably won't be for, let's say another
year, and at least here, I really do sleep better knowing I have a non-
btrfs backup available as well. You could manually checksum it, either
in whole or in part, to be sure of detecting rot there, tho I've not done
so here, figuring if I could survive decades without it before btrfs, I
can survive another few years with it as a second backup.
Given the cost of SSD vs. spinning-rust, if all your data fits on the
SSD, you should be able to do multiple levels of backup on spinning rust
without breaking the bank.
(FWIW, altho as I mentioned earlier I have dual SSD btrfs raid1, I do
still keep my media on spinning rust, NOT on SSD. So I can't say all my
data fits on SSD, here, or rather, it might, but that's not how I've set
it up. But as it happens the media files are both larger and less
critical in terms of access speed, so spinning rust for them actually
works out very well for me.)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-02-08 3:52 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-06 20:01 RAID1, SSD+non-SSD Brian B
2015-02-07 0:23 ` Chris Murphy
2015-02-07 18:06 ` Kai Krakow
2015-02-08 3:31 ` Duncan
2015-02-07 6:39 ` Duncan
2015-02-07 12:42 ` RAID1, SSD+non-SSD (RAID 5/6 question) Ed Tomlinson
2015-02-08 3:18 ` Duncan
2015-02-08 2:41 ` RAID1, SSD+non-SSD Brian B
2015-02-08 3:51 ` Duncan
2015-02-07 17:28 ` Kyle Manna
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.