All of lore.kernel.org
 help / color / mirror / Atom feed
* dstat shows unexpected result for two disk RAID1
@ 2016-03-09 20:21 Nicholas D Steeves
  2016-03-09 20:25 ` Nicholas D Steeves
  0 siblings, 1 reply; 18+ messages in thread
From: Nicholas D Steeves @ 2016-03-09 20:21 UTC (permalink / raw)
  To: linux-btrfs

Hello everyone,

I've run into an expected behaviour for a my two disk RAID1.  I mount
with UUIDs, because sometimes my USB disk gets /dev/sdc instead of
/dev/sdd.  The two elements of my RAID1 are currently sdb and sdd.

dstat -tdD total,sdb,sdc,sdd

It seems that per process, reads come from either sdb or sdd.  This
surprises me, because I understood that a btrfs RAID1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-09 20:21 dstat shows unexpected result for two disk RAID1 Nicholas D Steeves
@ 2016-03-09 20:25 ` Nicholas D Steeves
  2016-03-09 20:50   ` Goffredo Baroncelli
                     ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Nicholas D Steeves @ 2016-03-09 20:25 UTC (permalink / raw)
  To: linux-btrfs

grr.  Gmail is terrible :-/

I understood that a btrfs RAID1 would at best grab one block from sdb
and then one block from sdd in round-robin fashion, or at worse grab
one chunk from sdb and then one chunk from sdd.  Alternatively I
thought that it might read from both simultaneously, to make sure that
all data matches, while at the same time providing single-disk
performance.  None of these was the case.  Running a single
IO-intensive process reads from a single drive.

Did I misunderstand the documentation and is this normal, or is this a bug?
Nicholas

On 9 March 2016 at 15:21, Nicholas D Steeves <nsteeves@gmail.com> wrote:
> Hello everyone,
>
> I've run into an expected behaviour for a my two disk RAID1.  I mount
> with UUIDs, because sometimes my USB disk gets /dev/sdc instead of
> /dev/sdd.  The two elements of my RAID1 are currently sdb and sdd.
>
> dstat -tdD total,sdb,sdc,sdd
>
> It seems that per process, reads come from either sdb or sdd.  This
> surprises me, because I understood that a btrfs RAID1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-09 20:25 ` Nicholas D Steeves
@ 2016-03-09 20:50   ` Goffredo Baroncelli
  2016-03-09 21:26   ` Chris Murphy
  2016-03-09 21:36   ` Roman Mamedov
  2 siblings, 0 replies; 18+ messages in thread
From: Goffredo Baroncelli @ 2016-03-09 20:50 UTC (permalink / raw)
  To: Nicholas D Steeves, linux-btrfs

On 2016-03-09 21:25, Nicholas D Steeves wrote:
> grr.  Gmail is terrible :-/
> 
> I understood that a btrfs RAID1 would at best grab one block from sdb
> and then one block from sdd in round-robin fashion, or at worse grab
> one chunk from sdb and then one chunk from sdd.  Alternatively I
> thought that it might read from both simultaneously, to make sure that
> all data matches, while at the same time providing single-disk
> performance.  None of these was the case.  Running a single
> IO-intensive process reads from a single drive.
> 
> Did I misunderstand the documentation and is this normal, or is this a bug?
> Nicholas

In a case of a BTRFS RAID, I knew that a process read from a drive depending by its pid. 
I don't know if it is changed. But what from you write it seems that it still true today.



> 
> On 9 March 2016 at 15:21, Nicholas D Steeves <nsteeves@gmail.com> wrote:
>> Hello everyone,
>>
>> I've run into an expected behaviour for a my two disk RAID1.  I mount
>> with UUIDs, because sometimes my USB disk gets /dev/sdc instead of
>> /dev/sdd.  The two elements of my RAID1 are currently sdb and sdd.
>>
>> dstat -tdD total,sdb,sdc,sdd
>>
>> It seems that per process, reads come from either sdb or sdd.  This
>> surprises me, because I understood that a btrfs RAID1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-09 20:25 ` Nicholas D Steeves
  2016-03-09 20:50   ` Goffredo Baroncelli
@ 2016-03-09 21:26   ` Chris Murphy
  2016-03-09 22:51     ` Nicholas D Steeves
  2016-03-11 23:42     ` Nicholas D Steeves
  2016-03-09 21:36   ` Roman Mamedov
  2 siblings, 2 replies; 18+ messages in thread
From: Chris Murphy @ 2016-03-09 21:26 UTC (permalink / raw)
  To: Nicholas D Steeves; +Cc: Btrfs BTRFS

On Wed, Mar 9, 2016 at 1:25 PM, Nicholas D Steeves <nsteeves@gmail.com> wrote:
> grr.  Gmail is terrible :-/
>
> I understood that a btrfs RAID1 would at best grab one block from sdb
> and then one block from sdd in round-robin fashion, or at worse grab
> one chunk from sdb and then one chunk from sdd.  Alternatively I
> thought that it might read from both simultaneously, to make sure that
> all data matches, while at the same time providing single-disk
> performance.  None of these was the case.  Running a single
> IO-intensive process reads from a single drive.
>
> Did I misunderstand the documentation and is this normal, or is this a bug?
> Nicholas
>
> On 9 March 2016 at 15:21, Nicholas D Steeves <nsteeves@gmail.com> wrote:
>> Hello everyone,
>>
>> I've run into an expected behaviour for a my two disk RAID1.  I mount
>> with UUIDs, because sometimes my USB disk gets /dev/sdc instead of
>> /dev/sdd.  The two elements of my RAID1 are currently sdb and sdd.
>>
>> dstat -tdD total,sdb,sdc,sdd
>>
>> It seems that per process, reads come from either sdb or sdd.  This
>> surprises me, because I understood that a btrfs RAID1

It's normal and recognized to be sub-optimal. So it's an optimization
opportunity. :-)

I see parallelization of reads and writes to data single profile
multiple devices as useful also, similar to XFS allocation group
parallelization. Those AGs are spread across multiple devices in
md/lvm linear layouts, so if you have processes that read/write to
multiple AGs at a time, those I/Os happen at the same time when on
separate devices.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-09 20:25 ` Nicholas D Steeves
  2016-03-09 20:50   ` Goffredo Baroncelli
  2016-03-09 21:26   ` Chris Murphy
@ 2016-03-09 21:36   ` Roman Mamedov
  2016-03-09 21:43     ` Chris Murphy
  2016-03-10  4:06     ` Duncan
  2 siblings, 2 replies; 18+ messages in thread
From: Roman Mamedov @ 2016-03-09 21:36 UTC (permalink / raw)
  To: Nicholas D Steeves; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1862 bytes --]

On Wed, 9 Mar 2016 15:25:19 -0500
Nicholas D Steeves <nsteeves@gmail.com> wrote:

> grr.  Gmail is terrible :-/
> 
> I understood that a btrfs RAID1 would at best grab one block from sdb
> and then one block from sdd in round-robin fashion, or at worse grab
> one chunk from sdb and then one chunk from sdd.  Alternatively I
> thought that it might read from both simultaneously, to make sure that
> all data matches, while at the same time providing single-disk
> performance.  None of these was the case.  Running a single
> IO-intensive process reads from a single drive.

No RAID1 implementation reads from disks in a round-robin fashion, as that
would give terrible performance giving disks a constant seek load instead of
the normal linear read scenario.

As for reading at the same time, there's no reason to do that either, since
the data integrity is protected by checksums, and "the other" disk for a
particular data piece is being consulted only in case the checksum did not
match (or when you execute a 'scrub').

It's a known limitation that the disks are in effect "pinned" to running
processes, based on their process ID. One process reads from the same disk,
from the point it started and until it terminates. Other processes by luck may
read from a different disk, thus achieving load balancing. Or they may not,
and you will have contention with the other disk idling. This is unlike MD
RAID1, which knows to distribute read load dynamically to the least-utilized
array members.

Now if you want to do some more performance evaluation, check with your dstat
if both disks happen to *write* data in parallel, when you write to the array,
as ideally they should. Last I checked they mostly didn't, and this almost
halved write performance on a Btrfs RAID1 compared to a single disk.

-- 
With respect,
Roman

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-09 21:36   ` Roman Mamedov
@ 2016-03-09 21:43     ` Chris Murphy
  2016-03-09 22:08       ` Nicholas D Steeves
  2016-03-10  4:06     ` Duncan
  1 sibling, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2016-03-09 21:43 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Nicholas D Steeves, Btrfs BTRFS

On Wed, Mar 9, 2016 at 2:36 PM, Roman Mamedov <rm@romanrm.net> wrote:
> On Wed, 9 Mar 2016 15:25:19 -0500
> Nicholas D Steeves <nsteeves@gmail.com> wrote:
>
>> grr.  Gmail is terrible :-/
>>
>> I understood that a btrfs RAID1 would at best grab one block from sdb
>> and then one block from sdd in round-robin fashion, or at worse grab
>> one chunk from sdb and then one chunk from sdd.  Alternatively I
>> thought that it might read from both simultaneously, to make sure that
>> all data matches, while at the same time providing single-disk
>> performance.  None of these was the case.  Running a single
>> IO-intensive process reads from a single drive.
>
> No RAID1 implementation reads from disks in a round-robin fashion, as that
> would give terrible performance giving disks a constant seek load instead of
> the normal linear read scenario.
>
> As for reading at the same time, there's no reason to do that either, since
> the data integrity is protected by checksums, and "the other" disk for a
> particular data piece is being consulted only in case the checksum did not
> match (or when you execute a 'scrub').
>
> It's a known limitation that the disks are in effect "pinned" to running
> processes, based on their process ID. One process reads from the same disk,
> from the point it started and until it terminates. Other processes by luck may
> read from a different disk, thus achieving load balancing. Or they may not,
> and you will have contention with the other disk idling. This is unlike MD
> RAID1, which knows to distribute read load dynamically to the least-utilized
> array members.

This is a better qualification than my answer.


>
> Now if you want to do some more performance evaluation, check with your dstat
> if both disks happen to *write* data in parallel, when you write to the array,
> as ideally they should. Last I checked they mostly didn't, and this almost
> halved write performance on a Btrfs RAID1 compared to a single disk.

I've found it to be about the same or slightly less than single disk.
But most of my writes to raid1 are btrfs receive.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-09 21:43     ` Chris Murphy
@ 2016-03-09 22:08       ` Nicholas D Steeves
  0 siblings, 0 replies; 18+ messages in thread
From: Nicholas D Steeves @ 2016-03-09 22:08 UTC (permalink / raw)
  To: Btrfs BTRFS

On 9 March 2016 at 16:43, Chris Murphy <lists@colorremedies.com> wrote:
> On Wed, Mar 9, 2016 at 2:36 PM, Roman Mamedov <rm@romanrm.net> wrote:
>> On Wed, 9 Mar 2016 15:25:19 -0500
> This is a better qualification than my answer.
>
>>
>> Now if you want to do some more performance evaluation, check with your dstat
>> if both disks happen to *write* data in parallel, when you write to the array,
>> as ideally they should. Last I checked they mostly didn't, and this almost
>> halved write performance on a Btrfs RAID1 compared to a single disk.
>
> I've found it to be about the same or slightly less than single disk.
> But most of my writes to raid1 are btrfs receive.

Here are my results for sending pv /tmpfs_mem_disk/deleteme.tar -pabet
> /scratch/deleteme.tar, after I've cleared all caches.  Pv states the
average rate was 77MiB/s, which seems low for a 4GB file.  Here is the
dstat section for peak rates for writing.

----system---- -dsk/total----dsk/sdb-----dsk/sdd--
            time     |  read     writ: read   writ: read  writ
09-03 16:48:43|   48k  145M:   0    74M:  48k   72M
09-03 16:48:44|      0   120M:   0    74M:   0     46M
09-03 16:48:45| 840k  144M:   0    74M:   0     70M
09-03 16:48:46|      0   147M:   0    80M:   0     67M

and for reading many >200MB raw WAVs from one subvolume while writing
a ~20GB tar to another subvolume:

09-03 16:59:57|  56M  103M:   0    54M:  56M   50M
09-03 16:59:58|  48M  118M:  32k   56M:  48M   62M
09-03 16:59:59|  54M  113M:   0    57M:  54M   55M
09-03 17:00:00|  43M  116M:   0    54M:  43M   63M
09-03 17:00:01|  60M  118M:   0    64M:  60M   54M
09-03 17:00:02|  57M   97M:  32k   48M:  54M   49M

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-09 21:26   ` Chris Murphy
@ 2016-03-09 22:51     ` Nicholas D Steeves
  2016-03-11 23:42     ` Nicholas D Steeves
  1 sibling, 0 replies; 18+ messages in thread
From: Nicholas D Steeves @ 2016-03-09 22:51 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 9 March 2016 at 16:36, Roman Mamedov <rm@romanrm.net> wrote:
> On Wed, 9 Mar 2016 15:25:19 -0500
> Nicholas D Steeves <nsteeves@gmail.com> wrote:
>
>> I understood that a btrfs RAID1 would at best grab one block from sdb
>> and then one block from sdd in round-robin fashion, or at worse grab
>> one chunk from sdb and then one chunk from sdd.  Alternatively I
>> thought that it might read from both simultaneously, to make sure that
>> all data matches, while at the same time providing single-disk
>> performance.  None of these was the case.  Running a single
>> IO-intensive process reads from a single drive.
>
> No RAID1 implementation reads from disks in a round-robin fashion, as that
> would give terrible performance giving disks a constant seek load instead of
> the normal linear read scenario.

On 9 March 2016 at 16:26, Chris Murphy <lists@colorremedies.com> wrote:
> It's normal and recognized to be sub-optimal. So it's an optimization
> opportunity. :-)
>
> I see parallelization of reads and writes to data single profile
> multiple devices as useful also, similar to XFS allocation group
> parallelization. Those AGs are spread across multiple devices in
> md/lvm linear layouts, so if you have processes that read/write to
> multiple AGs at a time, those I/Os happen at the same time when on
> separate devices.

Chris, yes, that's exactly how I thought that it would work.  Roman,
when I said round-robin--please forgive my naïvité--I meant hoped
there would be a chunk A1 from disk0 read at the same time as chunk A2
from disk1.  Can you use the btree associated with chunk A1 to put
disk B to work readingahead, but searching the btree associated with
chunk A1?  Then, when disk0 finishes reading A1 into memory, A2 gets
contatinated.

If disk0 is finishes reading chunk A1, change the primary read disk
for PID to disk1 and let reading A2 continue, and put disk0 to work
using the same method as disk1 was previously, but on chunk A3.  Else,
if disk1 reading A2 finishes before disk0 finishes A1, then disk0
remains the primary read disk for PID and disk1 begins reading A3.

That's how I thought that it would work, and that the scheduler could
interrupt the readahead operation for non-primary disk.  Eg: disk1
would becoming primary reading disk for PID2, where disk0 would
continue as primary for PID1.  And if there's a long queue of reads or
writes then this simplest-case would be limited in the following way:
disk0 and disk1 never actually get to read or write to the same chunk
<- Is this the explanation why, for practical reasons, dstat shows the
behaviour it shows?

If this is the case, would it be possible for the non-primary read
disk for PID1 to tag the A[x] chunk it wrote to memory with a request
for the PID to use what it wrote to memory from A[x]?  And also for
the "primary" disk to resume from location y in A[x] instead beginning
from scratch with A[x]?  Roman, in this case, the seeks would be
time-saving, no?

Unfortunately, I don't know how to implement this, but I had imagined
that the btree for a directory contained pointers (I'm using this term
loosely rather than programically) to all extents associated with all
files contained underneath it.  Or does it point to the chunk, which
then points to the extent?  At any rate, is this similar to the
dir_index of ext4, and is this the method btrfs uses?

Best regards,
Nicholas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-09 21:36   ` Roman Mamedov
  2016-03-09 21:43     ` Chris Murphy
@ 2016-03-10  4:06     ` Duncan
  2016-03-10  5:01       ` Chris Murphy
  2016-03-12  0:04       ` Nicholas D Steeves
  1 sibling, 2 replies; 18+ messages in thread
From: Duncan @ 2016-03-10  4:06 UTC (permalink / raw)
  To: linux-btrfs

Roman Mamedov posted on Thu, 10 Mar 2016 02:36:27 +0500 as excerpted:

> It's a known limitation that the disks are in effect "pinned" to running
> processes, based on their process ID. One process reads from the same
> disk, from the point it started and until it terminates. Other processes
> by luck may read from a different disk, thus achieving load balancing.
> Or they may not, and you will have contention with the other disk
> idling. This is unlike MD RAID1, which knows to distribute read load
> dynamically to the least-utilized array members.
> 
> Now if you want to do some more performance evaluation, check with your
> dstat if both disks happen to *write* data in parallel, when you write
> to the array,
> as ideally they should. Last I checked they mostly didn't, and this
> almost halved write performance on a Btrfs RAID1 compared to a single
> disk.

As stated, at present btrfs mostly handles devices (I've made it a 
personal point to try not to say disks, because SSD, etc, unless it's 
/specific/ /to/ spinning rust, but device remains correct) one at a time 
per task.

And for raid1 read in particular, the read scheduler is a very simple 
even/odd PID based scheduler, implemented early on when simplicity of 
implementation and easy testing of single-task single-device, multi-task 
multi-device, and multi-task-bottlenecked-to-single-device, all three 
scenarios, was of prime consideration, far more so than a speed.  Indeed, 
at that point, optimization would have been a prime example of "premature 
optimization", as it would have almost certainly either restricted 
various later added feature implementation choices later on, or would 
have needed redone later, once those features and their constraints were 
known, thus losing the work done in the first optimization.

And in fact, I've pointed out this very fact as a an easily seen example 
of why btrfs isn't yet fully stable or production ready -- as can be seen 
in the work of the very developers themselves.  Any developer worth the 
name will be very wary of the dangers of "premature optimization" and the 
risk it brings of either severely limiting implementations of further 
features or having to be good work thrown out because it doesn't match 
the new code.

When the devs consider the btrfs code stable enough, they'll optimize 
this.  Until then, it's prime evidence that they do _not_ consider btrfs 
stable and mature enough for this sort of optimization just yet. =:^)


Meanwhile, for quite some time (since at least kernel 3.5 when raid56 was 
expected in kernel 3.6) on the roadmap for implementation after raid56, 
is N-way-mirroring -- basically, raid1 the way mdraid does it, so 5 
devices means 5 mirrors, not the precisely two mirrors of each chunk, 
with new chunks distributed across the other devices until they've all 
been used, that we have now (tho it would continue to be an option).

And FWIW, N-way-mirroring is a primary feature interest of mine so I've 
been following it more closely than much of btrfs development.

Of course the logical raid10 extension of that would be the ability to 
specify N mirrors and M stripes on raid10 as well, so that for a 6-device 
raid10, you could choose between the existing two-way-mirroring, three-
way-striping, and a new three-way-mirroring, two-way-striping, mode, tho 
I don't know if they'll implement both N-way-mirroring raid1 and N-way-
mirroring raid10 at the same time, or wait on the latter.

Either way, my point in bringing up N-way-mirroring, is that it has been 
roadmapped for quite some time, and with it roadmapped, attempting either 
two-way-only-optimization or N-way-optimization, now, arguably _would_ be 
premature optimization, because the first would have to be redone for N-
way once it became available, and there's no way to test that the second 
actually works beyond two-way, until n-way is actually available.

So I'd guess N-way-read-optimization, with N=2 just one of the 
possibilities, will come after N-way-mirroring, which in turn has long 
been roadmapped for after raid56.

Meanwhile, while parity-raid (aka raid56) isn't as bad as it was when 
first nominally completed in 3.19, as of 4.4 (and I think 4.5 as I've not 
seen a full trace yet, let alone a fix), there's still at least one known 
bug remaining to be traced down and exterminated, that's causing at least 
some raid56 reshapes to different numbers of devices or recovery from a 
lost device to take at least 10 times as long as they logically should, 
we're talking times of weeks to months, during which time the array can 
be used, but if it's a bad device replacement and more devices go down in 
that time...  So even if it's not an immediate data-loss bug, it's still 
a blocker in terms of actually using parity-raid for the purposes parity-
raid is normally used.

So raid56, while nominally complete now (after nearing four /years/ of 
work, remember, originally it was intended for kernel 3.5 or 3.6), still 
isn't anything close to stable as the rest of btrfs, and is still 
requiring developer focus, so it could be awhile before we see that N-way-
mirroring that was roadmapped after it, which in turn means it'll likely 
be even longer before we see good raid1 read optimization.

Tho hopefully all the really tough problems they would have hit with N-
way-mirroring were hit and resolved with raid56, and N-way-mirroring will 
thus be relatively simple, so hopefully it's less than the four years 
it's taking raid56.  But I don't expect to see it for another year or 
two, and don't expect to be actually use it as intended (as a more 
failure resistant raid1) for some time after that as the bugs get worked 
out, so realistically, 2-3 years.

If multi-device scheduling optimization is done in say 6 months after 
that... that means we're looking at 2.5-3.5 years, perhaps longer, for 
it.  So it's a known issue, yes, and on the roadmap, yes, but don't 
expect to see anything in the near (-2-year) future, more like 
intermediate (3-5) year future.  In all honesty I don't seriously expect 
it to be long-term future, beyond 5 years, but it's possible.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-10  4:06     ` Duncan
@ 2016-03-10  5:01       ` Chris Murphy
  2016-03-10  8:10         ` Duncan
  2016-03-12  0:04       ` Nicholas D Steeves
  1 sibling, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2016-03-10  5:01 UTC (permalink / raw)
  To: Btrfs BTRFS

On Wed, Mar 9, 2016 at 9:06 PM, Duncan <1i5t5.duncan@cox.net> wrote:

> Tho hopefully all the really tough problems they would have hit with N-
> way-mirroring were hit and resolved with raid56, and N-way-mirroring will
> thus be relatively simple, so hopefully it's less than the four years
> it's taking raid56.  But I don't expect to see it for another year or
> two, and don't expect to be actually use it as intended (as a more
> failure resistant raid1) for some time after that as the bugs get worked
> out, so realistically, 2-3 years.
>
> If multi-device scheduling optimization is done in say 6 months after
> that... that means we're looking at 2.5-3.5 years, perhaps longer, for
> it.  So it's a known issue, yes, and on the roadmap, yes, but don't
> expect to see anything in the near (-2-year) future, more like
> intermediate (3-5) year future.  In all honesty I don't seriously expect
> it to be long-term future, beyond 5 years, but it's possible.

Meh, encryption RFC patches arrived 8 days ago and I wasn't expecting
that to happen for a couple years. So I think our expectations have
almost no bearing on feature or fix arrival. For all we know n-way
could appear in 4.6.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-10  5:01       ` Chris Murphy
@ 2016-03-10  8:10         ` Duncan
  0 siblings, 0 replies; 18+ messages in thread
From: Duncan @ 2016-03-10  8:10 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Wed, 09 Mar 2016 22:01:21 -0700 as excerpted:

> Meh, encryption RFC patches arrived 8 days ago and I wasn't expecting
> that to happen for a couple years. So I think our expectations have
> almost no bearing on feature or fix arrival. For all we know n-way could
> appear in 4.6.

The crypto rfc patches were out of left field for me as well.

But if n-way does show up effectively "tomorrow" (as it would need to, to 
hit 4.6), I'd suspect someone with some money to spend prioritized it, 
much as I suspect that's what happened with the crypto patches.

Call me a conspiracy nut, but don't be too surprised if someone's 
introducing some product with btrfs and encrypted subvolumes a year or 18 
months from now...  I know I won't be! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-09 21:26   ` Chris Murphy
  2016-03-09 22:51     ` Nicholas D Steeves
@ 2016-03-11 23:42     ` Nicholas D Steeves
  1 sibling, 0 replies; 18+ messages in thread
From: Nicholas D Steeves @ 2016-03-11 23:42 UTC (permalink / raw)
  To: Btrfs BTRFS

On 9 March 2016 at 16:26, Chris Murphy <lists@colorremedies.com> wrote:
>
> It's normal and recognized to be sub-optimal. So it's an optimization
> opportunity. :-)
>
> I see parallelization of reads and writes to data single profile
> multiple devices as useful also, similar to XFS allocation group
> parallelization. Those AGs are spread across multiple devices in
> md/lvm linear layouts, so if you have processes that read/write to
> multiple AGs at a time, those I/Os happen at the same time when on
> separate devices.

I'm not sure if I can pull it off... :-)  At best I might only be able
to define the problem and how things fit together, and then attempt to
logic my way through it with pseudo-code.  My hope is that someone
would look at this work, say "Aha!  You're doing it wrong!" and then
implement it the right way.

On 10 March 2016 at 03:10, Duncan <1i5t5.duncan@cox.net> wrote:
>
> Call me a conspiracy nut, but don't be too surprised if someone's
> introducing some product with btrfs and encrypted subvolumes a year or 18
> months from now...  I know I won't be! =:^)

In that case, couldn't an "at a glance" overview of what needs to be
done for distributed read optimisation entice a product-manager
somewhere out there to throw some employee-time at the problem? :-p

Best regards,
Nicholas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-10  4:06     ` Duncan
  2016-03-10  5:01       ` Chris Murphy
@ 2016-03-12  0:04       ` Nicholas D Steeves
  2016-03-12  0:10         ` Nicholas D Steeves
  1 sibling, 1 reply; 18+ messages in thread
From: Nicholas D Steeves @ 2016-03-12  0:04 UTC (permalink / raw)
  To: Btrfs BTRFS

On 9 March 2016 at 23:06, Duncan <1i5t5.duncan@cox.net> wrote:
>
> Meanwhile, while parity-raid (aka raid56) isn't as bad as it was when
> first nominally completed in 3.19, as of 4.4 (and I think 4.5 as I've not
> seen a full trace yet, let alone a fix), there's still at least one known
> bug remaining to be traced down and exterminated, that's causing at least
> some raid56 reshapes to different numbers of devices or recovery from a
> lost device to take at least 10 times as long as they logically should,
> we're talking times of weeks to months, during which time the array can
> be used, but if it's a bad device replacement and more devices go down in
> that time...
>
> Tho hopefully all the really tough problems they would have hit with N-
> way-mirroring were hit and resolved with raid56, and N-way-mirroring will
> thus be relatively simple, so hopefully it's less than the four years
> it's taking raid56.  But I don't expect to see it for another year or
> two, and don't expect to be actually use it as intended (as a more
> failure resistant raid1) for some time after that as the bugs get worked
> out, so realistically, 2-3 years.

Could the raid5 code could be patched to copy/read instead of
build/check parity?  In effect I'm wondering if this could be used as
an alternative to the current raid1 profile.  The bonus being that it
seems like it might accelerate shaking out the bugs in raid5.
Likewise, would doing the same with the raid6 code in effect implement
a 3-way mirror distributed over n-devices?

Kind regards,
Nicholas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-12  0:04       ` Nicholas D Steeves
@ 2016-03-12  0:10         ` Nicholas D Steeves
  2016-03-12  1:20           ` Chris Murphy
  0 siblings, 1 reply; 18+ messages in thread
From: Nicholas D Steeves @ 2016-03-12  0:10 UTC (permalink / raw)
  To: Btrfs BTRFS

P.S. Rather than parity, I mean instead of distributing into stripes, do a copy!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-12  0:10         ` Nicholas D Steeves
@ 2016-03-12  1:20           ` Chris Murphy
  2016-04-06  3:58             ` Nicholas D Steeves
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Murphy @ 2016-03-12  1:20 UTC (permalink / raw)
  To: Nicholas D Steeves; +Cc: Btrfs BTRFS

On Fri, Mar 11, 2016 at 5:10 PM, Nicholas D Steeves <nsteeves@gmail.com> wrote:
> P.S. Rather than parity, I mean instead of distributing into stripes, do a copy!

raid56 by definition are parity based, so I'd say no that's confusing
to turn it into something it's not.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-03-12  1:20           ` Chris Murphy
@ 2016-04-06  3:58             ` Nicholas D Steeves
  2016-04-06 12:02               ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 18+ messages in thread
From: Nicholas D Steeves @ 2016-04-06  3:58 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 11 March 2016 at 20:20, Chris Murphy <lists@colorremedies.com> wrote:
> On Fri, Mar 11, 2016 at 5:10 PM, Nicholas D Steeves <nsteeves@gmail.com> wrote:
>> P.S. Rather than parity, I mean instead of distributing into stripes, do a copy!
>
> raid56 by definition are parity based, so I'd say no that's confusing
> to turn it into something it's not.

I just found the Multiple Device Support diagram.  I'm trying to
figure out how hard it's going for me to get up to speed, because I've
only ever casually and informally read about filesystems.  I worry
that because I didn't study filesystem design in school, and because
everything I worked on was in C++...well, the level of sophistication
and design might be beyond what I can learn.  What do you think?  Can
you recommend any books on file system design that will provide what
is necessary to understand btrfs?

Cheers,
Nicholas

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-04-06  3:58             ` Nicholas D Steeves
@ 2016-04-06 12:02               ` Austin S. Hemmelgarn
  2016-04-22 22:36                 ` Nicholas D Steeves
  0 siblings, 1 reply; 18+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-06 12:02 UTC (permalink / raw)
  To: Nicholas D Steeves; +Cc: Chris Murphy, Btrfs BTRFS

On 2016-04-05 23:58, Nicholas D Steeves wrote:
> On 11 March 2016 at 20:20, Chris Murphy <lists@colorremedies.com> wrote:
>> On Fri, Mar 11, 2016 at 5:10 PM, Nicholas D Steeves <nsteeves@gmail.com> wrote:
>>> P.S. Rather than parity, I mean instead of distributing into stripes, do a copy!
>>
>> raid56 by definition are parity based, so I'd say no that's confusing
>> to turn it into something it's not.
>
> I just found the Multiple Device Support diagram.  I'm trying to
> figure out how hard it's going for me to get up to speed, because I've
> only ever casually and informally read about filesystems.  I worry
> that because I didn't study filesystem design in school, and because
> everything I worked on was in C++...well, the level of sophistication
> and design might be beyond what I can learn.  What do you think?  Can
> you recommend any books on file system design that will provide what
> is necessary to understand btrfs?
While I can't personally recommend any books on filesystem design, I can 
give some more general advice:
1. Make sure you have at least a basic understanding of how things work 
at a high level from the user perspective.  It's a lot easier to 
understand the low-level stuff if you know how it all ends up fitting 
together.  Back when I started looking at the internals of BTRFS I was 
pretty lost myself.  I still am to a certain extent when it comes to the 
kernel code (most of my background is in Python, Lua, or Bourne Shell, 
not C, and I don't normally deal with data structures at such a low 
level), but as I've used it more on my systems, a lot of stuff that 
seemed cryptic at first is making a lot more sense.
2. Keep in mind that there are a number of things in BTRFS that have no 
equivalent in other filesystems, or are not typical filesystem design 
topics.  The multi-device support for example is pretty much 
non-existent as a filesystem design topic because it's traditionally 
handled by lower levels like LVM.
3. The Linux VFS layer is worth taking a look at, as it handles the 
translation between the low-level ABI provided by each filesystem and 
the user-level API.  Most of the stuff that BTRFS provides through it is 
rather consistent with the user level API, but understanding what 
translation goes on there can be helpful to understanding some of the 
higher-level internals in BTRFS.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: dstat shows unexpected result for two disk RAID1
  2016-04-06 12:02               ` Austin S. Hemmelgarn
@ 2016-04-22 22:36                 ` Nicholas D Steeves
  0 siblings, 0 replies; 18+ messages in thread
From: Nicholas D Steeves @ 2016-04-22 22:36 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Btrfs BTRFS

Everyone, thank you very much for helping me to learn more.  Getting
up to speed takes forever!  I posted an idea relating to this thread,
but it's more read latency rather than throughput related, but I'm not
sure what the right way to link overlapping threads is, so here is how
to find it:

Date: Fri, 22 Apr 2016 18:14:00 -0400
Message-ID: <CAD=QJKgJ9JAgZAOSivJTL-bcLbdkP6UqGb0i6g=fS9j6XKtcLA@mail.gmail.com>
Subject: Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time)
 de-duplication framework

WRT the original AG balanced IO optimisation problem, should I spend
most of my time reading disk-io.c while looking for opportunities to
optimize multi-device reads, while consulting xfs_file.c and
xfs_super.c, or just xfs_super.c?

Thanks,
Nicholas

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-04-22 22:36 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-09 20:21 dstat shows unexpected result for two disk RAID1 Nicholas D Steeves
2016-03-09 20:25 ` Nicholas D Steeves
2016-03-09 20:50   ` Goffredo Baroncelli
2016-03-09 21:26   ` Chris Murphy
2016-03-09 22:51     ` Nicholas D Steeves
2016-03-11 23:42     ` Nicholas D Steeves
2016-03-09 21:36   ` Roman Mamedov
2016-03-09 21:43     ` Chris Murphy
2016-03-09 22:08       ` Nicholas D Steeves
2016-03-10  4:06     ` Duncan
2016-03-10  5:01       ` Chris Murphy
2016-03-10  8:10         ` Duncan
2016-03-12  0:04       ` Nicholas D Steeves
2016-03-12  0:10         ` Nicholas D Steeves
2016-03-12  1:20           ` Chris Murphy
2016-04-06  3:58             ` Nicholas D Steeves
2016-04-06 12:02               ` Austin S. Hemmelgarn
2016-04-22 22:36                 ` Nicholas D Steeves

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.