All of lore.kernel.org
 help / color / mirror / Atom feed
* best base / worst case RAID 5,6 write speeds
@ 2015-12-10  1:34 Dallas Clement
  2015-12-10  6:36 ` Alexander Afonyashin
                   ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Dallas Clement @ 2015-12-10  1:34 UTC (permalink / raw)
  To: linux-raid

Hi all.  I'm trying to determine best and worst case expected
sequential write speeds for Linux software RAID with spinning disks.

I have been assuming on the following:

Best case RAID 6 sequential write speed is (N-2) * X, where is is
number of drives and X is write speed of a single drive.

Worst case RAID 6 sequential write speed is (N-2) * X / 2.

Best case RAID 5 sequential write speed is (N-1) * X.

Worst case RAID 5 sequential write speed is (N-1) * X / 2.

Could someone please confirm whether these formulas are accurate or not?


I am not even getting worst case write performance with an array of 12
spinning 7200 RPM SATA disks.  Thus I  suspect either the formulas I
am using are wrong or I have alignment issues or something.  My chunk
size is 128 KB at the moment.

Thanks,

Dallas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-10  1:34 best base / worst case RAID 5,6 write speeds Dallas Clement
@ 2015-12-10  6:36 ` Alexander Afonyashin
  2015-12-10 14:38   ` Dallas Clement
  2015-12-10 15:14 ` John Stoffel
  2015-12-10 20:06 ` Phil Turmel
  2 siblings, 1 reply; 60+ messages in thread
From: Alexander Afonyashin @ 2015-12-10  6:36 UTC (permalink / raw)
  To: Dallas Clement; +Cc: Linux-RAID

Hi,

Did you set stride/stripe configuration to filesystem that is aligned
with raid configuration?

Regards,
Alexander

On Thu, Dec 10, 2015 at 4:34 AM, Dallas Clement
<dallas.a.clement@gmail.com> wrote:
> Hi all.  I'm trying to determine best and worst case expected
> sequential write speeds for Linux software RAID with spinning disks.
>
> I have been assuming on the following:
>
> Best case RAID 6 sequential write speed is (N-2) * X, where is is
> number of drives and X is write speed of a single drive.
>
> Worst case RAID 6 sequential write speed is (N-2) * X / 2.
>
> Best case RAID 5 sequential write speed is (N-1) * X.
>
> Worst case RAID 5 sequential write speed is (N-1) * X / 2.
>
> Could someone please confirm whether these formulas are accurate or not?
>
>
> I am not even getting worst case write performance with an array of 12
> spinning 7200 RPM SATA disks.  Thus I  suspect either the formulas I
> am using are wrong or I have alignment issues or something.  My chunk
> size is 128 KB at the moment.
>
> Thanks,
>
> Dallas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-10  6:36 ` Alexander Afonyashin
@ 2015-12-10 14:38   ` Dallas Clement
  0 siblings, 0 replies; 60+ messages in thread
From: Dallas Clement @ 2015-12-10 14:38 UTC (permalink / raw)
  To: Alexander Afonyashin; +Cc: Linux-RAID

On Thu, Dec 10, 2015 at 12:36 AM, Alexander Afonyashin
<a.afonyashin@madnet-team.ru> wrote:
> Hi,
>
> Did you set stride/stripe configuration to filesystem that is aligned
> with raid configuration?
>
> Regards,
> Alexander
>
> On Thu, Dec 10, 2015 at 4:34 AM, Dallas Clement
> <dallas.a.clement@gmail.com> wrote:
>> Hi all.  I'm trying to determine best and worst case expected
>> sequential write speeds for Linux software RAID with spinning disks.
>>
>> I have been assuming on the following:
>>
>> Best case RAID 6 sequential write speed is (N-2) * X, where is is
>> number of drives and X is write speed of a single drive.
>>
>> Worst case RAID 6 sequential write speed is (N-2) * X / 2.
>>
>> Best case RAID 5 sequential write speed is (N-1) * X.
>>
>> Worst case RAID 5 sequential write speed is (N-1) * X / 2.
>>
>> Could someone please confirm whether these formulas are accurate or not?
>>
>>
>> I am not even getting worst case write performance with an array of 12
>> spinning 7200 RPM SATA disks.  Thus I  suspect either the formulas I
>> am using are wrong or I have alignment issues or something.  My chunk
>> size is 128 KB at the moment.
>>
>> Thanks,
>>
>> Dallas
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

> Did you set stride/stripe configuration to filesystem that is aligned
with raid configuration?

Hi Alexander.  I've  just been reading / writing the raw raid device
with no filesystem at the moment.  Using fio, with direct=1.  I seem
to get the best results with bs=512k and queue_depth=256.  Even with
this, I am still short of the worst case write performance.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-10  1:34 best base / worst case RAID 5,6 write speeds Dallas Clement
  2015-12-10  6:36 ` Alexander Afonyashin
@ 2015-12-10 15:14 ` John Stoffel
  2015-12-10 18:40   ` Dallas Clement
  2015-12-10 20:06 ` Phil Turmel
  2 siblings, 1 reply; 60+ messages in thread
From: John Stoffel @ 2015-12-10 15:14 UTC (permalink / raw)
  To: Dallas Clement; +Cc: linux-raid

>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:

Dallas> Hi all.  I'm trying to determine best and worst case expected
Dallas> sequential write speeds for Linux software RAID with spinning disks.

Dallas> I have been assuming on the following:

Dallas> Best case RAID 6 sequential write speed is (N-2) * X, where is is
Dallas> number of drives and X is write speed of a single drive.

Dallas> Worst case RAID 6 sequential write speed is (N-2) * X / 2.

Dallas> Best case RAID 5 sequential write speed is (N-1) * X.

Dallas> Worst case RAID 5 sequential write speed is (N-1) * X / 2.

Dallas> Could someone please confirm whether these formulas are accurate or not?


Dallas> I am not even getting worst case write performance with an
Dallas> array of 12 spinning 7200 RPM SATA disks.  Thus I suspect
Dallas> either the formulas I am using are wrong or I have alignment
Dallas> issues or something.  My chunk size is 128 KB at the moment.

I think you're over-estimating the speed of your disks.  Remember that
disk speeds are faster on the outer tracks of the drive, and slower on
the inner tracks.

I'd setup two partitions, one at the start and one at the outside and
do some simple:

  dd if=/dev/zero of=/dev/inner,outer bs=8192 count=100000 oflag=direct

and look at those numbers.  Then build up a table where you vary the
bs= from 512 to N, which could be whatever you want.

That will give you a better estimate of individual drive performance.

Then when you do your fio tests, vary the queue depth, block size,
inner/outer partition, etc, but all on a single disk at first to
compare with the first set of results and to see how they correlate.

THEN you can start looking at the RAID performance numbers.

And of course, the controller you use matters, how it's configured,
how it's setup for caching, etc.  Lots and lots and lots of details to
be tracked.

Change one thing at a time, then re-run your tests.  Automating them
is key here.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-10 15:14 ` John Stoffel
@ 2015-12-10 18:40   ` Dallas Clement
       [not found]     ` <CAK2H+ed+fe5Wr0B=h5AzK5_=ougQtW_6cJcUG_S_cg+WfzDb=Q@mail.gmail.com>
  2015-12-10 19:28     ` John Stoffel
  0 siblings, 2 replies; 60+ messages in thread
From: Dallas Clement @ 2015-12-10 18:40 UTC (permalink / raw)
  To: John Stoffel; +Cc: Linux-RAID

On Thu, Dec 10, 2015 at 9:14 AM, John Stoffel <john@stoffel.org> wrote:
>>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:
>
> Dallas> Hi all.  I'm trying to determine best and worst case expected
> Dallas> sequential write speeds for Linux software RAID with spinning disks.
>
> Dallas> I have been assuming on the following:
>
> Dallas> Best case RAID 6 sequential write speed is (N-2) * X, where is is
> Dallas> number of drives and X is write speed of a single drive.
>
> Dallas> Worst case RAID 6 sequential write speed is (N-2) * X / 2.
>
> Dallas> Best case RAID 5 sequential write speed is (N-1) * X.
>
> Dallas> Worst case RAID 5 sequential write speed is (N-1) * X / 2.
>
> Dallas> Could someone please confirm whether these formulas are accurate or not?
>
>
> Dallas> I am not even getting worst case write performance with an
> Dallas> array of 12 spinning 7200 RPM SATA disks.  Thus I suspect
> Dallas> either the formulas I am using are wrong or I have alignment
> Dallas> issues or something.  My chunk size is 128 KB at the moment.
>
> I think you're over-estimating the speed of your disks.  Remember that
> disk speeds are faster on the outer tracks of the drive, and slower on
> the inner tracks.
>
> I'd setup two partitions, one at the start and one at the outside and
> do some simple:
>
>   dd if=/dev/zero of=/dev/inner,outer bs=8192 count=100000 oflag=direct
>
> and look at those numbers.  Then build up a table where you vary the
> bs= from 512 to N, which could be whatever you want.
>
> That will give you a better estimate of individual drive performance.
>
> Then when you do your fio tests, vary the queue depth, block size,
> inner/outer partition, etc, but all on a single disk at first to
> compare with the first set of results and to see how they correlate.
>
> THEN you can start looking at the RAID performance numbers.
>
> And of course, the controller you use matters, how it's configured,
> how it's setup for caching, etc.  Lots and lots and lots of details to
> be tracked.
>
> Change one thing at a time, then re-run your tests.  Automating them
> is key here.
>
>

Hi John.  Thanks for the help.  I did what you recommended and created
two equal size partitions on my Hitachi 4TB 7200RPM SATA disks.

Device          Start        End    Sectors  Size Type
/dev/sda1        2048 3907014656 3907012609  1.8T Linux filesystem
/dev/sda2  3907016704 7814037134 3907020431  1.8T Linux filesystem

I ran the dd test with varying block size.  I started to see a
difference in write speed with larger block size.

[root@localhost ~]# dd if=/dev/zero of=/dev/sda1 bs=2048k count=1000
oflag=direct
1000+0 records in
1000+0 records out
2097152000 bytes (2.1 GB) copied, 11.5475 s, 182 MB/s

[root@localhost ~]# dd if=/dev/zero of=/dev/sda2 bs=2048k count=1000
oflag=direct
1000+0 records in
1000+0 records out
2097152000 bytes (2.1 GB) copied, 13.6355 s, 154 MB/s

The difference is not as great as I suspected it might be.  If I plug
in this lower write speed of 154 MB/s in the RAID 6 worst case write
speed calculation mentioned earlier, I should be getting at least (12
- 2) * 154 MB/s / 2 = 770 MB/s.  For this same bs=2048k and
queue_depth=256 I am getting 678 MB/s which is almost 100 MB/s less
than worst case.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
       [not found]     ` <CAK2H+ed+fe5Wr0B=h5AzK5_=ougQtW_6cJcUG_S_cg+WfzDb=Q@mail.gmail.com>
@ 2015-12-10 19:26       ` Dallas Clement
  2015-12-10 19:33         ` John Stoffel
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-10 19:26 UTC (permalink / raw)
  To: Mark Knecht; +Cc: John Stoffel, Linux-RAID

On Thu, Dec 10, 2015 at 12:54 PM, Mark Knecht <markknecht@gmail.com> wrote:
>
>
> On Thu, Dec 10, 2015 at 10:40 AM, Dallas Clement
> <dallas.a.clement@gmail.com> wrote:
>>
>> <SNIP>
>>
>> Hi John.  Thanks for the help.  I did what you recommended and created
>> two equal size partitions on my Hitachi 4TB 7200RPM SATA disks.
>>
>> Device          Start        End    Sectors  Size Type
>> /dev/sda1        2048 3907014656 3907012609  1.8T Linux filesystem
>> /dev/sda2  3907016704 7814037134 3907020431  1.8T Linux filesystem
>>
>
> If you're goal was to test the speed at the end of the drive it seems to me
> that sda2
> should have started near the end of the drive and not presumably in the
> middle?
>
> Probably has nothing to do with the missing 100MB/s though.
>
> - Mark

I tried a more extreme case.

Device          Start        End    Sectors   Size Type
/dev/sda1        2048 6999998464 6999996417   3.3T Linux filesystem
/dev/sda2  7000000512 7814037134  814036623 388.2G Linux filesystem

Now I'm seeing quite a bit more difference between inner and outer.

[root@localhost ~]# dd if=/dev/zero of=/dev/sda1 bs=2048k count=1000
1000+0 records in
1000+0 records out
2097152000 bytes (2.1 GB) copied, 13.4422 s, 156 MB/s

[root@localhost ~]# dd if=/dev/zero of=/dev/sda2 bs=2048k count=1000
1000+0 records in
1000+0 records out
2097152000 bytes (2.1 GB) copied, 21.9703 s, 95.5 MB/s

Using this this worst case inner speed my expected worst case write
speed is (12 - 2) * 95.5 MB/s / 2 = 477.5 MB/s.  So I guess if this
really is the worst case, then the 678 MB/s sequential write speed I
am seeing with RAID 6 on a large partition is not so bad.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-10 18:40   ` Dallas Clement
       [not found]     ` <CAK2H+ed+fe5Wr0B=h5AzK5_=ougQtW_6cJcUG_S_cg+WfzDb=Q@mail.gmail.com>
@ 2015-12-10 19:28     ` John Stoffel
  2015-12-10 22:23       ` Wols Lists
  1 sibling, 1 reply; 60+ messages in thread
From: John Stoffel @ 2015-12-10 19:28 UTC (permalink / raw)
  To: Dallas Clement; +Cc: John Stoffel, Linux-RAID

>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:

Dallas> On Thu, Dec 10, 2015 at 9:14 AM, John Stoffel <john@stoffel.org> wrote:
>>>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:
>> 
Dallas> Hi all.  I'm trying to determine best and worst case expected
Dallas> sequential write speeds for Linux software RAID with spinning disks.
>> 
Dallas> I have been assuming on the following:
>> 
Dallas> Best case RAID 6 sequential write speed is (N-2) * X, where is is
Dallas> number of drives and X is write speed of a single drive.
>> 
Dallas> Worst case RAID 6 sequential write speed is (N-2) * X / 2.
>> 
Dallas> Best case RAID 5 sequential write speed is (N-1) * X.
>> 
Dallas> Worst case RAID 5 sequential write speed is (N-1) * X / 2.
>> 
Dallas> Could someone please confirm whether these formulas are accurate or not?
>> 
>> 
Dallas> I am not even getting worst case write performance with an
Dallas> array of 12 spinning 7200 RPM SATA disks.  Thus I suspect
Dallas> either the formulas I am using are wrong or I have alignment
Dallas> issues or something.  My chunk size is 128 KB at the moment.
>> 
>> I think you're over-estimating the speed of your disks.  Remember that
>> disk speeds are faster on the outer tracks of the drive, and slower on
>> the inner tracks.
>> 
>> I'd setup two partitions, one at the start and one at the outside and
>> do some simple:
>> 
>> dd if=/dev/zero of=/dev/inner,outer bs=8192 count=100000 oflag=direct
>> 
>> and look at those numbers.  Then build up a table where you vary the
>> bs= from 512 to N, which could be whatever you want.
>> 
>> That will give you a better estimate of individual drive performance.
>> 
>> Then when you do your fio tests, vary the queue depth, block size,
>> inner/outer partition, etc, but all on a single disk at first to
>> compare with the first set of results and to see how they correlate.
>> 
>> THEN you can start looking at the RAID performance numbers.
>> 
>> And of course, the controller you use matters, how it's configured,
>> how it's setup for caching, etc.  Lots and lots and lots of details to
>> be tracked.
>> 
>> Change one thing at a time, then re-run your tests.  Automating them
>> is key here.
>> 
>> 

Dallas> Hi John.  Thanks for the help.  I did what you recommended and created
Dallas> two equal size partitions on my Hitachi 4TB 7200RPM SATA disks.

Dallas> Device          Start        End    Sectors  Size Type
Dallas> /dev/sda1        2048 3907014656 3907012609  1.8T Linux filesystem
Dallas> /dev/sda2  3907016704 7814037134 3907020431  1.8T Linux filesystem

I would do it even differently, put a 10g partition at each end and
run your tests.  

Dallas> I ran the dd test with varying block size.  I started to see a
Dallas> difference in write speed with larger block size.

You will... that's the streaming write speed.  But in real life,
unless you're streaming video or other large large files, you're never
doing to see that. 

Dallas> [root@localhost ~]# dd if=/dev/zero of=/dev/sda1 bs=2048k count=1000
Dallas> oflag=direct
Dallas> 1000+0 records in
Dallas> 1000+0 records out
Dallas> 2097152000 bytes (2.1 GB) copied, 11.5475 s, 182 MB/s

Dallas> [root@localhost ~]# dd if=/dev/zero of=/dev/sda2 bs=2048k count=1000
Dallas> oflag=direct
Dallas> 1000+0 records in
Dallas> 1000+0 records out
Dallas> 2097152000 bytes (2.1 GB) copied, 13.6355 s, 154 MB/s

It will be an even higher difference if you move the partitions to the
ends even more.

Dallas> The difference is not as great as I suspected it might be.  If
Dallas> I plug in this lower write speed of 154 MB/s in the RAID 6
Dallas> worst case write speed calculation mentioned earlier, I should
Dallas> be getting at least (12 - 2) * 154 MB/s / 2 = 770 MB/s.  For
Dallas> this same bs=2048k and queue_depth=256 I am getting 678 MB/s
Dallas> which is almost 100 MB/s less than worst case.

At this point, you need to now look at your controllers and
motherboard and how they're configured.  If all those drives are on
one controller, and if that controller is on a single lane of PCIe,
then you will see controller bandwidth issues as well.

So now you need to step back and look at the entire system.  How is
the drive cabled?  What is the system powered with?

Also, Linux RAID only recently, in recent linux kernels, got away from
single threaded RAID56 compute threads, so that could be an impact
too.

The best way would be to have your disks spread out across multiple
controllers, on multiple busses, all talking in parallel.

If you're looking for a more linear speedup test, build a small 10g
partition on each disk, then build a RAID0 linear stripped array, but
with the small stride number.  Then you do your sequential write and
you should see a pretty linear increase in speed, up until you hit
controller, memory, cpu, SATA limits.

Another option, if you're looking for good performance, might be to
look at lvcache, which is what I've just done at home.  I have a pair
of mirrored 4Tb disks, and a pair of mirrored 500Gb SSDs which I used
for boot, /, /var and cache.  So far I'm quite happy with the
performance speedup.  But I also haven't done *any* rigorous testing,
since I'm more concerned about durability first, then speed.

John

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-10 19:26       ` Dallas Clement
@ 2015-12-10 19:33         ` John Stoffel
  2015-12-10 22:19           ` Wols Lists
  0 siblings, 1 reply; 60+ messages in thread
From: John Stoffel @ 2015-12-10 19:33 UTC (permalink / raw)
  To: Dallas Clement; +Cc: Mark Knecht, John Stoffel, Linux-RAID

>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:

Dallas> I tried a more extreme case.

Dallas> Device          Start        End    Sectors   Size Type
Dallas> /dev/sda1        2048 6999998464 6999996417   3.3T Linux filesystem
Dallas> /dev/sda2  7000000512 7814037134  814036623 388.2G Linux filesystem

Dallas> Now I'm seeing quite a bit more difference between inner and outer.

Dallas> [root@localhost ~]# dd if=/dev/zero of=/dev/sda1 bs=2048k count=1000
Dallas> 1000+0 records in
Dallas> 1000+0 records out
Dallas> 2097152000 bytes (2.1 GB) copied, 13.4422 s, 156 MB/s

Dallas> [root@localhost ~]# dd if=/dev/zero of=/dev/sda2 bs=2048k count=1000
Dallas> 1000+0 records in
Dallas> 1000+0 records out
Dallas> 2097152000 bytes (2.1 GB) copied, 21.9703 s, 95.5 MB/s


This is actually one of the tricks people used to do before there was
ready availability of SSDs.  They would buy a bunch of disks and then
only use the outer tracks, while striping data across a whole bunch of
disks to get up the IOPs when they were IOP but not space limited.
Think databases with lots and lots of transactions.

Now it's simpler to just A) buy lots and lots of memory, B) bunches of
SSDs, C) both, D) beat the developers until they learn to write better
SQL.

Sorry, D) never happens.  :-)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-10  1:34 best base / worst case RAID 5,6 write speeds Dallas Clement
  2015-12-10  6:36 ` Alexander Afonyashin
  2015-12-10 15:14 ` John Stoffel
@ 2015-12-10 20:06 ` Phil Turmel
  2015-12-10 20:09   ` Dallas Clement
  2 siblings, 1 reply; 60+ messages in thread
From: Phil Turmel @ 2015-12-10 20:06 UTC (permalink / raw)
  To: Dallas Clement, linux-raid

On 12/09/2015 08:34 PM, Dallas Clement wrote:
> Hi all.  I'm trying to determine best and worst case expected
> sequential write speeds for Linux software RAID with spinning disks.
> 
> I have been assuming on the following:
> 
> Best case RAID 6 sequential write speed is (N-2) * X, where is is
> number of drives and X is write speed of a single drive.
> 
> Worst case RAID 6 sequential write speed is (N-2) * X / 2.
> 
> Best case RAID 5 sequential write speed is (N-1) * X.
> 
> Worst case RAID 5 sequential write speed is (N-1) * X / 2.
> 
> Could someone please confirm whether these formulas are accurate or not?

Confirm these?  No.  In fact, I see no theoretical basis for stating a
worst case speed as half the best case speed.  Or any other fraction.
It's dependent on numerous variables -- block size, processor load, I/O
bandwidth at various choke points (Northbridge, southbridge, PCI/PCIe,
SATA/SAS channels, port mux...), I/O latency vs. queue depth vs. drive
buffers, sector positioning at block boundaries, drive firmware
housekeeping, etc.

Where'd you get the worst case formulas?


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-10 20:06 ` Phil Turmel
@ 2015-12-10 20:09   ` Dallas Clement
  2015-12-10 20:29     ` Phil Turmel
       [not found]     ` <CAK2H+ednN7dCGzcOt8TxgNdhdDA1mN6Xr5P8vQ+Y=-uRoxRksw@mail.gmail.com>
  0 siblings, 2 replies; 60+ messages in thread
From: Dallas Clement @ 2015-12-10 20:09 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Linux-RAID

On Thu, Dec 10, 2015 at 2:06 PM, Phil Turmel <philip@turmel.org> wrote:
> On 12/09/2015 08:34 PM, Dallas Clement wrote:
>> Hi all.  I'm trying to determine best and worst case expected
>> sequential write speeds for Linux software RAID with spinning disks.
>>
>> I have been assuming on the following:
>>
>> Best case RAID 6 sequential write speed is (N-2) * X, where is is
>> number of drives and X is write speed of a single drive.
>>
>> Worst case RAID 6 sequential write speed is (N-2) * X / 2.
>>
>> Best case RAID 5 sequential write speed is (N-1) * X.
>>
>> Worst case RAID 5 sequential write speed is (N-1) * X / 2.
>>
>> Could someone please confirm whether these formulas are accurate or not?
>
> Confirm these?  No.  In fact, I see no theoretical basis for stating a
> worst case speed as half the best case speed.  Or any other fraction.
> It's dependent on numerous variables -- block size, processor load, I/O
> bandwidth at various choke points (Northbridge, southbridge, PCI/PCIe,
> SATA/SAS channels, port mux...), I/O latency vs. queue depth vs. drive
> buffers, sector positioning at block boundaries, drive firmware
> housekeeping, etc.
>
> Where'd you get the worst case formulas?
>

> Where'd you get the worst case formulas?

Google search I'm afraid.  I think the assumption for RAID 5,6 worst
case is having to read and write the parity + data every cycle.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-10 20:09   ` Dallas Clement
@ 2015-12-10 20:29     ` Phil Turmel
  2015-12-10 21:14       ` Dallas Clement
       [not found]     ` <CAK2H+ednN7dCGzcOt8TxgNdhdDA1mN6Xr5P8vQ+Y=-uRoxRksw@mail.gmail.com>
  1 sibling, 1 reply; 60+ messages in thread
From: Phil Turmel @ 2015-12-10 20:29 UTC (permalink / raw)
  To: Dallas Clement; +Cc: Linux-RAID

On 12/10/2015 03:09 PM, Dallas Clement wrote:
> On Thu, Dec 10, 2015 at 2:06 PM, Phil Turmel <philip@turmel.org> wrote:

>> Where'd you get the worst case formulas?
> 
> Google search I'm afraid.  I think the assumption for RAID 5,6 worst
> case is having to read and write the parity + data every cycle.

Well, it'd be a lot worse than half, then.  To use the shortcut in raid5
to write one block, you have to read it first, read the parity, compute
the change in parity, then write the block with the new parity.  That's
two reads and two writes for a single upper level write.  For raid6, add
read and write of the Q syndrome, assuming you have a kernel new enough
to do the raid6 shortcut at all.  Three reads and three writes for a
single upper level write.  In both cases, add rotational latency to
reposition for writing over sectors just read.

Those RMW operations generally happen to small random writes, which
makes the assertion for sequential writes odd.  Unless you delay writes
or misalign or inhibit merging, RMW won't trigger except possibly at the
beginning or end of a stream.

That's why I questioned O_SYNC when you were using a filesystem: it
prevents merging, and forces seeking to do small metadata writes.
Basically turning a sequential workload into a random one.

Phil

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-10 20:29     ` Phil Turmel
@ 2015-12-10 21:14       ` Dallas Clement
  2015-12-10 21:32         ` Phil Turmel
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-10 21:14 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Linux-RAID

On Thu, Dec 10, 2015 at 2:29 PM, Phil Turmel <philip@turmel.org> wrote:
> On 12/10/2015 03:09 PM, Dallas Clement wrote:
>> On Thu, Dec 10, 2015 at 2:06 PM, Phil Turmel <philip@turmel.org> wrote:
>
>>> Where'd you get the worst case formulas?
>>
>> Google search I'm afraid.  I think the assumption for RAID 5,6 worst
>> case is having to read and write the parity + data every cycle.
>
> Well, it'd be a lot worse than half, then.  To use the shortcut in raid5
> to write one block, you have to read it first, read the parity, compute
> the change in parity, then write the block with the new parity.  That's
> two reads and two writes for a single upper level write.  For raid6, add
> read and write of the Q syndrome, assuming you have a kernel new enough
> to do the raid6 shortcut at all.  Three reads and three writes for a
> single upper level write.  In both cases, add rotational latency to
> reposition for writing over sectors just read.
>
> Those RMW operations generally happen to small random writes, which
> makes the assertion for sequential writes odd.  Unless you delay writes
> or misalign or inhibit merging, RMW won't trigger except possibly at the
> beginning or end of a stream.
>
> That's why I questioned O_SYNC when you were using a filesystem: it
> prevents merging, and forces seeking to do small metadata writes.
> Basically turning a sequential workload into a random one.
>
> Phil

> Those RMW operations generally happen to small random writes, which
> makes the assertion for sequential writes odd.

Exactly.  I'm not expecting RMWs to be happening for large sequential
writes.  But yet my RAID 5, 6 sequential write performance is still
very poor.  As mentioned earlier, I'm getting around 95 MB/s on the
inner side of these disks.  With 12 of them, my RAID 6 write speed
should be (12 - 2) * 95 = 950 MB/s.  I'm getting about 300 MB/s less
than that for this scenario.  I have the disks split up among three
different controllers.  There should be plenty of bandwidth.  Several
days ago I ran fio on each of the 12 disks concurrently.  I was able
to see the disks at or near 100% utilization and wMB/s around 160-170
MB/s.  That's why I started focusing on RAID as being the potential
bottleneck.

> That's why I questioned O_SYNC when you were using a filesystem: it
> prevents merging, and forces seeking to do small metadata writes.
> Basically turning a sequential workload into a random one.

Yes, that certainly makes sense.  Not using O_SYNC anymore.  Just O_DIRECT.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-10 21:14       ` Dallas Clement
@ 2015-12-10 21:32         ` Phil Turmel
  0 siblings, 0 replies; 60+ messages in thread
From: Phil Turmel @ 2015-12-10 21:32 UTC (permalink / raw)
  To: Dallas Clement; +Cc: Linux-RAID

On 12/10/2015 04:14 PM, Dallas Clement wrote:

> Exactly.  I'm not expecting RMWs to be happening for large sequential
> writes.  But yet my RAID 5, 6 sequential write performance is still
> very poor.  As mentioned earlier, I'm getting around 95 MB/s on the
> inner side of these disks.  With 12 of them, my RAID 6 write speed
> should be (12 - 2) * 95 = 950 MB/s.  I'm getting about 300 MB/s less
> than that for this scenario.  I have the disks split up among three
> different controllers.  There should be plenty of bandwidth.  Several
> days ago I ran fio on each of the 12 disks concurrently.  I was able
> to see the disks at or near 100% utilization and wMB/s around 160-170
> MB/s.  That's why I started focusing on RAID as being the potential
> bottleneck.
> 
>> That's why I questioned O_SYNC when you were using a filesystem: it
>> prevents merging, and forces seeking to do small metadata writes.
>> Basically turning a sequential workload into a random one.
> 
> Yes, that certainly makes sense.  Not using O_SYNC anymore.  Just O_DIRECT.

Sounds like its time to break out blktrace to see what's really
happening between your array and its member devices.

With diffs from old kernels to new.

Phil

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-10 19:33         ` John Stoffel
@ 2015-12-10 22:19           ` Wols Lists
  0 siblings, 0 replies; 60+ messages in thread
From: Wols Lists @ 2015-12-10 22:19 UTC (permalink / raw)
  To: John Stoffel, Dallas Clement; +Cc: Mark Knecht, Linux-RAID

On 10/12/15 19:33, John Stoffel wrote:
> Now it's simpler to just A) buy lots and lots of memory, B) bunches of
> SSDs, C) both, D) beat the developers until they learn to write better
> SQL.
> 
> Sorry, D) never happens.  :-)

Don't get me started ... :-)

Relational first normal form CANNOT be efficient, and it starts with the
first law - "data comes in rows and columns". In the real world, that's
not true, and the rot just gets worse from there ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-10 19:28     ` John Stoffel
@ 2015-12-10 22:23       ` Wols Lists
  0 siblings, 0 replies; 60+ messages in thread
From: Wols Lists @ 2015-12-10 22:23 UTC (permalink / raw)
  To: John Stoffel, Dallas Clement; +Cc: Linux-RAID

On 10/12/15 19:28, John Stoffel wrote:
> At this point, you need to now look at your controllers and
> motherboard and how they're configured.  If all those drives are on
> one controller, and if that controller is on a single lane of PCIe,
> then you will see controller bandwidth issues as well.

Very out of date ... we're talking the days of CD writers here ... but
we used a pc to write bulk CDs, and we found we simply had to buy add-in
IDE cards to get decent performance - two CD drives on the same PATA
cable and response went through the floor - massively so.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
       [not found]     ` <CAK2H+ednN7dCGzcOt8TxgNdhdDA1mN6Xr5P8vQ+Y=-uRoxRksw@mail.gmail.com>
@ 2015-12-11  0:02       ` Dallas Clement
       [not found]         ` <CAK2H+efF2dM1BsM7kzfTxMdQEHvbWRaVe7zJLTGcPZzafn2M6A@mail.gmail.com>
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-11  0:02 UTC (permalink / raw)
  To: Mark Knecht; +Cc: Phil Turmel, Linux-RAID

On Thu, Dec 10, 2015 at 5:04 PM, Mark Knecht <markknecht@gmail.com> wrote:
>
>
> On Thu, Dec 10, 2015 at 12:09 PM, Dallas Clement
> <dallas.a.clement@gmail.com> wrote:
>>
>> On Thu, Dec 10, 2015 at 2:06 PM, Phil Turmel <philip@turmel.org> wrote:
>> <SNIP>
>> >>
>> >> Could someone please confirm whether these formulas are accurate or
>> >> not?
>> >
>> > Confirm these?  No.  In fact, I see no theoretical basis for stating a
>> > worst case speed as half the best case speed.  Or any other fraction.
>> > It's dependent on numerous variables -- block size, processor load, I/O
>> > bandwidth at various choke points (Northbridge, southbridge, PCI/PCIe,
>> > SATA/SAS channels, port mux...), I/O latency vs. queue depth vs. drive
>> > buffers, sector positioning at block boundaries, drive firmware
>> > housekeeping, etc.
>> >
>> > Where'd you get the worst case formulas?
>> >
>>
>> > Where'd you get the worst case formulas?
>>
>> Google search I'm afraid.  I think the assumption for RAID 5,6 worst
>> case is having to read and write the parity + data every cycle.
>
> What sustained throughput do you get in this system if you skip RAID, set
> up a script and write different data to all 12 drives in parallel? I don't
> think
> you've addressed Phil's comment concerning all the other potential choke
> points in  the system. You'd need to be careful and make sure all the data
> is really out to disk but it might tell you something about your assumptions
> vs what the hardware is really doing..
>
> - Mark

Hi Mark,

> What sustained throughput do you get in this system if you skip RAID, set
> up a script and write different data to all 12 drives in parallel?

Just tried this again, running fio concurrently on all 12 disks.  This
time doing sequential writes, bs=2048k, direct=1 to the raw disk
device - no filesystem.  The results are not encouraging.  I tried to
watch the disk behavior with iostat.  This 8 core xeon system was
really getting crushed.  The load average during the 10 minute test
was 15.16  26.41  21.53.  iostat showed %iowait varying between 40%
and 80%.  Also iostat showed only about 8 of the 12 disks on average
getting CPU time.  They had high near 100% utilization and pretty good
write speed ~160 - 170 MB/s.  Looks like my disks are just too slow
and the CPU cores are stuck waiting for them.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
       [not found]         ` <CAK2H+efF2dM1BsM7kzfTxMdQEHvbWRaVe7zJLTGcPZzafn2M6A@mail.gmail.com>
@ 2015-12-11  0:41           ` Dallas Clement
  2015-12-11  1:19             ` Dallas Clement
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-11  0:41 UTC (permalink / raw)
  To: Mark Knecht; +Cc: Phil Turmel, Linux-RAID

On Thu, Dec 10, 2015 at 6:22 PM, Mark Knecht <markknecht@gmail.com> wrote:
>
>
> On Thu, Dec 10, 2015 at 4:02 PM, Dallas Clement <dallas.a.clement@gmail.com>
> wrote:
>>
>> On Thu, Dec 10, 2015 at 5:04 PM, Mark Knecht <markknecht@gmail.com> wrote:
> <SNIP>
>>
>> Hi Mark,
>>
>> > What sustained throughput do you get in this system if you skip RAID,
>> > set
>> > up a script and write different data to all 12 drives in parallel?
>>
>> Just tried this again, running fio concurrently on all 12 disks.  This
>> time doing sequential writes, bs=2048k, direct=1 to the raw disk
>> device - no filesystem.  The results are not encouraging.  I tried to
>> watch the disk behavior with iostat.  This 8 core xeon system was
>> really getting crushed.  The load average during the 10 minute test
>> was 15.16  26.41  21.53.  iostat showed %iowait varying between 40%
>> and 80%.  Also iostat showed only about 8 of the 12 disks on average
>> getting CPU time.  They had high near 100% utilization and pretty good
>> write speed ~160 - 170 MB/s.  Looks like my disks are just too slow
>> and the CPU cores are stuck waiting for them.
>
> Well, it was hard on the system but it might not be a total loss. I'm not
> saying this is a good test but it might give you some ideas about how to
> proceed. Fewer drives? Better controller?
>
> Was it any different at the front and back of the drive?
>
> One thing I didn't see in this thread was a check to make sure your
> alignment is on the physical sector alignment if you're using 4K sectors
> which I assume drives this large are using.
>
> Anyway, data is just data. It gives you something to think about.
>
> Good luck,
> Mark

Hi Mark.  Perhaps this is normal behavior when there are more disks to
be served than there are CPUs.  But it surely does seem like a waste
for the CPUs to be locked up in uninterruptible sleep waiting for I/O
on these disks.  I presume this is caused by threads in the kernel
tied up in spin loops waiting for I/O.  It would sure be nice if the
I/O could be handled in a more asynchronous way so that these CPUs can
go off and do other things while they are waiting for I/Os to complete
on slow disks.

> Was it any different at the front and back of the drive?

Didn't try on this particular test.

> One thing I didn't see in this thread was a check to make sure your
> alignment is on the physical sector alignment if you're using 4K sectors
> which I assume drives this large are using.

Yes, these drives surely use 4K sectors.  But I haven't checked for
sector alignment issues.  Any tips on how to do that?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-11  0:41           ` Dallas Clement
@ 2015-12-11  1:19             ` Dallas Clement
       [not found]               ` <CAK2H+ec-zMbhxoFyHXLkdM-z-9cYYzNbPFhn19XjTHqrOMDZKQ@mail.gmail.com>
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-11  1:19 UTC (permalink / raw)
  To: Mark Knecht; +Cc: Phil Turmel, Linux-RAID

On Thu, Dec 10, 2015 at 6:41 PM, Dallas Clement
<dallas.a.clement@gmail.com> wrote:
> On Thu, Dec 10, 2015 at 6:22 PM, Mark Knecht <markknecht@gmail.com> wrote:
>>
>>
>> On Thu, Dec 10, 2015 at 4:02 PM, Dallas Clement <dallas.a.clement@gmail.com>
>> wrote:
>>>
>>> On Thu, Dec 10, 2015 at 5:04 PM, Mark Knecht <markknecht@gmail.com> wrote:
>> <SNIP>
>>>
>>> Hi Mark,
>>>
>>> > What sustained throughput do you get in this system if you skip RAID,
>>> > set
>>> > up a script and write different data to all 12 drives in parallel?
>>>
>>> Just tried this again, running fio concurrently on all 12 disks.  This
>>> time doing sequential writes, bs=2048k, direct=1 to the raw disk
>>> device - no filesystem.  The results are not encouraging.  I tried to
>>> watch the disk behavior with iostat.  This 8 core xeon system was
>>> really getting crushed.  The load average during the 10 minute test
>>> was 15.16  26.41  21.53.  iostat showed %iowait varying between 40%
>>> and 80%.  Also iostat showed only about 8 of the 12 disks on average
>>> getting CPU time.  They had high near 100% utilization and pretty good
>>> write speed ~160 - 170 MB/s.  Looks like my disks are just too slow
>>> and the CPU cores are stuck waiting for them.
>>
>> Well, it was hard on the system but it might not be a total loss. I'm not
>> saying this is a good test but it might give you some ideas about how to
>> proceed. Fewer drives? Better controller?
>>
>> Was it any different at the front and back of the drive?
>>
>> One thing I didn't see in this thread was a check to make sure your
>> alignment is on the physical sector alignment if you're using 4K sectors
>> which I assume drives this large are using.
>>
>> Anyway, data is just data. It gives you something to think about.
>>
>> Good luck,
>> Mark
>
> Hi Mark.  Perhaps this is normal behavior when there are more disks to
> be served than there are CPUs.  But it surely does seem like a waste
> for the CPUs to be locked up in uninterruptible sleep waiting for I/O
> on these disks.  I presume this is caused by threads in the kernel
> tied up in spin loops waiting for I/O.  It would sure be nice if the
> I/O could be handled in a more asynchronous way so that these CPUs can
> go off and do other things while they are waiting for I/Os to complete
> on slow disks.
>
>> Was it any different at the front and back of the drive?
>
> Didn't try on this particular test.
>
>> One thing I didn't see in this thread was a check to make sure your
>> alignment is on the physical sector alignment if you're using 4K sectors
>> which I assume drives this large are using.
>
> Yes, these drives surely use 4K sectors.  But I haven't checked for
> sector alignment issues.  Any tips on how to do that?

According to parted, my disk partition is aligned.

(parted) align-check
alignment type(min/opt)  [optimal]/minimal?
Partition number? 6
6 aligned

Partition Table: gpt

Number  Start      End          Size         File system  Name     Flags
 1      2048s      10002431s    10000384s                 primary
 2      10002432s  42002431s    32000000s                 primary
 3      42002432s  42004479s    2048s                     primary  bios_grub
 4      42004480s  42006527s    2048s                     primary
 5      42006528s  50008063s    8001536s                  primary
> 6      50008064s  7796883455s  7746875392s               primary

50008064 / 4096 = 12209

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
       [not found]               ` <CAK2H+ec-zMbhxoFyHXLkdM-z-9cYYzNbPFhn19XjTHqrOMDZKQ@mail.gmail.com>
@ 2015-12-11 15:44                 ` Dallas Clement
  2015-12-11 16:32                   ` John Stoffel
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-11 15:44 UTC (permalink / raw)
  To: Mark Knecht; +Cc: Phil Turmel, Linux-RAID

On Thu, Dec 10, 2015 at 8:50 PM, Mark Knecht <markknecht@gmail.com> wrote:
>
>
> On Thu, Dec 10, 2015 at 5:19 PM, Dallas Clement <dallas.a.clement@gmail.com>
> wrote:
>>
>> <SNIP>
>> >> One thing I didn't see in this thread was a check to make sure your
>> >> alignment is on the physical sector alignment if you're using 4K
>> >> sectors
>> >> which I assume drives this large are using.
>> >
>> > Yes, these drives surely use 4K sectors.  But I haven't checked for
>> > sector alignment issues.  Any tips on how to do that?
>>
>> According to parted, my disk partition is aligned.
>>
>> (parted) align-check
>> alignment type(min/opt)  [optimal]/minimal?
>> Partition number? 6
>> 6 aligned
>>
>> Partition Table: gpt
>>
>> Number  Start      End          Size         File system  Name     Flags
>>  1      2048s      10002431s    10000384s                 primary
>>  2      10002432s  42002431s    32000000s                 primary
>>  3      42002432s  42004479s    2048s                     primary
>> bios_grub
>>  4      42004480s  42006527s    2048s                     primary
>>  5      42006528s  50008063s    8001536s                  primary
>> > 6      50008064s  7796883455s  7746875392s               primary
>>
>> 50008064 / 4096 = 12209
>
> I think you want to divide by 8 as 8*512 = 4096 but you're likely ok
> on all of them. Partitioning tools are better about that these days.
>
> Looking back I didn't see what sort of controller/s you have all these
> drives attached to. If you're spinning in IO loops then it could be a
> driver issue.
>
> I would suggest googling for causes of iowait cycles and post
> back additional questions.
>
> Good luck,
> Mark

Hi Mark.  I have three different controllers on this motherboard.  A
Marvell 9485 controls 8 of the disks.  And an Intel Cougar Point
controls the 4 remaining disks.

> If you're spinning in IO loops then it could be a driver issue.

It sure is looking like that.  I will try to profile the kernel
threads today and maybe use blktrace as Phil recommended to see what
is going on there.

This is pretty sad that 12 single threaded fio jobs can bring this
system to its knees.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-11 15:44                 ` Dallas Clement
@ 2015-12-11 16:32                   ` John Stoffel
  2015-12-11 16:47                     ` Dallas Clement
  0 siblings, 1 reply; 60+ messages in thread
From: John Stoffel @ 2015-12-11 16:32 UTC (permalink / raw)
  To: Dallas Clement; +Cc: Mark Knecht, Phil Turmel, Linux-RAID

>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:

Dallas> Hi Mark.  I have three different controllers on this
Dallas> motherboard.  A Marvell 9485 controls 8 of the disks.  And an
Dallas> Intel Cougar Point controls the 4 remaining disks.

What type of PCIe slots are the controllers in?  And how fast are the
controllers/drives?  Are they SATA1/2/3 drives?  

>> If you're spinning in IO loops then it could be a driver issue.

Dallas> It sure is looking like that.  I will try to profile the
Dallas> kernel threads today and maybe use blktrace as Phil
Dallas> recommended to see what is going on there.

what kernel aer you running?

Dallas> This is pretty sad that 12 single threaded fio jobs can bring
Dallas> this system to its knees.

I think it might be better to lower the queue depth, you might be just
blowing out the controller caches...  hard to know.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-11 16:32                   ` John Stoffel
@ 2015-12-11 16:47                     ` Dallas Clement
  2015-12-11 19:34                       ` John Stoffel
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-11 16:47 UTC (permalink / raw)
  To: John Stoffel; +Cc: Mark Knecht, Phil Turmel, Linux-RAID

On Fri, Dec 11, 2015 at 10:32 AM, John Stoffel <john@stoffel.org> wrote:
>>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:
>
> Dallas> Hi Mark.  I have three different controllers on this
> Dallas> motherboard.  A Marvell 9485 controls 8 of the disks.  And an
> Dallas> Intel Cougar Point controls the 4 remaining disks.
>
> What type of PCIe slots are the controllers in?  And how fast are the
> controllers/drives?  Are they SATA1/2/3 drives?
>
>>> If you're spinning in IO loops then it could be a driver issue.
>
> Dallas> It sure is looking like that.  I will try to profile the
> Dallas> kernel threads today and maybe use blktrace as Phil
> Dallas> recommended to see what is going on there.
>
> what kernel aer you running?
>
> Dallas> This is pretty sad that 12 single threaded fio jobs can bring
> Dallas> this system to its knees.
>
> I think it might be better to lower the queue depth, you might be just
> blowing out the controller caches...  hard to know.

Hi John.

> What type of PCIe slots are the controllers in?  And how fast are the
> controllers/drives?  Are they SATA1/2/3 drives?

The  MV 9485 controller is attached to an Intel Sandy Bridge via PCIe
GEN2 x 8.  This one controls 8 of the disks.
The Intel Cougar Point is connected to the Intel Sandy Bridge via DMI bus.

All of the drives are SATA III, however I do have two of the drives
connected to SATA II ports on the Cougar Point.  These two drives used
to be connected to SATA III ports on a MV 9125/9120 controller.  But
it had truly horrible write performance.  Moving to the SATA II ports
on the Cougar Point boosted the performance close to the same as the
other drives.  The remaining 10 drives are all connected to SATA III
ports.

> what kernel aer you running?

Right now, I'm using 3.10.69.  But I have tried the 4.2 kernel in
Fedora 23 with similar results.

> I think it might be better to lower the queue depth, you might be just
> blowing out the controller caches...  hard to know.

Good idea.  I'll trying lowering to see what effect.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-11 16:47                     ` Dallas Clement
@ 2015-12-11 19:34                       ` John Stoffel
  2015-12-11 21:24                         ` Dallas Clement
  0 siblings, 1 reply; 60+ messages in thread
From: John Stoffel @ 2015-12-11 19:34 UTC (permalink / raw)
  To: Dallas Clement; +Cc: John Stoffel, Mark Knecht, Phil Turmel, Linux-RAID

>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:

Dallas> On Fri, Dec 11, 2015 at 10:32 AM, John Stoffel <john@stoffel.org> wrote:
>>>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:
>> 
Dallas> Hi Mark.  I have three different controllers on this
Dallas> motherboard.  A Marvell 9485 controls 8 of the disks.  And an
Dallas> Intel Cougar Point controls the 4 remaining disks.
>> 
>> What type of PCIe slots are the controllers in?  And how fast are the
>> controllers/drives?  Are they SATA1/2/3 drives?
>> 
>>>> If you're spinning in IO loops then it could be a driver issue.
>> 
Dallas> It sure is looking like that.  I will try to profile the
Dallas> kernel threads today and maybe use blktrace as Phil
Dallas> recommended to see what is going on there.
>> 
>> what kernel aer you running?
>> 
Dallas> This is pretty sad that 12 single threaded fio jobs can bring
Dallas> this system to its knees.
>> 
>> I think it might be better to lower the queue depth, you might be just
>> blowing out the controller caches...  hard to know.

Dallas> Hi John.

>> What type of PCIe slots are the controllers in?  And how fast are the
>> controllers/drives?  Are they SATA1/2/3 drives?

Dallas> The MV 9485 controller is attached to an Intel Sandy Bridge
Dallas> via PCIe GEN2 x 8.  This one controls 8 of the disks.  The
Dallas> Intel Cougar Point is connected to the Intel Sandy Bridge via
Dallas> DMI bus.

So that should all be nice and fast.  

Dallas> All of the drives are SATA III, however I do have two of the
Dallas> drives connected to SATA II ports on the Cougar Point.  These
Dallas> two drives used to be connected to SATA III ports on a MV
Dallas> 9125/9120 controller.  But it had truly horrible write
Dallas> performance.  Moving to the SATA II ports on the Cougar Point
Dallas> boosted the performance close to the same as the other drives.
Dallas> The remaining 10 drives are all connected to SATA III ports.

>> what kernel aer you running?

Dallas> Right now, I'm using 3.10.69.  But I have tried the 4.2 kernel
Dallas> in Fedora 23 with similar results.

Hmm... maybe if your feeling adventerous you could try v4.4-rc4 and
see how it works.  You don't want anything between 4.2.6 and that
because of problems with blk req management.  I'm hazy on the details.

>> I think it might be better to lower the queue depth, you might be just
>> blowing out the controller caches...  hard to know.

Dallas> Good idea.  I'll trying lowering to see what effect.

It might also make sense to try your tests starting with just 1 disk,
and then adding one more disk, re-running the tests, then another
disk, re-running the tests, etc.

Try with one on the MV, then one on the Cougar, then one on MV and one
on Cougar, etc.

Try to see if you can spot where the performance falls off the cliff.

Also, which disk scheduler are you using?  Instead of CFQ, you might
try deadline instead.

As you can see, there's a TON of knobs to twiddle with, it's not a
simple thing to do at times.

John

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-11 19:34                       ` John Stoffel
@ 2015-12-11 21:24                         ` Dallas Clement
  2015-12-11 23:30                           ` Dallas Clement
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-11 21:24 UTC (permalink / raw)
  To: John Stoffel; +Cc: Mark Knecht, Phil Turmel, Linux-RAID

On Fri, Dec 11, 2015 at 1:34 PM, John Stoffel <john@stoffel.org> wrote:
>>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:
>
> Dallas> On Fri, Dec 11, 2015 at 10:32 AM, John Stoffel <john@stoffel.org> wrote:
>>>>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:
>>>
> Dallas> Hi Mark.  I have three different controllers on this
> Dallas> motherboard.  A Marvell 9485 controls 8 of the disks.  And an
> Dallas> Intel Cougar Point controls the 4 remaining disks.
>>>
>>> What type of PCIe slots are the controllers in?  And how fast are the
>>> controllers/drives?  Are they SATA1/2/3 drives?
>>>
>>>>> If you're spinning in IO loops then it could be a driver issue.
>>>
> Dallas> It sure is looking like that.  I will try to profile the
> Dallas> kernel threads today and maybe use blktrace as Phil
> Dallas> recommended to see what is going on there.
>>>
>>> what kernel aer you running?
>>>
> Dallas> This is pretty sad that 12 single threaded fio jobs can bring
> Dallas> this system to its knees.
>>>
>>> I think it might be better to lower the queue depth, you might be just
>>> blowing out the controller caches...  hard to know.
>
> Dallas> Hi John.
>
>>> What type of PCIe slots are the controllers in?  And how fast are the
>>> controllers/drives?  Are they SATA1/2/3 drives?
>
> Dallas> The MV 9485 controller is attached to an Intel Sandy Bridge
> Dallas> via PCIe GEN2 x 8.  This one controls 8 of the disks.  The
> Dallas> Intel Cougar Point is connected to the Intel Sandy Bridge via
> Dallas> DMI bus.
>
> So that should all be nice and fast.
>
> Dallas> All of the drives are SATA III, however I do have two of the
> Dallas> drives connected to SATA II ports on the Cougar Point.  These
> Dallas> two drives used to be connected to SATA III ports on a MV
> Dallas> 9125/9120 controller.  But it had truly horrible write
> Dallas> performance.  Moving to the SATA II ports on the Cougar Point
> Dallas> boosted the performance close to the same as the other drives.
> Dallas> The remaining 10 drives are all connected to SATA III ports.
>
>>> what kernel aer you running?
>
> Dallas> Right now, I'm using 3.10.69.  But I have tried the 4.2 kernel
> Dallas> in Fedora 23 with similar results.
>
> Hmm... maybe if your feeling adventerous you could try v4.4-rc4 and
> see how it works.  You don't want anything between 4.2.6 and that
> because of problems with blk req management.  I'm hazy on the details.
>
>>> I think it might be better to lower the queue depth, you might be just
>>> blowing out the controller caches...  hard to know.
>
> Dallas> Good idea.  I'll trying lowering to see what effect.
>
> It might also make sense to try your tests starting with just 1 disk,
> and then adding one more disk, re-running the tests, then another
> disk, re-running the tests, etc.
>
> Try with one on the MV, then one on the Cougar, then one on MV and one
> on Cougar, etc.
>
> Try to see if you can spot where the performance falls off the cliff.
>
> Also, which disk scheduler are you using?  Instead of CFQ, you might
> try deadline instead.
>
> As you can see, there's a TON of knobs to twiddle with, it's not a
> simple thing to do at times.
>
> John

> It might also make sense to try your tests starting with just 1 disk,
> and then adding one more disk, re-running the tests, then another
> disk, re-running the tests, etc

> Try to see if you can spot where the performance falls off the cliff.

Okay, did this.  Interestingly, things did not fall of the cliff until
adding in the 12th disk.  I started adding disks one at a time
beginning with the Cougar Point.  The %iowait jumped up right away
with this guy also.

> Also, which disk scheduler are you using?  Instead of CFQ, you might
> try deadline instead.

I'm using deadline.  I have definitely observed better performance
with this vs cfq.

At this point I think I need to probably use a tool like blktrace to
get more visibility than what I have with ps and iostat.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-11 21:24                         ` Dallas Clement
@ 2015-12-11 23:30                           ` Dallas Clement
  2015-12-12  0:00                             ` Dallas Clement
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-11 23:30 UTC (permalink / raw)
  To: John Stoffel; +Cc: Mark Knecht, Phil Turmel, Linux-RAID

On Fri, Dec 11, 2015 at 3:24 PM, Dallas Clement
<dallas.a.clement@gmail.com> wrote:
> On Fri, Dec 11, 2015 at 1:34 PM, John Stoffel <john@stoffel.org> wrote:
>>>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:
>>
>> Dallas> On Fri, Dec 11, 2015 at 10:32 AM, John Stoffel <john@stoffel.org> wrote:
>>>>>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:
>>>>
>> Dallas> Hi Mark.  I have three different controllers on this
>> Dallas> motherboard.  A Marvell 9485 controls 8 of the disks.  And an
>> Dallas> Intel Cougar Point controls the 4 remaining disks.
>>>>
>>>> What type of PCIe slots are the controllers in?  And how fast are the
>>>> controllers/drives?  Are they SATA1/2/3 drives?
>>>>
>>>>>> If you're spinning in IO loops then it could be a driver issue.
>>>>
>> Dallas> It sure is looking like that.  I will try to profile the
>> Dallas> kernel threads today and maybe use blktrace as Phil
>> Dallas> recommended to see what is going on there.
>>>>
>>>> what kernel aer you running?
>>>>
>> Dallas> This is pretty sad that 12 single threaded fio jobs can bring
>> Dallas> this system to its knees.
>>>>
>>>> I think it might be better to lower the queue depth, you might be just
>>>> blowing out the controller caches...  hard to know.
>>
>> Dallas> Hi John.
>>
>>>> What type of PCIe slots are the controllers in?  And how fast are the
>>>> controllers/drives?  Are they SATA1/2/3 drives?
>>
>> Dallas> The MV 9485 controller is attached to an Intel Sandy Bridge
>> Dallas> via PCIe GEN2 x 8.  This one controls 8 of the disks.  The
>> Dallas> Intel Cougar Point is connected to the Intel Sandy Bridge via
>> Dallas> DMI bus.
>>
>> So that should all be nice and fast.
>>
>> Dallas> All of the drives are SATA III, however I do have two of the
>> Dallas> drives connected to SATA II ports on the Cougar Point.  These
>> Dallas> two drives used to be connected to SATA III ports on a MV
>> Dallas> 9125/9120 controller.  But it had truly horrible write
>> Dallas> performance.  Moving to the SATA II ports on the Cougar Point
>> Dallas> boosted the performance close to the same as the other drives.
>> Dallas> The remaining 10 drives are all connected to SATA III ports.
>>
>>>> what kernel aer you running?
>>
>> Dallas> Right now, I'm using 3.10.69.  But I have tried the 4.2 kernel
>> Dallas> in Fedora 23 with similar results.
>>
>> Hmm... maybe if your feeling adventerous you could try v4.4-rc4 and
>> see how it works.  You don't want anything between 4.2.6 and that
>> because of problems with blk req management.  I'm hazy on the details.
>>
>>>> I think it might be better to lower the queue depth, you might be just
>>>> blowing out the controller caches...  hard to know.
>>
>> Dallas> Good idea.  I'll trying lowering to see what effect.
>>
>> It might also make sense to try your tests starting with just 1 disk,
>> and then adding one more disk, re-running the tests, then another
>> disk, re-running the tests, etc.
>>
>> Try with one on the MV, then one on the Cougar, then one on MV and one
>> on Cougar, etc.
>>
>> Try to see if you can spot where the performance falls off the cliff.
>>
>> Also, which disk scheduler are you using?  Instead of CFQ, you might
>> try deadline instead.
>>
>> As you can see, there's a TON of knobs to twiddle with, it's not a
>> simple thing to do at times.
>>
>> John
>
>> It might also make sense to try your tests starting with just 1 disk,
>> and then adding one more disk, re-running the tests, then another
>> disk, re-running the tests, etc
>
>> Try to see if you can spot where the performance falls off the cliff.
>
> Okay, did this.  Interestingly, things did not fall of the cliff until
> adding in the 12th disk.  I started adding disks one at a time
> beginning with the Cougar Point.  The %iowait jumped up right away
> with this guy also.
>
>> Also, which disk scheduler are you using?  Instead of CFQ, you might
>> try deadline instead.
>
> I'm using deadline.  I have definitely observed better performance
> with this vs cfq.
>
> At this point I think I need to probably use a tool like blktrace to
> get more visibility than what I have with ps and iostat.

I have one more observation.  I tried varying the queue depth from 1,
4, 16, 32, 64, 128, 256.  Surprisingly, all 12 disks are able to
handle this load with queue depth <= 128.  Each disk is at 100%
utilization and writing 170-180 MB/s.  Things start to fall apart with
queue depth = 256 after adding in the 12th disk.  The inflection point
on load average seems to be around queue depth = 32.  The load average
for this 8 core system goes up to about 13 when I increase the queue
depth to 64.

So is my workload of 12 fio jobs writing sequential 2 MB blocks with
direct I/O just too abusive?  Seems so with high queue depth.

I started this discussion because my RAID 5 and RAID 6 write
performance is really bad.  If my system is able to write to all 12
disks at 170 MB/s in JBOD mode, I am expecting that one fio job should
be able to write at a speed of (N - 1) * X = 11 * 170 MB/s = 1870
MB/s.  However, I am getting < 700 MB/s for queue depth = 32 and < 600
MB/s for queue depth = 256.  I get similarly disappointing results for
RAID 6 writes.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-11 23:30                           ` Dallas Clement
@ 2015-12-12  0:00                             ` Dallas Clement
  2015-12-12  0:38                               ` Phil Turmel
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-12  0:00 UTC (permalink / raw)
  To: John Stoffel; +Cc: Mark Knecht, Phil Turmel, Linux-RAID

On Fri, Dec 11, 2015 at 5:30 PM, Dallas Clement
<dallas.a.clement@gmail.com> wrote:
> On Fri, Dec 11, 2015 at 3:24 PM, Dallas Clement
> <dallas.a.clement@gmail.com> wrote:
>> On Fri, Dec 11, 2015 at 1:34 PM, John Stoffel <john@stoffel.org> wrote:
>>>>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:
>>>
>>> Dallas> On Fri, Dec 11, 2015 at 10:32 AM, John Stoffel <john@stoffel.org> wrote:
>>>>>>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:
>>>>>
>>> Dallas> Hi Mark.  I have three different controllers on this
>>> Dallas> motherboard.  A Marvell 9485 controls 8 of the disks.  And an
>>> Dallas> Intel Cougar Point controls the 4 remaining disks.
>>>>>
>>>>> What type of PCIe slots are the controllers in?  And how fast are the
>>>>> controllers/drives?  Are they SATA1/2/3 drives?
>>>>>
>>>>>>> If you're spinning in IO loops then it could be a driver issue.
>>>>>
>>> Dallas> It sure is looking like that.  I will try to profile the
>>> Dallas> kernel threads today and maybe use blktrace as Phil
>>> Dallas> recommended to see what is going on there.
>>>>>
>>>>> what kernel aer you running?
>>>>>
>>> Dallas> This is pretty sad that 12 single threaded fio jobs can bring
>>> Dallas> this system to its knees.
>>>>>
>>>>> I think it might be better to lower the queue depth, you might be just
>>>>> blowing out the controller caches...  hard to know.
>>>
>>> Dallas> Hi John.
>>>
>>>>> What type of PCIe slots are the controllers in?  And how fast are the
>>>>> controllers/drives?  Are they SATA1/2/3 drives?
>>>
>>> Dallas> The MV 9485 controller is attached to an Intel Sandy Bridge
>>> Dallas> via PCIe GEN2 x 8.  This one controls 8 of the disks.  The
>>> Dallas> Intel Cougar Point is connected to the Intel Sandy Bridge via
>>> Dallas> DMI bus.
>>>
>>> So that should all be nice and fast.
>>>
>>> Dallas> All of the drives are SATA III, however I do have two of the
>>> Dallas> drives connected to SATA II ports on the Cougar Point.  These
>>> Dallas> two drives used to be connected to SATA III ports on a MV
>>> Dallas> 9125/9120 controller.  But it had truly horrible write
>>> Dallas> performance.  Moving to the SATA II ports on the Cougar Point
>>> Dallas> boosted the performance close to the same as the other drives.
>>> Dallas> The remaining 10 drives are all connected to SATA III ports.
>>>
>>>>> what kernel aer you running?
>>>
>>> Dallas> Right now, I'm using 3.10.69.  But I have tried the 4.2 kernel
>>> Dallas> in Fedora 23 with similar results.
>>>
>>> Hmm... maybe if your feeling adventerous you could try v4.4-rc4 and
>>> see how it works.  You don't want anything between 4.2.6 and that
>>> because of problems with blk req management.  I'm hazy on the details.
>>>
>>>>> I think it might be better to lower the queue depth, you might be just
>>>>> blowing out the controller caches...  hard to know.
>>>
>>> Dallas> Good idea.  I'll trying lowering to see what effect.
>>>
>>> It might also make sense to try your tests starting with just 1 disk,
>>> and then adding one more disk, re-running the tests, then another
>>> disk, re-running the tests, etc.
>>>
>>> Try with one on the MV, then one on the Cougar, then one on MV and one
>>> on Cougar, etc.
>>>
>>> Try to see if you can spot where the performance falls off the cliff.
>>>
>>> Also, which disk scheduler are you using?  Instead of CFQ, you might
>>> try deadline instead.
>>>
>>> As you can see, there's a TON of knobs to twiddle with, it's not a
>>> simple thing to do at times.
>>>
>>> John
>>
>>> It might also make sense to try your tests starting with just 1 disk,
>>> and then adding one more disk, re-running the tests, then another
>>> disk, re-running the tests, etc
>>
>>> Try to see if you can spot where the performance falls off the cliff.
>>
>> Okay, did this.  Interestingly, things did not fall of the cliff until
>> adding in the 12th disk.  I started adding disks one at a time
>> beginning with the Cougar Point.  The %iowait jumped up right away
>> with this guy also.
>>
>>> Also, which disk scheduler are you using?  Instead of CFQ, you might
>>> try deadline instead.
>>
>> I'm using deadline.  I have definitely observed better performance
>> with this vs cfq.
>>
>> At this point I think I need to probably use a tool like blktrace to
>> get more visibility than what I have with ps and iostat.
>
> I have one more observation.  I tried varying the queue depth from 1,
> 4, 16, 32, 64, 128, 256.  Surprisingly, all 12 disks are able to
> handle this load with queue depth <= 128.  Each disk is at 100%
> utilization and writing 170-180 MB/s.  Things start to fall apart with
> queue depth = 256 after adding in the 12th disk.  The inflection point
> on load average seems to be around queue depth = 32.  The load average
> for this 8 core system goes up to about 13 when I increase the queue
> depth to 64.
>
> So is my workload of 12 fio jobs writing sequential 2 MB blocks with
> direct I/O just too abusive?  Seems so with high queue depth.
>
> I started this discussion because my RAID 5 and RAID 6 write
> performance is really bad.  If my system is able to write to all 12
> disks at 170 MB/s in JBOD mode, I am expecting that one fio job should
> be able to write at a speed of (N - 1) * X = 11 * 170 MB/s = 1870
> MB/s.  However, I am getting < 700 MB/s for queue depth = 32 and < 600
> MB/s for queue depth = 256.  I get similarly disappointing results for
> RAID 6 writes.

One other thing I failed to mention is that I seem to be unable to
saturate my RAID device using fio.  I have tried increasing the number
of jobs and that has actually resulted in worse performance.  Here's
what I get with just one job thread.

# fio ../job.fio
job: (g=0): rw=write, bs=2M-2M/2M-2M/2M-2M, ioengine=libaio, iodepth=256
fio-2.2.7
Starting 1 process
Jobs: 1 (f=1): [W(1)] [90.5% done] [0KB/725.3MB/0KB /s] [0/362/0 iops]
[eta 00m:02s]
job: (groupid=0, jobs=1): err= 0: pid=30569: Sat Dec 12 08:22:54 2015
  write: io=10240MB, bw=561727KB/s, iops=274, runt= 18667msec
    slat (usec): min=316, max=554160, avg=3623.16, stdev=20560.63
    clat (msec): min=25, max=2744, avg=913.26, stdev=508.27
     lat (msec): min=26, max=2789, avg=916.88, stdev=510.13
    clat percentiles (msec):
     |  1.00th=[  221],  5.00th=[  553], 10.00th=[  594], 20.00th=[  635],
     | 30.00th=[  660], 40.00th=[  685], 50.00th=[  709], 60.00th=[  742],
     | 70.00th=[  791], 80.00th=[  947], 90.00th=[ 1827], 95.00th=[ 2114],
     | 99.00th=[ 2442], 99.50th=[ 2474], 99.90th=[ 2540], 99.95th=[ 2737],
     | 99.99th=[ 2737]
    bw (KB  /s): min= 3093, max=934603, per=97.80%, avg=549364.82,
stdev=269856.22
    lat (msec) : 50=0.14%, 100=0.39%, 250=0.78%, 500=2.03%, 750=58.67%
    lat (msec) : 1000=18.18%, 2000=11.41%, >=2000=8.40%
  cpu          : usr=5.30%, sys=8.89%, ctx=2219, majf=0, minf=32
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.3%, 32=0.6%, >=64=98.8%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=0/w=5120/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
  WRITE: io=10240MB, aggrb=561727KB/s, minb=561727KB/s,
maxb=561727KB/s, mint=18667msec, maxt=18667msec

Disk stats (read/write):
    md10: ios=1/81360, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=660/4402, aggrmerge=9848/234056, aggrticks=23282/123890,
aggrin_queue=147976, aggrutil=66.50%
  sda: ios=712/4387, merge=10727/233944, ticks=24150/130830,
in_queue=155810, util=61.32%
  sdb: ios=697/4441, merge=10246/234331, ticks=19820/108830,
in_queue=129430, util=59.58%
  sdc: ios=636/4384, merge=9273/233886, ticks=21380/123780,
in_queue=146070, util=62.17%
  sdd: ios=656/4399, merge=9731/234030, ticks=23050/135000,
in_queue=158880, util=63.91%
  sdf: ios=672/4427, merge=9862/234117, ticks=20110/101910,
in_queue=122790, util=58.53%
  sdg: ios=656/4414, merge=9801/234081, ticks=20820/110860,
in_queue=132390, util=61.38%
  sdh: ios=644/4385, merge=9526/234047, ticks=25120/131670,
in_queue=157630, util=62.80%
  sdi: ios=739/4369, merge=10757/233876, ticks=32430/160810,
in_queue=194080, util=66.50%
  sdj: ios=687/4386, merge=10525/234033, ticks=25770/131950,
in_queue=158530, util=64.18%
  sdk: ios=620/4454, merge=9572/234495, ticks=22010/117190,
in_queue=139960, util=60.80%
  sdl: ios=610/4393, merge=9090/233924, ticks=23800/118340,
in_queue=142910, util=62.12%
  sdm: ios=602/4394, merge=9066/233915, ticks=20930/115520,
in_queue=137240, util=60.96%

As you can see, the array utilization is only 66.5% and the disk
utilization is about the same.  Perhaps I am just using the wrong tool
or using fio incorrectly. On the other hand, I suppose it still could
be a problem with RAID 5, 6 implementation.

This is my fio job config:

# cat ../job.fio
[job]
ioengine=libaio
iodepth=256
prio=0
rw=write
bs=2048k
filename=/dev/md10
numjobs=1
size=10g
direct=1
invalidate=1
ramp_time=15
runtime=120
time_based

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-12  0:00                             ` Dallas Clement
@ 2015-12-12  0:38                               ` Phil Turmel
  2015-12-12  2:55                                 ` Dallas Clement
  0 siblings, 1 reply; 60+ messages in thread
From: Phil Turmel @ 2015-12-12  0:38 UTC (permalink / raw)
  To: Dallas Clement, John Stoffel; +Cc: Mark Knecht, Linux-RAID

On 12/11/2015 07:00 PM, Dallas Clement wrote:

>> So is my workload of 12 fio jobs writing sequential 2 MB blocks with
>> direct I/O just too abusive?  Seems so with high queue depth.

I don't think you are adjusting any hardware queue depth here.  The fio
man page is quite explicit that iodepth=N is ineffective for sequential
operations.  But you are using the libaio engine, so you are piling up
many *software* queued operations for the kernel to execute, not
operations in flight to the disks.  From the histograms in your results,
the vast majority of ops are completing at depth=4.  Further queuing is
just adding kernel overhead.

The queuing differences from one kernel to another is a driver and
hardware property, not an application property.

>> I started this discussion because my RAID 5 and RAID 6 write
>> performance is really bad.  If my system is able to write to all 12
>> disks at 170 MB/s in JBOD mode, I am expecting that one fio job should
>> be able to write at a speed of (N - 1) * X = 11 * 170 MB/s = 1870
>> MB/s.  However, I am getting < 700 MB/s for queue depth = 32 and < 600
>> MB/s for queue depth = 256.  I get similarly disappointing results for
>> RAID 6 writes.

That's why I suggested blktrace.  Collect a trace while a single dd is
writing to your raw array device.  Compare the large writes submitted to
the md device against the broken down writes submitted to the member
devices.

Compare the patterns and sizes from older kernels against newer kernels,
possibly varying which controllers and data paths are involved.

Phil

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-12  0:38                               ` Phil Turmel
@ 2015-12-12  2:55                                 ` Dallas Clement
  2015-12-12  4:47                                   ` Phil Turmel
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-12  2:55 UTC (permalink / raw)
  To: Phil Turmel; +Cc: John Stoffel, Mark Knecht, Linux-RAID

On Fri, Dec 11, 2015 at 6:38 PM, Phil Turmel <philip@turmel.org> wrote:
> On 12/11/2015 07:00 PM, Dallas Clement wrote:
>
>>> So is my workload of 12 fio jobs writing sequential 2 MB blocks with
>>> direct I/O just too abusive?  Seems so with high queue depth.
>
> I don't think you are adjusting any hardware queue depth here.  The fio
> man page is quite explicit that iodepth=N is ineffective for sequential
> operations.  But you are using the libaio engine, so you are piling up
> many *software* queued operations for the kernel to execute, not
> operations in flight to the disks.  From the histograms in your results,
> the vast majority of ops are completing at depth=4.  Further queuing is
> just adding kernel overhead.
>
> The queuing differences from one kernel to another is a driver and
> hardware property, not an application property.
>
>>> I started this discussion because my RAID 5 and RAID 6 write
>>> performance is really bad.  If my system is able to write to all 12
>>> disks at 170 MB/s in JBOD mode, I am expecting that one fio job should
>>> be able to write at a speed of (N - 1) * X = 11 * 170 MB/s = 1870
>>> MB/s.  However, I am getting < 700 MB/s for queue depth = 32 and < 600
>>> MB/s for queue depth = 256.  I get similarly disappointing results for
>>> RAID 6 writes.
>
> That's why I suggested blktrace.  Collect a trace while a single dd is
> writing to your raw array device.  Compare the large writes submitted to
> the md device against the broken down writes submitted to the member
> devices.
>
> Compare the patterns and sizes from older kernels against newer kernels,
> possibly varying which controllers and data paths are involved.
>
> Phil

Hi Phil,

> I don't think you are adjusting any hardware queue depth here.

Right, that was my understanding as well.  The fio iodepth setting
just controls how many I/Os can be in flight from the application
perspective.  I have not modified the hardware queue depth on my disks
at all yet.  Was saving that for later.

>  The fio man page is quite explicit that iodepth=N is ineffective for sequential
> operations.  But you are using the libaio engine, so you are piling up
> many *software* queued operations for the kernel to execute, not
> operations in flight to the disks.

Right.  I understand the fio iodepth is different than the hardware
queue depth.  But the fio man page seems to only mention limitation on
synchronous operations which mine are not. I'm using direct=1 and
sync=0.

I guess what I would really like to know is how I can achieve at or
near 100% utilization on the raid device and its member disks with
fio.  Do I need to increase /sys/block/sd*/device/queue_depth and
/sys/block/sd*/queue/nr_requests to get more utilization?

> That's why I suggested blktrace.  Collect a trace while a single dd is
> writing to your raw array device.  Compare the large writes submitted to
> the md device against the broken down writes submitted to the member
> devices.

Sounds good.  Will do.  What signs of trouble should I be looking for?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-12  2:55                                 ` Dallas Clement
@ 2015-12-12  4:47                                   ` Phil Turmel
  2015-12-14 20:14                                     ` Dallas Clement
  0 siblings, 1 reply; 60+ messages in thread
From: Phil Turmel @ 2015-12-12  4:47 UTC (permalink / raw)
  To: Dallas Clement; +Cc: John Stoffel, Mark Knecht, Linux-RAID

On 12/11/2015 09:55 PM, Dallas Clement wrote:

> Right.  I understand the fio iodepth is different than the hardware
> queue depth.  But the fio man page seems to only mention limitation on
> synchronous operations which mine are not. I'm using direct=1 and
> sync=0.

You are confusing sequential and synchronous.  The man page says it is
ineffective for *sequential* operations, especially when direct=1.

> I guess what I would really like to know is how I can achieve at or
> near 100% utilization on the raid device and its member disks with
> fio.  Do I need to increase /sys/block/sd*/device/queue_depth and
> /sys/block/sd*/queue/nr_requests to get more utilization?

I don't know specifically.  It seems to me that increasing queue depth
adds resiliency in the face of data transfer timing jitter, but at the
cost of more CPU overhead.

I'm not convinced fio is the right workload, either.  It's options are
much more flexible for random I/O workloads.  dd isn't perfect either,
especially when writing zeroes -- it actually reads zeros over and over
from the special device.  For sequential operations I like dc3dd with
its pat= wipe= mode.  That'll only generate writes.

>> That's why I suggested blktrace.  Collect a trace while a single dd is
>> writing to your raw array device.  Compare the large writes submitted to
>> the md device against the broken down writes submitted to the member
>> devices.
> 
> Sounds good.  Will do.  What signs of trouble should I be looking for?

Look for strictly increasing logical block addresses in requests to the
member devices.  Any disruption in that will break optimum positioning
for streaming throughput. Per device. Requests to the device have to be
large enough and paced quickly enough to avoid starving the write head.

Of course, any reads mixed in mean RMW cycles you didn't avoid.  You
shouldn't have any of those for sequential writes in chunk * (n-2)
multiples.

I know it's a bit hand-wavy, but you have more hardware to play with
than I do :-)

Phil

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-12  4:47                                   ` Phil Turmel
@ 2015-12-14 20:14                                     ` Dallas Clement
       [not found]                                       ` <CAK2H+edazVORrVovWDeTA8DmqUL+5HRH-AcRwg8KkMas=o+Cog@mail.gmail.com>
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-14 20:14 UTC (permalink / raw)
  To: Phil Turmel; +Cc: John Stoffel, Mark Knecht, Linux-RAID

On Fri, Dec 11, 2015 at 10:47 PM, Phil Turmel <philip@turmel.org> wrote:
> On 12/11/2015 09:55 PM, Dallas Clement wrote:
>
>> Right.  I understand the fio iodepth is different than the hardware
>> queue depth.  But the fio man page seems to only mention limitation on
>> synchronous operations which mine are not. I'm using direct=1 and
>> sync=0.
>
> You are confusing sequential and synchronous.  The man page says it is
> ineffective for *sequential* operations, especially when direct=1.
>
>> I guess what I would really like to know is how I can achieve at or
>> near 100% utilization on the raid device and its member disks with
>> fio.  Do I need to increase /sys/block/sd*/device/queue_depth and
>> /sys/block/sd*/queue/nr_requests to get more utilization?
>
> I don't know specifically.  It seems to me that increasing queue depth
> adds resiliency in the face of data transfer timing jitter, but at the
> cost of more CPU overhead.
>
> I'm not convinced fio is the right workload, either.  It's options are
> much more flexible for random I/O workloads.  dd isn't perfect either,
> especially when writing zeroes -- it actually reads zeros over and over
> from the special device.  For sequential operations I like dc3dd with
> its pat= wipe= mode.  That'll only generate writes.
>
>>> That's why I suggested blktrace.  Collect a trace while a single dd is
>>> writing to your raw array device.  Compare the large writes submitted to
>>> the md device against the broken down writes submitted to the member
>>> devices.
>>
>> Sounds good.  Will do.  What signs of trouble should I be looking for?
>
> Look for strictly increasing logical block addresses in requests to the
> member devices.  Any disruption in that will break optimum positioning
> for streaming throughput. Per device. Requests to the device have to be
> large enough and paced quickly enough to avoid starving the write head.
>
> Of course, any reads mixed in mean RMW cycles you didn't avoid.  You
> shouldn't have any of those for sequential writes in chunk * (n-2)
> multiples.
>
> I know it's a bit hand-wavy, but you have more hardware to play with
> than I do :-)
>
> Phil

Hi Phil,  I ran blktrace while writing with dd to a RAID 5 device with
12 disks.  My chunk size is 128K.  So I set my block size to 128K *
(12-2) = 1280k.   Here is the dd command I ran.

# /usr/local/bin/dd if=/dev/zero of=/dev/md10 bs=1280k count=1000 oflag=direct

> Look for strictly increasing logical block addresses in requests to the
> member devices.  Any disruption in that will break optimum positioning
> for streaming throughput. Per device. Requests to the device have to be
> large enough and paced quickly enough to avoid starving the write head.

I just ran blktrace and then blkparse after the write finished.  I'm
new to blktrace so not really sure what I'm looking at.  I wasn't able
to see the writes to individual disks.

> Of course, any reads mixed in mean RMW cycles you didn't avoid.  You
> shouldn't have any of those for sequential writes in chunk * (n-2)
> multiples.

I did see lots of rmw's which I am assuming I should not be seeing if
everything is correctly aligned!

  9,10   1        0    15.016034153     0  m   N raid5 rmw 1536 5
  9,10   1        0    15.016039816     0  m   N raid5 rmw 1544 5
  9,10   1        0    15.016042200     0  m   N raid5 rmw 1552 5
  9,10   1        0    15.016044241     0  m   N raid5 rmw 1560 5
  9,10   1        0    15.016046200     0  m   N raid5 rmw 1568 5
  9,10   1        0    15.016048096     0  m   N raid5 rmw 1576 5
  9,10   1        0    15.016049977     0  m   N raid5 rmw 1584 5
  9,10   1        0    15.016051851     0  m   N raid5 rmw 1592 5
  9,10   1        0    15.016054075     0  m   N raid5 rmw 1600 5
  9,10   1        0    15.016056042     0  m   N raid5 rmw 1608 5
  9,10   1        0    15.016057916     0  m   N raid5 rmw 1616 5
  9,10   1        0    15.016059809     0  m   N raid5 rmw 1624 5
  9,10   1        0    15.016061670     0  m   N raid5 rmw 1632 5
  9,10   1        0    15.016063578     0  m   N raid5 rmw 1640 5

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
       [not found]                                       ` <CAK2H+edazVORrVovWDeTA8DmqUL+5HRH-AcRwg8KkMas=o+Cog@mail.gmail.com>
@ 2015-12-14 20:55                                         ` Dallas Clement
       [not found]                                           ` <CAK2H+ed-3Z8SR20t8rpt3Fb48c3X2Jft=qZoiY9emC2nQww1xQ@mail.gmail.com>
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-14 20:55 UTC (permalink / raw)
  To: Mark Knecht; +Cc: Phil Turmel, John Stoffel, Linux-RAID

On Mon, Dec 14, 2015 at 2:40 PM, Mark Knecht <markknecht@gmail.com> wrote:
>
>
> On Mon, Dec 14, 2015 at 12:14 PM, Dallas Clement
> <dallas.a.clement@gmail.com> wrote:
>>
>> <SNIP>
>>
>> Hi Phil,  I ran blktrace while writing with dd to a RAID 5 device with
>> 12 disks.  My chunk size is 128K.  So I set my block size to 128K *
>> (12-2) = 1280k.   Here is the dd command I ran.
>
> Just curious but for my own knowledge if it's RAID5 why is it 12-2?
>
> - Mark

> Just curious but for my own knowledge if it's RAID5 why is it 12-2?

Shouldn't be.  It should have been 12-1 or writing 1408k.  Boy do I
feel dumb.  Anyhow, when writing this value, no more RMWs.   Yay!

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
       [not found]                                           ` <CAK2H+ed-3Z8SR20t8rpt3Fb48c3X2Jft=qZoiY9emC2nQww1xQ@mail.gmail.com>
@ 2015-12-14 21:20                                             ` Dallas Clement
  2015-12-14 22:05                                               ` Dallas Clement
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-14 21:20 UTC (permalink / raw)
  To: Mark Knecht; +Cc: Phil Turmel, John Stoffel, Linux-RAID

On Mon, Dec 14, 2015 at 3:02 PM, Mark Knecht <markknecht@gmail.com> wrote:
>
>
> On Mon, Dec 14, 2015 at 12:55 PM, Dallas Clement
> <dallas.a.clement@gmail.com> wrote:
>>
>> On Mon, Dec 14, 2015 at 2:40 PM, Mark Knecht <markknecht@gmail.com> wrote:
>> >
>> >
>> > On Mon, Dec 14, 2015 at 12:14 PM, Dallas Clement
>> > <dallas.a.clement@gmail.com> wrote:
>> >>
>> >> <SNIP>
>> >>
>> >> Hi Phil,  I ran blktrace while writing with dd to a RAID 5 device with
>> >> 12 disks.  My chunk size is 128K.  So I set my block size to 128K *
>> >> (12-2) = 1280k.   Here is the dd command I ran.
>> >
>> > Just curious but for my own knowledge if it's RAID5 why is it 12-2?
>> >
>> > - Mark
>>
>> > Just curious but for my own knowledge if it's RAID5 why is it 12-2?
>>
>> Shouldn't be.  It should have been 12-1 or writing 1408k.  Boy do I
>> feel dumb.  Anyhow, when writing this value, no more RMWs.   Yay!
>
> I wasn't going to be so bold as to suggest the RMW's would go away but I'm
> glad they did.
>
> So, now you can presumably gather new data looking at speed and post that,
> correct?
>
> Cheers,
> Mark

Hmm, I think I may have spoke to soon.  I did a speed test using fio
this time, same bs=1408k.  I see lots of RMWs in the trace this time.
I did another larger dd transfer too, and I see some RMWs but not very
many - maybe 4 or 5 for a 20GB transfer.

It looks like the LBAs are increasing for the writes to the disks.

  9,10   2     2816     0.737523948 27410  Q  WS 965888 + 256 [dd]
  9,10   2     2817     0.737620583 27410  Q  WS 966144 + 256 [dd]
  9,10   2     2818     0.737630651 27410  Q  WS 966400 + 256 [dd]
  9,10   2     2819     0.737641625 27410  Q  WS 966656 + 256 [dd]
  9,10   2     2820     0.737651603 27410  Q  WS 966912 + 256 [dd]
  9,10   2     2821     0.737662735 27410  Q  WS 967168 + 256 [dd]
  9,10   2     2822     0.737672709 27410  Q  WS 967424 + 256 [dd]
  9,10   2     2823     0.737683881 27410  Q  WS 967680 + 256 [dd]
  9,10   2     2824     0.737693896 27410  Q  WS 967936 + 256 [dd]
  9,10   2     2825     0.737704484 27410  Q  WS 968192 + 256 [dd]
  9,10   2     2826     0.737714348 27410  Q  WS 968448 + 256 [dd]

The dd transfers do seem faster when using bs=1408k.  But need to
collect some more data.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-14 21:20                                             ` Dallas Clement
@ 2015-12-14 22:05                                               ` Dallas Clement
  2015-12-14 22:31                                                 ` Tommy Apel
       [not found]                                                 ` <CAK2H+ecMvDLdYLhMtMQbP7Ygw-VohG7LGZ2n7H+LAXQ1waJK3A@mail.gmail.com>
  0 siblings, 2 replies; 60+ messages in thread
From: Dallas Clement @ 2015-12-14 22:05 UTC (permalink / raw)
  To: Mark Knecht; +Cc: Phil Turmel, John Stoffel, Linux-RAID

On Mon, Dec 14, 2015 at 3:20 PM, Dallas Clement
<dallas.a.clement@gmail.com> wrote:
> On Mon, Dec 14, 2015 at 3:02 PM, Mark Knecht <markknecht@gmail.com> wrote:
>>
>>
>> On Mon, Dec 14, 2015 at 12:55 PM, Dallas Clement
>> <dallas.a.clement@gmail.com> wrote:
>>>
>>> On Mon, Dec 14, 2015 at 2:40 PM, Mark Knecht <markknecht@gmail.com> wrote:
>>> >
>>> >
>>> > On Mon, Dec 14, 2015 at 12:14 PM, Dallas Clement
>>> > <dallas.a.clement@gmail.com> wrote:
>>> >>
>>> >> <SNIP>
>>> >>
>>> >> Hi Phil,  I ran blktrace while writing with dd to a RAID 5 device with
>>> >> 12 disks.  My chunk size is 128K.  So I set my block size to 128K *
>>> >> (12-2) = 1280k.   Here is the dd command I ran.
>>> >
>>> > Just curious but for my own knowledge if it's RAID5 why is it 12-2?
>>> >
>>> > - Mark
>>>
>>> > Just curious but for my own knowledge if it's RAID5 why is it 12-2?
>>>
>>> Shouldn't be.  It should have been 12-1 or writing 1408k.  Boy do I
>>> feel dumb.  Anyhow, when writing this value, no more RMWs.   Yay!
>>
>> I wasn't going to be so bold as to suggest the RMW's would go away but I'm
>> glad they did.
>>
>> So, now you can presumably gather new data looking at speed and post that,
>> correct?
>>
>> Cheers,
>> Mark
>
> Hmm, I think I may have spoke to soon.  I did a speed test using fio
> this time, same bs=1408k.  I see lots of RMWs in the trace this time.
> I did another larger dd transfer too, and I see some RMWs but not very
> many - maybe 4 or 5 for a 20GB transfer.
>
> It looks like the LBAs are increasing for the writes to the disks.
>
>   9,10   2     2816     0.737523948 27410  Q  WS 965888 + 256 [dd]
>   9,10   2     2817     0.737620583 27410  Q  WS 966144 + 256 [dd]
>   9,10   2     2818     0.737630651 27410  Q  WS 966400 + 256 [dd]
>   9,10   2     2819     0.737641625 27410  Q  WS 966656 + 256 [dd]
>   9,10   2     2820     0.737651603 27410  Q  WS 966912 + 256 [dd]
>   9,10   2     2821     0.737662735 27410  Q  WS 967168 + 256 [dd]
>   9,10   2     2822     0.737672709 27410  Q  WS 967424 + 256 [dd]
>   9,10   2     2823     0.737683881 27410  Q  WS 967680 + 256 [dd]
>   9,10   2     2824     0.737693896 27410  Q  WS 967936 + 256 [dd]
>   9,10   2     2825     0.737704484 27410  Q  WS 968192 + 256 [dd]
>   9,10   2     2826     0.737714348 27410  Q  WS 968448 + 256 [dd]
>
> The dd transfers do seem faster when using bs=1408k.  But need to
> collect some more data.

The speeds I am seeing with dd are definitely faster.  I was getting
about 333 MB/s when writing bs=2048k which was not chunk aligned.
When writing bs=1408k I am getting at least 750 MB/s.  Reducing the
RMWs certainly did help.  But this write speed is still far short of
the (12 - 1) * 150 MB/s = 1650 MB/s I am expecting for minimal to no
RMWs.  I probably am not able to saturate the RAID device with dd
though.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-14 22:05                                               ` Dallas Clement
@ 2015-12-14 22:31                                                 ` Tommy Apel
       [not found]                                                 ` <CAK2H+ecMvDLdYLhMtMQbP7Ygw-VohG7LGZ2n7H+LAXQ1waJK3A@mail.gmail.com>
  1 sibling, 0 replies; 60+ messages in thread
From: Tommy Apel @ 2015-12-14 22:31 UTC (permalink / raw)
  To: Dallas Clement; +Cc: Linux-RAID

On Mon, 2015-12-14 at 16:05 -0600, Dallas Clement wrote:
> On Mon, Dec 14, 2015 at 3:20 PM, Dallas Clement
> <dallas.a.clement@gmail.com> wrote:
> > On Mon, Dec 14, 2015 at 3:02 PM, Mark Knecht <markknecht@gmail.com> wrote:
> > > 
> > > 
> > > On Mon, Dec 14, 2015 at 12:55 PM, Dallas Clement
> > > <dallas.a.clement@gmail.com> wrote:
> > > > 
> > > > On Mon, Dec 14, 2015 at 2:40 PM, Mark Knecht <markknecht@gmail.com> wrote:
> > > > > 
> > > > > 
> > > > > On Mon, Dec 14, 2015 at 12:14 PM, Dallas Clement
> > > > > <dallas.a.clement@gmail.com> wrote:
> > > > > > 
> > > > > > <SNIP>
> > > > > > 
> > > > > > Hi Phil,  I ran blktrace while writing with dd to a RAID 5 device with
> > > > > > 12 disks.  My chunk size is 128K.  So I set my block size to 128K *
> > > > > > (12-2) = 1280k.   Here is the dd command I ran.
> > > > > 
> > > > > Just curious but for my own knowledge if it's RAID5 why is it 12-2?
> > > > > 
> > > > > - Mark
> > > > 
> > > > > Just curious but for my own knowledge if it's RAID5 why is it 12-2?
> > > > 
> > > > Shouldn't be.  It should have been 12-1 or writing 1408k.  Boy do I
> > > > feel dumb.  Anyhow, when writing this value, no more RMWs.   Yay!
> > > 
> > > I wasn't going to be so bold as to suggest the RMW's would go away but I'm
> > > glad they did.
> > > 
> > > So, now you can presumably gather new data looking at speed and post that,
> > > correct?
> > > 
> > > Cheers,
> > > Mark
> > 
> > Hmm, I think I may have spoke to soon.  I did a speed test using fio
> > this time, same bs=1408k.  I see lots of RMWs in the trace this time.
> > I did another larger dd transfer too, and I see some RMWs but not very
> > many - maybe 4 or 5 for a 20GB transfer.
> > 
> > It looks like the LBAs are increasing for the writes to the disks.
> > 
> >   9,10   2     2816     0.737523948 27410  Q  WS 965888 + 256 [dd]
> >   9,10   2     2817     0.737620583 27410  Q  WS 966144 + 256 [dd]
> >   9,10   2     2818     0.737630651 27410  Q  WS 966400 + 256 [dd]
> >   9,10   2     2819     0.737641625 27410  Q  WS 966656 + 256 [dd]
> >   9,10   2     2820     0.737651603 27410  Q  WS 966912 + 256 [dd]
> >   9,10   2     2821     0.737662735 27410  Q  WS 967168 + 256 [dd]
> >   9,10   2     2822     0.737672709 27410  Q  WS 967424 + 256 [dd]
> >   9,10   2     2823     0.737683881 27410  Q  WS 967680 + 256 [dd]
> >   9,10   2     2824     0.737693896 27410  Q  WS 967936 + 256 [dd]
> >   9,10   2     2825     0.737704484 27410  Q  WS 968192 + 256 [dd]
> >   9,10   2     2826     0.737714348 27410  Q  WS 968448 + 256 [dd]
> > 
> > The dd transfers do seem faster when using bs=1408k.  But need to
> > collect some more data.
> 
> The speeds I am seeing with dd are definitely faster.  I was getting
> about 333 MB/s when writing bs=2048k which was not chunk aligned.
> When writing bs=1408k I am getting at least 750 MB/s.  Reducing the
> RMWs certainly did help.  But this write speed is still far short of
> the (12 - 1) * 150 MB/s = 1650 MB/s I am expecting for minimal to no
> RMWs.  I probably am not able to saturate the RAID device with dd
> though.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

750MB/s ~ 6000Mbit/s which is most likely the limitation of your expander chip

-- 
/Tommy


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
       [not found]                                                 ` <CAK2H+ecMvDLdYLhMtMQbP7Ygw-VohG7LGZ2n7H+LAXQ1waJK3A@mail.gmail.com>
@ 2015-12-14 23:25                                                   ` Dallas Clement
  2015-12-15  2:36                                                     ` Dallas Clement
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-14 23:25 UTC (permalink / raw)
  To: Mark Knecht; +Cc: Phil Turmel, John Stoffel, Linux-RAID

On Mon, Dec 14, 2015 at 4:17 PM, Mark Knecht <markknecht@gmail.com> wrote:
>
>
> On Mon, Dec 14, 2015 at 2:05 PM, Dallas Clement <dallas.a.clement@gmail.com>
> wrote:
>>
>> <SNIP>
>>
>> The speeds I am seeing with dd are definitely faster.  I was getting
>> about 333 MB/s when writing bs=2048k which was not chunk aligned.
>> When writing bs=1408k I am getting at least 750 MB/s.  Reducing the
>> RMWs certainly did help.  But this write speed is still far short of
>> the (12 - 1) * 150 MB/s = 1650 MB/s I am expecting for minimal to no
>> RMWs.  I probably am not able to saturate the RAID device with dd
>> though.
>
> But then you get back to all the questions about where you are on the drives
> physically (inside vs outside) and all the potential bottlenecks in the
> hardware. It
> might not be 'far short' if you're on the inside of the drive.
>
> I have no idea about what vintage Cougar Point machine you have but there
> are some reports about bugs that caused issues with a couple of the
> higher hard drive interface ports on some earlier machines. Your nature
> seems to be to generally build the largest configurations you can but Phil
> suggested earlier and it might be appropriate here to disconnect a bunch of
> drives and then do 1 drive, 2 drives, 3 drives and measure speeds. I seem
> to remember you saying something about it working well until you added the
> last drive so if you go this way I'd suggest physically disconnecting drives
> you are not testing, booting up, testing, powering down, adding another
> drive, etc.

Hi Mark

> But then you get back to all the questions about where you are on the drives
> physically (inside vs outside) and all the potential bottlenecks in the
> hardware. It
> might not be 'far short' if you're on the inside of the drive.

Perhaps.  But I was getting about 95 MB/s on the inside when I
measured earlier.  Even with this number the write speed for RAID 5
should be around 11 * 95 = 1045 MB/s.  Also, when I was running fio on
individual disks concurrently, adding one in at a time, iostat was
showing wMB/s to be around 160-170 MB/s.

> I have no idea about what vintage Cougar Point machine you have but there
> are some reports about bugs that caused issues with a couple of the
> higher hard drive interface ports on some earlier machines.

Hmm, I will need to look into that some more.

> I'd suggest physically disconnecting drives you are not testing, booting up, testing, powering down, adding another drive, etc.

Yes, I haven't tried that yet with RAID 5 or 6.  I'll give it a shot
maybe starting with 4 disks, adding one at a time and measure the
write speed.

On another point, this blktrace program sure is neat!  A wealth of info here.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-14 23:25                                                   ` Dallas Clement
@ 2015-12-15  2:36                                                     ` Dallas Clement
  2015-12-15 13:53                                                       ` Phil Turmel
                                                                         ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Dallas Clement @ 2015-12-15  2:36 UTC (permalink / raw)
  To: Mark Knecht; +Cc: Phil Turmel, John Stoffel, Linux-RAID

On Mon, Dec 14, 2015 at 5:25 PM, Dallas Clement
<dallas.a.clement@gmail.com> wrote:
> On Mon, Dec 14, 2015 at 4:17 PM, Mark Knecht <markknecht@gmail.com> wrote:
>>
>>
>> On Mon, Dec 14, 2015 at 2:05 PM, Dallas Clement <dallas.a.clement@gmail.com>
>> wrote:
>>>
>>> <SNIP>
>>>
>>> The speeds I am seeing with dd are definitely faster.  I was getting
>>> about 333 MB/s when writing bs=2048k which was not chunk aligned.
>>> When writing bs=1408k I am getting at least 750 MB/s.  Reducing the
>>> RMWs certainly did help.  But this write speed is still far short of
>>> the (12 - 1) * 150 MB/s = 1650 MB/s I am expecting for minimal to no
>>> RMWs.  I probably am not able to saturate the RAID device with dd
>>> though.
>>
>> But then you get back to all the questions about where you are on the drives
>> physically (inside vs outside) and all the potential bottlenecks in the
>> hardware. It
>> might not be 'far short' if you're on the inside of the drive.
>>
>> I have no idea about what vintage Cougar Point machine you have but there
>> are some reports about bugs that caused issues with a couple of the
>> higher hard drive interface ports on some earlier machines. Your nature
>> seems to be to generally build the largest configurations you can but Phil
>> suggested earlier and it might be appropriate here to disconnect a bunch of
>> drives and then do 1 drive, 2 drives, 3 drives and measure speeds. I seem
>> to remember you saying something about it working well until you added the
>> last drive so if you go this way I'd suggest physically disconnecting drives
>> you are not testing, booting up, testing, powering down, adding another
>> drive, etc.
>
> Hi Mark
>
>> But then you get back to all the questions about where you are on the drives
>> physically (inside vs outside) and all the potential bottlenecks in the
>> hardware. It
>> might not be 'far short' if you're on the inside of the drive.
>
> Perhaps.  But I was getting about 95 MB/s on the inside when I
> measured earlier.  Even with this number the write speed for RAID 5
> should be around 11 * 95 = 1045 MB/s.  Also, when I was running fio on
> individual disks concurrently, adding one in at a time, iostat was
> showing wMB/s to be around 160-170 MB/s.
>
>> I have no idea about what vintage Cougar Point machine you have but there
>> are some reports about bugs that caused issues with a couple of the
>> higher hard drive interface ports on some earlier machines.
>
> Hmm, I will need to look into that some more.
>
>> I'd suggest physically disconnecting drives you are not testing, booting up, testing, powering down, adding another drive, etc.
>
> Yes, I haven't tried that yet with RAID 5 or 6.  I'll give it a shot
> maybe starting with 4 disks, adding one at a time and measure the
> write speed.
>
> On another point, this blktrace program sure is neat!  A wealth of info here.

Hi Everyone.  I have some very interesting news to report.  I did a
little bit more playing around with fio, doing sequential writes to a
RAID 5 device with all 12 disks.  I kept the block size at the 128K
chunk aligned value of 1408K.  But this time I varied the queue depth.
These are my results for writing a 10 GB of data:

iodepth=1 => 642 MB/s, # of RMWs = 11

iodepth=4 => 1108 MB/s, # of RMWs = 6

iodepth=8 => 895 MB/s, # of RMWs = 7

iodepth=16 => 855 MB/s, # of RMWs = 11

iodepth=32 => 936 MB/s, # of RMWs = 11

iodepth=64 => 551 MB/s, # of RMWs = 5606

iodepth=128 => 554 MB/s, # of RMWs = 6333

As you can see, something goes terribly wrong with async i/o with
iodepth >= 64.  Btw, not to be contentious Phil, I have checked
multiple fio man pages and they clearly indicate that iodepth is for
async i/o which this is (libaio).  I don't see any mention of
sequential writes being prohibited with async i/o.  See
https://github.com/axboe/fio/blob/master/HOWTO.  However, maybe I'm
missing something and it sure looks from these results that there may
be a connection.

This is my fio job config:

[job]
ioengine=libaio
iodepth=128
prio=0
rw=write
bs=1408k
filename=/dev/md10
numjobs=1
size=10g
direct=1
invalidate=1

Incidentally, the very best write speed here (1108 MB/s with
iodepth=4) comes out to about 100 MB/s per disk, which is pretty close
to the worst case inner disk speed of 95.5 MB/s I had recorded
earlier.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-15  2:36                                                     ` Dallas Clement
@ 2015-12-15 13:53                                                       ` Phil Turmel
  2015-12-15 14:09                                                       ` Robert Kierski
  2015-12-15 15:14                                                       ` John Stoffel
  2 siblings, 0 replies; 60+ messages in thread
From: Phil Turmel @ 2015-12-15 13:53 UTC (permalink / raw)
  To: Dallas Clement, Mark Knecht; +Cc: John Stoffel, Linux-RAID

Hi Dallas,

On December 14, 2015 9:36:05 PM EST, Dallas Clement

>Hi Everyone. I have some very interesting news to report. I did a
>little bit more playing around with fio, doing sequential writes to a
>RAID 5 device with all 12 disks. I kept the block size at the 128K
>chunk aligned value of 1408K. But this time I varied the queue depth.
>These are my results for writing a 10 GB of data:
>
>iodepth=1 => 642 MB/s, # of RMWs = 11
>
>iodepth=4 => 1108 MB/s, # of RMWs = 6
>
>iodepth=8 => 895 MB/s, # of RMWs = 7
>
>iodepth=16 => 855 MB/s, # of RMWs = 11
>
>iodepth=32 => 936 MB/s, # of RMWs = 11
>
>iodepth=64 => 551 MB/s, # of RMWs = 5606
>
>iodepth=128 => 554 MB/s, # of RMWs = 6333
>
>As you can see, something goes terribly wrong with async i/o with
>iodepth >= 64. Btw, not to be contentious Phil, I have checked
>multiple fio man pages and they clearly indicate that iodepth is for
>async i/o which this is (libaio). I don't see any mention of
>sequential writes being prohibited with async i/o. See
>https://github.com/axboe/fio/blob/master/HOWTO.

Hmmm. I misread that part. But do note the comment that you might not
achieve as many in-flight I/Os as you expect.

>However, maybe I'm
>missing something and it sure looks from these results that there may
>be a connection.
>
>This is my fio job config:
>
>[job]
>ioengine=libaio
>iodepth=128
>prio=0
>rw=write
>bs=1408k
>filename=/dev/md10
>numjobs=1
>size=10g
>direct=1
>invalidate=1
>
>Incidentally, the very best write speed here (1108 MB/s with
>iodepth=4) comes out to about 100 MB/s per disk, which is pretty close
>to the worst case inner disk speed of 95.5 MB/s I had recorded
>earlier.

Very interesting indeed. I wonder if the extra I/O in flight at high
depths is consuming all available stripe cache space, possibly not
consistently. I'd raise and lower that in various combinations with
various combinations of iodepth.  Running out of stripe cache will cause
premature RMWs.

Regards,

Phil

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: best base / worst case RAID 5,6 write speeds
  2015-12-15  2:36                                                     ` Dallas Clement
  2015-12-15 13:53                                                       ` Phil Turmel
@ 2015-12-15 14:09                                                       ` Robert Kierski
  2015-12-15 15:14                                                       ` John Stoffel
  2 siblings, 0 replies; 60+ messages in thread
From: Robert Kierski @ 2015-12-15 14:09 UTC (permalink / raw)
  To: Dallas Clement, Mark Knecht; +Cc: Phil Turmel, John Stoffel, Linux-RAID

Dallas,

The threshold between iodepth=32 and iodepth=64 might be caused by exceeding the stripe cache size.

In my opinion, if you're writing chunk aligned data, you shouldn't be doing any RMW's.  That you're doing small numbers of RMW's with small iodepth, indicates that you're using the stripe cache.

I've tried similar tests where I've set the stripe cache size to 17 (the smallest you can set it to), and then did perfectly aligned IO's.  The results showed that I was doing massive amounts of RMW, and my performance was horrible.

Your FIO job file looks ordinary (and by that I mean.... "good").   While I wouldn't have picked bs=1408k, because it's aligned, I wouldn't expect that to be a cause for problem.

You should be able to set iodepth to whatever you want.  You could set it to a billion.  The OS should block additional requests until the underlying device's queue has available space.  Iodepth shouldn't affect RMW when you're doing aligned writes.  In my opnion, increasing iodepth should only help... not hurt.  If it causes low memory, that could be an issue, but it shouldn't increase the amount of RMW.

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-15  2:36                                                     ` Dallas Clement
  2015-12-15 13:53                                                       ` Phil Turmel
  2015-12-15 14:09                                                       ` Robert Kierski
@ 2015-12-15 15:14                                                       ` John Stoffel
  2015-12-15 17:30                                                         ` Dallas Clement
  2 siblings, 1 reply; 60+ messages in thread
From: John Stoffel @ 2015-12-15 15:14 UTC (permalink / raw)
  To: Dallas Clement; +Cc: Mark Knecht, Phil Turmel, John Stoffel, Linux-RAID


Dallas,

I suspect you've hit a known problem-ish area with Linux disk io,
which is that big queue depths aren't optimal.  They're better for
systems where you're talking to a big backend disk array with lots of
cache/memory which is all battery backed, and which can acknowledge
those writes immmediately, but then retire them to disk in a more
optimal manner.

So using a queue depth of 4, which is per-device, means that you can
have upto 48 writes outstanding at a time.  Just doubling that to 8,
means you can have 96 writes outstanding, which takes up memory
buffers on the system, etc.

As you can see, it peaks at a queue depth of 4, and then tends
downward before falling off a cliff.  So now what I'd do is keep the
queue depth at 4, but vary the block size and other parameters and see
how things change there.

Now this is all fun, but I also think you need to backup and re-think
about the big picture.  What workloads are you looking to optimize
for?  Lots of small file writes?  Lots of big file writes?  Random
reads of big/small files?

Are you looking for backing stores for VMs?

Have you looked into battery backed RAID cards?  They used to be alot
common, but these days CPUs are more than fast enough, and JBOD works
really well with more flexibility and less chance of your data getting
lost due to vendor-lockin.

Another option, if you're looking for performance might be using
lvmcache with a pair of mirrored SSDs, and if you KNOW you have UPS
support on the system, you could change the cache policy from
writeback (both SSDs and backing store writes need to complete) to
write through (SSDs writes done, backing store later...) so that you
get the most speed.

I've just recently done this setup on my home machine (not nearly as
beefy as this) and my off the cuff feeling is that it's a nice
speedup.

But back to the task at hand, what is the goal here?  To find the
sweet spot for your hardware?  For fun?  I'm all up for fun, this is a
great discussion.

It's too bad there's no auto-tuning script for testing a setup and
running fio to get test results, and then having the next knob tweaked
and tested in an automated way.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-15 15:14                                                       ` John Stoffel
@ 2015-12-15 17:30                                                         ` Dallas Clement
  2015-12-15 19:22                                                           ` Phil Turmel
  2015-12-15 21:54                                                           ` John Stoffel
  0 siblings, 2 replies; 60+ messages in thread
From: Dallas Clement @ 2015-12-15 17:30 UTC (permalink / raw)
  To: John Stoffel; +Cc: Mark Knecht, Phil Turmel, Linux-RAID

Thanks guys for all the ideas and help.

Phil,

> Very interesting indeed. I wonder if the extra I/O in flight at high
> depths is consuming all available stripe cache space, possibly not
> consistently. I'd raise and lower that in various combinations with
> various combinations of iodepth.  Running out of stripe cache will cause
> premature RMWs.

Okay, I'll play with that today.  I have to confess I'm not sure that
I completely understand how the stripe cache works.  I think the idea
is to batch I/Os into a complete stripe if possible and write out to
the disks all in one go to avoid RMWs.  Other than alignment issues,
I'm unclear on what triggers RMWs.  It seems like as Robert mentioned
that if the I/Os block size is stripe aligned, there should never be
RMWs.

My stripe cache is 8192 btw.

John,

> I suspect you've hit a known problem-ish area with Linux disk io, which is that big queue depths aren't optimal.

Yes, certainly looks that way.  But maybe as Phil indicated I might be
exceeding my stripe cache.  I am still surprised that there are so
many RMWs even if the stripe cache has been exhausted.

> As you can see, it peaks at a queue depth of 4, and then tends
> downward before falling off a cliff.  So now what I'd do is keep the
> queue depth at 4, but vary the block size and other parameters and see
> how things change there.

Why do you think there is a gradual drop off after queue depth of 4
and before it falls off the cliff?

> Now this is all fun, but I also think you need to backup and re-think
> about the big picture.  What workloads are you looking to optimize
> for?  Lots of small file writes?  Lots of big file writes?  Random
> reads of big/small files?

> Are you looking for backing stores for VMs?

I with this were for fun! ;)  Although this has been a fun discussion.
I've learned a ton.  This effort is for work though.  I'd be all over
the SSDs and caching otherwise.  I'm trying to characterize and then
squeeze all of the performance I can out of a legacy NAS product.  I
am constrained by the existing hardware.  Unfortunately I do not have
the option of using SSDs or hardware RAID controllers.  I have to rely
completely on Linux RAID.

I also need to optimize for large sequential writes (streaming video,
audio, large file transfers), iSCSI (mostly used for hosting VMs), and
random I/O (small and big files) as you would expect with a NAS.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-15 17:30                                                         ` Dallas Clement
@ 2015-12-15 19:22                                                           ` Phil Turmel
  2015-12-15 19:44                                                             ` Dallas Clement
  2015-12-15 21:54                                                           ` John Stoffel
  1 sibling, 1 reply; 60+ messages in thread
From: Phil Turmel @ 2015-12-15 19:22 UTC (permalink / raw)
  To: Dallas Clement, John Stoffel; +Cc: Mark Knecht, Linux-RAID

Hi Dallas,

On 12/15/2015 12:30 PM, Dallas Clement wrote:
> Thanks guys for all the ideas and help.
> 
> Phil,
> 
>> Very interesting indeed. I wonder if the extra I/O in flight at high
>> depths is consuming all available stripe cache space, possibly not
>> consistently. I'd raise and lower that in various combinations with
>> various combinations of iodepth.  Running out of stripe cache will cause
>> premature RMWs.
> 
> Okay, I'll play with that today.  I have to confess I'm not sure that
> I completely understand how the stripe cache works.  I think the idea
> is to batch I/Os into a complete stripe if possible and write out to
> the disks all in one go to avoid RMWs.  Other than alignment issues,
> I'm unclear on what triggers RMWs.  It seems like as Robert mentioned
> that if the I/Os block size is stripe aligned, there should never be
> RMWs.
>
> My stripe cache is 8192 btw.
>

Stripe cache is the kernel's workspace to compute parity or to recover
data from parity.  It works on 4k blocks.  Per "man md", the units are
number of such blocks per device.  *The blocks in each cache stripe are
separated from each other on disk by the chunk size*.

Let's examine some scenarios for your 128k chunk size, 12 devices.  You
have 8192 cache stripes of 12 blocks each:

1) Random write of 16k.  4 stripes will be allocated from the cache for
*all* of the devices, and filled for the devices written.  The raid5
state machine lets them sit briefly for a chance for more writes to the
other blocks in each stripe.

1a) If none come in, MD will request a read of the old data blocks and
the old parities.  When those arrive, it'll compute the new parities and
write both parities and new data blocks.  Total I/O: 32k read, 32k write.

1b) If other random writes come in for those stripes, chunk size spaced,
MD will wait a bit more.  Then it will read in any blocks that weren't
written, compute parity, and write all the new data and parity.  Total
I/O: 16k * n, possibly some reads, the rest writes.

2) Sequential write of stripe-aligned 1408k.  The first 128k allocates
64 cache stripes and fills their first block.  The next 128k fills the
second block of each cache stripe.  And so on, filling all the data
blocks in the cache stripes.  MD shortly notices a full cache stripe
write on each, so it just computes the parities and submits all of those
writes.

3) Sequential write of 256k, aligned or not.  As above, but you only
fill two blocks in each cache stripe.  MD then reads 1152k, computes
parity, and writes 384k.

4) Multiple back-to-back writes of 1408k aligned.  First grabs 64 cache
stripes and shortly queues all of those writes.  Next grabs another 64
cache stripes and queues more writes. And then another 64 caches stripes
and writes.  Underlying layer, as its queue grows, notices the adjacency
of chunk writes from multiple top-level writes and starts merging.
Stripe caches are still held, though, until each write is completed.  If
128 top-level writes are in flight (8192/64), you've exhausted your
stripe cache.  Note that this is writes in flight in your application
*and* writes in flight from anything else.  Keeping in mind that merging
might actually raise the completion latency for the earlier writes.

I'm sure you can come up with more.  The key is that stripe parity
calculations must be performed on blocks separated on disk by the chunk
size.  Really big chunk sizes don't actually help parity raid, since
everything is broken down to 4k for the stripe cache, then re-merged
underneath it.

> I with this were for fun! ;)  Although this has been a fun discussion.
> I've learned a ton.  This effort is for work though.  I'd be all over
> the SSDs and caching otherwise.  I'm trying to characterize and then
> squeeze all of the performance I can out of a legacy NAS product.  I
> am constrained by the existing hardware.  Unfortunately I do not have
> the option of using SSDs or hardware RAID controllers.  I have to rely
> completely on Linux RAID.
> 
> I also need to optimize for large sequential writes (streaming video,
> audio, large file transfers), iSCSI (mostly used for hosting VMs), and
> random I/O (small and big files) as you would expect with a NAS.

On spinning rust, once you introduce any random writes, you've
effectively made the entire stack a random workload.  This is true for
all raid levels, but particularly true for parity raid due to the RMW
cycles.  If you really need great sequential performance, you can't
allow the VMs and the databases and small files on the same disks.

That said, I recommend a parity raid chunk size of 16k or 32k for all
workloads.  Greatly improves spatial locality for random writes, reduces
stripe cache hogging for sequential writes, and doesn't hurt sequential
reads too much.

Phil

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-15 19:22                                                           ` Phil Turmel
@ 2015-12-15 19:44                                                             ` Dallas Clement
  2015-12-15 19:52                                                               ` Phil Turmel
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-15 19:44 UTC (permalink / raw)
  To: Phil Turmel; +Cc: John Stoffel, Mark Knecht, Linux-RAID

On Tue, Dec 15, 2015 at 1:22 PM, Phil Turmel <philip@turmel.org> wrote:
> Hi Dallas,
>
> On 12/15/2015 12:30 PM, Dallas Clement wrote:
>> Thanks guys for all the ideas and help.
>>
>> Phil,
>>
>>> Very interesting indeed. I wonder if the extra I/O in flight at high
>>> depths is consuming all available stripe cache space, possibly not
>>> consistently. I'd raise and lower that in various combinations with
>>> various combinations of iodepth.  Running out of stripe cache will cause
>>> premature RMWs.
>>
>> Okay, I'll play with that today.  I have to confess I'm not sure that
>> I completely understand how the stripe cache works.  I think the idea
>> is to batch I/Os into a complete stripe if possible and write out to
>> the disks all in one go to avoid RMWs.  Other than alignment issues,
>> I'm unclear on what triggers RMWs.  It seems like as Robert mentioned
>> that if the I/Os block size is stripe aligned, there should never be
>> RMWs.
>>
>> My stripe cache is 8192 btw.
>>
>
> Stripe cache is the kernel's workspace to compute parity or to recover
> data from parity.  It works on 4k blocks.  Per "man md", the units are
> number of such blocks per device.  *The blocks in each cache stripe are
> separated from each other on disk by the chunk size*.
>
> Let's examine some scenarios for your 128k chunk size, 12 devices.  You
> have 8192 cache stripes of 12 blocks each:
>
> 1) Random write of 16k.  4 stripes will be allocated from the cache for
> *all* of the devices, and filled for the devices written.  The raid5
> state machine lets them sit briefly for a chance for more writes to the
> other blocks in each stripe.
>
> 1a) If none come in, MD will request a read of the old data blocks and
> the old parities.  When those arrive, it'll compute the new parities and
> write both parities and new data blocks.  Total I/O: 32k read, 32k write.
>
> 1b) If other random writes come in for those stripes, chunk size spaced,
> MD will wait a bit more.  Then it will read in any blocks that weren't
> written, compute parity, and write all the new data and parity.  Total
> I/O: 16k * n, possibly some reads, the rest writes.
>
> 2) Sequential write of stripe-aligned 1408k.  The first 128k allocates
> 64 cache stripes and fills their first block.  The next 128k fills the
> second block of each cache stripe.  And so on, filling all the data
> blocks in the cache stripes.  MD shortly notices a full cache stripe
> write on each, so it just computes the parities and submits all of those
> writes.
>
> 3) Sequential write of 256k, aligned or not.  As above, but you only
> fill two blocks in each cache stripe.  MD then reads 1152k, computes
> parity, and writes 384k.
>
> 4) Multiple back-to-back writes of 1408k aligned.  First grabs 64 cache
> stripes and shortly queues all of those writes.  Next grabs another 64
> cache stripes and queues more writes. And then another 64 caches stripes
> and writes.  Underlying layer, as its queue grows, notices the adjacency
> of chunk writes from multiple top-level writes and starts merging.
> Stripe caches are still held, though, until each write is completed.  If
> 128 top-level writes are in flight (8192/64), you've exhausted your
> stripe cache.  Note that this is writes in flight in your application
> *and* writes in flight from anything else.  Keeping in mind that merging
> might actually raise the completion latency for the earlier writes.
>
> I'm sure you can come up with more.  The key is that stripe parity
> calculations must be performed on blocks separated on disk by the chunk
> size.  Really big chunk sizes don't actually help parity raid, since
> everything is broken down to 4k for the stripe cache, then re-merged
> underneath it.
>
>> I with this were for fun! ;)  Although this has been a fun discussion.
>> I've learned a ton.  This effort is for work though.  I'd be all over
>> the SSDs and caching otherwise.  I'm trying to characterize and then
>> squeeze all of the performance I can out of a legacy NAS product.  I
>> am constrained by the existing hardware.  Unfortunately I do not have
>> the option of using SSDs or hardware RAID controllers.  I have to rely
>> completely on Linux RAID.
>>
>> I also need to optimize for large sequential writes (streaming video,
>> audio, large file transfers), iSCSI (mostly used for hosting VMs), and
>> random I/O (small and big files) as you would expect with a NAS.
>
> On spinning rust, once you introduce any random writes, you've
> effectively made the entire stack a random workload.  This is true for
> all raid levels, but particularly true for parity raid due to the RMW
> cycles.  If you really need great sequential performance, you can't
> allow the VMs and the databases and small files on the same disks.
>
> That said, I recommend a parity raid chunk size of 16k or 32k for all
> workloads.  Greatly improves spatial locality for random writes, reduces
> stripe cache hogging for sequential writes, and doesn't hurt sequential
> reads too much.
>
> Phil

Wow!  Thanks a ton Phil.  This is incredibly helpful!  It looks like I
need to do some experimenting with smaller chunk sizes.  Just one more
question:  what stripe cache size do you recommend for this system?
It has 8 GB of RAM, but can't use all of it for RAID as this NAS needs
to run multiple applications.  I understand that in the >= 4.1 kernels
the stripe cache grows dynamically.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-15 19:44                                                             ` Dallas Clement
@ 2015-12-15 19:52                                                               ` Phil Turmel
  0 siblings, 0 replies; 60+ messages in thread
From: Phil Turmel @ 2015-12-15 19:52 UTC (permalink / raw)
  To: Dallas Clement; +Cc: John Stoffel, Mark Knecht, Linux-RAID

On 12/15/2015 02:44 PM, Dallas Clement wrote:

> Wow!  Thanks a ton Phil.  This is incredibly helpful!  It looks like I
> need to do some experimenting with smaller chunk sizes.  Just one more
> question:  what stripe cache size do you recommend for this system?
> It has 8 GB of RAM, but can't use all of it for RAID as this NAS needs
> to run multiple applications.  I understand that in the >= 4.1 kernels
> the stripe cache grows dynamically.

I don't really know.  I use the default 256, but all of my parity raid
arrays have 16k chunks and are relatively lightly loaded.  Consider
sampling stripe_cache_active once a second for a normal workday on real
workloads to figure out what you need.

Phil

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-15 17:30                                                         ` Dallas Clement
  2015-12-15 19:22                                                           ` Phil Turmel
@ 2015-12-15 21:54                                                           ` John Stoffel
  2015-12-15 23:07                                                             ` Dallas Clement
  1 sibling, 1 reply; 60+ messages in thread
From: John Stoffel @ 2015-12-15 21:54 UTC (permalink / raw)
  To: Dallas Clement; +Cc: John Stoffel, Mark Knecht, Phil Turmel, Linux-RAID

>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:

Dallas> Thanks guys for all the ideas and help.
Dallas> Phil,

>> Very interesting indeed. I wonder if the extra I/O in flight at high
>> depths is consuming all available stripe cache space, possibly not
>> consistently. I'd raise and lower that in various combinations with
>> various combinations of iodepth.  Running out of stripe cache will cause
>> premature RMWs.

Dallas> Okay, I'll play with that today.  I have to confess I'm not
Dallas> sure that I completely understand how the stripe cache works.
Dallas> I think the idea is to batch I/Os into a complete stripe if
Dallas> possible and write out to the disks all in one go to avoid
Dallas> RMWs.  Other than alignment issues, I'm unclear on what
Dallas> triggers RMWs.  It seems like as Robert mentioned that if the
Dallas> I/Os block size is stripe aligned, there should never be RMWs.

Remember, there's a bounding limit on both how large the stripe cache
is, and how long (timewise) it will let the cache sit around waiting
for new blocks to come in.  That's probably what you're hitting at
times with the high queue depth numbers.

I assume the blocktrace info would tell you more, but I haven't really
a clue how to interpret it.  


Dallas> My stripe cache is 8192 btw.

Dallas> John,

>> I suspect you've hit a known problem-ish area with Linux disk io, which is that big queue depths aren't optimal.

Dallas> Yes, certainly looks that way.  But maybe as Phil indicated I might be
Dallas> exceeding my stripe cache.  I am still surprised that there are so
Dallas> many RMWs even if the stripe cache has been exhausted.

>> As you can see, it peaks at a queue depth of 4, and then tends
>> downward before falling off a cliff.  So now what I'd do is keep the
>> queue depth at 4, but vary the block size and other parameters and see
>> how things change there.

Dallas> Why do you think there is a gradual drop off after queue depth
Dallas> of 4 and before it falls off the cliff?

I think because the in-kernel sizes start getting bigger, and so the
kernel spends more time queuing and caching the data and moving it
around, instead of just shoveling it down to the disks as quick as it
can.

Dallas> I with this were for fun! ;) Although this has been a fun
Dallas> discussion.  I've learned a ton.  This effort is for work
Dallas> though.  I'd be all over the SSDs and caching otherwise.  I'm
Dallas> trying to characterize and then squeeze all of the performance
Dallas> I can out of a legacy NAS product.  I am constrained by the
Dallas> existing hardware.  Unfortunately I do not have the option of
Dallas> using SSDs or hardware RAID controllers.  I have to rely
Dallas> completely on Linux RAID.

Ah... in that case, you need to do your testing from the NAS side,
don't bother going to this level.  I'd honestly now just set your
queue depth to 4 and move on to testing the NAS side of things, where
you have one, two, four, eight, or more test boxes hitting the NAS
box.

Dallas> I also need to optimize for large sequential writes (streaming
Dallas> video, audio, large file transfers), iSCSI (mostly used for
Dallas> hosting VMs), and random I/O (small and big files) as you
Dallas> would expect with a NAS.

So you want to do everything at all once.  Fun.  So really I'd move
back to the Network side, because unless your NAS box has more than
1GigE interface, and supports Bonding/trunking, you've hit the
performance wall.

Also, even if you get a ton of performance with large streaming
writes, when you sprinkle in a small set of random IO/s, you're going
to hit the cliff much sooner.  And in that case... it's another set of
optimizations.

Are you going to use NFSv3?  TCP?  UDP?  1500 MTU, 9000 MTU?  How many
clients?  How active?

Can you give up disk space for IOP/s?  So get away from the RAID6 and
move to RAID1 mirrors with a strip atop it, so that you maximize how
many IOPs you can get.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-15 21:54                                                           ` John Stoffel
@ 2015-12-15 23:07                                                             ` Dallas Clement
  2015-12-16 15:31                                                               ` Dallas Clement
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-15 23:07 UTC (permalink / raw)
  To: John Stoffel; +Cc: Mark Knecht, Phil Turmel, Linux-RAID

On Tue, Dec 15, 2015 at 3:54 PM, John Stoffel <john@stoffel.org> wrote:
>>>>>> "Dallas" == Dallas Clement <dallas.a.clement@gmail.com> writes:
>
> Dallas> Thanks guys for all the ideas and help.
> Dallas> Phil,
>
>>> Very interesting indeed. I wonder if the extra I/O in flight at high
>>> depths is consuming all available stripe cache space, possibly not
>>> consistently. I'd raise and lower that in various combinations with
>>> various combinations of iodepth.  Running out of stripe cache will cause
>>> premature RMWs.
>
> Dallas> Okay, I'll play with that today.  I have to confess I'm not
> Dallas> sure that I completely understand how the stripe cache works.
> Dallas> I think the idea is to batch I/Os into a complete stripe if
> Dallas> possible and write out to the disks all in one go to avoid
> Dallas> RMWs.  Other than alignment issues, I'm unclear on what
> Dallas> triggers RMWs.  It seems like as Robert mentioned that if the
> Dallas> I/Os block size is stripe aligned, there should never be RMWs.
>
> Remember, there's a bounding limit on both how large the stripe cache
> is, and how long (timewise) it will let the cache sit around waiting
> for new blocks to come in.  That's probably what you're hitting at
> times with the high queue depth numbers.
>
> I assume the blocktrace info would tell you more, but I haven't really
> a clue how to interpret it.
>
>
> Dallas> My stripe cache is 8192 btw.
>
> Dallas> John,
>
>>> I suspect you've hit a known problem-ish area with Linux disk io, which is that big queue depths aren't optimal.
>
> Dallas> Yes, certainly looks that way.  But maybe as Phil indicated I might be
> Dallas> exceeding my stripe cache.  I am still surprised that there are so
> Dallas> many RMWs even if the stripe cache has been exhausted.
>
>>> As you can see, it peaks at a queue depth of 4, and then tends
>>> downward before falling off a cliff.  So now what I'd do is keep the
>>> queue depth at 4, but vary the block size and other parameters and see
>>> how things change there.
>
> Dallas> Why do you think there is a gradual drop off after queue depth
> Dallas> of 4 and before it falls off the cliff?
>
> I think because the in-kernel sizes start getting bigger, and so the
> kernel spends more time queuing and caching the data and moving it
> around, instead of just shoveling it down to the disks as quick as it
> can.
>
> Dallas> I with this were for fun! ;) Although this has been a fun
> Dallas> discussion.  I've learned a ton.  This effort is for work
> Dallas> though.  I'd be all over the SSDs and caching otherwise.  I'm
> Dallas> trying to characterize and then squeeze all of the performance
> Dallas> I can out of a legacy NAS product.  I am constrained by the
> Dallas> existing hardware.  Unfortunately I do not have the option of
> Dallas> using SSDs or hardware RAID controllers.  I have to rely
> Dallas> completely on Linux RAID.
>
> Ah... in that case, you need to do your testing from the NAS side,
> don't bother going to this level.  I'd honestly now just set your
> queue depth to 4 and move on to testing the NAS side of things, where
> you have one, two, four, eight, or more test boxes hitting the NAS
> box.
>
> Dallas> I also need to optimize for large sequential writes (streaming
> Dallas> video, audio, large file transfers), iSCSI (mostly used for
> Dallas> hosting VMs), and random I/O (small and big files) as you
> Dallas> would expect with a NAS.
>
> So you want to do everything at all once.  Fun.  So really I'd move
> back to the Network side, because unless your NAS box has more than
> 1GigE interface, and supports Bonding/trunking, you've hit the
> performance wall.
>
> Also, even if you get a ton of performance with large streaming
> writes, when you sprinkle in a small set of random IO/s, you're going
> to hit the cliff much sooner.  And in that case... it's another set of
> optimizations.
>
> Are you going to use NFSv3?  TCP?  UDP?  1500 MTU, 9000 MTU?  How many
> clients?  How active?
>
> Can you give up disk space for IOP/s?  So get away from the RAID6 and
> move to RAID1 mirrors with a strip atop it, so that you maximize how
> many IOPs you can get.
>

Hi John.

> Remember, there's a bounding limit on both how large the stripe cache
> is, and how long (timewise) it will let the cache sit around waiting
> for new blocks to come in.  That's probably what you're hitting at
> times with the high queue depth numbers.

Okay, good to know.  I did try doubling the size of the stripe cache
just to see if it would reduce the # of RMWs at iodepth>=64.  It did
not.  So it looks like cache timing out as you mentioned.

> So you want to do everything at all once.  Fun.  So really I'd move
> back to the Network side, because unless your NAS box has more than
> 1GigE interface, and supports Bonding/trunking, you've hit the
> performance wall.

I'm not sure I necessarily want to tune everything at once.
Surprisingly this box does have 10GigE interfaces.  I just want to get
RAID tuned the best I can before I start testing over the network.
With 10 GigE this box should be able to write 1200 MB/s max.  But as
reported earlier, I'm not even able to get that with fio running
locally on the box writing to the RAID device.

After Phil's explanation I now better understand what triggers the
RMWs.  Clearly I would like to minimize these to get the best
performance for both sequential and random patterns.  Messing with the
stripe cache size doesn't seem to change anything with performance, so
will probably play with a smaller chunk size next to see if that
helps.

> Also, even if you get a ton of performance with large streaming
> writes, when you sprinkle in a small set of random IO/s, you're going
> to hit the cliff much sooner.  And in that case... it's another set of
> optimizations.

Yes, I get that.  There are definitely some customers that do a little
of everything with these NAS boxes.  But from what I've seen, a lot of
them also use a NAS for just one thing - host VMs or stream media or
file server or database / web app.

> Are you going to use NFSv3?  TCP?  UDP?  1500 MTU, 9000 MTU?

Yes, all of these are supported.

> How many clients?  How active?

These boxes tend to get used pretty hard.  Probably the biggest
application is iSCSI or Samba backups, and then hosting VMs.  For
backups it's usually small number of clients but heavy volume.  For
VMs there can be quite a few.

One other consideration is that these kind of products undergo lots of
bench-mark testing with the usual host of tools.  Probably my main
goal at this point is for the tests which are primarily focused on
sequential throughput (small and large blocks) are the best that they
can be given the hardware limitations.  10 Gbps iSCSI throughput is
probably the most important benchmark.  If I can somehow get the RAID
5,6 write speeds up over 1 GB/s I would be very happy. Right now 10
Gbps iSCSI write performance is limited by the RAID device performance
sadly.

> Can you give up disk space for IOP/s?  So get away from the RAID6 and
> move to RAID1 mirrors with a strip atop it, so that you maximize how
> many IOPs you can get.

Yes, this box already supports RAID 0, 1, 5, 6, 10, 50, 60

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-15 23:07                                                             ` Dallas Clement
@ 2015-12-16 15:31                                                               ` Dallas Clement
       [not found]                                                                 ` <CAK2H+eeD2k4yzuvL4uF_qKycp6A=XPe8pVF_J-7Agi8Ze89PPQ@mail.gmail.com>
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-16 15:31 UTC (permalink / raw)
  To: John Stoffel; +Cc: Mark Knecht, Phil Turmel, Linux-RAID

Phil, the 16k chunk size has really given a boost to my RAID 5
sequential write performance measured with fio, bs=1408k.

This is what I was getting with a 128k chunk size:

iodepth=4 => 605 MB/s
iodepth=8 => 589 MB/s
iodepth=16 => 634 MB/s
iodepth=32 => 635 MB/s

But this is what I'm getting with a 16k chunk size:

iodepth=4 => 825 MB/s
iodepth=8 => 810 MB/s
iodepth=16 => 851 MB/s
iodepth=32 => 866 MB/s

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
       [not found]                                                                 ` <CAK2H+eeD2k4yzuvL4uF_qKycp6A=XPe8pVF_J-7Agi8Ze89PPQ@mail.gmail.com>
@ 2015-12-17  5:57                                                                   ` Dallas Clement
  2015-12-17 13:41                                                                   ` Phil Turmel
  1 sibling, 0 replies; 60+ messages in thread
From: Dallas Clement @ 2015-12-17  5:57 UTC (permalink / raw)
  To: Mark Knecht; +Cc: John Stoffel, Phil Turmel, Linux-RAID

On Wed, Dec 16, 2015 at 6:24 PM, Mark Knecht <markknecht@gmail.com> wrote:
>
>
> On Wed, Dec 16, 2015 at 7:31 AM, Dallas Clement <dallas.a.clement@gmail.com>
> wrote:
>>
>> Phil, the 16k chunk size has really given a boost to my RAID 5
>> sequential write performance measured with fio, bs=1408k.
>>
>> This is what I was getting with a 128k chunk size:
>>
>> iodepth=4 => 605 MB/s
>> iodepth=8 => 589 MB/s
>> iodepth=16 => 634 MB/s
>> iodepth=32 => 635 MB/s
>>
>> But this is what I'm getting with a 16k chunk size:
>>
>> iodepth=4 => 825 MB/s
>> iodepth=8 => 810 MB/s
>> iodepth=16 => 851 MB/s
>> iodepth=32 => 866 MB/s
>
>
> Dallas,
>    Hi. Just for kicks I tried Phil's idea (I think it was Phil) and sampled
> stripe_cache_active
> by putting this command in a 1 second loop and running it today while I
> worked.
>
> cat  /sys/block/md3/md/stripe_cache_active >> testCacheResults
>
> My workload is _very_ different from what you're working on. This is a
> high-end desktop
> machine (Intel 980i Extreme processor, 24GB DRAM, RAID6) running 2 Windows 7
> VMs
> while I watch the stock market and program in MatLab. None the less I was
> somewhat
> surprise at the spread in the number of active lines. The test ran for about
> 10 hours with
> about 94% of the results being 0, but numbers ranging from 1 line to 2098
> lines active
> at a single time. Also interesting to me was when that 2098 value hit it was
> apparently
> all clear in less than 1 second as the values immediately following where
> back to 0.
>
>    Note that this is 5 disk RAID6 set up with a chunk size of 16k and an
> internal
> intent bitmap. I did no work like you're doing when I set the machine up. I
> just picked
> some numbers and built it so that I could get to work.
>
>    I've not done any real speed testing but a quick run of dd suggested
> maybe
> 160MB/S-180MB/S which sounds about right to me.
>
>    Anyway, just thought it was interesting.
>
> - Mark
>
> mark@c2RAID6 ~ $ sort -g testCacheResults | uniq -c
>   33316 0
>     127 1
>      98 2
>     105 3
>     141 4
>      71 5
>      48 6
>      38 7
>      39 8
>      36 9
>      31 10
>      23 11
>      30 12
>      26 13
>      17 14
>      12 15
>      20 16
>      14 17
>      17 18
>      23 19
>      19 20
>      12 21
>      13 22
>      14 23
>      16 24
>      15 25
>      14 26
>       8 27
>      11 28
>      16 29
>      10 30
>       3 31
>       9 32
>       3 33
>       5 34
>      13 35
>       7 36
>       7 37
>       3 38
>       7 39
>       6 40
>       9 41
>       5 42
>       6 43
>       7 44
>      12 45
>       7 46
>       7 47
>       6 48
>       6 49
>       5 50
>       4 51
>       8 52
>       2 53
>       6 54
>      10 55
>       3 56
>       7 57
>       7 58
>       9 59
>       3 60
>       5 61
>       8 62
>       1 63
>       5 64
>       4 65
>       9 66
>       3 67
>       3 68
>       2 69
>       2 70
>       5 71
>       2 72
>       3 73
>       3 74
>       3 75
>       3 76
>       3 77
>       1 78
>       4 79
>       1 80
>       3 81
>       2 82
>       1 83
>       4 84
>       1 85
>       4 86
>       1 87
>       2 89
>       2 90
>       1 91
>       2 92
>       1 93
>       4 94
>       2 95
>       5 96
>       2 97
>       2 98
>       2 99
>       5 100
>       2 101
>       1 102
>       6 103
>       5 104
>       1 105
>       3 106
>       3 107
>       2 108
>       3 109
>       3 110
>       4 111
>       3 112
>       1 113
>       4 114
>       1 115
>       1 116
>       1 117
>       3 118
>       4 119
>       3 120
>       3 121
>       2 122
>       3 123
>       4 124
>       2 125
>       3 126
>       1 127
>       2 128
>       2 129
>       1 130
>       3 131
>       2 132
>       2 133
>       2 134
>       3 135
>       1 136
>       2 137
>       3 138
>       5 140
>       3 141
>       3 142
>       1 143
>       1 144
>       5 145
>       1 146
>       6 147
>       3 148
>       1 149
>       1 150
>       1 152
>       2 153
>       1 154
>       1 155
>       1 156
>       4 157
>       3 158
>       1 159
>       3 160
>       1 161
>       6 162
>       1 163
>       2 164
>       1 165
>       1 166
>       4 167
>       2 168
>       5 169
>       2 170
>       3 172
>       5 173
>       4 174
>       4 175
>       4 176
>       3 177
>       2 178
>       2 179
>       6 180
>       2 181
>       3 182
>       3 184
>       2 185
>       3 186
>       4 187
>       2 188
>       5 190
>       4 192
>       3 193
>       2 194
>       6 196
>       1 197
>       1 198
>       1 199
>       2 200
>       4 201
>       2 203
>       2 204
>       4 206
>       1 207
>       2 208
>       5 209
>       2 210
>       3 211
>       6 212
>       3 213
>       3 214
>       4 215
>       4 216
>       6 217
>       8 218
>       1 219
>       5 220
>       6 221
>       4 222
>       6 223
>       6 224
>       5 225
>       2 226
>       3 227
>       5 228
>       2 229
>       1 230
>       5 231
>       6 232
>       6 233
>       3 234
>       4 235
>       6 236
>       5 237
>       1 238
>       5 239
>       2 240
>       5 241
>       4 242
>       2 244
>       2 245
>       2 246
>       2 247
>       3 248
>       2 249
>       4 250
>       3 251
>       6 252
>       2 253
>       2 254
>       5 255
>       3 256
>       4 257
>       3 258
>       3 259
>       6 260
>       2 261
>       3 262
>       3 263
>       1 264
>       3 265
>       1 266
>       4 267
>       4 268
>       4 269
>       3 270
>       4 271
>       2 272
>       1 273
>       1 275
>       1 276
>       5 277
>       6 278
>       2 279
>       2 280
>       1 281
>       6 282
>       5 283
>       8 284
>       1 285
>       5 286
>       4 287
>       2 288
>       2 289
>       3 290
>       2 291
>       1 292
>       2 293
>       1 294
>       3 295
>       2 296
>       2 297
>       1 298
>       3 299
>       2 300
>       1 301
>       2 303
>       3 305
>       3 306
>       1 307
>       1 308
>       2 309
>       2 310
>       1 311
>       1 312
>       1 313
>       2 314
>       1 315
>       1 317
>       1 318
>       2 320
>       1 321
>       2 322
>       2 323
>       2 324
>       1 325
>       1 326
>       2 327
>       3 328
>       2 329
>       1 331
>       1 335
>       1 336
>       2 337
>       1 338
>       1 339
>       1 340
>       3 341
>       1 343
>       1 344
>       1 346
>       1 347
>       1 348
>       2 349
>       1 350
>       1 352
>       2 353
>       1 357
>       1 359
>       1 360
>       1 365
>       1 368
>       1 369
>       2 372
>       2 373
>       1 378
>       1 380
>       1 388
>       2 392
>       1 409
>       1 410
>       1 414
>       1 425
>       1 444
>       1 455
>       2 460
>       1 465
>       1 469
>       1 484
>       1 485
>       1 492
>       1 499
>       1 503
>       1 504
>       1 509
>       1 518
>       1 534
>       1 540
>       1 541
>       1 543
>       1 546
>       1 572
>       1 575
>       1 586
>       1 591
>       1 592
>       1 602
>       1 637
>       1 661
>       1 674
>       1 732
>       1 770
>       1 780
>       1 905
>       2 927
>       1 928
>       1 1036
>       1 1146
>       1 1151
>       1 1157
>       1 1314
>       1 1974
>       1 2098
> mark@c2RAID6 ~ $

Hi Mark.  This is quite fascinating.  Now I really want to try it with
my workloads.  How big is your stripe cache btw?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
       [not found]                                                                 ` <CAK2H+eeD2k4yzuvL4uF_qKycp6A=XPe8pVF_J-7Agi8Ze89PPQ@mail.gmail.com>
  2015-12-17  5:57                                                                   ` Dallas Clement
@ 2015-12-17 13:41                                                                   ` Phil Turmel
  2015-12-17 21:08                                                                     ` Dallas Clement
  1 sibling, 1 reply; 60+ messages in thread
From: Phil Turmel @ 2015-12-17 13:41 UTC (permalink / raw)
  To: Mark Knecht, Dallas Clement; +Cc: John Stoffel, Linux-RAID

Hi Mark,

On 12/16/2015 07:24 PM, Mark Knecht wrote:
> On Wed, Dec 16, 2015 at 7:31 AM, Dallas Clement
> <dallas.a.clement@gmail.com <mailto:dallas.a.clement@gmail.com>> wrote:
> 
>     Phil, the 16k chunk size has really given a boost to my RAID 5
>     sequential write performance measured with fio, bs=1408k.
> 
>     This is what I was getting with a 128k chunk size:
> 
>     iodepth=4 => 605 MB/s
>     iodepth=8 => 589 MB/s
>     iodepth=16 => 634 MB/s
>     iodepth=32 => 635 MB/s
> 
>     But this is what I'm getting with a 16k chunk size:
> 
>     iodepth=4 => 825 MB/s
>     iodepth=8 => 810 MB/s
>     iodepth=16 => 851 MB/s
>     iodepth=32 => 866 MB/s

Very interesting.  Good to see hypotheses supported by results.

> Dallas,
>    Hi. Just for kicks I tried Phil's idea (I think it was Phil) and

:-)

> sampled  stripe_cache_active
> by putting this command in a 1 second loop and running it today while I
> worked.
> 
> cat  /sys/block/md3/md/stripe_cache_active >> testCacheResults
> 
> My workload is _very_ different from what you're working on. This is a
> high-end desktop
> machine (Intel 980i Extreme processor, 24GB DRAM, RAID6) running 2
> Windows 7 VMs
> while I watch the stock market and program in MatLab. None the less I
> was somewhat 
> surprise at the spread in the number of active lines. The test ran for
> about 10 hours with 
> about 94% of the results being 0, but numbers ranging from 1 line to
> 2098 lines active
> at a single time. Also interesting to me was when that 2098 value hit it
> was apparently 
> all clear in less than 1 second as the values immediately following
> where back to 0.

Yeah, latencies are pretty low.  One-second samples will be fairly
random snapshots under most conditions.  Consider sampling much faster,
but building one-minute histograms and recording those.

Phil

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-17 13:41                                                                   ` Phil Turmel
@ 2015-12-17 21:08                                                                     ` Dallas Clement
  2015-12-17 22:40                                                                       ` Phil Turmel
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-17 21:08 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Mark Knecht, John Stoffel, Linux-RAID

I am still in the process of collecting a bunch of performance data.
But so far, it is shocking to see the throughput difference when
blocks written are stripe aligned.  However, in the non-ideal world it
is not always possible to ensure that clients are writing blocks of
data which are stripe aligned.  If the goal is to reduce the # of RMWs
it seems like writing big blocks would also help for sequential
workloads where large quantities of data are being written.  Can any
of you think of anything else that can be tuned in the kernel to
reduce # of RMWs in the case where blocks are not stripe aligned?  Is
it a bad idea to mess with the timing of the stripe cache?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-17 21:08                                                                     ` Dallas Clement
@ 2015-12-17 22:40                                                                       ` Phil Turmel
  2015-12-17 23:28                                                                         ` Dallas Clement
  0 siblings, 1 reply; 60+ messages in thread
From: Phil Turmel @ 2015-12-17 22:40 UTC (permalink / raw)
  To: Dallas Clement; +Cc: Mark Knecht, John Stoffel, Linux-RAID

On 12/17/2015 04:08 PM, Dallas Clement wrote:
> I am still in the process of collecting a bunch of performance data.
> But so far, it is shocking to see the throughput difference when
> blocks written are stripe aligned.

Random writes unaligned has at least a 4x multiplier on raid5 and 6x on
raid6 per my earlier explanation.  Why does this surprise you?  It's
parity raid.  This is why users with heavy random workloads are pointed
at raid1 and raid10.  I like raid10,f3 for VM host images and databases.

> However, in the non-ideal world it
> is not always possible to ensure that clients are writing blocks of
> data which are stripe aligned.

Hardly possible at all, except for bulk writes of large media files, and
then only if you are writing one stream at a time to an otherwise idle
storage stack.  Not very realistic in a general-purpose storage
appliance.  "General purpose" just isn't very sequential.

> If the goal is to reduce the # of RMWs
> it seems like writing big blocks would also help for sequential
> workloads where large quantities of data are being written.

The goal is to be able to read later what you need to write now.  Unless
you have unlimited $ to spend, you have to balance speed, redundancy,
and capacity.  As they say, pick two.

Lots of spindles is generally good.  Raid5 is great for capacity, good
for redundancy, and marginal for speed.  Raid6 is great for capacity,
great for redundancy, and pitiful for speed.  Raid10,f2 is great for
speed, poor for capacity, and good for redundancy.  Raid10,f3 is great
for speed, pitiful for capacity, and great for redundancy.

> Can any
> of you think of anything else that can be tuned in the kernel to
> reduce # of RMWs in the case where blocks are not stripe aligned?  Is
> it a bad idea to mess with the timing of the stripe cache?

You can't really hold those writes for long, as any serious application
is going to call fdatasync at short intervals, for algorithmic integrity
reasons.  On random workloads, you simply have no choice but to do RMWs.
 Your only out is to make complete chunk stripes smaller than your
application's typical write size.  That raises the odds that any
particular write will be aligned or mostly aligned.  Have you tried 4k
chunks?

Phil


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-17 22:40                                                                       ` Phil Turmel
@ 2015-12-17 23:28                                                                         ` Dallas Clement
  2015-12-18  0:54                                                                           ` Dallas Clement
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-17 23:28 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Mark Knecht, John Stoffel, Linux-RAID

On Thu, Dec 17, 2015 at 4:40 PM, Phil Turmel <philip@turmel.org> wrote:
> On 12/17/2015 04:08 PM, Dallas Clement wrote:
>> I am still in the process of collecting a bunch of performance data.
>> But so far, it is shocking to see the throughput difference when
>> blocks written are stripe aligned.
>
> Random writes unaligned has at least a 4x multiplier on raid5 and 6x on
> raid6 per my earlier explanation.  Why does this surprise you?  It's
> parity raid.  This is why users with heavy random workloads are pointed
> at raid1 and raid10.  I like raid10,f3 for VM host images and databases.
>
>> However, in the non-ideal world it
>> is not always possible to ensure that clients are writing blocks of
>> data which are stripe aligned.
>
> Hardly possible at all, except for bulk writes of large media files, and
> then only if you are writing one stream at a time to an otherwise idle
> storage stack.  Not very realistic in a general-purpose storage
> appliance.  "General purpose" just isn't very sequential.
>
>> If the goal is to reduce the # of RMWs
>> it seems like writing big blocks would also help for sequential
>> workloads where large quantities of data are being written.
>
> The goal is to be able to read later what you need to write now.  Unless
> you have unlimited $ to spend, you have to balance speed, redundancy,
> and capacity.  As they say, pick two.
>
> Lots of spindles is generally good.  Raid5 is great for capacity, good
> for redundancy, and marginal for speed.  Raid6 is great for capacity,
> great for redundancy, and pitiful for speed.  Raid10,f2 is great for
> speed, poor for capacity, and good for redundancy.  Raid10,f3 is great
> for speed, pitiful for capacity, and great for redundancy.
>
>> Can any
>> of you think of anything else that can be tuned in the kernel to
>> reduce # of RMWs in the case where blocks are not stripe aligned?  Is
>> it a bad idea to mess with the timing of the stripe cache?
>
> You can't really hold those writes for long, as any serious application
> is going to call fdatasync at short intervals, for algorithmic integrity
> reasons.  On random workloads, you simply have no choice but to do RMWs.
>  Your only out is to make complete chunk stripes smaller than your
> application's typical write size.  That raises the odds that any
> particular write will be aligned or mostly aligned.  Have you tried 4k
> chunks?
>
> Phil
>

Hi Phil.  Thanks for the explanation.

> Random writes unaligned has at least a 4x multiplier on raid5 and 6x on
> raid6 per my earlier explanation.  Why does this surprise you?  It's
> parity raid.  This is why users with heavy random workloads are pointed
> at raid1 and raid10.  I like raid10,f3 for VM host images and databases.

It really shouldn't surprise me.  I should have said I am very HAPPY
to see such relatively good performance when the writes are stripe
aligned. :)

> Have you tried 4k chunks?

No not yet.  I've been taking some measurements with 16k, 32k, 64k,
128k, 256k.  So far it looks like 64k has the highest speeds for RAID
5 sequential writes.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-17 23:28                                                                         ` Dallas Clement
@ 2015-12-18  0:54                                                                           ` Dallas Clement
       [not found]                                                                             ` <CAFx4rwT8xgwZ0OWaLLsZvhMskiwmY54MzHgnnEPaswByeRrXxQ@mail.gmail.com>
  0 siblings, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-18  0:54 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Mark Knecht, John Stoffel, Linux-RAID

>> Have you tried 4k chunks?
>
> No not yet.  I've been taking some measurements with 16k, 32k, 64k,
> 128k, 256k.  So far it looks like 64k has the highest speeds for RAID
> 5 sequential writes.

Correction.  It looks like a 32k chunk size gives the best performance
overall for RAID 5 sequential and random access.  I'm checking this on
RAID 0, 1, and 6 as well.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
       [not found]                                                                             ` <CAFx4rwT8xgwZ0OWaLLsZvhMskiwmY54MzHgnnEPaswByeRrXxQ@mail.gmail.com>
@ 2015-12-22  6:15                                                                               ` Doug Dumitru
  2015-12-22 14:34                                                                                 ` Robert Kierski
  2015-12-22 16:48                                                                                 ` Dallas Clement
  0 siblings, 2 replies; 60+ messages in thread
From: Doug Dumitru @ 2015-12-22  6:15 UTC (permalink / raw)
  Cc: Linux-RAID

My apologies for diving in so late.

I routinely run 24 drive raid-5 sets with SSDs.  Chunk is set at 32K
and the applications only writes "perfect" 736K "stripes".  The SSDs
are Samsung 850 pros on dedicated LSI 3008 SAS ports and are at "new"
preconditioning (ie, they are at full speed) or just over 500 MB/sec.
CPU is a single E5-1650 v3.

With stock RAID-5 code, I get about 1.8 GB/sec, q=4.

Now this application is writing from kernel space
(generic_make_request w/ q waiting for completion callback).  There
are a lot of RMW operations happening here.  I think the raid-5
background thread is waking up asynchronously when only a part of the
write has been buffered into stripe cache pages.  The bio going into
the raid layer is a single bio, so nothing is being carved up on the
request end.  The raid-5 helper thread also saturates a cpu core
(which is about as fast as you can get with an E5-1650).

If I patch raid5.ko with special case code to avoid the stripe cache
and just compute parity and go, the write throughput goes up above
11GB/sec.

This is obviously an impossible IO pattern for most applications, but
does confirm that the upper limit of (n-1)*bw is "possible", but not
with the current stripe cache logic in the raid layer.

Doug Dumitru
WildFire Storage

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: best base / worst case RAID 5,6 write speeds
  2015-12-22  6:15                                                                               ` Doug Dumitru
@ 2015-12-22 14:34                                                                                 ` Robert Kierski
  2015-12-22 16:48                                                                                 ` Dallas Clement
  1 sibling, 0 replies; 60+ messages in thread
From: Robert Kierski @ 2015-12-22 14:34 UTC (permalink / raw)
  To: doug; +Cc: Linux-RAID

Hey Doug,

I would be interested in seeing the patch you're talking about.  I wonder if that code couldn't be turned on/off with a tuning parameter or module param.

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461


-----Original Message-----
From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Doug Dumitru
Sent: Tuesday, December 22, 2015 12:16 AM
Cc: Linux-RAID
Subject: Re: best base / worst case RAID 5,6 write speeds

My apologies for diving in so late.

I routinely run 24 drive raid-5 sets with SSDs.  Chunk is set at 32K and the applications only writes "perfect" 736K "stripes".  The SSDs are Samsung 850 pros on dedicated LSI 3008 SAS ports and are at "new"
preconditioning (ie, they are at full speed) or just over 500 MB/sec.
CPU is a single E5-1650 v3.

With stock RAID-5 code, I get about 1.8 GB/sec, q=4.

Now this application is writing from kernel space (generic_make_request w/ q waiting for completion callback).  There are a lot of RMW operations happening here.  I think the raid-5 background thread is waking up asynchronously when only a part of the write has been buffered into stripe cache pages.  The bio going into the raid layer is a single bio, so nothing is being carved up on the request end.  The raid-5 helper thread also saturates a cpu core (which is about as fast as you can get with an E5-1650).

If I patch raid5.ko with special case code to avoid the stripe cache and just compute parity and go, the write throughput goes up above 11GB/sec.

This is obviously an impossible IO pattern for most applications, but does confirm that the upper limit of (n-1)*bw is "possible", but not with the current stripe cache logic in the raid layer.

Doug Dumitru
WildFire Storage
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-22  6:15                                                                               ` Doug Dumitru
  2015-12-22 14:34                                                                                 ` Robert Kierski
@ 2015-12-22 16:48                                                                                 ` Dallas Clement
  2015-12-22 18:33                                                                                   ` Doug Dumitru
  1 sibling, 1 reply; 60+ messages in thread
From: Dallas Clement @ 2015-12-22 16:48 UTC (permalink / raw)
  To: doug; +Cc: Linux-RAID

On Tue, Dec 22, 2015 at 12:15 AM, Doug Dumitru <doug@easyco.com> wrote:
> My apologies for diving in so late.
>
> I routinely run 24 drive raid-5 sets with SSDs.  Chunk is set at 32K
> and the applications only writes "perfect" 736K "stripes".  The SSDs
> are Samsung 850 pros on dedicated LSI 3008 SAS ports and are at "new"
> preconditioning (ie, they are at full speed) or just over 500 MB/sec.
> CPU is a single E5-1650 v3.
>
> With stock RAID-5 code, I get about 1.8 GB/sec, q=4.
>
> Now this application is writing from kernel space
> (generic_make_request w/ q waiting for completion callback).  There
> are a lot of RMW operations happening here.  I think the raid-5
> background thread is waking up asynchronously when only a part of the
> write has been buffered into stripe cache pages.  The bio going into
> the raid layer is a single bio, so nothing is being carved up on the
> request end.  The raid-5 helper thread also saturates a cpu core
> (which is about as fast as you can get with an E5-1650).
>
> If I patch raid5.ko with special case code to avoid the stripe cache
> and just compute parity and go, the write throughput goes up above
> 11GB/sec.
>
> This is obviously an impossible IO pattern for most applications, but
> does confirm that the upper limit of (n-1)*bw is "possible", but not
> with the current stripe cache logic in the raid layer.
>
> Doug Dumitru
> WildFire Storage
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


> If I patch raid5.ko with special case code to avoid the stripe cache
> and just compute parity and go, the write throughput goes up above
> 11GB/sec.

Hi Doug.  This is really quite astounding and encouraging!  Would you
be willing to share your patch?  I am eager to give it a try for RAID
5 and 6.

> Now this application is writing from kernel space
> (generic_make_request w/ q waiting for completion callback).  There
> are a lot of RMW operations happening here.  I think the raid-5
> background thread is waking up asynchronously when only a part of the
> write has been buffered into stripe cache pages.

I am also anxious to hear from anyone who maintains the stripe cache
code.  I am seeing similar behavior when I monitor writes of perfectly
stripe-aligned blocks.  The # of RMWs are smallish and seem to vary,
but still I do not expect to see any of them!

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2015-12-22 16:48                                                                                 ` Dallas Clement
@ 2015-12-22 18:33                                                                                   ` Doug Dumitru
  2016-01-04 18:56                                                                                     ` Robert Kierski
  0 siblings, 1 reply; 60+ messages in thread
From: Doug Dumitru @ 2015-12-22 18:33 UTC (permalink / raw)
  To: Dallas Clement, Robert Kierski; +Cc: Linux-RAID

Robert and Dallas,

The patch is an astonishingly single case and has a few usage caveats.

It only works when IO is precisely aligned on stripe boundaries.  If
anything is off-aligned, or even is an aligned case is encountered and
the stripe cache is not empty, the patch special case does not happen.
Second, the patch assumes that your application layer "makes sense"
and will not try to read a block that is in the middle of being
written.

The patch is in use on production servers, but with still more
caveats.  It turns off if the array is not clean or if a rebuild or
check is in progress.

Here is "raid5.c" from CentOS 7 with the patch applied:

https://drive.google.com/file/d/0B3T4AZzjEGVkbUYzeVZqbkIzN1E/view?usp=sharing

The modified areas are all inside of #ifdef EASYCO conditionals.  I
did not want to post this as a patch here as this is not appropriate
code for general use.

-- Some comments on stripe cache --

The stripe cache is a lot of overhead for this particular case, but
still works quite well compared to the alternatives.  Most benchmarks
I see with high-end raid cards cannot get to 1GB/sec either on raid-5
or raid-6.

Moving away from the stripe cache, especially dynamically, might open
up a nasty set of locking semantics.

-- Some comments on the raid background thread --

With most "reasonable" disk sets, the single raid thread is fine for
raid-5 at 1.8GB/sec.  If you want to get raid-6 faster, you need more
cores.  With my E5-1650 v3 I get just over 8 GB/sec with raid-6, most
of which is the raid-6 parity compute code.  Multi-socket E5s might do
a little better, but NUMA throws all sorts of interesting performance
tuning issues at our proprietary layer that is above raid.

-- Some comments on benchmarks --

If you run benchmarks like fio, you will get IO patterns that never
happen "in live datasets".  For example, a real file system will never
read a block that is being written.  This is a side effect of the file
systems use of pages as cache and writes that come from dirty pages.
Benchmarks just pump random numbers and overlaps are allowed.  This
means you must write code that survives the benchmarks, but optimizing
for a benchmark in some areas is dubious.

-- Some comments on RMW and SSDs --

One reason I wrote this patch was to keep SSDs happy.  If you write to
SSDs "perfectly" they never degrade and stay at full performance.  If
you do any random writing, the SSDs eventually need to do some space
management (garbage collection).  Even the 2-3% of RMW that I see
without the patch is enough to cost 3x of SSD wear with some drives.



Doug Dumitru
WildFire Storage


On Tue, Dec 22, 2015 at 8:48 AM, Dallas Clement
<dallas.a.clement@gmail.com> wrote:
> On Tue, Dec 22, 2015 at 12:15 AM, Doug Dumitru <doug@easyco.com> wrote:
>> My apologies for diving in so late.
>>
>> I routinely run 24 drive raid-5 sets with SSDs.  Chunk is set at 32K
>> and the applications only writes "perfect" 736K "stripes".  The SSDs
>> are Samsung 850 pros on dedicated LSI 3008 SAS ports and are at "new"
>> preconditioning (ie, they are at full speed) or just over 500 MB/sec.
>> CPU is a single E5-1650 v3.
>>
>> With stock RAID-5 code, I get about 1.8 GB/sec, q=4.
>>
>> Now this application is writing from kernel space
>> (generic_make_request w/ q waiting for completion callback).  There
>> are a lot of RMW operations happening here.  I think the raid-5
>> background thread is waking up asynchronously when only a part of the
>> write has been buffered into stripe cache pages.  The bio going into
>> the raid layer is a single bio, so nothing is being carved up on the
>> request end.  The raid-5 helper thread also saturates a cpu core
>> (which is about as fast as you can get with an E5-1650).
>>
>> If I patch raid5.ko with special case code to avoid the stripe cache
>> and just compute parity and go, the write throughput goes up above
>> 11GB/sec.
>>
>> This is obviously an impossible IO pattern for most applications, but
>> does confirm that the upper limit of (n-1)*bw is "possible", but not
>> with the current stripe cache logic in the raid layer.
>>
>> Doug Dumitru
>> WildFire Storage
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>> If I patch raid5.ko with special case code to avoid the stripe cache
>> and just compute parity and go, the write throughput goes up above
>> 11GB/sec.
>
> Hi Doug.  This is really quite astounding and encouraging!  Would you
> be willing to share your patch?  I am eager to give it a try for RAID
> 5 and 6.
>
>> Now this application is writing from kernel space
>> (generic_make_request w/ q waiting for completion callback).  There
>> are a lot of RMW operations happening here.  I think the raid-5
>> background thread is waking up asynchronously when only a part of the
>> write has been buffered into stripe cache pages.
>
> I am also anxious to hear from anyone who maintains the stripe cache
> code.  I am seeing similar behavior when I monitor writes of perfectly
> stripe-aligned blocks.  The # of RMWs are smallish and seem to vary,
> but still I do not expect to see any of them!



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: best base / worst case RAID 5,6 write speeds
  2015-12-22 18:33                                                                                   ` Doug Dumitru
@ 2016-01-04 18:56                                                                                     ` Robert Kierski
  2016-01-04 19:13                                                                                       ` Doug Dumitru
  0 siblings, 1 reply; 60+ messages in thread
From: Robert Kierski @ 2016-01-04 18:56 UTC (permalink / raw)
  To: doug; +Cc: linux-raid

Hey Doug,

I'm trying to get the patch to work and not having much luck.

I'm guessing that you're using the generic CentOS 7 kernel (3.10).  I'm using a 3.18.4 kernel, so there were a few changes I had to make to get the patch to apply.  But they weren't significant as far as I can tell, and shouldn't have caused the FastWrite code to be ignored.

There must be additional changes in the kernel that are necessary.  When I create the special case MDRaid, and then use the special case IO pattern, I see only the debug messages indicating that the FastWrite code was ignored. 

I've added more debugging code to try to understand what's going on.  It turns out that no matter what I do, no matter how I configure the MDRaid, and no matter what IO pattern I use, the size of the IO is such that it doesn't conform to the criteria required to call FastWrite.  It seems that in the 3.18 .4 kernel, IO's are broken up before calling MDRaid's make_request function.  There doesn't seem to be an obvious relationship between the size being passed in, and the chunk size that's configured by mdadm.

At first, it appeared that it was the minimum IO size being used.  But then I tried setting the minimum IO and optimal IO to the same thing.  The resulting IO's are now 1/2 the size of the minimum IO.... which is not the size of the optimal IO (which is the chunk size * number of data disks).

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2016-01-04 18:56                                                                                     ` Robert Kierski
@ 2016-01-04 19:13                                                                                       ` Doug Dumitru
  2016-01-04 19:33                                                                                         ` Robert Kierski
  0 siblings, 1 reply; 60+ messages in thread
From: Doug Dumitru @ 2016-01-04 19:13 UTC (permalink / raw)
  To: Robert Kierski; +Cc: linux-raid

Robert,

The "FastWrite" code requires a complete stripe all contained inside
of a single BIO.  Depending on your drive count and stripe size, you
could end up above the BIO size limit (1MB for most kernels).  You
might try a smaller chunk size.

Our "application" calls this from a kernel thread, so there may be
other issue that happen if you drive this from user space.  I would
have thought that O_DIRECT would work OK from user space, but have not
tried it.

There is a huge if statement that tests for the FastWrite leg in the
patch.  Pretty much all of the planets need to align for the code to
get used.  printk statements are your friend ;)

Your code changes were probably mostly "bi_iter" stuff with the new
bio iterator structure.  If the code compiles, you probably made the
correct changes.  Plus, it did not crash (this is where a serial
console is helpful ;) ).

Doug

On Mon, Jan 4, 2016 at 10:56 AM, Robert Kierski <rkierski@cray.com> wrote:
> Hey Doug,
>
> I'm trying to get the patch to work and not having much luck.
>
> I'm guessing that you're using the generic CentOS 7 kernel (3.10).  I'm using a 3.18.4 kernel, so there were a few changes I had to make to get the patch to apply.  But they weren't significant as far as I can tell, and shouldn't have caused the FastWrite code to be ignored.
>
> There must be additional changes in the kernel that are necessary.  When I create the special case MDRaid, and then use the special case IO pattern, I see only the debug messages indicating that the FastWrite code was ignored.
>
> I've added more debugging code to try to understand what's going on.  It turns out that no matter what I do, no matter how I configure the MDRaid, and no matter what IO pattern I use, the size of the IO is such that it doesn't conform to the criteria required to call FastWrite.  It seems that in the 3.18 .4 kernel, IO's are broken up before calling MDRaid's make_request function.  There doesn't seem to be an obvious relationship between the size being passed in, and the chunk size that's configured by mdadm.
>
> At first, it appeared that it was the minimum IO size being used.  But then I tried setting the minimum IO and optimal IO to the same thing.  The resulting IO's are now 1/2 the size of the minimum IO.... which is not the size of the optimal IO (which is the chunk size * number of data disks).
>
> Bob Kierski
> Senior Storage Performance Engineer
> Cray Inc.
> 380 Jackson Street
> Suite 210
> St. Paul, MN 55101
> Tele: 651-967-9590
> Fax:  651-605-9001
> Cell: 651-890-7461
>



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: best base / worst case RAID 5,6 write speeds
  2016-01-04 19:13                                                                                       ` Doug Dumitru
@ 2016-01-04 19:33                                                                                         ` Robert Kierski
  2016-01-04 19:43                                                                                           ` Doug Dumitru
  0 siblings, 1 reply; 60+ messages in thread
From: Robert Kierski @ 2016-01-04 19:33 UTC (permalink / raw)
  To: doug; +Cc: linux-raid

Hey Doug,

I tried all sorts of things... even things that seemed rather backwards.  I tried a variety of disk count's, chunk sizes, IO sizes, IO Counts, etc.  

Yes...the changes were adding the bi_iter (mostly).

I also eliminated the limit of 10 printk's so that I could try various MDRaid configurations and various block size and counts without having to reboot.

I would have thought that the OS would pass the IO from user space to the driver without modification.  But it appears that's not the case.

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: best base / worst case RAID 5,6 write speeds
  2016-01-04 19:33                                                                                         ` Robert Kierski
@ 2016-01-04 19:43                                                                                           ` Doug Dumitru
  2016-01-15 16:53                                                                                             ` Robert Kierski
  0 siblings, 1 reply; 60+ messages in thread
From: Doug Dumitru @ 2016-01-04 19:43 UTC (permalink / raw)
  To: Robert Kierski; +Cc: linux-raid

Robert,

I have a C benchmark program that opens a block device with O_DIRECT
(040000 on x86 and 0200000 on arm) and reads/writes with 4K aligned
buffers show up as BIOs as hoped.

One easy way to "test" this is to add debug code to "dm-zero" and use
it as a way to watch IOs as they get to the block stack.

Doug


On Mon, Jan 4, 2016 at 11:33 AM, Robert Kierski <rkierski@cray.com> wrote:
> Hey Doug,
>
> I tried all sorts of things... even things that seemed rather backwards.  I tried a variety of disk count's, chunk sizes, IO sizes, IO Counts, etc.
>
> Yes...the changes were adding the bi_iter (mostly).
>
> I also eliminated the limit of 10 printk's so that I could try various MDRaid configurations and various block size and counts without having to reboot.
>
> I would have thought that the OS would pass the IO from user space to the driver without modification.  But it appears that's not the case.
>
> Bob Kierski
> Senior Storage Performance Engineer
> Cray Inc.
> 380 Jackson Street
> Suite 210
> St. Paul, MN 55101
> Tele: 651-967-9590
> Fax:  651-605-9001
> Cell: 651-890-7461
>



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: best base / worst case RAID 5,6 write speeds
  2016-01-04 19:43                                                                                           ` Doug Dumitru
@ 2016-01-15 16:53                                                                                             ` Robert Kierski
  0 siblings, 0 replies; 60+ messages in thread
From: Robert Kierski @ 2016-01-15 16:53 UTC (permalink / raw)
  To: doug; +Cc: linux-raid

Hey Doug,

I tried as you suggested... only that didn't help.  So I actually had to get down into the guts to figure out what was going wrong.

It turned out to be a bug in the block layer of the 3.18 kernel that required a one line change.  With that fixed, the FastWrite patch works like a charm.

Thanks!

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461


^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2016-01-15 16:53 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-10  1:34 best base / worst case RAID 5,6 write speeds Dallas Clement
2015-12-10  6:36 ` Alexander Afonyashin
2015-12-10 14:38   ` Dallas Clement
2015-12-10 15:14 ` John Stoffel
2015-12-10 18:40   ` Dallas Clement
     [not found]     ` <CAK2H+ed+fe5Wr0B=h5AzK5_=ougQtW_6cJcUG_S_cg+WfzDb=Q@mail.gmail.com>
2015-12-10 19:26       ` Dallas Clement
2015-12-10 19:33         ` John Stoffel
2015-12-10 22:19           ` Wols Lists
2015-12-10 19:28     ` John Stoffel
2015-12-10 22:23       ` Wols Lists
2015-12-10 20:06 ` Phil Turmel
2015-12-10 20:09   ` Dallas Clement
2015-12-10 20:29     ` Phil Turmel
2015-12-10 21:14       ` Dallas Clement
2015-12-10 21:32         ` Phil Turmel
     [not found]     ` <CAK2H+ednN7dCGzcOt8TxgNdhdDA1mN6Xr5P8vQ+Y=-uRoxRksw@mail.gmail.com>
2015-12-11  0:02       ` Dallas Clement
     [not found]         ` <CAK2H+efF2dM1BsM7kzfTxMdQEHvbWRaVe7zJLTGcPZzafn2M6A@mail.gmail.com>
2015-12-11  0:41           ` Dallas Clement
2015-12-11  1:19             ` Dallas Clement
     [not found]               ` <CAK2H+ec-zMbhxoFyHXLkdM-z-9cYYzNbPFhn19XjTHqrOMDZKQ@mail.gmail.com>
2015-12-11 15:44                 ` Dallas Clement
2015-12-11 16:32                   ` John Stoffel
2015-12-11 16:47                     ` Dallas Clement
2015-12-11 19:34                       ` John Stoffel
2015-12-11 21:24                         ` Dallas Clement
2015-12-11 23:30                           ` Dallas Clement
2015-12-12  0:00                             ` Dallas Clement
2015-12-12  0:38                               ` Phil Turmel
2015-12-12  2:55                                 ` Dallas Clement
2015-12-12  4:47                                   ` Phil Turmel
2015-12-14 20:14                                     ` Dallas Clement
     [not found]                                       ` <CAK2H+edazVORrVovWDeTA8DmqUL+5HRH-AcRwg8KkMas=o+Cog@mail.gmail.com>
2015-12-14 20:55                                         ` Dallas Clement
     [not found]                                           ` <CAK2H+ed-3Z8SR20t8rpt3Fb48c3X2Jft=qZoiY9emC2nQww1xQ@mail.gmail.com>
2015-12-14 21:20                                             ` Dallas Clement
2015-12-14 22:05                                               ` Dallas Clement
2015-12-14 22:31                                                 ` Tommy Apel
     [not found]                                                 ` <CAK2H+ecMvDLdYLhMtMQbP7Ygw-VohG7LGZ2n7H+LAXQ1waJK3A@mail.gmail.com>
2015-12-14 23:25                                                   ` Dallas Clement
2015-12-15  2:36                                                     ` Dallas Clement
2015-12-15 13:53                                                       ` Phil Turmel
2015-12-15 14:09                                                       ` Robert Kierski
2015-12-15 15:14                                                       ` John Stoffel
2015-12-15 17:30                                                         ` Dallas Clement
2015-12-15 19:22                                                           ` Phil Turmel
2015-12-15 19:44                                                             ` Dallas Clement
2015-12-15 19:52                                                               ` Phil Turmel
2015-12-15 21:54                                                           ` John Stoffel
2015-12-15 23:07                                                             ` Dallas Clement
2015-12-16 15:31                                                               ` Dallas Clement
     [not found]                                                                 ` <CAK2H+eeD2k4yzuvL4uF_qKycp6A=XPe8pVF_J-7Agi8Ze89PPQ@mail.gmail.com>
2015-12-17  5:57                                                                   ` Dallas Clement
2015-12-17 13:41                                                                   ` Phil Turmel
2015-12-17 21:08                                                                     ` Dallas Clement
2015-12-17 22:40                                                                       ` Phil Turmel
2015-12-17 23:28                                                                         ` Dallas Clement
2015-12-18  0:54                                                                           ` Dallas Clement
     [not found]                                                                             ` <CAFx4rwT8xgwZ0OWaLLsZvhMskiwmY54MzHgnnEPaswByeRrXxQ@mail.gmail.com>
2015-12-22  6:15                                                                               ` Doug Dumitru
2015-12-22 14:34                                                                                 ` Robert Kierski
2015-12-22 16:48                                                                                 ` Dallas Clement
2015-12-22 18:33                                                                                   ` Doug Dumitru
2016-01-04 18:56                                                                                     ` Robert Kierski
2016-01-04 19:13                                                                                       ` Doug Dumitru
2016-01-04 19:33                                                                                         ` Robert Kierski
2016-01-04 19:43                                                                                           ` Doug Dumitru
2016-01-15 16:53                                                                                             ` Robert Kierski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.