All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID 5,6 sequential writing seems slower in newer kernels
@ 2015-12-01 23:02 Dallas Clement
  2015-12-02  1:07 ` keld
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Dallas Clement @ 2015-12-01 23:02 UTC (permalink / raw)
  To: linux-raid

Hi,

I have a NAS system with 12 spinning disks that has been running with
the 2.6.39.4 kernel. It has a 4 core xeon processor (E31275 @
3.40GHz), with 8 GB of RAM.  The 12 disks in my RAID array are Hitachi
4 TB, 7200 RPM SATA drives.  The filesystem is XFS.

Recently I have been evaluating RAID performance on newer kernels 3.10
and 4.2.  I have observed that with the same settings, I am seeing
much slower RAID 5 and 6 sequential write speeds with newer kernels
compared to what I was seeing with the 2.6.39.4 kernel.  However, the
4.2 kernel has much better read speeds for both sequential and random
patterns.  I understand that there have been many improvements to RAID
5 and 6 in the 4.1 kernel.  I definitely am seeing improvement with
reads but not writes.

If I observe disk and array throughput with iostat, the individual
disk utilization and wMB/s is much lower in the newer kernels.  With
the older 2.6.39.4 kernel, disk utilization seems to stay above 80%
with wMB/s around 74 MB/s, whereas the newer kernel disk utilization
seems to vary between 20-70% with wMB/s around 9-38 MB/s.  CPU iowait
gets up to about 10% much of the time.  These Hitachi disks are
capable of sustaining around 170 MB/s, which is just about what I see
when doing sequential writes to all 12 disks concurrently in a JBOD
configuration, i.e no RAID.  The iowait for 12 disks of JBOD gets up
to about 97% - which makes the system very unresponsive.

One other observation is that RAID 0 sequential write speeds in newer
kernels are only slightly less than what I was seeing in 2.6.39.4.

I am frankly surprised at these results.  Perhaps there is some
configuration or tunable settings that have changed since the 2.6
kernel that I am unaware of that affect RAID 5, 6 performance.  Please
comment if you have any ideas which might explain what I am seeing.

Thanks,

Dalla

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-01 23:02 RAID 5,6 sequential writing seems slower in newer kernels Dallas Clement
@ 2015-12-02  1:07 ` keld
  2015-12-02 14:18   ` Robert Kierski
  2015-12-02  5:22 ` Roman Mamedov
  2015-12-02 14:15 ` Robert Kierski
  2 siblings, 1 reply; 35+ messages in thread
From: keld @ 2015-12-02  1:07 UTC (permalink / raw)
  To: Dallas Clement; +Cc: linux-raid

Hi Dallas

Did you test the performance of other raid types, such as RAID1 and the various
layouts of RAID10 for the newer kernels?

Best regards
keld

On Tue, Dec 01, 2015 at 05:02:27PM -0600, Dallas Clement wrote:
> Hi,
> 
> I have a NAS system with 12 spinning disks that has been running with
> the 2.6.39.4 kernel. It has a 4 core xeon processor (E31275 @
> 3.40GHz), with 8 GB of RAM.  The 12 disks in my RAID array are Hitachi
> 4 TB, 7200 RPM SATA drives.  The filesystem is XFS.
> 
> Recently I have been evaluating RAID performance on newer kernels 3.10
> and 4.2.  I have observed that with the same settings, I am seeing
> much slower RAID 5 and 6 sequential write speeds with newer kernels
> compared to what I was seeing with the 2.6.39.4 kernel.  However, the
> 4.2 kernel has much better read speeds for both sequential and random
> patterns.  I understand that there have been many improvements to RAID
> 5 and 6 in the 4.1 kernel.  I definitely am seeing improvement with
> reads but not writes.
> 
> If I observe disk and array throughput with iostat, the individual
> disk utilization and wMB/s is much lower in the newer kernels.  With
> the older 2.6.39.4 kernel, disk utilization seems to stay above 80%
> with wMB/s around 74 MB/s, whereas the newer kernel disk utilization
> seems to vary between 20-70% with wMB/s around 9-38 MB/s.  CPU iowait
> gets up to about 10% much of the time.  These Hitachi disks are
> capable of sustaining around 170 MB/s, which is just about what I see
> when doing sequential writes to all 12 disks concurrently in a JBOD
> configuration, i.e no RAID.  The iowait for 12 disks of JBOD gets up
> to about 97% - which makes the system very unresponsive.
> 
> One other observation is that RAID 0 sequential write speeds in newer
> kernels are only slightly less than what I was seeing in 2.6.39.4.
> 
> I am frankly surprised at these results.  Perhaps there is some
> configuration or tunable settings that have changed since the 2.6
> kernel that I am unaware of that affect RAID 5, 6 performance.  Please
> comment if you have any ideas which might explain what I am seeing.
> 
> Thanks,
> 
> Dalla
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-01 23:02 RAID 5,6 sequential writing seems slower in newer kernels Dallas Clement
  2015-12-02  1:07 ` keld
@ 2015-12-02  5:22 ` Roman Mamedov
  2015-12-02 14:15 ` Robert Kierski
  2 siblings, 0 replies; 35+ messages in thread
From: Roman Mamedov @ 2015-12-02  5:22 UTC (permalink / raw)
  To: Dallas Clement; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 522 bytes --]

On Tue, 1 Dec 2015 17:02:27 -0600
Dallas Clement <dallas.a.clement@gmail.com> wrote:

> I am frankly surprised at these results.  Perhaps there is some
> configuration or tunable settings that have changed since the 2.6
> kernel that I am unaware of that affect RAID 5, 6 performance.  Please
> comment if you have any ideas which might explain what I am seeing.

Do you use a write intent bitmap (internal?), what is your bitmap chunk size?

What is your stripe_cache_size set to?

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-01 23:02 RAID 5,6 sequential writing seems slower in newer kernels Dallas Clement
  2015-12-02  1:07 ` keld
  2015-12-02  5:22 ` Roman Mamedov
@ 2015-12-02 14:15 ` Robert Kierski
  2 siblings, 0 replies; 35+ messages in thread
From: Robert Kierski @ 2015-12-02 14:15 UTC (permalink / raw)
  To: Dallas Clement, linux-raid

Wow!!! You've stole my thunder.  But I can do you one better.

I've got DDR3 RAM disks that are capable of sustained performance of about 6G/s each.  When combined into a 8+2 RAID6, I get only about 3G/s doing sequential writes, and about 42G/s on reads.  CPU is 96.8% idle for the writes.

In iostat, I'm seeing about 10x the amount of data going to each individual disk compared to what my user app is writing to the raid device.

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461


^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-02  1:07 ` keld
@ 2015-12-02 14:18   ` Robert Kierski
  2015-12-02 14:45     ` Phil Turmel
  0 siblings, 1 reply; 35+ messages in thread
From: Robert Kierski @ 2015-12-02 14:18 UTC (permalink / raw)
  To: Dallas Clement; +Cc: linux-raid

Sorry... I should have mentioned that I'm running the 3.18.4 kernel with a 32 core xeon and 128G of memory.  I'm not using a FS, I'm going directly to the raid block device.

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-02 14:18   ` Robert Kierski
@ 2015-12-02 14:45     ` Phil Turmel
  2015-12-02 15:28       ` Robert Kierski
  2015-12-02 15:37       ` Robert Kierski
  0 siblings, 2 replies; 35+ messages in thread
From: Phil Turmel @ 2015-12-02 14:45 UTC (permalink / raw)
  To: Robert Kierski, Dallas Clement; +Cc: linux-raid

On 12/02/2015 09:18 AM, Robert Kierski wrote:
> Sorry... I should have mentioned that I'm running the 3.18.4 kernel
> with a 32 core xeon and 128G of memory.  I'm not using a FS, I'm
> going directly to the raid block device.

I'm not sure if the parallelization of raid parity has been merged yet,
but I'm pretty sure it isn't in 3.18.  With one core tied up computing
parity and the rest idle, that'd be 96.875% idle.

Phil

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-02 14:45     ` Phil Turmel
@ 2015-12-02 15:28       ` Robert Kierski
  2015-12-02 15:37         ` Phil Turmel
  2015-12-02 15:37       ` Robert Kierski
  1 sibling, 1 reply; 35+ messages in thread
From: Robert Kierski @ 2015-12-02 15:28 UTC (permalink / raw)
  To: Phil Turmel, Dallas Clement; +Cc: linux-raid

Thanks for the response.

Nice try... But, the reason I’m using the 3.18.4 kernel is that it has the parallelization.  I've got group_thread_cnt set to 32.  I'm watching the CPU's with mpstat, and they're pretty much idle.  I'm also watching the system traces with perf.  It claims that only 11.9% of my time is spent doing the xor.

I've got my CS set at 128k.  I have noticed that if I set the CS to 32k, the TP is about 2x.  I'm pretty sure the problem is that the 1M writes I'm doing are being broken into 4K pages, and then reassembled before going to disk.

Also, this is independent of the IO Scheduler.  I've tried all 3 and got the same results.

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461


-----Original Message-----
From: Phil Turmel [mailto:philip@turmel.org] 
Sent: Wednesday, December 02, 2015 8:45 AM
To: Robert Kierski; Dallas Clement
Cc: linux-raid@vger.kernel.org
Subject: Re: RAID 5,6 sequential writing seems slower in newer kernels

On 12/02/2015 09:18 AM, Robert Kierski wrote:
> Sorry... I should have mentioned that I'm running the 3.18.4 kernel 
> with a 32 core xeon and 128G of memory.  I'm not using a FS, I'm going 
> directly to the raid block device.

I'm not sure if the parallelization of raid parity has been merged yet, but I'm pretty sure it isn't in 3.18.  With one core tied up computing parity and the rest idle, that'd be 96.875% idle.

Phil

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-02 14:45     ` Phil Turmel
  2015-12-02 15:28       ` Robert Kierski
@ 2015-12-02 15:37       ` Robert Kierski
  1 sibling, 0 replies; 35+ messages in thread
From: Robert Kierski @ 2015-12-02 15:37 UTC (permalink / raw)
  To: Phil Turmel, Dallas Clement; +Cc: linux-raid

Just for completeness... I should also mention that... 

The 32 core system has the AVX2 extension.  I've tried this on a 12 Core system without the AVX2 extension.  In that case, it's using SSE24 to do the xor.

I get pretty much the same results on either system.

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461


-----Original Message-----
From: Phil Turmel [mailto:philip@turmel.org] 
Sent: Wednesday, December 02, 2015 8:45 AM
To: Robert Kierski; Dallas Clement
Cc: linux-raid@vger.kernel.org
Subject: Re: RAID 5,6 sequential writing seems slower in newer kernels

On 12/02/2015 09:18 AM, Robert Kierski wrote:
> Sorry... I should have mentioned that I'm running the 3.18.4 kernel 
> with a 32 core xeon and 128G of memory.  I'm not using a FS, I'm going 
> directly to the raid block device.

I'm not sure if the parallelization of raid parity has been merged yet, but I'm pretty sure it isn't in 3.18.  With one core tied up computing parity and the rest idle, that'd be 96.875% idle.

Phil

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-02 15:28       ` Robert Kierski
@ 2015-12-02 15:37         ` Phil Turmel
  2015-12-02 15:44           ` Robert Kierski
  0 siblings, 1 reply; 35+ messages in thread
From: Phil Turmel @ 2015-12-02 15:37 UTC (permalink / raw)
  To: Robert Kierski, Dallas Clement; +Cc: linux-raid

On 12/02/2015 10:28 AM, Robert Kierski wrote:
> Thanks for the response.
> 
> Nice try... But, the reason I’m using the 3.18.4 kernel is that it has the parallelization.  I've got group_thread_cnt set to 32.  I'm watching the CPU's with mpstat, and they're pretty much idle.  I'm also watching the system traces with perf.  It claims that only 11.9% of my time is spent doing the xor.

Hmm. Ok.

> I've got my CS set at 128k.  I have noticed that if I set the CS to 32k, the TP is about 2x.  I'm pretty sure the problem is that the 1M writes I'm doing are being broken into 4K pages, and then reassembled before going to disk.

I think you're right.  What is your stripe cache size?

> Also, this is independent of the IO Scheduler.  I've tried all 3 and got the same results.

If your stripe cache is too small, sequential writes with large chunks
can exhaust the cache before complete stripes are written, turning all
of those partial stripe writes into read-modify-write cycles.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-02 15:37         ` Phil Turmel
@ 2015-12-02 15:44           ` Robert Kierski
  2015-12-02 15:51             ` Phil Turmel
  0 siblings, 1 reply; 35+ messages in thread
From: Robert Kierski @ 2015-12-02 15:44 UTC (permalink / raw)
  To: Phil Turmel, Dallas Clement; +Cc: linux-raid

I've tried a variety of settings... ranging from 17 to 32768.

Yes.. with stripe_cache_size set to 17, I see a C/T of rmw's.  And my TP goes in the toilet -- even with the RAM disks, I get only about 30M/s.

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461


-----Original Message-----
From: Phil Turmel [mailto:philip@turmel.org] 
Sent: Wednesday, December 02, 2015 9:37 AM
To: Robert Kierski; Dallas Clement
Cc: linux-raid@vger.kernel.org
Subject: Re: RAID 5,6 sequential writing seems slower in newer kernels

On 12/02/2015 10:28 AM, Robert Kierski wrote:
> Thanks for the response.
> 
> Nice try... But, the reason I’m using the 3.18.4 kernel is that it has the parallelization.  I've got group_thread_cnt set to 32.  I'm watching the CPU's with mpstat, and they're pretty much idle.  I'm also watching the system traces with perf.  It claims that only 11.9% of my time is spent doing the xor.

Hmm. Ok.

> I've got my CS set at 128k.  I have noticed that if I set the CS to 32k, the TP is about 2x.  I'm pretty sure the problem is that the 1M writes I'm doing are being broken into 4K pages, and then reassembled before going to disk.

I think you're right.  What is your stripe cache size?

> Also, this is independent of the IO Scheduler.  I've tried all 3 and got the same results.

If your stripe cache is too small, sequential writes with large chunks can exhaust the cache before complete stripes are written, turning all of those partial stripe writes into read-modify-write cycles.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-02 15:44           ` Robert Kierski
@ 2015-12-02 15:51             ` Phil Turmel
  2015-12-02 19:50               ` Dallas Clement
  0 siblings, 1 reply; 35+ messages in thread
From: Phil Turmel @ 2015-12-02 15:51 UTC (permalink / raw)
  To: Robert Kierski, Dallas Clement; +Cc: linux-raid

On 12/02/2015 10:44 AM, Robert Kierski wrote:
> I've tried a variety of settings... ranging from 17 to 32768.
> 
> Yes.. with stripe_cache_size set to 17, I see a C/T of rmw's.  And my TP goes in the toilet -- even with the RAM disks, I get only about 30M/s.

Ok.

You mentioned you aren't using a filesystem.  How are you testing?

Phil

ps. convention on kernel.org is to trim replies and bottom-post, or
interleave.  Please do.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-02 15:51             ` Phil Turmel
@ 2015-12-02 19:50               ` Dallas Clement
  2015-12-03  0:12                 ` Dallas Clement
  2015-12-03 14:19                 ` Robert Kierski
  0 siblings, 2 replies; 35+ messages in thread
From: Dallas Clement @ 2015-12-02 19:50 UTC (permalink / raw)
  To: linux-raid

On Wed, Dec 2, 2015 at 9:51 AM, Phil Turmel <philip@turmel.org> wrote:
> On 12/02/2015 10:44 AM, Robert Kierski wrote:
>> I've tried a variety of settings... ranging from 17 to 32768.
>>
>> Yes.. with stripe_cache_size set to 17, I see a C/T of rmw's.  And my TP goes in the toilet -- even with the RAM disks, I get only about 30M/s.
>
> Ok.
>
> You mentioned you aren't using a filesystem.  How are you testing?
>
> Phil
>
> ps. convention on kernel.org is to trim replies and bottom-post, or
> interleave.  Please do.

Thank you all for your responses.

Keld,

> Did you test the performance of other raid types, such as RAID1 and the various layouts of RAID10 for the newer kernels?

I did try RAID 1 but not RAID 10.  With RAID 1 I am seeing much higher
average and peak wMB/s and disk utilization than with RAID 5 and 6.
Though I need to run some more tests to compare the performance of
newer kernels with the 2.6.39.4 kernel.  Will report on that a bit
later.

Roman,

> Do you use a write intent bitmap (internal?), what is your bitmap chunk size?

Yes, I do.  After reading up on this, I see that it can negatively
affect write performance.  The bitmap chunk size is 67108864.

> What is your stripe_cache_size set to?

strip_cache_size is 8192

Robert, like you I am observing that my CPU is mostly idle during RAID
5 or 6 write testing.  Something else is throttling the traffic.  Not
sure if there is some threshold crossing i.e. queue size, await time
etc that is causing this or if it is implementation problem.

I understand that the stripe cache grows dynamically in >= 4.1
kernels.   Fwiw, adjusting the stripe cache made no difference in my
results.

Regards,

Dallas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-02 19:50               ` Dallas Clement
@ 2015-12-03  0:12                 ` Dallas Clement
  2015-12-03  2:18                   ` Phil Turmel
  2015-12-03 14:19                 ` Robert Kierski
  1 sibling, 1 reply; 35+ messages in thread
From: Dallas Clement @ 2015-12-03  0:12 UTC (permalink / raw)
  To: linux-raid

On Wed, Dec 2, 2015 at 1:50 PM, Dallas Clement
<dallas.a.clement@gmail.com> wrote:
> On Wed, Dec 2, 2015 at 9:51 AM, Phil Turmel <philip@turmel.org> wrote:
>> On 12/02/2015 10:44 AM, Robert Kierski wrote:
>>> I've tried a variety of settings... ranging from 17 to 32768.
>>>
>>> Yes.. with stripe_cache_size set to 17, I see a C/T of rmw's.  And my TP goes in the toilet -- even with the RAM disks, I get only about 30M/s.
>>
>> Ok.
>>
>> You mentioned you aren't using a filesystem.  How are you testing?
>>
>> Phil
>>
>> ps. convention on kernel.org is to trim replies and bottom-post, or
>> interleave.  Please do.
>
> Thank you all for your responses.
>
> Keld,
>
>> Did you test the performance of other raid types, such as RAID1 and the various layouts of RAID10 for the newer kernels?
>
> I did try RAID 1 but not RAID 10.  With RAID 1 I am seeing much higher
> average and peak wMB/s and disk utilization than with RAID 5 and 6.
> Though I need to run some more tests to compare the performance of
> newer kernels with the 2.6.39.4 kernel.  Will report on that a bit
> later.
>
> Roman,
>
>> Do you use a write intent bitmap (internal?), what is your bitmap chunk size?
>
> Yes, I do.  After reading up on this, I see that it can negatively
> affect write performance.  The bitmap chunk size is 67108864.
>
>> What is your stripe_cache_size set to?
>
> strip_cache_size is 8192
>
> Robert, like you I am observing that my CPU is mostly idle during RAID
> 5 or 6 write testing.  Something else is throttling the traffic.  Not
> sure if there is some threshold crossing i.e. queue size, await time
> etc that is causing this or if it is implementation problem.
>
> I understand that the stripe cache grows dynamically in >= 4.1
> kernels.   Fwiw, adjusting the stripe cache made no difference in my
> results.
>
> Regards,
>
> Dallas

Here is a summary of the performance differences I am seeing with the
3.10.69 kernel vs the 2.6.39.4 kernel (baseline):

RAID 0

bs = 512k - 3.5% slower
bs = 2048k - 1.5% slower

RAID 1

bs = 512k - 35% faster
bs = 2048k - 48% faster

RAID 5

bs = 512k - 22% slower
bs = 2048k - 28% slower

RAID 6

bs = 512k - 24% slower
bs = 2048k - 30% slower

Surprisingly  RAID 1 is faster in the newer kernel, but RAID 5 & 6 much slower.

All measurements computed from bandwidth averages taken on 12 disk
array with XFS filesytem using fio with direct=1, sync=1,
invalidate=1.

Seems incredulous!?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-03  0:12                 ` Dallas Clement
@ 2015-12-03  2:18                   ` Phil Turmel
  2015-12-03  2:24                     ` Dallas Clement
  0 siblings, 1 reply; 35+ messages in thread
From: Phil Turmel @ 2015-12-03  2:18 UTC (permalink / raw)
  To: Dallas Clement, linux-raid

On 12/02/2015 07:12 PM, Dallas Clement wrote:
> All measurements computed from bandwidth averages taken on 12 disk
> array with XFS filesytem using fio with direct=1, sync=1,
> invalidate=1.

Why do you need direct=1 and sync=1 ?  Have you checked an strace from
the app you are trying to model that shows it uses these?

> Seems incredulous!?

Not with those options.  Particularly sync=1.  That causes an inode
stats update and a hardware queue flush after every write operation.
Support for that on various devices has changed over time.

I suspect if you do a bisect on the kernel to pinpoint the change(s)
that is doing this, you'll find a patch that closes a device-specific or
filesystem sync bug or something that enables deep queues for a device.

Modern software that needs file integrity guarantees make sparse use of
fdatasync and/or fsync and avoid sync entirely.  You'll have a more
believable test if you use fsync_on_close=1 or end_fsync=1.

Phil

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-03  2:18                   ` Phil Turmel
@ 2015-12-03  2:24                     ` Dallas Clement
  2015-12-03  2:33                       ` Dallas Clement
  2015-12-03  2:34                       ` Phil Turmel
  0 siblings, 2 replies; 35+ messages in thread
From: Dallas Clement @ 2015-12-03  2:24 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

On Wed, Dec 2, 2015 at 8:18 PM, Phil Turmel <philip@turmel.org> wrote:
> On 12/02/2015 07:12 PM, Dallas Clement wrote:
>> All measurements computed from bandwidth averages taken on 12 disk
>> array with XFS filesytem using fio with direct=1, sync=1,
>> invalidate=1.
>
> Why do you need direct=1 and sync=1 ?  Have you checked an strace from
> the app you are trying to model that shows it uses these?
>
>> Seems incredulous!?
>
> Not with those options.  Particularly sync=1.  That causes an inode
> stats update and a hardware queue flush after every write operation.
> Support for that on various devices has changed over time.
>
> I suspect if you do a bisect on the kernel to pinpoint the change(s)
> that is doing this, you'll find a patch that closes a device-specific or
> filesystem sync bug or something that enables deep queues for a device.
>
> Modern software that needs file integrity guarantees make sparse use of
> fdatasync and/or fsync and avoid sync entirely.  You'll have a more
> believable test if you use fsync_on_close=1 or end_fsync=1.
>
> Phil

Hi Phil. Hmm that makes sense that something may have changed wrt to
syncing.  Basically what I am trying to do with my fio testing is
avoid any asynchronous or caching behavior.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-03  2:24                     ` Dallas Clement
@ 2015-12-03  2:33                       ` Dallas Clement
  2015-12-03  2:38                         ` Phil Turmel
  2015-12-03  2:34                       ` Phil Turmel
  1 sibling, 1 reply; 35+ messages in thread
From: Dallas Clement @ 2015-12-03  2:33 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

On Wed, Dec 2, 2015 at 8:24 PM, Dallas Clement
<dallas.a.clement@gmail.com> wrote:
> On Wed, Dec 2, 2015 at 8:18 PM, Phil Turmel <philip@turmel.org> wrote:
>> On 12/02/2015 07:12 PM, Dallas Clement wrote:
>>> All measurements computed from bandwidth averages taken on 12 disk
>>> array with XFS filesytem using fio with direct=1, sync=1,
>>> invalidate=1.
>>
>> Why do you need direct=1 and sync=1 ?  Have you checked an strace from
>> the app you are trying to model that shows it uses these?
>>
>>> Seems incredulous!?
>>
>> Not with those options.  Particularly sync=1.  That causes an inode
>> stats update and a hardware queue flush after every write operation.
>> Support for that on various devices has changed over time.
>>
>> I suspect if you do a bisect on the kernel to pinpoint the change(s)
>> that is doing this, you'll find a patch that closes a device-specific or
>> filesystem sync bug or something that enables deep queues for a device.
>>
>> Modern software that needs file integrity guarantees make sparse use of
>> fdatasync and/or fsync and avoid sync entirely.  You'll have a more
>> believable test if you use fsync_on_close=1 or end_fsync=1.
>>
>> Phil
>
> Hi Phil. Hmm that makes sense that something may have changed wrt to
> syncing.  Basically what I am trying to do with my fio testing is
> avoid any asynchronous or caching behavior.

I'm not sure that the sync=1 has any effect in this case where I've
got direct=1 set (for non buffered I/O).  I think the sync=1 flag only
matters for buffered I/O.  I really shouldn't be setting that flag at
all.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-03  2:24                     ` Dallas Clement
  2015-12-03  2:33                       ` Dallas Clement
@ 2015-12-03  2:34                       ` Phil Turmel
  1 sibling, 0 replies; 35+ messages in thread
From: Phil Turmel @ 2015-12-03  2:34 UTC (permalink / raw)
  To: Dallas Clement; +Cc: linux-raid

On 12/02/2015 09:24 PM, Dallas Clement wrote:
> On Wed, Dec 2, 2015 at 8:18 PM, Phil Turmel <philip@turmel.org> wrote:
>> I suspect if you do a bisect on the kernel to pinpoint the change(s)
>> that is doing this, you'll find a patch that closes a device-specific or
>> filesystem sync bug or something that enables deep queues for a device.
>>
>> Modern software that needs file integrity guarantees make sparse use of
>> fdatasync and/or fsync and avoid sync entirely.  You'll have a more
>> believable test if you use fsync_on_close=1 or end_fsync=1.
>>
>> Phil
> 
> Hi Phil. Hmm that makes sense that something may have changed wrt to
> syncing.  Basically what I am trying to do with my fio testing is
> avoid any asynchronous or caching behavior.

I hope that if you really need this you are doing exhaustive testing on
failure modes -- I would be worried that these speed changes imply flaws
in the older kernels.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-03  2:33                       ` Dallas Clement
@ 2015-12-03  2:38                         ` Phil Turmel
  2015-12-03  2:51                           ` Dallas Clement
  0 siblings, 1 reply; 35+ messages in thread
From: Phil Turmel @ 2015-12-03  2:38 UTC (permalink / raw)
  To: Dallas Clement; +Cc: linux-raid

On 12/02/2015 09:33 PM, Dallas Clement wrote:

> I'm not sure that the sync=1 has any effect in this case where I've
> got direct=1 set (for non buffered I/O).  I think the sync=1 flag only
> matters for buffered I/O.  I really shouldn't be setting that flag at
> all.

It's substantially different from direct=1.  O_DIRECT just bypasses the
kernel's caches.  O_SYNC flushes the file data and filesystem metadata,
and kills the device caches and queues.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-03  2:38                         ` Phil Turmel
@ 2015-12-03  2:51                           ` Dallas Clement
  2015-12-03  4:30                             ` Phil Turmel
  0 siblings, 1 reply; 35+ messages in thread
From: Dallas Clement @ 2015-12-03  2:51 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

On Wed, Dec 2, 2015 at 8:38 PM, Phil Turmel <philip@turmel.org> wrote:
> On 12/02/2015 09:33 PM, Dallas Clement wrote:
>
>> I'm not sure that the sync=1 has any effect in this case where I've
>> got direct=1 set (for non buffered I/O).  I think the sync=1 flag only
>> matters for buffered I/O.  I really shouldn't be setting that flag at
>> all.
>
> It's substantially different from direct=1.  O_DIRECT just bypasses the
> kernel's caches.  O_SYNC flushes the file data and filesystem metadata,
> and kills the device caches and queues.

Isn't O_SYNC only applicable for buffered I/O or going through the
kernel caches?  If I'm using O_DIRECT, seems like it should just
ignore this flag.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-03  2:51                           ` Dallas Clement
@ 2015-12-03  4:30                             ` Phil Turmel
  2015-12-03  4:49                               ` Dallas Clement
  2015-12-03 13:43                               ` Robert Kierski
  0 siblings, 2 replies; 35+ messages in thread
From: Phil Turmel @ 2015-12-03  4:30 UTC (permalink / raw)
  To: Dallas Clement; +Cc: linux-raid

On 12/02/2015 09:51 PM, Dallas Clement wrote:
> On Wed, Dec 2, 2015 at 8:38 PM, Phil Turmel <philip@turmel.org> wrote:
>> On 12/02/2015 09:33 PM, Dallas Clement wrote:
>>
>>> I'm not sure that the sync=1 has any effect in this case where I've
>>> got direct=1 set (for non buffered I/O).  I think the sync=1 flag only
>>> matters for buffered I/O.  I really shouldn't be setting that flag at
>>> all.
>>
>> It's substantially different from direct=1.  O_DIRECT just bypasses the
>> kernel's caches.  O_SYNC flushes the file data and filesystem metadata,
>> and kills the device caches and queues.
> 
> Isn't O_SYNC only applicable for buffered I/O or going through the
> kernel caches?  If I'm using O_DIRECT, seems like it should just
> ignore this flag.

O_SYNC is orthogonal to whether the kernel caches are involved.  It is
about ensuring that data *and* metadata are safely written all the way
to permanent media.

Phil

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-03  4:30                             ` Phil Turmel
@ 2015-12-03  4:49                               ` Dallas Clement
  2015-12-03 13:43                               ` Robert Kierski
  1 sibling, 0 replies; 35+ messages in thread
From: Dallas Clement @ 2015-12-03  4:49 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

On Wed, Dec 2, 2015 at 10:30 PM, Phil Turmel <philip@turmel.org> wrote:
> On 12/02/2015 09:51 PM, Dallas Clement wrote:
>> On Wed, Dec 2, 2015 at 8:38 PM, Phil Turmel <philip@turmel.org> wrote:
>>> On 12/02/2015 09:33 PM, Dallas Clement wrote:
>>>
>>>> I'm not sure that the sync=1 has any effect in this case where I've
>>>> got direct=1 set (for non buffered I/O).  I think the sync=1 flag only
>>>> matters for buffered I/O.  I really shouldn't be setting that flag at
>>>> all.
>>>
>>> It's substantially different from direct=1.  O_DIRECT just bypasses the
>>> kernel's caches.  O_SYNC flushes the file data and filesystem metadata,
>>> and kills the device caches and queues.
>>
>> Isn't O_SYNC only applicable for buffered I/O or going through the
>> kernel caches?  If I'm using O_DIRECT, seems like it should just
>> ignore this flag.
>
> O_SYNC is orthogonal to whether the kernel caches are involved.  It is
> about ensuring that data *and* metadata are safely written all the way
> to permanent media.
>
> Phil

Okay, that was my original intent, i.e. to avoid caching and buffering
as much as possible so that I could get a feel for true throughput
capability of the RAID device and the disks.  Do you think it would be
better then to use sync=0 or fsync_on_close=1 for the sake of
evaluating RAID write performance?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-03  4:30                             ` Phil Turmel
  2015-12-03  4:49                               ` Dallas Clement
@ 2015-12-03 13:43                               ` Robert Kierski
  2015-12-03 14:37                                 ` Phil Turmel
  1 sibling, 1 reply; 35+ messages in thread
From: Robert Kierski @ 2015-12-03 13:43 UTC (permalink / raw)
  To: Phil Turmel, Dallas Clement; +Cc: linux-raid

This is why I use Direct-IO to the bare metal block device instead of going through the FS.  Rather than discussing the real problem, we're off in the weed talking about whether the tests should be using O_SYNC and whether there is a problem introduced in the latest version of the FS.

FS's and cache are very good at hiding the problems of those things below them and prevent you from exercising the code you're interested in debugging.

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461


^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-02 19:50               ` Dallas Clement
  2015-12-03  0:12                 ` Dallas Clement
@ 2015-12-03 14:19                 ` Robert Kierski
  2015-12-03 14:39                   ` Dallas Clement
  2015-12-03 15:04                   ` Phil Turmel
  1 sibling, 2 replies; 35+ messages in thread
From: Robert Kierski @ 2015-12-03 14:19 UTC (permalink / raw)
  To: Dallas Clement, linux-raid

Phil,

I have a variety of testing tools that I use to corroborate the results of the others.  So... IOR, XDD, fio, iozone, (and dd when I need something simple).  Each of those can be run with a variety of options that simulate what an FS will submit to the block layer without adding the complexity, overhead, and uncertainty that an FS brings to the table.  I've run the same tools through an FS, and found that at the bottom end of things, I can configure those tools to do exactly what the FS does... only when I'm looking at the traces, I don't have to scan past 100K lines while the FS is dealing with inodes, privileges, and other meta data.

But to more precisely answer your question... as an example, if I'm using dd, I give this command:

dd if=/dev/zero of=/dev/md0 bs=1M oflag=direct

Where /dev/md0 is the raid device I've configured.

I don't use bitmaps, I've configured my raid using "--bitmap=none" and confirmed that mdadmin sees that there is no bitmap.  I don't have alignment issues as my ramdisk has 512byte sectors.  If something is somehow aligning things off 512byte boundaries when doing 1m writes.... I would be surprised.  Also... I verified that the data written to disk falls at the boundaries I'm expecting.

I tried RAID0 and got performance that is similar to what I was expecting -- 38G/s doing the writes.

I tried the 4.1 kernel, and was able to get better performance.  It was actually 2x the 3.18 performance... but the 3.18 performance is so bad that twice horrible is still horrible.

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-03 13:43                               ` Robert Kierski
@ 2015-12-03 14:37                                 ` Phil Turmel
  0 siblings, 0 replies; 35+ messages in thread
From: Phil Turmel @ 2015-12-03 14:37 UTC (permalink / raw)
  To: Robert Kierski, Dallas Clement; +Cc: linux-raid

On 12/03/2015 08:43 AM, Robert Kierski wrote:
> This is why I use Direct-IO to the bare metal block device instead of going through the FS.  Rather than discussing the real problem, we're off in the weed talking about whether the tests should be using O_SYNC and whether there is a problem introduced in the latest version of the FS.

It's not off in the weeds for Dallas, the OP.

> FS's and cache are very good at hiding the problems of those things below them and prevent you from exercising the code you're interested in debugging.

Yep, you seem to have a real problem.

Phil

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-03 14:19                 ` Robert Kierski
@ 2015-12-03 14:39                   ` Dallas Clement
  2015-12-03 15:04                   ` Phil Turmel
  1 sibling, 0 replies; 35+ messages in thread
From: Dallas Clement @ 2015-12-03 14:39 UTC (permalink / raw)
  To: Robert Kierski; +Cc: linux-raid

On Thu, Dec 3, 2015 at 8:19 AM, Robert Kierski <rkierski@cray.com> wrote:
> Phil,
>
> I have a variety of testing tools that I use to corroborate the results of the others.  So... IOR, XDD, fio, iozone, (and dd when I need something simple).  Each of those can be run with a variety of options that simulate what an FS will submit to the block layer without adding the complexity, overhead, and uncertainty that an FS brings to the table.  I've run the same tools through an FS, and found that at the bottom end of things, I can configure those tools to do exactly what the FS does... only when I'm looking at the traces, I don't have to scan past 100K lines while the FS is dealing with inodes, privileges, and other meta data.
>
> But to more precisely answer your question... as an example, if I'm using dd, I give this command:
>
> dd if=/dev/zero of=/dev/md0 bs=1M oflag=direct
>
> Where /dev/md0 is the raid device I've configured.
>
> I don't use bitmaps, I've configured my raid using "--bitmap=none" and confirmed that mdadmin sees that there is no bitmap.  I don't have alignment issues as my ramdisk has 512byte sectors.  If something is somehow aligning things off 512byte boundaries when doing 1m writes.... I would be surprised.  Also... I verified that the data written to disk falls at the boundaries I'm expecting.
>
> I tried RAID0 and got performance that is similar to what I was expecting -- 38G/s doing the writes.
>
> I tried the 4.1 kernel, and was able to get better performance.  It was actually 2x the 3.18 performance... but the 3.18 performance is so bad that twice horrible is still horrible.
>
> Bob Kierski
> Senior Storage Performance Engineer
> Cray Inc.
> 380 Jackson Street
> Suite 210
> St. Paul, MN 55101
> Tele: 651-967-9590
> Fax:  651-605-9001
> Cell: 651-890-7461
>

> I have a variety of testing tools that I use to corroborate the results of the others.

Robert, can you summarize your results for RAID 0, 1, 5, 6?  I'm
interested to see what you are actually getting compared to what you
expected particularly for RAID 5 & 6.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-03 14:19                 ` Robert Kierski
  2015-12-03 14:39                   ` Dallas Clement
@ 2015-12-03 15:04                   ` Phil Turmel
  2015-12-03 22:21                     ` Weedy
  2015-12-04 13:40                     ` Robert Kierski
  1 sibling, 2 replies; 35+ messages in thread
From: Phil Turmel @ 2015-12-03 15:04 UTC (permalink / raw)
  To: Robert Kierski, Dallas Clement, linux-raid

On 12/03/2015 09:19 AM, Robert Kierski wrote:
> Phil,
> 
> I have a variety of testing tools that I use to corroborate the results of the others.  So... IOR, XDD, fio, iozone, (and dd when I need something simple).  Each of those can be run with a variety of options that simulate what an FS will submit to the block layer without adding the complexity, overhead, and uncertainty that an FS brings to the table.  I've run the same tools through an FS, and found that at the bottom end of things, I can configure those tools to do exactly what the FS does... only when I'm looking at the traces, I don't have to scan past 100K lines while the FS is dealing with inodes, privileges, and other meta data.

Ok.  Please cite the tool when you give a performance number, please.

> But to more precisely answer your question... as an example, if I'm using dd, I give this command:
> 
> dd if=/dev/zero of=/dev/md0 bs=1M oflag=direct

Why oflag=direct ?  And what do you get without it?

> Where /dev/md0 is the raid device I've configured.
> 
> I don't use bitmaps, I've configured my raid using "--bitmap=none" and confirmed that mdadmin sees that there is no bitmap.  I don't have alignment issues as my ramdisk has 512byte sectors.  If something is somehow aligning things off 512byte boundaries when doing 1m writes.... I would be surprised.  Also... I verified that the data written to disk falls at the boundaries I'm expecting.

Ok.  I wasn't concerned about sector size.  I was concerned about writes
not filling complete stripes in a single IO.  Writes to parity raid are
broken up into 4k blocks in the stripe cache for parity calculation.
Each block in that stripe is separated from its mates by the chunk size.
 If you don't write to all of them before the state machine decides to
compute, the parity devices will be read to perform RMW cycles (or the
other data members will be read to recompute from scratch).  Either way,
when the 4k blocks are then written from the stripe, they have to have a
chance to get merged again.

> I tried RAID0 and got performance that is similar to what I was expecting -- 38G/s doing the writes.

Yep, those 1M writes are broken into chunk-sized writes for each member
and submitted as is.  Raid456 breaks those down further for parity
calculation.

So, you probably have found a bug in post-stripe merging.  Possibly due
to the extreme low latency of a ramdisk.  Possibly an O_DIRECT side
effect.  There's been a lot of work on parity raid in the past couple
years, both fixing bugs and adding features.

Sounds like time to bisect to locate the patches that make step changes
in performance on your specific hardware.

Phil

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-03 15:04                   ` Phil Turmel
@ 2015-12-03 22:21                     ` Weedy
  2015-12-04 13:40                     ` Robert Kierski
  1 sibling, 0 replies; 35+ messages in thread
From: Weedy @ 2015-12-03 22:21 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Robert Kierski, Dallas Clement, linux-raid

This is a bit off topic for this thread but is this tuning info on one
of the wiki's.
I mean tuning all the caches and whether you should have bitmaps on or off.

I'm going to build a new raid 5 (possibly raid 6) soon and would like
to saturate gigabit line.
I'd like to get my caches and stripes setup right for a 4 or 6 disk array.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-03 15:04                   ` Phil Turmel
  2015-12-03 22:21                     ` Weedy
@ 2015-12-04 13:40                     ` Robert Kierski
  2015-12-04 16:08                       ` Dallas Clement
  2015-12-04 18:51                       ` Shaohua Li
  1 sibling, 2 replies; 35+ messages in thread
From: Robert Kierski @ 2015-12-04 13:40 UTC (permalink / raw)
  To: Phil Turmel, Dallas Clement, linux-raid

It turns out the problem I'm experiencing is related to thread count.  When I run XDD with a reasonable queuedepth parameter (32), I get horrible performance.  When I run it with a small queuedepth (1-4), I get expected performance.

Here are the command lines:

Horrible Performance:
xdd -id commandline -dio -maxall -targets 1 /dev/md0 -queuedepth 32  -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential

GOOD Performance:
xdd -id commandline -dio -maxall -targets 1 /dev/md0 -queuedepth 1 -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential

BEST Performance:
xdd -id commandline -dio -maxall -targets 1 /dev/md0 -queuedepth 3 -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential

BAD Performance
xdd -id commandline -dio -maxall -targets 1 /dev/md1 -queuedepth 5 -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-04 13:40                     ` Robert Kierski
@ 2015-12-04 16:08                       ` Dallas Clement
  2015-12-07 14:29                         ` Robert Kierski
  2015-12-04 18:51                       ` Shaohua Li
  1 sibling, 1 reply; 35+ messages in thread
From: Dallas Clement @ 2015-12-04 16:08 UTC (permalink / raw)
  To: Robert Kierski; +Cc: Phil Turmel, linux-raid

On Fri, Dec 4, 2015 at 7:40 AM, Robert Kierski <rkierski@cray.com> wrote:
> It turns out the problem I'm experiencing is related to thread count.  When I run XDD with a reasonable queuedepth parameter (32), I get horrible performance.  When I run it with a small queuedepth (1-4), I get expected performance.
>
> Here are the command lines:
>
> Horrible Performance:
> xdd -id commandline -dio -maxall -targets 1 /dev/md0 -queuedepth 32  -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential
>
> GOOD Performance:
> xdd -id commandline -dio -maxall -targets 1 /dev/md0 -queuedepth 1 -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential
>
> BEST Performance:
> xdd -id commandline -dio -maxall -targets 1 /dev/md0 -queuedepth 3 -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential
>
> BAD Performance
> xdd -id commandline -dio -maxall -targets 1 /dev/md1 -queuedepth 5 -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential
>
> Bob Kierski
> Senior Storage Performance Engineer
> Cray Inc.
> 380 Jackson Street
> Suite 210
> St. Paul, MN 55101
> Tele: 651-967-9590
> Fax:  651-605-9001
> Cell: 651-890-7461
>

Robert, is this with a RAID 6 array?  Also, what are the actual
throughput numbers you are getting?  Would it be possible for you to
capture the iostat -xm output and report the individual disk wMB/s and
also the disk utilization for RAID 6?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-04 13:40                     ` Robert Kierski
  2015-12-04 16:08                       ` Dallas Clement
@ 2015-12-04 18:51                       ` Shaohua Li
  2015-12-05  1:38                         ` Dallas Clement
  2015-12-07 14:18                         ` Robert Kierski
  1 sibling, 2 replies; 35+ messages in thread
From: Shaohua Li @ 2015-12-04 18:51 UTC (permalink / raw)
  To: Robert Kierski; +Cc: Phil Turmel, Dallas Clement, linux-raid

On Fri, Dec 04, 2015 at 01:40:02PM +0000, Robert Kierski wrote:
> It turns out the problem I'm experiencing is related to thread count.  When I run XDD with a reasonable queuedepth parameter (32), I get horrible performance.  When I run it with a small queuedepth (1-4), I get expected performance.
> 
> Here are the command lines:
> 
> Horrible Performance:
> xdd -id commandline -dio -maxall -targets 1 /dev/md0 -queuedepth 32  -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential
> 
> GOOD Performance:
> xdd -id commandline -dio -maxall -targets 1 /dev/md0 -queuedepth 1 -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential
> 
> BEST Performance:
> xdd -id commandline -dio -maxall -targets 1 /dev/md0 -queuedepth 3 -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential
> 
> BAD Performance
> xdd -id commandline -dio -maxall -targets 1 /dev/md1 -queuedepth 5 -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential

the performance issue only happens for directIO write, right? did you check
buffered write? The directIO case doesn't delay write, so will create more
read-modify-write. you can check with below debug code.


diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 45933c1..d480cc3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5278,10 +5278,10 @@ static void make_request(struct mddev *mddev, struct bio * bi)
 			}
 			set_bit(STRIPE_HANDLE, &sh->state);
 			clear_bit(STRIPE_DELAYED, &sh->state);
-			if ((!sh->batch_head || sh == sh->batch_head) &&
-			    (bi->bi_rw & REQ_SYNC) &&
-			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-				atomic_inc(&conf->preread_active_stripes);
+//			if ((!sh->batch_head || sh == sh->batch_head) &&
+//			    (bi->bi_rw & REQ_SYNC) &&
+//			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+//				atomic_inc(&conf->preread_active_stripes);
 			release_stripe_plug(mddev, sh);
 		} else {
 			/* cannot get stripe for read-ahead, just give-up */

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-04 18:51                       ` Shaohua Li
@ 2015-12-05  1:38                         ` Dallas Clement
  2015-12-07 14:18                         ` Robert Kierski
  1 sibling, 0 replies; 35+ messages in thread
From: Dallas Clement @ 2015-12-05  1:38 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Robert Kierski, Phil Turmel, linux-raid

On Fri, Dec 4, 2015 at 12:51 PM, Shaohua Li <shli@kernel.org> wrote:
> On Fri, Dec 04, 2015 at 01:40:02PM +0000, Robert Kierski wrote:
>> It turns out the problem I'm experiencing is related to thread count.  When I run XDD with a reasonable queuedepth parameter (32), I get horrible performance.  When I run it with a small queuedepth (1-4), I get expected performance.
>>
>> Here are the command lines:
>>
>> Horrible Performance:
>> xdd -id commandline -dio -maxall -targets 1 /dev/md0 -queuedepth 32  -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential
>>
>> GOOD Performance:
>> xdd -id commandline -dio -maxall -targets 1 /dev/md0 -queuedepth 1 -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential
>>
>> BEST Performance:
>> xdd -id commandline -dio -maxall -targets 1 /dev/md0 -queuedepth 3 -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential
>>
>> BAD Performance
>> xdd -id commandline -dio -maxall -targets 1 /dev/md1 -queuedepth 5 -blocksize 1048576 -timelimit 10 -reqsize 1 -mbytes 5000 -passes 20 -verbose -op write -seek sequential
>
> the performance issue only happens for directIO write, right? did you check
> buffered write? The directIO case doesn't delay write, so will create more
> read-modify-write. you can check with below debug code.
>
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 45933c1..d480cc3 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5278,10 +5278,10 @@ static void make_request(struct mddev *mddev, struct bio * bi)
>                         }
>                         set_bit(STRIPE_HANDLE, &sh->state);
>                         clear_bit(STRIPE_DELAYED, &sh->state);
> -                       if ((!sh->batch_head || sh == sh->batch_head) &&
> -                           (bi->bi_rw & REQ_SYNC) &&
> -                           !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> -                               atomic_inc(&conf->preread_active_stripes);
> +//                     if ((!sh->batch_head || sh == sh->batch_head) &&
> +//                         (bi->bi_rw & REQ_SYNC) &&
> +//                         !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> +//                             atomic_inc(&conf->preread_active_stripes);
>                         release_stripe_plug(mddev, sh);
>                 } else {
>                         /* cannot get stripe for read-ahead, just give-up */

Hi all.  My original test involved fio sequential writing to XFS
formatted RAID devices with block size = 2M and queue depth = 256.
Today I spent some time focused on testing raw RAID sequential write
tests with dd similar to Robert's tests.  I am happy to report that I
see the exact opposite results that I reported earlier with fio / XFS.
When comparing performance between the 2.6.39.4 kernel and the 3.10.69
kernel, I am seeing that RAID 0 and and RAID 1 write speeds are about
the same.  However, RAID 5 is about 60% faster in the 3.10.69 kernel
and RAID 6 is 40% faster.  I am not sure how to control queue depth
with plain old dd.

Next I am going to get a second opinion from fio, this time writing
directly to the RAID devices instead of going through XFS with varying
queue depth.  If I see the same behavior as with DD, then there is no
problem with RAID in the new kernels - it is something else, perhaps
XFS.  Will report my fio findings as soon as I have a chance to
capture them.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-04 18:51                       ` Shaohua Li
  2015-12-05  1:38                         ` Dallas Clement
@ 2015-12-07 14:18                         ` Robert Kierski
  1 sibling, 0 replies; 35+ messages in thread
From: Robert Kierski @ 2015-12-07 14:18 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Phil Turmel, Dallas Clement, linux-raid

I re-ran the test using buffered IO, with both XDD and FIO.  Except that I was using XDD and FIO, the tests were essentially the same -- TC=4, QD=1, BS=1M, DIR=Write, Order=SEQ.

I've switched to using spinning disks, since several people on this list indicated that using DDR Disk devices with it's ultra low latency could cause unintended behavior in RAID6.  My theoretical MAX TP is 1080 - 1620 MB/s for 6 HDD's that each have a TP of about 270 MB/s on the outer edge.  My SAS infrastructure is such that I shouldn't have a problem achieving the Theoretical MAX, when testing on the outer edge.

With Buffered IO...

FIO reported a sustained TP of 2200 MB/s.  As this is well above the theoretical MAX, I didn't trust it.  I ran the same test with XDD.  It reported a TP of 570 MB/s.  When I watched the system via vmstat and iostat while running both of these tests, both tools indicated that I was doing about 570 MB/s while running both XDD and FIO.  So there must be something wrong in the way FIO is calculating TP when using buffered IO.

When running both FIO and XDD using direct IO, I was able to get much better performance (about 1040 MB/s) with the above paramaters (TC=4, QD=1, BS=1M, DIR=Write, Order=SEQ).  I am able to confirm that TP using vmstat and iostat.

While running with both Buffered-IO and Direct-IO, I watched the IO's in both iostat and vmstat and can confirm that there weren't any reads, which means there wasn't any Read-Modify-Write going on in either case.

As I am getting nearly the theoretical MAX, one might think there isn't a problem.  However, when I increase TC to 5, my TP is only 750 MB/s.  When I increase TC to 32, my TP is about 250 MB/s.  While you may conclude that 32 threads is overwhelming the system, 5 threads shouldn't be overwhelming a system that has 12 core.  When you consider that I get MAX TP with 4 threads, the fact that I get  a 25% reduction by adding one thread is a problem.

Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461


^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-04 16:08                       ` Dallas Clement
@ 2015-12-07 14:29                         ` Robert Kierski
  2015-12-08 19:38                           ` Dallas Clement
  0 siblings, 1 reply; 35+ messages in thread
From: Robert Kierski @ 2015-12-07 14:29 UTC (permalink / raw)
  To: Dallas Clement; +Cc: Phil Turmel, linux-raid

I need some clarification on what data you want collected.... do you want iostat -xm collected before and after?  Or at a periodic interval (1 sec)?

I'm measuring sustained performance.  This means I’m runnig for a period of time -- 1 minute.  If you want iostat collected on an interval, that would probably be more data than one would want posted to this mailing list.

But... yes, this is a RAID6.

I've switched to HDD's as storage as there is concern on the list that the DDR disk devices could be causing unintended behavior in RAID6.  With 6 HDD's, each able to do 270 MB/s on their outer edge, I'm getting RAID6 throughput of about 1080 MB/s when I use a thread count of 4, using FIO.

-----Original Message-----
From: Dallas Clement [mailto:dallas.a.clement@gmail.com] 
Sent: Friday, December 04, 2015 10:08 AM
To: Robert Kierski
Cc: Phil Turmel; linux-raid@vger.kernel.org
Subject: Re: RAID 5,6 sequential writing seems slower in newer kernels

Robert, is this with a RAID 6 array?  Also, what are the actual throughput numbers you are getting?  Would it be possible for you to capture the iostat -xm output and report the individual disk wMB/s and also the disk utilization for RAID 6?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-07 14:29                         ` Robert Kierski
@ 2015-12-08 19:38                           ` Dallas Clement
  2015-12-08 21:24                             ` Robert Kierski
  0 siblings, 1 reply; 35+ messages in thread
From: Dallas Clement @ 2015-12-08 19:38 UTC (permalink / raw)
  To: Robert Kierski; +Cc: Phil Turmel, linux-raid

On Mon, Dec 7, 2015 at 8:29 AM, Robert Kierski <rkierski@cray.com> wrote:
> I need some clarification on what data you want collected.... do you want iostat -xm collected before and after?  Or at a periodic interval (1 sec)?
>
> I'm measuring sustained performance.  This means I’m runnig for a period of time -- 1 minute.  If you want iostat collected on an interval, that would probably be more data than one would want posted to this mailing list.
>
> But... yes, this is a RAID6.
>
> I've switched to HDD's as storage as there is concern on the list that the DDR disk devices could be causing unintended behavior in RAID6.  With 6 HDD's, each able to do 270 MB/s on their outer edge, I'm getting RAID6 throughput of about 1080 MB/s when I use a thread count of 4, using FIO.
>
> -----Original Message-----
> From: Dallas Clement [mailto:dallas.a.clement@gmail.com]
> Sent: Friday, December 04, 2015 10:08 AM
> To: Robert Kierski
> Cc: Phil Turmel; linux-raid@vger.kernel.org
> Subject: Re: RAID 5,6 sequential writing seems slower in newer kernels
>
> Robert, is this with a RAID 6 array?  Also, what are the actual throughput numbers you are getting?  Would it be possible for you to capture the iostat -xm output and report the individual disk wMB/s and also the disk utilization for RAID 6?

Hi Robert.  Thanks for posting these results.  Your fio direct I/O
results look pretty darn good.  Were you able to confirm with hdparm
-t that you really can get 270 MB/s on your disks in reality?  Also,
what formula are you using to calculate best and worst case RAID 6
read and write speeds?

> However, when I increase TC to 5, my TP is only 750 MB/s.  When I increase TC to 32, my TP is about 250 MB/s.

Now this is pretty disappointing.  I agree, 12 cores should be able to
easily handle this number of threads.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: RAID 5,6 sequential writing seems slower in newer kernels
  2015-12-08 19:38                           ` Dallas Clement
@ 2015-12-08 21:24                             ` Robert Kierski
  0 siblings, 0 replies; 35+ messages in thread
From: Robert Kierski @ 2015-12-08 21:24 UTC (permalink / raw)
  To: Dallas Clement; +Cc: Phil Turmel, linux-raid

Unfortunately, when things seem too good to be true....

I have to go back to the beginning.  I had switched to a system with HDD's, but hadn't noticed that the system has a cache that was skewing my results.   My test script attempts to turn off the cache, but in this case, it's ignored since the cache is battery backed, and therefore knows that I really want it on.

Also, I thought the drives were 10k RPM.  If that's the case, then 270 MB/s would be a fairly reasonable TP on the outer edge.  However, these are 7.2k RPM drives and aren't able to do anywhere near 270 MB/s.  The cache was skewing that result as well.

So... please ignore any message in which I claim to get 1080 MB/s from a RAID6.


Bob Kierski
Senior Storage Performance Engineer
Cray Inc.
380 Jackson Street
Suite 210
St. Paul, MN 55101
Tele: 651-967-9590
Fax:  651-605-9001
Cell: 651-890-7461


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2015-12-08 21:24 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-01 23:02 RAID 5,6 sequential writing seems slower in newer kernels Dallas Clement
2015-12-02  1:07 ` keld
2015-12-02 14:18   ` Robert Kierski
2015-12-02 14:45     ` Phil Turmel
2015-12-02 15:28       ` Robert Kierski
2015-12-02 15:37         ` Phil Turmel
2015-12-02 15:44           ` Robert Kierski
2015-12-02 15:51             ` Phil Turmel
2015-12-02 19:50               ` Dallas Clement
2015-12-03  0:12                 ` Dallas Clement
2015-12-03  2:18                   ` Phil Turmel
2015-12-03  2:24                     ` Dallas Clement
2015-12-03  2:33                       ` Dallas Clement
2015-12-03  2:38                         ` Phil Turmel
2015-12-03  2:51                           ` Dallas Clement
2015-12-03  4:30                             ` Phil Turmel
2015-12-03  4:49                               ` Dallas Clement
2015-12-03 13:43                               ` Robert Kierski
2015-12-03 14:37                                 ` Phil Turmel
2015-12-03  2:34                       ` Phil Turmel
2015-12-03 14:19                 ` Robert Kierski
2015-12-03 14:39                   ` Dallas Clement
2015-12-03 15:04                   ` Phil Turmel
2015-12-03 22:21                     ` Weedy
2015-12-04 13:40                     ` Robert Kierski
2015-12-04 16:08                       ` Dallas Clement
2015-12-07 14:29                         ` Robert Kierski
2015-12-08 19:38                           ` Dallas Clement
2015-12-08 21:24                             ` Robert Kierski
2015-12-04 18:51                       ` Shaohua Li
2015-12-05  1:38                         ` Dallas Clement
2015-12-07 14:18                         ` Robert Kierski
2015-12-02 15:37       ` Robert Kierski
2015-12-02  5:22 ` Roman Mamedov
2015-12-02 14:15 ` Robert Kierski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.