Software RAID checksum performance on 24 disks not even close to kernel reported

All of lore.kernel.org
 help / color / mirror / Atom feed

* Software RAID checksum performance on 24 disks not even close to kernel reported
@ 2012-06-04 23:14 Ole Tange
  2012-06-05  1:26 ` Joe Landman
                   ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Ole Tange @ 2012-06-04 23:14 UTC (permalink / raw)
  To: linux-raid

On my new 24 disk array I get 900 MB/s of raw read or write using `dd`
to all the disks.

When I set the disks up as a 24 disk software RAID6 I get 400 MB/s
write and 600 MB/s read. It seems to be due to checksuming, as I have
a single process (md0_raid6) taking up 100% of one CPU.

It seems, however, that the performance of the checksumming is heavily
dependent on how many disks are in the RAID and how big the chunk size
is.

That makes sense, as the CPU can compute the checksum faster if the
whole stripe (i.e. one chunk for each disk) can be fit into the CPU
cache.

I tested this by creating 24 devices in RAM, used different chunk
sizes, and then copied the linux kernel source. Test script can be
found on http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html

By doing it in RAM the results are not affected by physical disks or
disk controller. So the only change is the speed of computing
checksums. This can also be seen as the time the process md0_raid0 is
running.

The results were:

Chunk size	Time to copy 10 linux kernel sources as files	Time to copy
10 linux kernel sources as a single tar file
16	32s	13s
32	32s	19s
64	31s	11s
128	39s	13s
256	43s	11s
4096	1m38s	16s

It makes sense that it is faster to copy 10 big files than 10 times
the same size in small files - especially on RAID6 where you have to
read from disk if you do not write a full stripe. So the difference
for the big files is minimal.

For the small files the difference is more pronounced. Any chunk size
over 64k gives a performance penalty.

But I cannot explain why even the best performance (4600 MB/11s = 420
MB/s) is not even close to the checksum performance reported by the
kernel at boot (6196 MB/s):

    Mar 13 16:02:42 server kernel: [   35.120035] raid6: using
algorithm sse2x4 (6196 MB/s)

Can you explain why I only get 420 MB/s of real world checksumming
instead of 6196 MB/s?

/Ole
-- 
Have you ordered your GNU Parallel merchandise?
https://www.gnu.org/software/parallel/merchandise.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-04 23:14 Software RAID checksum performance on 24 disks not even close to kernel reported Ole Tange
@ 2012-06-05  1:26 ` Joe Landman
  2012-06-05  3:36 ` Igor M Podlesny
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 38+ messages in thread
From: Joe Landman @ 2012-06-05  1:26 UTC (permalink / raw)
  To: Ole Tange; +Cc: linux-raid

On 06/04/2012 07:14 PM, Ole Tange wrote:

> But I cannot explain why even the best performance (4600 MB/11s = 420
> MB/s) is not even close to the checksum performance reported by the
> kernel at boot (6196 MB/s):
>
>      Mar 13 16:02:42 server kernel: [   35.120035] raid6: using
> algorithm sse2x4 (6196 MB/s)
>
> Can you explain why I only get 420 MB/s of real world checksumming
> instead of 6196 MB/s?

In the best possible case, you would get 22x 1-disk bandwidth, which 
would be ~120MB/s (assuming RAID6, and "infinite" speed of checksum 
computation).  This is your "theoretical" upper bound on performance.

Your pragmatic upper bound on performance is reduced from the 
theoretical by many issues, including various hardware issues 
(controller, PCIe lanes, memory, disk, ...), as well as software (IO 
stack traversal, elevators, buffers/cache fills, ... etc.).

Aside from this, it is very rare that you will have a single application 
reading and writing at full stripe width all the time, which would be 
the optimal case for you.  There are some, and we've play with a number 
for our customers.  But they are the exception and not the rule.

Your real world IO performance for 24 disks is 420MB/s.  So this 
particular setup is, from your numbers, operating at about 16% peak 
efficiency per disk +/- some.  This is not uncommon for people's 
self-built systems.  The checksumming isn't your rate limiting feature. 
  Other things are.

Joe

>
>
> /Ole

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-04 23:14 Software RAID checksum performance on 24 disks not even close to kernel reported Ole Tange
  2012-06-05  1:26 ` Joe Landman
@ 2012-06-05  3:36 ` Igor M Podlesny
  2012-06-05  7:47   ` Ole Tange
  2012-06-05  3:39 ` Igor M Podlesny
  2012-06-06 14:11 ` Ole Tange
  3 siblings, 1 reply; 38+ messages in thread
From: Igor M Podlesny @ 2012-06-05  3:36 UTC (permalink / raw)
  To: Ole Tange; +Cc: linux-raid

On 5 June 2012 07:14, Ole Tange <ole@tange.dk> wrote:
> On my new 24 disk array I get 900 MB/s of raw read or write using `dd`
> to all the disks.

   — Array of layout what?

> When I set the disks up as a 24 disk software RAID6 I get 400 MB/s
> write and 600 MB/s read. It seems to be due to checksuming, as I have
> a single process (md0_raid6) taking up 100% of one CPU.
[…]
> I tested this by creating 24 devices in RAM, used different chunk
> sizes, and then copied the linux kernel source. Test script can be
> found on http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html

   What a wild train of thoughts… Are those 24 disks HDDs or they're "in RAM"?

> By doing it in RAM the results are not affected by physical disks or
> disk controller. So the only change is the speed of computing
> checksums. This can also be seen as the time the process md0_raid0 is
> running.
>
> The results were:
>
> Chunk size      Time to copy 10 linux kernel sources as files   Time to copy
> 10 linux kernel sources as a single tar file
> 16      32s     13s
[…]
> 4096    1m38s   16s

   You were talking bout MB/secs and now you're not. It doesn't help
understanding you either.

> It makes sense that it is faster to copy 10 big files than 10 times
[…]
>
> But I cannot explain why even the best performance (4600 MB/11s = 420
> MB/s) is not even close to the checksum performance reported by the
> kernel at boot (6196 MB/s):
>
>    Mar 13 16:02:42 server kernel: [   35.120035] raid6: using
> algorithm sse2x4 (6196 MB/s)
>
> Can you explain why I only get 420 MB/s of real world checksumming
> instead of 6196 MB/s?

   Again — 420 MB/sec on HDD-based RAID or in-RAM one? What do you
think LSR subscribers are — mediums?

--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-04 23:14 Software RAID checksum performance on 24 disks not even close to kernel reported Ole Tange
  2012-06-05  1:26 ` Joe Landman
  2012-06-05  3:36 ` Igor M Podlesny
@ 2012-06-05  3:39 ` Igor M Podlesny
  2012-06-05  7:47   ` Ole Tange
  2012-06-06 14:11 ` Ole Tange
  3 siblings, 1 reply; 38+ messages in thread
From: Igor M Podlesny @ 2012-06-05  3:39 UTC (permalink / raw)
  To: Ole Tange; +Cc: linux-raid

On 5 June 2012 07:14, Ole Tange <ole@tange.dk> wrote:
[…]
> I tested this by creating 24 devices in RAM, used different chunk
> sizes, and then copied the linux kernel source. Test script can be
> found on http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html

   I don't see there "--assume-clean" option, and it's (again) not
clear — are you playing with re-syncing RAID or what?

--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-05  3:36 ` Igor M Podlesny
@ 2012-06-05  7:47   ` Ole Tange
  2012-06-05 11:25     ` Peter Grandi
  2012-06-05 14:15     ` Stan Hoeppner
  0 siblings, 2 replies; 38+ messages in thread
From: Ole Tange @ 2012-06-05  7:47 UTC (permalink / raw)
  To: Igor M Podlesny; +Cc: linux-raid

On Tue, Jun 5, 2012 at 5:36 AM, Igor M Podlesny <for.poige+lsr@gmail.com> wrote:
> On 5 June 2012 07:14, Ole Tange <ole@tange.dk> wrote:
>> On my new 24 disk array I get 900 MB/s of raw read or write using `dd`
>> to all the disks.
>
>   — Array of layout what?

Raw performance. I.e. no RAID:

  echo 3 > /proc/sys/vm/drop_caches
  time parallel -j0 dd if={} of=/dev/null bs=1000k count=1k ::: /dev/sd?

The 900 MB/s was based on my old controller. I re-measured using my
new controller and get closer to 2000 MB/s in raw (non-RAID)
performance, which is close to the theoretical maximum for that
controller (2400 MB/s). This indicated that hardware is not a
bottleneck.

>> When I set the disks up as a 24 disk software RAID6 I get 400 MB/s
>> write and 600 MB/s read. It seems to be due to checksuming, as I have
>> a single process (md0_raid6) taking up 100% of one CPU.
> […]
>> I tested this by creating 24 devices in RAM, used different chunk
>> sizes, and then copied the linux kernel source. Test script can be
>> found on http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html
>
>   What a wild train of thoughts… Are those 24 disks HDDs or they're "in RAM"?

As I write:

>> I tested this by creating 24 devices in RAM

so yes: For the test they are loop back devices on a tmpfs in RAM.

>> By doing it in RAM the results are not affected by physical disks or
>> disk controller.

So the test results are NOT affected by various hardware issues
(controller, PCIe lanes, disk, ...), and also NOT affected by
software related to the hardware (IO stack traversal, elevators,
buffers/cache fills, ... etc.).

The test is thus not limited by the 2000 MB/s that 'dd'-test shows the
hardware supports.

The only hardware being used in the test is RAM.

It should therefore be possible to reproduce my findings on most
systems with > 10 GB RAM. Maybe you get different values, but I would
think you will see the same trend: md0_raid6 is the limiting factor
and you do not get anywhere near the theoretical max that the kernel
reports (6196 MB/s in my case).

The theoretical max raw performance of my loop devices in RAM is 7000
MB/s as measured by:

  time parallel -j0 dd if={} of=/dev/null bs=500k count=1k ::: /dev/loop*

>> So the only change is the speed of computing
>> checksums. This can also be seen as the time the process md0_raid0 is
>> running.
>>
>> The results were:
>>
>> Chunk size      Time to copy 10 linux kernel sources as files   Time to copy
>> 10 linux kernel sources as a single tar file
>> 16      32s     13s
> […]
>> 4096    1m38s   16s
>
>   You were talking bout MB/secs and now you're not. It doesn't help
> understanding you either.

The table shows chunk size (mdadm -c), timings for copying the linux
source 10 times in parallel as files and second time as single
uncompressed tar file. This is to measure performance for small files
and big files respectively.

>> But I cannot explain why even the best performance (4600 MB/11s = 420
>> MB/s) is not even close to the checksum performance reported by the
>> kernel at boot (6196 MB/s):
>>
>>    Mar 13 16:02:42 server kernel: [   35.120035] raid6: using
>> algorithm sse2x4 (6196 MB/s)
>>
>> Can you explain why I only get 420 MB/s of real world checksumming
>> instead of 6196 MB/s?
>
>   Again — 420 MB/sec on HDD-based RAID or in-RAM one? What do you
> think LSR subscribers are — mediums?

I had assumed that if they had any doubt they would read the test
script. As I wrote:

>> Test script can be found on http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html

Here it should be clear that for this test setup the HDDs are
loop-back files on a tmpfs (which is 100% in ram - not swapped out).

The main point is:

When I run 'top' during the tests I see 'md0_raid6' taking up 100% of
one CPU core. This leads me to believe the limiting factor is indeed
'md0_raid6' and not hardware. This is true for all the in RAM tests
(and it is also true for the production system which runs on normal
magnetic SATA disks).

So what puzzles me is: If the theoretical maximum for checksumming is
6196 MB/s and the loop back devices delivers 7000 MB/s in raw
(non-RAID) performance, why do I only get 420 MB/s if the loop-back
devices are in RAID6? And why is md0_raid6 taking up 100% of one CPU
core, but only delivering 420 MB/s of performance?

I _do_ expect md0_raid6 to take up 100% of one CPU core, but it should
perform at 6196 MB/s, and not at 420 MB/s that I  measure.

What performance do you get if you run the test script (lower part of
http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html)?
Can you reproduce the findings?

/Ole
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-05  3:39 ` Igor M Podlesny
@ 2012-06-05  7:47   ` Ole Tange
  2012-06-05 11:29     ` Igor M Podlesny
  0 siblings, 1 reply; 38+ messages in thread
From: Ole Tange @ 2012-06-05  7:47 UTC (permalink / raw)
  To: Igor M Podlesny; +Cc: linux-raid

On Tue, Jun 5, 2012 at 5:39 AM, Igor M Podlesny <for.poige+lsr@gmail.com> wrote:
> On 5 June 2012 07:14, Ole Tange <ole@tange.dk> wrote:
> […]
>> I tested this by creating 24 devices in RAM, used different chunk
>> sizes, and then copied the linux kernel source. Test script can be
>> found on http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html
>
>   I don't see there "--assume-clean" option, and it's (again) not
> clear — are you playing with re-syncing RAID or what?

Good call. But the resync is done before the mkfs.xfs is finished, so
the time of the copying is not affected by resync.

I re-tested with --assume-clean and as expected it has no impact.


/Ole
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-05  7:47   ` Ole Tange
@ 2012-06-05 11:25     ` Peter Grandi
  2012-06-05 20:57       ` Ole Tange
  2012-06-05 14:15     ` Stan Hoeppner
  1 sibling, 1 reply; 38+ messages in thread
From: Peter Grandi @ 2012-06-05 11:25 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

>>>> I tested this by creating 24 devices in RAM, used different
>>>> chunk sizes, and then copied the linux kernel source. Test
>>>> script can be found on [ ... ] By doing it in RAM the
>>>> results are not affected by physical disks or disk
>>>> controller. So the only change is the speed of computing
>>>> checksums. This can also be seen as the time the process
>>>> md0_raid0 is running.

>>> When I set the disks up as a 24 disk software RAID6

It does not change much of the conclusions as to the (euphemism)
audacity of your conclusions), but you have created a 21+2 RAID6
set, as the 24th block device is a spare:

  seq 24 | parallel -X --tty mdadm --create --force /dev/md0 -c $CHUNK --level=6 --raid-devices=23 -x 1 /dev/loop{}

>>> I get 400 MB/s write and 600 MB/s read. It seems to be due
>>> to checksuming, as I have a single process (md0_raid6)
>>> taking up 100% of one CPU.

[ ... ]

> The 900 MB/s was based on my old controller. I re-measured
> using my new controller and get closer to 2000 MB/s in raw
> (non-RAID) performance, which is close to the theoretical
> maximum for that controller (2400 MB/s). This indicated that
> hardware is not a bottleneck.

A 21+2 drive RAID6 set is (euphemism) brave, and perhaps it
matches the (euphemism) strategic insight that only checksumming
withing MD could account for 100% CPU time in a single threaded
way.

But as a start you could try running your (euphemism) "test"
with O_DIRECT:

  http://www.sabi.co.uk/blog/0709sep.html#070919

While making sure that the IO is stripe aligned (21 times the
chunk size).

Your (euphemism) tests could also probably benefit from more
care about (euphemism) details like commit semantics, as the use
of 'sync' in your scripts seems to me based on (euphemism)
unconventional insight, for example this:

 «seq 10 | time parallel mkdir -p /mnt/md0/{}\;tar -x -C /mnt/md0/{} -f linux.tar\; sync»

But also more divertingly:

 «seq 24 | parallel dd if=/dev/zero of=tmpfs/disk{} bs=500k count=1k
  seq 24 | parallel losetup /dev/loop{} tmpfs/disk{}
  sync
  sleep 1;
  sync»

and even:

 «mount /dev/md0 /mnt/md0
  sync»

Perhaps you might also want to investigate the behaviour of
'tmpfs' and 'loop' devices, as it seems quite (euphemism)
creative to me to have RAID set member block devices as 'loop's
over 'tmpfs' files:

 «mount -t tmpfs tmpfs tmpfs
  seq 24 | parallel dd if=/dev/zero of=tmpfs/disk{} bs=500k count=1k
  seq 24 | parallel losetup /dev/loop{} tmpfs/disk{}»

Put another way, most aspects of your (euphemism) tests seem to
me rather (euphemism) imaginative.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-05  7:47   ` Ole Tange
@ 2012-06-05 11:29     ` Igor M Podlesny
  2012-06-05 13:09       ` Peter Grandi
  2012-06-05 18:44       ` Ole Tange
  0 siblings, 2 replies; 38+ messages in thread
From: Igor M Podlesny @ 2012-06-05 11:29 UTC (permalink / raw)
  To: Ole Tange; +Cc: linux-raid

On 5 June 2012 15:47, Ole Tange <ole@tange.dk> wrote:
> On Tue, Jun 5, 2012 at 5:39 AM, Igor M Podlesny <for.poige+lsr@gmail.com> wrote:
>> On 5 June 2012 07:14, Ole Tange <ole@tange.dk> wrote:
>> […]
>>> I tested this by creating 24 devices in RAM, used different chunk
>>> sizes, and then copied the linux kernel source. Test script can be
>>> found on http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html
>>
>>   I don't see there "--assume-clean" option, and it's (again) not
>> clear — are you playing with re-syncing RAID or what?
>
> Good call. But the resync is done before the mkfs.xfs is finished, so
> the time of the copying is not affected by resync.
>
> I re-tested with --assume-clean and as expected it has no impact.

   Wanna try CONFIG_MULTICORE_RAID456? :-)

--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-05 11:29     ` Igor M Podlesny
@ 2012-06-05 13:09       ` Peter Grandi
  2012-06-05 21:17         ` Ole Tange
  2012-06-05 18:44       ` Ole Tange
  1 sibling, 1 reply; 38+ messages in thread
From: Peter Grandi @ 2012-06-05 13:09 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

>> Good call. But the resync is done before the mkfs.xfs is finished, so
>> the time of the copying is not affected by resync.
>> 
>> I re-tested with --assume-clean and as expected it has no impact.

>    Wanna try CONFIG_MULTICORE_RAID456? :-)

That would be intreresting, but the original post reports over
6GB/s for pure checksumming, and around 400MB/s actual transfer
rate. In theory there is no need here for multihreading. There
may something else going on :-).

[ ... ]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-05  7:47   ` Ole Tange
  2012-06-05 11:25     ` Peter Grandi
@ 2012-06-05 14:15     ` Stan Hoeppner
  2012-06-05 20:45       ` Ole Tange
  1 sibling, 1 reply; 38+ messages in thread
From: Stan Hoeppner @ 2012-06-05 14:15 UTC (permalink / raw)
  To: Ole Tange; +Cc: Igor M Podlesny, linux-raid

On 6/5/2012 2:47 AM, Ole Tange wrote:

>   time parallel -j0 dd if={} of=/dev/null bs=1000k count=1k ::: /dev/sd?
                                            ^^^^^^^^
Block size, bs, should always be a multiple of the page size lest
throughput will suffer.  The Linux page size on x86 CPUs is 4096 bytes.
 Using bs values that are not multiples of page size will usually give
less than optimal results due to unaligned memory accesses.

Additionally, you will typically see optimum throughput using bs values
of between 4096 and 16384 bytes.  Below and above that throughput
typically falls.  Test each page size multiple from 4096 to 32768 to
confirm on your system.

Also, using large block sizes causes dd to buffer large amounts of data
into memory as each physical IO is only 4096 bytes.  Thus dd doesn't
actually start writing to disk until each block is buffered into RAM, in
this case just under 1MB.  This reduces efficiency by quite a bit vs the
4096 byte block size which allows streaming directly from dd without the
buffering.

> The 900 MB/s was based on my old controller. I re-measured using my
> new controller and get closer to 2000 MB/s in raw (non-RAID)
> performance, which is close to the theoretical maximum for that
> controller (2400 MB/s). This indicated that hardware is not a
> bottleneck.
> 
>>> When I set the disks up as a 24 disk software RAID6 I get 400 MB/s
>>> write and 600 MB/s read. It seems to be due to checksuming, as I have
>>> a single process (md0_raid6) taking up 100% of one CPU.

The dd block size will likely be even more critical when dealing with
parity arrays, as non page size blocks will cause problems with stripe
aligned writes.

Since both the Linux page size and all filesystem (EXT, XFS, JFS) block
sizes are 4096 bytes, you should always test dd with bs=4096, as that's
your real world day-to-day target block IO size.

-- 
Stan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-05 11:29     ` Igor M Podlesny
  2012-06-05 13:09       ` Peter Grandi
@ 2012-06-05 18:44       ` Ole Tange
  2012-06-06  1:40         ` Brad Campbell
  1 sibling, 1 reply; 38+ messages in thread
From: Ole Tange @ 2012-06-05 18:44 UTC (permalink / raw)
  To: Igor M Podlesny; +Cc: linux-raid

On Tue, Jun 5, 2012 at 1:29 PM, Igor M Podlesny <for.poige+lsr@gmail.com> wrote:
> On 5 June 2012 15:47, Ole Tange <ole@tange.dk> wrote:
>> On Tue, Jun 5, 2012 at 5:39 AM, Igor M Podlesny <for.poige+lsr@gmail.com> wrote:
>>> On 5 June 2012 07:14, Ole Tange <ole@tange.dk> wrote:
>>> […]
>>>> I tested this by creating 24 devices in RAM, used different chunk
>>>> sizes, and then copied the linux kernel source. Test script can be
>>>> found on http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html
:
>   Wanna try CONFIG_MULTICORE_RAID456? :-)

If the kernel can checksum 6196 MB/s why would I need
CONFIG_MULTICORE_RAID456? Please elaborate on why you think that is
needed.

/Ole
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-05 14:15     ` Stan Hoeppner
@ 2012-06-05 20:45       ` Ole Tange
  0 siblings, 0 replies; 38+ messages in thread
From: Ole Tange @ 2012-06-05 20:45 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On Tue, Jun 5, 2012 at 4:15 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 6/5/2012 2:47 AM, Ole Tange wrote:
>
>>   time parallel -j0 dd if={} of=/dev/null bs=1000k count=1k ::: /dev/sd?
>                                            ^^^^^^^^
> Block size, bs, should always be a multiple of the page size lest
> throughput will suffer.  The Linux page size on x86 CPUs is 4096 bytes.
>  Using bs values that are not multiples of page size will usually give
> less than optimal results due to unaligned memory accesses.

The above command was used to measure the raw read performance from
all physical drives, i. e. the 2000 MB/s. If your hypothesis is
correct then I should be able to push the 2000 MB/s even higher by
using a smaller blocksize.

To see if you were right (i.e. that the block size has any impact
whatsoever) I tried:

  time parallel -j0 dd if={} of=/dev/null bs=4k count=250k ::: /dev/sd?

I tested 100 times of 4k block and 1000k block and found the min,
median, and max:

seq 100 | parallel -j1 -I ,, --arg-sep ,, -N0  'echo 3 >
/proc/sys/vm/drop_caches;'/usr/bin/time -f%e  parallel -j0 dd if={}
of=/dev/null bs=4k count=250k ::: /dev/sd? 2>&1 |grep -v o > out-4k

seq 100 | parallel -j1 -I ,, --arg-sep ,, -N0  'echo 3 >
/proc/sys/vm/drop_caches;'/usr/bin/time -f%e  parallel -j0 dd if={}
of=/dev/null bs=1000k count=1k ::: /dev/sd? 2>&1 |grep -v o >
out-1000k

$ echo "1000 * " `ls /dev/sd? | wc -l ` / `sort out-4k | tail -n 1` | bc -l
$ echo "1000 * " `ls /dev/sd? | wc -l ` / `sort out-4k | head -n 50 |
tail -n 1` | bc -l
$ echo "1000 * " `ls /dev/sd? | wc -l ` / `sort -r out-4k  | tail -n 1` | bc -l

System 1 (4 kb blocks): Min: 1416.61 MB/s Median: 1899.82 MB/s Max: 2038.92 MB/s
System 2 (4 kb blocks): Min: 1636.24 MB/s Median: 1850.53 MB/s Max: 2039.21 MB/s
System 3 (4 kb blocks): Min: 1123.43 MB/s Median: 1373.13 MB/s Max: 1464.96 MB/s

$ echo "1000 * " `ls /dev/sd? | wc -l ` / `sort out-1000k | tail -n 1` | bc -l
$ echo "1000 * " `ls /dev/sd? | wc -l ` / `sort out-1000k | head -n 50
| tail -n 1` | bc -l
$ echo "1000 * " `ls /dev/sd? | wc -l ` / `sort -r out-1000k  | tail
-n 1` | bc -l

System 1 (1000 kb blocks): Min: 1389.76 MB/s Median: 1909.72 MB/s Max:
2044.60 MB/s
System 2 (1000 kb blocks): Min: 1593.13 MB/s Median: 1799.30 MB/s Max:
1975.68 MB/s
System 3 (1000 kb blocks): Min: 1072.26 MB/s Median: 1345.02 MB/s Max:
1459.39 MB/s

If you compare the numbers between the 2 block sizes you can see that
the ranges and medians are almost identical. Is this the kind of
suffering of throughput you expected by not using the same block size?
Because I would find that this suffering is hardly worth mentioning -
it could just as well be due to variation.

> Additionally, you will typically see optimum throughput using bs values
> of between 4096 and 16384 bytes.  Below and above that throughput
> typically falls.  Test each page size multiple from 4096 to 32768 to
> confirm on your system.

Are you aware that the 'dd' part of the script is for setting up the
loop back devices? That part is not timed at all, so if that part took
twice as long it would not change the validity of the test at all.

> Also, using large block sizes causes dd to buffer large amounts of data
> into memory as each physical IO is only 4096 bytes.  Thus dd doesn't
> actually start writing to disk until each block is buffered into RAM, in
> this case just under 1MB.  This reduces efficiency by quite a bit vs the
> 4096 byte block size which allows streaming directly from dd without the
> buffering.

Are you aware that the test takes place in RAM, and not on magnetic media?

>> The 900 MB/s was based on my old controller. I re-measured using my
>> new controller and get closer to 2000 MB/s in raw (non-RAID)
>> performance, which is close to the theoretical maximum for that
>> controller (2400 MB/s). This indicated that hardware is not a
>> bottleneck.
>>
>>>> When I set the disks up as a 24 disk software RAID6 I get 400 MB/s
>>>> write and 600 MB/s read. It seems to be due to checksuming, as I have
>>>> a single process (md0_raid6) taking up 100% of one CPU.
>
> The dd block size will likely be even more critical when dealing with
> parity arrays, as non page size blocks will cause problems with stripe
> aligned writes.

Again: The dd is not done on the array. It is done on the separate
devices to measure maximal hardware performance and to setup the loop
back devices in RAM, respectively.

Did you run the test script? What where your numbers? Did md0_raid6
take up 100% CPU of 1 core during the copy?


/Ole
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-05 11:25     ` Peter Grandi
@ 2012-06-05 20:57       ` Ole Tange
  2012-06-06 17:37         ` Peter Grandi
  0 siblings, 1 reply; 38+ messages in thread
From: Ole Tange @ 2012-06-05 20:57 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On Tue, Jun 5, 2012 at 1:25 PM, Peter Grandi <pg@lxra2.for.sabi.co.uk> wrote:
:
> It does not change much of the conclusions as to the (euphemism)
> audacity of your conclusions), but you have created a 21+2 RAID6
> set, as the 24th block device is a spare:
>
>  seq 24 | parallel -X --tty mdadm --create --force /dev/md0 -c $CHUNK --level=6 --raid-devices=23 -x 1 /dev/loop{}

That is correct. It reflects the physical setup of the 24 physical drives.

>>>> I get 400 MB/s write and 600 MB/s read. It seems to be due
>>>> to checksuming, as I have a single process (md0_raid6)
>>>> taking up 100% of one CPU.
>
> [ ... ]
>
>> The 900 MB/s was based on my old controller. I re-measured
>> using my new controller and get closer to 2000 MB/s in raw
>> (non-RAID) performance, which is close to the theoretical
>> maximum for that controller (2400 MB/s). This indicated that
>> hardware is not a bottleneck.
>
> A 21+2 drive RAID6 set is (euphemism) brave, and perhaps it
> matches the (euphemism) strategic insight that only checksumming
> withing MD could account for 100% CPU time in a single threaded
> way.

It is not a guess that md0_raid6 takes up 100% of 1 core. It is
reported by 'top'.

But maybe you are right: The 100% that md0_raid6 uses could be due to
something other than checksumming. But the test clearly show that
chunk size has a huge impact on the amount of CPU time md0_raid6 has
to use.

> But as a start you could try running your (euphemism) "test"
> with O_DIRECT:
>
>  http://www.sabi.co.uk/blog/0709sep.html#070919
>
> While making sure that the IO is stripe aligned (21 times the
> chunk size).

It is unclear to me how to change the timed part of the test script to
use O_DIRECT and make it stripe aligned:

seq 10 | time parallel mkdir -p /mnt/md0/{}\;tar -x -C /mnt/md0/{} -f
linux.tar\; sync
seq 10 | time parallel mkdir -p /mnt/md0/{}\;cp linux.tar /mnt/md0/{} \; sync

Please advice.

> Your (euphemism) tests could also probably benefit from more
> care about (euphemism) details like commit semantics, as the use
> of 'sync' in your scripts seems to me based on (euphemism)
> unconventional insight, for example this:
>
>  «seq 10 | time parallel mkdir -p /mnt/md0/{}\;tar -x -C /mnt/md0/{} -f linux.tar\; sync»

Feel free to substitute with:

seq 10 | time parallel mkdir -p /mnt/md0/{}\;tar -x -C /mnt/md0/{} -f linux.tar
time sync

Here you will have to add the two durations.

With that modification I get:

Chunk size	Time to copy 10 linux kernel sources as files	Time to copy
10 linux kernel sources as a single tar file
16        29s        13s
32        28s        11s
64        29s        13s
128        34s        10s
256        41s        11s
4096        1m35s        2m15s (!)

Most numbers are comparable to the original results
http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html

The 2m15s result for the 4096 big-file test was a bit surprising, so
I re-ran that test and got 2m36s.

> But also more divertingly:
>
>  «seq 24 | parallel dd if=/dev/zero of=tmpfs/disk{} bs=500k count=1k
>  seq 24 | parallel losetup /dev/loop{} tmpfs/disk{}
>  sync
>  sleep 1;
>  sync»

Are you aware that this part is for the setup of the test? It is not
the timed section and thus it does not affect the validity of the
test.

> and even:
>
>  «mount /dev/md0 /mnt/md0
>  sync»

Yeah that part was a bit weird, but I had 1 run where the script
failed without the 'sync'. And again: Are you aware that this part is
for the setup of the test? It is not the timed section and thus does
not change the validity of the test.

> Perhaps you might also want to investigate the behaviour of
> 'tmpfs' and 'loop' devices, as it seems quite (euphemism)
> creative to me to have RAID set member block devices as 'loop's
> over 'tmpfs' files:
>
>  «mount -t tmpfs tmpfs tmpfs
>  seq 24 | parallel dd if=/dev/zero of=tmpfs/disk{} bs=500k count=1k
>  seq 24 | parallel losetup /dev/loop{} tmpfs/disk{}»

How would YOU design the test so that:

* it is reproducible for others?
* it does not depend on controllers and disks?
* it uses 24 devices?
* it uses different chunk sizes?
* it tests both big and small file performance?

> Put another way, most aspects of your (euphemism) tests seem to
> me rather (euphemism) imaginative.

Did you run the test script? What were your numbers? Did md0_raid6
take up 100% CPU of 1 core during the copy? And if so: Can you explain
why md0_raid6 would take up 100% CPU of 1 core?


/Ole
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-05 13:09       ` Peter Grandi
@ 2012-06-05 21:17         ` Ole Tange
  2012-06-06  1:38           ` Stan Hoeppner
  0 siblings, 1 reply; 38+ messages in thread
From: Ole Tange @ 2012-06-05 21:17 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On Tue, Jun 5, 2012 at 3:09 PM, Peter Grandi <pg@lxra2.for.sabi.co.uk> wrote:
> [ ... ]
>
>>> Good call. But the resync is done before the mkfs.xfs is finished, so
>>> the time of the copying is not affected by resync.
>>>
>>> I re-tested with --assume-clean and as expected it has no impact.
>
>>    Wanna try CONFIG_MULTICORE_RAID456? :-)
>
> That would be intreresting, but the original post reports over
> 6GB/s for pure checksumming, and around 400MB/s actual transfer
> rate. In theory there is no need here for multihreading. There
> may something else going on :-).

I have the feeling that some of you have not experienced md0_raid6
taking up 100% CPU of a single core. If you have not, please run the
test on http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html

The test requires 10 GB RAM, atleast 2 CPU cores, and takes less than
3 minutes to run.

See if you can reproduce the CPU usage, and post your results along
with the reported checksumming speed reported by the kernel.


/Ole
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-05 21:17         ` Ole Tange
@ 2012-06-06  1:38           ` Stan Hoeppner
  0 siblings, 0 replies; 38+ messages in thread
From: Stan Hoeppner @ 2012-06-06  1:38 UTC (permalink / raw)
  To: Ole Tange; +Cc: Peter Grandi, Linux RAID

On 6/5/2012 4:17 PM, Ole Tange wrote:
> On Tue, Jun 5, 2012 at 3:09 PM, Peter Grandi <pg@lxra2.for.sabi.co.uk> wrote:
>> [ ... ]
>>
>>>> Good call. But the resync is done before the mkfs.xfs is finished, so
>>>> the time of the copying is not affected by resync.
>>>>
>>>> I re-tested with --assume-clean and as expected it has no impact.
>>
>>>    Wanna try CONFIG_MULTICORE_RAID456? :-)
>>
>> That would be intreresting, but the original post reports over
>> 6GB/s for pure checksumming, and around 400MB/s actual transfer
>> rate. In theory there is no need here for multihreading. There
>> may something else going on :-).
> 
> I have the feeling that some of you have not experienced md0_raid6
> taking up 100% CPU of a single core. If you have not, please run the
> test on http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html
> 
> The test requires 10 GB RAM, atleast 2 CPU cores, and takes less than
> 3 minutes to run.
> 
> See if you can reproduce the CPU usage, and post your results along
> with the reported checksumming speed reported by the kernel.

There's no need for anyone to duplicate this testing.  It's already been
done, problem code identified, and patches submitted, about a week
before you started this thread.

Patches to make md RAID 1/10/5 write ops multi-threaded have already
been submitted (read ops already in essence are multi-threaded).  A
patch for RAID 6 has not yet been submitted but is probably in the
works.  Your thread comes about a week on the heels of the most recent
discussion of this problem.  See the archives.

And specifically, search the list archive for these thread scalability
patches by Shaohua Li.  AFAIK the patches haven't been accepted yet, and
it will likely be a while before they hit mainline.

In the mean time, the quickest way to "restore" your lost performance
while still using parity, and not sacrificing lots of platter space, is
to set two spares and create two 11 drive RAID5 arrays.  This costs you
one additional disk as a spare.  Each md RAID5 thread will run on a
different core, and with only 11 SRDs shouldn't peak a single core,
unless you have really slow cores such as the dual core Intel Atom 330 @
1.60 GHz.

If you need a single file space then layer a concatenated array
(--linear) over the two RAID 5 arrays and format the --linear device
with XFS, which will yield multi-threaded/multi-user parallelism with a
concatenated volume, assuming your workload writes files to multiple
directories.  If you think you need maximum single file streaming
performance, then lay a RAID 0 stripe over the two RAID5s and use
whichever filesystem you like.  If it's XFS, take care to properly align
writes.  This can be difficult using a nested stripe over multiple
parity arrays.

-- 
Stan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-05 18:44       ` Ole Tange
@ 2012-06-06  1:40         ` Brad Campbell
  2012-06-06  3:48           ` Marcus Sorensen
  2012-06-06 11:17           ` Ole Tange
  0 siblings, 2 replies; 38+ messages in thread
From: Brad Campbell @ 2012-06-06  1:40 UTC (permalink / raw)
  To: Ole Tange; +Cc: Igor M Podlesny, linux-raid

On 06/06/12 02:44, Ole Tange wrote:
> On Tue, Jun 5, 2012 at 1:29 PM, Igor M Podlesny<for.poige+lsr@gmail.com>  wrote:
>> On 5 June 2012 15:47, Ole Tange<ole@tange.dk>  wrote:
>>> On Tue, Jun 5, 2012 at 5:39 AM, Igor M Podlesny<for.poige+lsr@gmail.com>  wrote:
>>>> On 5 June 2012 07:14, Ole Tange<ole@tange.dk>  wrote:
>>>> […]
>>>>> I tested this by creating 24 devices in RAM, used different chunk
>>>>> sizes, and then copied the linux kernel source. Test script can be
>>>>> found on http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html
> :
>>    Wanna try CONFIG_MULTICORE_RAID456? :-)
>
> If the kernel can checksum 6196 MB/s why would I need
> CONFIG_MULTICORE_RAID456? Please elaborate on why you think that is
> needed.

I'd have thought there was a significant difference between the test 
generating that figure (being a large, single block being checksummed) 
and shunting around blocks from 20 odd block devices, arranging them and 
checksumming them.

I'm not debating the validity of your tests at all, however I do 
question your assertion than a single raid6 thread should even get close 
to that theoretical figure when actually doing real work.

Why not do as the man suggested and enable CONFIG_MULTICORE_RAID456 and 
see what happens?

Regards,
Brad
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-06  1:40         ` Brad Campbell
@ 2012-06-06  3:48           ` Marcus Sorensen
  2012-06-06 11:21             ` Ole Tange
  2012-06-06 11:17           ` Ole Tange
  1 sibling, 1 reply; 38+ messages in thread
From: Marcus Sorensen @ 2012-06-06  3:48 UTC (permalink / raw)
  To: Brad Campbell; +Cc: Ole Tange, Igor M Podlesny, linux-raid

That's my thought. The checksumming speed doesn't remain static
regardless of what's being checksummed.  If I want to calculate double
parity on a stripe that spans 30 disks I'd expect that to be more CPU
intensive than calculating double parity on a 3 disk stripe.

On Tue, Jun 5, 2012 at 7:40 PM, Brad Campbell <lists2009@fnarfbargle.com> wrote:
> On 06/06/12 02:44, Ole Tange wrote:
>>
>> On Tue, Jun 5, 2012 at 1:29 PM, Igor M Podlesny<for.poige+lsr@gmail.com>
>>  wrote:
>>>
>>> On 5 June 2012 15:47, Ole Tange<ole@tange.dk>  wrote:
>>>>
>>>> On Tue, Jun 5, 2012 at 5:39 AM, Igor M Podlesny<for.poige+lsr@gmail.com>
>>>>  wrote:
>>>>>
>>>>> On 5 June 2012 07:14, Ole Tange<ole@tange.dk>  wrote:
>>>>> […]
>>>>>>
>>>>>> I tested this by creating 24 devices in RAM, used different chunk
>>>>>> sizes, and then copied the linux kernel source. Test script can be
>>>>>> found on
>>>>>> http://oletange.blogspot.dk/2012/05/software-raid-performance-on-24-disks.html
>>
>> :
>>>
>>>   Wanna try CONFIG_MULTICORE_RAID456? :-)
>>
>>
>> If the kernel can checksum 6196 MB/s why would I need
>> CONFIG_MULTICORE_RAID456? Please elaborate on why you think that is
>> needed.
>
>
> I'd have thought there was a significant difference between the test
> generating that figure (being a large, single block being checksummed) and
> shunting around blocks from 20 odd block devices, arranging them and
> checksumming them.
>
> I'm not debating the validity of your tests at all, however I do question
> your assertion than a single raid6 thread should even get close to that
> theoretical figure when actually doing real work.
>
> Why not do as the man suggested and enable CONFIG_MULTICORE_RAID456 and see
> what happens?
>
> Regards,
> Brad
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-06  1:40         ` Brad Campbell
  2012-06-06  3:48           ` Marcus Sorensen
@ 2012-06-06 11:17           ` Ole Tange
  2012-06-06 12:58             ` Brad Campbell
  1 sibling, 1 reply; 38+ messages in thread
From: Ole Tange @ 2012-06-06 11:17 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-raid

On Wed, Jun 6, 2012 at 3:40 AM, Brad Campbell <lists2009@fnarfbargle.com> wrote:
> On 06/06/12 02:44, Ole Tange wrote:
>> On Tue, Jun 5, 2012 at 1:29 PM, Igor M Podlesny<for.poige+lsr@gmail.com>
>>  wrote:
:
>>>   Wanna try CONFIG_MULTICORE_RAID456? :-)
>>
>> If the kernel can checksum 6196 MB/s why would I need
>> CONFIG_MULTICORE_RAID456? Please elaborate on why you think that is
>> needed.
>
> I'd have thought there was a significant difference between the test
> generating that figure (being a large, single block being checksummed) and
> shunting around blocks from 20 odd block devices, arranging them and
> checksumming them.

Is this based on gut feeling? Or do you have numbers to backup this claim?

> I'm not debating the validity of your tests at all, however I do question
> your assertion than a single raid6 thread should even get close to that
> theoretical figure when actually doing real work.

It would be good if we had someone who actually _knew_ (not just by
gut feeling) what the kernel reported checksumming is based on, and
how we can compute the expected performance for checksumming for a 24
disk RAID6.

From the source it seems the checksumming is using 16+2 disks:

    ./lib/raid6/algos.c:	const int disks = (65536/PAGE_SIZE)+2;

That is fairly close to the 21+2 disks in my setup.

The chunk size seems to be 4KB:

                                (*algo)->gen_syndrome(disks, PAGE_SIZE, *dptrs);

which is not close to my setup (ranging from 16KB to 4096KB).

This might mean that the only RAID6 setup in which you can expect the
checksumming performance reported by the kernel is 16+2 disks and a
chunk size of 4KB.

But if I try that setup on the test in RAM, md0_raid6 still takes up
more CPU time than the checksumming would account for.

> Why not do as the man suggested and enable CONFIG_MULTICORE_RAID456 and see
> what happens?

It is a lot of work to put into testing something that is at best a guess.

In the best case it shows that it will work with multicore, but it
would not be a solution to me (the module being experimental).

But it would be great to hear from someone who has
CONFIG_MULTICORE_RAID456 enabled already and see if they can reproduce
the results.

/Ole
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-06  3:48           ` Marcus Sorensen
@ 2012-06-06 11:21             ` Ole Tange
  0 siblings, 0 replies; 38+ messages in thread
From: Ole Tange @ 2012-06-06 11:21 UTC (permalink / raw)
  To: Marcus Sorensen; +Cc: linux-raid

On Wed, Jun 6, 2012 at 5:48 AM, Marcus Sorensen <shadowsor@gmail.com> wrote:

> That's my thought. The checksumming speed doesn't remain static
> regardless of what's being checksummed.  If I want to calculate double
> parity on a stripe that spans 30 disks I'd expect that to be more CPU
> intensive than calculating double parity on a 3 disk stripe.

If the checksumming speed differs wildly, then it seems odd that the
kernel chooses the best checksumming algorithm before it knows what it
is going to checksum. Would it not make more sense to defer deciding
the algorithm till you know the actual task?


/Ole
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-06 11:17           ` Ole Tange
@ 2012-06-06 12:58             ` Brad Campbell
  0 siblings, 0 replies; 38+ messages in thread
From: Brad Campbell @ 2012-06-06 12:58 UTC (permalink / raw)
  To: Ole Tange; +Cc: linux-raid

On 06/06/12 19:17, Ole Tange wrote:

> But if I try that setup on the test in RAM, md0_raid6 still takes up
> more CPU time than the checksumming would account for.

What part of "and shunting around blocks from 20 odd block devices, 
arranging them and checksumming them." are you missing?

The number your kernel gives you at bootup is to take a block of data 
and checksum it. In your real world results (in the same thread as the 
one doing the checksum) you are juggling the IO from all the disks, 
managing the buffers that result from that and calculating all the block 
positions. In what conceivable way can you conclude that a single thread 
can do all that and still give you the throughput the fabricated 
benchmark does?

>> Why not do as the man suggested and enable CONFIG_MULTICORE_RAID456 and see
>> what happens?
>
> It is a lot of work to put into testing something that is at best a guess.

Can lead a horse to water.

Regards,
Brad

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-04 23:14 Software RAID checksum performance on 24 disks not even close to kernel reported Ole Tange
                   ` (2 preceding siblings ...)
  2012-06-05  3:39 ` Igor M Podlesny
@ 2012-06-06 14:11 ` Ole Tange
  2012-06-06 16:05   ` Igor M Podlesny
  2012-06-06 16:09   ` Dan Williams
  3 siblings, 2 replies; 38+ messages in thread
From: Ole Tange @ 2012-06-06 14:11 UTC (permalink / raw)
  To: linux-raid

On Tue, Jun 5, 2012 at 1:14 AM, Ole Tange <ole@tange.dk> wrote:

> But I cannot explain why even the best performance (4600 MB/11s = 420
> MB/s) is not even close to the checksum performance reported by the
> kernel at boot (6196 MB/s):

From the friendly people on the mailing list the answer can be summarized as:

Checksumming is only a minor part of what md0_raid6 has to do. A lot
of the work is shuffling data around. The reason why checksumming is
in the kernel log is because the checksumming algorithm is one part
that can be optimized where as the other parts stay the same no matter
chosen the checksumming algorithm.

So when parity computing is mentioned under performance on
http://blog.zorinaq.com/?e=10 is is only a small part of the picture:
The parity computing may only take 1.5% CPU time, but the shuffling
data around can take several magnitudes longer - thus software RAID
may not be able to outperform hardware RAID in that aspect.

In other words: What you see is normal, and it is not out of the
ordinary to see md0_raid6 use 100% CPU time on a single core when
using a 24 disk RAID6. Work is underway to spread the load to multiple
cores using the experimental kernel parameter
CONFIG_MULTICORE_RAID456.

If the bottleneck is md0_raid6, choose a chunk size that gives
reasonable performance on your CPU (which in your case seem to be
32-64 KB).

/Ole

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-06 14:11 ` Ole Tange
@ 2012-06-06 16:05   ` Igor M Podlesny
  2012-06-06 19:51     ` Ole Tange
  2012-06-06 16:09   ` Dan Williams
  1 sibling, 1 reply; 38+ messages in thread
From: Igor M Podlesny @ 2012-06-06 16:05 UTC (permalink / raw)
  To: Ole Tange; +Cc: linux-raid

On 6 June 2012 22:11, Ole Tange <ole@tange.dk> wrote:
> On Tue, Jun 5, 2012 at 1:14 AM, Ole Tange <ole@tange.dk> wrote:
[…]
> If the bottleneck is md0_raid6, choose a chunk size that gives
> reasonable performance on your CPU (which in your case seem to be
> 32-64 KB).

   Which you're to never-ever do, since it's simply stupid to involve
bunch of disks into so brief I/O ops.

--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-06 14:11 ` Ole Tange
  2012-06-06 16:05   ` Igor M Podlesny
@ 2012-06-06 16:09   ` Dan Williams
  2012-06-06 19:19     ` Ole Tange
  2012-06-07  4:06     ` Stan Hoeppner
  1 sibling, 2 replies; 38+ messages in thread
From: Dan Williams @ 2012-06-06 16:09 UTC (permalink / raw)
  To: Ole Tange; +Cc: linux-raid

On Wed, Jun 6, 2012 at 7:11 AM, Ole Tange <ole@tange.dk> wrote:
> On Tue, Jun 5, 2012 at 1:14 AM, Ole Tange <ole@tange.dk> wrote:
>
>> But I cannot explain why even the best performance (4600 MB/11s = 420
>> MB/s) is not even close to the checksum performance reported by the
>> kernel at boot (6196 MB/s):
>
> From the friendly people on the mailing list the answer can be summarized as:
>
> Checksumming is only a minor part of what md0_raid6 has to do. A lot
> of the work is shuffling data around. The reason why checksumming is
> in the kernel log is because the checksumming algorithm is one part
> that can be optimized where as the other parts stay the same no matter
> chosen the checksumming algorithm.

The checksum algorithm benchmark is the value for 16-data disks, so a
24 disk array will be a little bit slower.

> So when parity computing is mentioned under performance on
> http://blog.zorinaq.com/?e=10 is is only a small part of the picture:
> The parity computing may only take 1.5% CPU time, but the shuffling
> data around can take several magnitudes longer - thus software RAID
> may not be able to outperform hardware RAID in that aspect.

Hardware raid ultimately does the same shuffling, outside of nvram an
advantage it has is that parity data does not traverse the bus, its
potential disadvantage is an embedded cpu / memory vs host cpu /
memory.

> In other words: What you see is normal, and it is not out of the
> ordinary to see md0_raid6 use 100% CPU time on a single core when
> using a 24 disk RAID6. Work is underway to spread the load to multiple
> cores using the experimental kernel parameter
> CONFIG_MULTICORE_RAID456.

Don't use CONFIG_MULTICORE_RAID456, we need a different approach.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-05 20:57       ` Ole Tange
@ 2012-06-06 17:37         ` Peter Grandi
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Grandi @ 2012-06-06 17:37 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

>> A 21+2 drive RAID6 set is (euphemism) brave, and perhaps it
>> matches the (euphemism) strategic insight that only
>> checksumming withing MD could account for 100% CPU time in a
>> single threaded way.

> It is not a guess that md0_raid6 takes up 100% of 1 core. It
> is reported by 'top'.

> But maybe you are right: The 100% that md0_raid6 uses could be
> due to something other than checksumming. But the test clearly
> show that chunk size has a huge impact on the amount of CPU
> time md0_raid6 has to use.

The (euphemism) test(s) much more "clearly show" something else
entirely :-).

For a (euphemism) different approach here is in three lines a
"test" that in its minuscule simplicity (lots of improvements
could be made) illustrates several things in which it is
(euphemism) different from the one reported above:

------------------------------------------------------------------------
  base#  mdadm --create /dev/md0 -c 64 --level=6 --raid-devices=16 /dev/ram{0..15}
  mdadm: array /dev/md0 started.
------------------------------------------------------------------------
  base#  time dd bs=$((14 * 64 * 1024)) of=/dev/zero iflag=direct if=/dev/md0
  255+0 records in
  255+0 records out
  233963520 bytes (234 MB) copied, 0.0453674 seconds, 5.2 GB/s

  real    0m0.047s
  user    0m0.000s
  sys     0m0.047s
------------------------------------------------------------------------
  base#  sysctl vm/drop_caches=1; time dd bs=$((14 * 64 * 1024)) of=/dev/zero if=/dev/md0
vm.drop_caches = 1
  255+0 records in
  255+0 records out
  233963520 bytes (234 MB) copied, 0.285007 seconds, 821 MB/s

  real    0m0.360s
  user    0m0.000s
  sys     0m0.286s
------------------------------------------------------------------------

Note that this is about *reading* and thus there is no
"checksum" calculation involved. It was amusing also to rerun
the above on 'ram0' instead of 'md0' for comparison.

It was also quite depressing to me to try the same for *writing*
and try different 'bs=' values.

Other (euphemism) different tests: I have compared writing to a
RAID0 set of equivalent stripe width (14) and to a RAID5 set of
equivalent stripe width (14+1).

PS: Running any "test" on a RAID set of in-memory block devices
    seems to me to be (euphemism) entertaining rather than
    useful as RAM accesses are not that parallelizable, and this
    breaks a pretty fundamental assumption.

[ ... ]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-06 16:09   ` Dan Williams
@ 2012-06-06 19:19     ` Ole Tange
  2012-06-06 19:24       ` Dan Williams
  2012-06-07  4:06     ` Stan Hoeppner
  1 sibling, 1 reply; 38+ messages in thread
From: Ole Tange @ 2012-06-06 19:19 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

On Wed, Jun 6, 2012 at 6:09 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Jun 6, 2012 at 7:11 AM, Ole Tange <ole@tange.dk> wrote:
>> On Tue, Jun 5, 2012 at 1:14 AM, Ole Tange <ole@tange.dk> wrote:
>>
>>> But I cannot explain why even the best performance (4600 MB/11s = 420
>>> MB/s) is not even close to the checksum performance reported by the
>>> kernel at boot (6196 MB/s):
>>
>> From the friendly people on the mailing list the answer can be summarized as:
:
>> In other words: What you see is normal, and it is not out of the
>> ordinary to see md0_raid6 use 100% CPU time on a single core when
>> using a 24 disk RAID6. Work is underway to spread the load to multiple
>> cores using the experimental kernel parameter
>> CONFIG_MULTICORE_RAID456.
>
> Don't use CONFIG_MULTICORE_RAID456, we need a different approach.

Here you disagree with Brad Campbell and Igor M Podlesny. Can you
elaborate why you do not think I should use CONFIG_MULTICORE_RAID456
and why you do not think that approach will work?


/Ole

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-06 19:19     ` Ole Tange
@ 2012-06-06 19:24       ` Dan Williams
  2012-06-06 19:26         ` Ole Tange
  0 siblings, 1 reply; 38+ messages in thread
From: Dan Williams @ 2012-06-06 19:24 UTC (permalink / raw)
  To: Ole Tange; +Cc: linux-raid, shli

On Wed, Jun 6, 2012 at 12:19 PM, Ole Tange <ole@tange.dk> wrote:
> On Wed, Jun 6, 2012 at 6:09 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Wed, Jun 6, 2012 at 7:11 AM, Ole Tange <ole@tange.dk> wrote:
>>> On Tue, Jun 5, 2012 at 1:14 AM, Ole Tange <ole@tange.dk> wrote:
>>>
>>>> But I cannot explain why even the best performance (4600 MB/11s = 420
>>>> MB/s) is not even close to the checksum performance reported by the
>>>> kernel at boot (6196 MB/s):
>>>
>>> From the friendly people on the mailing list the answer can be summarized as:
> :
>>> In other words: What you see is normal, and it is not out of the
>>> ordinary to see md0_raid6 use 100% CPU time on a single core when
>>> using a 24 disk RAID6. Work is underway to spread the load to multiple
>>> cores using the experimental kernel parameter
>>> CONFIG_MULTICORE_RAID456.
>>
>> Don't use CONFIG_MULTICORE_RAID456, we need a different approach.
>
> Here you disagree with Brad Campbell and Igor M Podlesny. Can you
> elaborate why you do not think I should use CONFIG_MULTICORE_RAID456
> and why you do not think that approach will work?

Because I wrote the code and it doesn't work the way I want it to,
hence the experimental tag and why I need to go review Shaohua's
recent patches.

--
Dan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-06 19:24       ` Dan Williams
@ 2012-06-06 19:26         ` Ole Tange
  0 siblings, 0 replies; 38+ messages in thread
From: Ole Tange @ 2012-06-06 19:26 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

On Wed, Jun 6, 2012 at 9:24 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Jun 6, 2012 at 12:19 PM, Ole Tange <ole@tange.dk> wrote:
>> On Wed, Jun 6, 2012 at 6:09 PM, Dan Williams <dan.j.williams@intel.com> wrote:
:
>>> Don't use CONFIG_MULTICORE_RAID456, we need a different approach.
>>
>> Here you disagree with Brad Campbell and Igor M Podlesny. Can you
>> elaborate why you do not think I should use CONFIG_MULTICORE_RAID456
>> and why you do not think that approach will work?
>
> Because I wrote the code and it doesn't work the way I want it to [...]

It can hardly be more authoritative than this :-)


/Ole

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-06 16:05   ` Igor M Podlesny
@ 2012-06-06 19:51     ` Ole Tange
  2012-06-06 22:21       ` Igor M Podlesny
  0 siblings, 1 reply; 38+ messages in thread
From: Ole Tange @ 2012-06-06 19:51 UTC (permalink / raw)
  To: Igor M Podlesny; +Cc: linux-raid

On Wed, Jun 6, 2012 at 6:05 PM, Igor M Podlesny <for.poige+lsr@gmail.com> wrote:
> On 6 June 2012 22:11, Ole Tange <ole@tange.dk> wrote:
>> On Tue, Jun 5, 2012 at 1:14 AM, Ole Tange <ole@tange.dk> wrote:
> […]
>> If the bottleneck is md0_raid6, choose a chunk size that gives
>> reasonable performance on your CPU (which in your case seem to be
>> 32-64 KB).
>
>   Which you're to never-ever do, since it's simply stupid to involve
> bunch of disks into so brief I/O ops.

Currently it seems the bottleneck is md0_raid6 running on one core as
it seems can only deliver 400 MB/s to a 24 disk RAID6. The hardware +
drivers supports 2000 MB/s according to the tests.

So without change the total bandwidth is 400 MB/s.

Let us say that we can change the config so md0_raid6 can deliver 600
MB/s, but at the cost that the hardware + drivers lose 2/3 of their
performance and thus can only deliver 650 MB/s. Then the total
bandwidth will be 600 MB/s.

To me what matters is the total performance, so I would choose the 600
MB/s over the 400 MB/s any day.

So Igor: Do you have numbers that back up your claim? Or do you advice
against it just because "it's simply stupid"?

/Ole
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-06 19:51     ` Ole Tange
@ 2012-06-06 22:21       ` Igor M Podlesny
  2012-06-06 22:53         ` Peter Grandi
  0 siblings, 1 reply; 38+ messages in thread
From: Igor M Podlesny @ 2012-06-06 22:21 UTC (permalink / raw)
  To: Ole Tange; +Cc: linux-raid

On 7 June 2012 03:51, Ole Tange <ole@tange.dk> wrote:
[…]
> So Igor: Do you have numbers that back up your claim? Or do you advice
> against it just because "it's simply stupid"?

   The reasons are pretty well described in FreeBSD 4.11 vinum's
manual (which I had been using long before LSR): «… For optimum
performance, stripes should be at least 128 kB in size: anything
smaller will result in a significant increase in I/O activity due to
mapping of individual requests over multiple disks. The performance
improvement due to the increased number of concurrent transfers caused
by this mapping will not make up for the performance drop due to the
increase in latency. A good guideline for stripe size is between 256
kB and 512 kB.  Avoid powers of 2, however: they tend to cause all
superblocks to be placed on the first subdisk. …» —
http://goo.gl/sTHxY

   P. S. Alas, LSR doesn't support anything but powers of 2 for chunk
sizes, but according to Neil Brown, it can be relatively easy changed.

--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-06 22:21       ` Igor M Podlesny
@ 2012-06-06 22:53         ` Peter Grandi
  2012-06-07  3:41           ` Igor M Podlesny
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Grandi @ 2012-06-06 22:53 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

> The reasons are pretty well described in FreeBSD 4.11 vinum's
> manual (which I had been using long before LSR):

  «… For optimum performance, stripes should be at least 128 kB
  in size: anything smaller will result in a significant increase
  in I/O activity due to mapping of individual requests over
  multiple disks.»

That is a really gross generalization, because the aggregate
latency due to lack of arm and platter synchronization among RAID
members for example does not apply to flash SSDs, and has
dramatically different impacts on read vs. write and streaming
vs. randomish access patterns, and single vs. multiple threads
workloads.

The above recommendation applies mostly to reading, streaming,
access patterns by single threaded workloads.

However in those cases the argument is even stronger than it was
in the past: the 'man' page makes example about disks with 6
(six) MB/s transfer rates and 8ms latencies, and current disks
often exceed 100 (one hundred) MB/s, while access times haven't
improved much.

  «A good guideline for stripe size is between 256 kB and 512
  kB.»

It is very important to note here that "stripe size" in Vinum
means "chunk size" in MD. For many workloads that is too large a
chunk size.

  «Avoid powers of 2, however: they tend to cause all superblocks
  to be placed on the first subdisk. …» — http://goo.gl/sTHxY

> P. S. Alas, LSR doesn't support anything but powers of 2 for
> chunk sizes, but according to Neil Brown, it can be relatively
> easy changed.

Some important Linux filesystems try to align metadata to chunk
boundaries instead, for example 'ext3' with 'stride=' and XFS
with 'su='.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-06 22:53         ` Peter Grandi
@ 2012-06-07  3:41           ` Igor M Podlesny
  2012-06-07  4:59             ` Stan Hoeppner
  0 siblings, 1 reply; 38+ messages in thread
From: Igor M Podlesny @ 2012-06-07  3:41 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On 7 June 2012 06:53, Peter Grandi <pg@lxra2.to.sabi.co.uk> wrote:
[…]
>  «A good guideline for stripe size is between 256 kB and 512
>  kB.»
>
> It is very important to note here that "stripe size" in Vinum
> means "chunk size" in MD. For many workloads that is too large a
> chunk size.

   It depends on number of files in working set. I still can't see
reasons to have 4 KiB chunk (or stripe size) for newsdir or maildir
with the hell a lot of tiny files, really — having enough of them
anyways guarantees equal load distribution among all of the disks even
with rather large chunk size. But with 4 KiB _every_ I/O-op would
involve several disks, and again I don't consider it's "nice".

>  «Avoid powers of 2, however: they tend to cause all superblocks
>  to be placed on the first subdisk. …» — http://goo.gl/sTHxY
>
>> P. S. Alas, LSR doesn't support anything but powers of 2 for
>> chunk sizes, but according to Neil Brown, it can be relatively
>> easy changed.
>
> Some important Linux filesystems try to align metadata to chunk
> boundaries instead, for example 'ext3' with 'stride=' and XFS
> with 'su='.

   They try, for sure, but try is still a try. For e. g., you pvmove'd
your LVM with XFS from one RAID to another one having different
"layout", things can just stop working well all of the sudden. Another
drawback of 2^n chunk sizes is extending gap between them:
…1024…2048…4096…8192…. You just can't go and try something like 1731
or 3333 KiB for chunk size.

--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-06 16:09   ` Dan Williams
  2012-06-06 19:19     ` Ole Tange
@ 2012-06-07  4:06     ` Stan Hoeppner
  2012-06-07 14:40       ` Joe Landman
  1 sibling, 1 reply; 38+ messages in thread
From: Stan Hoeppner @ 2012-06-07  4:06 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

On 6/6/2012 11:09 AM, Dan Williams wrote:

> Hardware raid ultimately does the same shuffling, outside of nvram an
> advantage it has is that parity data does not traverse the bus...

Are you referring to the host data bus(s)?  I.e. HT/QPI and PCIe?

With a 24 disk array, a full stripe write is only 1/12th parity data,
less than 10%.  And the buses (point to point actually) of 24 drive
caliber systems will usually start at one way B/W of 4GB/s for PCIe 2.0
x8 and with one way B/W from the PCIe controller to the CPU starting at
10.4GB/s for AMD HT 3.0 systems.  PCIe x8 is plenty to handle a 24 drive
md RAID 6, using 7.2K SATA drives anyway.

What is a bigger issue, and may actually be what you were referring to,
is read-modify-write B/W, which will incur a full stripe read and write.
 For RMW heavy workloads, this is significant.  HBA RAID does have a big
advantage here, compared to one's md array possessing the aggregate
performance to saturate the PCIe bus.

-- 
Stan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-07  3:41           ` Igor M Podlesny
@ 2012-06-07  4:59             ` Stan Hoeppner
  2012-06-07  5:22               ` Igor M Podlesny
  0 siblings, 1 reply; 38+ messages in thread
From: Stan Hoeppner @ 2012-06-07  4:59 UTC (permalink / raw)
  To: Igor M Podlesny; +Cc: Peter Grandi, Linux RAID

On 6/6/2012 10:41 PM, Igor M Podlesny wrote:

>    They try, for sure, but try is still a try. For e. g., you pvmove'd
> your LVM with XFS from one RAID to another one having different
> "layout", things can just stop working well all of the sudden.

Filesystems that have zero awareness of the storage geometry have poor
performance on striped RAID devices.  XFS has excellent performance on
striped RAID specifically due to this awareness.  Now you describe this
strength as a weakness due to a volume portability corner case no SA in
his right mind would attempt.

The proper way to do this is to perform an xfsdump of the filesystem to
a file, create a new XFS with the proper stripe geometry in the new
storage location (which takes all of 1 second BTW), then xfsrestore the
dump file.

And now you will say "Yes, but..."

-- 
Stan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-07  4:59             ` Stan Hoeppner
@ 2012-06-07  5:22               ` Igor M Podlesny
  2012-06-07  9:03                 ` Stan Hoeppner
  0 siblings, 1 reply; 38+ messages in thread
From: Igor M Podlesny @ 2012-06-07  5:22 UTC (permalink / raw)
  To: stan; +Cc: Peter Grandi, Linux RAID

On 7 June 2012 12:59, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 6/6/2012 10:41 PM, Igor M Podlesny wrote:
>>    They try, for sure, but try is still a try. For e. g., you pvmove'd
>> your LVM with XFS from one RAID to another one having different
>> "layout", things can just stop working well all of the sudden.
>
> Filesystems that have zero awareness of the storage geometry have poor
> performance on striped RAID devices.  XFS has excellent performance on
> striped RAID specifically due to this awareness.  Now you describe this

   And Btrfs is way faster (~ 30 %) on single sustained 22 GiB reading
— http://poige.livejournal.com/560643.html

   XFS is excellent but for parallel I/O mainly due to its multiple
allocation groups, not "RAID awareness" — EXT3/4 can be formatted
adjusted to RAID's layout just as XFS, in case you didn't know it.

> strength as a weakness due to a volume portability corner case no SA in
> his right mind would attempt.
>
> The proper way to do this is to perform an xfsdump of the filesystem to
> a file, create a new XFS with the proper stripe geometry in the new
> storage location (which takes all of 1 second BTW), then xfsrestore the
> dump file.

   Damn the proper way, Stan, if it's inconvenient one, and some
better results can be automagically achieved using another way. Which
one is more proper then? )

   I see no point in arguing just to argue. I see no drawbacks in
chunks not limited to 2^n, since at least it doesn't prohibit 2^n as
well. So, there's no point really.

--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-07  5:22               ` Igor M Podlesny
@ 2012-06-07  9:03                 ` Stan Hoeppner
  2012-06-07  9:22                   ` Igor M Podlesny
  0 siblings, 1 reply; 38+ messages in thread
From: Stan Hoeppner @ 2012-06-07  9:03 UTC (permalink / raw)
  To: Igor M Podlesny; +Cc: Peter Grandi, Linux RAID

On 6/7/2012 12:22 AM, Igor M Podlesny wrote:
> On 7 June 2012 12:59, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> On 6/6/2012 10:41 PM, Igor M Podlesny wrote:
>>>    They try, for sure, but try is still a try. For e. g., you pvmove'd
>>> your LVM with XFS from one RAID to another one having different
>>> "layout", things can just stop working well all of the sudden.
>>
>> Filesystems that have zero awareness of the storage geometry have poor
>> performance on striped RAID devices.  XFS has excellent performance on
>> striped RAID specifically due to this awareness.  Now you describe this
> 
>    And Btrfs is way faster (~ 30 %) on single sustained 22 GiB reading
> — http://poige.livejournal.com/560643.html

I don't see anything there that credibly demonstrates what you state
here.  Also note this is an English only mailing list.  Linking to a
forum that is primarily in Russian, I assume, and where half the posts
are by you, doesn't lend any credibility to your arguments.

>    XFS is excellent but for parallel I/O mainly due to its multiple
> allocation groups, not "RAID awareness" 

With storage geometries more complex than a single RAID array, which is
probably more common with XFS than not, allocation groups are then
designed around the storage geometry.  Thus these two things are equally
important to the overall performance of the filesystem, especially with
high IOPS metadata heavy workloads.  So yes, storage geometry awareness
plays a very large role in overall performance.  I don't have the time
to post an example scenario at the moment.  You can see examples in
previous posts of mine relating to maildir performance.

> — EXT3/4 can be formatted
> adjusted to RAID's layout just as XFS, in case you didn't know it.

Yes, EXT can be informed of the geometry on the command line.  I freely
admit I don't keep up with EXT development, but last I recall mke2fs
didn't query md and populate its stripe parameters automatically.  XFS
has for quite some time.  I always do mine manually anyway, so automatic
stuff really doesn't matter to me.  Just pointing out a difference, if
it still exists.

>> strength as a weakness due to a volume portability corner case no SA in
>> his right mind would attempt.
>>
>> The proper way to do this is to perform an xfsdump of the filesystem to
>> a file, create a new XFS with the proper stripe geometry in the new
>> storage location (which takes all of 1 second BTW), then xfsrestore the
>> dump file.
> 
>    Damn the proper way, Stan, if it's inconvenient one, and some
> better results can be automagically achieved using another way. Which
> one is more proper then? )

It depends on how much one values his data integrity and performance,
and what one considers "inconvenient".  If it takes 3 hours to move a
large filesystem to another array with your pvmove method and you end up
with performance problems afterward, what's the real human time
difference if it takes 5 hours to do it correctly with xfsdump/restore
and an mkfs.xfs?

BTW, if you use an aligned EXT4, you have the same problem with the new
RAID geometry.  But EXT4 doesn't have integrated dump/restore
facilities, so you'd have to use something like tar, which will take
many times longer due to all the system calls.  xfsdump/restore send
commands directly to the filesystem driver--no user space calls.

>    I see no point in arguing just to argue.

Accepting the fact there are people on this list who have far more
knowledge of XFS internals than yourself would be a good start.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-07  9:03                 ` Stan Hoeppner
@ 2012-06-07  9:22                   ` Igor M Podlesny
  0 siblings, 0 replies; 38+ messages in thread
From: Igor M Podlesny @ 2012-06-07  9:22 UTC (permalink / raw)
  To: stan; +Cc: Peter Grandi, Linux RAID

On 7 June 2012 17:03, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>    I see no point in arguing just to argue.
>
> Accepting the fact there are people on this list who have far more
> knowledge of XFS internals than yourself would be a good start.

   It has nothing to do with XFS, but rather trolling, I guess. That's
I surely can admit you're a superior one.

--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-07  4:06     ` Stan Hoeppner
@ 2012-06-07 14:40       ` Joe Landman
  2012-06-08  1:23         ` Stan Hoeppner
  0 siblings, 1 reply; 38+ messages in thread
From: Joe Landman @ 2012-06-07 14:40 UTC (permalink / raw)
  To: stan; +Cc: Dan Williams, linux-raid

Not to interject too much here ...

On 06/07/2012 12:06 AM, Stan Hoeppner wrote:
> On 6/6/2012 11:09 AM, Dan Williams wrote:
>
>> Hardware raid ultimately does the same shuffling, outside of nvram an
>> advantage it has is that parity data does not traverse the bus...
>
> Are you referring to the host data bus(s)?  I.e. HT/QPI and PCIe?
>
> With a 24 disk array, a full stripe write is only 1/12th parity data,
> less than 10%.  And the buses (point to point actually) of 24 drive
> caliber systems will usually start at one way B/W of 4GB/s for PCIe 2.0
> x8 and with one way B/W from the PCIe controller to the CPU starting at

PCIe gen 2 is ~500MB/s per lane in each direction, but there's like a 
14% protocol overhead, so your "sustained" streaming performance is more 
along the lines of 430 MB/s.  For a PCIe x8 gen 2 system, this nets you 
about 3.4GB/s in each direction.

> 10.4GB/s for AMD HT 3.0 systems.  PCIe x8 is plenty to handle a 24 drive
> md RAID 6, using 7.2K SATA drives anyway.

Each drive capable of streaming say 140 MB/s (modern drives).  24 x 140 
= 3.4 GB/s

This assumes streaming, no seeks that aren't part of streaming.

This said, this is *not* a design pattern you'd want to follow for a 
number of reasons.

But for seek heavy designs, you aren't going to hit anything close to 
140 MB/s.  We've just done a brief study for a customer on what they 
should expect to see (by measuring it and reporting on the measurement). 
  Assume close to an order of magnitude off for seekier loads.

Also, please note that iozone, dd, bonnie++, ... aren't great load 
generators, especially if things are in cache.  You tend to measure the 
upper layers of the file system stack, and not the actual full stack 
performance.  fio does a better job if you set the right options.  This 
said, almost all of these codes suffer from a measurement at the front 
end of the stack, if you want to know what the disks are really doing, 
you have to start poking your head into the kernel proc/sys spaces. 
Whats interesting is that of the tools mentioned, only fio appears to 
eventually converge its reporting to what the backend hardware does. 
The front end measurements seem to do a pretty bad job of deciding when 
an IO begins and when it is complete.  Could be an fsync or similar 
problem (discussed in the past), but its very annoying.  End users look 
at bonnie++ and other results and don't understand why their use case is 
so badly different in performance.

> What is a bigger issue, and may actually be what you were referring to,
> is read-modify-write B/W, which will incur a full stripe read and write.
>   For RMW heavy workloads, this is significant.  HBA RAID does have a big
> advantage here, compared to one's md array possessing the aggregate
> performance to saturate the PCIe bus.

The big issues for most HBAs are the available bandwidth to the disks, 
the quality/implementation of the controllers/drivers, etc.  Hanging 24 
drives off a single controller is a low cost design, not a high 
performance design.  You will get contention (especially with expandor 
chips).  You will get sub-optimal performance.

Checksumming speed on the CPU will not be the bottleneck in most of 
these cases.  Controller/driver performance and contention will be.

Back to your regularly scheduled thread ...

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Software RAID checksum performance on 24 disks not even close to kernel reported
  2012-06-07 14:40       ` Joe Landman
@ 2012-06-08  1:23         ` Stan Hoeppner
  0 siblings, 0 replies; 38+ messages in thread
From: Stan Hoeppner @ 2012-06-08  1:23 UTC (permalink / raw)
  To: Joe Landman; +Cc: Dan Williams, linux-raid

On 6/7/2012 9:40 AM, Joe Landman wrote:
> Not to interject too much here ...
> 
> On 06/07/2012 12:06 AM, Stan Hoeppner wrote:
>> On 6/6/2012 11:09 AM, Dan Williams wrote:
>>
>>> Hardware raid ultimately does the same shuffling, outside of nvram an
>>> advantage it has is that parity data does not traverse the bus...
>>
>> Are you referring to the host data bus(s)?  I.e. HT/QPI and PCIe?
>>
>> With a 24 disk array, a full stripe write is only 1/12th parity data,
>> less than 10%.  And the buses (point to point actually) of 24 drive
>> caliber systems will usually start at one way B/W of 4GB/s for PCIe 2.0
>> x8 and with one way B/W from the PCIe controller to the CPU starting at
> 
> PCIe gen 2 is ~500MB/s per lane in each direction, but there's like a
> 14% protocol overhead, so your "sustained" streaming performance is more
> along the lines of 430 MB/s.  For a PCIe x8 gen 2 system, this nets you
> about 3.4GB/s in each direction.

You're quite right Joe.  I was intentionally stating raw B/W numbers
simply for easier comparison, same with my HT numbers below.

>> 10.4GB/s for AMD HT 3.0 systems.  PCIe x8 is plenty to handle a 24 drive
>> md RAID 6, using 7.2K SATA drives anyway.
> 
> Each drive capable of streaming say 140 MB/s (modern drives).  24 x 140
> = 3.4 GB/s

I was being conservative and assuming 100MB/s per drive, as streaming
workloads over stripes don't seem to always generate typical single
streaming behavior at the individual drive level.

> This assumes streaming, no seeks that aren't part of streaming.
> 
> This said, this is *not* a design pattern you'd want to follow for a
> number of reasons.
> 
> But for seek heavy designs, you aren't going to hit anything close to
> 140 MB/s.  We've just done a brief study for a customer on what they
> should expect to see (by measuring it and reporting on the measurement).
>  Assume close to an order of magnitude off for seekier loads.

Yep.  Which is why I always recommend the fastest spindles one can
afford if they have a random IOPS workload, or many parallel streaming
workloads, or mix of these.  Both hammer the actuators, and even more so
using XFS w/Inode64 on a striped array.

And I'd never recommend a 23/24 drive RAID6 (or RAID5).  I was simply
commenting based on the OP's preferred setup.  I did recommend multiple
RAID5s as a better solution to the 23 drive RAID6 and the OP did not
respond to those suggestions.  Seems he's set on a 23 drive RAID6 no
matter what.

> Also, please note that iozone, dd, bonnie++, ... aren't great load
> generators, especially if things are in cache.  You tend to measure the
> upper layers of the file system stack, and not the actual full stack
> performance.  

I've never quoted numbers from any of these benchmarks.  I don't use
them.  I did comment on someone else's apparent misuse of dd.

> fio does a better job if you set the right options.  This
> said, almost all of these codes suffer from a measurement at the front
> end of the stack, if you want to know what the disks are really doing,
> you have to start poking your head into the kernel proc/sys spaces.
> Whats interesting is that of the tools mentioned, only fio appears to
> eventually converge its reporting to what the backend hardware does. The
> front end measurements seem to do a pretty bad job of deciding when an
> IO begins and when it is complete.  Could be an fsync or similar problem
> (discussed in the past), but its very annoying.  End users look at
> bonnie++ and other results and don't understand why their use case is so
> badly different in performance.

When I do my own benchmarking it's at the application level.  I let
others benchmark however they wish.  It's difficult and too time
consuming to convince some users that their fav benchy has no relevance
to their target workload.  That takes time and patience, and often
political skills, which I don't tend to possess.  On occasion I will try
to steer people clear of what should be seen as obviously bad design
choices but are yet not.

>> What is a bigger issue, and may actually be what you were referring to,
>> is read-modify-write B/W, which will incur a full stripe read and write.
>>   For RMW heavy workloads, this is significant.  HBA RAID does have a big
>> advantage here, compared to one's md array possessing the aggregate
>> performance to saturate the PCIe bus.
> 
> The big issues for most HBAs are the available bandwidth to the disks,
> the quality/implementation of the controllers/drivers, etc.  Hanging 24
> drives off a single controller is a low cost design, not a high
> performance design.  You will get contention (especially with expandor
> chips).  You will get sub-optimal performance.

In general I'd agree.  But this depends heavily on the HBA, its ASIC,
its QC, and the same for any expanders in question.  The LSI 2x36 6Gb/s
SAS expander ASIC doesn't seem to slow things down any.  The Marvell SAS
expanders, and Marvell and Silicon Image SATA PMPs are another story.

Regarding HBAs, there are a few LSI boards that when used with LSI
expanders can easily handle 24 drive md arrays.

> Checksumming speed on the CPU will not be the bottleneck in most of
> these cases.  Controller/driver performance and contention will be.

Not threading?  Well, I guess if you have a cruddy HBA and/or driver you
won't get far enough along to hit the md raid threading limitation, so
this is a good point.

-- 
Stan

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2012-06-08  1:23 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-04 23:14 Software RAID checksum performance on 24 disks not even close to kernel reported Ole Tange
2012-06-05  1:26 ` Joe Landman
2012-06-05  3:36 ` Igor M Podlesny
2012-06-05  7:47   ` Ole Tange
2012-06-05 11:25     ` Peter Grandi
2012-06-05 20:57       ` Ole Tange
2012-06-06 17:37         ` Peter Grandi
2012-06-05 14:15     ` Stan Hoeppner
2012-06-05 20:45       ` Ole Tange
2012-06-05  3:39 ` Igor M Podlesny
2012-06-05  7:47   ` Ole Tange
2012-06-05 11:29     ` Igor M Podlesny
2012-06-05 13:09       ` Peter Grandi
2012-06-05 21:17         ` Ole Tange
2012-06-06  1:38           ` Stan Hoeppner
2012-06-05 18:44       ` Ole Tange
2012-06-06  1:40         ` Brad Campbell
2012-06-06  3:48           ` Marcus Sorensen
2012-06-06 11:21             ` Ole Tange
2012-06-06 11:17           ` Ole Tange
2012-06-06 12:58             ` Brad Campbell
2012-06-06 14:11 ` Ole Tange
2012-06-06 16:05   ` Igor M Podlesny
2012-06-06 19:51     ` Ole Tange
2012-06-06 22:21       ` Igor M Podlesny
2012-06-06 22:53         ` Peter Grandi
2012-06-07  3:41           ` Igor M Podlesny
2012-06-07  4:59             ` Stan Hoeppner
2012-06-07  5:22               ` Igor M Podlesny
2012-06-07  9:03                 ` Stan Hoeppner
2012-06-07  9:22                   ` Igor M Podlesny
2012-06-06 16:09   ` Dan Williams
2012-06-06 19:19     ` Ole Tange
2012-06-06 19:24       ` Dan Williams
2012-06-06 19:26         ` Ole Tange
2012-06-07  4:06     ` Stan Hoeppner
2012-06-07 14:40       ` Joe Landman
2012-06-08  1:23         ` Stan Hoeppner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.