kernel checksumming performance vs actual raid device performance

All of lore.kernel.org
 help / color / mirror / Atom feed

* kernel checksumming performance vs actual raid device performance
@ 2016-07-12 21:09 Matt Garman
  2016-07-13  3:58 ` Brad Campbell
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Matt Garman @ 2016-07-12 21:09 UTC (permalink / raw)
  To: Mdadm

We have a system with a 24-disk raid6 array, using 2TB SSDs.  We use
this system in a workload that is 99.9% read-only (a few small
writes/day, versus countless reads).  This system is an NFS server for
about 50 compute nodes that continually read its data.

In a non-degraded state, the system works wonderfully: the md0_raid6
process uses less than 1% CPU, each drive is around 20% utilization
(via iostat), no swapping is taking place.  The outbound throughput
averages around 2.0 GB/sec, with 2.5 GB/sec peaks.

However, we had a disk fail, and the throughput dropped considerably,
with the md0_raid6 process pegged at 100% CPU.

I understand that data from the failed disk will need to be
reconstructed from parity, and this will cause the md0_raid6 process
to consume considerable CPU.

What I don't understand is how I can determine what kind of actual MD
device performance (throughput) I can expect in this state?

Dmesg seems to give some hints:

[    6.386820] xor: automatically using best checksumming function:
[    6.396690]    avx       : 24064.000 MB/sec
[    6.414706] raid6: sse2x1   gen()  7636 MB/s
[    6.431725] raid6: sse2x2   gen()  3656 MB/s
[    6.448742] raid6: sse2x4   gen()  3917 MB/s
[    6.465753] raid6: avx2x1   gen()  5425 MB/s
[    6.482766] raid6: avx2x2   gen()  7593 MB/s
[    6.499773] raid6: avx2x4   gen()  8648 MB/s
[    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
[    6.499774] raid6: using avx2x2 recovery algorithm

(CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)

Perhaps naively, I would expect that second-to-last line:

[    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)

to indicate what kind of throughput I could expect in a degraded
state, but clearly that is not right---or I have something
misconfigured.

So in other words, what does that gen() 8648 MB/s metric mean in terms
of real-world throughput?  Is there a way I can "convert" that number
to expected throughput of a degraded array?

Thanks,
Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-07-12 21:09 kernel checksumming performance vs actual raid device performance Matt Garman
@ 2016-07-13  3:58 ` Brad Campbell
       [not found] ` <CAFx4rwQj3_JTNiS0zsQjp_sPXWkrp0ggjg_UiR7oJ8u0X9PQVA@mail.gmail.com>
  2016-08-24  1:02 ` Shaohua Li
  2 siblings, 0 replies; 26+ messages in thread
From: Brad Campbell @ 2016-07-13  3:58 UTC (permalink / raw)
  To: Matt Garman, Mdadm

On 13/07/16 05:09, Matt Garman wrote:
> We have a system with a 24-disk raid6 array, using 2TB SSDs.  We use
> this system in a workload that is 99.9% read-only (a few small
> writes/day, versus countless reads).  This system is an NFS server for
> about 50 compute nodes that continually read its data.


> Perhaps naively, I would expect that second-to-last line:
>
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
>
> to indicate what kind of throughput I could expect in a degraded
> state, but clearly that is not right---or I have something
> misconfigured.
>
> So in other words, what does that gen() 8648 MB/s metric mean in terms
> of real-world throughput?  Is there a way I can "convert" that number
> to expected throughput of a degraded array?

I can't help you with the throughput calculation, but 24 disks would 
imply you are striped across 22 disks.

With a reconstruct read, you of course have to read an entire stripe 
from all present disks to reconstruct that sector. So I would assume 
your machine IO to go through the roof, in addition to whatever 
calculations are required to actually do the reconstruction.

What is your chunk size?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Fwd: kernel checksumming performance vs actual raid device performance
       [not found] ` <CAFx4rwQj3_JTNiS0zsQjp_sPXWkrp0ggjg_UiR7oJ8u0X9PQVA@mail.gmail.com>
@ 2016-07-13 16:52   ` Doug Dumitru
  2016-08-16 19:44   ` Matt Garman
       [not found]   ` <CAJvUf-Dqesy2TJX7W-bPakzeDcOoNy0VoSWWM06rKMYMhyhY7g@mail.gmail.com>
  2 siblings, 0 replies; 26+ messages in thread
From: Doug Dumitru @ 2016-07-13 16:52 UTC (permalink / raw)
  To: linux-raid

---------- Forwarded message ----------
From: Doug Dumitru <doug@easyco.com>
Date: Tue, Jul 12, 2016 at 7:10 PM
Subject: Re: kernel checksumming performance vs actual raid device performance
To: Matt Garman <matthew.garman@gmail.com>

Mr. Garman,

If you only lose a single drive in a raid-6 array, then only XOR
parity needs to be re-computed.  The "first" parity drive in RAID-6
array is actually a RAID-5 parity drive.  The CPU "parity calc"
overhead for re-computing a missing raid-5 drive is very cheap and
should run at > 5GB/sec.

The raid-6 "test" numbers are the performance of calculating the
raid-6 parity "syndrome".  The overhead of calculating a missing disk
with raid-6 is higher.

In terms of performance overhead, most people look at long linear
write performance.  In this case, raid-6 calc does matter especially
in that the raid "thread" is singular, so the calcs will saturate a
single thread.

I suspect you are seeing something other than the parity math.  I have
24 SSDs in an array here and will need to try this.

You might want to try running "perf" on your system while it is
degraded and see where the thread is churning.  I would love to see
those results.  I would not be surprised to see that the thread is
literally "spinning".  If so, then the 100% cpu is probably fixable,
but it won't actually help performance.

In term of single drive missing performance with short reads, you are
mostly at the mercy of short read IOPS.  If you array is reading 8K
blocks at 2GB/sec, this is at 250,000 IOPS and you kill off a drive,
it will jump to 500,000 IOPS.  Reading from the good drives remains as
single reads, but read from the missing drives require reads from all
of the others (with raid-5, all but one).  I am not sure how the
recovery thread issues these recovery read.  Hopefully, it blasts them
at the array with abandon (ie, submit all 22 requests concurrently),
but the code might be less aggressive in deference to hard disks.
SSDs love deep queue depths.

Regardless, 500K IOPS as reads is not that easy.  A lot of disk HBAs
start to saturate around there.

A couple of "design" points I would consider, if this is a system that
you need to duplicate.

1)  Consider a single CPU socket solution, like an E6-1650 v3.
Multi-socked CPU introduce NUMA and a whole slew of "interesting"
system contention issues.
2)  Use good HBA that are direct connected to the disks.  I like LSI
3008 and the newer 16-port version, although you need to use only 12
ports with 6GBit SATA/SAS to keep from over-running the PCI-e slot
bandwidth.
3)  Do everything you can to hammer deep queue depths.
4)  Setup IRQ affinity so that the HBAs spread their IRQ requests across cores.

Doug Dumitru
WildFire Storage

On Tue, Jul 12, 2016 at 2:09 PM, Matt Garman <matthew.garman@gmail.com> wrote:
>
> We have a system with a 24-disk raid6 array, using 2TB SSDs.  We use
> this system in a workload that is 99.9% read-only (a few small
> writes/day, versus countless reads).  This system is an NFS server for
> about 50 compute nodes that continually read its data.
>
> In a non-degraded state, the system works wonderfully: the md0_raid6
> process uses less than 1% CPU, each drive is around 20% utilization
> (via iostat), no swapping is taking place.  The outbound throughput
> averages around 2.0 GB/sec, with 2.5 GB/sec peaks.
>
> However, we had a disk fail, and the throughput dropped considerably,
> with the md0_raid6 process pegged at 100% CPU.
>
> I understand that data from the failed disk will need to be
> reconstructed from parity, and this will cause the md0_raid6 process
> to consume considerable CPU.
>
> What I don't understand is how I can determine what kind of actual MD
> device performance (throughput) I can expect in this state?
>
> Dmesg seems to give some hints:
>
> [    6.386820] xor: automatically using best checksumming function:
> [    6.396690]    avx       : 24064.000 MB/sec
> [    6.414706] raid6: sse2x1   gen()  7636 MB/s
> [    6.431725] raid6: sse2x2   gen()  3656 MB/s
> [    6.448742] raid6: sse2x4   gen()  3917 MB/s
> [    6.465753] raid6: avx2x1   gen()  5425 MB/s
> [    6.482766] raid6: avx2x2   gen()  7593 MB/s
> [    6.499773] raid6: avx2x4   gen()  8648 MB/s
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [    6.499774] raid6: using avx2x2 recovery algorithm
>
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
>
> Perhaps naively, I would expect that second-to-last line:
>
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
>
> to indicate what kind of throughput I could expect in a degraded
> state, but clearly that is not right---or I have something
> misconfigured.
>
> So in other words, what does that gen() 8648 MB/s metric mean in terms
> of real-world throughput?  Is there a way I can "convert" that number
> to expected throughput of a degraded array?
>
>
> Thanks,
> Matt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Doug Dumitru
EasyCo LLC

-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
       [not found] ` <CAFx4rwQj3_JTNiS0zsQjp_sPXWkrp0ggjg_UiR7oJ8u0X9PQVA@mail.gmail.com>
  2016-07-13 16:52   ` Fwd: " Doug Dumitru
@ 2016-08-16 19:44   ` Matt Garman
  2016-08-16 22:51     ` Doug Dumitru
       [not found]     ` <CAFx4rwTawqrBOWVwtPnGhRRAM1XiGQkS-o3YykmD0AftR45YkA@mail.gmail.com>
       [not found]   ` <CAJvUf-Dqesy2TJX7W-bPakzeDcOoNy0VoSWWM06rKMYMhyhY7g@mail.gmail.com>
  2 siblings, 2 replies; 26+ messages in thread
From: Matt Garman @ 2016-08-16 19:44 UTC (permalink / raw)
  To: Doug Dumitru, Mdadm

Hi Doug & linux-raid list,

On Tue, Jul 12, 2016 at 9:10 PM, Doug Dumitru <doug@easyco.com> wrote:
> You might want to try running "perf" on your system while it is degraded and
> see where the thread is churning.  I would love to see those results.  I
> would not be surprised to see that the thread is literally "spinning".  If
> so, then the 100% cpu is probably fixable, but it won't actually help
> performance.

I sat on your email for a while, as the machine in question was (is)
production, and we don't have any useful downtime windows to
experiment.  But now we have a second, identical machine.  It
eventually needs to go into production as well, but for now we have
some time to test.

My understanding of "perf" is that it analyzes an individual process.
Would you be willing to elaborate on how I might use it while the
rebuild is taking place?

> In term of single drive missing performance with short reads, you are mostly
> at the mercy of short read IOPS.  If you array is reading 8K blocks at
> 2GB/sec, this is at 250,000 IOPS and you kill off a drive, it will jump to
> 500,000 IOPS.  Reading from the good drives remains as single reads, but
> read from the missing drives require reads from all of the others (with
> raid-5, all but one).  I am not sure how the recovery thread issues these
> recovery read.  Hopefully, it blasts them at the array with abandon (ie,
> submit all 22 requests concurrently), but the code might be less aggressive
> in deference to hard disks.  SSDs love deep queue depths.

I may be jumping ahead a little, but I wonder if there are tuning
parameters that make sense for an array such as this, given the
read-dominant (effectively WORM) workload?  In particular, things like
block-level read-ahead, IO scheduler, queue depth, etc.  I know the
standard answer for these is "test and see" but we don't have a second
100-machine compute farm to test with.  It's quite hard to simulate
such a workload.

> 1)  Consider a single CPU socket solution, like an E6-1650 v3.  Multi-socked
> CPU introduce NUMA and a whole slew of "interesting" system contention
> issues.

I think that's a good idea, but I wanted to have two identical systems.

> 2)  Use good HBA that are direct connected to the disks.  I like LSI 3008
> and the newer 16-port version, although you need to use only 12 ports with
> 6GBit SATA/SAS to keep from over-running the PCI-e slot bandwidth.

We have three LSI MegaRAID SAS-3 3108 9361-8i controllers per system.
8 ports per card.  Drives are indeed direct-connected.  (Technically
there is a backplane, but it's not an expander, just a pass-through
backplane for neat cabling.)

> 3)  Do everything you can to hammer deep queue depths.

Can you elaborate on that?

> 4)  Setup IRQ affinity so that the HBAs spread their IRQ requests across
> cores.

We have spent a lot of time tuning the NIC IRQs, but have not yet
spent any time on the HBA IRQs.  Will do.

> You can probably mitigate the amount of degradation by lowering the rebuild
> speed, but this will make the rebuild take longer, so you are messed up
> either way.  If the server has "down time" at night, you might lower the
> rebuild to a really small value during the day, and up it at night.

I'll have to discuss with my colleagues, but we have the impression
that the max rebuild speed parameter is more of a hint than an actual
"hard" setting.  That is, we tried to do exactly what you suggest:
defer most rebuild work to after-hours when the load was lighter (and
no one would notice).  But we were unable to stop the rebuild from
basically completely crippling the NFS performance during the day.

"Messed up either way" is indeed the right conclusion here.  But I
think we have some bottleneck somewhere that is artificially hurting,
making things worse than they could/should be.

Thanks again for the thoughtful feedback!

-Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-16 19:44   ` Matt Garman
@ 2016-08-16 22:51     ` Doug Dumitru
  2016-08-17  0:27       ` Adam Goryachev
       [not found]     ` <CAFx4rwTawqrBOWVwtPnGhRRAM1XiGQkS-o3YykmD0AftR45YkA@mail.gmail.com>
  1 sibling, 1 reply; 26+ messages in thread
From: Doug Dumitru @ 2016-08-16 22:51 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

Matt,

One last thing I would highly recommend is:

Secure erase the replacement disk before rebuilding onto it.

If the replacement disk is "pre conditioned" with random writes, even
if very slowly, this will lower the write performance of the disk
during the rebuild.

On Tue, Aug 16, 2016 at 12:44 PM, Matt Garman <matthew.garman@gmail.com> wrote:
> Hi Doug & linux-raid list,
>
> On Tue, Jul 12, 2016 at 9:10 PM, Doug Dumitru <doug@easyco.com> wrote:
>> You might want to try running "perf" on your system while it is degraded and
>> see where the thread is churning.  I would love to see those results.  I
>> would not be surprised to see that the thread is literally "spinning".  If
>> so, then the 100% cpu is probably fixable, but it won't actually help
>> performance.
>
> I sat on your email for a while, as the machine in question was (is)
> production, and we don't have any useful downtime windows to
> experiment.  But now we have a second, identical machine.  It
> eventually needs to go into production as well, but for now we have
> some time to test.
>
> My understanding of "perf" is that it analyzes an individual process.
> Would you be willing to elaborate on how I might use it while the
> rebuild is taking place?
>
>> In term of single drive missing performance with short reads, you are mostly
>> at the mercy of short read IOPS.  If you array is reading 8K blocks at
>> 2GB/sec, this is at 250,000 IOPS and you kill off a drive, it will jump to
>> 500,000 IOPS.  Reading from the good drives remains as single reads, but
>> read from the missing drives require reads from all of the others (with
>> raid-5, all but one).  I am not sure how the recovery thread issues these
>> recovery read.  Hopefully, it blasts them at the array with abandon (ie,
>> submit all 22 requests concurrently), but the code might be less aggressive
>> in deference to hard disks.  SSDs love deep queue depths.
>
> I may be jumping ahead a little, but I wonder if there are tuning
> parameters that make sense for an array such as this, given the
> read-dominant (effectively WORM) workload?  In particular, things like
> block-level read-ahead, IO scheduler, queue depth, etc.  I know the
> standard answer for these is "test and see" but we don't have a second
> 100-machine compute farm to test with.  It's quite hard to simulate
> such a workload.
>
>> 1)  Consider a single CPU socket solution, like an E6-1650 v3.  Multi-socked
>> CPU introduce NUMA and a whole slew of "interesting" system contention
>> issues.
>
> I think that's a good idea, but I wanted to have two identical systems.
>
>> 2)  Use good HBA that are direct connected to the disks.  I like LSI 3008
>> and the newer 16-port version, although you need to use only 12 ports with
>> 6GBit SATA/SAS to keep from over-running the PCI-e slot bandwidth.
>
> We have three LSI MegaRAID SAS-3 3108 9361-8i controllers per system.
> 8 ports per card.  Drives are indeed direct-connected.  (Technically
> there is a backplane, but it's not an expander, just a pass-through
> backplane for neat cabling.)
>
>> 3)  Do everything you can to hammer deep queue depths.
>
> Can you elaborate on that?
>
>> 4)  Setup IRQ affinity so that the HBAs spread their IRQ requests across
>> cores.
>
> We have spent a lot of time tuning the NIC IRQs, but have not yet
> spent any time on the HBA IRQs.  Will do.
>
>> You can probably mitigate the amount of degradation by lowering the rebuild
>> speed, but this will make the rebuild take longer, so you are messed up
>> either way.  If the server has "down time" at night, you might lower the
>> rebuild to a really small value during the day, and up it at night.
>
> I'll have to discuss with my colleagues, but we have the impression
> that the max rebuild speed parameter is more of a hint than an actual
> "hard" setting.  That is, we tried to do exactly what you suggest:
> defer most rebuild work to after-hours when the load was lighter (and
> no one would notice).  But we were unable to stop the rebuild from
> basically completely crippling the NFS performance during the day.
>
> "Messed up either way" is indeed the right conclusion here.  But I
> think we have some bottleneck somewhere that is artificially hurting,
> making things worse than they could/should be.
>
> Thanks again for the thoughtful feedback!
>
> -Matt



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-16 22:51     ` Doug Dumitru
@ 2016-08-17  0:27       ` Adam Goryachev
  0 siblings, 0 replies; 26+ messages in thread
From: Adam Goryachev @ 2016-08-17  0:27 UTC (permalink / raw)
  To: doug, Matt Garman; +Cc: Mdadm

On 17/08/16 08:51, Doug Dumitru wrote:
> Matt,
>
> One last thing I would highly recommend is:
>
> Secure erase the replacement disk before rebuilding onto it.
>
> If the replacement disk is "pre conditioned" with random writes, even
> if very slowly, this will lower the write performance of the disk
> during the rebuild.
>
> On Tue, Aug 16, 2016 at 12:44 PM, Matt Garman <matthew.garman@gmail.com> wrote:
>> Hi Doug & linux-raid list,
>>
>> On Tue, Jul 12, 2016 at 9:10 PM, Doug Dumitru <doug@easyco.com> wrote:
>>
>>> You can probably mitigate the amount of degradation by lowering the rebuild
>>> speed, but this will make the rebuild take longer, so you are messed up
>>> either way.  If the server has "down time" at night, you might lower the
>>> rebuild to a really small value during the day, and up it at night.
>> I'll have to discuss with my colleagues, but we have the impression
>> that the max rebuild speed parameter is more of a hint than an actual
>> "hard" setting.  That is, we tried to do exactly what you suggest:
>> defer most rebuild work to after-hours when the load was lighter (and
>> no one would notice).  But we were unable to stop the rebuild from
>> basically completely crippling the NFS performance during the day.
>>
>> "Messed up either way" is indeed the right conclusion here.  But I
>> think we have some bottleneck somewhere that is artificially hurting,
>> making things worse than they could/should be.
>>
>> Thanks again for the thoughtful feedback!
>>
>> -Matt
Sorry, probably messed up the quoting/attribution, but I don't think 
that is too important here.
You should find that the max value is in fact an upper bound, so the 
re-sync will *try* to limit the speed to this value, and the minimum is 
a lower bound. However, if you set the minimum too high, then the system 
(as a whole) may not be able to achieve that, and so the resync speed 
might be lower.

I don't think I've seen a case where the resync speed is much higher 
than the max value. Of course, even a small resync speed could have a 
big impact on performance (due to extra seeks on the disks moving to the 
resync area away from the active read/write workload area)...

I think there is also an option to completely stop the resync from 
progressing (changes it to "pending" status), but maybe someone else 
on-list can comment about that. You might be able to totally stop the 
resync during the day, and then set an outage period to stop your 
workload and allow the system to run the resync at maximum speed (just 
set the max value to a really large number).

Sorry, but I'm not an mdadm expert, just sharing my experiences with 
dealing with similar issues.

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
       [not found]     ` <CAFx4rwTawqrBOWVwtPnGhRRAM1XiGQkS-o3YykmD0AftR45YkA@mail.gmail.com>
@ 2016-08-23 14:34       ` Matt Garman
  2016-08-23 15:02         ` Chris Murphy
  0 siblings, 1 reply; 26+ messages in thread
From: Matt Garman @ 2016-08-23 14:34 UTC (permalink / raw)
  To: Doug Dumitru; +Cc: Mdadm

On Tue, Aug 16, 2016 at 5:43 PM, Doug Dumitru <doug@easyco.com> wrote:
> One last thing I would highly recommend is:
>
> Secure erase the replacement disk before rebuilding onto it.
>
> If the replacement disk is "pre conditioned" with random writes, even if
> very slowly, this will lower the write performance of the disk during the
> rebuild.

Does that also apply to brand-new disks from the manufacturer?

I.e., should we just always do a secure erase, or sometimes depending
on how the drive was sourced?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
       [not found]     ` <CAFx4rwSQQuqeCFm+60+Gm75D49tg+mVjU=BnQSZThdE7E6KqPQ@mail.gmail.com>
@ 2016-08-23 14:54       ` Matt Garman
  2016-08-23 18:00         ` Doug Ledford
  0 siblings, 1 reply; 26+ messages in thread
From: Matt Garman @ 2016-08-23 14:54 UTC (permalink / raw)
  To: Doug Dumitru, Mdadm

On Tue, Aug 16, 2016 at 11:36 AM, Doug Dumitru <doug@easyco.com> wrote:
> The RAID rebuild for a single bad drive "should" be an XOR and should run at
> 200,000 kb/sec (the default speed_limit_max).  I might be wrong on this and
> this might still need a full RAID-6 syndrome compute, but I dont think so.
>
> The rebuild might not hit 200MB/sec if the drive you replaced is
> "conditioned".  Be sure to secure erase any non-new drive before you replace
> it.
>
> Your read IOPS will compete with now busy drives which may increase the IO
> latency a lot, and slow you down a lot.
>
> One out of 22 read OPS will be to the bad drive, so this will now take 22
> reads to re-construct the IO.  The reconstruction is XOR, so pretty cheap
> from a CPU point of view.  Regardless, your IOPS total will double.
>
> You can probably mitigate the amount of degradation by lowering the rebuild
> speed, but this will make the rebuild take longer, so you are messed up
> either way.  If the server has "down time" at night, you might lower the
> rebuild to a really small value during the day, and up it at night.

OK, right now I'm looking purely at performance in a degraded state,
no rebuild taking place.

We have designed a simple read load test to simulate the actual
production workload.  (It's not perfect of course, but a reasonable
approximation.  I can share with the list if there's interest.)  But
basically it just runs multiple threads of reading random files
continuously.

When the array is in a pristine state, we can achieve read throughput
of 8000 MB/sec (at the array level, per iostat with 5 second samples).

Now I failed a single drive.  Running the same test, read performance
drops all the way down to 200 MB/sec.

I understand that IOPS should double, which to me says we should
expect a roughly 50% read performance drop (napkin math).  But this is
a drop of over 95%.

Again, this is with no rebuild taking place...

Thoughts?

Thanks again,
Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-23 14:34       ` Matt Garman
@ 2016-08-23 15:02         ` Chris Murphy
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Murphy @ 2016-08-23 15:02 UTC (permalink / raw)
  To: Mdadm

On Tue, Aug 23, 2016 at 8:34 AM, Matt Garman <matthew.garman@gmail.com> wrote:
> On Tue, Aug 16, 2016 at 5:43 PM, Doug Dumitru <doug@easyco.com> wrote:
>> One last thing I would highly recommend is:
>>
>> Secure erase the replacement disk before rebuilding onto it.
>>
>> If the replacement disk is "pre conditioned" with random writes, even if
>> very slowly, this will lower the write performance of the disk during the
>> rebuild.
>
> Does that also apply to brand-new disks from the manufacturer?
>
> I.e., should we just always do a secure erase, or sometimes depending
> on how the drive was sourced?

The main issue is to get rid of previous fs signatures so there are no
longer stale file systems. If the drive is/was ever partitioned, those
signatures could be anywhere on the drive, so the ATA secure erase is
a way to clobber all of them. An alternative is fully encrypting the
drive. If you merely change the encryption key, cipher text on the
drive becomes different "cipher text" as it's decrypted, so again
everything that was on the drive is effectively obliterated, but is
much faster. If security isn't a big concern, you can automate opening
the LUKS device with a keyfile to avoid manually typing in
passphrases. The other nice thing about it is if you ever have to
return such a drive under warranty, or otherwise decommission it, you
don't have to worry about drive contents.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-23 14:54       ` Matt Garman
@ 2016-08-23 18:00         ` Doug Ledford
  2016-08-23 18:27           ` Doug Dumitru
  0 siblings, 1 reply; 26+ messages in thread
From: Doug Ledford @ 2016-08-23 18:00 UTC (permalink / raw)
  To: Matt Garman, Doug Dumitru, Mdadm

[-- Attachment #1.1: Type: text/plain, Size: 5096 bytes --]

On 8/23/2016 10:54 AM, Matt Garman wrote:
> On Tue, Aug 16, 2016 at 11:36 AM, Doug Dumitru <doug@easyco.com> wrote:
>> The RAID rebuild for a single bad drive "should" be an XOR and should run at
>> 200,000 kb/sec (the default speed_limit_max).  I might be wrong on this and
>> this might still need a full RAID-6 syndrome compute, but I dont think so.
>>
>> The rebuild might not hit 200MB/sec if the drive you replaced is
>> "conditioned".  Be sure to secure erase any non-new drive before you replace
>> it.
>>
>> Your read IOPS will compete with now busy drives which may increase the IO
>> latency a lot, and slow you down a lot.
>>
>> One out of 22 read OPS will be to the bad drive, so this will now take 22
>> reads to re-construct the IO.  The reconstruction is XOR, so pretty cheap
>> from a CPU point of view.  Regardless, your IOPS total will double.
>>
>> You can probably mitigate the amount of degradation by lowering the rebuild
>> speed, but this will make the rebuild take longer, so you are messed up
>> either way.  If the server has "down time" at night, you might lower the
>> rebuild to a really small value during the day, and up it at night.
> 
> OK, right now I'm looking purely at performance in a degraded state,
> no rebuild taking place.
> 
> We have designed a simple read load test to simulate the actual
> production workload.  (It's not perfect of course, but a reasonable
> approximation.  I can share with the list if there's interest.)  But
> basically it just runs multiple threads of reading random files
> continuously.
> 
> When the array is in a pristine state, we can achieve read throughput
> of 8000 MB/sec (at the array level, per iostat with 5 second samples).
> 
> Now I failed a single drive.  Running the same test, read performance
> drops all the way down to 200 MB/sec.
> 
> I understand that IOPS should double, which to me says we should
> expect a roughly 50% read performance drop (napkin math).  But this is
> a drop of over 95%.
> 
> Again, this is with no rebuild taking place...
> 
> Thoughts?

This depends a lot on how you structured your raid array.  I didn't see
your earlier emails, so I'm inferring from the "one out of 22 reads will
be to the bad drive" that you have a 24 disk raid6 array?  If so, then
that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
as the basis for my next statement even if it's slightly wrong.

Doug was right in that you will have to read 21 data disks and 1 parity
disk to reconstruct reads from the missing block of any given stripe.
And while he is also correct that this doubles IO ops needed to get your
read data, it doesn't address the XOR load to get your data.  With 19
data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
actually achieving more like 390MB/s of data read, with 190MB/s of it
being direct reads, and then you are using XOR on 200MB/s in order to
generate the other 10MB/s of results.

The question of why that performance is so bad is probably (and I say
probably because without actually testing it this is just some hand-wavy
explanation based upon what I've tested and found in the past, but may
not be true today) due to a couple factors:

1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
routines, you can actually keep a CPU pretty busy with this.  Also, even
though the XOR routines try to time their assembly 'just so' so that
they can use the cache avoiding instructions, this fails more often than
not so you end up blowing CPU caches while doing this work, which of
course effects the overall system.  Possible fixes for this might include:
	a) Multi-threaded XOR becoming the default (last I knew it wasn't,
correct me if I'm wrong)
	b) Improved XOR routines that deal with cache more intelligently
	c) Creating a consolidated page cache/stripe cache (if we can read more
of the blocks needed to get our data from cache instead of disk it helps
reduce that IO ops issue)
	d) Rearchitecting your arrays into raid50 instead of big raid6 array

2) Even though we theoretically doubled IO ops, we haven't addressed
whether or not that doubling is done efficiently.  Testing would be
warranted here to make sure that our reads for reconstruction aren't
negatively impacting overall disk IO op capability.  We might be doing
something that we can fix, such as interfering with merges or with
ordering or with latency sensitive commands.  A person would need to do
some deep inspection of how commands are being created and sent to each
device in order to see if we are keeping them busy or our own latencies
at the kernel level are leaving the disks idle and killing our overall
throughput (or conversely has the random head seeks just gone so
radically through the roof that the problem here really is the time it
takes the heads to travel everywhere we are sending them).

-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: 0E572FDD

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-23 18:00         ` Doug Ledford
@ 2016-08-23 18:27           ` Doug Dumitru
  2016-08-23 19:10             ` Doug Ledford
  2016-08-23 19:26             ` Matt Garman
  0 siblings, 2 replies; 26+ messages in thread
From: Doug Dumitru @ 2016-08-23 18:27 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Matt Garman, Mdadm

Mr. Ledford,

I think your explanation of RAID "dirty" read performance is a bit off.

If you have 64KB chunks, this describes the layout.  I don't think
this also requires 64K reads.  I know that this is true with RAID-5,
and I am pretty sure it applies to raid-6 as well.  So if you do 4K
reads, you should see 4K reads to all the member drives.

You can verify this pretty easily with iostat.

Mr. Garman,

Your results are a lot worse than expected.  I always assume that a
raid "dirty" read will try to hit the disk hard.  This implies issuing
the 22 reads requests in parallel.  This is how "SSD" folks think.  It
is possible that this code is old enough to be in an HDD "mindset" and
that the requests are issued sequentially.  If so, then this is
something to "fix" in the raid code (I use the term fix here loosely
as this is not really a bug).

Can you run an iostat during your degraded test, and also a top run
over 20+ seconds with kernel threads showing up.  Even better would be
a perf capture, but you might not have all the tools installed.  You
can always try:

perf record -a sleep 20

then

perf report

should show you the top functions globally over the 20 second sample.
If you don't have perf loaded, you might (or might not) be able to
load it from the distro.

Doug


On Tue, Aug 23, 2016 at 11:00 AM, Doug Ledford <dledford@redhat.com> wrote:
> On 8/23/2016 10:54 AM, Matt Garman wrote:
>> On Tue, Aug 16, 2016 at 11:36 AM, Doug Dumitru <doug@easyco.com> wrote:
>>> The RAID rebuild for a single bad drive "should" be an XOR and should run at
>>> 200,000 kb/sec (the default speed_limit_max).  I might be wrong on this and
>>> this might still need a full RAID-6 syndrome compute, but I dont think so.
>>>
>>> The rebuild might not hit 200MB/sec if the drive you replaced is
>>> "conditioned".  Be sure to secure erase any non-new drive before you replace
>>> it.
>>>
>>> Your read IOPS will compete with now busy drives which may increase the IO
>>> latency a lot, and slow you down a lot.
>>>
>>> One out of 22 read OPS will be to the bad drive, so this will now take 22
>>> reads to re-construct the IO.  The reconstruction is XOR, so pretty cheap
>>> from a CPU point of view.  Regardless, your IOPS total will double.
>>>
>>> You can probably mitigate the amount of degradation by lowering the rebuild
>>> speed, but this will make the rebuild take longer, so you are messed up
>>> either way.  If the server has "down time" at night, you might lower the
>>> rebuild to a really small value during the day, and up it at night.
>>
>> OK, right now I'm looking purely at performance in a degraded state,
>> no rebuild taking place.
>>
>> We have designed a simple read load test to simulate the actual
>> production workload.  (It's not perfect of course, but a reasonable
>> approximation.  I can share with the list if there's interest.)  But
>> basically it just runs multiple threads of reading random files
>> continuously.
>>
>> When the array is in a pristine state, we can achieve read throughput
>> of 8000 MB/sec (at the array level, per iostat with 5 second samples).
>>
>> Now I failed a single drive.  Running the same test, read performance
>> drops all the way down to 200 MB/sec.
>>
>> I understand that IOPS should double, which to me says we should
>> expect a roughly 50% read performance drop (napkin math).  But this is
>> a drop of over 95%.
>>
>> Again, this is with no rebuild taking place...
>>
>> Thoughts?
>
> This depends a lot on how you structured your raid array.  I didn't see
> your earlier emails, so I'm inferring from the "one out of 22 reads will
> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
> as the basis for my next statement even if it's slightly wrong.
>
> Doug was right in that you will have to read 21 data disks and 1 parity
> disk to reconstruct reads from the missing block of any given stripe.
> And while he is also correct that this doubles IO ops needed to get your
> read data, it doesn't address the XOR load to get your data.  With 19
> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
> actually achieving more like 390MB/s of data read, with 190MB/s of it
> being direct reads, and then you are using XOR on 200MB/s in order to
> generate the other 10MB/s of results.
>
> The question of why that performance is so bad is probably (and I say
> probably because without actually testing it this is just some hand-wavy
> explanation based upon what I've tested and found in the past, but may
> not be true today) due to a couple factors:
>
> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
> routines, you can actually keep a CPU pretty busy with this.  Also, even
> though the XOR routines try to time their assembly 'just so' so that
> they can use the cache avoiding instructions, this fails more often than
> not so you end up blowing CPU caches while doing this work, which of
> course effects the overall system.  Possible fixes for this might include:
>         a) Multi-threaded XOR becoming the default (last I knew it wasn't,
> correct me if I'm wrong)
>         b) Improved XOR routines that deal with cache more intelligently
>         c) Creating a consolidated page cache/stripe cache (if we can read more
> of the blocks needed to get our data from cache instead of disk it helps
> reduce that IO ops issue)
>         d) Rearchitecting your arrays into raid50 instead of big raid6 array
>
> 2) Even though we theoretically doubled IO ops, we haven't addressed
> whether or not that doubling is done efficiently.  Testing would be
> warranted here to make sure that our reads for reconstruction aren't
> negatively impacting overall disk IO op capability.  We might be doing
> something that we can fix, such as interfering with merges or with
> ordering or with latency sensitive commands.  A person would need to do
> some deep inspection of how commands are being created and sent to each
> device in order to see if we are keeping them busy or our own latencies
> at the kernel level are leaving the disks idle and killing our overall
> throughput (or conversely has the random head seeks just gone so
> radically through the roof that the problem here really is the time it
> takes the heads to travel everywhere we are sending them).
>
>
> --
> Doug Ledford <dledford@redhat.com>
>     GPG Key ID: 0E572FDD
>



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-23 18:27           ` Doug Dumitru
@ 2016-08-23 19:10             ` Doug Ledford
  2016-08-23 19:19               ` Doug Dumitru
  2016-08-23 19:26             ` Matt Garman
  1 sibling, 1 reply; 26+ messages in thread
From: Doug Ledford @ 2016-08-23 19:10 UTC (permalink / raw)
  To: doug; +Cc: Matt Garman, Mdadm


[-- Attachment #1.1: Type: text/plain, Size: 7329 bytes --]

On 8/23/2016 2:27 PM, Doug Dumitru wrote:
> Mr. Ledford,
> 
> I think your explanation of RAID "dirty" read performance is a bit off.
> 
> If you have 64KB chunks, this describes the layout.  I don't think
> this also requires 64K reads.  I know that this is true with RAID-5,
> and I am pretty sure it applies to raid-6 as well.  So if you do 4K
> reads, you should see 4K reads to all the member drives.

Of course.  I didn't mean to imply otherwise.  The read size is the read
size.  But, since the OPs test case was to "read random files" and not
"read random blocks of random files" I took it to mean it would be
sequential IO across a multitude of random files.  That assumption might
have been wrong, but I wrote my explanation with that in mind.

> You can verify this pretty easily with iostat.
> 
> Mr. Garman,
> 
> Your results are a lot worse than expected.  I always assume that a
> raid "dirty" read will try to hit the disk hard.  This implies issuing
> the 22 reads requests in parallel.  This is how "SSD" folks think.  It
> is possible that this code is old enough to be in an HDD "mindset" and
> that the requests are issued sequentially.  If so, then this is
> something to "fix" in the raid code (I use the term fix here loosely
> as this is not really a bug).
> 
> Can you run an iostat during your degraded test, and also a top run
> over 20+ seconds with kernel threads showing up.  Even better would be
> a perf capture, but you might not have all the tools installed.  You
> can always try:
> 
> perf record -a sleep 20
> 
> then
> 
> perf report
> 
> should show you the top functions globally over the 20 second sample.
> If you don't have perf loaded, you might (or might not) be able to
> load it from the distro.
> 
> Doug
> 
> 
> On Tue, Aug 23, 2016 at 11:00 AM, Doug Ledford <dledford@redhat.com> wrote:
>> On 8/23/2016 10:54 AM, Matt Garman wrote:
>>> On Tue, Aug 16, 2016 at 11:36 AM, Doug Dumitru <doug@easyco.com> wrote:
>>>> The RAID rebuild for a single bad drive "should" be an XOR and should run at
>>>> 200,000 kb/sec (the default speed_limit_max).  I might be wrong on this and
>>>> this might still need a full RAID-6 syndrome compute, but I dont think so.
>>>>
>>>> The rebuild might not hit 200MB/sec if the drive you replaced is
>>>> "conditioned".  Be sure to secure erase any non-new drive before you replace
>>>> it.
>>>>
>>>> Your read IOPS will compete with now busy drives which may increase the IO
>>>> latency a lot, and slow you down a lot.
>>>>
>>>> One out of 22 read OPS will be to the bad drive, so this will now take 22
>>>> reads to re-construct the IO.  The reconstruction is XOR, so pretty cheap
>>>> from a CPU point of view.  Regardless, your IOPS total will double.
>>>>
>>>> You can probably mitigate the amount of degradation by lowering the rebuild
>>>> speed, but this will make the rebuild take longer, so you are messed up
>>>> either way.  If the server has "down time" at night, you might lower the
>>>> rebuild to a really small value during the day, and up it at night.
>>>
>>> OK, right now I'm looking purely at performance in a degraded state,
>>> no rebuild taking place.
>>>
>>> We have designed a simple read load test to simulate the actual
>>> production workload.  (It's not perfect of course, but a reasonable
>>> approximation.  I can share with the list if there's interest.)  But
>>> basically it just runs multiple threads of reading random files
>>> continuously.
>>>
>>> When the array is in a pristine state, we can achieve read throughput
>>> of 8000 MB/sec (at the array level, per iostat with 5 second samples).
>>>
>>> Now I failed a single drive.  Running the same test, read performance
>>> drops all the way down to 200 MB/sec.
>>>
>>> I understand that IOPS should double, which to me says we should
>>> expect a roughly 50% read performance drop (napkin math).  But this is
>>> a drop of over 95%.
>>>
>>> Again, this is with no rebuild taking place...
>>>
>>> Thoughts?
>>
>> This depends a lot on how you structured your raid array.  I didn't see
>> your earlier emails, so I'm inferring from the "one out of 22 reads will
>> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
>> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
>> as the basis for my next statement even if it's slightly wrong.
>>
>> Doug was right in that you will have to read 21 data disks and 1 parity
>> disk to reconstruct reads from the missing block of any given stripe.
>> And while he is also correct that this doubles IO ops needed to get your
>> read data, it doesn't address the XOR load to get your data.  With 19
>> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
>> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
>> actually achieving more like 390MB/s of data read, with 190MB/s of it
>> being direct reads, and then you are using XOR on 200MB/s in order to
>> generate the other 10MB/s of results.
>>
>> The question of why that performance is so bad is probably (and I say
>> probably because without actually testing it this is just some hand-wavy
>> explanation based upon what I've tested and found in the past, but may
>> not be true today) due to a couple factors:
>>
>> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
>> routines, you can actually keep a CPU pretty busy with this.  Also, even
>> though the XOR routines try to time their assembly 'just so' so that
>> they can use the cache avoiding instructions, this fails more often than
>> not so you end up blowing CPU caches while doing this work, which of
>> course effects the overall system.  Possible fixes for this might include:
>>         a) Multi-threaded XOR becoming the default (last I knew it wasn't,
>> correct me if I'm wrong)
>>         b) Improved XOR routines that deal with cache more intelligently
>>         c) Creating a consolidated page cache/stripe cache (if we can read more
>> of the blocks needed to get our data from cache instead of disk it helps
>> reduce that IO ops issue)
>>         d) Rearchitecting your arrays into raid50 instead of big raid6 array
>>
>> 2) Even though we theoretically doubled IO ops, we haven't addressed
>> whether or not that doubling is done efficiently.  Testing would be
>> warranted here to make sure that our reads for reconstruction aren't
>> negatively impacting overall disk IO op capability.  We might be doing
>> something that we can fix, such as interfering with merges or with
>> ordering or with latency sensitive commands.  A person would need to do
>> some deep inspection of how commands are being created and sent to each
>> device in order to see if we are keeping them busy or our own latencies
>> at the kernel level are leaving the disks idle and killing our overall
>> throughput (or conversely has the random head seeks just gone so
>> radically through the roof that the problem here really is the time it
>> takes the heads to travel everywhere we are sending them).
>>
>>
>> --
>> Doug Ledford <dledford@redhat.com>
>>     GPG Key ID: 0E572FDD
>>
> 
> 
> 


-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-23 19:10             ` Doug Ledford
@ 2016-08-23 19:19               ` Doug Dumitru
  2016-08-23 19:26                 ` Doug Ledford
  0 siblings, 1 reply; 26+ messages in thread
From: Doug Dumitru @ 2016-08-23 19:19 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Matt Garman, Mdadm

Mr. Ledford,

I am glad that we are in agreement.  My issue is that if the customer
is reading 4GB/sec with a non-degraded array, the degraded array
should only have 2X the number of IOs and 2X the transfer sizes to the
drives.  If the data rate falls to 1GB, I can suspect cpu overhead.
With this case falling to 200MB/sec, then something else is going on.

SSDs tend to be very "flat" reading from q=1 up to about q=20 assuming
the HBAs can keep up.

Then again, 4GB/sec is actually pretty good for a real array with a file system.

In thinking more about this, it is possible that the raid layer is
passing all of the read overhead for the degraded read to the single
raid5 background thread.  200MB/sec after the overhead of populating
stripe pages is then very believable.  My write testing with raid-5
shows that the stripe cache and single thread doing computes can lower
linear write throughput from 10GB/sec (raid-5) or 8GB/sec (raid-6)
down to under 1.5GB/sec.  Getting to 10 or 8 GB/sec requires patches
to raid5.c bypassing the stripe cache and background thread for
"perfect writes" (writes that are exactly an array stripe in a single
BIO).

The whole raid design is intended to keep locks low.  In looking at
SSD performance, perhaps this needs to be rethought so that processing
can more effectively use multi-cores and deep queue depths.



Doug


On Tue, Aug 23, 2016 at 12:10 PM, Doug Ledford <dledford@redhat.com> wrote:
> On 8/23/2016 2:27 PM, Doug Dumitru wrote:
>> Mr. Ledford,
>>
>> I think your explanation of RAID "dirty" read performance is a bit off.
>>
>> If you have 64KB chunks, this describes the layout.  I don't think
>> this also requires 64K reads.  I know that this is true with RAID-5,
>> and I am pretty sure it applies to raid-6 as well.  So if you do 4K
>> reads, you should see 4K reads to all the member drives.
>
> Of course.  I didn't mean to imply otherwise.  The read size is the read
> size.  But, since the OPs test case was to "read random files" and not
> "read random blocks of random files" I took it to mean it would be
> sequential IO across a multitude of random files.  That assumption might
> have been wrong, but I wrote my explanation with that in mind.
>
>> You can verify this pretty easily with iostat.
>>
>> Mr. Garman,
>>
>> Your results are a lot worse than expected.  I always assume that a
>> raid "dirty" read will try to hit the disk hard.  This implies issuing
>> the 22 reads requests in parallel.  This is how "SSD" folks think.  It
>> is possible that this code is old enough to be in an HDD "mindset" and
>> that the requests are issued sequentially.  If so, then this is
>> something to "fix" in the raid code (I use the term fix here loosely
>> as this is not really a bug).
>>
>> Can you run an iostat during your degraded test, and also a top run
>> over 20+ seconds with kernel threads showing up.  Even better would be
>> a perf capture, but you might not have all the tools installed.  You
>> can always try:
>>
>> perf record -a sleep 20
>>
>> then
>>
>> perf report
>>
>> should show you the top functions globally over the 20 second sample.
>> If you don't have perf loaded, you might (or might not) be able to
>> load it from the distro.
>>
>> Doug
>>
>>
>> On Tue, Aug 23, 2016 at 11:00 AM, Doug Ledford <dledford@redhat.com> wrote:
>>> On 8/23/2016 10:54 AM, Matt Garman wrote:
>>>> On Tue, Aug 16, 2016 at 11:36 AM, Doug Dumitru <doug@easyco.com> wrote:
>>>>> The RAID rebuild for a single bad drive "should" be an XOR and should run at
>>>>> 200,000 kb/sec (the default speed_limit_max).  I might be wrong on this and
>>>>> this might still need a full RAID-6 syndrome compute, but I dont think so.
>>>>>
>>>>> The rebuild might not hit 200MB/sec if the drive you replaced is
>>>>> "conditioned".  Be sure to secure erase any non-new drive before you replace
>>>>> it.
>>>>>
>>>>> Your read IOPS will compete with now busy drives which may increase the IO
>>>>> latency a lot, and slow you down a lot.
>>>>>
>>>>> One out of 22 read OPS will be to the bad drive, so this will now take 22
>>>>> reads to re-construct the IO.  The reconstruction is XOR, so pretty cheap
>>>>> from a CPU point of view.  Regardless, your IOPS total will double.
>>>>>
>>>>> You can probably mitigate the amount of degradation by lowering the rebuild
>>>>> speed, but this will make the rebuild take longer, so you are messed up
>>>>> either way.  If the server has "down time" at night, you might lower the
>>>>> rebuild to a really small value during the day, and up it at night.
>>>>
>>>> OK, right now I'm looking purely at performance in a degraded state,
>>>> no rebuild taking place.
>>>>
>>>> We have designed a simple read load test to simulate the actual
>>>> production workload.  (It's not perfect of course, but a reasonable
>>>> approximation.  I can share with the list if there's interest.)  But
>>>> basically it just runs multiple threads of reading random files
>>>> continuously.
>>>>
>>>> When the array is in a pristine state, we can achieve read throughput
>>>> of 8000 MB/sec (at the array level, per iostat with 5 second samples).
>>>>
>>>> Now I failed a single drive.  Running the same test, read performance
>>>> drops all the way down to 200 MB/sec.
>>>>
>>>> I understand that IOPS should double, which to me says we should
>>>> expect a roughly 50% read performance drop (napkin math).  But this is
>>>> a drop of over 95%.
>>>>
>>>> Again, this is with no rebuild taking place...
>>>>
>>>> Thoughts?
>>>
>>> This depends a lot on how you structured your raid array.  I didn't see
>>> your earlier emails, so I'm inferring from the "one out of 22 reads will
>>> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
>>> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
>>> as the basis for my next statement even if it's slightly wrong.
>>>
>>> Doug was right in that you will have to read 21 data disks and 1 parity
>>> disk to reconstruct reads from the missing block of any given stripe.
>>> And while he is also correct that this doubles IO ops needed to get your
>>> read data, it doesn't address the XOR load to get your data.  With 19
>>> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
>>> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
>>> actually achieving more like 390MB/s of data read, with 190MB/s of it
>>> being direct reads, and then you are using XOR on 200MB/s in order to
>>> generate the other 10MB/s of results.
>>>
>>> The question of why that performance is so bad is probably (and I say
>>> probably because without actually testing it this is just some hand-wavy
>>> explanation based upon what I've tested and found in the past, but may
>>> not be true today) due to a couple factors:
>>>
>>> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
>>> routines, you can actually keep a CPU pretty busy with this.  Also, even
>>> though the XOR routines try to time their assembly 'just so' so that
>>> they can use the cache avoiding instructions, this fails more often than
>>> not so you end up blowing CPU caches while doing this work, which of
>>> course effects the overall system.  Possible fixes for this might include:
>>>         a) Multi-threaded XOR becoming the default (last I knew it wasn't,
>>> correct me if I'm wrong)
>>>         b) Improved XOR routines that deal with cache more intelligently
>>>         c) Creating a consolidated page cache/stripe cache (if we can read more
>>> of the blocks needed to get our data from cache instead of disk it helps
>>> reduce that IO ops issue)
>>>         d) Rearchitecting your arrays into raid50 instead of big raid6 array
>>>
>>> 2) Even though we theoretically doubled IO ops, we haven't addressed
>>> whether or not that doubling is done efficiently.  Testing would be
>>> warranted here to make sure that our reads for reconstruction aren't
>>> negatively impacting overall disk IO op capability.  We might be doing
>>> something that we can fix, such as interfering with merges or with
>>> ordering or with latency sensitive commands.  A person would need to do
>>> some deep inspection of how commands are being created and sent to each
>>> device in order to see if we are keeping them busy or our own latencies
>>> at the kernel level are leaving the disks idle and killing our overall
>>> throughput (or conversely has the random head seeks just gone so
>>> radically through the roof that the problem here really is the time it
>>> takes the heads to travel everywhere we are sending them).
>>>
>>>
>>> --
>>> Doug Ledford <dledford@redhat.com>
>>>     GPG Key ID: 0E572FDD
>>>
>>
>>
>>
>
>
> --
> Doug Ledford <dledford@redhat.com>
>     GPG Key ID: 0E572FDD
>



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-23 18:27           ` Doug Dumitru
  2016-08-23 19:10             ` Doug Ledford
@ 2016-08-23 19:26             ` Matt Garman
  2016-08-23 19:41               ` Doug Dumitru
  2016-08-23 20:15               ` Doug Ledford
  1 sibling, 2 replies; 26+ messages in thread
From: Matt Garman @ 2016-08-23 19:26 UTC (permalink / raw)
  To: Doug Dumitru; +Cc: Doug Ledford, Mdadm

Doug & Doug,

Thank you for your helpful replies.  I merged both of your posts into
one, see inline comments below:

On Tue, Aug 23, 2016 at 2:10 PM, Doug Ledford <dledford@redhat.com> wrote:
> Of course.  I didn't mean to imply otherwise.  The read size is the read
> size.  But, since the OPs test case was to "read random files" and not
> "read random blocks of random files" I took it to mean it would be
> sequential IO across a multitude of random files.  That assumption might
> have been wrong, but I wrote my explanation with that in mind.

Yes, multiple parallel sequential reads.  Our test program generates a
bunch of big random files (file size has an approximately normal
distribution, centered around 500 MB, going down to 100 MB or so, up
to a few multi-GB outliers).  The file generation is a one-time thing,
and we don't really care about its performance.

The read testing program just randomly picks one of those files, then
reads it start-to-finish using "dd".  But it kicks off several "dd"
threads at once (currently 50, though this is a run-time parameter).
This is how we generate the read load, and I use iostat while this is
running to see how much read throughput I'm getting from the array.


On Tue, Aug 23, 2016 at 1:00 PM, Doug Ledford <dledford@redhat.com> wrote:
> This depends a lot on how you structured your raid array.  I didn't see
> your earlier emails, so I'm inferring from the "one out of 22 reads will
> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
> as the basis for my next statement even if it's slightly wrong.

Yes, that is exactly correct, here's the relevant part of /proc/mdstat:

Personalities : [raid1] [raid6] [raid5] [raid4]

md0 : active raid6 sdl[11] sdi[8] sdx[23] sdc[2] sdo[14] sdn[13]
sdm[12] sdr[17] sdk[10] sdb[1] sdu[20] sdp[15] sdq[16] sds[18] sdt[19]
sdw[22] sdv[21] sda[0](F) sdj[9] sde[4] sdd[3] sdf[5] sdh[7] sdg[6]

      44005879808 blocks super 1.2 level 6, 512k chunk, algorithm 2
[24/23] [_UUUUUUUUUUUUUUUUUUUUUUU]

      bitmap: 0/15 pages [0KB], 65536KB chunk


> Doug was right in that you will have to read 21 data disks and 1 parity
> disk to reconstruct reads from the missing block of any given stripe.
> And while he is also correct that this doubles IO ops needed to get your
> read data, it doesn't address the XOR load to get your data.  With 19
> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
> actually achieving more like 390MB/s of data read, with 190MB/s of it
> being direct reads, and then you are using XOR on 200MB/s in order to
> generate the other 10MB/s of results.

Most of this morning I've been setting/unsetting/changing various
tunables, to see if I could increase the read speed.  I got a huge
boost by increasing the /sys/block/md0/md/stripe_cache_size parameter
from the default (256 IIRC) to 16384.  Doubling it again to 32k didn't
seem to bring any further benefit.  So with the stripe_cache_size
increased to 16k, I'm now getting around 1000 MB/s read in the
degraded state.  When the degraded array was only doing 200 MB/s, the
md0_raid6 process was taking about 50% CPU according to top.  Now I
have a 5x increase in read speed, and md0_raid6 is taking 100% CPU.
I'm still degraded by a factor of eight, though, where I'd expect only
two.

> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
> routines, you can actually keep a CPU pretty busy with this.  Also, even
> though the XOR routines try to time their assembly 'just so' so that
> they can use the cache avoiding instructions, this fails more often than
> not so you end up blowing CPU caches while doing this work, which of
> course effects the overall system.

While 200 MB/s of XOR sounds high, the kernel is "advertising" over
8000 MB/s, per dmesg:

[    6.386820] xor: automatically using best checksumming function:
[    6.396690]    avx       : 24064.000 MB/sec
[    6.414706] raid6: sse2x1   gen()  7636 MB/s
[    6.431725] raid6: sse2x2   gen()  3656 MB/s
[    6.448742] raid6: sse2x4   gen()  3917 MB/s
[    6.465753] raid6: avx2x1   gen()  5425 MB/s
[    6.482766] raid6: avx2x2   gen()  7593 MB/s
[    6.499773] raid6: avx2x4   gen()  8648 MB/s
[    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
[    6.499774] raid6: using avx2x2 recovery algorithm

(CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)

I'm assuming however the kernel does its testing is fairly optimal,
and probably assumes ideal cache behavior... so maybe actual XOR
performance won't be as good as what dmesg suggests... but still, 200
MB/s (or even 1000 MB/s, as I'm now getting), is much lower than 8000
MB/s...

Is it possible to pin kernel threads to a CPU?  I'm thinking I could
reboot with isolcpus=2 (for example) and if I can force that md0_raid6
thread to run on CPU 2, at least the L1/L2 caches should be minimally
affected...

> Possible fixes for this might include:
>         c) Creating a consolidated page cache/stripe cache (if we can read more
> of the blocks needed to get our data from cache instead of disk it helps
> reduce that IO ops issue)

I suppose this might be an explanation for why increasing the array's
stripe_cache_size gave me such a boost?

>         d) Rearchitecting your arrays into raid50 instead of big raid6 array

My colleague tested that exact same config with hardware raid5, and
striped the three raid5 arrays together with software raid1.  So
clearly not apples-to-apples, but he did get dramatically better
degraded and rebuild performance.  I do intend to test a pure software
raid-50 implementation.

> (or conversely has the random head seeks just gone so
> radically through the roof that the problem here really is the time it
> takes the heads to travel everywhere we are sending them).

I'm certain head movement time isn't the issue, as these are SSDs.  :)

On Tue, Aug 23, 2016 at 1:27 PM, Doug Dumitru <doug@easyco.com> wrote:
> Can you run an iostat during your degraded test, and also a top run
> over 20+ seconds with kernel threads showing up.  Even better would be
> a perf capture, but you might not have all the tools installed.  You
> can always try:
>
> perf record -a sleep 20
>
> then
>
> perf report
>
> should show you the top functions globally over the 20 second sample.
> If you don't have perf loaded, you might (or might not) be able to
> load it from the distro.

Running top for 20 or more seconds, the top processes in terms of CPU
usage are pretty static:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1228 root      20   0       0      0      0 R 100.0  0.0 562:16.83 md0_raid6
 1315 root      20   0    4372    684    524 S  17.3  0.0  57:20.92 rngd
  107 root      20   0       0      0      0 S   9.6  0.0  65:16.63 kswapd0
  108 root      20   0       0      0      0 S   8.6  0.0  65:19.58 kswapd1
19424 root      20   0  108972   1676    560 D   3.3  0.0   0:00.52 dd
 6909 root      20   0  108972   1676    560 D   2.7  0.0   0:01.53 dd
18383 root      20   0  108972   1680    560 D   2.7  0.0   0:00.63 dd


I truncated the output.  The "dd" processes are part of our testing
tool that generates the huge read load on the array.  Any given "dd"
process might jump around, but those four kernel processes are always
the top four.  (Note that before I increased the stripe_cache_size (as
mentioned above), the md0_raid6 process was only consuming around 50%
CPU.)

Here is a representative view of a non-first iteration of "iostat -mxt 5":


08/23/2016 01:37:59 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.84    0.00   27.41   67.59    0.00    0.17

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdy               0.00     0.40    0.80    0.60     0.05     0.00
83.43     0.00    1.00    0.50    1.67   1.00   0.14
sdz               0.00     0.40    0.00    0.60     0.00     0.00
10.67     0.00    2.00    0.00    2.00   2.00   0.12
sdd           12927.00     0.00  204.40    0.00    51.00     0.00
511.00     5.93   28.75   28.75    0.00   4.31  88.10
sde           13002.60     0.00  205.20    0.00    51.20     0.00
511.00     6.29   30.39   30.39    0.00   4.59  94.12
sdf           12976.80     0.00  205.00    0.00    51.00     0.00
509.50     6.17   29.76   29.76    0.00   4.57  93.78
sdg           12950.20     0.00  205.60    0.00    50.80     0.00
506.03     6.20   29.75   29.75    0.00   4.57  93.88
sdh           12949.00     0.00  207.20    0.00    50.90     0.00
503.11     6.36   30.35   30.35    0.00   4.59  95.10
sdb           12196.40     0.00  192.60    0.00    48.10     0.00
511.47     5.48   28.15   28.15    0.00   4.38  84.36
sda               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi           12923.00     0.00  208.40    0.00    51.00     0.00
501.20     6.79   32.31   32.31    0.00   4.65  96.84
sdj           12796.20     0.00  206.80    0.00    50.50     0.00
500.12     6.62   31.73   31.73    0.00   4.62  95.64
sdk           12746.60     0.00  204.00    0.00    50.20     0.00
503.97     6.38   30.77   30.77    0.00   4.60  93.86
sdl           12570.00     0.00  202.20    0.00    49.70     0.00
503.39     6.39   31.19   31.19    0.00   4.63  93.68
sdn           12594.00     0.00  204.20    0.00    49.95     0.00
500.97     6.40   30.99   30.99    0.00   4.58  93.54
sdm           12569.00     0.00  203.80    0.00    49.90     0.00
501.45     6.30   30.58   30.58    0.00   4.45  90.60
sdp           12568.80     0.00  205.20    0.00    50.10     0.00
500.03     6.37   30.79   30.79    0.00   4.52  92.72
sdo           12569.20     0.00  204.00    0.00    49.95     0.00
501.46     6.40   31.07   31.07    0.00   4.58  93.42
sdw           12568.60     0.00  206.20    0.00    50.00     0.00
496.60     6.34   30.71   30.71    0.00   4.24  87.48
sdx           12038.60     0.00  197.40    0.00    47.60     0.00
493.84     6.01   30.21   30.21    0.00   4.40  86.86
sdq           12570.20     0.00  204.20    0.00    50.15     0.00
502.97     6.23   30.41   30.41    0.00   4.44  90.68
sdr           12571.00     0.00  204.60    0.00    50.25     0.00
502.99     6.15   30.26   30.26    0.00   4.18  85.62
sds           12495.20     0.00  203.80    0.00    49.95     0.00
501.95     6.00   29.62   29.62    0.00   4.24  86.38
sdu           12695.60     0.00  207.80    0.00    50.65     0.00
499.17     6.22   30.00   30.00    0.00   4.16  86.38
sdv           12619.00     0.00  207.80    0.00    50.35     0.00
496.22     6.23   30.03   30.03    0.00   4.20  87.32
sdt           12671.20     0.00  206.20    0.00    50.50     0.00
501.56     6.05   29.30   29.30    0.00   4.24  87.44
sdc           12851.60     0.00  203.00    0.00    50.70     0.00
511.50     5.84   28.49   28.49    0.00   4.17  84.64
md126             0.00     0.00    0.60    1.00     0.05     0.00
71.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.60    0.80     0.05     0.00
81.14     0.00    2.29    0.67    3.50   1.14   0.16
dm-1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00 4475.20    0.00  1110.95     0.00
508.41     0.00    0.00    0.00    0.00   0.00   0.00


sdy and sz are the system drives, so they are uninteresting.

sda is the md0 drive I failed, that's why it stays at zero.

And lastly, here's the output of the perf commands you suggested (at
least the top part):

Samples: 561K of event 'cycles', Event count (approx.): 318536644203
Overhead  Command         Shared Object                 Symbol
  52.85%  swapper         [kernel.kallsyms]             [k] cpu_startup_entry
   4.47%  md0_raid6       [kernel.kallsyms]             [k] memcpy
   3.39%  dd              [kernel.kallsyms]             [k] __find_stripe
   2.50%  md0_raid6       [kernel.kallsyms]             [k] analyse_stripe
   2.43%  dd              [kernel.kallsyms]             [k] _raw_spin_lock_irq
   1.75%  rngd            rngd                          [.] 0x000000000000288b
   1.74%  md0_raid6       [kernel.kallsyms]             [k] xor_avx_5
   1.49%  dd              [kernel.kallsyms]             [k]
copy_user_enhanced_fast_string
   1.33%  md0_raid6       [kernel.kallsyms]             [k] ops_run_io
   0.65%  dd              [kernel.kallsyms]             [k] raid5_compute_sector
   0.60%  md0_raid6       [kernel.kallsyms]             [k] _raw_spin_lock_irq
   0.55%  ps              libc-2.17.so                  [.] _IO_vfscanf
   0.53%  ps              [kernel.kallsyms]             [k] vsnprintf
   0.51%  ps              [kernel.kallsyms]             [k] format_decode
   0.47%  ps              [kernel.kallsyms]             [k] number.isra.2
   0.41%  md0_raid6       [kernel.kallsyms]             [k] raid_run_ops
   0.40%  md0_raid6       [kernel.kallsyms]             [k] __blk_segment_map_sg


That's my first time using the perf tool, so I need a little hand-holding here.

Thanks again all!
Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-23 19:19               ` Doug Dumitru
@ 2016-08-23 19:26                 ` Doug Ledford
  0 siblings, 0 replies; 26+ messages in thread
From: Doug Ledford @ 2016-08-23 19:26 UTC (permalink / raw)
  To: doug; +Cc: Matt Garman, Mdadm


[-- Attachment #1.1: Type: text/plain, Size: 697 bytes --]

On 8/23/2016 3:19 PM, Doug Dumitru wrote:
> Mr. Ledford,
> 
> I am glad that we are in agreement.  My issue is that if the customer
> is reading 4GB/sec with a non-degraded array, the degraded array
> should only have 2X the number of IOs and 2X the transfer sizes to the
> drives.  If the data rate falls to 1GB, I can suspect cpu overhead.
> With this case falling to 200MB/sec, then something else is going on.
> 
> SSDs tend to be very "flat" reading from q=1 up to about q=20 assuming
> the HBAs can keep up.

Is he using SSDs?  If so, I missed that bit.  I wrote my response
assuming rotating media.




-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-23 19:26             ` Matt Garman
@ 2016-08-23 19:41               ` Doug Dumitru
  2016-08-23 20:15               ` Doug Ledford
  1 sibling, 0 replies; 26+ messages in thread
From: Doug Dumitru @ 2016-08-23 19:41 UTC (permalink / raw)
  To: Matt Garman; +Cc: Doug Ledford, Mdadm

Matt,

So you are up at 1GB/sec, which is only 1/4 the degraded speed, but
1/2 the expected speed based on drive data transfers required.  This
is actually pretty good.

I should have mentioned the stripe cache parameter before, but I use
raid "differently" and stripe cache does not impact my use case.
Sorry.

The 1GB/sec saturating a core is probably as good as it gets.  This
core is doing a lot of stripe cache page manipulations which are not
all that fast.

Also, the single parity recovery case should be XOR and not the raid-6
logic, so it should be pretty cheap.  Another, not important point for
this issue, is that the benchmarks are to generate parity, not
recover.  Recovery with raid-6 (ie, two drives failed) is more
expensive that the writes.  I am not sure how optimized this is, but
it could be really bad.

If you need this to go faster, then it is either a raid re-design, or
perhaps you should consider cutting your array into two parts.  Two 12
drives raid-6 arrays will give you more bandwidth both because the
failures are less "wide", so a single drive will only do 11 reads
instead of 22.  Plus you get the benefit of two raid-6 threads should
you have dead drives on both halves.  You can raid-0 the arrays
together.  Then again, you lose two drives worth of space.

Doug


On Tue, Aug 23, 2016 at 12:26 PM, Matt Garman <matthew.garman@gmail.com> wrote:
> Doug & Doug,
>
> Thank you for your helpful replies.  I merged both of your posts into
> one, see inline comments below:
>
> On Tue, Aug 23, 2016 at 2:10 PM, Doug Ledford <dledford@redhat.com> wrote:
>> Of course.  I didn't mean to imply otherwise.  The read size is the read
>> size.  But, since the OPs test case was to "read random files" and not
>> "read random blocks of random files" I took it to mean it would be
>> sequential IO across a multitude of random files.  That assumption might
>> have been wrong, but I wrote my explanation with that in mind.
>
> Yes, multiple parallel sequential reads.  Our test program generates a
> bunch of big random files (file size has an approximately normal
> distribution, centered around 500 MB, going down to 100 MB or so, up
> to a few multi-GB outliers).  The file generation is a one-time thing,
> and we don't really care about its performance.
>
> The read testing program just randomly picks one of those files, then
> reads it start-to-finish using "dd".  But it kicks off several "dd"
> threads at once (currently 50, though this is a run-time parameter).
> This is how we generate the read load, and I use iostat while this is
> running to see how much read throughput I'm getting from the array.
>
>
> On Tue, Aug 23, 2016 at 1:00 PM, Doug Ledford <dledford@redhat.com> wrote:
>> This depends a lot on how you structured your raid array.  I didn't see
>> your earlier emails, so I'm inferring from the "one out of 22 reads will
>> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
>> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
>> as the basis for my next statement even if it's slightly wrong.
>
> Yes, that is exactly correct, here's the relevant part of /proc/mdstat:
>
> Personalities : [raid1] [raid6] [raid5] [raid4]
>
> md0 : active raid6 sdl[11] sdi[8] sdx[23] sdc[2] sdo[14] sdn[13]
> sdm[12] sdr[17] sdk[10] sdb[1] sdu[20] sdp[15] sdq[16] sds[18] sdt[19]
> sdw[22] sdv[21] sda[0](F) sdj[9] sde[4] sdd[3] sdf[5] sdh[7] sdg[6]
>
>       44005879808 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [24/23] [_UUUUUUUUUUUUUUUUUUUUUUU]
>
>       bitmap: 0/15 pages [0KB], 65536KB chunk
>
>
>> Doug was right in that you will have to read 21 data disks and 1 parity
>> disk to reconstruct reads from the missing block of any given stripe.
>> And while he is also correct that this doubles IO ops needed to get your
>> read data, it doesn't address the XOR load to get your data.  With 19
>> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
>> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
>> actually achieving more like 390MB/s of data read, with 190MB/s of it
>> being direct reads, and then you are using XOR on 200MB/s in order to
>> generate the other 10MB/s of results.
>
> Most of this morning I've been setting/unsetting/changing various
> tunables, to see if I could increase the read speed.  I got a huge
> boost by increasing the /sys/block/md0/md/stripe_cache_size parameter
> from the default (256 IIRC) to 16384.  Doubling it again to 32k didn't
> seem to bring any further benefit.  So with the stripe_cache_size
> increased to 16k, I'm now getting around 1000 MB/s read in the
> degraded state.  When the degraded array was only doing 200 MB/s, the
> md0_raid6 process was taking about 50% CPU according to top.  Now I
> have a 5x increase in read speed, and md0_raid6 is taking 100% CPU.
> I'm still degraded by a factor of eight, though, where I'd expect only
> two.
>
>> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
>> routines, you can actually keep a CPU pretty busy with this.  Also, even
>> though the XOR routines try to time their assembly 'just so' so that
>> they can use the cache avoiding instructions, this fails more often than
>> not so you end up blowing CPU caches while doing this work, which of
>> course effects the overall system.
>
> While 200 MB/s of XOR sounds high, the kernel is "advertising" over
> 8000 MB/s, per dmesg:
>
> [    6.386820] xor: automatically using best checksumming function:
> [    6.396690]    avx       : 24064.000 MB/sec
> [    6.414706] raid6: sse2x1   gen()  7636 MB/s
> [    6.431725] raid6: sse2x2   gen()  3656 MB/s
> [    6.448742] raid6: sse2x4   gen()  3917 MB/s
> [    6.465753] raid6: avx2x1   gen()  5425 MB/s
> [    6.482766] raid6: avx2x2   gen()  7593 MB/s
> [    6.499773] raid6: avx2x4   gen()  8648 MB/s
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [    6.499774] raid6: using avx2x2 recovery algorithm
>
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
>
> I'm assuming however the kernel does its testing is fairly optimal,
> and probably assumes ideal cache behavior... so maybe actual XOR
> performance won't be as good as what dmesg suggests... but still, 200
> MB/s (or even 1000 MB/s, as I'm now getting), is much lower than 8000
> MB/s...
>
> Is it possible to pin kernel threads to a CPU?  I'm thinking I could
> reboot with isolcpus=2 (for example) and if I can force that md0_raid6
> thread to run on CPU 2, at least the L1/L2 caches should be minimally
> affected...
>
>> Possible fixes for this might include:
>>         c) Creating a consolidated page cache/stripe cache (if we can read more
>> of the blocks needed to get our data from cache instead of disk it helps
>> reduce that IO ops issue)
>
> I suppose this might be an explanation for why increasing the array's
> stripe_cache_size gave me such a boost?
>
>>         d) Rearchitecting your arrays into raid50 instead of big raid6 array
>
> My colleague tested that exact same config with hardware raid5, and
> striped the three raid5 arrays together with software raid1.  So
> clearly not apples-to-apples, but he did get dramatically better
> degraded and rebuild performance.  I do intend to test a pure software
> raid-50 implementation.
>
>> (or conversely has the random head seeks just gone so
>> radically through the roof that the problem here really is the time it
>> takes the heads to travel everywhere we are sending them).
>
> I'm certain head movement time isn't the issue, as these are SSDs.  :)
>
> On Tue, Aug 23, 2016 at 1:27 PM, Doug Dumitru <doug@easyco.com> wrote:
>> Can you run an iostat during your degraded test, and also a top run
>> over 20+ seconds with kernel threads showing up.  Even better would be
>> a perf capture, but you might not have all the tools installed.  You
>> can always try:
>>
>> perf record -a sleep 20
>>
>> then
>>
>> perf report
>>
>> should show you the top functions globally over the 20 second sample.
>> If you don't have perf loaded, you might (or might not) be able to
>> load it from the distro.
>
> Running top for 20 or more seconds, the top processes in terms of CPU
> usage are pretty static:
>
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>  1228 root      20   0       0      0      0 R 100.0  0.0 562:16.83 md0_raid6
>  1315 root      20   0    4372    684    524 S  17.3  0.0  57:20.92 rngd
>   107 root      20   0       0      0      0 S   9.6  0.0  65:16.63 kswapd0
>   108 root      20   0       0      0      0 S   8.6  0.0  65:19.58 kswapd1
> 19424 root      20   0  108972   1676    560 D   3.3  0.0   0:00.52 dd
>  6909 root      20   0  108972   1676    560 D   2.7  0.0   0:01.53 dd
> 18383 root      20   0  108972   1680    560 D   2.7  0.0   0:00.63 dd
>
>
> I truncated the output.  The "dd" processes are part of our testing
> tool that generates the huge read load on the array.  Any given "dd"
> process might jump around, but those four kernel processes are always
> the top four.  (Note that before I increased the stripe_cache_size (as
> mentioned above), the md0_raid6 process was only consuming around 50%
> CPU.)
>
> Here is a representative view of a non-first iteration of "iostat -mxt 5":
>
>
> 08/23/2016 01:37:59 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            4.84    0.00   27.41   67.59    0.00    0.17
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdy               0.00     0.40    0.80    0.60     0.05     0.00
> 83.43     0.00    1.00    0.50    1.67   1.00   0.14
> sdz               0.00     0.40    0.00    0.60     0.00     0.00
> 10.67     0.00    2.00    0.00    2.00   2.00   0.12
> sdd           12927.00     0.00  204.40    0.00    51.00     0.00
> 511.00     5.93   28.75   28.75    0.00   4.31  88.10
> sde           13002.60     0.00  205.20    0.00    51.20     0.00
> 511.00     6.29   30.39   30.39    0.00   4.59  94.12
> sdf           12976.80     0.00  205.00    0.00    51.00     0.00
> 509.50     6.17   29.76   29.76    0.00   4.57  93.78
> sdg           12950.20     0.00  205.60    0.00    50.80     0.00
> 506.03     6.20   29.75   29.75    0.00   4.57  93.88
> sdh           12949.00     0.00  207.20    0.00    50.90     0.00
> 503.11     6.36   30.35   30.35    0.00   4.59  95.10
> sdb           12196.40     0.00  192.60    0.00    48.10     0.00
> 511.47     5.48   28.15   28.15    0.00   4.38  84.36
> sda               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdi           12923.00     0.00  208.40    0.00    51.00     0.00
> 501.20     6.79   32.31   32.31    0.00   4.65  96.84
> sdj           12796.20     0.00  206.80    0.00    50.50     0.00
> 500.12     6.62   31.73   31.73    0.00   4.62  95.64
> sdk           12746.60     0.00  204.00    0.00    50.20     0.00
> 503.97     6.38   30.77   30.77    0.00   4.60  93.86
> sdl           12570.00     0.00  202.20    0.00    49.70     0.00
> 503.39     6.39   31.19   31.19    0.00   4.63  93.68
> sdn           12594.00     0.00  204.20    0.00    49.95     0.00
> 500.97     6.40   30.99   30.99    0.00   4.58  93.54
> sdm           12569.00     0.00  203.80    0.00    49.90     0.00
> 501.45     6.30   30.58   30.58    0.00   4.45  90.60
> sdp           12568.80     0.00  205.20    0.00    50.10     0.00
> 500.03     6.37   30.79   30.79    0.00   4.52  92.72
> sdo           12569.20     0.00  204.00    0.00    49.95     0.00
> 501.46     6.40   31.07   31.07    0.00   4.58  93.42
> sdw           12568.60     0.00  206.20    0.00    50.00     0.00
> 496.60     6.34   30.71   30.71    0.00   4.24  87.48
> sdx           12038.60     0.00  197.40    0.00    47.60     0.00
> 493.84     6.01   30.21   30.21    0.00   4.40  86.86
> sdq           12570.20     0.00  204.20    0.00    50.15     0.00
> 502.97     6.23   30.41   30.41    0.00   4.44  90.68
> sdr           12571.00     0.00  204.60    0.00    50.25     0.00
> 502.99     6.15   30.26   30.26    0.00   4.18  85.62
> sds           12495.20     0.00  203.80    0.00    49.95     0.00
> 501.95     6.00   29.62   29.62    0.00   4.24  86.38
> sdu           12695.60     0.00  207.80    0.00    50.65     0.00
> 499.17     6.22   30.00   30.00    0.00   4.16  86.38
> sdv           12619.00     0.00  207.80    0.00    50.35     0.00
> 496.22     6.23   30.03   30.03    0.00   4.20  87.32
> sdt           12671.20     0.00  206.20    0.00    50.50     0.00
> 501.56     6.05   29.30   29.30    0.00   4.24  87.44
> sdc           12851.60     0.00  203.00    0.00    50.70     0.00
> 511.50     5.84   28.49   28.49    0.00   4.17  84.64
> md126             0.00     0.00    0.60    1.00     0.05     0.00
> 71.00     0.00    0.00    0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    0.60    0.80     0.05     0.00
> 81.14     0.00    2.29    0.67    3.50   1.14   0.16
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md0               0.00     0.00 4475.20    0.00  1110.95     0.00
> 508.41     0.00    0.00    0.00    0.00   0.00   0.00
>
>
> sdy and sz are the system drives, so they are uninteresting.
>
> sda is the md0 drive I failed, that's why it stays at zero.
>
> And lastly, here's the output of the perf commands you suggested (at
> least the top part):
>
> Samples: 561K of event 'cycles', Event count (approx.): 318536644203
> Overhead  Command         Shared Object                 Symbol
>   52.85%  swapper         [kernel.kallsyms]             [k] cpu_startup_entry
>    4.47%  md0_raid6       [kernel.kallsyms]             [k] memcpy
>    3.39%  dd              [kernel.kallsyms]             [k] __find_stripe
>    2.50%  md0_raid6       [kernel.kallsyms]             [k] analyse_stripe
>    2.43%  dd              [kernel.kallsyms]             [k] _raw_spin_lock_irq
>    1.75%  rngd            rngd                          [.] 0x000000000000288b
>    1.74%  md0_raid6       [kernel.kallsyms]             [k] xor_avx_5
>    1.49%  dd              [kernel.kallsyms]             [k]
> copy_user_enhanced_fast_string
>    1.33%  md0_raid6       [kernel.kallsyms]             [k] ops_run_io
>    0.65%  dd              [kernel.kallsyms]             [k] raid5_compute_sector
>    0.60%  md0_raid6       [kernel.kallsyms]             [k] _raw_spin_lock_irq
>    0.55%  ps              libc-2.17.so                  [.] _IO_vfscanf
>    0.53%  ps              [kernel.kallsyms]             [k] vsnprintf
>    0.51%  ps              [kernel.kallsyms]             [k] format_decode
>    0.47%  ps              [kernel.kallsyms]             [k] number.isra.2
>    0.41%  md0_raid6       [kernel.kallsyms]             [k] raid_run_ops
>    0.40%  md0_raid6       [kernel.kallsyms]             [k] __blk_segment_map_sg
>
>
> That's my first time using the perf tool, so I need a little hand-holding here.
>
> Thanks again all!
> Matt



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-23 19:26             ` Matt Garman
  2016-08-23 19:41               ` Doug Dumitru
@ 2016-08-23 20:15               ` Doug Ledford
  2016-08-23 21:42                 ` Phil Turmel
  1 sibling, 1 reply; 26+ messages in thread
From: Doug Ledford @ 2016-08-23 20:15 UTC (permalink / raw)
  To: Matt Garman, Doug Dumitru; +Cc: Mdadm


[-- Attachment #1.1: Type: text/plain, Size: 19347 bytes --]

On 8/23/2016 3:26 PM, Matt Garman wrote:
> Doug & Doug,
> 
> Thank you for your helpful replies.  I merged both of your posts into
> one, see inline comments below:
> 
> On Tue, Aug 23, 2016 at 2:10 PM, Doug Ledford <dledford@redhat.com> wrote:
>> Of course.  I didn't mean to imply otherwise.  The read size is the read
>> size.  But, since the OPs test case was to "read random files" and not
>> "read random blocks of random files" I took it to mean it would be
>> sequential IO across a multitude of random files.  That assumption might
>> have been wrong, but I wrote my explanation with that in mind.
> 
> Yes, multiple parallel sequential reads.  Our test program generates a
> bunch of big random files (file size has an approximately normal
> distribution, centered around 500 MB, going down to 100 MB or so, up
> to a few multi-GB outliers).  The file generation is a one-time thing,
> and we don't really care about its performance.
> 
> The read testing program just randomly picks one of those files, then
> reads it start-to-finish using "dd".  But it kicks off several "dd"
> threads at once (currently 50, though this is a run-time parameter).
> This is how we generate the read load, and I use iostat while this is
> running to see how much read throughput I'm getting from the array.

OK, 50 sequential I/Os at a time.  Good point to know.

> 
> On Tue, Aug 23, 2016 at 1:00 PM, Doug Ledford <dledford@redhat.com> wrote:
>> This depends a lot on how you structured your raid array.  I didn't see
>> your earlier emails, so I'm inferring from the "one out of 22 reads will
>> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
>> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
>> as the basis for my next statement even if it's slightly wrong.
> 
> Yes, that is exactly correct, here's the relevant part of /proc/mdstat:
> 
> Personalities : [raid1] [raid6] [raid5] [raid4]
> 
> md0 : active raid6 sdl[11] sdi[8] sdx[23] sdc[2] sdo[14] sdn[13]
> sdm[12] sdr[17] sdk[10] sdb[1] sdu[20] sdp[15] sdq[16] sds[18] sdt[19]
> sdw[22] sdv[21] sda[0](F) sdj[9] sde[4] sdd[3] sdf[5] sdh[7] sdg[6]
> 
>       44005879808 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [24/23] [_UUUUUUUUUUUUUUUUUUUUUUU]
> 
>       bitmap: 0/15 pages [0KB], 65536KB chunk

You're raid device has a good chunk size for your usage pattern.  If you
had a smallish chunk size (like 64k or 32k), I would actually expect
things to behave differently.  But, then again, maybe I'm wrong and that
would help.  With a smaller chunk size, you would be able to fit more
stripes in the stripe cache using less memory.

> 
>> Doug was right in that you will have to read 21 data disks and 1 parity
>> disk to reconstruct reads from the missing block of any given stripe.
>> And while he is also correct that this doubles IO ops needed to get your
>> read data, it doesn't address the XOR load to get your data.  With 19
>> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
>> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
>> actually achieving more like 390MB/s of data read, with 190MB/s of it
>> being direct reads, and then you are using XOR on 200MB/s in order to
>> generate the other 10MB/s of results.
> 
> Most of this morning I've been setting/unsetting/changing various
> tunables, to see if I could increase the read speed.  I got a huge
> boost by increasing the /sys/block/md0/md/stripe_cache_size parameter
> from the default (256 IIRC) to 16384.  Doubling it again to 32k didn't
> seem to bring any further benefit.

Makes sense.  I know the stripe cache size is conservative by default
because of the fact that it's not shared with the page cache, so you
might as well consider it's memory lost.  When you upped it to 64k, and
you have 22 disks at 512k chunk, that 11MB per stripe and 65536 total
allowed stripes which is a maximum memory consumption of around 700GB
RAM.  I doubt you have that much in your machine, so I'm guessing it's
simply using all available RAM that the page cache or something else
isn't already using.  That's also explains why setting it higher doesn't
provide any additional benefits ;-).

>  So with the stripe_cache_size
> increased to 16k, I'm now getting around 1000 MB/s read in the
> degraded state.  When the degraded array was only doing 200 MB/s, the
> md0_raid6 process was taking about 50% CPU according to top.  Now I
> have a 5x increase in read speed, and md0_raid6 is taking 100% CPU.

You probably have maxed out your single CPU performance and won't see
any benefit without having a multi-threaded XOR routine.

> I'm still degraded by a factor of eight, though, where I'd expect only
> two.
> 
>> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
>> routines, you can actually keep a CPU pretty busy with this.  Also, even
>> though the XOR routines try to time their assembly 'just so' so that
>> they can use the cache avoiding instructions, this fails more often than
>> not so you end up blowing CPU caches while doing this work, which of
>> course effects the overall system.
> 
> While 200 MB/s of XOR sounds high, the kernel is "advertising" over
> 8000 MB/s, per dmesg:
> 
> [    6.386820] xor: automatically using best checksumming function:
> [    6.396690]    avx       : 24064.000 MB/sec
> [    6.414706] raid6: sse2x1   gen()  7636 MB/s
> [    6.431725] raid6: sse2x2   gen()  3656 MB/s
> [    6.448742] raid6: sse2x4   gen()  3917 MB/s
> [    6.465753] raid6: avx2x1   gen()  5425 MB/s
> [    6.482766] raid6: avx2x2   gen()  7593 MB/s
> [    6.499773] raid6: avx2x4   gen()  8648 MB/s
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [    6.499774] raid6: using avx2x2 recovery algorithm
> 
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
> 
> I'm assuming however the kernel does its testing is fairly optimal,

It is *highly* optimal.  What's more, it uses 100% CPU during this time.
 The raid6 thread doing your recovery is responsible for lots of stuff,
issuing reads, doing xor, fulfilling write requests, maintaining the
cache, etc.  It has to have time to actually do other work.  So start
with that 8GB/s figure, but immediately start subtracting from that
because the CPU needs to do other things as well.  Then remember that we
are under *extreme* memory pressure.  When you have to bring in 22 reads
in order to reconstruct just 1 block of the same size, then for 100MB/s
of degraded reads you are generating 2200MB/s of PCI DMA -> MEM
bandwidth consumption, followed by 2200MB/s of MEM -> register load
bandwidth consumption, then I'd have to read the avx xor routine to know
how much write bandwidth it is using, but it's at least 100MB/s of
bandwidth, and likely at least four or five times that much because it
probably doesn't do all 22 blocks in a single xor pass, it likely loads
parity, then reads up to maybe four blocks and xors them together and
then stores the parity, so each pass will re-read and re-store the
parity block.  The point of all of this is that people forget to do the
math on the memory bandwidth used by these XOR operations.  The faster
they are, the higher the percentage of main memory bandwidth you are
consuming.  Now you have to subtract all of that main memory bandwidth
from the total main memory bandwidth for the CPU, and what's left over
is all you have for doing other productive work.  Even if you aren't
blowing your caches doing all of this XOR work, you are blowing your
main memory bandwidth.  Other threads or other actions end up stalling
waiting on main memory accesses to complete.

> and probably assumes ideal cache behavior... so maybe actual XOR
> performance won't be as good as what dmesg suggests...

It will never be that good, and you can thank your stars that it isn't,
because if it were, your computer would be ground to a halt with nothing
happening but data XOR computations.

> but still, 200
> MB/s (or even 1000 MB/s, as I'm now getting), is much lower than 8000
> MB/s...

The math fits.  Most quad channel Intel CPUs have memory bandwidths in
the 50GByte/s range theoretical maximum, but it's not bidirectional,
it's not even multi-access, so you have to remember that the usage looks
like this on a good read:

copy 1: DMA from PCI bus to main memory
copy 2: Load from main memory to CPU for copy_to_user
copy 3: Store from CPU to main memory for user

To get 8GB/s of read performance undregraded then required 24GB/s of
actual memory bandwidth just for the copies.  That's half of your entire
memory bandwidth (unless you have multiple sockets, then things get more
complex, but this is still true for one socket of the multiple socket
machine).  Once you add the XOR routine into the figure, the 3 accesses
is the same for part of it, but for degraded fixups, it is much worse.

> Is it possible to pin kernel threads to a CPU?  I'm thinking I could
> reboot with isolcpus=2 (for example) and if I can force that md0_raid6
> thread to run on CPU 2, at least the L1/L2 caches should be minimally
> affected...

You could try that, but I doubt it will effect much.

>> Possible fixes for this might include:
>>         c) Creating a consolidated page cache/stripe cache (if we can read more
>> of the blocks needed to get our data from cache instead of disk it helps
>> reduce that IO ops issue)
> 
> I suppose this might be an explanation for why increasing the array's
> stripe_cache_size gave me such a boost?

Yes.  The default setting is conservative, you told it to use as much
memory as it needed.

>>         d) Rearchitecting your arrays into raid50 instead of big raid6 array
> 
> My colleague tested that exact same config with hardware raid5, and
> striped the three raid5 arrays together with software raid1.

That's a huge waste, are you sure he didn't use raid0 for the stripe?

>  So
> clearly not apples-to-apples, but he did get dramatically better
> degraded and rebuild performance.  I do intend to test a pure software
> raid-50 implementation.

I would try it.  If you are OK with single disk failures anyway.

>> (or conversely has the random head seeks just gone so
>> radically through the roof that the problem here really is the time it
>> takes the heads to travel everywhere we are sending them).
> 
> I'm certain head movement time isn't the issue, as these are SSDs.  :)

Fair enough ;-).  And given these are SSDs, I'd be just fine doing
something like four 6 disk raid5s then striped in a raid0 myself.  The
main cause for concern with spinning disks is latent bad sectors causing
a read error on rebuild, with SSDs that's much less of a concern.

> On Tue, Aug 23, 2016 at 1:27 PM, Doug Dumitru <doug@easyco.com> wrote:
>> Can you run an iostat during your degraded test, and also a top run
>> over 20+ seconds with kernel threads showing up.  Even better would be
>> a perf capture, but you might not have all the tools installed.  You
>> can always try:
>>
>> perf record -a sleep 20
>>
>> then
>>
>> perf report
>>
>> should show you the top functions globally over the 20 second sample.
>> If you don't have perf loaded, you might (or might not) be able to
>> load it from the distro.
> 
> Running top for 20 or more seconds, the top processes in terms of CPU
> usage are pretty static:
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>  1228 root      20   0       0      0      0 R 100.0  0.0 562:16.83 md0_raid6
>  1315 root      20   0    4372    684    524 S  17.3  0.0  57:20.92 rngd
>   107 root      20   0       0      0      0 S   9.6  0.0  65:16.63 kswapd0
>   108 root      20   0       0      0      0 S   8.6  0.0  65:19.58 kswapd1
> 19424 root      20   0  108972   1676    560 D   3.3  0.0   0:00.52 dd
>  6909 root      20   0  108972   1676    560 D   2.7  0.0   0:01.53 dd
> 18383 root      20   0  108972   1680    560 D   2.7  0.0   0:00.63 dd
> 
> 
> I truncated the output.  The "dd" processes are part of our testing
> tool that generates the huge read load on the array.  Any given "dd"
> process might jump around, but those four kernel processes are always
> the top four.  (Note that before I increased the stripe_cache_size (as
> mentioned above), the md0_raid6 process was only consuming around 50%
> CPU.)

I would try to tune your stripe cache size such that the kswapd?
processes go to sleep.  Those are reading/writing swap.  That won't help
your overall performance.

> Here is a representative view of a non-first iteration of "iostat -mxt 5":
> 
> 
> 08/23/2016 01:37:59 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            4.84    0.00   27.41   67.59    0.00    0.17
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdy               0.00     0.40    0.80    0.60     0.05     0.00
> 83.43     0.00    1.00    0.50    1.67   1.00   0.14
> sdz               0.00     0.40    0.00    0.60     0.00     0.00
> 10.67     0.00    2.00    0.00    2.00   2.00   0.12
> sdd           12927.00     0.00  204.40    0.00    51.00     0.00
> 511.00     5.93   28.75   28.75    0.00   4.31  88.10

I'm not sure how much I trust some of these numbers.  According to this,
you are issuing 200 read/s, at an average size of 511KB, which should
work out to roughly 100MB/s of data read, but rMB/s is only 51.  I
wonder if the read requests from the raid6 thread are bypassing the
rMB/s accounting because they aren't coming from the VFS or some such?
It would explain why the rMB/s is only half of what it should be based
upon requests and average request size.

> sde           13002.60     0.00  205.20    0.00    51.20     0.00
> 511.00     6.29   30.39   30.39    0.00   4.59  94.12
> sdf           12976.80     0.00  205.00    0.00    51.00     0.00
> 509.50     6.17   29.76   29.76    0.00   4.57  93.78
> sdg           12950.20     0.00  205.60    0.00    50.80     0.00
> 506.03     6.20   29.75   29.75    0.00   4.57  93.88
> sdh           12949.00     0.00  207.20    0.00    50.90     0.00
> 503.11     6.36   30.35   30.35    0.00   4.59  95.10
> sdb           12196.40     0.00  192.60    0.00    48.10     0.00
> 511.47     5.48   28.15   28.15    0.00   4.38  84.36
> sda               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdi           12923.00     0.00  208.40    0.00    51.00     0.00
> 501.20     6.79   32.31   32.31    0.00   4.65  96.84
> sdj           12796.20     0.00  206.80    0.00    50.50     0.00
> 500.12     6.62   31.73   31.73    0.00   4.62  95.64
> sdk           12746.60     0.00  204.00    0.00    50.20     0.00
> 503.97     6.38   30.77   30.77    0.00   4.60  93.86
> sdl           12570.00     0.00  202.20    0.00    49.70     0.00
> 503.39     6.39   31.19   31.19    0.00   4.63  93.68
> sdn           12594.00     0.00  204.20    0.00    49.95     0.00
> 500.97     6.40   30.99   30.99    0.00   4.58  93.54
> sdm           12569.00     0.00  203.80    0.00    49.90     0.00
> 501.45     6.30   30.58   30.58    0.00   4.45  90.60
> sdp           12568.80     0.00  205.20    0.00    50.10     0.00
> 500.03     6.37   30.79   30.79    0.00   4.52  92.72
> sdo           12569.20     0.00  204.00    0.00    49.95     0.00
> 501.46     6.40   31.07   31.07    0.00   4.58  93.42
> sdw           12568.60     0.00  206.20    0.00    50.00     0.00
> 496.60     6.34   30.71   30.71    0.00   4.24  87.48
> sdx           12038.60     0.00  197.40    0.00    47.60     0.00
> 493.84     6.01   30.21   30.21    0.00   4.40  86.86
> sdq           12570.20     0.00  204.20    0.00    50.15     0.00
> 502.97     6.23   30.41   30.41    0.00   4.44  90.68
> sdr           12571.00     0.00  204.60    0.00    50.25     0.00
> 502.99     6.15   30.26   30.26    0.00   4.18  85.62
> sds           12495.20     0.00  203.80    0.00    49.95     0.00
> 501.95     6.00   29.62   29.62    0.00   4.24  86.38
> sdu           12695.60     0.00  207.80    0.00    50.65     0.00
> 499.17     6.22   30.00   30.00    0.00   4.16  86.38
> sdv           12619.00     0.00  207.80    0.00    50.35     0.00
> 496.22     6.23   30.03   30.03    0.00   4.20  87.32
> sdt           12671.20     0.00  206.20    0.00    50.50     0.00
> 501.56     6.05   29.30   29.30    0.00   4.24  87.44
> sdc           12851.60     0.00  203.00    0.00    50.70     0.00
> 511.50     5.84   28.49   28.49    0.00   4.17  84.64
> md126             0.00     0.00    0.60    1.00     0.05     0.00
> 71.00     0.00    0.00    0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    0.60    0.80     0.05     0.00
> 81.14     0.00    2.29    0.67    3.50   1.14   0.16
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md0               0.00     0.00 4475.20    0.00  1110.95     0.00
> 508.41     0.00    0.00    0.00    0.00   0.00   0.00
> 
> 
> sdy and sz are the system drives, so they are uninteresting.
> 
> sda is the md0 drive I failed, that's why it stays at zero.
> 
> And lastly, here's the output of the perf commands you suggested (at
> least the top part):
> 
> Samples: 561K of event 'cycles', Event count (approx.): 318536644203
> Overhead  Command         Shared Object                 Symbol
>   52.85%  swapper         [kernel.kallsyms]             [k] cpu_startup_entry
>    4.47%  md0_raid6       [kernel.kallsyms]             [k] memcpy
>    3.39%  dd              [kernel.kallsyms]             [k] __find_stripe
>    2.50%  md0_raid6       [kernel.kallsyms]             [k] analyse_stripe
>    2.43%  dd              [kernel.kallsyms]             [k] _raw_spin_lock_irq
>    1.75%  rngd            rngd                          [.] 0x000000000000288b
>    1.74%  md0_raid6       [kernel.kallsyms]             [k] xor_avx_5
>    1.49%  dd              [kernel.kallsyms]             [k]
> copy_user_enhanced_fast_string
>    1.33%  md0_raid6       [kernel.kallsyms]             [k] ops_run_io
>    0.65%  dd              [kernel.kallsyms]             [k] raid5_compute_sector
>    0.60%  md0_raid6       [kernel.kallsyms]             [k] _raw_spin_lock_irq
>    0.55%  ps              libc-2.17.so                  [.] _IO_vfscanf
>    0.53%  ps              [kernel.kallsyms]             [k] vsnprintf
>    0.51%  ps              [kernel.kallsyms]             [k] format_decode
>    0.47%  ps              [kernel.kallsyms]             [k] number.isra.2
>    0.41%  md0_raid6       [kernel.kallsyms]             [k] raid_run_ops
>    0.40%  md0_raid6       [kernel.kallsyms]             [k] __blk_segment_map_sg
> 
> 
> That's my first time using the perf tool, so I need a little hand-holding here.

You might get more interesting perf results if you could pin the md
raid6 thread to a single CPU and then filter the perf results to just
that CPU.


-- 
Doug Ledford <dledford@redhat.com>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-23 20:15               ` Doug Ledford
@ 2016-08-23 21:42                 ` Phil Turmel
  0 siblings, 0 replies; 26+ messages in thread
From: Phil Turmel @ 2016-08-23 21:42 UTC (permalink / raw)
  To: Doug Ledford, Matt Garman, Doug Dumitru; +Cc: Mdadm

On 08/23/2016 04:15 PM, Doug Ledford wrote:

> You're raid device has a good chunk size for your usage pattern.  If you
> had a smallish chunk size (like 64k or 32k), I would actually expect
> things to behave differently.  But, then again, maybe I'm wrong and that
> would help.  With a smaller chunk size, you would be able to fit more
> stripes in the stripe cache using less memory.

This is not correct.  Parity operations in MD raid4/5/6 operate on 4k
blocks.  The stripe cache for an array is a collection of 4k elements
per member device.  Chunk size doesn't factor into the cache itself.

But see below....

> Makes sense.  I know the stripe cache size is conservative by default
> because of the fact that it's not shared with the page cache, so you
> might as well consider it's memory lost.  When you upped it to 64k, and
> you have 22 disks at 512k chunk, that 11MB per stripe and 65536 total
> allowed stripes which is a maximum memory consumption of around 700GB
> RAM.  I doubt you have that much in your machine, so I'm guessing it's
> simply using all available RAM that the page cache or something else
> isn't already using.  That's also explains why setting it higher doesn't
> provide any additional benefits ;-).

More likely the parity thread saturated and no more speed was possible.
Also possible that there would be a step change in performance again at
a much larger cache size.

>> While 200 MB/s of XOR sounds high, the kernel is "advertising" over
>> 8000 MB/s, per dmesg:
>>
>> [    6.386820] xor: automatically using best checksumming function:
>> [    6.396690]    avx       : 24064.000 MB/sec
>> [    6.414706] raid6: sse2x1   gen()  7636 MB/s
>> [    6.431725] raid6: sse2x2   gen()  3656 MB/s
>> [    6.448742] raid6: sse2x4   gen()  3917 MB/s
>> [    6.465753] raid6: avx2x1   gen()  5425 MB/s
>> [    6.482766] raid6: avx2x2   gen()  7593 MB/s
>> [    6.499773] raid6: avx2x4   gen()  8648 MB/s
>> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
>> [    6.499774] raid6: using avx2x2 recovery algorithm
>>
>> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)

Parity operations in raid must always involve all (available) member
devices.  Read operations when not degraded won't generate any parity
operations.  Most large write operations and any degraded read
operations will involve all members, even if those members' data is not
part of the larger read/write request.

As chunk sizes get larger the odds grow that any given array I/O will
touch a fraction of the slice, causing I/O to members purely for parity
math.  Also, the odds rise that the starting point or ending point of an
array I/O operation will not be aligned to the stripe, making more
member I/O solely for parity math.

Then add in the fact that dd issues I/O requests one block at a time,
per the bs=? parameter.  So it is possible that data that would have
been sequential without parallel pressure (still in the stripe cache for
later reads) generates multiple parity calculations for fractional
stripe operations, just due to stripe size/alignment mismatch on single
dd dispatches.

What bs=? value are you using in your dd commands?  Based on your 512k
chunk, it should be 10240k for aligned operations and much larger than
that for unaligned.

FWIW, I use small chunk sizes -- usually 16k.

Phil

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-07-12 21:09 kernel checksumming performance vs actual raid device performance Matt Garman
  2016-07-13  3:58 ` Brad Campbell
       [not found] ` <CAFx4rwQj3_JTNiS0zsQjp_sPXWkrp0ggjg_UiR7oJ8u0X9PQVA@mail.gmail.com>
@ 2016-08-24  1:02 ` Shaohua Li
  2016-08-25 15:07   ` Matt Garman
  2 siblings, 1 reply; 26+ messages in thread
From: Shaohua Li @ 2016-08-24  1:02 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

On Tue, Jul 12, 2016 at 04:09:25PM -0500, Matt Garman wrote:
> We have a system with a 24-disk raid6 array, using 2TB SSDs.  We use
> this system in a workload that is 99.9% read-only (a few small
> writes/day, versus countless reads).  This system is an NFS server for
> about 50 compute nodes that continually read its data.
> 
> In a non-degraded state, the system works wonderfully: the md0_raid6
> process uses less than 1% CPU, each drive is around 20% utilization
> (via iostat), no swapping is taking place.  The outbound throughput
> averages around 2.0 GB/sec, with 2.5 GB/sec peaks.
> 
> However, we had a disk fail, and the throughput dropped considerably,
> with the md0_raid6 process pegged at 100% CPU.
> 
> I understand that data from the failed disk will need to be
> reconstructed from parity, and this will cause the md0_raid6 process
> to consume considerable CPU.
> 
> What I don't understand is how I can determine what kind of actual MD
> device performance (throughput) I can expect in this state?
> 
> Dmesg seems to give some hints:
> 
> [    6.386820] xor: automatically using best checksumming function:
> [    6.396690]    avx       : 24064.000 MB/sec
> [    6.414706] raid6: sse2x1   gen()  7636 MB/s
> [    6.431725] raid6: sse2x2   gen()  3656 MB/s
> [    6.448742] raid6: sse2x4   gen()  3917 MB/s
> [    6.465753] raid6: avx2x1   gen()  5425 MB/s
> [    6.482766] raid6: avx2x2   gen()  7593 MB/s
> [    6.499773] raid6: avx2x4   gen()  8648 MB/s
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [    6.499774] raid6: using avx2x2 recovery algorithm
> 
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
> 
> Perhaps naively, I would expect that second-to-last line:
> 
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> 
> to indicate what kind of throughput I could expect in a degraded
> state, but clearly that is not right---or I have something
> misconfigured.
> 
> So in other words, what does that gen() 8648 MB/s metric mean in terms
> of real-world throughput?  Is there a way I can "convert" that number
> to expected throughput of a degraded array?

In non-degrade mode, raid6 just directly dispatch IO to raid disks, software
involvement is very small. In degrade mode, the data is calculated. There are a
lot of factors impacting the performance:
1. enter the raid6 state machine, which has a long code path. (this is
debatable, if a read doesn't read the faulty disk and it's a small random read,
raid6 doesn't need to run the state machine. Fixing this could hugely improve
the performance)
2. the state machine runs in a single thread, which is a bottleneck. try to
increase group_thread_cnt, which will make the handling multi-thread.
3. stripe cache involves. try to increase stripe_cache_size.
4. the faulty disk data must be calculated, which involves read from other
disks. If this is a numa machine, and each disk interrupts to different
cpus/nodes, there will be big impact (cache, wakeup IPI)
5. the xor calculation overhead. Actually I don't think the impact is big,
mordern cpu can do the calculation fast.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-24  1:02 ` Shaohua Li
@ 2016-08-25 15:07   ` Matt Garman
  2016-08-25 23:39     ` Adam Goryachev
  0 siblings, 1 reply; 26+ messages in thread
From: Matt Garman @ 2016-08-25 15:07 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Mdadm

Note: again I consolidated several previous posts into one for inline replies...

On Tue, Aug 23, 2016 at 2:41 PM, Doug Dumitru <doug@easyco.com> wrote:
> So you are up at 1GB/sec, which is only 1/4 the degraded speed, but
> 1/2 the expected speed based on drive data transfers required.  This
> is actually pretty good.

I get 8 GB/sec non-degraded.  So I'd say I'm still only 1/8
non-degraded speed, and about 1/4 of what I expect in degraded state.
I.e., I expect 4 GB/sec non-degraded.  However, based on what I'm
reading in this thread, maybe I can't do any better?  But
group_thread_cnt might save the day...

> If you need this to go faster, then it is either a raid re-design, or
> perhaps you should consider cutting your array into two parts.  Two 12
> drives raid-6 arrays will give you more bandwidth both because the
> failures are less "wide", so a single drive will only do 11 reads
> instead of 22.  Plus you get the benefit of two raid-6 threads should
> you have dead drives on both halves.  You can raid-0 the arrays
> together.  Then again, you lose two drives worth of space.

Yes, that's on the list to test.  Actually we'll try three 8-disk
raid-5s striped into one big raid0.  That only loses one drive's worth
of space (compared to a single 24-disk raid6).  Space is at a premium
here, as we're really needing to build this system with 4 TB drives.

The loss of resiliency using raid5 instead of raid6 "shouldn't" be an
issue here.  The design is to deliberately over-provision these
servers so that we have one more than we need.  Then in case of
failure (or major degradation) of a single server, we can migrate
clients to the other ones.

On Tue, Aug 23, 2016 at 3:15 PM, Doug Ledford <dledford@redhat.com> wrote:
> OK, 50 sequential I/Os at a time.  Good point to know.

Note that's just the test workload.  The real workload has literally
*thousands* of sequential reads at once.  However. those thousands of
reads aren't reading at full speed like dd of=/dev/null.  In the real
workload, after a chunk of data is read, some computations are done.
IOW, when the storage backend is working optimally, the read processes
are CPU bound.  But it's extremely hard to accurately generate this
kind of test workload, so we have fewer reader threads (50 in this
case), but they are pure read-as-fast-as-we-can jobs, as opposed to
read-and-compute.

> You're raid device has a good chunk size for your usage pattern.  If you
> had a smallish chunk size (like 64k or 32k), I would actually expect
> things to behave differently.  But, then again, maybe I'm wrong and that
> would help.  With a smaller chunk size, you would be able to fit more
> stripes in the stripe cache using less memory.

For some reason I thought we had a 64k chunk size, which I believe is
the mdadm default?  But, you're right, it is indeed 512k.  I will try
to experiment with different chunk sizes, as my Internet-research
suggests that's a very application-dependent setting; I can't seem to
find any rules of thumb as to what our ideal chunk size might be for
this particular workload.  My intuition says bigger is better, since
we're dealing with sequential reads of generally large-ish files.

> Makes sense.  I know the stripe cache size is conservative by default
> because of the fact that it's not shared with the page cache, so you
> might as well consider it's memory lost.  When you upped it to 64k, and
> you have 22 disks at 512k chunk, that 11MB per stripe and 65536 total
> allowed stripes which is a maximum memory consumption of around 700GB
> RAM.  I doubt you have that much in your machine, so I'm guessing it's
> simply using all available RAM that the page cache or something else
> isn't already using.  That's also explains why setting it higher doesn't
> provide any additional benefits ;-).

Do you think more RAM might be beneficial then?

> The math fits.  Most quad channel Intel CPUs have memory bandwidths in
> the 50GByte/s range theoretical maximum, but it's not bidirectional,
> it's not even multi-access, so you have to remember that the usage looks
> like this on a good read:

I'll have to re-read your explanation a few more times to fully grasp
it, but thank you for that!

For what it's worth, this is a NUMA system: two E5-2620v3 CPUs.  More
cores, but I understand the complexities added by memory controller
and PCIe node locality.

>> My colleague tested that exact same config with hardware raid5, and
>> striped the three raid5 arrays together with software raid1.
>
> That's a huge waste, are you sure he didn't use raid0 for the stripe?

Sorry, typo, that was raid0 indeed.

> I would try to tune your stripe cache size such that the kswapd?
> processes go to sleep.  Those are reading/writing swap.  That won't help
> your overall performance.

Do you mean swapping as in swapping memory to disk?  I don't think
that is happening.  I have 32 GB of swap space, but according to "free
-k" only 48k of swap is being used, and that number never grows.
Also, I don't have any of the classic telltale signs of disk-swapping,
e.g. overall laggy system feel.

Also, I re-set the stripe_cache_size back down to 256, and those
kswapd processes continue to peg a couple CPUs.  IOW,
stripe_cache_size doesn't appear to have much effect on kswapd.

On Tue, Aug 23, 2016 at 8:02 PM, Shaohua Li <shli@kernel.org> wrote:
> 2. the state machine runs in a single thread, which is a bottleneck. try to
> increase group_thread_cnt, which will make the handling multi-thread.

For others' reference, this parameter is in
/sys/block/<device>/md/stripe_cache_size.

On this CentOS (RHEL) 7.2 server, the parameter defaults to 0.  I set
it to 4, and the degraded reads went up dramatically.  Need to
experiment with this (and all the other tunables) some more, but that
change alone put me up to 2.5 GB/s read from the degraded array!

Thanks again,
Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-25 15:07   ` Matt Garman
@ 2016-08-25 23:39     ` Adam Goryachev
  2016-08-26 13:01       ` Matt Garman
  2016-08-26 18:11       ` Wols Lists
  0 siblings, 2 replies; 26+ messages in thread
From: Adam Goryachev @ 2016-08-25 23:39 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

On 26/08/16 01:07, Matt Garman wrote:
>
>> Makes sense.  I know the stripe cache size is conservative by default
>> because of the fact that it's not shared with the page cache, so you
>> might as well consider it's memory lost.  When you upped it to 64k, and
>> you have 22 disks at 512k chunk, that 11MB per stripe and 65536 total
>> allowed stripes which is a maximum memory consumption of around 700GB
>> RAM.  I doubt you have that much in your machine, so I'm guessing it's
>> simply using all available RAM that the page cache or something else
>> isn't already using.  That's also explains why setting it higher doesn't
>> provide any additional benefits ;-).
> Do you think more RAM might be beneficial then?
I'm not sure of this, but I can suggest that you try various sizes for 
the stripe_cache_size, in my testing, I tried various values up to 64k, 
but 4k ended up being the optimal value (I only have 8 disks with 64k 
chunk size)...
>
>> I would try to tune your stripe cache size such that the kswapd?
>> processes go to sleep.  Those are reading/writing swap.  That won't help
>> your overall performance.
> Do you mean swapping as in swapping memory to disk?  I don't think
> that is happening.  I have 32 GB of swap space, but according to "free
> -k" only 48k of swap is being used, and that number never grows.
> Also, I don't have any of the classic telltale signs of disk-swapping,
> e.g. overall laggy system feel.
>
> Also, I re-set the stripe_cache_size back down to 256, and those
> kswapd processes continue to peg a couple CPUs.  IOW,
> stripe_cache_size doesn't appear to have much effect on kswapd.
You should find out if you are swapping with vmstat:
vmstat 5
Watch the Swap (SI and SO) columns, if they are non-zero, then you are 
indeed swapping.

You might find that if there is insufficient memory, then the kernel 
will automatically reduce/limit the value for the stripe_cache_size (I'm 
only guessing, but my memory tells me that the kernel locks this memory 
and it can't be swapped/etc).

>
> On Tue, Aug 23, 2016 at 8:02 PM, Shaohua Li <shli@kernel.org> wrote:
>> 2. the state machine runs in a single thread, which is a bottleneck. try to
>> increase group_thread_cnt, which will make the handling multi-thread.
> For others' reference, this parameter is in
> /sys/block/<device>/md/stripe_cache_size.
>
> On this CentOS (RHEL) 7.2 server, the parameter defaults to 0.  I set
> it to 4, and the degraded reads went up dramatically.  Need to
> experiment with this (and all the other tunables) some more, but that
> change alone put me up to 2.5 GB/s read from the degraded array!

Did you mean group_thread_cnt which defaults to 0?
I don't recall the default for stripe_cache_size, but I'm pretty certain 
it is not 0...
Note, in your case, it might increase the "test read scenario" but since 
your "live" scenario has a lot more CPU overhead, then this option might 
decrease overall results... Unfortunately, only testing with "live" load 
will really provide the information you will need to decide on this.

Regards,
Adam



-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-25 23:39     ` Adam Goryachev
@ 2016-08-26 13:01       ` Matt Garman
  2016-08-26 20:04         ` Doug Dumitru
  2016-08-26 18:11       ` Wols Lists
  1 sibling, 1 reply; 26+ messages in thread
From: Matt Garman @ 2016-08-26 13:01 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Mdadm

On Thu, Aug 25, 2016 at 6:39 PM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
>> Do you think more RAM might be beneficial then?
>
> I'm not sure of this, but I can suggest that you try various sizes for the
> stripe_cache_size, in my testing, I tried various values up to 64k, but 4k
> ended up being the optimal value (I only have 8 disks with 64k chunk
> size)...
>
> You should find out if you are swapping with vmstat:
> vmstat 5
> Watch the Swap (SI and SO) columns, if they are non-zero, then you are
> indeed swapping.
>
> You might find that if there is insufficient memory, then the kernel will
> automatically reduce/limit the value for the stripe_cache_size (I'm only
> guessing, but my memory tells me that the kernel locks this memory and it
> can't be swapped/etc).

Good ideas.  I actually halved the amount of physical memory in this
machine.  I replaced the original eight 8GB DIMMs with eight 4GB
DIMMs.  So no change in number of modules, but total RAM went from 64
GB to 32 GB.

I then cranked the stripe_cache_size up to 32k, degraded the array,
and kicked off my reader test.

Performance is basically the same.  And I'm definitely not swapping,
vmstat shows both swap values constant at zero.  So it appears the
kernel is smart enough to scale back the stripe_cache_size to avoid
swapping.


>> On Tue, Aug 23, 2016 at 8:02 PM, Shaohua Li <shli@kernel.org> wrote:
>>>
>>> 2. the state machine runs in a single thread, which is a bottleneck. try
>>> to
>>> increase group_thread_cnt, which will make the handling multi-thread.
>>
>> For others' reference, this parameter is in
>> /sys/block/<device>/md/stripe_cache_size.
>>
>> On this CentOS (RHEL) 7.2 server, the parameter defaults to 0.  I set
>> it to 4, and the degraded reads went up dramatically.  Need to
>> experiment with this (and all the other tunables) some more, but that
>> change alone put me up to 2.5 GB/s read from the degraded array!
>
>
> Did you mean group_thread_cnt which defaults to 0?
> I don't recall the default for stripe_cache_size, but I'm pretty certain it
> is not 0...
> Note, in your case, it might increase the "test read scenario" but since
> your "live" scenario has a lot more CPU overhead, then this option might
> decrease overall results... Unfortunately, only testing with "live" load
> will really provide the information you will need to decide on this.

Yes, sorry, that is a typo, meant to write group_thread_cnt.  That
defaults to 0.  stripe_cache_size appears to default to 256.  (At
least on CentOS/RHEL 7.2.)

Agreed, yes, upping group_thread_cnt could improve one thing only to
the detriment of something else.  Nothing like a little "testing in
production" to make the higher-ups sweat.  :)

Thanks again all!
Matt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-25 23:39     ` Adam Goryachev
  2016-08-26 13:01       ` Matt Garman
@ 2016-08-26 18:11       ` Wols Lists
  1 sibling, 0 replies; 26+ messages in thread
From: Wols Lists @ 2016-08-26 18:11 UTC (permalink / raw)
  To: Adam Goryachev, Matt Garman; +Cc: Mdadm

On 26/08/16 00:39, Adam Goryachev wrote:
> You should find out if you are swapping with vmstat:
> vmstat 5
> Watch the Swap (SI and SO) columns, if they are non-zero, then you are
> indeed swapping.
> 
> You might find that if there is insufficient memory, then the kernel
> will automatically reduce/limit the value for the stripe_cache_size (I'm
> only guessing, but my memory tells me that the kernel locks this memory
> and it can't be swapped/etc).

Are you using a gui :-) ?

Download and build the latest version of xosview (assuming it builds,
when I last tried, "bleeding edge" was bleeding ... :-(

git://github.com/mromberg/xosview

That'll give you a nice little overview of both raid and swap. The
current version of xosview is fine for swap, but the raid monitor is broken.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-26 13:01       ` Matt Garman
@ 2016-08-26 20:04         ` Doug Dumitru
  2016-08-26 21:57           ` Phil Turmel
  0 siblings, 1 reply; 26+ messages in thread
From: Doug Dumitru @ 2016-08-26 20:04 UTC (permalink / raw)
  To: Matt Garman; +Cc: Adam Goryachev, Mdadm

On Fri, Aug 26, 2016 at 6:01 AM, Matt Garman <matthew.garman@gmail.com> wrote:
> On Thu, Aug 25, 2016 at 6:39 PM, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
>>> Do you think more RAM might be beneficial then?
>>
>> I'm not sure of this, but I can suggest that you try various sizes for the
>> stripe_cache_size, in my testing, I tried various values up to 64k, but 4k
>> ended up being the optimal value (I only have 8 disks with 64k chunk
>> size)...
>>
>> You should find out if you are swapping with vmstat:
>> vmstat 5
>> Watch the Swap (SI and SO) columns, if they are non-zero, then you are
>> indeed swapping.
>>
>> You might find that if there is insufficient memory, then the kernel will
>> automatically reduce/limit the value for the stripe_cache_size (I'm only
>> guessing, but my memory tells me that the kernel locks this memory and it
>> can't be swapped/etc).
>
> Good ideas.  I actually halved the amount of physical memory in this
> machine.  I replaced the original eight 8GB DIMMs with eight 4GB
> DIMMs.  So no change in number of modules, but total RAM went from 64
> GB to 32 GB.
>
> I then cranked the stripe_cache_size up to 32k, degraded the array,
> and kicked off my reader test.
>
> Performance is basically the same.  And I'm definitely not swapping,
> vmstat shows both swap values constant at zero.  So it appears the
> kernel is smart enough to scale back the stripe_cache_size to avoid
> swapping.

The documentation implies that 32K is the upper limit for stripe_cache_size.

It is not immediately clear from the documentation or the code whether
a "stripe" is a page, a control structure, or a chunk.  I "think" it
is a control structure with a bio plus a single page.

I took a simple array from stripe_cache_size 256 => 32K and the system
allocated 265 MB of RAM (crude number via free), so this implies that
the stripe cache is 8K per entry.  The stripe cache struct appears to
have a bio plus a bunch of other control items in the struct.  I am
not sure if it has a statically allocated page, but at 8K it looks
like it does.  So I think the minimum/static memory allocated by the
stripe cache is 8K per entry.  This "might" also be the maximum, or
the cache size might grow to handle longer requests.

My test array is 16K chunks, and 8K is lower than 16K, so the max
might be (4K+stripe_cache_size) * chunk_size, but I suspect it is
actually (4K+4K) * stripe_cache_size.  Others write and breath this
code more than me, so clarification would be helpful.

It is were actually chunk size, the upper limits would be really bad
(32K * 512K) = 16GB.  Raid/5/6 is "compatible as a swap device", so
memory allocates during IO are generally not allowed.  So I think that
the stripe cache gets bumped and just stays there with little (or no)
dynamic allocation during operation.  If you run out of stripe cache
buckets, the driver "stalls" the calling IO operations until stripe
caches become available.  This "stall" of calling IOs will lower the
number of outstanding IOs to the member drives, which probably
explains your performance at 200 MB/sec.  Once stripe_cache_size gets
big enough to handle your workload, additional allocate does not help.
You can look at stripe_cache_active to see what is in use during your
run.

Doug

[... rest snipped ...]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-26 20:04         ` Doug Dumitru
@ 2016-08-26 21:57           ` Phil Turmel
  2016-08-26 22:11             ` Doug Dumitru
  0 siblings, 1 reply; 26+ messages in thread
From: Phil Turmel @ 2016-08-26 21:57 UTC (permalink / raw)
  To: doug, Matt Garman; +Cc: Adam Goryachev, Mdadm

On 08/26/2016 04:04 PM, Doug Dumitru wrote:
> I took a simple array from stripe_cache_size 256 => 32K and the system
> allocated 265 MB of RAM (crude number via free), so this implies that
> the stripe cache is 8K per entry.  The stripe cache struct appears to
> have a bio plus a bunch of other control items in the struct.  I am
> not sure if it has a statically allocated page, but at 8K it looks
> like it does.  So I think the minimum/static memory allocated by the
> stripe cache is 8K per entry.  This "might" also be the maximum, or
> the cache size might grow to handle longer requests.

This was answered three days ago.  Allow me to quote myself:

> This is not correct.  Parity operations in MD raid4/5/6 operate on 4k
> blocks.  The stripe cache for an array is a collection of 4k elements
> per member device.  Chunk size doesn't factor into the cache itself.

Phil


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: kernel checksumming performance vs actual raid device performance
  2016-08-26 21:57           ` Phil Turmel
@ 2016-08-26 22:11             ` Doug Dumitru
  0 siblings, 0 replies; 26+ messages in thread
From: Doug Dumitru @ 2016-08-26 22:11 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Matt Garman, Adam Goryachev, Mdadm

Phil,

My apologies for missing this.  This thread is getting long.

Regardless, the max stripe_cache_size will not use more than 256MB of
RAM (32K x 8K) for a single device, and the memory usage will be
static.

Doug


On Fri, Aug 26, 2016 at 2:57 PM, Phil Turmel <philip@turmel.org> wrote:
> On 08/26/2016 04:04 PM, Doug Dumitru wrote:
>> I took a simple array from stripe_cache_size 256 => 32K and the system
>> allocated 265 MB of RAM (crude number via free), so this implies that
>> the stripe cache is 8K per entry.  The stripe cache struct appears to
>> have a bio plus a bunch of other control items in the struct.  I am
>> not sure if it has a statically allocated page, but at 8K it looks
>> like it does.  So I think the minimum/static memory allocated by the
>> stripe cache is 8K per entry.  This "might" also be the maximum, or
>> the cache size might grow to handle longer requests.
>
> This was answered three days ago.  Allow me to quote myself:
>
>> This is not correct.  Parity operations in MD raid4/5/6 operate on 4k
>> blocks.  The stripe cache for an array is a collection of 4k elements
>> per member device.  Chunk size doesn't factor into the cache itself.
>
> Phil
>



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-08-26 22:11 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-12 21:09 kernel checksumming performance vs actual raid device performance Matt Garman
2016-07-13  3:58 ` Brad Campbell
     [not found] ` <CAFx4rwQj3_JTNiS0zsQjp_sPXWkrp0ggjg_UiR7oJ8u0X9PQVA@mail.gmail.com>
2016-07-13 16:52   ` Fwd: " Doug Dumitru
2016-08-16 19:44   ` Matt Garman
2016-08-16 22:51     ` Doug Dumitru
2016-08-17  0:27       ` Adam Goryachev
     [not found]     ` <CAFx4rwTawqrBOWVwtPnGhRRAM1XiGQkS-o3YykmD0AftR45YkA@mail.gmail.com>
2016-08-23 14:34       ` Matt Garman
2016-08-23 15:02         ` Chris Murphy
     [not found]   ` <CAJvUf-Dqesy2TJX7W-bPakzeDcOoNy0VoSWWM06rKMYMhyhY7g@mail.gmail.com>
     [not found]     ` <CAFx4rwSQQuqeCFm+60+Gm75D49tg+mVjU=BnQSZThdE7E6KqPQ@mail.gmail.com>
2016-08-23 14:54       ` Matt Garman
2016-08-23 18:00         ` Doug Ledford
2016-08-23 18:27           ` Doug Dumitru
2016-08-23 19:10             ` Doug Ledford
2016-08-23 19:19               ` Doug Dumitru
2016-08-23 19:26                 ` Doug Ledford
2016-08-23 19:26             ` Matt Garman
2016-08-23 19:41               ` Doug Dumitru
2016-08-23 20:15               ` Doug Ledford
2016-08-23 21:42                 ` Phil Turmel
2016-08-24  1:02 ` Shaohua Li
2016-08-25 15:07   ` Matt Garman
2016-08-25 23:39     ` Adam Goryachev
2016-08-26 13:01       ` Matt Garman
2016-08-26 20:04         ` Doug Dumitru
2016-08-26 21:57           ` Phil Turmel
2016-08-26 22:11             ` Doug Dumitru
2016-08-26 18:11       ` Wols Lists

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.